Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-2377

Mongos sharding/failover behaves strangely

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • 1.7.5
    • Affects Version/s: 1.7.5
    • Component/s: Replication, Sharding
    • Labels:
      None
    • Environment:
      Linux x86/x86_64
    • Linux

      Summary:
      mongos fails temporarily when replica set primary member goes down ("dbclient error communicating with server: <host>:<port>"), then fails semi-permanently until all replica set are up ("mongos connectionpool: connect failed <replicaSet>/<host>:<port>[,<host>:<port>...]") and ("not master and slaveok=false")
      I wonder if mongos should:
      1. Do auto-retry/auto-reconnect at least for read operations
      2. Do not fail permanently until replica set has all servers running again

      Configuration:
      A set of 3 machines hosting replica set (named testRS), config servers and an instance of mongos. Sharding is enabled, no actual collections are sharded.
      celestine-1: config1, rs1, mongos
      celestine-2: config2, rs2
      celestine-3: config3, rs3
      One user database "test1", having one collection "items" with two documents (see session below).

      Versions:
      mongos 1.7.5 nightly (2011-01-18), used it because mongos 1.6.5/1.6.6 causes mongo shell to fail with assertion (ERROR: MessagingPort::call() wrong id got:XXX expect:YYY)
      mongod 1.7.5 nightly (2011-01-18)
      mongo shell 1.7.5 nightly (2011-01-18)

      Mongos session:
      > db.items.find()

      { "_id" : ObjectId("4d35bea8ba15dc15b0d3878e"), "value" : 123456789012345 } { "_id" : ObjectId("4d35c228abfec6a6bac2d04b"), "value" : 98 }

      — bring down primary member of replica set (celestine-2 ATM) here —

      > db.items.find()
      error: {
      "$err" : "dbclient error communicating with server: celestine-2:27100",
      "code" : 10278
      }
      > db.items.find()
      error: {
      "$err" : "dbclient error communicating with server: celestine-2:27100",
      "code" : 10278
      }
      > db.items.find()
      error: {
      "$err" : "mongos connectionpool: connect failed testRS/celestine-1:27100,celestine-3:27100,celestine-2:27100 : connect failed to set testRS/celestine-1:27100,celestine-3:27100,celestine-2:27100",
      "code" : 11002
      }
      > db.items.find()
      error:

      { "$err" : "not master and slaveok=false", "code" : 13435 }

      Mongos log:
      Wed Jan 19 12:31:46 [Balancer] ~ScopedDbConnection: _conn != null
      Wed Jan 19 12:31:46 [Balancer] caught exception while doing balance: DBClientBase::findOne: transport error: celestine-1:27100 query:

      { features: 1 }

      Wed Jan 19 12:32:36 [Balancer] ~ScopedDbConnection: _conn != null
      Wed Jan 19 12:32:36 [Balancer] caught exception while doing balance: mongos connectionpool: connect failed testRS/celestine-1:27100,celestine-3:27100,celestine-2:27100 : connect failed to set testRS/celestine-1:27100,celestine-3:27100,celestine-2:27100
      Wed Jan 19 12:33:06 [Balancer] ~ScopedDbConnection: _conn != null
      Wed Jan 19 12:33:06 [Balancer] caught exception while doing balance: mongos connectionpool: connect failed testRS/celestine-1:27100,celestine-3:27100,celestine-2:27100 : connect failed to set testRS/celestine-1:27100,celestine-3:27100,celestine-2:27100

            Assignee:
            kristina Kristina Chodorow (Inactive)
            Reporter:
            onyxmaster Aristarkh Zagorodnikov
            Votes:
            0 Vote for this issue
            Watchers:
            0 Start watching this issue

              Created:
              Updated:
              Resolved: