Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-8273

Mongos view of Replica Set can become stale forever if members change completely between refresh

    • Type: Icon: Bug Bug
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 2.2.1
    • Component/s: Sharding
    • Labels:
      None
    • Environment:
      Linux RHEL 5.5
    • Linux

      The ReplicaSetMonitor refreshes the replica set view every 10 sec or when a new operation is requested on a replica set connection that had errored out. The problem comes in when the members of the set changed such that none of the members are part of the members of what the ReplicaSetMonitor have. And the direct consequence is that there will be no way for the monitor to contact any of the new members.

      The only work around for this issue is to manually edit the config.shards collection and restart all mongos.

      Attached a script, test.patch, that demonstrates this problem.

      Original bug report:

      We had an issue whereby our mongo Config Servers didn't notice when the host names in one shard was changed.

      We have two shards:
      rs0: cn14:27118,cn53:27118
      rs1: cn54:27118,cn55:27118

      These were changed in the replicaset to:
      rs0: cn14-ib:27118,cn53-ib:27118
      rs1: cn54-ib:27118,cn55-ib:27118

      the '-ib' interfaces are different interfaces on the host with the same name (infiniband).

      The replicasets appeared to be happy and in sync, for both rs0, and rs1. However only rs0 was updated in the config servers shards collection!

      The entire cluster was rebooted over the weekend. Two days later the config.shards collection did not learn the new hostnames of rs1. Also killing and restarting the config servers and mongos' since then hasn't helped.

      That said the cluster appeared to work fine, until we enabled sharding on a database. At that point the mongos' and pymongo clients started failing (see attached assertions and backtraces).

      The error seen in pymongo:

        File "/users/is/ahlpypi/egg_cache/p/pymongo-2.4.1_1-py2.6-linux-x86_64.egg/pymongo/cursor.py", line 814, in next
          if len(self.__data) or self._refresh():
        File "/users/is/ahlpypi/egg_cache/p/pymongo-2.4.1_1-py2.6-linux-x86_64.egg/pymongo/cursor.py", line 763, in _refresh
          self.__uuid_subtype))
        File "/users/is/ahlpypi/egg_cache/p/pymongo-2.4.1_1-py2.6-linux-x86_64.egg/pymongo/cursor.py", line 720, in __send_message
          self.__uuid_subtype)
        File "/users/is/ahlpypi/egg_cache/p/pymongo-2.4.1_1-py2.6-linux-x86_64.egg/pymongo/helpers.py", line 104, in _unpack_response
          error_object["$err"])
      OperationFailure: database error: can't find shard for: cn54-ib:27118
      

      The error seen on the mongos:

      Tue Jan 22 15:29:17 [conn270] warning: db exception when initializing on rs1:rs1/cn54:27118,cn55:27118, current connection
       state is { state: { conn: "rs1/cn54-ib:27118,cn55-ib:27118", vinfo: "mongoose.centaur @ 6|1||50fe9e767e7521213b281407", c
      ursor: "(none)", count: 0, done: false }, retryNext: false, init: false, finish: false, errored: false } :: caused by :: 1
      3129 can't find shard for: cn54-ib:27118
      Tue Jan 22 15:29:17 [conn270] AssertionException while processing op type : 2004 to : mongoose.centaur :: caused by :: 131
      29 can't find shard for: cn54-ib:27118
      Tue Jan 22 15:29:19 [conn270] ERROR: can't get TCP_KEEPINTVL: errno:92 Protocol not available
      Tue Jan 22 15:29:19 [conn270] Assertion: 13129:can't find shard for: cn54-ib:27118
      0x80de91 0x7d76e9 0x7d786c 0x7688a8 0x763e64 0x769f76 0x76c21a 0x7704ec 0x56b7df 0x58249a 0x5889d9 0x7814a2 0x75a81b 0x4ff
      e41 0x7fc0b1 0x3e1c60673d 0x3e1bed3f6d 
       /opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo15printStackTraceERSo+0x21) [0x80de91]
       /opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo11msgassertedEiPKc+0x99) [0x7d76e9]
       /opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos [0x7d786c]
       /opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo15StaticShardInfo4findERKSs+0x358) [0x7688a8]
       /opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo5Shard5resetERKSs+0x34) [0x763e64]
       /opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo17checkShardVersionEPNS_12DBClientBaseERKSsN5boost10shared_ptrI
      KNS_12ChunkManagerEEEbi+0x906) [0x769f76]
       /opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo14VersionManager19checkShardVersionCBEPNS_15ShardConnectionEbi+
      0x6a) [0x76c21a]
       /opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo15ShardConnection11_finishInitEv+0xfc) [0x7704ec]
       /opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo27ParallelSortClusteredCursor28setupVersionAndHandleSlaveOkEN5b
      oost10shared_ptrINS_23ParallelConnectionStateEEERKNS_5ShardENS2_IS5_EERKNS_15NamespaceStringERKSsNS2_IKNS_12ChunkManagerEE
      E+0x19f) [0x56b7df]
       /opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo27ParallelSortClusteredCursor9startInitEv+0xdea) [0x58249a]
       /opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo27ParallelSortClusteredCursor8fullInitEv+0x9) [0x5889d9]
       /opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo13ShardStrategy7queryOpERNS_7RequestE+0x472) [0x7814a2]
       /opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo7Request7processEi+0x1fb) [0x75a81b]
       /opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMess
      agingPortEPNS_9LastErrorE+0x71) [0x4ffe41]
       /opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x411) [0x7fc0b1]
       /lib64/libpthread.so.0 [0x3e1c60673d]
       /lib64/libc.so.6(clone+0x6d) [0x3e1bed3f6d]
      Tue Jan 22 15:29:19 [conn270] warning: db exception when initializing on rs1:rs1/cn54:27118,cn55:27118, current connection
       state is { state: { conn: "rs1/cn54-ib:27118,cn55-ib:27118", vinfo: "mongoose.centaur @ 6|1||50fe9e767e7521213b281407", c
      ursor: "(none)", count: 0, done: false }, retryNext: false, init: false, finish: false, errored: false } :: caused by :: 1
      3129 can't find shard for: cn54-ib:27118
      Tue Jan 22 15:29:19 [conn270] AssertionException while processing op type : 2004 to : mongoose.centaur :: caused by :: 131
      29 can't find shard for: cn54-ib:27118
      

      We ended up fixing this by manually changing the rs1 location in the config.shards collection.

      According to this news group posting by Eliot, this shouldn't be necessary:
      https://groups.google.com/d/topic/mongodb-user/9L3LM5nK5aI/discussion

        1. replicaset-errors.txt
          251 kB
        2. test.patch
          5 kB

            Assignee:
            randolph@mongodb.com Randolph Tan
            Reporter:
            jblackburn James Blackburn
            Votes:
            1 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated:
              Resolved: