-
Type:
Bug
-
Resolution: Done
-
Priority:
Major - P3
-
None
-
Affects Version/s: 2.2.1
-
Component/s: Sharding
-
None
-
Environment:Linux RHEL 5.5
-
Linux
-
None
-
3
-
None
-
None
-
None
-
None
-
None
-
None
The ReplicaSetMonitor refreshes the replica set view every 10 sec or when a new operation is requested on a replica set connection that had errored out. The problem comes in when the members of the set changed such that none of the members are part of the members of what the ReplicaSetMonitor have. And the direct consequence is that there will be no way for the monitor to contact any of the new members.
The only work around for this issue is to manually edit the config.shards collection and restart all mongos.
Attached a script, test.patch, that demonstrates this problem.
Original bug report:
We had an issue whereby our mongo Config Servers didn't notice when the host names in one shard was changed.
We have two shards:
rs0: cn14:27118,cn53:27118
rs1: cn54:27118,cn55:27118These were changed in the replicaset to:
rs0: cn14-ib:27118,cn53-ib:27118
rs1: cn54-ib:27118,cn55-ib:27118the '-ib' interfaces are different interfaces on the host with the same name (infiniband).
The replicasets appeared to be happy and in sync, for both rs0, and rs1. However only rs0 was updated in the config servers shards collection!
The entire cluster was rebooted over the weekend. Two days later the config.shards collection did not learn the new hostnames of rs1. Also killing and restarting the config servers and mongos' since then hasn't helped.
That said the cluster appeared to work fine, until we enabled sharding on a database. At that point the mongos' and pymongo clients started failing (see attached assertions and backtraces).
The error seen in pymongo:
File "/users/is/ahlpypi/egg_cache/p/pymongo-2.4.1_1-py2.6-linux-x86_64.egg/pymongo/cursor.py", line 814, in next if len(self.__data) or self._refresh(): File "/users/is/ahlpypi/egg_cache/p/pymongo-2.4.1_1-py2.6-linux-x86_64.egg/pymongo/cursor.py", line 763, in _refresh self.__uuid_subtype)) File "/users/is/ahlpypi/egg_cache/p/pymongo-2.4.1_1-py2.6-linux-x86_64.egg/pymongo/cursor.py", line 720, in __send_message self.__uuid_subtype) File "/users/is/ahlpypi/egg_cache/p/pymongo-2.4.1_1-py2.6-linux-x86_64.egg/pymongo/helpers.py", line 104, in _unpack_response error_object["$err"]) OperationFailure: database error: can't find shard for: cn54-ib:27118The error seen on the mongos:
Tue Jan 22 15:29:17 [conn270] warning: db exception when initializing on rs1:rs1/cn54:27118,cn55:27118, current connection state is { state: { conn: "rs1/cn54-ib:27118,cn55-ib:27118", vinfo: "mongoose.centaur @ 6|1||50fe9e767e7521213b281407", c ursor: "(none)", count: 0, done: false }, retryNext: false, init: false, finish: false, errored: false } :: caused by :: 1 3129 can't find shard for: cn54-ib:27118 Tue Jan 22 15:29:17 [conn270] AssertionException while processing op type : 2004 to : mongoose.centaur :: caused by :: 131 29 can't find shard for: cn54-ib:27118 Tue Jan 22 15:29:19 [conn270] ERROR: can't get TCP_KEEPINTVL: errno:92 Protocol not available Tue Jan 22 15:29:19 [conn270] Assertion: 13129:can't find shard for: cn54-ib:27118 0x80de91 0x7d76e9 0x7d786c 0x7688a8 0x763e64 0x769f76 0x76c21a 0x7704ec 0x56b7df 0x58249a 0x5889d9 0x7814a2 0x75a81b 0x4ff e41 0x7fc0b1 0x3e1c60673d 0x3e1bed3f6d /opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo15printStackTraceERSo+0x21) [0x80de91] /opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo11msgassertedEiPKc+0x99) [0x7d76e9] /opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos [0x7d786c] /opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo15StaticShardInfo4findERKSs+0x358) [0x7688a8] /opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo5Shard5resetERKSs+0x34) [0x763e64] /opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo17checkShardVersionEPNS_12DBClientBaseERKSsN5boost10shared_ptrI KNS_12ChunkManagerEEEbi+0x906) [0x769f76] /opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo14VersionManager19checkShardVersionCBEPNS_15ShardConnectionEbi+ 0x6a) [0x76c21a] /opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo15ShardConnection11_finishInitEv+0xfc) [0x7704ec] /opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo27ParallelSortClusteredCursor28setupVersionAndHandleSlaveOkEN5b oost10shared_ptrINS_23ParallelConnectionStateEEERKNS_5ShardENS2_IS5_EERKNS_15NamespaceStringERKSsNS2_IKNS_12ChunkManagerEE E+0x19f) [0x56b7df] /opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo27ParallelSortClusteredCursor9startInitEv+0xdea) [0x58249a] /opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo27ParallelSortClusteredCursor8fullInitEv+0x9) [0x5889d9] /opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo13ShardStrategy7queryOpERNS_7RequestE+0x472) [0x7814a2] /opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo7Request7processEi+0x1fb) [0x75a81b] /opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMess agingPortEPNS_9LastErrorE+0x71) [0x4ffe41] /opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x411) [0x7fc0b1] /lib64/libpthread.so.0 [0x3e1c60673d] /lib64/libc.so.6(clone+0x6d) [0x3e1bed3f6d] Tue Jan 22 15:29:19 [conn270] warning: db exception when initializing on rs1:rs1/cn54:27118,cn55:27118, current connection state is { state: { conn: "rs1/cn54-ib:27118,cn55-ib:27118", vinfo: "mongoose.centaur @ 6|1||50fe9e767e7521213b281407", c ursor: "(none)", count: 0, done: false }, retryNext: false, init: false, finish: false, errored: false } :: caused by :: 1 3129 can't find shard for: cn54-ib:27118 Tue Jan 22 15:29:19 [conn270] AssertionException while processing op type : 2004 to : mongoose.centaur :: caused by :: 131 29 can't find shard for: cn54-ib:27118We ended up fixing this by manually changing the rs1 location in the config.shards collection.
According to this news group posting by Eliot, this shouldn't be necessary:
https://groups.google.com/d/topic/mongodb-user/9L3LM5nK5aI/discussion