Details
-
Bug
-
Resolution: Done
-
Major - P3
-
None
-
2.2.1
-
None
-
Linux RHEL 5.5
-
Linux
Description
The ReplicaSetMonitor refreshes the replica set view every 10 sec or when a new operation is requested on a replica set connection that had errored out. The problem comes in when the members of the set changed such that none of the members are part of the members of what the ReplicaSetMonitor have. And the direct consequence is that there will be no way for the monitor to contact any of the new members.
The only work around for this issue is to manually edit the config.shards collection and restart all mongos.
Attached a script, test.patch, that demonstrates this problem.
Original bug report:
We had an issue whereby our mongo Config Servers didn't notice when the host names in one shard was changed.
We have two shards:
rs0: cn14:27118,cn53:27118
rs1: cn54:27118,cn55:27118These were changed in the replicaset to:
rs0: cn14-ib:27118,cn53-ib:27118
rs1: cn54-ib:27118,cn55-ib:27118the '-ib' interfaces are different interfaces on the host with the same name (infiniband).
The replicasets appeared to be happy and in sync, for both rs0, and rs1. However only rs0 was updated in the config servers shards collection!
The entire cluster was rebooted over the weekend. Two days later the config.shards collection did not learn the new hostnames of rs1. Also killing and restarting the config servers and mongos' since then hasn't helped.
That said the cluster appeared to work fine, until we enabled sharding on a database. At that point the mongos' and pymongo clients started failing (see attached assertions and backtraces).
The error seen in pymongo:
File "/users/is/ahlpypi/egg_cache/p/pymongo-2.4.1_1-py2.6-linux-x86_64.egg/pymongo/cursor.py", line 814, in nextif len(self.__data) or self._refresh():File "/users/is/ahlpypi/egg_cache/p/pymongo-2.4.1_1-py2.6-linux-x86_64.egg/pymongo/cursor.py", line 763, in _refreshself.__uuid_subtype))File "/users/is/ahlpypi/egg_cache/p/pymongo-2.4.1_1-py2.6-linux-x86_64.egg/pymongo/cursor.py", line 720, in __send_messageself.__uuid_subtype)File "/users/is/ahlpypi/egg_cache/p/pymongo-2.4.1_1-py2.6-linux-x86_64.egg/pymongo/helpers.py", line 104, in _unpack_responseerror_object["$err"])OperationFailure: database error: can't find shard for: cn54-ib:27118The error seen on the mongos:
Tue Jan 22 15:29:17 [conn270] warning: db exception when initializing on rs1:rs1/cn54:27118,cn55:27118, current connectionstate is { state: { conn: "rs1/cn54-ib:27118,cn55-ib:27118", vinfo: "mongoose.centaur @ 6|1||50fe9e767e7521213b281407", cursor: "(none)", count: 0, done: false }, retryNext: false, init: false, finish: false, errored: false } :: caused by :: 13129 can't find shard for: cn54-ib:27118Tue Jan 22 15:29:17 [conn270] AssertionException while processing op type : 2004 to : mongoose.centaur :: caused by :: 13129 can't find shard for: cn54-ib:27118Tue Jan 22 15:29:19 [conn270] ERROR: can't get TCP_KEEPINTVL: errno:92 Protocol not availableTue Jan 22 15:29:19 [conn270] Assertion: 13129:can't find shard for: cn54-ib:271180x80de91 0x7d76e9 0x7d786c 0x7688a8 0x763e64 0x769f76 0x76c21a 0x7704ec 0x56b7df 0x58249a 0x5889d9 0x7814a2 0x75a81b 0x4ffe41 0x7fc0b1 0x3e1c60673d 0x3e1bed3f6d/opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo15printStackTraceERSo+0x21) [0x80de91]/opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo11msgassertedEiPKc+0x99) [0x7d76e9]/opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos [0x7d786c]/opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo15StaticShardInfo4findERKSs+0x358) [0x7688a8]/opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo5Shard5resetERKSs+0x34) [0x763e64]/opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo17checkShardVersionEPNS_12DBClientBaseERKSsN5boost10shared_ptrIKNS_12ChunkManagerEEEbi+0x906) [0x769f76]/opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo14VersionManager19checkShardVersionCBEPNS_15ShardConnectionEbi+0x6a) [0x76c21a]/opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo15ShardConnection11_finishInitEv+0xfc) [0x7704ec]/opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo27ParallelSortClusteredCursor28setupVersionAndHandleSlaveOkEN5boost10shared_ptrINS_23ParallelConnectionStateEEERKNS_5ShardENS2_IS5_EERKNS_15NamespaceStringERKSsNS2_IKNS_12ChunkManagerEEE+0x19f) [0x56b7df]/opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo27ParallelSortClusteredCursor9startInitEv+0xdea) [0x58249a]/opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo27ParallelSortClusteredCursor8fullInitEv+0x9) [0x5889d9]/opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo13ShardStrategy7queryOpERNS_7RequestE+0x472) [0x7814a2]/opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo7Request7processEi+0x1fb) [0x75a81b]/opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x71) [0x4ffe41]/opt/ahl/releases/mongodb/2.2.1-1.ahl/bin/mongos(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x411) [0x7fc0b1]/lib64/libpthread.so.0 [0x3e1c60673d]/lib64/libc.so.6(clone+0x6d) [0x3e1bed3f6d]Tue Jan 22 15:29:19 [conn270] warning: db exception when initializing on rs1:rs1/cn54:27118,cn55:27118, current connectionstate is { state: { conn: "rs1/cn54-ib:27118,cn55-ib:27118", vinfo: "mongoose.centaur @ 6|1||50fe9e767e7521213b281407", cursor: "(none)", count: 0, done: false }, retryNext: false, init: false, finish: false, errored: false } :: caused by :: 13129 can't find shard for: cn54-ib:27118Tue Jan 22 15:29:19 [conn270] AssertionException while processing op type : 2004 to : mongoose.centaur :: caused by :: 13129 can't find shard for: cn54-ib:27118We ended up fixing this by manually changing the rs1 location in the config.shards collection.
According to this news group posting by Eliot, this shouldn't be necessary:
https://groups.google.com/d/topic/mongodb-user/9L3LM5nK5aI/discussion