-
Type: Bug
-
Resolution: Duplicate
-
Priority: Major - P3
-
None
-
Affects Version/s: 2.0.3
-
Component/s: Replication, Sharding, Stability
-
Environment:Ubuntu 12.04.1 LTS on a hi1.4xlarge AWS instance
-
Linux
-
We were taking an old MongoDB server out of a replicaset (one of two shards), to be replaced with a newer build of the system. The machine to be taken out was called mongo2-1, and it was the primary.
I first stepped it down from Primary:
rs.stepDown();
mongo2-4 was elected the new primary. That went very smoothly, and mongo2-4 was serving requests with little or no effect on the site from the transition.
I let that run for a minute, and since everything looked fine, I did
rs.remove("mongo2-1.prod.trello.local:27017") on mongo2-4.
The new primary (mongo2-4) logged some very strange replica set behavior:
Mon Dec 3 21:56:59 [conn219311] starting new replica set monitor for replica set rs2 with seed of mongo2-1.prod.trello.local:27017
Mon Dec 3 21:56:59 [conn219311] successfully connected to seed mongo2-1.prod.trello.local:27017 for replica set rs2
Note that the seed there is the replica set member I had just removed.
Then MongoDB on mongo2-4 seg faulted:
Mon Dec 3 21:57:00 Invalid access at address: 0xffffffffffffffe0
Mon Dec 3 21:57:00 Got signal: 11 (Segmentation fault).
Mon Dec 3 21:57:00 Backtrace:
0xa90d79 0xa91350 0x7f588eb5acb0 0x5d3f2d 0x5d9d01 0x5da4b0 0xa20db3 0xa2163b 0xa27c45 0xa2837e 0xa23b42 0xa6b06e 0x97cc14 0x97e20f 0x940e25 0x9441b1 0x8869d7 0x88df49 0xaa37d6 0x637497
/usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0xa90d79]
/usr/bin/mongod(_ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv+0x220) [0xa91350]
/lib/x86_64-linux-gnu/libpthread.so.0(+0xfcb0) [0x7f588eb5acb0]
/usr/bin/mongod(_ZN5mongo17ReplicaSetMonitor16_checkConnectionEPNS_18DBClientConnectionERSsbi+0x11ad) [0x5d3f2d]
/usr/bin/mongod(_ZN5mongo17ReplicaSetMonitorC1ERKSsRKSt6vectorINS_11HostAndPortESaIS4_EE+0x441) [0x5d9d01]
/usr/bin/mongod(_ZN5mongo17ReplicaSetMonitor3getERKSsRKSt6vectorINS_11HostAndPortESaIS4_EE+0x200) [0x5da4b0]
/usr/bin/mongod(_ZN5mongo5Shard7_rsInitEv+0x133) [0xa20db3]
/usr/bin/mongod(_ZN5mongo5Shard8_setAddrERKSs+0x13b) [0xa2163b]
/usr/bin/mongod(_ZN5mongo15StaticShardInfo6reloadEv+0xa65) [0xa27c45]
/usr/bin/mongod(_ZN5mongo15StaticShardInfo4findERKSs+0x14e) [0xa2837e]
/usr/bin/mongod(_ZN5mongo5Shard5resetERKSs+0x42) [0xa23b42]
/usr/bin/mongod(_ZN5mongo17SplitChunkCommand3runERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0xbee) [0xa6b06e]
/usr/bin/mongod(_ZN5mongo11execCommandEPNS_7CommandERNS_6ClientEiPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0x6a4) [0x97cc14]
/usr/bin/mongod(_ZN5mongo12_runCommandsEPKcRNS_7BSONObjERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x6ff) [0x97e20f]
/usr/bin/mongod(_ZN5mongo11runCommandsEPKcRNS_7BSONObjERNS_5CurOpERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x35) [0x940e25]
/usr/bin/mongod(ZN5mongo8runQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1+0x11e1) [0x9441b1]
/usr/bin/mongod() [0x8869d7]
/usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x559) [0x88df49]
/usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x76) [0xaa37d6]
/usr/bin/mongod(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x287) [0x637497]
After it restarted, we were consistenly getting socket exceptions in Mongo clients connecting through mongos, and the mongos logs were showing:
Mon Dec 3 21:57:36 [conn18303] SyncClusterConnection connecting to [mongo-config-1.prod.trello.local:27019]
Mon Dec 3 21:57:36 [conn18303] SyncClusterConnection connecting to [mongo-config-2.prod.trello.local:27019]
Mon Dec 3 21:57:36 [conn18303] SyncClusterConnection connecting to [mongo-config-3.prod.trello.local:27019]
Mon Dec 3 21:57:37 [conn18339] Assertion: 13129:can't find shard for: mongo2-5.prod.trello.local:27017
0x5350c2 0x75caba 0x757fd2 0x7f5297 0x5c3526 0x5c1707 0x5ef3ee 0x76fbd9 0x7b60e7 0x7c8691 0x5e8127 0x7f270a3dbe9a 0x7f27098f6cbd
/usr/bin/mongos(_ZN5mongo11msgassertedEiPKc+0x112) [0x5350c2]
/usr/bin/mongos(_ZN5mongo15StaticShardInfo4findERKSs+0x3aa) [0x75caba]
/usr/bin/mongos(_ZN5mongo5Shard5resetERKSs+0x42) [0x757fd2]
/usr/bin/mongos() [0x7f5297]
/usr/bin/mongos(_ZN5boost6detail8function17function_invoker4IPFbRN5mongo12DBClientBaseERKSsbiEbS5_S7_biE6invokeERNS1_15function_bufferES5
_S7_bi+0x16) [0x5c3526]
/usr/bin/mongos(_ZN5mongo15ShardConnection11_finishInitEv+0x137) [0x5c1707]
/usr/bin/mongos(_ZN5mongo27ParallelSortClusteredCursor5_initEv+0x5be) [0x5ef3ee]
/usr/bin/mongos(_ZN5mongo13ShardStrategy7queryOpERNS_7RequestE+0xc59) [0x76fbd9]
/usr/bin/mongos(_ZN5mongo7Request7processEi+0x187) [0x7b60e7]
/usr/bin/mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x71) [0x7c8691]
/usr/bin/mongos(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x287) [0x5e8127]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7f270a3dbe9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f27098f6cbd]
Finally, we added mongo2-1 back into the replicaset, restarted all mongos and client processes, and things went back to normal. An hour later, we were able to remove mongo2-1 from the replica set and nothing exploded. All told, about a 15 minute site outage.
- duplicates
-
SERVER-5110 ReplicaSetMonitor::check not thread safe wrt _master
- Closed