[SERVER-8074] mongos segfault on replica set member shutdown Created: 03/Jan/13  Updated: 15/Feb/13  Resolved: 08/Jan/13

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.2.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Arkadiy Kukarkin Assignee: Randolph Tan
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-7061 mongos can use invalid ptr to master ... Closed
Operating System: Linux
Participants:

 Description   

We have a sharded cluster with two shards, each a replica set with two machines. Upon shutting down one of these (SECONDARY in 2nd shard, 'RIGHT' below), the PRIMARY also stepped down due to not seeing a majority (no arbiter). This seems to have really upset the mongos, 4 (out of 10) of which segfaulted:

Thu Jan 3 16:37:25 [WriteBackListener-RIGHT:27018] DBClientCursor::init call() failed
Thu Jan 3 16:37:25 [WriteBackListener-RIGHT:27018] WriteBackListener exception : DBClientBase::findN: transport error: RIGHT:27018 ns: admin.$cmd query:

{ writebacklisten: ObjectId('50d0c65444377d91a6e94 862') }

Thu Jan 3 16:37:25 [WriteBackListener-LEFT:27018] GLE is

{ singleShard: "events/LEFT:27018,RIGHT:27018", updatedExisting: true, n: 1, lastOp: 5829355722684497925, connectionId: 104511, err: null, ok: 1.0 }

Thu Jan 3 16:37:25 [WriteBackListener-LEFT:27018] GLE is

{ singleShard: "events/LEFT:27018,RIGHT:27018", updatedExisting: true, n: 1, lastOp: 5829355722684497926, connectionId: 104511, err: null, ok: 1. 0 }

Thu Jan 3 16:37:25 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355722684497935, connectionId: 202, err: null, ok: 1.0
}
Thu Jan 3 16:37:27 [WriteBackListener-RIGHT:27018] WriteBackListener exception : socket exception [CONNECT_ERROR] for RIGHT:27018
Thu Jan 3 16:37:29 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355739864367115, connectionId: 202, err: null, ok: 1.0
}Thu Jan 3 16:37:30 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355744159334401, connectionId: 202, err: null, ok: 1.0
}
Thu Jan 3 16:37:30 [WriteBackListener-RIGHT:27018] WriteBackListener exception : socket exception [CONNECT_ERROR] for RIGHT:27018
Thu Jan 3 16:37:31 [Balancer] distributed lock 'balancer/g14.hypem.com:27018:1355859540:1804289383' acquired, ts : 50e6082b44377d91a6ecadd3
Thu Jan 3 16:37:31 [Balancer] distributed lock 'balancer/g14.hypem.com:27018:1355859540:1804289383' unlocked.
Thu Jan 3 16:37:32 [ReplicaSetMonitorWatcher] DBClientCursor::init call() failed
Thu Jan 3 16:37:34 [WriteBackListener-RIGHT:27018] WriteBackListener exception : socket exception [CONNECT_ERROR] for RIGHT:27018
Thu Jan 3 16:37:37 [Balancer] distributed lock 'balancer/g14.hypem.com:27018:1355859540:1804289383' acquired, ts : 50e6083144377d91a6ecadd4
Thu Jan 3 16:37:37 [Balancer] distributed lock 'balancer/g14.hypem.com:27018:1355859540:1804289383' unlocked.
Thu Jan 3 16:37:39 [WriteBackListener-RIGHT:27018] WriteBackListener exception : socket exception [CONNECT_ERROR] for RIGHT:27018
Thu Jan 3 16:37:39 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355782814040073, connectionId: 202, err: null, ok: 1.0
}
Thu Jan 3 16:37:41 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355791403974659, connectionId: 202, err: null, ok: 1.0
}
Thu Jan 3 16:37:42 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355795698941961, connectionId: 202, err: null, ok: 1.0
}
Thu Jan 3 16:37:42 [ReplicaSetMonitorWatcher] trying reconnect to RIGHT:27018
Thu Jan 3 16:37:42 [ReplicaSetMonitorWatcher] reconnect RIGHT:27018 failed couldn't connect to server RIGHT:27018
Thu Jan 3 16:37:43 [Balancer] distributed lock 'balancer/g14.hypem.com:27018:1355859540:1804289383' acquired, ts : 50e6083744377d91a6ecadd5
Thu Jan 3 16:37:43 [Balancer] events2 is unavailable
Thu Jan 3 16:37:43 [Balancer] distributed lock 'balancer/g14.hypem.com:27018:1355859540:1804289383' unlocked.
Thu Jan 3 16:37:43 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355799993909263, connectionId: 202, err: null, ok: 1.0
}
Thu Jan 3 16:37:44 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355804288876560, connectionId: 202, err: null, ok: 1.0
}
Thu Jan 3 16:37:45 [WriteBackListener-RIGHT:27018] WriteBackListener exception : socket exception [CONNECT_ERROR] for RIGHT:27018
Thu Jan 3 16:37:45 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355808583843853, connectionId: 202, err: null, ok: 1.0
}
Thu Jan 3 16:37:46 [WriteBackListener-LEFT:27018] GLE is

{ singleShard: "events/LEFT:27018,RIGHT:27018", updatedExisting: true, n: 1, lastOp: 5829355812878811137, connectionId: 104511, err: null, ok: 1.0 }

Thu Jan 3 16:37:46 [WriteBackListener-LEFT:27018] GLE is

{ singleShard: "events/LEFT:27018,RIGHT:27018", updatedExisting: true, n: 1, lastOp: 5829355812878811139, connectionId: 104511, err: null, ok: 1. 0 }

Thu Jan 3 16:37:46 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355812878811145, connectionId: 202, err: null, ok: 1.0
}
Thu Jan 3 16:37:46 [WriteBackListener-LEFT:27018] GLE is

{ singleShard: "events/LEFT:27018,RIGHT:27018", updatedExisting: true, n: 1, lastOp: 5829355812878811147, connectionId: 104511, err: null, ok: 1. 0 }

Thu Jan 3 16:37:46 [WriteBackListener-LEFT:27018] GLE is

{ singleShard: "events/LEFT:27018,RIGHT:27018", updatedExisting: true, n: 1, lastOp: 5829355812878811149, connectionId: 104511, err: null, ok: 1. 0 }

Thu Jan 3 16:37:47 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355817173778439, connectionId: 202, err: null, ok: 1.0
}
Thu Jan 3 16:37:48 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355821468745731, connectionId: 202, err: null, ok: 1.0
}
Thu Jan 3 16:37:49 [Balancer] distributed lock 'balancer/g14.hypem.com:27018:1355859540:1804289383' acquired, ts : 50e6083d44377d91a6ecadd6
Thu Jan 3 16:37:49 [Balancer] events2 is unavailable
Thu Jan 3 16:37:49 [Balancer] distributed lock 'balancer/g14.hypem.com:27018:1355859540:1804289383' unlocked.
Thu Jan 3 16:37:49 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355825763713029, connectionId: 202, err: null, ok: 1.0
}
Thu Jan 3 16:37:49 [WriteBackListener-LEFT:27018] DBClientCursor::init call() failed
Thu Jan 3 16:37:49 [WriteBackListener-LEFT:27018] WriteBackListener exception : DBClientBase::findN: transport error: LEFT:27018 ns: admin.$cmd query:

{ writebacklisten: ObjectId('50d0c65444377d91a6e94 862') }

Thu Jan 3 16:37:49 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355825763713039, connectionId: 202, err: null, ok: 1.0
}
Thu Jan 3 16:37:50 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355830058680325, connectionId: 202, err: null, ok: 1.0
}
Thu Jan 3 16:37:50 [WriteBackListener-LEFT:27018] Socket recv() errno:104 Connection reset by peer 10.0.25.172:27018
Thu Jan 3 16:37:50 [WriteBackListener-LEFT:27018] SocketException: remote: 10.0.25.172:27018 error: 9001 socket exception [1] server [10.0.25.172:27018]
Thu Jan 3 16:37:50 [WriteBackListener-LEFT:27018] DBClientCursor::init call() failed
Thu Jan 3 16:37:50 [WriteBackListener-LEFT:27018] WriteBackListener exception : DBClientBase::findN: transport error: LEFT:27018 ns: admin.$cmd query:

{ writebacklisten: ObjectId('50d0c65444377d91a6e94 862') }

Thu Jan 3 16:37:50 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355830058680332, connectionId: 202, err: null, ok: 1.0
}
Thu Jan 3 16:37:50 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355830058680336, connectionId: 202, err: null, ok: 1.0
}
Thu Jan 3 16:37:51 [conn64348] Socket recv() errno:104 Connection reset by peer 10.0.25.172:27018
Thu Jan 3 16:37:51 [conn64348] SocketException: remote: 10.0.25.172:27018 error: 9001 socket exception [1] server [10.0.25.172:27018]
Thu Jan 3 16:37:51 [conn64348] DBClientCursor::init lazy say() failedThu Jan 3 16:37:51 [conn64348] DBClientCursor::init message from say() was empty
Thu Jan 3 16:37:51 [conn64348] slave no longer has secondary status: LEFT:27018
Thu Jan 3 16:37:51 [conn64348] sharded connection to events/LEFT:27018,RIGHT:27018 not being returned to the pool
Thu Jan 3 16:37:51 [conn64348] Socket recv() errno:104 Connection reset by peer 10.0.25.172:27018
Thu Jan 3 16:37:51 [conn64348] SocketException: remote: 10.0.25.172:27018 error: 9001 socket exception [1] server [10.0.25.172:27018] Thu Jan 3 16:37:51 [conn64348] DBClientCursor::init call() failed
Thu Jan 3 16:37:51 [conn64348] trying reconnect to RIGHT:27018
Thu Jan 3 16:37:51 [conn64348] reconnect RIGHT:27018 failed couldn't connect to server RIGHT:27018
Thu Jan 3 16:37:51 [conn64330] trying reconnect to LEFT:27018
Thu Jan 3 16:37:51 [conn64330] reconnect LEFT:27018 ok
Thu Jan 3 16:37:51 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355834353647627, connectionId: 202, err: null, ok: 1.0
Thu Jan 3 16:37:52 [conn64348] Socket recv() errno:104 Connection reset by peer 10.0.25.172:27018
Thu Jan 3 16:37:52 [conn64348] SocketException: remote: 10.0.25.172:27018 error: 9001 socket exception [1] server [10.0.25.172:27018]
Thu Jan 3 16:37:52 [conn64348] DBClientCursor::init lazy say() failed
Thu Jan 3 16:37:52 [conn64348] DBClientCursor::init message from say() was empty
Thu Jan 3 16:37:52 [conn64348] slave no longer has secondary status: LEFT:27018
Thu Jan 3 16:37:52 [conn64348] sharded connection to events/LEFT:27018,RIGHT:27018 not being returned to the poolThu Jan 3 16:37:52 [WriteBackListener-g29i:27018] GLE is { singleShard: "events2/g19i:27018,g29i:27018", updatedExisting: true, n: 1, lastOp: 5829355838648614917, connectionId: 202, err: null, ok: 1.0
}
Thu Jan 3 16:37:52 [conn64330] Socket recv() errno:104 Connection reset by peer 10.0.25.172:27018
Thu Jan 3 16:37:52 [conn64330] SocketException: remote: 10.0.25.172:27018 error: 9001 socket exception [1] server [10.0.25.172:27018]
Thu Jan 3 16:37:52 [conn64330] DBClientCursor::init lazy say() failed
Thu Jan 3 16:37:52 [conn64330] DBClientCursor::init message from say() was empty
Thu Jan 3 16:37:52 [conn64330] slave no longer has secondary status: LEFT:27018
Thu Jan 3 16:37:52 [conn64330] sharded connection to events/LEFT:27018,RIGHT:27018 not being returned to the pool
Thu Jan 3 16:37:52 [WriteBackListener-RIGHT:27018] WriteBackListener exception : socket exception [CONNECT_ERROR] for RIGHT:27018
Thu Jan 3 16:37:52 [conn64320] Socket recv() errno:104 Connection reset by peer 10.0.25.172:27018
Thu Jan 3 16:37:52 [conn64320] SocketException: remote: 10.0.25.172:27018 error: 9001 socket exception [1] server [10.0.25.172:27018]
Thu Jan 3 16:37:52 [conn64320] DBClientCursor::init lazy say() failed
Thu Jan 3 16:37:52 [conn64320] DBClientCursor::init message from say() was empty
Thu Jan 3 16:37:52 [conn64320] slave no longer has secondary status: LEFT:27018
Thu Jan 3 16:37:52 [conn64320] warning: Weird shift of primary detected
Thu Jan 3 16:37:53 [conn64348] Socket recv() errno:104 Connection reset by peer 10.0.25.172:27018
Thu Jan 3 16:37:53 [conn64348] SocketException: remote: 10.0.25.172:27018 error: 9001 socket exception [1] server [10.0.25.172:27018]
Thu Jan 3 16:37:53 [conn64348] DBClientCursor::init lazy say() failed
Thu Jan 3 16:37:53 [conn64348] DBClientCursor::init message from say() was empty
Thu Jan 3 16:37:53 [conn64348] slave no longer has secondary status: LEFT:27018
Thu Jan 3 16:37:53 [conn64348] sharded connection to events/LEFT:27018,RIGHT:27018 not being returned to the pool
Thu Jan 3 16:37:53 [conn64348] trying reconnect to RIGHT:27018
Thu Jan 3 16:37:53 [conn64348] reconnect RIGHT:27018 failed couldn't connect to server RIGHT:27018
Thu Jan 3 16:37:53 [conn64330] Socket recv() errno:104 Connection reset by peer 10.0.25.172:27018
Thu Jan 3 16:37:53 [conn64330] SocketException: remote: 10.0.25.172:27018 error: 9001 socket exception [1] server [10.0.25.172:27018]
Thu Jan 3 16:37:53 [conn64330] DBClientCursor::init lazy say() failed
Thu Jan 3 16:37:53 [conn64330] DBClientCursor::init message from say() was empty
Thu Jan 3 16:37:53 [conn64330] slave no longer has secondary status: LEFT:27018
Thu Jan 3 16:37:53 [conn64330] sharded connection to events/LEFT:27018,RIGHT:27018 not being returned to the pool
Thu Jan 3 16:37:53 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: LEFT:27018

{ setName: "events", ismaster: false, secondary: true, hosts: [ "LEFT:27018", "RIGHT:27018" ], me: "LEFT:270 18", maxBsonObjectSize: 16777216, ok: 1.0 }

Thu Jan 3 16:37:53 [ReplicaSetMonitorWatcher] ReplicaSetMonitor::_checkConnection: caught exception RIGHT:27018 socket exception [FAILED_STATE] for RIGHT:27018
Thu Jan 3 16:37:53 [conn64320] Socket say send() errno:104 Connection reset by peer 10.0.25.172:27018
Thu Jan 3 16:37:53 [conn64320] DBException in process: socket exception [SEND_ERROR] for 10.0.25.172:27018
Thu Jan 3 16:37:54 [conn64330] got not master for: LEFT:27018
Received signal 11
Backtrace: 0x8386d5 0x7fc340c094c0 0xc61e30 /usr/bin/mongos(_ZN5mongo17printStackAndExitEi+0x75)[0x8386d5]
/lib/x86_64-linux-gnu/libc.so.6(+0x364c0)[0x7fc340c094c0]
/usr/bin/mongos(_ZTVN5mongo18DBClientConnectionE+0x10)[0xc61e30]



 Comments   
Comment by Randolph Tan [ 08/Jan/13 ]

Thanks for the confirmation!

Comment by Arkadiy Kukarkin [ 07/Jan/13 ]

Looks right, I have not been able to reproduce this since moving up to 2.2.2 for the entire cluster.

Comment by Randolph Tan [ 04/Jan/13 ]

Do you have a reproducible case for this bug? I believe that this should have been fixed in v2.2.1 by SERVER-7061. Would you be able to confirm that this bug won't appear in 2.2.1?

Thanks!

Generated at Thu Feb 08 03:16:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.