[SERVER-7029] Mongos crash on replica set failover Created: 13/Sep/12  Updated: 15/Feb/13  Resolved: 18/Sep/12

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.2.0
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Gregor Macadam Assignee: Greg Studer
Resolution: Duplicate Votes: 2
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Duplicate
duplicates SERVER-7061 mongos can use invalid ptr to master ... Closed
Related
Operating System: ALL
Participants:

 Description   

Sharded 4 node replica set with priorities 2, 1, 0, 0
Transient failover from priority 2 to priority 1 node and back to priority 2 node causes mongos seg fault.

https://groups.google.com/forum/?fromgroups=#!topic/mongodb-user/NeeB86n9-JU

Just had more nodes collapse on a different column set now... here's the log with logLevel:2 turned on from mongos

Tue Sep 11 14:03:29 [conn1510] warning: splitChunk failed - cmd: { splitChunk: "catalog.feed_data_changelog", keyPattern: { retid: 1.0, feedid: 1.0, uniqid: 1.0, version: 1.0 }, min: { retid: 13712, feedid: 2669, uniqid: "287e1d9af8a592cfe0a80aa7b80df7ba", version: 1 }, max: { retid: 13712, feedid: 2669, uniqid: "4cf79bccacf4e71d2e73607677fea6e5", version: 1 }, from: "col03", splitKeys: [ { retid: 13712, feedid: 2669, uniqid: "38623d70f41a1a433b27e0f187219eb0", version: 3 } ], shardId: "catalog.feed_data_changelog-retid_13712feedid_2669uniqid_"287e1d9af8a592cfe0a80aa7b80df7ba"version_1", configdb: "servercfg1:27019,servercfg2:27019,servercfg3:27019" } result: { who: { _id: "catalog.feed_data_changelog", process: "server01c03:27017:1346968636:1431697445", state: 2, ts: ObjectId('504f71768fb2ef42a78a0c25'), when: new Date(1347383670849), who: "server01c03:27017:1346968636:1431697445:conn62819:2040432535", why: "migrate-{ retid: 8941, feedid: 8005, uniqid: "9cd5f39b3cfaeb79e8925eef346e68d0", version: 32 }" }, errmsg: "the collection's metadata lock is taken", ok: 0.0 }
Tue Sep 11 14:03:29 [conn1510] ChunkManager: time to load chunks for catalog.feed_data_changelog: 5ms sequenceNumber: 191 version: 227|3||504aa7463a46fa0144cf6f5e based on: 227|3||504aa7463a46fa0144cf6f5e
Tue Sep 11 14:03:29 [conn1510] warning: chunk manager reload forced for collection 'catalog.feed_data_changelog', config version is 227|3||504aa7463a46fa0144cf6f5e
Tue Sep 11 14:04:19 [conn1510] end connection 127.0.0.1:52194 (2 connections now open)
Tue Sep 11 14:04:20 [mongosMain] connection accepted from 127.0.0.1:52353 #1515 (3 connections now open)
Tue Sep 11 14:04:35 [ReplicaSetMonitorWatcher] Primary for replica set col03 changed to server02c03:27017
Tue Sep 11 14:04:45 [ReplicaSetMonitorWatcher] Primary for replica set col03 changed to server01c03:27017
Tue Sep 11 14:04:45 [ReplicaSetMonitorWatcher] Primary for replica set col03 changed to server02c03:27017
Tue Sep 11 14:04:49 [WriteBackListener-server02c03:27017] DBClientCursor::init call() failed
Tue Sep 11 14:04:49 [WriteBackListener-server02c03:27017] WriteBackListener exception : DBClientBase::findN: transport error: server02c03:27017 ns: admin.$cmd query: { writebacklisten: ObjectId('504e77846a941ccd587623c8') }
Tue Sep 11 14:04:49 [conn1515] got not master for: server02c03:27017
Received signal 11
Backtrace: 0x8386d5 0x3bd7a302d0 0x2aaaab2bad80 
/usr/bin/mongos(_ZN5mongo17printStackAndExitEi+0x75)[0x8386d5]
/lib64/libc.so.6[0x3bd7a302d0]
[0x2aaaab2bad80]



 Comments   
Comment by Ian Whalen (Inactive) [ 31/Oct/12 ]

Hi Chamith, I'm sorry you're still seeing a problem. The best option for now would be to open up a new conversation on our forums at https://groups.google.com/forum/?fromgroups#!forum/mongodb-user - our team will be able to discuss the problem further with you there and then, if needed, either reopen this ticket or open a new one covering any bugs you've identified.

Comment by Chamith Wijesundara [ 31/Oct/12 ]

Hi Ian Whalen/All

We updated our mongodb servers from 2.2.0 to 2.2.1 as you said regarding to this issue. But the same issue is occurring.

Following is the log just before the crash

"Wed Oct 31 06:16:41 [conn733481] got not master for: xx.xxx.xxx.xx:27772
Wed Oct 31 06:16:41 [conn740969] end connection xx.xxx.xxx.xx:56104 (489 connections now open)
Received signal 11
Backtrace: 0x8386d5 0x7fb65e9c6230 0xc632b0
mongos(_ZN5mongo17printStackAndExitEi+0x75)[0x8386d5]
/lib/libc.so.6(+0x32230)[0x7fb65e9c6230]
Wed Oct 31 06:16:41 [conn741593] end connection xx.xxx.xxx.xx:42052 (488 connections now open)
Wed Oct 31 06:16:41 [mongosMain] connection accepted from xx.xxx.xxx.xx:43605 #741604 (489 connections now open)
Wed Oct 31 06:16:41 [conn741577] Socket recv() errno:104 Connection reset by peer xx.xxx.xxx.xx:27772
Wed Oct 31 06:16:41 [conn741577] SocketException: remote: xx.xxx.xxx.xx:27772 error: 9001 socket exception [1] server [xx.xxx.xxx.xx:27772]
Wed Oct 31 06:16:41 [conn741577] DBClientCursor::init call() failed
mongos(_ZTVN5mongo18DBClientReplicaSetE+0x10)[0xc632b0]
==="

Many thanks,
Chamith

Comment by Gregor Macadam [ 29/Oct/12 ]

Hi - the 2.2.1 release should be available in the next week or two, pending any unexpected problems.
Thanks
Gregor

Comment by Chamith Wijesundara [ 29/Oct/12 ]

Hi,

According to the Mongo developers (https://jira.mongodb.org/browse/SERVER) the 2.2.1 should be released on last Friday.

But still can't find than version on downloads. (http://www.mongodb.org/downloads)
Is there a place that I can download this version or is it yet to be released? Then when it will be released?

Many thanks,
Chamith.

Comment by Daniel Pasette (Inactive) [ 24/Oct/12 ]

The 2.2.1-rc1 build is still be qualified. The 2.2.1 release will be announced here:
https://groups.google.com/forum/?fromgroups=#!forum/mongodb-announce

Comment by Chamith Wijesundara [ 22/Oct/12 ]

Thanks Ian.

We will upgrade to 2.2.1 when released. By Thursday It will be released right ?

Thanks for the quick reply.

Comment by Ian Whalen (Inactive) [ 22/Oct/12 ]

Chamith, the fix for this issue has indeed been committed and will be available in both 2.3.0 and 2.2.1 when those releases are made. Please note, however, that 2.3.0 is a development release so it should not be used in a development environment.

Comment by Chamith Wijesundara [ 22/Oct/12 ]

Hi,

We are currently having the same issue. Our mongodb version is 2.2.0.
According to the https://jira.mongodb.org/browse/SERVER-7061, This issue is fixed in 2.3.0 right?
Is there way to overcome this issue other than upgrading to 2.3.0 ?

Thanks and appreciate your help.

Comment by Greg Studer [ 18/Sep/12 ]

SERVER-7061 describes the issue in more detail.

Comment by Greg Studer [ 18/Sep/12 ]

Think we've tracked this down, thanks again for the help. Linked to the duplicate ticket.

Comment by Greg Studer [ 18/Sep/12 ]

@Daniel - thanks for the trace - if you have any more of the log, it would be very useful to see that as well. As-is, it'll help us out quite a bit though - we'll start digging.

Think the other problem is a config server that is unreachable? Pretty sure it's unrelated.

Comment by Taylor Fort [ 18/Sep/12 ]

Daniel, I did the same thing and was still able to reproduce the issue after a few hours of heavy write traffic. Depending on how well your shard key distributes on its own, we observed you can minimize the effects of the mongos crashing by setting the balancerWindow to offpeak hours, increasing the chunk size, and (if reasonable in your environment) setting all secondaries to priority:0 (disabling write fault tolerance/redundancy). I have a nagios check for the primaries to alert if they are ever down so that I can manually promote a secondary. It's definitely not a production ready plan, but it helps keep mongos up and running and not a nightmare to admin.

Comment by Daniel Brügge [ 18/Sep/12 ]

Hi, i have just completely reset my cluster incl. deleting all data (config servers, all mongod data etc..). I am importing data now and saw this error. It's not crashing the mongos, but it starts similar. Maybe it's useful:

 
Tue Sep 18 18:31:09 [LockPinger] cluster config1:27019,config2:27019,config3:27019 pinged successfully at Tue Sep 18 18:31:06 2012 by distributed lock pinger 'config1:27019,config2:27019,config3:27019/server1:27017:1347980213:1804289383', sleeping for 30000ms
Tue Sep 18 18:31:13 [Balancer] distributed lock 'balancer/server1:27017:1347980213:1804289383' acquired, ts : 5058a1cc86b1876254c835da
Tue Sep 18 18:31:17 [Balancer] distributed lock 'balancer/server1:27017:1347980213:1804289383' unlocked. 
Tue Sep 18 18:32:09 [LockPinger] Socket recv() timeout  server1:27019
Tue Sep 18 18:32:29 [CheckConfigServers] Socket recv() timeout  server1:27019
Tue Sep 18 18:32:53 [CheckConfigServers] SocketException: remote: server1:27019 error: 9001 socket exception [3] server [server1:27019] 
Tue Sep 18 18:32:53 [CheckConfigServers] DBClientCursor::init call() failed
Tue Sep 18 18:32:53 [LockPinger] SocketException: remote: server1:27019 error: 9001 socket exception [3] server [server1:27019] 
Tue Sep 18 18:32:53 [LockPinger] DBClientCursor::init call() failed
Tue Sep 18 18:32:53 [CheckConfigServers] warning:  couldn't check on config server:config2:27019 ok for now : 10276 DBClientBase::findN: transport error: config2:27019 ns: admin.$cmd query: { getlasterror: 1 }
Tue Sep 18 18:32:53 [LockPinger] scoped connection to config1:27019,config2:27019,config3:27019 not being returned to the pool
Tue Sep 18 18:32:53 [LockPinger] warning: distributed lock pinger 'config1:27019,config2:27019,config3:27019/server1:27017:1347980213:1804289383' detected an exception while pinging. :: caused by :: SyncClusterConnection::udpate prepare failed: 10276 DBClientBase::findN: transport error: config2:27019 ns: admin.$cmd query: { fsync: 1 } config2:27019:{}
Tue Sep 18 18:33:01 [Balancer] distributed lock 'balancer/server1:27017:1347980213:1804289383' acquired, ts : 5058a1db86b1876254c835db

Comment by Daniel Brügge [ 17/Sep/12 ]

Yes. Took some hours to reproduce it:

Mon Sep 17 18:17:42 [Balancer] Refreshing MaxChunkSize: 64
Mon Sep 17 18:17:42 BackgroundJob starting: ConnectBG
Mon Sep 17 18:17:42 [Balancer] checking last ping for lock 'balancer/client1:27017:1347895185:1804289383' against process client1:27017:1347895185:1804289383 and ping Mon Sep 17 18:16:36 2012
Mon Sep 17 18:17:42 [Balancer] could not force lock 'balancer/client1:27017:1347895185:1804289383' because elapsed time 0 <= takeover time 900000
Mon Sep 17 18:17:42 [Balancer] skipping balancing round because another balancer is active
Mon Sep 17 18:17:46 [WriteBackListener-shard3_rs2:27018] SocketException: remote: shard3_rs2_IP:27018 error: 9001 socket exception [0] server [shard3_rs2_IP:27018] 
Mon Sep 17 18:17:46 [WriteBackListener-shard3_rs2:27018] DBClientCursor::init call() failed
Mon Sep 17 18:17:46 [WriteBackListener-shard3_rs2:27018] User Assertion: 10276:DBClientBase::findN: transport error: shard3_rs2:27018 ns: admin.$cmd query: { writebacklisten: ObjectId('5057388e7773804af634bb61') }
Mon Sep 17 18:17:46 [WriteBackListener-shard3_rs2:27018] WriteBackListener exception : DBClientBase::findN: transport error: shard3_rs2:27018 ns: admin.$cmd query: { writebacklisten: ObjectId('5057388e7773804af634bb61') }
Mon Sep 17 18:17:46 [WriteBackListener-shard3_rs1:27018] GLE is { singleShard: "rs_wa_3/shard3_rs1:27018,shard3_rs2:27018,shard3_rs3:27018", n: 0, lastOp: Timestamp 1347898666000|3, connectionId: 101460, err: null, ok: 1.0 }
Mon Sep 17 18:17:46 [WriteBackListener-shard3_rs1:27018] writebacklisten result: { data: { writeBack: true, ns: "database.documents", id: ObjectId('50574c6600000000000209d2'), connectionId: 101453, instanceIdent: "shard3_rs1:27018", version: Timestamp 428000|0, versionEpoch: ObjectId('000000000000000000000000'), yourVersion: Timestamp 427000|5, yourVersionEpoch: ObjectId('000000000000000000000000'), msg: BinData }, ok: 1.0 }
Mon Sep 17 18:17:46 [WriteBackListener-shard3_rs1:27018] connectionId: shard3_rs1:27018:101453 writebackId: 50574c6600000000000209d2 needVersion : 428|0||000000000000000000000000 mine : 428|1||000000000000000000000000
Mon Sep 17 18:17:46 [WriteBackListener-shard3_rs1:27018] op: insert len: 631 ns: database.documents{ _id: ObjectId('50367fbae4b0c4a3a7f44e6e'), e_uuid: "59a9b774-5e7c-475e-8a3b-186e3810c8f1", uuid: "2d5f4ae8-a4f7-44a7-b962-83bd0bf0f39f", time_of_import: 1345748922, content: "test content", content_source: "BODY", document_uuid: "8bf3c955-0bd0-4339-a125-d7b02730a912", domain: "www.test.eu", import_date: new Date(1345748922054), language: "EN", persist_date: new Date(1345748922054), published_date: new Date(1345520473835), published_date_not_found: false, source_name: "blog", source_type: "BLOG", url: "http://www.test.eu/2012/08/21/", value: 0 }
Mon Sep 17 18:17:46 [WriteBackListener-shard3_rs1:27018] wbl already reloaded config information for version 428|0||000000000000000000000000, at version 428|1||000000000000000000000000
Mon Sep 17 18:17:46 BackgroundJob starting: ConnectBG
Mon Sep 17 18:17:46 [WriteBackListener-shard3_rs1:27018]  have to set shard version for conn: shard3_rs2:27018 ns:database.documents my last seq: 0  current: 59 version: 428|1||000000000000000000000000 manager: 0x7f38f6c17b00
Mon Sep 17 18:17:46 [WriteBackListener-shard3_rs1:27018]     setShardVersion  rs_wa_3 shard3_rs2:27018  database.documents  { setShardVersion: "database.documents", configdb: "config1:27019,config2:27019,config3:27019", version: Timestamp 428000|1, versionEpoch: ObjectId('000000000000000000000000'), serverID: ObjectId('5057388e7773804af634bb61'), shard: "rs_wa_3", shardHost: "rs_wa_3/shard3_rs1:27018,shard3_rs2:27018,shard3_rs3:27018" } 0x7f38ffb1b9e0
Mon Sep 17 18:17:46 [WriteBackListener-shard3_rs1:27018]        setShardVersion failed!
{ need_authoritative: true, errmsg: "first setShardVersion", ok: 0.0 }
Mon Sep 17 18:17:46 [WriteBackListener-shard3_rs1:27018]  have to set shard version for conn: shard3_rs2:27018 ns:database.documents my last seq: 0  current: 59 version: 428|1||000000000000000000000000 manager: 0x7f38f6c17b00
Mon Sep 17 18:17:46 [WriteBackListener-shard3_rs1:27018]     setShardVersion  rs_wa_3 shard3_rs2:27018  database.documents  { setShardVersion: "database.documents", configdb: "config1:27019,config2:27019,config3:27019", version: Timestamp 428000|1, versionEpoch: ObjectId('000000000000000000000000'), serverID: ObjectId('5057388e7773804af634bb61'), authoritative: true, shard: "rs_wa_3", shardHost: "rs_wa_3/shard3_rs1:27018,shard3_rs2:27018,shard3_rs3:27018" } 0x7f38ffb1b9e0
Mon Sep 17 18:17:46 [WriteBackListener-shard3_rs1:27018] got not master for: shard3_rs2:27018
Mon Sep 17 18:17:46 [WriteBackListener-shard3_rs1:27018]        setShardVersion failed!
{ errmsg: "not master", note: "from post init in setShardVersion", errmsg: "", ok: 0.0 }
Received signal 11
Backtrace: 0x8386d5 0x7f38ffbd8230 
mongos(_ZN5mongo17printStackAndExitEi+0x75)[0x8386d5]
/lib/libc.so.6(+0x32230)[0x7f38ffbd8230]
===

Comment by Gregor Macadam [ 17/Sep/12 ]

@Daniel can you upload a mongos log with loglevel set to 2?

Comment by Daniel Brügge [ 17/Sep/12 ]

I have exactly the same error here. The PRIMARY member tries to change an then it crashes:

WriteBackListener exception : DBClientBase::findN: transport error: member1:27018 ns: admin.$cmd query: { writebacklisten: ObjectId('50572fea6e34dd533b7f4bb1') }
[...]
Mon Sep 17 16:32:32 [conn2] got not master for: member1:27018
Received signal 11
Backtrace: 0x8386d5 0x7f5bb706e230 0xc61e30 
mongos(_ZN5mongo17printStackAndExitEi+0x75)[0x8386d5]
/lib/libc.so.6(+0x32230)[0x7f5bb706e230]
mongos(_ZTVN5mongo18DBClientConnectionE+0x10)[0xc61e30]
===

Comment by Taylor Fort [ 13/Sep/12 ]

I'll get the debug info for you tomorrow. We're currently testing out a short term work around by setting all secondaries to priority:0 and living with a lack of write redundancy. We're waiting for the consequences during our balancer window.

Comment by Greg Studer [ 13/Sep/12 ]

Are you sure the logLevel is set to 2? There should be additional information about the isMaster values and the replica set if so. You can check directly at runtime with:

http://docs.mongodb.org/manual/reference/command/setParameter/

If you set the logLevel, you should see output showing the previous logLevel (so if you set twice, you can be sure the setting was acknowledged).

Very much looks like some sort of race condition and we'll start looking, but the information returned by the replica set members would be extremely helpful here (it's reported at loglevel >=1).

Generated at Thu Feb 08 03:13:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.