|
Sharded 4 node replica set with priorities 2, 1, 0, 0
Transient failover from priority 2 to priority 1 node and back to priority 2 node causes mongos seg fault.
https://groups.google.com/forum/?fromgroups=#!topic/mongodb-user/NeeB86n9-JU
Just had more nodes collapse on a different column set now... here's the log with logLevel:2 turned on from mongos
Tue Sep 11 14:03:29 [conn1510] warning: splitChunk failed - cmd: { splitChunk: "catalog.feed_data_changelog", keyPattern: { retid: 1.0, feedid: 1.0, uniqid: 1.0, version: 1.0 }, min: { retid: 13712, feedid: 2669, uniqid: "287e1d9af8a592cfe0a80aa7b80df7ba", version: 1 }, max: { retid: 13712, feedid: 2669, uniqid: "4cf79bccacf4e71d2e73607677fea6e5", version: 1 }, from: "col03", splitKeys: [ { retid: 13712, feedid: 2669, uniqid: "38623d70f41a1a433b27e0f187219eb0", version: 3 } ], shardId: "catalog.feed_data_changelog-retid_13712feedid_2669uniqid_"287e1d9af8a592cfe0a80aa7b80df7ba"version_1", configdb: "servercfg1:27019,servercfg2:27019,servercfg3:27019" } result: { who: { _id: "catalog.feed_data_changelog", process: "server01c03:27017:1346968636:1431697445", state: 2, ts: ObjectId('504f71768fb2ef42a78a0c25'), when: new Date(1347383670849), who: "server01c03:27017:1346968636:1431697445:conn62819:2040432535", why: "migrate-{ retid: 8941, feedid: 8005, uniqid: "9cd5f39b3cfaeb79e8925eef346e68d0", version: 32 }" }, errmsg: "the collection's metadata lock is taken", ok: 0.0 }
|
Tue Sep 11 14:03:29 [conn1510] ChunkManager: time to load chunks for catalog.feed_data_changelog: 5ms sequenceNumber: 191 version: 227|3||504aa7463a46fa0144cf6f5e based on: 227|3||504aa7463a46fa0144cf6f5e
|
Tue Sep 11 14:03:29 [conn1510] warning: chunk manager reload forced for collection 'catalog.feed_data_changelog', config version is 227|3||504aa7463a46fa0144cf6f5e
|
Tue Sep 11 14:04:19 [conn1510] end connection 127.0.0.1:52194 (2 connections now open)
|
Tue Sep 11 14:04:20 [mongosMain] connection accepted from 127.0.0.1:52353 #1515 (3 connections now open)
|
Tue Sep 11 14:04:35 [ReplicaSetMonitorWatcher] Primary for replica set col03 changed to server02c03:27017
|
Tue Sep 11 14:04:45 [ReplicaSetMonitorWatcher] Primary for replica set col03 changed to server01c03:27017
|
Tue Sep 11 14:04:45 [ReplicaSetMonitorWatcher] Primary for replica set col03 changed to server02c03:27017
|
Tue Sep 11 14:04:49 [WriteBackListener-server02c03:27017] DBClientCursor::init call() failed
|
Tue Sep 11 14:04:49 [WriteBackListener-server02c03:27017] WriteBackListener exception : DBClientBase::findN: transport error: server02c03:27017 ns: admin.$cmd query: { writebacklisten: ObjectId('504e77846a941ccd587623c8') }
|
Tue Sep 11 14:04:49 [conn1515] got not master for: server02c03:27017
|
Received signal 11
|
Backtrace: 0x8386d5 0x3bd7a302d0 0x2aaaab2bad80
|
/usr/bin/mongos(_ZN5mongo17printStackAndExitEi+0x75)[0x8386d5]
|
/lib64/libc.so.6[0x3bd7a302d0]
|
[0x2aaaab2bad80]
|
|
|
Hi Chamith, I'm sorry you're still seeing a problem. The best option for now would be to open up a new conversation on our forums at https://groups.google.com/forum/?fromgroups#!forum/mongodb-user - our team will be able to discuss the problem further with you there and then, if needed, either reopen this ticket or open a new one covering any bugs you've identified.
|
|
Hi Ian Whalen/All
We updated our mongodb servers from 2.2.0 to 2.2.1 as you said regarding to this issue. But the same issue is occurring.
Following is the log just before the crash
"Wed Oct 31 06:16:41 [conn733481] got not master for: xx.xxx.xxx.xx:27772
Wed Oct 31 06:16:41 [conn740969] end connection xx.xxx.xxx.xx:56104 (489 connections now open)
Received signal 11
Backtrace: 0x8386d5 0x7fb65e9c6230 0xc632b0
mongos(_ZN5mongo17printStackAndExitEi+0x75)[0x8386d5]
/lib/libc.so.6(+0x32230)[0x7fb65e9c6230]
Wed Oct 31 06:16:41 [conn741593] end connection xx.xxx.xxx.xx:42052 (488 connections now open)
Wed Oct 31 06:16:41 [mongosMain] connection accepted from xx.xxx.xxx.xx:43605 #741604 (489 connections now open)
Wed Oct 31 06:16:41 [conn741577] Socket recv() errno:104 Connection reset by peer xx.xxx.xxx.xx:27772
Wed Oct 31 06:16:41 [conn741577] SocketException: remote: xx.xxx.xxx.xx:27772 error: 9001 socket exception [1] server [xx.xxx.xxx.xx:27772]
Wed Oct 31 06:16:41 [conn741577] DBClientCursor::init call() failed
mongos(_ZTVN5mongo18DBClientReplicaSetE+0x10)[0xc632b0]
==="
Many thanks,
Chamith
|
|
Hi - the 2.2.1 release should be available in the next week or two, pending any unexpected problems.
Thanks
Gregor
|
|
Hi,
According to the Mongo developers (https://jira.mongodb.org/browse/SERVER) the 2.2.1 should be released on last Friday.
But still can't find than version on downloads. (http://www.mongodb.org/downloads)
Is there a place that I can download this version or is it yet to be released? Then when it will be released?
Many thanks,
Chamith.
|
|
The 2.2.1-rc1 build is still be qualified. The 2.2.1 release will be announced here:
https://groups.google.com/forum/?fromgroups=#!forum/mongodb-announce
|
|
Thanks Ian.
We will upgrade to 2.2.1 when released. By Thursday It will be released right ?
Thanks for the quick reply.
|
|
Chamith, the fix for this issue has indeed been committed and will be available in both 2.3.0 and 2.2.1 when those releases are made. Please note, however, that 2.3.0 is a development release so it should not be used in a development environment.
|
|
Hi,
We are currently having the same issue. Our mongodb version is 2.2.0.
According to the https://jira.mongodb.org/browse/SERVER-7061, This issue is fixed in 2.3.0 right?
Is there way to overcome this issue other than upgrading to 2.3.0 ?
Thanks and appreciate your help.
|
|
SERVER-7061 describes the issue in more detail.
|
|
Think we've tracked this down, thanks again for the help. Linked to the duplicate ticket.
|
|
@Daniel - thanks for the trace - if you have any more of the log, it would be very useful to see that as well. As-is, it'll help us out quite a bit though - we'll start digging.
Think the other problem is a config server that is unreachable? Pretty sure it's unrelated.
|
|
Daniel, I did the same thing and was still able to reproduce the issue after a few hours of heavy write traffic. Depending on how well your shard key distributes on its own, we observed you can minimize the effects of the mongos crashing by setting the balancerWindow to offpeak hours, increasing the chunk size, and (if reasonable in your environment) setting all secondaries to priority:0 (disabling write fault tolerance/redundancy). I have a nagios check for the primaries to alert if they are ever down so that I can manually promote a secondary. It's definitely not a production ready plan, but it helps keep mongos up and running and not a nightmare to admin.
|
|
Hi, i have just completely reset my cluster incl. deleting all data (config servers, all mongod data etc..). I am importing data now and saw this error. It's not crashing the mongos, but it starts similar. Maybe it's useful:
|
Tue Sep 18 18:31:09 [LockPinger] cluster config1:27019,config2:27019,config3:27019 pinged successfully at Tue Sep 18 18:31:06 2012 by distributed lock pinger 'config1:27019,config2:27019,config3:27019/server1:27017:1347980213:1804289383', sleeping for 30000ms
|
Tue Sep 18 18:31:13 [Balancer] distributed lock 'balancer/server1:27017:1347980213:1804289383' acquired, ts : 5058a1cc86b1876254c835da
|
Tue Sep 18 18:31:17 [Balancer] distributed lock 'balancer/server1:27017:1347980213:1804289383' unlocked.
|
Tue Sep 18 18:32:09 [LockPinger] Socket recv() timeout server1:27019
|
Tue Sep 18 18:32:29 [CheckConfigServers] Socket recv() timeout server1:27019
|
Tue Sep 18 18:32:53 [CheckConfigServers] SocketException: remote: server1:27019 error: 9001 socket exception [3] server [server1:27019]
|
Tue Sep 18 18:32:53 [CheckConfigServers] DBClientCursor::init call() failed
|
Tue Sep 18 18:32:53 [LockPinger] SocketException: remote: server1:27019 error: 9001 socket exception [3] server [server1:27019]
|
Tue Sep 18 18:32:53 [LockPinger] DBClientCursor::init call() failed
|
Tue Sep 18 18:32:53 [CheckConfigServers] warning: couldn't check on config server:config2:27019 ok for now : 10276 DBClientBase::findN: transport error: config2:27019 ns: admin.$cmd query: { getlasterror: 1 }
|
Tue Sep 18 18:32:53 [LockPinger] scoped connection to config1:27019,config2:27019,config3:27019 not being returned to the pool
|
Tue Sep 18 18:32:53 [LockPinger] warning: distributed lock pinger 'config1:27019,config2:27019,config3:27019/server1:27017:1347980213:1804289383' detected an exception while pinging. :: caused by :: SyncClusterConnection::udpate prepare failed: 10276 DBClientBase::findN: transport error: config2:27019 ns: admin.$cmd query: { fsync: 1 } config2:27019:{}
|
Tue Sep 18 18:33:01 [Balancer] distributed lock 'balancer/server1:27017:1347980213:1804289383' acquired, ts : 5058a1db86b1876254c835db
|
|
|
Yes. Took some hours to reproduce it:
Mon Sep 17 18:17:42 [Balancer] Refreshing MaxChunkSize: 64
|
Mon Sep 17 18:17:42 BackgroundJob starting: ConnectBG
|
Mon Sep 17 18:17:42 [Balancer] checking last ping for lock 'balancer/client1:27017:1347895185:1804289383' against process client1:27017:1347895185:1804289383 and ping Mon Sep 17 18:16:36 2012
|
Mon Sep 17 18:17:42 [Balancer] could not force lock 'balancer/client1:27017:1347895185:1804289383' because elapsed time 0 <= takeover time 900000
|
Mon Sep 17 18:17:42 [Balancer] skipping balancing round because another balancer is active
|
Mon Sep 17 18:17:46 [WriteBackListener-shard3_rs2:27018] SocketException: remote: shard3_rs2_IP:27018 error: 9001 socket exception [0] server [shard3_rs2_IP:27018]
|
Mon Sep 17 18:17:46 [WriteBackListener-shard3_rs2:27018] DBClientCursor::init call() failed
|
Mon Sep 17 18:17:46 [WriteBackListener-shard3_rs2:27018] User Assertion: 10276:DBClientBase::findN: transport error: shard3_rs2:27018 ns: admin.$cmd query: { writebacklisten: ObjectId('5057388e7773804af634bb61') }
|
Mon Sep 17 18:17:46 [WriteBackListener-shard3_rs2:27018] WriteBackListener exception : DBClientBase::findN: transport error: shard3_rs2:27018 ns: admin.$cmd query: { writebacklisten: ObjectId('5057388e7773804af634bb61') }
|
Mon Sep 17 18:17:46 [WriteBackListener-shard3_rs1:27018] GLE is { singleShard: "rs_wa_3/shard3_rs1:27018,shard3_rs2:27018,shard3_rs3:27018", n: 0, lastOp: Timestamp 1347898666000|3, connectionId: 101460, err: null, ok: 1.0 }
|
Mon Sep 17 18:17:46 [WriteBackListener-shard3_rs1:27018] writebacklisten result: { data: { writeBack: true, ns: "database.documents", id: ObjectId('50574c6600000000000209d2'), connectionId: 101453, instanceIdent: "shard3_rs1:27018", version: Timestamp 428000|0, versionEpoch: ObjectId('000000000000000000000000'), yourVersion: Timestamp 427000|5, yourVersionEpoch: ObjectId('000000000000000000000000'), msg: BinData }, ok: 1.0 }
|
Mon Sep 17 18:17:46 [WriteBackListener-shard3_rs1:27018] connectionId: shard3_rs1:27018:101453 writebackId: 50574c6600000000000209d2 needVersion : 428|0||000000000000000000000000 mine : 428|1||000000000000000000000000
|
Mon Sep 17 18:17:46 [WriteBackListener-shard3_rs1:27018] op: insert len: 631 ns: database.documents{ _id: ObjectId('50367fbae4b0c4a3a7f44e6e'), e_uuid: "59a9b774-5e7c-475e-8a3b-186e3810c8f1", uuid: "2d5f4ae8-a4f7-44a7-b962-83bd0bf0f39f", time_of_import: 1345748922, content: "test content", content_source: "BODY", document_uuid: "8bf3c955-0bd0-4339-a125-d7b02730a912", domain: "www.test.eu", import_date: new Date(1345748922054), language: "EN", persist_date: new Date(1345748922054), published_date: new Date(1345520473835), published_date_not_found: false, source_name: "blog", source_type: "BLOG", url: "http://www.test.eu/2012/08/21/", value: 0 }
|
Mon Sep 17 18:17:46 [WriteBackListener-shard3_rs1:27018] wbl already reloaded config information for version 428|0||000000000000000000000000, at version 428|1||000000000000000000000000
|
Mon Sep 17 18:17:46 BackgroundJob starting: ConnectBG
|
Mon Sep 17 18:17:46 [WriteBackListener-shard3_rs1:27018] have to set shard version for conn: shard3_rs2:27018 ns:database.documents my last seq: 0 current: 59 version: 428|1||000000000000000000000000 manager: 0x7f38f6c17b00
|
Mon Sep 17 18:17:46 [WriteBackListener-shard3_rs1:27018] setShardVersion rs_wa_3 shard3_rs2:27018 database.documents { setShardVersion: "database.documents", configdb: "config1:27019,config2:27019,config3:27019", version: Timestamp 428000|1, versionEpoch: ObjectId('000000000000000000000000'), serverID: ObjectId('5057388e7773804af634bb61'), shard: "rs_wa_3", shardHost: "rs_wa_3/shard3_rs1:27018,shard3_rs2:27018,shard3_rs3:27018" } 0x7f38ffb1b9e0
|
Mon Sep 17 18:17:46 [WriteBackListener-shard3_rs1:27018] setShardVersion failed!
|
{ need_authoritative: true, errmsg: "first setShardVersion", ok: 0.0 }
|
Mon Sep 17 18:17:46 [WriteBackListener-shard3_rs1:27018] have to set shard version for conn: shard3_rs2:27018 ns:database.documents my last seq: 0 current: 59 version: 428|1||000000000000000000000000 manager: 0x7f38f6c17b00
|
Mon Sep 17 18:17:46 [WriteBackListener-shard3_rs1:27018] setShardVersion rs_wa_3 shard3_rs2:27018 database.documents { setShardVersion: "database.documents", configdb: "config1:27019,config2:27019,config3:27019", version: Timestamp 428000|1, versionEpoch: ObjectId('000000000000000000000000'), serverID: ObjectId('5057388e7773804af634bb61'), authoritative: true, shard: "rs_wa_3", shardHost: "rs_wa_3/shard3_rs1:27018,shard3_rs2:27018,shard3_rs3:27018" } 0x7f38ffb1b9e0
|
Mon Sep 17 18:17:46 [WriteBackListener-shard3_rs1:27018] got not master for: shard3_rs2:27018
|
Mon Sep 17 18:17:46 [WriteBackListener-shard3_rs1:27018] setShardVersion failed!
|
{ errmsg: "not master", note: "from post init in setShardVersion", errmsg: "", ok: 0.0 }
|
Received signal 11
|
Backtrace: 0x8386d5 0x7f38ffbd8230
|
mongos(_ZN5mongo17printStackAndExitEi+0x75)[0x8386d5]
|
/lib/libc.so.6(+0x32230)[0x7f38ffbd8230]
|
===
|
|
|
@Daniel can you upload a mongos log with loglevel set to 2?
|
|
I have exactly the same error here. The PRIMARY member tries to change an then it crashes:
WriteBackListener exception : DBClientBase::findN: transport error: member1:27018 ns: admin.$cmd query: { writebacklisten: ObjectId('50572fea6e34dd533b7f4bb1') }
|
[...]
|
Mon Sep 17 16:32:32 [conn2] got not master for: member1:27018
|
Received signal 11
|
Backtrace: 0x8386d5 0x7f5bb706e230 0xc61e30
|
mongos(_ZN5mongo17printStackAndExitEi+0x75)[0x8386d5]
|
/lib/libc.so.6(+0x32230)[0x7f5bb706e230]
|
mongos(_ZTVN5mongo18DBClientConnectionE+0x10)[0xc61e30]
|
===
|
|
|
I'll get the debug info for you tomorrow. We're currently testing out a short term work around by setting all secondaries to priority:0 and living with a lack of write redundancy. We're waiting for the consequences during our balancer window.
|
|
Are you sure the logLevel is set to 2? There should be additional information about the isMaster values and the replica set if so. You can check directly at runtime with:
http://docs.mongodb.org/manual/reference/command/setParameter/
If you set the logLevel, you should see output showing the previous logLevel (so if you set twice, you can be sure the setting was acknowledged).
Very much looks like some sort of race condition and we'll start looking, but the information returned by the replica set members would be extremely helpful here (it's reported at loglevel >=1).
|
Generated at Thu Feb 08 03:13:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.