[SERVER-4888] mongod crash with signal 7 under high write load Created: 07/Feb/12  Updated: 11/Jul/13  Resolved: 17/Jan/13

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 2.0.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: P Eger Assignee: Unassigned
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

uname -a
Linux test-mongo2-us.internal.net 2.6.18-194.el5 #1 SMP Fri Apr 2 14:58:14 EDT 2010 x86_64 x86_64 x86_64 GNU/Linux


Operating System: Linux
Participants:

 Description   

I have a simple 3 shard cluster (1 config server, 1 mongos) set up for testing, throwing a high load of reads and writes at a collection sharded on _id (which is a random integer between 1 and 10M). After about an hour of testing, test-mongo2-us (the master) crashed with the below error, possibly while splitting a chunk / rebablancing given the timedstamps on the log entries.

Tue Feb 7 00:01:21 [conn71] received splitChunk request: { splitChunk: "testdb.user", keyPattern:

{ _id: 1.0 }

, min:

{ _id: 6252355 }

, max:

{ _id: 6965549 }

, from: "test-mongo2-us:27117", splitKeys: [

{ _id: 6585511 }

], shardId: "testdb.user-_id_6252355", configdb: "test-mongo1-us:27019" }
Tue Feb 7 00:01:21 [conn71] created new distributed lock for testdb.user on test-mongo1-us:27019 ( lock timeout : 900000, ping interval : 30000, process : 0 )
Tue Feb 7 00:01:21 [conn73] command admin.$cmd command: { splitChunk: "testdb.user", keyPattern:

{ _id: 1.0 }

, min:

{ _id: 6252355 }

, max:

{ _id: 6965549 }

, from: "test-mongo2-us:27117", splitKeys: [

{ _id: 6585515 }

], shardId: "testdb.user-_id_6252355", configdb: "test-mongo1-us:27019" } ntoreturn:1 reslen:351 555ms
Tue Feb 7 00:01:21 [conn68] received splitChunk request: { splitChunk: "testdb.user", keyPattern:

{ _id: 1.0 }

, min:

{ _id: 6252355 }

, max:

{ _id: 6965549 }

, from: "test-mongo2-us:27117", splitKeys: [

{ _id: 6585511 }

], shardId: "testdb.user-_id_6252355", configdb: "test-mongo1-us:27019" }
Tue Feb 7 00:01:21 [conn68] created new distributed lock for testdb.user on test-mongo1-us:27019 ( lock timeout : 900000, ping interval : 30000, process : 0 )
Tue Feb 7 00:01:21 [conn23] could not acquire lock 'testdb.user/test-mongo2-us.web.blizzard.net:27117:1328559662:781691710' (another update won)
Tue Feb 7 00:01:21 [conn23] distributed lock 'testdb.user/test-mongo2-us.web.blizzard.net:27117:1328559662:781691710' was not acquired.
Tue Feb 7 00:01:21 [conn23] command admin.$cmd command: { splitChunk: "testdb.user", keyPattern:

{ _id: 1.0 }

, min:

{ _id: 6252355 }

, max:

{ _id: 6965549 }

, from: "test-mongo2-us:27117", splitKeys: [

{ _id: 6585531 }

], shardId: "testdb.user-_id_6252355", configdb: "test-mongo1-us:27019" } ntoreturn:1 reslen:351 653ms
Tue Feb 7 00:01:21 Invalid access at address: 0xb9bd3c

Tue Feb 7 00:01:21 Got signal: 7 (Bus error).



 Comments   
Comment by Geoff Wagstaff [ 11/Jul/13 ]

I'm also having issues with sharding under high write load. More info: https://groups.google.com/forum/#!topic/mongodb-user/BF06C0yqtV4

Here's my mongos log (includes stack trace):

Thu Jul 11 16:26:17.903 [conn468] warning: splitChunk failed - cmd: { splitChunk: "trends.dimension.language", keyPattern: { _id: 1.0 }, min: { _id: BinData }, max: { _id: MaxKey }, from: "rs4", splitKeys: [ { _id: BinData } ], shardId: "trends.dimension.language-_id_BinData(0, 000120130711)", configdb: "mongobig-1-1:27019,mongobig-2-1:27019,mongobig-3-1:27019" } result: { who: { _id: "trends.dimension.language", process: "mongobig-4-1:27018:1373558874:1854596173", state: 1, ts: ObjectId('51dedca979a3c5840b68bbe1'), when: new Date(1373559977876), who: "mongobig-4-1:27018:1373558874:1854596173:conn133:1732459270", why: "split-{ _id: BinData }" }, ok: 0.0, errmsg: "the collection's metadata lock is taken" }
Thu Jul 11 16:26:18.530 [conn470] ChunkManager: time to load chunks for trends.dimension.language: 0ms sequenceNumber: 141 version: 2|3||51dedb10e68a130889d33a14 based on: 2|1||51dedb10e68a130889d33a14
Thu Jul 11 16:26:18.530 [conn470] autosplitted trends.dimension.language shard: ns:trends.dimension.languageshard: rs4:rs4/mongobig-4-1:27018lastmod: 2|1||000000000000000000000000min: { _id: BinData }max: { _id: MaxKey } on: { _id: BinData } (splitThreshold 471859) (migrate suggested)
Thu Jul 11 16:26:18.535 [conn470] moving chunk (auto): ns:trends.dimension.languageshard: rs4:rs4/mongobig-4-1:27018lastmod: 2|3||000000000000000000000000min: { _id: BinData }max: { _id: MaxKey } to: rs5:rs5/mongobig-5-1:27018
Thu Jul 11 16:26:18.535 [conn470] moving chunk ns: trends.dimension.language moving ( ns:trends.dimension.languageshard: rs4:rs4/mongobig-4-1:27018lastmod: 2|3||000000000000000000000000min: { _id: BinData }max: { _id: MaxKey }) rs4:rs4/mongobig-4-1:27018 -> rs5:rs5/mongobig-5-1:27018
Thu Jul 11 16:26:18.549 [conn464] warning: splitChunk failed - cmd: { splitChunk: "trends.dimension.language", keyPattern: { _id: 1.0 }, min: { _id: BinData }, max: { _id: MaxKey }, from: "rs4", splitKeys: [ { _id: BinData } ], shardId: "trends.dimension.language-_id_BinData(0, 000120130711)", configdb: "mongobig-1-1:27019,mongobig-2-1:27019,mongobig-3-1:27019" } result: { who: { _id: "trends.dimension.language", process: "mongobig-4-1:27018:1373558874:1854596173", state: 1, ts: ObjectId('51dedcaa79a3c5840b68bbe3'), when: new Date(1373559978507), who: "mongobig-4-1:27018:1373558874:1854596173:conn149:504692606", why: "split-{ _id: BinData }" }, ok: 0.0, errmsg: "the collection's metadata lock is taken" }
Thu Jul 11 16:26:18.549 [conn466] warning: splitChunk failed - cmd: { splitChunk: "trends.dimension.language", keyPattern: { _id: 1.0 }, min: { _id: BinData }, max: { _id: MaxKey }, from: "rs4", splitKeys: [ { _id: BinData } ], shardId: "trends.dimension.language-_id_BinData(0, 000120130711)", configdb: "mongobig-1-1:27019,mongobig-2-1:27019,mongobig-3-1:27019" } result: { who: { _id: "trends.dimension.language", process: "mongobig-4-1:27018:1373558874:1854596173", state: 1, ts: ObjectId('51dedcaa79a3c5840b68bbe3'), when: new Date(1373559978507), who: "mongobig-4-1:27018:1373558874:1854596173:conn149:504692606", why: "split-{ _id: BinData }" }, ok: 0.0, errmsg: "the collection's metadata lock is taken" }
Thu Jul 11 16:26:18.625 [conn470] moveChunk result: { who: { _id: "trends.dimension.language", process: "mongobig-4-1:27018:1373558874:1854596173", state: 1, ts: ObjectId('51dedcaa79a3c5840b68bbe3'), when: new Date(1373559978507), who: "mongobig-4-1:27018:1373558874:1854596173:conn149:504692606", why: "split-{ _id: BinData }" }, ok: 0.0, errmsg: "the collection metadata could not be locked with lock migrate-{ _id: BinData }" }
Thu Jul 11 16:26:18.625 [conn470] Assertion: 10412:moveAndCommit failed: { who: { _id: "trends.dimension.language", process: "mongobig-4-1:27018:1373558874:1854596173", state: 1, ts: ObjectId('51dedcaa79a3c5840b68bbe3'), when: new Date(1373559978507), who: "mongobig-4-1:27018:1373558874:1854596173:conn149:504692606", why: "split-{ _id: BinData }" }, ok: 0.0, errmsg: "the collection metadata could not be locked with lock migrate-{ _id: BinData }" }
0xa82161 0xa46e8b 0xa473cc 0x8c7b5e 0x9bc29c 0x9c30ca 0x991db1 0x669891 0xa6e8ce 0x7f5fc5b4fe9a 0x7f5fc4e62ccd
 /usr/bin/mongos(_ZN5mongo15printStackTraceERSo+0x21) [0xa82161]
 /usr/bin/mongos(_ZN5mongo11msgassertedEiPKc+0x9b) [0xa46e8b]
 /usr/bin/mongos() [0xa473cc]
 /usr/bin/mongos(_ZNK5mongo5Chunk13splitIfShouldEl+0x23ee) [0x8c7b5e]
 /usr/bin/mongos(_ZN5mongo13ShardStrategy7_updateERKSsRKNS_7BSONObjES5_iRNS_7RequestERNS_9DbMessageEi+0x3bc) [0x9bc29c]
 /usr/bin/mongos(_ZN5mongo13ShardStrategy7writeOpEiRNS_7RequestE+0x63a) [0x9c30ca]
 /usr/bin/mongos(_ZN5mongo7Request7processEi+0xd1) [0x991db1]
 /usr/bin/mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x71) [0x669891]
 /usr/bin/mongos(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x42e) [0xa6e8ce]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7f5fc5b4fe9a]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f5fc4e62ccd]

Comment by Eliot Horowitz (Inactive) [ 17/Jan/13 ]

Without a full stack trace, diagnosing will be impossible.
There have been various fixes in 2.2.x,, so if you haven't already i would upgrade to the latest.

Generated at Thu Feb 08 03:07:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.