[SERVER-10012] Segmentation fault on splitChunk Created: 24/Jun/13  Updated: 17/Mar/14  Resolved: 17/Mar/14

Status: Closed
Project: Core Server
Component/s: Sharding, Stability
Affects Version/s: 2.4.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Hendrik Schumacher Assignee: Unassigned
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Debian squeeze


Issue Links:
Related
related to SERVER-7790 Segfault in splitchunk following drop... Closed
is related to SERVER-5160 Handle all failed shardCollection com... Closed
Operating System: Linux
Participants:

 Description   

We use a mongodb sharded setup with four shards (each a replicaset with one two mongodbs and one arbiter). Since we do batch inserts we use pre-splitting. All mongodbs are version 2.4.3.

One the slave of one of the shards we got the following segmentation fault:

Mon Jun 24 16:22:25.325 [conn912] command admin.$cmd command: { splitChunk: "ad_cache.cache-xxx", keyPattern: { _id: "hashed" }, min: { _id: 0 }, max: { _id: 4611686018427387900 }, from: "mongo_rs33", splitKeys: [ { _id: 2305843009213693950 } ], shardId: "ad_cache.cache-xxx-_id_0", configdb: "s33:27019,s34:27019,s35:27019" } ntoreturn:1 keyUpdates:0 locks(micros) r:61 reslen:37 5490ms
Mon Jun 24 16:22:25.329 [conn912] received splitChunk request: { splitChunk: "ad_cache.cache-xxx", keyPattern: { _id: "hashed" }, min: { _id: 4611686018427387900 }, max: { _id: MaxKey }, from: "mongo_rs33", splitKeys: [ { _id: 6917529027641081850 } ], shardId: "ad_cache.cache-xxx-_id_4611686018427387900", configdb: "s33:27019,s34:27019,s35:27019" }
Mon Jun 24 16:22:27.501 [conn912] distributed lock 'ad_cache.cache-xxx/h34:27020:1372068904:410557907' acquired, ts : 51c85621020fcf91b8e782c2
Mon Jun 24 16:22:27.533 [conn912] splitChunk accepted at version 2|5||51c855e903e25cae7275142f
Mon Jun 24 16:22:28.813 [conn912] about to log metadata event: { _id: "h34-2013-06-24T14:22:28-51c85624020fcf91b8e782cd", server: "h34", clientAddr: "10.48.2.33:38163", time: new Date(1372083748813), what: "split", ns: "ad_cache.cache-xxx", details: { before: { min: { _id: 4611686018427387900 }, max: { _id: MaxKey }, lastmod: Timestamp 1000|3, lastmodEpoch: ObjectId('000000000000000000000000') }, left: { min: { _id: 4611686018427387900 }, max: { _id: 6917529027641081850 }, lastmod: Timestamp 2000|8, lastmodEpoch: ObjectId('51c855e903e25cae7275142f') }, right: { min: { _id: 6917529027641081850 }, max: { _id: MaxKey }, lastmod: Timestamp 2000|9, lastmodEpoch: ObjectId('51c855e903e25cae7275142f') } } }
Mon Jun 24 16:22:30.358 Invalid access at address: 0xbc from thread: conn912
 
Mon Jun 24 16:22:30.358 Got signal: 11 (Segmentation fault).
 
Mon Jun 24 16:22:30.451 Backtrace:
0xdcf361 0x6cf729 0x6cfcb2 0x7fb1c1e0bff0 0xccf5d4 0x8d236a 0x8d5065 0x8d6592 0xa7c97b 0xa80360 0x9f44d4 0x9f57e2 0x6e747a 0xdbbb7e 0x7fb1c1e038ca 0x7fb1c11b6b6d
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdcf361]
 /usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x6cf729]
 /usr/bin/mongod(_ZN5mongo24abruptQuitWithAddrSignalEiP7siginfoPv+0x262) [0x6cfcb2]
 /lib/libpthread.so.0(+0xeff0) [0x7fb1c1e0bff0]
 /usr/bin/mongod(_ZN5mongo17SplitChunkCommand3runERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x85c4) [0xccf5d4]
 /usr/bin/mongod(_ZN5mongo12_execCommandEPNS_7CommandERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x3a) [0x8d236a]
 /usr/bin/mongod(_ZN5mongo7Command11execCommandEPS0_RNS_6ClientEiPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0x705) [0x8d5065]
 /usr/bin/mongod(_ZN5mongo12_runCommandsEPKcRNS_7BSONObjERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x5e2) [0x8d6592]
 /usr/bin/mongod(_ZN5mongo11runCommandsEPKcRNS_7BSONObjERNS_5CurOpERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x3b) [0xa7c97b]
 /usr/bin/mongod(_ZN5mongo8runQueryERNS_7MessageERNS_12QueryMessageERNS_5CurOpES1_+0xd50) [0xa80360]
 /usr/bin/mongod() [0x9f44d4]
 /usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x392) [0x9f57e2]
 /usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x9a) [0x6e747a]
 /usr/bin/mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x42e) [0xdbbb7e]
 /lib/libpthread.so.0(+0x68ca) [0x7fb1c1e038ca]
 /lib/libc.so.6(clone+0x6d) [0x7fb1c11b6b6d]

The database is running fine for some time after we restart it. After a few hours the segmentation fault occurs again. We already tried removing the dbpath files and letting the mongodb do an initial sync (this occasionally succeeds but the segmentation fault happens again).



 Comments   
Comment by Stennie Steneker (Inactive) [ 17/Mar/14 ]

Hi Hendrik.

Apologies for the delay in follow-up on this issue. Your last response suggested that you had found a workaround using splitChunk from a single mongos, and we did not have any additional logs to review.

SERVER-7790 which you mentioned does seem to be a similar issue, and the feature requests SERVER-4537 and SERVER-5160 may also be relevant for you to watch & upvvote.

FYI, in MongoDB 2.4.8+ there is now caching for dbhash results which should improve performance using multiple mongos: https://jira.mongodb.org/browse/SERVER-11021. If you have a large config database and multiple mongos, it's possible that long-running dbhash commands could have contributed to slowness and spurious "config servers differ" warnings.

I'm going to close this issue, but please feel free to reopen if you have new information to investigate.

Thanks,
Stephen

Comment by Hendrik Schumacher [ 05/Aug/13 ]

Hello,

sorry for not replying earlier, I was on vacation.

1. The change we made was to limit the splitChunk commands to one
application. Beforehand we had multiple servers, each with an application
and a mongos. Each application created collections in a sharded database,
sharded these collections and did splitChunk commands to do pre-splitting
before doing mass inserts. After the change we only had one application
creating, sharding and splitting the collections (and putting the
collections in a collection pool). The actual executed commands did not
change. I assume that not doing createCollection, shardCollection and
splitChunk in parallel massively fixed our problem with the segmentation
fault.

2. It is possible that single mongos did not run exactly the same version
but if so it was very close (like mainly 2.4.3 and one or two 2.4.2).

3. The "config servers x and y differ" messages occured all over the place
(every other minute for hours). This problem reappeared even after we
manually re-synced the config servers (several times) and restarted the
mongos instances. This was another reason we limited the creating,
sharding and splitting of the collections to one application since this
also fixed the config server problem.

4. I am not sure if we still have mongos logs from the time of the
segmentation fault since this was quite some time ago but I will look. It
will probably be hard to find the right mongos since we had like 6 or 7 of
them and only one would have issued the problematic splitChunk command.

Best,
Hendrik

Comment by sam.helman@10gen.com [ 25/Jul/13 ]

Hello,

To help us diagnose the issue, I have few questions:

1. What exactly do you mean by "if we only did the splitChunk from one mongos"? What were the exact split commands you were running before and after this change, and from what arrangement of mongos's?
2. Can you confirm that all of the mongos's are running the same version?
3. The "config servers x and y differ" line indicates that the data is not consistent across the config servers. Did these errors occur before or after the segmentation fault?
4. Would it be possible to post logs from one or more of the mongos's from the time that the issue occurred?

Thank you

Comment by Hendrik Schumacher [ 12/Jul/13 ]

We tracked this down a bit. We had a number of mongos running on the sharded setup and we did not get the segmentation fault any longer if we only did the splitChunk from one mongos (one application using this mongos). Maybe this is related to https://jira.mongodb.org/browse/SERVER-4537 or https://jira.mongodb.org/browse/SERVER-7790.

Also one additional note: I dont know if this is connected but while we were having this problem we also encountered problems with our config servers. The config servers continued to run but the mongos instances had many "config servers x and y differ" lines in the logs.

Comment by Geoff Wagstaff [ 11/Jul/13 ]

While experiencing problems with sharding (more info: https://groups.google.com/forum/#!topic/mongodb-user/BF06C0yqtV4), I'm also seeing this. Here's a snippet from my mongos log:

Thu Jul 11 16:26:17.903 [conn468] warning: splitChunk failed - cmd: { splitChunk: "trends.dimension.language", keyPattern: { _id: 1.0 }, min: { _id: BinData }, max: { _id: MaxKey }, from: "rs4", splitKeys: [ { _id: BinData } ], shardId: "trends.dimension.language-_id_BinData(0, 000120130711)", configdb: "mongobig-1-1:27019,mongobig-2-1:27019,mongobig-3-1:27019" } result: { who: { _id: "trends.dimension.language", process: "mongobig-4-1:27018:1373558874:1854596173", state: 1, ts: ObjectId('51dedca979a3c5840b68bbe1'), when: new Date(1373559977876), who: "mongobig-4-1:27018:1373558874:1854596173:conn133:1732459270", why: "split-{ _id: BinData }" }, ok: 0.0, errmsg: "the collection's metadata lock is taken" }
Thu Jul 11 16:26:18.530 [conn470] ChunkManager: time to load chunks for trends.dimension.language: 0ms sequenceNumber: 141 version: 2|3||51dedb10e68a130889d33a14 based on: 2|1||51dedb10e68a130889d33a14
Thu Jul 11 16:26:18.530 [conn470] autosplitted trends.dimension.language shard: ns:trends.dimension.languageshard: rs4:rs4/mongobig-4-1:27018lastmod: 2|1||000000000000000000000000min: { _id: BinData }max: { _id: MaxKey } on: { _id: BinData } (splitThreshold 471859) (migrate suggested)
Thu Jul 11 16:26:18.535 [conn470] moving chunk (auto): ns:trends.dimension.languageshard: rs4:rs4/mongobig-4-1:27018lastmod: 2|3||000000000000000000000000min: { _id: BinData }max: { _id: MaxKey } to: rs5:rs5/mongobig-5-1:27018
Thu Jul 11 16:26:18.535 [conn470] moving chunk ns: trends.dimension.language moving ( ns:trends.dimension.languageshard: rs4:rs4/mongobig-4-1:27018lastmod: 2|3||000000000000000000000000min: { _id: BinData }max: { _id: MaxKey }) rs4:rs4/mongobig-4-1:27018 -> rs5:rs5/mongobig-5-1:27018
Thu Jul 11 16:26:18.549 [conn464] warning: splitChunk failed - cmd: { splitChunk: "trends.dimension.language", keyPattern: { _id: 1.0 }, min: { _id: BinData }, max: { _id: MaxKey }, from: "rs4", splitKeys: [ { _id: BinData } ], shardId: "trends.dimension.language-_id_BinData(0, 000120130711)", configdb: "mongobig-1-1:27019,mongobig-2-1:27019,mongobig-3-1:27019" } result: { who: { _id: "trends.dimension.language", process: "mongobig-4-1:27018:1373558874:1854596173", state: 1, ts: ObjectId('51dedcaa79a3c5840b68bbe3'), when: new Date(1373559978507), who: "mongobig-4-1:27018:1373558874:1854596173:conn149:504692606", why: "split-{ _id: BinData }" }, ok: 0.0, errmsg: "the collection's metadata lock is taken" }
Thu Jul 11 16:26:18.549 [conn466] warning: splitChunk failed - cmd: { splitChunk: "trends.dimension.language", keyPattern: { _id: 1.0 }, min: { _id: BinData }, max: { _id: MaxKey }, from: "rs4", splitKeys: [ { _id: BinData } ], shardId: "trends.dimension.language-_id_BinData(0, 000120130711)", configdb: "mongobig-1-1:27019,mongobig-2-1:27019,mongobig-3-1:27019" } result: { who: { _id: "trends.dimension.language", process: "mongobig-4-1:27018:1373558874:1854596173", state: 1, ts: ObjectId('51dedcaa79a3c5840b68bbe3'), when: new Date(1373559978507), who: "mongobig-4-1:27018:1373558874:1854596173:conn149:504692606", why: "split-{ _id: BinData }" }, ok: 0.0, errmsg: "the collection's metadata lock is taken" }
Thu Jul 11 16:26:18.625 [conn470] moveChunk result: { who: { _id: "trends.dimension.language", process: "mongobig-4-1:27018:1373558874:1854596173", state: 1, ts: ObjectId('51dedcaa79a3c5840b68bbe3'), when: new Date(1373559978507), who: "mongobig-4-1:27018:1373558874:1854596173:conn149:504692606", why: "split-{ _id: BinData }" }, ok: 0.0, errmsg: "the collection metadata could not be locked with lock migrate-{ _id: BinData }" }
Thu Jul 11 16:26:18.625 [conn470] Assertion: 10412:moveAndCommit failed: { who: { _id: "trends.dimension.language", process: "mongobig-4-1:27018:1373558874:1854596173", state: 1, ts: ObjectId('51dedcaa79a3c5840b68bbe3'), when: new Date(1373559978507), who: "mongobig-4-1:27018:1373558874:1854596173:conn149:504692606", why: "split-{ _id: BinData }" }, ok: 0.0, errmsg: "the collection metadata could not be locked with lock migrate-{ _id: BinData }" }
0xa82161 0xa46e8b 0xa473cc 0x8c7b5e 0x9bc29c 0x9c30ca 0x991db1 0x669891 0xa6e8ce 0x7f5fc5b4fe9a 0x7f5fc4e62ccd
 /usr/bin/mongos(_ZN5mongo15printStackTraceERSo+0x21) [0xa82161]
 /usr/bin/mongos(_ZN5mongo11msgassertedEiPKc+0x9b) [0xa46e8b]
 /usr/bin/mongos() [0xa473cc]
 /usr/bin/mongos(_ZNK5mongo5Chunk13splitIfShouldEl+0x23ee) [0x8c7b5e]
 /usr/bin/mongos(_ZN5mongo13ShardStrategy7_updateERKSsRKNS_7BSONObjES5_iRNS_7RequestERNS_9DbMessageEi+0x3bc) [0x9bc29c]
 /usr/bin/mongos(_ZN5mongo13ShardStrategy7writeOpEiRNS_7RequestE+0x63a) [0x9c30ca]
 /usr/bin/mongos(_ZN5mongo7Request7processEi+0xd1) [0x991db1]
 /usr/bin/mongos(_ZN5mongo21ShardedMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x71) [0x669891]
 /usr/bin/mongos(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x42e) [0xa6e8ce]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7f5fc5b4fe9a]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f5fc4e62ccd]

Generated at Thu Feb 08 03:22:02 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.