[SERVER-29310] server crash during chunk split Created: 21/May/17  Updated: 21/Jun/17  Resolved: 22/May/17

Status: Closed
Project: Core Server
Component/s: Networking
Affects Version/s: 3.2.13
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Robert Fehrmann Assignee: Samantha Ritter (Inactive)
Resolution: Duplicate Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File mongod.log    
Issue Links:
Duplicate
duplicates SERVER-29152 Segfault in multiple shard primaries ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

I suspect this will happen again, (we have 9 different clusters) but right now I do not have a steps to reproduce the problem.

Participants:

 Description   

4 days ago we upgrade from 3.0.7 to 3.2.13. We didn't have any crashes in a year but had 2 crashes on 2 different clusters since the upgrade. The first crash did not produce a stack trace but the second one did (please see attached log file). Here's the log from right before the crash and the stacktrace. It seems that the segmentation fault happened during a chunk split attempt.

2017-05-21T09:13:28.466-0400 I SHARDING [conn3787] request split points lookup for chunk postingrecommendation.postingrecommendation { : "TN", : "5913264d50499b0bb4434b24" } -->> { : "TN", : "7f934f001357ce9c0eb72c05" }
2017-05-21T09:13:28.515-0400 I SHARDING [conn3787] received splitChunk request: { splitChunk: "postingrecommendation.postingrecommendation", keyPattern: { _skp: 1.0, _id: 1.0 }, min: { _skp: "TN", _id: "5913264d50499b0bb4434b24" }, max: { _skp: "TN", _id: "7f934f001357ce9c0eb72c05" }, from: "jsra", splitKeys: [ { _skp: "TN", _id: "59185007638e290ba8e933a5" }, { _skp: "TN", _id: "591cff1650499b0bb44d70df" } ], shardId: "postingrecommendation.postingrecommendation-_skp_"TN"_id_"5913264d50499b0bb4434b24"", configdb: "mgocnf-a.snagprod.corp:27340,mgocnf-b.snagprod.corp:27340,mgocnf-c.snagprod.corp:27340", epoch: ObjectId('527abca8d31d1633acdaa97e') }
2017-05-21T09:13:28.749-0400 I SHARDING [conn3787] distributed lock 'postingrecommendation.postingrecommendation/mgo-jsra-a.snagprod.corp:27017:1495120452:697408834' acquired for 'splitting chunk [{ _skp: "TN", _id: "5913264d50499b0bb4434b24" }, { _skp: "TN", _id: "7f934f001357ce9c0eb72c05" }) in postingrecommendation.postingrecommendation', ts : 592192782e655bbf91dde53f
2017-05-21T09:13:28.749-0400 I SHARDING [conn3787] remotely refreshing metadata for postingrecommendation.postingrecommendation based on current shard version 30|12824||527abca8d31d1633acdaa97e, current metadata version is 30|12824||527abca8d31d1633acdaa97e
2017-05-21T09:13:28.751-0400 I SHARDING [conn3787] metadata of collection postingrecommendation.postingrecommendation already up to date (shard version : 30|12824||527abca8d31d1633acdaa97e, took 2 ms)
2017-05-21T09:13:28.752-0400 W SHARDING [conn3787] splitChunk cannot find chunk [{ _skp: "TN", _id: "5913264d50499b0bb4434b24" },{ _skp: "TN", _id: "7f934f001357ce9c0eb72c05" }) to split, the chunk boundaries may be stale
2017-05-21T09:13:28.850-0400 I SHARDING [conn3787] distributed lock 'postingrecommendation.postingrecommendation/mgo-jsra-a.snagprod.corp:27017:1495120452:697408834' unlocked.
2017-05-21T09:13:28.850-0400 I COMMAND  [conn3787] command admin.$cmd command: splitChunk { splitChunk: "postingrecommendation.postingrecommendation", keyPattern: { _skp: 1.0, _id: 1.0 }, min: { _skp: "TN", _id: "5913264d50499b0bb4434b24" }, max: { _skp: "TN", _id: "7f934f001357ce9c0eb72c05" }, from: "jsra", splitKeys: [ { _skp: "TN", _id: "59185007638e290ba8e933a5" }, { _skp: "TN", _id: "591cff1650499b0bb44d70df" } ], shardId: "postingrecommendation.postingrecommendation-_skp_"TN"_id_"5913264d50499b0bb4434b24"", configdb: "mgocnf-a.snagprod.corp:27340,mgocnf-b.snagprod.corp:27340,mgocnf-c.snagprod.corp:27340", epoch: ObjectId('527abca8d31d1633acdaa97e') } keyUpdates:0 writeConflicts:0 exception: splitChunk cannot find chunk [{ _skp: "TN", _id: "5913264d50499b0bb4434b24" },{ _skp: "TN", _id: "7f934f001357ce9c0eb72c05" }) to split, the chunk boundaries may be stale ( ns : postingrecommendation.postingrecommendation, received : 0|0||000000000000000000000000, wanted : 30|12824||527abca8d31d1633acdaa97e, send ) code:13388 numYields:0 reslen:496 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, MMAPV1Journal: { acquireCount: { w: 1 } }, Database: { acquireCount: { w: 1 } }, Collection: { acquireCount: { W: 1 } } } protocol:op_query 335ms
2017-05-21T09:13:28.851-0400 I NETWORK  [conn3787] end connection 10.70.18.214:49658 (222 connections now open)
2017-05-21T09:13:28.866-0400 F -        [thread1] Invalid access at address: 0xffffffffffffffe8
2017-05-21T09:13:28.997-0400 F -        [thread1] Got signal: 11 (Segmentation fault).
0x133f4f2 0x133e649 0x133e9c8 0x7f3fd19db330 0x1b514e9 0x1b51ba9 0xa111e9 0xa118b5 0x11f3c03 0x11f5e50 0x1418fc0 0x7f3fd19d2f82 0x7f3fd19d3197 0x7f3fd1700bed
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"F3F4F2","s":"_ZN5mongo15printStackTraceERSo"},{"b":"400000","o":"F3E649"},{"b":"400000","o":"F3E9C8"},{"b":"7F3FD19CB000","o":"10330"},{"b":"400000","o":"17514E9","s":"_ZNSo6sentryC2ERSo"},{"b":"400000","o":"1751BA9","s":"_ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l"},{"b":"400000","o":"6111E9","s":"_ZN5mongo11PoolForHost4doneEPNS_16DBConnectionPoolEPNS_12DBClientBaseE"},{"b":"400000","o":"6118B5","s":"_ZN5mongo16DBConnectionPool7releaseERKSsPNS_12DBClientBaseE"},{"b":"400000","o":"DF3C03"},{"b":"400000","o":"DF5E50"},{"b":"400000","o":"1018FC0"},{"b":"7F3FD19CB000","o":"7F82"},{"b":"7F3FD19CB000","o":"8197"},{"b":"7F3FD1603000","o":"FDBED","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.2.13", "gitVersion" : "23899209cad60aaafe114f6aea6cb83025ff51bc", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "3.13.0-112-generic", "version" : "#159-Ubuntu SMP Fri Mar 3 15:26:07 UTC 2017", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "B559BDA626A4B7F4A29153D8DA0DAA0B3B48A82B" }, { "b" : "7FFCB9DAB000", "elfType" : 3, "buildId" : "012E1338BA43AF7C0DC7D069F64F0A6490CC6D9C" }, { "b" : "7F3FD28ED000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "48A664AE6B0B4918A3EB0156C6364C4F084232FD" }, { "b" : "7F3FD2511000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "6B8997EA892A7FF37AC8CAA8F239D595251889BB" }, { "b" : "7F3FD2309000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "1EEBA762A6A2C8884D56033EE8CCE79B95CD974D" }, { "b" : "7F3FD2105000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "D0F881E59FF88BE4F29A228C8657376B3C325C2C" }, { "b" : "7F3FD1DFF000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "1654CB13B1D24ED03F4BDCB51FC7524B9181A771" }, { "b" : "7F3FD1BE9000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "36311B4457710AE5578C4BF00791DED7359DBB92" }, { "b" : "7F3FD19CB000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "22F9078CFA529CCE1A814A4A1A1C018F169D5652" }, { "b" : "7F3FD1603000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "CA5C6CFE528AF541C3C2C15CEE4B3C74DA4E2FB4" }, { "b" : "7F3FD2B4C000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "237E22E5AAC2DDFCD06518F63FD720FE758E6E5B" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x32) [0x133f4f2]
 mongod(+0xF3E649) [0x133e649]
 mongod(+0xF3E9C8) [0x133e9c8]
 libpthread.so.0(+0x10330) [0x7f3fd19db330]
 mongod(_ZNSo6sentryC2ERSo+0x19) [0x1b514e9]
 mongod(_ZSt16__ostream_insertIcSt11char_traitsIcEERSt13basic_ostreamIT_T0_ES6_PKS3_l+0x29) [0x1b51ba9]
 mongod(_ZN5mongo11PoolForHost4doneEPNS_16DBConnectionPoolEPNS_12DBClientBaseE+0x109) [0xa111e9]
 mongod(_ZN5mongo16DBConnectionPool7releaseERKSsPNS_12DBClientBaseE+0xE5) [0xa118b5]
 mongod(+0xDF3C03) [0x11f3c03]
 mongod(+0xDF5E50) [0x11f5e50]
 mongod(+0x1018FC0) [0x1418fc0]
 libpthread.so.0(+0x7F82) [0x7f3fd19d2f82]
 libpthread.so.0(+0x8197) [0x7f3fd19d3197]
 libc.so.6(clone+0x6D) [0x7f3fd1700bed]
-----  END BACKTRACE  -----



 Comments   
Comment by Samantha Ritter (Inactive) [ 22/May/17 ]

This issue duplicates SERVER-29152.

Comment by Samantha Ritter (Inactive) [ 22/May/17 ]

Hi Robert,

Thank you for these stack traces, they helped us track down the root cause of this issue, which is an internal race condition between the shutdown of different objects while threads within the server are exiting. I've written a bit more about this in SERVER-29152, and we're going to use that ticket to track this bug and the status of a patch to fix it.

Thank you,
Samantha

Comment by Daniel Pasette (Inactive) [ 21/May/17 ]

We're investigating a crash which has a very similar fingerprint here: SERVER-29152. We are addressing this issue as critical and will be releasing a patch as soon as we have determined the root cause.

Generated at Thu Feb 08 04:20:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.