[SERVER-23615] Mongod crashes with a signal 11 (Segmentation fault) Created: 08/Apr/16  Updated: 06/Dec/22  Resolved: 19/Jul/16

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: shiyingwang Assignee: Backlog - Query Team (Inactive)
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File mongodb.log     File mongodb.log.2016-04-07T11-37-03     File mongodb.log.2016-04-08T03-24-41    
Assigned Teams:
Query
Operating System: ALL
Sprint: Query 14 (05/13/16)
Participants:

 Description   

016-04-08T01:04:10.993+0800 I COMMAND  [conn6] CMD: drop unilife.suningGoodProfileCollection
2016-04-08T01:04:10.994+0800 I COMMAND  [conn6] CMD: drop unilife.suningGoodDetailCollection
2016-04-08T01:30:44.355+0800 I WRITE    [conn6] update unilife.productBasicCollection query: { source: "suning" } update: { $set: { bombWrods: "", state: -1 } } nscanned:0 nscannedObjects:14604 nMatched:3851 nModified:2617 keyUpdates:1 writeConflicts:0 numYields:200 locks:{ Global: { acquireCount: { r: 201, w: 201 } }, MMAPV1Journal: { acquireCount: { w: 2818 }, acquireWaitCount: { w: 2 }, timeAcquiringMicros: { w: 268 } }, Database: { acquireCount: { w: 201 } }, Collection: { acquireCount: { W: 201 } } } 1664ms
2016-04-08T01:30:44.356+0800 I COMMAND  [conn6] command unilife.$cmd command: update { update: "productBasicCollection", ordered: false, updates: [ { q: { source: "suning" }, u: { $set: { bombWrods: "", state: -1 } }, multi: true } ] } keyUpdates:0 writeConflicts:0 numYields:0 reslen:55 locks:{ Global: { acquireCount: { r: 201, w: 201 } }, MMAPV1Journal: { acquireCount: { w: 2818 }, acquireWaitCount: { w: 2 }, timeAcquiringMicros: { w: 268 } }, Database: { acquireCount: { w: 201 } }, Collection: { acquireCount: { W: 201 } } } 1665ms
2016-04-08T01:30:44.361+0800 I COMMAND  [conn6] killcursors: found 0 of 1
2016-04-08T01:30:44.797+0800 I COMMAND  [conn6] killcursors: found 0 of 1
2016-04-08T01:30:45.182+0800 I COMMAND  [conn6] killcursors: found 0 of 1
2016-04-08T01:30:45.581+0800 I COMMAND  [conn6] killcursors: found 0 of 1
2016-04-08T01:30:45.983+0800 I COMMAND  [conn6] killcursors: found 0 of 1
2016-04-08T01:30:46.407+0800 I COMMAND  [conn6] killcursors: found 0 of 1
2016-04-08T01:30:46.831+0800 I COMMAND  [conn6] killcursors: found 0 of 1
2016-04-08T01:30:47.246+0800 I COMMAND  [conn6] killcursors: found 0 of 1
2016-04-08T01:30:47.719+0800 I COMMAND  [conn6] killcursors: found 0 of 1
2016-04-08T01:30:48.288+0800 I COMMAND  [conn6] killcursors: found 0 of 1
2016-04-08T01:30:48.913+0800 I COMMAND  [conn6] killcursors: found 0 of 1
2016-04-08T01:30:49.508+0800 I COMMAND  [conn6] killcursors: found 0 of 1
2016-04-08T01:30:50.164+0800 I COMMAND  [conn6] killcursors: found 0 of 1
2016-04-08T01:30:50.695+0800 I COMMAND  [conn6] killcursors: found 0 of 1
2016-04-08T01:30:51.349+0800 I COMMAND  [conn6] killcursors: found 0 of 1
2016-04-08T01:30:52.015+0800 I COMMAND  [conn6] killcursors: found 0 of 1
2016-04-08T01:30:52.713+0800 I COMMAND  [conn6] killcursors: found 0 of 1
2016-04-08T01:30:53.265+0800 I COMMAND  [conn6] killcursors: found 0 of 1
2016-04-08T01:30:53.715+0800 I COMMAND  [conn6] killcursors: found 0 of 1
2016-04-08T01:30:54.167+0800 I COMMAND  [conn6] killcursors: found 0 of 1
2016-04-08T01:30:54.635+0800 I COMMAND  [conn6] killcursors: found 0 of 1
2016-04-08T01:30:55.083+0800 I COMMAND  [conn6] killcursors: found 0 of 1
2016-04-08T01:30:55.543+0800 I COMMAND  [conn6] killcursors: found 0 of 1
2016-04-08T01:30:55.999+0800 I COMMAND  [conn6] killcursors: found 0 of 1
2016-04-08T01:30:56.490+0800 I COMMAND  [conn6] killcursors: found 0 of 1
2016-04-08T01:30:56.976+0800 I COMMAND  [conn6] killcursors: found 0 of 1
2016-04-08T01:30:57.475+0800 I COMMAND  [conn6] killcursors: found 0 of 1
2016-04-08T01:30:57.738+0800 F -        [conn6] Invalid access at address: 0x1029b7680
2016-04-08T01:30:57.760+0800 F -        [conn6] Got signal: 11 (Segmentation fault).
 
 0xf5e489 0xf5dd52 0xf5e0ae 0x7fd023123130 0x7fd021c1c688 0x848893 0x8cb26a 0xa39a44 0xa3a175 0xbd09a4 0xbd0d54 0xbd13bd 0x9b1bdd 0x9b33bc 0x9b38fb 0x9b62ad 0x9d8474 0x9d93fd 0x9da10b 0xb9ef5a 0xab5290 0x80fc1d 0xf115bb 0x7fd02311bdf5 0x7fd021bc91ad
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"B5E489"},{"b":"400000","o":"B5DD52"},{"b":"400000","o":"B5E0AE"},{"b":"7FD023114000","o":"F130"},{"b":"7FD021AD3000","o":"149688"},{"b":"400000","o":"448893"},{"b":"400000","o":"4CB26A"},{"b":"400000","o":"639A44"},{"b":"400000","o":"63A175"},{"b":"400000","o":"7D09A4"},{"b":"400000","o":"7D0D54"},{"b":"400000","o":"7D13BD"},{"b":"400000","o":"5B1BDD"},{"b":"400000","o":"5B33BC"},{"b":"400000","o":"5B38FB"},{"b":"400000","o":"5B62AD"},{"b":"400000","o":"5D8474"},{"b":"400000","o":"5D93FD"},{"b":"400000","o":"5DA10B"},{"b":"400000","o":"79EF5A"},{"b":"400000","o":"6B5290"},{"b":"400000","o":"40FC1D"},{"b":"400000","o":"B115BB"},{"b":"7FD023114000","o":"7DF5"},{"b":"7FD021AD3000","o":"F61AD"}],"processInfo":{ "mongodbVersion" : "3.0.4", "gitVersion" : "0481c958daeb2969800511e7475dc66986fa9ed5", "uname" : { "sysname" : "Linux", "release" : "3.10.0-229.el7.x86_64", "version" : "#1 SMP Fri Mar 6 11:36:42 UTC 2015", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000", "buildId" : "69E53C6FFEE320204F7099893CA3B40309400897" }, { "b" : "7FFFBA2FE000", "elfType" : 3, "buildId" : "CBFB9E3C89636A82FF8DA0DBDD27455704C5A38F" }, { "b" : "7FD023114000", "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "12F30315D4F4A2FE58B1977405C8B5515861E66B" }, { "b" : "7FD022EA7000", "path" : "/lib64/libssl.so.10", "elfType" : 3, "buildId" : "7E24873B9C50842F0D3285B7D06F872226D17262" }, { "b" : "7FD022AC0000", "path" : "/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "9523142ED92FE40B988E6EED854D79BF587954C5" }, { "b" : "7FD0228B8000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "7376A07360DC57189A8F92B20AA4AA1CAEA80551" }, { "b" : "7FD0226B4000", "path" : "/lib64/libdl.so.2", "elfType" : 3, "buildId" : "4DFEE4EA9AE8FDD4C71BD4CCC0727222F19DF810" }, { "b" : "7FD0223AC000", "path" : "/lib64/libstdc++.so.6", "elfType" : 3, "buildId" : "8941888BF8EE9CED585599BE5397A385FC1C73CE" }, { "b" : "7FD0220AA000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "A1AA62B29765BE03A36BF927B047EEEF8696EEC6" }, { "b" : "7FD021E94000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3, "buildId" : "6AA1DCC4DE7F1836344949857FC2017278631FFD" }, { "b" : "7FD021AD3000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "C31FFE7942BFD77B2FCA8F9BD5709D387A86D3BC" }, { "b" : "7FD023330000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "9866E1D2BA61EBB4CE4F009FACDAACC24EF3B804" }, { "b" : "7FD021887000", "path" : "/lib64/libgssapi_krb5.so.2", "elfType" : 3, "buildId" : "D46A230FFF4A7B808B3CFC213D31FCAC542FB504" }, { "b" : "7FD0215A2000", "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "6D6136A0E795420B05854DEF13A10C226FE9CCB2" }, { "b" : "7FD02139E000", "path" : "/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "3A1166709F88740C49E060731832E3FAD2DFB66B" }, { "b" : "7FD02116C000", "path" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "AA97A848DD7C9E57B06EC913E10D420AEBBCE027" }, { "b" : "7FD020F56000", "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "1982C8CDAE90F898D1AD26DC07E807333B4789D0" }, { "b" : "7FD020D47000", "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "AEF6C3D3C5152F339942041519A106FC055DAF71" }, { "b" : "7FD020B43000", "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "2E01D5AC08C1280D013AAB96B292AC58BC30A263" }, { "b" : "7FD020929000", "path" : "/lib64/libresolv.so.2", "elfType" : 3, "buildId" : "AC596E865AF0D14B10F7B707F47D2031AD6D68DC" }, { "b" : "7FD020704000", "path" : "/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "82FF6B18E1E42825CC2D060F969479AD4AF2F62C" }, { "b" : "7FD0204A3000", "path" : "/lib64/libpcre.so.1", "elfType" : 3, "buildId" : "30FA397B01197ECABC647CBD8E75FDD5B743D730" }, { "b" : "7FD02027E000", "path" : "/lib64/liblzma.so.5", "elfType" : 3, "buildId" : "218D03D1F6CF1A099A4D467B5E8ECF4F2BF45750" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf5e489]
 mongod(+0xB5DD52) [0xf5dd52]
 mongod(+0xB5E0AE) [0xf5e0ae]
 libpthread.so.0(+0xF130) [0x7fd023123130]
 libc.so.6(+0x149688) [0x7fd021c1c688]
 mongod(_ZNK5mongo11mutablebson8Document4Impl13writeChildrenINS_14BSONObjBuilderEEEvjPT_+0x1D3) [0x848893]
 mongod(_ZNK5mongo11mutablebson8Document9getObjectEv+0xFA) [0x8cb26a]
 mongod(_ZN5mongo11UpdateStage18transformAndUpdateERKNS_11SnapshottedINS_7BSONObjEEERNS_8RecordIdE+0x734) [0xa39a44]
 mongod(_ZN5mongo11UpdateStage4workEPm+0x3E5) [0xa3a175]
mongod(_ZN5mongo12PlanExecutor18getNextSnapshottedEPNS_11SnapshottedINS_7BSONObjEEEPNS_8RecordIdE+0xA4) [0xbd09a4]
 mongod(_ZN5mongo12PlanExecutor7getNextEPNS_7BSONObjEPNS_8RecordIdE+0x34) [0xbd0d54]
 mongod(_ZN5mongo12PlanExecutor11executePlanEv+0x3D) [0xbd13bd]
 mongod(_ZN5mongo18WriteBatchExecutor10execUpdateERKNS_12BatchItemRefEPNS_7BSONObjEPPNS_16WriteErrorDetailE+0x71D) [0x9b1bdd]
 mongod(_ZN5mongo18WriteBatchExecutor11bulkExecuteERKNS_21BatchedCommandRequestERKNS_19WriteConcernOptionsEPSt6vectorIPNS_19BatchedUpsertDetailESaIS9_EEPS7_IPNS_16WriteErrorDetailESaISE_EE+0x23C) [0x9b33bc]
 mongod(_ZN5mongo18WriteBatchExecutor12executeBatchERKNS_21BatchedCommandRequestEPNS_22BatchedCommandResponseE+0x37B) [0x9b38fb]
 mongod(_ZN5mongo8WriteCmd3runEPNS_16OperationContextERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x15D) [0x9b62ad]
 mongod(_ZN5mongo12_execCommandEPNS_16OperationContextEPNS_7CommandERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x34) [0x9d8474]
 mongod(_ZN5mongo7Command11execCommandEPNS_16OperationContextEPS0_iPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0xC1D) [0x9d93fd]
 mongod(_ZN5mongo12_runCommandsEPNS_16OperationContextEPKcRNS_7BSONObjERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x28B) [0x9da10b]
 mongod(_ZN5mongo8runQueryEPNS_16OperationContextERNS_7MessageERNS_12QueryMessageERKNS_15NamespaceStringERNS_5CurOpES3_+0x77A) [0xb9ef5a]
 mongod(_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0xB10) [0xab5290]
 mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0xDD) [0x80fc1d]
 mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x34B) [0xf115bb]
 libpthread.so.0(+0x7DF5) [0x7fd02311bdf5]
 libc.so.6(clone+0x6D) [0x7fd021bc91ad]
-----  END BACKTRACE  -----



 Comments   
Comment by Ian Whalen (Inactive) [ 18/Jul/16 ]

Given the lack of activity on this issue we're going to have to resolve as Gone Away, but please do feel free to re-open with the additional requested logs and information if you'd like us to continue this investigation.

Comment by Kelsey Schubert [ 21/Jun/16 ]

Hi shiyingwang,

We still need logs with a higher level of verbosity to diagnose the problem. If this is still an issue for you, would you please clarify how frequently this issue occurs and increase the logging level per Max's suggestion so we can continue to investigate?

Thank you,
Thomas

Comment by Max Hirschhorn [ 02/Jun/16 ]

Hi shiyingwang,

The entire query team spent an hour investigating the behavior of the update subsystem in the areas surrounding those from the reported backtrace. We don't have much more information regarding what the underlying cause of the memory corruption issue is at this time. However, to Rassi's earlier comment

I believe that MutableDocument is holding on to a copy of the update pre-image BSONObj, and my working theory is that something happens to this BSONObj before the MutableDocument is serialized. In particular, I think that the serialization process makes a copy of this BSONObj here (which happens to be owned, in this case), and then segfaults when trying to decrement the refcount for this BSONObj's shared buffer when it goes out of scope here.

we've realized that since the mmapv1 storage engine is being used, the BSONObj representing the document is should be "unowned". This means that the memory for the object is backed by the memory-mapped data file and not a separate copy of that data. This is peculiar because segfaulting when destructing the SharedBuffer underlying the BSONObj would suggest that the object was in fact an owned object. We have yet to determine whether it's possible for a serialized element not at the root to be owned.

Additionally, based on where we are within the UpdateStage::transformAndUpdate() function that an in-place update to the document isn't being performed. This means that the update subsystem will be building up a new in-memory representation of the document instead of modifying bytes within the existing object.

We have two theories about what is causing the segmentation fault:

  1. There is a corrupted document (e.g. invalid BSON) in one of the collections that when updated is triggering a segmentation fault. The contents of the mongod's data directory would be useful to determine if this is the root cause. We can create a private upload portal if you are interested and able to upload the data files.
  2. Running a particular update on a document in one of the collections is triggering a segmentation fault. Having logs at a higher verbosity level would be useful to determine if this is the root cause.

From the logs you've uploaded, we see that you've experience this same segmentation fault multiple times - with occurrences on March 25th, April 2nd, and April 8th.

  1. How frequently does this issue occur for you?
  2. Have you already determined from the error that is propagated to your application what document and update operation is triggering the issue?

If not, then we'd like to request that you increase the verbosity of your mongod to log-level 1 and the log-level for commands to 3. You can do so by running the following commands in the mongo shell:

db.adminCommand({setParameter: 1, logLevel: 1});
db.adminCommand({setParameter: 1, logComponentVerbosity: {command: 3}});

This way the mongod process will log what update operation is about to be performed prior to actually reading and modifying the document (and thus potentially triggering the segfault).

Once we determine the offending document and update operation, we'll have a much better chance of determining what the issue in the code is. We appreciate your patience.

Thanks,
Max

Comment by Ramon Fernandez Marina [ 14/Apr/16 ]

Thanks for the additional information shiyingwang, this ticket is being investigated by the Query team.

Comment by shiyingwang [ 14/Apr/16 ]

storage engine is MMAPv1

Comment by shiyingwang [ 14/Apr/16 ]

the complete logs

Comment by J Rassi [ 08/Apr/16 ]

It looks like memory corruption is being encountered during a user update here, when serializing the updated document BSONObj from the corresponding MutableDocument. I believe that MutableDocument is holding on to a copy of the update pre-image BSONObj, and my working theory is that something happens to this BSONObj before the MutableDocument is serialized. In particular, I think that the serialization process makes a copy of this BSONObj here (which happens to be owned, in this case), and then segfaults when trying to decrement the refcount for this BSONObj's shared buffer when it goes out of scope here.

shiyingwang: please provide the above requested information when you get a chance. It will be particularly helpful for the debugging process.

Comment by Kelsey Schubert [ 08/Apr/16 ]

shiyingwang, in addition, can you please answer the following questions:

  1. Can you please upload the complete logs?
  2. Have you observed this behavior before?
  3. Can you provide details about when this issue occurs that would help us reproduce this issue?

Thank you,
Thomas

Comment by Kelsey Schubert [ 08/Apr/16 ]

Hi shiyingwang,

We are investigating this issue. Can you please clarify which storage engine you are using, MMAPv1 or WiredTiger?

Thank you,
Thomas

Generated at Thu Feb 08 04:03:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.