[SERVER-34616] second index build interrupt crashed mongod Created: 23/Apr/18  Updated: 10/May/18  Resolved: 10/May/18

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Asya Kamsky Assignee: Daniel Gottlieb (Inactive)
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
depends on SERVER-32638 Permit dblock acquisition waiting to ... Closed
Problem/Incident
is caused by SERVER-32638 Permit dblock acquisition waiting to ... Closed
Related
is related to SERVER-33377 MultiIndexBlock builder can incorrect... Closed
Operating System: ALL
Sprint: Storage NYC 2018-05-21
Participants:

 Description   

I had a background index build, and started another one from shell, control-C aborted the second one (prompted if I want to kill operation on server, responded yes). Then I found the background index build via currentOp and sent it killOp. At this point the server crashed.

This was build from master

2018-03-02T13:39:14.753-0500 I CONTROL  [initandlisten] DEBUG build (which is slower)
2018-03-02T13:39:14.753-0500 I CONTROL  [initandlisten] db version v3.7.2-264-g6618ad3
2018-03-02T13:39:14.753-0500 I CONTROL  [initandlisten] git version: 6618ad302577418504e9a1b7849bfa7d400d4c53

And here is the stack

2018-04-23T11:40:14.057-0400 I -        [conn60]   Index Build (background): 660100/1029432 64%
2018-04-23T11:40:17.063-0400 I -        [conn60]   Index Build (background): 663700/1029432 64%
2018-04-23T11:40:31.155-0400 I INDEX    [conn37] build index on: yelp.review properties: { v: 2, key: { user_id: 1.0 }, name: "user_id_1", ns: "yelp.review" }
2018-04-23T11:40:31.954-0400 I INDEX    [conn37] 	 building index using bulk method; build may temporarily use up to 500 megabytes of RAM
2018-04-23T11:40:34.285-0400 I -        [conn37]   Index Build: 2700/7684000 0%
2018-04-23T11:40:37.003-0400 I -        [conn37]   Index Build: 61500/7684000 0%
2018-04-23T11:40:40.003-0400 I -        [conn37]   Index Build: 690200/7684000 8%
2018-04-23T11:40:40.901-0400 I NETWORK  [listener] connection accepted from 127.0.0.1:35980 #65 (7 connections now open)
2018-04-23T11:40:40.902-0400 I NETWORK  [conn65] received client metadata from 127.0.0.1:35980 conn65: { application: { name: "MongoDB Shell" }, driver: { name: "MongoDB Internal Client", version: "3.6.4" }, os: { type: "Linux", name: "Ubuntu", architecture: "x86_64", version: "16.04" } }
2018-04-23T11:40:42.347-0400 I COMMAND  [conn65] going to kill op: 5735906
2018-04-23T11:40:42.347-0400 I NETWORK  [conn65] end connection 127.0.0.1:35980 (6 connections now open)
2018-04-23T11:40:42.698-0400 I COMMAND  [conn37] command yelp.$cmd appName: "MongoDB Shell" command: createIndexes { createIndexes: "review", indexes: [ { key: { user_id: 1.0 }, name: "user_id_1" } ], $db: "yelp" } numYields:0 ok:0 errMsg:"operation was interrupted" errName:Interrupted errCode:11601 reslen:96 locks:{ Global: { acquireCount: { r: 1, w: 1 } }, Database: { acquireCount: { W: 1 }, acquireWaitCount: { W: 1 }, timeAcquiringMicros: { W: 169771 } }, Collection: { acquireCount: { w: 1 } } } protocol:op_msg 22964ms
2018-04-23T11:40:42.698-0400 I NETWORK  [conn37] end connection 127.0.0.1:60420 (5 connections now open)
2018-04-23T11:40:42.713-0400 I COMMAND  [conn64] command yelp.review command: insert { insert: "review", writeConcern: { getLastError: 1, w: 1 }, ordered: false, $db: "yelp" } ninserted:1000 keysInserted:1000 numYields:0 reslen:44 locks:{ Global: { acquireCount: { r: 16, w: 16 } }, Database: { acquireCount: { w: 16 }, acquireWaitCount: { w: 1 }, timeAcquiringMicros: { w: 22910069 } }, Collection: { acquireCount: { w: 16 } } } protocol:op_query 22925ms
2018-04-23T11:40:42.773-0400 I -        [conn60]   Index Build (background): 665900/1029432 64%
2018-04-23T11:40:43.855-0400 I NETWORK  [listener] connection accepted from 127.0.0.1:35982 #66 (6 connections now open)
2018-04-23T11:40:43.855-0400 I NETWORK  [conn66] received client metadata from 127.0.0.1:35982 conn66: { application: { name: "MongoDB Shell" }, driver: { name: "MongoDB Internal Client", version: "3.6.4" }, os: { type: "Linux", name: "Ubuntu", architecture: "x86_64", version: "16.04" } }
2018-04-23T11:40:45.016-0400 I -        [conn60]   Index Build (background): 670400/1029432 65%
2018-04-23T11:40:48.041-0400 I -        [conn60]   Index Build (background): 675000/1029432 65%
2018-04-23T11:40:51.050-0400 I -        [conn60]   Index Build (background): 678500/1029432 65%
2018-04-23T11:40:54.124-0400 I -        [conn60]   Index Build (background): 687300/1029432 66%
2018-04-23T11:40:57.092-0400 I -        [conn60]   Index Build (background): 693400/1029432 67%
2018-04-23T11:41:00.007-0400 I -        [conn60]   Index Build (background): 697000/1029432 67%
2018-04-23T11:41:03.009-0400 I -        [conn60]   Index Build (background): 702500/1029432 68%
2018-04-23T11:41:06.077-0400 I -        [conn60]   Index Build (background): 707000/1029432 68%
2018-04-23T11:41:07.094-0400 I COMMAND  [conn64] command yelp.review command: insert { insert: "review", writeConcern: { getLastError: 1, w: 1 }, ordered: false, $db: "yelp" } ninserted:1000 keysInserted:1000 numYields:0 reslen:44 locks:{ Global: { acquireCount: { r: 16, w: 16 } }, Database: { acquireCount: { w: 16 } }, Collection: { acquireCount: { w: 16 } } } protocol:op_query 115ms
2018-04-23T11:41:07.672-0400 I COMMAND  [conn64] command yelp.review command: insert { insert: "review", writeConcern: { getLastError: 1, w: 1 }, ordered: false, $db: "yelp" } ninserted:1000 keysInserted:1000 numYields:0 reslen:44 locks:{ Global: { acquireCount: { r: 16, w: 16 } }, Database: { acquireCount: { w: 16 } }, Collection: { acquireCount: { w: 16 } } } protocol:op_query 104ms
2018-04-23T11:41:09.007-0400 I -        [conn60]   Index Build (background): 708700/1029432 68%
2018-04-23T11:41:09.246-0400 I COMMAND  [conn64] command yelp.review command: insert { insert: "review", writeConcern: { getLastError: 1, w: 1 }, ordered: false, $db: "yelp" } ninserted:1000 keysInserted:1000 numYields:0 reslen:44 locks:{ Global: { acquireCount: { r: 16, w: 16 } }, Database: { acquireCount: { w: 16 } }, Collection: { acquireCount: { w: 16 } } } protocol:op_query 102ms
2018-04-23T11:41:09.711-0400 I COMMAND  [conn64] command yelp.review command: insert { insert: "review", writeConcern: { getLastError: 1, w: 1 }, ordered: false, $db: "yelp" } ninserted:1000 keysInserted:1000 numYields:0 reslen:44 locks:{ Global: { acquireCount: { r: 16, w: 16 } }, Database: { acquireCount: { w: 16 } }, Collection: { acquireCount: { w: 16 } } } protocol:op_query 138ms
2018-04-23T11:41:09.931-0400 I COMMAND  [conn66] going to kill op: 5729150
2018-04-23T11:41:09.934-0400 F -        [conn60] terminate() called. An exception is active; attempting to gather more information
2018-04-23T11:41:10.025-0400 F -        [conn60] DBException::toString(): Interrupted: operation was interrupted
Actual exception type: mongo::error_details::ExceptionForImpl<(mongo::ErrorCodes::Error)11601, mongo::ExceptionForCat<(mongo::ErrorCategory)1> >
 0x55c6f54fd5b1 0x55c6f54fcf95 0x55c6f55fe106 0x55c6f55fe151 0x55c6f4105fa0 0x55c6f4e2ff16 0x55c6f4e32846 0x55c6f4e2e04e 0x55c6f3d24ec6 0x55c6f3d26f7c 0x55c6f3d27d6e 0x55c6f3d16e0a 0x55c6f3d212ca 0x55c6f3d1bcaf 0x55c6f3d1fdee 0x55c6f4be3b22 0x55c6f3d1a7ac 0x55c6f3d1d244 0x55c6f3d1de24 0x55c6f3d1bd41 0x55c6f3d1fdee 0x55c6f4be4085 0x55c6f5466a44 0x7f6fea3cd6ba 0x7f6fea10341d
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"55C6F32F7000","o":"22065B1","s":"_ZN5mongo15printStackTraceERSo"},{"b":"55C6F32F7000","o":"2205F95"},{"b":"55C6F32F7000","o":"2307106","s":"_ZN10__cxxabiv111__terminateEPFvvE"},{"b":"55C6F32F7000","o":"2307151"},{"b":"55C6F32F7000","o":"E0EFA0","s":"_ZN5mongo14CmdCreateIndex9errmsgRunEPNS_16OperationContextERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNS_7BSONObjERS8_RNS_14BSONObjBuilderE"},{"b":"55C6F32F7000","o":"1B38F16","s":"_ZN5mongo23ErrmsgCommandDeprecated3runEPNS_16OperationContextERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNS_7BSONObjERNS_14BSONObjBuilderE"},{"b":"55C6F32F7000","o":"1B3B846","s":"_ZN5mongo12BasicCommand11enhancedRunEPNS_16OperationContextERKNS_12OpMsgRequestERNS_14BSONObjBuilderE"},{"b":"55C6F32F7000","o":"1B3704E","s":"_ZN5mongo7Command9publicRunEPNS_16OperationContextERKNS_12OpMsgRequestERNS_14BSONObjBuilderE"},{"b":"55C6F32F7000","o":"A2DEC6"},{"b":"55C6F32F7000","o":"A2FF7C"},{"b":"55C6F32F7000","o":"A30D6E","s":"_ZN5mongo23ServiceEntryPointCommon13handleRequestEPNS_16OperationContextERKNS_7MessageERKNS0_5HooksE"},{"b":"55C6F32F7000","o":"A1FE0A","s":"_ZN5mongo23ServiceEntryPointMongod13handleRequestEPNS_16OperationContextERKNS_7MessageE"},{"b":"55C6F32F7000","o":"A2A2CA","s":"_ZN5mongo19ServiceStateMachine15_processMessageENS0_11ThreadGuardE"},{"b":"55C6F32F7000","o":"A24CAF","s":"_ZN5mongo19ServiceStateMachine15_runNextInGuardENS0_11ThreadGuardE"},{"b":"55C6F32F7000","o":"A28DEE"},{"b":"55C6F32F7000","o":"18ECB22","s":"_ZN5mongo9transport26ServiceExecutorSynchronous8scheduleESt8functionIFvvEENS0_15ServiceExecutor13ScheduleFlagsENS0_23ServiceExecutorTaskNameE"},{"b":"55C6F32F7000","o":"A237AC","s":"_ZN5mongo19ServiceStateMachine22_scheduleNextWithGuardENS0_11ThreadGuardENS_9transport15ServiceExecutor13ScheduleFlagsENS2_23ServiceExecutorTaskNameENS0_9OwnershipE"},{"b":"55C6F32F7000","o":"A26244","s":"_ZN5mongo19ServiceStateMachine15_sourceCallbackENS_6StatusE"},{"b":"55C6F32F7000","o":"A26E24","s":"_ZN5mongo19ServiceStateMachine14_sourceMessageENS0_11ThreadGuardE"},{"b":"55C6F32F7000","o":"A24D41","s":"_ZN5mongo19ServiceStateMachine15_runNextInGuardENS0_11ThreadGuardE"},{"b":"55C6F32F7000","o":"A28DEE"},{"b":"55C6F32F7000","o":"18ED085"},{"b":"55C6F32F7000","o":"216FA44"},{"b":"7F6FEA3C6000","o":"76BA"},{"b":"7F6FE9FFC000","o":"10741D","s":"clone"}],"processInfo":{ "mongodbVersion" : "3.7.2-264-g6618ad3", "gitVersion" : "6618ad302577418504e9a1b7849bfa7d400d4c53", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "4.8.0-56-generic", "version" : "#61~16.04.1-Ubuntu SMP Wed Jun 14 11:58:22 UTC 2017", "machine" : "x86_64" }, "somap" : [ { "b" : "55C6F32F7000", "elfType" : 3, "buildId" : "190CA0EEC11CE7C478AE765DD00D13B7A267090D" }, { "b" : "7FFC561AF000", "elfType" : 3, "buildId" : "9734E1B5F33BBF806373CB77E9980C9FEB6D2509" }, { "b" : "7F6FEAF0E000", "path" : "/lib/x86_64-linux-gnu/libresolv.so.2", "elfType" : 3, "buildId" : "6EF73266978476EF9F2FD2CF31E57F4597CB74F8" }, { "b" : "7F6FEAD06000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "89C34D7A182387D76D5CDA1F7718F5D58824DFB3" }, { "b" : "7F6FEAB02000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "8CC8D0D119B142D839800BFF71FB71E73AEA7BD4" }, { "b" : "7F6FEA7F9000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "DFB85DE42DAFFD09640C8FE377D572DE3E168920" }, { "b" : "7F6FEA5E3000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "68220AE2C65D65C1B6AAA12FA6765A6EC2F5F434" }, { "b" : "7F6FEA3C6000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "CE17E023542265FC11D9BC8F534BB4F070493D30" }, { "b" : "7F6FE9FFC000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "B5381A457906D279073822A5CEB24C4BFEF94DDB" }, { "b" : "7F6FEB129000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "5D7B6259552275A3C17BD4C3FD05F5A6BF40CAA5" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x41) [0x55c6f54fd5b1]
 mongod(+0x2205F95) [0x55c6f54fcf95]
 mongod(_ZN10__cxxabiv111__terminateEPFvvE+0x6) [0x55c6f55fe106]
 mongod(+0x2307151) [0x55c6f55fe151]
 mongod(_ZN5mongo14CmdCreateIndex9errmsgRunEPNS_16OperationContextERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNS_7BSONObjERS8_RNS_14BSONObjBuilderE+0x1D30) [0x55c6f4105fa0]
 mongod(_ZN5mongo23ErrmsgCommandDeprecated3runEPNS_16OperationContextERKNSt7__cxx1112basic_stringIcSt11char_traitsIcESaIcEEERKNS_7BSONObjERNS_14BSONObjBuilderE+0x46) [0x55c6f4e2ff16]
 mongod(_ZN5mongo12BasicCommand11enhancedRunEPNS_16OperationContextERKNS_12OpMsgRequestERNS_14BSONObjBuilderE+0x76) [0x55c6f4e32846]
 mongod(_ZN5mongo7Command9publicRunEPNS_16OperationContextERKNS_12OpMsgRequestERNS_14BSONObjBuilderE+0x2E) [0x55c6f4e2e04e]
 mongod(+0xA2DEC6) [0x55c6f3d24ec6]
 mongod(+0xA2FF7C) [0x55c6f3d26f7c]
 mongod(_ZN5mongo23ServiceEntryPointCommon13handleRequestEPNS_16OperationContextERKNS_7MessageERKNS0_5HooksE+0x2BE) [0x55c6f3d27d6e]
 mongod(_ZN5mongo23ServiceEntryPointMongod13handleRequestEPNS_16OperationContextERKNS_7MessageE+0x3A) [0x55c6f3d16e0a]
 mongod(_ZN5mongo19ServiceStateMachine15_processMessageENS0_11ThreadGuardE+0xBA) [0x55c6f3d212ca]
 mongod(_ZN5mongo19ServiceStateMachine15_runNextInGuardENS0_11ThreadGuardE+0xBF) [0x55c6f3d1bcaf]
 mongod(+0xA28DEE) [0x55c6f3d1fdee]
 mongod(_ZN5mongo9transport26ServiceExecutorSynchronous8scheduleESt8functionIFvvEENS0_15ServiceExecutor13ScheduleFlagsENS0_23ServiceExecutorTaskNameE+0x1A2) [0x55c6f4be3b22]
 mongod(_ZN5mongo19ServiceStateMachine22_scheduleNextWithGuardENS0_11ThreadGuardENS_9transport15ServiceExecutor13ScheduleFlagsENS2_23ServiceExecutorTaskNameENS0_9OwnershipE+0x19C) [0x55c6f3d1a7ac]
 mongod(_ZN5mongo19ServiceStateMachine15_sourceCallbackENS_6StatusE+0xAC4) [0x55c6f3d1d244]
 mongod(_ZN5mongo19ServiceStateMachine14_sourceMessageENS0_11ThreadGuardE+0x314) [0x55c6f3d1de24]
 mongod(_ZN5mongo19ServiceStateMachine15_runNextInGuardENS0_11ThreadGuardE+0x151) [0x55c6f3d1bd41]
 mongod(+0xA28DEE) [0x55c6f3d1fdee]
 mongod(+0x18ED085) [0x55c6f4be4085]
 mongod(+0x216FA44) [0x55c6f5466a44]
 libpthread.so.0(+0x76BA) [0x7f6fea3cd6ba]
 libc.so.6(clone+0x6D) [0x7f6fea10341d]
-----  END BACKTRACE  -----



 Comments   
Comment by Daniel Gottlieb (Inactive) [ 10/May/18 ]

I'm rather certain this problem has gone away. I've been able to reproduce this with a modification to make it easier to hit. The sequence:

  1. While building the index, the operation is called which a DBException being thrown from this statement.
  2. Because this is a background index build, the code grabs the DBLock to check the replication state.
  3. SERVER-32638 introduced DBLock acquisition throwing exceptions if the operation context was killed. Specifically that would only throw in cases where the operation was killed and* there was contention acquiring the lock. This race is tricky to hit.
  4. The exception handler has another exception handler that terminates the process. According to Kelsey's line numbers, this is the terminate that killed the process.

This crash was fixed by a later push under the SERVER-32638 ticket by having createIndexes opt out of operation killing throwing exceptions when acquiring locks.

The modification for reproduction was to include changes from SERVER-27534. These changes caused uncontested lock acquisition to throw an exception if the operation context was already killed.

Comment by Eric Milkie [ 27/Apr/18 ]

Should investigate this to see what code path was hit when it crashed.

Comment by Asya Kamsky [ 23/Apr/18 ]

esha.maharishi
db version v3.7.2-264-g6618ad3
git version: 6618ad302577418504e9a1b7849bfa7d400d4c53

Comment by Esha Maharishi (Inactive) [ 23/Apr/18 ]

milkie thanks, I wanted to rule out changes made in SERVER-34370.

Comment by Eric Milkie [ 23/Apr/18 ]

esha.maharishi db version v3.7.2-264-g6618ad3

Comment by Esha Maharishi (Inactive) [ 23/Apr/18 ]

asya, what version did this crash happen on?

Comment by Asya Kamsky [ 23/Apr/18 ]

Please note - I have not been able to reproduce this in 3.7.5 but I don't know if there is some timing component to the triggering of the crash so filed this to see if decoding the backtrace may help confirm this has been fixed or how to reproduce it if not.

Generated at Thu Feb 08 04:37:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.