[SERVER-16867] Server crashed during iiBench run using WiredTiger engine Created: 15/Jan/15  Updated: 02/Feb/15  Resolved: 23/Jan/15

Status: Closed
Project: Core Server
Component/s: Storage, WiredTiger
Affects Version/s: 2.8.0-rc4
Fix Version/s: 3.0.0-rc6

Type: Bug Priority: Critical - P2
Reporter: Tim Callaghan Assignee: Andy Schwerin
Resolution: Done Votes: 0
Labels: wiredtiger
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File lex5.ssd-mongoiibench-1-100000000-1000-999999-4-wt-0-15.log.gz    
Issue Links:
Duplicate
is duplicated by SERVER-16345 WiredTiger primary died with: Fatal D... Closed
is duplicated by SERVER-16926 mongod sefgault with wiredtiger durin... Closed
is duplicated by SERVER-16976 Invariant failure in wiredtiger_recor... Closed
Tested
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

iiBench : https://github.com/tmcallaghan/iibench-mongodb
4 writers, 1000 inserts per batch
Ubuntu 14.04.1
48G RAM in server
WT configured to be 8G cache, directIO on [data], all else defaults
storage is SSD (~50,000 random IOPs)

immediate rerun did not reproduce.

Participants:

 Description   

Running iiBench insert only workload, specific details in steps to reproduce. Here is the server's log from startup:

2015-01-15T10:03:05.095-0500 I CONTROL  [initandlisten] MongoDB starting : pid=1708 port=27017 dbpath=/data.ssd/tcallaghan/data/mongo-data 64-bit host=lex5
2015-01-15T10:03:05.095-0500 I CONTROL  [initandlisten] db version v2.8.0-rc4
2015-01-15T10:03:05.095-0500 I CONTROL  [initandlisten] git version: 3ad571742911f04b307f0071979425511c4f2570
2015-01-15T10:03:05.095-0500 I CONTROL  [initandlisten] build info: Linux build11.nj1.10gen.cc 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3 21:39:27 UTC 2014 x86_64 BOOST_LIB_VERSION=1_49
2015-01-15T10:03:05.095-0500 I CONTROL  [initandlisten] allocator: tcmalloc
2015-01-15T10:03:05.095-0500 I CONTROL  [initandlisten] options: { processManagement: { fork: true }, replication: { replSet: "tmcRepl" }, storage: { dbPath: "/data.ssd/tcallaghan/data/mongo-data", engine: "wiredtiger", journal: { enabled: true }, wiredTiger: { engineConfig: { cacheSizeGB: 8, configString: "direct_io=[data]" } } }, systemLog: { destination: "file", path: "lex5.ssd-mongoiibench-1-100000000-1000-999999-4-wt-0-15.log.mongolog" } }
2015-01-15T10:03:05.095-0500 W STORAGE  [initandlisten] Detected configuration for non-active storage engine wiredTiger when current storage engine is wiredtiger
2015-01-15T10:03:05.095-0500 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=8G,session_max=20000,extensions=[local=(entry=index_collator_extension)],statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),direct_io=[data]
2015-01-15T10:03:05.114-0500 I REPL     [initandlisten] Did not find local replica set configuration document at startup;  NoMatchingDocument Did not find replica set configuration document in local.system.replset
2015-01-15T10:03:05.118-0500 I NETWORK  [initandlisten] waiting for connections on port 27017
2015-01-15T10:03:05.118-0500 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:34456 #1 (1 connection now open)
2015-01-15T10:03:06.125-0500 I NETWORK  [conn1] end connection 127.0.0.1:34456 (0 connections now open)
2015-01-15T10:03:06.172-0500 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:34457 #2 (1 connection now open)
2015-01-15T10:03:06.173-0500 I COMMAND  [conn2] replSet info initiate : no configuration specified.  Using a default configuration for the set
2015-01-15T10:03:06.173-0500 I COMMAND  [conn2] replSet created this configuration for initiation : { _id: "tmcRepl", version: 1, members: [ { _id: 0, host: "lex5:27017" } ] }
2015-01-15T10:03:06.173-0500 I REPL     [conn2] replSetInitiate admin command received from client
2015-01-15T10:03:06.174-0500 I REPL     [conn2] replSet replSetInitiate config object with 1 members parses ok
2015-01-15T10:03:06.177-0500 I REPL     [ReplicationExecutor] new replica set config in use: { _id: "tmcRepl", version: 1, members: [ { _id: 0, host: "lex5:27017", arbiterOnly: false, buildIndexes: true, hidden: false, priority: 1.0, tags: {}, slaveDelay: 0, votes: 1 } ], settings: { chainingAllowed: true, heartbeatTimeoutSecs: 10, getLastErrorModes: {}, getLastErrorDefaults: { w: 1, wtimeout: 0 } } }
2015-01-15T10:03:06.177-0500 I REPL     [ReplicationExecutor] transition to STARTUP2
2015-01-15T10:03:06.178-0500 I REPL     [conn2] ******
2015-01-15T10:03:06.178-0500 I REPL     [conn2] creating replication oplog of size: 22838MB...
2015-01-15T10:03:06.188-0500 I REPL     [conn2] ******
2015-01-15T10:03:06.188-0500 I REPL     [conn2] Starting replication applier threads
2015-01-15T10:03:06.188-0500 I REPL     [ReplicationExecutor] transition to RECOVERING
2015-01-15T10:03:06.190-0500 I REPL     [ReplicationExecutor] transition to PRIMARY
2015-01-15T10:03:06.190-0500 I REPL     [rsSync] transition to primary complete; database writes are now permitted

Here is the stack trace from the crash:

2015-01-15T10:11:32.156-0500 F REPL     [conn8] Fatal DBException in logOp(): 112 WriteConflict
2015-01-15T10:11:32.156-0500 F -        [conn8] terminate() called. An exception is active; attempting to gather more information
2015-01-15T10:11:32.192-0500 F -        [conn8] DBException::toString(): 112 WriteConflict
Actual exception type: mongo::WriteConflictException
 
 0xf133b9 0xf12eb0 0x7f29dad43836 0x7f29dad43863 0xc1594c 0x98cb15 0x98d210 0x98e954 0x98f055 0x99110d 0x9ad054 0x9adf93 0x9aea4b 0xb7ca1a 0xa8fcd5 0x7e41f0 0xed1381 0x7f29db3fd182 0x7f29da4fdefd
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"B133B9"},{"b":"400000","o":"B12EB0"},{"b":"7F29DACE5000","o":"5E836"},{"b":"7F29DACE5000","o":"5E863"},{"b":"400000","o":"81594C"},{"b":"400000","o":"58CB15"},{"b":"400000","o":"58D210"},{"b":"400000","o":"58E954"},{"b":"400000","o":"58F055"},{"b":"400000","o":"59110D"},{"b":"400000","o":"5AD054"},{"b":"400000","o":"5ADF93"},{"b":"400000","o":"5AEA4B"},{"b":"400000","o":"77CA1A"},{"b":"400000","o":"68FCD5"},{"b":"400000","o":"3E41F0"},{"b":"400000","o":"AD1381"},{"b":"7F29DB3F5000","o":"8182"},{"b":"7F29DA403000","o":"FAEFD"}],"processInfo":{ "mongodbVersion" : "2.8.0-rc4", "gitVersion" : "3ad571742911f04b307f0071979425511c4f2570", "uname" : { "sysname" : "Linux", "release" : "3.13.0-40-generic", "version" : "#69-Ubuntu SMP Thu Nov 13 17:53:56 UTC 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000" }, { "b" : "7FFF894EE000", "elfType" : 3 }, { "b" : "7F29DB3F5000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3 }, { "b" : "7F29DB1ED000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3 }, { "b" : "7F29DAFE9000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3 }, { "b" : "7F29DACE5000", "path" : "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", "elfType" : 3 }, { "b" : "7F29DA9DF000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3 }, { "b" : "7F29DA7C9000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3 }, { "b" : "7F29DA403000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3 }, { "b" : "7F29DB613000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3 } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf133b9]
 mongod(+0xB12EB0) [0xf12eb0]
 libstdc++.so.6(+0x5E836) [0x7f29dad43836]
 libstdc++.so.6(+0x5E863) [0x7f29dad43863]
 mongod(_ZN5mongo4repl5logOpEPNS_16OperationContextEPKcS4_RKNS_7BSONObjEPS5_Pbb+0x1DC) [0xc1594c]
 mongod(_ZN5mongo18WriteBatchExecutor13execOneInsertEPNS0_16ExecInsertsStateEPPNS_16WriteErrorDetailE+0xEA5) [0x98cb15]
 mongod(_ZN5mongo18WriteBatchExecutor11execInsertsERKNS_21BatchedCommandRequestEPSt6vectorIPNS_16WriteErrorDetailESaIS6_EE+0x250) [0x98d210]
 mongod(_ZN5mongo18WriteBatchExecutor11bulkExecuteERKNS_21BatchedCommandRequestEPSt6vectorIPNS_19BatchedUpsertDetailESaIS6_EEPS4_IPNS_16WriteErrorDetailESaISB_EE+0x34) [0x98e954]
 mongod(_ZN5mongo18WriteBatchExecutor12executeBatchERKNS_21BatchedCommandRequestEPNS_22BatchedCommandResponseE+0x395) [0x98f055]
 mongod(_ZN5mongo8WriteCmd3runEPNS_16OperationContextERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x15D) [0x99110d]
 mongod(_ZN5mongo12_execCommandEPNS_16OperationContextEPNS_7CommandERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x34) [0x9ad054]
 mongod(_ZN5mongo7Command11execCommandEPNS_16OperationContextEPS0_iPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0xC13) [0x9adf93]
 mongod(_ZN5mongo12_runCommandsEPNS_16OperationContextEPKcRNS_7BSONObjERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x28B) [0x9aea4b]
 mongod(_ZN5mongo8runQueryEPNS_16OperationContextERNS_7MessageERNS_12QueryMessageERNS_5CurOpES3_b+0x76A) [0xb7ca1a]
 mongod(_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortEb+0xB25) [0xa8fcd5]
 mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0xE0) [0x7e41f0]
 mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x411) [0xed1381]
 libpthread.so.0(+0x8182) [0x7f29db3fd182]
 libc.so.6(clone+0x6D) [0x7f29da4fdefd]
-----  END BACKTRACE  -----



 Comments   
Comment by Githook User [ 22/Jan/15 ]

Author:

{u'username': u'andy10gen', u'name': u'Andy Schwerin', u'email': u'schwerin@mongodb.com'}

Message: SERVER-16867 Allow logOp() to throw before calling first non-rollback-safe logOp listener.
Branch: master
https://github.com/mongodb/mongo/commit/dc0a3a6cbcc91a410b07ad0e10c6ecc4ae2c92dc

Comment by Tim Callaghan [ 15/Jan/15 ]

Full client log attached. The issue occurred ~500 seconds into the benchmark.

Comment by Alexander Gorrod [ 15/Jan/15 ]

tmcallaghan Do you know how long the benchmark was running before hitting this issue? From the log snippets you posted it appears to be about 1.5 hours. Was that for a single benchmark run?

Comment by Tim Callaghan [ 15/Jan/15 ]

Yes, this all runs on localhost. I saved all my log files and restarted the process hoping for a reproducer, but this second run hasn't crashed. It does however have long stalls on inserts, but I'll ticket that separately.

Comment by Asya Kamsky [ 15/Jan/15 ]

tmcallaghan did you run the client on the same machine that mongod was running on? Trying to reproduce with same conditions you had.

Comment by Tim Callaghan [ 15/Jan/15 ]

After this event the server is unwilling to accept any further writes.

Generated at Thu Feb 08 03:42:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.