[SERVER-17304] rc8 on CentOs 6.6 crashed under heavy r/w workload Created: 17/Feb/15  Updated: 18/Feb/15  Resolved: 17/Feb/15

Status: Closed
Project: Core Server
Component/s: Internal Code
Affects Version/s: 3.0.0-rc8
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Buzz Moschetti Assignee: Mathias Stearn
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File mongo0.log    
Issue Links:
Duplicate
duplicates SERVER-17198 logOp rollback in findAndModify when ... Closed
Operating System: ALL
Steps To Reproduce:

On the box:
cd /home/buzz/
sh runt stateEvent --port 37017 -d test5 --numevents 2000000 --posn 100000 --threads 32 --drop --payload 8192

Participants:

 Description   

This is the big IBM X6 box:

buzz@IDF-IBM-Test-1 ~]$ uname -a
Linux IDF-IBM-Test-1.10gen.cc 2.6.32-431.el6.x86_64 #1 SMP Fri Nov 22 03:15:09 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux
[buzz@IDF-IBM-Test-1 ~]$ cat /etc/re
readahead.conf    redhat-lsb/       redhat-release    request-key.conf  request-key.d/    resolv.conf
[buzz@IDF-IBM-Test-1 ~]$ cat /etc/redhat-release 
CentOS release 6.6 (Final)
[buzz@IDF-IBM-Test-1 ~]$ df
Filesystem           1K-blocks     Used Available Use% Mounted on
/dev/mapper/vg_idfibmtest1-lv_root
                      51606140 11926496  37058204  25% /
tmpfs                396812172        0 396812172   0% /dev/shm
/dev/sda1               495844    96335    373909  21% /boot
/dev/mapper/vg_idfibmtest1-lv_home
                     231167100   832892 218591592   1% /home
/dev/md1             768924576 20997468 708868020   3% /data/1
/dev/md2             768924576 10687204 719178284   2% /data/2
/dev/md3             768924576   201440 729664048   1% /data/3
/dev/md4             768924576  3242900 726622588   1% /data/4
 
top - 06:18:58 up 5 days, 20:39,  3 users,  load average: 0.00, 0.00, 0.07
Tasks: 1814 total,   1 running, 1813 sleeping,   0 stopped,   0 zombie
Cpu(s):  0.0%us,  0.0%sy,  0.0%ni,100.0%id,  0.0%wa,  0.0%hi,  0.0%si,  0.0%st
Mem:  793624344k total, 10494188k used, 783130156k free,   270752k buffers
Swap:  4194296k total,        0k used,  4194296k free,  4892660k cached

The rc8 server is vanilla WiredTiger with no special options:

numactl --interleave=all /home/buzz/3.0.0-rc8/bin/mongod --storageEngine=wiredTiger --port 37017 --dbpath /data/4/data0/db0 --logpath /tmp/mongo0.log --fork

Program is performance test. It generates load for a state-event model like this:
1. Find an item in the currentPos collection using findAndModify to "logically reserve" it
2. Copy it to historicPos collection
3. Insert an event doc into the events collection
4 findAndModify the same item in currentPos to write updated data
There are 32 threads. Each thread has a loop that randomly picks a position key and goes thru the steps outlined above.

Failure occurs about 90% of the time between 3000 and 7000 turns of the crank (1-4 above).

This seems to be the start of trouble:

2015-02-17T05:54:22.306-0500 F REPL     [conn27] Rollback of logOp not currently allowed (SERVER-15192)
2015-02-17T05:54:22.306-0500 I -        [conn27] Fatal Assertion 18805
2015-02-17T05:54:22.318-0500 I COMMAND  [conn22] command test5.$cmd command: findAndModify { findandmodify: "currentPos", query: { k: "P5507", r: 1 }, update: { $inc: { n: 1 }, $set: { val: 121, ts: new Date(1424170462154), r: 0 } }, upsert: true } update: { $inc: { n: 1 }, $set: { val: 121, ts: new Date(1424170462154), r: 0 } } nscanned:1 nscannedObjects:1 nMatched:1 nModified:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:8361 163ms
2015-02-17T05:54:22.320-0500 I COMMAND  [conn30] command test5.$cmd command: findAndModify { findandmodify: "currentPos", query: { k: "P48449", r: 0 }, update: { $set: { r: 1, ts: new Date(1424170462156), x: BinData(0, 02654E522D2D326778F31B1BA8B6A8EF889742FAFEC2AA305660EADF51F63400EC4832AABD4D264BAFBCF357C98DD4BE590A18CB892BB02F009BCD4C7D1BF08FF6023526BA33...) } }, new: true, upsert: true } update: { $set: { r: 1, ts: new Date(1424170462156), x: BinData(0, 02654E522D2D326778F31B1BA8B6A8EF889742FAFEC2AA305660EADF51F63400EC4832AABD4D264BAFBCF357C98DD4BE590A18CB892BB02F009BCD4C7D1BF08FF6023526BA33...) } } nscanned:1 nscannedObjects:1 nMatched:1 nModified:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:8362 163ms
2015-02-17T05:54:22.320-0500 I CONTROL  [conn27] 
 0xf3f059 0xee9191 0xecd5b1 0xc247ac 0xd60996 0x7e42f6 0x94218e 0x9429de 0x9b9434 0x9ba373 0x9baf6b 0xb8abe5 0xa9c968 0x7e6cc0 0xefd11b 0x30ba2079d1 0x30b9ee88fd
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"B3F059"},{"b":"400000","o":"AE9191"},{"b":"400000","o":"ACD5B1"},{"b":"400000","o":"8247AC"},{"b":"400000","o":"960996"},{"b":"400000","o":"3E42F6"},{"b":"400000","o":"54218E"},{"b":"400000","o":"5429DE"},{"b":"400000","o":"5B9434"},{"b":"400000","o":"5BA373"},{"b":"400000","o":"5BAF6B"},{"b":"400000","o":"78ABE5"},{"b":"400000","o":"69C968"},{"b":"400000","o":"3E6CC0"},{"b":"400000","o":"AFD11B"},{"b":"30BA200000","o":"79D1"},{"b":"30B9E00000","o":"E88FD"}],"processInfo":{ "mongodbVersion" : "3.0.0-rc8", "gitVersion" : "9d0714cdcffadb203ff68730acedbc40875403da", "uname" : { "sysname" : "Linux", "release" : "2.6.32-431.el6.x86_64", "version" : "#1 SMP Fri Nov 22 03:15:09 UTC 2013", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000" }, { "b" : "7FFFEEEFF000", "elfType" : 3 }, { "path" : "/lib64/libpthread.so.0", "elfType" : 3 }, { "path" : "/lib64/librt.so.1", "elfType" : 3 }, { "path" : "/lib64/libdl.so.2", "elfType" : 3 }, { "path" : "/usr/lib64/libstdc++.so.6", "elfType" : 3 }, { "path" : "/lib64/libm.so.6", "elfType" : 3 }, { "path" : "/lib64/libgcc_s.so.1", "elfType" : 3 }, { "path" : "/lib64/libc.so.6", "elfType" : 3 }, { "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3 } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf3f059]
 mongod(_ZN5mongo10logContextEPKc+0xE1) [0xee9191]
 mongod(_ZN5mongo13fassertFailedEi+0x61) [0xecd5b1]
 mongod(+0x8247AC) [0xc247ac]
 mongod(_ZN5mongo22WiredTigerRecoveryUnit6_abortEv+0x36) [0xd60996]
 mongod(_ZN5mongo15WriteUnitOfWorkD1Ev+0x16) [0x7e42f6]
 mongod(_ZN5mongo16CmdFindAndModify7runImplEPNS_16OperationContextERKSsS4_RKNS_7BSONObjES7_S7_S7_bbbRNS_14BSONObjBuilderERSs+0x18AE) [0x94218e]
 mongod(_ZN5mongo16CmdFindAndModify3runEPNS_16OperationContextERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x2AE) [0x9429de]
 mongod(_ZN5mongo12_execCommandEPNS_16OperationContextEPNS_7CommandERKSsRNS_7BSONObjEiRSsRNS_14BSONObjBuilderEb+0x34) [0x9b9434]
 mongod(_ZN5mongo7Command11execCommandEPNS_16OperationContextEPS0_iPKcRNS_7BSONObjERNS_14BSONObjBuilderEb+0xC13) [0x9ba373]
 mongod(_ZN5mongo12_runCommandsEPNS_16OperationContextEPKcRNS_7BSONObjERNS_11_BufBuilderINS_16TrivialAllocatorEEERNS_14BSONObjBuilderEbi+0x28B) [0x9baf6b]
 mongod(_ZN5mongo8runQueryEPNS_16OperationContextERNS_7MessageERNS_12QueryMessageERKNS_15NamespaceStringERNS_5CurOpES3_b+0x755) [0xb8abe5]
 mongod(_ZN5mongo16assembleResponseEPNS_16OperationContextERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortEb+0xAC8) [0xa9c968]
 mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0xE0) [0x7e6cc0]
 mongod(_ZN5mongo17PortMessageServer17handleIncomingMsgEPv+0x32B) [0xefd11b]
 libpthread.so.0(+0x79D1) [0x30ba2079d1]
 libc.so.6(clone+0x6D) [0x30b9ee88fd]
-----  END BACKTRACE  -----
2015-02-17T05:54:22.320-0500 I -        [conn27] 
 
***aborting after fassert() failure

Interesting note: Same test program (but 16 threads instead of 32) ran OK all day yesterday (start, stop, reset, over and over) on a r3.4xlarge AWS instance running ami-d13845e1.



 Comments   
Comment by David Storch [ 17/Feb/15 ]

Guessing this is fixed by f0e6b92744cbe03aa9 (see SERVER-17198).

Comment by Andy Schwerin [ 17/Feb/15 ]

Strictly speaking, this is a duplicate of SERVER-15192. However, redbeard0531 or david.storch may have fixed this and the other known RC8 occurrences already.

Generated at Thu Feb 08 03:43:57 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.