[SERVER-19274] DB fails to recover after system crash, with "No such file or directory" Created: 02/Jul/15  Updated: 11/Apr/16  Resolved: 17/Nov/15

Status: Closed
Project: Core Server
Component/s: Storage, WiredTiger
Affects Version/s: 3.1.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Jonathan Abrahams Assignee: Susan LoVerso
Resolution: Cannot Reproduce Votes: 0
Labels: 32powercycle
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File logs.tar     File powertest.sh     File wiredTiger-afterrecovery.taraa     File wiredTiger-afterrecovery.tarab     File wiredTiger-afterrecovery.tarac     File wiredTiger-afterrecovery.tarad     File wiredTiger-afterrecovery.tarae     File wiredTiger-afterrecovery.taraf     File wiredTiger-afterrecovery.tarag     File wiredTiger-beforerecovery.taraa     File wiredTiger-beforerecovery.tarab     File wiredTiger-beforerecovery.tarac     File wiredTiger-beforerecovery.tarad     File wiredTiger-beforerecovery.tarae     File wiredTiger-beforerecovery.taraf     File wiredTiger-beforerecovery.tarag     File wiredTiger-firstrun.tar     File wiredTiger.taraa     File wiredTiger.tarab     File wiredTiger.tarac     File wiredTiger.tarad     File wiredTiger.tarae     File wiredTiger.taraf     File wiredTiger.tarag    
Issue Links:
Depends
depends on WT-2125 recovery failing with ENOENT Closed
is depended on by WT-1973 MongoDB changes for WiredTiger 2.7.0 Closed
Related
related to SERVER-19774 WT_NOTFOUND: item not found during DB... Closed
is related to WT-2008 Recovery fails due to a missing file Closed
is related to WT-2009 Recovery failure when there are updat... Closed
is related to WT-2014 Compact causes updates to the turtle ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

Using the powercycle.sh script with the following arguments:

./powertest.sh -d https://fastdl.mongodb.org/linux/mongodb-linux-x86_64-3.1.5.tgz -r pt-315 -F ~/mongodb/mongo -s 10.4.1.54  -C mpower -O output1 -S "-i /Users/jonathan/.ssh/mFi.pem admin@10.4.117.131" -R -N 10 -l 50

Note - If you need to run this on an AWS instance you would use it as follows. The -F option is the root of the mongo source on your client.

./powertest.sh -d https://fastdl.mongodb.org/linux/mongodb-linux-x86_64-3.1.5.tgz -r pt-315 -F ~/mongodb/mongo -s "-i <pem file> <ip addr>  -M <mongo bin dir> -R -N 10 -l 50

Participants:

 Description   

The powercycle test created a DB that was not recoverable:

2015-07-02T15:10:36.727-0400 I CONTROL  [main] ***** SERVER RESTARTED *****
2015-07-02T15:10:36.788-0400 W -        [initandlisten] Detected unclean shutdown - /home/jonathan/pt-315/data/wiredTiger/mongod.lock is not empty.
2015-07-02T15:10:36.788-0400 W STORAGE  [initandlisten] Recovering data from the last clean checkpoint.
2015-07-02T15:10:36.788-0400 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=8G,session_max=20000,eviction=(threads_max=4),statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
2015-07-02T15:10:36.861-0400 E STORAGE  [initandlisten] WiredTiger (2) [1435864236:861392][2600:0x7f6b3b048b80], file:collection-181-5805430730046625089.wt: /home/jonathan/pt-315/data/wiredTiger/collection-181-5805430730046625089.wt: No such file or directory
2015-07-02T15:10:36.861-0400 E STORAGE  [initandlisten] WiredTiger (2) [1435864236:861464][2600:0x7f6b3b048b80], file:collection-181-5805430730046625089.wt: Operation failed during recovery: No such file or directory
2015-07-02T15:10:36.880-0400 I -        [initandlisten] Fatal assertion 28548 NoSuchKey Unable to find metadata for table:collection-175-5805430730046625089
2015-07-02T15:10:36.880-0400 I -        [initandlisten]
 
***aborting after fassert() failure



 Comments   
Comment by Jonathan Abrahams [ 17/Nov/15 ]

I'll mark this as resolved, since it could not be reproduced in 3.2.0-rc3-pre.

Comment by Jonathan Abrahams [ 12/Nov/15 ]

michael.cahill The write cache cannot be disabled for the device:

[jonathan@capja2 etc]$ sudo hdparm  -W0 /dev/sda
 
/dev/sda:
 setting drive write-caching to 0 (off)
 write-caching =  1 (on)

I can try a different physical machine, also running CentOS 7 and see if I can disable the write cache. This will take a little more time to setup, but it may achieve our test.

Question - Should we test this on an ext4 and/or xfs file system?

Comment by Jonathan Abrahams [ 09/Nov/15 ]

sue.loverso That build is no good:

2015-11-09T13:06:29.634-0500 I CONTROL  [initandlisten] MongoDB starting : pid=2582 port=37017 dbpath=/home/jonathan/wt320rc2-sue-110915-1/data/wiredTiger 64-bit host=capja2
2015-11-09T13:06:29.634-0500 I CONTROL  [initandlisten] db version v3.2.0-rc2-65-gc4d2fd3
2015-11-09T13:06:29.634-0500 I CONTROL  [initandlisten] git version: c4d2fd393eec1141caf88b7d68d3b12a65ae40c1
2015-11-09T13:06:29.634-0500 I CONTROL  [initandlisten] allocator: tcmalloc
2015-11-09T13:06:29.634-0500 I CONTROL  [initandlisten] modules: none
2015-11-09T13:06:29.634-0500 I CONTROL  [initandlisten] build environment:
2015-11-09T13:06:29.634-0500 I CONTROL  [initandlisten]     distarch: x86_64
2015-11-09T13:06:29.634-0500 I CONTROL  [initandlisten]     target_arch: x86_64
2015-11-09T13:06:29.634-0500 I CONTROL  [initandlisten] options: { net: { port: 37017 }, processManagement: { fork: true }, replication: { replSet: "pc" }, setParameter: { enableTestCommands: "1" }, storage: { dbPath: "wt320rc2-sue-110915-1/data/wiredTiger", engine: "wiredTiger" }, systemLog: { destination: "file", logAppend: true, path: "wt320rc2-sue-110915-1/logs/mongod-wiredTiger-recovery.log" } }
2015-11-09T13:06:29.652-0500 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=8G,session_max=20000,eviction=(threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
2015-11-09T13:06:29.686-0500 F -        [initandlisten] Invalid access at address: 0xa7afd
2015-11-09T13:06:29.711-0500 F -        [initandlisten] Got signal: 11 (Segmentation fault).
 
 0x12b4a72 0x12b39a9 0x12b3d28 0x7fbe32378130 0x7fbe31ff0a94 0x7fbe3201d079 0x19b4375 0x19f5227 0x19e6f80 0x19f888b 0x1980b7b 0x198f64b 0x1988693 0x103848a 0x1035761 0xf5d9c8 0x9350ef 0x9397f9 0x7fbe31fc9af5 0x932d19
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"EB4A72"},{"b":"400000","o":"EB39A9"},{"b":"400000","o":"EB3D28"},{"b":"7FBE32369000","o":"F130"},{"b":"7FBE31FA8000","o":"48A94"},{"b":"7FBE31FA8000","o":"75079"},{"b":"400000","o":"15B4375"},{"b":"400000","o":"15F5227"},{"b":"400000","o":"15E6F80"},{"b":"400000","o":"15F888B"},{"b":"400000","o":"1580B7B"},{"b":"400000","o":"158F64B"},{"b":"400000","o":"1588693"},{"b":"400000","o":"C3848A"},{"b":"400000","o":"C35761"},{"b":"400000","o":"B5D9C8"},{"b":"400000","o":"5350EF"},{"b":"400000","o":"5397F9"},{"b":"7FBE31FA8000","o":"21AF5"},{"b":"400000","o":"532D19"}],"processInfo":{ "mongodbVersion" : "3.2.0-rc2-65-gc4d2fd3", "gitVersion" : "c4d2fd393eec1141caf88b7d68d3b12a65ae40c1", "compiledModules" : [], "uname" : { "sysname" : "Linux", "release" : "3.10.0-229.4.2.el7.x86_64", "version" : "#1 SMP Wed May 13 10:06:09 UTC 2015", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000" }, { "b" : "7FFF8A65C000", "elfType" : 3 }, { "b" : "7FBE32FA8000", "path" : "/lib64/librt.so.1", "elfType" : 3 }, { "b" : "7FBE32DA4000", "path" : "/lib64/libdl.so.2", "elfType" : 3 }, { "b" : "7FBE32A9D000", "path" : "/lib64/libstdc++.so.6", "elfType" : 3 }, { "b" : "7FBE3279B000", "path" : "/lib64/libm.so.6", "elfType" : 3 }, { "b" : "7FBE32585000", "path" : "/lib64/libgcc_s.so.1", "elfType" : 3 }, { "b" : "7FBE32369000", "path" : "/lib64/libpthread.so.0", "elfType" : 3 }, { "b" : "7FBE31FA8000", "path" : "/lib64/libc.so.6", "elfType" : 3 }, { "b" : "7FBE331B0000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3 } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x32) [0x12b4a72]
 mongod(+0xEB39A9) [0x12b39a9]
 mongod(+0xEB3D28) [0x12b3d28]
 libpthread.so.0(+0xF130) [0x7fbe32378130]
 libc.so.6(_IO_vfprintf+0x1564) [0x7fbe31ff0a94]
 libc.so.6(vsnprintf+0x79) [0x7fbe3201d079]
 mongod(__wt_log_vprintf+0x65) [0x19b4375]
 mongod(+0x15F5227) [0x19f5227]
 mongod(__wt_schema_drop+0xE0) [0x19e6f80]
 mongod(__wt_session_drop+0xDB) [0x19f888b]
 mongod(__wt_las_create+0x4B) [0x1980b7b]
 mongod(__wt_connection_workers+0x5B) [0x198f64b]
 mongod(wiredtiger_open+0x1563) [0x1988693]
 mongod(_ZN5mongo18WiredTigerKVEngineC2ERKSsS2_bbb+0x5AA) [0x103848a]
 mongod(+0xC35761) [0x1035761]
 mongod(_ZN5mongo20ServiceContextMongoD29initializeGlobalStorageEngineEv+0x588) [0xf5d9c8]
 mongod(_ZN5mongo13initAndListenEi+0x2EF) [0x9350ef]
 mongod(main+0x149) [0x9397f9]
 libc.so.6(__libc_start_main+0xF5) [0x7fbe31fc9af5]
 mongod(+0x532D19) [0x932d19]
-----  END BACKTRACE  -----
2015-11-09T13:06:29.711-0500 F -        [initandlisten] /proc/self/maps:
00400000-01f9a000 r-xp 00000000 fd:02 538007974                          /home/jonathan/wt320rc2-sue-110915-1/mongodb-linux-x86_64-3.2.0-rc2-65-gc4d2fd3/bin/mongod
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 02199000-02270000 rw-p 01b99000 fd:02 538007974                          /home/jonathan/wt320rc2-sue-110915-1/mongodb-linux-x86_64-3.2.0-rc2-65-gc4d2fd3/bin/mongod
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 02270000-022ba000 rw-p 00000000 00:00 0
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 038b1000-044b2000 rw-p 00000000 00:00 0                                  [heap]
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 044b2000-07694000 rw-p 00000000 00:00 0                                  [heap]
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe29a7d000-7fbe29a7e000 ---p 00000000 00:00 0
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe29a7e000-7fbe2a27e000 rw-p 00000000 00:00 0                          [stack:2586]
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe2a27e000-7fbe2a27f000 ---p 00000000 00:00 0
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe2a27f000-7fbe2aa7f000 rw-p 00000000 00:00 0                          [stack:2585]
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe2aa7f000-7fbe2aa80000 ---p 00000000 00:00 0
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe2aa80000-7fbe2b280000 rw-p 00000000 00:00 0                          [stack:2584]
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe2b280000-7fbe2b281000 ---p 00000000 00:00 0
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe2b281000-7fbe2ba81000 rw-p 00000000 00:00 0                          [stack:2583]
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe2ba81000-7fbe31fa8000 r--p 00000000 fd:01 11251                      /usr/lib/locale/locale-archive
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe31fa8000-7fbe3215e000 r-xp 00000000 fd:01 205023206                  /usr/lib64/libc-2.17.so
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe3215e000-7fbe3235e000 ---p 001b6000 fd:01 205023206                  /usr/lib64/libc-2.17.so
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe3235e000-7fbe32362000 r--p 001b6000 fd:01 205023206                  /usr/lib64/libc-2.17.so
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe32362000-7fbe32364000 rw-p 001ba000 fd:01 205023206                  /usr/lib64/libc-2.17.so
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe32364000-7fbe32369000 rw-p 00000000 00:00 0
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe32369000-7fbe3237f000 r-xp 00000000 fd:01 201334002                  /usr/lib64/libpthread-2.17.so
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe3237f000-7fbe3257f000 ---p 00016000 fd:01 201334002                  /usr/lib64/libpthread-2.17.so
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe3257f000-7fbe32580000 r--p 00016000 fd:01 201334002                  /usr/lib64/libpthread-2.17.so
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe32580000-7fbe32581000 rw-p 00017000 fd:01 201334002                  /usr/lib64/libpthread-2.17.so
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe32581000-7fbe32585000 rw-p 00000000 00:00 0
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe32585000-7fbe3259a000 r-xp 00000000 fd:01 205023195                  /usr/lib64/libgcc_s-4.8.3-20140911.so.1
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe3259a000-7fbe32799000 ---p 00015000 fd:01 205023195                  /usr/lib64/libgcc_s-4.8.3-20140911.so.1
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe32799000-7fbe3279a000 r--p 00014000 fd:01 205023195                  /usr/lib64/libgcc_s-4.8.3-20140911.so.1
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe3279a000-7fbe3279b000 rw-p 00015000 fd:01 205023195                  /usr/lib64/libgcc_s-4.8.3-20140911.so.1
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe3279b000-7fbe3289c000 r-xp 00000000 fd:01 201333945                  /usr/lib64/libm-2.17.so
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe3289c000-7fbe32a9b000 ---p 00101000 fd:01 201333945                  /usr/lib64/libm-2.17.so
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe32a9b000-7fbe32a9c000 r--p 00100000 fd:01 201333945                  /usr/lib64/libm-2.17.so
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe32a9c000-7fbe32a9d000 rw-p 00101000 fd:01 201333945                  /usr/lib64/libm-2.17.so
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe32a9d000-7fbe32b86000 r-xp 00000000 fd:01 201334037                  /usr/lib64/libstdc++.so.6.0.19
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe32b86000-7fbe32d85000 ---p 000e9000 fd:01 201334037                  /usr/lib64/libstdc++.so.6.0.19
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe32d85000-7fbe32d8d000 r--p 000e8000 fd:01 201334037                  /usr/lib64/libstdc++.so.6.0.19
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe32d8d000-7fbe32d8f000 rw-p 000f0000 fd:01 201334037                  /usr/lib64/libstdc++.so.6.0.19
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe32d8f000-7fbe32da4000 rw-p 00000000 00:00 0
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe32da4000-7fbe32da7000 r-xp 00000000 fd:01 201333943                  /usr/lib64/libdl-2.17.so
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe32da7000-7fbe32fa6000 ---p 00003000 fd:01 201333943                  /usr/lib64/libdl-2.17.so
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe32fa6000-7fbe32fa7000 r--p 00002000 fd:01 201333943                  /usr/lib64/libdl-2.17.so
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe32fa7000-7fbe32fa8000 rw-p 00003000 fd:01 201333943                  /usr/lib64/libdl-2.17.so
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe32fa8000-7fbe32faf000 r-xp 00000000 fd:01 201334007                  /usr/lib64/librt-2.17.so
2015-11-09T13:06:29.711-0500 F -        [initandlisten] 7fbe32faf000-7fbe331ae000 ---p 00007000 fd:01 201334007                  /usr/lib64/librt-2.17.so
2015-11-09T13:06:29.712-0500 F -        [initandlisten] 7fbe331ae000-7fbe331af000 r--p 00006000 fd:01 201334007                  /usr/lib64/librt-2.17.so
2015-11-09T13:06:29.712-0500 F -        [initandlisten] 7fbe331af000-7fbe331b0000 rw-p 00007000 fd:01 201334007                  /usr/lib64/librt-2.17.so
2015-11-09T13:06:29.712-0500 F -        [initandlisten] 7fbe331b0000-7fbe331d1000 r-xp 00000000 fd:01 205023199                  /usr/lib64/ld-2.17.so
2015-11-09T13:06:29.712-0500 F -        [initandlisten] 7fbe333b6000-7fbe333bd000 rw-p 00000000 00:00 0
2015-11-09T13:06:29.712-0500 F -        [initandlisten] 7fbe333c9000-7fbe333d0000 r--s 00000000 fd:01 68823729                   /usr/lib64/gconv/gconv-modules.cache
2015-11-09T13:06:29.712-0500 F -        [initandlisten] 7fbe333d0000-7fbe333d1000 rw-p 00000000 00:00 0
2015-11-09T13:06:29.712-0500 F -        [initandlisten] 7fbe333d1000-7fbe333d2000 r--p 00021000 fd:01 205023199                  /usr/lib64/ld-2.17.so
2015-11-09T13:06:29.712-0500 F -        [initandlisten] 7fbe333d2000-7fbe333d3000 rw-p 00022000 fd:01 205023199                  /usr/lib64/ld-2.17.so
2015-11-09T13:06:29.712-0500 F -        [initandlisten] 7fbe333d3000-7fbe333d4000 rw-p 00000000 00:00 0
2015-11-09T13:06:29.712-0500 F -        [initandlisten] 7fff89cd3000-7fff89cf4000 rw-p 00000000 00:00 0                          [stack]
2015-11-09T13:06:29.712-0500 F -        [initandlisten] 7fff89d5c000-7fff89d5e000 r-xp 00000000 00:00 0                          [vdso]
2015-11-09T13:06:29.712-0500 F -        [initandlisten] ffffffffff600000-ffffffffff601000 r-xp 00000000 00:00 0                  [vsyscall]

Comment by Jonathan Abrahams [ 04/Nov/15 ]

sue.loverso Sorry for stating there were backups of the DBs. I modified the powertest.sh script to not backup the DBs (by default). To get the backups I reran with the "-D" option enabled. The results are in wt320rc1-run3.

Failure observed:

2015-11-03T22:31:09.449-0500 I CONTROL  [main] ***** SERVER RESTARTED *****
2015-11-03T22:31:09.461-0500 I CONTROL  [initandlisten] MongoDB starting : pid=2577 port=37017 dbpath=/home/jonathan/wt320rc1-run3/data/wiredTiger 64-bit host=capja2
2015-11-03T22:31:09.461-0500 I CONTROL  [initandlisten] db version v3.2.0-rc1-85-g217ec54
2015-11-03T22:31:09.461-0500 I CONTROL  [initandlisten] git version: 217ec54c518a04173f4b4b1fc7778ad4a1095c84
2015-11-03T22:31:09.461-0500 I CONTROL  [initandlisten] allocator: tcmalloc
2015-11-03T22:31:09.461-0500 I CONTROL  [initandlisten] modules: none
2015-11-03T22:31:09.461-0500 I CONTROL  [initandlisten] build environment:
2015-11-03T22:31:09.461-0500 I CONTROL  [initandlisten]     distarch: x86_64
2015-11-03T22:31:09.461-0500 I CONTROL  [initandlisten]     target_arch: x86_64
2015-11-03T22:31:09.461-0500 I CONTROL  [initandlisten] options: { net: { port: 37017 }, processManagement: { fork: true }, replication: { replSet: "pc" }, setParameter: { enableTestCommands: "1" }, storage: { dbPath: "wt320rc1-run3/data/wiredTiger", engine: "wiredTiger" }, systemLog: { destination: "file", logAppend: true, path: "wt320rc1-run3/logs/mongod-wiredTiger-recovery.log" } }
2015-11-03T22:31:09.502-0500 W -        [initandlisten] Detected unclean shutdown - /home/jonathan/wt320rc1-run3/data/wiredTiger/mongod.lock is not empty.
2015-11-03T22:31:09.502-0500 W STORAGE  [initandlisten] Recovering data from the last clean checkpoint.
2015-11-03T22:31:09.502-0500 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=8G,session_max=20000,eviction=(threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
2015-11-03T22:31:09.862-0500 E STORAGE  [initandlisten] WiredTiger (-31803) [1446607869:862931][2577:0x7f66a0fc0cc0], file:collection-2380--3971201562620218084.wt, txn-recover: Operation failed during recovery: WT_NOTFOUND: item not found
2015-11-03T22:31:09.863-0500 E STORAGE  [initandlisten] WiredTiger (-31803) [1446607869:863118][2577:0x7f66a0fc0cc0], file:collection-2380--3971201562620218084.wt, txn-recover: Recovery failed: WT_NOTFOUND: item not found
2015-11-03T22:31:09.919-0500 I -        [initandlisten] Assertion: 28595:-31803: WT_NOTFOUND: item not found
2015-11-03T22:31:09.925-0500 I STORAGE  [initandlisten] exception in initAndListen: 28595 -31803: WT_NOTFOUND: item not found, terminating
2015-11-03T22:31:09.925-0500 I CONTROL  [initandlisten] dbexit:  rc: 100

Comment by Susan LoVerso [ 04/Nov/15 ]

jonathan.abrahams I went onto the machine and found wt320rc1-run2 dated today. However, I don't see in the data directory any of the saved copies like beforerecovery and afterrecovery. Can you rerun with that? I'd suggest always having it on, or if it is, any ideas where those directories are and/or why they're not in the data directory? Did I miss them somewhere?

On the machine I do not see any other file or directory dated today (other than the tarball I just created). Are the DB tarball backups the copies I'm looking for? If so, where are they?

Comment by Jonathan Abrahams [ 03/Nov/15 ]

It could not be reproduced using a VM (with atleast 300+ "power" cycles).

Comment by Alexander Gorrod [ 03/Nov/15 ]

Thanks for the update jonathan.abrahams. Did you try running the script against a VM? I'd like to understand whether the physical reboot is the only difference between your setup and the one david.hows has been running.

Comment by Jonathan Abrahams [ 03/Nov/15 ]

The failure was seen in both cases within 10 power cycles. The DB tarball backups are available for both failures (as well as direct access to the host).

Comment by Michael Cahill (Inactive) [ 03/Nov/15 ]

jonathan.abrahams, thanks for re-running.

FWIW, we did a new drop just after the version you tested (6262a4f), but I don't expect that to make any difference to this test since you are running with journal enabled.

Do you know how many cycles this took to reproduce, and are the before / after recovery database tarballs available?

Comment by Jonathan Abrahams [ 03/Nov/15 ]

david.hows Please make sure you use the latest powercycle.sh from the CAP repo.

I saw 2 failures (master, i.e., 3.2.0-rc2-pre) using the following command:

./powertest.sh -s <ip addr> -R  -N 10 -F ~/mongo -f 5 -C mpower -O output1 -S "-i mFi.pem admin@<mfi ip>"  -l 200 -i 1000000

Note that this differs from the command you use, as it runs on a physical host and uses a power off/on procedure to crash the server where mongod is running.

Crash 1:

2015-10-29T11:01:36.514-0400 I CONTROL  [main] ***** SERVER RESTARTED *****
2015-10-29T11:01:36.550-0400 W -        [initandlisten] Detected unclean shutdown - /home/jonathan/wt320rc1/data/wiredTiger/mongod.lock is not empty.
2015-10-29T11:01:36.550-0400 W STORAGE  [initandlisten] Recovering data from the last clean checkpoint.
2015-10-29T11:01:36.550-0400 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=8G,session_max=20000,eviction=(threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
2015-10-29T11:01:36.850-0400 E STORAGE  [initandlisten] WiredTiger (-31803) [1446130896:850821][2457:0x7f2838613cc0], file:collection-2109-1173218696634382594.wt: Operation failed during recovery: WT_NOTFOUND: item not found
2015-10-29T11:01:36.850-0400 E STORAGE  [initandlisten] WiredTiger (-31803) [1446130896:850924][2457:0x7f2838613cc0], file:collection-2109-1173218696634382594.wt: Recovery failed: WT_NOTFOUND: item not found
2015-10-29T11:01:36.863-0400 I -        [initandlisten] Assertion: 28595:-31803: WT_NOTFOUND: item not found
2015-10-29T11:01:36.869-0400 I STORAGE  [initandlisten] exception in initAndListen: 28595 -31803: WT_NOTFOUND: item not found, terminating
2015-10-29T11:01:36.869-0400 I CONTROL  [initandlisten] dbexit:  rc: 100

Crash 2:

2015-11-03T16:44:20.661-0500 I CONTROL  [main] ***** SERVER RESTARTED *****
2015-11-03T16:44:20.669-0500 I CONTROL  [initandlisten] MongoDB starting : pid=2565 port=37017 dbpath=/home/jonathan/wt320rc1-run2/data/wiredTiger 64-bit host=capja2
2015-11-03T16:44:20.669-0500 I CONTROL  [initandlisten] db version v3.2.0-rc1-80-g6262a4f
2015-11-03T16:44:20.669-0500 I CONTROL  [initandlisten] git version: 6262a4f016ebd159a83712d0d1b550af39212c3e
2015-11-03T16:44:20.669-0500 I CONTROL  [initandlisten] allocator: tcmalloc
2015-11-03T16:44:20.669-0500 I CONTROL  [initandlisten] modules: none
2015-11-03T16:44:20.669-0500 I CONTROL  [initandlisten] build environment:
2015-11-03T16:44:20.669-0500 I CONTROL  [initandlisten]     distarch: x86_64
2015-11-03T16:44:20.669-0500 I CONTROL  [initandlisten]     target_arch: x86_64
2015-11-03T16:44:20.669-0500 I CONTROL  [initandlisten] options: { net: { port: 37017 }, processManagement: { fork: true }, replication: { replSet: "pc" }, setParameter: { enableTestCommands: "1" }, storage: { dbPath: "wt320rc1-run2/data/wiredTiger", engine: "wiredTiger" }, systemLog: { destination: "file", logAppend: true, path: "wt320rc1-run2/logs/mongod-wiredTiger-recovery.log" } }
2015-11-03T16:44:20.697-0500 W -        [initandlisten] Detected unclean shutdown - /home/jonathan/wt320rc1-run2/data/wiredTiger/mongod.lock is not empty.
2015-11-03T16:44:20.697-0500 W STORAGE  [initandlisten] Recovering data from the last clean checkpoint.
2015-11-03T16:44:20.697-0500 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=8G,session_max=20000,eviction=(threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
2015-11-03T16:44:24.786-0500 E STORAGE  [initandlisten] WiredTiger (2) [1446587064:786539][2565:0x7fbbf4350cc0], file:index-1915-3843389990893053778.wt, txn-recover: /home/jonathan/wt320rc1-run2/data/wiredTiger/index-1915-3843389990893053778.wt: No such file or directory
2015-11-03T16:44:24.786-0500 E STORAGE  [initandlisten] WiredTiger (2) [1446587064:786606][2565:0x7fbbf4350cc0], file:index-1915-3843389990893053778.wt, txn-recover: Operation failed during recovery: No such file or directory
2015-11-03T16:44:25.547-0500 I -        [initandlisten] Fatal assertion 28548 NoSuchKey Unable to find metadata for table:collection-1898-3843389990893053778
2015-11-03T16:44:25.547-0500 I -        [initandlisten]
 
***aborting after fassert() failure

Comment by David Hows [ 27/Oct/15 ]

jonathan.abrahams, would you be able to attempt to reproduce in your environment with either 3.2.0RC0 or a recent compile of master and see if the issue is resolved on your end?

My Reproduction command is

bash ./powertest.sh -s " -i /home/david/work/crashtest/.vagrant/machines/default/virtualbox/private_key vagrant@172.28.128.3" -d https://fastdl.mongodb.org/linux/mongodb-linux-x86_64-3.1.8.tgz -r pt-318 -C vagrant -R -l 200 -N 10 -F /home/david/work/mongo/ -o "--setParameter enableTestCommands=1" | tee powertest-318.log

It reproduces very quickly with 3.1.5

bash ./powertest.sh -s " -i /home/david/work/crashtest/.vagrant/machines/default/virtualbox/private_key vagrant@172.28.128.3" -d https://fastdl.mongodb.org/linux/mongodb-linux-x86_64-3.1.5.tgz -r pt-315 -C vagrant -R -l 200 -N 10 -F /home/david/work/mongo/ -o "--setParameter enableTestCommands=1" | tee powertest-315.log

Comment by David Hows [ 27/Oct/15 ]

I've been running this for a day.

I ran 200 iterations vs 3.1.8 and was unable to reproduce.

I can readily reproduce in 3.1.5

Comment by Jonathan Abrahams [ 11/Sep/15 ]

alexander.gorrod I reran with the v3.1.8-pre-, git version: 188a6021d72322f1
f45adbbe0d0973686deaf8f8, and saw this fatal:

2015-09-11T13:42:25.336-0400 I CONTROL  [main] ***** SERVER RESTARTED *****
2015-09-11T13:42:25.386-0400 W -        [initandlisten] Detected unclean shutdown - /home/jonathan/m318wt261/data/wiredTiger/mongod.lock is not empty.
2015-09-11T13:42:25.386-0400 W STORAGE  [initandlisten] Recovering data from the last clean checkpoint.
2015-09-11T13:42:25.386-0400 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=8G,session_max=20000,eviction=(threads_max=4),config_base=false,statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
2015-09-11T13:42:31.877-0400 E STORAGE  [initandlisten] WiredTiger (2) [1441993351:877615][2584:0x7f9a10c34cc0], file:collection-1444--8026368725953356873.wt: /home/jonathan/m318wt261/data/wiredTiger/collection-1444--8026368725953356873.wt: No such file or directory
2015-09-11T13:42:31.877-0400 E STORAGE  [initandlisten] WiredTiger (2) [1441993351:877682][2584:0x7f9a10c34cc0], file:collection-1444--8026368725953356873.wt: Operation failed during recovery: No such file or directory
2015-09-11T13:42:33.787-0400 I STORAGE  [initandlisten] Starting WiredTigerRecordStoreThread local.oplog.rs
2015-09-11T13:42:33.787-0400 I STORAGE  [initandlisten] Sampling from the oplog to determine where to place markers for when to truncate
2015-09-11T13:42:33.787-0400 I STORAGE  [initandlisten] Taking 1003 samples and assuming that each section of oplog contains approximately 155792 records totaling to 61133856 bytes
2015-09-11T13:42:33.789-0400 I STORAGE  [initandlisten] Failed to get enough random samples, falling back to scanning the oplog
2015-09-11T13:42:33.789-0400 I STORAGE  [initandlisten] Scanning the oplog to determine where to place markers for when to truncate
2015-09-11T13:42:40.518-0400 I -        [initandlisten] Fatal assertion 28548 NoSuchKey Unable to find metadata for table:collection-1410--8026368725953356873
2015-09-11T13:42:40.518-0400 I -        [initandlisten]
 
***aborting after fassert() failure

Comment by Alexander Gorrod [ 11/Sep/15 ]

jonathan.abrahams I pushed a change into MongoDB master last night that may result in a "No such file or directory" return on an index. When you re-run this test could you try running with the latest Mongo version?

The relevant ticket is SERVER-20322

Comment by Susan LoVerso [ 03/Sep/15 ]

alexander.gorrod I have ssh access to the failing machine now. Unfortunately since recovery (successfully) has been run on the database, WT has archived away all the original log files. But the symptoms you originally saw are exactly what still happens in this case. The file getting ENOENT when starting mongod exists in the WT metadata file, but does not exist on disk. I'll look at WT-2008 and WT-2009 and see if there is another path that ends up with the same problem.

I've asked jonathan.abrahams to reproduce with saving copies of the database directory before running recovery. No need to upload tar files.

Comment by Susan LoVerso [ 03/Sep/15 ]

jonathan.abrahams can you give me ssh access to the machine when it fails? Is it still in the mode from this last failure or do you need to repro again?

Comment by Jonathan Abrahams [ 28/Aug/15 ]

alexander.gorrod Do you want ssh access to view the storage directory and files, or should I provide them in this ticket?

Comment by Alexander Gorrod [ 28/Aug/15 ]

jonathan.abrahams I'm not having any luck reproducing this failure - would it be possible for us to see the failure on the machine you are using?

Comment by Jonathan Abrahams [ 26/Aug/15 ]

Still reproducible on 3.1.8-pre- on a physical host using this test command:

powertest.sh -s 10.4.1.54 -r wt-317-2 -C mpower -O output1 -S "-i /Users/jonathan/.ssh/mFi.pem admin@10.4.117.131" -R -F /Users/jonathan/mongodb/mongo -N 20 -l 100 -D -A -p

Comment by Jonathan Abrahams [ 26/Aug/15 ]

I am rerunning the tests using 3.1.8-pre-, and so far I have not been able to reproduce this issue. It's possible the fixes have corrected this issue.

Some tests run on VMs, some on physical machines. I have pushed the latest powertest.sh to the GitHub repo.

Stay tuned and if this is not reproducible, it will be marked resolved.

Comment by Alexander Gorrod [ 26/Aug/15 ]

Thanks for re-testing jonathan.abrahams. Could you provide some more information:

When you say latest - which version do you mean?
Do you know how long the test ran for before the failure happened? Were you running the crash cycling on physical hardware or with a virtual machine?
Are you running with the power-cycle script as in the GitHub repository? I've updated to the newest version of the script and starting running locally again.

Comment by Jonathan Abrahams [ 14/Aug/15 ]

Reopening based on last comment.

Comment by Jonathan Abrahams [ 14/Aug/15 ]

I reran this test on latest and still have this problem:

2015-08-13T16:57:17.880-0400 I CONTROL  [main] ***** SERVER RESTARTED *****
2015-08-13T16:57:17.937-0400 W -        [initandlisten] Detected unclean shutdown - /home/jonathan/pt-317-wt-100m/data/wiredTiger/mongod.lock is not empty.
2015-08-13T16:57:17.937-0400 W STORAGE  [initandlisten] Recovering data from the last clean checkpoint.
2015-08-13T16:57:17.937-0400 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=8G,session_max=20000,eviction=(threads_max=4),statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
2015-08-13T16:59:33.576-0400 E STORAGE  [initandlisten] WiredTiger (2) [1439499573:576196][2591:0x7f547b92ccc0], file:index-27-2812787096723087782.wt: /home/jonathan/pt-317-wt-100m/data/wiredTiger/index-27-2812787096723087782.wt: No such file or directory
2015-08-13T16:59:33.576-0400 E STORAGE  [initandlisten] WiredTiger (2) [1439499573:576268][2591:0x7f547b92ccc0], file:index-27-2812787096723087782.wt: Operation failed during recovery: No such file or directory
2015-08-13T16:59:47.078-0400 I STORAGE  [initandlisten] dropping unused ident: collection-28-2812787096723087782
2015-08-13T16:59:47.080-0400 I STORAGE  [initandlisten] dropping unused ident: collection-30-2812787096723087782
2015-08-13T16:59:47.081-0400 I STORAGE  [initandlisten] dropping unused ident: collection-32-2812787096723087782
2015-08-13T16:59:47.082-0400 I STORAGE  [initandlisten] dropping unused ident: collection-34-2812787096723087782
2015-08-13T16:59:47.083-0400 I STORAGE  [initandlisten] dropping unused ident: collection-36-2812787096723087782
2015-08-13T16:59:47.084-0400 I STORAGE  [initandlisten] dropping unused ident: collection-38-2812787096723087782
2015-08-13T16:59:47.085-0400 I STORAGE  [initandlisten] dropping unused ident: collection-40-2812787096723087782
2015-08-13T16:59:47.086-0400 I STORAGE  [initandlisten] dropping unused ident: collection-55-2812787096723087782
2015-08-13T16:59:47.086-0400 I STORAGE  [initandlisten] dropping unused ident: index-173-2812787096723087782
2015-08-13T16:59:47.087-0400 I STORAGE  [initandlisten] dropping unused ident: index-174-2812787096723087782
2015-08-13T16:59:47.088-0400 I STORAGE  [initandlisten] dropping unused ident: index-175-2812787096723087782
2015-08-13T16:59:47.089-0400 I STORAGE  [initandlisten] dropping unused ident: index-176-2812787096723087782
2015-08-13T16:59:47.090-0400 I STORAGE  [initandlisten] dropping unused ident: index-177-2812787096723087782
2015-08-13T16:59:47.091-0400 I STORAGE  [initandlisten] dropping unused ident: index-178-2812787096723087782
2015-08-13T16:59:47.092-0400 I STORAGE  [initandlisten] dropping unused ident: index-179-2812787096723087782
2015-08-13T16:59:47.093-0400 I STORAGE  [initandlisten] dropping unused ident: index-180-2812787096723087782
2015-08-13T16:59:47.094-0400 I STORAGE  [initandlisten] dropping unused ident: index-181-2812787096723087782
2015-08-13T16:59:47.094-0400 I STORAGE  [initandlisten] dropping unused ident: index-182-2812787096723087782
2015-08-13T16:59:47.095-0400 I STORAGE  [initandlisten] dropping unused ident: index-183-2812787096723087782
2015-08-13T16:59:47.096-0400 I STORAGE  [initandlisten] dropping unused ident: index-184-2812787096723087782
2015-08-13T16:59:47.097-0400 I STORAGE  [initandlisten] dropping unused ident: index-189-2812787096723087782
2015-08-13T16:59:47.098-0400 I STORAGE  [initandlisten] dropping unused ident: index-190-2812787096723087782
2015-08-13T16:59:47.099-0400 I STORAGE  [initandlisten] dropping unused ident: index-191-2812787096723087782
2015-08-13T16:59:47.100-0400 I STORAGE  [initandlisten] dropping unused ident: index-192-2812787096723087782
2015-08-13T16:59:47.101-0400 I STORAGE  [initandlisten] dropping unused ident: index-213-2812787096723087782
2015-08-13T16:59:47.102-0400 I STORAGE  [initandlisten] dropping unused ident: index-214-2812787096723087782
2015-08-13T16:59:47.103-0400 I STORAGE  [initandlisten] dropping unused ident: index-215-2812787096723087782
2015-08-13T16:59:47.104-0400 I STORAGE  [initandlisten] dropping unused ident: index-216-2812787096723087782
2015-08-13T16:59:47.105-0400 I STORAGE  [initandlisten] dropping unused ident: index-221-2812787096723087782
2015-08-13T16:59:47.106-0400 I STORAGE  [initandlisten] dropping unused ident: index-222-2812787096723087782
2015-08-13T16:59:47.107-0400 I STORAGE  [initandlisten] dropping unused ident: index-223-2812787096723087782
2015-08-13T16:59:47.108-0400 I STORAGE  [initandlisten] dropping unused ident: index-224-2812787096723087782
2015-08-13T16:59:47.108-0400 I STORAGE  [initandlisten] dropping unused ident: index-233-2812787096723087782
2015-08-13T16:59:47.109-0400 I STORAGE  [initandlisten] dropping unused ident: index-234-2812787096723087782
2015-08-13T16:59:47.110-0400 I STORAGE  [initandlisten] dropping unused ident: index-235-2812787096723087782
2015-08-13T16:59:47.111-0400 I STORAGE  [initandlisten] dropping unused ident: index-236-2812787096723087782
2015-08-13T16:59:47.112-0400 I STORAGE  [initandlisten] dropping unused ident: index-241-2812787096723087782
2015-08-13T16:59:47.113-0400 I STORAGE  [initandlisten] dropping unused ident: index-242-2812787096723087782
2015-08-13T16:59:47.114-0400 I STORAGE  [initandlisten] dropping unused ident: index-243-2812787096723087782
2015-08-13T16:59:47.115-0400 I STORAGE  [initandlisten] dropping unused ident: index-244-2812787096723087782
2015-08-13T16:59:47.116-0400 I STORAGE  [initandlisten] dropping unused ident: index-245-2812787096723087782
2015-08-13T16:59:47.117-0400 I STORAGE  [initandlisten] dropping unused ident: index-246-2812787096723087782
2015-08-13T16:59:47.118-0400 I STORAGE  [initandlisten] dropping unused ident: index-247-2812787096723087782
2015-08-13T16:59:47.119-0400 I STORAGE  [initandlisten] dropping unused ident: index-248-2812787096723087782
2015-08-13T16:59:47.120-0400 I STORAGE  [initandlisten] dropping unused ident: index-253-2812787096723087782
2015-08-13T16:59:47.121-0400 I STORAGE  [initandlisten] dropping unused ident: index-254-2812787096723087782
2015-08-13T16:59:47.122-0400 I STORAGE  [initandlisten] dropping unused ident: index-255-2812787096723087782
2015-08-13T16:59:47.123-0400 I STORAGE  [initandlisten] dropping unused ident: index-256-2812787096723087782
2015-08-13T16:59:47.124-0400 I STORAGE  [initandlisten] dropping unused ident: index-257-2812787096723087782
2015-08-13T16:59:47.124-0400 I STORAGE  [initandlisten] dropping unused ident: index-258-2812787096723087782
2015-08-13T16:59:47.125-0400 I STORAGE  [initandlisten] dropping unused ident: index-259-2812787096723087782
2015-08-13T16:59:47.126-0400 I STORAGE  [initandlisten] dropping unused ident: index-260-2812787096723087782
2015-08-13T16:59:47.128-0400 I STORAGE  [initandlisten] dropping unused ident: index-261-2812787096723087782
2015-08-13T16:59:47.128-0400 I STORAGE  [initandlisten] dropping unused ident: index-262-2812787096723087782
2015-08-13T16:59:47.129-0400 I STORAGE  [initandlisten] dropping unused ident: index-263-2812787096723087782
2015-08-13T16:59:47.130-0400 I STORAGE  [initandlisten] dropping unused ident: index-264-2812787096723087782
2015-08-13T16:59:47.131-0400 I STORAGE  [initandlisten] dropping unused ident: index-265-2812787096723087782
2015-08-13T16:59:47.132-0400 I STORAGE  [initandlisten] dropping unused ident: index-266-2812787096723087782
2015-08-13T16:59:47.133-0400 I STORAGE  [initandlisten] dropping unused ident: index-267-2812787096723087782
2015-08-13T16:59:47.134-0400 I STORAGE  [initandlisten] dropping unused ident: index-268-2812787096723087782
2015-08-13T16:59:47.135-0400 I STORAGE  [initandlisten] dropping unused ident: index-269-2812787096723087782
2015-08-13T16:59:47.136-0400 I STORAGE  [initandlisten] dropping unused ident: index-270-2812787096723087782
2015-08-13T16:59:47.137-0400 I STORAGE  [initandlisten] dropping unused ident: index-271-2812787096723087782
2015-08-13T16:59:47.138-0400 I STORAGE  [initandlisten] dropping unused ident: index-272-2812787096723087782
2015-08-13T16:59:47.138-0400 I STORAGE  [initandlisten] dropping unused ident: index-54-2812787096723087782
2015-08-13T16:59:47.139-0400 I STORAGE  [initandlisten] dropping unused ident: index-67-2812787096723087782
2015-08-13T16:59:47.140-0400 I STORAGE  [initandlisten] dropping unused ident: index-79-2812787096723087782
2015-08-13T16:59:47.142-0400 I CONTROL  [initandlisten] MongoDB starting : pid=2591 port=37017 dbpath=/home/jonathan/pt-317-wt-100m/data/wiredTiger 64-bit host=capja2
2015-08-13T16:59:47.142-0400 I CONTROL  [initandlisten]
2015-08-13T16:59:47.142-0400 I CONTROL  [initandlisten] ** NOTE: This is a development version (3.1.7-pre-) of MongoDB.
2015-08-13T16:59:47.142-0400 I CONTROL  [initandlisten] **       Not recommended for production.
2015-08-13T16:59:47.142-0400 I CONTROL  [initandlisten]
2015-08-13T16:59:47.143-0400 I CONTROL  [initandlisten]
2015-08-13T16:59:47.143-0400 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'.
2015-08-13T16:59:47.143-0400 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
2015-08-13T16:59:47.143-0400 I CONTROL  [initandlisten]
2015-08-13T16:59:47.143-0400 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/defrag is 'always'.
2015-08-13T16:59:47.143-0400 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
2015-08-13T16:59:47.143-0400 I CONTROL  [initandlisten]
2015-08-13T16:59:47.143-0400 I CONTROL  [initandlisten] db version v3.1.7-pre-
2015-08-13T16:59:47.143-0400 I CONTROL  [initandlisten] git version: a067d03eca22d67a1ca036ea61544cd51df1490b
2015-08-13T16:59:47.143-0400 I CONTROL  [initandlisten] allocator: tcmalloc
2015-08-13T16:59:47.143-0400 I CONTROL  [initandlisten] options: { net: { port: 37017 }, processManagement: { fork: true }, setParameter: { enableTestCommands: "1" }, storage: { dbPath: "pt-317-wt-100m/data/wiredTiger", engine: "wiredTiger" }, systemLog: { destination: "file", logAppend: true, path: "pt-317-wt-100m/logs/mongod-wiredTiger-recovery.log" } }
2015-08-13T16:59:47.146-0400 I -        [initandlisten] Fatal assertion 28579 NoSuchKey Unable to find metadata for table:index-21-2812787096723087782
2015-08-13T16:59:47.146-0400 I -        [initandlisten]
 
***aborting after fassert() failure

Comment by Alexander Gorrod [ 23/Jul/15 ]

I've been running this test for 24 hours without a failure, when it would fail reliably within 5 minutes before. The fixes in WT-2008 and WT-2009 have fixed the issues I can reproduce here.

Comment by Alexander Gorrod [ 15/Jul/15 ]

I've found two bugs WT-2008 and WT-2009 that are causing the failures reported here. I have a fix for WT-2008 awaiting review, WT-2009 needs further discussion before implementing a fix.

Comment by Alexander Gorrod [ 14/Jul/15 ]

I've reproduced the failure - I'm actively chasing it.

The failure I see is:

2015-07-14T05:54:38.664+0000 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=1G,session_max=20000,eviction=(threads_max=4),statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
2015-07-14T05:54:40.216+0000 E STORAGE  [initandlisten] WiredTiger (2) [1436853280:216378][1604:0x7f984eb1cbc0], file:collection-183--5187426057094844684.wt: /home/vagrant/pt-latest/data/wiredTiger/collection-183--5187426057094844684.wt: No such file or directory
2015-07-14T05:54:40.216+0000 E STORAGE  [initandlisten] WiredTiger (2) [1436853280:216510][1604:0x7f984eb1cbc0], file:collection-183--5187426057094844684.wt: Operation failed during recovery: No such file or directory

A printlog on the failed database reveals:

  { "lsn" : [24,33857152],
    "hdr_flags" : "",
    "rec_len" : 256,
    "mem_len" : 256,
    "type" : "commit",
    "txnid" : 115824,
    "ops": [
      { "optype": "row_put",
        "fileid": "0",
        "key": "table:collection-183--5187426057094844684\u0000",
        "value": "app_metadata=(formatVersion=1),colgroups=,collator=,columns=,key_format=q,value_format=u\u0000"
      }
    ]
  },

Which indicates that the file has been written to the WiredTiger metadata. The file definitely doesn't exist on disk:

$ ls wiredTiger-beforerecovery/collection*
wiredTiger-beforerecovery/collection-0--6622855524004531093.wt    wiredTiger-beforerecovery/collection-193--5187426057094844684.wt
wiredTiger-beforerecovery/collection-1--5187426057094844684.wt    wiredTiger-beforerecovery/collection-2--6622855524004531093.wt
wiredTiger-beforerecovery/collection-1-5882861703348662567.wt     wiredTiger-beforerecovery/collection-3--5187426057094844684.wt
wiredTiger-beforerecovery/collection-179--5187426057094844684.wt  wiredTiger-beforerecovery/collection-3-5882861703348662567.wt
wiredTiger-beforerecovery/collection-181--5187426057094844684.wt  wiredTiger-beforerecovery/collection-4--6622855524004531093.wt
wiredTiger-beforerecovery/collection-185--5187426057094844684.wt  wiredTiger-beforerecovery/collection-6--6622855524004531093.wt
wiredTiger-beforerecovery/collection-189--5187426057094844684.wt

I'll do some more digging into create code paths tomorrow, to see what may be causing the failure.

Comment by Alexander Gorrod [ 14/Jul/15 ]

Thanks for reporting this issue jonathan.abrahams. Can you reproduce the failure reliably? How long do you need to run the power-cycle script before seeing the failure?

I've been running the power cycle script with the configuration options you specified for over 2 hours now without failure.

Comment by Jonathan Abrahams [ 07/Jul/15 ]

I also ran the same scenario with a new option for the CRUD client, using writeConcern of {j: true}. This produced the same result.

Comment by Jonathan Abrahams [ 02/Jul/15 ]

Files were split, cat them back together:

cat wiredTiger-afterrecovery.tar* > wiredTiger-afterrecovery.tar

Generated at Thu Feb 08 03:50:25 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.