[SERVER-17086] Can't start MongoDB w/ WiredTiger Engine because of Checksum Error Created: 28/Jan/15  Updated: 04/Jun/15  Resolved: 15/May/15

Status: Closed
Project: Core Server
Component/s: WiredTiger
Affects Version/s: 2.8.0-rc5, 3.0.0-rc6
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Soner K Assignee: Bruce Lucas (Inactive)
Resolution: Cannot Reproduce Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File WiredTiger.turtle     File WiredTiger.wt     File sizeStorer.wt    
Issue Links:
Related
related to SERVER-17152 WiredTiger file corrupted during powe... Closed
is related to SERVER-17152 WiredTiger file corrupted during powe... Closed
is related to SERVER-17142 corruption in sizeStorer.wt file afte... Closed
Operating System: ALL
Steps To Reproduce:

1. Do some concurrent inserts into the database
2. Wait for the RAM to go out
2.1. MongoDB should crash
3. Try to start MongoDB again
4. The error stated above occurs.

Participants:

 Description   

Hi,

I inserted many data via a concurrent program into a MongoDB Database (2.8.0-rc5) with the WiredEngine StorageEngine.

I'm guessing the 12 GB RAM went out and the MongoDB process was killed. Before I post the log file, I want to say that I searched the jira for this issue and found some bugreports on this:

  • SERVER-16210 (Solution: Remove any Non-MongoDB file from the dbpath-folder. Didn't work for me)
  • SERVER-16214 (No Solution)
  • SERVER-16596 (No Solution)
  • SERVER-16172 (Solution: Update should have fixed that. Didn't work for me, I tried just now with MongoDB-3.0-latest, same error persists)
  • SERVER-16173 (Solution: Update should have fixed that. Didn't work for me)
  • SERVER-16804 (No Solution)

So my issue seems to be the checksum error. Is there any way for me to change the checksum number? I tried to open the .wt files in the dbpath with a hex editor but due to the huge size of the file (~228 GiB) I don't even know where to begin searching. The hex addresses in the error log below don't help either.

I'd appreciate any help, comment or hint

Here's the log file on trying to start mongod with the WiredTiger StorageEngine. The exact command is:

./mongod --dbpath /some/path --port 8003 --storageEngine wiredTiger

I changed the hostname and pathname in the log below manually.

2015-01-28T11:16:40.912+0100 I CONTROL  [initandlisten] MongoDB starting : pid=866 port=8003 dbpath=/some/path 64-bit host=someHost
2015-01-28T11:16:40.913+0100 I CONTROL  [initandlisten] ** WARNING: You are running this process as the root user, which is not recommended.
2015-01-28T11:16:40.913+0100 I CONTROL  [initandlisten] 
2015-01-28T11:16:40.913+0100 I CONTROL  [initandlisten] 
2015-01-28T11:16:40.913+0100 I CONTROL  [initandlisten] ** WARNING: /sys/kernel/mm/transparent_hugepage/enabled is 'always'.
2015-01-28T11:16:40.913+0100 I CONTROL  [initandlisten] **        We suggest setting it to 'never'
2015-01-28T11:16:40.913+0100 I CONTROL  [initandlisten] 
2015-01-28T11:16:40.913+0100 I CONTROL  [initandlisten] db version v2.8.0-rc5
2015-01-28T11:16:40.913+0100 I CONTROL  [initandlisten] git version: 74b351de21c84438b12a83b28e155f5e69e3c1eb
2015-01-28T11:16:40.913+0100 I CONTROL  [initandlisten] build info: Linux build19.nj1.10gen.cc 2.6.32-431.3.1.el6.x86_64 #1 SMP Fri Jan 3 21:39:27 UTC 2014 x86_64 BOOST_LIB_VERSION=1_49
2015-01-28T11:16:40.913+0100 I CONTROL  [initandlisten] allocator: tcmalloc
2015-01-28T11:16:40.913+0100 I CONTROL  [initandlisten] options: { net: { port: 8003 }, storage: { dbPath: "/some/path", engine: "wiredTiger" } }
2015-01-28T11:16:40.913+0100 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=3G,session_max=20000,eviction=(threads_max=4),statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
2015-01-28T11:16:41.728+0100 E STORAGE  [initandlisten] WiredTiger (0) [1422440201:728246][866:0x7fde3ffd2b80], file:sizeStorer.wt: read checksum error [4096B @ 32768, 546414969 != 1946601003]
2015-01-28T11:16:41.728+0100 E STORAGE  [initandlisten] WiredTiger (0) [1422440201:728309][866:0x7fde3ffd2b80], file:sizeStorer.wt: sizeStorer.wt: encountered an illegal file format or internal value
2015-01-28T11:16:41.728+0100 E STORAGE  [initandlisten] WiredTiger (-31804) [1422440201:728322][866:0x7fde3ffd2b80], file:sizeStorer.wt: the process must exit and restart: WT_PANIC: WiredTiger library panic
2015-01-28T11:16:41.728+0100 I -        [initandlisten] Fatal Assertion 28558
2015-01-28T11:16:41.745+0100 I CONTROL  [initandlisten] 
 0xf25749 0xecf571 0xeb46e1 0xd406f6 0x13501f0 0x13504b5 0x1350921 0x12abb92 0x12c457c 0x12c24b6 0x12c33cf 0x12e93db 0x134f868 0x134f97a 0x12f58c1 0x134e40a 0x135b3af 0x135b729 0x130c7d0 0x135c264 0x12ec211 0x12e6e8c 0xd421c8 0xd3fd28 0xa5fdcd 0x7df7af 0x7e4344 0x7fde3ebf2040 0x7dd589
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"B25749"},{"b":"400000","o":"ACF571"},{"b":"400000","o":"AB46E1"},{"b":"400000","o":"9406F6"},{"b":"400000","o":"F501F0"},{"b":"400000","o":"F504B5"},{"b":"400000","o":"F50921"},{"b":"400000","o":"EABB92"},{"b":"400000","o":"EC457C"},{"b":"400000","o":"EC24B6"},{"b":"400000","o":"EC33CF"},{"b":"400000","o":"EE93DB"},{"b":"400000","o":"F4F868"},{"b":"400000","o":"F4F97A"},{"b":"400000","o":"EF58C1"},{"b":"400000","o":"F4E40A"},{"b":"400000","o":"F5B3AF"},{"b":"400000","o":"F5B729"},{"b":"400000","o":"F0C7D0"},{"b":"400000","o":"F5C264"},{"b":"400000","o":"EEC211"},{"b":"400000","o":"EE6E8C"},{"b":"400000","o":"9421C8"},{"b":"400000","o":"93FD28"},{"b":"400000","o":"65FDCD"},{"b":"400000","o":"3DF7AF"},{"b":"400000","o":"3E4344"},{"b":"7FDE3EBD2000","o":"20040"},{"b":"400000","o":"3DD589"}],"processInfo":{ "mongodbVersion" : "2.8.0-rc5", "gitVersion" : "74b351de21c84438b12a83b28e155f5e69e3c1eb", "uname" : { "sysname" : "Linux", "release" : "3.18.4-1-ARCH", "version" : "#1 SMP PREEMPT Tue Jan 27 20:45:02 CET 2015", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b" : "400000" }, { "b" : "7FFF9F704000", "path" : "linux-vdso.so.1", "elfType" : 3 }, { "b" : "7FDE3FBAB000", "path" : "/usr/lib/libpthread.so.0", "elfType" : 3 }, { "b" : "7FDE3F9A3000", "path" : "/usr/lib/librt.so.1", "elfType" : 3 }, { "b" : "7FDE3F79F000", "path" : "/usr/lib/libdl.so.2", "elfType" : 3 }, { "b" : "7FDE3F490000", "path" : "/usr/lib/libstdc++.so.6", "elfType" : 3 }, { "b" : "7FDE3F18B000", "path" : "/usr/lib/libm.so.6", "elfType" : 3 }, { "b" : "7FDE3EF75000", "path" : "/usr/lib/libgcc_s.so.1", "elfType" : 3 }, { "b" : "7FDE3EBD2000", "path" : "/usr/lib/libc.so.6", "elfType" : 3 }, { "b" : "7FDE3FDC7000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3 } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf25749]
 mongod(_ZN5mongo10logContextEPKc+0xE1) [0xecf571]
 mongod(_ZN5mongo13fassertFailedEi+0x61) [0xeb46e1]
 mongod(+0x9406F6) [0xd406f6]
 mongod(+0xF501F0) [0x13501f0]
 mongod(__wt_err+0x95) [0x13504b5]
 mongod(__wt_panic+0x21) [0x1350921]
 mongod(__wt_bm_read+0x72) [0x12abb92]
 mongod(__wt_bt_read+0x1AC) [0x12c457c]
 mongod(__wt_btree_tree_open+0x56) [0x12c24b6]
 mongod(__wt_btree_open+0xD9F) [0x12c33cf]
 mongod(__wt_conn_btree_get+0x19B) [0x12e93db]
 mongod(__wt_session_get_btree+0x2D8) [0x134f868]
 mongod(__wt_session_get_btree_ckpt+0xCA) [0x134f97a]
 mongod(__wt_curfile_open+0xE1) [0x12f58c1]
 mongod(__wt_open_cursor+0x26A) [0x134e40a]
 mongod(+0xF5B3AF) [0x135b3af]
 mongod(+0xF5B729) [0x135b729]
 mongod(__wt_log_scan+0x780) [0x130c7d0]
 mongod(__wt_txn_recover+0x424) [0x135c264]
 mongod(__wt_connection_workers+0x61) [0x12ec211]
 mongod(wiredtiger_open+0x118C) [0x12e6e8c]
 mongod(_ZN5mongo18WiredTigerKVEngineC1ERKSsS2_bb+0x308) [0xd421c8]
 mongod(+0x93FD28) [0xd3fd28]
 mongod(_ZN5mongo23GlobalEnvironmentMongoD22setGlobalStorageEngineERKSs+0x30D) [0xa5fdcd]
 mongod(_ZN5mongo13initAndListenEi+0x6EF) [0x7df7af]
 mongod(main+0x134) [0x7e4344]
 libc.so.6(__libc_start_main+0xF0) [0x7fde3ebf2040]
 mongod(+0x3DD589) [0x7dd589]
-----  END BACKTRACE  -----
2015-01-28T11:16:41.745+0100 I -        [initandlisten] 
 
***aborting after fassert() failure



 Comments   
Comment by Ramon Fernandez Marina [ 28/May/15 ]

hannes_brt, in order to avoid confusion can you please open a separate ticket for your issue? Please include the full log and the exact version you're running, as well as details about your setup and any other information you think may help us track down this problem.

Thanks,
Ramón.

Comment by Hannes Bretschneider [ 28/May/15 ]

The same issue seems to be happening to me.

I was running a job that had 31 threads writing simultaneously to MongoDB. The job finished successfully, but after the following re-start, mongod now no longer starts up.

Here is the log:

2015-05-28T13:37:20.738-0400 I CONTROL ***** SERVER RESTARTED *****
2015-05-28T13:37:20.763-0400 I STORAGE [initandlisten] wiredtiger_open config: create,cache_size=31G,session_max=20000,eviction=(threads_max=4),statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
2015-05-28T13:37:20.770-0400 E STORAGE [initandlisten] WiredTiger (0) [1432834640:770235][30102:0x7f257ae31bc0], file:WiredTiger.wt, connection: read checksum error [4096B @ 57344, 1959803539 != 2884623168]
2015-05-28T13:37:20.770-0400 E STORAGE [initandlisten] WiredTiger (0) [1432834640:770298][30102:0x7f257ae31bc0], file:WiredTiger.wt, connection: WiredTiger.wt: encountered an illegal file format or internal value
2015-05-28T13:37:20.770-0400 E STORAGE [initandlisten] WiredTiger (-31804) [1432834640:770313][30102:0x7f257ae31bc0], file:WiredTiger.wt, connection: the process must exit and restart: WT_PANIC: WiredTiger library panic
2015-05-28T13:37:20.770-0400 I - [initandlisten] Fatal Assertion 28558
2015-05-28T13:37:20.780-0400 I CONTROL [initandlisten]
0xf51949 0xef1671 0xed6261 0xd7b2ba 0x13816c9 0x1381885 0x1381d24 0x12d578e 0x12d5c28 0x12d2fe3 0x12d6956 0x12ef431 0x1317a0b 0x1380b43 0x134eafb 0x13150f7 0xd65c2b 0xd63c28 0xa81dbd 0x808712 0x7d5334 0x7f2579415ec5 0x8064d7
----- BEGIN BACKTRACE -----
{"backtrace":[

{"b":"400000","o":"B51949"}

,

{"b":"400000","o":"AF1671"}

,

{"b":"400000","o":"AD6261"}

,

{"b":"400000","o":"97B2BA"}

,

{"b":"400000","o":"F816C9"}

,

{"b":"400000","o":"F81885"}

,

{"b":"400000","o":"F81D24"}

,

{"b":"400000","o":"ED578E"}

,

{"b":"400000","o":"ED5C28"}

,

{"b":"400000","o":"ED2FE3"}

,

{"b":"400000","o":"ED6956"}

,

{"b":"400000","o":"EEF431"}

,

{"b":"400000","o":"F17A0B"}

,

{"b":"400000","o":"F80B43"}

,

{"b":"400000","o":"F4EAFB"}

,

{"b":"400000","o":"F150F7"}

,

{"b":"400000","o":"965C2B"}

,

{"b":"400000","o":"963C28"}

,

{"b":"400000","o":"681DBD"}

,

{"b":"400000","o":"408712"}

,

{"b":"400000","o":"3D5334"}

,

{"b":"7F25793F4000","o":"21EC5"}

,

{"b":"400000","o":"4064D7"}

],"processInfo":{ "mongodbVersion" : "3.0.3", "gitVersion" : "b40106b36eecd1b4407eb1ad1af6bc60593c6105", "uname" :

{ "sysname" : "Linux", "release" : "3.13.0-53-generic", "version" : "#89-Ubuntu SMP Wed May 20 10:34:39 UTC 2015", "machine" : "x86_64" }

, "somap" : [

{ "elfType" : 2, "b" : "400000", "buildId" : "F56F80CB96B4DBFC070BEB0ADAC7D6B274BFC6B1" }

,

{ "b" : "7FFF250AF000", "elfType" : 3, "buildId" : "E76682F06460BD7F01BD69179E0ADC594943C298" }

,

{ "b" : "7F257AA1E000", "path" : "/lib/x86_64-linux-gnu/libpthread.so.0", "elfType" : 3, "buildId" : "9318E8AF0BFBE444731BB0461202EF57F7C39542" }

,

{ "b" : "7F257A7C0000", "path" : "/lib/x86_64-linux-gnu/libssl.so.1.0.0", "elfType" : 3, "buildId" : "FF43D0947510134A8A494063A3C1CF3CEBB27791" }

,

{ "b" : "7F257A3E5000", "path" : "/lib/x86_64-linux-gnu/libcrypto.so.1.0.0", "elfType" : 3, "buildId" : "B927879B878D90DD9FF4B15B00E7799AA8E0272F" }

,

{ "b" : "7F257A1DD000", "path" : "/lib/x86_64-linux-gnu/librt.so.1", "elfType" : 3, "buildId" : "92FCF41EFE012D6186E31A59AD05BDBB487769AB" }

,

{ "b" : "7F2579FD9000", "path" : "/lib/x86_64-linux-gnu/libdl.so.2", "elfType" : 3, "buildId" : "C1AE4CB7195D337A77A3C689051DABAA3980CA0C" }

,

{ "b" : "7F2579CD5000", "path" : "/usr/lib/x86_64-linux-gnu/libstdc++.so.6", "elfType" : 3, "buildId" : "19EFDDAB11B3BF5C71570078C59F91CF6592CE9E" }

,

{ "b" : "7F25799CF000", "path" : "/lib/x86_64-linux-gnu/libm.so.6", "elfType" : 3, "buildId" : "1D76B71E905CB867B27CEF230FCB20F01A3178F5" }

,

{ "b" : "7F25797B9000", "path" : "/lib/x86_64-linux-gnu/libgcc_s.so.1", "elfType" : 3, "buildId" : "8D0AA71411580EE6C08809695C3984769F25725B" }

,

{ "b" : "7F25793F4000", "path" : "/lib/x86_64-linux-gnu/libc.so.6", "elfType" : 3, "buildId" : "30C94DC66A1FE95180C3D68D2B89E576D5AE213C" }

,

{ "b" : "7F257AC3C000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "9F00581AB3C73E3AEA35995A0C50D24D59A01D47" }

] }}
mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf51949]
mongod(_ZN5mongo10logContextEPKc+0xE1) [0xef1671]
mongod(_ZN5mongo13fassertFailedEi+0x61) [0xed6261]
mongod(+0x97B2BA) [0xd7b2ba]
mongod(__wt_eventv+0x489) [0x13816c9]
mongod(__wt_err+0x95) [0x1381885]
mongod(__wt_panic+0x24) [0x1381d24]
mongod(__wt_block_extlist_read+0x6E) [0x12d578e]
mongod(__wt_block_extlist_read_avail+0x28) [0x12d5c28]
mongod(__wt_block_checkpoint_load+0x193) [0x12d2fe3]
mongod(+0xED6956) [0x12d6956]
mongod(__wt_btree_open+0xAB1) [0x12ef431]
mongod(__wt_conn_btree_get+0x19B) [0x1317a0b]
mongod(__wt_session_get_btree+0x343) [0x1380b43]
mongod(__wt_metadata_open+0x2B) [0x134eafb]
mongod(wiredtiger_open+0xCD7) [0x13150f7]
mongod(_ZN5mongo18WiredTigerKVEngineC1ERKSsS2_bb+0x2EB) [0xd65c2b]
mongod(+0x963C28) [0xd63c28]
mongod(_ZN5mongo23GlobalEnvironmentMongoD22setGlobalStorageEngineERKSs+0x30D) [0xa81dbd]
mongod(_ZN5mongo13initAndListenEi+0x422) [0x808712]
mongod(main+0x134) [0x7d5334]
libc.so.6(__libc_start_main+0xF5) [0x7f2579415ec5]
mongod(+0x4064D7) [0x8064d7]
----- END BACKTRACE -----
2015-05-28T13:37:20.780-0400 I - [initandlisten]

***aborting after fassert() failure

Comment by Ramon Fernandez Marina [ 15/May/15 ]

Hi soner, unfortunately we have not been able to reproduce this ticket. There's been a significant number of improvements in MongoDB and WiredTiger since this issue was first reported, and the 2.8/3.0 release candidates are no longer relevant nor recommended, so I'd like to ask you to upgrade to version 3.0.3 and report back.

I'm going to resolve this ticket, but if after upgrading to 3.0.3 you continue to observe the same behavior please feel free to reopen this ticket.

Regards,
Ramón.

Comment by Soner K [ 05/Mar/15 ]

Ok here are more details on what I did and what happened exactly:

I needed a large collection of generated documents which have a specific structure with semi-random content (choose a random item from a set of items for each document). I wrote a Java-tool which generates these random documents. To make things faster I ran 1000 Threads which each inserts 1'000'000 documents. In the end I would have one billion documents. It didn't work quite well because the MongoDB server process crashed pretty often and I had to restart the MongoDB server (and the Java-tool for inserting the documents).
The MongoDB server started without any issues and was pretty reliable until the last time where it displayed the error message stated in this jira-ticket.

Here's an example of the JSON-structure:

aDocument.json

{
  "_id": { "$oid": "123456789"},
  "someKey" : "",
  "someOtherKey" : "someContent",
  "nullObject" : null,
  "weHaveIntegersToo" : 0,
  "andBooleans" : true,
  "andArrays" : [
    {
      "moreKeyValues" : "content",
      "floats" : 0.47234,
      "anything" : null
    },
    {
      "moreKeyValues" : "otherContent",
      "floats" : 1.234567,
      "anything" : "NOT_NULL"
    }
  ]
}

I'd like to upload the JSON-file but it might contain sensitive data. So if you need a specific JSON to test, I might give it to you in private.
One JSON-document has about 1100 lines (pretty printed of course) and has a filesize of ~35 KiB.

I had three indexes on keys containing integers (like

{ "aNumber" : 5}

).

Thank you for looking into this!

Comment by Bruce Lucas (Inactive) [ 04/Mar/15 ]

Thanks Soner. Those files confirm that it appears that some data was not properly written to disk. I have been running a repro attempt that writes some data and kills mongod with SIGKILL for about 24 hours, through 900 or so server crashes, and have not reproduced a problem. I will try modifying the repro attempt to make it more write intensive and try some more. Can you say a little bit about the kind of data you were inserting? How many threads, how many indexes, what size documents?
Thanks,
Bruce

Comment by Soner K [ 03/Mar/15 ]

Hi Bruce,

I've uploaded the two files you requested.

Only the mongod process was killed, the system was ok. This happened a few times before but I could start the mongod process afterwards. Only this one time it wouldn't start.

Thank you,
Soner

Comment by Bruce Lucas (Inactive) [ 03/Mar/15 ]

Thanks Soner. That file in itself appears to be ok, so I think the problem is a mismatch between the content of that file and one of the other WiredTiger metadata files that points to it. Would you be able to also attach the following two metadata files to the ticket: WiredTiger.turtle and WiredTiger.wt so I can check my theory?

Also, can you clarify whether the node crashed prior to this problem occurring, or only the mongod process was killed?

Thanks,
Bruce

Comment by Soner K [ 19/Feb/15 ]

Hi,
sorry for the late reply, I wasn't able to response earlier due to an illness.

So I attached the sizeStorer.wt
Thanks for looking into this!

Comment by Daniel Pasette (Inactive) [ 31/Jan/15 ]

Would you be able to upload and attach the sizeStorer.wt file from your dbpath to this ticket for examination?

Generated at Thu Feb 08 03:43:15 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.