[SERVER-19916] WT 3.0.3 config server does not start Created: 12/Aug/15  Updated: 10/Oct/15  Resolved: 10/Oct/15

Status: Closed
Project: Core Server
Component/s: Admin, WiredTiger
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Dharshan Rangegowda Assignee: Ramon Fernandez Marina
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Zip Archive data.zip     Zip Archive logs.zip     File new_files.tgz    
Operating System: ALL
Participants:

 Description   

Config server is running on AWS.

I replaced the data disk with a snapshot taken previously and the server refuses to start after that.

2015-08-12T22:40:35.978+0000 I CONTROL  ***** SERVER RESTARTED *****
2015-08-12T22:40:36.048+0000 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=1G,session_max=20000,eviction=(threads_max=4),statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
2015-08-12T22:40:36.078+0000 E STORAGE  [initandlisten] WiredTiger (0) [1439419236:78133][3373:0x7f3330c47c80], file:WiredTiger.wt, connection: read checksum error [4096B @ 20480, 4285996209 != 1734960750]
2015-08-12T22:40:36.078+0000 E STORAGE  [initandlisten] WiredTiger (0) [1439419236:78187][3373:0x7f3330c47c80], file:WiredTiger.wt, connection: WiredTiger.wt: encountered an illegal file format or internal value
2015-08-12T22:40:36.078+0000 E STORAGE  [initandlisten] WiredTiger (-31804) [1439419236:78213][3373:0x7f3330c47c80], file:WiredTiger.wt, connection: the process must exit and restart: WT_PANIC: WiredTiger library panic
2015-08-12T22:40:36.078+0000 I -        [initandlisten] Fatal Assertion 28558
2015-08-12T22:40:36.095+0000 I CONTROL  [initandlisten]
 0xf6a889 0xf08321 0xeec161 0xd88caa 0x139bc59 0x139be15 0x139c2b4 0x12f0cae 0x12f1148 0x12ee503 0x12f1e76 0x130a021 0x13322bb 0x139b0d3 0x136919b 0x132f9a7 0xd72b8b 0xd70a28 0xa8104d 0x7f3e
62 0x7f93c4 0x7f332f1feaf5 0x7f1bbd
----- BEGIN BACKTRACE -----
{"backtrace":[{"b":"400000","o":"B6A889"},{"b":"400000","o":"B08321"},{"b":"400000","o":"AEC161"},{"b":"400000","o":"988CAA"},{"b":"400000","o":"F9BC59"},{"b":"400000","o":"F9BE15"},{"b":"40
0000","o":"F9C2B4"},{"b":"400000","o":"EF0CAE"},{"b":"400000","o":"EF1148"},{"b":"400000","o":"EEE503"},{"b":"400000","o":"EF1E76"},{"b":"400000","o":"F0A021"},{"b":"400000","o":"F322BB"},{"
b":"400000","o":"F9B0D3"},{"b":"400000","o":"F6919B"},{"b":"400000","o":"F2F9A7"},{"b":"400000","o":"972B8B"},{"b":"400000","o":"970A28"},{"b":"400000","o":"68104D"},{"b":"400000","o":"3F3E6
2"},{"b":"400000","o":"3F93C4"},{"b":"7F332F1DD000","o":"21AF5"},{"b":"400000","o":"3F1BBD"}],"processInfo":{ "mongodbVersion" : "3.0.3", "gitVersion" : "b40106b36eecd1b4407eb1ad1af6bc60593c
6105", "uname" : { "sysname" : "Linux", "release" : "3.10.48-55.140.amzn1.x86_64", "version" : "#1 SMP Wed Jul 9 23:32:19 UTC 2014", "machine" : "x86_64" }, "somap" : [ { "elfType" : 2, "b"
: "400000", "buildId" : "108A63CA14A4BD5E599BAC10885DBD3A85DA5439" }, { "b" : "7FFF1E6FB000", "elfType" : 3, "buildId" : "9C6F128297F4D2216E6CD02B515D59A00C4063AF" }, { "b" : "7F333081D000",
 "path" : "/lib64/libpthread.so.0", "elfType" : 3, "buildId" : "D48D3E6672A77B603B402F661BABF75E90AD570B" }, { "b" : "7F33305B0000", "path" : "/usr/lib64/libssl.so.10", "elfType" : 3, "build
Id" : "22480480235F3B1C6C2E5E5953949728676D3796" }, { "b" : "7F33301CB000", "path" : "/lib64/libcrypto.so.10", "elfType" : 3, "buildId" : "F1C146B78505646930DD9003AA2B3622C5226D1B" }, { "b"
: "7F332FFC3000", "path" : "/lib64/librt.so.1", "elfType" : 3, "buildId" : "E81013CBFA409053D58A65A0653271AB665A4619" }, { "b" : "7F332FDBF000", "path" : "/lib64/libdl.so.2", "elfType" : 3,
"buildId" : "62A8842157C62F95C3069CBF779AFCC26577A99A" }, { "b" : "7F332FAB6000", "path" : "/usr/lib64/libstdc++.so.6", "elfType" : 3, "buildId" : "66F1CF311C61879639BD3DC0034DEE0D6D042261"
}, { "b" : "7F332F7B4000", "path" : "/lib64/libm.so.6", "elfType" : 3, "buildId" : "5F97F8F8E5024E29717CF35998681F84D4A22D45" }, { "b" : "7F332F59E000", "path" : "/lib64/libgcc_s.so.1", "elf
Type" : 3, "buildId" : "DB655E06F0F4F7B4EC561BB7E620F5D5BC4F1C54" }, { "b" : "7F332F1DD000", "path" : "/lib64/libc.so.6", "elfType" : 3, "buildId" : "DF6DA145A649EA093507A635AF383F608E7CE3F2
" }, { "b" : "7F3330A39000", "path" : "/lib64/ld-linux-x86-64.so.2", "elfType" : 3, "buildId" : "6F90843B9087FE91955FEB0355EB0858EF9E97B2" }, { "b" : "7F332EF9A000", "path" : "/lib64/libgssa
pi_krb5.so.2", "elfType" : 3, "buildId" : "9DF61878D8918F25CC74AD01F417FDB051DFE3DA" }, { "b" : "7F332ECB5000", "path" : "/lib64/libkrb5.so.3", "elfType" : 3, "buildId" : "6F1DB0F811D1B21052
0443442D4437BC43BF9A80" }, { "b" : "7F332EAB2000", "path" : "/usr/lib64/libcom_err.so.2", "elfType" : 3, "buildId" : "E52249AE6C9865B5C3B9697A57FC92200DA51CF3" }, { "b" : "7F332E887000", "pa
th" : "/lib64/libk5crypto.so.3", "elfType" : 3, "buildId" : "F7DF34078FD7BFD684FE46D5F677EEDA1D9B9DC9" }, { "b" : "7F332E671000", "path" : "/lib64/libz.so.1", "elfType" : 3, "buildId" : "87B
4EBF2183C8EA4AB657212203EFFE6340E2F4F" }, { "b" : "7F332E466000", "path" : "/lib64/libkrb5support.so.0", "elfType" : 3, "buildId" : "381960ACAB9C39461D58BDE7B272C4F61BB3582F" }, { "b" : "7F3
32E263000", "path" : "/lib64/libkeyutils.so.1", "elfType" : 3, "buildId" : "BF48CD5658DE95CE058C4B828E81C97E2AE19643" }, { "b" : "7F332E049000", "path" : "/lib64/libresolv.so.2", "elfType" :
 3, "buildId" : "6A7DA1CED90F65F27CB7B5BACDBB1C386C05F592" }, { "b" : "7F332DE28000", "path" : "/usr/lib64/libselinux.so.1", "elfType" : 3, "buildId" : "803D7EF21A989677D056E52BAEB9AB5B154FB
9D9" } ] }}
 mongod(_ZN5mongo15printStackTraceERSo+0x29) [0xf6a889]
 mongod(_ZN5mongo10logContextEPKc+0xE1) [0xf08321]
 mongod(_ZN5mongo13fassertFailedEi+0x61) [0xeec161]
 mongod(+0x988CAA) [0xd88caa]
 mongod(__wt_eventv+0x489) [0x139bc59]
 mongod(__wt_err+0x95) [0x139be15]
 mongod(__wt_panic+0x24) [0x139c2b4]
 mongod(__wt_block_extlist_read+0x6E) [0x12f0cae]
 mongod(__wt_block_extlist_read_avail+0x28) [0x12f1148]
 mongod(__wt_block_checkpoint_load+0x193) [0x12ee503]
 mongod(+0xEF1E76) [0x12f1e76]
 mongod(__wt_btree_open+0xAB1) [0x130a021]
 mongod(__wt_conn_btree_get+0x19B) [0x13322bb]
 mongod(__wt_session_get_btree+0x343) [0x139b0d3]
 mongod(__wt_metadata_open+0x2B) [0x136919b]
 mongod(wiredtiger_open+0xCD7) [0x132f9a7]
 mongod(_ZN5mongo18WiredTigerKVEngineC1ERKSsS2_bb+0x2EB) [0xd72b8b]
 mongod(+0x970A28) [0xd70a28]
 mongod(_ZN5mongo23GlobalEnvironmentMongoD22setGlobalStorageEngineERKSs+0x30D) [0xa8104d]
 mongod(_ZN5mongo13initAndListenEi+0x422) [0x7f3e62]
 mongod(main+0x134) [0x7f93c4]
 libc.so.6(__libc_start_main+0xF5) [0x7f332f1feaf5]
 mongod(+0x3F1BBD) [0x7f1bbd]
-----  END BACKTRACE  -----
2015-08-12T22:40:36.095+0000 I -        [initandlisten]
 
***aborting after fassert() failure

The logs and data files are attached



 Comments   
Comment by Ramon Fernandez Marina [ 10/Oct/15 ]

Apologies for the long delay dharshanr@scalegrid.net. It seems that the WiredTiger.wt file in this node got corrupted:

  • since the symptoms are similar to those found in SERVER-18316
  • SERVER-18316 could be triggered by an unclean shutdown
  • since the shutdown immediately before the problem was encountered, I wonder if the storage layer is not providing enough guarantees around fsync() here...

Anyway, I've uploaded the new_files.tgz archive to this ticket containing the repaired files. If you extract this archive into your dbpath you should be able to recover this node. I would also encourage you to upgrade to MongoDB 3.0.6, which fixes SERVER-18316 and 102 other issues since 3.0.3.

Regards,
Ramón.

Comment by Dharshan Rangegowda [ 18/Aug/15 ]

Yes. The snapshot is for backup purposes. With previous versions of MongoDB we use EBS snapshots for backup. With WiredTiger we would like to continue to use the same process.

When I said the "service starts file" I implied Step 2 in your descriptions above. Once the service is stopped the data on disk should be consistent right? Or is there some other command that needs to be run? (E.g. disk flush etc)

Comment by Ramon Fernandez Marina [ 18/Aug/15 ]

Hi dharshanr@scalegrid.net, thanks for the additional info. We're taking a closer look at the uploaded data and logs, and depending on what we find we may need to take you up on your offer to log in remotely.

The part I'm confused about is that the ticket description says "config server does not start", but in step 1. above you mention "the service starts fine", so I think the scenario is as follows (please correct me if I'm wrong):

  1. The mongod process on this config server is stopped, and the data is snapshotted
  2. After the snapshot the mongod process is restarted without problems
  3. Some time later the same mongod process is stopped, and its data is replaced with the one from the snapshot
  4. At this point the mongod process fails to start

Is this correct? Did I miss any steps? I'll be investigating the consistency of the data you uploaded, but I'd be interested in knowing what your ultimate goal on this process is – I assume this step is part of a larger process (e.g.: cluster backups, etc.) and it would be helpful to see the bigger picture.

Thanks,
Ramón.

Comment by Dharshan Rangegowda [ 18/Aug/15 ]

Hi Sam,

I don't think the issue is specific to config server. I have another replica set where I am able to repro the same bug. Also I am not sure there is an issue with EBS snapshots since it is used by thousands of users every day.

1. Steps for snapshot - Stop the mongod instance, snapshot the ebs volume, start the mongod instance. Note the service starts fine.
2. I have a machine with a live repro. Can you share your public key and I will give you access.
3. Do you want to look at the machine first before I do the upgrade to 3.0.5?

Comment by Sam Kleinman (Inactive) [ 17/Aug/15 ]

I would assume based on the stacktrace, that there was something wrong with the snapshot, the storage system, or the networking layers used to create or transmit the query responses. For reference, if you haven't read our Replace Disabled Config Server , you may find this resource useful. A few questions:

  1. Can you provide the steps that you used to create the snapshot?
  2. Are you able to create a new snapshot and reproduce this problem?
  3. Can you upgrade to 3.0.5 or 3.0.6-rc0 with the new snapshot and validate that the problem hasn't been fixed in a more recent release?

Thanks for the feedback, and sorry that we didn't get back to you sooner.

Regards,
sam

Comment by Dharshan Rangegowda [ 15/Aug/15 ]

Hi folks - any update on this? I had the same error repro on another machine. A "repair" operation did not the fix the error as well.

Comment by Dharshan Rangegowda [ 13/Aug/15 ]

One more thing I would like to add - the snapshot of data was taken after the service was stopped - so technically it should be fully consistent on disk.

Generated at Thu Feb 08 03:52:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.