[SERVER-19916] WT 3.0.3 config server does not start Created: 12/Aug/15 Updated: 10/Oct/15 Resolved: 10/Oct/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Admin, WiredTiger |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Dharshan Rangegowda | Assignee: | Ramon Fernandez Marina |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Operating System: | ALL |
| Participants: |
| Description |
|
Config server is running on AWS. I replaced the data disk with a snapshot taken previously and the server refuses to start after that.
The logs and data files are attached |
| Comments |
| Comment by Ramon Fernandez Marina [ 10/Oct/15 ] |
|
Apologies for the long delay dharshanr@scalegrid.net. It seems that the WiredTiger.wt file in this node got corrupted:
Anyway, I've uploaded the new_files.tgz archive to this ticket containing the repaired files. If you extract this archive into your dbpath you should be able to recover this node. I would also encourage you to upgrade to MongoDB 3.0.6, which fixes Regards, |
| Comment by Dharshan Rangegowda [ 18/Aug/15 ] |
|
Yes. The snapshot is for backup purposes. With previous versions of MongoDB we use EBS snapshots for backup. With WiredTiger we would like to continue to use the same process. When I said the "service starts file" I implied Step 2 in your descriptions above. Once the service is stopped the data on disk should be consistent right? Or is there some other command that needs to be run? (E.g. disk flush etc) |
| Comment by Ramon Fernandez Marina [ 18/Aug/15 ] |
|
Hi dharshanr@scalegrid.net, thanks for the additional info. We're taking a closer look at the uploaded data and logs, and depending on what we find we may need to take you up on your offer to log in remotely. The part I'm confused about is that the ticket description says "config server does not start", but in step 1. above you mention "the service starts fine", so I think the scenario is as follows (please correct me if I'm wrong):
Is this correct? Did I miss any steps? I'll be investigating the consistency of the data you uploaded, but I'd be interested in knowing what your ultimate goal on this process is – I assume this step is part of a larger process (e.g.: cluster backups, etc.) and it would be helpful to see the bigger picture. Thanks, |
| Comment by Dharshan Rangegowda [ 18/Aug/15 ] |
|
Hi Sam, I don't think the issue is specific to config server. I have another replica set where I am able to repro the same bug. Also I am not sure there is an issue with EBS snapshots since it is used by thousands of users every day. 1. Steps for snapshot - Stop the mongod instance, snapshot the ebs volume, start the mongod instance. Note the service starts fine. |
| Comment by Sam Kleinman (Inactive) [ 17/Aug/15 ] |
|
I would assume based on the stacktrace, that there was something wrong with the snapshot, the storage system, or the networking layers used to create or transmit the query responses. For reference, if you haven't read our Replace Disabled Config Server , you may find this resource useful. A few questions:
Thanks for the feedback, and sorry that we didn't get back to you sooner. Regards, |
| Comment by Dharshan Rangegowda [ 15/Aug/15 ] |
|
Hi folks - any update on this? I had the same error repro on another machine. A "repair" operation did not the fix the error as well. |
| Comment by Dharshan Rangegowda [ 13/Aug/15 ] |
|
One more thing I would like to add - the snapshot of data was taken after the service was stopped - so technically it should be fully consistent on disk. |