[SERVER-18006] Shard replicas crash after 3.0.2 upgrade with "Didn't find RecordId in WiredTigerRecordStore" Created: 13/Apr/15 Updated: 01/Sep/15 Resolved: 19/May/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | WiredTiger |
| Affects Version/s: | 3.0.2 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Oleg Rekutin | Assignee: | Ramon Fernandez Marina |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
Upgraded a 4-shard replicated 3.0.1 cluster to 3.0.2. 1 hour later, two of the shard replicas crashed with the following stack trace:
Second replica same trace! But a different collection and about 5 minutes later in time. Both of these failures are preceded by a chunk move (in the mongos logs), a few minutes after. Here is the only line in the MongoS that's pertaining:
So it seems like this was a result of a chunk move. We turned off the load balancer and ran without crashes for the last two days. Of course, it's not sustainable to leave the load balancer off. Replicas would NOT restart after this crash and kept crashing with the same stack trace on the same collection. They would start w/o the replSet argument, but with replSet, it would fail. We ended up removing the collection locally (started w/o replSet) and that allowed the node to restart as a replication member. (We then dropped that collection clusterwide) About an hour before the failure, the entire cluster has been upgraded from 3.0.1 to 3.0.2, specifically the mongodb-linux-x86_64-amazon-3.0.2.tgz release. Here's the uname -a output:
|
| Comments |
| Comment by Ramon Fernandez Marina [ 01/Sep/15 ] | |
|
Thanks for the update oleg@evergage.com, that's great news! And your message should help other users that may run into this issue, so thanks again. Cheers, | |
| Comment by Oleg Rekutin [ 01/Sep/15 ] | |
|
Ramon, thank you for your response! Indeed, we were able to re-sync all the nodes over time and looks like we have been able to get past this issue, as it has not come up since June 17. | |
| Comment by Ramon Fernandez Marina [ 01/Sep/15 ] | |
|
Hi oleg@evergage.com, as per Michael's message on If you have a replica set you can try to re-sync from a different node; if the source node was not affected the resync process should succeed and give you a baseline to recover the rest of the nodes. Regards, | |
| Comment by Oleg Rekutin [ 17/Jun/15 ] | |
|
ramon.fernandez, we hit this problem again. Is there a way to proactively find out every database that was affected by (or has leftover corruption from) | |
| Comment by Ramon Fernandez Marina [ 19/May/15 ] | |
|
Thanks for getting back to us oleg@evergage.com. We'll close the ticket then – feel free to re-open (or open a new one) if you see this again. Regards, | |
| Comment by Oleg Rekutin [ 18/May/15 ] | |
|
Dan, apologies for missing your question earlier. Yes, I did run this cluster with 3.0.0 at one point. SO indeed it's quite possible that it's leftovers from We were forced to turn on back load balancing as it wasn't too happy unbalanced, but I am happy to report that we have not run into this problem again, while running more load and continuing to run load balancing. The cluster is still running 3.0.2. Given all the data points, it's quite possible that this is a | |
| Comment by Ramon Fernandez Marina [ 15/May/15 ] | |
|
Hi oleg@evergage.com, we haven't heard back from you for some time. If this is still an issue for you, can you please provide the information requested by Dan in the previous comment? Thanks, | |
| Comment by Daniel Pasette (Inactive) [ 01/May/15 ] | |
|
Hi Oleg, can you confirm whether or not you were running this cluster with 3.0.0 at one point? It is possible that the collection was impacted by Thanks | |
| Comment by Ramon Fernandez Marina [ 22/Apr/15 ] | |
|
Thanks oleg@evergage.com, I think you're right about storage problems, so I think it's safe to rule those out. As for That being said, the assertion is the same as in Regards, | |
| Comment by Oleg Rekutin [ 22/Apr/15 ] | |
|
Hi Ramón, these replicas run on top of AWS EBS, and I don't see any errors in system logs related to disk or IO issues. I don't see write or read latency abnormalities for the involved GP2 EBS volume for the failure time period. So I don't think it's related to on-disk corruption. It's also strange that two of our nodes experienced this shortly after the 3.0.2 upgrade. Another notch against on-disk corruption. Could this be related to | |
| Comment by Ramon Fernandez Marina [ 22/Apr/15 ] | |
|
Thanks for uploading the logs oleg@evergage.com. One think you may be able to do is to check the health of the disks of the affected nodes to rule out on-disk corruption because of flaky hardware. There's a similar error message in I understand your concerns about turning the balancer back on; we can try to reproduce on our end, but until we do there's unfortunately not enough information for us to proceed. I'm going to link this ticket to Regards, | |
| Comment by Oleg Rekutin [ 16/Apr/15 ] | |
|
Hi Jason, apologies for the delay in response, Yes, both members that suffered the issue were in the secondary state. I've uploaded both the logs from the time that the servers were started as secondaries as 3.0.2 (this was the first start after the 3.0.2 upgrade) to your private server. You can see that both members were in SECONDARY state. Unfortunately, we did not capture the data files for these members at the time of the assertion or shortly thereafter. If this happens again, I will try to capture the data files. But our load balancer is off for the time being and I am not eager to reproduce this problem.
| |
| Comment by J Rassi [ 13/Apr/15 ] | |
|
Hi, Sorry to hear that you encountered this problem. We'll need additional information to further diagnose the issue:
If you are not able to share the files publicly (or any file exceeds ~100MB in size), note also that you may upload the files to a private server of ours that is accessible only to MongoDB staff members (hit "enter" at the password prompt):
~ Jason Rassi |