[SERVER-22617] SnapshotThread hits invariant due to reading oplog entries out of order Created: 08/Feb/16 Updated: 15/Nov/21 Resolved: 24/Feb/16 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 3.2.4, 3.3.3 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Ricardo Hilsenrath | Assignee: | Ramon Fernandez Marina |
| Resolution: | Done | Votes: | 0 |
| Labels: | code-and-test | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||||||||||||||||||
| Backport Completed: | |||||||||||||||||||||||||||||||||||||||||
| Sprint: | Integration 10 (02/22/16) | ||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||
| Description |
|
it happens occasionaly in our production environment
Steps to repro from orig ticket
|
| Comments |
| Comment by Hoyt Ren [ 25/Feb/16 ] |
|
Hello Youdong, This doesn't happen again in my cluster this month, I guess it's not that frequently. I simply restart the server when this happens. I will wait for the 3.2.4 and update as soon as possible once available. |
| Comment by Zhang Youdong [ 25/Feb/16 ] |
|
I got your point. But this case happened again today in our environment (just insert、update、query、and some commands), so when will 3.2.4 be released? or I can do something to avoid this problem ASAP? |
| Comment by Adam Midvidy [ 25/Feb/16 ] |
|
First off, a disclaimer: I strongly recommend against cherry-picking random commits and then running the resulting binaries in production. MongoDB, Inc. only provides support for the builds that we distribute - we put extensive effort in to validating each release against our considerable regression suite - none of which applies to picking random commits and applying them to the tree at some point in time. That said, it is possible, with some effort, to apply only the WT commit you mentioned. The reason I didn't suggest it initially is that it won't apply cleanly because the paths in that commit are different than those used in the version of the WiredTiger sources in the MongoDB repository. If you'd like to experiment that the aforementioned fix works in your test or staging environment, this is reasonable idea. |
| Comment by Zhang Youdong [ 25/Feb/16 ] |
|
Thanks for your reply,Adam Midvidy |
| Comment by Adam Midvidy [ 25/Feb/16 ] |
|
zyd_com, the commit is https://github.com/mongodb/mongo/commit/f77630a9e971cae1f921292ea31d9d40a4b096b8. Since the issue was in WiredTiger, the true fix was tracked in |
| Comment by Zhang Youdong [ 25/Feb/16 ] |
|
I have encountered this problem in out production environment yesterday(using 3.2.3), I saw that this problem is marked as 「RESOLVED」,but I cannot find the commit, would you like to provide the commit and I will merge to test. |
| Comment by Hoyt Ren [ 16/Feb/16 ] |
|
I will continue in this ticket. Here is my situation: First I make a mistake: after re-check the other part of the log, I found that one of my crashed server is primary (auto-switched) and another is secondary (different RS and host). When I setup the rs, I don't know ----enableMajorityReadConcern, so it should be default value. The RS is a brand new 3.2.0 and data imported by mongoimport. The servers start at 2016-01-04. All engine are WT. |
| Comment by Ramon Fernandez Marina [ 15/Feb/16 ] |
|
ricardo_fanatee, this is to let you know that I've moved this ticket back to the SERVER project. All the private information you uploaded for this ticket is in a private area and will remain private, but it's very useful for the original description to be visible to search engines so other users may find this ticket if they're affected by the same bug. We're investigating this invariant failure at high priority, stay tuned for updates. Regards, |
| Comment by Ramon Fernandez Marina [ 12/Feb/16 ] |
|
Thanks ricardo_fanatee, we're looking at the data you uploaded and working on a reproducer locally. We'll post updates to this ticket as we have them. |
| Comment by Ricardo Hilsenrath [ 09/Feb/16 ] |
|
diagnostics.data uploaded |
| Comment by Ramon Fernandez Marina [ 09/Feb/16 ] |
|
Inside your dbpath you'll find a directory named diagnostic.data, that's the data Scott was talking about. Easiest way forward is for you to zip/tar this directory up and upload it. |
| Comment by Ricardo Hilsenrath [ 09/Feb/16 ] |
|
Scott I have two small logs (the primary server that crashed, and the secondary that has priority 0), but the other secondary server that took the primary role when our "main server" went down has a 150mb log file (I'm uploading them in S3) Where can I find the file diagnostic.data? We made a lot of changes in the replica set on december and early january, but we're running on a stable (didn't change a single configuration on any server) for the past 2-3 weeks, and suddenly the server crashed All members are running wiredtiger engine, and all logs are from yesterday (I'm uploading the logs on S3) |
| Comment by Ramon Fernandez Marina [ 09/Feb/16 ] |
|
ricardo_fanatee, this ticket is now in a private project as you requested. Whenever you have the information requested above by Scott you can upload it here privately. If any of the files are bit you can use this upload portal instead (also private). Thanks, |
| Comment by Ricardo Hilsenrath [ 09/Feb/16 ] |
|
Hi Scott, I will be glad to share more information about the servers and logs, and as you suggested, I think it'll be better if you move this issue to a private project. I'll prepare all the logs you asked and I'll upload them as soon as poosible Thanks, |
| Comment by Scott Hernandez (Inactive) [ 09/Feb/16 ] |
|
Hi Ricardo, We would like to diagnose this issue if possible. Can you upload the rest of the logs before that, on each occasion? It would also be useful to include logs from the other members at that same time. Also, please include the "diagnostic.data" directory with the logs. (Please let us know if you would like to provide the data privately, and if so we can move this issue from the public SERVER project to a private one.) Which node is this from? Is it always on the same member, or does it change? Are all the members running with the same storage engine, and is the data new or from a previous version? |