[SERVER-23794] Multiple documents with same _id after 3.2 upgrade Created: 19/Apr/16 Updated: 07/Nov/17 Resolved: 03/Oct/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | WiredTiger |
| Affects Version/s: | 3.2.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Greg Murphy | Assignee: | Eric Milkie |
| Resolution: | Incomplete | Votes: | 1 |
| Labels: | NR, uic | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Operating System: | ALL |
| Steps To Reproduce: | Hard to be precise, but I expect this is connected with the upgrade from 3.0 on MMAPv1 to Wired Tiger on 3.2. |
| Participants: |
| Description |
|
We upgraded a replica set from 2.6 to 3.0 then 3.2, each time adding new servers, letting them replicate then removing the old servers. On the day that the first 3.2 server was added as a 4th (hidden) node in the 3.0 cluster to start the data sync, we appear to have encountered some data corruption. To be precise, we now have examples of multiple documents in a collection with the same _id. An unbounded find on the collection shows them (only showing the record around the problem one for brevity, and anonymising the collection name):
Note that "_id" : "2016-03-18" is there twice. If I try and query directly for this record, only one appears:
Would a copy of the WiredTiger datafiles for this collection and its indexes help with analysing this issue? |
| Comments |
| Comment by Eric Milkie [ 13/Sep/17 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Greg, | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Greg Murphy [ 22/Jul/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi - Yes, journalling is enabled. The process that writes this data uses the Java driver, version 3.2.2. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kelsey Schubert [ 18/Jul/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi gregmurphy, Thank you for the detailed timeline. Unfortunately, we haven't been able to reproduce this issue yet. To help us reproduce this issue, would you please specify which MongoDB driver you were using and its specific version? Additionally, would you please clarify whether journaling is enabled? Thank you, | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Greg Murphy [ 26/Apr/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I've dug back through the timeline and any available logs to clarify (as well as I can) what happened around that time. This is hopefully an accurate representation of events:
In answer to your questions: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kelsey Schubert [ 22/Apr/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi gregmurphy, Thank you for answering my questions, we are continuing to investigate this issue. My understanding is that on March 18th, the cluster contained 3 MongoDB 3.0 nodes and one hidden MongoDB 3.2 node, which was performing an initial sync. The affected documents were inserted that day into a MongoDB 3.0 primary and may have been updated during the initial sync process of the hidden MongoDB 3.2 node. Is this correct?
Thank you for your help, | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Greg Murphy [ 20/Apr/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Should have added - I kept the output of some finds against the collections that had the missing elements (note the docs with { "_id" : "2016-03-18"}) Primary:
Secondary 1:
Secondary 2:
| |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Greg Murphy [ 20/Apr/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Thomas Yes, this is the same replica set that Regarding your questions, these answers are in relation to the multiple documents with the same _id: 1. Primary and secondaries have identical output:
2. Unfortunately we no longer have the logs as we only retain for 30 days and the problem data was generated on March 18th. I did check the logs at the time though, and didn't notice any errors. 3. Inserts and updates, and the documents are all very small (longest ~200 bytes), and once they've been inserted any updates should just be to change numeric values, not add more elements. 4. Validate output:
Regarding the collections with missing elements on the secondaries (that I have already repaired), I only kept notes about one of the affected collections from the time. From memory, 10-15 exhibited this problem. For the one I kept notes about: getIndexes (same output from primary and secondaries):
Collection validation:
Interestingly this collection also suffers from the duplicate _id issue, and in turn this duplicate record was the specific record that had the missing elements on the secondaries. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Kelsey Schubert [ 19/Apr/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi gregmurphy, Thank you for reporting this issue. To get a better understanding of what is going on here, please answer the following questions:
Kind regards, | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Greg Murphy [ 19/Apr/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
I'm in the process of scanning all of our collections for instances like this and have so far found 16. I'll update with the complete figures once the scan is finished. | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Greg Murphy [ 19/Apr/16 ] | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
|
As a side note - we also saw some other data corruption on the same day, with elements missing from documents on secondaries. I have already repaired this data by running updates against the primary which replicated successfully to the secondaries, but its interesting that this happened to data from exactly the same day as this issue. |