[SERVER-9870] Rollback should not enforce unique indexes while applying ops Created: 06/Jun/13 Updated: 06/Dec/22 Resolved: 21/Dec/15 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.2.3, 2.4.4, 2.5.0 |
| Fix Version/s: | 3.0.0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Michael Henson | Assignee: | Backlog - Replication Team |
| Resolution: | Done | Votes: | 3 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Ubuntu Precise 12.04 LTS |
||
| Attachments: |
|
||||||||||||||||
| Issue Links: |
|
||||||||||||||||
| Assigned Teams: |
Replication
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Operating System: | ALL | ||||||||||||||||
| Participants: | |||||||||||||||||
| Description |
|
We recently had an event where our primary server had to step down and after the new election occurred was put into a rollback state. During the rollback mongo encountered a unique index violation while attempting to write to the view_stats collection and promptly crashed.
I brought the machine back up out of the replicaset and was able to verify that there were 4 bad documents that lacked the host, view and pid fields. No other active hosts had these bad documents and it's technically highly unlikely that any of our code inserted them in the first place. I removed the bad documents and brought the member back into the replicaset, but it crashed again shortly thereafter with the same error. We were only able to resolve the issue by restoring the database from an EBS snapshot and letting it catch up again. I've attached a log from the primary to this ticket. |
| Comments |
| Comment by Anthony MOI [ 02/Aug/13 ] | ||
|
We had a similar behavior just after adding an index with { unique: true, dropDups: true }. The index got created, and just after that, all our secondaries started to crash with errors "E11000 duplicate key error index" and similar stacks. The only way to stop this from happening again was to remove the index and re-sync secondaries from scratch. All our instances are running MongoDB 2.4.3 | ||
| Comment by Scott Hernandez (Inactive) [ 11/Jun/13 ] | ||
|
Yes, this sounds like the exact situation which would cause this kind of issue. We will look through the oplog for some proof, but I think we can reproduce this in the future either way. Thanks for the oplog data. | ||
| Comment by Michael Henson [ 08/Jun/13 ] | ||
|
Here are the indexes that are currently on that collection: [ , , This collection acts as a staging ground to store performance information on various routes in our REST api. There are basically two sorts of operations that we perform on the documents in this collection. The first is to perform an upsert on a particular (view, host, pid) which will increment the count, sum_ms, and sum_square_ms fields. This action is triggered by a request being issued against our API for the given route (called view in this case). We don't actually perform an upsert for every request. Instead, we sample, let's say 10% of our requests. The second sort of operation occurs once per minute. We aggregate all of the documents for a given view and then delete them all one view at a time. | ||
| Comment by Scott Hernandez (Inactive) [ 07/Jun/13 ] | ||
|
Thanks Michael, Can you also provide your index definitions for that collection:
We believe the problem was a transitive index violation during the rollback because documents were being replaced by the current ones (as existed on the new/current primary) in an order different than happened in the original timeline. We are trying to track down where this might be possible by looking at the oplog as a hint since we can't really go back in time Do you think the application/data might have changed documents in an order where looking at two different points in time, in that small windows of time, there were two documents which had the same [view, host, pid] values (it seems like you have a unique index on the combination of those three fields)? | ||
| Comment by Michael Henson [ 07/Jun/13 ] | ||
|
Here are the contents of the oplog on that machine from after 2013-06-03T18:48:00 (the log indicated this was the date to sync back to). | ||
| Comment by Scott Hernandez (Inactive) [ 06/Jun/13 ] | ||
|
Would it be possible for you to upload the oplog from that "rollback" host? Really we just need the portion being used to do the rollback. |