[SERVER-8636] Broken indexes Created: 20/Feb/13 Updated: 10/Dec/14 Resolved: 30/Jul/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 2.2.3 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Blocker - P1 |
| Reporter: | Aristarkh Zagorodnikov | Assignee: | Joanna Cheng |
| Resolution: | Incomplete | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
| Operating System: | ALL |
| Participants: |
| Description |
|
We are having some problems with sharding, when diagnosing them I stumbled upon this:
How this is possible? |
| Comments |
| Comment by Joanna Cheng [ 30/Jul/13 ] | |||||||
|
Hi Aristarkh, I reviewed your logs and have some comments: 1. I can see the problems you first hit at ~23:35, which causes the secondaries to crash. Also, as the replica set members come back up, we try to catch up but hit the same problematic transaction again. 2. I don't see the write actually going through in your logs. I wouldn't expect the write to go through on the secondaries while the unique index exists. 3. I can see the index being rebuilt around 01:40 on d3 (which is the primary at this stage?). Not sure what this was for. I see more errors (for the same duplicate key) as late as 08:44 on member d2 4. Since there are only logs from the incident (but not earlier), I can't ascertain how the index inconsistency among replica set members happened in the first place. As you have moved past this issue and probably no longer have any of the data from the incident, there is not much else we can advise without resorting to pure guesswork. I will resolve this ticket; however if you have any further information related to this issue, please feel free to re-open with additional details. Thanks and kind regards, | |||||||
| Comment by Aristarkh Zagorodnikov [ 06/May/13 ] | |||||||
|
Hi! | |||||||
| Comment by Ian Daniel [ 06/May/13 ] | |||||||
|
Hi Aristarkh, It has been a couple of months since you raised this issue. What is the status of the problem now? Have you been able to resolve the problem, or are you still looking for assistance? Kind regards, | |||||||
| Comment by Aristarkh Zagorodnikov [ 21/Feb/13 ] | |||||||
|
The offending insert happened at about Feb 20 23:35 log time. | |||||||
| Comment by Aristarkh Zagorodnikov [ 21/Feb/13 ] | |||||||
|
Attached logs. d3 is usually the primary (has priority 10), d1 and d2 are secondaries. d3 had non-unique index, d1 and d2 had unique one. Insert got into d3, d1 and d2 at first refused to replicate the change (and that's fine), but after a series of restarts (puppet ensures the mongods are up and running periodically) without any user intervention (no one altered any data on d1 and d2, and since replicaSet was degraded due to only 1 member up, d3 was also secondary, so it wasn't accepting any writes) the insert got through. | |||||||
| Comment by Aristarkh Zagorodnikov [ 21/Feb/13 ] | |||||||
|
Will attach logs later today. | |||||||
| Comment by Aristarkh Zagorodnikov [ 21/Feb/13 ] | |||||||
|
Yes, the index situation is as described. It has been primary this for at least a month (might be demoted for a while for a reboot). | |||||||
| Comment by Eliot Horowitz (Inactive) [ 21/Feb/13 ] | |||||||
|
So to be clear, the current primary has an index on ( files_id , n ) but it is non-unique? | |||||||
| Comment by Aristarkh Zagorodnikov [ 20/Feb/13 ] | |||||||
|
The only "legal" way it could happen (as I see it, I don't know enough about replication internals) is that batch application ignores index uniqueness when inserting successive batched data which violates uniqueness constraint. Otherwise, it looks like corruption to me (and since it happened on four different raid arrays it's a MongoDB error, not a hardware one). | |||||||
| Comment by Aristarkh Zagorodnikov [ 20/Feb/13 ] | |||||||
|
P.S. We diagnosed this error to originate from non-unique index on the primary, but secondaries have the unique index, how such inserts are possible? | |||||||
| Comment by Aristarkh Zagorodnikov [ 20/Feb/13 ] | |||||||
|
We were getting this after a series of errors like this one:
These errors eventually gone away after a series of secondary restarts (I cannot even begin to understand how this error could go away at all). Different machine, different replicaset:
|