[SERVER-12295] Migrations occassionally aborted due to duplicate key error on unique shard key Created: 08/Jan/14 Updated: 15/Nov/21 Resolved: 10/Jun/14 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Index Maintenance, Replication, Stability, Storage |
| Affects Version/s: | 2.2.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Dan Cooper | Assignee: | Thomas Rueckstiess |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
CentOS 6.4 |
||
| Operating System: | ALL |
| Participants: |
| Description |
|
Hello, we've notice recently that the moveChunk directory will rapidly start using all of it's available disk space. Looking through the logs, we saw errors of Sun Jan 5 09:43:14 [migrateThread] ERROR: migrate failed: E11000 duplicate key error index: blah.$ownerid_1_docname_1_osvcname_1 dup key: { : "11446719", : "app.json", : "blah" }. We recently upgraded to 2.2.6 from 2.2.4 though I am not sure if that is relevant. When the disk fills, mongod crashes and fails over to the secondary. After clearing out the moveChunk dir, the partner (now primary node) will have the same behavior the next day. If we remove the offending duplicate key, the issue stops. |
| Comments |
| Comment by Ramon Fernandez Marina [ 10/Jun/14 ] | ||||||||||||||||||||||
|
Hi al.gehrig@rockstarsandiego.com, we haven't heard back from you for some time, so I'm going to mark this ticket as resolved. If you happen to run into this issue again feel free to re-open the ticket and provide additional information. Regards, | ||||||||||||||||||||||
| Comment by Thomas Rueckstiess [ 01/May/14 ] | ||||||||||||||||||||||
|
Okay, thanks for letting us know. We'll wait for you to provide additional details. Regards, | ||||||||||||||||||||||
| Comment by Al Gehrig [ 01/May/14 ] | ||||||||||||||||||||||
|
Hi Thomas, Yes - we're still getting these from time-to-time. The next time we see one, I'll post to this ticket. -Al | ||||||||||||||||||||||
| Comment by Thomas Rueckstiess [ 01/May/14 ] | ||||||||||||||||||||||
|
Hi Al, I just wanted to follow up with you. Have you seen this issue appear again, or can we go ahead and resolve the ticket? Regards, | ||||||||||||||||||||||
| Comment by Asya Kamsky [ 14/Mar/14 ] | ||||||||||||||||||||||
|
Checking indexes now can help since they should be identical on each shard and if they are not it could suggest why this problem may be recurring. If they are identical it can rule out one possible source on the problem. I will put the ticket into waiting state, when this happens again, please let us know, attach related information and we will try to get to the bottom of it with you at that point. | ||||||||||||||||||||||
| Comment by Al Gehrig [ 14/Mar/14 ] | ||||||||||||||||||||||
|
We had to delete the document to prevent the storage from filling up. We will keep an eye out for this and the next time it happens we'll get the "to" and the "from" as well and the getIndexes(). | ||||||||||||||||||||||
| Comment by Asya Kamsky [ 14/Mar/14 ] | ||||||||||||||||||||||
|
Could you query for that document directly on the mongod in the "from" shard? It would also help to see db.collectionblah.getIndexes() on each shard's primary. | ||||||||||||||||||||||
| Comment by Al Gehrig [ 14/Mar/14 ] | ||||||||||||||||||||||
|
Sorry - yes, that was a typo. I've updated the comment. | ||||||||||||||||||||||
| Comment by Asya Kamsky [ 14/Mar/14 ] | ||||||||||||||||||||||
|
If I'm reading this right, it appears the error is from collection "cloud_prod.members.yay.blah" and the query on mongod was run against collection "db.members.yay.iblah" (blah vs iblah) - was this a typo when cutting-and-pasting the diagnostics? | ||||||||||||||||||||||
| Comment by Al Gehrig [ 14/Mar/14 ] | ||||||||||||||||||||||
|
We are still seeing this. We have seen incidents where there were actually duplicate keys which prevented chunk migration. But there have been other cases where there was not a duplicate key and chunk migration still failed. For example:
So I went and searched for that key Only one doc shows up when querying via the mongoD (this is the "to" mongoD):
and only one doc shows up when querying via mongoS:
We end up with aborted chunk migrations in this case as well. | ||||||||||||||||||||||
| Comment by Dan Cooper [ 13/Mar/14 ] | ||||||||||||||||||||||
|
This is still an issue. A coworker will be adding details | ||||||||||||||||||||||
| Comment by David Hows [ 10/Feb/14 ] | ||||||||||||||||||||||
|
Hi Dan, I'm marking this issue as closed as we have not heard back from you. If there is further we can do for you on this issue please feel free to re-open it with details. Regards, | ||||||||||||||||||||||
| Comment by Asya Kamsky [ 10/Jan/14 ] | ||||||||||||||||||||||
|
David, If I understand correctly, 10 is the to shard and 27 is the from shard, correct? Would you run the same query and command on the primary of shard27 (from shard), which I believe is host rsgewrmng53?
If somehow the index on the "from" shard is not unique and there are duplicate records there it would explain why the duplicates keep showing up on the "to" shard - they are not already there before migration (which would clean them out anyway) they are being introduced by migration from the shard that may be missing a unique constraint on the index. | ||||||||||||||||||||||
| Comment by David Hows [ 09/Jan/14 ] | ||||||||||||||||||||||
|
Thanks for those logs Dan, There is nothing in those logs to show why the document might have come back prior to the migration processing or explain why you received the duplicate key error during the migration. Would it be possible for you to run the same test again? But this time at a higher logging level (say log level 4)? I'd like to see if this sheds any further light on what could be the problem. If you are able to do so, would you be able to delete the duplicate document, increase the logging on set27 and set10, then trigger the migration once again. Following this could you please upload the logs from both primaries of the two sets? This should give us significantly more information to see what might be causing these issues. Thanks for reporting this issue and your continuing patience, | ||||||||||||||||||||||
| Comment by Dan Cooper [ 09/Jan/14 ] | ||||||||||||||||||||||
|
So there was a case where there was a duplicate key error moving a chunk from set27 to set 10. Mng20 is the primary for that replica set.
I went and deleted the key from the primary mongoD.
and so it should be gone (no duplicate)… or so I thought Then chunk migration resumed and mongo then complains about it being back as a duplicate key
And sure enough it’s back on set 10:
Unless I delete it from mongoS, there doesn’t appear to be a way to get rid of the duplicate issue (and, of course, deleting it from mongoS means actually losing data). | ||||||||||||||||||||||
| Comment by David Hows [ 09/Jan/14 ] | ||||||||||||||||||||||
|
Thanks Dan, We will wait for the logs and review then. Regards, | ||||||||||||||||||||||
| Comment by Dan Cooper [ 08/Jan/14 ] | ||||||||||||||||||||||
|
We just tried to delete the key from the mongod, it tried to move the chunk and then complained about the duplicate again. I'll attach logs shortly. | ||||||||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 08/Jan/14 ] | ||||||||||||||||||||||
|
Can you send the full log where you got the duplicate key exception and the FROM side of that chunk migrate? Do you know if the document in question is being modified during migration? | ||||||||||||||||||||||
| Comment by Dan Cooper [ 08/Jan/14 ] | ||||||||||||||||||||||
|
Can you delete the document with the offending key on the TO side of the migrate? - Yes, but we'd like to know why we keep getting duplicates though. It seems to happen randomly on any given shard. How many shards are there? - 40 shards. One primary, one secondary and an arbitor on each shard. Also, Is there a specific reason you haven't upgrade to 2.4 yet? - We're planning to soon, Last time we tried we had an issue with mongostat not working properly, which we use heavily. You guys weren't able to replicate the problem. | ||||||||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 08/Jan/14 ] | ||||||||||||||||||||||
|
Can you delete the document with the offending key on the TO side of the migrate? | ||||||||||||||||||||||
| Comment by Dan Cooper [ 08/Jan/14 ] | ||||||||||||||||||||||
|
Shard Key is "key" : { "ownerid" : 1, "docname" : 1, "osvcname" : 1 }Indexes are: , , | ||||||||||||||||||||||
| Comment by Eliot Horowitz (Inactive) [ 08/Jan/14 ] | ||||||||||||||||||||||
|
Some questions.
|