[SERVER-12295] Migrations occassionally aborted due to duplicate key error on unique shard key Created: 08/Jan/14  Updated: 15/Nov/21  Resolved: 10/Jun/14

Status: Closed
Project: Core Server
Component/s: Index Maintenance, Replication, Stability, Storage
Affects Version/s: 2.2.6
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Dan Cooper Assignee: Thomas Rueckstiess
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

CentOS 6.4


Operating System: ALL
Participants:

 Description   

Hello, we've notice recently that the moveChunk directory will rapidly start using all of it's available disk space. Looking through the logs, we saw errors of Sun Jan 5 09:43:14 [migrateThread] ERROR: migrate failed: E11000 duplicate key error index: blah.$ownerid_1_docname_1_osvcname_1 dup key: { : "11446719", : "app.json", : "blah" }. We recently upgraded to 2.2.6 from 2.2.4 though I am not sure if that is relevant.

When the disk fills, mongod crashes and fails over to the secondary. After clearing out the moveChunk dir, the partner (now primary node) will have the same behavior the next day. If we remove the offending duplicate key, the issue stops.



 Comments   
Comment by Ramon Fernandez Marina [ 10/Jun/14 ]

Hi al.gehrig@rockstarsandiego.com,

we haven't heard back from you for some time, so I'm going to mark this ticket as resolved. If you happen to run into this issue again feel free to re-open the ticket and provide additional information.

Regards,
Ramón.

Comment by Thomas Rueckstiess [ 01/May/14 ]

Okay, thanks for letting us know. We'll wait for you to provide additional details.

Regards,
Thomas

Comment by Al Gehrig [ 01/May/14 ]

Hi Thomas,

Yes - we're still getting these from time-to-time. The next time we see one, I'll post to this ticket.

-Al

Comment by Thomas Rueckstiess [ 01/May/14 ]

Hi Al,

I just wanted to follow up with you. Have you seen this issue appear again, or can we go ahead and resolve the ticket?

Regards,
Thomas

Comment by Asya Kamsky [ 14/Mar/14 ]

Checking indexes now can help since they should be identical on each shard and if they are not it could suggest why this problem may be recurring. If they are identical it can rule out one possible source on the problem.

I will put the ticket into waiting state, when this happens again, please let us know, attach related information and we will try to get to the bottom of it with you at that point.

Comment by Al Gehrig [ 14/Mar/14 ]

We had to delete the document to prevent the storage from filling up. We will keep an eye out for this and the next time it happens we'll get the "to" and the "from" as well and the getIndexes().

Comment by Asya Kamsky [ 14/Mar/14 ]

Could you query for that document directly on the mongod in the "from" shard?

It would also help to see db.collectionblah.getIndexes() on each shard's primary.

Comment by Al Gehrig [ 14/Mar/14 ]

Sorry - yes, that was a typo. I've updated the comment.

Comment by Asya Kamsky [ 14/Mar/14 ]

If I'm reading this right, it appears the error is from collection "cloud_prod.members.yay.blah" and the query on mongod was run against collection "db.members.yay.iblah" (blah vs iblah) - was this a typo when cutting-and-pasting the diagnostics?

Comment by Al Gehrig [ 14/Mar/14 ]

We are still seeing this. We have seen incidents where there were actually duplicate keys which prevented chunk migration. But there have been other cases where there was not a duplicate key and chunk migration still failed. For example:

Wed Mar  5 04:24:44 [repl writer worker 5] ERROR: writer worker caught exception: E11000 duplicate key error index: cloud_prod.members.yay.blah.$ownerid_1_docname_1_osvcname_1  dup key: { : "35290103", : "savedata.json", : "sc" }

So I went and searched for that key

Only one doc shows up when querying via the mongoD (this is the "to" mongoD):

rsgewrset12:PRIMARY> db.members.yay.blah.find({ ownerid: "35290103", docname: "savedata.json", osvcname: "sc" } )
{ "_id" : ObjectId("52d0e3879c551e923a0b0575"), "data" : { "version" : 1, "ownerID" ........

and only one doc shows up when querying via mongoS:

mongos>  db.members.yay.blah.find({ ownerid: "35290103", docname: "savedata.json", osvcname: "sc" } )
{ "_id" : ObjectId("52d0e3879c551e923a0b0575"), "data" : { "version" : 1, "ownerID" ........

We end up with aborted chunk migrations in this case as well.

Comment by Dan Cooper [ 13/Mar/14 ]

This is still an issue. A coworker will be adding details

Comment by David Hows [ 10/Feb/14 ]

Hi Dan,

I'm marking this issue as closed as we have not heard back from you.

If there is further we can do for you on this issue please feel free to re-open it with details.

Regards,
David

Comment by Asya Kamsky [ 10/Jan/14 ]

David,

If I understand correctly, 10 is the to shard and 27 is the from shard, correct?

Would you run the same query and command on the primary of shard27 (from shard), which I believe is host rsgewrmng53?

use cloud_prod
db.members.yay.car.find({ ownerid : "16758570", docname : "gamenp.json", osvcname : "sc" })
db.members.getIndexes()

If somehow the index on the "from" shard is not unique and there are duplicate records there it would explain why the duplicates keep showing up on the "to" shard - they are not already there before migration (which would clean them out anyway) they are being introduced by migration from the shard that may be missing a unique constraint on the index.

Comment by David Hows [ 09/Jan/14 ]

Thanks for those logs Dan,

There is nothing in those logs to show why the document might have come back prior to the migration processing or explain why you received the duplicate key error during the migration.

Would it be possible for you to run the same test again? But this time at a higher logging level (say log level 4)? I'd like to see if this sheds any further light on what could be the problem. If you are able to do so, would you be able to delete the duplicate document, increase the logging on set27 and set10, then trigger the migration once again. Following this could you please upload the logs from both primaries of the two sets?

This should give us significantly more information to see what might be causing these issues.

Thanks for reporting this issue and your continuing patience,
David

Comment by Dan Cooper [ 09/Jan/14 ]

So there was a case where there was a duplicate key error moving a chunk from set27 to set 10. Mng20 is the primary for that replica set.

Wed Jan  8 18:40:29 [conn579943] command admin.$cmd command: { moveChunk: "cloud_prod.members.yay.car", from: "ewrset27/ewrmng53:27017,ewrmng54:27017", to: "ewrset10/ewrmng19:27017,ewrmng20:27017", fromShard: "ewrset27", toShard: "ewrset10", min: { ownerid: "16723769", docname: "gamenp.json", osvcname: "sc" }, max: { ownerid: "16792097", docname: "gamenp.json", osvcname: "sc" }, maxChunkSizeBytes: 67108864, shardId: "cloud_prod.members.yay.car-ownerid_"16723769"docname_"gamenp.json"osvcname_"sc"", configdb: "ewrmngcfg1:30001,ewrmngcfg2:30001,ewrmngcfg3:30001", secondaryThrottle: false } ntoreturn:1 keyUpdates:0 locks(micros) W:914 r:31419 reslen:693 3504ms
 

I went and deleted the key from the primary mongoD.

ewrset10:PRIMARY> db.members.yay.car.find({ ownerid : "16758570", docname : "gamenp.json", osvcname : "sc" })
{ "_id" : ObjectId("5238764b125a4204301d6944"), "data" : { "version" : 1, "ownerID" : "raaan90", "sp0_order" : { }, "multiplayer0" : { "vehicle0" : { "45" : "5", "46" : "203402306", "47" : "67506433",
"appdata" : { "65" : "1.000000", "66" : "1.000000", "appUID" : "-1199345894" } }, "docname" : "gamenp.json", "lastmodified" : { "DateTime" : ISODate("2013-12-03T14:22:16.129Z"), "Ticks" : NumberLong("635216773361295841") }, "osvcname" : "sc", "ownerid" : "16758570" }
ewrset10:PRIMARY> db.members.yay.car.remove({ ownerid : "16758570", docname : "gamenp.json", osvcname : "sc" })
ewrset10:PRIMARY> db.members.yay.car.find({ ownerid : "16758570", docname : "gamenp.json", osvcname : "sc" })
 

and so it should be gone (no duplicate)… or so I thought

Then chunk migration resumed and mongo then complains about it being back as a duplicate key

Wed Jan  8 18:40:31 [conn579955] moveChunk request accepted at version 1142|3||5217f1a621ce2f768a688386
Wed Jan  8 18:40:31 [conn579955] moveChunk number of documents: 6552
Wed Jan  8 18:40:31 [initandlisten] connection accepted from 10.90.52.180:36457 #1141511 (4415 connections now open)
Wed Jan  8 18:40:32 [conn1141511] end connection 10.90.52.180:36457 (4414 connections now open)
Wed Jan  8 18:40:32 [conn579955] moveChunk data transfer progress: { active: true, ns: "cloud_prod.members.yay.car", from: "ewrset27/ewrmng53:27017,ewrmng54:27017", min: { ownerid: "16723769", docname: "gamenp.json", osvcname: "sc" }, max: { ownerid: "16792097", docname: "gamenp.json", osvcname: "sc" }, shardKeyPattern: { ownerid: 1, docname: 1, osvcname: 1 }, state: "clone", counts: { cloned: 1042, clonedBytes: 2055855, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
Wed Jan  8 18:40:33 [conn579955] moveChunk data transfer progress: { active: false, ns: "cloud_prod.members.yay.car", from: "ewrset27/ewrmng53:27017,ewrmng54:27017", min: { ownerid: "16723769", docname: "gamenp.json", osvcname: "sc" }, max: { ownerid: "16792097", docname: "gamenp.json", osvcname: "sc" }, shardKeyPattern: { ownerid: 1, docname: 1, osvcname: 1 }, state: "fail", errmsg: "E11000 duplicate key error index: cloud_prod.members.yay.car.$ownerid_1_docname_1_osvcname_1  dup key: { : "16758570", : "gamenp.json", : "sc" }", counts: { cloned: 1042, clonedBytes: 2055855, catchup: 0, steady: 0 }, ok: 1.0 } my mem used: 0
Wed Jan  8 18:40:33 [conn579955] warning: moveChunk error transferring data caused migration abort: { active: false, ns: "cloud_prod.members.yay.car", from: "ewrset27/ewrmng53:27017,ewrmng54:27017", min: { ownerid: "16723769", docname: "gamenp.json", osvcname: "sc" }, max: { ownerid: "16792097", docname: "gamenp.json", osvcname: "sc" }, shardKeyPattern: { ownerid: 1, docname: 1, osvcname: 1 }, state: "fail", errmsg: "E11000 duplicate key error index: cloud_prod.members.yay.car.$ownerid_1_docname_1_osvcname_1  dup key: { : "16758570", : "gamenp.json", : "sc" }", counts: { cloned: 1042, clonedBytes: 2055855, catchup: 0, steady: 0 }, ok: 1.0 }
  

And sure enough it’s back on set 10:

ewrset10:PRIMARY> db.members.yay.car.find({ ownerid : "16758570", docname : "gamenp.json", osvcname : "sc" })
{ "_id" : ObjectId("5238764b125a4204301d6944"), "data" : { "version" : 1, "ownerID" : "raaan90", "sp0_order" : { }, "multiplayer0" : { "vehicle0" : { "45" : "5", "46" : "203402306", "47" : "67506433", "48" : "133505289", "49" : "-2122317693", "10" : "3", "11" : "0", "12" : "0", "13" : "0", "14" : "0", "50" : "2118394178", "15" : "4", "51" : "116", "16" : "2108042618", "52" : "1.000000", "17" : "1766676233", "18" : "-1390777827", "53" : "car", "54" : "None", "19" : "-1751761149", "20" : "1732399718", "25" : "0", "26" : "255", "27" : "255", "28" : "255", "29" : "0", "69" : "7234", "1073741823"
}, "mp0_order" : { }, "sp1_order" : { }, "appdata" : { "65" : "1.000000", "66" : "1.000000", "appUID" : "-1199345894" } }, "docname" : "gamenp.json", "lastmodified" : { "DateTime" : ISODate("2013-12-03T14:22:16.129Z"), "Ticks" : NumberLong("635216773361295841") }, "osvcname" : "sc", "ownerid" : "16758570" }
  

Unless I delete it from mongoS, there doesn’t appear to be a way to get rid of the duplicate issue (and, of course, deleting it from mongoS means actually losing data).

Comment by David Hows [ 09/Jan/14 ]

Thanks Dan,

We will wait for the logs and review then.

Regards,
David

Comment by Dan Cooper [ 08/Jan/14 ]

We just tried to delete the key from the mongod, it tried to move the chunk and then complained about the duplicate again. I'll attach logs shortly.

Comment by Eliot Horowitz (Inactive) [ 08/Jan/14 ]

Can you send the full log where you got the duplicate key exception and the FROM side of that chunk migrate?

Do you know if the document in question is being modified during migration?

Comment by Dan Cooper [ 08/Jan/14 ]

Can you delete the document with the offending key on the TO side of the migrate? - Yes, but we'd like to know why we keep getting duplicates though. It seems to happen randomly on any given shard.

How many shards are there? - 40 shards. One primary, one secondary and an arbitor on each shard.

Also, Is there a specific reason you haven't upgrade to 2.4 yet? - We're planning to soon, Last time we tried we had an issue with mongostat not working properly, which we use heavily. You guys weren't able to replicate the problem.

Comment by Eliot Horowitz (Inactive) [ 08/Jan/14 ]

Can you delete the document with the offending key on the TO side of the migrate?
How many shards are there?
Also, Is there a specific reason you haven't upgrade to 2.4 yet?

Comment by Dan Cooper [ 08/Jan/14 ]

Shard Key is "key" :

{ "ownerid" : 1, "docname" : 1, "osvcname" : 1 }

Indexes are:
[
        {
                "v" : 1,
                "key" :

{                         "_id" : 1                 }

,
                "ns" : "cloud_prod.members.gta5.car",
                "name" : "id"
        },
        {
                "v" : 1,
                "key" :

{                         "ownerid" : 1,                         "docname" : 1,                         "osvcname" : 1                 }

,
                "unique" : true,
                "ns" : "cloud_prod.members.gta5.car",
                "name" : "ownerid_1_docname_1_osvcname_1"
        }
]

Comment by Eliot Horowitz (Inactive) [ 08/Jan/14 ]

Some questions.

  • What's the shard key?
  • Can you send the indexes on the collection? (db.<collection>.getIndexes())
Generated at Thu Feb 08 03:28:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.