[SERVER-38212] $out fails with duplicate _id key error Created: 20/Nov/18  Updated: 29/Jul/20  Resolved: 26/Nov/18

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: 3.6.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Prashant Chaudhari Assignee: Danny Hatcher (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
Operating System: ALL
Steps To Reproduce:

I have a collection with about a million records. The aggregation query explained above returns about 300k results which I am trying to dump into a collection with randomly generated name.

This aggregation query fails with the below error:

Mongo::error::operationfailure(insert for $out failed: { lastop:

{ ts: timestamp(1542714271, 9378), t: 39 }

, connectionid: 242453, err: "e11000 duplicate key error collection: api smartquest co production.tmp.agg out.637144 index: id dup key: { : objectid('5bf2347a4b8a98775e4dbf95') }", code: 11000, codename: "duplicatekey", n: 0, ok: 1.0, operationtime: timestamp(1542714271, 9378), $clustertime: { clustertime: timestamp(1542714271, 9379), signature:

{ hash: bindata(0, 0000000000000000000000000000000000000000), keyid: 0 }

} } (16996))

Participants:

 Description   

// db.sq_lesson_user_lessons.aggregate([ { "$match": { lesson_id: { "$in": [ObjectId("5bb6ec0a178353bbdecdd94d"), ObjectId("5bbf1e611783538013ce2f0a"), ObjectId("5bc1871a98f172c52b8710a6"), ObjectId("5bc1ceef0789fa947c1da8b2")] }, status: { "$in": ['featured','started','pending','completed'] } } }, { "$project": { _id: 1, user_profile_id: 1, status: 1, lesson_id: 1 } }, { "$out": "analytics_company_5bb6039598f17297c964fc54_sq_user_lessons" } ])

 



 Comments   
Comment by Talles Airan [ 29/Jul/20 ]

This bug is happening in high load environments

I have a system that accesses users, it happens that the moment it generates an objectId mongodb already generates another one, I have about 144 requests per second.

I had to implement an objectId of mine from one of mongodb combined with the current timestamp including a random_bytes I made

Comment by Danny Hatcher (Inactive) [ 12/Dec/18 ]

Hello Prashant,

Yes, that is correct.

Thank you,

Danny

Comment by Prashant Chaudhari [ 10/Dec/18 ]

I couldn't really correlate my use case with the Read Isolation mentioned in the docs. Are you suggesting that while the $out operation is in progress, other write operations affecting the same collection may interleave and the affect the result of the aggregation?

Comment by Danny Hatcher (Inactive) [ 26/Nov/18 ]

Hello Prashant,

I believe that you may be encountering one of the concepts within MongoDB's read isolation. As the aggregation is searching through the large collection to return results, it is possible that some documents are being returned multiple times. Because you are then attempting to insert those documents into a new collection using their original {{_id}}s, conflicts will occur as the same document would attempt to be inserted twice.

I see that you have also posted this question on Stack Overflow. You mentioned in one of your comments there that this only happens sometimes. That helps support the above theory; only occasionally are writes causing your reads to "duplicate".

Would it fit your business case to insert those documents without their original _id fields or by projecting that field to something else? That way a randomly-generated _id will be created for each document and you shouldn't encounter this error.

As a question such as this is better suited to Stack Overflow and you have already asked the question there, I will close this ticket.

Thank you,

Danny

Comment by Prashant Chaudhari [ 20/Nov/18 ]

Here is a mongo console log for the same query:

db.sq_lesson_user_lessons.aggregate([
  {
   "$match": {
     lesson_id: { 
       "$in": [ObjectId("5bb6ec0a178353bbdecdd94d"), ObjectId("5bbf1e611783538013ce2f0a")] 
     },
     status: { "$in": ['featured','started','pending','completed'] }
   }
  },
  {
   "$project": {
     _id: 1,
     user_profile_id: 1,
     status: 1,
     lesson_id: 1
   }
  },
  {
   "$out": "analytics_company_5bb6039598f17297c964fc54_sq_user_lessons"
  }
])
 
assert: command failed: {
    "operationTime" : Timestamp(1542715086, 67659),
    "ok" : 0,
    "errmsg" : "insert for $out failed: { lastOp: { ts: Timestamp(1542715086, 67657), t: 39 }, connectionId: 242551, err: \"E11000 duplicate key error collection: api_smartquest_co_production.tmp.agg_out.637145 index: _id_ dup key: { : ObjectId('5bf22e554b8a982ada5e2828') }\", code: 11000, codeName: \"DuplicateKey\", n: 0, ok: 1.0, operationTime: Timestamp(1542715086, 67657), $clusterTime: { clusterTime: Timestamp(1542715086, 67658), signature: { hash: BinData(0, 0000000000000000000000000000000000000000), keyId: 0 } } }",
    "code" : 16996,
    "codeName" : "Location16996",
    "$clusterTime" : {
        "clusterTime" : Timestamp(1542715086, 67659),
        "signature" : {
            "hash" : BinData(0,"wvZz15/714/PHqAWywLpZlP4azQ="),
            "keyId" : NumberLong("6606442824109916161")
        }
    }
} : aggregate failed

Generated at Thu Feb 08 04:48:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.