Mongo Primary misses certain writes to the Oplog

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Done
    • Priority: Major - P3
    • None
    • Affects Version/s: 2.0.8
    • Component/s: Replication, Stability
    • None
    • Environment:
      Solaris 10u9/10u10 using ZFS
      Oracle Corporation SUN FIRE X4270 M2 SERVER
      96GB RAM
    • Solaris
    • Hide

      None found

      Show
      None found
    • None
    • 3
    • None
    • None
    • None
    • None
    • None
    • None

      This may be related to SERVER-8313.

      I have removed some information due to NDA restrictions. If neccesary I will include that information.

      On a client's site where we are running 20 replica sets in a sharded cluster, with the unique shard key "ownerId, profileId, instanceId", we find that on occasion we get the following error on a rs.status():

      "errmsg" : "syncTail: 11000 E11000 duplicate key error index: <removed> dup key: { : <removed> }, syncing:

      { ts: Timestamp 1370449518000|2, h: 7055153591203270570, op: \"i\", ns: \"<db>"}

      This appears on both the secondaries, and at that point they stop replicating from the primary. The only way we have found to resolve this is to stop the secondaries and re-sync them from the primary, which is time-consuming and doesn't really give the customer confidence in our solution.

      Upon closer investigation, we found the following in the oplog on the primary. The parts mark in <xxx> all match each other. The entries are sorted by _id to ensure that the same entries go together.

      51ad5a4deefa529ba5336e7a
      ------------------
      { "ts" :

      { "t" : 1370315341000 , "i" : 2 }

      , "h" : -7125820950950980008, "op" : "i", "ns" : "<db>.default", "o" : { "_id" :

      { "$oid" : "51ad5a4deefa529ba5336e7a" }

      , "ownerId" : "<number>", "profileId" : "<package>", "instanceId" : "<instance>", "state" : "active", "createDateTime" : 1370315341, "startDateTime" : 1370315341, "expireDateTime" : 1370401741, "removeDateTime" : 1370401741, "cancelled" : false, "mode" : "onceOff", "properties" :

      { "0" : "<remove>:remove:<instance>", "1" : "<remove>:delete:1055" }

      } }
      { "ts" :

      { "t" : 1370347540000 , "i" : 3 }

      , "h" : -2674495216582342342, "op" : "d", "ns" : "<db>.default", "b" : true, "o" : { "_id" :

      { "$oid" : "51ad5a4deefa529ba5336e7a" }

      } }

      51add814fc83b2a3bc30b5e8
      ------------------
      { "ts" :

      { "t" : 1370347540000 , "i" : 4 }

      , "h" : -8427812199229652004, "op" : "i", "ns" : "<db>.default", "o" : { "_id" :

      { "$oid" : "51add814fc83b2a3bc30b5e8" }

      , "ownerId" : "<number>", "profileId" : "<package>", "instanceId" : "<instance>", "state" : "active", "createDateTime" : 1370347540, "startDateTime" : 1370347540, "expireDateTime" : 1370433940, "removeDateTime" : 1370433940, "cancelled" : false, "mode" : "onceOff", "properties" :

      { "0" : "<remove>:remove:<instance>", "1" : "<remove>:delete:1055" }

      } }
      { "ts" :

      { "t" : 1370433968000 , "i" : 6 }

      , "h" : 3274260589582880478, "op" : "d", "ns" : "<db>.default", "b" : true, "o" : { "_id" :

      { "$oid" : "51add814fc83b2a3bc30b5e8" }

      } }

      51af520f4000ab14865d3283
      ------------------
      { "ts" :

      { "t" : 1370444303000 , "i" : 177 }

      , "h" : 2830599934484141765, "op" : "i", "ns" : "<db>.default", "o" : { "_id" :

      { "$oid" : "51af520f4000ab14865d3283" }

      , "ownerId" : "<number>", "profileId" : "<package>", "instanceId" : "<instance>", "state" : "active", "createDateTime" : 1370444303, "startDateTime" : 1370444303, "expireDateTime" : 1370530703, "removeDateTime" : 1370530703, "cancelled" : false, "mode" : "onceOff", "properties" :

      { "0" : "<remove>:remove:<instance>", "1" : "<remove>:delete:1055" }

      } }

      51af666ed18b4aa6c892bbc9
      ------------------
      { "ts" :

      { "t" : 1370449518000 , "i" : 2 }

      , "h" : 7055153591203270570, "op" : "i", "ns" : "<db>.default", "o" : { "_id" :

      { "$oid" : "51af666ed18b4aa6c892bbc9" }

      , "ownerId" : "<number>", "profileId" : "<package>", "instanceId" : "<instance>", "state" : "active", "createDateTime" : 1370449518, "startDateTime" : 1370449518, "expireDateTime" : 1370535918, "removeDateTime" : 1370535918, "cancelled" : false, "mode" : "onceOff", "properties" :

      { "0" : "<remove>:remove:<instance>", "1" : "<remove>:delete:1055" }

      } }

      From this log it quickly became clear that the primary missed writing a delete entry into the oplog, causing a double-insert with the same key (ownerId, profileId, instanceId). This would obviously cause a conflict on the secondary, and require intervention. Unfortunately the issue only appears infrequently, making this very hard to duplicate or test.

      Is this something we can resolve by upgrading to a different version of MongoDB? We are currently looking at going to 2.2.4. However, I couldn't find a ticket in JIRA that seems to address this directly and is fixed in that version.

            Assignee:
            Matt Dannenberg (Inactive)
            Reporter:
            Braam van Heerden
            Votes:
            0 Vote for this issue
            Watchers:
            6 Start watching this issue

              Created:
              Updated:
              Resolved: