[SERVER-9965] Mongo Primary misses certain writes to the Oplog Created: 19/Jun/13  Updated: 10/Dec/14  Resolved: 02/Aug/13

Status: Closed
Project: Core Server
Component/s: Replication, Stability
Affects Version/s: 2.0.8
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Braam van Heerden Assignee: Matt Dannenberg
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Solaris 10u9/10u10 using ZFS
Oracle Corporation SUN FIRE X4270 M2 SERVER
96GB RAM


Operating System: Solaris
Steps To Reproduce:

None found

Participants:

 Description   

This may be related to SERVER-8313.

I have removed some information due to NDA restrictions. If neccesary I will include that information.

On a client's site where we are running 20 replica sets in a sharded cluster, with the unique shard key "ownerId, profileId, instanceId", we find that on occasion we get the following error on a rs.status():

"errmsg" : "syncTail: 11000 E11000 duplicate key error index: <removed> dup key: { : <removed> }, syncing:

{ ts: Timestamp 1370449518000|2, h: 7055153591203270570, op: \"i\", ns: \"<db>"}

This appears on both the secondaries, and at that point they stop replicating from the primary. The only way we have found to resolve this is to stop the secondaries and re-sync them from the primary, which is time-consuming and doesn't really give the customer confidence in our solution.

Upon closer investigation, we found the following in the oplog on the primary. The parts mark in <xxx> all match each other. The entries are sorted by _id to ensure that the same entries go together.

51ad5a4deefa529ba5336e7a
------------------
{ "ts" :

{ "t" : 1370315341000 , "i" : 2 }

, "h" : -7125820950950980008, "op" : "i", "ns" : "<db>.default", "o" : { "_id" :

{ "$oid" : "51ad5a4deefa529ba5336e7a" }

, "ownerId" : "<number>", "profileId" : "<package>", "instanceId" : "<instance>", "state" : "active", "createDateTime" : 1370315341, "startDateTime" : 1370315341, "expireDateTime" : 1370401741, "removeDateTime" : 1370401741, "cancelled" : false, "mode" : "onceOff", "properties" :

{ "0" : "<remove>:remove:<instance>", "1" : "<remove>:delete:1055" }

} }
{ "ts" :

{ "t" : 1370347540000 , "i" : 3 }

, "h" : -2674495216582342342, "op" : "d", "ns" : "<db>.default", "b" : true, "o" : { "_id" :

{ "$oid" : "51ad5a4deefa529ba5336e7a" }

} }

51add814fc83b2a3bc30b5e8
------------------
{ "ts" :

{ "t" : 1370347540000 , "i" : 4 }

, "h" : -8427812199229652004, "op" : "i", "ns" : "<db>.default", "o" : { "_id" :

{ "$oid" : "51add814fc83b2a3bc30b5e8" }

, "ownerId" : "<number>", "profileId" : "<package>", "instanceId" : "<instance>", "state" : "active", "createDateTime" : 1370347540, "startDateTime" : 1370347540, "expireDateTime" : 1370433940, "removeDateTime" : 1370433940, "cancelled" : false, "mode" : "onceOff", "properties" :

{ "0" : "<remove>:remove:<instance>", "1" : "<remove>:delete:1055" }

} }
{ "ts" :

{ "t" : 1370433968000 , "i" : 6 }

, "h" : 3274260589582880478, "op" : "d", "ns" : "<db>.default", "b" : true, "o" : { "_id" :

{ "$oid" : "51add814fc83b2a3bc30b5e8" }

} }

51af520f4000ab14865d3283
------------------
{ "ts" :

{ "t" : 1370444303000 , "i" : 177 }

, "h" : 2830599934484141765, "op" : "i", "ns" : "<db>.default", "o" : { "_id" :

{ "$oid" : "51af520f4000ab14865d3283" }

, "ownerId" : "<number>", "profileId" : "<package>", "instanceId" : "<instance>", "state" : "active", "createDateTime" : 1370444303, "startDateTime" : 1370444303, "expireDateTime" : 1370530703, "removeDateTime" : 1370530703, "cancelled" : false, "mode" : "onceOff", "properties" :

{ "0" : "<remove>:remove:<instance>", "1" : "<remove>:delete:1055" }

} }

51af666ed18b4aa6c892bbc9
------------------
{ "ts" :

{ "t" : 1370449518000 , "i" : 2 }

, "h" : 7055153591203270570, "op" : "i", "ns" : "<db>.default", "o" : { "_id" :

{ "$oid" : "51af666ed18b4aa6c892bbc9" }

, "ownerId" : "<number>", "profileId" : "<package>", "instanceId" : "<instance>", "state" : "active", "createDateTime" : 1370449518, "startDateTime" : 1370449518, "expireDateTime" : 1370535918, "removeDateTime" : 1370535918, "cancelled" : false, "mode" : "onceOff", "properties" :

{ "0" : "<remove>:remove:<instance>", "1" : "<remove>:delete:1055" }

} }

From this log it quickly became clear that the primary missed writing a delete entry into the oplog, causing a double-insert with the same key (ownerId, profileId, instanceId). This would obviously cause a conflict on the secondary, and require intervention. Unfortunately the issue only appears infrequently, making this very hard to duplicate or test.

Is this something we can resolve by upgrading to a different version of MongoDB? We are currently looking at going to 2.2.4. However, I couldn't find a ticket in JIRA that seems to address this directly and is fixed in that version.



 Comments   
Comment by Matt Dannenberg [ 02/Aug/13 ]

I'm closing this ticket. Feel free to re-open it if the issue occurs in 2.2.5

Comment by Braam van Heerden [ 27/Jun/13 ]

Thanks for the reply.

I will discuss a migration to 2.2.5 to our management and see what the timelines for this is. This issue has occurred about 5-6 times in a 2-week period and not since then, so it seems to be very transient.

Comment by Matt Dannenberg [ 26/Jun/13 ]

I do not believe this is related to SERVER-8313 because SERVER-8313 refers to features that were not implemented in 2.0.X.

We can try further debugging by capturing all write operations (using db.adminCommand(

{setParameter: 1, logLevel: 1}

)), but this may bloat your logs.

2.2.5 contains significant improvements to oplog replication, both in stability and performance.

Generated at Thu Feb 08 03:21:55 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.