[SERVER-9965] Mongo Primary misses certain writes to the Oplog Created: 19/Jun/13 Updated: 10/Dec/14 Resolved: 02/Aug/13 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication, Stability |
| Affects Version/s: | 2.0.8 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Braam van Heerden | Assignee: | Matt Dannenberg |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Solaris 10u9/10u10 using ZFS |
||
| Operating System: | Solaris |
| Steps To Reproduce: | None found |
| Participants: |
| Description |
|
This may be related to I have removed some information due to NDA restrictions. If neccesary I will include that information. On a client's site where we are running 20 replica sets in a sharded cluster, with the unique shard key "ownerId, profileId, instanceId", we find that on occasion we get the following error on a rs.status(): "errmsg" : "syncTail: 11000 E11000 duplicate key error index: <removed> dup key: { : <removed> }, syncing: { ts: Timestamp 1370449518000|2, h: 7055153591203270570, op: \"i\", ns: \"<db>"}This appears on both the secondaries, and at that point they stop replicating from the primary. The only way we have found to resolve this is to stop the secondaries and re-sync them from the primary, which is time-consuming and doesn't really give the customer confidence in our solution. Upon closer investigation, we found the following in the oplog on the primary. The parts mark in <xxx> all match each other. The entries are sorted by _id to ensure that the same entries go together. 51ad5a4deefa529ba5336e7a , "h" : -7125820950950980008, "op" : "i", "ns" : "<db>.default", "o" : { "_id" : { "$oid" : "51ad5a4deefa529ba5336e7a" }, "ownerId" : "<number>", "profileId" : "<package>", "instanceId" : "<instance>", "state" : "active", "createDateTime" : 1370315341, "startDateTime" : 1370315341, "expireDateTime" : 1370401741, "removeDateTime" : 1370401741, "cancelled" : false, "mode" : "onceOff", "properties" : { "0" : "<remove>:remove:<instance>", "1" : "<remove>:delete:1055" } } } , "h" : -2674495216582342342, "op" : "d", "ns" : "<db>.default", "b" : true, "o" : { "_id" : { "$oid" : "51ad5a4deefa529ba5336e7a" }} } 51add814fc83b2a3bc30b5e8 , "h" : -8427812199229652004, "op" : "i", "ns" : "<db>.default", "o" : { "_id" : { "$oid" : "51add814fc83b2a3bc30b5e8" }, "ownerId" : "<number>", "profileId" : "<package>", "instanceId" : "<instance>", "state" : "active", "createDateTime" : 1370347540, "startDateTime" : 1370347540, "expireDateTime" : 1370433940, "removeDateTime" : 1370433940, "cancelled" : false, "mode" : "onceOff", "properties" : { "0" : "<remove>:remove:<instance>", "1" : "<remove>:delete:1055" } } } , "h" : 3274260589582880478, "op" : "d", "ns" : "<db>.default", "b" : true, "o" : { "_id" : { "$oid" : "51add814fc83b2a3bc30b5e8" }} } 51af520f4000ab14865d3283 , "h" : 2830599934484141765, "op" : "i", "ns" : "<db>.default", "o" : { "_id" : { "$oid" : "51af520f4000ab14865d3283" }, "ownerId" : "<number>", "profileId" : "<package>", "instanceId" : "<instance>", "state" : "active", "createDateTime" : 1370444303, "startDateTime" : 1370444303, "expireDateTime" : 1370530703, "removeDateTime" : 1370530703, "cancelled" : false, "mode" : "onceOff", "properties" : { "0" : "<remove>:remove:<instance>", "1" : "<remove>:delete:1055" }} } 51af666ed18b4aa6c892bbc9 , "h" : 7055153591203270570, "op" : "i", "ns" : "<db>.default", "o" : { "_id" : { "$oid" : "51af666ed18b4aa6c892bbc9" }, "ownerId" : "<number>", "profileId" : "<package>", "instanceId" : "<instance>", "state" : "active", "createDateTime" : 1370449518, "startDateTime" : 1370449518, "expireDateTime" : 1370535918, "removeDateTime" : 1370535918, "cancelled" : false, "mode" : "onceOff", "properties" : { "0" : "<remove>:remove:<instance>", "1" : "<remove>:delete:1055" }} } From this log it quickly became clear that the primary missed writing a delete entry into the oplog, causing a double-insert with the same key (ownerId, profileId, instanceId). This would obviously cause a conflict on the secondary, and require intervention. Unfortunately the issue only appears infrequently, making this very hard to duplicate or test. Is this something we can resolve by upgrading to a different version of MongoDB? We are currently looking at going to 2.2.4. However, I couldn't find a ticket in JIRA that seems to address this directly and is fixed in that version. |
| Comments |
| Comment by Matt Dannenberg [ 02/Aug/13 ] |
|
I'm closing this ticket. Feel free to re-open it if the issue occurs in 2.2.5 |
| Comment by Braam van Heerden [ 27/Jun/13 ] |
|
Thanks for the reply. I will discuss a migration to 2.2.5 to our management and see what the timelines for this is. This issue has occurred about 5-6 times in a 2-week period and not since then, so it seems to be very transient. |
| Comment by Matt Dannenberg [ 26/Jun/13 ] |
|
I do not believe this is related to We can try further debugging by capturing all write operations (using db.adminCommand( {setParameter: 1, logLevel: 1})), but this may bloat your logs. 2.2.5 contains significant improvements to oplog replication, both in stability and performance. |