[SERVER-7551] _id Unique key violation during initial sync Created: 05/Nov/12  Updated: 11/Jul/16  Resolved: 11/Nov/12

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.2.1
Fix Version/s: 2.2.2, 2.3.1

Type: Bug Priority: Major - P3
Reporter: Eric Milkie Assignee: Eric Milkie
Resolution: Done Votes: 3
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File mongoDB.txt     Text File mongod.log     File mongodb-logs.zip    
Issue Links:
Depends
Duplicate
is duplicated by SERVER-7628 [rsSync] warning: DR102 too much dat... Closed
Related
is related to SERVER-7186 Batched oplog application increases f... Closed
Operating System: ALL
Participants:

 Comments   
Comment by Christian Ribe [ 14/Nov/12 ]

Thanks for your help Eric.
Great job

Chris

Comment by auto [ 09/Nov/12 ]

Author:

{u'date': u'2012-11-09T15:47:00Z', u'email': u'milkie@10gen.com', u'name': u'Eric Milkie'}

Message: SERVER-7551 use same batch limits for initialsync and synctail
Branch: v2.2
https://github.com/mongodb/mongo/commit/39f9638a400a1189d9f00f89c5ece874b4acaf2a

Comment by Eric Milkie [ 09/Nov/12 ]

Hi Chris. I've committed a change that I believe will solve your issue. The amount of uncommitted bytes was going too high during initial sync, due to a various array of factors. I've put in some adjustments that will more closely match what we are doing during normal syncing, to ensure the journal doesn't overflow. This fix will be released in version 2.2.2.

Comment by auto [ 09/Nov/12 ]

Author:

{u'date': u'2012-11-09T15:47:00Z', u'name': u'Eric Milkie', u'email': u'milkie@10gen.com'}

Message: SERVER-7551 use same batch limits for initialsync and synctail
Branch: master
https://github.com/mongodb/mongo/commit/61c4869d25f58e8dae4da9d73f65ed42de84fbe6

Comment by Eric Milkie [ 08/Nov/12 ]

Hi Chris.
We're still continuing to analyze your logs. I'm attempting to reproduce the issue locally on my machine, as after code inspection we do not immediately see how the problem is possible.

Comment by Christian Ribe [ 07/Nov/12 ]

Hey Eric,

I seem to have solved the issue when I deleted the offending index. The secondary was able to resync correctly after that. Index was not needed anymore so I did not bother the rebuild it.

I will not run a repair database since the mongo documents seem to say NOT to do that. http://www.mongodb.org/display/DOCS/Durability+and+Repair

Still interested about your analysis of my logs.
Thanks
Chris

Comment by Christian Ribe [ 06/Nov/12 ]

Hi Eric,

Here are the log files:

  • mongoPRI.full.log (primary server full logs from sync start time to fail time)
  • mongoPRI.clean.log (the same file just removed all the "connection accepted" and "end connection"
  • mongoSEC.full.log (full log of the secondary that I am trying to sync)

Hope it helps you figure out why my setup is crashing.

I'm planing on dropping some old data + less critical indexes so primary is smaller.
Hope it allows the sync to complete, if it does I will be able to stop one server and run a repairDatabase.
Should fix the duplicates right?

Thank you for your help.
Chris

Comment by Christian Ribe [ 06/Nov/12 ]

MongoDB logs of primary and crashing secondary on sync.

Comment by auto [ 06/Nov/12 ]

Author:

{u'date': u'2012-11-06T15:54:08Z', u'email': u'milkie@10gen.com', u'name': u'Eric Milkie'}

Message: SERVER-7551 do not convert updates to upserts if primary is pre-2.2

Converting updates to upserts during replication was added for version 2.2.0.
This code was dependent on all operators being idempotent.
Some operators in 2.0.x are not fully idempotent, so if we are reading an
oplog generated by a 2.0.x node, we shouldn't activate this new feature.
Branch: v2.2
https://github.com/mongodb/mongo/commit/cfac2cb0aa85b9d45deca4d658726999d232eefb

Comment by Eric Milkie [ 06/Nov/12 ]

Hi Chris,
Thanks for attaching MongoDB.txt. Can you attach a log that shows log entries from the beginning, where the server was started up? Also, can you attach the log from the primary as well?

Comment by Christian Ribe [ 06/Nov/12 ]

Here is an update.
Upgraded primary and secondary to 2.2.1 but still failing with:
DR102 too much data written uncommitted 314.577MB

Here is my logs just before it crashes...
No joy.
Chris

Comment by Christian Ribe [ 06/Nov/12 ]

MongoDB.log file of :
DR102 too much data written uncommitted 314.577MB error

Comment by auto [ 06/Nov/12 ]

Author:

{u'date': u'2012-11-06T15:54:08Z', u'email': u'milkie@10gen.com', u'name': u'Eric Milkie'}

Message: SERVER-7551 do not convert updates to upserts if primary is pre-2.2

Converting updates to upserts during replication was added for version 2.2.0.
This code was dependent on all operators being idempotent.
Some operators in 2.0.x are not fully idempotent, so if we are reading an
oplog generated by a 2.0.x node, we shouldn't activate this new feature.
Branch: master
https://github.com/mongodb/mongo/commit/e541866fa44c13c82df51131b2075739f9d448d9

Comment by Christian Ribe [ 05/Nov/12 ]

My secondary did not come back up for me in 2.0.7
Will try and stop my primary and upgrade to latest 2.2.1 hope it works.

Comment by Rotem Hermon [ 05/Nov/12 ]

Thanks, the secondary came up ok in 2.0.7.
I will wait for 2.2.2 for the upgrade.

Comment by Eric Milkie [ 05/Nov/12 ]

Hi Rotem,
Thanks for the answers. We discussed possible solutions and we think we have something that will allow all users who make heavy use of $push and other $ operators to upgrade smoothly. We are going to publish a 2.2.2 release candidate in the next few days with a modification that will correct this issue.
For now, you should restart your secondary node with version 2.0.7. There is a slight chance that the secondary will experience the same unique key error (in fact, there has always been this chance ever since MongoDB supported replication); in such a situation, one must null out the conflicting oplog entries manually. Our support team will help with this if it occurs.

Comment by Christian Ribe [ 05/Nov/12 ]

Hi,

I am in the same situation, I am stuck with no working secondaries that cannot sync up with the primary.
So I cannot step-down the primary to switch to the upgraded secondary.

I can understand the need to raise an exception on duplicates but I don't see why the hole server should crash. Stop the replication maybe but don't stop the daemon and let it vote no?
Doing this has a 2nd effect where N secondaries can crash leaving the primary alone. Since the primary does not know if it is the one that is isolated from the replicaset it switches itself to a secondary state so all writes stop. In my case I had no arbiter which would have avoided this but still.

Will upgrade my primary as suggested and try to limit my downtime...

Comment by Rotem Hermon [ 05/Nov/12 ]

1. This log is from a second restart after an earlier crash (same crash,
same error), that's probably why there was recovery info.

2. Yes, we're generating our own _ids.

Comment by Eric Milkie [ 05/Nov/12 ]

There may be other ways to upgrade without downtime; my first idea was merely a suggestion.

Some more questions I have;
It looks like there was recovery information in the journal when you started up your secondary, which is unusual. Was it shut down cleanly? The recovery stage ought not to have caused this issue, but I am curious about it.

Due to the value of the _id that conflicted, it appears that you are creating your own _id values; is that correct?

Comment by Rotem Hermon [ 05/Nov/12 ]

I have many writes happening all the time, and the whole point is not taking down the entire replica set as this will cause downtime to my site. So you're saying there's no way of upgrading without downtime? That's pretty bad!

Comment by Eric Milkie [ 05/Nov/12 ]

Unfortunately, you won't get the fix for unique index key violations until you upgrade both the primary (which generates the oplog) and the secondary (which consumes it).
Was the secondary far behind the primary when you upgraded it? Were many writes occurring on the primary during this time? If you can temporarily stop writes to the primary while the secondary resyncs, that should avoid the problem.

Comment by Rotem Hermon [ 05/Nov/12 ]

Running version 2.0.7. It's a replica set with 2 members (and an arbitrar), all running 2.0.7. Crash happened when trying to upgrade the secondary to 2.2.1.

Comment by Eric Milkie [ 05/Nov/12 ]

Hi Rotem.
What version of mongod were you running before you upgraded? Were all the members of the replica set running version 2.2.1 at the time of the crash?

Generated at Thu Feb 08 03:14:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.