[SERVER-6671] oplog is not strictly idempotent when unique index is present Created: 31/Jul/12 Updated: 11/Jul/16 Resolved: 09/Oct/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.0.6, 2.2.0-rc0 |
| Fix Version/s: | 2.2.1, 2.3.0 |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Ben Becker | Assignee: | Eric Milkie |
| Resolution: | Done | Votes: | 2 |
| Labels: | index, oplog, replication, unique | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Any replicated environment |
||
| Attachments: |
|
||||||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Description |
|
Given a replica set which has collection 'c' with a unique index on 'a', consider the following sequence of events:
If a secondary node encounters an error during replication (e.g. by killing the getMore op or network disconnect), it may try to reapply the same operations. In the above case, attempting to reapply the operations will fail on the secondary nodes due to {_id:1} already having a unique index value on {a:1}. On version 2.0.2, the issue is manifested by secondarys halting replication due to syncTail encountering E11000 on the secondary node(s). On version master (~2.2.0-rc0, git hash 73b97070259a0e30d5ed35880b11b32078620e42), the following fassert() is hit:
|
| Comments |
| Comment by auto [ 05/Oct/12 ] |
|
Author: {u'date': u'2012-10-04T15:04:31-07:00', u'email': u'milkie@10gen.com', u'name': u'Eric Milkie'}Message: |
| Comment by auto [ 04/Oct/12 ] |
|
Author: {u'date': u'2012-10-04T15:04:31-07:00', u'email': u'milkie@10gen.com', u'name': u'Eric Milkie'}Message: |
| Comment by auto [ 04/Oct/12 ] |
|
Author: {u'date': u'2012-10-03T08:49:45-07:00', u'email': u'milkie@10gen.com', u'name': u'Eric Milkie'}Message: |
| Comment by auto [ 04/Oct/12 ] |
|
Author: {u'date': u'2012-10-03T08:32:40-07:00', u'email': u'milkie@10gen.com', u'name': u'Eric Milkie'}Message: The version field will allow us to detect the primary's version. |
| Comment by auto [ 03/Oct/12 ] |
|
Author: {u'date': u'2012-10-03T08:49:45-07:00', u'email': u'milkie@10gen.com', u'name': u'Eric Milkie'}Message: |
| Comment by auto [ 03/Oct/12 ] |
|
Author: {u'date': u'2012-10-03T08:32:40-07:00', u'email': u'milkie@10gen.com', u'name': u'Eric Milkie'}Message: The version field will allow us to detect the primary's version. |
| Comment by Pierre Dane [ 03/Oct/12 ] |
|
Thanks Eric - have done: SERVER-7243 |
| Comment by Eric Milkie [ 03/Oct/12 ] |
|
Hi Pierre. |
| Comment by Pierre Dane [ 03/Oct/12 ] |
|
update : this is the windows2008+ build using latest c# driver. Let me know if I need to open a separate ticket please. Txs |
| Comment by Pierre Dane [ 03/Oct/12 ] |
|
We are seeing this error with no unique key (except _id) present: This happens when upgrading from 2.0.6 to 2.2 - simply stopping the server (primary or secondary) and trying to restart with the 2.2 binaries. Will the above fix solve our issue as well? |
| Comment by auto [ 02/Oct/12 ] |
|
Author: {u'date': u'2012-10-02T10:13:15-07:00', u'email': u'milkie@10gen.com', u'name': u'Eric Milkie'}Message: |
| Comment by auto [ 02/Oct/12 ] |
|
Author: {u'date': u'2012-10-01T10:29:58-07:00', u'email': u'milkie@10gen.com', u'name': u'Eric Milkie'}Message: |
| Comment by auto [ 02/Oct/12 ] |
|
Author: {u'date': u'2012-09-28T13:20:59-07:00', u'email': u'milkie@10gen.com', u'name': u'Eric Milkie'}Message: During rollback, there is a stage where we apply all the docs we fetched to undo all the |
| Comment by auto [ 02/Oct/12 ] |
|
Author: {u'date': u'2012-09-28T11:54:44-07:00', u'email': u'milkie@10gen.com', u'name': u'Eric Milkie'}Message: Operations involving unique indexes may be repeated during a recovery or initial sync. We need to ignore unique key violations in this situation. This is okay because by the time we reach the end of the oplog stream (and are finally caught up), all the violations should be resolved (as we should end up with an exact copy of the primary). |
| Comment by auto [ 02/Oct/12 ] |
|
Author: {u'date': u'2012-10-02T10:13:15-07:00', u'email': u'milkie@10gen.com', u'name': u'Eric Milkie'}Message: |
| Comment by auto [ 01/Oct/12 ] |
|
Author: {u'date': u'2012-10-01T10:29:58-07:00', u'email': u'milkie@10gen.com', u'name': u'Eric Milkie'}Message: |
| Comment by auto [ 28/Sep/12 ] |
|
Author: {u'date': u'2012-09-28T13:20:59-07:00', u'email': u'milkie@10gen.com', u'name': u'Eric Milkie'}Message: During rollback, there is a stage where we apply all the docs we fetched to undo all the |
| Comment by auto [ 28/Sep/12 ] |
|
Author: {u'date': u'2012-09-28T11:54:44-07:00', u'email': u'milkie@10gen.com', u'name': u'Eric Milkie'}Message: Operations involving unique indexes may be repeated during a recovery or initial sync. We need to ignore unique key violations in this situation. This is okay because by the time we reach the end of the oplog stream (and are finally caught up), all the violations should be resolved (as we should end up with an exact copy of the primary). |
| Comment by Ben Becker [ 31/Jul/12 ] |
|
Just to note – the attached test is a bit nondeterministic due to killOp()/syncTail timing. Open to any advice/input on how to make it more deterministic. |