[SERVER-6975] $push idempotency issues with version 2.2.0 Created: 09/Sep/12  Updated: 15/Feb/13  Resolved: 28/Sep/12

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.2.0
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Braden Evans Assignee: Eric Milkie
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

ubuntu/3.0.0-17


Issue Links:
Duplicate
duplicates SERVER-3407 oplog is not idempotent for array ope... Closed
Related
is related to SERVER-5961 $push array size constraint can cause... Closed
Operating System: Linux
Participants:

 Description   

We first got this crash during initial sync of a secondary node, ended up downgrading to 2.0 because the sync could not resume, and upgrading to 2.2 once it completed. Now after three days of uptime we have the crash again:

Sat Sep  8 17:53:33 [repl writer worker 7] ERROR: writer worker caught exception: E11000 duplicate key error index: [collection].$_id_  dup key: { : ObjectId('50287481463c0b83d82f8dea') } on: { ts: Timestamp 1347152002000|57, h: -1992302823673752270, op: "u", ns: "[collection]", o2: { _id: ObjectId('50287481463c0b83d82f8dea'), tu: { $size: 6 } }, o: { $set: { a: ObjectId('5026b145003c0b2238a9ddf6') }, $push: { tu: ObjectId('0000000000000000002721eb') }, $set: { tuc: 7 } } }
Sat Sep  8 17:53:33 [repl writer worker 7]   Fatal Assertion 16360
0xade6e1 0x802e03 0x64f77d 0x77d3dd 0x7c3659 0x7f69a5e88efc 0x7f69a521a59d
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xade6e1]
 /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xa3) [0x802e03]
 /usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x12d) [0x64f77d]
 /usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x26d) [0x77d3dd]
 /usr/bin/mongod() [0x7c3659]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x7efc) [0x7f69a5e88efc]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f69a521a59d]
Sat Sep  8 17:53:33 [repl writer worker 7]
 
***aborting after fassert() failure
 
 
Sat Sep  8 17:53:33 Got signal: 6 (Aborted).
 
Sat Sep  8 17:53:33 Backtrace:
0xade6e1 0x5582d9 0x7f69a516d420 0x7f69a516d3a5 0x7f69a5170b0b 0x802e3e 0x64f77d 0x77d3dd 0x7c3659 0x7f69a5e88efc 0x7f69a521a59d
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xade6e1]
 /usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x5582d9]
 /lib/x86_64-linux-gnu/libc.so.6(+0x36420) [0x7f69a516d420]
 /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7f69a516d3a5]
 /lib/x86_64-linux-gnu/libc.so.6(abort+0x17b) [0x7f69a5170b0b]
 /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xde) [0x802e3e]
 /usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x12d) [0x64f77d]
 /usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x26d) [0x77d3dd]
 /usr/bin/mongod() [0x7c3659]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x7efc) [0x7f69a5e88efc]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f69a521a59d]

And it crashes on startup with the same exception:

at Sep  8 17:57:08 [rsStart] replSet STARTUP2
Sat Sep  8 17:57:08 [rsHealthPoll] replSet member 192.168.1.40:27017 is up
Sat Sep  8 17:57:08 [rsHealthPoll] replSet member 192.168.1.40:27017 is now in state PRIMARY
Sat Sep  8 17:57:08 [rsHealthPoll] replSet member 192.168.1.20:27017 is up
Sat Sep  8 17:57:08 [rsHealthPoll] replSet member 192.168.1.20:27017 is now in state ARBITER
Sat Sep  8 17:57:09 [rsSync] replSet SECONDARY
Sat Sep  8 17:57:13 [rsBackgroundSync] replSet syncing to: 192.168.1.40:27017
Sat Sep  8 17:57:13 [repl writer worker 8] ERROR: writer worker caught exception: E11000 duplicate key error index: [collection].$_id_  dup key: { : ObjectId('50287481463c0b83d82f8dea') } on: { ts: Timestamp 1347152002000|57, h: -1992302823673752270, op: "u", ns: "[collection]", o2: { _id: ObjectId('50287481463c0b83d82f8dea'), tu: { $size: 6 } }, o: { $set: { a: ObjectId('5026b145003c0b2238a9ddf6') }, $push: { tu: ObjectId('0000000000000000002721eb') }, $set: { tuc: 7 } } }
Sat Sep  8 17:57:13 [repl writer worker 8]   Fatal Assertion 16360
0xade6e1 0x802e03 0x64f77d 0x77d3dd 0x7c3659 0x7f993759cefc 0x7f993692e59d
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xade6e1]
 /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xa3) [0x802e03]
 /usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x12d) [0x64f77d]
 /usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x26d) [0x77d3dd]
 /usr/bin/mongod() [0x7c3659]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x7efc) [0x7f993759cefc]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f993692e59d]
Sat Sep  8 17:57:13 [repl writer worker 8]
 
***aborting after fassert() failure
 
 
Sat Sep  8 17:57:13 Got signal: 6 (Aborted).
 
Sat Sep  8 17:57:13 Backtrace:
0xade6e1 0x5582d9 0x7f9936881420 0x7f99368813a5 0x7f9936884b0b 0x802e3e 0x64f77d 0x77d3dd 0x7c3659 0x7f993759cefc 0x7f993692e59d
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xade6e1]
 /usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x5582d9]
 /lib/x86_64-linux-gnu/libc.so.6(+0x36420) [0x7f9936881420]
 /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7f99368813a5]
 /lib/x86_64-linux-gnu/libc.so.6(abort+0x17b) [0x7f9936884b0b]
 /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xde) [0x802e3e]
 /usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x12d) [0x64f77d]
 /usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x26d) [0x77d3dd]
 /usr/bin/mongod() [0x7c3659]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x7efc) [0x7f993759cefc]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f993692e59d]



 Comments   
Comment by Eric Milkie [ 15/Nov/12 ]

Hi Andy.
This particular JIRA issue is marked as Duplicate, and as such has no corresponding Fix version. The linked ticket that duplicates this one, SERVER-3407, has been fixed by SERVER-7186 and SERVER-4781, as noted in the last posted comment, and the fix version for those was 2.2.1. If you are seeing a crash with version 2.2.1, we need more information and the discussion should probably be moved to a new SERVER ticket, or at least should be applied to one of the tickets where the fixes to the code were applied. I apologize for the complicated nature that developed from tracking these particular issues.

Comment by Andy O'Neill [ 15/Nov/12 ]

This is marked as resolved but it does remain as a crashing bug in the production version (2.2.1). Is there a tracking bug for a backport to 2.2?

I do not agree that crashing during initial sync is preferable to keeping the 2.0 behavior... if you can consider from my position I now have to remain on 2.0 until 2.4 comes out because I can't bring up new 2.2 servers. Is that any better than not aborting during the sync?

Comment by Eric Milkie [ 28/Sep/12 ]

Please see SERVER-3407 for further updates.

Comment by Eric Milkie [ 10/Sep/12 ]

SERVER-5961 has existed since the beginning. The reason why you are having trouble with 2.2 is because it now aborts when it detects this situation, whereas previous versions ignored the duplicate key error and continued. Arguably, continuing after this error is perhaps not the correct thing to do. If you look in your initial sync logs running 2.0, perhaps you can find the same duplicate key error?
Note that the duplicate key error can occur at any time, not just at initial sync time. This is just an effect of the data left behind by the problem with $push operations occurring during initial sync.

Comment by Braden Evans [ 10/Sep/12 ]

Eric,
It seems possible this is the cause, although I see this case was added back in 1.8, and I am fairly confident my problem is with 2.2. I see a large number of "repl writer" workers (around 10) - if concurrency has been increased perhaps that is triggering this bug. These push operations are all upserts, although that "tuc" field indicates there are already a few entries in the array and it was likely not newly created.

Unfortunately this is our production system, and it is not viable to stop the $push operations while we sync (and indeed the crash I opened the case for is not a sync it is just normal operations) We are heavy uses (abusers?) of $push.

Comment by Eric Milkie [ 10/Sep/12 ]

Hi Braden. Thanks for the detailed problem description.
I think you may be hitting SERVER-5961. We're working on changing the semantics of how array-manipulation operators work for 2.3 so that this type of situation is ameliorated.
Is it possible to do the initial sync of the secondary without concurrently issuing the $push/$set writes to the primary?

Comment by Braden Evans [ 10/Sep/12 ]

Looking through the logs from the first time we encountered this I noticed it always fails on a $push operation, on various different collections. The exception appears to be around the _id index, but to clarify there are no indexes on the field being pushed, or the other field being updated.

The update that generated this op is:
collection.Update(Query.EQ("_id", receiver.Id), Update.Push("tu", sender.Id).Inc("tuc", 1))

Some other examples of this exception (different collections):

Mon Sep 3 15:00:38 [repl writer worker 7] ERROR: writer worker caught exception: E11000 duplicate key error index: [collection].$id dup key: { : ObjectId('5044fe2d293c0b02a893576b') } on: { ts: Timestamp 1346698527000|3315, h: -2145615260524114526, op: "u", ns: "[collection]", o2: { _id: ObjectId('5044fe2d293c0b02a893576b'), pc:

{ $size: 35 }

}, o: { $push: { pc:

{ u: ObjectId('0000000000000000002283a9'), c: [ 12, 20, 43, 51, 73, 5, 24, 31, 48, 67, 1, 23, 34, 58, 69, 3, 30, 36, 53, 71, 11, 19, 39, 59, 68 ] }

} } }

Mon Sep 3 20:43:57 [repl writer worker 8] ERROR: writer worker caught exception: E11000 duplicate key error index: [collection].$id dup key: { : ObjectId('50454763063c0b8da41b0558') } on: { ts: Timestamp 1346717426000|21, h: -2936717105214827886, op: "u", ns: "[collection]", o2: { _id: ObjectId('50454763063c0b8da41b0558'), i:

{ $size: 1 }

}, o: { $push: { i:

{ a: ObjectId('50427caa083c0b958c01d055'), i: ObjectId('50454807463c0b8da42307ff') }

} } }

Comment by Braden Evans [ 09/Sep/12 ]

Downgraded to 2.0 and the server came right back up

Generated at Thu Feb 08 03:13:16 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.