[SERVER-10905] Duplicate key error killed all secondaries on cluster Created: 25/Sep/13  Updated: 10/Dec/14  Resolved: 18/Mar/14

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.4.6
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Ramiro Berrelleza Assignee: Samantha Ritter (Inactive)
Resolution: Cannot Reproduce Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

ubuntu 12.04 on aws


Attachments: Text File mongodb-primary.log     Text File mongodb-secondary-1.log     Text File mongodb-secondary-2.log    
Operating System: Linux
Participants:

 Description   

We have a cluster of 3 nodes to enable replication. Two of the nodes are configured with the same priority (1) while the third one is configured with priority 0 so it never gets promoted (it's the one we use for backups).

As of last night, the two secondaries have been failing continuously during writes, with the following error logged on both servers:

Wed Sep 25 05:28:16.211 [repl writer worker 2] ERROR: writer worker caught exception: E11000 duplicate key error index: marketshare.Application.$name_1 dup key: { : "TestingRDS" } on: { ts: Timestamp 1380112101000|1, h: -5880003146554201345, v: 2, op: "u", ns: "marketshare.Application", o2:

{ _id: ObjectId('5242be375f6ffb0c37145385') }

, o: { $set: { boxes.0:

{ box_type_name: "OPTIMIZERMS-APPV4", updated: "2013-09-25 12:28:22.855168", <rest of object>...}

}}

Wed Sep 25 05:28:16.211 [repl writer worker 2] Fatal Assertion 16360
0xdddd81 0xd9dc13 0xc26bfc 0xdab721 0xe26609 0x7ff9f4923e9a 0x7ff9f3c36ccd
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdddd81]
/usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xa3) [0xd9dc13]
/usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x12c) [0xc26bfc]
/usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x281) [0xdab721]
/usr/bin/mongod() [0xe26609]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7ff9f4923e9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7ff9f3c36ccd]
Wed Sep 25 05:28:16.215 [repl writer worker 2]

***aborting after fassert() failure

Wed Sep 25 05:28:16.215 Got signal: 6 (Aborted).

Wed Sep 25 05:28:16.219 Backtrace:
0xdddd81 0x6d0d29 0x7ff9f3b794a0 0x7ff9f3b79425 0x7ff9f3b7cb8b 0xd9dc4e 0xc26bfc 0xdab721 0xe26609 0x7ff9f4923e9a 0x7ff9f3c36ccd
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdddd81]
/usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x6d0d29]
/lib/x86_64-linux-gnu/libc.so.6(+0x364a0) [0x7ff9f3b794a0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7ff9f3b79425]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x17b) [0x7ff9f3b7cb8b]
/usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xde) [0xd9dc4e]
/usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x12c) [0xc26bfc]
/usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x281) [0xdab721]
/usr/bin/mongod() [0xe26609]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7ff9f4923e9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7ff9f3c36ccd]

After rebooting both secondaries, the cluster was able to establish quorum again for about 15 minutes, but after that, the issue reproed again. When we logged in to the instances directly, we noticed that que indexes were correct on the primary, but were completely empty on the secondary.



 Comments   
Comment by Stennie Steneker (Inactive) [ 18/Mar/14 ]

Hi Ramiro,

Please be advised I'm now closing this issue. It looks like something went amiss in your replica set upgrade back in September, which you resolved by rebuilding the replica set on fresh servers. From the log files provided it is unclear how to reproduce this issue.

If you do have any further questions or concerns on this issue, please let us know.

Thanks,
Stephen

Comment by Ramiro Berrelleza [ 18/Nov/13 ]

No, we haven't seen the issue repro. I've launched a parallel set of servers with the same topology, and haven't seen this issue. However, these are fresh servers, instead of upgraded ones, so it's not really the same scenario 100%.

Comment by Samantha Ritter (Inactive) [ 04/Nov/13 ]

Hi Ramiro, I'm sorry for the delay in response. Has this issue reappeared, with either your original cluster or your parallel set of servers?

Comment by Ramiro Berrelleza [ 28/Sep/13 ]

We removed the server from the cluster, so for now we're running in a single server until we understand the problem better.

The history of the cluster goes as follows:
Originally, we had a primary and a replica running on AWS (mongodb 2.2.3, on us-west-1). After about 3-4 months running on that setup, we decided to migrate our servers to run with a primary and two replicas (primary and 1 secondary on AWS, the other secondary on our datacenter, all mongodb 2.4.6). In order to migrate the data and avoid disruptions, we decided to join our new primary to the old setup, and once replication was complete, we promoted it to primary (by calling rs.stepDown() on the old server), removed the old servers from the configuration, and added the new servers to the replica set. Our servers ran fine for about a week, until the reported incident happened.

We have a parallel set of servers with the same configuration, but with different data. We haven't seen the issue repro on those servers.

Comment by Eliot Horowitz (Inactive) [ 27/Sep/13 ]

What's the status now?

What's the history of the cluster? i.e. restores from backup, syncs, etc...

Generated at Thu Feb 08 03:24:21 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.