[SERVER-10905] Duplicate key error killed all secondaries on cluster Created: 25/Sep/13 Updated: 10/Dec/14 Resolved: 18/Mar/14 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.4.6 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Critical - P2 |
| Reporter: | Ramiro Berrelleza | Assignee: | Samantha Ritter (Inactive) |
| Resolution: | Cannot Reproduce | Votes: | 1 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
ubuntu 12.04 on aws |
||
| Attachments: |
|
| Operating System: | Linux |
| Participants: |
| Description |
|
We have a cluster of 3 nodes to enable replication. Two of the nodes are configured with the same priority (1) while the third one is configured with priority 0 so it never gets promoted (it's the one we use for backups). As of last night, the two secondaries have been failing continuously during writes, with the following error logged on both servers: Wed Sep 25 05:28:16.211 [repl writer worker 2] ERROR: writer worker caught exception: E11000 duplicate key error index: marketshare.Application.$name_1 dup key: { : "TestingRDS" } on: { ts: Timestamp 1380112101000|1, h: -5880003146554201345, v: 2, op: "u", ns: "marketshare.Application", o2: { _id: ObjectId('5242be375f6ffb0c37145385') }, o: { $set: { boxes.0: { box_type_name: "OPTIMIZERMS-APPV4", updated: "2013-09-25 12:28:22.855168", <rest of object>...}}} Wed Sep 25 05:28:16.211 [repl writer worker 2] Fatal Assertion 16360 ***aborting after fassert() failure Wed Sep 25 05:28:16.215 Got signal: 6 (Aborted). Wed Sep 25 05:28:16.219 Backtrace: After rebooting both secondaries, the cluster was able to establish quorum again for about 15 minutes, but after that, the issue reproed again. When we logged in to the instances directly, we noticed that que indexes were correct on the primary, but were completely empty on the secondary. |
| Comments |
| Comment by Stennie Steneker (Inactive) [ 18/Mar/14 ] |
|
Hi Ramiro, Please be advised I'm now closing this issue. It looks like something went amiss in your replica set upgrade back in September, which you resolved by rebuilding the replica set on fresh servers. From the log files provided it is unclear how to reproduce this issue. If you do have any further questions or concerns on this issue, please let us know. Thanks, |
| Comment by Ramiro Berrelleza [ 18/Nov/13 ] |
|
No, we haven't seen the issue repro. I've launched a parallel set of servers with the same topology, and haven't seen this issue. However, these are fresh servers, instead of upgraded ones, so it's not really the same scenario 100%. |
| Comment by Samantha Ritter (Inactive) [ 04/Nov/13 ] |
|
Hi Ramiro, I'm sorry for the delay in response. Has this issue reappeared, with either your original cluster or your parallel set of servers? |
| Comment by Ramiro Berrelleza [ 28/Sep/13 ] |
|
We removed the server from the cluster, so for now we're running in a single server until we understand the problem better. The history of the cluster goes as follows: We have a parallel set of servers with the same configuration, but with different data. We haven't seen the issue repro on those servers. |
| Comment by Eliot Horowitz (Inactive) [ 27/Sep/13 ] |
|
What's the status now? What's the history of the cluster? i.e. restores from backup, syncs, etc... |