[SERVER-19794] Fatal Assertion 16360: duplicate key error during replication Created: 06/Aug/15  Updated: 31/May/17  Resolved: 03/Oct/15

Status: Closed
Project: Core Server
Component/s: Index Maintenance, Replication
Affects Version/s: 3.0.5
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Shankar Karuppiah Assignee: Eric Milkie
Resolution: Incomplete Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-19633 Fatal Assertion 16360 during initial ... Closed
Operating System: Linux
Participants:

 Description   

2015-08-04T06:37:49.530+0000 E REPL     [repl writer worker 9] writer worker caught exception:  :: caused by :: 11000 E11000 duplicate key error collection: talk-vsession.visitorSessions index: _id_ dup key: { : ObjectId('55c05dbdd896ff11c05bf548') } on: { ts: Timestamp 1438670269000|632, h: -3999426385275226819, v: 2, op: "u", ns: "talk-vsession.visitorSessions", o2: { _id: ObjectId('55c05dbdd896ff11c05bf548') }, o: { $set: { sm: "e56d8d2df5c4e6761a36b7e88efe0681853a6f54", uo: new Date(1438670269476) } } }
2015-08-04T06:37:49.530+0000 I -        [repl writer worker 9] Fatal Assertion 16360
2015-08-04T06:37:49.530+0000 I -        [repl writer worker 9] 
 
***aborting after fassert() failure



 Comments   
Comment by Ramon Fernandez Marina [ 07/May/16 ]

saisrinivase@gmail.com, for MongoDB-related support discussion please post on the mongodb-user group or Stack Overflow with the mongodb tag, where your question will reach a larger audience. A question like this involving more discussion would be best posted on the mongodb-user group.

Regards,
Ramón.

Comment by Sai [ 07/May/16 ]

Hi Sachin,

Sorry for the delay in response .

We didn't upgraded or either down graded the version ,I have restarted re
sync and now it became in sync with primary.

But the major concern is some times unexpectedly mongo services are going
down and we are starting manually in sharding.

When I looked in the error log I didn't see any errors and I am not sure
how I can really resolve this issue.

Any clue can you help me guys in this.

Comment by Sachin Kumar [ 06/May/16 ]

Hello Ramon ,

Thanks for your response , But I wanted to know if anyone who did the upgrade got the issue resolved 100% ., if with the newer version i get the same issue , then there is no point to upgrade .

I am facing only this issue rest mongo is doing good for me .. So wanted to be 100% sure before upgrade .

Thanks
Sachin Kumar

Comment by Ramon Fernandez Marina [ 05/May/16 ]

sachinrohdia@gmail.com, if you're seeing the behavior described in this ticket it could be due to SERVER-21275, so I'd recommend you upgrade to the latest version (MongoDB 3.2.6 at the time of this writing) and open a new ticket if the issue persists.

Thanks,
Ramón.

Comment by Sachin Kumar [ 05/May/16 ]

Hello Sai ,

Is upgrade to 3.0.11 , resolved this issue ?

Comment by Sai [ 13/Apr/16 ]

Thank you Roman !!!

Comment by Ramon Fernandez Marina [ 13/Apr/16 ]

We were not able to reproduce the behavior described in this ticket, but as pointed out above it could be related to SERVER-21275. I'd recommend upgrading to 3.0.11 and, if the problem persists, open a new SERVER ticket.

Thanks,
Ramón.

Comment by Sai [ 13/Apr/16 ]

Hi Ramon,

Thank you , i have already started the re-sync to make its a secondary.

I am glad to know about the FATAL ASSERTION with duplicate key error , because that is the one i am not able to find out .

Could you help me on this , appreciate your help !!!!

Comment by Ramon Fernandez Marina [ 13/Apr/16 ]

saisrinivase@gmail.com, it seems this node has become too stale to be able to catch up, and you'll need to resync it. If the problem persists please open a new ticket so we can investigate it separately.

Thanks,
Ramón.

Comment by Sai [ 13/Apr/16 ]

Again today it's re-sync failed with the below error ,

2016-04-12T20:45:59.697-0400 I REPL [ReplicationExecutor] syncing from: xxxxxxxxxx:27015
2016-04-12T20:45:59.778-0400 W REPL [rsBackgroundSync] we are too stale to use xxxxxxx:27015 as a sync source
2016-04-12T20:45:59.778-0400 I REPL [ReplicationExecutor] syncing from: xxxxxxxxxxx:27015
2016-04-12T20:45:59.833-0400 W REPL [rsBackgroundSync] we are too stale to use xxxxxxxxxx:27015 as a sync source
2016-04-12T20:45:59.833-0400 I REPL [ReplicationExecutor] syncing from:xxxxxxxxxxxxxxxx:27015
2016-04-12T20:45:59.886-0400 I REPL [ReplicationExecutor] could not find member to sync from
2016-04-12T20:45:59.886-0400 I REPL [rsBackgroundSync] replSet error RS102 too stale to catch up
2016-04-12T20:45:59.886-0400 I REPL [rsBackgroundSync] replSet our last optime : Apr 11 09:00:36 570b9ff4:1dc
2016-04-12T20:45:59.886-0400 I REPL [rsBackgroundSync] replSet oldest available is Apr 11 15:41:34 570bfdee:139
2016-04-12T20:45:59.886-0400 I REPL [rsBackgroundSync] replSet See http://dochub.mongodb.org/core/resyncingaverystalereplicasetmember

016-04-12T20:46:20.811-0400 I QUERY [conn15428] assertion 13436 not master or second
ary; cannot currently read from this replSet member ns:config.settings query:{}
2016-04-12T20:46:24.096-0400 I NETWORK [conn15539] end connection 10.160.137.7:53778 (16 connections now open)

2016-04-12T20:47:15.724-0400 I QUERY [conn15428] assertion 13436 not master or second
ary; cannot currently read from this replSet member ns:config.settings query:{}
2016-04-12T20:47:24.947-0400 I NETWORK [conn15547] end connection 10.160.137.7:53794 (17 connections now open)

Comment by Shankar Karuppiah [ 12/Apr/16 ]

I think this issue is related to https://jira.mongodb.org/browse/SERVER-21275
It might be solved If you upgrade your mongodb to 3.0.8 or greater

Comment by Sai [ 12/Apr/16 ]

2016-03-23T20:15:51.366-0400 I - [repl writer worker 2] Fatal Assertion 16361

Error message as above , please help me on this ..Thanks in advance !!!!

Comment by Sai [ 11/Apr/16 ]

Do we have any solution for this in 3.0.7 version .

I would like thanks and appreciate if we have any solution for this error.

Since i have started doing re-sync twice but at last it is failing with the same error as posted by Shankar.

Comment by JongWon Kim [ 03/Nov/15 ]

We're experiencing same issues what original author mentioned. I think I can answer to the question..

1. No
2. Local RAID Volume. No recent failures recently.
3. Yes, not just one or two node. We have experienced 5-6 times a week.

We installed 3.0.7 on every node from the beginning, and migrated data from the nodes running 2.4.13.
We're using WiredTiger as storage engine, using snappy compression, no journaling.

Comment by Ramon Fernandez Marina [ 03/Oct/15 ]

shankar.k, we haven't heard back from you for a while so we're closing this ticket. If this is still an issue for you please provide the additional information requested by Sam above.

Thanks,
Ramón.

Comment by Sam Kleinman (Inactive) [ 01/Sep/15 ]

Sorry for the delay in getting back to you. I've discussed this case with a few of my colleagues and we have three possible explanations for the error that you saw:

  1. The data on the secondary was modified during a maintenance operation where the instance was running in "standalone" mode and not a member of the set.
  2. Data corruption on the secondary member of the set, as a result of storage (block device) system failure, networking errors, or other modification.
  3. A logical error in the replication system.

In most cases the clock skew should not be a problem, although running a time synchronization service (e.g. ntpd) may be a good idea as a matter of general practice. There are some cases where the combination of clock skew and unreliable networks can produce a replica set with multiple primaries for a short period of time. This is confusing for the client driver, but should lead to a rollback rather than the kind of error that you see.

In an attempt to better understand what's happening here, could you answer the following questions:

  1. Did you preform any maintenance on this member of the replica set, at any point in its history? Was there a period where this member was running but not a member of the replica set (i.e. ruining without the replSet option,) where a client might have been able to connect and modify the data (even unintentionally?)
  2. What kind of storage system are the data file stored in? Is it local or attached via a networking interface? Are you aware of recent drive failures? Have you had previous issues with storage consistency?
  3. Has this issue continued to affect your deployment? Have you had other nodes affected by this issue? Has the same node continued to be affected by this issue? Do you have an idea of which document or documents are triggering this error? We'd like to figure out a way to reproduce this issue. Do you still have a copy of the data where you see this error?

Sorry again for the delay, and thanks for your help.

Regards,
sam

Comment by Ramon Fernandez Marina [ 12/Aug/15 ]

Thanks for uploading the logs shankar.k; this is what I see on the primary:

2015-08-04T06:48:01.310+0000 I COMMAND  [conn6452433] command talk-vsession.$cmd command: delete { delete: "visitorSessions", 
writeConcern: { w: "majority", wtimeout: 5000 }, ordered: true, deletes: [ { q: { _id: { $in: [ ObjectId('55c05d77d896ff11c05b
e9a6'), ObjectId('55c05da8d896ff11c05bf1cf'), ObjectId('55c05d14d896ff11c05bd8d6'), ObjectId('55c05c57d896ff11c05bb85e'), ObjectId('55c05dbdd896ff11c05bf53b'), ObjectId('55c05b06d896ff11c05b7fee'), ObjectId('55c057ecd896ff11c05aff26'), ObjectId('55c05893d896ff11c05b1a26'), ObjectId('55c05db8d896ff11c05bf463'), ObjectId('55c05d46d896ff11c05be13a'), ObjectId('55c05986d896ff11c05b4141'), ObjectId('55c05d87d896ff11c05bec3b'), ObjectId('55c05dbad896ff11c05bf4b6'), ObjectId('55c05d78d896ff11c05be9c8'), ObjectId('55c05d94d896ff11c05bee83'), ObjectId('55c05d26d896ff11c05bdbdf'), ObjectId('55c05d42d896ff11c05be083'), ObjectId('55c05c21d896ff11c05baee7'), ObjectId('55c05d9fd896ff11c05bf07d'), ObjectId('55c05d87d896ff11c05bec56'), ObjectId('55c05d95d896ff11c05bee9f'), ObjectId('55c05dbcd896ff11c05bf527'), ObjectId('55c05dbbd896ff11c05bf4f1'), ObjectId('55c05d47d896ff11c05be141'), ObjectId('55c04478d896ff11c0582d7d'), ObjectId('55c05d17d896ff11c05bd966'), ObjectId('55c05d5fd896ff11c05be563'), ObjectId('55c05d81d896ff11c05beb4e'), ObjectId('55c05dbdd896ff11c05bf548'), ObjectId('55c05973d896ff11c05b3e35'), ObjectId('55c05d87d896ff11c05bec4e'), ObjectId('55c05db9d896ff11c05bf49d'), ObjectId('55c0594bd896ff11c05b37c7'), ObjectId('55c04f8fd896ff11c059b8f9'), ObjectId('55c05d00d896ff11c05bd58f'), ObjectId('55c049efd896ff11c058eb16'), ObjectId('55c05c95d896ff11c05bc305'), ObjectId('55c05d78d896ff11c05be9cd'), ObjectId('55c04fb4d896ff11c059be68'), ObjectId('55c05d03d896ff11c05bd62a'), ObjectId('55c05d76d896ff11c05be983'), ObjectId('55c05d79d896ff11c05be9e3'), ObjectId('55c048afd896ff11c058be66'), ObjectId('55c05da5d896ff11c05bf15e'), ObjectId('55c05bddd896ff11c05ba338'), ObjectId('55c05c86d896ff11c05bc05c'), ObjectId('55c05b8ad896ff11c05b9544'), ObjectId('55c05d88d896ff11c05bec60'), ObjectId('55c05da4d896ff11c05bf149'), ObjectId('55c05a29d896ff11c05b5bc8') ] } }, limit: 0 } ] } keyUpdates:0 writeConflicts:0 numYields:0 reslen:185 locks:{ Global: { acquireCount: { r: 51, w: 51 } }, Database: { acquireCount: { w: 51 } }, Collection: { acquireCount: { w: 1 } }, oplog: { acquireCount: { w: 50 } } } 5019ms
2015-08-04T06:48:01.310+0000 I NETWORK  [conn6452433] end connection 10.96.4.137:50583 (3041 connections now open)
2015-08-04T06:48:01.602+0000 I COMMAND  [conn3261683] command talk-asession.$cmd command: update { update: "agentSessions", writeConcern: { w: "majority", wtimeout: 15000 }, ordered: true, updates: [ { q: { _id: ObjectId('55c053fdac9121063ec28f86'), eo: { $gt: new Date(1438670775312) } }, u: { $set: { eo: new Date(1438672575312), uo: new Date(1438670775312) } }, multi: false, upsert: false } ] } keyUpdates:0 writeConflicts:0 numYields:0 reslen:200 locks:{ Global: { acquireCount: { r: 2, w: 2 } }, Database: { acquireCount: { w: 2 } }, Collection: { acquireCount: { w: 1 } }, oplog: { acquireCount: { w: 1 } } } 15001ms
2015-08-04T06:48:01.729+0000 W NETWORK  [ReplExecNetThread-402] Failed to connect to 10.97.176.65:27017, reason: errno:111 Connection refused
2015-08-04T06:48:01.729+0000 I REPL     [ReplicationExecutor] Error in heartbeat request to mongodb-alpha-1:27017; Location18915 Failed attempt to connect to mongodb-alpha-1:27017; couldn't connect to server mongodb-alpha-1:27017 (10.97.176.65), connection attempt failed

Right after the delete command the secondary mongodb-alpha-1 becomes unreachable; in the mongodb-alpha-1 logs I see:

2015-08-04T06:37:49.530+0000 E REPL     [repl writer worker 9] writer worker caught exception:  :: caused by :: 11000 E11000 duplicate key error collection: talk-vsession.visitorSessions index: _id_ dup key: { : ObjectId('55c05dbdd896ff11c05bf548') } on: { ts: Timestamp 1438670269000|632, h: -3999426385275226819, v: 2, op: "u", ns: "talk-vsession.visitorSessions", o2: { _id: ObjectId('55c05dbdd896ff11c05bf548') }, o: { $set: { sm: "e56d8d2df5c4e6761a36b7e88efe0681853a6f54", uo: new Date(1438670269476) } } }
2015-08-04T06:37:49.530+0000 I -        [repl writer worker 9] Fatal Assertion 16360
2015-08-04T06:37:49.530+0000 I -        [repl writer worker 9] 
 
***aborting after fassert() failure
 
 
2015-08-04T07:01:24.391+0000 I CONTROL  ***** SERVER RESTARTED *****

Both commands mention the offending objectid, 55c05dbdd896ff11c05bf548, so I'm trying to understand how the delete command may trigger this assertion on the secondary node.

I'm very suspicious of the clock difference between these two nodes, more than 10 minutes. I don't know if this is the source of the problem yet or just a coincidence.

We'll investigate this further and let you know if we need additional information. I see that the mongodb-alpha-1 successfully transitioned to secondary after a restart; I also see a number of restarts on this system, are you still affected by this issue or are the restarts expected?

Thanks,
Ramón.

Comment by Shankar Karuppiah [ 06/Aug/15 ]

Hello Ramon,

Thank you for creating the bucket. I have upload the logs
mongod-alpha-3.log.2015-08-05T00-00-01.tar.gz => Primary
mongod-alpha-1.log.2015-08-05T00-00-01.tar.gz => Secondary that crashed

For info about load, maybe you checkout our MMS account, organization name is loveclients
I'm not sure about the type of operations that it was running when it crashed.

Thank you,
Shankar

Comment by Ramon Fernandez Marina [ 06/Aug/15 ]

Here it is: https://10gen-httpsupload.s3.amazonaws.com/upload_forms/e108a7f2-b1d4-492c-96dd-841c1ff25316.html

Comment by Shankar Karuppiah [ 06/Aug/15 ]

Hello Ramon,

Could you create a MongoDB Support File Upload bucket, please ?

Thank you,
Shankar

Comment by Ramon Fernandez Marina [ 06/Aug/15 ]

Hi shankar.k, could you please upload full logs for the primary node as well as the affected secondary? Are there other details you can provide as of to how this assertion was triggered (load, type of operations, etc.)?

Thanks,
Ramón.

Generated at Thu Feb 08 03:52:05 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.