[SERVER-19633] Fatal Assertion 16360 during initial sync Created: 28/Jul/15  Updated: 25/Nov/15  Resolved: 10/Oct/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.6.9, 2.6.10
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: John Fisher Assignee: Ramon Fernandez Marina
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File mongod.log-20150728.gz    
Issue Links:
Related
related to SERVER-19794 Fatal Assertion 16360: duplicate key ... Closed
Operating System: ALL
Participants:

 Description   

After provisioning a new database server, and adding it to our existing replica set, we got this error during the initial sync. A large portion of the data transferred successfully before we got the error.

The duplicate key error mentioned in the logs does not exist on the primary. The index in question (on Primary) is:

	{
		"v" : 0,
		"unique" : true,
		"key" : {
			"linkedin_ids" : 1
		},
		"name" : "linkedin_ids_1",
		"ns" : "datafox.mappings",
		"sparse" : true,
		"background" : true,
		"safe" : null
	}

The pre-existing Primary and Secondary hosts are running 2.6.9, but the new server (on which the error occurs) is running 2.6.10.

We've tried to bring this new host up twice, and got the same error both times. (Though on different objects)

2015-07-27T20:48:29.379-0700 [initandlisten] connection accepted from 172.31.16.82:37930 #825633 (318 connections now open)
2015-07-27T20:48:29.488-0700 [initandlisten] connection accepted from 172.31.10.102:60198 #825634 (319 connections now open)
2015-07-27T20:48:29.531-0700 [repl writer worker 2] ERROR: writer worker caught exception:  :: caused by :: 11000 E11000 duplicate key error index: datafox.mappings.$linkedin_ids_1  dup key: { : undefined } on: { ts: Timestamp 1438054759000|1, h: -7888321465560204017, v: 2, op: "u", ns: "datafox.mappings", o2: { _id: ObjectId('54fdb9c9998e85da06001fd7') }, o: { $set: { mtime: new Date(1438054759071), urls.1: { url: "www.boosharticles.com/", url_domain: "boosharticles.com", _id: ObjectId('55b6f96701c3755e3800f0c4') }, __v: 1 } } }
2015-07-27T20:48:29.531-0700 [repl writer worker 2] Fatal Assertion 16360
2015-07-27T20:48:29.531-0700 [repl writer worker 2] 
 
***aborting after fassert() failure



 Comments   
Comment by Ramon Fernandez Marina [ 10/Oct/15 ]

jr@datafox.co, thanks for uploading the logs and for your patience. Unfortunately we haven't been able to ascertain that it was a bug in mongod that caused this behavior. This comment in SERVER-19794 hints at what issues (other than a possible bug in mongod) may have triggered this, but without catching the exact moment when the document is inserted/updated in the secondary node it's not possible to get enough information to track down exactly what triggered this condition.

A possible workaround is to resync this secondary node. Depending on the size of the data this may be a long operation, so one thing you may want to try first is to rebuild the linkedin_ids index on the secondary node first.

On the off chance that this issue was caused by a bug in v0 indexes, you may want to consider rebuilding this index on the primary as well to update it to v1.

Sorry you run into this, and I hope the workaround provided above allows you to get you back up and running.

Regards,
Ramón.

Comment by Ramon Fernandez Marina [ 31/Aug/15 ]

Hi jr@datafox.co, apologies for the radio silence. This is to let you know that we're still investigating this issue alongside SERVER-19794, which exhibits the same symptoms. We'll update this ticket when we know more.

Thanks for your patience,
Ramón.

Comment by John Fisher [ 30/Jul/15 ]

Here is the full log from the replication fatal assertion issue.

Comment by Ramon Fernandez Marina [ 30/Jul/15 ]

jr@datafox.co, which version of MongoDB was this index created with? As you may have noticed, the version number is 0, which corresponds to an old version; if you were to create an index with 2.6.9 the version would be 1.

I'm asking because this may be the source of the issue: when I try to reproduce this assertion and I create a 2-node replica set running 2.6.9, a v0 index does replicate as a v0 index – but when I add a 2.6.10 node the index is built as v1. While I haven't been able to trigger the assertion I'm suspecting index versions may be the issue.

I'd like to get full logs for the 2.6.10 node since startup until it aborts replication and shuts down. If you don't have them but are willing to repeat the experiment I'd like to ask you to start mongod with -vv to get more verbose logs. Is this something you can do to help us track down the source of the problem?

Thanks,
Ramón.

Comment by John Fisher [ 30/Jul/15 ]

Any updates?

Generated at Thu Feb 08 03:51:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.