[SERVER-20796] Warning on replica that collection lacks a unique index on _id, though index is present Created: 07/Oct/15  Updated: 09/Jun/16  Resolved: 13/Dec/15

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.6.9, 2.6.10
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Marcin Gozdalik Assignee: Wan Bachtiar
Resolution: Incomplete Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File mongod.log-SERVER-20796.log.bz2.1     File mongod.log-SERVER-20796.log.bz2.2    
Issue Links:
Related
is related to SERVER-21357 index warning on hidden secondary aft... Closed
Operating System: ALL
Participants:

 Description   

I set up a replica against a master server. The master has a few databases.
After initial resync, I get the following warnings on replica:

Server has startup warnings: 
2015-09-14T09:33:23.616-0400 [initandlisten] WARNING: the collection 'sffs03.e' lacks a unique index on _id. This index is needed for replication to function properly
2015-09-14T09:33:23.616-0400 [initandlisten] 	 To fix this, you need to create a unique index on _id. See http://dochub.mongodb.org/core/build-replica-set-indexes
2015-09-14T09:33:28.493-0400 [initandlisten] WARNING: the collection 'sffs05.e' lacks a unique index on _id. This index is needed for replication to function properly
2015-09-14T09:33:28.493-0400 [initandlisten] 	 To fix this, you need to create a unique index on _id. See http://dochub.mongodb.org/core/build-replica-set-indexes
2015-09-14T09:33:34.136-0400 [initandlisten] WARNING: the collection 'sf.catalog' lacks a unique index on _id. This index is needed for replication to function properly
2015-09-14T09:33:34.136-0400 [initandlisten] 	 To fix this, you need to create a unique index on _id. See http://dochub.mongodb.org/core/build-replica-set-indexes
2015-09-14T09:33:34.136-0400 [initandlisten] WARNING: the collection 'sf.jobs' lacks a unique index on _id. This index is needed for replication to function properly
2015-09-14T09:33:34.136-0400 [initandlisten] 	 To fix this, you need to create a unique index on _id. See http://dochub.mongodb.org/core/build-replica-set-indexes
2015-09-14T09:33:34.136-0400 [initandlisten] WARNING: the collection 'sf.migrator' lacks a unique index on _id. This index is needed for replication to function properly
2015-09-14T09:33:34.136-0400 [initandlisten] 	 To fix this, you need to create a unique index on _id. See http://dochub.mongodb.org/core/build-replica-set-indexes
2015-09-14T09:33:34.136-0400 [initandlisten] WARNING: the collection 'sf.v_fs06' lacks a unique index on _id. This index is needed for replication to function properly
2015-09-14T09:33:34.136-0400 [initandlisten] 	 To fix this, you need to create a unique index on _id. See http://dochub.mongodb.org/core/build-replica-set-indexes
2015-09-14T09:33:34.136-0400 [initandlisten] WARNING: the collection 'sf.v_jeremytest' lacks a unique index on _id. This index is needed for replication to function properly
2015-09-14T09:33:34.136-0400 [initandlisten] 	 To fix this, you need to create a unique index on _id. See http://dochub.mongodb.org/core/build-replica-set-indexes
2015-09-14T09:33:34.250-0400 [initandlisten] WARNING: the collection 'sffs06_new.e' lacks a unique index on _id. This index is needed for replication to function properly
2015-09-14T09:33:34.250-0400 [initandlisten] 	 To fix this, you need to create a unique index on _id. See http://dochub.mongodb.org/core/build-replica-set-indexes

The warnings are not correct as the collections have default indexes on _id, and it's not possible to insert a document with duplicated _id on master:

sf0:PRIMARY> use sffs03
switched to db sffs03
sf0:PRIMARY> db.e.insert({ "_id" : ObjectId("55c4c48c7c738904a2390306"), "zip": 1})
WriteResult({
	"nInserted" : 0,
	"writeError" : {
		"code" : 11000,
		"errmsg" : "insertDocument :: caused by :: 11000 E11000 duplicate key error index: sffs03.e.$_id_  dup key: { : ObjectId('55c4c48c7c738904a2390306') }"
	}
})
sf0:PRIMARY> db.e.getIndexes()
[
	{
		"v" : 1,
		"key" : {
			"_id" : 1
		},
		"name" : "_id_",
		"ns" : "sffs03.e"
	},
...

The indexes are visible on replica:

sf0:SECONDARY> db.e.getIndexes()
[
	{
		"v" : 1,
		"key" : {
			"_id" : 1
		},
		"name" : "_id_",
		"ns" : "sffs03.e"
	},
...

If I create a new database, the replica does not complain, so it seems there is something off with the old databases, though they have been created with the default setting, and _id index was not dropped.

sf0:PRIMARY> use uniqe_index
switched to db uniqe_index
sf0:PRIMARY> db.e.insert({'ala': 1, 'kot': 2})
WriteResult({ "nInserted" : 1 })
sf0:PRIMARY> db.e.getIndexes()
[
	{
		"v" : 1,
		"key" : {
			"_id" : 1
		},
		"name" : "_id_",
		"ns" : "uniqe_index.e"
	}
]
sf0:PRIMARY> db.e.find()
{ "_id" : ObjectId("561505583903c78e99111195"), "ala" : 1, "kot" : 2 }
sf0:PRIMARY> db.e.insert({"_id" : ObjectId("561505583903c78e99111195"), 'ala': 2, 'kot': 3})
WriteResult({
	"nInserted" : 0,
	"writeError" : {
		"code" : 11000,
		"errmsg" : "insertDocument :: caused by :: 11000 E11000 duplicate key error index: uniqe_index.e.$_id_  dup key: { : ObjectId('561505583903c78e99111195') }"
	}
})

I'm not sure if any of it matters, but all of the information for completeness.

There is a lot of data in the DBs:

sf0:SECONDARY> show dbs
sf          1539.202GB
sffs03       157.877GB
sffs05      1407.267GB
sffs06_new   451.733GB
admin          (empty)
local         50.054GB
test           (empty)

and setting up the replica fails sometimes (probably because the oplog is too small).

Replication configuration

sf0:PRIMARY> rs.config()
{
	"_id" : "sf0",
	"version" : 4,
	"members" : [
		{
			"_id" : 0,
			"host" : "host07:27017"
		},
		{
			"_id" : 1,
			"host" : "host06:27017",
			"votes" : 0,
			"priority" : 0
		}
	]
}

Replication status:

sf0:PRIMARY> rs.printReplicationInfo()
configured oplog size:   47797.726318359375MB
log length start to end: 257988secs (71.66hrs)
oplog first event time:  Sun Oct 04 2015 11:22:16 GMT-0400 (EDT)
oplog last event time:   Wed Oct 07 2015 11:02:04 GMT-0400 (EDT)
now:                     Wed Oct 07 2015 11:02:24 GMT-0400 (EDT)

Master is running:

db version v2.6.9
2015-10-07T11:12:25.873-0400 git version: df313bc75aa94d192330cb92756fc486ea604e64

Replica is running:

db version v2.6.10
2015-10-07T11:12:19.453-0400 git version: 5901dbfb49d16eaef6f2c2c50fba534d23ac7f6c



 Comments   
Comment by Wan Bachtiar [ 13/Dec/15 ]

Hi Marzin,

We haven’t heard back from you for some time, so I’m
going to mark this ticket as resolved. If this is still an issue for you, please feel free to re-open the ticket and provide additional information.

Regards,

Wan.

Comment by Wan Bachtiar [ 25/Nov/15 ]

Hi Marzin,

As Ramon has previously mentioned, the warnings appear after an unclean shutdown. The member received a termination signal 15 (SIGTERM):

2015-09-14T09:32:46.361-0400 [signalProcessingThread] got signal 15 (Terminated), will terminate after current cmd ends
2015-09-14T09:32:46.464-0400 [repl writer worker 3] ERROR: exception:  :: caused by :: 11600 interrupted at shutdown on: { ts: Timestamp 1441107063000|88, h: -48697
97789310936012, v: 2, op: "i", ns: "sfsf06_new.e", o: { _id: ObjectId('55e58c757c7389334e59d610'), size: 199, tsize: -1, pino: 5227818899, uid: 85894, depth: 13, nf
: -1, ct: 1440656894, mt: 1440656894, ext: "", ino: 5276240806, at: 1440656894, path: dir/file, blck: 85, gid: 2085894, type: 32768, fn: "17-8049072_contig", mode: 
436 } }
2015-09-14T09:32:46.474-0400 [repl writer worker 3] Fatal Assertion 16361
2015-09-14T09:32:46.474-0400 [repl writer worker 3] 
2015-09-14T09:32:58.085-0400 ***** SERVER RESTARTED *****

The member was terminated during the synchronisation of oplog but before the index building stage. This has caused some collections to exist without indexes.

After the member was restarted, a startup check process discovered that some collections exist without the _id index which triggered the warning messages. The startup process then reset the initial-sync process, by dropping all the databases and restarting the cloning process from the primary :

2015-09-14T09:33:40.846-0400 [rsBackgroundSync] replSet syncing to: host07:27017
2015-09-14T09:33:40.847-0400 [rsBackgroundSync] replset setting syncSourceFeedback to host07:27017
2015-09-14T09:33:50.857-0400 [rsSync] replSet initial sync pending
2015-09-14T09:33:50.857-0400 [rsSync] replSet syncing to: host07:27017
2015-09-14T09:33:50.868-0400 [rsSync] replSet initial sync drop all databases
2015-09-14T09:33:50.869-0400 [rsSync] dropAllDatabasesExceptLocal 5
2015-09-14T09:33:50.951-0400 [rsSync] removeJournalFiles

This time the initial-sync process completed successfully, and the member entered the SECONDARY state.

setting up the replica fails sometimes (probably because the oplog is too small).

If the primary oplog is too small, the primary would be vetoed from being the source of the sync and there would be entries in the log file as well. In this case, there are no entries in the log file that suggested the primary had been vetoed. The log entries should be similar to:

[rsBackgroundSync] replSet not trying to sync from <HOST>:<PORT>, it is vetoed for <NUMBER> more seconds

We had to apply workaround described in SERVER-18721 to be able to synchronize with primary and first attempts failed.

Can you upload the primary log between 2015-09-02T02:42:04 and 2015-09-03T06:02:07? This is to confirm whether there are lots of short-lived connections caused by replication updates on deleted documents (as described in SERVER-18721).

Based on the output of your rs.conf(), you only have two members in the replica set which means failure on either of the members will result in no primary. The minimum recommended replica set deployment for production system is a three-member replica set. Adding a third member to your replica set will provide fault tolerance and high availability. See Three Member Replica Set for more information.

Kind Regards,

Wan.

Comment by Marcin Gozdalik [ 29/Oct/15 ]
  • the warnings are gone
  • we just hit a weird behavior - one of the DBs had a collection of 300M documents. We needed to reboot both the secondary and the primary. After they come back up, the secondary still had 300M documents, but primary was reduced to 14. Curiously enough, inserting a random document on primary with the same ObjectId that was visible on the replica, caused the primary to insert the document and the secondary to overwrite the old document with the same ObjectId. Hope that counts as negative impact We needed the DB so we dumped the replica, dropped collection on the master and we're currently restoring.
Comment by Ramon Fernandez Marina [ 28/Oct/15 ]

Thanks for uploading the logs gozdal, and apologies it took a while to go through them. What I see is that these warnings appear once after what seems to be an unclean shutdown, so I need to ask you:

  • are you still seeing the warnings?
  • is there any negative impact on your replica set or whether things are working correctly?

Thanks,
Ramón.

Comment by Marcin Gozdalik [ 14/Oct/15 ]

ramon.fernandez can I do something more?

Comment by Marcin Gozdalik [ 08/Oct/15 ]

I attach logs. They are huge (7.5GB uncompressed) because of many replication errors. We had to apply workaround described in https://jira.mongodb.org/browse/SERVER-18721 to be able to synchronize with primary and first attempts failed.

Comment by Marcin Gozdalik [ 08/Oct/15 ]

The actual log file is compressed with bzip2 but split into two parts because of Jira 150MB file size limitation.
To uncompress run:

cat mongod.log-SERVER-20796.log.bz2.1 mongod.log-SERVER-20796.log.bz2.2 > mongod.log-SERVER-20796.log.bz2
bunzip2 mongod.log-SERVER-20796.log.bz2

WARNING: the unpacked file is 7.5GB.

Comment by Marcin Gozdalik [ 08/Oct/15 ]

Here is rs.conf():

sf0:PRIMARY> rs.conf()
{
	"_id" : "sf0",
	"version" : 4,
	"members" : [
		{
			"_id" : 0,
			"host" : "host07:27017"
		},
		{
			"_id" : 1,
			"host" : "host06:27017",
			"votes" : 0,
			"priority" : 0
		}
	]
}

Comment by Ramon Fernandez Marina [ 07/Oct/15 ]

gozdal, can you please send the logs for the secondary that's missing indexes from the time it was started until it became a secondary?

Also, can you please post the output of rs.conf()?

Thanks,
Ramón.

Generated at Thu Feb 08 03:55:18 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.