[SERVER-13953] Duplicate key error causes all secondaries to fassert() and crash in production Created: 15/May/14  Updated: 10/Dec/14  Resolved: 15/May/14

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 2.4.8
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: Jiangcheng Wu Assignee: Thomas Rueckstiess
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File vo5xzp29vrev6q5e49nof23a7lljzgexvqm0u95p19jjysfx.db.PRIMARY_20140514.tgz     File vo5xzp29vrev6q5e49nof23a7lljzgexvqm0u95p19jjysfx.db.content.20140514.tgz    
Issue Links:
Duplicate
duplicates SERVER-12662 Two background index builds in parall... Closed
Operating System: ALL
Participants:

 Description   

Cluster = 1 PRIMARY + 3 SECONDARY + 1 ARBITER
MASTER and ARBITER are ok, all 3 SECONDARY crashed, and still crashed after restart instance.

I saw some duplicated key error in mongodb.log sometimes, but this was first time they cashed.

mongodb logs:
SECONDARY1:
```
Wed May 14 15:01:58.452 [repl writer worker 1] info: indexing in foreground on this replica; was a background index build on the primary
Wed May 14 15:01:58.452 [repl writer worker 1] build index vo5xzp29vrev6q5e49nof23a7lljzgexvqm0u95p19jjysfx.Dongxi

{ dongxiID: 1 }

Wed May 14 15:01:58.457 [repl writer worker 1] ERROR: writer worker caught exception: E11000 duplicate key error index: vo5xzp29vrev6q5e49nof23a7lljzgexvqm0u95p19jjysfx.Dongxi.$dongxiID_1 dup key: { : null } on: { ts: Timestamp 1400050387000|17, h: -1650836744667758, v: 2, op: "i", ns: "vo5xzp29vrev6q5e49nof23a7lljzgexvqm0u95p19jjysfx.system.indexes", o: { name: "dongxiID_1", ns: "vo5xzp29vrev6q5e49nof23a7lljzgexvqm0u95p19jjysfx.Dongxi", background: true, unique: true, key:

{ dongxiID: 1 }

} }
Wed May 14 15:01:58.457 [repl writer worker 1] Fatal Assertion 16360
0xde05e1 0xda03d3 0xc28f3c 0xdadf21 0xe28e69 0x7fb81feeae9a 0x7fb81f1fdccd
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xde05e1]
/usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xa3) [0xda03d3]
/usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x12c) [0xc28f3c]
/usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x281) [0xdadf21]
/usr/bin/mongod() [0xe28e69]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7fb81feeae9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fb81f1fdccd]
Wed May 14 15:01:58.459 [repl writer worker 1]

***aborting after fassert() failure

Wed May 14 15:01:58.459 Got signal: 6 (Aborted).

Wed May 14 15:01:58.461 Backtrace:
0xde05e1 0x6d0559 0x7fb81f1404a0 0x7fb81f140425 0x7fb81f143b8b 0xda040e 0xc28f3c 0xdadf21 0xe28e69 0x7fb81feeae9a 0x7fb81f1fdccd
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xde05e1]
/usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x6d0559]
/lib/x86_64-linux-gnu/libc.so.6(+0x364a0) [0x7fb81f1404a0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7fb81f140425]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x17b) [0x7fb81f143b8b]
/usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xde) [0xda040e]
/usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x12c) [0xc28f3c]
/usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x281) [0xdadf21]
/usr/bin/mongod() [0xe28e69]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7fb81feeae9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fb81f1fdccd]
```
SECONDARY2:
```
Wed May 14 14:53:08.718 [repl writer worker 1] ERROR: writer worker caught exception: E11000 duplicate key error index: vo5xzp29vrev6q5e49nof23a7lljzgexvqm0u95p19jjysfx.Dongxi.$dongxiID_1 dup key: { : null } on: { ts: Timestamp 1400050387000|17, h: -1650836744667758, v: 2, op: "i", ns: "vo5xzp29vrev6q5e49nof23a7lljzgexvqm0u95p19jjysfx.system.indexes", o: { name: "dongxiID_1", ns: "vo5xzp29vrev6q5e49nof23a7lljzgexvqm0u95p19jjysfx.Dongxi", background: true, unique: true, key:

{ dongxiID: 1 }

} }
Wed May 14 14:53:08.718 [repl writer worker 1] Fatal Assertion 16360
Wed May 14 14:53:08.724 [initandlisten] connection accepted from 10.124.57.3:41607 #15706255 (138 connections now open)
0xde05e1 0xda03d3 0xc28f3c 0xdadf21 0xe28e69 0x7fdfe6683e9a 0x7fdfe59963fd
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xde05e1]
/usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xa3) [0xda03d3]
/usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x12c) [0xc28f3c]
/usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x281) [0xdadf21]
/usr/bin/mongod() [0xe28e69]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7fdfe6683e9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fdfe59963fd]
Wed May 14 14:53:08.740 [repl writer worker 1]

***aborting after fassert() failure

Wed May 14 14:53:08.748 Got signal: 6 (Aborted).

Wed May 14 14:53:08.750 Backtrace:
0xde05e1 0x6d0559 0x7fdfe58d84a0 0x7fdfe58d8425 0x7fdfe58dbb8b 0xda040e 0xc28f3c 0xdadf21 0xe28e69 0x7fdfe6683e9a 0x7fdfe59963fd
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xde05e1]
/usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x6d0559]
/lib/x86_64-linux-gnu/libc.so.6(+0x364a0) [0x7fdfe58d84a0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7fdfe58d8425]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x17b) [0x7fdfe58dbb8b]
/usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xde) [0xda040e]
/usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x12c) [0xc28f3c]
/usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x281) [0xdadf21]
/usr/bin/mongod() [0xe28e69]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7fdfe6683e9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fdfe59963fd]
```

SECONDARY3:
```
Wed May 14 14:53:07.291 [repl writer worker 13] info: indexing in foreground on this replica; was a background index build on the primary
Wed May 14 14:53:07.291 [repl writer worker 13] build index vo5xzp29vrev6q5e49nof23a7lljzgexvqm0u95p19jjysfx.Dongxi

{ dongxiID: 1 }

Wed May 14 14:53:07.353 [repl writer worker 13] ERROR: writer worker caught exception: E11000 duplicate key error index: vo5xzp29vrev6q5e49nof23a7lljzgexvqm0u95p19jjysfx.Dongxi.$dongxiID_1 dup key: { : null } on: { ts: Timestamp 1400050387000|17, h: -1650836744667758, v: 2, op: "i", ns: "vo5xzp29vrev6q5e49nof23a7lljzgexvqm0u95p19jjysfx.system.indexes", o: { name: "dongxiID_1", ns: "vo5xzp29vrev6q5e49nof23a7lljzgexvqm0u95p19jjysfx.Dongxi", background: true, unique: true, key:

{ dongxiID: 1 }

} }
Wed May 14 14:53:07.353 [repl writer worker 13] Fatal Assertion 16360
0xde05e1 0xda03d3 0xc28f3c 0xdadf21 0xe28e69 0x7fbb61e0fe9a 0x7fbb611223fd
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xde05e1]
/usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xa3) [0xda03d3]
/usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x12c) [0xc28f3c]
/usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x281) [0xdadf21]
/usr/bin/mongod() [0xe28e69]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7fbb61e0fe9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fbb611223fd]
Wed May 14 14:53:07.402 [repl writer worker 13]

***aborting after fassert() failure

Wed May 14 14:53:07.409 Got signal: 6 (Aborted).

Wed May 14 14:53:07.421 Backtrace:
0xde05e1 0x6d0559 0x7fbb610644a0 0x7fbb61064425 0x7fbb61067b8b 0xda040e 0xc28f3c 0xdadf21 0xe28e69 0x7fbb61e0fe9a 0x7fbb611223fd
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xde05e1]
/usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x6d0559]
/lib/x86_64-linux-gnu/libc.so.6(+0x364a0) [0x7fbb610644a0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7fbb61064425]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x17b) [0x7fbb61067b8b]
/usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xde) [0xda040e]
/usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x12c) [0xc28f3c]
/usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x281) [0xdadf21]
/usr/bin/mongod() [0xe28e69]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7fbb61e0fe9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7fbb611223fd]
```

I've attached db files of vo5xzp29vrev6q5e49nof23a7lljzgexvqm0u95p19jjysfx.



 Comments   
Comment by Thomas Rueckstiess [ 15/May/14 ]

Hi Jiangcheng,

Is it possible that you attempted to build the index multiple times (e.g. ran the call to ensureIndex() twice?) Our current hypothesis is that you're experiencing a known bug SERVER-12662, which would cause the same errors you are seeing. The issue was fixed in version 2.4.10.

To fix your secondaries, the best choice is to resync each of them one by one from the primary. I would also recommend that you upgrade to the latest stable version in the 2.4 series (currently 2.4.10) before attempting to build the index again.

As this appears to be a duplicate of SERVER-12662, I mark it as such and resolve the ticket. If after resyncing the secondary nodes and upgrading to 2.4.10 you still experience the same issues, please feel free to re-open the ticket.

Regards,
Thomas

Comment by Jiangcheng Wu [ 15/May/14 ]

db data dump from PRIMARY

Comment by Jiangcheng Wu [ 15/May/14 ]

1, indexes on PRIMARY:

SHARD02:PRIMARY> use vo5xzp29vrev6q5e49nof23a7lljzgexvqm0u95p19jjysfx;
switched to db vo5xzp29vrev6q5e49nof23a7lljzgexvqm0u95p19jjysfx
SHARD02:PRIMARY> db.Dongxi.getIndexes()
[
	{
		"v" : 1,
		"key" : {
			"_id" : 1
		},
		"ns" : "vo5xzp29vrev6q5e49nof23a7lljzgexvqm0u95p19jjysfx.Dongxi",
		"name" : "_id_"
	},
	{
		"v" : 1,
		"key" : {
			"createdAt" : NumberLong(1)
		},
		"ns" : "vo5xzp29vrev6q5e49nof23a7lljzgexvqm0u95p19jjysfx.Dongxi",
		"name" : "createdAt_1",
		"background" : true
	},
	{
		"v" : 1,
		"key" : {
			"updatedAt" : NumberLong(1)
		},
		"ns" : "vo5xzp29vrev6q5e49nof23a7lljzgexvqm0u95p19jjysfx.Dongxi",
		"name" : "updatedAt_1",
		"background" : true
	}
]

2,PRIMARY mongodb log:

Wed May 14 14:53:07.407 [conn831296] build index vo5xzp29vrev6q5e49nof23a7lljzgexvqm0u95p19jjysfx.Dongxi { dongxiID: 1 } background
Wed May 14 14:53:07.421 [conn831296] background addExistingToIndex exception E11000 duplicate key error index: vo5xzp29vrev6q5e49nof23a7lljzgexvqm0u95p19jjysfx.Dongxi.$dongxiID_1  dup key: { : null }

3, it was from SECONDARY3, and I have attached data from PRIMARY

If you need anything , please let me know, thank you.

Comment by Thomas Rueckstiess [ 15/May/14 ]

Hi Jiangcheng,

It looks like you were trying to build a unique index on the field dongxiID, but there are many duplicate values so the index build failed. I'd like to find out why it succeeded on the primary though and was replicated via the oplog.

To continue the diagnosis I would need some more information.

1. Can you please run the following command to get the indexes on the Dongxi collection on the primary (the node that did not crash):

use vo5xzp29vrev6q5e49nof23a7lljzgexvqm0u95p19jjysfx
db.Dongxi.getIndexes()

2. We also need to look at the log file of the primary node. Can you please attach the log file reaching back to before you started the index build?
3. Can you tell me where the data dump you attached was from? The primary? Or one of the secondaries?

Thanks,
Thomas

Generated at Thu Feb 08 03:33:23 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.