[SERVER-6228] Duplicate key error causes secondaries to fassert() and crash Created: 27/Jun/12  Updated: 11/Jul/16  Resolved: 18/Dec/12

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.1.2
Fix Version/s: 2.2.0-rc0

Type: Bug Priority: Critical - P2
Reporter: Bernie Hackett Assignee: Eric Milkie
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File rs0.log     File rs1.log     File rs2.log    
Issue Links:
Depends
Operating System: ALL
Participants:

 Description   

Found while testing PyMongo against 2.1.2. I'm still trying to track down the actual sequence of events that causes this. Here's the traceback running log level 5:

Wed Jun 27 14:28:41 [repl writer worker] New namespace: pymongo_test.test.$_id_
Wed Jun 27 14:28:41 [repl writer worker] 	 done building bottom layer, going to commit
Wed Jun 27 14:28:41 [repl writer worker] build index done.  scanned 0 total records. 0 secs
Wed Jun 27 14:28:41 [repl writer worker] New namespace: pymongo_test.test
Wed Jun 27 14:28:41 [repl writer worker] build index pymongo_test.test { i: 1 }
mem info: before index start vsize: 9493 resident: 40 mapped: 4382
Wed Jun 27 14:28:41 [repl writer worker] external sort root: /home/behackett/data/rs1/_tmp/esort.1340832521.50/
mem info: before final sort vsize: 9493 resident: 40 mapped: 4382
Wed Jun 27 14:28:41 [repl writer worker] 		 not using file.  size:35 _compares:0
mem info: after final sort vsize: 9493 resident: 40 mapped: 4382
Wed Jun 27 14:28:41 [repl writer worker] 	 external sort used : 0 files  in 0 secs
Wed Jun 27 14:28:41 [repl writer worker] reset extent was:pymongo_test.test.$keyb_1 now:pymongo_test.test.$i_1
allocExtent pymongo_test.test.$i_1 size 36864 1
Wed Jun 27 14:28:41 [repl writer worker] New namespace: pymongo_test.test.$i_1
Wed Jun 27 14:28:41 [repl writer worker] 	 done building bottom layer, going to commit
Wed Jun 27 14:28:41 [repl writer worker] 	 fastBuildIndex dupsToDrop:0
Wed Jun 27 14:28:41 [repl writer worker] build index done.  scanned 1 total records. 0 secs
Wed Jun 27 14:28:41 [repl writer worker] User Assertion: 11000:E11000 duplicate key error index: pymongo_test.test.$i_1  dup key: { : 2 }
Wed Jun 27 14:28:41 [repl writer worker] ERROR: writer worker caught exception: E11000 duplicate key error index: pymongo_test.test.$i_1  dup key: { : 2 } on: { ts: Timestamp 1340832519000|21, h: -352243138702237989, op: "i", ns: "pymongo_test.test", o: { _id: ObjectId('4feb7b07fba5223bd4d67d72'), i: 2 } }
Wed Jun 27 14:28:41 [repl writer worker]   Fatal Assertion 16360
0x8964e1 0xaa2373 0x8e4b00 0x8a6abd 0x666189 0x7fda040cfd0c 0x7fda03469bed 
 ./mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x8964e1]
 ./mongod(_ZN5mongo13fassertFailedEi+0xa3) [0xaa2373]
 ./mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x120) [0x8e4b00]
 ./mongod(_ZN5mongo10threadpool6Worker4loopEv+0x26d) [0x8a6abd]
 ./mongod() [0x666189]
 /lib64/libpthread.so.0(+0x7d0c) [0x7fda040cfd0c]
 /lib64/libc.so.6(clone+0x6d) [0x7fda03469bed]
Wed Jun 27 14:28:41 [repl writer worker] 
 
***aborting after fassert() failure
 
 
Wed Jun 27 14:28:41 Got signal: 6 (Aborted).
 
Wed Jun 27 14:28:41 Backtrace:
0x8964e1 0x550cd9 0x7fda033c2bc0 0x7fda033c2b45 0x7fda033c3fc6 0xaa23ae 0x8e4b00 0x8a6abd 0x666189 0x7fda040cfd0c 0x7fda03469bed 
 ./mongod(_ZN5mongo15printStackTraceERSo+0x21) [0x8964e1]
 ./mongod(_ZN5mongo10abruptQuitEi+0x399) [0x550cd9]
 /lib64/libc.so.6(+0x35bc0) [0x7fda033c2bc0]
 /lib64/libc.so.6(gsignal+0x35) [0x7fda033c2b45]
 /lib64/libc.so.6(abort+0x186) [0x7fda033c3fc6]
 ./mongod(_ZN5mongo13fassertFailedEi+0xde) [0xaa23ae]
 ./mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x120) [0x8e4b00]
 ./mongod(_ZN5mongo10threadpool6Worker4loopEv+0x26d) [0x8a6abd]
 ./mongod() [0x666189]
 /lib64/libpthread.so.0(+0x7d0c) [0x7fda040cfd0c]
 /lib64/libc.so.6(clone+0x6d) [0x7fda03469bed]

I can reproduce this consistently running PyMongo's test_collection module, although running tests individually doesn't seem to cause it. I'll attach full logs from each replica set member shortly.



 Comments   
Comment by Gianfranco Palumbo [ 18/Dec/12 ]

scotthernandez Sure I'll create another ticket. Just wondering, I would expect that it shutdowns gracefully instead showing a stacktrace or to stop the replication and log the assertion, though a shutdown might be a better signal to alert of this issue.

Comment by Scott Hernandez (Inactive) [ 18/Dec/12 ]

Gianfranco, please open a new issue for new scenarios like this. This is not the same scenario nor cause as described in this issue.

What would you expect it to do? You broke replication by creating an index on the secondary which was more restrictive than the primary. We can change the error message but shutting down is a safe action if you chose to manually break replication like this – there is no way to automatically recover other than to drop the indexes not on the primary.

Comment by Gianfranco Palumbo [ 18/Dec/12 ]

Tested in 2.2.0 and 2.2.2 - This still causes a crash.

The way to replicate this is:
1) Step down a secondary
2) Start it without --replSet and another port
3) Create unique index
4) Restart as replSet member
5) Insert a document in the primary which would duplicate a key in the index.

mongod 2.2.2

Tue Dec 18 12:05:54 [initandlisten] MongoDB starting : pid=6992 port=40002 dbpath=/Users/gianfranco/data/replset/node2 64-bit host=Gianfranco-10gen.local
Tue Dec 18 12:05:54 [initandlisten] db version v2.2.2, pdfile version 4.5
Tue Dec 18 12:05:54 [initandlisten] git version: d1b43b61a5308c4ad0679d34b262c5af9d664267
Tue Dec 18 12:05:54 [initandlisten] build info: Darwin bs-osx-106-x86-64-1.local 10.8.0 Darwin Kernel Version 10.8.0: Tue Jun  7 16:33:36 PDT 2011; root:xnu-1504.15.3~1/RELEASE_I386 i386 BOOST_LIB_VERSION=1_49
Tue Dec 18 12:05:54 [initandlisten] options: { dbpath: "/Users/gianfranco/data/replset/node2", port: 40002, replSet: "myapp" }
Tue Dec 18 12:05:54 [initandlisten] journal dir=/Users/gianfranco/data/replset/node2/journal
Tue Dec 18 12:05:54 [initandlisten] recover : no journal files present, no recovery needed
Tue Dec 18 12:05:54 [websvr] admin web console waiting for connections on port 41002
...
Tue Dec 18 12:06:18 [repl writer worker 1] ERROR: writer worker caught exception: E11000 duplicate key error index: test.test.$a_1  dup key: { : 1.0 } on: { ts: Timestamp 1355832378000|1, h: 6182722510085346418, v: 2, op: "i", ns: "test.test", o: { _id: ObjectId('50d05c3a8a41536396de58f9'), a: 1.0 } }
Tue Dec 18 12:06:18 [repl writer worker 1]   Fatal Assertion 16360
0x10059544b 0x100557dc5 0x10043e700 0x1005688ed 0x1005e08e3 0x7fff8e926742 0x7fff8e913181 
 0   mongod                              0x000000010059544b _ZN5mongo15printStackTraceERSo + 43
 1   mongod                              0x0000000100557dc5 _ZN5mongo13fassertFailedEi + 165
 2   mongod                              0x000000010043e700 _ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE + 336
 3   mongod                              0x00000001005688ed _ZN5mongo10threadpool6Worker4loopEv + 621
 4   mongod                              0x00000001005e08e3 thread_proxy + 163
 5   libsystem_c.dylib                   0x00007fff8e926742 _pthread_start + 327
 6   libsystem_c.dylib                   0x00007fff8e913181 thread_start + 13
Tue Dec 18 12:06:18 [repl writer worker 1] 
 
***aborting after fassert() failure
 
 
Tue Dec 18 12:06:18 Got signal: 6 (Abort trap: 6).
 
Tue Dec 18 12:06:18 Backtrace:
0x10059544b 0x10000197b 0x7fff8e9148ea 0x105c98970 0x7fff8e96bdce 0x100557e00 0x10043e700 0x1005688ed 0x1005e08e3 0x7fff8e926742 0x7fff8e913181 
 0   mongod                              0x000000010059544b _ZN5mongo15printStackTraceERSo + 43
 1   mongod                              0x000000010000197b _ZN5mongo10abruptQuitEi + 987
 2   libsystem_c.dylib                   0x00007fff8e9148ea _sigtramp + 26
 3   ???                                 0x0000000105c98970 0x0 + 4392061296
 4   libsystem_c.dylib                   0x00007fff8e96bdce abort + 143
 5   mongod                              0x0000000100557e00 _ZN5mongo13fassertFailedEi + 224
 6   mongod                              0x000000010043e700 _ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE + 336
 7   mongod                              0x00000001005688ed _ZN5mongo10threadpool6Worker4loopEv + 621
 8   mongod                              0x00000001005e08e3 thread_proxy + 163
 9   libsystem_c.dylib                   0x00007fff8e926742 _pthread_start + 327
 10  libsystem_c.dylib                   0x00007fff8e913181 thread_start + 13

mongod 2.2.0

Tue Dec 18 12:11:14 [initandlisten] MongoDB starting : pid=7242 port=40002 dbpath=/Users/gianfranco/data/replset/node2 64-bit host=Gianfranco-10gen.local
Tue Dec 18 12:11:14 [initandlisten] db version v2.2.0, pdfile version 4.5
Tue Dec 18 12:11:14 [initandlisten] git version: f5e83eae9cfbec7fb7a071321928f00d1b0c5207
Tue Dec 18 12:11:14 [initandlisten] build info: Darwin bs-osx-106-x86-64-1.local 10.8.0 Darwin Kernel Version 10.8.0: Tue Jun  7 16:33:36 PDT 2011; root:xnu-1504.15.3~1/RELEASE_I386 i386 BOOST_LIB_VERSION=1_49
Tue Dec 18 12:11:14 [initandlisten] options: { dbpath: "/Users/gianfranco/data/replset/node2", port: 40002, replSet: "myapp" }
Tue Dec 18 12:11:14 [initandlisten] journal dir=/Users/gianfranco/data/replset/node2/journal
Tue Dec 18 12:11:14 [initandlisten] recover : no journal files present, no recovery needed
Tue Dec 18 12:11:14 [websvr] admin web console waiting for connections on port 41002
...
Tue Dec 18 12:12:37 [repl writer worker 1] ERROR: writer worker caught exception: E11000 duplicate key error index: test.test.$a_1  dup key: { : 1.0 } on: { ts: Timestamp 1355832378000|1, h: 6182722510085346418, v: 2, op: "i", ns: "test.test", o: { _id: ObjectId('50d05c3a8a41536396de58f9'), a: 1.0 } }
Tue Dec 18 12:12:37 [repl writer worker 1]   Fatal Assertion 16360
0x10037637b 0x1000aeeb5 0x100351092 0x10004806d 0x1005a7823 0x7fff8e926742 0x7fff8e913181 
 0   mongod                              0x000000010037637b _ZN5mongo15printStackTraceERSo + 43
 1   mongod                              0x00000001000aeeb5 _ZN5mongo13fassertFailedEi + 165
 2   mongod                              0x0000000100351092 _ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE + 306
 3   mongod                              0x000000010004806d _ZN5mongo10threadpool6Worker4loopEv + 621
 4   mongod                              0x00000001005a7823 thread_proxy + 163
 5   libsystem_c.dylib                   0x00007fff8e926742 _pthread_start + 327
 6   libsystem_c.dylib                   0x00007fff8e913181 thread_start + 13
Tue Dec 18 12:12:37 [repl writer worker 1] 
 
***aborting after fassert() failure
 
 
Tue Dec 18 12:12:37 Got signal: 6 (Abort trap: 6).
 
Tue Dec 18 12:12:37 Backtrace:
0x10037637b 0x100001a6b 0x7fff8e9148ea 0x104c9e970 0x7fff8e96bdce 0x1000aeef0 0x100351092 0x10004806d 0x1005a7823 0x7fff8e926742 0x7fff8e913181 
 0   mongod                              0x000000010037637b _ZN5mongo15printStackTraceERSo + 43
 1   mongod                              0x0000000100001a6b _ZN5mongo10abruptQuitEi + 987
 2   libsystem_c.dylib                   0x00007fff8e9148ea _sigtramp + 26
 3   ???                                 0x0000000104c9e970 0x0 + 4375308656
 4   libsystem_c.dylib                   0x00007fff8e96bdce abort + 143
 5   mongod                              0x00000001000aeef0 _ZN5mongo13fassertFailedEi + 224
 6   mongod                              0x0000000100351092 _ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE + 306
 7   mongod                              0x000000010004806d _ZN5mongo10threadpool6Worker4loopEv + 621
 8   mongod                              0x00000001005a7823 thread_proxy + 163
 9   libsystem_c.dylib                   0x00007fff8e926742 _pthread_start + 327
 10  libsystem_c.dylib                   0x00007fff8e913181 thread_start + 13

Comment by auto [ 28/Jun/12 ]

Author:

{u'date': u'2012-06-28T12:56:20-07:00', u'email': u'milkie@10gen.com', u'name': u'Eric Milkie'}

Message: SERVER-6228 treat inserts into .system.indexes as commands

Index builds are not replicated as a 'c' op (command) but instead
as an 'i' op (insert) into the .system.indexes collection. The
insert code detects this and builds the index as part of the insert
of the record.
To make multithreaded replication work, we need to treat index builds
like commands so that they aren't intermixed with other write ops in
a batch.
Branch: master
https://github.com/mongodb/mongo/commit/028db4bb3ec21677033924100cb2dcd96678dadf

Comment by Bernie Hackett [ 27/Jun/12 ]

Logs for replica set members.

Comment by Bernie Hackett [ 27/Jun/12 ]

Also, the secondaries can't be restarted. They just crash again trying to replay the oplog.

Generated at Thu Feb 08 03:11:06 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.