[SERVER-9059] Fatal Assertion on replication Created: 21/Mar/13  Updated: 16/Nov/21  Resolved: 25/Mar/13

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Samuel Clay Assignee: J Rassi
Resolution: Duplicate Votes: 0
Labels: initialSync, replication
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 64-bit 12.04, 32GB on Digital Ocean


Attachments: Text File mongodb.log     Text File mongodb_2nd_try.txt     Text File mongodb_3rd_try.txt    
Issue Links:
Duplicate
duplicates SERVER-9087 Index changes lost during initial syn... Closed
duplicates SERVER-6671 oplog is not strictly idempotent when... Closed
Steps To Reproduce:

rs.add('db03:27017')

...

<wait 6 hours for 100GB of data to replicate>

Participants:

 Description   

Hi, I'm having a hell of a time getting a new replica set member added.

Brand new machine with mongodb 2.4 and tried to add a new member to my replica set. Got a SEGFAULT.

I have no idea what's causing this issue, but I got the exact same segfault a few months ago when 2.2 came out. I repaired the db and STILL got this problem. And it only happens after the entire replication is finished (6 hours later).

I can't make any new copies of the data! And I need to start sharding because my service is exploding in growth. What am I supposed to do if I can't replicate?

Sam

Thu Mar 21 02:01:21.926 [repl writer worker 1] build index done. scanned 9 total records. 0.001 secs
Thu Mar 21 02:01:22.040 [rsSync] warning replset primary is an older version than we are; upgrade recommended
Thu Mar 21 02:01:25.229 [repl writer worker 1] ERROR: exception: E11000 duplicate key error index: newsblur.stories.$story_hash_1 dup key: { : "980568:65f04f" } on: { ts: Timestamp 1363810091000|32, h: -7663483269809030053, op: "i", ns: "newsblur.stories", o: { _id: ObjectId('514a172a0f55a7137913adea'), share_user_ids: {}, story_author_name: "", story_date: new Date(1363809963000), story_tags: {}, story_hash: "980568:65f04f", story_feed_id: 980568, story_permalink: "http://feedproxy.google.com/~r/DrudgeReportFeed/~3/Kn0_WSFPhAI/us-syria-crisis-lebanon-idUSBRE92J0AY20130320", story_title: "More Syrian shells land in Lebanon...", comment_user_ids: {}, story_content_z: BinData, story_guid: "http://www.reuters.com/article/2013/03/20/us-syria-crisis-lebanon-idUSBRE92J0AY20130320" } }
Thu Mar 21 02:01:25.230 [repl writer worker 1] Fatal Assertion 16361
0xdcae01 0xd8ab83 0xc1ecb6 0xd986b1 0xe13709 0x7f1fac48fe9a 0x7f1fab7a2cbd
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdcae01]
/usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xa3) [0xd8ab83]
/usr/bin/mongod(_ZN5mongo7replset21multiInitialSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x166) [0xc1ecb6]
/usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x281) [0xd986b1]
/usr/bin/mongod() [0xe13709]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7f1fac48fe9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f1fab7a2cbd]
Thu Mar 21 02:01:25.238 [repl writer worker 1]

***aborting after fassert() failure

Thu Mar 21 02:01:25.238 Got signal: 6 (Aborted).

Thu Mar 21 02:01:25.242 Backtrace:
0xdcae01 0x6ce879 0x7f1fab6e54a0 0x7f1fab6e5425 0x7f1fab6e8b8b 0xd8abbe 0xc1ecb6 0xd986b1 0xe13709 0x7f1fac48fe9a 0x7f1fab7a2cbd
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdcae01]
/usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x6ce879]
/lib/x86_64-linux-gnu/libc.so.6(+0x364a0) [0x7f1fab6e54a0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7f1fab6e5425]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x17b) [0x7f1fab6e8b8b]
/usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xde) [0xd8abbe]
/usr/bin/mongod(_ZN5mongo7replset21multiInitialSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x166) [0xc1ecb6]
/usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x281) [0xd986b1]
/usr/bin/mongod() [0xe13709]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7f1fac48fe9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f1fab7a2cbd]



 Comments   
Comment by J Rassi [ 07/May/13 ]

I can still create a new ticket, but I won't be bale to provide much but the broken server's log file.

That would be a fine start.

If more convenient than uploading to JIRA, you may provide your logs via scp. Use the following command (hit enter at the password prompt):

scp -P 722 <file(s)> newsblur@www.10gen.com:

Comment by Samuel Clay [ 07/May/13 ]

The problem is that I don't have an easy way to get your logs for a few days ago. The logs are enormous, as they have millions of "connections" lines. I can still create a new ticket, but I won't be bale to provide much but the broken server's log file.

Comment by J Rassi [ 07/May/13 ]

Sam, I don't see any new issues created from your account. Could we still help you diagnose this?

Comment by J Rassi [ 05/May/13 ]

I believe this is unrelated to the original problem encountered above. Would you open a new ticket for this issue? Please include the mongod log for db24 and db25 covering at least the last 24 hours, and the mongod log for db22 covering up to the last two weeks if available. Thanks.

Comment by Samuel Clay [ 05/May/13 ]

I'm getting a new assertion failure on replication. This time after I created a new index on primary and had it replicate to the secondaries. After the secondaries finished, 2/3 of them came up just fine. One machine did now and it gave a stack trace. The broken machine is db22.newsblur.com (on MMS), and all machines are running 2.4.3.

Sun May 5 03:47:43.376 [rsSync] replSet still syncing, not yet to minValid optime 5185cf69:1
Sun May 5 03:47:43.524 [rsSync] replSet SECONDARY
Sun May 5 03:47:43.525 [repl prefetch worker] newsblur.stories Assertion failure _a != -1 src/mongo/db/ops/../pdfile.h 607
Sun May 5 03:47:43.525 [repl prefetch worker] newsblur.stories Assertion failure _a != -1 src/mongo/db/ops/../pdfile.h 607
0xdcf361 0xd902bd 0x9cdf38 0x9d6dbf 0xb2c0b4 0xb2cf55 0xc1ca02 0xd9cc01 0xe17cb9 0x7f22bb81be9a 0x7f22bab2bcbd
Sun May 5 03:47:43.525 [repl prefetch worker] newsblur.stories Assertion failure _a != -1 src/mongo/db/ops/../pdfile.h 607
0xdcf3610xdcf361 0xd902bd 0xd902bd0x9cdf38 0x9cdf38 103172470x9d6dbf 0xb2c0b40xb2c0b4 0xb2cf55 0xb2cf550xc1ca02 0xd9cc01 0xe17cb90xc1ca02 0xd9cc010x7f22bb81be9a 0xe17cb90x7f22bab2bcbd 0x7f22bb81be9a
0x7f22bab2bcbd
Sun May 5 03:47:43.528 [repl prefetch worker] newsblur.stories Assertion failure _a != -1 src/mongo/db/ops/../pdfile.h 607
0xdcf361 0xd902bd 0x9cdf38 0x9d6dbf 0xb2c0b4 0xb2cf55 0xc1ca02 0xd9cc01 0xe17cb9 0x7f22bb81be9a 0x7f22bab2bcbd
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdcf361]
/usr/bin/mongod(_ZN5mongo12verifyFailedEPKcS1_j+0xfd) [0xd902bd]
/usr/bin/mongod(_ZN5mongo18IndexInterfaceImplINS_12BtreeData_V1EE20beginInsertIntoIndexEiRNS_12IndexDetailsENS_7DiskLocERKNS_7BSONObjERKNS_8OrderingEb+0x118) [0x9cdf38]
/usr/bin/mongod(_ZN5mongo19fetchIndexInsertersERSt3setINS_7BSONObjENS_10BSONObjCmpESaIS1_EERNS_14IndexInterface13IndexInserterEPNS_16NamespaceDetailsEiRKS1_NS_7DiskLocEb+0x2cf) [0x9d6dbf]
/usr/bin/mongod(_ZN5mongo18prefetchIndexPagesEPNS_16NamespaceDetailsERKNS_7BSONObjE+0x724) [0xb2c0b4]
/usr/bin/mongod(_ZN5mongo28prefetchPagesForReplicatedOpERKNS_7BSONObjE+0x605) [0xb2cf55]
/usr/bin/mongod(_ZN5mongo7replset8SyncTail10prefetchOpERKNS_7BSONObjE+0x202) [0xc1ca02]
/usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x281) [0xd9cc01]
/usr/bin/mongod() [0xe17cb9]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7f22bb81be9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f22bab2bcbd]
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdcf361]
/usr/bin/mongod(_ZN5mongo12verifyFailedEPKcS1_j+0xfd) [0xd902bd]
/usr/bin/mongod(_ZN5mongo18IndexInterfaceImplINS_12BtreeData_V1EE20beginInsertIntoIndexEiRNS_12IndexDetailsENS_7DiskLocERKNS_7BSONObjERKNS_8OrderingEb+0x118) [0x9cdf38]
/usr/bin/mongod(_ZN5mongo19fetchIndexInsertersERSt3setINS_7BSONObjENS_10BSONObjCmpESaIS1_EERNS_14IndexInterface13IndexInserterEPNS_16NamespaceDetailsEiRKS1_NS_7DiskLocEb+0x2cf) [0x9d6dbf]
/usr/bin/mongod(_ZN5mongo18prefetchIndexPagesEPNS_16NamespaceDetailsERKNS_7BSONObjE+0x724) [0xb2c0b4]
/usr/bin/mongod(_ZN5mongo28prefetchPagesForReplicatedOpERKNS_7BSONObjE+0x605) [0xb2cf55]
/usr/bin/mongod(_ZN5mongo7replset8SyncTail10prefetchOpERKNS_7BSONObjE+0x202) [0xc1ca02]
/usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x281) [0xd9cc01]
/usr/bin/mongod() [0xe17cb9]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7f22bb81be9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f22bab2bcbd]
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdcf361]
/usr/bin/mongod(_ZN5mongo12verifyFailedEPKcS1_j+0xfd) [0xd902bd]
/usr/bin/mongod(_ZN5mongo18IndexInterfaceImplINS_12BtreeData_V1EE20beginInsertIntoIndexEiRNS_12IndexDetailsENS_7DiskLocERKNS_7BSONObjERKNS_8OrderingEb+0x118) [0x9cdf38]
/usr/bin/mongod(_ZN5mongo19fetchIndexInsertersERSt3setINS_7BSONObjENS_10BSONObjCmpESaIS1_EERNS_14IndexInterface13IndexInserterEPNS_16NamespaceDetailsEiRKS1_NS_7DiskLocEb+0x2cf) [0x9d6dbf]
/usr/bin/mongod(_ZN5mongo18prefetchIndexPagesEPNS_16NamespaceDetailsERKNS_7BSONObjE+0x724) [0xb2c0b4]
/usr/bin/mongod(_ZN5mongo28prefetchPagesForReplicatedOpERKNS_7BSONObjE+0x605) [0xb2cf55]
/usr/bin/mongod(_ZN5mongo7replset8SyncTail10prefetchOpERKNS_7BSONObjE+0x202) [0xc1ca02]
/usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x281) [0xd9cc01]
/usr/bin/mongod() [0xe17cb9]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7f22bb81be9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f22bab2bcbd]
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdcf361]
/usr/bin/mongod(_ZN5mongo12verifyFailedEPKcS1_j+0xfd) [0xd902bd]
/usr/bin/mongod(_ZN5mongo18IndexInterfaceImplINS_12BtreeData_V1EE20beginInsertIntoIndexEiRNS_12IndexDetailsENS_7DiskLocERKNS_7BSONObjERKNS_8OrderingEb+0x118) [0x9cdf38]
/usr/bin/mongod(_ZN5mongo19fetchIndexInsertersERSt3setINS_7BSONObjENS_10BSONObjCmpESaIS1_EERNS_14IndexInterface13IndexInserterEPNS_16NamespaceDetailsEiRKS1_NS_7DiskLocEb+0x2cf) [0x9d6dbf]
/usr/bin/mongod(_ZN5mongo18prefetchIndexPagesEPNS_16NamespaceDetailsERKNS_7BSONObjE+0x724) [0xb2c0b4]
/usr/bin/mongod(_ZN5mongo28prefetchPagesForReplicatedOpERKNS_7BSONObjE+0x605) [0xb2cf55]
/usr/bin/mongod(_ZN5mongo7replset8SyncTail10prefetchOpERKNS_7BSONObjE+0x202) [0xc1ca02]
/usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x281) [0xd9cc01]
/usr/bin/mongod() [0xe17cb9]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7f22bb81be9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f22bab2bcbd]
Sun May 5 03:47:43.552 [repl writer worker 1] newsblur.stories Assertion failure _a != -1 src/mongo/db/ops/../pdfile.h 607
0xdcf361 0xd902bd 0x9cdf38 0x9d6dbf 0x9d8c22 0xac0009 0xac130f 0xa8b72a 0xa8d787 0xa6956a 0xc1fb13 0xc1f258 0xd9cc01 0xe17cb9 0x7f22bb81be9a 0x7f22bab2bcbd
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdcf361]
/usr/bin/mongod(_ZN5mongo12verifyFailedEPKcS1_j+0xfd) [0xd902bd]
/usr/bin/mongod(_ZN5mongo18IndexInterfaceImplINS_12BtreeData_V1EE20beginInsertIntoIndexEiRNS_12IndexDetailsENS_7DiskLocERKNS_7BSONObjERKNS_8OrderingEb+0x118) [0x9cdf38]
/usr/bin/mongod(_ZN5mongo19fetchIndexInsertersERSt3setINS_7BSONObjENS_10BSONObjCmpESaIS1_EERNS_14IndexInterface13IndexInserterEPNS_16NamespaceDetailsEiRKS1_NS_7DiskLocEb+0x2cf) [0x9d6dbf]
/usr/bin/mongod(_ZN5mongo24indexRecordUsingTwoStepsEPKcPNS_16NamespaceDetailsENS_7BSONObjENS_7DiskLocEb+0x2d2) [0x9d8c22]
/usr/bin/mongod(_ZN5mongo11DataFileMgr6insertEPKcPKvibbbPb+0x1239) [0xac0009]
/usr/bin/mongod(_ZN5mongo11DataFileMgr16insertWithObjModEPKcRNS_7BSONObjEbb+0x4f) [0xac130f]
/usr/bin/mongod(_ZN5mongo14_updateObjectsEbPKcRKNS_7BSONObjES4_bbbRNS_7OpDebugEPNS_11RemoveSaverEbRKNS_24QueryPlanSelectionPolicyEb+0x2efa) [0xa8b72a]
/usr/bin/mongod(_ZN5mongo27updateObjectsForReplicationEPKcRKNS_7BSONObjES4_bbbRNS_7OpDebugEbRKNS_24QueryPlanSelectionPolicyE+0xb7) [0xa8d787]
/usr/bin/mongod(_ZN5mongo21applyOperation_inlockERKNS_7BSONObjEbb+0x65a) [0xa6956a]
/usr/bin/mongod(_ZN5mongo7replset8SyncTail9syncApplyERKNS_7BSONObjEb+0x713) [0xc1fb13]
/usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x48) [0xc1f258]
/usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x281) [0xd9cc01]
/usr/bin/mongod() [0xe17cb9]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7f22bb81be9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f22bab2bcbd]
Sun May 5 03:47:43.562 [repl writer worker 1] ERROR: writer worker caught exception: assertion src/mongo/db/ops/../pdfile.h:607 on: { ts: Timestamp 1367723893000|1, h: 256075113332974832, v: 2, op: "i", ns: "newsblur.stories", o: { _id: ObjectId('5185cf75409b612f1631dce1'), share_user_ids: {}, story_author_name: "", story_date: new Date(1367723893192), story_tags: {}, story_hash: "6:388392", story_feed_id: 6, story_permalink: "http://lcamtuf.blogspot.com/2013/05/some-harmless-old-fashioned-fun-with-css.html", story_title: "Some harmless, old-fashioned fun with CSS", comment_user_ids: {}, story_content_z: BinData, story_guid: "http://lcamtuf.blogspot.com/2013/05/some-harmless-old-fashioned-fun-with-css.html" } }
Sun May 5 03:47:43.562 [repl writer worker 1] Fatal Assertion 16360
0xdcf361 0xd8f0d3 0xc1f33c 0xd9cc01 0xe17cb9 0x7f22bb81be9a 0x7f22bab2bcbd
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdcf361]
/usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xa3) [0xd8f0d3]
/usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x12c) [0xc1f33c]
/usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x281) [0xd9cc01]
/usr/bin/mongod() [0xe17cb9]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7f22bb81be9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f22bab2bcbd]
Sun May 5 03:47:43.566 [repl writer worker 1]

***aborting after fassert() failure

Sun May 5 03:47:43.566 Got signal: 6 (Aborted).

Sun May 5 03:47:43.568 Backtrace:
0xdcf361 0x6cf729 0x7f22baa6e4a0 0x7f22baa6e425 0x7f22baa71b8b 0xd8f10e 0xc1f33c 0xd9cc01 0xe17cb9 0x7f22bb81be9a 0x7f22bab2bcbd
/usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdcf361]
/usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x6cf729]
/lib/x86_64-linux-gnu/libc.so.6(+0x364a0) [0x7f22baa6e4a0]
/lib/x86_64-linux-gnu/libc.so.6(gsignal+0x35) [0x7f22baa6e425]
/lib/x86_64-linux-gnu/libc.so.6(abort+0x17b) [0x7f22baa71b8b]
/usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xde) [0xd8f10e]
/usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x12c) [0xc1f33c]
/usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x281) [0xd9cc01]
/usr/bin/mongod() [0xe17cb9]
/lib/x86_64-linux-gnu/libpthread.so.0(+0x7e9a) [0x7f22bb81be9a]
/lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f22bab2bcbd]

Comment by J Rassi [ 22/Mar/13 ]

We have identified the root cause of the index lookup failure after the initial sync succeeded. The details of the problem can be found in SERVER-9087, as well as a plan for a release with the fix.

In short, you will need to run mongod with --repair on any 2.4.0 nodes that have done an initial sync, to fix the indexing issue. The repair will rebuild the indexes from the clean data, which was not affected by the problem.

If you'd like any help with this process, please feel free to call us (646-201-9247), or feel free to email me a phone number to reach you at.

Comment by J Rassi [ 22/Mar/13 ]

As an aside, make sure you're running with an an odd number of members (if you're still on four, you'll want to add an arbiter).

Comment by J Rassi [ 22/Mar/13 ]

Could you connect to db21 and run the following (to increase the verbosity of the log output):

db.adminCommand({setParameter:1, logLevel:2})

Please post when you next see that assertion (you can reset it back to 0 afterwards).

Comment by Samuel Clay [ 22/Mar/13 ]

Yes, the cluster is stable. All four nodes (the 3 secondaries) are replicating. But I'm getting a few of these assertions. It seems they aren't taking down the server, though.

Comment by J Rassi [ 22/Mar/13 ]

Looking into it.

Is db21 still replicating? And db05/db20?

Comment by Samuel Clay [ 22/Mar/13 ]

This server is called db21, syncing from the same db04/db05 (primary/secondary).

Comment by Samuel Clay [ 22/Mar/13 ]

Bad news, added another member to the replica set and got a different assertion error after replicating. See attached mongodb_3rd_try.txt.

Assertion: 10334:BSONObj size: -286331154 (0xEEEEEEEE) is invalid. Size must be between 0 and 16793600(16MB) First element: _id: ObjectId('514abf3b8210885b8cdd197a')

Comment by J Rassi [ 22/Mar/13 ]

Glad the node is back up and running. Closing this ticket; feel free to re-open if you have further questions related to this.

Comment by Samuel Clay [ 22/Mar/13 ]

Thank you guys, everything worked.

Comment by Samuel Clay [ 22/Mar/13 ]

Yup, that's exactly what I did. I'm now 40 minutes into a 3 hour replication. (Took 6 hours the first time pre-repair, now takes 3 hours.)

Comment by J Rassi [ 22/Mar/13 ]

This is a known issue that still affects the current version, unfortunately. I can suggest an ugly workaround, however: there's a ~3 second window that the new member will be in state STARTUP2, during which it has a copy of the replication configuration, before it has chosen a sync target. During this window you can run rs.syncFrom().

So: run rs.add('db20:27017') on db04, watch the mongod output on db20, and then as you see the line "[rsStart] replSet STARTUP2", run rs.syncFrom('db05:27017') on db20 from your other shell. It's okay if you run the syncFrom too early – just try again.

If you're successful, you should see the line "[rsSync] replSet syncing to: db05:27017 by request" in the log soon after.

Comment by Samuel Clay [ 22/Mar/13 ]

Now that I wiped the data dir, how do I use rs.syncFrom('db05:27017')? If I use it on db20, I get this message:

{ "startupStatus" : 3, "info" : "run rs.initiate(...) if not yet done for the set", "ok" : 0, "errmsg" : "can't get local.system.replset config from self or any seed (EMPTYCONFIG)" }
Comment by J Rassi [ 22/Mar/13 ]

Yes, please retry your initial sync using an empty directory as the dbpath.

The server does wipe user data before performing an initial sync (and no, there's no way to perform an initial sync that skips this step), but it does not drop the "local" database, which contains the oplog. I believe that the oplog on db20 contains entries generated from a 2.2.0 mongod (the log snippet that you uploaded hints at this), which would result in you running into the linked issue even when running 2.4.0. If you perform a clean 2.4.0 => 2.4.0 initial sync, then the only oplog entries that will be applied will be those that occurred since the start of the initial sync, which will necessarily include no 2.2.0 oplog entries.

Comment by Samuel Clay [ 22/Mar/13 ]

No, mongodb deleted everything (so I think). Should I try again? The data is all there on db20, is there any way to salvage what's already there? This takes 5+ hours each time.

Comment by J Rassi [ 22/Mar/13 ]

When you tried performing the initial sync using 2.4.0, did you start mongod with a completely empty directory as the dbpath?

Comment by J Rassi [ 22/Mar/13 ]

Sent – feel free to reply with host details in the email, but we'll leave this the main channel for the issue, assuming that's okay with you.

Comment by Samuel Clay [ 22/Mar/13 ]

They are heavily used servers. The logs are huge, but they are filled to the brim with connection/end connection statements, as well as insertion/deletions/getmores.

If you want to ssh in to read on the logs, it's 91GB (whoops, forget to initiate logrotate.d), just send me your ssh key. samuel@newsblur.com

Comment by J Rassi [ 22/Mar/13 ]

Can you post logs for db04 and db05? As much as you have from today (and other runs of db20, if you have).

Comment by J Rassi [ 22/Mar/13 ]

We're taking a look at the log.

Comment by Samuel Clay [ 21/Mar/13 ]

Bad news. Same segfault. See attached "mongodb_2nd_try.txt". And just to confirm, every member of the set is running mongodb 2.4. And I just repaired the primary database yesterday.

Comment by J Rassi [ 21/Mar/13 ]

Can you upload the current mongod log for db20, so we can double-check what's going on?

Comment by Samuel Clay [ 21/Mar/13 ]

Nevermind, found the syncFrom command (http://docs.mongodb.org/manual/reference/method/rs.syncFrom/#rs.syncFrom). It was a bit tricky to figure out but I went with this method:

  • Run rs.syncFrom('db05:27017') on db20 (forcing sync from secondary)
  • Run rs.add('db20:27017') on db04, which db20 is waiting in the syncFrom call.

Replication has started. I'll check back here in 6 hours. Thanks!

Comment by Samuel Clay [ 21/Mar/13 ]

Yes, that is all correct. I'll give it another try immediately. How can I force sync from another node?

Comment by J Rassi [ 21/Mar/13 ]

Can you confirm the following list of events:

  • db04 and db05 were running mongod 2.2.0
  • db20 was brought up with mongod 2.4.0, which attempted to perform an initial sync
  • db20's initial sync failed
  • db04 and db05 were upgraded to mongod 2.4.0

Assuming that this was the case, then you are likely encountering SERVER-6671, which affects 2.2.0 (and was fixed in 2.2.1). Since your cluster is now fully upgraded to 2.4.0, then you won't run into this issue during the initial sync.

If you would like help syncing from another node or avoiding an initial sync, let us know.

Comment by Samuel Clay [ 21/Mar/13 ]

They were running 2.2, but they are now both on 2.4. I can't perform a replication again without waiting 6 hours and severely impacting performance. There used to be an option to choose which db to perform the initial sync from, but that option is gone. If I could choose to use the secondary db, I wouldn't have as big of a performance hit, but my primary is on faster harder, so it gets selected by default.

Comment by J Rassi [ 21/Mar/13 ]

What version of mongod is db04 and db05 running?

Comment by Samuel Clay [ 21/Mar/13 ]

nbset:PRIMARY> rs.status()
{
"set" : "nbset",
"date" : ISODate("2013-03-21T17:43:20Z"),
"myState" : 1,
"members" : [
{
"_id" : 2,
"name" : "db05:27017",
"health" : 1,
"state" : 2,
"stateStr" : "SECONDARY",
"uptime" : 43549,
"optime" :

{ "t" : 1363887799, "i" : 99 }

,
"optimeDate" : ISODate("2013-03-21T17:43:19Z"),
"lastHeartbeat" : ISODate("2013-03-21T17:43:19Z"),
"lastHeartbeatRecv" : ISODate("2013-03-21T17:43:20Z"),
"pingMs" : 0
},
{
"_id" : 8,
"name" : "db04:27017",
"health" : 1,
"state" : 1,
"stateStr" : "PRIMARY",
"uptime" : 43728,
"optime" :

{ "t" : 1363887800, "i" : 72 }

,
"optimeDate" : ISODate("2013-03-21T17:43:20Z"),
"self" : true
}
],
"ok" : 1
}

Comment by J Rassi [ 21/Mar/13 ]

Hi Sam,

I'd like to collect more information to help diagnose this issue:

  • How many members are in your replica set? What version of mongod is each running?
  • Could you paste the output of running rs.status() on the primary?
  • Could you upload the full mongod log for the member that threw the fatal assertion, as an attachment to this ticket?

~ Jason Rassi

Generated at Thu Feb 08 03:19:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.