[SERVER-11534] corruption of a secondary memeber Created: 01/Nov/13  Updated: 11/Jul/16  Resolved: 23/Feb/14

Status: Closed
Project: Core Server
Component/s: Internal Code
Affects Version/s: 2.4.6
Fix Version/s: None

Type: Bug Priority: Blocker - P1
Reporter: Julien Bachmann Assignee: Bruce Lucas (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

ubuntu 13.04 running on AWS instance with data on an EBS on ext4.


Attachments: Text File primary_logs.txt     Text File secondary_logs.txt    
Operating System: Linux
Participants:

 Description   

I have a replicaset with the primary running mongodb 2.2 and a secondary running 2.4.6 and an arbiter.

The secondary just crashed and cannot restart due to corruption.

Here is the error log:

Fri Nov  1 22:20:15.270 [repl prefetch worker] Assertion: 10334:BSONObj size: 0 (0x00000000) is invalid. Size must be between 0 and 16793600(16MB) First element: EOO
0xdddd81 0xd9f55b 0xd9fa9c 0x6ed13f 0x90bdd4 0xb304cd 0xb321c2 0xc23602 0xdab721 0xe26609 0x7f5288704f8e 0x7f5287a08e1d 
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdddd81]
 /usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x9b) [0xd9f55b]
 /usr/bin/mongod() [0xd9fa9c]
 /usr/bin/mongod(_ZNK5mongo7BSONObj14_assertInvalidEv+0x5bf) [0x6ed13f]
 /usr/bin/mongod(_ZN5mongo7Helpers8findByIdERNS_6ClientEPKcNS_7BSONObjERS5_PbS7_+0xcd4) [0x90bdd4]
 /usr/bin/mongod(_ZN5mongo19prefetchRecordPagesEPKcRKNS_7BSONObjE+0x44d) [0xb304cd]
 /usr/bin/mongod(_ZN5mongo28prefetchPagesForReplicatedOpERKNS_7BSONObjE+0x6e2) [0xb321c2]
 /usr/bin/mongod(_ZN5mongo7replset8SyncTail10prefetchOpERKNS_7BSONObjE+0x202) [0xc23602]
 /usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x281) [0xdab721]
 /usr/bin/mongod() [0xe26609]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x7f8e) [0x7f5288704f8e]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f5287a08e1d]
Fri Nov  1 22:20:15.276 [repl writer worker 1] Assertion: 10334:BSONObj size: 0 (0x00000000) is invalid. Size must be between 0 and 16793600(16MB) First element: EOO
0xdddd81 0xd9f55b 0xd9fa9c 0x6ed13f 0xa8f587 0xa924c7 0xa72449 0xc273d3 0xc26b18 0xdab721 0xe26609 0x7f5288704f8e 0x7f5287a08e1d 
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdddd81]
 /usr/bin/mongod(_ZN5mongo11msgassertedEiPKc+0x9b) [0xd9f55b]
 /usr/bin/mongod() [0xd9fa9c]
 /usr/bin/mongod(_ZNK5mongo7BSONObj14_assertInvalidEv+0x5bf) [0x6ed13f]
 /usr/bin/mongod(_ZN5mongo14_updateObjectsEbPKcRKNS_7BSONObjES4_bbbRNS_7OpDebugEPNS_11RemoveSaverEbRKNS_24QueryPlanSelectionPolicyEb+0x2067) [0xa8f587]
 /usr/bin/mongod(_ZN5mongo27updateObjectsForReplicationEPKcRKNS_7BSONObjES4_bbbRNS_7OpDebugEbRKNS_24QueryPlanSelectionPolicyE+0xb7) [0xa924c7]
 /usr/bin/mongod(_ZN5mongo21applyOperation_inlockERKNS_7BSONObjEbb+0xb39) [0xa72449]
 /usr/bin/mongod(_ZN5mongo7replset8SyncTail9syncApplyERKNS_7BSONObjEb+0x713) [0xc273d3]
 /usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x48) [0xc26b18]
 /usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x281) [0xdab721]
 /usr/bin/mongod() [0xe26609]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x7f8e) [0x7f5288704f8e]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f5287a08e1d]
Fri Nov  1 22:20:15.282 [repl writer worker 1] ERROR: writer worker caught exception: BSONObj size: 0 (0x00000000) is invalid. Size must be between 0 and 16793600(16MB) First element: EOO on: { ts: Timestamp 1383334273000|29, h: -7997199756652111606, v: 2, op: "u", ns: "webdoc_production.digest_mails", o2: { _id: ObjectId('5274056f1cbbe237b9000001') }, o: { $set: { sent_at: new Date(1383335279245) } } }
Fri Nov  1 22:20:15.282 [repl writer worker 1]   Fatal Assertion 16360
0xdddd81 0xd9dc13 0xc26bfc 0xdab721 0xe26609 0x7f5288704f8e 0x7f5287a08e1d 
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdddd81]
 /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xa3) [0xd9dc13]
 /usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x12c) [0xc26bfc]
 /usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x281) [0xdab721]
 /usr/bin/mongod() [0xe26609]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x7f8e) [0x7f5288704f8e]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f5287a08e1d]
Fri Nov  1 22:20:15.289 [repl writer worker 1] 
 
***aborting after fassert() failure
 
 
Fri Nov  1 22:20:15.289 Got signal: 6 (Aborted).
 
Fri Nov  1 22:20:15.299 Backtrace:
0xdddd81 0x6d0d29 0x7f52879460b0 0x7f5287946037 0x7f5287949698 0xd9dc4e 0xc26bfc 0xdab721 0xe26609 0x7f5288704f8e 0x7f5287a08e1d 
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xdddd81]
 /usr/bin/mongod(_ZN5mongo10abruptQuitEi+0x399) [0x6d0d29]
 /lib/x86_64-linux-gnu/libc.so.6(+0x370b0) [0x7f52879460b0]
 /lib/x86_64-linux-gnu/libc.so.6(gsignal+0x37) [0x7f5287946037]
 /lib/x86_64-linux-gnu/libc.so.6(abort+0x148) [0x7f5287949698]
 /usr/bin/mongod(_ZN5mongo13fassertFailedEi+0xde) [0xd9dc4e]
 /usr/bin/mongod(_ZN5mongo7replset14multiSyncApplyERKSt6vectorINS_7BSONObjESaIS2_EEPNS0_8SyncTailE+0x12c) [0xc26bfc]
 /usr/bin/mongod(_ZN5mongo10threadpool6Worker4loopEv+0x281) [0xdab721]
 /usr/bin/mongod() [0xe26609]
 /lib/x86_64-linux-gnu/libpthread.so.0(+0x7f8e) [0x7f5288704f8e]
 /lib/x86_64-linux-gnu/libc.so.6(clone+0x6d) [0x7f5287a08e1d]



 Comments   
Comment by Bruce Lucas (Inactive) [ 23/Feb/14 ]

Hi Julien,

As it has been a while since we've heard from you, I'll close this ticket. Please feel free to re-open it if/when you want to pursue a root cause analysis as outlined above.

Thanks,
Bruce

Comment by Bruce Lucas (Inactive) [ 07/Jan/14 ]

Hi Julien,

Happy New Year! Just checking in with you again to find out if you are still interested in pursuing a root cause for the corruption you saw.

Thanks,
Bruce

Comment by Bruce Lucas (Inactive) [ 24/Dec/13 ]

Hi Julien,

Sorry for the delay in responding. Are you still interested in working with us to investigate the problem? To further the investigation we could look at any of the following that you can provide us:

  • all available mongod logs for that node going as far back before the crash as possible
  • all available syslogs (/var/log/syslog* on Ubuntu) also going as far back before the crash as possible
  • the database files themselves

We can provide a secure, private location for uploading this information, just let us know if you want to proceed.

The messages are consistent with the data at the location that the _id index points to containing 0s instead of the expected data. One possible cause that we can investigate is whether there was an i/o error that prevented the data from being written. The logs and db files might contain evidence of such an error, or if not may suggest a different cause.

Happy Holidays!
Bruce

Comment by Julien Bachmann [ 05/Nov/13 ]

I can try to do a repairDatabase but it is not the point. Now I already have a new secondary that I added from a previous backup so it is fine.

My concern is that I don't want this issue happen in the future as it will require work to repair the problem every time.

I think I should be able to delete lot of data from a database without having any problem. Also I do not make index pointing anywhere. It is mongod stuff.

So for me this problem require a fix on your side

Comment by Ranjay Krishna [ 05/Nov/13 ]

This problem might occur if there is a large amount of data that is recently deleted or if your index is pointing to a location with no object in it.

Please run:

db.repairDatabase() 

on the secondary and let us know if you still encounter the problem.

Comment by Julien Bachmann [ 04/Nov/13 ]

I just uploded logs of primary and secondary. We can see in primary log that it issue the following info:

info DFM::findAll(): extent c:7fa38000 was empty, skipping ahead. ns:webdoc_production.digest_mails

and in secondary the crash log tell:

ERROR: writer worker caught exception: BSONObj size: 0 (0x00000000) is invalid. Size must be between 0 and 16793600(16MB) First element: EOO on: { ts: Timestamp 1383334273000|29, h: -7997199756652111606, v: 2, op: "u", ns: "webdoc_production.digest_mails", o2:

Unknown macro: { _id}

, o: { $set:

Unknown macro: { sent_at}

} }

It looks like it is related but not sure.

Otherwise primary build info is:

"version" : "2.2.2",
"gitVersion" : "d1b43b61a5308c4ad0679d34b262c5af9d664267",
"sysInfo" : "Linux ip-10-2-29-40 2.6.21.7-2.ec2.v1.2.fc8xen #1 SMP Fri Nov 20 17:48:28 EST 2009 x86_64 BOOST_LIB_VERSION=1_49",
"versionArray" : [
2,
2,
2,
0
],
"bits" : 64,
"debug" : false,
"maxBsonObjectSize" : 16777216,
"ok" : 1

And secondary info is:

db version v2.4.6
version: b9925db5eac369d77a3a5f5d98a145eaaacd9673

And secondary was a fresh new install that was done by following those instructions:
http://docs.mongodb.org/manual/tutorial/install-mongodb-on-ubuntu/
Then is has been added empty in the replica set. So full synch was done.

Comment by Julien Bachmann [ 04/Nov/13 ]

the logs of secondary when it failed

Comment by Julien Bachmann [ 04/Nov/13 ]

the logs of primary when secondary failed

Comment by Eliot Horowitz (Inactive) [ 02/Nov/13 ]

Can you send the logs from the primary and secondary at this time?
What is the history of the secondary? Is it brand new, or was it upgraded?
What version of 2.2 is the primary on?

Generated at Thu Feb 08 03:26:04 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.