[SERVER-3283] [replica set sync] local.oplog.rs Assertion failure isOk() db/pdfile.h 259 Created: 17/Jun/11  Updated: 30/Mar/12  Resolved: 05/Oct/11

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 1.8.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Michael Conigliaro Assignee: Kristina Chodorow (Inactive)
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu on EC2


Operating System: Linux
Participants:

 Description   

I started seeing this assertion error on one of the slaves in one of my replica sets. It seems to have caused replication to stop.

Fri Jun 17 04:15:24 [conn1059] query admin.$cmd ntoreturn:1 command:

{ writebacklisten: ObjectId('4ded68a18601f9f32f826c44') }

reslen:60 300009ms
Fri Jun 17 04:15:25 [conn924] query admin.$cmd ntoreturn:1 command:

{ writebacklisten: ObjectId('4ded637aef4335230f35faf2') }

reslen:60 300007ms
Fri Jun 17 04:15:25 [conn846] query admin.$cmd ntoreturn:1 command:

{ writebacklisten: ObjectId('4ded63529d03464f291491fd') }

reslen:60 300009ms
Fri Jun 17 04:15:25 [conn841] query admin.$cmd ntoreturn:1 command:

{ writebacklisten: ObjectId('4ded633d2179e8c686856149') }

reslen:60 300009ms
Fri Jun 17 04:15:27 [conn837] query admin.$cmd ntoreturn:1 command:

{ writebacklisten: ObjectId('4ded6336eb7379566e2c7ec9') }

reslen:60 300010ms
Fri Jun 17 04:15:52 [conn7822] query admin.$cmd ntoreturn:1 command:

{ writebacklisten: ObjectId('4dee9a6cd4822fa37bc008fe') }

reslen:60 300007ms
Fri Jun 17 04:15:56 [conn551] query admin.$cmd ntoreturn:1 command:

{ writebacklisten: ObjectId('4ded55b540adf833f570a90d') }

reslen:60 300008ms
Fri Jun 17 04:16:15 [dur] lsn set 445741544
Fri Jun 17 04:16:27 [conn850] query admin.$cmd ntoreturn:1 command:

{ writebacklisten: ObjectId('4ded635aca11c4f8c54ea25d') }

reslen:60 300009ms
Fri Jun 17 04:17:06 [conn29103] query admin.$cmd ntoreturn:1 command:

{ writebacklisten: ObjectId('4df25181d059bd052edf5abd') }

reslen:60 300008ms
Fri Jun 17 04:17:15 [dur] lsn set 445771580
Fri Jun 17 04:17:16 [conn919] query admin.$cmd ntoreturn:1 command:

{ writebacklisten: ObjectId('4ded63638a912a64992a16e3') }

reslen:60 300008ms
Fri Jun 17 04:17:18 [conn840] query admin.$cmd ntoreturn:1 command:

{ writebacklisten: ObjectId('4ded6345ccacebb71829396e') }

reslen:60 300008ms
Fri Jun 17 04:17:18 [conn917] query admin.$cmd ntoreturn:1 command:

{ writebacklisten: ObjectId('4ded636c6b00c1f2c76d1dbe') }

reslen:60 300008ms
Fri Jun 17 04:17:23 [conn833] query admin.$cmd ntoreturn:1 command:

{ writebacklisten: ObjectId('4ded632f2bbe6d881ab63b90') }

reslen:60 300008ms
Fri Jun 17 04:17:44 [replica set sync] local.oplog.rs Assertion failure isOk() db/pdfile.h 259
0x54f08e 0x5606dd 0x64dbbe 0x76f563 0x772768 0x73dc97 0x732135 0x73217a 0x76fe5d 0x710f23 0x7080e5 0x7084a8 0x70852c 0x70b342 0x8baea0 0x7ff247ff69ca 0x7ff2475a570d
/usr/bin/mongod(_ZN5mongo12sayDbContextEPKc+0xae) [0x54f08e]
/usr/bin/mongod(_ZN5mongo8assertedEPKcS1_j+0xfd) [0x5606dd]
/usr/bin/mongod(_ZN5mongo11DataFileMgr9getExtentERKNS_7DiskLocE+0x8e) [0x64dbbe]
/usr/bin/mongod(_ZN5mongo11DataFileMgr13_deleteRecordEPNS_16NamespaceDetailsEPKcPNS_6RecordERKNS_7DiskLocE+0x113) [0x76f563]
/usr/bin/mongod(_ZN5mongo11DataFileMgr12deleteRecordEPKcPNS_6RecordERKNS_7DiskLocEbb+0x198) [0x772768]
/usr/bin/mongod(_ZN5mongo16NamespaceDetails11cappedAllocEPKci+0x377) [0x73dc97]
/usr/bin/mongod(_ZN5mongo16NamespaceDetails6_allocEPKci+0x25) [0x732135]
/usr/bin/mongod(_ZN5mongo16NamespaceDetails5allocEPKciRNS_7DiskLocE+0x3a) [0x73217a]
/usr/bin/mongod(_ZN5mongo11DataFileMgr17fast_oplog_insertEPNS_16NamespaceDetailsEPKci+0x18d) [0x76fe5d]
/usr/bin/mongod(_ZN5mongo11_logOpObjRSERKNS_7BSONObjE+0x253) [0x710f23]
/usr/bin/mongod(_ZN5mongo11ReplSetImpl8syncTailEv+0x1b25) [0x7080e5]
/usr/bin/mongod(_ZN5mongo11ReplSetImpl11_syncThreadEv+0xc8) [0x7084a8]
/usr/bin/mongod(_ZN5mongo11ReplSetImpl10syncThreadEv+0x3c) [0x70852c]
/usr/bin/mongod(_ZN5mongo15startSyncThreadEv+0x252) [0x70b342]
/usr/bin/mongod(thread_proxy+0x80) [0x8baea0]
/lib/libpthread.so.0(+0x69ca) [0x7ff247ff69ca]
/lib/libc.so.6(clone+0x6d) [0x7ff2475a570d]
Fri Jun 17 04:17:44 [replica set sync] replSet syncThread: 0 assertion db/pdfile.h:259
Fri Jun 17 04:17:55 [replica set sync] local.oplog.rs Assertion failure isOk() db/pdfile.h 259
0x54f08e 0x5606dd 0x64dbbe 0x76f563 0x772768 0x73dc97 0x732135 0x73217a 0x76fe5d 0x710f23 0x7080e5 0x7084a8 0x70852c 0x70b342 0x8baea0 0x7ff247ff69ca 0x7ff2475a570d
/usr/bin/mongod(_ZN5mongo12sayDbContextEPKc+0xae) [0x54f08e]
/usr/bin/mongod(_ZN5mongo8assertedEPKcS1_j+0xfd) [0x5606dd]
/usr/bin/mongod(_ZN5mongo11DataFileMgr9getExtentERKNS_7DiskLocE+0x8e) [0x64dbbe]
/usr/bin/mongod(_ZN5mongo11DataFileMgr13_deleteRecordEPNS_16NamespaceDetailsEPKcPNS_6RecordERKNS_7DiskLocE+0x113) [0x76f563]
/usr/bin/mongod(_ZN5mongo11DataFileMgr12deleteRecordEPKcPNS_6RecordERKNS_7DiskLocEbb+0x198) [0x772768]
/usr/bin/mongod(_ZN5mongo16NamespaceDetails11cappedAllocEPKci+0x377) [0x73dc97]
/usr/bin/mongod(_ZN5mongo16NamespaceDetails6_allocEPKci+0x25) [0x732135]
/usr/bin/mongod(_ZN5mongo16NamespaceDetails5allocEPKciRNS_7DiskLocE+0x3a) [0x73217a]
/usr/bin/mongod(_ZN5mongo11DataFileMgr17fast_oplog_insertEPNS_16NamespaceDetailsEPKci+0x18d) [0x76fe5d]
/usr/bin/mongod(_ZN5mongo11_logOpObjRSERKNS_7BSONObjE+0x253) [0x710f23]
/usr/bin/mongod(_ZN5mongo11ReplSetImpl8syncTailEv+0x1b25) [0x7080e5]
/usr/bin/mongod(_ZN5mongo11ReplSetImpl11_syncThreadEv+0xc8) [0x7084a8]
/usr/bin/mongod(_ZN5mongo11ReplSetImpl10syncThreadEv+0x3c) [0x70852c]
/usr/bin/mongod(_ZN5mongo15startSyncThreadEv+0x252) [0x70b342]
/usr/bin/mongod(thread_proxy+0x80) [0x8baea0]
/lib/libpthread.so.0(+0x69ca) [0x7ff247ff69ca]
/lib/libc.so.6(clone+0x6d) [0x7ff2475a570d]
Fri Jun 17 04:17:55 [replica set sync] replSet syncThread: 0 assertion db/pdfile.h:259

Restarting the slave has had no effect, as you can see:

Fri Jun 17 15:26:24 [initandlisten] MongoDB starting : pid=27662 port=27018 dbpath=/var/lib/mongodb 64-bit
Fri Jun 17 15:26:24 [initandlisten] db version v1.8.2-rc3, pdfile version 4.5
Fri Jun 17 15:26:24 [initandlisten] git version: 2d7719228787c9c8100456bc70bf860ec2885732
Fri Jun 17 15:26:24 [initandlisten] build sys info: Linux bs-linux64.10gen.cc 2.6.21.7-2.ec2.v1.2.fc8xen #1 SMP Fri Nov 20 17:48:28 EST 2009 x86_64 BOOST_LIB_VERSION=1_41
Fri Jun 17 15:26:24 [initandlisten] journal dir=/var/lib/mongodb/journal
Fri Jun 17 15:26:24 [initandlisten] recover : no journal files present, no recovery needed
Fri Jun 17 15:26:24 [initandlisten] preallocateIsFaster=true 2.44
Fri Jun 17 15:26:24 [initandlisten] preallocateIsFaster=true 2.6
Fri Jun 17 15:26:24 [initandlisten] preallocateIsFaster=true 2.5
Fri Jun 17 15:26:24 [initandlisten] waiting for connections on port 27018
Fri Jun 17 15:26:24 [websvr] web admin interface listening on port 28018
Fri Jun 17 15:26:24 [startReplSets] replSet STARTUP2
Fri Jun 17 15:26:24 [rs Manager] replSet can't see a majority, will not try to elect self
Fri Jun 17 15:26:24 [replica set sync] replSet SECONDARY
Fri Jun 17 15:26:24 [ReplSetHealthPollTask] replSet info mongodb09.example.com:27018 is up
Fri Jun 17 15:26:24 [ReplSetHealthPollTask] replSet member mongodb09.example.com:27018 ARBITER
Fri Jun 17 15:26:24 [rs Manager] replSet info not electing self, we are not freshest
Fri Jun 17 15:26:26 [ReplSetHealthPollTask] replSet info mongodb07.example.com:27018 is up
Fri Jun 17 15:26:26 [ReplSetHealthPollTask] replSet member mongodb07.example.com:27018 PRIMARY
Fri Jun 17 15:26:26 [replica set sync] local.oplog.rs Assertion failure isOk() db/pdfile.h 259
0x54f08e 0x5606dd 0x64dbbe 0x76f563 0x772768 0x73dc97 0x732135 0x73217a 0x76fe5d 0x710f23 0x7080e5 0x7084a8 0x70852c 0x70b342 0x8baea0 0x7f2a569f89ca 0x7f2a55fa770d
/usr/bin/mongod(_ZN5mongo12sayDbContextEPKc+0xae) [0x54f08e]
/usr/bin/mongod(_ZN5mongo8assertedEPKcS1_j+0xfd) [0x5606dd]
/usr/bin/mongod(_ZN5mongo11DataFileMgr9getExtentERKNS_7DiskLocE+0x8e) [0x64dbbe]
/usr/bin/mongod(_ZN5mongo11DataFileMgr13_deleteRecordEPNS_16NamespaceDetailsEPKcPNS_6RecordERKNS_7DiskLocE+0x113) [0x76f563]
/usr/bin/mongod(_ZN5mongo11DataFileMgr12deleteRecordEPKcPNS_6RecordERKNS_7DiskLocEbb+0x198) [0x772768]
/usr/bin/mongod(_ZN5mongo16NamespaceDetails11cappedAllocEPKci+0x377) [0x73dc97]
/usr/bin/mongod(_ZN5mongo16NamespaceDetails6_allocEPKci+0x25) [0x732135]
/usr/bin/mongod(_ZN5mongo16NamespaceDetails5allocEPKciRNS_7DiskLocE+0x3a) [0x73217a]
/usr/bin/mongod(_ZN5mongo11DataFileMgr17fast_oplog_insertEPNS_16NamespaceDetailsEPKci+0x18d) [0x76fe5d]
/usr/bin/mongod(_ZN5mongo11_logOpObjRSERKNS_7BSONObjE+0x253) [0x710f23]
/usr/bin/mongod(_ZN5mongo11ReplSetImpl8syncTailEv+0x1b25) [0x7080e5]
/usr/bin/mongod(_ZN5mongo11ReplSetImpl11_syncThreadEv+0xc8) [0x7084a8]
/usr/bin/mongod(_ZN5mongo11ReplSetImpl10syncThreadEv+0x3c) [0x70852c]
/usr/bin/mongod(_ZN5mongo15startSyncThreadEv+0x252) [0x70b342]
/usr/bin/mongod(thread_proxy+0x80) [0x8baea0]
/lib/libpthread.so.0(+0x69ca) [0x7f2a569f89ca]
/lib/libc.so.6(clone+0x6d) [0x7f2a55fa770d]
Fri Jun 17 15:26:26 [replica set sync] replSet syncThread: 0 assertion db/pdfile.h:259



 Comments   
Comment by Kristina Chodorow (Inactive) [ 12/Sep/11 ]

You could run a repair(), but repair() will remove any corrupted data. This means that if you have a "clean" server, it might have records that this server doesn't after a repair. Thus, it isn't going to be exactly a clean copy of your data post-repair: it'll be a clean copy with some documents missing.

Also, keep in mind that whatever server you got this data directory from probably had corruption, too, so you should repair or resync from a known clean version on that server, too.

Comment by Matt Parlane [ 11/Sep/11 ]

I've just been bitten by this exact bug, and it was (most probably) because of an unclean shutdown. I initially synced the slave with a copy of the data directory rather than using MongoDB's built-in sync, and there is a good chance that the server which was the origin of that data was at one point run without journalling.

Anyway, is there anything I can do to resolve the situation now without resyncing? The replica is on another continent and it takes foreeevveeeerrr to resync.

Also, let me know if you need any info to debug. My backtrace looks almost identical to the one above.

Comment by Kristina Chodorow (Inactive) [ 30/Jun/11 ]

Yes, it will only be detected if something tries to use the corrupted section, so theoretically it could exist forever without being detected.

Comment by Michael Conigliaro [ 29/Jun/11 ]

Well, once 1.8 came out, I enabled journaling everywhere. Could DB corruption really go that long without being detected?

Comment by Kristina Chodorow (Inactive) [ 24/Jun/11 ]

But was this server ever running without journaling?

Comment by Michael Conigliaro [ 23/Jun/11 ]

Yes, that's why I suspected something was wrong with the init script. I brought this up before on IRC, but nobody could reproduce it, and it's never really been a problem (mostly just an annoyance). And yes, I am running with journaling.

Comment by Kristina Chodorow (Inactive) [ 23/Jun/11 ]

That sounds like it's shutting down uncleanly! (which shouldn't actually be a problem if you're running with journaling, but would be a problem if it used to run w/out journaling.)

Comment by Michael Conigliaro [ 23/Jun/11 ]

Oh, not that I know of. Although for the record, I suspect there might be something wrong with the init script on Ubuntu. Often I will do a restart with the init script, check the logs, and see that MongoDB is recovering from the journal. I'm not sure what that's all about, or if it matters at all.

Comment by Kristina Chodorow (Inactive) [ 23/Jun/11 ]

No, I mean "at any time in the history of the world," not necessarily right before.

Comment by Michael Conigliaro [ 23/Jun/11 ]

No unclean shutdown that I know of. That log snippit I posted is actually where those messages started, so if there was a shutdown, I imagine we would see MongoDB startup messages in there.

Comment by Kristina Chodorow (Inactive) [ 21/Jun/11 ]

It looks like corruption in the slave, so probably for the best. Did you have an unclean shutdown?

Comment by Michael Conigliaro [ 17/Jun/11 ]

i ultimately just deleted all the data on the slave and forced a resync.

Generated at Thu Feb 08 03:02:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.