[SERVER-7890] local.oplog.rs Assertion failure Created: 10/Dec/12  Updated: 10/Dec/14  Resolved: 02/Jun/13

Status: Closed
Project: Core Server
Component/s: Storage
Affects Version/s: 2.2.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: John Woakes Assignee: David Hows
Resolution: Done Votes: 2
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Microsoft Windows Azure


Attachments: Text File Mongo.log     Text File Mongo20130419.log     Text File MongoLogs.log     Zip Archive WADLogsTable7.zip     Microsoft Word WADLogsTable_Tejas.csv    
Issue Links:
Related
Operating System: ALL
Steps To Reproduce:

The database was fine on Friday, but was like this Monday morning.

Participants:

 Description   

Getting this error in the logs and returned to our app:

local.oplog.rs Assertion failure !loc.isNull() src\mongo\db\pdfile.cpp 1636



 Comments   
Comment by Tad Marshall [ 19/Apr/13 ]

Hi John,

From looking at the code (assuming this is still version 2.2.2) the assertions are a result of the corruption and not its cause. The code is trying to make room for a new record by deleting the oldest records in the capped collection, but because the firstRecord pointer is null it can't delete anything.

I have been trying to understand why this happens on Azure but we never get similar reports on any other platform. The only way that we should be able to get corruption like this is if the system "lies" to us; it says that it has written data to disk but it has not actually done it. All disk operations are written to the journal before they update the data file, so a hard stop at any point when journaling should be fully recoverable.

When you had issues in the past, you were able to examine the Azure storage logs and find reports that led to the discovery of the need to block FlushViewOfFile calls while data was being updated (and vice versa). This was SERVER-7378, which was backported to version 2.2.2 in git commit https://github.com/mongodb/mongo/commit/f43972dd586475132045bb11a96baac168f5b8cd .

My understanding is that Azure instances can terminate in a way that is different from a normal crash or power failure. If a disk operation has completed from MongoDB's perspective (i.e. the Windows API call returned without error) but the data has not yet been written to disk because disk I/O is actually over the network and may not complete if an instance is terminated, then this is a problem and could cause this kind of corruption.

I may be misinformed and I am really just trying to come up with an explanation that makes sense, so I apologize if I'm just confused.

Can you examine the Azure storage logs and see if there is anything that correlates with the timing of these problems?

Tad

Comment by John Woakes [ 19/Apr/13 ]

I added this to the issue;

Hi Tad,

Not sure what you mean by - "Have these nodes had a pattern of restarted Azure instances that could suggest why this keeps happening?".

If you hard crash mongo it should recover when it starts back up. We did take this instance down 24 hours earlier.

This was the primary database of our production database and our production system was down because the secondary's did not elect a new primary.

John.

Comment by John Woakes [ 19/Apr/13 ]

Hi Tad,

Not sure what you mean by - "Have these nodes had a pattern of restarted Azure instances that could suggest why this keeps happening?".

If you hard crash mongo it should recover when it starts back up. We did take this instance down 24 hours earlier.

This was the primary database of our production database and our production system was down because the secondary's did not elect a new primary.

John.

Comment by John Woakes [ 19/Apr/13 ]

Not sure if this helps but looking at the logs around the time the !loc.isNull() started happening the first sign that things were not right was an error "couldn't make room for new record (len: 184) in capped ns local.oplog.rs"

I am including the portion of the logs that contains this.

John.

Comment by John Woakes [ 19/Apr/13 ]

Logs from today's failure.

Comment by Tejas Vora [ 19/Apr/13 ]

Including John ...

"Tad Marshall (JIRA)" <jira@mongodb.org> wrote:

[ https://jira.mongodb.org/browse/SERVER-7890?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=317284#comment-317284 ]

Tad Marshall commented on SERVER-7890:
--------------------------------------

Hi Tejas,

Records (documents) are kept in a doubly linked list within each extent (contiguous area in a data file) and the first and last records in that list are pointed to by pointers in the extent's header.

The error message "in extent 2:2000, firstRecord is null but lastRecord is 2:2f2f737c" is indicating that the pointers in the extent header are invalid. If there are no active records in the extent, then both pointers should be null, and if there are active records in the extent then both pointers should be non-null. Since the firstRecord pointer is null but the lastRecord pointer is not null, at least one of these pointers cannot be valid.

Your best recovery path is probably to resync this node.

Have these nodes had a pattern of restarted Azure instances that could suggest why this keeps happening?

Tad


This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Comment by Tad Marshall [ 19/Apr/13 ]

Hi Tejas,

Records (documents) are kept in a doubly linked list within each extent (contiguous area in a data file) and the first and last records in that list are pointed to by pointers in the extent's header.

The error message "in extent 2:2000, firstRecord is null but lastRecord is 2:2f2f737c" is indicating that the pointers in the extent header are invalid. If there are no active records in the extent, then both pointers should be null, and if there are active records in the extent then both pointers should be non-null. Since the firstRecord pointer is null but the lastRecord pointer is not null, at least one of these pointers cannot be valid.

Your best recovery path is probably to resync this node.

Have these nodes had a pattern of restarted Azure instances that could suggest why this keeps happening?

Tad

Comment by Tejas Vora [ 19/Apr/13 ]

Hi David,

We are seeing the same problem in our production environment this time. As per your instructions, I ran validate command on local.oplog.rs collection and came back with following saying "Collection is corrupt and requires repair" but don't know why and what could be the reason behind it.

{
        "ns" : "local.oplog.rs",
        "capped" : true,
        "max" : 2147483647,
        "firstExtent" : "2:2000 ns:local.oplog.rs",
        "lastExtent" : "2:2000 ns:local.oplog.rs",
        "extentCount" : 1,
        "extents" : [
                {
                        "loc" : "2:2000",
                        "xnext" : "null",
                        "xprev" : "null",
                        "nsdiag" : "local.oplog.rs",
                        "size" : 990003200,
                        "firstRecord" : "null",
                        "lastRecord" : "2:2f2f737c"
                }
        ],
        "datasize" : 904741068,
        "nrecords" : 5328857,
        "lastExtentSize" : 990003200,
        "padding" : 1,
        "firstExtentDetails" : {
                "loc" : "2:2000",
                "xnext" : "null",
                "xprev" : "null",
                "nsdiag" : "local.oplog.rs",
                "size" : 990003200,
                "firstRecord" : "null",
                "lastRecord" : "2:2f2f737c"
        },
        "objectsFound" : 0,
        "invalidObjects" : 0,
        "bytesWithHeaders" : 0,
        "bytesWithoutHeaders" : 0,
        "deletedCount" : 3,
        "deletedSize" : 244,
        "nIndexes" : 0,
        "keysPerIndex" : {
 
        },
        "valid" : false,
        "errors" : [
                "in extent 2:2000, firstRecord is null but lastRecord is 2:2f2f737c"
        ],
        "advice" : "ns corrupt, requires repair",
        "ok" : 1
}

Comment by Stuart Rench [ 10/Apr/13 ]

Unfortunately, I have already rebuilt that node. But in case you still want to see, the rebuilt node shows:

PRIMARY> db.oplog.rs.validate(true)
{
	"ns" : "local.oplog.rs",
	"capped" : 1,
	"max" : 2147483647,
	"firstExtent" : "2:2000 ns:local.oplog.rs",
	"lastExtent" : "9:2000 ns:local.oplog.rs",
	"extentCount" : 8,
	"extents" : [
		{
			"loc" : "2:2000",
			"xnext" : "3:2000",
			"xprev" : "null",
			"nsdiag" : "local.oplog.rs",
			"size" : 2146426864,
			"firstRecord" : "2:20b0",
			"lastRecord" : "2:4fcaafb0"
		},
		{
			"loc" : "3:2000",
			"xnext" : "4:2000",
			"xprev" : "2:2000",
			"nsdiag" : "local.oplog.rs",
			"size" : 2146426864,
			"firstRecord" : "null",
			"lastRecord" : "null"
		},
		{
			"loc" : "4:2000",
			"xnext" : "5:2000",
			"xprev" : "3:2000",
			"nsdiag" : "local.oplog.rs",
			"size" : 2146426864,
			"firstRecord" : "null",
			"lastRecord" : "null"
		},
		{
			"loc" : "5:2000",
			"xnext" : "6:2000",
			"xprev" : "4:2000",
			"nsdiag" : "local.oplog.rs",
			"size" : 2146426864,
			"firstRecord" : "null",
			"lastRecord" : "null"
		},
		{
			"loc" : "6:2000",
			"xnext" : "7:2000",
			"xprev" : "5:2000",
			"nsdiag" : "local.oplog.rs",
			"size" : 2146426864,
			"firstRecord" : "null",
			"lastRecord" : "null"
		},
		{
			"loc" : "7:2000",
			"xnext" : "8:2000",
			"xprev" : "6:2000",
			"nsdiag" : "local.oplog.rs",
			"size" : 2146426864,
			"firstRecord" : "null",
			"lastRecord" : "null"
		},
		{
			"loc" : "8:2000",
			"xnext" : "9:2000",
			"xprev" : "7:2000",
			"nsdiag" : "local.oplog.rs",
			"size" : 2146426864,
			"firstRecord" : "null",
			"lastRecord" : "null"
		},
		{
			"loc" : "9:2000",
			"xnext" : "null",
			"xprev" : "8:2000",
			"nsdiag" : "local.oplog.rs",
			"size" : 194031616,
			"firstRecord" : "null",
			"lastRecord" : "null"
		}
	],
	"datasize" : 1277035988,
	"nrecords" : 3852455,
	"lastExtentSize" : 194031616,
	"padding" : 1,
	"firstExtentDetails" : {
		"loc" : "2:2000",
		"xnext" : "3:2000",
		"xprev" : "null",
		"nsdiag" : "local.oplog.rs",
		"size" : 2146426864,
		"firstRecord" : "2:20b0",
		"lastRecord" : "2:4fcaafb0"
	},
	"cappedOutOfOrder" : 0,
	"objectsFound" : 3852455,
	"invalidObjects" : 0,
	"bytesWithHeaders" : 1338675268,
	"bytesWithoutHeaders" : 1277035988,
	"deletedCount" : 8,
	"deletedSize" : 13880342988,
	"nIndexes" : 0,
	"keysPerIndex" : {
		
	},
	"valid" : true,
	"errors" : [ ],
	"ok" : 1
}

Comment by David Hows [ 10/Apr/13 ]

Hi Stuart,

Can you run the validate that I suggested on the local.oplog.rs collection?

Regards,
David

Comment by Stuart Rench [ 09/Apr/13 ]

Here are a couple of log entries that seem appropriate:

The first error that shows up on the node that was primary at the time of failure:

Mon Apr  8 09:15:01 [conn13686] couldn't make room for new record (len: 500) in capped ns local.oplog.rs
  Extent 0
    magic: 41424344 extent->ns: local.oplog.rs
    fr: null lr: 2:694907f8 extent->len: 2146426864
  Extent 1 (capExtent)
    magic: 41424344 extent->ns: local.oplog.rs
    fr: null lr: 3:4470ab08 extent->len: 1553997824
 local.oplog.rs Assertion failure len * 5 > lastExtentSize db/namespace.cpp 431
0x57a966 0x5854ab 0x861bbc 0x86ca46 0x860965 0x8609aa 0x8a75f1 0x833ca4 0x8318e2 0x88933f 0x88d9e9 0x88fcf5 0xaa2008 0x638aa7 0x3c3800673d 0x3c370d44bd 
 /usr/bin/mongod(_ZN5mongo12sayDbContextEPKc+0x96) [0x57a966]
 /usr/bin/mongod(_ZN5mongo8assertedEPKcS1_j+0xfb) [0x5854ab]
 /usr/bin/mongod(_ZNK5mongo16NamespaceDetails13maybeComplainEPKci+0x8ec) [0x861bbc]
 /usr/bin/mongod(_ZN5mongo16NamespaceDetails11cappedAllocEPKci+0x256) [0x86ca46]
 /usr/bin/mongod(_ZN5mongo16NamespaceDetails6_allocEPKci+0x25) [0x860965]
 /usr/bin/mongod(_ZN5mongo16NamespaceDetails5allocEPKciRNS_7DiskLocE+0x3a) [0x8609aa]
 /usr/bin/mongod(_ZN5mongo11DataFileMgr17fast_oplog_insertEPNS_16NamespaceDetailsEPKci+0x161) [0x8a75f1]
 /usr/bin/mongod [0x833ca4]
 /usr/bin/mongod(_ZN5mongo5logOpEPKcS1_RKNS_7BSONObjEPS2_Pb+0x42) [0x8318e2]
 /usr/bin/mongod(_ZN5mongo14checkAndInsertEPKcRNS_7BSONObjE+0xbf) [0x88933f]
 /usr/bin/mongod(_ZN5mongo14receivedInsertERNS_7MessageERNS_5CurOpE+0x439) [0x88d9e9]
 /usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x11c5) [0x88fcf5]
 /usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x78) [0xaa2008]
 /usr/bin/mongod(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x287) [0x638aa7]
 /lib64/libpthread.so.0 [0x3c3800673d]
 /lib64/libc.so.6(clone+0x6d) [0x3c370d44bd]

Later, we begin to see a lot like the following:

Mon Apr  8 09:15:01 [conn4151]  local.oplog.rs Assertion failure !loc.isNull() db/pdfile.cpp 1886
0x57a966 0x5854ab 0x8a7806 0x833ca4 0x8318e2 0x88933f 0x88d9e9 0x88fcf5 0xaa2008 0x638aa7 0x3c3800673d 0x3c370d44bd 
 /usr/bin/mongod(_ZN5mongo12sayDbContextEPKc+0x96) [0x57a966]
 /usr/bin/mongod(_ZN5mongo8assertedEPKcS1_j+0xfb) [0x5854ab]
 /usr/bin/mongod(_ZN5mongo11DataFileMgr17fast_oplog_insertEPNS_16NamespaceDetailsEPKci+0x376) [0x8a7806]
 /usr/bin/mongod [0x833ca4]
 /usr/bin/mongod(_ZN5mongo5logOpEPKcS1_RKNS_7BSONObjEPS2_Pb+0x42) [0x8318e2]
 /usr/bin/mongod(_ZN5mongo14checkAndInsertEPKcRNS_7BSONObjE+0xbf) [0x88933f]
 /usr/bin/mongod(_ZN5mongo14receivedInsertERNS_7MessageERNS_5CurOpE+0x439) [0x88d9e9]
 /usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x11c5) [0x88fcf5]
 /usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x78) [0xaa2008]
 /usr/bin/mongod(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x287) [0x638aa7]
 /lib64/libpthread.so.0 [0x3c3800673d]
 /lib64/libc.so.6(clone+0x6d) [0x3c370d44bd]

SECONDARY> db.oplog.rs.stats()
{
	"ns" : "local.oplog.rs",
	"count" : 12012005,
	"size" : 3508231904,
	"avgObjSize" : 292.06047649830316,
	"storageSize" : 3700424688,
	"numExtents" : 2,
	"nindexes" : 0,
	"lastExtentSize" : 1553997824,
	"paddingFactor" : 1,
	"flags" : 0,
	"totalIndexSize" : 0,
	"indexSizes" : {
		
	},
	"capped" : 1,
	"max" : 2147483647,
	"ok" : 1
}

Comment by Stuart Rench [ 09/Apr/13 ]

I seem to be having a similar issue:
Mongo version 2.0.9
Centos 5.6

After restarting the mongo process, the instance went into a secondary state, but has the following "errmsg" :
"syncTail: 0 assertion db/pdfile.cpp:1886, syncing: { ts: Timestamp 1365448166000|1, h: 6712576203048320232, op: \"i\", ns: [removed]"

I assume I can resolve the current failed state by forcing this replicaset member to rebuild itself (delete the data directory), but is there any information you want pulled prior to that?

Comment by David Hows [ 27/Mar/13 ]

Hi John and Tejas,

Im so sorry that I did not get back to you on this. Thank you for flipping the tag on this ticket over.

I've been looking through the codebase and still cannot find any reason for this problem. It appears that the issue is related to being unable to clear enough space in the oplog to insert a document, but I cannot see why this would fail. Would it be possible for you to run a validate(true) on the system.oplog collection and confirm there is nothing wrong with the layout of the data.

Beyond this, would you be willing to run the new 2.4.1 release with the --enableExperimentalStorageDetailsCmd argument and run the following commands:

use local
db.system.oplog.diskStorageStats()
db.system.oplog.getDiskStorageStats()

These commands will give an overview of what is in the oplog extent and how the data is allocated - and thus what the problem might be.

Regards,
David

Comment by John Woakes [ 25/Mar/13 ]

What are you waiting on from us?

Comment by Tejas Vora [ 05/Mar/13 ]

Is there any news on this issue? We are facing this issue every now and then in our production environment.

Comment by Tejas Vora [ 17/Feb/13 ]

Hi David,

We are seeing the same issue again in our production environment now. Couple of days back we deployed our latest code to Azure Production environment. As a part of the new deployment, we deleted the old database files. When Azure solution comes alive,

  • it created new replica set
  • Restored database from backup

The solution have been running smoothly for last 3 days and today, all of sudden, we started seeing the same issue on 2 of the secondary instance machines. Currently, both secondary machines are reporting same issue and oplog time is reporting out of sync with primary instance.

This is a very critical issue now - because the only choice we are left with is to destroy the database on secondary machines and do a full re-sync. We are worried that this will happen over and over and every single time we will have to do a full re-sync.

Comment by Rob Marscher [ 21/Jan/13 ]

I didn't restore from backup. I shutdown the secondary servers, deleted their data dirs, shutdown the primary server, deleted it's local.* database files, started them up and did rs.initiate() and rs.config() again. Essentially the same steps described in "Reconfigure By Breaking The Mirror". http://docs.mongodb.org/manual/tutorial/reconfigure-replica-set-with-unavailable-members/#reconfigure-by-breaking-the-mirror

Comment by Gregor Macadam [ 21/Jan/13 ]

Hi Rob
I'd like to document this on the mongodb-user thread - when you say you recreated the replica set - you mean you restored from backup?
Thanks
Gregor

Comment by Rob Marscher [ 18/Jan/13 ]

Hi David,

Since I recreated the replica set, this problem has not popped up again. So far, it seems it was an isolated incident for us.

Thanks again,
Rob

Comment by David Hows [ 11/Jan/13 ]

Hi Tejas,

I've escalated this to have more people investigate this.

You mentioned that you did the restore into a new database? Was this a newly created DB within an already running replica set? Or a whole new replica set?

Would you be able to upload the files that you used to restore the database to our fileshare?

You can use the following command to upload to our secure site

scp -P 722 <filename> iqmetrix@www.10gen.com:

I would like to see if the files you used in the backup have any problems.

Thanks,

David

Comment by Tejas Vora [ 08/Jan/13 ]

Hi David,

Unfortunately this issue is hard to reproduce. It does not surface all the time. Also note that it can surface not only during restore. It just shows up randomly any time. Meaning earlier we had the same problem on one of our test environment. It was running fine (we not running any restore on it) and this problem started showing up.

Anyhow, this time - during restore - when this problem showed up; following errors were thrown just before we started seeing this problem.
I have also attached a log file with some of the initial logging when this error started happening.

Mon Dec 31 21:16:35 [conn173952] couldn't make room for new record (len: 380) in capped ns local.oplog.rs; TraceSource 'WaWorkerHost.exe' event"
  Extent 0 (capExtent); TraceSource 'WaWorkerHost.exe' event"
    magic: 41424344 extent->ns: local.oplog.rs; TraceSource 'WaWorkerHost.exe' event"
    fr: null lr: 0:25e45e58 extent->len: 990003200; TraceSource 'WaWorkerHost.exe' event"
 local.oplog.rs Assertion failure len * 5 > lastExtentSize src\mongo\db\namespace_details.cpp 444; TraceSource 'WaWorkerHost.exe' event"
Mon Dec 31 21:16:39 [conn173952] mongod.exe    ...\src\mongo\util\stacktrace.cpp(161)                           mongo::printStackTrace+0x3e; TraceSource 'WaWorkerHost.exe' event"
Mon Dec 31 21:16:39 [conn173952] mongod.exe    ...\src\mongo\util\assert_util.cpp(109)                          mongo::verifyFailed+0xdc; TraceSource 'WaWorkerHost.exe' event"
Mon Dec 31 21:16:39 [conn173952] mongod.exe    ...\src\mongo\db\namespace_details.cpp(444)                      mongo::NamespaceDetails::maybeComplain+0x57b; TraceSource 'WaWorkerHost.exe' event"
Mon Dec 31 21:16:39 [conn173952] mongod.exe    ...\src\mongo\db\cap.cpp(255)                                    mongo::NamespaceDetails::cappedAlloc+0x46a; TraceSource 'WaWorkerHost.exe' event"
Mon Dec 31 21:16:39 [conn173952] mongod.exe    ...\src\mongo\db\namespace_details.cpp(280)                      mongo::NamespaceDetails::alloc+0x5d; TraceSource 'WaWorkerHost.exe' event"
Mon Dec 31 21:16:39 [conn173952] mongod.exe    ...\src\mongo\db\pdfile.cpp(1636)                                mongo::DataFileMgr::fast_oplog_insert+0xa2; TraceSource 'WaWorkerHost.exe' event"
Mon Dec 31 21:16:39 [conn173952] mongod.exe    ...\src\mongo\db\oplog.cpp(207)                                  mongo::_logOpRS+0x61f; TraceSource 'WaWorkerHost.exe' event"
Mon Dec 31 21:16:39 [conn173952] mongod.exe    ...\src\mongo\db\instance.cpp(754)                               mongo::checkAndInsert+0xe5; TraceSource 'WaWorkerHost.exe' event"
Mon Dec 31 21:16:39 [conn173952] mongod.exe    ...\src\mongo\db\instance.cpp(814)                               mongo::receivedInsert+0x790; TraceSource 'WaWorkerHost.exe' event"
Mon Dec 31 21:16:39 [conn173952] mongod.exe    ...\src\mongo\db\instance.cpp(434)                               mongo::assembleResponse+0x607; TraceSource 'WaWorkerHost.exe' event"
Mon Dec 31 21:16:39 [conn173952] mongod.exe    ...\src\mongo\db\db.cpp(192)                                     mongo::MyMessageHandler::process+0xf5; TraceSource 'WaWorkerHost.exe' event"
Mon Dec 31 21:16:39 [conn173952] mongod.exe    ...\src\mongo\util\net\message_server_port.cpp(86)               mongo::pms::threadRun+0x59a; TraceSource 'WaWorkerHost.exe' event"
Mon Dec 31 21:16:39 [conn173952] mongod.exe    ...\src\third_party\boost\libs\thread\src\win32\thread.cpp(180)  boost::`anonymous namespace'::thread_start_function+0x21; TraceSource 'WaWorkerHost.exe' event"
Mon Dec 31 21:16:39 [conn173952] mongod.exe    f:\dd\vctools\crt_bld\self_64_amd64\crt\src\threadex.c(314)      _callthreadstartex+0x17; TraceSource 'WaWorkerHost.exe' event"
Mon Dec 31 21:16:39 [conn173952] mongod.exe    f:\dd\vctools\crt_bld\self_64_amd64\crt\src\threadex.c(292)      _threadstartex+0x7f; TraceSource 'WaWorkerHost.exe' event"
Mon Dec 31 21:16:39 [conn173952] kernel32.dll                                                                   BaseThreadInitThunk+0xd; TraceSource 'WaWorkerHost.exe' event"
Mon Dec 31 21:16:39 [conn173952] insert xq.PricingScheduleModelRevisions keyUpdates:0 exception: assertion src\mongo\db\namespace_details.cpp:444 locks(micros) w:6746453 3393ms; TraceSource 'WaWorkerHost.exe' event"
Mon Dec 31 21:16:39 [conn173952] couldn't make room for new record (len: 380) in capped ns local.oplog.rs; TraceSource 'WaWorkerHost.exe' event"
  Extent 0 (capExtent); TraceSource 'WaWorkerHost.exe' event"
    magic: 41424344 extent->ns: local.oplog.rs; TraceSource 'WaWorkerHost.exe' event"
    fr: null lr: 0:25e45e58 extent->len: 990003200; TraceSource 'WaWorkerHost.exe' event"
 local.oplog.rs Assertion failure len * 5 > lastExtentSize src\mongo\db\namespace_details.cpp 444; TraceSource 'WaWorkerHost.exe' event"
Mon Dec 31 21:16:39 [conn173952] mongod.exe    ...\src\mongo\util\stacktrace.cpp(161)                           mongo::printStackTrace+0x3e; TraceSource 'WaWorkerHost.exe' event"
Mon Dec 31 21:16:39 [conn173952] mongod.exe    ...\src\mongo\util\assert_util.cpp(109)                          mongo::verifyFailed+0xdc; TraceSource 'WaWorkerHost.exe' event"
Mon Dec 31 21:16:39 [conn173952] mongod.exe    ...\src\mongo\db\namespace_details.cpp(444)                      mongo::NamespaceDetails::maybeComplain+0x57b; TraceSource 'WaWorkerHost.exe' event"
Mon Dec 31 21:16:39 [conn173952] mongod.exe    ...\src\mongo\db\cap.cpp(255)                                    mongo::NamespaceDetails::cappedAlloc+0x46a; TraceSource 'WaWorkerHost.exe' event"
Mon Dec 31 21:16:39 [conn173952] mongod.exe    ...\src\mongo\db\namespace_details.cpp(280)                      mongo::NamespaceDetails::alloc+0x5d; TraceSource 'WaWorkerHost.exe' event"
Mon Dec 31 21:16:39 [conn173952] mongod.exe    ...\src\mongo\db\pdfile.cpp(1636)                                mongo::DataFileMgr::fast_oplog_insert+0xa2; TraceSource 'WaWorkerHost.exe' event"
Mon Dec 31 21:16:39 [conn173952] mongod.exe    ...\src\mongo\db\oplog.cpp(207)                                  mongo::_logOpRS+0x61f; TraceSource 'WaWorkerHost.exe' event"
Mon Dec 31 21:16:39 [conn173952] mongod.exe    ...\src\mongo\db\instance.cpp(754)                               mongo::checkAndInsert+0xe5; TraceSource 'WaWorkerHost.exe' event"
Mon Dec 31 21:16:39 [conn173952] mongod.exe    ...\src\mongo\db\instance.cpp(814)                               mongo::receivedInsert+0x790; TraceSource 'WaWorkerHost.exe' event"
Mon Dec 31 21:16:39 [conn173952] mongod.exe    ...\src\mongo\db\instance.cpp(434)                               mongo::assembleResponse+0x607; TraceSource 'WaWorkerHost.exe' event"
Mon Dec 31 21:16:39 [conn173952] mongod.exe    ...\src\mongo\db\db.cpp(192)                                     mongo::MyMessageHandler::process+0xf5; TraceSource 'WaWorkerHost.exe' event"
Mon Dec 31 21:16:39 [conn173952] mongod.exe    ...\src\mongo\util\net\message_server_port.cpp(86)               mongo::pms::threadRun+0x59a; TraceSource 'WaWorkerHost.exe' event"
Mon Dec 31 21:16:39 [conn173952] mongod.exe    ...\src\third_party\boost\libs\thread\src\win32\thread.cpp(180)  boost::`anonymous namespace'::thread_start_function+0x21; TraceSource 'WaWorkerHost.exe' event"
Mon Dec 31 21:16:39 [conn173952] mongod.exe    f:\dd\vctools\crt_bld\self_64_amd64\crt\src\threadex.c(314)      _callthreadstartex+0x17; TraceSource 'WaWorkerHost.exe' event"
Mon Dec 31 21:16:39 [conn173952] mongod.exe    f:\dd\vctools\crt_bld\self_64_amd64\crt\src\threadex.c(292)      _threadstartex+0x7f; TraceSource 'WaWorkerHost.exe' event"
Mon Dec 31 21:16:39 [conn173952] kernel32.dll                                                                   BaseThreadInitThunk+0xd; TraceSource 'WaWorkerHost.exe' event"
Mon Dec 31 21:16:39 [conn173952] insert xq.PricingScheduleModelRevisions keyUpdates:0 exception: assertion src\mongo\db\namespace_details.cpp:444 locks(micros) w:1108364 547ms; TraceSource 'WaWorkerHost.exe' event"
Mon Dec 31 21:16:39 [conn173952] couldn't make room for new record (len: 448) in capped ns local.oplog.rs; TraceSource 'WaWorkerHost.exe' event"
  Extent 0 (capExtent); TraceSource 'WaWorkerHost.exe' event"
    magic: 41424344 extent->ns: local.oplog.rs; TraceSource 'WaWorkerHost.exe' event"
    fr: null lr: 0:25e45e58 extent->len: 990003200; TraceSource 'WaWorkerHost.exe' event"
 

Comment by Rob Marscher [ 08/Jan/13 ]

Thanks, David. Sorry for the delay, but I just uploaded a file named mongod.log-oplog-corruption that contains the mongod log from when the local.oplog.rs issue started and I truncated it after about a minute of output. But the same errors continued for several days.

Comment by David Hows [ 07/Jan/13 ]

Hi Rob,

You can upload them with the following command

scp -P 722 <filename> beaffinitive@www.10gen.com:

This will upload them to our private fileshare.

Cheers,

David

Comment by Rob Marscher [ 07/Jan/13 ]

Hi David,

Thanks a lot for checking in on this. For the oplog issue, I found the first error:

:
Wed Jan  2 04:08:18 [conn81258] couldn't make room for new record (len: 248) in capped ns local.oplog.rs
  Extent 0 (capExtent)
    magic: 41424344 extent->ns: local.oplog.rs
    fr: null lr: 2:5dd5dbc extent->len: 990000128
 local.oplog.rs Assertion failure len * 5 > lastExtentSize src/mongo/db/namespace_details.cpp 444
0xaf8c41 0xabf4cd 0x80e808 0x657fba 0x80d197 0x80d488 0x859eee 0x81d97f 0x81a279 0x83bb14 0x83f381 0x841de7 0x7a973f 0x7b2df1 0x56fe42 0xae6ed1 0x7
fe4de44f851 0x7fe4ddd1111d 
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xaf8c41]
 /usr/bin/mongod(_ZN5mongo12verifyFailedEPKcS1_j+0xfd) [0xabf4cd]
 /usr/bin/mongod(_ZNK5mongo16NamespaceDetails13maybeComplainEPKci+0xa48) [0x80e808]
 /usr/bin/mongod(_ZN5mongo16NamespaceDetails11cappedAllocEPKci+0x31a) [0x657fba]
 /usr/bin/mongod(_ZN5mongo16NamespaceDetails6_allocEPKci+0x27) [0x80d197]
 /usr/bin/mongod(_ZN5mongo16NamespaceDetails5allocEPKciRNS_7DiskLocE+0x38) [0x80d488]
 /usr/bin/mongod(_ZN5mongo11DataFileMgr17fast_oplog_insertEPNS_16NamespaceDetailsEPKci+0x6e) [0x859eee]
 /usr/bin/mongod() [0x81d97f]
 /usr/bin/mongod(_ZN5mongo5logOpEPKcS1_RKNS_7BSONObjEPS2_Pbb+0x49) [0x81a279]
 /usr/bin/mongod() [0x83bb14]
 /usr/bin/mongod(_ZN5mongo14_updateObjectsEbPKcRKNS_7BSONObjES4_bbbRNS_7OpDebugEPNS_11RemoveSaverEbRKNS_24QueryPlanSelectionPolicyEb+0x3081) [0x83f
381]
 /usr/bin/mongod(_ZN5mongo13updateObjectsEPKcRKNS_7BSONObjES4_bbbRNS_7OpDebugEbRKNS_24QueryPlanSelectionPolicyE+0xb7) [0x841de7]
 /usr/bin/mongod(_ZN5mongo14receivedUpdateERNS_7MessageERNS_5CurOpE+0x4cf) [0x7a973f]
 /usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x10b1) [0x7b2df1]
 /usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x82) [0x56fe42]
 /usr/bin/mongod(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x411) [0xae6ed1]
 /lib64/libpthread.so.0(+0x7851) [0x7fe4de44f851]
 /lib64/libc.so.6(clone+0x6d) [0x7fe4ddd1111d]

It repeats those several times and then starts issuing the !loc.isNull() errors like this one:

Wed Jan  2 04:08:45 [conn81258]  local.oplog.rs Assertion failure !loc.isNull() src/mongo/db/pdfile.cpp 1580
0xaf8c41 0xabf4cd 0x85a0b3 0x81d97f 0x81a279 0x83bb14 0x83f381 0x841de7 0x7a973f 0x7b2df1 0x56fe42 0xae6ed1 0x7fe4de44f851 0x7fe4ddd1111d 
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xaf8c41]
 /usr/bin/mongod(_ZN5mongo12verifyFailedEPKcS1_j+0xfd) [0xabf4cd]
 /usr/bin/mongod() [0x85a0b3]
 /usr/bin/mongod() [0x81d97f]
 /usr/bin/mongod(_ZN5mongo5logOpEPKcS1_RKNS_7BSONObjEPS2_Pbb+0x49) [0x81a279]
 /usr/bin/mongod() [0x83bb14]
 /usr/bin/mongod(_ZN5mongo14_updateObjectsEbPKcRKNS_7BSONObjES4_bbbRNS_7OpDebugEPNS_11RemoveSaverEbRKNS_24QueryPlanSelectionPolicyEb+0x3081) [0x83f381]
 /usr/bin/mongod(_ZN5mongo13updateObjectsEPKcRKNS_7BSONObjES4_bbbRNS_7OpDebugEbRKNS_24QueryPlanSelectionPolicyE+0xb7) [0x841de7]
 /usr/bin/mongod(_ZN5mongo14receivedUpdateERNS_7MessageERNS_5CurOpE+0x4cf) [0x7a973f]
 /usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x10b1) [0x7b2df1]
 /usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x82) [0x56fe42]
 /usr/bin/mongod(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x411) [0xae6ed1]
 /lib64/libpthread.so.0(+0x7851) [0x7fe4de44f851]
 /lib64/libc.so.6(clone+0x6d) [0x7fe4ddd1111d]

I have logs for that I can send but would need a private channel. Can I email you? As I mentioned in the last post, I'll reach out to our sales rep for support. I've been having some hostname issues as well (server has a public and private hostname - the dbs need to talk to each other connect over private hostnames, but be set to the public hostname which has been causing replication issues - possibly only since the 2.2.x upgrade). Thanks, Rob

Comment by David Hows [ 07/Jan/13 ]

Hi rmarscher,

Can you attach logs that cover the time when you see these issues along with a few choice examples so we can know where to look?

Cheers,

David

Comment by David Hows [ 07/Jan/13 ]

Hi tejasv@iqmetrix.com,

Sorry for the delay.

From the printReplicationInfo output I see that the oplog appears to have been corrupted. Could you please attach the file you used for the database restore?

Are you able to re-create the issue each time? If so, could you attach the logs that covers the period of you loading the data? There should be a corresponding log message in the server code that would make diagnosing the issue much easier.

Cheers,

David

Comment by Rob Marscher [ 05/Jan/13 ]

Actually... update commands are throwing the !loc.isNull() error as well. I thought they weren't before. Also, we upgraded these servers from 2.0.2 about a month ago and one of the servers moved around that time to a different ip. We are using the php mongo driver 1.2.9 from php version 5.3.15.

I ended up blowing away the replica set and starting it over again. The !loc.isNull() errors are gone now and it seems we didn't lose any data. However, running mongodump and mongorestore are both causing segfaults with "Logstream::get called in uninitialized state".

Will try to contact 10gen for support on this tomorrow.

Comment by Rob Marscher [ 05/Jan/13 ]

We are currently experiencing this on our primary database in a production replica set. Most operations are running fine, but I found that db.[collection].remove are not removing documents. The secondary database in the repl set does not have errors in its log. We are running 2.2.1.

We store web sessions in mongo collections which need to be removed when they expire - so that's how we finally discovered the problem. Inserts and updates are working fine as far as our client connections are concerned - but in the logs, they throw the same error:

Sat Jan  5 17:05:22 [conn116032]  local.oplog.rs Assertion failure !loc.isNull() src/mongo/db/pdfile.cpp 1580
0xaf8c41 0xabf4cd 0x85a0b3 0x81d97f 0x81a279 0x83bb14 0x83f381 0x841de7 0x7a973f 0x7b2df1 0x56fe42 0xae6ed1 0x7fe4de44f851 0x7fe4ddd1111d 
 /usr/bin/mongod(_ZN5mongo15printStackTraceERSo+0x21) [0xaf8c41]
 /usr/bin/mongod(_ZN5mongo12verifyFailedEPKcS1_j+0xfd) [0xabf4cd]
 /usr/bin/mongod() [0x85a0b3]
 /usr/bin/mongod() [0x81d97f]
 /usr/bin/mongod(_ZN5mongo5logOpEPKcS1_RKNS_7BSONObjEPS2_Pbb+0x49) [0x81a279]
 /usr/bin/mongod() [0x83bb14]
 /usr/bin/mongod(_ZN5mongo14_updateObjectsEbPKcRKNS_7BSONObjES4_bbbRNS_7OpDebugEPNS_11RemoveSaverEbRKNS_24QueryPlanSelectionPolicyEb+0x3081) [0x83f381]
 /usr/bin/mongod(_ZN5mongo13updateObjectsEPKcRKNS_7BSONObjES4_bbbRNS_7OpDebugEbRKNS_24QueryPlanSelectionPolicyE+0xb7) [0x841de7]
 /usr/bin/mongod(_ZN5mongo14receivedUpdateERNS_7MessageERNS_5CurOpE+0x4cf) [0x7a973f]
 /usr/bin/mongod(_ZN5mongo16assembleResponseERNS_7MessageERNS_10DbResponseERKNS_11HostAndPortE+0x10b1) [0x7b2df1]
 /usr/bin/mongod(_ZN5mongo16MyMessageHandler7processERNS_7MessageEPNS_21AbstractMessagingPortEPNS_9LastErrorE+0x82) [0x56fe42]
 /usr/bin/mongod(_ZN5mongo3pms9threadRunEPNS_13MessagingPortE+0x411) [0xae6ed1]
 /lib64/libpthread.so.0(+0x7851) [0x7fe4de44f851]
 /lib64/libc.so.6(clone+0x6d) [0x7fe4ddd1111d]

rs.status shows everything is fine:

{
	"set" : "affiniti1",
	"date" : ISODate("2013-01-05T22:01:59Z"),
	"myState" : 2,
	"syncingTo" : "affinwww99.pvt.empoweringmedia.net:27017",
	"members" : [
		{
			"_id" : 0,
			"name" : "affinwww00.pvt.empoweringmedia.net:27017",
			"health" : 1,
			"state" : 7,
			"stateStr" : "ARBITER",
			"uptime" : 1097444,
			"lastHeartbeat" : ISODate("2013-01-05T22:01:57Z"),
			"pingMs" : 4
		},
		{
			"_id" : 1,
			"name" : "affiniti03.pvt.empoweringmedia.net:27017",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 1097462,
			"optime" : Timestamp(1357117697000, 3),
			"optimeDate" : ISODate("2013-01-02T09:08:17Z"),
			"self" : true
		},
		{
			"_id" : 2,
			"name" : "affinwww99.pvt.empoweringmedia.net:27017",
			"health" : 1,
			"state" : 1,
			"stateStr" : "PRIMARY",
			"uptime" : 1097444,
			"optime" : Timestamp(1357117697000, 3),
			"optimeDate" : ISODate("2013-01-02T09:08:17Z"),
			"lastHeartbeat" : ISODate("2013-01-05T22:01:59Z"),
			"pingMs" : 2
		}
	],
	"ok" : 1
}

db.getReplicationInfo()

{
	"logSizeMB" : 944.1375732421875,
	"usedMB" : 880.1,
	"timeDiff" : 6097596,
	"timeDiffHours" : 1693.78,
	"tFirst" : "Tue Oct 23 2012 15:21:41 GMT-0400 (EDT)",
	"tLast" : "Wed Jan 02 2013 04:08:17 GMT-0500 (EST)",
	"now" : "Sat Jan 05 2013 16:55:30 GMT-0500 (EST)"
}

db.printReplicationInfo()

configured oplog size:   944.1375732421875MB
log length start to end: 6097596secs (1693.78hrs)
oplog first event time:  Tue Oct 23 2012 15:21:41 GMT-0400 (EDT)
oplog last event time:   Wed Jan 02 2013 04:08:17 GMT-0500 (EST)
now:                     Sat Jan 05 2013 16:56:13 GMT-0500 (EST)

The current primary only has 5.4GB of disk space left. It that involved?

When firing up a client shell, there's an interesting set of messages:

connecting to: affiniti1/affiniti03.pvt.empoweringmedia.net,affinwww99.empoweringmedia.net:27017/test
Sat Jan  5 17:07:11 starting new replica set monitor for replica set affiniti1 with seed of affiniti03.pvt.empoweringmedia.net:27017,affinwww99.empoweringmedia.net:27017
Sat Jan  5 17:07:11 successfully connected to seed affiniti03.pvt.empoweringmedia.net:27017 for replica set affiniti1
Sat Jan  5 17:07:11 changing hosts to { 0: "affiniti03.pvt.empoweringmedia.net:27017", 1: "affinwww99.pvt.empoweringmedia.net:27017" } from affiniti1/
Sat Jan  5 17:07:11 trying to add new host affiniti03.pvt.empoweringmedia.net:27017 to replica set affiniti1
Sat Jan  5 17:07:11 successfully connected to new host affiniti03.pvt.empoweringmedia.net:27017 in replica set affiniti1
Sat Jan  5 17:07:11 trying to add new host affinwww99.pvt.empoweringmedia.net:27017 to replica set affiniti1
Sat Jan  5 17:07:11 successfully connected to new host affinwww99.pvt.empoweringmedia.net:27017 in replica set affiniti1
Sat Jan  5 17:07:11 error connecting to seed affinwww99.empoweringmedia.net:27017 :: caused by :: 15928 couldn't connect to server affinwww99.empoweringmedia.net:27017
Sat Jan  5 17:07:11 Primary for replica set affiniti1 changed to affinwww99.pvt.empoweringmedia.net:27017
Sat Jan  5 17:07:11 replica set monitor for replica set affiniti1 started, address is affiniti1/affiniti03.pvt.empoweringmedia.net:27017,affinwww99.pvt.empoweringmedia.net:27017
Sat Jan  5 17:07:11 [ReplicaSetMonitorWatcher] starting

It does show that there is some type of problem connecting to the "seed". But the client starts up normally and I didn't even notice that error until I read closely.

I just tried stopping the secondary server (we're on CentOS, so I issued /etc/init.d/mongod stop) and this was the output:

/etc/init.d/mongod: line 69: ulimit: open files: cannot modify limit: Operation not permitted
Stopping mongod: 

And it just hangs there. It doesn't stop. In the mongod.log, the only thing happening is accepting and ending connections from the primary and arbiter servers.

Here are the local.oplog.rs stats:

{
	"ns" : "local.oplog.rs",
	"count" : 4197399,
	"size" : 922841276,
	"avgObjSize" : 219.8602696574712,
	"storageSize" : 990000128,
	"numExtents" : 1,
	"nindexes" : 0,
	"lastExtentSize" : 990000128,
	"paddingFactor" : 1,
	"systemFlags" : 0,
	"userFlags" : 0,
	"totalIndexSize" : 0,
	"indexSizes" : {
		
	},
	"capped" : true,
	"max" : 2147483647,
	"ok" : 1
}

Help! We can't afford to lose data on these servers.

Comment by Tejas Vora [ 31/Dec/12 ]

We are still getting this error. Today, I started with blank database and was trying to restore database of ~910 MB to it. Mid-way during the restore - it started throwing the same error. Saying "local.oplog.rs Assertion failure !loc.isNull() src\mongo\db\pdfile.cpp 1636"

I checked the stats on local.oplog.rs collection and it is as below:

{
        "ns" : "local.oplog.rs",
        "count" : 2387943,
        "size" : 951795844,
        "avgObjSize" : 398.5839879762624,
        "storageSize" : 990003200,
        "numExtents" : 1,
        "nindexes" : 0,
        "lastExtentSize" : 990003200,
        "paddingFactor" : 1,
        "systemFlags" : 0,
        "userFlags" : 0,
        "totalIndexSize" : 0,
        "indexSizes" : {
 
        },
        "capped" : true,
        "max" : 2147483647,
        "ok" : 1
}

db.stats() for local database

{
        "db" : "local",
        "collections" : 3,
        "objects" : 2387946,
        "avgObjSize" : 398.58357433543307,
        "dataSize" : 951796052,
        "storageSize" : 990019584,
        "numExtents" : 3,
        "indexes" : 0,
        "indexSize" : 0,
        "fileSize" : 1073741824,
        "nsSizeMB" : 16,
        "ok" : 1
}

db.printReplicationInfo()

{
        "logSizeMB" : 944.1375732421875,
        "usedMB" : 907.71,
        "errmsg" : "objects not found in local.oplog.$main -- is this a new and empty db instance?",
        "oplogMainRowCount" : 2387943
}

db.getReplicationInfo()

{
        "logSizeMB" : 944.1375732421875,
        "usedMB" : 907.71,
        "errmsg" : "objects not found in local.oplog.$main -- is this a new and empty db instance?",
        "oplogMainRowCount" : 2387943
}

db.getPrevError()

{ "err" : null, "n" : 0, "nPrev" : -1, "ok" : 1 }

rs.status()

{
        "set" : "iq",
        "date" : ISODate("2012-12-31T23:17:58Z"),
        "myState" : 1,
        "members" : [
                {
                        "_id" : 0,
                        "name" : "MongoDB0:27017",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                        "uptime" : 399734,
                        "optime" : Timestamp(1356992996000, 1),
                        "optimeDate" : ISODate("2012-12-31T22:29:56Z"),
                        "self" : true
                }
        ],
        "ok" : 1
}

rs.config()

{
        "_id" : "iq",
        "version" : 1,
        "members" : [
                {
                        "_id" : 0,
                        "host" : "MongoDB0:27017"
                }
        ]
}

Comment by John Woakes [ 19/Dec/12 ]

Full logs

Comment by John Woakes [ 19/Dec/12 ]

This has only happened twice and we run multiple databases all the time. Nothing unusual about the database that this happened to. We have others just like it running the same code. I will attach the log files.

Comment by David Hows [ 18/Dec/12 ]

Hi John,

Are there any more details in your log? Could you attach your full logfile?

Is it occurring regularly? Has anything within your environment changed recently? Any upgrades, codechanges?

How do you access your DB? Which drivers do you use and which versions?

Cheers,

David

Generated at Thu Feb 08 03:15:52 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.