[SERVER-6844] Invalid BSONObj size error causing secondary optime freeze Created: 24/Aug/12  Updated: 15/Feb/13  Resolved: 27/Aug/12

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.1.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: John Woakes Assignee: Ben Becker
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Azure Worker Role


Attachments: Text File Secondary.log    
Issue Links:
Duplicate
duplicates SERVER-6311 FlushViewOfFile failures can result i... Closed
Operating System: ALL
Participants:

 Description   

One of our production secondary databases now has an optime that is nearly 24 hours old. Looking at the logs at that optime I see some assertions

[rsSync] Assertion: 10334:Invalid BSONObj size: 1818322244 (0x4465616C) first element: r: ?type=101; TraceSource 'WaWorkerHost.exe' event"
[rsSync] replSet syncTail: 10334 Invalid BSONObj size: 1818322244 (0x4465616C) first element: r: ?type=101, syncing: { ts: Timestamp 1345750316000|66, h: 1018104994802725908, op: ""u"", ns: ""xq.SessionModel"", o2:

{ _id: ""fOLao2S5-cu8sYYVWuAmOr_t77XseH8rN0t7nt-fNUI"" }

, o:

{ _id: ""fOLao2S5-cu8sYYVWuAmOr_t77XseH8rN0t7nt-fNUI"", Expires: new Date(1345752117253), Id: ""damcomm"", AccountId: BinData, Roles: [ ""Dealer"" ], AuthApiToken: null }

}; TraceSource 'WaWorkerHost.exe' event"
[rsSync] replSet not trying to sync from 10.26.10.41:27017, it is vetoed for 269 more seconds; TraceSource 'WaWorkerHost.exe' event"
[rsSync] replSet syncing to: 10.26.2.151:27017; TraceSource 'WaWorkerHost.exe' event"
[rsSync] Assertion: 10334:Invalid BSONObj size: 1818322244 (0x4465616C) first element: r: ?type=101; TraceSource 'WaWorkerHost.exe' event"
[rsSync] replSet syncTail: 10334 Invalid BSONObj size: 1818322244 (0x4465616C) first element: r: ?type=101, syncing: { ts: Timestamp 1345750316000|66, h: 1018104994802725908, op: ""u"", ns: ""xq.SessionModel"", o2:

{ _id: ""fOLao2S5-cu8sYYVWuAmOr_t77XseH8rN0t7nt-fNUI"" }

, o:

{ _id: ""fOLao2S5-cu8sYYVWuAmOr_t77XseH8rN0t7nt-fNUI"", Expires: new Date(1345752117253), Id: ""damcomm"", AccountId: BinData, Roles: [ ""Dealer"" ], AuthApiToken: null }

}; TraceSource 'WaWorkerHost.exe' event"
[rsSync] replSet not trying to sync from 10.26.10.41:27017, it is vetoed for 238 more seconds; TraceSource 'WaWorkerHost.exe' event"
[rsSync] replSet not trying to sync from 10.26.2.151:27017, it is vetoed for 269 more seconds; TraceSource 'WaWorkerHost.exe' event"



 Comments   
Comment by John Woakes [ 27/Aug/12 ]

Sounds good to me - looking forward to 2.2

Comment by Ben Becker [ 27/Aug/12 ]

Hi John,

This indeed confirms the issue you encountered was caused by SERVER-6311 and should be resolved with the latest release candidate. I'm going to go ahead and mark this issue as a duplicate of SERVER-6311, but please feel free to reopen if you have any further questions or issues.

Thanks!
-Ben

Comment by John Woakes [ 27/Aug/12 ]

I downloaded the logs for the previous 36 hours and found these messages [I didn't see anything else near these message that might be related]

"MongoWorker_IN_1" Wed Aug 22 13:33:32 [DataFileSync] FlushViewOfFile failed 33 file: b:/data/xq.3 // other secondary
"MongoWorker_IN_0" Wed Aug 22 13:33:41 [DataFileSync] FlushViewOfFile failed 33 file: b:/data/xq.3 // primary
"MongoWorker_IN_2" Wed Aug 22 13:33:42 [DataFileSync] FlushViewOfFile failed 33 file: F:/data/xq.3 // secondary that stopped replicating

This is the first Invalid BSONBObj error.

"MongoWorker_IN_2", Thu Aug 23 19:31:58 [rsSync] Assertion: 10334:Invalid BSONObj size: 1818322244 (0x4465616C) first element: r: ?type=101

The instance 0 was the primary at this point in time.

"[2012-08-22 13:30:37,58523]: |INFO | - Id = 0, Name = 10.26.2.151:27017, Health = UP, State Id = 1, State = PRIMARY, Heart Beat = Not Applicable, Optime = 8/22/2012 1:30:37 PM, Ping = Not Applicable
"[2012-08-22 13:30:37,58523]: |INFO | - Id = 1, Name = 10.26.10.41:27017, Health = UP, State Id = 2, State = SECONDARY, Heart Beat = 2012-08-22 13:30 PM, Optime = 8/22/2012 1:30:35 PM, Ping = 3
"[2012-08-22 13:30:37,58523]: |INFO | - Id = 2, Name = 10.26.2.175:27017, Health = UP, State Id = 2, State = SECONDARY, Heart Beat = 2012-08-22 13:30 PM, Optime = 8/22/2012 1:30:35 PM, Ping = 1

We are looking forward to the release of 2.2 very much and "stable" is good

Comment by Ben Becker [ 27/Aug/12 ]

Hi John,

The full logs since the instance was started would be the most helpful. We're looking for the machine that was primary at the time of corruption, which is unknown. 10.26.2.151:27017 was the primary during the period of the attached logs, and is replica set member 0, so I think this would be the first place to look.

Unfortunately it's hard to say how far back you will need to search, as the corruption may have happened at any point. The main goal is to find the error message(s) to confirm upgrading will prevent the issue going forward.

Version 2.2 is just around the corner and should be stable, but version 2.0.7 also fixes the known issue.

Regards,
Ben

Comment by John Woakes [ 27/Aug/12 ]

I assume the secondary was syncing from the primary [is there a way to tell?]. I can search the logs for the FlushViewOfFile() call. How far back should I search?

Our logs go back 1 month but are on Azure and it would take a long time to down load more than a day but could be done.

This particular instance of mongo has been running since Aug 8th 2012

We are waiting for the final release of 2.2 before we update. This is my boss's firm wish unless I [you?] can persuade him otherwise.

Comment by Ben Becker [ 25/Aug/12 ]

Hmm, I'm just interested in the host this secondary is syncing from. How far back do the logs go on the primary? And how long have these instances been running?

I would also advise upgrading to the latest release candidate to ensure we avoid any future issues like this.

Comment by John Woakes [ 25/Aug/12 ]

The logs above are for the secondary.

I did the search and only found this entry

[DataFileSync] flushing mmaps took 10296ms for 8 files

Is that what you were after?

Comment by Ben Becker [ 24/Aug/12 ]

Hi John,

There was a known issue prior to v2.2.0-rc0 and v2.0.7 which could cause the journal file to contain invalid data if the FlushViewOfFile() call fails for more than 60 seconds. This was resolved in SERVER-6311.

This appears to have happened on the primary though, not the secondary. Could you search from all occurrences of 'FlushViewOfFile for <some file name> failed with error ...', and paste them to this ticket? A simple grep for FlushViewOfFile would be sufficient to confirm this is the case.

Thanks!
-Ben

Comment by John Woakes [ 24/Aug/12 ]

I ran validate on the primary and got these results.

iq:PRIMARY> db.runCommand(

{ validate: "SessionModel", full: true, scandata: tru e }

)
{
"ns" : "xq.SessionModel",
"firstExtent" : "3:2e2a000 ns:xq.SessionModel",
"lastExtent" : "3:2e96000 ns:xq.SessionModel",
"extentCount" : 4,
"extents" : [

{ "loc" : "3:2e2a000", "xnext" : "3:2e36000", "xprev" : "null", "nsdiag" : "xq.SessionModel", "size" : 12288, "firstRecord" : "3:2e2a258", "lastRecord" : "3:2e2a568" }

,

{ "loc" : "3:2e36000", "xnext" : "3:2e42000", "xprev" : "3:2e2a000", "nsdiag" : "xq.SessionModel", "size" : 49152, "firstRecord" : "3:2e360b0", "lastRecord" : "3:2e3e134" }

,

{ "loc" : "3:2e42000", "xnext" : "3:2e96000", "xprev" : "3:2e36000", "nsdiag" : "xq.SessionModel", "size" : 196608, "firstRecord" : "3:2e42184", "lastRecord" : "3:2e687fc" }

,

{ "loc" : "3:2e96000", "xnext" : "null", "xprev" : "3:2e42000", "nsdiag" : "xq.SessionModel", "size" : 786432, "firstRecord" : "3:2e960b0", "lastRecord" : "3:2eaf9d0" }

],
"datasize" : 401924,
"nrecords" : 2143,
"lastExtentSize" : 786432,
"padding" : 1,
"firstExtentDetails" :

{ "loc" : "3:2e2a000", "xnext" : "3:2e36000", "xprev" : "null", "nsdiag" : "xq.SessionModel", "size" : 12288, "firstRecord" : "3:2e2a258", "lastRecord" : "3:2e2a568" }

,
"objectsFound" : 2143,
"invalidObjects" : 0,
"bytesWithHeaders" : 436212,
"bytesWithoutHeaders" : 401924,
"deletedCount" : 11,
"deletedSize" : 607564,
"nIndexes" : 1,
"keysPerIndex" :

{ "xq.SessionModel.$_id_" : 2143 }

,
"valid" : true,
"errors" : [ ],
"ok" : 1
}
iq:PRIMARY> rs.status()
{
"set" : "iq",
"date" : ISODate("2012-08-24T19:27:33Z"),
"myState" : 1,
"members" : [

{ "_id" : 0, "name" : "10.26.2.151:27017", "health" : 1, "state" : 1, "stateStr" : "PRIMARY", "uptime" : 1468495, "optime" : Timestamp(1345836453000, 2), "optimeDate" : ISODate("2012-08-24T19:27:33Z"), "self" : true }

,

{ "_id" : 1, "name" : "10.26.10.41:27017", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 1468205, "optime" : Timestamp(1345836450000, 1), "optimeDate" : ISODate("2012-08-24T19:27:30Z"), "lastHeartbeat" : ISODate("2012-08-24T19:27:32Z"), "pingMs" : 6 }

,

{ "_id" : 2, "name" : "10.26.2.175:27017", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 289183, "optime" : Timestamp(1345750316000, 65), "optimeDate" : ISODate("2012-08-23T19:31:56Z"), "lastHeartbeat" : ISODate("2012-08-24T19:27:32Z"), "pingMs" : 1 }

],
"ok" : 1
}

Generated at Thu Feb 08 03:12:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.