[SERVER-6844] Invalid BSONObj size error causing secondary optime freeze Created: 24/Aug/12 Updated: 15/Feb/13 Resolved: 27/Aug/12 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | 2.1.0 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | John Woakes | Assignee: | Ben Becker |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
Azure Worker Role |
||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Operating System: | ALL | ||||||||
| Participants: | |||||||||
| Description |
|
One of our production secondary databases now has an optime that is nearly 24 hours old. Looking at the logs at that optime I see some assertions [rsSync] Assertion: 10334:Invalid BSONObj size: 1818322244 (0x4465616C) first element: r: ?type=101; TraceSource 'WaWorkerHost.exe' event" , o: { _id: ""fOLao2S5-cu8sYYVWuAmOr_t77XseH8rN0t7nt-fNUI"", Expires: new Date(1345752117253), Id: ""damcomm"", AccountId: BinData, Roles: [ ""Dealer"" ], AuthApiToken: null } }; TraceSource 'WaWorkerHost.exe' event" , o: { _id: ""fOLao2S5-cu8sYYVWuAmOr_t77XseH8rN0t7nt-fNUI"", Expires: new Date(1345752117253), Id: ""damcomm"", AccountId: BinData, Roles: [ ""Dealer"" ], AuthApiToken: null } }; TraceSource 'WaWorkerHost.exe' event" |
| Comments |
| Comment by John Woakes [ 27/Aug/12 ] |
|
Sounds good to me - looking forward to 2.2 |
| Comment by Ben Becker [ 27/Aug/12 ] |
|
Hi John, This indeed confirms the issue you encountered was caused by Thanks! |
| Comment by John Woakes [ 27/Aug/12 ] |
|
I downloaded the logs for the previous 36 hours and found these messages [I didn't see anything else near these message that might be related] "MongoWorker_IN_1" Wed Aug 22 13:33:32 [DataFileSync] FlushViewOfFile failed 33 file: b:/data/xq.3 // other secondary This is the first Invalid BSONBObj error. "MongoWorker_IN_2", Thu Aug 23 19:31:58 [rsSync] Assertion: 10334:Invalid BSONObj size: 1818322244 (0x4465616C) first element: r: ?type=101 The instance 0 was the primary at this point in time. "[2012-08-22 13:30:37,58523]: |INFO | - Id = 0, Name = 10.26.2.151:27017, Health = UP, State Id = 1, State = PRIMARY, Heart Beat = Not Applicable, Optime = 8/22/2012 1:30:37 PM, Ping = Not Applicable We are looking forward to the release of 2.2 very much and "stable" is good |
| Comment by Ben Becker [ 27/Aug/12 ] |
|
Hi John, The full logs since the instance was started would be the most helpful. We're looking for the machine that was primary at the time of corruption, which is unknown. 10.26.2.151:27017 was the primary during the period of the attached logs, and is replica set member 0, so I think this would be the first place to look. Unfortunately it's hard to say how far back you will need to search, as the corruption may have happened at any point. The main goal is to find the error message(s) to confirm upgrading will prevent the issue going forward. Version 2.2 is just around the corner and should be stable, but version 2.0.7 also fixes the known issue. Regards, |
| Comment by John Woakes [ 27/Aug/12 ] |
|
I assume the secondary was syncing from the primary [is there a way to tell?]. I can search the logs for the FlushViewOfFile() call. How far back should I search? Our logs go back 1 month but are on Azure and it would take a long time to down load more than a day but could be done. This particular instance of mongo has been running since Aug 8th 2012 We are waiting for the final release of 2.2 before we update. This is my boss's firm wish unless I [you?] can persuade him otherwise. |
| Comment by Ben Becker [ 25/Aug/12 ] |
|
Hmm, I'm just interested in the host this secondary is syncing from. How far back do the logs go on the primary? And how long have these instances been running? I would also advise upgrading to the latest release candidate to ensure we avoid any future issues like this. |
| Comment by John Woakes [ 25/Aug/12 ] |
|
The logs above are for the secondary. I did the search and only found this entry [DataFileSync] flushing mmaps took 10296ms for 8 files Is that what you were after? |
| Comment by Ben Becker [ 24/Aug/12 ] |
|
Hi John, There was a known issue prior to v2.2.0-rc0 and v2.0.7 which could cause the journal file to contain invalid data if the FlushViewOfFile() call fails for more than 60 seconds. This was resolved in This appears to have happened on the primary though, not the secondary. Could you search from all occurrences of 'FlushViewOfFile for <some file name> failed with error ...', and paste them to this ticket? A simple grep for FlushViewOfFile would be sufficient to confirm this is the case. Thanks! |
| Comment by John Woakes [ 24/Aug/12 ] |
|
I ran validate on the primary and got these results. iq:PRIMARY> db.runCommand( { validate: "SessionModel", full: true, scandata: tru e }) , , , { "loc" : "3:2e96000", "xnext" : "null", "xprev" : "3:2e42000", "nsdiag" : "xq.SessionModel", "size" : 786432, "firstRecord" : "3:2e960b0", "lastRecord" : "3:2eaf9d0" } ], , , , , { "_id" : 2, "name" : "10.26.2.175:27017", "health" : 1, "state" : 2, "stateStr" : "SECONDARY", "uptime" : 289183, "optime" : Timestamp(1345750316000, 65), "optimeDate" : ISODate("2012-08-23T19:31:56Z"), "lastHeartbeat" : ISODate("2012-08-24T19:27:32Z"), "pingMs" : 1 } ], |