[SERVER-15265] passes >= maxPasses (capped collection) Created: 16/Sep/14  Updated: 17/Oct/14  Resolved: 25/Sep/14

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 2.4.8
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Duncan Phillips Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File mongo.log    
Issue Links:
Duplicate
duplicates SERVER-6981 maxPasses assertion (on allocation fa... Closed
Operating System: ALL
Steps To Reproduce:

I'm unable to reproduce

Participants:

 Description   

We have a 5 node mongo replica set, and after a network outage, all our apps flushed their data to the mongo master. The master stayed up, and replicated across to the other nodes, but at a specific point, 3 of the replica set members crashed, and we were unable to recover them from that state. Fortunately we were able to restore the nodes from the remaining nodes.

I've attached log files from the master.

A bit of context on what happened when we started the mongo node up. The journal recovered, but then the i/o went through the roof, and we saw the bgsync not keeping up when running with -vvvv. When the fatal exception happens, it's always for the same query on the same capped collection which I assume it's trying to replay from the master. I confirmed that the same query is where the other nodes also break. There is nothing special about the query, it is a small blob with very few fields, and the capped collection has many just like it.

The OS is Ubuntu 10.04.3 LTS. FS is XFS. 22Gb memory.



 Comments   
Comment by Daniel Pasette (Inactive) [ 23/Sep/14 ]

Agree that the message could be cleaned up considerably. redbeard0531 is doing some work in that code for 2.8, so we should see some improvements there.

Comment by Duncan Phillips [ 22/Sep/14 ]

Hi Dan, thanks for the feedback and helping to understand the issue. As a side note, we discovered that the capped collection had been sized down at some point, so it was only 4Mb large. I have a feeling that this, together with the a large variation in record sizes in that collection caused the issue. Unfortunately, I can't provide the raw data. As a request, could that error message be improved, even after reading the source code it was unclear that this was the situation.

Comment by Daniel Pasette (Inactive) [ 19/Sep/14 ]

Hi Duncan,

This message indicates that the server cannot find a contiguous run of free space large enough to fit a 55272 byte record after deleting 5000 records. Since your average record size is ~450 bytes, that would require at least 100 contiguous average-sized records to be deleted. If it happened to delete smaller records it could require much more. Would it be possible to get a compressed copy of the raw datafiles that are exhibiting this error?

To workaround this, unfortunately, is onerous. Because the operation which is causing the problem is already in the oplog and causing some secondaries to crash, you have to get past this operation to continue replication. There are a couple ways to do this, but the easiest would be to resync the secondaries that are having the issue.

To be sure, this is a fairly rare occurrence, but it can be seen when inserting documents of vastly different sizes into capped collections. If it is acceptable to your application to cap the number of documents in the collection, it is possible to avoid this situation by setting the max documents to be < 5000. See an example here: http://docs.mongodb.org/manual/reference/method/db.createCollection/#example

Comment by Duncan Phillips [ 19/Sep/14 ]

We've just been hit by this again, it's taken out 4 of our 5 nodes and brought our site down.

Generated at Thu Feb 08 03:37:29 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.