[SERVER-20602] Malformed chunk at end of metrics file Created: 24/Sep/15  Updated: 07/Oct/15  Resolved: 01/Oct/15

Status: Closed
Project: Core Server
Component/s: Diagnostics
Affects Version/s: None
Fix Version/s: 3.1.9

Type: Bug Priority: Major - P3
Reporter: Bruce Lucas (Inactive) Assignee: Andrew Morrow (Inactive)
Resolution: Done Votes: 0
Labels: FTDC
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Backwards Compatibility: Fully Compatible
Operating System: ALL
Sprint: Platform A (10/09/15)
Participants:

 Description   

Some metrics files end with a malformed chunk having an empty reference document but nmetrics > 0. Ideally an empty chunk shouldn't be present at all, but if it is the nmetrics value should be 0.



 Comments   
Comment by Githook User [ 01/Oct/15 ]

Author:

{u'username': u'acmorrow', u'name': u'Andrew Morrow', u'email': u'acm@mongodb.com'}

Message: SERVER-20602 There is never FTDC data to flush if we lack a reference document
Branch: master
https://github.com/mongodb/mongo/commit/a861bf352a9e71ac46e80d19d96f5fa5d9c068e3

Comment by Bruce Lucas (Inactive) [ 29/Sep/15 ]

Took a closer look at the data, and as far as I can see with this fix there are no missing samples.

Comment by Andrew Morrow (Inactive) [ 28/Sep/15 ]

bruce.lucas@mongodb.com - I've confirmed that it is the same issue. The behavior here is that you add a sample, which puts you over the sample limit, so you flush some data and reset the reference document. Flushing that data puts you over the file size limit, so you go to rotate. Rotate closes the file, which calls flush again. At this point, due to the bug identified before, we interpret the non-reset value for _compressedChunkBuffer.len() > 0 as indicating that we have data, even though the reference doc is empty. At that point, we write a malformed chunk. I'm pretty confident that this is the right fix. My only concern, which I share with you, is that there is some other condition that necessitates the second part of that expression that we are overlooking. If you could take a closer look at your data and try to establish whether it is complete, I'd be more comfortable proposing this as the fix.

Comment by Bruce Lucas (Inactive) [ 28/Sep/15 ]

The issue I observed occurs when the file size reaches its limit and we roll over to the next file. I don't know whether it also occurs on shutdown - it may. I tried the change you identified and it does indeed seem to fix the issue that I observed on file rollover, and a cursor look shows the files are otherwise still valid (although I haven't looked closely to make sure for example that we aren't losing samples at the point where we roll over).

Comment by Andrew Morrow (Inactive) [ 25/Sep/15 ]

bruce.lucas - I believe I have diagnosed this issue. Here is the offending line:

https://github.com/mongodb/mongo/blob/88c34d5fe384f80508c3b489ced5fcf4ae914bd1/src/mongo/db/ftdc/compressor.h#L113

The first part of that expression is fine: if we don't have a reference document, we definitely don't have data. However the or-ed in check of the length of the uncompressed data is erroneous. After flushing data, the compressor resets the reference document. However, it does not reset other state variables. At shutdown, mongod does a final flush. If that shutdown flush is invoked after a flush driven by the addition of new data without an intervening sample collection, then the reference document will be empty, but the length of the _compressedChunkBuffer will be non-zero. This causes us to actually attempt to write data, with an empty reference document and a non-zero metrics count. This would also explain why only some metrics files have this issue: mongod needs to be terminated right after it has flushed a file, and before it has collected another sample.

The fix appears to be simply to remove the " || _compressedChunkBuffer.len() > 0" from that line, and allow the presence or absence of the reference document to be the sole determinant of whether there is data to be flushed. This makes sense: if you have a reference document, you have at least one datum being the reference document itself, if you don't have a reference document, you don't have anything to flush.

Could you please test the above change locally and see if it fixes the issue for you? If so I will get it up for CR and patch queued.

Generated at Thu Feb 08 03:54:42 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.