[SERVER-28801] Enhance the index validation code to output details on what is wrong Created: 14/Apr/17 Updated: 06/Dec/22 Resolved: 02/May/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Testing Infrastructure |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Task | Priority: | Minor - P4 |
| Reporter: | Eric Milkie | Assignee: | Backlog - Storage Execution Team |
| Resolution: | Duplicate | Votes: | 2 |
| Labels: | stm | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Assigned Teams: |
Storage Execution
|
||||||||
| Participants: | |||||||||
| Linked BF Score: | 0 | ||||||||
| Description |
|
Currently, the index validation code only mentions that a corrupt index has too many or too few index entries. But the code has knowledge of which documents are missing index entries, which index entries are missing documents, and which index entries are incorrectly referencing the same document. It would be helpful to display this information when the validation fails.
(It doesn't even mention explicitly why the index is invalid: too many, or too few entries) |
| Comments |
| Comment by Max Hirschhorn [ 02/May/19 ] |
|
Pretty sure this is done thanks to gregory.wlodarek but reassigning to the Storage team to confirm. |
| Comment by Geert Bosch [ 27/Aug/18 ] |
|
Yes, it was always the intent to do another pass, but we never got to schedule that work and was not strictly necessary for the primary goal of detecting corruption. |
| Comment by Eric Milkie [ 21/Aug/18 ] |
|
The code was designed to not hold locks, as part of the background validation project, so it depends on whether you want to break that or not. |
| Comment by Max Hirschhorn [ 21/Aug/18 ] |
Couldn't we just hold the MODE_X lock on the collection for even longer to avoid that being an issue? |
| Comment by Eric Milkie [ 21/Aug/18 ] |
|
Yes, that could be done. Writes to the collection in the interim could throw the results off, but for our automated testing cases that shouldn't be a problem. |
| Comment by Max Hirschhorn [ 16/Aug/18 ] |
|
geert.bosch, milkie, do either of you know whether another pass through the collection or an index could be done to report the documents or index keys which map to the offending hash? |
| Comment by Eric Milkie [ 18/Apr/17 ] |
|
In light of these limitations, I'd like to change the work for this ticket to encompass changing the validate() logic itself. This would be a lot more work, but eventually this sort of diagnosis can be really valuable. |
| Comment by Robert Guo (Inactive) [ 18/Apr/17 ] |
|
1. In the case of having more documents than index entries, validate() would confuse users because it only knows when the number of documents has exceeded the number of index entries, but the "leftover" documents may not be the root cause of the inconsistency; the actual cause is likely that previously scanned documents had missing index entries. Showing the leftover documents could be useful for our tests where we have relatively few documents and there's a higher chance of an invalid document being printed; but in the field where the number of inconsistent documents is typically far fewer than the total number of documents, it's highly unlikely the printed information will be useful. (edit: we're dumping the collection in the JS hook if validate() fails, so we don't really gain additional information in our tests from showing these leftover documents...) 2. For index entries missing documents, we don't currently keep track of the extra entry, or even which index has extra entries (all we know is that a number in a hash table is >0, which could have been caused by any index); so there's no extra information to print. 3. Misordered indexes is one scenario where we do have extra information! In this case, we can actually add the offending entries to the error message. (But I don't think we've ever seen this error in our testing) milkie Given the limitations of points 1 and 2, I don't think there's anything more we can print for users without changing the validate() code to do another pass of the collection and invalid index. |
| Comment by Eric Milkie [ 18/Apr/17 ] |
|
In addition to helping diagnose build failures, I was hoping to use this new code to help diagnose failures in the field for which we have no reproducer in house. I'm not sure having the extra information will confuse users in regard to what they need to do next; they will still either resync, or contact MongoDB for assistance. |
| Comment by Max Hirschhorn [ 18/Apr/17 ] |
|
After some discussion internally with the TIG team, we are likely to only add the additional logging when the server is running with enableTestCommands=1 to avoid any confusion around why in some cases additional details are included in the response. The details themselves won't really instruct a user as to the action they need to take since they'll still want to resync or reIndex their node. milkie, does that seem reasonable to you? |