[SERVER-35050] Don't abort collection clone due to negative document count Created: 17/May/18 Updated: 29/Oct/23 Resolved: 05/Sep/19 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 4.3.1, 4.2.4, 3.6.18, 4.0.17 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Bruce Lucas (Inactive) | Assignee: | Mihai Andrei |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | former-quick-wins, former-robust-initial-sync, neweng | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Operating System: | ALL | ||||||||||||
| Backport Requested: |
v4.2, v4.0, v3.6
|
||||||||||||
| Sprint: | Repl 2019-09-09 | ||||||||||||
| Participants: | |||||||||||||
| Case: | (copied to CRM) | ||||||||||||
| Description |
|
In CollectionCloner::_countCallback we abort the clone if the value returned from the count command on the collection is negative. However, as we document, the value returned by the count command is advisory and may not be accurate, e.g if there was an unclean shutdown. I believe we are (or should be) only using the count for advisory purposes during the clone, so we shouldn't abort the clone if the count is negative, as this can for example prevent initial sync. |
| Comments |
| Comment by Githook User [ 27/Feb/20 ] | |||||||||||||||||||||||||||
|
Author: {'name': 'Mihai Andrei', 'email': 'mihai.andrei@mongodb.com'}Message: (cherry picked from commit 5ed5b857aaf2e2fbf443588e9b4cbb359fbd1f4d) | |||||||||||||||||||||||||||
| Comment by Githook User [ 27/Feb/20 ] | |||||||||||||||||||||||||||
|
Author: {'name': 'Mihai Andrei', 'email': 'mihai.andrei@mongodb.com'}Message: (cherry picked from commit 5ed5b857aaf2e2fbf443588e9b4cbb359fbd1f4d) | |||||||||||||||||||||||||||
| Comment by Githook User [ 27/Feb/20 ] | |||||||||||||||||||||||||||
|
Author: {'name': 'Mihai Andrei', 'email': 'mihai.andrei@mongodb.com'}Message: (cherry picked from commit 5ed5b857aaf2e2fbf443588e9b4cbb359fbd1f4d) | |||||||||||||||||||||||||||
| Comment by Githook User [ 04/Sep/19 ] | |||||||||||||||||||||||||||
|
Author: {'email': 'mihai.andrei@mongodb.com', 'name': 'Mihai Andrei'}Message: | |||||||||||||||||||||||||||
| Comment by Ratika Gandhi [ 25/Jul/19 ] | |||||||||||||||||||||||||||
|
We want to remove the check and not allow counts to ever be negative going forward. | |||||||||||||||||||||||||||
| Comment by Systems [ 04/Jan/19 ] | |||||||||||||||||||||||||||
|
We are experiencing this exact issue. Here is a quick script which will cycle through all DB;s and collections and fix any found -1 index counts A few DB's are excluded (admin, local, config) Hope this saves many hours of frustration
| |||||||||||||||||||||||||||
| Comment by Scott Glajch [ 19/Dec/18 ] | |||||||||||||||||||||||||||
|
Thank you for the insight Bruce | |||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 19/Dec/18 ] | |||||||||||||||||||||||||||
|
You are correct that validate() can be resource intensive. However this specific issue (initial sync fails because of a negative count) is most likely to occur only on a collection with a small number documents, because a collection with a large number of documents is unlikely to have a skew in the recorded count large enough to make it negative, and the impact of doing validate() on a collection with a small number of documents should be minimal. I think the procedure you describe could work, but I think would be unnecessary because of the preceding comment. | |||||||||||||||||||||||||||
| Comment by Scott Glajch [ 19/Dec/18 ] | |||||||||||||||||||||||||||
|
Thanks for the info Bruce! It is mentioned in the validate() docs that it can be resource intensive to run, and obtains an exclusive lock.
What I take away from that is that if your application on top of mongo is going to behave poorly if collection level locks are obtained (which would be the case for at least some of our collections), running validate() would not be a great idea. That being said, it might still be useful to know how the collection is not healthy before trying to manually repair it. Do you think there would be any impact if I took a secondary, moved it to a hidden secondary, ran validate, moved it back to a regular secondary, then did manual steps to try and restore the correct data on the primary (like I did in my original comments)?
| |||||||||||||||||||||||||||
| Comment by Bruce Lucas (Inactive) [ 19/Dec/18 ] | |||||||||||||||||||||||||||
|
glajchs, thanks for your comment. Just for your future reference and for anyone else reading this ticket, an accurate count can also be restored by running validate() on the affected collection. We document that here. | |||||||||||||||||||||||||||
| Comment by Scott Glajch [ 19/Dec/18 ] | |||||||||||||||||||||||||||
|
We just hit this issue when trying to initial sync a replacement node for a bad secondary (well it used to be the primary before it went bad). We are using mongo 3.4.17 The node it's syncing from did used to be a primary at one point, and was shutdown non gracefully (OOM killed), but I think it's not necessarily the only node in the shard to have an unsafe shutdown over the duration of the shard's history. The actual message (with db/collection/sync source names redacted) was:
I ended up resolving the fact that count was returning -1 by going onto the shard's primary, doing a find(), seeing that it only returned 1 document, doing a mongoexport of the document, a .remove() of the document (which fixed the count to now return 0), and then a mongoimport of the document. Hopefully that was the correct steps to repair this collection. I'm going to run a looper that calls count on all collections we have to ensure that there aren't any other gems waiting for us further down the initial sync pipeline, but hopefully this info/these steps helps someone else in the future.
|