[SERVER-14117] moveChunk should attempt to retry write errors during chunk cleanup Created: 31/May/14 Updated: 10/Dec/14 Resolved: 22/Jul/14 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | David Murphy | Assignee: | Greg Studer |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Participants: | |||||||||
| Description |
|
Current Mongos will return complete as soon as a chunkMove hit an error in phase 6. It should should retry based on a config.settings.moveRetries=3. The default would be 0 to preserve previous behavior however this is very helpful to avoid orphans to begin with. I am aware we have a new function to clean them but you can still have logical DB corruption in the mean time. |
| Comments |
| Comment by Greg Studer [ 23/Jul/14 ] |
|
> If we wait for #2 we are purposefully leaving the system with data that ChunkManager unsafe commands like count will see and thus return the wrong data. This is actually a separate but related issue ( > as who is the primary on a given shard is not actually important I think there's a misunderstanding here - the chunk cleanup (and all stages of migration) are driven by the primary host of the FROM shard. Mongos just passes along moveChunk to the shard, and receives "ok" when the logical migration is finished. The cleanup may not have happened yet, since that's a heuristic enforced by mongod, and there's nothing it knows to retry. Failures during the migration itself mongos often does retry (sometimes indefinitely) if the migration is driven by the balancer, because balancing is deterministic per-collection. N retries would require new state to track "attempted cleanups" on mongod hosts and synchronization with replication and lazy metadata load - at that point you're designing a "background cleanup process" with a prioritized queue (and we basically have this with RangeDeleter, though it needs some love, if you'd like to look). |
| Comment by David Murphy [ 22/Jul/14 ] |
|
True enough however it still would be best to retry more than a single time. For example an operation being killed, a network glitch or other election that re-elects the same primary would all be situations where it could retry and avoid an orphan. The point here is to make best effort to ensure it is unable to do the delete. In fact retrying the delete ( as who is the primary on a given shard is not actually important) would be a best case, as it would ensure that the cleanup phase was smart enough to persist. I think there are 2 sides to this issue 1) Make a reasonable effort to prevent the need for more cleanup and/or orphan removal (1 or 2 retries) With the orphan cleanup command being the last ditch effort. If we wait for #2 we are purposefully leaving the system with data that ChunkManager unsafe commands like count will see and thus return the wrong data. A quick retry loop with a config.settings options seems to very little effort to combat a very real issue that plagues all versions today, with only 2.6 having the start of a solution. Also a retry is not changing anything fundamentally like a system would.This means it would be easier to implement on all version moving forward until such time 6210 can be implemented. This would make our customer feel mongo is more stable rather than question stability if basic constructs like count seem unstable. I don't disagree that a sweeper system like 6210 references would be good, just that its not a complete solution but repair mechanism for an avoidable issue. Thanks |
| Comment by Greg Studer [ 22/Jul/14 ] |
|
Got it - but it seems like you're actually describing the more general problem A retry setting wouldn't necessarily help - in particular, on stepdown it is incorrect and impossible to retry on the now-secondary node. Additional chunk state is needed, or a continually running background process monitoring the unowned ranges on the primary. EDIT: Also just wanted to clarify that migrations are operations from mongod -> mongod, and are not orchestrated by mongos (though mongos may initiate them). Cleanup always happens after mongod reports success in v2.6 (and in earlier versions if there are any active cursors). |
| Comment by David Murphy [ 17/Jul/14 ] |
|
Greg There are many cases from a network glitch, to a multi phase delete timeout, to a stepDown/Election occurring. All of these cases will present an error on the delete, which the moveChunk function just returns true after and make no attempt to try a second time. Best case would be something where config.settings.cleanupAttempts defaulted to say 2 or 3. We could even leave it as a default 0 for 2.4/2.6 but make it a setting that someone could choose to change to make orphans less likely to be created. This is the other side of the orphan question, where the cleanup script can remove them, but we should make best effort to avoid their creation as the will confuse thing until the cleanup command is run. David |
| Comment by Greg Studer [ 20/Jun/14 ] |
|
I'm not 100% sure I understand what a "sinkhole delete error" is - is this issue a request to continue migration cleanup even after replica set changes? |
| Comment by David Murphy [ 31/May/14 ] |
|
It should retry the delete to avoid creating of an orphan. On a sinkhole delete error it fails which means a stepdown will cause orphans. It could retry a couple times then give up to puts a decent effort to reduce this chance. Sent from my iPhone |
| Comment by Asya Kamsky [ 31/May/14 ] |
|
Phase 6 is the clean up - this is after the chunk has actually been moved and committed. Can you clarify in terms of cluster state, rather than step numbers when this would kick in? What do you envision for moveRetry? The move has already been completed at this point so what would be left would be to cleanup orphaned documents. |