[SERVER-27581] shouldRetry() logic in multiInitialSyncApply_noAbort() is over-aggressive Created: 04/Jan/17 Updated: 06/Dec/17 Resolved: 11/Jul/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Replication |
| Affects Version/s: | None |
| Fix Version/s: | 3.4.7, 3.5.10 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Mathias Stearn | Assignee: | Judah Schvimer |
| Resolution: | Done | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||
| Backwards Compatibility: | Fully Compatible | ||||
| Operating System: | ALL | ||||
| Backport Requested: |
v3.4
|
||||
| Sprint: | Repl 2017-05-08, Repl 2017-05-29, Repl 2017-06-19, Repl 2017-07-31 | ||||
| Participants: | |||||
| Description |
|
Currently multiInitialSyncApply_noAbort() calls shouldRetry() on all !OK statuses returned from applying any non-command op. If shouldRetry() returns false, the error is ignored. It should probably only do this for error cases that are expecting the shouldRetry() treatment (such as an update not finding a document) rather than for absolutely all errors. |
| Comments |
| Comment by Githook User [ 17/Jul/17 ] |
|
Author: {u'username': u'judahschvimer', u'name': u'Judah Schvimer', u'email': u'judah@mongodb.com'}Message: (cherry picked from commit f33e9a715cf3eb4d29d8d5c5a926e5edacf9f63d) |
| Comment by Githook User [ 11/Jul/17 ] |
|
Author: {u'username': u'judahschvimer', u'name': u'Judah Schvimer', u'email': u'judah@mongodb.com'}Message: |
| Comment by Mathias Stearn [ 23/Jun/17 ] |
|
That sounds right to me. |
| Comment by Judah Schvimer [ 19/Jun/17 ] |
|
The only time we ignore error statuses is for non-commands since we do propagate errors on commands. We also always catch exceptions and handle them appropriately, so the only thing we need to be concerned about is bad statuses. In SyncApply we can only return a bad status for non-commands in applyOperationInlock or if there is a bad op type. applyOperation_inLock can return a bad status in a few places: if a multi-insert fails , if a single insert fails, or if an update fails. The inserts could fail for any number of reasons. The updates fail if they do not match anything. If they don't, we check a few error criteria. In secondary application the only one that's relevant is the first which we will always enter since all updates in oplog entries have simple query predicates. It appears that on the OperationFailed update errors, we should be retrying but otherwise we probably should be failing more loudly, and restarting initial sync. redbeard0531, does this appropriately explain what you see is wrong here? |
| Comment by Judah Schvimer [ 20/Jan/17 ] |
|
We need to establish if this is a correctness bug. |
| Comment by Spencer Brody (Inactive) [ 05/Jan/17 ] |
|
redbeard0531, what's the impact of leaving things as they are? Is this a correctness issue? |