[SERVER-27581] shouldRetry() logic in multiInitialSyncApply_noAbort() is over-aggressive Created: 04/Jan/17  Updated: 06/Dec/17  Resolved: 11/Jul/17

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: 3.4.7, 3.5.10

Type: Bug Priority: Major - P3
Reporter: Mathias Stearn Assignee: Judah Schvimer
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v3.4
Sprint: Repl 2017-05-08, Repl 2017-05-29, Repl 2017-06-19, Repl 2017-07-31
Participants:

 Description   

Currently multiInitialSyncApply_noAbort() calls shouldRetry() on all !OK statuses returned from applying any non-command op. If shouldRetry() returns false, the error is ignored. It should probably only do this for error cases that are expecting the shouldRetry() treatment (such as an update not finding a document) rather than for absolutely all errors.



 Comments   
Comment by Githook User [ 17/Jul/17 ]

Author:

{u'username': u'judahschvimer', u'name': u'Judah Schvimer', u'email': u'judah@mongodb.com'}

Message: SERVER-27581 Only fetch missing documents on update oplog entries during initial
sync

(cherry picked from commit f33e9a715cf3eb4d29d8d5c5a926e5edacf9f63d)
Branch: v3.4
https://github.com/mongodb/mongo/commit/cbd74166d8930657f63e8e90791d276e15c3d3cb

Comment by Githook User [ 11/Jul/17 ]

Author:

{u'username': u'judahschvimer', u'name': u'Judah Schvimer', u'email': u'judah@mongodb.com'}

Message: SERVER-27581 Only fetch missing documents on update oplog entries during initial
sync
Branch: master
https://github.com/mongodb/mongo/commit/f33e9a715cf3eb4d29d8d5c5a926e5edacf9f63d

Comment by Mathias Stearn [ 23/Jun/17 ]

That sounds right to me.

Comment by Judah Schvimer [ 19/Jun/17 ]

The only time we ignore error statuses is for non-commands since we do propagate errors on commands. We also always catch exceptions and handle them appropriately, so the only thing we need to be concerned about is bad statuses.

In SyncApply we can only return a bad status for non-commands in applyOperationInlock or if there is a bad op type.

applyOperation_inLock can return a bad status in a few places: if a multi-insert fails , if a single insert fails, or if an update fails. The inserts could fail for any number of reasons. The updates fail if they do not match anything. If they don't, we check a few error criteria. In secondary application the only one that's relevant is the first which we will always enter since all updates in oplog entries have simple query predicates.

It appears that on the OperationFailed update errors, we should be retrying but otherwise we probably should be failing more loudly, and restarting initial sync.

redbeard0531, does this appropriately explain what you see is wrong here?

Comment by Judah Schvimer [ 20/Jan/17 ]

We need to establish if this is a correctness bug.

Comment by Spencer Brody (Inactive) [ 05/Jan/17 ]

redbeard0531, what's the impact of leaving things as they are? Is this a correctness issue?

Generated at Thu Feb 08 04:15:34 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.