[SERVER-35663] Replication recovery does not update the logical clock Created: 18/Jun/18  Updated: 29/Oct/23  Resolved: 17/Jan/19

Status: Closed
Project: Core Server
Component/s: Replication, Sharding
Affects Version/s: 4.0.0-rc5, 4.1.1
Fix Version/s: 4.0.7, 4.1.8

Type: Bug Priority: Major - P3
Reporter: Jack Mulrow Assignee: Jack Mulrow
Resolution: Fixed Votes: 0
Labels: sharding-wfbf-day
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
is related to SERVER-46257 OplogFetcher should run LogicalTimeMe... Closed
is related to SERVER-46308 Investigate dependency between commit... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.0
Sprint: Sharding 2018-08-13, Sharding 2018-09-10, Sharding 2018-09-24, Sharding 2018-10-08, Sharding 2018-10-22, Sharding 2018-11-05, Sharding 2018-12-17, Sharding 2018-12-31, Sharding 2019-01-14, Sharding 2019-01-28
Participants:
Linked BF Score: 64

 Description   

If a node crashes with unapplied oplog entries, when it starts back up it will apply to the end of its oplog through ReplicationRecoveryImpl::recoverFromOplog. This applies the entries by directly calling SyncTail::multiApply (through an OplogApplier), which does not update the logical clock, unlike normal secondary application. Then when starting up its replication coordinator, the node will asynchronously schedule ReplicationCoordinatorImpl::_finishLoadLocalConfig which updates the logical clock after it updates the replication coordinator's lastAppliedOpTime to the opTime of the latest oplog entry.

If a request is processed during this window in _finishLoadLocalConfig, when the node goes to compute logical time metadata for the response, it can hit this invariant because the operationTime, which is typically the lastAppliedOpTime, will be greater than the latest time in the logical clock.

Two ways to fix this would be to have replication recovery update the logical clock when applying the unapplied oplog entries or to update the global timestamp before updating lastAppliedOpTime when finishing loading the local replica set config.



 Comments   
Comment by Githook User [ 25/Feb/19 ]

Author:

{'name': 'Jack Mulrow', 'username': 'jsmulrow', 'email': 'jack.mulrow@mongodb.com'}

Message: SERVER-35663 Always update the logical clock before advancing the last applied opTime

(cherry picked from commit bfb0aec447f39633694dcc4418f2a5a5a167937e)
Branch: v4.0
https://github.com/mongodb/mongo/commit/33388105eecf102c9a18138f751ea8a502eb257c

Comment by Githook User [ 17/Jan/19 ]

Author:

{'username': 'jsmulrow', 'email': 'jack.mulrow@mongodb.com', 'name': 'Jack Mulrow'}

Message: SERVER-35663 Always update the logical clock before advancing the last applied opTime
Branch: master
https://github.com/mongodb/mongo/commit/bfb0aec447f39633694dcc4418f2a5a5a167937e

Comment by Judah Schvimer [ 10/Dec/18 ]

convert it to a dassert since the clock will eventually correct itself and this check is mostly for testing

Are there any causal consistency violations that can occur if there is a bug here? Should we add an invariant where we move the lastApplied forward that we don't move it ahead of the logical clock, or should we move the logical clock forward in the same place where we move the lastApplied forward rather than one after another everywhere else? Or will that have a significant performance cost?

Otherwise lgtm

Comment by Jack Mulrow [ 03/Dec/18 ]

The underlying problem here seems to be the invariant mentioned in the description expects a relationship between operationTime and clusterTime (i.e. operationTime <= clusterTime), which puts a dependency between the ReplicationCoordinator (where operationTime is retrieved for reads) and the LogicalClock (where cluster time is retrieved) that isn't always enforced.

E.g. from the description, when we finish loading the replica set config (which happens after replication recovery), the ReplicationCoordinator's last applied opTime is advanced, then the ReplicationCoordinator mutex is unlocked before advancing the LogicalClock (through the setGlobalTimestamp method). If a request is processed during this window, when its logical time metadata is computed, it may receive a correct operationTime but a stale clusterTime, which violates the invariant.

To fix this, I think it's sufficient to ensure wherever the last applied opTime can advance we always update the LogicalClock first. The main risk I can see with this approach is an inability to enforce this ordering at compile-time, but re-enabling the invariant should allow us to catch errors during our tests.

My proposed solution:

  1. Re-enable the failing invariant, but convert it to a dassert since the clock will eventually correct itself and this check is mostly for testing
  2. Audit each place the ReplicationCoordinator's last applied opTime can be advanced and ensure the LogicalClock is updated first. The places I've currently found are:
    1. When setting the last opTime in initial sync
    2. When the replication config is finished loading
    3. When resetting the last opTimes in the ReplicationCoordinator (e.g. after rollback)
    4. During secondary oplog application (the order is already correct here, so I would just update this comment)
    5. When an oplog entry is committed (this also seems correct, since the time is reserved in the clock before the WriteUnitOfWork is committed)

misha.tyulenev and judah.schvimer, what do you think of the proposed solution?

Comment by Githook User [ 01/Aug/18 ]

Author:

{'name': 'Jack Mulrow', 'email': 'jack.mulrow@mongodb.com', 'username': 'jsmulrow'}

Message: SERVER-35663 Remove clusterTime >= operationTime invariant

(cherry picked from commit bc67f3ea909bc1c13d46aa439f44cb88cbf67374)
Branch: v3.6
https://github.com/mongodb/mongo/commit/19523a841554dd16580afa3059acd58bee0ecb9e

Comment by Githook User [ 18/Jun/18 ]

Author:

{'username': 'jsmulrow', 'name': 'Jack Mulrow', 'email': 'jack.mulrow@mongodb.com'}

Message: SERVER-35663 Remove clusterTime >= operationTime invariant

(cherry picked from commit bc67f3ea909bc1c13d46aa439f44cb88cbf67374)
Branch: v4.0
https://github.com/mongodb/mongo/commit/a9a9e398d86591c7559ba26580a38c2d439d7835

Comment by Githook User [ 18/Jun/18 ]

Author:

{'username': 'jsmulrow', 'name': 'Jack Mulrow', 'email': 'jack.mulrow@mongodb.com'}

Message: SERVER-35663 Remove clusterTime >= operationTime invariant
Branch: master
https://github.com/mongodb/mongo/commit/bc67f3ea909bc1c13d46aa439f44cb88cbf67374

Generated at Thu Feb 08 04:40:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.