[SERVER-62987] Wrong replication logic on refreshes on secondary nodes Created: 26/Jan/22  Updated: 29/Oct/23  Resolved: 06/Jul/23

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: 7.1.0-rc0, 7.0.1, 5.0.20, 6.0.9

Type: Bug Priority: Major - P3
Reporter: Sergi Mateo Bellido Assignee: Allison Easton
Resolution: Fixed Votes: 0
Labels: SSCCL-BUG
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Related
Assigned Teams:
Sharding EMEA
Backwards Compatibility: Fully Compatible
Backport Requested:
v7.0, v6.0, v5.0
Sprint: Sharding EMEA 2023-07-10
Participants:
Linked BF Score: 0

 Description   

The way we do refreshes on secondaries is defined by two steps:

  1. The secondary node asks to the primary node to do a refresh with local write concern.
  2. The secondary node waits until the changes done by the primary are replicated.

The issue is how we implement the second step: we are waiting on the logical time (just a time, without the term component) associated with the oplog entry generated on the primary node. Note that this entry hasn't been majority committed, so it could totally happen that another node steps up, does some writes and at some point this entry is rollbacked. Then, at some point the secondary node might fetch an oplog entry with a logical time bigger than the one it was waiting, and it will assume that it has the changes associated with the refresh on the primary. However that's not true.

Under that scenario it might happen that the ShardServerCatalogCacheLoader returns some metadata associated with a CollectionVersion older than what the CatalogCache already knows. Then, the CatalogCache will try to combine the SSCCL result with its local metadatada, creating an inconsistent routing history: it will potentially contain the collection metadata we got from the SSCCL but the chunks we already had in the CatalogCache. Thus, the problem is that the new routing history has stale collection information.

We believe that this could be potentially problematic for the two fields we have in config.collections that are mutable and replicated to shards: allowMigrations and reshardingFields.


Which is the behavior under this scenario? In the 5.0 binary we would hit one of these two invariants stating that we found different collection information for the same collection version. In 5.1 or more recent versions the CatalogCache is not going to throw an invariant, so users might experience incorrect executions of the ongoing DDL operations.

Affected versions: I took a look at 3.6 and already has this problem.



 Comments   
Comment by Githook User [ 18/Aug/23 ]

Author:

{'name': 'Allison Easton', 'email': 'allison.easton@mongodb.com', 'username': 'allisoneaston'}

Message: SERVER-62987 Secondary refreshes should be interrupted on replication rollback

(cherry picked from commit 2196021e412bb0ad1c470f4ba664551a9bbb56fe)
Branch: v7.0
https://github.com/mongodb/mongo/commit/355e304a60e1fd3783eed52314c8ea0505deb9f9

Comment by Githook User [ 20/Jul/23 ]

Author:

{'name': 'Allison Easton', 'email': 'allison.easton@mongodb.com', 'username': 'allisoneaston'}

Message: SERVER-62987 Secondary refreshes should be interrupted on replication rollback

(cherry picked from commit 2196021e412bb0ad1c470f4ba664551a9bbb56fe)
Branch: v6.0
https://github.com/mongodb/mongo/commit/f6c48222cc1cf0c68935a3414f79067687059d1a

Comment by Githook User [ 20/Jul/23 ]

Author:

{'name': 'Allison Easton', 'email': 'allison.easton@mongodb.com', 'username': 'allisoneaston'}

Message: SERVER-62987 Secondary refreshes should be interrupted on replication rollback

(cherry picked from commit 2196021e412bb0ad1c470f4ba664551a9bbb56fe)
Branch: v5.0
https://github.com/mongodb/mongo/commit/e6e00002c258c9f0177cf37ec196493f05e7ec96

Comment by Githook User [ 06/Jul/23 ]

Author:

{'name': 'Allison Easton', 'email': 'allison.easton@mongodb.com', 'username': 'allisoneaston'}

Message: SERVER-62987 Secondary refreshes should be interrupted on replication rollback
Branch: master
https://github.com/mongodb/mongo/commit/2196021e412bb0ad1c470f4ba664551a9bbb56fe

Generated at Thu Feb 08 05:56:36 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.