[SERVER-82617] Router's fsyncLock command must be resilient to elections Created: 31/Oct/23  Updated: 12/Dec/23

Status: Backlog
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Pierlauro Sciarelli Assignee: Nandini Bhartiya
Resolution: Unresolved Votes: 0
Labels: cs-subteam1, sharding-nyc-subteam1
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

When fsyncLock is invoked on a router, it contacts the primary of every shard and makes sure there are no ongoing DDLs in order not to incur in inconsistencies during backups. This protocol is currently not resilient to elections.

Example of breaking scenario

Let's consider a shard with 3 nodes: n0, n1 and n2. The primary was n0 but just switched to n1.

  1. The router believes n0 is primary, asks to acquire the fsync lock
  2. Since the command is allowed on secondaries, n0 acquires the lock and returns successfully
  3. A DDL starts on n1 since the coordinator document can be majority committed replicating to n2
  4. Backup starts from n1 or n2

Generated at Thu Feb 08 06:49:48 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.