[SERVER-48679] flushRoutingTableCacheUpdates should block on critical section with kWrite, not kRead Created: 09/Jun/20  Updated: 29/Oct/23  Resolved: 01/Jul/20

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.6.18, 4.5.1, 4.0.18, 4.2.7, 4.4.0-rc8
Fix Version/s: 4.4.1, 4.7.0, 4.2.10, 4.0.22

Type: Bug Priority: Major - P3
Reporter: Esha Maharishi (Inactive) Assignee: Luis Osta (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Problem/Incident
Related
related to SERVER-50898 safe_secondary_reads_causal_consisten... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4, v4.2, v4.0
Sprint: Sharding 2020-06-15, Sharding 2020-06-29
Participants:
Linked BF Score: 0

 Description   

The donor writes the enterCriticalSectionCounter flag
-> which causes secondaries to clear their filtering metadata
-> which causes the next versioned request on the secondary to throw StaleConfig and trigger the secondary to refresh
-> which causes the secondary to send flushRoutingTableCacheUpdates to the primary
-> which blocks behind the critical section only if reads are being blocked

In 4.4 and earlier versions, if reads haven't started being blocked yet, the secondary will finish the refresh and serve reads for stale mongoses even if the migration commits. 

For example:

  • Donor writes enterCriticalSectionSignal at T90
  • Secondary sees the flag, invalidates its filtering metadata
  • Secondary gets versioned read, sendsflushRoutingTableCacheUpdates, gets back success
  • Donor starts blocking writes
  • Donor commits the migration, which succeeds at T100
  • Client does a write from mongos1, which contacts donor and gets back StaleConfig, then retries write on recipient, which succeeds at T101
  • Client does afterClusterTime: T101 read from mongos2, which is stale and contacts the donor secondary. >>> That secondary will wait until T101, then serve the read <<<

In 4.5, that happens to not be an issue since the refresh is done by calling onShardVersionMismatch which waits for the critical section as long as writes are already being blocked

Despite that, we want to change flushRoutingTableCacheUpdates in all versions to block behind the critical section with kWrite, not kRead, as it does today.



 Comments   
Comment by Githook User [ 19/Oct/20 ]

Author:

{'name': 'Luis Osta', 'email': 'luis.osta@mongodb.com', 'username': 'LuisOsta'}

Message: SERVER-48679 flushRoutingTableCacheUpdates should block on critical section with kWrite, not kRead

(cherry picked from commit 9f4b81e5bdcf38f9b10459203a804ba406528770)
Branch: v4.0
https://github.com/mongodb/mongo/commit/c3c59be0a89fcea095f701ff615ccc416fde5001

Comment by Githook User [ 12/Aug/20 ]

Author:

{'name': 'Luis Osta', 'email': 'luis.osta@mongodb.com', 'username': 'LuisOsta'}

Message: SERVER-48679 flushRoutingTableCacheUpdates should block on critical section with kWrite, not kRead

(cherry picked from commit 9f4b81e5bdcf38f9b10459203a804ba406528770)
Branch: v4.2
https://github.com/mongodb/mongo/commit/1dd91dee7c8cb437a893cd7f6ff22d0bd26ee1d4

Comment by Githook User [ 12/Aug/20 ]

Author:

{'name': 'Luis Osta', 'email': 'luis.osta@mongodb.com', 'username': 'LuisOsta'}

Message: SERVER-48679 flushRoutingTableCacheUpdates should block on critical section with kWrite, not kRead

(cherry picked from commit 9f4b81e5bdcf38f9b10459203a804ba406528770)
Branch: v4.4
https://github.com/mongodb/mongo/commit/18e6827bd918c06b5d43d39b2f196102973ea31d

Comment by Githook User [ 01/Jul/20 ]

Author:

{'name': 'Luis Osta', 'email': 'luis.osta@mongodb.com', 'username': 'LuisOsta'}

Message: SERVER-48679 flushRoutingTableCacheUpdates should block on critical section with kWrite, not kRead
Branch: master
https://github.com/mongodb/mongo/commit/9f4b81e5bdcf38f9b10459203a804ba406528770

Generated at Thu Feb 08 05:17:48 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.