-
Type: Investigation
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
Tools and Replicator
-
114
Changed the name of the metadataRefreshInTransactionMaxWaitBehindCritSecMS configuration parameter to metadataRefreshInTransactionMaxWaitMS
metadataRefreshInTransactionMaxWaitBehindCritSecMS continues to be valid but deprecated
Description of Linked Ticket
As part of SERVER-59965 a “circuit breaker” has been introduced to prevent transaction from dead-lock on the critical section (issue carefully described by the ticket).
However, as part of BF-34016 we realised a transaction can also block when the filtering metadata are UNKNOWN (as the shard won’t serve reads or writes).
This can be more problematic in case the shard is in recovery state as part of the step up, where all migrations will be recovered, causing the writes or reads to be blocked for some time (as the metadata are cleared)
As well explained by several comments on BF-34016, this can generate a dead-lock. The transaction might hold a lock that doesn’t allow the migration abortion (part of the migration recovering) to complete, and the migration can prevent the transaction from committing due to the shard version being UNKNOWN.
The goal of the ticket is to limit the time a transaction spends waiting for the shard version recovery, similar to was done in SERVER-59965 for a transaction waiting on critical section. This will allow the transaction in such a rare case to abort, letting the migration abortion to complete.
Note transactions already have a timeout of 1 minute, and the BF-34016 will be implicitly fixed by SERVER-86727. For this reason, the ticket can be marked as improvement to prevent similar issue from happening again.
- depends on
-
SERVER-92530 Limit the time a transaction waits for a refresh on a shard in recovery state
- Closed