Loading...

XML

Word

Printable

JSON

Type: Task
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 8.1.0-rc0, 8.0.10, 7.0.22
Affects Version/s: None
Component/s: Sharding
Labels:
None

Assigned Teams:

Catalog and Routing
Backwards Compatibility:
Fully Compatible
Backport Requested:

v8.0, v7.0
Sprint:
CAR Team 2025-01-20, CAR Team 2025-02-03, CAR Team 2025-02-17
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

~~SERVER-71823~~ added consistency checks on sharding metadata for a cluster, database, or collection. If issued cluster wise, the algorithm will execute serially the following way:

1. for each DB:
2. acquire the DB DDL lock
3. execute the check for that DB

A customer that would like to use checkMetadataConsistency in order to detect any problem that could come up in a cluster, could under some circumstances end up blocking other DDL operations for a long time, which is unacceptable for a live cluster. In particular, if we have a long standing DDL that takes a DB lock (like resharding) and then a checkMetadataConsistency request comes in, step 2 it would enque a database lock that would prevent other DDL that also take database locks to execute, in particular:

renameCollection
shardCollection
collMod
dropCollection
dropDatabase
dropIndexes
reshardCollection
movePrimary
createCollection (Only affected from v8.0, both implicit and explicit collection creation, but when done outside of a transaction or retryable write)

~~SERVER-92182~~ partially handled this in master, by adding a backoff mechanism in step 2, however, this does not address the full execution of the operation (step 3) nor older versions. Currently it is possible to assign a maxTimeMS to the command, but considering this would affect the duration of the entire verification, we want a more fine-grained parameter (specifically for steps 2 and 3).

The purpose of this ticket is to add a new parameter to the command that would limit the total ammount of time other DDL operations might get blocked (that is, the total duration of steps 2 and 3), and fail the command once the time limit is reached.

causes

SERVER-104292 Terminate operation before deadline after DDL lock acquisition

Closed

SERVER-105799 Ensure that failpoints used in check_metadata_consistency_timeout_tests.js are reached on slow machines

Closed

Assignee:: Marcos José Grillo Ramirez
Reporter:: Marcos José Grillo Ramirez
Participants:: Githook User, Marcos José Grillo Ramirez
Votes:: 0 Vote for this issue
Watchers:: 4 Start watching this issue

Created:: Jan 15 2025 05:04:52 PM UTC
Updated:: Jun 18 2025 08:39:26 PM UTC
Resolved:: Feb 05 2025 05:14:36 PM UTC
Confidence Status Last Update:: 21/Jan/25 2:57 PM

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates