Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-99440

Add timeout parameter for check metadata consistency database operation

    • Type: Icon: Task Task
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 8.1.0-rc0
    • Affects Version/s: None
    • Component/s: Sharding
    • None
    • Catalog and Routing
    • Fully Compatible
    • v8.0, v7.0
    • CAR Team 2025-01-20, CAR Team 2025-02-03, CAR Team 2025-02-17

      SERVER-71823 added consistency checks on sharding metadata for a cluster, database, or collection. If issued cluster wise, the algorithm will execute serially the following way:

      1. for each DB:
      2. acquire the DB DDL lock
      3. execute the check for that DB

      A customer that would like to use checkMetadataConsistency in order to detect any problem that could come up in a cluster, could under some circumstances end up blocking other DDL operations for a long time, which is unacceptable for a live cluster. In particular, if we have a long standing DDL that takes a DB lock (like resharding) and then a checkMetadataConsistency request comes in, step 2 it would enque a database lock that would prevent other DDL that also take database locks to execute, in particular:

      • renameCollection
      • shardCollection
      • collMod
      • dropCollection
      • dropDatabase
      • dropIndexes
      • reshardCollection
      • movePrimary
      • createCollection (Only affected from v8.0, both implicit and explicit collection creation, but when done outside of a transaction or retryable write)

      SERVER-92182 partially handled this in master, by adding a backoff mechanism in step 2, however, this does not address the full execution of the operation (step 3) nor older versions. Currently it is possible to assign a maxTimeMS to the command, but considering this would affect the duration of the entire verification, we want a more fine-grained parameter (specifically for steps 2 and 3).

      The purpose of this ticket is to add a new parameter to the command that would limit the total ammount of time other DDL operations might get blocked (that is, the total duration of steps 2 and 3), and fail the command once the time limit is reached.

            Assignee:
            marcos.grillo@mongodb.com Marcos José Grillo Ramirez
            Reporter:
            marcos.grillo@mongodb.com Marcos José Grillo Ramirez
            Votes:
            0 Vote for this issue
            Watchers:
            4 Start watching this issue

              Created:
              Updated:
              Resolved: