Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-92658

Investigate whether step downs may lead to an inconsistent state of internal sessions within the pool

    • Type: Icon: Bug Bug
    • Resolution: Works as Designed
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Catalog and Routing
    • ALL
    • CAR Team 2024-08-05, CAR Team 2024-08-19
    • 0

      Consider the following sequence:

      1. the primary node of a shard (nodeA) is coordinating a DDL operation (shardDDL1) and checks out a session Session-X from the InternalSessionPool, incrementing its txnNumber until 10;
      2. nodeA manages to execute the last phase of shardDDL1, but it steps down during the "release coordinator" step; under such circumstance, we may end up returning  SessionX to the node's InternalSessionPool (PoolA) without deleting the recovery doc (which will still contain a reference to SessionX with txnNumber = 10)
      3. nodeB steps up and
        1. resumes  shardDDL1, completing it and returning SessionX to its own PoolB (for example, with txnNumber = 12)
        2. later starts executing a shardDDL2, checking out SessionX from PoolB and advancing its txnNumber until 15
        3. steps down, leaving on the recovery document  the fact that SessionX is checked out at txnNumber 15
      4. nodeA steps up again and
        1. starts serving a shardDDL3, checking out SessionX from PoolA (which is still at txnNumber 10)...
        2. ... while also resuming shardDDL2, which also has SessionX checked out  at txnNumber 15 (due to the metadata of the recovery doc) 

      Under such assumption, we expect shardDDL3 experiencing TransactionTooOld errors.

            Assignee:
            paolo.polato@mongodb.com Paolo Polato
            Reporter:
            paolo.polato@mongodb.com Paolo Polato
            Votes:
            0 Vote for this issue
            Watchers:
            8 Start watching this issue

              Created:
              Updated:
              Resolved: