Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-48679

flushRoutingTableCacheUpdates should block on critical section with kWrite, not kRead

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: 3.6.18, 4.5.1, 4.0.18, 4.2.7, 4.4.0-rc8
    • Fix Version/s: 4.2.10, 4.4.1, 4.7.0, 4.0.22
    • Component/s: Sharding
    • Labels:
      None
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Backport Requested:
      v4.4, v4.2, v4.0
    • Sprint:
      Sharding 2020-06-15, Sharding 2020-06-29
    • Linked BF Score:
      12

      Description

      The donor writes the enterCriticalSectionCounter flag
      -> which causes secondaries to clear their filtering metadata
      -> which causes the next versioned request on the secondary to throw StaleConfig and trigger the secondary to refresh
      -> which causes the secondary to send flushRoutingTableCacheUpdates to the primary
      -> which blocks behind the critical section only if reads are being blocked

      In 4.4 and earlier versions, if reads haven't started being blocked yet, the secondary will finish the refresh and serve reads for stale mongoses even if the migration commits. 

      For example:

      • Donor writes enterCriticalSectionSignal at T90
      • Secondary sees the flag, invalidates its filtering metadata
      • Secondary gets versioned read, sendsflushRoutingTableCacheUpdates, gets back success
      • Donor starts blocking writes
      • Donor commits the migration, which succeeds at T100
      • Client does a write from mongos1, which contacts donor and gets back StaleConfig, then retries write on recipient, which succeeds at T101
      • Client does afterClusterTime: T101 read from mongos2, which is stale and contacts the donor secondary. >>> That secondary will wait until T101, then serve the read <<<

      In 4.5, that happens to not be an issue since the refresh is done by calling onShardVersionMismatch which waits for the critical section as long as writes are already being blocked

      Despite that, we want to change flushRoutingTableCacheUpdates in all versions to block behind the critical section with kWrite, not kRead, as it does today.

        Attachments

          Issue Links

            Activity

              People

              Assignee:
              luis.osta Luis Osta Lugo (Inactive)
              Reporter:
              esha.maharishi Esha Maharishi
              Participants:
              Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

                Dates

                Created:
                Updated:
                Resolved: