Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-70437

Lost writes to unsharded collection during movePrimary

    • Type: Icon: Bug Bug
    • Resolution: Fixed
    • Priority: Icon: Major - P3 Major - P3
    • 6.3.0-rc0
    • Affects Version/s: 4.4.0, 5.0.0, 6.0.0
    • Component/s: Sharding
    • None
    • Fully Compatible
    • ALL
    • Execution Team 2022-11-14, Execution Team 2022-12-12, Execution Team 2022-11-28, Execution Team 2022-12-26, Execution Team 2023-01-09
    • 35

      Consider the following interleaving:

      1.  [th1] Starts a multi write operation on an unsharded collection and passes the dbVersion check successfully.
      2. [th1] The multi write yields.
      3. [th2] A movePrimary starts and sets the 'move primary in progress' flag on the DatabaseShardingState.
      4. [th2] MovePrimary commits.
      5. [th2] MovePrimary unsets the 'move primary in progress'
      6. [th2] MovePrimary hangs before dropping the old collection from the former primary.
      7. [th1] The multi write now resumes from the yield. Note how the writes never recheck the dbVersion on the op_observers (like we do for the 'shardVersion'). Therefore, the writes don't fail, but are lost because they happened on the old db-primary shard after the ownership change had already committed.

      Note that on the op_observer we fail writes if there's a move primary operation in progress. This is what typically prevents losing writes. However, this is only hit if the write restores from yield while the movePrimary is still in progress. In the interleaving above, this does not happen. I don't think this interleaving is very likely, since it requires that the write was yielded for a long time (from the moment the cloning started until after the commit happened).

            Assignee:
            daniel.gomezferro@mongodb.com Daniel Gomez Ferro
            Reporter:
            jordi.serra-torrens@mongodb.com Jordi Serra Torrens
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: