Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-27807

creating a snapshot and registering it in the replcoord is not synchronous

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: 3.2.13, 3.4.3, 3.5.2
    • Component/s: Replication, Storage
    • Labels:
      None
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Backport Completed:
    • Sprint:
      Storage 2017-02-13
    • Linked BF Score:
      0

      Description

      Normally, the replication coordinator keeps track of all WiredTiger snapshots in a vector _uncommittedSnapshots, which is protected by the replcoord mutex. This vector needs to mirror the actual list of snapshots in WiredTiger.
      Dropping all snapshots is also protected by this mutex – the _uncommittedSnapshots vector and the storage engine's list of snapshots are updated at the same time under this mutex lock.
      Creating a new snapshot, however, is not completely protected by this mutex. The snapshot is created in the storage engine outside of the mutex lock, and then the replCoord state is updated under the mutex lock. Thus, if dropAllSnapshots() is asynchronously called in between the time that a snapshot is created in WiredTiger and the time where it is registered in _uncommittedSnapshots, it is possible for the system to later attempt to use a snapshot it thinks is there but is not actually present in WiredTiger.
      Currently, we call dropAllSnapshots() at rollback time, at the beginning of initial sync, and at reconfig time, so any of those actions could trigger an fassert.

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                0 Vote for this issue
                Watchers:
                3 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: