Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-2696

Race condition on unclean shutdown may miss log records with large updates

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical - P2
    • Resolution: Fixed
    • Affects Version/s: WT2.8.0
    • Fix Version/s: WT2.9.0, 3.2.8, 3.3.9
    • Labels:
      None

      Description

      Issue Status as of Jul 06, 2016

      ISSUE SUMMARY
      Under extremely rare circumstances, a race condition in the code that updates large records may cause some of those updates to be lost during an unclean shutdown.

      On a production system, the path with the race condition is only taken when log records are 128k or larger. From MongoDB's perspective, it is a smaller record size, maybe 40k, since an individual WT log record contains the insert into collections, indexes, oplog, etc.

      Attempts to trigger this race condtion with MongoDB using a synthetic workload with compression disabled have produced mixed results. However, attempts to reproduce this issue in MongoDB with default compression (snappy) have been unsuccessful.

      This issue only affects users running with journaling enabled. Users that run with journaling disabled can not be affected by this bug.

      USER IMPACT
      If the race condition is triggered and the node suffers an unclean shutdown, some updates to large records since the last checkpoint may be lost. Unfortunately it is not possible to detect if the race condition has been triggered.

      AFFECTED VERSIONS
      MongoDB 3.2 versions up to and including MongoDB 3.2.7.

      REMEDIATION
      A fix for this issue is included in the MongoDB 3.2.8 production release. Users with workloads that include updates to large records whose nodes may be subject to unclean shutdowns should upgrade to MongoDB 3.2.8 to avoid exposure to this issue.

      WORKAROUNDS
      Unfortunately there are no known workarounds for this issue.

      Original description

      Hi!

      After re-building WiredTiger with diagnostic enabled one of out test started to fail.
      The test checks ability of DB to recover after application crash.
      Please see attached minimized test:

      $ ./recovery-test-mp
      5 writer threads spawned
      killing child
      checking DB...
      no record with key 28363
      no record with key 3689348814741930043
      no record with key 3689348814741983775
      no record with key 7378697629483839817
      no record with key 7378697629483894622
      no record with key 11068046444225735421
      no record with key 14757395258967669044
      no record with key 14757395258967726182
      8 record(s) absent from total of 544769
      

      I was unable to reproduce the problem without diagnostic enabled.

      Thanks!

        Attachments

        1. check.2696.js
          0.6 kB
        2. insert.2696.js
          0.3 kB
        3. recovery-test-mp.c
          5 kB
        4. run2696.sh
          3 kB
        5. runloop.sh
          0.3 kB

          Issue Links

            Activity

              People

              • Assignee:
                sue.loverso Sue LoVerso
                Reporter:
                Dmitri Shubin Dmitri Shubin
              • Votes:
                0 Vote for this issue
                Watchers:
                6 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: