Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-17506

Race between inserts and checkpoints can lose records under WiredTiger

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Critical - P2
    • Resolution: Fixed
    • Affects Version/s: 3.0.0
    • Fix Version/s: 3.0.1, 3.1.0
    • Component/s: WiredTiger
    • Labels:
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL
    • Backport Completed:

      Description

      Repro:

      collections=1000
      threads=1
       
      function checkpoint {
          echo checkpointing
          mongo --quiet --eval "printjson(db.adminCommand({fsync:1}))"
      }
       
      function setup {
          killall -w mongod
          rm -rf db; mkdir -p db
          mongod --dbpath db --logpath db.log --fork --storageEngine wiredTiger
          sleep 2
          echo initializing $collections collections
          mongo --quiet --eval "
              for (var i=0; i<$collections; i++)
                  db['c'+i].insert({})
          "
          checkpoint
      }
       
      function repro {
          (
              for t in $(seq 0 $[$threads-1]); do
                  echo inserting $t
                  mongo --quiet --eval "
                      for (var i=$t; i<$collections; i+=$threads)
                          db['c'+i].insert({})
                  " &
              done
              wait
              echo finished inserts
          ) &
          checkpoint &
          wait
          echo finished parallel checkpoint
          checkpoint
          echo finished sequential checkpoint        
      }
       
      function check {
          echo checking
          mongo --quiet --eval "
              for (var i=0; i<$collections; i++) {
                  try {
                      c = db['c'+i]
                      n_id = c.find().hint({_id:1}).itcount()
                      n_c = c.find().hint({\$natural:1}).itcount()
                      if (n_id != n_c)
                          print(i, n_c, n_id, 'OOPS')
                  } catch (e) {
                      print(i, e)
                  }
              }
          "
      }
       
      setup
      while true; do
          repro
          echo stopping
          killall -w mongod
          echo starting
          mongod --dbpath db --logpath db.log --logappend --fork --storageEngine wiredTiger
          sleep 2
          check
      done
      

      Produces following errors from app on second time through the loop:

      27 Error: error: { "$err" : "Didn't find RecordId in WiredTigerRecordStore", "code" : 28556 }
      28 Error: error: { "$err" : "Didn't find RecordId in WiredTigerRecordStore", "code" : 28556 }
      29 Error: error: { "$err" : "Didn't find RecordId in WiredTigerRecordStore", "code" : 28556 }
      30 Error: error: { "$err" : "Didn't find RecordId in WiredTigerRecordStore", "code" : 28556 }
      31 Error: error: { "$err" : "Didn't find RecordId in WiredTigerRecordStore", "code" : 28556 }
      32 Error: error: { "$err" : "Didn't find RecordId in WiredTigerRecordStore", "code" : 28556 }
      33 Error: error: { "$err" : "Didn't find RecordId in WiredTigerRecordStore", "code" : 28556 }
      ...
      

      And the following in the mongod log:

      2015-03-07T12:53:31.671-0500 E STORAGE  WiredTiger (0) [1425750811:671913][9925:0x7f3a93725700], eviction-server: cache server: exiting with 420 pages in memory and 416 pages evicted
      2015-03-07T12:53:31.671-0500 E STORAGE  WiredTiger (0) [1425750811:671962][9925:0x7f3a93725700], eviction-server: cache server: exiting with 1640 bytes in memory
      2015-03-07T12:53:31.671-0500 E STORAGE  WiredTiger (0) [1425750811:671976][9925:0x7f3a93725700], eviction-server: cache server: exiting with 1058 bytes dirty and 2 pages dirty
      

      2015-03-07T12:53:35.799-0500 I -        [conn1] Assertion: 28556:Didn't find RecordId in WiredTigerRecordStore
      

      1. server_17506.js
        2 kB
        Kamran K.

        Issue Links

          Activity

          Hide
          michael.cahill Michael Cahill added a comment - - edited

          [~bruce.lucas@10gen.com], thanks for the repro! Asya Kamsky, thanks for identifying the patch!

          A fix is here:
          https://github.com/wiredtiger/wiredtiger/pull/1735

          Show
          michael.cahill Michael Cahill added a comment - - edited [~bruce.lucas@10gen.com] , thanks for the repro! Asya Kamsky , thanks for identifying the patch! A fix is here: https://github.com/wiredtiger/wiredtiger/pull/1735
          Hide
          oleg@evergage.com Oleg Rekutin added a comment -

          Michael Cahill, do you have any more information on which parts get corrupted? It looks like this problem corrupted data, but does not show itself until later. Is there any way to perform some kind of validation on data structures to look for this? Is it the indices or actual entry storage? Or is it something related to the replication log?

          Show
          oleg@evergage.com Oleg Rekutin added a comment - Michael Cahill , do you have any more information on which parts get corrupted? It looks like this problem corrupted data, but does not show itself until later. Is there any way to perform some kind of validation on data structures to look for this? Is it the indices or actual entry storage? Or is it something related to the replication log?
          Hide
          michael.cahill Michael Cahill added a comment - - edited

          Oleg Rekutin, it's tough to characterize exactly how this bug might have exhibited.

          The bug was triggered if a database page was not in cache at the beginning of a checkpoint, then it was read into cache and dirtied (a document was inserted, updated or deleted) during the checkpoint. The checkpoint would (correctly) skip over the dirty page because the changes were too recent to appear, but the tree containing the dirty page could end up being marked clean. Then if there were no more updates before mongod was shut down, the dirty page would never get written.

          Once the bug has occurred (i.e., mongod has been shut down), there is nothing in the data files that could be used to detect or recover from the lost update. As noted here, this was fixed in MongoDB 3.0.1.

          Show
          michael.cahill Michael Cahill added a comment - - edited Oleg Rekutin , it's tough to characterize exactly how this bug might have exhibited. The bug was triggered if a database page was not in cache at the beginning of a checkpoint, then it was read into cache and dirtied (a document was inserted, updated or deleted) during the checkpoint. The checkpoint would (correctly) skip over the dirty page because the changes were too recent to appear, but the tree containing the dirty page could end up being marked clean. Then if there were no more updates before mongod was shut down, the dirty page would never get written. Once the bug has occurred (i.e., mongod has been shut down), there is nothing in the data files that could be used to detect or recover from the lost update. As noted here, this was fixed in MongoDB 3.0.1.

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              18 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: