Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-2009

Recovery failure when there are updates and a drop to re-apply

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: WT2.7.0
    • Labels:
      None
    • # Replies:
      7
    • Last comment by Customer:
      true

      Description

      I've seen a problem with recovery, and I'm not sure what the correct solution is.

      The application is doing:

      • Create checkpoint
      • Create a new table
      • Insert some stuff into the new table
      • Drop the new table

      Followed by a hard system crash.

      1. pl.tgz
        4.12 MB
        Alexander Gorrod

        Issue Links

          Activity

          Hide
          alexander.gorrod Alexander Gorrod added a comment -

          I don't think it's necessary for the create to be after the checkpoint to trigger the problem, but in the case I've reproduced (see SERVER-19724 for the reproduction steps) it is.

          The problem is that we've done the drop, and there is a log record for it:

            { "lsn" : [2,17437568],
              "hdr_flags" : "",
              "rec_len" : 128,
              "mem_len" : 128,
              "type" : "commit",
              "txnid" : 122720,
              "ops": [
                { "optype": "row_remove",
                  "fileid": "0",
                  "key": "table:index-102--8248197238674022337\u0000"
                }
              ]
            },
          

          But.. We are doing recovery, and earlier log records want to apply updates to that file we have since dropped.

          I don't know of a simple solution to this problem, I can think of two solutions:

          1) We do a first recovery pass looking for file drops, and then on the second pass skip any records that would be applied to files that are dropped later (yuck).

          2) We don't immediately error during recovery if we encounter a log record with a missing file, but save the file the log record applied to, and look out for drops of that file later in the log. At the end of recovery if there were any saved log records that didn't correspond to a dropped file we error out. Otherwise recovery went OK.

          Hopefully I'm missing a simpler solution?

          Show
          alexander.gorrod Alexander Gorrod added a comment - I don't think it's necessary for the create to be after the checkpoint to trigger the problem, but in the case I've reproduced (see SERVER-19724 for the reproduction steps) it is. The problem is that we've done the drop, and there is a log record for it: { "lsn" : [2,17437568], "hdr_flags" : "", "rec_len" : 128, "mem_len" : 128, "type" : "commit", "txnid" : 122720, "ops": [ { "optype": "row_remove", "fileid": "0", "key": "table:index-102--8248197238674022337\u0000" } ] }, But.. We are doing recovery, and earlier log records want to apply updates to that file we have since dropped. I don't know of a simple solution to this problem, I can think of two solutions: 1) We do a first recovery pass looking for file drops, and then on the second pass skip any records that would be applied to files that are dropped later (yuck). 2) We don't immediately error during recovery if we encounter a log record with a missing file, but save the file the log record applied to, and look out for drops of that file later in the log. At the end of recovery if there were any saved log records that didn't correspond to a dropped file we error out. Otherwise recovery went OK. Hopefully I'm missing a simpler solution?
          Hide
          alexander.gorrod Alexander Gorrod added a comment -

          More information about the failure. The error is:

          2015-07-15T06:53:32.533+0000 W -        [initandlisten] Detected unclean shutdown - /home/vagrant/pt-new/data/wiredTiger/mongod.lock is not empty.
          2015-07-15T06:53:32.533+0000 W STORAGE  [initandlisten] Recovering data from the last clean checkpoint.
          2015-07-15T06:53:32.533+0000 I STORAGE  [initandlisten] wiredtiger_open config: create,cache_size=1G,session_max=20000,eviction=(threads_max=4),statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0),
          2015-07-15T06:53:33.003+0000 E STORAGE  [initandlisten] WiredTiger (2) [1436943213:3099][1604:0x7fc1ccfa1bc0], file:collection-103--8248197238674022337.wt: /home/vagrant/pt-new/data/wiredTiger/collection-103--8248197238674022337.wt: No such file or directory
          2015-07-15T06:53:33.003+0000 E STORAGE  [initandlisten] WiredTiger (2) [1436943213:3251][1604:0x7fc1ccfa1bc0], file:collection-103--8248197238674022337.wt: Operation failed during recovery: No such file or directory
          2015-07-15T06:53:33.034+0000 I -        [initandlisten] Fatal assertion 28548 NoSuchKey Unable to find metadata for table:collection-93--8248197238674022337
          2015-07-15T06:53:33.035+0000 I -        [initandlisten]
          

          The database directory contains:

          collection-0--148936795720080555.wt     index-120--8248197238674022337.wt
          collection-105--8248197238674022337.wt  index-122--8248197238674022337.wt
          collection-113--8248197238674022337.wt  index-124--8248197238674022337.wt
          collection-115--8248197238674022337.wt  index-126--8248197238674022337.wt
          collection-117--8248197238674022337.wt  index-128--8248197238674022337.wt
          collection-119--8248197238674022337.wt  index-130--8248197238674022337.wt
          collection-121--8248197238674022337.wt  index-2--8248197238674022337.wt
          collection-123--8248197238674022337.wt  index-90--8248197238674022337.wt
          collection-125--8248197238674022337.wt  index-92--8248197238674022337.wt
          collection-127--8248197238674022337.wt  journal
          collection-129--8248197238674022337.wt  _mdb_catalog.wt
          collection-1--8248197238674022337.wt    mongod.lock
          collection-89--8248197238674022337.wt   sizeStorer.wt
          collection-91--8248197238674022337.wt   storage.bson
          index-106--8248197238674022337.wt       WiredTiger
          index-114--8248197238674022337.wt       WiredTiger.basecfg
          index-1--148936795720080555.wt          WiredTiger.lock
          index-116--8248197238674022337.wt       WiredTiger.turtle
          index-118--8248197238674022337.wt       WiredTiger.wt
          

          I'll upload a full printlog output as well.

          Show
          alexander.gorrod Alexander Gorrod added a comment - More information about the failure. The error is: 2015-07-15T06:53:32.533+0000 W - [initandlisten] Detected unclean shutdown - /home/vagrant/pt-new/data/wiredTiger/mongod.lock is not empty. 2015-07-15T06:53:32.533+0000 W STORAGE [initandlisten] Recovering data from the last clean checkpoint. 2015-07-15T06:53:32.533+0000 I STORAGE [initandlisten] wiredtiger_open config: create,cache_size=1G,session_max=20000,eviction=(threads_max=4),statistics=(fast),log=(enabled=true,archive=true,path=journal,compressor=snappy),file_manager=(close_idle_time=100000),checkpoint=(wait=60,log_size=2GB),statistics_log=(wait=0), 2015-07-15T06:53:33.003+0000 E STORAGE [initandlisten] WiredTiger (2) [1436943213:3099][1604:0x7fc1ccfa1bc0], file:collection-103--8248197238674022337.wt: /home/vagrant/pt-new/data/wiredTiger/collection-103--8248197238674022337.wt: No such file or directory 2015-07-15T06:53:33.003+0000 E STORAGE [initandlisten] WiredTiger (2) [1436943213:3251][1604:0x7fc1ccfa1bc0], file:collection-103--8248197238674022337.wt: Operation failed during recovery: No such file or directory 2015-07-15T06:53:33.034+0000 I - [initandlisten] Fatal assertion 28548 NoSuchKey Unable to find metadata for table:collection-93--8248197238674022337 2015-07-15T06:53:33.035+0000 I - [initandlisten] The database directory contains: collection-0--148936795720080555.wt index-120--8248197238674022337.wt collection-105--8248197238674022337.wt index-122--8248197238674022337.wt collection-113--8248197238674022337.wt index-124--8248197238674022337.wt collection-115--8248197238674022337.wt index-126--8248197238674022337.wt collection-117--8248197238674022337.wt index-128--8248197238674022337.wt collection-119--8248197238674022337.wt index-130--8248197238674022337.wt collection-121--8248197238674022337.wt index-2--8248197238674022337.wt collection-123--8248197238674022337.wt index-90--8248197238674022337.wt collection-125--8248197238674022337.wt index-92--8248197238674022337.wt collection-127--8248197238674022337.wt journal collection-129--8248197238674022337.wt _mdb_catalog.wt collection-1--8248197238674022337.wt mongod.lock collection-89--8248197238674022337.wt sizeStorer.wt collection-91--8248197238674022337.wt storage.bson index-106--8248197238674022337.wt WiredTiger index-114--8248197238674022337.wt WiredTiger.basecfg index-1--148936795720080555.wt WiredTiger.lock index-116--8248197238674022337.wt WiredTiger.turtle index-118--8248197238674022337.wt WiredTiger.wt I'll upload a full printlog output as well.
          Hide
          alexander.gorrod Alexander Gorrod added a comment -

          printlog from failure

          Show
          alexander.gorrod Alexander Gorrod added a comment - printlog from failure
          Hide
          alexander.gorrod Alexander Gorrod added a comment -

          My analysis for above wasn't quite right. There is no remove recorded for collection-103--8248197238674022337, there is a remove recorded for index-102, which also has a missing file.

          It makes me wonder whether we are missing a sync call for the log file i.e there was a log record for the remove that didn't make it to disk stably before the file was removed.

          Show
          alexander.gorrod Alexander Gorrod added a comment - My analysis for above wasn't quite right. There is no remove recorded for collection-103--8248197238674022337 , there is a remove recorded for index-102, which also has a missing file. It makes me wonder whether we are missing a sync call for the log file i.e there was a log record for the remove that didn't make it to disk stably before the file was removed.
          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

          Message: WT-2009 Apply tracked metadata operations post-commit. In particular, don't remove dropped files until the metadata changes are durable.
          Branch: develop
          https://github.com/wiredtiger/wiredtiger/commit/072399431533994cb74d1c8defb0799c95278860

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'} Message: WT-2009 Apply tracked metadata operations post-commit. In particular, don't remove dropped files until the metadata changes are durable. Branch: develop https://github.com/wiredtiger/wiredtiger/commit/072399431533994cb74d1c8defb0799c95278860
          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

          Message: WT-2009 Check for errors syncing the metadata before removing files.
          Branch: develop
          https://github.com/wiredtiger/wiredtiger/commit/1222e6a2913d898b3488c5ad2f73ae6ca98ce973

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'} Message: WT-2009 Check for errors syncing the metadata before removing files. Branch: develop https://github.com/wiredtiger/wiredtiger/commit/1222e6a2913d898b3488c5ad2f73ae6ca98ce973
          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexg@wiredtiger.com'}

          Message: WT-2009 Be paranoid about syncing the log after meta tracking.

          The log sync flags are complex and can be updated at various
          times in the life-cycle. It's always required for metadata updates to
          be stable on disk - so paranoia is justified.
          Branch: develop
          https://github.com/wiredtiger/wiredtiger/commit/a813dbfd98983ca6e588bf46f3e369bc3f17f5be

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexg@wiredtiger.com'} Message: WT-2009 Be paranoid about syncing the log after meta tracking. The log sync flags are complex and can be updated at various times in the life-cycle. It's always required for metadata updates to be stable on disk - so paranoia is justified. Branch: develop https://github.com/wiredtiger/wiredtiger/commit/a813dbfd98983ca6e588bf46f3e369bc3f17f5be

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:
                Days since reply:
                1 year, 43 weeks, 4 days ago
                Date of 1st Reply: