Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-1944

wiredtiger_open() fails with WT_NOTFOUND

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: WT2.6.0
    • Fix Version/s: WT2.6.1
    • Labels:
      None
    • # Replies:
      6
    • Last comment by Customer:
      true

      Description

      Hi!

      We somehow managed to get DB that we're unable to open due to wiredtiger_open() returns WT_NOTFOUND with no additional information.

      I'm attaching the DB itself and small program that opens it.

      Output I get when using WiredTiger 2.6.0 ('master'):

      $ ./open_db 
      wiredtiger_open(WT_HOME, NULL, "log=(enabled)", &conn) at (open_db.c:27) failed: WT_NOTFOUND: item not found [-31803]
      $ ./open_db 
      196 records in table 'main'
      $ ./open_db 
      196 records in table 'main'
      $ ./open_db 
      196 records in table 'main'
      

      I.e. it fails the first time, but can open and read it on next attempts.

      1. open_db.c
        0.9 kB
        Dmitri Shubin
      2. wt.tar.gz
        60 kB
        Dmitri Shubin

        Issue Links

          Activity

          Hide
          Dmitri Shubin Dmitri Shubin added a comment -

          I though it's similar to WT-1936, but 'develop' (7b3ad36318ddf82f11dd06d861158518622ddd17) seems to have the same problem. Even worse it started to print another errors after several runs:

          $ ./open_db 
          wiredtiger_open(WT_HOME, NULL, "log=(enabled)", &conn) at (open_db.c:27) failed: WT_NOTFOUND: item not found [-31803]
          $ ./open_db 
          196 records in table 'main'
          $ ./open_db 
          196 records in table 'main'
          $ ./open_db 
          [1432593816:665483][26848:0x7f4760b91700]: wt/WiredTigerLog.0000000006: No such file or directory
          [1432593816:665623][26848:0x7f4760b91700], WT_CURSOR.search: wt/WiredTigerLog.0000000006: No such file or directory
          [1432593816:665639][26848:0x7f4760b91700]: wt/WiredTigerLog.0000000006: No such file or directory
          196 records in table 'main'
          

          Show
          Dmitri Shubin Dmitri Shubin added a comment - I though it's similar to WT-1936 , but 'develop' (7b3ad36318ddf82f11dd06d861158518622ddd17) seems to have the same problem. Even worse it started to print another errors after several runs: $ ./open_db wiredtiger_open(WT_HOME, NULL, "log=(enabled)", &conn) at (open_db.c:27) failed: WT_NOTFOUND: item not found [-31803] $ ./open_db 196 records in table 'main' $ ./open_db 196 records in table 'main' $ ./open_db [1432593816:665483][26848:0x7f4760b91700]: wt/WiredTigerLog.0000000006: No such file or directory [1432593816:665623][26848:0x7f4760b91700], WT_CURSOR.search: wt/WiredTigerLog.0000000006: No such file or directory [1432593816:665639][26848:0x7f4760b91700]: wt/WiredTigerLog.0000000006: No such file or directory 196 records in table 'main'
          Hide
          keith.bostic Keith Bostic added a comment -

          Sue LoVerso, I took a quick look (thought it might be in the stuff I was looking at today).

          What's happening here is that metafile->ckpt_lsn is larger than any log file in the home directory:

          #1  0x000000000048cf3b in __wt_txn_recover (session=0x801430580)
              at src/txn/txn_recover.c:466
          466				ret = __wt_log_scan(session,
          (gdb) p metafile->ckpt_lsn
          $1 = {file = 6, offset = 213248}
          (gdb) 
          Suspended
          pixiebob:WT_1944 {71} ls wt
          WiredTiger			WiredTigerLog.0000000003
          WiredTiger.lock			WiredTigerLog.0000000004
          WiredTiger.turtle		WiredTigerLog.0000000005
          WiredTiger.wt			main.wt
          WiredTigerLog.0000000001	main_idx-1407.wti
          WiredTigerLog.0000000002
          

          and __wt_log_scan returns WT_NOFOUND in that case:

          1256				/* Offsets must be on allocation boundaries. */
          1257				if (lsnp->offset % allocsize != 0 ||
          1258				    lsnp->file > log->fileid)
          1259					return (WT_NOTFOUND);
          

          Show
          keith.bostic Keith Bostic added a comment - Sue LoVerso , I took a quick look (thought it might be in the stuff I was looking at today). What's happening here is that metafile->ckpt_lsn is larger than any log file in the home directory: #1 0x000000000048cf3b in __wt_txn_recover (session=0x801430580) at src/txn/txn_recover.c:466 466 ret = __wt_log_scan(session, (gdb) p metafile->ckpt_lsn $1 = {file = 6, offset = 213248} (gdb) Suspended pixiebob:WT_1944 {71} ls wt WiredTiger WiredTigerLog.0000000003 WiredTiger.lock WiredTigerLog.0000000004 WiredTiger.turtle WiredTigerLog.0000000005 WiredTiger.wt main.wt WiredTigerLog.0000000001 main_idx-1407.wti WiredTigerLog.0000000002 and __wt_log_scan returns WT_NOFOUND in that case: 1256 /* Offsets must be on allocation boundaries. */ 1257 if (lsnp->offset % allocsize != 0 || 1258 lsnp->file > log->fileid) 1259 return (WT_NOTFOUND);
          Hide
          sue.loverso Sue LoVerso added a comment -

          Thanks! I think there are two separate issues here.
          1. The first error is that the checkpoint LSN is in the metadata file but that log file does not exist on disk. I have an idea about how that might occur and I am going to test out the theory.
          2. The second error is that running the program multiple times advances the log and the checkpoint LSN but doesn't write the checkpoint LSN onto disk and that log file is eventually archived. I understand that issue and am thinking about various solutions.

          Show
          sue.loverso Sue LoVerso added a comment - Thanks! I think there are two separate issues here. 1. The first error is that the checkpoint LSN is in the metadata file but that log file does not exist on disk. I have an idea about how that might occur and I am going to test out the theory. 2. The second error is that running the program multiple times advances the log and the checkpoint LSN but doesn't write the checkpoint LSN onto disk and that log file is eventually archived. I understand that issue and am thinking about various solutions.
          Hide
          sue.loverso Sue LoVerso added a comment -

          Michael Cahill I have a pull request open to address the second issue. I have also put a separate commit on that branch to address the first issue. The description for the first issue is below.

          Although I haven't been able to force the situation, what I think happened in the first piece of this issue (checkpoint LSN in log file 6, only log files up to 5 exist after crash) is this:

          We saved the checkpoint LSN, which is 6,N (where N is pretty small, early in log file 6). Log file 6, having recently been created, is not synced. That is neither its contents nor its entry in the parent directory. The LSN gets written to the metadata file for the table(s) and the metadata itself. The metadata and turtle files are sync'ed to disk. Then, before the checkpoint log record gets written (with SYNC set), the crash happens. The metadata on disk is in log file 6, but the logging sync hasn't happened yet, so log file 6 does not exist.

          My fix is to force a sync of the logging directory and log file at the beginning of the checkpoint, where we record the ckpt_lsn so that once the checkpoint starts, we're guaranteed the log file in the metadata exists.

          Show
          sue.loverso Sue LoVerso added a comment - Michael Cahill I have a pull request open to address the second issue. I have also put a separate commit on that branch to address the first issue. The description for the first issue is below. Although I haven't been able to force the situation, what I think happened in the first piece of this issue (checkpoint LSN in log file 6, only log files up to 5 exist after crash) is this: We saved the checkpoint LSN, which is 6,N (where N is pretty small, early in log file 6). Log file 6, having recently been created, is not synced. That is neither its contents nor its entry in the parent directory. The LSN gets written to the metadata file for the table(s) and the metadata itself. The metadata and turtle files are sync'ed to disk. Then, before the checkpoint log record gets written (with SYNC set), the crash happens. The metadata on disk is in log file 6, but the logging sync hasn't happened yet, so log file 6 does not exist. My fix is to force a sync of the logging directory and log file at the beginning of the checkpoint, where we record the ckpt_lsn so that once the checkpoint starts, we're guaranteed the log file in the metadata exists.
          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'sueloverso', u'name': u'Susan LoVerso', u'email': u'sue@wiredtiger.com'}

          Message: Only skip metadata checkpoint if we really skipped it, not force.
          WT-1944
          Branch: develop
          https://github.com/wiredtiger/wiredtiger/commit/440b54a6f6792a7c658936b30cae8c13b4cca7d2

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'sueloverso', u'name': u'Susan LoVerso', u'email': u'sue@wiredtiger.com'} Message: Only skip metadata checkpoint if we really skipped it, not force. WT-1944 Branch: develop https://github.com/wiredtiger/wiredtiger/commit/440b54a6f6792a7c658936b30cae8c13b4cca7d2
          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'sueloverso', u'name': u'Susan LoVerso', u'email': u'sue@wiredtiger.com'}

          Message: Force a sync of log files at the beginning of a checkpoint. WT-1944
          Branch: develop
          https://github.com/wiredtiger/wiredtiger/commit/7df667de9cfa68cb1d3ed397a2a9119b4a83f549

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'sueloverso', u'name': u'Susan LoVerso', u'email': u'sue@wiredtiger.com'} Message: Force a sync of log files at the beginning of a checkpoint. WT-1944 Branch: develop https://github.com/wiredtiger/wiredtiger/commit/7df667de9cfa68cb1d3ed397a2a9119b4a83f549

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              5 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:
                Days since reply:
                1 year, 42 weeks ago
                Date of 1st Reply: