Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-2074

test/checkpoint failure in automated testing

    Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: WT2.7.0
    • Labels:
      None
    • # Replies:
      10
    • Last comment by Customer:
      true

      Description

      test/checkpoint is failing reliably in Jenkins. The failures started after the merge of the LAS changes:
      http://build.wiredtiger.com:8080/job/wiredtiger-test-checkpoint/1661/

      The failure is of the form:

      nice ./test/checkpoint/t -t m -n 1000000 -k 5000000 -C cache_size=100MB
      t: 2nd cursor didn't find 1st key
      : WT_NOTFOUND: item not found
      t: verify_checkpoint - mismatching data: Bad address
      t: process 21958
          1: 1 workers, 3 tables
      checkpointer thread starting: tid: 21958:0x7f5b66867700
      worker thread starting: tid: 21958:0x7f5b5ffff700
      Finished a checkpoint
      Finished verifying a checkpoint with 3 tables and 0 keys
      Finished a checkpoint
      Finished verifying a checkpoint with 3 tables and 43 keys
      Finished a checkpoint
      Finished verifying a checkpoint with 3 tables and 699 keys
      Finished a checkpoint
      Finished verifying a checkpoint with 3 tables and 1429 keys
      <snip>
      Finished a checkpoint
      Finished verifying a checkpoint with 3 tables and 425649 keys
      Finished a checkpoint
      Key mismatch 2597678 from a COL table is not 2597687 from a ROW table
      Ran workers for: 11.779459 seconds
      

        Issue Links

          Activity

          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'keithbostic', u'name': u'Keith Bostic', u'email': u'keith@wiredtiger.com'}

          Message: WT-2074: Exempt files that aren't checkpointed from the eviction test
          whether or not a running checkpoint blocks using the lookaside file.

          This isn't necessary at the moment: the only file not checkpointed is
          the lookaside table, and you can't write records from the lookaside
          table into the lookaside table (reconciliation checks and blocks it),
          but it's the right test for the future.
          Branch: develop
          https://github.com/wiredtiger/wiredtiger/commit/46331aa9234551b0084a00b11e35e9ce31263062

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'keithbostic', u'name': u'Keith Bostic', u'email': u'keith@wiredtiger.com'} Message: WT-2074 : Exempt files that aren't checkpointed from the eviction test whether or not a running checkpoint blocks using the lookaside file. This isn't necessary at the moment: the only file not checkpointed is the lookaside table, and you can't write records from the lookaside table into the lookaside table (reconciliation checks and blocks it), but it's the right test for the future. Branch: develop https://github.com/wiredtiger/wiredtiger/commit/46331aa9234551b0084a00b11e35e9ce31263062
          Hide
          michael.cahill Michael Cahill added a comment -

          I think there is still a race with lookaside eviction running when the checkpoint starts.

          In particular, in this code from from txn_ckpt.c:

           443         /*
           444          * Bump the global checkpoint generation, used to figure out whether
           445          * checkpoint has visited a tree.  There is no need for this to be
           446          * atomic: it is only written while holding the checkpoint lock.
           447          *
           448          * We do need to update it before clearing the checkpoint's entry out
           449          * of the transaction table, or a thread evicting in a tree could
           450          * ignore the checkpoint's transaction.
           451          */
           452         ++txn_global->checkpoint_gen;
           453         WT_STAT_FAST_CONN_SET(session,
           454             txn_checkpoint_generation, txn_global->checkpoint_gen);
           455
           456         /*
           457          * Start a snapshot transaction for the checkpoint.
           458          *
           459          * Note: we don't go through the public API calls because they have
           460          * side effects on cursors, which applications can hold open across
           461          * calls to checkpoint.
           462          */
           463         WT_ERR(__wt_txn_begin(session, txn_cfg));
          

          If we read checkpoint_gen during eviction and decide to proceed with LAS eviction, a checkpoint transaction could start in the meantime, and if there are enough updates to the page, we could still end up writing the wrong value.

          Maybe we need to maintain a count of LAS evictions in progress, and wait for that to go to zero after bumping checkpoint_gen?

          Show
          michael.cahill Michael Cahill added a comment - I think there is still a race with lookaside eviction running when the checkpoint starts. In particular, in this code from from txn_ckpt.c : 443 /* 444 * Bump the global checkpoint generation, used to figure out whether 445 * checkpoint has visited a tree. There is no need for this to be 446 * atomic: it is only written while holding the checkpoint lock. 447 * 448 * We do need to update it before clearing the checkpoint's entry out 449 * of the transaction table, or a thread evicting in a tree could 450 * ignore the checkpoint's transaction. 451 */ 452 ++txn_global->checkpoint_gen; 453 WT_STAT_FAST_CONN_SET(session, 454 txn_checkpoint_generation, txn_global->checkpoint_gen); 455 456 /* 457 * Start a snapshot transaction for the checkpoint. 458 * 459 * Note: we don't go through the public API calls because they have 460 * side effects on cursors, which applications can hold open across 461 * calls to checkpoint. 462 */ 463 WT_ERR(__wt_txn_begin(session, txn_cfg)); If we read checkpoint_gen during eviction and decide to proceed with LAS eviction, a checkpoint transaction could start in the meantime, and if there are enough updates to the page, we could still end up writing the wrong value. Maybe we need to maintain a count of LAS evictions in progress, and wait for that to go to zero after bumping checkpoint_gen ?
          Hide
          keith.bostic Keith Bostic added a comment -

          Michael Cahill, yes, I agree.

          I hate to maintain a count of LAS evictions in progress, and wait for that to drain.

          What do you think of checking the generation values before/after the reconciliation?

          I've pushed a branch (https://github.com/wiredtiger/wiredtiger/pull/2168).

          Show
          keith.bostic Keith Bostic added a comment - Michael Cahill , yes, I agree. I hate to maintain a count of LAS evictions in progress, and wait for that to drain. What do you think of checking the generation values before/after the reconciliation? I've pushed a branch ( https://github.com/wiredtiger/wiredtiger/pull/2168 ).
          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'keithbostic', u'name': u'Keith Bostic', u'email': u'keith@wiredtiger.com'}

          Message: WT-2074: there's still a race in 46331aa, if we read the global checkpoint
          generation before starting eviction, decide to proceed with LAS eviction,
          then a checkpoint transaction starts in the meantime, if there are enough
          updates to the page, we could still end up writing the wrong value. Copy
          the btree and system checkpoint generations before reconciling the page and
          fail reconciliation if a checkpoint collides.
          Branch: develop
          https://github.com/wiredtiger/wiredtiger/commit/28b6294cc7beea24bcd6b288a4b8c80fd33821dd

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'keithbostic', u'name': u'Keith Bostic', u'email': u'keith@wiredtiger.com'} Message: WT-2074 : there's still a race in 46331aa, if we read the global checkpoint generation before starting eviction, decide to proceed with LAS eviction, then a checkpoint transaction starts in the meantime, if there are enough updates to the page, we could still end up writing the wrong value. Copy the btree and system checkpoint generations before reconciling the page and fail reconciliation if a checkpoint collides. Branch: develop https://github.com/wiredtiger/wiredtiger/commit/28b6294cc7beea24bcd6b288a4b8c80fd33821dd
          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

          Message: Merge pull request #2168 from wiredtiger/wt-2074-checkpoint-test

          WT-2074: fix a race between lookaside table reconciliation and checkpoints.
          Branch: develop
          https://github.com/wiredtiger/wiredtiger/commit/87592eccb8bcef7391192bc6ec97a7391ea2090e

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'} Message: Merge pull request #2168 from wiredtiger/wt-2074-checkpoint-test WT-2074 : fix a race between lookaside table reconciliation and checkpoints. Branch: develop https://github.com/wiredtiger/wiredtiger/commit/87592eccb8bcef7391192bc6ec97a7391ea2090e

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:
                Days since reply:
                1 year, 37 weeks, 1 day ago
                Date of 1st Reply: