Details

    • # Replies:
      20
    • Last comment by Customer:
      true

      Description

      The wiredtiger-perf-ckpt-lsm job is failing.

      It follows the patches for:
      SERVER-18829 Have pages start in the middle of the LRU queue for eviction. (commit: 32144696f78cf726b8b1df8becca0a86d870efa3)
      Only update the oldest read generation once we have some pages in the queue. (commit: 21c88f54c24045fef963dbb413bae1ca97f312a9)

      The issue in question is a panic due to:

      max latency exceeded: threshold 2000000 read max 16361 insert max 853 update max 16734347 Error: WT_PANIC: WiredTiger library panic
      

        Issue Links

          Activity

          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

          Message: WT-1962 Fix error handling of hotbackup unlock.
          Branch: develop
          https://github.com/wiredtiger/wiredtiger/commit/bfea5498e41c43453f3a13203d1b3a78e95d41f2

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'} Message: WT-1962 Fix error handling of hotbackup unlock. Branch: develop https://github.com/wiredtiger/wiredtiger/commit/bfea5498e41c43453f3a13203d1b3a78e95d41f2
          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

          Message: Merge pull request #2054 from wiredtiger/backup-rwlock

          WT-1962 Make the hot_backup_lock a read/write lock.
          Branch: develop
          https://github.com/wiredtiger/wiredtiger/commit/3866fa663b7244fec7fdb3930885ee5314736c45

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'} Message: Merge pull request #2054 from wiredtiger/backup-rwlock WT-1962 Make the hot_backup_lock a read/write lock. Branch: develop https://github.com/wiredtiger/wiredtiger/commit/3866fa663b7244fec7fdb3930885ee5314736c45
          Hide
          sue.loverso Sue LoVerso added a comment -

          After adding a bunch of timing calls and other printfs to the code, here's a distilled sequence of events that causes the update-checkpoint-lsm.wtperf configuration to readily fail with a > 2 second latency.

          NOTE: This one is specific to LSM. I'm also able to cause a failure with update-checkpoint-btree.wtperf for long latency. I haven't gone through that output yet.

          The left column represents time in seconds.

          53.7  CKPT Full checkpoint started  Starts to get locks and call ckpt_list
          53.8  LSM-worker checkpoints chunk 4.  Final fsync takes 321 ms with SCHEMA
          53.9  CKPT Checkpoint wt_checkpoint_list completes.  Write leaf pages.
          54.4  LSM acquires SCHEMA to switch to chunk 7.
          54.7  Different LSM worker flushing chunk 5 waits for SCHEMA.
          55.2  LSM switch, fsync of chunk 7 takes 637 ms
          55.9  CKPT Calls checkpoint_sync w/o SCHEMA.
          56.3  CKPT flushes chunk 5.  Fsync takes 328 ms (initial flush w/o SCHEMA)
          56.3  CKPT waits to acquire SCHEMA.
          56.49 LSM switch releases SCHEMA, held 2036 ms.
          56.49 CKPT acquires SCHEMA
          56.49 USER thread enters LSM cursor update.  waits to get SCHEMA.
          56.49 CKPT calls __wt_checkpoint
          56.8  CKPT calls __wt_checkpoint_sync
          58.1  CKPT Fsync of chunk 6 takes 1315 ms
          58.3  CKPT Fsync of chunk 5 takes 193 ms
          58.3  CKPT releases SCHEMA
          58.3  USER LSM cursor update took 1860 ms to get SCHEMA
          58.3  CKPT Completes in 4634 ms
          58.3  LSM Took 3595 ms to acquire SCHEMA
          58.3  LSM worker completes checkpoint of chunk 5.
          58.3  USER thread detects > 2 second latency
          

          Show
          sue.loverso Sue LoVerso added a comment - After adding a bunch of timing calls and other printfs to the code, here's a distilled sequence of events that causes the update-checkpoint-lsm.wtperf configuration to readily fail with a > 2 second latency. NOTE: This one is specific to LSM. I'm also able to cause a failure with update-checkpoint-btree.wtperf for long latency. I haven't gone through that output yet. The left column represents time in seconds. 53.7 CKPT Full checkpoint started Starts to get locks and call ckpt_list 53.8 LSM-worker checkpoints chunk 4. Final fsync takes 321 ms with SCHEMA 53.9 CKPT Checkpoint wt_checkpoint_list completes. Write leaf pages. 54.4 LSM acquires SCHEMA to switch to chunk 7. 54.7 Different LSM worker flushing chunk 5 waits for SCHEMA. 55.2 LSM switch, fsync of chunk 7 takes 637 ms 55.9 CKPT Calls checkpoint_sync w/o SCHEMA. 56.3 CKPT flushes chunk 5. Fsync takes 328 ms (initial flush w/o SCHEMA) 56.3 CKPT waits to acquire SCHEMA. 56.49 LSM switch releases SCHEMA, held 2036 ms. 56.49 CKPT acquires SCHEMA 56.49 USER thread enters LSM cursor update. waits to get SCHEMA. 56.49 CKPT calls __wt_checkpoint 56.8 CKPT calls __wt_checkpoint_sync 58.1 CKPT Fsync of chunk 6 takes 1315 ms 58.3 CKPT Fsync of chunk 5 takes 193 ms 58.3 CKPT releases SCHEMA 58.3 USER LSM cursor update took 1860 ms to get SCHEMA 58.3 CKPT Completes in 4634 ms 58.3 LSM Took 3595 ms to acquire SCHEMA 58.3 LSM worker completes checkpoint of chunk 5. 58.3 USER thread detects > 2 second latency
          Hide
          david.hows David Hows added a comment -

          Hey Sue LoVerso,

          Whats the status of this one? I've just seen a latency failure on update-checkpoint-btree.wtperf, which is the btree checkpoint task. http://build.wiredtiger.com:8080/job/wiredtiger-perf-checkpoint/22/console

          Could this issue still be hanging about?

          Show
          david.hows David Hows added a comment - Hey Sue LoVerso , Whats the status of this one? I've just seen a latency failure on update-checkpoint-btree.wtperf , which is the btree checkpoint task. http://build.wiredtiger.com:8080/job/wiredtiger-perf-checkpoint/22/console Could this issue still be hanging about?
          Hide
          sue.loverso Sue LoVerso added a comment -

          Yes, both of these issues still exist. I have not been able to repro the btree failure after adding timing debugging to understand where the time is being spent unfortunately. I have been able to get other btree tests to fail (all 500m-btree insert/update tests) with the instrumentation so hopefully WT-2124 and this one (for btree) have a common cause.

          Show
          sue.loverso Sue LoVerso added a comment - Yes, both of these issues still exist. I have not been able to repro the btree failure after adding timing debugging to understand where the time is being spent unfortunately. I have been able to get other btree tests to fail (all 500m-btree insert/update tests) with the instrumentation so hopefully WT-2124 and this one (for btree) have a common cause.

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              7 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:
                Days since reply:
                1 year, 39 weeks, 1 day ago
                Date of 1st Reply: