Details

    • Type: Task
    • Status: Closed
    • Resolution: Fixed
    • Affects Version/s: None
    • Fix Version/s: WT1.6.6
    • Labels:

      Description

      In stress testing, a test/format LSM job appeared to be stuck. On investigation, all application threads were sleeping due to throttling.

      (gdb) print *lsm_tree
      ...
        throttle_sleep = 14038831697358, chunk_fill_ms = 3074457346313,
      ...
        chunk_alloc = 160, nchunks = 7, last = 30, old_chunks = 0x7f297d0c2ce0,
        old_alloc = 80, nold_chunks = 0, flags = 28}
      (gdb) print *lsm_tree->chunk[0]
      $4 = {id = 10, generation = 1, uri = 0x7f297d051040 "file:wt-000010.lsm",
        bloom_uri = 0x0, count = 99248, create_ts = {tv_sec = 1381426119,
          tv_nsec = 928171662}, refcnt = 0, txnid_max = 0, flags = 20}
      (gdb) print *lsm_tree->chunk[1]
      $5 = {id = 20, generation = 1, uri = 0x7f291cc09180 "file:wt-000020.lsm",
        bloom_uri = 0x7f297d132ca0 "file:wt-000020.bf", count = 115791, create_ts = {
          tv_sec = 1381426132, tv_nsec = 460307811}, refcnt = 0, txnid_max = 0,
        flags = 21}
      (gdb) print *lsm_tree->chunk[2]
      $6 = {id = 29, generation = 1, uri = 0x7f298c4251c0 "file:wt-000029.lsm",
        bloom_uri = 0x7f297d0b6ac0 "file:wt-000029.bf", count = 79525, create_ts = {
          tv_sec = 1381426141, tv_nsec = 677431242}, refcnt = 0, txnid_max = 0,
        flags = 21}
      (gdb) print *lsm_tree->chunk[3]
      $7 = {id = 26, generation = 0, uri = 0x7f296f416a40 "file:wt-000026.lsm",
        bloom_uri = 0x0, count = 25180, create_ts = {tv_sec = 1381426137,
          tv_nsec = 278737985}, refcnt = 0, txnid_max = 307242, flags = 0}
      (gdb) print *lsm_tree->chunk[4]
      $8 = {id = 27, generation = 0, uri = 0x7f291e1434a0 "file:wt-000027.lsm",
        bloom_uri = 0x0, count = 25368, create_ts = {tv_sec = 1381426138,
          tv_nsec = 465214936}, refcnt = 0, txnid_max = 314017, flags = 0}
      (gdb) print *lsm_tree->chunk[5]
      $9 = {id = 28, generation = 0, uri = 0x7f291e1433c0 "file:wt-000028.lsm",
        bloom_uri = 0x0, count = 15186, create_ts = {tv_sec = 1381426141,
          tv_nsec = 670444784}, refcnt = 0, txnid_max = 325815, flags = 0}
      (gdb) print *lsm_tree->chunk[6]
      $10 = {id = 30, generation = 0, uri = 0x7f29343e70a0 "file:wt-000030.lsm",
        bloom_uri = 0x0, count = 1800, create_ts = {tv_sec = 1381426144,
          tv_nsec = 19573522}, refcnt = 0, txnid_max = 327238, flags = 0}
      

        Issue Links

          Activity

          Hide
          sueloverso Sue Loverso added a comment -

          I hit it again! Process 21318 on AWS HD if anyone wants to look at it. Stack is the same.

          Show
          sueloverso Sue Loverso added a comment - I hit it again! Process 21318 on AWS HD if anyone wants to look at it. Stack is the same.
          Hide
          michael.cahill Michael Cahill added a comment -

          For the record, the problem here is a tree where the newest 3+ chunks are in-memory, then we have a merge chunk. For the time calculation to work (of how far behind checkpoints are), we need to find the last on-disk, generation zero chunk.

          Show
          michael.cahill Michael Cahill added a comment - For the record, the problem here is a tree where the newest 3+ chunks are in-memory, then we have a merge chunk. For the time calculation to work (of how far behind checkpoints are), we need to find the last on-disk, generation zero chunk.
          Hide
          agorrod Alex Gorrod added a comment -

          I've verified this is fixed by running the configuration 500 times without failure.

          Show
          agorrod Alex Gorrod added a comment - I've verified this is fixed by running the configuration 500 times without failure.
          Hide
          ramon.fernandez Ramon Fernandez added a comment -

          Additional ticket information from GitHub

          This ticket was referenced in the following commits:
          Show
          ramon.fernandez Ramon Fernandez added a comment - Additional ticket information from GitHub This ticket was referenced in the following commits: 0f06074e6af6f36be20b3fb5ad485ab5075bfb3f 5fbde1a68afc620254f95348f9f6282e7426436f
          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexg@wiredtiger.com'}

          Message: Fix a deadlock related to LSM. There are cases where closing a file with
          an existing checkpoint could self deadlock.

          Check in the meta tracking whether we've already visited a checkpoint handle.

          Refs WT-716
          Branch: develop
          https://github.com/wiredtiger/wiredtiger/commit/3e254079484ce35a3cb70c48478c69defdb8f012

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexg@wiredtiger.com'} Message: Fix a deadlock related to LSM. There are cases where closing a file with an existing checkpoint could self deadlock. Check in the meta tracking whether we've already visited a checkpoint handle. Refs WT-716 Branch: develop https://github.com/wiredtiger/wiredtiger/commit/3e254079484ce35a3cb70c48478c69defdb8f012

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              1 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:
                Days since reply:
                2 years, 5 weeks, 2 days ago
                Date of 1st Reply: