Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-19954

Extended stall during checkpoint with large number of dirty tables under WiredTiger

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: 3.0.5
    • Fix Version/s: 3.2.0-rc3
    • Component/s: WiredTiger
    • Labels:
      None
    • Backwards Compatibility:
      Fully Compatible
    • Operating System:
      ALL

      Description

      Test that creates a 50k dirty tables causes a stall during the checkpoint for several minutes.

      shell script that calls functions in test.js:

      # create a bunch of dirty tables and keep them dirty
      # first iteration through creates the tables
      # subsequent iterations keep them dirty
      # issue does not repro with a smaller number of iterations 
      # maybe tables need to become dirty after checkpoint starts?
      for i in $(seq 20); do
          (
              # insert on 20 threads in parallel into a total of 50k collections
              for t in $(seq 20); do
                  mongo --eval "load('test.js'); insert($t, 2500)" &
              done
              wait
          )
      done
       
      # now try to insert some data to make stall evident
      (
          for t in $(seq 20); do
              mongo --eval "load('test.js'); load()" &
          done
          wait
      )
      

      test.js:

      function insert(t, n) {
          for (var i=0; i<n; i++) {
              c = db['c'+t+'.'+i]
              c.insert({})
              if (i%100==0)
                  print(t,i)
          }
      }
       
      function load() {
          count = 1000000
          every = 10000
          for (var i=0; i<count; ) {
              var bulk = db.c.initializeUnorderedBulkOp();
              for (var j=0; j<every; j++, i++)
                  bulk.insert({})
              bulk.execute();
              print(i)
          }
      }
      

      • From A to B is first iteration through the loop; this goes slower than subsequent iterations because it is creating 50k collections.
      • From B to C is remaining 19 iterations of inserting one document into each of the 50k collections
      • At C we begin trying to insert a bunch of data, but it stalls for several minutes until D.
      • At D checkpoint ends and inserts become unblocked
      • Based on "cursor prev" calls I think the single-document inserts into the 50k collections from B to C are stalling during checkpoints; I suspect the stalls are not as long because in this phase there are fewer dirty tables at each checkpoint because the tables are only being dirtied as fast as they can be created, whereas from B to C the tables are already created so are being dirtied much faster, so the checkpoint where the lengthy stall occurs sees a lot more dirty tables.

      This is the active thread during the stall:

      Thread 31 (Thread 0x7f5949f32700 (LWP 5161)):
      #0  __strcmp_ssse3 () at ../sysdeps/x86_64/multiarch/../strcmp.S:210
      #1  0x000000000134759a in __wt_meta_track_find_handle (session=session@entry=0x279d8c0, name=0xa8b9c30 "file:collection-6000-3939942762316306140.wt",
          checkpoint=checkpoint@entry=0x5e630aa0 "WiredTigerCheckpoint.1") at src/third_party/wiredtiger/src/meta/meta_track.c:245
      #2  0x000000000137994d in __wt_session_lock_checkpoint (session=session@entry=0x279d8c0, checkpoint=0x5e630aa0 "WiredTigerCheckpoint.1") at src/third_party/wiredtiger/src/session/session_dhandle.c:454
      #3  0x0000000001382885 in __checkpoint_worker (is_checkpoint=1, cfg=<optimized out>, session=<optimized out>) at src/third_party/wiredtiger/src/txn/txn_ckpt.c:924
      #4  __wt_checkpoint (session=<optimized out>, cfg=<optimized out>) at src/third_party/wiredtiger/src/txn/txn_ckpt.c:1127
      #5  0x0000000001381bf7 in __checkpoint_apply (session=0x279d8c0, cfg=0x7f5949f31a30, op=0x13823c0 <__wt_checkpoint>) at src/third_party/wiredtiger/src/txn/txn_ckpt.c:184
      #6  0x0000000001383785 in __wt_txn_checkpoint (session=session@entry=0x279d8c0, cfg=cfg@entry=0x7f5949f31a30) at src/third_party/wiredtiger/src/txn/txn_ckpt.c:501
      #7  0x0000000001376966 in __session_checkpoint (wt_session=0x279d8c0, config=<optimized out>) at src/third_party/wiredtiger/src/session/session_api.c:919
      #8  0x000000000130e38a in __ckpt_server (arg=0x279d8c0) at src/third_party/wiredtiger/src/conn/conn_ckpt.c:95
      #9  0x00007f594e1fb182 in start_thread (arg=0x7f5949f32700) at pthread_create.c:312
      #10 0x00007f594d2fc47d in clone () at ../sysdeps/unix/sysv/linux/x86_64/clone.S:111
      

        Issue Links

          Activity

          Hide
          michael.cahill Michael Cahill added a comment -

          For the record, there is a change in WT-147 that should eliminate the need for the linear scan through the list of dirty handles during a checkpoint that is indicated as the main issue here. Once WT-147 is merged, we will retest to make sure this workload performs well and backport the necessary part of the change to the 3.0 branch.

          Show
          michael.cahill Michael Cahill added a comment - For the record, there is a change in WT-147 that should eliminate the need for the linear scan through the list of dirty handles during a checkpoint that is indicated as the main issue here. Once WT-147 is merged, we will retest to make sure this workload performs well and backport the necessary part of the change to the 3.0 branch.
          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

          Message: SERVER-19954 Don't scan tracked handles during checkpoints.

          The change in WT-147 to allow a session to lock the same handle exclusive multiple times makes this unnecessary.
          Branch: develop
          https://github.com/wiredtiger/wiredtiger/commit/215156b8545a42b8ae878019f1b830debe6f5fa0

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'} Message: SERVER-19954 Don't scan tracked handles during checkpoints. The change in WT-147 to allow a session to lock the same handle exclusive multiple times makes this unnecessary. Branch: develop https://github.com/wiredtiger/wiredtiger/commit/215156b8545a42b8ae878019f1b830debe6f5fa0
          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexander.gorrod@mongodb.com'}

          Message: Merge pull request #2247 from wiredtiger/SERVER-19954

          SERVER-19954 Don't scan tracked handles during checkpoints.
          Branch: develop
          https://github.com/wiredtiger/wiredtiger/commit/46b4ad57ecbb935af2c1ae6fd546497dc620e64e

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexander.gorrod@mongodb.com'} Message: Merge pull request #2247 from wiredtiger/ SERVER-19954 SERVER-19954 Don't scan tracked handles during checkpoints. Branch: develop https://github.com/wiredtiger/wiredtiger/commit/46b4ad57ecbb935af2c1ae6fd546497dc620e64e
          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexander.gorrod@mongodb.com'}

          Message: Merge pull request #2247 from wiredtiger/SERVER-19954

          SERVER-19954 Don't scan tracked handles during checkpoints.
          Branch: develop
          https://github.com/wiredtiger/wiredtiger/commit/46b4ad57ecbb935af2c1ae6fd546497dc620e64e

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'agorrod', u'name': u'Alex Gorrod', u'email': u'alexander.gorrod@mongodb.com'} Message: Merge pull request #2247 from wiredtiger/ SERVER-19954 SERVER-19954 Don't scan tracked handles during checkpoints. Branch: develop https://github.com/wiredtiger/wiredtiger/commit/46b4ad57ecbb935af2c1ae6fd546497dc620e64e
          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

          Message: WT-2207 Track whenever a session has a handle exclusive.

          Previously, this was only tracked for newly-opened handles, but the change for
          SERVER-19954 relies on always detecting when a session has a handle exclusive,
          regardless of whether it was the first to open the handle.
          Branch: develop
          https://github.com/wiredtiger/wiredtiger/commit/54557dd24139fa0f903d83735855ffea35377867

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'} Message: WT-2207 Track whenever a session has a handle exclusive. Previously, this was only tracked for newly-opened handles, but the change for SERVER-19954 relies on always detecting when a session has a handle exclusive, regardless of whether it was the first to open the handle. Branch: develop https://github.com/wiredtiger/wiredtiger/commit/54557dd24139fa0f903d83735855ffea35377867
          Hide
          michael.cahill Michael Cahill added a comment -

          Bruce Lucas, Martin Bligh, do you have any suggestions on how to re-test this?

          Once there are more than 10K tables, I now see all the time being spent opening and closing sessions and cursors, because the caching layer bounds how many we keep open since 3.0.6. I can code around that, but then I'm not testing how the system actually behaves.

          The original issue Bruce Lucas identified here (excessive time in __wt_meta_track_find_handle) has been fixed, maybe we can open a separate 3.3 ticket for "improve performance with WT with more than 10K active tables"?

          Show
          michael.cahill Michael Cahill added a comment - Bruce Lucas , Martin Bligh , do you have any suggestions on how to re-test this? Once there are more than 10K tables, I now see all the time being spent opening and closing sessions and cursors, because the caching layer bounds how many we keep open since 3.0.6. I can code around that, but then I'm not testing how the system actually behaves. The original issue Bruce Lucas identified here (excessive time in __wt_meta_track_find_handle ) has been fixed, maybe we can open a separate 3.3 ticket for "improve performance with WT with more than 10K active tables"?
          Hide
          pasette Dan Pasette added a comment -

          Spun off a new ticket for follow up work: SERVER-21629

          Show
          pasette Dan Pasette added a comment - Spun off a new ticket for follow up work: SERVER-21629

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              10 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved: