Details

    • Type: Bug
    • Status: Resolved
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: WT2.6.1
    • Fix Version/s: WT2.7.0
    • Labels:
      None
    • # Replies:
      7
    • Last comment by Customer:
      true

      Description

      In http://build.wiredtiger.com:8080/job/wiredtiger-test-unit/4411/console we had this:

      ERROR: test_txn14.test_txn14.test_log_flush(bg) (subunit.RemotedTestCase)
      test_txn14.test_txn14.test_log_flush(bg)
      

      Digging into the core dump:

      #4  0x00007f404bfe6cb7 in __log_file_server (arg=0x2c6f170) at ../src/conn/conn_log.c:344
      344		WT_ASSERT(session, __wt_log_cmp(
      345			&close_end_lsn, &log->sync_lsn) >= 0);
      (gdb) print close_end_lsn
      $1 = {file = 14, offset = 0}
      (gdb) print log->sync_lsn
      $2 = {file = 14, offset = 83584}
      

        Issue Links

          Activity

          Hide
          michael.cahill Michael Cahill added a comment - - edited

          Sue LoVerso, a couple more things I found when digging into this...

          I am suspicious of this code in conn_log.c:

          353                 /*
          354                  * If a later thread asked for a background sync, do it now.
          355                  */
          356                 if (__wt_log_cmp(&log->bg_sync_lsn, &log->sync_lsn) > 0) {
          357                         /*
          358                          * Save the latest write LSN which is the minimum
          359                          * we will have written to disk.
          360                          */
          361                         min_lsn = log->write_lsn;
          362                         /*
          363                          * We have to wait until the LSN we asked for is
          364                          * written.  If it isn't signal the wrlsn thread
          365                          * to get it written.
          366                          */
          367                         if (__wt_log_cmp(&log->bg_sync_lsn, &min_lsn) <= 0) {
          368                                 WT_ERR(__wt_fsync(session, log->log_fh));
          369                                 __wt_spin_lock(session, &log->log_sync_lock);
          370                                 locked = true;
          371                                 /*
          372                                  * The sync LSN could have advanced while we
          373                                  * were writing to disk.
          374                                  */
          375                                 if (__wt_log_cmp(
          376                                     &log->sync_lsn, &min_lsn) <= 0) {
          377                                         log->sync_lsn = min_lsn;
          

          In particular, I think this could allow sync_lsn to skip into a new file because it doesn't check log->bg_sync_lsn.file == log->sync_lsn.file as we do in other places that are waiting for the sync_lsn. Of course it's a little different here because this is also the thread that will move sync_lsn into the new file.

          One other thing, I noticed that __log_file_server sleeps for 1 second between iterations. Our condition waits are imprecise, so we can lose signals under some (rare) conditions. I'd feel more comfortable if the file server woke up every 100ms or more frequently to reduce the chance of stalling synchronous commits after switching log files. Does that make sense?

          Show
          michael.cahill Michael Cahill added a comment - - edited Sue LoVerso , a couple more things I found when digging into this... I am suspicious of this code in conn_log.c : 353 /* 354 * If a later thread asked for a background sync, do it now. 355 */ 356 if (__wt_log_cmp(&log->bg_sync_lsn, &log->sync_lsn) > 0) { 357 /* 358 * Save the latest write LSN which is the minimum 359 * we will have written to disk. 360 */ 361 min_lsn = log->write_lsn; 362 /* 363 * We have to wait until the LSN we asked for is 364 * written. If it isn't signal the wrlsn thread 365 * to get it written. 366 */ 367 if (__wt_log_cmp(&log->bg_sync_lsn, &min_lsn) <= 0) { 368 WT_ERR(__wt_fsync(session, log->log_fh)); 369 __wt_spin_lock(session, &log->log_sync_lock); 370 locked = true; 371 /* 372 * The sync LSN could have advanced while we 373 * were writing to disk. 374 */ 375 if (__wt_log_cmp( 376 &log->sync_lsn, &min_lsn) <= 0) { 377 log->sync_lsn = min_lsn; In particular, I think this could allow sync_lsn to skip into a new file because it doesn't check log->bg_sync_lsn.file == log->sync_lsn.file as we do in other places that are waiting for the sync_lsn . Of course it's a little different here because this is also the thread that will move sync_lsn into the new file. One other thing, I noticed that __log_file_server sleeps for 1 second between iterations. Our condition waits are imprecise, so we can lose signals under some (rare) conditions. I'd feel more comfortable if the file server woke up every 100ms or more frequently to reduce the chance of stalling synchronous commits after switching log files. Does that make sense?
          Hide
          sue.loverso Sue LoVerso added a comment -

          Michael Cahill do you have the tarball/core dump stashed somewhere? I went to tinderbox and it is not one of the saved tarballs and the Jenkins workspace directory for that test is a more recent, successful run. There are things I'd like to look at in the core. Or if you could paste *log from some appropriate frame (without all the slots) for all the rest of the LSNs and other log settings.

          Show
          sue.loverso Sue LoVerso added a comment - Michael Cahill do you have the tarball/core dump stashed somewhere? I went to tinderbox and it is not one of the saved tarballs and the Jenkins workspace directory for that test is a more recent, successful run. There are things I'd like to look at in the core. Or if you could paste *log from some appropriate frame (without all the slots) for all the rest of the LSNs and other log settings.
          Hide
          michael.cahill Michael Cahill added a comment -

          Sue LoVerso, I'm sorry, I don't have the tarball stashed. I'll try to reproduce this and let you know if I can.

          Show
          michael.cahill Michael Cahill added a comment - Sue LoVerso , I'm sorry, I don't have the tarball stashed. I'll try to reproduce this and let you know if I can.
          Hide
          sue.loverso Sue LoVerso added a comment -

          I am able to reproduce this on the HDD instance (but it has run > 1000 iterations on the SSD).

          Show
          sue.loverso Sue LoVerso added a comment - I am able to reproduce this on the HDD instance (but it has run > 1000 iterations on the SSD).
          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'sueloverso', u'name': u'Susan LoVerso', u'email': u'sue@wiredtiger.com'}

          Message: WT-2137 Check the sync_lsn is in the correct file before moving it
          forward.
          Branch: develop
          https://github.com/wiredtiger/wiredtiger/commit/b5a916d3fdd550700258082804272dc04dff935e

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'sueloverso', u'name': u'Susan LoVerso', u'email': u'sue@wiredtiger.com'} Message: WT-2137 Check the sync_lsn is in the correct file before moving it forward. Branch: develop https://github.com/wiredtiger/wiredtiger/commit/b5a916d3fdd550700258082804272dc04dff935e
          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'sueloverso', u'name': u'Susan LoVerso', u'email': u'sue@wiredtiger.com'}

          Message: WT-2137 Reword comment for s_string message.
          Branch: develop
          https://github.com/wiredtiger/wiredtiger/commit/b4857ae2f71fe019ef7a45b8ac7967cc8ec9fb17

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'sueloverso', u'name': u'Susan LoVerso', u'email': u'sue@wiredtiger.com'} Message: WT-2137 Reword comment for s_string message. Branch: develop https://github.com/wiredtiger/wiredtiger/commit/b4857ae2f71fe019ef7a45b8ac7967cc8ec9fb17
          Hide
          xgen-internal-githook Githook User added a comment -

          Author:

          {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'}

          Message: Merge pull request #2226 from wiredtiger/wt-2137

          WT-2137 Check the sync_lsn is in the correct file before moving it forward.
          Branch: develop
          https://github.com/wiredtiger/wiredtiger/commit/aab81018d799a4bde3e297e03eaa9679673ae89a

          Show
          xgen-internal-githook Githook User added a comment - Author: {u'username': u'michaelcahill', u'name': u'Michael Cahill', u'email': u'michael.cahill@mongodb.com'} Message: Merge pull request #2226 from wiredtiger/wt-2137 WT-2137 Check the sync_lsn is in the correct file before moving it forward. Branch: develop https://github.com/wiredtiger/wiredtiger/commit/aab81018d799a4bde3e297e03eaa9679673ae89a

            People

            • Votes:
              0 Vote for this issue
              Watchers:
              2 Start watching this issue

              Dates

              • Created:
                Updated:
                Resolved:
                Days since reply:
                1 year, 38 weeks, 2 days ago
                Date of 1st Reply: