Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-11602

Hide expected eviction failures from the application and don't rollback in case of errors

    • TheMoon-StorEng - 2023-09-19
    • v7.0, v6.0, v5.0, v4.4, v4.2

      Here is the relevant code portion from __wt_cache_eviction_worker, the function that the application thread calls when it is forced to evict:

      for (initial_progress = cache->eviction_progress;; ret = 0) {
              /* Evict a page. */
              switch (ret = __evict_page(session, false)) {
              case 0:
                  if (busy)
                      goto err;
              /* FALLTHROUGH */
              case EBUSY:
              /* Stop if we've exceeded the time out. */
              if (time_start != 0 && cache_max_wait_us != 0) {
                  time_stop = __wt_clock(session);
                  if (session->cache_wait_us + WT_CLOCKDIFF_US(time_stop, time_start) > cache_max_wait_us)
                      goto err;
          if (time_start != 0) {
              time_stop = __wt_clock(session);
              elapsed = WT_CLOCKDIFF_US(time_stop, time_start);
              WT_STAT_CONN_INCRV(session, application_cache_time, elapsed);
              WT_STAT_SESSION_INCRV(session, cache_time, elapsed);
              session->cache_wait_us += elapsed;
              if (cache_max_wait_us != 0 && session->cache_wait_us > cache_max_wait_us) {
                  WT_TRET(__wt_txn_rollback_required(session, WT_TXN_ROLLBACK_REASON_CACHE_OVERFLOW));
                  WT_STAT_CONN_INCR(session, cache_timed_out_ops);
                  __wt_verbose_notice(session, WT_VERB_TRANSACTION, "%s", session->txn->rollback_reason);

      The for loop resets the return value, ret = 0, and hence, in general, doesn't let EBUSY from the eviction attempt be leaked to the caller. This is true even when all the attempts to evict the page return busy, and the call returns. This is desired as the callers to __wt_cache_eviction_worker might not check for EBUSY returns and a caller like page_in could then return EBUSY to the application. We would not like the application call to fail because eviction did not succeed.

      But, at the end of the loop, we have the code that checks if the application thread doesn't want to wait too long. In such a case, we stop trying to evict past a timeout period. But, if the page had last failed to evict with EBUSY, we do not end up suppressing the error. This would be a bug and lead to the application call failing instead of going through with the operation.

      On the other hand, if an application thread can't get space in the cache and needs to fail, it should result in a WT_ROLLBACK error. Look at the code past the err: label, specifically the following line:

                  WT_TRET(__wt_txn_rollback_required(session, WT_TXN_ROLLBACK_REASON_CACHE_OVERFLOW));

      If the last attempt at eviction before timeout failed with EBUSY and we decide that this transaction needs to be rolled-back, we end up returning EBUSY instead of WT_ROLLBACK. This is because we are using WT_TRET and an error is already set. Also, in this case we will have EBUSY returned with a rollback reason set, which is an inconsistent API return.

            sulabh.mahajan@mongodb.com Sulabh Mahajan
            sulabh.mahajan@mongodb.com Sulabh Mahajan
            0 Vote for this issue
            5 Start watching this issue