Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-11948

Document the intricacies of the skip list ordering bug and its solution

    • Type: Icon: Documentation Documentation
    • Resolution: Done
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • Labels:
      None

      In this ticket, we document the intricacies of the fix to the skiplist code for the memory ordering bug, WT-10461. We consulted the industry experts to confirm our understanding and ensure the correctness of the fix. It is worth capturing some of the details here.

      Relevant code sample and explanation

      WiredTiger uses a lock-free skip list data structure to store newly inserted pairs in memory. The following are simplified code snippets to illustrate the bug and the solution:

      For reference:
      WT_INSERT - A node representing the K/V pair inserted into the skip list
      WT_INSERT_HEAD - The structure containing the head and tail pointers for the various skip list levels.
      __wt_search_insert() - The function that searches the skip list for the given key
      Note: The skip list is up to 10 levels deep.

      /* Skip list head for the WT_INSERTs. */
      struct __wt_insert_head {
          WT_INSERT *head[WT_SKIP_MAXDEPTH]; /* first item on skiplists */
          WT_INSERT *tail[WT_SKIP_MAXDEPTH]; /* last item on skiplists */
      };
      
      /* A WT_INSERT. */
      struct __wt_insert {
          struct {
              uint32_t offset; /* row-store key data start */
              uint32_t size;   /* row-store key data size */
          } key;
          WT_INSERT *next[0]; /* forward-linked skip list */
      };
      //Note: The next array could span a cache line - this has been a relevant detail in reproducing the bug.
      
      #define WT_SKIP_MAXDEPTH 10
      
      #define WT_READ_BARRIER()                           \
          do {                                            \
              __asm__ volatile("dmb ishld" ::: "memory"); \
          } while (0)
      
      /* 
       * Lexicographic comparison routine, skips leading *matchp bytes.
       * Returns: < 0 if user_item is < tree_item, = 0 if user_item is = tree_item, > 0 if
       * user_item is > tree_item.
       */
      static inline int
      __wt_lex_compare_skip(const WT_ITEM *user_item, const WT_ITEM *tree_item, size_t *matchp);
      
      /*
       * Search through the skip list to find the WT_INSERT closest matching the srch_key.
       */
      int
      __wt_search_insert(
        WT_SESSION_IMPL *session, WT_CURSOR_BTREE *cbt, WT_INSERT_HEAD *ins_head, WT_ITEM *srch_key)
      {
          WT_INSERT *ins, **insp, *last_ins;
          WT_ITEM key;
          size_t match, skiphigh, skiplow;
          int cmp, i;
      
          cmp = 0; /* -Wuninitialized */
      
          /*
           * Start at the highest skip level, then go as far as possible at each level
           * before stepping down to the next.
           */
          match = skiphigh = skiplow = 0;
          ins = last_ins = NULL;
          for (i = WT_SKIP_MAXDEPTH - 1, insp = &ins_head->head[i]; i >= 0;) {
              /* 
               * Note that the skip list can also be modified in parallel. Get a copy of the
               * head node so that we can work with the immutable local copy.
               */
              ins = *insp;
              // WT_READ_BARRIER();  <<  The fix made
      
              /* The insert may be NULL while finding the first entry in the skip list. */
              if (ins == NULL) {
                  insp–;
                  continue;
              }
      
              /*
               * Comparisons may be repeated as we drop down skiplist levels; don't repeat
               * comparisons, they might be expensive.
               */
              if (ins != last_ins) {
                  last_ins = ins;
                  key.data = WT_INSERT_KEY(ins);
                  key.size = WT_INSERT_KEY_SIZE(ins);
                  /*
                   * Optimize the number of bytes we need to compare during the search, by
                   * skipping the prefix of the search key that already matches the upper
                   * level.
                   */
                  match = WT_MIN(skiplow, skiphigh);
                  WT_RET(__wt_compare_skip(session, srch_key, &key, &cmp, &match));
              }
      
              if (cmp > 0) { /* Keep going at this level */
                  insp = &ins->next[i];
                  skiplow = match;
              } else if (cmp < 0) { /* Drop down a level */
                  insp--;
                  skiphigh = match;
              } else {
                  /* Found */
                  break;
              }
          return (0);
      }
      

      A brief explainer of the above code:

      • The search function takes in a key and tries to find the WT_INSERT node in the skip list that matches the key or returns the closest match. The actual code saves the found key in the cbt structure as an output parameter - that’s been stripped out here to simplify the example. The search through the skip list involves going through higher levels and then stepping down as needed. This relies on the invariant we maintain that if a node is found at any given level in the skip list, the lower levels are guaranteed to contain the node as well. There are optimisations to avoid re-comparing key prefix matches because larger prefixes of the keys match as we step down the skip list.
      • The search code works in parallel with other threads that could be inserting new nodes into the skip list. There are no locks taken. The insertion of a node into the skip list at a given level happens using an atomic compare and swap operation on the list pointers. The insertion is also done from the bottom-most level upwards. The search operation relies on this directionality because it starts the search at upper levels and steps down as needed. In this manner, both insertion and search can happen in the same skip list without taking any locks.

      This relies on the invariant we maintain that if a node is found at any given level in the skip list, the lower levels are guaranteed to contain the node as well.

      Note that it is vital to be aware of such an invariant in the loop from the perspective of data dependency. This is something where the CPU could predict that the current level matches and the loads for the future level could resolve first, but then the loads for the upper level resolve, and the processor decides it didn't mis-speculate.

      The bug's cause and the fix

      • The bug presents itself as a wrongly positioned node at one of the levels in the skip list. The same search function is also used by insert code to find the location where the new node should go. It resolves if we introduce the read barrier, dmb ishld, just after taking a local copy of the head pointer. Note that we do so for each iteration of the loop as we walk down the levels in the skip list.
      • The way the bug inserted the node incorrectly can only happen if an older value for the head pointer gets used during a search racing with an insert. To prevent this from happening, each loop iteration takes a local copy of the head’s pointer before proceeding further. This seems sufficient on the x86 but not on the ARM architecture. But, putting the read barrier resolves the bug.

      Intricate details and learnings

      • An out-of-order read into the local copy of the head pointer appears to have happened here when the order is compared across the loop iterations. Without the read barrier, the logic assumes that the read for iteration N will happen before iteration N-1 as dictated by the order of the loop execution. This requirement holds on x86 as TSO ordering maintains the load -> load ordering. A read barrier is required on ARM to guarantee the expected order is adhered to, as ARM can read memory locations out of order. In the above code, the proposed fix was to add a read barrier at the start of the loop to enforce order between the loop iterations.
      • In the above code snippet, there are several places we read the head pointer and store it into the local copy insp. Even on ARM with weaker memory order, not all re-orderings are allowed. The ARM architecture will recognize data dependencies and control dependencies between reads and stores and respect them to be ordered such that the younger instructions have to be ordered after older ones complete when there is a direct dependency. An example of data dependency is the assignment, ins = *insp, which will respect the dependence later in the loop on reading insp = &ins->next[i]. Similarly, for the control dependency on cmp for when it gets set to ins->next[i] or insp--.

        An example of data dependency is the assignment, ins = *insp, which will respect the dependence later in the loop on reading insp = &ins->next[i]

        It is crucial here to note that there is no data dependency between the assignment ins = *insp and decrement insp--.

      • Now, consider the possibility of the processors' speculative execution. Since this bug presents itself as inconsistent/unordered reads across different loop iterations, is it possible that the CPU would speculatively execute the future loop iterations and then decide that the results are valid and use those executions? x86 does the speculative execution, and if the values change out from under, it throws it away and tries again, so we do not see any issue there. ARM, conversely, is free to consider the results from a speculative execution if it did not break data or control dependency requirements. It is easier to reason which re-orderings could occur when the loop is imagined unrolled. To obtain the local copy of the insert head, ins = *insp, when staying at the current depth in the skip list traversal, there is a data (load) dependency, insp = &ins->next[i], that needs to be resolved before the processor can evaluate further iterations. But, in cases where the skip list traversal moves a level deeper, there is no dependency with ins = *insp. This is precisely where we see the bug:
            for (...) {
                ins = *insp;
                ...
                if (cmp > 0) { /* Keep going at this level */
                    insp = &ins->next[i];   <<<< data dependency on ins, does not allow speculative execution
                    skiplow = match;
                } else if (cmp < 0) { /* Drop down a level */
                    insp--;     <<<< No data dependency on ins, allows speculative execution
                    skiphigh = match;
                } else { ... }
        

      Any proof of the above theory

      We were able to write a test that reliably reproduced the original issue (ref: WT-10561). As a final fix, we have done a coarse intervention by putting a read barrier, dmb ishld, that orders the loads between each loop iteration. This has worked in practice in fixing the bug.

          for (i = WT_SKIP_MAXDEPTH - 1, insp = &ins_head->head[i]; i >= 0;) {
              /* 
               * Note that the skip list can also be modified in parallel. Get a copy of the
               * head node so that we can work with the immutable local copy.
               */
              ins = *insp;
              // WT_READ_BARRIER();  <<<<<<<<<
              ..
          }
      

      On the other hand, we experimented with a finer placement of the read barrier along the various loop control paths (loc. 1 and loc. 2):

              if (cmp > 0) { /* Keep going at this level */
                  insp = &ins->next[i];
                  // WT_READ_BARRIER();  <<<<<<<<< loc. 1
                  skiplow = match;
              } else if (cmp < 0) { /* Drop down a level */
                  insp--;
                  // WT_READ_BARRIER();  <<<<<<<<< loc. 2
                  skiphigh = match;
              } else {
                  /* Found */
                  break;
              }
      

      A read barrier only at location 1 does not resolve the issue, whereas one at location 2 does. This validates the above theory that, in practice, we only need to place a memory order requirement when dropping down the level in the skip list iteration.

      Miscellaneous

      • At some point we argued that this bug could have been because of compiler optimizations. How about optimising away the local copy, replacing ins with *ins, and the instance of the loop iteration getting inconsistent values for the head pointer? We argued that if this bug were because of compiler optimisation, we would have also seen this issue on x86. But, it is essential to clarify that compiler optimizations could be architecture-dependent. If an optimization doesn't happen on one architecture, it is NOT evidence that it might not happen on the other.
      • We reviewed the documentation on the ARM data barriers, and we noticed the reference to preventing the reordering of “explicit data access”: "The Data Memory Barrier (DMB) prevents the reordering of specified explicit data accesses across the barrier instruction. All explicit data load or store instructions, which are executed by the PE in program order before the DMB, are observed by all Observers within a specified Shareability domain before the data accesses after the DMB in program order." What is explicit or implicit in data accesses, and do we need to consider them when using barriers?

        Explicit memory accesses are any access your program uses (loads or stores). The execution of a code can create implicit memory accesses, for example, ARM allows for hardware updates of the page table dirty and access flags that the OS maintains, and these access are implicit in accessing (or writing to) a page that hasn't yet been accessed (or written to). In short, anything in your program is an explicit access, and unless you're writing an operating system, you don't need to worry about implicit accesses.

            Assignee:
            sulabh.mahajan@mongodb.com Sulabh Mahajan
            Reporter:
            sulabh.mahajan@mongodb.com Sulabh Mahajan
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated:
              Resolved: