Uploaded image for project: 'WiredTiger'
  1. WiredTiger
  2. WT-12730

Consider WT_RET and friends storing more error information

    • Type: Icon: Improvement Improvement
    • Resolution: Unresolved
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: None
    • Component/s: Not Applicable
    • Labels:
      None
    • Storage Engines
    • StorEng - 2024-06-11

      This came out of a HELP ticket + Slack thread (links in comments).

      There are many functions where a non-zero return value could come from a number of places, for example __reconcile:

          WT_ERR(__rec_write_wrapup(session, r, page));
          __rec_write_page_status(session, r);
          WT_ERR(__reconcile_post_wrapup(session, r, page, flags, page_lockedp));
          // snip...
          if (__wt_ref_is_root(ref)) {
              WT_WITH_PAGE_INDEX(session, ret = __rec_root_write(session, page, flags));
              if (ret != 0)
                  goto err;
              return (0);
          }
          // snip...
          WT_ERR(__wt_page_parent_modify_set(session, ref, true));
          // snip...
      
      err:
          if (ret != 0)
              WT_RET_PANIC(session, ret, "reconciliation failed after building the disk image");
      

      If we see this message in the wild, it's impractical to tell where it came from. There's a similar gap for functions using WT_ERR, where we often can't tell which "leaf" function call returned non-zero.

      One useful tool we have available for fixing this is the fact that almost all of our error handling code uses one of WT_RET, WT_ERR, or related friends.

      So it would be possible, without intrusive changes, to record more accurate failure information. For example, WT_RET could, for non-zero return values, store a tuple of (retval, line, function) in a small per-session circular buffer.

      Exposing this is also possible - a WT_PANIC could dump it using the verbose system, and we could expose a wt_dump_crash_diagnostics API for callers that capture segmentation faults and attempt to print their own diagnostics (e.g. MongoDB).

      The most likely sticking point is probably performance - there are non-zero return values (e.g. WT_NOTFOUND for cursors) that are frequent and potentially on the hot path.

      This ticket is just a discussion point for whether we want to do something like this, and a record of the decision we make + why.

            Assignee:
            backlog-server-storage-engines [DO NOT USE] Backlog - Storage Engines Team
            Reporter:
            will.korteland@mongodb.com Will Korteland
            Votes:
            0 Vote for this issue
            Watchers:
            7 Start watching this issue

              Created:
              Updated: