dist/s_mentions hangs and writes multi-GB tmp file on wt-NNNN branches

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Fixed
    • Priority: Major - P3
    • WT12.0.0, 9.0.0-rc0
    • Affects Version/s: None
    • Component/s: Tools
    • None
    • Storage Engines - Transactions
    • 481.187
    • SE Transactions - 2026-06-05
    • 1

      Symptom

      dist/s_fast (and dist/s_all) hang on any branch whose name matches the wt-NNNN pattern. The script dist/s_mentions never returns and silently grows dist/__s_all_tmp_-s_mentions-PID to many GB (observed 60 GB on a developer machine before being killed).

      Root cause

      dist/s_all runs each step with stdout redirected to a tmp file inside dist/:

      local t="${t_pfx}-${name}-$$"
      $cmd > $t 2>&1
      

      For s_mentions this gives dist/__s_all_tmp_-s_mentions-PID.

      dist/s_mentions then runs (from dist/):

      grep -Iinr --exclude-dir=.git --exclude-dir=BUILD ticket-id ../
      

      The recursion walks back into dist/, reads __s_all_tmp_-s_mentions-PID, and finds the ticket id in lines grep itself just wrote there. Each new match contains the ticket id, so the walk keeps appending to the file and re-reading the appended content – an exponential feedback loop until the disk fills or the process is killed.

      It does not reproduce on develop because the branch name does not match the wt-NNNN regex and s_mentions exits early.

      Why this is only surfacing now

      The bug has been latent since the recursive grep was introduced (Nov 2021, WT-8448) and survived the tmp-file naming change in Dec 2023 (WT-12195). Whether it explodes vs. completes depends on a race:

      • If grep visits dist/ early in the walk, the tmp file is still small; grep reads to EOF before s_all redirect writes enough new bytes to extend it – loop terminates.
      • If grep visits dist/ late, the tmp file is already large and growing; grep reads forever.

      readdir order on APFS is directory-entry order (not alphabetical), so which case you get depends on worktree layout. Two other factors raise match volume and tilt the race toward the bad case:

      • Stale state in the worktree root (for example WT_TEST/ from earlier test runs) that is not covered by the build-folder exclude loop.
      • High match density across src/, test/, comments, and generated files for the specific ticket id (this branch has many wt-17236 references; less-touched ticket ids would produce fewer matches and finish in time).

      Same script, same s_all wiring – just a worktree content profile that finally pushes the race over.

      Fix

      Add --exclude=__s_all_tmp_* (or equivalent) to the grep invocation in dist/s_mentions so it cannot read the in-flight tmp file. Independent of walk order and match volume, the feedback loop is closed.

      Reproduction

      The race triggers when match volume outruns grep's read of the tmp file. To force it deterministically:

      1. Check out a branch named wt-NNNN-anything.
      2. In the worktree root, leave directories that are not excluded by the build-folder loop (for example WT_TEST/) and contain or path-match the ticket id.
      3. Run dist/s_fast from the worktree root.
      4. Observe ps showing a long-running grep -Iinr ticket-id ../ rooted in dist/ and dist/__s_all_tmp_-s_mentions-PID growing unboundedly. Kill the process before it fills the disk.

      A clean worktree (no stale test output, no extra build dirs, few ticket-id mentions) will usually complete – but the race is still present and remains a footgun until the exclude is in place.

            Assignee:
            Haribabu Kommi
            Reporter:
            Haribabu Kommi
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: