Create concurrency workload that performs retryable writes during an index build

XMLWordPrintableJSON

    • Type: Engineering Test
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Storage Execution
    • Storage Execution 2026-01-19, Storage Execution 2026-02-02, Storage Execution 2026-02-16, Storage Execution 2026-04-13
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Goal / Objective

      • Create a new concurrency workload that issues retryable writes concurrently with an index build, to exercise correctness and durability of index build during side writes
      • Ensure retryable write semantics are preserved when:
          - An index build is in progress
          - An index build succeeds or fails (including interruption/stepdown scenarios, if covered by the suite)
      • Integrate the workload into the existing concurrency/resmoke infrastructure so it can run in CI as part of the standard suite(s) for index build / retryable write coverage – do this by looking at other fsm concurrency tests and ensuring that this new fsm concurrency test gets added to appropriate suites. 

      Context & Background

      • The ticket calls for a concurrency workload specifically focused on retryable writes during an index build
      • Gap: there is no targeted workload that:
          - Runs multiple threads of retryable writes (insert/update/delete/findAndModify) using sessions + `txnNumber` while an index build are actively running (foreground/background, possibly on multiple indexes/fields).
          - And validates both:
            - Retryable write semantics (retries do not double-apply or lose writes).
            - Index correctness (index contents match the collection after the build, including across retries and failures).
      • This workload is intended to:
          - Increase confidence in primary-driven index builds and any any other index builds' behavior under retryable writes.
          - Provide a regression target for bugs where retryable writes interact poorly with index building or catalog transitions.

      Acceptance Criteria

      • Workload behavior:
          - Implements two fsm concurrency tests that:
            - Open logical sessions and issue retryable writes (using `lsid` + monotonically increasing `txnNumber`).
            - Randomize operation types across:
              - Inserts (new documents, including documents that will/ will not satisfy the index key pattern).
              - Updates (in-place and potentially moving documents into/out of the indexed key range).
              - Deletes.
              - `findAndModify` / upserts (if supported in the concurrency framework).
            - Randomly choose whether to *retry* a subset of these writes with the same `lsid`/`txnNumber` and identical command, to exercise replay behavior.
            - Run concurrently with one or more threads that:
              - Start index builds using `createIndexes` against the same collection(s) being modified.
              - Optionally vary index options (e.g., unique vs non-unique, different key patterns).
      • Index build interaction:
          - The workload must guarantee overlapping windows where:
            - At least some retryable writes run while an index build is in-progress on the same collection.
            - Some retryable write retries occur:
              - During an in-progress build.
              - After build completion.
          - The workload should allow configuration of:
            - Number of concurrent writer threads.
            - Frequency/number of index builds.
      • Correctness properties (checked implicitly or via post-hooks/assertions):
          - The suite completes without server crashes, invariants, or fasserts.
          - No retryable write is applied more than once to the logical document state, even if retried mid-build.
          - No successful retryable write is silently lost.
          - Final collection state is consistent with the index (I think this will be covered by the ValidateCollections hook).
          - There should be two fsm concurrency workloads created with a base and one where the index created is unique (reduce code duplication, look at other instances where we extend fsm concurrency tests to reduce code duplication). For unique indexes, duplicate key errors occur only when logically appropriate (no spurious duplicates from retries).
      • Integration:
          - The new workload is:
            - Implemented using the *existing concurrency test framework* (same style/pattern as other concurrency workloads).
            - Registered in at least one standard concurrency or index-build-related *resmoke suite*.
            - Documented in the suite YAML (name, tags, rough description).
      • CI / Evergreen:
          - At least one Evergreen task is updated or added to run this workload regularly (e.g., as part of an existing concurrency or index build suite).
          - Workload is stable (no known flakes) under typical concurrency suite runtime constraints.

      Constraints & Out of Scope

      • Topology:
          - In scope: Replica set, single-node topologies, and sharded fixtures with sharded clusters used by existing concurrency suites (whichever is standard for index build concurrency tests).
      • Feature coverage:
          - Focuses on retryable writes, not multi-statement transactions; transaction behavior is out of scope.
          - Does not need to cover every index option variant (TTL, partial, wildcard, text, etc.); a small representative set is sufficient (e.g., one standard non-unique index, optionally one unique index).
      • Product behavior changes:
          - No server feature changes are required; any server changes discovered as necessary should be spun out into separate SERVER tickets.

      Testing Instructions

      • Local / developer runs:
          - Add the new workload file(s) under the standard concurrency workload location (matching existing naming and structure).
          - Register the workload in a suitable concurrency or index-build-focused resmoke suite YAML.
          - Run the workload locally with resmoke using a representative suite, e.g.:
            - `buildscripts/resmoke.py --suites <concurrency_suite_with_new_workload> --repeatSuites=10`
              - Adjust suite name and repeat count to:
                - Ensure the workload actually overlaps retryable writes with index builds.
                - Shake out basic flakiness.
      • Verification steps:
          - Confirm:
            - All test runs pass without crashes, fasserts, or invariant failures.
            - No unexpected `OutOfDiskSpace`, `WriteConflict`, or retryable write protocol assertion errors appear in the logs (beyond expected, internally-handled transient errors).
          - Where supported:
            - Run with *increased log verbosity* for index build and retryable write components to visually confirm:
              - Interleaving of index build phases with retryable writes and their retries.
      • Evergreen:
          - Add or update an Evergreen task to:
            - Run the suite containing the new workload at least once per patch and on mainline (e.g., in an existing concurrency or index build task group).
          - Ensure:
            - Task runtime stays within acceptable bounds.
            - No intermittent failures after multiple mainline runs.

            Assignee:
            Jess Balint
            Reporter:
            Stephanie Eristoff
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: