Introduce white-box integration tests for cluster-wide change stream v2 shard targeting

XMLWordPrintableJSON

    • Type: Task
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • Query Execution
    • QE 2026-03-16
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Summary

      SERVER-111381 implemented AllDatabasesChangeStreamShardTargeterImpl with C++ unit tests and JS smoke tests, but lacks white-box integration tests that verify observable shard-targeting behavior: which shards have open cursors after each lifecycle event, and that placement history is consulted at the right times.

      Approach

      Two jstest files using two observability mechanisms:

      1. *$currentOp with {
        Unknown macro: { {idleCursors: true}

        }}
        * — inspects which shards have open change stream cursors. Cursors identified via unique comment per test case.

      1. Log offset snapshots — before each operation, snapshot checkLog.getGlobalLog(mongos).length. After the operation, search only the new log entries for expected LOGV2 IDs and attributes. This avoids false matches from prior test steps.

      Shared fixture: A single ShardingTest created in before()beforeEach()/afterEach() only clean up databases/collections. Verbose query logging on all nodes via logComponentVerbosity: {query: {verbosity: 3{}}}.

      Test Files

      • jstests/sharding/query/change_streams/change_stream_all_databases_v2_strict_whitebox.js — 3 shards
      • jstests/sharding/query/change_streams/change_stream_all_databases_v2_ignore_removed_shards_whitebox.js — 4 shards (needs spare after removals)

      File 1: Strict Mode (change_stream_all_databases_v2_strict_whitebox.js)

      Fixture: 3 shards. Tags: featureFlagChangeStreamPreciseShardTargetingrequires_shardinguses_change_streamsassumes_balancer_off.


      Test 1.1: Initialize with data on multiple shards

      Setup: Create DB (primary shard0). Shard collection, split chunks across shard0 and shard1. Insert documents on both.

      Steps:

      1. Snapshot log offset on mongos.
      2. Open cluster-wide v2 change stream.
      3. Assert cursors: shard0 (has data), shard1 (has data). No cursor on shard2 (no data). Config cursor open.
      4. Assert log: ID 11138104 on mongos with shards listing shard0 and shard1.
      5. Insert a document, verify event received.

      Test 1.2: Initialize on empty cluster — config server cursor only

      Setup: No user databases.

      Steps:

      1. Snapshot log offset.
      2. Open cluster-wide v2 change stream.
      3. Assert cursors: No data shard cursors. Config cursor open.
      4. Assert log: ID 11138104 with empty shards.
      5. Snapshot log offset again.
      6. Create database + collection (triggers DatabaseCreatedControlEvent).
      7. Assert cursors: Data shard cursor now open on the DB's primary shard. Config cursor still open.
      8. Assert log: ID 11138117 ("Handling placement refresh") with updated shard set.
      9. Insert document, verify event received.

      Test 1.3: DatabaseCreated on a different shard opens cursor on that shard

      Setup: DB1 with primary on shard0, unsharded collection with data.

      Steps:

      1. Open change stream. Assert cursors on shard0 + config.
      2. Snapshot log offset.
      3. Create DB2 with primaryShard: shard1. Create collection in DB2.
      4. Assert cursors: Cursor now also open on shard1. shard2 still no cursor. Config still open.
      5. Assert log: ID 11138117 with shards including shard1.
      6. Insert into DB2, verify event received.

      Test 1.4: MoveChunk triggers placement refresh

      Setup: Collection sharded across shard0 and shard1 (data on both).

      Steps:

      1. Open change stream. Assert cursors on shard0, shard1 + config.
      2. Snapshot log offset.
      3. moveChunk all chunks from shard0 to shard2.
      4. Assert cursors: Cursor opened on shard2. Cursor on shard0 closed (no data). shard1 unchanged. Config still open.
      5. Assert log: ID 11138117 with updated shard set.
      6. Insert targeting shard2, verify event.

      Test 1.5: MovePrimary triggers placement refresh

      Setup: Unsharded collection on shard0 (shard0 is DB primary).

      Steps:

      1. Open change stream. Assert cursor on shard0 + config.
      2. Snapshot log offset.
      3. movePrimary to shard1.
      4. Assert cursors: Cursor on shard1 opened. Cursor on shard0 closed. Config still open.
      5. Assert log: ID 11138117.
      6. Insert, verify event.

      Test 1.6: NamespacePlacementChanged via reshardCollection triggers placement refresh

      Setup: Collection sharded across shard0 and shard1 (key {}{_id: 1}{}).

      Steps:

      1. Open change stream. Assert cursors on shard0, shard1 + config.
      2. Snapshot log offset.
      3. reshardCollection with new key {{{} {a: 1}
        {}}}, distribute chunks to shard1 and shard2.
      1. Assert cursors: Cursor opened on shard2. Cursor on shard0 closed. shard1 unchanged. Config still open.
      2. Assert log: ID 11138117 with updated shard set showing shard1, shard2.
      3. Insert, verify event with new document key shape.

      Test 1.7: Multiple databases on different shards

      Setup: DB1 (primary shard0), DB2 (primary shard1), DB3 (primary shard2). Each with an unsharded collection.

      Steps:

      1. Open change stream.
      2. Assert cursors: shard0, shard1, shard2 all have cursors. Config cursor open.
      3. Insert one document into each DB, verify all 3 events received.

      File 2: Ignore Removed Shards Mode (change_stream_all_databases_v2_ignore_removed_shards_whitebox.js)

      Fixture: 4 shards (shard3 spare for survival). Tags: add config_shard_incompatibleresource_intensive.

      Test 2.1: Multi-database setup, single shard removed — bounded then unbounded

      Goal: Verify the core IRS lifecycle with a multi-database whole-cluster placement. A bounded (degraded) segment targets only the surviving shard, then an unbounded (normal) segment opens a config server cursor.

      Setup:

      1. Create DB1 with primary on shard0. Create an unsharded collection in DB1, insert doc A.
      2. Create DB2 with primary on shard1. Shard a collection in DB2 with key {}{_id: 1}{}, split chunks across shard0 and shard1. Insert doc B on shard0, doc C on shard1.
      3. Record startAtOperationTime = T1.
      4. Insert doc D on shard1 (into DB2's collection — provides an event in the bounded segment).
      5. Move DB2's chunks off shard0 to shard1. Move DB1's primary to shard1 (so shard0 is fully drained).
      6. Remove shard0.

      Segment analysis:

      • At T1: whole-cluster placement = [shard0, shard1] (DB1 on shard0, DB2 on shard0+shard1). shard0 removed → shards=[shard1], bounded [T1, T_drain).
      • At T_drain: placement = [shard1], no removed shard → unbounded.

      Steps:

      1. Snapshot log offset on mongos.
      2. Open cluster-wide v2 IRS stream from T1 with comment: "test_2_1".
      3. Assert log (first segment): ID 11138108 on mongos with:
        • shards containing only shard1.
        • nextPlacementChangedAt set (bounded).
      4. Assert cursors (bounded segment): Cursor on shard1 only. No cursor on shard0 (removed), shard2, shard3. No config server cursor.
      5. Verify events from bounded segment: doc D from shard1 arrives. Doc A (DB1 on shard0) and doc B (DB2 chunk on shard0) are lost.
      6. Snapshot log offset.
      7. After segment transition past T_drain:
      8. Assert log (second segment): ID 11138108 with nextPlacementChangedAt absent (unbounded).
      9. Assert cursors (unbounded segment): Cursor on shard1. Config server cursor now open.
      10. Insert doc E on shard1, verify event received.

      Test 2.2: All original data shards removed — segment skips to new placement

      Goal: Verify that when all shards that had data at T1 are removed, the fetcher's internal loop skips forward to the first timestamp where a surviving shard has data. The openCursorAt value advances past T1, and events from the skipped range are lost.

      Setup:

      1. Create DB with primary on shard0. Create unsharded collection. Insert doc A on shard0.
      2. Record startAtOperationTime = T1.
      3. Move primary to shard3 (spare shard) at time ~T_move. This moves data to shard3.
      4. Remove shard0.

      Segment analysis:

      • At T1: placement = [shard0]. shard0 removed → surviving = [] (empty). Fetcher loops: skips to T_move.
      • At T_move: placement = [shard3], no removed shard → single unbounded segment starting at T_move.
      • Range [T1, T_move) is silently skipped.

      Steps:

      1. Snapshot log offset on mongos.
      2. Open cluster-wide v2 IRS stream from T1 with comment: "test_2_2".
      3. Assert log: ID 11138108 with:
        • openCursorAt = T_move (NOT T1 — demonstrating the skip).
        • shards containing shard3.
        • nextPlacementChangedAt absent (unbounded).
      4. Assert cursors: Cursor on shard3. Config server cursor open. No cursors on shard0, shard1, shard2.
      5. Verify no events from [T1, T_move) are returned (those were on removed shard0).
      6. Insert doc B on shard3, verify event received.

      Test 2.3: Segments discover new shard — data migrated after T1 is not missed

      Goal: Prove that reading in bounded segments is necessary for correctness. Without segments, data that migrated to a new shard (shard2) after the stream's start time would be invisible because shard2 wasn't in the original placement. The segment boundary forces re-evaluation, discovering shard2.

      Setup:

      1. Create DB1 with primary on shard0. Create unsharded collection, insert doc A on shard0.
      2. Create DB2 with primary on shard0. Shard a collection in DB2 with key {}{_id: 1}{}, split chunks across shard0 and shard1. Insert doc B on shard0, doc C on shard1.
      3. Record startAtOperationTime = T1.
      4. Insert doc D on shard1 (into DB2's collection).
      5. Move DB2's chunks from shard1 to shard2 at time ~T_move. This creates a placement change: shard1 drops out of DB2's placement, shard2 enters.
      6. Insert doc E on shard2 (into DB2's collection, post-move).
      7. Remove shard1.

      Segment analysis:

      • At T1: whole-cluster placement = [shard0, shard1] (DB1 on shard0, DB2 on shard0+shard1). shard1 removed → shards=[shard0], bounded [T1, T_move).
      • At T_move: placement = [shard0, shard2] (DB1 on shard0, DB2 on shard0+shard2). No removed shard → shards=[shard0, shard2], unbounded.

      The key point: shard2 was NOT in the placement at T1. If cursors were opened only based on T1's placement minus removed shards, we'd have [shard0] and never see doc E on shard2. The segment boundary at T_move forces re-evaluation, discovering shard2.

      Steps:

      1. Snapshot log offset.
      2. Open cluster-wide v2 IRS stream from T1 with comment: "test_2_3".
      3. Assert log (segment 1): ID 11138108 with shards=[shard0]nextPlacementChangedAt ~= T_move.
      4. Assert cursors (segment 1): Cursor on shard0 only. No cursor on shard1 (removed), shard2 (not yet in placement). No config cursor (bounded).
      5. Verify segment 1 events: doc A from shard0 (DB1). Doc D from shard1 lost (shard1 removed).
      6. Snapshot log offset.
      7. After segment transition at T_move:
      8. Assert log (segment 2): ID 11138108 with shards=[shard0, shard2]nextPlacementChangedAt absent.
      9. Assert cursors (segment 2): Cursors on shard0 AND shard2. Config cursor open.
      10. Verify doc E from shard2 is received — proving that segment-based reading discovered shard2.
      11. Insert doc F on shard2, verify event.

      Test 2.4: Multiple databases on different shards — only surviving shard's events returned

      Goal: Verify all-databases-specific behavior in IRS mode: with databases on different shards, removing one shard causes only that shard's database events to be lost, while other databases' events on surviving shards are preserved.

      Setup:

      1. Create DB1 with primary on shard0. Create unsharded collection, insert doc A.
      2. Create DB2 with primary on shard1. Create unsharded collection, insert doc B.
      3. Record startAtOperationTime = T1.
      4. Insert doc C into DB2 (on shard1).
      5. Move DB1's primary to shard1 (so shard0 is drained). Remove shard0.

      Segment analysis:

      • At T1: whole-cluster placement = [shard0, shard1] (DB1 on shard0, DB2 on shard1). shard0 removed → shards=[shard1], bounded.
      • After boundary: only shard1 in placement, unbounded.

      Steps:

      1. Snapshot log offset.
      2. Open cluster-wide v2 IRS stream from T1 with comment: "test_2_4".
      3. Assert log: ID 11138108 with shards=[shard1].
      4. Assert cursors (bounded segment): Cursor on shard1 only. No config cursor.
      5. Verify only DB2's events returned: doc B and doc C. Doc A (DB1 on shard0) lost.
      6. After transition to unbounded segment:
      7. Assert cursors: shard1 + config cursor.
      8. Insert doc D into DB2, verify event received.

            Assignee:
            Lyublena Antova
            Reporter:
            Denis Grebennicov
            Votes:
            0 Vote for this issue
            Watchers:
            1 Start watching this issue

              Created:
              Updated: