[SERVER-45830] Add failpoint to allow InitialSyncTest fixture to pause initial syncing node after cloning some documents Created: 28/Jan/20  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: William Schultz (Inactive) Assignee: Backlog - Replication Team
Resolution: Unresolved Votes: 0
Labels: former-quick-wins
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-32903 Ambiguous field name error should be ... Closed
is related to SERVER-45827 Expand initial sync fuzzer grammar to... Backlog
Assigned Teams:
Replication
Participants:

 Description   

The initial sync fuzzer currently pauses initial sync before running the 'listDatabases', 'listCollections', and 'listIndexes', for each database/collection that is being cloned. It does not, however, pause the syncing node at any time during the actual fetching of documents inside the CollectionCloner. This can prevent it from being able to deterministically reproduce certain bugs that may occur during the collection cloning process. For example, if the sync source contains a document {_id: 1}, which is cloned by the initial syncing node, and then the sync source deletes {_id: 1} and re-inserts it before the clone has finished for that collection, the syncing node may clone the document a second time. Being able to deterministically reproduce cases like this would be a helpful improvement to our initial sync test infrastructure.



 Comments   
Comment by William Schultz (Inactive) [ 31/Jan/20 ]

judah.schvimer It's not strictly required for us to be able to catch the bug described in SERVER-32903 in the initial sync fuzzer (you can see that the original bugs were produced in BF-7257), but, as discusssed with samy.lanka, this additional failpoint would make more types of initial sync bugs deterministically reproducible. For example, to repro the bug (2) described here we must pause while fetching documents during collection clone (after we have run listIndexes). So, I would say this is a debuggability improvement i.e. we do have suites that can catch bugs like those in SERVER-32903, but it is harder to debug and reproduce them since they are non-deterministic.

Comment by Judah Schvimer [ 31/Jan/20 ]

william.schultz, is this ticket required for the initial sync fuzzer to catch bugs like SERVER-32903?

Comment by William Schultz (Inactive) [ 28/Jan/20 ]

Implementing both this change and SERVER-45827 would hopefully allow the initial sync fuzzer to give us both the operational diversity of the existing jstestfuzz_replication_initsync suites and a high degree of deterministic reproducibility.

samy.lanka noted that controlling the index building process in initial sync could be another source of potential non-determinism, but I believe that for bugs like the one described in SERVER-32903, the state of an index build during initial sync cloning doesn't affect whether we throw an error when inserting particular keys. That might be a detail of how the CollectionBulkLoader works, though, and I would have to verify my understanding there.

Comment by William Schultz (Inactive) [ 28/Jan/20 ]

This improvement should allow the initial sync fuzzer to deterministically reproduce bugs like number (2) mentioned here in SERVER-32903. Note that the initialSyncHangCollectionClonerAfterHandlingBatchResponse failpoint may already provide the necessary server functionality for this improvement.

Generated at Thu Feb 08 05:09:49 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.