[SERVER-42308] Improve synchronization between two fail points Created: 19/Jul/19  Updated: 08/Jan/24

Status: Backlog
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Jason Chan Assignee: Backlog - Replication Team
Resolution: Unresolved Votes: 1
Labels: former-quick-wins
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
is depended on by SERVER-42471 Add the new waitForFailpoint mechanis... Closed
Duplicate
is duplicated by SERVER-43748 Convenient synchronization between te... Closed
is duplicated by SERVER-39165 Add waitForFailpoint command Closed
Assigned Teams:
Replication
Participants:

 Description   

Currently, in order to synchronize two fail points, we have to call checkLog and sometimes clearLog to verify when a failpoint has started to know when we can safely start performing test asserts if we want to validate the intermediary states of an operation.

The existing syntax can often be very verbose and not very intuitive. We propose extending the current configureFailPoint command to allow specifying more explicit relationships between failpoints, more specifically that a failpoint will be able to signal other failpoints or also wait for a specific signal to be broadcasted before unblocking itself.

Proposed syntax:

{configureFailPoint: “failpoint1”, mode: “<mode>”, sync: {signals: [signal1, signal2], waitFor: [signal3, signal4], timeout: 100, clearSignal: <true/false>}}

With the above syntax, failpoint1 will emit signals signal1 and signal2, and then block itself until signal3 and signal4 are broadcasted or timeout after 100 seconds. The clearSignal boolean indicates whether signal3 and signal4 should be cleared once they are consumed.

Reference: https://mariadb.com/kb/en/the-debug-sync-facility/



 Comments   
Comment by Siyuan Zhou [ 25/Oct/19 ]

judah.schvimer, yes, but not urgent. I expect we'll get a clearer picture of the use cases to guide our design after Lingzhi reviews the test changes of SERVER-39165. SERVER-39165 is about the synchronization between the client and the server. It's possible to synchronize two fail points in the server with a parallel shell. That should be something to watch out.

With the concrete cases where fail point synchronization in server can be helpful, we should design this ticket mainly for them.

Comment by Judah Schvimer [ 25/Oct/19 ]

lingzhi.deng, jason.chan, and siyuan.zhou, is there further desire for this after SERVER-39165?

Comment by Mira Carey [ 02/Aug/19 ]

judah.schvimer, I think that makes sense. I'll reopen SERVER-39165, and see if we end up getting to it before you do.

Comment by Judah Schvimer [ 02/Aug/19 ]

mira.carey@mongodb.com, what is your estimate for the time to do this work? Should we re-open SERVER-39165 so you can prioritize it in parallel since we don't want to do it as currently designed now?

Comment by Jason Chan [ 02/Aug/19 ]

We have a better idea for the design of this ticket after an initial review from mira.carey@mongodb.com. Putting it back to Needs Triage to better prioritize this work.

Generated at Thu Feb 08 05:00:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.