[SERVER-39165] Add waitForFailpoint command Created: 23/Jan/19  Updated: 29/Oct/23  Resolved: 25/Oct/19

Status: Closed
Project: Core Server
Component/s: Internal Code
Affects Version/s: None
Fix Version/s: 4.3.1, 4.2.2

Type: Task Priority: Major - P3
Reporter: Matthew Saltz (Inactive) Assignee: Cheahuychou Mao
Resolution: Fixed Votes: 3
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by SERVER-42471 Add the new waitForFailpoint mechanis... Closed
Duplicate
duplicates SERVER-42308 Improve synchronization between two f... Backlog
Problem/Incident
Related
related to SERVER-48171 replace OperationContext with Interru... Closed
is related to SERVER-44251 Replace setFailPoint and clearFailPoi... Closed
Backwards Compatibility: Fully Compatible
Backport Requested:
v4.2
Sprint: Sharding 2019-02-11, Sharding 2019-10-21, Sharding 2019-11-04
Participants:
Linked BF Score: 50

 Description   

Right now if we want to wait for a failpoint to be hit from a jstest, we need to parse log output to wait until the failpoint outputs some log message. It'd be much easier if we had a waitForFailpoint command that could wait until a failpoint got hit. This would require some "barrier"-esque functionality on the failpoint, which could potentially be useful for its own reasons.



 Comments   
Comment by Githook User [ 31/Oct/19 ]

Author:

{'username': 'cheahuychou', 'email': 'cheahuychou.mao@mongodb.com', 'name': 'Cheahuychou Mao'}

Message: SERVER-39165 Add waitForFailpoint command and update non-repl tests

(cherry picked from commit f79498e9aa27765fc363fcb85a65f47af6e20ee1)
Branch: v4.2
https://github.com/mongodb/mongo/commit/f8fadc135f6c8737d206a1660a0bd9cb66c97a42

Comment by Githook User [ 25/Oct/19 ]

Author:

{'username': 'cheahuychou', 'email': 'cheahuychou.mao@mongodb.com', 'name': 'Cheahuychou Mao'}

Message: SERVER-39165 Add waitForFailpoint command and update other repl tests
Branch: master
https://github.com/mongodb/mongo/commit/8fc28f0773ca1efb0a43cc5590b9ca8b9e50559e

Comment by Githook User [ 25/Oct/19 ]

Author:

{'username': 'cheahuychou', 'email': 'cheahuychou.mao@mongodb.com', 'name': 'Cheahuychou Mao'}

Message: SERVER-39165 Add waitForFailpoint command and update initial sync tests
Branch: master
https://github.com/mongodb/mongo/commit/99d90208c67351469dae5983ed233638e61cf732

Comment by Githook User [ 25/Oct/19 ]

Author:

{'username': 'cheahuychou', 'email': 'cheahuychou.mao@mongodb.com', 'name': 'Cheahuychou Mao'}

Message: SERVER-39165 Add waitForFailpoint command and update non-repl tests
Branch: master
https://github.com/mongodb/mongo/commit/f79498e9aa27765fc363fcb85a65f47af6e20ee1

Comment by Siyuan Zhou [ 11/Oct/19 ]

I feel configuringFailPoint always resetting to 0 is more intuitive since it's more like a "semaphore" after the resetting. I cannot think of a case of the first concern where waiting on the same fail point more than once without resetting it is necessary.

Here are the three use cases lingzhi.deng and I discussed if we extend the design in SERVER-42308 and expose the counter:

  1. One thread goes through fail points A and B and another thread has fail points C and D. We want C-D to happen between A and B. This is straightforward by having C waiting on the counter of A and B waiting on the counter of D.
  2. Use two fail points A and B as a barrier. When both of them are hit, the program can continue. The two fail points need to wait on each other. With the counter design, it's not straightforward unless setting A, B and A again.
  3. Have two code section A and B in different threads to go in lock steps or B always after an A. In regular expression: (AB)+ or (A+B)+. More than two fail points may be needed. This cannot be supported by exposing counters. I cannot think of a use case right now though.

My point is whether we can figure out an interface that doesn't conflict with SERVER-42308, but I also agree, without an audit of existing complex use cases of fail points, making design choices is more of a guesswork. This ticket is definitely a step forward anyway.

Comment by Mira Carey [ 09/Oct/19 ]

There are a couple of things returning a count helps avoid:

  1. Making the call racey. I.e. if you wait for a thread to enter the fail point after calling wait, you could miss a "notification" in between calling configure and calling wait
  2. Making the call stateful. I'd thought about letting you reset the counters (then you just look for non-zero entry counts), but I think the design is slightly less flexible and more error prone in that model

I'm also interested in making the minimum diff necessary here. There was a lot of complexity in that signaling design, and not an obvious way forward. I think adding this count, and leaving it's orchestration more or less to the client, is least likely to conflict with whatever you end up pulling together. And if we end up deciding later on that reusing this counter (but with zeroing) is the right thing to do, I'd rather pay that cost then, rather than engineer for the possibility now (because I'm guessing the cost won't be all that high).

Comment by Lingzhi Deng [ 08/Oct/19 ]

Is there any particular reason we want to expose the count to client? Or is there any particular reason / use cases that we don't want reset the count on configureFailPoint? The reason I asked is because I still hope SERVER-42308 can be (re)implemented on top of your waitForFailPoint design based on the new atomic counter.

Comment by Mira Carey [ 07/Oct/19 ]

Laying out current thoughts on design here:

  1. introduce a new atomic on every fail point, which is a counter of how many times the fail point has actually been entered. This should be incremented whenever _slowShouldFailOpenBlock() returns slowOn.
  2. change configureFailPoint to return this value every time it's used
  3. add a waitForFailPoint command which requires a count argument. That command should return when the specified fail point has been entered that number of times

That would then give a flow like:

let count = db.adminCommand({configureFailPoint: ...});
doStuffBeforeFailPoint();
db.adminCommand({waitForFailPoint: failPoint, timesEntered: count+1});
doStuffAfterFailPoint();

Comment by Mira Carey [ 02/Aug/19 ]

re-opening, as SERVER-42308 looks to be a little more complicated to implement. We may still do this first (or in addition to that ticket)

Comment by Mira Carey [ 24/Jul/19 ]

SERVER-42308 looks like a slightly more comprehensive ticket that subsumes this one.

Given that, I'm going to close this out as a dup and try and pick up the related functionality over in that ticket

Generated at Thu Feb 08 04:51:13 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.