[SERVER-47343] Add/update failpoints to make cancelCommand integration tests more robust Created: 06/Apr/20  Updated: 29/Oct/23  Resolved: 07/Apr/20

Status: Closed
Project: Core Server
Component/s: Internal Code
Affects Version/s: None
Fix Version/s: 4.4.0-rc0, 4.7.0

Type: Bug Priority: Major - P3
Reporter: Cheahuychou Mao Assignee: Cheahuychou Mao
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.4
Sprint: Service arch 2020-04-20
Participants:
Linked BF Score: 23

 Description   
  • It is incorrect for CancelRemotelyTimeout to wait for the _killOperations command issued by cancelCommand to be killed when it times out since we don't kill operations without operation keys (i.e. without hedged options). The test rarely fails or hangs since the currentOp check and the _killOperations command both have the timeout of 1 second. However, there is a race between when the _killOperations operation starts and when the currentOp check starts. That is, there is a chance that the currentOp check returns without waiting because there is no _killOperations operation running yet, and so the test fails as the number of timed out commands is zero. To solve this, we should remove the currentOp check and instead use a failpoint to wait for the _killOperations command to timeout. That failpoint should be placed at the end of NetworkInterfaceTL::tryFinish and have a predicate that only returns true when command name and error code match the response status and command name for the request.
  • The failpoint "networkInterfaceAfterAcquireConn" was added to ensure that cancelCommand does not start running until the command acquires a connection (otherwise, no _killOperations command will be issued). However, it does not check the command name so other commands in the background can also enter this failpoint if they happen to run while the test is in this block. So to avoid this, we should replace the failpoint with a function similar to waitForCommand that runs currentOp repeatedly until there is a matching operation is running.


 Comments   
Comment by Githook User [ 07/Apr/20 ]

Author:

{'name': 'Cheahuychou Mao', 'email': 'cheahuychou.mao@mongodb.com', 'username': 'cheahuychou'}

Message: SERVER-47343 Add/update failpoints to make cancelCommand integration tests more robust

(cherry picked from commit e03fe6ff798f50382122edd469743677d401e001)
Branch: v4.4
https://github.com/mongodb/mongo/commit/234f73aa0da80d40100fab6cafcac49b5d787853

Comment by Githook User [ 07/Apr/20 ]

Author:

{'name': 'Cheahuychou Mao', 'email': 'cheahuychou.mao@mongodb.com', 'username': 'cheahuychou'}

Message: SERVER-47343 Add/update failpoints to make cancelCommand integration tests more robust
Branch: master
https://github.com/mongodb/mongo/commit/e03fe6ff798f50382122edd469743677d401e001

Generated at Thu Feb 08 05:13:54 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.