[SERVER-59687] Investigate failing parallel shell in jstests/replsets/rollback_waits_for_bgindex_completion.js Created: 31/Aug/21 Updated: 29/Oct/23 Resolved: 30/Sep/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 5.1.0-rc0 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Richard Samuels (Inactive) | Assignee: | Adi Zaimi |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Steps To Reproduce: | Remove checkExitSuccess: false from jstests/replsets/rollback_waits_for_bgindex_completion.js:94 and run it. |
||||||||||||||||||||
| Sprint: | Repl 2021-10-04 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Description |
|
As a part of this ticket, we modified the jstests/replsets/rollback_waits_for_bgindex_completion.js to run the function, which checks the exit code of the parallel shell, and throws an error if it's not 0. It began to fail with the following assertion:
The parallel shell is launched in jstests/replsets/rollback_waits_for_bgindex_completion.js:93. (Off commit e502f2d3965ac4147d303e956a582b7c4eef8232) Here's the whole stack trace of when the parallel shell is launched:
STM has set the checkExitSuccess flag to false on the cleanup function of the parallel shell to prevent the error from causing these tests to go red and to preserve existing semantics. We'd like someone to investigate if the parallel shell failure is expected (in which case checkExitSuccess should remain false), or if it's unexpected and the test needs to be modified. |
| Comments |
| Comment by Vivian Ge (Inactive) [ 06/Oct/21 ] | ||||||||||||||||||||||||||||||
|
Updating the fixversion since branching activities occurred yesterday. This ticket will be in rc0 when it’s been triggered. For more active release information, please keep an eye on #server-release. Thank you! | ||||||||||||||||||||||||||||||
| Comment by Githook User [ 27/Sep/21 ] | ||||||||||||||||||||||||||||||
|
Author: {'name': 'Adi Zaimi', 'email': 'adizaimi@yahoo.com', 'username': 'adizaimi'}Message: | ||||||||||||||||||||||||||||||
| Comment by Adi Zaimi [ 23/Sep/21 ] | ||||||||||||||||||||||||||||||
|
According to https://docs.mongodb.com/v4.4/core/index-creation/#build-failure-and-recovery the shell command should expect to get the "InterruptedDueToReplStateChange" error and as such we need to call IndexBuildTest.startIndexBuild() with ErrorCodes.InterruptedDueToReplStateChange. Whether the test is correct in the way rollback waits/does not wait for the index build to be completed, is a matter for a further discussion. | ||||||||||||||||||||||||||||||
| Comment by Adi Zaimi [ 23/Sep/21 ] | ||||||||||||||||||||||||||||||
|
From the previous comment, "rollback intends to wait for indexbuild to be completed (aborted?), but the abort has already occurred, so there is nothing to wait for". Not sure if this is the intended behavior--maybe the abort happens too quickly so the wait does nothing here. Maybe we should put a sleep in the abort process so the wait code actually would get exercised. | ||||||||||||||||||||||||||||||
| Comment by Adi Zaimi [ 23/Sep/21 ] | ||||||||||||||||||||||||||||||
|
This is in the log right before the bg shell prints the above error:
Which means that the primary put the index build in background upon stepdown. Then we get the rollback activity which apparently triggers the abort of the indexbuild (look for "Stopping index builds before rollback"):
Notice the last two lines, rollback intends to wait for indexbuild to be completed (aborted?), but the abort has already occurred, so there is nothing to wait for. The next primary seems to want to start index build (notice the timestamp, its prior to the above block):
However the original node finds the previous attempt and initiates the resume:
Later indexbuild starts and completes in the other nodes. | ||||||||||||||||||||||||||||||
| Comment by Adi Zaimi [ 22/Sep/21 ] | ||||||||||||||||||||||||||||||
|
Here is the output of that background job:
Output at line 11: | ||||||||||||||||||||||||||||||
| Comment by Adi Zaimi [ 22/Sep/21 ] | ||||||||||||||||||||||||||||||
|
Documenting for self reference: Key info is that startIndexBuild calls startParallelShell() to run the index build operation in the background. Here is what is occuring: uncaught exception: Error: [0] != [252] are not equal : encountered an error in the parallel shell We perform the assert if options.checkExitSuccess would be set to true from caller in that's an object created at line 43 bgIndexThread = IndexBuildTest.startIndexBuild(node, testColl.getFullName(), {x: 1}); Changing the parameter checkExitSuccess: true causes my local test to fail same as the tkt. Added debug to figure out what happens to the background index builder pid. However turns out that's unnecessary because that shell prints out to the same log info prepended with sh12345 (12345 being its PID). |