[SERVER-60428] Investigate effectiveness of test jstests/replsets/rollback_waits_for_bgindex_completion.js Created: 04/Oct/21 Updated: 09/Nov/21 Resolved: 09/Nov/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Adi Zaimi | Assignee: | Adi Zaimi |
| Resolution: | Duplicate | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||||||
| Sprint: | Repl 2021-10-18, Repl 2021-11-01, Replication 2021-11-15 | ||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||
| Description |
|
Test jstests/replsets/rollback_waits_for_bgindex_completion.js runs an index build in the background while a transaction is waiting to be committed, leader steps down, and rollsback the txn. Test verifies that rollback should wait for the index build to be aborted but it seems from the logs that there is no wait happening--rollback proceeds immediately because abort of index build has occurred earlier. We should investigate if the current behavior makes for a correct test, and possibly introduce a delay in aborting the index build so rollback effectively waits for it. Please refer to ticket https://jira.mongodb.org/browse/SERVER-59687 for more information. Relevant log portion:
Notice the last two lines, rollback intends to wait for indexbuild to be completed (aborted?), but the abort has already occurred, so there is nothing to wait for. |
| Comments |
| Comment by Adi Zaimi [ 09/Nov/21 ] | |||||||||||
|
Closing this ticked and any followup will be done as part of | |||||||||||
| Comment by Adi Zaimi [ 09/Nov/21 ] | |||||||||||
|
It seems that single-phase index build may no longer be a use case we intend to support: I asked on the storage execution channel and Louis Williams informed me that '“Single-phase” index builds replicate a single createIndexes oplog entry and they’re really only used internally for index builds on empty collections. Also for tenant migrations, but we really don’t want to continue supporting that use case. ' I created | |||||||||||
| Comment by Adi Zaimi [ 15/Oct/21 ] | |||||||||||
|
Discussed further with Samy and there is one detail that may be the key to this issue, in rollback_impl.cpp:
where stopIndexBuildsForRollback() checks
From the log we have a two-phase index build (which is the default): [js_test:rollback_waits_for_bgindex_completion] d23280| 2021-08-31T08:44:43.472+00:00 I STORAGE 20650 [BackgroundSync] "Active index builds","attr":{"context":"IndexBuildsCoordinator::stopIndexBuildsForRollback","builds":[{"buildUUID":{"uuid":{"$uuid":"5ecf792a-24b2-4b5f-bfef-6aa5c164d317"}},"collectionUUID":{"uuid":{"$uuid":"0e8e269d-2a5b-4127-8939-71bfa6036243"}},"indexNames":["x_1"],"protocol":"two phase"}]} And the code after we print "Waiting for all background operations to complete before starting rollback" waits for index builds to finish only in the one phase case. So for the default two-phase case ,the next line printed immediately after "Finished waiting for background operations to complete before rollback" is the correct code behavior. It seems that the remaining issue we need to address is whether this test is valid to run with two-phase index build as the default. Or perhaps we can change the default for this test to run with one phase index build. | |||||||||||
| Comment by Adi Zaimi [ 11/Oct/21 ] | |||||||||||
|
Looked at this more carefully. Added a new failpoint to stop/pause right after we print "Attempting to abort index build" in abortIndexBuildByBuildUUID(). I tried simply sleeping for 3 (or 5) sec at that point and it was still the case that the rollback code waited the 3 (or 5) extra seconds and the "Finished waiting for background..." line is still printed immediately after "Waiting for all background..." – that is we don't wait there the 3 seconds. I also tried using pauseWhileSet() on that new failpoint in abortIndexBuildByBuildUUID() so the rollback would wait until we turn that failpoint off in the test, however that approach resulted in a infinite loop in the server waiting for another failpoint to be turned off ("ctx":"BackgroundSync","msg":"bgsync - stopReplProducer fail point enabled. Blocking until fail point is disabled."). |