[SERVER-60428] Investigate effectiveness of test jstests/replsets/rollback_waits_for_bgindex_completion.js Created: 04/Oct/21  Updated: 09/Nov/21  Resolved: 09/Nov/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Adi Zaimi Assignee: Adi Zaimi
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
is duplicated by SERVER-60427 Investigate effectiveness of test jst... Closed
is duplicated by SERVER-60819 remove jstests/replsets/rollback_wait... Closed
Related
related to SERVER-59687 Investigate failing parallel shell in... Closed
related to SERVER-60819 remove jstests/replsets/rollback_wait... Closed
Operating System: ALL
Sprint: Repl 2021-10-18, Repl 2021-11-01, Replication 2021-11-15
Participants:

 Description   

Test jstests/replsets/rollback_waits_for_bgindex_completion.js runs an index build in the background while a transaction is waiting to be committed, leader steps down, and rollsback the txn. Test verifies that rollback should wait for the index build to be aborted but it seems from the logs that there is no wait happening--rollback proceeds immediately because abort of index build has occurred earlier. We should investigate if the current behavior makes for a correct test, and possibly introduce a delay in aborting the index build so rollback effectively waits for it.

Please refer to ticket https://jira.mongodb.org/browse/SERVER-59687 for more information.

Relevant log portion:

[js_test:rollback_waits_for_bgindex_completion] d20030| {"t":{"$date":"2021-09-22T23:06:42.609+00:00"},"s":"I",  "c":"STORAGE",  "id":20655,   "ctx":"BackgroundSync","msg":"Index build: joined after abort","attr":{"buildUUID":{"uuid":{"$uuid":"39d281db-25d4-46c3-9f3b-20133f58bff4"}},"waitResult":{"code":0,"codeName":"OK"},"status":{"code":276,"codeName":"IndexBuildAborted","errmsg":"collection scan stopped. totalRecords: 1; durationMillis: 0ms; phase: collection scan; collectionScanPosition: RecordId(1); readSource: kMajorityCommitted :: caused by :: operation was interrupted"}}}
[js_test:rollback_waits_for_bgindex_completion] d20030| {"t":{"$date":"2021-09-22T23:06:42.609+00:00"},"s":"I",  "c":"STORAGE",  "id":20347,   "ctx":"BackgroundSync","msg":"Index build: aborted without cleanup","attr":{"buildUUID":{"uuid":{"$uuid":"39d281db-25d4-46c3-9f3b-20133f58bff4"}},"collectionUUID":{"uuid":{"$uuid":"8cec4cde-51df-432e-8bd2-2a2f8a54593d"}},"namespace":"dbWithBgIndex.coll"}}
[js_test:rollback_waits_for_bgindex_completion] d20030| {"t":{"$date":"2021-09-22T23:06:42.609+00:00"},"s":"I",  "c":"STORAGE",  "id":4656002, "ctx":"IndexBuildsCoordinatorMongod-0","msg":"unable to log operation","attr":{"error":{"code":276,"codeName":"IndexBuildAborted","errmsg":"operation was interrupted"}}}
[js_test:rollback_waits_for_bgindex_completion] d20032| {"t":{"$date":"2021-09-22T23:06:42.613+00:00"},"s":"I",  "c":"REPL",     "id":21079,   "ctx":"BackgroundSync","msg":"bgsync - stopReplProducer fail point enabled. Blocking until fail point is disabled."}
[js_test:rollback_waits_for_bgindex_completion] d20030| {"t":{"$date":"2021-09-22T23:06:42.617+00:00"},"s":"I",  "c":"INDEX",    "id":4841502, "ctx":"BackgroundSync","msg":"Index build: wrote resumable state to disk","attr":{"buildUUID":{"uuid":{"$uuid":"39d281db-25d4-46c3-9f3b-20133f58bff4"}},"collectionUUID":{"uuid":{"$uuid":"8cec4cde-51df-432e-8bd2-2a2f8a54593d"}},"namespace":"dbWithBgIndex.coll","details":{"_id":{"$uuid":"39d281db-25d4-46c3-9f3b-20133f58bff4"},"phase":"collection scan","collectionUUID":{"$uuid":"8cec4cde-51df-432e-8bd2-2a2f8a54593d"},"collectionScanPosition":1,"indexes":[{"fileName":"extsort-index.0-1004796179216808253","numKeys":1,"ranges":[{"startOffset":0,"endOffset":13,"checksum":4272028543}],"sideWritesTable":"internal-43--4107655658741736919","spec":{"v":2,"key":{"x":1},"name":"x_1"},"isMultikey":false,"multikeyPaths":[{"multikeyComponents":[]}]}]}}}
[js_test:rollback_waits_for_bgindex_completion] d20030| {"t":{"$date":"2021-09-22T23:06:42.617+00:00"},"s":"I",  "c":"ROLLBACK", "id":21595,   "ctx":"BackgroundSync","msg":"Waiting for all background operations to complete before starting rollback"}
[js_test:rollback_waits_for_bgindex_completion] d20030| {"t":{"$date":"2021-09-22T23:06:42.617+00:00"},"s":"I",  "c":"ROLLBACK", "id":21597,   "ctx":"BackgroundSync","msg":"Finished waiting for background operations to complete before rollback"}

Notice the last two lines, rollback intends to wait for indexbuild to be completed (aborted?), but the abort has already occurred, so there is nothing to wait for.



 Comments   
Comment by Adi Zaimi [ 09/Nov/21 ]

Closing this ticked and any followup will be done as part of SERVER-60819.

Comment by Adi Zaimi [ 09/Nov/21 ]

It seems that single-phase index build may no longer be a use case we intend to support: I asked on the storage execution channel and Louis Williams informed me that '“Single-phase” index builds replicate a single createIndexes oplog entry and they’re really only used internally for index builds on empty collections. Also for tenant migrations, but we really don’t want to continue supporting that use case. '

I created SERVER-60819 for storage execution to determine how to proceed with this test.

Comment by Adi Zaimi [ 15/Oct/21 ]

Discussed further with Samy and there is one detail that may be the key to this issue, in rollback_impl.cpp:

 350     // Aborts all active, two-phase index builds.
 351     [[maybe_unused]] auto stoppedIndexBuilds =
 352         IndexBuildsCoordinator::get(opCtx)->stopIndexBuildsForRollback(opCtx);

where stopIndexBuildsForRollback() checks

1349         if (IndexBuildProtocol::kSinglePhase == replState->protocol) {
1350             LOGV2(20659,
1351                   "Not stopping single phase index build",
1352                   "buildUUID"_attr = replState->buildUUID);
1353             return;
1354         }

From the log we have a two-phase index build (which is the default):

[js_test:rollback_waits_for_bgindex_completion] d23280| 2021-08-31T08:44:43.472+00:00 I STORAGE 20650 [BackgroundSync] "Active index builds","attr":{"context":"IndexBuildsCoordinator::stopIndexBuildsForRollback","builds":[{"buildUUID":{"uuid":{"$uuid":"5ecf792a-24b2-4b5f-bfef-6aa5c164d317"}},"collectionUUID":{"uuid":{"$uuid":"0e8e269d-2a5b-4127-8939-71bfa6036243"}},"indexNames":["x_1"],"protocol":"two phase"}]}

And the code after we print "Waiting for all background operations to complete before starting rollback" waits for index builds to finish only in the one phase case. So for the default two-phase case ,the next line printed immediately after "Finished waiting for background operations to complete before rollback" is the correct code behavior.

It seems that the remaining issue we need to address is whether this test is valid to run with two-phase index build as the default. Or perhaps we can change the default for this test to run with one phase index build.

Comment by Adi Zaimi [ 11/Oct/21 ]

Looked at this more carefully. Added a new failpoint to stop/pause right after we print "Attempting to abort index build" in abortIndexBuildByBuildUUID(). I tried simply sleeping for 3 (or 5) sec at that point and it was still the case that the rollback code waited the 3 (or 5) extra seconds and the "Finished waiting for background..." line is still printed immediately after "Waiting for all background..." – that is we don't wait there the 3 seconds.

I also tried using pauseWhileSet() on that new failpoint in abortIndexBuildByBuildUUID() so the rollback would wait until we turn that failpoint off in the test, however that approach resulted in a infinite loop in the server waiting for another failpoint to be turned off ("ctx":"BackgroundSync","msg":"bgsync - stopReplProducer fail point enabled. Blocking until fail point is disabled.").

Generated at Thu Feb 08 05:49:47 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.