[SERVER-59687] Investigate failing parallel shell in jstests/replsets/rollback_waits_for_bgindex_completion.js Created: 31/Aug/21  Updated: 29/Oct/23  Resolved: 30/Sep/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 5.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Richard Samuels (Inactive) Assignee: Adi Zaimi
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Gantt Dependency
has to be done after SERVER-58200 Asserting clauses do not cause a jste... Closed
Related
related to SERVER-59688 Investigate failing parallel shell in... Closed
is related to SERVER-60428 Investigate effectiveness of test jst... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

Remove checkExitSuccess: false from jstests/replsets/rollback_waits_for_bgindex_completion.js:94 and run it.

Sprint: Repl 2021-10-04
Participants:

 Description   

SERVER-58200 modified startParallelShell to ensure that whenever a parallel shell is started, the cleanup function returned by startParallelShell must be called in that shell, or the whole test will fail.

As a part of this ticket, we modified the jstests/replsets/rollback_waits_for_bgindex_completion.js to run the function, which checks the exit code of the parallel shell, and throws an error if it's not 0.

It began to fail with the following assertion:

[js_test:rollback_waits_for_bgindex_completion] uncaught exception: Error: [0] != [252] are not equal : encountered an error in the parallel shell :
[js_test:rollback_waits_for_bgindex_completion] doassert@src/mongo/shell/assert.js:20:14
[js_test:rollback_waits_for_bgindex_completion] assert.eq@src/mongo/shell/assert.js:179:9
[js_test:rollback_waits_for_bgindex_completion] startParallelShell/<@src/mongo/shell/servers_misc.js:182:13
[js_test:rollback_waits_for_bgindex_completion] @jstests/replsets/rollback_waits_for_bgindex_completion.js:93:1
[js_test:rollback_waits_for_bgindex_completion] @jstests/replsets/rollback_waits_for_bgindex_completion.js:11:2
[js_test:rollback_waits_for_bgindex_completion] failed to load: jstests/replsets/rollback_waits_for_bgindex_completion.js
[js_test:rollback_waits_for_bgindex_completion] exiting with code -3

The parallel shell is launched in jstests/replsets/rollback_waits_for_bgindex_completion.js:93. (Off commit e502f2d3965ac4147d303e956a582b7c4eef8232) Here's the whole stack trace of when the parallel shell is launched:

[js_test:rollback_waits_for_bgindex_completion] startParallelShell/<@src/mongo/shell/servers_misc.js:182:13
[js_test:rollback_waits_for_bgindex_completion] @jstests/replsets/rollback_waits_for_bgindex_completion.js:93:1
[js_test:rollback_waits_for_bgindex_completion] @jstests/replsets/rollback_waits_for_bgindex_completion.js:11:2
[js_test:rollback_waits_for_bgindex_completion] failed to load: jstests/replsets/rollback_waits_for_bgindex_completion.js

STM has set the checkExitSuccess flag to false on the cleanup function of the parallel shell to prevent the error from causing these tests to go red and to preserve existing semantics. We'd like someone to investigate if the parallel shell failure is expected (in which case checkExitSuccess should remain false), or if it's unexpected and the test needs to be modified. 



 Comments   
Comment by Vivian Ge (Inactive) [ 06/Oct/21 ]

Updating the fixversion since branching activities occurred yesterday. This ticket will be in rc0 when it’s been triggered. For more active release information, please keep an eye on #server-release. Thank you!

Comment by Githook User [ 27/Sep/21 ]

Author:

{'name': 'Adi Zaimi', 'email': 'adizaimi@yahoo.com', 'username': 'adizaimi'}

Message: SERVER-59687 Expect InterruptedDueToReplStateChange in test
Branch: master
https://github.com/mongodb/mongo/commit/ffb73e57e447a1c4496b327de797fad1d6c7fb1a

Comment by Adi Zaimi [ 23/Sep/21 ]

According to https://docs.mongodb.com/v4.4/core/index-creation/#build-failure-and-recovery the shell command should expect to get the "InterruptedDueToReplStateChange" error and as such we need to call IndexBuildTest.startIndexBuild() with ErrorCodes.InterruptedDueToReplStateChange.

Whether the test is correct in the way rollback waits/does not wait for the index build to be completed, is a matter for a further discussion.

Comment by Adi Zaimi [ 23/Sep/21 ]

From the previous comment, "rollback intends to wait for indexbuild to be completed (aborted?), but the abort has already occurred, so there is nothing to wait for". Not sure if this is the intended behavior--maybe the abort happens too quickly so the wait does nothing here. Maybe we should put a sleep in the abort process so the wait code actually would get exercised.

Comment by Adi Zaimi [ 23/Sep/21 ]

This is in the log right before the bg shell prints the above error:

[js_test:rollback_waits_for_bgindex_completion] d20030| {"t":{"$date":"2021-09-22T23:06:42.461+00:00"},"s":"I",  "c":"REPL",     "id":21358,   "ctx":"conn2","msg":"Replica set state transition","attr":{"newState":"SECONDARY","oldState":"PRIMARY"}}
[js_test:rollback_waits_for_bgindex_completion] d20030| {"t":{"$date":"2021-09-22T23:06:42.461+00:00"},"s":"I",  "c":"INDEX",    "id":20441,   "ctx":"conn56","msg":"Index build: received interrupt signal","attr":{"buildUUID":{"uuid":{"$uuid":"39d281db-25d4-46c3-9f3b-20133f58bff4"}},"signal":{"code":11602,"codeName":"InterruptedDueToReplStateChange","errmsg":"operation was interrupted"}}}
[js_test:rollback_waits_for_bgindex_completion] d20030| {"t":{"$date":"2021-09-22T23:06:42.461+00:00"},"s":"I",  "c":"INDEX",    "id":4760400, "ctx":"conn56","msg":"Index build: ignoring interrupt and continuing in background","attr":{"buildUUID":{"uuid":{"$uuid":"39d281db-25d4-46c3-9f3b-20133f58bff4"}}}}

Which means that the primary put the index build in background upon stepdown. Then we get the rollback activity which apparently triggers the abort of the indexbuild (look for "Stopping index builds before rollback"):

[js_test:rollback_waits_for_bgindex_completion] d20030| {"t":{"$date":"2021-09-22T23:06:42.608+00:00"},"s":"I",  "c":"ROLLBACK", "id":21607,   "ctx":"BackgroundSync","msg":"Rollback common point","attr":{"commonPointOpTime":{"ts":{"$timestamp":{"t":1632352002,"i":3}},"t":1}}}
[js_test:rollback_waits_for_bgindex_completion] d20030| {"t":{"$date":"2021-09-22T23:06:42.609+00:00"},"s":"I",  "c":"REPL",     "id":21532,   "ctx":"BackgroundSync","msg":"Incremented the rollback ID","attr":{"rbid":2}}
[js_test:rollback_waits_for_bgindex_completion] d20030| {"t":{"$date":"2021-09-22T23:06:42.609+00:00"},"s":"I",  "c":"STORAGE",  "id":20658,   "ctx":"BackgroundSync","msg":"Stopping index builds before rollback"}
[js_test:rollback_waits_for_bgindex_completion] d20030| {"t":{"$date":"2021-09-22T23:06:42.609+00:00"},"s":"I",  "c":"STORAGE",  "id":20650,   "ctx":"BackgroundSync","msg":"Active index builds","attr":{"context":"IndexBuildsCoordinator::stopIndexBuildsForRollback","builds":[{"buildUUID":{"uuid":{"$uuid":"39d281db-25d4-46c3-9f3b-20133f58bff4"}},"collectionUUID":{"uuid":{"$uuid":"8cec4cde-51df-432e-8bd2-2a2f8a54593d"}},"indexNames":["x_1"],"protocol":"two phase"}]}}
[js_test:rollback_waits_for_bgindex_completion] d20030| {"t":{"$date":"2021-09-22T23:06:42.609+00:00"},"s":"I",  "c":"STORAGE",  "id":4656010, "ctx":"BackgroundSync","msg":"Attempting to abort index build","attr":{"buildUUID":{"uuid":{"$uuid":"39d281db-25d4-46c3-9f3b-20133f58bff4"}}}}
[js_test:rollback_waits_for_bgindex_completion] d20030| {"t":{"$date":"2021-09-22T23:06:42.609+00:00"},"s":"I",  "c":"STORAGE",  "id":4656003, "ctx":"BackgroundSync","msg":"Aborting index build","attr":{"buildUUID":{"uuid":{"$uuid":"39d281db-25d4-46c3-9f3b-20133f58bff4"}},"error":"rollback"}}
[js_test:rollback_waits_for_bgindex_completion] d20030| {"t":{"$date":"2021-09-22T23:06:42.609+00:00"},"s":"I",  "c":"STORAGE",  "id":465611,  "ctx":"BackgroundSync","msg":"Cleaned up index build after abort. ","attr":{"buildUUID":{"uuid":{"$uuid":"39d281db-25d4-46c3-9f3b-20133f58bff4"}}}}
[js_test:rollback_waits_for_bgindex_completion] d20030| {"t":{"$date":"2021-09-22T23:06:42.609+00:00"},"s":"I",  "c":"INDEX",    "id":4984704, "ctx":"IndexBuildsCoordinatorMongod-0","msg":"Index build: collection scan stopped","attr":{"buildUUID":{"uuid":{"$uuid":"39d281db-25d4-46c3-9f3b-20133f58bff4"}},"collectionUUID":{"uuid":{"$uuid":"8cec4cde-51df-432e-8bd2-2a2f8a54593d"}},"totalRecords":1,"durationMillis":0,"phase":"collection scan","collectionScanPosition":"1","readSource":"kMajorityCommitted","error":{"code":276,"codeName":"IndexBuildAborted","errmsg":"operation was interrupted"}}}
[js_test:rollback_waits_for_bgindex_completion] d20030| {"t":{"$date":"2021-09-22T23:06:42.609+00:00"},"s":"I",  "c":"STORAGE",  "id":20655,   "ctx":"BackgroundSync","msg":"Index build: joined after abort","attr":{"buildUUID":{"uuid":{"$uuid":"39d281db-25d4-46c3-9f3b-20133f58bff4"}},"waitResult":{"code":0,"codeName":"OK"},"status":{"code":276,"codeName":"IndexBuildAborted","errmsg":"collection scan stopped. totalRecords: 1; durationMillis: 0ms; phase: collection scan; collectionScanPosition: RecordId(1); readSource: kMajorityCommitted :: caused by :: operation was interrupted"}}}
[js_test:rollback_waits_for_bgindex_completion] d20030| {"t":{"$date":"2021-09-22T23:06:42.609+00:00"},"s":"I",  "c":"STORAGE",  "id":20347,   "ctx":"BackgroundSync","msg":"Index build: aborted without cleanup","attr":{"buildUUID":{"uuid":{"$uuid":"39d281db-25d4-46c3-9f3b-20133f58bff4"}},"collectionUUID":{"uuid":{"$uuid":"8cec4cde-51df-432e-8bd2-2a2f8a54593d"}},"namespace":"dbWithBgIndex.coll"}}
[js_test:rollback_waits_for_bgindex_completion] d20030| {"t":{"$date":"2021-09-22T23:06:42.609+00:00"},"s":"I",  "c":"STORAGE",  "id":4656002, "ctx":"IndexBuildsCoordinatorMongod-0","msg":"unable to log operation","attr":{"error":{"code":276,"codeName":"IndexBuildAborted","errmsg":"operation was interrupted"}}}
[js_test:rollback_waits_for_bgindex_completion] d20032| {"t":{"$date":"2021-09-22T23:06:42.613+00:00"},"s":"I",  "c":"REPL",     "id":21079,   "ctx":"BackgroundSync","msg":"bgsync - stopReplProducer fail point enabled. Blocking until fail point is disabled."}
[js_test:rollback_waits_for_bgindex_completion] d20030| {"t":{"$date":"2021-09-22T23:06:42.617+00:00"},"s":"I",  "c":"INDEX",    "id":4841502, "ctx":"BackgroundSync","msg":"Index build: wrote resumable state to disk","attr":{"buildUUID":{"uuid":{"$uuid":"39d281db-25d4-46c3-9f3b-20133f58bff4"}},"collectionUUID":{"uuid":{"$uuid":"8cec4cde-51df-432e-8bd2-2a2f8a54593d"}},"namespace":"dbWithBgIndex.coll","details":{"_id":{"$uuid":"39d281db-25d4-46c3-9f3b-20133f58bff4"},"phase":"collection scan","collectionUUID":{"$uuid":"8cec4cde-51df-432e-8bd2-2a2f8a54593d"},"collectionScanPosition":1,"indexes":[{"fileName":"extsort-index.0-1004796179216808253","numKeys":1,"ranges":[{"startOffset":0,"endOffset":13,"checksum":4272028543}],"sideWritesTable":"internal-43--4107655658741736919","spec":{"v":2,"key":{"x":1},"name":"x_1"},"isMultikey":false,"multikeyPaths":[{"multikeyComponents":[]}]}]}}}
[js_test:rollback_waits_for_bgindex_completion] d20030| {"t":{"$date":"2021-09-22T23:06:42.617+00:00"},"s":"I",  "c":"ROLLBACK", "id":21595,   "ctx":"BackgroundSync","msg":"Waiting for all background operations to complete before starting rollback"}
[js_test:rollback_waits_for_bgindex_completion] d20030| {"t":{"$date":"2021-09-22T23:06:42.617+00:00"},"s":"I",  "c":"ROLLBACK", "id":21597,   "ctx":"BackgroundSync","msg":"Finished waiting for background operations to complete before rollback"}

Notice the last two lines, rollback intends to wait for indexbuild to be completed (aborted?), but the abort has already occurred, so there is nothing to wait for.

The next primary seems to want to start index build (notice the timestamp, its prior to the above block):

[js_test:rollback_waits_for_bgindex_completion] d20031| {"t":{"$date":"2021-09-22T23:06:42.501+00:00"},"s":"I",  "c":"STORAGE",  "id":20650,   "ctx":"OplogApplier-0","msg":"Active index builds","attr":{"context":"IndexBuildsCoordinator::onStepUp","builds":[{"buildUUID":{"uuid":{"$uuid":"39d281db-25d4-46c3-9f3b-20133f58bff4"}},"collectionUUID":{"uuid":{"$uuid":"8cec4cde-51df-432e-8bd2-2a2f8a54593d"}},"indexNames":["x_1"],"protocol":"two phase"}]}}

However the original node finds the previous attempt and initiates the resume:

[js_test:rollback_waits_for_bgindex_completion] d20030| {"t":{"$date":"2021-09-22T23:06:42.811+00:00"},"s":"I",  "c":"STORAGE",  "id":4916301, "ctx":"BackgroundSync","msg":"Found unfinished index build to resume","attr":{"buildUUID":{"uuid":{"$uuid":"39d281db-25d4-46c3-9f3b-20133f58bff4"}},"collectionUUID":{"uuid":{"$uuid":"8cec4cde-51df-432e-8bd2-2a     2f8a54593d"}},"phase":"collection scan"}}
[js_test:rollback_waits_for_bgindex_completion] d20030| {"t":{"$date":"2021-09-22T23:06:42.811+00:00"},"s":"I",  "c":"STORAGE",  "id":22253,   "ctx":"BackgroundSync","msg":"Found index from unfinished build"...
[js_test:rollback_waits_for_bgindex_completion] d20030| {"t":{"$date":"2021-09-22T23:06:42.814+00:00"},"s":"I",  "c":"STORAGE",  "id":4841700, "ctx":"BackgroundSync","msg":"Index build: resuming"...
[js_test:rollback_waits_for_bgindex_completion] d20030| {"t":{"$date":"2021-09-22T23:06:42.815+00:00"},"s":"I",  "c":"INDEX",    "id":20384,   "ctx":"BackgroundSync","msg":"Index build: starting"....
[js_test:rollback_waits_for_bgindex_completion] d20030| {"t":{"$date":"2021-09-22T23:06:43.168+00:00"},"s":"I",  "c":"INDEX",    "id":20391,   "ctx":"IndexBuildsCoordinatorMongod-0","msg":"Index build: collection scan done"....
[js_test:rollback_waits_for_bgindex_completion] d20030| {"t":{"$date":"2021-09-22T23:06:43.177+00:00"},"s":"I",  "c":"STORAGE",  "id":3856203, "ctx":"IndexBuildsCoordinatorMongod-0","msg":"Index build: waiting for next action before completing final phase"..

Later indexbuild starts and completes in the other nodes.

Comment by Adi Zaimi [ 22/Sep/21 ]

Here is the output of that background job:

  1 [js_test:rollback_waits_for_bgindex_completion] sh13195| MongoDB shell version v5.1.0-alpha-1011-g6f895e6
  2  connecting to: mongodb://localhost:20020/?compressors=disabled&gssapiServiceName=mongodb
  3  Implicit session: session { "id" : UUID("046ce85e-60a6-43e1-86d3-66c6a9ff9fe2") }
  4  MongoDB server version: 5.1.0-alpha-1011-g6f895e6
  5  uncaught exception: Error: command did not fail with any of the following codes [ ] {
  6         "topologyVersion" : {
  7                 "processId" : ObjectId("614bb6e0e6bfbe190860c4fd"),
  8                 "counter" : NumberLong(12)
  9         },
 10         "ok" : 0,
 11         "errmsg" : "Index build failed: 39d281db-25d4-46c3-9f3b-20133f58bff4: Collection dbWithBgIndex.coll ( 8cec4cde-51df-432e-8bd2-2a2f8a54593d ) :: caused by :: operation was interrupted",
 12         "code" : 11602,
 13         "codeName" : "InterruptedDueToReplStateChange",
 14         "$clusterTime" : {
 15                 "clusterTime" : Timestamp(1632352002, 7),
 16                 "signature" : {
 17                         "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
 18                         "keyId" : NumberLong(0)
 19                 }
 20         },
 21         "operationTime" : Timestamp(1632352002, 7)
 22  } :
 23  _getErrorWithCode@src/mongo/shell/utils.js:24:13
 24  doassert@src/mongo/shell/assert.js:18:14
 25  _assertCommandFailed@src/mongo/shell/assert.js:805:21
 26  assert.commandFailedWithCode@src/mongo/shell/assert.js:851:16
 27  commandWorkedOrFailedWithCode@src/mongo/shell/assert.js:822:20
 28 @(shell eval):65:17
 29 @(shell eval):47:33
 30  exiting with code -4

Output at line 11:
"errmsg" : "Index build failed: 39d281db-25d4-46c3-9f3b-20133f58bff4: Collection dbWithBgIndex.coll ( 8cec4cde-51df-432e-8bd2-2a2f8a54593d ) :: caused by :: operation was interrupted",
and line 13: "codeName" : "InterruptedDueToReplStateChange",
lead me to conclude that this type of index build failure is expected, however looking at the intended behavior to confirm.

Comment by Adi Zaimi [ 22/Sep/21 ]

Documenting for self reference:

Key info is that startIndexBuild calls startParallelShell() to run the index build operation in the background.
Before test exit we check the exitCode for the parallel shell and make sure it is zero, but in this case it isn't.

Here is what is occuring:

uncaught exception: Error: [0] != [252] are not equal : encountered an error in the parallel shell
The above line is asserted in the function passed to startParallelShell()
servers_misc.js:181 assert.eq(0, exitCode, "encountered an error in the parallel shell");

We perform the assert if options.checkExitSuccess would be set to true from caller in
rollback_waits_for_bgindex_completion.js:94 bgIndexThread({checkExitSuccess: false});
(NB: in test we now pass false so the check does not fail like the tkt shows).

that's an object created at line 43 bgIndexThread = IndexBuildTest.startIndexBuild(node, testColl.getFullName(), {x: 1});
and that function in turn spawns off a parallel shell:
index_build.js: 39 return startParallelShell(funWithArgs(func, args), conn.port); 

Changing the parameter checkExitSuccess: true causes my local test to fail same as the tkt.

Added debug to figure out what happens to the background index builder pid. However turns out that's unnecessary because that shell prints out to the same log info prepended with sh12345 (12345 being its PID).

Generated at Thu Feb 08 05:47:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.