[SERVER-59927] Resharding's RecipientStateMachine::_restoreMetrics() doesn't retry on transient errors, leading to fassert() on stepdown Created: 13/Sep/21  Updated: 29/Oct/23  Resolved: 12/Oct/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 5.0.0
Fix Version/s: 5.2.0, 5.0.4, 5.1.0-rc1

Type: Bug Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Brett Nawrocki
Resolution: Fixed Votes: 0
Labels: PM-234-M3, PM-234-T-lifecycle
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by SERVER-53351 Add resharding fuzzer task with step-... Closed
Problem/Incident
is caused by SERVER-53912 ReshardingRecipientService instances ... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.1, v5.0
Sprint: Sharding 2021-09-20, Sharding 2021-10-04, Sharding 2021-10-18
Participants:
Story Points: 1

 Description   

The changes from fee0349 as part of SERVER-53912 introduced a RecipientStateMachine::_restoreMetrics() method to calculate the number of documents it cloned, oplog entries it fetched, and oplog entries it applied at the beginning of starting to run again. These read operations may be interrupted if the primary steps down shortly after having been stepped up. The call to RecipientStateMachine::_restoreMetrics() should be placed in a resharding::WithAutomaticRetry() block so any transient errors can be automatically retried and synchronized with the stepdown token being canceled.

As a bonus on this ticket, we should see if it is possible to have the resharding code invariant that all usages of CancelableOperationContextFactory occur within a resharding::WithAutomaticRetry() block.


[2021/09/02 13:54:14.598] [js_test:resharding_fuzzer-120e1-1630590374586-8] d20021| 2021-09-02T13:54:14.598+00:00 I  COMMAND  21581   [conn1] "Received replSetStepUp request"
[2021/09/02 13:54:14.603] [js_test:resharding_fuzzer-120e1-1630590374586-8] d20021| 2021-09-02T13:54:14.602+00:00 I  REPL     21358   [ReplCoord-7] "Replica set state transition","attr":{"newState":"PRIMARY","oldState":"SECONDARY"}
[2021/09/02 13:54:14.615] [js_test:resharding_fuzzer-120e1-1630590374586-8] d20021| 2021-09-02T13:54:14.615+00:00 I  REPL     21331   [OplogApplier-0] "Transition to primary complete; database writes are now permitted"
[2021/09/02 13:54:14.823] [js_test:resharding_fuzzer-120e1-1630590374586-8] d20021| 2021-09-02T13:54:14.823+00:00 I  REPL     21402   [conn4] "Stepping down from primary, because a new term has begun","attr":{"term":6}
...
[2021/09/02 13:54:14.825] [js_test:resharding_fuzzer-120e1-1630590374586-8] d20021| 2021-09-02T13:54:14.825+00:00 F  RESHARD  5551101 [ReshardingRecipientService-5] "Unrecoverable error occurred past the point recipient was prepared to complete the resharding operation","attr":{"error":"InterruptedDueToReplStateChange: operation was interrupted"}
[2021/09/02 13:54:14.825] [js_test:resharding_fuzzer-120e1-1630590374586-8] d20021| 2021-09-02T13:54:14.825+00:00 F  ASSERT   23089   [ReshardingRecipientService-5] "Fatal assertion","attr":{"msgid":5551101,"file":"src/mongo/db/s/resharding/resharding_recipient_service.cpp","line":404}

https://evergreen.mongodb.com/lobster/evergreen/test/mongodb_mongo_master_enterprise_rhel_80_64_bit_resharding_fuzzer_stepup_1_enterprise_rhel_80_64_bit_patch_23f9d2a53917d63fc3d3b8c8646f40f2bc4caa2f_6130c5d561837d6514713be6_21_09_02_12_44_55/0/6130e1232fd552933c3a0c9a#bookmarks=0%2C26712%2C26783%2C34982%2C35005%2C35160%2C35363%2C35372%2C35373%2C35629%2C153752%2C153798&f~=000~d20021%5C%7C&f~=100~%5C%5BResharding.%2AService&f~=010~%28REPL_HB%7CELECTION%29&f~=011~REPL_HB&l=1



 Comments   
Comment by Githook User [ 18/Oct/21 ]

Author:

{'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}

Message: SERVER-59927 Add retry to _restoreMetrics()

RecipientStateMachine::_restoreMetrics() performs a number of read
operations to calculate the number of documents it cloned, oplog entries
it fetched, and oplog entries it applied at the beginning of starting to
run again. These read operations may be interrupted if the primary steps
down shortly after having been stepped up, which eventually leads to an
fassert(). Therefore, perform _restoreMetrics() in a
resharding::WithAutomaticRetry() block so any transient errors can be
automatically retried and synchronized with the stepdown token being
canceled.

Furthermore, refactor RecipientStateMachine to use new
RetryingCancelableOperationContextFactory to ensure that all usages of
CancelableOperationContextFactory occur within a
resharding::WithAutomaticRetry() block.

Additionally, add a test case that will cover the _restoreMetrics() read
operations being interrupted.

(cherry picked from commit b9e2784da82fef8e45b95b88e4ac1443649a5b0c)
Branch: v5.1
https://github.com/mongodb/mongo/commit/8797d34dae0f520c2cb19ff5c40ee3835ef0cc4c

Comment by Githook User [ 18/Oct/21 ]

Author:

{'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}

Message: SERVER-59927 Add retry to _restoreMetrics()

RecipientStateMachine::_restoreMetrics() performs a number of read
operations to calculate the number of documents it cloned, oplog entries
it fetched, and oplog entries it applied at the beginning of starting to
run again. These read operations may be interrupted if the primary steps
down shortly after having been stepped up, which eventually leads to an
fassert(). Therefore, perform _restoreMetrics() in a
resharding::WithAutomaticRetry() block so any transient errors can be
automatically retried and synchronized with the stepdown token being
canceled.

Furthermore, refactor RecipientStateMachine to use new
RetryingCancelableOperationContextFactory to ensure that all usages of
CancelableOperationContextFactory occur within a
resharding::WithAutomaticRetry() block.

Additionally, add a test case that will cover the _restoreMetrics() read
operations being interrupted.

(cherry picked from commit b9e2784da82fef8e45b95b88e4ac1443649a5b0c)
Branch: v5.0
https://github.com/mongodb/mongo/commit/3ecfccd26a758127b18087160bc2bcf54d6c058a

Comment by Githook User [ 11/Oct/21 ]

Author:

{'name': 'Brett Nawrocki', 'email': 'brett.nawrocki@mongodb.com', 'username': 'brettnawrocki'}

Message: SERVER-59927 Add retry to _restoreMetrics()

RecipientStateMachine::_restoreMetrics() performs a number of read
operations to calculate the number of documents it cloned, oplog entries
it fetched, and oplog entries it applied at the beginning of starting to
run again. These read operations may be interrupted if the primary steps
down shortly after having been stepped up, which eventually leads to an
fassert(). Therefore, perform _restoreMetrics() in a
resharding::WithAutomaticRetry() block so any transient errors can be
automatically retried and synchronized with the stepdown token being
canceled.

Furthermore, refactor RecipientStateMachine to use new
RetryingCancelableOperationContextFactory to ensure that all usages of
CancelableOperationContextFactory occur within a
resharding::WithAutomaticRetry() block.

Additionally, add a test case that will cover the _restoreMetrics() read
operations being interrupted.
Branch: master
https://github.com/mongodb/mongo/commit/b9e2784da82fef8e45b95b88e4ac1443649a5b0c

Generated at Thu Feb 08 05:48:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.