Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Fixed
Priority: Major - P3
Fix Version/s: 5.2.0, 5.0.4, 5.1.0-rc1
Affects Version/s: 5.0.0
Component/s: Sharding
Labels:
- PM-234-M3
- PM-234-T-lifecycle

Backwards Compatibility:
Fully Compatible
Operating System:
ALL
Backport Requested:

v5.1, v5.0
Sprint:
Sharding 2021-09-20, Sharding 2021-10-04, Sharding 2021-10-18
Story Points:
1
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

The changes from fee0349 as part of ~~SERVER-53912~~ introduced a RecipientStateMachine::_restoreMetrics() method to calculate the number of documents it cloned, oplog entries it fetched, and oplog entries it applied at the beginning of starting to run again. These read operations may be interrupted if the primary steps down shortly after having been stepped up. The call to RecipientStateMachine::_restoreMetrics() should be placed in a resharding::WithAutomaticRetry() block so any transient errors can be automatically retried and synchronized with the stepdown token being canceled.

As a bonus on this ticket, we should see if it is possible to have the resharding code invariant that all usages of CancelableOperationContextFactory occur within a resharding::WithAutomaticRetry() block.

[2021/09/02 13:54:14.598] [js_test:resharding_fuzzer-120e1-1630590374586-8] d20021| 2021-09-02T13:54:14.598+00:00 I  COMMAND  21581   [conn1] "Received replSetStepUp request"
[2021/09/02 13:54:14.603] [js_test:resharding_fuzzer-120e1-1630590374586-8] d20021| 2021-09-02T13:54:14.602+00:00 I  REPL     21358   [ReplCoord-7] "Replica set state transition","attr":{"newState":"PRIMARY","oldState":"SECONDARY"}
[2021/09/02 13:54:14.615] [js_test:resharding_fuzzer-120e1-1630590374586-8] d20021| 2021-09-02T13:54:14.615+00:00 I  REPL     21331   [OplogApplier-0] "Transition to primary complete; database writes are now permitted"
[2021/09/02 13:54:14.823] [js_test:resharding_fuzzer-120e1-1630590374586-8] d20021| 2021-09-02T13:54:14.823+00:00 I  REPL     21402   [conn4] "Stepping down from primary, because a new term has begun","attr":{"term":6}
...
[2021/09/02 13:54:14.825] [js_test:resharding_fuzzer-120e1-1630590374586-8] d20021| 2021-09-02T13:54:14.825+00:00 F  RESHARD  5551101 [ReshardingRecipientService-5] "Unrecoverable error occurred past the point recipient was prepared to complete the resharding operation","attr":{"error":"InterruptedDueToReplStateChange: operation was interrupted"}
[2021/09/02 13:54:14.825] [js_test:resharding_fuzzer-120e1-1630590374586-8] d20021| 2021-09-02T13:54:14.825+00:00 F  ASSERT   23089   [ReshardingRecipientService-5] "Fatal assertion","attr":{"msgid":5551101,"file":"src/mongo/db/s/resharding/resharding_recipient_service.cpp","line":404}

https://evergreen.mongodb.com/lobster/evergreen/test/mongodb_mongo_master_enterprise_rhel_80_64_bit_resharding_fuzzer_stepup_1_enterprise_rhel_80_64_bit_patch_23f9d2a53917d63fc3d3b8c8646f40f2bc4caa2f_6130c5d561837d6514713be6_21_09_02_12_44_55/0/6130e1232fd552933c3a0c9a#bookmarks=0%2C26712%2C26783%2C34982%2C35005%2C35160%2C35363%2C35372%2C35373%2C35629%2C153752%2C153798&f~=000~d20021%5C%7C&f~=100~%5C%5BResharding.%2AService&f~=010~%28REPL_HB%7CELECTION%29&f~=011~REPL_HB&l=1

is caused by

SERVER-53912 ReshardingRecipientService instances to load metrics state upon instantiation

Closed

is depended on by

SERVER-53351 Add resharding fuzzer task with step-ups enabled for shards

Closed

related to

SERVER-105848 Refractor ReshardingDonorService to use RetryingCancelableOperationContextFactory

Closed

Assignee:: Brett Nawrocki
Reporter:: Max Hirschhorn
Participants:: Brett Nawrocki, Githook User, Max Hirschhorn
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: Sep 13 2021 11:08:18 PM UTC
Updated:: Jun 03 2025 04:14:57 PM UTC
Resolved:: Oct 12 2021 02:51:50 PM UTC
Confidence Status Last Update:: 24/Sep/21 3:29 PM

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates