[SERVER-59775] ReshardingDonorOplogIterator triggers an fassert() when it continues to run in member state SECONDARY following a stepdown Created: 03/Sep/21  Updated: 29/Oct/23  Resolved: 08/Sep/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 5.0.0
Fix Version/s: 5.0.4, 5.1.0-rc0

Type: Bug Priority: Major - P3
Reporter: Max Hirschhorn Assignee: Max Hirschhorn
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
is depended on by SERVER-53351 Add resharding fuzzer task with step-... Closed
Related
related to SERVER-80280 Consider introducing concept of drain... Closed
related to SERVER-79802 Allow resharding donor oplog iterator... Closed
related to SERVER-79955 Need a more complete mechanism for in... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v5.0
Sprint: Sharding 2021-09-06, Sharding 2021-09-20
Participants:
Story Points: 1

 Description   

The design for PrimaryOnlyService has the cancellation token for the Instances canceled on stepdown and their task executor shut down. However, a currently running task can continue running (briefly) in member state SECONDARY. ReshardingDonorOplogIterator reads from the oplog buffer collection locally using the default RecoveryUnit::ReadSource of kNoTimestamp. This leads to the node hitting this fassert() in AutoGetCollectionForReadBase.

Moreover, ReshardingDonorOplogIterator depends on being guaranteed to read the write committed by the ReshardingOplogFetcher thread after being notified via awaitInsert(). This means RecoveryUnit::ReadSource::kNoOverlap isn't a suitable alternative. Instead, we'll have ReshardingDonorOplogIterator use ShouldNotConflictWithSecondaryBatchApplicationBlock.


[js_test:resharding_fuzzer-79234-1630149184498-1] d20022| 2021-08-28T11:16:22.607+00:00 I  REPL     21358   [ReplCoord-1] "Replica set state transition","attr":{"newState":"SECONDARY","oldState":"PRIMARY"}
...
[js_test:resharding_fuzzer-79234-1630149184498-1] d20022| 2021-08-28T11:16:22.608+00:00 F  STORAGE  4728700 [ReshardingRecipientService-1] "Reading from replicated collection on a secondary without read timestamp or PBWM lock","attr":{"collection":"config.localReshardingOplogBuffer.cea06672-2ba3-4d95-8b23-a4cfc596f4df.shard1"}
[js_test:resharding_fuzzer-79234-1630149184498-1] d20022| 2021-08-28T11:16:22.608+00:00 F  ASSERT   23089   [ReshardingRecipientService-1] "Fatal assertion","attr":{"msgid":4728700,"file":"src/mongo/db/db_raii.cpp","line":334}
[js_test:resharding_fuzzer-79234-1630149184498-1] d20022| 2021-08-28T11:16:22.608+00:00 F  ASSERT   23090   [ReshardingRecipientService-1] "\n\n***aborting after fassert() failure\n\n"
[js_test:resharding_fuzzer-79234-1630149184498-1] d20022| 2021-08-28T11:16:22.608+00:00 F  CONTROL  4757800 [ReshardingRecipientService-1] "Writing fatal message","attr":{"message":"Got signal: 6 (Aborted).\n"}
...
[js_test:resharding_fuzzer-79234-1630149184498-1] d20022| 2021-08-28T11:16:22.614+00:00 I  REPL     5123007 [ReplCoord-1] "Interrupting PrimaryOnlyService due to stepDown","attr":{"service":"ReshardingRecipientService","numInstances":1,"numOperationContexts":3}
...
[js_test:resharding_fuzzer-79234-1630149184498-1] d20022| 2021-08-28T11:16:23.157+00:00 I  CONTROL  31445   [ReshardingRecipientService-1] "Frame","attr":{"frame":{"a":"557020E99237","b":"55700D006000","o":"13E93237","s":"_ZN5mongo25fassertFailedWithLocationEiPKcj","s+":"D7"}}
[js_test:resharding_fuzzer-79234-1630149184498-1] d20022| 2021-08-28T11:16:23.157+00:00 I  CONTROL  31445   [ReshardingRecipientService-1] "Frame","attr":{"frame":{"a":"55701F13DA0B","b":"55700D006000","o":"12137A0B","s":"_ZN5mongo28AutoGetCollectionForReadBaseINS_25AutoGetCollectionLockFreeENS_32AutoGetCollectionForReadLockFree13EmplaceHelperEEC1EPNS_16OperationContextERKS3_b","s+":"15FB"}}
[js_test:resharding_fuzzer-79234-1630149184498-1] d20022| 2021-08-28T11:16:23.157+00:00 I  CONTROL  31445   [ReshardingRecipientService-1] "Frame","attr":{"frame":{"a":"55701F13F8D2","b":"55700D006000","o":"121398D2","s":"_ZN5boost15optional_detail13optional_baseIN5mongo28AutoGetCollectionForReadBaseINS2_25AutoGetCollectionLockFreeENS2_32AutoGetCollectionForReadLockFree13EmplaceHelperEEEE9constructIJRPNS2_16OperationContextERS6_RbEEEvNS_11optional_ns15in_place_init_tEDpOT_","s+":"42"}}
[js_test:resharding_fuzzer-79234-1630149184498-1] d20022| 2021-08-28T11:16:23.157+00:00 I  CONTROL  31445   [ReshardingRecipientService-1] "Frame","attr":{"frame":{"a":"55701F135399","b":"55700D006000","o":"1212F399","s":"_ZN5mongo12_GLOBAL__N_138acquireCollectionAndConsistentSnapshotIZNS_32AutoGetCollectionForReadLockFreeC1EPNS_16OperationContextERKNS_21NamespaceStringOrUUIDENS_25AutoGetCollectionViewModeENS_6Date_tEE3$_1ZNS2_C1ES4_S7_S8_S9_E3$_2ZNS2_C1ES4_S7_S8_S9_E3$_3EEDaS4_bRNS_24CollectionCatalogStasherET_T0_T1_","s+":"199"}}
[js_test:resharding_fuzzer-79234-1630149184498-1] d20022| 2021-08-28T11:16:23.157+00:00 I  CONTROL  31445   [ReshardingRecipientService-1] "Frame","attr":{"frame":{"a":"55701F134DD4","b":"55700D006000","o":"1212EDD4","s":"_ZN5mongo32AutoGetCollectionForReadLockFreeC1EPNS_16OperationContextERKNS_21NamespaceStringOrUUIDENS_25AutoGetCollectionViewModeENS_6Date_tE","s+":"1F4"}}
[js_test:resharding_fuzzer-79234-1630149184498-1] d20022| 2021-08-28T11:16:23.157+00:00 I  CONTROL  31445   [ReshardingRecipientService-1] "Frame","attr":{"frame":{"a":"55701F13ECFF","b":"55700D006000","o":"12138CFF","s":"_ZN5mongo35AutoGetCollectionForReadCommandBaseINS_32AutoGetCollectionForReadLockFreeEEC2EPNS_16OperationContextERKNS_21NamespaceStringOrUUIDENS_25AutoGetCollectionViewModeENS_6Date_tENS_16AutoStatsTracker7LogModeE","s+":"4F"}}
[js_test:resharding_fuzzer-79234-1630149184498-1] d20022| 2021-08-28T11:16:23.157+00:00 I  CONTROL  31445   [ReshardingRecipientService-1] "Frame","attr":{"frame":{"a":"55701F1402EF","b":"55700D006000","o":"1213A2EF","s":"_ZN5boost15optional_detail13optional_baseIN5mongo39AutoGetCollectionForReadCommandLockFreeEE9constructIJRPNS2_16OperationContextERKNS2_21NamespaceStringOrUUIDERNS2_25AutoGetCollectionViewModeERNS2_6Date_tERNS2_16AutoStatsTracker7LogModeEEEEvNS_11optional_ns15in_place_init_tEDpOT_","s+":"4F"}}
[js_test:resharding_fuzzer-79234-1630149184498-1] d20022| 2021-08-28T11:16:23.157+00:00 I  CONTROL  31445   [ReshardingRecipientService-1] "Frame","attr":{"frame":{"a":"55701F1368FB","b":"55700D006000","o":"121308FB","s":"_ZN5mongo44AutoGetCollectionForReadCommandMaybeLockFreeC2EPNS_16OperationContextERKNS_21NamespaceStringOrUUIDENS_25AutoGetCollectionViewModeENS_6Date_tENS_16AutoStatsTracker7LogModeE","s+":"8B"}}
[js_test:resharding_fuzzer-79234-1630149184498-1] d20022| 2021-08-28T11:16:23.157+00:00 I  CONTROL  31445   [ReshardingRecipientService-1] "Frame","attr":{"frame":{"a":"55701E4493CF","b":"55700D006000","o":"114433CF","s":"_ZN5boost15optional_detail13optional_baseIN5mongo44AutoGetCollectionForReadCommandMaybeLockFreeEE9constructIJRPNS2_16OperationContextERKNS2_21NamespaceStringOrUUIDENS2_25AutoGetCollectionViewModeENS2_6Date_tENS2_16AutoStatsTracker7LogModeEEEEvNS_11optional_ns15in_place_init_tEDpOT_","s+":"4F"}}
[js_test:resharding_fuzzer-79234-1630149184498-1] d20022| 2021-08-28T11:16:23.157+00:00 I  CONTROL  31445   [ReshardingRecipientService-1] "Frame","attr":{"frame":{"a":"55701E43C9C7","b":"55700D006000","o":"114369C7","s":"_ZN5mongo28CommonMongodProcessInterface40attachCursorSourceToPipelineForLocalReadEPNS_8PipelineE","s+":"4F7"}}
[js_test:resharding_fuzzer-79234-1630149184498-1] d20022| 2021-08-28T11:16:23.158+00:00 I  CONTROL  31445   [ReshardingRecipientService-1] "Frame","attr":{"frame":{"a":"55701E58982B","b":"55700D006000","o":"1158382B","s":"_ZN5mongo17shardVersionRetryIZNS_19sharded_agg_helpers22attachCursorToPipelineEPNS_8PipelineENS_20ShardTargetingPolicyEN5boost8optionalINS_7BSONObjEEEE3$_4EEDaPNS_16OperationContextEPNS_12CatalogCacheENS_15NamespaceStringENS_10StringDataEOT_","s+":"37B"}}
[js_test:resharding_fuzzer-79234-1630149184498-1] d20022| 2021-08-28T11:16:23.158+00:00 I  CONTROL  31445   [ReshardingRecipientService-1] "Frame","attr":{"frame":{"a":"55701E5890BD","b":"55700D006000","o":"115830BD","s":"_ZN5mongo19sharded_agg_helpers22attachCursorToPipelineEPNS_8PipelineENS_20ShardTargetingPolicyEN5boost8optionalINS_7BSONObjEEE","s+":"53D"}}
[js_test:resharding_fuzzer-79234-1630149184498-1] d20022| 2021-08-28T11:16:23.158+00:00 I  CONTROL  31445   [ReshardingRecipientService-1] "Frame","attr":{"frame":{"a":"55701E3FA8DF","b":"55700D006000","o":"113F48DF","s":"_ZN5mongo27ShardServerProcessInterface28attachCursorSourceToPipelineEPNS_8PipelineENS_20ShardTargetingPolicyEN5boost8optionalINS_7BSONObjEEE","s+":"5F"}}
[js_test:resharding_fuzzer-79234-1630149184498-1] d20022| 2021-08-28T11:16:23.158+00:00 I  CONTROL  31445   [ReshardingRecipientService-1] "Frame","attr":{"frame":{"a":"55701F488150","b":"55700D006000","o":"12482150","s":"_ZN5mongo20DocumentSourceLookUp13buildPipelineERKNS_8DocumentE","s+":"E90"}}
[js_test:resharding_fuzzer-79234-1630149184498-1] d20022| 2021-08-28T11:16:23.158+00:00 I  CONTROL  31445   [ReshardingRecipientService-1] "Frame","attr":{"frame":{"a":"55701F48581B","b":"55700D006000","o":"1247F81B","s":"_ZN5mongo20DocumentSourceLookUp12unwindResultEv","s+":"AAB"}}
[js_test:resharding_fuzzer-79234-1630149184498-1] d20022| 2021-08-28T11:16:23.158+00:00 I  CONTROL  31445   [ReshardingRecipientService-1] "Frame","attr":{"frame":{"a":"55701F483B56","b":"55700D006000","o":"1247DB56","s":"_ZN5mongo20DocumentSourceLookUp9doGetNextEv","s+":"F6"}}
[js_test:resharding_fuzzer-79234-1630149184498-1] d20022| 2021-08-28T11:16:23.158+00:00 I  CONTROL  31445   [ReshardingRecipientService-1] "Frame","attr":{"frame":{"a":"55701CFC28BC","b":"55700D006000","o":"FFBC8BC","s":"_ZN5mongo14DocumentSource7getNextEv","s+":"21C"}}
[js_test:resharding_fuzzer-79234-1630149184498-1] d20022| 2021-08-28T11:16:23.158+00:00 I  CONTROL  31445   [ReshardingRecipientService-1] "Frame","attr":{"frame":{"a":"55701F53C6EE","b":"55700D006000","o":"125366EE","s":"_ZN5mongo8Pipeline7getNextEv","s+":"DE"}}
[js_test:resharding_fuzzer-79234-1630149184498-1] d20022| 2021-08-28T11:16:23.158+00:00 I  CONTROL  31445   [ReshardingRecipientService-1] "Frame","attr":{"frame":{"a":"55701D698F3C","b":"55700D006000","o":"10692F3C","s":"_ZN5mongo28ReshardingDonorOplogIterator10_fillBatchERNS_8PipelineE","s+":"AC"}}
[js_test:resharding_fuzzer-79234-1630149184498-1] d20022| 2021-08-28T11:16:23.158+00:00 I  CONTROL  31445   [ReshardingRecipientService-1] "Frame","attr":{"frame":{"a":"55701D699E2C","b":"55700D006000","o":"10693E2C","s":"_ZN5mongo28ReshardingDonorOplogIterator12getNextBatchESt10shared_ptrINS_8executor12TaskExecutorEENS_17CancellationTokenENS_33CancelableOperationContextFactoryE","s+":"54C"}}

https://evergreen.mongodb.com/lobster/build/c6979c6e3c82b5fa2586cea47ff21636/test/612a1acec2ab686fd51b1f68#bookmarks=0%2C37457%2C37460%2C37506%2C37797%2C160261%2C160414&f~=100~d20022%5C%7C



 Comments   
Comment by Vivian Ge (Inactive) [ 06/Oct/21 ]

Updating the fixversion since branching activities occurred yesterday. This ticket will be in rc0 when it’s been triggered. For more active release information, please keep an eye on #server-release. Thank you!

Comment by Githook User [ 20/Sep/21 ]

Author:

{'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}

Message: SERVER-59775 Make ReshardingDonorOplogIterator safe to run as secondary.

(cherry picked from commit 55ed7abbf0cb8afa13f66234f099a4b02fc6e8c6)
Branch: v5.0
https://github.com/mongodb/mongo/commit/7c40f936935ea61d336a310bac1599eda2aa6c7a

Comment by Githook User [ 07/Sep/21 ]

Author:

{'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}

Message: SERVER-59775 Make ReshardingDonorOplogIterator safe to run as secondary.
Branch: master
https://github.com/mongodb/mongo/commit/55ed7abbf0cb8afa13f66234f099a4b02fc6e8c6

Generated at Thu Feb 08 05:48:05 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.