[SERVER-53454] Return an error future from ReshardingOplogFetcher::awaitInsert if the fetcher has been shut down Created: 18/Dec/20  Updated: 27/Oct/23  Resolved: 20/Jul/21

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Blake Oler Assignee: Max Hirschhorn
Resolution: Gone away Votes: 0
Labels: PM-234-M3, PM-234-T-oplog-fetch
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File BF.diff    
Issue Links:
Backports
Depends
Operating System: ALL
Backport Requested:
v5.0
Sprint: Sharding 2021-07-12, Sharding 2021-07-26
Participants:
Linked BF Score: 23
Story Points: 1

 Description   

Reference is in the linked BF comments. The diff to get a repro is attached. This diff may be useful in order to create a test to verify this behavior has been fixed, and will definitely be useful to verify locally that the fix works.



 Comments   
Comment by Max Hirschhorn [ 20/Jul/21 ]

I took another look at SERVER-53454 and the original failure. I don't see any reason to believe awaitInsert() was being called before the previously returned future had become ready either then or now. My best guess for the cause of the original failure is that the hack we were using to signal the ReshardingOplogApplier should abort via its corresponding ReshardingOplogFetcher wasn't safe from the ReshardingOplogFetcher being destroyed before the ReshardingOplogApplier had receive the notification.

The changes from 67ff845 as part of SERVER-53931 made it so we now use cancellation tokens for that purpose instead and the changes from 518af1a as part of SERVER-55813 guarantee the ReshardingOplogFetcher and its associated member variables as alive the entire time the ReshardingOplogApplier would be running.

I filed SERVER-58702 to fix the comment in ReshardingDataReplication and update the declaration order for the member variables. Closing this ticket as "Gone away".

Comment by Max Hirschhorn [ 22/Dec/20 ]

Looking at the TestAwaitInsertErrors test case in the attached diff, it seems like Blake demonstrated an issue with referencing the moved-from ReshardingOplogApplier::_onInsertFuture when the caller violates the contract by calling awaitInsert() prior to the future returned by an earlier call to awaitInsert() having become ready?

Do we know that it possible for ReshardingDonorOplogIterator to do? ReshardingOplogApplier::_scheduleNextBatch() only ever calls ReshardingDonorOplogIterator::getNext() once due to the current setting for the reshardingBatchLimitOperations server parameter. And the call to ReshardingOplogApplier::_scheduleNextBatch() is responsible for setting up the next call to ReshardingOplogApplier::_scheduleNextBatch() in sequence. It isn't clear to me how we'd have multiple calls to awaitInsert() outstanding at the same time. Could there be something else going on in the Evergreen failure?

Generated at Thu Feb 08 05:30:58 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.