[SERVER-27280] repl failpoints that block until they're turned off should check for shutdown at intervals Created: 05/Dec/16  Updated: 06/Dec/22  Resolved: 20/Jul/17

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: 3.4.0
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Esha Maharishi (Inactive) Assignee: Backlog - Replication Team
Resolution: Won't Fix Votes: 0
Labels: neweng
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to SERVER-26928 Check for shutdown in pauseRsBgSyncPr... Closed
related to SERVER-27227 Disable fail points on test failures ... Closed
Assigned Teams:
Replication
Operating System: ALL
Participants:

 Description   

The following places use the

if (failpoint) {
    while (failpoint) {
        sleepsecs(1)
    }
}

pattern, without checking for shutdown inside the while:

SyncTail::getMissingDoc()
https://github.com/mongodb/mongo/blob/d9d8ac1e8ae85a07369274eaf12c0ce555e08b2e/src/mongo/db/repl/sync_tail.cpp#L948-L954

rs_rollback.cpp::_syncRollback()
https://github.com/mongodb/mongo/blob/d9d8ac1e8ae85a07369274eaf12c0ce555e08b2e/src/mongo/db/repl/rs_rollback.cpp#L912-L919

CollectionCloner::_insertDocumentsCallback
https://github.com/mongodb/mongo/blob/d9d8ac1e8ae85a07369274eaf12c0ce555e08b2e/src/mongo/db/repl/collection_cloner.cpp#L496-L506

rs_initialsync.cpp::_initialSync
https://github.com/mongodb/mongo/blob/d9d8ac1e8ae85a07369274eaf12c0ce555e08b2e/src/mongo/db/repl/rs_initialsync.cpp#L310-L316

Cloner()
https://github.com/mongodb/mongo/blob/d9d8ac1e8ae85a07369274eaf12c0ce555e08b2e/src/mongo/db/cloner.cpp#L285-L295

If some test turns on the failpoint but forgets to turn it off, a node that hits the failpoint will hang in shutdown, causing failures like the ones addressed by SERVER-26928.

They should all be changed to check for being in shutdown (possibly through the inShutdown() function).



 Comments   
Comment by Spencer Brody (Inactive) [ 20/Jul/17 ]

We've been fixing these as we come across them and as they cause issues, but we aren't planning a dedicated pass to clean up all failpoints in the repl code.

Comment by Andy Schwerin [ 05/Dec/16 ]

Except if the failpoint is supposed to block shutdown? In any event, I would like to discourage ever calling the global inShutdown() function. It makes shutdown coordination difficult.

Generated at Thu Feb 08 04:14:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.