[SERVER-54441] Long Oplog Recovery times after SigAbort failures Created: 10/Feb/21  Updated: 31/Aug/23

Status: Blocked
Project: Core Server
Component/s: Storage
Affects Version/s: 4.9.0-alpha4
Fix Version/s: None

Type: Question Priority: Major - P3
Reporter: James O'Leary Assignee: Backlog - Replication Team
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to WT-7079 Long recovery time after unclean shut... Closed
Assigned Teams:
Replication
Participants:

 Description   

Executing the eMRCf_runner.sh tests with more than 7 growth iterations and enableMajorityReadConcern set to true results in a SIGAbort when shutting down the primary.

The test involves deliberately shutting down the only secondary in a PSA replica set with EnableMajorityReadConcern true and performing a large update heavy workload (10 growth phases involves roughly 6,000,000 updates).

 

In this scenario the Oplog Recovery phase takes a significant amount of time (~108 minutes):

 
{"t":{"$date":"2021-01-11T02:42:11.819+00:00"},"s":"I",  "c":"REPL",     "id":21545,   "ctx":"initandlisten","msg":"Starting recovery oplog application at the stable timestamp","attr":{"stableTimestamp":{"$timestamp":{"t":1610324416,"i":1}}}}
 
...
 
{"t":{"$date":"2021-01-11T04:30:53.247+00:00"},"s":"I",  "c":"REPL",     "id":21536,   "ctx":"initandlisten","msg":"Completed oplog application for recovery","attr":{"numOpsApplied":114391580,"numBatches":22879,"applyThroughOpTime":{"ts":{"$timestamp":{"t":1610331306,"i":2}},"t":2}}} 

 

 Given that this is a PSA configuration, the replica set will not be available during this recovery. Is this amount of time expected for this case?



 Comments   
Comment by Bruce Lucas (Inactive) [ 11/Feb/21 ]

FTR the sigabrt was due to OOM.

Generated at Thu Feb 08 05:33:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.