Long Oplog Recovery times after SigAbort failures

XMLWordPrintableJSON

    • Type: Question
    • Resolution: Unresolved
    • Priority: Major - P3
    • None
    • Affects Version/s: 4.9.0-alpha4
    • Component/s: Storage
    • None
    • Replication
    • None
    • 3
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Executing the eMRCf_runner.sh tests with more than 7 growth iterations and enableMajorityReadConcern set to true results in a SIGAbort when shutting down the primary.

      The test involves deliberately shutting down the only secondary in a PSA replica set with EnableMajorityReadConcern true and performing a large update heavy workload (10 growth phases involves roughly 6,000,000 updates).

       

      In this scenario the Oplog Recovery phase takes a significant amount of time (~108 minutes):

      
      {"t":{"$date":"2021-01-11T02:42:11.819+00:00"},"s":"I",  "c":"REPL",     "id":21545,   "ctx":"initandlisten","msg":"Starting recovery oplog application at the stable timestamp","attr":{"stableTimestamp":{"$timestamp":{"t":1610324416,"i":1}}}}
      
      ...
      
      {"t":{"$date":"2021-01-11T04:30:53.247+00:00"},"s":"I",  "c":"REPL",     "id":21536,   "ctx":"initandlisten","msg":"Completed oplog application for recovery","attr":{"numOpsApplied":114391580,"numBatches":22879,"applyThroughOpTime":{"ts":{"$timestamp":{"t":1610331306,"i":2}},"t":2}}} 

       

       Given that this is a PSA configuration, the replica set will not be available during this recovery. Is this amount of time expected for this case?

              Assignee:
              [DO NOT USE] Backlog - Replication Team
              Reporter:
              James O'Leary
              Votes:
              0 Vote for this issue
              Watchers:
              12 Start watching this issue

                Created:
                Updated: