Replication lag on resharding donors can lead to critical section timeout

XMLWordPrintableJSON

    • Cluster Scalability
    • Fully Compatible
    • ClusterScalability Apr28-May09, ClusterScalability May12-May25
    • None
    • 3
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      Currently, to enter the critical section, a donor needs to do a shard version refresh and process the resharding fields. The former involves doing a noop write with writeConcern "majority" with a timeout of 60 seconds. The latter involves persisting the configTime (most recent majority timestamp on the CSRS) to the config.vectorClock collection with writeConcern "majority" with a timeout of 60 seconds.

      For this reason, majority replication lag on a donor can make it to fail to transition to the critical section within the critical section timeout or soon enough for the recipients to finish fetching and applying oplog entries within the critical section timeout.

      Please note that the state transition writes on a donor don't involve waiting for writeConcern "majority".

              Assignee:
              Cheahuychou Mao
              Reporter:
              Cheahuychou Mao
              Votes:
              0 Vote for this issue
              Watchers:
              3 Start watching this issue

                Created:
                Updated:
                Resolved: