Replication lag on resharding donors can lead to critical section timeout

XMLWordPrintableJSON

    • Cluster Scalability
    • Fully Compatible
    • ClusterScalability Apr28-May09, ClusterScalability May12-May25
    • None
    • 3
    • None
    • None
    • None
    • None
    • None
    • None

      Currently, to enter the critical section, a donor needs to do a shard version refresh and process the resharding fields. The former involves doing a noop write with writeConcern "majority" with a timeout of 60 seconds. The latter involves persisting the configTime (most recent majority timestamp on the CSRS) to the config.vectorClock collection with writeConcern "majority" with a timeout of 60 seconds.

      For this reason, majority replication lag on a donor can make it to fail to transition to the critical section within the critical section timeout or soon enough for the recipients to finish fetching and applying oplog entries within the critical section timeout.

      Please note that the state transition writes on a donor don't involve waiting for writeConcern "majority".

            Assignee:
            Cheahuychou Mao
            Reporter:
            Cheahuychou Mao
            Votes:
            0 Vote for this issue
            Watchers:
            3 Start watching this issue

              Created:
              Updated:
              Resolved: