[SERVER-45095] Measure the running time of safe reconfig in a healthy replset Created: 12/Dec/19  Updated: 19/Mar/20  Resolved: 19/Mar/20

Status: Closed
Project: Core Server
Component/s: Replication
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Siyuan Zhou Assignee: Siyuan Zhou
Resolution: Done Votes: 0
Labels: safe-reconfig-testing
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-46907 Speed up config replication acknowled... Closed
Sprint: Repl 2020-03-23
Participants:

 Comments   
Comment by Siyuan Zhou [ 19/Mar/20 ]

I audited the log messaged suggested by william.schultz in a few replica_sets tests on the latest commit. Secondaries are able to learn the latest config immediately since the primary restarts heartbeats as part of the reconfig and secondaries send back heartbeats immediately when they learn of a newer config. As a result most reconfig commands wait for a round of heartbeat to learn the states of secondaries, which is the slowest part.

The replset startup phase sets heartbeat intervals to 200ms temporarily, so we can observe a consistent 200ms configWaitDuration and an extra oplogWaitDuration ranging from 0 to 200ms depending on the speed of initial sync and whether the majority number changes.

In testing phase, configWaitDuration is pretty consistent with the heartbeat interval 2000ms in the following tests:

  • reconfig.js (2000ms)
  • reconfig_tags.js (2001ms)
  • reconfig_add_remove_arbiter.jsĀ (2001ms)

oplogWaitDuration is 0 in above tests because they don't need to wait for extra data replication. In production, oplogWaitDuration should be smaller than the latency of a write majority write in the new config.

We could make configWaitDuration shorter by letting the primary learn of the states of secondaries faster. The work is tracked by SERVER-46907.

Comment by William Schultz (Inactive) [ 17/Mar/20 ]

Note that the log message stats we print here should help us get a sense of how long safe reconfigs take in common cases.

Generated at Thu Feb 08 05:07:52 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.