As part of PM-2423 we have been measuring the performance of the migration protocol. We saw some weird numbers related to the cloning of the sessions that would be interesting to understand.
3-shard Sharded cluster running 5.1 binaries.
We create a sharded collection with an initial pre-split of 1K chunks. The shard key is a hashed random number. After that we insert as bulk of 1K documents using retryable writes.
After that we execute a few thousand random migrations. There are not CRUD operations during the execution of this phase.
You can check the Genny worload we executed here.
You can find our results here.
We are plotting two different variables:
- the total execution time spent holding the critical section during the migration (catch-up phase of the migration).
- the total execution time spent holding the critical section blocking reads and writes (commit phase of the migration).
The interesting time is the first one, the second one is more or less constant. We can see an slow down of 30x between the first move chunks and the lasts ones on my machine, after 4K moveChunks. We also got some numbers on EVG, you see them on the different tabs.