Recently, we have some sharding cluster with version 4.0.26. sometime we will get a result that update operation is extremely slow, about tens of seconds to a few minutes.
After in-depth analysis, I think it's a BUG here.
First , when moveChunk happens, A chunk will move from shard A to shard B , B will cleanup this chunk data first and will wait for cleanup to make sure that the new chunk data wouldn't delete by another older cleanup task. That is , moveChunk will cost a very long time, up to 15 minutes (rangeDeleterBatchDelayMS ) .
Secondly, there is a jara https://jira.mongodb.org/browse/SERVER-56779 , and from 4.0.26 , MongoDB do not use collection distributed lock for chunk merges,and use the ActiveMigrationsRegistry. But it cause a new sense
That is split will be blocked by movechunk until the moveChunk ended.
last, in 4.0.26 ,the auto-split is alse trigger by mongos, and is a part of the update operation.
So sometimes there is such a scene, a chunk moved from shard A to shard B , and then it is moved from shard B to shard A, the second moveChunk task will be blocked， up to 15 minutes。then the update operation will be blocked by splitChunk, and splitchunk is waiting for last moveChunk
from 4.2 ,auto-split is triggered by mongod , and it's an asynchronous task. So this problem is only affect 4.0.26.