-
Type:
Bug
-
Resolution: Unresolved
-
Priority:
Minor - P4
-
None
-
Affects Version/s: 6.0.19
-
ALL
-
None
-
3
-
TBD
-
None
-
None
-
None
-
None
-
None
-
None
We have a sharded cluster with really heavy write load, regularly spiking to 100% CPU. All written documents once are processed and are subject for removal by TTL index.
But sometimes some primary shards become so heavily loaded that background TTL removal process lags behind, and the number of documents for pending removal grows. Initially all documents are perfectly evenly distributed across shards, but once TTL process starts working with different performance on different shards, balancer realizes that it can help rebalance large shards.
And actually it starts making things even worse. Balancer chooses shards with large number of ducuments and tries to move them to shards with smaller number of documents. But as we know this imbalance solely created by the degraded TTL removal performance, and balancer activity creates even more load and contention on already heavily loaded shards.
We fixed the problem setting up activity window for balancer for time when the load is relatively low. But maybe it's possible to pause balancer activity if it sees that TTL removal backlog is considerably large?