[SERVER-71627] Refreshed cached collection route info will severely block all client request when a cluster with 1 million chunks Created: 26/Nov/22  Updated: 18/Dec/23  Resolved: 11/Jul/23

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: 7.1.0-rc0, 4.2.25, 7.0.1, 6.0.10, 5.0.21, 4.4.25

Type: Improvement Priority: Major - P3
Reporter: y yz Assignee: Tommaso Tocci
Resolution: Fixed Votes: 1
Labels: balancer-round-perf, chunkmap-improvements
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PDF File MongoDB Routing Info Refresh Optimization-1.pdf     PDF File MongoDB Routing Info Refresh Optimization.pdf     PNG File image-2022-11-26-16-21-42-695.png     PNG File image-2022-11-26-16-23-01-172.png     PNG File image-2022-11-26-16-23-18-848.png     PNG File image-2022-11-26-16-23-53-761.png     PNG File image-2022-11-26-16-24-26-994.png     PNG File image-2022-11-26-16-24-40-471.png     PNG File image-2022-11-26-16-24-57-224.png     PNG File image-2022-11-26-16-25-41-574.png     PNG File image-2022-11-26-16-28-47-370.png     PNG File image-2022-11-26-16-34-25-167.png    
Issue Links:
Backports
Depends
depends on SERVER-76828 Increase test coverage for RoutingTab... Closed
is depended on by SERVER-67529 Resharding silently skips documents w... Closed
is depended on by SERVER-52776 IDL-ize ChunkType Open
is depended on by SERVER-77090 Prevent ChunkMap to be constructed wi... Closed
is depended on by SERVER-78495 Throw out vestiges of enableFinerGrai... In Code Review
Documented
is documented by DOCS-16257 [SERVER] Investigate changes in SERVE... Closed
Problem/Incident
causes SERVER-81966 Avoid modification of previous ChunkM... Closed
causes SERVER-80596 Avoid useless initial iteration in Ch... Closed
Related
Assigned Teams:
Sharding EMEA
Backwards Compatibility: Fully Compatible
Backport Requested:
v7.0, v6.0, v5.0, v4.4, v4.2
Sprint: Sharding EMEA 2023-01-09, Sharding EMEA 2023-01-23, Sharding EMEA 2023-02-06, Sharding EMEA 2023-02-20, Sharding EMEA 2023-03-06, Sharding EMEA 2023-03-20, Sharding EMEA 2023-04-03, Sharding EMEA 2023-04-17, Sharding EMEA 2023-05-01, Sharding EMEA 2023-05-15, Sharding EMEA 2023-05-29, Sharding EMEA 2023-06-12, Sharding EMEA 2023-06-26, Sharding EMEA 2023-07-10, Sharding EMEA 2023-07-24
Participants:
Case:
Linked BF Score: 5

 Description   

Refreshing routing info happens under a lot of circumstances on mongos & mongod, e.g. splitting & moving chunks & shard version Check(when routing requests for read/write queries), etc. Efficiency of refreshing is crucial to MongoDB sharded cluster’s core functionalities.
In production clusters, chunk number grows rapidly with data keeps flowing in, resulting longer refreshing duration, all client requests are blocked. Although the sql of client requests is simple and the system load (CPU, MEM, IO) is low, client request jitter time has high latency during the route refreshing. For example, a cluster with 1 million chunks, it’d take seconds to do the refresh, severely blocking all client queries.



 Comments   
Comment by Githook User [ 21/Aug/23 ]

Author:

{'name': 'Tommaso Tocci', 'email': 'tommaso.tocci@mongodb.com', 'username': 'toto-dev'}

Message: SERVER-71627 Refreshed cached collection route info will severely block all client request when a cluster with 1 million chunks
Branch: v6.0
https://github.com/mongodb/mongo/commit/00ae536924d887841bc5decd691ed46c92b1f7a6

Comment by Githook User [ 21/Aug/23 ]

Author:

{'name': 'Tommaso Tocci', 'email': 'tommaso.tocci@mongodb.com', 'username': 'toto-dev'}

Message: SERVER-71627 Refreshed cached collection route info will severely block all client request when a cluster with 1 million chunks
Branch: v4.4
https://github.com/mongodb/mongo/commit/24c9b7498f68eef681d6154bbbda73d3d4f6d9b9

Comment by Githook User [ 21/Aug/23 ]

Author:

{'name': 'Tommaso Tocci', 'email': 'tommaso.tocci@mongodb.com', 'username': 'toto-dev'}

Message: SERVER-71627 Refreshed cached collection route info will severely block all client request when a cluster with 1 million chunks
Branch: v5.0
https://github.com/mongodb/mongo/commit/1bfdeba2f93bf65327a73badd3a50f5b856e27c1

Comment by Githook User [ 17/Aug/23 ]

Author:

{'name': 'Tommaso Tocci', 'email': 'tommaso.tocci@mongodb.com', 'username': 'toto-dev'}

Message: SERVER-71627 Refreshed cached collection route info will severely block all client request when a cluster with 1 million chunks
Branch: v7.0
https://github.com/mongodb/mongo/commit/58a9e97df59b7ed86821555d86097f224886eb71

Comment by Githook User [ 15/Aug/23 ]

Author:

{'name': 'Tommaso Tocci', 'email': 'tommaso.tocci@mongodb.com', 'username': 'toto-dev'}

Message: SERVER-71627 Refreshed cached collection route info will severely block all client request when a cluster with 1 million chunks

BACKPORT-16609
Branch: v4.2
https://github.com/mongodb/mongo/commit/aed3561d4cf3463e78801ef73fbf563919b1397c

Comment by Githook User [ 11/Jul/23 ]

Author:

{'name': 'Tommaso Tocci', 'email': 'tommaso.tocci@mongodb.com', 'username': 'toto-dev'}

Message: SERVER-71627 Refreshed cached collection route info will severely block all client request when a cluster with 1 million chunks

Many thanks to yangyazhou, demonyang, ycycyyc, pengzhenyi2015, wujunyuyuyu, lakeleisu and jerrygxx for the original idea on how to significantly improve performance of updating the routing table.
Branch: master
https://github.com/mongodb/mongo/commit/502ee4edd68cf833bd5b2b5f98c4538a6d9ce6eb

Comment by Tommaso Tocci [ 11/Jul/23 ]

Hi 1147952115@qq.com !

I would like to thank you, and the entire Tencent MongoDB team, again for the detailed report and the code change proposal. We really appreciate the big effort.

We reviewed and tested thoroughly the code you submitted, and we really like the idea of Two-Dimensional Sorting & Search. The performance improvements for incremental refreshes are excellent.
On the other hand, we found some issues with the proposed implementation. In particular, we discovered:

  • A correctness bug in ChunkMap::forEachOverlappingChunk that keeps processing chunks even if they are not overlapping with the given range. In particular, the problem is caused by this break that is only interrupting the inner for-loop but not the outer one.
  • A performance regression in ChunkMap::_findIntersectingChunk caused by a vector copy that can be avoided. This is regression is particularly relevant because this function is on the hot path executed by most data queries.
  • A correctness bug in ChunkMap::makeUpdated. The algorithm is producing a wrong chunk map in case updated chunks contain merged chunks that span across chunk vectors boundaries.

Thus, we ended up changing the code quite a bit to fix the bug and the performance regressions. In particular, we re-wrote entirely the merging algorithm to support all possible chunk operations.

Comment by y yz [ 13/Feb/23 ]

hi, matt.panton@mongodb.com

It's a great honor to get your approval,  we have signed the agreement and email to you.

thanks.

Comment by Matt Panton [ 09/Feb/23 ]

Hello 1147952115@qq.com!

I am Matt, a Product Manager at MongoDB. After reviewing your pull request, we want to include the optimization in future releases of MongoDB!

Before we can include the optimization in a future release, we need you and all of the members of your team that worked on this optimization to sign the Contributor Agreement. You can email the signed agreement(s) to me at matt.panton@mongodb.com 

After filling out and returning the signed contributor agreement, MongoDB can begin merging the code.

Thank you for your contribution! I am thrilled to see the future performance benefits this optimization will deliver to you, your organization, and customers worldwide.

Comment by Chris Kelly [ 26/Nov/22 ]

Thank you for your incredibly detailed summary with supporting documentation! 

I'm going to pass this on to the relevant team to look into.

Comment by y yz [ 26/Nov/22 ]

MongoDB Routing Info Refresh Optimization.pdf

Comment by y yz [ 26/Nov/22 ]

"MongoDB routing refresh mechanism limitation analysis" and "Proposed MongoDB Incremental Routing Info Refresh Method: Two-Dimensional Sorting & Search" please refer to the attached PDF(MongoDB Routing  Info Refresh Optimization.pdf).

 

The optimization code push: https://github.com/mongodb/mongo/pull/1506

 

Comment by y yz [ 26/Nov/22 ]

Tencent MongoDB team have solved the problem thoroughly by Two-Dimensional Sorting & Search. After optimization, latency consumption remained at 2ms regardless of the data-size of the shrading cluster.

  • Performance comparison before and after optimization
MongoDB Version Total Data Size(TB) Total Chunk Number(M) Elapsed Time of queries(ms) Elapsed Time after optimization (ms)
3.6 80 4.5 4500 2
4.0 1.2 0.25 300 2
4.2 25 1.5 1200 2
5.0 30 2 910 2
5.0 80 5 2600 2

After optimization, refreshing incremental routing info’s time cost is around 2ms, and most of the elapsed time is spent on retrieving changed chunks from the Config Server, while generating the new ChunkMap only takes a very short period (< 1ms)

      instructions:Here, 5.0 version's shard key is int id(ranging from 0 to 100,000,000), so it take less time.  In fact, the shard key in product is more complex, so it take more time than id shard key.

      The optimization code address: https://github.com/mongodb/mongo/pull/1506

  • Logs before and after optimization(5M chunk size)

Logs before optimization:

  1. {"t": \{"$date": "2022-10-17T11: 15: 56.209+08: 00"}

    ,"s": "I",  "c": "SH_REFR",  "id": 4619901, "ctx": "CatalogCache-3","msg": "Refreshed cached collection","attr": {"namespace": "test.test2","lookupSinceVersion": {"0": {"$timestamp": {"t": 49188,"i": 1}},"1": {"$oid": "626a663821072b82d9059209"},"2":
    Unknown macro: {"$timestamp"}
    ,"newVersion": {"chunkVersion": {"0": {"$timestamp": {"t": 49189,"i": 1}},"1": {"$oid": "626a663821072b82d9059209"},"2":
    Unknown macro: {"$timestamp"}
    ,"forcedRefreshSequenceNum": 15,"epochDisambiguatingSequenceNum": 17},"timeInStore": {"chunkVersion": {"0": {"$timestamp": {"t": 49189,"i": 1}},"1": {"$oid": "626a663821072b82d9059209"},"2":
    Unknown macro: {"$timestamp"}
    ,"forcedRefreshSequenceNum": 15,"epochDisambiguatingSequenceNum": 16},"durationMillis": 2442}}  

Logs after optimization:

  1. {"t": \{"$date": "2022-10-17T15: 40: 01.742+08: 00"}

    ,"s": "I",  "c": "SH_REFR",  "id": 4619901, "ctx": "CatalogCache-6","msg": "Refreshed cached collection","attr": {"namespace": "test.test2","lookupSinceVersion": {"0": {"$timestamp": {"t": 49185,"i": 1}},"1": {"$oid": "626a663821072b82d9059209"},"2":
    Unknown macro: {"$timestamp"}
    ,"newVersion": {"chunkVersion": {"0": {"$timestamp": {"t": 49186,"i": 1}},"1": {"$oid": "626a663821072b82d9059209"},"2":
    Unknown macro: {"$timestamp"}
    ,"forcedRefreshSequenceNum": 27,"epochDisambiguatingSequenceNum": 29},"timeInStore": {"chunkVersion": {"0": {"$timestamp": {"t": 49186,"i": 1}},"1": {"$oid": "626a663821072b82d9059209"},"2":
    Unknown macro: {"$timestamp"}
    ,"forcedRefreshSequenceNum": 27,"epochDisambiguatingSequenceNum": 28},"durationMillis": 2}}  

 

 

Comment by y yz [ 26/Nov/22 ]

I'm terribly sorry,this is an improved feature, but the jira platform changed it as a fault type.

 

I am from tencent cloud mongodb team, Our company is a partner with yours.

In the past several years, Tencent MongoDB team has noticed unusual slow queries on sharded clusters out of blue while there’s no sign of any system resource shortage (CPU, RAM, I/O, etc). Further looking into this symptom, the team figured out retrieving incremental routing info would take a lot of time when total chunk number exceeds certain threshold on shared clusters. For instance, a sharded cluster with 250k chunks requires around 300ms to refresh routing info; For larger clusters, like a cluster with 1 million chunks, it’d take seconds to do the refresh, severely blocking all client queries.

Other than that, refreshing routing info could consume more CPU, when a cluster has multiple shard tables, CPU jitter is more obvious when refreshing routing info at the same time. it increased the cost of business development and limits the distributed function

Below are some of the cases we found in production & testing environment.

1.   Background

Case 1: v4.0 Product Cluster with 250k chunks

  • Cluster Info

Data size:  5.5 billion docs, 1.2TB in total size;

Chunk number: 250k;

Refreshing routing info duration:  200ms for mongos, 300ms for mongod.

  • Related mongos logs{}

  • Thu Oct  6 11: 28: 42.556 I SH_REFR  [ConfigServerCatalogCacheLoader-85148] Refresh forcollection orderSchedule.OrderDispatchLogDetail from version 102961|686||62d157722a3a66acadc3b7a4 to version 102961|701||62d157722a3a66acadc3b7a4 took 190 ms  
  • Thu Oct  6 11: 28: 44.914 I SH_REFR  [ConfigServerCatalogCacheLoader-85148] Refresh forcollection orderSchedule.OrderDispatchLogDetail from version 102961|701||62d157722a3a66acadc3b7a4 to version 102961|704||62d157722a3a66acadc3b7a4 took 183 ms  
  • Thu Oct  6 11: 29: 41.923 I SH_REFR  [ConfigServerCatalogCacheLoader-85149] Refresh forcollection orderSchedule.OrderDispatchLogDetail from version 102961|704||62d157722a3a66acadc3b7a4 to version 102961|707||62d157722a3a66acadc3b7a4 took 194 ms  
  • Thu Oct  6 11: 32: 02.121 I SH_REFR  [ConfigServerCatalogCacheLoader-85151] Refresh forcollection orderSchedule.OrderDispatchLogDetail from version 102961|707||62d157722a3a66acadc3b7a4 to version 102961|723||62d157722a3a66acadc3b7a4 took 198 ms  
  • Related mongod logs

  • Thu Oct  6 11: 24: 11.358 I SH_REFR  [ConfigServerCatalogCacheLoader-191078] Refresh for collection orderSchedule.OrderDispatchLogDetail from version 102961|603||62d157722a3a66acadc3b7a4 to version 102961|628||62d157722a3a66acadc3b7a4 took 262 ms  
  • Thu Oct  6 11: 24: 21.306 I SH_REFR  [ConfigServerCatalogCacheLoader-191078] Refresh for collection orderSchedule.OrderDispatchLogDetail from version 102961|628||62d157722a3a66acadc3b7a4 to version 102961|631||62d157722a3a66acadc3b7a4 took 285 ms  
  • Thu Oct  6 11: 24: 45.905 I SH_REFR  [ConfigServerCatalogCacheLoader-191078] Refresh for collection orderSchedule.OrderDispatchLogDetail from version 102961|631||62d157722a3a66acadc3b7a4 to version 102961|634||62d157722a3a66acadc3b7a4 took 265 ms  
  • Thu Oct  6 11: 25: 20.979 I SH_REFR  [ConfigServerCatalogCacheLoader-191078] Refresh for collection OrderDispatchLogDetailfrom version 102961|634||62d157722a3a66acadc3b7a4 to version 102961|644||62d157722a3a66acadc3b7a4 took 252 ms  

Case 2: v4.2 Product Cluster with 1.5m chunks

  • Cluster Info{}

Data size: 15.5 billion docs,22.5TB in total size;

Chunk number: 1.5 million;

Refreshing routing info duration: 800ms for mongos, 1.2s for mongod.

  • Related mongos logs

  • 2022-10-05T04: 18: 41.359+0800 I  SH_REFR  [ConfigServerCatalogCacheLoader-136639] Refresh for collection wukong.actions from version 102910|1||626ba80f5fa7cb632d7bf264 to version 102910|4||626ba80f5fa7cb632d7bf264 took 788 ms  
  • 2022-10-05T04: 18: 50.800+0800 I  SH_REFR  [ConfigServerCatalogCacheLoader-136639] Refresh for collection wukong.actions from version 102910|4||626ba80f5fa7cb632d7bf264 to version 102910|7||626ba80f5fa7cb632d7bf264 took 780 ms  
  • 2022-10-05T04: 19: 18.546+0800 I  SH_REFR  [ConfigServerCatalogCacheLoader-136639] Refresh for collection wukong.actions from version 102910|7||626ba80f5fa7cb632d7bf264 to version 102911|1||626ba80f5fa7cb632d7bf264 took 778 ms  
  • 2022-10-05T04: 20: 01.105+0800 I  SH_REFR  [ConfigServerCatalogCacheLoader-136640] Refresh for collection wukong.actions from version 102911|1||626ba80f5fa7cb632d7bf264 to version 102912|1||626ba80f5fa7cb632d7bf264 took 781 ms 
  • Related mongod logs

  • 2022-10-06T10: 54: 49.219+0800 I  SH_REFR  [ConfigServerCatalogCacheLoader-141236] Refresh for collection wukong.actions from version 103200|584||626ba80f5fa7cb632d7bf264 to version 103200|593||626ba80f5fa7cb632d7bf264 took 1001 ms  
  • 2022-10-06T10: 57: 42.071+0800 I  SH_REFR  [ConfigServerCatalogCacheLoader-141237] Refresh for collection wukong.actions from version 103200|593||626ba80f5fa7cb632d7bf264 to version 103200|608||626ba80f5fa7cb632d7bf264 took 1200 ms  
  • 2022-10-06T11: 00: 36.781+0800 I  SH_REFR  [ConfigServerCatalogCacheLoader-141240] Refresh for collection wukong.actions from version 103200|608||626ba80f5fa7cb632d7bf264 to version 103200|623||626ba80f5fa7cb632d7bf264 took 1146 ms  
  • 2022-10-06T11: 03: 34.142+0800 I  SH_REFR  [ConfigServerCatalogCacheLoader-141241] Refresh for collection wukong.actions from version 103200|623||626ba80f5fa7cb632d7bf264 to version 103200|632||626ba80f5fa7cb632d7bf264 took 1129 ms  

Case 3: v3.6 Product Cluster with 4.3m chunks

  • Cluster Info

Data size: 120 billion docs, 80TB data size in total;

Chunk number: 430 million;

Refreshing routing info duration: 4s for mongos, 4.6s for mongod.

Case 4: v5.0 Test Cluster with 2m chunks

  • Cluster Info

Set up a v5.0 sharded cluster with 2 million chunks – Use “id” field as shard key, ranging from 0 to 100,000,000, generate 2m chunks by pre-splitting chunks.

instructions:Here, 5.0 version's shard key is int id(ranging from 0 to 100,000,000), so it take less time.  In fact, the shard key in product is more complex, so it take more time than the simple id shard key.

  • Related mongos logs

//Refreshing routing info

{"t":

{"$date": "2022-10-06T17: 46: 25.479+08: 00"}

,"s": "I",  "c": "SH_REFR",  "id": 4619901, "ctx": "CatalogCache-2","msg": "Refreshed cached collection","attr": {"namespace": "test.test2","lookupSinceVersion": {"0": {"$timestamp": {"t": 49182,"i": 1}},"1": {"$oid": "626a663821072b82d9059209"},"2": {"$timestamp":

{"t": 1651140151,"i": 6}

}},"newVersion": {"chunkVersion": {"0": {"$timestamp": {"t": 49182,"i": 1}},"1": {"$oid": "626a663821072b82d9059209"},"2": {"$timestamp":

{"t": 1651140151,"i": 6}

}},"forcedRefreshSequenceNum": 21,"epochDisambiguatingSequenceNum": 18},"timeInStore": {"chunkVersion": : {"0": {"$timestamp": {"t": 49182,"i": 1}},"1": {"$oid": "626a663821072b82d9059209"},"2": {"$timestamp": {"t": 1651140151,"i": 6}},"forcedRefreshSequenceNum": 20,"epochDisambiguatingSequenceNum": 17},"durationMillis": 896}} 

Case 5: v5.0 Test Cluster with 5m chunks

  • Cluster Info

Set up a v5.0 sharded cluster with 5 million chunks – Use “id” field as shard key, generate 5m chunks by pre-splitting chunks.

instructions:Here, 5.0 version's shard key is int id(ranging from 0 to 100,000,000), so it take less time.  In fact, the shard key in product is more complex, so it take more time than the simple id shard key.

  • Related mongod logs
  1. {"t": \{"$date":"2022-10-17T16:15:56.209+08:00"}

    ,"s":"I",  "c":"SH_REFR",  "id":4619901, "ctx":"CatalogCache-3","msg":"Refreshed cached collection","attr":{"namespace":"test.test2","lookupSinceVersion":{"0":{"$timestamp":{"t":49188,"i":1}},"1":{"$oid":"626a663821072b82d9059209"},"2":
    Unknown macro: {"$timestamp"}
    ,"newVersion":{"chunkVersion":{"0":{"$timestamp":{"t":49189,"i":1}},"1":{"$oid":"626a663821072b82d9059209"},"2":
    Unknown macro: {"$timestamp"}
    ,"forcedRefreshSequenceNum":15,"epochDisambiguatingSequenceNum":17},"timeInStore":{"chunkVersion":{"0":{"$timestamp":{"t":49189,"i":1}},"1":{"$oid":"626a663821072b82d9059209"},"2":
    Unknown macro: {"$timestamp"}
    ,"forcedRefreshSequenceNum":15,"epochDisambiguatingSequenceNum":16},"durationMillis":2442}}  

2.   Issue Impact

More chunks and more complex shardkey,more time it takes. If Refreshing routing info takes too long, the main effects are as follows: 

  • Cluster becomes un-responsive

All requests will be blocked when mongos/mongod is retrieving incremental routing info, henceforth the bigger the cluster is, the more un-responsive it could be.

  • Uneven data distribution among shards

A sharded cluster with 1.4 million chunks had to disable balancer except for several hours during midnight(The  client request qps is very low, CPU & IO & MEM are very low load), due to the frequent slow queries caused by refreshing routing info. However the balancing progress made during midnights never caught up the gap and the shards imbalance ends up like below, and still worsening:

 After analysis, the cluster jitter is caused by route refreshing, the system load is low during this period.

  • increasing the cost of business development and limits the distributed function

In order to avoid serious service jitter caused by routing problems, many users strictly limit the data amount of a collection(data of a single collection should not exceed 4T).

When the amount of data in a collection exceeds 4T, the user needs to separate the collection. In this way, we lose the core distributed advantage that we have.

  • High CPU consumption

Multiple ChunkVector iterator, a lot of CPU resources are used, Especially if there's a lot of collection and a lot of chunks.

 

Generated at Thu Feb 08 06:19:32 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.