[SERVER-38929] Refresh of RoutingTableHistory causes large rise in memory usage Created: 10/Jan/19  Updated: 06/Dec/22  Resolved: 15/Jan/19

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 4.0.1
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Danny Hatcher (Inactive) Assignee: [DO NOT USE] Backlog - Sharding Team
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2019-01-08 at 5.10.45 PM.png    
Issue Links:
Duplicate
duplicates SERVER-36443 Long-running queries should not cause... Closed
Related
Assigned Teams:
Sharding
Operating System: ALL
Participants:

 Description   

In a sharded cluster, a long running query can cause a shard to refresh the routing table history multiple times. If the sharded cluster is very large, this routing table history can take up a large amount of space and eventually lead to an OOM.

Here is a snapshot of call stacks that show 6.5 GB being used solely to update the routing table history.

Here is the balancer information:

balancer:
        Currently enabled:  yes
        Currently running:  yes
        Collections with active migrations: 
                buildlogs.logs started at Tue Jan 08 2019 21:57:37 GMT+0000 (UTC)
        Failed balancer rounds in last 5 attempts:  0
        Migration Results for the last 24 hours: 
                2990 : Success
                1 : Failed with error 'aborted', from logkeeperdb-shard_26 to logkeeperdb-shard_24
                1 : Failed with error 'aborted', from logkeeperdb-shard_26 to logkeeperdb-shard_21
                2 : Failed with error 'aborted', from logkeeperdb-shard_14 to logkeeperdb-rs0
                1 : Failed with error 'aborted', from logkeeperdb-shard_9 to logkeeperdb-shard_18
                1 : Failed with error 'aborted', from logkeeperdb-shard_15 to logkeeperdb-shard_21
                1 : Failed with error 'aborted', from logkeeperdb-shard_15 to logkeeperdb-shard_17
                1 : Failed with error 'aborted', from logkeeperdb-shard_15 to logkeeperdb-shard_22
                1 : Failed with error 'aborted', from logkeeperdb-shard_17 to logkeeperdb-shard_12
                1 : Failed with error 'aborted', from logkeeperdb-shard_14 to logkeeperdb-shard_4
                1 : Failed with error 'aborted', from logkeeperdb-shard_22 to logkeeperdb-shard_13
                1 : Failed with error 'aborted', from logkeeperdb-shard_13 to logkeeperdb-shard_8
                1 : Failed with error 'aborted', from logkeeperdb-shard_17 to logkeeperdb-shard_20

mongos> db.chunks.find({ns: "buildlogs.logs"}).count()
1303476


Generated at Thu Feb 08 04:50:29 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.