[SERVER-85590] Mongos startup time increase linearly with the amount of databases in the cluster Created: 23/Jan/24  Updated: 05/Feb/24  Resolved: 05/Feb/24

Status: Closed
Project: Core Server
Component/s: Admin, Sharding
Affects Version/s: 5.0 Required, 4.4 Required, 6.0 Required
Fix Version/s: None

Type: Bug Priority: Minor - P4
Reporter: Anthony Brodard Assignee: Unassigned
Resolution: Done Votes: 0
Labels: mongos
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Mongodb CE 4.4.27 - 5.0.24 - Ubuntu 22.04


Operating System: ALL
Steps To Reproduce:
  1. Create an empty cluster
  2. Restart a mongos, note the time diff between `"msg":"Build Info"` and `"msg":"Waiting for connections"` events
  3. Create large batch of databases
  4. Restart a mongos, note the time diff between `"msg":"Build Info"` and `"msg":"Waiting for connections"` events
Participants:

 Description   

Hello,

On startup, mongos runs 1 query per database to the `config.collections` collection.

On cluster with large amount of databases, it will have an important impact on the process startup time.

Moreover, as mongos uses `ReadConcern: majority` to communicate with config replica, the startup time can also vary depending to the network latency between config members and the mongos.

Here are the kind of requests sent by mongos to config RS on startup:

```

{"t":\{"$date":"2024-01-22T13:14:58.987+00:00"}

,"s":"D4", "c":"ASIO",     "id":22596,   "ctx":"mongosMain","msg":"startCommand","attr":{"request":"RemoteCommand 939 – target:[cfg-1.internal:37001] db:config expDate:2024-01-22T13:15:28.987+00:00 cmd:{ find: \"collections\", filter:

{ _id: /^db_20240119160851\\-800\\./ }

, readConcern: { level: \"majority\", afterOpTime:

{ ts: Timestamp(1705929298, 1), t: 2 }

}, maxTimeMS: 30000 }"}}

{"t":\{"$date":"2024-01-22T13:14:58.989+00:00"}

,"s":"D4", "c":"ASIO",     "id":22596,   "ctx":"mongosMain","msg":"startCommand","attr":{"request":"RemoteCommand 940 – target:[cfg-1.internal:37001] db:config expDate:2024-01-22T13:15:28.989+00:00 cmd:{ find: \"collections\", filter:

{ _id: /^db_20240119160851\\-801\\./ }

, readConcern: { level: \"majority\", afterOpTime:

{ ts: Timestamp(1705929298, 1), t: 2 }

}, maxTimeMS: 30000 }"}}

{"t":\{"$date":"2024-01-22T13:14:58.991+00:00"}

,"s":"D4", "c":"ASIO",     "id":22596,   "ctx":"mongosMain","msg":"startCommand","attr":{"request":"RemoteCommand 941 – target:[cfg-2.internal:37001] db:config expDate:2024-01-22T13:15:28.991+00:00 cmd:{ find: \"collections\", filter:

{ _id: /^db_20240119160851\\-802\\./ }

, readConcern: { level: \"majority\", afterOpTime:

{ ts: Timestamp(1705929298, 1), t: 2 }

}, maxTimeMS: 30000 }"}}

{"t":\{"$date":"2024-01-22T13:14:59.004+00:00"}

,"s":"D4", "c":"ASIO",     "id":22596,   "ctx":"mongosMain","msg":"startCommand","attr":{"request":"RemoteCommand 942 – target:[cfg-3.internal:37001] db:config expDate:2024-01-22T13:15:29.004+00:00 cmd:{ find: \"collections\", filter:

{ _id: /^db_20240119160851\\-803\\./ }

, readConcern: { level: \"majority\", afterOpTime:

{ ts: Timestamp(1705929298, 1), t: 2 }

}, maxTimeMS: 30000 }"}}

```

In our case, we are operating a cluster with more than 450k databases. Config servers is distributed between 3 sites:

  • Mongos <-> Config1 : 1ms
  • Mongos <-> Config2: 3ms
  • Mongos <-> Config3: 10ms 

When the mongos start, as the queries as distributed equally to all config member, it takes approximately 30 minutes to start, which can be confirmed by doing the maths: `150k * 1ms + 150k * 3ms + 150k * 10ms = 35 minutes`

We tried to set a lower `–localThreshold`, but it apparently only applies to communications between mongos and data shard.

We are able to reproduce this behaviour by creating 25k on a testing env:

  • it takes 2 minutes for a mongos to start when all config RS members are distributed between the three sites
  • it still takes 25s to start if config RS members are located on the same site

We were able to reproduce it on 5.0.24, but not on 6.0.13.

  • 4.4.22: 2 minutes

```

{"t":\{"$date":"2024-01-23T14:18:32.675+00:00"}

,"s":"I",  "c":"CONTROL",  "id":23403,   "ctx":"mongosMain","msg":"Build Info","attr":{"buildInfo":{"version":"4.4.22","gitVersion":"fc832685b99221cffb1f5bb5a4ff5ad3e1c416b2","openSSLVersion":
"OpenSSL 1.1.1f  31 Mar 2020","modules":[],"allocator":"tcmalloc","environment":{"distmod":"ubuntu2004","distarch":"x86_64","target_arch":"x86_64"}}}}

{"t":\{"$date":"2024-01-23T14:20:31.197+00:00"}

,"s":"I",  "c":"NETWORK",  "id":23016,   "ctx":"listener","msg":"Waiting for connections","attr":{"port":57000,"ssl":"off"}}

```

  • 5.0.24: 2 minutes

```

{"t":\{"$date":"2024-01-23T15:01:52.077+00:00"}

,"s":"I",  "c":"CONTROL",  "id":23403,   "ctx":"mongosMain","msg":"Build Info","attr":{"buildInfo":{"version":"5.0.24","gitVersion":"f034f0c51b3dffef4b8c9452d77ede9888f28f66","openSSLVersion":
"OpenSSL 1.1.1f  31 Mar 2020","modules":[],"allocator":"tcmalloc","environment":{"distmod":"ubuntu2004","distarch":"x86_64","target_arch":"x86_64"}}}}

{"t":\{"$date":"2024-01-23T15:03:53.645+00:00"}

,"s":"I",  "c":"NETWORK",  "id":23016,   "ctx":"listener","msg":"Waiting for connections","attr":{"port":57000,"ssl":"off"}}

```

  • 6.0.13: 7s

```

{"t":\{"$date":"2024-01-23T15:53:10.296+00:00"}

,"s":"I",  "c":"CONTROL",  "id":23403,   "ctx":"mongosMain","msg":"Build Info","attr":{"buildInfo":{"version":"6.0.13","gitVersion":"3b13907f9bdf6bd3264d67140d6c215d51bbd20c","openSSLVersion":
"OpenSSL 1.1.1f  31 Mar 2020","modules":[],"allocator":"tcmalloc","environment":{"distmod":"ubuntu2004","distarch":"x86_64","target_arch":"x86_64"}}}}

```

Also, is it possible to backport the behaviour on 6.0.13 to upcoming 4.4 and 5.0 versions?

Thank you!



 Comments   
Comment by Eric Sedor [ 05/Feb/24 ]

Hi anthony@sendinblue.com, thanks for your patience. Specific improvements are backported where reasonable, but it's pretty unlikely that we'll backport to earlier versions the breadth of recent improvements for handling large numbers of namespaces. Hopefully moving to MongoDB 6.0 is an option for you!

Sincerely,
Eric

Generated at Thu Feb 08 06:58:07 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.