[SERVER-57776] Sharded collections becomes inaccessible when it becomes to big Created: 17/Jun/21  Updated: 19/Jul/21  Resolved: 19/Jul/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Nicolai Ødum Assignee: Eric Sedor
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Steps To Reproduce:

Create a sharded setup where the config.chunks collection contains 40 million entries and half of them relates to one single collection.

Try to start a MongoS

Participants:

 Description   

I have a large sharded (+500TB) collection that right now is inaccessible because of a timeout in the synchronization between MongoC and MongoS.

Both MongoS and MongoC are run on enterprise class servers with 10Gbit network with <0.1ms latency.

The Shareded collection has +20 million entries in the config.chunks collection on the MongoC - and the total number of entries in config.chunks collection is +40 million. When the MongoS starts there is a (hardcoded?) limit of 1 min for each collection to sync config.chunks from the MongoC to the MongoS...And if it fails the MongoS will not start at all.
I have tried to add loadRoutingTableOnStartup: false to the mongos config and the result is that mongos starts and all other collections are accessible but I am still not able to access the large sharded collection.

Is there a way to change that timeout in the MongoS?



 Comments   
Comment by Eric Sedor [ 19/Jul/21 ]

Thanks for clarifying nicolai@niro-it.dk. We do understand this is sensitive information and appreciate your care. Unfortunately, without logs or diagnostic data, we aren't able to investigate this report here in the SERVER project. But we will be on the lookout for similar reports. If you do end up able to provide logs showing the mongos startup failure, let us know and we can reopen this ticket.

Sincerely,
Eric

Comment by Nicolai Ødum [ 15/Jul/21 ]

Sorry - I am not able to provide you with un-obfucated logs - I have used https://github.com/rueckstiess/fruitsalad but I am not sure if it can handle the new json format.

 

Regards

Nicolai

Comment by Eric Sedor [ 15/Jul/21 ]

Hi nicolai@niro-it.dk,

Are you able to provide un-obfuscated logs? We are definitely interested in investigating the details of what you're reporting.

Sincerely,
Eric

Comment by Eric Sedor [ 28/Jun/21 ]

Hi nicolai@niro-it.dk,

It looks like the logs have been fully obfuscated. Are you at all able to provide either partially obfuscated or un-redacted versions of these logs to the same upload portal? To clarify, files uploaded here will only be visible to MongoDB employees actively involved in this investigation.

If that's not possible, could you provide manually redacted lines that preserve the system-related information in each line? We're particularly interested in the log messages that are occurring on the mongos and config server primary at the time the mongos is failing to start.

Gratefully,
Eric

Comment by Nicolai Ødum [ 18/Jun/21 ]

I have uploaded mongos and a mongoc log. Because of company policy I am not able to upload binary files. 

Comment by Nicolai Ødum [ 18/Jun/21 ]

OS: CentOS Linux release 7.9.2009

MongoDB --version

Build Info: {
"version": "4.4.6",
"gitVersion": "72e66213c2c3eab37d9358d5e78ad7f5c1d0d0d7",
"openSSLVersion": "OpenSSL 1.0.1e-fips 11 Feb 2013",
"modules": [],
"allocator": "tcmalloc",
"environment":

{ "distmod": "rhel70", "distarch": "x86_64", "target_arch": "x86_64" }

}

 

Comment by Eric Sedor [ 17/Jun/21 ]

Hi nicolai@niro-it.dk, can you clarify the MongoDB version and provide some additional information?

I've created a secure upload portal for you. Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time.

We'd like information from the following nodes:

  • A mongos that has failed to start
  • The primary member of the config server replica set

For each of these nodes spanning a time period that includes a failed restart attempt, would you please archive (tar or zip) and upload to that link:

  • the mongos/d logs
  • the $dbpath/diagnostic.data directory (the contents are described here)

Thank you,
Eric

Generated at Thu Feb 08 05:42:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.