[SERVER-73975] There will be a freeze when executing the enableSharding or shardCollection command Created: 14/Feb/23  Updated: 27/Mar/23

Status: Investigating
Project: Core Server
Component/s: None
Affects Version/s: 4.4.14, 4.4.17, 4.4.18
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: 鹏 刘 Assignee: Alan Zheng
Resolution: Unresolved Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: JPEG File POPO20230315-174850.jpg     JPEG File POPO20230315-174901.jpg     PNG File image.png     JPEG File mongocfg.jpg     JPEG File upload.jpg    
Issue Links:
Related
related to SERVER-58721 processReplSetInitiate does not set a... Closed
Assigned Teams:
Replication
Operating System: ALL
Steps To Reproduce:

1、Deploy a cluster higher than version 4.4.14.

2、Log in to mongos and execute the shardCollection command.

Participants:

 Description   

When I deploy a cluster higher than version 4.4.14,I find that there will be a freeze when executing the enableSharding or shardCollection command.But when I restart the configserver master node, the lag disappeared. The log looks like this.

I find that when executing the shardCollection command, the shardserver node will read some collections in the config database of the configserver, such as config.collections, and these operations will spend a lot of time. I added some logs to the code to record the time spent by configserver executing the following functions in service_entry_point_common.cpp.

behaviors.waitForReadConcern(opCtx, invocation.get(), request);
behaviors.setPrepareConflictBehaviorForReadConcern(opCtx, invocation.get()); 

I got the following log.

{"t":{"$date":"2023-02-10T18:45:25.552+08:00"},"s":"I",  "c":"COMMAND",  "id":1133440, "ctx":"conn29","msg":"Reach there to start waiting for read Concern!!!!"}
{"t":{"$date":"2023-02-10T18:45:27.452+08:00"},"s":"I",  "c":"COMMAND",  "id":1133441, "ctx":"conn29","msg":"Reach there to end waiting for read Concern!!!!"}
{"t":{"$date":"2023-02-10T18:45:27.453+08:00"},"s":"I",  "c":"COMMAND",  "id":51803,   "ctx":"conn29","msg":"Slow query","attr":{"type":"command","ns":"config.collections","command":{"find":"collections","filter":{"_id":"netease.ffcccccciiii"},"readConcern":{"level":"majority","afterOpTime":{"ts":{"$timestamp":{"t":1676028592,"i":4}},"t":1}},"limit":1,"maxTimeMS":30000,"$readPreference":{"mode":"nearest"},"$replData":1,"$clusterTime":{"clusterTime":{"$timestamp":{"t":1676028592,"i":4}},"signature":{"hash":{"$binary":{"base64":"xEAj8PuCgaD58iDzrXetLftM4nU=","subType":"0"}},"keyId":7198487474404851735}},"$configServerState":{"opTime":{"ts":{"$timestamp":{"t":1676028592,"i":4}},"t":1}},"$db":"config"},"planSummary":"IDHACK","keysExamined":1,"docsExamined":1,"cursorExhausted":true,"numYields":0,"nreturned":1,"reslen":769,"locks":{"ReplicationStateTransition":{"acquireCount":{"w":1}},"Global":{"acquireCount":{"r":1}},"Database":{"acquireCount":{"r":1}},"Collection":{"acquireCount":{"r":1}},"Mutex":{"acquireCount":{"r":1}}},"readConcern":{"level":"majority","afterOpTime":{"ts":{"$timestamp":{"t":1676028592,"i":4}},"t":1},"provenance":"clientSupplied"},"storage":{},"protocol":"op_msg","durationMillis":1900}} 



 Comments   
Comment by Yuan Fang [ 20/Mar/23 ]

Hi liupeng9711@gmail.com, Thank you for your patience. Apologies for the slow response. I am passing this along to the team for investigation. Please continue to watch for updates.

Comment by 鹏 刘 [ 20/Mar/23 ]

        This is a bug in the Mongodb implementation. In the implementation,the mcp timestamp is not set when the replica set is initialized, resulting in the mcp timestamp of all nodes in all replica sets being { ts: Timestamp(0, 0), t: -1}. When the node uses find to pull the oplog for the first time, the mcp timestamp in the response is also { ts: Timestamp(0, 0), t: -1}, and then the mcp timestamp set locally is also { ts: Timestamp (0, 0), t: -1}.

        Later, use the getMore command to pull the oplog. Before executing the command, the lastKnownCommittedOpTime field of the getMore command will be set to the local mcp timestamp.

       When the synchronization source node receives the getMore command, it will assign clientsLastKnownCommittedOpTime according to the lastKnownCommittedOpTime field of the getMore command. When the value of clientsLastKnownCommittedOpTime is empty,the getMore command will enter the waiting state.

      In version 5.0, the code to set the mcp code when initializing the replica set. It is recommended to backport these codes.

Comment by 鹏 刘 [ 15/Mar/23 ]

I found that starting from version 3.4, when the secordary node opens the tailable cursor to obtain the oplog of the source node, it will set a waiting time, usually 5 seconds. When the cluster starts for the first time, the waiting still takes effect, but when I restart the master node, The waiting time is invalid, and the oplog is obtained multiple times in a short period of time, and most of them only obtain 0 oplog.The following picture is the result of two oplog pulls by the same node within 20 ms.

Comment by 鹏 刘 [ 01/Mar/23 ]

I have uploaded the log file and diagnostic.data file of the config node.

Comment by 鹏 刘 [ 01/Mar/23 ]

When I execte "shardCollection" commad, the shardsvr node will read some collections in the config database of the configserver, like config.collections, those slow query above are read operations. It's very easy to reproduce. Deploy a cluster, then execute the shardCollection command on mongos, then restart the config master node, and execute the shardCollection command again, the time-consuming difference between the two is very large. I'll give you the logs right away.

Comment by Yuan Fang [ 28/Feb/23 ]

Hi liupeng9711@gmail.com,

Thank you for providing the diagnostic datasets. I have reviewed both the log and FTDC, and while I can see a few records of slow queries, it does not seem that any operations of "enableSharding" or "shardCollection" were performed before the lagging occurred. In order to better answer the original question, could you please either:

  • Upload the log and FTDC that span a time period including the original incident (on 2/13/2023),
  • Or attempt to reproduce the issue and upload the log and FTDC files that encompass the time before, during, and after the operations (ie."enableSharding" or "shardCollection" ) and the lagging

to the upload portal?

Thank you!
Yuan

 

Comment by 鹏 刘 [ 27/Feb/23 ]

The 'freeze' means that when I execute the shardCollection operation, there will be an obvious lag. I have uploaded the log file and diagnostic.data file of the config node again.

Comment by Yuan Fang [ 24/Feb/23 ]

Hi liupeng9711@gmail.com,

Unfortunately, I couldn't find any files in the folder linked to the upload portal. Could you try to upload them to the upload portal again? Feel free to let me know if you encounter any issues while uploading.

Additionally, could you also address my clarification question that was asked in my last comment? Thank you!

Regards,
Yuan

 

Comment by 鹏 刘 [ 23/Feb/23 ]

I have uploaded the log file and diagnostic.data file of the config node, please have a look

Comment by Yuan Fang [ 22/Feb/23 ]

Hi liupeng9711@gmail.com,

Thank you for your report, we are currently working on reproducing the issue you countered. However, could you please provide some clarification regarding the behavior of 'freeze' in the context of:

I find that there will be a freeze when executing the enableSharding or shardCollection command.

Specifically, when you say 'freeze', are you referring to the situation where the mongod hangs and cannot take any new operations? 

Additionally, it would be helpful if you could provide the complete mongod log and FTDC. I've created a secure upload portal for you. Files uploaded to this portal are hosted on Box, are visible only to MongoDB employees, and are routinely deleted after some time.

For each node in the replica set spanning a time period that includes the incident, would you please archive (tar or zip) and upload to that link:

  • the mongod logs
  • the $dbpath/diagnostic.data directory (the contents are described here)

Thank you!

Regards,
Yuan

Comment by 鹏 刘 [ 15/Feb/23 ]

I think the reason is that the configserver has no new operations, so the mcp timestamp has not been able to move forward. When I perform continuous insert operations on the configserver node,lag  disappeared.

Generated at Thu Feb 08 06:26:09 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.