[SERVER-23041] Shards starting or entering primary mode may get stuck if no CSRS config hosts are available Created: 10/Mar/16  Updated: 21/Mar/19  Resolved: 31/Mar/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.2.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kaloian Manassiev Assignee: Spencer Brody (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File ismaster_csrs_configsvr_not_restart.js     File ismaster_csrs_configsvr_restart.js    
Issue Links:
Related
related to SERVER-23398 Cannot start mongos with secondary co... Closed
Operating System: ALL
Participants:

 Description   

If a shard node using a CSRS config server has ever been a chunk donor in a migration, that node will have a minOpTime document stored with the config server's optime from the last migration.

Upon startup or becoming primary, if there is a minOpTime document, the shard starts initializing the ShardingState machinery in order to prime it with the minimal config server optime.

This initialization will never complete if none of the CSRS hosts are available and we will keep retrying infinitely. The reason the initialization gets stuck is because we try to reload the list of shards.

We should change the code so that we can initialize and set the min optime and not reload the list of shards until it becomes necessary.



 Comments   
Comment by Jeff Tharp [ 21/Mar/19 ]

For anyone else encountering this issue, I found a workaround that may help.  I was in the process of cloning one of our production clusters starting from EBS snapshots I had restored.  Since the cluster was meant to be a clone of the live cluster, the config server hosts were different and a security group enforced that the restored nodes could not talk to the production clusters CSRS (nor did I want them to do so).  So once the restored shard nodes finished with recovery, they fell into this endless loop of trying to access the production cluster's CSRS and would not accept new connections (so I couldn't get a shell to delete the offending minOpTimeRecovery doc).

My workaround was to stop the shard nodes and then restart with the replication configuration commented out and shard.clusterRole set to configsvr, not shardsvr.  This seems to bypass checking for the minOpTimeRecovery doc in admin.system.version.  Once mongo started, I was able to connect with a mongo shell and delete the doc.  I then stopped mongo and restarted with the proper configuration (replication enabled and sharding.clusterRole = shardsvr) and proceeded with the remaining steps of restoring the cluster.

Comment by Andy Schwerin [ 31/Mar/16 ]

While this works as designed when all the config servers are down, when there is one config server up that has never spoken to a config server primary, it behaves as in SERVER-23398. Further work will be on that ticket.

Comment by Spencer Brody (Inactive) [ 10/Mar/16 ]

I don't really see this as a problem. If all the config servers are down, your cluster isn't going to be usable anyway.

Generated at Thu Feb 08 04:02:11 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.