[SERVER-50696] Shard restore procedure leaving cluster in hung or inconsistent state. Created: 02/Sep/20 Updated: 27/Oct/23 Resolved: 21/Sep/20 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Todd Vernick | Assignee: | Dmitry Agranat |
| Resolution: | Community Answered | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Operating System: | ALL | |||||||||||||||||||||||||||||||||||||
| Steps To Reproduce: | shard config
configdb config
Note: The original cluster has replicasets (3 each shard and 3 configdbs) The restored cluster has 1 replica each in a PRIMARY state. |
|||||||||||||||||||||||||||||||||||||
| Participants: |
| Description |
|
CentOS Linux release 7.8.2003 (Core) MongoD version 3.6.16 mongoS version 3.6.16 When restoring a shard from snapshots, mongoS cannot pull data consistently. Shard info can be pulled via "sh.status" command via mongoS. Running simple commands like "show collections" will hang and fail after the 30s timeout with "NetworkInterfaceExceededTimeLimit". All ports can be reached via shards to configdb and configdb to shards. Types of messages seen in one of the shard logs.
Types of messages seen in configdb
Sometimes when restarting the configdb once or twice it will fix the issue and mongoS will start pulling data again. The problem is this is very inconsistent and restarting the configdb only works sometimes to fix it.
|
| Comments |
| Comment by Todd Vernick [ 23/Sep/20 ] | |||
|
Hi @dagranat - Can we reopen this ticket? It's definitely not a network issue so I'd like to see if there is anything else we can explore. | |||
| Comment by Todd Vernick [ 21/Sep/20 ] | |||
|
Also note I've gone through the community boards and there is nothing that has been helpful pertaining to this issue this far. | |||
| Comment by Todd Vernick [ 21/Sep/20 ] | |||
|
The network error are during the restarts when the restored data is mounted to each shard. After everything restarts the network errors are not seen other than queries timing out as mentioned in my original issue. I've already ruled out network issues. All hosts can connect to respective ports and I've even ran tcpdumps to verify that nothing was being blocked. | |||
| Comment by Dmitry Agranat [ 21/Sep/20 ] | |||
|
Hi tvernick@squarespace.com, thank you for uploading all the requested information. I've noticed multiple errors which might indicate an issue with your network connectivity.
In addition, this error (in addition to above mentioned reason of network connectivity) might also indicate that the balancer was not fully stopped during the backup process:
If you need further assistance troubleshooting, I encourage you to ask our community by posting on the MongoDB Developer Community Forums or on Stack Overflow with the mongodb tag. Regards, | |||
| Comment by Todd Vernick [ 14/Sep/20 ] | |||
|
Hi Dmitry - I have uploaded the files you requested. Please let me know if you need anything else. | |||
| Comment by Dmitry Agranat [ 14/Sep/20 ] | |||
|
Hi tvernick@squarespace.com, we'll need to gather some data in order to understand what's going on.
You can combine all the executed commands and command's output per file. For example, backup_file.txt and restore_file.txt
Note: please change <configServerHost> and <configServerPort> to any config server. I've created a secure upload portal for you. Files uploaded to this portal are visible only to MongoDB employees and are routinely deleted after some time. Thanks, | |||
| Comment by Todd Vernick [ 13/Sep/20 ] | |||
|
Yes I have tried this command but unfortunately it does not help. | |||
| Comment by Dmitry Agranat [ 13/Sep/20 ] | |||
|
Hi tvernick@squarespace.com, before requesting all the needed data to investigate this, could you please clarify if you have tried flushRouterConfig from my last comment? | |||
| Comment by Todd Vernick [ 10/Sep/20 ] | |||
|
Hi Dmitry - I have setup the restored cluster with the additional replicas like the original cluster setup but I'm still seeing the same issue with command timeouts. | |||
| Comment by Todd Vernick [ 09/Sep/20 ] | |||
|
I'm going to try to get it more of a mirror of the original. Would there potentially be an issue if a sharded cluster only has a single replica per shard instead of 3 members? | |||
| Comment by Dmitry Agranat [ 06/Sep/20 ] | |||
|
Thank you for providing all the requested information, it was very helpful. Based on the above, it appears that you are trying to do a partial restore of your original cluster. Custom restore procedures are out of scope for the SERVER project. In case the restored cluster is identical to the original one (or has more members as documented here) and you still experience the reported issues, we'd happy to take a look. One thing that might help with mongoS inconsistency (although this would entirely depend of the specific steps in this custom procedure) is to clear the cached routing table with flushRouterConfig. Thanks, | |||
| Comment by Todd Vernick [ 02/Sep/20 ] | |||
|
Also this is config options running on mongoS
| |||
| Comment by Todd Vernick [ 02/Sep/20 ] | |||
|
Hi Dmitry- I followed the doc here https://docs.mongodb.com/v3.6/tutorial/restore-sharded-cluster/ Note this actually is happening on two different clusters - Similar data sets but two environments (production and staging). Staging has just a fraction of the total data stored and 3 less shards. I'll just use staging for example here. Original cluster: 3 shards with 6 replicaset members in each shard. 3 additional hidden secondaries are hosted in GCP used primarily as the "snapshot" source data members. (Note these hosts are in sync so no lag here) 1 configdb replicaset with 6 members. mongoS clients connecting to the 1 configdb member
Restore cluster: 3 shards with 1 primary replica on each shard. 1 configdb replicaset with 1 member. mongoS clients connecting to the 1 configdb member Original configdb server is not touched during any part of the process besides stopping the mongoD process before snapshotting the data directory. | |||
| Comment by Dmitry Agranat [ 02/Sep/20 ] | |||
|
Hi tvernick@squarespace.com, I'd like to clarify a couple of points:
Thanks, |