[SERVER-60916] CPS Restores failed with a snapshot with documents in reshardingOperation Created: 22/Oct/21 Updated: 29/Oct/23 Resolved: 28/Oct/21 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Sharding |
| Affects Version/s: | None |
| Fix Version/s: | 5.2.0, 5.0.4, 5.1.0-rc3 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Chuck Zhang | Assignee: | Max Hirschhorn |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Backport Requested: |
v5.1, v5.0
|
||||||||||||||||||||
| Sprint: | Sharding 2021-11-01 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Story Points: | 1 | ||||||||||||||||||||
| Description |
|
Manual tested below:
reshardCollection
Mongo config server fassert error
|
| Comments |
| Comment by Githook User [ 28/Oct/21 ] | ||||||||
|
Author: {'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}Message: The ARS (AsyncRequestSender) calls ShardRegistry::getShardNoReload() and (cherry picked from commit dd41a99e2c177d5bd3a38229d85458117480c5d9) | ||||||||
| Comment by Githook User [ 28/Oct/21 ] | ||||||||
|
Author: {'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}Message: The ARS (AsyncRequestSender) calls ShardRegistry::getShardNoReload() and (cherry picked from commit dd41a99e2c177d5bd3a38229d85458117480c5d9) | ||||||||
| Comment by Githook User [ 27/Oct/21 ] | ||||||||
|
Author: {'name': 'Max Hirschhorn', 'email': 'max.hirschhorn@mongodb.com', 'username': 'visemet'}Message: The ARS (AsyncRequestSender) calls ShardRegistry::getShardNoReload() and | ||||||||
| Comment by Max Hirschhorn [ 27/Oct/21 ] | ||||||||
|
pierlauro.sciarelli and I discussed this issue some more over Slack. The plan is to
We expect the combination of these two changes to address the issue reported by the Automation team in (We can separately evaluate undoing these workarounds after | ||||||||
| Comment by Chuck Zhang [ 25/Oct/21 ] | ||||||||
Per discussed offline, I am working on data files.
| ||||||||
| Comment by Pierlauro Sciarelli [ 25/Oct/21 ] | ||||||||
|
The problem could be that the ARS did not wait for the reload because it actually relies on getShardNoReload instead of getShard ( As | ||||||||
| Comment by Max Hirschhorn [ 25/Oct/21 ] | ||||||||
|
Kal and I discussed this issue over Zoom. He made a point that ShardRegistry::getShard() would have waited for the topology time in the ShardRegistry to have advanced. So even though ShardRegistry::_periodicReload() hit the ReadConcernMajorityNotAvailableYet error, the ShardRegistry::getShard() from the ARS must not have. Otherwise we would have seen the ReadConcernMajorityNotAvailableYet error instead of the ShardNotFound error.
| ||||||||
| Comment by Pierlauro Sciarelli [ 25/Oct/21 ] | ||||||||
Confirming that the DDL coordinators are not saving the list of shards at any time. All DDL operations (maybe not resharding?) are performing broadcasts to all shards present in the shard registry, also rename (just to clarify - that's not relevant at moment).
We are only handling (retrying) all kind of failures that may happen within nodes belonging to shards, but - as pointed out in the scope's non-goals - we have not taken into consideration the scenario in which a shard may be added/removed during DDLs.
We don't currently checkpoint the vector clock, I believe that we didn't take this solution into account for the same reason mentioned above. However, in | ||||||||
| Comment by Kaloian Manassiev [ 25/Oct/21 ] | ||||||||
|
The Sharding DDL coordinators do not remember the list of shards if I am not mistaken, because we will either always broadcast (for drop) or we will reconstruct the list of shards from the routing information, for rename. So in that sense, I don't think we can encounter ShardNotFound error from the ShardRegistry, unless during recovery. I am not sure what we use for the DDL coordinators, I believe that we persist the vector clock, which in turn will store the topologyOpTime, which ensures that ShardRegistry::getShard will refresh at least up to that topologyOpTime. pierlauro.sciarelli/jordi.serra-torrens: Can you guys confirm that my understanding above is correct? | ||||||||
| Comment by Max Hirschhorn [ 25/Oct/21 ] | ||||||||
|
Assigned this to Sharding EMEA to reconcile the PrimaryOnlyService and ShardRegistry initialization order. Feel free to pass back to Sharding NYC if you don't feel like there's a general solution to take here. | ||||||||
| Comment by Chuck Zhang [ 22/Oct/21 ] | ||||||||
1. yes. atlas-4wnh1p-shard-0 is one of the destShardName of the restored cluster
| ||||||||
| Comment by Max Hirschhorn [ 22/Oct/21 ] | ||||||||
Yes, I think the issue could be framed that way. Would we also face a similar issue in the Sharding DDL Coordinators? Or how do those primary-only service Instances handle ShardNotFound? | ||||||||
| Comment by Kaloian Manassiev [ 22/Oct/21 ] | ||||||||
|
It looks like we have the waitForShardRegistryReload() with which we block opening the connections before the SahrdRegistry has performed a refresh. Is it by any chance a problem of replication's step-up actions executing before that method has completed? | ||||||||
| Comment by Max Hirschhorn [ 22/Oct/21 ] | ||||||||
|
It looks like the ShardRegistry may be in an unqueryable state because using read concern level "majority" internally. Assuming that is expected, kaloian.manassiev, tommaso.tocci, how should other sharding components be waiting for the first reload of the ShardRegistry to have succeeded?
| ||||||||
| Comment by Max Hirschhorn [ 22/Oct/21 ] | ||||||||
|
I see the cause of the original fassert in the new logs that were uploaded, thanks. The resharding coordinator on the config server primary is unexpectedly getting ShardNotFound when attempting to contact the atlas-4wnh1p-shard-0 shard. chuck.zhang, could you please confirm/answer the following for me:
| ||||||||
| Comment by Chuck Zhang [ 22/Oct/21 ] | ||||||||
|
Sorry for being not clear on the ask. I wasn't aware the mongod logs were missing. I've reattached the mongod logs which covers the time range of the restore. File atlas-4wnh1p-config-00-02.qzt8l.mmscloudteam.com_2021-10-22T00_00_00_2021-10-22T03_30_00_mongodb seems quite related to the root cause. | ||||||||
| Comment by Max Hirschhorn [ 22/Oct/21 ] | ||||||||
|
chuck.zhang, I didn't realize you were asking specifically about the restart attempt at 2021-10-22T00:33:24.672. Is there a reason mongodb.log doesn't cover that time range when logAppend: true is set in the automation configuration? It looks like there was an fassert() at that time too. Unfortunately the logs from mongod in the automation log are truncated so I couldn't say whether the original fassert was location code 34418 or not. I suspect it probably was though. Would it be possible to rerun the test/procedure with the maxLogSizeKB server parameter set to a larger value to avoid this log truncation? --setParameter maxLogSizeKB=20
| ||||||||
| Comment by Chuck Zhang [ 22/Oct/21 ] | ||||||||
|
max.hirschhorn Thanks for looking.
The logs above is actually 1 and half hours after the backup restore initiated. It seems the mongod took 2 hours to have some initial logs printed and it only lasts for 1 min.
As commented in the above link, we haven't seen any issues causing by this so far since it's an existing behavior before reshard collection was introduced.
| ||||||||
| Comment by Max Hirschhorn [ 22/Oct/21 ] | ||||||||
|
Pulling out the error from the attached log file for ease of viewing and searchability. Is there some part of the restore procedure which would explain why the oplog collection doesn't exist on the config server node? At first glance I'm not sure how this is connected to taking a backup while there's a resharding operation.
|