[SERVER-33797] sharding metadata command should make sure that ShardRegistry is up to date Created: 09/Mar/18 Updated: 29/Oct/23 Resolved: 29/Mar/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 3.7.2 |
| Fix Version/s: | 3.7.4 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Randolph Tan | Assignee: | Matthew Saltz (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Operating System: | ALL | ||||||||||||
| Sprint: | Sharding 2018-04-09 | ||||||||||||
| Participants: | |||||||||||||
| Linked BF Score: | 15 | ||||||||||||
| Description |
|
... before accessing it. For example, shardCollection makes sure that there are shards to put the new sharded collection. It is possible that the shard already exists in storage but not yet loaded to the ShardRegistry, so it can mistakenly think that no shard exists. |
| Comments |
| Comment by Githook User [ 29/Mar/18 ] | |
|
Author: {'email': 'matthew.saltz@mongodb.com', 'name': 'Matthew Saltz'}Message: | |
| Comment by Matthew Saltz (Inactive) [ 27/Mar/18 ] | |
|
Code review: https://mongodbcr.appspot.com/190780002/ | |
| Comment by Matthew Saltz (Inactive) [ 23/Mar/18 ] | |
|
Okay, that makes sense. So basically, it doesn't matter that the caller does the reload, it just matters that the reload has happened since the caller started its operation. If some other update sneaks in afterwards it doesn't matter because we just need to be sure we're seeing what we were supposed to see when we got called. ('we' being the caller). | |
| Comment by Randolph Tan [ 23/Mar/18 ] | |
|
When ShardRegistry::reload returns false, it just means that the current thread did not perform the reload. This is a scenario when some other thread was already reloading and to avoid repeated work, it just piggybacked on the work of another thread. Here is an example of why reload needs to be only called one extra time and not in a loop: Scenario: enableSharding does call reload when selecting shards. The issue in the build failure is that path is not taken because the old primary has already created the database entry so the new primary just skips the reload completely. | |
| Comment by Matthew Saltz (Inactive) [ 23/Mar/18 ] | |
|
It's also possible I'm looking in the entirely wrong places, in which case please tell me where I should be looking instead | |
| Comment by Matthew Saltz (Inactive) [ 23/Mar/18 ] | |
|
It seems like every call to getShard() on the ShardRegistry (which is what ShardingCatalogManager::shardCollection and does attempt a reload: However, from looking at the reload() implementation, despite what the comment says in the header file (that you should call it "one more time" if it returns false), it seems to me that it's possible that several repeated calls to reload could be necessary as long as another thread keeps interfering and doing the reload itself. I might be misunderstanding since there's clearly a lot of context I'm missing but it seems like the two calls to reload() inside getShard() should be replaced with something along the lines of:
(You could do it in another way for style reasons, but this is the gist.) If I'm correct in my understanding, it could also be worth adding either an additional function or additional parameter to reload to specify whether the caller needs to truly perform the reload or if another thread doing the reload is sufficient. Also, I'm curious to explore a bit more to see if there are situations where we're actually okay with the reload being done by some other thread or if in most cases we need to call it repeatedly until it returns true anyways. For ShardingCatalogManager::enableSharding, on the other hand, the only time it touches the ShardRegistry is to do createDatabase when it goes to pick a primary shard id for the new database, in which case ShardRegistry::getAllShardIds() is called instead, which doesn't do any reloading internally. So then the reload only happens in the caller (_selectShardForNewDatabase()) if getAllShardIds() comes back empty. Is it possible that there's a case where all shard IDs does not come back empty but where the ShardRegistry is out of date and should be reloaded anyways? (Is that what this JIRA is really referring to?) Also in this particular case the call-reload-till-it-returns-true protocol isn't followed - is that okay? janna.golden Tagging you since this is related to your comment in the build failure associated with this. |