[SERVER-67128] Pre-populate ShardRegistry during startup at Config shard Created: 08/Jun/22 Updated: 29/Oct/23 Resolved: 27/Jun/22 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | 6.1.0-rc0 |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Andrew Shuvalov (Inactive) | Assignee: | Andrew Shuvalov (Inactive) |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | sharding-nyc-subteam2, sharding-nyc-subteam2-catalog-poc | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||
| Sprint: | Sharding 2022-06-27, Sharding 2022-07-11 | ||||||||
| Participants: | |||||||||
| Comments |
| Comment by Andrew Shuvalov (Inactive) [ 22/Jun/22 ] | |||||||||||
|
Here is why I think it is all failing (race, not deterministically): The ShardRegistry::_lookup is calling the ShardRegistryData::createFromCatalogClient to refresh the empty cache, the client picks the nearest replica. If a secondary picks another secondary as nearest, it cannot run the request with ShardNotFound because the shard registry is empty. The primary replica usually can load, because it picks itself for loopback connection more often than not, and connection to itself works. The replicas are coming online with some delay making tests to be flaky with a variation of failures. The graph of those nearest requests is somewhat sticky because it changes after next round of hellos. The test may fail before the graph becomes more favorable. Finding a new nearest takes some time and I observe a replica repeatedly getting host not found while trying a secondary. The fix for this problem is to pre-populate the ShardRegistry from local DB, which also could be delayed because of replication. | |||||||||||
| Comment by Andrew Shuvalov (Inactive) [ 22/Jun/22 ] | |||||||||||
|
This has to be reopened as the fix for | |||||||||||
| Comment by Kaloian Manassiev [ 09/Jun/22 ] | |||||||||||
|
Just to be clear, andrew.shuvalov@mongodb.com, do we need to block the port from opening until the shard registry data is loaded at all? It is not like the key manager where it is actually required for incoming requests. I guess what I am asking is, if we just remove the waiting for load from the bootstrap, what fails? | |||||||||||
| Comment by Andrew Shuvalov (Inactive) [ 09/Jun/22 ] | |||||||||||
|
kaloian.manassiev@mongodb.com thanks, confirmed. In previous experiments I've checked that ShardRegistryData is actually loaded during bootstrap, however I tried your suggestion that it doesn't have to. So I inserted this code:
it spins once at every server, because indeed the first read comes before the startup is complete. But then the complete is triggered and the "done" log is printed. So it will be a delay waiting for port open, but it will work. I'm resolving the ticket. | |||||||||||
| Comment by Kaloian Manassiev [ 09/Jun/22 ] | |||||||||||
|
andrew.shuvalov@mongodb.com, for the ShardRegistry, do we really need to prepopulate it or should we just leave the first access to it do the fetch? Is there something before the ports are open (or at some path where we shouldn't block) which relies on it being populate? For the KeyManager I kind of see the need, but for the ShardRegistry I can't think of one. |