[SERVER-79311] Investigate if LogicalSessionCache refresher and reaper truly need to force refresh the routing info for config.system.sessions Created: 25/Jul/23 Updated: 07/Sep/23 Resolved: 05/Sep/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Cheahuychou Mao | Assignee: | Cheahuychou Mao |
| Resolution: | Won't Do | Votes: | 0 |
| Labels: | sharding-nyc-subteam3 | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||
| Assigned Teams: |
Sharding NYC
|
||||
| Sprint: | Sharding NYC 2023-08-21, Sharding NYC 2023-09-04 | ||||
| Participants: | |||||
| Story Points: | 3 | ||||
| Description |
|
The LogicalSessionCache refresher and reaper currently have the step to check that the config.system.sessions collection exists (here and here) which under the hood performs a force refresh of the routing for the collection. On a secondary shardsvr mongod, each routing info refresh involves making the primary refresh by running a _flushRoutingTableCacheUpdate command against the primary and waiting for opTime that the command returns. From code inspection, the wait does not have a timeout. So the opTime wait time after each _flushRoutingTableCacheUpdate command is dependent on the replication lag. So when the lag is large, the refresh will take proportionally long to complete (HELP-48060) and can consequently occur less frequently than scheduled. It is unclear why such a force refresh is necessary, i.e. why we don't just let refresher or reaper itself as a client retry the upserts/delete/find commands later if it gets a StaleConfig error. |
| Comments |
| Comment by Jason Zhang [ 01/Aug/23 ] |
|
From our discussion a listCollections directly to the primary shard could bypass waiting for replication, but some more investigation into whether or not that is feasible should be done. |