Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Unresolved
Priority: Major - P3
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Cluster Scalability
Sprint:
Cluster Scalability 2024-09-02, Cluster Scalability 2024-10-14, Cluster Scalability 2024-10-28, Cluster Scalability 2024-11-11, Cluster Scalability 2024-12-23, Cluster Scalability 2025-01-20
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Currently the periodic thread sequence goes like this:

1. Purge sessions pending to be refreshed ended with endSessionsFromClient command.
2. Perform refresh. This updates the ping on sessions collection for each currently active in-memory session.
3. Scan the sessions collection and check which sessions still exists. Sessions that no longer exists in the collection are treated as "expired" sessions and we call kill cursors on them.
4. Once we finished doing this, we clear the list of "sessions pending to be refreshed".

However, any assertion that occurs abort the entire sequence. This means that if a single shard keeps on causing step#2 to assert, then it will not clear the list of "sessions pending to be refreshed" and can cause it to accumulate.

As a concrete example, imagine this setup:
session collection chunk distribution:
shard0: lsid: 0->10
shard1: lsid: 10->20
shard2: lsid: 20->30

lsid in memory:
shard10: 0, 10, 20
shard11: 1, 11, 21
shard12: 2, 12, 22

Note that shard10, shard11, shard12 each will have to target shard0 when performing the session refresh since one of it's lsids touches the chunk shard0 owns. So, if the write to shard0 is causing errors, then shards10, 11 and 12 won't be able to purge expired sessions. Also note that it is not unusual for multiple shards to have the same lsid in memory because some ops can hit multiple shards. In an extreme case where we perform a broadcast query with lsid: 4, then all shards will now have lsid: 4 in memory. And using the previous example, all shards will now have to target shard0 when performing logical session cache refresh.

related to

SERVER-94571 Increase write concern timeout for refreshing sessions collection to 60 seconds

Closed

Assignee:: Unassigned
Reporter:: Randolph Tan
Participants:: Randolph Tan
Votes:: 0 Vote for this issue
Watchers:: 14 Start watching this issue

Created:: Aug 23 2024 07:37:52 PM UTC
Updated:: Feb 25 2025 06:37:06 PM UTC
Confidence Status Last Update:: 06/Sep/24 5:18 PM

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates