[SERVER-31565] Add a `serverStatus` section with runtime information for the sessions cache Created: 13/Oct/17  Updated: 30/Oct/23  Resolved: 08/Nov/17

Status: Closed
Project: Core Server
Component/s: Diagnostics, Sharding
Affects Version/s: None
Fix Version/s: 3.6.0-rc4

Type: Improvement Priority: Major - P3
Reporter: Kaloian Manassiev Assignee: Mira Carey
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Documented
is documented by DOCS-10686 Docs for SERVER-28301: Add stats abou... Closed
Backwards Compatibility: Fully Compatible
Sprint: Platforms 2017-10-23, Platforms 2017-11-13
Participants:

 Description   

With the way that sessions management works, a periodic maintenance task runs on each node in a replica set or sharded cluster, which synchronizes the in-memory sessions state with what's persisted in the config.system.sessions collection.

In order to improve the product supportability we should include FTDC metrics for sessions management as part of the serverStatus output. This will allow interesting server behaviour changes to be correlated with executions of the sessions state maintenance task.

I propose that the following metrics be reported, under a section called sessions. All these metrics are individual for the node:

  • activeSessionsCount - The number of active/cached sessions
  • sessionsCollectionRefreshCount - A number, which is incremented by one every time the in-memory sessions state has been been persisted
  • lastSessionsCollectionRefreshDurationMicros - The duration of the last sessions collection refresh
  • lastSessionsCollectionRefreshTimestamp - The wall-clock time of when the last sessions collection refresh happened
  • lastSessionsCollectionRefreshDurationEntriesRefreshed - How many entries were refreshed during the last refresh round.
  • sessionsCollectionCleanupCount - A number, which is incremented by one every time sessions are being cleaned up
  • lastSessionsCollectionCleanupDurationMicros - The duration of the last sessions cleanup
  • lastSessionsCollectionCleanupTimestamp - The wall-clock time of when the last sessions cleanup happened
  • lastSessionsCollectionCleanupEntriesCleanedUp - How many entries were cleaned up during the last round

NOTE: These statistics are particularly useful for the sharding case, where there is cross-node communication and potentially large number of sessions that could be refreshed each round, so it is acceptable that they are only present in a sharded cluster.



 Comments   
Comment by Githook User [ 08/Nov/17 ]

Author:

{'name': 'Jason Carey', 'username': 'hanumantmk', 'email': 'jcarey@argv.me'}

Message: SERVER-31565 Add stats about logical sessions background jobs to serverStatus
Branch: master
https://github.com/mongodb/mongo/commit/2c87a5e7d90a6dd998d098f85e5db486a555fe42

Comment by Ian Whalen (Inactive) [ 03/Nov/17 ]

Revert to fix failure in transaction_reaper.js.

Comment by Githook User [ 03/Nov/17 ]

Author:

{'name': 'Ian Whalen', 'username': 'IanWhalen', 'email': 'ian.whalen@gmail.com'}

Message: Revert "SERVER-31565 Add stats about logical sessions background jobs to serverStatus"

This reverts commit 7cd8508b06e1574bea211dff054855b70b7cc20e.
Branch: master
https://github.com/mongodb/mongo/commit/bbbafc93ab4ab5b36c8f297a158cd218ab0638f9

Comment by Githook User [ 01/Nov/17 ]

Author:

{'name': 'samantharitter', 'username': 'samantharitter', 'email': 'samantha.ritter@10gen.com'}

Message: SERVER-31565 Add stats about logical sessions background jobs to serverStatus
Branch: master
https://github.com/mongodb/mongo/commit/7cd8508b06e1574bea211dff054855b70b7cc20e

Comment by Githook User [ 01/Nov/17 ]

Author:

{'name': 'samantharitter', 'username': 'samantharitter', 'email': 'samantha.ritter@10gen.com'}

Message: SERVER-31565 Remove unused logical session cache method
Branch: master
https://github.com/mongodb/mongo/commit/ef8db41490338502892d2e546e9a745d529ad614

Comment by Samantha Ritter (Inactive) [ 16/Oct/17 ]

It's a great idea to add some sessions-related statistics to server status, but I have a few comments/suggestions.

I am assuming that "duration" means we time the background jobs from start to finish, and report how long they take to run.

I am assuming that "sessions collection cleanup" means the part of the refresh background job where we remove records that have been ended via endSessions from config.system.sessions. This cleanup happens during the regular background job, which also does refreshing, among other things, so it doesn't make sense to report its timing separately from the refresh timing. The cleanup of records that expire naturally, without an explicit user call to endSessions, happens via a TTL index, so we can't report stats on that from within the session cache. In my proposed metrics below, these stats are aggregated into one "sessionsCollectionBackgroundJob" group.

There is a separate cleanup task that we should also report stats on: the transaction reaper. This second background job is responsible for clearing records out of the transaction table if their parent sessions have ended or expired, and it runs on a schedule independent of the refresh background job. However, it runs with the same frequency as the refresh background job, which is once every logicalSessionRefreshMinutes, or every 5 minutes by default.

Given those things, I propose the following set of metrics, which adds to Kal's original set but uses names that I think are more clear:

  • activeSessionsCount - The number of records currently in the cache, which exists now as a field called "records". This is the number of records that have been used since the last time the refresh job ran.
  • sessionsCollectionBackgroundJobCount - The number of times the logical sessions cache _refresh job has run. This task refreshes active records, removes records ended via endSessions, and closes local zombie cursors whose sessions have ended or expired.
  • lastSessionsCollectionBackgroundJobDurationMicros - The duration of the last sessions collection background job
  • lastSessionsCollectionBackgroundJobTimestamp - The wall-clock time of when the last sessions collection refresh job happened
  • lastSessionsCollectionBackgroundJobEntriesRefreshed - How many entries were refreshed during the last refresh round.
  • lastSessionsCollectionBackgroundJobEntriesEnded - How many sessions were explicitly ended during the last refresh round.
  • lastSessionsCollectionBackgroundJobCursorsClosed - How many zombie cursors were closed by the last background job run.
  • transactionReaperJobCount - A number, which is incremented by one every time transaction records are cleaned up.
  • lastTransactionReaperJobDurationMicros - The duration of the last transaction reaper run.
  • lastTransactionReaperJobTimestamp - The wall-clock time of when the last transaction reaper run happened.
  • lastTransactionReaperJobEntriesCleanedUp - How many entries from the transaction table were cleaned up during the last round.

We've already added a section to serverStatus called "logicalSessionRecordCache," which currently only reports the number of active records in the cache. I'd like to add the new metrics to that existing section.

Comment by Bruce Lucas (Inactive) [ 13/Oct/17 ]

OK.

In the interest of specificity, when are the counts and other metrics updated - all at the same time, at the end of the the corresponding activity (refresh, cleanup)?

Comment by Kaloian Manassiev [ 13/Oct/17 ]

bruce.lucas: No, the frequency is much lower than that - on the order of every 5 minutes and should span over many FTDC rounds. So being able to infer a rate of cleanup is not really meaningful.

Comment by Bruce Lucas (Inactive) [ 13/Oct/17 ]

How often does the cleanup happen?

If it's very frequent (say once a second or more) then it might be better to make to make the numbers cumulative so they can be differentiated to produce a rate, i.e. entries cleaned up per second, entries refreshed per second, etc.

If infrequent the format you suggest looks ok.

Generated at Thu Feb 08 04:27:28 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.