[SERVER-80722] Rationalize Catalog Cache' statistics Created: 05/Sep/23  Updated: 26/Oct/23

Status: Backlog
Project: Core Server
Component/s: None
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Antonio Fuschetto Assignee: Backlog - Catalog and Routing
Resolution: Unresolved Votes: 0
Labels: oldshardingemea, shardingemea-qw
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File image-2023-09-05-14-31-48-034.png    
Issue Links:
Depends
is depended on by SERVER-80724 Add database refresh statistics to Ca... Blocked
is depended on by SERVER-80868 Complete TODO listed in SERVER-34164 Blocked
Related
is related to SERVER-34164 [support] add stats for database refr... Closed
Assigned Teams:
Catalog and Routing
Participants:
Story Points: 2

 Description   

The serverStatus command and the FTDC data files report general statistics on the server. As part of this, there is a section dedicated to the Catalog Cache:

 shardingStatistics : {
  ...
   catalogCache : {
      numDatabaseEntries : Long("<num>"),
      numCollectionEntries : Long("<num>"),
      countStaleConfigErrors : Long("<num>"),
      totalRefreshWaitTimeMicros : Long("<num>"),
      numActiveIncrementalRefreshes : Long("<num>"),
      countIncrementalRefreshesStarted : Long("<num>"),
      numActiveFullRefreshes : Long("<num>"),
      countFullRefreshesStarted : Long("<num>"),
      countFailedRefreshes : Long("<num>")
   }
...
}

In case of the FTDC file, this information can be graphically represented by T2:

One goal of this ticket is to identify metrics that can be considered useful and clear for investigating the behavior of the Catalog Cache (i.e., it shouldn't need to be a Sharding expert to interpreter these metrics).

Ideally, for both collection and database metadata, we would need (TBD):

  • number of the entries in the cache (TBD: stale and non-stale?)
  • number of cache misses (TBD: and cache hits?)
  • number of incremental refreshes started/completed (only for collection metadata)
  • number of full refreshes started/completed
  • number of failed refreshes
  • time spent waiting for refreshes (in milliseconds)

On the other hand, T2 should be fixed and/or improved to represent this information in the best way (for example, it currently shows threads as unit of the totalRefreshWaitTimeMicros metric, which is a definitely a bug). Consequently, some tickets for the Server Triage & Release team should be created as a part of this work.


Generated at Thu Feb 08 06:44:24 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.