[SERVER-28670] Add sharding metadata refresh metrics section to serverStatus Created: 07/Apr/17  Updated: 06/Jun/18  Resolved: 24/Jan/18

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.5.5
Fix Version/s: 3.4.15, 3.6.4, 3.7.2

Type: Improvement Priority: Major - P3
Reporter: Kaloian Manassiev Assignee: Kaloian Manassiev
Resolution: Done Votes: 1
Labels: SWDI
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Documented
is documented by DOCS-11282 Add sharding metadata refresh metrics... Closed
Duplicate
is duplicated by SERVER-11784 better migration stats Closed
is duplicated by SERVER-29788 Log moveChunk counts in changelog Closed
Related
Backwards Compatibility: Fully Compatible
Backport Requested:
v3.6, v3.4
Sprint: Sharding 2018-01-29
Participants:
Case:

 Description   

Sharding metadata refreshes have occasionally been demonstrated to cause throughput stalls. Currently, there is no visibility into when these are happening other than looking at the server log and trying to match them with FTDC data.

In order to improve diagnosability we should introduce metadata refresh metrics to serverStatus so they can also be recorded in FTDC. All the proposed metrics should be under a section called shardingStatistics and will behave like this:

  • shardingStatistics
    • countStaleConfigErrors - Counts how many times threads hit stale config exception (which is what triggers metadata refreshes
    • countDonorMoveChunkStarted - Cumulative, always-increasing counter of how many chunks did this node start donating (whether they succeeded or not)
    • totalDonorMoveChunkTimeMillis - Cumulative, always-increasing counter of how much time the entire move chunk operation took (excluding range deletion)
    • totalDonorChunkCloneTimeMillis - Cumulative, always-increasing counter of how much time the clone phase took on the donor node, before it was appropriate to enter the critical section
    • totalCriticalSectionCommitTimeMillis - Cumulative, always-increasing counter of how much time the critical section's commit phase took (this is the period of time when all operations on the collection are blocked, not just the reads (from 3.6 onward))
    • totalCriticalSectionTimeMillis - Cumulative, always-increasing counter of how much time the entire critical section took. It includes the time the recipient took to fetch the latest modifications from the donor and persist them plus the critical section commit time. The value of totalCriticalSectionTimeMillis - totalCriticalSectionCommitTimeMillis gives the duration of the catch-up phase of the critical section (where the last mods are transferred from the donor to the recipient).
  • shardingStatistics.catalogCache
    • numDatabaseEntries - Tracks how many database entries in total are in currently the catalog cache
    • numCollectionEntries - Tracks how many collection entries (in total across all databases) are currently in the catalog cache
    • countStaleConfigErrors - Counts how many times threads hit stale config exception (which is what triggers metadata refreshes)
    • totalRefreshWaitTimeMicros - Cumulative, always-increasing counter of how much time threads waiting for refresh combined
    • numActiveIncrementalRefreshes - Tracks how many incremental refreshes are waiting to complete currently
    • countIncrementalRefreshesStarted - Cumulative, always-increasing counter of how many incremental refreshes have been kicked off
    • numActiveFullRefreshes - Tracks how many full refreshes are waiting to complete currently
    • countFullRefreshesStarted - Cumulative, always-increasing counter of how many full refreshes have been kicked off
    • countFailedRefreshes - Cumulative, always-increasing counter of how many full or incremental refreshes failed for whatever reason


 Comments   
Comment by Githook User [ 29/Mar/18 ]

Author:

{'email': 'kaloian.manassiev@mongodb.com', 'name': 'Kaloian Manassiev', 'username': 'kaloianm'}

Message: SERVER-28670 Add sharding CatalogCache and donor metrics to serverStatus

Includes metrics for refresh, clone and migration critical section
duration.

(cherry picked from commit c4142a8e0b486f3642b700c9efb208f909e3ff1d)
Branch: v3.4
https://github.com/mongodb/mongo/commit/8393b19ba5d0e9d79588aae42225b5af22acaccf

Comment by Githook User [ 16/Mar/18 ]

Author:

{'email': 'kaloian.manassiev@mongodb.com', 'name': 'Kaloian Manassiev', 'username': 'kaloianm'}

Message: SERVER-28670 Add sharding CatalogCache and donor metrics to serverStatus

Includes metrics for refresh, clone and migration critical section
duration.

(cherry picked from commit bc433b50e0205dfd0a8bfb6906393d841fd8193a)
Branch: v3.6
https://github.com/mongodb/mongo/commit/c4142a8e0b486f3642b700c9efb208f909e3ff1d

Comment by Githook User [ 24/Jan/18 ]

Author:

{'name': 'Kaloian Manassiev', 'email': 'kaloian.manassiev@mongodb.com', 'username': 'kaloianm'}

Message: SERVER-28670 Add sharding CatalogCache and donor metrics to serverStatus

Includes metrics for refresh, clone and migration critical section
duration.
Branch: master
https://github.com/mongodb/mongo/commit/bc433b50e0205dfd0a8bfb6906393d841fd8193a

Generated at Thu Feb 08 04:18:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.