Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Done
Priority: Major - P3
Fix Version/s: 3.4.15, 3.6.4, 3.7.2
Affects Version/s: 3.5.5
Component/s: Sharding
Labels:
- SWDI

Backwards Compatibility:
Fully Compatible
Backport Requested:

v3.6, v3.4
Sprint:
Sharding 2018-01-29
Case:
Confidence Status:
None
Work Order:
3
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

Sharding metadata refreshes have occasionally been demonstrated to cause throughput stalls. Currently, there is no visibility into when these are happening other than looking at the server log and trying to match them with FTDC data.

In order to improve diagnosability we should introduce metadata refresh metrics to serverStatus so they can also be recorded in FTDC. All the proposed metrics should be under a section called shardingStatistics and will behave like this:

shardingStatistics
- countStaleConfigErrors - Counts how many times threads hit stale config exception (which is what triggers metadata refreshes
- countDonorMoveChunkStarted - Cumulative, always-increasing counter of how many chunks did this node start donating (whether they succeeded or not)
- totalDonorMoveChunkTimeMillis - Cumulative, always-increasing counter of how much time the entire move chunk operation took (excluding range deletion)
- totalDonorChunkCloneTimeMillis - Cumulative, always-increasing counter of how much time the clone phase took on the donor node, before it was appropriate to enter the critical section
- totalCriticalSectionCommitTimeMillis - Cumulative, always-increasing counter of how much time the critical section's commit phase took (this is the period of time when all operations on the collection are blocked, not just the reads (from 3.6 onward))
- totalCriticalSectionTimeMillis - Cumulative, always-increasing counter of how much time the entire critical section took. It includes the time the recipient took to fetch the latest modifications from the donor and persist them plus the critical section commit time. The value of totalCriticalSectionTimeMillis - totalCriticalSectionCommitTimeMillis gives the duration of the catch-up phase of the critical section (where the last mods are transferred from the donor to the recipient).

shardingStatistics.catalogCache
- numDatabaseEntries - Tracks how many database entries in total are in currently the catalog cache
- numCollectionEntries - Tracks how many collection entries (in total across all databases) are currently in the catalog cache
- countStaleConfigErrors - Counts how many times threads hit stale config exception (which is what triggers metadata refreshes)
- totalRefreshWaitTimeMicros - Cumulative, always-increasing counter of how much time threads waiting for refresh combined
- numActiveIncrementalRefreshes - Tracks how many incremental refreshes are waiting to complete currently
- countIncrementalRefreshesStarted - Cumulative, always-increasing counter of how many incremental refreshes have been kicked off
- numActiveFullRefreshes - Tracks how many full refreshes are waiting to complete currently
- countFullRefreshesStarted - Cumulative, always-increasing counter of how many full refreshes have been kicked off
- countFailedRefreshes - Cumulative, always-increasing counter of how many full or incremental refreshes failed for whatever reason

is duplicated by

SERVER-11784 better migration stats

Closed

SERVER-29788 Log moveChunk counts in changelog

Closed

Assignee:: Kaloian Manassiev
Reporter:: Kaloian Manassiev
Participants:: Githook User, Kaloian Manassiev
Votes:: 1 Vote for this issue
Watchers:: 11 Start watching this issue

Created:: Apr 07 2017 02:35:52 PM UTC
Updated:: Jun 06 2018 04:38:23 PM UTC
Resolved:: Jan 24 2018 06:49:18 PM UTC

Details

Description

Attachments

Issue Links

Forms

Activity

People

Dates