Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-28670

Add sharding metadata refresh metrics section to serverStatus

    XMLWordPrintable

    Details

    • Type: Improvement
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Fixed
    • Affects Version/s: 3.5.5
    • Fix Version/s: 3.4.15, 3.6.4, 3.7.2
    • Component/s: Sharding
    • Labels:
    • Backwards Compatibility:
      Fully Compatible
    • Backport Requested:
      v3.6, v3.4
    • Sprint:
      Sharding 2018-01-29
    • Case:

      Description

      Sharding metadata refreshes have occasionally been demonstrated to cause throughput stalls. Currently, there is no visibility into when these are happening other than looking at the server log and trying to match them with FTDC data.

      In order to improve diagnosability we should introduce metadata refresh metrics to serverStatus so they can also be recorded in FTDC. All the proposed metrics should be under a section called shardingStatistics and will behave like this:

      • shardingStatistics
        • countStaleConfigErrors - Counts how many times threads hit stale config exception (which is what triggers metadata refreshes
        • countDonorMoveChunkStarted - Cumulative, always-increasing counter of how many chunks did this node start donating (whether they succeeded or not)
        • totalDonorMoveChunkTimeMillis - Cumulative, always-increasing counter of how much time the entire move chunk operation took (excluding range deletion)
        • totalDonorChunkCloneTimeMillis - Cumulative, always-increasing counter of how much time the clone phase took on the donor node, before it was appropriate to enter the critical section
        • totalCriticalSectionCommitTimeMillis - Cumulative, always-increasing counter of how much time the critical section's commit phase took (this is the period of time when all operations on the collection are blocked, not just the reads (from 3.6 onward))
        • totalCriticalSectionTimeMillis - Cumulative, always-increasing counter of how much time the entire critical section took. It includes the time the recipient took to fetch the latest modifications from the donor and persist them plus the critical section commit time. The value of totalCriticalSectionTimeMillis - totalCriticalSectionCommitTimeMillis gives the duration of the catch-up phase of the critical section (where the last mods are transferred from the donor to the recipient).
      • shardingStatistics.catalogCache
        • numDatabaseEntries - Tracks how many database entries in total are in currently the catalog cache
        • numCollectionEntries - Tracks how many collection entries (in total across all databases) are currently in the catalog cache
        • countStaleConfigErrors - Counts how many times threads hit stale config exception (which is what triggers metadata refreshes)
        • totalRefreshWaitTimeMicros - Cumulative, always-increasing counter of how much time threads waiting for refresh combined
        • numActiveIncrementalRefreshes - Tracks how many incremental refreshes are waiting to complete currently
        • countIncrementalRefreshesStarted - Cumulative, always-increasing counter of how many incremental refreshes have been kicked off
        • numActiveFullRefreshes - Tracks how many full refreshes are waiting to complete currently
        • countFullRefreshesStarted - Cumulative, always-increasing counter of how many full refreshes have been kicked off
        • countFailedRefreshes - Cumulative, always-increasing counter of how many full or incremental refreshes failed for whatever reason

        Attachments

          Issue Links

            Activity

              People

              • Votes:
                1 Vote for this issue
                Watchers:
                11 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: