[SERVER-29189] Top and op latency histogram counters are incorrect for getMore on an awaitData cursor Created: 12/May/17  Updated: 14/Jun/17  Resolved: 14/Jun/17

Status: Closed
Project: Core Server
Component/s: Diagnostics, Querying
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: David Storch Assignee: David Storch
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-29304 Exclude time spent blocking for await... Closed
Related
Operating System: ALL
Steps To Reproduce:

The following script demonstrates the behavioral differences between top, the per-collection latency histograms, and the global latency histogram. Note that both the top command and the per-collection latency histograms are affected by this issue.

(function() {
    "use strict";
 
    let c = db.c;
    c.drop();
    assert.commandWorked(db.createCollection("c", {capped: true, size: 1024}));
    assert.writeOK(db.c.insert({_id: 1}));
    assert.writeOK(db.c.insert({_id: 2}));
    assert.writeOK(db.c.insert({_id: 3}));
 
    function getGlobalLatencyStats() {
        return db.serverStatus().opLatencies.reads;
    }
 
    function getCollectionLatencyStats(coll) {
        return coll.latencyStats().next().latencyStats.reads;
    }
 
    function getTop(coll) {
        return db.adminCommand({top: 1}).totals[coll.getFullName()].getmore;
    }
 
    // Global latency histogram from serverStatus.
    let oldGlobalLatency = getGlobalLatencyStats();
    assert.eq(3, db.c.find().tailable(true).itcount());
    let newGlobalLatency = getGlobalLatencyStats();
 
    let globalLatencyOpsDelta = newGlobalLatency.ops - oldGlobalLatency.ops;
    print("global latency hist ops delta: " + globalLatencyOpsDelta);
    let globalLatencyDelta = newGlobalLatency.latency - oldGlobalLatency.latency;
    print("global latency hist latency delta (micros): " + globalLatencyDelta);
 
    // Per-collection latency histogram.
    let oldCollLatency = getCollectionLatencyStats(db.c);
    assert.eq(3, db.c.find().tailable(true).itcount());
    let newCollLatency = getCollectionLatencyStats(db.c);
 
    let collLatencyOpsDelta = newCollLatency.ops - oldCollLatency.ops;
    print("collection latency hist ops delta: " + collLatencyOpsDelta);
    let collLatencyDelta = newCollLatency.latency - oldCollLatency.latency;
    print("collection latency hist latency delta (micros): " + collLatencyOpsDelta);
 
    // Top.
    let oldTop = getTop(db.c);
    assert.eq(3, db.c.find().tailable(true).itcount());
    let newTop = getTop(db.c);
 
    let topOpsDelta = newTop.count - oldTop.count;
    print("top ops delta: " + topOpsDelta);
    let topLatencyDelta = newTop.time - oldTop.time;
    print("top latency delta (micros): " + topOpsDelta);
}());

Sprint: Query 2017-06-19
Participants:

 Description   

The getMore path uses the AutoGetCollectionForReadCommand RAII helper, which on destruction updates Top and the appropriate per-collection latency histogram. It updates the stats by

  1. bumping the number of recorded reads, and
  2. recording the wall clock time between construction and destruction of the RAII object.

For getMore on awaitData cursors, however, we release the AutoGetCollectionForReadCommand while blocking. This is to ensure that we do not hold collection locks while sleeping on a condition variable for potentially long periods of time. After waking up, the AutoGetCollectionForReadCommand is reacquired. The result is that we double-count each getMore operation in the Top diagnostic output. Furthermore, the time spent blocking for awaitData is not recorded. This is inconsistent with the global operation latency histogram reported by serverStatus(), which incorporates time spent blocking for awaitData.



 Comments   
Comment by David Storch [ 14/Jun/17 ]

This was fixed by SERVER-29304 and is now tested by getmore_awaitdata_opcounters.js. Resolving as a duplicate.

Comment by David Storch [ 19/May/17 ]

The changes related to reporting of latency for getMore on an awaitData cursor will be handled by related ticket SERVER-29304. Therefore, the scope of this ticket is now narrowed to cover the op counters problem only.

Generated at Thu Feb 08 04:20:08 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.