[SERVER-29435] Latencies in Top exclude lock acquisition time Created: 02/Jun/17  Updated: 14/Jun/17  Resolved: 14/Jun/17

Status: Closed
Project: Core Server
Component/s: Diagnostics
Affects Version/s: None
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: David Storch Assignee: David Storch
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-29304 Exclude time spent blocking for await... Closed
Operating System: ALL
Sprint: Query 2017-06-19
Participants:

 Description   

Due to the introduction of the AutoStatsTracker for reporting latencies to Top in SERVER-22541 (in particular, see commit 584ca76de9e), these latencies now exclude time spent acquiring locks. The issue can be demonstrated by applying the following patch, which adds sleeps to simulate super long lock acquisition times:

diff --git a/src/mongo/db/db_raii.cpp b/src/mongo/db/db_raii.cpp
index 9098e3b..977d291 100644
--- a/src/mongo/db/db_raii.cpp
+++ b/src/mongo/db/db_raii.cpp
@@ -107,6 +107,9 @@ AutoStatsTracker::~AutoStatsTracker() {
 AutoGetCollectionForRead::AutoGetCollectionForRead(OperationContext* opCtx,
                                                    const NamespaceString& nss,
                                                    AutoGetCollection::ViewMode viewMode) {
+    // Simulate a 2 second lock acquisition.
+    sleepmillis(2000);
+
     _autoColl.emplace(opCtx, nss, MODE_IS, MODE_IS, viewMode);
 
     // Note: this can yield.

Then run the following:

> db.c.drop()
true
> db.c.insert({_id: 1})
WriteResult({ "nInserted" : 1 })
> db.c.aggregate([{$collStats: {latencyStats: {}}}]).pretty()
{
	"ns" : "test.c",
	"localTime" : ISODate("2017-06-02T21:17:22.267Z"),
	"latencyStats" : {
		"reads" : {
			"latency" : NumberLong(163),
			"ops" : NumberLong(2)
		},
		"writes" : {
			"latency" : NumberLong(143438),
			"ops" : NumberLong(1)
		},
		"commands" : {
			"latency" : NumberLong(0),
			"ops" : NumberLong(0)
		}
	}
}
> db.c.find()
{ "_id" : 1 }
> db.c.aggregate([{$collStats: {latencyStats: {}}}]).pretty()
{
	"ns" : "test.c",
	"localTime" : ISODate("2017-06-02T21:17:35.243Z"),
	"latencyStats" : {
		"reads" : {
			"latency" : NumberLong(401),
			"ops" : NumberLong(4)
		},
		"writes" : {
			"latency" : NumberLong(143438),
			"ops" : NumberLong(1)
		},
		"commands" : {
			"latency" : NumberLong(0),
			"ops" : NumberLong(0)
		}
	}
}

You should observe that the find commands take a few seconds to return, but the cumulative read latency reported in $collStats only increases by tens or hundreds of microseconds. Note that latency reported in slow query log lines is not affected by this bug. Re-using the example above, the log line for the find command reports a latency of 2000ms, which of course is dominated by the simulated 2 second lock acquisition:

2017-06-02T17:18:01.340-0400 I COMMAND  [conn1] command test.c appName: "MongoDB Shell" command: find { find: "c", filter: {}, $db: "test" } planSummary: COLLSCAN keysExamined:0 docsExamined:1 cursorExhausted:1 numYields:0 nreturned:1 reslen:100 locks:{ Global: { acquireCount: { r: 2 } }, Database: { acquireCount: { r: 1 } }, Collection: { acquireCount: { r: 1 } } } protocol:op_msg 2000ms



 Comments   
Comment by David Storch [ 14/Jun/17 ]

I ended up fixing this as part of the change for SERVER-29304. The fix was to use the CurOp timer for Top, so that our various debug mechanism which report latencies all use the same timer. The CurOp timer is now used by the slow query logs, the profiler, db.currentOp(), Top, and operation latency histograms. I'm resolving this ticket as a duplicate.

Comment by David Storch [ 05/Jun/17 ]

charlie.swanson, pretty sure, yeah. Unfortunately CurOp::ensureStarted() is used for reporting latency in the log lines, but we have a separate Timer instance for reporting latencies in Top. Prior to this patch, the Timer was constructed before the AutoGetCollectionForRead.

I'm not sure if there is a good reason why we can't use the same timing code for both CurOp and Top, but I wasn't planning on changing it as part of fixing this issue.

Comment by Charlie Swanson [ 05/Jun/17 ]

david.storch are you sure it was that commit? It looks like both before and after that commit we acquire the collection lock before calling Curop::ensureStarted(), which as far as I can tell is responsible for the latency reporting?

Generated at Thu Feb 08 04:20:52 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.