[SERVER-33156] Report lock statistics per collection in collstats Created: 07/Feb/18  Updated: 06/Dec/22  Resolved: 16/Mar/18

Status: Closed
Project: Core Server
Component/s: Diagnostics, Storage
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: David Bartley Assignee: Backlog - Storage Execution Team
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File inprog.txt     Text File serverStatus_locks.txt    
Issue Links:
Related
related to SERVER-33155 Export/report lock-held time statistics Closed
Assigned Teams:
Storage Execution
Participants:

 Description   
Original Summary

Export/report per-collection/per-database lock statistics

Original Description

Global statistics are provided in serverStatus, but it would be even more useful if these were per-collection and per-database, particularly coupled with SERVER-33155. I can provide a patch that we've been running in production for several months if useful.



 Comments   
Comment by Ian Whalen (Inactive) [ 16/Mar/18 ]

Resolving this ticket as it has been in Waiting For User Input for 2 weeks.

Comment by Geert Bosch [ 28/Feb/18 ]

Hi bartle,

Rereading this ticket and in particular SERVER-33155, it appears that you ask for lock held times as proxy for operation times. However, while we don't record lock held times, we do record operation latency statistics and these can be queried using the latencyStats MongoDB shell helper:

> db.x.latencyStats().next()
{
	"ns" : "test.x",
	"host" : "geertsmbp.local:27017",
	"localTime" : ISODate("2018-02-28T16:06:38.248Z"),
	"latencyStats" : {
		"reads" : {
			"latency" : NumberLong(5875),
			"ops" : NumberLong(20)
		},
		"writes" : {
			"latency" : NumberLong(34409325),
			"ops" : NumberLong(100001)
		},
		"commands" : {
			"latency" : NumberLong(37526),
			"ops" : NumberLong(3)
		}
	}
}

or directly using the collStats aggregation stage:

 db.x.aggregate([{$collStats: {latencyStats: {histograms:true}}}]).next()
{
	"ns" : "test.x",
	"host" : "geertsmbp.local:27017",
	"localTime" : ISODate("2018-02-28T16:11:15.412Z"),
	"latencyStats" : {
		"reads" : {
			"histogram" : [
				{
					"micros" : NumberLong(128),
					"count" : NumberLong(2)
				},
				{
					"micros" : NumberLong(256),
					"count" : NumberLong(19)
				}
			],
			"latency" : NumberLong(6151),
			"ops" : NumberLong(21)
		},
		"writes" : {
			"histogram" : [
				{
					"micros" : NumberLong(256),
					"count" : NumberLong(97484)
				},
				{
					"micros" : NumberLong(512),
					"count" : NumberLong(2498)
				},
				{
					"micros" : NumberLong(1024),
					"count" : NumberLong(12)
				},
				{
					"micros" : NumberLong(2048),
					"count" : NumberLong(3)
				},
				{
					"micros" : NumberLong(3072),
					"count" : NumberLong(2)
				},
				{
					"micros" : NumberLong(4096),
					"count" : NumberLong(1)
				},
				{
					"micros" : NumberLong(6144),
					"count" : NumberLong(1)
				}
			],
			"latency" : NumberLong(34409325),
			"ops" : NumberLong(100001)
		},
		"commands" : {
			"histogram" : [
				{
					"micros" : NumberLong(4096),
					"count" : NumberLong(1)
				},
				{
					"micros" : NumberLong(6144),
					"count" : NumberLong(1)
				},
				{
					"micros" : NumberLong(24576),
					"count" : NumberLong(1)
				}
			],
			"latency" : NumberLong(37526),
			"ops" : NumberLong(3)
		}
	}
}

In all cases the metrics are cumulative, so you can get statistics over a certain period by computing the difference between the results at the begin and end of the period.

I believe this should address your use case, but please let us know if you need additional information.

-Geert

Comment by Ian Whalen (Inactive) [ 23/Feb/18 ]

geert.bosch to investigate complexity of implementation in next sprint.

Comment by Kelsey Schubert [ 21/Feb/18 ]

Hi bartle,

I've updated the ticket summary to better describe this improvement following discussion on this ticket and SERVER-33155. Please continue to watch it for updates.

Kind regards,
Kelsey

Comment by David Bartley [ 14/Feb/18 ]

I've attached output from the "locks" section of serverStatus and the output for an entry for inprog.

We also get lock stats in logs:

[2018-02-14 21:05:34.751335] 2018-02-14T21:05:34.750+0000 I COMMAND  [conn87875] command test.test command: insert { insert: "test", documents: [ { _id: ObjectId('5a84a49e1aa7e58261946b87'), foo: "aldsj" } ], ordered: true } ninserted:1 keyUpdates:0 writeConflicts:0 numYields:0 reslen:65 locks:{ Global.ParallelBatchWriterMode: { acquireCount: { r: 2 }, timeHeldMicros: { r: 153 } }, Global.Global: { acquireCount: { w: 2 }, timeHeldMicros: { w: 150 } }, Global: { acquireCount: { r: 2, w: 2 }, timeHeldMicros: { r: 153, w: 150 } }, GlobalLockStatsCount: 2, MMAPV1JournalLockStatsCount: 0, Database.test: { acquireCount: { w: 1 }, timeHeldMicros: { w: 148 } }, Database.local: { acquireCount: { w: 1 }, timeHeldMicros: { w: 100 } }, Database: { acquireCount: { w: 2 }, timeHeldMicros: { w: 248 } }, DatabaseLockStatsCount: 2, Collection.test.test: { acquireCount: { w: 1 }, timeHeldMicros: { w: 144 } }, Collection: { acquireCount: { w: 1 }, timeHeldMicros: { w: 144 } }, CollectionLockStatsCount: 1, Metadata.11529215046068469765: { acquireCount: { w: 1 }, timeHeldMicros: { w: 96 } }, Metadata: { acquireCount: { w: 1 }, timeHeldMicros: { w: 96 } }, MetadataLockStatsCount: 1, ResourceIdCount: 37, oplog: { acquireCount: { w: 1 }, timeHeldMicros: { w: 99 } } } protocol:op_command 179µs

Comment by David Bartley [ 14/Feb/18 ]

We have O(1000's) of collections and O(100's) of databases. so it's been ok for our purposes. I think it'd be fine to not include the expanded lock-stats in serverStatus by default (or cap the output somehow), but emit them in collStats and dbStats.

Comment by Ramon Fernandez Marina [ 14/Feb/18 ]

bartle, adding per-collection or per-database metrics in serverStatus has the potential to make the output very large, which can be undesirable and may have negative ramifications. Perhaps if they were not provided by default, but under a flag, that could be a reasonable addition. Do you have a sample output of these stats? Also, have you tested what happens when there are millions of databases/collections present (e.g.: the size of the output exceeds 16MB)?

Thanks,
Ramón.

Generated at Thu Feb 08 04:32:29 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.