[SERVER-39934] CurOp::completeAndLogOperation should not hang waiting for global lock Created: 01/Mar/19  Updated: 29/Oct/23  Resolved: 21/Mar/19

Status: Closed
Project: Core Server
Component/s: Concurrency
Affects Version/s: None
Fix Version/s: 4.0.9, 4.1.10

Type: Bug Priority: Major - P3
Reporter: Matthew Russotto Assignee: Sulabh Mahajan
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Backports
Depends
Problem/Incident
is caused by SERVER-39361 Synchronise collecting storage engine... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v4.0
Sprint: Storage Engines 2019-03-11, Storage Engines 2019-03-25
Participants:
Linked BF Score: 32
Story Points: 8

 Description   

When logging a command (either slow or when forced logging is enabled) CurOp::completeAndLogOperation attempts to take a global lock to obtain storage statistics. If something else has a Global X lock (or an enqueued Global X lock), this lock acquisition will stall behind that operation. This introduces an undesirable dependency on the global lock for otherwise lock-free operations such as $currentOp.

We should give this acquisition a very short deadline and elide the storage stats when it is not available.



 Comments   
Comment by Githook User [ 03/Apr/19 ]

Author:

{'email': 'sulabh.mahajan@mongodb.com', 'name': 'Sulabh Mahajan', 'username': 'sulabhM'}

Message: SERVER-38239 Added getOperationStatistics() API to fetch storage stats from WiredTiger

(cherry picked from commit ba3894493a94ed3c18458f391ff181d57475f010)

SERVER-38240 Added storage statistics information into the slowop log messages.

(cherry picked from commit 86b6aca9fa1940e85bba87261d1494ef2c208a4a)

SERVER-38240 work around uint64_t conversion on s390x in WiredTigerOperationStats::fetchStats()

(cherry picked from commit 82161eec79fc74652dc07b1c83fe500dc4f95e79)

SERVER-38243 Test presence of the storage stats in slowop logs and system.profile.

(cherry picked from commit 3b4c6a689a3fdaa923d427ae112ea599513ef8ce)

SERVER-39026 Use correct type for retrieving WiredTiger stats

(cherry picked from commit 6a9a5855048df1f4796a4032276d01318c398691)

SERVER-39488 Look for storage stats in the find command's profiled entry.

The find command will always have the storage stats because the
test has created a collection with btree spanning multiple pages. Scanning
this collection after a server restart will trigger read from the disk
and have the disk read stats.

(cherry picked from commit 6125b5fb078a316854f0299b96b7d16eacb944de)

SERVER-39361 Synchronise collecting storage engine stats with shutdown

(cherry picked from commit bacb6b67706a2c057fcd0f76a38f416b225aa69a)

SERVER-39061 Fix the wt_operation_stats test to wait for the operation log to appear.

(cherry picked from commit 371197e4bab715a83272a4472e118ee5c5cbbf7c)

SERVER-39934 Fix locking for slow ops storage stats
SERVER-39654 Make slow ops storage stats work with transactions

(cherry picked from commit 23ca771f76f85638f23bca2a4a6ac196a81fdc21)
Branch: v4.0
https://github.com/mongodb/mongo/commit/57edce7396271531dde4499458b22c9cde1f03d4

Comment by Sulabh Mahajan [ 25/Mar/19 ]

bruce.lucas, this will impact slow operation logging and profiler. No impact to serverStatus.

Comment by Bruce Lucas (Inactive) [ 25/Mar/19 ]

sulabh.mahajan, can you confirm that this impacts only currentOp and slow operation logging, and not serverStatus? I want to make sure we aren't doing anything that might create frequent schema changes in serverStatus, which would have a bad effect on ftdc compression efficiency.

Comment by Githook User [ 21/Mar/19 ]

Author:

{'email': 'sulabh.mahajan@mongodb.com', 'name': 'Sulabh Mahajan', 'username': 'sulabhM'}

Message: SERVER-39934 Fix locking for slow ops storage stats
SERVER-39654 Make slow ops storage stats work with transactions
Branch: master
https://github.com/mongodb/mongo/commit/23ca771f76f85638f23bca2a4a6ac196a81fdc21

Comment by Eric Milkie [ 05/Mar/19 ]

Yeah, that tracker sounds better to use.

Comment by Matthew Russotto [ 05/Mar/19 ]

I believe lockerInfo won't tell you about locks which have been released, so you'll end up missing the storage engine stats more often than necessary.

There's also the GlobalAcqusitionLockTracker, which tells if a global lock has ever been acquired in X, IX, or S (even if subsequently released); you could add IS to that and use it.

Comment by Eric Milkie [ 05/Mar/19 ]

matthew.russotto, I don't think we should use the proposed solution for the problem as presented. I believe if we remove the global lock acquisition for otherwise lock-free operations, that would solve your issue, correct?
I think we could do this by looking in the lockerInfo to see if the vector of locks has a size greater than 0. If it is zero, we will skip fetching storage engine stats completely.

Comment by Bruce Lucas (Inactive) [ 04/Mar/19 ]

Can you clarify what user-visible content (e.g. log lines, currentOp output, serverStatus, FTDC output) is impacted, and what exactly "storage stats" refers to (ideally in terms of the affected user-visible artifacts)?

Comment by Kelsey Schubert [ 04/Mar/19 ]

It's not ideal, especially since when things are slow that's generally when you most want diagnostics. But, I'm not sure I have a better idea. It sounds like this is our standard way of resolving this issue until PM-1242 lands. bruce.lucas, feel free to chime in if I'm missing anything.

Comment by Judah Schvimer [ 04/Mar/19 ]

kelsey.schubert, is this change acceptable?

Generated at Thu Feb 08 04:53:33 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.