[SERVER-47998] Operations that time out from maxTimeMS before constructing an AutoStatsTracker don't get profiled Created: 06/May/20 Updated: 06/Dec/22 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | Querying |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Spencer Brody (Inactive) | Assignee: | Backlog - Query Execution |
| Resolution: | Unresolved | Votes: | 0 |
| Labels: | qexec-team | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Assigned Teams: |
Query Execution
|
|||||||||||||||||||||||
| Operating System: | ALL | |||||||||||||||||||||||
| Steps To Reproduce: |
Notice that only the first two inserts show up in system.profile (the inserts of {a:1} and {b:1}, which both complete successfully). The insert of {c:1}, which times out, doesn't show up in the profiler. |
|||||||||||||||||||||||
| Sprint: | Query 2020-06-01 | |||||||||||||||||||||||
| Participants: |
| Comments |
| Comment by David Storch [ 03/Jun/20 ] |
|
After investigation, this is not a duplicate but a novel issue report for a bug that has likely existed for quite some time. The good news is that the bug is not as severe as the original title implies. Generally speaking, operations which fail due to maxTimeMS expiration in fact do get profiled most of the time. However, this ticket demonstrates a scenario in which they can fail to get profiled. The profiling implementation involves reading the user's configured profiling level out of the catalog's Database object and then passing it to CurOp via CurOp::raiseDbProfileLevel(). It is the CurOp object's profiling level that is later consulted when determining whether to write to the system.profile collection. The call to raiseDbProfileLevel() for most commands happens around the same time that locks are acquired. This is one of the purposes of the AutoStatsTracker RAII helper which is acquired alongside locks as part of AutoGetCollectionForReadCommand. However, there are interrupt checks early in the command dispatch path before the AutoStatsTracker is created. If maxTimeMS expires at this early point, then the CurOp is left without knowledge of the profiling level, and the command fails to be profiled. In practice, most commands are likely to time out after they have done real work. So encountering this issue in a real deployment is unlikely, which could explain why the issue was not previously discovered. I'm returning this ticket to the "Needs Scheduling" queue so that it can be triaged by the query team. I recommend either a "Backlog" fixVersion or a "Won't Fix" resolution. |
| Comment by Craig Homa [ 12/May/20 ] |
|
Moving into investigating as we think there is a dupe out there for this. |
| Comment by Spencer Brody (Inactive) [ 06/May/20 ] |
|
I have not since it's tricky to trigger without the failpoint. But the failpoint is pretty well targeted to exactly the place we normally check for timeout, so I'd be surprised if it's the culprit. |
| Comment by Charlie Swanson [ 06/May/20 ] |
|
spencer have you observed this without using the failpoint? I'm wondering if this is just a problem where the failpoint is too early in the codepath. |
| Comment by Spencer Brody (Inactive) [ 06/May/20 ] |
|
Repro'd on master (4.5.1-pre), haven't tried on any other branches |