[SERVER-35967] $sample with explain(true) hangs Created: 05/Jul/18  Updated: 29/Oct/23  Resolved: 30/Jul/18

Status: Closed
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: 3.6.0
Fix Version/s: 3.6.7

Type: Bug Priority: Major - P3
Reporter: Josef Ahmad Assignee: Ian Boros
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-29421 Aggregation executionStats or allPlan... Closed
is related to SERVER-35973 Explain execution stats for $sample a... Closed
is related to SERVER-19758 Add "executionStats" and "allPlansExe... Closed
Backwards Compatibility: Fully Compatible
Operating System: ALL
Steps To Reproduce:

db.c.drop()
for (i = 0; i < 1000; i++ ) { db.c.insertOne({size:1})}
db.c.explain(true).aggregate({$sample: {size: 10}})

Sprint: Query 2018-07-16, Query 2018-07-30, Query 2018-08-13
Participants:

 Description   

Observed in v3.6. Unable to reproduce on v3.4.

$sample command never returns when run with explain(true). db.currentOps() shows:

   "planSummary":  "MULTI_ITERATOR",

   "numYields": 18619890,

numYields continuously increases.



 Comments   
Comment by Githook User [ 30/Jul/18 ]

Author:

{'name': 'Ian Boros', 'email': 'ian.boros@10gen.com'}

Message: SERVER-35967 Ban explain() on $sample when storage engine supports random cursors
Branch: v3.6
https://github.com/mongodb/mongo/commit/9c3f90d24a36f07072a45dbc8c4694dbe1fc664b

Comment by David Storch [ 05/Jul/18 ]

One more note: In the process of investigating this issue, I found a separate problem with explaining $sample that affects 4.0 and master. See SERVER-35973.

Comment by David Storch [ 05/Jul/18 ]

After investigating, I can confirm that this problem was introduced by SERVER-19758, which added "executionStats" and "allPlansExecution" explain modes for aggregation explain. I can also confirm that it was fixed by SERVER-29421, meaning that this issue affects only the 3.6 branch. The problem occurs when the explain code attempts to run its underlying PlanExecutor in order to gather execution statistics:

https://github.com/mongodb/mongo/blob/6405d65b1d6432e138b44c13085d0c2fe235d6bd/src/mongo/db/pipeline/document_source_cursor.cpp#L190

This PlanExecutor encapsulates an execution plan which simply reads from a random cursor—a RecordCursor obtained by calling RecordStore::getRandomCursor(). Existing random cursor implementations sample via a pseudorandom btree walk. They also never return EOF, but rather will sample infinitely (with replacement) as a caller iterates the cursor. The explain code is doing just this—in its attempt to gather stats from the cursor, it is stuck in an infinite loop here checking for an EOF that will never come.

The problem is that the explain code path iterates the underlying random cursor directly, circumventing the code in DocumentSourceSampleFromRandomCursor which is responsible for duplicate elimination, and, importantly, for signaling EOF for $sample queries. SERVER-29421 fixes the problem by changing our strategy for explain execution. Specifically, this fix consolidates the explain code path with the regular execution code path, such that exec stats are gathered by running the query in much the same way as the regular, non-explain code path.

Comment by David Storch [ 05/Jul/18 ]

I can reproduce this issue on the 3.6 branch but not on the master branch. My initial guess is that this is a regression due to SERVER-19758 that was subsequently fixed by SERVER-29421. Assigning to myself to produce a more concrete root cause diagnosis.

Comment by Ramon Fernandez Marina [ 05/Jul/18 ]

I can confirm this behavior: in 3.4.4 the explain() command returns, in 3.6.5 doesn't.

Generated at Thu Feb 08 04:41:37 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.