[SERVER-31684] QueryPlanKilled (operation exceeded time limit) in $changeStream with updateLookup Created: 23/Oct/17 Updated: 30/Oct/23 Resolved: 18/Dec/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Querying, Replication |
| Affects Version/s: | 3.6.0-rc0 |
| Fix Version/s: | 3.6.2, 3.7.1 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Yves Duhem | Assignee: | Martin Neupauer |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||||||||||||||
| Issue Links: |
|
||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||
| Operating System: | ALL | ||||||||||||||||||||
| Backport Requested: |
v3.6
|
||||||||||||||||||||
| Steps To Reproduce: |
|
||||||||||||||||||||
| Sprint: | Query 2017-11-13, Query 2017-12-04, Query 2017-12-18, Query 2018-01-01 | ||||||||||||||||||||
| Participants: | |||||||||||||||||||||
| Linked BF Score: | 0 | ||||||||||||||||||||
| Description |
|
On a 3.6.0-rc0 single node replica set, while applying a CRUD workload with 16 or more threads, a change stream using the option fullDocument: updateLookup eventually fails during a getMore with the following error:
|
| Comments |
| Comment by Githook User [ 03/Jan/18 ] |
|
Author: {'name': 'Martin Neupauer', 'username': 'MartinNeupauer', 'email': 'martin.neupauer@mongodb.com'}Message: The changestream queries used an operation context deadline to track (cherry picked from commit 962c5c61c93776aa4d1a8efb67a1a80cb3bb2ad0) |
| Comment by Githook User [ 18/Dec/17 ] |
|
Author: {'name': 'Martin Neupauer', 'email': 'martin.neupauer@mongodb.com', 'username': 'MartinNeupauer'}Message: The changestream queries used an operation context deadline to track |
| Comment by David Storch [ 15/Dec/17 ] |
|
Agree: this should be backported to 3.6. |
| Comment by Githook User [ 14/Dec/17 ] |
|
Author: {'name': 'Spencer Jackson', 'email': 'spencer.jackson@mongodb.com', 'username': 'spencerjackson'}Message: Revert " This reverts commit b79e5f04ffc79b5892f89c22b9e5f26a297b1185. |
| Comment by Charlie Swanson [ 14/Dec/17 ] |
|
I'm just gonna jump in and vote yes, I checked the backport requested box |
| Comment by James Wahlin [ 14/Dec/17 ] |
|
martin.neupauer / david.storch - should we consider backporting this to 3.6? |
| Comment by Githook User [ 12/Dec/17 ] |
|
Author: {'name': 'Martin Neupauer', 'email': 'martin.neupauer@mongodb.com', 'username': 'MartinNeupauer'}Message: The changestream queries used an operation context deadline to track |
| Comment by Martin Neupauer [ 10/Nov/17 ] |
|
almost there ... the timeout errors are gone but the test hangs. |
| Comment by Martin Neupauer [ 10/Nov/17 ] |
|
oh, in the DocumentSourceLookupChangePostImage::getNext there is pSource->getNext() call not guarded by the OperationContext::DeadlineStash. I'll move the guard to the top of getNext and see what happens. |
| Comment by Martin Neupauer [ 09/Nov/17 ] |
|
somethings not right and I have no idea what - it seems that two DocumentSourceCursors are both marked as normal (i.e. not tailable) hence all timeouts fire as expected. |
| Comment by Charlie Swanson [ 09/Nov/17 ] |
|
Ah yes, martin.neupauer - there are two DocumentSourceCursors at play here, one is the one reading the oplog and generate the change notifications, the other is the one doing the lookup for the post-image of any update operations. For that update lookup cursor, we only generate one document and so expect it to be shortly lived, and is not tailable. Further, we explicitly suspend the maxTimeMS deadline while creating/using that cursor, done here. Can you investigate why that doesn't seem to be working? It looks like somehow there is still a perceived deadline that's causing it to think it has exceeded its time limit. |
| Comment by Martin Neupauer [ 07/Nov/17 ] |
|
It turns out it's not that easy. When I copied tailableMode in the ExpressionContext::copyWith I got a different error from the repro: 2017-11-07T18:36:45.326-0500 I QUERY [thread2] js thread raised js exception: JSInterpreterFailure: Error: getMore command failed: { } |
| Comment by Martin Neupauer [ 07/Nov/17 ] |
|
And the answer turns out to be trivial. We forgot to copy tailableMode filed in the "copy constructor" ExpressionContext::copyWith |
| Comment by Martin Neupauer [ 07/Nov/17 ] |
|
So the code that should ignore the timeout looks like this: if (state == PlanExecutor::IS_EOF && pExpCtx->isTailableAwaitData()) and the definition of isTailableAwaitData is straightforward: return tailableMode == TailableMode::kTailableAndAwaitData; However, the current value of pExpCtx->tailableMode is kNormal (0). The $64k question is why? |
| Comment by Yves Duhem [ 03/Nov/17 ] |
|
The error still occurs after the fix for |
| Comment by Martin Neupauer [ 03/Nov/17 ] |
|
I've confirmed that the test fails before https://github.com/mongodb/mongo/commit/81c5724d9d11a1ff2b16937dc01e6439d9e0f266 commit and works fine afterwards. It's a dup of |
| Comment by Bernard Gorman [ 02/Nov/17 ] |
|
charlie.swanson: no, that was deliberate - I figured that the mongoD lookup would be susceptible to in-pipeline timeouts in exactly the same way as it had been on the mongoS, so it made sense to stash the deadline before the lookup no matter what context it was running in. I didn't realise we already had this ticket for an actual incident of the timeout though - nice bonus! |
| Comment by Charlie Swanson [ 02/Nov/17 ] |
|
Looks like bernard.gorman (accidentally?) fixed this when he refactored the update lookup to stash the deadline within DocumentSourceLookupChangePostImage instead of within mongos' implementation of the lookup: https://github.com/mongodb/mongo/commit/81c5724d9d11a1ff2b16937dc01e6439d9e0f266#diff-77917bcdb3b940573a2fb1311147602eL735 martin.neupauer to confirm with the reproduction script, then close out if that hypothesis is correct. |