[SERVER-31684] QueryPlanKilled (operation exceeded time limit) in $changeStream with updateLookup Created: 23/Oct/17  Updated: 30/Oct/23  Resolved: 18/Dec/17

Status: Closed
Project: Core Server
Component/s: Querying, Replication
Affects Version/s: 3.6.0-rc0
Fix Version/s: 3.6.2, 3.7.1

Type: Bug Priority: Major - P3
Reporter: Yves Duhem Assignee: Martin Neupauer
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File change_streams_repro.js    
Issue Links:
Backports
Depends
Duplicate
duplicates SERVER-31597 Refactor $changeStream post-update lo... Closed
Related
Backwards Compatibility: Fully Compatible
Operating System: ALL
Backport Requested:
v3.6
Steps To Reproduce:
  1. start a single node replica set
  2. cd into mongo repository
  3. download the attached repro script: change_streams_repro.js
  4. run the script:
    > mongo change_streams_repro.js
Sprint: Query 2017-11-13, Query 2017-12-04, Query 2017-12-18, Query 2018-01-01
Participants:
Linked BF Score: 0

 Description   

On a 3.6.0-rc0 single node replica set, while applying a CRUD workload with 16 or more threads, a change stream using the option fullDocument: updateLookup eventually fails during a getMore with the following error:

2017-10-23T16:51:32.883-0400 I QUERY    [thread2] js thread raised js exception: JSInterpreterFailure: Error: getMore command failed: {
	"operationTime" : Timestamp(1508791892, 30247),
	"ok" : 0,
	"errmsg" : "collection or index disappeared when cursor yielded: ExceededTimeLimit: operation exceeded time limit",
	"code" : 175,
	"codeName" : "QueryPlanKilled",
	"$clusterTime" : {
		"clusterTime" : Timestamp(1508791892, 30249),
		"signature" : {
			"hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
			"keyId" : NumberLong(0)
		}
	}
}



 Comments   
Comment by Githook User [ 03/Jan/18 ]

Author:

{'name': 'Martin Neupauer', 'username': 'MartinNeupauer', 'email': 'martin.neupauer@mongodb.com'}

Message: SERVER-31684 Fix unexpected "operation exceeded time limit" errors

The changestream queries used an operation context deadline to track
a wait time before returning EOF. This occasionaly interfered with
normal operation deadlines leading to unexpected errors.

(cherry picked from commit 962c5c61c93776aa4d1a8efb67a1a80cb3bb2ad0)
Branch: v3.6
https://github.com/mongodb/mongo/commit/8d2276dfa94673f0ca1480fc87977a6f36c2816a

Comment by Githook User [ 18/Dec/17 ]

Author:

{'name': 'Martin Neupauer', 'email': 'martin.neupauer@mongodb.com', 'username': 'MartinNeupauer'}

Message: SERVER-31684 Fix unexpected "operation exceeded time limit" errors

The changestream queries used an operation context deadline to track
a wait time before returning EOF. This occasionaly interfered with
normal operation deadlines leading to unexpected errors.
Branch: master
https://github.com/mongodb/mongo/commit/962c5c61c93776aa4d1a8efb67a1a80cb3bb2ad0

Comment by David Storch [ 15/Dec/17 ]

Agree: this should be backported to 3.6.

Comment by Githook User [ 14/Dec/17 ]

Author:

{'name': 'Spencer Jackson', 'email': 'spencer.jackson@mongodb.com', 'username': 'spencerjackson'}

Message: Revert "SERVER-31684 Fix unexpected "operation exceeded time limit" errors"

This reverts commit b79e5f04ffc79b5892f89c22b9e5f26a297b1185.
Branch: master
https://github.com/mongodb/mongo/commit/1f38fb202b9f8696cf28d39e674242e036c0b75c

Comment by Charlie Swanson [ 14/Dec/17 ]

I'm just gonna jump in and vote yes, I checked the backport requested box

Comment by James Wahlin [ 14/Dec/17 ]

martin.neupauer / david.storch - should we consider backporting this to 3.6?

Comment by Githook User [ 12/Dec/17 ]

Author:

{'name': 'Martin Neupauer', 'email': 'martin.neupauer@mongodb.com', 'username': 'MartinNeupauer'}

Message: SERVER-31684 Fix unexpected "operation exceeded time limit" errors

The changestream queries used an operation context deadline to track
a wait time before returning EOF. This occasionaly interfered with
normal operation deadlines leading to unexpected errors.
Branch: master
https://github.com/mongodb/mongo/commit/b79e5f04ffc79b5892f89c22b9e5f26a297b1185

Comment by Martin Neupauer [ 10/Nov/17 ]

almost there ... the timeout errors are gone but the test hangs.

Comment by Martin Neupauer [ 10/Nov/17 ]

oh, in the DocumentSourceLookupChangePostImage::getNext there is pSource->getNext() call not guarded by the OperationContext::DeadlineStash. I'll move the guard to the top of getNext and see what happens.

Comment by Martin Neupauer [ 09/Nov/17 ]

somethings not right and I have no idea what - it seems that two DocumentSourceCursors are both marked as normal (i.e. not tailable) hence all timeouts fire as expected.
It looks like that it is not the lookup cursor (https://github.com/mongodb/mongo/blob/r3.6.0-rc2/src/mongo/db/pipeline/document_source_lookup_change_post_image.cpp#L71-L72) that is failing.

Comment by Charlie Swanson [ 09/Nov/17 ]

Ah yes, martin.neupauer - there are two DocumentSourceCursors at play here, one is the one reading the oplog and generate the change notifications, the other is the one doing the lookup for the post-image of any update operations. For that update lookup cursor, we only generate one document and so expect it to be shortly lived, and is not tailable. Further, we explicitly suspend the maxTimeMS deadline while creating/using that cursor, done here. Can you investigate why that doesn't seem to be working? It looks like somehow there is still a perceived deadline that's causing it to think it has exceeded its time limit.

Comment by Martin Neupauer [ 07/Nov/17 ]

It turns out it's not that easy. When I copied tailableMode in the ExpressionContext::copyWith I got a different error from the repro:

2017-11-07T18:36:45.326-0500 I QUERY [thread2] js thread raised js exception: JSInterpreterFailure: Error: getMore command failed: {
"operationTime" : Timestamp(1510097805, 54),
"ok" : 0,
"errmsg" : "error processing query: ns=change_streams.change_streams_32Tree: _id == ObjectId('5a0243868ccce3bdbd7c99ea')\nSort: {}\nProj: {}\n tailable cursor requested on non capped collection",
"code" : 2,
"codeName" : "BadValue",
"$clusterTime" : {
"clusterTime" : Timestamp(1510097805, 54),
"signature" :

{ "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="), "keyId" : NumberLong(0) }

}
} :

Comment by Martin Neupauer [ 07/Nov/17 ]

And the answer turns out to be trivial. We forgot to copy tailableMode filed in the "copy constructor" ExpressionContext::copyWith

Comment by Martin Neupauer [ 07/Nov/17 ]

So the code that should ignore the timeout looks like this:

if (state == PlanExecutor::IS_EOF && pExpCtx->isTailableAwaitData())

and the definition of isTailableAwaitData is straightforward: return tailableMode == TailableMode::kTailableAndAwaitData;

However, the current value of pExpCtx->tailableMode is kNormal (0). The $64k question is why?

Comment by Yves Duhem [ 03/Nov/17 ]

The error still occurs after the fix for SERVER-31597. I tested with the repro script on commit 628160ab7a. Note that it can sometimes take a few tries before seeing the error.

Comment by Martin Neupauer [ 03/Nov/17 ]

I've confirmed that the test fails before https://github.com/mongodb/mongo/commit/81c5724d9d11a1ff2b16937dc01e6439d9e0f266 commit and works fine afterwards. It's a dup of SERVER-31597

Comment by Bernard Gorman [ 02/Nov/17 ]

charlie.swanson: no, that was deliberate - I figured that the mongoD lookup would be susceptible to in-pipeline timeouts in exactly the same way as it had been on the mongoS, so it made sense to stash the deadline before the lookup no matter what context it was running in. I didn't realise we already had this ticket for an actual incident of the timeout though - nice bonus!

Comment by Charlie Swanson [ 02/Nov/17 ]

Looks like bernard.gorman (accidentally?) fixed this when he refactored the update lookup to stash the deadline within DocumentSourceLookupChangePostImage instead of within mongos' implementation of the lookup: https://github.com/mongodb/mongo/commit/81c5724d9d11a1ff2b16937dc01e6439d9e0f266#diff-77917bcdb3b940573a2fb1311147602eL735

martin.neupauer to confirm with the reproduction script, then close out if that hypothesis is correct.

Generated at Thu Feb 08 04:27:51 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.