[SERVER-15815] maxTimeMS on initial tailable cursor query and getMore. Created: 27/Oct/14  Updated: 06/Apr/23  Resolved: 02/Feb/16

Status: Closed
Project: Core Server
Component/s: Querying
Affects Version/s: 2.6.5
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Dissatisfied Former User Assignee: David Storch
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Duplicate
duplicates SERVER-18184 Add awaitData support to getMore command Closed
Related
is related to DRIVERS-272 Add option maxAwaitTimeMS on getMore ... Closed
Participants:

 Description   

Hello!

I ran into an issue (PYTHON-780) with deviation from expected cursor timeout behaviour (defined by SERVER-2212); if await_data=True then maxTimeMS is ignored. This is effectively undocumented (unspecified in SERVER-2212) where I assumed the following behaviour would be correct:

A query that completes in under its time limit will "roll over" its remaining time to the first getmore op (which will then "roll over" its remaining time to the second getmore op and so on, until the time limit is hit).

Without the ability to have controllable timeouts on the operation of tailing cursors I am forced to resort to polling behaviour to implement the timeout myself in Python. This is not a good solution for a wide variety of reasons, least of all for efficiency and added latency.

Use case: awaiting messages from distributed RPC calls over a capped collection used as a message bus. A specific example from my codebase is task.result(timeout=None) — return the result of the RPC call, waiting up to timeout seconds for it. (Indefinitely if unspecified.)

PYTHON-780 contains a minimal test case whose behaviour was expected to match the test_max_time_ms tests. (That is, raising of ExecutionTimeout vs. the current behaviour of silently finishing iteration.) Error code 50 is never returned over the wire protocol in this situation, as far as I can tell, thus never triggering this exception in PyMongo's _unpack_response code.



 Comments   
Comment by David Storch [ 02/Feb/16 ]

Hi alice@gothcandy.com,

Again, apologies for our delayed response. I believe that MongoDB version 3.2 adds the behavior that you are looking for. In version 3.2, we added new database commands for performing find and getMore operations (see SERVER-15176 for more detail on this). As part of this work, the new getMore command accepts a parameter called maxTimeMS which is legal only for awaitData cursors over capped collections. The getMore command is documented here; in particular, see the documentation for the maxTimeMS parameter. This new feature should allow you to configure the amount of time for which an awaitData cursor will block waiting for capped insertions.

The official MongoDB drivers support this new behavior via an option called maxAwaitTimeMS. See DRIVERS-272 for more details, or refer to the driver-specific documentation. For PyMongo, the documentation of maxAwaitTimeMS can be found here.

I am closing as a duplicate of SERVER-18184, which was the ticket under which this feature was developed in the server. Please let me know if you have any further questions or concerns.

Best,
Dave

Comment by Dissatisfied Former User [ 19/Jan/16 ]

A year later, any word? This issue has technically been blocking release of a pure-MongoDB (and superior, I feel) alternative to the Celery distributed task system for four years now. Hasn't blocked fire-and-forget task support, but anything involving waiting on tasks (producers waiting for results, or generator pipelines) dies a horrible death to starvation because of this. Well, mostly the aforementioned doubling of latency, but you get the idea.

Comment by Ramon Fernandez Marina [ 10/Dec/14 ]

amcgregor, the fixversion went from "debugging with submitter", which means we're working on our end to verify the issue and its impact, to "2.9 Desired", which means we want to include a fix in the next development version. The fixversion will change again to 2.9.X when this ticket is included in the planning of a specific point release.

Comment by Dissatisfied Former User [ 10/Dec/14 ]

I'll be sure to write less detail-filled initial reports in the future. :wry: I notice the Fixversion was updated. Is there hope?

Comment by Ramon Fernandez Marina [ 13/Nov/14 ]

alice@gothcandy.com, apologies for the late reply. The "debugging with submitter" fixversion means we're working to verify whether this is a bug or not and the potential impact on users. It is often the case that we ask reporters for more information, but not always. Since you're a watcher on the ticket you'll receive any updates as they happen.

Comment by Dissatisfied Former User [ 03/Nov/14 ]

A week has passed with the state in “debugging with submitter” and I have had no communication at all.

Comment by Dissatisfied Former User [ 28/Oct/14 ]

While I marked this as an improvement, not operating according to the documentation and expectations provided therein may be “bug-worthy”.

As it stands, falling back on polling to implement timeouts is not a long-term viable solution. A half-second retry delay will ensure tasks which take 0.51 seconds to complete can only return data after one full second, reducing performance by half and doubling latency. With a badly tuned capped collection (too small for the activity level) it could make the difference between getting the message and not.

Generated at Thu Feb 08 03:39:05 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.