[SERVER-55802] Mongos should respect the client Op timeout without relying on mongod to do so Created: 05/Apr/21  Updated: 06/Dec/22  Resolved: 11/May/21

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 4.0.23
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Andrew Shuvalov (Inactive) Assignee: Backlog - Service Architecture
Resolution: Won't Fix Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: PNG File Screen Shot 2021-04-06 at 7.20.16 PM.png    
Issue Links:
Depends
depends on DRIVERS-797 Support maxTimeMS for write operations Closed
Assigned Teams:
Service Arch
Participants:

 Description   

In the HELP ticket repro, artificial fault injection was made to simulate disk error on mongod. While mongod was not capable to respect the operation timeout because the thread was blocked indefinitely on disk operation. Additional fault injection was made to simulate the operation timeout at mongos and that resulted in much slower connection buildup than without timing out the operations.

It should be assumed that interrupting a thread stuck waiting on socket reply should me much easier than interrupting the thread stuck on faulty disk I/O, because the TCP connection is still perfectly healthy. In Enterprise binaries, the problem of faulty disk on mongos is solved by Watchdog.



 Comments   
Comment by Lamont Nelson [ 11/May/21 ]

I'm going to close this ticket, if there is a particular use case where maxtimems doesn't work we can provide a test case and reopen.

Comment by Lamont Nelson [ 11/May/21 ]

I'm not sure that I understand this statement. The bson api is the interface to run commands on mongodb. Everything else is just syntactic sugar provided (or not) by the drivers in their host language.

Comment by Andrew Shuvalov (Inactive) [ 13/Apr/21 ]

Yes, maxTimeMS works in raw runCommands but I really don't want to steer users in that direction, it may create compatibility hurdles. At least, we need to wait what design comes out from DRIVERS-555 to have more consistent long term strategy. In medium term, this problem should be partially mitigated by mongod-side implementation of my thread liveness monitor proposal PM-2248.

Comment by Lamont Nelson [ 13/Apr/21 ]

Have you actually tried to attach maxTimeMs to the raw command and it didn't work? Meaning the BSON representation, not a command through the strongly typed api.

Comment by Lamont Nelson [ 10/Apr/21 ]

I think that in order to enforce maxTimeMS for writes using the Java driver, we need to use the interface that lets you submit a raw BSON command (MongoDatabase.runCommand; equivalent to what this jstest is doing ) versus their strongly typed api. I'm not sure why this is, but I've verified with Jeff Yemin that this is the case.

Comment by Andrew Shuvalov (Inactive) [ 08/Apr/21 ]

Yes, I was able to reproduce that the timeout in find() operation is respected by mongos and it will release the thread, and both connections. However, as discussed above, most other operations don't have a timeout mongos can handle.

Comment by Andrew Shuvalov (Inactive) [ 08/Apr/21 ]

There is also DRIVERS-555 "Client side operations Timeout" with the notion that MaxTimeMS will be deprecated and replaced with unified timeoutMS. Sop I'm changing this to blocked on DRIVERS-555.

Comment by Andrew Shuvalov (Inactive) [ 07/Apr/21 ]

There is an open DRIVERS-797 to "Support maxTimeMS for write operations" about that.

Comment by Andrew Shuvalov (Inactive) [ 07/Apr/21 ]

I don't see any way to set MaxTimeMS for updateOne(), updateMany() and similar operations in Java driver, maybe because DOCS-9823 or maybe it's indeed impossible. Yes, there is a maxTime() method on FindIterable<TDocument> returned by find(), but this is not the solution. The timeout should work on all CRUD operations, especially mutations, because it's definitely possible for a mutation to get stuck.

Comment by Matthew Saltz (Inactive) [ 07/Apr/21 ]

There's a difference between wtimeout and MaxTimeMS. The way I understand it, wtimeout only applies to the portion of the query that waits for write concern, and does not work as an overall operation timeout. So the query you're issuing doesn't actually have an overall operation timeout on it. I think if you set maxtimems then you'll see the behavior you're requesting where the mongos will stop waiting for the query to complete after the operation deadline. (On 4.0.23 and later, that is.)

Comment by Andrew Shuvalov (Inactive) [ 06/Apr/21 ]

I don't think the fixes above address the update operation timeout. However please correct me if my expectations on the way I'm setting the write timeout is expected to do what I want:

                mongoClient = parent.getClient();
                MongoDatabase sampleTrainingDB = mongoClient.getDatabase("sample_training");
                MongoCollection<Document> gradesCollection = sampleTrainingDB.getCollection("grades")
                    .withWriteConcern(sampleTrainingDB.getWriteConcern().withWTimeout(30000, TimeUnit.MILLISECONDS));
 
                Bson filter = eq("student_id", 10000 + rand.nextInt(1000));
                Bson updateOperation = set("class_id", rand.nextInt(10));
                UpdateResult updateResult = gradesCollection.updateOne(filter, updateOperation);

I just repeated the experiment with the head, future v5.0 release:

Here the fault injection to block any disk ops on primary are blocked at A. The immediate connection buiildup starts and none of 'updateOne' commands ever times out.

Generated at Thu Feb 08 05:37:30 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.