[JAVA-3920] Incorrect read timeout handling causes connection closing Created: 28/Dec/20 Updated: 28/Oct/23 Resolved: 20/Jan/21 |
|
| Status: | Closed |
| Project: | Java Driver |
| Component/s: | Async |
| Affects Version/s: | 4.0.5, 4.1.0 |
| Fix Version/s: | 4.2.0, 3.12.8 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Konrad Kamiński | Assignee: | Valentin Kavalenka |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Backwards Compatibility: | Fully Compatible |
| Documentation Changes: | Not Needed |
| Description |
|
NettyStream handles timeouts using ReadTimeoutHandler. It schedules and cancels timeout operations in various places (e.g. here or here) and does so in an asynchronous way by submitting Tasks to Executors. The code in NettyStream (especially in readAsync method) and in ReadTimeoutHandler assumes that cancellations happen after scheduling a task, but since they are executed concurrently in multiple threads we may sometimes have a situation (as we do in our project) that a scheduled timeout task is not cancelled since the cancellation is effectively done before the task is scheduled (i.e. removeTimeout is called BEFORE scheduleTimeout).
This improper handling of timeouts causes connections to be treated as invalid (STALE) when a "bogus" timeout occurs and because of the driver policy all other connections to the same server will be marked as invalid.
In the near future I'll provide a PR with a solution to that problem. |
| Comments |
| Comment by Githook User [ 20/Jan/21 ] | |||||||||||||||||||||||||||
|
Author: {'name': 'Valentin Kovalenko', 'email': 'valentin.kovalenko@mongodb.com', 'username': 'stIncMale'}Message: Fix the read timeout implementation in NettyStream (#636) Both the new approach and the original one achieve the guarantee that The original approach of achieving the at-most-one property:
The new approach achieves the at-most-one property by using a lock. Other notable changes:
| |||||||||||||||||||||||||||
| Comment by Valentin Kavalenka [ 14/Jan/21 ] | |||||||||||||||||||||||||||
|
Thank you, kaminski.konrad@gmail.com! Your scenario is very helpful, and it is clear that there is a bug in the driver. Below is your scenario with a few additional // VK: comments, which make it more clear at least to me
What makes the reproducibility even worse, is that when the number of available bytes is less than pendingReader.numBytes in the EVENT_LOOP thread, VK3 becomes
which increases the chances of Runnable[1] noticing that RTH.timeout is not null, and thus not scheduling a new timeout task that will never be cancelled.
I agree, there is some excessive memory synchronization code that may be misleading. We'll improve this. | |||||||||||||||||||||||||||
| Comment by Konrad Kamiński [ 13/Jan/21 ] | |||||||||||||||||||||||||||
|
Producing an example would be quite difficult, but we observe this behaviour in our systems. I did some analysis regarding the possible cause and this is the scenario I think we have:
There are also some other issues which potentially mislead you when doing analysis (e.g. all those volatile fields, which are accessed only in synchronized blocks anyway or my favourite volatile timeout in ReadTimeoutHandler with a code like if (timeout != null) timeout.cancel();). | |||||||||||||||||||||||||||
| Comment by Valentin Kavalenka [ 12/Jan/21 ] | |||||||||||||||||||||||||||
|
Hi kaminski.konrad@gmail.com. You are right that if the methods ReadTimeoutHandler.scheduleTimeout and ReadTimeoutHandler.removeTimeout were called concurrently on the same ReadTimeoutHandler object (which is notably not shared between channels), then losing cancellations would be possible. However, these methods are always called from the same EventLoop (the one that handles all the I/O operations for the channel underlying a NettyStream): we check whether the code is currently being executed in the channel's event loop before scheduling/cancelling the read timeout for the channel, and if not, then we submit scheduling/cancellation to be executed in the event loop. Additionally, the ReadTimeoutException is thrown by a ReadTimeoutTask, which is also scheduled in the channel's event loop.1 In other words, I do not see a way for us to lose a timeout cancellation due to concurrency. kaminski.konrad@gmail.com would you please describe in more details how the issue is observed (at the very least you must be observing a MongoSocketReadTimeoutException, please attach the stack trace), and what makes you interpret it as caused by a lost timeout cancellation? We need this information to proceed with debugging the problem. If you would provide a minimal reproducible example, it would be even more useful. 1 In our case ChannelHandlerContext.executor() returns a reference to the same object as ChannelHandlerContext.channel().eventLoop() does because the ReadTimeoutHandler is registered without a handler-specific EventExecutorGroup. | |||||||||||||||||||||||||||
| Comment by Valentin Kavalenka [ 05/Jan/21 ] | |||||||||||||||||||||||||||
|
Thanks for reporting this issue, and we are especially grateful for the PR you created! We will investigate the issue, review your PR and get back to you. Regards, | |||||||||||||||||||||||||||
| Comment by Konrad Kamiński [ 04/Jan/21 ] | |||||||||||||||||||||||||||
|
A PR has been provided. |