[JAVA-4387] MongoDb returns message length larger than the max message length Created: 03/Nov/21 Updated: 27/Oct/23 Resolved: 30/Nov/21 |
|
| Status: | Closed |
| Project: | Java Driver |
| Component/s: | None |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | 运军 蒋 | Assignee: | Valentin Kavalenka |
| Resolution: | Gone away | Votes: | 0 |
| Labels: | external-user | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Attachments: |
|
||||||||
| Issue Links: |
|
||||||||
| Case: | (copied to CRM) | ||||||||
| Documentation Changes: | Not Needed | ||||||||
| Description |
|
Hi guys, We are facing an issue that happened several times a day. Our java driver version is 4.0.5 and Mongo DB version is 4.0.9. We have 2 routers and several nodes. And we are using reactive stream. This issue only happens in our production env. It usually first occurred while MongodbHealthIndicator checking the connections with mongo db. Please see the stacktrace as below:
*The responseTo is always smaller than the request id * The full stack trace is
I investigated the related source code, I guess that when the first error happened. It seems that mongo has replied an unknown message to the driver. When the driver decode the message follow the wire protocol, it found that the message length is too big(reading the first 16 bytes from the stream) and throw an exception then ignored the rest bytes. The following requests will keep doing the same until meeting the first correct reply message but the responseTo in the reply message will be smaller than the current request id. I can't reproduce this on my laptop. I'm wondering why the driver needs to check the message size or the responseTo is matching the request id. Is there a concern regarding this? Why not check if the reply message is matching the wire protocol? |
| Comments |
| Comment by PM Bot [ 30/Nov/21 ] | ||||||||||||||||||||||
|
There hasn't been any recent activity on this ticket, so we're resolving it. Thanks for reaching out! Please feel free to comment on this if you're able to provide more information. | ||||||||||||||||||||||
| Comment by Valentin Kavalenka [ 16/Nov/21 ] | ||||||||||||||||||||||
|
Changing the status to "waiting for reporter". It will remain in this status for some time before being closed automatically if there is no activity. gloojiang86@gmail.com, if you discover evidence that points to a problem in the driver code and the ticket is closed by that time, please clone it again. | ||||||||||||||||||||||
| Comment by 运军 蒋 [ 16/Nov/21 ] | ||||||||||||||||||||||
|
Yeah. I totally understand that. We'll keep investigating this. Thanks for your efforts. | ||||||||||||||||||||||
| Comment by Valentin Kavalenka [ 15/Nov/21 ] | ||||||||||||||||||||||
|
I understand that this problem is a difficult one to investigate, but I also do not see a way to continue investigating it without actively and directly investigating it in the environment where it is reproducible. This is beyond what the driver dev team can do, especially given that despite all your and our efforts there is still no compelling evidence that the problem is in the driver code. If you manage to eventually discover such evidence, please do report your findings. | ||||||||||||||||||||||
| Comment by 运军 蒋 [ 05/Nov/21 ] | ||||||||||||||||||||||
|
Thanks for your quick reply on this.
I was hoping I could find some corrupted tcp segments in the pcaps, But failed. You could look into the captured segments along with the service_1.log. Thanks for your time. | ||||||||||||||||||||||
| Comment by Valentin Kavalenka [ 04/Nov/21 ] | ||||||||||||||||||||||
|
Thank you for providing more details on the matter. Before discussing them, I'd like to clarify whether you have been able to reproduce the problem with NeuVector disabled? Another question I have is whether you tried enabling TLS in the client? I know that this is not a simple task, but to my understanding doing so has a potential of telling us directly whether the network data gets corrupted or not. 1If my understanding correct, you found four bytes composing a specific message header <messageLength=975335023, requestId=926299441, responseTo=959721525, opCode=960050487> in TCP segments sent to multiple client sockets, which suggests the bytes found in those segments do not represent the MongoDB message header, because requestId is unique within a client JVM process. However, even in the event that they all do, then the same message header was sent to different sockets, and, thus, it is not possible to derive that driver reading this header from the socket 10.42.19.195:53222 (as per service_1.log A better experiment that I can think of would be to grab all inbound TCP segments for a problematic client socket, and try to match them with message headers attached to operation events in the history recorded by the debugger (see events with the DECODE_MESSAGE_HEADER code). I think this experiment can give a more compelling answer to whether the driver read the same message headers that a network sniffer captured, or not. The advantage of this experiment is that with necessity some headers will be matched with inbound TCP data, confirming that the conversion to bytes is done correctly. However, even the proposed experiment may produce incorrect results depending on how Wireshark searches for data: it is theoretically possible for a MongoDB message header to be split between TCP segments, and if Wireshark searches for a byte string only within each captured segment, then it may not find the header. This is something to keep in mind when a header is not found by Wireshark. The experiment you conducted and the experiment I proposed both answer the following question: "Does the driver read from a socket the data that comes from the network for that socket and only that data?". If the answer is "yes", then the data is corrupted and the point at which TCP segments are sniffed tells us that the data was corrupted before this point. If the answer is "no", then it means either that the driver (or libraries it uses) has a bug, or that the data was corrupted somewhere between the sniffing point and the driver. The last part is what makes it crucial to put the sniffing point as close to the driver as possible. E.g., if a client is working within a Docker container, then sniffing is better be done also within the container after the point at which NeuVector does its job. At which point was the data sniffed in your experiments? 2My understanding is that you are still using org.mongodb:mongodb-driver-reactivestreams:4.0.5. Observing multiple consequent errors on the same socket is a problem that was identified and fixed in 4.4.0 as a result of your help in
followed by the exception message for the same connection
Once the driver failed to successfully read from a stream, it is supposed to close it, but sometimes it doesn't and instead continues reading from an incorrect position in the stream, and reads junk again and again. You can see that "Detected an invalid reply message header" events keep accumulating: the message with messageSequenceNumber=2 has 2 such events, while the message with messageSequenceNumber=3 has 2 of the same events plus 1 new event. You can also see that each time the message header is different, because the driver keeps reading data that is not a message header trying to interpret it as one. It is possible for it to eventually encounter a 4-byte sequence that has messageLength that is not too high and not too low, in which case the driver will likely complain about responseTo, which, I believe, is exactly what you observed. Given that the observations you presented in item 2 are based on the behavior the driver has as a direct consequence of having the aforementioned bug, I do not think it is a good idea to include those observations in the analysis of the problem. I propose to pay attention only to the first action of reading invalid data from a socket and actions related to that same socket before this happened.
If you have commercial support from MongoDB, then you could open a support case. I expect that communications there are more interactive. However, for a Java Driver Jira ticket communications are intended to happen via Jira. | ||||||||||||||||||||||
| Comment by 运军 蒋 [ 03/Nov/21 ] | ||||||||||||||||||||||
|
@Valentin Kavalenka, we have some new findings that indicate this issue is not caused by corruption. We did several rounds of experiments and capture the TCP packets between our service and mongo. The packets indicated that there is no corruption and we confirmed that the TSO is disabled. Furthermore, we found something really interesting from the logs logged by the debugger.
I can't reopen Thanks a lot. |