[SERVER-69177] A bug of WritableServerSelector-Timeout transaction committed Created: 26/Aug/22 Updated: 09/Jan/23 Resolved: 09/Jan/23 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | None |
| Affects Version/s: | 4.4.5 |
| Fix Version/s: | None |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Ouyang Tsuna | Assignee: | Chris Kelly |
| Resolution: | Cannot Reproduce | Votes: | 0 |
| Labels: | Bug | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Operating System: | ALL |
| Participants: |
| Description |
|
The core message is
I have a mongo replica-set cluster consisting of 5 nodes. The primary is public-cd-a1.disalg.cn:37017. However, in this exception, they are all secondaries in the client's view. I have checked the mongod.log and there are no network issues or leader re-election. Node a1 is always the primary.
|
| Comments |
| Comment by Chris Kelly [ 09/Jan/23 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
We haven’t heard back from you for some time, so I’m going to close this ticket. If you're ever able to, please provide additional information and we will reopen the ticket. Christopher | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Chris Kelly [ 29/Dec/22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Ouyang, After getting your project running, we found that it defaults to running against MongoDB 4.2.8, which appears to run successfully and with no failures. However, it is unclear what parameters you ran the test with against 4.4.5. I am referring to the GitHub repository you linked. Per the repository's sample test:
Running against 4.4.5 seemed to require a change to db.clj. When running against 4.4.5 after modifying the /src/jepsen/mongodb/db.clj file to hit https://repo.mongodb.org/apt/debian/dists/buster/mongodb-org/4.4/main/binary-amd64/mongodb-org-mongos_4.4.5_amd64.deb
The project gets stuck at
Despite the fact it appears MongoDB is accepting connections from the logs:
I figured this was due to it originally using the java driver 4.0.2 which is not compatible with MongoDB 4.4, however when changing the version to 4.4.1 it continued to persist.
Christopher | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Chris Kelly [ 26/Sep/22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Ouyang, We still need additional information to help diagnose the problem. If you're able, please attach some of the requested information to the ticket so we can investigate further. Christopher | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Chris Kelly [ 08/Sep/22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Ouyang, Thanks for your report. To start investigating this further, first and foremost it would be super helpful to describe the exact events in sequence to form a clear timeline of what is taking place leading up to the cluster state issue, such as specifically when you are doing writes, when they're retrying, how you're confirming that T2 can read the updates of T1 and so on (and also how you cannot verify if they are committed)
Any of the following would also be helpful and appreciated:
Christopher | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ouyang Tsuna [ 01/Sep/22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Chris, This error occurs in almost every test we conduct when the RTT between the clients and the servers is about 30 ms. In fact, we are actively testing MongoDB transactions and the project could be found in Tsunaou/mongodb at txn-checking-4.4.5 (github.com). The test framework is based on jepsen-io/jepsen: A framework for distributed systems verification, with fault injection (github.com). Since our test framework is implemented by Clojure, a functional program language based on JVM. If you are interested in reproducing this issue, I can write a detailed tutorial to guide you through it. If not, maybe we can try to reproduce it by Java, but it will take some time and I can not estimate it. Regrads, Young
| ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Chris Kelly [ 29/Aug/22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
Hi Ouyang, Thank you for your report. I have seen this error arise due to network issues before on the community forums. I have also seen speculation that this is due to the bindIp configuration. However, that doesn't sound exactly the same here, since you are stating you're able to read the updates despite the error. I think it's possible there is some network issue at play here, but it is odd that your client is reporting issues if the writes are actually committing and able to be read later. To start investigating:
Regards, | ||||||||||||||||||||||||||||||||||||||||||||||||
| Comment by Ouyang Tsuna [ 26/Aug/22 ] | ||||||||||||||||||||||||||||||||||||||||||||||||
|
The rrt from client to server is about 30 ms. |