-
Type:
Bug
-
Resolution: Cannot Reproduce
-
Priority:
Major - P3
-
None
-
Affects Version/s: 4.4.5
-
Component/s: None
-
ALL
-
None
-
None
-
None
-
None
-
None
-
None
-
None
MongoDB version : v4.4.5
MongoDB Java Driver Version: mongodb-driver-sync 4.1.2
A transaction T1 failed to commit due to the WritableServerSelector-Timeout exception. However, another transaction T2 read the updates of T1. I use the withTransaction API and T1 retries for many times.
The exception T1 throws are:
2022-08-25 00:35:04,237{GMT} WARN [jepsen worker 46] jepsen.mongodb.rw-register: #error {
:cause Timed out after 1000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=REPLICA_SET, servers=[{address=public-cd-a5.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=34.4 ms, state=CONNECTED}, {address=public-cd-a1.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=156.8 ms, state=CONNECTED}, {address=public-cd-a4.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=38.3 ms, state=CONNECTED}, {address=public-cd-a2.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=35.9 ms, state=CONNECTED}, {address=public-cd-a3.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=35.1 ms, state=CONNECTED}]
:via
[{:type com.mongodb.MongoTimeoutException
:message Timed out after 1000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=REPLICA_SET, servers=[{address=public-cd-a5.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=34.4 ms, state=CONNECTED}, {address=public-cd-a1.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=156.8 ms, state=CONNECTED}, {address=public-cd-a4.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=38.3 ms, state=CONNECTED}, {address=public-cd-a2.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=35.9 ms, state=CONNECTED}, {address=public-cd-a3.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=35.1 ms, state=CONNECTED}]
:at [com.mongodb.internal.connection.BaseCluster createTimeoutException BaseCluster.java 407]}]
:trace
[[com.mongodb.internal.connection.BaseCluster createTimeoutException BaseCluster.java 407]
[com.mongodb.internal.connection.BaseCluster selectServer BaseCluster.java 118]
[com.mongodb.internal.connection.AbstractMultiServerCluster selectServer AbstractMultiServerCluster.java 52]
[com.mongodb.internal.binding.ClusterBinding$ClusterBindingConnectionSource <init> ClusterBinding.java 103]
[com.mongodb.internal.binding.ClusterBinding$ClusterBindingConnectionSource <init> ClusterBinding.java 99]
[com.mongodb.internal.binding.ClusterBinding getWriteConnectionSource ClusterBinding.java 91]
[com.mongodb.client.internal.ClientSessionBinding getWriteConnectionSource ClientSessionBinding.java 93]
[com.mongodb.internal.operation.OperationHelper withReleasableConnection OperationHelper.java 618]
[com.mongodb.internal.operation.CommandOperationHelper executeRetryableCommand CommandOperationHelper.java 705]
[com.mongodb.internal.operation.TransactionOperation execute TransactionOperation.java 69]
[com.mongodb.internal.operation.CommitTransactionOperation execute CommitTransactionOperation.java 133]
[com.mongodb.internal.operation.CommitTransactionOperation execute CommitTransactionOperation.java 54]
[com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor execute MongoClientDelegate.java 195]
[com.mongodb.client.internal.ClientSessionImpl commitTransaction ClientSessionImpl.java 129]
[com.mongodb.client.internal.ClientSessionImpl withTransaction ClientSessionImpl.java 210]
[jdk.internal.reflect.GeneratedMethodAccessor29 invoke nil -1]
[jdk.internal.reflect.DelegatingMethodAccessorImpl invoke DelegatingMethodAccessorImpl.java 43]
[java.lang.reflect.Method invoke Method.java 566]
[clojure.lang.Reflector invokeMatchingMethod Reflector.java 167]
[clojure.lang.Reflector invokeInstanceMethod Reflector.java 102]
[jepsen.mongodb.rw_register.Client$fn__2566 invoke rw_register.clj 137]
[jepsen.mongodb.rw_register.Client invoke_BANG_ rw_register.clj 128]
[jepsen.client.Validate invoke_BANG_ client.clj 83]
[jepsen.generator.interpreter.ClientWorker invoke_BANG_ interpreter.clj 62]
[jepsen.generator.interpreter$spawn_worker$fn__9422$fn__9423 invoke interpreter.clj 137]
[jepsen.generator.interpreter$spawn_worker$fn__9422 invoke interpreter.clj 120]
[clojure.core$binding_conveyor_fn$fn__5739 invoke core.clj 2030]
[clojure.lang.AFn call AFn.java 18]
[java.util.concurrent.FutureTask run FutureTask.java 264]
[java.util.concurrent.ThreadPoolExecutor runWorker ThreadPoolExecutor.java 1128]
[java.util.concurrent.ThreadPoolExecutor$Worker run ThreadPoolExecutor.java 628]
[java.lang.Thread run Thread.java 829]]}
The core message is
Client view of cluster state is {type=REPLICA_SET, servers=[
{address=public-cd-a5.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=34.4 ms, state=CONNECTED},
{address=public-cd-a1.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=156.8 ms, state=CONNECTED},
{address=public-cd-a4.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=38.3 ms, state=CONNECTED},
{address=public-cd-a2.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=35.9 ms, state=CONNECTED},
{address=public-cd-a3.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=35.1 ms, state=CONNECTED}
]
I have a mongo replica-set cluster consisting of 5 nodes. The primary is public-cd-a1.disalg.cn:37017. However, in this exception, they are all secondaries in the client's view. I have checked the mongod.log and there are no network issues or leader re-election. Node a1 is always the primary.
Moreover, there are other types of nodes in the client's view except REPLICA_SET_SECONDARY. In this exception, I found REPLICA_SET_OTHER and UNKNOWN.
:cause Timed out after 1000 ms while waiting for a server that matches WritableServerSelector. Client view of cluster state is {type=REPLICA_SET, servers=[ {address=public-cd-a5.disalg.cn:37017, type=, roundTripTime=37.0 ms, state=CONNECTED}, {address=public-cd-a1.disalg.cn:37017, type=UNKNOWN, state=CONNECTING}, {address=public-cd-a4.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=822.4 ms, state=CONNECTED}, {address=public-cd-a2.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=34.1 ms, state=CONNECTED}, {address=public-cd-a3.disalg.cn:37017, type=REPLICA_SET_SECONDARY, roundTripTime=195.9 ms, state=CONNECTED}]
This issue makes us can not determine whether a transaction is successfully committed. Why does the client see a wrong view of the cluster state?