[JAVA-1868] MongoDB Java Client gets stuck after new PRIMARY is elected Created: 19/Jun/15  Updated: 11/May/20  Resolved: 11/May/20

Status: Closed
Project: Java Driver
Component/s: Cluster Management
Affects Version/s: 3.0.2
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Attila T?kés Assignee: Unassigned
Resolution: Done Votes: 6
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
related to JAVA-3301 JAva Driver Hangs with when replica s... Closed
is related to JAVA-3070 Retryable Reads Closed
is related to JAVA-3094 How to handle MongoSocketReadException Closed

 Description   

I'm trying to test a fail-over scenario with MongoDB Java Client:

  • connect to replica set with 3 nodes (1 primary + 2 secondary)
  • kill the primary
  • new primary node is elected
  • (expected) driver continues to work after failing over

Actual:

  • MongoDB java client tries to Rediscovering type of existing primary, gets Exception in monitor thread while connecting to server nosql-x64-node-2:27017 and gets stuck:

    20:19:52,352  INFO cluster:71 - Cluster created with settings {hosts=[nosql-x64-node-1:27017], mode=MULTIPLE, requiredClusterType=REPLICA_SET, serverSelectionTimeout='30000 ms', maxWaitQueueSize=500, requiredReplicaSetName='rs1'}
    20:19:52,353  INFO cluster:71 - Adding discovered server nosql-x64-node-1:27017 to client view of cluster
    20:19:52,483  INFO cluster:71 - No server chosen by PrimaryServerSelector from cluster description ClusterDescription{type=REPLICA_SET, connectionMode=MULTIPLE, all=[ServerDescription{address=nosql-x64-node-1:27017, type=UNKNOWN, state=CONNECTING}]}. Waiting for 30000 ms before timing out
    20:19:52,487  INFO connection:71 - Opened connection [connectionId{localValue:1, serverValue:125}] to nosql-x64-node-1:27017
    20:19:52,493  INFO cluster:71 - Monitor thread successfully connected to server with description ServerDescription{address=nosql-x64-node-1:27017, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 0, 2]}, minWireVersion=0, maxWireVersion=3, electionId=null, maxDocumentSize=16777216, roundTripTimeNanos=1508516, setName='rs1', canonicalAddress=nosql-x64-node-1:27017, hosts=[nosql-x64-node-3:27017, nosql-x64-node-2:27017, nosql-x64-node-1:27017], passives=[], arbiters=[], primary='nosql-x64-node-2:27017', tagSet=TagSet{[]}}
    20:19:52,496  INFO cluster:71 - Adding discovered server nosql-x64-node-3:27017 to client view of cluster
    20:19:52,499  INFO cluster:71 - Adding discovered server nosql-x64-node-2:27017 to client view of cluster
    20:19:52,505  INFO connection:71 - Opened connection [connectionId{localValue:2, serverValue:169}] to nosql-x64-node-3:27017
    20:19:52,507  INFO connection:71 - Opened connection [connectionId{localValue:3, serverValue:161}] to nosql-x64-node-2:27017
    20:19:52,508  INFO cluster:71 - Monitor thread successfully connected to server with description ServerDescription{address=nosql-x64-node-3:27017, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 0, 2]}, minWireVersion=0, maxWireVersion=3, electionId=null, maxDocumentSize=16777216, roundTripTimeNanos=1761189, setName='rs1', canonicalAddress=nosql-x64-node-3:27017, hosts=[nosql-x64-node-3:27017, nosql-x64-node-2:27017, nosql-x64-node-1:27017], passives=[], arbiters=[], primary='nosql-x64-node-2:27017', tagSet=TagSet{[]}}
    20:19:52,510  INFO cluster:71 - Monitor thread successfully connected to server with description ServerDescription{address=nosql-x64-node-2:27017, type=REPLICA_SET_PRIMARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 0, 2]}, minWireVersion=0, maxWireVersion=3, electionId=55844af54ac7e7537155bd46, maxDocumentSize=16777216, roundTripTimeNanos=1413436, setName='rs1', canonicalAddress=nosql-x64-node-2:27017, hosts=[nosql-x64-node-3:27017, nosql-x64-node-2:27017, nosql-x64-node-1:27017], passives=[], arbiters=[], primary='nosql-x64-node-2:27017', tagSet=TagSet{[]}}
    20:19:52,510  INFO cluster:71 - Discovered replica set primary nosql-x64-node-2:27017
    20:19:52,518  INFO connection:71 - Opened connection [connectionId{localValue:4, serverValue:162}] to nosql-x64-node-2:27017
    ...
    20:21:12,521  INFO cluster:71 - Monitor thread successfully connected to server with description ServerDescription{address=nosql-x64-node-3:27017, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 0, 2]}, minWireVersion=0, maxWireVersion=3, electionId=55844f8645d8c9dd9bbdebfe, maxDocumentSize=16777216, roundTripTimeNanos=1103740, setName='rs1', canonicalAddress=nosql-x64-node-3:27017, hosts=[nosql-x64-node-3:27017, nosql-x64-node-2:27017, nosql-x64-node-1:27017], passives=[], arbiters=[], primary='nosql-x64-node-3:27017', tagSet=TagSet{[]}}
    20:21:13,013  INFO cluster:71 - Monitor thread successfully connected to server with description ServerDescription{address=nosql-x64-node-1:27017, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 0, 2]}, minWireVersion=0, maxWireVersion=3, electionId=null, maxDocumentSize=16777216, roundTripTimeNanos=1120967, setName='rs1', canonicalAddress=nosql-x64-node-1:27017, hosts=[nosql-x64-node-3:27017, nosql-x64-node-2:27017, nosql-x64-node-1:27017], passives=[], arbiters=[], primary='nosql-x64-node-3:27017', tagSet=TagSet{[]}}
    20:21:32,525  INFO cluster:71 - Monitor thread successfully connected to server with description ServerDescription{address=nosql-x64-node-3:27017, type=REPLICA_SET_PRIMARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 0, 2]}, minWireVersion=0, maxWireVersion=3, electionId=55844f8645d8c9dd9bbdebfe, maxDocumentSize=16777216, roundTripTimeNanos=1063299, setName='rs1', canonicalAddress=nosql-x64-node-3:27017, hosts=[nosql-x64-node-3:27017, nosql-x64-node-2:27017, nosql-x64-node-1:27017], passives=[], arbiters=[], primary='nosql-x64-node-3:27017', tagSet=TagSet{[]}}
    20:21:32,526  INFO cluster:71 - Discovered replica set primary nosql-x64-node-3:27017
    20:21:32,527  INFO cluster:71 - Rediscovering type of existing primary nosql-x64-node-2:27017
    20:21:34,814  INFO cluster:76 - Exception in monitor thread while connecting to server nosql-x64-node-2:27017
    com.mongodb.MongoSocketOpenException: Exception opening socket
    	at com.mongodb.connection.SocketStream.open(SocketStream.java:63)
    	at com.mongodb.connection.InternalStreamConnection.open(InternalStreamConnection.java:114)
    	at com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable.run(DefaultServerMonitor.java:127)
    	at java.lang.Thread.run(Thread.java:745)
    Caused by: java.net.NoRouteToHostException: No route to host
    	at java.net.PlainSocketImpl.socketConnect(Native Method)
    	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:339)
    	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:200)
    	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:182)
    	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
    	at java.net.Socket.connect(Socket.java:579)
    	at com.mongodb.connection.SocketStreamHelper.initialize(SocketStreamHelper.java:50)
    	at com.mongodb.connection.SocketStream.open(SocketStream.java:58)
    	... 3 more
    



 Comments   
Comment by Mstthias Müller [ 08/Jan/20 ]

Hi Jeffrey,

this was exactyl what I was looking for, thank you! I'll give it a shot.
(Although I'm surprised why nobody mentioned this feature, since it already existed when I wrote my first comment .)

Regards
Matthias

Comment by Jeffrey Yemin [ 07/Jan/20 ]

Hi madness-inc,

See the comment I just added on JAVA-3094 regarding automatic retry of both reads and writes by the driver. This may help in your scenario, although it's still not clear why the driver is getting a socket timeout when attempting to read from the new primary. The driver will only retry once, so unless that timeout is transient, the operation will still end up failing.

Comment by Mstthias Müller [ 07/Jan/20 ]

Hi folks,

I was hoping to get some kind of feedback here.
Are there any plans to enhance the (connection) recovery features of the Mongo Java Driver?

Regards
Matthias

Comment by Mstthias Müller [ 19/Aug/19 ]

Hi ross.lawley,

the log does not show any errors for that period.
But I have to determine that the assumptions from my first comment were wrong, although I was told so.

My assumptions are:
When connecting to a simple replica set (primary, 1 secondary, 1 arbiter), and a new primary is elected (because the old one crashed or was shut down for maintenance),
the driver should transparently handle that case by not using the unavailable node any more (until available again).

As jeff.yemin stated in JAVA-3094, the Java Driver does not handle a cluster failover completely transparent:

For reads, you'll have to catch the exception and execute the retry in the application.

In other words:
If a MongoSocketException occurs, it is up to the client how to handle it.
There's no automatic failover done by the driver.
But IMHO, a temporary shutdown of a replica set node and the election of a new master is nothing an application developer should worry about.

I think it would really be a cool feature of the driver to support transparently handling such scenarios.
Some other middleware products, like RabbitMQ, do support this:
https://www.rabbitmq.com/api-guide.html#recovery

What do you think?

Regards
Matthias

Comment by Ross Lawley [ 27/Jun/19 ]

Hi madness-inc,

I reading that as the driver is trying to speak to the correct server - as it has identified it as the primary. However, there was a socket timeout when doing so:

2019-05-21 10:10:01,363 INFO  ["http-nio-8080-exec-5"] org.mongodb.driver.connection: Closed connection [connectionId{localValue:85, serverValue:14}] to localhost:8002 because there was a socket exception raised by this connection.
 
...
 
com.mongodb.MongoSocketReadTimeoutException: Timeout while receiving message

There was also an earlier log message from the connection pool with another connection that had failed against 8002:

2019-05-21 10:03:35,324 INFO  ["http-nio-8080-exec-6"] org.mongodb.driver.connection: Closed connection [connectionId{localValue:15, serverValue:8}] to localhost:8002 because there was a socket exception raised by this connection.

Does the MongoD logs show anything for that period on localhost:8002? Looks like there are some network issues occurring there.

Ross

Comment by Mstthias Müller [ 27/Jun/19 ]

Hi Jeff,

I'm really sorry to bother you with this.
Unfortunately, we'll have to move to another solution if we can't figure out how to handle the MongoDB cluster failover properly (we're using MongoDB to store the user's HTTP session in a clustered environment).
Any feedback is highly appreciated!
 
Regards
Matthias

Comment by Mstthias Müller [ 04/Jun/19 ]

Hi Jeff,

have you had a chance to look at the logs?
Any hint is welcome here, I just can't figure out what's going wrong.

 

Regards

Matthias

Comment by Mstthias Müller [ 21/May/19 ]

Hi Jeff,

sorry it took me a a while to summarize this.

The scenario is:

  • replica set with PRIMARY, 1 SECONDARY, 1 ARBITER
  • Mongo Java Driver 3.10.2 (we're running with Java 8)

These are my MongoClientOptions, using a socketTimeout, as suggested:

MongoClientOptions options = MongoClientOptions.builder()
	.writeConcern(WriteConcern.MAJORITY)
	.readPreference(ReadPreference.primary())
	.readConcern(ReadConcern.MAJORITY)
	.requiredReplicaSetName("appng")
	.socketTimeout(5000)
	.build();

When starting the cluster, everything is fine, rs.status() gives me 3 members:

rs.status();
{
	"set" : "appng",
	...
	"members" : [
		{
			"_id" : 0,
			"name" : "localhost:8001",
			"health" : 1,
			"state" : 1,
			"stateStr" : "PRIMARY",
			...
			"electionTime" : Timestamp(1558424157, 1),
			"electionDate" : ISODate("2019-05-21T07:35:57Z"),
			...
		},
		{
			"_id" : 1,
			"name" : "localhost:8002",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			...
			"syncingTo" : "localhost:8001",
			"syncSourceHost" : "localhost:8001",
			...
		},
		{
			"_id" : 2,
			"name" : "localhost:8003",
			"health" : 1,
			"state" : 7,
			"stateStr" : "ARBITER",
			...
		}
	],
	"ok" : 1
}

Now, when starting my application, everything is fine, the configuration is detected correctly.

2019-05-21 09:54:09,100 INFO  ["localhost-startStop-1"] org.mongodb.driver.cluster: No server chosen by com.mongodb.client.internal.MongoClientDelegate$1@3e25dcd6 from cluster description ClusterDescription{type=REPLICA_SET, connectionMode=MULTIPLE, serverDescriptions=[ServerDescription{address=localhost:8002, type=UNKNOWN, state=CONNECTING}, ServerDescription{address=localhost:8001, type=UNKNOWN, state=CONNECTING}]}. Waiting for 30000 ms before timing out
2019-05-21 09:54:09,107 INFO  ["cluster-ClusterId{value='5ce3aea150c1517f2d165529', description='null'}-localhost:8002"] org.mongodb.driver.connection: Opened connection [connectionId{localValue:1, serverValue:7}] to localhost:8002
2019-05-21 09:54:09,107 INFO  ["cluster-ClusterId{value='5ce3aea150c1517f2d165529', description='null'}-localhost:8001"] org.mongodb.driver.connection: Opened connection [connectionId{localValue:2, serverValue:15}] to localhost:8001
2019-05-21 09:54:09,119 INFO  ["cluster-ClusterId{value='5ce3aea150c1517f2d165529', description='null'}-localhost:8002"] org.mongodb.driver.cluster: Monitor thread successfully connected to server with description ServerDescription{address=localhost:8002, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 6, 12]}, minWireVersion=0, maxWireVersion=6, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=6747329, setName='appng', canonicalAddress=localhost:8002, hosts=[localhost:8002, localhost:8001], passives=[], arbiters=[localhost:8003], primary='localhost:8001', tagSet=TagSet{[]}, electionId=null, setVersion=3, lastWriteDate=Tue May 21 09:54:09 CEST 2019, lastUpdateTimeNanos=37785965920298}
2019-05-21 09:54:09,121 INFO  ["cluster-ClusterId{value='5ce3aea150c1517f2d165529', description='null'}-localhost:8002"] org.mongodb.driver.cluster: Adding discovered server localhost:8003 to client view of cluster
2019-05-21 09:54:09,121 INFO  ["cluster-ClusterId{value='5ce3aea150c1517f2d165529', description='null'}-localhost:8001"] org.mongodb.driver.cluster: Monitor thread successfully connected to server with description ServerDescription{address=localhost:8001, type=REPLICA_SET_PRIMARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 6, 12]}, minWireVersion=0, maxWireVersion=6, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=9617219, setName='appng', canonicalAddress=localhost:8001, hosts=[localhost:8002, localhost:8001], passives=[], arbiters=[localhost:8003], primary='localhost:8001', tagSet=TagSet{[]}, electionId=7fffffff000000000000001c, setVersion=3, lastWriteDate=Tue May 21 09:54:09 CEST 2019, lastUpdateTimeNanos=37785968337383}
2019-05-21 09:54:09,122 INFO  ["cluster-ClusterId{value='5ce3aea150c1517f2d165529', description='null'}-localhost:8001"] org.mongodb.driver.cluster: Setting max election id to 7fffffff000000000000001c from replica set primary localhost:8001
2019-05-21 09:54:09,122 INFO  ["cluster-ClusterId{value='5ce3aea150c1517f2d165529', description='null'}-localhost:8001"] org.mongodb.driver.cluster: Setting max set version to 3 from replica set primary localhost:8001
2019-05-21 09:54:09,122 INFO  ["cluster-ClusterId{value='5ce3aea150c1517f2d165529', description='null'}-localhost:8001"] org.mongodb.driver.cluster: Discovered replica set primary localhost:8001
2019-05-21 09:54:09,125 INFO  ["cluster-ClusterId{value='5ce3aea150c1517f2d165529', description='null'}-localhost:8003"] org.mongodb.driver.connection: Opened connection [connectionId{localValue:3, serverValue:15}] to localhost:8003
2019-05-21 09:54:09,126 INFO  ["cluster-ClusterId{value='5ce3aea150c1517f2d165529', description='null'}-localhost:8003"] org.mongodb.driver.cluster: Monitor thread successfully connected to server with description ServerDescription{address=localhost:8003, type=REPLICA_SET_ARBITER, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 6, 12]}, minWireVersion=0, maxWireVersion=6, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=null, roundTripTimeNanos=1173638, setName='appng', canonicalAddress=localhost:8003, hosts=[localhost:8002, localhost:8001], passives=[], arbiters=[localhost:8003], primary='localhost:8001', tagSet=TagSet{[]}, electionId=null, setVersion=3, lastWriteDate=Tue May 21 09:53:59 CEST 2019, lastUpdateTimeNanos=37785973691794}
2019-05-21 09:54:09,134 INFO  ["localhost-startStop-1"] org.mongodb.driver.connection: Opened connection [connectionId{localValue:4, serverValue:16}] to localhost:8001

Now, I do a regular shutdown of the master (localhost:8001) -> SECONDARY gets new PRIMARY

Logs of SECONDARY:

2019-05-21T09:58:23.911+0200 I REPL     [replexec-11] Starting an election, since we've seen no PRIMARY in the past 10000ms
2019-05-21T09:58:23.911+0200 I REPL     [replexec-11] conducting a dry run election to see if we could be elected. current term: 28
2019-05-21T09:58:23.911+0200 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to localhost:8001
2019-05-21T09:58:23.912+0200 I ASIO     [NetworkInterfaceASIO-Replication-0] Failed to connect to localhost:8001 - HostUnreachable: Connection refused
2019-05-21T09:58:23.912+0200 I REPL     [replexec-8] VoteRequester(term 28 dry run) received a yes vote from localhost:8003; response message: { term: 28, voteGranted: true, reason: "", ok: 1.0 }
2019-05-21T09:58:23.912+0200 I ASIO     [NetworkInterfaceASIO-Replication-0] Dropping all pooled connections to localhost:8001 due to failed operation on a connection
2019-05-21T09:58:23.913+0200 I REPL     [replexec-8] dry election run succeeded, running for election in term 29
2019-05-21T09:58:23.962+0200 I ASIO     [NetworkInterfaceASIO-Replication-0] Connecting to localhost:8001
2019-05-21T09:58:23.962+0200 I ASIO     [NetworkInterfaceASIO-Replication-0] Failed to connect to localhost:8001 - HostUnreachable: Connection refused
2019-05-21T09:58:23.962+0200 I ASIO     [NetworkInterfaceASIO-Replication-0] Dropping all pooled connections to localhost:8001 due to failed operation on a connection
2019-05-21T09:58:23.963+0200 I REPL     [replexec-10] VoteRequester(term 29) failed to receive response from localhost:8001: HostUnreachable: Connection refused
2019-05-21T09:58:23.964+0200 I REPL     [replexec-9] VoteRequester(term 29) received a yes vote from localhost:8003; response message: { term: 29, voteGranted: true, reason: "", ok: 1.0 }
2019-05-21T09:58:23.964+0200 I REPL     [replexec-9] election succeeded, assuming primary role in term 29
2019-05-21T09:58:23.964+0200 I REPL     [replexec-9] transition to PRIMARY from SECONDARY
2019-05-21T09:58:23.964+0200 I REPL     [replexec-9] Resetting sync source to empty, which was :27017
2019-05-21T09:58:23.964+0200 I REPL     [replexec-9] Entering primary catch-up mode.

rs.status() also confirms that everything is fine:

rs.status();
{
	"set" : "appng",
	...
	"members" : [
		{
			"_id" : 0,
			"name" : "localhost:8001",
			"health" : 0,
			"state" : 8,
			"stateStr" : "(not reachable/healthy)",
			...
		},
		{
			"_id" : 1,
			"name" : "localhost:8002",
			"health" : 1,
			"state" : 1,
			"stateStr" : "PRIMARY",
			...
			"electionTime" : Timestamp(1558425503, 1),
			"electionDate" : ISODate("2019-05-21T07:58:23Z"),
			...
		},
		{
			"_id" : 2,
			"name" : "localhost:8003",
			"health" : 1,
			"state" : 7,
			"stateStr" : "ARBITER",
			...
		}
	],
	"ok" : 1
}

When I'm now accessing my application, the following happens:

2019-05-21 09:58:19,182 INFO  ["cluster-ClusterId{value='5ce3aea150c1517f2d165529', description='null'}-localhost:8003"] org.mongodb.driver.cluster: Monitor thread successfully connected to server with description ServerDescription{address=localhost:8003, type=REPLICA_SET_ARBITER, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 6, 12]}, minWireVersion=0, maxWireVersion=6, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=null, roundTripTimeNanos=1025366, setName='appng', canonicalAddress=localhost:8003, hosts=[localhost:8002, localhost:8001], passives=[], arbiters=[localhost:8003], primary='null', tagSet=TagSet{[]}, electionId=null, setVersion=3, lastWriteDate=Tue May 21 09:58:09 CEST 2019, lastUpdateTimeNanos=38036029203176}
2019-05-21 09:58:19,186 INFO  ["cluster-ClusterId{value='5ce3aea150c1517f2d165529', description='null'}-localhost:8002"] org.mongodb.driver.cluster: Monitor thread successfully connected to server with description ServerDescription{address=localhost:8002, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 6, 12]}, minWireVersion=0, maxWireVersion=6, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=1106506, setName='appng', canonicalAddress=localhost:8002, hosts=[localhost:8002, localhost:8001], passives=[], arbiters=[localhost:8003], primary='null', tagSet=TagSet{[]}, electionId=null, setVersion=3, lastWriteDate=Tue May 21 09:58:09 CEST 2019, lastUpdateTimeNanos=38036033318648}
2019-05-21 09:58:19,189 INFO  ["cluster-ClusterId{value='5ce3aea150c1517f2d165529', description='null'}-localhost:8001"] org.mongodb.driver.cluster: Exception in monitor thread while connecting to server localhost:8001
com.mongodb.MongoSocketOpenException: Exception opening socket
	at com.mongodb.internal.connection.SocketStream.open(SocketStream.java:70)
	at com.mongodb.internal.connection.InternalStreamConnection.open(InternalStreamConnection.java:126)
	at com.mongodb.internal.connection.DefaultServerMonitor$ServerMonitorRunnable.run(DefaultServerMonitor.java:131)
	at java.lang.Thread.run(Thread.java:748)
Caused by: java.net.ConnectException: Connection refused (Connection refused)
	at java.net.PlainSocketImpl.socketConnect(Native Method)
	at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
	at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
	at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
	at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
	at java.net.Socket.connect(Socket.java:589)
	at com.mongodb.internal.connection.SocketStreamHelper.initialize(SocketStreamHelper.java:64)
	at com.mongodb.internal.connection.SocketStream.initializeSocket(SocketStream.java:79)
	at com.mongodb.internal.connection.SocketStream.open(SocketStream.java:65)
	... 3 more
2019-05-21 09:58:29,184 INFO  ["cluster-ClusterId{value='5ce3aea150c1517f2d165529', description='null'}-localhost:8003"] org.mongodb.driver.cluster: Monitor thread successfully connected to server with description ServerDescription{address=localhost:8003, type=REPLICA_SET_ARBITER, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 6, 12]}, minWireVersion=0, maxWireVersion=6, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=null, roundTripTimeNanos=1021370, setName='appng', canonicalAddress=localhost:8003, hosts=[localhost:8002, localhost:8001], passives=[], arbiters=[localhost:8003], primary='localhost:8002', tagSet=TagSet{[]}, electionId=null, setVersion=3, lastWriteDate=Tue May 21 09:58:09 CEST 2019, lastUpdateTimeNanos=38046031501403}
2019-05-21 09:58:29,188 INFO  ["cluster-ClusterId{value='5ce3aea150c1517f2d165529', description='null'}-localhost:8002"] org.mongodb.driver.cluster: Monitor thread successfully connected to server with description ServerDescription{address=localhost:8002, type=REPLICA_SET_PRIMARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 6, 12]}, minWireVersion=0, maxWireVersion=6, maxDocumentSize=16777216, logicalSessionTimeoutMinutes=30, roundTripTimeNanos=1068662, setName='appng', canonicalAddress=localhost:8002, hosts=[localhost:8002, localhost:8001], passives=[], arbiters=[localhost:8003], primary='localhost:8002', tagSet=TagSet{[]}, electionId=7fffffff000000000000001d, setVersion=3, lastWriteDate=Tue May 21 09:58:25 CEST 2019, lastUpdateTimeNanos=38046035124501}
2019-05-21 09:58:29,188 INFO  ["cluster-ClusterId{value='5ce3aea150c1517f2d165529', description='null'}-localhost:8002"] org.mongodb.driver.cluster: Setting max election id to 7fffffff000000000000001d from replica set primary localhost:8002
2019-05-21 09:58:29,188 INFO  ["cluster-ClusterId{value='5ce3aea150c1517f2d165529', description='null'}-localhost:8002"] org.mongodb.driver.cluster: Discovered replica set primary localhost:8002
2019-05-21 10:03:35,324 INFO  ["http-nio-8080-exec-6"] org.mongodb.driver.connection: Closed connection [connectionId{localValue:15, serverValue:8}] to localhost:8002 because there was a socket exception raised by this connection.

So the driver correctly is able to detect that localhost:8002 became the new PRIMARY.

But now the weird thing is, although the new primary was eleceted, subsequent calls do throw an exception:

2019-05-21 10:09:56,355 INFO  ["http-nio-8080-exec-5"] org.mongodb.driver.connection: Opened connection [connectionId{localValue:85, serverValue:14}] to localhost:8002
2019-05-21 10:10:01,363 INFO  ["http-nio-8080-exec-5"] org.mongodb.driver.connection: Closed connection [connectionId{localValue:85, serverValue:14}] to localhost:8002 because there was a socket exception raised by this connection.
...
com.mongodb.MongoSocketReadTimeoutException: Timeout while receiving message
	at com.mongodb.internal.connection.InternalStreamConnection.translateReadException(InternalStreamConnection.java:553)
	at com.mongodb.internal.connection.InternalStreamConnection.receiveMessage(InternalStreamConnection.java:444)
	at com.mongodb.internal.connection.InternalStreamConnection.receiveCommandMessageResponse(InternalStreamConnection.java:295)
	at com.mongodb.internal.connection.InternalStreamConnection.sendAndReceive(InternalStreamConnection.java:255)
	at com.mongodb.internal.connection.UsageTrackingInternalConnection.sendAndReceive(UsageTrackingInternalConnection.java:99)
	at com.mongodb.internal.connection.DefaultConnectionPool$PooledConnection.sendAndReceive(DefaultConnectionPool.java:444)
	at com.mongodb.internal.connection.CommandProtocolImpl.execute(CommandProtocolImpl.java:72)
	at com.mongodb.internal.connection.DefaultServer$DefaultServerProtocolExecutor.execute(DefaultServer.java:200)
	at com.mongodb.internal.connection.DefaultServerConnection.executeProtocol(DefaultServerConnection.java:269)
	at com.mongodb.internal.connection.DefaultServerConnection.command(DefaultServerConnection.java:131)
	at com.mongodb.operation.MixedBulkWriteOperation.executeCommand(MixedBulkWriteOperation.java:433)
	at com.mongodb.operation.MixedBulkWriteOperation.executeBulkWriteBatch(MixedBulkWriteOperation.java:259)
	at com.mongodb.operation.MixedBulkWriteOperation.access$700(MixedBulkWriteOperation.java:70)
	at com.mongodb.operation.MixedBulkWriteOperation$1.call(MixedBulkWriteOperation.java:203)
	at com.mongodb.operation.MixedBulkWriteOperation$1.call(MixedBulkWriteOperation.java:194)
	at com.mongodb.operation.OperationHelper.withReleasableConnection(OperationHelper.java:424)
	at com.mongodb.operation.MixedBulkWriteOperation.execute(MixedBulkWriteOperation.java:194)
	at com.mongodb.operation.BaseWriteOperation.execute(BaseWriteOperation.java:148)
	at com.mongodb.operation.BaseWriteOperation.execute(BaseWriteOperation.java:52)
	at com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.execute(MongoClientDelegate.java:193)
	at com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.execute(MongoClientDelegate.java:166)
	at com.mongodb.DBCollection.executeWriteOperation(DBCollection.java:347)
	at com.mongodb.DBCollection.update(DBCollection.java:570)
	at com.mongodb.DBCollection.update(DBCollection.java:493)
	at com.mongodb.DBCollection.update(DBCollection.java:469)
	at com.mongodb.DBCollection.update(DBCollection.java:447)
	at com.mongodb.DBCollection.update(DBCollection.java:512)
	at com.mongodb.DBCollection.update(DBCollection.java:527)
	...(call from application)
Caused by: java.net.SocketTimeoutException: Read timed out
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
	at java.net.SocketInputStream.read(SocketInputStream.java:171)
	at java.net.SocketInputStream.read(SocketInputStream.java:141)
	at com.mongodb.internal.connection.SocketStream.read(SocketStream.java:109)
	at com.mongodb.internal.connection.InternalStreamConnection.receiveResponseBuffers(InternalStreamConnection.java:570)
	at com.mongodb.internal.connection.InternalStreamConnection.receiveMessage(InternalStreamConnection.java:441)
	... 49 more

So I'm wondering why this exception is thrown, since the PRIMARY localhost:8002 is doing fine.

I hope this makes my problem comprehensible.
Looking forward for your reply and thank you in advance for your efforts.

Regards
Matthias

Comment by Jeffrey Yemin [ 03/May/19 ]

Sure, that would help, but as these types of issues are often related to the environment in which the application is running, please also share the results of actually running the application in your environment.

Comment by Mstthias Müller [ 03/May/19 ]

Hi Jeff,

 

how about a small Java-Projekt that demonstrates the issue? I could put it on github...

 

Regards

Matthias

Comment by Jeffrey Yemin [ 03/May/19 ]

Hi madness-inc

Your assumption is correct, but without more information, I can't say whether you're hitting the same issue as the reporter. Are you also seeing application threads stuck in socket read, as described in the first comment? If not, you're probably encountering a different issue and we'd need to see some stack traces from your application to diagnose further.

Comment by Mstthias Müller [ 03/May/19 ]

Hi Jeff and MongoDB team,

sorry to bother you with this old ticket, but I still can't figure out how a cluster failover/new primary can be handled using the Java driver.

My assumptions are:

When connecting to a simple replica set (primary, 1 secondary, 1 arbiter), and a new primary is elected (because the old one crashed or was shut down for maintenance),

the driver should transparently handle that case by not using the unavailable node any more (until available again).

 

But this does not happen, the driver continues trying to use the unavailable node, leading to application errors.

So the question is: Are my assumptions wrong? Or is it just a question of correctly configuring the driver?

 

PS: Can anybody fix the typo in my profile's full name ?

 

Regards

Matthias

Comment by Jeffrey Yemin [ 19/Jun/15 ]

Great, I updated the relevant metadata on the ticket. Good luck, and thanks for reporting this issue. It's something I had not previously considered.

Comment by Attila T?kés [ 19/Jun/15 ]

Ok, I think it would be useful to keep open. Thanks!

Comment by Jeffrey Yemin [ 19/Jun/15 ]

I suppose expected is in the eye of the beholder. What I meant is that it's the behavior I expect given what I know of the overall design of the driver. If you'd like to keep this issue open, I will, but I will change the issue type from Bug to Improvement.

Comment by Attila T?kés [ 19/Jun/15 ]

Ok, I will set some timeout, but I don't think this is expected behavior. The driver apparently knows that that the node is dead, and the primary was changed. The correct behavior should be to automatically cut the old connection, and try to execute the operation on the new primary.

Comment by Jeffrey Yemin [ 19/Jun/15 ]

In that case I'd see this as the expected behavior. The driver is blocked waiting for a response from the server, and it's not getting one. Eventually it will timeout according to the value of the MongoClientOptions.socketTimeout property, which defaults to not timing out at all. So a couple of suggestions:

  • For planned maintenance windows, first kill the mongod process, then shut down the VM. This gives the mongod server a chance to close all open sockets, and then the driver will throw an immediate exception.
  • To handle unplanned loss of a VM, set MongoClientOptions.socketTimeout to a value that is reasonable for your application. It will force the socket to timeout that read instead of waiting forever.
Comment by Attila T?kés [ 19/Jun/15 ]

I killed the VM (nosql-x64-node-2) running the primary. It is expected to be unreachable.

Comment by Jeffrey Yemin [ 19/Jun/15 ]

The java.net.NoRouteToHostException exception suggests that you didn't simply kill the primary. As the javadoc for that exception says: "Typically, the remote host cannot be reached because of an intervening firewall, or if an intermediate router is down."

Can you explain exactly what you did to kill the primary?

Comment by Attila T?kés [ 19/Jun/15 ]

Application thread stuck in socket read:

java.lang.Thread.State: RUNNABLE
	at java.net.SocketInputStream.socketRead0(Native Method)
	at java.net.SocketInputStream.read(SocketInputStream.java:152)
	at java.net.SocketInputStream.read(SocketInputStream.java:122)
	at com.mongodb.connection.SocketStream.read(SocketStream.java:85)
	at com.mongodb.connection.InternalStreamConnection.receiveResponseBuffers(InternalStreamConnection.java:491)
	at com.mongodb.connection.InternalStreamConnection.receiveMessage(InternalStreamConnection.java:221)
	at com.mongodb.connection.UsageTrackingInternalConnection.receiveMessage(UsageTrackingInternalConnection.java:102)
	at com.mongodb.connection.DefaultConnectionPool$PooledConnection.receiveMessage(DefaultConnectionPool.java:416)
	at com.mongodb.connection.WriteCommandProtocol.receiveMessage(WriteCommandProtocol.java:184)
	at com.mongodb.connection.WriteCommandProtocol.execute(WriteCommandProtocol.java:76)
	at com.mongodb.connection.UpdateCommandProtocol.execute(UpdateCommandProtocol.java:63)
	at com.mongodb.connection.UpdateCommandProtocol.execute(UpdateCommandProtocol.java:37)
	at com.mongodb.connection.DefaultServer$DefaultServerProtocolExecutor.execute(DefaultServer.java:155)
	at com.mongodb.connection.DefaultServerConnection.executeProtocol(DefaultServerConnection.java:219)
	at com.mongodb.connection.DefaultServerConnection.updateCommand(DefaultServerConnection.java:120)
	at com.mongodb.operation.UpdateOperation.executeCommandProtocol(UpdateOperation.java:76)
	at com.mongodb.operation.BaseWriteOperation$1.call(BaseWriteOperation.java:111)
	at com.mongodb.operation.BaseWriteOperation$1.call(BaseWriteOperation.java:106)
	at com.mongodb.operation.OperationHelper.withConnectionSource(OperationHelper.java:186)
	at com.mongodb.operation.OperationHelper.withConnection(OperationHelper.java:177)
	at com.mongodb.operation.BaseWriteOperation.execute(BaseWriteOperation.java:106)
	at com.mongodb.operation.BaseWriteOperation.execute(BaseWriteOperation.java:58)
	at com.mongodb.Mongo.execute(Mongo.java:747)
	at com.mongodb.Mongo$2.execute(Mongo.java:730)
	at com.mongodb.DBCollection.executeWriteOperation(DBCollection.java:327)
	at com.mongodb.DBCollection.update(DBCollection.java:461)
	at com.mongodb.DBCollection.update(DBCollection.java:426)
	at com.mongodb.DBCollection.update(DBCollection.java:485)
	at com.mongodb.DBCollection.update(DBCollection.java:500)

Generated at Thu Feb 08 08:55:43 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.