[JAVA-2157] Java driver stops connecting to one server after cluster status update Created: 22/Mar/16  Updated: 03/Jun/22  Resolved: 04/Oct/16

Status: Closed
Project: Java Driver
Component/s: Async
Affects Version/s: 3.2.2
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Francisco Alejandro Lozano López Assignee: Unassigned
Resolution: Duplicate Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Java 8, Linux, AWS


Issue Links:
Duplicate
duplicates JAVA-2302 Memory leak on Java async driver + Ne... Closed
Related

 Description   

This issue was originally reported in Java but was moved to support (as https://jira.mongodb.org/browse/SUPPORT-1538). Now I've been asked to re-open it in Java again:

We have a 3-node replica-set and some java clients connecting to it.

Servers are:

  • 0001: development-trantor-rs01-0001 - ip-10-1-xxx-49.trantor-southwest-1.compute.internal
  • 0002: development-trantor-rs01-0002 - ip-10-1-xxx-203.trantor-southwest-1.compute.internal
  • 0003: development-trantor-rs01-0003 - ip-10-1-xxx-227.trantor-southwest-1.compute.internal

Something happened in our primary 0001:

2015-12-14T02:37:18.863+0000 I -        [conn517493] Assertion: 28537:Demoted from primary while removing from foo.bar
2015-12-14T02:37:18.990+0000 I QUERY    [conn320297] getmore local.oplog.rs query: { ts: { $gt: Timestamp 1449128999000|2 } } cursorid:6554465004290 ntoreturn:0 keyUpdates:0 writeConflicts:0 numYields:7 nreturned:4 reslen:544 locks:{ Global: { acquireCount: { r: 20 }, acquireWaitCount: { r: 6 }, timeAcquiringMicros: { r: 90655565 } }, Database: { acquireCount: { r: 10 } }, oplog: { acquireCount: { r: 10 } } } user_key_comparison_count:618 block_cache_hit_count:14 block_read_count:0 block_read_byte:0 internal_key_skipped_count:11 internal_delete_skipped_count:0 get_from_memtable_count:0 seek_on_memtable_count:7 seek_child_seek_count:21 27265ms
2015-12-14T02:37:18.990+0000 I QUERY    [conn119] getmore local.oplog.rs query: { ts: { $gte: Timestamp 1447900202000|2 } } cursorid:6554135393530 ntoreturn:0 keyUpdates:0 writeConflicts:0 numYields:5 nreturned:4 reslen:544 locks:{ Global: { acquireCount: { r: 14 }, acquireWaitCount: { r: 6 }, timeAcquiringMicros: { r: 43403120 } }, Database: { acquireCount: { r: 7 } }, oplog: { acquireCount: { r: 7 } } } user_key_comparison_count:442 block_cache_hit_count:10 block_read_count:0 block_read_byte:0 internal_key_skipped_count:9 internal_delete_skipped_count:0 get_from_memtable_count:0 seek_on_memtable_count:5 seek_child_seek_count:15 15791ms
2015-12-14T02:37:19.005+0000 I NETWORK  [conn566148] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [10.1.xxx.208:54911]
2015-12-14T02:37:19.005+0000 I NETWORK  [conn566145] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [10.1.xxx.243:44347]
2015-12-14T02:37:19.015+0000 I NETWORK  [conn566146] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [10.1.xxx.84:36672]
2015-12-14T02:37:19.051+0000 I NETWORK  [conn566147] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [10.1.xxx.10:47207]
2015-12-14T02:37:19.066+0000 I -        [conn562910] Assertion: 28537:Demoted from primary while removing from foo.bar
2015-12-14T02:37:19.066+0000 I -        [conn517498] Assertion: 28537:Demoted from primary while removing from foo.bar
2015-12-14T02:37:19.091+0000 I -        [conn496978] Assertion: 28537:Demoted from primary while removing from foo.bar
2015-12-14T02:37:19.091+0000 I -        [conn497062] Assertion: 28537:Demoted from primary while removing from foo.bar
2015-12-14T02:37:19.253+0000 I NETWORK  [conn119] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [10.1.xxx.227:57257]
2015-12-14T02:37:19.253+0000 I NETWORK  [conn320297] SocketException handling request, closing client connection: 9001 socket exception [SEND_ERROR] server [10.1.xxx.176:42329]
2015-12-14T02:37:19.253+0000 I NETWORK  [conn566150] end connection 10.1.xxx.176:42718 (10 connections now open)
2015-12-14T02:37:19.630+0000 I QUERY    [conn566149] query local.oplog.rs planSummary: COLLSCAN ntoreturn:1 ntoskip:0 nscanned:0 nscannedObjects:1 keyUpdates:0 writeConflicts:0 numYields:1 nreturned:1 reslen:117 locks:{ Global: { acquireCount: { r: 4 } }, Database: { acquireCount: { r: 2 }, acquireWaitCount: { r: 1 }, timeAcquiringMicros: { r: 43811 } }, oplog: { acquireCount: { r: 2 } } } user_key_comparison_count:110 block_cache_hit_count:3 block_read_count:1 block_read_byte:16106 internal_key_skipped_count:1 internal_delete_skipped_count:0 get_from_memtable_count:0 seek_on_memtable_count:2 seek_child_seek_count:6 197ms
2015-12-14T02:37:19.657+0000 I NETWORK  [initandlisten] connection accepted from 10.1.xxx.227:44285 #566151 (10 connections now open)
2015-12-14T02:37:19.778+0000 I NETWORK  [initandlisten] connection accepted from 127.0.0.1:43518 #566152 (11 connections now open)
2015-12-14T02:37:19.787+0000 I NETWORK  [conn566152] end connection 127.0.0.1:43518 (10 connections now open)
2015-12-14T02:37:20.031+0000 I CONTROL  [conn517493]
 0x108a4a2 0x1027129 0x100ba6e 0x100bb1c 0xabae7c 0xc93603 0xc9378b 0xc93adb 0xc941bd 0xa6c563 0xa6ccb0 0xa6d366 0xa6fe9c 0xa91e31 0xa92e3c 0xa93b0b 0xc6128b 0xb74472 0x8b36d5 0x103bac9 0x7fef860d6df5 0x7fef84b881ad

after which 0002 became the new primary. 0001 recovered properly (with some rollbacks) and made a transition into a secondary.
mongod.log from 0001:

2015-12-14T02:38:07.456+0000 I REPL     [ReplicationExecutor] syncing from: ip-10-1-xxx-203.trantor-southeast-1.compute.internal:27017
2015-12-14T02:38:07.472+0000 I REPL     [rsBackgroundSync] replSet our last op time fetched: Dec 14 02:37:07:1
2015-12-14T02:38:07.472+0000 I REPL     [rsBackgroundSync] replset source's GTE: Dec 14 02:38:04:1
2015-12-14T02:38:07.473+0000 I REPL     [rsBackgroundSync] beginning rollback
2015-12-14T02:38:07.473+0000 I REPL     [rsBackgroundSync] rollback 0
2015-12-14T02:38:07.474+0000 I REPL     [ReplicationExecutor] transition to ROLLBACK
2015-12-14T02:38:07.474+0000 I REPL     [rsBackgroundSync] rollback 1
2015-12-14T02:38:07.475+0000 I REPL     [rsBackgroundSync] rollback 2 FindCommonPoint
2015-12-14T02:38:07.507+0000 I REPL     [SyncSourceFeedback] replset setting syncSourceFeedback to ip-10-1-xxx-203.trantor-southeast-1.compute.internal:27017
2015-12-14T02:38:07.569+0000 I REPL     [rsBackgroundSync] replSet info rollback our last optime:   Dec 14 02:37:07:1
2015-12-14T02:38:07.569+0000 I REPL     [rsBackgroundSync] replSet info rollback their last optime: Dec 14 02:38:04:1
2015-12-14T02:38:07.569+0000 I REPL     [rsBackgroundSync] replSet info rollback diff in end of log times: -57 seconds
2015-12-14T02:38:07.570+0000 I REPL     [rsBackgroundSync] replSet rollback found matching events at Dec 14 02:36:24:2
2015-12-14T02:38:07.570+0000 I REPL     [rsBackgroundSync] replSet rollback findcommonpoint scanned : 6
2015-12-14T02:38:07.570+0000 I REPL     [rsBackgroundSync] replSet rollback 3 fixup
2015-12-14T02:38:07.583+0000 I REPL     [rsBackgroundSync] rollback 3.5
2015-12-14T02:38:07.585+0000 I REPL     [rsBackgroundSync] rollback 4 n:4
2015-12-14T02:38:07.585+0000 I REPL     [rsBackgroundSync] replSet minvalid=Dec 14 02:38:04 566e2b8c:1
2015-12-14T02:38:07.616+0000 I REPL     [rsBackgroundSync] rollback 4.6
2015-12-14T02:38:07.616+0000 I REPL     [rsBackgroundSync] rollback 4.7
2015-12-14T02:38:07.668+0000 I REPL     [rsBackgroundSync] rollback 5 d:4 u:0
2015-12-14T02:38:07.668+0000 I REPL     [rsBackgroundSync] rollback 6
2015-12-14T02:38:07.711+0000 I REPL     [rsBackgroundSync] rollback done
2015-12-14T02:38:07.711+0000 I REPL     [ReplicationExecutor] transition to RECOVERING
2015-12-14T02:38:07.711+0000 I REPL     [rsBackgroundSync] rollback finished
2015-12-14T02:38:07.711+0000 I REPL     [ReplicationExecutor] syncing from: ip-10-1-xxx-203.trantor-southeast-1.compute.internal:27017
2015-12-14T02:38:07.711+0000 I REPL     [ReplicationExecutor] syncing from: ip-10-1-xxx-203.trantor-southeast-1.compute.internal:27017
2015-12-14T02:38:07.712+0000 I NETWORK  [conn566149] end connection 10.1.157.227:44284 (5 connections now open)
2015-12-14T02:38:07.717+0000 I NETWORK  [conn566151] end connection 10.1.157.227:44285 (4 connections now open)
2015-12-14T02:38:07.813+0000 I REPL     [ReplicationExecutor] transition to SECONDARY

But, after that, 0001 (now a secondary) was already unreachable by java:

[2015-12-14T02:37:25.235Z] [DEBUG] [cluster-ClusterId{value='566a2b1badbb126de1a555c2', description='null'}-ip-10-1-xxx-227.trantor-southeast-1.compute.internal:27017] org.mongodb.driver.cluster:? Updating cluster description to  {type=REPLICA_SET, servers=[{address=ip-10-1-xxx-203.trantor-southeast-1.compute.internal:27017, type=REPLICA_SET_PRIMARY, roundTripTime=4.8 ms, state=CONNECTED}, {address=ip-10-1-xxx-227.trantor-southeast-1.compute.internal:27017, type=REPLICA_SET_SECONDARY, roundTripTime=2.0 ms, state=CONNECTED}, {address=ip-10-1-xxx-49.trantor-southeast-1.compute.internal:27017, type=UNKNOWN, state=CONNECTING}]

We raised rs.stepDown() on 0002, after which 0001 became primary again, but java still can't connect to it, we're getting messages like this in the java client:

[2015-12-15T06:28:39.451Z] [INFO] [Thread-5970] org.mongodb.driver.cluster:? No server chosen by WritableServerSelector from cluster description ClusterDescription{type=REPLICA_SET, connectionMode=MULTIPLE, all=[ServerDescription{address=ip-10-1-xxx-203.trantor-southwest-1.compute.internal:27017, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 0, 5]}, minWireVersion=0, maxWireVersion=3, electionId=null, maxDocumentSize=16777216, roundTripTimeNanos=3210551, setName='rs01', canonicalAddress=ip-10-1-xxx-203.trantor-southwest-1.compute.internal:27017, hosts=[ip-10-1-xxx-203.trantor-southwest-1.compute.internal:27017, ip-10-1-xxx-49.trantor-southwest-1.compute.internal:27017, ip-10-1-xxx-227.trantor-southwest-1.compute.internal:27017], passives=[], arbiters=[], primary='ip-10-1-xxx-49.trantor-southwest-1.compute.internal:27017', tagSet=TagSet{[]}}, ServerDescription{address=ip-10-1-xxx-227.trantor-southwest-1.compute.internal:27017, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 0, 5]}, minWireVersion=0, maxWireVersion=3, electionId=null, maxDocumentSize=16777216, roundTripTimeNanos=1189727, setName='rs01', canonicalAddress=ip-10-1-xxx-227.trantor-southwest-1.compute.internal:27017, hosts=[ip-10-1-xxx-203.trantor-southwest-1.compute.internal:27017, ip-10-1-xxx-49.trantor-southwest-1.compute.internal:27017, ip-10-1-xxx-227.trantor-southwest-1.compute.internal:27017], passives=[], arbiters=[], primary='ip-10-1-xxx-49.trantor-southwest-1.compute.internal:27017', tagSet=TagSet{[]}}, ServerDescription{address=ip-10-1-xxx-49.trantor-southwest-1.compute.internal:27017, type=UNKNOWN, state=CONNECTING}]}. Waiting for 30000 ms before timing out

All nodes agree on master:

pdsh -w "development-trantor-mongo-rs01-000[1-3]" "mongo --eval 'printjson(db.isMaster().primary)'"
development-trantor-mongo-rs01-0001: MongoDB shell version: 3.0.5
development-trantor-mongo-rs01-0002: MongoDB shell version: 3.0.5
development-trantor-mongo-rs01-0003: MongoDB shell version: 3.0.5
development-trantor-mongo-rs01-0001: "ip-10-1-xxx-49.trantor-southwest-1.compute.internal:27017"
development-trantor-mongo-rs01-0002: "ip-10-1-xxx-49.trantor-southwest-1.compute.internal:27017"
development-trantor-mongo-rs01-0003: "ip-10-1-xxx-49.trantor-southwest-1.compute.internal:27017"

And we can connect with command-line client from the same host that runs the failing java code:

# mongo --host ip-10-1-xxx-49.trantor-southwest-1.compute.internal
MongoDB shell version: 3.0.4
connecting to: ip-10-1-xxx-49.trantor-southwest-1.compute.internal:27017/test
rs01:PRIMARY>

Connection string looks like:

mongodb://development-trantor-mongo-rs01-0001,development-trantor-mongo-rs01-0002,development-trantor-mongo-rs01-0003/?w=majority&journal=true

(these hostnames are just CNAMES for the real IP, we can connect to these too from the machine)

The stack-trace shows a thread like this:

"cluster-ClusterId{value='566a2b1badbb126de1a555c2', description='null'}-ip-10-1-xxx-49.trantor-southwest-1.compute.internal:27017" #107390 daemon prio=5 os_prio=0 tid=0x00007ff9f4243800 nid=0x5a41 waiting on condition [0x00007ff9ed8f7000]
   java.lang.Thread.State: WAITING (parking)
    at sun.misc.Unsafe.park(Native Method)
    - parking to wait for  <0x00000000efce9b88> (a java.util.concurrent.CountDownLatch$Sync)
    at java.util.concurrent.locks.LockSupport.park(LockSupport.java:175)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.parkAndCheckInterrupt(AbstractQueuedSynchronizer.java:836)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.doAcquireSharedInterruptibly(AbstractQueuedSynchronizer.java:997)
    at java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireSharedInterruptibly(AbstractQueuedSynchronizer.java:1304)
    at java.util.concurrent.CountDownLatch.await(CountDownLatch.java:231)
    at com.mongodb.connection.netty.NettyStream$FutureAsyncCompletionHandler.get(NettyStream.java:330)
    at com.mongodb.connection.netty.NettyStream.read(NettyStream.java:160)
    at com.mongodb.connection.InternalStreamConnection.receiveResponseBuffers(InternalStreamConnection.java:491)
    at com.mongodb.connection.InternalStreamConnection.receiveMessage(InternalStreamConnection.java:221)
    at com.mongodb.connection.CommandHelper.receiveReply(CommandHelper.java:134)
    at com.mongodb.connection.CommandHelper.receiveCommandResult(CommandHelper.java:121)
    at com.mongodb.connection.CommandHelper.executeCommand(CommandHelper.java:32)
    at com.mongodb.connection.InternalStreamConnectionInitializer.initializeConnectionDescription(InternalStreamConnectionInitializer.java:83)
    at com.mongodb.connection.InternalStreamConnectionInitializer.initialize(InternalStreamConnectionInitializer.java:43)
    at com.mongodb.connection.InternalStreamConnection.open(InternalStreamConnection.java:115)
    at com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable.run(DefaultServerMonitor.java:127)
    - locked <0x00000000efce9cb0> (a com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable)
    at java.lang.Thread.run(Thread.java:745)

whereas the one for the other two nodes look like this:

"cluster-ClusterId{value='566a2b1badbb126de1a555c2', description='null'}-ip-10-1-xxx-227.trantor-southwest-1.compute.internal:27017" #144394 daemon prio=5 os_prio=0 tid=0x00007ffa20cbb000 nid=0x3b0d waiting on condition [0x00007ff9ed9f8000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep(Native Method)
    at com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable.waitForNext(DefaultServerMonitor.java:225)
    at com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable.run(DefaultServerMonitor.java:167)
    - locked <0x00000000eff7fc00> (a com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable)
    at java.lang.Thread.run(Thread.java:745)
 
"cluster-ClusterId{value='566a2b1badbb126de1a555c2', description='null'}-ip-10-1-xxx-203.trantor-southwest-1.compute.internal:27017" #144373 daemon prio=5 os_prio=0 tid=0x00007ffa20e83000 nid=0x3ac9 waiting on condition [0x00007ff9e747a000]
   java.lang.Thread.State: TIMED_WAITING (sleeping)
    at java.lang.Thread.sleep(Native Method)
    at com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable.waitForNext(DefaultServerMonitor.java:225)
    at com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable.run(DefaultServerMonitor.java:167)
    - locked <0x00000000eff19fc0> (a com.mongodb.connection.DefaultServerMonitor$ServerMonitorRunnable)
    at java.lang.Thread.run(Thread.java:745)

tcpdump shows no activity from the driver trying to connect (but it appears when we connect with command-line, so tcpdump args are ok).

ip addresses and host names are unchanged.

We upgraded to 3.2.2 and were hoping that would have fixed it, but it doesn't, just took a while but it happened again:

[2016-03-21T14:34:55.862Z] [INFO] [Thread-4870] org.mongodb.driver.cluster:? No server chosen by WritableServerSelector from cluster description ClusterDescription{type=REPLICA_SET, connectionMode=MULTIPLE, all=[ServerDescription{address=ip-10-1-xxx-203.trantor-southwest-1.compute.internal:27017, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 0, 5]}, minWireVersion=0, maxWireVersion=3, maxDocumentSize=16777216, roundTripTimeNanos=2897612, setName='rs01', canonicalAddress=ip-10-1-xxx-203.trantor-southwest-1.compute.internal:27017, hosts=[ip-10-1-xxx-203.trantor-southwest-1.compute.internal:27017, ip-10-1-xxx-49.trantor-southwest-1.compute.internal:27017, ip-10-1-xxx-227.trantor-southwest-1.compute.internal:27017], passives=[], arbiters=[], primary='ip-10-1-xxx-49.trantor-southwest-1.compute.internal:27017', tagSet=TagSet{[]}, electionId=null, setVersion=3}, ServerDescription{address=ip-10-1-xxx-227.trantor-southwest-1.compute.internal:27017, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 0, 5]}, minWireVersion=0, maxWireVersion=3, maxDocumentSize=16777216, roundTripTimeNanos=802719, setName='rs01', canonicalAddress=ip-10-1-xxx-227.trantor-southwest-1.compute.internal:27017, hosts=[ip-10-1-xxx-203.trantor-southwest-1.compute.internal:27017, ip-10-1-xxx-49.trantor-southwest-1.compute.internal:27017, ip-10-1-xxx-227.trantor-southwest-1.compute.internal:27017], passives=[], arbiters=[], primary='ip-10-1-xxx-49.trantor-southwest-1.compute.internal:27017', tagSet=TagSet{[]}, electionId=null, setVersion=3}, ServerDescription{address=ip-10-1-xxx-49.trantor-southwest-1.compute.internal:27017, type=UNKNOWN, state=CONNECTING}]}. Waiting for 30000 ms before timing out

rs.status() in primary:

{
	"set" : "rs01",
	"date" : ISODate("2016-03-21T14:46:02.002Z"),
	"myState" : 1,
	"members" : [
		{
			"_id" : 0,
			"name" : "ip-10-1-xxx-49.trantor-southwest-1.compute.internal:27017",
			"health" : 1,
			"state" : 1,
			"stateStr" : "PRIMARY",
			"uptime" : 1231298,
			"optime" : Timestamp(1458354484, 1),
			"optimeDate" : ISODate("2016-03-19T02:28:04Z"),
			"electionTime" : Timestamp(1458354624, 1),
			"electionDate" : ISODate("2016-03-19T02:30:24Z"),
			"configVersion" : 3,
			"self" : true
		},
		{
			"_id" : 1,
			"name" : "ip-10-1-xxx-203.trantor-southwest-1.compute.internal:27017",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 216934,
			"optime" : Timestamp(1458354484, 1),
			"optimeDate" : ISODate("2016-03-19T02:28:04Z"),
			"lastHeartbeat" : ISODate("2016-03-21T14:46:01.081Z"),
			"lastHeartbeatRecv" : ISODate("2016-03-21T14:46:01.704Z"),
			"pingMs" : 2,
			"configVersion" : 3
		},
		{
			"_id" : 2,
			"name" : "ip-10-1-xxx-227.trantor-southwest-1.compute.internal:27017",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 1230888,
			"optime" : Timestamp(1458354484, 1),
			"optimeDate" : ISODate("2016-03-19T02:28:04Z"),
			"lastHeartbeat" : ISODate("2016-03-21T14:46:01.659Z"),
			"lastHeartbeatRecv" : ISODate("2016-03-21T14:46:01.571Z"),
			"pingMs" : 3,
			"configVersion" : 3
		}
	],
	"ok" : 1
}

However, this time (3.2.2), it managed to recover after the stepDown():

Log in app-server was:

[2016-03-21T14:48:26.914Z] [INFO] [Thread-5293] org.mongodb.driver.cluster:? No server chosen by WritableServerSelector from cluster description ClusterDescription{type=REPLICA_SET, connectionMode=MULTIPLE, all=[ServerDescription{address=ip-10-1-xxx-203.trantor-southwest-1.compute.internal:27017, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 0, 5]}, minWireVersion=0, maxWireVersion=3, maxDocumentSize=16777216, roundTripTimeNanos=4268169, setName='rs01', canonicalAddress=ip-10-1-xxx-203.trantor-southwest-1.compute.internal:27017, hosts=[ip-10-1-xxx-203.trantor-southwest-1.compute.internal:27017, ip-10-1-xxx-49.trantor-southwest-1.compute.internal:27017, ip-10-1-xxx-227.trantor-southwest-1.compute.internal:27017], passives=[], arbiters=[], primary='null', tagSet=TagSet{[]}, electionId=null, setVersion=3}, ServerDescription{address=ip-10-1-xxx-227.trantor-southwest-1.compute.internal:27017, type=REPLICA_SET_SECONDARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 0, 5]}, minWireVersion=0, maxWireVersion=3, maxDocumentSize=16777216, roundTripTimeNanos=739004, setName='rs01', canonicalAddress=ip-10-1-xxx-227.trantor-southwest-1.compute.internal:27017, hosts=[ip-10-1-xxx-203.trantor-southwest-1.compute.internal:27017, ip-10-1-xxx-49.trantor-southwest-1.compute.internal:27017, ip-10-1-xxx-227.trantor-southwest-1.compute.internal:27017], passives=[], arbiters=[], primary='null', tagSet=TagSet{[]}, electionId=null, setVersion=3}, ServerDescription{address=ip-10-1-xxx-49.trantor-southwest-1.compute.internal:27017, type=UNKNOWN, state=CONNECTING}]}. Waiting for 30000 ms before timing out
[2016-03-21T14:48:26.997Z] [INFO] [cluster-ClusterId{value='56ea96e31847a572e963fa73', description='null'}-ip-10-1-xxx-203.trantor-southwest-1.compute.internal:27017] org.mongodb.driver.cluster:? Monitor thread successfully connected to server with description ServerDescription{address=ip-10-1-xxx-203.trantor-southwest-1.compute.internal:27017, type=REPLICA_SET_PRIMARY, state=CONNECTED, ok=true, version=ServerVersion{versionList=[3, 0, 5]}, minWireVersion=0, maxWireVersion=3, maxDocumentSize=16777216, roundTripTimeNanos=3998244, setName='rs01', canonicalAddress=ip-10-1-xxx-203.trantor-southwest-1.compute.internal:27017, hosts=[ip-10-1-xxx-203.trantor-southwest-1.compute.internal:27017, ip-10-1-xxx-49.trantor-southwest-1.compute.internal:27017, ip-10-1-xxx-227.trantor-southwest-1.compute.internal:27017], passives=[], arbiters=[], primary='ip-10-1-xxx-203.trantor-southwest-1.compute.internal:27017', tagSet=TagSet{[]}, electionId=56f009baca60543a10792068, setVersion=3}
[2016-03-21T14:48:26.997Z] [INFO] [cluster-ClusterId{value='56ea96e31847a572e963fa73', description='null'}-ip-10-1-xxx-203.trantor-southwest-1.compute.internal:27017] org.mongodb.driver.cluster:? Setting max election id to 56f009baca60543a10792068 from replica set primary ip-10-1-xxx-203.trantor-southwest-1.compute.internal:27017
[2016-03-21T14:48:26.997Z] [INFO] [cluster-ClusterId{value='56ea96e31847a572e963fa73', description='null'}-ip-10-1-xxx-203.trantor-southwest-1.compute.internal:27017] org.mongodb.driver.cluster:? Discovered replica set primary ip-10-1-xxx-203.trantor-southwest-1.compute.internal:27017

and new status is:

 
rs01:SECONDARY> rs.status()
{
	"set" : "rs01",
	"date" : ISODate("2016-03-21T14:53:17.164Z"),
	"myState" : 2,
	"syncingTo" : "ip-10-1-xxx-227.trantor-southwest-1.compute.internal:27017",
	"members" : [
		{
			"_id" : 0,
			"name" : "ip-10-1-xxx-49.trantor-southwest-1.compute.internal:27017",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 1231733,
			"optime" : Timestamp(1458571985, 1),
			"optimeDate" : ISODate("2016-03-21T14:53:05Z"),
			"syncingTo" : "ip-10-1-xxx-227.trantor-southwest-1.compute.internal:27017",
			"configVersion" : 3,
			"self" : true
		},
		{
			"_id" : 1,
			"name" : "ip-10-1-xxx-203.trantor-southwest-1.compute.internal:27017",
			"health" : 1,
			"state" : 1,
			"stateStr" : "PRIMARY",
			"uptime" : 217370,
			"optime" : Timestamp(1458571985, 1),
			"optimeDate" : ISODate("2016-03-21T14:53:05Z"),
			"lastHeartbeat" : ISODate("2016-03-21T14:53:15.828Z"),
			"lastHeartbeatRecv" : ISODate("2016-03-21T14:53:16.861Z"),
			"pingMs" : 5,
			"electionTime" : Timestamp(1458571706, 1),
			"electionDate" : ISODate("2016-03-21T14:48:26Z"),
			"configVersion" : 3
		},
		{
			"_id" : 2,
			"name" : "ip-10-1-xxx-227.trantor-southwest-1.compute.internal:27017",
			"health" : 1,
			"state" : 2,
			"stateStr" : "SECONDARY",
			"uptime" : 1231323,
			"optime" : Timestamp(1458571985, 1),
			"optimeDate" : ISODate("2016-03-21T14:53:05Z"),
			"lastHeartbeat" : ISODate("2016-03-21T14:53:16.411Z"),
			"lastHeartbeatRecv" : ISODate("2016-03-21T14:53:15.957Z"),
			"pingMs" : 0,
			"syncingTo" : "ip-10-1-xxx-203.trantor-southwest-1.compute.internal:27017",
			"configVersion" : 3
		}
	],
	"ok" : 1
}



 Comments   
Comment by Jeffrey Yemin [ 04/Oct/16 ]

This is very likely a duplicate of JAVA-2302, which has been fixed on master.

Comment by Jeffrey Yemin [ 16/Sep/16 ]

Thanks for providing feedback on whether you were able to reproduce with the default stream factory. Re-opening so that we can investigate further.

Comment by Francisco Alejandro Lozano López [ 16/Sep/16 ]

Yes, we haven't been able to reproduce in normal non-netty stream factory, which is working well for us. For now we're OK with not using SSL for this, so ... it's fine.

Comment by Jeffrey Yemin [ 16/Sep/16 ]

As we have not heard back from you in some time regarding our request to test with the default stream factory, we are closing this issue. If you can provide additional information, we will be happy to re-open it.

Comment by Ross Lawley [ 24/Mar/16 ]

Hi Francisco,

It looks like the Primary went down and the driver wasn't able to re evaluate its status in the server monitoring after the Primary was restarted. It looks like a Assertion: 28537:Demoted from primary while removing from foo.bar happened and that lead to a stackoverflow:

2015-12-14T02:37:20.031+0000 I CONTROL  [conn517493]
 0x108a4a2 0x1027129 0x100ba6e 0x100bb1c 0xabae7c 0xc93603 0xc9378b 0xc93adb 0xc941bd 0xa6c563 0xa6ccb0 0xa6d366 0xa6fe9c 0xa91e31 0xa92e3c 0xa93b0b 0xc6128b 0xb74472 0x8b36d5 0x103bac9 0x7fef860d6df5 0x7fef84b881ad

It is possible that there was no callback triggered to Netty in this scenario, I'll try and replicate killing a server but previous attempts failed to reproduce this issue. It would be useful if you could attempt to reproduce this with the default StreamFactoryFactory (which uses NIO2), rather than NettyStreamFactoryFactory. Is that a possibility?

Comment by Ross Lawley [ 22/Mar/16 ]

Thanks flozano,

Just reviewing and will add some feedback when I have reassessed the issue.

Ross

Generated at Thu Feb 08 08:56:29 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.