[JAVA-2945] Mongo Driver with bad connection hanging forever and not timing out Created: 24/Aug/18  Updated: 27/Oct/23  Resolved: 28/Aug/18

Status: Closed
Project: Java Driver
Component/s: Connection Management
Affects Version/s: 3.5.0
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: RameshPitani Assignee: Unassigned
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File ErrorLog1.txt     Text File ThreadDump.txt    

 Description   

Hi,

We are having a weird issue with Mongo java driver 3.5.0. We have following error initially returned by Mongo and application is handling error smoothly. But after this error, whatever next request picks mongo connection, Next request is hanging forever and not timing out after a certain interval.

 

  1. client-options
    mongo.connectionsPerHost=20
    mongo.threadsAllowedToBlockForConnectionMultiplier=2
    mongo.connectTimeout=30000
    mongo.maxWaitTime=10000
    mongo.socketKeepAlive=true
    mongo.socketTimeout=20000
    mongo.maxConnectionIdleTime=300000
    mongo.maxConnectionLifeTime=43200000
    mongo.heartbeatFrequency=3600000
    mongo.minHeartbeatFrequency=500
    mongo.heartbeatConnectTimeout=20000
    mongo.heartbeatSocketTimeout=20000

 

Caused by: com.mongodb.MongoQueryException: Query failed with error code 35 and error message '' on server localhost:27017Caused by: com.mongodb.MongoQueryException: Query failed with error code 35 and error message '' on server localhost:27017 at com.mongodb.connection.ProtocolHelper.getQueryFailureException(ProtocolHelper.java:131) at com.mongodb.connection.GetMoreProtocol.execute(GetMoreProtocol.java:88) at com.mongodb.connection.GetMoreProtocol.execute(GetMoreProtocol.java:49) at com.mongodb.connection.DefaultServer$DefaultServerProtocolExecutor.execute(DefaultServer.java:159) at com.mongodb.connection.DefaultServerConnection.executeProtocol(DefaultServerConnection.java:289) at com.mongodb.connection.DefaultServerConnection.getMore(DefaultServerConnection.java:254) at com.mongodb.operation.QueryBatchCursor.getMore(QueryBatchCursor.java:218) at com.mongodb.operation.QueryBatchCursor.hasNext(QueryBatchCursor.java:105) at com.mongodb.MongoBatchCursorAdapter.hasNext(MongoBatchCursorAdapter.java:46) at com.mongodb.DBCursor.hasNext(DBCursor.java:145) at com.mongodb.DBCursor.fillArray(DBCursor.java:910) at com.mongodb.DBCursor.toArray(DBCursor.java:655) at com.mongodb.DBCursor.toArray(DBCursor.java:643)

 



 Comments   
Comment by RameshPitani [ 27/Aug/18 ]

Hi Jeff, No. I have informed about this on 08/24. IBM exposed mongo interface on top of TPF mainframe system. This allows java application to talk to TPF Mainframe system using NoSQL. Initially, when I raised this defect, I thought 2nd Issue(Connection hung) is related to Java Mongo driver issue. Today IBM confirmed that there is a bug on mainframe system (Closing Mongo Connection during cursor close) for 2nd error.

For 1st error((Query failed with error code 35 and error message)), IBM Mongo server can not handle if Find and GetMore come on different Connection. This is the limitation of IBM mongo server.

https://www.ibm.com/support/knowledgecenter/en/SSB23S_1.1.0.15/gtpd5/cmdbifcontainer.html 

Your prompt responses helped us to narrow down the problem. There is no problem with mongo java driver. Thank you for all your help. We can close this JIRA. 

Comment by Jeffrey Yemin [ 27/Aug/18 ]

pitani can you clarify? Are you not running against a MongoDB binary that was built by MongoDB, Inc? If not, where is the binary from?

Comment by RameshPitani [ 27/Aug/18 ]

The version of MongoDB on TPF is compatible with 2.6, however there are, certain 2.6 commands are not supported.

Comment by Jeffrey Yemin [ 27/Aug/18 ]

pitani for question 1, I'm asking what version of the MongoDB server that you are connecting to, not the driver version. Driver and server are versioned independently of each other.

Comment by RameshPitani [ 27/Aug/18 ]

Here I attached Thread Dump file when request hung.

Comment by RameshPitani [ 27/Aug/18 ]

Jeff,

We have observed an interesting clue in the log about Connection Hang Issue in the log (2nd Issue after Query failed with error code 35 and error message).
In our application, We are not Closing DBCursor manually. Cursor Closing is delegated to Spring MongoClientOptions.Builder.cursorFinalizerEnabled. This is the reason for "CleanCursors-2-thread-1" thread in the log file. CleanCursors thread is running after we have this error "Query failed with error code 35 and error message" for a different request on Connection Id(localValue:23). After "CleanCursors" killed cursor for ConnectionId(localValue:23), New request 10499772172419394 used same Mongo Connection(localValue:23). This 10499772172419394 request hang and never got the response back. What could be the problem? Is CleanCursors thread also closing Mongo Connection and Request hang for ever?

 

2018-08-27 12:52:23,333|DEBUG|||CleanCursors-2-thread-1|org.mongodb.driver.protocol.killcursor||Killing cursors [1] on connection [connectionId\{localValue:23, serverValue:0}] to server mongodb.prod.sabre.com:27017
2018-08-27 12:52:23,334|TRACE|||CleanCursors-2-thread-1|org.mongodb.driver.connection||Checked in connection [connectionId\{localValue:23, serverValue:0}] to server mongodb.prod.sabre.com:27017
2018-08-27 12:52:24,684|TRACE|10499772172419394|AS|ckiFromTG--MOM-Consumer-Thread-6|org.mongodb.driver.connection||Checked out connection [connectionId\{localValue:23, serverValue:0}] to server mongodb.prod.sabre.com:27017
2018-08-27 12:52:24,684|DEBUG|10499772172419394|AS|ckiFromTG--MOM-Consumer-Thread-6|org.mongodb.driver.protocol.query||Sending query of namespace tpfdf.LF22AA on connection [connectionId\{localValue:23, serverValue:0}] to server mongodb.prod.sabre.com:27017

 

Comment by RameshPitani [ 27/Aug/18 ]

Thank you Jeff.

1. This is our version. Please let us know if you see any problem with version I am using.

<spring-data-mongodb.version>1.8.4.RELEASE</spring-data-mongodb.version>
<mongo-java-driver.version>3.5.0</mongo-java-driver.version>

2. Yes. We are getting Error "Query failed with error code 35 and error message" even when we tested against the standalone server.  

mongo.host=ACPSTSB.vpars.sabre.com
mongo.port=27017
mongo.credentials=UserName:Password@tpfdf
mongo.connectTimeout=10000
mongo.maxWaitTime=10000
mongo.socketKeepAlive=true
mongo.socketTimeout=10000
mongo.maxConnectionIdleTime=300000
mongo.maxConnectionLifeTime=43200000
mongo.heartbeatFrequency=3600000
mongo.minHeartbeatFrequency=500
mongo.heartbeatConnectTimeout=20000
mongo.heartbeatSocketTimeout=20000

Spring Context file:
<mongo:mongo-client host="#{ckiMongoProperties['mongo.host']}" port="#{ckiMongoProperties['mongo.port']}" credentials="#{ckiMongoProperties['mongo.credentials']}">
<mongo:client-options connections-per-host="#{ckiMongoProperties['mongo.connectionsPerHost']}" threads-allowed-to-block-for-connection-multiplier="#{ckiMongoProperties['mongo.threadsAllowedToBlockForConnectionMultiplier']}" connect-timeout="#{ckiMongoProperties['mongo.connectTimeout']}" max-wait-time="#{ckiMongoProperties['mongo.maxWaitTime']}" socket-keep-alive="#{ckiMongoProperties['mongo.socketKeepAlive']}" socket-timeout="#{ckiMongoProperties['mongo.socketTimeout']}" max-connection-idle-time="#{ckiMongoProperties['mongo.maxConnectionIdleTime']}" max-connection-life-time="#{ckiMongoProperties['mongo.maxConnectionLifeTime']}" heartbeat-frequency="#{ckiMongoProperties['mongo.heartbeatFrequency']}" min-heartbeat-frequency="#{ckiMongoProperties['mongo.minHeartbeatFrequency']}" heartbeat-connect-timeout="#{ckiMongoProperties['mongo.heartbeatConnectTimeout']}" heartbeat-socket-timeout="#{ckiMongoProperties['mongo.heartbeatSocketTimeout']}"/>
</mongo:mongo-client>

<mongo:db-factory dbname="tpfdf" mongo-ref="mongo"/>

<mongo:mapping-converter id="mongoConverter" base-package="com.sabre.checkin.connector.tpfdf.converter.mongo">
<mongo:custom-converters base-package="com.sabre.checkin.connector.tpfdf.converter.mongo"/></mongo:mapping-converter>

<bean id="mongoTemplate" class="org.springframework.data.mongodb.core.MongoTemplate"> <constructor-arg name="mongoDbFactory" ref="mongoDbFactory"/> <constructor-arg ref="mongoConverter"/></bean>

 

Comment by Jeffrey Yemin [ 27/Aug/18 ]

Ramesh, the only way to force it would be to set the maximum connection pool size to one, but that would mean MongoClient wouldn't have any concurrency. A few further questions:

  1. What server version are you using? Looking at the error code in master, error code 35 maps to something that doesn't make sense in this context: https://docs.google.com/a/10gen.com/document/d/14hoL-vVUHqXYr0eRXqFdEE3iAV3oXZpT8hUqpS1mfGw/edit?pli=1&disco=AAAACCkmsf4, so I want to look at the error codes defined for the server version that you're actually using.
  2. Can you confirm that you can observe this error even when not using a load balancer? If so, what is the server topology (sharded, replset, standalone) and connection string you're using to test that scenario?
Comment by RameshPitani [ 27/Aug/18 ]

Hi Jeff, Reg. "Query failed with error code 35 and error message". I have one question..

Z/TPF Mongo Team claims that above error happens if the find and get more are on different connections. Is there a way to force Original Find and GetMore use same mongo connection? Will @Transactional work or any other solution?

Comment by Jeffrey Yemin [ 27/Aug/18 ]

If a killCursors message is being sent from a CleanCursors thread, it means that the cursor was not closed by the application (e.g., via a call to DBCursor.close).

Comment by RameshPitani [ 27/Aug/18 ]

We have configured only one server(F5 Load Balancer) in spring context XML file. There is no replica Set. This F5 load balancer server  talks to 2 ZTPF mongo servers behind. Even if we hit single ZTPF mongo server directly with out going through F5 Load Balancer, We have this error. 

<mongo:mongo-client host="#{ckiMongoProperties['mongo.host']}" port="#{ckiMongoProperties['mongo.port']}" credentials="#{ckiMongoProperties['mongo.credentials']}">
<mongo:client-options connections-per-host="#{ckiMongoProperties['mongo.connectionsPerHost']}"
threads-allowed-to-block-for-connection-multiplier="#{ckiMongoProperties['mongo.threadsAllowedToBlockForConnectionMultiplier']}"
connect-timeout="#{ckiMongoProperties['mongo.connectTimeout']}"
max-wait-time="#{ckiMongoProperties['mongo.maxWaitTime']}"
socket-keep-alive="#{ckiMongoProperties['mongo.socketKeepAlive']}"
socket-timeout="#{ckiMongoProperties['mongo.socketTimeout']}"
max-connection-idle-time="#{ckiMongoProperties['mongo.maxConnectionIdleTime']}"
max-connection-life-time="#{ckiMongoProperties['mongo.maxConnectionLifeTime']}"
heartbeat-frequency="#{ckiMongoProperties['mongo.heartbeatFrequency']}"
min-heartbeat-frequency="#{ckiMongoProperties['mongo.minHeartbeatFrequency']}"
heartbeat-connect-timeout="#{ckiMongoProperties['mongo.heartbeatConnectTimeout']}"
heartbeat-socket-timeout="#{ckiMongoProperties['mongo.heartbeatSocketTimeout']}"/>
</mongo:mongo-client>

Comment by Jeffrey Yemin [ 27/Aug/18 ]

Yes, there is no requirement that a GetMore request uses the same ConnectionId as the previous ones for a given query. The only requirement is that it routes it to the same server. Are you seeing any evidence that the GetMore is routed to a different server address?

Comment by RameshPitani [ 26/Aug/18 ]

Hi Jeff, I have observed something in my log file(ErrorLog1.txt) related to error "Query failed with error code 35 and error message ''.

This  ErrorLog1.txt contains log messages related to the same transaction. If you look at lines 2 and 4. It has different ConnectionId Value. 1st line is Mongo query with In clause of 150 addresses. Mongo driver cursor decided to use Get More documents multiple times. 1st time, requestErrorLog1.txt used Connection Id:12, 2nd-time GetMore used Connection Id:14.

When I saw successful transactions, It used same connection Id value for 1st time as well as 2nd time(Get More), I am not seeing any problem.

If the 2nd time (Get More) used different ConnectionId value, then we have the error "Query failed with error code 35 and error message ''. In TPF Mainframe logs error is "MONG0035E 09.51.26 Cursor ID 1 passed on the request is not valid or is not an active cursor."

Do you know why 2nd time (GetMore) is using different ConnectionId value compared to 1st one?

 

Comment by Jeffrey Yemin [ 24/Aug/18 ]

Hi Ramesh,

I understand your question, and my requests are attempting to prove your hypotheses. The command monitoring events, in particular the ConnectionId, will tell us that the two requests are actually using the same connection. The thread dump will prove that the request is actually waiting for a socket read to complete, and the server logs may shed light on whether the second request actually arrived at the server.

Generally, if a driver receives a response to a request, as it appeared to here (it's reporting error code 35), then it assumes the socket is still in a good state and just returns it to the pool. It's only on an IOException that the connection is forcibly closed by the driver.

Comment by RameshPitani [ 24/Aug/18 ]

Hi Jeff,

We will try command listener with mongo java driver 3.5.0.

ZTPF has mongo interface and the Client application(Java) is using Mongo Java driver(3.5.0) to communicate to ZTPF system using Mongo interface. So the 1st error(Query failed with error code 35 and error message '' on server) might be related to ZTPF system.

My question mainly about 2nd request. 2nd request hung forever and mongo java driver is not timing out after 10 seconds. If Mongo Socket connection went bad after 1st error, why driver is not closing bad connection and why 2nd request using bad mongo connection? How can close this bad connection forcebly? Is there a way to close it?

 

  

Comment by Jeffrey Yemin [ 24/Aug/18 ]

Sorry to hear that you're having issues with MongoDB using the Java driver. Are you able to reproduce this error? If so, a few things that would help us to investigate further:

  • Add a command listener to your MongoClient and use it to print out the commands that are sent to the server and the replies that are received
  • Take several thread dump, one minute apart, of the application when it's in a hanging state
  • Check the server logs for any anomolies

If you can share all that information with us, we can proceed with an investigation.

Generated at Thu Feb 08 08:58:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.