[JAVA-4554] Java OOM Exception leads to MongoInternalException with responseTo / requestId mismatch Created: 29/Mar/22  Updated: 04/May/22  Resolved: 07/Apr/22

Status: Closed
Project: Java Driver
Component/s: Connection Management
Affects Version/s: 4.5.0
Fix Version/s: None

Type: Bug Priority: Unknown
Reporter: Simon Schabel Assignee: Jeffrey Yemin
Resolution: Cannot Reproduce Votes: 0
Labels: external-user
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified


 Description   

We're facing a connection issue with our MongoDb replica set and our Java application and I do not know where to search for the problem.

 

The setup is: MongoDb running in Version 5.0.6 on all of our four Debian 11 root servers (no docker or any other containerization).

One server runs a MongoDb instance separately in standalone mode.

The other three servers are all put together as replica set, one is primary, one secondary, one arbiter. 

 

rs.status() is:

 

{
        "set" : "wisdom",
        "date" : ISODate("2022-03-29T20:22:00.525Z"),
        "myState" : 2,
        "term" : NumberLong(1),
        "syncSourceHost" : "h1.example.com:27017",
        "syncSourceId" : 0,
        "heartbeatIntervalMillis" : NumberLong(2000),
        "majorityVoteCount" : 2,
        "writeMajorityCount" : 2,
        "votingMembersCount" : 3,
        "writableVotingMembersCount" : 2,
        "optimes" : {
                "lastCommittedOpTime" : {
                        "ts" : Timestamp(1648585320, 1),
                        "t" : NumberLong(1)
                },
                "lastCommittedWallTime" : ISODate("2022-03-29T20:22:00.228Z"),
                "readConcernMajorityOpTime" : {
                        "ts" : Timestamp(1648585320, 1),
                        "t" : NumberLong(1)
                },
                "appliedOpTime" : {
                        "ts" : Timestamp(1648585320, 1),
                        "t" : NumberLong(1)
                },
                "durableOpTime" : {
                        "ts" : Timestamp(1648585320, 1),
                        "t" : NumberLong(1)
                },
                "lastAppliedWallTime" : ISODate("2022-03-29T20:22:00.228Z"),
                "lastDurableWallTime" : ISODate("2022-03-29T20:22:00.228Z")
        },
        "lastStableRecoveryTimestamp" : Timestamp(1648585298, 17),
        "members" : [
                {
                        "_id" : 0,
                        "name" : "h1.example.com:27017",
                        "health" : 1,
                        "state" : 1,
                        "stateStr" : "PRIMARY",
                        "uptime" : 337679,
                        "optime" : {
                                "ts" : Timestamp(1648585319, 3),
                                "t" : NumberLong(1)
                        },
                        "optimeDurable" : {
                                "ts" : Timestamp(1648585319, 3),
                                "t" : NumberLong(1)
                        },
                        "optimeDate" : ISODate("2022-03-29T20:21:59Z"),
                        "optimeDurableDate" : ISODate("2022-03-29T20:21:59Z"),
                        "lastAppliedWallTime" : ISODate("2022-03-29T20:21:59.397Z"),
                        "lastDurableWallTime" : ISODate("2022-03-29T20:21:59.397Z"),
                        "lastHeartbeat" : ISODate("2022-03-29T20:21:59.889Z"),
                        "lastHeartbeatRecv" : ISODate("2022-03-29T20:21:59.970Z"),
                        "pingMs" : NumberLong(25),
                        "lastHeartbeatMessage" : "",
                        "syncSourceHost" : "",
                        "syncSourceId" : -1,
                        "infoMessage" : "",
                        "electionTime" : Timestamp(1648246078, 2),
                        "electionDate" : ISODate("2022-03-25T22:07:58Z"),
                        "configVersion" : 4,
                        "configTerm" : 1
                },
                {
                        "_id" : 1,
                        "name" : "h2.example.com:27017",
                        "health" : 1,
                        "state" : 2,
                        "stateStr" : "SECONDARY",
                        "uptime" : 338620,
                        "optime" : {
                                "ts" : Timestamp(1648585320, 1),
                                "t" : NumberLong(1)
                        },
                        "optimeDate" : ISODate("2022-03-29T20:22:00Z"),
                        "lastAppliedWallTime" : ISODate("2022-03-29T20:22:00.228Z"),
                        "lastDurableWallTime" : ISODate("2022-03-29T20:22:00.228Z"),
                        "syncSourceHost" : "h2.example.com:27017",
                        "syncSourceId" : 0,
                        "infoMessage" : "",
                        "configVersion" : 4,
                        "configTerm" : 1,
                        "self" : true,
                        "lastHeartbeatMessage" : ""
                },
                {
                        "_id" : 2,
                        "name" : "h3.example.com:27017",
                        "health" : 1,
                        "state" : 7,
                        "stateStr" : "ARBITER",
                        "uptime" : 121552,
                        "lastHeartbeat" : ISODate("2022-03-29T20:21:59.116Z"),
                        "lastHeartbeatRecv" : ISODate("2022-03-29T20:21:59.733Z"),
                        "pingMs" : NumberLong(25),
                        "lastHeartbeatMessage" : "",
                        "syncSourceHost" : "",
                        "syncSourceId" : -1,
                        "infoMessage" : "",
                        "configVersion" : 4,
                        "configTerm" : 1
                }
        ],
        "ok" : 1,
        "$clusterTime" : {
                "clusterTime" : Timestamp(1648585320, 1),
                "signature" : {
                        "hash" : BinData(0,"AAAAAAAAAAAAAAAAAAAAAAAAAAA="),
                        "keyId" : NumberLong(0)
                }
        },
        "operationTime" : Timestamp(1648585320, 1)
}

 

 

Our main application is running on other Debian servers within docker and uses tomcat:8.5.73-jdk11-corretto as basis. We use the following mongo java driver version:

 

<dependency>
    <groupId>org.mongodb</groupId>
    <artifactId>mongodb-driver-sync</artifactId>
    <version>4.5.0</version>
</dependency>

Within our application we have two database connection managers which both hold their own MongoClient. One is connecting to the standalone MongoDb, the other one to the replica set. Both use mongodb URIs which look like this:

 

 

mongodb://<IP-HERE>:27017/?waitQueueMultiple=10

 

 

Today we were facing the situation that suddenly our application had a memory hiccup, an OOM exception occurred:

 

Exception in thread "Thread-130174" java.lang.OutOfMemoryError: Java heap space

 

 

Directly 5sec later, the application wasn't able to connect to the replicaSet any more and the following error messages were given out again and again (here two example logs which were directly following each other):

 

Exception in thread "Thread-129358" com.mongodb.MongoInternalException: The responseTo (13430453) in the response does not match the requestId (13430461) in the request
	at com.mongodb.internal.connection.ReplyMessage.<init>(ReplyMessage.java:65)
	at com.mongodb.internal.connection.ReplyMessage.<init>(ReplyMessage.java:43)
	at com.mongodb.internal.connection.InternalStreamConnection.getCommandResult(InternalStreamConnection.java:540)
	at com.mongodb.internal.connection.InternalStreamConnection.receiveCommandMessageResponse(InternalStreamConnection.java:425)
	at com.mongodb.internal.connection.InternalStreamConnection.sendAndReceive(InternalStreamConnection.java:342)
	at com.mongodb.internal.connection.UsageTrackingInternalConnection.sendAndReceive(UsageTrackingInternalConnection.java:116)
	at com.mongodb.internal.connection.DefaultConnectionPool$PooledConnection.sendAndReceive(DefaultConnectionPool.java:643)
	at com.mongodb.internal.connection.CommandProtocolImpl.execute(CommandProtocolImpl.java:71)
	at com.mongodb.internal.connection.DefaultServer$DefaultServerProtocolExecutor.execute(DefaultServer.java:240)
	at com.mongodb.internal.connection.DefaultServerConnection.executeProtocol(DefaultServerConnection.java:226)
	at com.mongodb.internal.connection.DefaultServerConnection.command(DefaultServerConnection.java:126)
	at com.mongodb.internal.connection.DefaultServerConnection.command(DefaultServerConnection.java:116)
	at com.mongodb.internal.connection.DefaultServer$OperationCountTrackingConnection.command(DefaultServer.java:345)
	at com.mongodb.internal.operation.CommandOperationHelper.createReadCommandAndExecute(CommandOperationHelper.java:230)
	at com.mongodb.internal.operation.FindOperation.lambda$execute$1(FindOperation.java:666)
	at com.mongodb.internal.operation.OperationHelper.lambda$withSourceAndConnection$2(OperationHelper.java:575)
	at com.mongodb.internal.operation.OperationHelper.withSuppliedResource(OperationHelper.java:600)
	at com.mongodb.internal.operation.OperationHelper.lambda$withSourceAndConnection$3(OperationHelper.java:574)
	at com.mongodb.internal.operation.OperationHelper.withSuppliedResource(OperationHelper.java:600)
	at com.mongodb.internal.operation.OperationHelper.withSourceAndConnection(OperationHelper.java:573)
	at com.mongodb.internal.operation.FindOperation.lambda$execute$2(FindOperation.java:661)
	at com.mongodb.internal.async.function.RetryingSyncSupplier.get(RetryingSyncSupplier.java:65)
	at com.mongodb.internal.operation.FindOperation.execute(FindOperation.java:692)
	at com.mongodb.internal.operation.FindOperation.execute(FindOperation.java:86)
	at com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.execute(MongoClientDelegate.java:191)
	at com.mongodb.client.internal.MongoIterableImpl.execute(MongoIterableImpl.java:135)
	at com.mongodb.client.internal.MongoIterableImpl.iterator(MongoIterableImpl.java:92)
	at com.mongodb.client.internal.MongoIterableImpl.forEach(MongoIterableImpl.java:121)
	at com.mongodb.client.internal.MappingIterable.forEach(MappingIterable.java:59)
	at com.mongodb.client.internal.MappingIterable.into(MappingIterable.java:69)
 at com.example.persistence.readManyAsList(SxDao.java:113)

and

 

 

Exception in thread "Thread-129766" com.mongodb.MongoInternalException: The responseTo (13430461) in the response does not match the requestId (13430469) in the request
	at com.mongodb.internal.connection.ReplyMessage.<init>(ReplyMessage.java:65)
	at com.mongodb.internal.connection.ReplyMessage.<init>(ReplyMessage.java:43)
	at com.mongodb.internal.connection.InternalStreamConnection.getCommandResult(InternalStreamConnection.java:540)
	at com.mongodb.internal.connection.InternalStreamConnection.receiveCommandMessageResponse(InternalStreamConnection.java:425)
	at com.mongodb.internal.connection.InternalStreamConnection.sendAndReceive(InternalStreamConnection.java:342)
	at com.mongodb.internal.connection.UsageTrackingInternalConnection.sendAndReceive(UsageTrackingInternalConnection.java:116)
	at com.mongodb.internal.connection.DefaultConnectionPool$PooledConnection.sendAndReceive(DefaultConnectionPool.java:643)
	at com.mongodb.internal.connection.CommandProtocolImpl.execute(CommandProtocolImpl.java:71)
	at com.mongodb.internal.connection.DefaultServer$DefaultServerProtocolExecutor.execute(DefaultServer.java:240)
	at com.mongodb.internal.connection.DefaultServerConnection.executeProtocol(DefaultServerConnection.java:226)
	at com.mongodb.internal.connection.DefaultServerConnection.command(DefaultServerConnection.java:126)
	at com.mongodb.internal.connection.DefaultServerConnection.command(DefaultServerConnection.java:116)
	at com.mongodb.internal.connection.DefaultServer$OperationCountTrackingConnection.command(DefaultServer.java:345)
	at com.mongodb.internal.operation.CommandOperationHelper.createReadCommandAndExecute(CommandOperationHelper.java:230)
	at com.mongodb.internal.operation.FindOperation.lambda$execute$1(FindOperation.java:666)
	at com.mongodb.internal.operation.OperationHelper.lambda$withSourceAndConnection$2(OperationHelper.java:575)
	at com.mongodb.internal.operation.OperationHelper.withSuppliedResource(OperationHelper.java:600)
	at com.mongodb.internal.operation.OperationHelper.lambda$withSourceAndConnection$3(OperationHelper.java:574)
	at com.mongodb.internal.operation.OperationHelper.withSuppliedResource(OperationHelper.java:600)
	at com.mongodb.internal.operation.OperationHelper.withSourceAndConnection(OperationHelper.java:573)
	at com.mongodb.internal.operation.FindOperation.lambda$execute$2(FindOperation.java:661)
	at com.mongodb.internal.async.function.RetryingSyncSupplier.get(RetryingSyncSupplier.java:65)
	at com.mongodb.internal.operation.FindOperation.execute(FindOperation.java:692)
	at com.mongodb.internal.operation.FindOperation.execute(FindOperation.java:86)
	at com.mongodb.client.internal.MongoClientDelegate$DelegateOperationExecutor.execute(MongoClientDelegate.java:191)
	at com.mongodb.client.internal.MongoIterableImpl.execute(MongoIterableImpl.java:135)
	at com.mongodb.client.internal.MongoIterableImpl.iterator(MongoIterableImpl.java:92)
	at com.mongodb.client.internal.MongoIterableImpl.forEach(MongoIterableImpl.java:121)
	at com.mongodb.client.internal.MappingIterable.forEach(MappingIterable.java:59)
	at com.mongodb.client.internal.MappingIterable.into(MappingIterable.java:69)
	at com.example.persistence.readManyAsList(SxDao.java:113)

 

Please notice the responseTo and requestId numbers of the subsequent message:

 

responseTo (13430453) in the response does not match the requestId (13430461)
responseTo (13430461) in the response does not match the requestId (13430469)

 

The responseTo of the following log message was always referencing the requestId from before.

This type of error message was given out since that point for any try to get or store data within the replica set via its database connection manager.

 

Interestingly the other database connection managers, which connects to the standalone MongoDb instance, wasn't affected by the OOM exception. It was still reading and writing from/to it without any problem.

In the end we restarted our application and everything was working again. But of course we need to find out what the main problem is. 

 

 

I assume it has something to do with our replica set and the OOM hiccup our app had. Does anyone has a hint what we might need to look into to get this fixed?



 Comments   
Comment by Jeffrey Yemin [ 07/Apr/22 ]

Thanks for your understanding. This is indeed a tricky one. Hopefully we'll be able to get to the bottom of it at some point.

Comment by Simon Schabel [ 07/Apr/22 ]

Sorry, unfortunately I can't provide more information.

 

Thank you for you work on this. If you were not able to reproduce, this ticket might be closed. If we see this occurring again with more useful details in the logs, I will report it here.

Comment by Jeffrey Yemin [ 06/Apr/22 ]

I attempted to reproduce the issue with the following program:

public class JAVA4554 {
    public static void main(String[] args) throws InterruptedException {
        var client = MongoClients.create();
        var collection = client.getDatabase("test").getCollection("JAVA-4554", Person.class)
                .withCodecRegistry(
                        CodecRegistries.fromRegistries(
                                CodecRegistries.fromCodecs(new PersonCodec()),
                                MongoClientSettings.getDefaultCodecRegistry()));
 
        var executor = Executors.newFixedThreadPool(5);
        for (int i = 0; i < 5; i++) {
            executor.submit(() -> {
                for (; ; ) {
                    var person = new Person();
                    person.id = new ObjectId();
                    person.name = "Jeff";
 
                    try {
                        collection.insertOne(person);
                    } catch (OutOfMemoryError e) {
                        System.out.println(e.getClass());
                    } catch (Throwable t) {
                        t.printStackTrace();
                    }
                }
            });
        }
    }
 
    static class Person {
        public ObjectId id;
        public String name;
    }
 
    static class PersonCodec implements Codec<Person> {
        private static AtomicLong counter = new AtomicLong();
 
        @Override
        public Person decode(final BsonReader bsonReader, final DecoderContext decoderContext) {
            throw new UnsupportedOperationException();
        }
 
        @Override
        public void encode(final BsonWriter writer, final Person person, final EncoderContext encoderContext) {
            if (counter.incrementAndGet() % 10 == 0) {
                // force OutOfMemoryError
                byte[] b = new byte[Integer.MAX_VALUE];
            }
 
            writer.writeStartDocument();
            writer.writeObjectId("_id", person.id);
            writer.writeString("name", person.name);
            writer.writeEndDocument();
        }
 
        @Override
        public Class<Person> getEncoderClass() {
            return Person.class;
        }
    }
}

and was unsuccessful.

I see two possible ways forward:

  1. If you can provide us with a reliable reproducer, that would be the most straightforward way to find the root cause
  2. If you can provide us with DEBUG level application logs that cover the time period of the exception, we might be able to figure it out by analyzing the logs.
Comment by Jeffrey Yemin [ 04/Apr/22 ]

Without some sort of reliable reproducer, I'm not sure how to proceed. We'll mull it over, but I'm not optimistic. OOMs usually indicate some systemic problem that typically requires a restart anyway, so I'm not surprised that this hasn't been reported before.

Comment by Simon Schabel [ 04/Apr/22 ]

Unfortunately not, I did not find any other related stack trace within the logs.

The other threads only logged messages like:

Exception in thread "Thread-130127" java.lang.OutOfMemoryError: Java heap space

..without a proper stack trace. If something comes up with more information I will share it with you.

Comment by Jeffrey Yemin [ 04/Apr/22 ]

That doesn't seem like the full stack trace. Is there more available, perhaps?

Comment by Simon Schabel [ 04/Apr/22 ]

Hey @Jeffrey Yemin,

Thank you for looking into that. Yes, it came with a stack trace, please see below:

Exception in thread "Thread-130180" java.lang.OutOfMemoryError: Java heap space
	at org.bson.Document.<init>(Document.java:79)
	at com.example.products.model.Attribute.serialize(Attribute.java:43)
	at com.example.products.model.InputAttribute.serialize(InputAttribute.java:72)
	at com.example.products.model.Product.serialize(Product.java:191)

No, we do not use TLS connections.

Additionally I wanted to mention that no error message or any other suspicious or related log entry was found on the server side. 

 

I hope this helps, please let me know if I can help with more information.

Comment by Jeffrey Yemin [ 04/Apr/22 ]

Also, please let us know if you're using TLS connections.

Comment by Jeffrey Yemin [ 02/Apr/22 ]

Hi simon.schabel@semknox.com

Did that OOM come with a stack trace? If so, please share it.

P.S. I ask because I want to know whether the OOM is coming from inside the driver (e.g. allocating a buffer) or from somewhere completely separate from the driver in the application.

Generated at Thu Feb 08 09:02:23 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.