[GODRIVER-1750] memory leak in topology.Server Created: 16/Sep/20  Updated: 28/Oct/23  Resolved: 30/Apr/21

Status: Closed
Project: Go Driver
Component/s: Server Discovery and Monitoring
Affects Version/s: 1.4.0
Fix Version/s: 1.5.2

Type: Bug Priority: Major - P3
Reporter: Pierre Durand Assignee: Divjot Arora (Inactive)
Resolution: Fixed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Linux, Go 1.15


Attachments: PNG File 2021-02-22 09-57-34 的螢幕擷圖.png     PNG File Capture d’écran du 2020-09-16 09-47-32.png     PNG File image-2021-03-11-17-22-19-452.png    

 Description   

I've noticed that my application is leaking memory in `toology.Server`. It think it's not closing the "context cancel" properly. See the attached screenshot (it's from the DataDog continuous profiling)

It started to happen since I upgraded to the Go MongoDB driver v1.4.0. It was perfectly fine with v1.3.x.

I'm always using the same MongoDB "client" object, I'm creating it at the application startup, and I continue to use the same object. I'm not creating thousands of client objects.

My application is a monitoring tool that fetches the "current op" periodically. It calls `RunCommand()` with `{"currentOp":1, "active": true}`. I suspect it could be related to this, because I don't notice similar issues in my other apps (that are running normal "find" commands).



 Comments   
Comment by Githook User [ 30/Apr/21 ]

Author:

{'name': 'Divjot Arora', 'email': 'divjot.arora@10gen.com', 'username': 'divjotarora'}

Message: GODRIVER-1750 Ensure contexts are always cancelled during server monitoring (#654)
Branch: release/1.5
https://github.com/mongodb/mongo-go-driver/commit/f28d69f15d68afcbda6e30c4eec037bbffc78dc7

Comment by Githook User [ 30/Apr/21 ]

Author:

{'name': 'Divjot Arora', 'email': 'divjot.arora@10gen.com', 'username': 'divjotarora'}

Message: GODRIVER-1750 Ensure contexts are always cancelled during server monitoring (#654)
Branch: master
https://github.com/mongodb/mongo-go-driver/commit/29b591cfc35ee055912f94c857bc1b09300e16ec

Comment by Divjot Arora (Inactive) [ 28/Apr/21 ]

https://github.com/mongodb/mongo-go-driver/pull/654

Comment by Divjot Arora (Inactive) [ 05/Apr/21 ]

I think I've narrowed down the issue - the context may not be cancelled if we fail to create the heartbeat connection. In this case, the next iteration will try to create a new connection and context, which can lead to a context leak. This also explains my inability to reproduce as it would only occur if the target server were unreachable for a long period of time.

noobfriend@gmail.com Thank you for your patience and code samples while we've investigated this ticket. I'm going to mark it with fixVersion 1.5.2 and put it into the "Scheduled" state so that we can release a fix in our next patch release.

Comment by Divjot Arora (Inactive) [ 25/Mar/21 ]

Thank you for the new code changes. I'll continue to look at this today and next week.

Comment by 孟修 N/A [ 25/Mar/21 ]

Hi Divjot Arora,

Thanks for the update. I found the change would leads to some side effects about connection handling, too. 
Here is my quick workaround that only affect to heartbeat connection, although I think it could be also buggy...
https://github.com/flf2ko/mongo-go-driver/compare/v1.4.4...flf2ko:monitor

Comment by Divjot Arora (Inactive) [ 22/Mar/21 ]

noobfriend@gmail.com,

I wanted to provide a quick update. I don't think we want to call "ProcessHandshakeError" here as that function is meant for processing errors in pooled connections and errors in monitoring connections are handled separately. In the next few days, I'll dig deeper and try to figure out why calling ProcessHandshakeError reduces the memory pressure and how we can get the same effect without actually calling that function.

Comment by Divjot Arora (Inactive) [ 11/Mar/21 ]

Hi noobfriend@gmail.com,

Thanks for investigating so thoroughly. I'm going to put this ticket in an "Investigating" state to give myself some time to look over your comments and the commit you linked. Based on that, we'll try to schedule it to be worked on ASAP so we can fix this in our next patch release.

– Divjot

Comment by 孟修 N/A [ 11/Mar/21 ]

I tested by a simple change and run my server for two days.
The memory is used by a fixed amount. I think my guess should be correct.

Change I made:

https://github.com/flf2ko/mongo-go-driver/commit/25a25c3f2f75616f4e318e641c41cf49927f6b56

Inuse space:

Comment by 孟修 N/A [ 10/Mar/21 ]

Hi, I have a guess about this issue and still work on verifying it.
Here is my guess:

1. 

If `(*Connection).processInitializationError` happened in heartbeat goroutine, it will use `(*Connection).config.errorHandlingCallback` to do cancelling.

go.mongodb.org/mongo-driver/x/mongo/driver/topology.(*connection).processInitializationError
go.mongodb.org/mongo-driver/x/mongo/driver/topology.(*connection).connect
go.mongodb.org/mongo-driver/x/mongo/driver/topology.(*Server).setupHeartbeatConnection

https://github.com/mongodb/mongo-go-driver/blob/v1.4.6/x/mongo/driver/topology/connection.go#L131

 

2. 

In this case, `(*Connection).config` is derived from `(*Server).createConnection`, which is without `withErrorHandlingCallback(s.ProcessHandshakeError)` ConnectionOption.

Note that in pool case, it defines the callback in https://github.com/mongodb/mongo-go-driver/blob/v1.4.6/x/mongo/driver/topology/server.go#L178 . But in heartbeat case, this ConnectionOption is not included in `(*Server).cfg`

 

go.mongodb.org/mongo-driver/x/mongo/driver/topology.(*Server).createConnection
go.mongodb.org/mongo-driver/x/mongo/driver/topology.(*Server).setupHeartbeatConnection

https://github.com/mongodb/mongo-go-driver/blob/v1.4.6/x/mongo/driver/topology/server.go#L579

 

3. 

Therefore, heartbeatCtxCancel won't be called when `processInitializationError` is happened

 

s.heartbeatLock.Lock()    
s.heartbeatCtx, s.heartbeatCtxCancel = context.WithCancel(s.globalCtx)    
s.conn = conn    s.heartbeatLock.Unlock()
 
s.conn.connect(s.heartbeatCtx)

 

 

https://github.com/mongodb/mongo-go-driver/blob/v1.4.6/x/mongo/driver/topology/server.go#L586-L590

 

 

 

Comment by Backlog - Core Eng Program Management Team [ 09/Mar/21 ]

There hasn't been any recent activity on this ticket, so we're resolving it. Thanks for reaching out! Please feel free to comment on this if you're able to provide more information.

Comment by 舜元 施 [ 23/Feb/21 ]

yes, it's a long-running application with slow memory leak. We will keep investigating the issue. We already found that it would only happen when app connects to a replica set cluster which across different data center. (No issue if connect to local DB cluster)

Comment by Divjot Arora (Inactive) [ 22/Feb/21 ]

Hi b98705002@gmail.com,

Thanks for the attached memory profile. Can you provide more information on how this profile was taken? Is this a snapshot from a long-running application? Additionally, is it possible for you to provide a Go program that we can run which illustrates the memory issue?

– Divjot

Comment by 舜元 施 [ 22/Feb/21 ]

Hi, I also encounter this issue from profiling in version v1.4.x 

Comment by Esha Bhargava [ 12/Oct/20 ]

There hasn't been any recent activity on this ticket, so we're resolving it. Thanks for reaching out! Please comment on this ticket if you can provide the requested information.

Comment by Divjot Arora (Inactive) [ 22/Sep/20 ]

Hi pierrre,

I'm trying to look into this, but I haven't been able to reproduce it so far. v1.4.0 of the driver did introduce a new heartbeating protocol that's used by the driver's internal monitoring routine. The implementation of that protocol uses contexts and cancellation functions to signal some state changes. However, I believe the cancellation function is always called for the created contexts, so there should not be a memory leak. In your testing, do you see memory constantly increase, or is it that your application is using more memory than it did before but it isn't increasing as the application runs? Also, do you have a small reproducible example that we could run against a MongoDB server to debug this issue?

– Divjot

Generated at Thu Feb 08 08:37:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.