[SERVER-6056] memory leak. query time out when sharding. Created: 11/Jun/12  Updated: 06/Dec/22  Resolved: 21/Mar/18

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.0.5, 2.0.6, 2.2.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: peanutgyz Assignee: [DO NOT USE] Backlog - Sharding Team
Resolution: Done Votes: 1
Labels: connection, mongos, sharding
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

linux , 3 shard, each 3 server replicaset.


Assigned Teams:
Sharding
Operating System: ALL
Participants:

 Description   

mognos:
ConnectionShardStatus preserve map<DBClientBase*, map<string,unsigned long long> > _map; to keep sequence number. when checkShardVersion , call setSequence to set de map, use key :
conn = getVersionable(&conn_in); (which is connection to master of replicaset when use shard and replicaset). but when shardconnection onDestroy, resetShardVersion use key is ptr to shardconnection , not the connection to master of replicaset. So each time _map.erase(conn) return 0, and _map size always increase.

because i use short connection in web client , each time client close connection , mongos destroy shardconnection and cause _map increate. sometimes mongos create new connection to mongod, with same pointer position with an old connection object, cause the new connection can get an positive sequence from _map , of 0!. it will cause checkShardVersion run >2 times, and cause web client wait 3 -5 seconds.

i modify shard_version.cpp resetShardVersion function , when conn type is SET, connectionShardStatus.reset( getVersionable(conn) ); , fix this problem.



 Comments   
Comment by Gregory McKeon (Inactive) [ 21/Mar/18 ]

Sharded queries no longer use this codepath. Please reopen if this is still an issue.

Comment by Greg Studer [ 13/Aug/13 ]

There's been progress, but we're currently working on refactoring that entire codepath, which should make this better long-term.

The workaround for now is to use connection pooling in the driver to avoid creating lots of new connections to mongos.

Comment by Jin-wook Jeong [ 09/Aug/13 ]

any progress on this issue?
I have the exact same situation here too, even with the latest v2.4.5.
Short-living connections to mongos caused the 'connectionShartStatus._map' always increase, which make mongos get OOM killed finally.

After applying the peanutgyz's patch to version_manager.cpp of v2.4.5, the memory leak has gone away.

void VersionManager::resetShardVersionCB( DBClientBase * conn ) {
    if( isVersionableCB( conn ) ) {
        // In addition to peanutgyz's patch,
        // there is possibility that getVersionable() throws an exception.
        // So, we have to catch it.
        try {
            DBClientBase* c = getVersionable( conn );
            if( c ) conn = c;
        } catch (std::exception& e) {
            // something meaningful here, if any?
        }
    }
    connectionShardStatus.reset( conn );
}

Comment by Spencer Brody (Inactive) [ 22/Feb/13 ]

Hi peanutgyz. I have taken a closer look at the code and agree with you now that there is a potential memory leak here. I am currently discussing the best way to fix it with my colleagues. Thank you for your help bringing this to our attention!

Comment by peanutgyz [ 23/Jan/13 ]

can someone give me response ?

Comment by peanutgyz [ 16/Jan/13 ]

1. I found there is 2 connection pool, one is defined at client/connpool.h DBConnectionPool pool; another is defined at s/shardconnection.cpp DBConnectionPool shardConnectionPool;
when run connPoolStats command, will run db/commands.cpp PoolStats() : Command( "connPoolStats" ) {} command, which just call pool.appendInfo( result );,
so the output doesn't show anything about shardConnectionPool ? is it right ?

2. i change the code to PoolStats() : Command( "connPoolStats" ) {} command, which just call shardConnectionPool.appendInfo( result );, to check shardConnectionPool info,
when i create 400 concurrent connection , in connPoolStats output i got

"hosts" : {
"shard0001/s1:10001,s2:10002,s3:10003::0" :

{ "available" : 0, "created" : 401 }

,
"shard0002/s1:10011,s2:10012,s3:10013::0" :

{ "available" : 0, "created" : 401 }

},

map size is 802 now .

after i close all client connection , output change to :

"hosts" : {
"shard0001/s1:10001,s2:10002,s3:10003::0" :

{ "available" : 50, "created" : 401 }

,
"shard0002/s1:10011,s2:10012,s3:10013::0" :

{ "available" : 50, "created" : 401 }

},

but map size is still 802. map size never scale down ,

3. i think, mongos keep connection pool to all shard, but each pool has _maxPerHost = 50 limit . when a lot of client connection to mongos at same time , each client connection pop connection
from pool, and pool will be empty.
when client close connection , will push connection to mongod back to connection pool, and max pool size is 50, so when pool is fool , mongos will destroy shardConnection.

in shardConnection destruct func, will call versionManager.resetShardVersionCB( _conn ).
connectionShardStatus ::
void reset( DBClientBase * conn )

{ scoped_lock lk( _mutex ); _map.erase( conn ); }

_map.erase(conn) always return 0, never delete item from map.

because when shardConnection destruct, it use ptr to shardConnection to reset map,
but when setSequence, it use conn = getVersionable(conn_in) the connection to master , to insert data into map, so map always incr.

Comment by Spencer Brody (Inactive) [ 15/Jan/13 ]

Mongos keeps a connection pool to all the shards. This pool does not scale down as the number of incoming client connections goes down - all connections opened remain opened forever unless there is a problem on that socket. That makes me think this isn't a bug. You can see information about the connections in the connection pool by running the "connPoolStats" command. So long as the size of _map remains relative to the number of connections in the connection pool, I don't think there is a memory leak. Can you check the size of the connection pool using connPoolStats and see if the size of _map scales with the size of the connection pool?

Comment by peanutgyz [ 15/Jan/13 ]

3 Tue Jan 15 17:24:18 [conn2013] _map size is 3769 conn is 0x7f819c24be60, sequence is 0
4 Tue Jan 15 17:24:18 [conn2011] _map size is 3770 conn is 0x7f819c1d1200, sequence is 0
5 Tue Jan 15 17:24:18 [conn2013] _map size is 3770 conn is 0x7f819c1d18c0, sequence is 0
6 Tue Jan 15 17:24:18 [conn2015] _map size is 3770 conn is 0x7f819c2ba5a0, sequence is 0
7 Tue Jan 15 17:24:18 [conn2015] _map size is 3771 conn is 0x7f819c24c240, sequence is 0
8 Tue Jan 15 17:24:18 [conn2012] _map size is 3771 conn is 0x7f8199bd8240, sequence is 0
9 Tue Jan 15 17:24:18 [conn2012] _map size is 3772 conn is 0x7f819c1d1560, sequence is 0
0 Tue Jan 15 17:24:18 [conn2014] _map size is 3772 conn is 0x7f8199bdc120, sequence is 0
1 Tue Jan 15 17:24:18 [conn2014] _map size is 3773 conn is 0x7f819c1d1e60, sequence is 0
2 Tue Jan 15 17:24:18 [conn2016] _map size is 3773 conn is 0x7f8199bdc5a0, sequence is 0
3 Tue Jan 15 17:24:18 [conn2016] _map size is 3774 conn is 0x7f819c2485a0, sequence is 0
4 Tue Jan 15 17:24:18 [conn2018] _map size is 3774 conn is 0x7f8199bdc900, sequence is 0
5 Tue Jan 15 17:24:18 [conn2018] _map size is 3775 conn is 0x7f819c249320, sequence is 0
6 Tue Jan 15 17:24:18 [conn2017] _map size is 3775 conn is 0x7f8199bdcfc0, sequence is 0
7 Tue Jan 15 17:24:18 [conn2017] _map size is 3776 conn is 0x7f819c248900, sequence is 0
8 Tue Jan 15 17:24:18 [conn2020] _map size is 3776 conn is 0x7f8199bdd320, sequence is 0
9 Tue Jan 15 17:24:18 [conn2020] _map size is 3777 conn is 0x7f819c249d40, sequence is 0
0 Tue Jan 15 17:24:18 [conn2019] _map size is 3777 conn is 0x7f8199bdcc60, sequence is 0
1 Tue Jan 15 17:24:18 [conn2019] _map size is 3778 conn is 0x7f819c2499e0, sequence is 0
2 Tue Jan 15 17:24:18 [conn1997] _map size is 3778 conn is 0x7f8199bdd680, sequence is 0
3 Tue Jan 15 17:24:18 [conn1997] _map size is 3779 conn is 0x7f819d96cfc0, sequence is 0
4 Tue Jan 15 17:24:18 [conn2021] _map size is 3779 conn is 0x7f8199bdd9e0, sequence is 0
5 Tue Jan 15 17:24:18 [conn2021] _map size is 3780 conn is 0x7f819c24c480, sequence is 0
6 Tue Jan 15 17:24:18 [conn2008] _map size is 3780 conn is 0x7f8199bddd40, sequence is 0
7 Tue Jan 15 17:24:18 [conn2008] _map size is 3781 conn is 0x7f819c1e3680, sequence is 0

Comment by peanutgyz [ 15/Jan/13 ]

S getSequence( DBClientBase * conn , const string& ns ) {
scoped_lock lk( _mutex );
log() << "_map size is " << _map.size() << endl;
return _map[conn][ns];
}

I change the shard_version.cpp , add log to watch _map size.

Then, i create 500 concurrent connection to mongos , query data, and _map incr.
But when i close all client connection , _map size never decr.

Comment by Spencer Brody (Inactive) [ 11/Sep/12 ]

I'm resolving this ticket due to lack of activity.

If you have a test that demonstrates this problem, feel free to re-open.

Comment by Spencer Brody (Inactive) [ 20/Jun/12 ]

Do you have a test that reproduces the problem?
Can you attach the patch you made that fixes it?

Generated at Thu Feb 08 03:10:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.