[CXX-1545] ReplicaSet with a shutdowned member Created: 04/Apr/18  Updated: 27/Oct/23  Resolved: 07/Apr/18

Status: Closed
Project: C++ Driver
Component/s: Release
Affects Version/s: 3.2.0-rc0
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Denis Bip Assignee: Unassigned
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

debian 9



 Description   

Hi! I have a problem when I shutdown one server of replicaset. Then client trying to connect (it seems) to unavailable server again and again. The problem goes away when reconfigure replicaset is done (remove unavailable server).



 Comments   
Comment by Denis Bip [ 07/Apr/18 ]

Thank you very much!

Comment by A. Jesse Jiryu Davis [ 07/Apr/18 ]

Right, the methods on mongocxx::pool are thread-safe, including try_acquire. Yes, maxPoolSize can be passed in the URI.

It's true that destroying a pool might wait a few seconds if the pool is trying to reach an unavailable server. I suggest that you reduce this delay by setting connectTimeoutMS to a short number. For example, if you know that connecting to a server should never take more than one second, you could include "connectTimeoutMS=1000".

I see that you use socketTimeoutMS in your URI, but you probably don't need that. socketTimeoutMS is used to limit the duration of a command after the driver has connected to a server, but passing "maxTimeMS" to each command is a much better way to accomplish the same goal, see the documentation for maxTimeMS for the find command, for example. I suggest that you remove the socketTimeoutMS option from your URI.

connectTimeoutMS is how long the driver waits while trying to connect to a server, the default is 10000 (10 seconds). I suggest you include in your URI "connectTimeoutMS=1000" (1 second), or however long you think is appropriate for your network.

Full URI options documentation here:

http://mongoc.org/libmongoc/current/mongoc_uri_t.html#connection-options

Comment by Denis Bip [ 07/Apr/18 ]

But how to destory pool without timeouts in case of unavailable server? For example I need to restart the program, then I should wait for the end of program + 10 (or more) seconds awaiting for destoroing the pool ...

Comment by Denis Bip [ 07/Apr/18 ]

Aa, ok! Should I make all pools whether static or global in this case of use? So, pool::try_acquire() must be thread safe (or need to be synchronized), so it seems it is thread safe(by looking the code of cxx and c driver). And maxPoolSize parameter can be passed thrue the URI, is it right?

Comment by A. Jesse Jiryu Davis [ 07/Apr/18 ]

Oh. Don't do that then! Follow the pattern I showed in my code: Create a single mongocxx::pool at the beginning of your program and use it through the lifetime of your program. Do not destroy it until the end of the program. This is the correct and efficient way to use mongocxx::pool, both in normal circumstances and when you have an unavailable secondary in your replica set.

I don't have a good hypothesis about why you see timeouts when a secondary is shutdown down, and don't see timeouts when all members are available.

Comment by Denis Bip [ 06/Apr/18 ]

I found a clue. The problem is in destructor of mongocxx::pool. Try this:

#include <bsoncxx/builder/basic/document.hpp>
#include <bsoncxx/builder/basic/kvp.hpp>
#include <mongocxx/client.hpp>
#include <mongocxx/instance.hpp>
#include <mongocxx/pool.hpp>
#include <mongocxx/uri.hpp>
 
using bsoncxx::builder::basic::make_document;
using bsoncxx::builder::basic::kvp;
 
int main() {
    mongocxx::instance inst{};
    mongocxx::uri uri{"mongodb://ergegrbhg934n9:a4ertegerg345t345t9543478423f384980@192.168.212.35:27018,192.168.213.35:27018,192.168.221.35:27018/pult?replicaSet=pult&socketTimeoutMS=10000"};
 
    
    bsoncxx::document::value filter = make_document(kvp("_id", 1));
    bsoncxx::document::value update = make_document(kvp("$set", make_document(kvp("x", 1))));
 
    while (true) {
        mongocxx::pool pool{uri};
        auto client = pool.try_acquire();
        if (client) {
            client.get()->database("test")["collection"].update_one(filter.view(), update.view());
        }
    }
}

So you will get timeout 10sec.

Comment by A. Jesse Jiryu Davis [ 05/Apr/18 ]

Thanks for the information. I tried to reproduce this bug and I could not - I started a replica set with 3 data members and 3 arbiters and I stopped one of the members with kill -STOP to simulate a machine that does not respond on the network. I continued to execute the update_one method using the C++ Driver and it continued to work without throwing an exception.

Could you please tell me what C Driver version you're using?

Please compile this code and run it while shutting down the secondary and let me know if this code also throws an exception:

#include <bsoncxx/builder/basic/document.hpp>
#include <bsoncxx/builder/basic/kvp.hpp>
#include <mongocxx/client.hpp>
#include <mongocxx/instance.hpp>
#include <mongocxx/pool.hpp>
#include <mongocxx/uri.hpp>
 
using bsoncxx::builder::basic::make_document;
using bsoncxx::builder::basic::kvp;
 
int main() {
    mongocxx::instance inst{};
    mongocxx::uri uri{"mongodb://ergegrbhg934n9:a4ertegerg345t345t9543478423f384980@192.168.212.35:27018,192.168.213.35:27018,192.168.221.35:27018/pult?replicaSet=pult&socketTimeoutMS=10000"};
 
    mongocxx::pool pool{uri};
    bsoncxx::document::value filter = make_document(kvp("_id", 1));
    bsoncxx::document::value update = make_document(kvp("$set", make_document(kvp("x", 1))));
 
    while (true) {
        auto client = pool.try_acquire();
        if (client) {
            client.get()->database("test")["collection"].update_one(filter.view(), update.view());
        }
    }
}

Comment by Denis Bip [ 05/Apr/18 ]

Sure. Here it is:
mongodb://ergegrbhg934n9:a4ertegerg345t345t9543478423f384980@192.168.212.35:27018,192.168.213.35:27018,192.168.221.35:27018/pult?replicaSet=pult&socketTimeoutMS=10000

Comment by A. Jesse Jiryu Davis [ 05/Apr/18 ]

Thanks - can I really see the URI that you pass to mongocxx::pool, please? At least, can you tell me if you include a "replicaSet=" option in the URI?

Comment by Denis Bip [ 05/Apr/18 ]

db.isMaster()
{
"hosts" : [
"192.168.212.35:27018",
"192.168.213.35:27018",
"192.168.221.35:27018"
],
"arbiters" : [
"192.168.212.30:27018",
"192.168.213.30:27018",
"192.168.221.30:27018"
],
"setName" : "pult",
"setVersion" : 25,
"ismaster" : true,
"secondary" : false,
"primary" : "192.168.213.35:27018",
"me" : "192.168.213.35:27018",
"electionId" : ObjectId("7fffffff0000000000000013"),
"lastWrite" : {
"opTime" :

{ "ts" : Timestamp(1522943296, 83), "t" : NumberLong(19) }

,
"lastWriteDate" : ISODate("2018-04-05T15:48:16Z")
},
"maxBsonObjectSize" : 16777216,
"maxMessageSizeBytes" : 48000000,
"maxWriteBatchSize" : 1000,
"localTime" : ISODate("2018-04-05T15:48:16.660Z"),
"maxWireVersion" : 5,
"minWireVersion" : 0,
"readOnly" : false,
"ok" : 1
}

Comment by Denis Bip [ 05/Apr/18 ]

I'm using 6 members, 3 standart and 3 arbiters. URI looks like 3 ip to standart server with socketTimeout=10000. By the way, If URI has only one valid ip in config (and no metter how much invalid) - it seems to work.
I'm using additional classes for work with dependinies, so I think will be better to show you sheme:
This is a test sheme:

#create pool
while (test)

{ #try acquire + get = ptr to mongocxx::client #update something this_thread::sleep(1 sec) }

So. Now shutdown the secondary server (physically, not only mongod daemon). Waiting...and I'm getting timeouts (I think 10000 msec which I past into URI) with an errors. Note, that an error rises not at every loop. So on production applications I have a big problems, because per one request I have 50msec as maximum, and big timeouts "shuting down" all threads of application. I'm facing this problem everytime when the secondary server goes down.

Comment by A. Jesse Jiryu Davis [ 05/Apr/18 ]

Thank you. A few more questions to help me diagnose:

  • what MongoDB URI do you use?
  • how many members are in your replica set?
  • can you share with me the C++ code that reproduces this error?
  • can you share with me the output of db.isMaster() from the mongo shell connected to the primary while the secondary machine is shut down?
Comment by Denis Bip [ 05/Apr/18 ]

I forget to say that I test it with secondary server, not primary

Comment by Denis Bip [ 05/Apr/18 ]

I notice that, if you shutdown only mongod instance - all works fine. The problem is come when you physically shutdown the server. So client says: Failed to send "update" command with database "xxx": socket error or timeout: generic server error. not master: generic server error.

Comment by Denis Bip [ 05/Apr/18 ]

I am using mongocxx::pool

Comment by A. Jesse Jiryu Davis [ 04/Apr/18 ]

Hi, are you using a mongocxx::pool or only a mongocxx::client?

Generated at Wed Feb 07 22:03:00 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.