[CDRIVER-2068] mongocxx::pool mongos failover Created: 15/Feb/17  Updated: 27/Oct/23  Resolved: 08/Mar/17

Status: Closed
Project: C Driver
Component/s: network
Affects Version/s: 1.5.2
Fix Version/s: None

Type: Task Priority: Minor - P4
Reporter: Aleksander Melnikov Assignee: Unassigned
Resolution: Works as Designed Votes: 0
Labels: driver
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Windows 7 x64, Visual Studio 2015


Attachments: Text File test.cpp    
Issue Links:
Related
is related to CDRIVER-2067 Server selection is not random on Win... Closed

 Description   

We have shard cluster with 3 mongos instances for failover purposes. All mongos in future, will be distributed on different hosts (now they running on same host with different ports for test purposes).
The failover needed for backup network channels between client app and mongos. For this purpose, we provide all mongos addreses in mongocxx::uri and then pass it to mongocxx::pool.
Now, we see so mongocxx::pool does not failover to another mongos if we shutdown first (for example).
But mongocxx::client does it well (failover to different mongos).

Does this behaviour is correct or maybe we need to additional setup mongocxx::pool environment for the mongos failover?



 Comments   
Comment by A. Jesse Jiryu Davis [ 08/Mar/17 ]

Thanks, here's what happens:

1. The C Driver should use all mongos servers equally, but it only uses the first mongos in the seed list due to a recent bug, CDRIVER-2067, which I'll fix in the next version.
2. When the first mongos shuts down, you get a series of server errors: the server is refusing to start new operations while it is terminating
3. Once mongos closes its connection, the C Driver doesn't detect immediately that the connection is closed, due to a well-known Windows bug. We'll investigate a workaround: CDRIVER-2081
4. Since the C Driver doesn't immediately detect the closed connection, it must wait its default socket timeout, which is 5 minutes (we chose a long timeout to allow long-running commands like "createIndexes"). Setting your socket timeout to 10 seconds makes this faster.
5. Once the socket timeout expires, the C Driver re-checks all the mongos servers in its seed list, in parallel. After its connect timeout (default 10 seconds) expires, it knows that the first mongos is unavailable and the others are available, so it uses the second one. (It will use all available servers equally once CDRIVER-2067 is fixed.)

I'm going to close this bug and continue work in the two related bugs. The mongos failover behavior itself is working as expected on Windows, but there is room for improvement. Thanks for beginning this investigation.

Comment by Aleksander Melnikov [ 06/Mar/17 ]

now, ready to investigate:
Q: Does the same behavior occur if you shut down the last mongos in the seed list, the one on port 27034, instead?
A: No, app continue to work.

Q: If you keep retrying in a loop for 30 seconds or longer, does your application eventually recover and start using the second mongos?
A: Yes, after bunch of errors, like :

err: write results unavailable from host:27002 :: caused by :: Location17382: Can't use connection pool during shutdown: generic server error
err: write results unavailable from host:27002 :: caused by :: Location17382: Can't use connection pool during shutdown: generic server error
..... a lot of these messages ...
err: write results unavailable from host:27002 :: caused by :: Location17382: Can't use connection pool during shutdown: generic server error
err: Failed to send "insert" command with database "test": Failure during socket delivery: Unknown error (10054): generic server error
err: Failed to connect to target host: 127.0.0.1:27017: generic server error

app continue to work.

Q: How soon after you kill the first mongos do you query the cluster again? Are you in the middle of a query, or do you query within 5 seconds of killing mongos, or more than 5 seconds after?
A: I query (write to) cluster with 50 ms delay. I attach example code.

Q: Does this bug manifest if you change your URI to this?:
A: After change:

err: write results unavailable from WKS-17364.net.billing.ru:27002 :: caused by :: Location17382: Can't use connection pool during shutdown: generic server error
err: write results unavailable from WKS-17364.net.billing.ru:27002 :: caused by :: Location17382: Can't use connection pool during shutdown: generic server error
err: write results unavailable from WKS-17364.net.billing.ru:27002 :: caused by :: Location17382: Can't use connection pool during shutdown: generic server error
err: could not find host matching read preference { mode: "primary", tags: [ {} ] } for set mon_0: generic server error
err: Failed to send "insert" command with database "test": socket error or timeout: generic server error
err: Failed to connect to target host: 127.0.0.1:27017: generic server error

.. then continue to normal work

Comment by Aleksander Melnikov [ 04/Mar/17 ]

I will try to investigate the steps after returning from the vacation March 15

Comment by A. Jesse Jiryu Davis [ 03/Mar/17 ]

Hi Aleksander, have you had an opportunity to try steps 4,5,6,7 to help us diagnose?

Comment by Aleksander Melnikov [ 25/Feb/17 ]

1) MongoDB version 3.4.2
2) No SSL, no authentication
3) mongos shut down from trial version of MongoDB Ops Manager (deployment->processes->mongos->shutdown)
4,5,6,7) will try to investigate ASAP

Comment by David Golden [ 23/Feb/17 ]

Aleksander, I'm moving this question to the CDRIVER Jira project, as server monitoring for mongocxx is handled by libmongoc. Jesse is going to take the lead from here on out and I'll continue to watch the ticket.

Comment by A. Jesse Jiryu Davis [ 23/Feb/17 ]

Hi, Aleksander.

  • What MongoDB version are you running, please?
  • Do you use SSL and/or authentication?
  • How do you shut down the first mongos?
  • Does the same behavior occur if you shut down the last mongos in the seed list, the one on port 27034, instead?
  • If you keep retrying in a loop for 30 seconds or longer, does your application eventually recover and start using the second mongos?
  • How soon after you kill the first mongos do you query the cluster again? Are you in the middle of a query, or do you query within 5 seconds of killing mongos, or more than 5 seconds after?
  • Does this bug manifest if you change your URI to this?:

"mongodb://host:27017,host:27033,host:27034/?socketTimeoutMS=10000"

Comment by Aleksander Melnikov [ 22/Feb/17 ]

1) libmongoc version 1.5.2
2) uri is "mongodb://host:27017,host:27033,host:27034"
3) all operations fail (I think all connections in pool fail with one operation)

Comment by David Golden [ 21/Feb/17 ]

Hi, Aleksander. Thanks for the report.

I have some questions:

  • What version of libmongoc are you using?
  • What is the exact URI string you're using? (You can omit/change any confidential details)
  • Does a specific operation fail? Or do all operations fail?
Comment by Aleksander Melnikov [ 15/Feb/17 ]

while using mongocxx::pool - a lot of exceptions occurs:
No suitable servers found: `serverSelectionTimeoutMS` expired: [connection refused calling ismaster on 'host:27017'] [connection refused calling ismaster on 'host:27033'] [connection refused calling ismaster on 'host:27034']: generic server error

After all - possible all pool entryes used, data continues to be send.

Generated at Wed Feb 07 21:14:03 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.