Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: 2.2.13, 2.2.14
Affects Version/s: None
Component/s: None
Labels:
- external-user

Confidence Status:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Link:
None
Goal Name(s):
None

I had this issue when working with MongoDB Atlas. Twice so far in the last 10 days, there were two 30 minutes outages where the primary went down inexplicably. Just for reference, the issue in question is being treated in MMSSUPPORT-12572.

As client, I'm using node.js 6.5.0, with Mongoose 4.6.8, which comes with mongodb native driver 2.2.11, but I was also able to reproduce the issue by using 4.7.2-pre, which comes with mongodb native driver 2.2.12. I have several processes connecting to the database. I'm also using the recommended connection string format, by enumerating all members, and also providing the replica set name.

When the primary goes down, the remaining members of the replica set timeout when attempting to connect the primary, they run an election, and choose a new primary correctly. However, they keep trying to contact the failing primary, resulting in timeout errors. This is fine, except that they also cause connecting clients to timeout as well after sometime (when connecting to the working members, see below description). I'm assuming this is the reason, or part of it, because I never get this behavior when the replica set has all its members up.

The client doesn't seem to work anymore when the primary goes down, and to be honest I'm not sure why. Restarting the client seems to help. What happens is that it also tries to connect the failing primary, resulting in a timeout error. After this error happens, the server seems to connect the database by using the remaining members of the replica set. But, after a short time, the database disconnects the client, rendering the client unusable for operations that require database access.

When try to replicate the issue, the key is to simulate a timeout from the failing primary. It is not the same as just killing the primary, because that would result in a much faster “Connection refused” error. What happens instead, is a timeout error.

Here's some insight from my research:

I was able to replicate the issue on a local setup replica, as stated in https://docs.mongodb.com/v3.2/tutorial/deploy-replica-set-for-testing/. I connected the primary using a TCP proxy (check https://github.com/Shopify/toxiproxy), and then simulated a network outage like what happens with MongoDB Atlas when a primary goes down. The simulation works by adding a timeout setting to both upstream and downstream.

Steps to setup the environment:

toxiproxy-server  &
toxiproxy-cli create mongod_primary -l localhost:12345 -u localhost:27017

When adding the first member to the replica set, use the 12345 port, which is the proxy one.

Then, connect a client to the replica set (make sure to list the primary with the 12345 port as well), and then execute the following:

toxiproxy-cli toxic add mongod_primary -t timeout -a timeout=0 —-downstream
toxiproxy-cli toxic add mongod_primary -t timeout -a timeout=0 —-upstream

You’ll see that the other members elect a new primary, but regarding mongo client, it gets unusable. All attempts to use the database just never return to the caller.
Even when restarting the client, it often gets disconnected because a timeout error (to the available members, which is weird, because they are working). The same happens if you try removing the failing replica member from the connection string. It’s almost like when the remaining members try to connect the failing primary, and that results in timeout errors (leaving the connection attempts hanging until the error happens), the availability of these members gets reduced somehow, causing connection drops / timeouts to the connecting clients.

PS: I found that toxiproxy crashes on Mac OS X Sierra after a short time when being used. I was able to make the issue go away by using the latest Go version and building the project from source.)

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

app_server.log
68 kB
Dec 08 2016 08:31:49 PM UTC
replicas.log
68 kB
Dec 08 2016 08:31:48 PM UTC

Assignee:: Christian Amor Kvalheim
Reporter:: Gian Franco Zabarino
Reviewers:: None
Votes:: 0 Vote for this issue
Watchers:: 4 Start watching this issue

Created:: Dec 08 2016 01:51:21 PM UTC
Updated:: Dec 23 2021 11:55:51 PM UTC
Resolved:: Dec 08 2016 05:56:23 PM UTC
Confidence Status Last Update:: 08/Dec/16 5:53 PM

Details

Description

Attachments

Attachments

Activity

People

Dates