[SERVER-9788] mongos does not re-evaluate read preference once a valid replica set member is chosen Created: 28/May/13  Updated: 08/Feb/23  Resolved: 30/Jun/14

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.4.3
Fix Version/s: 2.6.4, 2.7.3

Type: Bug Priority: Major - P3
Reporter: Remon van Vliet Assignee: Randolph Tan
Resolution: Done Votes: 4
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

All


Issue Links:
Depends
Documented
is documented by DOCS-4484 Document that mongoS no longer pins c... Closed
Duplicate
is duplicated by SERVER-14781 ReadPreference.secondaryPreferred doe... Closed
is duplicated by SERVER-10904 Possible for _master and _slaveConn t... Closed
is duplicated by SERVER-7629 Make DBClientReplicaSet draw connecti... Closed
is duplicated by SERVER-9984 MongoS spawns new connections to seco... Closed
Related
related to DOCS-6268 No secondary connection pinning in 3.0 Closed
related to CXX-275 Backport server r2.7.2..r2.7.3 change... Closed
related to SERVER-10449 Queries not balanced among the second... Closed
is related to SERVER-4706 when a socket between mongos and mong... Closed
is related to SERVER-14899 Re-evaluate the behavior of connectio... Closed
Tested
Operating System: ALL
Backport Completed:
Steps To Reproduce:

1) Create and start 3 member repset (primary, secondary, arbiter)
2) Start mongos
3) Send reads to mongos, verify they go to SEC
4) Kill SEC
5) Send reads to mongos, verify they go to PRI
6) Restart SEC
7) Send reads to mongos, verify they go to SEC (they don't).

Participants:

 Description   
Issue Status as of Jul 22, 2014

ISSUE SUMMARY
When reading from a sharded cluster via mongos with a specific read preference, mongos never re-evaluates the preference as long as it connects to a valid member. This can in certain circumstances lead to situations where mongos reads from nodes for prolonged times that do not match the user's intention and expectation.

Example:

When the "secondaryPreferred" read preference is set, mongos connects to an available secondary on a new connection for reads. If there are no longer any available secondaries, mongos correctly switches to a primary node. However, even when a secondary node is available again, mongos does not switch back to read from the secondary node. The connection is pinned to the primary because under "secondaryPreferred", the primary is a valid target to read from and no re-evaluation is carried out until the the target becomes invalid or unreachable.

USER IMPACT
Reads can go to primary nodes for prolonged times even though the user specified that they prefer secondary reads. Users may not even be aware of this fact, if they don't closely monitor the state of their replica sets at all times. Depending on the application architecture, this can lead to degraded read and write throughput.

WORKAROUNDS
The only workaround is to forcibly unpin the connection by specifying a different readPreference on said connection.

AFFECTED VERSIONS
All previous production releases are affected by this issue.

FIX VERSION
The fix is included in the 2.6.4 production release.

RESOLUTION DETAILS

  1. Secondary connections are now drawn from the global pool.
  2. For mongos, the active ReplicaSet connection will release its secondary connection back to the pool after the end of the query/command. This also has a side effect of 'unpinning' the read preference settings. In other words, when this connection is reused again, the node selection will be evaluated again according to the read preference.

As these changes could not be backported to 2.6, a different fix was implemented specifically for 2.6: a new mongos server parameter, internalDBClientRSReselectNodePercentage was introduced. This can be set to any value from 0 to 100 (defaults to 0) and represents the probability (expressed in percentage) of a replica set connection in mongos to reevaluate replica set node selection from scratch, regardless of the compatibility of the current read preference to the last-used secondary node. Extra care should be taken since reselecting a replica set node will destroy the old connection and create a new connection. This means in extreme cases (for example, 100%), mongos can be creating and destroying connections for every read request.

Original description

During lab tests with 1 primary, 1 secondary and 1 arbiter I'm running into the following issue when using the Java drivers' "secondaryPreferred" read preference :

We start with a healthy 3 member repset and start our load tests. This load test connects to mongos. All reads go to the secondary member. We kill the secondary and reads are correctly routed to the primary. We restart the secondary but reads continue to go to the primary indefinitely.

Might be a Java driver issue since I was not able to reproduce in shell due to the lack of support for this read mode there (I think?)



 Comments   
Comment by Michael Paik [ 11/Nov/14 ]

renctan, is removing this bullet point the only applicable change?

Comment by Randolph Tan [ 22/Jul/14 ]

More conservative fix for v2.6:

Added a new mongos server parameter, "internalDBClientRSReselectNodePercentage". This can be set to any value from 0 to 100 (defaults to 0) and represents the probability (expressed in percentage) of a replica set connection in mongos to reevaluate replica set node selection from scratch, regardless of the compatibility of the current read preference to the currently pinned node. Extra care should be taken since v2.6 doesn't pool secondary connections, so unpinning a node from the replica set connection has a side effect of destroying the connection. This means in extreme cases (for example, 100%), mongos can be creating and destroying connections for every read request.

Comment by Githook User [ 22/Jul/14 ]

Author:

{u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: SERVER-9788 mongos does not respect secondary preferred after temporarily unavailable secondary

v2.6 fix: Added server parameter to tweak the frequency of when a replica set connection will decide when it needs to re-evaluate the node selection from scratch for a query with read prefrerence (i.e., decide not to use the cached connection regardless of read prefrerece compatibility).
Branch: v2.6
https://github.com/mongodb/mongo/commit/10f5eb6b0d3eee62fbd7492a2ab4745306e0f54e

Comment by Remon van Vliet [ 14/Jul/14 ]

Great, that sounds like the appropriate fix.

Comment by Randolph Tan [ 30/Jun/14 ]

Changes made:

1. Secondary connections are now drawn from the global pool.
2. For mongos, the active replset connection will release the secondary connections back to the pool. To be more precise, the thread local ClientConnection object will do this. This also has a side effect of 'unpinning' the read preference settings. In other words, when this connection is reused again, the node selection for read preference will be evaluated again from scratch.

Comment by Githook User [ 30/Jun/14 ]

Author:

{u'username': u'renctan', u'name': u'Randolph Tan', u'email': u'randolph@10gen.com'}

Message: SERVER-9788 mongos does not respect secondary preferred after temporarily unavailable secondary
Branch: master
https://github.com/mongodb/mongo/commit/09d2bf2a43cbf6e7ac10d4dc89934528001d0b69

Comment by Scott Hernandez (Inactive) [ 28/Jan/14 ]

The java driver exhibits this behavior because it has a long-lived connection pool, and once the pooled connections map to a backend, it sticks to that one, providing consistency (see below).

The goal is not to distribute each individual request/operation to load balance across the available replicas but more to handle distribution of those connection when the sockets/connects are established. This will yield the most consisten view of the data because it will not result in reads from different replicas across the window of replication (so you don't see new data, then old data) thus leading to reads in order other than normal time.

Remon, replicas are not really good for good for read-scaling, unfortunately; if you want to scale, read or write, it is best to add more shards, not replicas. There are some exceptions from this, but they are few are far between and related to over-saturating nodes and/or large node latencies.

If you have a specific use-case it would be good to provide it here so we can suggest what to do.

Comment by Irina Kaprizkina [ 12/Nov/13 ]

We are experiencing the same issue. In our tests it seems to be pointing to java driver not able to utilize restarted available secondary server.

Comment by Vinod Kumar [ 28/Oct/13 ]

Hi any updates on this . Seems like we saw the same behaviour in SERVER-11117 .
Am also curious to know what is the setting specific to the Java driver that you have explained above

Comment by Remon van Vliet [ 19/Jun/13 ]

Any updates on this? I would like to know what the decisions, if any, are regarding this issue since it might mean we'll have to start working on a workaround.

Comment by Remon van Vliet [ 05/Jun/13 ]

1) Ah, yes that pretty much explains it. Probably good to provide a link to that section in the read preferences docs
2) I was not aware of this. Any particular reason the Java driver is an exception to the common implementation that you're aware of?
3) Okay, I would really appreciate a decision one way or the other since it affects whether or not we can reliably use secondaries for read scaling.

Comment by Randolph Tan [ 31/May/13 ]

1) Although it was not clear in the documentation, the pinning behavior was described in the auto-retry section.
2) Actually, the Java driver is a special case where the pinning behavior must be explicitly requested by the user.
3) You have a point. We will discuss this internally and look for ways to solve this issue.

Comment by Remon van Vliet [ 30/May/13 ]

I understand but it's rather time consuming to isolate the test as it's currently built on top of some in-house tooling. It's almost certainly the pinning behaviour. I have a test that runs 20 threads that all do random reads from a test collections at maximum throughput. Database configuration is as described.

I would argue this is actually a bug rather than a feature request for the following reasons :

1) The contract for secondaryPreferred as described in the documentation is "... read from secondary members, but in situations where the set consists of a single primary (and no other members,) the read operation will use the set’s primary.". Currently it does not adhere to this (it will read from a primary in situations where there ARE other members).

2) The behaviour between connecting to a repset directly compared to through mongos is currently not consistent. Drivers behave correctly (as in, do as advertised) whereas mongos does not.

3) The current behaviour can lead to prolonged significantly degraded read and write throughput while a by then perfectly healthy secondary was available. With sufficiently long cluster uptimes this would almost certainly lead to situations where secondary nodes cannot be counted on to carry read load.

Hope you agree.

Comment by Randolph Tan [ 29/May/13 ]

I just wanted to make sure that the one you are experiencing is just the pinning behavior or something else. If this is indeed the pinning behavior then I will convert this into a feature request to allow unpinning of connections.

Comment by Remon van Vliet [ 29/May/13 ]

I understand the reasoning and it's perfectly valid for various usecases but I think that's a developer decision to make. If they want to avoid that behaviour they should not have a read preference "secondary preferred" which implies the expectation that it will switch from primary to secondary when the latter becomes available. In that case the developer clearly prioritizes removing read load from the primary. I also don't think the back in time issue is that relevant for scenarios where developers have to take eventual consistency into account anyway (it is not different from switching from one secondary to the next if they are at different position in the oplog).

I don't have a very practical way to share the entire test unfortunately. Are you not able to reproduce?

Comment by Randolph Tan [ 29/May/13 ]

Hi,

Can you share the test project?

The reasoning behind the pinning logic was to avoid going back in time as much as possible. For example, if doc A was deleted at time T and client is connected to node0 which has optime > T, we want to avoid the situation where client switches to node1 which has optime < T and make doc A visible to it. So the jump from the view of the world at optime > T to optime < T was what I was referring to as "going back in time".

Comment by Remon van Vliet [ 29/May/13 ]

Hi,

Yes that is the behaviour I'm seeing. I would argue that that is not correct behaviour. Secondary preferred read preference should do exactly that and prefer secondary nodes when they are available. It is currently not following that contract and there are very valid reasons why you would not want the current behaviour. Additionally the behaviour isn't consistent with accessing repsets directly rather than through mongos. My test is multi-threaded by the way.

Comment by Randolph Tan [ 28/May/13 ]

Hi,

The mongos pins the node chosen unless the node becomes unreachable or the read preference setting becomes incompatible with the selected node. In addition, mongos also uses pooled connection so if your test is single threaded, it is very likely that it is using the same connection from the pool that was pinned.

Comment by Remon van Vliet [ 28/May/13 ]

Note that it works perfectly fine if the driver connects directly to the repset rather than through mongos.

Generated at Thu Feb 08 03:21:26 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.