[SERVER-41258] Router config remains stale for secondary readPreferences after moveChunk Created: 21/May/19  Updated: 27/Oct/23  Resolved: 28/Jun/19

Status: Closed
Project: Core Server
Component/s: None
Affects Version/s: 3.6.8, 4.0.9
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Jon Hyman Assignee: Randolph Tan
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: File SERVER41258.py     File test.js    
Issue Links:
Related
is related to DOCS-12797 Improve documentation around doing se... Closed
Operating System: ALL
Steps To Reproduce:

Setup (cluster and data):

mlaunch init --dir clusters/shardedcounttest --shards 2 --replicaset --nodes 3 --mongos 2
mgeneratejs '{"country" : "US","payload":"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur auctor, risus vitae molestie varius, tellus massa sagittis diam, non convallis augue magna quis tortor. Quisque dictum lacus quis faucibus fermentum. Fusce lobortis eros pretium, rhoncus ipsum at, mattis libero. Duis ut enim nunc. Fusce orci metus, egestas et arcu a, sollicitudin luctus enim. Maecenas enim sapien, condimentum vel ornare eget, feugiat vel velit. Vestibulum quis orci lectus. Nullam rhoncus massa ut ligula mattis volutpat. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Nam cursus vitae elit eget tempor. Etiam vulputate luctus arcu, vel lobortis nulla laoreet id. Maecenas ac mi metus. Nullam luctus consectetur pharetra. Vivamus orci enim, congue eget interdum non, semper sed lacus."}' -n 10000 | mongoimport -d mydb -c countries

Setup (sharding and config):

Note that setting a low orphanCleanupDelaySecs is very helpful in ensuring that the default of 15 minutes doesn't delay back to back migrations.

mongo --port 27017
sh.enableSharding("mydb")
use mydb
db.countries.createIndex({"country":1})
sh.shardCollection("mydb.countries",{"country":1})
while (true) { print(new ISODate() + ' - ' + db.countries.count({country:"US"}));sleep(1000) }
 
mongo --port 27019
db.adminCommand( { setParameter: 1, orphanCleanupDelaySecs: 2 } )
 
mongo --port 27022
db.adminCommand( { setParameter: 1, orphanCleanupDelaySecs: 2 } )

Testing:

Note the python script ( SERVER41258.py ) connects to one router but moveChunk commands should be run on the other.

python SERVER41258.py

+

mongo --port 27018
sh.moveChunk("mydb.countries",{country:"US"},"shard02")
sh.moveChunk("mydb.countries",{country:"US"},"shard01")

(repeat as necessary)

Results:

Python script shows count drops alongside range deleter activity and does not recover...

2019-06-04 15:11:32.560431 10000
2019-06-04 15:11:33.566062 10000
2019-06-04 15:11:34.574102 8719
2019-06-04 15:11:35.579170 3728
2019-06-04 15:11:36.584497 0
2019-06-04 15:11:37.586690 0

2019-06-04T08:11:36.401-0700 I SHARDING [Collection Range Deleter] No documents remain to delete in mydb.countries range [{ country: MinKey }, { country: MaxKey })
2019-06-04T08:11:36.402-0700 I SHARDING [Collection Range Deleter] Waiting for majority replication of local deletions in mydb.countries range [{ country: MinKey }, { country: MaxKey })
2019-06-04T08:11:36.418-0700 I SHARDING [Collection Range Deleter] Finished deleting documents in mydb.countries range [{ country: MinKey }, { country: MaxKey })

...unless flushRouterConfig is run on the mongos that the python script is connected to (port 27017 in this test)

2019-06-04 15:14:17.252945 0
2019-06-04 15:14:18.254547 0
2019-06-04 15:14:19.264117 10000
2019-06-04 15:14:20.273053 10000

Amending the python script to use readPreference primary prevents the issue. The issue also does not occur if both the python script and the moveChunk are run against the same router.

original reproduction

1. Create a 2 replica-set sharded cluster
2. Create a collection foo with two fields [id,country], sharded on _id. We inserted 200,000 records which made 4 chunks.

db.foo.find({"country" : "US"}).count() should return 200,000 documents.

3. Run this Python script to print out the number of results against a secondary query

import pymongo
import time
import datetime
from pymongo.errors import ConnectionFailure, AutoReconnect
from bson.objectid import ObjectId
 
constring = 'mongodb://localhost:27017/admin'
print(constring)
client = pymongo.MongoClient(constring, readPreference='secondaryPreferred')
db1=client['foo']
while True:
      cursor = db1.foo.find({"country" : "UK"}).count()
      print(cursor)
      time.sleep(60)

Here's the timeline:

200000 -> script started
200000
200000
200000 -> balancer enabled
200000 -> results may look consistent but that's only because rangeDeleter hasn't started yet. A consistent script would return 300,000 documents as the secondaries don't filter orphans. The count is still accessing only shard1
200000
200000
199616 -> rangeDeleter started removing migrated chunks. 
175725 -> The script still accessing only shard1 and as result count decreases
152233
128447
105075
100000 -> rangeDeleter finishes removing migrated chunks
100000 -> script returns inconsistent results as it only points to shard1
100000
100000
100000
200000 -> A find executed against primaries which reloaded the version and results consistent again
200000 -> The count now accesses both shards (shard1 & shard2) so it returns consistent results again.
200000
200000

Sprint: Sharding 2019-06-17, Sharding 2019-07-01
Participants:

 Description   
original description

We had a production incident last week where we had inconsistent results reading from secondaries for a collection, and a secondary wasn't returning a document due to a chunk move and the a stale chunk map. Running flushRouterConfig fixed the issue.

We have a collection that has very little read and write activity, with the vast majority of all reads occur on secondaries. This collection was sharded on the field

{k:1}

with the vast majority of find queries being a findOne with

{k: "someValue"}

.

What appears to be the case is that when a chunk migration happens, the mongos reloads the collection version only when an action is performed against the primary (read or write). If there is no action against the primary, the mongos will still use the old chunk map. This is problematic when adding shards and rebalancing, which is what we had done.

On a read-only collection or on a collection that rarely receives writes, like ours it is possible the secondary reads to miss documents after a chunk migration to a shard that didn't previously exist on the chunk map.

It appears that if the collection receives a read or write to the primary, the mongos that issued the command will update its chunk map but other mongos will not.

We're running 3.6.8 on MMAP.



 Comments   
Comment by Randolph Tan [ 03/Jan/23 ]

Note that with SERVER-56488 (in v5.0), default readConcern is now local.

Comment by Scott Glajch [ 19/Dec/22 ]

How can this be functions as designed?  The default read concern fails to find all data when read?
This doesn't seem reasonable.

We are experiencing this issue as well in both mongo 4.2 and 4.4.
Chunk moves done on the command line done on one mongos are breaking other mongos views of those collections.  (We have to do these chunk moves because of mongo's "Primary Database" model for all collections create an extremely imbalanced cluster in the case of collections whose number of chunks are fewer than the number of shards, and mongo's built-in balancer doesn't help solve this)

It turns out we need to do a flushRouterConfig at the database level (doing it at the collection level doesn't fix the issue).

What's worse, is that the documentation around this for 4.2 claimed that only "clearing jumbo chunks" or running "movePrimary" are the use cases for needing to manually flush the configs, so moveChunk isn't mentioned at all.
And in 4.4 this documentation was removed (since the jumbo chunk/movePrimary cases were fixed), so there's no doc on this now.
https://www.mongodb.com/docs/v4.2/reference/command/flushRouterConfig/#considerations
https://www.mongodb.com/docs/v4.4/reference/command/flushRouterConfig/

 

Comment by Randolph Tan [ 12/Jun/19 ]

I believe you are hitting this because the count you are sending probably is set to have the default read concern of "available". I have also attached a javascript snippet demonstrating this issue. If you set the readConcern to something other than available, then you should start getting the expected results.

P.S. Also note the count command doesn't filter out "orphaned" documents unless it has a query predicate and mongod version >= 4.0.

Comment by Eric Sedor [ 04/Jun/19 ]

jonhyman, I'm going to pass this on to an appropriate team to investigate further. Thanks for your patience so far.

Comment by Eric Sedor [ 04/Jun/19 ]

I believe we can rule out SERVER-3645. Not only is that ticket explicitly for higher results, the following code uses only find and still produces the zeroed-out count when the chunk leaves the node.

cursor = db1.countries.find({"country" : "US"})
      count = 0
      for doc in cursor:
          count += 1

Comment by Eric Sedor [ 29/May/19 ]

jonhyman we can confirm that we can reproduce the drop in the count result you are reporting, but are still investigating the reason. Thanks for your patience.

Generated at Thu Feb 08 04:57:14 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.