Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-41258

Router config remains stale for secondary readPreferences after moveChunk

    • Type: Icon: Bug Bug
    • Resolution: Works as Designed
    • Priority: Icon: Major - P3 Major - P3
    • None
    • Affects Version/s: 3.6.8, 4.0.9
    • Component/s: None
    • Labels:
      None
    • ALL
    • Hide

      Setup (cluster and data):

      mlaunch init --dir clusters/shardedcounttest --shards 2 --replicaset --nodes 3 --mongos 2
      mgeneratejs '{"country" : "US","payload":"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur auctor, risus vitae molestie varius, tellus massa sagittis diam, non convallis augue magna quis tortor. Quisque dictum lacus quis faucibus fermentum. Fusce lobortis eros pretium, rhoncus ipsum at, mattis libero. Duis ut enim nunc. Fusce orci metus, egestas et arcu a, sollicitudin luctus enim. Maecenas enim sapien, condimentum vel ornare eget, feugiat vel velit. Vestibulum quis orci lectus. Nullam rhoncus massa ut ligula mattis volutpat. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Nam cursus vitae elit eget tempor. Etiam vulputate luctus arcu, vel lobortis nulla laoreet id. Maecenas ac mi metus. Nullam luctus consectetur pharetra. Vivamus orci enim, congue eget interdum non, semper sed lacus."}' -n 10000 | mongoimport -d mydb -c countries
      

      Setup (sharding and config):

      Note that setting a low orphanCleanupDelaySecs is very helpful in ensuring that the default of 15 minutes doesn't delay back to back migrations.

      mongo --port 27017
      sh.enableSharding("mydb")
      use mydb
      db.countries.createIndex({"country":1})
      sh.shardCollection("mydb.countries",{"country":1})
      while (true) { print(new ISODate() + ' - ' + db.countries.count({country:"US"}));sleep(1000) }
      
      mongo --port 27019
      db.adminCommand( { setParameter: 1, orphanCleanupDelaySecs: 2 } )
      
      mongo --port 27022
      db.adminCommand( { setParameter: 1, orphanCleanupDelaySecs: 2 } )
      

      Testing:

      Note the python script ( SERVER41258.py ) connects to one router but moveChunk commands should be run on the other.

      python SERVER41258.py
      

      +

      mongo --port 27018
      sh.moveChunk("mydb.countries",{country:"US"},"shard02")
      sh.moveChunk("mydb.countries",{country:"US"},"shard01")
      

      (repeat as necessary)

      Results:

      Python script shows count drops alongside range deleter activity and does not recover...

      2019-06-04 15:11:32.560431 10000
      2019-06-04 15:11:33.566062 10000
      2019-06-04 15:11:34.574102 8719
      2019-06-04 15:11:35.579170 3728
      2019-06-04 15:11:36.584497 0
      2019-06-04 15:11:37.586690 0
      
      2019-06-04T08:11:36.401-0700 I SHARDING [Collection Range Deleter] No documents remain to delete in mydb.countries range [{ country: MinKey }, { country: MaxKey })
      2019-06-04T08:11:36.402-0700 I SHARDING [Collection Range Deleter] Waiting for majority replication of local deletions in mydb.countries range [{ country: MinKey }, { country: MaxKey })
      2019-06-04T08:11:36.418-0700 I SHARDING [Collection Range Deleter] Finished deleting documents in mydb.countries range [{ country: MinKey }, { country: MaxKey })
      

      ...unless flushRouterConfig is run on the mongos that the python script is connected to (port 27017 in this test)

      2019-06-04 15:14:17.252945 0
      2019-06-04 15:14:18.254547 0
      2019-06-04 15:14:19.264117 10000
      2019-06-04 15:14:20.273053 10000
      

      Amending the python script to use readPreference primary prevents the issue. The issue also does not occur if both the python script and the moveChunk are run against the same router.

      original reproduction

      1. Create a 2 replica-set sharded cluster
      2. Create a collection foo with two fields [id,country], sharded on _id. We inserted 200,000 records which made 4 chunks.

      db.foo.find({"country" : "US"}).count() should return 200,000 documents.

      3. Run this Python script to print out the number of results against a secondary query

      import pymongo
      import time
      import datetime
      from pymongo.errors import ConnectionFailure, AutoReconnect
      from bson.objectid import ObjectId
      
      constring = 'mongodb://localhost:27017/admin'
      print(constring)
      client = pymongo.MongoClient(constring, readPreference='secondaryPreferred')
      db1=client['foo']
      while True:
            cursor = db1.foo.find({"country" : "UK"}).count()
            print(cursor)
            time.sleep(60)
      

      Here's the timeline:

      200000 -> script started
      200000
      200000
      200000 -> balancer enabled
      200000 -> results may look consistent but that's only because rangeDeleter hasn't started yet. A consistent script would return 300,000 documents as the secondaries don't filter orphans. The count is still accessing only shard1
      200000
      200000
      199616 -> rangeDeleter started removing migrated chunks. 
      175725 -> The script still accessing only shard1 and as result count decreases
      152233
      128447
      105075
      100000 -> rangeDeleter finishes removing migrated chunks
      100000 -> script returns inconsistent results as it only points to shard1
      100000
      100000
      100000
      200000 -> A find executed against primaries which reloaded the version and results consistent again
      200000 -> The count now accesses both shards (shard1 & shard2) so it returns consistent results again.
      200000
      200000
      
      Show
      Setup (cluster and data): mlaunch init --dir clusters/shardedcounttest --shards 2 --replicaset --nodes 3 --mongos 2 mgeneratejs '{"country" : "US","payload":"Lorem ipsum dolor sit amet, consectetur adipiscing elit. Curabitur auctor, risus vitae molestie varius, tellus massa sagittis diam, non convallis augue magna quis tortor. Quisque dictum lacus quis faucibus fermentum. Fusce lobortis eros pretium, rhoncus ipsum at, mattis libero. Duis ut enim nunc. Fusce orci metus, egestas et arcu a, sollicitudin luctus enim. Maecenas enim sapien, condimentum vel ornare eget, feugiat vel velit. Vestibulum quis orci lectus. Nullam rhoncus massa ut ligula mattis volutpat. Pellentesque habitant morbi tristique senectus et netus et malesuada fames ac turpis egestas. Nam cursus vitae elit eget tempor. Etiam vulputate luctus arcu, vel lobortis nulla laoreet id. Maecenas ac mi metus. Nullam luctus consectetur pharetra. Vivamus orci enim, congue eget interdum non, semper sed lacus."}' -n 10000 | mongoimport -d mydb -c countries Setup (sharding and config): Note that setting a low orphanCleanupDelaySecs is very helpful in ensuring that the default of 15 minutes doesn't delay back to back migrations. mongo --port 27017 sh.enableSharding("mydb") use mydb db.countries.createIndex({"country":1}) sh.shardCollection("mydb.countries",{"country":1}) while (true) { print(new ISODate() + ' - ' + db.countries.count({country:"US"}));sleep(1000) } mongo --port 27019 db.adminCommand( { setParameter: 1, orphanCleanupDelaySecs: 2 } ) mongo --port 27022 db.adminCommand( { setParameter: 1, orphanCleanupDelaySecs: 2 } ) Testing: Note the python script ( SERVER41258.py ) connects to one router but moveChunk commands should be run on the other. python SERVER41258.py + mongo --port 27018 sh.moveChunk("mydb.countries",{country:"US"},"shard02") sh.moveChunk("mydb.countries",{country:"US"},"shard01") (repeat as necessary) Results: Python script shows count drops alongside range deleter activity and does not recover... 2019-06-04 15:11:32.560431 10000 2019-06-04 15:11:33.566062 10000 2019-06-04 15:11:34.574102 8719 2019-06-04 15:11:35.579170 3728 2019-06-04 15:11:36.584497 0 2019-06-04 15:11:37.586690 0 2019-06-04T08:11:36.401-0700 I SHARDING [Collection Range Deleter] No documents remain to delete in mydb.countries range [{ country: MinKey }, { country: MaxKey }) 2019-06-04T08:11:36.402-0700 I SHARDING [Collection Range Deleter] Waiting for majority replication of local deletions in mydb.countries range [{ country: MinKey }, { country: MaxKey }) 2019-06-04T08:11:36.418-0700 I SHARDING [Collection Range Deleter] Finished deleting documents in mydb.countries range [{ country: MinKey }, { country: MaxKey }) ...unless flushRouterConfig is run on the mongos that the python script is connected to (port 27017 in this test) 2019-06-04 15:14:17.252945 0 2019-06-04 15:14:18.254547 0 2019-06-04 15:14:19.264117 10000 2019-06-04 15:14:20.273053 10000 Amending the python script to use readPreference primary prevents the issue. The issue also does not occur if both the python script and the moveChunk are run against the same router. original reproduction 1. Create a 2 replica-set sharded cluster 2. Create a collection foo with two fields [id,country] , sharded on _id. We inserted 200,000 records which made 4 chunks. db.foo.find({"country" : "US"}).count() should return 200,000 documents. 3. Run this Python script to print out the number of results against a secondary query import pymongo import time import datetime from pymongo.errors import ConnectionFailure, AutoReconnect from bson.objectid import ObjectId constring = 'mongodb://localhost:27017/admin' print (constring) client = pymongo.MongoClient(constring, readPreference= 'secondaryPreferred' ) db1=client[ 'foo' ] while True : cursor = db1.foo.find({ "country" : "UK" }).count() print (cursor) time.sleep(60) Here's the timeline: 200000 -> script started 200000 200000 200000 -> balancer enabled 200000 -> results may look consistent but that 's only because rangeDeleter hasn' t started yet. A consistent script would return 300,000 documents as the secondaries don't filter orphans. The count is still accessing only shard1 200000 200000 199616 -> rangeDeleter started removing migrated chunks. 175725 -> The script still accessing only shard1 and as result count decreases 152233 128447 105075 100000 -> rangeDeleter finishes removing migrated chunks 100000 -> script returns inconsistent results as it only points to shard1 100000 100000 100000 200000 -> A find executed against primaries which reloaded the version and results consistent again 200000 -> The count now accesses both shards (shard1 & shard2) so it returns consistent results again. 200000 200000
    • Sharding 2019-06-17, Sharding 2019-07-01

      original description

      We had a production incident last week where we had inconsistent results reading from secondaries for a collection, and a secondary wasn't returning a document due to a chunk move and the a stale chunk map. Running flushRouterConfig fixed the issue.

      We have a collection that has very little read and write activity, with the vast majority of all reads occur on secondaries. This collection was sharded on the field

      {k:1}

      with the vast majority of find queries being a findOne with

      {k: "someValue"}

      .

      What appears to be the case is that when a chunk migration happens, the mongos reloads the collection version only when an action is performed against the primary (read or write). If there is no action against the primary, the mongos will still use the old chunk map. This is problematic when adding shards and rebalancing, which is what we had done.

      On a read-only collection or on a collection that rarely receives writes, like ours it is possible the secondary reads to miss documents after a chunk migration to a shard that didn't previously exist on the chunk map.

      It appears that if the collection receives a read or write to the primary, the mongos that issued the command will update its chunk map but other mongos will not.

      We're running 3.6.8 on MMAP.

        1. SERVER41258.py
          0.4 kB
        2. test.js
          1 kB

            Assignee:
            randolph@mongodb.com Randolph Tan
            Reporter:
            jonhyman Jon Hyman
            Votes:
            0 Vote for this issue
            Watchers:
            10 Start watching this issue

              Created:
              Updated:
              Resolved: