[SERVER-38080] Reading using SecondaryPreffered fails without calling flushRouterConfig, when chunks are moved to different shards Created: 12/Nov/18  Updated: 27/Oct/23  Resolved: 14/Nov/18

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.6.8
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Milind Vaidya [X] Assignee: Danny Hatcher (Inactive)
Resolution: Works as Designed Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Operating System: ALL
Participants:

 Description   

We have a production sharded cluster recently upgraded from 3.4 to 3.6. It was initially with 4 shards with 1 primary and 1 secondary each with 5 mongos routers supporting it. We scaled it up to add 2 more shards. So now there are 6 shards with 1 primary and 1 secondary each. It was time to rebalance the collections leading to moving of some chunks from 4 shards originally present to these 2 newly added shards.

There were lots of cases after this rebalancing, where the data was not found when some of the mongos routers were used with secondaryPreferred option. But if you look actually into the db the data is present. So after issuing flushRouterConfig on all the mongos shell, the things were back to normal.

But this is not reliable as the balancing will take place with the data coming in and getting modified. it is critical for all mongos to be updated with the change in the data distribution.

Also we use Route53 setting of AWS where a name 'prod-mongos' points to multiple mongos. So with 'prod-mongos' pointing 2 mongos x and y following experiments were ran using php

 Step 1 :

With 4 consecutive executions result was returned in some case and not in other

connection string : $conn = new MongoDB\Client( 'mongodb://prod-mongos:27017', array( "readPreference" => "secondaryPreferred", "socketTimeoutMS" => 5000));

 

$ php balancer_bug_test.php -i id

$ php balancer_bug_test.php -i id

MongoDB\Model\BSONDocument Object

(

{{    [storage:ArrayObject:private] => Array}}

{{        (}}

{{            record details}}

{{        )}}

)

$ php balancer_bug_test.php -i id

$ php balancer_bug_test.php -i id

MongoDB\Model\BSONDocument Object

(

{{    [storage:ArrayObject:private] => Array}}

{{        (}}

{{           record details}}

{{        )}}

)

 

 Step 2:  'prod-mongos' pointing only mongos x

 

$ php balancer_bug_test.php -i id

$ php balancer_bug_test.php -i id

$ php balancer_bug_test.php -i id

$ php balancer_bug_test.php -i id

No result

 

Step 3: connected to mongos x shell and executed

db.adminCommand("flushRouterConfig")

$ php balancer_bug_test.php -i id

MongoDB\Model\BSONDocument Object

(

{{    [storage:ArrayObject:private] => Array}}

{{        (}}

{{            record details}}

{{        )}}

)

$ php balancer_bug_test.php -i id

MongoDB\Model\BSONDocument Object

(

{{    [storage:ArrayObject:private] => Array}}

{{        (}}

{{            record details}}

{{        )}}

)

$ php balancer_bug_test.php -i id

MongoDB\Model\BSONDocument Object

(

{{    [storage:ArrayObject:private] => Array}}

{{        (}}

{{            record details}}

{{        )}}

)

$ php balancer_bug_test.php -i id

MongoDB\Model\BSONDocument Object

(

{{    [storage:ArrayObject:private] => Array}}

{{        (}}

{{            record details}}

{{        )}}

)

Result was obtained each time.

 

Step 4 : To rule out possibility of driver issue, similar experiments were run using python and pymongo. The observation were exactly the same

 

Step 5: Similar commands were tried from the mongos using db.getMongo().setReadPref('secondaryPreferred').

Result was similar. Data was not fetched correctly unless 'flushRouterConfig' was used

 

Version details :

 

Mongo DB : MongoDB shell version v3.6.8

{{Mongo Config server OS : CentOS release 6.10 (Final) }}

{{Mongos OS : CentOS release 6.9 (Final) }}

Linux 2.6.32-696.13.2.el6.x86_64

{{Mongo DB Shards and Replica Set hosts OS : }}

CentOS Linux release 7.5.1804 (Core)   Linux 4.18.1-1.el7.elrepo.x86_64

Reference : https://jira.mongodb.org/browse/SERVER-5931  We thought this would be fixed in version 3.6.8



 Comments   
Comment by Danny Hatcher (Inactive) [ 14/Nov/18 ]

Hello Milind,

It is possible for "available" to return orphaned documents even after the chunk migration has finished as that read concern does not check the shard Primary nor the config servers for updated metadata.

Yes, the recommended solution is to a read concern other than "available" when performing secondary reads against a sharded cluster.

As this behavior is expected, I will close this ticket.

Thank you,

Danny

Comment by Milind Vaidya [X] [ 13/Nov/18 ]

Hi Danny,

Thanks for the response.  

I tried reproducing the issue and was able to. I also tried setting "read concern" to "local" in both php and python scripts.It did immediately return result, if it was not initially. But I will have to test little more.  The test collection that was sharded was very small, so I could drop, restore and reshard to reproduce the bug.

 

One more thing noticed was that after some time(no sure how much), even the queries started returning result even without read concern set. So is it safe to assume that "can return orphaned documents if chunk migrations are occurring" and not when the chunk migration is complete  and the data is well distributed ?

But recommended solution seems to be using "read Concern" other than "available" in case secondaryPreffered is used.

 

 

Comment by Danny Hatcher (Inactive) [ 12/Nov/18 ]

Hello Milind,

Thank you for the detailed description; it has helped us identify the situation immediately.

In MongoDB 3.6, read concern "available" is used by default for any secondary reads. In sharded clusters, "available" can return orphaned documents if chunk migrations are occurring. In order to prevent this from happening, we recommend specifying a read concern with your queries or associate them with causally consistent sessions. Either option will ensure that read concern "available" is not used and you should not experience this situation again.

If you specify a different read concern or test causally consistent sessions, do you still see the issues?

Thank you,

Danny

Generated at Thu Feb 08 04:47:53 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.