[SERVER-22739] Sharding SecondaryPreferred read commands routed to a primary do not handle StaleConfigException Created: 18/Feb/16  Updated: 11/Jul/18  Resolved: 11/Aug/17

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 3.0.9, 3.2.13, 3.4.4
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: Kaloian Manassiev Assignee: Esha Maharishi (Inactive)
Resolution: Done Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Depends
Related
related to SERVER-18671 SecondaryPreferred can end up using u... Closed
Operating System: ALL
Steps To Reproduce:

Run the following JS script:

// Tests that SecondaryPreferred queries routed to the primary of a shard will handle stale metadata
(function() {
'use strict';
 
// Start a sharding cluster with a single shard, which has one node
var st = new ShardingTest({ mongos: 2, shards: 2, other: { rs: { nodes: 1 } } });
 
// Shard a collection by the first mongos
assert.commandWorked(st.s0.adminCommand({ enableSharding: 'TestDB' }));
st.ensurePrimaryShard('TestDB', st.shard0.shardName);
assert.commandWorked(st.s0.adminCommand({ shardCollection: 'TestDB.TestColl', key: { Key: 1 } }));
 
// Insert some documents
assert.writeOK(st.s0.getDB('TestDB').TestColl.insert({ Key: 0, Value: 'Value 0' }));
assert.writeOK(st.s0.getDB('TestDB').TestColl.insert({ Key: 1, Value: 'Value 1' }));
 
// Make sure the second mongos has the most up-to-date metadata
assert.eq(2, st.s0.getDB('TestDB').TestColl.find().itcount());
assert.eq(2, st.s1.getDB('TestDB').TestColl.find().itcount());
 
// Make sure the second mongos has cached a versioned collection with the stale version
var slaveOkConnection = new Mongo(st.s1.host);
slaveOkConnection.setSlaveOk();
assert.eq(2, slaveOkConnection.getDB('TestDB').TestColl.distinct('Value').length);
 
// Split the chunk on the first mongos
assert.commandWorked(st.s0.adminCommand({ split: 'TestDB.TestColl', find: { Key: 0 } }));
assert.commandWorked(st.s0.adminCommand({ moveChunk: 'TestDB.TestColl',
                                          find: { Key: 1 },
                                          to: st.shard1.shardName }));
 
// Now do the distinct again and use the same connection as the one on which we ran distinct
// earlier, because sharded/versioned connections are cached per thread.
assert.eq(2, slaveOkConnection.getDB('TestDB').TestColl.distinct('Value').length);
 
st.stop();
 
})();

Sprint: Sharding 11 (03/11/16), Sharding 2017-08-21
Participants:
Case:
Linked BF Score: 0

 Description   

SecondaryPreferred reads are allowed to be routed to a primary host if this is deemed the most appropriate.

Legacy style queries disable the version checking and so they go over unversioned connections. However other command implementations, such as distinct or count which use ShardCollection will fail with a cryptic NodeNotFound error if they get routed to a primary with the SecondaryPreferred preference set, because they do not expect to see StaleConfigException.

See the included repro script for more information.



 Comments   
Comment by Esha Maharishi (Inactive) [ 11/Jul/18 ]

Note that the above comment was made shortly before the distinct command was made to send shardVersion under SERVER-30698.

So, the distinct command is no longer an exception; I added this comment for posterity.

Comment by Esha Maharishi (Inactive) [ 11/Aug/17 ]

This should be fixed by the Safe Secondary Reads project in 3.6 (PM-256), which sends shardVersions on read requests from mongos to shards (with some known exceptions, such as distinct and geoNear) regardless of the readPreference.

Comment by Kaloian Manassiev [ 18/Feb/16 ]

Confirmed that bug reproduces in 3.0 with similar (but different error message):

2016-02-18T18:08:15.101-0500 E QUERY    Error: distinct failed: {
        "code" : 16379,
        "ok" : 0,
        "errmsg" : "exception: Failed to call findOne, no good nodes in test-rs0, last error: can't findone replica set node kaloianmdesktop:31100:  :: caused by :: 9996 stale config on lazy receive :: caused by :: $err: \"[TestDB.TestColl] shard version not ok: version mismatch detected for TestDB.TestColl, stored major version 2 does not match received 1 ( ns : TestDB....\" ( ns : TestDB.TestColl, received : 1|0||56c64ec7c49a058786d260a8, wanted : 2|0||56c64ec7c49a058786d260a8, recv )" }
    at Error (<anonymous>)
    at DBCollection.distinct (src/mongo/shell/collection.js:1237:15)
    at (shell):1:57 at src/mongo/shell/collection.js:1237

Attaching a repro script in the repro steps section.

Comment by Spencer Brody (Inactive) [ 18/Feb/16 ]

How long has this issue existed for? Is it a regression?

Generated at Thu Feb 08 04:01:17 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.