[SERVER-16693] It is possible to read unowned data from the primary after fail-over Created: 30/Dec/14  Updated: 12/Mar/16  Resolved: 12/Mar/16

Status: Closed
Project: Core Server
Component/s: Sharding
Affects Version/s: 2.6.6, 2.8.0-rc4
Fix Version/s: None

Type: Bug Priority: Critical - P2
Reporter: David Hows Assignee: Andy Schwerin
Resolution: Incomplete Votes: 1
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Attachments: Text File mongos.log    
Issue Links:
Related
related to SERVER-16237 Don't check the shard version if the ... Closed
Operating System: ALL
Participants:

 Description   

Was able to produce this with the following environment:
2 shards with 3 members (shards named "shard01" and "shard02"
2 MongoS

I then issued the following commands on MongoS#1

sh.enableSharding("test")
sh.shardCollection("test.t2", {x:"hashed"})
//Three times of the following for some data
for(i=0;i<1000;i++){db.t2.insert({x:i})}
sh.stopBalancer()

Following this, on MongoS#2 I issued the find and change in readPref seen below in the "Actions on MongoS#2" section.
Then I issue the following on MongoS#1

sh.moveChunk("test.t2", {x:110}, "shard01")

Finally, I create a 60 second outage of the primary of "shard01" with the following command on the shell. The PID should be that of the primary of the "to" shard:

date; kill -STOP 79564; sleep 60; date; kill -CONT 79564

Once MongoS#2 detects that the primary of shard01 is down and a new primary is elected, we can issue further commands to MongoS#2 as below.

*Actions on MongoS#2"

MongoDB shell version: 2.6.5
connecting to: 127.0.0.1:27025/test
//Before-Migration
mongos> db.getMongo().setReadPref('primaryPreferred')
mongos> db.t2.find({x:110})
{ "_id" : ObjectId("5487a24f8cb198ac46491999"), "x" : 110 }
{ "_id" : ObjectId("5487a25c8cb198ac46491d81"), "x" : 110 }
{ "_id" : ObjectId("5487a25f8cb198ac46492169"), "x" : 110 }
//After Migration during outage
mongos> db.t2.find({x:110})
//One Second after the command above returned
mongos> db.t2.find({x:110})
{ "_id" : ObjectId("5487a25f8cb198ac46492169"), "x" : 110 }
{ "_id" : ObjectId("5487a24f8cb198ac46491999"), "x" : 110 }
{ "_id" : ObjectId("5487a25c8cb198ac46491d81"), "x" : 110 }
mongos>

The above is repeatable on the MongoS for only the first operation after the outage has occurred. This is because we will read from the primary member before issueing the SetShardVersion (SSV).

If you switch the outage to SIGSTOP the "from" shard you can repeat the failed find a far larger number of times.

mongos> db.t2.find({x:110})
mongos> db.t2.find({x:110})
mongos> db.t2.find({x:110})
mongos> db.t2.find({x:110})
mongos> db.t2.find({x:110})
mongos> db.t2.find({x:110})
mongos> db.t2.find({x:110})
mongos> db.t2.find({x:110})
mongos> db.t2.find({x:110})
mongos> db.t2.find({x:110})
{ "_id" : ObjectId("5487a25c8cb198ac46491d81"), "x" : 110 }
{ "_id" : ObjectId("5487a24f8cb198ac46491999"), "x" : 110 }
{ "_id" : ObjectId("5487a25f8cb198ac46492169"), "x" : 110 }



 Comments   
Comment by Ramon Fernandez Marina [ 12/Mar/16 ]

david.hows, there has been no activity on this ticket for 5 months now, so I'm going to resolve it. Feel free to reopen if your testing with a more recent release shows any issues.

Comment by Andy Schwerin [ 28/Sep/15 ]

david.hows, I believe that this is resolved by some combination of SERVER-15176 (find command) and (in config server replica set systems) SERVER-19934. Could you re-test after each of those tickets are resolved? I believe they're both targeted for 3.1.9, but SERVER-19934 may hit in 3.2.0-rc0.

Comment by David Hows [ 05/Jan/15 ]

I have successfully reproduced this under 2.8.0-RC4. That said, the failure is not as pronounced as under earlier versions and we seem to issue a new SSV quicker and better detect the change to new primary. I was only able to issue the failing query twice before shard versioning correctly directed me to the correct shard (when before it was until the new primary came back under 2.6 and earlier tested versions of 2.8).

Can you repro this without that preference, when only reading from primaries?

No, primaryPreferred is a requirement. Although primaryPreferred is sometimes a secondary read, this is a case where it is known to be a primary read. The parallel cursor created uses the isSecondaryQuery function call within DBClientReplicaSet which considers a primaryPreferred read preference to be a secondary read, thus allowing you to skip all the shard versioning checks (as per SERVER-7246).

In other words, is this just reading from the member as a secondary before the mongos instance realizes it is now "primary", to meet the readPref? If so, this sounds more like an issue of improving the discovery of the primary on the mongos instance.

No. The MongoS is aware that this query is reading from a primary according to the logs below.
I suspect that the failure is due to the changes in SERVER-7246 in combination with the isSecondaryQuery behaviour above. This means we will correctly issue a query to the new known primary, but skip the steps involved with shard versioning. Below is the first query issued after the failure case described above at log level 5. You can see that we correctly detect the new primary, but don't issue an SSV on this connection.

2015-01-05T11:12:13.121+1100 D SHARDING [conn5] Request::process begin ns: test.t2 msg id: 11 op: 2004 attempt: 0
2015-01-05T11:12:13.122+1100 D SHARDING [conn5] query: test.t2 { query: { x: 110.0 }, $readPreference: { mode: "primaryPreferred" } } ntoreturn: 0 options: 0
2015-01-05T11:12:13.122+1100 D NETWORK  [conn5] creating pcursor over QSpec { ns: "test.t2", n2skip: 0, n2return: 0, options: 0, query: { query: { x: 110.0 }, $readPreference: { mode: "primaryPreferred" } }, fields: {} } and CInfo { v_ns: "", filter: {} }
2015-01-05T11:12:13.122+1100 D QUERY    [conn5] [QLOG] Beginning planning...
=============================
Options = NO_TABLE_SCAN
Canonical query:
ns=test.t2 limit=0 skip=0
Tree: x == 110.0
Sort: {}
Proj: {}
=============================
2015-01-05T11:12:13.122+1100 D QUERY    [conn5] [QLOG] Index 0 is kp: { x: "hashed" }
2015-01-05T11:12:13.122+1100 D QUERY    [conn5] [QLOG] Predicate over field 'x'
2015-01-05T11:12:13.123+1100 D QUERY    [conn5] [QLOG] Relevant index 0 is kp: { x: "hashed" }
2015-01-05T11:12:13.123+1100 D QUERY    [conn5] Relevant index 0 is kp: { x: "hashed" }
2015-01-05T11:12:13.123+1100 D QUERY    [conn5] [QLOG] Rated tree:
x == 110.0  || First: 0 notFirst: full path: x
2015-01-05T11:12:13.123+1100 D QUERY    [conn5] [QLOG] Tagging memoID 1
2015-01-05T11:12:13.123+1100 D QUERY    [conn5] [QLOG] Enumerator: memo just before moving:
2015-01-05T11:12:13.124+1100 D QUERY    [conn5] [QLOG] About to build solntree from tagged tree:
x == 110.0  || Selected Index #0 pos 0
2015-01-05T11:12:13.124+1100 D QUERY    [conn5] [QLOG] Planner: adding solution:
FETCH
---filter:
        x == 110.0  || Selected Index #0 pos 0
---fetched = 1
---sortedByDiskLoc = 1
---getSort = [{}, ]
---Child:
------IXSCAN
---------keyPattern = { x: "hashed" }
---------direction = 1
---------bounds = field #0['x']: [-541895413742407152, -541895413742407152]
---------fetched = 0
---------sortedByDiskLoc = 1
---------getSort = [{}, ]
2015-01-05T11:12:13.124+1100 D QUERY    [conn5] [QLOG] Planner: outputted 1 indexed solutions.
2015-01-05T11:12:13.124+1100 D NETWORK  [conn5] initializing over 1 shards required by [test.t2 @ 15|1||5487a0872504a4716776a56c]
2015-01-05T11:12:13.124+1100 D NETWORK  [conn5] initializing on shard shard02:shard02/Pixl.local:27021,Pixl.local:27022,Pixl.local:27023, current connection state is { state: {}, retryNext: false, init: false, finish: false, errored: false }
2015-01-05T11:12:13.124+1100 D NETWORK  [conn5] polling for status of connection to 10.8.1.229:27021, no events
2015-01-05T11:12:13.125+1100 D NETWORK  [conn5] ReplicaSetMonitor::get shard02
2015-01-05T11:12:13.125+1100 D NETWORK  [conn5] ReplicaSetMonitor::get shard02
2015-01-05T11:12:13.125+1100 D NETWORK  [conn5] ReplicaSetMonitor::get shard02
2015-01-05T11:12:13.125+1100 D NETWORK  [conn5] dbclient_rs say using secondary or tagged node selection in shard02, read pref is { pref: "primary pref", tags: [ {} ] } (primary : Pixl.local:27021, lastTagged : [not cached])
2015-01-05T11:12:13.125+1100 D NETWORK  [conn5] ReplicaSetMonitor::get shard02
2015-01-05T11:12:13.125+1100 D NETWORK  [conn5] ReplicaSetMonitor::get shard02
2015-01-05T11:12:13.126+1100 D NETWORK  [conn5] creating new connection to:Pixl.local:27023
2015-01-05T11:12:13.126+1100 D COMMAND  [ConnectBG] BackgroundJob starting: ConnectBG
2015-01-05T11:12:13.127+1100 D NETWORK  [conn5] connected to server Pixl.local:27023 (10.8.1.229)
2015-01-05T11:12:13.127+1100 D NETWORK  [conn5] connected connection!
2015-01-05T11:12:13.127+1100 D NETWORK  [conn5] dbclient_rs selecting primary node Pixl.local:27023
2015-01-05T11:12:13.127+1100 D NETWORK  [conn5] ReplicaSetMonitor::get shard02
2015-01-05T11:12:13.127+1100 D NETWORK  [conn5] initialized query (lazily) on shard shard02:shard02/Pixl.local:27021,Pixl.local:27022,Pixl.local:27023, current connection state is { state: { conn: "shard02/Pixl.local:27021,Pixl.local:27022,Pixl.local:27023", vinfo: "test.t2 @ 15|1||5487a0872504a4716776a56c", cursor: "(empty)", count: 0, done: false }, retryNext: false, init: true, finish: false, errored: false }
2015-01-05T11:12:13.128+1100 D NETWORK  [conn5] finishing over 1 shards
2015-01-05T11:12:13.128+1100 D NETWORK  [conn5] ReplicaSetMonitor::get shard02
2015-01-05T11:12:13.128+1100 D NETWORK  [conn5] finishing on shard shard02:shard02/Pixl.local:27021,Pixl.local:27022,Pixl.local:27023, current connection state is { state: { conn: "shard02/Pixl.local:27021,Pixl.local:27022,Pixl.local:27023", vinfo: "test.t2 @ 15|1||5487a0872504a4716776a56c", cursor: "(empty)", count: 0, done: false }, retryNext: false, init: true, finish: false, errored: false }
2015-01-05T11:12:13.128+1100 D NETWORK  [conn5] finished on shard shard02:shard02/Pixl.local:27021,Pixl.local:27022,Pixl.local:27023, current connection state is { state: { conn: "(done)", vinfo: "test.t2 @ 15|1||5487a0872504a4716776a56c", cursor: "(empty)", count: 0, done: false }, retryNext: false, init: true, finish: true, errored: false }
2015-01-05T11:12:13.128+1100 D SHARDING [conn5] Request::process end ns: test.t2 msg id: 11 op: 2004 attempt: 0 6ms

Later in the logs (after issuing 2 more queries) we see that SSV being issued correctly, returning the error, updating and then pointing at the correct shard.

2015-01-05T11:12:14.094+1100 D NETWORK  [conn5] initializing over 1 shards required by [test.t2 @ 15|1||5487a0872504a4716776a56c]
2015-01-05T11:12:14.094+1100 D NETWORK  [conn5] initializing on shard shard02:shard02/Pixl.local:27021,Pixl.local:27022,Pixl.local:27023, current connection state is { state: {}, retryNext: false, init: false, finish: false, errored: false }
2015-01-05T11:12:14.094+1100 D NETWORK  [conn5] ReplicaSetMonitor::get shard02
2015-01-05T11:12:14.094+1100 D NETWORK  [conn5] ReplicaSetMonitor::get shard02
2015-01-05T11:12:14.094+1100 D SHARDING [conn5] setting shard version of 15|0||5487a0872504a4716776a56c for test.t2 on shard shard02:shard02/Pixl.local:27021,Pixl.local:27022,Pixl.local:27023
2015-01-05T11:12:14.094+1100 D SHARDING [conn5] last version sent with chunk manager iteration 0, current chunk manager iteration is 7
2015-01-05T11:12:14.095+1100 D SHARDING [conn5]     setShardVersion  shard02 Pixl.local:27023  test.t2  { setShardVersion: "test.t2", configdb: "Pixl.local:27024", shard: "shard02", shardHost: "shard02/Pixl.local:27021,Pixl.local:27022,Pixl.local:27023", version: Timestamp 15000|0, versionEpoch: ObjectId('5487a0872504a4716776a56c') } 7
2015-01-05T11:12:14.095+1100 D SHARDING [conn5]        setShardVersion failed!
{ need_authoritative: true, ok: 0.0, errmsg: "first setShardVersion" }
2015-01-05T11:12:14.095+1100 D NETWORK  [conn5] ReplicaSetMonitor::get shard02
2015-01-05T11:12:14.096+1100 D SHARDING [conn5] loading chunk manager for collection test.t2 using old chunk manager w/ version 15|1||5487a0872504a4716776a56c and 100 chunks
2015-01-05T11:12:14.096+1100 D SHARDING [conn5] major version query from 15|1||5487a0872504a4716776a56c and over 2 shards is { query: { ns: "test.t2", lastmod: { $gte: Timestamp 15000|1 } }, orderby: { lastmod: 1 } }
2015-01-05T11:12:14.096+1100 D SHARDING [conn5] found 3 new chunks for collection test.t2 (tracking 3), new version is 16|1||5487a0872504a4716776a56c
2015-01-05T11:12:14.097+1100 D SHARDING [conn5] loaded 3 chunks into new chunk manager for test.t2 with version 16|1||5487a0872504a4716776a56c
2015-01-05T11:12:14.097+1100 I SHARDING [conn5] ChunkManager: time to load chunks for test.t2: 1ms sequenceNumber: 8 version: 16|1||5487a0872504a4716776a56c based on: 15|1||5487a0872504a4716776a56c

I've attached the full 2.8.0-RC4 log

Comment by Scott Hernandez (Inactive) [ 30/Dec/14 ]

You are using the primaryPreferred read preference so it would be acceptable to read from a secondary if the primary is unavailable, from the perspective of the mongos instance issuing the query. Can you repro this without that preference, when only reading from primaries?

In other words, is this just reading from the member as a secondary before the mongos instance realizes it is now "primary", to meet the readPref? If so, this sounds more like an issue of improving the discovery of the primary on the mongos instance.

Also, is this reproducible on 2.8RC4? Please be clear on the versions being tested when updating the information.

Generated at Thu Feb 08 03:41:56 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.