Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 2.4.6
Component/s: Sharding
Labels:
- PM-1017
Environment:
Ubuntu servers

Assigned Teams:

Sharding
Operating System:
ALL
Steps To Reproduce:

Hide

Unsure. We're spending a little time trying to make a reproduction script, but essentially something along the lines of:

1. Have multiple mongos running (we have 12 or so) and several populated shards (we have 8), with some collections in a database sharded and some collections unsharded.

2. Start draining the shard that is the current primary for the database.

3. While it is draining, run movePrimary to another shard.

4. Query each mongos separately, looking for inconsistent results.

Show
Unsure. We're spending a little time trying to make a reproduction script, but essentially something along the lines of: 1. Have multiple mongos running (we have 12 or so) and several populated shards (we have 8), with some collections in a database sharded and some collections unsharded. 2. Start draining the shard that is the current primary for the database. 3. While it is draining, run movePrimary to another shard. 4. Query each mongos separately, looking for inconsistent results.
Sprint:
Sharding 2018-11-19
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

We had very confusing behavior where a collection was reporting one set of documents some of the time, and other results at other times. We deduced (and verified) that this was because non-sharded collections (in a sharded environment) were being accessed on two different shards. What appears to have happened is that at least one of our mongos did not get the movePrimary message, meaning it still believed (and interacted with) data on the old shard.

Now, in our situation, we admissibly committed a faux pas: we ran movePrimary while a shard was draining. I realize that the web docs explicitly state not to do this, but it happened.

It seems that movePrimary either:

A) Doesn't move collections atomically
B) Isn't very forceful about having mongos update their routing tables
C) Doesn't play nicely at all with the balancer when a shard is draining

Or something else I suppose. Either way, it seems reasonable that movePrimary will raise an error (rather than creating inconsistencies) if it truly needs to be ran only when all chunks are moved off of a shard. If it does not actually need that, then there clearly is a bug somewhere that is leading to very confusing and inconsistent errors.

depends on

SERVER-939 Ability to distribute collections in a single db

Closed

related to

SERVER-8059 After movePrimary, db.getCollectionNames() excludes previously existing one-chunk collections

Closed

Assignee:: [DO NOT USE] Backlog - Sharding Team
Reporter:: Walt Woods
Participants:: [DO NOT USE] Backlog - Sharding Team, David Storch, Esha Maharishi, Greg Studer, Ori Avtalion, Walt Woods
Votes:: 1 Vote for this issue
Watchers:: 11 Start watching this issue

Created:: Oct 31 2013 07:31:39 PM UTC
Updated:: Dec 06 2022 05:15:16 AM UTC
Resolved:: Nov 12 2019 06:50:00 PM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates