[SERVER-5931] Secondary reads in sharded clusters need stronger consistency Created: 25/May/12 Updated: 06/Apr/23 Resolved: 31/Jul/17 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Querying, Replication, Sharding |
| Affects Version/s: | None |
| Fix Version/s: | 3.5.11 |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Kay Agahd | Assignee: | Dianna Hohensee (Inactive) |
| Resolution: | Done | Votes: | 41 |
| Labels: | setShardVersion | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Environment: |
ubuntu lucid 64 bit |
||
| Issue Links: |
|
||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Sprint: | Sharding 2017-08-21 | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Participants: | |||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||||||||||||||||||||||||||||||||||||||||||||||
| Description |
|
Secondary reads in MongoDB are only eventually consistent - the state of the system will not reflect the latest changes. When balancing, the state of the cluster is changing implicitly, and so secondary reads are inconsistent. This means that duplicate, stale, or missing data can be observed when balancing operations are active, along with orphaned data from aborted balancer operations. Issues with orphaned data affecting results from primary reads are different problems - see Original description: Mongo may return too many documents in a sharded system. This may occur when a document is located on more than one shard. We don't know yet why some documents are located on more than one shard because we never access shards directly. We always access mongoDB through mongos (router). Perhaps these documents result from a failed chunk migration? In any case, even if these documents exist on more than one shard, mongo should be clever enough to return only those, which are tracked by the config servers. Let me show you a test case (documents are sharded by _id):
|
| Comments |
| Comment by Kaloian Manassiev [ 26/Feb/19 ] | |||||||||||||||||||||||
|
Hello lucasoares, This feature is already available in the 3.6.0 and later releases. Please take a look at the documentation for more information. Best regards, | |||||||||||||||||||||||
| Comment by Lucas [ 26/Feb/19 ] | |||||||||||||||||||||||
|
Hello! This will enter on any near 3.6 release? Thank you. | |||||||||||||||||||||||
| Comment by Spencer Brody (Inactive) [ 17/Sep/15 ] | |||||||||||||||||||||||
|
jblackburn, you may also be interested in watching | |||||||||||||||||||||||
| Comment by Andy Schwerin [ 17/Sep/15 ] | |||||||||||||||||||||||
|
jblackburn, this ticket describes a different feature request from the one in your comment. Your request is more akin to | |||||||||||||||||||||||
| Comment by James Blackburn [ 15/Sep/15 ] | |||||||||||||||||||||||
|
It's currently possible for a SECONDARY to be infinitely Ideally the SECONDARY should move to RECOVERING if it becomes too stale, or be configured to fail altogether. We have a reasonable tolerance for stale secondaries, but not for hours or days... | |||||||||||||||||||||||
| Comment by Adam Flynn [ 06/Nov/14 ] | |||||||||||||||||||||||
|
Wanted to add a +1 to this ticket and describe how it's manifested itself in our deployment (see CS-11107 for more details). We have a workaround now, but I want to make the engineering team aware of a use case where this can be really bad. First, I understand the complexity of the issues involved in this one and that it can't be fixed haphazardly. It's probably too late for 2.8, but I'd love to see this be a priority in 3.0. I also think it's important to advertise the symptoms of this limitation more widely (internally & externally) so people can avoid painting themselves into corners until this is fixed. We use tagged secondary reads pretty aggressively in our app. With a high read/write ratio, tolerance for eventual consistency (in the form of replication lag, anyway), and a high redundancy requirement that makes us carry a lot of secondaries anyway, a secondary read preference makes a lot of sense. We also split between analytics queries and real-time queries. Having analytics & real-time loads on the same node causes problems, so we use tags to route to different secondaries. Great feature for that! But - the high read-write ratio means a long time can pass between a moveChunk finishing and the primary being hit (especially in analytics tools which often do no writes). The behaviour here is that documents seem to disappear and all kinds of queries fail. Until Andrew helped us understand the details (hitting a primary of a shard that knows about the migration refreshes metadata), our only operational fix was flushRouterConfig everywhere. So, when we added new shards and had aggressive balancing, every night or two our error rate would spike way up from "missing" documents. Someone would wake up, flush or restart all mongos, error rate goes down, back to sleep. Random data disappearing until you restart mongos is pretty scary. We have 17 shards (68 mongod) now and add shards monthly, so we're constantly doing a lot of balancing, even with well-distributed writes. Adding shards implied accepting a bunch of things would be transiently broken for a week or two. That obviously isn't scalable. We first opened the ticket about this back in March and the support engineer wasn't able to nail down what was happening (and we couldn't reproduce reliably to get logs). After we figured out how to reproduce this and got a high logLevel capture, Andrew was able to nail down that it was a case of this ticket. To be clear: I think MongoDB support is excellent and we've got a reasonable (if messy) workaround from CS-11107. But, my concern here is that such a big caveat in MongoDB's consistency semantics isn't mentioned in the docs and 2 support engineers weren't initially aware of its symptoms. If it was clear in docs or support discussions that this particular behavior existed, I probably would have made different architectural decisions over the last couple years... but having it come up as a "gotcha" during fast growth is a big concern. | |||||||||||||||||||||||
| Comment by Greg Studer [ 27/Aug/14 ] | |||||||||||||||||||||||
|
This issue has had a lot of discussion - summarizing here: Secondary reads in MongoDB are only eventually consistent - the state of the system will not reflect the latest changes. When balancing, the state of the cluster is changing implicitly, and so secondary reads are inconsistent. This means that duplicate, stale, or missing data can be observed when balancing operations are active, along with orphaned data from aborted balancer operations. Issues with orphaned data affecting results from primary reads are different problems - see Filtering results to remove duplicate data is only half of the problem - data that has not yet been replicated to certain secondaries on a TO-shard but has been removed on a FROM-shard may be invisible temporarily. A full fix is nontrivial and requires tracking differing sets of chunk metadata per-node, integrated with replication, and new targeting logic. A partial fix may be possible by forcing migrations to become fully replicated using secondary throttling - this would allow filtering to work, if a primary was online, at the cost of slow migrations. We're still considering the options here. | |||||||||||||||||||||||
| Comment by Thomas Rueckstiess [ 29/Jul/14 ] | |||||||||||||||||||||||
|
Hi Nic, to clarify your last question: compact rewrites (and thus compacts) the extents of a collection but does not free up disk space. This is because extents from different collections are kept within the same db file. The space is added to the free list and available to newly inserted data, but you won't see more available space on the OS level after a compact. Regards, | |||||||||||||||||||||||
| Comment by Vincent [ 29/Jul/14 ] | |||||||||||||||||||||||
|
@Nic > Probably but I'm not sure it frees up disk space (the docs only states it's similar to repairDatabase but for a collection). I had not enough disk space to repair my big database too, so I "repaired" small databases first to make enough room to repair the big one, but I don't know if you can do this too. Let me know | |||||||||||||||||||||||
| Comment by Nic Cottrell (Personal) [ 29/Jul/14 ] | |||||||||||||||||||||||
|
I don't seem to have the required available disk space for an entire copy of the whole db, so will running compact on individual collections give me the same index space savings? Maybe I should just rebuild the entire secondary from the primary data... | |||||||||||||||||||||||
| Comment by Nic Cottrell (Personal) [ 29/Jul/14 ] | |||||||||||||||||||||||
|
@Vicent - I did run on the two mongod primaries directly, but didn't run the repairs yet. Will give it a try. Many thanks! | |||||||||||||||||||||||
| Comment by Vincent [ 29/Jul/14 ] | |||||||||||||||||||||||
|
@Nic Cottrell > Did you make sure you ran the script on mongod and not mongos? | |||||||||||||||||||||||
| Comment by Nic Cottrell (Personal) [ 29/Jul/14 ] | |||||||||||||||||||||||
|
@agahd When running the script from the mongodb.org docs page it sat there running for hours. I didn't see any printouts that chunks were cleaned up, but some of our errors disappeared. It may be however a change in read preferences in some subroutine. We have the same problem where we are on the edge of fitting in to RAM. I'm a bit wary of running the chunk-checker.js since it's from 2010 and many versions of Mongo ago. I don't want to further mess up sharding | |||||||||||||||||||||||
| Comment by Kay Agahd [ 27/Jul/14 ] | |||||||||||||||||||||||
|
niccottrell, I've tried the script snippet at http://docs.mongodb.org/manual/reference/command/cleanupOrphaned/ without luck. The script took a long time but the unreferenced chunks remained on the shard - at least, no space was freed up. It seems to me that balancing and cleaning-up moved chunks is even worse in v2.6 than ever before because it may get stuck and even may crash mongod. Maybe these threads are related: | |||||||||||||||||||||||
| Comment by sam flint [ 22/Jul/14 ] | |||||||||||||||||||||||
|
You can also use explain() to capture the correct count and this is much faster than itcount(). We put this in our client side application to call explain().n As you can see it is accurate and it is faster than itcount(). "cursor" : "BtreeCursor client_id_1_lists_1_order_1", , {client_id:1}).count() , {client_id:1}).itcount() , {client_id:1}).explain().n | |||||||||||||||||||||||
| Comment by Asya Kamsky [ 21/Jul/14 ] | |||||||||||||||||||||||
|
niccottrell I just realized that your test does not look at documents, but rather it looks at count() which will be wrong whenever there is a migration in progress (or if there are orphans) due to | |||||||||||||||||||||||
| Comment by Asya Kamsky [ 21/Jul/14 ] | |||||||||||||||||||||||
|
niccottrell if your queries are going to the primaries, then it can't be because of this bug as primaries filter out documents that don't belong to their chunk ranges. It's possible that you are seeing a different bug. If you can confirm the queries are being routed to *primaries* and you're seeing duplicates, could you open a new SERVER bug? If the queries are being routed to *secondaries* when you are requesting primary then it could be either Morphia or Java driver bug or mongos bug which we would also like to track, triage and fix. | |||||||||||||||||||||||
| Comment by Nic Cottrell (Personal) [ 21/Jul/14 ] | |||||||||||||||||||||||
|
Thanks @agahd! I guess with M2.6+ you could borrow this snippet from http://docs.mongodb.org/manual/reference/command/cleanupOrphaned/ :
to do the actual cleanup steps. | |||||||||||||||||||||||
| Comment by Nic Cottrell (Personal) [ 21/Jul/14 ] | |||||||||||||||||||||||
|
Definitely querying against primary (at least that's what I'm instructing Morphia/Java driver) and definitely got duplicate objects with the same ObjectId _id field. Ran the cleanupOrphaned loop example from the MongoDoc and now no longer getting these duplicates - no changes to our code, so really looks like a mongos bug remains.. | |||||||||||||||||||||||
| Comment by Asya Kamsky [ 21/Jul/14 ] | |||||||||||||||||||||||
|
2.6 supports a command cleanupOrphaned so you should be using it, rather than any scripts. However, if you are querying against primaries only, this ticket does not apply and if you're seeing incorrect results, it's not because of the issue this ticket is tracking. You might want to post your case on mongodb-user google group. | |||||||||||||||||||||||
| Comment by Kay Agahd [ 20/Jul/14 ] | |||||||||||||||||||||||
|
Nic, I've used this one, slightly modified: | |||||||||||||||||||||||
| Comment by Nic Cottrell (Personal) [ 20/Jul/14 ] | |||||||||||||||||||||||
|
Unfortunately this seems to be a problem. We have a sharded setup with both nodes and mongos running 2.6.3. The collection has shard key which doesn't including the _id field at all, and we do a query with readPref=primary. We're using the Morphia interface, i.e.
And this assert regularly fails in our unit tests. Is there a nice script in 2.6 to clean away these orphan documents automatically? | |||||||||||||||||||||||
| Comment by Kay Agahd [ 23/Oct/13 ] | |||||||||||||||||||||||
|
Thank you Scott for clarifying. Right now, we have removed all orphans from our db so the export works as expected. As soon as we have new orphans and detect duplicates in our exported dataset, we will let you know. Thanks for your patience. | |||||||||||||||||||||||
| Comment by Scott Hernandez (Inactive) [ 23/Oct/13 ] | |||||||||||||||||||||||
|
If you do any query (find not count, see below or linked issues) on the mongos, with or without the shard key, which goes only to the primaries then all non-owned/orphan docs will be removed from the results. MongoDB has a number of tests which verify this. Here is part of the explain which show the "nChunkSkips" which are skipped documents not returned since they are not owned by that shard – results in no orphans docs being returned:
In your example above you are using count() which has similar issues like a secondaries wrt to queries (via find), but on the primary as well. That issue is here: In your example above if you use itcount() it will run the query and count the returned documents which will filter out orphans. I have also run mongoexport/dump to verify that orphans are not returned when the primary is used. If you see otherwise we need to create a new issue and follow up there. You may also be interested in the new cleanupOrphans command in the next version (2.6) which be used to remove orphans in cases where a failure leaves them around: | |||||||||||||||||||||||
| Comment by Kay Agahd [ 23/Oct/13 ] | |||||||||||||||||||||||
|
Scott, we are aware that running with slaveOk=true might return inconsistent results. However, what we have experienced and what I've shown above demonstrates that one may receive inconsistent results such as duplicates even when using slaveOk=false while going through mongos if the queried field is NOT the shardkey. Please review my above steps to reproduce the problem. The reason I was connecting there to a mongod instead to a mongos was to be able to create some orphan documents in order to demonstrate how mongodb behaves when you run a query against a mongodb system having orphan documents. If mongodb would NOT create orphan documents, probably resulted from a broken chunk move, OR if mongodb was able to automatically remove orphan documents asap OR if mongodb would read the sharding status upon a query request to know whether the found documents belong really to the shard which returned them, then we would NOT have this problem at all. Just connecting through mongos with slaveOk=false option does NOT solve the problem. | |||||||||||||||||||||||
| Comment by Scott Hernandez (Inactive) [ 23/Oct/13 ] | |||||||||||||||||||||||
|
agahd, you must use the primary (the default behavior for all langauges/drivers but not all tools) connecting through mongos, not directly to the shards or with slaveOk enabled. All direct connections are unaware of sharding state (even on the primary) and considered administrative/maintenance connections which allow full access to data independent of sharding state (and therefore return "duplicates"). This is why it is always required to change your application/tools to only connect to the mongos servers and never directly to the shard for user data operations. Did you run mongoexport against mongos with the "--slaveOk false" option? By default it will try to read from the non-primary, if not the option is explicit.
| |||||||||||||||||||||||
| Comment by Kay Agahd [ 23/Oct/13 ] | |||||||||||||||||||||||
|
Scott, your sugested workaround "that you must use the primary for accurate results" is wrong. Even using the primary returns orphaned documents if the query does not contain the shardkey (see my steps to reproduce above please). Btw. this happens also when using mongoexport which is a big pain for us because we can't always use the shardkey to export some datasets, thus we often have to fight against duplicates in the exported data. It would be nice if this bug could be fixed. Thanks! | |||||||||||||||||||||||
| Comment by Kay Agahd [ 13/Feb/13 ] | |||||||||||||||||||||||
|
@Holger, during migrations, documents must be on both source and destination server. They will/should be deleted only after migration from the source server. If your query hits the servers during migration, it's normal that it will find both (with slaveOk on). A job which removes orphan documents wouldn't help in this case. | |||||||||||||||||||||||
| Comment by Holger Morch [ 13/Feb/13 ] | |||||||||||||||||||||||
|
We are facing the same issue. Since we are a read heavy application we are normally reading with secondary preferred and so it happens that users see orphan objects in the response. | |||||||||||||||||||||||
| Comment by Christian Tonhäuser [ 13/Feb/13 ] | |||||||||||||||||||||||
|
Do you think this might make it into 2.3.X? | |||||||||||||||||||||||
| Comment by Scott Hernandez (Inactive) [ 31/May/12 ] | |||||||||||||||||||||||
|
There is no workaround for this until the underlying bug/system is fixed. The workaround is that you must use the primary for accurate results. | |||||||||||||||||||||||
| Comment by Kay Agahd [ 31/May/12 ] | |||||||||||||||||||||||
|
I understand that orphaned documents may exist but they shouldn't have any impact on query results, even when queried non-primaries (slaveOk). Can you suggest a better workaround than querying primaries (since this would drop mongoDB performance) or will this issue be fixed soon? Thanks! | |||||||||||||||||||||||
| Comment by Scott Hernandez (Inactive) [ 29/May/12 ] | |||||||||||||||||||||||
|
Yes, this is a known problem and I've linked the count related issue which is not related to non-primary queries. In addition as you have noted doing non-primary queries can return documents which are not owned by that shard – orphaned one. These documents can get there from failed migrations, and of course during migrations, for example, so it is expected that it will happen at times. | |||||||||||||||||||||||
| Comment by Kay Agahd [ 25/May/12 ] | |||||||||||||||||||||||
|
sorry, the formatting is a bit weird |