[SERVER-36871] $sample can loop infinitely on orphaned data Created: 24/Aug/18 Updated: 29/Oct/23 Resolved: 11/Dec/18 |
|
| Status: | Closed |
| Project: | Core Server |
| Component/s: | Aggregation Framework |
| Affects Version/s: | None |
| Fix Version/s: | 4.1.7 |
| Type: | Bug | Priority: | Major - P3 |
| Reporter: | Matthew Saltz (Inactive) | Assignee: | Bernard Gorman |
| Resolution: | Fixed | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||||||
| Backport Requested: |
v4.0, v3.6
|
||||||||||||||||
| Sprint: | Query 2018-10-08, Query 2018-11-19, Query 2018-12-03, Query 2018-12-17 | ||||||||||||||||
| Participants: | |||||||||||||||||
| Case: | (copied to CRM) | ||||||||||||||||
| Description |
|
The following scenario (at least) can cause an infinite loop in $sample:
|
| Comments |
| Comment by Githook User [ 11/Dec/18 ] | |
|
Author: {'name': 'Bernard Gorman', 'email': 'bernard.gorman@gmail.com', 'username': 'gormanb'}Message: | |
| Comment by Bernard Gorman [ 16/Nov/18 ] | |
|
Hi stuart.hall@masternaut.com - thank you for providing this additional context. Below, I've outlined our understanding of this bug following some further investigation. To briefly describe the core of the problem: we have an optimization in place for the $sample aggregation stage which, under certain circumstances, allows it to replace the default $sample implementation with an alternative that uses a random-read cursor to pull back results. Specifically, the optimized stage will be used if the requested $sample size is 5% or less of the total number of documents in the collection and that total is greater than 100 documents. The problem stems from the fact that, because we cannot tell a priori how many orphans are present on the shard, this decision is based on a document count that includes all documents in the collection, whether they are orphans or not. As with any query, we counteract this by adding a ShardingFilter stage to the query plan, which will filter out any orphans that we may encounter. The problem arises in cases where the ratio of orphans to legitimate documents is very high or, in the worst case, where there are only orphans on the shard but enough of them to trigger the optimized $sample stage. Due to the fact that the randomized cursor will continue returning documents indefinitely, including duplicates, for as long as more documents are requested, we have logic in the DocumentSourceSampleFromRandomCursor class to track the ids of returned documents and to abort if we see more than 100 duplicates before completing the $sample. However, the ShardingFilter stage of the underlying query plan has no such logic; it assumes that the input stream is always finite and deduplicated, and that no matter how many orphans it has to filter, it will eventually hit EOF. In the scenario discussed here, where there are only orphans on the shard, the result is that every single one of the infinite documents returned by the random cursor is an orphan that the ShardingFilter discards, control never returns from the query system, and DSSampleFromRandomCursor never gets the opportunity to either complete or terminate the aggregation. If there are enough legitimate documents on the shard to satisfy the $sample but they are dwarfed by the number of orphans, then the $sample will eventually complete, but it may take a prohibitively long time. What made this particularly pathological in your case was the associated NO_YIELD bug described in As you can see from I hope this helps to clarify the behaviour you observed, and the actions we are taking to address it. Thank you for bringing it to our attention! Regards, | |
| Comment by Stuart Hall [ 25/Oct/18 ] | |
To add information to this, we have now experienced what we believe is this problem in our production environment when running $sample while there are orphaned documents on a particular shard. In our case, the code executed was as simple as can be:
In our case, this query partially froze the mongod process with at least one core running 100% and all write queries blocked. We had to kill -9 the offending node to restore the shard to operation. We know that this particular shard has some orphaned data as we have evacuated all data-bearing chunks from this shard, but still have a number of documents remaining in the local collection in shardKey ranges that overlapped with chunks that should have been on different shards. Because this issue resulted in a loss of service to our production cluster, we have raised a support ticket (00526155) but I thought it worth also updating directly here for visibility. | |
| Comment by Charlie Swanson [ 05/Oct/18 ] | |
|
greg.mckeon I have not started investigating this yet. I haven't had time to thus far. | |
| Comment by David Storch [ 07/Sep/18 ] | |
|
Assigning to Charlie to figure out what changed in sharding and assess the likelihood of this happening in a real world scenario. | |
| Comment by David Storch [ 28/Aug/18 ] | |
|
This work should include re-enabling the $sample tests in testshard1.js. |