[SERVER-3322] sharded aggregations should be very tolerant of node failures Created: 23/Jun/11 Updated: 06/Dec/22 |
|
| Status: | Backlog |
| Project: | Core Server |
| Component/s: | Aggregation Framework |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Improvement | Priority: | Major - P3 |
| Reporter: | Dwight Merriman | Assignee: | Backlog - Query Execution |
| Resolution: | Unresolved | Votes: | 1 |
| Labels: | aggregation, mapreduce,, usability | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Issue Links: |
|
||||||||||||
| Assigned Teams: |
Query Execution
|
||||||||||||
| Backwards Compatibility: | Fully Compatible | ||||||||||||
| Participants: | |||||||||||||
| Case: | (copied to CRM) | ||||||||||||
| Description |
|
In large clusters machines will fail frequently. Aggregation operations in sharded environments should be fairly tolerant of this – this is important for jobs that require many hours to run. Two things we want to handle: (1) if a node in one shard which was doing work fails, we want the job to still complete. This applies to both map/reduce and the aggregation framework. When done in one reassign ticket to the other team? This is lower priority than performance optimization and having good concurrency in the frameworks (I'd say do this thereafter). |
| Comments |
| Comment by David Storch [ 18/Jan/18 ] |
|
Fair enough. |
| Comment by Kaloian Manassiev [ 18/Jan/18 ] |
|
Sharding already provides framework for reacting to node failures - retryable writes and the RetryPolicy on the async results merger. Making aggregation resumable after failures requires domain knowledge for what can be resumed and making use of the tools which sharding provides. So I don't think this belongs to sharding. |
| Comment by David Storch [ 18/Jan/18 ] |
|
kaloian.manassiev, this ticket is fairly broad and non-specific, but I think it belongs on the sharding team's backlog. Do you agree? |
| Comment by Antoine Girbal [ 23/Nov/11 ] |
|
Besides fault tolerance, we should also add $partial support for MR |