[SERVER-3322] sharded aggregations should be very tolerant of node failures Created: 23/Jun/11  Updated: 06/Dec/22

Status: Backlog
Project: Core Server
Component/s: Aggregation Framework
Affects Version/s: None
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: Dwight Merriman Assignee: Backlog - Query Execution
Resolution: Unresolved Votes: 1
Labels: aggregation, mapreduce,, usability
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Issue Links:
Related
is related to SERVER-31782 allow aggregation to take an 'allowPa... Backlog
is related to SERVER-17696 Terminate sharded queries immediately... Closed
Assigned Teams:
Query Execution
Backwards Compatibility: Fully Compatible
Participants:
Case:

 Description   

In large clusters machines will fail frequently. Aggregation operations in sharded environments should be fairly tolerant of this – this is important for jobs that require many hours to run. Two things we want to handle:

(1) if a node in one shard which was doing work fails, we want the job to still complete.
(2) if that node's work is completely restarted, #1 is fixed but the time for the job to complete might double. that is suboptimal and should be addressed too.

This applies to both map/reduce and the aggregation framework. When done in one reassign ticket to the other team?

This is lower priority than performance optimization and having good concurrency in the frameworks (I'd say do this thereafter).



 Comments   
Comment by David Storch [ 18/Jan/18 ]

Fair enough.

Comment by Kaloian Manassiev [ 18/Jan/18 ]

Sharding already provides framework for reacting to node failures - retryable writes and the RetryPolicy on the async results merger. Making aggregation resumable after failures requires domain knowledge for what can be resumed and making use of the tools which sharding provides. So I don't think this belongs to sharding.

Comment by David Storch [ 18/Jan/18 ]

kaloian.manassiev, this ticket is fairly broad and non-specific, but I think it belongs on the sharding team's backlog. Do you agree?

Comment by Antoine Girbal [ 23/Nov/11 ]

Besides fault tolerance, we should also add $partial support for MR

Generated at Thu Feb 08 03:02:45 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.