-
Type: Improvement
-
Resolution: Unresolved
-
Priority: Major - P3
-
None
-
Affects Version/s: None
-
Component/s: Aggregation Framework
-
Query Execution
-
Fully Compatible
-
(copied to CRM)
In large clusters machines will fail frequently. Aggregation operations in sharded environments should be fairly tolerant of this – this is important for jobs that require many hours to run. Two things we want to handle:
(1) if a node in one shard which was doing work fails, we want the job to still complete.
(2) if that node's work is completely restarted, #1 is fixed but the time for the job to complete might double. that is suboptimal and should be addressed too.
This applies to both map/reduce and the aggregation framework. When done in one reassign ticket to the other team?
This is lower priority than performance optimization and having good concurrency in the frameworks (I'd say do this thereafter).
- is related to
-
SERVER-31782 allow aggregation to take an 'allowPartialResults' option
- Backlog
-
SERVER-17696 Terminate sharded queries immediately after a failure
- Closed