Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-3322

sharded aggregations should be very tolerant of node failures

    XMLWordPrintableJSON

Details

    • Query Execution
    • Fully Compatible

    Description

      In large clusters machines will fail frequently. Aggregation operations in sharded environments should be fairly tolerant of this – this is important for jobs that require many hours to run. Two things we want to handle:

      (1) if a node in one shard which was doing work fails, we want the job to still complete.
      (2) if that node's work is completely restarted, #1 is fixed but the time for the job to complete might double. that is suboptimal and should be addressed too.

      This applies to both map/reduce and the aggregation framework. When done in one reassign ticket to the other team?

      This is lower priority than performance optimization and having good concurrency in the frameworks (I'd say do this thereafter).

      Attachments

        Activity

          People

            backlog-query-execution Backlog - Query Execution
            dwight@mongodb.com Dwight Merriman
            Votes:
            1 Vote for this issue
            Watchers:
            10 Start watching this issue

            Dates

              Created:
              Updated: