[SERVER-4364] Method for running large mapReduce operations in a working replica set Created: 23/Nov/11  Updated: 06/Dec/22  Resolved: 04/Feb/22

Status: Closed
Project: Core Server
Component/s: MapReduce
Affects Version/s: 1.8.4, 2.0.1
Fix Version/s: None

Type: Improvement Priority: Major - P3
Reporter: T. Dampier Assignee: Backlog - Query Optimization
Resolution: Done Votes: 7
Labels: mapreduce,, replication
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

replica set – e.g., a PRIMARY, a SECONDARY, and an ARBITER – receiving a steady trickle of read & write operations


Issue Links:
Related
Assigned Teams:
Query Optimization
Participants:

 Description   

As a user of a MongoDB database served from an N-node replica set (running in --auth mode) that is responding to a steady trickle of read and write requests, I want to be able to complete a large map/reduce task without material disruption to the operation of the replica set.

By "large," I mean that I cannot guarantee the result set fits in a single 16MB document – i.e., my task is a poor candidate for "out:

{ inline: 1 }

".

By "material disruption," I mean a variety of pathologies I'd like to avoid if possible. For example:

  1. failover to the SECONDARY because the PRIMARY becomes unresponsive
    • (with successive failure of the mapReduce – db assertion 13312 – when it cannot commit its result collection because former PRIMARY is no longer PRIMARY)
  2. error RS102 ("too stale to catch up") on a secondary because of the deluge of a suddenly-appearing result set being replicated;
  3. significant degradation of overall performance re: the latency of servicing other requests;
  4. etc.

For the sake of argument, let's further assume my map/reduce task is not amenable to an incremental application of successive mapReduce calls; and that I would potentially like to preserve the map/reduce output collection as a target for future queries, though it would be acceptable to have to manually move that result data around among hosts.

I've spoken with a couple members of the 10gen engineering team, and they've been very helpful in brainstorming approaches and workarounds. But for the sake of the product, I want to make sure the use case is captured here, as I don't think any of the approaches I'm currently aware of really addresses it satisfactorily.

Working with current constraints, here's where we get ...

The "large" / no inline requirement today will force us to execute the mapReduce() call on the PRIMARY. The reason being: an output collection will have to be created, and only the PRIMARY is permitted to do these writes.

But today, for a certain class of job, executing on the PRIMARY is a non-starter. It will consume too many system resources, monopolize the one and only javascript thread, hold the write lock for extended periods of time, etc. – risking running afoul of pathology #1 above. If that failure mode is dodged, it runs the risk of pathology #2, as the large result set (created much faster than any more "natural" insertion of data) floods the SECONDARY nodes and overflows their oplogs. And of course failure mode #3 is always a concern – the PRIMARY is too critical to the overall responsiveness of any app that uses the database.

That's the problem, in a nutshell. Some potential, suggested solutions – all of which involve changes to core server :

A. If we could relax or finesse the only-PRIMARY-can-write constraint for this kind of operation, we could run the mapReduce() call on a SECONDARY – perhaps even a "hidden" node – and leave the overall cluster and any apps almost completely unaffected. Then we'd take responsibility for getting the result collection out of that node as a separate task. Currently, though, this does not seem possible – cf. issue SERVER-4264 – as we can't target even the 'local' collection there. (And even if we could target 'local', today it requires admin/separate privileges to do so.)

B. Alternately, if we could alter the mapReduce() command in such a way as to allow "out:" to target a remote database – perhaps in an entirely disjoint replica set – we could allow a secondary to run the calculation, save the result set, but still not violate the proscription against writing on a secondary.

C. Another possibility is to flag the result set for exclusion from replication – though this leaves us running on the PRIMARY, where failure modes #1 and #3 are still an issue.

My intent here, though, is not to describe or attach to any particular solution; rather, I'm just seeking to articulate the problem standing in the way of (what I consider) a rather important use case.

Thanks for reading this far!

TD



 Comments   
Comment by Esha Bhargava [ 04/Feb/22 ]

Closing these tickets as part of the deprecation of mapReduce.

Comment by Eliot Horowitz (Inactive) [ 28/Nov/11 ]

Not sure we'll have a great solution in 2.2 for this - but we know its an issue and are working on some ideas.

Generated at Thu Feb 08 03:05:46 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.