-
Type: Improvement
-
Resolution: Done
-
Priority: Major - P3
-
None
-
Affects Version/s: 1.8.4, 2.0.1
-
Component/s: MapReduce
-
Environment:replica set -- e.g., a PRIMARY, a SECONDARY, and an ARBITER -- receiving a steady trickle of read & write operations
-
Query Optimization
As a user of a MongoDB database served from an N-node replica set (running in --auth mode) that is responding to a steady trickle of read and write requests, I want to be able to complete a large map/reduce task without material disruption to the operation of the replica set.
By "large," I mean that I cannot guarantee the result set fits in a single 16MB document – i.e., my task is a poor candidate for "out:
{ inline: 1 }".
By "material disruption," I mean a variety of pathologies I'd like to avoid if possible. For example:
- failover to the SECONDARY because the PRIMARY becomes unresponsive
- (with successive failure of the mapReduce – db assertion 13312 – when it cannot commit its result collection because former PRIMARY is no longer PRIMARY)
- error RS102 ("too stale to catch up") on a secondary because of the deluge of a suddenly-appearing result set being replicated;
- significant degradation of overall performance re: the latency of servicing other requests;
- etc.
For the sake of argument, let's further assume my map/reduce task is not amenable to an incremental application of successive mapReduce calls; and that I would potentially like to preserve the map/reduce output collection as a target for future queries, though it would be acceptable to have to manually move that result data around among hosts.
I've spoken with a couple members of the 10gen engineering team, and they've been very helpful in brainstorming approaches and workarounds. But for the sake of the product, I want to make sure the use case is captured here, as I don't think any of the approaches I'm currently aware of really addresses it satisfactorily.
Working with current constraints, here's where we get ...
The "large" / no inline requirement today will force us to execute the mapReduce() call on the PRIMARY. The reason being: an output collection will have to be created, and only the PRIMARY is permitted to do these writes.
But today, for a certain class of job, executing on the PRIMARY is a non-starter. It will consume too many system resources, monopolize the one and only javascript thread, hold the write lock for extended periods of time, etc. – risking running afoul of pathology #1 above. If that failure mode is dodged, it runs the risk of pathology #2, as the large result set (created much faster than any more "natural" insertion of data) floods the SECONDARY nodes and overflows their oplogs. And of course failure mode #3 is always a concern – the PRIMARY is too critical to the overall responsiveness of any app that uses the database.
That's the problem, in a nutshell. Some potential, suggested solutions – all of which involve changes to core server :
A. If we could relax or finesse the only-PRIMARY-can-write constraint for this kind of operation, we could run the mapReduce() call on a SECONDARY – perhaps even a "hidden" node – and leave the overall cluster and any apps almost completely unaffected. Then we'd take responsibility for getting the result collection out of that node as a separate task. Currently, though, this does not seem possible – cf. issue SERVER-4264 – as we can't target even the 'local' collection there. (And even if we could target 'local', today it requires admin/separate privileges to do so.)
B. Alternately, if we could alter the mapReduce() command in such a way as to allow "out:" to target a remote database – perhaps in an entirely disjoint replica set – we could allow a secondary to run the calculation, save the result set, but still not violate the proscription against writing on a secondary.
C. Another possibility is to flag the result set for exclusion from replication – though this leaves us running on the PRIMARY, where failure modes #1 and #3 are still an issue.
My intent here, though, is not to describe or attach to any particular solution; rather, I'm just seeking to articulate the problem standing in the way of (what I consider) a rather important use case.
Thanks for reading this far!
TD