[SERVER-699] Support other scripting languages (eg perl) for map/reduce Created: 04/Mar/10 Updated: 06/Dec/22 |
|
| Status: | Open |
| Project: | Core Server |
| Component/s: | Usability |
| Affects Version/s: | None |
| Fix Version/s: | features we're not sure of |
| Type: | New Feature | Priority: | Major - P3 |
| Reporter: | josh rabinowitz | Assignee: | Backlog - Query Optimization |
| Resolution: | Unresolved | Votes: | 21 |
| Labels: | map-reduce | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Assigned Teams: |
Query Optimization
|
| Participants: |
| Description |
|
It would be advantageous to be able to use other scripting languages in map/reduce tasks (for me, perl, though I could see python being a good fit too). This would allow developers to write map/reduce tasks more easily, and to allow them to access code and libraries in that language which might be advantageous during the map/reduce tasks. |
| Comments |
| Comment by josh rabinowitz [ 02/Dec/10 ] |
|
I'm the original poster of this JIRA (not that I was the first to want support for other languages in m/r). It's been interesting to see how the conversation here has evolved. To add my $0.01: +1 to streaming solution. And BSON in/out sounds just fine. |
| Comment by Bobby J [ 01/Dec/10 ] |
|
Big priority for us. We chose to use mongodb partly because pymongo integrated so nicely into our python codebase. Now we find ourselves using hadoop for mapreduce jobs just so we can keep our mapper/reducer functionality in python. Thanks for looking into this! |
| Comment by Paul Harvey [ 16/Oct/10 ] |
|
I can appreciate that this task may be a little open-ended, there are some interesting design decisions to make. Turning mongo into a full-blown distributed HPC platform might be asking too much. But we would really appreciate a streaming solution also - no matter how primitive. Although we will be storing raw data in mongodb, the system we are building is only able to exploit mongo for metadata (management of the raw data). As things currently stand, we either have to fund someone to re-work a precious few algorithms into mongo+m/r javascript (costly, unsustainable), relying on sharding to have any hope of reasonable CPU utilisation or alternatively we build an in-house API to bridge the raw data from mongob to an entirely separate distributed HPC framework. We work in bioinformatics - many problems fit embarrassingly well into map/reduce, but we rely heavily on libraries to the bulk of the work (python, perl, ruby - probably in that order - though people use things like R on their workstations) |
| Comment by Valery Khamenya [ 22/Sep/10 ] |
|
+1 to Mathieu Poumeyrol |
| Comment by Eliot Horowitz (Inactive) [ 22/Jun/10 ] |
|
@mathieu @cyril we agree. we haven't gotten to it yet - but its definitely one of the things we want to support. |
| Comment by Cyril Mougel [ 22/Jun/10 ] |
|
I totaly agree with Mathieu. The streaming solution is really good solution to use what we want to do for map/reduce. In certain way, it can help us to made a multi-threading map/reduce function because it's our program to be multi-thread, not MongoDB. |
| Comment by Mathieu Poumeyrol [ 22/Jun/10 ] |
|
I had a conversation at MongoFR with Matthias, and I think it would be a good place to followup. I think we need to have something similar to hadoop streaming. The principle is simple, each mapper starts an external process with a command specified by the user, push each document on the process STDIN, get each emitted value on its STDOUT. And same for reducer. That would give support in m/r for any language that can read and write json and/or bson, instead of having to pick one or a few languages that will let most users frustrated and require more and more heavy code maintenance and complex dependencies The nice point with this approach is also it is very easy to simulate map reduce using unix pipes in a development environment. Matthias expressed concern about that feature allowing arbitrary code execution on the server, but that is a risk that can be mitigated : we may want to limit it to some directory where the admin put the scripts, or even a more defined list, or have a map/reduce worker running as nobody... but that would seriously make the installation more difficult. As far as I'm concerned, I'd prefer the server to let me do whatever I want, with the user mongo is actually running. My data, my responsibility. For your information, hadoop also manages code transport from wherever the job is launched to the various nodes. The use case I'm investigating is log analysis, I would love to get all my logs into mongo to support real time collection, long term storage, massive analysis and pinpoint debugging. But in massive analysis, streaming is an absolute must. |
| Comment by Evan Wies [ 15/Apr/10 ] |
|
Some notes on Lua. Lua is really fast and designed to be embedded. mstearn said that preserving ordering of keys is important. Lua doesn't do that. There was a patch in January 2010 that does that (http://lua-users.org/lists/lua-l/2010-01/msg00199.html). Since it is a patch rather than a library, it would only be feasible for the server. Clients can't be expected to use patched VM. Lua has an awesome JIT (http://www.luajit.org). It would make a lot of sense for map/reduce. You'd need to port the ordered patch to it. You'd also want to wait for the Foreign Function Interface (FFI) or raw struct access to be added. Notes on syntax: Lua table is a little cleaner syntax than pure JSON. Lua: { name = "mongo", type = "db" }Since the native (albeit configurable) data type in Lua is a double, there are issues storing 64-bit integers. There are various solutions for this which are web-searchable. |
| Comment by Ben Poweski [ 05/Apr/10 ] |
|
For us perl with all of its complexity, and well Perlisms, would a less desirable language. I think Lua would be a natural fit. |
| Comment by Eliot Horowitz (Inactive) [ 04/Mar/10 ] |
|
agree it would be nice - but non-trivial |