[SERVER-2099] MapReduce does not allow limit as one attribute on a sharded setup Created: 16/Nov/10  Updated: 06/Dec/22  Resolved: 22/May/17

Status: Closed
Project: Core Server
Component/s: MapReduce
Affects Version/s: 1.6.3
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: hc Assignee: Backlog - Query Team (Inactive)
Resolution: Won't Fix Votes: 2
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

OS - Linux hqd-soak-03 2.6.18-164.11.1.el5 #1 SMP Wed Jan 20 07:32:21 EST 2010 x86_64 x86_64 x86_64 GNU/Linux


Issue Links:
Depends
is depended on by CSHARP-2075 C# - MapReduce (MongoDB.Driver) does ... Closed
Assigned Teams:
Query
Backwards Compatibility: Fully Compatible
Operating System: Linux
Participants:

 Description   

We have sharded setup on 2 machines and have a 120GB collection on these machines.

Ran a mapreduce job through db.runCommand with a limit attribute, to test the map-reduce functions and it did threw an error message that limit attribute is not allowed.

Specifics

db.runCommand({mapreduce: "bigdm",
map:m,
reduce: r,
limit:100,
query: { "dynamicRL" : { "$exists" : true}},
out: "dt",
verbose: true});

It gives me following error message

{
"assertion" : "don't know mr field: limit",
"assertionCode" : 10177,
"errmsg" : "db assertion failure",
"ok" : 0
}



 Comments   
Comment by Asya Kamsky [ 22/May/17 ]

We're closing this ticket as "Won't Fix" as the concept of "limit" doesn't make sense to apply to MapReduce as discussed here (especially since it wouldn't work well in a sharded environment as pointed out in the comments).

If the goal is to limit processing to a random subset of documents, aggregation provides a $sample stage to only pass through a random subset of documents.

Comment by Andrew Armstrong [ 21/Apr/11 ]

I don't use MR yet, but perhaps limit would better be named sample size?

Since MR is not ordered, perhaps a sample size argument would mean at most X documents would be processed per shard?

I assume its useful in MR to have a limit so you can get an approximate answer to see if a query looks right, then run the full query when you're happy?

Comment by Eliot Horowitz (Inactive) [ 03/Jan/11 ]

Still not sure what the right thing to do is.

Comment by Eliot Horowitz (Inactive) [ 21/Nov/10 ]

Its unclear what the semantics are of limit in this case.
We don't want to apply it serially, as on a large limit that could be slow.
But also don't want to apply separately as then its not accurate.

Generated at Thu Feb 08 02:58:58 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.