Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Duplicate
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 2.6.5, 2.7.8
Component/s: MapReduce, Sharding
Labels:
Environment:
Ubuntu 14.10
MongoDB packages from 10gen
PyMongo 2.7.1

Assigned Teams:

Sharding
Operating System:
Linux
Steps To Reproduce:

Hide

1. Instantiate a new, clean MongoDB cluster, featuring a single shard server, config server and mongos.
2. Create a new database, dropping it first if it exists already.
3. Create an input collection and an output collection. Both collections are sharded. The output collection has a hashed index on the _id field.
4. Run a simple map reduce job that gets its input from the input collection and outputs into the output collection.
5. All documents produced by the reducer in stage one of the map reduce process gets lost in the post processing stage. Output collection is empty.
6. Repeat steps 2,3 and 4 using an output collection having a different name. The map reduce process succeeds this time.
7. Repeat steps 2,3 and 4 using an output collection having the same name as was used in the first map reduce job. It will fail again.

(Python implementation of this test case is attached)

Show
1. Instantiate a new, clean MongoDB cluster, featuring a single shard server, config server and mongos. 2. Create a new database, dropping it first if it exists already. 3. Create an input collection and an output collection. Both collections are sharded. The output collection has a hashed index on the _id field. 4. Run a simple map reduce job that gets its input from the input collection and outputs into the output collection. 5. All documents produced by the reducer in stage one of the map reduce process gets lost in the post processing stage. Output collection is empty. 6. Repeat steps 2,3 and 4 using an output collection having a different name. The map reduce process succeeds this time. 7. Repeat steps 2,3 and 4 using an output collection having the same name as was used in the first map reduce job. It will fail again. (Python implementation of this test case is attached)
CAR Domain/s:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Name(s):
None
Goal Link:
None

When outputting from a map reduce job into a sharded output collection which features a hashed index on the _id field, no output is produced. The _id field is also the sharding key, so this issue

Extensive testing shows that this happens only for the first map reduce that is ever run on a MongoDB cluster. It fails to produce output and in the process, the name of the output collection appears to become 'cursed' somehow: Any subsequent map-reduce job runs fail if that same output collection name is used.

Even if the collection is re-created or the entire database is dropped and re-created, or if a different database is used. The name of the output collection can never be used again. Only when outputting into a collection with a different name, the exact same map reduce job processing the exact same data will succeed.

The problem emerges on sharded clusters only, and only when the output collection uses a hashed index.

It is possible to work around this problem by running a dummy map reduce job on newly setup MongoDB clusters, using an output collection that will never be used in regular operations.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

log.txt
40 kB
Dec 19 2014 01:47:05 PM UTC
log-2.8rc3.txt
40 kB
Dec 22 2014 09:28:33 AM UTC
testcase.py
3 kB
Dec 22 2014 09:28:33 AM UTC
testcase.py
3 kB
Dec 19 2014 01:47:05 PM UTC

is duplicated by

SERVER-14324 MapReduce does not respect existing shard key on output:sharded

Closed

related to

SERVER-43467 Complete TODO listed in SERVER-16605

Closed

Assignee:: [DO NOT USE] Backlog - Sharding Team
Reporter:: D.H.J. Takken
Participants:: [DO NOT USE] Backlog - Sharding Team, Asya Kamsky, D.H.J. Takken, Moshe Kaplan [X], Ramon Fernandez
Votes:: 1 Vote for this issue
Watchers:: 8 Start watching this issue

Created:: Dec 19 2014 01:47:05 PM UTC
Updated:: Dec 06 2022 04:57:54 AM UTC
Resolved:: Jun 13 2019 04:00:40 PM UTC

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates