[SERVER-16605] Mapreduce into sharded collection with hashed index fails Created: 19/Dec/14  Updated: 06/Dec/22  Resolved: 13/Jun/19

Status: Closed
Project: Core Server
Component/s: MapReduce, Sharding
Affects Version/s: 2.6.5, 2.7.8
Fix Version/s: None

Type: Bug Priority: Major - P3
Reporter: D.H.J. Takken Assignee: [DO NOT USE] Backlog - Sharding Team
Resolution: Duplicate Votes: 1
Labels: open_todo_in_code, sharding, todo_in_code
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified
Environment:

Ubuntu 14.10
MongoDB packages from 10gen
PyMongo 2.7.1


Attachments: Text File log-2.8rc3.txt     Text File log.txt     File testcase.py     File testcase.py    
Issue Links:
Duplicate
is duplicated by SERVER-14324 MapReduce does not respect existing s... Closed
Related
related to SERVER-43467 Complete TODO listed in SERVER-16605 Closed
Assigned Teams:
Sharding
Operating System: Linux
Steps To Reproduce:

1. Instantiate a new, clean MongoDB cluster, featuring a single shard server, config server and mongos.
2. Create a new database, dropping it first if it exists already.
3. Create an input collection and an output collection. Both collections are sharded. The output collection has a hashed index on the _id field.
4. Run a simple map reduce job that gets its input from the input collection and outputs into the output collection.
5. All documents produced by the reducer in stage one of the map reduce process gets lost in the post processing stage. Output collection is empty.
6. Repeat steps 2,3 and 4 using an output collection having a different name. The map reduce process succeeds this time.
7. Repeat steps 2,3 and 4 using an output collection having the same name as was used in the first map reduce job. It will fail again.

(Python implementation of this test case is attached)

Participants:

 Description   

When outputting from a map reduce job into a sharded output collection which features a hashed index on the _id field, no output is produced. The _id field is also the sharding key, so this issue

Extensive testing shows that this happens only for the first map reduce that is ever run on a MongoDB cluster. It fails to produce output and in the process, the name of the output collection appears to become 'cursed' somehow: Any subsequent map-reduce job runs fail if that same output collection name is used.

Even if the collection is re-created or the entire database is dropped and re-created, or if a different database is used. The name of the output collection can never be used again. Only when outputting into a collection with a different name, the exact same map reduce job processing the exact same data will succeed.

The problem emerges on sharded clusters only, and only when the output collection uses a hashed index.

It is possible to work around this problem by running a dummy map reduce job on newly setup MongoDB clusters, using an output collection that will never be used in regular operations.



 Comments   
Comment by Asya Kamsky [ 08/May/18 ]

I think this is just an instance of SERVER-14324 where output isn't using _id:"hashed" but using _id:1 instead.

Comment by Asya Kamsky [ 08/May/18 ]

I just verified that this is an issue if you are trying to output into a collection with shard key {_id:"hashed"}

If the sharding of collection is changed to {_id:1} then it works as expected.

 

Comment by Ramon Fernandez Marina [ 19/Feb/16 ]

MosheKaplan, please take a look at SERVER-17397, which contains information on how to eliminate all traces of a database or collection from a sharded cluster. Hope that helps.

Comment by Moshe Kaplan [X] [ 19/Feb/16 ]

We suffer from the same case at 3.0.3.
In our case, it seems that more than a single collection is being cursed, although newer tables are being created.
We tried to remove tables traces from the config servers (locks and collecitons tables in the config database), but it did not resolve this issue.
Can you point us to the location in the sharded cluster, where we can remove the "cursed" table name to recover the cluster?

Thanks
Moshe

Comment by Ramon Fernandez Marina [ 19/May/15 ]

dtakken, looks like we let this ticket fall through the cracks – very sorry about that.

Thanks for the concise script, I'm able to reproduce the behavior we describe and we're investigating. Once thing I've noticed is that it seems to be the "OutputCollectionA" name the only one that doesn't work, not just the first one that's used. There's room for improvement in mapReduce operations with output to sharded collections, but this looks like a very strange bug and I wonder if there are other names that fail in the same manner. Will post updates to this ticket as they become available.

Thanks,
Ramón.

EDIT
After more thorough testing I can confirm that the issue appears with whatever collection name is used for the first mapReduce operation as initially pointed out in this ticket; apologies for the confusion. I also observed that the behavior reproduces even if the hashed index is not created.

Comment by D.H.J. Takken [ 22/Dec/14 ]

Uploaded a new version of the test case, removing an index creation statement that is not relevant to the testcase.

Also, I added the logging output of the testcase run on MongoDB 2.8 RC3.

Comment by D.H.J. Takken [ 22/Dec/14 ]

I just tested on version 2.8 RC3 and the problem reproduces there as well.

Comment by D.H.J. Takken [ 19/Dec/14 ]

That is correct. I use the mongodb-org and mongodb-org-unstable packages by Ernie Hershey. Packages for the 2.8 RC series have not appeared in the repository yet.

Comment by Asya Kamsky [ 19/Dec/14 ]

You'd mentioned on mongodb-dev group that you can reproduce this on 2.6.5 and 2.7.x builds (2.8.0-rc?) is that correct? I'm double-checking since the 10gen packages are all 2.4 I believe.

Generated at Thu Feb 08 03:41:38 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.