Uploaded image for project: 'Core Server'
  1. Core Server
  2. SERVER-16605

Mapreduce into sharded collection with hashed index fails

    XMLWordPrintable

    Details

    • Type: Bug
    • Status: Closed
    • Priority: Major - P3
    • Resolution: Duplicate
    • Affects Version/s: 2.6.5, 2.7.8
    • Fix Version/s: None
    • Component/s: MapReduce, Sharding
    • Environment:
      Ubuntu 14.10
      MongoDB packages from 10gen
      PyMongo 2.7.1
    • Operating System:
      Linux
    • Steps To Reproduce:
      Hide

      1. Instantiate a new, clean MongoDB cluster, featuring a single shard server, config server and mongos.
      2. Create a new database, dropping it first if it exists already.
      3. Create an input collection and an output collection. Both collections are sharded. The output collection has a hashed index on the _id field.
      4. Run a simple map reduce job that gets its input from the input collection and outputs into the output collection.
      5. All documents produced by the reducer in stage one of the map reduce process gets lost in the post processing stage. Output collection is empty.
      6. Repeat steps 2,3 and 4 using an output collection having a different name. The map reduce process succeeds this time.
      7. Repeat steps 2,3 and 4 using an output collection having the same name as was used in the first map reduce job. It will fail again.

      (Python implementation of this test case is attached)

      Show
      1. Instantiate a new, clean MongoDB cluster, featuring a single shard server, config server and mongos. 2. Create a new database, dropping it first if it exists already. 3. Create an input collection and an output collection. Both collections are sharded. The output collection has a hashed index on the _id field. 4. Run a simple map reduce job that gets its input from the input collection and outputs into the output collection. 5. All documents produced by the reducer in stage one of the map reduce process gets lost in the post processing stage. Output collection is empty. 6. Repeat steps 2,3 and 4 using an output collection having a different name. The map reduce process succeeds this time. 7. Repeat steps 2,3 and 4 using an output collection having the same name as was used in the first map reduce job. It will fail again. (Python implementation of this test case is attached)

      Description

      When outputting from a map reduce job into a sharded output collection which features a hashed index on the _id field, no output is produced. The _id field is also the sharding key, so this issue

      Extensive testing shows that this happens only for the first map reduce that is ever run on a MongoDB cluster. It fails to produce output and in the process, the name of the output collection appears to become 'cursed' somehow: Any subsequent map-reduce job runs fail if that same output collection name is used.

      Even if the collection is re-created or the entire database is dropped and re-created, or if a different database is used. The name of the output collection can never be used again. Only when outputting into a collection with a different name, the exact same map reduce job processing the exact same data will succeed.

      The problem emerges on sharded clusters only, and only when the output collection uses a hashed index.

      It is possible to work around this problem by running a dummy map reduce job on newly setup MongoDB clusters, using an output collection that will never be used in regular operations.

        Attachments

        1. testcase.py
          3 kB
        2. testcase.py
          3 kB
        3. log-2.8rc3.txt
          40 kB
        4. log.txt
          40 kB

          Issue Links

            Activity

              People

              • Votes:
                1 Vote for this issue
                Watchers:
                8 Start watching this issue

                Dates

                • Created:
                  Updated:
                  Resolved: