Loading...

XML

Word

Printable

JSON

Type: New Feature
Resolution: Done
Priority: Major - P3
Fix Version/s: None
Affects Version/s: 1.1.0, 2.0.0
Component/s: None
Labels:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Link:
None
Goal Name(s):
None

The default Partitioner strategy ignores the aggregation pipeline completely, causing a lot of empty partitions that slow down the Spark application.

There should be a Partitioner strategy that is able to create partitions based on the aggregation pipeline that will be executed.

In the Spark WebUI of my application, I can see a lot of partitions being created and processed eventhough they are completely empty. My aggregation is trimming down the expected count of documents from several hundred millions to just hundreds. The current default partitioner seems to ignore this and builds 242 partitions. See the attached images for clarification:

After calling saveAsTextFile on the RDD, I can see a lot of these partitions were in fact empty:

I am using the SparkConnector version 1.1.0 because we run Spark 1.6.1, but this feature is also not available for 2.0.0.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

mongo-spark3.PNG
Nov 16 2016 11:13:26 AM UTC
38 kB
F H
partitions.PNG
Nov 16 2016 11:14:13 AM UTC
82 kB
F H

related to

SPARK-101 Add support for partial collection partitioning for non sharded partitioners.

Closed

Assignee:: Unassigned
Reporter:: F H
Reviewers:: None
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: Nov 16 2016 11:16:41 AM UTC
Updated:: Nov 30 2016 10:17:52 AM UTC
Resolved:: Nov 30 2016 10:17:30 AM UTC

Details

Description

Attachments

Attachments

Issue Links

Activity

People

Dates