Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Gone away
Priority: Unknown
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Assigned Teams:

Java Drivers

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Link:
None
Goal Name(s):
None

On version 4.0.x of sharded MongoDB, when COLL_STATS_AGGREGATION_PIPELINE is called, the result returns multiple stats records for each shard.

This is problematic when partitioners (SamplePartitioners ...) use this method to create input partitions. And since it will only use document count of a single shard to divide the number of input partitions, the files being outputted by the mongo-spark seem to be either too large or too small.

The problematic code seems be in here:

https://github.com/mongodb/mongo-spark/blob/r10.2.1/src/main/java/com/mongodb/spark/sql/connector/read/partitioner/PartitionerHelper.java#L98-L105

I know that MongoDB 4.0.x is end of life but I'm raising this since the connector says it supports version 4.0 or later.

I tested with version 4.2 and 4.4 and there was no problems.

- - Sort By Name
  - Sort By Date
  - Ascending
  - Descending
  - Thumbnails
  - List
  - Download All

image-2024-02-16-07-49-10-851.png
78 kB
Feb 16 2024 12:49:13 AM UTC

Assignee:: Prakul Agarwal
Reporter:: Phong Bui
Reviewers:: None
Votes:: 0 Vote for this issue
Watchers:: 5 Start watching this issue

Created:: Feb 16 2024 12:54:33 AM UTC
Updated:: Apr 25 2024 12:02:11 PM UTC
Resolved:: Apr 25 2024 12:02:09 PM UTC

Details

Description

Attachments

Attachments

Activity

People

Dates