-
Type: Bug
-
Resolution: Gone away
-
Priority: Unknown
-
None
-
Affects Version/s: None
-
Component/s: None
-
Labels:None
-
Java Drivers
On version 4.0.x of sharded MongoDB, when COLL_STATS_AGGREGATION_PIPELINE is called, the result returns multiple stats records for each shard.
This is problematic when partitioners (SamplePartitioners ...) use this method to create input partitions. And since it will only use document count of a single shard to divide the number of input partitions, the files being outputted by the mongo-spark seem to be either too large or too small.
The problematic code seems be in here:
I know that MongoDB 4.0.x is end of life but I'm raising this since the connector says it supports version 4.0 or later.
I tested with version 4.2 and 4.4 and there was no problems.