Uploaded image for project: 'Spark Connector'
  1. Spark Connector
  2. SPARK-423

CollStats Aggregation Pipeline doesn't return aggregated stats for Sharded Mongo 4.0.x

    • Type: Icon: Bug Bug
    • Resolution: Gone away
    • Priority: Icon: Unknown Unknown
    • None
    • Affects Version/s: None
    • Component/s: None
    • Labels:
      None
    • Java Drivers

      On version 4.0.x of sharded MongoDB, when COLL_STATS_AGGREGATION_PIPELINE is called, the result returns multiple stats records for each shard.

      This is problematic when partitioners (SamplePartitioners ...) use this method to create input partitions. And since it will only use document count of a single shard to divide the number of input partitions, the files being outputted by the mongo-spark seem to be either too large or too small.

      The problematic code seems be in here:

      https://github.com/mongodb/mongo-spark/blob/r10.2.1/src/main/java/com/mongodb/spark/sql/connector/read/partitioner/PartitionerHelper.java#L98-L105

      I know that MongoDB 4.0.x is end of life but I'm raising this since the connector says it supports version 4.0 or later.

      I tested with version 4.2 and 4.4 and there was no problems.

            Assignee:
            prakul.agarwal@mongodb.com Prakul Agarwal
            Reporter:
            boush.phong@gmail.com Phong Bui
            Votes:
            0 Vote for this issue
            Watchers:
            5 Start watching this issue

              Created:
              Updated:
              Resolved: