Uploaded image for project: 'Spark Connector'
  1. Spark Connector
  2. SPARK-376

Use the schema to automatically project fields

    • Type: Icon: Improvement Improvement
    • Resolution: Fixed
    • Priority: Icon: Unknown Unknown
    • 10.1.0
    • Affects Version/s: None
    • Component/s: None
    • None

      Pre 10.x the Spark connector had two configurations:

      • sql.pipeline.includeNullFilters - Includes null filters in the aggregation pipeline.
      • sql.pipeline.includeFiltersAndProjections - Includes any filters and projections in the aggregation pipeline.

      -Both defaulted to true making the connector automatically filter out any documents that did not contain any required fields or contained null values. It also added a projection to only include the fields in the schema.

      I think for 10.1.0 an equivalent to the sql.pipeline.includeFiltersAndProjections configuration should be added. This reduces the data sent across the wire and ensures that users don't get a Missing field DataException.

      Testing against null values eg sql.pipeline.includeNullFilters is not efficient as it requires a $ne lookup on the field type, which can significantly impact read performance due to their poor Query Selectivity - so shouldn't be ported to default as on.-

      Having discussed, silently skipping `null` data / missing fields is not appropriate as it hides potential data issues - which the user should explicitly filter out. This will also ensure that ds.count() (which doesn't provide a schema) always matches the number of results returned from the dataset.

      Projections should be automatically applied as there is no reason to send extra data.

      See:

            Assignee:
            ross@mongodb.com Ross Lawley
            Reporter:
            ross@mongodb.com Ross Lawley
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated:
              Resolved: