-
Type: Improvement
-
Resolution: Fixed
-
Priority: Unknown
-
Affects Version/s: None
-
Component/s: None
-
None
-
(copied to CRM)
Pre 10.x the Spark connector had two configurations:
- sql.pipeline.includeNullFilters - Includes null filters in the aggregation pipeline.
- sql.pipeline.includeFiltersAndProjections - Includes any filters and projections in the aggregation pipeline.
-Both defaulted to true making the connector automatically filter out any documents that did not contain any required fields or contained null values. It also added a projection to only include the fields in the schema.
I think for 10.1.0 an equivalent to the sql.pipeline.includeFiltersAndProjections configuration should be added. This reduces the data sent across the wire and ensures that users don't get a Missing field DataException.
Testing against null values eg sql.pipeline.includeNullFilters is not efficient as it requires a $ne lookup on the field type, which can significantly impact read performance due to their poor Query Selectivity - so shouldn't be ported to default as on.-
Having discussed, silently skipping `null` data / missing fields is not appropriate as it hides potential data issues - which the user should explicitly filter out. This will also ensure that ds.count() (which doesn't provide a schema) always matches the number of results returned from the dataset.
Projections should be automatically applied as there is no reason to send extra data.
See: