Loading...

XML

Word

Printable

JSON

Type: Improvement
Resolution: Fixed
Priority: Unknown
Fix Version/s: 10.1.0
Affects Version/s: None
Component/s: None
Labels:
None

Case:

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Link:
None
Goal Name(s):
None

Pre 10.x the Spark connector had two configurations:

sql.pipeline.includeNullFilters - Includes null filters in the aggregation pipeline.
sql.pipeline.includeFiltersAndProjections - Includes any filters and projections in the aggregation pipeline.

-Both defaulted to true making the connector automatically filter out any documents that did not contain any required fields or contained null values. It also added a projection to only include the fields in the schema.

I think for 10.1.0 an equivalent to the sql.pipeline.includeFiltersAndProjections configuration should be added. This reduces the data sent across the wire and ensures that users don't get a Missing field DataException.

Testing against null values eg sql.pipeline.includeNullFilters is not efficient as it requires a $ne lookup on the field type, which can significantly impact read performance due to their poor Query Selectivity - so shouldn't be ported to default as on.-

Having discussed, silently skipping `null` data / missing fields is not appropriate as it hides potential data issues - which the user should explicitly filter out. This will also ensure that ds.count() (which doesn't provide a schema) always matches the number of results returned from the dataset.

Projections should be automatically applied as there is no reason to send extra data.

See:

is related to

SPARK-371 Handling of missing array field in the new Spark connector is not consistent with previous version

Closed

SPARK-410 col.isNotNull() does not work for fields with null values

Closed

Assignee:: Ross Lawley
Reporter:: Ross Lawley
Reviewers:: None
Votes:: 0 Vote for this issue
Watchers:: 2 Start watching this issue

Created:: Nov 02 2022 01:46:07 PM UTC
Updated:: Nov 07 2023 01:16:53 PM UTC
Resolved:: Dec 07 2022 10:25:19 AM UTC

Details

Description

Attachments

Issue Links

Activity

People

Dates