-
Type:
Bug
-
Resolution: Gone away
-
Priority:
Minor - P4
-
None
-
Affects Version/s: None
-
Component/s: None
-
None
-
None
-
None
-
None
-
None
-
None
-
None
I have a dataset which I load in a notebook with pyspark and use the mongodb connector to load the data.
When I try to query the data with .isNotNull() on a embedded object it doesn't work, i keep getting all entries even the ones with null.
I load the data like this:
df_assets = ( spark.read .format("mongodb") .option("connection.uri", f"mongodb+srv://{mongodb_username}:{mongodb_password}@{mongodb_hostname}") .option("database", "fleet-db") .option("collection", "assets") .load() #.cache() #adding this row "fixes" the problem )
and then if I try something like
display(df_assets.filter(F.col("status.roboticMowerStatus").isNotNull()))
I just keep getting all the rows.
If I change and add a ".cache()" after the ".load()" where I load the data the processing is done in databricks and then it works.
the "status" object is a object with several string attributes, example of a simplified asset object:
{ _id: "uuid", status: { roboticMowerStatus: "MOWING", batteryHealthStatus: "OK", inventoryStatus: null } }
All the attributes inside the "status" object can be null in some documents.
Connector version: org.mongodb.spark:mongo-spark-connector_2.12:10.1.0