Loading...

XML

Word

Printable

JSON

Type: Bug
Resolution: Gone away
Priority: Minor - P4
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

Aha! Reference:
None
Tracking Level:
None
Risk Status:
None
Exec Notes:
None
Goal Link:
None
Goal Name(s):
None

I have a dataset which I load in a notebook with pyspark and use the mongodb connector to load the data.

When I try to query the data with .isNotNull() on a embedded object it doesn't work, i keep getting all entries even the ones with null.

I load the data like this:

df_assets = (  
  spark.read  
  .format("mongodb")  
  .option("connection.uri", f"mongodb+srv://{mongodb_username}:{mongodb_password}@{mongodb_hostname}")  
  .option("database", "fleet-db")  
  .option("collection", "assets")  
  .load()
  #.cache() #adding this row "fixes" the problem
)

and then if I try something like

display(df_assets.filter(F.col("status.roboticMowerStatus").isNotNull()))

I just keep getting all the rows.

If I change and add a ".cache()" after the ".load()" where I load the data the processing is done in databricks and then it works.

the "status" object is a object with several string attributes, example of a simplified asset object:

{
  _id: "uuid",
  status: {
    roboticMowerStatus: "MOWING",
    batteryHealthStatus: "OK",
    inventoryStatus: null
  }
}

All the attributes inside the "status" object can be null in some documents.

Connector version: org.mongodb.spark:mongo-spark-connector_2.12:10.1.0

Assignee:: Prakul Agarwal
Reporter:: Mats Ekroth
Reviewers:: None
Votes:: 0 Vote for this issue
Watchers:: 4 Start watching this issue

Created:: May 29 2023 05:45:48 PM UTC
Updated:: Oct 27 2023 07:41:55 PM UTC
Resolved:: Jun 16 2023 12:00:42 PM UTC

Details

Description

Attachments

Activity

People

Dates