isNotNull queries not working

XMLWordPrintableJSON

    • Type: Bug
    • Resolution: Gone away
    • Priority: Minor - P4
    • None
    • Affects Version/s: None
    • Component/s: None
    • None
    • None
    • None
    • None
    • None
    • None
    • None

      I have a dataset which I load in a notebook with pyspark and use the mongodb connector to load the data. 

      When I try to query the data with .isNotNull() on a embedded object it doesn't work, i keep getting all entries even the ones with null. 

      I load the data like this: 

      df_assets = (  
        spark.read  
        .format("mongodb")  
        .option("connection.uri", f"mongodb+srv://{mongodb_username}:{mongodb_password}@{mongodb_hostname}"  .option("database", "fleet-db")  
        .option("collection", "assets")  
        .load()
        #.cache() #adding this row "fixes" the problem
      ) 

      and then if I try something like 

      display(df_assets.filter(F.col("status.roboticMowerStatus").isNotNull())) 

      I just keep getting all the rows. 

      If I change and add a ".cache()" after the ".load()" where I load the data the processing is done in databricks and then it works. 

      the "status" object is a object with several string attributes, example of a simplified asset object: 

      {
        _id: "uuid",
        status: {
          roboticMowerStatus: "MOWING",
          batteryHealthStatus: "OK",
          inventoryStatus: null
        }
      }
      

      All the attributes inside the "status" object can be null in some documents.

      Connector version: org.mongodb.spark:mongo-spark-connector_2.12:10.1.0

              Assignee:
              Prakul Agarwal
              Reporter:
              Mats Ekroth
              None
              Votes:
              0 Vote for this issue
              Watchers:
              4 Start watching this issue

                Created:
                Updated:
                Resolved: