Uploaded image for project: 'Spark Connector'
  1. Spark Connector
  2. SPARK-401

BSON Type support in Spark Sql queries

    • Type: Icon: Spec Change Spec Change
    • Resolution: Unresolved
    • Priority: Icon: Unknown Unknown
    • None
    • Affects Version/s: None
    • Component/s: None
    • Labels:
      None

      We are using the MongoDB Spark Connector to replicate a MongoDB collection to Databricks Spark. However, when we retrieve the MongoDB documents in Spark, the BSON types are converted to strings, causing us to lose the type information. We tried using the 'OutputExtendedJson' option, which preserves the type information but introduces extra nesting, making it harder to query the data.

      We want to preserve the type information because it helps differentiate between documents with similar string representations but different types. For example, consider these two documents:

      Document 1:
      {{

      { "_id": ObjectId("6453d5034f28c0d9088645c7"), "Date": ISODate("2016-03-04T08:00:00.000") }

      }}
      Document 2:
      {{

      {"_id": "6453d5034f28c0d9088645c7","Date": "2016-03-04T08:00:00.000"}

      }}
      Without preserving the type information, Spark would treat these two documents as the same, even though the "_id" field in Document 1 is an ObjectId and in Document 2 is a string.
      And when preserve type info, the documents becomes hard to query as they are nested.

      We believe others must have encountered a similar issue. Any suggestions on how to handle this so that it is easy to query and we do not have to consider nested schema? Is there a way in Spark SQL to specify BSON types like ObjectId?

            Assignee:
            Unassigned Unassigned
            Reporter:
            mahesh.ambule@prodigaltech.com Mahesh Ambule
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

              Created:
              Updated: