Loading...

XML

Word

Printable

JSON

Type: Spec Change
Resolution: Unresolved
Priority: Unknown
Fix Version/s: None
Affects Version/s: None
Component/s: None
Labels:
None

We are using the MongoDB Spark Connector to replicate a MongoDB collection to Databricks Spark. However, when we retrieve the MongoDB documents in Spark, the BSON types are converted to strings, causing us to lose the type information. We tried using the 'OutputExtendedJson' option, which preserves the type information but introduces extra nesting, making it harder to query the data.

We want to preserve the type information because it helps differentiate between documents with similar string representations but different types. For example, consider these two documents:

Document 1:
{{

{ "_id": ObjectId("6453d5034f28c0d9088645c7"), "Date": ISODate("2016-03-04T08:00:00.000") }

}}
Document 2:
{{

{"_id": "6453d5034f28c0d9088645c7","Date": "2016-03-04T08:00:00.000"}

}}
Without preserving the type information, Spark would treat these two documents as the same, even though the "_id" field in Document 1 is an ObjectId and in Document 2 is a string.
And when preserve type info, the documents becomes hard to query as they are nested.

We believe others must have encountered a similar issue. Any suggestions on how to handle this so that it is easy to query and we do not have to consider nested schema? Is there a way in Spark SQL to specify BSON types like ObjectId?

Assignee:: Unassigned

Reporter:: Mahesh Ambule

Votes:: 0 Vote for this issue

Watchers:: 2 Start watching this issue

Created:: May 09 2023 05:36:10 AM UTC

Updated:: Dec 11 2023 08:41:22 PM UTC

Details

Description

Attachments

Activity

People

Dates