Uploaded image for project: 'Spark Connector'
  1. Spark Connector
  2. SPARK-327

Support for handling Corrupt/Bad records on spark read

    XMLWordPrintable

Details

    • New Feature
    • Status: Backlog
    • Major - P3
    • Resolution: Unresolved
    • 3.0.1
    • None
    • Reads, Schema

    Description

      Summary

      MongoTypeConversionException is thrown during spark read in presence of bad/corrupt fields in large collection. Adding support for modes like Permissive or DropMalformed as Mongo spark options will help in successful completion of MongoSpark read.

      Motivation

      Who is the affected end user?

      Big data management companies

      How does this affect the end user?

      Dataframe read operation breaks in presence of corrupt records.

      How likely is it that this problem or use case will occur?

      Any huge Mongo collection holding unstructured data where scanning entire collection to infer schema results in performance overhead.

      Whenever explicit schema is passed during spark dataframe read.

      If the problem does occur, what are the consequences and how severe are they?

      Failover - Spark read fails with MongoTypeConversionException even in presence of one corrupt record in collection of 1000x rows.

      Is this issue urgent?

      Yes, breaks in read operation will be prevented.

      Is this ticket required by a downstream team?

      Needed by e.g. Atlas, Shell, Compass?

      Is this ticket only for tests?

      No

      Cast of Characters

      Engineering Lead:
      Document Author:
      POCers:
      Product Owner:
      Program Manager:
      Stakeholders:

      Channels & Docs

      Slack Channel

      [Scope Document|some.url]

      [Technical Design Document|some.url]

      Attachments

        Activity

          People

            Unassigned Unassigned
            santhoshsuresh95@gmail.com Santhosh Suresh
            Votes:
            0 Vote for this issue
            Watchers:
            2 Start watching this issue

            Dates

              Created:
              Updated: