Priority: Major - P3
Resolution: Works as Designed
Affects Version/s: 2.4.1
Fix Version/s: None
Environment:aws emr 5.29.0 application spark
Python 3.6.8 (default, Oct 14 2019, 21:22:53)
[GCC 4.8.5 20150623 (Red Hat 4.8.5-28)] on linux
Despite reading the schema with sampling Ratio 1.0 and even specifying the samplesize to be total number of documents in the pipeline, mongo spark connector is inferring schema wrong and throwing cast exception.
The error does not even point out which field or document the error is present.
How I log to pyspark-shell in aws emr 5.29.0 with spark application:
How I am reading data:
After that simple
throws this error.
In this case, error was present in integer field in nested structure, i dropped all integer columns, after flattening like this:
still it throws the same error