-
Type: Bug
-
Resolution: Works as Designed
-
Priority: Major - P3
-
None
-
Affects Version/s: 2.2.0
-
Component/s: API
-
None
-
Environment:pyspark 2.2.0
It seems to be impossible to get aggregation queries to work against ISODate fields with python API. E.g.:
...options("pipeline",[{"$match": {"$gt":{"time": some_value}}}])
An attempt to use a native datetime.datetime objects fails with the BSON error. E.g:
... .options("pipeline",[{"$match": {"$gt": {"time": datetime.datetime(2017,8,7,17,42,0} }}]) ... Py4JJavaError: An error occurred while calling o487.load. : org.bson.json.JsonParseException: JSON reader was expecting a value but found 'datetime'. at org.bson.json.JsonReader.readBsonType(JsonReader.java:237) at org.bson.codecs.BsonDocumentCodec.decode(BsonDocumentCodec.java:82) at org.bson.codecs.BsonDocumentCodec.decode(BsonDocumentCodec.java:41) at org.bson.codecs.configuration.LazyCodec.decode(LazyCodec.java:47) at org.bson.codecs.BsonDocumentCodec.readValue(BsonDocumentCodec.java:101) at org.bson.codecs.BsonDocumentCodec.decode(BsonDocumentCodec.java:84)
Nevertheless, the ISODate fields are seem to be properly marshaled from BSON to datetime.datetime in RDDs. The expected behavior is to have a proper datetime.datetime <=> BSON conversion in both directions (the way it is done in pymongo). This is why I classify this issue into the bug category. Otherwise, please, let me know if there is an official way to deal with datetime objects in queries.
Hoping for some sort of schema-based conversion magic to happen I already tried numeric timestamps (like 1483246800000) and ISO-string values with no success.
I would like to get some workaround to this problem before an official solution is released, if possible. Is there any hack, e.g, creating DateTime objects via Py4J or so? Any help is appreciated!