I want to use Spark Connector for spark cluster and shard mongoDB.
According to the note of spark connector, it has a feature: Data locality awareness, described as The Spark connector is aware which MongoDB partitions are storing data.
Question 1: What is the correct behavior of this feature? In my test environment, there are two machines installed shard mongoDB and using the spark cluster in these two machines. When I create RDDs using JavaMongoRDD<Document> items = MongoSpark.load(sc, readConfig);
I found that the RDDs not always generated in the same worker getting the data from the mongo shard in same machine. It is random behavior.
Question 2: How to configure this feature? I have searched the solutions and test below configuration but not work:
1. readOverrides.put("readpreference.name", "nearest"); it seems for cluster disaster.
2. readOverrides.put("partitioner", "MongoShardedPartitioner"); not work