[DOCS-12256] Spark with YARN and Kerberos Created: 03/Jan/18 Updated: 07/Jul/23 Resolved: 07/Jul/23 |
|
| Status: | Closed |
| Project: | Documentation |
| Component/s: | Spark Connector |
| Affects Version/s: | None |
| Fix Version/s: | None |
| Type: | Task | Priority: | Major - P3 |
| Reporter: | Danny Hatcher (Inactive) | Assignee: | Unassigned |
| Resolution: | Won't Do | Votes: | 0 |
| Labels: | None | ||
| Remaining Estimate: | Not Specified | ||
| Time Spent: | Not Specified | ||
| Original Estimate: | Not Specified | ||
| Participants: | |
| Days since reply: | 45 weeks ago |
| Epic Link: | DOCSP-6205 |
| Description |
|
What are the steps to get Spark to work with Yarn cluster mode and Kerberos? (It's fine if this doesn't exist anywhere yet, creating it as a tracking ticket for myself to continue investigating when I get a chance.) |
| Comments |
| Comment by Prakul Agarwal [ 30/Mar/23 ] | ||||||
|
chinmay.jog@bnymellon.com Did this help solve your question? We can also get on a call to go over your use case and dive deeper into the issue. | ||||||
| Comment by Prakul Agarwal [ 07/Mar/23 ] | ||||||
|
chinmay.jog@bnymellon.com The MongoDB Spark Connector doesn't have any configuration to deal with auth. Taking a step back - The communication between the Spark executors and the Spark master is done using the Spark driver program, which is responsible for coordinating the execution of Spark applications on the cluster. The driver program sends instructions to the Spark master, which then forwards those instructions to the appropriate executors running on the worker nodes. I would think if the initial connection between MongoDB and Spark master is happening correctly, the process of auth tokens being passed to the executor comes down to config on the Spark driver maybe? Make sure that the JAAS and KRB5 config files are accessible to the Spark executors. You can specify the location of these files using the --files command-line option when launching your Spark application. For example: In your spark-submit command, pass the JAAS configuration file and keytab as local resource files, using the --files option, and specify the JAAS configuration file options to the JVM options specified for the driver and executor:
The above command will upload config files to the Spark cluster and make them available to the executors. If the above doesn't work can you verify if Spark executors are using the correct Java security properties when authenticating with Kerberos?
This will enable debugging output for the Java GSS API, which is used for Kerberos authentication. You can then check the executor logs to see if there are any Kerberos-related errors. Further, if this is found to be an issue we can ensure that the Spark executors are using the correct Java security properties when authenticating with Kerberos via the `spark.executor.extraJavaOptions` configuration. | ||||||
| Comment by Chinmay Jog [ 04/Mar/23 ] | ||||||
|
Hi @Prakul Agarwal, it is the latter. We have a kerberized Hadoop cluster running with Yarn (the Cloudera distribution Hadoop Data Platform). We also have a kerberized mongodb enterprise edition cluster running for us. We are trying to connect to this mongodb cluster using the "Mongodb Spark Connector". When we run the spark code using yarn as the cluster manager, the code crashes because Spark executors are not able to connect to mongodb (I get the gssapi failed error). If I run the spark job in local mode, connection works. My perception of the problem is as follows: When running in local mode, the spark program is able to create a kerberos authentication token and successfully use it to connect, since master and workers are running on the same node. But when running in cluster mode, only the master is able to connect to mongo, but not the executors (somehow, the kerberos token generated by master while doing the InferSchema stage is not passed on the the executors to connect). Could you please point me to the right configuration so that I can specify the jaas config, krb5 config or any other configuration to the "Mongodb Spark Connector"?
PS: tried placing the JAAS config and KRB5 config in /etc of each VM allocated to the spark cluster. Also tried keeping it in HDFS and reading from there. But nothing works.
| ||||||
| Comment by Prakul Agarwal [ 04/Mar/23 ] | ||||||
|
chinmay.jog@bnymellon.com Regarding the question you have asked: Is this regarding (1) setting up Spark cluster with Yarn cluster mode and Kerberos?, or (2) is this about use of "MongoDB Spark connector" when Spark cluster is setup with Yarn and Kerberos? If it is the latter can you give us more context about what the issue is that you are looking to get more info on | ||||||
| Comment by Chinmay Jog [ 14/Nov/22 ] | ||||||
|
Can anyone please provide any documentation as to how this is achieved? | ||||||
| Comment by Nathan Leniz [ 23/Sep/21 ] | ||||||
|
This ticket hasn't had any activity for three years. Closing for now, happy to reopen if more information can be provided and this is still needed. |