[DOCS-12256] Spark with YARN and Kerberos Created: 03/Jan/18  Updated: 07/Jul/23  Resolved: 07/Jul/23

Status: Closed
Project: Documentation
Component/s: Spark Connector
Affects Version/s: None
Fix Version/s: None

Type: Task Priority: Major - P3
Reporter: Danny Hatcher (Inactive) Assignee: Unassigned
Resolution: Won't Do Votes: 0
Labels: None
Remaining Estimate: Not Specified
Time Spent: Not Specified
Original Estimate: Not Specified

Participants:
Days since reply: 45 weeks ago
Epic Link: DOCSP-6205

 Description   

What are the steps to get Spark to work with Yarn cluster mode and Kerberos? (It's fine if this doesn't exist anywhere yet, creating it as a tracking ticket for myself to continue investigating when I get a chance.)



 Comments   
Comment by Prakul Agarwal [ 30/Mar/23 ]

chinmay.jog@bnymellon.com Did this help solve your question? We can also get on a call to go over your use case and dive deeper into the issue.

Comment by Prakul Agarwal [ 07/Mar/23 ]

chinmay.jog@bnymellon.com The MongoDB Spark Connector doesn't have any configuration to deal with auth. 

Taking a step back - The communication between the Spark executors and the Spark master is done using the Spark driver program, which is responsible for coordinating the execution of Spark applications on the cluster. The driver program sends instructions to the Spark master, which then forwards those instructions to the appropriate executors running on the worker nodes. I would think if the initial connection between MongoDB and Spark master is happening correctly, the process of auth tokens being passed to the executor comes down to config on the Spark driver maybe?

Make sure that the JAAS and KRB5 config files are accessible to the Spark executors. You can specify the location of these files using the --files command-line option when launching your Spark application. For example:

In your spark-submit command, pass the JAAS configuration file and keytab as local resource files, using the --files option, and specify the JAAS configuration file options to the JVM options specified for the driver and executor:

spark-submit \
--files key.conf#key.conf,v.keytab#v.keytab \
--driver-java-options "-Djava.security.auth.login.config=./key.conf" \
--conf "spark.executor.extraJavaOptions=-Djava.security.auth.login.config=./key.conf" \ 

spark-submit --class my.Main --master yarn --deploy-mode cluster --files /path/to/jaas.conf,/path/to/krb5.conf myapp.jar 

The above command will upload config files to the Spark cluster and make them available to the executors.

source - https://docs.cloudera.com/HDPDocuments/HDP3/HDP-3.0.1/developing-spark-applications/content/running_spark_streaming_jobs_on_a_kerberos-enabled_cluster.html#:~:text=In%20your%20spark%2Dsubmit%20command,conf%23key.

If the above doesn't work can you verify if Spark executors are using the correct Java security properties when authenticating with Kerberos?

com.sun.security.jgss.debug=true

This will enable debugging output for the Java GSS API, which is used for Kerberos authentication. You can then check the executor logs to see if there are any Kerberos-related errors.

Further, if this is found to be an issue we can ensure that the Spark executors are using the correct Java security properties when authenticating with Kerberos via the `spark.executor.extraJavaOptions` configuration.

Comment by Chinmay Jog [ 04/Mar/23 ]

Hi @Prakul Agarwal, it is the latter. We have a kerberized Hadoop cluster running with Yarn (the Cloudera distribution Hadoop Data Platform). We also have a kerberized mongodb enterprise edition cluster running for us. We are trying to connect to this mongodb cluster using the "Mongodb Spark Connector". When we run the spark code using yarn as the cluster manager, the code crashes because Spark executors are not able to connect to mongodb (I get the gssapi failed error). If I run the spark job in local mode, connection works. 

My perception of the problem is as follows:

When running in local mode, the spark program is able to create a kerberos authentication token and successfully use it to connect, since master and workers are running on the same node. But when running in cluster mode, only the master is able to connect to mongo, but not the executors (somehow, the kerberos token generated by master while doing the InferSchema stage is not passed on the the executors to connect). 

Could you please point me to the right configuration so that I can specify the jaas config, krb5 config or any other configuration to the "Mongodb Spark Connector"?

 

PS: tried placing the JAAS config and KRB5 config in /etc of each VM allocated to the spark cluster. Also tried keeping it in HDFS and reading from there. But nothing works.

 

Comment by Prakul Agarwal [ 04/Mar/23 ]

chinmay.jog@bnymellon.com Regarding the question you have asked: Is this regarding (1) setting up Spark cluster with Yarn cluster mode and Kerberos?, or (2) is this about use of  "MongoDB Spark connector" when Spark cluster is setup with Yarn and Kerberos?

If it is the latter can you give us more context about what the issue is that you are looking to get more info on

Comment by Chinmay Jog [ 14/Nov/22 ]

Can anyone please provide any documentation as to how this is achieved? 

Comment by Nathan Leniz [ 23/Sep/21 ]

This ticket hasn't had any activity for three years. Closing for now, happy to reopen if more information can be provided and this is still needed.

Generated at Thu Feb 08 08:04:47 UTC 2024 using Jira 9.7.1#970001-sha1:2222b88b221c4928ef0de3161136cc90c8356a66.