-
Type: New Feature
-
Resolution: Won't Fix
-
Priority: Major - P3
-
None
-
Affects Version/s: 2.4.0
-
Component/s: Configuration
-
None
-
Environment:Tested in (but not restricted to) Linux with pyspark (python 3.6.8), mongod (v4.0.6) and spark (v2.4.0)
-
(copied to CRM)
It appears that a SparkSession object cannot support more than 1 concurrent Kerberos principal. Each spark application can conceptually require up to 3 principals:
- Input URI
- Output URI
- The default or operating context in which the parent application operates in
There are 2 failure modes that arise out of this:
- If the input & output URI credentials differ by principal (typically uncommon)
- If the default context differs from either of the URIs
In the second case here, if a spark application (pyspark) was configured to authenticate to an endpoint (like Hadoop) with a principal that differs from the MongoDB URI credentials, one set of connection will fail to auth. Only the connections with the matching principal for the Kerberos security context will succeed.
I think there is are two layers that are contributing to this:
- The MongoClient cache pooling would make the security context common across threads
- The inherent behaviour of the Java driver that is limited to accessing the default GSSAPI security context in the JVM. Ie, it it not designed to select a named context by principal
In summary, the Java driver relies naively on the default kerberos token on which it initialises the GSSAPI security context inside MongoClient. Given these 3 components share the same default context, it's not possible for any of the 3 components to differ by principal.