Comments (10)
@dennishuo It was really impractical when submitting Spark jobs from a Windows client to a Linux cluster, the Spark driver was running on Windows while the Spark cluster was hosted on Linux machines. So it was impossible to use the same credentials path both on Windows and Linux.
from hadoop-connectors.
Unfortunately not; the GCS connector integrates strictly through Hadoop FileSystem interfaces, which doesn't have any clear notion of masters/workers or any way to broadcast metadata to be used by all workers. Anything that could be implemented would end up fairly specific to a particular stack, e.g. relying on YARN or on Spark or HDFS or Zookeeper to do some kind of keyfile distribution.
Was it impractical because you need to specify different credentials per job or something like that?
One approach to make it easier if it's difficult to sync keyfile directories across your workers continuously would be to use an NFS mount shared across all your nodes to hold the keyfiles.
from hadoop-connectors.
through Hadoop FileSystem interfaces, which doesn't have any clear notion of masters/workers or any way to broadcast metadata to be used by all workers.
Hadoop offers a distributed cache https://hadoop.apache.org/docs/r2.6.3/api/org/apache/hadoop/filecache/DistributedCache.html
from hadoop-connectors.
@dennishuo We have very similar problem and are solving it in suggested manner -- making them available as local files on workers. However, this approach has two problems as for now:
- making it work for all types of jobs is quite fragile, as hadoop-core, beeline, pig and spark require different approaches to inject those properties
- it introduces another level of indirection in multi-user environment, when key file used does not correspond to principal
- keys may be stolen by unauthorised principals if DefaultContainerExecutor is used
While 3) seems to be resolvable by using LinuxContainerExecutor, 1) and 2) does not seem to have easy solution at this time. I'm thinking about a mechanism that would provide different sets of auth properties depending on principal name. Something like that:
fs.gs.principals.names=myuser,otheruser
fs.gs.principals.props.myuser.google.cloud.auth.service.account.enable=true
fs.gs.principals.props.myuser.google.cloud.auth.service.account.json.keyfile=/var/run/keys/myuser.json
fs.gs.principals.props.otheruser.google.cloud.auth.service.account.enable=true
fs.gs.principals.props.otheruser.fs.gs.project.id=other-google-project-name
fs.gs.principals.props.otheruser.google.cloud.auth.service.account.json.keyfile=/var/run/keys/otheruser.json
Do you believe this type of functionality could become part of upstream driver?
from hadoop-connectors.
@dennishuo We have a very similar problem, I wanted to setup a dataproc cluster for multi user. Since the compute engine uses a default service or custom service account credentials to connect to storage bucket which doesn't have any relation with user principals who submits the jobs or I couldn't find an option to control it, which makes the dataproc cluster unsecure and creates a problem mentioned by @chemikadze it introduces another level of indirection in multi-user environment, when key file used does not correspond to principal.
Is there any workaround or solution available ?
from hadoop-connectors.
@krishnabigdata In my case, we've solved that indirection by implementing wrapper around GCS Hadoop driver, which is mapping users to keys according to configured mapping. Users are mapped to groups, and groups are mapped to particular "group" service account.
from hadoop-connectors.
@chemikadze Thanks for your reply, in my case we are submitting the job using gcloud dataproc jobs submit hadoop
because my thought is to controls access to dataproc cluster using IAM roles but during the job submission the user principals are not getting forward to the hadoop cluster and as well as gcloud
doesn't perform any access validation on storage buckets at client side , the job always executed as root
user. May I know how to map the users to their service account do you have any solution for this case?
from hadoop-connectors.
All we need is the Hadoop Map Reduce submitted by users using gcloud dataproc jobs submit hadoop
should be able to use only the storage buckets or folder which user has access to it.
Current:
gcloud dataproc jobs (IAM - user principal) -> Dataproc Cluster (IAM - user principal) -> (SA Default/custom) -> Storage Bucket
If the user has access to submit jobs to Dataproc cluster can use any storage-buckets which the service account has access to it.
Required:
gcloud dataproc jobs (IAM - user principal) -> Dataproc Cluster (IAM - user principal) -> (IAM - user principal) -> Storage Bucket
The user has access to submit jobs to Dataproc cluster can only use the storage-buckets which the user account has access to it.
So far I couldn't find a way to do it. Can you please help me on it
Is there any workaround or solution available to this problem?
from hadoop-connectors.
@krishnabigdata you can use GCP Token Broker in conjunction with Kerberos to secure Dataproc cluster for multi-user use case with per-user GCS authentication.
from hadoop-connectors.
Hi Medb, could you have some ideas regarding GCP bucket to PySpark on-prem connection, so that I can get the bucket data to on-prem?
from hadoop-connectors.
Related Issues (20)
- BQ storage libray blocked on update to grpc v1.56 HOT 1
- GoogleCloudStorageFileSystem#delete recursive does not page
- Memory issues while running Apache Spark streaming applications on Google Dataproc cluster | OutOfMemoryError Java heap space
- flumk sink hdfs to gcs, all gcs write thread blocked
- how to transfer file from local to gcs bucket using dataproc hadoop in intellij
- GCS Connector fails with StackOverflowError during accessing hadoop credentials
- GhfsStorageStatistics cannot be cast ERROR HOT 9
- Support disabling automatic decompression of gzip files in GCS connector
- gcs-connector 3.0 not working with pyspark HOT 5
- gcs-connector:3.0.0 failing due to certificate when accessing to GCS from Github runner with WIF configuration HOT 7
- Feature request: automatic identity deduction a la google.auth.default()
- gcs-connector-3.0.0-shaded CVEs HOT 1
- How can I sink GCS connector metrics into GCP Cloud Monitor? HOT 2
- globStatus should prioritize server-side filtering over listing all files and performing local matches
- Conversion from InputStream -> ByteBuffer on gRPC writes creates many byte[] allocations. HOT 2
- Bug in `GoogleCloudStorageReadChannel` can cause an infinite loop
- hadoop3-2.2.22 and hadoop3-2.2.23 throws NoSuchMethodError at ServiceOptions.getService
- gcs-connector- CVE
- GCS connector throws rate limit errors
- Could not initialize class com.google.cloud.hadoop.fs.gcs.GoogleHadoopFileSystemConfiguration HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from hadoop-connectors.