Giter Site home page Giter Site logo

helmchart's Introduction

GitHub release GitHub license GoDoc Go Report Card Slack Status CLA assistant

Pachyderm – Automate data transformations with data versioning and lineage

Pachyderm is cost-effective at scale, enabling data engineering teams to automate complex pipelines with sophisticated data transformations across any type of data. Our unique approach provides parallelized processing of multi-stage, language-agnostic pipelines with data versioning and data lineage tracking. Pachyderm delivers the ultimate CI/CD engine for data.

Features

  • Data-driven pipelines automatically trigger based on detecting data changes.
  • Immutable data lineage with data versioning of any data type.
  • Autoscaling and parallel processing built on Kubernetes for resource orchestration.
  • Uses standard object stores for data storage with automatic deduplication.
  • Runs across all major cloud providers and on-premises installations.

Getting Started

To start deploying your end-to-end version-controlled data pipelines, run Pachyderm locally or you can also deploy on AWS/GCE/Azure in about 5 minutes.

You can also refer to our complete documentation to see tutorials, check out example projects, and learn about advanced features of Pachyderm.

If you'd like to see some examples and learn about core use cases for Pachyderm:

Documentation

Official Documentation

Community

Keep up to date and get Pachyderm support via:

  • Twitter Follow us on Twitter.
  • Slack Status Join our community Slack Channel to get help from the Pachyderm team and other users.

Contributing

To get started, sign the Contributor License Agreement.

You should also check out our contributing guide.

Send us PRs, we would love to see what you do! You can also check our GH issues for things labeled "help-wanted" as a good place to start. We're sometimes bad about keeping that label up-to-date, so if you don't see any, just let us know.

Usage Metrics

Pachyderm automatically reports anonymized usage metrics. These metrics help us understand how people are using Pachyderm and make it better. They can be disabled by setting the env variable METRICS to false in the pachd container.

helmchart's People

Contributors

avigil avatar chainlink avatar echohack avatar eytanhanig avatar nadegepepin avatar philwinder avatar robert-uhl avatar tybritten avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

helmchart's Issues

Move from implicit Postgres configured Host/Port to explicit (external PG Support)

Right now, postgres is configured implicitly by having a specially named service which corresponds to a kube generated env var (POSTGRES_SERVICE_HOST) This makes using an external postgres difficult/awkward (ie. needing to override an autogenerated Env var) External services didn't work as it requires a DNS name, which won't always be the case for databases.

TODO

Helm bug causing partSize to be converted to a float

Hi all.

This bug took me much longer to track down that I'd like to admit!

Recreate:
Use an "AMAZON" setup. E.g. https://github.com/pachyderm/helmchart/blob/master/examples/aws-values.yaml

Run that code (or just helm template if you don't want to recreate).

You'll get an error in pachd which crashes.

error setting up External Pachd GRPC Server: error setting up PFS API GRPC Server: cannot parse: strconv.ParseInt: parsing "5.24288e+06": invalid syntax

Reason:
This bug: helm/helm#1707

The partSize setting is being converted into a float by the toString method in this line: https://github.com/pachyderm/helmchart/blob/pachyderm-0.4.1-rc.1/pachyderm/templates/pachd/storage-secret.yaml#L35

Because the number is big and converted to engineering notation.

If you use a smaller number, then the toString works, but unfortunately minio requires a value of at least 5242880 to work (the default value).

I think you can work around this issue by accepting a string in your json schema, not an int. Which will be passed through to the k8s env var and interpreted as an int by pach.

Node Affinity

For shared clusters it's essential that Pachyderm not compete for resources with other production applications. There are a number of methods for doing, all of which require additions to the Pod's spec. This is currently impossible to do with this Helm chart.

Please make it possible to restrict Pachyderm to specific nodes as per this official documentation. The quickest implementation of this would be to allow users to specify additional yaml that should be added to the spec.

Set up CI

Set project up in circle CI to run tests

Add more-informative description

The description in helm search repo pachyderm is just ‘A Helm chart for Kubernetes,’ which is hardly useful. Add in some more useful verbiage here.

Allow for configurable annotations in ingress

Change the ingress annotations section to a toYaml so you can pass in whatever you'd like (in our case traefik, but folks might want to add their own ingress, cert manager, etc)

See here for an example

Issue with prometheus port name: "must be no more than 15 characters"

Hey Kyle!

Quick issue. The name for the prom port is invalid. Annoyingly it doesn't validate properly on the local side, and passes. But when you apply it (I'm on 1.20 Minikube atm) it fails.

Error: Service "pachd" is invalid: spec.ports[8].targetPort: Invalid value: "prometheus-metrics": must be no more than 15 characters

https://github.com/pachyderm/helmchart/blob/master/pachyderm/templates/pachd/service.yaml#L67

Renaming to prom-metrics should fix. Will create a PR.

Separate TLS configuration for Dash and Pach

Currently, Dash and Pachd share the same TLS configuration, but it would be good to make these separate (with the option to use the same secret) as sometimes folks host dash and pach on different domains or have other use cases to use different certificates

TLS configuration Changes [ High prio, blocks chart installation ]

Right now, we half enable TLS by default by specifying a secret name and hoping the user specifies the cert. Default should be either A) fully disabled or B) Enabled, but with an error message for cert

As well, specifying no TLS is a bit awkward

tls:
  certName: "" // Must be empty string
  create: null // must be null

We should change this to be more straightforward

cc @robert-uhl

Pick up Pachyderm version from chart appVersion

The app version is used once in Chart.yaml and once in values.yaml, which is repetition and leads to mistakes. Better would be for the templates to default to the former and only use the latter as an override.

Should be pretty straightforward.

Dashboard ingress not working

$ helm -n pachyderm template ./pachyderm -f ./pachyderm/values.yaml \
--set ingress.enabled=true        
Error: template: pachyderm/templates/dash/ingress.yaml:12:20: executing "pachyderm/templates/dash/ingress.yaml" at <.Pach.DashURL>: nil pointer evaluating interface {}.DashURL

Use --debug flag to render out invalid YAML

here:

- host: {{ .Pach.DashURL }}

IAM Roles for EKS Pods

This chart currently supports IAM roles for GKE but not IAM roles for pods on EKS. This is blocking my company's adoption.

Here are two easy methods to add this functionality:

  1. Add Helm values for specifying the IAM role name for the two service accounts, which is as simple as adding the annotation eks.amazonaws.com/role-arn: ...
  2. A more universal solution solution is to allow users to specify additional annotations for each of the two service accounts. There are examples of this in a plethora of other popular helm charts.

etcd storageclass not able to provision gp3 disk

In order to provision a gp3 disk for etcd on eks, we need to use the aws ebs csi driver as the provisioner.

The following change has to be made in the storageclass:

# Source: pachyderm/templates/etcd/storageclass.yaml
allowVolumeExpansion: true
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
  labels:
    app: etcd
    suite: pachyderm
  name: etcd-storage-class
parameters:
  type: gp3
provisioner: ebs.csi.aws.com >>>> I changed this to make it work.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.