Giter Site home page Giter Site logo

hubcap's Introduction

Hubcap

Hubcap is the script that generates the pages of the dbt package registry site, hub.getdbt.com.

Each hour, hubcap.py runs, and checks whether there are new releases for any of the repositories listed in hub.json. If a new release has been created, or a new package has been added to the list, a Pull Request is opened against the hub.getdbt.com repository to update the the registry site to reflect this. PRs are approved by a member of the Fishtown Analytics team, typically within one business day.

Caveats and Gotchas

Assorted constraints:

  • project must be hosted on GitHub
  • your project must have a dbt_project.yml with a name: tag in the yaml
  • if used by your project, a packages.yml must live at the root level of your project repository
  • only release names that use semantic versioning will be picked up by hubcap — both 0.1.0 and v0.1.0 will be picked up, but first-release will not.

Adding your package to hubcap

Currently, only packages hosted on a GitHub repo are supported.

To add your package, open a PR that adds your repository to hub.json. A dbt Labs team member will review your PR and provide a cursory check of your new package against best practices.

hubcap's People

Contributors

absorbb avatar aescay avatar agnessnowplow avatar amychen1776 avatar aneiderhiser avatar bill-warner avatar bruno-szdl avatar cdussud avatar clrcrl avatar danielpdwalker avatar dbeatty10 avatar drewbanin avatar entechlog avatar fivetran-joemarkiewicz avatar guilhermealc avatar il-dat avatar joellabes avatar jtcohen6 avatar kristin-bagnall avatar milo157 avatar mjirv avatar niallrees avatar oleg-solovyev avatar saras-daton avatar stephen986 avatar tnightengale avatar tomshelllby avatar versusfacit avatar wilson-urdaneta avatar yu-iskw avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

hubcap's Issues

Feature: Add docs and readme

I added some comments to hubcap during the refactor but that's just the basics. I think it would be swell to add how the script works, notes about the ecosystem, user directions that dovetail.

Documentation is endless, so here's my proposed scope for this issue:

  • briefly describe the overall workflow architecture to help new users understand the package ecosystem, especially those who want to contribute packages
  • a link to notes about adding a repo
  • notes on adding a new package to the hub
  • contribution notes for the repo itself

Enable range of database hosts to empower CI for maintainers of community packages

Goal

As a maintainer of a community package for dbt, I want to run automated testing through a CI service so that quality assurance is automatically included in the workflow for a wide range of databases.

Solved

✅ Both GitHub Actions (GHA) and Circle CI have free plans available to cover the continuous integration (CI) piece.

Problem

❌ Compute costs. Lack of free hosts for the full range of dbt database adapters (Postgres, BigQuery, Snowflake, etc).

Options discussed thus far

  1. Reimburse compute costs through an expense reporting mechanism or stipend
  2. Clone these repos on instances internal to dbt Labs and execute tests
  3. Enable access to database instances internal to dbt Labs for community maintainers

Primary negative trade-off for each option

  1. Complicated on the finance side of things
  2. Disempowering for community maintainers and asynchronous delays not conducive to typical development workflows
  3. Complicated on the security side of things

Update the CODEOWNERS file

It sounds like someone on the dbt Labs engineering team will come to own this stack. If/when that is decided, the relevant parties should be added to the CODEOWNERS file.

Until then, update the file to reflect the current dedicated team members.

Use the FishtownBuildBot service account for automated pull requests

Problem

Currently, all the pull requests for hub.getdbt.com look like they are coming from Drew. Example here:

image

Proposed solution

We use the FishtownBuildBot service account for automated pull requests like this:

image

Implementation

  1. Login to GitHub using the FishtownBuildBot service account
  2. Generate a personal access token (PAT)
  3. Login to Heroku
  4. Update all the hubcap applications (both production and non-production) to reflect the new PAT
  5. Also update the user name and email address like this

Renamed GitHub repo did not generate new Hub entry

Overview

As mentioned here, https://github.com/tuva-health/core was renamed to https://github.com/tuva-health/core.

When the script ran on Heroku, it successfully cloned the repo:

2022-11-17 15:03:57 INFO Drawing down tuva-health's the_tuva_project
2022-11-17 15:03:57 INFO cloning https://github.com/tuva-health/the_tuva_project.git to /app/target/tuva-health_the_tuva_project
2022-11-17 15:03:59 INFO pulling main at /app/target/tuva-health_the_tuva_project

But it did not collect tags and write index.json when appropriate (like it successfully did for terminology and data_profiling:

...
2022-11-17 15:04:48 INFO no new tags for terminology. Skipping...
2022-11-17 15:04:48 INFO collecting tags for data_profiling
2022-11-17 15:04:48 INFO pkg hub tags: []
2022-11-17 15:04:48 INFO pkg remote tags: ['0.1.0']
2022-11-17 15:04:48 INFO creating task to add new tags ['0.1.0'] to data_profiling
...
2022-11-17 15:04:50 INFO writing index.json to /app/target/hub.getdbt.com/data/packages/tuva-health/data_profiling/index.json
2022-11-17 15:04:50 INFO
2022-11-17 15:04:50 INFO downloading: https://codeload.github.com/tuva-health/data_profiling/tar.gz/0.1.0
2022-11-17 15:04:50 INFO SHA1: e3c25af1078d845d3ba1065249eb76676120888b
2022-11-17 15:04:50 INFO writing spec to /app/target/hub.getdbt.com/data/packages/tuva-health/data_profiling/versions/0.1.0.json
2022-11-17 15:04:50 INFO hubcap: Adding tag 0.1.0 for tuva-health/data_profiling

Next steps

Try to reproduce and troubleshoot this locally (rather than via Heroku).

Fall-back if all else fails

Manually create a pull request within https://github.com/dbt-labs/hub.getdbt.com that adds the appropriate files.

cc @tuvaforrest for visibility

Testing framework

As a contributor, I'd like a testing framework so that I can be more confident that my changes work as expected (and don't introduce bugs).

pytest seems like a logical choice.

Remove cron.sh

Allow hubcap.py to be stand-alone without needing to be called by cron.sh.

To fully deprecate cron.sh, do the following:

  • Update the Heroku Scheduler to be $ python3 hubcap/hubcap.py instead of $ ./cron.sh
  • Update from cron.sh to python3 hubcap/hubcap.py within the documentation
  • remove the ENV environment variable and cron.sh from the documentation
  • remove the ENV environment variable in Heroku
  • remove cron.sh

[Urgent] Bug: Build package configs from main branch (or master)

image

We need to ensure the hubcap script is prepared to handle packages with main, then prioritize that if there's no master branch. Currently, master is hardcoded into the script. We need main conditional logic that uses main then master when looking for commit shas to change into package specs. Otherwise, newer versions won't be added to hub.getdbt.com; the script will just break for any repo that does not retain a master branch.

Good news though! Because package specs depend only on commit shas, nothing has been changed for previous versions of packages.

This will partially address issue #69 .

Feature: Make hubcap less dependent on config

This script requires a carefully made environment variable be present to run. It'd be nice to wrap variables in such a manner that permits using this script even when that env variable isn't present. That would be helpful for test runs. Right now, it requires an unreasonable amount of hacking to run the main script outside of a heroku cluster (and in time, we'll want to move away from Heroku in general, it seems).

Use the configured repo name as-is

The temporary folder of git repositories currently looks something like this:

...
fivetran_dbt_recurly
fivetran_dbt_recurly_source
hub
...

So it is more obvious to a new developer that https://github.com/dbt-labs/hub.getdbt.com has been cloned, I'd rather it look like this:

...
fivetran_dbt_recurly
fivetran_dbt_recurly_source
hub.getdbt.com
...

Remove `__all__`

Proposal

Since the code within the hubcap repo isn't meant to be published as a package and it doesn't have any instances of from {module} import *, we can remove instances of __all__.

TL;DR for __all__

  • If you have __all__ in a Python module, then from {module} import * will import everything listed in __all__. Otherwise, it will import everything that does not start with an underscore.

Preserve the git target directory by default

Proposal

  • rename default target directory for git clone steps from "git-tmp" to "target"
  • default to preserving the target directory rather than deleting it
  • introduce a make clean for cleaning out the target directory (edit: not needed for now -- can be re-requested if needed)
  • drop support for the GIT_TMP environment variable

Register a dbt package from a sub directory of a repository

Hi friends,

We are about to release a new dbt package and we wanted to keep everything as a morepo in our project. That would mean instead of pointing to a github repository we would have to point it to a subdirectory in our existing repository for example; https://github.com/fal-ai/fal/feature-store.

I don't think this is possible right now, but we are happy to contribute if this is something you would like to see implemented.

I just had a quick glance at the code and looks like a change like this is self contained in the hubcap repository and I don't have to touch the logic how dbt-core downloads dependencies.

Helpful error message if not able to list pull requests

Background

Currently open pull requests are discovered with a URL like:
https://api.github.com/repos/{org_name}/{package_name}/pulls?state=open

But if this API has an error response, the script will raise an exception without any indication of the cause.

Proposal

Catch any exceptions, log the exception message, and also suggest the primary causes of error:

  • The repository is not visible to GitHub user specified by the token
  • The token is lacking the applicable scopes (repo, workflow)

Get package dependencies for specific release tag

(Similar to #21)

Hubcap uses the default branch, rather than the specific release, to get information about packages (dependencies). Instead, for each tagged release it's adding, it should check out that tag before introspecting packages.

This is causing issues right now for fivetran_utils. They can't change the default branch (it's pointed to by some older versions of other packages), so they want to cut releases from non-default branches instead. Everything works except for the packages dict created by Hubcap.

Use `main` as the default branch

Background

The default branch for https://github.com/dbt-labs/hub.getdbt.com is currently master. It is hard-coded within Python in a place or two within this repo (https://github.com/dbt-labs/hubcap). Use git grep "master" ./ to discover those locations. It is unknown if this branch name is also specified within Netlify.

Next steps

  1. Discover if the branch name is hard-coded within Netlify configuration -- update the following steps depending on the discoveries
  2. Copy master to main within https://github.com/dbt-labs/hub.getdbt.com
  3. Update hubcap to point to main branch
  4. Redploy hubcap
  5. Drop master branch in hub.getdbt.com

hubcap IndexError: list index out of rangeExceptionFatal

We got the following error which first appeared at 2023-01-11T18:05:23.005324+00:00 (which I believe was the first run after merging #222):

File "/app/hubcap/hubcap.py", line 70, in <module>
    new_branches = package.commit_version_updates_to_hub(
  File "/app/hubcap/package.py", line 125, in commit_version_updates_to_hub
    branch_name, org_name, package_name = task.run(hub_dir_path, pr_strategy)
  File "/app/hubcap/records.py", line 104, in run
    new_index_entry = self.make_index(
  File "/app/hubcap/records.py", line 178, in make_index
    latest_version = version_numbers[-1]
IndexError: list index out of rangeExceptionFatal

One of the packages that was added only has a single tag, and it is 0.1.0-b1. We do have other packages on the hub with similar tags, dbt_utils 1.0.0-b2 being one example.

Regardless of whether this is expected or not, this doesn't seem like something that should cause the script to error out.

ad hoc executions of the hubcap script

Add documentation on how to do ad hoc executions of the hubcap script (rather than waiting for the frequency specified within the Heroku Scheduler).

Final ad hoc step is something like this:

heroku run ./cron.sh

Or in the near future, just:

heroku run python3 hubcap/hubcap.py

Ignore commits in the blame view on GitHub

Be able to run automatic formatters without affecting the display of what revision and author last modified each line of a file (i.e., git blame.

Read more here:

One this file is in place, then the following can be run locally:

git blame --ignore-revs-file .git-blame-ignore-revs

This file will also be taken into account by GitHub with the git blame view.

Package on the Hub with only prerelease tags

When all the tagged versions for a package are prereleases, then hubcap will generate a PR, but Netflify won't be able to build it, so it can't auto-merge. This will hold up all other packages until it is resolved.

This is caused by a confluence of issues:

  • hubcap will generate PRs whenever there is a tag with a valid semantic version (includes both final releases as well as pre-releases)
  • hubcap will assign the latest final version when it can find one and "" otherwise
  • hub.getdbt.com requires that latest is a valid version (and will break otherwise)

Easiest solution:

  • Remove pre-release only packages from the Hub when they are discovered (ideally not adding them to hub.json until there is a tag with a final release)

Other solutions:

  • make hubcap.py robust in the face of packages with only pre-release tags
  • make hub.getdbt.com robust in the face of a blank latest version

I'm taking the easiest solution in the short-term and have a PR ready that implements the first "other" solution.

A problem with using the easiest solution long-term is that simple human error can trigger this situation again in the future. I'd rather prevent it entirely.

Support prerelease identifiers without hyphens

Prompted by @NiallRees's exciting 1.0.0b1 release of dbt_artifacts: https://github.com/brooklyn-data/dbt_artifacts/releases/tag/1.0.0b1

hubcap/hubcap/version.py

Lines 14 to 15 in d82a8a2

# regex taken from official SEMVER documentation site
match = re.match('^(?P<major>0|[1-9]\d*)\.(?P<minor>0|[1-9]\d*)\.(?P<patch>0|[1-9]\d*)(?:-(?P<prerelease>(?:0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*)(?:\.(?:0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*))*))?(?:\+(?P<buildmetadata>[0-9a-zA-Z-]+(?:\.[0-9a-zA-Z-]+)*))?$',

I think that's SemVer-official, but Python/pip actually doesn't support the hyphen. From PEP 440:

Semantic versions containing a hyphen (pre-releases - clause 10) or a plus sign (builds - clause 11) are not compatible with this PEP and are not permitted in the public version field.

So, it's pretty common for folks to use the hyphenless prerelease identifier, even if it's not "real" semver. The Core team stumbled across this inconsistency a few months ago (dbt-labs/dbt-core#4741), and decided to let it be. The dbt-core semver logic supports both:

>>> from dbt.semver import VersionSpecifier
>>> VersionSpecifier.from_version_string("1.0.0b1")
VersionSpecifier(major='1', minor='0', patch='0', prerelease='b1', build=None, matcher=<Matchers.EXACT: '='>)
>>> VersionSpecifier.from_version_string("1.0.0-b1")
VersionSpecifier(major='1', minor='0', patch='0', prerelease='b1', build=None, matcher=<Matchers.EXACT: '='>)

IMO:

  • We should aim for consistency between Hubcap + dbt-core (though it's a very good thing that Hubcap removed its dependency on dbt-core!)
  • We should accept both 1.0.0b1 and 1.0.0-b1 as valid semantic version identifiers, with prerelease suffix b1.
  • It's just a matter of adding one little qmark (?) to the regex string

Feature: Codify package config guidelines

Hey team, we have been discussing there being rules for how a plugin project should look for it to be added to the hub. Here's what I've got so far:

  • has a dbt_project.yml with a name
  • if packages.yml exists, it lives at the root dir of the package
  • the package repo should not be private
  • prefer main to master for parsing out commits (perhaps have a way for packages to specify the branch of their choice)

Note: I had originally framed this to myself as requiring a main branch, but on second thought, I think it's better to prioritize main to master in the script logic or perhaps even just having some kind of config file in the user package repo's as a possible override if they want to specify the exact branch for us to use when considering new versions. main is already prioritized by GitHub and we can document that master has been deprecated but is still supported (since any branch can be used). That way, we don't frustrate package maintainers that haven't yet made the switch (plenty of shops are still slowly but surely transitioning over).

What other things should be added to docs about what a basic package should look like?

Lock down `requirements.txt`

Related to #108

Overview

We have wide-open versions listed in requirements.txt. This could be problematic at an inopportune time if the cache is dropped and an incompatible version of a package is installed.

The Heroku documentation states:

It’s recommended to specify explicit dependency versions in your requirements.txt file. To update this file, you can use the pip freeze command in your active virtual environment:

pip freeze > requirements.txt

But this post explains why that is still not enough.

It goes on to explain how to use pip-compile (from pip-tools) to create a locked-down requirements.txt. I believe this makes it akin to a Pipfile.lock file or poetry.lock file from Pipenv and Poetry, respectively.

Feature

  1. Make sure the local version of Python matches that of runtime.txt precisely
  2. Add pip-tools to the dev-requirements.txt
  3. Rename the existing requirements.txt to requirements.in
  4. Run pip-compile to create requirements.txt
  5. Check-in all of the above changes to git

Use project git config instead of global

git config is updated globally here in a bash script. However, we can use something like this using Python instead. It can be placed after the clone here.

Advantages

  • Scope the config specific to where it is needed rather than globally
  • No need to clean-up afterwards if it fails in the middle of execution
  • The code should be shorter overall
  • More people know how to read and write Python than bash
  • One step closer to removing the bash script altogether

Implementation details

The config.example.json should be updated to something like the following:

{
    "user": {
        "name": "dbt-hubcap",
        "email": "[email protected]",
        "token": "pe4s0n@l-@cce$$-t0k3n"
    },
    "org": "dbt-labs",
    "repo": "hub.getdbt.com",
    "push_branches": true,
    "one_branch_per_repo": true
}

Then the name and email keys should be used to set the project config for the git repo.

Feature: improve logging and monitoring

We need a way to get build logs that's at least a touch better than heroku logs -a dbt-hubcap. This may involve a simple monitoring setup for heroku or evolve into migrating the project entirely.

image

Generate project specification without dbt

Problem

dbt expectations cannot merge the newest version of their package to hub.getdbt.com because the dbt version employed by this repo's build script requires an incompatible version of core.

The script errors out at setup.

Background

To generate project specifications (i.e. the information used to generate project pages on https://hub.getdbt.com/), hubcap.py uses dbt itself to run shell commands and extract project information.

Among other things, this requires a phony dbt profile and a specific dbt version (see /requirements.txt). This introduces possible conflict with any packages using a require-dbt-version configuration tag in their dbt_project.yml.

As mentioned, dbt expectations is blocked from pushing their newest release because their pinned dbt version is not compatible with the version of dbt we use in the hubcap build script.

image

Solution

All functionality performed by dbt can be done using familiar Python system libraries and yaml parsing files. Let's remove Core from this build script.

Thankfully, the two files which hold this information are both yaml files located by design at root level of a dbt project--dbt_project.yml and packages.yml. Yaml parsing libraries should do the trick to put together a specification and a dependency chain.

As a bonus, any modularization of these components will be appreciated in the now and the future.

Outcomes

  • we unblock the dbt expectations team
  • we insulate ourselves from this danger

Feature: Create a dbt-labs ci user and have them make commits

Currently, we have Drew's user account supporting the access of private repos and running the script. He has an api token fixed to his name among other things.

My gut tells me this should actually be a dedicated ci user rather than something to Drew himself. If nothing else, it's a bit odd to see his face on the hub package version commits. More seriously, having a dedicated user makes it easier to act securely and without needing to bother Drew about such things in the future case other API tokens and such are needed. We can also democratize access to the pipeline across the team.

Remove overzealous filtering of prereleases

All prerelease versions are removed from hubcap whereas we should only block those that do not pass the semver regex. That said, we should also retain the behavior that the latest version goes to the latest stable version as opposed to a prerelease.

Release instructions

As a maintainer, I'd like release instructions so that I know how to perform production deployments.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.