dbt-labs / hubcap Goto Github PK

This app adds modules to the hubsite at hub.getdbt.com

Python 100.00%

hubcap's Introduction

Hubcap

Hubcap is the script that generates the pages of the dbt package registry site, hub.getdbt.com.

Each hour, hubcap.py runs, and checks whether there are new releases for any of the repositories listed in hub.json. If a new release has been created, or a new package has been added to the list, a Pull Request is opened against the hub.getdbt.com repository to update the the registry site to reflect this. PRs are approved by a member of the Fishtown Analytics team, typically within one business day.

Caveats and Gotchas

Assorted constraints:

project must be hosted on GitHub
your project must have a dbt_project.yml with a name: tag in the yaml
if used by your project, a packages.yml must live at the root level of your project repository
only release names that use semantic versioning will be picked up by hubcap — both 0.1.0 and v0.1.0 will be picked up, but first-release will not.

Adding your package to hubcap

Currently, only packages hosted on a GitHub repo are supported.

To add your package, open a PR that adds your repository to hub.json. A dbt Labs team member will review your PR and provide a cursory check of your new package against best practices.

hubcap's People

Contributors

Stargazers

Watchers

Forkers

rudderlabs tayloramurphy yu-iskw tnightengale kristin-bagnall omnata-labs dataders niallrees data-mie aaronsteers re-data gary-beautypie absorbb bill-warner courentin fivetran-joemarkiewicz avohq aneiderhiser danielpdwalker oravi elementary-data nraghute mildbyte teradata guilhermealc wilson-urdaneta pvcy cerebriumai gabriel-milan sfc-gh-dflippo milo157 agnessnowplow mdesmet adamribaudo-velir hightouchio everpeace mjirv tariqmusa il-nina andrewcstewart lewisdavies rjh336 tkirschke saras-daton rlh1994 cdussud thutuva parimalposts vigneshwaranjeyakumar jakubadamek datomni il-dat aaron-zhou ejoranlienea bruno-szdl techindicium 0adamjones sdebruyn will-warner dwreeves buremba datnguye gjmcclintock tuvaforrest arnon7 dvalexhiggs bobsamuels zshandy aitem calum-mcg rlsalcido24 danthelion bfdcampos craigrmccown taraojo darrenmccarra oleg-solovyev kgmcquate gmdata-co axelthevenot bcodell il-toti tasmananalytics adrianbr metaops-solutions flexor-peleg thu-il github-christophe-oudar kayrnt hodadelfi dlt-hub irvingpop alittlesliceoftom

hubcap's Issues

Feature: Add docs and readme

I added some comments to hubcap during the refactor but that's just the basics. I think it would be swell to add how the script works, notes about the ecosystem, user directions that dovetail.

Documentation is endless, so here's my proposed scope for this issue:

briefly describe the overall workflow architecture to help new users understand the package ecosystem, especially those who want to contribute packages
a link to notes about adding a repo
notes on adding a new package to the hub
contribution notes for the repo itself

Enable range of database hosts to empower CI for maintainers of community packages

Goal

As a maintainer of a community package for dbt, I want to run automated testing through a CI service so that quality assurance is automatically included in the workflow for a wide range of databases.

Solved

✅ Both GitHub Actions (GHA) and Circle CI have free plans available to cover the continuous integration (CI) piece.

Problem

❌ Compute costs. Lack of free hosts for the full range of dbt database adapters (Postgres, BigQuery, Snowflake, etc).

Options discussed thus far

Reimburse compute costs through an expense reporting mechanism or stipend
Clone these repos on instances internal to dbt Labs and execute tests
Enable access to database instances internal to dbt Labs for community maintainers

Primary negative trade-off for each option

Complicated on the finance side of things
Disempowering for community maintainers and asynchronous delays not conducive to typical development workflows
Complicated on the security side of things

Rename `setup.py`

Having a script named setup.py has historically meant something specific within Python projects.

To avoid confusion and surprise, let's just rename it.

Feature: Update primary branch from master to main

Let's bring this repo up to date with other dbt-labs repos in this small but nonetheless important way.

Flake8 took down the gitlab repository in favor of github

An estimated 20K(!) CI pipelines in GitHub are effected by this 🤯

See example solution here:
dbt-labs/dbt-core#6252

Here's what the error looks like in our CI pipeline:

Update the CODEOWNERS file

It sounds like someone on the dbt Labs engineering team will come to own this stack. If/when that is decided, the relevant parties should be added to the CODEOWNERS file.

Until then, update the file to reflect the current dedicated team members.

Use the FishtownBuildBot service account for automated pull requests

Problem

Currently, all the pull requests for hub.getdbt.com look like they are coming from Drew. Example here:

Proposed solution

We use the FishtownBuildBot service account for automated pull requests like this:

dbt-labs/dbt-core#6112

Implementation

Login to GitHub using the FishtownBuildBot service account
Generate a personal access token (PAT)
Login to Heroku
Update all the hubcap applications (both production and non-production) to reflect the new PAT
Also update the user name and email address like this

Fix issues any raised via flake8

#185 added a pre-commit hook for flake8.

Use pre-commit run --all-files to find any issues and then fix them.

Remove authentication when cloning dbt packages

Remove personal access token (PAT) authentication when cloning git URLs for dbt packages.

Renamed GitHub repo did not generate new Hub entry

Overview

As mentioned here, https://github.com/tuva-health/core was renamed to https://github.com/tuva-health/core.

When the script ran on Heroku, it successfully cloned the repo:

2022-11-17 15:03:57 INFO Drawing down tuva-health's the_tuva_project
2022-11-17 15:03:57 INFO cloning https://github.com/tuva-health/the_tuva_project.git to /app/target/tuva-health_the_tuva_project
2022-11-17 15:03:59 INFO pulling main at /app/target/tuva-health_the_tuva_project

But it did not collect tags and write index.json when appropriate (like it successfully did for terminology and data_profiling:

...
2022-11-17 15:04:48 INFO no new tags for terminology. Skipping...
2022-11-17 15:04:48 INFO collecting tags for data_profiling
2022-11-17 15:04:48 INFO pkg hub tags: []
2022-11-17 15:04:48 INFO pkg remote tags: ['0.1.0']
2022-11-17 15:04:48 INFO creating task to add new tags ['0.1.0'] to data_profiling
...
2022-11-17 15:04:50 INFO writing index.json to /app/target/hub.getdbt.com/data/packages/tuva-health/data_profiling/index.json
2022-11-17 15:04:50 INFO
2022-11-17 15:04:50 INFO downloading: https://codeload.github.com/tuva-health/data_profiling/tar.gz/0.1.0
2022-11-17 15:04:50 INFO SHA1: e3c25af1078d845d3ba1065249eb76676120888b
2022-11-17 15:04:50 INFO writing spec to /app/target/hub.getdbt.com/data/packages/tuva-health/data_profiling/versions/0.1.0.json
2022-11-17 15:04:50 INFO hubcap: Adding tag 0.1.0 for tuva-health/data_profiling

Next steps

Try to reproduce and troubleshoot this locally (rather than via Heroku).

Fall-back if all else fails

Manually create a pull request within https://github.com/dbt-labs/hub.getdbt.com that adds the appropriate files.

cc @tuvaforrest for visibility

Testing framework

As a contributor, I'd like a testing framework so that I can be more confident that my changes work as expected (and don't introduce bugs).

pytest seems like a logical choice.

Hubcap fails if package does not have a `master` branch

Note that this is becoming more problematic since GitHub defaults to using main now, so more packages will be created without the master branch

Relevant code.

Remove cron.sh

Allow hubcap.py to be stand-alone without needing to be called by cron.sh.

To fully deprecate cron.sh, do the following:

Update the Heroku Scheduler to be $ python3 hubcap/hubcap.py instead of $ ./cron.sh
Update from cron.sh to python3 hubcap/hubcap.py within the documentation
remove the ENV environment variable and cron.sh from the documentation
remove the ENV environment variable in Heroku
remove cron.sh

[Urgent] Bug: Build package configs from main branch (or master)

We need to ensure the hubcap script is prepared to handle packages with main, then prioritize that if there's no master branch. Currently, master is hardcoded into the script. We need main conditional logic that uses main then master when looking for commit shas to change into package specs. Otherwise, newer versions won't be added to hub.getdbt.com; the script will just break for any repo that does not retain a master branch.

Good news though! Because package specs depend only on commit shas, nothing has been changed for previous versions of packages.

This will partially address issue #69 .

Feature: Make hubcap less dependent on config

This script requires a carefully made environment variable be present to run. It'd be nice to wrap variables in such a manner that permits using this script even when that env variable isn't present. That would be helpful for test runs. Right now, it requires an unreasonable amount of hacking to run the main script outside of a heroku cluster (and in time, we'll want to move away from Heroku in general, it seems).

Use the configured repo name as-is

The temporary folder of git repositories currently looks something like this:

...
fivetran_dbt_recurly
fivetran_dbt_recurly_source
hub
...

So it is more obvious to a new developer that https://github.com/dbt-labs/hub.getdbt.com has been cloned, I'd rather it look like this:

...
fivetran_dbt_recurly
fivetran_dbt_recurly_source
hub.getdbt.com
...

Contributing instructions

As a contributor, I'd like instructions so that I know how to do development and test my work.

Remove `all`

Proposal

Since the code within the hubcap repo isn't meant to be published as a package and it doesn't have any instances of from {module} import *, we can remove instances of __all__.

TL;DR for `all`

If you have __all__ in a Python module, then from {module} import * will import everything listed in __all__. Otherwise, it will import everything that does not start with an underscore.

Preserve the git target directory by default

Proposal

rename default target directory for git clone steps from "git-tmp" to "target"
default to preserving the target directory rather than deleting it
~~introduce a make clean for cleaning out the target directory~~ (edit: not needed for now -- can be re-requested if needed)
drop support for the GIT_TMP environment variable

Register a dbt package from a sub directory of a repository

Hi friends,

We are about to release a new dbt package and we wanted to keep everything as a morepo in our project. That would mean instead of pointing to a github repository we would have to point it to a subdirectory in our existing repository for example; https://github.com/fal-ai/fal/feature-store.

I don't think this is possible right now, but we are happy to contribute if this is something you would like to see implemented.

I just had a quick glance at the code and looks like a change like this is self contained in the hubcap repository and I don't have to touch the logic how dbt-core downloads dependencies.

Reconnect Heroku to GitHub to automatically deploy

It is possible to configure automatic deploys from GitHub to deploy automatically whenever a specific branch is pushed to:

Deploy > Deployment method > GitHub > Connect to GitHub

I'm guessing we could use the same FishtownBuildBot user in #152.

Helpful error message if not able to list pull requests

Background

Currently open pull requests are discovered with a URL like:
https://api.github.com/repos/{org_name}/{package_name}/pulls?state=open

But if this API has an error response, the script will raise an exception without any indication of the cause.

Proposal

Catch any exceptions, log the exception message, and also suggest the primary causes of error:

The repository is not visible to GitHub user specified by the token
The token is lacking the applicable scopes (repo, workflow)

Get package dependencies for specific release tag

(Similar to #21)

Hubcap uses the default branch, rather than the specific release, to get information about packages (dependencies). Instead, for each tagged release it's adding, it should check out that tag before introspecting packages.

This is causing issues right now for fivetran_utils. They can't change the default branch (it's pointed to by some older versions of other packages), so they want to cut releases from non-default branches instead. Everything works except for the packages dict created by Hubcap.

Authenticated HTTPS URLs using The Simplest Bullet™

Use the The Simplest Bullet™ strategy to authenticate pushes to the https://github.com/dbt-labs/hub.getdbt.com repo.

i.e., handle all 3 git URL styles as authenticated HTTPS URLs using a personal access token (PAT).

Use `main` as the default branch

Background

The default branch for https://github.com/dbt-labs/hub.getdbt.com is currently master. It is hard-coded within Python in a place or two within this repo (https://github.com/dbt-labs/hubcap). Use git grep "master" ./ to discover those locations. It is unknown if this branch name is also specified within Netlify.

Next steps

Discover if the branch name is hard-coded within Netlify configuration -- update the following steps depending on the discoveries
Copy master to main within https://github.com/dbt-labs/hub.getdbt.com
Update hubcap to point to main branch
Redploy hubcap
Drop master branch in hub.getdbt.com

Bump version of Python

Bump version of Python. Currently 3.9.11:

https://devcenter.heroku.com/articles/python-support#supported-runtimes

Update this file:

runtime.txt

`pre-commit` hooks

Proposal

Use pre-commit hooks to perform automated checking before a local commit is even allowed.

Implementation

add a minimal .pre-commit-config.yaml configuration file
include installation and usage instructions in CONTRIBUTING

Upgrading to the Latest Heroku Stack

Log output

This app is using the Heroku-20 stack, however a newer stack is available.
To upgrade to Heroku-22, see:
https://devcenter.heroku.com/articles/upgrading-to-the-latest-stack

Resources

https://devcenter.heroku.com/articles/upgrading-to-the-latest-stack

hubcap IndexError: list index out of rangeExceptionFatal

We got the following error which first appeared at 2023-01-11T18:05:23.005324+00:00 (which I believe was the first run after merging #222):

File "/app/hubcap/hubcap.py", line 70, in <module>
    new_branches = package.commit_version_updates_to_hub(
  File "/app/hubcap/package.py", line 125, in commit_version_updates_to_hub
    branch_name, org_name, package_name = task.run(hub_dir_path, pr_strategy)
  File "/app/hubcap/records.py", line 104, in run
    new_index_entry = self.make_index(
  File "/app/hubcap/records.py", line 178, in make_index
    latest_version = version_numbers[-1]
IndexError: list index out of rangeExceptionFatal

One of the packages that was added only has a single tag, and it is 0.1.0-b1. We do have other packages on the hub with similar tags, dbt_utils 1.0.0-b2 being one example.

Regardless of whether this is expected or not, this doesn't seem like something that should cause the script to error out.

`pytest` within GitHub Actions

Use GitHub Actions to run the existing pytest suite on pull requests.

ad hoc executions of the hubcap script

Add documentation on how to do ad hoc executions of the hubcap script (rather than waiting for the frequency specified within the Heroku Scheduler).

Final ad hoc step is something like this:

heroku run ./cron.sh

Or in the near future, just:

heroku run python3 hubcap/hubcap.py

Code static type checker pre-commit hook

Use mypy for static type checking within a pre-commit hook.

Example configs:

Hubcap creates a very large number of branches on the hub.getdbt.com repo

https://github.com/fishtown-analytics/hub.getdbt.com/branches

This feels like too many branches to me :)

Ignore commits in the blame view on GitHub

Be able to run automatic formatters without affecting the display of what revision and author last modified each line of a file (i.e., git blame.

Package on the Hub with only prerelease tags

When all the tagged versions for a package are prereleases, then hubcap will generate a PR, but Netflify won't be able to build it, so it can't auto-merge. This will hold up all other packages until it is resolved.

This is caused by a confluence of issues:

hubcap will generate PRs whenever there is a tag with a valid semantic version (includes both final releases as well as pre-releases)
hubcap will assign the latest final version when it can find one and "" otherwise
hub.getdbt.com requires that latest is a valid version (and will break otherwise)

Easiest solution:

Remove pre-release only packages from the Hub when they are discovered (ideally not adding them to hub.json until there is a tag with a final release)

Support prerelease identifiers without hyphens

Prompted by @NiallRees's exciting 1.0.0b1 release of dbt_artifacts: https://github.com/brooklyn-data/dbt_artifacts/releases/tag/1.0.0b1

hubcap/hubcap/version.py

Lines 14 to 15 in d82a8a2

    
           # regex taken from official SEMVER documentation site 
        
           match = re.match('^(?P<major>0|[1-9]\d*)\.(?P<minor>0|[1-9]\d*)\.(?P<patch>0|[1-9]\d*)(?:-(?P<prerelease>(?:0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*)(?:\.(?:0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*))*))?(?:\+(?P<buildmetadata>[0-9a-zA-Z-]+(?:\.[0-9a-zA-Z-]+)*))?$',

I think that's SemVer-official, but Python/pip actually doesn't support the hyphen. From PEP 440:

Semantic versions containing a hyphen (pre-releases - clause 10) or a plus sign (builds - clause 11) are not compatible with this PEP and are not permitted in the public version field.

So, it's pretty common for folks to use the hyphenless prerelease identifier, even if it's not "real" semver. The Core team stumbled across this inconsistency a few months ago (dbt-labs/dbt-core#4741), and decided to let it be. The dbt-core semver logic supports both:

>>> from dbt.semver import VersionSpecifier
>>> VersionSpecifier.from_version_string("1.0.0b1")
VersionSpecifier(major='1', minor='0', patch='0', prerelease='b1', build=None, matcher=<Matchers.EXACT: '='>)
>>> VersionSpecifier.from_version_string("1.0.0-b1")
VersionSpecifier(major='1', minor='0', patch='0', prerelease='b1', build=None, matcher=<Matchers.EXACT: '='>)

IMO:

We should aim for consistency between Hubcap + dbt-core (though it's a very good thing that Hubcap removed its dependency on dbt-core!)
We should accept both 1.0.0b1 and 1.0.0-b1 as valid semantic version identifiers, with prerelease suffix b1.
It's just a matter of adding one little qmark (?) to the regex string

Code formatting pre-commit hook

Use black for code formatting within a pre-commit hook

Feature: Codify package config guidelines

Hey team, we have been discussing there being rules for how a plugin project should look for it to be added to the hub. Here's what I've got so far:

has a dbt_project.yml with a name
if packages.yml exists, it lives at the root dir of the package
the package repo should not be private
prefer main to master for parsing out commits (perhaps have a way for packages to specify the branch of their choice)

Note: I had originally framed this to myself as requiring a main branch, but on second thought, I think it's better to prioritize main to master in the script logic or perhaps even just having some kind of config file in the user package repo's as a possible override if they want to specify the exact branch for us to use when considering new versions. main is already prioritized by GitHub and we can document that master has been deprecated but is still supported (since any branch can be used). That way, we don't frustrate package maintainers that haven't yet made the switch (plenty of shops are still slowly but surely transitioning over).

What other things should be added to docs about what a basic package should look like?

Feature: Add actions workflow to this repository

Code linting pre-commit hook

Use flake8 for style checking within a pre-commit hook

Lock down `requirements.txt`

Related to #108

Overview

We have wide-open versions listed in requirements.txt. This could be problematic at an inopportune time if the cache is dropped and an incompatible version of a package is installed.

The Heroku documentation states:

It’s recommended to specify explicit dependency versions in your requirements.txt file. To update this file, you can use the pip freeze command in your active virtual environment:

pip freeze > requirements.txt

But this post explains why that is still not enough.

It goes on to explain how to use pip-compile (from pip-tools) to create a locked-down requirements.txt. I believe this makes it akin to a Pipfile.lock file or poetry.lock file from Pipenv and Poetry, respectively.

Feature

Make sure the local version of Python matches that of runtime.txt precisely
Add pip-tools to the dev-requirements.txt
Rename the existing requirements.txt to requirements.in
Run pip-compile to create requirements.txt
Check-in all of the above changes to git

Default value if the `GIT_TMP` environment variable is not set

Proposal

provide a default value if the GIT_TMP environment variable is not set

Use project git config instead of global

git config is updated globally here in a bash script. However, we can use something like this using Python instead. It can be placed after the clone here.

Advantages

Scope the config specific to where it is needed rather than globally
No need to clean-up afterwards if it fails in the middle of execution
The code should be shorter overall
More people know how to read and write Python than bash
One step closer to removing the bash script altogether

Implementation details

The config.example.json should be updated to something like the following:

{
    "user": {
        "name": "dbt-hubcap",
        "email": "[email protected]",
        "token": "pe4s0n@l-@cce$$-t0k3n"
    },
    "org": "dbt-labs",
    "repo": "hub.getdbt.com",
    "push_branches": true,
    "one_branch_per_repo": true
}

Then the name and email keys should be used to set the project config for the git repo.

Feature: improve logging and monitoring

We need a way to get build logs that's at least a touch better than heroku logs -a dbt-hubcap. This may involve a simple monitoring setup for heroku or evolve into migrating the project entirely.

Provide a useful error message if unable to push branches

If the GitHub user tied to the personal access token (PAT) is unable to push branches, the resulting stack trace is practically undecipherable.

Remove test to confirm that the GitHub token is authorized to push

Undo #164 now that #66 and #152 are confirmed to be working.

Generate project specification without dbt

Problem

dbt expectations cannot merge the newest version of their package to hub.getdbt.com because the dbt version employed by this repo's build script requires an incompatible version of core.

The script errors out at setup.

Background

To generate project specifications (i.e. the information used to generate project pages on https://hub.getdbt.com/), hubcap.py uses dbt itself to run shell commands and extract project information.

Among other things, this requires a phony dbt profile and a specific dbt version (see /requirements.txt). This introduces possible conflict with any packages using a require-dbt-version configuration tag in their dbt_project.yml.

As mentioned, dbt expectations is blocked from pushing their newest release because their pinned dbt version is not compatible with the version of dbt we use in the hubcap build script.

Solution

All functionality performed by dbt can be done using familiar Python system libraries and yaml parsing files. Let's remove Core from this build script.

Thankfully, the two files which hold this information are both yaml files located by design at root level of a dbt project--dbt_project.yml and packages.yml. Yaml parsing libraries should do the trick to put together a specification and a dependency chain.

As a bonus, any modularization of these components will be appreciated in the now and the future.

Outcomes

we unblock the dbt expectations team
we insulate ourselves from this danger

Feature: Create a dbt-labs ci user and have them make commits

Currently, we have Drew's user account supporting the access of private repos and running the script. He has an api token fixed to his name among other things.

My gut tells me this should actually be a dedicated ci user rather than something to Drew himself. If nothing else, it's a bit odd to see his face on the hub package version commits. More seriously, having a dedicated user makes it easier to act securely and without needing to bother Drew about such things in the future case other API tokens and such are needed. We can also democratize access to the pipeline across the team.

Remove overzealous filtering of prereleases

All prerelease versions are removed from hubcap whereas we should only block those that do not pass the semver regex. That said, we should also retain the behavior that the latest version goes to the latest stable version as opposed to a prerelease.

Release instructions

As a maintainer, I'd like release instructions so that I know how to perform production deployments.

	# regex taken from official SEMVER documentation site
	match = re.match('^(?P<major>0\|[1-9]\d)\.(?P<minor>0\|[1-9]\d)\.(?P<patch>0\|[1-9]\d)(?:-(?P<prerelease>(?:0\|[1-9]\d\|\d[a-zA-Z-][0-9a-zA-Z-])(?:\.(?:0\|[1-9]\d\|\d[a-zA-Z-][0-9a-zA-Z-]))))?(?:\+(?P<buildmetadata>[0-9a-zA-Z-]+(?:\.[0-9a-zA-Z-]+)*))?$',

dbt-labs / hubcap Goto Github PK

hubcap's Introduction

Hubcap

Caveats and Gotchas

Adding your package to hubcap

hubcap's People

Contributors

Stargazers

Watchers

Forkers

hubcap's Issues

Goal

Solved

Problem

Options discussed thus far

Primary negative trade-off for each option

Problem

Proposed solution

Implementation

Overview

Next steps

Fall-back if all else fails

Proposal

Proposal

Background

Proposal

Background

Next steps

Proposal

Implementation

Log output

Resources

Overview

Feature

Proposal

Advantages

Implementation details

Problem

Background

Solution

Outcomes

Recommend Projects

Recommend Topics

Recommend Org