dbt-labs / hubcap Goto Github PK

This app adds modules to the hubsite at hub.getdbt.com

Python 100.00%

hubcap's Issues

Use `main` as the default branch

Background

The default branch for https://github.com/dbt-labs/hub.getdbt.com is currently master. It is hard-coded within Python in a place or two within this repo (https://github.com/dbt-labs/hubcap). Use git grep "master" ./ to discover those locations. It is unknown if this branch name is also specified within Netlify.

Next steps

Discover if the branch name is hard-coded within Netlify configuration -- update the following steps depending on the discoveries
Copy master to main within https://github.com/dbt-labs/hub.getdbt.com
Update hubcap to point to main branch
Redploy hubcap
Drop master branch in hub.getdbt.com

Feature: improve logging and monitoring

We need a way to get build logs that's at least a touch better than heroku logs -a dbt-hubcap. This may involve a simple monitoring setup for heroku or evolve into migrating the project entirely.

Use the FishtownBuildBot service account for automated pull requests

Problem

Currently, all the pull requests for hub.getdbt.com look like they are coming from Drew. Example here:

Proposed solution

We use the FishtownBuildBot service account for automated pull requests like this:

dbt-labs/dbt-core#6112

Implementation

Login to GitHub using the FishtownBuildBot service account
Generate a personal access token (PAT)
Login to Heroku
Update all the hubcap applications (both production and non-production) to reflect the new PAT
Also update the user name and email address like this

Feature: Add docs and readme

I added some comments to hubcap during the refactor but that's just the basics. I think it would be swell to add how the script works, notes about the ecosystem, user directions that dovetail.

Documentation is endless, so here's my proposed scope for this issue:

briefly describe the overall workflow architecture to help new users understand the package ecosystem, especially those who want to contribute packages
a link to notes about adding a repo
notes on adding a new package to the hub
contribution notes for the repo itself

Release instructions

As a maintainer, I'd like release instructions so that I know how to perform production deployments.

Upgrading to the Latest Heroku Stack

Log output

This app is using the Heroku-20 stack, however a newer stack is available.
To upgrade to Heroku-22, see:
https://devcenter.heroku.com/articles/upgrading-to-the-latest-stack

Resources

https://devcenter.heroku.com/articles/upgrading-to-the-latest-stack

Feature: Add actions workflow to this repository

Feature: Codify package config guidelines

Hey team, we have been discussing there being rules for how a plugin project should look for it to be added to the hub. Here's what I've got so far:

has a dbt_project.yml with a name
if packages.yml exists, it lives at the root dir of the package
the package repo should not be private
prefer main to master for parsing out commits (perhaps have a way for packages to specify the branch of their choice)

Note: I had originally framed this to myself as requiring a main branch, but on second thought, I think it's better to prioritize main to master in the script logic or perhaps even just having some kind of config file in the user package repo's as a possible override if they want to specify the exact branch for us to use when considering new versions. main is already prioritized by GitHub and we can document that master has been deprecated but is still supported (since any branch can be used). That way, we don't frustrate package maintainers that haven't yet made the switch (plenty of shops are still slowly but surely transitioning over).

What other things should be added to docs about what a basic package should look like?

Support prerelease identifiers without hyphens

Prompted by @NiallRees's exciting 1.0.0b1 release of dbt_artifacts: https://github.com/brooklyn-data/dbt_artifacts/releases/tag/1.0.0b1

hubcap/hubcap/version.py

Lines 14 to 15 in d82a8a2

    
           # regex taken from official SEMVER documentation site 
        
           match = re.match('^(?P<major>0|[1-9]\d*)\.(?P<minor>0|[1-9]\d*)\.(?P<patch>0|[1-9]\d*)(?:-(?P<prerelease>(?:0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*)(?:\.(?:0|[1-9]\d*|\d*[a-zA-Z-][0-9a-zA-Z-]*))*))?(?:\+(?P<buildmetadata>[0-9a-zA-Z-]+(?:\.[0-9a-zA-Z-]+)*))?$',

I think that's SemVer-official, but Python/pip actually doesn't support the hyphen. From PEP 440:

Semantic versions containing a hyphen (pre-releases - clause 10) or a plus sign (builds - clause 11) are not compatible with this PEP and are not permitted in the public version field.

So, it's pretty common for folks to use the hyphenless prerelease identifier, even if it's not "real" semver. The Core team stumbled across this inconsistency a few months ago (dbt-labs/dbt-core#4741), and decided to let it be. The dbt-core semver logic supports both:

>>> from dbt.semver import VersionSpecifier
>>> VersionSpecifier.from_version_string("1.0.0b1")
VersionSpecifier(major='1', minor='0', patch='0', prerelease='b1', build=None, matcher=<Matchers.EXACT: '='>)
>>> VersionSpecifier.from_version_string("1.0.0-b1")
VersionSpecifier(major='1', minor='0', patch='0', prerelease='b1', build=None, matcher=<Matchers.EXACT: '='>)

IMO:

We should aim for consistency between Hubcap + dbt-core (though it's a very good thing that Hubcap removed its dependency on dbt-core!)
We should accept both 1.0.0b1 and 1.0.0-b1 as valid semantic version identifiers, with prerelease suffix b1.
It's just a matter of adding one little qmark (?) to the regex string

Preserve the git target directory by default

Proposal

rename default target directory for git clone steps from "git-tmp" to "target"
default to preserving the target directory rather than deleting it
~~introduce a make clean for cleaning out the target directory~~ (edit: not needed for now -- can be re-requested if needed)
drop support for the GIT_TMP environment variable

Enable range of database hosts to empower CI for maintainers of community packages

Goal

As a maintainer of a community package for dbt, I want to run automated testing through a CI service so that quality assurance is automatically included in the workflow for a wide range of databases.

Solved

✅ Both GitHub Actions (GHA) and Circle CI have free plans available to cover the continuous integration (CI) piece.

Problem

❌ Compute costs. Lack of free hosts for the full range of dbt database adapters (Postgres, BigQuery, Snowflake, etc).

Options discussed thus far

Reimburse compute costs through an expense reporting mechanism or stipend
Clone these repos on instances internal to dbt Labs and execute tests
Enable access to database instances internal to dbt Labs for community maintainers

Primary negative trade-off for each option

Complicated on the finance side of things
Disempowering for community maintainers and asynchronous delays not conducive to typical development workflows
Complicated on the security side of things

`pytest` within GitHub Actions

Use GitHub Actions to run the existing pytest suite on pull requests.

Flake8 took down the gitlab repository in favor of github

An estimated 20K(!) CI pipelines in GitHub are effected by this 🤯

See example solution here:
dbt-labs/dbt-core#6252

Here's what the error looks like in our CI pipeline:

Rename `setup.py`

Having a script named setup.py has historically meant something specific within Python projects.

To avoid confusion and surprise, let's just rename it.

Remove authentication when cloning dbt packages

Remove personal access token (PAT) authentication when cloning git URLs for dbt packages.

Helpful error message if not able to list pull requests

Background

Currently open pull requests are discovered with a URL like:
https://api.github.com/repos/{org_name}/{package_name}/pulls?state=open

But if this API has an error response, the script will raise an exception without any indication of the cause.

Proposal

Catch any exceptions, log the exception message, and also suggest the primary causes of error:

The repository is not visible to GitHub user specified by the token
The token is lacking the applicable scopes (repo, workflow)

Provide a useful error message if unable to push branches

If the GitHub user tied to the personal access token (PAT) is unable to push branches, the resulting stack trace is practically undecipherable.

Ignore commits in the blame view on GitHub

Be able to run automatic formatters without affecting the display of what revision and author last modified each line of a file (i.e., git blame.

Remove overzealous filtering of prereleases

All prerelease versions are removed from hubcap whereas we should only block those that do not pass the semver regex. That said, we should also retain the behavior that the latest version goes to the latest stable version as opposed to a prerelease.

Update the CODEOWNERS file

It sounds like someone on the dbt Labs engineering team will come to own this stack. If/when that is decided, the relevant parties should be added to the CODEOWNERS file.

Until then, update the file to reflect the current dedicated team members.

`pre-commit` hooks

Proposal

Use pre-commit hooks to perform automated checking before a local commit is even allowed.

Implementation

add a minimal .pre-commit-config.yaml configuration file
include installation and usage instructions in CONTRIBUTING

Code linting pre-commit hook

Use flake8 for style checking within a pre-commit hook

ad hoc executions of the hubcap script

Add documentation on how to do ad hoc executions of the hubcap script (rather than waiting for the frequency specified within the Heroku Scheduler).

Final ad hoc step is something like this:

heroku run ./cron.sh

Or in the near future, just:

heroku run python3 hubcap/hubcap.py

[Urgent] Bug: Build package configs from main branch (or master)

We need to ensure the hubcap script is prepared to handle packages with main, then prioritize that if there's no master branch. Currently, master is hardcoded into the script. We need main conditional logic that uses main then master when looking for commit shas to change into package specs. Otherwise, newer versions won't be added to hub.getdbt.com; the script will just break for any repo that does not retain a master branch.

Good news though! Because package specs depend only on commit shas, nothing has been changed for previous versions of packages.

This will partially address issue #69 .

Renamed GitHub repo did not generate new Hub entry

Overview

As mentioned here, https://github.com/tuva-health/core was renamed to https://github.com/tuva-health/core.

When the script ran on Heroku, it successfully cloned the repo:

2022-11-17 15:03:57 INFO Drawing down tuva-health's the_tuva_project
2022-11-17 15:03:57 INFO cloning https://github.com/tuva-health/the_tuva_project.git to /app/target/tuva-health_the_tuva_project
2022-11-17 15:03:59 INFO pulling main at /app/target/tuva-health_the_tuva_project

But it did not collect tags and write index.json when appropriate (like it successfully did for terminology and data_profiling:

...
2022-11-17 15:04:48 INFO no new tags for terminology. Skipping...
2022-11-17 15:04:48 INFO collecting tags for data_profiling
2022-11-17 15:04:48 INFO pkg hub tags: []
2022-11-17 15:04:48 INFO pkg remote tags: ['0.1.0']
2022-11-17 15:04:48 INFO creating task to add new tags ['0.1.0'] to data_profiling
...
2022-11-17 15:04:50 INFO writing index.json to /app/target/hub.getdbt.com/data/packages/tuva-health/data_profiling/index.json
2022-11-17 15:04:50 INFO
2022-11-17 15:04:50 INFO downloading: https://codeload.github.com/tuva-health/data_profiling/tar.gz/0.1.0
2022-11-17 15:04:50 INFO SHA1: e3c25af1078d845d3ba1065249eb76676120888b
2022-11-17 15:04:50 INFO writing spec to /app/target/hub.getdbt.com/data/packages/tuva-health/data_profiling/versions/0.1.0.json
2022-11-17 15:04:50 INFO hubcap: Adding tag 0.1.0 for tuva-health/data_profiling

Next steps

Try to reproduce and troubleshoot this locally (rather than via Heroku).

Fall-back if all else fails

Manually create a pull request within https://github.com/dbt-labs/hub.getdbt.com that adds the appropriate files.

cc @tuvaforrest for visibility

Get package dependencies for specific release tag

(Similar to #21)

Hubcap uses the default branch, rather than the specific release, to get information about packages (dependencies). Instead, for each tagged release it's adding, it should check out that tag before introspecting packages.

This is causing issues right now for fivetran_utils. They can't change the default branch (it's pointed to by some older versions of other packages), so they want to cut releases from non-default branches instead. Everything works except for the packages dict created by Hubcap.

Reconnect Heroku to GitHub to automatically deploy

It is possible to configure automatic deploys from GitHub to deploy automatically whenever a specific branch is pushed to:

Deploy > Deployment method > GitHub > Connect to GitHub

I'm guessing we could use the same FishtownBuildBot user in #152.

Hubcap creates a very large number of branches on the hub.getdbt.com repo

https://github.com/fishtown-analytics/hub.getdbt.com/branches

This feels like too many branches to me :)

Remove cron.sh

Allow hubcap.py to be stand-alone without needing to be called by cron.sh.

To fully deprecate cron.sh, do the following:

Update the Heroku Scheduler to be $ python3 hubcap/hubcap.py instead of $ ./cron.sh
Update from cron.sh to python3 hubcap/hubcap.py within the documentation
remove the ENV environment variable and cron.sh from the documentation
remove the ENV environment variable in Heroku
remove cron.sh

Code formatting pre-commit hook

Use black for code formatting within a pre-commit hook

Lock down `requirements.txt`

Related to #108

Overview

We have wide-open versions listed in requirements.txt. This could be problematic at an inopportune time if the cache is dropped and an incompatible version of a package is installed.

The Heroku documentation states:

It’s recommended to specify explicit dependency versions in your requirements.txt file. To update this file, you can use the pip freeze command in your active virtual environment:

pip freeze > requirements.txt

But this post explains why that is still not enough.

It goes on to explain how to use pip-compile (from pip-tools) to create a locked-down requirements.txt. I believe this makes it akin to a Pipfile.lock file or poetry.lock file from Pipenv and Poetry, respectively.

Feature

Make sure the local version of Python matches that of runtime.txt precisely
Add pip-tools to the dev-requirements.txt
Rename the existing requirements.txt to requirements.in
Run pip-compile to create requirements.txt
Check-in all of the above changes to git

Generate project specification without dbt

Problem

dbt expectations cannot merge the newest version of their package to hub.getdbt.com because the dbt version employed by this repo's build script requires an incompatible version of core.

The script errors out at setup.

Background

To generate project specifications (i.e. the information used to generate project pages on https://hub.getdbt.com/), hubcap.py uses dbt itself to run shell commands and extract project information.

Among other things, this requires a phony dbt profile and a specific dbt version (see /requirements.txt). This introduces possible conflict with any packages using a require-dbt-version configuration tag in their dbt_project.yml.

As mentioned, dbt expectations is blocked from pushing their newest release because their pinned dbt version is not compatible with the version of dbt we use in the hubcap build script.

Solution

All functionality performed by dbt can be done using familiar Python system libraries and yaml parsing files. Let's remove Core from this build script.

Thankfully, the two files which hold this information are both yaml files located by design at root level of a dbt project--dbt_project.yml and packages.yml. Yaml parsing libraries should do the trick to put together a specification and a dependency chain.

As a bonus, any modularization of these components will be appreciated in the now and the future.

Outcomes

we unblock the dbt expectations team
we insulate ourselves from this danger

Package on the Hub with only prerelease tags

When all the tagged versions for a package are prereleases, then hubcap will generate a PR, but Netflify won't be able to build it, so it can't auto-merge. This will hold up all other packages until it is resolved.

This is caused by a confluence of issues:

hubcap will generate PRs whenever there is a tag with a valid semantic version (includes both final releases as well as pre-releases)
hubcap will assign the latest final version when it can find one and "" otherwise
hub.getdbt.com requires that latest is a valid version (and will break otherwise)

Easiest solution:

Remove pre-release only packages from the Hub when they are discovered (ideally not adding them to hub.json until there is a tag with a final release)

Remove test to confirm that the GitHub token is authorized to push

Undo #164 now that #66 and #152 are confirmed to be working.

Code static type checker pre-commit hook

Use mypy for static type checking within a pre-commit hook.

Example configs:

Feature: Create a dbt-labs ci user and have them make commits

Currently, we have Drew's user account supporting the access of private repos and running the script. He has an api token fixed to his name among other things.

My gut tells me this should actually be a dedicated ci user rather than something to Drew himself. If nothing else, it's a bit odd to see his face on the hub package version commits. More seriously, having a dedicated user makes it easier to act securely and without needing to bother Drew about such things in the future case other API tokens and such are needed. We can also democratize access to the pipeline across the team.

Remove `all`

Proposal

Since the code within the hubcap repo isn't meant to be published as a package and it doesn't have any instances of from {module} import *, we can remove instances of __all__.

TL;DR for `all`

If you have __all__ in a Python module, then from {module} import * will import everything listed in __all__. Otherwise, it will import everything that does not start with an underscore.

Use the configured repo name as-is

The temporary folder of git repositories currently looks something like this:

...
fivetran_dbt_recurly
fivetran_dbt_recurly_source
hub
...

So it is more obvious to a new developer that https://github.com/dbt-labs/hub.getdbt.com has been cloned, I'd rather it look like this:

...
fivetran_dbt_recurly
fivetran_dbt_recurly_source
hub.getdbt.com
...

Testing framework

As a contributor, I'd like a testing framework so that I can be more confident that my changes work as expected (and don't introduce bugs).

pytest seems like a logical choice.

Fix issues any raised via flake8

#185 added a pre-commit hook for flake8.

Use pre-commit run --all-files to find any issues and then fix them.

Authenticated HTTPS URLs using The Simplest Bullet™

Use the The Simplest Bullet™ strategy to authenticate pushes to the https://github.com/dbt-labs/hub.getdbt.com repo.

i.e., handle all 3 git URL styles as authenticated HTTPS URLs using a personal access token (PAT).

Hubcap fails if package does not have a `master` branch

Note that this is becoming more problematic since GitHub defaults to using main now, so more packages will be created without the master branch

Relevant code.

Bump version of Python

Bump version of Python. Currently 3.9.11:

https://devcenter.heroku.com/articles/python-support#supported-runtimes

Update this file:

runtime.txt

hubcap IndexError: list index out of rangeExceptionFatal

We got the following error which first appeared at 2023-01-11T18:05:23.005324+00:00 (which I believe was the first run after merging #222):

File "/app/hubcap/hubcap.py", line 70, in <module>
    new_branches = package.commit_version_updates_to_hub(
  File "/app/hubcap/package.py", line 125, in commit_version_updates_to_hub
    branch_name, org_name, package_name = task.run(hub_dir_path, pr_strategy)
  File "/app/hubcap/records.py", line 104, in run
    new_index_entry = self.make_index(
  File "/app/hubcap/records.py", line 178, in make_index
    latest_version = version_numbers[-1]
IndexError: list index out of rangeExceptionFatal

One of the packages that was added only has a single tag, and it is 0.1.0-b1. We do have other packages on the hub with similar tags, dbt_utils 1.0.0-b2 being one example.

Regardless of whether this is expected or not, this doesn't seem like something that should cause the script to error out.

Default value if the `GIT_TMP` environment variable is not set

Proposal

provide a default value if the GIT_TMP environment variable is not set

Register a dbt package from a sub directory of a repository

Hi friends,

We are about to release a new dbt package and we wanted to keep everything as a morepo in our project. That would mean instead of pointing to a github repository we would have to point it to a subdirectory in our existing repository for example; https://github.com/fal-ai/fal/feature-store.

I don't think this is possible right now, but we are happy to contribute if this is something you would like to see implemented.

I just had a quick glance at the code and looks like a change like this is self contained in the hubcap repository and I don't have to touch the logic how dbt-core downloads dependencies.

Contributing instructions

As a contributor, I'd like instructions so that I know how to do development and test my work.

Use project git config instead of global

git config is updated globally here in a bash script. However, we can use something like this using Python instead. It can be placed after the clone here.

Advantages

Scope the config specific to where it is needed rather than globally
No need to clean-up afterwards if it fails in the middle of execution
The code should be shorter overall
More people know how to read and write Python than bash
One step closer to removing the bash script altogether

Implementation details

The config.example.json should be updated to something like the following:

{
    "user": {
        "name": "dbt-hubcap",
        "email": "[email protected]",
        "token": "pe4s0n@l-@cce$$-t0k3n"
    },
    "org": "dbt-labs",
    "repo": "hub.getdbt.com",
    "push_branches": true,
    "one_branch_per_repo": true
}

Then the name and email keys should be used to set the project config for the git repo.

Feature: Make hubcap less dependent on config

This script requires a carefully made environment variable be present to run. It'd be nice to wrap variables in such a manner that permits using this script even when that env variable isn't present. That would be helpful for test runs. Right now, it requires an unreasonable amount of hacking to run the main script outside of a heroku cluster (and in time, we'll want to move away from Heroku in general, it seems).

Feature: Update primary branch from master to main

Let's bring this repo up to date with other dbt-labs repos in this small but nonetheless important way.

	# regex taken from official SEMVER documentation site
	match = re.match('^(?P<major>0\|[1-9]\d)\.(?P<minor>0\|[1-9]\d)\.(?P<patch>0\|[1-9]\d)(?:-(?P<prerelease>(?:0\|[1-9]\d\|\d[a-zA-Z-][0-9a-zA-Z-])(?:\.(?:0\|[1-9]\d\|\d[a-zA-Z-][0-9a-zA-Z-]))))?(?:\+(?P<buildmetadata>[0-9a-zA-Z-]+(?:\.[0-9a-zA-Z-]+)*))?$',

dbt-labs / hubcap Goto Github PK

hubcap's Issues

Background

Next steps

Problem

Proposed solution

Implementation

Log output

Resources

Proposal

Goal

Solved

Problem

Options discussed thus far

Primary negative trade-off for each option

Background

Proposal

Proposal

Implementation

Overview

Next steps

Fall-back if all else fails

Overview

Feature

Problem

Background

Solution

Outcomes

Proposal

Proposal

Advantages

Implementation details

Recommend Projects

Recommend Topics

Recommend Org