dbt-labs / hubcap Goto Github PK
View Code? Open in Web Editor NEWThis app adds modules to the hubsite at hub.getdbt.com
This app adds modules to the hubsite at hub.getdbt.com
The default branch for https://github.com/dbt-labs/hub.getdbt.com is currently master
. It is hard-coded within Python in a place or two within this repo (https://github.com/dbt-labs/hubcap). Use git grep "master" ./
to discover those locations. It is unknown if this branch name is also specified within Netlify.
master
to main
within https://github.com/dbt-labs/hub.getdbt.commain
branchmaster
branch in hub.getdbt.comCurrently, all the pull requests for hub.getdbt.com look like they are coming from Drew. Example here:
We use the FishtownBuildBot service account for automated pull requests like this:
I added some comments to hubcap during the refactor but that's just the basics. I think it would be swell to add how the script works, notes about the ecosystem, user directions that dovetail.
Documentation is endless, so here's my proposed scope for this issue:
As a maintainer, I'd like release instructions so that I know how to perform production deployments.
This app is using the Heroku-20 stack, however a newer stack is available.
To upgrade to Heroku-22, see:
https://devcenter.heroku.com/articles/upgrading-to-the-latest-stack
https://devcenter.heroku.com/articles/upgrading-to-the-latest-stack
Hey team, we have been discussing there being rules for how a plugin project should look for it to be added to the hub. Here's what I've got so far:
Note: I had originally framed this to myself as requiring a main
branch, but on second thought, I think it's better to prioritize main to master in the script logic or perhaps even just having some kind of config file in the user package repo's as a possible override if they want to specify the exact branch for us to use when considering new versions. main
is already prioritized by GitHub and we can document that master
has been deprecated but is still supported (since any branch can be used). That way, we don't frustrate package maintainers that haven't yet made the switch (plenty of shops are still slowly but surely transitioning over).
What other things should be added to docs about what a basic package should look like?
Prompted by @NiallRees's exciting 1.0.0b1
release of dbt_artifacts
: https://github.com/brooklyn-data/dbt_artifacts/releases/tag/1.0.0b1
Lines 14 to 15 in d82a8a2
I think that's SemVer-official, but Python/pip actually doesn't support the hyphen. From PEP 440:
Semantic versions containing a hyphen (pre-releases - clause 10) or a plus sign (builds - clause 11) are not compatible with this PEP and are not permitted in the public version field.
So, it's pretty common for folks to use the hyphenless prerelease identifier, even if it's not "real" semver. The Core team stumbled across this inconsistency a few months ago (dbt-labs/dbt-core#4741), and decided to let it be. The dbt-core
semver logic supports both:
>>> from dbt.semver import VersionSpecifier
>>> VersionSpecifier.from_version_string("1.0.0b1")
VersionSpecifier(major='1', minor='0', patch='0', prerelease='b1', build=None, matcher=<Matchers.EXACT: '='>)
>>> VersionSpecifier.from_version_string("1.0.0-b1")
VersionSpecifier(major='1', minor='0', patch='0', prerelease='b1', build=None, matcher=<Matchers.EXACT: '='>)
IMO:
1.0.0b1
and 1.0.0-b1
as valid semantic version identifiers, with prerelease suffix b1.?
) to the regex stringmake clean
for cleaning out the target directoryGIT_TMP
environment variableAs a maintainer of a community package for dbt, I want to run automated testing through a CI service so that quality assurance is automatically included in the workflow for a wide range of databases.
✅ Both GitHub Actions (GHA) and Circle CI have free plans available to cover the continuous integration (CI) piece.
❌ Compute costs. Lack of free hosts for the full range of dbt database adapters (Postgres, BigQuery, Snowflake, etc).
Use GitHub Actions to run the existing pytest
suite on pull requests.
An estimated 20K(!) CI pipelines in GitHub are effected by this 🤯
See example solution here:
dbt-labs/dbt-core#6252
Here's what the error looks like in our CI pipeline:
Having a script named setup.py
has historically meant something specific within Python projects.
To avoid confusion and surprise, let's just rename it.
Remove personal access token (PAT) authentication when cloning git URLs for dbt packages.
Currently open pull requests are discovered with a URL like:
https://api.github.com/repos/{org_name}/{package_name}/pulls?state=open
But if this API has an error response, the script will raise an exception without any indication of the cause.
Catch any exceptions, log the exception message, and also suggest the primary causes of error:
If the GitHub user tied to the personal access token (PAT) is unable to push branches, the resulting stack trace is practically undecipherable.
Be able to run automatic formatters without affecting the display of what revision and author last modified each line of a file (i.e., git blame
.
Read more here:
One this file is in place, then the following can be run locally:
git blame --ignore-revs-file .git-blame-ignore-revs
This file will also be taken into account by GitHub with the git blame view.
All prerelease versions are removed from hubcap whereas we should only block those that do not pass the semver regex. That said, we should also retain the behavior that the latest version goes to the latest stable version as opposed to a prerelease.
It sounds like someone on the dbt Labs engineering team will come to own this stack. If/when that is decided, the relevant parties should be added to the CODEOWNERS file.
Until then, update the file to reflect the current dedicated team members.
Use pre-commit
hooks to perform automated checking before a local commit is even allowed.
.pre-commit-config.yaml
configuration fileUse flake8
for style checking within a pre-commit hook
Add documentation on how to do ad hoc executions of the hubcap script (rather than waiting for the frequency specified within the Heroku Scheduler).
Final ad hoc step is something like this:
heroku run ./cron.sh
Or in the near future, just:
heroku run python3 hubcap/hubcap.py
We need to ensure the hubcap script is prepared to handle packages with main, then prioritize that if there's no master branch. Currently, master is hardcoded into the script. We need main conditional logic that uses main then master when looking for commit shas to change into package specs. Otherwise, newer versions won't be added to hub.getdbt.com; the script will just break for any repo that does not retain a master branch.
Good news though! Because package specs depend only on commit shas, nothing has been changed for previous versions of packages.
This will partially address issue #69 .
As mentioned here, https://github.com/tuva-health/core was renamed to https://github.com/tuva-health/core.
When the script ran on Heroku, it successfully cloned the repo:
2022-11-17 15:03:57 INFO Drawing down tuva-health's the_tuva_project
2022-11-17 15:03:57 INFO cloning https://github.com/tuva-health/the_tuva_project.git to /app/target/tuva-health_the_tuva_project
2022-11-17 15:03:59 INFO pulling main at /app/target/tuva-health_the_tuva_project
But it did not collect tags and write index.json when appropriate (like it successfully did for terminology
and data_profiling
:
...
2022-11-17 15:04:48 INFO no new tags for terminology. Skipping...
2022-11-17 15:04:48 INFO collecting tags for data_profiling
2022-11-17 15:04:48 INFO pkg hub tags: []
2022-11-17 15:04:48 INFO pkg remote tags: ['0.1.0']
2022-11-17 15:04:48 INFO creating task to add new tags ['0.1.0'] to data_profiling
...
2022-11-17 15:04:50 INFO writing index.json to /app/target/hub.getdbt.com/data/packages/tuva-health/data_profiling/index.json
2022-11-17 15:04:50 INFO
2022-11-17 15:04:50 INFO downloading: https://codeload.github.com/tuva-health/data_profiling/tar.gz/0.1.0
2022-11-17 15:04:50 INFO SHA1: e3c25af1078d845d3ba1065249eb76676120888b
2022-11-17 15:04:50 INFO writing spec to /app/target/hub.getdbt.com/data/packages/tuva-health/data_profiling/versions/0.1.0.json
2022-11-17 15:04:50 INFO hubcap: Adding tag 0.1.0 for tuva-health/data_profiling
Try to reproduce and troubleshoot this locally (rather than via Heroku).
Manually create a pull request within https://github.com/dbt-labs/hub.getdbt.com that adds the appropriate files.
cc @tuvaforrest for visibility
(Similar to #21)
Hubcap uses the default branch, rather than the specific release, to get information about packages
(dependencies). Instead, for each tagged release it's adding, it should check out that tag before introspecting packages.
This is causing issues right now for fivetran_utils
. They can't change the default branch (it's pointed to by some older versions of other packages), so they want to cut releases from non-default branches instead. Everything works except for the packages
dict created by Hubcap.
It is possible to configure automatic deploys from GitHub to deploy automatically whenever a specific branch is pushed to:
I'm guessing we could use the same FishtownBuildBot user in #152.
https://github.com/fishtown-analytics/hub.getdbt.com/branches
This feels like too many branches to me :)
Allow hubcap.py
to be stand-alone without needing to be called by cron.sh
.
To fully deprecate cron.sh
, do the following:
$ python3 hubcap/hubcap.py
instead of $ ./cron.sh
cron.sh
to python3 hubcap/hubcap.py
within the documentationENV
environment variable and cron.sh
from the documentationENV
environment variable in Herokucron.sh
Use black for code formatting within a pre-commit hook
Related to #108
We have wide-open versions listed in requirements.txt
. This could be problematic at an inopportune time if the cache is dropped and an incompatible version of a package is installed.
The Heroku documentation states:
It’s recommended to specify explicit dependency versions in your requirements.txt file. To update this file, you can use the pip freeze command in your active virtual environment:
pip freeze > requirements.txt
But this post explains why that is still not enough.
It goes on to explain how to use pip-compile
(from pip-tools
) to create a locked-down requirements.txt
. I believe this makes it akin to a Pipfile.lock
file or poetry.lock
file from Pipenv and Poetry, respectively.
runtime.txt
preciselypip-tools
to the dev-requirements.txt
requirements.txt
to requirements.in
pip-compile
to create requirements.txt
dbt expectations cannot merge the newest version of their package to hub.getdbt.com
because the dbt version employed by this repo's build script requires an incompatible version of core.
The script errors out at setup.
To generate project specifications (i.e. the information used to generate project pages on https://hub.getdbt.com/), hubcap.py
uses dbt
itself to run shell commands and extract project information.
Among other things, this requires a phony dbt profile and a specific dbt version (see /requirements.txt
). This introduces possible conflict with any packages using a require-dbt-version
configuration tag in their dbt_project.yml
.
As mentioned, dbt expectations is blocked from pushing their newest release because their pinned dbt version is not compatible with the version of dbt we use in the hubcap build script.
All functionality performed by dbt can be done using familiar Python system libraries and yaml parsing files. Let's remove Core from this build script.
Thankfully, the two files which hold this information are both yaml files located by design at root level of a dbt project--dbt_project.yml
and packages.yml
. Yaml parsing libraries should do the trick to put together a specification and a dependency chain.
As a bonus, any modularization of these components will be appreciated in the now and the future.
When all the tagged versions for a package are prereleases, then hubcap will generate a PR, but Netflify won't be able to build it, so it can't auto-merge. This will hold up all other packages until it is resolved.
This is caused by a confluence of issues:
latest
final version when it can find one and ""
otherwiselatest
is a valid version (and will break otherwise)Easiest solution:
Other solutions:
latest
versionI'm taking the easiest solution in the short-term and have a PR ready that implements the first "other" solution.
A problem with using the easiest solution long-term is that simple human error can trigger this situation again in the future. I'd rather prevent it entirely.
Use mypy
for static type checking within a pre-commit hook.
Example configs:
Currently, we have Drew's user account supporting the access of private repos and running the script. He has an api token fixed to his name among other things.
My gut tells me this should actually be a dedicated ci user rather than something to Drew himself. If nothing else, it's a bit odd to see his face on the hub package version commits. More seriously, having a dedicated user makes it easier to act securely and without needing to bother Drew about such things in the future case other API tokens and such are needed. We can also democratize access to the pipeline across the team.
Since the code within the hubcap repo isn't meant to be published as a package and it doesn't have any instances of from {module} import *
, we can remove instances of __all__
.
__all__
__all__
in a Python module, then from {module} import *
will import everything listed in __all__
. Otherwise, it will import everything that does not start with an underscore.The temporary folder of git repositories currently looks something like this:
...
fivetran_dbt_recurly
fivetran_dbt_recurly_source
hub
...
So it is more obvious to a new developer that https://github.com/dbt-labs/hub.getdbt.com has been cloned, I'd rather it look like this:
...
fivetran_dbt_recurly
fivetran_dbt_recurly_source
hub.getdbt.com
...
As a contributor, I'd like a testing framework so that I can be more confident that my changes work as expected (and don't introduce bugs).
pytest
seems like a logical choice.
#185 added a pre-commit hook for flake8
.
Use pre-commit run --all-files
to find any issues and then fix them.
Use the The Simplest Bullet™ strategy to authenticate pushes to the https://github.com/dbt-labs/hub.getdbt.com repo.
i.e., handle all 3 git URL styles as authenticated HTTPS URLs using a personal access token (PAT).
Note that this is becoming more problematic since GitHub defaults to using main
now, so more packages will be created without the master
branch
Bump version of Python. Currently 3.9.11
:
Update this file:
We got the following error which first appeared at 2023-01-11T18:05:23.005324+00:00 (which I believe was the first run after merging #222):
File "/app/hubcap/hubcap.py", line 70, in <module>
new_branches = package.commit_version_updates_to_hub(
File "/app/hubcap/package.py", line 125, in commit_version_updates_to_hub
branch_name, org_name, package_name = task.run(hub_dir_path, pr_strategy)
File "/app/hubcap/records.py", line 104, in run
new_index_entry = self.make_index(
File "/app/hubcap/records.py", line 178, in make_index
latest_version = version_numbers[-1]
IndexError: list index out of rangeExceptionFatal
One of the packages that was added only has a single tag, and it is 0.1.0-b1
. We do have other packages on the hub with similar tags, dbt_utils 1.0.0-b2 being one example.
Regardless of whether this is expected or not, this doesn't seem like something that should cause the script to error out.
GIT_TMP
environment variable is not setHi friends,
We are about to release a new dbt package and we wanted to keep everything as a morepo in our project. That would mean instead of pointing to a github repository we would have to point it to a subdirectory in our existing repository for example; https://github.com/fal-ai/fal/feature-store.
I don't think this is possible right now, but we are happy to contribute if this is something you would like to see implemented.
I just had a quick glance at the code and looks like a change like this is self contained in the hubcap repository and I don't have to touch the logic how dbt-core
downloads dependencies.
As a contributor, I'd like instructions so that I know how to do development and test my work.
git
config is updated globally here in a bash script. However, we can use something like this using Python instead. It can be placed after the clone here.
The config.example.json
should be updated to something like the following:
{
"user": {
"name": "dbt-hubcap",
"email": "[email protected]",
"token": "pe4s0n@l-@cce$$-t0k3n"
},
"org": "dbt-labs",
"repo": "hub.getdbt.com",
"push_branches": true,
"one_branch_per_repo": true
}
Then the name
and email
keys should be used to set the project config for the git repo.
This script requires a carefully made environment variable be present to run. It'd be nice to wrap variables in such a manner that permits using this script even when that env variable isn't present. That would be helpful for test runs. Right now, it requires an unreasonable amount of hacking to run the main script outside of a heroku cluster (and in time, we'll want to move away from Heroku in general, it seems).
Let's bring this repo up to date with other dbt-labs repos in this small but nonetheless important way.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.