Giter Site home page Giter Site logo

Comments (15)

jtcohen6 avatar jtcohen6 commented on August 16, 2024 1

We use "sparse checkout" for package subdirectories installed via the git method: https://docs.getdbt.com/docs/building-a-dbt-project/package-management#project-subdirectories

The installation mechanism for the package method (= Hub registry packages) doesn't actually use git, though β€” it just downloads tarball contents from the URL supplied by GitHub's API (codeload.github.com).

from hubcap.

elongl avatar elongl commented on August 16, 2024 1

@jtcohen6 Yes, of course! It also makes much more sense to me with a slight change.
I think it's better design to keep the organization within the key rather than specify it everytime.

{
    "organizations": {
        "elementary-data": {
            "packages": [
                {
                    "repo_name": "dbt-data-reliability",
                    "subdirectory": "dbt_project"
                }
            ]
        }
    }
}

from hubcap.

elongl avatar elongl commented on August 16, 2024

We'd really love this feature as well!
Will also like to contribute if possible.

We're an open-source project and we're maintaining two repositories, one of which is the Dbt package itself, and the other is a Python package that relies on it. Managing the two repositories is quite challenging in the sense that we have to sync features and logic between the repositories instead of having a single feature branch that would affect both.
Currently, we'd often have two branches with the same name in both repos and need to merge them at the same time, etc.
and we basically need to do everything twice such as CI/CD, creating a release, and so on.

We'd love to merge those repos if it would've been possible.
The hubcap is the sole reason we haven't done it already.

from hubcap.

joellabes avatar joellabes commented on August 16, 2024

@turbo1912 and @elongl thanks for opening this, and sorry that it took a while to get back to you!

Conceptually, I'm very supportive of this, especially if it doesn't impact how dbt-core downloads things.

The things that are giving me pause are that we don't have any testing around the project (other than a couple of semver tests that @dbeatty10 added in #133) at the moment so can't really guarantee that changes don't cause downstream issues.

I would say go for it, with the disclaimer that code review might be slow as we try to rustle up someone who knows how hubcap works. We'll also probably wind up dragging in someone from @dbt-labs/core to double check that there aren't any flow-on effects.

from hubcap.

elongl avatar elongl commented on August 16, 2024

Awesome! Glad to hear you and the team are on board.
I'll send here in a couple of days a rough design of how I'm planning to implement it so I could get your approval and start working on it. Thanks a lot.

from hubcap.

jtcohen6 avatar jtcohen6 commented on August 16, 2024

I definitely see the value here!

I think this may be tricky for us to do in the current implementation of Hubcap + Hub site (hub.getdbt.com), though not impossible. The Hub site does not actually store/mirror specific filesβ€”it's just a pointer to a GitHub tarball URL, containing a zipped version of all files from the repo. E.g. for dbt_utils version 0.8.6: https://codeload.github.com/dbt-labs/dbt-utils/tar.gz/0.8.6

If we wanted to proceed with this as an extension of the current implementation, I think it would need to include:

  • A new Hub API field, subdirectory
  • An update to some methods in dbt-core's deps logic (download_and_untar, untar_package) to accept a subdirectory parameter, pull + rename only that subdirectory if a subdirectory argument is supplied, and delete the remaining files (! will require careful testing on any OS with odd file permissions, a.k.a. Windows)

Note that will still require downloading the entire contents of the repo, and then quickly deleting the contents we don't care about. That could still pose a risk on containerized / disk-limited file systems, if the overall repo is truly massive.

A longer-term answer probably looks like the Hub graduating to support its own file-hosting capabilities, rather than using GitHub as a backend. In that future, something like this should be much simpler to implement, and better: the filtering can happen during package registration / upload, rather than in every single package download.

Related issue: dbt-labs/dbt-core#4868

from hubcap.

elongl avatar elongl commented on August 16, 2024

Hi @jtcohen6
Really happy to see that you support the idea.
I think I'll begin working on it this weekend and submit a PR.

Too bad Github doesn't provide a way to only download a specific file path within a repository 😞
Where should the user specify the subdirectory within the API, especially the hub.json?

I imagined it as something like that:

    "<organization>": [
        "<repo>/<subdirectory>"
    ]

For instance,

    "elementary-data": [
        "dbt-data-reliability/dbt_project"
    ]

so basically if there's a subdirectory, specify a conditional path within the repository with leading /s.
What do you think?

from hubcap.

lostmygithubaccount avatar lostmygithubaccount commented on August 16, 2024

would using git-sparse-checkout help? https://git-scm.com/docs/git-sparse-checkout. It does allow you to effectively only download a subdirectory, I'm not sure how easy it is to use though

from hubcap.

elongl avatar elongl commented on August 16, 2024

would using git-sparse-checkout help? https://git-scm.com/docs/git-sparse-checkout. It does allow you to effectively only download a subdirectory, I'm not sure how easy it is to use though

Interesting suggestion! I don't think we'll be able to entirely solve the problem using it since that in order to use it you need to have the repository already cloned which at this point you've already downloaded all the files. But it might be a cleaner way to exclude the necessary files rather than deleting everything else.

from hubcap.

joellabes avatar joellabes commented on August 16, 2024

A pleasant side effect: if this ticket was done, experimental packages (such as insert_by_period) could stay on the hub instead of needing to be specified as git subdirectories

from hubcap.

elongl avatar elongl commented on August 16, 2024

@jtcohen6 @joellabes
Bumping this comment before I'm starting to work on it.
Would appreciate your confirmation, thanks!

from hubcap.

jtcohen6 avatar jtcohen6 commented on August 16, 2024

@elongl Thanks for the bump!

That feels doable, but we'd end up splitting on the / and storing the repo name separate from the subdirectory name. At the risk of much-more-verbose JSON, would it be better for us to change the data structure in hub.json?

[
    {
        "org_name": "<org_name>",
        "packages": [
            {
                "repo_name": "<repo_name>",
                "subdirectory": "<subdirectory>"
            },
        ],
    },
]

So for instance:

[
    {
        "org_name": "elementary-data",
        "packages": [
            {
                "repo_name": "dbt-data-reliability",
                "subdirectory": "dbt_project"
            },
        ],   
    },
]

from hubcap.

jtcohen6 avatar jtcohen6 commented on August 16, 2024

@elongl Fair point! I can't think of another org-level property that we'd need to specify.

We could opt for as much conciseness as possible, while still allowing for structure where we need it. The org name could be the key, and its value would be a list of packages, with type List[Union[str, Dict[str, str]]]:

{
    "elementary-data": [
        "dbt-repo-non-subdirectory",
        {
          "repo_name": "dbt-data-reliability",
          "subdirectory": "dbt_project"
        },
    ]
}

I'm realizing this would also offer a (roundabout) way of resolving dbt-labs/dbt-core#4868 (a way to ignore/exclude large unnecessary files). Package maintainers could move the "essential" components of the package to a subdirectory, and then specify that subdirectory in hub.json. In effect, dbt-core would ignore (not copy) / delete the other files. It's not a perfect solution, but it feels like a reasonable workaround, until we have our own dedicated infrastructure for hosting package files.

from hubcap.

domenic-donato avatar domenic-donato commented on August 16, 2024

We're also looking to setup this tool in our monorepo. Is this now possible? If so, what are the steps we should follow.

from hubcap.

dbeatty10 avatar dbeatty10 commented on August 16, 2024

Thanks for letting us know your interest @domenic-donato.

This is still a feature request that hasn't been implemented or released.

from hubcap.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.