There seems to be a speed regression between using fetch-dep

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Here are a few logs: <a href="https://gist.github.com/winterqt

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Speed regression when using fetch-depth = 1 about backport-action HOT 10 CLOSED

korthout commented on May 20, 2024 1

Speed regression when using fetch-depth = 1

from backport-action.

Comments (10)

korthout commented on May 20, 2024 1

Wow, I'm really surprised about those results. It's completely the opposite of what I'd expect.

I'll need some time to dive into this. Thanks again for reporting it and collecting all this data. 🙇

are these workflow file URLs returning the page on GitHub for the file

👍 They appear with syntax highlighting on my browser (brave)

from backport-action.

korthout commented on May 20, 2024 1

I haven't tested it yet, but I think all that's missing is specifying a fetch depth of the fetches executed by the action. I guess currently it fetches all ancestor commits of the fetched commit but that's really unnecessary. This would explain why it takes so long 😌

See --depth in https://linux.die.net/man/1/git-fetch

from backport-action.

korthout commented on May 20, 2024

@winterqt Thanks for reporting this.

I've previously pinpointed the problem in nixpkgs with changing the default ref that's checked out. Please try again without the non-default ref.

However, I don't understand how it could lead to such a speed regression. I'd actually expect that it wouldn't work. Do you have the logs of such a run for me? Perhaps these shed some light on what's going on. I might also be able to use the data of an actual run to reproduce the issue locally.

from backport-action.

winterqt commented on May 20, 2024

Here are a few logs:

v0.0.5, which corresponds to this workflow file
v0.0.8, which corresponds to this workflow file. I assume this is what you meant by it expectedly failing?
v0.0.8 w/ default ref, which corresponds to this workflow file. I noticed that we were using a non-default ref before opening this issue, so I took a stab at it to see if it would fix anything. This took nearly 20 minutes to run 🤯

(Side note, are these workflow file URLs returning the page on GitHub for the file (with syntax highlighting etc.) or is it returning the raw content of the file? GitHub can't make up its mind seemingly :/ might be a browser thing.)

from backport-action.

korthout commented on May 20, 2024

EDIT: 🤔 I'm not sure how to determine the number of commits to fetch on the base branch (PR target), perhaps we should simply fetch 1000 commits to start. That's already a factor 400 improvement for nixpkgs. We can further improve it if a need arises.

Copied for visibility from #268 (review)

from backport-action.

korthout commented on May 20, 2024

This week I've read a lot about shallow repos, and grafted commits. I've played around with some fetching strategies that could work for the backport-action. And I've thought about optimizing the action for performance on larger repos.

TL;DR

We can optimize the action for fetch speed, but until then please consider cloning with the entire history (untested but may work):

- uses: actions/checkout@v3
  with:
    fetch-depth: 0

The current problems

At the moment, the action uses the git history to determine the relevant commits using git mergebase. This means that we need to know the entire history between the pull request's base and head refs. Currently, we simply fetch the entire history of each ref. This makes sure that all commits that we're interested in are available for the action. And this works in all cases (on pull_request and pull_request_target, as well as merge, rebase and squash scenarios). But it was a difference from before and is the reason you're now seeing this speed regression due to fetching.

In the logs you shared we see that fetching the base and head refs both take about 10 minutes while fetching the target only takes ~50s.

2022-07-25T03:39:51.7562163Z Detected labels on PR: backport backporttttttttt
2022-07-25T03:48:38.1331105Z From https://github.com/winterqt/nixpkgs
2022-07-25T03:48:38.1344220Z  * branch                    b0e7423bf3a9a1b1b2202b56c6a0975598ef969f -> FETCH_HEAD
2022-07-25T03:59:10.9378798Z From https://github.com/winterqt/nixpkgs
2022-07-25T03:59:10.9462819Z  * branch                      refs/pull/1/head -> FETCH_HEAD
2022-07-25T03:59:11.0342065Z Working on label backport backporttttttttt
2022-07-25T03:59:11.0351785Z Found target in label: backporttttttttt
2022-07-25T04:00:00.7954800Z From https://github.com/winterqt/nixpkgs
2022-07-25T04:00:00.8010867Z  * branch                      backporttttttttt -> FETCH_HEAD
2022-07-25T04:00:00.8011544Z  * [new branch]                backporttttttttt -> origin/backporttttttttt

If we don't want to fetch the entire history but use a shallow repo, then we need to fetch an arbitrary number of commits from the base branch. This number would be dependent on the repo's speed to add more commits. For large repos like Nixpkgs, the base is growing very rapidly (I think it's at ~900 commits per week for Nixpkgs now). And so this may not be the best approach anymore.

A different approach

Instead of determining the commits to cherry-pick based on the repo's history (i.e. git mergebase), we can ask GitHub to tell us about the commits on the PR. Both the REST API as well as the GraphQL API provide methods for this.

Note that these are both paged, and limited to a maximum of 250 commits (which is a limit I'm okay with)

When we know the exact commit shas, we can shallowly fetch only the relevant parts. This should provide optimal performance for repos with large histories, and doesn't negatively impact young repos.

Cloning the repo shallowly

A current design choice is that the action can backport a PR to multiple targets. This means it needs to check out each target's latest commit and then cherry-pick the relevant commits from the PR on top of it. So we cannot simply clone the target branch shallowly. So we'll need to start with the PR head.

Sadly, it is not possible with git clone to directly checkout the refs/pull/<#>/head as a branch. However, it seems possible with the checkout action, so we can look into this further optimization later on. For now, we'll need to clone the repo's default branch shallowly to get started:

git clone <URL> --depth=1

Fetch only the relevant commits to cherry-pick

Now we can fetch the relevant commits. To retrieve those, we can fetch refs/pull/<#>/head. However, if we would shallowly fetch these, then the deepest commit will be a grafted commit that contains all changes up to that point. This would effectively remove the ability to cherry-pick it because it contains many more changes than the original commit on origin.

To work around this, we can simply fetch 1 commit deeper:

git fetch origin refs/pull/#/head --depth=n+1

(for each target:) Fetch the target and cherry-pick the commits

The target can again be shallowly fetched as a grafted commit because we don't really care about that latest commit. We just want to cherry-pick on top of it.

git fetch origin <target>:<target> --depth=1
git switch -c backport-<#>-to-<target> <target>

We can verify whether the commits that we want to cherry-pick are available to use:

git cat-file -t sha1
git cat-file -t sha2
etc

And now we can cherry-pick them.

git cherry-pick -x sha1 sha2 etc

In my experiments, I was unable to cherry-pick a range of commits this way. Perhaps this is because of the grafted commits, 🤷 But for now it's fine to reference them individually. If this command becomes too long we can even cherry-pick them individually.

git cherry-pick -x sha1
git cherry-pick -x sha2
etc

What's blocking us from doing this?

The tests are too coupled to the implementation and don't really allow such a refactoring. Recently, I've been working on https://github.com/korthout/backport-action-test as a way to test only the behavior. I don't want to make the old tests work with this new way. Instead, I'd prefer to replace them with more cases in backport-action-test. Sadly these tests are still a bit brittle and slow. I hope to improve that soon.

But for now, there aren't really many test cases in place that support these changes. So we'd have to test these changes more or less in production. That may be okay, but:

Is there anything I can try in the meantime?

I think we can try to checkout the repo differently and potentially see a similar performance as v0.0.5.

If we clone with fetch-depth 0 (entire history), I think that the following fetches should complete pretty much instantly.

- uses: actions/checkout@v3
  with:
  fetch-depth: 0

@winterqt Could you please give this a try?

from backport-action.

winterqt commented on May 20, 2024

Firstly, thank you for all the research and work you've put into this, I greatly appreciate it.

Secondly, sure -- I assume you want me to bump the action to v0.0.8 and try keeping our fetch-depth at 0, right?

from backport-action.

korthout commented on May 20, 2024

I assume you want me to bump the action to v0.0.8 and try keeping our fetch-depth at 0, right?

👍 Yes that's correct. Curious to your results 🙇

I greatly appreciate it.

❤️ Thanks, that's great to hear.

from backport-action.

korthout commented on May 20, 2024

It's taken me some time, but I'm happy to report that I was able to successfully perform a lot of successful tests with a new version of the backport-action that implements the approach I described above. 🎉

That means:

a true shallow clone (1 commit deep) using:

- uses: actions/checkout@v3
  with:
    fetch-depth: 1

the action fetches only the necessary commits containing the changes (using git fetch --depth=x+1)
and fetching only a single commit for each target (using git fetch --depth=1)

I was able to test this:

as a PR created locally in the repo that is backported on pull_request[closed]
- containing a single commit (automated)
- containing 4 commits (manually)
as a PR created in a fork of the repo that is backported on pull_request_target[closed]
- containing a single commit (automated)
- containing 4 commits (manually)

@winterqt I'd love to receive some feedback on this from you. Does it really speed up the checkout and the backporting in a large project like Nixpkgs?

Please try it out by switching the checkout action's fetch-depth to 1 and using a special version of the backport action:

- uses: actions/checkout@v3
  with:
    fetch-depth: 1
- name: Backport
  uses: zeebe-io/backport-action@korthout-fetch-depth

If these changes solve the problem as expected, this will form the basis of v1.0 of this action.

from backport-action.

korthout commented on May 20, 2024

@winterqt My team has switched to v1-rc1 to further battle-test this version before releasing v1 (see #289). We've already seen a dramatic performance boost on our repo with just 17k commits.

Backporting a pull request to 2 branches went from ~1m on v0.0.9 to just 11s on v1-rc1 in the camunda/zeebe repo 🌱

#296 (comment)

Of course, I'm curious about the effect this release will have on Nixpkgs (430k commits), but I don't recommend switching the repository to this release candidate already, as it hasn't been battle-tested yet. I just wanted to keep you in the loop.

from backport-action.

Speed regression when using fetch-depth = 1 about backport-action HOT 10 CLOSED

Comments (10)

TL;DR

The current problems

A different approach

Cloning the repo shallowly

Fetch only the relevant commits to cherry-pick

(for each target:) Fetch the target and cherry-pick the commits

What's blocking us from doing this?

Is there anything I can try in the meantime?

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent