machulav / ec2-github-runner Goto Github PK
View Code? Open in Web Editor NEWOn-demand self-hosted AWS EC2 runner for GitHub Actions
License: MIT License
On-demand self-hosted AWS EC2 runner for GitHub Actions
License: MIT License
Hi,
About ~10 hours ago our build started failing with the timeout error out of the blue; I double checked and made sure no infrastructure or code changes were made.
I also tried upgrading from 2.1.0 to 2.2.0 to no effect.
I confirmed networking is working fine by manually creating an instance with the same networking set up and running the same commands the action runs. I was able to register a runner with Github.
I observed the runner page and saw that that runners were coming up and being registered fine. They were sitting in Idle while the action was polling for them. This screen shot shows my test instance, as well as the two self hosted instances our build runs usually start up before they are terminated after the 5 min window.
Has anyone else observed this in the last day or two? Any ideas how to proceed with this?
Thanks!
All examples I have seen use a docker image, which has a user
parameter.
But I'm not using docker. How do I then tell the action-runner to run as non-root (ubuntu
in this case)
I tried many different ways, but no matter what I do the current user remains root
- name: Who Am I?
run: |
sudo su - ubuntu
whoami
- name: Who Am I?
run: |
sudo -u ubuntu bash
whoami
- name: Who Am I?
shell: bash -l {0}
run: |
su - ubuntu
whoami
I can't find anything on the EC2 side that will let me change the default user. When I connect via ssh it gives the root@ip
address.
I have everything already installed/configured under ubuntu
.
If this is not the right place to ask if you know where I can find this info please let me know as I have spent many hours searching and can't find any information.
Thank you!
Hi!
Tried to test your example workflow https://github.com/snussik/ec2-github-runner
But on master commits nothing happens at all.
And VScode linter shows error:
Hello, I'm having an issue with the "start" mode of this action, as the runner software on my EC2 instance does not register with GitHub. I have a feeling that this issue is caused by the instance's lack of internet connection as the action hangs at "Checking every 10 seconds to see if the runner is registered".
The instance is started and stopped correctly by the Action. This is my first application of this action in my work, so I have not implemented this action before.
I've taken the following steps so far:
The security group which is attached to the instance allows all outbound traffic
A NAT gateway was added to the VPC/Subnet containing the instance
The route table of the VPC/subnet points internet traffic to the NAT gateway
Regenerating the GitHub token used in this repository
Has anyone else encountered this issue, if so, how did you resolve it?
I'd like to be able to SSH into the machine manually to verify things. Is there a way to specify which SSH pair the EC2 instance uses?
Sorry if this was answered in the doc, I didn't see anything in there.
Is this really intended? I don't think PRs should be pushing the dist on main. Perhaps back on the branch?
Hello!
I made a fork and added Windows support + multiple runners
Still needs some fixes like de-registering the runners but spawning seems to work well!
Would be nice to fix it and integrate it with this main project
Maybe with a "os:windows" or "os:linux" input
https://github.com/yatima1460/ec2-multiple-github-runners/tree/windows-experimental
In current release you use runner version hardcoded in aws.js. And its version is older than latest. It'd be great to programmatically get the latest release for usage.
GitHub API gives us this information: curl -s -X GET 'https://api.github.com/repos/actions/runner/releases/latest'
It could be parsed directly, or with jq script or by octokit.
How to assign public ip to ec2 instance
Unless I'm missing something it doesn't look like this supports multiple security group IDs when starting instances. Is that right? If so, would it be possible to add that feature?
Needed to run this for an t4g.micro instance with Amazon Linux 2.
yum install libicu60
without it, the config step fails with the error
Cannot get symbol ucol_setMaxVariable_50 from libicui18n
Error: /lib64/libicui18n.so.50: undefined symbol: ucol_setMaxVariable_50
👋 internally we have somewhat similar approach for a GCE/GCP self-hosted runner. We are considering open-sourcing it. Do you think it would be a good idea to make this action "cloud agnostic", and add support for GCE based self-hosted runner here, or would you suggest to make a separate action?
Would it make sense to have a version fixed in our AMI instead? or maybe a download URI with a default? My organization is a bit wary about downloading things off of the web for every build. Maybe a separate enhancement?
Let's keep it separate and collect feedback from the others.
By default it takes the instance type's default storage capacity. Can we add an option to define the storage capacity.
In relation to #64 can functionality be added to allow the runner to be run by a non root user. We have a requirement for this that I'm sure is not all that unusual, where we need a non root user to do build tasks. We are not using Docker to build, but running in the EC2 instance.
I understand that the startup script is run as root, but there must be a way to tell the script to run the GitHub runner service as a different user.
It would be great if we could configure the quietPeriodSeconds
and retryIntervalSeconds
values. I see my EC2 instance (running Amazon Linux 2) is usually booted up and registered in GitHub much earlier than detected in the script.
Would it be possible to expose these as configurable settings within the action so we can adjust the timings? 🙂
I have spent a few days troubleshooting this issue, and it seems to be an issue with the GitHub runner.
The action involves
The following is a simplification of the code. Many packages are pre-installed in the AMI image (docker, database client e.t.c.)
name: RPA deep test
on:
push:
branches:
- dev
jobs:
start-runner:
name: Instanciate temporary EC2 instance
runs-on: ubuntu-latest
outputs:
label: ${{ steps.start-ec2-runner.outputs.label }}
ec2-instance-id: ${{ steps.start-ec2-runner.outputs.ec2-instance-id }}
steps:
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v1
with:
aws-access-key-id: ${{ secrets.RPA_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.RPA_SECRET_ACCESS_KEY }}
aws-region: ${{ secrets.RPA_REGION }}
- name: Start EC2 runner
id: start-ec2-runner
uses: machulav/ec2-github-runner@v2
with:
mode: start
github-token: ${{ secrets.RPA_PERSONAL_ACCESS_TOKEN }}
ec2-image-id: ${{ secrets.AMI_IMAGE }}
ec2-instance-type: ${{ secrets.INSTANCE_TYPE }}
subnet-id: ${{ secrets.SUBNET }}
security-group-id: ${{ secrets.SECURITY_GROUP }}
do-the-job:
name: Perform tests
needs: start-runner # required to start the main job when the runner is ready
runs-on: ${{ needs.start-runner.outputs.label }} # run the job on the newly created runner
steps:
- name: Enable docker permissions
id: docker-permission
run: sudo chmod 666 /var/run/docker.sock
- name: Clone repo
id: git-clone-repo
run: git clone https://${{secrets.GH_USER}}:${{ secrets.GH_PERSONAL_ACCESS_TOKEN }}@github.com/foo-project/foo.git --branch=${GITHUB_REF##*/}
- name: Build frontend image
id: build-frontend-image
run: cd foo && docker build -t frontend-test:latest frontend/.
- name: Build backend image
id: build-backend-image
run: cd sq-web-app && docker build -t backend-test:latest backend/.
- name: Run frontend image
id: run-frontend-image
run: docker run --name frontend -p 3000:80 --env BACKEND=http://localhost:8080/api -d frontend-test:latest
- name: Run backend image
id: run-backend-image
run: docker run --name backend -d -p 8080:8080 backend-test:latest
- name: Run tests
id: robot-test
run: |
cd sq-web-app
sudo apt install -y python3-pip
pip3 install robotframework
pip3 install robotframework-browser
python3 -m Browser.entry init
python3 -m robot test/0*.robot
stop-runner:
name: Terminate EC2 Instance
needs:
- start-runner # required to get output from the start-runner job
- do-the-job # required to wait when the main job is done
runs-on: ubuntu-latest
if: ${{ always() }} # required to stop the runner even if the error happened in the previous jobs
steps:
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v1
with:
aws-access-key-id: ${{ secrets.RPA_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.RPA_SECRET_ACCESS_KEY }}
aws-region: ${{ secrets.RPA_REGION }}
- name: Stop EC2 runner
uses: machulav/ec2-github-runner@v2
with:
mode: stop
github-token: ${{ secrets.RPA_PERSONAL_ACCESS_TOKEN }}
label: ${{ needs.start-runner.outputs.label }}
ec2-instance-id: ${{ needs.start-runner.outputs.ec2-instance-id }}
When this is executed in GitHub actions, the robot script cannot find the backend server.
When I manually execute these commands with identical settings, everything works flawlessly.
Why does the GitHub runner not execute commands the same way as if I were to SSH into an instance and execute the commands manually? (as you would expect)
This issue should improve two things:
See more details in this discussion: #18
I may be mistaken but I didn't find any options to specify a timeout.
It seems to me that 2 types of timeouts could be implemented.
Stopping the job after x amount of time. This could be achieved in a run statement, but having an argument like "job-timeout" or such could be useful.
Terminating the ec2 instance after x amount of time. To reduce the risk of paying for an unused EC2 instance if any edge case, where that could happen, was to happen. Self terminating the ec2 instance might be a way to implement this behavior. An argument "ec2-timeout" might then be appropriate.
In my case, I cache a lot of intermediate compilation objects on EBS and I don't want to lose them when a workflow is finished.
So I'm wondering if we can provide an option that only stops a machine without terminating it, and also an option to support restarting from the previously stopped instance.
mode: start_existing
instances:
- i-123
- i-124
- i-125
mode: stop_no_terminate
instances:
- i-123
- i-124
- i-125
Related AWS APIs:
Another approach could be just to reuse the available EBS volumes that were not deleted after the previous ec2 was terminated. https://aws.amazon.com/premiumsupport/knowledge-center/deleteontermination-ebs/ I do not know which option is technically easier to implement.
Thanks.
I have configured this runner for use but when it tries to run the start-runner job it exits with
Error: GitHub Registration Token receiving error
Error: HttpError: Not Found
Error: Not Found
It seems to be falling over here:
Lines 24 to 32 in 21b339b
so I guess that /repos/{owner}/{repo}/actions/runners/registration-token is the URL which is Not Found.
I've tried ensuring that my Access Token is correct and it seems to be.
I can't imagine that github would be blocking access to their own API from the runners.
Does anyone know what I've missed?
I had been pushing frequently while iterating on a workflow and ran into an issue which I think is due to hitting the usage limit of 1000 api calls /hr /repo.
Notice in the image, that though the code reached the timeout, this error continued to repeat and the job didn't exit even after the 10 minute timeout (I manually canceled the workflow). The instance was up and running, but perhaps not in time. It did not ever succeed to register, so the limit must have been hit before it could try to register.
So, something is in the reject/exit stack is not quite working? Also, it seems likely that the interval used to check whether the runner is created should be increased or made configurable.
Hi colleagues,
After instance creation job couldn't be run on EC2 machine.
Instances could be created and terminated in my VPC, outbound traffic is allowed on port 443 with security group.
What can be the cause and where I should start troubleshooting?
Thanks, Artem.
AWS EC2 instance i-xxxxxxxx is up and running
Waiting 30s for the AWS EC2 instance to be registered in GitHub as a new self-hosted runner
Checking every 10s if the GitHub self-hosted runner is registered
Checking...
Checking...
Checking... (x N times)
Error: GitHub self-hosted runner registration error
Error: A timeout of 5 minutes is exceeded. Your AWS EC2 instance was not able to register itself in GitHub as a new self-hosted runner.
I am using a multi-account structure for my environments, and would like to be able to pass in an (optional) profile name so that I can use a simpler pattern in getting the aws credentials to this action.
As of today I'm receiving the following deprecation warning for this action
Node.js 12 actions are deprecated. For more information see: https://github.blog/changelog/2022-09-22-github-actions-all-actions-will-begin-running-on-node16-instead-of-node12/. Please update the following actions to use Node.js 16: machulav/ec2-github-runner
We are using instances types that require lots of reasources. Because of, from time to time it is impossible to start an instance of some type in a certain region and a certain AZ.
It'll be great if this plugin will be able to run over list of regions/AZs (subnets and security groups) and start the instance where its type is available.
I'm sure that this is a configuration issue on my end but I'm not sure what the problem is.
Link to my yml:
https://github.com/choderalab/super-duper-guacamole/blob/27b75e91103c5804e1d056908cb4c97110b5a7eb/.github/workflows/self-hosted-test.yml
Link to github action log:
https://github.com/choderalab/super-duper-guacamole/runs/2273873236?check_suite_focus=true
I made the inbound traffic rules wide open to help troubleshoot this:
Inbound rules
Type | Protocol | Port range | Source | Description - optional
-- | -- | -- | -- | --
All traffic | All | All | 0.0.0.0/0 | wide open for testing
All traffic | All | All | ::/0 | wide open for testing
Outbound rules
Port range | Protocol | Destination | Security groups
-- | -- | -- | --
443 | TCP | 0.0.0.0/0 | GitHubActionSelfHostedRunner
I'm able to SSH onto the EC2 instance that gets spun up -- any other ideas on how to test?
Is it possible to specify the EC2 instance that is started? I don't want to spin up a new EC2 instance each time a workflow is ran.
Our OSS repo has grown over the last few months to the point where we are actually seeing some issue with the Github API rate limits on registering a runner: https://github.com/airbytehq/airbyte/runs/5406131335?check_suite_focus=true#step:3:244
I'm wondering if anyone has seen this issue in their builds and have any recommendations on how to tackle this.
I've added another PAT from another user in the time being as a temporary solution. Thanks!
What happens if a workflow is cancelled, or GitHub goes down?
The action only uses plain RunInstances call, so launched instances will never get terminated if the stop instance step fails to run?
Hashicorp packer has similar issues because it also uses RunInstances.
https://binx.io/blog/2020/03/27/how-to-terminate-packer-instances-on-aws/
To ensure that no more than one instance is running on several consequent pushes, the idea is to cancel the previous still running workflows on a new push, so we added a global setting:
concurrency: # cancel previous build on a new push
group: ${{ github.ref }} # https://docs.github.com/en/actions/reference/context-and-expression-syntax-for-github-actions#github-context
cancel-in-progress: true
now when I'm testing this, quite often stop fails to stop leaving the instance running, which is very expensive!
Run machulav/ec2-github-runner@v2
Error: Error: Not all the required inputs are provided for the 'stop' mode
Error: Not all the required inputs are provided for the 'stop' mode
Error: TypeError: Cannot read property 'mode' of undefined
Error: Cannot read property 'mode' of undefined
Here is the log from the start job:
with:
mode: start
github-token: ***
ec2-image-id: ami-03540b272db1624b7
ec2-instance-type: p3.8xlarge
security-group-id: sg-f2a4e2fc
subnet-id: subnet-b7533b96
aws-resource-tags: [
{"Key": "Name", "Value": "ec2-github-runner"},
{"Key": "GitHubRepository", "Value": "bigscience-workshop/Megatron-DeepSpeed"}
]
env:
AWS_DEFAULT_REGION: us-east-1
AWS_REGION: us-east-1
AWS_ACCESS_KEY_ID: ***
AWS_SECRET_ACCESS_KEY: ***
GitHub Registration Token is received
AWS EC2 instance i-038eeed014c994b48 is started
Error: The operation was canceled.
so it looks like it didn't set the vars it was supposed to set because it was cancelled.
Here is the full workflow for context:
https://github.com/bigscience-workshop/Megatron-DeepSpeed/blob/7c636d7555e915f1f426984172f73840b2168313/.github/workflows/main.yml
If there are other solutions I'm all ears.
Thank you!
Сontinuation of the discussion started by @jpalomaki in #62 (comment)
Related issues that may be covered using the EC2 launch template approach:
Feature | Issue | Can be covered by EC2 launch template |
---|---|---|
Re-use runner | #4 | ? |
Spot instances | #5 | yes |
Parallel processing | #8 | ? |
Public IP | #52 | yes |
Custom storage | #53 | yes, block device mappings, e.g. larger root volume |
Re-use storage | #59 | ? |
Multiple regions/AZ support | #60 | ? |
Multiple security groups | #68 | yes |
Tags | #3 (implemented) | yes |
IAM role | #6 (implemented) | yes, via instance profile |
EC2 keypair | #74 | yes |
really like the idea of on-demand runners.
spot support will make it more appealing :)
The issue is resolved in this fork: main...theopolis:main by @theopolis
Run machulav/ec2-github-runner@v2
GitHub Registration Token is received
AWS EC2 instance i-0eeae9ef28dcd04e9 is started
AWS EC2 instance i-0eeae9ef28dcd04e9 is up and running
Waiting 30s for the AWS EC2 instance to be registered in GitHub as a new self-hosted runner
Checking every 10s if the GitHub self-hosted runner is registered
Checking...
.
.
.
Checking...
Error: GitHub self-hosted runner registration error
Checking...
Error: A timeout of 5 minutes is exceeded. Your AWS EC2 instance was not able to register itself in GitHub as a new self-hosted runner.
this is the error i receive for like 30% of my runners
what could cause this? and how can i increase the percentage of successfull instantiations?
I was wondering if there were any plans to support MacOS instance types for the runners?
When I try to run actions/checkout
without supplying my personal access token (PAT) on my EC2 runner I receive an error: "remote: Repository not found." When I supply my PAT using the token
the checkout is successful.
do-the-job:
name: Do the job on the runner
needs: start-runner # required to start the main job when the runner is ready
runs-on: ${{ needs.start-runner.outputs.label }} # run the job on the newly created runner
steps:
- name: clone repo
uses: actions/checkout@v3
with:
token: ${{secrets.GH_PERSONAL_ACCESS_TOKEN}} #without this line it fails
On a Github runner I don't need to supply the token, I think because the action finds it at ${{ github.token }}
so my hosted runner must not be receiving this. I inspected the github context on the runner with echo
and all looks well except I can't obviously confirm the token (it's censored).
Others seemingly are able to use actions/checkout
on their EC2 runner without supplying a token, example: here.
This may be a problem with how I prepared my runner EC2 image however I'm not sure how to diagnose this.
When using the standard ec2 runner setup from the documentation it seems as if the GitHub runners aren't being cleaned up. The returned message is GitHub self-hosted runner with label <random id> is not found, so the removal is skipped
but when I look at the list of runners, I see the runner with that label as offline. The AWS side seems to clean up properly. In any case, thanks for this great tool!
See more details and examples in the following PR #31
As you can see in this failed workflow run, my task is able to create an EC2 image, but then fails to connect with GitHub to register.
I made a token with the repo credentials and set it as a secret. So I don't think that's it.
I worry that I've messed up my VPC or security group configuration. Are there any screenshots or more detail examples of how to set it. Here's how I made my security group. Is this right?
I have followed all the instructions that are mentioned in the README.md but the hello world job results in failure for me.
Here is the output from the EC2 Serial Console
:
Here is the output from the GitHub action (it stays stuck here):
Here is what's in my .github/workflows/aws-ec2-job.yml
file:
name: aws-ec2-job
on: pull_request
jobs:
start-runner:
name: Start self-hosted EC2 runner
runs-on: ubuntu-latest
outputs:
label: ${{ steps.start-ec2-runner.outputs.label }}
ec2-instance-id: ${{ steps.start-ec2-runner.outputs.ec2-instance-id }}
steps:
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v1
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ secrets.AWS_REGION }}
- name: Start EC2 runner
id: start-ec2-runner
uses: machulav/ec2-github-runner@v2
with:
mode: start
github-token: ${{ secrets.REPO_SCOPE_PAT }}
ec2-image-id: ${{ secrets.EC2_IMAGE_ID }}
ec2-instance-type: t3.xlarge
subnet-id: ${{ secrets.SUBNET_ID }}
security-group-id: ${{ secrets.SECURITY_GROUP_ID }}
aws-ec2-job:
name: run the benchmarks on the runner
needs: start-runner # required to start the main job when the runner is ready
runs-on: ${{ needs.start-runner.outputs.label }} # run the job on the newly created runner
steps:
- name: Hello World
run: echo 'Hello World!'
stop-runner:
name: Stop self-hosted EC2 runner
needs:
- start-runner # required to get output from the start-runner job
- aws-ec2-job # required to wait when the main job is done
runs-on: ubuntu-latest
if: ${{ always() }} # required to stop the runner even if the error happened in the previous jobs
steps:
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v1
with:
aws-access-key-id: ${{ secrets.AWS_ACCESS_KEY_ID }}
aws-secret-access-key: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
aws-region: ${{ secrets.AWS_REGION }}
- name: Stop EC2 runner
uses: machulav/ec2-github-runner@v2
with:
mode: stop
github-token: ${{ secrets.REPO_SCOPE_PAT }}
label: ${{ needs.start-runner.outputs.label }}
ec2-instance-id: ${{ needs.start-runner.outputs.ec2-instance-id }}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.