fduran / sadservers Goto Github PK

SadServers: Linux & DevOps Troubleshooting Scenarios SaaS

Python 11.72% Shell 11.88% HCL 76.41%

devops interview-practice interview-preparation interview-test linux sre-infra troubleshooting

sadservers's Introduction

SadServers

SadServers is a SaaS where users can test their Linux and DevOps troubleshooting skills on real Linux servers in a "Capture the Flag" fashion.

Make Money Creating SadServers Scenarios

Table of Contents:

What
Why
When
How Does It Look
Architecture
Site Priorities
- User Experience
- Security
Code
Issues
Roadmap
Collaboration
Scenarios
Contact

What

SadServers is a SaaS where users can test their Linux (Docker, Kubernetes...) troubleshooting skills on real Linux servers in a "Capture the Flag" fashion.

There's a collection of scenarios, a description of what's wrong and a test to check if the issue has been solved. The servers are spun up on the spot, users get an "SSH" shell via a browser window to an ephemeral server (destroyed after the allotted time for solving the challenge) and then they can try and solve the problem.

Problems include common software that run on Linux, like databases or web servers although knowledge of the details for the specific application is not necessarily required. It also includes scenarios where you do need to be familiar with the technology with the issue, for example, a Docker scenario. The scenarios are for the most part real-world ones, as in they are similar to issues that we have encountered.

SadServers is aimed primarily at users that are professional Software Developers (possibly), System Administrators, DevOps engineers, SREs, and related positions that require server debugging and troubleshooting skills.

Particularly SadServers wants to test these professionals (or people aspiring to these jobs) in a way that would be useful for the purpose of a troubleshooting part of a job interview.

Why

To scratch a personal itch and because there's nothing like this that I'm aware of. There are/were some sandbox solutions like Katacoda (shut down in May 2022) but nothing that gives you a specific problem with a condition of victory on a real server.

It's also my not-so-secret hope that a sophisticated enough version of SadServers could be used by tech companies (or for companies that carry on job interviews on their behalf) to automate or facilitate the Linux troubleshooting interview section.

An annoyance I found during my interviews is that sometimes instead of helping, the interviewer unintentionally misleads you, or you feel like you are in a tv game where you have to maximize for some arbitrary points and come up with an game strategy that doesn't reflect real incident situations (do I try to keep solving this problem or do I move to the next one, which one is better?).

When

SadServers was "launched" in Hacker News in October 2022, reaching the #2 position that day and yes, it suffered from the "HN hug of death".

How does it look?

Architecture

See diagram:

Users interact via HTTPS only with a web server and a proxy server connecting to the scenario VMs. The rest of the communications are internal between VPCs or AWS services. Each scenario VM resides in a VPC with no Internet-facing incoming access and limited egress access.

Web server

The website is powered by Django and Python3, with Bootstrap and plain Javascript at the front.

In front of Django there's an Nginx server and Gunicorn WSGI server. The SSL certificate is generously provided by Let's Encrypt and its certbot, the best thing to happen to the Internet since Mosaic.

Task Queue

New server requests are queued and processed in the background. On the front-end I'm using Celery Progress Bar for Django. The tasks are managed asynchronously by Celery with a RabbitMQ back-end and with task results saved to the main database (and yes, maybe there should be a simpler but still robust stack instead of this).

Instances are requested on AWS using Boto3, based on scenario images. A Celery beat scheduler checks for expired instances and kills them.

Permanent Storage

A PostgreSQL database is the permanent storage backend, the first choice for RDBMS. SQLite is a valid alternative for sites without a high rate of (concurrent) writes.

Proxy Server

In the initial proof of concept, I had the users connect to the VMs public IP directly. For security reasons like terminating SSL, being able to use rate limiting, logging access and specially having the VMs with private IPs only, it's a good idea to route access to the scenario instances through a reverse web proxy server.

Since the scenario instances are created on demand (at least some of them), I needed a way to dynamically inject in the web server configuration the route mappings, ie, using code against an API to configure the web server and reloading it. The configuration for proxying a VM would be like proxy.sadservers.com:port/somestring -> (proxy passes to upstream server) -> VM ip address:port . (Using a path string is an option, other options could be passing a ...?parameter in the URL or in the HTTP headers).

This was an interesting learning experience since unlike the rest of the stack I've never had this situation before. After considering some alternatives, I almost made it work with Traefik but I hit a wall, and at the end it didn't seem to be a good solution for this case. A friend of mine suggested to use Hashicorp Consul, where the Django server connects to and writes to its key/value store, and Consul-template, which monitors Consul and writes the key/values (string and IP) into the Nginx configuration (which does the actual SSL and proxying) and reloads it. After figuring out production settings (certificates, tokens) it turned out to work very well.

Scenario Instances

On the VM instances, Gotty provides a terminal with a shell as HTTP(S). An agent built with Golang and Gin provides a rest API to the main server, so solutions can be checked and commands can be sent to the scenario instance or data extracted from it.

Replay System

For the Linux World Cup I wanted to have a way to record the user command line sessions and be able to show them publicly. I mean, what good is a World Cup or any competition if people cannot see what the participants are doing?

I looked at several options and ended up implementing asciinema which does the heavy lifting. You can see the results at https://replay.sadservers.com.

Asciinema supports AWS S3 as a backend storage, but I decided that a better option for my case is to have S3 in front of the asciinema server, in part due to security and also because it gives me more control. The implementation of this "S3" is done with Minio in a self-hosted server.

Diagram:

The workflow is as follows:

The user (cloud icon) goes to SadServers.com and creates an AWS instance VM.
The VM offers a shell-like web interface using Gotty (traffic goes through a proxy not shown here). Gotty calls Bash as login, which calls /etc/profile and in there I call asciinema record to a file (with a conditional test so it doesn't create an infinite loop).
A Minio client mc periodically ships the cast file (up to a size) to the Minio storage server (Minio has an mc mirror option to constantly sync file changes but it didn't work for me). If the user does a shutdown, using systemd I'm able to synchronize the last command changes and send them. There's also a maximum file size limit in the storage server.
When the VM is destroyed, the web server sends a request to the storage server, where an agent I wrote:
uploads the cast file to the asciinema server (replay server) and
returns the URL in the replay server to the web server so it's saved to the database and shown in the user's dashboard as a link to the replay server for the scenario they tried.
The user (or anyone, it's public) can see their session replay.

In the call to the agent (step 4), I can decide via a database flag for the scenario if I want to upload the screencast from the storage server to the public replay server. This allows me the control to decide if some scenarios have their sessions made public or not, while still retaining in the storage server all the casts.

This feature was used in the Linux World Cup for example, where I wanted to show everyone how people solved (or tried to solve) the challenges but I didn't want the sessions to be public until the event was over.

Resumable VMs

For users that are doing guided learning at their own pace (rather than solving a "challenge"), like for example the Linux Upskill Challenge , SadServers offers "resumable" VMs, ie, servers that the user can stop and restart without losing their changes.

Currently this feature is limited to one VM per (registered) user. Also, these VMs can be destroyed after a number of days of inactivity or after a total number of days.

From the user's point of view, the lifecycle of their resumable instance follows the same lifecycle as any (EBS-backed, not spot) instance in AWS (shown below), with the difference that the transitions states are not shown to the user. So the instance can be either Running or Stopped, and from either one of those two states it can be Terminated.

Once the resumable instance is terminated, the user can choose to create another one.

API

SadServers offers an API, built with the Django REST framework and with a web interface at https://sadservers.com/api/.

See the full API Documentation.

Other Infrastructure & Services

Without a lot of detail, there's quite a bit of auxiliary services needed to run a public service in a decent "production-ready" state. This includes notification services (AWS SES for email for example), logging service, external uptime monitoring service, scheduled backups, error logging (like Sentry), infrastructure as code (Hashicorp Terraform and Packer).

Site Priorities

There are two main objectives: 1) to provide a good user experience with value and 2) security.

User Experience

Not a UX expert as anyone can see but just trying to make it as simple and less confusing as possible. Like Seth Godin says in The Big Red Fez, "show me the banana" (make evident where to click). The "happy paths" are so far one or two clicks away.

Security

Security starts with threat modeling, which is a fancy way of saying "think what can go terribly wrong and what's most likely to go wrong". (Sidebar: Infosec is full of these big fancy expressions like "blast radius", "attack vector", "attack surface" or my favourite one "non-zero"; except if ending the sentence you can just omit it, try it with "there's a non-zero chance of blah").

For this project I see two types issues that adversarial ("bad hacker") agents could possibly inflict, focusing first on financial incentives and then on assholery ones:

Monetary-based: there are free computing resources, so they could try and use for things like mining crypto or as a platform to launch malware or spam attacks (at an ISP I worked for, frequently a VM maxing out CPU was a compromised one sending spam or malware).
Monetary-based: AWS account credentials need to be managed for the queuing service that calls the AWS API. If these credentials are compromised, then I could be stuck with a big AWS bill.
Nastiness-based: general attacks like DoS on public endpoints from the outside or internal or "sibling" attacks from scenario VMs to other VMs.

Mitigation

An incomplete list of things to do in general or that I've done in this case:

(Principle of least privilege) create a cloud account with permissions just to perform what you need. In my case, to be able to only create ec2 nodes, of a specific type(s) in specific subnets. Given the type of instance (nano, also using "spot" ones) and size of subnet(s) and therefore VMs, there's a known cap on the maximum expense that this account could incur during a period of time.
Monitor all the things and alert. Budgets and threshold alerts in your cloud provider are a way to detect anomaly costs.
Access and application logs are also helpful in detecting malicious behaviour.
In my case, instances spun normally from the website are garbage-collected after 15-45 minutes and are not powerful, so it's a disincentive for running malicious or opportunistic programs on them.
Scenario VMs are isolated within their VPC. The only ingress network traffic allowed is from the web server to the agent and from the proxy server to the shell-to-web tool. The only egress traffic allowed is ICMP and indirectly (via a local name server), DNS. This eliminates in principle the risk of these instances being used to launch attacks on other servers in the Internet.
From the outside Internet there's only network access to an HTTPS port on both web server and proxy server, also there are automatic rate-limiting measures at these public entry points.

Code

This project may become Open Source at some point but for now the code is not publicly available. One reason is that showing the solution to the scenarios defeats the purpose and another reason is to expose details of how things are set up for security reasons. I'll be happy to chat about technical aspects of the project if someone is curious.

Issues

See Issues

Roadmap

~~Save & replay user command history.~~ DONE (with some limitations like replay file size, see Replay Server )
~~Instances with public IPs where the user's public SSH key is added so they can use any SSH client.~~ DONE for selected users.
~~Code to run competitions~~ DONE for the Linux World Cup
~~Guided scenarios with stop-and-resume VMs.~~ DONE
~~User comment system.~~ DONE piggy-backing on Github
~~Add tags to scenarios~~ DONE
~~Blog or article system.~~ DONE
~~API~~ DONE
Multi-VM scenarios.
OS package repository cache/proxy server.
A system for users to upload their scenarios.
Downloadable scenario VMs (OVA).
Translation of texts to multiple languages.
Guided learning system.
Login using Github or Gmail account.
Look into WebAssembly (WASM) so users can run (some) scenarios in the browser.
Look into alternative hosting methods:
- Kubernetes for Dockerized scenarios.
- Firecracker.

Collaboration

I'm not looking for web development (SadServers.com front-end and back-end) help at the moment. The biggest help I'll appreciate is:

Feedback on the scenarios and general website user experience.
Creation of scenarios.

Currently there's a pay-for-scenario bounty request; see the details at Make Money Creating SadServers Scenarios.

Scenarios

Contact

Any feedback is appreciated, please email [email protected]

sadservers's People

Contributors

Stargazers

Watchers

Forkers

sredevopsorg sholway enigma972 m0nkee kangarooo tamersp25 sanyuesiyuewuyue tariqsheikhsw 0bprashanthc spread0x jonathanhle zhu-weijie kamal3552 jd-apprentice dongtv305 renanpessoa ediltonx fidelismaia sgnconnects brahianpdev jason-niemczyk-autodesk dmitriinikonorov florintp-onboarding tonecode suryatmodulus vishal7245 evgenyvinnitskiy rodrigoieh is-this-echo sfzxc togijorok jhamukesh998 intrepidvaillant linuxhandbook apostergiou dear-anastasia rayning0 dimabaringoltz zy410692 ametow madhank93 alf3run knowledgescout mercy7777777 arturmartins abhishekpc lucrussell jodybro nab0404 ldbobby alblatypov icesi-ops aaa20 abhilashpramod-sre

sadservers's Issues

Scenario "Tokyo" : localhost bypasses firewall

curl localhost works with iptables fw rule because ipv6 is not blocker

How to copy/paste in the terminal?

Hello,

I would like to know if there is any hotkey I can use for copying and pasting in the terminal.
Thanks

11 "Lisbon": etcd SSL cert troubles feedback on solution

As this problem was created in January, the hint probably worked only for that month of 2023, but the SSL certificate is only valid for one month (2022 dec - end of Jan 2023), which means the hint does not work anymore (setting time one year back exactly).

What worked was using timedatectl to set it directly:

timedatectl set-time "2023-01-15 00:00"

Requirements to write/record and publish scenarios solving content

Is there any problem/requirement on writing/recording the scenarios being solved in a Youtube Channel or Medium article? Thanks in advance and well done, SadServers is amazing!

Add a Donation Link

Have you considered adding a donation link in the project's documentation or README? This way, users and supporters could express their gratitude through small contributions.

be able to search sad servers scenarios

add a basic search tofind sad servers scenarios

Replays

Hey! Replays are intended to work or have been implemented just for the LWC and then forgot to remove it?

Because I've tried a scenario and it generated the followed link:

https://sadservers.com/accounts/View%20the%20recording%20at:

Just in case

The scenario was that and what I've done was nothing weird. Just tried the scenario and got kicked out because no time left.

Error in the Santiago

Hi, help me please to resolve my issue.

admin@i-068844a34e82f4dae:~$ grep -rwc "Alice" /home/admin/*.txt
/home/admin/11-0.txt:397
/home/admin/1342-0.txt:1
/home/admin/1661-0.txt:12
/home/admin/84-0.txt:0

admin@i-068844a34e82f4dae:~$ grep -rwc "Alice" /home/admin/*.txt | awk -F ':' '{total += $2} END {print total}'
410

admin@i-068844a34e82f4dae:~$ grep -l "Alice" /home/admin/1342-0.txt | xargs awk '/Alice/ {getline; print $1}'
156

The answer "410156" is not the solution, but the "411156" is.
Could you explain where I have the error?

Contribution : Packer template

Hello,

I would like to contribute to this cool project by providing you a packer code that i have made for different provider (vsphere,proxmox).

The code is able to create Ubuntu 20.04,22.04 template with different preinstalled package, and user account with sudo permissions.

I think i just need to adapt it to aws provider.

Can you provide me a list a requirements you want in you image ?

"Check my Solution" Fails Sometimes

"Check my Solution" button fails to properly connect to the scenario VM and run the solution script. As a result, users see "invalid solution" when clicking the button, even if solution may be valid.

Curated problem solutions

Do you know where we can see other users' scripts for the different problems? It would be helpful if there were a place to discuss problem solutions. I can, for instance, solve a problem in a very roundabout fashion without grasping the true essence of the lesson. Seeing how others completed the different exercises after submitting a correct solution would be helpful.

Scenario "Buenos Aires" : it's "solved" at the beginning

For a few seconds while everything spins up in k8s, the pods pass the check (solution "correct")

case sensitive emails are not supported as user authentication

Email RFC allows for case-sensitive addresses on the non-domain part; ie [email protected] can be a different mailbox than [email protected].

Still, at the moment, SadServers only uses all lowercase email addresses. It converts all the address string into lowercase, so you can register and log in as [email protected] but you cannot have a separate ("duplicated") johnsmith@ (or any case variation since they are treated the same) and the password reset will be sent to [email protected]

Scenario "Santiago": confusing wording

This scenario description is confusing some users (difference between occurrences of a string and number of lines where that string is present in a file). Also the way to present two solutions in one file is confusing.

Want to re-write this scenario into a similar one about more advanced grep usage.

Scenario "Salta": too restrictive solution script

tTe regex for checking the docker published port can be relaxed, as we don/t really care if the node app is not started on :8888, as long as the port is mapped correctly when executing "docker run".

Using wasm linux

Hi. Thanks for the great project.
Was it considered to use wasm linux in browser?
It would reduce AWS bill and it would be possible to use it offline.

Not working "Run" because CSRF verification failed

"Run" in any scenario does not work.

Forbidden (403) CSRF verification failed.

My Using browsers (safari, GoogleChrome (v119.0.6045.159)) allow cookies, but I got a 403 with the above error.

You might find an article about this error that says there is no CSRF token in the form.

Could you investigate this Or give me any solutions?

Lisbon etcd cluster is unavailable or misconfigured

Hello, dear @fduran
I've got a bug in "Lisbon" task. After generating new self signed certificate and restarting ETCD daemon, the error is etcd cluster is unavailable or misconfigured, but test script says "OK". See the output below.

root@i-08012af3bb4b932ca:~# etcdctl get foo
Error:  client: etcd cluster is unavailable or misconfigured; error #0: EOF
; error #1: dial tcp 127.0.0.1:4001: connect: connection refused

error #0: EOF
error #1: dial tcp 127.0.0.1:4001: connect: connection refused

root@i-08012af3bb4b932ca:~# 
root@i-08012af3bb4b932ca:~# curl -k https://localhost:2379/v2/keys/foo
{"action":"get","node":{"key":"/foo","value":"bar","modifiedIndex":4,"createdIndex":4}}
root@i-08012af3bb4b932ca:~# 
root@i-08012af3bb4b932ca:~# curl https://localhost:2379/v2/keys/foo
curl: (60) SSL certificate problem: self signed certificate
More details here: https://curl.se/docs/sslcerts.html

curl failed to verify the legitimacy of the server and therefore could not
establish a secure connection to it. To learn more about this situation and
how to fix it, please visit the web page mentioned above.
root@i-08012af3bb4b932ca:~# 
root@i-08012af3bb4b932ca:~# 
root@i-08012af3bb4b932ca:~# cd /home/admin/agent && ./check.sh
OK

root@i-08012af3bb4b932ca:/home/admin/agent# GODEBUG=x509ignoreCN=0 etcdctl get foo
Error:  client: etcd cluster is unavailable or misconfigured; error #0: EOF
; error #1: dial tcp 127.0.0.1:4001: connect: connection refused

error #0: EOF
error #1: dial tcp 127.0.0.1:4001: connect: connection refused

root@i-08012af3bb4b932ca:/home/admin/agent# GODEBUG=x509ignoreCN=0 ETCDCTL_ENDPOINT=https://localhost:2379 etcdctl get foo
bar

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.