lablup / backend.ai-jail Goto Github PK

View Code? Open in Web Editor NEW

7.0 15.0 2.0 120 KB

A programmable security sandbox for Backend.AI kernels

License: GNU Lesser General Public License v3.0

Makefile 2.26% Rust 94.66% Roff 0.37% Shell 2.70%

backendai jail sandboxing

backend.ai-jail's People

Contributors

Stargazers

Watchers

Forkers

hephaex jopemachine

backend.ai-jail's Issues

chmod not working under jail even when the policy allows it

This is causing problems with compiling kernels.

Child count gets too much increased due to missing exit tracking

Often TensorFlow codes spawn many threads, but the jail recognizes "too many" threads while the actual number of threads are within the configured limit.

Potential solutions:

Directly read "/proc/{pid}/status" to get the actual number of threads from the OS. May incur some overheads when spawning new processes/threads in the child.
Guard the childCount variable with explicit locks.

But still, TensorFlow seems to increase the number of threads when we repeat calling regressors.
We need to find some good solution on this.

NOTE:
Even the following code produces a large number of threads more than the number of CPU cores allocated to the container:

config = tf.ConfigProto(intra_op_parallelism_threads=1, inter_op_parallelism_threads=1, \
                        allow_soft_placement=True, device_count = {'CPU': 1})
session = tf.Session(config=config)

Transform hard-coded policies into configurations

So that we don't have to bake the jail binaries and the base kernel images every time when we update the policies.
The configuration format should be easily writable and understandable, so JSON/YAML/TOML would be good choices. We need to figure out which is most accessible from the golang ecosystem.

Keywords for prior knowledges required:

seccomp filter, Linux system call argument format in the amd64 architecture
Go language
Docker

To make it configurable, we should implement:

Reading the list of systemcall filters, resource limits, whitelisted filesystem operation paths, and preserved environment keys from the config file.
Keep the differences only in per-policy configs and shared parts in the "default" config. (e.g., additionally allowed or blocked syscalls should go into per-policy configs.)
- Note: Traced syscalls are currently shared across all policies.
Prevent the child process from accessing the configuration files.

Merge or take advantage of Docker's default apparmor profile

By a recent investigation of unexpected jail failures by @tlqaksqhr, we finally identified that the root cause was intermix of docker-default apparmor profile and our jail's seccomp+ptrace.
(Yes, I thought apparmor is deprecated but it has been still being used!)

References:

Since apparmor simplifies some parts of our jail policy implementation, such as path-based access controls, let's combine its advantage with our jail.

Could we translate the path-based access control part of policy.yml to apparmor profile? Or, could we do the reverse (importing the docker-default apparmor profile to the base policy.yml)?
- If we use apparmor in addition to jail:
  - Modify the agent to auto-generate & load the apparmor profile from the container's policy.yml when starting containers, and unload the profile when containers terminate. (one profile per container)
- If we merge apparmor profile into jail:
  - Set apparmor=unconfined security options when starting containers in the agents.

Generic stdin support

It is difficult to have language-by-language stdin overrides.
Could we just override the read() syscall with stdin (file descriptor zero)?

Things to test/consider:

check if overriding read() with fd zero really works
what happens if the stdin file descriptor is duplicated via dup(), dup2()? Is there any language that use them? (i.e., should we keep track of such syscalls and target/returned file descriptors?)
check if redirection interferes with read() overriding -- maybe we need to check isatty()for the stdin fd?

Build or embed a DNS server to filter allowed external hostnames

It is non-trivial to manage outbound security rules using IP addresses, as many external websites rely on load balancers and volatile IP addresses on top of clouds.

Let's build a DNS server that provides transparent access to whitelist domains (e.g., github.com) from user kernel sessions but returns "unresolved" results for other domains.
This would not be perfect but will provides a good starting point.

Update seccomp profiles

ref) https://docs.docker.com/engine/release-notes/ (20.10 series)

seccomp: Whitelist clock_adjtime. CAP_SYS_TIME is still required for time adjustment moby/moby#40929

seccomp: Add openat2 and faccessat2 to default seccomp profile moby/moby#41353

seccomp: allow ‘rseq’ syscall in default seccomp profile moby/moby#41158

seccomp: allow syscall membarrier moby/moby#40731

seccomp: whitelist io-uring related system calls moby/moby#39415

Fix seccomp profile for clone syscall moby/moby#39308

Add watch mode

The current debug mode (enabled via -debug flag) dumps too much information.
We should have a "watch" mode that transparently allows all system calls but logs the system calls blocked by the current designated policy. This will be useful to update our filter sets when we encounter new application that does not work with Sorna jail but works well without it.

Audit ARM64 port

#29 did basic port to ARM64, but Linux on different architectures (like x86_64 and ARM64) can execute different syscalls for the same code, and Jail needs to take that into consideration.

At least one such case is known. For the following Python code:

import os
os.access('/', os.F_OK)

x86_64 executes access but ARM64 executes faccessat. Currently, access is checked for path but faccessat is not.

Finding the path of an executable file is missing while rewriting in Rust

While Rust rewriting, exec.LookPath function's feature seems to be missing.

For example,

// Below command panicked with "python3 not found" error message.
$ target/debug/backendai-jail python3

// Below command is working as expected.
$ target/debug/backendai-jail /usr/bin/python3

I'm not sure if this is intentional, but I think it could be useful to include this feature.

We can call the which command directly, or maybe it would be better to use which crate.

It seems the "which" command could be inserted into this https://github.com/lablup/backend.ai-jail/blob/main/src/jail.rs#L767

Inconsistent golang versions in different Dockerfiles

It seems we are using inconsistent golang versions in Dockerfile.builder-musllinux, Dockerfile.builder-manylinux and Dockerfile.

Trying to build the development container using the readme file causes the following error.

#6 1.656 src/github.com/seccomp/libseccomp-golang/seccomp_internal.go:698: cannot use req.data.arch (type C.__u32) as type C.uint32_t in argument to archFromNative
------
executor failed running [/bin/sh -c go get github.com/seccomp/libseccomp-golang &&     go get github.com/fatih/color &&     go get github.com/gobwas/glob &&     go get gopkg.in/yaml.v2]: exit code: 2
make: *** [prepare-dev] Error 1

Since Dockerfile.builder-musllinux and Dockerfile.builder-manylinux uses golang 1.11 version and Dockerfile uses golang 1.8 version, I think the golang version for the Dockerfile should be updated to 1.11 version.

Use seccomp's add_rule_conditional

Currently, if we need to hook, say, ioctl request 42, all ioctl requests are trapped to user space, because we use seccomp's add_rule method, like add_rule(ioctl).

We can do better. If we use seccomp's add_rule_conditional method instead, like add_rule_conditional(ioctl, arg2 == 42), only ioctl request 42 is trapped to user space, because comparison check is done in kernel space. This may improve performance.

Add test cases

Write minimal test cases, at least.