auxoncorp / ferros Goto Github PK
View Code? Open in Web Editor NEWA Rust-based userland which also adds compile-time assurances to seL4 development.
Home Page: https://ferros.auxon.io/
License: Apache License 2.0
A Rust-based userland which also adds compile-time assurances to seL4 development.
Home Page: https://ferros.auxon.io/
License: Apache License 2.0
A VMM will typically want to register an IRQ handler for every IRQ exposed to the guest.
The number of IRQs is typically not known at compile time, and often is enumerated from a device-tree blob.
In this case, it is common to use the AEP binding feature of the seL4 kernel to listen for both synchronous IPC and IRQ notification events. The VMM could badge each IRQ notification with (1 << IRQ_NUM)
.
With this, we will also need a way to ACK (seL4_IRQHandler_Ack()
) a given IRQ at an arbitrary
later point. For example, in a VMM, IRQ's are not ACK'd until the guest does the ACK operation; where the guest access causes the seL4 kernel to deliver a VGICMaintenanceFault
to the VMM's fault handler, then it will seL4_IRQHandler_Ack()
.
The VMM will need to map specific regions into the IPA for the guest, so VSpace
will need something akin to VSpace::map_region_at_addr
:
pub fn map_region_at_addr<SizeBits: Unsigned>(
&mut self,
vaddr: Vaddr,
region: UnmappedRegion<SizeBits>,
rights: CapRights,
) -> Result<(), VSpaceError> {
if self.overlaps_with_previously_mapped(vaddr) {
return Err(VSpaceError::RegionOverlaps);
} else {
self.map_layer(...)
}
}
Where VSpace::overlaps_with_previously_mapped
uses a bit of new state added to VSpace
:
// Maybe this comes from a configuration item in sel4.toml?
const NUM_SPECIFIC_REGIONS: usize = ...;
struct VSpace<VSS> {
//....
mapped_regions: [(Vaddr, Vaddr); NUM_SPECIFIC_REGIONS],
}
impl VSpace<...> {
fn overlaps_with_previously_mapped(&self, vaddr: Vaddr) -> bool {
for (head, size) in self.mapped_regions {
if vaddr > head && vaddr < head + size {
return true
}
}
false
}
}
Some caveats:
map_region_at_specific_addr
is in the fault-path for the VMM, the hottest part w.r.t. to meeting a guest's expectations on memory accesses. It is likely that we don't want to do the overlapping check on every call, but instead to elide it in our code altogether, allowing the map_region_at_specific_addr
call to fail in the kernel, amortizing that cost.map_region
functions to make use of the new state, as it is likely that the cost of letting the kernel tell us about an overlap, one page at a time, is in its worst case more than what could be reasonably amortized. Imagine failing through a region of a few hundred megabytes, or even a gigabyte, one page at a time.This rolls up into https://github.com/auxoncorp/engineering-planning/issues/3 and is blocked by #9.
Uses separate binaries and the new process spawning support
Implement a macro (possibly procedural) fulfilling the harness-content-specification pattern described in the sel4 unit testing proposal.:
Something like:
sel4_test_main!(sel4_harness, [resource_usage_test, child_process_test])
OR
sel4_test_main!(resource_usage_test, child_process_test)
The main point is to gather up all of the test case instances (via their setup methods generated by the test-attribute macro ) and run them sequentially while reporting output somewhere. The setup methods should give nearly everything necessary to run the tests as child processes, and the harness should await either the report of success / failure / a fault for those child processes.
Report output should most likely be by UART, though fallback to DebugPutChar when supported may be acceptable.
Most of this code can live in a test-supporting subcrate, and the test harness macro just pulls together a few of the more tedious bits. It is theoretically possible that all of the grunt work of the test harness will be feasibly non-generated, in which case we can eliminate the macro aspect of this whole task (at the potential cost of more per-harness boilerplate).
In order to be able to run a series of functions (e.g. tests) that in sum take up more resources than may be available on a system at a given time, we must be able to reuse Untyped memory and CNode Slots.
For the present, simple reuse of general-purpose memory (rather than device memory) and local CNode slots ought to suffice.
The proposed approach is to add methods to Cap and CNodeSlots that accept a closure that takes an owned alias-copy of the self-struct , do some work, and then clean up any derivative resource usage in order to allow re-use of the struct.
Prove out the design of ferros-test, the product of https://github.com/auxoncorp/engineering-planning/blob/c5c3563b6a291176000c854ddb5b09563a183dcb/sel4_unit_testing_framework.md, by testing ferros with it.
There's some anecdotal evidence that our type-safe allocator pattern chews up a lot of stack. I believe that the optimizer mitigates this. Do some experiments to find out what's really going on.
This will likely involve the use of GenericArray or heapless.
Add all available fields for the Fault enum inner types:
UnknownSyscall
(architecture specific)UserException
(architecture specific)Add the additional inner fault types ( only when hyp support is enabled?) (KernelArmHypervisorSupport
):
VGICMaintenance
VCPUFault
The additional context/fields are needed for a VMM to discern what is happening when a guest VM fault occurs, specifically our VMM depends on them.
The register initialization for aarch32 has local unit test coverage, this should exist for the aarch64 implementation as well.
This is follow on work for #9.
By "simultaneous" I mean that VSpace
should not provide both modes of mapping at the same time. When it does, we must keep track of the regions created in uses of map_region_at_addr
to prevent overlaps in usages of map_region
.
Regarding the VMM's use of this: The VM's VSpace should always be in the map_region_at_addr
mode. This is counter to a typical process which starts in map_region_at_addr
mode to service the user image mapping but then transitions to the mode of "map a region, I don't care where". A future_next_addr
member could be added to the "map at" state which gets moved when any region exceeds it.
In an aarch64 build, the child processes don't seem to be able to use their VSpaces. The access for the pc faults before the process even starts, and it also cannot dump its stack:
vm fault on code at address 0x44ae98 with status 0x82000006
in thread 0xff800fff2c00 "child of: 'rootserver'" at address 0x44ae98
With stack:
0x312000: INVALID
0x312008: INVALID
0x312010: INVALID
0x312018: INVALID
0x312020: INVALID
0x312028: INVALID
0x312030: INVALID
0x312038: INVALID
0x312040: INVALID
0x312048: INVALID
0x312050: INVALID
0x312058: INVALID
0x312060: INVALID
0x312068: INVALID
0x312070: INVALID
0x312078: INVALID
My best effort at translating what that status means (0x82000006
) I gather from some arm documentation on the instruction fault status register:
b000110 access flag fault, page
This leads me to wonder if there are some broken details w/r/t rights? flags? settings for paging mapping when using aarch64.
Here's some information on the meaning of access flags: http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0211k/Caceaije.html
However, I'm unsure what or how those tie into the use of CapRight
in seL4.
The VMM process will be self-hosted in the sense that it is managing its own memory. The main goal of this work is to separate the creation of a process from the use of that process's VSpace. In the existing process creation, ReadyProcess::new
takes a mutable reference to the process's VSpace, this borrow prevents placing a VSpace into a child's process parameters.
This issue proposes a new type of process, one that is self-hosted.
in userland/process
, the existing implementation becomes StandardProcess
and a new process type, SelfHosted
, is added to the module.
SelfHosted
SelfHosted
's constructor takes a stack addr for the child and the local region for stack setup:
pub fn new<T: RetypeForSetup>(
child_stack_addr: usize,
cspace: LocalCap<ChildCNode>,
parent_mapped_region: MappedMemoryRegion<StackPageCount, shared_status::Exclusive>,
parent_cnode: &LocalCap<LocalCNode>,
function_descriptor: extern "C" fn(T) -> (),
process_parameter: SetupVer<T>,
ipc_buffer_ut: LocalCap<Untyped<PageBits>>,
tcb_ut: LocalCap<Untyped<<ThreadControlBlock as DirectRetype>::SizeBits>>,
slots: LocalCNodeSlots<PrepareThreadCNodeSlots>,
priority_authority: &LocalCap<ThreadPriorityAuthority>,
fault_source: Option<crate::userland::FaultSource<role::Child>>,
) -> Result<Self, ProcessSetupError>;
At its call site, the stack regions should already be configured:
let (unmapped_stack_pages, _) =
parent_mapped_region.share(page_slots, parent_cnode, CapRights::RW)?;
let mapped_stack_pages =
vspace.map_shared_region_and_consume(unmapped_stack_pages, CapRights::RW)?;
let stack_pointer =
mapped_stack_pages.vaddr() + mapped_stack_pages.size() - param_size_on_stack;
let params = SelfHostedParams { vspace };
// Now we setup the process _without_ using the vspace.
let proc = SelfHosted::new(
stack_pointer,
child_cnode,
local_mapped_region,
&cnode,
child_main,
params,
ut,
ut,
slots,
tpa,
Some(fault_source),
)?;
See comment below for an update to this API.
ProcType
(optional)There's likely some code reuse both in the funneling into some overlap with standard process creation but also in the common behavior. Consider a
impl<P: ProcType> P {
// Common things
}
solution to capture some of the overlap where SelfHosted
and StandardProcess
both implement ProcType
.
This was papered over while trying to maintain forward progress on the vmm.
Some current interrupt controller implementations support up to 10 bits worth of interrupts. Our representation and handling of IRQ values currently is limited to a u8
, and thus needs to be expanded or made more flexible.
See also: section 2.2.1 of https://www.cl.cam.ac.uk/research/srg/han/ACS-P35/zynq/arm_gic_architecture_specification.pdf
ferros/src/cap/fault_reply_endpoint.rs
Lines 22 to 24 in 60ccc39
So we can create and map device memory in path with other general MemoryRegions.
These three types grew at distinct times in the development of VSpace 2: The Address Space Odyssey and they could be contracted into a single type which serves the needs of each use case.
This is a follow up to #9.
We need to hand a lot, but not all, of the irqcontrol slots to the VMM process, but still maintain some of them for local device drivers (uart to talk to the control plane, for example)
Off the cuff:
Where the size differences are delegated to the conditional inclusion of an arch-specific module.
There are cases where weak regions are needed, e.g. mapping in user images, as their size is not known at compile time.
In support of tracking short-running child processes, a ferros process ought to be able to start a child process and then wait for it to either fault or send an "I am finished" signal via an endpoint or notification.
UserTestFnOutput::Result => parse_quote! {{
match #call {
Ok(_) => ferros::test_support::TestOutcome::Success,
- Err(_) => ferros::test_support::TestOutcome::Failure,
+ Err(e) => {
+ ferros::debug_println!("Test failed:\n {:#?}\n", e);
+ ferros::test_support::TestOutcome::Failure
+ }
}
}},
}
Also update the tests
When running cargo test
with feature vspace_map_region_at_addr
enabled, some tests are failing on the aarch64 virt machine.
All errors are the same, and seem to indicate we are attempting to re-use/overwrite a resource:
Test failed:
ProcessSetupError(
VSpaceError(
SeL4Error(
PageMap(
DeleteFirst,
),
),
),
)
...
test result: FAILED. 4 passed; 12 failed;
In 64-bit architectures, a CNode slot is 32 (2^5) bytes rather than 16 (2^4). This means that retype_cnode
will need arch-specific treatment.
Now that we have WUTBuddy::alloc_strong
, we can provide the same API that Allocator::get_untyped
gives us, but we can do it with WUTBuddy
underneath! We can also use something buddy-like for tracking device memory, however, it will need to be supplemented with the ability to look things up by physical address which I'll get into later in this issue.
Where one face looks at general memory, the other at device memory. This results in the effective erasure of Allocator
as we know it and from its ashes rises a WUTBuddy
backed general memory allocator and a split-based allocator for device memory.
pub struct UntypedAllocator {
uts: WUTBuddy<role::Local, memory_kind::General>,
}
pub struct DeviceUntypedAllocator {
device_uts: ArrayVec<[WUntyped<memory_kind::Device>; MAX_DEVICE_UTS]>,
}
Both of these allocators could join the BootInfo
family, making there be a single bootstrap call which returns these two objects as part of the boot info structure1.
WUntyped
s do not have memory kinds associated with them. This will likely be the most amount of work as it'll result in threading it into all of the WUntyped
usages.memory_kind
sorting hat to separate device from general memory.weak_ut_buddy
to be able to be constructed around a set of untypeds rather than just one.get_untyped
Its API could remain the same, but I think it will want to return the error from WUTbuddy::alloc_strong
for cases where splitting is involved and we get into syscall-land.
micro_alloc.rs
and make get_untyped
delegate to alloc_strong
.For device memory, we need something like a buddy allocator, but our lookups aren't about sizes exclusively; they're also about physical addresses. We also have the desire to take only the relevant chunk of some oversized "device region", as mentioned in #30. For this use case, we want our implementation to do 4 things:
Our existing buddy allocator implementations use size as the sorting criterion; we want a slightly different implementation which uses physical addresses as the search criterion, where size only comes into play after we've located the target region.
Before I get into a suggested implementation I'd like to mention a conjecture w/r/t the salience of the type-level tracking of the size of device untypeds: I don't think it matters. We need size like an afterthought so we know how to split—we're not trying to prevent an over-allocation on device memory. Its depletion as a resource is binary, not gradual; the capability to the desired device is there or it is not.
How I imagine this works is as a simple ArrayVec
holding WUntypeds
whose paddr
we're also tracking a lá seL4_UntypedDesc
. Lookups search through the vector and find the region the desired device resides in and it's extracted from the list. If it needs to be split, it is—buddy style, and then the new caps are pushed into the vector2.
WUntypeds
3.memory_kind::Device
untypeds to reduce intrusion?We learned in debugging for #28, that without extern "C"
, Rust does not abide the aarch32 calling convention but instead, puts the parameters on the stack and a pointer to them in r0
. With that in mind, consider one of the subtleties of aarch64's calling convention: If a "composite type" is greater or equal to 16 bytes (2 words) in size, then it shall go into the stack and a pointer to it into x0
, otherwise, it can go into x0
and x1
. Without extern "C"
, does Rust follow these rules? Or does it always do the pointer-into-the-stack method?
This is part of the work described in section I in #39.
When a fault occurs, the seL4 kernel generates a reply capability in a special TCB slot that can be used to restart/resume the faulting thread.
A VMM will need access to that reply capability in order for it to resume a guest, after servicing a data fault for example.
This will require some amount of CSpace slots to be allocated during initialization, which can be filled at runtime with the seL4 syscall seL4_CNode_SaveCaller()
.
Implement a procedural macro for test function annotation that would support the usage pattern described in the sel4 unit testing proposal..
Suggested proc-macro implementation is to generate:
As the proc-macro will need to be highly aware of ferros types for resource injection, this makes some sense to do as a subcrate in this repo.
As of the time in which the VSpace redux lands into master, we will have support for both aarch32 and aarch64. We should be able to run our tests on both architectures.
This is a follow up to #9.
The VMM needs—in a debug build—nearly a MB of stack, this should not be required for most Ferros processes.
Currently device untypeds track paddr in their MemoryKind. Move this up to {W}Untyped so it applies to all untypeds, and add a public paddr
method to obtain the information.
Rationale: We need the physical address of regular memory to set up dma transfers. We currently get this by asking the kernel about a page address. This is not ideal because it's another syscall that's not necessarily needed. There are also cases where we might want to reason about general untypeds with knowledge of their physical address before we turn them into pages, perhaps to take advantage of platform NUMA characteristics.
This change also cleans up some General/Device memory distinctions, making the code a bit easier to understand.
Anti-Rationale: the kernel is tracking this and making it available, so we shouldn't re-track it ourselves.
... or make a different struct to do the same thing. We need to take a large device untyped and carve it up on demand to get a single page device untyped at a requested address.
We need this when mapping device memory
The seL4 kernel supports running on an ARMv8 ISA, application profile, so should ferros.
It's worth noting that the AAarch32 execution state of ARMv8-A is compatible with an ARMv7-A implementation that includes the Visualization Extensions.
This allows us to run 32-bit guest at EL0/EL1 in a 64-bit vmm at EL2 (assuming the exception handling path exist).
Also note that GNU and Linux documentation (except for Redhat and Fedora distributions) sometimes refers to AArch64 as ARM64.
AArch64 supports three different translation granules.
These define the block size at the lowest level of translation table and
control the size of translation tables in use.
seL4 has picked a translation granule of 4 KB
, with the following page sizes:
seL4_PageBits = 4 KB
at level 3seL4_LargePageBits = 2 MB
at level 2seL4_HugePageBits = 1 GB
at level 1In the AArch64 state, there are four levels of paging structures:
PageGlobalDirectory
at level 0PageUpperDirectory
at level 1PageDirectory
at level 2PageTable
at level 3Where a the VSpace is realized as a PageGlobalDirectory
.
All paging structures are indexed by 9 bits of the virtual address,
therefore each level can address 2^9 = 512
slots.
4 KB
pagesIn the case of a 4 KB granule, the hardware can use a 4-level look up process. The 48-bit address has nine address bits for each level translated (that is, 512 entries each), with the final 12 bits selecting a byte within the 4 KB coming directly from the original address.
Bits [47:39] of the virtual address index into the 512 entry L0 table. Each of these table entries spans a 512 GB range and points to an L1 table. Within that 512 entry L1 table, bits [38:30] are used as index to select an entry and each entry points to either a 1 GB block or an L2 table.
Bits [29:21] index into a 512 entry L2 table and each entry points to a 2 MB block or next table level. At the last level, bits [20:12] index into a 512 entry L3 table and each entry points to a 4 KB block.
+--------+--------+--------+--------+--------+--------+--------+--------+
|63 56|55 48|47 40|39 32|31 24|23 16|15 8|7 0|
+--------+--------+--------+--------+--------+--------+--------+--------+
| | | | | |
| | | | | v
| | | | | [11:0] in-page offset
| | | | +-> [20:12] L3 index
| | | +-----------> [29:21] L2 index
| | +---------------------> [38:30] L1 index
| +-------------------------------> [47:39] L0 index
+-------------------------------------------------> [63] TTBR0/1
When mapping memory, consider the cacheability
:
cacheable == true
, use seL4_ARCH_Default_VMAttributes
cacheable == false
, use seL4_ARCH_Uncached_VMAttributes
arm64
module with all of the respective architecture const
'saarch64
specific translation and paging constructors // IN
#[derive(ABI)]
struct Foo {
x: MyThing
}
// OUT
#[repr(C)]
struct Foo {
x: MyThing
}
impl ABI for Foo {
fn _hidden_static_asserts() {
assert_impl!(MyThing, ABI);
}
}
In the initial VSpace implementation, pages had a type-level distinction regarding whether or not they were backed by device memory, the MemoryKind
. This should be added to memory regions in the new VSpace implementation.
This is a follow up to #9.
A seL4 VCPU object enables a thread to perform instructions and operations as if it were running at a higher privilege level. Higher privilege levels typically have access to additional machine registers and other pieces of state, the VCPU object acts as storage for that state.
Note that the VCpuRegister
variants are different for each architecture.
There is some opportunity to have distinct Bound
and Unbound
VCPU types, where the
process of binding an unbound VCPU to a TCB yields a bound VCPU type.
We should wrap the underlying capability to expose these methods:
/// Bind a TCB to a virtual CPU.
///
/// When a TCB has a bound VCpu it is allowed to
/// have the mode field of the cpsr register set
/// to values other than user.
/// It is allowed to have any value other than hypervisor.
pub fn bind_tcb(&mut self, tcb: &mut LocalCap<ThreadControlBlock>) -> Result<(), SeL4Error>;
/// Inject an IRQ to a virtual CPU.
pub fn inject_irq(
&mut self,
virq: u16,
priority: u8,
group: u8,
index: u8,
) -> Result<(), SeL4Error>;
/// Read a virtual CPU register.
pub fn read_register(&self, reg: VCpuRegister) -> Result<usize, SeL4Error>;
/// Write a virtual CPU register.
pub fn write_register(&mut self, reg: VCpuRegister, value: usize) -> Result<(), SeL4Error>;
Using insert_sorted
inside a loop can lead to bad performance. Instead, it should insert everything and then sort once at the end.
The world in which it was needed is no longer the current timeline
It's currently heavily pointer based and very unsafe, but it doesn't have to be.
Not unlike the distinction between Child and Local for capabilities, we should also include a VSpace state which denotes that it is being prepared for a child's usage.
Consider as an example the creation of a child process in a child process:
ferros/qemu-test/test-project/src/grandkid_process_runs.rs
Lines 65 to 72 in ad3807b
It is of course standard to be operating on a child's VSpace; what is not standard here, is that this child intends to create a child of its own. That's why we map pages into its address space and move the capabilities to the child's CSpace. The child will use this mapped region as scratch for creating its own child's stack. It is for this use case that we'd like to have a state in 1 This state shall also add type-level guardrails to prevent a developer from mixing local and child capabilities, &c.vspace_state
for a process's VSpace whose intent is to beget children.
1. See comment below regarding a CNodeRole
index.
Presently qemu-test uses manual tracking of the number of expected tests we want to pass inside the seL4 instance (where the ferros-test based driver is generating output).
It would be nice to have better integration that obviated the need for manually incrementing the expected number of tests to pass. Three approaches spring to mind:
This issue rolls up into https://github.com/auxoncorp/engineering-planning/issues/3
When interacting with a VSpace, the address space should be considered first-class and any interactions should be on the basis of virtual addresses. The reason for this is twofold:
We need representations of the seL4 error constants in Ferros, likely as an enum with a From
implementation for the existing libsel4-defined usize
constants.
This is borne from the review on #9, but is not limited to a VSpace requirement.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.