tytso / xfstests-bld Goto Github PK

Creates a file system / storage test appliance which can be run using KVM, GCE, and Android

License: GNU General Public License v2.0

Makefile 3.88% Shell 29.96% C 38.87% Perl 0.65% TeX 8.90% Yacc 0.22% Awk 0.37% M4 0.48% HTML 0.01% Python 1.77% Roff 10.09% sed 0.10% Dockerfile 0.08% Go 4.60%

blktests linux test-appliance xfstests

xfstests-bld's People

Stargazers

Watchers

Forkers

ykorman dmonakhov amir73il kamit123 oleg-demkovych edjee hejianet ebiggers josefbacik zhaoliangyi cypresslin t-msn matthewbobrowski harshadjs mashbu jingli18 mythic2000 yuexinli xiaoyangshen quezizhen lrumancik riteshharjani poluo tidesq payes mayadata-io qause hunterize bergwolf rohankadekodi njutli mogoreanu art2539 meena-oss sensarliar bongiojp linsite hardboprobot gwendalcr samkenxstream cs558i eeshan9815 kitten1119

xfstests-bld's Issues

FR: have LTM parse serial console of test VMs

Having LTM monitor the test VM's serial consoles will give us more information on the status of the VM.

We can monitor for PANIC/OOPs messages. This would allow us to catch hung tasks earlier instead of having to wait the full 1hr test timeout. Additionally, we probably want to reboot if we see these regardless as it may impact the integrity of subsequent tests.

Right now we update the test status via the test VMs GCE metadata. As you are not advised to update a VMs metadata super often, we limit the VM to updating its metadata every minute. Using the metadata as a means of communication between the test VM and LTM server can be inconvenient at times. If we monitor the serial console, we can get rid of using the GCE metadata for test statuses.

LTM test sharding to handle varying CPU/MEM/SSD requirements for a child test VM

The sharding logic in the LTM currently assumes all child test appliances will use 2 CPUs. The PD-SSD quota limits are not even involved in the calculation (although the default 500GB PD-SSD quota limit is unlikely to be hit, as the scratch disk size change made the PD-SSD size of most configs be around 16 GB, except for the few LARGE configs requiring up to 56 GB).

For this reason, the "ext4/dax" config, and all variations in NR_CPU/MEM for the test appliance are disallowed. The sharding logic in the LTM could be updated to do more intelligent sharding of test appliances, and take into account more of these factors.

FR: support --no-email option for ltm commands

When running LTM commands for selftests, we likely don't want to receive emails. Right now, --no-email is only supported for pure gce-xfstests commands (gce-xfstests -c ...) but not for ltm commands (gce-xfstests ltm -c ...). If you attempt to use --no-email for an ltm command, the tests will run but sharder.emailReport() will panic due to no email address causing the results not to get packed and uploaded to gce.

sharder.go:

func (sharder *ShardScheduler) finish()  {
    sharder.log.Debug("Finishing sharder")

    sharder.aggResults()
    sharder.createInfo()
    sharder.createRunStats()
    sharder.genResultsSummary()

    if !sharder.reportKCS {
        sharder.emailReport() -> panic'ing because no where to send email
    }

    sharder.packResults()
}

Occassionally, tests are skipped due to lack of disk space

cat generic/694.notrun 
This test requires at least 4GB free on /xt-vdb to run

cat xfs/144.notrun 
This test requires at least 3GB free on /xt-vdd to run

FR: pretty prints for LTM commands

LTM commands such as ltm-info and commands to launch tests directly return the json response from the LTM server. When many tests and/or configs are running, the ltm-info ouptut can get very long. It would be nice to have a summarized version of the test statuses. We might as well cleanup the other commands which also print the json.

Note some functions rely on parsing the json directly to get specific fields. If we make pretty print the default, we would want to add a --json flag which can return the json for these cases.

Speed up LTM server launch by combining the files read from GCS

The shell script /usr/local/lib/gce-fetch-gs-files takes 20 seconds to run:

Jun 04 21:18:34 xfstests-ltm systemd[1]: Starting GCE self-signed cert fetch from GCS...
Jun 04 21:18:44 xfstests-ltm root[1841]: fetching cert completed, gsutil returned 0
Jun 04 21:18:50 xfstests-ltm root[2213]: fetching config completed, gsutil returned 0
Jun 04 21:18:54 xfstests-ltm systemd[1]: Started GCE self-signed cert fetch from GCS.'

This is because it calls gsutil multiple times --- and each run takes a few seconds. It should be possible to combine the files needed to configure the LTM server into a single tar file.

ext4/304 passed with Ubuntu Disco

It looks like ext4/304 is not failing anymore, passed with Ubuntu Disco (5.0.0-29-generic)

04:10:45 DEBUG| [stdout] ext4/303	 75s
04:11:20 DEBUG| [stdout] ext4/304	 35s
04:11:23 DEBUG| [stdout] ext4/305	 3s

Perhaps it can be removed from the kvm-xfstests/test-appliance/files/root/fs/ext4/exclude exclusion file?

--email should have an equivalent --no-email option

The --email option currently allows passing an empty string as an argument, which ends up disabling email sending in that test run. Disabling emails for a testrun should be changed to an explicit "--no-email" option.

The LTM can also benefit from this, and pass --no-email instead of --email ''

fio failed to build due to gettid define conflict

Issue found on Ubuntu Eoan 19.10, with gcc (Ubuntu 9.2.1-8ubuntu1) 9.2.1 20190909, glibc (Ubuntu GLIBC 2.30-0ubuntu1) 2.30

It looks like it's shipping the gettid in /usr/include/x86_64-linux-gnu/bits/unistd_ext.h with newer glibc.

And this creates a conflict while compiling the code.

FR: validate group name

Similarly to validate_test_name, we should validate the group name (auto, quick, etc) in gce-xfstests. Using an incorrect group name results in the VM launching but failing with an "Unable to generate test summary report" email.

LTM should allow launching child VMs with DAX config

The DAX config is explicitly disallowed and ignored by the LTM server. This is to avoid the test sharding logic miscalculating quota restrictions and attempting to launch a child test VM into a region without sufficient CPU quota. The sharding logic in the LTM needs to handle varying test VM resource requirements before the DAX config should be allowed to run.

This requires fixing #11 as a prereq.

Feature Request: gce-xfstests for ARM

Currently there is no work flow for running gce-xfstests for ARM. Requesting gce-xfstests for ARM to validate arm kernels with xfstests.

64-bit test appliance root_fs.img.x86_64 missing?

Hi,

Looks like 64 bit appliance mentioned in kvm-xfstests is missing. root_fs.img download link gets redirect to this one and displays 404 error message Sorry, we cannot find your kernels . Is it removed or some other issue?

Allow passing LTM a maximum number of shards

The LTM has an argument that allows passing a maximum number of shards to create, if only the local region is being used (--no-region-shard)

However, this option isn't available to the user at all. This could be exposed via a command line option

Build failed (fsverity-utils.git has gone)

When trying to build xfstest-bld, get-all will fail with:

Running './get-all'

Cloning into 'fio'...
warning: redirecting to https://git.kernel.dk/fio.git/
Checking out fio fio-3.15 (previously was fio-3.33-96-gded6cce8)

Cloning into 'quota'...
Checking out quota 6e631074330a (previously was v4.05-53-gd90b7d5)

Cloning into 'xfsprogs-dev'...
Checking out xfsprogs-dev v5.2.0 (previously was v6.1.1)

Cloning into 'xfstests-dev'...

Cloning into 'fsverity'...
fatal: repository 'https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/fsverity-utils.git/' not found
Failed to clone fsverity from https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/fsverity-utils.git/

It looks like https://git.kernel.org/pub/scm/linux/kernel/git/ebiggers/fsverity-utils.git/ does not exist anymore.

LTM to use/handle preemptible instances

As the LTM is monitoring test VMs, it is possible that the child VMs are run in preemptible instances.

This will require the LTM to be able to detect the difference between preemption and successful test completion, and re-launch the child VM with all unfinished tests, starting from the last known test completion. This will also need an improvement to the results aggregation, as the results from the preempted child test appliance will need to be kept, and combined with the results of the resumed test appliance.

This may require a change to xfstests (an upstream package) to include the ability to "resume" an FSTESTSET from a given point, or to implement this "resume tests" behavior separately using custom FSTESTSETs. See: #9

Touch is generating unexpected output

I've noticed a problem that could be related to the touch version during running xfstests using kvm-xfstests.

For tests cases generic.634 and generic.635 the .bad files contain the following output:

QA output created by 634
touch: invalid date format 'Feb 22 22:22:22 UTC 2222'
Silence is golden.

QA output created by 635
touch: invalid date format 'Feb 22 22:22:22 UTC 2222'

Maybe updating touch in VM could solve this.

LTM support running on f1-micro instance

The memory consumption of the LTM currently is too high to run on an f1-micro instance.

To reduce the memory consumption, the LTM may need to use threads instead of processes for shards. The TestRunManager can still run in a separate process. For this to work, shard threads should be changed to not modify any process-level values (stdout/stderr, and root logger).
This will require changing the logging setup within a shard, as they currently replace the handlers on the root logger instead of using their own logger instance, and redirect stdout/stderr to the logfile.

Additional consideration may be necessary to coordinate access to shared variables by shard threads.

Current memory consumption approximations.
Test run with 3 shards: spikes to 200MB, stabilizes at ~100MB
Test run with 13 shards: spikes to 600MB, stabilizes at ~300MB
Approximate memory consumption per shard:
Initial spike ~50MB
Stable ~20MB
One test run process consumes ~50MB on its own.

LTM shard monitoring timeouts on a per-test basis

The shard monitoring in the LTM currently assumes a fixed timeout of 1 hour between status updates on the child test appliance. If the last status update occurred more than an hour ago, the monitor process will assume that the test appliance crashed/wedged, and will create a serial port dump in place of test results.
The reason for setting a fixed 1 hour timeout is that generic/027 and a few other tests in xfstests are very IOPS bound, and take a while (some runs take longer than 3000 seconds)

If the test appliance were more diligent in reporting the latest test being run, custom timeouts could be set for each test in xfstests, and a kernel crash would be detected much sooner. For example, generic/001 is usually quite fast to run, so if the LTM is aware that the test appliance is running generic/001, the timeout could be somewhere in the range of 20-30 seconds rather than the fixed hour.

It could be also estimated that tests fall into several categories of size, e.g. "xsmall", "small", "medium", "large", and "xlarge"

To be even more sophisticated, the timeouts could be modified based on the number of CPUs/size of the scratch disk of that particular test appliance, and whether the test is more CPU/IOPS bound.

gce-xfstests to use custom FSTESTSETS (and LTM)

gce-xfstests passes the FSTESTSET to the kexec'd kernel using the kernel command line. The issue with passing this information here is that an extensive custom set of tests that is separate from the built-in groups of xfstests will not fit (e.g. generic/001 generic/002 generic/003 generic/004 ...etc)
Directly modifying the groups files in xfstests is a workaround, but this involves re-creating the xfstests tarball.

This change is to allow custom FSTESTSETS to be created and passed to the kexec'd image, without specifying the entire test set on the kernel command line.
The LTM should also be made aware of custom FSTESTSET files, and be able to pass the custom FSTESTSET from its JSON arguments into all of its child test appliances.

Details:
This may involve uploading a custom fstestset file or passing the custom fstestset into the gce instance metadata from a file, downloading the custom set into a file on the test appliance, kexec'ing, and having the kexec'd test appliance be aware of the custom set file rather than exclusively pulling the test set info from the kernel command line.

build-all failed because "config.guess: unable to guess system type"

With Ubuntu 18.04 Bionic (4.15.0-60-generic) running on a PowerPC
The ./build-all from make command failed with:

./build-all
----------------- 2019-09-04 08:11:33: Starting build of extended attribute library
checking build system type... Makefile:31: recipe for target 'all' failed
stderr:
configure: WARNING: unrecognized options: --disable-nls
./config.guess: unable to guess system type

This script, last modified 2012-06-10, has failed to recognize
the operating system you are using. It is advised that you
download the most up to date version of the config scripts from

http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.guess;hb=HEAD
and
http://git.savannah.gnu.org/gitweb/?p=config.git;a=blob_plain;f=config.sub;hb=HEAD

If the version you run (./config.guess) is already up to date, please
send the following data and any information you think might be
pertinent to [email protected] in order to provide the needed
information to handle your system.

config.guess timestamp = 2012-06-10

uname -m = ppc64le
uname -r = 4.15.0-60-generic
uname -s = Linux
uname -v = #67-Ubuntu SMP Thu Aug 22 16:54:48 UTC 2019

/usr/bin/uname -p =
/bin/uname -X =

hostinfo =
/bin/universe =
/usr/bin/arch -k =
/bin/arch =
/usr/bin/oslevel =
/usr/convex/getsysinfo =

UNAME_MACHINE = ppc64le
UNAME_RELEASE = 4.15.0-60-generic
UNAME_SYSTEM = Linux
UNAME_VERSION = #67-Ubuntu SMP Thu Aug 22 16:54:48 UTC 2019
configure: error: cannot guess build type; you must specify one
make: *** [all] Error 1

For the config.guess files, there are:

./acl/build-aux/config.guess
./attr/config.guess
./e2fsprogs-libs/config/config.guess
./popt/config.guess

Replace the config.guess / config.sub in attr/ with the suggested new version can fix this issue.

FR: ltm-auto-resume improvements

Right now, crashed tests get assigned a time of 0 seconds. See if there is a way to approximate how much time was spent on the crashed test and update this in the results.

Have LTM always save serial console.

FR: upload XML file with aggregate results separate from results.tar.gz

For LTM runs, create a top level results.xml file which combines the per-config results.xml files. Upload this separately form the results.tar.gz file similarly to how we handle the summary file. Then, convert results processing scripts to download just this XML file and use it for processing.

FR: retry LTM commands on failure

While LTM is coming online, commands sent to LTM will fail. Allow retry and waiting for success from inside gce-xfstests.

x86_64-config-4.14 does not work with 5.0-rc2 kernel anymore

The latest sample kernel config in kernel-configs for now is for the 4.14 kernel on x86_64. It worked fine for launching the 4.20 kernel version, but somewhere between 4.20 - 5.0-rc2 it stopped working.

How to reproduce

Checkout the v4.20 tag, copy x86_64-config-4.14 to .config, run make olddefconfig, then build and run -- it works fine.

Then repeat the same steps with the v5.0-rc2 tag -- it panics because of absence of the root device:

[    1.042661] VFS: Cannot open root device "vda" or unknown-block(0,0): error -6
[    1.043315] Please append a correct "root=" boot option; here are the available partitions:
[    1.044047] Kernel panic - not syncing: VFS: Unable to mount root fs on unknown-block(0,0)
[    1.044766] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.0.0-rc2-xfstests #1
[    1.045378] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.1-1ubuntu1 04/01/2014
[    1.046155] Call Trace:
[    1.046381]  dump_stack+0x67/0x90
[    1.046677]  panic+0x100/0x2b3
[    1.046952]  mount_block_root+0x214/0x2be
[    1.047308]  ? do_early_param+0x8e/0x8e
[    1.047649]  prepare_namespace+0x130/0x166
[    1.048012]  kernel_init_freeable+0x316/0x325
[    1.048397]  ? rest_init+0x24c/0x24c
[    1.048715]  kernel_init+0xa/0x104
[    1.049018]  ret_from_fork+0x3a/0x50
[    1.049420] Kernel Offset: 0xcc00000 from 0xffffffff81000000 (relocation range: 0xffffffff80000000-0xffffffffbfffffff)
[    1.050347] Rebooting in 5 seconds..

I use Ubuntu 18.10 and its packaged version of QEMU

$ qemu-system-x86_64 --version
QEMU emulator version 2.12.0 (Debian 1:2.12+dfsg-3ubuntu8.2)
Copyright (c) 2003-2017 Fabrice Bellard and the QEMU Project developers

dmesg-v5.0-rc2.log