llnl / unifyfs Goto Github PK

UnifyFS: A file system for burst buffers

License: Other

Makefile 0.73% Shell 7.38% M4 1.69% C 80.55% Perl 9.20% Python 0.11% C++ 0.35%

system-software burst-buffers file-system

unifyfs's Introduction

UnifyFS: A User-Level File System for Supercomputers

Node-local storage is becoming an indispensable hardware resource on large-scale supercomputers to buffer the bursty I/O from scientific applications. However, there is a lack of software support for node-local storage to be used efficiently by applications that use shared files.

UnifyFS is an ephemeral, user-level file system under active development. UnifyFS addresses a major usability factor of current and future systems because it enables applications to gain performance advantages from distributed storage devices on the system while being as easy to use as a center-wide parallel file system.

Documentation

UnifyFS documentation is at https://unifyfs.readthedocs.io.

For instructions on how to build and install UnifyFS, see Build UnifyFS.

Build Status

Status of UnifyFS development branch (dev):

UnifyFS Citation

We recommend that you use this citation for UnifyFS:

Michael Brim, Adam Moody, Seung-Hwan Lim, Ross Miller, Swen Boehm, Cameron Stanavige, Kathryn Mohror, Sarp Oral, “UnifyFS: A User-level Shared File System for Unified Access to Distributed Local Storage,” 37th IEEE International Parallel & Distributed Processing Symposium (IPDPS 2023), St. Petersburg, FL, May 2023.

Contribute and Develop

If you would like to help, please see our contributing guidelines.

unifyfs's People

Contributors

Stargazers

Watchers

unifyfs's Issues

unifycr.h public header file is broken. It requires unifycr-internal.h.

System information

Type	Version/Name
Operating System
OS Version
Architecture
UnifyCR Version

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning or errors or releveant debugging data

Improved management of configuration settings

Describe the problem you're observing

The current design for managing configuration settings is inconsistent, and sometimes redundant.

In the current design, the various components act as follows:

unifycr (the job launch helper) - reads the sysconf file, environment, and CLI arguments to define the “run configuration”, which it writes to the runstate file (default location: /var/run/unifycr/unifycr-runstate.conf).
unifycrd (the server) - reads the environment to initialize its configuration settings; eventually will read runstate file (see Pull Request #109 )
libunifycr (the client library) - reads the environment to initialize its configuration settings; eventually will use PMIx to find server connection information

This results in several problems:

Only unifycr supports CLI options to update configuration settings. There is no way to pass configuration settings to the server via the command line.
The client library uses getenv() exclusively for setting its configuration, but there is no component responsible for setting up that environment.
The runstate file is created in the local filesystem of the launch node (i.e., the node where unifycr is executed), and thus is not globally visible. It could be created on a shared file system, but then there is a possibility for a file access storm when all servers attempt to read the file during startup on large job allocations.
The runstate file uses a slightly different format from the sysconf file, and thus requires separate (redundant) methods for parsing and configuration updates.

Proposed Resolution

Server configuration will no longer use the runstate file. Servers will read their local copy of the sysconf file, and will support updates to configuration settings via environment variables and CLI options. This supports both methods of launching servers from Issue #86 :

unifycr - passes configuration settings via server CLI options
RM-integrated launch scripts - can set environment variables or use server CLI options

Client library configuration will occur in two phases:

The client will query PMIx to obtain information necessary to connect to local server
The client will receive all other configuration settings from the server.

Option 1: all settings passed via local IPC
Option 2: server generates a local runstate file using same format as sysconf file, and passes the path to this file to client via local IPC. client uses same method as server for reading/parsing the file to set initial configuration

Review changes to wrappers and ensure function semantics hold

Ensure that wrapper functions return correct values to the caller, especially in the case of failure.

test_write_static dies with floating point exception

I get a floating point exception if I do the following:

$ salloc -N1
$ ./server.sh
$ srun -N1 -n1 ./client/tests/test_write_static -s1 -b1 -t1 -f /tmp/foo
srun: error: catalyst1: task 0: Floating point exception (core dumped)

Where server.sh contains:

#!/bin/bash
export UNIFYCR_META_SERVER_RATIO=1
export UNIFYCR_META_DB_NAME=unifycr_db
export UNIFYCR_CHUNK_MEM=0
basedir=$(dirname "$0")
srun -N 1 -n 1 $basedir/server/src/unifycrd &

GDB backtrace:

Core was generated by `./test_write_static -f /tmp/foobar -b 1 -t 1 -s 1'.
Program terminated with signal 8, Arithmetic exception.
#0  0x000000000040d8e0 in unifycr_split_index (cur_idx=<optimized out>, index_set=<optimized out>, slice_range=<optimized out>) at unifycr-fixed.c:292
292         long cur_slice_start = cur_idx->file_pos / slice_range * slice_range;
Missing separate debuginfos, use: debuginfo-install infinipath-psm-3.3-25_g326b95a_open.1.el7.x86_64 libibverbs-13-7.el7.x86_64 libnl3-3.2.28-4.el7.x86_64 libuuid-2.23.2-43.el7.x86_64 numactl-libs-2.0.9-6.el7_2.x86_64 openssl-libs-1.0.2k-8.el7.x86_64 zlib-1.2.7-17.el7.x86_64
(gdb) bt
#0  0x000000000040d8e0 in unifycr_split_index (cur_idx=<optimized out>, index_set=<optimized out>, slice_range=<optimized out>) at unifycr-fixed.c:292
#1  unifycr_logio_chunk_write (fid=0, pos=390226556, meta=0x0, chunk_id=0, chunk_offset=4241984, buf=0x3f7ff, count=1) at unifycr-fixed.c:392
#2  0x000000000040d559 in unifycr_fid_store_fixed_write (fid=0, meta=0x1742627c, pos=0, buf=0x0, count=4241984) at unifycr-fixed.c:685
#3  0x000000000040711c in unifycr_fd_write (fd=-975858628, pos=<optimized out>, buf=<optimized out>, count=<optimized out>) at unifycr-sysio.c:506
#4  __wrap_pwrite (fd=1, buf=0x1742627c, count=0, offset=0) at unifycr-sysio.c:1446
#5  0x0000000000405126 in __wrap_write (fd=1025, buf=0x1742627c, count=0) at unifycr-sysio.c:833
#6  0x000000000040332b in main (argc=9, argv=0x7fffffffb1a8) at test_write.c:132

unifycr-stdio.c:694:9: warning: statement with no effect: s->ubuflen;

https://github.com/LLNL/UnifyCR/blob/2cc38254e3bf25a07c2619bf5fdc577725568869/client/src/unifycr-stdio.c#L694

It seems s->ubuflen was meant to be updated here, but it's unclear to me what the correct assignment is.

client: read request overflows

When the number of read requests in lio_listio exceeds the threshhold (UNIFYCR_MAX_READ_CNT), or
the read requests in the shared read request buffer overflows, split the read requests into
multiple transfers (Todo in unifycr-sysio.c).

Standardize coding conventions

We currently have a mix of formatting styles in the code. Let's choose a standard for things like spaces vs. tabs, level of indentation, column width, formatting of comment blocks, placement of curly braces, etc. The convention can ultimately be checked and/or enforced with a style checking script run from a make target and as part of an automated CI test suite. There is a placeholder in the contributing guidelines for a link to style guidelines once they are written.

First 24 bytes of read transfer are corrupt

System information

Type	Version/Name
Operating System	RHEL
OS Version	7.4
Architecture	X86-64
UnifyCR Version	multiple

Describe the problem you're observing

First 24 bytes of read transfer are corrupt

Describe how to reproduce the problem

Instrumented "test_write" program to set contents of 4k transfer writes to contain incremented byte values: 0,1,2,... (i.e., all zeros in the first write, all 1s in the second write, etc.)
Completed 128 transfers in the test program
Instrumented test_read to to verify content of reads; found that the first 24 bytes of each transfer did not contain the values written. Instead these 24 bytes were constant for all transfers, except for the byte at offset 9 (zero relative), which changes with each transfer.

Additional testing

Put in tracing to verify that the metadata updates for the writes and metadata lookups for the reads are both correct.

Unified build framework

Write a unified build script based on the GNU autotools.

Implement or remove unifycr_print_chunk_list()

Commit 75543ec removed the commented out code from unifycr_print_chunk_list() shown below. This left a function stub behind that should be fully implemented or removed if it's not needed.

/* debug function to print list of chunks constituting a file
 * and to test above function*/
void unifycr_print_chunk_list(char *path)
{
#if 0
    chunk_list_t *chunk_list;
    chunk_list_t *chunk_element;

    chunk_list = unifycr_get_chunk_list(path);

    fprintf(stdout, "-------------------------------------\n");
    LL_FOREACH(chunk_list, chunk_element) {
        printf("%d,%d,%p,%ld\n", chunk_element->chunk_id,
               chunk_element->location,
               chunk_element->chunk_offset,
               chunk_element->spillover_offset);
    }

    LL_FOREACH(chunk_list, chunk_element) {
        free(chunk_element);
    }
    fprintf(stdout, "\n");
    fprintf(stdout, "-------------------------------------\n");
#endif
}

range_server_bget_op() inconsistent logic for op == MDHIM_GET_NEXT

On this line we have if (op != MDHIM_GET_NEXT) {

https://github.com/LLNL/UnifyCR/blob/2cc38254e3bf25a07c2619bf5fdc577725568869/meta/src/range_server.c#L1029

Yet inside the if body there is code to handle op == MDHIM_GET_NEXT:

https://github.com/LLNL/UnifyCR/blob/2cc38254e3bf25a07c2619bf5fdc577725568869/meta/src/range_server.c#L1042
and
https://github.com/LLNL/UnifyCR/blob/2cc38254e3bf25a07c2619bf5fdc577725568869/meta/src/range_server.c#L1055

I'd like to restructure this function to reduce the nesting level and improve readability. Resolving the above inconsistency will help guide the restructuring. Should the handlers for MDHIM_GET_NEXT within the if body be removed, or is the if conditional statement just wrong? Is there a way to test whether this operation is being handled correctly?

SIGBUS accessing shared memory

@jgmoore-or reported a SIGBUS error accessing shared memory on the unifycr mailing list. Opening an issue here for bug tracking purposes. Email thread quoted below.

Adam,

I still can't get past this error:

Program received signal SIGBUS, Bus error.

0x00007ffff05973cb in __memcpy_ssse3_back () from /lib64/libc.so.6

Missing separate debuginfos, use: debuginfo-install glibc-2.17-157.el7.x86_64
libgcc-4.8.5-11.el7.x86_64 libgfortran-4.8.5-11.el7.x86_64
libquadmath-4.8.5-11.el7.x86_64 mpich-3.2-3.2-2.el7.x86_64
numactl-libs-2.0.9-6.el7_2.x86_64 openssl-libs-1.0.1e-60.el7.x86_64
zlib-1.2.7-17.el7.x86_64

(gdb) bt

#0  0x00007ffff05973cb in __memcpy_ssse3_back () from /lib64/libc.so.6
#1  0x0000000000409926 in unifycr_logio_chunk_write (fid=1, pos=49152, meta=0x7fffddcf443c, chunk_id=0, chunk_offset=49152, buf=0x66280c0, count=4096)
    at unifycr-fixed.c:378
#2  0x000000000040a574 in unifycr_fid_store_fixed_write (fid=1, meta=0x7fffddcf443c, pos=49152, buf=0x66280c0, count=4096) at unifycr-fixed.c:699
#3  0x0000000000405280 in unifycr_fid_write (fid=1, pos=49152, buf=0x66280c0, count=4096) at unifycr.c:746
#4  0x000000000040b1aa in unifycr_fd_write (fd=1, pos=49152, buf=0x66280c0, count=4096) at unifycr-sysio.c:541
#5  0x000000000040d45c in __wrap_pwrite (fd=1, buf=0x66280c0, count=4096, offset=49152) at unifycr-sysio.c:1463
#6  0x000000000040bc01 in __wrap_write (fd=1025, buf=0x66280c0, count=4096) at unifycr-sysio.c:865
#7  0x00000000004043a1 in main (argc=13, argv=0x7fffffffe358) at test_write.c:143

Hence I can only successfully write a file up to 12 4096 byte pages.  Anything
larger than this, independent of block/transfer sizes, will fail.

I've checked, and the shared memory for the superblock, chunks, etc. is being
allocated correctly using: shm_open, ftruncate, mmap. It's there, the correct
size, and when the server attaches it agrees.  However, I cannot access
(dereference) the memory in the chunk region beyond offset 49152 in the first
chunk of the chunk segment.  The remainder of the allocated shared memory,
including addresses past the chunk segment, are accessible.

Apparently, this is a common problem with mmap of shared memory (often caused
by failure to size the segment with ftruncate), and according to a google
search, catching signal 7 should allow the program to continue.

I'm perplexed as to why I'm seeing this. I'm running RHEL 7.3 on Intel
X86 (64 bit, 64GB of memory), which is probably different than your
environment.  Any ideas would be welcomed.

Thanks.
Joseph

From: "Moody, Adam T."
Date: Thursday, December 21, 2017 at 3:01 PM
To: "Moore, Joseph G."
Subject: Re: Shared memory issue

That's strange.  I'll look through the code to get some ideas.

Can you tell whether the ftruncate is succeeding?  How about the other calls setting up the segment?

-Adam

From: "Moore, Joseph G."
Date: Thursday, December 21, 2017 at 9:45 AM
To: "Moody, Adam T."
Subject: Shared memory issue

Adam,

The setup for the segment all runs fine.  However, I think it's mostly just
assigning addresses from the segment to in-process data structures.  The
ftruncate call is validated, and the calls to mmap and the init_pointers
function are skipped if it fails.  I'm sure init_pointers runs, so
ftruncate is succeeding.  This is at the bottom of unifycr_superblock_shmget()
in unifycr.c.  We're using the code in the else for the

if  (fs_type != UNIFYCR_LOG)

at the top of the function.  (Our fs_type is UNIFYCR_LOG.)

The bus error is always triggered by accessing the range of the chunk storage
beyond 49,152. This is at an offset of 237,568 to 268,623,872 bytes within the
shared-memory region. (The chunk region is almost all of the shared memory.)
I've tested access to the remainder of the SM segment (at the beginning and
end) and it's always okay.

I'm guessing it's the OS trying to keep me from accessing memory that
may be invalid.  I'm not an expert on this stuff, however.  The weird thing
is the hard boundary at 49,152 into the chunk store.  This never varies.

-Joseph

client: attribute management for open and fsync

Avoid the conflict on file id. Current implementation generate files id based on
hashcode of file name, conflict happens when two file names are hashed to the same file id.

Provide atomic update on file id. When a file is openned with O_CREAT, the client first
checks if the file is created, then creates a file if it does not exist, or returns error
if O_EXCL is set. These operations needs to be atomic.

During fsync, the final file size should be the largest size (typically in the checkpoint workload)
among the metadata sent by all clients. Current implementation takes the last file size received as the final file size for fsync, which may not be the correct one.

unifycr-stdio.c:2825:16: warning: ‘nr’ may be used uninitialized in __svfscanf

The handling of nr in __svfscanf looks buggy and triggers this warning:

unifycr-stdio.c: In function ‘__svfscanf’:
unifycr-stdio.c:2825:16: warning: ‘nr’ may be used uninitialized in this function [-Wmaybe-uninitialized]
             if (nr < 0) {
                ^

https://github.com/LLNL/UnifyCR/blob/2cc38254e3bf25a07c2619bf5fdc577725568869/client/src/unifycr-stdio.c#L2828

Address --Wstrict-aliasing warnings for type-punning

There are a number of compiler warnings related to the use of type-punning to reinterpret the memory type of message buffers:

unifycr.c: In function ‘unifycr_get_global_fid’:
unifycr.c:942:5: warning: dereferencing type-punned pointer will break
strict-aliasing rules [-Wstrict-aliasing]
     *gfid = *((int *)md);
     ^

This is considered unsafe as it can lead to bugs at compiler higher optimization levels. PR #69 proposes to disable those warnings until a proper fix can be implemented. Perhaps those buffers should be represented by structs or unions rather than raw byte arrays.

Clarifying that the current version requires mpich and not working with openmpi

The current version of UnifyCR adopts the mdhim key-value store, which strictly requires

An MPI distribution that supports MPI_THREAD_MULTIPLE and per-object locking of critical sections (this excludes OpenMPI)

as specified in the project github.

Until we put an alternative metadata store, this should be explained in the documentation, such as the 'How to build UnifyCR' page.

client: set proper defaults for client env variables

All of the variables below should be checked for proper default values. Most of them are hard coded to catalyst, and some like UNIFYCR_USE_SPILLOVER should default to being turned on.

The following variables are configurable in the client:

UNIFYCR_CHUNK_MEM: the maximum amount of allocated shared memory to
store the data.

UNIFYCR_INDEX_BUF_SIZE: the size of shared memory buffer to store the indices
of unifycr (The support for index spillover has not been provided).

UNIFYCR_ATTR_BUF_SIZE: the size of shared memory buffer to store the
file attribute.

SHM_REQ_SIZE: size of the request shared memory (the client will put all
its read requests in this buffer and notify its delegator to fetch its requested
data).

SHM_RECV_SIZE: size of the receive shared memory (the delegator will put
the fetched data in this buffer).

Metadata API

Metadata API current usage

Client initiated Metadata operations

The following client operations will create a META command:

Call to get_global_file_meta() will retrieve unifycr_fattr_t from the server (flag/type is 1).
Call to set_global_file_meta() will send serialized unifycr_fattr_t to the server, which stores it in MDHIM(flag/type is 2).
Call to fsync() on the client will send a "fsync" message to the server (flag/type is 3).

Call to unifycr_fd_logreadlist() will create "READ" command for the server.

The functions set_global_file_meta() and get_global_file_meta() are called from unifycr_fid_open().

Synchronization to Delegator

In unifycr_sync_to_del(), the client side information is transferred to the server. The unifycr_sync_to_del() function is called from unifycr_mount() (Metadata operation is COMM_MOUNT).

Server Metadata ops

The server handles Metadata operations initiated by the client in the delegator_handle_command() function (unifycr_cmd_handler.c). Depending on the operation, meta_process_fsync(), meta_process_attr_set(), meta_process_attr_get() are called.
Note: the sub commands are not defined in an enum.

From request manager rm_read_remote_data() call to meta_batch_get(). The request manager dispatches read requests to different threads based on the requesting client.

meta_process_attr_set() stores file attributes into mdhim.
meta_process_attr_get() retrives file attributes from mdhim.

meta_process_fsync() stores all file attributes and extents in mdhim.
meta_batch_get() retrieves extents from mdhim.

superblock

The superblock is allocated in in the client in unifycr_superblock_shmget(). If the superblock already exists, the client will attach to the existing superblock.

Pointers into the superblock are set up in unifycr_init_pointers(). The following data lives in the superblock:

a header (uint32_t)
free_fid_stack
unifycr_filelist
unifycr_filelist
unifycr_chunkmetas
spillover padding?
free_chunk_stack
spillover padding
unifycr_chunks (only if unifycr_use_memfs if set)
if fs_type is UNIFYCR_LOG
- unifycr_indices.ptr_num_entries
- unifycr_indices.index_entry
- unifycr_fattrs.ptr_num_entries
- unifycr_fattrs.meta_entry

Metadata

File attributes

The server defines the following type in unifycr_global.h:

typedef int fattr_key_t;

typedef struct {
    char fname[ULFS_MAX_FILENAME];
    struct stat file_attr;
} fattr_val_t;

typedef struct {
    int fid;
    int gfid;
    char filename[ULFS_MAX_FILENAME];
    struct stat file_attr;
} unifycr_file_attr_t;

fattr_val_t is set using meta_process_attr_set() and retrieved using meta_process_attr_get(). In meta_process_fsync() fattr_val_t and unifycr_val_t are exchanged. All unifycr_val_t are retrieved with a call to meta_batch_get().

The client defines the following type in unifycr_internal.h:

typedef struct {
    int fid;
    int gfid;
    char filename[UNIFYCR_MAX_FILENAME];
    struct stat file_attr;
} unifycr_fattr_t;

The types unifycr_file_attr_t and unifycr_fattr_t are identical. Client and server use the unifycr_f(ile_)attr_t type to exchange the file attributes between client and server using the socket. The server in turn uses
fattr_key_t and fattr_val_t to store the file attributes in mdhim.

The file attributes are exchanged with the client.

Types defined by client and server

typedef struct {
    off_t file_pos;
    off_t mem_pos;
    size_t length;
    int fid;
} unifycr_index_t;

The unifycr_index_t type is defined in unifycr-internal.h for the client and unifycr_metadata.h for the server.

File extents?

typedef struct {
    unsigned long fid;
    unsigned long offset;
} unifycr_key_t;

typedef struct {
    unsigned long delegator_id;
    unsigned long len;
    unsigned long addr;
    unsigned long app_rank_id; /*include both app and rank id*/
} unifycr_val_t;

unifycr_key_t and unifycr_val_t are set in meta_process_fsync. The data is copied from the superblock and stored in a global array for the keys and values (unifycr_keys and unifycr_vals). After the data is copied, it is stored in the metadata server (using mdhimBPut()).

File and chunk meta

The following types are defined be the client in unifycr-internal.h

typedef struct {
    off_t size;                     /* current file size */
    off_t real_size;                /* real size of the file for logio*/
    int is_dir;                     /* is this file a directory */
    pthread_spinlock_t fspinlock;   /* file lock variable */
    enum flock_enum flock_status;   /* file lock status */

    int storage;                    /* FILE_STORAGE specifies file data management */

    off_t chunks;                   /* number of chunks allocated to file */
    unifycr_chunkmeta_t *chunk_meta; /* meta data for chunks */

} unifycr_filemeta_t;

typedef struct {
    int location; /* CHUNK_LOCATION specifies how chunk is stored */
    off_t id;     /* physical id of chunk in its respective storage */
} unifycr_chunkmeta_t;

There is an array on the superblock to store unifycr_max_files of unifycr_filemeta_t and unifycr_max_chunks of unifycr_chunkmeta_t types.
In the client unifycr_filemeta_t is retrieved by a call to unifycr_get_meta_from_fid().

Functions reading unifycr_filemeta_t:

unifycr_fid_is_dir
unifycr_fid_size
unifycr_fid_stat
unifycr_fid_read
unifycr_fid_write (calls unifycr_fid_store_fixed_write)
unifycr_fid_shrink (calls unifycr_fid_store_fixed_shrink)
unifycr_fid_store_fixed_write

Functions writing unifycr_filemeta_t:

unifycr_fd_write
unifycr_fid_store_alloc
unifycr_fid_create_file
unifycr_fid_create_directory
unifycr_fid_extend
unifycr_fid_truncate
unifycr_fid_open

open questions

Do we really need to store the entire stat structure in the K-V store?
Does the fsync need to store file attributes?

Proposed new Metadata API

shared metadata types

Datatypes shared by the client and server need to be defined in a common directory.

Common datatypes:

unifycr_file_attr_t and unifycr_fattr_t (need to rename one of the types)
unifycr_index_t

proposed API functions

/*
 *
 */
int unifycr_set_file_attribute (const char* const filename, struct stat file_stat);

/*
 *
 */
int unifycr_get_file_attribute (const char* const filename, struct stat *file_stat);

/*
 *
 */
int unifycr_set_file_extents (const char* const filename, unsigned int num_extents, unifycr_index_t *extents);

/*
 *
 */
int unifycr_bulk_set_file_extents (unsigned int num_files, const char** const filename, unsigned int *num_extents, unifycr_index_t **extents);

/*
 *
 */
int unifycr_get_file_extents (const char* const filename, unsigned int *num_extents, unifycr_index_t **extents);

/*
 *
 */
int unifycr_bulk_get_file_extents (unsigned int num_files, const char* const filename, unsigned int **num_extents, unifycr_index_t ***extents);

openning file /l/ssd/spill_1_0.log failure: No such file or directory

I note this error when running test_write_gotcha:

$ srun -n 1 -N 1 ./test_write_gotcha -f foo -b 1024 -t 1024 -s 1024
rank:0, openning file /l/ssd/spill_0_0.log failure: No such file or directory
This function name failed to be wrapped: stat
Aggregate Write BW is 530.503979MB/s, Min Write BW is 530.503979MB/s

Resource manager integration

This issue is to develop additional software components that allows UnifyCR to interact with cluster resource managers such as SLURM and LSF. Such interactions are necessary to launch and terminate the UnifyCR daemon with user-specified configurations. In addition, 'unifycr' command line utility will be developed for support environments without having any resource managers.

Expected functionalities

Reading the system-wide options if any, e.g., shared memory size.
Recognizing the user-specified configuration options, which will be fed to the UnifyCR daemon.
Launching the UnifyCR daemon across a job allocation, i.e., a set of compute node, with the given options
Performing any initialization and cleanup tasks, e.g., preloading/draining data files from/to PFS

User-specified options

The following is the initial list of the options that will be supported:

mount=<name>: Specifies the mount point name to be used by UnifyCR. Job environment variable
UNIFYCR MT is set to the value of name.
transfer in=<path or filename>: Specifies a file name or directory to transfer into UnifyCR instance at the beginning of the job.
transfer out=<path or filename>: Specifies a file name or directory to transfer files into from UnifyCR instance at the end of the allocation.
cleanup: Specifies if UnifyCR should clean up any storage used at the end of the allocation.
consistency model: Planned option to allow user to specify desired consistency model of those supported by UnifyCR. The default is “laminated consistency.”

Implementation

Currently, three software components are planned:

unifycr command line utility.
Resource manager-specific hooks, e.g., SLURM plugin, LSF prolog/epilog scripts.
Initial template for configuration files, e.g., /etc/unifycr.conf

New source code directories

The following directories will be appended to the existing source code tree:

/etc: configuration file template
/rm: resource manager-specific plugins or scripts
/util: command-line tools including the unifycr utility

Resources

Reorganizing test programs

Currently, testing programs in /client/tests are not fully integrated into the build system. Furthermore, it is not clear whether all test programs work as expected.

IMHO, it seems to be better to have a separate directory (e.g., /examples) that would contain example programs (most programs in the current /client/tests plus more such as hdf5, ...), while enhancing the testing framework in /t.

Any opinions are welcome.

Implement or remove unifycr_get_chunk_list()

Commit 75543ec removed the commented out code from unifycr_get_chunk_list() shown below. This left a function stub behind that should be fully implemented or removed if it's not needed.

 /* get a list of chunks for a given file (useful for RDMA, etc.) */
 chunk_list_t *unifycr_get_chunk_list(char *path)
 {
#if 0
    if (unifycr_intercept_path(path)) {
        int i = 0;
        chunk_list_t *chunk_list = NULL;
        chunk_list_t *chunk_list_elem;

        /* get the file id for this file descriptor */
        /* Rag: We decided to use the path instead.. Can add flexibility to support both */
        //int fid = unifycr_get_fid_from_fd(fd);
        int fid = unifycr_get_fid_from_path(path);
        if (fid < 0) {
            errno = EACCES;
            return NULL;
        }

        /* get meta data for this file */
        unifycr_filemeta_t *meta = unifycr_get_meta_from_fid(fid);
        if (meta) {

            while (i < meta->chunks) {
                chunk_list_elem = (chunk_list_t *)malloc(sizeof(chunk_list_t));

                /* get the chunk id for the i-th chunk and
                 * add it to the chunk_list */
                unifycr_chunkmeta_t *chunk_meta = &(meta->chunk_meta[i]);
                chunk_list_elem->chunk_id = chunk_meta->id;
                chunk_list_elem->location = chunk_meta->location;

                if (chunk_meta->location == CHUNK_LOCATION_MEMFS) {
                    /* update the list_elem with the memory address of this chunk */
                    chunk_list_elem->chunk_offset = unifycr_compute_chunk_buf(meta, chunk_meta->id,
                                                    0);
                    chunk_list_elem->spillover_offset = 0;
                } else if (chunk_meta->location == CHUNK_LOCATION_SPILLOVER) {
                    /* update the list_elem with the offset of this chunk in the spillover file*/
                    chunk_list_elem->spillover_offset = unifycr_compute_spill_offset(meta,
                                                        chunk_meta->id, 0);
                    chunk_list_elem->chunk_offset = NULL;
                } else {
                    /*TODO: Handle the container case.*/
                }

                /* currently using macros from utlist.h to
                 * handle link-list operations */
                LL_APPEND(chunk_list, chunk_list_elem);
                i++;
            }
            return chunk_list;
        } else {
            return NULL;
        }
    } else {
        /* file not managed by UNIFYCR */
        errno = EACCES;
        return NULL;
    }
#endif
    return NULL;
 }

UnifyCR runtime debugging/monitoring/profiling utility

It is not easy to check the UnifyCR runtime status without using debugging tools and skills. it would be convenient to provide a utility program that provides:

current daemon status including the memory consumption across all nodes.
file system statistics, such as the number of files, space consumption, etc.
a small shell-like environment where a user can interactively explore the namespace (cd, ls, ...)
moving files between unifycr volume and any other mountpoint (e.g., /lustre, /mnt/xfs, ...)

It has been discussed previously to include this feature in future.

add data corruption fix that accounts for message header

integrate patch fix into UnifyCR

Update gotcha API calls in UnifyCR for newest version of Gotcha

We are using Gotcha 0.0.2, and the Gotcha API has changed since its 1.0 release. If you try to use a newer version of Gotcha it breaks the UnifyCR build.

HDF5 dataset creation support

Testing environment

The test has been conducted in our testbed with three x86 nodes:

Intel Xeon E5-2603
64 GB RAM
mpi/mpich-3.2-x86_64

The write_test_static seems to run fine.

HDF5 dataset creation test

The basic hdf5 file creation is not working as it is supposed to be. The sequence of hdf5 calls are as follows:

H5Fcreate
H5Screate_simple
H5Dcreate2
H5Dclose
H5Sclose
H5Fclose

The full testing program was taken from C Examples In HDF5 Source Code, the create example.
It has been modified to use MPI as follows:

/*
 *  This example illustrates how to create a dataset that is a 4 x 6 
 *  array.  It is used in the HDF5 Tutorial.
 */

#include <stdio.h>
#include <stdlib.h>
#include <unistd.h>
#include <fcntl.h>
#include <string.h>
#include <mpi.h>
#include "hdf5.h"
#define FILE "dset.h5"

int main(int argc, char **argv) {

    int rank, rank_num;

    hid_t       file_id, dataset_id, dataspace_id;  /* identifiers */
    hsize_t     dims[2];
    herr_t      status = 0;

    MPI_Init(&argc, &argv);
    MPI_Comm_size(MPI_COMM_WORLD, &rank_num);
    MPI_Comm_rank(MPI_COMM_WORLD, &rank);

    unifycr_mount("/tmp", rank, rank_num, 0, 1);

    MPI_Barrier(MPI_COMM_WORLD);

    if (rank == 0) {
        /* Create a new file using default properties. */
        file_id = H5Fcreate(FILE, H5F_ACC_TRUNC, H5P_DEFAULT, H5P_DEFAULT);
        printf("H5Fcreate: %d\n", file_id);

        /* Create the data space for the dataset. */
        dims[0] = 4; 
        dims[1] = 6; 
        dataspace_id = H5Screate_simple(2, dims, NULL);
        printf("H5Screate_simple: %d\n", dataspace_id);

        /* Create the dataset. */
        dataset_id = H5Dcreate2(file_id, "/tmp/dset", H5T_STD_I32BE,
                dataspace_id, H5P_DEFAULT, H5P_DEFAULT,
                H5P_DEFAULT);
        printf("H5Dcreate2: %d\n", status);

        /* End access to the dataset and release resources used by it. */
        status = H5Dclose(dataset_id);
        printf("H5Dclose: %d\n", status);

        /* Terminate access to the data space. */ 
        status = H5Sclose(dataspace_id);
        printf("H5Sclose: %d\n", status);

        /* Close the file. */
        status = H5Fclose(file_id);
        printf("H5Fclose: %d\n", status);
    }

    MPI_Barrier(MPI_COMM_WORLD);

    return 0;
}

The output of the program execution is:

unifycr: unifycr.c:1771: unifycr_init: are we using spillover? 1

unifycr: unifycr.c:1875: unifycr_init: FD limit for system = 1024

unifycr: unifycr.c:1444: unifycr_superblock_shmget: Key for superblock = 0

unifycr: unifycr.c:1771: unifycr_init: are we using spillover? 1

unifycr: unifycr.c:1771: unifycr_init: are we using spillover? 1

unifycr: unifycr.c:1875: unifycr_init: FD limit for system = 1024

unifycr: unifycr.c:1444: unifycr_superblock_shmget: Key for superblock = 0

unifycr: unifycr.c:1402: unifycr_init_structures: Meta-stacks initialized!

unifycr: unifycr.c:1402: unifycr_init_structures: Meta-stacks initialized!

unifycr: unifycr.c:1875: unifycr_init: FD limit for system = 1024

unifycr: unifycr.c:1444: unifycr_superblock_shmget: Key for superblock = 0

unifycr: unifycr.c:1402: unifycr_init_structures: Meta-stacks initialized!

unifycr: unifycr.c:668: unifycr_fid_alloc: unifycr_stack_pop() gave 0

unifycr: unifycr.c:704: unifycr_fid_create_file: Filename /tmp got unifycr fd 0

unifycr: unifycr.c:668: unifycr_fid_alloc: unifycr_stack_pop() gave 0

unifycr: unifycr.c:704: unifycr_fid_create_file: Filename /tmp got unifycr fd 0

unifycr: unifycr.c:668: unifycr_fid_alloc: unifycr_stack_pop() gave 0

unifycr: unifycr.c:704: unifycr_fid_create_file: Filename /tmp got unifycr fd 0

H5Fcreate: 16777216
H5Screate_simple: 67108866
HDF5-DIAG: Error detected in HDF5 (1.8.12) thread 0:
  #000: ../../src/H5D.c line 170 in H5Dcreate2(): unable to create dataset
    major: Dataset
    minor: Unable to initialize object
  #001: ../../src/H5Dint.c line 439 in H5D__create_named(): unable to create and link to dataset
    major: Dataset
    minor: Unable to initialize object
  #002: ../../src/H5L.c line 1638 in H5L_link_object(): unable to create new link to object
    major: Links
    minor: Unable to initialize object
  #003: ../../src/H5L.c line 1882 in H5L_create_real(): can't insert link
    major: Symbol table
    minor: Unable to insert object
  #004: ../../src/H5Gtraverse.c line 861 in H5G_traverse(): internal path traversal failed
    major: Symbol table
    minor: Object not found
  #005: ../../src/H5Gtraverse.c line 755 in H5G_traverse_real(): component not found
    major: Symbol table
    minor: Object not found
H5Dcreate2: 0
HDF5-DIAG: Error detected in HDF5 (1.8.12) thread 0:
  #000: ../../src/H5D.c line 391 in H5Dclose(): not a dataset
    major: Invalid arguments to routine
    minor: Inappropriate type
H5Dclose: -1
H5Sclose: 0
H5Fclose: 0

It seems like H5Dcreate2() function fails when it tries to create an named object. Further debugging would necessary to figure out what exactly is problematic (lstat, lseek or write, based on the strace result below).

UnifyCR was configured as follows:

$ env | grep UNIFYCR
UNIFYCR_DEBUG=8
UNIFYCR_META_DB_NAME=unifycr_db
UNIFYCR_SERVER_DEBUG_LOG=/tmp/unifycr/unifycrd_debug.31028
UNIFYCR_EXTERNAL_META_DIR=/tmp/unifycr/ssd
UNIFYCR_META_SERVER_RATIO=1
UNIFYCR_EXTERNAL_DATA_DIR=/tmp/unifycr/ssd
UNIFYCR_CHUNK_MEM=0
UNIFYCR_META_DB_PATH=/tmp/unifycr/ssd

The original example (without modification) works like:

$ cc -o h5_create_dataset h5_create_dataset.c -lhdf5
$ ./h5_create_dataset
$ h5dump dset.h5
HDF5 "dset.h5" {
GROUP "/" {
   DATASET "dset" {
      DATATYPE  H5T_STD_I32BE
      DATASPACE  SIMPLE { ( 4, 6 ) / ( 4, 6 ) }
      DATA {
      (0,0): 0, 0, 0, 0, 0, 0,
      (1,0): 0, 0, 0, 0, 0, 0,
      (2,0): 0, 0, 0, 0, 0, 0,
      (3,0): 0, 0, 0, 0, 0, 0
      }
   }
}
}
$ strace ./h5_create_dataset
...
open("dset.h5", O_RDWR)                 = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=1400, ...}) = 0
close(3)                                = 0
open("dset.h5", O_RDWR|O_CREAT|O_TRUNC, 0666) = 3
fstat(3, {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
mmap(NULL, 528384, PROT_READ|PROT_WRITE, MAP_PRIVATE|MAP_ANONYMOUS, -1, 0) = 0x7fb7c0536000
getcwd("/autofs/nccs-svm1_techint/home/hs2/projects/__tmp/h5", 1024) = 53
lstat("dset.h5", {st_mode=S_IFREG|0644, st_size=0, ...}) = 0
lseek(3, 0, SEEK_SET)                   = 0
write(3, "\211HDF\r\n\32\n\0\0\0\0\0\10\10\0\4\0\20\0\0\0\0\0\0\0\0\0\0\0\0\0"..., 96) = 96
write(3, "\1\0\1\0\1\0\0\0\30\0\0\0\0\0\0\0\21\0\20\0\0\0\0\0\210\0\0\0\0\0\0\0"..., 1304) = 1304
munmap(0x7fb7c0536000, 528384)          = 0
close(3)                                = 0
exit_group(0)                           = ?
+++ exited with 0 +++

Basically, it uses the following system calls:

open
fstat
close
lstat
lseek
write
close

Implement unifycr_fid_close()

unifycr_fid_close() is currently a placeholder and needs to be implemented

client: add exception when partial or returned data is incorrect

Handle the exception when only partial data are returned for read or the returned data are incorrect.

report error to client when read request fails

Describe the problem you're observing

The client hangs after writing data, then trying to read it in the same program. It works if you close the file in between, but it should at least error out if the data has not been synced yet.

Describe how to reproduce the problem

Run a client application that writes data to a UnifyCR intercepted path, then try to read that data back without closing the file handle before the read. When you do this the client just hangs. This should report an error if the read lookup fails.

UNIFYCR_LOG enum constant

UNIFYCR_LOG is defined at client/src/unifycr-internal.h, together with UNIFYCR and UNIFYCR_STRIPE. The file system types should be passed as the last parameter when calling unifycr_mount(). But, is there any cases that we need to pass something else than UNIFYCR_LOG? As far as I remember, I do not think other types are even functional.

In client/src/unifycr_sysio.c, __wrap_pwrite() does not even check whether the type is UNIFYCR_LOG or something else, while __wrap_write() does.

ssize_t UNIFYCR_WRAP(pwrite)(int fd, const void *buf, size_t count,
                             off_t offset)
{
    /* equivalent to write(), except that it writes into a given
     * position without changing the file pointer */
    /* check whether we should intercept this file descriptor */
    if (unifycr_intercept_fd(&fd)) {
        /* get pointer to file descriptor structure */
        unifycr_fd_t *filedesc = unifycr_get_filedesc_from_fd(fd);
        if (filedesc == NULL) {
            /* ERROR: invalid file descriptor */
            errno = EBADF;
            return (ssize_t) (-1);
        }

        /* write data to file */
        int write_rc = unifycr_fd_write(fd, offset, buf, count);
        if (write_rc != UNIFYCR_SUCCESS) {
            errno = unifycr_err_map_to_errno(write_rc);
            return (ssize_t) (-1); 
        }

        /* return number of bytes read */
        return (ssize_t) count;
    } else {
        MAP_OR_FAIL(pwrite);
        ssize_t ret = UNIFYCR_REAL(pwrite)(fd, buf, count, offset);
        return ret;
    }
}

/* ... */

ssize_t UNIFYCR_WRAP(write)(int fd, const void *buf, size_t count)
{
    ssize_t ret;

    /* check whether we should intercept this file descriptor */
    if (unifycr_intercept_fd(&fd)) {
        /* get pointer to file descriptor structure */
        unifycr_fd_t *filedesc = unifycr_get_filedesc_from_fd(fd);
        if (filedesc == NULL) {
            /* ERROR: invalid file descriptor */
            errno = EBADF;
            return (ssize_t) (-1);
        }

        if (fs_type != UNIFYCR_LOG) {
            /* write data to file */
            int write_rc = unifycr_fd_write(fd, filedesc->pos, buf, count);
            if (write_rc != UNIFYCR_SUCCESS) {
                errno = unifycr_err_map_to_errno(write_rc);
                return (ssize_t) (-1);
            }
            ret = count;
        } else {
            fd += unifycr_fd_limit;
            ret = pwrite(fd, buf, count, filedesc->pos);
            /* pwrite() will set errno on error for us */
            if (ret < 0)
                return -1;
        }
        /* update file position */
        filedesc->pos += ret;

    } else {
        MAP_OR_FAIL(write);
        ret = UNIFYCR_REAL(write)(fd, buf, count);
    }

    return ret;
}

And, it seems wrong to call unifycr_fd_write() when the type is NOT UNIFYCR_LOG. If my understanding is correct, there are three places in unifycr-sysio.c that if statements are wrongly written.

Please correct or confirm. If there is nothing working or planned beyond UNIFYCR_LOG, I would suggest to discard the enum definition and clean up the code including the unifycr_mount() arguments.

static config.h in server/src

Describe the problem you're observing

The build for unifycrd (code in server/src) is using a static version of config.h that exists in the repo, rather than the one generated from config.h.in by configure.

Describe how to reproduce the problem

Make source code changes that depend on defines in config.h, reconfigure and rebuild, bang head against wall wondering why things aren't working as expected.

The fix is to git rm server/src/config.h

Static client initialization and finalization

Description

In order to make the use of UnifyCR more seamless, it is desirable to eliminate the need for client applications to directly call unifycr_mount() and unifycr_unmount(). Thus, we would like to call these functions as part of the client library's static initialization and finalization. This issue is to document the roadblocks to doing such static setup/teardown.

Initialization roadblocks

MPI is used extensively during unifycr_mount(), but an application will not call MPI_Init() until after the library is initialized. Similarly, the MPI rank space is used as the UnifyCR global rank space, and the local rank and size of MPI_COMM_WORLD are passed as arguments to unifycr_mount().

Finalization roadblocks

No problems identified yet.

add spillover size calculation fix

integrate spillover size calculation fix into UnifyCR

Daemonize function and server/client connection

I'm using SLURM, mvapich, and srun to launch the server and client processes.

When the daemonize() function is used and the client write test is called the unifycr_mount call fails.

The stack trace looks like this:

unifycr_mount -> unifycrfs_mount -> unifycr_init_socket -> connect

The connect call fails in unifycr_init_socket with a return code of (-1).

Provide way to enable memcpy variants for testing

Commit 75543ec removed the commented out function unifycr_memcpy() shown below. @adammoody commented in #51 that it was there to facilitate performance testing of different memcpy variants with different compilers and architectures. If this is still needed we should provide either a build-time or run-time method to enable such variants.

 /* simple memcpy which compilers should be able to vectorize
  * from: http://software.intel.com/en-us/articles/memcpy-performance/
  * icc -restrict -O3 ... */
 static inline void *unifycr_memcpy(void *restrict b, const void *restrict a,
                                    size_t n)
 {
     char *s1 = b;
     const char *s2 = a;
     for (; 0 < n; --n) {
         *s1++ = *s2++;
     }
     return b;
 }

Implement unifycr_fid_store_free()

unifycr_fid_store_free() is currently a placeholder and needs to be implemented

Improve error handling if /dev/shm too small

In #59 it was reported that an application crashed with SIGBUS if /dev/shm was configured with insufficient space. On the mailing list it was suggested that this case could be detected and reported more gracefully if we replace the use of ftruncate() with fallocate(), although that would sacrifice portability since fallocate() is Linux-specific. @adammoody also pointed out that we'd need to examine how memory pages are assigned to NUMA banks with fallocate().

Enable GOTCHA via LD_PRELOAD

To enable pre-built applications to run without needing to recompile or relink, it would be nice to have LD_PRELOAD support. Such apps will not call unifycr_mount/unmount, so this library will need to initialize another way. We could perhaps intercept /unifycr by default, and we could read an environment variable to acquire the desired mount point if the user desires something different.

An example of how to LD_PRELOAD with GOTCHA is available here:
https://github.com/LLNL/GOTCHA-tracer

check for all elements of wrapped function in gotcha

Make sure there is a warning if any of the required elements for a wrapped function are missing. For instance, if a prototype for a wrapped function is declared, but it is not added to the gotcha struct that function will not be intercepted.

Things to check:

Each wrapped function prototype also has a pointer to the original function defined and visible to the gotcha struct
The number of wrapped functions should match the number of functions defined in the gotcha struct

UNIFYCR_MAX_FILENAME vs. ULFS_MAX_FILENAME

There are two *_MAX_FILENAME definitions in the source. ULFS_MAX_FILENAME is defined to 256 and UNIFYCR_MAX_FILENAME is defined to 128.
It there a reason for the two definitions?

Make an annotated tag for v0.1.0

@kathrynmohror I'd like to suggest that we make an annotated git tag for release 0.1.0, retroactively applied to commit 4c5be05 (the current tip of the master branch where the license was added). This will allow git describe to work correctly, which is useful for generating correctly versioned tarballs with make dist. To do this, you'd run

git tag -a -m "Tag version v0.1.0" v0.1.0 4c5be05
git push origin --tags v0.1.0

update unifycr_list in client

Need to update unifycr_list.txt in client/maint/check_fns/ and regenerate the unifycr_gotcha_map header file. This keeps the list of functions supported current, and only requires updating the function prototype, whereas updating the generated header for gotcha requires also updating the struct it builds.

Support of stat family syscalls

The current version of UnifyCR does not fully support stat family system calls, which is required to support major applications (#114). The following system calls are expected to be implemented:

lstat
fstat
fxstat

This function name failed to be wrapped: stat

I note this error when running test_write_gotcha:

$ srun -n 1 -N 1 ./test_write_gotcha -f foo -b 1024 -t 1024 -s 1024
rank:0, openning file /l/ssd/spill_0_0.log failure: No such file or directory
This function name failed to be wrapped: stat
Aggregate Write BW is 530.503979MB/s, Min Write BW is 530.503979MB/s

missing autoconf check for openssl/md5.h

Testing out #22 I got a build-time error on my Ubuntu 16.04 system:

unifycr.c:67:25: fatal error: openssl/md5.h: No such file or directory

We should add an autoconf check for this dependency. In this case I needed to install the libssl-dev package.

aclocal: warning: couldn't open directory './contrib/aclocal': No such file or directory

Running autogen.sh produces the warning message:

aclocal: warning: couldn't open directory './contrib/aclocal': No such file or directory

Possible memory leak in UNIFYCR_WRAP(lio_listio)

In client/src/unifycr-sysio.c, it looks like the error path on line 887 should call free(glb_read_reqs).

 877 int UNIFYCR_WRAP(lio_listio)(int mode, struct aiocb *const aiocb_list[],
 878                              int nitems, struct sigevent *sevp)
 879 {
 880
 881     int ret = 0, i;
 882     read_req_t *glb_read_reqs = malloc(nitems * sizeof(read_req_t));
 883
 884     for (i = 0; i < nitems; i++) {
 885         if (aiocb_list[i]->aio_lio_opcode != LIO_READ) {
 886             //does not support write operation currently
 887             return -1;
 888         }
 889         glb_read_reqs[i].fid = aiocb_list[i]->aio_fildes;
 890         glb_read_reqs[i].buf = (char *)aiocb_list[i]->aio_buf;
 891         glb_read_reqs[i].length = aiocb_list[i]->aio_nbytes;
 892         glb_read_reqs[i].offset = aiocb_list[i]->aio_offset;
 893
 894     }
 895
 896     ret = unifycr_fd_logreadlist(glb_read_reqs, nitems);
 897     free(glb_read_reqs);
 898     return ret;
 899 }

llnl / unifyfs Goto Github PK

unifyfs's Introduction

UnifyFS: A User-Level File System for Supercomputers

Documentation

Build Status

UnifyFS Citation

Contribute and Develop

unifyfs's People

Contributors

Stargazers

Watchers

Forkers

unifyfs's Issues

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning or errors or releveant debugging data

Describe the problem you're observing

Proposed Resolution

System information

Describe the problem you're observing

Describe how to reproduce the problem

Additional testing

Metadata API current usage

Client initiated Metadata operations

Synchronization to Delegator

Server Metadata ops

superblock

Metadata

File attributes

Types defined by client and server

File extents?

File and chunk meta

open questions

Proposed new Metadata API

shared metadata types

proposed API functions

Expected functionalities

User-specified options

Implementation

New source code directories

Resources

Testing environment

HDF5 dataset creation test

Describe the problem you're observing

Describe how to reproduce the problem

Describe the problem you're observing

Describe how to reproduce the problem

Description

Initialization roadblocks

Finalization roadblocks

Recommend Projects

Recommend Topics

Recommend Org