seaflow-uw / seaflowpy Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 3.0 32.23 MB

Python package with library and CLI tool for analyzing SeaFlow data

License: GNU General Public License v3.0

Python 99.50% Dockerfile 0.31% Shell 0.19%

flow-cytometry oceanography python reproducible-research seaflow

seaflowpy's People

Contributors

Stargazers

Watchers

Forkers

sophieclayton ctberthiaume woofdawg

seaflowpy's Issues

track all attempts to filter EVT files

seaflow-uw/popcycle#46

db import-filter-params and db create changes

both db create and db import-filter-params should create a new popcycle database. Currently only db create does. Both should be able to take --cruise and --serial, or check for existing values in the db file if not provided. db create should be renamed db import-sfl.

Better docs on CLI tools

The README refers to seaflowpy --help as the starting place for CLI docs. There should be a more comprehensive and approachable source for documentation, maybe using Sphinx and readthedocs.

KeyError with error message during filtering

An error message generated while filtering to a read-only DB created a KeyError because the key "file" was not present in the work dictionary.

seaflow@de31b0f741e5:/data$ seaflowpy filter -o HOT310_opp-1core -d HOT310.db-1core -p 1 -e HOT310_evt/
Run parameters and information:
{
  "delta": false,
  "evt_dir": "HOT310_evt/",
  "db": "HOT310.db-1core",
  "opp_dir": "HOT310_opp-1core",
  "process_count": 1,
  "resolution": 10.0,
  "version": "5.1.0",
  "cruise": "HOT310"
}

Getting lists of files to filter
sfl=1558 evt=1559 intersection=1558

Filtering 1558 EVT files. Progress for 50th quantile every ~ 10.0%
Process Process-2:
Traceback (most recent call last):
  File "/home/seaflow/.local/lib/python3.8/site-packages/seaflowpy/db.py", line 332, in executemany
    con.executemany(sql, values)
sqlite3.OperationalError: attempt to write a readonly database

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/seaflow/.local/lib/python3.8/site-packages/seaflowpy/filterevt.py", line 234, in do_save
    db.save_opp_to_db(work["opp_vals"], work["dbpath"])
  File "/home/seaflow/.local/lib/python3.8/site-packages/seaflowpy/db.py", line 88, in save_opp_to_db
    executemany(dbpath, sql_insert, vals)
  File "/home/seaflow/.local/lib/python3.8/site-packages/seaflowpy/db.py", line 334, in executemany
    raise errors.SeaFlowpyError("An error occurred when executing SQL queries: {!s}".format(e))
seaflowpy.errors.SeaFlowpyError: An error occurred when executing SQL queries: attempt to write a readonly database

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/seaflow/.local/lib/python3.8/site-packages/seaflowpy/util.py", line 131, in wrapper
    f(*args, **kwargs)
  File "/home/seaflow/.local/lib/python3.8/site-packages/seaflowpy/filterevt.py", line 239, in do_save
    work["errors"].append("Unexpected error when saving file {} to db: {}".format(work["file"], e))
KeyError: 'file'
^CTraceback (most recent call last):
  File "/usr/local/lib/python3.8/multiprocessing/queues.py", line 245, in _feed
    send_bytes(obj)
  File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 405, in _send_bytes
    self._send(buf)
  File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

CRLF SFL import doesn't see flow rate

Importing SFL files with CRLF line-endings misses the final flow rate column. Files with LF line-endings import flow rate OK.

sub-linear scaling with increasing cores

Scaling across multiple cores quickly becomes non-linear. This is caused by cache-contention once too many processes are scheduled on the same socket. For example, here is the output of profiling scripts/perftest.py using the linux tool perf on 1 core.

perf stat -e task-clock,cycles,instructions,context-switches,cpu-migrations,page-faults,cache-references,cache-misses python ~/git/seaflowpy/scripts/perftest.py

      17532.427294      task-clock (msec)         #    0.977 CPUs utilized          
    52,455,060,802      cycles                    #    2.992 GHz                      (49.94%)
    36,262,158,676      instructions              #    0.69  insns per cycle          (74.93%)
             4,182      context-switches          #    0.239 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
         1,444,517      page-faults               #    0.082 M/sec                  
       775,817,000      cache-references          #   44.250 M/sec                    (75.08%)
        61,825,323      cache-misses              #    7.969 % of all cache refs      (74.98%)

      17.948233815 seconds time elapsed

and here is the output using 8 cores on a 2 socket, 4 cores per socket machine.

perf stat -e task-clock,cycles,instructions,context-switches,cpu-migrations,page-faults,cache-references,cache-misses python ~/git/seaflowpy/scripts/perftest.py

      29198.622016      task-clock (msec)         #    0.977 CPUs utilized          
    87,280,802,723      cycles                    #    2.989 GHz                      (50.02%)
    36,564,930,299      instructions              #    0.42  insns per cycle          (74.98%)
             7,415      context-switches          #    0.254 K/sec                  
               711      cpu-migrations            #    0.024 K/sec                  
         1,444,523      page-faults               #    0.049 M/sec                  
       791,715,505      cache-references          #   27.115 M/sec                    (74.96%)
       146,640,175      cache-misses              #   18.522 % of all cache refs      (75.03%)

      29.897445148 seconds time elapsed

It is still faster to use all available cores in most cases, but the difference between using 4 cores and 8 cores in this case is minimal.

I'm not sure there's a fix for this since it's caused by numpy highly optimized use of CPU cache. It's possible operating on a single large dataframe representing multiple files and making use of either openblas/mkl parallelization or using dask/modin etc may be an improvement, but I suspect not enough of one to justify the significant codes changes.

It's also worth noting that allowing numpy with mkl to use all available cores during filtering is actually much slower than limiting to 1 core and simply starting multiple filtering processes.

Remote filtering should erase results after rsync

To save space on the remote filtering instance, cruise results should be erased after they've been transferred back to the client.

Add outlier table entry for each filtered file

Handle EVT files that exist outside normal day of year directory structure

Sometimes SeaFlow datasets are messy and have EVT files that are outside the normal day of year fan out strategy. These files get filtered and end up in OPP results. They should be ignored instead.

gzip opp file fails

For some cruises remote filtering fails because an expected opp file is not present when gzipping. With commit 1985d23

command=seaflowpy_filter --s3 -d MESO_SCOPE.db -p 16 -o MESO_SCOPE_opp
real_command=/bin/bash -l -c "cd /mnt/ramdisk/MESO_SCOPE >/dev/null && seaflowpy_filter --s3 -d MESO_SCOPE.db -p 16 -o MESO_SCOPE_opp"
/usr/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
Defined parameters:
{
  "version": "0.7.1a0",
  "s3": true,
  "opp_dir": "MESO_SCOPE_opp",
  "resolution": 10.0,
  "cruise": "MESO_SCOPE",
  "db": "MESO_SCOPE.db",
  "process_count": 16
}


Filtering 15180 EVT files. Progress every 10% (approximately)
Error: b'gzip: MESO_SCOPE_opp/97.5/2017_177/2017-06-26T22-36-10+00-00.opp: No such file or directory\n'

timestamp strings consistency

Timestamp strings should always be consistent, for example same fractional seconds precision, same TZ format, and should match popcycle.

numpy warning concerning dtypes and binary compatibility

On Linux I'm seeing this warning

/usr/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)

Some web searches indicates this is because some module (maybe pandas) was compiled against an older version of numpy than the one installed. Apparently it's harmless.

cli defaults

CLI tools help text should show default values.

empty sfl file

SFL files with only header lines cause sfl commands to crash.

Traceback (most recent call last):
  File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3078, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'date'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/seaflow/miniconda3/envs/seaflowpy/bin/seaflowpy", line 11, in <module>
    sys.exit(cli())
  File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/seaflowpy/cli/commands/sfl_cmd.py", line 204, in sfl_validate_cmd
    errors = sfl.check(df)
  File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/seaflowpy/sfl.py", line 62, in check
    errors.extend(check_date(df))
  File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/seaflowpy/sfl.py", line 94, in check_date
    bad_dates = df[~df["date"].map(lambda d: check_date_string(d))]["date"]
  File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/pandas/core/frame.py", line 2688, in __getitem__
    return self._getitem_column(key)
  File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/pandas/core/frame.py", line 2695, in _getitem_column
    return self._get_item_cache(key)
  File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/pandas/core/generic.py", line 2489, in _get_item_cache
    values = self._data.get(item)
  File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/pandas/core/internals.py", line 4115, in get
    loc = self.items.get_loc(item)
  File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'date'

bug in geo.py

input:
LAT LON
2120.601 -15816.43
output"
21.3433 -157.7262

LON is wrong, it should be -158.28
Issue is probably in ddm2dd function

sds2sfl fails with CMOP cruises

The create_date_field command in sds2sfl fails with the CMOP cruises because computerUTC has a four-digit year instead of two.

Anaconda link

The "install Anaconda" link doesn't work.

Check file_duration in SFL

Part of validation/fixing of SFL files should be checking that file_duration is a positive number. It should be a required field in the same sense that lat and lon are required.

Add feature to calculate true event rate

The method for producing the event rate found in a raw SFL file has varied over the life of the SeaFlow project. For consistency it would be good to recalculate this event rate based on the DURATION column in an SFL file and the actual number of events found in EVT files.

The first 4 bytes of an EVT file record the event count as a uint32. In some cases though the file has been truncated, so in addition to reading the first 4 bytes of every EVT file it's necessary to check the expected file size of 4 byte event count header + (event count * (4 spacer bytes + 20 data bytes)).

This should be probably be performed as soon as the cruise raw data is received.

seaflowpy parallel filtering performance is worse with numpy + MKL

Parallel filtering performance is worse with numpy + MKL, and much better with plain numpy installed from PyPI. seaflowpy should perform fine in either case. Figure out how to get seaflowpy to play nice with MKL.

Remote filtering should check if any dbs are specified before starting instance

The remote filtering CLI tool starts and presumably provisions and instance even the DB list is empty, which is wasteful. Should abort early if no DBs are given in command.