Giter Site home page Giter Site logo

seaflow-uw / seaflowpy Goto Github PK

View Code? Open in Web Editor NEW
5.0 9.0 3.0 32.18 MB

Python package with library and CLI tool for analyzing SeaFlow data

License: GNU General Public License v3.0

Python 99.49% Dockerfile 0.32% Shell 0.20%
flow-cytometry python seaflow oceanography reproducible-research

seaflowpy's People

Stargazers

 avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

seaflowpy's Issues

bug in geo.py

input:
LAT LON
2120.601 -15816.43
output"
21.3433 -157.7262

LON is wrong, it should be -158.28
Issue is probably in ddm2dd function

db import-filter-params and db create changes

both db create and db import-filter-params should create a new popcycle database. Currently only db create does. Both should be able to take --cruise and --serial, or check for existing values in the db file if not provided. db create should be renamed db import-sfl.

gzip opp file fails

For some cruises remote filtering fails because an expected opp file is not present when gzipping. With commit 1985d23

command=seaflowpy_filter --s3 -d MESO_SCOPE.db -p 16 -o MESO_SCOPE_opp
real_command=/bin/bash -l -c "cd /mnt/ramdisk/MESO_SCOPE >/dev/null && seaflowpy_filter --s3 -d MESO_SCOPE.db -p 16 -o MESO_SCOPE_opp"
/usr/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)
Defined parameters:
{
  "version": "0.7.1a0",
  "s3": true,
  "opp_dir": "MESO_SCOPE_opp",
  "resolution": 10.0,
  "cruise": "MESO_SCOPE",
  "db": "MESO_SCOPE.db",
  "process_count": 16
}


Filtering 15180 EVT files. Progress every 10% (approximately)
Error: b'gzip: MESO_SCOPE_opp/97.5/2017_177/2017-06-26T22-36-10+00-00.opp: No such file or directory\n'

empty sfl file

SFL files with only header lines cause sfl commands to crash.

Traceback (most recent call last):
  File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3078, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'date'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/seaflow/miniconda3/envs/seaflowpy/bin/seaflowpy", line 11, in <module>
    sys.exit(cli())
  File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/click/core.py", line 722, in __call__
    return self.main(*args, **kwargs)
  File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/click/core.py", line 697, in main
    rv = self.invoke(ctx)
  File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/click/core.py", line 895, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/click/core.py", line 535, in invoke
    return callback(*args, **kwargs)
  File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/seaflowpy/cli/commands/sfl_cmd.py", line 204, in sfl_validate_cmd
    errors = sfl.check(df)
  File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/seaflowpy/sfl.py", line 62, in check
    errors.extend(check_date(df))
  File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/seaflowpy/sfl.py", line 94, in check_date
    bad_dates = df[~df["date"].map(lambda d: check_date_string(d))]["date"]
  File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/pandas/core/frame.py", line 2688, in __getitem__
    return self._getitem_column(key)
  File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/pandas/core/frame.py", line 2695, in _getitem_column
    return self._get_item_cache(key)
  File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/pandas/core/generic.py", line 2489, in _get_item_cache
    values = self._data.get(item)
  File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/pandas/core/internals.py", line 4115, in get
    loc = self.items.get_loc(item)
  File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'date'

Check file_duration in SFL

Part of validation/fixing of SFL files should be checking that file_duration is a positive number. It should be a required field in the same sense that lat and lon are required.

timestamp strings consistency

Timestamp strings should always be consistent, for example same fractional seconds precision, same TZ format, and should match popcycle.

KeyError with error message during filtering

An error message generated while filtering to a read-only DB created a KeyError because the key "file" was not present in the work dictionary.

seaflow@de31b0f741e5:/data$ seaflowpy filter -o HOT310_opp-1core -d HOT310.db-1core -p 1 -e HOT310_evt/
Run parameters and information:
{
  "delta": false,
  "evt_dir": "HOT310_evt/",
  "db": "HOT310.db-1core",
  "opp_dir": "HOT310_opp-1core",
  "process_count": 1,
  "resolution": 10.0,
  "version": "5.1.0",
  "cruise": "HOT310"
}

Getting lists of files to filter
sfl=1558 evt=1559 intersection=1558

Filtering 1558 EVT files. Progress for 50th quantile every ~ 10.0%
Process Process-2:
Traceback (most recent call last):
  File "/home/seaflow/.local/lib/python3.8/site-packages/seaflowpy/db.py", line 332, in executemany
    con.executemany(sql, values)
sqlite3.OperationalError: attempt to write a readonly database

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/seaflow/.local/lib/python3.8/site-packages/seaflowpy/filterevt.py", line 234, in do_save
    db.save_opp_to_db(work["opp_vals"], work["dbpath"])
  File "/home/seaflow/.local/lib/python3.8/site-packages/seaflowpy/db.py", line 88, in save_opp_to_db
    executemany(dbpath, sql_insert, vals)
  File "/home/seaflow/.local/lib/python3.8/site-packages/seaflowpy/db.py", line 334, in executemany
    raise errors.SeaFlowpyError("An error occurred when executing SQL queries: {!s}".format(e))
seaflowpy.errors.SeaFlowpyError: An error occurred when executing SQL queries: attempt to write a readonly database

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
    self.run()
  File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "/home/seaflow/.local/lib/python3.8/site-packages/seaflowpy/util.py", line 131, in wrapper
    f(*args, **kwargs)
  File "/home/seaflow/.local/lib/python3.8/site-packages/seaflowpy/filterevt.py", line 239, in do_save
    work["errors"].append("Unexpected error when saving file {} to db: {}".format(work["file"], e))
KeyError: 'file'
^CTraceback (most recent call last):
  File "/usr/local/lib/python3.8/multiprocessing/queues.py", line 245, in _feed
    send_bytes(obj)
  File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 405, in _send_bytes
    self._send(buf)
  File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 368, in _send
    n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe

sds2sfl fails with CMOP cruises

The create_date_field command in sds2sfl fails with the CMOP cruises because computerUTC has a four-digit year instead of two.

Add feature to calculate true event rate

The method for producing the event rate found in a raw SFL file has varied over the life of the SeaFlow project. For consistency it would be good to recalculate this event rate based on the DURATION column in an SFL file and the actual number of events found in EVT files.

The first 4 bytes of an EVT file record the event count as a uint32. In some cases though the file has been truncated, so in addition to reading the first 4 bytes of every EVT file it's necessary to check the expected file size of 4 byte event count header + (event count * (4 spacer bytes + 20 data bytes)).

This should be probably be performed as soon as the cruise raw data is received.

cli defaults

CLI tools help text should show default values.

sub-linear scaling with increasing cores

Scaling across multiple cores quickly becomes non-linear. This is caused by cache-contention once too many processes are scheduled on the same socket. For example, here is the output of profiling scripts/perftest.py using the linux tool perf on 1 core.

perf stat -e task-clock,cycles,instructions,context-switches,cpu-migrations,page-faults,cache-references,cache-misses python ~/git/seaflowpy/scripts/perftest.py

      17532.427294      task-clock (msec)         #    0.977 CPUs utilized          
    52,455,060,802      cycles                    #    2.992 GHz                      (49.94%)
    36,262,158,676      instructions              #    0.69  insns per cycle          (74.93%)
             4,182      context-switches          #    0.239 K/sec                  
                 0      cpu-migrations            #    0.000 K/sec                  
         1,444,517      page-faults               #    0.082 M/sec                  
       775,817,000      cache-references          #   44.250 M/sec                    (75.08%)
        61,825,323      cache-misses              #    7.969 % of all cache refs      (74.98%)

      17.948233815 seconds time elapsed

and here is the output using 8 cores on a 2 socket, 4 cores per socket machine.

perf stat -e task-clock,cycles,instructions,context-switches,cpu-migrations,page-faults,cache-references,cache-misses python ~/git/seaflowpy/scripts/perftest.py

      29198.622016      task-clock (msec)         #    0.977 CPUs utilized          
    87,280,802,723      cycles                    #    2.989 GHz                      (50.02%)
    36,564,930,299      instructions              #    0.42  insns per cycle          (74.98%)
             7,415      context-switches          #    0.254 K/sec                  
               711      cpu-migrations            #    0.024 K/sec                  
         1,444,523      page-faults               #    0.049 M/sec                  
       791,715,505      cache-references          #   27.115 M/sec                    (74.96%)
       146,640,175      cache-misses              #   18.522 % of all cache refs      (75.03%)

      29.897445148 seconds time elapsed

It is still faster to use all available cores in most cases, but the difference between using 4 cores and 8 cores in this case is minimal.

I'm not sure there's a fix for this since it's caused by numpy highly optimized use of CPU cache. It's possible operating on a single large dataframe representing multiple files and making use of either openblas/mkl parallelization or using dask/modin etc may be an improvement, but I suspect not enough of one to justify the significant codes changes.

It's also worth noting that allowing numpy with mkl to use all available cores during filtering is actually much slower than limiting to 1 core and simply starting multiple filtering processes.

Better docs on CLI tools

The README refers to seaflowpy --help as the starting place for CLI docs. There should be a more comprehensive and approachable source for documentation, maybe using Sphinx and readthedocs.

numpy warning concerning dtypes and binary compatibility

On Linux I'm seeing this warning

/usr/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
  return f(*args, **kwds)

Some web searches indicates this is because some module (maybe pandas) was compiled against an older version of numpy than the one installed. Apparently it's harmless.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.