seaflow-uw / seaflowpy Goto Github PK
View Code? Open in Web Editor NEWPython package with library and CLI tool for analyzing SeaFlow data
License: GNU General Public License v3.0
Python package with library and CLI tool for analyzing SeaFlow data
License: GNU General Public License v3.0
Sometimes SeaFlow datasets are messy and have EVT files that are outside the normal day of year fan out strategy. These files get filtered and end up in OPP results. They should be ignored instead.
input:
LAT LON
2120.601 -15816.43
output"
21.3433 -157.7262
LON is wrong, it should be -158.28
Issue is probably in ddm2dd function
both db create
and db import-filter-params
should create a new popcycle database. Currently only db create
does. Both should be able to take --cruise
and --serial
, or check for existing values in the db file if not provided. db create
should be renamed db import-sfl
.
Importing SFL files with CRLF line-endings misses the final flow rate column. Files with LF line-endings import flow rate OK.
For some cruises remote filtering fails because an expected opp file is not present when gzipping. With commit 1985d23
command=seaflowpy_filter --s3 -d MESO_SCOPE.db -p 16 -o MESO_SCOPE_opp
real_command=/bin/bash -l -c "cd /mnt/ramdisk/MESO_SCOPE >/dev/null && seaflowpy_filter --s3 -d MESO_SCOPE.db -p 16 -o MESO_SCOPE_opp"
/usr/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
Defined parameters:
{
"version": "0.7.1a0",
"s3": true,
"opp_dir": "MESO_SCOPE_opp",
"resolution": 10.0,
"cruise": "MESO_SCOPE",
"db": "MESO_SCOPE.db",
"process_count": 16
}
Filtering 15180 EVT files. Progress every 10% (approximately)
Error: b'gzip: MESO_SCOPE_opp/97.5/2017_177/2017-06-26T22-36-10+00-00.opp: No such file or directory\n'
Parallel filtering performance is worse with numpy
+ MKL, and much better with plain numpy installed from PyPI. seaflowpy
should perform fine in either case. Figure out how to get seaflowpy
to play nice with MKL.
SFL files with only header lines cause sfl commands to crash.
Traceback (most recent call last):
File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3078, in get_loc
return self._engine.get_loc(key)
File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'date'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/seaflow/miniconda3/envs/seaflowpy/bin/seaflowpy", line 11, in <module>
sys.exit(cli())
File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/click/core.py", line 722, in __call__
return self.main(*args, **kwargs)
File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/click/core.py", line 697, in main
rv = self.invoke(ctx)
File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
return _process_result(sub_ctx.command.invoke(sub_ctx))
File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/click/core.py", line 895, in invoke
return ctx.invoke(self.callback, **ctx.params)
File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/click/core.py", line 535, in invoke
return callback(*args, **kwargs)
File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/seaflowpy/cli/commands/sfl_cmd.py", line 204, in sfl_validate_cmd
errors = sfl.check(df)
File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/seaflowpy/sfl.py", line 62, in check
errors.extend(check_date(df))
File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/seaflowpy/sfl.py", line 94, in check_date
bad_dates = df[~df["date"].map(lambda d: check_date_string(d))]["date"]
File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/pandas/core/frame.py", line 2688, in __getitem__
return self._getitem_column(key)
File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/pandas/core/frame.py", line 2695, in _getitem_column
return self._get_item_cache(key)
File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/pandas/core/generic.py", line 2489, in _get_item_cache
values = self._data.get(item)
File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/pandas/core/internals.py", line 4115, in get
loc = self.items.get_loc(item)
File "/Users/seaflow/miniconda3/envs/seaflowpy/lib/python3.7/site-packages/pandas/core/indexes/base.py", line 3080, in get_loc
return self._engine.get_loc(self._maybe_cast_indexer(key))
File "pandas/_libs/index.pyx", line 140, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/index.pyx", line 162, in pandas._libs.index.IndexEngine.get_loc
File "pandas/_libs/hashtable_class_helper.pxi", line 1492, in pandas._libs.hashtable.PyObjectHashTable.get_item
File "pandas/_libs/hashtable_class_helper.pxi", line 1500, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 'date'
Part of validation/fixing of SFL files should be checking that file_duration
is a positive number. It should be a required field in the same sense that lat and lon are required.
Timestamp strings should always be consistent, for example same fractional seconds precision, same TZ format, and should match popcycle.
An error message generated while filtering to a read-only DB created a KeyError because the key "file" was not present in the work
dictionary.
seaflow@de31b0f741e5:/data$ seaflowpy filter -o HOT310_opp-1core -d HOT310.db-1core -p 1 -e HOT310_evt/
Run parameters and information:
{
"delta": false,
"evt_dir": "HOT310_evt/",
"db": "HOT310.db-1core",
"opp_dir": "HOT310_opp-1core",
"process_count": 1,
"resolution": 10.0,
"version": "5.1.0",
"cruise": "HOT310"
}
Getting lists of files to filter
sfl=1558 evt=1559 intersection=1558
Filtering 1558 EVT files. Progress for 50th quantile every ~ 10.0%
Process Process-2:
Traceback (most recent call last):
File "/home/seaflow/.local/lib/python3.8/site-packages/seaflowpy/db.py", line 332, in executemany
con.executemany(sql, values)
sqlite3.OperationalError: attempt to write a readonly database
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/home/seaflow/.local/lib/python3.8/site-packages/seaflowpy/filterevt.py", line 234, in do_save
db.save_opp_to_db(work["opp_vals"], work["dbpath"])
File "/home/seaflow/.local/lib/python3.8/site-packages/seaflowpy/db.py", line 88, in save_opp_to_db
executemany(dbpath, sql_insert, vals)
File "/home/seaflow/.local/lib/python3.8/site-packages/seaflowpy/db.py", line 334, in executemany
raise errors.SeaFlowpyError("An error occurred when executing SQL queries: {!s}".format(e))
seaflowpy.errors.SeaFlowpyError: An error occurred when executing SQL queries: attempt to write a readonly database
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 315, in _bootstrap
self.run()
File "/usr/local/lib/python3.8/multiprocessing/process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "/home/seaflow/.local/lib/python3.8/site-packages/seaflowpy/util.py", line 131, in wrapper
f(*args, **kwargs)
File "/home/seaflow/.local/lib/python3.8/site-packages/seaflowpy/filterevt.py", line 239, in do_save
work["errors"].append("Unexpected error when saving file {} to db: {}".format(work["file"], e))
KeyError: 'file'
^CTraceback (most recent call last):
File "/usr/local/lib/python3.8/multiprocessing/queues.py", line 245, in _feed
send_bytes(obj)
File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 405, in _send_bytes
self._send(buf)
File "/usr/local/lib/python3.8/multiprocessing/connection.py", line 368, in _send
n = write(self._handle, buf)
BrokenPipeError: [Errno 32] Broken pipe
The create_date_field
command in sds2sfl
fails with the CMOP cruises because computerUTC has a four-digit year instead of two.
The method for producing the event rate found in a raw SFL file has varied over the life of the SeaFlow project. For consistency it would be good to recalculate this event rate based on the DURATION
column in an SFL file and the actual number of events found in EVT files.
The first 4 bytes of an EVT file record the event count as a uint32. In some cases though the file has been truncated, so in addition to reading the first 4 bytes of every EVT file it's necessary to check the expected file size of 4 byte event count header + (event count * (4 spacer bytes + 20 data bytes))
.
This should be probably be performed as soon as the cruise raw data is received.
The "install Anaconda" link doesn't work.
CLI tools help text should show default values.
Scaling across multiple cores quickly becomes non-linear. This is caused by cache-contention once too many processes are scheduled on the same socket. For example, here is the output of profiling scripts/perftest.py
using the linux tool perf
on 1 core.
perf stat -e task-clock,cycles,instructions,context-switches,cpu-migrations,page-faults,cache-references,cache-misses python ~/git/seaflowpy/scripts/perftest.py
17532.427294 task-clock (msec) # 0.977 CPUs utilized
52,455,060,802 cycles # 2.992 GHz (49.94%)
36,262,158,676 instructions # 0.69 insns per cycle (74.93%)
4,182 context-switches # 0.239 K/sec
0 cpu-migrations # 0.000 K/sec
1,444,517 page-faults # 0.082 M/sec
775,817,000 cache-references # 44.250 M/sec (75.08%)
61,825,323 cache-misses # 7.969 % of all cache refs (74.98%)
17.948233815 seconds time elapsed
and here is the output using 8 cores on a 2 socket, 4 cores per socket machine.
perf stat -e task-clock,cycles,instructions,context-switches,cpu-migrations,page-faults,cache-references,cache-misses python ~/git/seaflowpy/scripts/perftest.py
29198.622016 task-clock (msec) # 0.977 CPUs utilized
87,280,802,723 cycles # 2.989 GHz (50.02%)
36,564,930,299 instructions # 0.42 insns per cycle (74.98%)
7,415 context-switches # 0.254 K/sec
711 cpu-migrations # 0.024 K/sec
1,444,523 page-faults # 0.049 M/sec
791,715,505 cache-references # 27.115 M/sec (74.96%)
146,640,175 cache-misses # 18.522 % of all cache refs (75.03%)
29.897445148 seconds time elapsed
It is still faster to use all available cores in most cases, but the difference between using 4 cores and 8 cores in this case is minimal.
I'm not sure there's a fix for this since it's caused by numpy highly optimized use of CPU cache. It's possible operating on a single large dataframe representing multiple files and making use of either openblas/mkl parallelization or using dask/modin etc may be an improvement, but I suspect not enough of one to justify the significant codes changes.
It's also worth noting that allowing numpy with mkl to use all available cores during filtering is actually much slower than limiting to 1 core and simply starting multiple filtering processes.
The remote filtering CLI tool starts and presumably provisions and instance even the DB list is empty, which is wasteful. Should abort early if no DBs are given in command.
The README refers to seaflowpy --help
as the starting place for CLI docs. There should be a more comprehensive and approachable source for documentation, maybe using Sphinx and readthedocs.
On Linux I'm seeing this warning
/usr/lib/python3.5/importlib/_bootstrap.py:222: RuntimeWarning: numpy.dtype size changed, may indicate binary incompatibility. Expected 96, got 88
return f(*args, **kwds)
Some web searches indicates this is because some module (maybe pandas) was compiled against an older version of numpy than the one installed. Apparently it's harmless.
To save space on the remote filtering instance, cruise results should be erased after they've been transferred back to the client.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.