fermi-ad / datalogger-to-ml Goto Github PK
View Code? Open in Web Editor NEWScripts for requesting AD Controls data logger data and transforming those to the desired ML output format and destination.
Scripts for requesting AD Controls data logger data and transforming those to the desired ML output format and destination.
In order to limit the number of writes to disk, we should store data in memory before writing to disk.
This is ok because we know the size of the requests. This will become memory-hungry as the request size grows.
This change is affected by a bug in the Python DPM data logger DAQ. It has been found that the penultimate block of data is sometimes returned after the ultimate empty data message to indicate that there are no more data.
Data returned for all devices in the list result in a 17 -43 error, DPM_BAD_DATASOURCE_FORMAT (The fields provided with a data source are incorrect).
We should validate the request going to the DPM is correct and possibly ask @charlieking65 to verify the request at the DPM.
There's potential for the DAQ loop to not meet its exit condition. Nanny should timeout to clean up before the next script runs. I think that 30min is plenty of time for DAQ and is well before the next hour when a new script is launched.
Use add_entires
to add devices in bulk. This should speed up list initialization.
Data files should be written to a month -> day folder structure.
Adding a CLI argument to set start_time
will allow users to start DAQ at an arbitrary time for testing, debugging, or regenerating. Using this in conjunction with --run-once
will allow the production of one file.
The really nice thing about the new library is that we can use pip
to install things.
https://cdcvs.fnal.gov/redmine/projects/py/wiki
There are some asynchronous programming ideas here that may not be intuitive but the API is meant to be similar to the existing one. Let me know if you need assistance.
TOML seems like a nicer solution for config files than other options.
YAML is ok but the whitespace sensitivity can be annoying.
Python typically closes files on exit but there is the potential for leaving a file open if we don't handle exits appropriately. The Pythonic way to handle this is with a context manager.
We should review every instance where a file is opened and make sure that the with
keyword is being used so that when the scope is left, the file is closed.
Despite having access in the past, https://www-bd.fnal.gov/pip3 is no longer publicly available. Our builds will not work until this is changed.
Create a script to iterate over generated data files and verify that they can be opened.
I'm thinking it will be like the h5_dump.py
.
#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import pandas as pd
import sys
import helper_methods
def main(output_path):
h5_outputs = path.join(output_path, '*.h5')
# Glob allows the use of the * wildcard
files = glob(h5_outputs)
for file in files:
try:
with pd.HDFStore(file) as hdf:
print(f'{file} was successfully read')
except OSError as error:
print(f'Could not open {file} with error: {error}')
if __name__ == '__main__':
main(sys.argv[1])
The datalogger-to-ml
command could provide another sub-command init
that creates a config file with rational defaults.
The Nanny script is responsible for calling the DAQ script with certain parameters per the L-CAPE project.
This script allows concerns to be separated. Instead of dpmData.py being a do all script, it will be generic and the Nanny script can pass arguments to the DAQ script to make it specifically useful for L-CAPE. The idea is that dpmData.py could be useful for other projects in the future.
When that day comes we should move it out into its own project.
The Nanny script should be responsible for generating the device requests list and naming the output files. It can be responsible for coordinating runs of the DAQ script also, although the current plan is to use a CronJob.
The old data collector script should know when to stop collecting data.
Further discussion needed for handling scenarios where intermittent errors occur which may stop the script from collecting older data.
I instinctively used positional arguments with dump
and it didn't work.
If the script fails to generate a data file. The next run of the script will generate a data file from the previous hour. This has happened enough times that we are currently 9 hours behind.
I propose having nanny.py
generate files from the calculated start time, the end time of the last found file, until now.
Consider:
if path.exists(output_file):
os.remove(output_file)
with pd.HDFStore(output_file) as hdf:
We remove an existing file because otherwise, we append it to the existing file.
If an existing process holds the output_file
open, and a new process attempts to remove it, it should fail. Consider failing gracefully.
For some reason, when the build process runs pip install --extra-index-url https://www-bd.fnal.gov/pip3 -r requirements.txt
it fails with the error ERROR: Could not find a version that satisfies the requirement acsys==0.10.0 (from versions: none)
. (This file version exists in the package repository.)
We should understand what pip is looking for to determine the version number. I thought it was the file name, which includes the version number, but maybe it's unpacking each package to find the version number? ๐คท๐ป
The current log for the application that's been deployed is 2.2GB. We should set that to something more reasonable.
The README should tell users and developers how to install all the requirements.
@jasonstjohn reports that files are not being produced after restarting the nanny with a specified start date in the config file.
Running it "in hand" revealed that the nanny was attempting to read the files in the output directory, which I thought was incorrect given that we gave it a strict start time.
I realize now that it should consider the files to deduce which time to start.
Through the DPM updates and progress in reliability at every part of the system, we believe that things are more reliable now. We should consider rebuilding the files that we have to ensure consistency.
https://docs.python.org/3/library/pathlib.html
Pathlib provides a path class instead of the path string like os.path.
Old data collector script should support older versions of the device request list, and handle different versions of h5 files.
Tests were prompted by missing basic functionality of python dpmData.py --help
running the application rather than printing the help page.
Other basic functionality tests should be written to prevent regressions.
Want to specify that the final file be written to two destinations: One being archived to tape (but not readily readable), and one which is readily readable (but not so robustly backed up).
Distinct functional parts of the code should be timed to determine where optimizations may be made.
timeit seems to be the accepted method for profiling.
https://docs.python.org/3/library/timeit.html
Document the results in the README.
When our code successfully builds, GitHub can automatically deploy to our system,
This script needs to know what data request version to use and will look at existing files to determine what date to end collection.
I proposed that we make requests incrementally until we get errors suggesting that there are no more data.
If no files for this version exist then this script would collect data from now back as far as it can go.
PNSF is sensitive to modifications and the default for pandas.HDFStore(<file>)
(that we use to validate HDF parsing) is to open in append mode. We should always read from PNFS in r
, read-only mode.
We need a tape backup strategy. Yujun Wu will be a good resource.
I imagine the process of using this project would be as follows:
In response to a ticket about the pnfs directory not responding, the service desk says
The situation usually happens when you or other people open a file under /pnfs and try to write into it. /pnfs is a not full POSIX filesystem. Once a file is in /pnfs, it shouldn't be modified. You can delete a file and re-copy/upload it, but not modify it directly.
So, we should have the nanny script write to the local directory and move the file to the given location.
We should provide stats on each run for troubleshooting purposes. These don't need to be versioned so we included them in the .gitignore. I think we should output the stats of each run to a file named by its date-time. We should also have a cleanup function that only allows a week of stats to persist.
Things to time:
Things to log:
We should gracefully handle the case where someone kill
s the process or ctrl-c
s. Ideally, we break the DAQ loop and output logs for the data we have.
Check out argparse for documentation.
Variables to parameterize
The option will be START_TIME
AND END_TIME
or DURATION
.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.