fermi-ad / datalogger-to-ml Goto Github PK

3.0 3.0 0.0 374 KB

Scripts for requesting AD Controls data logger data and transforming those to the desired ML output format and destination.

Python 99.41% Makefile 0.59%

application

datalogger-to-ml's People

Contributors

Stargazers

Watchers

datalogger-to-ml's Issues

Store data in memory before writing to disk

In order to limit the number of writes to disk, we should store data in memory before writing to disk.

This is ok because we know the size of the requests. This will become memory-hungry as the request size grows.

This change is affected by a bug in the Python DPM data logger DAQ. It has been found that the penultimate block of data is sometimes returned after the ultimate empty data message to indicate that there are no more data.

DPM Data Request Results in 17 -43 Error

Data returned for all devices in the list result in a 17 -43 error, DPM_BAD_DATASOURCE_FORMAT (The fields provided with a data source are incorrect).

We should validate the request going to the DPM is correct and possibly ask @charlieking65 to verify the request at the DPM.

Timeout for Nanny

There's potential for the DAQ loop to not meet its exit condition. Nanny should timeout to clean up before the next script runs. I think that 30min is plenty of time for DAQ and is well before the next hour when a new script is launched.

Add entries in bulk

Use add_entires to add devices in bulk. This should speed up list initialization.

Write data files to a day folder

Data files should be written to a month -> day folder structure.

Allow Nanny CLI to Set the start_time

Adding a CLI argument to set start_time will allow users to start DAQ at an arbitrary time for testing, debugging, or regenerating. Using this in conjunction with --run-once will allow the production of one file.

Update to the Newest Python Client Libraries

The really nice thing about the new library is that we can use pip to install things.

https://cdcvs.fnal.gov/redmine/projects/py/wiki

There are some asynchronous programming ideas here that may not be intuitive but the API is meant to be similar to the existing one. Let me know if you need assistance.

Create a requirements.txt file for easy dependency installs

https://pip.pypa.io/en/stable/user_guide/#requirements-files

Support TOML config file

TOML seems like a nicer solution for config files than other options.
YAML is ok but the whitespace sensitivity can be annoying.

Use context manager to ensure files are closed

Python typically closes files on exit but there is the potential for leaving a file open if we don't handle exits appropriately. The Pythonic way to handle this is with a context manager.

We should review every instance where a file is opened and make sure that the with keyword is being used so that when the scope is left, the file is closed.

Builds are broken because acsys cannot be downloaded

Despite having access in the past, https://www-bd.fnal.gov/pip3 is no longer publicly available. Our builds will not work until this is changed.

Verify data files

Create a script to iterate over generated data files and verify that they can be opened.

I'm thinking it will be like the h5_dump.py.

#!/usr/bin/env python3
# -*- coding: utf-8 -*-
import pandas as pd
import sys
import helper_methods

def main(output_path):
	h5_outputs = path.join(output_path, '*.h5')
    # Glob allows the use of the * wildcard
    files = glob(h5_outputs)

	for file in files:
		try:
			with pd.HDFStore(file) as hdf:
				print(f'{file} was successfully read')
		except OSError as error:
			print(f'Could not open {file} with error: {error}')

if __name__ == '__main__':
    main(sys.argv[1])

Provide a template config file

The datalogger-to-ml command could provide another sub-command init that creates a config file with rational defaults.

Create Nanny Script

The Nanny script is responsible for calling the DAQ script with certain parameters per the L-CAPE project.
This script allows concerns to be separated. Instead of dpmData.py being a do all script, it will be generic and the Nanny script can pass arguments to the DAQ script to make it specifically useful for L-CAPE. The idea is that dpmData.py could be useful for other projects in the future.
When that day comes we should move it out into its own project.
The Nanny script should be responsible for generating the device requests list and naming the output files. It can be responsible for coordinating runs of the DAQ script also, although the current plan is to use a CronJob.

Stopping point for old data collector

The old data collector script should know when to stop collecting data.
Further discussion needed for handling scenarios where intermittent errors occur which may stop the script from collecting older data.

Allow dump arguments to be positional

I instinctively used positional arguments with dump and it didn't work.

Script run time and file time have fallen out of sync

If the script fails to generate a data file. The next run of the script will generate a data file from the previous hour. This has happened enough times that we are currently 9 hours behind.

I propose having nanny.py generate files from the calculated start time, the end time of the last found file, until now.

Consider:

    if path.exists(output_file):
        os.remove(output_file)

    with pd.HDFStore(output_file) as hdf:

We remove an existing file because otherwise, we append it to the existing file.
If an existing process holds the output_file open, and a new process attempts to remove it, it should fail. Consider failing gracefully.

pip can't find acsys version

For some reason, when the build process runs pip install --extra-index-url https://www-bd.fnal.gov/pip3 -r requirements.txt it fails with the error ERROR: Could not find a version that satisfies the requirement acsys==0.10.0 (from versions: none). (This file version exists in the package repository.)

We should understand what pip is looking for to determine the version number. I thought it was the file name, which includes the version number, but maybe it's unpacking each package to find the version number? 🤷🏻

Add Log Rotation

The current log for the application that's been deployed is 2.2GB. We should set that to something more reasonable.

Add Fermilab Dependency Install Instructions

The README should tell users and developers how to install all the requirements.

`start` field in config isn't used

@jasonstjohn reports that files are not being produced after restarting the nanny with a specified start date in the config file.

Running it "in hand" revealed that the nanny was attempting to read the files in the output directory, which I thought was incorrect given that we gave it a strict start time.

I realize now that it should consider the files to deduce which time to start.

Rebuild the existing set of data files

Through the DPM updates and progress in reliability at every part of the system, we believe that things are more reliable now. We should consider rebuilding the files that we have to ensure consistency.

Use pathlib instead of os.path

https://docs.python.org/3/library/pathlib.html

Pathlib provides a path class instead of the path string like os.path.

Support older versions

Old data collector script should support older versions of the device request list, and handle different versions of h5 files.

Create a Basic Suite of Tests

Tests were prompted by missing basic functionality of python dpmData.py --help running the application rather than printing the help page.

Other basic functionality tests should be written to prevent regressions.

Ability to specify multiple final copy destinations

Want to specify that the final file be written to two destinations: One being archived to tape (but not readily readable), and one which is readily readable (but not so robustly backed up).

Profile Code

Distinct functional parts of the code should be timed to determine where optimizations may be made.

timeit seems to be the accepted method for profiling.
https://docs.python.org/3/library/timeit.html

Document the results in the README.

Automate deployment

When our code successfully builds, GitHub can automatically deploy to our system,

Create script for collecting old data

This script needs to know what data request version to use and will look at existing files to determine what date to end collection.

I proposed that we make requests incrementally until we get errors suggesting that there are no more data.

If no files for this version exist then this script would collect data from now back as far as it can go.

`validate` sub-command hangs when accessing file on PNFS

PNSF is sensitive to modifications and the default for pandas.HDFStore(<file>) (that we use to validate HDF parsing) is to open in append mode. We should always read from PNFS in r, read-only mode.

To-Do for dpmData.py

Look for a local device list. If found, use that list.
Else, use https://github.com/fermi-controls/linac-logger-device-cleaner/releases/latest/download/linac_logger_drf_requests.txt as the device list.
Add an argparse tag to specify the version for the filenames.
Write a helper script to read GitHub file and write to DEVICE_LIST

Push pnfs files to tape

We need a tape backup strategy. Yujun Wu will be a good resource.

Generalize and document the use of the data logger

I imagine the process of using this project would be as follows:

pip install data_logger
Create a config file
python -m data_logger --config "path/to/config"

TODOs to accomplish this

Publish package to chablis
Add argparse for config
Update README describing installation

pnfs does not like many writes

In response to a ticket about the pnfs directory not responding, the service desk says

The situation usually happens when you or other people open a file under /pnfs and try to write into it. /pnfs is a not full POSIX filesystem. Once a file is in /pnfs, it shouldn't be modified. You can delete a file and re-copy/upload it, but not modify it directly.

So, we should have the nanny script write to the local directory and move the file to the given location.

Output Run Stats

We should provide stats on each run for troubleshooting purposes. These don't need to be versioned so we included them in the .gitignore. I think we should output the stats of each run to a file named by its date-time. We should also have a cleanup function that only allows a week of stats to persist.

Things to time:

DPM data acquisition
HDF file output

Things to log:

DAQ errors
Python errors
Debug log
Empty DAQ for any device

Handle SIGTERM

We should gracefully handle the case where someone kills the process or ctrl-cs. Ideally, we break the DAQ loop and output logs for the data we have.

Create the ability to pass parameter via the command line

Check out argparse for documentation.

Variables to parameterize

DEVICE_LIMIT - optional, default 0 for all devices
DEVICE_FILE - optional, default stdin - filename variable not yet created
OUTPUT_FILE - optional, default stdout - filename variable not yet created
START_TIME - required, datetime variable not yet created
END_TIME - required, datetime variable not yet created
DURATION - required, datetime variable not yet created
NODE - optional, default None - string variable not yet created

The option will be START_TIME AND END_TIME or DURATION.