cmsdaq / hltd Goto Github PK

The Fcube hlt daemon

License: GNU Lesser General Public License v3.0

Python 52.38% Shell 0.65% HTML 1.31% Makefile 0.54% C 42.64% Batchfile 0.08% CSS 0.04% PLSQL 2.00% TSQL 0.17% SQLPL 0.05% Dockerfile 0.01% PowerShell 0.13%

hltd's Introduction

hltd

The Fcube hlt daemon

Documentation links:

https://twiki.cern.ch/twiki/bin/view/CMS/FCubeMainPage

https://twiki.cern.ch/twiki/bin/view/CMS/FileBasedEvfHLTDaemon

https://twiki.cern.ch/twiki/bin/view/CMS/FFFConfigurationPlan

Building:

On a (CC7) build machine prerequisite packages need to be installed:

yum install -y python-devel libcap-devel rpm-build python-six python-setuptools

Note: python 3.4 equivalent is:

yum install -y python34-devel libcap-devel rpm-build python34-six python34-setuptools

building hltd library RPM:

scripts/libhltdrpm.sh

building hltd executable RPM:

scripts/hltdrpm.sh

optionally to only read parameters from cache:

scripts/hltdrpm.sh --batch # or -b

fffmeta RPM is now merged with hltd RPM and should no longer be built or installed.

Note: Provide as last command line parameter the param cache file containing last values used. If it does not exist, the file will be created. "scripts/paramcache.template" is available with default values (note that you need to provide correct password). If no name is provided to the script, default name will be "paramcache". "env":"vm" parameter value is now obsolete with "prod" covering all use cases.

hltd's People

Contributors

Watchers

Forkers

zazasa smorovic dmitrijus danduggan vanbesien diguida

hltd's Issues

Dual BU mounts not properly unmounted

When using more than one mount point on the FUs for the BU disks, hltd does only umount the first mount point. It then fails in remounting the 2nd mount point:

INFO:2014-08-26 18:59:59 - cleanup_mountpoints: found following mount points
INFO:2014-08-26 18:59:59 - ['/fff/BU0']
INFO:2014-08-26 18:59:59 - trying umount of /fff/BU0
INFO:2014-08-26 18:59:59 - found BU to mount at bu-c2e18-27-01.daq2fus1v0.cms
INFO:2014-08-26 18:59:59 - trying to mount bu-c2e18-27-01.daq2fus1v0.cms:/ /fff/BU0/ramdisk
INFO:2014-08-26 18:59:59 - trying to mount bu-c2e18-27-01.daq2fus1v0.cms: /fff/BU0/output
INFO:2014-08-26 18:59:59 - found BU to mount at bu-c2e18-27-01.daq2fus1v1.cms
INFO:2014-08-26 18:59:59 - trying to mount bu-c2e18-27-01.daq2fus1v1.cms:/ /fff/BU1/ramdisk
ERROR:2014-08-26 18:59:59 - Command '['mount', '-t', 'nfs4', '-o', 'rw,noatime,vers=4,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,noac', 'bu-c2e18-27-01
.daq2fus1v1.cms:/fff/ramdisk', '/fff/BU1/ramdisk']' returned non-zero exit status 32
Traceback (most recent call last):
File "/opt/hltd/python/hltd.py", line 185, in cleanup_mountpoints
os.path.join('/'+conf.bu_base_dir+str(i),conf.ramdisk_subdirectory)]
File "/usr/lib64/python2.6/subprocess.py", line 505, in check_call
raise CalledProcessError(retcode, cmd)
CalledProcessError: Command '['mount', '-t', 'nfs4', '-o', 'rw,noatime,vers=4,rsize=65536,wsize=65536,namlen=255,hard,proto=tcp,timeo=600,retrans=2,sec=sys,noac', 'bu-c2e18-27-01.daq2fus
1v1.cms:/fff/ramdisk', '/fff/BU1/ramdisk']' returned non-zero exit status 32
CRITICAL:2014-08-26 18:59:59 - Unable to mount ramdisk - exiting.

add a timestamp to the modified hltd.conf when changed by fff meta rpm

we had an issue with supposedly some action by puppet on the meta rpm generated a botched modification in hltd.conf. We need to keep track of the last modification by adding a timestamp to the "edited by fff meta rpm" comment line at the top of hltd.conf

Allow to stop/unmount FUs for a given BU

It would be useful to be able to tell the FUs of a given BU to unmount the disks. The use case is if you need to reboot a BU, you need to unmount the BU disks on the FUs first to avoid stale NFS mounts. It would be nice if the hltd on the BU could take core of that either by a special command 'e.g. hltd umount', or by shutting down the hltd on the BU.

Remi

Uniform log system

It could be useful to define some rule about log files.
There are 2 main question to define:

1- Log destination (i suggest /var/log/hltd/ )
2- Format:

in my python script i used logging library with this configuration:

logging.basicConfig(filename="/tmp/anelastic.log",
 level=logging.INFO,
 format='%(levelname)s:%(asctime)s-%(name)s.%(funcName)s - %(message)s',
 datefmt='%Y-%m-%d %H:%M:%S')

and each class creates is own logger:

def __init__(self):
 self.logger = logging.getLogger(self.__class__.__name__)

it will result something like this in the log file (type-date-class-function-message):

INFO:2014-03-19 17:51:44-LumiSectionHandler.processDATFile - *message*

multiple instances of hltd service

There are requirements to save on hardware resources by running, for example, multiple minidaq systems while using common hardware. Since hltd is a system service, in current form it does not support instantiation.

This schema is proposed:

running multiple instances of hltd is possible on BU. FU only runs one instance at the time.
instances are set in an input file (later possibly HwCfg database) shipped with meta rpm, which takes care of setting up the software.
list of installed instances is written to /etc/hltd.instances by the meta rpm.
"main" instance, if installed, keeps using same structure as current version of hltd
other instances use modified config file: /etc/hltd-$instance.conf and logs go to /var/log/hltd/$instance. pid and lock files also have a separate name,
port is modified (in range [9000,9010]) based on the input file.
non-main instance reads from ramdisk subdirectory which is a loopback file system mouting an image file with predefined size. in this way, ramdisk quota of each instance is enforced.
-FUs have instance name specified in hltd.conf and read input from remote ramdisk subdirectory. boxinfo files are written to the same subdirectory. output is written either to normal BU NFS output, or a subdirectory in NFS mount (this is not yet decided).
except in these described differences, FUs behave like a main instance. A single FU is assigned to one and only one BU hltd instance, also as described in input config. file.
local nor central elasticsearch clusters are not divided into instances. This assumes no duplicate run numbers between the systems (i.e. using same RunInfo DB).
init scripts needs improvement to allow starting/stopping selected instance.

Experimental support for this feature is implemented here:
https://github.com/smorovic/hltd/tree/multi-instances%2Bcloud

enable switch in hltd conf

hltd service should not be started until configured by fffmeta rpm. A switch is going to be added that is disabling service by default until modified by the configuration script in meta package.

HLT and cloud switchover interface

Availability of HLT nodes for cloud requires cooperation between mechanisms for starting CMSSW jobs built into hltd and a service which runs VM instances in same machines.

An external tool, possibly integrated with LevelZero FM, will be used to control which FU nodes should stop HLT and switch to the cloud mode. The tool will then directly contact hltd on nodes using cgi interface.

From hltd version 1.6.0, an API is available to allow taking the FU out of HLT. Activation is done in a similar way to new run notifications: a cgi script creates file in watch directory (standard location is '/fff/data'). Request needs to be send to hltd(by default port 9000), using the following form of the URL: http://host:9000/cgi-bin/exclude_cgi.py

FUs will then analyze which is the last lumisection completed on BU and signal CMSSW processes to finish within two additional lumisections. During this time, hltd switches into "activatingCloud" mode, and stops accepting new run start events from BU hltd. A number of available resources are masked in box file so that BU stops requesting data for machines in a switch over mode. Upon CMSSW jobs and local merge scripts finishing, all core resources are moved to /etc/appliance/resources/cloud and finally FU switches to the "cloud" mode, at which point virtual machines can run.

Since a recent conclusion was to activate VM startup through hltd, one possibility is to run a script which will signal the local cloud service ("cloud tool") that VMs can be started (when CMSSW are finished).

An "include" interface is also provided in hltd 1.6.0. However, currently this only returns core files in their usual place and allows hltd to accept new HLT runs.

I propose to modify this interface, which will execute a script/command communicating to the cloud tool to stop VMs ("include_cgi.py"), before the switchover to HLT is completed. The script called by hltd can be synchronous, i.e. it returns only when VMs are shut down (note that cgi calls are still asynchronous: they create a file which triggers action in hltd and immediately finish).

In addition, hltd will update a file providing name of the mode in which hltd currently is. This file could be polled for any mode change by the cloud service (if necessary). I propose that the file location is: "/fff/data/mode", with the following modes possible: "HLT", "activatingCloud","deactivatingCloud","cloud".

Monitoring: hltd mode can be monitored through elasticsearch. The mode can be written to box info files which are filled into central ES index (every 5 seconds per machine). In addition, a separate monitoring chain could be implemented through cloud services by polling content of the "/fff/data/mode" file.
Handling of runs during cloud mode: new run requests will be ignored, but hltd can cache last such request to allow joining the ongoing run once cloud mode is switched off. Otherwise it would not join HLT until the next run after switching back is started.

Main workflow working indipendently from elasticsearch server (json files merging, ini and dat files handling)

New file naming convention update (anelastic+elastic part)

New filename schema: runX_lsY_type_otherstuff.ext .
This schema should be respected by each filetype, EoLS and EoR too.

Switch from pyelasticsearch into official elasticsearch-py

elasticsearch-py is the offical python library from the elasticsearch team:
https://github.com/elasticsearch/elasticsearch-py.
It should be better supported and maintained.

Move the boxinfo out of the runindex, in a new index.

Resource assignment to multi-threaded CMSSW processes

At startup of a CMSSW process, currently a single CPU core resource is assigned to a process. Changes are required to configure how many cores to assign to a single multi-threaded process and configure CMSSW multi-threading level correspondignly.

This can be done by waiting for appropriate number of CPU resource core files to appear in "idle" directory, then move them to "online" and spawn a process with the appropriate "number of threads" parameter passed as a cmsRun command line parameter. The parameter modifies process options in the CMSSW python configuration, setting numberOfStreams and numberOfThreads. Default value will be 1 thread/stream (single-threaded behavior).

Output adler32 checksums

Pending CMSSW release will add checksum field in output json and ".checksum" field for the micro-merged dat file which will be included in merged json file
Using output_adler32=true in [General] section of hltd.conf, the anelastic service will check correctness of this checksum on the memory buffer used to move file to output directory.

Using templates for dynamic creation of indices in appliance ES cluster

In some cases index creation fails and is later created dynamically when first document is indexed.
This creates the index without nencessary mapping and settings.
By specifying default template, all this can be taken by elasticsearch automatically whenever index creation succeeds. FU IP allocation still needs to be applied later because it is specific to a machine which index the document.

Prevent hltd to start on zombie FUs

After yesterday's hltd update, 3 FUs which should not be used came back alive and created box info files. The hltd on the BU failed to properly talk to them. One should device a way to inhibit the start of the hltd on "blacklisted" FUs.

Logging improvements

Logging of hltd and spawned scripts is managed using logrotate. All logs from the hltd package are moved to /var/log/hltd directory.
stdout and stderr is captured by logger in both main hltd and spawned scripts.

HLT logs are now found in /var/log/hltd/pid and appended with run number for easier distinguishing between job logs from different runs.

Resource summary file on BU

Presently the information of state of CPU resource usage is available through box info files updated by each FU in ramdisk. It was proposed that BU hltd should instead summarize this into a number of available resources and provide to consumers (BU application).

In the updated version, a file /fff/ramdisk/appliance/resource_summary (JSON file) is written, containing also other summmarized information (taking care that it is taken from box files updated within last 10s). For example:
{
"ramdisk_occupancy": 0.32000000000000001,
"active_resources": 1,
"activeFURun": 127042,
"activeRunNumQueuedLS": 0,
"broken": 0,
"idle": 0,
"used": 1,
"cloud": 0
}

ramdisk_occupancy is ration between used and total size of ramdisk partition
active_resources - sum of idle and used resources in FUs
activeFURun: most recent run found in all active_runs boxinfo files
activeRunNumQueuedLS - worst-case number of lumisection data sitting in anelastic.py queue on FUs.
This indicates number of EoLS files found in queue in anelastic.py, which is used to store inotify file events before they are handled by the script. High value can indicate problems in disk IO or NFS file copying to BU. Value is -1 if there is no FU active run or the script is not initialized yet. Value is only taken from FUs with the same last active run as indicated in the summary.
broken/idle/used/cloud summarize core resources in more detail

global elasticsearch run index

For aggregated information on runs, and monitoring that is not run dependent, we are introducing elasticsearch index stored on a separate cluster.
At present it collects information on: run directory created and EoR file appeared on BU, system monitoring of BU and FUs, end of lumi appearing on BU.

All documents are tagged with per-BU (or in some cases per-FU) id so that information can be tracked per appliance or box.

The index is created and/or filled from hltd on BU. Special configuration parameter elastic_runindex_url is added to hltd.conf.

Port to more efficient inotify python library

Currently used Pyinotify library is not a very performant one.
hltd could be ported to a different library, python-inotify.
A blog post of the author outlines performance problems seen with pyinotify.
http://www.serpentine.com/blog/2008/01/04/why-you-should-not-use-pyinotify/comment-page-1/

In tests, it was observed that, with a large number of files written and deleted (~10k/sec) pyinotify uses approx 50% of CPU, while the equivalent python-inotify code used less than 20%. This was done with hltd code, with a return placed at the point where events collected by the library would be processed by hltd (anelastic.py).
Exception to this is a lightweight wrapper around the new library (runs a thread waiting for events), as the library is more low-level than pyinotify.

License of python-inotify is LGPLv2, so it fine to use it from the legal standpoint.

Note: new library had a memory leak in C bindings which was found and fixed.

FFF configuration meta rpm

To automatically configure DAQ cluster, a separate "meta" rpm package is created. it depends on hltd and elasticsearch, and will trigger reconfiguration when those are updated.

rpm build and integrated configuration scripts are presently found in hltd git. Versioning of the rpm follows the hltd versioning schema.
Script contained in rpm can detect whether it is executed on BU or FU, and on which cluster (daqval, prod) - presently this is based on hostname naming conventions. On FUs, the script will connect to HWCfg DB to retrieve DNS name of the BU data interface for mounting NFS ramdisk/output area. This is currently only supported on daqval until production HWCfg DB is ready.

The package will also set up elasticsearch parameters accordingly: appliance cluster setup where BU is master without data, and slaves with data on FUs. Unicast is used for discovery between master and slaves in the appliance.

The full cluster should be properly configured without manual intervention using the package. For this, requirement is that, when script is running on FU, BU machine must be booted. Also, the equipment set with proper information must be present in database or configuration will fail.

In addition, package includes the init script which will run the configuration script at each boot ('refresh'), prior to starting hltd and elasticsearch services.

Package is by running scripts/metarpm.sh

Review directories/files watched by inotify

Review the directories/files watched by inotify. Remove any unnecessary watches and call-backs.

CMSSW log collection

CMSSW stderr and stdout is redirected to log files located in /var/log/hltd/pid.
A new script is developed to scan this directory and parse output of appearing log files. Script is started as a child process by hltd.

Messages are parsed into json documents which contain category, severity, module name, instance, function call, framework state, timestamp and message content of logs. Recognized messages are those produced by the MessageLogger (DEBUG,INFO,WARNING,ERROR and FATAL) as well as stack trace from a crash (considered FATAL). Framework report information is currently ignored.

Messages are also scanned to calculate a "lexicalId" hash which can be used, for example, for rate reduction of similar messages later in the chain.
Depending on the "hltd es_cmssw_log_level" parameter in hltd.conf, threshold is set for minimum log level to store in the elasticsearch.

Presently the same index as for other run-related information is used, however this can be changed if necessary.

A tool "test/logprint.py" is also provided, doing time-window queries in elasticsearch and printing collected messages in a way similar to the Handsaw tool.

Micro-merging by hltd

Currently micro-merging is done by each CMSSW process appending their output to the merged file in CMSSW end of lumi event. As events are not buffered by the process and instead written to disk during the lumisection, this requires reading data back from the disk and writing it back to the merged file, thus creating additional disk I/O.

This step could be skipped by delegating the merging to hltd esCopy function. Internally it opens a file for writing and appends contents of the single merged file from local disk. Instead the function could be modified to read contents of multiple files and merge to BU mount point on the fly (similar to option "B" in mini/macro merger)

Catch file move exception and retry

Fix for a problem occassionaly seen in daqval where file move to output directory would fail. The fix captures exception thrown and retries.

Monitoring HLT rates

New development in CMSSW enables output of HLT path rates in json files (and jsd definition). This requires implementing parsing files and insertion into elasticsearch by elastic monitor on FUs.

Merged box info for appliance

To reduce amount of box info documents, FU information is aggregated and injected into central ES as part of BU boxinfo document. Additional field is added (mapping update needed to become effective) listing hosts from which information is collected. Additionally, the unique "boxinfo_last" document can be switched on for each BU. This document is replaced each time update is made from the same BU.

Properly handle crashed CMSSW processes

Handle left over files from crashed CMSSW processes. The meta-information should enable the mergers to handle the rest of the successfully processes events. The input file of the crashed processes should be put into a special place to be defined.

avoid message "storms" in logCollector.py

In the case where the HLT generates a very large number of error log messages, this can bring ES into a state where it is no longer able to handle the transactions. As a result, the appliance cluster stops responding to other requests and the logCollector accumulates messages in memory (resident sizes over 2GB seen). We need to implement a mechanism to a) stop logging the same message when it repeats more than n times and b) drop log messages if we get an error on the transaction. Also it might be advisable to handle log messages with a bulk insert to minimise impact on cpu/memory

syntax error when writing completed time to central ES

Wrong variable name used ("putq" instead of "postq")

Problem using alias for the runindex name in es-cdaq ES servers.

If we use an alias to point to the runindex, elasticbu.py does not realize when the index already exists because the server return a InvalidIndexNameException instead of IndexAlreadyExistsException .

Make complete checks part of the river task and store the result

Since the complete check at all levels of merge is information we will want to query constantly and routinely, it make sense to calculate the complete flags (or percentages) for all streams systematically at the time we run the collection of other statistics, in the river plugin. This will at the same time lighten the task of the GUI and/or the server by providing pre-calculated information. For clarity, we might want to run the complete check in a separate river plugin since, unlike the microstate-and-stream-rate plugin, it will have to access the central server for both read and write.

streamError

We need to create a "fake stream" json files that store information about not processed events due to process crashes.

To allow the merger group to handle this stream as closely to a normal stream as possible, we need to generate the following files:

A json file for each LS containing the number of event processed, the number of event not processed and the list of not processed raw files.

A proper .INI file .