Giter Site home page Giter Site logo

alphatwirl / alphatwirl Goto Github PK

View Code? Open in Web Editor NEW
10.0 3.0 10.0 3.72 MB

A Python library for summarizing event data into multivariate categorical data

License: BSD 3-Clause "New" or "Revised" License

Python 99.33% R 0.17% C 0.50% Shell 0.01%
data-analysis root-cern pandas r data-frame alphatwirl

alphatwirl's Introduction

PyPI version Anaconda-Server Badge DOI Build Status codecov



A Python library for summarizing event data into multivariate categorical data

Description

AlphaTwirl is a Python library that summarizes event data into multivariate categorical data as data frames. Event data, input to AlphaTwirl, are data with one entry (or row) for one event: for example, data in ROOT TTrees with one entry per collision event of an LHC experiment at CERN. Event data are often large—too large to be loaded in memory—because they have as many entries as events. Multivariate categorical data, the output of AlphaTwirl, have one row for one category. They are usually small—small enough to be loaded in memory—because they only have as many rows as categories. Users can, for example, import them as data frames into R and pandas, which usually load all data in memory, and can perform categorical data analyses with a rich set of data operations available in R and pandas.


Quick start

        Binder


Publication


License

  • AlphaTwirl is licensed under the BSD license.

Contact

alphatwirl's People

Contributors

benkrikler avatar kreczko avatar taisakuma avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

alphatwirl's Issues

update test for run.py

update test for run.py

  • to test SIGINT is actually ignored:
    signal.signal(signal.SIGINT, signal.SIG_IGN)
  • to test log level and handler are in effect:
    def setup_logging():
    path = 'logging_levels.json.gz'
    if not os.path.isfile(path):
    return
    with gzip.GzipFile(path, 'r') as f:
    loglevel_dict = json.loads(f.read().decode('utf-8'))
    for name, level in loglevel_dict.items():
    logger = logging.getLogger(name)
    logger.setLevel(level)
    handler = logging.StreamHandler(sys.stdout)
    formatter = logging.Formatter('%(asctime)s - %(name)s - %(levelname)s - %(message)s')
    handler.setFormatter(formatter)
    logging.getLogger('').addHandler(handler)

Unit tests package dependencies not documented

I was trying to understand why the unit tests were failing when I run pytest and it turns out it's because I didn't have the pytest-console-scripts package installed. I realised that these were needed by looking in the travis CI config file, but it would be good to define pytest, pytest-mock, pytest-cov and pytest-console-scripts either in the requirements.txt file, or mention it in the README explicitly.

Add support for SGE batch systems

I've no idea how much work this would involve, but to bring alphatwirl out to more sites, support beyond HTCondor would be nice to be able to run components in parallel. In particular, sun grid engine batch systems, although somewhat old-fashioned, are still common such as lxplus or on the Imperial College batch system.

don't skip a test for Summarizer for Python 3

@pytest.mark.skipif(sys.version_info >= (3, 0), reason="requires python 2")
def test_to_tuple_list_key_not_tuple(obj):
obj.add('A', (12, )) # the keys are not a tuple
obj.add(2, (20, )) #
assert [(2, 20), ('A', 12)] == obj.to_tuple_list()

It is skipped because int and str cannot be compared in python 3

keys_sorted = sorted(self._results.keys())

ValueError with duplicate datasets

Release version

0.16.0 (expected to appear in 0.20.0 as well)

Issue

ValueError is raised on loop/merge.py#L7 when duplicate datasets (two or more dataset entries with the same name attribute)

Description

Duplicate datasets can appear by user error. The final duplicate dataset will overwrite all the other in loop/EventDatasetReader.py#L74, but loop/EventDatasetReader.py#L68 is filled with runids from all dataset duplicates.

When the code reaches loop/merge.py#L7 it tries to access runids from duplicate datasets that have been overwritten in loop/EventDatasetReader.py#L74

Resolution

Everything works fine if there are no duplicate datasets.

However, this could be checked or enforced somewhere earlier.

EventLoop doesn't provide Reader a way to terminate execution elegantly

Problem

The code in the EventLoop class handles the reader (typically a composite reader) and provides it with events from the tree(s) by:

  1. Call reader.begin() method
  2. For each event:
    a. Call reader.event()
  3. Call reader.end()

In each case, the return of each of reader's methods is ignored. This makes it awkward for a reader to tell the event loop to terminate early, or not run at all in the case of begin(). begin() can raise an exception of course, but in some cases that is more extreme than really needed.

Solution

Have the return of each calll to reader begin, event, and end checked, and execution halted if the return does not evaluate to true.

Specific example

I'm running over a large heppy dataset and including the existing RA1 AlphaTools sequence as a single Reader. The heppy dataset is being split up using AlphaTwirl and submitted to the Bristol htcondor batch system. One of the datasets in this input is not supposed to be handled by AlphaTools and AlphaTools knows this, so it would normally just ignore it, however this causes an exception to be raised crashing the batch job, which is then re-submitted. To enable things to finish elegantly, I would prefer to be able to signal to the EventLoop that this jobs should not proceed in the begin() method of the reader that wraps AlphaTools.

test gets stuck in python 3.6 in some circumstances

at the commit 7405535, the test gets stuck in some cirumstances.

example how to reproduce

in the top directory of alphatwirl,

pytest -s tests

the test starts normally.

============================================ test session starts =============================================
platform darwin -- Python 3.6.3, pytest-3.4.0, py-1.5.2, pluggy-0.6.0
rootdir: #######, inifile:
plugins: mock-1.6.3, cov-2.5.1, console-scripts-0.1.4, hypothesis-3.38.5
collected 619 items                                                                                          

and all tests pass.

tests/unit/summary/test_Summarizer.py .....s....                                                       [  0%]
tests/unit/summary/test_WeightCalculatorOne.py .                                                       [  0%]
tests/unit/summary/test_convert_key_vals_dict_to_tuple_list.py ......                                  [  0%]
tests/unit/summary/test_parse_indices_config.py .

=================================== 579 passed, 40 skipped in 5.09 seconds =================================

but, the test doesn't finish. the shell prompt won't appear.

hit ctrl-c

Error in atexit._run_exitfuncs:
Traceback (most recent call last):
  File "/Users/sakuma/anaconda3/envs/py3_6-pd0_20/lib/python3.6/multiprocessing/popen_fork.py", line 29, in poll
    pid, sts = os.waitpid(self.pid, flag)
KeyboardInterrupt
$

the problem doesn't always happen

the same problem doesn't happen for example if

  • -s option is not used
  • the trailing slash is given as in tests/
pytest tests
pytest -s tests/

the same problem doesn't happen in python 2.7

platform darwin -- Python 2.7.12, pytest-3.3.2, py-1.5.2, pluggy-0.6.0
rootdir: #######, inifile: pytest.ini
plugins: mock-1.6.3, cov-2.5.1, console-scripts-0.1.3

HTCondorJobSubmitter crashes if problem submitting

When trying to run on lxplus, with my working directory under /eos I observed the following traceback:

Traceback (most recent call last):
  File "/afs/cern.ch/user/b/bkrikler/.local/bin/fast_carpenter", line 11, in <module>
    load_entry_point('fast-carpenter', 'console_scripts', 'fast_carpenter')()
  File "/eos/user/b/bkrikler/CHIP/fast-carpenter/fast_carpenter/__main__.py", line 67, in main
    return process.run(datasets, sequence)
  File "/afs/cern.ch/user/b/bkrikler/.local/lib/python2.7/site-packages/atuproot/atuproot_main.py", line 49, in run
    return self._run(loop)
  File "/afs/cern.ch/user/b/bkrikler/.local/lib/python2.7/site-packages/atuproot/atuproot_main.py", line 101, in _run
    result = loop()
  File "/eos/user/b/bkrikler/CHIP/alphatwirl/alphatwirl/datasetloop/loop.py", line 55, in __call__
    self.reader.read(dataset)
  File "/eos/user/b/bkrikler/CHIP/alphatwirl/alphatwirl/datasetloop/reader.py", line 27, in read
    reader.read(dataset)
  File "/eos/user/b/bkrikler/CHIP/alphatwirl/alphatwirl/loop/EventDatasetReader.py", line 66, in read
    runids = self.eventLoopRunner.run_multiple(eventLoops)
  File "/eos/user/b/bkrikler/CHIP/alphatwirl/alphatwirl/loop/MPEventLoopRunner.py", line 93, in run_multiple
    return self.communicationChannel.put_multiple(eventLoops)
  File "/eos/user/b/bkrikler/CHIP/alphatwirl/alphatwirl/concurrently/CommunicationChannel.py", line 131, in put_multiple
    return self.dropbox.put_multiple(packages)
  File "/eos/user/b/bkrikler/CHIP/alphatwirl/alphatwirl/concurrently/TaskPackageDropbox.py", line 60, in put_multiple
    runids = self.dispatcher.run_multiple(self.workingArea, pkgidxs)
  File "/eos/user/b/bkrikler/CHIP/alphatwirl/alphatwirl/concurrently/HTCondorJobSubmitter.py", line 129, in run_multiple
    njobs = int(regex.search(stdout).groups()[0])
AttributeError: 'NoneType' object has no attribute 'groups'

The underlying problem in this case was that htcondor submission on lxplus doesn't work below /eos only under /afs, however the traceback there hid this somewhat. It might be helpful if the match object from the regular expression on line 127 was first checked to be valid and not None. If it is None, then the stdout and stderr could be printed and some exception raised. That would help a user understand more immediately why the jobs weren't submitted I suspect.

RuntimeWarning in travis

# assert '' == ret.stderr
## commented out because of "RuntimeWarning" in travis
## https://travis-ci.org/alphatwirl/alphatwirl/jobs/409237073
## E AssertionError: assert '' == '/home/travis/miniconda/envs...ok( name,
## *args, **kwds )\n'
## E + /home/travis/miniconda/envs/testenv/lib/ROOT.py:301: RuntimeWarning:
## numpy.dtype size changed, may indicate binary incompatibility. Expected
## 96, got 88

https://travis-ci.org/alphatwirl/alphatwirl/jobs/409237073

handle TChain with no files

events = self.build_events()
self.nevents = len(events)
self._report_progress(0)
self.reader.begin(events)

self.reader.begin(events) can fail if events are built from TChain with no file.

This could happen if no files are verified here:

if self.config['check_files']:
paths = self._verify_files(paths, self.config['skip_error_files'])
chain = ROOT.TChain(self.config['tree_name'])
for path in paths:
chain.Add(path)

Extend the contents used for DataFrame filenames

This is based on a series of posts first made to Slack. But for the TL;DR:

  • The TableFileNameComposer class currently creates filenames that contain only the binning scheme for the DFs
  • Sometimes the same binning scheme will be used whilst other variables are changed to create a DF (eg, for systematics, where we change the weighting or corrections).
  • I suggest we have TableFileNameComposer.call() receive the full DF config specification, such that any generic filename composer one would like to create has the full amount of information available to it. So on this line, you would just pass in ret:
ret['outFileName'] = self.createOutFileName(ret)

For the meantime, I have a hacky solution to my problem, which can be used instead of TableFileNameComposer:

class WithInsertTableFileNameComposer():                                           
    def __init__(self, composer, inserts):                                         
        self.inserts = inserts                                                     
        self.composer = composer                                                   
        self.frame_idx = 0                                                         
                                                                                   
    def __call__(self, columnNames, indices, **kwargs):                                     
        this_insert = self.inserts[self.frame_idx]                                 
        suffix = kwargs.get("suffix", self.composer.default_suffix)      
        kwargs["suffix"] = "--{}.{}".format(this_insert, suffix)                   
        self.frame_idx += 1                                                        
        return self.composer(columnNames, indices, **kwargs)     

change the way to give a progressReporter to a task

At the moment, a progressReporter is given to a task as an argument:

def _run_task(self, package):
try:
result = package.task(progressReporter = self.progressReporter, *package.args, **package.kwargs)
except TypeError:
result = package.task(*package.args, **package.kwargs)
return result

this potentially causes a name conflict if a task already has an argument progressReporter for a different purpose.

Instead of Worker giving the reporter to a task, the task should find the reporter if the task wants a reporter.

A possible solution: a worker register a reporter to a DB, e.g., ProgressReporterAgency or ProgressReporterOffice defined in the module progressbar. A task gets the reporter from the DB.

change the default datasetColumnName to 'dataset'

at either

avoid mutable default

$ find ./ -name \*.py | xargs grep 'def' | grep '\[ *\]'
./parallel/build.py:def build_parallel(parallel_mode, quiet=True, processes=4, user_modules=[ ],
./concurrently/HTCondorJobSubmitter.py:    def __init__(self, job_desc_extra=[ ], job_desc_dict={}):
./selection/modules/with_count.py:    def __init__(self, name='All', selections=[ ]):
./selection/modules/with_count.py:    def __init__(self, name='Any', selections=[ ]):
./selection/modules/basic.py:    def __init__(self, name='All', selections=[ ]):
./selection/modules/basic.py:    def __init__(self, name='Any', selections=[ ]):
./loop/ReaderComposite.py:    def __init__(self, readers=[]):
$ find ./ -name \*.py | xargs grep 'def' | grep '{ *}'
./parallel/build.py:        logger.warning('unknown parallel_mode "{}", use default "{}"'.format(
./concurrently/HTCondorJobSubmitter.py:    def __init__(self, job_desc_extra=[ ], job_desc_dict={}):
./selection/factories/expand.py:def expand_path_cfg(path_cfg, alias_dict={ }, overriding_kargs={ }):

possible more. arguments can be listed in multiple lines.

optimize the sending results from worker nodes for data frames

currently, the package concurrently, uses pickle to send the results from the worker nodes.

pickle files with data frames tend to become large. and loading large pickles is slow.

develop an alternative implementation which is specifically optimized for data frames, for example, by using dask so that loading the results at the interactive node is fast.

update tests for TaskPackageDropbox

  • update tests for TaskPackageDropbox
    • the current tests only cover limited situations
    • what to vary
      • the number of tasks, e.g., 0, 1, or 5
      • the number of fails before success for each task, for example, [0, 0, 0, 0, 0], [0, 0, 2, 0, 0], [0, 1, 1, 0, 1] if the number of tasks is 5.
      • sequences of runs in poll
        • permutation
        • the number of runs in each pol

organize requirements

pointed out in #30 (comment)

need at least three separate lists of requirements

  • to use alphatwirl
  • to run the tests
  • to compile the docs

they can be specified in install_requires in setup.py and requirements files for pip.

absorb collectors into readers

related to the comment at https://github.com/alphatwirl/alphatwirl-interface/pull/7#issuecomment-369357995

this will be surgical changes.

it will help to solve #10.

steps:

should min be applied on bin boundaries rather than values?

In the version 0.18.3, the following code produce awkward results

from qtwirl import qtwirl
filepath = 'https://github.com/alphatwirl/qtwirl/raw/v0.03.1/tests/data/sample_chain_01.root'
results = qtwirl(
    file=filepath, tree_name='tree',
    reader_cfg=dict(keyAttrNames='met', binnings=RoundLog(0.2, 100, min=10, underflow_bin=0)))
print results
  met n nvar
0.000000 102 102
6.309573 0 0
10.000000 53 53
15.848932 77 77
25.118864 100 100
39.810717 141 141
63.095734 153 153
100.000000 166 166
158.489319 114 114
251.188643 78 78
398.107171 14 14
630.957344 2 2
1000.000000 0 0

There shouldn't be the 2nd row, i.e. the row with met = 6.309573.

This is even wrong. It looks like there is no events with 6.309573 <= met < 10.0, which is wrong.

This happens because, in RoundLog (and Round as well), the next to the underflow bin is the bin for the min.

An example

>>> b = RoundLog(0.2, 100, min=10, underflow_bin=0)
>>> b(10)
6.309573444801936
>>> b(8)
0

Here is why this happens. Because 10 is greater than or equal to min (in fact exactly the min), it returns the lower edge of the bin. However, because of the rounding issue of float, 10 doesn't fall into the interval [10, 10^1.2) but one below [10^0.8, 10). So it returns 10^0.8 = 6.30957. On the other hand, 8 is below min, it returns underflow_bin.

This is a featrue but is very uncomfortable. 8 is greator than 6.3. If there is a bin [10^0.8, 10), 8 should fall in that bin.

As long as the bin is determined after the value is compared with the min, this can happen.

This uncomfortable situation can be avoided if the minimum value is the lower edge of the bin the argument min falls in.

make it possible to include lambdas in tasks

currently, it is impossible for a task in concurrently to include lambda. this is because lambda isn't picklable.

there are several solutions there to serialize lambdas, including dill.

note that unlike results, tasks are generally small. serializing and unserializing are not necessary to be very fast.

"can't pickle _thread.lock" in ResumableDatasetLoop with SubprocessRunner in Python 3

at 8db62fc, a few commits after v0.24.2

Traceback (most recent call last):
  File "./bdphi-scripts/bdphiROC/twirl_mktbl_heppy.py", line 721, in <module>
    main()
  File "./bdphi-scripts/bdphiROC/twirl_mktbl_heppy.py", line 62, in main
    run(reader_collector_pairs)
  File "./bdphi-scripts/bdphiROC/twirl_mktbl_heppy.py", line 609, in run
    treeName='tree'
  File "./bdphi-scripts/external/atheppy/atheppy/fw.py", line 110, in run
    self._run(loop)
  File "./bdphi-scripts/external/atheppy/atheppy/fw.py", line 285, in _run
    loop()
  File "./bdphi-scripts/external/alphatwirl/alphatwirl/datasetloop/loop.py", line 60, in __call__
    pickle.dump(self.reader, f, protocol=pickle.HIGHEST_PROTOCOL)
TypeError: can't pickle _thread.lock objects

The error occures in ResumableDatasetLoop:

def __call__(self):
self.reader.begin()
for dataset in self.datasets:
self.reader.read(dataset)
path = os.path.join(self.workingarea.path, 'reader.p.gz')
with gzip.open(path, 'wb') as f:
pickle.dump(self.reader, f, protocol=pickle.HIGHEST_PROTOCOL)
return self.reader.end()

  • This happens only in Python 3, not in Python 2.7
  • This happens only in the parrallel mode subprocess, not in htcondor.

The proc in SubprocessRunner might not be picklable in Python 3:

proc = subprocess.Popen(
args,
stdout=subprocess.PIPE if self.pipe else None,
stderr=subprocess.PIPE if self.pipe else None,
cwd=taskdir
)
self.running_procs.append(proc)

  • can try to pickle before self.reader.read(dataset) or self.reader.begin()

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.