Giter Site home page Giter Site logo

cfe-lab / kive Goto Github PK

View Code? Open in Web Editor NEW
8.0 8.0 1.0 49.01 MB

Archival and automation of bioinformatic pipelines and data

Home Page: https://cfe-lab.github.io/Kive

License: BSD 3-Clause "New" or "Revised" License

Python 68.77% Shell 1.89% JavaScript 5.73% HTML 3.82% CSS 1.25% TypeScript 16.74% Makefile 0.09% Sass 0.75% Jinja 0.82% Singularity 0.14% Dockerfile 0.01%
bioinformatics-pipeline python version-control

kive's People

Contributors

artpoon avatar dependabot-preview[bot] avatar dependabot[bot] avatar dmacmillan avatar donaim avatar donkirkby avatar emartin-cfe avatar jamesnakagawa avatar jjh13 avatar nathanielknight avatar rhliang avatar rmcclosk avatar tnguyencfe avatar wrpscott avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Forkers

dmacmillan

kive's Issues

Refactor summarize_CSV

CompoundDatatype.summarize_CSV has a bug in it somewhere, but it is quite long, which is making finding the bug difficult. I'm going to split it up into several smaller functions.

How immutable are Datatypes?

Once you create a Datatype, can you modify it? Can you add another supertype later on, or change the supertype?

I ask because I can see ways to make things more efficient if they are totally immutable. For example, get_all_regexps would (the first time it's called) traverse the supertypes and collect all their regexps, and store these in a variable self.all_regexps. On all subsequent calls, it would return self.all_regexps without (re)doing the whole traversal. Not critical (I suppose the trees won't get very big, so this won't be expensive anyway), just would be nice.

Checking basic constraints takes a LONG time

Something to think about if the backend team has any spare time - checking the basic constraints on even a moderately sized dataset (600 entries) causes a noticeable delay, on the order of a few seconds. I'm a bit worried about what would happen if we have a data file with a million entries.

Allowed BasicConstraints should be based on Shipyard atomic types

Allowed basic constraints (ie. minlen, maxlen, regexp, etc.) on a Datatype should be the /intersection/ of allowed basic constraints on atomic Shipyard types which the Datatype restricts (even indirectly).

For example, suppose datatype A restricts both float and bool. Then since floats can have minval, maxval, and regexp, but bools can only have regexp, datatype A can only have a regexp.

Any Datatype restricting a Datatype with a datetimeformat constraint may not define another datetimeformat constraint to "override" the supertype. It must assume the supertype's datetimeformat.

Everyone update your nukeDB.expect

I made a small fix to nukeDB_default.expect. It now nukes the Logs/ directory which stores all of the output logs produced during the running of Pipelines. Everyone should update your nukeDB.expect accordingly.

Creation of "nonsense" basic constraints

Currently, it is possible to create Datatypes with "nonsense" basic constraints, eg. min length > max length, more than one min length, etc. We need code in clean() to disallow these kinds of conflicts.

Tests for prototypes

When defining a Datatype with custom constraints, the user may optionally supply a "prototype", which is essentially a unit test for the custom constraint's verification method. There are currently no tests for this functionality.

Tests for VerificationLog

Code has been written to create a VerificationLog each time a column of data is checked against its CustomConstraint. No tests have been written for this yet.

New name for Run*

We have been referring to Runs and their component parts (RunSIC's, RunOutputCables, and RunSteps) in the code as either "instances" or "records". There are problems with both names - we have another thing called record (ExecRecord), and instance means something in the OOP context. We need to either change how we refer to Run*, or rename ExecRecord. Suggestions are welcome.

The most parsimonious thing to do might be just to call them runs (eg. PipelineOutputCable.poc_runs.create(...), instead of the current PipelineOutputCable.poc_instances.create(...)). However, this might also get confusing, because we have a thing called a Run. We could also go with "runrecord", though it's a bit long and the clash with record is still there.

Alternatively, what about "context"? Since a Run* does provide context for execution. It would be PipelineOutputCable.poc_contexts.create(...), and in the execute code, variables would be called curr_context and so on.

Should regexps match the whole string?

I defined a basic constraint "[A-Za-z]+" on a Datatype, and was wondering why the value "l1ve" did not trigger a cell error. It was because we are calling re.match on the string, so the "l" at the beginning was matched and that was good enough.

  1. Is this the expected behaviour? Should we tell the user they need to enclose their pattern in "^$" to match the whole string?

  2. If so, should we really be preferentially matching at the beginning of the string only (with re.match), or searching the whole string for a partial match (re.search)?

Helper for Run*Cable: keeps_output()

There needs to be a helper function for RunSIC's and RunOutputCables called keeps_output(), which tells whether or not to keep the Dataset produced by the cable. This is currently implemented in lines 180-190 in sandbox/execute.py.

  • for RunSIC, it should just return the PSIC.keep_output
  • for RunOutputCable, it should return True if the ROC is part of a top-level Run, or if not, whether the corresponding PipelineStep keeps its output (whether the output is in the PipelineStep's outputs_to_delete, or not).

Why is Run*.ExecLog a GenericRelation, and not a OneToOneField?

As usual, trying to do one thing leads to 12 others...

As far as I can see, a Run(Step|SIC|OutputCable) can only have one ExecLog associated. In fact, there is an explicit check for this (although I was the one who put that in, so I may have been in error). Can we just make it a nullable OneToOneField?

Verification methods for CustomConstraints can output large row numbers

I created a verification method which outputs

failed_row
1000

for all inputs, and used it to summarize a 2-row CSV. The resulting summary had the following key:

'failing_cells': {(1000, 1): [<CustomConstraint: CustomConstraint object>]

If I'm reading the code right, this would get saved to the DB as a CellError.

One solution would be, when we check the output of a verification method, we add a BasicConstraint (maxval = num_rows) to VERIF_OUT, do the check, and then delete the BasicConstraint. This would not be thread safe.

CompoundDatatypeMember clean() doesn't clean the underlying Datatype

CompoundDatatype.clean() recursively cleans all the members, but the members don't actually have a clean function, so this just does the basic Django validation on the fields. Should we clean the underlying Datatype from CDTM.clean(), or is it assumed to be clean before we even create the CDT?

String representations of CDT's

We should remove the numerical indices from string and unicode representations of CompoundDataTypes, because they are always 1,...,n. Create a representation which shows only the column names and column types. Remove or escape angle brackets in the representations, as these cause problems with the HTML interface.

IntegrityError on revising method

Exception value: columns content_type_id, object_id, dataset_idx are not unique

/Users/apoon/git/shipyard/shipyard/transformation/models.py in create_input dataset_idx=dataset_idx) ... โ–ผ Local vars Variable Value dataset_idx 1 max_row None self <Method: Method g2pScorer g2pScorer> compounddatatype <CompoundDatatype: (string: to_test)> min_row None dataset_name u'an input'

I have not changed any of this input's values, but it should be able to create a distinct TransformationXput because it is associated with a different Method (Transformation).

failure to execute loaddata

$python2.7 manage.py loaddata initial_data

Problem installing fixture '/Users/art/git/Shipyard/shipyard/metadata/fixtures/initial_data.json': Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/core/management/commands/loaddata.py", line 196, in handle
obj.save(using=using)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/core/serializers/base.py", line 165, in save
models.Model.save_base(self.object, using=using, raw=True)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/db/models/base.py", line 529, in save_base
rows = manager.using(using).filter(pk=pk_val)._update(values)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/db/models/query.py", line 560, in _update
return query.get_compiler(self.db).execute_sql(None)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/db/models/sql/compiler.py", line 986, in execute_sql
cursor = super(SQLUpdateCompiler, self).execute_sql(result_type)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/db/models/sql/compiler.py", line 818, in execute_sql
cursor.execute(sql, params)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/db/backends/util.py", line 40, in execute
return self.cursor.execute(sql, params)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/db/backends/sqlite3/base.py", line 344, in execute
return Database.Cursor.execute(self, query, params)
IntegrityError: Could not load auth.Permission(pk=1): UNIQUE constraint failed: auth_permission.content_type_id, auth_permission.codename

Why are start/endtime in ExecLog, but returncode/output/error in MethodOutput?

I added a bit of functionality to the wrapper around running code, which can "fill in" a passed in log (ie. set the return code, start time, end time, and upload the logs). I was hoping this would save us some lines, since the code for filling out VerificationLogs and ExecLogs is quite similar.

After I did this, I realised that it's not quite the same, because while VerificationLogs have all those things (ret code, start/end time, logs), ExecLogs have only start/end time, and MethodOutputs have the return code and the logs. So I can't pass either one of these classes in as the log to fill out.

I am wondering why we have these things split between two classes. Why do we need to time and record things other than the execution of user-supplied methods? We aren't timing any other internal Shipyard code, nor are we recording its execution in logs. And I can't imagine stuff like executing cables is going to take enough time for the user to be worried about logging it - certainly not longer than, say, checking basic constraints, which we're not logging either.

If this is best answered in person feel free to not respond here, I just wanted to get it written down while I'm thinking about it.

Prototypes are never checked (?)

As far as I can tell, all we currently do with prototypes is make sure they have the right CompoundDatatype and aren't raw. We never actually run the verification method against them.

New class for "things that take time"

We have a number of classes with start_time and end_time attributes (all the Run* and *Log classes), which necessitates many checks that start_time <= end time in clean functions. We should create a new class which has a start_time and end_time attribute, and have all these classes have a foreign key to one. That way, the time coherence checks, and any future helpers like getting the time elapsed, will only have to be written once.

Constraint checking for Datatypes

Recursive checking of basic constraints, for Datatypes, will need to be moved to a new complete_clean() function. This will prevent ValueErrors of the following form when calling clean() before save():

ValueError: "<Datatype: test>" needs to have a value for field "from_datatype" before this many-to-many relationship can be used.

complete_clean() is now a function which (by convention) can only be called after the object has been saved. This was implicit before, because of the contexts in which complete_clean() was called - it is now explicit.

Remove PythonType

We are removing the PythonType attribute from Datatypes. All Datatypes must now select a Datatype to restrict when they are created, possibly one of the four atomic Shipyard Datatypes: integer, string, bool, float. In the backend, we need to:

  • add a check in clean() (or possibly complete_clean()?) that the Datatype has a supertype
  • all integrity checks based on Python type will need to be rewritten or removed

Method.create_input fails with IntegrityError despite unique inputs

raised at /Users/art/git/Shipyard/shipyard/transformation/models.py in create_input
IntegrityError: columns content_type_id, object_id, dataset_idx are not unique

>>> for foo in TransformationInput.objects.all(): ... print foo.content_type_id, foo.object_id, foo.dataset_idx ... 4 1 1 4 1 2 4 2 1 4 2 2 4 3 1 4 3 2 4 4 1 4 4 2 4 5 1 4 5 2 4 6 1 4 6 2

Some proposed changes to Run.clean

I would like to make these changes to Run.clean, in the interest of (eventually) running several steps concurrently. Aside from some changes to the tests, I'm pretty sure it won't break anything.

currently proposed change
if there is an incomplete RunStep, all preceeding RunSteps (by step_num) are complete if there is an incomplete RunStep, all RunSteps which feed into it are complete
if all RunSteps are not present and complete, there are no RunOutputCables if a particular RunStep is not present or not complete, there are no RunOutputCables from that RunStep

Better error messages setup

The error_messages dict in constants.py is a hideous monstrosity (sorry, it seemed like a good idea at the time). I need to find either a better way of storing messages, or just get rid of the thing altogether and go back to writing them inline.

check_file_contents doesn't raise any errors

You can create SymbolicDatasets with bad headers, or from an empty file. All it does is warn you, but it still creates the SD (which will create problems down the line I'm sure). I'm going to fix this.

Failed Runs are not complete

In the course of trying to fix my tests, I created a Method which doesn't work. It's a python script, and on its stderr I get

Traceback (most recent call last):
  File "/tmp/userbob_run1_Lk3lCg/step1/reverse.sh", line 3, in <module>
    with open(sys.argv[1]) as infile, open(sys.argv[2]) as outfile:
IOError: [Errno 2] No such file or directory: '/tmp/userbob_run1_Lk3lCg/step1/output_data/step1_reversed_words.csv'

After the execution of the pipeline is finished, sandbox.run.complete_clean() raises a ValidationError. Upon further inspection, it seems that the second RunStep is missing, because the first failed. This causes Run.clean() to return False.

  • is this expected behaviour, or is something wrong with my test?
  • if it is expected behaviour, is it what we want?

ContentCheckLog/IntegrityCheckLog and their relationship to ExecLog

I've been tinkering with the Run* classes and fixing up their clean() functions, as well as hammering out what their completeness and success criteria are (i.e. what makes a Run* complete and what makes it successful). I realized that we hadn't explicitly dealt with the role of the CCLs and ICLs in completeness/success. This is probably because CCLs and ICLs aren't directly related to Run* classes; they're tied to ExecLogs.

However I'm realizing that the CCLs and ICLs shouldn't have any bearing on the completeness or success of the ExecLogs they're tied to, as they represent a different step of the procedure. I'm going to remove the check for CCL and ICL failures from ExecLog.is_successful() and move them to Run_.successful_execution(). Moreover, I'm going to add checks for CCL and ICL completion into Run_.is_complete().

CCLs and ICLs will remain pointing to ExecLog, but philosophically the CCLs and ICLs are more closely tied to their grandparent Run* to which the parent ExecLog belongs. As such, I'm going to put calls to CCL.clean() and ICL.clean() in the Run*.clean()s but not in ExecLog.clean().

Let me know if you have any objections.

addition of @transaction.atomic decorator in transaction.models breaks support for Django 1.4

Validating models...

metadata.models
Unhandled exception in thread started by <bound method Command.inner_run of <django.contrib.staticfiles.management.commands.runserver.Command object at 0x1017c4c10>>
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/core/management/commands/runserver.py", line 91, in inner_run
self.validate(display_num_errors=True)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/core/management/base.py", line 266, in validate
num_errors = get_validation_errors(s, app)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/core/management/validation.py", line 30, in get_validation_errors
for (app_name, error) in get_app_errors().items():
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/db/models/loading.py", line 158, in get_app_errors
self._populate()
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/db/models/loading.py", line 64, in _populate
self.load_app(app_name, True)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/db/models/loading.py", line 88, in load_app
models = import_module('.models', app_name)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/utils/importlib.py", line 35, in import_module
import(name)
File "/Users/art/git/Shipyard/shipyard/archive/models.py", line 20, in
import method.models
File "/Users/art/git/Shipyard/shipyard/method/models.py", line 14, in
import file_access_utils, transformation.models
File "/Users/art/git/Shipyard/shipyard/transformation/models.py", line 43, in
class Transformation(models.Model):
File "/Users/art/git/Shipyard/shipyard/transformation/models.py", line 108, in Transformation
@transaction.atomic
AttributeError: 'module' object has no attribute 'atomic'

ExecRecord reuse is broken

Tests in sandbox.tests_rm pertaining to reusing an ExecRecord (test_execute_pipeline_fill_in_ER, test_execute_pipeline_reuse, and test_execute_pipeline_reuse_within_different_pipeline) are all failing. ExecRecords are not being reused, despite the presence of others which should be compatible.

failure to nuke DB

when executing nukeDB.bash, there's an error attempting to load initial_data.json

Traceback (most recent call last): File "./manage.py", line 10, in <module> execute_from_command_line(sys.argv) File "/usr/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 399, in execute_from_command_line utility.execute() File "/usr/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 392, in execute self.fetch_command(subcommand).run_from_argv(self.argv) File "/usr/local/lib/python2.7/site-packages/django/core/management/base.py", line 242, in run_from_argv self.execute(*args, **options.__dict__) File "/usr/local/lib/python2.7/site-packages/django/core/management/base.py", line 285, in execute output = self.handle(*args, **options) File "/usr/local/lib/python2.7/site-packages/django/core/management/base.py", line 415, in handle return self.handle_noargs(**options) File "/usr/local/lib/python2.7/site-packages/django/core/management/commands/syncdb.py", line 112, in handle_noargs emit_post_sync_signal(created_models, verbosity, interactive, db) File "/usr/local/lib/python2.7/site-packages/django/core/management/sql.py", line 216, in emit_post_sync_signal interactive=interactive, db=db) File "/usr/local/lib/python2.7/site-packages/django/dispatch/dispatcher.py", line 185, in send response = receiver(signal=self, sender=sender, **named) File "/usr/local/lib/python2.7/site-packages/django/contrib/auth/management/__init__.py", line 82, in create_permissions ctype = ContentType.objects.db_manager(db).get_for_model(klass) File "/usr/local/lib/python2.7/site-packages/django/contrib/contenttypes/models.py", line 47, in get_for_model defaults = {'name': smart_text(opts.verbose_name_raw)}, File "/usr/local/lib/python2.7/site-packages/django/db/models/manager.py", line 154, in get_or_create return self.get_queryset().get_or_create(**kwargs) File "/usr/local/lib/python2.7/site-packages/django/db/models/query.py", line 388, in get_or_create six.reraise(*exc_info) File "/usr/local/lib/python2.7/site-packages/django/db/models/query.py", line 379, in get_or_create with transaction.atomic(using=self.db): File "/usr/local/lib/python2.7/site-packages/django/db/transaction.py", line 277, in __enter__ connection._start_transaction_under_autocommit() File "/usr/local/lib/python2.7/site-packages/django/db/backends/sqlite3/base.py", line 436, in _start_transaction_under_autocommit self.cursor().execute("BEGIN") File "/usr/local/lib/python2.7/site-packages/django/db/backends/util.py", line 69, in execute return super(CursorDebugWrapper, self).execute(sql, params) File "/usr/local/lib/python2.7/site-packages/django/db/backends/util.py", line 53, in execute return self.cursor.execute(sql, params) File "/usr/local/lib/python2.7/site-packages/django/db/utils.py", line 99, in __exit__ six.reraise(dj_exc_type, dj_exc_value, traceback) File "/usr/local/lib/python2.7/site-packages/django/db/backends/util.py", line 51, in execute return self.cursor.execute(sql) File "/usr/local/lib/python2.7/site-packages/django/db/backends/sqlite3/base.py", line 448, in execute return Database.Cursor.execute(self, query) django.db.utils.OperationalError: cannot start a transaction within a transaction

Revisions for the same CodeResource do not need distinct names

I have used the interface to create three CodeResourceRevisions for the same CodeResource, all with the same name. When I go to create a Method, and select the relevant CodeResource, the "Revisions" dropdown is populated with three identical choices. I can't tell which version is which.

I can see two choices:

  1. enforce that different revisions of the same CodeResource have distinct names
  2. in the Revisions dropdown, display something else identifying the CodeResourceRevision, like a timestamp, or a version number (which we would have to calculate based on the when it was created relative to other Revisions)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.