cfe-lab / kive Goto Github PK
View Code? Open in Web Editor NEWArchival and automation of bioinformatic pipelines and data
Home Page: https://cfe-lab.github.io/Kive
License: BSD 3-Clause "New" or "Revised" License
Archival and automation of bioinformatic pipelines and data
Home Page: https://cfe-lab.github.io/Kive
License: BSD 3-Clause "New" or "Revised" License
CompoundDatatype.summarize_CSV has a bug in it somewhere, but it is quite long, which is making finding the bug difficult. I'm going to split it up into several smaller functions.
Once you create a Datatype, can you modify it? Can you add another supertype later on, or change the supertype?
I ask because I can see ways to make things more efficient if they are totally immutable. For example, get_all_regexps would (the first time it's called) traverse the supertypes and collect all their regexps, and store these in a variable self.all_regexps. On all subsequent calls, it would return self.all_regexps without (re)doing the whole traversal. Not critical (I suppose the trees won't get very big, so this won't be expensive anyway), just would be nice.
If I make a new Code Resource, it appears in the database and the resources.html summary page but does not appear in the drop-down generated by the ModelForm CodeResourcePrototypeForm in method/templates/method/resouces_add.html (which uses method/static/method/resources.js)
The error logs of methods which explicitly output to stderr are empty.
Something to think about if the backend team has any spare time - checking the basic constraints on even a moderately sized dataset (600 entries) causes a noticeable delay, on the order of a few seconds. I'm a bit worried about what would happen if we have a data file with a million entries.
I thought we had decided to only allow regexp restrictions on Boolean types. Permission to change?
Allowed basic constraints (ie. minlen, maxlen, regexp, etc.) on a Datatype should be the /intersection/ of allowed basic constraints on atomic Shipyard types which the Datatype restricts (even indirectly).
For example, suppose datatype A restricts both float and bool. Then since floats can have minval, maxval, and regexp, but bools can only have regexp, datatype A can only have a regexp.
Any Datatype restricting a Datatype with a datetimeformat constraint may not define another datetimeformat constraint to "override" the supertype. It must assume the supertype's datetimeformat.
I made a small fix to nukeDB_default.expect. It now nukes the Logs/ directory which stores all of the output logs produced during the running of Pipelines. Everyone should update your nukeDB.expect accordingly.
This should pre-populate forms with values associated with the previous Method.
New Method will then have previous one as its parent, and belong to the same MethodFamily.
Python type is redundant given restricts.
Make restricts a required field with 'string' as default value.
Currently, it is possible to create Datatypes with "nonsense" basic constraints, eg. min length > max length, more than one min length, etc. We need code in clean() to disallow these kinds of conflicts.
When defining a Datatype with custom constraints, the user may optionally supply a "prototype", which is essentially a unit test for the custom constraint's verification method. There are currently no tests for this functionality.
summarize_CSV is failing at line 1034:
with open(output_path, "rb") as test_out:
because output_path does not exist. It is supposed to have been written to by the verification method.
Code has been written to create a VerificationLog each time a column of data is checked against its CustomConstraint. No tests have been written for this yet.
Runs, RunSteps, RunStepInputCables, and RunOutputCables all have a start_time attribute, but no end_time. We should add an end_time.
We have been referring to Runs and their component parts (RunSIC's, RunOutputCables, and RunSteps) in the code as either "instances" or "records". There are problems with both names - we have another thing called record (ExecRecord), and instance means something in the OOP context. We need to either change how we refer to Run*, or rename ExecRecord. Suggestions are welcome.
The most parsimonious thing to do might be just to call them runs (eg. PipelineOutputCable.poc_runs.create(...), instead of the current PipelineOutputCable.poc_instances.create(...)). However, this might also get confusing, because we have a thing called a Run. We could also go with "runrecord", though it's a bit long and the clash with record is still there.
Alternatively, what about "context"? Since a Run* does provide context for execution. It would be PipelineOutputCable.poc_contexts.create(...), and in the execute code, variables would be called curr_context and so on.
I defined a basic constraint "[A-Za-z]+" on a Datatype, and was wondering why the value "l1ve" did not trigger a cell error. It was because we are calling re.match on the string, so the "l" at the beginning was matched and that was good enough.
Is this the expected behaviour? Should we tell the user they need to enclose their pattern in "^$" to match the whole string?
If so, should we really be preferentially matching at the beginning of the string only (with re.match), or searching the whole string for a partial match (re.search)?
I can't run the server. The CSS doesn't get loaded and the AJAX calls don't seem to be working either. Is there anything in settings.py I need to set? Environment variables?
There needs to be a helper function for RunSIC's and RunOutputCables called keeps_output(), which tells whether or not to keep the Dataset produced by the cable. This is currently implemented in lines 180-190 in sandbox/execute.py.
As usual, trying to do one thing leads to 12 others...
As far as I can see, a Run(Step|SIC|OutputCable) can only have one ExecLog associated. In fact, there is an explicit check for this (although I was the one who put that in, so I may have been in error). Can we just make it a nullable OneToOneField?
I created a verification method which outputs
failed_row
1000
for all inputs, and used it to summarize a 2-row CSV. The resulting summary had the following key:
'failing_cells': {(1000, 1): [<CustomConstraint: CustomConstraint object>]
If I'm reading the code right, this would get saved to the DB as a CellError.
One solution would be, when we check the output of a verification method, we add a BasicConstraint (maxval = num_rows) to VERIF_OUT, do the check, and then delete the BasicConstraint. This would not be thread safe.
CompoundDatatype.clean() recursively cleans all the members, but the members don't actually have a clean function, so this just does the basic Django validation on the fields. Should we clean the underlying Datatype from CDTM.clean(), or is it assumed to be clean before we even create the CDT?
We should remove the numerical indices from string and unicode representations of CompoundDataTypes, because they are always 1,...,n. Create a representation which shows only the column names and column types. Remove or escape angle brackets in the representations, as these cause problems with the HTML interface.
Functions which add one or more objects to the database, but which must add /zero/ objects if they fail at any point, should be marked with the "@transaction.atomic" decorator. This can replace nested try/except blocks.
This way it would be the same as PipelineStepInputCable, and we could call cable.custom_wires for both, rather than having to check the type.
It seems to me that every Method should have at least one output, unless it is the results box.
Exception value: columns content_type_id, object_id, dataset_idx are not unique
/Users/apoon/git/shipyard/shipyard/transformation/models.py in create_input dataset_idx=dataset_idx) ... โผ Local vars Variable Value dataset_idx 1 max_row None self <Method: Method g2pScorer g2pScorer> compounddatatype <CompoundDatatype: (string: to_test)> min_row None dataset_name u'an input'
I have not changed any of this input's values, but it should be able to create a distinct TransformationXput because it is associated with a different Method (Transformation).
$python2.7 manage.py loaddata initial_data
Problem installing fixture '/Users/art/git/Shipyard/shipyard/metadata/fixtures/initial_data.json': Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/core/management/commands/loaddata.py", line 196, in handle
obj.save(using=using)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/core/serializers/base.py", line 165, in save
models.Model.save_base(self.object, using=using, raw=True)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/db/models/base.py", line 529, in save_base
rows = manager.using(using).filter(pk=pk_val)._update(values)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/db/models/query.py", line 560, in _update
return query.get_compiler(self.db).execute_sql(None)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/db/models/sql/compiler.py", line 986, in execute_sql
cursor = super(SQLUpdateCompiler, self).execute_sql(result_type)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/db/models/sql/compiler.py", line 818, in execute_sql
cursor.execute(sql, params)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/db/backends/util.py", line 40, in execute
return self.cursor.execute(sql, params)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/db/backends/sqlite3/base.py", line 344, in execute
return Database.Cursor.execute(self, query, params)
IntegrityError: Could not load auth.Permission(pk=1): UNIQUE constraint failed: auth_permission.content_type_id, auth_permission.codename
I added a bit of functionality to the wrapper around running code, which can "fill in" a passed in log (ie. set the return code, start time, end time, and upload the logs). I was hoping this would save us some lines, since the code for filling out VerificationLogs and ExecLogs is quite similar.
After I did this, I realised that it's not quite the same, because while VerificationLogs have all those things (ret code, start/end time, logs), ExecLogs have only start/end time, and MethodOutputs have the return code and the logs. So I can't pass either one of these classes in as the log to fill out.
I am wondering why we have these things split between two classes. Why do we need to time and record things other than the execution of user-supplied methods? We aren't timing any other internal Shipyard code, nor are we recording its execution in logs. And I can't imagine stuff like executing cables is going to take enough time for the user to be worried about logging it - certainly not longer than, say, checking basic constraints, which we're not logging either.
If this is best answered in person feel free to not respond here, I just wanted to get it written down while I'm thinking about it.
As far as I can tell, all we currently do with prototypes is make sure they have the right CompoundDatatype and aren't raw. We never actually run the verification method against them.
We have a number of classes with start_time and end_time attributes (all the Run* and *Log classes), which necessitates many checks that start_time <= end time in clean functions. We should create a new class which has a start_time and end_time attribute, and have all these classes have a foreign key to one. That way, the time coherence checks, and any future helpers like getting the time elapsed, will only have to be written once.
Recursive checking of basic constraints, for Datatypes, will need to be moved to a new complete_clean() function. This will prevent ValueErrors of the following form when calling clean() before save():
ValueError: "<Datatype: test>" needs to have a value for field "from_datatype" before this many-to-many relationship can be used.
complete_clean() is now a function which (by convention) can only be called after the object has been saved. This was implicit before, because of the contexts in which complete_clean() was called - it is now explicit.
We are removing the PythonType attribute from Datatypes. All Datatypes must now select a Datatype to restrict when they are created, possibly one of the four atomic Shipyard Datatypes: integer, string, bool, float. In the backend, we need to:
raised at /Users/art/git/Shipyard/shipyard/transformation/models.py in create_input
IntegrityError: columns content_type_id, object_id, dataset_idx are not unique
>>> for foo in TransformationInput.objects.all(): ... print foo.content_type_id, foo.object_id, foo.dataset_idx ... 4 1 1 4 1 2 4 2 1 4 2 2 4 3 1 4 3 2 4 4 1 4 4 2 4 5 1 4 5 2 4 6 1 4 6 2
I would like to make these changes to Run.clean, in the interest of (eventually) running several steps concurrently. Aside from some changes to the tests, I'm pretty sure it won't break anything.
currently | proposed change |
---|---|
if there is an incomplete RunStep, all preceeding RunSteps (by step_num) are complete | if there is an incomplete RunStep, all RunSteps which feed into it are complete |
if all RunSteps are not present and complete, there are no RunOutputCables | if a particular RunStep is not present or not complete, there are no RunOutputCables from that RunStep |
The error_messages dict in constants.py is a hideous monstrosity (sorry, it seemed like a good idea at the time). I need to find either a better way of storing messages, or just get rid of the thing altogether and go back to writing them inline.
You can create SymbolicDatasets with bad headers, or from an empty file. All it does is warn you, but it still creates the SD (which will create problems down the line I'm sure). I'm going to fix this.
In the course of trying to fix my tests, I created a Method which doesn't work. It's a python script, and on its stderr I get
Traceback (most recent call last):
File "/tmp/userbob_run1_Lk3lCg/step1/reverse.sh", line 3, in <module>
with open(sys.argv[1]) as infile, open(sys.argv[2]) as outfile:
IOError: [Errno 2] No such file or directory: '/tmp/userbob_run1_Lk3lCg/step1/output_data/step1_reversed_words.csv'
After the execution of the pipeline is finished, sandbox.run.complete_clean() raises a ValidationError. Upon further inspection, it seems that the second RunStep is missing, because the first failed. This causes Run.clean() to return False.
Allow user to enter comma-delimited (double-quote enclosed) regex patterns.
Parse in view function to populate multiple RE forms.
I've been tinkering with the Run* classes and fixing up their clean() functions, as well as hammering out what their completeness and success criteria are (i.e. what makes a Run* complete and what makes it successful). I realized that we hadn't explicitly dealt with the role of the CCLs and ICLs in completeness/success. This is probably because CCLs and ICLs aren't directly related to Run* classes; they're tied to ExecLogs.
However I'm realizing that the CCLs and ICLs shouldn't have any bearing on the completeness or success of the ExecLogs they're tied to, as they represent a different step of the procedure. I'm going to remove the check for CCL and ICL failures from ExecLog.is_successful() and move them to Run_.successful_execution(). Moreover, I'm going to add checks for CCL and ICL completion into Run_.is_complete().
CCLs and ICLs will remain pointing to ExecLog, but philosophically the CCLs and ICLs are more closely tied to their grandparent Run* to which the parent ExecLog belongs. As such, I'm going to put calls to CCL.clean() and ICL.clean() in the Run*.clean()s but not in ExecLog.clean().
Let me know if you have any objections.
Validating models...
metadata.models
Unhandled exception in thread started by <bound method Command.inner_run of <django.contrib.staticfiles.management.commands.runserver.Command object at 0x1017c4c10>>
Traceback (most recent call last):
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/core/management/commands/runserver.py", line 91, in inner_run
self.validate(display_num_errors=True)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/core/management/base.py", line 266, in validate
num_errors = get_validation_errors(s, app)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/core/management/validation.py", line 30, in get_validation_errors
for (app_name, error) in get_app_errors().items():
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/db/models/loading.py", line 158, in get_app_errors
self._populate()
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/db/models/loading.py", line 64, in _populate
self.load_app(app_name, True)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/db/models/loading.py", line 88, in load_app
models = import_module('.models', app_name)
File "/System/Library/Frameworks/Python.framework/Versions/2.6/lib/python2.6/site-packages/django/utils/importlib.py", line 35, in import_module
import(name)
File "/Users/art/git/Shipyard/shipyard/archive/models.py", line 20, in
import method.models
File "/Users/art/git/Shipyard/shipyard/method/models.py", line 14, in
import file_access_utils, transformation.models
File "/Users/art/git/Shipyard/shipyard/transformation/models.py", line 43, in
class Transformation(models.Model):
File "/Users/art/git/Shipyard/shipyard/transformation/models.py", line 108, in Transformation
@transaction.atomic
AttributeError: 'module' object has no attribute 'atomic'
Tests in sandbox.tests_rm pertaining to reusing an ExecRecord (test_execute_pipeline_fill_in_ER, test_execute_pipeline_reuse, and test_execute_pipeline_reuse_within_different_pipeline) are all failing. ExecRecords are not being reused, despite the presence of others which should be compatible.
when executing nukeDB.bash, there's an error attempting to load initial_data.json
Traceback (most recent call last): File "./manage.py", line 10, in <module> execute_from_command_line(sys.argv) File "/usr/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 399, in execute_from_command_line utility.execute() File "/usr/local/lib/python2.7/site-packages/django/core/management/__init__.py", line 392, in execute self.fetch_command(subcommand).run_from_argv(self.argv) File "/usr/local/lib/python2.7/site-packages/django/core/management/base.py", line 242, in run_from_argv self.execute(*args, **options.__dict__) File "/usr/local/lib/python2.7/site-packages/django/core/management/base.py", line 285, in execute output = self.handle(*args, **options) File "/usr/local/lib/python2.7/site-packages/django/core/management/base.py", line 415, in handle return self.handle_noargs(**options) File "/usr/local/lib/python2.7/site-packages/django/core/management/commands/syncdb.py", line 112, in handle_noargs emit_post_sync_signal(created_models, verbosity, interactive, db) File "/usr/local/lib/python2.7/site-packages/django/core/management/sql.py", line 216, in emit_post_sync_signal interactive=interactive, db=db) File "/usr/local/lib/python2.7/site-packages/django/dispatch/dispatcher.py", line 185, in send response = receiver(signal=self, sender=sender, **named) File "/usr/local/lib/python2.7/site-packages/django/contrib/auth/management/__init__.py", line 82, in create_permissions ctype = ContentType.objects.db_manager(db).get_for_model(klass) File "/usr/local/lib/python2.7/site-packages/django/contrib/contenttypes/models.py", line 47, in get_for_model defaults = {'name': smart_text(opts.verbose_name_raw)}, File "/usr/local/lib/python2.7/site-packages/django/db/models/manager.py", line 154, in get_or_create return self.get_queryset().get_or_create(**kwargs) File "/usr/local/lib/python2.7/site-packages/django/db/models/query.py", line 388, in get_or_create six.reraise(*exc_info) File "/usr/local/lib/python2.7/site-packages/django/db/models/query.py", line 379, in get_or_create with transaction.atomic(using=self.db): File "/usr/local/lib/python2.7/site-packages/django/db/transaction.py", line 277, in __enter__ connection._start_transaction_under_autocommit() File "/usr/local/lib/python2.7/site-packages/django/db/backends/sqlite3/base.py", line 436, in _start_transaction_under_autocommit self.cursor().execute("BEGIN") File "/usr/local/lib/python2.7/site-packages/django/db/backends/util.py", line 69, in execute return super(CursorDebugWrapper, self).execute(sql, params) File "/usr/local/lib/python2.7/site-packages/django/db/backends/util.py", line 53, in execute return self.cursor.execute(sql, params) File "/usr/local/lib/python2.7/site-packages/django/db/utils.py", line 99, in __exit__ six.reraise(dj_exc_type, dj_exc_value, traceback) File "/usr/local/lib/python2.7/site-packages/django/db/backends/util.py", line 51, in execute return self.cursor.execute(sql) File "/usr/local/lib/python2.7/site-packages/django/db/backends/sqlite3/base.py", line 448, in execute return Database.Cursor.execute(self, query) django.db.utils.OperationalError: cannot start a transaction within a transaction
I have used the interface to create three CodeResourceRevisions for the same CodeResource, all with the same name. When I go to create a Method, and select the relevant CodeResource, the "Revisions" dropdown is populated with three identical choices. I can't tell which version is which.
I can see two choices:
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.