samstudio8 / chitin Goto Github PK
View Code? Open in Web Editor NEWchitin: an awful shell for awful bioinformaticians
Home Page: https://samnicholls.net/2016/11/16/disorganised-disaster/
License: MIT License
chitin: an awful shell for awful bioinformaticians
Home Page: https://samnicholls.net/2016/11/16/disorganised-disaster/
License: MIT License
Could be useful to take a note of what one was working on in the shell before giving up to drink tea?
Seeing as one can't change directory in this shell, it could be considered as a manager for a given top level analysis directory. We could store the JSON
(soon to be sqlite
schema) in the same directory and switch to relative paths?
[See #4 ]
As chitin
now buffers stdout
and stderr
, commands that need user input (such as bgzip
in the case of requesting user input to overwrite a file) hang. Nice one, @SamStudio8.
Could we "load in" and parse a script such that we can read all the commands it contains? Do we need to?
For some stupid reason I thought adding multiprocessing
to the mix was a great idea.
Now there is a ton of wonky shit going on simultaneously:
We could "drop" (hide/archive) command history for files that are modified after being deleted first.
I've just had to re-run an experiment, I don't need to generate new data, but rather just have an umbrella for a "new set of runs". It would be helpful if there was a CLI/API/Web option to request a new UUID to do this. Bonus points if we held a "parent" experiment or something.
We might need an intermediate class where Experiments have RunGroups with Runs, rather than Runs immediately belonging to an experiment.
Not urgent but it might make sense to have a queue responsible for database queries (or at least just those that cause insertions) to prevent application code having to deal with locked tables.
It would be kinda cool if chitin
could enforce a sane working environment at all times:
As part of 2adff2d, I cobbled together a function to hash large files that linearly subsamples bytes to reduce time. I feel this going to turn out to be a terrible idea, so I'm going to open an issue in advance. Sorry.
Just so you know.
Hashing all files in all adjacent directories is not working out
It's not me, it's definitely you. Stop it right now.
It's really annoying
yields hilarious results
find
and tar
can work in an unexpected fashion when paths are replaced with their absolute counterparts and inserted back into the command string before execution.
We could make aliases work by forcing a source of the bashrc before feeding our commands to the shell. Probably.
Getting an error when launching for the first time. Install (as --user) went smoothly.
Traceback (most recent call last):
File "/homes/ccole/.local/bin/chitin", line 9, in
load_entry_point('chitin==0.0.1', 'console_scripts', 'chitin')()
File "/sw/opt/python/2.7.3/lib/python2.7/site-packages/pkg_resources/init.py", line 542, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "/sw/opt/python/2.7.3/lib/python2.7/site-packages/pkg_resources/init.py", line 2569, in load_entry_point
return ep.load()
File "/sw/opt/python/2.7.3/lib/python2.7/site-packages/pkg_resources/init.py", line 2229, in load
return self.resolve()
File "/sw/opt/python/2.7.3/lib/python2.7/site-packages/pkg_resources/init.py", line 2235, in resolve
module = import(self.module_name, fromlist=['name'], level=0)
File "build/bdist.linux-x86_64/egg/chitin/init.py", line 18, in
'BufrStubImagePlugin',
File "build/bdist.linux-x86_64/egg/chitin/util.py", line 11, in
File "build/bdist.linux-x86_64/egg/chitin/record.py", line 6, in
File "/cluster/gjb_lab/ccole/.local/lib/python2.7/site-packages/Flask_SQLAlchemy-2.1-py2.7.egg/flask_sqlalchemy/init.py", line 25, in
from sqlalchemy import orm, event, inspect
ImportError: cannot import name inspect
Could be a dependency issue. What version requirements are there?
I've been generating big runs of data in directories with UUIDs, and a lookup file that contains the UUID to parameters-what-were-used-to-generate-those-files. This has been pretty handy because all the files are uniquely identifiable and don't feature parameters that become unhelpful, or deprecated, etc. later. This also means I'm not messing about with stupid folder hierarchies: the filesystem is a crap abstraction for the representation of experiment properties.
Because of this hierarchy, I've found myself just applying operations to lists of UUID-named directories, so why not make this a part of chitin? We could just provide a script (or series of commands) and a bunch of UUIDs.
Currently chitin
switches ALL paths to an absolute path, but if you wanted to wrap up and throw your workspace on another system, all your paths are suddenly incorrect...
I'm not even sure if we want a multiprocessed shell. Sure, it can run jobs in parallel and is trivial to script, and also needs no configuration. But does anyone want this? Is it because I just don't want to hammer at GNU Parallel? I don't know.
Directory Item
s are somewhat useless. The hash of an Item
that is also a directory was designed to detect changes in directories outside of chitin
, but serves little purpose outside of the integrity check.
It would be much more useful if we could correspond the hash of a directory to a group of items. So I propose something like an ItemSet
object that can represent a directory (or even a group of files belonging to a project, etc.). An Item
could easily be in multiple ItemSet
.
My ideas for "protecting" files could be instead applied to ItemSet
s (ie. flat out prevent clobbering).
An example of where this would be much more useful is tar
: where we already capture the directory hash, but cannot easily work out what the file state was at that particular hash.
File metadata is stored per user, in their local database. Changes to files are monitored via use of chitin
. So we have two points:
chitin
users cause changes in the file system?The first point falls in-line with the future development of permitting the database to be on a server instead of just local. I suspect we may have some work to do to ensure that history is processed and stored in the correct order if multiple users do things at once, but I think this will be fine. Caching may also be necessary so users have some history data for when they are offline? But we are a while from this right now anyway.
The second is unlikely to be addressable in a fashion I would like. Right now, chitin
will always raise a warning about files that have changed outside of its knowledge, which is reasonable. After all, that is what I care about more than the history: now a user will know if somebody has messed with this file. Potential ways to catch this (on a shared computer at least), is to have an additional daemon or kernel module that captures some information - but the reason chitin
works the way it does is because it seemed to be a rather easy way of getting this data in the first place! ;)
I would love to try and make a ZFS extension for this, but that's a long time away and possible beyond my time and ability anyway.
In case you hadn't noticed, I trashed the entire chitin
repo to make chitin2
. It's around half as much less garbage as the last time and moves a little away from the idea of replacing your shell, but instead wrapping a script to keep track of what happens inside. I got overexcited in the last version, and made chitin
a clever, parallelised shell that sent commands to a remote machine and allowed any chitin
-capable shell to download and process the jobs. At this point, I realised I'd made a grid engine, so I've nuked the code base and started over: this time trying to remember the goal of chitin
was to be a watchful guardian of your filesystem.
For a reminder of my November 2016 tirade that caused chitin
to come into existence, check my blog.
Pretty much all the cool features of chitin1
are missing, but I plan to bring them back:
I've finally made a business decision about the metadata storage part of chitin
. I don't like sqlite
, the database gets big, slow and locked. Originally the metadata was to be presented in the terminal (and it was), but we've outgrown this by necessity (commands and resources are linked together and I want you to click on them to find out stuff). Thus we're in your browser. The current version of chitin2
has an integrated webserver using Flask and SQLalchemy but this is troublesome for migration, and it was never my intention to bundle the shell-part and web-part together. Thus my roadmap includes:
chitin-server
repo soon. Django is definitely OTT for this, it's also wonderfully crafted, extendable, well-supported and has an excellent database migration system. chitin-serverAdditional ideas of things that are to come:
Early 2019 Stories
Late 2019 Stories
%history
, or create tar
archivesSeeing as we can now run bash scripts, it would be nice to group all of the Event
objects (that is, the commands executed individually) together under some container. An EventSet
seems like a reasonable solution. We can attach the input parameters, name, path, MD5 (or even a copy?) of the script in question to the EventSet
such that is available to all ItemEvent
.
We could also have total_wall metadata and such, too.
This is really a tracking issue for me, that outlines the issues I have using chitin
with my own workflow, but please feel free to add your own.
Were output files borked etc
For example, a run of FASTQC produces a small set of reports - we could attach these to a given FQ file.
Bonus points if we ever switch to Flask and automatically host such metafiles to be served later.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.