๐ Find me on the web, talk to me on Twitter.
๐ Current open source focus: OMEN - Collaborative Annotation Platform for NLP
OMEN - A dockerized, collaborative, annotation platform.
License: MIT License
๐ Find me on the web, talk to me on Twitter.
๐ Current open source focus: OMEN - Collaborative Annotation Platform for NLP
The inspect view should include a feature to accept all annotations by another annotator.
Potential implementations:
Should work fine with a simple implementation that maintains the information in the dataset's metadata field.
Python 3.8 is available for both, development and prod base images.
Describe the bug
The search functionality during curation behaves unexpectedly, only showing case-sensitive matches.
Expected behavior
Case insensitive search would make more sense.
Current state: The menu option for this is already integrated into the appropriate tab and restricted to numeric data attributes present in the dataset.
Details:
attribute < value
and another attribute >= value
Targets:
Visibility:
Meta-issue to prepare the system for external ML components #27
The CSV export that already exists but should be improved upon for the integration of external ML components.
Make sure this is flexible enough to support conll-u later on for token level tasks (this might require to hard code a few assumptions on preprocessing and separate tokenization from generic feature extraction)
All implementations should allow for bulk downloads, as well as API access with pagination.
Filters:
DatasetContent
fields to store preprocessed values (e.g. tokenized text)Annotation
sub-class that stores model predictions and confidence values (or uncertainty measures), might require separate entity if we want to also store topic model or clustering resultsUser reports being shown the same sequence of already annotated samples (small dataset of N<=10, reproduced after all samples have been annotated once).
The flask_secret
option key in config.json
is currently manually set during deployment and bears a risk to be copied between multiple deployments of the software.
It would be nicer to auto-generate this for the user (only if it has not been set at the time the server is started). The docs suggest simply using os.urandom(16)
to generate the key, which might be good enough for a strong default.
Maintaining both doesn't make sense and leaves the main readme in a relatively useless state. While at it also change the gh-pages template, the current one seems tedious to read.
Make sure to include a default value for this in config.json
.
Note: Investigate if this is blocked by the reverse proxy first, otherwise:
Popper.js
should already be included correctly.
The package build through github's packages feature based on Dockerfile.prod
works locally but has permission issues when pulled:
omen_1 | Traceback (most recent call last):
omen_1 | File "/usr/local/bin/gunicorn", line 8, in <module>
omen_1 | sys.exit(run())
omen_1 | File "/usr/local/lib/python3.7/site-packages/gunicorn/app/wsgiapp.py", line 58, in run
omen_1 | WSGIApplication("%(prog)s [OPTIONS] [APP_MODULE]").run()
omen_1 | File "/usr/local/lib/python3.7/site-packages/gunicorn/app/base.py", line 228, in run
omen_1 | super().run()
omen_1 | File "/usr/local/lib/python3.7/site-packages/gunicorn/app/base.py", line 72, in run
omen_1 | Arbiter(self).run()
omen_1 | File "/usr/local/lib/python3.7/site-packages/gunicorn/arbiter.py", line 58, in __init__
omen_1 | self.setup(app)
omen_1 | File "/usr/local/lib/python3.7/site-packages/gunicorn/arbiter.py", line 93, in setup
omen_1 | self.log = self.cfg.logger_class(app.cfg)
omen_1 | File "/usr/local/lib/python3.7/site-packages/gunicorn/glogging.py", line 195, in __init__
omen_1 | self.setup(cfg)
omen_1 | File "/usr/local/lib/python3.7/site-packages/gunicorn/glogging.py", line 207, in setup
omen_1 | self.logfile = open(cfg.errorlog, 'a+')
omen_1 | PermissionError: [Errno 13] Permission denied: '/home/omenuser/logs/error_log.log'
The above was performed with the following service definition in examples/docker-compose.yml
:
version: '3'
services:
omen:
image: docker.pkg.github.com/frankgrimm/omen/omen-prod:latest
restart: always
ports:
- "5000:5000"
volumes:
- ${PWD}/app:/home/omenuser/app/app
- ${PWD}/logs:/home/omenuser/logs
- ${PWD}/config.json:/home/omenuser/app/config.json
environment:
- LOGLEVEL=debug
- WORKERS_PER_CORE=1
Performance for datasets > 100k samples suffers in some places and the dataframe abstraction leads to odd structures in several places. It would be better to refactor the persistence layer in order to separate dataset metadata and content a bit better.
To prepare support larger datasets and optimize performance, the sample selection logic and eager data loading from views.py::annotate
should be moved to the Dataset
class.
JWT examples flask-smorest-36 and descriptor
optional: include swagger-io -dist or ReDoc
The current color palette for the charts in the inspect view has some issues:
The activity overview after logging in is currently limited to one's own events. There's also an issue with some of the filtering/limiting being enforced after the database query.
Refactoring should include:
Display inter-annotator agreement, overall and per tag.
A good place might be next to or below the tag overview chart.
While we have a progress indicator for overall annotation progress on datasets, we currently have no place for users to get an overview on progress in individual work packages.
Since we might get quite a bit of information here (e.g. a large dataset with 10 work packages and 5-10 annotators each) it might be prudent to put it next to the annotation overview in the curation view:
Alternatively we might want to use this feature as a reason to separate out the annotation overview and work package progress into its own "analytics" view, the metadata and inter-annotator agreement calculation seems quite hidden there.
Some annotation tasks might benefit from hiding other annotator's tags from users with owner/curator roles while annotating.
This should be introduced as a setting for the overall dataset (or potentially task if implemented after those models are separated).
The annotation view is currently a bottleneck in the UX flow since updates are rather slow.
It would be better to gracefully upgrade all interactions to use fetch
instead of a full page reload.
Separating the dataset from tasks defined on it would
The sqlalchemy relationships for datasets require proper configuration of cascade behaviour.
(possibly also by annotator)
The current behaviour is confusing since no indication is given to the user when an annotator for a particular subtask already has dataset-wide privileges.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.