Giter Site home page Giter Site logo

frankgrimm / omen Goto Github PK

View Code? Open in Web Editor NEW
4.0 4.0 0.0 10.55 MB

OMEN - A dockerized, collaborative, annotation platform.

License: MIT License

CSS 21.22% JavaScript 53.50% HTML 7.11% Python 14.40% Shell 0.09% Roff 3.65% Mako 0.03%
docker flask nlp webapp

omen's Introduction

omen's People

Contributors

dependabot[bot] avatar frankgrimm avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

omen's Issues

inspect view / curator role: batch accept annotations

The inspect view should include a feature to accept all annotations by another annotator.

Potential implementations:

  • Accept all annotations by user X for all samples (overwriting existing).
  • Accept all annotations by user X for samples that are not tagged yet.
  • Multiselect on the left hand side of the inspect view and both of the above options for the selection.

dismissable notifications

  • give notifications a sticky position
  • optional stacking
  • dismiss success/info/default notifications after 5s, others (error/warning) after 10
  • pause dismissal timer on hover, restart on blur
  • add icon to dismiss notification immediately

multi-label support

  • dataset setting
    • backend
    • UI
  • change sample/annotation model to accomodate
  • change user interaction flow (fetch-based setter route)
  • change inter-annotator agreement to reflect multi-label scenario

work package editor: split by value

Current state: The menu option for this is already integrated into the appropriate tab and restricted to numeric data attributes present in the dataset.

Details:

  • The UX needs a modal that, ideally, displays the actual value range present in the selected numeric attribute.
  • The backend implementation should be straightforward, creating a work package attribute < value and another attribute >= value

curator role: import annotations

  • upload CSV file, specify ID and tag/annotation column
  • option to either overwrite (add severe warning label to this) existing annotations or only insert new annotations

Introduce comments and notes

Targets:

  • Dataset (e.g. to discuss and clarify annotation guidelines)
  • Individual Samples (e.g. to discuss issues with a particular decision).

Visibility:

  • Private (notes that are only visible to the current user account)
  • Public (anybody with rights to the particular item can see your comments)

Dataset Preprocessing, additional Export Formats and Options

Meta-issue to prepare the system for external ML components #27

The CSV export that already exists but should be improved upon for the integration of external ML components.

Additional export formats

  • metadata (JSON)
    • task metadata description
    • vocabulary table
  • JSON
  • ARFF
  • libsvm

Make sure this is flexible enough to support conll-u later on for token level tasks (this might require to hard code a few assumptions on preprocessing and separate tokenization from generic feature extraction)

All implementations should allow for bulk downloads, as well as API access with pagination.

Filters:

  • all annotations vs. curated gold annotations only
  • all fields vs. sample_index, id column, text column, gold annotation only
  • preprocessed vs raw text vs both

Data model changes

  • add DatasetContent fields to store preprocessed values (e.g. tokenized text)
  • add separate table for a feature vocabulary (in combination with the above this builds a simple feature store), include metrics like tf and df.
  • add inherited Annotation sub-class that stores model predictions and confidence values (or uncertainty measures), might require separate entity if we want to also store topic model or clustering results
  • figure out what to do with model artifacts for continuous learning scenarios

autogenerate flask_secret if not configured

The flask_secret option key in config.json is currently manually set during deployment and bears a risk to be copied between multiple deployments of the software.

It would be nicer to auto-generate this for the user (only if it has not been set at the time the server is started). The docs suggest simply using os.urandom(16) to generate the key, which might be good enough for a strong default.

Merge project README.md and docs main page

Maintaining both doesn't make sense and leaves the main readme in a relatively useless state. While at it also change the gh-pages template, the current one seems tedious to read.

gh-packages based deployment

The package build through github's packages feature based on Dockerfile.prod works locally but has permission issues when pulled:

omen_1  | Traceback (most recent call last):
omen_1  |   File "/usr/local/bin/gunicorn", line 8, in <module>
omen_1  |     sys.exit(run())
omen_1  |   File "/usr/local/lib/python3.7/site-packages/gunicorn/app/wsgiapp.py", line 58, in run
omen_1  |     WSGIApplication("%(prog)s [OPTIONS] [APP_MODULE]").run()
omen_1  |   File "/usr/local/lib/python3.7/site-packages/gunicorn/app/base.py", line 228, in run
omen_1  |     super().run()
omen_1  |   File "/usr/local/lib/python3.7/site-packages/gunicorn/app/base.py", line 72, in run
omen_1  |     Arbiter(self).run()
omen_1  |   File "/usr/local/lib/python3.7/site-packages/gunicorn/arbiter.py", line 58, in __init__
omen_1  |     self.setup(app)
omen_1  |   File "/usr/local/lib/python3.7/site-packages/gunicorn/arbiter.py", line 93, in setup
omen_1  |     self.log = self.cfg.logger_class(app.cfg)
omen_1  |   File "/usr/local/lib/python3.7/site-packages/gunicorn/glogging.py", line 195, in __init__
omen_1  |     self.setup(cfg)
omen_1  |   File "/usr/local/lib/python3.7/site-packages/gunicorn/glogging.py", line 207, in setup
omen_1  |     self.logfile = open(cfg.errorlog, 'a+')
omen_1  | PermissionError: [Errno 13] Permission denied: '/home/omenuser/logs/error_log.log'

The above was performed with the following service definition in examples/docker-compose.yml:

version: '3'
services:
    omen:
        image: docker.pkg.github.com/frankgrimm/omen/omen-prod:latest
        restart: always
        ports:
            - "5000:5000"
        volumes:
            - ${PWD}/app:/home/omenuser/app/app
            - ${PWD}/logs:/home/omenuser/logs
            - ${PWD}/config.json:/home/omenuser/app/config.json
        environment:
            - LOGLEVEL=debug
            - WORKERS_PER_CORE=1

persistence: materialize dataset content in own table

Performance for datasets > 100k samples suffers in some places and the dataframe abstraction leads to odd structures in several places. It would be better to refactor the persistence layer in order to separate dataset metadata and content a bit better.

  • introduce new model DatasetContent (including a generated index and additional columns in the original upload)
  • make sure albemic migration that adds the new table exists and bump application version
  • change dataset editor: add / replace content
  • pagination and queries via that entity instead of pandas
  • view updates (backend)
  • view updates (frontend)

move sample selection logic

To prepare support larger datasets and optimize performance, the sample selection logic and eager data loading from views.py::annotate should be moved to the Dataset class.

Learning Component

  • interface that, optionally, allows outsourcing the computation to another node
  • mlflow support (local and remote)
  • system- and dataset-wide configuration for a) manual and b) active learning
  • active learning config: minimum time delta to last finished training run, minimum annotation delta to previous training run
  • active learning: how do we best handle invalidation and full retraining if dataset was replaced or annotation was changed?

inspect view: chart, support dynamic color scheme for N>6 tags

The current color palette for the charts in the inspect view has some issues:

  • Only 6 predefined colors that are easily exhausted. Should add a palette (maybe even for the tag metadata)
  • Palette colors for tags that do not define one might take over ones that are actually used but come later in the sequence.

Activities: Include activity from other accounts

The activity overview after logging in is currently limited to one's own events. There's also an issue with some of the filtering/limiting being enforced after the database query.

Refactoring should include:

  • limit relevant activity types and visibility/scope at query time
  • show comment activity for all users on datasets with at least a curator role
  • show work package / annotation completion events on datasets with at least a curator role for all annotators (requires a new event type)

work packages: progress overview

While we have a progress indicator for overall annotation progress on datasets, we currently have no place for users to get an overview on progress in individual work packages.

Since we might get quite a bit of information here (e.g. a large dataset with 10 work packages and 5-10 annotators each) it might be prudent to put it next to the annotation overview in the curation view:
image

Alternatively we might want to use this feature as a reason to separate out the annotation overview and work package progress into its own "analytics" view, the metadata and inter-annotator agreement calculation seems quite hidden there.

role changes: Owner, Curator, Annotator

  • do not implicitly give the dataset owner annotation and curation privileges
  • annotation and curation privileges are currently mutually exclusive (and curation implies annotation). move that to separate privilege flags instead

introduce dataset level setting to hide votes during annotation

Some annotation tasks might benefit from hiding other annotator's tags from users with owner/curator roles while annotating.

This should be introduced as a setting for the overall dataset (or potentially task if implemented after those models are separated).

Asynchronous Annotation Task

The annotation view is currently a bottleneck in the UX flow since updates are rather slow.

It would be better to gracefully upgrade all interactions to use fetch instead of a full page reload.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.