Light

frankgrimm / omen Goto Github PK

View Code? Open in Web Editor NEW

4.0 4.0 0.0 10.55 MB

OMEN - A dockerized, collaborative, annotation platform.

License: MIT License

CSS 21.22% JavaScript 53.50% HTML 7.11% Python 14.40% Shell 0.09% Roff 3.65% Mako 0.03%

docker flask nlp webapp

omen's Introduction

Hi there 👋

🔗 Find me on the web, talk to me on Twitter.

🔗 Current open source focus: OMEN - Collaborative Annotation Platform for NLP

💼 Semantic Computing Group @ CITEC

omen's People

Contributors

Stargazers

Watchers

omen's Issues

tag editor: add per-tag icon and color customization

inspect view / curator role: batch accept annotations

The inspect view should include a feature to accept all annotations by another annotator.

Potential implementations:

Accept all annotations by user X for all samples (overwriting existing).
Accept all annotations by user X for samples that are not tagged yet.
Multiselect on the left hand side of the inspect view and both of the above options for the selection.

maintain audit information for configuration / setting changes

Should work fine with a simple implementation that maintains the information in the dataset's metadata field.

dismissable notifications

give notifications a sticky position
optional stacking
dismiss success/info/default notifications after 5s, others (error/warning) after 10
pause dismissal timer on hover, restart on blur
add icon to dismiss notification immediately

Python version upgrade 3.7 -> 3.8

Python 3.8 is available for both, development and prod base images.

dev image
prod image

inspect view: change own annotations

multi-label support

dataset setting
- backend
- UI
change sample/annotation model to accomodate
change user interaction flow (fetch-based setter route)
change inter-annotator agreement to reflect multi-label scenario

annotation view: alternative display for large number of tags

vertical layout or single-button dropdown for n>5 tags
fix responsive behaviour of tag button group

curation view: content search is case sensitive

Describe the bug

The search functionality during curation behaves unexpectedly, only showing case-sensitive matches.

Expected behavior

Case insensitive search would make more sense.

annotation view: option to skip annotated samples in either direction

work package editor: split by value

Current state: The menu option for this is already integrated into the appropriate tab and restricted to numeric data attributes present in the dataset.

Details:

The UX needs a modal that, ideally, displays the actual value range present in the selected numeric attribute.
The backend implementation should be straightforward, creating a work package attribute < value and another attribute >= value

curator role: import annotations

upload CSV file, specify ID and tag/annotation column
option to either overwrite (add severe warning label to this) existing annotations or only insert new annotations

Introduce comments and notes

Targets:

Dataset (e.g. to discuss and clarify annotation guidelines)
Individual Samples (e.g. to discuss issues with a particular decision).

Visibility:

Private (notes that are only visible to the current user account)
Public (anybody with rights to the particular item can see your comments)

Dataset Preprocessing, additional Export Formats and Options

Meta-issue to prepare the system for external ML components #27

The CSV export that already exists but should be improved upon for the integration of external ML components.

Additional export formats

metadata (JSON)
- task metadata description
- vocabulary table
JSON
ARFF
libsvm

Make sure this is flexible enough to support conll-u later on for token level tasks (this might require to hard code a few assumptions on preprocessing and separate tokenization from generic feature extraction)

All implementations should allow for bulk downloads, as well as API access with pagination.

Filters:

all annotations vs. curated gold annotations only
all fields vs. sample_index, id column, text column, gold annotation only
preprocessed vs raw text vs both

Data model changes

add DatasetContent fields to store preprocessed values (e.g. tokenized text)
add separate table for a feature vocabulary (in combination with the above this builds a simple feature store), include metrics like tf and df.
add inherited Annotation sub-class that stores model predictions and confidence values (or uncertainty measures), might require separate entity if we want to also store topic model or clustering results
figure out what to do with model artifacts for continuous learning scenarios

Document configuration options

Investigate incorrect sample order in annotation view

User reports being shown the same sequence of already annotated samples (small dataset of N<=10, reproduced after all samples have been annotated once).

autogenerate flask_secret if not configured

The flask_secret option key in config.json is currently manually set during deployment and bears a risk to be copied between multiple deployments of the software.

It would be nicer to auto-generate this for the user (only if it has not been set at the time the server is started). The docs suggest simply using os.urandom(16) to generate the key, which might be good enough for a strong default.

Merge project README.md and docs main page

Maintaining both doesn't make sense and leaves the main readme in a relatively useless state. While at it also change the gh-pages template, the current one seems tedious to read.

dataset option: do not restart annotation once all samples have been annotated

Make sure to include a default value for this in config.json.

Add configuration for upload constraints in omen-prod

Note: Investigate if this is blocked by the reverse proxy first, otherwise:

identify configuration for gunicorn environment
migrate settings into our own configuration and document change

replace button titles with proper popovers

gh-packages based deployment

The package build through github's packages feature based on Dockerfile.prod works locally but has permission issues when pulled:

omen_1  | Traceback (most recent call last):
omen_1  |   File "/usr/local/bin/gunicorn", line 8, in <module>
omen_1  |     sys.exit(run())
omen_1  |   File "/usr/local/lib/python3.7/site-packages/gunicorn/app/wsgiapp.py", line 58, in run
omen_1  |     WSGIApplication("%(prog)s [OPTIONS] [APP_MODULE]").run()
omen_1  |   File "/usr/local/lib/python3.7/site-packages/gunicorn/app/base.py", line 228, in run
omen_1  |     super().run()
omen_1  |   File "/usr/local/lib/python3.7/site-packages/gunicorn/app/base.py", line 72, in run
omen_1  |     Arbiter(self).run()
omen_1  |   File "/usr/local/lib/python3.7/site-packages/gunicorn/arbiter.py", line 58, in __init__
omen_1  |     self.setup(app)
omen_1  |   File "/usr/local/lib/python3.7/site-packages/gunicorn/arbiter.py", line 93, in setup
omen_1  |     self.log = self.cfg.logger_class(app.cfg)
omen_1  |   File "/usr/local/lib/python3.7/site-packages/gunicorn/glogging.py", line 195, in __init__
omen_1  |     self.setup(cfg)
omen_1  |   File "/usr/local/lib/python3.7/site-packages/gunicorn/glogging.py", line 207, in setup
omen_1  |     self.logfile = open(cfg.errorlog, 'a+')
omen_1  | PermissionError: [Errno 13] Permission denied: '/home/omenuser/logs/error_log.log'

The above was performed with the following service definition in examples/docker-compose.yml:

version: '3'
services:
    omen:
        image: docker.pkg.github.com/frankgrimm/omen/omen-prod:latest
        restart: always
        ports:
            - "5000:5000"
        volumes:
            - ${PWD}/app:/home/omenuser/app/app
            - ${PWD}/logs:/home/omenuser/logs
            - ${PWD}/config.json:/home/omenuser/app/config.json
        environment:
            - LOGLEVEL=debug
            - WORKERS_PER_CORE=1

persistence: materialize dataset content in own table

Performance for datasets > 100k samples suffers in some places and the dataframe abstraction leads to odd structures in several places. It would be better to refactor the persistence layer in order to separate dataset metadata and content a bit better.

introduce new model DatasetContent (including a generated index and additional columns in the original upload)
make sure albemic migration that adds the new table exists and bump application version
change dataset editor: add / replace content
pagination and queries via that entity instead of pandas
view updates (backend)
view updates (frontend)

move sample selection logic

To prepare support larger datasets and optimize performance, the sample selection logic and eager data loading from views.py::annotate should be moved to the Dataset class.

add OpenAPI compatible API routes

flask-smorest docs

JWT examples flask-smorest-36 and descriptor

optional: include swagger-io -dist or ReDoc

Learning Component

interface that, optionally, allows outsourcing the computation to another node
mlflow support (local and remote)
system- and dataset-wide configuration for a) manual and b) active learning
active learning config: minimum time delta to last finished training run, minimum annotation delta to previous training run
active learning: how do we best handle invalidation and full retraining if dataset was replaced or annotation was changed?

inspect view: chart, support dynamic color scheme for N>6 tags

The current color palette for the charts in the inspect view has some issues:

Only 6 predefined colors that are easily exhausted. Should add a palette (maybe even for the tag metadata)
Palette colors for tags that do not define one might take over ones that are actually used but come later in the sequence.

Activities: Include activity from other accounts

The activity overview after logging in is currently limited to one's own events. There's also an issue with some of the filtering/limiting being enforced after the database query.

Refactoring should include:

limit relevant activity types and visibility/scope at query time
show comment activity for all users on datasets with at least a curator role
show work package / annotation completion events on datasets with at least a curator role for all annotators (requires a new event type)

inspect view: display inter-annotator agreement

Display inter-annotator agreement, overall and per tag.

A good place might be next to or below the tag overview chart.

dataset editor: user selection redesign

Document install instructions

inspect view: use majority annotation as own

work packages: progress overview

While we have a progress indicator for overall annotation progress on datasets, we currently have no place for users to get an overview on progress in individual work packages.

Since we might get quite a bit of information here (e.g. a large dataset with 10 work packages and 5-10 annotators each) it might be prudent to put it next to the annotation overview in the curation view:

Alternatively we might want to use this feature as a reason to separate out the annotation overview and work package progress into its own "analytics" view, the metadata and inter-annotator agreement calculation seems quite hidden there.

role changes: Owner, Curator, Annotator

do not implicitly give the dataset owner annotation and curation privileges
annotation and curation privileges are currently mutually exclusive (and curation implies annotation). move that to separate privilege flags instead

allow curators to maintain task definitions
allow a cleaner implementation of #14 through views on the dataset

move metadata under the title
move search and options next to the table (into the same row as the pagination element)

data import: customize input format options

quote char
delimiter

Inspect view filter: Show only conflicting annotations

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.