universaldatatool / universal-data-tool Goto Github PK

Collaborate & label any type of data, images, text, or documents, in an easy web interface or desktop app.

Home Page: https://universaldatatool.com

License: MIT License

JavaScript 98.34% HTML 1.31% CSS 0.03% Dockerfile 0.03% Shell 0.12% Singularity 0.17%

computer-vision annotate-images entity-recognition desktop classification dataset annotation-tool deep-learning text-annotation named-entity-recognition

universal-data-tool's Introduction

Universal Data Tool

Try it out at udt.dev, download the desktop app or run on-premise.

Docs • Website • Playground • Library Usage • On-Premise

The Universal Data Tool is a web/desktop app for editing and annotating images, text, audio, documents and to view and edit any data defined in the extensible .udt.json and .udt.csv standard.

Supported Data

Image Segmentation • Image Classification • Text Classification • Named Entity Recognition • Named Entity Relations / Part of Speech Tagging • Audio Transcription • Data Entry • Video Segmentation • Landmark / Pose Annotation

Recent Updates

Follow our development on Youtube!

Features

Collaborate with others in real time, no sign up!
Usable on web or as Windows,Mac or Linux desktop application
Configure your project with an easy-to-use GUI
Easily create courses to train your labelers
Download/upload as easy-to-use CSV (sample.udt.csv) or JSON (sample.udt.json)
Support for Images, Videos, PDFs, Text, Audio Transcription and many other formats
Can be easily integrated into a React application
Annotate images or videos with classifications, tags, bounding boxes, polygons and points
Fast Automatic Smart Pixel Segmentation using WebWorkers and WebAssembly
Import data from Google Drive, Youtube, CSV, Clipboard and more
Annotate NLP datasets with Named Entity Recognition (NER), classification and Part of Speech (PoS) tagging.
Easily load into pandas or use with fast.ai
Runs with docker docker run -p 3000:3000 universaldatatool/universaldatatool
Runs with singularity singularity run universaldatatool/universaldatatool

Installation

Web App

Just visit universaldatatool.com!

Trying to run the web app locally? Run npm install then npm run start after cloning this repository to start the web server.

Desktop Application

Download the latest release from the releases page and run the executable you downloaded.

Contributing

(Optional) Say hi in the Slack channel!
Read this guide to get started with development.

Contributors ✨

Thanks goes to these wonderful people (emoji key):

_{Severin Ibarluzea} 💻 📖 👀	_Puskuruk 💻 👀	_CedricJean 💻	_beru 💻	_Marc 💻 📖	_Wafaa-arbash 📖	_{Pierre Grimaud} 📖
_{sreevardhanreddi} 💻	_{Mohammed Eldadah} 💻	_x213212 💻	_hysios 💻	_{Cong Dao} 💻	_{Renato Junior} 🌍	_Rick 🌍 💻
_anaplian 💻	_{Miguel Carvalho} 🌍	_{Kyle OBrien} 💻	_{Hakkı Yağız ERDİNÇ} 💻	_{João Victor Davim} 💻

This project follows the all-contributors specification. Contributions of any kind welcome!

universal-data-tool's People

Contributors

Stargazers

Watchers

Forkers

suchoudh sbrichardson patmosxx-v2 manugarri tspannhw neuroradiology burakakrishna gpsbird mohmiim lyrl muzakparov c0debrain liujianhui1986 brewer-algosec saonam hhy5277 giserh jimoconnell cheesama kustomzone coreymakes lamprosmousselimis mdheller sxty4170160 intfrr vooban cwsaunders pavelsevcik davebulaval cedricjean johnlau55 zhaowujie geospatial-data-science reidel beru leondragon crackerboy adminbbbbb bbbbpage cybercypher chestnut3108 imihbh wafaa-arbash butlerwilson sonfire186 jbarroso x213212 fanzalika twofish0319 hyeoncheolkim91 amchuz krowdev leekltw wael93 hysios jajordan13 werikcyano congdv ashiquebiniqbal maxcodextc bastinrobin mehulpancholi srinivest shahriar-delavar pjsudharshan ashishpatel26 hyunmu spendyala carscan emrecavunt juanlp huguensjean deepchatterjeevns rajesh16702 kripaz777 mrjunato collecting-data-testing-by-automation saeedseyyedi justusmochache lparis rubenszimbres fadelerwin pranjalai 0x4mio emperorsreeni rickstaa anand870 cmftall jaykimbravekjh whaozl sanghy ticlazau miguelcarvalho13 shypwright jvdavim roychowd anaplian sinha96 my3jie shamzos

universal-data-tool's Issues

Feature: Suggest Interface Type (fix "undefined" not supported)

We should suggest a dataset type to use based on the sample the user selected. Then show buttons allowing them to configure that interface.

Feature: PDF Image Annotation

Support PDFs in Image Segmentation and Image Classification. Please thumbs up if you want it.

Import JSON/CSV doesn't Import CSVs that don't strictly comply with UDT CSV Format

People should be able to upload a csv like this:

imageUrl
https://example.com/image1.jpg
https://example.com/image2.jpg
https://example.com/image3.jpg

But this currently doesn't parse with the UDT parser

Exclusive image classification Output should be string, not array

Steps to reproduce:

Create image classification task with some image samples (use cat toy dataset) and click "No" to allow multiple classifcations
Complete some samples
Example the JSON in the settings page "Edit JSON" button. The taskOutput contains single item arrays for each sample when it should have an array

Feature: Speaker Identification Audio Transcription

Requires visualized waveform

Import from CSV / JSON should prompt user to see if they'd like to import the interface if the interface is different

Open-Source Collaboration Server

use-server.js makes requests to an API server which is hosted at udt-collaboration.workaround.now.sh. Are there plans to make the code for the server open source so that users can self-host the entire stack for this tool?

NER splits on letters with accents

I do not think the NER interface should split on letters with accents. Here are some really common in French "ùûàâçéèêëïîô"

Bugs

Played with your tool a little, here is a few things I found :

Settings wont pop up when on full screen mode
Resizing a box where the right side becomes the left side (or the opposite) is not possible, it simply redo the original box. Same for top/bottom. I think this should not be the expected behavior.

Here are some suggestions :

Make Task description optional (ReactImageAnnotate component)
Make the "selector" always present (it should be the default cursor i think). Pressing a key like "W" from cursor should get us in the drawing mode and once completed, back to selector mode.
Divider between the always present and optional tools on the left side bar
Center the image by default in the Pan
For bounding box, a single click shouldn't create a 1 pixel box. Either make it so we can draw box (click -> drag -> click) or simply do nothing since it is probably a missclick
Put the label inside the annotation if the annotation of the bounding box is 100% height of the image
1 color per class, I think this should the default. Could be optional to have different colors for all.
make it possible to have a "default class" just like Labelimg. All Labelimg is missing is being able to change the default class using hotkeys!
make it impossible to put the corners outside of the image. In the gray area.
make it possible to go to the next image without comming back to the selector every time

React Usage Documentation

Document how to use the Universal Data Tool in a React project.

Audio Transcription Interface Bugs: Doesn't load existing settings

Feature: Multiple Classifications or Tags per Region

Feature: 3D Bounding Boxes

We believe that people can need "3D Bounding Boxes" and we want to work on that. 🎉

if you'd like this functionality please let us know by leaving a thumbs 🤘

Feature: Hotkeys for Composite Tasks + "Automatic Next" on Composite Tasks

Composite tasks currently take a lot of clicking to complete. This update should introduce hotkeys for each subinterface in a composite task and allow the user to configure the completion of a composite task to open the next subtask automatically. I.e. after completing the first task within the composite task you complete the next.

feature: import of NER JSON

Hi there. hank you for a great tool.
I am curious whether you are considering support for import of pre-annotated text for NER?
This is a very common task in active learning setup / post-regex-clean up step.

Collaborative Session Bugs

If you see any bugs related to collobrative session you can report them here! 🎉

SQLite Collaboration Server

First reported in #16. The collaboration server is currently written with a scalable serverless architecture hosted on zeit now. We want to have a different codebase for the local one. Because the zeit now code was built for a commercial project, we can't open-source the code. But we can build a new version that implements the API.

Here is the full specification:

Universal Data Tool Collaborative Editing Server

Goals

Users should be able to collaborate with other users to complete the labeling of a dataset together
Users should receive notifications as work is completed or started by other users
Users should receive "updates" from other users in less than 500ms
The "Settings" should be able to be edited by any user
New data uploaded should be supported by any user
Collaborative links should be shareable
The first time someone enters collaboration mode a dialog should explain how to share the link etc.

Out of Scope

Should not require any login
Collaborative editing on a per-sample basis
- Collisions should take "last person who submitted edit"
Completion time estimate

Key Technologies

fast-json-patch is used to send patches
object-hash is used to hash objects to produce hashOfLatestState
micro is used for endpoints
ava is used for testing
sqlite is used as the database
better-sqlite3 is an npm module that makes the connection to sqlite very fast and simple

Architecture

The following endpoints are used...

POST /udt/session: Creates a link to a UDT session. Whoever initiates collaboration mode calls this. It is called exactly once to start a session. A session lasts indefinitely. Returns the url to the session.
GET /udt/session/<session_id>: Gets the latest version of the UDT JSON file by getting the latest session_state (see DB Architecture)
GET /udt/session/<session_id>/diffs: Gets recent diffs for the JSON file
- The requestor must provide the querystring parameter since=<ISODATE> indicating that they would like the diffs since the last time they polled.
- The UDT will poll this every 250-500ms. Most of the time it'll return an empty array of patches.
- Responds with { patches: Array<JSONDiffPatch>, hashOfLatestState, latestVersion }
PATCH /udt/session/<session_id>: Sends a JSONDiffPatch object with changes
- Request contains { patch, mySessionStateId }
  - patch is applied against the latest session state to generate a new session state.
  - mySessionStateId isn't used (for now)
- Should return { hashOfLatestState, latestVersion }
PATCH /udt/session/<session_id>/sample: Creates modifies or deletes a sample
- This endpoint should be used instead of the /udt/session/<session_id> endpoint for updating, creating or deleting samples because it can handle certain edge cases better.
- A request contains { operation, sampleIndex, [newInput], [newOutput], [previousInput] }
  - operation can be "DELETE", "CREATE", "UPDATE"
  - newInput is the taskData[sampleIndex] that the UDT observes when it sends the request
    - If "UPDATE" or "DELETE", use previousInput to find the true sample index. (i.e. do a deep comparison to find the sampleIndex using the latest version of the state).
  - newOutput is the new output for "UPDATE" operations. It is optional because the user may not want
  - sampleIndex provided by the requestor not be used.
- Should return { hashOfLatestState, latestVersion }

Example

Let's look at a typical collaborative workflow to see how these endpoints work:

After User1 engages collaboration mode, an API request is sent to POST /udt/sessionUser1's editor parses the response and creates a link for them to share.
User1 shares the link with their team (only User2) and begins to edit
User2 uses the link to join the session. They get the latest version of the UDT JSON by calling GET /udt/session/<session_id>. They know the session_id because it's embedded in the link.
User2 edits something in the settings. The UDT makes a request to PATCH /udt/session/<session_id> with a JSONDiffPatch containing they're changes.
User1 polls GET /udt/session/<session_id>/diffs?since=<last_version> to get the latest patches. User1's editor sees that there's a patch to apply from User2. They apply the patch, and display a notification for the user.
User1 begins to edit a sample. This triggers a request to PATCH /udt/session/<session_id>/sample changing the taskData[sampleIndex].isBeingEdited to true.
User1 finishes editing a sample. This triggers a request to PATCH /udt/session/<session_id>/sample changing the taskData[sampleIndex].isBeingEdited to true and and taskOutput[sampleIndex] to their newOutput

Database Architecture

One table called session_state representing each state of the JSON file. It contains the following columns:

session_state_id uuid randomly generated
short_id text randomly generated: represents the session id
udt_json jsonb: The state of the UDT file
patch jsonb: The patch that created this version from the previous version
previous_session_state_id uuid: Identifier for previous state
version integer: Integer identifying the revision number
created_at timestamptz: Timestamp on creation

The database will have the following constraints applied

UNIQUE previous_session_state_id
- Each session can only have one subsequent state. This prevents certain race conditions.

The database will have the following SQL triggers:

Delete session_states that are older than 1 hour AND not the latest state
- Triggered when a session state is inserted.

Embedded PDF Viewer

When task data contains pdfUrl, directly embed a pdf.

stuck on github page in MacOS app -- cannot navigate back

I clicked Github button in the upper right corner in the Mac App, and now cannot get back to the labelling interface. There is a button in Navigate menu, but it does nothing:

Feature: Import Samples from UDT CSV

As discussed in #32, we should add an import dialog for CSV data. Basically it would import the sample portion of a *.udt.csv file, which is structured as shown below....

path	.	document	output
interface	`{ ... }`
samples.0		This strainer makes a great...	`{ "entities": [ { "label": "hat", "start": ... } ]}`
samples.1		Boy spaghetti is sure tasty...	`{"entities": [ { "label": "food", "start": ... } ]}`

Upgrade notice should state the latest version

Doccano-style Data Importing

Support importing of doccano-style imports/exports.

Feature: DEXTR-based "Magic Segmentation"

See https://youtu.be/6WJxzKsIFKA?t=2936

We can use an algorithm like DEXTR to do this quickly http://people.ee.ethz.ch/~cvlsegmentation/dextr/

Thanks to @Ownmarc for the research into implementation.

Ideally, there are two tools, one that outputs a pixel level segmentation #59, and one that outputs a polygon segmentation.

Keyboard Shortcuts

Need to have keyboard shortcuts for common tasks.

Could this be used to identify materials, not animals?

Question quick.

Do you think this could be used to identify materials? Like to determine a color or type of metal?

Desktop Application "Recent Files" doesn't recognize when file is deleted

It may also be storing the file in local storage, either way it should be a path. Instead of storing the files in local storage (if that is how they are stored) we should just store the paths when the application is in desktop mode.

Feature: Sample Assignment

Edit: The original title was "Chat Box"

Image Segmentation Settings Box doesn't appear if in full screen mode

Originally reported in #19 This is probably a bug with the underlying library react-image-annotate.

Add Separate Dependency List for React Library

We need to create a separate dependencies list for react usage. Many react users won't need youtube-dl or ffmpeg libaries and we don't want things to be super bloated if they're using it as an npm module.

Paste URLs doesn't recognize images with GET Parameters

Steps to reproduce:

Paste image url like the one below:
https://scontent-lga3-1.cdninstagram.com/v/t51.2885-15/e35/c0.180.1440.1440a/s480x480/92319440_155591232435058_5851057296458538518_n.jpg?_nc_ht=scontent-lga3-1.cdninstagram.com&_nc_cat=110&_nc_ohc=pbVh20DKP50AX-QKGFD&oh=fd6b34bf2605f170104db599b1131d84&oe=5EB4AD12
Sample is not imported

Slow Collaborative Sessions when hashes not equal

Logs this in the console:

when getting diffs, hashes were not equal! getting latest version from server...
after patch, hashes were not equal! getting latest version from server...

Should support composite GUI configuration

There should be a recursive GUI configuration for composite type tasks. The following story would show a basic interface...

import React from "react"

import { storiesOf } from "@storybook/react"
import { action } from "@storybook/addon-actions"

import CompositeConfiguration from "./"

storiesOf("CompositeConfiguration", module).add("Basic", () => (
  <CompositeConfiguration
    onSaveTaskOutputItem={action("CompositeConfiguration")}
    interface={{
      type: "composite",
      fields: [
         {
             "fieldName": "Field 1",
             "interface": { "type": "image_segmentation"}
          }, 
          {
             "fieldName": "Field 2",
             "interface": { "type": "audio_transcription"}
          }, 
      ]
      description: "This is an **audio transcription** description."
    }}
  />
))

I'm not sure what taskData or taskOutput is supposed to look like, other than the keys of each taskOutput should be the field name e.g. taskOutput[0]["Field 1"] is the output from Field 1.

README Images

This issue is just used to upload images for usage in the README.

Collaborative Session should Sync with Recent Item

When people exit a collaborative session, they're expecting the file to retain the same name it did before when it was stored in local storage.

We should link the session the user is working in to a local storage item. If they created a collaborative session while working in a local storage item it should be linked to that item.

Should remember decision to "Hide Description" for next sample

Add contributing instruction

In order to make contribution more clear I suggest a similar aproach:

https://github.com/facebook/draft-js/blob/a9316a723f9e918afde44dea68b5f9f39b7d9b00/CONTRIBUTING.md

Also for more clear Issue templating I recommand : https://github.com/GRAAL-Research/poutyne/tree/master/.github/ISSUE_TEMPLATE

I wanted to propose a PR but some point where not clear (linter, yapf).

Feature: Delete All Samples

Empty Labeling State

The empty labeling state should tell the user what to do (i.e. if the interface is "empty")

Authentification and managing users

Is this something on your roadmap ?

Would be great to have per user annotations (as suggested in UniversalDataTool/udt-format#2)

Would open the doors for many great things

Should support data_entry GUI Configuration

Feature: Transform Convert to Web URLs

File server powered by https://transfer.sh code will go up on https://files.universaldatatool.com. We'll then need a transform dialog to allow users to convert their local files into web urls.

Feature: Pixel-Level Segmentation

Feature: Sample Colors / States

Many people have a pipeline where a machine learning sample can be in an "incomplete" "in review" and "complete" state. This feature should give flexibility for the user to determine the sample's state when saving in an efficient manner.

Editing Settings in Collaborative Session is annoying

This is because changes are immediately sent to the server, then during the reconciliation some immediate user changes are reverted.

Try editing a text field for a setting while in a collaborative session. You'll see that many patches are sent and it's difficult to type.

The easiest solution is to implement a "Save" button on the Settings page. This will also make it more difficult for a team member to accidentally mess up the settings. The Save button will appear only when in a collaborative session.

UDT Walkthrough for First-Time Users

These libraries look good:

Continuous Deployment

We should use https://github.com/semantic-release/semantic-release, I originally proposed something a bit more complicated

Why only allow file uploads from desktop version?

I launched the react app (since I would like to deploy this as a service), however I cant seem to be able to upload a file with samples. THe button is greyed out and reads "DESKTOP ONLY"

Why adding this arbitrary limitation to a web app?

Feature: Video Annotation

We've heard some people ask for this, please thumbs up if you're interested!

Video annotation has some unique challenges. Labels should automatically interpolate between frames- e.g. if the user annotates frame 1 and 5 of a moving object, frames 2,3,4 should be automatically move the bounding box to different parts of the movement.

Interface Previews

Preview the interface on sample data for each interface.

bug: unable to import text on Mac App

I am not able to import Directories or Test Snippets.

When I go Directory route, I can select a directory, but next it puts me to "Grid" view with nothing there.

If I press "Text Snippets", I can type, but I cannot paste any text from systems buffer (Cmd+V does not work, and there is no contextual menu).

I am using latest version downloaded 2020-03-23