Giter Site home page Giter Site logo

digitallinguistics / tags2dlx Goto Github PK

View Code? Open in Web Editor NEW
1.0 2.0 0.0 139 KB

A JavaScript (Node.js) library that converts a tagged (monolinear) text to DLx JSON format

Home Page: https://developer.digitallinguistics.io/tags2dlx

License: MIT License

JavaScript 100.00%
digital-linguistics dlx corpus-linguistics linguistics corpora corpus

tags2dlx's Introduction

DLx Organization Profile

This repository contains the code for the Digital Linguistics (DLx) profile page.

tags2dlx's People

Contributors

dependabot[bot] avatar dwhieb avatar

Stargazers

 avatar

Watchers

 avatar  avatar

tags2dlx's Issues

option: compact

Add a Boolean compact option, which allows the user to set the JSON output to compact, with no unnecessary white space.

Internally, this will set the space argument of JSON.stringify to 0 rather than 2.

option: tag name

The user should be able to pass an option to the library (tagSetName) which provides the name of the tag set. This will be used to set the property name of the Tabs object in the DLx text:

"tags": {
  "penn": "N"
}

If this option is not provided, the tags will be added individually with their value set to true:

"tags": {
  "N": true
}

option: postprocessors

The user should be able to pass a postprocessors hash as an option to tags2dlx which allows them to specify postprocessors at the text, utterance, and word level.

The library itself should use this functionality for its own postprocessing.

set up project board

Set up a project board for managing tasks in this repo, with the following (automated) columns:

  • wish list
  • to do
  • in progress
  • done

option: NDJSON

Add a Boolean NDJSON option which allows the user to specify that the output should be in NDJSON format, where each line in the resulting NDJSON file represents an utterance.

setup GitHub pages

  • setup DLx subdomain
  • change repository settings
  • update URL of repo
  • add link to DLx developer documentation

set up Zenodo

  • enable Zenodo for repository
  • make a release
  • add Zenodo badge and citation to readme

publish library to npm

  • set up automated deployment based on GitHub releases
  • remove GitHub releases / version badge(s) from readme
  • add npm version badge to readme
  • make the first GitHub release

words should be DLx Word Token objects

Each word object in the returned text should adhere to the DLx Word Token schema.

  • transcription
  • translation (empty)
  • tags (with the tag attached to that token)

update readme

Go through the readme checklist

  • introduction
  • example of a converted sentence
  • links to DLx JSON format
  • basic usage
  • version - latest stable release of Node
  • issues
  • author
  • license
  • badges
    • Travis CI status
    • GitHub issue count
    • license
    • GitHub stars
  • running tests
  • contributing

option: tag separator

The tags2dlx function should accept a tagSeparator option, which allows the user to specify one or more separators that delimit the tag at the end of the word from the word itself.

set up Jasmine

  • tests should run on Node stable release
  • add to Travis CI

option: preprocessors

The user should be able to pass a preprocessors hash as an option to tags2dlx that allows them to specify preprocessors for the text, utterance, and word.

The library itself should use this functionality for its own preprocessing.

option: metadata

The user should be able to pass a metadata option to this library that includes other metadata about the text, such as the title.

add transcription property to utterance

Extract the original transcript / transcription (which? both?) for each utterance, and include it in the transcription / transcript property of the utterance.

You'll probably need to keep an in-memory version of the text, which you progressively trim each utterance from the beginning of, using a regular expression to find the next instance of the current utterance separator.

support different languages

Technically, this library only currently supports English corpora, because the output sets the transcription property as a simple string. According to the DLx format, when the transcription field is just a string, it should be interpreted as English.

Add a language option to the tags2dlx function which allows the user to specify the language of the corpus. English is assumed by default.

option: word separator

The tags2dlx library should accept a wordSeparator option which allows the user to specify one or more characters to use as word separators.

add command line interface

The user should be able to run this library from the command line. The command line interface should accept 1 required argument: the path to the file or folder to convert.

The command line script should either:

  • convert the provided single file to JSON and save the new file alongside the original
  • recurse the provided directory and convert each file as above

option: punctuation

The user should be able to provide a set of punctuation characters to ignore.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.