Giter Site home page Giter Site logo

traindata's Introduction

# Traindata

Orthography-to-phonology networks require training data that
has some special care taken to consider aspects of phonological
and orthographic processing when specifying to the training 
representations used during learning. This repo provides some
functionality to that end.

## Installation

This project now supports dependency management through Poetry. To get started, first [install Poetry](https://python-poetry.org/docs/#installation) if you haven't already.

Clone the repository to your local machine:
```bash
git clone https://github.com/MCooperBorkenhagen/Traindata.git
```

If you're developing in VS Code, consider creating the virtual environment in the project directory
as VS Code natively detects it and uses its kernel

```bash
poetry config virtualenvs.in-project true
```

Install dependencies and activate the virtual environment. 

```bash
poetry install
poetry shell
```

If you're in VS code, and have installed the *.venv* folder locally, then you do not need to activate the poetry shell.

## Usage

To include this repository as a dependency in another project, add the following to your project's pyproject.toml file:

```toml
[tool.poetry.dependencies]
python = ">=3.9,<3.13"
traindata = { git = "https://github.com/MCooperBorkenhagen/Traindata.git", branch = "main" }
```

Then, run _*poetry install*_ to install the `traindata` package along with its dependencies.

Alternatively, you can build and install this project using pip

```plaintext
pip install git+https://github.com/MCooperBorkenhagen/Traindata.git@main#egg=traindata
```

by adding that URL (along with the git+ prefix) to a requirements.txt file

```bash
pip install -r requirements.txt
```

or by cloning the repository and building it from the *setup.py* file


```bash
python setup.py bdist_wheel
pip install dist/traindata-0.1.0-py3-none-any.whl
```

Phonology
---------
The phonological structure contained/ assumed in Traindata is
opinionated and is based on the ARPAbet. For more information
on this representational scheme see below. The specific version
of this scheme is maintained locally in this repository, but is
derivative of what is contained in the MCB repository linked below.

ARPAbet Documentation:
https://en.wikipedia.org/wiki/ARPABET
http://www.speech.cs.cmu.edu/cgi-bin/cmudict
https://github.com/MCooperBorkenhagen/ARPAbet

The phonological representations are extensions of previous
connectionist implementations, namely from Harm & Seidenberg (1999).
In essence, the classical distinctive features framework is used,
originally formulated in Chomsky and Halle (1968). Originally,
the framework used in this repository codes for binary features,
but other codings are possible.


Orthography
-----------
The orthographic representations used here assume a one-hot
coding over letters, with each represented fundamentally as 
a 26-unit vector where the nth element in the vector corresponds
to the position of that letter in the alphabet. Originally, the
letters specified are all lowercase, but that could change in
the future.

Terminal Segments
-----------------
An important detail about the way that orthography and phonology
are represented in the method contained here concerns terminal
segments. When processing time-varying representations of language,
including words, the boundaries of any given language form (sentence,
word, etc.) often needs to be explicitly identified. The featural
representations contained here allow for this, with a feature
representing the start of the word (labeled SOS) and a feature
encoding the end of the word (labeled EOS). The labels "SOS" and
"EOS" are adopted from the machine learning literature and implementations
of sequential ANNs. 

traindata's People

Contributors

mathnathan avatar mcooperborkenhagen avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.