Giter Site home page Giter Site logo

dialectal_varieties's Introduction

Dialectal Varieties

Corpus:

  • Contains also the EU documents
  • Each document has 66 sentences of ~2000 words
  • The *.chnk files contain each data point on a line with sentences split by '#%'
  • The original indices before shuffling are stored in *.orig_idx
  • Shuffled documents are found in *.shf

Each test set contains 40 documents from each class, and the training sets have a maximum of 197 documents (the minimum nr given Scotland, the smallest class)

Running the code

1. Install requirements

Install requirements. Spacy is only needed for to obtain PoS tags.

pip3 install requirements.txt

Download spacy models for English and French.

# you may choose to download other models if you wish so
python3 -m spacy download en_core_web_trf
python3 -m spacy download fr_dep_news_trf # for PoS tagging
python3 -m spacy download fr_core_news_lg # for NER

Optional - split raw files into chunks

This is not needed, as the splits are already released in this repository. If one wishes to split the files into chunks and to do the train-test split, run:

python3 split_extract.py

Optional - machine translate English documents to French

We used an En-Fr readily-available MT model implemented in fairseq-py. The fairseq model is available for download and may be loaded into fairseq by following the tutorial from the official repository. The MT model is only important to show that dialect-markers from the source language are preserved in the MT output.

2. Generate PoS-tagged and no-entity files

The annotated train-test splits used in the experiments are available here. This script will generate PoS-tagged copies of the text files and copies that have all the annotated entities replaced. It is intended to be used with the English, French, and MT French train-test splits. Check the source code, if you wish to run it on a custom directory.

cd src && python3 make_pos_dirs.py

3. Generate some tables from text classification experiments

This will run all the configurations in the *_classifications.py scripts. Currently it does classifications using function words, pronouns, PoS n-grams, word n-grams.

cd src && \
python3 en_classifications.py && \
python3 fr_classifications.py

The classification scores are obtained by doing multiple experiments on downsampled data. Each class is split into batches of equal size and for each split, we train-test a classifier.

Directories:

  • corpus/* : directories of texts produced by native, non-native MEP's
  • corpus/train_test_split: the train test splits for comparing classification similarities across languages
  • analogies: csv files of misaligned words across corpora
  • feature_selection: directory, we saved the accuracy, F1 score and cofusion matrices for different classification scenarios, filenames: FEATURE_LANGUAGE_PAIRS.ftrs
  • features: lists of word and PoS features
  • src: source file directory

Maximum nr of sents / doc 66 Scotland chunks: 237 England chunks: 911 Ireland chunks: 476 EU chunks: 449

Results

Original English Data

setup feature En vs. Ir En vs. Sc Ir vs. Sc 3-way
English Originals function_words_en 0.9 0.91 0.85 0.8
English Originals pronouns_en 0.63 0.76 0.69 0.57
English Originals PoS n-grams 0.91 0.87 0.91 0.83
English Originals selected_pos_ngrams_en 0.88 0.85 0.86 0.78
English Originals selected_pos_ngrams_fr 0.82 0.71 0.77 0.64
English no Entities Word n-grams 0.91 0.89 0.92 0.83

French Human and Machine Translated Data

setup feature En vs. Ir En vs. Sc Ir vs. Sc 3-way
French Translations function_words_fr 0.84 0.87 0.78 0.71
French Translations pronouns_fr 0.82 0.8 0.72 0.66
French Translations PoS n-grams 0.89 0.82 0.76 0.74
French Translations selected_pos_ngrams_fr 0.78 0.76 0.62 0.59
French Translations selected_pos_ngrams_en 0.8 0.76 0.71 0.59
French no Entities Word n-grams 0.97 0.91 0.95 0.9
French MT function_words_fr 0.88 0.84 0.81 0.72
French MT pronouns_fr 0.85 0.85 0.74 0.71
French MT PoS n-grams 0.94 0.87 0.84 0.78
French MT selected_pos_ngrams_fr 0.83 0.73 0.77 0.66
French MT selected_pos_ngrams_en 0.78 0.79 0.72 0.62
French MT no Entities Word n-grams 0.99 0.91 0.95 0.9

dialectal_varieties's People

Contributors

senisioi avatar

Stargazers

 avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.