Giter Site home page Giter Site logo

superstyl's Introduction

SUPERvised STYLometry

Installing

You will need python3.8 or later, virtualenv and pip

git clone https://github.com/SupervisedStylometry/SuperStyl.git
cd SuperStyl
virtualenv -p python3.8 env
source env/bin/activate
pip install -r requirements.txt
# And get the model for language prediction
mkdir superstyl/preproc/models
wget https://dl.fbaipublicfiles.com/fasttext/supervised-models/lid.176.bin -P ./superstyl/preproc/models/

Workflow

FIXME: look inside the scripts, or do

python main.py --help
python train_svm.py --help

for full documentation on the main functionnalities of the CLI, regarding data generation (main.py) and SVM training (train_svm.py).

For more particular data processing usages (splitting and merging datasets), see also:

python split.py --help
python merge_datasets.csv.py --help

Get feats

With or without preexisting feature list:

python main.py -s path/to/docs/* -t chars -n 3
# with it
python main.py -s path/to/docs/* -f feature_list.json -t chars -n 3
# There are several other available options
# See --help

Alternatively, you can build samples out of the data, for a given number of verses or words:

# words from txt
python main.py -s data/psyche/train/* -t chars -n 3 -x txt --sampling --sample_units words --sample_size 1000
# verses from TEI encoded docs
python main.py -s data/psyche/train/* -t chars -n 3 -x tei --sampling --sample_units verses --sample_size 200

Optional: Do a fixed split

You can choose either choose to perform k-fold cross-validation (including leave-one-out), in which case this step is unnecessary. Or you can do a classical train/test random split.

If you want to do initial random split,

python split.py feats_tests.csv

If you want to split according to existing json file,

python split.py feats_tests.csv -s split.json

There are other available options, see --help, e.g.

python split.py feats_tests.csv -m langcert_revised.csv -e wilhelmus_train.csv

Train svm

It's quite simple really,

python train_svm.py path-to-train-data.csv [--test_path TEST_PATH] [--cross_validate {leave-one-out,k-fold}] [--k K] [--dim_reduc {pca}] [--norms] [--balance {class_weight,downsampling,Tomek,upsampling,SMOTE,SMOTETomek}] [--class_weights] [--kernel {LinearSVC,linear,polynomial,rbf,sigmoid}] [--final] [--get_coefs]

For instance, using leave-one-out or 10-fold cross-validation

# e.g.
python train_svm.py data/feats_tests_train.csv --norms --cross_validate leave-one-out
python train_svm.py data/feats_tests_train.csv --norms --cross_validate k-fold --k 10

Or a train/test split

# e.g.
python train_svm.py data/feats_tests_train.csv --test_path test_feats.csv --norms

And for a final analysis, applied on unseen data:

# e.g.
python train_svm.py data/feats_tests_train.csv --test_path unseen.csv --norms --final

With a little more options,

# e.g.
python train_svm.py data/feats_tests_train.csv --test_path unseen.csv --norms --class_weights --final --get_coefs

Sources

Cite this repository

You can cite it using the CITATION.cff file (and Github cite functionnalities), following:

@software{Camps_SUPERvised_STYLometry_SuperStyl_2021,author = {Camps, Jean-Baptiste},doi = {...},month = {...},title = {{SUPERvised STYLometry (SuperStyl)}},version = {...},year = {2021}}

FastText models

FastText

If you use these models, please cite the following papers:

[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, Bag of Tricks for Efficient Text Classification

@article{joulin2016bag, title={Bag of Tricks for Efficient Text Classification}, author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Mikolov, Tomas}, journal={arXiv preprint arXiv:1607.01759}, year={2016} }

[2] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, FastText.zip: Compressing text classification models

@article{joulin2016fasttext, title={FastText.zip: Compressing text classification models}, author={Joulin, Armand and Grave, Edouard and Bojanowski, Piotr and Douze, Matthijs and J{'e}gou, H{'e}rve and Mikolov, Tomas}, journal={arXiv preprint arXiv:1612.03651}, year={2016} }

superstyl's People

Contributors

jean-baptiste-camps avatar ponteineptique avatar floriancafiero avatar malichot avatar etienneferrandi avatar dependabot[bot] avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.