This repository contains the code and configuration files for transforming the child language acquisition corpora into the ACQDIV database.
If you use the database in your reasearch, please cite as follows:
Jancso, Anna, Steven Moran, and Sabine Stoll.
"The ACQDIV Corpus Database and Aggregation Pipeline."
Proceedings of The 12th Language Resources and Evaluation Conference. 2020.
Download the ACQDIV database (only public corpora):
To request access to the full database including the private corpora (for research purposes only!), please refer to Sabine Stoll. In case of technical questions, please open an issue on this repository.
Our full database consists of the following corpora:
Corpus | ISO | Public | # Words |
---|---|---|---|
Chintang Language Corpus | ctn | no | 987'673 |
Cree Child Language Acquisition Study (CCLAS) Corpus | cre | yes | 44'751 |
English Manchester Corpus | eng | yes | 2'016'043 |
MPI-EVA Jakarta Child Language Database | ind | yes | 2'489'329 |
Allen Inuktitut Child Language Corpus | ike | no | 71'191 |
MiiPro Japanese Corpus | jpn | yes | 1'011'670 |
Miyata Japanese Corpus | jpn | yes | 373'021 |
Ku Waru Child Language Socialization Study | mux | yes | 65'723 |
Sarvasy Nungon Corpus | yuw | yes | 19'659 |
Qaqet Child Language Documentation | byx | no | 56'239 |
Stoll Russian Corpus | rus | no | 2'029'704 |
Demuth Sesotho Corpus | sot | yes | 177'963 |
Tuatschin Corpus | roh | no | 118'310 |
Koç University Longitudinal Language Development Database | tur | no | 1'120'077 |
Pfeiler Yucatec Child Language Corpus | yua | no | 262'382 |
Total | 10'843'735 |
For Windows users, follow the installation/run instructions here: https://github.com/acqdiv/acqdiv/wiki/Installation-Run-instructions-for-Windows
For Mac and Linux user, continue here to run the pipeline yourself:
Create a virtual environment [optional]:
python3 -m venv venv
source venv/bin/activate
You can install the package from PyPI or directly from source:
PyPI
pip install acqdiv
From source
# Clone Repository
git clone [email protected]:acqdiv/acqdiv.git
cd acqdiv
# Install package (for users!)
pip install .
# Developer mode (for developers!)
pip install -r requirements.txt
Run the following script to download the public corpora:
python util/download_public_corpora.py
The corpora are in the folder corpora
.
For the private corpora, either place the session files in corpora/<corpus_name>/{cha|toolbox}/
and the metadata files (only Toolbox corpora) in corpora/<corpus_name>/imdi/
or
edit the paths to those files in the config.ini
(also see below).
Get the configuration file src/acqdiv/config.ini
and specify the absolute
paths (without trailing slashes) for the corpora directory (corpora_dir
) and
the directory where the database should be written to (db_dir
):
[.global]
# directory containing corpora
corpora_dir = /absolute/path/to/corpora/dir
# directory where the database is written to
db_dir = /absolute/path/to/database/dir
...
Optionally adapt the paths for the individual corpora (sessions
and metadata_dir
).
Run the pipeline specifying the absolute path to the configuration file:
acqdiv load -c /absolute/path/to/config.ini
Install dependencies
$ R
> install.packages("RSQLite")
> install.packages("rlang")
Navigate to src/acqdiv/database
and run:
Rscript sqlite_to_r.R /absolute/path/to/sqlite-DB
Run the unittests:
pytest tests/unittests
Run the integrity tests on the database:
pytest tests/systemtests