(note (2015-10-27): this app is under heavy development and the readme will probably be out of date.)
corpex (corpus explorer) is for searching through large (billions of tokens) linguistic corpora. It is intended to be useful for both language instructors and researchers by providing them an easy interface for linguistic corpora which otherwise might have remained out of technical reach.
corpex was developed for the HindMonoCorp corpus, but it can be adapted for other corpora as well.
Demo
Go to corpex.lgessler.com.
Right now corpex works by matching regular expressions against raw text. So to find all the instances of चाहता at the end of a sentence, we'd enter चाहता[\s+]?[.!?।] into the search bar and let it rip!
(Note: the inverted index has not been implemented yet (2015-10-30), so searches will be absurdly slow.)
Components used
corpex uses Meteor.js with Bootstrap 3 and some supporting Python scripts. It was built on top of Differential's Meteor Boilerplate Lite.
Installation with HindMonoCorp
Linux/OSX
First, install Meteor:
curl https://install.meteor.com/ | sh
Clone repo:
git clone https://github.com/lgessler/corpex.git
cd corpex
Launch Meteor:
cd app
meteor
In a new terminal, download HindMonoCorp. As of 2015-10-29, only small amounts of data are supported, so pull out a sample.
cd data
wget https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11858/00-097C-0000-0023-6260-A/hindmonocorp05.plaintext.gz?sequence=2&isAllowed=y
gunzip hindmonocorp05.plaintext.gz
head -n 10000 hindmonocorp05.plaintext > hmcsample.txt
Find your Meteor port (default should be 3001):
cd ../app
meteor mongo -U
(result: mongodb://127.0.0.1:3001/meteor)
Install dependencies and run python script to populate MongoDB:
pip3 install pymongo tqdm
python3 populate_db.py data/hmcsample.txt 3001
You should now be able to navigate to localhost:3000
and begin querying.
Windows
Contact me if you need help and we can figure it out together to fill this section out :^)
Licensing
MIT