Giter Site home page Giter Site logo

corpex's Introduction

(note (2015-10-27): this app is under heavy development and the readme will probably be out of date.)

corpex (corpus explorer) is for searching through large (billions of tokens) linguistic corpora. It is intended to be useful for both language instructors and researchers by providing them an easy interface for linguistic corpora which otherwise might have remained out of technical reach.

corpex was developed for the HindMonoCorp corpus, but it can be adapted for other corpora as well.

Demo

Go to corpex.lgessler.com.

Right now corpex works by matching regular expressions against raw text. So to find all the instances of चाहता at the end of a sentence, we'd enter चाहता[\s+]?[.!?।] into the search bar and let it rip!

(Note: the inverted index has not been implemented yet (2015-10-30), so searches will be absurdly slow.)

Components used

corpex uses Meteor.js with Bootstrap 3 and some supporting Python scripts. It was built on top of Differential's Meteor Boilerplate Lite.


Installation with HindMonoCorp

Linux/OSX

First, install Meteor:

curl https://install.meteor.com/ | sh

Clone repo:

git clone https://github.com/lgessler/corpex.git
cd corpex

Launch Meteor:

cd app
meteor

In a new terminal, download HindMonoCorp. As of 2015-10-29, only small amounts of data are supported, so pull out a sample.

cd data
wget https://lindat.mff.cuni.cz/repository/xmlui/bitstream/handle/11858/00-097C-0000-0023-6260-A/hindmonocorp05.plaintext.gz?sequence=2&isAllowed=y
gunzip hindmonocorp05.plaintext.gz
head -n 10000 hindmonocorp05.plaintext > hmcsample.txt

Find your Meteor port (default should be 3001):

cd ../app
meteor mongo -U
(result: mongodb://127.0.0.1:3001/meteor)

Install dependencies and run python script to populate MongoDB:

pip3 install pymongo tqdm
python3 populate_db.py data/hmcsample.txt 3001

You should now be able to navigate to localhost:3000 and begin querying.

Windows

Contact me if you need help and we can figure it out together to fill this section out :^)


Licensing

MIT

corpex's People

Contributors

lgessler avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.