meta-toolkit / meta Goto Github PK

A Modern C++ Data Sciences Toolkit

License: MIT License

CMake 1.68% Python 0.16% Ruby 0.01% Shell 0.14% C 0.28% C++ 97.74%

nlp nlp-parsing search-engine inverted-index pos-tag text-analysis text-analytics text-classification language-modeling graph-algorithms

meta's People

Contributors

Stargazers

Watchers

Forkers

wavelets respu jilinhuang koreanfoodcomics booleanbanana hazimehh skystrife basilrormose ethanccy klperez armorleon seanxiangcheng coloratto pradyath jediz syeoryn abhayprakash retrojoshua gredoy jumarko howecoursera richardwxn cuissai chengjiun chengat1314 mbaroudi sinb yisiliu vishnur sunnybingome leigaosearch yangwenca ducphuong cybernetics zhushun0008 zgamez mishakakkar programmer102 uloma07 canoefzh henfiber ibrahimibrahim christopherok alexocculate jeffzfw tecposter lamwolf2010 nitishpareek7 karmine yiwitzki jiachengpan joker-jerome adsar paolotof rygbee irenewang sunsure gef756 esparza83 shkdidrlf jerry-chee sorhawell xianaiyang thebearer696 jgsogo amallia bhargavaramm anukat2015 domarps judit-b shawnkx palashahuja mkreiser mgkhkhd tjhubert giacomini hanjinda xuanhan863 fatmas1982 sanj24home nakols kingsea2016 mrwill84 siddshuk eektguj tktangazure gabrielelanaro louisegan514 djmott gress2 gembree1 solertis fchandra09 shlemas danlg ubuntuevangelist greenfigo2015 pranav-a minghao2016 aalind0

meta's Issues

Phrase Analyzer

Use the output of a chunker to create features based on strings of words. This will be particularly useful when combined with the topic modeling algorithms to create phrase-based models.

reformulate language model rankers

It may be possible to simplify the formula for the language model retrieval methods. ~~Additionally, we should look into a fast approximate log implementation~~.

Update topic modeling to use forward_index and config.toml

Add unit-tests to travis-ci integration.

Right now, travis-ci builds with gcc 4.8 and clang 3.3. This is a great start, but it would be great to extend this to also run the unit tests.

There area few problems that need tackled to do this. First, we need some way of getting the dataset(s) used in the unit-tests onto the travis-ci machine. A public webserver that hosts these would probably be a good idea (can we put them on timan103 or another department server?), since then we can just wget them.

Second, the current unit-test framework reports the success/failure of the test only on the command line via text. It ignores the return code of the child processes that are running each individual test. Ideally, we'd like the unit-test executable to also return 0 on success and 1 on failure.

Add register_ranker() function.

Print timeout when time exceeded in unit-test

Stop and start LDA inference

Investigate neural networks

Streamline installation instructions

The top of the README/front page of the website should be modified to

Contain concrete minimum system requirements for Ubuntu as well as OSX
Contain instructions on installing the ICU dependency on the latest Ubuntu as well as OSX.
Contain instructions on getting CMake >= 3.0 on Ubuntu as well as OSX
Make it more immediately clear how to check out from git with all submodules (e.g. git clone https://github.com/meta-toolkit/meta.git --recursive)
Make it more immediately clear how to choose between g++ and clang++ as a compiler (basically being Linux == g++, OSX == clang++ unless you really know what you're doing)

"Favoritism" shown for Ubuntu/OSX because those seem to be the most common compatible platforms we're getting installation questions about. People not on those distros are probably more likely to be able to figure out how to install the dependencies based on the instructions given for Ubuntu (but we might eventually want to include a section for other distros...)

Add ensemble methods

add option in unit_test.h to not fork the process to allow debugging

Doxygen all the files

Make sure everything is commented correctly.

Sean:

Chase:

feature scaling for classifiers

Option to scale feature vectors in a predefined range (e.g. [0,1]). This should reduce SGD's convergence rate as well as make distance calculations proportional.

ranker_tests are comparing doubles with ASSERT_EQUAL

They should instead by ASSERT_BINOP with some approximately equal binary op.

Alternatively, we could add ASSERT_APPROX_EQUAL or ASSERT_FP_EQUAL that special cases for floating point "equality".

Relevance judgements using meta::index

meta-stanford-preprocessor on github in meta-toolkit namespace?

I think it might be beneficial to move meta-stanford-preprocessor into the meta-toolkit namespace on Github, and maybe even add it as a submodule.

Documentation for meta::analyzers

Recreate feature selection with the current indexes

Will need to discuss how we want this implemented.

Need to take into account large datasets, unlike the previous version
Runtime feature selection or index-creation time?

Reduce POSIX/Linux assumptions

We make some strong assumptions currently about the underlying system's capabilities. We should try to relax and/or encapsulate these assumptions so that we can build on multiple platforms (e.g., Windows and BSD).

Things I can remember that are assuming POSIX or Linux:

file descripors
mmap()
system() in unit-test for deleting a directory recursively
fork()/waitpid() in unit-test (Windows is probably going to be a pain here...)
endian-ness assumptions in the index files (e.g., disk_vector)

Investigate margin methods for perceptron

Adjust unit-test timeouts for Debug mode

We could get fancy and look at #ifdef NDEBUG, or just increase these across the board.

These are the unit tests that fail for me in debug mode because of timeouts:

 winnow-cv-file                                   [ FAIL ] Time limit exceeded
 winnow-split-file                                [ FAIL ] Time limit exceeded

 winnow-cv-line                                   [ FAIL ] Time limit exceeded
 winnow-split-line                                [ FAIL ] Time limit exceeded

 ranker-dirichlet-prior                           [ FAIL ] Time limit exceeded
 ranker-jelinek-mercer                            [ FAIL ] Time limit exceeded
 ranker-okapi-bm25                                [ FAIL ] Time limit exceeded
 ranker-pivoted-length                            [ FAIL ] Time limit exceeded

Unit tests for compressed_file_{reader,writer}

We need tests for these: read uint64_ts and strings with various "encodings".

nearest_centroid classifier

This will be a good complement to knn.

Documentation for meta::index

Check for malformed index on load?

Right now, if the index structure was created (I think all it takes is for the folder to exist?) but not finished, you get an exception thrown about not being able to open the postings file for the index.

This is programmatically fine, but maybe we should be doing something on that exception in the example applications (such as just forcing a re-indexing)?

Investigate support for spaces in path names for the config file and file corpus list

Documentation for meta::classify

Remove zip/tarball downloads from website

Since Github currently doesn't package things recursively, they're worse than useless and seem to be causing confusion.

Unfortunately, there will still be the zip download option on the repository main page (and I don't see an immediate way of disabling that), so we will still have to make it clear that you need to check out from git to compile from source.

Decision trees

Separate learning algorithms from models

As a general refactoring, I think it would be good to separate the learning (or inference) algorithms from the models themselves.

For example, sgd could be separated out into a linear classifier model + the sgd learning algorithm. Once the model is learned, we don't really care about the algorithm that was used to learn it. This also will help separate out the logic for learning from the logic for doing things with a learned model.

This can likely also apply to the topic models, where fundamentally we have lda as a model, and then four different inference algorithms for it.

Use fastapprox for fast math calculations

http://fastapprox.googlecode.com/svn/trunk/fastapprox/src/fastonebigheader.h

We definitely want to use fast log in the rankers. We also probably want to use fast exp in some learning algorithms.

Add ability to load classifiers from model files

Currently some of the classifiers save model files (like sgd), but they don't have a way of initializing from a pre-trained model. For sgd specifically, the constructor calls reset() immediately.

We should add another constructor that allows for loading a pre-trained classifier from its model file. We'll need to change the way that the model file is stored most likely, but this should be a relatively easy change (for sgd anyway).

dashes versus underscores in documentation

Make sure those are consistent

Generalize document tokenization with analyzer refactoring

In progress on "analyzers" branch.

I think this is the last major coding blocker for JMLR?

Winnow broken under gcc

 winnow-cv-file                                   [ FAIL ] Time limit exceeded
 winnow-split-file                                acc: 0.119048
[ FAIL ] Assertion failed: mtx.accuracy() > min_accuracy (/home/chase/projects/meta/test/classifier_test.h:46)

and

 winnow-cv-line                                   [ FAIL ] Time limit exceeded
 winnow-split-line                                acc: 0.119048
[ FAIL ] Assertion failed: mtx.accuracy() > min_accuracy (/home/chase/projects/meta/test/classifier_test.h:46)

(I added a printout of the accuracy of the classifier to the unit tests, which shows it's clearly confused about something.)

Is clang/libc++ giving us some behavior that's not guaranteed by the standard that we're depending on?

This is with everything compiled with g++ in release mode.

Clean up feature select interface

Reduce amount of full-scans required for forward_index creation from libsvm data

The current implementation of forward_index scans through the libsvm formatted file at least three times:

To get the number of documents in the index
To set the document byte positions for the index
To initialize all document-level metadata for the index

When the data file is huge, this is really overkill. I think we ought to be able to combine these all into one pass through the file, and then just initialize all of the disk_vectors after we have created their files for them. Since we're single-threaded and just going through the data file in sequential order, we should be able to just write the numbers/doubles in binary to the appropriate file and then read them back in with disk_vector, I think. This would be a pretty big improvement for, say, the mnist8m dataset.

Use CTest for unit testing

This may address some of our concerns with making our current unit-test framework cross-platform.

CTest is part of CMake and is configured within CMakeLists.txt. Here's an example:

enable_testing()
add_test(name-of-test executable argument1 argument2 argument3)
set_tests_properties(name-of-test PROPERTIES TIMEOUT 5) # 5 second time limit

Then, to run, you can issue either make test or ctest on the command line. For MSVC, it will add a target for building the tests.

My proposal, then, is basically to eliminate the fork() calls in unit_test.cpp (make debug mode default) and convert our existing run_test calls to add_test() in the CMakeLists.txt. We'd need to enhance the harness to allow for a single test to run, but I think this can be done elegantly.

This does not deprecate the unit testing framework as a whole as we still need something to do the actual unit testing for us, but it would deprecate the timeout/signal handling parts.

Thoughts? You can test this right now by doing something like:

enable_testing()
add_test(classifier-tests unit-test classifiers)

in CMakeLists.txt, regenerating, and running ctest or make test.

Use adjusted term score for displaying top words in each topic

term_score_{k,v}= ^{\beta_{k,v}}\log\left(\frac{^{\beta_{k,v}}}{(\prod_{j=1}^K\ ^{\beta_{j,v}})^{\frac{1}{K}}}\right)

Documentation for meta::topics

Consider using a git submodule for the liblinear and slda dependencies.

Codify our style guide for clang-format

Partition data more explicitly during indexing?

Based on our observations for the performance of indexing speed with the reddit dataset, I wonder whether or not we would get better performance if we pre-split the data before passing the documents off to the analyzer in each thread as opposed to the current situation where the threads all potentially compete for the mutex that surrounds the shared queue.

Basically, what I'm imagining is lazily-loading the document's content instead of loading it when it's read from the corpus, and then just creating a huge vector of all of the documents, partitioning it into num_threads parts, and then having each thread tokenize just that segment. Perhaps that can eliminate the contention for the mutex? This is likely to be a bigger concern when documents are super small.

Option to evenly split labels in classifier

Also in a config option, something under [classifier], like even-split = true (false by default). Finds the label with the lowest number of documents and randomly truncates the rest to be that amount.

It should be split during classifier runtime (still index the whole corpus).

unit-test ASSERT() doesn't print values of expression

Thinking something like what monad does, where it actually will print the left and right hand sides of the assertion failure when doing something like

ASSERT(x == y);

Investigate split-ordered lists for concurrent hash maps

Add progress reporting to filesystem::copy_file()

Alternatively, we could have another function, filesystem::copy_file_with_progress() or something.

Basically, there's what seems like a long "hang" when making a forward_index from several gigabyte files of libsvm data because it has to copy over the file first. It would be nice if we could give progress output while this is happening.

This can be done with filesystem::file_size() to get the number of total bytes, reading the file in with ifstream::read(), using ifstream::gcount() to get the actual number of bytes read in each chunk-read, and then using printing::progress::operator() appropriately to signal the current number of bytes that have been processed (or, perhaps better, the number of "chunks").

Allow line-corpus format corpora to be segmented

For example, our yelp dataset has two segments: score (which is everything) and sentiment (which is just the extreme reviews). These subsets can be independently indexed if using file_corpus, but not when using line_corpus.

Not sure if this is worth it or not, but it could potentially be useful for the same reasons it was useful for file_corpus.

ngram_pos_analyzer using CRF

Change ngram_pos_analyzer to use MeTA's CRF for POS tagging. Trained model can be specified in the config file.

Might also want to look into a general analyzer function extract_sentences since diff_analyzer (and the future tree_analyzer) requires this method. extract_sentences could return either a sequence::sequence or lm::sentence. It would also make sense to convert between the two.