Giter Site home page Giter Site logo

sujayhegde / salm Goto Github PK

View Code? Open in Web Editor NEW

This project forked from moses-smt/salm

0.0 2.0 0.0 378 KB

SALM: Suffix Array and its Applications in Empirical Language Processing by Joy

License: GNU General Public License v2.0

Makefile 6.00% C++ 94.00%

salm's Introduction

SALM: Suffix Array tool kit for empirical Language Manipulations.
By Joy, [email protected]

1) Download the source code from: http://projectile.is.cs.cmu.edu/research/public/tools/salm/salm.htm or http://www.sourceforge.net/projects/salm
2) Build binaries:
	a) For Linux platform:
		cd Distribution/Linux
		make allO32 (for 32-bit platform)
		or
		make allO64 (for 64-bit platform)
		
		Binaries are created under Bin/Linux
		
	b) For Win32 platform
		open project files under Distribution/Win32 and use Visual C++ to build executables.
		Executables are placed under Bin/Win32
		
3) Index a corpus.
	The first step is to index a corpus using IndexSA program.
	There is no limitation to the size of the corpus as long as there is enough RAM.
	A corpus of N words requires 9N bytes memory during indexing.
	
	Another constraint is that no sentence can have more than 254 words.
	
	Synposis of IndexSA:
		IndexSA corpusFileName [existingIDVocabularyFile]
	
	Optional existingIDVocabularyFile can be used to specify an existing vocabulary.
	It will be updated if the words in the corpus are new to the exising vocabulary.
	This is useful if several corpora want to share a common vocabulary.
	

4) Applications
	The key functions to suffix array applications are provided in class C_SuffixArraySearchApplicationBase and C_SuffixArrayScanningBase
	Please check the documentation and API for more details.
	
	Sample programs such as:
	
		FrequencyOfNgrams: 
			Output the frequency of an n-gram in the training corpus
			
		NGramMatchingStat4TestSet		
			Output the n-gram token matching statistics of a testing data
	
		NgramTypeInTestSetMatchedInCorpus
			Output the n-gram type matching statistics of a testing data
			
		NgramMatchingFreq4Sent
			Output the frequencies of all the embedded n-grams in a sentence
			
		NgramMatchingFreqAndNonCompositionality4Sent
			Output the non-compositionalities of the embedded n-grams in a sentence
			
		FilterDuplicatedSentences
			Filter out duplicated sentences in the training corpus and output the unique ones
					
		CollectNgramFreqCount
			Given a list of n-grams and a list of traing corpus indexed by their suffix array, collect counts of n-grams in these corpus. E.g. given a Chinese word list, one can collect the frequency of these words (as character n-grams) from several large corpora (segmented into characters).

		CalcCountOfCounts
			Output the count-of-counts information of a corpus
			
		OutputHighFreqNgram
			Specified by a configuration file, output the n-gram types that have frequencies higher than the threshold
	
		TypeTokenFreqInCorpus
			Output the type/token statistics of the corpus

5) Questions, comments and suggestions?
Please email [email protected]
   

salm's People

Contributors

hieuhoang avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.