Giter Site home page Giter Site logo

topwords's Introduction

topWords

Most of the files to be run by topWords are going to be TSV files from divitext, so it's optimized for that.

For a very basic example, assume that the input file is in your working directory (the directory you're running R from) and you are leaving all the parameters as their defaults. The only three parameters you NEED to specify are the input file, number of clades to cut the dendrogram into, and number of top distance values you want (this will default to a Kruskal-Wallis test), e.g.:

topWords(input.file = "DAN450_AZ.tsv", n.clades = 2, n.dist.values = 1)

If you have the parameters in that order, it can be even shorter:

topWords("DAN450_AZ.tsv", 2, 1)

Both of these will create an output directory with the default name "dist-output" in your working directory with the summary text files, color coded dendrogram, and graphs of the relative frequencies of the most distinct words across the top 2 clades.

Obviously you may want to an input file from another directory, in which case you just append the pathname to the file name, e.g.:

topWords("/home/documents/DAN450_AZ.tsv", 2, 1)

You may specify the output folder's name, but it will always be in the working directory [which you can change in R via the command: setwd("directory")]. The data set name is also used in the titles of the graphs and in the text files.

topWords("DAN450_AZ.tsv", 2, 1, dataset.name="DAZ_KW_2C_1L")

The above line will cause all the output to be in a folder named "DAZ_KW_2C_1L" with that string as part of the graph titles & summary files. This is obviously useful for keeping track of what files and folders are part of what analysis.

If you want to create the dendro with different metrics and linkages, the parameters are "metric", "p", and "linkage", which take the same values as dist() and hclust().

The program defaults to treating input as TSV files, but you may wish to use any other whitespace delimited files. For a text file (from the Lexomics command line Perl scripts, for example) you could call the function as:

topWords("DAZ.txt", 2, 1, is.tab.sep=FALSE, dataset.name="DAZ")

topWords only works when the data is in a format where the chunks are rows and the words are the columns. Because of this the function automatically detects the dimensions of the data being read in and if there are more rows than columns, tranposes it. This only works if there are more words than chunks, but except in a few boundary cases this will almost always be the case. If you need to run a file with more chunks than words, you will need to edit the code.

The above examples have all assumed the default distance metric, Kruskal-Wallis. As of now, there are two other options, Beta-Binomial Anova-F and Absolute Distance. Distance metrics are specified using a numeric code 1-3. 1 is Kruskal-Wallis, 2 is Beta-Binomial, and 3 is the Absoulte Distance. The metric used is also included in the graphs and summary files to make referencing them easier. Because the Beta-Binomial will require some explaining, we'll start with the Absolute Distance.

To run the function using the Absolute Distance metric:

topWords("DAN450_AZ.tsv", 2, 1, dataset.name="DAZ_AD_2C_1L", d.metric=3).

The output structure will still be the same as if it had been Kruskal-Wallis, but with the values and words calculated with the different metric.

One caveaut of the Absolte-Distance metric is words that appear in one clade but none of the others cause a divide-by-zero error. To account for this, the code will give you a warning when this happens and assign another score to these words. The score used is the next highest+1.

The Beta-Binomial Anova-F statistic is computationally intensive (much moreso than the other metrics), so topWords has functionality to make use of parallel processing to come up with faster results (if you have the "multicore" package installed). The parameters are defaulted for sequential analysis, so to run it sequentially:

topWords("DAN450_AZ.tsv", 2, 1, dataset.name="DAZ_BB_2C_1L", d.metric=2)

To run the Beta-Binomial Anova-F statistic in parallel, the run.betabin.par flag needs to be set to TRUE. The function will run on two cores by default when running parallel, but you can run it across as many cores as your system has. For example:

topWords("DAN450_AZ.tsv", 2, 1, dataset.name="DAZ_BB_2C_1L", d.metric=2, num.cores=12)

topwords's People

Contributors

raffled avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.