This is a suite of Machine Learning tools for identifying the language in which a piece of code is written. It could potentially be used for text editors, code hosting websites, and much more.
A seasoned programmer could quickly tell you that this program is written in C:
#include <stdio.h>
int main(int argc, const char ** argv) {
printf("Hello, world!");
}
The goal of whichlang
is to teach a program to do the same. By showing a Machine Learning algorithm a ton of code, you can teach it learn to identify programming languages itself.
There are four steps to using whichlang:
- Configure Go and download whichlang.
- Fetch code samples from Github or some other source.
- Train a classifier with the code samples.
- Use the whichlang API or server with the classifier you trained.
First, follow the instructions on this page to setup Go. Once Go is setup and you have a GOPATH
configured, run this set of commands:
$ go get github.com/unixpickle/whichlang
$ cd $GOPATH/src/github.com/unixpickle/whichlang
Now you have downloaded whichlang
and are sitting in its root source folder.
To fetch samples from Github, you must have a Github account (having more than one Github account may be beneficial, as well). You should decide how many samples you want for each programming language. I have found that 180 is more than enough.
You can fetch samples and save them to a directory as follows:
$ mkdir /path/to/samples
$ go run cmd/fetchlang/*.go /path/to/samples 180
In the above example, I specified 180 samples per language. This will prompt you for your Github credentials (to get around strict API rate limits). If you specify a large number of samples (where 180 counts as a large number), you may hit Github's API rate limits several times during the fetching process. If this occurs, you will want to delete the partially-downloaded source directories (they will be subdirectories of your sample directory, and will contain less than 180 samples), then wait an hour before re-running fetchlang
. The fetchlang
sub-command will automatically skip any source directories that are already present, making it relatively easy to resume paused or rate-limited downloads.
With whichlang, you can train a number of different kinds of classifiers on your data. Currently, you can use the following classifiers:
Out of these algorithms, I have found that Support Vector Machines are the simplest to train and work very well. Artificial Neural Networks are a close second, but they have more hyper-parameters and are thus harder to tune well. In this document, I will describe how to train both of these classifiers, leaving out ID3 and K-nearest neighbors.
For any classifier you use, you must choose a "ubiquity" value. Since whichlang works by extracting keywords from source files, it is important to discern potentially important keywords from file-specific keywords like variable names or embedded strings. To do this, keywords which appear in less than N
files are ignored during training and classification, where N
is the "ubiquity". I have found that a ubiquity of 10-20 works when you have roughly 100 source files.
The easiest way to train a Support Vector Machine is to allow whichlang to select all the hyper-parameters for you. Note, however, that this option is very slow, so you may want to keep reading.
$ go run cmd/trainer/*.go svm 15 /path/to/samples /path/to/classifier.json
In the above command, I specified a ubiquity of 15 files. This will train an SVM on the given sample directory, outputing a classifier to /path/to/classifier.json
. As this command runs, it will go through many different possible SVM configurations, choosing the one which performs the best on new samples (as measured via cross-validation). Since this command has to try many possible configurations, it will take a long time to run (perhaps hours or even days). I have already gone through the trouble of finding good parameters, and I will now share my results.
I have found that a linear SVM works fairly well for programming language classification. In particular, I've gotten a linear SVM to reach a 93% success rate on new samples, and most of those mistakes were reasonable (e.g. mistaking C for C++, or mistaking Ruby for CoffeeScript). To train a linear SVM, you can set the SVM_KERNEL
environment variable before running the trainer
sub-command:
$ export SVM_KERNEL=linear
If you want verbose output during training, you can specify another environment variable:
$ export SVM_VERBOSE=1
Once you have trained a linear SVM, you can perform a special compression step which will make the classifier faster and smaller. This is a technique which only works for linear SVMs! Run the following command:
$ go run cmd/svm-shrink/*.go /path/to/classifier.json /path/to/optimized.json
This will create a classifier file at /path/to/optimized.json
which is the optimized version of /path/to/classifier.json
. Remember, this only works for linear SVMs.
For other SVM environment variables you can checkout this list.
While whichlang does allow you to train ANNs without specifying any hyper-parameters (via grid search), doing so will take a tremendous amount of time. It is highly recommended that you manually specify the parameters for your neural network. I will give one example of training an ANN, but it is up to you to tweak these parameters:
$ export NEURALNET_VERBOSE=1
$ export NEURALNET_VERBOSE_STEPS=1
$ export NEURALNET_STEP_SIZE=0.01
$ export NEURALNET_MAX_ITERS=100
$ export NEURALNET_HIDDEN_SIZE=150
$ go run cmd/trainer/*.go neuralnet 15 /path/to/samples /path/to/classifier.json
For more ANN environment variables you can checkout this list.
Using a classifier is as simple as loading in a file. You can checkout the classify command for a very simple (15-line) example.