src-d / ml Goto Github PK

sourced.ml is a library and command line tools to build and apply machine learning models on top of Universal Abstract Syntax Trees

License: Other

Python 99.89% Dockerfile 0.05% Makefile 0.06%

mloncode machine-learning ast word2vec

ml's Introduction

MLonCode research playground

This project is no longer maintained, it has evolved into several others:

ml-core - the bits which are independent of mining tools.
ml-mining - general purpose mining environment, currenly based on the deprecated jgit-spark-connector.

Below goes the original README.

This project is the foundation for MLonCode research and development. It abstracts feature extraction and training models, thus allowing to focus on the higher level tasks.

Currently, the following models are implemented:

BOW - weighted bag of x, where x is many different extracted feature types.
id2vec, source code identifier embeddings.
docfreq, feature document frequencies (part of TF-IDF).
topic modeling over source code identifiers.

It is written in Python3 and has been tested on Linux and macOS. source{d} ml is tightly coupled with source{d} engine and delegates all the feature extraction parallelization to it.

Here is the list of proof-of-concept projects which are built using sourced.ml:

vecino - finding similar repositories.
tmsc - listing topics of a repository.
snippet-ranger - topic modeling of source code snippets.
apollo - source code deduplication at scale.

Installation

Whether you wish to include Spark in your installation or would rather use an existing installation, to use sourced-ml you will need to have some native libraries installed, e.g. on Ubuntu you must first run: apt install libxml2-dev libsnappy-dev. Tensorflow is also a requirement - we support both the CPU and GPU version. In order to select which version you want, modify the package name in the next section to either sourced-ml[tf] or sourced-ml[tf-gpu] depending on your choice. If you don't, neither version will be installed.

With Apache Spark included

pip3 install sourced-ml

Use existing Apache Spark

If you already have Apache Spark installed and configured on your environment at $APACHE_SPARK you can re-use it and avoid downloading 200Mb through pip "editable installs" by

pip3 install -e "$SPARK_HOME/python"
pip3 install sourced-ml

In both cases, you will need to have some native libraries installed. E.g., on Ubuntu apt install libxml2-dev libsnappy-dev. Some parts require Tensorflow.

Usage

This project exposes two interfaces: API and command line. The command line is

srcml --help

Docker image

docker run -it --rm srcd/ml --help

If this first command fails with

Cannot connect to the Docker daemon. Is the docker daemon running on this host?

And you are sure that the daemon is running, then you need to add your user to docker group: refer to the documentation.

Contributions

...are welcome! See CONTRIBUTING and CODE_OF_CONDUCT.md.

License

Apache 2.0

Algorithms

Identifier embeddings

We build the source code identifier co-occurrence matrix for every repository.

Read Git repositories.
Classify files using enry.
Extract UAST from each supported file.
Split and stem all the identifiers in each tree.
Traverse UAST, collapse all non-identifier paths and record all

identifiers on the same level as co-occurring. Besides, connect them with their immediate parents.
Write the global co-occurrence matrix.
Train the embeddings using Swivel (requires Tensorflow). Interactively view

the intermediate results in Tensorboard using --logs.
Write the identifier embeddings model.

1-5 is performed with repos2coocc command, 6 with id2vec_preproc, 7 with id2vec_train, 8 with id2vec_postproc.

Weighted Bag of X

We represent every repository as a weighted bag-of-vectors, provided by we've got document frequencies ("docfreq") and identifier embeddings ("id2vec").

Clone or read the repository from disk.
Classify files using enry.
Extract UAST from each supported file.
Extract various features from each tree, e.g. identifiers, literals or node2vec-like structural fingerprints.
Group by repository, file or function.
Set the weight of each such feature according to TF-IDF.
Write the BOW model.

1-7 are performed with repos2bow command.

Topic modeling

See here.

Glossary

See here.

ml's People

Contributors

Stargazers

Watchers

Forkers

pombredanne egorbu vmarkovtsev fineguy zurk chubbymaggie ufwt galdude33 marnovo monperrus fossabot warenlg youcefmco bzz hectlles absognety fulaphex irinakhismatullina chan0415 rahulkumaran y1026 shobrook preethamvishy gy741 qakart sonalidasgupta ayushi04 ivan-magda jan21 guillemdb glimow benkalegin gryn010 lpusok fabiopetrillo amruta-bandhu-chaudhury forkkit afcarl junkgear jeroenherczeg jjennings955 ilya-palachev neomatrix369 isabella232

ml's Issues

Create terms glossary for sourced.ml

We constantly confuse terms, so what to say about other developers.
I do not want to make it full, but to have a start.

Here is terms list to explain on the first iteration:

Bag-of-words
Weighted bag-of-words
Model
Algorithm
Transformer
Document
Features
1. identifier
2. token
3. literal
4. graphlet

Googleable terms we may comment:

quantization
TF-IDF
topic
co-occurrence matrix

@src-d/machine-learning please take a look and add any confusing terms you remember.

Repair docker builds

Broken for 16 days https://hub.docker.com/r/srcd/ml/builds/

ImportError: cannot import name 'generate_meta'

I'm starting from a fresh venv running Python 3.5.0

Here are my steps
docker run -d --name bblfshd --privileged -p 9432:9432 bblfsh/bblfshd
Then
docker exec -it bblfshd bblfshctl driver install --all

Output:
Installing python driver language from "docker://bblfsh/python-driver:latest"... Done Installing java driver language from "docker://bblfsh/java-driver:latest"... Done

After that I install ast2vec
pip3 install ast2vec
Successfully installed PyStemmer-1.3.0 args-0.1.0 asdf-1.3.1 ast2vec-0.3.5a0 astropy-2.0.2 bblfsh-2.6.1 cachetools-2.0.1 certifi-2017.11.5 chardet-3.0.4 clint-0.5.1 docker-2.6.1 docker-pycreds-0.2.1 google-api-core-0.1.1 google-auth-1.2.1 google-cloud-core-0.28.0 google-cloud-storage-1.6.0 google-resumable-media-0.3.1 googleapis-common-protos-1.5.3 grpcio-1.7.0 grpcio-tools-1.7.0 idna-2.6 jsonschema-2.6.0 lz4-0.10.1 modelforge-0.4.0a0 netifaces-0.10.6 numpy-1.13.3 protobuf-3.4.0 py-1.4.34 pyasn1-0.3.7 pyasn1-modules-0.1.5 pytest-3.2.3 python-dateutil-2.6.1 pyyaml-3.12 requests-2.18.4 rsa-3.4.2 scipy-0.19.1 semantic-version-2.6.0 six-1.11.0 urllib3-1.22 websocket-client-0.44.0

Trying to reproduce https://github.com/src-d/ast2vec/blob/master/topic_modeling.md

ast2vec enry

I get the error message:
Traceback (most recent call last): File "/Users/melodywolk/env2/bin/ast2vec", line 7, in <module> from ast2vec.__main__ import main File "/Users/melodywolk/env2/lib/python3.5/site-packages/ast2vec/__init__.py", line 3, in <module> from ast2vec.bow import BOW, NBOW File "/Users/melodywolk/env2/lib/python3.5/site-packages/ast2vec/bow.py", line 4, in <module> from modelforge import generate_meta ImportError: cannot import name 'generate_meta'

Am I missing something?
Thank you!

Reanimate sourced-ml

We need to make CI pass again!

This includes modelforge ^0.11

API abstraction after cmd but before transformers

Example: extract identifiers from a single repository.

Improve the test coverage

E.g. https://codecov.io/gh/src-d/ml/src/250c09aaa2c998ce2783e83d0f905e8906bc9dff/sourced/ml/transformers/basic.py

Usage recommendations for performance

It would be nice to document some performance hints.

For example, should I use --persist MEMORY_AND_DISK? Are there other parameters related to performance?

line 581, in read_int raise EOFError

Spark 2.2.1, the latest srcml
(b7066789@cs-hpc06-mgmt):~/go_wspace/github_data/siva/latest/00 $ srcml preprocrepos -r temp -o test

_/usr/local/lib64/python3.6/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
INFO:spark:Starting preprocess_repos-a48a3453-43a0-4eb3-a900-c0afddb9cb2d on local[*]
Ivy Default Cache set to: /home/b7066789/.ivy2/cache
The jars for the packages stored in: /home/b7066789/.ivy2/jars
:: loading settings :: url = jar:file:/usr/local/spark-2.2.1-bin-hadoop2.7/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
tech.sourced#engine added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found tech.sourced#engine;0.6.4 in central
found io.netty#netty-all;4.1.17.Final in central
found org.eclipse.jgit#org.eclipse.jgit;4.9.0.201710071750-r in central
found com.jcraft#jsch;0.1.54 in central
found com.googlecode.javaewah#JavaEWAH;1.1.6 in central
found org.apache.httpcomponents#httpclient;4.3.6 in central
found org.apache.httpcomponents#httpcore;4.3.3 in central
found commons-logging#commons-logging;1.1.3 in central
found commons-codec#commons-codec;1.6 in central
found org.slf4j#slf4j-api;1.7.2 in central
found tech.sourced#siva-java;0.1.3 in central
found org.bblfsh#bblfsh-client;1.8.2 in central
found com.thesamet.scalapb#scalapb-runtime_2.11;0.7.1 in central
found com.thesamet.scalapb#lenses_2.11;0.7.0-test2 in central
found com.lihaoyi#fastparse_2.11;1.0.0 in central
found com.lihaoyi#fastparse-utils_2.11;1.0.0 in central
found com.lihaoyi#sourcecode_2.11;0.1.4 in central
found com.google.protobuf#protobuf-java;3.5.0 in central
found commons-io#commons-io;2.5 in central
found io.grpc#grpc-netty;1.10.0 in central
found io.grpc#grpc-core;1.10.0 in central
found io.grpc#grpc-context;1.10.0 in central
found com.google.code.gson#gson;2.7 in central
found com.google.guava#guava;19.0 in central
found com.google.errorprone#error_prone_annotations;2.1.2 in central
found com.google.code.findbugs#jsr305;3.0.0 in central
found io.opencensus#opencensus-api;0.11.0 in central
found io.opencensus#opencensus-contrib-grpc-metrics;0.11.0 in central
found io.netty#netty-codec-http2;4.1.17.Final in central
found io.netty#netty-codec-http;4.1.17.Final in central
found io.netty#netty-codec;4.1.17.Final in central
found io.netty#netty-transport;4.1.17.Final in central
found io.netty#netty-buffer;4.1.17.Final in central
found io.netty#netty-common;4.1.17.Final in central
found io.netty#netty-resolver;4.1.17.Final in central
found io.netty#netty-handler;4.1.17.Final in central
found io.netty#netty-handler-proxy;4.1.17.Final in central
found io.netty#netty-codec-socks;4.1.17.Final in central
found com.thesamet.scalapb#scalapb-runtime-grpc_2.11;0.7.1 in central
found io.grpc#grpc-stub;1.10.0 in central
found io.grpc#grpc-protobuf;1.10.0 in central
found com.google.protobuf#protobuf-java;3.5.1 in central
found com.google.protobuf#protobuf-java-util;3.5.1 in central
found com.google.api.grpc#proto-google-common-protos;1.0.0 in central
found io.grpc#grpc-protobuf-lite;1.10.0 in central
found org.rogach#scallop_2.11;3.0.3 in central
found org.apache.commons#commons-pool2;2.4.3 in central
found tech.sourced#enry-java;1.6.3 in central
found org.xerial#sqlite-jdbc;3.21.0 in central
found com.groupon.dse#spark-metrics;2.0.0 in central
found io.dropwizard.metrics#metrics-core;3.1.2 in central
:: resolution report :: resolve 879ms :: artifacts dl 38ms
:: modules in use:
com.google.api.grpc#proto-google-common-protos;1.0.0 from central in [default]
com.google.code.findbugs#jsr305;3.0.0 from central in [default]
com.google.code.gson#gson;2.7 from central in [default]
com.google.errorprone#error_prone_annotations;2.1.2 from central in [default]
com.google.guava#guava;19.0 from central in [default]
com.google.protobuf#protobuf-java;3.5.1 from central in [default]
com.google.protobuf#protobuf-java-util;3.5.1 from central in [default]
com.googlecode.javaewah#JavaEWAH;1.1.6 from central in [default]
com.groupon.dse#spark-metrics;2.0.0 from central in [default]
com.jcraft#jsch;0.1.54 from central in [default]
com.lihaoyi#fastparse-utils_2.11;1.0.0 from central in [default]
com.lihaoyi#fastparse_2.11;1.0.0 from central in [default]
com.lihaoyi#sourcecode_2.11;0.1.4 from central in [default]
com.thesamet.scalapb#lenses_2.11;0.7.0-test2 from central in [default]
com.thesamet.scalapb#scalapb-runtime-grpc_2.11;0.7.1 from central in [default]
com.thesamet.scalapb#scalapb-runtime_2.11;0.7.1 from central in [default]
commons-codec#commons-codec;1.6 from central in [default]
commons-io#commons-io;2.5 from central in [default]
commons-logging#commons-logging;1.1.3 from central in [default]
io.dropwizard.metrics#metrics-core;3.1.2 from central in [default]
io.grpc#grpc-context;1.10.0 from central in [default]
io.grpc#grpc-core;1.10.0 from central in [default]
io.grpc#grpc-netty;1.10.0 from central in [default]
io.grpc#grpc-protobuf;1.10.0 from central in [default]
io.grpc#grpc-protobuf-lite;1.10.0 from central in [default]
io.grpc#grpc-stub;1.10.0 from central in [default]
io.netty#netty-all;4.1.17.Final from central in [default]
io.netty#netty-buffer;4.1.17.Final from central in [default]
io.netty#netty-codec;4.1.17.Final from central in [default]
io.netty#netty-codec-http;4.1.17.Final from central in [default]
io.netty#netty-codec-http2;4.1.17.Final from central in [default]
io.netty#netty-codec-socks;4.1.17.Final from central in [default]
io.netty#netty-common;4.1.17.Final from central in [default]
io.netty#netty-handler;4.1.17.Final from central in [default]
io.netty#netty-handler-proxy;4.1.17.Final from central in [default]
io.netty#netty-resolver;4.1.17.Final from central in [default]
io.netty#netty-transport;4.1.17.Final from central in [default]
io.opencensus#opencensus-api;0.11.0 from central in [default]
io.opencensus#opencensus-contrib-grpc-metrics;0.11.0 from central in [default]
org.apache.commons#commons-pool2;2.4.3 from central in [default]
org.apache.httpcomponents#httpclient;4.3.6 from central in [default]
org.apache.httpcomponents#httpcore;4.3.3 from central in [default]
org.bblfsh#bblfsh-client;1.8.2 from central in [default]
org.eclipse.jgit#org.eclipse.jgit;4.9.0.201710071750-r from central in [default]
org.rogach#scallop_2.11;3.0.3 from central in [default]
org.slf4j#slf4j-api;1.7.2 from central in [default]
org.xerial#sqlite-jdbc;3.21.0 from central in [default]
tech.sourced#engine;0.6.4 from central in [default]
tech.sourced#enry-java;1.6.3 from central in [default]
tech.sourced#siva-java;0.1.3 from central in [default]
:: evicted modules:
com.google.protobuf#protobuf-java;3.5.0 by [com.google.protobuf#protobuf-java;3.5.1] in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 51 | 0 | 0 | 1 || 50 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 50 already retrieved (0kB/18ms)
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/10/08 12:50:33 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/10/08 12:50:33 WARN SparkConf: In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN).
18/10/08 12:50:34 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
INFO:engine:Initializing engine on temp
INFO:ParquetSaver:Ignition -> DzhigurdaFiles -> UastExtractor -> Moder -> FieldsSelector -> ParquetSaver
18/10/08 12:50:41 ERROR PythonRunner: Python worker exited unexpectedly (crashed)
org.apache.spark.api.python.PythonException: Traceback (most recent call last):
File "/usr/local/spark-2.2.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/worker.py", line 162, in main
is_sql_udf = read_int(infile)
File "/usr/local/spark-2.2.1-bin-hadoop2.7/python/lib/pyspark.zip/pyspark/serializers.py", line 581, in read_int
raise EOFError
EOFError

    at org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:193)
    at org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:234)
    at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:152)
    at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:63)
    at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:323)
    at org.apache.spark.rdd.RDD.iterator(RDD.scala:287)
    at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:87)
    at org.apache.spark.scheduler.Task.run(Task.scala:108)
    at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:338)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:745)

Caused by: io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:221)
at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:202)
at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:101)
at gopkg.in.bblfsh.sdk.v1.protocol.generated.ProtocolServiceGrpc$ProtocolServiceBlockingStub.parse(ProtocolServiceGrpc.scala:61)
at org.bblfsh.client.BblfshClient.parse(BblfshClient.scala:30)
at tech.sourced.engine.util.Bblfsh$.extractUAST(Bblfsh.scala:80)
at tech.sourced.engine.udf.ExtractUASTsUDF$class.extractUASTs(ExtractUASTsUDF.scala:17)
at tech.sourced.engine.udf.ExtractUASTsUDF$.extractUASTs(ExtractUASTsUDF.scala:24)
at tech.sourced.engine.package$EngineDataFrame$$anon$2.call(package.scala:395)
at tech.sourced.engine.package$EngineDataFrame$$anon$2.call(package.scala:377)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:518)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:333)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1948)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
Caused by: tech.sourced.engine.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/127.0.0.1:9432
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at tech.sourced.engine.shaded.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
at tech.sourced.engine.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
at tech.sourced.engine.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
at tech.sourced.engine.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at tech.sourced.engine.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at tech.sourced.engine.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at tech.sourced.engine.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at tech.sourced.engine.shaded.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused
... 11 more
18/10/08 12:50:41 ERROR PythonRunner: This may have been caused by a prior exception:
io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:221)
at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:202)
at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:101)
at gopkg.in.bblfsh.sdk.v1.protocol.generated.ProtocolServiceGrpc$ProtocolServiceBlockingStub.parse(ProtocolServiceGrpc.scala:61)
at org.bblfsh.client.BblfshClient.parse(BblfshClient.scala:30)
at tech.sourced.engine.util.Bblfsh$.extractUAST(Bblfsh.scala:80)
at tech.sourced.engine.udf.ExtractUASTsUDF$class.extractUASTs(ExtractUASTsUDF.scala:17)
at tech.sourced.engine.udf.ExtractUASTsUDF$.extractUASTs(ExtractUASTsUDF.scala:24)
at tech.sourced.engine.package$EngineDataFrame$$anon$2.call(package.scala:395)
at tech.sourced.engine.package$EngineDataFrame$$anon$2.call(package.scala:377)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:518)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:333)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1948)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
Caused by: tech.sourced.engine.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/127.0.0.1:9432
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at tech.sourced.engine.shaded.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
at tech.sourced.engine.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
at tech.sourced.engine.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
at tech.sourced.engine.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at tech.sourced.engine.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at tech.sourced.engine.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at tech.sourced.engine.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at tech.sourced.engine.shaded.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused
... 11 more
18/10/08 12:50:41 ERROR Executor: Exception in task 0.0 in stage 0.0 (TID 0)
io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:221)
at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:202)
at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:101)
at gopkg.in.bblfsh.sdk.v1.protocol.generated.ProtocolServiceGrpc$ProtocolServiceBlockingStub.parse(ProtocolServiceGrpc.scala:61)
at org.bblfsh.client.BblfshClient.parse(BblfshClient.scala:30)
at tech.sourced.engine.util.Bblfsh$.extractUAST(Bblfsh.scala:80)
at tech.sourced.engine.udf.ExtractUASTsUDF$class.extractUASTs(ExtractUASTsUDF.scala:17)
at tech.sourced.engine.udf.ExtractUASTsUDF$.extractUASTs(ExtractUASTsUDF.scala:24)
at tech.sourced.engine.package$EngineDataFrame$$anon$2.call(package.scala:395)
at tech.sourced.engine.package$EngineDataFrame$$anon$2.call(package.scala:377)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:518)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:333)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1948)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
Caused by: tech.sourced.engine.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/127.0.0.1:9432
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at tech.sourced.engine.shaded.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
at tech.sourced.engine.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
at tech.sourced.engine.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
at tech.sourced.engine.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at tech.sourced.engine.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at tech.sourced.engine.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at tech.sourced.engine.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at tech.sourced.engine.shaded.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused
... 11 more
18/10/08 12:50:41 WARN TaskSetManager: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:221)
at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:202)
at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:101)
at gopkg.in.bblfsh.sdk.v1.protocol.generated.ProtocolServiceGrpc$ProtocolServiceBlockingStub.parse(ProtocolServiceGrpc.scala:61)
at org.bblfsh.client.BblfshClient.parse(BblfshClient.scala:30)
at tech.sourced.engine.util.Bblfsh$.extractUAST(Bblfsh.scala:80)
at tech.sourced.engine.udf.ExtractUASTsUDF$class.extractUASTs(ExtractUASTsUDF.scala:17)
at tech.sourced.engine.udf.ExtractUASTsUDF$.extractUASTs(ExtractUASTsUDF.scala:24)
at tech.sourced.engine.package$EngineDataFrame$$anon$2.call(package.scala:395)
at tech.sourced.engine.package$EngineDataFrame$$anon$2.call(package.scala:377)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:518)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:333)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1948)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
Caused by: tech.sourced.engine.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/127.0.0.1:9432
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at tech.sourced.engine.shaded.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
at tech.sourced.engine.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
at tech.sourced.engine.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
at tech.sourced.engine.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at tech.sourced.engine.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at tech.sourced.engine.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at tech.sourced.engine.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at tech.sourced.engine.shaded.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused
... 11 more

18/10/08 12:50:41 ERROR TaskSetManager: Task 0 in stage 0.0 failed 1 times; aborting job
Traceback (most recent call last):
File "/home/b7066789/.local/bin/srcml", line 11, in
sys.exit(main())
File "/home/b7066789/.local/lib/python3.6/site-packages/sourced/ml/main.py", line 354, in main
return handler(args)
File "/home/b7066789/.local/lib/python3.6/site-packages/sourced/ml/utils/engine.py", line 87, in wrapped_pause
return func(cmdline_args, *args, **kwargs)
File "/home/b7066789/.local/lib/python3.6/site-packages/sourced/ml/cmd/preprocess_repos.py", line 24, in preprocess_repos
.link(ParquetSaver(save_loc=args.output))
File "/home/b7066789/.local/lib/python3.6/site-packages/sourced/ml/transformers/transformer.py", line 114, in execute
head = node(head)
File "/home/b7066789/.local/lib/python3.6/site-packages/sourced/ml/transformers/basic.py", line 292, in call
rdd.toDF().write.parquet(self.save_loc)
File "/home/b7066789/.local/lib/python3.6/site-packages/pyspark/sql/session.py", line 58, in toDF
return sparkSession.createDataFrame(self, schema, sampleRatio)
File "/home/b7066789/.local/lib/python3.6/site-packages/pyspark/sql/session.py", line 582, in createDataFrame
rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
File "/home/b7066789/.local/lib/python3.6/site-packages/pyspark/sql/session.py", line 380, in _createFromRDD
struct = self._inferSchema(rdd, samplingRatio)
File "/home/b7066789/.local/lib/python3.6/site-packages/pyspark/sql/session.py", line 351, in _inferSchema
first = rdd.first()
File "/home/b7066789/.local/lib/python3.6/site-packages/pyspark/rdd.py", line 1361, in first
rs = self.take(1)
File "/home/b7066789/.local/lib/python3.6/site-packages/pyspark/rdd.py", line 1343, in take
res = self.context.runJob(self, takeUpToNumLeft, p)
File "/home/b7066789/.local/lib/python3.6/site-packages/pyspark/context.py", line 992, in runJob
port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
File "/home/b7066789/.local/lib/python3.6/site-packages/py4j/java_gateway.py", line 1133, in call
answer, self.gateway_client, self.target_id, self.name)
File "/home/b7066789/.local/lib/python3.6/site-packages/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/home/b7066789/.local/lib/python3.6/site-packages/py4j/protocol.py", line 319, in get_return_value
format(target_id, ".", name), value)
py4j.protocol.Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob.
: org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID 0, localhost, executor driver): io.grpc.StatusRuntimeException: UNAVAILABLE: io exception
at io.grpc.stub.ClientCalls.toStatusRuntimeException(ClientCalls.java:221)
at io.grpc.stub.ClientCalls.getUnchecked(ClientCalls.java:202)
at io.grpc.stub.ClientCalls.blockingUnaryCall(ClientCalls.java:101)
at gopkg.in.bblfsh.sdk.v1.protocol.generated.ProtocolServiceGrpc$ProtocolServiceBlockingStub.parse(ProtocolServiceGrpc.scala:61)
at org.bblfsh.client.BblfshClient.parse(BblfshClient.scala:30)
at tech.sourced.engine.util.Bblfsh$.extractUAST(Bblfsh.scala:80)
at tech.sourced.engine.udf.ExtractUASTsUDF$class.extractUASTs(ExtractUASTsUDF.scala:17)
at tech.sourced.engine.udf.ExtractUASTsUDF$.extractUASTs(ExtractUASTsUDF.scala:24)
at tech.sourced.engine.package$EngineDataFrame$$anon$2.call(package.scala:395)
at tech.sourced.engine.package$EngineDataFrame$$anon$2.call(package.scala:377)
at org.apache.spark.sql.catalyst.expressions.GeneratedClass$GeneratedIterator.processNext(Unknown Source)
at org.apache.spark.sql.execution.BufferedRowIterator.hasNext(BufferedRowIterator.java:43)
at org.apache.spark.sql.execution.WholeStageCodegenExec$$anonfun$8$$anon$1.hasNext(WholeStageCodegenExec.scala:395)
at scala.collection.Iterator$$anon$11.hasNext(Iterator.scala:408)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.hasNext(SerDeUtil.scala:117)
at scala.collection.Iterator$class.foreach(Iterator.scala:893)
at org.apache.spark.api.python.SerDeUtil$AutoBatchedPickler.foreach(SerDeUtil.scala:112)
at org.apache.spark.api.python.PythonRDD$.writeIteratorToStream(PythonRDD.scala:518)
at org.apache.spark.api.python.PythonRunner$WriterThread$$anonfun$run$3.apply(PythonRDD.scala:333)
at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1948)
at org.apache.spark.api.python.PythonRunner$WriterThread.run(PythonRDD.scala:269)
Caused by: tech.sourced.engine.shaded.io.netty.channel.AbstractChannel$AnnotatedConnectException: Connection refused: localhost/127.0.0.1:9432
at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method)
at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:717)
at tech.sourced.engine.shaded.io.netty.channel.socket.nio.NioSocketChannel.doFinishConnect(NioSocketChannel.java:323)
at tech.sourced.engine.shaded.io.netty.channel.nio.AbstractNioChannel$AbstractNioUnsafe.finishConnect(AbstractNioChannel.java:340)
at tech.sourced.engine.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKey(NioEventLoop.java:633)
at tech.sourced.engine.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeysOptimized(NioEventLoop.java:580)
at tech.sourced.engine.shaded.io.netty.channel.nio.NioEventLoop.processSelectedKeys(NioEventLoop.java:497)
at tech.sourced.engine.shaded.io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:459)
at tech.sourced.engine.shaded.io.netty.util.concurrent.SingleThreadEventExecutor$5.run(SingleThreadEventExecutor.java:858)
at tech.sourced.engine.shaded.io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:138)
at java.lang.Thread.run(Thread.java:745)
Caused by: java.net.ConnectException: Connection refused
... 11 more

Driver stacktrace:
at org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1517)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1505)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1504)
at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:48)
at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1504)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:814)
at scala.Option.foreach(Option.scala:257)
at org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:814)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGScheduler.scala:1732)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1687)
at org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler.scala:1676)
at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
at org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:630)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2029)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2050)
at org.apache.spark.SparkContext.runJob(SparkContext.scala:2069)
at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:455)
at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:497)
at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
at py4j.Gateway.invoke(Gateway.java:280)
at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
at py4j.commands.CallCommand.execute(CallCommand.java:79)
at py4j.GatewayConnection.run(GatewayConnection.java:214)
at java.lang.Thread.run(Thread.java:745)_

Fix assemble_spark_config that alter default values now

I found out in tests that SparkDefault.PACKAGES and SparkDefault.CONFIG were changed.
To fix it I need to change a behavior of assemble_spark_config function.

Add Output and Input types to Transformers

We want to be sure that only compatible Transformers will be linked together.
So it is a good idea to add to Transformer class something like

INPUT_FORMAT = Rdd[Row["cname1", "cname2"]]
OUTPUT_FORMAT = ...

And check compability on linking stage.

@vmarkovtsev please assign me.

Integrate the neural token splitter

Based on the paper and the existing code we should have an ability to parse identifiers with ML, not with heuristics.

The model has been partially written by @warenlg I don't remember where is it, can you please find Waren.

The splitting should be batched for performance reasons.

Option to avoid downloading Apache Spark on every install

Sorry if I'm missing something here, but right now sourced.ml library has, what seems to be a mandatory dependency on Apache Spark though pip.

As a user of src-d/ml, I would expect this library to let me avoid downloading 188Mb of Apache Spark as a dependency, and instead let me choosing my local Apache Spark installation \w pySpark through $SPARK_HOME.

Same way as it's not fetching Tensorflow and libxml2 and just mention them in https://github.com/src-d/ml#installation

Update Spark to 2.3

@vmarkovtsev @zurk

Just saw this PR in engine, I think we should also upgrade when it is merged and stable. I don't think there are many differences in the python's API, so it should not be too much refactoring on our side, and there is some pretty cool new features in 2.2 and 2.3

Broken installation

At first I create virtualenv:

virtualenv -p /usr/local/bin/python3 test
source test/bin/activate

Then I run the installation steps:

pip3 install git+https://github.com/bblfsh/client-python

pip3 install ast2vec doesn't work at the moment, so I run:

python3 setup.py install

Which produces the following output:

Searching for tensorflow<2.0,>=1.0
Reading https://pypi.python.org/simple/tensorflow/
No local packages or working download links found for tensorflow<2.0,>=1.0
error: Could not find suitable distribution for Requirement.parse('tensorflow<2.0,>=1.0')

And pip3 stops working:

>>pip3 --version
...
AttributeError: '_NamespacePath' object has no attribute 'sort'

docker instructions issues on Ubuntu 16.04

(Ubuntu 16.04)

$ docker build -t srcd/ast2vec .
Cannot connect to the Docker daemon. Is the docker daemon running on this host?
$ sudo docker build -t srcd/ast2vec .
Sending build context to Docker daemon 59.96 MB
...
==> appears to need to run as root?

docker run -d --privileged -p 9432:9432 --name bblfsh --rm bblfsh/server
==> Conflicting options: --rm and -d
==> moby/moby#27812
==> need docker 1.13 or later (Ubuntu 16.04 defaults to 1.12.6, build 78d1802)

ValueError: RDD is empty

I have installed the module as suggested and run the command:
srcml preprocrepos -m 50G,50G,50G -r siva --output ./test
Where siva is the directory, containing all the siva files. The memory parameters do not change anything.
My spark is very old (1.3) - could it be the reason? Is it runnable in pyspark (the latest one)?

_/usr/local/lib64/python3.6/site-packages/h5py/init.py:36: FutureWarning: Conversion of the second argument of issubdtype from float to np.floating is deprecated. In future, it will be treated as np.float64 == np.dtype(float).type.
from ._conv import register_converters as _register_converters
INFO:spark:Starting preprocess_repos-424fe007-f0db-48b7-863b-5a5b90ce5f63 on local[*]
Ivy Default Cache set to: /home/b7066789/.ivy2/cache
The jars for the packages stored in: /home/b7066789/.ivy2/jars
:: loading settings :: url = jar:file:/home/b7066789/.local/lib/python3.6/site-packages/pyspark/jars/ivy-2.4.0.jar!/org/apache/ivy/core/settings/ivysettings.xml
tech.sourced#engine added as a dependency
:: resolving dependencies :: org.apache.spark#spark-submit-parent;1.0
confs: [default]
found tech.sourced#engine;0.6.4 in central
found io.netty#netty-all;4.1.17.Final in central
found org.eclipse.jgit#org.eclipse.jgit;4.9.0.201710071750-r in central
found com.jcraft#jsch;0.1.54 in central
found com.googlecode.javaewah#JavaEWAH;1.1.6 in central
found org.apache.httpcomponents#httpclient;4.3.6 in central
found org.apache.httpcomponents#httpcore;4.3.3 in central
found commons-logging#commons-logging;1.1.3 in central
found commons-codec#commons-codec;1.6 in central
found org.slf4j#slf4j-api;1.7.2 in central
found tech.sourced#siva-java;0.1.3 in central
found org.bblfsh#bblfsh-client;1.8.2 in central
found com.thesamet.scalapb#scalapb-runtime_2.11;0.7.1 in central
found com.thesamet.scalapb#lenses_2.11;0.7.0-test2 in central
found com.lihaoyi#fastparse_2.11;1.0.0 in central
found com.lihaoyi#fastparse-utils_2.11;1.0.0 in central
found com.lihaoyi#sourcecode_2.11;0.1.4 in central
found com.google.protobuf#protobuf-java;3.5.0 in central
found commons-io#commons-io;2.5 in central
found io.grpc#grpc-netty;1.10.0 in central
found io.grpc#grpc-core;1.10.0 in central
found io.grpc#grpc-context;1.10.0 in central
found com.google.code.gson#gson;2.7 in central
found com.google.guava#guava;19.0 in central
found com.google.errorprone#error_prone_annotations;2.1.2 in central
found com.google.code.findbugs#jsr305;3.0.0 in central
found io.opencensus#opencensus-api;0.11.0 in central
found io.opencensus#opencensus-contrib-grpc-metrics;0.11.0 in central
found io.netty#netty-codec-http2;4.1.17.Final in central
found io.netty#netty-codec-http;4.1.17.Final in central
found io.netty#netty-codec;4.1.17.Final in central
found io.netty#netty-transport;4.1.17.Final in central
found io.netty#netty-buffer;4.1.17.Final in central
found io.netty#netty-common;4.1.17.Final in central
found io.netty#netty-resolver;4.1.17.Final in central
found io.netty#netty-handler;4.1.17.Final in central
found io.netty#netty-handler-proxy;4.1.17.Final in central
found io.netty#netty-codec-socks;4.1.17.Final in central
found com.thesamet.scalapb#scalapb-runtime-grpc_2.11;0.7.1 in central
found io.grpc#grpc-stub;1.10.0 in central
found io.grpc#grpc-protobuf;1.10.0 in central
found com.google.protobuf#protobuf-java;3.5.1 in central
found com.google.protobuf#protobuf-java-util;3.5.1 in central
found com.google.api.grpc#proto-google-common-protos;1.0.0 in central
found io.grpc#grpc-protobuf-lite;1.10.0 in central
found org.rogach#scallop_2.11;3.0.3 in central
found org.apache.commons#commons-pool2;2.4.3 in central
found tech.sourced#enry-java;1.6.3 in central
found org.xerial#sqlite-jdbc;3.21.0 in central
found com.groupon.dse#spark-metrics;2.0.0 in central
found io.dropwizard.metrics#metrics-core;3.1.2 in central
:: resolution report :: resolve 1148ms :: artifacts dl 44ms
:: modules in use:
com.google.api.grpc#proto-google-common-protos;1.0.0 from central in [default]
com.google.code.findbugs#jsr305;3.0.0 from central in [default]
com.google.code.gson#gson;2.7 from central in [default]
com.google.errorprone#error_prone_annotations;2.1.2 from central in [default]
com.google.guava#guava;19.0 from central in [default]
com.google.protobuf#protobuf-java;3.5.1 from central in [default]
com.google.protobuf#protobuf-java-util;3.5.1 from central in [default]
com.googlecode.javaewah#JavaEWAH;1.1.6 from central in [default]
com.groupon.dse#spark-metrics;2.0.0 from central in [default]
com.jcraft#jsch;0.1.54 from central in [default]
com.lihaoyi#fastparse-utils_2.11;1.0.0 from central in [default]
com.lihaoyi#fastparse_2.11;1.0.0 from central in [default]
com.lihaoyi#sourcecode_2.11;0.1.4 from central in [default]
com.thesamet.scalapb#lenses_2.11;0.7.0-test2 from central in [default]
com.thesamet.scalapb#scalapb-runtime-grpc_2.11;0.7.1 from central in [default]
com.thesamet.scalapb#scalapb-runtime_2.11;0.7.1 from central in [default]
commons-codec#commons-codec;1.6 from central in [default]
commons-io#commons-io;2.5 from central in [default]
commons-logging#commons-logging;1.1.3 from central in [default]
io.dropwizard.metrics#metrics-core;3.1.2 from central in [default]
io.grpc#grpc-context;1.10.0 from central in [default]
io.grpc#grpc-core;1.10.0 from central in [default]
io.grpc#grpc-netty;1.10.0 from central in [default]
io.grpc#grpc-protobuf;1.10.0 from central in [default]
io.grpc#grpc-protobuf-lite;1.10.0 from central in [default]
io.grpc#grpc-stub;1.10.0 from central in [default]
io.netty#netty-all;4.1.17.Final from central in [default]
io.netty#netty-buffer;4.1.17.Final from central in [default]
io.netty#netty-codec;4.1.17.Final from central in [default]
io.netty#netty-codec-http;4.1.17.Final from central in [default]
io.netty#netty-codec-http2;4.1.17.Final from central in [default]
io.netty#netty-codec-socks;4.1.17.Final from central in [default]
io.netty#netty-common;4.1.17.Final from central in [default]
io.netty#netty-handler;4.1.17.Final from central in [default]
io.netty#netty-handler-proxy;4.1.17.Final from central in [default]
io.netty#netty-resolver;4.1.17.Final from central in [default]
io.netty#netty-transport;4.1.17.Final from central in [default]
io.opencensus#opencensus-api;0.11.0 from central in [default]
io.opencensus#opencensus-contrib-grpc-metrics;0.11.0 from central in [default]
org.apache.commons#commons-pool2;2.4.3 from central in [default]
org.apache.httpcomponents#httpclient;4.3.6 from central in [default]
org.apache.httpcomponents#httpcore;4.3.3 from central in [default]
org.bblfsh#bblfsh-client;1.8.2 from central in [default]
org.eclipse.jgit#org.eclipse.jgit;4.9.0.201710071750-r from central in [default]
org.rogach#scallop_2.11;3.0.3 from central in [default]
org.slf4j#slf4j-api;1.7.2 from central in [default]
org.xerial#sqlite-jdbc;3.21.0 from central in [default]
tech.sourced#engine;0.6.4 from central in [default]
tech.sourced#enry-java;1.6.3 from central in [default]
tech.sourced#siva-java;0.1.3 from central in [default]
:: evicted modules:
com.google.protobuf#protobuf-java;3.5.0 by [com.google.protobuf#protobuf-java;3.5.1] in [default]
---------------------------------------------------------------------
| | modules || artifacts |
| conf | number| search|dwnlded|evicted|| number|dwnlded|
---------------------------------------------------------------------
| default | 51 | 0 | 0 | 1 || 50 | 0 |
---------------------------------------------------------------------
:: retrieving :: org.apache.spark#spark-submit-parent
confs: [default]
0 artifacts copied, 50 already retrieved (0kB/18ms)
Using Spark's default log4j profile: org/apache/spark/log4j-defaults.properties
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
18/10/03 15:50:55 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
18/10/03 15:50:55 WARN SparkConf: In Spark 1.0 and later spark.local.dir will be overridden by the value set by the cluster manager (via SPARK_LOCAL_DIRS in mesos/standalone and LOCAL_DIRS in YARN).
18/10/03 15:50:58 WARN Utils: Service 'SparkUI' could not bind on port 4040. Attempting port 4041.
INFO:engine:Initializing engine on siva
INFO:ParquetSaver:Ignition -> DzhigurdaFiles -> UastExtractor -> Moder -> FieldsSelector -> ParquetSaver
Traceback (most recent call last):
File "/home/b7066789/.local/bin/srcml", line 11, in
sys.exit(main())
File "/home/b7066789/.local/lib/python3.6/site-packages/sourced/ml/main.py", line 354, in main
return handler(args)
File "/home/b7066789/.local/lib/python3.6/site-packages/sourced/ml/utils/engine.py", line 87, in wrapped_pause
return func(cmdline_args, *args, **kwargs)
File "/home/b7066789/.local/lib/python3.6/site-packages/sourced/ml/cmd/preprocess_repos.py", line 24, in preprocess_repos
.link(ParquetSaver(save_loc=args.output))
File "/home/b7066789/.local/lib/python3.6/site-packages/sourced/ml/transformers/transformer.py", line 114, in execute
head = node(head)
File "/home/b7066789/.local/lib/python3.6/site-packages/sourced/ml/transformers/basic.py", line 292, in call
rdd.toDF().write.parquet(self.save_loc)
File "/home/b7066789/.local/lib/python3.6/site-packages/pyspark/sql/session.py", line 58, in toDF
return sparkSession.createDataFrame(self, schema, sampleRatio)
File "/home/b7066789/.local/lib/python3.6/site-packages/pyspark/sql/session.py", line 582, in createDataFrame
rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
File "/home/b7066789/.local/lib/python3.6/site-packages/pyspark/sql/session.py", line 380, in _createFromRDD
struct = self._inferSchema(rdd, samplingRatio)
File "/home/b7066789/.local/lib/python3.6/site-packages/pyspark/sql/session.py", line 351, in inferSchema
first = rdd.first()
File "/home/b7066789/.local/lib/python3.6/site-packages/pyspark/rdd.py", line 1364, in first
raise ValueError("RDD is empty")
ValueError: RDD is empty

How can I use the ast2vec?

I want to embedding AST tree nodes to vector.Would you mind helping me?Thank you

Check Babelfish's server version

Babelfish server has recently (1.0.0 release) added the new API: Version(). As soon as the support is added to bblfsh/client-python, we need to add the check against the hardcoded value here.

Add "single-shot" mode to the token parser

Our heuristic token parser generates several variants of the split sometimes. E.g.
wdSize becomes ["wd", "size", "wdsize"].

We need to add an "aggressive" mode which does not generate repetitions and use it to train our neural identifier splitter. Besides, I want the following in this mode:

loooong_sh_loooong_sh -> looong, sh, loooong, sh
SmallIdFooo -> small, id, foo

Update srcd/ml-core container automatically

we have such updates for https://hub.docker.com/r/srcd/ml, but not for https://hub.docker.com/r/srcd/ml-core

We should add it.
@vmarkovtsev, as I understand only you in the team, have permissions for it.

KeyError when looking for engine version

Last release we switched from a static definition of the VERSION param of EngineDefault class in utils/engine to a dynamic one, doing:
VERSION = {pkg.key: pkg.version for pkg in pip.get_installed_distributions()}["sourced-engine"]

The problem is that when using spark in standalone mode, since we zip sourced-enginepackage it is not possible to get version info through pip, which raises the following error:

apache.spark.api.python.PythonException: Traceback (most recent call last):
  File "/opt/spark/python/pyspark/worker.py", line 166, in main
    func, profiler, deserializer, serializer = read_command(pickleSer, infile)
  File "/opt/spark/python/pyspark/worker.py", line 55, in read_command
    command = serializer._read_with_length(file)
  File "/opt/spark/python/pyspark/serializers.py", line 169, in _read_with_length
    return self.loads(obj)
  File "/opt/spark/python/pyspark/serializers.py", line 451, in loads
    return pickle.loads(obj, encoding=encoding)
  File "<frozen importlib._bootstrap>", line 2237, in _find_and_load
  File "<frozen importlib._bootstrap>", line 2226, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1191, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1161, in _load_backward_compatible
  File "/tmp/pip-build-ynmrc1i8/sourced-ml/sourced/ml/transformers/__init__.py", line 1, in <module>
  File "<frozen importlib._bootstrap>", line 2237, in _find_and_load
  File "<frozen importlib._bootstrap>", line 2226, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1191, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1161, in _load_backward_compatible
  File "/tmp/pip-build-ynmrc1i8/sourced-ml/sourced/ml/transformers/basic.py", line 7, in <module>
  File "<frozen importlib._bootstrap>", line 2237, in _find_and_load
  File "<frozen importlib._bootstrap>", line 2226, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1191, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1161, in _load_backward_compatible
  File "/tmp/pip-build-ynmrc1i8/sourced-ml/sourced/ml/transformers/transformer.py", line 3, in <module>
  File "<frozen importlib._bootstrap>", line 2237, in _find_and_load
  File "<frozen importlib._bootstrap>", line 2226, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1191, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1161, in _load_backward_compatible
  File "/tmp/pip-build-ynmrc1i8/sourced-ml/sourced/ml/utils/__init__.py", line 4, in <module>
  File "<frozen importlib._bootstrap>", line 2237, in _find_and_load
  File "<frozen importlib._bootstrap>", line 2226, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 1191, in _load_unlocked
  File "<frozen importlib._bootstrap>", line 1161, in _load_backward_compatible
  File "/usr/local/lib/python3.4/dist-packages/sourced/ml/utils/engine.py", line 3, in <module>
  File "/usr/local/lib/python3.4/dist-packages/pip/__init__.py", line 45, in <module>
    from pip.vcs import git, mercurial, subversion, bazaar  # noqa
  File "/usr/local/lib/python3.4/dist-packages/pip/vcs/mercurial.py", line 9, in <module>
    from pip.download import path_to_url
  File "/usr/local/lib/python3.4/dist-packages/pip/download.py", line 40, in <module>
    from pip._vendor import requests, six
  File "/usr/local/lib/python3.4/dist-packages/pip/_vendor/requests/__init__.py", line 98, in <module>
    from . import packages
  File "/usr/local/lib/python3.4/dist-packages/pip/_vendor/requests/packages.py", line 12, in <module>
    sys.modules['pip._vendor.requests.packages.' + mod] = sys.modules["pip._vendor." + mod]
KeyError: 'pip._vendor.urllib3.contrib'

In order to get rid of this we can simply revert back to the static def, and add a TODO to update it.

repos2bow fails if --docfreq points to a file in non-existing directory

Running: srcml repos2bow --repositories ~/dev/working/siva-good --bow data/BOW --feature lit --languages Java --docfreq data/docfreq --persist MEMORY_AND_DISK

Directory data did not exist. The job failed pretty late with:

INFO:repos2bow:Writing docfreq to data/docfreq                                  
INFO:docfreq:Ordering the keys...
Traceback (most recent call last):
  File "ENV/bin/srcml", line 11, in <module>
    load_entry_point('sourced-ml', 'console_scripts', 'srcml')()
  File "/home/smola/dev/go/src/github.com/src-d/ml/sourced/ml/__main__.py", line 187, in main
    return handler(args)
  File "/home/smola/dev/go/src/github.com/src-d/ml/sourced/ml/cmd_entries/repos2bow.py", line 80, in repos2bow_entry
    return repos2bow_entry_template(args)
  File "/home/smola/dev/go/src/github.com/src-d/ml/sourced/ml/utils/engine.py", line 73, in wrapped_pause
    return func(cmdline_args, *args, **kwargs)
  File "/home/smola/dev/go/src/github.com/src-d/ml/sourced/ml/cmd_entries/repos2bow.py", line 68, in repos2bow_entry_template
    .save(args.docfreq)
  File "/home/smola/dev/go/src/github.com/src-d/ml/ENV/lib/python3.6/site-packages/modelforge/model.py", line 269, in save
    self._write_tree(tree, output)
  File "/home/smola/dev/go/src/github.com/src-d/ml/ENV/lib/python3.6/site-packages/modelforge/model.py", line 285, in _write_tree
    asdf.AsdfFile(final_tree).write_to(output, all_array_compression=ARRAY_COMPRESSION)
  File "/home/smola/dev/go/src/github.com/src-d/ml/ENV/lib/python3.6/site-packages/asdf/asdf.py", line 890, in write_to
    with generic_io.get_file(fd, mode='w') as fd:
  File "/home/smola/dev/go/src/github.com/src-d/ml/ENV/lib/python3.6/site-packages/asdf/generic_io.py", line 1186, in get_file
    fd = atomicfile.atomic_open(realpath, realmode)
  File "/home/smola/dev/go/src/github.com/src-d/ml/ENV/lib/python3.6/site-packages/asdf/extern/atomicfile.py", line 139, in atomic_open
    delete=False)
  File "/home/smola/dev/go/src/github.com/src-d/ml/ENV/lib/python3.6/tempfile.py", line 549, in NamedTemporaryFile
    (fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
  File "/home/smola/dev/go/src/github.com/src-d/ml/ENV/lib/python3.6/tempfile.py", line 260, in _mkstemp_inner
    fd = _os.open(file, flags, 0o600)
FileNotFoundError: [Errno 2] No such file or directory: 'data/.___atomic_write25q7n3vo'

I would expect that srcml fails before starting the job if data does not exist, or that data is created automatically when saving the model.

Improve TokenParser in cases containing abbreviations

While using TokenParser to correct typos in identifiers I constantly bump into mistakes like
HTMLElement -> htmle lement.

To me it looks like in that case (several uppercase letters in a row) it would be better to add the last letter to the next token. I've seen many cases when this would be wise, and almost no when it would break the logic.

E.g. token 'lement' is one of the most frequent typoed ones that gets to be split-out. And here's where it comes from (top-10 examples):

data[data.token_split.str.contains(" lement")]
pos  num_occ    num_repos    identifier    token_split    num_files
3993    66995    4764    HTMLElement    htmle lement    13079
14139    16425    103    NSXMLElement    nsxmle lement    1741
47404    4496    85    JAXBElement    jaxbe lement    453
64825    3276    16    HTMLElementEventMap    htmle lement event map    42
66583    3182    41    IHTMLElement    ihtmle lement    209
86788    2389    471    SVGSVGElement    svgsvge lement    784
107285    1895    653    HTMLLIElement    htmllie lement    967
123871    1618    548    HTMLHRElement    htmlhre lement    811
126724    1579    551    HTMLBRElement    htmlbre lement    825
128322    1556    418    SVGGElement    svgge lement    718
144583    1365    19    BSONElement    bsone lement    198
150084    1309    33    IXMLDOMElement    ixmldome lement    178

And here're the right parses for comparison:

data[data.token_split.str.contains(" element")]
pos    num_occ    num_repos    identifier    token_split    num_files
194    1608035    27484    createElement    create element    185424
458    740521    22    as_fusion_element    as fusion element    628
604    568326    19962    documentElement    document element    90360
618    555927    20933    getElementsByTagName    get elements by tag name    91772
794    407035    22788    getElementById    get element by id    97313
1888    155867    12876    getElementsByClassName    get elements by class name    29477
2182    131254    13040    activeElement    active element    37437
2538    111209    3936    getElement    get element    19493
3153    87404    137    FieldElement    field element    449
3221    85380    1091    _currentElement    current element    2370
3306    83096    6811    parentElement    parent element    18498
3550    76270    1698    domElement    dom element    5811
3765    71809    1496    buttonElement    button element    2572
3843    69912    12145    srcElement    src element    35165

TLDR Can I add this case to the TokenParser? It will be possible to switch it off in the beginning, and I would want to try it with typos.

Wrong input passed to UastRow2Document

When running repos2roleids, or other commands involving UastRow2Document like repos2id_sequence or repos2id_distance:

docker run -it --rm --link bblfshd -e LD_PRELOAD= --privileged srcd/ml-core bash
git clone https://github.com/src-d/ml.git
cd ml
pip3 install -r requirements.txt
pip3 install -e .
python3 -m sourced.ml repos2idseq -r sourced/ml/tests/parquet/test.parquet --parquet -o /tmp/output

returned

INFO:parquet:Initializing on sourced/ml/tests/parquet/test.parquet
INFO:CsvSaver:ParquetLoader -> Identity -> UastRow2Document -> UastDeserializer -> Uast2BagFeatures -> Rower -> CsvSaver
Traceback (most recent call last):
  File "/usr/lib/python3.4/runpy.py", line 170, in _run_module_as_main
    "__main__", mod_spec)
  File "/usr/lib/python3.4/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/root/srcd/ml/sourced/ml/__main__.py", line 358, in <module>
    sys.exit(main())
  File "/root/srcd/ml/sourced/ml/__main__.py", line 354, in main
    return handler(args)
  File "/root/srcd/ml/sourced/ml/utils/engine.py", line 87, in wrapped_pause
    return func(cmdline_args, *args, **kwargs)
  File "/root/srcd/ml/sourced/ml/cmd/repos2id_sequence.py", line 28, in repos2id_sequence
    .execute()
  File "/root/srcd/ml/sourced/ml/transformers/transformer.py", line 114, in execute
    head = node(head)
  File "/root/srcd/ml/sourced/ml/transformers/uast2bag_features.py", line 13, in __call__
    assert isinstance(rows, RDD)
AssertionError

Empty repo, three forks : ) And a promising name.

¡buenas tardes!

Just wondering what this is going to be. Is there a paper on what you are going to do?

Something like this?
https://pdfs.semanticscholar.org/5260/66e8c1007dd526eb4a7b89a925b95c6564f5.pdf

Thanks!

Split sourced-ml package to algorithms and data collection parts

Dependent projects such as https://github.com/src-d/style-analyzer need only algorithms part of the sourced-ml: https://github.com/src-d/ml/tree/master/sourced/ml/algorithms

Data collection part uses deprecated jgit-spark-connector which depends on old packages. This leads to unpleasant dependency conflicts: https://github.com/src-d/style-analyzer/pull/719/files#diff-354f30a63fb0907d4ad57269548329e3R30

That is why we should split the package into two parts.

Generate cmd arguments from Transformers we use.

The biggest part of cmd arguments we use come from Transformers.
I think we can include them to Transformers and assemble a list of cmd arguments from a pipeline.

It helps to maintain arguments and does not require a manual update in many places if an argument change and used by several cmd subcommands.

I will try to implement my vision to discuss when I have enough time.

@vmarkovtsev please assign me.

preprocrepos: what is this for and who is dzhigurda?

I'm reading this document and wondering what this command is for.

The description says preprocess your data before passing it to any command you need but this is too vague to be useful. What are the common use cases of the tool? Why was it created?

Finally, the last flag is dzhigurda ... is that Nikita Dhzigurda?

error while running how_to_use_ast2vec.ipynb

I've installed ml pip3 install -r requirements.txt

Now I'm executing the example:
jupyter-nbconvert --execute how_to_use_ast2vec.ipynb

But I get an error. Any idea to fix it?

Traceback (most recent call last):
  File "/usr/lib/python3/dist-packages/nbconvert/nbconvertapp.py", line 393, in export_single_notebook
    output, resources = self.exporter.from_filename(notebook_filename, resources=resources)
  File "/usr/lib/python3/dist-packages/nbconvert/exporters/exporter.py", line 174, in from_filename
    return self.from_file(f, resources=resources, **kw)
  File "/usr/lib/python3/dist-packages/nbconvert/exporters/exporter.py", line 192, in from_file
    return self.from_notebook_node(nbformat.read(file_stream, as_version=4), resources=resources, **kw)
  File "/usr/lib/python3/dist-packages/nbconvert/exporters/html.py", line 85, in from_notebook_node
    return super(HTMLExporter, self).from_notebook_node(nb, resources, **kw)
  File "/usr/lib/python3/dist-packages/nbconvert/exporters/templateexporter.py", line 280, in from_notebook_node
    nb_copy, resources = super(TemplateExporter, self).from_notebook_node(nb, resources, **kw)
  File "/usr/lib/python3/dist-packages/nbconvert/exporters/exporter.py", line 134, in from_notebook_node
    nb_copy, resources = self._preprocess(nb_copy, resources)
  File "/usr/lib/python3/dist-packages/nbconvert/exporters/exporter.py", line 311, in _preprocess
    nbc, resc = preprocessor(nbc, resc)
  File "/usr/lib/python3/dist-packages/nbconvert/preprocessors/base.py", line 47, in __call__
    return self.preprocess(nb, resources)
  File "/usr/lib/python3/dist-packages/nbconvert/preprocessors/execute.py", line 262, in preprocess
    nb, resources = super(ExecutePreprocessor, self).preprocess(nb, resources)
  File "/usr/lib/python3/dist-packages/nbconvert/preprocessors/base.py", line 69, in preprocess
    nb.cells[index], resources = self.preprocess_cell(cell, resources, index)
  File "/usr/lib/python3/dist-packages/nbconvert/preprocessors/execute.py", line 286, in preprocess_cell
    raise CellExecutionError.from_cell_and_msg(cell, out)
nbconvert.preprocessors.execute.CellExecutionError: An error occurred while executing the following cell:
------------------
# setup logging
from ast2vec import setup_logging
setup_logging(level="DEBUG")

# setup linguist - mandatory to launch first time to build enry.
# after this you can specify path to enry file.
from ast2vec import install_enry
install_enry()

# check bblfsh server
from ast2vec import ensure_bblfsh_is_running_noexc
ensure_bblfsh_is_running_noexc()
------------------

---------------------------------------------------------------------------
ImportError                               Traceback (most recent call last)
<ipython-input-1-8dceed0627d5> in <module>()
      1 # setup logging
----> 2 from ast2vec import setup_logging
      3 setup_logging(level="DEBUG")
      4 
      5 # setup linguist - mandatory to launch first time to build enry.

ImportError: cannot import name 'setup_logging'
ImportError: cannot import name 'setup_logging'

Rename cmd_entries to cmd

How about renaming cmd_entries to cmd? Would be consistent with the typical Go program structure.

@zurk @EgorBu @r0mainK what do you think?

Installing fails with Python 3.7

Due to the issue snowballstem/pystemmer#18 it is not possible to install current sourced-ml package smoothly via pip3 install sourced-ml using python 3.7 if you do not have cython package installed.

And it is hard to fix it here because we can not specify installation order in setup(...).
However, putting Cython before PyStemmer in requirements.txt seems to work. So at least you can run

pip3 install -r requirements.txt
pip3 install .

I create a PR with requirements.txt update for now.
When I have time I try to fix the issue in PyStemmer itself.

A related issue in style-analyzer: src-d/style-analyzer#175.

@vmarkovtsev please assign me.

Add detailed description for repos2bow and repos2ids subcommands

I already do not remember what the difference and why we have both.
I think we need describe it better. The same for classes in https://github.com/src-d/ml/blob/master/sourced/ml/transformers/content2ids.py Cannot get what are they all about?..

pip3 install failure: missing libxml

Ubuntu 16.04.3

pip3 install ast2vec

In file included from bblfsh/libuast/uast.c:2:0:
bblfsh/libuast/uast_private.h:8:25: fatal error: libxml/tree.h: No such file or directory
compilation terminated.
error: command 'x86_64-linux-gnu-gcc' failed with exit status 1

"sudo apt-get install libxml2-dev" appears to fix this.

Repair docker build

Unable to install sourced-ml using pipenv as well as pip3 on Mac OS X

Operation System: Mac OS X 10.13.1 (High Sierra)
Environment created using pipenv
Stdout after running pip3 install sourced-ml: https://gist.github.com/xennygrimmato/40a86377475a039e635762d33eab3405
Stdout after running pipenv install sourced-ml: https://gist.github.com/xennygrimmato/145ff65cc77374752a97ef3c4f2fc003

Unable to install sourced-ml. Is this a bug or is there a pre-requisite not being met on my machine?

pep8 has been renamed to pycodestyle (GitHub issue #466)
Use of the pep8 tool will be removed in a future release.
Please install and use `pycodestyle` instead.

We should change the travis config accordingly.

@vmarkovtsev I'm happy to do it.

Avoid using pandas

pandas slows down a lot the installation and is used only once in the repo https://github.com/src-d/ml/blob/master/sourced/ml/cmd/id2role_eval.py#L15 I believe we could avoid using it, and use the CSVSaver method instead https://github.com/src-d/ml/blob/master/sourced/ml/transformers/basic.py#L72.

Add tests using newly released python3.7 to travis ci

Blocked by travis-ci/travis-ci#9815
@vmarkovtsev I'll do it

src-d / ml Goto Github PK

ml's Introduction

MLonCode research playground

Installation

With Apache Spark included

Use existing Apache Spark

Usage

Docker image

Contributions

License

Algorithms

Identifier embeddings

Weighted Bag of X

Topic modeling

Glossary

ml's People

Contributors

Stargazers

Watchers

Forkers

ml's Issues

Recommend Projects

Recommend Topics

Recommend Org