src-d / apollo Goto Github PK

View Code? Open in Web Editor NEW

52.0 18.0 17.0 202 KB

Advanced similarity and duplicate source code proof of concept for our research efforts.

License: GNU General Public License v3.0

Python 95.45% HTML 1.35% Dockerfile 3.19%

duplicates similarity source-code duplicate-detection similarity-search python

apollo's People

Contributors

Stargazers

Watchers

Forkers

egorbu warenlg zurk vmarkovtsev chubbymaggie bzz carlosms fulaphex justforkin gryn010 arvind-india afcarl tklebanoff jeroenherczeg isabella232 techsuni2023

apollo's Issues

pep8 -> flake8 for travis ci config

same as src-d/ml#280

Apollo can't find duplicates in the same repo

In DB schema primary key contains only sha1 & repo:
https://github.com/src-d/apollo/blob/master/apollo/cassandra_utils.py#L72
If there is more than 1 file with the same hash it will be overridden.

Proposed solution: Primary key should include path to avoid it.

More detailed explanation in gemini: src-d/gemini#15
Fix in gemini: https://github.com/src-d/gemini/pull/18/files#diff-e9f7644d3a67c0182ad50efdd8f49f7a

Add README.md or at least a repo description

Currently wondering what this is 😄

Cassandra timeouts with resetdb

When resetting the cassandra DB, I often get this error:

root@rom-gpu-dbc68df59-kf6qw:/# apollo resetdb --cassandra cassandra
INFO:cassandra:Connecting to cassandra
DROP KEYSPACE IF EXISTS apollo;
Traceback (most recent call last):
  File "/usr/local/bin/apollo", line 11, in <module>
    sys.exit(main())
  File "/usr/local/lib/python3.4/dist-packages/apollo/__main__.py", line 228, in main
    return handler(args)
  File "/usr/local/lib/python3.4/dist-packages/apollo/cassandra_utils.py", line 67, in reset_db
    cql("DROP KEYSPACE IF EXISTS %s" % args.keyspace)
  File "/usr/local/lib/python3.4/dist-packages/apollo/cassandra_utils.py", line 64, in cql
    db.execute(cmd)
  File "cassandra/cluster.py", line 2141, in cassandra.cluster.Session.execute
  File "cassandra/cluster.py", line 4033, in cassandra.cluster.ResponseFuture.result
cassandra.OperationTimedOut: errors={'10.2.13.74': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=10.2.13.74

Gonna look into it when I get time, it's a relatively minor problem, as simply retrying the command will often do the trick, but would like to correct this if possible.

Integrate with code-annotation app

Implement sqlite generation in s.ml/Apollo
Review part - REST service

Support hashing on CPU

Not everyone has GPUs and in some cases hashing time may not be a bottleneck, so it would be nice to have a mode that does not require CUDA and firends.

AFAIK to have such option, alongside with high-performance GPU one would require small changes in hasher.py to have an option of using something like https://github.com/ekzhu/datasketch instead of libMHCUDA.

apollo urls command is still present in the readme

The readme contains instructions for apollo urls, but apparently that command was removed here df3fbd8
I'm not submitting a PR because I'm not sure if the command must be removed, or replaced with something else.

Add cmd entry point tests

As discovered in #56 (comment)

Problem mounting during install

On mac os the command mount -o bind does not work, because it does not exist, however you can achieve the same result with the method describe here.

It might be a good idea to add the link to this git as well as the alternate commands (e.g.: mount localhost:/path/to/sourced-engine bundle/engine) to the installation part of the doc.

Bags not saved in DB

The class BagsSaver in bags.py is not used since the refactoring of ml was leveraged to update apollo, hence the bags are not saved to DB.

If we wish to keep it that way, it would make sense to remove the. class, and refactor the other files to remove all references to the bags table.
If not, we can either modify ml's repos2bow function, in order to add this transformer at the end of the pipe, or we need to put the logic here like was the case before.

[`hash` with a big number of files] Failed to execute: com.datastax.spark.connector.writer.RichBatchStatement

Hi,
spark cassandra connector / scylla fails when you attempt to make hash step with a big number of files.
I tried 1M files - always fails, 300k files - unstable, some experiments can be completed, but after a several trials it fails with error. Before each new run of hash step I used reset_db but memory is not released from scylla (I'm not ure if it's correct behaviour of DB).

Error log

``` 18/04/10 23:04:51 ERROR QueryExecutor: Failed to execute: com.datastax.spark.connector.writer.RichBatchStatement@3ec8da3b com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency LOCAL_QUORUM (1 replica were required but only 0 acknowled ged the write) at com.datastax.driver.core.exceptions.WriteTimeoutException.copy(WriteTimeoutException.java:100) at com.datastax.driver.core.Responses$Error.asException(Responses.java:122) at com.datastax.driver.core.RequestHandler$SpeculativeExecution.onSet(RequestHandler.java:506) at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:1070) at com.datastax.driver.core.Connection$Dispatcher.channelRead0(Connection.java:993) at io.netty.channel.SimpleChannelInboundHandler.channelRead(SimpleChannelInboundHandler.java:105) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) at io.netty.handler.timeout.IdleStateHandler.channelRead(IdleStateHandler.java:287) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:102) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) at io.netty.handler.codec.ByteToMessageDecoder.fireChannelRead(ByteToMessageDecoder.java:293) at io.netty.handler.codec.ByteToMessageDecoder.channelRead(ByteToMessageDecoder.java:267) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) at io.netty.channel.AbstractChannelHandlerContext.fireChannelRead(AbstractChannelHandlerContext.java:336) at io.netty.channel.DefaultChannelPipeline$HeadContext.channelRead(DefaultChannelPipeline.java:1294) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:357) at io.netty.channel.AbstractChannelHandlerContext.invokeChannelRead(AbstractChannelHandlerContext.java:343) at io.netty.channel.DefaultChannelPipeline.fireChannelRead(DefaultChannelPipeline.java:911) at io.netty.channel.epoll.AbstractEpollStreamChannel$EpollStreamUnsafe.epollInReady(AbstractEpollStreamChannel.java:934) at io.netty.channel.epoll.EpollEventLoop.processReady(EpollEventLoop.java:397) at io.netty.channel.epoll.EpollEventLoop.run(EpollEventLoop.java:302) at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:131) at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:144) at java.lang.Thread.run(Thread.java:748) Caused by: com.datastax.driver.core.exceptions.WriteTimeoutException: Cassandra timeout during write query at consistency LOCAL_QUORUM (1 replica were required but only 0 acknowledged the write) at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:59) at com.datastax.driver.core.Responses$Error$1.decode(Responses.java:37) at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:289) at com.datastax.driver.core.Message$ProtocolDecoder.decode(Message.java:269) at io.netty.handler.codec.MessageToMessageDecoder.channelRead(MessageToMessageDecoder.java:88) ... 18 more ```

Allow to not specify the port for scylla/cassandra

Run parameters tuning on 1mm files

We expect that nothing will change but need to check.

Relates to #1197

Add license

Add LICENSE file + to readme. See: src-d/guide licensing policy.

Apollo could not be run with sourced.ml@develp branch

Right now, if one

git clone https://github.com/src-d/apollo.git; cd apollo
virtualenv -p python3 .venv-py3
source .venv-py3/bin/activate

pip install git+https://github.com/src-d/ml.git@develop
pip3 install -e .

and then apollo --help it will result in

$ apollo --help
Traceback (most recent call last):
  File "./src-d/apollo/.venv-py3/bin/apollo", line 11, in <module>
    load_entry_point('apollo', 'console_scripts', 'apollo')()
  File "./src-d/apollo/.venv-py3/lib/python3.6/site-packages/pkg_resources/__init__.py", line 572, in load_entry_point
    return get_distribution(dist).load_entry_point(group, name)
  File "./src-d/apollo/.venv-py3/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2755, in load_entry_point
    return ep.load()
  File "./src-d/apollo/.venv-py3/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2408, in load
    return self.resolve()
  File "./src-d/apollo/.venv-py3/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2414, in resolve
    module = __import__(self.module_name, fromlist=['__name__'], level=0)
  File "./src-d/apollo/apollo/__main__.py", line 12, in <module>
    from apollo.bags import preprocess_source, source2bags
  File "./src-d/apollo/apollo/bags.py", line 8, in <module>
    from sourced.ml.transformers import UastExtractor, Transformer, Cacher, UastDeserializer, Engine, \
ImportError: cannot import name 'Documents2BOW'

From https://github.com/src-d/ml/pull/160/files#diff-9bdc53996c12b1f5ff9117bb4bf0ae23R7 it seems that Repo2WeightedSet was renamed in sourced.ml, but

apollo/apollo/bags.py

Line 9 in 5faf5a5

    
           FieldsSelector, ParquetSaver, Repo2WeightedSet, Repo2DocFreq, Repo2Quant, BagsBatchSaver, BagsBatcher

still uses it.

Blog post about tuning the deduplication parameters

Follows https://github.com/src-d/backlog/issues/1197

Run apollo on minikube: Answer from Java side is empty

Issue description

I want to run apollo on k8 staging cluster, so I wanted to test it out locally on minikube first. I used helm charts to bring up a local spark cluster, scylla DB and babelfshd. I then created an image for apollo, available here as well as a k8 service so it would connect to port 7077, 9042 and 9432. After creating the pod i ran the resetdbcommand, it worked. I cloned the engine repo in order to get example siva files, that I put in io/siva. Then I tried to run the bagscommand , Spark launches and registers the job (I checked logs on the master and worker pod, as well as UI) and then I got this error:

INFO:engine:Initializing on io/siva
INFO:MetadataSaver:Ignition -> DzhigurdaFiles -> UastExtractor -> Moder -> Cacher -> MetadataSaver
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/py4j/java_gateway.py", line 1062, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/py4j/java_gateway.py", line 908, in send_command
    response = connection.send_command(command)
  File "/usr/local/lib/python3.5/dist-packages/py4j/java_gateway.py", line 1067, in send_command
    "Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
Traceback (most recent call last):
  File "/usr/local/bin/apollo", line 11, in <module>
    load_entry_point('apollo', 'console_scripts', 'apollo')()
  File "/packages/apollo/apollo/__main__.py", line 230, in main
    return handler(args)
  File "/packages/apollo/apollo/bags.py", line 94, in source2bags
    cache_hook=lambda: MetadataSaver(args.keyspace, args.tables["meta"]))
  File "/packages/sourced/ml/utils/engine.py", line 147, in wrapped_pause
    return func(cmdline_args, *args, **kwargs)
  File "/packages/sourced/ml/cmd_entries/repos2bow.py", line 35, in repos2bow_entry_template
    uast_extractor.link(cache_hook()).execute()
  File "/packages/sourced/ml/transformers/transformer.py", line 95, in execute
    head = node(head)
  File "/packages/apollo/apollo/bags.py", line 46, in __call__
    rows.toDF() \
  File "/spark/python/pyspark/sql/session.py", line 58, in toDF
    return sparkSession.createDataFrame(self, schema, sampleRatio)
  File "/spark/python/pyspark/sql/session.py", line 582, in createDataFrame
    rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
  File "/spark/python/pyspark/sql/session.py", line 380, in _createFromRDD
    struct = self._inferSchema(rdd, samplingRatio)
  File "/spark/python/pyspark/sql/session.py", line 351, in _inferSchema
    first = rdd.first()
  File "/spark/python/pyspark/rdd.py", line 1361, in first
    rs = self.take(1)
  File "/spark/python/pyspark/rdd.py", line 1343, in take
    res = self.context.runJob(self, takeUpToNumLeft, p)
  File "/spark/python/pyspark/context.py", line 992, in runJob
    port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
  File "/usr/local/lib/python3.5/dist-packages/py4j/java_gateway.py", line 1160, in __call__
    answer, self.gateway_client, self.target_id, self.name)
  File "/spark/python/pyspark/sql/utils.py", line 63, in deco
    return f(*a, **kw)
  File "/usr/local/lib/python3.5/dist-packages/py4j/protocol.py", line 328, in get_return_value
    format(target_id, ".", name))
py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob
ERROR:root:Exception while sending command.
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/py4j/java_gateway.py", line 1062, in send_command
    raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty

Steps to Reproduce (for bugs)

Setup minikube and helm
Clone charts repo
Create pods, servieces, etc: helm install scylla --name=scylla, helm install spark --name=anything, helm install bblfshd --name=babel, kubectl create -f service.yaml, kubectl run -ti --image=r0maink/apollo apollo-test
Open new tab and log in the spark master with kubectl exec -it anything-master /bin/bash then do: export PYSPARK_PYTHON=python3 and export PYSPARK_PYTHON_DRIVER=python3
Go to the previous tab, it should be logged on the apollo pod and run apollo resetdb --cassandra scylla:9042
Get the siva files: apt update, apt install git, git clone https://github.com/src-d/engine, mkdir io, mkdir io/bags, cp engine/examples/siva_files io/siva

And finally: apollo bags -r io/siva --bow io/bags/bow.asdf --docfreq io/bags/docfreq.asdf -f id -f lit -f uast2seq --uast2seq-seq-len 4 -l Java --min-docfreq 5 --bblfsh babel-bblfshd --cassandra scylla:9042 --persist MEMORY_ONLY -s spark://anything-master:7077

Any ideas ?

Problem when using apollo with spark cluster

When trying to run hash or cmd commands with spark in cluster mode, we get the same problem we used to have with ml, because the workers do not have the apollo lib and it is not added to the spark session using addPyfile.

I think we should either modify the way the --depzip flag functions in order to add it, or change logic:

when ml -s flag is used to specify a master that is not local, we should add ml, engine and all other dependencies if the call is not made by a command of the ml library, e.g. apollo and it's dependencies.
the --dep-zip flag should be used to add ml dependencies. It will be of no use for us since our workers use ml-core image and already have them, but it will be useful for other users.
as was pointed out in this issue, I think we should add to the spark conf by default the flags that will clean up after us, because it ends up taking a lot of memory

Problem saving bags when running on small dataset on science 3

I ran apollo's bag command with test Siva files I found on science 3. In order for it to work I had to change a couple lines in ml's batch_transform.py file and apollo's bags.py file. These changes involved the changes in this open PR as well as a couple more all involving refactoring of the old model variable to the new docfreq_model that were omitted and commenting out a line containing reference to quant_model.

The exact command was: docker run -it --rm -v /home/romain/io:/io --link bblfshd --link scylla src-d/apollo-rom bags -r /io/siva --batches /io/bags --docfreq /io/bags/docfreq.asdf -f id -f lit -f uast2seq --uast2seq-seq-len 4 -l Java -s 'local[*]' --min-docfreq 5 --bblfsh bblfshd --cassandra scylla --persist MEMORY_ONLY --config spark.executor.memory=4G --config spark.driver.memory=10G --config spark.driver.maxResultSize=4G

It seemed to work properly (log says apollo detected 128 documents, with average bag length of 398.7 and 5395 of vocab size) until starting writing docfreq to /io/bags/docfreq.asdf, when I got this error:

Traceback (most recent call last):
File "/usr/local/bin/apollo", line 11, in <module>
load_entry_point('apollo', 'console_scripts', 'apollo')()
File "/packages/apollo/apollo/__main__.py", line 258, in main
return handler(args)
File "/packages/apollo/apollo/bags.py", line 126, in source2bags
batcher.docfreq_model.save(args.docfreq)
File "/usr/local/lib/python3.5/dist-packages/modelforge/model.py", line 270, in save
write_model(self._meta, tree, output)
File "/usr/local/lib/python3.5/dist-packages/modelforge/model.py", line 409, in write_model
asdf.AsdfFile(final_tree).write_to(output, all_array_compression=ARRAY_COMPRESSION)
File "/usr/local/lib/python3.5/dist-packages/asdf/asdf.py", line 890, in write_to
with generic_io.get_file(fd, mode='w') as fd:
File "/usr/local/lib/python3.5/dist-packages/asdf/generic_io.py", line 1186, in get_file
fd = atomicfile.atomic_open(realpath, realmode)
File "/usr/local/lib/python3.5/dist-packages/asdf/extern/atomicfile.py", line 139, in atomic_open delete=False)
File "/usr/lib/python3.5/tempfile.py", line 688, in NamedTemporaryFile
(fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
File "/usr/lib/python3.5/tempfile.py", line 399, in _mkstemp_inner
fd = _os.open(file, flags, 0o600)
FileNotFoundError: [Errno 2] No such file or directory: '/io/bags/.___atomic_writedyrniw3x'

The problem seems to be linked with modelforge's saving process, from the error log it seems that science 3 hasn't got the latest version of the repo but I couldn't see how that affected it when looking this diff.