src-d / apollo Goto Github PK
View Code? Open in Web Editor NEWAdvanced similarity and duplicate source code proof of concept for our research efforts.
License: GNU General Public License v3.0
Advanced similarity and duplicate source code proof of concept for our research efforts.
License: GNU General Public License v3.0
same as src-d/ml#280
In DB schema primary key contains only sha1 & repo:
https://github.com/src-d/apollo/blob/master/apollo/cassandra_utils.py#L72
If there is more than 1 file with the same hash it will be overridden.
Proposed solution: Primary key should include path to avoid it.
More detailed explanation in gemini: src-d/gemini#15
Fix in gemini: https://github.com/src-d/gemini/pull/18/files#diff-e9f7644d3a67c0182ad50efdd8f49f7a
Currently wondering what this is ๐
When resetting the cassandra DB, I often get this error:
root@rom-gpu-dbc68df59-kf6qw:/# apollo resetdb --cassandra cassandra
INFO:cassandra:Connecting to cassandra
DROP KEYSPACE IF EXISTS apollo;
Traceback (most recent call last):
File "/usr/local/bin/apollo", line 11, in <module>
sys.exit(main())
File "/usr/local/lib/python3.4/dist-packages/apollo/__main__.py", line 228, in main
return handler(args)
File "/usr/local/lib/python3.4/dist-packages/apollo/cassandra_utils.py", line 67, in reset_db
cql("DROP KEYSPACE IF EXISTS %s" % args.keyspace)
File "/usr/local/lib/python3.4/dist-packages/apollo/cassandra_utils.py", line 64, in cql
db.execute(cmd)
File "cassandra/cluster.py", line 2141, in cassandra.cluster.Session.execute
File "cassandra/cluster.py", line 4033, in cassandra.cluster.ResponseFuture.result
cassandra.OperationTimedOut: errors={'10.2.13.74': 'Client request timeout. See Session.execute[_async](timeout)'}, last_host=10.2.13.74
Gonna look into it when I get time, it's a relatively minor problem, as simply retrying the command will often do the trick, but would like to correct this if possible.
Not everyone has GPUs and in some cases hashing time may not be a bottleneck, so it would be nice to have a mode that does not require CUDA and firends.
AFAIK to have such option, alongside with high-performance GPU one would require small changes in hasher.py to have an option of using something like https://github.com/ekzhu/datasketch instead of libMHCUDA.
The readme contains instructions for apollo urls
, but apparently that command was removed here df3fbd8
I'm not submitting a PR because I'm not sure if the command must be removed, or replaced with something else.
As discovered in #56 (comment)
On mac os the command mount -o bind
does not work, because it does not exist, however you can achieve the same result with the method describe here.
It might be a good idea to add the link to this git as well as the alternate commands (e.g.: mount localhost:/path/to/sourced-engine bundle/engine
) to the installation part of the doc.
The class BagsSaver
in bags.py
is not used since the refactoring of ml
was leveraged to update apollo, hence the bags are not saved to DB.
bags
table.ml
's repos2bow
function, in order to add this transformer at the end of the pipe, or we need to put the logic here like was the case before.Hi,
spark cassandra connector
/ scylla
fails when you attempt to make hash
step with a big number of files.
I tried 1M files - always fails, 300k files - unstable, some experiments can be completed, but after a several trials it fails with error. Before each new run of hash
step I used reset_db
but memory is not released from scylla
(I'm not ure if it's correct behaviour of DB).
We expect that nothing will change but need to check.
Relates to #1197
Add LICENSE file + to readme. See: src-d/guide licensing policy.
Right now, if one
git clone https://github.com/src-d/apollo.git; cd apollo
virtualenv -p python3 .venv-py3
source .venv-py3/bin/activate
pip install git+https://github.com/src-d/ml.git@develop
pip3 install -e .
and then apollo --help
it will result in
$ apollo --help
Traceback (most recent call last):
File "./src-d/apollo/.venv-py3/bin/apollo", line 11, in <module>
load_entry_point('apollo', 'console_scripts', 'apollo')()
File "./src-d/apollo/.venv-py3/lib/python3.6/site-packages/pkg_resources/__init__.py", line 572, in load_entry_point
return get_distribution(dist).load_entry_point(group, name)
File "./src-d/apollo/.venv-py3/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2755, in load_entry_point
return ep.load()
File "./src-d/apollo/.venv-py3/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2408, in load
return self.resolve()
File "./src-d/apollo/.venv-py3/lib/python3.6/site-packages/pkg_resources/__init__.py", line 2414, in resolve
module = __import__(self.module_name, fromlist=['__name__'], level=0)
File "./src-d/apollo/apollo/__main__.py", line 12, in <module>
from apollo.bags import preprocess_source, source2bags
File "./src-d/apollo/apollo/bags.py", line 8, in <module>
from sourced.ml.transformers import UastExtractor, Transformer, Cacher, UastDeserializer, Engine, \
ImportError: cannot import name 'Documents2BOW'
From https://github.com/src-d/ml/pull/160/files#diff-9bdc53996c12b1f5ff9117bb4bf0ae23R7 it seems that Repo2WeightedSet
was renamed in sourced.ml, but
Line 9 in 5faf5a5
I want to run apollo on k8 staging cluster, so I wanted to test it out locally on minikube first. I used helm charts to bring up a local spark cluster, scylla DB and babelfshd. I then created an image for apollo, available here as well as a k8 service so it would connect to port 7077, 9042 and 9432. After creating the pod i ran the resetdb
command, it worked. I cloned the engine repo in order to get example siva files, that I put in io/siva
. Then I tried to run the bags
command , Spark launches and registers the job (I checked logs on the master and worker pod, as well as UI) and then I got this error:
INFO:engine:Initializing on io/siva
INFO:MetadataSaver:Ignition -> DzhigurdaFiles -> UastExtractor -> Moder -> Cacher -> MetadataSaver
ERROR:root:Exception while sending command.
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/py4j/java_gateway.py", line 1062, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/py4j/java_gateway.py", line 908, in send_command
response = connection.send_command(command)
File "/usr/local/lib/python3.5/dist-packages/py4j/java_gateway.py", line 1067, in send_command
"Error while receiving", e, proto.ERROR_ON_RECEIVE)
py4j.protocol.Py4JNetworkError: Error while receiving
Traceback (most recent call last):
File "/usr/local/bin/apollo", line 11, in <module>
load_entry_point('apollo', 'console_scripts', 'apollo')()
File "/packages/apollo/apollo/__main__.py", line 230, in main
return handler(args)
File "/packages/apollo/apollo/bags.py", line 94, in source2bags
cache_hook=lambda: MetadataSaver(args.keyspace, args.tables["meta"]))
File "/packages/sourced/ml/utils/engine.py", line 147, in wrapped_pause
return func(cmdline_args, *args, **kwargs)
File "/packages/sourced/ml/cmd_entries/repos2bow.py", line 35, in repos2bow_entry_template
uast_extractor.link(cache_hook()).execute()
File "/packages/sourced/ml/transformers/transformer.py", line 95, in execute
head = node(head)
File "/packages/apollo/apollo/bags.py", line 46, in __call__
rows.toDF() \
File "/spark/python/pyspark/sql/session.py", line 58, in toDF
return sparkSession.createDataFrame(self, schema, sampleRatio)
File "/spark/python/pyspark/sql/session.py", line 582, in createDataFrame
rdd, schema = self._createFromRDD(data.map(prepare), schema, samplingRatio)
File "/spark/python/pyspark/sql/session.py", line 380, in _createFromRDD
struct = self._inferSchema(rdd, samplingRatio)
File "/spark/python/pyspark/sql/session.py", line 351, in _inferSchema
first = rdd.first()
File "/spark/python/pyspark/rdd.py", line 1361, in first
rs = self.take(1)
File "/spark/python/pyspark/rdd.py", line 1343, in take
res = self.context.runJob(self, takeUpToNumLeft, p)
File "/spark/python/pyspark/context.py", line 992, in runJob
port = self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)
File "/usr/local/lib/python3.5/dist-packages/py4j/java_gateway.py", line 1160, in __call__
answer, self.gateway_client, self.target_id, self.name)
File "/spark/python/pyspark/sql/utils.py", line 63, in deco
return f(*a, **kw)
File "/usr/local/lib/python3.5/dist-packages/py4j/protocol.py", line 328, in get_return_value
format(target_id, ".", name))
py4j.protocol.Py4JError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.runJob
ERROR:root:Exception while sending command.
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/py4j/java_gateway.py", line 1062, in send_command
raise Py4JNetworkError("Answer from Java side is empty")
py4j.protocol.Py4JNetworkError: Answer from Java side is empty
helm install scylla --name=scylla
, helm install spark --name=anything
, helm install bblfshd --name=babel
, kubectl create -f service.yaml
, kubectl run -ti --image=r0maink/apollo apollo-test
kubectl exec -it anything-master /bin/bash
then do: export PYSPARK_PYTHON=python3
and export PYSPARK_PYTHON_DRIVER=python3
apollo resetdb --cassandra scylla:9042
apt update
, apt install git
, git clone https://github.com/src-d/engine
, mkdir io
, mkdir io/bags
, cp engine/examples/siva_files io/siva
And finally: apollo bags -r io/siva --bow io/bags/bow.asdf --docfreq io/bags/docfreq.asdf -f id -f lit -f uast2seq --uast2seq-seq-len 4 -l Java --min-docfreq 5 --bblfsh babel-bblfshd --cassandra scylla:9042 --persist MEMORY_ONLY -s spark://anything-master:7077
Any ideas ?
When trying to run hash
or cmd
commands with spark in cluster mode, we get the same problem we used to have with ml
, because the workers do not have the apollo lib and it is not added to the spark session using addPyfile
.
I think we should either modify the way the --depzip
flag functions in order to add it, or change logic:
ml
-s
flag is used to specify a master that is not local, we should add ml, engine and all other dependencies if the call is not made by a command of the ml
library, e.g. apollo
and it's dependencies.--dep-zip
flag should be used to add ml
dependencies. It will be of no use for us since our workers use ml-core
image and already have them, but it will be useful for other users.I ran apollo's bag command with test Siva files I found on science 3. In order for it to work I had to change a couple lines in ml's batch_transform.py
file and apollo's bags.py
file. These changes involved the changes in this open PR as well as a couple more all involving refactoring of the old model
variable to the new docfreq_model
that were omitted and commenting out a line containing reference to quant_model
.
The exact command was: docker run -it --rm -v /home/romain/io:/io --link bblfshd --link scylla src-d/apollo-rom bags -r /io/siva --batches /io/bags --docfreq /io/bags/docfreq.asdf -f id -f lit -f uast2seq --uast2seq-seq-len 4 -l Java -s 'local[*]' --min-docfreq 5 --bblfsh bblfshd --cassandra scylla --persist MEMORY_ONLY --config spark.executor.memory=4G --config spark.driver.memory=10G --config spark.driver.maxResultSize=4G
It seemed to work properly (log says apollo detected 128 documents, with average bag length of 398.7 and 5395 of vocab size) until starting writing docfreq to /io/bags/docfreq.asdf, when I got this error:
Traceback (most recent call last):
File "/usr/local/bin/apollo", line 11, in <module>
load_entry_point('apollo', 'console_scripts', 'apollo')()
File "/packages/apollo/apollo/__main__.py", line 258, in main
return handler(args)
File "/packages/apollo/apollo/bags.py", line 126, in source2bags
batcher.docfreq_model.save(args.docfreq)
File "/usr/local/lib/python3.5/dist-packages/modelforge/model.py", line 270, in save
write_model(self._meta, tree, output)
File "/usr/local/lib/python3.5/dist-packages/modelforge/model.py", line 409, in write_model
asdf.AsdfFile(final_tree).write_to(output, all_array_compression=ARRAY_COMPRESSION)
File "/usr/local/lib/python3.5/dist-packages/asdf/asdf.py", line 890, in write_to
with generic_io.get_file(fd, mode='w') as fd:
File "/usr/local/lib/python3.5/dist-packages/asdf/generic_io.py", line 1186, in get_file
fd = atomicfile.atomic_open(realpath, realmode)
File "/usr/local/lib/python3.5/dist-packages/asdf/extern/atomicfile.py", line 139, in atomic_open
delete=False)
File "/usr/lib/python3.5/tempfile.py", line 688, in NamedTemporaryFile
(fd, name) = _mkstemp_inner(dir, prefix, suffix, flags, output_type)
File "/usr/lib/python3.5/tempfile.py", line 399, in _mkstemp_inner
fd = _os.open(file, flags, 0o600)
FileNotFoundError: [Errno 2] No such file or directory: '/io/bags/.___atomic_writedyrniw3x'
The problem seems to be linked with modelforge's saving process, from the error log it seems that science 3 hasn't got the latest version of the repo but I couldn't see how that affected it when looking this diff.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.