Giter Site home page Giter Site logo

reiz.io's People

Contributors

hugovk avatar isidentical avatar pre-commit-ci[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

reiz.io's Issues

reizql: META(filename=...) matcher

Currently, there is no way to search in a specific project while using Reiz. Considering the intended dataset size (thousands of packages) it might be a cool idea to somehow add META() matcher for searching custom filenames. Here is an example;

Call(
    Name('something'),
    __metadata__ = META(
        filename=f'requests/%'
    )
)

this would match files only under requests/ prefix, which internally means only the requests/ repo.

ReizQL

A general discussion on the idea of the query language that we'll expose this DB with.

reiz.serialization: Cache project metadatas

We currently insert a new project metadata every time we start processing a project, but if the serialize command runs multiple times there might be duplicates in the DB. For resolving that, we should create a reference pool of already inserted projects and use them if they are available.

Website down

IsItUp says that reiz.io is down - any idea of when it would be back up again?

reiz.reizql: High level frontend for ReizQL

This is just an idea that popped out a few times, nothing to be implemented but for mostly keeping track of things. It might be cool to have some sort of high level language (built a top on Python) to represent the queries in more human friendly manner (something like semgrep's format):

def ...(...): 
   for ... in ...:
       ...
   return ...

would ~transpile to

FunctionDef(
    body=[
        For(
            body=[
                ...
            ]
        ),
        Return(value = not None)
    ]
)

The implementation isn't as simple as people might think it is. It requires special behavior for each node type, to define what ... represents. And also probably many different constructs to define it properly. But I would avoid starting working on this until we figure out how to implement aliases.

sampling: Licenses

We need to research software licenses that don't conflict with our use case and only keep the projects which use these set of licenses.

database: Wrap EdgeQLConnection objects

Since we separated EdgeDB-bound logic from the compiler/serializer, maybe we can do the same thing to the database connections.

  • Introducing reiz.database package with
    • get_connection(backend: str) -> BaseConnection, connection: BaseConnection = get_connection(config.ir.backend)
    • reiz.database.base a BaseConnection class, which would host 2 main methods (maybe more) get_blocking_connection(), get_async_connection. It might host a pool inside maybe
    • reiz.database.edgeql that implements these methods

reiz.reizql.compiler: Extend list-matcher

Currently, the list matcher can either take a ReizQLIgnore or an ReizQLMatch, it would be awesome to have it work with ReizQLNot and ReizQLLogicOperator.
Example queries;

Module(body=[not Expr(Constant())])
Module(body=[Expr(Constant()) | Pass()])

reiz.reizql: Support length matchers

Call(args=LEN(min=3))
Call(args=LEN(max=3))
Call(args=LEN(min=1, max=3))

will be translated to appropriate count(<pointer>) >=/<= expression.

Search engine hanging forever

The engine is started in the following manner:

ubuntu@ip-172-31-63-88:~/reiz.io$ docker-compose up --build --remove-orphans
Building reiz
Step 1/8 : FROM python:3.8.2-slim
 ---> e7d894e42148
Step 2/8 : RUN apt-get update  && apt-get upgrade -y  && apt-get install git apt-utils bash -y
 ---> Using cache
 ---> 882941d1de25
Step 3/8 : WORKDIR /app
 ---> Using cache
 ---> ae7e60c190f3
Step 4/8 : COPY requirements.txt requirements.txt
 ---> Using cache
 ---> 6a970b289647
Step 5/8 : RUN python -m pip install -r requirements.txt
 ---> Using cache
 ---> 395c80b3ae8f
Step 6/8 : COPY . .
 ---> b3d4d8bdf508
Step 7/8 : RUN python -m pip install -e .
 ---> Running in 91727f3854c6
Obtaining file:///app
  Installing build dependencies: started
  Installing build dependencies: finished with status 'done'
  Getting requirements to build wheel: started
  Getting requirements to build wheel: finished with status 'done'
    Preparing wheel metadata: started
    Preparing wheel metadata: finished with status 'done'
Installing collected packages: reiz
  Running setup.py develop for reiz
Successfully installed reiz
WARNING: You are using pip version 20.1; however, version 21.1.1 is available.
You should consider upgrading via the '/usr/local/bin/python -m pip install --upgrade pip' command.
Removing intermediate container 91727f3854c6
 ---> 6135f850cf1a
Step 8/8 : ENTRYPOINT ["/bin/bash", "./scripts/docker_bootstrap.sh"]
 ---> Running in 75c35cd4e076
Removing intermediate container 75c35cd4e076
 ---> 99501f174e58

Successfully built 99501f174e58
Successfully tagged reizio_reiz:latest
Building web
Step 1/5 : FROM python:3.8.2-slim
 ---> e7d894e42148
Step 2/5 : RUN apt-get update  && apt-get upgrade -y  && apt-get install git apt-utils -y
 ---> Using cache
 ---> e065aae1ac6d
Step 3/5 : RUN git clone https://github.com/treescience/search.tree.science
 ---> Using cache
 ---> 35db27c91c06
Step 4/5 : WORKDIR search.tree.science
 ---> Using cache
 ---> 4e636879d5de
Step 5/5 : ENTRYPOINT ["python", "-m", "http.server", "8080"]
 ---> Using cache
 ---> 9ef76d9cae2a

Successfully built 9ef76d9cae2a
Successfully tagged reizio_web:latest
Starting reizio_edgedb_1 ... done
Recreating reizio_reiz_1 ... done
Recreating reizio_web_1  ... done
Attaching to reizio_edgedb_1, reizio_reiz_1, reizio_web_1
edgedb_1  | WARNING: no logs are available with the 'none' log driver
reiz_1    | + mkdir -p /app/tmp/data
reiz_1    | + mkdir -p /root/.local/
reiz_1    | + cp static/configs/docker_config.json /root/.local/reiz.json
reiz_1    | + pre_db
reiz_1    | + python -m reiz.sampling.get_dataset --limit 10 /app/tmp/dataset.json
reiz_1    | [2021-05-24 05:18:11,043] get_pypi_dataset --- Adding six to the dataset
reiz_1    | [2021-05-24 05:18:11,056] get_pypi_dataset --- Adding urllib3 to the dataset
reiz_1    | [2021-05-24 05:18:11,060] get_pypi_dataset --- Adding requests to the dataset
reiz_1    | [2021-05-24 05:18:11,066] get_pypi_dataset --- Adding certifi to the dataset
reiz_1    | [2021-05-24 05:18:11,074] get_pypi_dataset --- Adding setuptools to the dataset
reiz_1    | [2021-05-24 05:18:11,076] get_pypi_dataset --- Adding idna to the dataset
reiz_1    | [2021-05-24 05:18:11,099] get_pypi_dataset --- Adding botocore to the dataset
reiz_1    | [2021-05-24 05:18:11,101] get_pypi_dataset --- Adding chardet to the dataset
reiz_1    | [2021-05-24 05:18:11,104] get_pypi_dataset --- Adding s3transfer to the dataset
reiz_1    | [2021-05-24 05:18:11,113] get_pypi_dataset --- Adding pip to the dataset
reiz_1    | [2021-05-24 05:18:11,168] get_pypi_dataset --- 10 repositories have been added to the /app/tmp/dataset.json
reiz_1    | + python -m reiz.sampling.fetch_dataset /app/tmp/dataset.json /app/tmp/rawdata
reiz_1    | [2021-05-24 05:18:11,973] fetch           --- 'six' has been checked at 3974f0c4f6700a5821b451abddff8b3ba6b2a04f revision
reiz_1    | [2021-05-24 05:18:11,974] fetch           --- 'certifi' has been checked at c5311ad0533e78240f34b0e35451794d0e261d54 revision
reiz_1    | [2021-05-24 05:18:12,073] fetch           --- 'urllib3' has been checked at 97a16d74f287ce84dcb14aa90bf28c9088579257 revision
reiz_1    | [2021-05-24 05:18:12,123] fetch           --- 'requests' has been checked at f6d43b03fbb9a1e75ed63a9aa15738a8fce99b50 revision
reiz_1    | [2021-05-24 05:18:12,249] fetch           --- 'idna' has been checked at b0ef4bf7fe79174e9ddd8d89e2f48b2e6dc3e721 revision
reiz_1    | [2021-05-24 05:18:12,576] fetch           --- 's3transfer' has been checked at 59e968d05288092948284001710c416677102266 revision
reiz_1    | [2021-05-24 05:18:12,603] fetch           --- 'setuptools' has been checked at a5131f0b82e098da6c07a03a47f36f3a52f73fb6 revision
reiz_1    | [2021-05-24 05:18:12,706] fetch           --- 'chardet' has been checked at e6488640addf2af696e1d602e9007d9720d9fd7d revision
reiz_1    | [2021-05-24 05:18:13,493] fetch           --- 'pip' has been checked at 0a49dd913f673b8abb7c597d58f0e26d24dc34bc revision
reiz_1    | [2021-05-24 05:18:18,498] fetch           --- 'botocore' has been checked at b9cdb88f404c67acfb7088dbbbe97279baa1aa88 revision
reiz_1    | + python -m reiz.sampling.sanitize_dataset /app/tmp/dataset.json /app/tmp/rawdata /app/tmp/data --ignore-tests
reiz_1    | [2021-05-24 05:18:18,713] sanitize_dataset --- 'six' has been sanitized
reiz_1    | [2021-05-24 05:18:18,714] sanitize_dataset --- 'idna' has been sanitized
reiz_1    | [2021-05-24 05:18:18,714] sanitize_dataset --- 'certifi' has been sanitized
reiz_1    | [2021-05-24 05:18:18,714] sanitize_dataset --- 's3transfer' has been sanitized
reiz_1    | [2021-05-24 05:18:18,714] sanitize_dataset --- 'requests' has been sanitized
reiz_1    | [2021-05-24 05:18:18,714] sanitize_dataset --- 'urllib3' has been sanitized
reiz_1    | [2021-05-24 05:18:18,715] sanitize_dataset --- 'chardet' has been sanitized
reiz_1    | [2021-05-24 05:18:18,715] sanitize_dataset --- 'setuptools' has been sanitized
reiz_1    | [2021-05-24 05:18:18,715] sanitize_dataset --- 'pip' has been sanitized
reiz_1    | [2021-05-24 05:18:18,715] sanitize_dataset --- 'botocore' has been sanitized
reiz_1    | + sleep 15
reiz_1    | + post_db
reiz_1    | + python /app/scripts/create_db.py
reiz_1    | [2021-05-24 05:18:34,006] create_db       --- database exits, doing nothing...
reiz_1    | + python -m reiz.serialization.insert --fast --limit 75 --project-limit 7 /app/tmp/dataset.json
reiz_1    | + python -m reiz.web.api
reiz_1    | [2021-05-24 05:18:35 +0000] [140] [INFO] Goin' Fast @ http://0.0.0.0:8000
reiz_1    | [2021-05-24 05:18:35,211] _helper         --- Goin' Fast @ http://0.0.0.0:8000
reiz_1    | [2021-05-24 05:18:35 +0000] [140] [INFO] Starting worker [140]
reiz_1    | [2021-05-24 05:18:35,482] serve           --- Starting worker [140]

The search engine can be visited at 8080 port. However, no new log is printed and no search result is ever returned:
image

New data fetching / linking system

Currently, we use a raw list of the most popular Python packages and then process the source distribution files located at the PyPI. It has been previously requested that it would be way cooler if we had some sort of way to link the results with the source on GitHub with one click so that the user can explore the stuff before/after the code fragment. We can't do it right now, since we don't know which tree reference the source dists on the PyPI belong.

This is a multi-stage issue that we should start implementing as of right now. The stages

  • Listing a possible dataset of projects (collecting homepage information from the PyPI list we have)
  • Checking out all datasets with git clone
  • Keeping only hash, and the source files (*.py) and removing everything else (such as .git folder and readmes, documentations etc.)
  • The commit hash's should be listed in info.json of the clean directory (raw/ -> all checked out repositories with all information, clean/ -> only python source files).
  • Modify the schema to have a new node Project(string name, string commit_ref) and adapting the Module(..., Project project).
  • On the reiz.fetch, returning project commit_refs alongisde with filename and position.

Central logging system

Currently, the reiz.io toolchain uses raw prints to print out exceptions/logs. The better approach is keeping a central logger object at reiz.utilities and letting tools to print with debug/info/exception etc. levels

reiz.reizql: LEN matcher on empty sets

Assign(targets=[Tuple(LEN(max=1))])

would generate the following query;

SELECT ast::Assign FILTER count(.targets) = 1 AND (WITH _sequence_36f99674 := array_agg((SELECT .targets ORDER BY @index)) SELECT count(_sequence_36f99674[0][IS ast::Tuple].elts) <= 1)

which should actually guard that _sequence_36f99674[0][IS ast::Tuple].elts is EXISTS in the first place. Otherwise, the resultset will contain both Tuple()s and other things.

reiz.web: Cache to redis

We could potentially cache the query results to the redis with the format of {hash(reiz_ql_str) ^ hash(offset): [result1, ..., resultN]}

reiz.ir.optimizer: General

This is something that I intend to do, both for speed and proper edgeql generation. Each comment will correspond to a single operation on the optimizer

Test cases, tons of them

We need hundreds of different query test cases collected in a place, with E2E integrity testing.

reiz.db.insert: starred dictionary items doesn't work

We are currently unable to describe starred-keys (key None, value is the name). The better approach would be creating a DoubleStarred(expr value) node under expr = sum on the ASDL, and transforming the ast.

{**d}
{a:b, c:d, **e, **f, g: h}

(we are currently ignoring that keys)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.