Giter Site home page Giter Site logo

lynxkite / lynxkite Goto Github PK

View Code? Open in Web Editor NEW
135.0 8.0 13.0 38.24 MB

The complete graph data science platform

Home Page: https://lynxkite.com/

License: GNU Affero General Public License v3.0

Shell 0.73% Makefile 0.08% Scala 62.82% Go 3.76% Python 12.88% POV-Ray SDL 0.05% HTML 2.84% JavaScript 10.91% CSS 1.92% Dockerfile 0.03% SCSS 0.05% C++ 0.03% R 0.26% Earthly 0.17% TypeScript 3.46%
data-science complex-networks graph-algorithms graph-visualization machine-learning hacktoberfest

lynxkite's Introduction

LynxKite

LynxKite is a complete graph data science platform for very large graphs and other datasets. It seamlessly combines the benefits of a friendly graphical interface and a powerful Python API.

  • Hundreds of scalable graph operations, including graph metrics like PageRank, embeddedness, and centrality, machine learning methods including GCNs, graph segmentations like modular clustering, and various transformation tools like aggregations on neighborhoods.
  • The two main data types are graphs and relational tables. Switch back and forth between the two as needed to describe complex logical flows. Run SQL on both.
  • A friendly web UI for building powerful pipelines of operation boxes. Define your own custom boxes to structure your logic.
  • Tight integration with Python lets you implement custom transformations or create whole workflows through a simple API.
  • Integrates with the Hadoop ecosystem. Import and export from CSV, JSON, Parquet, ORC, JDBC, Hive, or Neo4j.
  • Fully documented.
  • Proven in production on large clusters and real datasets.
  • Fully configurable graph visualizations and statistical plots. Experimental 3D and ray-traced graph renderings.

LynxKite is under active development. Check out our Roadmap to see what we have planned for future releases.

Getting started

Quick try:

docker run --rm -p2200:2200 lynxkite/lynxkite

Setup with persistent data:

docker run \
  -p 2200:2200 \
  -v ~/lynxkite/meta:/metadata -v ~/lynxkite/data:/data \
  -e KITE_MASTER_MEMORY_MB=1024 \
  --name lynxkite lynxkite/lynxkite

Contributing

If you find any bugs, have any questions, feature requests or comments, please file an issue or email us at [email protected].

Containerized build

If you build LynxKite with Earthly, you don't have to install anything on your system except Earthly and get really reliable builds.

  1. Install Earthly.
  2. Run earthly +run to build and run LynxKite. See the Earthfile for other targets.

Native build

You can install LynxKite's dependencies (Scala, Node.js, Go) locally with Conda.

Before the first build:

tools/git/setup.sh # Sets up pre-commit hooks.
conda env create --name lk --file conda-env.yml
conda activate lk
cp conf/kiterc_template ~/.kiterc

We use make for building the whole project.

make

LynxKite can be run as a fat jar started with spark-submit. See run.sh for an example of this.

Tests

We have test suites for the different parts of the system:

  • Backend tests are unit tests for the Scala code. They can also be executed with Sphynx as the backend. If you run make backend-test it will do both. Or you can start sbt and run testOnly *SomethingTest to run just one test. Run ./test_backend.sh -si to start sbt with Sphynx as the backend.

  • Frontend tests use Playwright to simulate a user's actions on the UI. make frontend-test will build everything, start a temporary LynxKite instance and run the tests against that. If you already have a running run npm test in the web directory. You can start up a dev server that proxies backend requests to LynxKite with npm start.

  • Python API tests are started with make remote_api-test. If you already have a running LynxKite that is okay to test on, run python/remote_api/test.sh. This script can also run a subset of the test suite: python/remote_api/test.sh -p *something*

License

lynxkite's People

Contributors

asa10e avatar borsijulcsi avatar borsim avatar bramrodenburg avatar darabos avatar ddkatona avatar dependabot[bot] avatar dolphyvn avatar erbenpeter avatar forevian avatar gaborfeher avatar gsvigruha avatar hannagabor avatar huncros avatar jfcg avatar jmlizano avatar korommatyi avatar lacca0 avatar lynx-jenkins avatar lynx-steven-xufan avatar malnapolya avatar sergiykolesnikov avatar shuheng-lynx avatar szmate1618 avatar vinchang2015 avatar wafle avatar xandrew avatar xandrew-lynx avatar yenonn avatar zskatona avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lynxkite's Issues

Replace Google Maps?

We use Google Maps for the geographic visualization. But this API is limited to 1,000 requests per day (I think). Maybe there’s a better solution for an open-source project?

Vector support in SQL or Array support in derive boxes

Putting a Vector[number] through SQL turns it into Array[number]. How do I turn it back? I tried v.toVector in a derive box, but I get Unsupported type TypeTag[Array[Double]] for input parameter v in expression v.toVector.

How do I add edges?

There are operations that create new edges, like "Connect vertices on attribute". But these (always?) replace the existing edge bundle instead of adding to it. Maybe we could reconsider that.

But let's say they replace the edge bundle. How do I combine it with the preexisting edges? "Graph union" followed by "Merge vertices by attribute" kind of works. But it discards all the vertex attributes! I guess I can add them back with SQL...

image

Is there a better way? If there isn't, shouldn't there be?

No "vertices" table when there are no vertex attributes

The vertices table does not exist, but the edges table does.

sql-on-networkit-graph

My first guess is that the "Create Erdős–Rényi graph" operation runs outside the Spark domain (I'm not sure it's the correct terminology), so the vertices table is not created automatically for the Spark SQL version.

After running an edge attribute computation, the edges table is created properly and I can run SQL on it.

edges-of-networkit-graph

If I add a "Compute degree" operation, the vertices table appears.

Add GRAPE

https://github.com/maxiaoba/GRAPE is a state of the art feature imputation / label prediction method. It would be a good fit for LynxKite because it's an excuse to use a graph neural network in a setting where you don't even really have a graph. Also it uses PyTorch Geometric which makes the integration hopefully not too hard.

Compute modularity

Why don't we emit the modularity as a scalar from Find modular clustering? It's pretty complicated to compute it manually.

Add language drop-down to derive attribute boxes

We could allow using SQL and Python in addition to Scala.

When Python is selected, we search for attribute names in the expression the same way as for Scala. If we found x and y, and the output is supposed to be z we essentially just run the Compute in Python box with input vs.x, vs.y, output vs.z: float or vs.z: str and code x = vs['x']; y = vs['y']; vs['z'] = <user code>. This would be a slight convenience over the Compute in Python box.

When SQL is selected we run select !id, <user code> as output from vertices and put it back as an attribute similarly to how Filter with SQL works. This would be a nice convenience over using SQL and Use table as vertex attribute.

Parameter analysis

It's often the case that you want to tweak a box parameter (like dampening in PageRank) and see how a scalar (like an error metric or the size of a subgraph) changes with it. You probably want to specify a range and step size for the parameter and you want to look at a chart of the corresponding scalar values. You'd also want to have the results in a table so you can create custom charts or something.

How could we do this? Can we fit it into the existing box network, or do we need something very special?

image

Table support in Compute in Python

This would be the simplest thing. Just a DataFrame in and a DataFrame out.

It could work around the limitation that the box cannot change the graph structure. Just input the old edge list and output a new edge list. Maybe there's a better solution for this. But writing complex table manipulation code in Python would still be useful.

Opening box search is slow

Is it just me or has it really become slow to open the search box? It takes a few seconds!

Looks like it's due to the help icons for each box.

Visible outputs embedded in the workspace

Just a wild idea. A box that can show a scalar, plot, or visualization right there in the workspace. It's a normal box, but instead of an icon, it displays something. You can click on it to adjust the parameters and set what to display.

image

Remove attribute name conflict errors

image

So annoying! It's especially dumb when it's the key attribute. Just keep the old value.

But even for other attributes I think it would be better to not report an error for this. Just take the old or new value. I'm not sure which one is more helpful.

Guided tour is broken in 4.0.0

Reported by Tamas Hankusz.

When you press "Next" the first time and it tries to highlight an element, it breaks. The console shows a relevant exception:

Uncaught TypeError: r.hasOwnProperty is not a function
    at t.l.getOptions (bootstrap.js:1508)
    at t.l.init (bootstrap.js:1471)
    at new t (bootstrap.js:1980)
    at HTMLButtonElement.<anonymous> (bootstrap.js:2067)
    at Function.each (jquery.js:381)
    at x.fn.init.each (jquery.js:203)
    at x.fn.init.e.fn.popover (bootstrap.js:2061)
    at i._showPopover (bootstrap-tourist.js:1866)
    at i._showPopoverAndOverlay (bootstrap-tourist.js:1718)
    at i.<anonymous> (bootstrap-tourist.js:1384)

Add support for reading VCF, BGEN and Plink file formats

VCF, BGEN and Flink are common file formats in Genomics. The open source project Glow adds support for datasets with these formats into Spark Dataframes. I don't know how common it is for people to work with Genomics data to work with graph algorithms, but it seems to be quite easy to add support for these file formats to LynxKite.

Clear error reporting when LynxKite is down

I got this screenshot from a user:
image

I'm pretty sure LynxKite is down. This causes some requests to fail, which is not handled correctly by some piece of JavaScript, so there's a JavaScript error, which we try to send to the backend for logging. But that request fails too, and this time we handle it by popping up these alerts.

  • Let's not report ajax/jsError failures. It's never useful.
  • Make sure we display a clear error when LynxKite is not accessible from the frontend.

Python error handling

In case of errors we should

  • print all output, so we can add print() calls for debugging
  • handle it right for the "Create graph in Python" box too.

Add aggregations for vectors

For lat/long vectors it totally makes sense to take the average. Some other aggregations may also be useful on some datasets. (E.g. element-wise min/max.)

New aggregation: distribution sketch

I have more than a million edges between a few thousand nodes. The edges have a bunch of attributes. I merge the parallel edges. I wish I could now click on an edge and see a histogram of an attribute. It would show the distribution of the attribute values across all the parallel edges that were merged into this one.

I think there are two parts to it, both moderately complex:

  • A new aggregation method for strings and numbers that builds a distribution "sketch". There is an Apache library for this. (DataSketches)
  • Code for visualizations to be able to show the sketch as a histogram.

Upgrade to Spark 3.0

It seems despite the new major version, "No major code changes are required to adopt this version of Apache Spark."

It seems to have quite a few improvements. It would also allow for GPU acceleration as point out by Gyorgy Mezo.

Links in comments cannot be clicked

We've disabled mouse events on the comment text, so that it does block you from interacting with boxes "under" the comment. But this means links in the comment also don't work.

Might be a simple CSS fix.

Make aggregation suffixes optional in merges

I almost always find it annoying that a _sum suffix is added. Of course, it's necessary if the user selects multiple aggregation for the same attribute, but otherwise the semantics is more often then not that the user tells the system the "natural" way to aggregate an attribute.

Maybe a checkbox or something? If the user opts for no suffix then it's an error to choose more than one aggregation per field.

Include GPL JDBC drivers

We used to not include GPL JDBC drivers (notably MySQL's) in LynxKite because it wasn't itself a GPL licensed software. But now it is! Let's add MySQL and whatever else!

Delaunay triangulation from positions

This could quickly give a reasonable graph for location datasets where "neighboring" points are connected. It would be a useful approximation of road networks.

Create test for guided tour

We didn't notice when a jQuery upgrade killed it. (#30) It's a bit tricky to test, because we disable it for testing so it doesn't confuse the other tests.

Visualization without edges

With a small dense graph of geographic data I find myself adding Discard edges before every visualization box. We have a drop down already for directed or undirected edges. Probably wouldn't take much to add a "no edges" option as well.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.