lynxkite / lynxkite Goto Github PK

The complete graph data science platform

License: GNU Affero General Public License v3.0

Shell 0.73% Makefile 0.08% Scala 62.82% Go 3.76% Python 12.88% POV-Ray SDL 0.05% HTML 2.84% JavaScript 10.91% CSS 1.92% Dockerfile 0.03% SCSS 0.05% C++ 0.03% R 0.26% Earthly 0.17% TypeScript 3.46%

data-science complex-networks graph-algorithms graph-visualization machine-learning hacktoberfest

lynxkite's Introduction

LynxKite

LynxKite is a complete graph data science platform for very large graphs and other datasets. It seamlessly combines the benefits of a friendly graphical interface and a powerful Python API.

Hundreds of scalable graph operations, including graph metrics like PageRank, embeddedness, and centrality, machine learning methods including GCNs, graph segmentations like modular clustering, and various transformation tools like aggregations on neighborhoods.
The two main data types are graphs and relational tables. Switch back and forth between the two as needed to describe complex logical flows. Run SQL on both.
A friendly web UI for building powerful pipelines of operation boxes. Define your own custom boxes to structure your logic.
Tight integration with Python lets you implement custom transformations or create whole workflows through a simple API.
Integrates with the Hadoop ecosystem. Import and export from CSV, JSON, Parquet, ORC, JDBC, Hive, or Neo4j.
Fully documented.
Proven in production on large clusters and real datasets.
Fully configurable graph visualizations and statistical plots. Experimental 3D and ray-traced graph renderings.

LynxKite is under active development. Check out our Roadmap to see what we have planned for future releases.

Getting started

Quick try:

docker run --rm -p2200:2200 lynxkite/lynxkite

Setup with persistent data:

docker run \
  -p 2200:2200 \
  -v ~/lynxkite/meta:/metadata -v ~/lynxkite/data:/data \
  -e KITE_MASTER_MEMORY_MB=1024 \
  --name lynxkite lynxkite/lynxkite

Contributing

If you find any bugs, have any questions, feature requests or comments, please file an issue or email us at [email protected].

Containerized build

If you build LynxKite with Earthly, you don't have to install anything on your system except Earthly and get really reliable builds.

Install Earthly.
Run earthly +run to build and run LynxKite. See the Earthfile for other targets.

Native build

You can install LynxKite's dependencies (Scala, Node.js, Go) locally with Conda.

Before the first build:

tools/git/setup.sh # Sets up pre-commit hooks.
conda env create --name lk --file conda-env.yml
conda activate lk
cp conf/kiterc_template ~/.kiterc

We use make for building the whole project.

make

LynxKite can be run as a fat jar started with spark-submit. See run.sh for an example of this.

Tests

We have test suites for the different parts of the system:

Backend tests are unit tests for the Scala code. They can also be executed with Sphynx as the backend. If you run make backend-test it will do both. Or you can start sbt and run testOnly *SomethingTest to run just one test. Run ./test_backend.sh -si to start sbt with Sphynx as the backend.
Frontend tests use Playwright to simulate a user's actions on the UI. make frontend-test will build everything, start a temporary LynxKite instance and run the tests against that. If you already have a running run npm test in the web directory. You can start up a dev server that proxies backend requests to LynxKite with npm start.
Python API tests are started with make remote_api-test. If you already have a running LynxKite that is okay to test on, run python/remote_api/test.sh. This script can also run a subset of the test suite: python/remote_api/test.sh -p *something*

License

GNU Affero General Public License v3.0

lynxkite's People

Contributors

Stargazers

Watchers

Forkers

vishalbelsare xandrew bramrodenburg guptam fagan2888 zeta1999 jingmouren wanghu2050 tuckging timshenkao ahitboyzbw bbgnsurftech

lynxkite's Issues

Replace Google Maps?

We use Google Maps for the geographic visualization. But this API is limited to 1,000 requests per day (I think). Maybe there’s a better solution for an open-source project?

Remove references to viral modeling from docs

https://github.com/lynxkite/lynxkite/blob/4.1.0/web/app/help/project-ui/attributes.asciidoc makes references to the viral modeling box, removed in 4.0.0.

"Create graph in Python" with inputs

If it could take tables and graphs as input, it would allow a ton of new use cases.

SQL float gets imported as float, not number

Neo4j import/export operations need to be added to _import_box_names and _export_box_names in kite.py

Maybe a test for the API use would also be nice...

Vector support in SQL or Array support in derive boxes

Putting a Vector[number] through SQL turns it into Array[number]. How do I turn it back? I tried v.toVector in a derive box, but I get Unsupported type TypeTag[Array[Double]] for input parameter v in expression v.toVector.

Report error if edge attributes are requested in huge edge set mode

Set up automatic testing with GitHub Actions

I've started setting this up. Until it's done we can run the tests manually.

How do I add edges?

There are operations that create new edges, like "Connect vertices on attribute". But these (always?) replace the existing edge bundle instead of adding to it. Maybe we could reconsider that.

But let's say they replace the edge bundle. How do I combine it with the preexisting edges? "Graph union" followed by "Merge vertices by attribute" kind of works. But it discards all the vertex attributes! I guess I can add them back with SQL...

Is there a better way? If there isn't, shouldn't there be?

SIGINT doesn't terminate lynxkite docker container

After running the getting started docker command (https://lynxkite.com/getting-started#docker-linux):

docker run \
  -p 2200:2200 \
  -v ~/lynxkite_meta:/metadata -v ~/lynxkite_data:/data \
  -e KITE_MASTER_MEMORY_MB=1024 \
  --name lynxkite lynxkite/lynxkite

Sending ctrl+c to the process doesn't stop it and needs to be done via docker kill/stop. While not a major it's a slight inconvenience when getting started.

Machine learning / train logistic regression box unusable after changing variables available to it

Produce issue: train ML on aggregated attributes; then change aggregation so some of those attributes vanish (e.g., get replaced by others with different names). The ML box becomes violently unusable.

Not sure if this is best; alternatively, I could imagine it keeping the number of inputs it takes, and letting the user select another. I'm open to being convinced otherwise.

Unclear how to configure .kiterc with docker

Using the docker getting started guide, I wasn't able to figure out how to configure a .kiterc file (e.g. NEO4J_PASSWORD).

It is referenced in the docs two places:

Could the documentation be updated to explain how to configure these things when using Docker?

Add more "well-known" datasets

Adding more from the list at https://pytorch-geometric.readthedocs.io/en/latest/modules/datasets.html should not be too much trouble. I'd skip the multi-graph datasets in the first round and focus on ones that provide more than one dataset. (E.g. SNAPDataset.)

Set union / intersection aggregators

And maybe Vector concatenation. From #80 (comment).

No "vertices" table when there are no vertex attributes

The vertices table does not exist, but the edges table does.

My first guess is that the "Create Erdős–Rényi graph" operation runs outside the Spark domain (I'm not sure it's the correct terminology), so the vertices table is not created automatically for the Spark SQL version.

After running an edge attribute computation, the edges table is created properly and I can run SQL on it.

If I add a "Compute degree" operation, the vertices table appears.

Type inference doesn't work in "Create graph in python"

Tested on demo, 4.0.1, using the example from the help.

Add GRAPE

https://github.com/maxiaoba/GRAPE is a state of the art feature imputation / label prediction method. It would be a good fit for LynxKite because it's an excuse to use a graph neural network in a setting where you don't even really have a graph. Also it uses PyTorch Geometric which makes the integration hopefully not too hard.

Compute modularity

Why don't we emit the modularity as a scalar from Find modular clustering? It's pretty complicated to compute it manually.

Add language drop-down to derive attribute boxes

We could allow using SQL and Python in addition to Scala.

When Python is selected, we search for attribute names in the expression the same way as for Scala. If we found x and y, and the output is supposed to be z we essentially just run the Compute in Python box with input vs.x, vs.y, output vs.z: float or vs.z: str and code x = vs['x']; y = vs['y']; vs['z'] = <user code>. This would be a slight convenience over the Compute in Python box.

When SQL is selected we run select !id, <user code> as output from vertices and put it back as an attribute similarly to how Filter with SQL works. This would be a nice convenience over using SQL and Use table as vertex attribute.

Parameter analysis

It's often the case that you want to tweak a box parameter (like dampening in PageRank) and see how a scalar (like an error metric or the size of a subgraph) changes with it. You probably want to specify a range and step size for the parameter and you want to look at a chart of the corresponding scalar values. You'd also want to have the results in a table so you can create custom charts or something.

How could we do this? Can we fit it into the existing box network, or do we need something very special?

Check NetworKit integration for memory leaks

https://github.com/lynxkite/lynxkite/pull/111/files#r542291267

I think AddressSanitizer may be able to automatically check our memory management, but it's not easy to set up. I asked on Stack Overflow and got no useful answers so far.

Table support in Compute in Python

This would be the simplest thing. Just a DataFrame in and a DataFrame out.

It could work around the limitation that the box cannot change the graph structure. Just input the old edge list and output a new edge list. Maybe there's a better solution for this. But writing complex table manipulation code in Python would still be useful.

Opening box search is slow

Is it just me or has it really become slow to open the search box? It takes a few seconds!

Looks like it's due to the help icons for each box.

SPSS (.sav?) file format support

Visible outputs embedded in the workspace

Just a wild idea. A box that can show a scalar, plot, or visualization right there in the workspace. It's a normal box, but instead of an icon, it displays something. You can click on it to adjust the parameters and set what to display.

Deep Graph Mapper visualizations

Deep Graph Mapper is a GNN-based approach for meaningful full-graph visualizations. It finds clusters and represents them as nodes, so maybe it would be best implemented in LynxKite as a segmentation. I haven't read the whole article yet.

https://arxiv.org/abs/2002.03864
https://github.com/crisbodnar/dgm (PyTorch Geometric-based implementation.)

(From https://arxiv.org/abs/2007.02901.)

Show box help text even when it doesn't have (correct) inputs connected

Otherwise how do you find out what are the correct inputs?

Remove attribute name conflict errors

So annoying! It's especially dumb when it's the key attribute. Just keep the old value.

But even for other attributes I think it would be better to not report an error for this. Just take the old or new value. I'm not sure which one is more helpful.

Guided tour is broken in 4.0.0

Reported by Tamas Hankusz.

When you press "Next" the first time and it tries to highlight an element, it breaks. The console shows a relevant exception:

Uncaught TypeError: r.hasOwnProperty is not a function
    at t.l.getOptions (bootstrap.js:1508)
    at t.l.init (bootstrap.js:1471)
    at new t (bootstrap.js:1980)
    at HTMLButtonElement.<anonymous> (bootstrap.js:2067)
    at Function.each (jquery.js:381)
    at x.fn.init.each (jquery.js:203)
    at x.fn.init.e.fn.popover (bootstrap.js:2061)
    at i._showPopover (bootstrap-tourist.js:1866)
    at i._showPopoverAndOverlay (bootstrap-tourist.js:1718)
    at i.<anonymous> (bootstrap-tourist.js:1384)

Add support for reading VCF, BGEN and Plink file formats

VCF, BGEN and Flink are common file formats in Genomics. The open source project Glow adds support for datasets with these formats into Spark Dataframes. I don't know how common it is for people to work with Genomics data to work with graph algorithms, but it seems to be quite easy to add support for these file formats to LynxKite.

Excel file format support

Clear error reporting when LynxKite is down

I got this screenshot from a user:

I'm pretty sure LynxKite is down. This causes some requests to fail, which is not handled correctly by some piece of JavaScript, so there's a JavaScript error, which we try to send to the backend for logging. But that request fails too, and this time we handle it by popping up these alerts.

Let's not report ajax/jsError failures. It's never useful.
Make sure we display a clear error when LynxKite is not accessible from the frontend.

In wizards we would need a progress indicator when waiting for tables/project states.

Python error handling

In case of errors we should

print all output, so we can add print() calls for debugging
handle it right for the "Create graph in Python" box too.

Add aggregations for vectors

For lat/long vectors it totally makes sense to take the average. Some other aggregations may also be useful on some datasets. (E.g. element-wise min/max.)

New aggregation: distribution sketch

I have more than a million edges between a few thousand nodes. The edges have a bunch of attributes. I merge the parallel edges. I wish I could now click on an edge and see a histogram of an attribute. It would show the distribution of the attribute values across all the parallel edges that were merged into this one.

I think there are two parts to it, both moderately complex:

A new aggregation method for strings and numbers that builds a distribution "sketch". There is an Apache library for this. (DataSketches)
Code for visualizations to be able to show the sketch as a histogram.

Upgrade to Spark 3.0

It seems despite the new major version, "No major code changes are required to adopt this version of Apache Spark."

It seems to have quite a few improvements. It would also allow for GPU acceleration as point out by Gyorgy Mezo.

Links in comments cannot be clicked

We've disabled mouse events on the comment text, so that it does block you from interacting with boxes "under" the comment. But this means links in the comment also don't work.

Might be a simple CSS fix.

Get in changes to make custom GNNs simpler

Tagline overlaps logo if window is wide

At 1400px width something changes and ends up like this:

Generated python API documentation is stale

E.g., we still have the old neo4j import in:
https://github.com/lynxkite/lynxkite/blob/a547dd291f52000e233ccf4ab61e1949cff0d1bb/python/remote_api/src/lynx/operations.py

I tried to rebuild using python/documentation/build.sh but it complains with:
ModuleNotFoundError: No module named 'sphinx_lynx_theme'

Not sure where this was supposed to come.

"Don't move the box!"

Don't allow users do anything while an upload is in progress.

Instruments should report error if used in wizard mode

Make aggregation suffixes optional in merges

I almost always find it annoying that a _sum suffix is added. Of course, it's necessary if the user selects multiple aggregation for the same attribute, but otherwise the semantics is more often then not that the user tells the system the "natural" way to aggregate an attribute.

Maybe a checkbox or something? If the user opts for no suffix then it's an error to choose more than one aggregation per field.

Include GPL JDBC drivers

We used to not include GPL JDBC drivers (notably MySQL's) in LynxKite because it wasn't itself a GPL licensed software. But now it is! Let's add MySQL and whatever else!

Delaunay triangulation from positions

This could quickly give a reasonable graph for location datasets where "neighboring" points are connected. It would be a useful approximation of road networks.

Create test for guided tour

We didn't notice when a jQuery upgrade killed it. (#30) It's a bit tricky to test, because we disable it for testing so it doesn't confuse the other tests.

Add boxes based on AutoGL

https://github.com/THUMNLab/AutoGL looks like a nice wrapper around PyTorch-Geometrics many models. Instead of wrapping each model and exposing (and documenting) each parameter, we could just wrap each AutoGL task class. They have much fewer parameters.

(From @phitheratio.)

Copy country lookup shapefiles from biggraph/biggraph

I forgot to migrate these files, but they are required for the region lookup operation.

Visualization without edges

With a small dense graph of geographic data I find myself adding Discard edges before every visualization box. We have a drop down already for directed or undirected edges. Probably wouldn't take much to add a "no edges" option as well.