lukashuebner / hyperphylo Goto Github PK

View Code? Open in Web Editor NEW

5.0 5.0 0.0 50.65 MB

Judicious Graph Partitioning

License: GNU General Public License v3.0

Python 46.24% C++ 45.60% CMake 1.00% Shell 4.93% R 2.23%

partitioning partitioning-algorithms graphs loadbalancing graph-partitioning

hyperphylo's People

Contributors

Stargazers

Watchers

hyperphylo's Issues

Use KaHyPar's hypergraph DS

By using KaHyPar's hypergraph DS we can iterate over nodes/edges/incidentEdges/pins etc. Furthermore it will make things easier if we use HyperPhylo as initial partitioner.

Sebastians Antwort auf Dual Graph Kritik nachvollziehen

Hi,

I just had a look at your examples.

Lukas Hübner wrote:
Hello all,

we thought about the dual graph approach. It seems to minimize the
number of replicated vertices instead of the number of split repeat
classes. Are these two metrics somehow related to one another?
We came up with an example in which we are trying to demonstrate how
reducing the cut-size in the dual graph may result in an sub-optimal
cut in the original hypergraph (a). Numbers and "n" next to a
hyperedge denote an edge with the respective weight or n distinct
hyperedges spanning the same vertices.
I think the dual hypergraph you drew for example (a) is slightly wrong. I've attached a corrected
version to this mail.
Nevertheless, I currently think that it is necessary to use uniformly weighted hyperedges in the dual
hypergraph to get correct results for the original hypergraph (otherwise the min cut in the dual hypergraph would
be indeed the one you drew).
If two or more hyperedges in the original hypergraph are connected by more than one hypernode,
this would lead to hyperedges with a weight > 1in the dual hypergraph (if we collapse parallel hyperedges into a weighted one).
Since our initial goal is to balance hyperedges, this information might not be necessary in the dual hypergraph (since it only encodes
how many vertices will be cut after we transform the solution of the dual problem back to the original problem).

In the example attached to this email, it would then be necessary to decide to which block the upper three vertices of hyperedge a
(or the lower three in case of the other cut) should be assigned such that the cut in our original hypergraph is minimized.
This should then lead to the ideal cut you also drew in your picture.
We are unsure how the balancing constraint of the original hypergraph
is transferred to the balancing constraint in the dual graph.
If, in the original hypergraph our balance constraint is sth. like (1+e) |E|/k, then the dual hypergraph will have a
balance constraint of (1+e) |V|/k --> By balancing the nodes in the dual hypergraph, we balance the edges in
the original hypergraph.
A
scenario like the one shown in (b) may lead to a sub-optimal balancing
of the cut in the original graph if the respective cut in the dual
graph is allowed.
Using the above balance constraint, a cut like the one you drew in the dual hypergraph would not be feasible,
since it violates the balance constraint.

Best
Sebastian

Initiale Lösung für KaHyPar

It might be that Sebastian's partitioner will require a good initial solution to give good results: think about ways to get a good initial solution and ideally have some randomized component in there to generate alternative good initial solutions such that the partitioner can start optimizing from different good starting points to maybe better escape local optima.

Think about an algorithm for optimally assigning dangling vertices.

So far, I have found out that for partitioning into k=2 blocks (and using the traditional balance definition on vertices), the assignment can be solved optimally using a matching-based algorithm. For k>2 there exists
a solution (I think for the cut metric) via solving a min-cost max flow problem. The relevant references are discussed on page 7 in [1]. If it turns out that the min-cost max flow approach does not work in our case,
we could still use the matching based approach and compute a k-way partition via KaHyPar in recursive bisection mode.
https://pdfs.semanticscholar.org/e199/8253f365cb55bdc97c0f759d9aea5b17730d.pdf

NP-complete Beweis für n CPUs

Aktueller Beweis funktioniert nur für 2 CPUs, erweitern auf n.

Hypergraphrepräsentation

Erweitern um Anmerkung, dass das balance constraint nicht nötig ist, wenn Hyperedges mit nur einem Knoten erlaubt sind.
Erweitern um Anmerkung, dass - im Falle das Hyperedges mit Länge 1 nicht erlaubt sind - der balance constraint angepasst werden muss, da sonst vertices mit geteilter Kante gleich bewertet werden wie Knoten ganz ohne Kante.
Abschließend Version an Mailingliste schicken.

Don't allocate large objects within loops...

... instead, allocate it before entering the loop and use clear() operations to reset then for the next iteration.

Mail mit Ergebnissen + Benoit nach 1-Partitionssätzen fragen

Lowerbound ändern zu max(#innernodes, rccc_lowerbound)

Definition Team1 prüfen

problemdefinition.pdf

Split one partition more cleverly (e.g. via KaHyPar)

We suggest that you now focus on splitting one single MSA partition into 2 or k equal parts.
[...]
Then with any smarter approach, for instance with some hypergraph tool.

Proofread and assert Pairwise Distance Implementation

Erweiterung auf n Partitionen

Erweiterung von Verfahren und Beweis auf n MSA-Partitionen.

Think about representations for E and S

Depending on the operations we actually need, we could represent E and S either as

bitvectors of size |# hyperedges| bits
by saving the incident hyperedges directly

Possible bit vector implemtations:

vector (might not be usable if we need bit operations)
boost:dynamic_bitset
one of SDSLs (https://github.com/simongog/sdsl-lite) bitvectors

Use a different, possible inexact, hypergraph formulation of the problem that can be more easily handled by Sebastian's framework.

A few notes on that:
- I got in touch with the developer of PaToH and it turns out that support for edge balancing is only experimental, likely does not work (i.e., produces imbalance solutions) for non-powers of 2,
and only balances internal edges. Internal edge balancing is sth. thats already possible in a feature branch of KaHyPar.
- I talked to Prof. Sanders today and he proposed a simple formulation that should give reasonably balanced solutions for k=2:
He proposed to use the standard hypergraph model (sites = vertices, repeats classes = hyperedges) and then assign each vertex its degree as weight.
While this does not exactly model our problem, it produced reasonably balanced bisections in my initial tests. However, this approach does not work well for k > 2.
- Another ad hoc solution for assigning the 'dangling' vertices is the following: After partitioning the dual hypergraph, all non-dangling vertices are assigned to their
respective block in the original hypergraph and then marked as fixed. Then, KaHyPar is called again in fixed vertex partitioning mode and then tries to find an assignment
for the remaining vertices such that the cut is minimized. The only problem I currently see is that we then might end up with imbalanced solutions.

Write NP-completeness proof formally

Write down the proof in a very formal manner such that it is ready for a paper.

Email Sabine prüfen

Hi,
we do see the problem and think you could fix that by adding the number of edge-cuts implied by the partitioning process to the numerator of the bound.
We found another small problem with not using a multi hypergraph formulation since interior nodes of the tree can contain repeat classes consisting of the same sites.
It should not be a problem to extend the formulation to a multi hypergraph though.
Best wishes
Team 1

Split one partition naively

We suggest that you now focus on splitting one single MSA partition into 2 or k equal parts.
First with the most naive approach, i.e. splitting the partition into k subpartitions with the same number of sites each.

Two Step Approach von Alexis in Betracht ziehen

*If the general problem formulation is too complex to solve via hypergraph partitioning, use a two step approach for a practical sultion/implementation:
a) use algorithm of: https://link.springer.com/chapter/10.1007/978-3-662-44753-6_16 but using the actual computational cost incluyding repeats as weights (not the partition length; Benoit can help you with this) to get the partition distribution
b) use a hypergraph partitioning to optimally assign sites of partitions that need to be split up to cores *

Nur wenn der direkte Weg nicht klappt.

Alternatividee von Sebastian prüfen

Ich habe mir gestern Abend überlegt, das Problem via Partitionierung des
dualen Hypergraphen zu lösen

Bisher habe ich nur mal in das erste reingeschaut. Spontan sieht eure
Graphkonstruktion so aus, würde man mit folgender
Transformation von dem Hypergraphen zu eurem Graphen für den
Vertex-Separator kommen:

Hypergraph ---> Dualer Hypergraph ---> Net Intersection Graph
(clique-net transformation des dualen Hypergraphen)

Make sure that inputs for set_difference are sorted

see: https://www.boost.org/doc/libs/1_62_0/libs/range/doc/html/range/reference/algorithms/set/set_difference.html

Sebastians Einwand überprüfen

I though a little bit about the problem definition of team 2 and the complexity/capacity constraint that is used to balance the edges.
Currently, I see the following problem:

The idea is based on [1]. However, in their problem, the authors bound the size of each block (i.e. in terms of the number of edges) by some constant C.
While this works, I think it does not generalize to the proposed bound of (1+e) (|E| / k) proposed by team 2.
This upper bound might not be feasible, because as soon as a hyperedge is cut (i.e., connects more than one block), then the weight of the edge is accounted for multiple times.

For example if an edge connects 3 blocks, then its weight is added to each of the three blocks. Thus the weight of all blocks (which in the formulation of
team 2 corresponds to the sum of the edges that connect the block) is very likely to be larger then (1+e) (|E| / k).

Please correct me if I'm wrong!
So in order to use the proposed approach, we would need to come up with a reasonable upper bound for the block weights.

Test in practice on empirical test datasets (let Benoit know if you need more) if it matters and how many of those dangling vertices will actually occur.

This is a good point. Today, I've played around a little bit with a couple of my instances. In my preliminary experiments the number of dangling vertices was in the order of thousands for hypergraphs with
~100000 vertices and relatively large hyperedges. Since I assume this depends on the structure of the hypergraphs, further experiments using your instances are certainly necessary.

When working with the dual approach, I think we have to keep the following things in mind:
1.) In order to produce correct results, the input hypergraph has to have uniform hypernode and hyperedge weights, otherwise the assignment of the dangling vertices might not lead to the correct solution.
2.) Hyperedges of the dual hypergraph should also have a weight of 1. In the example Lukas posted to the mailing list on Tuesday, the hyperedges were weighted, because they collapsed parallel hyperedges into
one. However, for partitioning the dual hypergraph, this does not make sense (i.e., this information is actually misleading for our problem formulation) and leads to suboptimal cuts (as pointed out in the example).
3.) Parallel hyperdges should be removed from the dual hypergraph for the same reason as in 2.

Compare quality of naive vs clever splitting

Then you should compare the quality of the naive and advanced approaches results (load balancing, repeats losses, worst-core RCC).

I would be interested in knowing how "bad" the naive solution is and how better the more advanced solutions are.
I would also be interested in how fast the hypergraph tools can run on our problems.
That's also a good way of checking that your approaches work!

Capacity Constraint Paper an Sebastian

Eines davon hat er schon: https://link.springer.com/content/pdf/10.1007%2F978-3-642-32147-4.pdf (S. 285ff). Daran kann ich mich auch erinnern das wir das hatten. Weiß nicht, welche sonst noch.

Upper Bound Sebastian

Hi,

maybe the problem to come up with reasonable upper bounds for the block weights isn't that complicated.
If we look at the overall problem and the 2-phase approach, then we will first use the bin
packing algorithm to distribute the first couple of MSA partitions. Whenever it then decides
to split an MSA partition, it knows the remaining capacities of the CPUs it wants to assign
the splitted partition to. Thus, if we model sites without repeats as hypernodes incident to a
single-node hyperedge, then we actually get the upper bounds on the block weights from the
bin packing algorithm and we just could use these.

What do you think?

Best
Sebastian

Verfahren mit PaToH testen

Könnte problematisch sein, da PaToH anscheinend falsche Ergebnisse berechnet:

I again played around with PaToH and used their edge balancing feature [2]. Unfortunately both solution quality AND the reported block sizes (in terms of
hyperedges per block) seems to be wrong. I double checked the PaToH result both with my tool and with hMetis (and both tools report the same results, which
are significantly different from the results PaToH reports for its partition).
I'm on vacation tomorrow, but I'll contact the main author/developer of PaToH on Monday and ask whether I'm doing something wrong or edge-balancing
is actually broken in PaToH.

Falls nichts kommt, mal bei Sebastian nachfragen!

Fix judicious partitioning runtime/memory consumption

Find a way to avoid the mad memory consumption of the T set so we can use it on non-tiny datasets.

Possible strategies:
-find another algorithm for the set cover problem
-just calculate a part of T --> check the paper for this (might not be possible)
-don't calculate T at all but use some implicit representation of it (probably not feasible)
-use hypergraph coarsening (probably memory consumption will still be far too high)

Compare JDP results with LPP results

In [1] the authors also partition hyperedges such that the number of replicas (i.e., vertices that are cut) is minimized. I implemented the algorithm today and at least my initial experiments on some of my hypergraphs indicate that the # of hyperedges incident to each of the blocks is reasonably balanced.

[1] https://ieeexplore.ieee.org/document/8391739/

Use (bipartite) Graph as main set cover DS

Use a (bipartite) adjacency array data structure to represent the bipartite graph G=(E U S).

Each node of the graph should contain:

the bitvector representing the E/S element
start index into the edge array
size / degree

Using this DS we can:

sort S elements by degree in the greedy algorithm
remove S_j - P edges easily (by supporting edge deletions as I explained at the whiteboard)

lukashuebner / hyperphylo Goto Github PK

hyperphylo's People

Contributors

Stargazers

Watchers

hyperphylo's Issues

Recommend Projects

Recommend Topics

Recommend Org