Giter Site home page Giter Site logo

Comments (14)

ekg avatar ekg commented on August 20, 2024

Woah! I'm surprised that they take space at all when they are empty. But
this is in-memory, so I guess the implementation requires it.

I think you are right. If everything has an ID then we can selectively
annotate pieces of the graph efficiently. But is there another way?
On Jul 22, 2015 7:47 PM, "Glenn Hickey" [email protected] wrote:

Every VG protobuf message has a "map info" field. On my system, this
field, takes 88 bytes in memory when empty, and amounts to more than half
of the in-memory protobuf Graph representation. And probably matches a good
chunk of the memory used by the various hash_map indexes (if we assume 2 *
8 * 1.78 bytes per dense_map entry).

The info field doesn't seem super necessary. I'd suggest getting rid of it
if possible, or moving it into Graph-level maps - ie instead of having an
info record per Node, have a map in Graph that maps node id's to info
records. This seems to me to be a low-hanging-fruit way to cut memory usage
by about half. Something I can do, but would like feedback on as I don't
really know the intuition behind these info fields...

sizeof( ::google::protobuf::internal::MapField<::std::string, ::vg::Info,
::google::protobuf::internal::WireFormatLite::TYPE_STRING,
::google::protobuf::internal::WireFormatLite::TYPE_MESSAGE, 0 >) = 88
sizeof(vg::Node) = 152
sizeof(vg::Edge) = 144

thanks
-Glenn


Reply to this email directly or view it on GitHub
#58.

from vg.

glennhickey avatar glennhickey commented on August 20, 2024

Yeah, surprised me too, but when a 40M vg file was taking ~4G memory just
to load the protobuf something seemed fishy.

Not sure if there's another way. Would be nice if the maps were pointers
in memory, but doesn't seem to be the case. Protobuf doc says you can't
make map fields optional or required either (don't know enough about
protobuf to know if that would help anyway).

I guess not having an "id" field for every record makes implementing
graph-level maps more of hassle but should still be doable in many cases
(ie a mapping could be its index in the path list (which would need to be
transformed as graphs streamed in and out) etc.). But even adding id
fields seems like a win to get rid of the maps.

On Wed, Jul 22, 2015 at 1:50 PM, Erik Garrison [email protected]
wrote:

Woah! I'm surprised that they take space at all when they are empty. But
this is in-memory, so I guess the implementation requires it.

I think you are right. If everything has an ID then we can selectively
annotate pieces of the graph efficiently. But is there another way?
On Jul 22, 2015 7:47 PM, "Glenn Hickey" [email protected] wrote:

Every VG protobuf message has a "map info" field. On my system, this
field, takes 88 bytes in memory when empty, and amounts to more than half
of the in-memory protobuf Graph representation. And probably matches a
good
chunk of the memory used by the various hash_map indexes (if we assume 2
*
8 * 1.78 bytes per dense_map entry).

The info field doesn't seem super necessary. I'd suggest getting rid of
it
if possible, or moving it into Graph-level maps - ie instead of having an
info record per Node, have a map in Graph that maps node id's to info
records. This seems to me to be a low-hanging-fruit way to cut memory
usage
by about half. Something I can do, but would like feedback on as I don't
really know the intuition behind these info fields...

sizeof( ::google::protobuf::internal::MapField<::std::string, ::vg::Info,
::google::protobuf::internal::WireFormatLite::TYPE_STRING,
::google::protobuf::internal::WireFormatLite::TYPE_MESSAGE, 0 >) = 88
sizeof(vg::Node) = 152
sizeof(vg::Edge) = 144

thanks
-Glenn


Reply to this email directly or view it on GitHub
#58.


Reply to this email directly or view it on GitHub
#58 (comment).

from vg.

ekg avatar ekg commented on August 20, 2024

There is a way to make a dynamic version of xg. It isn't pretty to code but
would theoretically work. We can discuss. This seems like an easy win
though. I am not attached to the info objects.
On Jul 22, 2015 8:04 PM, "Glenn Hickey" [email protected] wrote:

Yeah, surprised me too, but when a 40M vg file was taking ~4G memory just
to load the protobuf something seemed fishy.

Not sure if there's another way. Would be nice if the maps were pointers
in memory, but doesn't seem to be the case. Protobuf doc says you can't
make map fields optional or required either (don't know enough about
protobuf to know if that would help anyway).

I guess not having an "id" field for every record makes implementing
graph-level maps more of hassle but should still be doable in many cases
(ie a mapping could be its index in the path list (which would need to be
transformed as graphs streamed in and out) etc.). But even adding id
fields seems like a win to get rid of the maps.

On Wed, Jul 22, 2015 at 1:50 PM, Erik Garrison [email protected]
wrote:

Woah! I'm surprised that they take space at all when they are empty. But
this is in-memory, so I guess the implementation requires it.

I think you are right. If everything has an ID then we can selectively
annotate pieces of the graph efficiently. But is there another way?
On Jul 22, 2015 7:47 PM, "Glenn Hickey" [email protected]
wrote:

Every VG protobuf message has a "map info" field. On my system, this
field, takes 88 bytes in memory when empty, and amounts to more than
half
of the in-memory protobuf Graph representation. And probably matches a
good
chunk of the memory used by the various hash_map indexes (if we assume
2
*
8 * 1.78 bytes per dense_map entry).

The info field doesn't seem super necessary. I'd suggest getting rid of
it
if possible, or moving it into Graph-level maps - ie instead of having
an
info record per Node, have a map in Graph that maps node id's to info
records. This seems to me to be a low-hanging-fruit way to cut memory
usage
by about half. Something I can do, but would like feedback on as I
don't
really know the intuition behind these info fields...

sizeof( ::google::protobuf::internal::MapField<::std::string,
::vg::Info,
::google::protobuf::internal::WireFormatLite::TYPE_STRING,
::google::protobuf::internal::WireFormatLite::TYPE_MESSAGE, 0 >) = 88
sizeof(vg::Node) = 152
sizeof(vg::Edge) = 144

thanks
-Glenn


Reply to this email directly or view it on GitHub
#58.


Reply to this email directly or view it on GitHub
#58 (comment).


Reply to this email directly or view it on GitHub
#58 (comment).

from vg.

glennhickey avatar glennhickey commented on August 20, 2024

It just occurred to me that there is a simple hack to turn the maps into
pointers: Just wrap the map in a new message... Ex:

message Metadata {
map<string, Info> info = 1;
}

message Node {
string sequence = 1; // sequence
string name = 2; // a name provides an identifier
int64 id = 3; // ids are critical, and unique within a graph
bytes data = 4; // arbitrary data
Metadata metadata = 5;
}

Shaves 80 bytes off node as metadata will be stored as a pointer.

On Wed, Jul 22, 2015 at 2:04 PM, Glenn Hickey [email protected]
wrote:

Yeah, surprised me too, but when a 40M vg file was taking ~4G memory just
to load the protobuf something seemed fishy.

Not sure if there's another way. Would be nice if the maps were pointers
in memory, but doesn't seem to be the case. Protobuf doc says you can't
make map fields optional or required either (don't know enough about
protobuf to know if that would help anyway).

I guess not having an "id" field for every record makes implementing
graph-level maps more of hassle but should still be doable in many cases
(ie a mapping could be its index in the path list (which would need to be
transformed as graphs streamed in and out) etc.). But even adding id
fields seems like a win to get rid of the maps.

On Wed, Jul 22, 2015 at 1:50 PM, Erik Garrison [email protected]
wrote:

Woah! I'm surprised that they take space at all when they are empty. But
this is in-memory, so I guess the implementation requires it.

I think you are right. If everything has an ID then we can selectively
annotate pieces of the graph efficiently. But is there another way?
On Jul 22, 2015 7:47 PM, "Glenn Hickey" [email protected] wrote:

Every VG protobuf message has a "map info" field. On my system, this
field, takes 88 bytes in memory when empty, and amounts to more than
half
of the in-memory protobuf Graph representation. And probably matches a
good
chunk of the memory used by the various hash_map indexes (if we assume
2 *
8 * 1.78 bytes per dense_map entry).

The info field doesn't seem super necessary. I'd suggest getting rid of
it
if possible, or moving it into Graph-level maps - ie instead of having
an
info record per Node, have a map in Graph that maps node id's to info
records. This seems to me to be a low-hanging-fruit way to cut memory
usage
by about half. Something I can do, but would like feedback on as I don't
really know the intuition behind these info fields...

sizeof( ::google::protobuf::internal::MapField<::std::string,
::vg::Info,
::google::protobuf::internal::WireFormatLite::TYPE_STRING,
::google::protobuf::internal::WireFormatLite::TYPE_MESSAGE, 0 >) = 88
sizeof(vg::Node) = 152
sizeof(vg::Edge) = 144

thanks
-Glenn


Reply to this email directly or view it on GitHub
#58.


Reply to this email directly or view it on GitHub
#58 (comment).

from vg.

glennhickey avatar glennhickey commented on August 20, 2024

Oops, now see that Adam proposed doing exactly this in an email thread
yesterday that I somehow didn't read all of at the time.

On Wed, Jul 22, 2015 at 3:04 PM, Glenn Hickey [email protected]
wrote:

It just occurred to me that there is a simple hack to turn the maps into
pointers: Just wrap the map in a new message... Ex:

message Metadata {
map<string, Info> info = 1;
}

message Node {
string sequence = 1; // sequence
string name = 2; // a name provides an identifier
int64 id = 3; // ids are critical, and unique within a graph
bytes data = 4; // arbitrary data
Metadata metadata = 5;
}

Shaves 80 bytes off node as metadata will be stored as a pointer.

On Wed, Jul 22, 2015 at 2:04 PM, Glenn Hickey [email protected]
wrote:

Yeah, surprised me too, but when a 40M vg file was taking ~4G memory just
to load the protobuf something seemed fishy.

Not sure if there's another way. Would be nice if the maps were pointers
in memory, but doesn't seem to be the case. Protobuf doc says you can't
make map fields optional or required either (don't know enough about
protobuf to know if that would help anyway).

I guess not having an "id" field for every record makes implementing
graph-level maps more of hassle but should still be doable in many cases
(ie a mapping could be its index in the path list (which would need to be
transformed as graphs streamed in and out) etc.). But even adding id
fields seems like a win to get rid of the maps.

On Wed, Jul 22, 2015 at 1:50 PM, Erik Garrison [email protected]
wrote:

Woah! I'm surprised that they take space at all when they are empty. But
this is in-memory, so I guess the implementation requires it.

I think you are right. If everything has an ID then we can selectively
annotate pieces of the graph efficiently. But is there another way?
On Jul 22, 2015 7:47 PM, "Glenn Hickey" [email protected]
wrote:

Every VG protobuf message has a "map info" field. On my system, this
field, takes 88 bytes in memory when empty, and amounts to more than
half
of the in-memory protobuf Graph representation. And probably matches a
good
chunk of the memory used by the various hash_map indexes (if we assume
2 *
8 * 1.78 bytes per dense_map entry).

The info field doesn't seem super necessary. I'd suggest getting rid
of it
if possible, or moving it into Graph-level maps - ie instead of having
an
info record per Node, have a map in Graph that maps node id's to info
records. This seems to me to be a low-hanging-fruit way to cut memory
usage
by about half. Something I can do, but would like feedback on as I
don't
really know the intuition behind these info fields...

sizeof( ::google::protobuf::internal::MapField<::std::string,
::vg::Info,
::google::protobuf::internal::WireFormatLite::TYPE_STRING,
::google::protobuf::internal::WireFormatLite::TYPE_MESSAGE, 0 >) = 88
sizeof(vg::Node) = 152
sizeof(vg::Edge) = 144

thanks
-Glenn


Reply to this email directly or view it on GitHub
#58.


Reply to this email directly or view it on GitHub
#58 (comment).

from vg.

ekg avatar ekg commented on August 20, 2024

I'm not sure I was on that thread!

In any case, I like this solution. The info is only used in a handful of places and it will be easy to make the switch. Have you had a chance to test it?

So on the wire, protobuf is nice and small because these structures don't end up getting serialized at all. My mistake was to not profile the in-memory performance.

from vg.

glennhickey avatar glennhickey commented on August 20, 2024

Yeah, wrapping it in a message does indeed turn it into a pointer, at least
on my system. This is what happens to the Position inside Mapping, as
well. So easy switch.

The protobuf in-memory performance is a little surprising compared to how
great it does on disk. There seems to be 24 bytes of (constant,
unavoidable) overhead per message object. 8 bytes of which comes from a
bool ( is_default_instance) that they stick on top of the generated class
definitions. If it grouped user-defined bools beside it, it would get
aligned out in many cases (like Edge), but the protobuf compiler aligns
user-defined bools together at the end of the definition no matter what
(at least on my system). Would be nice to have a minimal in-memory
interface perhaps at the cost of I/O performance, but I'd guess they'd
argue that's on the client to deal with.

On Thu, Jul 23, 2015 at 6:18 AM, Erik Garrison [email protected]
wrote:

I'm not sure I was on that thread!

In any case, I like this solution. The info is only used in a handful of
places and it will be easy to make the switch. Have you had a chance to
test it?

So on the wire, protobuf is nice and small because these structures don't
end up getting serialized at all. My mistake was to not profile the
in-memory performance.


Reply to this email directly or view it on GitHub
#58 (comment).

from vg.

ekg avatar ekg commented on August 20, 2024

This is not very nice. Switching to another in-memory format might make sense. If I had known about it before, I probably would have used https://capnproto.org/.

I hadn't intended vg::VG as an interface for very large graphs. Pretty much everything I'm doing with it is chunked or over very small regions of the graph. This said, the memory heaviness certainly can't help performance.

from vg.

ekg avatar ekg commented on August 20, 2024

@glennhickey have you tried moving vg to using the non-embedded maps? If not then I'll work on this.

from vg.

glennhickey avatar glennhickey commented on August 20, 2024

No, haven't started that yet.
thanks!

On Fri, Jul 24, 2015 at 2:51 PM, Erik Garrison [email protected]
wrote:

@glennhickey https://github.com/glennhickey have you tried moving vg to
using the non-embedded maps? If not then I'll work on this.


Reply to this email directly or view it on GitHub
#58 (comment).

from vg.

ekg avatar ekg commented on August 20, 2024

I should be able to test quickly. There is only one place where the info
field is used. It is for path annotation.
On Jul 24, 2015 9:08 PM, "Glenn Hickey" [email protected] wrote:

No, haven't started that yet.
thanks!

On Fri, Jul 24, 2015 at 2:51 PM, Erik Garrison [email protected]
wrote:

@glennhickey https://github.com/glennhickey have you tried moving vg
to
using the non-embedded maps? If not then I'll work on this.


Reply to this email directly or view it on GitHub
#58 (comment).


Reply to this email directly or view it on GitHub
#58 (comment).

from vg.

ekg avatar ekg commented on August 20, 2024

Just now working through Adam's cycle PR. I should get to this next.

Then, xg.

from vg.

ekg avatar ekg commented on August 20, 2024

Should resolve this.

Before:

test git:(master) ✗ vg construct -r 1mb1kgp/z.fa -v 1mb1kgp/z.vcf.gz -m 50 >z.vg
➜  test git:(master) ✗ /usr/bin/time -v vg stats z.vg 2>&1 | grep Maximum
        Maximum resident set size (kbytes): 251080

After:

test git:(master) ✗ /usr/bin/time -v vg stats z.vg 2>&1 | grep Maximum
        Maximum resident set size (kbytes): 126672

Nice.

from vg.

glennhickey avatar glennhickey commented on August 20, 2024

Awesome!

On Tue, Aug 11, 2015 at 11:34 AM, Erik Garrison [email protected]
wrote:

Should resolve this.

Before:

➜ test git:(master) ✗ vg construct -r 1mb1kgp/z.fa -v 1mb1kgp/z.vcf.gz -m 50 >z.vg
➜ test git:(master) ✗ /usr/bin/time -v vg stats z.vg 2>&1 | grep Maximum
Maximum resident set size (kbytes): 251080

After:

➜ test git:(master) ✗ /usr/bin/time -v vg stats z.vg 2>&1 | grep Maximum
Maximum resident set size (kbytes): 126672

Nice.


Reply to this email directly or view it on GitHub
#58 (comment).

from vg.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.