Every VG protobuf message has a "map<string, Info> info" field. On my system, t

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Protobuf "info" map fields take too much memory,about vgteam/vg

Comments (14)

ekg commented on August 20, 2024

Woah! I'm surprised that they take space at all when they are empty. But
this is in-memory, so I guess the implementation requires it.

I think you are right. If everything has an ID then we can selectively
annotate pieces of the graph efficiently. But is there another way?
On Jul 22, 2015 7:47 PM, "Glenn Hickey" [email protected] wrote:

Every VG protobuf message has a "map info" field. On my system, this
field, takes 88 bytes in memory when empty, and amounts to more than half
of the in-memory protobuf Graph representation. And probably matches a good
chunk of the memory used by the various hash_map indexes (if we assume 2 *
8 * 1.78 bytes per dense_map entry).

The info field doesn't seem super necessary. I'd suggest getting rid of it
if possible, or moving it into Graph-level maps - ie instead of having an
info record per Node, have a map in Graph that maps node id's to info
records. This seems to me to be a low-hanging-fruit way to cut memory usage
by about half. Something I can do, but would like feedback on as I don't
really know the intuition behind these info fields...

sizeof( ::google::protobuf::internal::MapField<::std::string, ::vg::Info,
::google::protobuf::internal::WireFormatLite::TYPE_STRING,
::google::protobuf::internal::WireFormatLite::TYPE_MESSAGE, 0 >) = 88
sizeof(vg::Node) = 152
sizeof(vg::Edge) = 144

thanks
-Glenn

—
Reply to this email directly or view it on GitHub
#58.

from vg.

glennhickey commented on August 20, 2024

Yeah, surprised me too, but when a 40M vg file was taking ~4G memory just
to load the protobuf something seemed fishy.

Not sure if there's another way. Would be nice if the maps were pointers
in memory, but doesn't seem to be the case. Protobuf doc says you can't
make map fields optional or required either (don't know enough about
protobuf to know if that would help anyway).

I guess not having an "id" field for every record makes implementing
graph-level maps more of hassle but should still be doable in many cases
(ie a mapping could be its index in the path list (which would need to be
transformed as graphs streamed in and out) etc.). But even adding id
fields seems like a win to get rid of the maps.

On Wed, Jul 22, 2015 at 1:50 PM, Erik Garrison [email protected]
wrote:

Woah! I'm surprised that they take space at all when they are empty. But
this is in-memory, so I guess the implementation requires it.

I think you are right. If everything has an ID then we can selectively
annotate pieces of the graph efficiently. But is there another way?
On Jul 22, 2015 7:47 PM, "Glenn Hickey" [email protected] wrote:

Every VG protobuf message has a "map info" field. On my system, this
field, takes 88 bytes in memory when empty, and amounts to more than half
of the in-memory protobuf Graph representation. And probably matches a
good
chunk of the memory used by the various hash_map indexes (if we assume 2
*
8 * 1.78 bytes per dense_map entry).

The info field doesn't seem super necessary. I'd suggest getting rid of
it
if possible, or moving it into Graph-level maps - ie instead of having an
info record per Node, have a map in Graph that maps node id's to info
records. This seems to me to be a low-hanging-fruit way to cut memory
usage
by about half. Something I can do, but would like feedback on as I don't
really know the intuition behind these info fields...

sizeof( ::google::protobuf::internal::MapField<::std::string, ::vg::Info,
::google::protobuf::internal::WireFormatLite::TYPE_STRING,
::google::protobuf::internal::WireFormatLite::TYPE_MESSAGE, 0 >) = 88
sizeof(vg::Node) = 152
sizeof(vg::Edge) = 144

thanks
-Glenn

—
Reply to this email directly or view it on GitHub
#58.

—
Reply to this email directly or view it on GitHub
#58 (comment).

from vg.

ekg commented on August 20, 2024

There is a way to make a dynamic version of xg. It isn't pretty to code but
would theoretically work. We can discuss. This seems like an easy win
though. I am not attached to the info objects.
On Jul 22, 2015 8:04 PM, "Glenn Hickey" [email protected] wrote:

Yeah, surprised me too, but when a 40M vg file was taking ~4G memory just
to load the protobuf something seemed fishy.

Not sure if there's another way. Would be nice if the maps were pointers
in memory, but doesn't seem to be the case. Protobuf doc says you can't
make map fields optional or required either (don't know enough about
protobuf to know if that would help anyway).

I guess not having an "id" field for every record makes implementing
graph-level maps more of hassle but should still be doable in many cases
(ie a mapping could be its index in the path list (which would need to be
transformed as graphs streamed in and out) etc.). But even adding id
fields seems like a win to get rid of the maps.

On Wed, Jul 22, 2015 at 1:50 PM, Erik Garrison [email protected]
wrote:

Woah! I'm surprised that they take space at all when they are empty. But
this is in-memory, so I guess the implementation requires it.

I think you are right. If everything has an ID then we can selectively
annotate pieces of the graph efficiently. But is there another way?
On Jul 22, 2015 7:47 PM, "Glenn Hickey" [email protected]
wrote:

Every VG protobuf message has a "map info" field. On my system, this
field, takes 88 bytes in memory when empty, and amounts to more than
half
of the in-memory protobuf Graph representation. And probably matches a
good
chunk of the memory used by the various hash_map indexes (if we assume
2
*
8 * 1.78 bytes per dense_map entry).

The info field doesn't seem super necessary. I'd suggest getting rid of
it
if possible, or moving it into Graph-level maps - ie instead of having
an
info record per Node, have a map in Graph that maps node id's to info
records. This seems to me to be a low-hanging-fruit way to cut memory
usage
by about half. Something I can do, but would like feedback on as I
don't
really know the intuition behind these info fields...

sizeof( ::google::protobuf::internal::MapField<::std::string,
::vg::Info,
::google::protobuf::internal::WireFormatLite::TYPE_STRING,
::google::protobuf::internal::WireFormatLite::TYPE_MESSAGE, 0 >) = 88
sizeof(vg::Node) = 152
sizeof(vg::Edge) = 144

thanks
-Glenn

—
Reply to this email directly or view it on GitHub
#58.

—
Reply to this email directly or view it on GitHub
#58 (comment).

—
Reply to this email directly or view it on GitHub
#58 (comment).

from vg.

glennhickey commented on August 20, 2024

It just occurred to me that there is a simple hack to turn the maps into
pointers: Just wrap the map in a new message... Ex:

message Metadata {
map<string, Info> info = 1;
}

message Node {
string sequence = 1; // sequence
string name = 2; // a name provides an identifier
int64 id = 3; // ids are critical, and unique within a graph
bytes data = 4; // arbitrary data
Metadata metadata = 5;
}

Shaves 80 bytes off node as metadata will be stored as a pointer.

On Wed, Jul 22, 2015 at 2:04 PM, Glenn Hickey [email protected]
wrote:

Yeah, surprised me too, but when a 40M vg file was taking ~4G memory just
to load the protobuf something seemed fishy.

Not sure if there's another way. Would be nice if the maps were pointers
in memory, but doesn't seem to be the case. Protobuf doc says you can't
make map fields optional or required either (don't know enough about
protobuf to know if that would help anyway).

I guess not having an "id" field for every record makes implementing
graph-level maps more of hassle but should still be doable in many cases
(ie a mapping could be its index in the path list (which would need to be
transformed as graphs streamed in and out) etc.). But even adding id
fields seems like a win to get rid of the maps.

On Wed, Jul 22, 2015 at 1:50 PM, Erik Garrison [email protected]
wrote:

Woah! I'm surprised that they take space at all when they are empty. But
this is in-memory, so I guess the implementation requires it.

I think you are right. If everything has an ID then we can selectively
annotate pieces of the graph efficiently. But is there another way?
On Jul 22, 2015 7:47 PM, "Glenn Hickey" [email protected] wrote:

Every VG protobuf message has a "map info" field. On my system, this
field, takes 88 bytes in memory when empty, and amounts to more than
half
of the in-memory protobuf Graph representation. And probably matches a
good
chunk of the memory used by the various hash_map indexes (if we assume
2 *
8 * 1.78 bytes per dense_map entry).

The info field doesn't seem super necessary. I'd suggest getting rid of
it
if possible, or moving it into Graph-level maps - ie instead of having
an
info record per Node, have a map in Graph that maps node id's to info
records. This seems to me to be a low-hanging-fruit way to cut memory
usage
by about half. Something I can do, but would like feedback on as I don't
really know the intuition behind these info fields...

sizeof( ::google::protobuf::internal::MapField<::std::string,
::vg::Info,
::google::protobuf::internal::WireFormatLite::TYPE_STRING,
::google::protobuf::internal::WireFormatLite::TYPE_MESSAGE, 0 >) = 88
sizeof(vg::Node) = 152
sizeof(vg::Edge) = 144

thanks
-Glenn

—
Reply to this email directly or view it on GitHub
#58.

—
Reply to this email directly or view it on GitHub
#58 (comment).

from vg.

glennhickey commented on August 20, 2024

Oops, now see that Adam proposed doing exactly this in an email thread
yesterday that I somehow didn't read all of at the time.

On Wed, Jul 22, 2015 at 3:04 PM, Glenn Hickey [email protected]
wrote:

It just occurred to me that there is a simple hack to turn the maps into
pointers: Just wrap the map in a new message... Ex:

message Metadata {
map<string, Info> info = 1;
}

message Node {
string sequence = 1; // sequence
string name = 2; // a name provides an identifier
int64 id = 3; // ids are critical, and unique within a graph
bytes data = 4; // arbitrary data
Metadata metadata = 5;
}

Shaves 80 bytes off node as metadata will be stored as a pointer.

On Wed, Jul 22, 2015 at 2:04 PM, Glenn Hickey [email protected]
wrote:

Yeah, surprised me too, but when a 40M vg file was taking ~4G memory just
to load the protobuf something seemed fishy.

Not sure if there's another way. Would be nice if the maps were pointers
in memory, but doesn't seem to be the case. Protobuf doc says you can't
make map fields optional or required either (don't know enough about
protobuf to know if that would help anyway).

I guess not having an "id" field for every record makes implementing
graph-level maps more of hassle but should still be doable in many cases
(ie a mapping could be its index in the path list (which would need to be
transformed as graphs streamed in and out) etc.). But even adding id
fields seems like a win to get rid of the maps.

On Wed, Jul 22, 2015 at 1:50 PM, Erik Garrison [email protected]
wrote:

Woah! I'm surprised that they take space at all when they are empty. But
this is in-memory, so I guess the implementation requires it.

I think you are right. If everything has an ID then we can selectively
annotate pieces of the graph efficiently. But is there another way?
On Jul 22, 2015 7:47 PM, "Glenn Hickey" [email protected]
wrote:

Every VG protobuf message has a "map info" field. On my system, this
field, takes 88 bytes in memory when empty, and amounts to more than
half
of the in-memory protobuf Graph representation. And probably matches a
good
chunk of the memory used by the various hash_map indexes (if we assume
2 *
8 * 1.78 bytes per dense_map entry).

The info field doesn't seem super necessary. I'd suggest getting rid
of it
if possible, or moving it into Graph-level maps - ie instead of having
an
info record per Node, have a map in Graph that maps node id's to info
records. This seems to me to be a low-hanging-fruit way to cut memory
usage
by about half. Something I can do, but would like feedback on as I
don't
really know the intuition behind these info fields...

sizeof( ::google::protobuf::internal::MapField<::std::string,
::vg::Info,
::google::protobuf::internal::WireFormatLite::TYPE_STRING,
::google::protobuf::internal::WireFormatLite::TYPE_MESSAGE, 0 >) = 88
sizeof(vg::Node) = 152
sizeof(vg::Edge) = 144

thanks
-Glenn

—
Reply to this email directly or view it on GitHub
#58.

—
Reply to this email directly or view it on GitHub
#58 (comment).

from vg.

ekg commented on August 20, 2024

I'm not sure I was on that thread!

In any case, I like this solution. The info is only used in a handful of places and it will be easy to make the switch. Have you had a chance to test it?

So on the wire, protobuf is nice and small because these structures don't end up getting serialized at all. My mistake was to not profile the in-memory performance.

from vg.

glennhickey commented on August 20, 2024

Yeah, wrapping it in a message does indeed turn it into a pointer, at least
on my system. This is what happens to the Position inside Mapping, as
well. So easy switch.

The protobuf in-memory performance is a little surprising compared to how
great it does on disk. There seems to be 24 bytes of (constant,
unavoidable) overhead per message object. 8 bytes of which comes from a
bool ( is_default_instance) that they stick on top of the generated class
definitions. If it grouped user-defined bools beside it, it would get
aligned out in many cases (like Edge), but the protobuf compiler aligns
user-defined bools together at the end of the definition no matter what
(at least on my system). Would be nice to have a minimal in-memory
interface perhaps at the cost of I/O performance, but I'd guess they'd
argue that's on the client to deal with.

On Thu, Jul 23, 2015 at 6:18 AM, Erik Garrison [email protected]
wrote:

I'm not sure I was on that thread!

In any case, I like this solution. The info is only used in a handful of
places and it will be easy to make the switch. Have you had a chance to
test it?

So on the wire, protobuf is nice and small because these structures don't
end up getting serialized at all. My mistake was to not profile the
in-memory performance.

—
Reply to this email directly or view it on GitHub
#58 (comment).

from vg.

ekg commented on August 20, 2024

This is not very nice. Switching to another in-memory format might make sense. If I had known about it before, I probably would have used https://capnproto.org/.

I hadn't intended vg::VG as an interface for very large graphs. Pretty much everything I'm doing with it is chunked or over very small regions of the graph. This said, the memory heaviness certainly can't help performance.

from vg.

ekg commented on August 20, 2024

@glennhickey have you tried moving vg to using the non-embedded maps? If not then I'll work on this.

from vg.

glennhickey commented on August 20, 2024

No, haven't started that yet.
thanks!

On Fri, Jul 24, 2015 at 2:51 PM, Erik Garrison [email protected]
wrote:

@glennhickey https://github.com/glennhickey have you tried moving vg to
using the non-embedded maps? If not then I'll work on this.

—
Reply to this email directly or view it on GitHub
#58 (comment).

from vg.

ekg commented on August 20, 2024

I should be able to test quickly. There is only one place where the info
field is used. It is for path annotation.
On Jul 24, 2015 9:08 PM, "Glenn Hickey" [email protected] wrote:

No, haven't started that yet.
thanks!

On Fri, Jul 24, 2015 at 2:51 PM, Erik Garrison [email protected]
wrote:

@glennhickey https://github.com/glennhickey have you tried moving vg
to
using the non-embedded maps? If not then I'll work on this.

—
Reply to this email directly or view it on GitHub
#58 (comment).

—
Reply to this email directly or view it on GitHub
#58 (comment).

from vg.

ekg commented on August 20, 2024

Just now working through Adam's cycle PR. I should get to this next.

Then, xg.

from vg.

ekg commented on August 20, 2024

Should resolve this.

Before:

➜  test git:(master) ✗ vg construct -r 1mb1kgp/z.fa -v 1mb1kgp/z.vcf.gz -m 50 >z.vg
➜  test git:(master) ✗ /usr/bin/time -v vg stats z.vg 2>&1 | grep Maximum
        Maximum resident set size (kbytes): 251080

After:

➜  test git:(master) ✗ /usr/bin/time -v vg stats z.vg 2>&1 | grep Maximum
        Maximum resident set size (kbytes): 126672

Nice.

from vg.

glennhickey commented on August 20, 2024

Awesome!

On Tue, Aug 11, 2015 at 11:34 AM, Erik Garrison [email protected]
wrote:

Should resolve this.

Before:

➜ test git:(master) ✗ vg construct -r 1mb1kgp/z.fa -v 1mb1kgp/z.vcf.gz -m 50 >z.vg
➜ test git:(master) ✗ /usr/bin/time -v vg stats z.vg 2>&1 | grep Maximum
Maximum resident set size (kbytes): 251080

After:

➜ test git:(master) ✗ /usr/bin/time -v vg stats z.vg 2>&1 | grep Maximum
Maximum resident set size (kbytes): 126672

Nice.

—
Reply to this email directly or view it on GitHub
#58 (comment).

from vg.

Protobuf "info" map fields take too much memory about vg HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent