Comments (14)
Woah! I'm surprised that they take space at all when they are empty. But
this is in-memory, so I guess the implementation requires it.
I think you are right. If everything has an ID then we can selectively
annotate pieces of the graph efficiently. But is there another way?
On Jul 22, 2015 7:47 PM, "Glenn Hickey" [email protected] wrote:
Every VG protobuf message has a "map info" field. On my system, this
field, takes 88 bytes in memory when empty, and amounts to more than half
of the in-memory protobuf Graph representation. And probably matches a good
chunk of the memory used by the various hash_map indexes (if we assume 2 *
8 * 1.78 bytes per dense_map entry).The info field doesn't seem super necessary. I'd suggest getting rid of it
if possible, or moving it into Graph-level maps - ie instead of having an
info record per Node, have a map in Graph that maps node id's to info
records. This seems to me to be a low-hanging-fruit way to cut memory usage
by about half. Something I can do, but would like feedback on as I don't
really know the intuition behind these info fields...sizeof( ::google::protobuf::internal::MapField<::std::string, ::vg::Info,
::google::protobuf::internal::WireFormatLite::TYPE_STRING,
::google::protobuf::internal::WireFormatLite::TYPE_MESSAGE, 0 >) = 88
sizeof(vg::Node) = 152
sizeof(vg::Edge) = 144thanks
-Glenn—
Reply to this email directly or view it on GitHub
#58.
from vg.
Yeah, surprised me too, but when a 40M vg file was taking ~4G memory just
to load the protobuf something seemed fishy.
Not sure if there's another way. Would be nice if the maps were pointers
in memory, but doesn't seem to be the case. Protobuf doc says you can't
make map fields optional or required either (don't know enough about
protobuf to know if that would help anyway).
I guess not having an "id" field for every record makes implementing
graph-level maps more of hassle but should still be doable in many cases
(ie a mapping could be its index in the path list (which would need to be
transformed as graphs streamed in and out) etc.). But even adding id
fields seems like a win to get rid of the maps.
On Wed, Jul 22, 2015 at 1:50 PM, Erik Garrison [email protected]
wrote:
Woah! I'm surprised that they take space at all when they are empty. But
this is in-memory, so I guess the implementation requires it.I think you are right. If everything has an ID then we can selectively
annotate pieces of the graph efficiently. But is there another way?
On Jul 22, 2015 7:47 PM, "Glenn Hickey" [email protected] wrote:Every VG protobuf message has a "map info" field. On my system, this
field, takes 88 bytes in memory when empty, and amounts to more than half
of the in-memory protobuf Graph representation. And probably matches a
good
chunk of the memory used by the various hash_map indexes (if we assume 2
*
8 * 1.78 bytes per dense_map entry).The info field doesn't seem super necessary. I'd suggest getting rid of
it
if possible, or moving it into Graph-level maps - ie instead of having an
info record per Node, have a map in Graph that maps node id's to info
records. This seems to me to be a low-hanging-fruit way to cut memory
usage
by about half. Something I can do, but would like feedback on as I don't
really know the intuition behind these info fields...sizeof( ::google::protobuf::internal::MapField<::std::string, ::vg::Info,
::google::protobuf::internal::WireFormatLite::TYPE_STRING,
::google::protobuf::internal::WireFormatLite::TYPE_MESSAGE, 0 >) = 88
sizeof(vg::Node) = 152
sizeof(vg::Edge) = 144thanks
-Glenn—
Reply to this email directly or view it on GitHub
#58.—
Reply to this email directly or view it on GitHub
#58 (comment).
from vg.
There is a way to make a dynamic version of xg. It isn't pretty to code but
would theoretically work. We can discuss. This seems like an easy win
though. I am not attached to the info objects.
On Jul 22, 2015 8:04 PM, "Glenn Hickey" [email protected] wrote:
Yeah, surprised me too, but when a 40M vg file was taking ~4G memory just
to load the protobuf something seemed fishy.Not sure if there's another way. Would be nice if the maps were pointers
in memory, but doesn't seem to be the case. Protobuf doc says you can't
make map fields optional or required either (don't know enough about
protobuf to know if that would help anyway).I guess not having an "id" field for every record makes implementing
graph-level maps more of hassle but should still be doable in many cases
(ie a mapping could be its index in the path list (which would need to be
transformed as graphs streamed in and out) etc.). But even adding id
fields seems like a win to get rid of the maps.On Wed, Jul 22, 2015 at 1:50 PM, Erik Garrison [email protected]
wrote:Woah! I'm surprised that they take space at all when they are empty. But
this is in-memory, so I guess the implementation requires it.I think you are right. If everything has an ID then we can selectively
annotate pieces of the graph efficiently. But is there another way?
On Jul 22, 2015 7:47 PM, "Glenn Hickey" [email protected]
wrote:Every VG protobuf message has a "map info" field. On my system, this
field, takes 88 bytes in memory when empty, and amounts to more than
half
of the in-memory protobuf Graph representation. And probably matches a
good
chunk of the memory used by the various hash_map indexes (if we assume
2
*
8 * 1.78 bytes per dense_map entry).The info field doesn't seem super necessary. I'd suggest getting rid of
it
if possible, or moving it into Graph-level maps - ie instead of having
an
info record per Node, have a map in Graph that maps node id's to info
records. This seems to me to be a low-hanging-fruit way to cut memory
usage
by about half. Something I can do, but would like feedback on as I
don't
really know the intuition behind these info fields...sizeof( ::google::protobuf::internal::MapField<::std::string,
::vg::Info,
::google::protobuf::internal::WireFormatLite::TYPE_STRING,
::google::protobuf::internal::WireFormatLite::TYPE_MESSAGE, 0 >) = 88
sizeof(vg::Node) = 152
sizeof(vg::Edge) = 144thanks
-Glenn—
Reply to this email directly or view it on GitHub
#58.—
Reply to this email directly or view it on GitHub
#58 (comment).—
Reply to this email directly or view it on GitHub
#58 (comment).
from vg.
It just occurred to me that there is a simple hack to turn the maps into
pointers: Just wrap the map in a new message... Ex:
message Metadata {
map<string, Info> info = 1;
}
message Node {
string sequence = 1; // sequence
string name = 2; // a name provides an identifier
int64 id = 3; // ids are critical, and unique within a graph
bytes data = 4; // arbitrary data
Metadata metadata = 5;
}
Shaves 80 bytes off node as metadata will be stored as a pointer.
On Wed, Jul 22, 2015 at 2:04 PM, Glenn Hickey [email protected]
wrote:
Yeah, surprised me too, but when a 40M vg file was taking ~4G memory just
to load the protobuf something seemed fishy.Not sure if there's another way. Would be nice if the maps were pointers
in memory, but doesn't seem to be the case. Protobuf doc says you can't
make map fields optional or required either (don't know enough about
protobuf to know if that would help anyway).I guess not having an "id" field for every record makes implementing
graph-level maps more of hassle but should still be doable in many cases
(ie a mapping could be its index in the path list (which would need to be
transformed as graphs streamed in and out) etc.). But even adding id
fields seems like a win to get rid of the maps.On Wed, Jul 22, 2015 at 1:50 PM, Erik Garrison [email protected]
wrote:Woah! I'm surprised that they take space at all when they are empty. But
this is in-memory, so I guess the implementation requires it.I think you are right. If everything has an ID then we can selectively
annotate pieces of the graph efficiently. But is there another way?
On Jul 22, 2015 7:47 PM, "Glenn Hickey" [email protected] wrote:Every VG protobuf message has a "map info" field. On my system, this
field, takes 88 bytes in memory when empty, and amounts to more than
half
of the in-memory protobuf Graph representation. And probably matches a
good
chunk of the memory used by the various hash_map indexes (if we assume
2 *
8 * 1.78 bytes per dense_map entry).The info field doesn't seem super necessary. I'd suggest getting rid of
it
if possible, or moving it into Graph-level maps - ie instead of having
an
info record per Node, have a map in Graph that maps node id's to info
records. This seems to me to be a low-hanging-fruit way to cut memory
usage
by about half. Something I can do, but would like feedback on as I don't
really know the intuition behind these info fields...sizeof( ::google::protobuf::internal::MapField<::std::string,
::vg::Info,
::google::protobuf::internal::WireFormatLite::TYPE_STRING,
::google::protobuf::internal::WireFormatLite::TYPE_MESSAGE, 0 >) = 88
sizeof(vg::Node) = 152
sizeof(vg::Edge) = 144thanks
-Glenn—
Reply to this email directly or view it on GitHub
#58.—
Reply to this email directly or view it on GitHub
#58 (comment).
from vg.
Oops, now see that Adam proposed doing exactly this in an email thread
yesterday that I somehow didn't read all of at the time.
On Wed, Jul 22, 2015 at 3:04 PM, Glenn Hickey [email protected]
wrote:
It just occurred to me that there is a simple hack to turn the maps into
pointers: Just wrap the map in a new message... Ex:message Metadata {
map<string, Info> info = 1;
}message Node {
string sequence = 1; // sequence
string name = 2; // a name provides an identifier
int64 id = 3; // ids are critical, and unique within a graph
bytes data = 4; // arbitrary data
Metadata metadata = 5;
}Shaves 80 bytes off node as metadata will be stored as a pointer.
On Wed, Jul 22, 2015 at 2:04 PM, Glenn Hickey [email protected]
wrote:Yeah, surprised me too, but when a 40M vg file was taking ~4G memory just
to load the protobuf something seemed fishy.Not sure if there's another way. Would be nice if the maps were pointers
in memory, but doesn't seem to be the case. Protobuf doc says you can't
make map fields optional or required either (don't know enough about
protobuf to know if that would help anyway).I guess not having an "id" field for every record makes implementing
graph-level maps more of hassle but should still be doable in many cases
(ie a mapping could be its index in the path list (which would need to be
transformed as graphs streamed in and out) etc.). But even adding id
fields seems like a win to get rid of the maps.On Wed, Jul 22, 2015 at 1:50 PM, Erik Garrison [email protected]
wrote:Woah! I'm surprised that they take space at all when they are empty. But
this is in-memory, so I guess the implementation requires it.I think you are right. If everything has an ID then we can selectively
annotate pieces of the graph efficiently. But is there another way?
On Jul 22, 2015 7:47 PM, "Glenn Hickey" [email protected]
wrote:Every VG protobuf message has a "map info" field. On my system, this
field, takes 88 bytes in memory when empty, and amounts to more than
half
of the in-memory protobuf Graph representation. And probably matches a
good
chunk of the memory used by the various hash_map indexes (if we assume
2 *
8 * 1.78 bytes per dense_map entry).The info field doesn't seem super necessary. I'd suggest getting rid
of it
if possible, or moving it into Graph-level maps - ie instead of having
an
info record per Node, have a map in Graph that maps node id's to info
records. This seems to me to be a low-hanging-fruit way to cut memory
usage
by about half. Something I can do, but would like feedback on as I
don't
really know the intuition behind these info fields...sizeof( ::google::protobuf::internal::MapField<::std::string,
::vg::Info,
::google::protobuf::internal::WireFormatLite::TYPE_STRING,
::google::protobuf::internal::WireFormatLite::TYPE_MESSAGE, 0 >) = 88
sizeof(vg::Node) = 152
sizeof(vg::Edge) = 144thanks
-Glenn—
Reply to this email directly or view it on GitHub
#58.—
Reply to this email directly or view it on GitHub
#58 (comment).
from vg.
I'm not sure I was on that thread!
In any case, I like this solution. The info
is only used in a handful of places and it will be easy to make the switch. Have you had a chance to test it?
So on the wire, protobuf is nice and small because these structures don't end up getting serialized at all. My mistake was to not profile the in-memory performance.
from vg.
Yeah, wrapping it in a message does indeed turn it into a pointer, at least
on my system. This is what happens to the Position inside Mapping, as
well. So easy switch.
The protobuf in-memory performance is a little surprising compared to how
great it does on disk. There seems to be 24 bytes of (constant,
unavoidable) overhead per message object. 8 bytes of which comes from a
bool ( is_default_instance) that they stick on top of the generated class
definitions. If it grouped user-defined bools beside it, it would get
aligned out in many cases (like Edge), but the protobuf compiler aligns
user-defined bools together at the end of the definition no matter what
(at least on my system). Would be nice to have a minimal in-memory
interface perhaps at the cost of I/O performance, but I'd guess they'd
argue that's on the client to deal with.
On Thu, Jul 23, 2015 at 6:18 AM, Erik Garrison [email protected]
wrote:
I'm not sure I was on that thread!
In any case, I like this solution. The info is only used in a handful of
places and it will be easy to make the switch. Have you had a chance to
test it?So on the wire, protobuf is nice and small because these structures don't
end up getting serialized at all. My mistake was to not profile the
in-memory performance.—
Reply to this email directly or view it on GitHub
#58 (comment).
from vg.
This is not very nice. Switching to another in-memory format might make sense. If I had known about it before, I probably would have used https://capnproto.org/.
I hadn't intended vg::VG
as an interface for very large graphs. Pretty much everything I'm doing with it is chunked or over very small regions of the graph. This said, the memory heaviness certainly can't help performance.
from vg.
@glennhickey have you tried moving vg to using the non-embedded maps? If not then I'll work on this.
from vg.
No, haven't started that yet.
thanks!
On Fri, Jul 24, 2015 at 2:51 PM, Erik Garrison [email protected]
wrote:
@glennhickey https://github.com/glennhickey have you tried moving vg to
using the non-embedded maps? If not then I'll work on this.—
Reply to this email directly or view it on GitHub
#58 (comment).
from vg.
I should be able to test quickly. There is only one place where the info
field is used. It is for path annotation.
On Jul 24, 2015 9:08 PM, "Glenn Hickey" [email protected] wrote:
No, haven't started that yet.
thanks!On Fri, Jul 24, 2015 at 2:51 PM, Erik Garrison [email protected]
wrote:@glennhickey https://github.com/glennhickey have you tried moving vg
to
using the non-embedded maps? If not then I'll work on this.—
Reply to this email directly or view it on GitHub
#58 (comment).—
Reply to this email directly or view it on GitHub
#58 (comment).
from vg.
Just now working through Adam's cycle PR. I should get to this next.
Then, xg.
from vg.
Should resolve this.
Before:
➜ test git:(master) ✗ vg construct -r 1mb1kgp/z.fa -v 1mb1kgp/z.vcf.gz -m 50 >z.vg
➜ test git:(master) ✗ /usr/bin/time -v vg stats z.vg 2>&1 | grep Maximum
Maximum resident set size (kbytes): 251080
After:
➜ test git:(master) ✗ /usr/bin/time -v vg stats z.vg 2>&1 | grep Maximum
Maximum resident set size (kbytes): 126672
Nice.
from vg.
Awesome!
On Tue, Aug 11, 2015 at 11:34 AM, Erik Garrison [email protected]
wrote:
Should resolve this.
Before:
➜ test git:(master) ✗ vg construct -r 1mb1kgp/z.fa -v 1mb1kgp/z.vcf.gz -m 50 >z.vg
➜ test git:(master) ✗ /usr/bin/time -v vg stats z.vg 2>&1 | grep Maximum
Maximum resident set size (kbytes): 251080After:
➜ test git:(master) ✗ /usr/bin/time -v vg stats z.vg 2>&1 | grep Maximum
Maximum resident set size (kbytes): 126672Nice.
—
Reply to this email directly or view it on GitHub
#58 (comment).
from vg.
Related Issues (20)
- Bump necessary magic numbers for long-read index types HOT 1
- vg autoindex --workflow mpmap HOT 1
- GAF nodes do not correspond to VCF nodes HOT 3
- Fastq analysis in euka HOT 2
- Protobuf errors mapping HOT 4
- vg augment crashes in combination with giraffe .gam outputs HOT 5
- when i do the code :vg stats -a var1.gam > var1.stats ,show me a bug HOT 2
- vg surject and pack fail on GraphAligner produced .gam HOT 4
- About the usage of vg deconstruct HOT 5
- Right method to annotate genes on pangenome graph HOT 1
- vg call report missing allele HOT 2
- vg call error HOT 2
- How do different methods affect precision? HOT 4
- Giraffe alignment is very slow and produces warning[vg::Watchdog] messages unless rescue is disabled HOT 12
- Giraffe needs to better deal with systems where multiple processes opening the distance index causes slowdowns
- Any option like BWA MEM '-C'?
- vg convert failes while converting gbz to xg with signal 6 error HOT 6
- Release vg v1.60.0
- `vg convert --vg-algorithm` loses start coordinates of paths in its W lines HOT 3
- The Amazing Uninjectable SAM
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from vg.