Comments (8)
@eblondel
great - I would be interested in this. I like the idea of allowing for a choice of parsing method. I can send my contact info via email.
from rsdmx.
Part of the enhancements required here are yet being handled in #36. I welcome any test (especially on Canada Statistics) that would cause issue, i will have more look to the performance especially for GenericData
.
from rsdmx.
Hi Emmanuel,
I'm currently working on some Statistics Canada SDMX files. Most of them are huge (e.g., between 100MB and 1GB).
Looking specifically at a SDMX data set where the 'Generic' file is 112MB (about 250,000 records; catalogue number 98-313-XCB2011022, via this link), the performance is not what I could naively extrapolate from the one mentioned in issue #47 (~ 20 min for a 50MB file).
It took nearly 6 hours to run (whereas I'd expect ~40 min). The code I'm using is fairly simple and follows what was suggested in the wiki and the Issues on this repo. Here it is:
library(rsdmx)
sdmxobj.g <- readSDMX(file = 'Generic_98-313-XCB2011022.xml', isURL = FALSE)
df.g <- as.data.frame(sdmxobj.g)
(I tested it on the same sdmx file but stripped out of most of the records so that the file is ~1MB and it just takes seconds)
This is running on a cluster where I have plenty of memory (so that's not the issue I guess) and CPU is at 2.1GHz. Version of rsdmx
is 0.5.5.
Is there something I'm missing?
Thanks.
from rsdmx.
Hello @davidchampredon i will try to inspect this request and compare with other smaller data files to see what could be the issue.
Note: at now rsdmx is handling the XML as tree in R which means that we consume twice the memory once converted into data.frame. An other perspective is to let rsdmx use the Simple API for XML (SAX), to avoid to store this XML source as tree within R.
from rsdmx.
Hi @eblondel thanks alot for all your work on rsdmx. I’ve came across this issue as well as I am often working with medium to large SDMX files.
I’ve had a look at addressing this via the readsdmx package where SDMX-ML data files are parsed via the RapidXml C++ library. It also solves the issue raised in #137.
Perhaps it could be envisaged to look at the possibilities of importing this to the rsdmx package? For the moment there isn’t the same coverage of SDMX data messages (planned to add these later on) but the most resource-intensive messages (e.g. generic) are already included.
from rsdmx.
Hi @mdequeljoe I would be keen on putting hands and spend work effort in alternate parsing for larger files. I had developed some successful attempts in staging area, but unfortunately the opportunity to dedicate time to this didn't come yet, because of lack of resources. Hopefully some institution implementing SDMX standards could sponsor rsdmx development for that in a near future.
I will look further on what you did on RapidXML wrapper. This is interesting indeed and could be a nice extension to rsdmx, However we need to carefuly check RapidXML license. Not sure it is compatible with rsdmx current license. There might be other alternate parsing that I need to evaluate if they could be good candidates for integration into rsdmx.
Best,
Emmanuel
from rsdmx.
@mdequeljoe I will look further on what you did on RapidXML wrapper. This is interesting indeed and could be a nice extension to rsdmx, However we need to carefuly check RapidXML license. Not sure it is compatible with rsdmx current license.
from rsdmx.
@mdequeljoe I've seen that you release it under GPL. This could fit into rsdmx. If you are interested in, let's set a rsdmx branch to test integration.
I had drafted a skeleton in the past in order to introduce the possibility in rsdmx to choose the parsing method (I was trying SAX approach). I could introduce the same approach, and in that case offer the possibility to select "rapidxml" parsing method that would point to the rapidxml handlers you developed. And if it's fully implemented for all rsdmx SDMX-ML objects, we may even consider setting this as default parsing method). If you wish we may set a phone call to discuss how to integrate this in the rsdmx object-oriented model.
Let me know
from rsdmx.
Related Issues (20)
- Reading IMF with readSDMX HOT 4
- Reading BIS data with readSDMX HOT 4
- Consolidate Github CI builds
- Use https for OECD HOT 1
- Catching a call error
- readSDMX crash HOT 1
- readSDMX Bundesbank error HOT 1
- Content-Type application/xml causes Server errors for some providers
- ABS (Australia) sdmx provider is now on REST 2.1
- NBB (Belgium) service provider moved to https
- UKDS service provider seems to be SDMX-JSON restricted
- Add Bundesbank SDMX service provider
- change in Eurostat API: adjustment of rsdmx request builder and man page? HOT 5
- Missing control in trying to get embedded DSD from SDMX dataset
- Update Eurostat SDMX provider
- Add New EUROSTAT providers
- codelist 'method' gets all possible values of the attribute, including non-related to the data/table considered HOT 1
- R 4.3.0 -> Calling && or || -> sugnificant change HOT 3
- as.data.frame or as_tibble broken with R 4.3.x, using && with expression of length greater than 1 HOT 2
- Curl/OpenSSL Error HOT 7
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from rsdmx.