Comments (11)
Thanks @khughitt . I need to generalise it so that it works for anything parquet data.
from dataframe-go.
Parquet importing is now supported (experimental): @CeciliaCoelho @khughitt @space55
from dataframe-go.
the function returns a *dataframe.DataFrame object. You can see examples in the Readme.
However, when I look at the code, it's not efficient at loading the Dataframe with the data. I need to understand that Parquet package better before I can improve the code.
from dataframe-go.
I've had lots of people in the past people asking for exporting to parquet, which I implemented.
You're the first to ask about importing, but I had put it in my todo list in may.
I won't have time to implement it soon. However, you can issue as PR.
from dataframe-go.
Hmmm. I noticed in my TODO list (#17), there had been 3 thumbs up for that request.
from dataframe-go.
In case it helps, here is some code I wrote to read a parquet file into a DataFrame that you may be able to adapt in the meantime:
package main
import (
dataframe "github.com/rocketlaunchr/dataframe-go"
"github.com/xitongsys/parquet-go-source/local"
"github.com/xitongsys/parquet-go/reader"
"context"
"runtime"
)
func loadEntriesParquet(inputParquet string) *dataframe.DataFrame {
// create local parquet/reader instances
entriesFr, err := local.NewLocalFileReader(inputParquet)
if err != nil {
log.Println("Can't open file")
}
entriesPr, err := reader.NewParquetColumnReader(entriesFr, int64(runtime.NumCPU()))
if err != nil {
log.Println("Unable to create parquet reader", err)
}
// determine numer of rows in input parquet file
numRows := int64(entriesPr.GetNumRows())
// read columns from parquet and use them to construct a DataFrame instance of the
// same form
var paths, titles, bodies, accesscounts, accessdates, createddates, deadlines, priorities, archived []interface{}
paths, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.path", numRows)
titles, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.title", numRows)
bodies, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.body", numRows)
accesscounts, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.accesscount", numRows)
accessdates, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.lastaccess", numRows)
createddates, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.datecreated", numRows)
deadlines, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.deadline", numRows)
priorities, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.priority", numRows)
archived, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.archived", numRows)
entries := dataframe.NewDataFrame(
dataframe.NewSeriesString("path", nil, paths...),
dataframe.NewSeriesString("title", nil, titles...),
dataframe.NewSeriesString("body", nil, bodies...),
dataframe.NewSeriesInt64("accesscount", nil, accesscounts...),
dataframe.NewSeriesInt64("lastaccess", nil, accessdates...),
dataframe.NewSeriesInt64("datecreated", nil, createddates...),
dataframe.NewSeriesInt64("deadline", nil, deadlines...),
dataframe.NewSeriesInt64("priority", nil, priorities...),
dataframe.NewSeriesInt64("archived", nil, archived...),
)
entriesPr.ReadStop()
entriesFr.Close()
// sort entries by date of creation
sortKey := []dataframe.SortKey{
{Key: "datecreated", Desc: true},
}
ctx := context.Background()
entries.Sort(ctx, sortKey)
return entries
}
Few comments:
- I can't make any claims that it is the most efficient approach, and feedback is welcome, but at least this should do the job..
- The function loads a parquet dataframe containing "entries", with an expected format.. I left a lot of the file-specific logic in there to provide examples of how to handle different variable types.
- I also left some logic in the bottom to help sort the dataframe once it's been loaded, in case that is useful.
Cheers.
from dataframe-go.
@pjebs Did you manage to generalise it? Can't get it to work, getting this error.
from dataframe-go.
@CeciliaCoelho can you show me your code.
I was actually waiting for a response to these Qs: xitongsys/parquet-go#360
from dataframe-go.
@CeciliaCoelho can you show me your code.
I was actually waiting for a response to these Qs: xitongsys/parquet-go#360
Getting a new error now. This was a CSV that I converted to parquet using python but wanted to open and use in Go because of efficiency.
The CSV was like this:
I have this code:
package main
import (
"context"
"log"
"runtime"
dataframe "github.com/rocketlaunchr/dataframe-go"
"github.com/xitongsys/parquet-go-source/local"
"github.com/xitongsys/parquet-go/reader"
)
func loadEntriesParquet(inputParquet string) *dataframe.DataFrame {
// create local parquet/reader instances
entriesFr, err := local.NewLocalFileReader(inputParquet)
if err != nil {
log.Println("Can't open file")
}
entriesPr, err := reader.NewParquetColumnReader(entriesFr, int64(runtime.NumCPU()))
if err != nil {
log.Println("Unable to create parquet reader", err)
}
// determine numer of rows in input parquet file
numRows := int64(entriesPr.GetNumRows())
// read columns from parquet and use them to construct a DataFrame instance of the
// same form
var id, name, res, spill, turb, pump []interface{}
id, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.id", numRows)
name, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.name", numRows)
res, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.res", numRows)
spill, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.spill", numRows)
turb, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.turb", numRows)
pump, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.pump", numRows)
entries := dataframe.NewDataFrame(
dataframe.NewSeriesString("id", nil, id...),
dataframe.NewSeriesString("name", nil, name...),
dataframe.NewSeriesString("res", nil, res...),
dataframe.NewSeriesInt64("spill", nil, spill...),
dataframe.NewSeriesInt64("turb", nil, turb...),
dataframe.NewSeriesInt64("pump", nil, pump...),
)
entriesPr.ReadStop()
entriesFr.Close()
// sort entries by date of creation
sortKey := []dataframe.SortKey{
{Key: "datecreated", Desc: true},
}
ctx := context.Background()
entries.Sort(ctx, sortKey)
return entries
}
func main() {
loadEntriesParquet("cascades2.parquet")
}
Now the error I'm getting is this:
from dataframe-go.
The "new" error is because you are trying to sort based on a Series called "datecreated" but you don't have such a field.
from dataframe-go.
The "new" error is because you are trying to sort based on a Series called "datecreated" but you don't have such a field.
Oh right, didn't notice that. It's running now. Thanks :)
How do I print or access the dataframe? (bet it's a stupid question, sorry I'm a Golang newbie)
from dataframe-go.
Related Issues (20)
- undefined: dataframe.LoadFromCSV HOT 5
- Export to Parquet example HOT 1
- Draw graphs from columns of dataframe HOT 9
- Appending a dataframe with another one. HOT 1
- How to set_index with two columns? HOT 2
- LoadFromJSON Not Working HOT 8
- How to remove duplicate rows in DataFrame? HOT 1
- Problem getting the package HOT 4
- is group by supported? HOT 1
- Bad import, was an upstream dependency deleted? HOT 5
- Error to read parquet with latest parquet-go HOT 21
- Error to read csv encoding utf-8 with bom and export back to parquet HOT 28
- Indirect dependency `github.com/blend/go-sdk v1.1.1` does not exist HOT 6
- Error to import csv, raised parquet-go error HOT 4
- Progress for re-write of dataframe-go? HOT 14
- DF Practice
- Add equivalent of `pandas`.`read_html`
- how to achieve multi index ?
- This library is defunct: please prove me wrong
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dataframe-go.