Giter Site home page Giter Site logo

Comments (11)

pjebs avatar pjebs commented on August 11, 2024 2

Thanks @khughitt . I need to generalise it so that it works for anything parquet data.

from dataframe-go.

pjebs avatar pjebs commented on August 11, 2024 2

Parquet importing is now supported (experimental): @CeciliaCoelho @khughitt @space55

from dataframe-go.

pjebs avatar pjebs commented on August 11, 2024 1

the function returns a *dataframe.DataFrame object. You can see examples in the Readme.

However, when I look at the code, it's not efficient at loading the Dataframe with the data. I need to understand that Parquet package better before I can improve the code.

from dataframe-go.

pjebs avatar pjebs commented on August 11, 2024

I've had lots of people in the past people asking for exporting to parquet, which I implemented.
You're the first to ask about importing, but I had put it in my todo list in may.
I won't have time to implement it soon. However, you can issue as PR.

from dataframe-go.

pjebs avatar pjebs commented on August 11, 2024

Hmmm. I noticed in my TODO list (#17), there had been 3 thumbs up for that request.

from dataframe-go.

khughitt avatar khughitt commented on August 11, 2024

In case it helps, here is some code I wrote to read a parquet file into a DataFrame that you may be able to adapt in the meantime:

package main

import (
	dataframe "github.com/rocketlaunchr/dataframe-go"
	"github.com/xitongsys/parquet-go-source/local"
	"github.com/xitongsys/parquet-go/reader"
        "context"
	"runtime"
)

func loadEntriesParquet(inputParquet string) *dataframe.DataFrame {
	// create local parquet/reader instances
	entriesFr, err := local.NewLocalFileReader(inputParquet)

	if err != nil {
		log.Println("Can't open file")
	}

	entriesPr, err := reader.NewParquetColumnReader(entriesFr, int64(runtime.NumCPU()))

	if err != nil {
		log.Println("Unable to create parquet reader", err)
	}

	// determine numer of rows in input parquet file
	numRows := int64(entriesPr.GetNumRows())

	// read columns from parquet and use them to construct a DataFrame instance of the
	// same form
	var paths, titles, bodies, accesscounts, accessdates, createddates, deadlines, priorities, archived []interface{}

	paths, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.path", numRows)
	titles, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.title", numRows)
	bodies, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.body", numRows)
	accesscounts, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.accesscount", numRows)
	accessdates, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.lastaccess", numRows)
	createddates, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.datecreated", numRows)
	deadlines, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.deadline", numRows)
	priorities, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.priority", numRows)
	archived, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.archived", numRows)

	entries := dataframe.NewDataFrame(
		dataframe.NewSeriesString("path", nil, paths...),
		dataframe.NewSeriesString("title", nil, titles...),
		dataframe.NewSeriesString("body", nil, bodies...),
		dataframe.NewSeriesInt64("accesscount", nil, accesscounts...),
		dataframe.NewSeriesInt64("lastaccess", nil, accessdates...),
		dataframe.NewSeriesInt64("datecreated", nil, createddates...),
		dataframe.NewSeriesInt64("deadline", nil, deadlines...),
		dataframe.NewSeriesInt64("priority", nil, priorities...),
		dataframe.NewSeriesInt64("archived", nil, archived...),
	)

	entriesPr.ReadStop()
	entriesFr.Close()

	// sort entries by date of creation
	sortKey := []dataframe.SortKey{
		{Key: "datecreated", Desc: true},
	}

	ctx := context.Background()
	entries.Sort(ctx, sortKey)

	return entries
}

Few comments:

  1. I can't make any claims that it is the most efficient approach, and feedback is welcome, but at least this should do the job..
  2. The function loads a parquet dataframe containing "entries", with an expected format.. I left a lot of the file-specific logic in there to provide examples of how to handle different variable types.
  3. I also left some logic in the bottom to help sort the dataframe once it's been loaded, in case that is useful.

Cheers.

from dataframe-go.

CeciliaCoelho avatar CeciliaCoelho commented on August 11, 2024

@pjebs Did you manage to generalise it? Can't get it to work, getting this error.

image

from dataframe-go.

pjebs avatar pjebs commented on August 11, 2024

@CeciliaCoelho can you show me your code.

I was actually waiting for a response to these Qs: xitongsys/parquet-go#360

from dataframe-go.

CeciliaCoelho avatar CeciliaCoelho commented on August 11, 2024

@CeciliaCoelho can you show me your code.

I was actually waiting for a response to these Qs: xitongsys/parquet-go#360

Getting a new error now. This was a CSV that I converted to parquet using python but wanted to open and use in Go because of efficiency.
The CSV was like this:
image

I have this code:

package main

import (
	"context"
	"log"
	"runtime"

	dataframe "github.com/rocketlaunchr/dataframe-go"
	"github.com/xitongsys/parquet-go-source/local"
	"github.com/xitongsys/parquet-go/reader"
)

func loadEntriesParquet(inputParquet string) *dataframe.DataFrame {
	// create local parquet/reader instances
	entriesFr, err := local.NewLocalFileReader(inputParquet)

	if err != nil {
		log.Println("Can't open file")
	}

	entriesPr, err := reader.NewParquetColumnReader(entriesFr, int64(runtime.NumCPU()))

	if err != nil {
		log.Println("Unable to create parquet reader", err)
	}

	// determine numer of rows in input parquet file
	numRows := int64(entriesPr.GetNumRows())

	// read columns from parquet and use them to construct a DataFrame instance of the
	// same form
	var id, name, res, spill, turb, pump []interface{}

	id, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.id", numRows)
	name, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.name", numRows)
	res, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.res", numRows)
	spill, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.spill", numRows)
	turb, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.turb", numRows)
	pump, _, _, _ = entriesPr.ReadColumnByPath("parquet_go_root.pump", numRows)

	entries := dataframe.NewDataFrame(
		dataframe.NewSeriesString("id", nil, id...),
		dataframe.NewSeriesString("name", nil, name...),
		dataframe.NewSeriesString("res", nil, res...),
		dataframe.NewSeriesInt64("spill", nil, spill...),
		dataframe.NewSeriesInt64("turb", nil, turb...),
		dataframe.NewSeriesInt64("pump", nil, pump...),
	)

	entriesPr.ReadStop()
	entriesFr.Close()

	// sort entries by date of creation
	sortKey := []dataframe.SortKey{
		{Key: "datecreated", Desc: true},
	}

	ctx := context.Background()
	entries.Sort(ctx, sortKey)

	return entries
}

func main() {
	loadEntriesParquet("cascades2.parquet")
}

Now the error I'm getting is this:
image

from dataframe-go.

pjebs avatar pjebs commented on August 11, 2024

The "new" error is because you are trying to sort based on a Series called "datecreated" but you don't have such a field.

from dataframe-go.

CeciliaCoelho avatar CeciliaCoelho commented on August 11, 2024

The "new" error is because you are trying to sort based on a Series called "datecreated" but you don't have such a field.

Oh right, didn't notice that. It's running now. Thanks :)
How do I print or access the dataframe? (bet it's a stupid question, sorry I'm a Golang newbie)

from dataframe-go.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.