Currently csvDataFrame uses core csvread into a matrix of Anys then parses each column

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

csvDataFrame rewrite,about juliadata/dataframes.jl

Comments (19)

johnmyleswhite commented on August 20, 2024

I've set up a branch called csv with very-rough experimental support for reading CSV files according to the RFC4180 conventions. You can see the additions at https://github.com/HarlanH/JuliaData/compare/csv

This branch reflects my (perhaps misguided) hope that the CSV reader we ultimately build can also be used for streaming data processing. That flexibility brings with it some potential problems, which I describe briefly below:

Pros:

New approach is correct for CSV files with standard quoted fields
New approach provides a viable mechanism for constructing DataStream's, which need:
- Mechanism for parsing lines separate from parsing files
- Explicit control over types on a line-by-line basis

Cons:

New approach is about twice as slow as old method, which was already slow
New approach is not yet correct for quotes-inside-of-quotes
New approach is not yet correct for newlines-inside-of-quotes

I think the major performance problems come from the frequent use I've made of cbind and rbind to construct DataFrame's. Other ideas would be very welcome.

from dataframes.jl.

doobwa commented on August 20, 2024

Though I've tried, I haven't been able to do anything that's convincingly better than _jl_dlm_readrow. Because of that, I've been using the setup below for reading data. The state is often just a dictionary, and updater() is an arbitrary function (which often just pushes different elements of the row to different places).

# Assumes infile has a header row
function streamer(infile, updater, state, dlm, eol, max_rows)
    io = open(infile)
    cn = _jl_dlm_readrow(io, dlm, eol)
    counter = 0
    for i = 1:max_rows
        counter += 1
        if counter % 10000 == 0 println("$counter/$max_rows"); end
        vals = _jl_dlm_readrow(io, dlm, eol)
        updater(state, vals, cn, i)
    end
    return state
end

Like you, I want explicit control over types on a line-by-line basis, which the updater() function can do. Let me add another pro to your list: explicit control over which elements to keep. This can be handy when some of the columns have a lot of data you want to ignore (e.g. text fields).

If I want to construct a DataFrame, I use the elements of the dictionary I constructed. That way I avoid all the copying required when using rbind repeatedly. Also, there is some overhead involved with rbinding in the presence of PooledDataVecs, so make sure to watch out for that as well.

from dataframes.jl.

StefanKarpinski commented on August 20, 2024

Btw, guys, I'm going to revamp the I/O and string subsystems soon using a zero-copy approach and hopefully will be able to make this kind of thing really fly.

from dataframes.jl.

doobwa commented on August 20, 2024

Awesome! I can't wait. I've been banging my ahead against this a bit. Can you give a little preview for how you plan to do this?

from dataframes.jl.

StefanKarpinski commented on August 20, 2024

The basic strategy is to absolutely minimize copying and syscalls as much as possible.

When doing string input (the I half of I/O) we use the readv syscall to have the kernel write as much data as it has into heap-allocate byte arrays of some fixed size. The strings that are actually presented to the user are then substrings and rope strings built out of those buffers. Because neither substrings nor rope strings require copying, this is a zero-copy operation.

When doing things like CSV parsing, rather than copying all the string data, you'll just be making new substrings and rope strings referencing the same buffers. Any buffers that become entirely unreferenced will get garbage collected. That means that if you take small substrings of huge input strings, everything will work just fine and memory will get freed as one would like it to.

When printing a big complicated rope string that's constructed this way, the key, again, is to minimize data copying and syscalls. On the output side (the O half of I/O), for a given sequence of substrings and rope strings to output, we construct the appropriate vector of io_vec structs which is also zero-copy, and then just make a single call to writev, which will print the full sequence of appropriate bytes.

There are some issues. One major concern is that indexing into a rope string is going to be much slower than indexing into a byte string. Therefore, I may not want to actually use rope strings but rather do something like a "super string" that keeps a vector of constituent string objects instead together with a table of lengths, allowing faster indexing and even more importantly, fast iteration (the state object might become something other than just an byte index, however :-|).

Another issue is that PCRE regular expressions will not work on complicated structures like rope strings. The current approach to that is to transcode transparently and then apply the operation to the resulting byte string. A possible future approach would be to write a regex engine in pure Julia which could operate on arbitrary string types. Such an engine could get automatic JIT power because writing r"^a+b+$" could generate actual code for that regex (generic code at compile time but once it was applied to a specific string type, it would be jitted to type-specific machine code). We could also follow the lead of Google's RE2 library and limit our semantics to real regular expressions which can only use a fixed amount of stack space and are very fast.

from dataframes.jl.

HarlanH commented on August 20, 2024

Interesting. For most CSV use-cases with DFs, most (but not all) of the
columns will be numeric, and will thus be converted to a numeric and the
string gc'ed. And at least some of the string columns will presumably be
pooled, so only one heap-allocated substring will be need to be stored per
value. I wonder if the overhead and complexity of doing the optimization
you describe, Stefan, is worth it in our case?

On Thu, Sep 27, 2012 at 5:56 PM, Stefan Karpinski
[email protected]:

The basic strategy is to absolutely minimize copying and syscalls as much
as possible.

When doing string input (the I half of I/O) we use the readv syscall to
have the kernel write as much data as it has into heap-allocate byte arrays
of some fixed size. The strings that are actually presented to the user are
then substrings and rope strings built out of those buffers. Because
neither substrings nor rope strings require copying, this is a zero-copy
operation.

When doing things like CSV parsing, rather than copying all the string
data, you'll just be making new substrings and rope strings referencing the
same buffers. Any buffers that become entirely unreferenced will get
garbage collected. That means that if you take small substrings of huge
input strings, everything will work just fine and memory will get freed as
one would like it to.

When printing a big complicated rope string that's constructed this way,
the key, again, is to minimize data copying and syscalls. On the output
side (the O half of I/O), for a given sequence of substrings and rope
strings to output, we construct the appropriate vector of io_vec structs
which is also zero-copy, and then just make a single call to writev,
which will print the full sequence of appropriate bytes.

There are some issues. One major concern is that indexing into a rope
string is going to be much slower than indexing into a byte string.
Therefore, I may not want to actually use rope strings but rather do
something like a "super string" that keeps a vector of constituent string
objects instead together with a table of lengths, allowing faster indexing
and even more importantly, fast iteration (the state object might become
something other than just an byte index, however :-|).

Another issue is that PCRE regular expressions will not work on
complicated structures like rope strings. The current approach to that is
to transcode transparently and then apply the operation to the resulting
byte string. A possible future approach would be to write a regex engine in
pure Julia which could operate on arbitrary string types. Such an engine
could get automatic JIT power because writing r"^a+b+$" could generate
actual code for that regex (generic code at compile time but once it was
applied to a specific string type, it would be jitted to type-specific
machine code). We could also follow the lead of Google's RE2 libraryhttp://code.google.com/p/re2/and limit our semantics to real regular expressions which can only use a
fixed amount of stack space and are very fast.

—
Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/JuliaData/issues/13#issuecomment-8956399.

from dataframes.jl.

johnmyleswhite commented on August 20, 2024

It's probably worth pointing out that there are at least three things that slow my code down that won't benefit from Stefan's promising improvements on I/O:

Need to run state-machine when splitting lines into pieces. In principle, this is a one-pass algorithm and should be nearly as fast as splitting, but could plausibly not be at present. I'll time it soon to find out.
Creation of single row DataFrame from DataVec's. This needs to be done much more cleanly than I've done it so far. I'm sure this step is wasteful.
Repeated binding of rows into cumulative DataFrame. This step is probably the real disaster for the current system.

from dataframes.jl.

StefanKarpinski commented on August 20, 2024

@HarlanH: those are good points. Data frame's are not a poster child for these optimizations, so perhaps I shouldn't even have brought it up here. Would be interesting to see, however.

@johnmyleswhite: you're probably right about the performance issues. It would probably make sense to try to avoid creating one-row data frames, and it would make even more sense to construct full data vectors before creating the data frame object. These kinds of things are pretty tricky, largely because you don't know how many records you are going to have at the end when you begin unless you do a scan through the document in advance — which actually might make sense for better performance (using wc -l is pretty effective).

from dataframes.jl.

HarlanH commented on August 20, 2024

I was thinking a two-pass algorithm might make sense, as you could figure
out data types and such on the first pass, then allocate memory (or disk
files, if you're converting to a memory-mapped DF), then re-read and do
conversions efficiently on the second pass.

On Sat, Sep 29, 2012 at 5:30 PM, Stefan Karpinski
[email protected]:

@HarlanH https://github.com/HarlanH: those are good points. Data
frame's are not a poster child for these optimizations, so perhaps I
shouldn't even have brought it up here. Would be interesting to see,
however.

@johnmyleswhite https://github.com/johnmyleswhite: you're probably
right about the performance issues. It would probably make sense to try to
avoid creating one-row data frames, and it would make even more sense to
construct full data vectors before creating the data frame object. These
kinds of things are pretty tricky, largely because you don't know how many
records you are going to have at the end when you begin unless you do a
scan through the document in advance — which actually might make sense for
better performance (using wc -l is pretty effective).

—
Reply to this email directly or view it on GitHubhttps://github.com/HarlanH/JuliaData/issues/13#issuecomment-9008460.

from dataframes.jl.

johnmyleswhite commented on August 20, 2024

There's a tension here that may have no clean solution: I want to create row-level data frames by default to handle streaming data that needs to allow algorithms to process a row at a time, but should probably move those single rows one-by-one into a preformed empty DataFrame that can be defined using wc -l along with the metadata. We may need to abandon this for reading a whole DataFrame, but I'll keep working since it would be much nicer to have a coherent architecture for both online and batch processing.

from dataframes.jl.

johnmyleswhite commented on August 20, 2024

I definitely think the two-pass algorithm is right. My plan for tomorrow was to create guess_metadata and calculate_metadata functions that pass through the data to compute types, row number, column number and header-based column names. You need a guess_metadata function if you want to use the same architecture to process an infinite stream of data.

from dataframes.jl.

StefanKarpinski commented on August 20, 2024

Another thing you can do is to read some rows and use the file size and average row size to guess at the necessary amount of memory. I do that here in my 300-line TSV static data key-value server and it works pretty well and cuts down on array resizing but only taking one pass through the data. I still suspect that the two-pass method is better.

from dataframes.jl.

johnmyleswhite commented on August 20, 2024

I just pushed some new code that attempts to pre-allocate an empty data frame, then fill it in row-by-row. The three variants I have all perform almost equally slower than csvDataFrame, which makes me think the problems are calls to cbind, the type conversion steps or the state-machine splitter. Will keep working on this over the week.

from dataframes.jl.

johnmyleswhite commented on August 20, 2024

Just pushed a change that removes the excessive calls to cbind. This new parser is now faster than the old one, while being both closer to the CSV standard and usable for streaming data. I'll clean up the splitter function and then would like to discuss a new API for reading DataFrames from delimited data files that uses this new backend.

I'll also finally be able to code up more substantive examples of streaming data processing, including streaming variance, entropy and correlations, along with an SGD-based linear regression solver that Stefan and I worked on previously without a DataFrame backend. Before the end of the month I'm hoping to have a clean Formula-based API for doing simple regression and classification on 10-1000 GB data sets in Julia.

from dataframes.jl.

tshort commented on August 20, 2024

Here are some ideas for an API for reading delimited files. I like the
idea of using dlmread and variants to fill in an AbstractDataFrame,
something like:

dlmread(io::IO, df::AbstractDataFrame, options)
csvread(io::IO, df::AbstractDataFrame, options)

That way, you can preallocate and specify the column types and the
container type. By using an AbstractDataFrame, the user could specify
a DataFrame, a DistributedDataFrame, OnDiskDataFrame, or something
else. Because df specifies many of the normal options to reading
delimited files, options may only need to specify the delimiter, the
number of rows to skip, quoting options, and NA specifications.

We will need to supply helper functions to fill in defaults. We can
create a type that holds the arguments to dlmread given above:

type DataFrameFile
    io::IO
    df::AbstractDataFrame
    options
end

dlmread(x::DataFrameFile)

Now, we need some methods to fill in defaults for a DataFrame:

# create an unitialized DataFrame from the given file and options:
DataFrameFile(io::IO)
DataFrameFile(io::IO, options)
    # Take one pass through the file and calculate the number of rows
    # and figure out column types.

Here are some ways you might read delimited files:

d = csvread(DataFrameFile("test.csv"))
d = dlmread(DataFrameFile("test.tsv", @options(sep := '\t', header := true)))

N = Base.countlines("test.csv") - 1
# make a DataFrame with a mixture of column types, including Vectors and DataVecs
df = DataFrame({PooledDataVec(zeros(Int64, N)), PooledDataVec(fill("", N)),
                zeros(N), DataVec(zeros(N)), fill("", N), fill("", N)},
               ["n", "name1", "x", "y", "name2", "name3"])
d = csvread("test.csv", df, @options(skip := 1))

from dataframes.jl.

johnmyleswhite commented on August 20, 2024

This is great, Tom. I'm on board with basically everything and will chime in with some comments tomorrow morning.

from dataframes.jl.

tautologico commented on August 20, 2024

I've used the csv reading from the csv branch on a few occasions now (following the examples in test/parser.jl) and, aside from having to specify column names and types, it worked well for the files I needed (while csvDataFrame didn't work). There were quotes inside quotes but they were simply omitted in the resulting DataFrame. As I didn't need that these values were kept intact, the current behavior was adequate.

from dataframes.jl.

johnmyleswhite commented on August 20, 2024

Good to know. I've done about half of the work on the inference system for column names and types. I'm going to try to finish it tomorrow.

I need to think a lot more carefully about how to handle quotes-inside-quotes correctly. Less clear to me is how one ought to handle quotes inside an unquoted string, i.e. a line like 12,3131,ac"cd"a,ada. Currently the system just removes those quotes.

from dataframes.jl.

johnmyleswhite commented on August 20, 2024

This issue is substantially out of date. Closing.

from dataframes.jl.

csvDataFrame rewrite about dataframes.jl HOT 19 CLOSED

Comments (19)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent