Giter Site home page Giter Site logo

tafia / calamine Goto Github PK

View Code? Open in Web Editor NEW
1.5K 28.0 148.0 5.75 MB

A pure Rust Excel/OpenDocument SpreadSheets file reader: rust on metal sheets

License: MIT License

Rust 100.00%
rust excel opendocument-spreadsheet parser vba serde deserializer

calamine's People

Contributors

andy128k avatar bbigras avatar dimastbk avatar dodomorandi avatar dskkato avatar evilpie avatar forgemo avatar grissiom avatar gzsombor avatar hbina avatar hiraokatakuya avatar hsivonen avatar jamesmcm avatar jfcha avatar jqnatividad avatar lucatrv avatar lukapeschke avatar masklinn avatar mjvl avatar pfernie avatar pleasedont avatar robyoung avatar roloedits avatar saks avatar sergeilem avatar sjmurdoch avatar softdevca avatar striezel avatar tafia avatar tiibun avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

calamine's Issues

Feature/API request: Open sheets from memory buffers

At the moment, the open API seems to be the only way to open a new workbook. However, it would be good to have also a way to open sheets directly from a memory buffer, without accessing the hard disk.

Imagine that an user has uploaded a smallish XLSX file to your web service, and the backend is written with Rust. You don't want to first save the file to disk (an operation that might have a worst case latency of tens of milliseconds), and then load it there to open it up โ€“ instead, you want to directly open the file from a &[u8].

get_value returns Empty for non-empty cells

If I open this XLS and try to read cell 6,2 (C7) of the sheet "Boys", it returns Some(Empty) when in Excel it contains the number 9. It doesn't seem to be a formula either.

Code to read it:

let mut workbook: Xls<_> = open_workbook("names.xls").unwrap();
let range = workbook.worksheet_range("Boys").unwrap().unwrap();
println!("{:?}", range.get_value((6,2)));

Prints Some(Empty).

Add examples

Can be straightforward but helps a lot when starting using a lib.

Ideas:
Conversion

  • Excel to csv
  • Excel to Json

Bulk check for excel errors

  • error in cells
  • errors in broken references

Don't clone on returning the sheet names

The sheet_names method clones the vector of the sheet names. I get that this isn't likely to be on the hot path, but cloning and allocating should usually left to the user when possible. (Nothing wrong with convenience methods though). As an alternative, it would be a good practice to provide an iterator over the sheet names: sheet_names_iter, with Item=&str. This avoids both allocation and cloning, leaving the maximum flexibility in the hands of the user.

getting the name of the author

Hello I was wondering if there was a way to find the name of the author of a sheet in the same way the file command:
file.xls: Composite Document File V2 Document, Little Endian, Os: Windows, Version 6.1, Code page: 1252, Author: AUTHOR_NAME, Last Saved By: AUTHOR_NAME, Last Printed: Wed Apr 10 07:55:25 2019, Create Time/Date: Wed Apr 10 07:57:54 2019, Last Saved Time/Date: Wed Apr 10 07:57:54 2019, Security: 0

I don't find anything in the documentation :/

Clean the project and publish to crates.io

Out of my mind:

  • license
  • better readme (a status about xlsb and xls management out of vba) (#20)
  • documentation (will then probably rely on doc.rs
  • more tests
  • travis and appveyor badges
  • clippy (#18)
  • rustfmt (#21)
  • use stable toolchain per default (#15 #19)

Deserialize cells with numbers as strings

I am writing some software that deserializes a excel file as a Vec<HashMap<String, Value>> where Value is my own internal value type. I do this using the RangeDeserializer, with the headers setting set to true. However I am running into some problems when my header contains numbers, is deserialized as such, and I get an error about expecting string and getting a number. I guess a simple fix would to use my internal value type which can be deserialized as both a floating point and a string and then manually converting it to a string, however if the header is parsed as a boolean I would have problem there as well. I don't know if this is considered a bug or not, but is there any workaround other than using Range::rows() and manually implementing the deserializing.

Serde serialization and deserialization?

Hello!

Have you considered adding support for serde support for serialization and deserialization, at least for the DataType type? It seems like it would be pretty straightforward. Are there any downsides to doing this that I'm not thinking of?

Loading sheets with different locale depends on sheet name

As an intro into Rust, I set a task for myself to convert some Excel files into JSON. My Excel files only have one sheet, they're basically csvs with numbers and strings,

I run into the issue that I need to enter the sheet name to open it i.e. workbook.worksheet_range("Sheet1"). When I tried running the code above I got an error "Sheet \'Sheet1\' does not exist"'. I opened the Excel to confirm it had the sheet, and realized the name was "Tabelle1" as it's created with the German locale. I realized that the actual user defined sheet name has to be used for workbook.worksheet_range() if I understand the code in read_sheets_names() correctly.
I was hoping to find out more and dug through the calamine code to figure out if there is another option. I didn't find any other way to reference the sheets. This feels like a good point for an enhancement request. The good thing is that workbook.worksheet_range("Tabelle1") opens the file correctly and I get the data.

After unzipping the excel file, I believe a solution might be something like:
The opened workbook.xml has a sheetId property and the actual sheet xml name still stays sheet1.xml in my case. I'm not sure if there are differences between the format revisions. I only tested on Excel2010. So I'm guessing that the path or the sheetId can be used for another reference. For instance when fetching sheets by name in read_sheets_names() we add the sheetId of the sheet to the resulting sheets object. Then a new function could take advantage of this and get the sheet by passing in it's number, i.e. "1".

That's my suggestion, I'd submit a pull request, as I said this is my first attempt at using rust, so I'll leave it at the suggestion level.
Thanks for the work on the wrapper, it's way over my head, but feels good to read the code and try and understand it ๐Ÿ˜ƒ. Happy new year!

bug when some rows has empty cells at the end

I have empty cells on some rows and it causes problems since office considers every row being the same length.

I wanted to fix it but I'm not sure how you would prefer it. Office could have a notion of rows in Range.inner. I'm not sure if we could add empty DataTypes in Range.inner since we would need to know the longest row's length and we would need to make 2 pass on the XML.

I can work on it if you lack the time.

I'll link a test case in which I also test for empty lines. I can split them in 2 test cases and I can also add them to issues.xlsx if you prefer.

<sheetData>
    <row r="1" spans="1:4" s="4" customFormat="1">
        <c r="A1" s="1" t="s">
            <v>0</v>
        </c>
        <c r="B1" s="2" t="s">
            <v>5</v>
        </c>
        <c r="C1" s="3" t="s">
            <v>10</v>
        </c>
        <c r="D1" s="4" t="s">
            <v>15</v>
        </c>
    </row>
    <row r="2" spans="1:4" s="4" customFormat="1">
        <c r="A2" s="1" t="s">
            <v>1</v>
        </c>
        <c r="B2" s="2" t="s">
            <v>6</v>
        </c>
        <c r="C2" s="3" t="s">
            <v>11</v>
        </c>
        <c r="D2" s="4" t="s">
            <v>16</v>
        </c>
    </row>
    <row r="3" spans="1:4" s="4" customFormat="1">
        <c r="A3" s="1" t="s">
            <v>2</v>
        </c>
        <c r="B3" s="2" t="s">
            <v>7</v>
        </c>
        <c r="C3" s="3" t="s">
            <v>12</v>
        </c>
    </row>
    <row r="4" spans="1:4" s="4" customFormat="1">
        <c r="A4" s="1" t="s">
            <v>3</v>
        </c>
        <c r="B4" s="2" t="s">
            <v>8</v>
        </c>
        <c r="C4" s="3" t="s">
            <v>13</v>
        </c>
        <c r="D4" s="4" t="s">
            <v>17</v>
        </c>
    </row>
    <row r="5" spans="1:4" s="4" customFormat="1">
        <c r="A5" s="1" t="s">
            <v>4</v>
        </c>
        <c r="B5" s="2" t="s">
            <v>9</v>
        </c>
        <c r="C5" s="3" t="s">
            <v>14</v>
        </c>
    </row>
</sheetData>

Extracting dates from Excel

In my excel sheet, I've got a column with dates. Unfortunately, calamine::DateType has no variant for dates and when I handle a date as a string I get for 29.08.2012 -> 41181.

It would be nice to have built-in date conversion

Usability of finding a header row and skipping rows

I have to read excel sheets with headers where the header row is not the first row.
I currently try to solve this by first using range.rows().enumerate().find(...) to get a row that contains the header names I expect and then using the index of that row to get a subrange of my original range.

I introduced a bug by using the index I got from that process as the new start and therefore started at e.g. row 3 instead of 26. I fixed that by adding the found header index to the current range start.
I guess this is kind of hard to use :/

Reader with data types casting on all columns or specific columns in range

Hi
First of all, thank you so much for making this library! I have been trying to get an xlsb read into pandas dataframe in python and your library has been a godsend.

My issue is that my .xlsb file has some strings that look like numbers: 00000000231, 00000000249, 00000000746. However the reader reads those values as Integers 231, 249, 746.

Is there anyway to force the entire range to be read as a string vs each data type? Sorry I am new to Rust programming so I couldn't find anything on the readthedocs site under the following links:

pub fn read_xlsb(path: &str)->Result<Vec<Vec<String>>, Error>{
    let mut excel: Xlsb<_> = open_workbook(path).expect("Incorrect path");
    if let Some(Ok(r)) = excel.worksheet_range("Default"){
        let width = r.get_size().1-1;
        let mut buffered = Vec::new();
        let rows = r.rows().take(10);
        for row in rows{
            let mut string = Vec::new();
            for value in row{
                match *value{
                    DataType::Empty=>string.push(String::from(" ")),
                    DataType::String(ref s)=>string.push(String::from(format!("{}",s))),
                    DataType::Int(ref i)=>string.push(String::from(format!("{}", i))),
                    DataType::Float(ref f)=>string.push(String::from(format!("'{}',", f))),
                    DataType::Bool(ref b)=>string.push(String::from(format!("'{}',",b))),
                    DataType::Error(ref e)=>string.push(String::from(format!("'{:?}',", e)))
                }
            }
            buffered.push(string);
        }
        Ok(buffered)
    }
    else{
        return Err(Error::new(ErrorKind::NotFound, "File NOT FOUND"));
    }
    
}

Find a new name

As per reddit thread the name is really not good.

@BrunoQC do you have any idea for the new name?
I'll settle tomorrow (the sooner the better but I need some time to decide).

Some research on the excel verb:
http://www.merriam-webster.com/thesaurus/excel

Some other candidates

  1. xlrs (eXceLRuSt)
  2. excellent / excel-lent (weird meaning in french as it is not necessarily slow)
  3. exceller (there is a wikipedia page about a "exceller" horse, I like the image https://en.wikipedia.org/wiki/Exceller)
  4. xlsr: variant of 1. but starting with "xls" like excel files
  5. excel-reader

I tend to like 1. the most

Proposed by reddit users:

  • Excellerator (killercup)

Time is recognized as Float.

time.xlsx

time.xlsx is composed of one cell "00:15:00".
This cell recognized as Float(0.0104166666666667).
I expect as String("00:15:00").
Is this the intended behavior?
The test code is below.

#[test]
fn time() {
    setup();

    let path = format!("{}/tests/time.xlsx", env!("CARGO_MANIFEST_DIR"));
    let mut excel: Xlsx<_> = open_workbook(&path).unwrap();

    let range = excel.worksheet_range("Sheet1").unwrap().unwrap();
    range_eq!(range, [[String("00:15:00".to_string())]]);
}

Documentation enhancement: How do I query a range of cells?

The examples I found all use the whole of sheet1 as the range. I have an Excel file that has a header row, followed by data rows grouped into sets of 4 columns each.

For external reasons I can't use export to CSV, or change the spreadsheet in any way.

I'd like to be able to find an example in the documentation describing how to convert the string b"A2:D91" to a Range that can be used within the Serde Deserialization example, instead of getting the range via workbook.worksheet_range("Sheet1")

Random Read

I have a a xlsx file that has a lot of meta data. and i am trying to read specific cells using the get_value function but i seem to be having a little trouble. Is there an example that uses the range.get_value function.

creating nested ranges.

how can i make a range within a range . I have a complicated xlsx file that contains a lot of header and meta data. surrounding that actual data. I was wondering how I can create a nested range to focus on the actual data.

currently I am simply using the worksheet_range function. which creates a range for the whole sheet.

`get_value` returns incorrect value when max. index is overflown

Consider Calamine 0.14.0, and an XLSX spreadsheet with one worksheet covered by range.
Given range.end() == (m_max, n_max) where m_max >= 0 and n_max >= 0,
let a = range.get_value((m, n)); where n == n_max + 1,
let b = range.get_value((m, n)); where n == 0,
then a == b.

This is incorrect. None should be returned when the index of any dimension is overflown wrt. to the maximum index.

Bug(s) in indexing worksheets

#113 contains code aimed at debugging indexing bugs.

let range = workbook.worksheet_range(name_worksheet.as_str());
if range.is_empty() {
    continue
}
let (row_max, col_max) = (row_max_usize as u32, col_max_usize as u32);
trace!("{}x{} matrix", row_max, col_max);
for index_col in 0..col_max {
    for index_row in 0..row_max {
        // panics
        let cell = range.get_value((index_row, index_col));
    }
}

calamine::xls::parse_dimensions() panics for specific workbook on debug build

fn main() {
    use calamine::{open_workbook, Xls};
    let mut path = std::path::PathBuf::from(env!("CARGO_MANIFEST_DIR"));
    path.push("car_test.xls");
    let _excel: Xls<_> = open_workbook(&path).unwrap();
}

Run this code on debug mode with car_test.xls from car_test.zip panics here:

thread 'main' panicked at 'attempt to subtract with overflow', C:\--snip--\calamine\src\xls.rs:450:32
stack backtrace:
  11: calamine::xls::parse_dimensions
             at C:\--snip--\calamine\src\xls.rs:450

Runing this code on release mode doesn't panic, because release builds doesn't check overflow. But I'm pretty sure potential bad things will come if u32 overflow happens.

One minute, I'll provide a pull request very soon.

XLSX with passwords

It seems that no Rust crate at the moment supports encrypted spreadsheets but is there any plans? It seems that password support is by no means a trivial feature, since there are many different schemes to encrypt an XLSX file. I don't even know if some are more common than others: http://poi.apache.org/encryption.html

I think that the Apache POI library is the gold standard standard for supporting stuff: https://github.com/apache/poi

As one step towards a better direction, it would be nice if it could recognize and report the error correctly, as it does with the XLS format.

xlsm: rows not splitted correctly

| 1 | a |
| 2 | b |
| 3 | c |

output

0,0: Int(1)
0,1: String("a")
1,0: String("b")
1,1: Int(3)
row: [Int(1), String("a"), Int(2)]
row: [String("b"), Int(3), String("c")]

code

extern crate office;
use office::Excel;

fn main() {
    let mut excel = Excel::open("test.xlsm").unwrap();
    let r = excel.worksheet_range("Feuil1").unwrap();
    println!("0,0: {:?}", r.get_value(0, 0));
    println!("0,1: {:?}", r.get_value(0, 1));
    println!("1,0: {:?}", r.get_value(1, 0));
    println!("1,1: {:?}", r.get_value(1, 1));

    for row in r.rows() {
        println!("row: {:?}", row);
    }   
}

xlsm: no sheet

There seems to be no sheet if I only have one line in a xlsm file.

| 1 | a |

thread 'main' panicked at 'called `Result::unwrap()` on an `Err` value: Error(Msg("Sheet \'Feuil1\' does not exist"), (None, None))', ../src/libcore/result.rs:799
extern crate office;
use office::Excel;

fn main() {
    let mut excel = Excel::open("test2.xlsm").unwrap();
    let r = excel.worksheet_range("Feuil1").unwrap();
    println!("0,0: {:?}", r.get_value(0, 0));
    println!("0,1: {:?}", r.get_value(0, 1));

    for row in r.rows() {
        println!("row: {:?}", row);
    }
}

ODS partial richtext string issue

Richtext in ODS cells works as long as the richtext part of the string is at the beginning. If it starts inside the string, the cell value is cut where it starts.

See the following pull request for a related testcase #122.

---- partial_richtext_ods stdout ----
thread 'partial_richtext_ods' panicked at 'assertion failed: (left == right)
left: String("a"),
right: String("abc"): Mismatch at position (0, 0)', tests/test.rs:272:5

Sorry, I had to create another ODS file as it seems editing those in the repository broke other tests.

Provide a lazy iterator over the rows

Today the entire range is first saved in memory.
Depending on the need we could provide for both xlsx and xlsm a Row or even a Cell lazy iterator ... and collect them into a Range if needed.

Improve readme

Adds several missing parts

  • Examples
    • Example for reading cells
    • Example for reading vba, and checking references
  • Better description of the library
  • Benchmarks?
  • Current limitations
    • cannot read xlsb cells yet
    • cannot read xls cells yet
    • inconsistent vba string parsing (missing a proper MBSC encoding)
    • plus many missing functions
    • no writing, only read

Better Vba management

Using the api is not nice atm because of the Cow.
Maybe doing all the work at once (expecting the vba code to be pretty low anyway) should probably be a better solution.
And then directly call get_references and get_module_names and get_module_(raw) from Excel.

lifetime issue when itering over the `sheet_names()`

Here is the code:

    let mut xl = Excel::open(&sce).unwrap();
    for stn in xl.sheet_names().unwrap() {
        let range = xl.worksheet_range(stn).unwrap();
        write_range(&mut dest, range).unwrap();
    }

I want to iter over the sheets with sheet names. But when I compile it, I got lifetime issue:

error[E0499]: cannot borrow `xl` as mutable more than once at a time
  --> src/main.rs:28:21
   |
27 |     for stn in xl.sheet_names().unwrap() {
   |                -- first mutable borrow occurs here
28 |         let range = xl.worksheet_range(stn).unwrap();
   |                     ^^ second mutable borrow occurs here
29 |         write_range(&mut dest, range).unwrap();
30 |     }
   |     - first borrow ends here

error: aborting due to previous error

Is there any way to fix it easily?

Performance/speed

Hi! I really love the work that you did on this tool, thank you!

I've been looking for a high-performance xlsx parser that works well on large files

I tried to parse a 270,000-row, 13-column, 21MB file and convert it to a csv, using both calamine (the sample code you provided in examples/) and the fastest xlsx library I know of, https://github.com/dilshod/xlsx2csv in python.

xlsx2csv took 1m42s while Calamine converted this in 3m39s.

I was wondering if speed and performance is a goal of this library, and if you have any ideas for speeding things up?

read_shared_strings is broken where there's RichTextRun cells

I have a big (14M) excel file with about 20 sheets. I only need to read one sheet and I can validate everything but on the last line some of the strings doesn't match. I have the same row count.

I'll check my Go implementation if I have the same problem and if I did something special.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.