Giter Site home page Giter Site logo

doyaaaaaken / kotlin-csv Goto Github PK

View Code? Open in Web Editor NEW
616.0 6.0 50.0 481 KB

Pure Kotlin CSV Reader/Writer

Home Page: https://kenta-koyama-biz.gitbook.io/kotlin-csv/

License: Apache License 2.0

Kotlin 100.00%
kotlin csv dsl kotlin-csv kotlin-multiplatform

kotlin-csv's Issues

Error AsSequence

val file = File("file.csv") // format ZoneDateTime,BigDecimal
val values = csvReader.open(file) {
    val listStr = readAll()
    val size = listStr.map { ZonedDateTime.parse(it[0]) to BigDecimal(it[1]) }.size
    println(size)
    readAllAsSequence().map { ZonedDateTime.parse(it[0]) to BigDecimal(it[1]) }.toMap()
}

println(values.size)

output:

1490
0

PS: Sorry, I realized the error: you need to rediscover

Skip mismatched row fields number option?

Is it possible to add an option to skip (or even read) rows with different fields number than other rows instead of throwing an exception. I'm willing to PR it myself if you're open to adding this option

Let me know what your thoughts are on this

Ability to read line by line file with header

It is very common to have csv headers in csv files. However not always I can or need to store the whole file in memory. Right now there is possibility either to read file line by line or read all data with header. It would be really useful to be able to do both at the same time

Handle duplicated headers

Hey @doyaaaaaken thank you very much for this library!

val duplicated = findDuplicate(headers)
if (duplicated != null) throw MalformedCSVException("header '$duplicated' is duplicated")

When reading all lines with headers, a duplicate check is performed and MalformedCSVException thrown.
Since columns in a row are accessed by their index headers could simply be deduplicated by appending something like an occurrence indicator to the header name.

Example:

just some example example headers
A B C D E

The header example appears twice. According to the suggestion the second occurrence could be named example_01.
Main benefit would be to not have a runtime exception thrown and not needing to rename columns in the original file to have it processed.

I know this would introduce some "magic" and I hope that wouldn't collide with your goal of having a simple library.
The required functionality would be just a few lines of code and I'd very gladly do a PR myself, just first wanted to get your thoughts on this.

java.lang.NoClassDefFoundError: mu/KotlinLogging

Describe the problem

It quite not a bug, but I'm stuck on this exception:

Caused by: java.lang.NoClassDefFoundError: mu/KotlinLogging
	at com.github.doyaaaaaken.kotlincsv.client.CsvFileReader.<init>(CsvFileReader.kt:21)
	at com.github.doyaaaaaken.kotlincsv.client.CsvReader.open(CsvReader.kt:129)
	at com.github.doyaaaaaken.kotlincsv.client.CsvReader.readAll(CsvReader.kt:48)

It happens only on Ubuntu. MacOS is still fine.

Environment

  • kotlin-csv version: 0.11.0
  • java version: java8
  • kotlin version: 1.4.10
  • OS: Ubuntu 18.04.5

Any suggestion on this? Thank you!

Skip empty line option on reading

According to the CSV Specificaton, an empty line between CSV rows is not allowed.
But, there is a demand for reading that kind of file.

So, we want to set it as csvReader option like below.

val reader = csvReader {
    skipEmptyLine = true
}

val str = """a,b,c

d,e,f
"""
//can read csv containing empty line.
reader.read(str)

Write Without Header

With current implementation whenever we write rows, it automatically adds the first row as a header. There should be an option to say no header needed.

Problem with parsing CSV file with spaces and colon

Good morning,

I'm trying to use kotlin-csv on a comma delimited file (input stream) but I think there is a problem with managing the spaces and colon.

In particular, these are the first lines of the file:

Device serial,Date ,Temperature 51 (Medium) °C
11869,2021-02-09 00:14:59,7.2
11869,2021-02-09 00:30:01,7.1
11869,2021-02-09 00:44:59,7.2
11869,2021-02-09 00:59:59,7.4
11869,2021-02-09 01:14:58,7.5
11869,2021-02-09 01:29:58,7.5
11869,2021-02-09 01:44:58,7.3
11869,2021-02-09 01:59:58,7.2
11869,2021-02-09 02:14:58,7.2
11869,2021-02-09 02:29:58,7.2
11869,2021-02-09 02:44:58,7.2
11869,2021-02-09 02:59:57,7.3
11869,2021-02-09 03:14:57,7.2
11869,2021-02-09 03:29:57,7.3
11869,2021-02-09 03:44:57,7.2
11869,2021-02-09 03:59:57,7.4

..while this is the script:

data class DataClass(
val FirstColumn: String,
val SecondColumn: String,
val ThirdColumn: String )

fun parse(data:InputStream): List?{

 val list = mutableListOf<DataClass>()

 try {
    val rows: List<List<String>> = csvReader().readAll(data)

    for (i in rows) {

 var firstColumn : String =  i[0]
 var secondColumn : String =  i[1]
 var thirdColumn : String =  i[2]

 list.add(DataClass(FirstColumn =firstColumn, SecondColumn = secondColumn,  ThirdColumn = thirdColumn))
    }

}
catch (e: Exception) {
e.printStackTrace()
}
return list
}

Unfortunately only the first column of each row is correctly identified in the output, for example (first row):
first column: 11869
second column: 2021-02-09
third colum: 0
No other colums detected.

Where is my mistake?
Thank you

Functions that use lambdas should be inlined where possible

I noticed that functions using lambdas aren't utilizing Kotlin's inline keyword. This could have avoidable performance impacts.

Take, for example, this function:

fun csvReader(init: CsvReaderContext.() -> Unit = {}): CsvReader {
    val context = CsvReaderContext().apply(init)
    return CsvReader(context)
}

The JVM doesn't have higher-order functions, so Kotlin must generate a class (a "SAM type") with the lambda's code in a single method. This doesn't matter too much if Kotlin can generate a singleton object, but in this case it can't, as CsvReaderContext() is captured in the closure. So, every time this function is invoked the lambda's class must be instantiated with CsvReaderContext() in a field, invoked, then garbage collected right after. (Correct me if I'm wrong)

*Small correction: this is a bad example. Looking at the bytecode, the reader context is passed as a method parameter.

I'm not sure how this works on other platforms, but on the JVM this impacts performance.

To avoid this, Kotlin provides inline functions, which inline the function's bytecode into where it's used. This mitigates the performance issues above at the expense of bytecode size being larger. If internal types or functions are used, you can add the @PublishedApi annotation to allow them to be accessed by the function, which makes whatever it's applied to public in the bytecode but obeyed by Kotlin.

A more impactful example would be the open functions in CsvReader.kt and CsvWriter.kt

Now, whether this is that big of a deal in this case is debatable.

Feature request: Differentiate between empty string and null value

readAllWithHeader yields a List<Map<String,String>> and hence empty columns are being read as empty strings, so that we get "" for both col1 and col2 in the following example:

"col1","col2"
"",

I'd really like to get null for col2 here (this of course only makes sense if all strings are quoted, otherwise it wouldn't be clear how to interpret empty columns). I understand that you can't change the result to List<Map<String,String?>> now, but maybe you could add a nullCode option for reading as it already exists for writing. The default value is an empty string "" (=current behavior). I could then simply do

val nullCode = "NULL"
val rows = csvReader(nullCode=nullCode).readAllWithHeader(inputStream)
    .map { row -> row.mapValues { col -> if (col.value == nullCode) null else col.value } }

At first glance it seems that it only requires to change

delimiter -> {
flushField()
state = ParseState.DELIMITER
}
'\n', '\u2028', '\u2029', '\u0085' -> {
flushField()
state = ParseState.END
}
'\r' -> {
if (nextCh == '\n') pos += 1
flushField()
state = ParseState.END
}

to

                    delimiter -> {
                        field.append(nullCode)
                        flushField()
                        state = ParseState.DELIMITER
                    }
                    '\n', '\u2028', '\u2029', '\u0085' -> {
                        field.append(nullCode)
                        flushField()
                        state = ParseState.END
                    }
                    '\r' -> {
                        if (nextCh == '\n') pos += 1
                        field.append(nullCode)
                        flushField()
                        state = ParseState.END
                    }

and the same for

delimiter -> {
flushField()
state = ParseState.DELIMITER
}
'\n', '\u2028', '\u2029', '\u0085' -> {
flushField()
state = ParseState.END
}
'\r' -> {
if (nextCh == '\n') pos += 1
flushField()
state = ParseState.END
}

but I didn't check it thoroughly.

CsvFileReader#readAllAsSequence fails on non-equal sized rows

Describe the bug
If rows are not equal-sized an exception is thrown:
com.github.doyaaaaaken.kotlincsv.util.CSVFieldNumDifferentException: Fields num seems to be 4 on each row, but on 2th csv row, fields num is 3.

To Reproduce

        csvWriter().open(csvFile) {
            writeRow(listOf("a"))
            writeRow(listOf("a", "b"))
        }
        csvReader().open(csvFile) {
           readAllAsSequence()
        }

Expected behavior
Missing cells are treated as nulls or empty strings.

Environment

  • kotlin-csv version 0.11.0
  • OS: Android 10

Screenshots
N/A

number of fields in a row has to be based on the header

To reproduce

Have a csv file with header row with 3 columns and two data rows, first data row with two columns, second - with three. Like this:

First name Last name Citizenship
John Bobkins
Michael Pepkins US

While invoking 'readAllWithHeaderAsSequence' on this file, the CSVFieldNumDifferentException is thrown saying that two colums are expected but three are found. It happens because 'fieldsNum' variable in the CsvFileReader.kt is initialized based on the first data row, while it has to be initialized based on the header row.

Expected behavior
The following code has to return two rows:

csvReader().open(filePath) {
                readAllWithHeaderAsSequence().forEach {

. . . 
}

Environment

  • kotlin-csv version 0.11.1
  • java version - java8
  • OS: Windows 10

Long-running write

Please allow writing to csv file without having to close it.
I have a streaming scenario where I need to write each data I get to csv. Closing and reopening after each batch would be suboptimal.

Thanks
David

Reuse config between reader and writer

When reading/writing you usually want to use the same config, i.e. charset, quoteChar, etc.

It would be great if we could write this config once, and then reuse in both read and write.

Something like:

val context = CsvContext {
    charset = "UTF-8"
}

csvReader(context)...
csvWriter(context)...

java.lang.NoClassDefFoundError: com/github/doyaaaaaken/kotlincsv/dsl/CsvReaderDslKt

Describe the bug
Cannot find the dsl

To Reproduce

plugins {
	kotlin("jvm") version "1.6.10"
        ...
}
...
dependencies {
	implementation("com.github.doyaaaaaken:kotlin-csv-jvm:1.2.0")
        ...
}
...
val rows = csvReader().readAll(inputStream) // throws error

java.lang.NoClassDefFoundError: com/github/doyaaaaaken/kotlincsv/dsl/CsvReaderDslKt

Expected behavior
A clear and concise description of what you expected to happen.

Environment

  • kotlin-csv version: 1.6.10
  • java version: 11.0.3
  • OS: MacOS

Screenshots
If applicable, add screenshots to help explain your problem.

Screen Shot 2022-03-15 at 12 11 11 PM

Remove logger 3rd party library

Quickly looking at the code it seems like there's only one log statement:

logger.info { "skip miss matched row. [csv row num = ${idx + 1}, fields num = ${row.size}, fields num of first row = $fieldsNumInRow]" }

Do we really need to pull an entire library for logging?

implementation("io.github.microutils:kotlin-logging:2.0.11")

I'm an Android user and currently that log would go basically nowhere.

EROFS (Read-only file system)

getting this error please help
Caused by: java.io.FileNotFoundException: test.csv: open failed: EROFS (Read-only file system)

Improvement: Read one row at a time

Currently the only way to interact with a CSV is to parse all rows. Two use-cases that this does not cover are:

  • Reading only the header. This is useful if you wish to provide a breakdown of what is included in the file. While it should be trivial to do without a library, the existence of this library and its parsing logic supports the position that this is a non-trivial task.
  • Reading row-by-row, which is arguably a superset of the former use-case. This would be helpful when interacting with asynchronous workflows. One could attempt to read a single row from a piped input stream, and the library throws an exception when another line cannot be read in its entirety (as it does now with the full text). The producer can then continue to populate the input stream as data becomes available. The end result would be an asynchronous stream of rows (which I am not suggesting should be included in this library, but these changes would make this possible).

adding kotlin-csv-jvm depenedncy pulls testing libraries into runtime

Describe the bug
When building application (using gradle distZip from application plugin) that depends on kotlin-csv, test libraries are pulled into created application artefact.

To Reproduce
Create empty basic kotlin project
add dependency implementation("com.github.doyaaaaaken:kotlin-csv-jvm:0.10.1")

Expected behavior
No testing libraries in artefact. To check:

  • run gradle dependencies

    • runtimeClasspath should not contain kotlin-test library
  • run gradle distZip with application plugin

    • created zip should not contain testing libraries

Environment

  • kotlin-csv: 0.10.1
  • java version: java8
  • gradle: 6.5
  • OS: Win

Asci Null characters between csv Strings

Hey!
i've parsed a csv and found out, that there is an asci NULL character between every char.

i've used a httpRequest with a ByeInputStream as following:

val result = httpClient.get<HttpStatement> {
  url(blobUrl)
}.execute() { response: HttpResponse ->
  val channel: ByteReadChannel = response.receive()
  val byteIn = ByteArrayInputStream(channel.toByteArray())
  csvReader {
    delimiter = ','
    skipEmptyLine = true
    skipMissMatchedRow = true
  }.readAll(byteIn)
}

the output is a List<List> which is totally correct.

// when i go through all elements like:
map { list ->
    list[0].forEach { c: Char ->
    println(c.toInt())
     }
}
// the output is:
0
50
0
48
0
50
0
49
0
45
0
48
0
49
0
45
0
48
0
49
0

This just happens to one csv response which i'm not sure why it happens. It's a report from the Google Play store
The same code works 100% good with other csv files.

i've solved it my replacing the NULL char manually like:

list[0].replace(Char.MIN_VALUE.toString(), "")

// e.g.
list[0].replace(Char.MIN_VALUE.toString(), "").forEach { c: Char ->
  println(c.toInt())
}

which returns:

50
48
50
49
45
48
49
45
48
49

i'm not sure if it's interesting to do it out of the box?

have a nice day!

use suspend function inside lambda of `open` method

In the below code, a compile error happen because suspend function cannot be called inside lambda of open function.
So make it callable.

suspend fun processRow(row: List<String>): List<String> {
    return row.map { "prefix-$it" }
}

val rows: List<List<String>> = csvReader().open("test.csv") {
    readAllAsSequence()
        .map { row -> processRow(row) } // Compile ERROR!! processRow is suspend function so cannot call inside lambda
        .toList()
}

Discusssion: https://kotlinlang.slack.com/archives/CMAL3470A/p1601651001001000

Introduce BOM for Microsoft applications

Hey there,

thank you very much for this gerat project.

Microsoft applications, for some reason, seem to require a BOM to parse for example UTF-8 files correctly, even though there is no byte order in UTF-8 like there is in 16/32. In order to open a created csv file correctly I suggest to add this special BOM (UTF-8 does require three special bytes 0xEF, 0xBB and 0xBF at the start of the file), even though the csvWriter is configured with the Charsets.UTF_8.name().

Why this is undocumented and why Excel seems to require a BOM for UTF-8 I don't know; might be good questions for Excel team at Microsoft.

What do you think or do you have any suggestion to solve this problem?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.