Giter Site home page Giter Site logo

csvreader's Introduction

CSV Reader

The CsvReader library is an extended version of Sébastien Lorion's fast CSV Reader project and provides fast parsing and reading of CSV files

NuGet Build status

To this end it is a straight drop-in replacement for the existing NuGet package LumenWorks.Framework.IO, but with additional capabilities; the other rationale for the project is that the code is not available elsewhere in a public source repository, making it difficult to extend/contribute to.

Welcome to contributions from anyone.

You can see the version history here.

Build the project

  • Windows: Run build.cmd

The tooling should be automatically installed by paket/Fake. The default build will compile and test the project, and also produce a nuget package.

The library supports for .NET 2.0, 3.5, 4.5 and 4.6.1 and .netstandard 1.3 and 2.0, the netstandard1.3 version does not contain CachedCsvReader since the necessary interfaces are not available.

Library License

The library is available under the MIT License, for more information see the License file in the GitHub repository.

Getting Started

A good starting point is to look at Sébastien's article on Code Project.

A basic use of the reader something like this...

using System.IO;
using LumenWorks.Framework.IO.Csv;

void ReadCsv()
{
    // open the file "data.csv" which is a CSV file with headers
    using (var csv = new CachedCsvReader(new StreamReader("data.csv"), true))
    {
        // Field headers will automatically be used as column names
        myDataGrid.DataSource = csv;
    }
}

Having said that, there are some extensions built into this version of the library that it is worth mentioning.

Additional Features

Columns

One addition is the addition of a Column list which holds the names and types of the data in the CSV file. If there are no headers present, we default the column names to Column1, Column2 etc; this can be overridden by setting the DefaultColumnHeader property e.g.

using System.IO;
using LumenWorks.Framework.IO.Csv;

void ReadCsv()
{
    // open the file "data.csv" which is a CSV file without headers
    using (var csv = new CachedCsvReader(new StreamReader("data.csv"), false))
    {
        csv.DefaultColumnHeader = "Fred"

        // Field headers will now be Fred1, Fred2, etc
        myDataGrid.DataSource = csv;
    }
}

You can specify the columns yourself if there are none, and also specify the expected type; this is especially important when using against SqlBulkCopy which we will come back to later.

using System.IO;
using LumenWorks.Framework.IO.Csv;

void ReadCsv()
{
    // open the file "data.csv" which is a CSV file without headers
    using (var csv = new CachedCsvReader(new StreamReader("data.csv"), false))
    {
        csv.Columns.Add(new Column { Name = "PriceDate", Type = typeof(DateTime) });
        csv.Columns.Add(new Column { Name = "OpenPrice", Type = typeof(decimal) });
        csv.Columns.Add(new Column { Name = "HighPrice", Type = typeof(decimal) });
        csv.Columns.Add(new Column { Name = "LowPrice", Type = typeof(decimal) });
        csv.Columns.Add(new Column { Name = "ClosePrice", Type = typeof(decimal) });
        csv.Columns.Add(new Column { Name = "Volume", Type = typeof(int) });

        // Field headers will now be picked from the Columns collection
        myDataGrid.DataSource = csv;
    }
}

SQL Bulk Copy

One use of CSV Reader is to have a nice .NET way of using SQL Bulk Copy (SBC) rather than bcp for bulk loading of data into SQL Server.

A couple of issues arise when using SBC 1. SBC wants the data presented as the correct type rather than as string 2. You need to map between the table destination columns and the CSV if the order does not match exactly

Below is a example using the Columns collection to set up the correct metadata for SBC

public void Import(string fileName, string connectionString)
{
    using (var reader = new CsvReader(new StreamReader(fileName), false))
    {
        reader.Columns = new List<LumenWorks.Framework.IO.Csv.Column>
        {
            new LumenWorks.Framework.IO.Csv.Column { Name = "PriceDate", Type = typeof(DateTime) },
            new LumenWorks.Framework.IO.Csv.Column { Name = "OpenPrice", Type = typeof(decimal) },
            new LumenWorks.Framework.IO.Csv.Column { Name = "HighPrice", Type = typeof(decimal) },
            new LumenWorks.Framework.IO.Csv.Column { Name = "LowPrice", Type = typeof(decimal) },
            new LumenWorks.Framework.IO.Csv.Column { Name = "ClosePrice", Type = typeof(decimal) },
            new LumenWorks.Framework.IO.Csv.Column { Name = "Volume", Type = typeof(int) },
        };

        // Now use SQL Bulk Copy to move the data
        using (var sbc = new SqlBulkCopy(connectionString))
        {
            sbc.DestinationTableName = "dbo.DailyPrice";
            sbc.BatchSize = 1000;

            sbc.AddColumnMapping("PriceDate", "PriceDate");
            sbc.AddColumnMapping("OpenPrice", "OpenPrice");
            sbc.AddColumnMapping("HighPrice", "HighPrice");
            sbc.AddColumnMapping("LowPrice", "LowPrice");
            sbc.AddColumnMapping("ClosePrice", "ClosePrice");
            sbc.AddColumnMapping("Volume", "Volume");

            sbc.WriteToServer(reader);
        }
    }
}

The method AddColumnMapping is an extension I wrote to simplify adding mappings to SBC

public static class SqlBulkCopyExtensions
{
    public static SqlBulkCopyColumnMapping AddColumnMapping(this SqlBulkCopy sbc, int sourceColumnOrdinal, int targetColumnOrdinal)
    {
        var map = new SqlBulkCopyColumnMapping(sourceColumnOrdinal, targetColumnOrdinal);
        sbc.ColumnMappings.Add(map);

        return map;
    }

    public static SqlBulkCopyColumnMapping AddColumnMapping(this SqlBulkCopy sbc, string sourceColumn, string targetColumn)
    {
        var map = new SqlBulkCopyColumnMapping(sourceColumn, targetColumn);
        sbc.ColumnMappings.Add(map);

        return map;
    }
}

One other issue recently arose where we wanted to use SBC but some of the data was not in the file itself, but metadata that needed to be included on every row. The solution was to amend the CSV reader and Columns collection to allow default values to be provided that are not in the data.

The additional columns should be added at the end of the Columns collection to avoid interfering with the parsing, see the amended example below...

public void Import(string fileName, string connectionString)
{
    using (var reader = new CsvReader(new StreamReader(fileName), false))
    {
        reader.Columns = new List<LumenWorks.Framework.IO.Csv.Column>
        {
            ...
            new LumenWorks.Framework.IO.Csv.Column { Name = "Volume", Type = typeof(int) },
            // NB Fake column so bulk import works
            new LumenWorks.Framework.IO.Csv.Column { Name = "Ticker", Type = typeof(string) },
        };

        // Fix up the column defaults with the values we need
        reader.UseColumnDefaults = true;
        reader.Columns[reader.GetFieldIndex("Ticker")] = Path.GetFileNameWithoutExtension(fileName);

        // Now use SQL Bulk Copy to move the data
        using (var sbc = new SqlBulkCopy(connectionString))
        {
            ...
            sbc.AddColumnMapping("Ticker", "Ticker");

            sbc.WriteToServer(reader);
        }
    }
}

To give an idea of performance, this took a naive sample app using an ORM from 2m 27s to 1.37s using SBC and the full import took just over 11m to import 9.8m records.

Null Byte Removal StreamReader

Use NullRemovalStreamReader when CSV files contain large number of null bytes and you do not control how to generate CSV files.

If you ever experienced "System.OutOfMemoryException" or long processing time, you will most likely get a huge performance gain with NullRemovalStreamReader.

public void Process(string path, bool addMark)
{
    using (StreamReader stream = new StreamReader(path))
    using (CsvReader csv = new CsvReader(stream.BaseStream, false, stream.CurrentEncoding, addMark))
    // or using (CsvReader csv = new CsvReader(File.OpenRead(path), false, Encoding.UTF8, addMark))
    {
        while (csv.ReadNextRecord())
        {
            string data = csv[i];
            // do stuff
        }
    }
}

When addMark is true, consecutive null bytes will be replaced by [removed x null bytes] to indicate the removal, you can see this from the benchmark output below.

Performance differences shown when tested with 20 million null bytes (20MB in storage) :

CsvReader -     without using NullRemovalStreamReader : 536968 ticks, 0.2120 sec., 94.3518 MB/sec.

CsvReader - with NullRemovalStreamReader without mark : 191137 ticks, 0.0755 sec., 265.0660 MB/sec.
AddMark =(False) LastCell =(cell63 followed by 20971520 null bytes)

CsvReader - with NullRemovalStreamReader with    mark : 168819 ticks, 0.0666 sec., 300.1079 MB/sec.
AddMark =(True) LastCell =(cell63 followed by 20971520 null bytes[removed 20971520 null bytes])

Adjust number of null bytes in benchmark to see how much memory/time you will be able to save:

X:\Path\CsvReader\build\Debug\CsvReaderBenchmarks\net461>CsvReaderBenchmarks.exe NullRemoval

Performance

One of the main reasons for using this library is its excellent performance on reading/parsing raw data, here's a recent run of the benchmark (which is in the source)

Test pass #1 - All fields

CsvReader - No cache      : 3134597 ticks, 1.2374 sec., 35.5582 MB/sec.
CachedCsvReader - Run 1   : 7452030 ticks, 2.9418 sec., 14.9571 MB/sec.
CachedCsvReader - Run 2   : 4525 ticks, 0.0018 sec., 24632.1821 MB/sec.
TextFieldParser           : 31568009 ticks, 12.4617 sec., 3.5308 MB/sec.
Regex                     : 11273590 ticks, 4.4503 sec., 9.8869 MB/sec.

Test pass #1 - Field #72 (middle)

CsvReader - No cache      : 2358656 ticks, 0.9311 sec., 47.2560 MB/sec.
CachedCsvReader - Run 1   : 7119186 ticks, 2.8104 sec., 15.6564 MB/sec.
CachedCsvReader - Run 2   : 325 ticks, 0.0001 sec., 342955.7662 MB/sec.
TextFieldParser           : 31171440 ticks, 12.3052 sec., 3.5757 MB/sec.
Regex                     : 5793093 ticks, 2.2869 sec., 19.2403 MB/sec.


Test pass #2 - All fields

CsvReader - No cache      : 2941954 ticks, 1.1614 sec., 37.8866 MB/sec.
CachedCsvReader - Run 1   : 7204077 ticks, 2.8439 sec., 15.4719 MB/sec.
CachedCsvReader - Run 2   : 314 ticks, 0.0001 sec., 354970.1401 MB/sec.
TextFieldParser           : 31213609 ticks, 12.3218 sec., 3.5709 MB/sec.
Regex                     : 11095897 ticks, 4.3802 sec., 10.0452 MB/sec.

Test pass #2 - Field #72 (middle)

CsvReader - No cache      : 2186909 ticks, 0.8633 sec., 50.9672 MB/sec.
CachedCsvReader - Run 1   : 7131654 ticks, 2.8153 sec., 15.6290 MB/sec.
CachedCsvReader - Run 2   : 296 ticks, 0.0001 sec., 376556.1622 MB/sec.
TextFieldParser           : 31381026 ticks, 12.3879 sec., 3.5518 MB/sec.
Regex                     : 5151353 ticks, 2.0335 sec., 21.6372 MB/sec.


Test pass #3 - All fields

CsvReader - No cache      : 2693834 ticks, 1.0634 sec., 41.3762 MB/sec.
CachedCsvReader - Run 1   : 7105358 ticks, 2.8049 sec., 15.6868 MB/sec.
CachedCsvReader - Run 2   : 326 ticks, 0.0001 sec., 341903.7546 MB/sec.
TextFieldParser           : 31323784 ticks, 12.3653 sec., 3.5583 MB/sec.
Regex                     : 11303752 ticks, 4.4622 sec., 9.8605 MB/sec.

Test pass #3 - Field #72 (middle)

CsvReader - No cache      : 2177773 ticks, 0.8597 sec., 51.1810 MB/sec.
CachedCsvReader - Run 1   : 7326816 ticks, 2.8923 sec., 15.2127 MB/sec.
CachedCsvReader - Run 2   : 328 ticks, 0.0001 sec., 339818.9756 MB/sec.
TextFieldParser           : 31168390 ticks, 12.3040 sec., 3.5761 MB/sec.
Regex                     : 5134853 ticks, 2.0270 sec., 21.7067 MB/sec.


Done

This was run on a high-spec machine (Xeon E5-2640, 32Gb RAM and M.2 1Tb SSD; you have to have some toys!) so the overall thoughput would be good, but CsvReader performs at 10x the speed the TextFieldParser and 5x faster than Regex

csvreader's People

Contributors

andrewpsy avatar forki avatar fretje avatar gap777 avatar jakubmaguza avatar jonreis avatar joshgo avatar kiran94 avatar martin005 avatar milcondoin avatar molopony avatar phatcher avatar sebcriteo avatar slorion avatar spintronic avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

csvreader's Issues

Compare with CsvHelper

Can you compare this library with CsvHelper?
What are advantages/ disadvantages to use one library or another?
As a new user, why I should choose this library?
It will be good to add the answer to readme.md

Privacy issues in exception handling

_csvReader.ParseError.
When checking my error log file the following is found
"OnCsvParseError The CSV appears to be corrupt near record '3174' field '1120 at position '1120'. Current raw data : "
this is then followed by many lines of the contents of the csv file which contains data about users.
It was brought to my attention that this might cause legal issues and I was asked to fix it.
Is there any way to configure csv reader so as not to dump the data as part of the exception?

[Question] Faster way to read the csv

Hey guys and thanks for the awesome work!

Could be a ReadAllLines() method, then run a Parallel.For loop to process all the lines in parallel way faster than the used single StreamReader ?

Start parsing at specific row

hi, I need the capability to start parsing at a specifc row, as some of the CSV files contain certain information on top that are not the column headers. Therefore I want to skip the first few rows (row number varies). Can this be included in the library or what woul be the best approach to this?

Thanks!

Index out of bound when reading TSV w/o header

Problem seems to lie in the fact that Columns list is initialized to an empty list, but in object FieldValue(int i) it is checked to be not null and not checked to be an empty list causing an attempt to use Columns list for data conversion and failure to find column since list is empty.

Sample code:

            using (var streamReader = new StreamReader(_sourceBcpFilePath))
            {
                using (var bcpReader = new CsvReader(streamReader, false, _delimiter, _quote, _escape, _comment, ValueTrimmingOptions.All))
                {

. . .
bcpReader.GetValue(0);
. . .
}
}

Problems "without headers" using DataTable in Version 3.9.1.0

I see an issue "Without Header " was resolved for dgv
However, I'm getting that same error with DataTable
"Field index must be included in [0, FieldCount[. Specified field index was : '0'. "

Used your test code switch to DataTable

var csv = new CachedCsvReader(new StringReader("12345678;Hello\r\n78945612;World"), false, ';');
DataTable dt = new DataTable();
dt.Load(csv);

( I have no issues with headers=true)
Thanks
CSVReader is great!

how to skip first n lines

A lot of machine generated data will have some lines (but without # as a comment). Can we call the function as csvreader (filename, lines skipped, delimiter, xxxx)?

Many thanks

Difficult to find invalid data when using SqlBulkCopy

A CSV file with content that is invalid (example: field is too long compared to the column its being inserted into) is difficult to error check as it does not give any information on row, position or column names.

Would be nice to be able to add a restriction to the Column, such as MaxLength, or that the error message gives a better hint regarding the location of the invalid data

DateTime serializes to Local Time

The CsvReader serializer is transforming the string "2017-01-09T00:01:47.6145940Z" into local time, But I need to keep the records in UTC.

There is no way to select the DateTimeStyle in the configuration

using (var reader = new StreamReader(lakeStream))
using (var csv = new CsvReader(reader, new CsvConfiguration()
{
HasHeaderRecord = hasHeaderRecord,
Delimiter = delimiter,
IgnoreBlankLines = true,
IsHeaderCaseSensitive = false,
//Missing DateTimeStyle = DateTimeStyle.AssumeUniversal
}))
{
rows = csv.GetRecords().ToList();
}

Unit test fails in debug mode at CsvReaderIDataReaderTest.GetSchemaTableWithoutHeadersTest()

Hi,

I was trying to run the unit tests in AppVeyor, but it seems to hang indefinitely at a certain step. See AppVeyor log here: https://ci.appveyor.com/project/joshgo/csvreader/build/1.0.0.9. As you can see, the build gets automatically killed/cancelled by AppVeyor after 1 hour.

I looked into this a little more, and I discovered that the CsvReaderIDataReaderTest.GetSchemaTableWithoutHeader() causes a Debug.Assert() call to be triggered (at CsvReader.cs:831). A dialog box pops up with the stacktrace, which is why the build never finishes. When the unit test are invoked on the release build, this assertion does not fail or do anything. Below are the log messages/call stack:

at CsvReader.EnsureInitialize()  C:\...\CsvReader\code\LumenWorks.Framework.IO\Csv\CsvReader.cs(831)
at CsvReader.System.Data.IDataReader.GetSchemaTable()  C:\...CsvReader\code\LumenWorks.Framework.IO\Csv\CsvReader.cs(2075)
at CsvReaderIDataReaderTest.GetSchemaTableWithoutHeadersTest()  C:\...\CsvReader\code\LumenWorks.Framework.Tests.Unit\IO\Csv\CsvReaderIDataReaderTest.cs(138)

Here is the offending line:
Debug.Assert(Columns.Count > 0 || (Columns.Count == 0 && _fieldHeaderIndexes == null));

This fails because _fieldHeaderIndexes is not null , this condition evaluates to a false.

The fix should be:

Debug.Assert(Columns.Count > 0 || (Columns.Count == 0 && (_fieldHeaderIndexes == null || _fieldHeaderIndexes.Count == 0)));

This handles both the null and not null case.

SBC and MetaData example throws exception

Code based on the example provided in documentation

using (SqlConnection connection = new SqlConnection(connectionString)) {
          SqlBulkCopy sbc = new SqlBulkCopy(connection);
               sbc.DestinationTableName = "CsvReaderDeaultValueTest";
                using (LumenWorks.Framework.IO.Csv.CsvReader reader = new LumenWorks.Framework.IO.Csv.CsvReader(new StreamReader(fileName), true)) {
                    reader.Columns.Add(new LumenWorks.Framework.IO.Csv.Column { Name = "Meta", Type = typeof(String) });
                    reader.UseColumnDefaults = true;
                   reader.Columns[reader.Columns.Count - 1].DefaultValue = Path.GetFileName(fileName);
                    string columnHeader = String.Empty;
                    string pattern = @"[\s\(\)\-\/\:]+";
                    Regex regex = new Regex(pattern);
                    foreach (var column in reader.Columns) {
                        columnHeader = regex.Replace(column.Name, String.Empty);
                        sbc.ColumnMappings.Add(columnHeader, columnHeader);
                    }
                    connection.Open();
                    sbc.BatchSize = 10000;
                    sbc.WriteToServer(reader); //write base rows to db
                }
            }

I get the error

'Meta' field header not found.
Parameter name: name

I found that when we do sbc.WriteToServer(reader) it makes a call to LumenWorks.Framework.IO.Csv.CsvReader.IDataRecord.GetOrdinal(string name) and for the column that was added ("Meta" in this example) it throws an ArgumentException when trying to do _fieldHeaderIndexes.TryGetValue(name, out index)

How did you guys do the SBC using the CsvReader?
I noticed that you had sbc.AddColumnMapping() method which is not native to SqlBulkCopy class. Did you guys write an extension method for this thing to work?
Thanks.

Static Fields?

Is there a way to add static fields?
EX: When loading from file and putting into the DB, we have a FileId that we generate before the insert, so we would want to add this, but it doesnt exist in the file. The code seems to blow up if I try to do it.

Field index must be included in [0, FieldCount[. Specified field index was : '16'.
Parameter name: field
Actual value was 16.

ref:

using (var reader = new CsvReader(new StreamReader(blobFile.OpenRead()), false, '|'))
               {

                   reader.MissingFieldAction = MissingFieldAction.ReplaceByNull;

                   reader.ParseError += (sender, args) =>
                   {
                       var error = args.Error;
                   };

                   var cols = new List<Column>();
                   cols.Add(new Column(){Name ="FileLogId", Type = typeof(string), OverrideValue = fileLogId.ToString(), DefaultValue = fileLogId.ToString()});

                   foreach (var field in fileDefinitionFields.Where(x => x.FieldOrder >0))
                   {
                       cols.Add(new Column(){Name = field.ColumnName, Type = typeof(string)});
                   }

                   reader.Columns = cols;

                   // Now use SQL Bulk Copy to move the data
                   using (var sbc = bulkCopy)
                   {


                       sbc.WriteToServer(reader);
                   }
               }

Can I easily trim tabs?

Is there an option to easily trims tabs? For example string.Trim() trims tabs as well as whitespaces, but csvReader.Configuration.TrimOptions = TrimOptions.Trim; only trims whitespaces.

Quote standards

RFC 4180 says:

  • Fields containing a line-break, double-quote, and/or commas should be quoted. (If they are not, the file will likely be impossible to process correctly).
  • A (double) quote character in a field must be represented by two (double) quote characters.

This means that field containing only one double-quote character is represented in CSV file as """" (4 double-quote characters). Unfortunately CsvReader throws an exception in this case.

OutOfMemoryException with large Files?

Hi,
with small files up to 200MB it works fine, but when i try to load an file with size 400MB and larger
i receive an OutOfMemoryException, do i miss something or have to set possible file size before load an csv into DataTable?
In my case there is an file with more than 250000 Lines and over 400MB size.
Does anyone have the same issue?
I didn define any schema.ini file before, is that what im miss?

Thanks
MirsoH

Permitting line breaks within field

I need to treat line breaks the same way quoted commas are: If a line break occurs within a field (e.g. inside quotes) do not treat it as EOR.

Possible?

thx

Porting to .net portable

I am a Xamarin developer. Would you be able to port this package to PCL?

This is the error I am getting, trying to add the nuget package to my PCL project

Error - Could not install package 'LumenWorksCsvReader 3.9.1'. You are trying to install this package into a project that targets '.NETPortable,Version=v4.5,Profile=Profile111', but the package does not contain any assembly references or content files that are compatible with that framework. For more information, contact the package author.

ContainsFieldHeader Method for CsvReader

Currently if you pass an Field header that does not exist into the string indexer an ArgumentException is thrown.

Feature request is for a method to check if a field header exists in the current instance of the CsvReader object to avoid this. Much like the IDictionary.ContainsKey() in .NET

Null versus Empty String fields

Hi

First, thanks for an awesome piece of code. Very useful.

I am testing for null and empty string values. e.g.

Row 1 - "A",,"C" is returned by CSVReader as the string array [A","","C"]
Row 2 - "A","","C" is also returned as [A","","C"]

There doesn't seem to be any distinction between a null string and an empty string.

I would like row 1 to return as ["A",null,"B"]

Empty string and null can have different meanings in some applications.

Question 1. Is there a way to affect that behavior now ?
Question 2. If not, is it possible to add a constructor parameter or property, with some logic, to change the behavior.

Thanks - Adam

CsvReader.ReadNextRecord throws IndexOutOfRangeException

CsvReader.ReadNextRecord throws an IndexOutOfRangeException when reading the file from the attached archive. You can run the following code to reproduce it (the path variable should contain the path to the extracted file):

using (var file = File.OpenText(path))
using (var csv = new CsvReader(file, false))
{
  while (csv.ReadNextRecord()) { }
}

Found via SharpFuzz.

Troubleshooting Column Limits in Bulk Copy

Hello,

I have been using CSVReader successfully for two years now in a database import program I run, however I am running into an issue I need some help troubleshooting.

I have a CSV file I am importing from a third party with an obscene amount of columns. The current total column count in this file is 196. I realize this is an extreme edge case, but I cannot alter this file and the import is not optimal, but it's what I have to work with.

Right now, I am easily importing 50-60 columns with no problem, but at about column 150 the import seems to break and throws a malformed exception. The file itself is clean, and is not malformed and I have verified in another program that I can get the data into the database without error.

I am thinking that this is a buffer issue (stream issue), but I am wondering if you have any hard limits on Column count?

Also, do you have any advise for such an odd use-case, or troubleshooting for a large column and field set? This may also be relevant to other developers, so I wanted to post here.

Thank you for you work on this project!

Replacement for FastCsvReader from its author

Hello, I am the original author of FastCsvReader. It's been a long while since I wrote that library and I had a replacement which was on Codeplex that I recently put on github. Upon seeing your project, I thought it might interest you.

You can take a look at https://github.com/slorion/nlight for the library. The CSV reader is https://github.com/slorion/nlight/blob/master/src/NLight/IO/Text/DelimitedRecordReader.cs. You can go to https://github.com/slorion/nlight/tree/master/src/NLight.Tests.Unit/IO/Text to see it in action or https://github.com/slorion/nlight/tree/master/src/NLight.Tests.Benchmarks to benchmark it (including against my old reader).

I plan to release the library on nuget later this week, after adding some more unit tests.

p.s. My name is Sébastien Lorion, you misspelled it for "Sebastian Lorien" in a couple of places :)

error in parsing while " is firs char in row text and " is a part of column text

this row is parsed correctly:

6224613;SINUS POLSKA, MIEDZYRZECZ;

this is not parsed correctly:
6224613;"SINUS POLSKA", MIEDZYRZECZ;

but this is parsed corectly too:

6224613;SINUS "POLSKA", MIEDZYRZECZ;

this means that if column contains " at first character then i get error ...
the csv is created by external source and i cant change this

CsvReader.GetValue(s) method not works as expected, but GetXXX works perfectly

Hello! I have some project which uses your library for loading CSV data into SQLite storage (you can see code at https://github.com/shmutalov/System.Data.Csv)

Firstly I tried to load row values by single call to CsvReader's GetValues/GetValue method. I don't know why, but it doesn't work as expected.

Then, I looked into your source (not very deep) and find that GetXXX methods have same behaviour as GetString method, which gets value by accessing list item through "this[i] => string". But GetValue method calls FieldValue(int field) which will throw error, if Columns were not initialized.

CsvReader throws MalformedCsvException when only some columns are mapped to SqlBulkCopy

Hi,

The behavior of the CsvReader and CachedCsvReader differ from each other when only some columns are mapped to SqlBulkCopy column collection. The CsvReader class throws MalformedCsvException while CachedCsvReader works as expected. The following (simplified from real project) code snippets should reproduce the behavior:

// throws MalformedCsvException at character position 2092
using (CsvReader parser = new CsvReader(new StreamReader(@"C:\Temp\Sample.txt"), true, '\t')) {
    using (SqlBulkCopy loader = new SqlBulkCopy("...")) {
        loader.DestinationTableName = "Sample";
        loader.ColumnMappings.Add("ColA", "ColA");
        loader.WriteToServer(parser);
    }
}

// works
using (CsvReader parser = new CachedCsvReader(new StreamReader(@"C:\Temp\Sample.txt"), true, '\t')) {
    using (SqlBulkCopy loader = new SqlBulkCopy("...")) {
        loader.DestinationTableName = "Sample";
        loader.ColumnMappings.Add("ColA", "ColA");
        loader.WriteToServer(parser);
    }
}

Is this a bug or by design?

The sample input file contains two tab-separated text columns (ColA, ColB) with headers.
The sample table contains single text column (ColA).

CsvReader and SqlBulkCopy is expected to pump the content of the first column (ColA) from the input file to the table.

Kind regards,
Olli-Pekka Isola

CurrentRecordIndex does not show the right index value

Hi,

I'm testing this to apply Custom Error Handling Using Events Scenario following the http://www.codeproject.com/Articles/9258/A-Fast-CSV-Reader

When I extract the value from ParseErrorEventArgs e.Error.CurrentRecordIndex, the index number was not coming from the total input list rather it's coming from parse records.

E.g. if the were a problem in line 3 and 5, CurrentRecordIndex is provided as 2 and 3.
The field number provided from error argument is also not correct.

Regards
Joe

How to read smart quotes using csvreader?

Hi,

I want to use CSV Reader to read a csv file that can read foreign characters ("iso-8859-1") and also smart quotes (http://practicaltypography.com/straight-and-curly-quotes.html).

For example the file looks something like this with header.

Column 1,Column 2,Column 3
"""This"" is ""My"" test ""Title""",Screenplay 'by' 'Will',"""This"" is ""My"" test ""Title"" notes notes notes"
“This” is my “Second” test “Title”,Second ‘Screenplay’ by ‘Will’,“This” is my “Second” test “Title”

Smart Quotes.pdf

Thanks,
Vivek

Problems without headers and Version 3.9.0

I have a problem with the version 3.9.0.

When i read a CSV like:

CSV File:

12345678;Hello
78945612;World

C# Code:

using (CachedCsvReader csv = new CachedCsvReader(new StreamReader(csvFile, Encoding.Default), false, ';'))
myDataGridView.DataSource = csv;

The following error message appears:

Field index must be included in [0, FieldCount[. Specified field index was : '0'.

But in the version 3.8.3 it works perfectly fine.

FieldCount not increased when Adding custom column

While I can get this sample code to print out the value of my custom column correctly, it does not work with SBC. SBC will not see my new column.

Working sample code:
`
using (StreamReader file = new StreamReader(_workingFile))
{
using (CsvReader csv = new CsvReader(file, true, '\t'))
{
csv.GetFieldHeaders();

                csv.Columns.Add(new LumenWorks.Framework.IO.Csv.Column { Name = "Record_Date_UTC", Type = typeof(string) });
                csv.UseColumnDefaults = true;
                csv.Columns.Where(x => x.Name == "Record_Date_UTC").FirstOrDefault().DefaultValue = "Blah";
                int fieldCount = csv.FieldCount;
                fieldCount++; // Having to manually increase this by 1 to account for custom column

                string[] headers = csv.GetFieldHeaders();
                
                while (csv.ReadNextRecord())
                {
                    for (int i = 0; i < fieldCount; i++)
                        Console.Write(string.Format("{0} = {1};",
                                      headers[i], csv[i]));
                    Console.WriteLine();
                }
            }
        }

SBC code throws "System.ArgumentException: 'Record_Date_UTC' field header not found.":

                    using (CsvReader csv = new CsvReader(file, true, '\t'))
                    {
                        csv.GetFieldHeaders();
                        csv.Columns.Add(new LumenWorks.Framework.IO.Csv.Column { Name = "Record_Date_UTC", Type = typeof(DateTime) , DefaultValue = fileDate.ToShortDateString() });
                        csv.UseColumnDefaults = true;
                        var fieldCound = csv.FieldCount;

}
`

Only processing one record?

When I parse my CSV file, my CachedCsvReader only returns one Record in the Record property.

But if I look under the Raw View, Non-Public members, _items, I see all 314 records.

Customizable CultureInfo

Hello phatcher,

Firstly thanks for the nice tool.

Now, I am using the CsvReader for importing the data in the Czech environment, where CultureInfo prescribes using comma as a decimal separator. On the other hand, for CSV files, it is often the case when the data is internacional and it uses dot as the decimal separator.

I am having issues with reading a file with Decimal values with dot (described at
https://stackoverflow.com/questions/53830123/lumenworks-csvreader-how-to-read-decimal-using-invariant-culture).

Peeking in the source, I have found the code like

        decimal IDataRecord.GetDecimal(int i)
        {
            ValidateDataReader(DataReaderValidations.IsInitialized | DataReaderValidations.IsNotClosed);
            return decimal.Parse(this[i], CultureInfo.CurrentCulture);
        }

Could the culture info be replaced by CultureInfo.InvariantCulture, or better (say), could it be replaced by a property that could be changed at the instance level?

Thanks and wishing you nice Christmas time and Happy new year.
P.

Exception parsing double-quoted fields

Exception: "MailformedCSVException" while parsing the following file:

h1,h2,h3
aa,""bb"",cc

I expect to be able to parse this file with value for second column to be ""bb"" or bb, based on trimming Options.

GetFieldIndex returns -1 when setting custom columns and specifying hasHeaders = true

Given I use hasHeaders: true for CsvReader constructor and want to manually specify columns with correct types, _fieldHeaderIndexes doesn't get populated that results in broken GetFieldIndex method.

Because given hasHeaders = true the _fieldHeaderIndexes gets populated (line 1592) only when no Columns have been added before.

My workaround is to initialize CsvReader with hasHeaders: false and then use csv.ReadNextRecord() before while loop:
while (csv.ReadNextRecord()) { ... }

Error: 'FileId' field header not found. Parameter name: name

Hi

I am new to LumenWorks, my apologies if I have missed something in the documentation.

I have a CSV file with 200 off columns. Typically there would be close to 65000 rows. Importing these rows into a SQL Database Table was fine, until I added a new field in the SQL Database Table called "FileId" - which does not exist in the CSV File. I wish to Inject this field and the relevant value.

How do I do this please?

const string fileToWorkWith = @"C:\Data\Fidessa ETP Files\Import\2019\myCsvFile.csv";
Output.WriteLn($"Working with file {fileToWorkWith}.");

const string databaseConnectionString = "Server=MyServer;Database=DB;User Id=sa; Password = xyz;";

Output.WriteLn($"Checking if working file exists.");

if (new System.IO.FileInfo(fileToWorkWith).Exists == false)
{
    Output.WriteLn("Working file does not exist.", Output.WriteTypes.Error);
    return;
}

Output.WriteLn("Reading file.");
using (var reader = new CsvReader(new StreamReader(fileToWorkWith), true, char.Parse(",") ))
{
    reader.Columns = new List<LumenWorks.Framework.IO.Csv.Column>
    {
        new LumenWorks.Framework.IO.Csv.Column { Name = "FileId", Type = typeof(int), DefaultValue = "1" },
    };
    reader.UseColumnDefaults = true;
    
    Output.WriteLn("Checking fields in file exist in the Database.");
    foreach (var fieldName in reader.GetFieldHeaders())
    {
        
        if (Fields.IsValid(fieldName.Replace(" ","_")) == false)
        {
            Output.WriteLn($"A new field named {fieldName} has been found in the file that does not exist in the database.", Output.WriteTypes.Error);
            return;
        }
    }

    using (var sbc = new SqlBulkCopy(databaseConnectionString))
    {
        sbc.DestinationTableName = "FidessaETP.tableARC_EventsOrderAndFlow_ImportTest";
        sbc.BatchSize = 1000;

        Output.WriteLn("Mapping available Csv Fields to DB Fields");
        foreach (var field in reader.GetFieldHeaders().ToArray())
        {
            sbc.ColumnMappings.Add(field, field.Replace(" ", "_"));
        }

        sbc.WriteToServer(reader);
    }

}
{
    reader.Columns = new List<LumenWorks.Framework.IO.Csv.Column>
    {
        new LumenWorks.Framework.IO.Csv.Column { Name = "FileId", Type = typeof(int), DefaultValue = "1" },
    };
    reader.UseColumnDefaults = true;
    
    Output.WriteLn("Checking fields in file exist in the Database.");
    foreach (var fieldName in reader.GetFieldHeaders())
    {
        
        if (Fields.IsValid(fieldName.Replace(" ","_")) == false)
        {
            Output.WriteLn($"A new field named {fieldName} has been found in the file that does not exist in the database.", Output.WriteTypes.Error);
            return;
        }
    }

    using (var sbc = new SqlBulkCopy(databaseConnectionString))
    {
        sbc.DestinationTableName = "FidessaETP.tableARC_EventsOrderAndFlow_ImportTest";
        sbc.BatchSize = 1000;

        Output.WriteLn("Mapping available Csv Fields to DB Fields");
        foreach (var field in reader.GetFieldHeaders().ToArray())
        {
            sbc.ColumnMappings.Add(field, field.Replace(" ", "_"));
        }

        sbc.WriteToServer(reader);
    }

}

The Error Details

Message: 'FileId' field header not found. Parameter name: name

Source:  LumenWorks.Framework.IO

Stack Trace:
System.ArgumentException: 'FileId' field header not found.
Parameter name: name
at LumenWorks.Framework.IO.Csv.CsvReader.System.Data.IDataRecord.GetOrdinal(String name)
at System.Data.SqlClient.SqlBulkCopy.WriteRowSourceToServerCommon(Int32 columnCount)
at System.Data.SqlClient.SqlBulkCopy.WriteRowSourceToServerAsync(Int32 columnCount, CancellationToken ctoken)
at System.Data.SqlClient.SqlBulkCopy.WriteToServer(IDataReader reader)
at Haitong.Test.CsvImporter.Program.Main(String[] args) in C:\Development\Workspaces\UK OPS Data Warehouse\UK OPS Data Warehouse\Haitong.Test.CsvImporter\Program.cs:line 86

Tab Delimited doesn't support quotes and newlines

From CodeProject comments...

just playing around with this and thought you should know this will throw exceptions if there are unescaped quotes and/or newlines (not CR/LF, just LF) in tab delimited csv. Seems like fairly common csv traits to me, maybe should be fixed, right?

Problem related to CsvReader.SkipEmptyLines in 3.9.1.0

If SkipEmptyLines is false, the CopyCurrentRecordTo call in the code shown below works for me on file A, but throws the LumenWorks.Framework.IO.Csv.MissingFieldCsvException exception on file B. If SkipEmptyLines is true, the CopyCurrentRecordTo call works as expected for both files.

CSV File A:

Field1,Field2
Value1,Value2

CSV File B:

"Field1","Field2"
"Value1","Value2"

C# Code:

using (CsvReader reader = new CsvReader(new StreamReader(fileName), true)) {
  reader.SkipEmptyLines = false;
  reader.ReadNextRecord();
  string[] values = new string[2];
  reader.CopyCurrentRecordTo(values);
}

AppVeyor failure

AppVeyor seems to be not configured completely. It does not have the nunit binaries, therefore it can't run the test and detect some of the NUnit related classes. Even though AppVeyor is failing, the project is actually fine because all the unit tests are passing when you run it via nunit-console / gui.

Possible fixes are:

  1. add a "nuget restore" call in AppVeyor
  2. call the supplied build scripts (**)

**I have used the build scripts on my machine, and they work fine. I have not had a chance to try in appveyor and verify they work there too.

Why is CsvReader forward only?

More of a question, initially thought it was a bug.

I need to know how many lines are in my csv file.
So I did a myCsvReader.Count() (Linq) to get it. Of course, this moves the cursor forward to the end as it has to iterate it. But then I called myCsvReader.GetEnumerator().Reset() or even myCsvReader.MoveTo(-1) and they have no effect.

I found this line of code in MoveTo:
if (record < _currentRecordIndex) return false;
which confirm my thoughts that this is a forward only reader.

This msdn page (https://msdn.microsoft.com/en-us/library/65zzykke(v=vs.100).aspx) says that

Iterators do not support the IEnumerator.Reset method. To re-iterate from the beginning, you must obtain a new iterator.

But I've tried to do var en2 = myCsvReader.GetEnumerator(); and that doesn't return a new iterator. It returns the existing iterator which points to the end.

Why? Why can't I go back to the begining?

Exception in CsvReader.cs

hi,
in CsvReader.cs, in the function private string ReadField(int field, bool initializing, bool discardValue)
this line throw new ArgumentOutOfRangeException(nameof(field), field, string.Format(CultureInfo.InvariantCulture, ExceptionMessage.FieldIndexOutOfRange, field)); throws an error

Method not found: 'System.String System.String.Format(System.IFormatProvider, System.String, System.Object)

According to this, it is an issue of the string.Format overload depending on the target and installed framework version. Is there a wa to fix this in the project?

Missing field not detected

I'm using version 3.9.0.0.

In my little test program, the method CsvReader.CopyCurrentRecordTo(string[] array) throws LumenWorks.Framework.IO.Csv.MissingFieldCsvException in the case of example CSV 1 (below), but not in the case of example CSV 2 (below). In the case of CSV 2, as far as I can tell, the reader acts as though the third line in the file were this:

"3",""

Should I not be getting a missing field exception in CSV 2?

I'm using these settings:

CsvReader reader = new CsvReader(new StreamReader(fileName), true, ',', '"', '"', '~', ValueTrimmingOptions.None)

DefaultParseErrorAction = ParseErrorAction.ThrowException;
MissingFieldAction = MissingFieldAction.ParseError;
SkipEmptyLines = false;
SupportsMultiline = false;

Example CSV 1:

A,B
1,2
3
5,6

Example CSV 2:

"A","B"
"1","2"
"3"
"5","6"

CurrentRecordIndex Does not Update if SkipEmptyLines is True and a row is empty

For example

If you have a CSV file with 10 lines and lines 3 and 4 are empty with SkipeEmptyLines == true when it reads record 5 it will have an index of 3.

I'm trying to write logic to ignore the last X records using RowCount-CurrentRecordIndex to determine this but it falls down when the counter is not incremented for blank rows.

I'm happy to look at a pull request if this is change you would accept into CsvReader?

Exception thrown with duplicated colum names

Hi,

I have a csv file with duplicated column names.
When I try to read the csv the following exception is thrown "System.ArgumentException : An item with the same key has already been added."
Any ideas how I can avoid this?

Thanks

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.