Giter Site home page Giter Site logo

utf8streamreader's Introduction

Utf8StreamReader

GitHub Actions Releases NuGet package

Utf8 based StreamReader for high performance text processing.

Avoiding unnecessary string allocation is a fundamental aspect of recent .NET performance improvements. Given that most file and network data is in UTF8, features like JsonSerializer and IUtf8SpanParsable, which operate on UTF8-based data, have been added. More recently, methods like Split, which avoids allocations, have also been introduced.

However, for the most common use case of parsing strings delimited by newlines, only the traditional StreamReader is provided, which generates a new String for each line, resulting in a large amount of allocations.

image

Read simple 1000000 lines text

Incredibly, there is a 240,000 times difference!

While it is possible to process data in UTF8 format using standard classes like PipeReader and SequenceReader, they are generic librardies, so properly handling newline processing requires considerable effort(Handling BOM and Multiple Types of Newline Characters).

Utf8StreamReader provides a familiar API similar to StreamReader, making it easy to use, while its ReadLine-specific implementation maximizes performance.

Getting Started

This library is distributed via NuGet, supporting .NET Standard 2.1, .NET 6(.NET 7) and .NET 8 or above.

PM> Install-Package Utf8StreamReader

The basic API involves using var streamReader = new Utf8StreamReader(stream); and then ReadOnlyMemory<byte> line = await streamReader.ReadLineAsync();. When enumerating all lines, you can choose from three styles:

using Cysharp.IO; // namespace of Utf8StreamReader

public async Task Sample1(Stream stream)
{
    using var reader = new Utf8StreamReader(stream);

    // Most performant style, similar as System.Threading.Channels
    while (await reader.LoadIntoBufferAsync())
    {
        while (reader.TryReadLine(out var line))
        {
            // line is ReadOnlyMemory<byte>, deserialize UTF8 directly.
            _ = JsonSerializer.Deserialize<Foo>(line.Span);
        }
    }
}

public async Task Sample2(Stream stream)
{
    using var reader = new Utf8StreamReader(stream);

    // Classical style, same as StreamReader
    ReadOnlyMemory<byte>? line = null;
    while ((line = await reader.ReadLineAsync()) != null)
    {
        _ = JsonSerializer.Deserialize<Foo>(line.Value.Span);
    }
}

public async Task Sample3(Stream stream)
{
    using var reader = new Utf8StreamReader(stream);

    // Most easiest style, use async streams
    await foreach (var line in reader.ReadAllLinesAsync())
    {
        _ = JsonSerializer.Deserialize<Foo>(line.Span);
    }
}

From a performance perspective, Utf8StreamReader only provides asynchronous APIs.

Theoretically, the highest performance can be achieved by combining LoadIntoBufferAsync and TryReadLine in a double while loop. This is similar to the combination of WaitToReadAsync and TryRead in Channels.

ReadLineAsync, like StreamReader.ReadLine, returns null to indicate that the end has been reached.

ReadAllLinesAsync returns an IAsyncEnumerable<ReadOnlyMemory<byte>>. Although there is a performance difference, it is minimal, so this API is ideal when you want to use it easily.

All asynchronous methods accept a CancellationToken and support cancellation.

For a real-world usage example, refer to StreamMessageReader.cs in Cysharp/Claudia, a C# SDK for Anthropic Claude, which parses server-sent events.

Buffer Lifetimes

The ReadOnlyMemory<byte> returned from ReadLineAsync or TryReadLine is only valid until the next call to LoadIntoBufferAsync or ReadLineAsync. Since the data is shared with the internal buffer, it may be overwritten, moved, or returned on the next call, so the safety of the data cannot be guaranteed. The received data must be promptly parsed and converted into a separate object. If you want to keep the data as is, use ToArray() to convert it to a byte[].

This design is similar to System.IO.Pipelines.

Optimizing FileStream

Similar to StreamReader, Utf8StreamReader has the ability to open a FileStream by accepting a string path.

public Utf8StreamReader(string path)
public Utf8StreamReader(string path, int bufferSize)
public Utf8StreamReader(string path, FileStreamOptions options)
public Utf8StreamReader(string path, FileStreamOptions options, int bufferSize)

Unfortunately, the FileStream used by StreamReader is not optimized for modern .NET. For example, when using FileStream with asynchronous methods, it should be opened with useAsync: true for optimal performance. However, since StreamReader has both synchronous and asynchronous methods in its API, false is specified. Additionally, although StreamReader itself has a buffer and FileStream does not require a buffer, the buffer of FileStream is still being utilized.

Strictly speaking, FileStream underwent a major overhaul in .NET 6. The behavior is controlled by an internal FileStreamStrategy. For instance, on Windows, SyncWindowsFileStreamStrategy is used when useAsync is false, and AsyncWindowsFileStreamStrategy is used when useAsync is true. Moreover, if bufferSize is set to 1, the FileStreamStrategy is used directly, and it writes directly to the buffer passed via ReadAsync(Memory<byte>). If any other value is specified, it is wrapped in a BufferedFileStreamStrategy.

Based on these observations of the internal behavior, Utf8StreamReader generates a FileStream with the following options:

new FileStream(path, FileMode.Open, FileAccess.Read, FileShare.Read, bufferSize: 1, useAsync: true)

For overloads that accept FileStreamOptions, the above settings are not reflected, so please adjust them manually.

Reset

Utf8StreamReader is a class that supports reuse. By calling Reset(), the Stream and internal state are released. Using Reset(Stream), it can be reused with a new Stream.

Options

The constructor accepts int bufferSize and bool leaveOpen as parameters.

int bufferSize defaults to 4096, but if the data per line is large, changing the buffer size may improve performance. When the buffer size and the size per line are close, frequent buffer copy operations occur, leading to performance degradation.

bool leaveOpen determines whether the internal Stream is also disposed when the object is disposed. The default is false, which means the Stream is disposed.

Additionally, there are init properties that allow changing the option values for ConfigureAwait and SkipBom.

bool ConfigureAwait { init; } allows you to specify the value for ConfigureAwait(bool continueOnCapturedContext) when awaiting asynchronous methods internally. The default is false.

bool SkipBom { init; } determines whether to identify and skip the BOM (Byte Order Mark) included at the beginning of the data during the first read. The default is true, which means the BOM is skipped.

Currently, this is not an option, but Utf8StreamReader only determines CRLF(\r\n) or LF(\n) as newline characters. Since environments that use CR(\r) are now extremely rare, the CR check is omitted for performance reasons. If you need this functionality, please let us know by creating an Issue. We will consider adding it as an option

Unity

Unity, which supports .NET Standard 2.1, can run this library. Since the library is only provided through NuGet, it is recommended to use NuGetForUnity for installation.

For detailed instructions on using NuGet libraries in Unity, please refer to the documentation of Cysharp/R3 and other similar resources.

License

This library is under the MIT License.

utf8streamreader's People

Contributors

neuecc avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.