Giter Site home page Giter Site logo

dmitry-brazhenko / sharptoken Goto Github PK

View Code? Open in Web Editor NEW
192.0 192.0 14.0 3.69 MB

SharpToken is a C# library for tokenizing natural language text. It's based on the tiktoken Python library and designed to be fast and accurate.

Home Page: https://www.nuget.org/packages/SharpToken

License: MIT License

C# 53.64% Python 1.10% HTML 45.26%
cl100kbase csharp gpt gpt-3 gpt-4 openai tokenizer

sharptoken's People

Contributors

anthonypuppo avatar dmitry-brazhenko avatar dmytrostruk avatar ericstj avatar ian-cameron avatar okhosting avatar vwilson avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sharptoken's Issues

Incorrect token count with Cyrillic

The online OpenAI tokenizer https://platform.openai.com/tokenizer counts 549 tokens for the piece of text below:

В цепочках поставок кейс-стадии, когда называются одна или несколько сторон, страдают от серьезных конфликтов интересов. Компании и их поддерживающие поставщики (программное обеспечение, консалтинг) имеют заинтересованность в представлении результата в положительном свете. Кроме того, фактические цепочки поставок обычно получают пользу или пострадают от случайных условий, которые никак не связаны с качеством их исполнения. Персонажи цепочки поставок - это методологический ответ на эти проблемы.

However, SharpTokens counts 219 tokens. There is something wrong going on.

GptEncoding thread-safe?

Is the SharpToken GptEncoding thread-safe, or should we create a new instance for every request?

Missing model name for 3.5 turbo Azure Deployment model

Nice job on this, thanks for doing it!

The one thing I noticed is that the Azure deployment model name for the gpt-35-turbo model was not included in Model.cs. It looks like it was added later in TikToken. I have an easy-enough work-around by testing for the value in my code and making the switch, but I suspect that you'll want to get that fixed before there is truly widespread adoption of this port.

Again, thanks for putting this together. I think it will be valuable.

Edited to make it clear that the name I'm referring to was added after-the-fact in TikToken.

Add Microsoft.SourceLink.GitHub to publish repo info/commit to nuget.org

Right now, there's no repo info in nuget.org about the project.

Would be nice to get that as well as commit info. The above package can provide that.

Just set the following project properties:

    <PackageProjectUrl>[if different from repo ]</PackageProjectUrl>
    <PublishRepositoryUrl>true</PublishRepositoryUrl>

Anthropic (claude) support

Can we use SharpToken for Anthropic? I could not find if claude is using "cl100k_base" or other encoding

Performance compared to TiktokenSharp, Tokenizer

SharpToken seems to be actively maintained, we are evaluating a possible switch from TiktokenSharp. I'm interested in whether the performance is the same or how SharpToken performs compared to TiktokenSharp and Tokenizer. Quickly gazing through the codebase, core BPE is not taking advantage of SSE, AVX2/512, or other SIMD implementations (like arm neon) yet.

Do you know if any of this is planned?

Implement MODEL_PREFIX_FOR_ENCODING

Would you be willing to implement support for model prefix to encoding, such as https://github.com/openai/tiktoken/blob/5d970c1100d3210b42497203d6b5c1e30cfda6cb/tiktoken/model.py#L63-L69 in TikToken for model names like gpt-3.5-turbo-0401, etc? Or will you take a PR for it?

Then it would support this kind of model name, currently it gives {"Could not automatically map gpt-3.5-turbo-0301 to a tokenizer. Please use GetEncoding to explicitly get the tokenizer you expect."}

We could probably add it to Model.cs with a similar approach when a model is not found in the dictionary, fall back to seeing the model starts with a key from a model prefix dictionary.

gpt-3.5-turbo-16k Not Supported

Can you please add support for gpt-3.5-turbo-16k? I get errors when I try to do:
var encoding = GptEncoding.GetEncodingForModel("gpt-3.5-turbo-16k");

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.