Giter Site home page Giter Site logo

Comments (16)

watzon avatar watzon commented on June 12, 2024 6

I'm happy to announce that I just finished adding a (pretty much api complete) replica of the pragmatic tokenizer gem to cadmium. You can check the docs here. This was a lot of work, but hopefully it does everything that's needed and more.

from crystal-libraries-needed.

watzon avatar watzon commented on June 12, 2024 2

@HCLarsen I can see that. I may split the library apart into several different shard as is a common practice with a lot of bigger JS libraries, but for now I'm going to work on completing Cadmium as a whole.

from crystal-libraries-needed.

GrgDev avatar GrgDev commented on June 12, 2024

I am going to try giving this a stab real quick if no one has tackled this. I gave the PragmaticTokenizer source a quick look over and the only unmet dependency I saw for porting this almost to Crystal as it is now is the CGI call for unescaping HTML text which Crystal has an HTML module for instead of CGI.

from crystal-libraries-needed.

johnjansen avatar johnjansen commented on June 12, 2024

@GrgDev its not really the CGI dep that is the issue, there are a couple of hurdles (from my sketchy memory)

  1. the regex's (and there are alot / and complex) wont work as is.
  2. the organization of the ruby code cannot be directly duplicated, so its a re-engineer in that respect

otherwise ill be interested in what you find, and may (time permitting) be able to offer some assistance

from crystal-libraries-needed.

GrgDev avatar GrgDev commented on June 12, 2024

Yeah, I see that the project setup and file organization would need to change a bit. I'll dig into the regex differences and see what I find.

from crystal-libraries-needed.

johnjansen avatar johnjansen commented on June 12, 2024

its the meta programming im more worried about ... its a bit tricky to untangle (unless you have a clear head and are locked in a room in silence)

from crystal-libraries-needed.

bew avatar bew commented on June 12, 2024

Can you add a link to the tokenizer you're proposing to 'duplicate' ? (So every people coming here won't have to search it to see what you mean)

from crystal-libraries-needed.

johnjansen avatar johnjansen commented on June 12, 2024

https://github.com/diasks2/pragmatic_tokenizer

from crystal-libraries-needed.

GrgDev avatar GrgDev commented on June 12, 2024

I'm busy at work right now, but I went ahead and stubbed out a quick empty repo here. Please excuse the corny name.

https://github.com/GrgDev/crystalized_tokenizer

If I get around to this, the work will be there.

from crystal-libraries-needed.

GrgDev avatar GrgDev commented on June 12, 2024

Not done yet. Putting a comment here in case I drop this for some reason so someone else can learn what I found already.

I ran into the metaprogramming issues, but they don't seem to be too bad. Only two found so far is:

  1. It does an inline extends string to add new custom methods. I just converted them to non-destructive methods that you pass the string to instead.
  2. It does check for if certain methods are #defined? in a language module at runtime which we replace with the #responds_to? macro. So far only found this with the check for SingleQuotes but that's a class/struct, not a method, so might be worth the hack to just throw in a constant Bool into the language modules for if it has it.

The regex so far has been a non-issue in terms of difficulty. Just annoyance. You just replace the \us with \x. Also I went through and replaced the inline non-ascii characters and converted them to their proper unicode escape character form.

from crystal-libraries-needed.

watzon avatar watzon commented on June 12, 2024

My project Cadmium has a number of tokenizers built in. None of them are quite as advanced as PragmaticTokenizer, but they should be sufficient for most needs.

from crystal-libraries-needed.

johnjansen avatar johnjansen commented on June 12, 2024

beautiful!

from crystal-libraries-needed.

HCLarsen avatar HCLarsen commented on June 12, 2024

My understanding is that Cadmium is an NLP engine. Since people may want this pragmatic tokenizer functionality without necessarily wanting the NLP as well, wouldn't it be a good idea to extract that into a separate shard?

from crystal-libraries-needed.

watzon avatar watzon commented on June 12, 2024

@HCLarsen I believe that Crystal ignores unused code when compiling, so importing the whole library shouldn't hurt you.

from crystal-libraries-needed.

johnjansen avatar johnjansen commented on June 12, 2024

from crystal-libraries-needed.

HCLarsen avatar HCLarsen commented on June 12, 2024

@watzon I do believe that's true. However, my reasoning isn't about the code size of the executable. It's more related to things like users being able to find a library that does this, or whether an update concerns a project that uses it as a dependency. Look at it as a matter of the Single Responsibility Principle, but applied to libraries. Software (and the shards.yml file) are much more concise and easy to understand if the dependencies are also concise.

from crystal-libraries-needed.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.