Comments (16)
I'm happy to announce that I just finished adding a (pretty much api complete) replica of the pragmatic tokenizer gem to cadmium. You can check the docs here. This was a lot of work, but hopefully it does everything that's needed and more.
from crystal-libraries-needed.
@HCLarsen I can see that. I may split the library apart into several different shard as is a common practice with a lot of bigger JS libraries, but for now I'm going to work on completing Cadmium as a whole.
from crystal-libraries-needed.
I am going to try giving this a stab real quick if no one has tackled this. I gave the PragmaticTokenizer source a quick look over and the only unmet dependency I saw for porting this almost to Crystal as it is now is the CGI call for unescaping HTML text which Crystal has an HTML module for instead of CGI.
from crystal-libraries-needed.
@GrgDev its not really the CGI dep that is the issue, there are a couple of hurdles (from my sketchy memory)
- the regex's (and there are alot / and complex) wont work as is.
- the organization of the ruby code cannot be directly duplicated, so its a re-engineer in that respect
otherwise ill be interested in what you find, and may (time permitting) be able to offer some assistance
from crystal-libraries-needed.
Yeah, I see that the project setup and file organization would need to change a bit. I'll dig into the regex differences and see what I find.
from crystal-libraries-needed.
its the meta programming im more worried about ... its a bit tricky to untangle (unless you have a clear head and are locked in a room in silence)
from crystal-libraries-needed.
Can you add a link to the tokenizer you're proposing to 'duplicate' ? (So every people coming here won't have to search it to see what you mean)
from crystal-libraries-needed.
https://github.com/diasks2/pragmatic_tokenizer
from crystal-libraries-needed.
I'm busy at work right now, but I went ahead and stubbed out a quick empty repo here. Please excuse the corny name.
https://github.com/GrgDev/crystalized_tokenizer
If I get around to this, the work will be there.
from crystal-libraries-needed.
Not done yet. Putting a comment here in case I drop this for some reason so someone else can learn what I found already.
I ran into the metaprogramming issues, but they don't seem to be too bad. Only two found so far is:
- It does an inline extends string to add new custom methods. I just converted them to non-destructive methods that you pass the string to instead.
- It does check for if certain methods are
#defined?
in a language module at runtime which we replace with the#responds_to?
macro. So far only found this with the check forSingleQuotes
but that's a class/struct, not a method, so might be worth the hack to just throw in a constant Bool into the language modules for if it has it.
The regex so far has been a non-issue in terms of difficulty. Just annoyance. You just replace the \u
s with \x
. Also I went through and replaced the inline non-ascii characters and converted them to their proper unicode escape character form.
from crystal-libraries-needed.
My project Cadmium has a number of tokenizers built in. None of them are quite as advanced as PragmaticTokenizer, but they should be sufficient for most needs.
from crystal-libraries-needed.
beautiful!
from crystal-libraries-needed.
My understanding is that Cadmium is an NLP engine. Since people may want this pragmatic tokenizer functionality without necessarily wanting the NLP as well, wouldn't it be a good idea to extract that into a separate shard?
from crystal-libraries-needed.
@HCLarsen I believe that Crystal ignores unused code when compiling, so importing the whole library shouldn't hurt you.
from crystal-libraries-needed.
from crystal-libraries-needed.
@watzon I do believe that's true. However, my reasoning isn't about the code size of the executable. It's more related to things like users being able to find a library that does this, or whether an update concerns a project that uses it as a dependency. Look at it as a matter of the Single Responsibility Principle, but applied to libraries. Software (and the shards.yml file) are much more concise and easy to understand if the dependencies are also concise.
from crystal-libraries-needed.
Related Issues (20)
- Apache Arrow Support HOT 1
- walkdir library HOT 7
- Web scraper HOT 3
- Suggestion: port gem mimemagic from Ruby for significant (and fully platform agnostic) MIME type coverage
- Push notification service
- Port of impersonator gem
- fastimage shard HOT 3
- XMPP / Jabber client shard HOT 2
- Font renderer HOT 7
- Relational Algebra
- Math parser / evaluator HOT 6
- Flashtext port
- Tokyo Tyrant (Tokyo Cabinet server) HOT 4
- Python to Crystal converter HOT 3
- Data-parallelism library
- Algorithmic Trading Library
- integration with a browser recording library: playwright library / puppeteer library / something similar HOT 5
- A rewrite of Centrifuge use crystal
- Kubernetes API HOT 2
- RDF library
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from crystal-libraries-needed.