Giter Site home page Giter Site logo

assem-ch / arabicstemmer Goto Github PK

View Code? Open in Web Editor NEW
138.0 138.0 36.0 1.21 MB

Assem's Arabic Light Stemmer is a snowball-based stemming algorithm for Arabic aimed mainly to improve search.

Home Page: http://arabicstemmer.com

License: Other

Python 58.31% Makefile 41.69%
arabic language snowball snowball-framework stemmer

arabicstemmer's Introduction

I like to build modern backend APIs. I've experience in those domains: Ecommerce,Delivery, Ride hailing, Search and Arabic NLP, Clinical trials

Open for external consultancy in:

  • Building MVPs for startups

The stack I prefer to work with: FastAPI, Django, Node.js, React-native. You may check my starred lists here.

Active projects that I maintain

django-fast-api
(Experimental)
Few hacks to speed up defining apis based on django rest framwork, inspired from fastapi.
Please give it a try and let me know your feedback.

django-jet-reboot a django admin based on django-jet that actually supports django 3.0 and django 4.0.
arabicstemmer an Arabic Light Stemmer aimed mainly to improve search.

For students and fresh graduates

  • Basic training for React-Native mobile dev: check Moumene's readme
  • Basic training for Django backend dev: Check link

arabicstemmer's People

Contributors

assem-ch avatar greenat92 avatar islamoc avatar mohsenuss91 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

arabicstemmer's Issues

Stemming "فاطفالهم"

First of all, I do not speak any Arabic whatsoever, but we're using Snowball in one of our products and we're now adding support for Arabic.

According to the README (and Google Translate seems to agree with this), the stem of "فاطفالهم" should be "اطفال". However, both our code, after linking the current master of this project, and http://arabicstemmer.com/ stems this word as "فاطفال".

Is this a bug or am I doing something wrong?

Arabic light stemming

I import the txt file on the site: http://arabicstemmer.com for stemming. After light stemming, for all of the repetitive words in each line of text, only one stem is returned, but the number of repetitive of words in each line is required. Do you know the solution to not clear repetitive stems?

Improving stemmer - milestone 2

  • Clear prefixes first, clear suffixes second
    • al kal fal bal bb should marked first, and set is_noun
    • aa ww ff should marked first
  • Greedy to choose between nouns suffixes and verb suffixes: طالبات
  • الزمان
  • والشمس
  • لمعالجة
  • أفنلزمكموها
  • س لا تلتصق إلا بأفعال المضارع ا
  • Detecting است prefix and define using it if noun or verb and also larger the size condition by 3: نسنعين ,
  • in suffixes, جمع مذكر السالم نادرا ماتكون جذع اقل من 4
  • و الفعل المضارع اللواحق يجب أن تترك الحجم 4 لأن للمضارع سابقا من حرف واحد
  • study the case of والأمر
  • make suffixes to set/unset is_noun, is_verb
  • don't stem if it contains a number or english number or size = odd
  • define regions before start stemming, test everything then perform stemming
  • black list: Ignore some predefined words, or does it worth
  • remove feminine marks and study feminine patterns
  • remove broken plural infixes: أطفال، كواسر ،نُمور
  • consider vocalization when exists:
  • tanween means a noun
  • detect and process_vocalized texts
  • Study patterns and guess it before stemming
  • Verb conjugation prefixes: a, t, y, n, if it has suffix, then remove the prefix with it
  • Rename routines to better-explaining names
  • study Alef-tanween
  • study idgham
  • Calculate probability of being noun or being verb
  • Prefix confusion
  • 2 letters words
  • improve from ISRI ideas
  • improve from khoja ideas
  • improve from tashaphine ideas
  • optimize performance
  • filter stop words

Arabic Stoplist issue

  • add translation comments
  • filter based on frequency, keep only high frequency words
  • save them as regular expressions supported with prefixes , affixes lists
    • process them while loading.
  • enrich it
  • classify words on categories

Makefile

make build_root_based_stemmer has path errors, the cp path of the build files need to be changed

from:
dist_rooter: build_root_based_stemmer
@echo "Compiling the root-based stemming algorithm to available programming languages"
@cd $(SNOWBALL); make dist
@mkdir -p "dist_rooter/python/"; cp $(SNOWBALL)dist/snowballstemmer-*.tar.gz "dist/python/"
@mkdir -p "dist_rooter/java/";cp $(SNOWBALL)"dist/libstemmer_java.tgz" "dist/java/"
@mkdir -p "dist_rooter/c/";cp $(SNOWBALL)"dist/libstemmer_c.tgz" "dist/c/"
@mkdir -p "dist_rooter/jsx/";cp $(SNOWBALL)"dist/jsxstemmer.tgz" "dist/jsx/"

To:
dist_rooter: build_root_based_stemmer
@echo "Compiling the root-based stemming algorithm to available programming languages"
@cd $(SNOWBALL); make dist
@mkdir -p "dist_rooter/python/"; cp $(SNOWBALL)dist/snowballstemmer-*.tar.gz "dist_rooter/python/"
@mkdir -p "dist_rooter/java/";cp $(SNOWBALL)"dist/libstemmer_java.tgz" "dist_rooter/java/"
@mkdir -p "dist_rooter/c/";cp $(SNOWBALL)"dist/libstemmer_c.tgz" "dist_rooter/c/"
@mkdir -p "dist_rooter/jsx/";cp $(SNOWBALL)"dist/jsxstemmer.tgz" "dist_rooter/jsx/"

Adding JS bundle to npm

The JS library is not published on npm, the client should manually download the source code and append it to the project, which of-course will cause for the client to lose the updates if there were any.

الحي

According to the result in the wesite it gave me الح

windows errors


arabicstemmer>python run_test.py
make: *** ../snowball/: No such file or directory.  Stop.
Le système ne peut pas accepter l'heure entrée.
Entrez la nouvelle heure : 14
Le client ne dispose pas d'un privilège nécessaire.
Traceback (most recent call last):
  File "run_test.py", line 12, in <module>
    f = open("tests/wordlist.out")
IOError: [Errno 2] No such file or directory: 'tests/wordlist.out'

Spread it

  • within snowball
  • within xapian
  • within pystemmer
  • within snowball_py
  • within nltk
  • within whoosh
  • Snowball for Go language snowball link
  • php pecl stem library link
  • perl Lingua::Stem::Snowball library link

المهيمن

this word should be مهيمن the result it gave was مهيم

"Stemming algorithm 'arabic' not found" in python!

The Arabic language is not included in the Assem's Arabic Light Stemmer
I tried that code (after I downloaded it )
from snowballstemmer import stemmer
ar_stemmer = stemmer("arabic")
ar_stemmer.stemWord(u"فسميتموها")
I get this error message :

raise KeyError("Stemming algorithm '%s' not found" % lang)

KeyError: "Stemming algorithm 'arabic' not found"

Website improvements

  • Rename the button browse to upload, and remove any space between it and "stem" button
  • Increment the width of the "input" to be expanded the same width of results
  • For resulted words, better change it to an "inline" list, a list in one line.
  • Make the footer smaller
  • Add "fork me on github" ribbon at the right , I prefer red or black: https://github.com/blog/273-github-ribbons
  • if you are using twitter bootstrap, I prefer using class="well" for Results
  • Use this font for arabic writing: http://www.amirifont.org/
  • Create the script "bash" to generate the code for c++,java, python, c
  • Test the generated code of programming languages and include all needed libraries , for example python needs 2 files: basestemmer.py and among.py. It is better if you tar.gzipped them , so users could download them in the same line.
  • Make the script to update the js snowball that we use in website so the next time we update the stemmer we can do the update in one line
  • Add this code of Analytics to the bottom of the page
<script>
  (function(i,s,o,g,r,a,m){i['GoogleAnalyticsObject']=r;i[r]=i[r]||function(){
  (i[r].q=i[r].q||[]).push(arguments)},i[r].l=1*new Date();a=s.createElement(o),
  m=s.getElementsByTagName(o)[0];a.async=1;a.src=g;m.parentNode.insertBefore(a,m)
  })(window,document,'script','//www.google-analytics.com/analytics.js','ga');

  ga('create', 'UA-73444954-1', 'auto');
  ga('send', 'pageview');

</script>
  • Add the text: "Welcome to the Arabic light stemmer made for Snowball, it is fast and supports many programming languages!" and fix typos in the other text:
    Type some Arabic text and press "Stem!" button or Upload to upload a txt file.

  • Show Results and its div only when there are results

  • Show the original text words mapped to the stemmed text words, a table with centered cells like that:

    مكتبة| لمعالجة| الكلمات| العربية | وتجذيعها
    مكتب| لمعالج| كلم |عرب| تجذيع

  • Read the text file using javascript/html5 instead of uploading it using php

  • Add (beta) to the header

C++ Stemming example

The library is excellent, I tried the python version of the library and it worked straight forward, the only thing is missing now from the documentation website is the examples for c++.

it would be great if it can be added

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.