![](https://img.shields.io/badge/language-Hindi-red.svg?style=for-the-badge)
sangitanlp / sangita Goto Github PK
View Code? Open in Web Editor NEWA Natural Language Toolkit for Indian Languages
License: Apache License 2.0
A Natural Language Toolkit for Indian Languages
License: Apache License 2.0
Task List for getting Started
Implement Tokeniser for Bengali
Find Datasets for the Language to move forward with.
The documentation for POS Tagger isn't present.
Have a look at postagger.py insert the documentation for the same.
Currently some of the data is housed in Corpora in this Repository
We need to transfer this data along with the rest of the incoming data to Sangita Data
The reason for doing so is that PyPi has a limit on the size of packages one can use.
This would require code refactoring, changing of file types as well as writing an installation script for automatic installation of data files.
Here is a task list for the issue
Change file type to something more usable and lesser in size.
Change the directory structure at Sangita Data.
Transfer the files and refactor code to reflect the change in location
Write installation Script.
Extraction of Word, Lemma pairs from the BenLem dataset.
Citation: A. Chakrabarty and U. Garain (2015): BenLem (a Bengali Lemmatizer) and its Role in WSD, in ACM Trans. Asian and Low-Resource Language Information Processing (TALIIP).
New language - Telugu
Find Data corpus Telugu language. Add links in the comments to this Issue
Implement tokenizer for Telugu
Create a lemmentizer for Telugu
Create a POS Tagger for Telugu.
The repo currently doesn’t have a specific Hind Corpora to work on. We are looking for a corpora which satisfies the following points:-
Some of the datasets that can be used might be available with the LTRC committee in IIT-B.
This issue is about discovering good Hindi Corpora for this project. Participants and contributors can search for and create PRs adding the datasets and links to the datasets.
Guidelines before sending Pull Requests:
Here is a rough outline of the requirements.
WordNet
Word, Lemma Pairs
Word, POS pairs
Others
The (word, gender) tuple is currently available here
In accordance with Issue #9 we will move this file to Sangita Data
We will also create a new repository for Hindi Word Vectors and one for machine learning models. These will be referenced in a separate issue.
Along with this we will remove the dependencies for Scikit Learn and work only with Keras.
The task list is given below
Move the gender.py to Sangita Data - Cakewalk.
Create a fresh set of word vectors and store it under a new repository especially for word vectors. - Pro.
Train the word vectors against the gender tags, and store the model under a separate repository. - Intermediate.
Refactor the code here, to accommodate these changes. - Intermediate.
Checklist for the Project:
Use a white background
Chose a nice font.
Top Menubar with five Code, Corpora, Documentation, Blog
Code should redirect to GitHub repo sangita
Corpora should redirect to the Organisation Page
Documentation should redirect to Readme
Blog should be a separate page with the Quora blog page posts embedded on it. Instructions are given here.
Project should be pushed to this repo
setup.py is a python file, which usually tells you that the module/package you are about to install has been packaged and distributed with Distutils, which is the standard for distributing Python Modules.This allows you to easily install Python packages. Often it's enough to write:
python setup.py install
and the module will install itself.
Write the setup.py file for this package.
Have look at the code in stemmer.py
Then do try to improve upon or provide an alternative pathway for that.
Requirements:
This issue involves three steps:
The pull request must contain detailed layout of how the developer sought to tackle the problem.
It must also contain the source of data and it's licensing specifications. Along with this there must be sufficient examples to demonstrate the working of the stemmer.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.