Giter Site home page Giter Site logo

camel-guidelines's Introduction

camel-lab

CAMeL Guidelines

In the Computational Approaches to Modeling Language (CAMeL) Lab we work on the development of a wide range of Arabic and Arabic dialect resources (tools, corpora and lexicons). One goal we hold high is to follow consistent standards for all of our resources. Of course, working with Arabic dialects comes with many challenges, as they are resource poor and have no official standards. Our overall approach to annotation guidelines of Arabic and its dialects, is to create common standards that are compatible with Modern Standard Arabic but easily and naturally extended to the various dialects.

In this site, we provide our guidelines for representing:

The guidelines are versioned and backed up on GitHub. We invite you to check them out, and give your feedback. Each guideline section includes a discussion of high level philosophy as well as specific details, and links to publications on the guidelines and publications and projects using the guidelines.

camel-guidelines's People

Contributors

a455bcd9 avatar fadhleryani avatar nizarhabash1 avatar slkh avatar

Stargazers

 avatar

Watchers

 avatar  avatar  avatar

camel-guidelines's Issues

inti and final i

The spelling of inti (meaning you) is controversial. The CODA* Guidelines says that it is motivated by the MSA guidelines. However, it transcribes this personal pronoun as إنتي.

In this particular situation, إنتِ seems to be phonologically motivated and morphologically similar to MSA guidelines.

The same applies to the past tense: كتبتِ or كتبتي.

Dialectal consonants

There are several consonants in the Arabic dialects that do not exist in the MSA. Examples: [p], [g] and [v] for Tunisian. Figure 2 in "Unified Guidelines and Resources for Arabic Dialect Orthography" provides several insights about this topic. However, there is no explicit explanation of why [g] should be written as q in Tunisian Arabic and as k in Moroccan Arabic.

Shaddah in the beginning of the word

In MSA, there is no shaddah at the beginning of the word. However, in Arabic dialects, this exists.

CODA* Guidelines did not seem to consider this issue. Example: مّالح (Salted Olive and Vegetables in Tunisian).

I think that this should be included in the CODA* Guidelines

The MSA rule is related to "العرب لا تبدأ إلا بمتحرك ولا تقف إلا على ساكن".

This is not valid for the Arabic dialects.

Prepositions and Clitics

There are several matters related to the transcription of preposition and clitics in the CODA* Guidelines. We found after a series of research observations that these entities will be easier to process and write if separated from the names following them with a space.
We consequently came up with a new method for transcribing prepositions in Tunisian Arabic: م (+ال), من, لي, ع (+ال), على, في, بي, لي and كي.

We also propose to let و (Conjunction) separated from the entity directly following it.

I should credit Hager Ben Ammar (https://tn.linkedin.com/in/ben-ammar-hager-9b670ab7) for this proposal that worked well in practice although it is different from MSA methods.

Etymology for spelling words

There is a guideline in Maghrebi CODA that was not considered well. In this brief guideline, we propose that ha-nekteb in Egyptian should be written as حنكتب and not as هنكتب as ح has etymologically developed from رح. It will be reasonable to adopt this guideline at a large scale for Arabic dialects as it solves controversies.

iA in Tunisian Arabic

In Maghrebi CODA guidelines, iA is used to differentiate [a:] from [e:]. What we propose is to keep this convention and use Zwarakai (U+0659) + Alef Madd to note this variant in the Arabic Script. This is similar to Pashto Script. https://en.wikipedia.org/wiki/File:Harakat_pashto.svg. This is mainly because it will be not visually excellent to note kasra before Alef Madd.

Alif Maqsura

We are a group of researchers that tested the CODA guidelines among other Arabic Script conventions on real users from Tunisia with the contribution of Derja Association. We held three demo sessions in late 2019. Given that developing a large-scale writing convention for Arabic dialects is more important than developing a convention for Tunisian Arabic, we decided to share with you our findings so that they be taken into consider in enriching CODA* Guidelines.

In "Unified guidelines and resources for Arabic dialect orthography", you specified this:
Alif Maqsura The MSA rules for spelling the AlifMaqsura (ø ý), which are sometimes based on roots and sometimes on patterns, apply in CODA*.

This is not explicit as a rule. We propose to decide the transcription of Alif Maqusra for verbs according to their present.
Example,
جاء (to come) becomes جا in Tunisian Arabic. We propose to write it as جى as its present is يجي.

The use of Haraka in undiacritized Arabic text

Sometimes, the use of Shaddah is needed to disambiguate between lexemes:
سلّم: say hello to someone (Tunisian)
سلم: being safe (Tunisian)
I think that Shaddah should be added in such an important situation.
As well, Haraka can be interesting to differentiate between "Al-" and Alif Madda coupled to an l in a given word:
بالْغة: Pubescent
باليمين: On the right
It seems that adding a haraka to l in this situation is excellent.
Another example where Sukun can be useful in undiacritized text is the differentiation between two types of noun phrases:
كلمة باهية, قول باهي: Good word (adjective and noun compound)
كلمةْ حق, قولْ حقيقة: True word (additional phrase)
I think that mentioning this is absolutely useful.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.