Giter Site home page Giter Site logo

database's Introduction

Diacritics Database

Build Status Greenkeeper badge

The meaning behind this repository is to collect diacritics with their associated ASCII characters in a structured form. It should be the central place for various projects when it comes to diacritics mapping.

As there is no single, trustworthy and complete source, all information need to be collected by contributors manually. However, parts of the data is generated automatically.

Example mapping:

Schön => Schoen
Schoen => Schön

This repository contains the database of the project.

Documentation

The specification can be found in the spec folder.

Contributing

Thanks for contributing 🎉 Most data is user-contributed so your help is really much appreciated.
To get started, please read the documentation. It should give you the basic idea of the database structure. Then you can start reviewing existing, unvalidated language files or create new ones in the source folder. Unvalidated languages contain a comment in the header of the file to indicate that they need to be reviewed by a native speaker.

database's People

Contributors

greenkeeper[bot] avatar hadithmv avatar julkue avatar mottie avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

Forkers

sahwar hadithmv

database's Issues

Turkish diacritics

Ş > S
ş > s
Ç > C
ç > c
Ğ > G
ğ > g
Ü > U
ü > u
İ > I
ı > i
Ö > O
ö > o

In Turkish however, diacritics changes the meaning of words unlike in Arabic

Concept & Specification

The meaning behind this repository is to collect diacritics with their associated ASCII characters in a structured form. It should be the central place for various projects when it comes to diacritics mapping.

As there is no single, trustworthy and complete source, all information need to be collected by users manually.

Example mapping:

Schön => Schoen
Schoen => Schön

User Requirements

Someone using diacritics mapping information.

It should be possible to:

  1. Output diacritics mapping information in a CLI and web interface
  2. Output diacritics mapping information for various languages, e.g. a JavaScript array/object
  3. Fetch diacritics mapping information in builds
  4. Filter diacritics mapping information based by:
    • By diacritic
    • By mapping value
    • By language
    • By continent
    • By alphabet (e.g. Latin)

Contributor Requirements

Someone providing diacritics mapping information.

Assuming every contributor has a GitHub account and is familiar with Git.

Providing information should be:

  1. Easy to collect
  2. Possible without manual invitations
  3. Possible without registration (an exception is: "Register with GitHub")
  4. Done at one place
  5. Easy to check correctness of information structure
  6. Checkable before acceptance by another contributor familiar with the language
  7. Possible without a Git clone

System Specification

There are two ways of realization:

  1. Create a JSON database in this GitHub repository, as this fits user and contributor requirements.

  2. Create a database in a third-party service that fits the user and contributor requirements.

    Tested:

    • Transifex: Doesn't fit requirements. It would allow providing mapping information, but not metadata.
    • Contentful: Doesn't fit requirements. It would require a manual invitation and registration.

Because we're not familiar with further third-party services that could fit user and contributor requirements, we'll continue realizing the first point.

System Requirements

See the documentation and pull request.

Build & Distribution

Build

According to the contributor requirements it should be possible to compile source files without making a Git clone necessary. This means that we can't require users to run e.g. $ grunt dist at the and, since this would require to clone, install dependencies and run things. What we'll do is implementing a build bot that will run our build on Travis CI and commits changes directly to a dist branch in this repository. Therefore once you merge something or you commit something yourself the dist branch will be updated automatically. Some people already doing this to update their gh-pages branch when something changes in the master branch (e.g. this script).

Since we'll use a server-side component to filter and serve actual mapping information we just need to generate one diacritics.json file containing all data.

To make parsing easier and to encode diacritics to unicode numbers in production we're going to need a build that minifies the files and encodes diacritics. This should be done using Grunt.

Integrity

In order to ensure integrity and consistency we need the following in our build process:

  • A JSON validator that validates database files (must work with comments)
  • A code style guideline, e.g. .jsbeautify
  • A linter for JSON files that makes sure the database is formatted according to the code style

Distribution

To provide diacritics mapping according to the User Requirements it's necessary to run a custom server-side component that makes it possible to sort, limit and filter information and output them in different ways (e.g. JS object or array). This component should be realized using Node.js as it's made for handling JS/JSON files and PHP would cause a lot more serializing/deserializing.

Next Steps

  • Finalize system requirements
  • Create a spec .md file that specifies the entire database structure in detail
  • Implement the basics according to the system requirements
  • If the basics exist, start collecting repositories that use diacritics and invite owners and stargazers to share their country-specific mapping information. It's in their interest to drive development forward.

This comment is updated continuously during the discussion

An in-range update of simple-git is breaking the build 🚨

Version 1.95.0 of simple-git was just published.

Branch Build failing 🚨
Dependency simple-git
Current Version 1.94.0
Type devDependency

This version is covered by your current version range and after updating it in your project the build failed.

simple-git is a devDependency of this project. It might not break your production code or affect downstream projects, but probably breaks your build or test tools, which may prevent deploying or publishing.

Status Details
  • continuous-integration/travis-ci/push The Travis CI build could not complete due to an error Details

Commits

The new version differs by 11 commits.

See the full diff

FAQ and help

There is a collection of frequently asked questions. If those don’t help, you can always ask the humans behind Greenkeeper.


Your Greenkeeper Bot 🌴

Case Sensitive / Capital Characters

Currently we have an output like e.g.:

"ü": {
  "mapping": {
    "base": "u",
    "decompose": {
      "titleCase": "Ue",
      "upperCase": "UE",
      "lowerCase": "ue"
    }
  }
},

In order to keep the behavior of mark.js' regular expression creation we need to differentiate between capital and non-capital characters. For above mentioned example I could imagine an output like:

"ü": {
  "capital": false,
  "mapping": {
    "base": "u",
    "decompose": {
      "lowerCase": "ue"
    }
  }
}

So, a new property capital was added and titleCase as well as upperCase was removed. For the opposite -- the capital character:

"Ü": {
  "mapping": {
    "base": "U",
    "decompose": {
      "titleCase": "Ue",
      "upperCase": "UE",
      "lowerCase": "ue"
    }
  },
}

it could be:

"Ü": {
  "capital": true,
  "mapping": {
    "base": "U",
    "decompose": {
      "titleCase": "Ue",
      "upperCase": "UE"
    }
  },
}

What do you think @Mottie?

Btw.: I've noticed that the equivalents generation is inconsistent. We have always used camelCase (e.g. "languageNative") but in the equivalents part we have e.g. html_decimal.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.