Giter Site home page Giter Site logo

Concept & Specification about database HOT 67 CLOSED

diacritics avatar diacritics commented on June 19, 2024
Concept & Specification

from database.

Comments (67)

andrewplummer avatar andrewplummer commented on June 19, 2024 2

Paging @Kimtaro! My above mentioned friend is not only a Unicode nut but also a linguistics nut. He's also Swedish, and they love their diacritics.

from database.

Mottie avatar Mottie commented on June 19, 2024 1

I don't know of any other third-party services, but I'm sure there are more. I'll keep an eye out.

I like what you have so far. I do have a few points I wanted to add:

Also, I would love to hear what ideas @mathiasbynens might have on this subject.

from database.

andrewplummer avatar andrewplummer commented on June 19, 2024 1

Hello,

The Sugar diacritics were based purely on my own research with a healthy dose of consulting with my European friends/followers. The goal was simply to provide an "80% good" sort order for most languages compared to the default Unicode code point sort in raw Javascript. It's not intended to be anything close to a complete collation algorithm, let alone an authority on diacritic mappings.

I definitely agree that there is a need for this. For my part, I would probably not add it as a dependency to Sugar but instead make sure that a project that adds it could quickly set up the sorting to work with it.

Thanks for working on this!

from database.

julmot avatar julmot commented on June 19, 2024 1

@Mottie

A combining diaeresis can be used in combination with just about any letter. It's not based on the language, it's a unicode method to create characters that visibly appear as the same character.

Thanks, just learned something new. I think adding those equivalents will be up to the authors most of the time, as users probably don't know much about them.

I've invited the author of shapecatcher to particiate at this discussion.

@andrewplummer
Thank you for participating! 👍

I would probably not add it as a dependency to Sugar but instead make sure that a project that adds it could quickly set up the sorting to work with it.

I see two ways of distribution:

  • Offering as dependency, e.g. via Bower. This would allow developers to use diacritics mapping within their applications very easily.
  • Offering as build integration. This would allow developers of libraries like you (and me) to use the mapping inside their builds, e.g. by replacing a placeholder like <% diacritics %>. You could also offer this as an additional add-in that integrates into your library – by creating a separate file that would overwrite the method that maps diacritics. In this case you don't need to have any production dependency, just one build helper to integrate diacritics mapping.

Whould the latter something you'd be interested in using? I'm asking because it's important to know if that would be a way library authors could imagine to integrate this. If not, what would be your preferred way?
@Mottie What do you think about this distribution?

from database.

Mottie avatar Mottie commented on June 19, 2024 1

Vietnamese is going to be fun... there are a lot of diacritics with multiple combining marks which may be added in different orders.

ẫ = a + ̃  + ̂  OR a + ̂  + ̃ 

Which means the equivalents array would need to include each order combination.

"ẫ" : [
    "a\u0303\u0302", // a + ̃  + ̂ 
    "a\u0302\u0303", // a + ̂  + ̃ 
    "\u1eab"         // ẫ
]

from database.

andrewplummer avatar andrewplummer commented on June 19, 2024 1

@julmot To answer your above question, the src files I use can be parsed directly, which is something I wouldn't like to lose, so I probably wouldn't use the <% diacritics %> idea. Instead, I would probably prefer to modify my code to "speak" the format that you guys are deciding upon here. Support for this format could be added in a minor version and my format would be deprecated in a major version. The user would then link these at runtime.

If I might add to the discussion, I read through the thread and had the same idea as @Mottie that the filenames should include the territory, i.e. de_AT.js. If there are no dialects or the dialects are all equivalent, it could just be de.js.

Lastly, I can tell you that Chinese and Korean should not have diacritic marks. Japanese has two: the voiced and semi-voiced sound marks. Unicode reserves a combining form for both of these but I've never seen them used as the pre-combined form all have their own codepoints (number of combinations is rather limited).

from database.

gromo avatar gromo commented on June 19, 2024 1

@julmot

  1. Are that all diacritic characters in Russian and Uzbek?

In russian, I believe, yes. Also, according to https://en.wikipedia.org/wiki/Diacritic :

Belarusian and Russian have the letter ё. In Russian, this letter is usually replaced by е, although it has a different pronunciation. The use of е instead of ё does not affect the pronunciation. Ё is always used in children's books and in dictionaries. A minimal pair is все (vs'e, "everybody" pl.) and всё (vs'o, "everything" n. sg.). In Belarusian the replacement by е is a mistake, in Russian, it is permissible to use either е or ё for ё but the former is more common in everyday writing (as opposed to instructional or juvenile writing).

There is only one diacritic letter - Ёё.

According to https://en.wikipedia.org/wiki/Uzbek_alphabet there are diacritic letters Oʻ oʻ & Gʻ gʻ in modern latin-based alphabet. And in uzbek cyrillic: Ҳ ҳ => Х х, Қ қ => К к, Ў ў => У у, Ғ ғ => Г г.

When the Uzbek language is written using the Latin script, the letters Oʻ (Cyrillic Ў) and Gʻ (Cyrillic Ғ) are properly rendered using the character U+02BB ʻ MODIFIER LETTER TURNED COMMA.[5] However, since this character is absent from most keyboard layouts and many fonts, most Uzbek websites – including some operated by the Uzbek government[2] – use either U+2018 ‘ LEFT SINGLE QUOTATION MARK or straight (typewriter) single quotes to represent these letters.

from database.

gromo avatar gromo commented on June 19, 2024 1

@Mottie In this case wikipedia contains more detailed information than I have living in Uzbekistan. If you look at my previous message, I've mentioned the same letters based on my knowledge. The only difference is letter Йй in cyrillic alphabet - I'm not sure if it can be count as diacritic of letter Ии, but I saw in some systems that Й was replaced by И and unicode char after it.

from database.

julmot avatar julmot commented on June 19, 2024 1

@Mottie Thanks for this information. We definitely need to investigate this for more languages.

I've thought about this again and came to the conclusion that even if the decompose property is unnecessary in almost every language (e.g. except German and Norwegian), then the database still makes sense. Users can't use the ASCIIFolding class as it's not possible for them to integrate it (Java class). Our project would make it possible for them to use it. We're also providing metadata for all languages that allow users to filter them by their needs and processes to integrate it into their projects.

from database.

Mottie avatar Mottie commented on June 19, 2024 1

Hey @julmot!

I'll clean up what I have in the works and post it in about 4 hours (after I go to the gym)... it is still incomplete, but it'll give you an idea of where things are now.

from database.

Mottie avatar Mottie commented on June 19, 2024 1

As I said, it's still a work-in-progress... https://github.com/diacritics/node-diacritics-transliterator/tree/initial

from database.

julmot avatar julmot commented on June 19, 2024 1

@Mottie Would you mind to submit a PR? This would allow us to have a conversation about it directly in the repository.

from database.

julmot avatar julmot commented on June 19, 2024

@Mottie Do you know any other third-party services worth mentioning?
Do you agree with the specified requirements or do you have any other ideas or concerns to share?

from database.

julmot avatar julmot commented on June 19, 2024

Thanks for sharing your ideas, @Mottie.

The file names should also include the territory, just as the CLDR database is set up. For example, the German language should include all of these files [...]

Why do you think this is necessary? In case of German there aren't any differences between e.g. Austria or Switzerland dialects.

Normalization of the code is still important as a diacritic can be represented in more than one way

What would be your solution approach here?

Collation rules might help with diacritic mapping for each language

How would you integrate these rules into the creation process of diacritics mapping?

Btw: As a collaborator you're allowed to update the specifications too.

from database.

Mottie avatar Mottie commented on June 19, 2024

Why do you think this is necessary? In case of German there aren't any differences between e.g. Austria or Switzerland dialects.

It's more of a "just-in-case there are differences" matter. I'm not saying duplicate the file for all territories.

What would be your solution approach here?

Well I don't think we'd need to do the normalization ourselves, but we would need to cover all the bases... maybe? If I use the example from this page for the latin capital letter A with a ring above, the data section would need to look something like this:

"data":{
  // Latin capital letter A with ring above (U+00C5)
  "Å":{
    "mapping":"A",
    "equivalents" : [
      "Å", // Angstrom sign (U+212B)
      "Å", // A (U+0041) + Combining Ring above (U+030A)

      // maybe include the key that wraps this object as well?
      "Å" // Latin capital letter A with ring above (U+00C5)
    ]
  }
}

Btw: As a collaborator you're allowed to update the specifications too.

I know I am, but we're still discussing the layout 😉

from database.

julmot avatar julmot commented on June 19, 2024

It's more of a "just-in-case there are differences" matter. I'm not saying duplicate the file for all territories.

That makes sense. Adding additional language variants would be optional. I've added this to the SysReq.

Well I don't think we'd need to do the normalization ourselves, but we would need to cover all the bases

Good catch! I didn't know what you meant here on first glance – because I'm not familiar with any language that has such equavalents. Added this to the SysReq too.

from database.

Mottie avatar Mottie commented on June 19, 2024

I've updated the spec example. A combining diaeresis can be used in combination with just about any letter. It's not based on the language, it's a unicode method to create characters that visibly appear as the same character.

There is another useful site I found, but it's not working for me at the moment - http://shapecatcher.com/

from database.

Mottie avatar Mottie commented on June 19, 2024

Thanks @andrewplummer! you 🚀!

@julmot

  • Yes, distribution by bower and npm are pretty much given. As long as we provide optimized data (e.g. based on the diacritic, language, etc.) I think we'll be fine. I'm sure the users will let us know if we need to add more.

  • I'm not sure that "continent" is needed in the data, or what should be done if the language isn't associated with one, e.g. Esperanto. Would "null" be appropriate?

  • I think adding "native" (or equivalent) to the metadata would also be beneficial

    "metadata": {
      "alphabet": "Latn",
      "continent": "EU",
      "language": "German",
      "native": "Deutsch"
    }
    

    mostly for selfish reasons as it is easier to search for "Deutsch" than it is to type out "German language" 😁

  • Mapping should be provided with the character with the accent removed and decomposed. If you look at this section on dealing with umlauts, you'll see that there are three ways to deal with them.

  • While attempting to construct a few files, I found that it was difficult to determine if an equivalent was a single unicode character or a combination. I know you like to see the actual character, but maybe for the equivalents it would be best to use the unicode value. I'm just not sure how to make and edit equivalents easier.

  • So this is what I have so far:

    {
      "metadata": {
        "alphabet": "Latn",
        "continent": "EU",
        "language": "German",
        "native": "Deutsch"
        // there could be more
      },
      // Sources:
      // diacritic list: https://en.wikipedia.org/wiki/German_orthography#Special_characters
      // mapping: https://en.wikipedia.org/wiki/German_orthography#Sorting
      "data": {
        "ü": {
          "mapping": {
            "base": "u",
            "decompose": "ue"
          },
          "equivalents": [
            "u\u0308", // u + Combining diaeresis (U+0308)
            "\u00fc"   // Latin small letter u with diaeresis (U+00FC)
          ]
        },
        "ö": {
          "mapping": {
            "base": "o",
            "decompose": "oe"
          },
          "equivalents": [
            "o\u0308", // o + Combining diaeresis (U+0308)
            "\u04e7",  // Cyrillic small letter o with diaeresis (U+04E7)
            "\u00f6"   // Latin small letter o with diaeresis (U+00F6)
          ]
        },
        "ä": {
          "mapping": {
            "base": "a",
            "decompose": "ae"
          },
          "equivalents": [
            "a\u0308", // a + Combining diaeresis (U+0308)
            "\u04d3", // Cyrillic small letter a with diaeresis (U+04D3)
            "\u00e4"  // Latin small letter a with diaeresis (U+00E4)
          ]
        },
        "ß": {
          "mapping": {
            "base": "\u00df",  // unchanged
            "decompose": "ss"
          }
        }
      }
    }

from database.

Mottie avatar Mottie commented on June 19, 2024

Next question. To allow adding comments to the source files, would you prefer to make them:

  • Plain .js files (using grunt to convert them to JSON),
  • YAML (using grunt-yaml to convert it to JSON),
  • Hjson (using grunt-hjson to convert it to JSON),
  • or something else?

from database.

julmot avatar julmot commented on June 19, 2024

@Mottie

Yes, distribution by bower and npm are pretty much given. As long as we provide optimized data (e.g. based on the diacritic, language, etc.) I think we'll be fine. I'm sure the users will let us know if we need to add more.

When I was talking about distribution using Bower I didn't meant to distribute the actual data. I meant a build helper that then fetches the data from this repository. This way we can have a specific version for our build helper but our users will always get the latest diacritics mapping information.
I see a few ways here:

  • We're going to distribute the actual data like you've mentioned into e.g. a dist folder. This will cause many variants based on the filter critera in the User Requirements. We could reduce these criteria but I'm quite sure that users will request them in future. The build helper could then simply load one existing file on the GitHub server.
  • We're going to build a server side service that fetches the data from this repository and provides them in the requested format. The build helper could then simply send a request and get the result as one file.

While I personally don't like to create a server side component, I also see that there would be many file variants. We'd need to specify a good dist structure to make finding things easily if we opt for the former.

What do you think?

I'm not sure that "continent" is needed in the data, or what should be done if the language isn't associated with one, e.g. Esperanto. Would "null" be appropriate?

No, it's not needed but would be a nice-to-have. Imagine a customer is distributing an application to a continent, e.g. Europe. Then it wouldn't be necessary to just include all mapping information by selecting every EU-country manually.
In case a country is associated with multiple continents like Russia we'd need to specify them inside an array.
I don't know any accepted language that isn't associated with a country. Esperanto seems like an idea of hippies, I'd vote for just ignoring it as there'll probably be no significant demand. But if we include it, I'd just add every continent inside an array, as it's globally available.

I think adding "native" (or equivalent) to the metadata would also be beneficial

Great idea. It would then be possible to select country specific diacritic mapping information by native language spellings. But would be another variant to consider in the distribution (see above).

Mapping should be provided with the character with the accent removed and decomposed. If you look at this section on dealing with umlauts, you'll see that there are three ways to deal with them.

Related to this article I agree with you and I'd vote for using it like you've did, having a base and decompose property when available, otherwise a simple string.

While attempting to construct a few files, I found that it was difficult to determine if an equivalent was a single unicode character or a combination. I know you like to see the actual character, but maybe for the equivalents it would be best to use the unicode value. I'm just not sure how to make and edit equivalents easier.

I agree with you. It's also hard to review when there is no visual difference. Would you mind to update the system requirements with this information?

Another open question about equivalents for me is who will collect them? We can't expect that users will do this and in this case how to integreate this into the workflow? When a user submits a pull request containing a new language we'd need to merge it and then adding the equivalents in the master branch.

Next question. To allow adding comments to the source files, would you prefer to make them

I'd prefer using strict JSON as .js files to allow code formatting (won't work with atom-beautify otherwise) and don't treat errors in text editors when adding comments. We'd need to integreate a JSON validator in the build. We'd also need to integrate a components that makes sure all database files are correctly formatted (according to a code style). And finally, we need to create a few code style files before (e.g. .jsbeautify).

from database.

Mottie avatar Mottie commented on June 19, 2024

I meant a build helper that then fetches the data from this repository.

That sounds like a good idea, there likely will be many variants. But sadly, my knowledge of pretty much all things server-side is severely lacking, so I won't be of much help there.

Esperanto seems like an idea of hippies, I'd vote for just ignoring it as there'll probably be no significant demand.

LOL, that's fine

Another open question about equivalents for me is who will collect them?

That's when shapecatcher and FileFormat.info become useful! I can help work on the initial data collection. And since visual equivalents won't change for a given character, a separate folder that contains the equivalents data would be much easier to maintain. We can then use reference these files in the build process.

src/
├── de/
│   ├── de.js
├── equivalents/
│   ├── ü.js
│   ├── ö.js
│   ├── ä.js

I'm not sure if using the actual character would fair well with accessing the file, so maybe using the unicode value would be better (e.g. u-00fc.js instead of ü.js)?

Inside of the file:

/**
 * Visual equivalents of ü
 */
[
    "u\u0308", // u + Combining diaeresis (U+0308)
    "\u00fc"   // Latin small letter u with diaeresis (U+00FC)
]

If, for some unique reason a character has a different equivalent, we could define it in the language file and then concatenate the equivalents values? Or not concatenate at all depending on what we discover. Actually now that I think about it, I remember reading somewhere that some fonts provide custom characters in the unicode private areas, but lets not worry about that unless it comes up.

We'd need to integreate a JSON validator in the build.

The Grunt file uses grunt.file.readJSON and within the process of building the files we'll end up using JSON.parse and JSON.stringify which will throw errors if the JSON isn't valid. I think it would be difficult to validate the JSON before the comments are stripped out.

As for beautifying the JSON, JSON.stringify would do that:

JSON.stringify({a:"b"}, null, 4);
/* result:
{
    "a": "b"
}
*/

from database.

julmot avatar julmot commented on June 19, 2024

That sounds like a good idea, there likely will be many variants. But sadly, my knowledge of pretty much all things server-side is severely lacking, so I won't be of much help there.

I'm quite sure you can help here, you just don't know yet 😆 If we decide to implement a server-side component then we'll set it up using Node.js as we're handling only JS/JSON files and using it makes it a lot easier than e.g. PHP. While you might not be familiar with it in detail, if I set up the basics you'll probably understand it quickly.
Anyway, to find a conclusion at this point I think we need to realize a server-side component. Otherwise many variants will be necessary and it might be confusing to have that many files in a dist folder.

And since visual equivalents won't change for a given character, a separate folder that contains the equivalents data would be much easier to maintain.

Sorry, I didn't understand the benefit of this when we're going to collect them using the unicode number. Could you help me understanding the benefit by explaining it a little more?

JSON.parse and JSON.stringify which will throw errors if the JSON isn't valid.

That would be enough.

As for beautifying the JSON, JSON.stringify would do that

I didn't meant to beautify them in the build, I meant to implement a build integration that checks if they are correctly formatted inside the src folder. Beautifying won't be necessary for the output.

What do you think of my question in your PR?

Wouldn't it make sense to provide the sources in the metadata object instead of comments? When they would be entered by users manually without providing sources we could fill in "provided by users" or something similiar.

from database.

Mottie avatar Mottie commented on June 19, 2024

Could you help me understanding the benefit by explaining it a little more?

Well when it comes to normalization, there are a limited number of visual equivalents for each given character. When we list the equivalents for a ü, we'll be repeating the same values in multiple languages. I was proposing centralizing these values in one place, then adding them to the language during a build, but only if the "equivalents" value is undefined in the language file and there is an existing equivalents file for the character.

Example language file:

"data": {
    "ü": {
        "mapping": {
            "base": "u",
            "decompose": "ue"
        }
        // no equivalents added here
    },
    ...

equivalents file

/**
 * Visual equivalents of ü
 */
[
    "u\u0308", // u + Combining diaeresis (U+0308)
    "\u00fc"   // Latin small letter u with diaeresis (U+00FC)
]

Resulting file:

"data": {
    "ü": {
        "mapping": {
            "base": "u",
            "decompose": "ue"
        }
        "equivalents": [
            "u\u0308",
            "\u00fc"
        ]
    },
    ...

I hope I better explained my idea.

provide the sources in the metadata object instead of comments?

Yes, that is a better idea. I guess I missed that question in the PR. I'll update the spec.

from database.

julmot avatar julmot commented on June 19, 2024

@Mottie I understand this. But what I still don't understand is the benefit of saving them in a separate file

I was proposing centralizing these values in one place

Saving them in the "equivalents" property would be one central place too?

from database.

Mottie avatar Mottie commented on June 19, 2024

Saving them in the "equivalents" property would be one central place too?

Yes, that would work too. Where would be the placement of that value within the file?

from database.

julmot avatar julmot commented on June 19, 2024

Like you've specified, in the equivalents property.

        "ü": {
            "mapping": {
                "base": "u",
                "decompose": "ue"
            },
            "equivalents": [
                "u\u0308", // u + Combining diaeresis (U+0308)
                "\u00fc"   // Latin small letter u with diaeresis (U+00FC)
            ]
        }

I'm quite sure we're misunderstanding us at some point, but I'm not sure at which one.

from database.

Mottie avatar Mottie commented on June 19, 2024

What I'm saying is, for example, if you look at this list of alphabets, under "Latin", you'll see there are a lot of languages that use the á diacritic. Instead of maintaining a list of visual equivalents for that one diacritic within each language file, we centralize it in one place, but add it to each file during the build process.

from database.

Mottie avatar Mottie commented on June 19, 2024

Did that clarify things? And what are we going to do about uppercase characters?

from database.

julmot avatar julmot commented on June 19, 2024

@Mottie

Did that clarify things?

Yes, thanks! I think I've understand you now. You meant to exclude them into separate files, to avoid redundant information in mapping files.

Seems like a good idea. Let's talk about the filenames. Naming them like the diacritic itself may cause issues on some operating systems. But naming them like the unicode number will make it impossible to find them quickly. Maybe we could map them by giving them a unique ID? Or do you see any alternatives?
Example:

"data": {
    "ü": {
        "mapping": {
            "base": "u",
            "decompose": "ue"
        },
        "equivalents": 1 // 1 would be a filename: ../equivalents/1.js
    },
    ...

Not the most beautiful variant though.

And what are we going to do about uppercase characters?

I've replied to this question here

There may be diacritics only available in upper case characters. To play it safe I'd include also upper case diacritics. Don't you think so?

from database.

Mottie avatar Mottie commented on June 19, 2024

Maybe we could map them by giving them a unique ID? Or do you see any alternatives?

Now that I've counted how many diacritics are listed just in the "Latin" table (246), I think it might be a better idea to group them a little LOL. I thought about grouping by the "base" letter (with a fallback to the "decompose" value) so there could be around 26 files (not counting characters that need to be encoded), but we haven't even considered languages like Arabic, Chinese and Japanese of which I have no clue how to begin. Should we even worry about non-Latin based languages at this stage?

If the "base" value was a character that needed encoding (e.g. ß, then I think the unicode value would be the best ID for the file. Something like u-00df.js?.

upper case characters

Including both upper and lower case would be the best idea then.

from database.

julmot avatar julmot commented on June 19, 2024

I'll come back to this tomorrow with a clear head. GN8 🌙

from database.

julmot avatar julmot commented on June 19, 2024

thought about grouping by the "base" letter (with a fallback to the "decompose" value) so there could be around 26 files (not counting characters that need to be encoded)

Could you update the spec with this?

but we haven't even considered languages like Arabic, Chinese and Japanese

Absolutely right. Before we start implementing the database, we should have a layout that works in all languages.
I've tried to find out if there are any cases that wouldn't work with our current schema, but weren't successfully. We'll need someone familiar with these languages...

I'd like to ask @gromo if you can help us out. We'd like to know if the Arabic alphabet contains diacritics like e.g. Latin and if they can be mapped to a "base" character (e.g. "u" when the diacritic is "ü")? Hopefully you're familiar with this alphabet as someone living in Uzbekistan. I'd appreciate your answer!

from database.

Mottie avatar Mottie commented on June 19, 2024

Could you update the spec with this?

Done. I've updated the spec (and PR). Let me know what you think.

Also, I think ligatures (æ decomposes to ae) need to be mentioned in the spec since they aren't "officially" named diacritics.

from database.

gromo avatar gromo commented on June 19, 2024

@julmot uzbek language uses cyrilic / latin alphabet, so I cannot help you with this

from database.

Mottie avatar Mottie commented on June 19, 2024

@gromo we could still use some feedback 😁

@julmot I forgot to ask, does ß only map to SS (uppercase)? If I use the following javascript, it gives interesting results:

console.log('ß'.toLowerCase(), 'ß'.toUpperCase());
// result: ß SS

from database.

julmot avatar julmot commented on June 19, 2024

@Mottie

Done. I've updated the spec (and PR). Let me know what you think.

Thank you, well done!

I have a few question about the equivalents spec now:

  1. When we're going to add an equivalent file for every base character, then this would also include equivalents from different languages with the same base character. Am I right? If so, then this wouldn't allow an output for a specific language only. With that in mind, does this still make sense?
  2. If yes, assuming the following situation: You have a base character and an equivalents file exists, but you don't want to include it and also don't want to overwrite it with manual equivalents, what would be necessary to exclude the equivalents file?
  3. Do you expect that overwriting an equivalents file manually happens often? If so, it might be confusing after a time, as we have no centralized place for them – what's the idea behind.
  4. Is there a common use case for including HTML codes? Browsers will render them to the actual character and non-JS languages don't need them.
  5. Might be important to note how to determine a filename like u1eb9u0301.js.

does ß only map to SS (uppercase)?

ß is a lower case character. That means, there is no ß for upper case sentences. Even though a character exists (upper case ß), but it isn't used in the German language (not approved officially), just in Unicode. For the mapping that means:

ß in lower case => ss

ß doesn't map to SS.

@gromo Thanks for your quick reply. I'm sorry to hear that, I thought that because this page is listing "Uzbek" under "Arabic". Anyway, could you answer the same questions for the cyrillic alphabet?

from database.

gromo avatar gromo commented on June 19, 2024

@julmot
It's not clear for me what you're working on, but I think the letters you're looking for are (letters => base letters):

Russian:
Ёё => Ее
Йй => Ии

Uzbek:
Gʻgʻ => Gg
Oʻoʻ => Oo

from database.

Mottie avatar Mottie commented on June 19, 2024

also include equivalents from different languages with the same base character

Yes. I'm not sure it would matter though because we providing a list of visual equivalents. The user can choose to use or not use the list. Am I mistaken here? We're just providing data, we aren't manipulating anything.

what would be necessary to exclude the equivalents file?

I was envisioning that if an equivalents was defined (even an empty array or string) in the language file, then the equivalents would not be added. If we're providing a list to an end user, then I think they could choose to include or not include the data.

Do you expect that overwriting an equivalents file manually happens often?

I doubt it. I was thinking that it should be an option though.

Is there a common use case for including HTML codes?

I was actually thinking that a user may be parsing an HTML file for other purposes. If it does sound like a good idea, what about URI encoding? encodeURI('ä') => "%C3%A4". Too much? I do tend to go overboard on ideas LOL.

Might be important to note how to determine a filename like u1eb9u0301.js

I was actually playing around with an idea of building the equivalents JSON - see this demo - it's just a preliminary idea. A third cross-reference of actual equivalents would need to be used and included in the build process (e.g. "ö" = "\u04e7" // Cyrillic small letter o with diaeresis (U+04E7)). This way, we wouldn't need to manually edit the JSON. This might even change my idea of having an equivalents folder and just create a temporary JSON file for cross-referencing into the main language file during the build. What do you think?

from database.

julmot avatar julmot commented on June 19, 2024

@gromo Thanks for your feedback. Just two more questions:

  1. Are that all diacritic characters in Russian and Uzbek?
  2. Are that the mappings to the real meaning behind the diacritics or just the ASCII equivalents? To make my question more clear: "ö" in German would map to "o" in ASCII, but the real meaning behind is "oe".

@andrewplummer Thank you for your answer!

Support for this format could be added in a minor version and my format would be deprecated in a major version. The user would then link these at runtime.

Sorry, but I don't fully understand you here. What would a user link? A file that overwrites a method? I so, then this file would also need the diacritics mapping information, which would mean that at least in this file a <% diacritics %> (or something similar) would be necessary.

Lastly, I can tell you that Chinese and Korean should not have diacritic marks. Japanese has two

Thank you very much for this information. @Mottie I guess this makes it simpler here and we don't need to spend much time on this.

@Mottie

I was actually thinking that a user may be parsing an HTML file for other purposes. If it does sound like a good idea, what about URI encoding? encodeURI('ä') => "%C3%A4". Too much? I do tend to go overboard on ideas LOL.

😆 If things are getting generated automatically, I don't see a disadvantage. Otherwise, no, I wouldn't include this.

I was actually playing around with an idea of building the equivalents JSON

and

Yes. I'm not sure it would matter though because we providing a list of visual equivalents. The user can choose to use or not use the list. Am I mistaken here? We're just providing data, we aren't manipulating anything.

I have to spend more time on this, understanding the equivalents thing and the automatic generation. Currently I'm not having much time and starting next week I'll be on vacation. But I'll let you know when I progressed. In the meantime I'd like to let you know that I'll convert the diacritics project to an organization. This has a few benefits:

  1. It makes clear that this isn't grown solely on my shoulders. Even if it was my idea it makes clear that many people are necessary to build this. And since you're spending the same time like me, I think it's fair.
  2. We will need multiple repositories – one as the database repository (this one), one for the server side component and at least one Node.js build helper, additionally a Grunt task – and this conveys a clear togetherness.
  3. We can create teams for reviewers, maybe even per language. Of course we need to find volunteers first :bowtie:

I've bought the domain diacritics.io already, that temporary redirects to this repository as long as we don't have a website.

from database.

andrewplummer avatar andrewplummer commented on June 19, 2024

Sorry, but I don't fully understand you here.

So a specific example would be something like:

Sugar.Array.setOption('collateEquivalents', obj);

Where obj is a Javascript object following your above format, and essentially mapping every entry in data to it's equivalent in decompose for the purpose of collation. Sugar doesn't handle this option per-locale as it is not that advanced yet, but that support could theoretically be added.

FWIW checked with my Unicode-obsessed friend and the combined diacritic forms may possibly be used in Japanese in some very rare cases.

from database.

andrewplummer avatar andrewplummer commented on June 19, 2024

@Mottie Not sure about the above notification? I just added a comment... didn't mean to unassign you??

from database.

julmot avatar julmot commented on June 19, 2024

@andrewplummer

Not sure about the above notification? I just added a comment... didn't mean to unassign you??

Whoops, I think we've just encountered a GitHub bug. I've removed Rob from the contributors, as I've invited him to join the new diacritics organization. Until he accepts that invitation, he's no longer available for assignment. GitHub possibly detected that and since you were the first user that had an action in this issue after I've removed him, they declared you to this action. Strange...

So a specific example would be something like

I understand now. So basically you'd not include the diacritics in your files, but would refactor the structure internally to allow overwriting them by users. This is a new use case, as this means that the actual data (mapping information) needs to be available locally, not just fetched in builds. Also this means, that every library author would need to implement such method, to allow overwriting.
I think that might be a good approach for your specific library, but not generally. It depends on the setup. For example I'd like to implement the mapping information into mark.js, where I would just set a placeholder in a build template to include the mapping object. Accessing my source files in production isn't allowed, so that wouldn't be a problem.
@Mottie What do you think of this?

from database.

andrewplummer avatar andrewplummer commented on June 19, 2024

Wow ok nice bug :)

Yeah, to be honest I think that my use case may not accurately represent the most common use case you will likely encounter. It's a bit of an outlier. I would definitely get some opinions from libraries that would make better use of the data.

from database.

julmot avatar julmot commented on June 19, 2024

@andrewplummer

I would definitely get some opinions from libraries that would make better use of the data.

Thanks for the feedback!
Do you have something in mind?

from database.

Mottie avatar Mottie commented on June 19, 2024

Chinese and Korean should not have diacritic marks. Japanese has two

❤️ Thanks for letting us know @andrewplummer!

If things are getting generated automatically, I don't see a disadvantage.

So far, I've collected data from several sites and this demo contains the current result. Here is a snippet showing just the á entry:

    "á": [
        // U+00E1 - LATIN SMALL LETTER A WITH ACUTE
        "\u00e1",
        "&#225;", // HTML decimal code
        "&#x00e1;" // HTML hex code
        "&aacute;", // HTML common entity code
        `a${ACUTE}`
    ]

The code is very messy, but I'll get it cleaned up.

I'll convert the diacritics project to an organization.

Awesome!

I just added a comment... didn't mean to unassign you??

OMG @andrewplummer, why?!

LOL, I left GitHub a little message to let them know.

According to {wikipedia}

@gromo As much as I love wikipedia, would you consider those entries accurate? Are there any other sources that you've found that supports the information? Either way, thank you for the update!

from database.

Mottie avatar Mottie commented on June 19, 2024

I found this valuable resource! http://www.unicode.org/reports/tr30/datafiles/DiacriticFolding.txt (See the "precomposed" section half way down). The main issue is that it was never approved and was deprecated (ref). So, even though it is directly related to our database, would it still be a good idea to use it?

Secondly, I saw this post in the Elasticsearch blog about ascii (diacritic) folding... they use ASCII Folding Filter from the Lucene Java Library. I'm not sure where to find that java code, but I suspect they are doing the same thing as in the DiacriticFolding.txt file... I will search for the resource later (after some sleep).

Update: https://github.com/GVdP-Travix/ASCIIFoldingFilter/blob/master/ASCIIFoldingFilter.js

from database.

julmot avatar julmot commented on June 19, 2024

@Mottie Thanks for this idea.

First of, the deprecated DiacriticFolding won't be something we can use as we need to guarantee correctness.

I've had a look at the Elasticsearch site you're referring but wasn't able to find the original "ASCIIFolding" project (mapping database). So I've only had a look at the JS mapping you've provided.

From my point of view this would only be a solution for the base property, as the decomposed value isn't covered. For example I've searched for "ü" and only found a mapping to "u". On the other hand, "ß" is mapped to "ss" which is contradictory.

Therefore I have the following questions:

  1. Is this a trustful source?
  2. Is the data covering all necessary base mappings? (they specify covered Unicode blocks)
  3. Is the data covering the correct base mappings? (for example we've defined ß as the mapping for ß, they are defining ss)

from database.

Mottie avatar Mottie commented on June 19, 2024

I don't know the specifics, but the DiacriticFolding was created by a member of the Unicode group. So that may not guarantee correctness, but it might be about as close as we can get right now.

And yeah, I agree that the "ASCIIFolding" should only be used for the base mapping.

Is this a trustful source?

I think so. Elastisearch is an open source RESTful search engine with almost 20k stars. Users are relying on the ASCII folding to perform searches.

Is the data covering all necessary base mappings?

It looks like they are mapping only by unicode blocks and not by language. But in their case, the ASCII folding doesn't differ, it looks like they are basically stripping away any diacritic. Which is essentially what the DiacriticFolding file looks like it is doing.

Is the data covering the correct base mappings?

I'm not sure how to answer this question. ß isn't really a diacritic, so stripping away any diacritics from the character doesn't apply; I think that's why we chose to leave it unchanged. I guess what our question should be is how should we define the base map of a character? Should it be the character without any diacritic(s), or as in the case of the ASCIIFolding, should it convert the character to accommodate searches from U.S. keyboards?

from database.

julmot avatar julmot commented on June 19, 2024

So that may not guarantee correctness, but it might be about as close as we can get right now.

As "ASCIIFolding" seems to contain the same information, I think we should focus on that?

I think so. Elastisearch is an open source RESTful search engine with almost 20k stars. Users are relying on the ASCII folding to perform searches.

I know Elasticsearch, but as I couldn't find the database they're using I assume that they're using it from a third-party too? In that case, we don't need to find out if Elasticsearch is trustful, but the third-party ("ASCIIFoldingFilter.js"). We also need to make sure that we can use their database from a legal point of view.

Should it be the character without any diacritic(s), or as in the case of the ASCIIFolding, should it convert the character to accommodate searches from U.S. keyboards?

We can make this decision easy: If we're going to use their database, we need to use what they're providing.

from database.

Mottie avatar Mottie commented on June 19, 2024

It looks like they use an apache license (http://www.apache.org/licenses/LICENSE-2.0)

Source: http://svn.apache.org/repos/asf/lucene/java/tags/lucene_solr_4_5_1/lucene/analysis/common/src/java/org/apache/lucene/analysis/miscellaneous/ASCIIFoldingFilter.java

What I'm actually saying is that I think this code is essentially implementing the "DiacriticFolding" data from the Unicode organization; but I can't say for sure until I actually compare the two.


Interestingly, this port to Go (https://github.com/hkulekci/ascii-folding) has an MIT license.

from database.

julmot avatar julmot commented on June 19, 2024

It looks like they use an apache license

I'm not a lawyer, but according to the license it allows usage with copyright notice. However, in users end products there can't be a notice since it'll just be e.g. a regex (no logic). I'm not sure if it's done with providing their copyright in our database. To guarantee we're allowed to use it we need to contact them.

Thanks for providing the Java file, that helped.

What I'm actually saying is that I think this code is essentially implementing the "DiacriticFolding" data from the Unicode organization; but I can't say for sure until I actually compare the two.

Interestingly. We should investigate this and find out if we can use the database. If so, I'd agree to use it to automatically generate the base property. But we need to document especially the case what happens with characters like ß that aren't diacritics. The special thing about ß is that when writing something uppercase it's replaced by SS otherwise ss.

Interestingly, this port to Go (https://github.com/hkulekci/ascii-folding) has an MIT license.

@hkulekci Can we assume that this is a mistake?

from database.

hkulekci avatar hkulekci commented on June 19, 2024

@julmot yeah, you can. I am not good at licensing something. I was only trying to exampling something in golang. :) I guess, in this case, I must choose apache license. If you know, please correct me which license I should choose.

from database.

julmot avatar julmot commented on June 19, 2024

@hkulekci No, sorry, I don't know it too. But since this project is released under the MIT license and we'd like to use this database, this is of our interest too.

@Mottie I you have time, could you please find out one owner of the provided Java library and contact him regarding the usage (and set me cc: please)? There's another question he probably can answer. I just asked myself : is the mapping e.g. ü => u common in all languages except German, where it could also be mapped to ue? I mean, if German is the only language that needs the decompose property, and all other languages are just having a base, then houston we have a problem. Then the entire database would be pointless as everything is already covered in the ASCIIFolding project.

from database.

Mottie avatar Mottie commented on June 19, 2024

Sorry, I've had a busy day; I'm just now checking GitHub.

I do know there is at least one additional language that needs diacritics decomposed... I found this issue (OpenRefine/OpenRefine#650 (comment)) about the Norwegian language:

  • 'æ' is replaced with 'ae'
  • 'ø' is replaced with 'oe'
  • 'å' is replaced with 'aa'

from database.

julmot avatar julmot commented on June 19, 2024

First of, I haven't received an answer from the lucene team regarding automatically generating the base property yet. Hopefully we have an answer soon.

Anyway, as soon as the API is merged the next step is to implement a process that allows users to integrate the diacritics project. We have several kind of projects:

  • JavaScript projects that are using a build
  • JavaScript projects that are serving source files direclty without a build (like e.g. @andrewplummer)
  • Projects with other languages (e.g. C, C++, C#, Java, ...)

I'd like to start discussing about JavaScript projects. We need to have a npm module that replaces placeholders with diacritics data. This module will use the API to fetch live data. There should be two possible placeholder types:

  • Those which will replace the placeholder with an array or object containing the diacritics mapping information in a useful structure. This is helpful for those who want to implement custom iterators over these mapping information
  • Those which will replace the placeholder with a method like this one, where no manual iterator is necessary. The user can then use this method to generate a regex that they can use to compare diacritic strings or replace characters.

While a placeholder syntax like <% diacritics %> would make sense, it's probably not the best idea. Why? Because there might be projects using the source files in development, like mark.js. It tests with the source files and only runs unit tests with compiled files. If we would have above named syntax then an error would be thrown. To avoid this, we need to have a placeholder syntax that can simply be replaced but is also valid without the replacement. An example could be:

const x = [/* diacritics: /?language=DE */];

[/* diacritics: /?language=DE */] would be the placeholder. As the actual information is placed within a comment this would be valid even without the replacement. The diacritics would be the actual keyword here. Everything following by the : would be an optional filter URL that is passed to the API.

This is just an idea, not set in stone. I'm open for other ideas. Anyone?

Okay, for these projects that aren't using a build they'd need to create a module that overwrites a method in their project by using the npm module in a build. There won't be a way to use the diacritics project without this module (or without a build) as the data is fetched dynamically from the API.

@Mottie What do you think?

from database.

Mottie avatar Mottie commented on June 19, 2024

Doesn't the API also need to indicate the type & format of the output?

/?diacritic=%C3%BC&output=base,decompose&format=string
  • output would indicate which data entries to return
  • format should be either a string, array or object

I'm not yet clear on how we would get the API to only return the first equivalent, or a specific equivalent if there is one. Also that specified equivalent's specific data (e.g. unicode).


In the case of the mark.js repo, if you added a placeholder for say u using /?base=u, we'd need the API to return a string of all equivalents[0].raw to create the desired output of uùúûüůū.

from database.

julmot avatar julmot commented on June 19, 2024

You're right, there should be some parameters to ignore some data, e.g. a value to ignore equivalents or just some of the equivalents (by name, e.g. unicode). Ignoring either base or decompose makes so sense in my opinion, as both are optional and both are mapping information. Some diacritics have a base and no decompose and vice versa.

In the case of the mark.js repo, if you added a placeholder for say u using /?base=u, we'd need the API to return a string of all equivalents[0].raw to create the desired output of uùúûüůū.

Yes, that parameter format wouldn't be part of the API in my opinion. This is something you need to specify in the placeholder, but is handled by the npm module.
In case of mark.js that would be an entire array, not just limited by e.g. u.

from database.

julmot avatar julmot commented on June 19, 2024

I've thought about this again and making the format parameter part of the API has one benefit: It would allow access to these formats outside the npm module. This is especially helpful if we want to show the code on the website, the array or the entire method. Users could then just copy and paste the code into their applications – which would be another good solution for projects without a build. So I'm open for this option.

If we introduce an option to specify the output structure (non-JSON) then this shouldn't be a parameter in my opinion (e.g. ?output=js-array). All the things under the route / are currently generating JSON. So the cleanest thing would be to introduce a new route, e.g. /js-array/?language=DE, where js-array is the output structure and the parameters are just like for the / route.

@Mottie What do you think?

from database.

Mottie avatar Mottie commented on June 19, 2024

Sorry for not responding earlier!

Do you still think that this will be necessary? I think the npm module would be able to output any format you need. I am starting to think that adding a new route would probably complicate the use of the API.

As an aside, I have started to work out what the npm module will provide and I've gotten stuck at how to deal with characters that are not going to be included under any language... like what happens when someone tried to remove diacritics from a crazy string like Iлtèrnåtïonɑlíƶatï߀ԉ? So I think the solution would be to create a en entry in the database that covers all the non-language specific diacritics. It's going to be huge.

from database.

julmot avatar julmot commented on June 19, 2024

Do you still think that this will be necessary? I think the npm module would be able to output any format you need. I am starting to think that adding a new route would probably complicate the use of the API.

I think it has one big advantage: It would allow copy and paste directly from the website. I currently imagine how our (later coming) website could look like. I imagine a website with a table full of diacritics, with their metadata and mapping information. You can filter and sort everything and finally you can just click "get code", select a language and structure (e.g. JavaScript and object or array) and get the code. This could be done using an option of the API. If it would be part of the npm module we would have to either create redundant code (npm module and website) or just don't implement such a button.

Interesting point in your second paragraph. Why would you call it "en"? And how would you map all kind of Unicode characters?

from database.

Mottie avatar Mottie commented on June 19, 2024

I think it has one big advantage: It would allow copy and paste directly from the website.

Ok, sounds good then!

Why would you call it "en"?

Well, English doesn't really include diacritics, and even when we do, we ignore them all the time. I didn't want to name it something like default and make an exception in the spec. So, the format will follow the spec like all the other languages.

One block of entries will be removing all combining diacritics... so the base would be an empty string:

    "data": {
        // combining diacritics -> convert to empty string
        "\u0301": { "mapping": { "base": "" } },
        "\u0300": { "mapping": { "base": "" }},
        "\u0306": { "mapping": { "base": "" }},
        "\u0302": { "mapping": { "base": "" }},
        "\u030c": { "mapping": { "base": "" }},
        ...
    }

Then we could include the decomposing of other symbols like and into (1)...

from database.

julmot avatar julmot commented on June 19, 2024

@Mottie I think that including special characters that aren't diacritics make sense (e.g. "①"), but we can't call them a "diacritic".

I'd say we should decide if we're going to include them depending on the effort. Is there any existing database like the one for the HTML entities? If so, then we can continue creating a new file in the build folder and adding them to the generated diacritics.json. A new API option should also allow excluding them.
If there's no database and it's much effort I don't think we can continue with it, or at least not at the current time. In my opinion we should focus on creating the npm module hopefully before new year. If this takes too much time it may be better to discuss this if we have time. In that case I'd personally find it confusing to name it "en" if English doesn't contain diacritics. I think the cleanest would be to just create a single JSON file directly in the src folder.

  1. Is there an existing database?
  2. How much time will it take to create that mapping information?
  3. In case there's no existing database: What do you think of the naming?

from database.

julmot avatar julmot commented on June 19, 2024

Btw.: Is the "Write" tab (textarea) also delayed for you while typing?

from database.

Mottie avatar Mottie commented on June 19, 2024

I'm not having any issues with the textarea.

I started with a bunch of characters and plugged them into node-unidecode which stripped out the diacritics and then added them to the data set... although some results ended up as [?]. The list I was working on is no where near complete.

In the mean time, I'll put this part on hold and continue working on the npm module.

from database.

julmot avatar julmot commented on June 19, 2024

is no where near complete

How do you know that? And where did you take the data from?

from database.

julmot avatar julmot commented on June 19, 2024

Ping @Mottie. And what's the current status with the npm module spec?

from database.

julmot avatar julmot commented on June 19, 2024

Finally, we're in the end-phase and going live soon.

from database.

Related Issues (8)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.