Giter Site home page Giter Site logo

Phonetic Similarity about fluentai HOT 18 OPEN

StephanAkkerman avatar StephanAkkerman commented on August 14, 2024
Phonetic Similarity

from fluentai.

Comments (18)

StephanAkkerman avatar StephanAkkerman commented on August 14, 2024

Outline of the process:
image
image

from fluentai.

StephanAkkerman avatar StephanAkkerman commented on August 14, 2024
  • Need to have a dataset of IPA pronunciations for the languages that we want to learn.
  • Use https://github.com/xinjli/transphone to do word/graphene to phoneme (IPA), for the language that we want to learn, assume we always use English as base language

from fluentai.

StephanAkkerman avatar StephanAkkerman commented on August 14, 2024

Phonetic Similarity
Use ALINE for phonetic similarity

We use a Levenshtein distance, now with a weight matrix defining the cost of substituting one phone for another, to
compute the phonetic distance between two words. We search for an alignment (mapping of phones in one word to those of
another) such that the Levenshtein distance is minimized.

We simplify the ALINE algorithm by omitting phone expansions (i.e. no special costs for matching two phones to one,
or one phone to two). We handle matching of syllable breaks by adding a constant match score Csep = 20, and a mismatch
cost Cskip = −10, as defined by ALINE. The phone feature set used by ALINE is not complete—a few phones are considered equivalent (e.g. /i/, /ɪ/), so we add an extra weight for preferring exact phone matches.

C_skip is defined in ALINE, but not C_sep. Maybe see the Scala implementation for this

ALINE output values are divided by 100 to make them comparable with similarities from other components.

Other methods for calculating

from fluentai.

StephanAkkerman avatar StephanAkkerman commented on August 14, 2024

Could also consider scraping wiktionary for all IPA pronunciations: https://en.wiktionary.org/wiki/Category:Indonesian_terms_with_IPA_pronunciation

from fluentai.

StephanAkkerman avatar StephanAkkerman commented on August 14, 2024

https://linguistics.stackexchange.com/questions/43492/finding-phonetic-similarity-of-names-in-different-languages

First, let's create a phoneme similarity function as you described. For this purpose, we will use a dictionary to map phonemes to their categories, and then define a function to calculate the similarity based on these categories.

# Create phoneme categories
phoneme_categories = {
    "aehijouwæɑɒɔəɛɜɪʊʌʔ": 0,  # vowels, semivowels, and weak consonants
    "bfpv": 1,  # labial consonants
    "ksxzɡʃʒ": 2,  # velars and sibilant consonants
    "dtðθ": 3,  # dental consonants
    "l": 4,  # lateral consonant
    "mnŋ": 5,  # nasals
    "ɹ": 6,  # rhotic
}

# Flatten the dictionary to map each phoneme to its category
phoneme_to_category = {
    phoneme: category
    for category, phonemes in phoneme_categories.items()
    for phoneme in phonemes
}

def phoneme_similarity(phoneme1, phoneme2):
    """Calculates the similarity between two phonemes."""
    if phoneme1 == phoneme2:
        return 1
    elif phoneme_to_category.get(phoneme1) == phoneme_to_category.get(phoneme2):
        return 0.5
    else:
        return 0

Next, we'll use this phoneme similarity function to calculate the similarity between two words. We'll use a dynamic programming approach similar to the Longest Common Subsequence problem.

def word_similarity(word1, word2):
    """Calculates the similarity between two words based on their phonemes."""
    len1 = len(word1)
    len2 = len(word2)
    match_table = [[0] * (len2 + 1) for _ in range(len1 + 1)]
    for i in range(1, len1 + 1):
        for j in range(1, len2 + 1):
            match_table[i][j] = max(
                match_table[i-1][j-1] + phoneme_similarity(word1[i-1], word2[j-1]),
                match_table[i-1][j],
                match_table[i][j-1]
            )
    return match_table[len1][len2] / (len1 + len2) * 2.0

You can now use word_similarity to calculate the similarity between two words, given their phoneme sequences. For example, to compare the words "hello" and "world", you could call word_similarity('hɛ.ˈloʊ', 'wɜɹld').

Remember, this is a basic implementation, and there are many ways to improve and refine it, as you've mentioned. For example, handling the words' consonant and vowel structures separately, handling swapped phonemes, or using a more sophisticated phoneme similarity function.

from fluentai.

StephanAkkerman avatar StephanAkkerman commented on August 14, 2024

To measure the similarity between two IPA strings, you could use a string distance metric such as the Levenshtein distance, which is simple to understand and implement. It counts the minimum number of edits (insertions, deletions, substitutions) needed to transform one string into the other.

However, if you want to account for the phonetic similarity between different phonemes, you could consider using a phonetic alignment algorithm like Needleman-Wunsch or Smith-Waterman, with a substitution matrix that reflects the phonetic similarity between different phonemes.

Keep in mind that this is a relatively simple approach and might not account for all the nuances of phonetic similarity. Also, because IPA is a broad phonetic transcription system, the same sound can be transcribed differently depending on the level of detail you want to include, and different languages might use different symbols to represent similar sounds. Therefore, you might need to preprocess your IPA strings to ensure they use a consistent set of symbols.

Here is a simplified example of how you might use the Levenshtein distance to compare two IPA strings in Python, using the python-Levenshtein package:

import Levenshtein as lev

ipa1 = "hɛ.ˈloʊ"  # English "hello"
ipa2 = "hə.ˈlo"  # Hypothetical Indonesian word

distance = lev.distance(ipa1, ipa2)

print(distance)

Remember to install the package with pip install python-Levenshtein if you haven't already.

The output is the minimum number of edits needed to transform ipa1 into ipa2. The lower the number, the more similar the two strings are.

from fluentai.

StephanAkkerman avatar StephanAkkerman commented on August 14, 2024

Sure, the Needleman-Wunsch algorithm is a standard algorithm for sequence alignment in bioinformatics, typically used for DNA or protein sequence alignment. The idea is to find the best alignment (including gaps) between two sequences, given a substitution matrix that scores the similarity between each pair of elements.

In the case of IPA strings, you could define a substitution matrix that scores the phonetic similarity between different phonemes. This would be a complex task in itself, as you would need to define a similarity score for each pair of IPA symbols, but let's assume you have such a matrix.

Here's a simple implementation of the Needleman-Wunsch algorithm in Python, adapted to work with strings of IPA symbols:

import numpy as np

def needleman_wunsch(seq1, seq2, sub_matrix, gap_penalty=-1):
    """Performs the Needleman-Wunsch sequence alignment."""
    n, m = len(seq1), len(seq2)

    # Initialize score matrix with zeros
    score_matrix = np.zeros((n+1, m+1))

    # Initialize first row and column with gap penalty
    score_matrix[0, :] = np.arange(m+1) * gap_penalty
    score_matrix[:, 0] = np.arange(n+1) * gap_penalty

    # Fill in the rest of the matrix
    for i in range(1, n+1):
        for j in range(1, m+1):
            match = score_matrix[i-1, j-1] + sub_matrix[seq1[i-1], seq2[j-1]]
            delete = score_matrix[i-1, j] + gap_penalty
            insert = score_matrix[i, j-1] + gap_penalty
            score_matrix[i, j] = max(match, delete, insert)

    return score_matrix[n, m]

# Example usage
ipa1 = "hɛ.ˈloʊ"  # English "hello"
ipa2 = "hə.ˈlo"  # Hypothetical Indonesian word

# Define a simple substitution matrix where identical symbols score 1
# and different symbols score -1
sub_matrix = {
    a: {b: 1 if a == b else -1 for b in set(ipa1 + ipa2)}
    for a in set(ipa1 + ipa2)
}

score = needleman_wunsch(ipa1, ipa2, sub_matrix)

print(score)

This code performs a simple global sequence alignment of the two IPA strings, scoring 1 for identical symbols and -1 for different symbols. It also includes a gap penalty of -1. The output is the maximum score of any alignment of the two strings.

Please note that this is a basic example and there are many ways to improve and refine this approach, such as using a more complex substitution matrix, adjusting the gap penalty, or implementing local alignment (Smith-Waterman) instead of global alignment (Needleman-Wunsch). Also, this simple implementation doesn't actually construct the alignment itself, just the score of the optimal alignment. Constructing the alignment would require additional code to backtrack through the score matrix.

from fluentai.

StephanAkkerman avatar StephanAkkerman commented on August 14, 2024

nltk's ALINE seems to run into a lot of issues, because some characters do not exist in its feature matrix

from fluentai.

StephanAkkerman avatar StephanAkkerman commented on August 14, 2024

Could try https://abydos.readthedocs.io/en/latest/abydos.distance.html#abydos.distance.ALINE and add the newer / missing IPA characters as a dict to it. Seems like an outdated library...

from fluentai.

StephanAkkerman avatar StephanAkkerman commented on August 14, 2024

Could copy source-code of https://abydos.readthedocs.io/en/latest/_modules/abydos/distance/_aline.html#ALINE and add missing characters to it.

from fluentai.

StephanAkkerman avatar StephanAkkerman commented on August 14, 2024

See https://github.com/msavva/transphoner/tree/553c0d5923af1d5055bbccb34d8cbbf0fdd1d8ef/libtransphone/src/main/scala/org/babysherlock/transphoner/phonetics of what to do with c_skip etc.

from fluentai.

StephanAkkerman avatar StephanAkkerman commented on August 14, 2024

Rewrite NLTK's or abydos' ALINE code and use scala implementation as guidance. Also make sure all (new) IPA characters are supported, including breaks. See Wikipedia page for the possible characters

from fluentai.

StephanAkkerman avatar StephanAkkerman commented on August 14, 2024

Also include special characters found in https://github.com/msavva/transphoner/blob/master/libtransphone/src/main/scala/org/babysherlock/transphoner/phonetics/Phone.scala#L218

from fluentai.

StephanAkkerman avatar StephanAkkerman commented on August 14, 2024

Also use files found in folder "data for transphoner"

from fluentai.

StephanAkkerman avatar StephanAkkerman commented on August 14, 2024

https://www.reddit.com/r/linguistics/comments/ovewb3/is_there_a_tool_that_can_convert_a_language_to/

from fluentai.

StephanAkkerman avatar StephanAkkerman commented on August 14, 2024

https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/tts/g2p.html#
https://github.com/NVIDIA/NeMo/tree/main/examples/tts/g2p

from fluentai.

StephanAkkerman avatar StephanAkkerman commented on August 14, 2024

https://github.com/lingjzhu/CharsiuG2P
https://huggingface.co/charsiu/g2p_multilingual_byT5_small_100

from fluentai.

StephanAkkerman avatar StephanAkkerman commented on August 14, 2024
import panphon.distance
dst = panphon.distance.Distance().feature_edit_distance
# add all English words here
WORDS = ["closing", "cucci", "gnocchi", "kissing"]
closest_word = min(WORDS, key=lambda x: dst("kucing", x))
# cucci
print(closest_word)

Better approach: embed all words into vectors and then perform maximum-inner-product-search.

from fluentai.

Related Issues (17)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.