Comments (18)
from fluentai.
- Need to have a dataset of IPA pronunciations for the languages that we want to learn.
- Use https://github.com/xinjli/transphone to do word/graphene to phoneme (IPA), for the language that we want to learn, assume we always use English as base language
from fluentai.
Phonetic Similarity
Use ALINE for phonetic similarity
We use a Levenshtein distance, now with a weight matrix defining the cost of substituting one phone for another, to
compute the phonetic distance between two words. We search for an alignment (mapping of phones in one word to those of
another) such that the Levenshtein distance is minimized.
We simplify the ALINE algorithm by omitting phone expansions (i.e. no special costs for matching two phones to one,
or one phone to two). We handle matching of syllable breaks by adding a constant match score Csep = 20, and a mismatch
cost Cskip = −10, as defined by ALINE. The phone feature set used by ALINE is not complete—a few phones are considered equivalent (e.g. /i/, /ɪ/), so we add an extra weight for preferring exact phone matches.
C_skip is defined in ALINE, but not C_sep. Maybe see the Scala implementation for this
ALINE output values are divided by 100 to make them comparable with similarities from other components.
Other methods for calculating
- Metaphone
- Doublemetaphone and example here
from fluentai.
Could also consider scraping wiktionary for all IPA pronunciations: https://en.wiktionary.org/wiki/Category:Indonesian_terms_with_IPA_pronunciation
from fluentai.
First, let's create a phoneme similarity function as you described. For this purpose, we will use a dictionary to map phonemes to their categories, and then define a function to calculate the similarity based on these categories.
# Create phoneme categories
phoneme_categories = {
"aehijouwæɑɒɔəɛɜɪʊʌʔ": 0, # vowels, semivowels, and weak consonants
"bfpv": 1, # labial consonants
"ksxzɡʃʒ": 2, # velars and sibilant consonants
"dtðθ": 3, # dental consonants
"l": 4, # lateral consonant
"mnŋ": 5, # nasals
"ɹ": 6, # rhotic
}
# Flatten the dictionary to map each phoneme to its category
phoneme_to_category = {
phoneme: category
for category, phonemes in phoneme_categories.items()
for phoneme in phonemes
}
def phoneme_similarity(phoneme1, phoneme2):
"""Calculates the similarity between two phonemes."""
if phoneme1 == phoneme2:
return 1
elif phoneme_to_category.get(phoneme1) == phoneme_to_category.get(phoneme2):
return 0.5
else:
return 0
Next, we'll use this phoneme similarity function to calculate the similarity between two words. We'll use a dynamic programming approach similar to the Longest Common Subsequence problem.
def word_similarity(word1, word2):
"""Calculates the similarity between two words based on their phonemes."""
len1 = len(word1)
len2 = len(word2)
match_table = [[0] * (len2 + 1) for _ in range(len1 + 1)]
for i in range(1, len1 + 1):
for j in range(1, len2 + 1):
match_table[i][j] = max(
match_table[i-1][j-1] + phoneme_similarity(word1[i-1], word2[j-1]),
match_table[i-1][j],
match_table[i][j-1]
)
return match_table[len1][len2] / (len1 + len2) * 2.0
You can now use word_similarity to calculate the similarity between two words, given their phoneme sequences. For example, to compare the words "hello" and "world", you could call word_similarity('hɛ.ˈloʊ', 'wɜɹld').
Remember, this is a basic implementation, and there are many ways to improve and refine it, as you've mentioned. For example, handling the words' consonant and vowel structures separately, handling swapped phonemes, or using a more sophisticated phoneme similarity function.
from fluentai.
To measure the similarity between two IPA strings, you could use a string distance metric such as the Levenshtein distance, which is simple to understand and implement. It counts the minimum number of edits (insertions, deletions, substitutions) needed to transform one string into the other.
However, if you want to account for the phonetic similarity between different phonemes, you could consider using a phonetic alignment algorithm like Needleman-Wunsch or Smith-Waterman, with a substitution matrix that reflects the phonetic similarity between different phonemes.
Keep in mind that this is a relatively simple approach and might not account for all the nuances of phonetic similarity. Also, because IPA is a broad phonetic transcription system, the same sound can be transcribed differently depending on the level of detail you want to include, and different languages might use different symbols to represent similar sounds. Therefore, you might need to preprocess your IPA strings to ensure they use a consistent set of symbols.
Here is a simplified example of how you might use the Levenshtein distance to compare two IPA strings in Python, using the python-Levenshtein package:
import Levenshtein as lev
ipa1 = "hɛ.ˈloʊ" # English "hello"
ipa2 = "hə.ˈlo" # Hypothetical Indonesian word
distance = lev.distance(ipa1, ipa2)
print(distance)
Remember to install the package with pip install python-Levenshtein if you haven't already.
The output is the minimum number of edits needed to transform ipa1 into ipa2. The lower the number, the more similar the two strings are.
from fluentai.
Sure, the Needleman-Wunsch algorithm is a standard algorithm for sequence alignment in bioinformatics, typically used for DNA or protein sequence alignment. The idea is to find the best alignment (including gaps) between two sequences, given a substitution matrix that scores the similarity between each pair of elements.
In the case of IPA strings, you could define a substitution matrix that scores the phonetic similarity between different phonemes. This would be a complex task in itself, as you would need to define a similarity score for each pair of IPA symbols, but let's assume you have such a matrix.
Here's a simple implementation of the Needleman-Wunsch algorithm in Python, adapted to work with strings of IPA symbols:
import numpy as np
def needleman_wunsch(seq1, seq2, sub_matrix, gap_penalty=-1):
"""Performs the Needleman-Wunsch sequence alignment."""
n, m = len(seq1), len(seq2)
# Initialize score matrix with zeros
score_matrix = np.zeros((n+1, m+1))
# Initialize first row and column with gap penalty
score_matrix[0, :] = np.arange(m+1) * gap_penalty
score_matrix[:, 0] = np.arange(n+1) * gap_penalty
# Fill in the rest of the matrix
for i in range(1, n+1):
for j in range(1, m+1):
match = score_matrix[i-1, j-1] + sub_matrix[seq1[i-1], seq2[j-1]]
delete = score_matrix[i-1, j] + gap_penalty
insert = score_matrix[i, j-1] + gap_penalty
score_matrix[i, j] = max(match, delete, insert)
return score_matrix[n, m]
# Example usage
ipa1 = "hɛ.ˈloʊ" # English "hello"
ipa2 = "hə.ˈlo" # Hypothetical Indonesian word
# Define a simple substitution matrix where identical symbols score 1
# and different symbols score -1
sub_matrix = {
a: {b: 1 if a == b else -1 for b in set(ipa1 + ipa2)}
for a in set(ipa1 + ipa2)
}
score = needleman_wunsch(ipa1, ipa2, sub_matrix)
print(score)
This code performs a simple global sequence alignment of the two IPA strings, scoring 1 for identical symbols and -1 for different symbols. It also includes a gap penalty of -1. The output is the maximum score of any alignment of the two strings.
Please note that this is a basic example and there are many ways to improve and refine this approach, such as using a more complex substitution matrix, adjusting the gap penalty, or implementing local alignment (Smith-Waterman) instead of global alignment (Needleman-Wunsch). Also, this simple implementation doesn't actually construct the alignment itself, just the score of the optimal alignment. Constructing the alignment would require additional code to backtrack through the score matrix.
from fluentai.
nltk's ALINE seems to run into a lot of issues, because some characters do not exist in its feature matrix
from fluentai.
Could try https://abydos.readthedocs.io/en/latest/abydos.distance.html#abydos.distance.ALINE and add the newer / missing IPA characters as a dict to it. Seems like an outdated library...
from fluentai.
Could copy source-code of https://abydos.readthedocs.io/en/latest/_modules/abydos/distance/_aline.html#ALINE and add missing characters to it.
from fluentai.
See https://github.com/msavva/transphoner/tree/553c0d5923af1d5055bbccb34d8cbbf0fdd1d8ef/libtransphone/src/main/scala/org/babysherlock/transphoner/phonetics of what to do with c_skip etc.
from fluentai.
Rewrite NLTK's or abydos' ALINE code and use scala implementation as guidance. Also make sure all (new) IPA characters are supported, including breaks. See Wikipedia page for the possible characters
from fluentai.
Also include special characters found in https://github.com/msavva/transphoner/blob/master/libtransphone/src/main/scala/org/babysherlock/transphoner/phonetics/Phone.scala#L218
from fluentai.
Also use files found in folder "data for transphoner"
from fluentai.
https://www.reddit.com/r/linguistics/comments/ovewb3/is_there_a_tool_that_can_convert_a_language_to/
from fluentai.
https://docs.nvidia.com/nemo-framework/user-guide/latest/nemotoolkit/tts/g2p.html#
https://github.com/NVIDIA/NeMo/tree/main/examples/tts/g2p
from fluentai.
https://github.com/lingjzhu/CharsiuG2P
https://huggingface.co/charsiu/g2p_multilingual_byT5_small_100
from fluentai.
import panphon.distance
dst = panphon.distance.Distance().feature_edit_distance
# add all English words here
WORDS = ["closing", "cucci", "gnocchi", "kissing"]
closest_word = min(WORDS, key=lambda x: dst("kucing", x))
# cucci
print(closest_word)
Better approach: embed all words into vectors and then perform maximum-inner-product-search.
from fluentai.
Related Issues (17)
- Establish Anki connection HOT 1
- Find open source LLM to generate verbal cue HOT 4
- Find open source text to image model HOT 3
- Create GUI
- Fix .apkg HOT 1
- Add functionality for word order HOT 1
- support more languages
- Word Imageability HOT 6
- Use FrequencyWords instead of 625 list HOT 1
- Use "known" words of language that learning HOT 1
- Other related papers
- Build Instructions HOT 1
- Semantic similarity HOT 2
- Orthographic similarity
- Look into flux image gen models HOT 1
- Vocabulary flow
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from fluentai.