Giter Site home page Giter Site logo

Addition about ancient-greek-word2vec HOT 2 OPEN

ryderwishart avatar ryderwishart commented on July 23, 2024
Addition

from ancient-greek-word2vec.

Comments (2)

ryderwishart avatar ryderwishart commented on July 23, 2024 1

Hi @l0d0v1c, you're right that the classic example is not directly transferrable to the Ancient Greek data, but I think there is more going on with the Greek data than might first appear. For one thing, the Greek data is more morphologically complex. In addition, I actually get interesting results depending on how I mix and match the vectors, depending on the corpus, and depending on the model type (skip gram versus CBOW, FastText vs. Word2Vec, window size, etc.). For instance, with this model ft_papyri&corpus_cbow_hs_2_to_5_size300_window5_mincount2.model (FastText, papyri and literary texts, CBOW, using hierarchical softmax, 2-5 length character ngrams, 300 as vector size, window of 5, min instances of vocabulary item in corpus as 2),

# βασιλεύς + γυνή - ἀνήρ

word_set_1 = ['βασιλεύς', 'γυνή']
word_set_2 = ['ἀνήρ']

# Finding the most similar words using vector arithmetic
similar_words = model.most_similar_cosmul(positive=word_set_1, negative=word_set_2, topn=10)

# Print the most similar words and their similarity scores
for word, similarity in similar_words:
    print(word, similarity)

Yields

φιλοβασιλεύς 0.8599472045898438 # φιλοβασιλεύς = 'royalist' or... maybe, 'king lover'? That seems a bit like a queen in some way.
βασιλειάω 0.8345795273780823 # βασιλειάω = 'to reign'
βασιλίσκος 0.8285424709320068 # βασιλίσκος = 'little king'
βασιλίζω 0.8045151829719543 # βασιλίζω = 'to rule as queen'
γαμβρός 0.7965137958526611 # γαμβρός = 'son-in-law'
βασιλίς 0.7935773134231567 # βασιλίς = 'queen'
βασιλεύω 0.7927768230438232 # βασιλεύω = 'to reign'
βασιληΐς 0.7872297763824463 # βασιληΐς = 'queen'
βασίλη 0.7827581763267517 # βασίλη = 'queen'
συμβασιλεύω 0.7825499176979065 # συμβασιλεύω = 'to reign jointly'

However, if I use the Word2Vec nov2022 model,

# βασιλεύς + γυνή - ἀνήρ

word_set_1 = ['βασιλεύς', 'γυνή']
word_set_2 = ['ἀνήρ']

# Finding the most similar words using vector arithmetic
similar_words = model.most_similar_cosmul(positive=word_set_1, negative=word_set_2, topn=10)

# Print the most similar words and their similarity scores
for word, similarity in similar_words:
    print(word, similarity)

Yields the expected results

βασίλισσα 0.8620691299438477 # NOTE: βασίλισσα = 'queen'
παιδίσκη 0.8539372682571411 # παιδίσκη = 'maid'
ἡρώδης 0.8510062098503113 # ἡρώδης = 'Herod'
ἰσραηλίτης 0.8509696125984192 # ἰσραηλίτης = 'Israelite'
ἀαρών 0.8409229516983032 # ἀαρών = 'Aaron'
κλεοπάτρα 0.8370845913887024 # κλεοπάτρα = 'Cleopatra'
ἀδελφή 0.8362259864807129 # ἀδελφή = 'sister'
βασιλίς 0.8341107964515686 # βασιλίς = 'queen'
ἀριστόβουλος 0.8217142224311829 # ἀριστόβουλος = 'Aristobulus'
γύναιον 0.8093928694725037 # γύναιον = 'woman'

In other words, model hyper parameters matter A LOT. The FastText model, because it breaks down character n-grams, finds a lot more similarity between words that share a derivational code (like 'βασιλ'). It's interesting to observe and ponder how drastically the results change based on a bit of hyperparam tweaking. In the case of transformer models you lose this transparency in the relationship between the algorithm and 'semantic similarity'. Everything is just 'attention'. That's part of the reason I find these more basic algorithms extremely important and suspect they still have a key role to play for lexical modelling.

By the way, I love the web app you made to view the graph of similar words! Are you doing any academic work on Greek lexical semantics? Would love to talk more.

from ancient-greek-word2vec.

l0d0v1c avatar l0d0v1c commented on July 23, 2024

Hi Thanks a lot! That's clear. You get the point hyperparameters are the key. Anyway in the dataset I used βασίλισσα appears only 4 times so it may explain I get παῖς instead even with your hyperparameters.

With your nov2022 παῖς and βασίλισσα are closer but in mine παῖς is closer to βασιλεύς

I didn't get your mail but I saw you are on researchgate so I'll send you more details by this way about my projects.

PS: Greek is complex but nothing compared to the 4000 rules of the panini grammar of sanskrit;-)

from ancient-greek-word2vec.

Related Issues (1)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.