ishefi / semantle-he Goto Github PK
View Code? Open in Web Editor NEWA Hebrew version of Semantle.
License: Other
A Hebrew version of Semantle.
License: Other
Page reload subtract 1 from the guess counting index.
When solving, the number of guesses is OK, but the number of the guess number of the found word is smaller by 1.
Reproduce by:
Guess a word
Reload page
Guess another word
Both words would have 1 as a guess number.
Additional reload doesn't harm the guess counter.
python scripts/semantle.py
File "/Users/[email protected]/IdeaProjects/semantle-he/scripts/semantle.py", line 21
secret = await logic.secret_logic.get_secret()
^
SyntaxError: 'await' outside async function
Right now, words such as ג'קט are not accepted by the algorithm while גקט does.
We should either:
Great application, thank you!
Would be nice to be able to share the results of incomplete guesses in an analogous format to that of a complete guess.
Often, the secret word is difficult and I would have liked to share how close I was. For example, when the word was גלגלת my top guess was ידית 999/1000. A possible text could be:
לא פתרתי היום את סמנטעל #99. לאחר 731 ניחושים הגעתי ל-999/1000:
https://semantle-he.herokuapp.com
בסמנטעל #74 הציון של המלה "כרבול" הוא 99.99 ומופיע אייקון של שני אנשים מחובקים.
המשחק אמר שהוא לא מכיר את המילה וניל, מנחש שאולי היא פורקה ל ו+ניל
בויקיפדיה מילים כתובות עם ניקוד אם אני לא טועה אז אפשר לקחת את זה בחשבון או בתהליך הפירוק של המילים או בכללי
faq.html
line 77
<p> ת: קוד מקור?</p>
Need to change 'ת' to 'ש' because it's a question
just gone a few days ago, a day after the horrible "איש" list that made 0 sense.
Would be nice to be able share guesses with one or more collaborators on the web so that each one can guess words and see the results of the other.
I do not have experience in such interfaces so I only have a vague idea of how this could be implemented (which may not be realistic) and I realize that this will require numerous additions. I think that it would be better not to keep any data that scales with the no. of players on the server. If each player that connects to the server has a unique id, the web page on the player's browser can keep a list of player ids to which the guesses will be shared with.
Any chances to get some statistics of the game?
how often people succeeded with the number?
how the score change along the guessing?
Hi Itamar,
My name is Itay Nakash, and I'm an MSc student studying natural language processing at Technion. I find the game you developed very intriguing, and I believe its statistics could offer valuable insights for creating personalized word embeddings.
If you're interested in utilizing this platform and people's responses from the game, I'd love to collaborate with you on this project, at any scale you prefer.
It will require some changes in the code to collect the statistics, and some nlp work to try and match the new word embedding.
In addition, I believe that framing this game as a nlp task, with a significant dataset, with a new task/goal that utilize this data could be a great contribution to the community.
Before I begin implementing and developing the idea, I would like to check with you whether you are open to integrating it into the platform, given your background as an NLP researcher.
Thank you,
Itay
After finishing the #48 semantle, I got the following share text:
פתרתי את סמנטעל #48 ב־70 ניחושים!
https://semantle-he.herokuapp.com
🟩🟩🟩🟩🟩 70 (1000/1000)
🟩🟩🟩🟩⬜ 67 (991/1000)
🟩🟩🟩🟩⬜ 50 (894/1000)
🟩🟩⬜⬜⬜ 52 (570/1000)
⬜⬜⬜⬜⬜ 49
⬜⬜⬜⬜⬜ 40
However, the game I played looked like this:
Is this maybe a bug?
Adding an apostrophe (or apostrophes) anywhere in a recognizable word will be treated as a distinct word, but will have the same closeness value as the word without the apostrophes.
For example, all of the following words were accepted as distinct words, and they all had the exact same closeness value:
צבע
צבע'
'צבע
צ'בע
צב'ע
צב'ע'
צ''''בע
More correct behavior would probably be to either reject those words or not count them as distinct from the original.
The readme says the db (word2vec.db
) is part of the repo but it's not there
Hi,
I've been playing around with Word2Vec and the model linked here, and I can't seem to reproduce the same distances.
For example:
Python 3.11.2 (main, Feb 12 2023, 00:48:52) [GCC 12.2.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import gensim
>>> model = gensim.models.Word2Vec.load('./wiki_tokenized_model/model.mdl')
>>> model.wv.similar_by_word('אשליה')
[('אשליית', 0.7949888110160828), ('אשלייתי', 0.7358855605125427), ('תחושה', 0.7196317911148071), ('סימולקרה', 0.7147767543792725), ('מתעתעת', 0.7013854384422302), ('השתקפות', 0.6864952445030212), ('אסטרלית', 0.6836147308349609), ('אשלייתית', 0.6831943392753601), ('אילוזיה', 0.6829365491867065), ('סיראנית', 0.6813762784004211)]
Note the distances.
However, the distance Semantle gives is different:
Am I doing anything wrong? I'd love some feedback!
Just a thought about how you can improve the precision and make your model better understand semantic relationship between words. Why not just use english? Then you can just add a layer of tramslation before generating the embedding of each guess. Assuming that hebrew to english translation is reliable, you'll be able to benefit from the abundance of work that has been done on english word2bec or any other word embedding technique. :)
large package, will reduce RAM. shouldn't be too hard
There is an unlikely but possible XSS vulnerablity. If someone is convinced to paste a guess, then an attacker can execute arbitrary JS on the victim's browser.
How to reproduce:
היי&"><iframe src="/" onload="alert('PWND')" width="0px" height="0px" />"
There are two parts to exploiting the vulnerability:
In order to do that, we can just write an actual word; e.g. היי
, add the &
character in order to make the server think it's a different parameter and add arbitrary text afterwards.
When we send היי&Malicious code here
and get a response from the server as if we only sent היי
.
Sanitizing the input before executing the following lines of code should solve this problem.
const url = "/api/distance" + '?word=' + word;
const response = await fetch(url);
innerHTML
in function guessRow
, specifically here:return `<tr><td>${guessNumber}</td>
<td style="color:${color}" onclick="select('${oldGuess}', secretVec);">${oldGuess}</td>
<td align="right" dir="ltr">${similarity.toFixed(2)}</td>
<td class="${cls}">${percentileText}${progress}
</td></tr>`;
Using an alternative to innerHTML
or escaping the input should also help preventing this attack.
Combining these two vulnerabilities, when we enter the malicious input, the following dangerous HTML is generated:
<tr>
<td>1</td>
<td style="color:#c0c" onclick="select('היי&">
<iframe src="/" onload="alert('PWND')" width="0px" height="0px">"', secretVec);">היי&">
<iframe src="/" onload="alert('PWND')" width="0px" height="0px" />
"
</td>
<td align="right" dir="ltr">24.84</td>
<td class="">(רחוק)
</td>
</tr>
<tr><td colspan=4><hr></td></tr></iframe></td></tr>
In handlers.py:133
there are the following lines:
if api_key != request.app.state.api_key:
raise HTTPException(status_code=status.HTTP_403_FORBIDDEN)
This piece of code is vulnerable to a timing side channel attack and should be replaced with the constant time comparison method hmac.compare_digest
we can use /secrets
page, and change current behavior to require both API key and future=true
Show the number of the last guess (how many guesses so far)
The How To Play page says that a guess can be "מילה או ביטוי קצר". To date, I have not found a guess with more than one word that was accepted. Examples: ראש ממשלה, עמוד שדרה, קרוב משפחה.
I haven't looked at the code but I suspect there is no mechanism to ever add these kinds of words to the database. In that case, the How To Play text should be updated.
Id like to have an option to view my progress while guessing.
For example to see the greatest breakthroughs, when I was stuck the most etc.
I suggest to add a progress button named "התקדמות".
The button will present the guesses as following:
It will present guesses in the order you guessed them, but! It will only present the guesses which got closer to the secret word.
For example the secret word is "צרידות"
And I guessed:
text on button should be ״נכנעתי״.
Should appear after GIVEUP_THRESH
(env var) good guesses (i.e., not nonexistent words)
Israeli Wikipedia is very uneven in its coverage of words. Is it the best freely-available Hebrew data set out there? What about ynet news archive, for example?
To prevent mixup of yesterday's and today's guesses
The game used to refresh (new secret word) at 2:00am (Israel time, GMT+3).
Since the switch to DST (Daylight Saving Time / שעון קיץ) it now refreshes at 3:00am.
If possible (and accepted) - I suggest to change the refresh time to midnight - 12:00am (along with all Wordle variations) or at least back to 2:00am.
אם מישהו הצליח את הסמנטעל בפחות מ-20 נסיונות, כנראה שזה היה ניחוש אז אפשר לעשות שאם הצלחת מהר מדי תקבל אחד חדש של מילה אקראית.
Yesterday (the word was "Joke"), the word "GILUY" got a negative mark.
is it a bug or a feature?
Thanks,
A couple of days ago the solution was "דעה". I guessed "דיעה" and it got only 996/1000 (66.54).
Hi,
I suggest marking breakthrough words, that is, guesses which were the closest to to the target when first tried.
This might help advancing to the goal, and once the goal is achieved, it will provide a nice view of the road to victory.
I'm attaching an example, be aware that it might spoil today's riddle (#31, although so far I didn't solve the damn thing)
.
In the word for 20220320, which was "קרקס", what seems to be the majority of the close words had been proper names of people and fictional charterers. Such words should, generally, not appear in the word list in the first place. As removing them may be an annoying issue (I suspect there should be an easy way to filter them reasonably with the pipeline), at least it is worth verifying that the list is not overrun by such words when selecting the daily words, as it can be very frustrating to guess such words.
Hi,
Could you provide more detailed instructions on how to generate the word2vec db? E.g., at what part and how to use the HebPipe you mentioned in the faq.
Thanks
It'll be nice to share my solve story with the actuall guesses I tried, to avoid spoilers, the "spoiler" markdown in telegram can be used. Since this feature avaible only on telegram, a seperate share button should be added.
The message should look like this:
telegram mardown for spoiler is leading and trailing '||'
Something like they did in BERT. In the standard gensim Word2Vec they don't take into considerations order of the sentence.
Adding positional encodings maybe can improve consistency by giving some weight to order.
Step to reproduce:
^
)A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.