Consider the following French sentence:
“Le président Emmanuel Macron assure le peuple canadien que le gouvernement français va continuer à défendre le Canada contre la menace américain.”
Any English speaker can make sense of this even if they don't know any French. My goal with this project is to generate these sorts of sentences automatically. Furthermore, I provide a gamified interface where the user can attempt to translate the provided sentences and receive feedback in near real time. Currently, I'm using GPT-3.5-Turbo along with a heuristic function and beam search to filter for sentences that are both grammatically correct and have a high "interpretability" score.
It takes into account:
- cognate_ratio: The ratio of cognates to total words in a sentence. Any sentence with a low cognate ratio (< 0.20) is automatically given a score of 0.
- avg_gap_between_consecutive_cognates: The average length of each gap between clusters of cognates. A "gap" is defined as a stretch two or more consecutive non-cognate words. In a good sentence, the gaps are short, and ideally the largest gap is no more than 3 or 4 words:
- biggest_gap: Any sentence whose largest gap is larger than 5 words is automatically given a score of 0. (If an English speaker sees 5 words in a row that they don't understand, they may give up.)
- avg_non_cognate_length: Sentences are penalized for having non-cognate words which are too long.
- total_score: A combination of the above factors
You can check it out for yourself in the
openai_beam_search.py
file under theget_score_breakdown
function.
- Take a given French word, e.g. "augmentation"
- Translate it into English. In this example, "augmenter" = "increase"
- Do the translation and the original word have a low edit ratio (i.e. lots of characters in common)? Then the word is cognate.
- Otherwise, try iterating through the synonyms of the English word to see if there's a match. For example, French "augmentation" has few characters in common with English "increase", but if we look through words related to "increase" using WordNet then "augmentation" is likely to be found somewhere in there. (This part is still a little buggy.)
- If neither of these conditions is met then the word is not considered a cognate.
production_data/fr_en_dict.json
: Contains a few thousand French words and their English translation. Speeds up the cognate identification process by several times since commonly used words don't have to be run through Google Translate (which takes at >= 0.5 sec / word)app.py
: The web app, which you can try out for yourself at app.vkethana.combackend.py
: This is the most important file in the list. Contains the heuristic function, a lot of calls to GPT-3.5 and 4 for beam search and sentence scoring, and some helper functions. It also has the cognate identification function, some sentence-starters for the model, and specifies some of the parameters for how the model is invoked. The GPT-3.5 calls involve asking the model to extend an already-existing sentence by 5 to 10 words using the v1 completions endpoint, which outputs 6 choices from which we choose 3 (using the simple heuristic). From those 3 choices, we narrow down to just one by asking ChatGPT to choose which of the three is most readable. At no point does the model generate a sentence totally from scratch; it always has least one word as a starter. Ifuse_seed_words
is enabled then the model will also have 2 cognates as starter wordsutils.py
: Contains thenode
class, which is a wrapper around sentences that allows for easy comparison and storing of scoring information. Handles the calls to NLTK, WordNet, Google Translate, and the edit distance function.
If the use_seed_words setting is enabled in openai_beam_search.py
, occaisionally GPT-3.5 will output underscores in place of the actual seed word. Also, it may make grammatical errors. For example, consider the following prompt:
You are about to receive a sentence in French.
Please complete the sentence in that language as coherently as possible.
Please include at least one of the following words in your response: abondantes., caractérisé.
You may include additional sentences afterward.
Please try to generate human-like text.
Above all, please do not write sentences in English (loanwords OK).
Avoid including random underscores in your response.
The sentence is:
La
One incorrect output for this prompt that exhibits the bug is the following:
La forêt amazonienne est __________ par sa biodiversité abondantes. Les
Interestingly, if we replace the underscores with the provided seed word "caractérisé", the sentence makes sense. "The amazon forest is characterized by its abundants [sic] biodiversity."
The model appears to make grammatical errors in some cases. For example, the incorrect plural adjective in "biodiversité abondantes" (it should be abondante)
Sometimes we get outputs of zero length. This is a problem when there are lines of code like
cognate_ratio = len(cognates) / len(words)
because it causes division by zero. Some checks have been implemented to prevent this but it still happens sometimes.
- Fine-tune GPT-3.5 on a subset of examples with high scores (done)
- Find a way to compare the fine-tuned model with the original model. Can we prove that the fine-tuned model outputs more cognateful sentences?
- How do we prove that the fine-tuned model is not simply overfitting?
- How can we generate sentence-staters programmatically instead of grabbing from a predetermined list?
- How can we choose cognate "seed words" for a sentence without causing hallucinations (see above)? Also, how can we choose the cognate words programmatically instead of selecting from a predetermined list of cognates?
- Would RL help with this project?