This repository contains the project outline for the Code-Mixed Machine Translation Project undertaken as part of the ANLP (Advanced Natural Language Processing) course at IIIT Hyderabad by Team The Triad: Yash Bhaskar, Sankalp Bahad, and Utsav Shekhar.
- Introduction
- Scope of the Project
- Challenges
- Dataset Exploration
- Evaluation Metrics
- Literature Review
- Research Gap
- GitHub Repository Links for Reference Implementations
- Project Implementation Plan
- Conclusion
- Niche problem: Code-Mixed Machine Translation
- Code-mixed data is prevalent in the daily lives of multilingual individuals.
- Machine translation is crucial for bridging the linguistic gap in code-mixed data.
- Limited research exists in the field of code-mixed machine translation.
- Address the challenges of code-mixed machine translation, specifically from Hinglish to English.
- Improve upon baseline results using various techniques.
- Aim to achieve better scores on MT Evaluation Metric.
- Informal nature of code-mixing.
- Lack of formal sources for code-mixed languages.
- Variability in transliteration.
- Code-switching patterns.
-
Hinglish-TOP dataset
- Size: 180,000 generated utterances, 10,000 human-annotated utterances, 170,000 synthetically generated utterances
- Related Publications: CST5: Code-Switched Semantic Parsing using T5
- Description: Code-switched semantic parsing dataset with human-annotated and synthetically generated utterances.
-
Dakshina Dataset
- Collection of text in 12 South Asian languages.
- Includes native script Wikipedia text and parallel data in native script and Latin alphabet.
- Hinglish-TOP dataset is a valuable resource for code-switched semantic parsing tasks.
- Diverse data examples with human-annotated and synthetic utterances.
- Connection to CST5 technique suggests its relevance and potential impact.
- BLEU Score: Measures similarity between model output and reference translations.
- CoMeT Score: Specifically designed for code-mixed language translation.
- chrF++ (character F-score with better tokenization): Suitable for evaluating code-mixed translations at the character level.
- Challenges of code-mixing in informal conversations.
- Data preparation using the Hinglish-TOP dataset.
- Model fine-tuning and post-processing.
- Evaluation metrics: BLEU, CoMeT, chrF++.
- Addressing the challenge of code-switched semantic parsing.
- Data preparation and preprocessing.
- Evaluation metrics: Naturalness, Semantic Equivalence.
- Impact of CST5 on semantic parsing performance.
- Identification of gaps in current research:
- Informal Nature of Code-Mixing.
- Lack of Formal Sources.
- Variability in Transliteration.
- Code-Switching Patterns.
-
Code-Mixed Machine Translation Dataset: Contains raw data and preprocessed versions for training and evaluation.
-
Hinglish-TOP Dataset: A valuable resource for code-mixed language research.
-
Indic-Trans: Transliteration library for Hindi to Roman conversion.
-
Fairseq: Toolkit for sequence-to-sequence tasks, including machine translation.
- Overview of the Proposed Approach
- Tentative Timeline
- Expected Challenges and Mitigation Strategies
- Summary of the Project Outline
- Importance of Addressing the Code-Mixed Machine Translation Problem
- Anticipated Project Outcomes
Note: The information presented here is part of the project outline and may be subject to updates and modifications during the project's progress.