Giter Site home page Giter Site logo

👩🏻‍💻 Meet Ayaka: A Passionate Researcher & Open-Source Contributor

Hi there! I am Ayaka, a 24-year-old computer science, historical linguistics, and mathematics researcher.

I have made significant contributions to the open-source community. I have created numerous open-source projects on GitHub and have hosted several websites and web services at my own expense. My open-source contributions span various fields, including deep learning, natural language processing, language conservation, historical linguistics, and computational linguistics.

📚 Proficiency in Deep Learning

My expertise in deep learning is reflected in my familiarity with JAX and Google Cloud TPU. I actively submit bug reports, participate in feature discussions and answer questions in the JAX and Google Cloud TPU community. In addition, I created TPU Starter, a comprehensive guide that has helped many people to get started with JAX and Google Cloud TPU. The guide has been translated into Korean and Chinese. Moreover, to enhance the user experience of JAX, I developed jax-smi, a tool that enables the monitoring of real-time memory usage of JAX programs, providing a similar experience to that of nvidia-smi. My significant contributions led to the honour of receiving the 2023 Google Open Source Peer Bonus Award.

💬 Natural Language Processing Expertise

In natural language processing, I have contributed to the Hugging Face Transformers library and released several NLP models. Besides, I have reimplemented the BART and Llama 2 models, and also collaborated on the reimplementation of the Mistral model, all from scratch using pure JAX. These projects provide high-quality open-source codebases to deep learning researchers and engineers and demonstrate how Transformer models can be implemented using JAX and trained on Google Cloud TPUs. Moreover, I implemented the BERT model from scratch using NumPy, performed in-browser inference using Pyodide, and thereby created TrAVis, a BERT attention visualiser that runs entirely within a browser. The visualiser offers an intuitive visualisation of BERT's attention mechanism for researchers.

I constantly keep up with the most advanced AI technologies. I am an early adopter of the most advanced large language model today—ChatGPT and have been studying it since its release. I am the co-author of the open-source Better ChatGPT website. Utilising the ChatGPT API, this website offers many advanced features and greatly enhances the ChatGPT user experience. It has garnered over 6,000 stars on GitHub and is being used by millions of users worldwide.

🌏 Language Conservation Efforts

My expertise in NLP also extends to language conservation. I trained the BART model for Cantonese, a low-resource language, and released it on the Hugging Face Hub. Building upon this, I proposed TransCan, an English-to-Cantonese machine translation model, greatly outperforming the state-of-the-art commercial machine translation system by 11.8 BLEU. The model has been released on GitHub, bringing benefits to both Cantonese and the wider low-resource NLP community.

In addition to language models, I have created several datasets. In the LIHKG Scraper project, I circumvented many layers of Cloudflare's restrictions to scrape LIHKG, one of the most popular Cantonese forums in Hong Kong, resulting in a corpus of 172,937,863 unique sentences. I have also created two English-Cantonese parallel corpora, Words.hk and ABC Cantonese.

Moreover, for the conservation of Hainanese and Hakka, I engineered web-scraping programs to regularly fetch the latest TV news of Wenchang and Xingning, which are broadcast in their local dialects.

🕰️ Pioneering Contributions in Historical Linguistics

I have also made considerable contributions to the field of historical linguistics. I founded the open-source organisation, nk2028, attracting a community of experts in historical linguistics. In nk2028, we have conducted pioneering research in the field of Middle Chinese phonology. We innovatively formalised the phonological positions of the Tshet-uinh phonological system as 6-tuples, which allowed us to accurately analyse the sound changes that have happened throughout the history of the Chinese language.

Moreover, in the process of putting this system into practice, we explored different methods of representing the laws of sound changes in computer programs. Initially, we designed a domain-specific language in PureScript and utilised SQLite as the database. In subsequent research, we simplified our approach by designing a novel JavaScript library, which greatly enhanced productivity.

Based on this, we released the Qieyun Autoderiver website, allowing community members to contribute laws of sound changes for various languages. This website has effectively invigorated the community and attracted many people to this field. To help beginners master the Tshet-uinh phonological system, we also published many tools, such as a tool to automate the process of puonq-tshet, a tool to generate Tshet-uinh Flashcards, and a tool to look up Tshet-uinh phonological positions.

💻 Innovations in Computational Linguistics

In nk2028, I have also made contributions to other aspects of linguistics. In the field of dialectology, we took over the discontinued MCPDict project and released the Chinese Dialect Pronunciation Atlas. Regarding classical Chinese, with the consent of the data provider, Sou-Yun website, we published ORCHESTRA, a comprehensive dataset of classical Chinese poetry. For phonetics, we created an IPA Online Practice System and a Putonghua IPA Converter.

Besides, I maintained the simplified-traditional Chinese conversion project OpenCC and its successor StarCC. These projects can accurately handle the problem of one-to-many mappings in simplified-traditional Chinese conversion. On top of this, leveraging my in-depth understanding of OpenType font features, I proposed a novel approach for simplified-to-traditional conversion fonts to handle the one-to-many mappings. Based on this approach, I produced two simplified-to-traditional conversion fonts, Fan Wun Ming and Fan Wun Hak. The approach I proposed has also been adopted by other font developers, enhancing the vibrancy of the typographic community.

For Cantonese, I published cantoseg, an effective Cantonese segmentation tool. I have also created two tools, namely ToJyutping and Inject Jyutping, which aid Cantonese learners in mastering the pronunciation of Chinese characters.

I am an active contributor to the rime input method community. As a member of the CanCLID organisation, I maintain rime-cantonese, a rime input schema for Cantonese. I've also released input schemata for TUPA, Loengfan, Mandarin, and Nüshu. Utilising my C++ and Python knowledge, I developed librime-python, a rime Python plugin that allows users to control the behaviour of the rime program through simple Python scripts. Moreover, I have curated awesome-rime, a comprehensive list of rime schemata and configs, gathering the efforts of the rime community.

🎲 Miscellaneous Endeavours and Contributions

My open-source contributions extend to my other areas of interest as well. With a deep understanding of the x64 instruction set and the Windows PE file format, I crafted the smallest 64-Bit PE file on Windows 10 using the assembly language. The file is a Windows executable of merely 268 bytes that can run normally and pop up a message box. Moreover, I proposed the Nya Calendar, a lunisolar-mercurial calendar that considers the synodic period of the Earth and Mercury and encompasses several unique properties.

In addition, I have contributed to the Arch Linux community by maintaining several AUR packages. I host several open-source websites and web services at my own expense, including the Online Nushu Dictionary website, a Graphviz server, a Telegram translation bot, and an instance of the Shieldy bot.

If you want to know more about me and explore my other passions and interests, feel free to visit my personal website!

Ayaka's Projects

aur icon aur

Ayaka PKGBUILD Repository

awesome-rime icon awesome-rime

A curated list of Rime IME schemata and configs | Rime 輸入法方案和配置列表

ayaka-site icon ayaka-site

Personal website deployed at https://ayaka.shn.hk/

ayaka14732 icon ayaka14732

The special repository whose README.md will appear on my public profile

bytevid icon bytevid

Say goodbye to long and boring videos 👋

cantoseg icon cantoseg

Cantonese segmentation tool 粵語分詞工具

cdn icon cdn

GitHub Pages + Cloudflare as a CDN

chatgptapifree icon chatgptapifree

A simple and open-source proxy API that allows you to access OpenAI's ChatGPT API for free!

cs224n-a4 icon cs224n-a4

A decent solution to Assignment #4 of CS 224n, Winter 2022 (Cherokee NMT)

einshard icon einshard

Einsum-like high-level array sharding API for JAX

fanwunhak icon fanwunhak

A Simplified-Chinese-to-Traditional-Chinese font based on GenYoGothic, which can handle the one-to-many problem | 繁媛黑體是基於源樣黑體開發的簡轉繁字型,能處理一簡對多繁

fanwunming icon fanwunming

A Simplified-Chinese-to-Traditional-Chinese font based on GenYoMin, which can handle the one-to-many problem | 繁媛明朝是基於源樣明體開發的簡轉繁字型,能處理一簡對多繁

flax icon flax

Flax is a neural network library for JAX that is designed for flexibility.

freechatgpt icon freechatgpt

Play and chat smarter with BetterChatGPT - an amazing open-source web app with a better UI for exploring OpenAI's ChatGPT API!

furigana icon furigana

An application to add furigana to Japanese texts

inject-xdi8 icon inject-xdi8

A browser extension that adds Xdi8 on Chinese characters

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.