Comments (6)
Interesting, good catch. I was able to reproduce this. The tokenization looks fine so the sequence to sequence model is just outputting HTML characters. The training data (OpenSubtitles, ParaCrawl, UNPC from OPUS) is generally created by scraping real world data so I'm guessing some escaped HTML codes made their way into the training data. It might make sense to add a filter in the training scripts to filter out any lines that have HTML escape codes in them.
from argos-translate.
I ended up removing this filtering because it substantially slowed down training. I'm still looking it a better solution but hoping that better quality models in general will do this less.
from argos-translate.
You could also use: https://docs.python.org/3/library/html.html#html.unescape
And a regex or beautifulsoup https://stackoverflow.com/questions/9662346/python-code-to-remove-html-tags-from-a-string
Which would make it more generic.
from argos-translate.
That might make sense but for now I'm just removing data that has one of these strings instead of trying to clean it up.
from argos-translate.
This should be done for any models trained going forward.
html_entities = [''', ' ', '<', '>', '"']
html_tags = ['<a>', '<p>', '<h1>', '<i>']
Suggestions for additional things that might be in training data to filter appreciated.
from argos-translate.
Still seeing this on models downloaded fresh today, was there perhaps a regression?
https://github.com/Qix-/bellingbot/issues/2 has some sample text; RU->EN.
from argos-translate.
Related Issues (20)
- (URLError(ConnectionRefusedError(61, 'Connection refused')),) HOT 4
- 希望出windows版
- Several models have a repetition issue with single proper nouns to English HOT 2
- Version comparison broken
- support multiple package paths in ARGOS_PACKAGES_DIR
- WARNING: Language de package default expects mwt, which has been added
- produce sourcemap of translation HOT 1
- Argos Translate GUI repeated delete breaks
- restructure the torrent: Argos-Translate-LibreTranslate-2022-04-30 HOT 4
- How use a specific dialect of a language?
- "Download failed" Error
- ssl.SSLError: [SSL: TLSV1_UNRECOGNIZED_NAME] tlsv1 unrecognized name (_ssl.c:1091) HOT 5
- punctuation breaks translation quality
- ArgosTransate for python doesn't translate anything from English to French HOT 1
- Argos_translate no longer works offline? HOT 1
- multilingual-rag using argos-translate HOT 1
- The difficulty of generating a proper LLM for translation from web scraping...
- Feature Request: Allow installing without `nvidia-cuda` packages. HOT 3
- Pipe mode, line-by-line (stdin/stdout)
- Support for tamil language
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from argos-translate.