Giter Site home page Giter Site logo

apertium-cat's Introduction

Apertium

Apertium is an open-source rule-based machine translation toolchain and ecosystem. It facilitates the creation of consistent and transparent machine translation systems by relying on deterministic linguistic rules rather than statistical or neural models. Apertium's tools are designed to be language-agnostic and platform-independent, making them suitable for a wide range of languages and applications.

Project Overview

Apertium's framework is based on finite-state transducers, which enable efficient and accurate processing of natural languages. The language data used by Apertium is stored in XML and other human-readable text formats, organized into modular single-language packages and translation pairs. This modularity allows for the reuse of language data across multiple translation systems.

Features

  • Rule-Based Translation: Consistent and understandable translations based on deterministic rules.
  • Finite-State Transducers: Efficient language processing using advanced computational models.
  • Language-Agnostic Tools: Broad applicability across multiple languages.
  • Modular Design: Reusable language packages simplify the development of new translation pairs.

Installation

Apertium provides binaries for several platforms, including Debian, Ubuntu, Fedora, CentOS, OpenSUSE, Windows, and macOS. Both nightly builds and official releases are available. If you are on a supported platform, it is recommended to use the pre-built binaries.

For more information, see the Apertium Installation Guide.

Building from Source

If you need to modify Apertium’s behavior or are on a platform that is not officially supported, follow these steps to build from source.

Requirements

Compiling

$ autoreconf -fvi
$ ./configure
$ make

Usage

Apertium can be used to translate text between supported languages. Assuming the relevant language data (here the Spanish-Catalan translator) has been installed, translation can be achieved with the following command:

$ apertium spa-cat input.txt output.txt

The apertium executable can also use piped streams:

$ echo "La casa es roja." | apertium spa-cat

Language data which has been compiled but not installed can be used with the -d flag:

$ echo "La casa es roja." | apertium -d ./apertium-spa-cat spa-cat

Formats other than plaintext can be specified with the -f flag:

$ apertium -f html spa-cat input.html output.html

Data packages may provide modes besides the main translation mode. Use the -l flag to list them.

$ apertium -l
$ apertium -l -d ./apertium-spa-cat

Additional Tools

This repository also provides the following executables:

Pipeline Modules

  • apertium-extract-caps, apertium-restore-caps: Handle capitalization
  • apertium-pretransfer: Split compound analyses into separate words for processing by apertium-transfer
  • apertium-posttransfer: Clean up repeated spaces
  • apertium-tagger: Perform statistical part-of-speech tagging
  • apertium-transfer, apertium-interchunk, apertium-postchunk: Structural transfer modules (documentation)
  • apertium-wblank-attach, apertium-wblank-detach, apertium-wblank-mode: Handle word-bound blanks

Build Tools

These programs are used in the process of compiling linguistic data packages.

  • apertium-compile-caps: Compile capitalization-handling rules for use by apertium-restore-caps (documentation)
  • apertium-gen-modes: Process the modes.xml file, which specifies what translation and analysis modes a data package provides
  • apertium-preprocess-transfer: Process structural transfer rule files for use by apertium-transfer
  • apertium-validate-acx, apertium-validate-crx, apertium-validate-dictionary, apertium-validate-interchunk, apertium-validate-modes, apertium-validate-postchunk, apertium-validate-tagger, apertium-validate-transfer: Validators for various XML rule formats

Format Handlers

For each supported file format, there is a deformatter named apertium-des[NAME] (e.g. apertium-deshtml) which reads formatted text from standard input and writes Apertium stream format to standard output. There is also a corresponding set of reformatters which do the reverse and are named apertium-re[NAME] (e.g. apertium-rehtml). These programs rarely need to be invoked directly, since they are handled by the apertium executable.

Most of the format handlers are currently deprecated in favor of Transfuse.

License

This project is licensed under the GNU General Public License v2.0. See the COPYING file for details.

For more information, visit Apertium or the Apertium Wiki.

apertium-cat's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

apertium-cat's Issues

missing cat.autopgen-diacritics-vells.bin

Output from trying to build apertium-eng-cat

make: *** No rule to make target `/Users/username/apertium/share/apertium/apertium-cat/cat.autopgen-diacritics-vells.bin', needed by `eng-cat.autopgen-diacritics-vells.bin'.  Stop.

desambiguar: els fins, els seus fins, els set

Ho apunto perquè no ens passi per alt. Cal arreglar la desambiguació en aquestes frases:

els fins aquí regulats,
els fins, objectius,
aconseguir aquests fins
els set acadèmics.

'haver' + adverbi que comença amb 'de'

Una frase com "ha de fet treballat" s'analitza com

$ echo "ha de fet treballat" | apertium -d . cat-disam
"<ha de>"
	"haver# de" vbmod pri p3 sg
"<fet>"
	"fet" n m sg
;	"fer" vblex pp m sg REMOVE:1903
"<treballat>"
	"treballat" adj m sg
	"treballar" vblex pp m sg

Hauria de ser "haver" + "de fet". He intentat solucionar-ho afegint aquesta línia al diccionari, però no ha funcionat:

<e r="LR" lm="haver de fet"><i></i><par n="haver__vblex"/><p><l><b/>de<b/>fet</l><r><j/>de<b/>fet<s n="adv"/></r></p></e>

Un cop solucionat, caldria fer el mateix per a un munt d'adverbis que comencen per "de" ("de co(l)p", "de mica en mica", "de dalt a baix", etc.).

Interrogants a principi de pregunta

Quan traduïm del català al castellà:

Teniu concedida alguna beca o ajut del curs 2020-21? > Tenéis concedida alguna beca o ayuda del curso 2020-21?

Hauríem de dir "¿Tenéis concedida alguna beca o ayuda del curso 2020-21?"

En castellà cal afegir-hi l'interrogant a principi de frase interrogativa.

Error en make test

En fer "make test" surt un error:

diff --git a/dev/greptests.txt b/dev/greptests.txt
index a0d444f..f267292 100644
--- a/dev/greptests.txt
+++ b/dev/greptests.txt
@@ -1,3 +1,5 @@
+apertium-cat.cat.metadix: <e lm="van Doesburg"><i>van Doesburg</i><par n="Saussure__np"/></e>
+apertium-cat.cat.metadix: <e lm="van Eesteren"><i>van Eesteren</i><par n="Saussure__np"/></e>
 apertium-cat.cat.metadix:<e lm="que " r="RL">     <i>que<b/></i><par n="que__cnjsub"/></e> <!-- "que" afegit en castellà-->
 apertium-cat.cat.metadix:<e r="RL" lm="no tenir raó"><i>no<b/></i><par n="/tenir__vblex"/><p><l><b/>raó</l><r><g><b/>raó</g></r></p></e>
 apertium-cat.cat.metadix:<e lm="no tenir res a veure"><i>no<b/></i><par n="abs/tenir__vblex"/><p><l><b/>res<b/>a<b/>veure</l><r><g><b/>res<b/>a<b/>veure</g></r></p></e>
make: *** [Makefile:808: test] Error 1

Sospito que és perquè s'han afegit dos cognoms amb "van" al diccionari, però no s'han afegit en algun altre lloc.

nomenclàtor IEC

El Nomenclàtor mundial de l'IEC ha canviat la grafia en català d'alguns topònims internacionals. Tant l'ésAdir com l'AVL han començat a seguir aquests criteris. Deixo una llista (copiada de l'ésAdir) dels que sembla que són els canvis més significatius, per a quan tinguem temps de repassar-ho en els diccionaris d'Apertium.

Astana
Bandaaceh
Bengaluru
Dhaka
Donbàs, el
Guiza
Kirguizstan, el
Luhansk
Mississipi
Montreal
Myanmar
Múnic
Pensilvània
Ramal·lah
Yangon
El Salvador
Eswatini
Kenya
Trinidad i Tobago
Vènet, el
Shanghai
Zúric
Sant Feliu Sasserra

Preferències: avui/hui

Si creem una preferència per a avui/hui, val_uni es queda només amb diferències de terminacions verbals respecte a val_gva, i estem més a prop d'eliminar una compilació més per a la generació.

pense: indicatiu / subjuntiu

Algunes formes verbals valencianes poden ser ambigües. Es podria millorar la desambiguació, almenys en alguns casos clars.

"perquè ell pense que està resolt" cat-spa
porque él pienso que está resuelto

"perquè Joan pense que està resolt" cat-spa
porque Joan pienso que está resuelto

"perquè el meu amic pense que està resolt" cat-spa
porque mi amigo pienso que está resuelto

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.