Giter Site home page Giter Site logo

luismond / tm2tb Goto Github PK

View Code? Open in Web Editor NEW
49.0 6.0 8.0 6.58 MB

Bilingual term extractor

Home Page: https://www.tm2tb.com

License: GNU General Public License v3.0

Python 100.00%
machine-translation translation python terminology-extraction transformers spacy bert-embeddings

tm2tb's Introduction

tm2tb

tm2tb is a bilingual term extractor.

It identifies terms in both source and target languages.

Given a Translation Memory, it extracts a Term Base

What is a Term Base?

In translation projects, a term base is a collection of terms relevant to a project.

It is like having a specialized bilingual dictionary.

It includes terms along with their corresponding translations in the target language.

What is a Translation Memory?

A Translation Memory is a file that stores translations from previously translated documents.

Typically, it’s bilingual, containing pairs of sentences in the source and target languages.

However, it can also include translations from multiple languages.

Where can I use tm2tb?

Translation and localization

  • Bilingual term lists play a crucial role in quality assurance during translation and localization projects.

Machine translation

  • Bilingual terminology is used to fine-tune MT models

Foreign language teaching

  • Bilingual term lists are used can be used as a teaching resource

Transcreation

  • Creative, non-literal translations can be extracted from bilingual data

What is tm2tb's approach?

  1. Extract terms from source and target sentences

  2. Use an AI model to convert these terms to 'vectors' (a sequence of numbers, don't worry about this)

  3. Use this information to find the most similar source and target term matches


Languages supported

Any language supported by spaCy

Bilingual file formats supported

  • .tmx
  • .mqxliff (memoQ)
  • .mxliff (Phrase, memsource)
  • .csv
  • .xlsx (Excel)

Basic examples

Run these examples in a Google Colab notebook

Extract terms from a sentence

from tm2tb import TermExtractor

en_sentence = (
    "The giant panda, also known as the panda bear (or simply the panda)"
    " is a bear native to South Central China. It is characterised by its"
    " bold black-and-white coat and rotund body. The name 'giant panda'"
    " is sometimes used to distinguish it from the red panda, a neighboring"
    " musteloid. Though it belongs to the order Carnivora, the giant panda"
    " is a folivore, with bamboo shoots and leaves making up more than 99%"
    " of its diet. Giant pandas in the wild will occasionally eat other grasses,"
    " wild tubers, or even meat in the form of birds, rodents, or carrion."
    " In captivity, they may receive honey, eggs, fish, shrub leaves,"
    " oranges, or bananas."
    )
>>> extractor = TermExtractor(en_sentence)  # Instantiate extractor with sentence
>>> terms = extractor.extract_terms()       # Extract terms
>>> print(terms[:10])

            term        pos_tags    rank  frequency
0          panda          [NOUN]  1.0000          6
1    giant panda     [ADJ, NOUN]  0.9462          3
2     panda bear    [NOUN, NOUN]  0.9172          1
3      red panda     [ADJ, NOUN]  0.9152          1
4      Carnivora         [PROPN]  0.6157          1
5  Central China  [PROPN, PROPN]  0.5306          1
6        bananas          [NOUN]  0.4813          1
7        rodents          [NOUN]  0.4218          1
8        Central         [PROPN]  0.3930          1
9    bear native     [NOUN, ADJ]  0.3695          1

We can get terms in other languages as well. (The language is detected automatically):

es_sentence = (
    "El panda gigante, también conocido como oso panda (o simplemente panda),"
    " es un oso originario del centro-sur de China. Se caracteriza por su"
    " llamativo pelaje blanco y negro, y su cuerpo rotundo. El nombre 'panda"
    " gigante' se usa en ocasiones para distinguirlo del panda rojo, un"
    " mustélido parecido. Aunque pertenece al orden de los carnívoros, el panda"
    " gigante es folívoro, y más del 99 % de su dieta consiste en brotes y"
    " hojas de bambú. En la naturaleza, los pandas gigantes comen ocasionalmente"
    " otras hierbas, tubérculos silvestres o incluso carne de aves, roedores o"
    " carroña. En cautividad, pueden alimentarse de miel, huevos, pescado, hojas"
    " de arbustos, naranjas o plátanos."
    )
>>> extractor = TermExtractor(es_sentence)  # Instantiate extractor with sentence
>>> terms = extractor.extract_terms()       # Extract terms
>>> print(terms[:10])

                    term        pos_tags    rank  frequency
0          panda gigante    [PROPN, ADJ]  1.0000          3
1             panda rojo  [PROPN, PROPN]  0.9023          1
2                  panda         [PROPN]  0.9013          6
3              oso panda  [PROPN, PROPN]  0.7563          1
4                gigante           [ADJ]  0.4877          3
5               roedores          [NOUN]  0.4641          1
6               plátanos          [NOUN]  0.4434          1
7          pelaje blanco     [NOUN, ADJ]  0.3851          1
8  tubérculos silvestres     [NOUN, ADJ]  0.3722          1
9                  bambú         [PROPN]  0.3704          1

Extracting terms from pairs of sentences

Extract and match source & target terms:

>>> from tm2tb import BitermExtractor

>>> extractor = BitermExtractor((en_sentence, es_sentence)) # Instantiate extractor with sentences
>>> biterms = extractor.extract_terms()                     # Extract biterms
>>> print(biterms[:7])

      src_term        tgt_term  similarity  frequency  biterm_rank
0  giant panda   panda gigante      0.9758          1       1.0000
1    red panda      panda rojo      0.9807          1       0.9385
2        panda           panda      1.0000          1       0.7008
3      oranges        naranjas      0.9387          1       0.3106
4       bamboo           bambú      0.9237          1       0.2911
5        China           China      1.0000          1       0.2550
6  rotund body  cuerpo rotundo      0.9479          1       0.2229

Extracting terms from bilingual documents

Extract and match source & target terms from a bilingual document:

                                                 src                                                trg
0   The giant panda also known as the panda bear (...  El panda gigante, también conocido como oso pa...
1   It is characterised by its bold black-and-whit...  Se caracteriza por su llamativo pelaje blanco ...
2   The name "giant panda" is sometimes used to di...  El nombre "panda gigante" se usa a veces para ...
3   Though it belongs to the order Carnivora, the ...  Aunque pertenece al orden Carnivora, el panda ...
4   Giant pandas in the wild will occasionally eat...  En la naturaleza, los pandas gigantes comen oc...
5   In captivity, they may receive honey, eggs, fi...  En cautiverio, pueden alimentarse de miel, hue...
6   The giant panda lives in a few mountain ranges...  El panda gigante vive en algunas cadenas monta...
7   As a result of farming, deforestation, and oth...  Como resultado de la agricultura, la deforesta...
8   For many decades, the precise taxonomic classi...  Durante muchas décadas, se debatió la clasific...
9   However, molecular studies indicate the giant ...  Sin embargo, los estudios moleculares indican ...
10  These studies show it diverged about 19 millio...  Estos estudios muestran que hace unos 19 millo...
11  The giant panda has been referred to as a livi...  Se ha hecho referencia al panda gigante como u...

>>> from tm2tb import BitextReader
>>> path = 'tests/panda_bear_english_spanish.csv'
>>> bitext = BitextReader(path).read_bitext()   # Read bitext
>>> extractor = BitermExtractor(bitext)         # Instantiate extractor with bitext
>>> biterms = extractor.extract_terms()         # Extract terms
>>> print(biterms[:10])

                   src_term                  tgt_term  similarity  frequency  biterm_rank
0               giant panda             panda gigante      0.9758          8       1.0000
1                     panda                     panda      1.0000          8       0.5966
2                 red panda                panda rojo      0.9807          1       0.1203
3                   Ursidae                   Ursidae      1.0000          2       0.0829
4             prepared food      alimentos preparados      0.9623          1       0.0735
5                     China                     China      1.0000          2       0.0648
6            family Ursidae           familia Ursidae      0.9564          1       0.0632
7  taxonomic classification  clasificación taxonómica      0.9543          1       0.0629
8           common ancestor            ancestro común      0.9478          1       0.0607
9           characteristics           características      0.9885          1       0.0540

More examples with options

Select the terms' length

Select the minimum and maximum length of the terms:

>>> extractor = TermExtractor(en_sentence)  
>>> terms = extractor.extract_terms(span_range=(2,3))
>>> print(terms[:10])

                  term               pos_tags    rank  frequency
0          giant panda            [ADJ, NOUN]  1.0000          3
1           panda bear           [NOUN, NOUN]  0.9693          1
2            red panda            [ADJ, NOUN]  0.9672          1
3  South Central China  [PROPN, PROPN, PROPN]  0.5647          1
4          bear native            [NOUN, ADJ]  0.3905          1
5        South Central         [PROPN, PROPN]  0.3902          1
6      order Carnivora          [NOUN, PROPN]  0.3504          1
7          wild tubers            [ADJ, NOUN]  0.3053          1
8        other grasses            [ADJ, NOUN]  0.2503          1
9           bold black             [ADJ, ADJ]  0.1845          1

Use Part-of-Speech tags

Pass a list of part-of-speech tags to delimit the selection of terms.

For example, get only adjectives and nouns:

>>> extractor = TermExtractor(en_sentence)
>>> terms = extractor.extract_terms(incl_pos=['ADJ', 'NOUN'])
>>> print(terms[:10])

            term      pos_tags    rank  frequency
0          panda        [NOUN]  1.0000          6
1    giant panda   [ADJ, NOUN]  0.9462          3
2     panda bear  [NOUN, NOUN]  0.9172          1
3      red panda   [ADJ, NOUN]  0.9152          1
4        bananas        [NOUN]  0.4813          1
5        rodents        [NOUN]  0.4218          1
6    bear native   [NOUN, ADJ]  0.3695          1
7    wild tubers   [ADJ, NOUN]  0.2889          1
8        oranges        [NOUN]  0.2723          1
9  other grasses   [ADJ, NOUN]  0.2368          1

Do the same for bilingual term extraction:

>>> extractor = BitermExtractor((en_sentence, es_sentence))
>>> biterms = extractor.extract_terms(span_range=(2,3))
>>> print(biterms[:10])

      src_term     src_tags  src_rank  ... similarity frequency  biterm_rank
0  giant panda  [ADJ, NOUN]    1.0000  ...     0.9758         1       1.0000
1    red panda  [ADJ, NOUN]    0.9672  ...     0.9807         1       0.9394
2  rotund body  [ADJ, NOUN]    0.1347  ...     0.9479         1       0.2204
>>> extractor = BitermExtractor((en_sentence, es_sentence))
>>> biterms = extractor.extract_terms(incl_pos=['ADJ', 'NOUN'])
>>> print(biterms[:10])

      src_term     src_tags  src_rank  ... similarity frequency  biterm_rank
0  giant panda  [ADJ, NOUN]    0.9462  ...     0.9140         1       1.0000
1        panda       [NOUN]    1.0000  ...     1.0000         1       0.7588
2      oranges       [NOUN]    0.2723  ...     0.9387         1       0.3372
3  rotund body  [ADJ, NOUN]    0.1274  ...     0.9479         1       0.2431

Installation in a linux OS

  1. Install pipenv and create a virtual environment

pip install pipenv

pipenv shell

  1. Clone the repository:

git clone https://github.com/luismond/tm2tb

  1. Install the requirements:

pipenv install

This will install the following libraries:

pip==22.1.2
setuptools==62.6.0
wheel==0.37.1
langdetect==1.0.9
pandas==1.4.3
xmltodict==0.12.0
openpyxl==3.0.9
sentence-transformers==2.2.2
tokenizers==0.12.1
spacy==3.3.0

Also, the following spaCy models will be downloaded and installed:

en_core_web_md-3.3.0
es_core_news_md-3.3.0
fr_core_news_md-3.3.0
de_core_news_md-3.3.0
pt_core_news_md-3.3.0
it_core_news_md-3.3.0

spaCy models

By default, tm2tb includes 6 medium spaCy language models, for English, Spanish, German, French, Portuguese, and Italian

If they are too large for your environment, you can download smaller models, but the Part-of-Speech tagging accuracy will be lower.

To add more languages, add them to tm2tb.spacy_models.py.

Check the available spaCy language models here.

Sentence transformer models

tm2tb is compatible with the following multilingual models:

  • LaBSE (best model for translated phrase mining, but please note it is almost 2 GB in size)
  • setu4993/smaller-LaBSE (a smaller LaBSE model that supports only 15 languages)
  • distiluse-base-multilingual-cased-v1
  • distiluse-base-multilingual-cased-v2
  • paraphrase-multilingual-MiniLM-L12-v2
  • paraphrase-multilingual-mpnet-base-v2

Please note that there is always a compromise between speed and accuracy.

  • Smaller models will be faster, but less accurate.
  • Larger models will be slower, but more accurate.

tm2tb.com

A tm2tb web app is available here: www.tm2tb.com

  • Extract biterms from bilingual documents and sentences (file limit size: 2 MB)

Maintainer

Luis Mondragón

License

tm2tb is released under the GNU General Public License v3.0

Credits

Libraries

  • spaCy: Tokenization, Part-of-Speech tagging
  • sentence-transformers: sentence and terms embeddings
  • xmltodict: parsing of XML file formats (.mqxliff, .mxliff, .tmx, etc.)

Other credits:

  • KeyBERT: tm2tb takes these concepts from KeyBERT:
  • Terms-to-sentence similarity
  • Maximal Marginal Relevance

tm2tb's People

Contributors

luismond avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

tm2tb's Issues

Truncated results

Hey :) love this project but I'm not really skilled with Python hence this might sound like a beginner question.

If I set print(biterms[:200]) to a high number like 200, it seems to truncate the export and skipping a bunch of lines

          src_term       src_tags  src_rank           trg_term      trg_tags  trg_rank  similarity  frequency  biterm_rank
0       enemy Clan  [NOUN, PROPN]    0.6490        clan ennemi  [NOUN, NOUN]    0.5855      0.9662         16       0.6448
1          Archers        [PROPN]    0.4446            archers        [NOUN]    0.4693      0.9096         28       0.6024
2             hunt         [NOUN]    0.4017             chasse        [NOUN]    0.4042      0.9168          6       0.5911
3         Warriors         [NOUN]    0.2542          guerriers        [NOUN]    0.5313      0.9256         32       0.5899
4          attacks         [NOUN]    0.2252           attaques        [NOUN]    0.5065      0.9658          6       0.5872
..             ...            ...       ...                ...           ...       ...         ...        ...          ...
195  bonus rewards   [NOUN, NOUN]    0.1408  récompenses bonus  [NOUN, NOUN]    0.3058      0.9771          1       0.5398
196         player         [NOUN]    0.1561             joueur        [NOUN]    0.1726      0.9684         17       0.5397
197          level        [PROPN]    0.1620             niveau        [NOUN]    0.1600      0.9875         21       0.5397

[200 rows x 9 columns]

also, any way to save all lines to a file or even better directly to a CSV? (at the moment I'm just saving the print output to a txt file via

from contextlib import redirect_stdout

with open('out.txt', 'w') as f:
    with redirect_stdout(f):
        print(biterms[:30])

Improve bitext reader

  • Solve bare exceptions
  • add test data in the supported formats

-- .xlsx: use office acount to create it
-- .mqxliff: reinstall memoQ, import data, export .mqxliff, .tmx
-- .mxliff: use account, load data, export

Local dictionaries

For this who cannot send data to a 3rd party API for data confidentiality reasons, it would be cool to use local dictionaries, like Hunspell (which is complete enough to power OS/browser spellchecking).

Biterm extractor fails with "IndexError: list index out of range"

Hi! In the Google Colab notebook, running the biterm extractor on a file above a certain size fails as follows:

Screenshot 2024-02-23 154800

The test_bitext_en_es.tmx test file you supplied works fine. If I truncate my test TMX to less than about 300 lines, it also works fine.

I also tested this on Windows with Python 3.10, 3.11, and 3.12 and the result is the same -- less than 300ish lines works, above 300ish lines fails.

Thanks and love your work!

'Tm2Tb' object has no attribute 'get_ngrams'

Hi there, I was trying to use your tool in Jupyter Lab and got this error, despite using a virtual environment and downloading all the requirements. Any ideas about what's happening here?

App doesn't seem to work - missing 'uploads' directory as well

Hi @luismond ,

Thank you so much for making this available.

I tried to run the app, and it was complaining about missing 'uploads' directory. I created it, but it throws an error.
Any idea how to fix this?

By the way, I was testing it using the de-en csv file you provided.

This is what the browser shows up:

Internal Server Error
The server encountered an internal error and was unable to complete your request. Either the server is overloaded or there is an error in the application.

Here is the console output:

127.0.0.1 - - [25/Mar/2021 05:41:39] "GET /favicon.ico HTTP/1.1" 500 -
tm len: 100
fn to df 0.052005767822265625
preproc 0.15700340270996094
detect 2.0500078201293945
detected src lang: en
detected trg lang: de
get tokens 2.051008701324463
grams 2.4290056228637695
remove stops 2.686998128890991
fn to iterzip 2.804997444152832
prepare 1116 tb cands 2.975860595703125
[2021-03-25 05:51:26,829] ERROR in app: Exception on / [POST]
Traceback (most recent call last):
  File "C:\Users\admin\Anaconda3\envs\python3.7\lib\site-packages\flask\app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "C:\Users\admin\Anaconda3\envs\python3.7\lib\site-packages\flask\app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "C:\Users\admin\Anaconda3\envs\python3.7\lib\site-packages\flask\app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "C:\Users\admin\Anaconda3\envs\python3.7\lib\site-packages\flask\_compat.py", line 39, in reraise
    raise value
  File "C:\Users\admin\Anaconda3\envs\python3.7\lib\site-packages\flask\app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "C:\Users\admin\Anaconda3\envs\python3.7\lib\site-packages\flask\app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "myproject.py", line 47, in post_file
    return prev(filename)
  File "myproject.py", line 59, in prev
    result_html = Markup(tm2tb_main(filename))
  File "C:\Users\admin\Downloads\Programs\tm2tb_flask_terminology_extraction-main\tm2tb-main\tm2tb.py", line 145, in tm2tb_main
    tb = tb_to_azure(tb, srcdet, trgdet)
  File "C:\Users\admin\Downloads\Programs\tm2tb_flask_terminology_extraction-main\tm2tb-main\lib\tb_to_azure.py", line 33, in tb_to_azure
    sst_batches_lu = [get_azure_dict_lookup(src_det, trgdet, l) for l in sst_batches]
  File "C:\Users\admin\Downloads\Programs\tm2tb_flask_terminology_extraction-main\tm2tb-main\lib\tb_to_azure.py", line 33, in <listcomp>
    sst_batches_lu = [get_azure_dict_lookup(src_det, trgdet, l) for l in sst_batches]
  File "C:\Users\admin\Downloads\Programs\tm2tb_flask_terminology_extraction-main\tm2tb-main\lib\get_azure_dict_lookup.py", line 36, in get_azure_dict_lookup
    targets = [get_normalizedTargets(d) for d in response]
  File "C:\Users\admin\Downloads\Programs\tm2tb_flask_terminology_extraction-main\tm2tb-main\lib\get_azure_dict_lookup.py", line 36, in <listcomp>
    targets = [get_normalizedTargets(d) for d in response]
  File "C:\Users\admin\Downloads\Programs\tm2tb_flask_terminology_extraction-main\tm2tb-main\lib\get_azure_dict_lookup.py", line 12, in get_normalizedTargets
    targets = [t['normalizedTarget'] for t in d['translations']]
TypeError: string indices must be integers
127.0.0.1 - - [25/Mar/2021 05:51:26] "POST / HTTP/1.1" 500 -
[2021-03-25 05:51:27,556] ERROR in app: Exception on /favicon.ico [GET]
Traceback (most recent call last):
  File "C:\Users\admin\Anaconda3\envs\python3.7\lib\site-packages\flask\app.py", line 2447, in wsgi_app
    response = self.full_dispatch_request()
  File "C:\Users\admin\Anaconda3\envs\python3.7\lib\site-packages\flask\app.py", line 1952, in full_dispatch_request
    rv = self.handle_user_exception(e)
  File "C:\Users\admin\Anaconda3\envs\python3.7\lib\site-packages\flask\app.py", line 1821, in handle_user_exception
    reraise(exc_type, exc_value, tb)
  File "C:\Users\admin\Anaconda3\envs\python3.7\lib\site-packages\flask\_compat.py", line 39, in reraise
    raise value
  File "C:\Users\admin\Anaconda3\envs\python3.7\lib\site-packages\flask\app.py", line 1950, in full_dispatch_request
    rv = self.dispatch_request()
  File "C:\Users\admin\Anaconda3\envs\python3.7\lib\site-packages\flask\app.py", line 1936, in dispatch_request
    return self.view_functions[rule.endpoint](**req.view_args)
  File "myproject.py", line 69, in get_file
    redirect(url_for('uploaded_file', filename=filename))
  File "C:\Users\admin\Anaconda3\envs\python3.7\lib\site-packages\flask\helpers.py", line 370, in url_for
    return appctx.app.handle_url_build_error(error, endpoint, values)
  File "C:\Users\admin\Anaconda3\envs\python3.7\lib\site-packages\flask\app.py", line 2216, in handle_url_build_error
    reraise(exc_type, exc_value, tb)
  File "C:\Users\admin\Anaconda3\envs\python3.7\lib\site-packages\flask\_compat.py", line 39, in reraise
    raise value
  File "C:\Users\admin\Anaconda3\envs\python3.7\lib\site-packages\flask\helpers.py", line 358, in url_for
    endpoint, values, method=method, force_external=external
  File "C:\Users\admin\Anaconda3\envs\python3.7\lib\site-packages\werkzeug\routing.py", line 2179, in build
    raise BuildError(endpoint, values, method, self)
werkzeug.routing.BuildError: Could not build url for endpoint 'uploaded_file' with values ['filename']. Did you forget to specify values ['filename_tb']?

Feature idea: make it work with non-aligned bilingual data

Problem:

So far, the module can extract terms from parallel, aligned data.
The fact is that there is a lot of multilingual data out there that is not aligned or in a translation file format.
For example, a Wikipedia page about panda bears in English and the same page in Spanish.
The content is similar, but the sentences of both pages are not aligned.

Solution:

Make it work for non-aligned data. Extract all source n-grams, and all target n-grams, compare them. Write special filtering and ranking functions if necessary.

PyPI package

Will there be a PyPI package so that other applications can integrate it easily with pip install tm2tb?

How to add stop vectors for new languages?

Hi thank you for this cool tool! We are trying to extend it with more Spacy languages. Adding the new language models was not an issue. However, when running the extractor, there are no stop_vectors found. Any hint where we can download some or create our own?
Thank you!

Traceback (most recent call last):
  File "/Users/devi/PycharmProjects/tm2tb/de_it_test.py", line 42, in <module>
    terms = extractor.extract_terms(span_range=(1, 3), incl_pos=['ADJ', 'NOUN', 'PROPN', 'ADP'])       # Extract terms
  File "/Users/devi/PycharmProjects/tm2tb/tm2tb/term_extractor.py", line 114, in extract_terms
    stops_embeddings_avg = np.load(os.path.join(file_dir, '..', 'stops_vectors', str(self.emb_dims), f'{self.lang}.npy'))
  File "/Users/devi/PycharmProjects/tm2tb/venv/lib/python3.9/site-packages/numpy/lib/npyio.py", line 390, in load
    fid = stack.enter_context(open(os_fspath(file), "rb"))
FileNotFoundError: [Errno 2] No such file or directory: '/Users/devi/PycharmProjects/tm2tb/tm2tb/../stops_vectors/768/it.npy'

Process finished with exit code 1

Value error when max_stopword_similarity too low in extract_terms method

When the max_stopword_similarity value passed to extract_terms method is too low, e. g. .10, no terms might be found at all. This results in the following error being raised in term_extractor.py line 124.

raise ValueError(
ValueError: Expected 2D array, got 1D array instead:
array=[].
Reshape your data either using array.reshape(-1, 1) if your data has a single feature or array.reshape(1, -1) if it contains a single sample.

Suggestion:
Check if top_spans actually contains any term candidate by wrapping lines 124-132 in an if condition:

        **if len(top_spans) > 0:**
            if collapse_similarity is True:
                top_spans = self._collapse_similarity(top_spans)
    
            for i, span in enumerate(top_spans):
                span._.span_id = i
            top_spans = sorted(top_spans, key=lambda span: span._.span_id)
    
            if return_as_table is True:
                top_spans = self._return_as_table(top_spans)
        return top_spans

Does this make sense to you?

Loading all SpaCy models slows down extraction

When loading the SpaCy models, all models are loaded even if they are not used. See file spacy_models.py line 19 and following.
When adding more languages or using lg models, this might become a bottleneck and slow down the extraction process significantly.

Suggestion:
Check which language is requested and only load the required model, e.g. by changing from line 58 to (removing spacy_model = spacy_models[lang]):

        if lang == 'de':
            spacy_model = de_core_news_md.load()
        elif lang == 'en':
            spacy_model = en_core_web_md.load()
        ...

Then you would also be able to remove line 19-26.
What do you think?

Would sentence alignment be in the scope of the module?

Problem:

Due to segmentation issues or configuration, many bilingual documents have long paragraphs in each segment. It would be nice if we could split the paragraphs and align the sentences within them. And simple rules like splitting on new line, or using regex wouldn't work. It would be necessary to align them using similarity. Could tm2tb do this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.