Giter Site home page Giter Site logo

cltk_readers's Introduction

CLTK Readers

A corpus-reader extension for CLTK

Version 0.6.7; tested on Python 3.10.8, CLTK 1.1.5; LatinCy 3.7.2

Installation

pip install -e git+https://github.com/diyclassics/cltk_readers.git#egg=cltk_readers

Usage

>>> from cltkreaders.lat import LatinTesseraeCorpusReader
>>> tess = LatinTesseraeCorpusReader()
>>> print(tess.fileids())
['ammianus.rerum_gestarum.part.14.tess', 'ammianus.rerum_gestarum.part.15.tess', 'ammianus.rerum_gestarum.part.16.tess', 'ammianus.rerum_gestarum.part.17.tess', ...]
>>> print(next(tess.tokenized_sents('vergil.aeneid.part.1.tess', simple=True)))
['Arma', 'virumque', 'cano', ',', 'Troiae', 'qui', 'primus', 'ab', 'oris', 'Italiam', ',', 'fato', 'profugus', ',', 'Laviniaque', 'venit', 'litora', ',', 'multum', 'ille', 'et', 'terris', 'iactatus', 'et', 'alto', 'vi', 'superum', 'saevae', 'memorem', 'Iunonis', 'ob', 'iram', ';']

Corpora supported (so far!)

Change log

  • 0.6.7: Add no annotations parameter to spacy_docs for LatinTesseraeCorpusReader
  • 0.6.6: Add root parameter to LatinTesseraeCorpusReader
  • 0.6.5: Add fileid selector support for pipe (|) delimited metadata
  • 0.6.4: Bump spaCy version
  • 0.6.3: Update fileid selector for Greek corpus readers
  • 0.6.2: Add LatinCy support for LatinPerseusCorpusReader
  • 0.6.1: Miscellaneous fixes to reader, fileid selector
  • 0.6.0: Introduce metadata-based fileid selector
  • 0.5.6: Bump spaCy version
  • 0.5.5: Update CSEL reader; Update spaCy dependency to LatinCy lg model
  • 0.5.4: Update spaCy dependency to LatinCy md model
  • 0.5.3: Update spaCy dependency to md model
  • 0.5.2: Minor fixes
  • 0.5.1: Fix spaCy model installation
  • 0.5.0: Update packaging for PyPI
  • 0.4.6: Add simple parameter to Tesserae tokenized_sents; add pos_sents to Tesserae; update demo notebook
  • 0.4.5: Update spaCy dependency to la_dep_cltk_sm-0.2.0
  • 0.4.4: Add support for Camena
  • 0.4.3: Add support for Open Greek & Latin CSEL files
  • 0.4.2: Update lxml; also update spaCy dependency (now to main spaCy project, as of v. 3.4.2)
  • 0.4.1: Update spaCy dependency
  • 0.4.0: Add support for Latin Library (and similar plaintext collections)
  • 0.3.0: Add support for Perseus-style TEI/XML files; add Latin spaCy support for lemmatization and POS tagging
  • 0.2.4: Add support for Universal Dependencies files
  • 0.2.3: Add support for Perseus AGLDT Treebanks

Coded 2022-2023 by Patrick J. Burns

cltk_readers's People

Contributors

clemsciences avatar diyclassics avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

cltk_readers's Issues

Misc. issues from my installation

I have a PR coming with some small things. But here are a few bigger errors that I found.

>>> tess.describe()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/kylejohnson/cltk_readers/cltkreaders/readers.py", line 241, in describe
    for sent in self.sents(fileids):
  File "/Users/kylejohnson/cltk_readers/cltkreaders/readers.py", line 145, in sents
    for text in self.texts(fileids):
  File "/Users/kylejohnson/cltk_readers/cltkreaders/readers.py", line 131, in texts
    for doc_row in self.doc_rows(fileids):
  File "/Users/kylejohnson/cltk_readers/cltkreaders/readers.py", line 119, in doc_rows
    k, v = line.split('>', 1)
ValueError: not enough values to unpack (expected 2, got 1)

Returns nothing:

>>> tess.check_corpus()

The filepath is wrong here. But where is this code defined? I couldn't find it.

>>> tess.citation()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/kylejohnson/.pyenv/versions/cltkReaders/lib/python3.9/site-packages/nltk/corpus/reader/api.py", line 151, in citation
    with self.open(self._citation) as f:
  File "/Users/kylejohnson/.pyenv/versions/cltkReaders/lib/python3.9/site-packages/nltk/corpus/reader/api.py", line 231, in open
    stream = self._root.join(file).open(encoding)
  File "/Users/kylejohnson/.pyenv/versions/cltkReaders/lib/python3.9/site-packages/nltk/data.py", line 334, in join
    return FileSystemPathPointer(_path)
  File "/Users/kylejohnson/.pyenv/versions/cltkReaders/lib/python3.9/site-packages/nltk/compat.py", line 41, in _decorator
    return init_func(*args, **kwargs)
  File "/Users/kylejohnson/.pyenv/versions/cltkReaders/lib/python3.9/site-packages/nltk/data.py", line 312, in __init__
    raise OSError("No such file or directory: %r" % _path)
OSError: No such file or directory: '/Users/kylejohnson/cltk_data/lat/text/lat_text_tesserae/texts/citation.bib'
>>> tess.license()
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/kylejohnson/.pyenv/versions/cltkReaders/lib/python3.9/site-packages/nltk/corpus/reader/api.py", line 144, in license
    with self.open(self._license) as f:
  File "/Users/kylejohnson/.pyenv/versions/cltkReaders/lib/python3.9/site-packages/nltk/corpus/reader/api.py", line 231, in open
    stream = self._root.join(file).open(encoding)
  File "/Users/kylejohnson/.pyenv/versions/cltkReaders/lib/python3.9/site-packages/nltk/data.py", line 334, in join
    return FileSystemPathPointer(_path)
  File "/Users/kylejohnson/.pyenv/versions/cltkReaders/lib/python3.9/site-packages/nltk/compat.py", line 41, in _decorator
    return init_func(*args, **kwargs)
  File "/Users/kylejohnson/.pyenv/versions/cltkReaders/lib/python3.9/site-packages/nltk/data.py", line 312, in __init__
    raise OSError("No such file or directory: %r" % _path)
OSError: No such file or directory: '/Users/kylejohnson/cltk_data/lat/text/lat_text_tesserae/texts/LICENSE'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.