booknlp / booknlp Goto Github PK

BookNLP, a natural language processing pipeline for books

License: MIT License

Python 100.00%

natural-language-processing cultural-analytics digital-humanities computational-social-science

booknlp's Introduction

BookNLP

BookNLP is a natural language processing pipeline that scales to books and other long documents (in English), including:

Part-of-speech tagging
Dependency parsing
Entity recognition
Character name clustering (e.g., "Tom", "Tom Sawyer", "Mr. Sawyer", "Thomas Sawyer" -> TOM_SAWYER) and coreference resolution
Quotation speaker identification
Supersense tagging (e.g., "animal", "artifact", "body", "cognition", etc.)
Event tagging
Referential gender inference (TOM_SAWYER -> he/him/his)

BookNLP ships with two models, both with identical architectures but different underlying BERT sizes. The larger and more accurate big model is fit for GPUs and multi-core computers; the faster small model is more appropriate for personal computers. See the table below for a comparison of the difference, both in terms of overall speed and in accuracy for the tasks that BookNLP performs.

	Small	Big
Entity tagging (F1)	88.2	90.0
Supersense tagging (F1)	73.2	76.2
Event tagging (F1)	70.6	74.1
Coreference resolution (Avg. F1)	76.4	79.0
Speaker attribution (B3)	86.4	89.9
CPU time, 2019 MacBook Pro (mins.)*	3.6	15.4
CPU time, 10-core server (mins.)*	2.4	5.2
GPU time, Titan RTX (mins.)*	2.1	2.2

*timings measure speed to run BookNLP on a sample book of The Secret Garden (99K tokens). To explore running BookNLP in Google Colab on a GPU, see this notebook.

Installation

Create anaconda environment, if desired. First download and install anaconda; then create and activate fresh environment.

conda create --name booknlp python=3.7
conda activate booknlp

If using a GPU, install pytorch for your system and CUDA version by following installation instructions on https://pytorch.org.
Install booknlp and download Spacy model.

pip install booknlp
python -m spacy download en_core_web_sm

Usage

from booknlp.booknlp import BookNLP

model_params={
		"pipeline":"entity,quote,supersense,event,coref", 
		"model":"big"
	}
	
booknlp=BookNLP("en", model_params)

# Input file to process
input_file="input_dir/bartleby_the_scrivener.txt"

# Output directory to store resulting files in
output_directory="output_dir/bartleby/"

# File within this directory will be named ${book_id}.entities, ${book_id}.tokens, etc.
book_id="bartleby"

booknlp.process(input_file, output_directory, book_id)

This runs the full BookNLP pipeline; you are able to run only some elements of the pipeline (to cut down on computational time) by specifying them in that parameter (e.g., to only run entity tagging and event tagging, change model_params above to include "pipeline":"entity,event").

This process creates the directory output_dir/bartleby and generates the following files:

bartleby/bartleby.tokens -- This encodes core word-level information. Each row corresponds to one token and includes the following information:
- paragraph ID
- sentence ID
- token ID within sentence
- token ID within document
- word
- lemma
- byte onset within original document
- byte offset within original document
- POS tag
- dependency relation
- token ID within document of syntactic head
- event
bartleby/bartleby.entities -- This represents the typed entities within the document (e.g., people and places), along with their coreference.
- coreference ID (unique entity ID)
- start token ID within document
- end token ID within document
- NOM (nominal), PROP (proper), or PRON (pronoun)
- PER (person), LOC (location), FAC (facility), GPE (geo-political entity), VEH (vehicle), ORG (organization)
- text of entity
bartleby/bartleby.supersense -- This stores information from supersense tagging.
- start token ID within document
- end token ID within document
- supersense category (verb.cognition, verb.communication, noun.artifact, etc.)
bartleby/bartleby.quotes -- This stores information about the quotations in the document, along with the speaker. In a sentence like "'Yes', she said", where she -> ELIZABETH_BENNETT, "she" is the attributed mention of the quotation 'Yes', and is coreferent with the unique entity ELIZABETH_BENNETT.
- start token ID within document of quotation
- end token ID within document of quotation
- start token ID within document of attributed mention
- end token ID within document of attributed mention
- attributed mention text
- coreference ID (unique entity ID) of attributed mention
- quotation text
bartleby/bartleby.book

JSON file providing information about all characters mentioned more than 1 time in the book, including their proper/common/pronominal references, referential gender, actions for the which they are the agent and patient, objects they possess, and modifiers.

bartleby/bartleby.book.html

HTML file containing a.) the full text of the book along with annotations for entities, coreference, and speaker attribution and b.) a list of the named characters and major entity catgories (FAC, GPE, LOC, etc.).

Annotations

Entity annotations

The entity annotation layer covers six of the ACE 2005 categories in text:

People (PER): Tom Sawyer, her daughter
Facilities (FAC): the house, the kitchen
Geo-political entities (GPE): London, the village
Locations (LOC): the forest, the river
Vehicles (VEH): the ship, the car
Organizations (ORG): the army, the Church

The targets of annotation here include both named entities (e.g., Tom Sawyer), common entities (the boy) and pronouns (he). These entities can be nested, as in the following:

For more, see: David Bamman, Sejal Popat and Sheng Shen, "An Annotated Dataset of Literary Entities," NAACL 2019.

The entity tagging model within BookNLP is trained on an annotated dataset of 968K tokens, including the public domain materials in LitBank and a new dataset of ~500 contemporary books, including bestsellers, Pulitzer Prize winners, works by Black authors, global Anglophone books, and genre fiction (article forthcoming).

Event annotations

The event layer identifies events with asserted realis (depicted as actually taking place, with specific participants at a specific time) -- as opposed to events with other epistemic modalities (hypotheticals, future events, extradiegetic summaries by the narrator).

Text	Events	Source
My father’s eyes had closed upon the light of this world six months, when mine opened on it.	{closed, opened}	Dickens, David Copperfield
Call me Ishmael.	{}	Melville, Moby Dick
His sister was a tall, strong girl, and she walked rapidly and resolutely, as if she knew exactly where she was going and what she was going to do next.	{walked}	Cather, O Pioneers

For more, see: Matt Sims, Jong Ho Park and David Bamman, "Literary Event Detection," ACL 2019.

The event tagging model is trained on event annotations within LitBank. The small model above makes use of a distillation process, by training on the predictions made by the big model for a collection of contemporary texts.

Supersense tagging

Supersense tagging provides coarse semantic information for a sentence by tagging spans with 41 lexical semantic categories drawn from WordNet, spanning both nouns (including plant, animal, food, feeling, and artifact) and verbs (including cognition, communication, motion, etc.)

Example	Source
The [station wagons]_artifact [arrived]_motion at [noon]_time, a long shining [line]_group that [coursed]_motion through the [west campus]_location.	Delillo, White Noise

The BookNLP tagger is trained on SemCor.

Character name clustering and coreference

The coreference layer covers the six ACE entity categories outlined above (people, facilities, locations, geo-political entities, organizations and vehicles) and is trained on LitBank and PreCo.

Example	Source
One may as well begin with [Helen]_x's letters to [[her]_x sister]_y	Forster, Howard's End

Accurate coreference at the scale of a book-length document is still an open research problem, and attempting full coreference -- where any named entity (Elizabeth), common entity (her sister, his daughter) and pronoun (she) can corefer -- tends to erroneously conflate multiple distinct entities into one. By default, BookNLP addresses this by first carrying out character name clustering (grouping "Tom", "Tom Sawyer" and "Mr. Sawyer" into a single entity), and then allowing pronouns to corefer with either named entities (Tom) or common entities (the boy), but disallowing common entities from co-referring to named entities. To turn off this mode and carry out full corefernce, add pronominalCorefOnly=False to the model_params parameters dictionary above (but be sure to inspect the output!).

For more on the coreference criteria used in this work, see David Bamman, Olivia Lewke and Anya Mansoor (2020), "An Annotated Dataset of Coreference in English Literature", LREC.

Referential gender inference

BookNLP infers the referential gender of characters by associating them with the pronouns (he/him/his, she/her, they/them, xe/xem/xyr/xir, etc.) used to refer to them in the context of the story. This method encodes several assumptions:

BookNLP describes the referential gender of characters, and not their gender identity. Characters are described by the pronouns used to refer to them (e.g., he/him, she/her) rather than labels like "M/F".
Prior information on the alignment of names with referential gender (e.g., from government records or larger background datasets) can be used to provide some information to inform this process if desired (e.g., "Tom" is often associated with he/him in pre-1923 English texts). Name information, however, should not be uniquely determinative, but rather should be sensitive to the context in which it is used (e.g., "Tom" in the book "Tom and Some Other Girls", where Tom is aligned with she/her). By default, BookNLP uses prior information on the alignment of proper names and honorifics with pronouns drawn from ~15K works from Project Gutenberg; this prior information can be ignored by setting referential_gender_hyperparameterFile:None in the model_params file. Alternative priors can be used by passing the pathname to a prior file (in the same format as english/data/gutenberg_prop_gender_terms.txt) to this parameter.
Users should be free to define the referential gender categories used here. The default set of categories is {he, him, his}, {she, her}, {they, them, their}, {xe, xem, xyr, xir}, and {ze, zem, zir, hir}. To specify a different set of categories, update the model_params setting to define them: referential_gender_cats: [ ["he", "him", "his"], ["she", "her"], ["they", "them", "their"], ["xe", "xem", "xyr", "xir"], ["ze", "zem", "zir", "hir"] ]

Speaker attribution

The speaker attribution model identifies all instances of direct speech in the text and attributes it to its speaker.

Quote	Speaker	Source
— Come up , Kinch ! Come up , you fearful jesuit !	Buck_Mulligan-0	Joyce, Ulysses
‘ Oh dear ! Oh dear ! I shall be late ! ’	The_White_Rabbit-4	Carroll, Alice in Wonderland
“ Do n't put your feet up there , Huckleberry ; ”	Miss_Watson-26	Twain, Huckleberry Finn

This model is trained on speaker attribution data in LitBank. For more on the quotation annotations, see this paper.

Part-of-speech tagging and dependency parsing

BookNLP uses Spacy for part-of-speech tagging and dependency parsing.

Acknowledgments

BookNLP is supported by the National Endowment for the Humanities (HAA-271654-20) and the National Science Foundation (IIS-1942591).

booknlp's People

Contributors

Stargazers

Watchers

Forkers

ishine waundme luiscamachocaballero mshook neurotech-hq avineshwar ngawangtrinley gabrielhankins trawely hieuqtran shainaraza manikant92 salmaneunus27 borishouenou josecampoblanco dumpmemory ankush-chander neobrainz jespejel-git yotam85 vvtatarinoff abatomunkuev marcelomata yoyodynecorp sampathlonka cxz rahulvks alimi001 ejhortala kpranke hiddendoor annajiat annickyr torgbuiedunyenyo quadrismegistus juniper-johnson johnlaudun seleenajm wilson428 kaushikepi khatvangi gxxu-ml apekshavkumar the-ineffable-alias yukta-jain bookscribs-io cgibson6279 nathanlauga xiaomeng-zhu 6989 surajsrivathsa tobm55 rafaelfyh smartthomas gyanachand1 janasabino fireworks-ai muhtalhakhan geary-shenck sburtner jcarlosneto ahmed-mohamed3 gg-big-org esantomi danieljohnevans martysteer lucy3 iq-scm suacalis semirag qinghao-guan dpicca burningaurora jackovalltrades playerofgames797 michielree buzaabah rasalahmcoding utkarsh4tech rowzzy richiesh yayuelaurazhou manika-lamba

booknlp's Issues

Trying to run the Google Colab notebook as is, getting inundated with an error

_/usr/lib/python3.9/encodings/ascii.py in decode(self, input, final)
24 class IncrementalDecoder(codecs.IncrementalDecoder):
25 def decode(self, input, final=False):
---> 26 return codecs.ascii_decode(input, self.errors)[0]
27
28 class StreamWriter(Codec,codecs.StreamWriter):

UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 149: ordinal not in range(128)_

I've tried:

changing the input file to my own
resaving my own explicitly as utf-8 encoded
reading in my file as a df then saving back to TXT with encoding on
None made a difference, please help.

What does the number before coherence mean?

Or the question is what's coding rule of these numbers?

HuggingFace Validation Error

Hello, I've been using BookNLP on Windows on my laptop. Today, when I tried to install it on another computer of mine, I got this error:

using device cpu
{'pipeline': 'entity,quote,supersense,event,coref', 'model': 'small'}
Traceback (most recent call last):
File "c:\Users\Dan\Dizertatie\test.py", line 8, in
booknlp=BookNLP("en", model_params)
File "C:\Users\Dan\AppData\Local\Programs\Python\Python310\lib\site-packages\booknlp\booknlp.py", line 14, in init
self.booknlp=EnglishBookNLP(model_params)
File "C:\Users\Dan\AppData\Local\Programs\Python\Python310\lib\site-packages\booknlp\english\english_booknlp.py", line 148, in init
self.entityTagger=LitBankEntityTagger(self.entityPath, tagsetPath)
File "C:\Users\Dan\AppData\Local\Programs\Python\Python310\lib\site-packages\booknlp\english\entity_tagger.py", line 19, in init
self.model = Tagger(freeze_bert=False, base_model=base_model, tagset_flat={"EVENT":1, "O":1}, supersense_tagset=self.supersense_tagset, tagset=self.tagset, device=device)
File "C:\Users\Dan\AppData\Local\Programs\Python\Python310\lib\site-packages\booknlp\english\tagger.py", line 58, in init
self.tokenizer = BertTokenizer.from_pretrained(modelName, do_lower_case=False, do_basic_tokenize=False)
File "C:\Users\Dan\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\tokenization_utils_base.py", line 1736, in from_pretrained
resolved_vocab_files[file_id] = cached_file(
File "C:\Users\Dan\AppData\Local\Programs\Python\Python310\lib\site-packages\transformers\utils\hub.py", line 409, in cached_file
resolved_file = hf_hub_download(
File "C:\Users\Dan\AppData\Local\Programs\Python\Python310\lib\site-packages\huggingface_hub\utils_validators.py", line 114, in _inner_fn
validate_repo_id(arg_value)
File "C:\Users\Dan\AppData\Local\Programs\Python\Python310\lib\site-packages\huggingface_hub\utils_validators.py", line 172, in validate_repo_id
raise HFValidationError(
huggingface_hub.utils.validators.HFValidationError: Repo id must use alphanumeric chars or '-', '', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: 'C:\Users\Dan\booknlps\entities_google/bert_uncased_L-4_H-256_A-4'.

Any ideas?

Get non-quotes

Hi,
I saw BookNLP can extract quotes, is there any way to extract non-quotes?
For example:

SPEAKER 1: "Test"
NARRATOR: Non-qoute
SPEAKER 2: "Test"

Thank you!

can't pip install on Apple Silicon

This may be an upstream Tensorflow issue but it doesn't appear to be possible to install bookNLP on Apple Silicon.

I'll continue to investigate but wanted to create this issue here to track and also get input from others who have encountered this.

BookNLP crashes without internet access even when models are already downloaded

I've been using BookNLP for the last couple weeks and love it; thanks for such a great package.

I realized working in the (wifi-less) subway today that even though I have the models downloaded, BookNLP crashes without internet access. That's unfortunate since there are of course many real-life situations in which internet access is impossible.

Here's the error (with internet turned off):

ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.]()

Here's the full stack trace:

[File ~/github/lltk/lltk/model/booknlp.py:436, in get_booknlp(language, pipeline, model, cache, quiet, **kwargs)
    ]()[434](file:///Users/ryan/github/lltk/lltk/model/booknlp.py?line=433)[ if not key in booknlpd:
    ]()[435](file:///Users/ryan/github/lltk/lltk/model/booknlp.py?line=434)[     from booknlp.booknlp import BookNLP
--> ]()[436](file:///Users/ryan/github/lltk/lltk/model/booknlp.py?line=435)[     booknlpd[key]=BookNLP(
    ]()[437](file:///Users/ryan/github/lltk/lltk/model/booknlp.py?line=436)[         language=language,
    ]()[438](file:///Users/ryan/github/lltk/lltk/model/booknlp.py?line=437)[         model_params=dict(pipeline=pipeline,model=model)
    ]()[439](file:///Users/ryan/github/lltk/lltk/model/booknlp.py?line=438)[     )
    ]()[440](file:///Users/ryan/github/lltk/lltk/model/booknlp.py?line=439)[ return booknlpd[key]

File ~/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/booknlp.py:14, in BookNLP.__init__(self, language, model_params)
     ]()[11](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/booknlp.py?line=10)[ def __init__(self, language, model_params):
     ]()[13](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/booknlp.py?line=12)[ 	if language == "en":
---> ]()[14](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/booknlp.py?line=13)[ 		self.booknlp=EnglishBookNLP(model_params)

File ~/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/english/english_booknlp.py:148, in EnglishBookNLP.__init__(self, model_params)
    ]()[145](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/english/english_booknlp.py?line=144)[ self.quoteTagger=QuoteTagger()
    ]()[147](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/english/english_booknlp.py?line=146)[ if self.doEntities:
--> ]()[148](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/english/english_booknlp.py?line=147)[ 	self.entityTagger=LitBankEntityTagger(self.entityPath, tagsetPath)
    ]()[149](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/english/english_booknlp.py?line=148)[ 	aliasPath = pkg_resources.resource_filename(__name__, "data/aliases.txt")
    ]()[150](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/english/english_booknlp.py?line=149)[ 	self.name_resolver=NameCoref(aliasPath)

File ~/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/english/entity_tagger.py:19, in LitBankEntityTagger.__init__(self, model_file, model_tagset)
     ]()[16](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/english/entity_tagger.py?line=15)[ base_model=re.sub("google_bert", "google/bert", model_file.split("/")[-1])
     ]()[17](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/english/entity_tagger.py?line=16)[ base_model=re.sub(".model", "", base_model)
---> ]()[19](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/english/entity_tagger.py?line=18)[ self.model = Tagger(freeze_bert=False, base_model=base_model, tagset_flat={"EVENT":1, "O":1}, supersense_tagset=self.supersense_tagset, tagset=self.tagset, device=device)
     ]()[21](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/english/entity_tagger.py?line=20)[ self.model.to(device)
     ]()[22](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/english/entity_tagger.py?line=21)[ self.model.load_state_dict(torch.load(model_file, map_location=device))

File ~/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/english/tagger.py:58, in Tagger.__init__(self, freeze_bert, base_model, tagset, supersense_tagset, tagset_flat, hidden_dim, flat_hidden_dim, device)
     ]()[54](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/english/tagger.py?line=53)[ self.rev_supersense_tagset[len(supersense_tagset)+1]="O"
     ]()[56](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/english/tagger.py?line=55)[ self.num_labels_flat=len(tagset_flat)
---> ]()[58](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/english/tagger.py?line=57)[ self.tokenizer = BertTokenizer.from_pretrained(modelName, do_lower_case=False, do_basic_tokenize=False)
     ]()[59](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/english/tagger.py?line=58)[ self.bert = BertModel.from_pretrained(modelName)
     ]()[61](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/booknlp/english/tagger.py?line=60)[ self.tokenizer.add_tokens(["[CAP]"], special_tokens=True)

File ~/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/tokenization_utils_base.py:1724, in PreTrainedTokenizerBase.from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
   ]()[1722](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/tokenization_utils_base.py?line=1721)[ else:
   ]()[1723](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/tokenization_utils_base.py?line=1722)[     try:
-> ]()[1724](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/tokenization_utils_base.py?line=1723)[         resolved_vocab_files[file_id] = cached_path(
   ]()[1725](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/tokenization_utils_base.py?line=1724)[             file_path,
   ]()[1726](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/tokenization_utils_base.py?line=1725)[             cache_dir=cache_dir,
   ]()[1727](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/tokenization_utils_base.py?line=1726)[             force_download=force_download,
   ]()[1728](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/tokenization_utils_base.py?line=1727)[             proxies=proxies,
   ]()[1729](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/tokenization_utils_base.py?line=1728)[             resume_download=resume_download,
   ]()[1730](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/tokenization_utils_base.py?line=1729)[             local_files_only=local_files_only,
   ]()[1731](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/tokenization_utils_base.py?line=1730)[             use_auth_token=use_auth_token,
   ]()[1732](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/tokenization_utils_base.py?line=1731)[             user_agent=user_agent,
   ]()[1733](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/tokenization_utils_base.py?line=1732)[         )
   ]()[1735](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/tokenization_utils_base.py?line=1734)[     except FileNotFoundError as error:
   ]()[1736](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/tokenization_utils_base.py?line=1735)[         if local_files_only:

File ~/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py:1921, in cached_path(url_or_filename, cache_dir, force_download, proxies, resume_download, user_agent, extract_compressed_file, force_extract, use_auth_token, local_files_only)
   ]()[1917](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=1916)[     local_files_only = True
   ]()[1919](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=1918)[ if is_remote_url(url_or_filename):
   ]()[1920](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=1919)[     # URL, so get it from the cache (downloading if necessary)
-> ]()[1921](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=1920)[     output_path = get_from_cache(
   ]()[1922](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=1921)[         url_or_filename,
   ]()[1923](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=1922)[         cache_dir=cache_dir,
   ]()[1924](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=1923)[         force_download=force_download,
   ]()[1925](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=1924)[         proxies=proxies,
   ]()[1926](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=1925)[         resume_download=resume_download,
   ]()[1927](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=1926)[         user_agent=user_agent,
   ]()[1928](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=1927)[         use_auth_token=use_auth_token,
   ]()[1929](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=1928)[         local_files_only=local_files_only,
   ]()[1930](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=1929)[     )
   ]()[1931](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=1930)[ elif os.path.exists(url_or_filename):
   ]()[1932](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=1931)[     # File, and it exists.
   ]()[1933](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=1932)[     output_path = url_or_filename

File ~/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py:2177, in get_from_cache(url, cache_dir, force_download, proxies, etag_timeout, resume_download, user_agent, use_auth_token, local_files_only)
   ]()[2171](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=2170)[                 raise FileNotFoundError(
   ]()[2172](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=2171)[                     "Cannot find the requested files in the cached path and outgoing traffic has been"
   ]()[2173](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=2172)[                     " disabled. To enable model look-ups and downloads online, set 'local_files_only'"
   ]()[2174](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=2173)[                     " to False."
   ]()[2175](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=2174)[                 )
   ]()[2176](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=2175)[             else:
-> ]()[2177](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=2176)[                 raise ValueError(
   ]()[2178](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=2177)[                     "Connection error, and we cannot find the requested files in the cached path."
   ]()[2179](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=2178)[                     " Please try again or make sure your Internet connection is on."
   ]()[2180](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=2179)[                 )
   ]()[2182](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=2181)[ # From now on, etag is not None.
   ]()[2183](file:///Users/ryan/miniforge3/envs/booknlp/lib/python3.10/site-packages/transformers/file_utils.py?line=2182)[ if os.path.exists(cache_path) and not force_download:

ValueError: Connection error, and we cannot find the requested files in the cached path. Please try again or make sure your Internet connection is on.]()

I turn wifi on and everything works normally.

Error initializing model in demo notebook

Hello! When initializing the BookNLP model in Google Colab, this error occurs:

RuntimeError: Error(s) in loading state_dict for Tagger: Unexpected key(s) in state_dict: "bert.embeddings.position_ids".

It may be a dependency error with the transformers library? I did not get this error when running the notebook in the same environment a month or so ago. Thanks for help.

Adaptation for another language

Hello!

Can you advise or get some instructions on how to make this library work with other language, for example Russian?

syntactic_head_ID erroneously references a token in the previous sentence

Hi, thank you for making the great library!
When parsing long documents, the syntactic_head_ID will sometimes reference a token in the previous sentence. For example, in the parsing output in the attached file (dKDD.csv):

0	2	0	28	she	she	122	125	PRON	PRP	nsubj	29	O
0	2	1	29	's	be	126	128	AUX	VBZ	ROOT	29	O
0	2	2	30	not	not	129	132	PART	RB	neg	29	O
0	2	3	31	the	the	133	136	DET	DT	det	32	O
0	2	4	32	one	one	137	140	NOUN	NN	attr	29	O
0	2	5	33	to	to	141	143	PART	TO	aux	34	O
0	2	6	34	write	write	144	149	VERB	VB	relcl	32	O
0	2	7	35	.	.	150	151	PUNCT	.	punct	29	O
0	3	0	36	Yeah	yeah	152	156	INTJ	UH	intj	35	O
0	3	1	37	.	.	157	158	PUNCT	.	punct	36	O

The syntactic_head_ID of token 36 (in sentence 3) is token 35 (sentence 2), which doesn't seem to make sense.
The same happens with tokens 62, 68, 91, 202, 276, 327, 328, 344, 376, 378, 385, 387, 433, 434, 499, 503, 516, 550, 556, 557, 558, 566, 589, 725, 751, 755, 813, 818, 843, 845, 853, 876, 880, 1450, 1502, 1563, 1756, 1881, 1882, 1902, 1926, 1972, 1993, 2054, 2058, 2059, 2086, 2097, 2103, 2488, 2489, 2511.
Is there a way to fix this?
dKDD.csv
dKDD.txt

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1701: character maps to <undefined>

I've been struggling with this for about two days straight and I finally got it figured out.

After everything was setup, it would run through the program before failing part-way through.

(booknlp) %user%>python setup.py
using device cpu
{'pipeline': 'entity,quote,supersense,event,coref', 'model': 'big'}
--- startup: 2.871 seconds ---
--- spacy: 13.651 seconds ---
--- entities: 160.518 seconds ---
--- quotes: 0.162 seconds ---
--- attribution: 0.025 seconds ---
--- name coref: 0.781 seconds ---
Traceback (most recent call last):
File "setup.py", line 21, in
booknlp.process(input_file, output_directory, book_id)

File "%user%.conda\envs\booknlp\lib\site-packages\booknlp\booknlp.py", line 17, in process
self.booknlp.process(inputFile, outputFolder, idd)

File "%user%.conda\envs\booknlp\lib\site-packages\booknlp\english\english_booknlp.py", line 426, in process
genderEM=GenderEM(tokens=tokens, entities=entities, refs=refs, genders=self.gender_cats, hyperparameterFile=self.gender_hyperparameterFile)

File "%user%.conda\envs\booknlp\lib\site-packages\booknlp\english\gender_inference_model_1.py", line 71, in init
self.read_hyperparams(hyperparameterFile)

File "%user%.conda\envs\booknlp\lib\site-packages\booknlp\english\gender_inference_model_1.py", line 167, in read_hyperparams
header=file.readline().rstrip()

File "%user%.conda\envs\booknlp\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1701: character maps to

I'm unsure if it's intended this way or not, but I modified the "read_hyperparams" definition in "gender_inference_model_1.py".

'''

def read_hyperparams(self, filename):
	self.hyperparameters={}
	try:
		with open(filename, 'r', encoding='utf-8') as file:   ### <--- This line
			header=file.readline().rstrip()
			gender_mapping={}
			for idx, val in enumerate(header.split("\t")[2:]):
				if val in self.genderID:
					gender_mapping[self.genderID[val]]=idx+2
			
	# More Code #

	except UnicodeDecodeError as e:
		print(f"Error occurred at file position: {e.start}")
		print("Attempting to read problematic line...")
		with open(filename, 'rb') as file:
			file.seek(e.start - 20, 0)  # Go back 20 bytes from the error position
			print(file.read(50))  # Read 50 bytes around the error position
		raise e  # Re-raise the exception to halt execution

'''
Explicitly stating the encoding here fixed the problem. The try statement was to help troubleshooting the troublesome characters.

If anyone else is having the same problem, this was my solution!

Problem running run_nlpbook.py

After running:
booknlp=BookNLP("en", model_params)

I get the following;

(It seems to refer to my model location by booknlps, but what is created is booknlp_models, and tacking a local directory path to the huggingface url also seems like an issue. I'm glad to help and try things here, though my experience with big pyhon code bases is limited )

404 Client Error: Repository Not Found for url: https://huggingface.co/C:%5CUsers%5Cdenis%5Cbooknlps%5Centities_google/bert_uncased_L-6_H-768_A-12/resolve/main/tokenizer_config.json

RepositoryNotFoundError Traceback (most recent call last)
c:\Users\denis\Anaconda3\envs\booknlp\lib\site-packages\transformers\file_utils.py in get_file_from_repo(path_or_repo, filename, cache_dir, force_download, resume_download, proxies, use_auth_token, revision, local_files_only)
2241 local_files_only=local_files_only,
-> 2242 use_auth_token=use_auth_token,
2243 )

c:\Users\denis\Anaconda3\envs\booknlp\lib\site-packages\transformers\file_utils.py in cached_path(url_or_filename, cache_dir, force_download, proxies, resume_download, user_agent, extract_compressed_file, force_extract, use_auth_token, local_files_only)
1853 use_auth_token=use_auth_token,
-> 1854 local_files_only=local_files_only,
1855 )

c:\Users\denis\Anaconda3\envs\booknlp\lib\site-packages\transformers\file_utils.py in get_from_cache(url, cache_dir, force_download, proxies, etag_timeout, resume_download, user_agent, use_auth_token, local_files_only)
2049 r = requests.head(url, headers=headers, allow_redirects=False, proxies=proxies, timeout=etag_timeout)
-> 2050 _raise_for_status(r)
2051 etag = r.headers.get("X-Linked-Etag") or r.headers.get("ETag")

c:\Users\denis\Anaconda3\envs\booknlp\lib\site-packages\transformers\file_utils.py in _raise_for_status(request)
1970 if error_code == "RepoNotFound":
-> 1971 raise RepositoryNotFoundError(f"404 Client Error: Repository Not Found for url: {request.url}")
1972 elif error_code == "EntryNotFound":

RepositoryNotFoundError: 404 Client Error: Repository Not Found for url: https://huggingface.co/C:%5CUsers%5Cdenis%5Cbooknlps%5Centities_google/bert_uncased_L-6_H-768_A-12/resolve/main/tokenizer_config.json

During handling of the above exception, another exception occurred:

OSError Traceback (most recent call last)
~\AppData\Local\Temp\ipykernel_2352\2094341818.py in
4 }
5
----> 6 booknlp=BookNLP("en", model_params)

c:\Users\denis\Anaconda3\envs\booknlp\lib\site-packages\booknlp\booknlp.py in init(self, language, model_params)
12
13 if language == "en":
---> 14 self.booknlp=EnglishBookNLP(model_params)
15
16 def process(self, inputFile, outputFolder, idd):

c:\Users\denis\Anaconda3\envs\booknlp\lib\site-packages\booknlp\english\english_booknlp.py in init(self, model_params)
146
147 if self.doEntities:
--> 148 self.entityTagger=LitBankEntityTagger(self.entityPath, tagsetPath)
149 aliasPath = pkg_resources.resource_filename(name, "data/aliases.txt")
150 self.name_resolver=NameCoref(aliasPath)

c:\Users\denis\Anaconda3\envs\booknlp\lib\site-packages\booknlp\english\entity_tagger.py in init(self, model_file, model_tagset)
17 base_model=re.sub(".model", "", base_model)
18
---> 19 self.model = Tagger(freeze_bert=False, base_model=base_model, tagset_flat={"EVENT":1, "O":1}, supersense_tagset=self.supersense_tagset, tagset=self.tagset, device=device)
20
21 self.model.to(device)

c:\Users\denis\Anaconda3\envs\booknlp\lib\site-packages\booknlp\english\tagger.py in init(self, freeze_bert, base_model, tagset, supersense_tagset, tagset_flat, hidden_dim, flat_hidden_dim, device)
56 self.num_labels_flat=len(tagset_flat)
57
---> 58 self.tokenizer = BertTokenizer.from_pretrained(modelName, do_lower_case=False, do_basic_tokenize=False)
59 self.bert = BertModel.from_pretrained(modelName)
60

c:\Users\denis\Anaconda3\envs\booknlp\lib\site-packages\transformers\tokenization_utils_base.py in from_pretrained(cls, pretrained_model_name_or_path, *init_inputs, **kwargs)
1662 use_auth_token=use_auth_token,
1663 revision=revision,
-> 1664 local_files_only=local_files_only,
1665 )
1666 if resolved_config_file is not None:

c:\Users\denis\Anaconda3\envs\booknlp\lib\site-packages\transformers\file_utils.py in get_file_from_repo(path_or_repo, filename, cache_dir, force_download, resume_download, proxies, use_auth_token, revision, local_files_only)
2246 logger.error(err)
2247 raise EnvironmentError(
-> 2248 f"{path_or_repo} is not a local folder and is not a valid model identifier "
2249 "listed on 'https://huggingface.co/models'\nIf this is a private repository, make sure to "
2250 "pass a token having permission to this repo with use_auth_token or log in with "

OSError: C:\Users\denis\booknlps\entities_google/bert_uncased_L-6_H-768_A-12 is not a local folder and is not a valid model identifier listed on 'https://huggingface.co/models'
If this is a private repository, make sure to pass a token having permission to this repo with use_auth_token or log in with huggingface-cli login and pass use_auth_token=True

Unable to run booknlp as .py file

When I execute the .py script, I get some errror below given : @dbamman

from booknlp.booknlp import BookNLP

model_params = {
"pipeline": "entity,quote,supersense,event,coref",
"model": "big"
}

booknlp = BookNLP("en", model_params)

Input file to process

input_file = "booknlpscr/pdf.txt"

Output directory to store resulting files in

output_directory = "pdf/"

File within this directory will be named ${book_id}.entities, ${book_id}.tokens, etc.

book_id = "pdf"

booknlp.process(input_file, output_directory, book_id)

print(input_file)
#conda install pytorch torchvision torchaudio cudatoolkit=10.2 -c pytorch

Windows Users of BookNLP

You may find this video helpful: https://www.youtube.com/watch?v=3l5ERF3QX0M&ab_channel=PythonTutorialsforDigitalHumanities

If it indeed is helpful I will ask about updating the ReadMe for this repo

Character count error?

Hi. I cannot match character count with simple word search count in the text file. What alerted me was the dracula.txt in which I get 'id': 230, 'count': 37, 'max_proper_mention': 'Mina' where as when I do a simple word count (in both textEdit and MS Word) for 'Mina' I get 260. Why the anomaly?

HFValidation error

I installed huggingface models as told in previous issue responses but still, i am getting this error.
HFValidationError: Repo id must use alphanumeric chars or '-', '_', '.', '--' and '..' are forbidden, '-' and '.' cannot start or end the name, max length is 96: 'C:\Users\PC1\booknlps\entities_google/bert_uncased_L-6_H-768_A-12'.

Download speeds very slow on initial startup

Hi, the download seems to take 4 hours for the bert .model files from the server end. Is there a way to wget or curl them into a directory? Also, if one terminates the program, the files are still partially written in and cause an unzipping error in pytorch. Is there a plan to mitigate this in the future with tempfile downloads?

minimal example:

import booknlp
from booknlp.booknlp import BookNLP
import spacy
spacy.load('en_core_web_sm')
model_params = {
    "pipeline": "entity,quote,supersense,event,coref",
    "model": "big"
}

booknlp = BookNLP("en", model_params)

# Input file to process
input_file = "input_dir/bartleby.txt"

# Output directory to store resulting files in
output_directory = "output_dir/bartleby/"

# File within this directory will be named ${book_id}.entities, ${book_id}.tokens, etc.
book_id = "bartleby"

booknlp.process(input_file, output_directory, book_id)

https://i.imgur.com/FZIqNsC.png

No module named 'booknlp' after run 'Usage script'

Exception has occurred: ModuleNotFoundError
No module named 'booknlp'
File "C:..\booknlptest.py", line 1, in
from booknlp.booknlp import BookNLP
ModuleNotFoundError: No module named 'booknlp'

UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 1701: character maps to <undefined>

Any idea how to fix this? I am getting the following error:

File "C:/Users/******/bookNLP.py", line 28, in
booknlp.process(input_file, output_directory, book_id)

File "C:\Users*******\Anaconda3\envs\BookNLP\lib\site-packages\booknlp\booknlp.py", line 17, in process
self.booknlp.process(inputFile, outputFolder, idd)

File "C:\Users******\Anaconda3\envs\BookNLP\lib\site-packages\booknlp\english\english_booknlp.py", line 426, in process
genderEM=GenderEM(tokens=tokens, entities=entities, refs=refs, genders=self.gender_cats, hyperparameterFile=self.gender_hyperparameterFile)

File "C:\Users******\Anaconda3\envs\BookNLP\lib\site-packages\booknlp\english\gender_inference_model_1.py", line 71, in init
self.read_hyperparams(hyperparameterFile)

File "C:\Users******\Anaconda3\envs\BookNLP\lib\site-packages\booknlp\english\gender_inference_model_1.py", line 167, in read_hyperparams
header=file.readline().rstrip()

File "C:\Users*******\Anaconda3\envs\BookNLP\lib\encodings\cp1252.py", line 23, in decode
return codecs.charmap_decode(input,self.errors,decoding_table)[0]