Giter Site home page Giter Site logo

dbpedia's Introduction

R package ‘dbpedia’ - wrapper for DBpedia Spotlight

License: GPL v3 R-CMD-check codecov Lifecycle: maturing

About

Functionality for Entity Linking from R: Get DBpedia URIs for entities in a corpus using DBpedia Spotlight.

Motivation

The method of Entity Linking is used to disambiguate entities such as Persons, Organizations and Locations in continuous text and link them to entries in an external knowledge graph. At its core, the aim of the dbpedia R package is to integrate Entity Linking with the tool “DBpedia Spotlight” (https://www.dbpedia-spotlight.org) into common workflows for text analysis in R. In particular, it addresses the following needs:

  • facilitate the use of Entity Linking with DBpedia Spotlight from within R
  • prepare and process textual data in an integrated, easy-to-use Entity Linking workflow for users working with R relying on different input and output formats
  • realize the potential of retrieved Uniform Resource Identifiers (URIs) to retrieve additional data from the DBpedia Knowledge Graph and Wikidata by further integration in the provided analysis pipeline

For the examples in this README, we also load some additional packages.

library(kableExtra)
library(dplyr)
library(quanteda)

First Look

The main motivation is to lower barriers to link textual data with other resources in social science research. As such, it aims to provide a focused way to interact with DBpedia Spotlight from within R. In its most basic application, the package can be used to query DBpedia Spotlight as follows:

library(dbpedia)

doc <- "Berlin is the capital city of Germany."

uri_table <- get_dbpedia_uris(x = doc,
                              language = getOption("dbpedia.lang"),
                              api = getOption("dbpedia.endpoint")
)

DBpedia Spotlight is able to identify which parts of the text represent entities and decide which resources in the knowledge graph DBpedia they correspond with. The return value of the method is a data.table containing identified entities along with their respective DBpedia URIs and starting positions in the document.

start text dbpedia_uri
1 Berlin http://de.dbpedia.org/resource/Berlin
31 Germany http://de.dbpedia.org/resource/Deutschland

Installation and Setup

At this stage, the dbpedia R package is a GitHub-only package. Install it as follows:

devtools::install_github("PolMine/dbpedia", ref = "main")

In a nutshell, the package prepares queries, sends them to an external tool and parses the returned results. This tool, DBpedia Spotlight, is running as a Web Service - either remotely or locally. The developers of DBpedia Spotlight currently maintain a public endpoint for the service which is selected by default by the dbpedia package.

Running DBpedia Spotlight locally - Docker Setup

As an alternative to the public endpoint, it is possible to run the service locally. This can be reasonable for reasons of performance, rate limits of the public endpoint and other considerations. The easiest way to realize this is to use the tool within a Docker container prepared by the maintainers of DBpedia Spotlight. The setup is described in some detail in the corresponding GitHub repository: https://github.com/dbpedia-spotlight/spotlight-docker.

As described on the GitHub page, with Docker running, the quick-start command to be used in the terminal to load and run a DBpedia Spotlight model is as follows:

docker run -tid \
  --restart unless-stopped \
  --name dbpedia-spotlight.de \
  --mount source=spotlight-model,target=/opt/spotlight \
  -p 2222:80  \
  dbpedia/dbpedia-spotlight spotlight.sh de

This will initialize the German Docker DBpedia Spotlight model. Other available languages are described in the GitHub repository as well.

Note: In our tests, we noticed that the DBpedia Spotlight Docker containers are not available for all architectures, in particular Apple silicon. In this case, build container from the dockerfile as follows before loading the model:

git clone https://github.com/dbpedia-spotlight/spotlight-docker.git
cd spotlight-docker
docker build -t dbpedia/dbpedia-spotlight:latest .

Note: When run the first time, the script will download the language model. Depending on the language, this download and the subsequent initialization of the model will take some time. This process is not necessarily obvious in the output of the terminal. If the container is queried before the language model is fully initialized, the download or initialization of the model seems to be interrupted which will cause errors when queried later on. It is thus advisable to wait until the container is idle before querying the service the first time in a session.

Using the package - A Very Quick Warkthrough with quanteda corpora

This README will use the common quanteda corpus format as input to provide a quick step-by-step overview about the functionality provided by the package. A brief second example will illustrate how the extracted Uniform Resource Identifiers can be mapped back onto the input, using a Corpus Workbench corpus as an example.

Setup - Loading the package

Upon loading the dbpedia package, a start up message will print information about whether the DBpedia Spotlight service is running locally or if a public endpoint is used. In addition, the language of the model and the corresponding list of stop words is shown.

library(dbpedia)

This information is available during the R session and is by default used by the get_dbpedia_uris() method.

getOption("dbpedia.endpoint")
getOption("dbpedia.lang")

Data

For the following example, we use the “US presidential inaugural address texts” corpus from the quanteda R package. For illustrative purposes, only speeches since 1970 are used. To create useful chunks of text, we split the corpus into paragraphs.

inaugural_paragraphs <- data_corpus_inaugural |>
  corpus_subset(Year > 1970) |>
  corpus_reshape(to = "paragraphs")

Entity Linking with get_dbpedia_uris()

Using a local endpoint for the DBpedia Spotlight service and the sample corpus from quanteda, identifying and disambiguating entities in documents can be realized with the main worker method the package: get_dbpedia_uris().

The method accepts the data in different input formats - character vectors, quanteda corpora, Corpus Workbench format, XML - as well as additional parameters, some of which are discussed in more detail in the package’s vignette.

uritab_paragraphs <- get_dbpedia_uris(
  x = inaugural_paragraphs,
  language = getOption("dbpedia.lang"),
  max_len = 5600L,
  confidence = 0.5,
  api = getOption("dbpedia.endpoint"),
  verbose = FALSE,
  progress = FALSE
)

In this case, the text of each document in the corpus is extracted and passed to the DBpedia Spotlight service. The results are then parsed by the method. The return value is a data.table containing the document name as well as the extracted entities along with their starting position in the text and, most importantly, their respective URI in the DBpedia Knowledge Graph (only the first five entities are shown here and the column containing the types of the entities is omitted):

doc start text dbpedia_uri
1973-Nixon.1 1 Mr http://de.dbpedia.org/resource/Master_of_the_Rolls
1973-Nixon.1 21 Mr http://de.dbpedia.org/resource/Master_of_the_Rolls
1973-Nixon.1 25 Speaker http://de.dbpedia.org/resource/Speaker
1973-Nixon.1 34 Mr http://de.dbpedia.org/resource/Master_of_the_Rolls
1973-Nixon.1 38 Chief Justice http://de.dbpedia.org/resource/Chief_Justice_of_the_United_States

The package’s vignette provides some more details to the approach and parameters.

Token-Level Annotation with the Corpus Workbench

While approaches that enrich documents with entities are very useful, another important aspect of entity linking is the ability to assign URIs to precise spans within the text and write them back to the corpus. This can be crucial if extracted URIs should be used in subsequent tasks when working with textual data. The Corpus Workbench data format makes it possible to map annotated entities onto the continuous text of the initial corpus. The following quick example should illustrate this.

Data

For this example, we use a single newswire of the REUTERS corpus. The corpus is provided as a Corpus Workbench sample corpus in the RcppCWB R package. To work with CWB corpora in R, the R package polmineR is used. Both RcppCWB and polmineR are dependencies of dbpedia.

library(polmineR)
use("RcppCWB")

To extract an illustrative part of the REUTERS corpus, we create a subcorpus comprising of a single document. To do so, we use polmineR’s subset() method for CWB corpus objects.

reuters_newswire <- corpus("REUTERS") |>
  subset(id == 144)

Entity Linking with get_dbpedia_uris()

Like before, we perform Entity Linking with get_dbpedia_uris(). In addition, we map entity types returned by DBpedia Spotlight to a number of entity classes (see the vignette for a more comprehensive explanation).

mapping_vector = c(
  "PERSON" = "DBpedia:Person",
  "ORGANIZATION" = "DBpedia:Organisation",
  "LOCATION" = "DBpedia:Place"
)

reuters_newswire_annotation <- reuters_newswire |>
  get_dbpedia_uris(verbose = FALSE) |>
  map_types_to_class(mapping_vector = mapping_vector)
## ℹ mapping values in column `types` to new column `class`

This results in the following annotations (only the first five entities are shown here and the column of types is omitted):

cpos_left cpos_right dbpedia_uri text class
92 92 http://de.dbpedia.org/resource/Organisation_erdölexportierender_Länder OPEC LOCATION|ORGANIZATION|PERSON
93 93 http://de.dbpedia.org/resource/Brian_May may LOCATION|ORGANIZATION|PERSON
101 101 http://de.dbpedia.org/resource/June_Carter_Cash June LOCATION|ORGANIZATION|PERSON
102 102 http://de.dbpedia.org/resource/Session_(Schweiz) session LOCATION|ORGANIZATION|PERSON
105 105 http://de.dbpedia.org/resource/Integrated_Truss_Structure its LOCATION|ORGANIZATION|PERSON

Mapping the Results to the Corpus

This leaves us with a similar output like before. As explained in more detail in the vignette, the output of get_dbpedia_uris() for CWB objects additionally contains corpus positions of entities within the continuous text. This allows us to map the annotations back to the corpus.

polmineR’s read() method allows us to visualize this mapping interactively, using the classes of the entities to provide some visual clues as well.

read(reuters_newswire,
     annotation = as_subcorpus(reuters_newswire_annotation, highlight_by = "class"))

Advanced Scenarios

This README only offers a first look into the functions of the dbpedia package. Specific parameters as well as other scenarios are discussed in more detail in the vignette of the package. These scenarios include the integration of SPARQL queries in the workflow to further enrich disambiguated entities with additional data from the DBpedia and Wikidata knowledge graphs.

Related work

Acknowledgements

We gratefully acknowledge funding from the German National Research Data Infrastructure (Nationale Forschungsdateninfrastruktur / NFDI). Developing the dbpedia package is part of the measure “Linking Textual Data” as part of the consortium KonsortSWD (project number 442494171).

dbpedia's People

Contributors

ablaette avatar christophleonhardt avatar

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar

dbpedia's Issues

Add parameter `support` to `get_dbpedia_uris()`

get_dbpedia_uris() currently passes the text and the confidence parameter to DBpedia Spotlight. However, there are more parameters which influence the results of the service. These are described in the paper by Mendes et al. (2011) and shown in examples on the DBpedia Spotlight GitHub wiki (https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/Web-service and https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/User's-manual).

One of those parameters is "support" which sets a threshold of the minimum prominence of an entity in Wikipedia (pp. 3-4). The inclusion of support might be useful. If I am not mistaken, support could be added to the query parameter created for the GET in get_dbpedia_uris().

Fixed namespace in `dbpedia_get_wikidata_uris()`

In dbpedia_get_wikidata_uris() there is an option to query additional properties from DBpedia. This is implemented like that:

optional <- sprintf('OPTIONAL { ?item dbo:%s ?key . }', optional)

A potential issue which would need to be checked, I guess, is that not all properties necessarily can be queried via the ontology namespace (?), i.e. with dbo:%s. If I look around at the DBpedia page for http://de.dbpedia.org/page/Mannheim, there are other namespaces such as "geo" or "prop-de" which probably would need to be named accordingly in a SPARQL query.

As I am not entirely sure, it should be checked whether I am correct. For example, I would assume that you cannot query latitude ("lat") as an optional attribute right now.

Checking for `types_src` results in an error if element in `types` column is unnamed

Issue

There are scenarios in which elements in the types column returned by get_dbpedia_uris() are not named lists. This is a) inconsistent and b) results in errors when checking for the types_src which relies on named elements in this column.

Example

See the following example:

library(dbpedia)
library(quanteda)

inaugural_paragraphs <- data_corpus_inaugural |>
  corpus_subset(Year == 2021) |>
  corpus_reshape(to = "paragraphs")

get_dbpedia_uris(
  x = inaugural_paragraphs["2021-Biden.145"],
  language = getOption("dbpedia.lang"),
  max_len = 5600L,
  confidence = 0.5,
  support = 20,
  types = character(),
  api = getOption("dbpedia.endpoint"), # English endpoint
  verbose = FALSE,
  progress = FALSE
)

This will result in an error:

Error in FUN(X[[i]], ...) : subscript out of bounds

Likely underlying issue

Currently, the way to populate the types column in get_dbpedia_uris() usually results in either an empty list (if there are no types for the entity) or a list of lists containing entity types (if there are types for an entity). The names of the nested lists refer to the source/ontology the type is derived from.

This fails, however, if the document passed to get_dbpedia_uris() has only one entity and only types from one source. In this case, types are added as unnamed list elements to the column. This seems to be happening only if resource_min (the data.table containing entities) has only one row.

Error with types_src

This, in itself, is inconsistent and should be addressed. However, the lack of a name in the column results in an error in the subsequent mechanism to extract and filter the types by their source via the types_src argument. This relies on the elements in types being named.

Potential Solution

I think that when preparing the types for the column, it would be necessary to check if

  • there are only types for a single element
  • these types are all from the same source

In case there is only one type of a single source, e.g. "Person" from "DBpedia", wrapping this value into an additional list() should work.

Define endpoint explicitly in README examples

This is a nice example in the README:

inaugural_paragraphs <- data_corpus_inaugural |>
  corpus_subset(Year > 1970) |>
  corpus_reshape(to = "paragraphs")
uritab_paragraphs <- get_dbpedia_uris(
  x = inaugural_paragraphs,
  language = getOption("dbpedia.lang"),
  max_len = 5600L,
  confidence = 0.5,
  api = getOption("dbpedia.endpoint"),
  verbose = FALSE,
  progress = FALSE
)

But if I run this on my local machine with the "German" DBpedia Spotlight container running, I will process the data with a German model, resulting in fuzzy results. On the CI test machines, tests would use the remote English endpoint, so results would be "correct", yet different.

So I suggest to set the endpoint and the language explicitly in this example:

uritab_paragraphs <- get_dbpedia_uris(
  x = inaugural_paragraphs,
  language = "en",
  max_len = 5600L,
  confidence = 0.5,
  api = "http://api.dbpedia-spotlight.org/en/annotate",
  verbose = FALSE,
  progress = FALSE
)

get_dbpedia_uris() does not fail gracefully if endpoint is not accessible

As I have started to process increasingly large corpora (using a local docker container with DBpedia Spotlight), I encountered an ungraceful abort. As it seems, the container was temporarily not available.

I assume that the unavailability of the container was a temporary phenomenon. But get_dbpedia_uris() should be more robust in this scenario. The best way would be that get_dbpedia_uris() retrys and aborts graceful after a number of unsuccessful attempts.

Progressive performance loss when processing large data

When I tried to run get_dbpedia_uris() on the entire GERMAPARL2 corpus, I had to abort because for whatever reason the processing time of paragraphs increased. To fix some observations:

  • The initial progress status message said that processing time would be 3 days. When I returned after a few days, estimated 'time of arrival'
    entitylinking.log
    was up to 5 days. This was when a bit mor of the entire data (1.8 million of 3.0 million paragraphs was processed.
  • Running htop from the shell did not give me any specific insight about the process: cores were used as expected and main memory had not yet been exhausted.
  • There was still about 25 GB of hard disk space left.
  • The information RStudio provides on memory consumption said that 10 GB were used: But I am not entirely sure that this information was correct.

Concerning the logfile:

  • It does not cover the entire data that has been processed: I started the process on April 1, but the first entries in the logfile are on April 4.
  • I would have expected 1.8 million entries in the logfile. But its length is 67212!

As a consequence, it is not possible to analyse when and why the slump of processing speed occurred. Anyway, these are some preliminary insights:

How many paragraphs have been processed per hour? Here, we do not see a decrease. My assumption is that the decrease occurred before the coverage of the logfile.

library(magrittr)
library(lubridate)
library(ggplot2)
library(dplyr)

logfile <-"~/Lab/tmp/entitylinking.log" 

logfile %>%
  readLines() %>%
  gsub("^\\[(.*?)\\]\\s.*?$", "\\1", .) %>%
  as.POSIXct() %>%
  lubridate::floor_date(unit = "hour") %>%
  as_tibble() %>%
  group_by(value) %>%
  summarise(N = n()) %>%
  ggplot(aes(x = value, y = N)) +
    geom_line()

grafik

How long did it take to process one paragraph? This is much less telling, quite overloaded.

logfile %>%
  readLines() %>%
  gsub("^\\[(.*?)\\]\\s.*?$", "\\1", .) %>%
  as.POSIXct() %>% 
  diff() %>%
  as_tibble() %>%
  mutate(id = 1L:nrow(.)) %>%
  ggplot(aes(x = id, y = value)) +
  geom_line()

grafik

What is the distribution of processing time?

logfile %>%
  readLines() %>%
  gsub("^\\[(.*?)\\]\\s.*?$", "\\1", .) %>%
  as.POSIXct() %>% 
  diff() %>%
  as.numeric() %>%
  hist(main = "Distribution of processing time", xlab = "seconds")

grafik

There are quite a few paragraphs that took a long, long time to be processed. We should analyse in further depth: What are the features of paragraphs that take so much time. One possibility: Requests that fail, then there is the waiting period until processing the paragraph is retried?

I attach the logfile for further analysis.

entitylinking.log

http error abort

get_dbpedia_uris() aborts when processing the following string.

[2024-03-31 04:44:30.073019] Der Gesetzentwurf Änderung Bürgerlichen Gesetzbuch
Der Gesetzentwurf Änderung Bürgerlichen Gesetzbuchs Gesetze — Drucksache 7/63 — Rechtsausschuß Ausschuß Raumordnung Bauwesen Städtebau Gesetzentwurf Übereinkommen Internationale Fernmeldesatellitenorganisation „ INTELSAT " — Drucksache 7/120 — Auswärtigen Ausschuß — federführend — Ausschuß Forschung Technologie Post Fernmeldewesen Gesetzentwurf Abkommen Parteien Nordatlantikvertrags Rechtsstellung Truppen hinsichtlich Bundesrepublik Deutschland stationierten ausländischen Truppen — Drucksache 7/119 — Auswärtigen Ausschuß — federführend — Verteidigungsausschuß Ausschuß Arbeit Sozialordnung Gesetzentwurf Abkommen Bundesrepublik Deutschland Republik Singapur Vermeidung Doppelbesteuerung Gebiete Steuern Einkommen Vermögen — Drucksache 7/106 — Finanzausschuß Gesetzentwurf Vereinbarung Bundesrepublik Deutschland italienischen Republik Erleichterungen fiskalischen Behandlung grenzüberschreitenden deutsch italienischen Straßengüterverkehrs — Drucksache 7/113 — Finanzausschuß — federführend — Ausschuß Verkehr Gesetzentwurf Abkommen Regierung Bundesrepublik Deutschland Regierung Vereinigten Königreichs Großbritanniens Nordirland steuerliche Behandlung Straßenfahrzeugen internationalen Verkehr — Drucksache 7/107 — Finanzausschuß — federführend — Ausschuß Verkehr Gesetzentwurf Abkommen Bundesrepublik Deutschland Republik Island Vermeidung Doppelbesteuerung Gebiete Steuern Einkommen Vermögen — Drucksache 7/99 — Finanzausschuß Gesetzentwurf Abkommen Bundesrepublik Deutschland Königreich Niederlande Krankenversicherung alte Rentner — Drucksache 7/110 — Ausschuß Arbeit Sozialordnung Gesetzentwurf Übereinkommen Internationalen Arbeitsorganisation Arbeitsaufsicht Landwirtschaft — Drucksache 7/109 — Ausschuß Arbeit Sozialordnung — federführend — Ausschuß Ernährung Landwirtschaft Forsten Gesetzentwurf Vereinbarung Bundesrepublik Deutschland Sozialistischen Föderativen Republik Jugoslawien Durchführung Abkommens Soziale Sicherheit — Drucksache 7/108 — Ausschuß Arbeit Sozialordnung Gesetzentwurf Übereinkommen Internationalen Arbeitsorganisation Schutz Arbeitnehmer ionisierenden Strahlen — Drucksache 7/105 — Ausschuß Arbeit Sozialordnung Gesetzentwurf internationalen Einheits Übereinkommen Suchtstoffe — Drucksache 7/126 — Ausschuß Jugend Familie Gesundheit — federführend — Rechtsausschuß Haushaltsausschuß gemäß § 96 GO Gesetzentwurf Übereinkommen Ausarbeitung Europäischen Arzneibuches — Drucksache 7/125 — Ausschuß Jugend Familie Gesundheit Gesetzentwurf Europäischen Übereinkommen Schutz Tieren beim internationalen Transport Drucksache 7/127 — Ausschuß Ernährung Landwirtschaft Forsten Gesetzentwurf Übereinkommen Schutz Hersteller Tonträgern unerlaubte Vervielfältigung Tonträger — Drucksache 7/121 — Rechtsausschuß Gesetzentwurf Haager Kaufrechtsübereinkommen — Drucksache 7/115 Rechtsausschuß Gesetzentwurf Vertrag Bundesrepublik Deutschland Republik Österreich Führung geschlossenen Zügen Österreichischen Bundesbahnen Strecken Deutschen Bundesbahn Bundesrepublik Deutschland — Drucksache 7/134 — Ausschuß Verkehr Gesetzentwurf Abkommen Bundesrepublik Deutschland Mauritius Förderung gegenseitigen Schutz Kapitalanlagen — Drucksache 7/104 — Ausschuß Wirtschaft — federführend — Ausschuß wirtschaftliche Zusammenarbeit Gesetzentwurf Abkommen Assoziation betreffend Beitritt Mauritius Assoziierungsabkommen Europäischen Wirtschaftsgemeinschaft Gemeinschaft assoziierten afrikanischen Staaten Madagaskar sowie Änderung Internen Abkommens Finanzierung Verwaltung Hilfe Gemeinschaft — Drucksache 7/132 — Ausschuß Wirtschaft — federführend — Ausschuß wirtschaftliche Zusammenarbeit Entwurf Konsulargesetzes — Drucksache 7/131 — Auswärtigen Ausschuß — federführend — Rechtsausschuß Entwurf Einheitlichen Gesetzes internationalen Kauf beweglicher Sachen — Drucksache 7/123 — Rechtsausschuß Entwurf einheitlichen Gesetzes Abschluß internationalen Kaufverträgen bewegliche Sachen — Drucksache 7/124 — Rechtsausschuß Gesetzentwurf Änderung Hypothekenbankgesetzes Schiffsbankgesetzes — Drucksache 7/114 — Finanzausschuß — federführuend — Ausschuß Wirtschaft Rechtsausschuß Ausschuß Raumordnung Bauwesen Städtebau Gesetzentwurf Änderung Gesetzes Beaufsichtigung privaten Versicherungsunternehmungen Bausparkassen Drucksache 7/100 — Finanzausschuß federführend — Ausschuß Wirtschaft Rechtsausschuß Ausschuß Raumordnung Bauwesen Städtebau Gesetzentwurf Änderung Börsengesetzes — Drucksache 7/101 — Finanzausschuß — federführend — Ausschuß Wirtschaft Rechtsausschuß Gesetzentwurf Änderung Gesetzes Pfandbriefe verwandte Schuldverschreibungen öffentlich rechtlicher Kreditanstalten — Drucksache 7/112 Finanzausschuß — federführend — Ausschuß Wirtschaft Rechtsausschuß Ausschuß Raumordnung Bauwesen Städtebau Gesetzentwurf Änderung Gesetzes Finanzstatistik — Drucksache 7/98 — Haushaltsausschuß — federführend — Finanzausschuß Innenausschuß Ausschuß Arbeit Sozialordnung Gesetzentwurf Änderung Eichgesetzes — Drucksache 7/103 — Ausschuß Wirtschaft Gesetzentwurf Änderung Gesetzes Einheiten Meßwesen — Drucksache 7/102 — Ausschuß Wirtschaft Entwurf Zweiten Gesetzes Änderung Viehzählungsgesetzes — Drucksache 7/128 — Ausschuß Ernährung Landwirtschaft Forsten — federführend — Innenausschuß Haushaltsausschuß gemäß § 96 GO Gesetzentwurf Beruf Diätassistenten — Drucksache 7/116 — Ausschuß Jugend Familie Gesundheit Gesetzentwurf Änderung Fleischbeschaugesetzes — Drucksache 7/122 — Ausschuß Jugend Familie Gesundheit — federführend — Ausschuß Ernährung Landwirtschaft Forsten Gesetzentwurf Änderung Unterhaltssicherungsgesetzes Arbeitsplatzschutzgesetzes — Drucksache 7/129 — Verteidigungsschuß
request 60 failed, waiting for retry
CONTAINER ID NAME CPU % MEM USAGE / LIMIT MEM % NET I/O BLOCK I/O PIDS
d6b9b798e0e3 dbpedia-spotlight.de 0.33% 7.169GiB / 27.86GiB 25.73% 2.39GB / 14.9GB 2.32GB / 126MB 30
used (Mb) gc trigger (Mb) limit (Mb) max used (Mb)
Ncells 102161828 5456.1 160160022 8553.5 NA 160160022 8553.5
Vcells 839448126 6404.5 1309906545 9993.8 32768 1088650679 8305.8
[2024-03-31 04:44:53.774239] Der Gesetzentwurf Änderung Bürgerlichen Gesetzbuch

Faulty class assignment in polmineR:::as.AnnotatedPlainTextDocument()

I currently cannot run the example in the README.Rmd file and I essentially narrowed it down to this minimal reproducible example:

library(polmineR)

gp_subcorpus <- corpus("GERMAPARL2") |>
  subset(protocol_date == "2009-11-10") |>
  subset(speaker_name == "Angela Merkel")

doc <- polmineR:::as.AnnotatedPlainTextDocument(
  x = gp_subcorpus,
  p_attributes = "word",
  verbose = FALSE
)

The error message states:

assignment of an object of class “NULL” is not valid for @‘s_attribute_strucs’ in an object of class “plpr_subcorpus”; is(value, "character") is not TRUE

Looking further, the following line in polmineR:::as.AnnotatedPlainTextDocument seems to be the culprit, although I don't know where and why this assignment occurs.

ne_sub <- subset(x, ne_type = ".*", regex = TRUE)

I am sure that this is an issue of polmineR and not dbpedia, but since the use case is quite specific, I would leave the issue here for now. It it would be more appropriate in the polmineR repository, I am happy to move it.

Session Info

This happens with the most recent versions of polmineR, dbpedia and GERMAPARL v2.

Keep entity types returned by DBpedia Spotlight

DBpedia Spotlight returns not only URIs for entities but also entity types. In get_dbpedia_uris() these values are currently omitted from the output. If kept, these entity types could be used to classify entities without additional SPARQL queries, for example if the textual data does not contain pre-annotated named entities.

avoid pipes in package code

On linux-oldrel I see. Essentially it means that pipes should be avoid in package code as a matter of compatibility with R-oldrel.

  • checking for file ‘.../DESCRIPTION’ ... OK
  • preparing ‘dbpedia’:
  • checking DESCRIPTION meta-information ... OK
  • installing the package to build vignettes
    -----------------------------------
  • installing source package ‘dbpedia’ ...
    ** using staged installation
    ** R
    Error in parse(outFile) :
    pipe placeholder can only be used as a named argument
    ERROR: unable to collate and parse R files for package ‘dbpedia’
  • removing ‘/tmp/RtmpF1KkFy/Rinst190576eec4/dbpedia’

`get_dbpedia_uris()` fails if argument `retry` is logical

For dbpedia v0.1.2.9005 get_dbpedia_uris() will fail if the argument retry is logical (which it is by default). Then this line will result in an error because the object request is not found:

request_max <- if (is.logical(retry)) as.integer(request) else retry

If retry is an integer value, then it will work. Using retry = 0L turns the functionality off which should be in line with the default behavior in prior versions.

Replace SPARQL::SPARQL()

We rely on the SPARQL::SPARQL() function. The SPARQL package cannot be installed from CRAN and we need to load RCurl and XML seperately. Looking at the code of the function, it should not be too complicated to develop a replacement,

function (url = "http://localhost/", query = "", update = "", 
    ns = NULL, param = "", extra = NULL, format = "xml", curl_args = NULL, 
    parser_args = NULL) 
{
    if (!is.null(extra)) {
        extrastr <- paste("&", sapply(seq(1, length(extra)), 
            function(i) {
                paste(names(extra)[i], "=", URLencode(extra[[i]]), 
                  sep = "")
            }), collapse = "&", sep = "")
    }
    else {
        extrastr <- ""
    }
    tf <- tempfile()
    if (query != "") {
        if (param == "") {
            param <- "query"
        }
        if (format == "xml") {
            tf <- do.call(getURL, append(list(url = paste(url, 
                "?", param, "=", gsub("\\+", "%2B", URLencode(query, 
                  reserved = TRUE)), extrastr, sep = ""), httpheader = c(Accept = "application/sparql-results+xml")), 
                curl_args))
            DOM <- do.call(xmlParse, append(list(tf), parser_args))
            if (length(getNodeSet(DOM, "//s:result[1]", namespaces = sparqlns)) == 
                0) {
                rm(DOM)
                df <- data.frame(c())
            }
            else {
                attrs <- unlist(xpathApply(DOM, paste("//s:head/s:variable", 
                  sep = ""), namespaces = sparqlns, quote(xmlGetAttr(x, 
                  "name"))))
                ns2 <- noBrackets(ns)
                res <- get_attr(attrs, DOM, ns2)
                df <- data.frame(res)
                rm(res)
                rm(DOM)
                n = names(df)
                for (r in 1:length(n)) {
                  name <- n[r]
                  df[name] <- as.vector(unlist(df[name]))
                }
            }
        }
        else if (format == "csv") {
            tf <- do.call(getURL, append(list(url = paste(url, 
                "?", param, "=", gsub("\\+", "%2B", URLencode(query, 
                  reserved = TRUE)), extrastr, sep = "")), curl_args))
            df <- do.call(readCSVstring, append(list(tf, blank.lines.skip = TRUE, 
                strip.white = TRUE), parser_args))
            if (!is.null(ns)) 
                df <- dropNS(df, ns)
        }
        else if (format == "tsv") {
            tf <- do.call(getURL, append(list(url = paste(url, 
                "?", param, "=", gsub("\\+", "%2B", URLencode(query, 
                  reserved = TRUE)), extrastr, sep = "")), curl_args))
            df <- do.call(readTSVstring, append(list(tf, blank.lines.skip = TRUE, 
                strip.white = TRUE), parser_args))
            if (!is.null(ns)) 
                df <- dropNS(df, ns)
        }
        else {
            cat("unknown format \"", format, "\"\n\n", sep = "")
            return(list(results = NULL, namespaces = ns))
        }
        list(results = df, namespaces = ns)
    }
    else if (update != "") {
        if (param == "") {
            param <- "update"
        }
        extra[[param]] <- update
        do.call(postForm, append(list(url, .params = extra), 
            curl_args))
    }
}

Get wikidata ID for dbedia URI

I would have thought that we access the DBpedia SPARQL endpoint for that purpose, but the SPARQL R package has been archived at CRAN.

There is the package 'WikidataQueryServiceR' to send queries to the Wikidata SPARQL endpoint. But will it work to get Wikidata IDs from DBpedia-URIs this way?

`get_dbpedia_uris()` fails if there are no named entities in the document

When processing very short subcorpora (for example paragraphs from GermaParl, see the current example in the README), it is possible that these documents do not contain a single named entity. If named entities are used in the s_attribute argument, get_dbpedia_uris() will fail in this case because merging potentially retrieved URIs with the empty data.table of non-existing named entities does not work.

Error when querying http://dbpedia.org/sparql

In the other examples, we use the German endpoint http://de.dbpedia.org/sparql . Trying to develop an English example, I get an error from the English SPARQL endpoint.

library(dbpedia)
library(quanteda)
library(dplyr)

options(dbpedia.lang = "en")
options(dbpedia.endpoint = "http://api.dbpedia-spotlight.org/en/annotate")

uritab <- data_char_ukimmig2010 |>
  corpus() |>
  get_dbpedia_uris() %>% 
  add_wikidata_uris(endpoint = "x", progress = TRUE) %>% 
  wikidata_query(id = "P31")

Opening and ending tag mismatch: hr line 5 and body
Opening and ending tag mismatch: body line 3 and html
Premature end of data in tag html line 1
Error: 1: Opening and ending tag mismatch: hr line 5 and body
2: Opening and ending tag mismatch: body line 3 and html
3: Premature end of data in tag html line 1

Internally, the error results from this line of the SPARQL::SPARQL() function:

tf <- do.call(
getURL, append(list(url = paste(url, "?", param, "=", gsub("\\+", "%2B", URLencode(query,  reserved = TRUE)), extrastr, sep = ""), httpheader = c(Accept = "application/sparql-results+xml")), 
curl_args))

Consider splitting longer text into chunks instead of truncating it

Longer documents (approximately over 5600 characters) are currently truncated. That means that one part of the document is not passed to DBpedia Spotlight, i.e. not annotated at all.

Instead of truncating, it should be considered to split the text into chunks if it is too long.

chunk_text() of the tokenizers R package could facilitate this. In get_dbpedia_uris() for character vectors this could be implemented along the lines of

if (nchar(x) > max_len) {
  if (verbose) cli_alert_info(
    "input text has length {nchar(x)}, creating chunks of 500 tokens."
  )

  x <- chunk_text(x, chunk_size = 500, lowercase = FALSE, strip_punct = FALSE)
}

x then becomes a list of character vectors which could be sent to DBpedia Spotlight one by one.

Ideally, the output would still be the entire input document, so it must be taken care when combining the individually returned results from DBpedia Spotlight.

Undefined global function 'types' in function `map_types_to_class()`

See the following output when running R CMD check on dbpedia v0.1.1.9008

  • checking R code for possible problems ... NOTE
    map_types_to_class: no visible binding for global variable ‘types’
    get_dbpedia_uris,character: no visible binding for global variable
    ‘types’
    get_dbpedia_uris,corpus: no visible binding for global variable ‘types’
    get_dbpedia_uris,subcorpus: no visible binding for global variable
    ‘types’
    Undefined global functions or variables:
    types

Potentially unintended consequences of `limit` in `dbpedia_get_wikidata_uris()`

The parameter limit might have unintended consequences, at least when following the current documentation of the dbpedia_get_wikidata_uris() function.

Take this as an example:

wikidata_uris <- dbpedia_get_wikidata_uris(
  x = c("http://dbpedia.org/resource/London", "http://dbpedia.org/resource/Washington,_D.C."),
  endpoint = "https://dbpedia.org/sparql/",
  wait = 5,
  limit = 2,
  progress = TRUE
)

In this example, the two queries are processed as one chunk, i.e. in one query sent to the endpoint. Although both items have Wikidata IDs associated with them in DBpedia, only Wikidata IDs for the first item are returned.

At first glance, this might be expected behavior. The limit argument is used as a parameter of the query and controls the number of results returned by the server. If it is set to 2, and the first item in the query has more than one Wikidata ID in the "sameAs" property (which it does in this example), then all returned Wikidata IDs will be for this first item only.

However, limit is also used to split the input vector, i.e. the URIs in x into chunks. This is also how the argument is documented in the package. This is why both URIs are passed in a single SPARQL query which includes the single limit argument for both items.

In the case above, a larger value for limit would solve the problem as it would allow all values for both items to be returned.

But I think that using limit for both purposes - in the query and for chunking the input vector - might be confusing and should be reconsidered.

Warnings caused by overlapping annotations when processing CWB corpora

Issue

DBpedia Spotlight can return multiple entity annotations for the same token. In issue #42, I described the general issues with this. In one scenario, the overlapping entities share the same starting position. This is problematic for CWB corpora.

See the following example:

library(polmineR)
library(dbpedia)

sc <- corpus("GERMAPARL2") |>
  subset(speaker_name == "Heinrich von Brentano") |>
  subset(protocol_date == "1960-06-22") |>
  as.speeches(s_attribute_name = "speaker_name",
              s_attribute_date = "protocol_date",
              gap = 50) |>
  _[[1]]

get_dbpedia_uris(
  x = sc,
  language = getOption("dbpedia.lang"),
  max_len = 5600L,
  confidence = 0.35,
  support = 20,
  api = getOption("dbpedia.endpoint"), # German endpoint
  verbose = FALSE,
  expand_to_token = TRUE
)

There are warnings stating that

Warning: longer object length is not a multiple of shorter object length

Likely Cause

This seems to be due to these lines in get_dbpedia_uris():

dbpedia/R/dbpedia.R

Lines 610 to 620 in f4dc779

tab <- links[,
list(
cpos_left = dt[.SD[["start"]] == dt[["start"]]][["id"]],
cpos_right = expand_fun(.SD),
dbpedia_uri = .SD[["dbpedia_uri"]],
text = .SD[["text"]],
types = .SD[["types"]]
),
by = "start",
.SDcols = c("start", "end", "dbpedia_uri", "text", "types")
]

DBpedia Spotlight adds different URIs to overlapping spans of tokens which share the same starting position. Since the starting position is the same for both annotations, dt[.SD[["start"]] == dt[["start"]]] is true for more than one token in the subcorpus. This causes the warning.

Possible solution

If we do not want to encode overlapping entities (in CWB corpora at least), we have to decide which annotation to keep and which to omit. Some options are already discussed in issue #42.

To circumvent the specific issue here, the first step would be to check whether there are multiple annotations for a single token (span). For this, a check like

if (any(table(resources$start) > 1)) 

could be added before resources is reduced to resources_min in get_dbpedia_uris() for subcorpora.

Then, it would be possible to introduce an argument which describes what to do in these cases.

Discussion

I am not sure about argument names and defaults. In addition, since this happens very rarely (for GermaParl at least), instead of an additional argument for get_dbpedia_uris(), it could also be considered to have an option which set the default behavior in such cases. But I am not sure if that is good practice.

As discussed in issue #42, there might be other options.

Unexpected return values of `get_dbpedia_uris()` if `s_attribute` is NULL

In the following example, I simply take the first paragraph of the GERMAPARL2 corpus and send it to DBpedia Spotlight (running in Docker). When I use the named entities encoded in the corpus, this works as expected. When I use the default for s_attribute, i.e. s_attribute = NULL, the return value looks odd. There are a lot of rows and each token has a number of identical annotations and corpus positions.

Here is the example:

library(polmineR)
library(dbpedia) # v0.1.1.9002

# get first paragraph
paragraph <- corpus("GERMAPARL2") |>
  subset(protocol_date == "1949-09-07") |>
  split(s_attribute = "p", value = FALSE) |>
  _[[1]]


p_annotated_with_ne <- get_dbpedia_uris(
  x = paragraph,
  language = "de",
  s_attribute = "ne_type",
  verbose = interactive()
)

p_annotated_without_ne <- get_dbpedia_uris(
  x = paragraph,
  language = "de",
  s_attribute = NULL,
  verbose = interactive()
)

(See one of the answers in https://stackoverflow.com/questions/67799890/column-name-equivalent-for-r-base-pipe for the hint concerning the "_")

In the source code of get_dbpedia_uris() I see that everything looks as expected until if (is.null(s_attribute)){ ..., i.e. links and dt contain a reasonable number of rows.

I think, the comparison between links and dt is a bit off. I would assume that in the second element of the comparison, the dt is missing. So maybe (!) it would suffice to change

dbpedia/R/dbpedia.R

Lines 270 to 278 in 01c5d91

tab <- links[,
list(
cpos_left = dt[.SD[["start"]] == .SD[["start"]]][["id"]],
cpos_right = dt[.SD[["end"]] == .SD[["end"]]][["id"]],
dbpedia_uri = .SD[["dbpedia_uri"]], text = .SD[["text"]]
),
by = "start",
.SDcols = c("start", "end", "dbpedia_uri", "text")
]

to something like

tab <- links[,
                 list(
                   cpos_left = dt[.SD[["start"]] == dt[["start"]]][["id"]],
                   cpos_right = dt[.SD[["end"]] == dt[["end"]]][["id"]],
                   dbpedia_uri = .SD[["dbpedia_uri"]],
                   text = .SD[["text"]]
                 ),
                 by = "start",
                 .SDcols = c("start", "end", "dbpedia_uri", "text")
    ]

(note the additional dt instead of the .SD in the comparison)

I think that the result looks reasonable, but I did not double check this yet.

NAs in cpos_left in output for `get_dbpedia_uris()` for subcorpora

Issue

As discussed in issue #26, DBpedia Spotlight occasionally annotates entities which do not perfectly align with token spans. The fix discussed in issue #26 is incomplete, however.

Working with GermaParl, it became apparent that there are scenarios in which tokenization can be tricky for the left entity boundary as well. In phrases like "G-8-Gipfel" (which, in GermaParl is often tokenized in two tokens, "G" and "-8-Gipfel"), the entity identified by DBpedia Spotlight is "Gipfel" which starts in the middle of the token. This is an issue when we join tokens and entities based on their starting positions as the offset is different, thus leading to a "NA" value in the left corpus position of the entity.

Potential Solutions

If we want to address this, we could use the same approach as suggested for issue #26: Expand the span to the previous token boundary. For this, we could compare the starting positions of the entity and tokens and chose the previous token using an extended version of the expand_fun() auxiliary function introduced earlier:

expand_fun = function(.SD, direction) {
  if (direction == "right") {
    cpos_right <- dt[.SD[["end"]] == dt[["end"]]][["id"]]
    if (length(cpos_right) == 0 & isTRUE(expand_to_token)) {
      cpos_right <- dt[["id"]][which(dt[["end"]] > .SD[["end"]])[1]]
    } else {
      cpos_right
    }
  } else {
    cpos_left <- dt[.SD[["start"]] == dt[["start"]]][["id"]]
    if (length(cpos_left) == 0 & isTRUE(expand_to_token)) {
      cpos_vec <- which(dt[["start"]] < .SD[["start"]])
      cpos_left <- dt[["id"]][cpos_vec[length(cpos_vec)]]
    } else {
      cpos_left
    }
  }
}

This would make it necessary to adjust the following chunk as well:

tab <- links[,
             list(
               cpos_left = expand_fun(.SD, direction = "left"),
               cpos_right = expand_fun(.SD, direction = "right"),
               dbpedia_uri = .SD[["dbpedia_uri"]],
               text = .SD[["text"]],
               types = .SD[["types"]]
             ),
             by = "start",
             .SDcols = c("start", "end", "dbpedia_uri", "text", "types")
]

The possibility that there are incomplete annotations for "cpos_left" should be considered here as well:

if (isTRUE(drop_inexact_annotations) & any(is.na(tab[["cpos_right"]]))) {

Discussion

As with issue #26, this should be optional and comes with some conceptual considerations, in particular whether it always makes sense to expand the entity span to match the token span.

This might also be not very efficient as this is checked for each entity.

brackets of c() wrong in get_dbpedia_uris,subcorpus-method?

Just a suspicion: c() is meaningless here, but is it because the brackets are misplaced?

if (isTRUE(drop_inexact_annotations) & (any(is.na(tab[["cpos_right"]])) | any(is.na(tab[["cpos_left"]])))) {
    missing_cpos_idx <- unique(
      c(which(is.na(tab[["cpos_right"]])), which(is.na(tab[["cpos_left"]])))
    )
    cli_alert_warning(
      "Cannot map {length(missing_cpos_idx)} entit{?y/ies} exactly to tokenstream. Dropping {?it/them} from the annotation."
    )
    tab <- tab[-missing_cpos_idx, ]
  }

Alignment of entity and token spans (in CWB subcorpora)

When working with a CWB corpus, I noticed that some entities returned by DBpedia Spotlight are not properly mapped to the tokens of the corpus. The issue seems to be that the entity spans of DBpedia Spotlight do not always align with the token spans of the CWB corpus.

In particular, I noticed that get_dbpedia_uris() returns a data.table object with "NA" in the "cpos_right" column for some rows. In my observation, this concerns tokens with apostrophes. Their boundaries seem to be treated differently by DBpedia Spotlight than the tokenization within the CWB would suggest.

Example

To illustrate what happens here, consider the following example:

library(dbpedia)
txt <- "Berlin is Germany's capital city."
db_dt <- get_dbpedia_uris(txt)

It is assumed that "Germany's" is passed as one token to DBpedia Spotlight. DBpedia Spotlight identifies the character sequence "Germany" as an entity. It is returned as such with the correct "start" position.

Consequences for aligning entity spans and token spans in the CWB

This is an issue if the goal is to not only identify entities in the text but also to map them back onto pre-tokenized input data - for example a CWB tokenstream: This match is realized by comparing start and end positions of the entity and token spans. Although the entity and the corresponding token have the same starting position, their length differs and thus does their end position. Being unable to find the exact token in the tokenstream, "cpos_right" becomes "NA" in the return value of get_dbpedia_uris() when used with CWB input.

Aside from not being able to fully map the entities to tokens, this causes problems in as_subcorpus() which returns a subcorpus of unknown size. When this subcorpus is then used as annotation in polmineR's read() function, this results in an error.

Possible Solution

I think that there are potentially two elements to address this:

  • explicitly drop entities which cannot be matched, i.e. which have "NA" as their cpos_right
  • expand entity spans to match token spans

Discussion

Concerning the first issue: Currently, inexact matches are kept but due to their unknown boundaries, this annotation cannot be used to map them back to the pre-tokenized input. If this is the goal, then removing annotations with "NA" in the cpos_right column and making this explicit with a message can address this. However, there might be scenarios in which mapping entities back to individual tokens is not the goal in the first place. Then maybe I would like to keep the entities even if they are annotated on sub-token level - as I would implicitly when my input is not tokenized beforehand. This could be addressed with a logical argument.

Concerning the second point, I think, for CWB (sub)corpora which are processed without using any pre-annotated named entity spans (i.e. without the argument s_attribute), an option would be to expand the identified entity span to the end of the token in which the entity ends, if necessary. In the example above, the entity annotation of "Germany" could be expanded to the end of the token "Germany's". In CWB subcorpora, this would avoid "NA" in the cpos_right column as the character offsets would match. Preliminary testing suggests that this is feasible. However, I think that this should be regarded as an optional feature instead of the default behavior. This seems to work well for the observed issue with apostrophes but there might be cases in which this expansion does not work.

So, essentially it could be worth considering whether to introduce two additional arguments to get_dbpedia_uris() for subcorpora:

  • drop_inexact_annotations: A logical value - Whether to drop annotations if entity and token spans do not align exactly
  • expand_to_token: A logical value - Whether diverging entity spans are expanded to match the next complete token boundary

Regarding usability, too many arguments should be avoided, however.

HTTP stream was not closed cleanly before end of the underlying stream

On CI I see the following error.

Error: Error: R CMD check found ERRORs
Execution halted

options(dbpedia.lang = "en")
options(dbpedia.endpoint = "http://api.dbpedia-spotlight.org/en/annotate")

httr::set_config(httr::config(ssl_verifypeer = 0L))

uritab <- data_char_ukimmig2010 |>
  corpus() |>
  get_dbpedia_uris(progress = TRUE) %>% 
  add_wikidata_uris(endpoint = "https://dbpedia.org/sparql/", progress = TRUE, chunksize = 100) %>% 
  wikidata_query(id = "P31")

error in evaluating the argument 'x' in selecting a method for function 'wikidata_query': HTTP/2 stream 1 was not closed cleanly before end of the underlying stream

wikidata_query() join issue

library(dbpedia)
library(quanteda)

options(dbpedia.lang = "en")
options(dbpedia.endpoint = "http://api.dbpedia-spotlight.org/en/annotate")

httr::set_config(httr::config(ssl_verifypeer = 0L))

uritab <- data_char_ukimmig2010 |>
  corpus() |>
  get_dbpedia_uris(progress = TRUE) %>% 
  add_wikidata_uris(endpoint = "https://dbpedia.org/sparql/", progress = TRUE, limit = 50) %>% 
  wikidata_query(id = "P31")

Error in h(simpleError(msg, call)) :
error in evaluating the argument 'x' in selecting a method for function 'wikidata_query': Join results in 584 rows; more than 556 = nrow(x)+nrow(i). Check for duplicate key values in i each of which join to the same group in x over and over again. If that's ok, try by=.EACHI to run j for each group to avoid the large allocation. If you are sure you wish to proceed, rerun with allow.cartesian=TRUE. Otherwise, please search for this error message in the FAQ, Wiki, Stack Overflow and data.table issue tracker for advice.

Handling overlapping annotations by DBpedia Spotlight

Issue

Occasionally, DBpedia Spotlight returns overlapping annotations.

Take the following example:

library(dbpedia)

doc <- "Der Deutsche Bundestag tagt in Berlin."

uri_table <- get_dbpedia_uris(
  x = doc,
  language = getOption("dbpedia.lang"),
  api = getOption("dbpedia.endpoint") # German endpoint
)

In phrases such as

"Der Deutsche Bundestag"

(found in GermaParl) both the entities "Der Deutsche Bundestag" and "Bundestag" are annotated. They might share the same URI but do not need to. Depending on the input format, this might cause different issues. For character vectors, this at least overestimates the number of unique entities (in the example above, there is only one instance of "Bundestag" but if we count the two URIs as two instances, this would not be correct in most cases). For CWB corpora, we currently do not have a way to encode these overlapping annotations.

In this issue, I'll demonstrate three variations of overlapping entity annotations. I think that the technical solution might be similar for all three scenarios. There are conceptual aspects to be discussed. The following considerations follow the assumption that we do not want to keep overlapping annotations but resolve these to a single annotation. Other solutions could be considered here.

Embedded Annotations

In the example above, "Der Deutsche Bundestag", one entity is completely embedded in the other. This could be resolved by controlling for overlapping entities and limiting the output to either the entity included in all annotations ("Bundestag"), the longest entity ("Der Deutsche Bundestag") or, using the scores provided by DBpedia Spotlight, the most "similar" (in terms of confidence) entity. Are there better options? This could be either controlled by an additional argument in get_dbpedia_uris() or maybe an option. I am not sure what constitutes good practice here.

Overlapping Entities

While in the example above, one entity is part of another, there are other examples in which the annotations merely overlap. I found an example for this in a speech by Angela Merkel (PlPr 16/46, page 4479; https://dserver.bundestag.de/btp/16/16046.pdf; abbreviated for this example):

"Die Mauer fiel

In this example, DBpedia Spotlight identifies two entities: "Die Mauer" and "Mauer fiel". They are both referring to the same URI. See the following chunk:

doc <- "Die Mauer fiel"

uri_table <- get_dbpedia_uris(
  x = doc,
  language = getOption("dbpedia.lang"),
  api = getOption("dbpedia.endpoint") # German endpoint
)

Similar to the issue above, if we would only count the number of URIs, the number of references to "Berliner Mauer" would be overestimated as it is counted twice although the term only occurs once.

Here, resolving these overlapping entities to one annotation seems to be more complicated than above: Which one is the more correct one? Combining both entities, the entity would be "Die Mauer fiel" which might be artificial. It would also be possible to reduce the entity to the tokens occurring in both overlapping spans (i.e. "Mauer"). Might this be more appropriate? This would be applicable to the embedded entities above, but does this always work as expected?

Interestingly, as_subcorpus() in combination with read() seems to work just fine (at least as long as the URI is the same for both parts of the overlap):

sc <- corpus("GERMAPARL2") |>
  subset(speaker_name == "Angela Merkel") |>
  subset(protocol_date == "2006-09-06") |>
  as.speeches(s_attribute_name = "speaker_name",
              s_attribute_date = "protocol_date",
              gap = 50) |>
  _[[1]]


speech_annotation <- get_dbpedia_uris(
  x = sc,
  language = getOption("dbpedia.lang"),
  max_len = 5600L,
  confidence = 0.35,
  support = 20,
  api = getOption("dbpedia.endpoint"), # German endpoint
  verbose = FALSE,
  expand_to_token = TRUE
)

read(sc,
     annotation = as_subcorpus(speech_annotation))

Overlapping Entities with the same starting position

This is a specific case of the first variation of the issue: It is possible that an entity is embedded in another entity but they both share the same starting position. In the following example (taken from a speech by Heinrich von Brentano in the Bundestag; PlPr. 3/118 page 6801; https://dserver.bundestag.de/btp/03/03118.pdf), this becomes apparent:

doc <- "Ölbild Kaiser Wilhelms I."

uri_table <- get_dbpedia_uris(
  x = doc,
  language = getOption("dbpedia.lang"),
  api = getOption("dbpedia.endpoint")
)

In this case, "Kaiser Wilhelms I." and "Kaiser" are both annotated as entities. They also have different URIs assigned to them.

Since this also results in warnings when applied to CWB corpora, I will create a separate issue for this scenario.

Possible Solution

Assuming that overlapping entities might not be encoded, it becomes necessary to determine how to handle these overlaps. What I can imagine is an option or an argument that states whether the shortest (or the actual overlapping token?), the longest or the most similar entity should be kept. This terminology of "longest" and "shortest" is somewhat inspired by the CWB manual for CQP queries - it probably should be checked how this is handled in other tools as well.

In the examples above, this would mean something like

Entity Shortest / Overlapping Entity Longest Entity Most Similar Entity
[Der Deutsche [Bundestag]] Bundestag Der Deutsche Bundestag Der Deutsche Bundestag
[Die [Mauer] fiel] Mauer Die Mauer fiel Mauer fiel
[[Kaiser] Wilhelms I.] Kaiser Kaiser Wilhelms I. Kaiser Wilhelms I.

Notes:

  • in the "Entity" column, the separate token spans are represented with pairs of squared brackets, i.e. "Der Deutsche Bundestag" is a span and "Bundestag" is a span, "Die Mauer" is a span and "Mauer fiel" is a span, etc.
  • the entity in the column "Most Similar" is based on the "similarityScore" column in the resources data.table retrieved from DBpedia Spotlight. These values can be very close.

Discussion

The question is how this behavior should be handled.

  • when is this behavior problematic?
  • should it be addressed as an argument or option for get_dbpedia_uris()?
  • Alternatively, the return value might contain both overlapping annotations which have to be filtered later on somehow
  • how would arguments and defaults look like?

api.dbpedia-spotlight.org SSL certificate has expired

When running this code ...

library(quanteda)
library(dplyr)

options(dbpedia.lang = "en")
options(dbpedia.endpoint = "http://api.dbpedia-spotlight.org/en/annotate")

uritab <- data_char_ukimmig2010 |>
  corpus() |>
  get_dbpedia_uris()

I get this error:

Error in curl::curl_fetch_memory(url, handle = handle) :
SSL peer certificate or SSH remote key was not OK: [api.dbpedia-spotlight.org] SSL certificate problem: certificate has expired

Minimal example for README or vinette

library(dbpedia)
library(magrittr)
library(dplyr)

txt <- "Christian Drosten arbeitet an der Charité in Berlin."

dbp <- get_dbpedia_uris(txt)

template <- 'SELECT distinct ?item ?wikidata_uri
      WHERE {
      VALUES ?item {%s}
      ?item owl:sameAs ?wikidata_uri
      %s
      FILTER(regex(str(?wikidata_uri), "www.wikidata.org" ) )}
      LIMIT %d'


wiki <- dbp %>% 
  pull(dbpedia_uri) %>% 
  dbpedia_get_wikidata_uris(
    optional = "municipalityCode",
    endpoint = "http://de.dbpedia.org/sparql",
    wait = 0.5,
    limit = 100,
    progress = TRUE
  )

wikidata_query() throws warning

Rows: 28 Columns: 4
── Column specification ─────────────────────────────────────────────────────────────────────────────────────
Delimiter: ","
chr (2): item, label
lgl (2): key, keyLabel

ℹ Use spec() to retrieve the full column specification for this data.
ℹ Specify the column types or set show_col_types = FALSE to quiet this message.

Remotes syntax in DESCRIPTION might interfere with installation from GitHub

If I am not mistaken, the naming of dependencies under "Remotes" in the DESCRIPTION file does not follow the syntax suggested in the vignette of the remotes package (https://cran.r-project.org/web/packages/remotes/vignettes/dependencies.html), potentially causing issues.

dbpedia/DESCRIPTION

Lines 32 to 35 in 01c5d91

Remotes:
polmineR=github::PolMine/polmineR@dev,
SPARQL=url::https://cran.r-project.org/src/contrib/Archive/SPARQL/SPARQL_1.16.tar.gz,
GermaParl2=url::https://polmine.github.io/drat/src/contrib/GermaParl2_2.0.0.tar.gz

dbpedia_links fails if no links found?

This is my interpretation why the following code fails. There should be an informative message, if no links/matches are found.

doc <- corpus("NADIRASZ") %>% 
   subset(article_id == "A24554563")
dbpedia_links <- get_dbpedia_links(x = doc, language = "de")
✔ convert input to `AnnotatedPlainTextDocument` [84ms]
✔ send request to DBpedia Spotlight [293ms]
✔ parse result [41ms]

Warning message:
In stri_sub(str = s, from = start, to = end) :
argument is not an atomic vector; coercing

Perhaps suggest waiting period when the dbpedia spotlight docker container is run for the first time

If you run the docker container for the very first time as suggested by the README (see chunk below), the container seems to start by downloading the language model itself. I think that it is important that the container is not queried while doing so as otherwise, the download seems to be interrupted, leading to an unusable setup. Waiting for a couple of minutes in the first start up seemed to work for me, at least.

dbpedia/README.Rmd

Lines 48 to 53 in 01c5d91

docker run -tid \
--restart unless-stopped \
--name dbpedia-spotlight.de \
--mount source=spotlight-model,target=/opt/spotlight \
-p 2222:80 \
dbpedia/dbpedia-spotlight spotlight.sh de

I am not sure if this is a general problem, but maybe a hint in the README that the container might be ready to use only after a few minutes if you use it for the first time might be useful for other users as well.

`entity_types_map()` does not work reliably

This concerns the version of the package on the entity_types branch. The line

type_list <- unlist(el, recursive = FALSE)

which was used in a similar function earlier causes issues in entity_types_map(). This is already anticipated in the comments surrounding this line. Long story short: Earlier, a list of lists was passed to the function (as an object corresponding to el here). Now, el is already a list itself, so does not need to be unlisted. If unlisted, the names of the resulting character vector probably won't match the mapping_vector. In consequence, all entities in the data.table are assigned to the category defined in the argument other.

To address this, I think it would suffice to omit the line quoted above. And instead of type_list in the following chunk, you would use el directly here:

dbpedia/R/entity_types.R

Lines 81 to 87 in f4dc779

types_with_class_raw <- lapply(
seq_along(type_list),
function(i) {
list_name <- names(type_list)[[i]]
list_elements <- type_list[[i]]
paste0(list_name, ":", list_elements)
})

Then, the assignment should work as expected.

Two notes here:

  • It might be worth considering changing the name of the resulting column in the data.table from "class" to "category" as this wording is indeed unclear.
  • A similar issue occurs with the "old" map_type_to_class() function. Here, the issue results in a different outcome, assigning all classes to all entities as currently visible in the README. This can be addressed in a similar fashion, but if the function is replaced by entity_types_map() fixing the former seems more important.

Parameter `types` for `get_dbpedia_uris()`

Related to Issue #30, there is a parameter types which can be passed to the DBpedia Spotlight service. Its use is shown in an example on the DBpedia Spotlight GitHub wiki: https://github.com/dbpedia-spotlight/dbpedia-spotlight/wiki/User's-manual.

The parameter limits the annotated entities to those of the declared types, i.e. with types=Person, only persons are returned.

This could be added to get_dbpedia_uris(). Like in issue #30, I think that adding it to the query parameter of the GET (as types = "Person,Organisation") should work.

I briefly tested this and I think that the values correspond to the types in the DBpedia ontology. DBpedia Spotlight also returns other types, for example from Wikidata, but passing these to the types parameter does not seem to work.

On a related note, a lot of annotated entities do not have any types.

In a very preliminary test, I also checked whether there was a difference between filtering results by type when querying DBpedia Spotlight (i.e. with "types=Person" in the query) and filtering the results afterwards (i.e. without the types parameter in the query). As far as I can see, the results are the same.

if() conditions comparing class() to string

Running tests on GitHub Actions, I see (macOS):

  • checking R code for possible problems ... [16s/17s] NOTE
    Found if() conditions comparing class() to string:
    File ‘dbpedia/R/dbpedia.R’: if (class(nodes) == "xml_nodeset") ...
    Use inherits() (or maybe is()) instead.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.