If I look at http://live

Looks like the identifiers were moved to a sub-templte: <a href="http://en.wikipedia.o

On Mon, Nov 11, 2013 at 5:34 PM, Andrea Di Menna <a href="mailto:notifications@git

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

ChemBox extractor does not copy the identifiers? about extraction-framework HOT 16 OPEN

dbpedia commented on August 26, 2024

ChemBox extractor does not copy the identifiers?

from extraction-framework.

Comments (16)

jcsahnwaldt commented on August 26, 2024

(Is it possible to run the extractor on a single wikipedia page, removing
the need for a full fledged MW/MySQL installation?)

http://mappings.dbpedia.org/server/extraction/en/

Example:

http://mappings.dbpedia.org/server/extraction/en/extract?title=Azulene

It's only the mapping based extractor and the label extractor though.

from extraction-framework.

egonw commented on August 26, 2024

OK, cool. When you say that "only the mapping based extractor..." is supported, that means all in this file:

https://raw.github.com/dbpedia/extraction-framework/live/mappings/Mapping_en.xml

Right?

And can I also run that "extract" script locally easily, with a patched version of this Mapping_en.xml ?

from extraction-framework.

jcsahnwaldt commented on August 26, 2024

Looks like the identifiers were moved to a sub-templte: http://en.wikipedia.org/wiki/Template:Chembox_Identifiers
The mapping-based extractor only extracts stuff from the main template.

from extraction-framework.

egonw commented on August 26, 2024

OK, got it. That's not an easy fix, I assume...

from extraction-framework.

jcsahnwaldt commented on August 26, 2024

It would be an easy fix: add mappings for Chembox_Identifiers and a few others to the mappings wiki, change a few lines in MappingExtractor.scala: Currently, we only extract data from the top-level template and ignore nested templates. It wouldn't be hard to change that. I just don't know what such a change might break.

from extraction-framework.

jcsahnwaldt commented on August 26, 2024

As for running the extraction locally: Mapping_en.xml is a copy of the mappings on the wiki and not up to date, but depending on the configuration, the extractor downloads the current mappings anyway. It's not hard to run the extraction locally. The main problem may be that it's currently not easy to configure the extraction to extract just a few pages from Wikipedia. We usually run the extraction on the whole dump (millions of pages). The configuration is rather unflexible, so it's hard to change the desired page source.

from extraction-framework.

egonw commented on August 26, 2024

"It would be an easy fix:"... oh, then please enable processing the Chembox_Identifiers... we'll see from the existing mappings if that works, and then I could simply focus on the additional mappings.

What I understand is that I should edit the wiki page rather than the Mapping_en.xml? But at the moment I cannot edit the wiki. I just created an account http://mappings.dbpedia.org/index.php/User:Egonw

Additional identifiers I am interested in include the InChIKey, Standard InChI, Standard InChIKey, ChemSpider ID, but I guess I would just add all defined in the Chembox_Identifier template.

My interest comes from my involvement in BridgeDB and cheminformatics in general.

from extraction-framework.

ninniuz commented on August 26, 2024

This would be an interesting enhancement.

Approaches I see are:

Recursively analyse TemplateNode's children and add quads from sub-templates (they would be assigned to the same root resource URI in case the sub-templates are mapped to the same class of the root template node)
Create a new PropertyMapping type (e.g PropertyTemplateMapping) which defines how to map template properties which value is a template itself
Extend the current PropertyMapping with a recurse property which tells the MappingExtractor to look for the specified templateProperty in any of the template's children
is very easy to implement but could potentially break something (not 100% sure)
2-3) should take same effort to develop but 2) would require changes on the mappings server as well (define a new Template)

@jcsahnwaldt what do you think?

from extraction-framework.

egonw commented on August 26, 2024

On Mon, Nov 11, 2013 at 5:34 PM, Andrea Di Menna
[email protected] wrote:

is very easy to implement but could potentially break something (not 100% sure)
2-3) should take same effort to develop but 2) would require changes on the mappings server as well (define a new Template)

I'd be more than happy to provide test data for ChemBoxes and child templates.

I do not know the code base, but given the right template, could
possibly contribute to extracting identifiers from these ChemBoxes.
The learning curve into the DBPedia extraction system has stopped me
from making patches so far... that is, I have no clue on to run the
code locally, to test any patch I'd write...

Egon

Dr E.L. Willighagen
Postdoctoral Researcher
Department of Bioinformatics - BiGCaT
Maastricht University (http://www.bigcat.unimaas.nl/)
Homepage: http://egonw.github.com/
LinkedIn: http://se.linkedin.com/in/egonw
Blog: http://chem-bla-ics.blogspot.com/
PubList: http://www.citeulike.org/user/egonw/tag/papers
ORCID: 0000-0001-7542-0286

from extraction-framework.

ninniuz commented on August 26, 2024

@jimkont what do you think about 1)?

from extraction-framework.

jimkont commented on August 26, 2024

We already implemented something similar to (1) for separate templates
maybe it makes sense to do it for nested too.
I don't think it will break something but not 100% sure

On Tue, Nov 19, 2013 at 7:25 PM, Andrea Di Menna
[email protected]:

@jimkont https://github.com/jimkont what do you think about 1)?

—
Reply to this email directly or view it on GitHubhttps://github.com//issues/54#issuecomment-28811806
.

Kontokostas Dimitris

from extraction-framework.

ninniuz commented on August 26, 2024

Should we use the main resource URI in case a child template is mapped to the same class of the root template or should we apply the subclass/superclass approach used in #4 ?

from extraction-framework.

egonw commented on August 26, 2024

In case of chemical, templates are used to annotate properties of the central class...

E.g. for Azulene for CAS/KEGG/ChemSpider/PubChem identifiers do not define subclasses, but are "properties" of azulene itself.

from extraction-framework.

ninniuz commented on August 26, 2024

Thanks @egonw :-)
I was wondering how to process such templates when they are included in a main template (the Chembox example fits very well).

@jimkont I think simply replacing

if(graph.isEmpty)
{
  node.children.flatMap(child => extractNode(child, subjectUri, pageContext))
}
else
{
  graph
}

with

graph ++ node.children.flatMap(child => extractNode(child, subjectUri, pageContext))

in org.dbpedia.extraction.mappings.MappingExtractor#extractNode is enough.

We are simply going to collect potentially mappable subtemplates/subtables and demand the quad creation to the Mapping instance (i.e. TemplateMapping - which handles subclass/superclass or new URI creation already - or TableMapping)

from extraction-framework.

jimkont commented on August 26, 2024

I think we should try #4

from extraction-framework.

ninniuz commented on August 26, 2024

My previous comment uses #4 (as a side effect of demanding quad building to TemplateMapping).
Can you be more specific? :D

from extraction-framework.

ChemBox extractor does not copy the identifiers? about extraction-framework HOT 16 OPEN

Comments (16)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent