Giter Site home page Giter Site logo

Comments (16)

jcsahnwaldt avatar jcsahnwaldt commented on August 26, 2024

(Is it possible to run the extractor on a single wikipedia page, removing
the need for a full fledged MW/MySQL installation?)

http://mappings.dbpedia.org/server/extraction/en/

Example:

http://mappings.dbpedia.org/server/extraction/en/extract?title=Azulene

It's only the mapping based extractor and the label extractor though.

from extraction-framework.

egonw avatar egonw commented on August 26, 2024

OK, cool. When you say that "only the mapping based extractor..." is supported, that means all in this file:

https://raw.github.com/dbpedia/extraction-framework/live/mappings/Mapping_en.xml

Right?

And can I also run that "extract" script locally easily, with a patched version of this Mapping_en.xml ?

from extraction-framework.

jcsahnwaldt avatar jcsahnwaldt commented on August 26, 2024

Looks like the identifiers were moved to a sub-templte: http://en.wikipedia.org/wiki/Template:Chembox_Identifiers
The mapping-based extractor only extracts stuff from the main template.

from extraction-framework.

egonw avatar egonw commented on August 26, 2024

OK, got it. That's not an easy fix, I assume...

from extraction-framework.

jcsahnwaldt avatar jcsahnwaldt commented on August 26, 2024

It would be an easy fix: add mappings for Chembox_Identifiers and a few others to the mappings wiki, change a few lines in MappingExtractor.scala: Currently, we only extract data from the top-level template and ignore nested templates. It wouldn't be hard to change that. I just don't know what such a change might break.

from extraction-framework.

jcsahnwaldt avatar jcsahnwaldt commented on August 26, 2024

As for running the extraction locally: Mapping_en.xml is a copy of the mappings on the wiki and not up to date, but depending on the configuration, the extractor downloads the current mappings anyway. It's not hard to run the extraction locally. The main problem may be that it's currently not easy to configure the extraction to extract just a few pages from Wikipedia. We usually run the extraction on the whole dump (millions of pages). The configuration is rather unflexible, so it's hard to change the desired page source.

from extraction-framework.

egonw avatar egonw commented on August 26, 2024

"It would be an easy fix:"... oh, then please enable processing the Chembox_Identifiers... we'll see from the existing mappings if that works, and then I could simply focus on the additional mappings.

What I understand is that I should edit the wiki page rather than the Mapping_en.xml? But at the moment I cannot edit the wiki. I just created an account http://mappings.dbpedia.org/index.php/User:Egonw

Additional identifiers I am interested in include the InChIKey, Standard InChI, Standard InChIKey, ChemSpider ID, but I guess I would just add all defined in the Chembox_Identifier template.

My interest comes from my involvement in BridgeDB and cheminformatics in general.

from extraction-framework.

ninniuz avatar ninniuz commented on August 26, 2024

This would be an interesting enhancement.

Approaches I see are:

  1. Recursively analyse TemplateNode's children and add quads from sub-templates (they would be assigned to the same root resource URI in case the sub-templates are mapped to the same class of the root template node)

  2. Create a new PropertyMapping type (e.g PropertyTemplateMapping) which defines how to map template properties which value is a template itself

  3. Extend the current PropertyMapping with a recurse property which tells the MappingExtractor to look for the specified templateProperty in any of the template's children

  4. is very easy to implement but could potentially break something (not 100% sure)
    2-3) should take same effort to develop but 2) would require changes on the mappings server as well (define a new Template)

@jcsahnwaldt what do you think?

from extraction-framework.

egonw avatar egonw commented on August 26, 2024

On Mon, Nov 11, 2013 at 5:34 PM, Andrea Di Menna
[email protected] wrote:

  1. is very easy to implement but could potentially break something (not 100% sure)
    2-3) should take same effort to develop but 2) would require changes on the mappings server as well (define a new Template)

I'd be more than happy to provide test data for ChemBoxes and child templates.

I do not know the code base, but given the right template, could
possibly contribute to extracting identifiers from these ChemBoxes.
The learning curve into the DBPedia extraction system has stopped me
from making patches so far... that is, I have no clue on to run the
code locally, to test any patch I'd write...

Egon

Dr E.L. Willighagen
Postdoctoral Researcher
Department of Bioinformatics - BiGCaT
Maastricht University (http://www.bigcat.unimaas.nl/)
Homepage: http://egonw.github.com/
LinkedIn: http://se.linkedin.com/in/egonw
Blog: http://chem-bla-ics.blogspot.com/
PubList: http://www.citeulike.org/user/egonw/tag/papers
ORCID: 0000-0001-7542-0286

from extraction-framework.

ninniuz avatar ninniuz commented on August 26, 2024

@jimkont what do you think about 1)?

from extraction-framework.

jimkont avatar jimkont commented on August 26, 2024

We already implemented something similar to (1) for separate templates
maybe it makes sense to do it for nested too.
I don't think it will break something but not 100% sure

On Tue, Nov 19, 2013 at 7:25 PM, Andrea Di Menna
[email protected]:

@jimkont https://github.com/jimkont what do you think about 1)?


Reply to this email directly or view it on GitHubhttps://github.com//issues/54#issuecomment-28811806
.

Kontokostas Dimitris

from extraction-framework.

ninniuz avatar ninniuz commented on August 26, 2024

Should we use the main resource URI in case a child template is mapped to the same class of the root template or should we apply the subclass/superclass approach used in #4 ?

from extraction-framework.

egonw avatar egonw commented on August 26, 2024

In case of chemical, templates are used to annotate properties of the central class...

E.g. for Azulene for CAS/KEGG/ChemSpider/PubChem identifiers do not define subclasses, but are "properties" of azulene itself.

from extraction-framework.

ninniuz avatar ninniuz commented on August 26, 2024

Thanks @egonw :-)
I was wondering how to process such templates when they are included in a main template (the Chembox example fits very well).

@jimkont I think simply replacing

if(graph.isEmpty)
{
  node.children.flatMap(child => extractNode(child, subjectUri, pageContext))
}
else
{
  graph
}

with

graph ++ node.children.flatMap(child => extractNode(child, subjectUri, pageContext))

in org.dbpedia.extraction.mappings.MappingExtractor#extractNode is enough.

We are simply going to collect potentially mappable subtemplates/subtables and demand the quad creation to the Mapping instance (i.e. TemplateMapping - which handles subclass/superclass or new URI creation already - or TableMapping)

from extraction-framework.

jimkont avatar jimkont commented on August 26, 2024

I think we should try #4

from extraction-framework.

ninniuz avatar ninniuz commented on August 26, 2024

My previous comment uses #4 (as a side effect of demanding quad building to TemplateMapping).
Can you be more specific? :D

from extraction-framework.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.