Giter Site home page Giter Site logo

projekt-opal / classification Goto Github PK

View Code? Open in Web Editor NEW
0.0 1.0 0.0 19 KB

Classification of DCAT themes using decision trees and TF-IDF (D3.4)

License: GNU Affero General Public License v3.0

Java 100.00%
opal classification weka dcat theme j48 tf-idf

classification's Introduction

OPAL Classification

This component predicts objects of the dcat:theme predicate of dcat:Dataset subjects, based on the dc:description property. The implementation was done using WEKA https://github.com/Waikato/weka-3.8. You can run the application with the default values with mvn clean install and mvn exec:java -Dexec.mainClass="tools.Main" -Dexec.args="-c j48 -ngrams 1" -Dexec.cleanupDaemonThreads=false
The result of the evaluation of the cross-validation of the training data and the evaluation of the test data is printed to console.

The following arguments can be provided:
-c {naive, j48}, default it j48
-ngrams {1,...,n}, default is 1
-query, sparql query default is SELECT * WHERE { ?s a <http://www.w3.org/ns/dcat#Dataset> ; <http://www.w3.org/ns/dcat#theme> ?o ; <http://purl.org/dc/terms/description> ?d FILTER ( lang(?d) = "en" ) } LIMIT 300
-endpoint, sparql endpoint, default is: https://www.europeandataportal.eu/sparql

Pre-processing

The following steps were taken:

  1. Removed punctuation
  2. Converted all text to lower case
  3. Tokenization and Lemmatization
  4. Removed the standard english stop words if either the lemma or the original word coincides

Word vectorization

The standard TF-IFD word vectors were computed.

Results

In the interest of time and since the approach is slow, the classifier was trained with 160 instances. That number might be too small to be representative.
The following accuracy was obtained for the cross-validation method with 4 folds:

Classifier 1-gram 2-gram 3-gram 4-gram
J48 75,625% 59,375% 59,375% 59,375%
NaiveBayes 47,5% 31,875% 36,875% 35%

The following accuracy was obtained for the evaluation of the test data.

Classifier 1-gram 2-gram 3-gram 4-gram
J48 62,07% 50% 59,09% 55,32%
NaiveBayes 28,09% 29,35% 27,59% 28,05%

Note

This component was developed by Ana Alexandra Morim da Silva during the OPAL hackathon. The component was on of the winners of the hackathon.

Credits

Data Science Group (DICE) at Paderborn University

This work has been supported by the German Federal Ministry of Transport and Digital Infrastructure (BMVI) in the project Open Data Portal Germany (OPAL) (funding code 19F2028A).

classification's People

Contributors

adibaba avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.