Giter Site home page Giter Site logo

scharoun / sde Goto Github PK

View Code? Open in Web Editor NEW

This project forked from seagatesoft/sde

0.0 1.0 0.0 255 KB

Structured Data Extractor. An application to extract structured data from web pages. It uses Data Extraction Based on Partial Tree Alignment (DEPTA) method. (UPDATE: I implemented a newer algorithm: https://github.com/seagatesoft/webdext)

Home Page: http://seagatesoft.blogspot.com

HTML 10.83% Java 89.17%

sde's Introduction

Structured Data Extractor (SDE) is an implementation of DEPTA (Data Extraction based on Partial Tree Alignment), a method to extract data from web pages (HTML documents). DEPTA was invented by Yanhong Zhai and Bing Liu from University of Illinois at Chicago and was published in their paper: "Structured Data Extraction from the Web based on Partial Tree Alignment" (IEEE Transactions on Knowledge and Data Engineering, 2006). Given a web page, SDE will detect data records contained in the web page and extract them into table structure (rows and columns). You can download the application from this link: Download Structured Data Extractor.

Usage

  1. Extract sde.zip.
  2. Make sure that Java Runtime Environment (version 5 or higher) already installed on your computer.
  3. Open command prompt (Windows) or shell (UNIX).
  4. Go to the directory where you extract sde.zip.
  5. Run this command: java -jar sde-runnable.jar URI_input path_to_output_file
  6. You can pass URI_input parameter refering to a local file or remote file, as long as it is a valid URI. URI refering to a local file must be preceded by "file:///". For example in Windows environment: "file:///D:/Development/Proyek/structured_data_extractor/bin/input/input.html" or in UNIX environment: "file:///home/seagate/input/input.html".
  7. The path to output file parameter is formatted as a valid path in the host operating system like "D:\Data\output.html" (Windows) or "/home/seagate/output/output.html" (UNIX).
  8. Extracted data can be viewed in the output file. The output file is a HTML document and the extracted data is presented in HTML tables.

Source Code

SDE source code is available at GitHub.

Dependencies

SDE was developed using these libraries:

  • Neko HTML Parser by Andy Clark and Marc Guillemot. Licensed under Apache License Version 2.0.
  • Xerces by The Apache Software Foundation. Licensed under Apache License Version 2.0.

License

SDE is licensed under the MIT license.

Author

Sigit Dewanto, sigitdewanto11[at]yahoo[dot]co[dot]uk, 2009.

sde's People

Contributors

seagatesoft avatar

Watchers

xl avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.