Giter Site home page Giter Site logo

vivicai / webspot Goto Github PK

View Code? Open in Web Editor NEW

This project forked from crawlab-team/webspot

0.0 0.0 0.0 2.66 MB

An intelligent web service to automatically detect web content and extract information from it.

Home Page: https://webspot.crawlab.net

License: BSD 3-Clause "New" or "Revised" License

Shell 0.01% JavaScript 3.13% Python 11.66% Go 0.31% CSS 0.65% HTML 0.14% Mako 0.06% Jupyter Notebook 83.83% Dockerfile 0.20%

webspot's Introduction

Webspot

Webspot is an intelligent web service to automatically detect web content and extract information from it.

Demo

中文

Screenshots

Detected Results

Extracted Fields

Extracted Data

Get Started

Docker

Make sure you have installed Docker and Docker Compose.

# clone git repo
git clone https://github.com/crawlab-team/webspot

# start docker containers
docker-compose up -d

Then you can access the web UI at http://localhost:80.

API Reference

Once you started Webspot, you can go to http://localhost:80/redoc to view the API reference.

Architecture

The overall process of how Webspot detects meaningful elements from HTML or web pages is shown in the following figure.

graph LR
    hr[HtmlRequester]
    gl[GraphLoader]
    d[Detector]
    r[Results]

    hr --"html + json"--> gl --"graph"--> d --"output"--> r

Development

You can follow the following guidance to get started.

Pre-requisites

  • Python >=3.8 and <=3.10
  • Go 1.16 or higher
  • MongoDB 4.2 or higher

Install dependencies

# dependencies
pip install -r requirements.txt

Configure Environment Variables

Database configuration is located in .env file. You can copy the example file and modify it.

cp .env.example .env

Start web server

# start development server
python main.py web

Code Structure

The core code is located in webspot directory. The main.py file is the entry point of the web server.

webspot
├── cmd     # command line tools
├── crawler # web crawler
├── data    # data files (html, json, etc.)
├── db      # database
├── detect  # web content detection
├── graph   # graph module
├── models  # models
├── request # request helper
├── test    # test cases
├── utils   # utilities
└── web     # web server

TODOs

Webspot is aimed at automating the process of web content detection and extraction. It is far from ready for production use. The following features are planned to be implemented in the future.

  • Table detection
  • Nested list detection
  • Export to spiders
  • Advanced browser request

Community

If you are interested in Webspot, please add the author's WeChat account "tikazyq1" noting "Webspot" to enter the discussion group.

webspot's People

Contributors

tikazyq avatar shiojiang avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.