Giter Site home page Giter Site logo

dataprismai / gpt-auto-webscraping Goto Github PK

View Code? Open in Web Editor NEW
19.0 1.0 1.0 11.06 MB

Home Page: https://huggingface.co/spaces/CognitiveLabs/GPT-auto-webscraping

License: MIT License

Python 100.00%
beautifulsoup chains gpt langchain webscraper webscraping

gpt-auto-webscraping's Introduction

Webscrapper generator with ChatGPT

This project is a tool to generate code for web scrapping using ChatGPT. The idea is to use the power of the GPT models to generate code for web scrapping projects. The tech stack used includes langchain, streamlit, and openai.

Try it: Space ๐Ÿค—

Development

(Recomended to use a virtual environment, see Venv for more information about)

pip install -m requirements.txt

Create a config.ini with the following information on your root directory

Visit OpenAI to get your API Key

[DEFAULT]
API-KEY = {fill the value with your OPENAI API Key}

Run the app

streamlit run app.py 

How it works

The idea of the project is to use GPT to automatize code generation for web scrapping.

  • The tool will return a method to be used in web scrapping projects.

  • The first bot (GPT chain) will return a JSON with the information of the fields to be extracted.

  • The second bot will return a function called extract_info.

  • The function will receive the HTML of the page and will return the information extracted from the page.

Video Demo

Watch the full video on YouTube. Demo gif

Steps to generate the code:

For now, the workflow has 2 manual steps, but the idea is to automatize the process in the future.

Step 1: Get an HTML element from the page you want to extract information from

  • Inspect the element from which you want to get the information
  • Copy the HTML element and paste it into the input of the app
  • Click on generate code

Here the first chain will generate a JSON with the information of the fields to be extracted That JSON will be used in the second chain as expected output format to generate the code

Step 2: Get the whole HTML of the page

  • Copy the HTML of the entire page
  • Paste it in the second input of the app to test it
  • Click on test code

If it was successful you will see a table with the information extracted from the page.

gpt-auto-webscraping's People

Contributors

dependabot[bot] avatar gianfrancocorrea avatar ttomas78 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

Forkers

vabisabi

gpt-auto-webscraping's Issues

File not found error

Hi

Thank you for the code.
It seems a file is missing that is needed.

FileNotFoundError: No secrets files found. Valid paths for a secrets.toml file are: C:\Users**.streamlit\secrets.toml

Where can I find this file or how can I create it and what exactly should be in the file?

Thank you

Full HTML website

Hello !
First of all thanks for sharing the code with us!

Do you have any idea how to extend your code to process a whole website?
For example extract the content of website which has ~107000 tokens .

Thanks,
Alex

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.