Giter Site home page Giter Site logo

shib1111111 / webtext-analyzer Goto Github PK

View Code? Open in Web Editor NEW
0.0 2.0 0.0 606 KB

It is a text analysis tool that performs linguistic analysis on a collection of web pages. It includes sentiment analysis, readability metrics, and other derived variables.

License: MIT License

Python 100.00%
nlp-machine-learning pyphen sentiment-analysis text-analysis text-summarization

webtext-analyzer's Introduction

WebText Analyzer: Uncover Insights from Web Pages

This project provides a text analysis tool that performs linguistic analysis on a collection of web pages. It includes sentiment analysis, readability metrics, and other derived variables. The tool reads web page URLs from an Input.xlsx file, fetches the content of each URL, and saves the title and descriptions of each page in separate text files. It then performs text analysis on these text files and saves the results in the output.csv file.

Prerequisites

Before running the code, please make sure the following libraries are installed:

  • pandas: For handling data in tabular format.
  • requests: For making HTTP requests to fetch web page content.
  • beautifulsoup4: For parsing HTML content.
  • nltk: The Natural Language Toolkit library for natural language processing.
  • pyphen: For counting syllables in words. You can install these libraries using
pip install pandas
pip install requests
pip install beautifulsoup4
pip install nltk
pip install pyphen

Data Files

Make sure the following data files are present in the same directory as the code: analysis.py : the code file.. Input.xlsx: This file contains the URL ID and URLs.

Output Data Structure.xlsx: This file specifies the output file format.

MasterDictionary/positive-words.txt: A text file containing a list of positive words, one word per line.

MasterDictionary/negative-words.txt: A text file containing a list of negative words, one word per line.

StopWords: A directory containing text files with stop words. Each filename should start with "StopWords" and end with .txt.

Code Execution

  • Place the code file analysis.py in a directory along with the required data files.

  • Create an Input.xlsx file with two columns:

  • URL_ID: An integer identifier for each URL.

  • URL: The web page URLs to analyze.

  • Create an Output Data Structure.xlsx file with the following columns:

    • URL_ID: The same integer identifier for each URL as in the Input.xlsx file.
  • Additional columns for storing the computed text analysis variables.

  • Please execute the code using a Python interpreter or IDE.

  • After execution, the computed text analysis results will be saved in the output.xlsx file.

Usage

  • Ensure that the code file and data files are set up as described above.
  • Run the code by executing the Python script or using an IDE.
  • The code will fetch the web page content, perform text analysis, and save the results in the output.csv file.
  • You can customize the code and parameters as per your requirements.
  • Refer to the code comments for detailed explanations of each step.

License

This project is licensed under the MIT License.

Thank you for viewing this repo! Feel free to reach out with any questions or feedback.

✨ --- Designed & made with Love by Shib Kumar Saraf ✨

webtext-analyzer's People

Contributors

shib1111111 avatar

Watchers

Kostas Georgiou avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.