Giter Site home page Giter Site logo

scrape_rd's People

Contributors

jocelynconnor953331 avatar pilot-to-sky avatar tyson-che avatar

Stargazers

 avatar  avatar

Watchers

 avatar

scrape_rd's Issues

Enhance Comment Tree Fetching and Preservation in Data Collection

Problem Description

Current data collection from Reddit does not efficiently handle posts with a large number of comments (5000+). Additionally, the existing method flattens the comments, ignoring the hierarchical structure which is crucial for understanding the context and relationships between comments.

Suggested Enhancements

  1. Efficient Comment Tree Navigation:

    • Implement an algorithm to efficiently navigate and fetch comment trees from Reddit posts, especially for posts with a high number of comments.
    • Ensure that the algorithm respects rate limits and minimizes the number of API calls.
  2. Preserve Comment Hierarchy:

    • Modify the data collection to preserve the parent-child relationships of comments.
    • Structure the data in a way that retains the nested nature of comment threads.

Expected Benefits

  • Improved data quality by maintaining the context and flow of conversations.
  • More efficient data collection, especially for posts with large comment sections, reducing the time and resources required for the process.

Potential Approach

  • Research and integrate existing libraries or algorithms that are optimized for this purpose.
  • Redesign the data model to incorporate a tree structure, potentially using adjacency lists or materialized paths.

Additional Context

  • The new approach should be compatible with the existing data transformation and MongoDB insertion processes.
  • Consider the scalability of the solution, as it should handle an increasing amount of data over time.

Validate and Handle Empty Data Before MongoDB Insertion(or even saving it locally(since no more than 3 gb))

Problem Description

There have been instances where empty BSON files are being inserted into MongoDB. This issue aims to prevent empty or incomplete data from being inserted and to ensure data integrity.

Modules Affected

  • data_transformation.py
  • fetch_post.py (if applicable)

Expected Changes

  1. Validation Check:

    • A validation function should be implemented to check the integrity and completeness of the data.
    • This function should be called after data transformation and before insertion into MongoDB.
    • If data is found to be incomplete or empty, it should be logged and not inserted.
  2. Logging Mechanism:

    • Expand the current logging to categorize different types of exceptions.
    • Any occurrence of empty data being prepared for insertion should be logged with a unique identifier to ease debugging.
  3. Unit Testing:

    • Add unit tests for the validation function to ensure it correctly identifies incomplete or empty data.
    • Ensure the tests cover a variety of scenarios, including edge cases.

Code Snippets Suggestion

# Inside data_transformation.py

# ... existing code ...

def is_data_valid(post_data):
    # Check if the post_data is empty or has missing critical fields
    return bool(post_data and post_data.get('title') and post_data.get('author'))

# ... existing code ...

def data_transform(submission):
    # ... existing code ...
    
    post_data = populate_post_data(submission)
    # ... existing code ...
    
    if not is_data_valid(post_data):
        logging.error(f"Invalid data detected for post ID: {submission.id}")
        return "", {}  # Return empty to prevent insertion
    
    # ... existing code ...

Optimize Task Distribution for Scraping and Database Operations

Problem Description

The current setup for scraping and database operations is not optimized for performance. The tasks are not distributed based on the regions with the fastest internet speed, and the database operations are not centralized in a location that allows for the quickest possible requests to external services such as OpenAI.

Suggested Solution

  1. Regional Optimization for Scraping:

    • Implement a system that assigns scraping tasks to servers based on the regions with the fastest internet connection.
  2. Database Location Optimization:

    • Set up the database in a strategic location (e.g., NYC) to minimize latency for data ingestion and external API requests.

Requirements

  • A flexible and soft way to distribute tasks, allowing dynamic allocation based on server performance and location.
  • Scalability to handle task distribution as the number of scraping targets increases.

Potential Approach

  • Develop or integrate a task management system that can assign tasks based on server metrics and geographic location.
  • Create a configuration system that allows for easy specification of database endpoints and can route data accordingly.

Expected Outcome

  • Enhanced performance of scraping operations by leveraging faster internet connections.
  • Reduced latency for database operations and API requests, leading to more efficient data processing.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.