tyson-che / scrape_rd Goto Github PK

View Code? Open in Web Editor NEW

2.0 2.0 0.0 22.89 MB

Python 100.00%

scrape_rd's People

Contributors

Stargazers

Watchers

scrape_rd's Issues

Enhance Comment Tree Fetching and Preservation in Data Collection

Problem Description

Current data collection from Reddit does not efficiently handle posts with a large number of comments (5000+). Additionally, the existing method flattens the comments, ignoring the hierarchical structure which is crucial for understanding the context and relationships between comments.

Suggested Enhancements

Efficient Comment Tree Navigation:
- Implement an algorithm to efficiently navigate and fetch comment trees from Reddit posts, especially for posts with a high number of comments.
- Ensure that the algorithm respects rate limits and minimizes the number of API calls.
Preserve Comment Hierarchy:
- Modify the data collection to preserve the parent-child relationships of comments.
- Structure the data in a way that retains the nested nature of comment threads.

Expected Benefits

Improved data quality by maintaining the context and flow of conversations.
More efficient data collection, especially for posts with large comment sections, reducing the time and resources required for the process.

Potential Approach

Research and integrate existing libraries or algorithms that are optimized for this purpose.
Redesign the data model to incorporate a tree structure, potentially using adjacency lists or materialized paths.

Additional Context

The new approach should be compatible with the existing data transformation and MongoDB insertion processes.
Consider the scalability of the solution, as it should handle an increasing amount of data over time.

Validate and Handle Empty Data Before MongoDB Insertion(or even saving it locally(since no more than 3 gb))

Problem Description

There have been instances where empty BSON files are being inserted into MongoDB. This issue aims to prevent empty or incomplete data from being inserted and to ensure data integrity.

Modules Affected

data_transformation.py
fetch_post.py (if applicable)

Expected Changes

Validation Check:
- A validation function should be implemented to check the integrity and completeness of the data.
- This function should be called after data transformation and before insertion into MongoDB.
- If data is found to be incomplete or empty, it should be logged and not inserted.
Logging Mechanism:
- Expand the current logging to categorize different types of exceptions.
- Any occurrence of empty data being prepared for insertion should be logged with a unique identifier to ease debugging.
Unit Testing:
- Add unit tests for the validation function to ensure it correctly identifies incomplete or empty data.
- Ensure the tests cover a variety of scenarios, including edge cases.

Code Snippets Suggestion

# Inside data_transformation.py

# ... existing code ...

def is_data_valid(post_data):
    # Check if the post_data is empty or has missing critical fields
    return bool(post_data and post_data.get('title') and post_data.get('author'))

# ... existing code ...

def data_transform(submission):
    # ... existing code ...
    
    post_data = populate_post_data(submission)
    # ... existing code ...
    
    if not is_data_valid(post_data):
        logging.error(f"Invalid data detected for post ID: {submission.id}")
        return "", {}  # Return empty to prevent insertion
    
    # ... existing code ...

Optimize Task Distribution for Scraping and Database Operations

Problem Description

The current setup for scraping and database operations is not optimized for performance. The tasks are not distributed based on the regions with the fastest internet speed, and the database operations are not centralized in a location that allows for the quickest possible requests to external services such as OpenAI.

Requirements

A flexible and soft way to distribute tasks, allowing dynamic allocation based on server performance and location.
Scalability to handle task distribution as the number of scraping targets increases.

Potential Approach

Develop or integrate a task management system that can assign tasks based on server metrics and geographic location.
Create a configuration system that allows for easy specification of database endpoints and can route data accordingly.

Expected Outcome

Enhanced performance of scraping operations by leveraging faster internet connections.
Reduced latency for database operations and API requests, leading to more efficient data processing.

tyson-che / scrape_rd Goto Github PK

scrape_rd's People

Contributors

Stargazers

Watchers

scrape_rd's Issues

Problem Description

Suggested Enhancements

Expected Benefits

Potential Approach

Additional Context

Problem Description

Modules Affected

Expected Changes

Code Snippets Suggestion

Problem Description

Suggested Solution

Requirements

Potential Approach

Expected Outcome

Recommend Projects

Recommend Topics

Recommend Org