Giter Site home page Giter Site logo

omkarcloud / botasaurus Goto Github PK

View Code? Open in Web Editor NEW
1.2K 14.0 106.0 50.32 MB

The All in One Framework to build Awesome Scrapers.

Home Page: https://www.omkar.cloud/botasaurus/

License: MIT License

Python 90.63% TypeScript 9.05% JavaScript 0.15% Shell 0.17%
anti-bot anti-detection cloudflare-bypass cloudflare-scrape anti-detect anti-detect-browser antidetect-browser undetected undetected-chromedriver bypass-cloudflare

botasaurus's Introduction

botasaurus

๐Ÿค– Botasaurus ๐Ÿค–

The All in One Framework to build Awesome Scrapers.

The web has evolved. Finally, web scraping has too.

View

Run in Gitpod

๐Ÿฟ๏ธ Botasaurus In a Nutshell

How wonderful that of all the web scraping tools out there, you chose to learn about Botasaurus. Congratulations!

And now that you are here, you are in for an exciting, unusual and rewarding journey that will make your web scraping life a lot, lot easier.

Now, let me tell you in bullet points about Botasaurus. (Because as per the marketing gurus, YOU as a member of Developer Tribe have a VERY short attention span.)

So, what is Botasaurus?

Botasaurus is an all-in-one web scraping framework that enables you to build awesome scrapers in less time, less code, and with more fun.

A Web Scraping Magician has put all his web scraping experience and best practices into Botasaurus to save you hundreds of hours of Development Time!

Now, for the magical powers awaiting you after learning Botasaurus:

  • Convert any Web Scraper to a UI-based Scraper in minutes, which will make your Customer sing praises of you.

pro-gmaps-demo

  • In terms of humaneness, what Superman is to Man, Botasaurus is to Selenium and Playwright. Easily pass every (Yes E-V-E-R-Y) bot test, no need to spend time finding ways to access a website.

solve-bot-detection

  • Easily save hours of Development Time with easy parallelization, profiles, extensions, and proxy configuration. Botasaurus makes asynchronous, parallel scraping a child's play.

  • Use Caching, Sitemap, Data cleaning, and other utilities to save hours of time spent in writing and debugging code.

  • Easily scale your scraper to multiple machines with Kubernetes, and get your data faster than ever.

And those are just the highlights. I Mean!

There is so much more to Botasaurus, that you will be amazed at how much time you will save with it.

๐Ÿš€ Getting Started with Botasaurus

Let's dive right in with a straightforward example to understand Botasaurus.

In this example, we will go through the steps to scrape the heading text from https://www.omkar.cloud/.

Botasaurus in action

Step 1: Install Botasaurus

First things first, you need to install Botasaurus. Run the following command in your terminal:

python -m pip install botasaurus

Step 2: Set Up Your Botasaurus Project

Next, let's set up the project:

  1. Create a directory for your Botasaurus project and navigate into it:
mkdir my-botasaurus-project
cd my-botasaurus-project
code .  # This will open the project in VSCode if you have it installed

Step 3: Write the Scraping Code

Now, create a Python script named main.py in your project directory and paste the following code:

from botasaurus.browser import browser, Driver

@browser
def scrape_heading_task(driver: Driver, data):
    # Visit the Omkar Cloud website
    driver.get("https://www.omkar.cloud/")
    
    # Retrieve the heading element's text
    heading = driver.get_text("h1")

    # Save the data as a JSON file in output/scrape_heading_task.json
    return {
        "heading": heading
    }
     
# Initiate the web scraping task
scrape_heading_task()

Let's understand this code:

  • We define a custom scraping task, scrape_heading_task, decorated with @browser:
@browser
def scrape_heading_task(driver: Driver, data):
  • Botasaurus automatically provides an Humane Driver to our function:
def scrape_heading_task(driver: Driver, data):
  • Inside the function, we:
    • Visit Omkar Cloud
    • Extract the heading text
    • Return the data to be automatically saved as scrape_heading_task.json by Botasaurus:
    driver.get("https://www.omkar.cloud/")
    heading = driver.get_text("h1")
    return {"heading": heading}
  • Finally, we initiate the scraping task:
# Initiate the web scraping task
scrape_heading_task()

Step 4: Run the Scraping Task

Time to run it:

python main.py

After executing the script, it will:

  • Launch Google Chrome
  • Visit omkar.cloud
  • Extract the heading text
  • Save it automatically as output/scrape_heading_task.json.

Botasaurus in action

Now, let's explore another way to scrape the heading using the request module. Replace the previous code in main.py with the following:

from botasaurus.request import request, Request
from botasaurus.soupify import soupify

@request
def scrape_heading_task(request: Request, data):
    # Visit the Omkar Cloud website
    response = request.get("https://www.omkar.cloud/")

    # Create a BeautifulSoup object    
    soup = soupify(response)
    
    # Retrieve the heading element's text
    heading = soup.find('h1').get_text()

    # Save the data as a JSON file in output/scrape_heading_task.json
    return {
        "heading": heading
    }     
# Initiate the web scraping task
scrape_heading_task()

In this code:

  • We scrape the HTML using request, which is specifically designed for making browser-like humane requests.
  • Next, we parse the HTML into a BeautifulSoup object using soupify() and extract the heading.

Step 5: Run the Scraping Task (which makes Humane HTTP Requests)

Finally, run it again:

python main.py

This time, you will observe the exact same result as before, but instead of opening a whole Browser, we are making browser-like humane HTTP requests.

๐Ÿ’ก Understanding Botasaurus

What is Botasaurus Driver, And Why should I use it over Selenium and Playwright?

Botasaurus Driver is a web automation driver like Selenium, and the single most important reason to use it is because it is truly humane, and you will not, and I repeat NOT, have any issues with accessing any website.

Plus, it is super fast to launch and use, and the API is designed by and for web scrapers, and you will love it.

How do I access Cloudflare-protected pages using Botasaurus?

Cloudflare is the most popular protection system on the web. So, let's see how Botasaurus can help you solve various Cloudflare challenges.

Connection Challenge

This is the single most popular challenge and requires making a browser-like connection with appropriate headers. It's commonly used for:

  • Product Pages
  • Blog Pages
  • Search Result Pages

Example Page: https://www.g2.com/products/github/reviews

What Works?

  • Visiting the website via Google Referrer (which makes is seems as if the user has arrived from google search).
from botasaurus.browser import browser, Driver

@browser
def scrape_heading_task(driver: Driver, data):
    # Visit the website via Google Referrer
    driver.google_get("https://www.g2.com/products/github/reviews")
    driver.prompt()
    heading = driver.get_text('.product-head__title [itemprop="name"]')
    return heading

scrape_heading_task()
  • Use the request module. The Request Object is smart and, by default, visits any link with a Google Referrer. Although it works, you will need to use retries.
from botasaurus.request import request, Request

@request(max_retry=10)
def scrape_heading_task(request: Request, data):
    response = request.get('https://www.g2.com/products/github/reviews')
    print(response.status_code)
    response.raise_for_status()
    return response.text

scrape_heading_task()

JS with Captcha Challenge

This challenge requires performing JS computations that differentiate a Chrome controlled by Selenium/Puppeteer/Playwright from a real Chrome. It also involves solving a Captcha. It's used to for pages which are rarely but sometimes visited by people, like:

  • 5th Review page
  • Auth pages

Example Page: https://www.g2.com/products/github/reviews.html?page=5&product_id=github

What Does Not Work?

Using @request does not work because although it can make browser-like HTTP requests, it cannot run JavaScript to solve the challenge.

What Works?

Pass the bypass_cloudflare=True argument to the google_get method.

from botasaurus.browser import browser, Driver

@browser
def scrape_heading_task(driver: Driver, data):
    driver.google_get("https://www.g2.com/products/github/reviews.html?page=5&product_id=github", bypass_cloudflare=True)
    driver.prompt()
    heading = driver.get_text('.product-head__title [itemprop="name"]')
    return heading

scrape_heading_task()

What are the benefits of a UI Scraper?

Here are some benefits of creating a scraper with a user interface:

  • Simplify your scraper usage for customers, eliminating the need to teach them how to modify and run your code.
  • Protect your code by hosting the scraper on the web and offering a monthly subscription, rather than providing full access to your code. This approach:
    • Safeguards your Python code from being copied and reused, increasing your customer's lifetime value.
    • Generate monthly recurring revenue via subscription from your customers, surpassing a one-time payment.
  • Enable sorting, filtering, and downloading of data in various formats (JSON, Excel, CSV, etc.).
  • Provide access via a REST API for seamless integration.
  • Create a polished frontend, backend, and API integration with minimal code.

How to run a UI-based scraper?

Let's run the Botasaurus Starter Template (the recommended template for greenfield Botasaurus projects), which scrapes the heading of the provided link by following these steps:

  1. Clone the Starter Template:

    git clone https://github.com/omkarcloud/botasaurus-starter my-botasaurus-project
    cd my-botasaurus-project
    
  2. Install dependencies (will take a few minutes):

    python -m pip install -r requirements.txt
    python run.py install
    
  3. Run the scraper:

    python run.py
    

Your browser will automatically open up at http://localhost:3000/. Then, enter the link you want to scrape (e.g., https://www.omkar.cloud/) and click on the Run Button.

starter-scraper-demo

After some seconds, the data will be scraped. starter-scraper-demo-result

Visit http://localhost:3000/output to see all the tasks you have started.

starter-scraper-demo-tasks

Go to http://localhost:3000/about to see the rendered README.md file of the project.

starter-scraper-demo-readme

Finally, visit http://localhost:3000/api-integration to see how to access the Scraper via API.

starter-scraper-demo-api

The API Documentation is generated dynamically based on your Scraper's Inputs, Sorts, Filters, etc., and is unique to your Scraper.

So, whenever you need to run the Scraper via API, visit this tab and copy the code specific to your Scraper.

How to create a UI Scraper using Botasaurus?

Creating a UI Scraper with Botasaurus is a simple 3-step process:

  1. Create your Scraper function
  2. Add the Scraper to the Server using 1 line of code
  3. Define the input controls for the Scraper

To understand these steps, let's go through the code of the Botasaurus Starter Template that you just ran.

Step 1: Create the Scraper Function

In src/scrape_heading_task.py, we define a scraping function which basically does the following:

  1. Receives a data object and extracts the "link".
  2. Retrieves the HTML content of the webpage using the "link".
  3. Converts the HTML into a BeautifulSoup object.
  4. Locates the heading element, extracts its text content and returns it.
from botasaurus.request import request, Request
from botasaurus.soupify import soupify

@request
def scrape_heading_task(request: Request, data):
    # Visit the Link
    response = request.get(data["link"])

    # Create a BeautifulSoup object    
    soup = soupify(response)
    
    # Retrieve the heading element's text
    heading = soup.find('h1').get_text()

    # Save the data as a JSON file in output/scrape_heading_task.json
    return {
        "heading": heading
    }

Step 2: Add the Scraper to the Server

In backend/scrapers.py, we:

  • Import our scraping function
  • Use Server.add_scraper() to register the scraper
from botasaurus_server.server import Server
from src.scrape_heading_task import scrape_heading_task

# Add the scraper to the server
Server.add_scraper(scrape_heading_task)

Step 3: Define the Input Controls

In backend/inputs/scrape_heading_task.js we:

  • Define a getInput function that takes the controls parameter
  • Add a link input control to it
  • Use comments to enable intellisense in VSCode (Very Very Important)
/**
 * @typedef {import('../../frontend/node_modules/botasaurus-controls/dist/index').Controls} Controls
 */

/**
 * @param {Controls} controls
 */
function getInput(controls) {
    controls
        // Render a Link Input, which is required, defaults to "https://www.omkar.cloud/". 
        .link('link', { isRequired: true, defaultValue: "https://www.omkar.cloud/" })
}

Above was a simple example; below is a real-world example with multi-text, number, switch, select, section, and other controls.

/**
 * @typedef {import('../../frontend/node_modules/botasaurus-controls/dist/index').Controls} Controls
 */


/**
 * @param {Controls} controls
 */
function getInput(controls) {
    controls
        .listOfTexts('queries', {
            defaultValue: ["Web Developers in Bangalore"],
            placeholder: "Web Developers in Bangalore",
            label: 'Search Queries',
            isRequired: true
        })
        .section("Email and Social Links Extraction", (section) => {
            section.text('api_key', {
                placeholder: "2e5d346ap4db8mce4fj7fc112s9h26s61e1192b6a526af51n9",
                label: 'Email and Social Links Extraction API Key',
                helpText: 'Enter your API key to extract email addresses and social media links.',
            })
        })
        .section("Reviews Extraction", (section) => {
            section
                .switch('enable_reviews_extraction', {
                    label: "Enable Reviews Extraction"
                })
                .numberGreaterThanOrEqualToZero('max_reviews', {
                    label: 'Max Reviews per Place (Leave empty to extract all reviews)',
                    placeholder: 20,
                    isShown: (data) => data['enable_reviews_extraction'], defaultValue: 20,
                })
                .choose('reviews_sort', {
                    label: "Sort Reviews By",
                    isRequired: true, isShown: (data) => data['enable_reviews_extraction'], defaultValue: 'newest', options: [{ value: 'newest', label: 'Newest' }, { value: 'most_relevant', label: 'Most Relevant' }, { value: 'highest_rating', label: 'Highest Rating' }, { value: 'lowest_rating', label: 'Lowest Rating' }]
                })
        })
        .section("Language and Max Results", (section) => {
            section
                .addLangSelect()
                .numberGreaterThanOrEqualToOne('max_results', {
                    placeholder: 100,
                    label: 'Max Results per Search Query (Leave empty to extract all places)'
                })
        })
        .section("Geo Location", (section) => {
            section
                .text('coordinates', {
                    placeholder: '12.900490, 77.571466'
                })
                .numberGreaterThanOrEqualToOne('zoom_level', {
                    label: 'Zoom Level (1-21)',
                    defaultValue: 14,
                    placeholder: 14
                })
        })
}

I encourage you to paste the above code into backend/inputs/scrape_heading_task.js and reload the page, and you will see a complex set of input controls like the image shown.

complex-input

Now, to use Botasaurus UI for adding new scrapers, remember these points:

  1. Create a backend/inputs/{your_scraping_function_name}.js file for each scraping function.
  2. Define the getInput function in the file with the necessary controls.
  3. Add comments to enable intellisense in VSCode, as you won't be able to remember all the controls.

Use this template as a starting point for new scraping function's input controls js file:

/**
 * @typedef {import('../../frontend/node_modules/botasaurus-controls/dist/index').Controls} Controls
 */

/**
 * @param {Controls} controls
 */
function getInput(controls) {
    // Define your controls here.
}

That's it! With these simple steps, you can create a fully functional UI Scraper using Botasaurus.

Later, you will learn how to add sorts and filters to make your UI Scraper even more powerful and user-friendly.

sorts-filters

What is Botasaurus, and what are its main features?

Botasaurus is an all-in-one web scraping framework designed to achieve two main goals:

  1. Provide common web scraping utilities to solve the pain points of web scraping.
  2. Offer a user interface to make it easy for your non-technical customers to run web scrapers.

To accomplish these goals, Botasaurus gives you 3 decorators:

  • @browser: For scraping web pages using a humane browser.
  • @request: For scraping web pages using lightweight and humane HTTP requests.
  • @task:
    • For scraping web pages using third-party libraries like playwright or selenium.
    • or, For running non-web scraping tasks, such as data processing (e.g., converting video to audio). Botasaurus is not limited to web scraping tasks; any Python function can be made accessible with a stunning UI and user-friendly API.

In practice, while developing with Botasaurus, you will spend most of your time in the following areas:

  • Configuring your scrapers via decorators with settings like:
    • Which proxy to use
    • How many scrapers to run in parallel, etc.
  • Writing your core web scraping logic using BeautifulSoup (bs4) or the Botasaurus Driver.

Additionally, you will utilize the following Botasaurus utilities for debugging and development:

  • bt: Mainly for writing JSON, EXCEL, and HTML temporary files, and for data cleaning.
  • Sitemap: For accessing the website's links and sitemap.
  • Minor utilities like:
    • LocalStorage: For storing scraper state.
    • soupify: For creating BeautifulSoup objects from Driver, Requests response, Driver Element, or HTML string.
    • IPUtils: For obtaining information (IP, country, etc.) about the current IP address.
    • Cache: For managing the cache.

By simply configuring these three decorators (@browser, @request, and @task) with arguments, you can easily create real-time scrapers and large-scale datasets, thus saving you countless hours that would otherwise be spent writing and debugging code from scratch.

How to use decorators in Botasaurus?

Decorators are the heart of Botasaurus. To use a decorator function, you can call it with:

  • A single item
  • A list of items

If a scraping function is given a list of items, it will sequentially call the scraping function for each data item.

For example, if you pass a list of three links to the scrape_heading_task function:

from botasaurus.browser import browser, Driver

@browser
def scrape_heading_task(driver: Driver, link):
    driver.get(link)
    heading = driver.get_text("h1")
    return heading

scrape_heading_task(["https://www.omkar.cloud/", "https://www.omkar.cloud/blog/", "https://stackoverflow.com/"]) # <-- list of items

Then, Botasaurus will launch a new browser instance for each item, and the final results will be stored in output/scrape_heading_task.json.

list-demo

How does Botasaurus help me in debugging?

Botasaurus helps you in debugging by:

  • Easily viewing the result of the scraping function, as it is saved in output/{your_scraping_function_name}.json. Say goodbye to print statements.

scraped data

  • Bringing your attention to errors in browser mode with a beep sound and pausing the browser, allowing you to debug the error on the spot.

  • Even if an exception is raised in headless mode, it will still open the website in your default browser, making it easier to debug code in a headless browser. (Isn't it cool?)

headless-error

How to configure the Browser Decorator?

The Browser Decorator allows you to easily configure various aspects of the browser, such as:

  • Blocking images and CSS
  • Setting up proxies
  • Specifying profiles
  • Enabling headless mode
  • Using Chrome extensions
  • Selecting language
  • Passing Arguments to Chrome

Blocking Images and CSS

Blocking images is one of the most important configurations when scraping at scale. Blocking images can significantly:

  • Speed up your web scraping tasks
  • Reduce bandwidth usage
  • And save money on proxies. (Best of All!)

For example, a page that originally takes 4 seconds and 12 MB to load might only take one second and 100 KB after blocking images and CSS.

To block images, use the block_images parameter:

@browser(
    block_images=True,
)

To block both images and CSS, use block_images_and_css:

@browser(
    block_images_and_css=True,
)    

Proxies

To use proxies, simply specify the proxy parameter:

@browser(
    proxy="http://username:password@proxy-provider-domain:port"
)    
def visit_ipinfo(driver: Driver, data):
    driver.get("https://ipinfo.io/")
    driver.prompt()

visit_ipinfo()

You can also pass a list of proxies, and the proxy will be automatically rotated:

@browser(
    proxy=[
        "http://username:password@proxy-provider-domain:port", 
        "http://username2:password2@proxy-provider-domain:port"
    ]
)
def visit_ipinfo(driver: Driver, data):
    driver.get("https://ipinfo.io/")
    driver.prompt()

visit_ipinfo() 

Profile

Easily specify the Chrome profile using the profile option:

@browser(
    profile="pikachu"
)    

However, each Chrome profile can become very large (e.g., 100 MB) and can eat up all your computer storage.

To solve this problem, use the tiny_profile option, which is a lightweight alternative to Chrome profiles.

When creating hundreds of Chrome profiles, it is highly recommended to use the tiny_profile option because:

  • Creating 1000 Chrome profiles will take at least 100 GB, whereas 1000 tiny profiles will take up only 1 MB of storage, making tiny profiles easy to store and back up.
  • Tiny profiles are cross-platform, meaning you can create profiles on a Linux server, copy the ./profiles folder to a Windows PC, and easily run them.

Under the hood, tiny profiles persist cookies from visited websites, making them extremely lightweight (around 1 KB) while providing the same session persistence.

Here's how to use the tiny profile:

@browser(
    tiny_profile=True, 
    profile="pikachu",
)    

Headless Mode

Enable headless mode with headless=True:

@browser(
    headless=True
)    

Note that using headless mode makes the browser much easier to identify by services like Cloudflare and Datadome. So, use headless mode only when scraping websites that don't use such services.

Chrome Extensions

Botasaurus allows the use of ANY Chrome Extension with just 1 line of code. The example below shows how to use the AdBlocker Chrome Extension:

from botasaurus.browser import browser, Driver
from chrome_extension_python import Extension

@browser(
    extensions=[
        Extension(
            "https://chromewebstore.google.com/detail/adblock-%E2%80%94-best-ad-blocker/gighmmpiobklfepjocnamgkkbiglidom"
        )
    ],
)
def scrape_while_blocking_ads(driver: Driver, data):
    driver.prompt()

scrape_while_blocking_ads()

In some cases, an extension may require additional configuration, such as API keys or credentials. For such scenarios, you can create a custom extension. Learn more about creating and configuring custom extensions here.

Language

Specify the language using the lang option:

from botasaurus.lang import Lang

@browser(
    lang=Lang.Hindi,
)

User Agent and Window Size

To make the browser really humane, Botasaurus does not change browser fingerprints by default, because using fingerprints makes the browser easily identifiable by running CSS tests to find mismatches between the provided user agent and the actual user agent.

However, if you need fingerprinting, use the user_agent and window_size options:

from botasaurus.browser import browser, Driver
from botasaurus.user_agent import UserAgent
from botasaurus.window_size import WindowSize

@browser(
    user_agent=UserAgent.RANDOM,
    window_size=WindowSize.RANDOM,
)
def visit_whatsmyua(driver: Driver, data):
    driver.get("https://www.whatsmyua.info/")
    driver.prompt()

visit_whatsmyua()

When working with profiles, you want the fingerprints to remain consistent. You don't want the user's user agent to be Chrome 106 on the first visit and then become Chrome 102 on the second visit.

So, when using profiles, use the HASHED option to generate a consistent user agent and window size based on the profile's hash:

from botasaurus.browser import browser, Driver
from botasaurus.user_agent import UserAgent
from botasaurus.window_size import WindowSize

@browser(
    profile="pikachu",
    user_agent=UserAgent.HASHED,
    window_size=WindowSize.HASHED,
)
def visit_whatsmyua(driver: Driver, data):
    driver.get("https://www.whatsmyua.info/")
    driver.prompt()
    
visit_whatsmyua()

# Everytime Same UserAgent and WindowSize
visit_whatsmyua()

Passing Arguments to Chrome

To pass arguments to Chrome, use the add_arguments option:

@browser(
    add_arguments=['--headless=new'],
)

To dynamically generate arguments based on the data parameter, pass a function:

def get_arguments(data):
    return ['--headless=new']

@browser(
    add_arguments=get_arguments,
)

Wait for Complete Page Load

By default, Botasaurus waits for all page resources (DOM, JavaScript, CSS, images, etc.) to load before calling your scraping function with the driver.

However, sometimes the DOM is ready, but JavaScript, images, etc., take forever to load.

In such cases, you can set wait_for_complete_page_load to False to interact with the DOM as soon as the HTML is parsed and the DOM is ready:

@browser(
    wait_for_complete_page_load=False,
)

Reuse Driver

Consider the following example:

from botasaurus.browser import browser, Driver

@browser
def scrape_data(driver: Driver, link):
    driver.get(link)

scrape_data(["https://www.omkar.cloud/", "https://www.omkar.cloud/blog/", "https://stackoverflow.com/"])

If you run this code, the browser will be recreated on each page visit, which is inefficient.

list-demo-omkar

To solve this problem, use the reuse_driver option which is great for cases like:

  • Scraping a large number of links and reusing the same browser instance for all page visits.
  • Running your scraper in a cloud server to scrape data on demand, without recreating Chrome on each request.

Here's how to use reuse_driver which will reuse the same Chrome instance for visiting each link.

from botasaurus.browser import browser, Driver

@browser(
    reuse_driver=True
)
def scrape_data(driver: Driver, link):
    driver.get(link)

scrape_data(["https://www.omkar.cloud/", "https://www.omkar.cloud/blog/", "https://stackoverflow.com/"])

Result list-demo-reuse-driver.gif


Also, by default, whenever the program ends or is canceled, Botasaurus smartly closes any open Chrome instances, leaving no instances running in the background.

In rare cases, you may want to explicitly close the Chrome instance. For such scenarios, you can use the .close() method on the scraping function:

scrape_data.close()

This will close any Chrome instances that remain open after the scraping function ends.

How to Configure the Browser's Chrome Profile, Language, and Proxy Dynamically Based on Data Parameters?

The decorators in Botasaurus are really flexible, allowing you to pass a function that can derive the browser configuration based on the data item parameter. This is particularly useful when working with multiple Chrome profiles.

You can dynamically configure the browser's Chrome profile and proxy using decorators in two ways:

  1. Using functions to extract configuration values from data:

    • Define functions to extract the desired configuration values from the data parameter.
    • Pass these functions as arguments to the @browser decorator.

    Example:

    from botasaurus.browser import browser, Driver
    
    def get_profile(data):
        return data["profile"]
    
    def get_proxy(data):
        return data["proxy"]
    
    @browser(profile=get_profile, proxy=get_proxy)
    def scrape_heading_task(driver: Driver, data):
        profile, proxy = driver.config.profile, driver.config.proxy
        print(profile, proxy)
        return profile, proxy
    
    data = [
        {"profile": "pikachu", "proxy": "http://142.250.77.228:8000"},
        {"profile": "greyninja", "proxy": "http://142.250.77.229:8000"},
    ]
    
    scrape_heading_task(data)
  2. Directly passing configuration values when calling the decorated function:

    • Pass the profile and proxy values directly as arguments to the decorated function when calling it.

    Example:

    from botasaurus.browser import browser, Driver
    
    @browser
    def scrape_heading_task(driver: Driver, data):
        profile, proxy = driver.config.profile, driver.config.proxy
        print(profile, proxy)
        return profile, proxy
    
    scrape_heading_task(
        profile='pikachu',  # Directly pass the profile
        proxy="http://142.250.77.228:8000",  # Directly pass the proxy
    )

PS: Most Botasaurus decorators allow passing functions to derive configurations from data parameters. Check the decorator's argument type hint to see if it supports this functionality.

What is the best way to manage profile-specific data like name, age across multiple profiles?

To store data related to the active profile, use driver.profile. Here's an example:

from botasaurus.browser import browser, Driver

def get_profile(data):
    return data["profile"]

@browser(profile=get_profile)
def run_profile_task(driver: Driver, data):
    # Set profile data
    driver.profile = {
        'name': 'Amit Sharma',
        'age': 30
    }

    # Update the name in the profile
    driver.profile['name'] = 'Amit Verma'

    # Delete the age from the profile
    del driver.profile['age']

    # Print the updated profile
    print(driver.profile)  # Output: {'name': 'Amit Verma'}

    # Delete the entire profile
    driver.profile = None

run_profile_task([{"profile": "amit"}])

For managing all profiles, use the Profiles utility. Here's an example:

from botasaurus.profiles import Profiles

# Set profiles
Profiles.set_profile('amit', {'name': 'Amit Sharma', 'age': 30})
Profiles.set_profile('rahul', {'name': 'Rahul Verma', 'age': 30})

# Get a profile
profile = Profiles.get_profile('amit')
print(profile)  # Output: {'name': 'Amit Sharma', 'age': 30}

# Get all profiles
all_profiles = Profiles.get_profiles()
print(all_profiles)  # Output: [{'name': 'Amit Sharma', 'age': 30}, {'name': 'Rahul Verma', 'age': 30}]

# Get all profiles in random order
random_profiles = Profiles.get_profiles(random=True)
print(random_profiles)  # Output: [{'name': 'Rahul Verma', 'age': 30}, {'name': 'Amit Sharma', 'age': 30}] in random order

# Delete a profile
Profiles.delete_profile('amit')

Note: All profile data is stored in the profiles.json file in the current working directory. profiles

What are some common methods in Botasaurus Driver?

Botasaurus Driver provides several handy methods for web automation tasks such as:

  • Visiting URLs:

    driver.get("https://www.example.com")
    driver.google_get("https://www.example.com")  # Use Google as the referer [Recommended]
    driver.get_via("https://www.example.com", referer="https://duckduckgo.com/")  # Use custom referer
    driver.get_via_this_page("https://www.example.com")  # Use current page as referer
  • For finding elements:

    from botasaurus.browser import Wait
    search_results = driver.select(".search-results", wait=Wait.SHORT)  # Wait for up to 4 seconds for the element to be present, return None if not found
    all_links = driver.select_all("a")  # Get all elements matching the selector
    search_results = driver.wait_for_element(".search-results", wait=Wait.LONG)  # Wait for up to 8 seconds for the element to be present, raise exception if not found
    hello_mom = driver.get_element_with_exact_text("Hello Mom", wait=Wait.VERY_LONG)  # Wait for up to 16 seconds for an element having the exact text "Hello Mom"
  • Interact with elements:

    driver.type("input[name='username']", "john_doe")  # Type into an input field
    driver.click("button.submit")  # Clicks an element
    element = driver.select("button.submit")
    element.click()  # Click on an element
  • Retrieve element properties:

    header_text = driver.get_text("h1")  # Get text content
    error_message = driver.get_element_containing_text("Error: Invalid input")
    image_url = driver.select("img.logo").get_attribute("src")  # Get attribute value
  • Work with parent-child elements:

    parent_element = driver.select(".parent")
    child_element = parent_element.select(".child")
    child_element.click()  # Click child element
  • Execute JavaScript:

    result = driver.run_js("return document.title")
    text_content = element.run_js("(el) => el.textContent")
  • Working with iframes:

    driver.get("https://www.g2.com/products/github/reviews.html?page=5&product_id=github")
    iframe = driver.select_iframe("#turnstile-wrapper iframe")
    text_content = iframe.select("body label").text
  • Miscellaneous:

    form.type("input[name='password']", "secret_password")  # Type into a form field
    container.is_element_present(".button")  # Check element presence
    page_html = driver.page_html  # Current page HTML
    driver.select(".footer").scroll_into_view()  # Scroll element into view
    driver.close()  # Close the browser

How Can I Pause the Browser to Inspect Website when Developing the Scraper?

To pause the scraper and wait for user input before proceeding, use driver.prompt():

driver.prompt()

How do I configure authenticated proxies with SSL in Botasaurus?

Proxy providers like BrightData, IPRoyal, and others typically provide authenticated proxies in the format "http://username:password@proxy-provider-domain:port". For example, "http://greyninja:[email protected]:12321".

However, if you use an authenticated proxy with a library like seleniumwire to visit a website using Cloudflare like G2.com, you are GUARANTEED to be identified because you are using a non-SSL connection.

To verify this, run the following code:

First, install the necessary packages:

python -m pip install selenium_wire chromedriver_autoinstaller_fix

Then, execute this Python script:

from seleniumwire import webdriver
from chromedriver_autoinstaller_fix import install

# Define the proxy
proxy_options = {
    'proxy': {
        'http': 'http://username:password@proxy-provider-domain:port', # TODO: Replace with your own proxy
        'https': 'http://username:password@proxy-provider-domain:port', # TODO: Replace with your own proxy
    }
}

# Install and set up the driver
driver_path = install()
driver = webdriver.Chrome(driver_path, seleniumwire_options=proxy_options)

# Visit the desired URL
link = 'https://www.g2.com/products/github/reviews'
driver.get("https://www.google.com/")
driver.execute_script(f'window.location.href = "{link}"')

# Prompt for user input
input("Press Enter to exit...")

# Clean up
driver.quit()

You will SURELY be identified:

identified

However, using proxies with Botasaurus solves this issue. See the difference by running the following code:

from botasaurus.browser import browser, Driver

@browser(proxy="http://username:password@proxy-provider-domain:port") # TODO: Replace with your own proxy 
def scrape_heading_task(driver: Driver, data):
    driver.google_get("https://www.g2.com/products/github/reviews")
    driver.prompt()

scrape_heading_task()    

Result: not identified

Important Note: To run the code above, you will need Node.js installed.

Why am I getting a socket connection error when using a proxy to access a website?

Certain proxy providers like BrightData will block access to specific websites. To determine if this is the case, run the following code:

from botasaurus.browser import browser, Driver

@browser(proxy="http://username:password@proxy-provider-domain:port")  # TODO: Replace with your own proxy
def visit_ipinfo(driver: Driver, data):
    driver.get("https://ipinfo.io/")
    driver.prompt()

visit_ipinfo()

If you can successfully access the ipinfo website but not the website you're attempting to scrape, it means the proxy provider is blocking access to that particular website.

In such situations, the only solution is to switch to a different proxy provider.

Some good proxy providers we personally use are:

  • For Rotating Datacenter Proxies: BrightData Datacenter Proxies, which cost around $0.6 per GB on a pay-as-you-go basis. No KYC is required.
  • For Rotating Residential Proxies: IPRoyal Royal Residential Proxies, which cost around $7 per GB on a pay-as-you-go basis. No KYC is required.

As always, nothing good in life comes free. Proxies are expensive, and will take up almost all of your scraping costs.

So, use proxies only when you need them, and prefer request-based scrapers over browser-based scrapers to save bandwidth.

Note: BrightData and IPRoyal have not paid us. We are recommending them based on our personal experience.

Which country should I choose when using proxies for web scraping?

The United States is often the best choice because:

  • The United States has a highly developed internet infrastructure and is home to numerous data centers, ensuring faster internet speeds.
  • Most global companies host their websites in the US, so using a US proxy will result in faster scraping speeds.

Should I use a proxy for web scraping?

ONLY IF you encounter IP blocks.

Sadly, most scrapers unnecessarily use proxies, even when they are not needed. Everything seems like a nail when you have a hammer.

We have seen scrapers which can easily access hundreds of thousands of protected pages using the @browser module on home Wi-Fi without any issues.

So, as a best practice scrape using the @browser module on your home Wi-Fi first. Only resort to proxies when you encounter IP blocks.

This practice will save you a considerable amount of time (as proxies are really slow) and money (as proxies are expensive as well).

How to configure the Request Decorator?

The Request Decorator is used to make humane requests. Under the hood, it uses botasaurus-requests, a library based on hrequests, which incorporates important features like:

  • Using browser-like headers in the correct order.
  • Makes a browser-like connection with correct ciphers.
  • Uses google.com referer by default to make it appear as if the user has arrived from google search.

Also, The Request Decorator allows you to configure proxy as follows:

@request(
    proxy="http://username:password@proxy-provider-domain:port"
)    

What Options Can I Configure in all 3 Decorators?

All 3 decorators allow you to configure the following options:

  • Parallel Execution:
  • Caching Results
  • Passing Common Metadata
  • Asynchronous Queues
  • Asynchronous Execution
  • Handling Crashes
  • Configuring Output
  • Exception Handling

Let's dive into each of these options and in later sections we will see their real-world applications.

parallel

The parallel option allows you to scrape data in parallel by launching multiple browser/request/task instances simultaneously. This can significantly speed up the scraping process.

Run the example below to see parallelization in action:

from botasaurus.browser import browser, Driver

@browser(parallel=3, data=["https://www.omkar.cloud/", "https://www.omkar.cloud/blog/", "https://stackoverflow.com/"])
def scrape_heading_task(driver: Driver, link):
    driver.get(link)
    heading = driver.get_text('h1')
    return heading

scrape_heading_task()    

cache

The cache option enables caching of web scraping results to avoid re-scraping the same data. This can significantly improve performance and reduce redundant requests.

Run the example below to see how caching works:

from botasaurus.browser import browser, Driver

@browser(cache=True, data=["https://www.omkar.cloud/", "https://www.omkar.cloud/blog/", "https://stackoverflow.com/"])
def scrape_heading_task(driver: Driver, link):
    driver.get(link)
    heading = driver.get_text('h1')
    return heading

print(scrape_heading_task())
print(scrape_heading_task())  # Data will be fetched from cache immediately 

Note: Caching is one of the most important features of Botasaurus.

metadata

The metadata option allows you to pass common information shared across all data items. This can include things like API keys, browser cookies, or any other data that remains constant throughout the scraping process.

It is commonly used with caching to exclude details like API keys and browser cookies from the cache key.

Here's an example of how to use the metadata option:

from botasaurus.task import task

@task()
def scrape_heading_task(driver: Driver, data, metadata):
    print("metadata:", metadata)
    print("data:", data)

data = [
    {"profile": "pikachu", "proxy": "http://142.250.77.228:8000"},
    {"profile": "greyninja", "proxy": "http://142.250.77.229:8000"},
]
scrape_heading_task(
  data, 
  metadata={"api_key": "BDEC26..."}
)

async_queue

In the world of web scraping, there are only two types of scrapers:

  1. Dataset Scrapers: These extract data from websites and store it as datasets. Companies like Bright Data use them to build datasets for Crunchbase, Indeed, etc.

  2. Real-time Scrapers: These fetch data from sources in real-time, like SERP APIs that provide Google and DuckDuckGo search results.

When building real-time scrapers, speed is paramount because customers are waiting for requests to complete. The async_queue feature is incredibly useful in such cases.

async_queue allows you to run scraping tasks asynchronously in a queue and gather the results using the .get() method.

A great use case for async_queue is scraping Google Maps. Instead of scrolling through the list of places and then scraping the details of each place sequentially, you can use async_queue to:

  1. Scroll through the list of places.
  2. Simultaneously make HTTP requests to scrape the details of each place in the background.

By executing the scrolling and requesting tasks concurrently, you can significantly speed up the scraper.

Run the code below to see browser scrolling and request scraping happening concurrently (really cool, must try!):

from botasaurus.browser import browser, Driver, AsyncQueueResult
from botasaurus.request import request, Request
import json

def extract_title(html):
    return json.loads(
        html.split(";window.APP_INITIALIZATION_STATE=")[1].split(";window.APP_FLAGS")[0]
    )[5][3][2][1]

@request(
    parallel=5,
    async_queue=True,
    max_retry=5,
)
def scrape_place_title(request: Request, link, metadata):
    cookies = metadata["cookies"]
    html = request.get(link, cookies=cookies, timeout=12).text
    title = extract_title(html)
    print("Title:", title)
    return title

def has_reached_end(driver):
    return driver.select('p.fontBodyMedium > span > span') is not None

def extract_links(driver):
    return driver.get_all_links('[role="feed"] > div > div > a')

@browser()
def scrape_google_maps(driver: Driver, link):
    driver.google_get(link, accept_google_cookies=True)  # accepts google cookies popup

    scrape_place_obj: AsyncQueueResult = scrape_place_title()  # initialize the async queue for scraping places
    cookies = driver.get_cookies_dict()  # get the cookies from the driver

    while True:
        links = extract_links(driver)  # get the links to places
        scrape_place_obj.put(links, metadata={"cookies": cookies})  # add the links to the async queue for scraping

        print("scrolling")
        driver.scroll_to_bottom('[role="feed"]')  # scroll to the bottom of the feed

        if has_reached_end(driver):  # we have reached the end, let's break buddy
            break

    results = scrape_place_obj.get()  # get the scraped results from the async queue
    return results

scrape_google_maps("https://www.google.com/maps/search/web+developers+in+bangalore")

run_async

Similarly, the run_async option allows you to execute scraping tasks asynchronously, enabling concurrent execution.

Similar to async_queue, you can use the .get() method to retrieve the results of an asynchronous task.

Code Example:

from botasaurus.browser import browser, Driver
from time import sleep

@browser(run_async=True)
def scrape_heading(driver: Driver, data):
    sleep(5)
    return {}

if __name__ == "__main__":
    result1 = scrape_heading()  # Launches asynchronously
    result2 = scrape_heading()  # Launches asynchronously

    result1.get()  # Wait for the first result
    result2.get()  # Wait for the second result

close_on_crash

The close_on_crash option determines the behavior of the scraper when an exception occurs:

  • If set to False (default):
    • The scraper will make a beep sound and pause the browser.
    • This makes debugging easier by keeping the browser open at the point of the crash.
    • Use this setting during development and testing.
  • If set to True:
    • The scraper will close the browser and continue with the rest of the data items.
    • This is suitable for production environments when you are confident that your scraper is robust.
    • Use this setting to avoid interruptions and ensure the scraper processes all data items.
from botasaurus.browser import browser, Driver

@browser(
    close_on_crash=False  # Determines whether the browser is paused (default: False) or closed when an error occurs
)
def scrape_heading_task(driver: Driver, data):
    raise Exception("An error occurred during scraping.")

scrape_heading_task()  

output and output_formats

By default, Botasaurus saves the result of scraping in the output/{your_scraping_function_name}.json file. Let's learn about various ways to configure the output.

  1. Change Output Filename: Use the output parameter in the decorator to specify a custom filename for the output.
from botasaurus.task import task

@task(output="my-output")
def scrape_heading_task(data): 
    return {"heading": "Hello, Mom!"}

scrape_heading_task()
  1. Disable Output: If you don't want any output to be saved, set output to None.
from botasaurus.task import task

@task(output=None)
def scrape_heading_task(data): 
    return {"heading": "Hello, Mom!"}

scrape_heading_task()
  1. Dynamically Write Output: To dynamically write output based on data and result, pass a function to the output parameter:
from botasaurus.task import task
from botasaurus import bt

def write_output(data, result):
    json_filename = bt.write_json(result, 'data')
    excel_filename = bt.write_excel(result, 'data')
    bt.zip_files([json_filename, excel_filename]) # Zip the JSON and Excel files for easy delivery to the customer

@task(output=write_output)  
def scrape_heading_task(data): 
    return {"heading": "Hello, Mom!"}

scrape_heading_task()

4.Save Outputs in Multiple Formats: Use the output_formats parameter to save outputs in different formats like JSON and EXCEL.

from botasaurus.task import task

@browser(output_formats=[bt.Formats.JSON, bt.Formats.EXCEL])  
def scrape_heading_task(data): 
    return {"heading": "Hello, Mom!"}

scrape_heading_task()

PRO TIP: When delivering data to customers, provide the dataset in JSON and Excel formats. Avoid CSV unless the customer asks, because Microsoft Excel has a hard time rendering CSV files with nested JSON.

CSV vs Excel csv-vs-excel

Exception Handling Options

Botasaurus provides various exception handling options to make your scrapers more robust:

  • max_retry: By default, any failed task is not retried. You can specify the maximum number of times to retry scraping when an error occurs using the max_retry option.
  • retry_wait: Specifies the waiting time between retries.
  • raise_exception: By default, Botasaurus does not raise an exception when an error occurs during scraping, because let's say you are keeping your PC running overnight to scrape 10,000 links. If one link fails, you really don't want to stop the entire scraping process, and ruin your morning by seeing an unfinished dataset.
  • must_raise_exceptions: Specifies exceptions that must be raised, even if raise_exception is set to False.
  • create_error_logs: Determines whether error logs should be created when exceptions occur. In production, when scraping hundreds of thousands of links, it's recommended to set create_error_logs to False to avoid using computational resources for creating error logs.
@browser(
    raise_exception=True,  # Raise an exception and halt the scraping process when an error occurs
    max_retry=5,  # Retry scraping a failed task a maximum of 5 times
    retry_wait=10,  # Wait for 10 seconds before retrying a failed task
    must_raise_exceptions=[CustomException],  # Definitely raise CustomException, even if raise_exception is set to False
    create_error_logs=False  # Disable the creation of error logs to optimize scraper performance
)
def scrape_heading_task(driver: Driver, data):
  # ...

What are some examples of common web scraping utilities provided by Botasaurus that make scraping easier?

bt Utility

The bt utility provides helper functions for:

  • Writing and reading JSON, EXCEL, and CSV files
  • Data cleaning

Some key functions are:

  • bt.write_json and bt.read_json: Easily write and read JSON files.
from botasaurus import bt

data = {"name": "pikachu", "power": 101}
bt.write_json(data, "output")
loaded_data = bt.read_json("output")
  • bt.write_excel and bt.read_excel: Easily write and read EXCEL files.
from botasaurus import bt

data = {"name": "pikachu", "power": 101}
bt.write_excel(data, "output")
loaded_data = bt.read_excel("output")
  • bt.write_csv and bt.read_csv: Easily write and read CSV files.
from botasaurus import bt

data = {"name": "pikachu", "power": 101}
bt.write_csv(data, "output")
loaded_data = bt.read_csv("output")
  • bt.write_html and bt.read_html: Write HTML content to a file.
from botasaurus import bt

html_content = "<html><body><h1>Hello, Mom!</h1></body></html>"
bt.write_html(html_content, "output")
  • bt.write_temp_json, bt.write_temp_csv, bt.write_temp_html: Write temporary JSON, CSV, or HTML files for debugging purposes.
from botasaurus import bt

data = {"name": "pikachu", "power": 101}
bt.write_temp_json(data)
bt.write_temp_csv(data)
bt.write_temp_html("<html><body><h1>Hello, Mom!</h1></body></html>")
  • Data cleaning functions like bt.extract_numbers, bt.extract_links, bt.remove_html_tags, and more.
text = "The price is $19.99 and the website is https://www.example.com"
numbers = bt.extract_numbers(text)  # [19.99]
links = bt.extract_links(text)  # ["https://www.example.com"]

Local Storage Utility

The Local Storage utility allows you to store and retrieve key-value pairs, which can be useful for maintaining state between scraper runs.

Here's how to use it:

from botasaurus.local_storage import LocalStorage

LocalStorage.set_item("credits_used", 100)
print(LocalStorage.get_item("credits_used", 0))

soupify Utility

The soupify utility creates a BeautifulSoup object from a Driver, Requests response, Driver Element, or HTML string.

from botasaurus.soupify import soupify
from botasaurus.request import request, Request
from botasaurus.browser import browser, Driver

@request
def get_heading_from_request(req: Request, data):
   """
   Get the heading of a web page using the request object.
   """
   response = req.get("https://www.example.com")
   soup = soupify(response)
   heading = soup.find("h1").text
   print(f"Page Heading: {heading}")

@browser
def get_heading_from_driver(driver: Driver, data):
   """
   Get the heading of a web page using the driver object.
   """
   driver.get("https://www.example.com")

   # Get the heading from the entire page
   page_soup = soupify(driver)
   page_heading = page_soup.find("h1").text
   print(f"Heading from Driver's Soup: {page_heading}")

   # Get the heading from the body element
   body_soup = soupify(driver.select("body"))
   body_heading = body_soup.find("h1").text
   print(f"Heading from Element's Soup: {body_heading}")

# Call the functions
get_heading_from_request()
get_heading_from_driver()

IP Utils

IP Utils provide functions to get information about the current IP address, such as the IP itself, country, ISP, and more:

from botasaurus.ip_utils import IPUtils

# Get the current IP address
current_ip = IPUtils.get_ip()
print(current_ip)
# Output: 47.31.226.180

# Get detailed information about the current IP address
ip_info = IPUtils.get_ip_info()
print(ip_info)
# Output: {
#     "ip": "47.31.226.180",
#     "country": "IN",
#     "region": "Delhi",
#     "city": "Delhi",
#     "postal": "110001",
#     "coordinates": "28.6519,77.2315",
#     "latitude": "28.6519",
#     "longitude": "77.2315",
#     "timezone": "Asia/Kolkata",
#     "org": "AS55836 Reliance Jio Infocomm Limited"
# }

Cache Utility

The Cache utility in Botasaurus allows you to manage cached data for your scraper. You can put, get, has, remove, and clear cache data.

Basic Usage

from botasaurus.task import task
from botasaurus.cache import Cache

# Example scraping function
@task
def scrape_data(data):
    # Your scraping logic here
    return {"processed": data}

# Sample data for scraping
input_data = {"key": "value"}

# Adding data to the cache
Cache.put('scrape_data', input_data, scrape_data(input_data))

# Checking if data is in the cache
if Cache.has('scrape_data', input_data):
    # Retrieving data from the cache
    cached_data = Cache.get('scrape_data', input_data)
    print(f"Cached data: {cached_data}")

# Removing specific data from the cache
Cache.remove('scrape_data', input_data)

# Clearing the complete cache for the scrape_data function
Cache.clear('scrape_data')

Advanced Usage for large-scale scraping projects

Count Cached Items

You can count the number of items cached for a particular function, which can serve as a scraping progress bar.

from botasaurus.cache import Cache

Cache.print_cached_items_count('scraping_function')

Filter Cached/Uncached Items

You can filter items that have been cached or not cached for a particular function.

from botasaurus.cache import Cache

all_items = ['1', '2', '3', '4', '5']

# Get items that are cached
cached_items = Cache.filter_items_in_cache('scraping_function', all_items)
print(cached_items)

# Get items that are not cached
uncached_items = Cache.filter_items_not_in_cache('scraping_function', all_items)
print(uncached_items)

Delete Cache The cache for a function is stored in the cache/{your_scraping_function_name}/ folder. To delete the cache, simply delete that folder.

delete-cache

Delete Specific Items

You can delete specific items from the cache for a particular function.

from botasaurus.cache import Cache

all_items = ['1', '2', '3', '4', '5']
deleted_count = Cache.delete_items('scraping_function', all_items)
print(f"Deleted {deleted_count} items from the cache.")

Delete Items by Filter

In some cases, you may want to delete specific items from the cache based on a condition. For example, if you encounter honeypots (mock HTML served to dupe web scrapers) while scraping a website, you may want to delete those items from the cache.

def should_delete_item(item, result):
    if 'Honeypot Item' in result:
        return True  # Delete the item
    return False  # Don't delete the item

all_items = ['1', '2', '3', '4', '5']
# List of items to iterate over, it is fine if the list contains items which have not been cached, as they will be simply ignored.
Cache.delete_items_by_filter('scraping_function', all_items, should_delete_item)

Importantly, be cautious and first use delete_items_by_filter on a small set of items which you want to be deleted. Here's an example:

from botasaurus import bt
from botasaurus.cache import Cache

def should_delete_item(item, result):
    # TODO: Update the logic
    if 'Honeypot Item' in result:
        return True # Delete the item
    return False # Don't delete the item

test_items = ['1', '2'] # TODO: update with target items
scraping_function_name = 'scraping_function' # TODO:  update with target scraping function name
Cache.delete_items_by_filter(scraping_function_name, test_items, should_delete_item)

for item in test_items:
    if Cache.has(scraping_function_name, item):
        bt.prompt(f"Item {item} was not deleted. Please review the logic of the should_delete_item function.")

How to Extract Links from a Sitemap?

In web scraping, it is a common use case to scrape product pages, blogs, etc. But before scraping these pages, you need to get the links to these pages.

Sadly, Many developers unnecessarily increase their work by writing code to visit each page one by one and scrape links, which they could have easily obtained by just looking at the Sitemap.

The Botasaurus Sitemap Module makes this process easy as cake by allowing you to get all links or sitemaps using:

  • The homepage URL (e.g., https://www.omkar.cloud/)
  • A direct sitemap link (e.g., https://www.omkar.cloud/sitemap.xml)
  • A .gz compressed sitemap

For example, if you're an Angel Investor seeking innovative tech startups to invest, G2 is an ideal platform to find such startups. You can run the following code to fetch over 160K+ product links from G2:

from botasaurus import bt
from botasaurus.sitemap import Sitemap, Filters, Extractors

links = (
    Sitemap("https://www.g2.com/sitemaps/sitemap_index.xml.gz")
    .filter(Filters.first_segment_equals("products"))
    .extract(Extractors.extract_link_upto_second_segment())
    .write_links('g2-products')
)

Output:

g2-sitemap-links.png

Or, let's say you're in the mood for some reading and looking for good stories. The following code will get you over 1000+ from moralstories26.com:

from botasaurus import bt
from botasaurus.sitemap import Sitemap, Filters

links = (
    Sitemap("https://moralstories26.com/")
    .filter(
        Filters.has_exactly_1_segment(),
        Filters.first_segment_not_equals(
            ["about", "privacy-policy", "akbar-birbal", "animal", "education", "fables", "facts", "family", "famous-personalities", "folktales", "friendship", "funny", "heartbreaking", "inspirational", "life", "love", "management", "motivational", "mythology", "nature", "quotes", "spiritual", "uncategorized", "zen"]
        ),
    )
    .write_links('moral-stories')
)

Output:

moralstories26-sitemap-links.png

Also, before scraping a site, it's useful to identify the available sitemaps. This can be easily done with the following code:

from botasaurus import bt
from botasaurus.sitemap import Sitemap

sitemaps = Sitemap("https://www.omkar.cloud/").write_sitemaps('omkar-sitemaps')

Output:

omkar-sitemap-links.png

To ensure your scrapers run super fast, we cache the Sitemap, but you may want to periodically refresh the cache. To do so, pass the Cache.REFRESH parameter.

from botasaurus import bt
from botasaurus.sitemap import Sitemap, Filters, Extractors
from botasaurus.cache import Cache

links = (
    Sitemap("https://www.g2.com/sitemaps/sitemap_index.xml.gz", cache=Cache.REFRESH) # Refresh the cache
    .filter(Filters.first_segment_equals("products"))
    .extract(Extractors.extract_link_upto_second_segment())
    .write_links('g2-products')
)

How can I filter a list of links, similar to working with Sitemaps?

Filtering links from a webpage is a common requirement in web scraping. For example, you might want to filter out all non-product pages.

Botasaurus's Links module simplifies link filtering and extraction:

from botasaurus.links import Links, Filters, Extractors

# Sample list of links
links = [
    "https://www.g2.com/categories/project-management",
    "https://www.g2.com/categories/payroll", 
    "https://www.g2.com/products/jenkins/reviews", 
    "https://www.g2.com/products/redis-software/pricing"
]

# Filter and extract links
filtered_links = (
    Links(links)
    .filter(Filters.first_segment_equals("products"))
    .extract(Extractors.extract_link_upto_second_segment())
    .write('g2-products')
)

What is the best way to use caching in Botasaurus?

Sadly, when using caching, most developers write a scraping function that scrapes the HTML and extracts the data from the HTML in the same function, like this:

from botasaurus.request import request, Request
from botasaurus.soupify import soupify

@request
def scrape_data(request: Request, data):
    # Visit the Link
    response = request.get(data)
    
    # Create a BeautifulSoup object
    soup = soupify(response)
    
    # Retrieve the heading element's text
    heading = soup.find('h1').get_text()
    
    # Save the data as a JSON file in output/scrape_data.json
    return {"heading": heading}

data_items = [
    "https://www.omkar.cloud/",
    "https://www.omkar.cloud/blog/",
    "https://stackoverflow.com/",
]

scrape_data(data_items)

Now, let's say, after 50% of the dataset has been scraped, what if:

  • Your customer wants to add another data point (which is very likely), or
  • One of your BeautifulSoup selectors happens to be flaky and needs to be updated (which is super likely)?

In such cases, you will have to scrape all the pages again, which is painful as it will take a lot of time and incur high proxy costs.

To resolve this issue, you can:

  1. Write a function that only scrapes and caches the HTML.
  2. Write a separate function that calls the HTML scraping function, extracts data using BeautifulSoup, and caches the result.

Here's a practical example:

from bs4 import BeautifulSoup
from botasaurus.task import task
from botasaurus.request import request, Request
from botasaurus.soupify import soupify

@request(cache=True)
def scrape_html(request: Request, url):
    # Scrape the HTML and cache it
    html = request.get(url).text
    return html

def extract_data(soup: BeautifulSoup):
    # Extract the heading from the HTML
    heading = soup.find("h1").get_text()
    return {"heading": heading}

# Cache the scrape_data task as well
@task(cache=True)
def scrape_data(url):
    # Call the scrape_html function to get the cached HTML
    html = scrape_html(url)
    # Extract data from the HTML using the extract_data function
    return extract_data(soupify(html))

data_items = [
    "https://www.omkar.cloud/",
    "https://www.omkar.cloud/blog/",
    "https://stackoverflow.com/",
]

scrape_data(data_items)

With this approach:

  • If you need to add data points or fix BeautifulSoup bugs, delete the cache/scrape_data folder and re-run the scraper. delete-cache
  • You only need to re-run the BeautifulSoup extraction, not the entire HTML scraping, saving time and proxy costs. Yahoo!

PRO TIP: This approach also makes your extract_data code easier and faster to test, like this:

from bs4 import BeautifulSoup
from botasaurus import bt

def extract_data(soup: BeautifulSoup):
    heading = soup.find('h1').get_text()
    return {"heading": heading}

if __name__ == '__main__':
    # Will use the cached HTML and run the extract_data function again.
    bt.write_temp_json(scrape_data("https://www.omkar.cloud/", cache=False))

What are the recommended settings for each decorator to build a production-ready scraper in Botasaurus?

For websites with minimal protection, use the Request module.

Here's a template for creating production-ready datasets using the Request module:

from bs4 import BeautifulSoup
from botasaurus.task import task
from botasaurus.request import request, Request
from botasaurus.soupify import soupify

@request(
    # proxy='http://username:password@datacenter-proxy-domain:proxy-port', # Uncomment to use Proxy ONLY if you face IP blocking
    cache=True,

    max_retry=20, # Retry up to 20 times, which is a good default

    output=None,

    close_on_crash=True,
    raise_exception=True,
    create_error_logs=False,
)
def scrape_html(request: Request, url):
    # Scrape the HTML and cache it
    response = request.get(url)
    response.raise_for_status()
    return response.text

def extract_data(soup: BeautifulSoup):
    # Extract the heading from the HTML
    heading = soup.find("h1").get_text()
    return {"heading": heading}

# Cache the scrape_data task as well
@task(
    cache=True,
    close_on_crash=True,
    create_error_logs=False,
    parallel=40, # Run 40 requests in parallel, which is a good default
)
def scrape_data(url):
    # Call the scrape_html function to get the cached HTML
    html = scrape_html(url)
    # Extract data from the HTML using the extract_data function
    return extract_data(soupify(html))

data_items = [
    "https://www.omkar.cloud/",
    "https://www.omkar.cloud/blog/",
    "https://stackoverflow.com/",
]

scrape_data(data_items)

For visiting well protected websites, use the Browser module.

Here's a template for creating production-ready datasets using the Browser module:

from bs4 import BeautifulSoup
from botasaurus.task import task
from botasaurus.browser import browser, Driver
from botasaurus.soupify import soupify

@browser(
    # proxy='http://username:password@datacenter-proxy-domain:proxy-port', # Uncomment to use Proxy ONLY if you face IP blocking

    # block_images_and_css=True, # Uncomment to block images and CSS, which can speed up scraping
    # wait_for_complete_page_load=False, # Uncomment to proceed once the DOM (Document Object Model) is loaded, without waiting for all resources to finish loading. This is recommended for faster scraping of Server Side Rendered (HTML) pages. eg: https://www.g2.com/products/jenkins/reviews.html

    cache=True,
    max_retry=5,  # Retry up to 5 times, which is a good default

    reuse_driver= True, # Reuse the same driver for all tasks
    
    output=None,

    close_on_crash=True,
    raise_exception=True,
    create_error_logs=False,
)
def scrape_html(driver: Driver, url):
    # Scrape the HTML and cache it
    driver.google_get(
        url,
        bypass_cloudflare=True,  # delete this line if the website you're accessing is not protected by Cloudflare
    )
    return driver.page_html

def extract_data(soup: BeautifulSoup):
    # Extract the heading from the HTML
    heading = soup.select_one('.product-head__title [itemprop="name"]').get_text()
    return {"heading": heading}

# Cache the scrape_data task as well
@task(
    cache=True,
    close_on_crash=True,
    create_error_logs=False,
)
def scrape_data(url):
    # Call the scrape_html function to get the cached HTML
    html = scrape_html(url)
    # Extract data from the HTML using the extract_data function
    return extract_data(soupify(html))

data_items = [
    "https://www.g2.com/products/stack-overflow-for-teams/reviews?page=8",
    "https://www.g2.com/products/jenkins/reviews?page=19",
]

scrape_data(data_items)

What Are Some Tips for accessing Protected sites?

  • Use google_get, use google_get, and use google_get!
  • Don't use headless mode, else you will surely be identified by Cloudflare, Datadome, Imperva.
  • Don't use Proxies, instead use your home Wi-Fi connection, even when scraping hundreds of thousands of pages.

How Do I Close All Running Chrome Instances?

While developing a scraper, multiple browser instances may remain open in the background (because of interrupting it with CTRL + C). This situation can cause your computer to hang.

Many Chrome processes running in Task Manager

To prevent your PC from hanging, you can run the following command to close all Chrome instances:

python -m close_chrome

How to Run Scraper in Docker?

To run a Scraper in Docker, use the Botasaurus Starter Template, which includes the necessary Dockerfile and Docker Compose configurations.

Use the following commands to clone the Botasaurus Starter Template, build a Docker image from it, and execute the scraper within a Docker environment.

git clone https://github.com/omkarcloud/botasaurus-starter my-botasaurus-project
cd my-botasaurus-project
docker-compose build
docker-compose up

How to Run Scraper in Gitpod?

Running a scraper in Gitpod offers several benefits:

  • Allows your scraper to use a powerful 8-core machine with 1000 Mbps internet speed
  • Makes it easy to showcase your scraper to customers without them having to install anything, by simply sharing the Gitpod machine link

In this example, we will run the Botasaurus Starter template in Gitpod:

  1. First, visit this link and sign up using your GitHub account.

    Screenshot (148)

  2. Once signed up, open the starter project in Gitpod.

    gp-continue

  3. In the terminal, run the following command:

    python run.py
  4. You will see a popup notification with the heading "A service is available on port 3000". In the popup notification, click the "Open Browser" button to open the UI Dashboard in your browser

    open-browser.png

  5. Now, you can press the Run button to get the results.

    starter-photo.png

Note: Gitpod is not suitable for long-running tasks, as the environment will automatically shut down after a short period of inactivity. Use your local machine or a cloud VM for long-running scrapers.

How to Run Scraper in Virtual Machine?

To run your scraper in a Virtual Machine, we will:

  • Create a static IP
  • Create a VM with that IP
  • SSH into the VM
  • Install the scraper

Now, follow these steps to run your scraper in a Virtual Machine:

    1. If you don't already have one, create a Google Cloud Account. You'll receive a $300 credit to use over 3 months. Select-your-billing-country
  1. Visit the Google Cloud Console and click the Cloud Shell button. A terminal will open up. click-cloud-shell-btn

  2. Run the following commands in the terminal:

    python -m pip install bota
    python -m bota create-ip

    You will be asked for a VM name. Enter any name you like, such as "pikachu".

    Name: pikachu

    Then, you will be asked for the region for the scraper. Press Enter to go with the default, which is "us-central1", as most global companies host their websites in the US.

    Region: Default

    Install bota

  3. Now, visit this link and create a deployment from Google Click to Deploy with the following settings:

    zone: us-central1-a # Use the zone from the region you selected in the previous step.
    Series: N1
    Machine Type: n1-standard-2 (2 vCPU, 1 core, 7.5 GB memory)
    Network Interface [External IP]: pikachu-ip # Use the IP name you created in the previous step.
    

    deploy-node

  4. Visit this link and click the SSH button to SSH into the VM. ssh-vm

  5. Now, run the following commands in the terminal, then wait for 5 minutes for the installation to complete:

    curl -sL https://raw.githubusercontent.com/omkarcloud/botasaurus/master/vm-scripts/install-scraper.sh | bash -s -- https://github.com/omkarcloud/botasaurus-starter

    install-scraper Note: If you are using a different repo, replace https://github.com/omkarcloud/botasaurus-starter with your repo URL. That's it! You have successfully launched the Scraper in a Virtual Machine. When the previous commands are done, you will see a link to your scraper. Visit it to run your scraper.

vm-success

How to delete the scraper and avoid incurring charges?

If you are deleting a custom scraper you deployed, please ensure you have downloaded the results from it.

Next, follow these steps to delete the scraper:

  1. Delete the static IP by running the following command:

    python -m bota delete-ip

    You will be asked for the name of the VM you created in the first step. Enter the name and press Enter.

    Delete IP

    Note: If you forgot the name of the IP, you can also delete all the IPs by running python -m bota delete-all-ips.

  2. Go to Deployment Manager and delete your deployment.

    Delete deployment

That's it! You have successfully deleted the scraper, and you will not incur any furthur charges.

How to Run Scraper in Kubernetes?

Visit this link to learn how to run scraper at scale using Kubernetes.

I have a feature request!

We'd love hear it! Share them on GitHub Discussions.

Make

Do you have a Discord community?

Yes, we have a Discord community where you can connect with other developers, ask questions, and share your experiences. Join our Discord community here.

โ“ Advanced Questions

Congratulations on completing the Botasaurus Documentation! Now, you have all the knowledge needed to effectively use Botasaurus.

You may choose to read the following questions based on your interests:

  1. How to Run Botasaurus in Google Colab?

  2. How can I allow users to filter the scraped data?

  3. How can I allow the user to sort the scraped data?

  4. How can I present the scraped data in different views?

  5. When building a large dataset, customers often request data in different formats like overview and review. How can I do that?

  6. What more can I configure when adding a scraper?

  7. How to control the maximum number of browsers and requests running at any point of time?

  8. How do I change the title, header title, and description of the scraper?

  9. How can I use a database like PostgreSQL with UI Scraper?

  10. Which PostgreSQL provider should I choose among Supabase, Google Cloud SQL, Heroku, and Amazon RDS?

  11. How to create a PostgreSQL database on Supabase?

  12. How to create a PostgreSQL database on Google Cloud?

  13. I am a Youtuber, Should I create YouTube videos about Botasaurus? If so, how can you help me?

Thank You

  • I didn't make Botasaurus for fame or to earn good karma. I created it because I would be really happy if you could use it to successfully complete your project. So, Thank you for using Botasaurus!
  • Kudos to the Apify Team for creating the proxy-chain library. The implementation of SSL-based Proxy Authentication wouldn't have been possible without their groundbreaking work on proxy-chain.
  • Shout out to ultrafunkamsterdam for creating nodriver, which inspired the creation of Botasaurus Driver.
  • A big thank you to daijro for creating hrequest, which inspired the creation of botasaurus-requests.
  • A humongous thank you to Cloudflare, DataDome, Imperva, and all bot recognition systems. Had you not been there, we wouldn't be either ๐Ÿ˜….

Now, what are you waiting for? ๐Ÿค” Go and make something mastastic! ๐Ÿš€

Become one of our amazing stargazers by giving us a star โญ on GitHub!

It's just one click, but it means the world to me.

Stargazers for @omkarcloud/botasaurus

Disclaimer for Botasaurus Project

By using Botasaurus, you agree to comply with all applicable local and international laws related to data scraping, copyright, and privacy. The developers of Botasaurus are not responsible for any misuse of this software. It is the sole responsibility of the user to ensure adherence to all relevant laws regarding data scraping, copyright, and privacy, and to use Botasaurus in an ethical and legal manner.

We take the concerns of the Botasaurus Project very seriously. For any inquiries or issues, please contact Chetan Jain at [email protected]. We will take prompt and necessary action in response to your emails.

Made with โค๏ธ in Mastastic Bharat ๐Ÿ‡ฎ๐Ÿ‡ณ - Vande Mataram

botasaurus's People

Contributors

allamaris0 avatar bistp avatar chetan11-dev avatar chetanjainsirsa avatar iamdevdiv avatar noreply avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

botasaurus's Issues

WebDriver Manager

Python has a wonderful library webdriver-manager it allows you to easily control the driver versions, install the latest ones, and those that fit the current version of the browser. This allows you to distract from the choice of the driver version and is convenient when used on different devices and platforms. How about using it to download detected drivers, and for undetected use this library as a template. I hope for a positive answer, I really want to help in the development of the project.

Error: connect ECONNREFUSED

Running
โ˜• JavaScript Error Call to 'launch' failed:

scrape_heading_task()
at (/root/folder/file.py:16)
current_result = run_task(data_item, False, 0)
at wrapper_browser (/usr/local/lib/python3.10/dist-packages/botasaurus/decorators.py:650)
driver = create_driver(data, options, desired_capabilities)
at run_task (/usr/local/lib/python3.10/dist-packages/botasaurus/decorators.py:528)
return do_create_stealth_driver(
at run (/usr/local/lib/python3.10/dist-packages/botasaurus/create_stealth_driver.py:282)
chrome = launch_chrome(start_url, options._arguments)
at do_create_stealth_driver (/usr/local/lib/python3.10/dist-packages/botasaurus/create_stealth_driver.py:229)
instance = ChromeLauncherAdapter.launch(**kwargs)
at launch_chrome (/usr/local/lib/python3.10/dist-packages/botasaurus/create_stealth_driver.py:102)
response = chrome_launcher.launch(kwargs, timeout=300)
at launch (/usr/local/lib/python3.10/dist-packages/botasaurus/chrome_launcher_adapter.py:12)

... across the bridge ...

at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1278:16)
Error: connect ECONNREFUSED 127.0.0.1:39307

^
๐ŸŒ‰ Error: connect ECONNREFUSED 127.0.0.1:39307

not work on VPS

when i try to run hello world script on vps (Ubuntu 22)

from botasaurus import *


@browser(headless=True)
def scrape_heading_task(driver: AntiDetectDriver, data):
    # Navigate to the Omkar Cloud website
    driver.get("https://www.omkar.cloud/")

    # Retrieve the heading element's text
    heading = driver.text("h1")

    # Save the data as a JSON file in output/scrape_heading_task.json
    return {"heading": heading}


if __name__ == "__main__":
    # Initiate the web scraping task
    scrape_heading_task()

i obtain this error

(venv) root@1941865-hj59931:~/realm_of_python/sandbox# python botasaurus_collector.py
Running
[INFO] Downloading Chrome Driver. This is a one-time process. Download in progress...
Traceback (most recent call last):
  File "/root/realm_of_python/sandbox/botasaurus_collector.py", line 18, in <module>
    scrape_heading_task()
  File "/root/realm_of_python/sandbox/venv/lib/python3.11/site-packages/botasaurus/decorators.py", line 501, in wrapper_browser
    current_result = run_task(data_item, False, 0)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/realm_of_python/sandbox/venv/lib/python3.11/site-packages/botasaurus/decorators.py", line 399, in run_task
    driver = create_selenium_driver(options, desired_capabilities)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/realm_of_python/sandbox/venv/lib/python3.11/site-packages/botasaurus/create_driver_utils.py", line 221, in create_selenium_driver
    driver = AntiDetectDriver(
             ^^^^^^^^^^^^^^^^^
  File "/root/realm_of_python/sandbox/venv/lib/python3.11/site-packages/botasaurus/anti_detect_driver.py", line 33, in __init__
    super().__init__(*args, **kwargs)
  File "/root/realm_of_python/sandbox/venv/lib/python3.11/site-packages/selenium/webdriver/chrome/webdriver.py", line 69, in __init__
    super().__init__(DesiredCapabilities.CHROME['browserName'], "goog",
  File "/root/realm_of_python/sandbox/venv/lib/python3.11/site-packages/selenium/webdriver/chromium/webdriver.py", line 92, in __init__
    super().__init__(
  File "/root/realm_of_python/sandbox/venv/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 272, in __init__
    self.start_session(capabilities, browser_profile)
  File "/root/realm_of_python/sandbox/venv/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 364, in start_session
    response = self.execute(Command.NEW_SESSION, parameters)
               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/root/realm_of_python/sandbox/venv/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 429, in execute
    self.error_handler.check_response(response)
  File "/root/realm_of_python/sandbox/venv/lib/python3.11/site-packages/selenium/webdriver/remote/errorhandler.py", line 243, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.SessionNotCreatedException: Message: session not created: Chrome failed to start: exited normally.
  (session not created: DevToolsActivePort file doesn't exist)
  (The process started from chrome location /opt/google/chrome/chrome is no longer running, so ChromeDriver is assuming that Chrome has crashed.)
Stacktrace:
#0 0x55ba42034fb3 <unknown>
#1 0x55ba41d084a7 <unknown>
#2 0x55ba41d3bc93 <unknown>
#3 0x55ba41d3810c <unknown>
#4 0x55ba41d7aac6 <unknown>
#5 0x55ba41d71713 <unknown>
#6 0x55ba41d4418b <unknown>
#7 0x55ba41d44f7e <unknown>
#8 0x55ba41ffa8d8 <unknown>
#9 0x55ba41ffe800 <unknown>
#10 0x55ba42008cfc <unknown>
#11 0x55ba41fff418 <unknown>
#12 0x55ba41fcc42f <unknown>
#13 0x55ba420234e8 <unknown>
#14 0x55ba420236b4 <unknown>
#15 0x55ba42034143 <unknown>
#16 0x7f1c63bcaac3 <unknown>

is there any advices?

Botasaurus perfomance

I would like to know the performance of the framework. When parsing a large amount of data, even through queries, the program's memory usage exceeded 9GB, the program did not cache anything, writing everything to the database on its own. And I also got the error "[Errno 24] Too many open files", I tried to explore on my own, but I didn't find anything. There is an assumption that when running in parallel, the program retains the startup context, even after completion, which can lead to such problems. Maybe I'll find something else and I'll definitely add it!

Node version error

I was testing Botsaurus with stealth driver but I receive this error in the console.

"Your Node.js version is 12, which is less than 16. To use the stealth and auth proxy features of Botasaurus, you need Node.js 16, Kindly install it by visiting https://nodejs.org/. An exception has occurred, use %tb to see the full traceback.".

This occurs even with the examples below. My node version is 20.11, I tried using other versions using NVM but the error persists. Can you tell me to help?

Extension path

Extension class should support directory path where local extension is saved.

Found a workaround by creating a custom class given below, but if you can add it to Extension class, it will be cool. Thanks

class Extension:
    def __init__(self, path):
        self.path = path

    def load(self, *args, **kwargs):
        return os.path.abspath(self.path)

Selenium webdriver chrome 115 stopped working

Description

The framework is failing for a new installation as it is not able to find ChromeDriver as it only goes up to version 114 due to driver restructuring by the Chromium Team for the new Chrome-for-Testing.

Steps to Reproduce

  • Try using the framework on a newer project with Chrome version of 115 or above.

Additional context

Traceback


Traceback (most recent call last):
  File "/Users/shubhamgarg/.pyenv/versions/3.11.2/lib/python3.11/runpy.py", line 198, in _run_module_as_main
    return _run_code(code, main_globals, None,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shubhamgarg/.pyenv/versions/3.11.2/lib/python3.11/runpy.py", line 88, in _run_code
    exec(code, run_globals)
  File "/Users/shubhamgarg/.vscode/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/__main__.py", line 39, in <module>
    cli.main()
  File "/Users/shubhamgarg/.vscode/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 430, in main
    run()
  File "/Users/shubhamgarg/.vscode/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/adapter/../../debugpy/launcher/../../debugpy/../debugpy/server/cli.py", line 284, in run_file
    runpy.run_path(target, run_name="__main__")
  File "/Users/shubhamgarg/.vscode/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 321, in run_path
    return _run_module_code(code, init_globals, run_name,
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shubhamgarg/.vscode/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 135, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "/Users/shubhamgarg/.vscode/extensions/ms-python.python-2023.14.0/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydevd_bundle/pydevd_runpy.py", line 124, in _run_code
    exec(code, run_globals)
  File "/Users/shubhamgarg/src/google-maps-scraper/main.py", line 18, in <module>
    launch_tasks(*tasks_to_be_run)
  File "/Users/shubhamgarg/.pyenv/versions/3.11.2/envs/gmaps-scraper/lib/python3.11/site-packages/bose/launch_tasks.py", line 54, in launch_tasks
    current_output = task.begin_task(current_data, task_config)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/Users/shubhamgarg/.pyenv/versions/3.11.2/envs/gmaps-scraper/lib/python3.11/site-packages/bose/base_task.py", line 214, in begin_task
    final = run_task(False, 0)
            ^^^^^^^^^^^^^^^^^^
  File "/Users/shubhamgarg/.pyenv/versions/3.11.2/envs/gmaps-scraper/lib/python3.11/site-packages/bose/base_task.py", line 155, in run_task
    create_directories(self.task_path)
  File "/Users/shubhamgarg/.pyenv/versions/3.11.2/envs/gmaps-scraper/lib/python3.11/site-packages/bose/base_task.py", line 99, in create_directories
    _download_driver()
  File "/Users/shubhamgarg/.pyenv/versions/3.11.2/envs/gmaps-scraper/lib/python3.11/site-packages/bose/base_task.py", line 34, in _download_driver
    download_driver()
  File "/Users/shubhamgarg/.pyenv/versions/3.11.2/envs/gmaps-scraper/lib/python3.11/site-packages/bose/download_driver.py", line 50, in download_driver
    move_driver()
  File "/Users/shubhamgarg/.pyenv/versions/3.11.2/envs/gmaps-scraper/lib/python3.11/site-packages/bose/download_driver.py", line 39, in move_driver
    move_chromedriver()
  File "/Users/shubhamgarg/.pyenv/versions/3.11.2/envs/gmaps-scraper/lib/python3.11/site-packages/bose/download_driver.py", line 38, in move_chromedriver
    shutil.move(src_path, dest_path)
  File "/Users/shubhamgarg/.pyenv/versions/3.11.2/lib/python3.11/shutil.py", line 845, in move
    copy_function(src, real_dst)
  File "/Users/shubhamgarg/.pyenv/versions/3.11.2/lib/python3.11/shutil.py", line 436, in copy2
    copyfile(src, dst, follow_symlinks=follow_symlinks)
  File "/Users/shubhamgarg/.pyenv/versions/3.11.2/lib/python3.11/shutil.py", line 256, in copyfile
    with open(src, 'rb') as fsrc:
         ^^^^^^^^^^^^^^^
FileNotFoundError: [Errno 2] No such file or directory: 'build/115/chromedriver'

Incompatibility of Selenium's Alert Class with AntiDetectDriver in Botasaurus

Description

When using the AntiDetectDriver from Botasaurus to access websites that generate JavaScript alerts, the standard Selenium Alert class methods do not work. This incompatibility leads to an inability to handle JavaScript alerts, which is a crucial feature for many web scraping tasks. The need to access native Selenium functions and classes while using the modified driver is also highlighted.

Steps to Reproduce

  1. Navigate to a webpage with a JavaScript alert using AntiDetectDriver (e.g., http://www.restaurant-schwabenstuben.de/).
  2. Attempt to handle the alert using Seleniumโ€™s Alert class.
  3. Observe that the Alert class methods do not work with AntiDetectDriver.

Expected Behavior

The Alert class methods should work seamlessly with AntiDetectDriver, allowing users to handle JavaScript alerts on web pages.

Actual Behavior

The Alert class methods are incompatible with AntiDetectDriver, causing an inability to interact with JavaScript alerts on web pages.

Reproduces How Often

100% of the time when encountering JavaScript alerts with AntiDetectDriver.

Additional Context

The inability to use Selenium's native alert handling capabilities with AntiDetectDriver significantly limits the driver's functionality for web scraping tasks that encounter JavaScript alerts. Furthermore, a general integration of native Selenium functions and classes with AntiDetectDriver would enhance its utility.

Error Message and Stack Trace

selenium.common.exceptions.UnexpectedAlertPresentException: Alert Text: [Alert text]
Message: unexpected alert open: {Alert text : [Alert text]}
(Session info: chrome=[version])
Stacktrace:
[Full stack trace]

Suggestions for Improvement

  • Explore and resolve the compatibility issues between Seleniumโ€™s Alert class and AntiDetectDriver.
  • Implement alternative methods within AntiDetectDriver for handling JavaScript alerts.
  • Provide clear documentation or examples for handling JavaScript alerts in Botasaurus.
  • Investigate the feasibility of integrating more native Selenium functions and classes with AntiDetectDriver.

Can not download driver when running parallel

I run sample code and error has occurred:

selenium.common.exceptions.WebDriverException: Message: 'chromedriver-122' executable needs to be in PATH. Please see https://chromedriver.chromium.org/home
I think that system does not wait until downloading driver finish so it raise exception.

Here Code example:

from botasaurus import *

@browser(parallel=bt.calc_max_parallel_browsers, block_resources=True, block_images=True, data=["https://www.yahoo.com/", "https://www.google.com", "https://stackoverflow.com/"])
def scrape_heading_task(driver: AntiDetectDriver, data):
    # print("metadata:", metadata)
    print("data:", data)
    # Navigate to the Omkar Cloud website
    driver.get(data)

    # Retrieve the heading element's text
    heading = driver.text("h1")
    title = driver.title

    # Save the data as a JSON file in output/scrape_heading_task.json
    return {
        "heading": heading,
        "title": title
    }

if __name__ == '__main__':
    scrape_heading_task()

Not working on some websites

Hello, saw this on a post from undetected-chromedriver and decided to check it out, but couldn't bypass a certain website and I imagine some others with the same tech would also have the same issue.

bet365.com doesn't load the main page and other pages are super inconsistent, works 1/20 times, so it does work just need to find what pattern makes it consistent. The code I've tried and had success is the one from the example:

from botasaurus import *
from botasaurus.create_stealth_driver import create_stealth_driver


@browser(
    create_driver=create_stealth_driver(
        start_url="https://www.bet365.com/#/AC/B151/C1/D50/E3/F163/",
        wait=8, # it seems like the wait doesn't matter
    ),
)
def scrape_heading_task(driver: AntiDetectDriver, data):
    driver.prompt()
    heading = driver.text('h1')
    return heading


scrape_heading_task()

btw this website is only accessible via undetected-chromedriver when using a workaround via disconnecting and reconnecting to the driver so I imagine on botasaurus would be something similar

Tasks folder empty

I am following the guide here: https://www.omkar.cloud/botasaurus/docs/sign-up-tutorial/

I pasted the following script:

from botasaurus import *

@browser(
    data = lambda: bt.generate_users(3, country=bt.Country.IN),
    block_resources=True,
    profile= lambda account: account['username'],
    tiny_profile= True,
)
def create_accounts(driver: AntiDetectDriver, account):
    name = account['name']
    email = account['email']
    password = account['password']

    def sign_up():
        driver.type('input[name="name"]', name)
        driver.type('input[type="email"]', email)
        driver.type('input[type="password"]', password)
        driver.click('button[type="submit"]')

    def confirm_email():
        link = bt.TempMail.get_email_link_and_delete_mailbox(email)
        driver.get(link)

    driver.google_get("https://www.omkar.cloud/auth/sign-up/")
    sign_up()
    confirm_email()
    bt.Profile.set_profile(account)    

@browser(
    data = lambda: bt.Profile.get_profiles(),
    block_resources=True,
    profile= lambda account: account['username'],
    tiny_profile= True,
)
def take_screenshots(driver: AntiDetectDriver, account):
    username = account['username']
    driver.get("https://www.omkar.cloud/")
    driver.save_screenshot(username)

if __name__ == "__main__":
    create_accounts()
    take_screenshots()

So the example does take 3 screenshots and saves them in output and each screenshot is saved as the username but since images are blocked, there isn't much to the screenshots.

There is a create_accounts.json and a take_screenshots.json but they only report 3 null entries. Is this normal?

Also the tasks directory is empty and doesn't contain any metadata on the bot run.

Fail screenshot saving

Description

Google Maps scraper fail on many queries (around 10k and more)

Steps to Reproduce

Work machine is Windows Server with 4 gb RAM (it's enough for 16 threads as I test it)

  1. Load many queries (I'm loading just a links to websites)
  2. Run and wait

Actual behavior:

Error:

Failed to save screenshot
Closing Browser
Traceback (most recent call last):
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python311\Lib\site-packages\bose\base_task.py", line 192, in run_task
    close_driver(driver)
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python311\Lib\site-packages\bose\base_task.py", line 181, in close_driver
    driver.close()
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python311\Lib\site-packages\bose\bose_driver.py", line 335, in close
    return super().close()
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python311\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 551, in close
    self.execute(Command.CLOSE)
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python311\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 429, in execute
    self.error_handler.check_response(response)
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python311\Lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 243, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: disconnected: Unable to receive message from renderer
  (failed to check if window was closed: disconnected: not connected to DevTools)
  (Session info: chrome=116.0.5845.141)
Stacktrace:
 GetHandleVerifier [0x005B37C3+48947]
 (No symbol) [0x00548551]
 (No symbol) [0x0044C92D]
 (No symbol) [0x0043E26E]
 (No symbol) [0x0043D09F]
 (No symbol) [0x0043D678]
 (No symbol) [0x0043C695]
 (No symbol) [0x00435811]
 (No symbol) [0x00435AC4]
 (No symbol) [0x0049D688]
 (No symbol) [0x00495053]
 (No symbol) [0x004716C7]
 (No symbol) [0x0047284D]
 GetHandleVerifier [0x007FFDF9+2458985]
 GetHandleVerifier [0x0084744F+2751423]
 GetHandleVerifier [0x00841361+2726609]
 GetHandleVerifier [0x00630680+560624]
 (No symbol) [0x0055238C]
 (No symbol) [0x0054E268]
 (No symbol) [0x0054E392]
 (No symbol) [0x005410B7]
 BaseThreadInitThunk [0x745962C4+36]
 RtlSubscribeWnfStateChangeNotification [0x77191B69+1081]
 RtlSubscribeWnfStateChangeNotification [0x77191B34+1028]`
During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Users\Admin\Desktop\google-maps-scraper-master\main.py", line 19, in <module>
    launch_tasks(*tasks_to_be_run)
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python311\Lib\site-packages\bose\launch_tasks.py", line 54, in launch_tasks
    current_output = task.begin_task(current_data, task_config)
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python311\Lib\site-packages\bose\base_task.py", line 219, in begin_task
    final = run_task(False, 0)
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python311\Lib\site-packages\bose\base_task.py", line 214, in run_task
    close_driver(driver)
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python311\Lib\site-packages\bose\base_task.py", line 181, in close_driver
    driver.close()
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python311\Lib\site-packages\bose\bose_driver.py", line 335, in close
    return super().close()
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python311\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 551, in close
    self.execute(Command.CLOSE)
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python311\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 429, in execute
    self.error_handler.check_response(response)
  File "C:\Users\Admin\AppData\Local\Programs\Python\Python311\Lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 243, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.WebDriverException: Message: disconnected: not connected to DevTools
  (failed to check if window was closed: disconnected: not connected to DevTools)
  (Session info: chrome=116.0.5845.141)
Stacktrace:
 GetHandleVerifier [0x005B37C3+48947]
 (No symbol) [0x00548551]
 (No symbol) [0x0044C92D]
 (No symbol) [0x0043D249]
 (No symbol) [0x0043D79A]
 (No symbol) [0x0043D738]
 (No symbol) [0x004326FD]
 (No symbol) [0x00432F8D]
 (No symbol) [0x0049D288]
 (No symbol) [0x00495053]
 (No symbol) [0x004716C7]
 (No symbol) [0x0047284D]
 GetHandleVerifier [0x007FFDF9+2458985]
 GetHandleVerifier [0x0084744F+2751423]
 GetHandleVerifier [0x00841361+2726609]
 GetHandleVerifier [0x00630680+560624]
 (No symbol) [0x0055238C]
 (No symbol) [0x0054E268]
 (No symbol) [0x0054E392]
 (No symbol) [0x005410B7]
 BaseThreadInitThunk [0x745962C4+36]
 RtlSubscribeWnfStateChangeNotification [0x77191B69+1081]
 RtlSubscribeWnfStateChangeNotification [0x77191B34+1028]`

Reproduces how often:

Every time when I start it, but after 30 minutes or more of work

Additional context

My log looks like that:

[7080:3484:0907/152340.096:ERROR:gles2_cmd_decoder_passthrough.cc(946)] ContextResult::k
FatalFailure: fail_if_major_perf_caveat + swiftshader
[7080:3484:0907/152340.107:ERROR:gles2_cmd_decoder_passthrough.cc(946)] ContextResult::k
FatalFailure: fail_if_major_perf_caveat + swiftshader
Done: V and B Le Mans
[7080:3484:0907/152342.742:ERROR:gles2_cmd_decoder_passthrough.cc(946)] ContextResult::k
FatalFailure: fail_if_major_perf_caveat + swiftshader
[7080:3484:0907/152342.762:ERROR:gles2_cmd_decoder_passthrough.cc(946)] ContextResult::k
FatalFailure: fail_if_major_perf_caveat + swiftshader
Done: V and B La Roche Nord
Filtered 5 links from 5.
View written JSON file at output/vandb-fr-in-france.json
View written CSV file at output/vandb-fr-in-france.csv
Closing Browser
Closed Browser
View Final Screenshot at tasks/1112/final.png
View written JSON file at output/all.json
Creating Driver with window_size=1920,1080 and user_agent=Mozilla/5.0 (Windows NT 10.0)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36

DevTools listening on ws://127.0.0.1:63583/devtools/browser/54835e3a-595b-4cce-8ce0-c9d1
f0639475
Launched Browser
[6804:3312:0907/152354.671:ERROR:gles2_cmd_decoder_passthrough.cc(946)] ContextResult::k
FatalFailure: fail_if_major_perf_caveat + swiftshader
[6804:3312:0907/152354.717:ERROR:gles2_cmd_decoder_passthrough.cc(946)] ContextResult::k
FatalFailure: fail_if_major_perf_caveat + swiftshader
Fetched 5 links.
Creating Driver with window_size=1920,1080 and user_agent=Mozilla/5.0 (Windows NT 10.0)
AppleWebKit/537.36 (KHTML, like Gecko) Chrome/105.0.0.0 Safari/537.36

a doubt

How can I get response responses on the traffic network using Botsaurus?

OSError: [Errno 8] Exec format error

on Linux rapsberrypi
After running the project, this error appears.

_File "/home/username/.local/lib/python3.11/site-packages/selenium/webdriver/common/service.py", line 71, in start
self.process = subprocess.Popen(cmd, env=self.env,
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/usr/lib/python3.11/subprocess.py", line 1024, in init
self._execute_child(args, executable, preexec_fn, close_fds,
File "/usr/lib/python3.11/subprocess.py", line 1901, in execute_child
raise child_exception_type(errno_num, err_msg, err_filename)
OSError: [Errno 8] Exec format error: '/home/username/Desktop/projects/aa/bb/cc/build/chromedriver-120'

Bug in example of README

If I run the following code:

from botasaurus import browser, AntiDetectDriver  # Replace with your actual scraping library

@browser(async_queue=True, close_on_crash=True)
def scrape_data(driver: AntiDetectDriver, data):
    print("Starting a task.")
    print(data)
    sleep(1)  # Simulate a delay, e.g., waiting for a page to load
    print("Task completed.")
    return data

if __name__ == "__main__":
    # Start scraping tasks without waiting for each to finish
    async_queue = scrape_data()  # Initializes the queue

    # Add tasks to the queue
    async_queue.put([1])
    async_queue.put(4)
    async_queue.put([5, 6])

    # Retrieve results when ready
    results = async_queue.get()  # Expects to receive: [1, 2, 3, 4, 5, 6]

It fail as the queue put method a list, and you are passing it an int. This is solved by changing it to a list.
But my question is: How can I make the program fail? because right now the program return an exception but doesn't finish, just stay in hold.

Thanks for the lib, amazing job!

get_elements_or_none_by_xpath bug - uses CSS_SELECTOR instead of XPATH

I believe there was an oversight issue of using the wrong By. ENUM parameter for this function.

file: anti_detect_driver.py

def get_elements_or_none_by_xpath(self: WebDriver, xpath, wait=Wait.SHORT):
        try:
            if wait is None:
                return self.find_elements(By.XPATH, xpath)
            else:
                WebDriverWait(self, wait).until(
                    EC.presence_of_element_located((By.CSS_SELECTOR, xpath))
                )

                return self.find_elements(By.XPATH, xpath)
        except:
            return None

the line
EC.presence_of_element_located((By.CSS_SELECTOR, xpath))

should be changed to:
EC.presence_of_element_located((By.XPATH, xpath))

Would like to wrap my library using this code

Summary

Hi, congratulations for your work. This is an amazing repository. I was checking the source code and I was wondering if it fits to a library I'm developing. It is used to run asynchronous requests against WebDrivers/Winium. I didn't find any unit test to help me to check it. This is the library. Let me know if you think it is possible to adapt/extend your code to use it, please. If it is possible, I'd be happy to play around and send some PRs related to it. Thank you!

Motivation

Why are we doing this? What use cases does it support? What is the expected outcome?
I have a fresh library to interact with WebDrivers and Wnium, but the code I quite verbose. I think your code may make my library simpler to be used.

Describe alternatives you've considered

Use other repository to wrap my code or develop it by myself.

botasaurus silently fails on nixos

i tried to get botasaurus running on nixos, but the "hello world" test script fails

heading None

expected result

heading "Elementasaurus helps you become a 10x Web Designer"
test.py
# debug log is not helpful

import logging

logging_level = "INFO"
logging_level = "DEBUG"

logging.basicConfig(
    #format='%(asctime)s %(levelname)s %(message)s',
    # also log the logger %(name)s, so we can filter by logger name
    format='%(asctime)s %(name)s %(levelname)s %(message)s',
    level=logging_level,
)

logger = logging.getLogger("test")



# https://github.com/omkarcloud/botasaurus

from botasaurus import *

@browser
def scrape_heading_task(driver: AntiDetectDriver, data):
    # Navigate to the Omkar Cloud website
    driver.get("https://www.omkar.cloud/")
    
    # Retrieve the heading element's text
    heading = driver.text("h1")

    # FIXME heading == None
    print("heading", repr(heading))

    # keep browser open
    #import time; time.sleep(9999)

    # Save the data as a JSON file in output/scrape_heading_task.json
    # "return" would write "null" to the output file
    return {
        "heading": heading
    }
     
if __name__ == "__main__":
    # Initiate the web scraping task
    scrape_heading_task()
example output
$ python pkgs/python3/pkgs/botasaurus/test-botasaurus.py
Running
2024-01-16 17:36:27,932 selenium.webdriver.common.service DEBUG Started executable: `/nix/store/yjq4z3n7p66l8jp06s8cgq647s6iwm7c-chromedriver-117.0.5938.149/bin/chromedriver` in a child process with pid: 1034600
2024-01-16 17:36:28,440 selenium.webdriver.remote.remote_connection DEBUG POST http://localhost:33991/session {"capabilities": {"firstMatch": [{}], "alwaysMatch": {"browserName": "chrome", "pageLoadStrategy": "normal", "goog:chromeOptions": {"excludeSwitches": ["enable-automation"], "useAutomationExtension": false, "extensions": [], "args": ["--start-maximized", "--window-size=1440,900", "--user-agent=Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/115.0.0.0 Safari/537.36", "--disable-blink-features=AutomationControlled", "--disable-site-isolation-trials"]}}}}
2024-01-16 17:36:28,443 urllib3.connectionpool DEBUG Starting new HTTP connection (1): localhost:33991
2024-01-16 17:36:31,726 urllib3.connectionpool DEBUG http://localhost:33991 "POST /session HTTP/1.1" 200 860
2024-01-16 17:36:31,727 selenium.webdriver.remote.remote_connection DEBUG Remote response: status=200 | data={"value":{"capabilities":{"acceptInsecureCerts":false,"browserName":"chrome","browserVersion":"117.0.5938.149","chrome":{"chromedriverVersion":"117.0.5938.149 (e3344ddefa12e60436fa28c81cf207c1afb4d0a9-refs/branch-heads/5938@{#1539})","userDataDir":"/run/user/1000/.org.chromium.Chromium.xNxfuG"},"fedcm:accounts":true,"goog:chromeOptions":{"debuggerAddress":"localhost:34929"},"networkConnectionEnabled":false,"pageLoadStrategy":"normal","platformName":"linux","proxy":{},"setWindowRect":true,"strictFileInteractability":false,"timeouts":{"implicit":0,"pageLoad":300000,"script":30000},"unhandledPromptBehavior":"dismiss and notify","webauthn:extension:credBlob":true,"webauthn:extension:largeBlob":true,"webauthn:extension:minPinLength":true,"webauthn:extension:prf":true,"webauthn:virtualAuthenticators":true},"sessionId":"2dc31237c852897577d0335baf49b4fb"}} | headers=HTTPHeaderDict({'Content-Length': '860', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
2024-01-16 17:36:31,728 selenium.webdriver.remote.remote_connection DEBUG Finished Request
2024-01-16 17:36:31,729 selenium.webdriver.remote.remote_connection DEBUG POST http://localhost:33991/session/2dc31237c852897577d0335baf49b4fb/url {"url": "https://www.omkar.cloud/"}
2024-01-16 17:37:00,778 urllib3.connectionpool DEBUG http://localhost:33991 "POST /session/2dc31237c852897577d0335baf49b4fb/url HTTP/1.1" 200 14
2024-01-16 17:37:00,781 selenium.webdriver.remote.remote_connection DEBUG Remote response: status=200 | data={"value":null} | headers=HTTPHeaderDict({'Content-Length': '14', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
2024-01-16 17:37:00,782 selenium.webdriver.remote.remote_connection DEBUG Finished Request
2024-01-16 17:37:00,783 selenium.webdriver.remote.remote_connection DEBUG POST http://localhost:33991/session/2dc31237c852897577d0335baf49b4fb/element {"using": "css selector", "value": "h1"}
2024-01-16 17:37:01,478 urllib3.connectionpool DEBUG http://localhost:33991 "POST /session/2dc31237c852897577d0335baf49b4fb/element HTTP/1.1" 200 95
2024-01-16 17:37:01,480 selenium.webdriver.remote.remote_connection DEBUG Remote response: status=200 | data={"value":{"element-6066-11e4-a52e-4f735466cecf":"407A8AABF752A82E39C6A1463C008FA9_element_32"}} | headers=HTTPHeaderDict({'Content-Length': '95', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
2024-01-16 17:37:01,482 selenium.webdriver.remote.remote_connection DEBUG Finished Request
heading None
2024-01-16 17:37:01,484 selenium.webdriver.remote.remote_connection DEBUG GET http://localhost:33991/session/2dc31237c852897577d0335baf49b4fb/url {}
2024-01-16 17:37:01,631 urllib3.connectionpool DEBUG http://localhost:33991 "GET /session/2dc31237c852897577d0335baf49b4fb/url HTTP/1.1" 200 36
2024-01-16 17:37:01,633 selenium.webdriver.remote.remote_connection DEBUG Remote response: status=200 | data={"value":"https://www.omkar.cloud/"} | headers=HTTPHeaderDict({'Content-Length': '36', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
2024-01-16 17:37:01,642 selenium.webdriver.remote.remote_connection DEBUG Finished Request
2024-01-16 17:37:01,646 selenium.webdriver.remote.remote_connection DEBUG DELETE http://localhost:33991/session/2dc31237c852897577d0335baf49b4fb/window {}
2024-01-16 17:37:02,299 urllib3.connectionpool DEBUG http://localhost:33991 "DELETE /session/2dc31237c852897577d0335baf49b4fb/window HTTP/1.1" 200 12
2024-01-16 17:37:02,301 selenium.webdriver.remote.remote_connection DEBUG Remote response: status=200 | data={"value":[]} | headers=HTTPHeaderDict({'Content-Length': '12', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
2024-01-16 17:37:02,302 selenium.webdriver.remote.remote_connection DEBUG Finished Request
2024-01-16 17:37:02,308 selenium.webdriver.remote.remote_connection DEBUG DELETE http://localhost:33991/session/2dc31237c852897577d0335baf49b4fb {}
2024-01-16 17:37:02,368 urllib3.connectionpool DEBUG http://localhost:33991 "DELETE /session/2dc31237c852897577d0335baf49b4fb HTTP/1.1" 200 14
2024-01-16 17:37:02,370 selenium.webdriver.remote.remote_connection DEBUG Remote response: status=200 | data={"value":null} | headers=HTTPHeaderDict({'Content-Length': '14', 'Content-Type': 'application/json; charset=utf-8', 'cache-control': 'no-cache'})
2024-01-16 17:37:02,372 selenium.webdriver.remote.remote_connection DEBUG Finished Request
Written
     output/scrape_heading_task.json

javascript.require

first i suspected that javascript.require fails to find the NPM dependencies
and that require would silently fail
but got_adapter and chrome_launcher_adapter are never used

/lib/python3.10/site-packages/botasaurus/got_adapter.py

got = require("got-scraping-export")
raise Exception(f"botasaurus/got_adapter.py: require got-scraping-export -> {got}")

/lib/python3.10/site-packages/botasaurus/chrome_launcher_adapter.py

chrome_launcher = require("chrome-launcher")
raise Exception(f"botasaurus/chrome_launcher_adapter.py: require chrome-launcher -> {chrome_launcher}")

reproduce

install nix or boot nixos

clone my nur-packages repo

git clone --depth=1 https://github.com/milahu/nur-packages
cd nur-packages

relevant files

start a nix-shell with botasaurus

nix-shell -E '
  let
    pkgs = import <nixpkgs> {};
    nurRepo = import ./. {};
  in
  pkgs.mkShell {
    buildInputs = [
      nurRepo.python3.pkgs.botasaurus
    ];
  }
'

run the botasaurus test script

python pkgs/python3/pkgs/botasaurus/test-botasaurus.py

why

in my aiohttp_chromium im creating something similar to botasaurus
so im curious how similar projects work

ideas...?
i dont want to spend too much time debugging, because i will probably not need this

Browser Visible Mode Doesn't Work: ECONNREFUSED

result = run_parallel(run, used_data, n)
at wrapper_browser (/home/user/.local/lib/python3.10/site-packages/botasaurus/decorators.py:664)
parallel_thread.join(0.2) # time out not to block KeyboardInterrupt
at run_parallel (/home/user/.local/lib/python3.10/site-packages/botasaurus/decorators.py:166)
raise self._exception
at join (/home/user/.local/lib/python3.10/site-packages/botasaurus/decorators.py:152)
self.result = target(*args, **kwargs)
at function (/home/user/.local/lib/python3.10/site-packages/botasaurus/decorators.py:143)
return Parallel(n_jobs=n_workers, backend="threading")(
at execute_parallel_tasks (/home/user/.local/lib/python3.10/site-packages/botasaurus/decorators.py:158)
return output if self.return_generator else list(output)
at call (/home/user/.local/lib/python3.10/site-packages/joblib/parallel.py:1952)
yield from self._retrieve()
at _get_outputs (/home/user/.local/lib/python3.10/site-packages/joblib/parallel.py:1595)
self._raise_error_fast()
at _retrieve (/home/user/.local/lib/python3.10/site-packages/joblib/parallel.py:1699)
error_job.get_result(self.timeout)
at _raise_error_fast (/home/user/.local/lib/python3.10/site-packages/joblib/parallel.py:1734)
return self._return_or_raise()
at get_result (/home/user/.local/lib/python3.10/site-packages/joblib/parallel.py:736)
raise self._result
at _return_or_raise (/home/user/.local/lib/python3.10/site-packages/joblib/parallel.py:754)

... across the bridge ...

at TCPConnectWrap.afterConnect [as oncomplete] (node:net:1605:16)
Error: connect ECONNREFUSED 127.0.0.1:38387

^
๐ŸŒ‰ Error: connect ECONNREFUSED 127.0.0.1:38387

Keep getting this error with headless=False, this is the config:
@browser(window_size=bt.WindowSize.REAL,parallel=8, create_driver=create_stealth_driver(
start_url=lambda data: data["link"],
wait=12
), add_arguments=add_arguments, raise_exception=True, headless=False, keep_drivers_alive=True, cache=True, output=None, reuse_driver=True, block_resources=True, block_images=True, max_retry=10
)

works fine with Headless=False, using these arguments:
def add_arguments(data, options):
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--no-sandbox')
options.add_argument('--server')
options.add_argument('--disable-setuid-sandbox')
options.add_argument('--no-zygote')
options.add_argument('--disable-gpu-sandbox')
options.add_argument('--disable-software-rasterizer')
options.add_argument('--ignore-certificate-errors')
options.add_argument('--ignore-ssl-errors')
options.add_argument('--use-gl=swiftshader')
options.add_argument('--window-size=1920,1080')

Also tried using only these:
def add_arguments(data, options):
options.add_argument('--disable-dev-shm-usage')
options.add_argument('--no-sandbox')
options.add_argument('--server')
options.add_argument('--disable-setuid-sandbox')

Tried with both proxy turned on and off, nothing seems to help!

Issue with Google Maps tutorial

So issue is with the Google Maps tutorial (https://www.omkar.cloud/botasaurus/docs/google-maps-scraping-tutorial/). I don't think anything is wrong with the code (aside from the missing comma on line 57), but I do seem to have issues with Google Maps itself. I'm testing this from Hong Kong using a residential HK ip. My Internet speed is 1Gbps and Google has data centers right in HK so it is usually a very low latency experience not to mention good download speeds on anything hosted by Google. I also have access to a US residential proxy and know how to use botasaurus with it if I need to.

So anyway I fired up the bot built from the tutorial. It's scrolling...., scrolling....., and then after a few times it looks like it's stuck as it's not scrolling anymore. I thought that there was an issue with the bot at first so I terminated it and then opened up a chrome browser in incognito mode prepared to inspect some elements using chrome developer tools.

I go to https://www.google.com/maps/search/restaurants+in+delhi/ and I noticed that it loads after scrolling fine initially (just like with the bot), but after scroll loading several times, I hit a chokepoint where google takes forever to load (just like the bot too). It actually took like 5 to 10 minutes to get through the first chokepoint. The second chokepoint is taking even longer. I've been waiting 30 minutes now to load what is after "Smoke Trailer Grill" but the multi colored circle is still spinning.

Are you experiencing such a phenomena on your end for the "restaurants in delhi" query? I can't see the actual end of the page and am unable to see the element(s) that will indicate to bot that it had reached the end of the page when it's scraping.

Docker build error

Hello. When i try to build project with botasaurus into docker image i get this error.

45.73   ChefBuildError
45.73
45.73   Backend subprocess exited when trying to invoke get_requires_for_build_wheel
45.73
45.73   You do not have node installed on your system, Kindly install it by visiting https://nodejs.org/
45.73
45.73
45.73   at /usr/local/lib/python3.11/site-packages/poetry/installation/chef.py:164 in _prepare
45.74       160โ”‚
45.74       161โ”‚                 error = ChefBuildError("\n\n".join(message_parts))
45.74       162โ”‚
45.74       163โ”‚             if error is not None:
45.74     โ†’ 164โ”‚                 raise error from None
45.74       165โ”‚
45.74       166โ”‚             return path
45.74       167โ”‚
45.74       168โ”‚     def _prepare_sdist(self, archive: Path, destination: Path | None = None) -> Path:
45.74
45.74 Note: This error originates from the build backend, and is likely not a problem with poetry but with botasaurus-proxy-authentication (1.0.8) not supporting PEP 517 builds. You 
can verify this by running 'pip wheel --no-cache-dir --use-pep517 "botasaurus-proxy-authentication (==1.0.8)"'.

is this okay? can i just use pip wheel --no-cache-dir --use-pep517 "botasaurus-proxy-authentication (==1.0.8) ??

Connect this to a SQL server database

Peace be upon you. I am a beginner in programming and I love projects like these. I want help, if possible. I want to link this project to the SQL Server 2008 database. please help me?

Chrome failed to start error on ubuntu 22.04 server, p3.10

Locally works fine, I want to move the scrapper to my server, and faced an error:

Code:

from botasaurus import *

@browser(
    reuse_driver=True,
    keep_drivers_alive=True,
    headless=True,
    block_resources=[
        ".css",
        ".jpg",
        ".jpeg",
        ".png",
        ".svg",
        ".gif",
        ".woff",
        ".pdf",
        ".zip",
    ],
    
)
def get_url_text(driver: AntiDetectDriver, url: str) -> str:
    driver.get(url)
    soup = driver.bs4()


get_url_text("https://github.com/Nv7-GitHub/googlesearch")

Failed with error:

Running Message: session not created: Chrome failed to start: exited normally. (session not created: DevToolsActivePort file doesn't exist) (The process started from chrome location [/opt/google/chrome/chrome](https://vscode-remote+ssh-002dremote-002bremote-005fcontainer.vscode-resource.vscode-cdn.net/opt/google/chrome/chrome) is no longer running, so ChromeDriver is assuming that Chrome has crashed.) Stacktrace: #0 0x55bdcb62bd93 <unknown> #1 0x55bdcb30f337 <unknown> #2 0x55bdcb343bc0 <unknown> #3 0x55bdcb33f765 <unknown> #4 0x55bdcb389b7c <unknown> #5 0x55bdcb37d1e3 <unknown> #6 0x55bdcb34d135 <unknown> #7 0x55bdcb34e13e <unknown> #8 0x55bdcb5efe4b <unknown> #9 0x55bdcb5f3dfa <unknown> #10 0x55bdcb5dc6d5 <unknown> #11 0x55bdcb5f4a6f <unknown> #12 0x55bdcb5c069f <unknown> #13 0x55bdcb619098 <unknown> #14 0x55bdcb619262 <unknown> #15 0x55bdcb62af34 <unknown> #16 0x7fc31a837ac3 <unknown>

Seems like something wrong with chromedriver. The current version is:

chromedriver-121 --version 
ChromeDriver 121.0.6167.85 (3f98d690ad7e59242ef110144c757b2ac4eef1a2-refs/branch-heads/6167@{#1539})

Request, Stealth Mode not working

from botasaurus import *

@request(use_stealth=True)
def scrape_heading_task(request: AntiDetectRequests, data):
response = request.get('ANY_CF_UAM_WEBSITE')
print(response.status_code)
return response.text

scrape_heading_task()

This shares output almost instantaneously for all cloudflare UAM websites and doesn't work.
Tried with and without proxies.
In the browser alternative, it waits for those 8 seconds making it work. What could be wrong here?

Any way to simulate wait here too, the wait argument doesn't work.

Adding Brave

Hey, Brave is chromium browser too, but it still asks me to download google chrome. I believe undetected_chromedriver has the compatibility for Brave as well. I would appreciate it if you could support this.
Thank you!

๐Ÿ™ Guidance on how to integrate Botasaurus in an existing project

This is a great project but i am having issues integrating it in my existing code.
I was previously using the UndetectedChromeDriver and would like to replace it with Botasaurus.
The goals are to handle sign-in, get user profiles and complete some user flow (fill forms, upload documents and click buttons).
I have created classes to easily integrate each part in the program.
Here is the code for the helper class

import subprocess
import os
from pathlib import Path
import logging
# from os import path
# import random
from time import sleep
# import undetected_chromedriver as uc
# from selenium.webdriver.chrome.options import Options
# from selenium.webdriver.chrome.service import Service
# from webdriver_manager.chrome import ChromeDriverManager


# from Tools.Bot.chrome_launcher_adapter import ChromeLauncherAdapter
# from Tools.Bot.create_stealth_driver import create_stealth_driver
from Tools.Bot.chrome_launcher_adapter import ChromeLauncherAdapter

from Tools.Bot.create_stealth_driver import create_stealth_driver
from selenium.webdriver.chrome.options import Options
from chromedriver_autoinstaller import install


from botasaurus import *
# from botasaurus_proxy_authentication import add_proxy_options


logger = logging.getLogger()
# COPIED FROM chrome-launcher code (https://github.com/GoogleChrome/chrome-launcher/blob/main/src/flags.ts), Mostly same but the extensions, media devices etc are not disabled to avoid detection
DEFAULT_FLAGS = [
    #   safe browsing service, upgrade detector, translate, UMA
    "--disable-background-networking",
    # Don't update the browser 'components' listed at chrome://components/
    "--disable-component-update",
    # Disables client-side phishing detection.
    "--disable-client-side-phishing-detection",
    # Disable syncing to a Google account
    "--disable-sync",
    # Disable reporting to UMA, but allows for collection
    "--metrics-recording-only",
    # Disable installation of default apps on first run
    "--disable-default-apps",
    # Disable the default browser check, do not prompt to set it as such
    "--no-default-browser-check",
    # Skip first run wizards
    "--no-first-run",
    # Disable backgrounding renders for occluded windows
    "--disable-backgrounding-occluded-windows",
    # Disable renderer process backgrounding
    "--disable-renderer-backgrounding",
    # Disable task throttling of timer tasks from background pages.
    "--disable-background-timer-throttling",
    # Disable the default throttling of IPC between renderer & browser processes.
    "--disable-ipc-flooding-protection",
    # Avoid potential instability of using Gnome Keyring or KDE wallet. crbug.com/571003 crbug.com/991424
    "--password-store=basic",
    # Use mock keychain on Mac to prevent blocking permissions dialogs
    "--use-mock-keychain",
    # Disable background tracing (aka slow reports & deep reports) to avoid 'Tracing already started'
    "--force-fieldtrials=*BackgroundTracing/default/",
    # Suppresses hang monitor dialogs in renderer processes. This flag may allow slow unload handlers on a page to prevent the tab from closing.
    "--disable-hang-monitor",
    # Reloading a page that came from a POST normally prompts the user.
    "--disable-prompt-on-repost",
    # Disables Domain Reliability Monitoring, which tracks whether the browser has difficulty contacting Google-owned sites and uploads reports to Google.
    "--disable-domain-reliability",
]




class BotasaurusChromeHandler:
    def __init__(self):
        print("๐Ÿ’ก ChromeHandler init")
        sleep(5)
        self._driver = self.launch_chrome("https://ca.yahoo.com/?p=us", [])
        create_stealth_driver()
        print("โœ… UndetectedChromeHandler launched โžก๏ธ (๐ŸŒˆ Google.com)")

    def driver(self): 
        return self._driver

    
    # @browser(profile='Profile 1',)
    def launch_chrome(self,start_url, additional_args):
        # Set Chrome options
        chrome_options = Options(
            # headless=True,
            # add_argument(r"--user-data-dir=/Users/lifen/Library/Application Support/Google/Chrome/Profile 1"),
        )
        chrome_options.add_argument("--remote-debugging-port=9222")
        # chrome_options.add_argument("--no-sandbox")
        # chrome_options.add_argument("--disable-gpu")
        # chrome_options.add_argument("--disable-extensions")
        # chrome_options.add_argument("--disable-dev-shm-usage")
        chrome_options.add_argument("--user-data-dir=/Users/lifen/Library/Application Support/Google/Chrome/Profile 1")
        # add_proxy_options(chrome_options)
        
        unique_flags = list(dict.fromkeys(DEFAULT_FLAGS + additional_args))

        kwargs = {
            "ignoreDefaultFlags": True,
            "chromeFlags": unique_flags,
            
            "userDataDir": "/Users/MacUser/Library/Application Support/Google/Chrome/Profile 1",

            "port": 9222,
            "headless": False,
            "autoClose": True,
        
        }

        if start_url:
            kwargs["startingUrl"] = start_url

        instance = ChromeLauncherAdapter.launch(**kwargs)
        return instance
    

Where the code is used:

import re
import logging
import random
from time import sleep
from configs.configs_model import ConfigsModel
from helpers.jobs_sql import JobsSQL
from helpers.html_page_handler import HTMLPageHandler
from helpers.shared import notification
from models.job_listing import JobListingModel

from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.common.exceptions import NoSuchElementException

from selenium.common.exceptions import TimeoutException
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC

from selenium.webdriver.remote.webelement import WebElement

from helpers.botasaurus_chrome_handler import BotasaurusChromeHandler

from botasaurus import *



logger = logging.getLogger()


class IndeedChromeApplier:

    def __init__(self, jobs_sql: JobsSQL, jobs: list):
        print(f"๐Ÿ’ก IndeedChromeApplier init ")
        self.jobs = jobs
        self.chrome = BotasaurusChromeHandler()
        # self.chrome.driver().maximize_window()
        driver = bt.create_driver()
        self.driver = driver
        self.page = HTMLPageHandler(driver=driver)
        self.jobs_sql = jobs_sql

    def get_uid(self):
        configs = ConfigsModel()
        uid = configs.user_id
        return uid

    # @browser
    def check_auth(self):
        # driver = self.chrome.driver()
        driver = self.driver
        driver.get("https://profile.indeed.com/")
        sleep(2)
        url = driver.current_url
        substring = "secure"
        print(f"๐ŸŸข ๐Ÿ”ด {url=}")
        if substring in url:
            print("โŒ Not Logged in")
            # Get input of the user to try again after he logs in
            notification(
                message="Please log in to Indeed.com and try again (y/n): ")
            _input = input("Please log in to Indeed.com and try again (y/n): ")
            _input: str = "" + _input
            if _input.lower().__contains__("y"):
                return self.check_auth()
            elif _input.lower().__contains__("n"):
                return False
            else:
                sleep(20000)
        elif "profile.indeed.com" in url:
            print("โœ… Logged in")
            return True


    def answer_questions(self):
        # Define a WebDriverWait with a timeout of 10 seconds
        wait = WebDriverWait(self.chrome.driver(), 10)

        # Wait for the radio button for commuting/relocation to be clickable and select it
        try:
            commute_option: WebElement = wait.until(
                EC.element_to_be_clickable(
                    (
                        By.XPATH,
                        "//label[@for='input-q_38d8e685bb4b5228c2494ac85bc44d69-0']",
                    )
                )
            )
            commute_option.click()
            sleep(random.uniform(0.7, 2.2))
        except TimeoutException:
            print("Failed to find or click the commute option.")


    def replace_resume(self, job_title):
        print("โฏ๏ธ  replace_resume")
        is_upload_resume = (
            "Upload or build a resume for this application"
            in self.chrome.driver().title
        )
        paths = self.get_paths()
        if is_upload_resume:
            print("โœ… is_upload")
            # Find the "Replace" link using the full link text
            replace_link = self.page.try_find_element(
                driver=self.chrome.driver(),
                name="Replace",
                by=By.CSS_SELECTOR,
                value='[data-testid="ResumeFileInfoCardReplaceButton-button"]',
            )
            sleep(1)
            if replace_link:
                print("โœ… replace_link")
                
                sleep(1)
                # Find the file input element
                file_input: WebElement = WebDriverWait(self.chrome.driver(), 10).until(
                    EC.presence_of_element_located(
                        (By.CSS_SELECTOR, 'input[type="file"]')
                    )
                )

                # Send the file path to the file input element
                file_input.send_keys(
                    f"{paths.output_resumes_pdf_dir}/RalphNduwimana-{job_title}.pdf"
                )
                sleep(random.uniform(0.9, 1.8))
                # self.page.click_to_next_page(name="Continue",by=By.CLASS_NAME,value='ia-continueButton ia-Resume-continue css-vw73h2 e8ju0x51')
                notification(message=f"Resume replaced by {job_title}")
                self.page.click_to_go_to_page(
                    name="Continue",
                    by=By.XPATH,
                    value="//div[contains(text(), 'Continue')]",
                )


    def submit_application(self):
        print("โฏ๏ธ  review_application")
        notification(message="Reviewing application")
        sleep(1.7)
        notification(message="No cover letter required!")
      
        submit = self.page.click_to_go_to_page(
            name="Submit your application",
            by=By.XPATH,
            value="//button[contains(@class, 'ia-continueButton')]",
        )
        if submit:
            notification("Application Submitted")
        else:
            notification("Application Submitted", code=0)

        # submit_application_button.click()
        # Wait for 2 seconds for the submission to be completed
        sleep(2)

        # Check if the page contains "Application Submitted"
        application_submitted = (
            "Application Submitted" in self.chrome.driver().page_source
        )
        # Check if the submission was completed and return True if "Application Submitted" was found
        if application_submitted:
            notification("Application submitted successfully!")
            return True
        else:
            print("Application submission failed.")
            return False
    def click_button(self):
        # Logic to click on buttons 
        pass
    
    def type_text(self):
        # Logic to click on buttons 
        pass

    def run(self):
        print("โฏ๏ธ  IndeedChromeApplier run")
        driver = self.chrome.driver()
        authenticated = self.check_auth()
        jobs_row = self.jobs_sql.load_jobs_by_status(query_status="Generated")
        jobs_data = [job_row for job_row in jobs_row]
        print(f'โœ… โœ… {str(jobs_data)[0:200]}')

        if authenticated:
            for data in jobs_data:
                if not data:
                    print(f'๐Ÿšซ No Data in jobs_data')
                job_data = self.convert_tuple_to_dict(data)
                job = JobListingModel(job_data)
                url = job.jobUrl
                print(f'โœ… โœ… โœ… โœ… {job.jobUrl}')
                page_loaded = self.page.go_to_page(url)
                if not page_loaded:
                    print(f"๐Ÿšซ {url} not loaded")
                    # continue

                if page_loaded:
                    print('โœ… page_loaded')

                    application_started = self.page.click_to_go_to_page(
                        name="Apply",
                        by=By.ID,
                        value="indeedApplyButton",
                    )
                    data = re.search(
                        "This job has expired on Indeed",
                        driver.page_source,
                    )
                    # Get True of False
                    expired = data is not None
                    print(f"๐Ÿ“• {expired=}")
                    # sleep(10000)
                    sleep(random.uniform(0.2, 0.5))
                    if not application_started:
                        print("๐Ÿšซ Application not started")
                        sleep(1000)
                    if "indeed" not in driver.current_url:
                        print("Cannot apply on company websites (just indeed.com)")
                        sleep(10000)

                    pages = {
                        "questions": False,
                        "resume": False,
                        "review": False,
                        "work-experience": False,
                        "submitted": False,
                    }

                    try:
                        # there is a page that has not been completed
                        while (
                            False
                            in pages.values()
                        ):
                            print('')

                    except NoSuchElementException:
                        print(
                            f"โŒ Failed to get page ")

    def log_in(self, username, password):
        print(f"โฏ๏ธ  Starting log_in {username} {password}")
        page = self.page
        try:
            username_bar = page.try_find_element(
                name="username_bar",
                by=By.ID,
                value="session_key",
                driver=self.driver,
            )
            assert username_bar is not None
            username_bar.send_keys(f"{username}")
            password_bar = page.try_find_element(
                name="password_bar", by=By.ID, value="session_password", driver=self.chrome.driver()
            )
            assert password_bar is not None
            password_bar.send_keys(f"{password}")
            password_bar.send_keys(Keys.ENTER)
            print("โœ… User logged-in")
        except NoSuchElementException:
            print("No such element found")
        except Exception:
            print("Other exception")
        print(f"โน๏ธ  Finished log_in {username} {password}")

    def log_out(self):
        url = self.chrome.driver().current_url
        print(f"โฏ๏ธ  Starting log_out from {url}")
        xpath = (
            "/html/body/div[5]/header/div/nav/ul/li[6]/div/button"
            if "Home" in url
            else "/html/body/header/div/div[2]/div/div/button"
        )
        page = self.page
        icon_button = page.try_find_element(
            driver=self.chrome.driver(),
            name="Log-Out",
            by=By.XPATH,
            value=xpath,
            element_type="button",
        )
        try:
            print(f"{icon_button=}")
            try:
                sign_out_option: WebElement = WebDriverWait(
                    self.chrome.driver(), 10
                ).until(EC.presence_of_element_located((By.LINK_TEXT, "Sign Out")))
                sign_out_option.click()
                print("โœ… User logged-out")
            except:
                print(f"Sign Out not found ")
        except:
            print("Avatar button not found")
        print(f"โน๏ธ  Finished log_out from {url}")


I would appreciate any guidance on how to integrate Botasaurus features in my code.
Thanks in advance!!!

unknown error: session deleted because of page crash

When I run the docker service on a real server and make a request, I get the following error. However, when I run it on my local computer, this error does not appear and I see that it works properly.

v12tj Waiting 10 seconds before connecting to Chrome...
v12tj 10.0.0.2 - - [05/Feb/2024 13:56:03] "๏ฟฝ[35m๏ฟฝ[1mGET /scrape?url=https://someurl.com HTTP/1.1๏ฟฝ[0m" 500 -
v12tj INFO:werkzeug:10.0.0.2 - - [05/Feb/2024 13:56:03] "๏ฟฝ[35m๏ฟฝ[1mGET /scrape?url=https://semizotomotivburdur.sahibinden.com HTTP/1.1๏ฟฝ[0m" 500 -
v12tj Traceback (most recent call last):
v12tj File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1488, in call
v12tj return self.wsgi_app(environ, start_response)
v12tj File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1466, in wsgi_app
v12tj response = self.handle_exception(e)
v12tj File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 1463, in wsgi_app
v12tj response = self.full_dispatch_request()
v12tj File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 872, in full_dispatch_request
v12tj rv = self.handle_user_exception(e)
v12tj File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 870, in full_dispatch_request
v12tj rv = self.dispatch_request()
v12tj File "/usr/local/lib/python3.9/site-packages/flask/app.py", line 855, in dispatch_request
v12tj return self.ensure_sync(self.view_functions[rule.endpoint])(**view_args) # type: ignore[no-any-return]
v12tj File "/app/main.py", line 20, in scrape
v12tj result = parser(dealer_url)
v12tj File "/app/boto_scraper.py", line 91, in parser
v12tj return scrape_dealer_page()
v12tj File "/usr/local/lib/python3.9/site-packages/botasaurus/decorators.py", line 633, in wrapper_browser
v12tj current_result = run_task(data_item, False, 0)
v12tj File "/usr/local/lib/python3.9/site-packages/botasaurus/decorators.py", line 484, in run_task
v12tj driver = create_driver(
v12tj File "/usr/local/lib/python3.9/site-packages/botasaurus/create_stealth_driver.py", line 263, in run
v12tj return do_create_stealth_driver(
v12tj File "/usr/local/lib/python3.9/site-packages/botasaurus/create_stealth_driver.py", line 234, in do_create_stealth_driver
v12tj bypass_detection(remote_driver, raise_exception)
v12tj File "/usr/local/lib/python3.9/site-packages/botasaurus/create_stealth_driver.py", line 193, in bypass_detection
v12tj wait_till_cloudflare_leaves(driver, previous_ray_id, raise_exception)
v12tj File "/usr/local/lib/python3.9/site-packages/botasaurus/create_stealth_driver.py", line 109, in wait_till_cloudflare_leaves
v12tj current_ray_id = get_rayid(driver)
v12tj File "/usr/local/lib/python3.9/site-packages/botasaurus/create_stealth_driver.py", line 91, in get_rayid
v12tj ray = driver.text(".ray-id code")
v12tj File "/usr/local/lib/python3.9/site-packages/botasaurus/anti_detect_driver.py", line 172, in text
v12tj return el.text
v12tj File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/remote/webelement.py", line 84, in text
v12tj return self._execute(Command.GET_ELEMENT_TEXT)['value']
v12tj File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/remote/webelement.py", line 396, in _execute
v12tj return self._parent.execute(command, params)
v12tj File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/remote/webdriver.py", line 429, in execute
v12tj self.error_handler.check_response(response)
v12tj File "/usr/local/lib/python3.9/site-packages/selenium/webdriver/remote/errorhandler.py", line 243, in check_response
v12tj raise exception_class(message, screen, stacktrace)
v12tj selenium.common.exceptions.WebDriverException: Message: unknown error: session deleted because of page crash
v12tj from unknown error: cannot determine loading status
v12tj from tab crashed
v12tj (Session info: chrome=120.0.6099.109)
v12tj Stacktrace:
v12tj #0 0x55ae79f57f83
v12tj #1 0x55ae79c10b2b
v12tj #2 0x55ae79bf816d
v12tj #3 0x55ae79bf7882
v12tj #4 0x55ae79bf6586
v12tj #5 0x55ae79bf644a
v12tj #6 0x55ae79bf47e1
v12tj #7 0x55ae79bf518a
v12tj #8 0x55ae79c0607c
v12tj #9 0x55ae79c1e7c1
v12tj #10 0x55ae79c246bb
v12tj #11 0x55ae79bf592d
v12tj #12 0x55ae79c1e459
v12tj #13 0x55ae79ca9204
v12tj #14 0x55ae79c89e53
v12tj #15 0x55ae79c51dd4
v12tj #16 0x55ae79c531de
v12tj #17 0x55ae79f1c531
v12tj #18 0x55ae79f20455
v12tj #19 0x55ae79f08f55
v12tj #20 0x55ae79f210ef
v12tj #21 0x55ae79eec99f
v12tj #22 0x55ae79f45008
v12tj #23 0x55ae79f451d7
v12tj #24 0x55ae79f57124
v12tj #25 0x7feac06e3044

// my docker-compose.yaml
version: "3"
services:
botoscrape:
restart: "no"
container_name: botasaurus
shm_size: 2gb
build:
dockerfile: Dockerfile
context: .

  volumes:
    - ./output:/app/output
    - ./tasks:/app/tasks
    - ./profiles:/app/profiles
    - ./profiles.json:/app/profiles.json
    - ./local_storage.json:/app/local_storage.json
  ports:
    - "8191:9090"

command: python -u main.py

  command: ["python", "-u", "main.py"]
  healthcheck:
    test: ["CMD", "curl", "-f", "http://localhost:9090/"]
    interval: 1m
    timeout: 10s
    retries: 3

// Dockerfile

FROM chetan1111/botasaurus:latest
ENV PYTHONUNBUFFERED=1

COPY requirements.txt .

RUN python -m pip install -r requirements.txt

RUN mkdir app
WORKDIR /app
COPY . /app

Can't make post request with payload and headers while stealth is on

iam getting this error :
venv\Lib\site-packages\botasaurus\got_adapter.py", line 138, in _convert_to_got_request
raise ValueError(f"{key} is not Supported")
ValueError: json is not Supported

from this line :
driver=bt.create_requests(use_stealth=True)
response=driver.post('https://woovin.com/wp-admin/admin-ajax.php',payload,headers=custom_headers)
the headers and payload are simple python dictionaries , iam just trying to make a simple post request with headers and payload it works perfectly fine if i removed the use_strealth=True as following :
driver=bt.create_requests()

Integrating with flask and handling routes

Hi, I want to build a full stack application scraping yellow pages where the user can enter the yellow pages url that they want to scrape and get the scraped data. I'm having issues integrating flask and the api routes in the application using requests module. There is a conflict because the @request decorator is the same name as the requests module in Flask. Do you guys have any examples of how this can be done? Thanks in advance.

from botasaurus import *
from flask import Flask, jsonify, request
from scraper.yp_usa_scraper import *
from flask_cors import CORS

app = Flask(__name__)
CORS(app)

@app.route('/scrape/yp-usa', methods=["POST"])
@request(use_stealth=True)
def scrape_heading_task(request: AntiDetectRequests, data):
    data = request.get_json()
    response = request.get('https://www.yell.com/ucs/UcsSearchAction.do?scrambleSeed=1475848896&keywords=hairdressers&location=hatfield%2C+hertfordshire')
    return response.text

if __name__ == "__main__":
    # Run the Flask development server
    app.run(debug=True)
AttributeError: 'AntiDetectRequests' object has no attribute 'get_json'

Cloudflare Failed

When I tried to open the form page using proxy, cloudflare fails, otherwise solves

code

import os.path

from botasaurus import *
from botasaurus.create_stealth_driver import create_stealth_driver


@browser(
    # user_agent=bt.UserAgent.REAL,
    # window_size=bt.WindowSize.REAL,
    create_driver=create_stealth_driver(
        start_url="https://dashboard.capsolver.com/passport/login",
    ),
)
def scrape_heading_task(driver: AntiDetectDriver, data):
    driver.prompt()
    heading = driver.text('h1')
    return heading


scrape_heading_task()

Thanks

Chrome does not `quit()` properly

Hi omkarcloud, I really appreciate your project in solving CloudFlare detection.

Context

  • I created a web server to call your library (in my case, it's the function scrape_cookies) with my provided proxy providers and particular urls. However browsers are not close properly after finishing.
  • I can't use python -m botasaurus.close in your suggestion because it interrupts concurrent requests.
  • I tried these approaches but none of these work:
    • driver.close(): close the browser but not the Chrome instance.
    • driver.quit(): this method doesn't work at all. Here is my change in decorators.py.
# line 339
        def close_driver(driver: AntiDetectDriver):
            if tiny_profile:
                save_cookies(driver, driver.about.profile)

            try:
                # driver.close()
                driver.quit()
            except WebDriverException as e:
                if "not connected to DevTools" in str(e):
                    print("Unable to close driver due to network issues")
                    # This error occurs due to connectivty issues
                    pass
                else:
                    raise
  • os.kill(driver.service.process.pid, 9)
  • driver.server.process.kill()
  • driver.server.process.terminate()

Do you have any suggestion for me?

Environment 1:

OS: MacOS
Proxy: datacenter IPs.
The library closed the browsers but not the instances.
Screenshot 2024-02-07 at 5 21 41โ€ฏPM

Environment 2:

OS: Docker on Mac
Dockerfile: Based on this botasaurus-starter.
Proxy: datacenter IPs.
Memory Usage: keeps going up after requests.

My code:

from botasaurus import *
from typing import List
from botasaurus.create_stealth_driver import create_stealth_driver
import json

from pydantic import BaseModel

from close import close_chrome

class CookieResponse(BaseModel):
    heading: str
    cookies: List[dict]
    chromeOptions: dict
    remoteAddress: str

def get_proxy(data):
    return data["proxy"]

class Input(BaseModel):
    proxy: str
    url: str | None = "https://www.instacart.com/"

# I have web APIs to called this function
def scape_cookies(input: Input) -> CookieResponse: 
    pid = None
    @browser(
        create_driver=create_stealth_driver(
            start_url=input["url"],
        ),
        max_retry=3,
        proxy=input["proxy"],
    )
    def scrape_website_args(driver: AntiDetectDriver, data) -> CookieResponse:
        heading = driver.text('h1')
        cookies = driver.get_cookies()
        serialized_data = json.dumps(cookies)
        nonlocal pid
        pid = driver.service.process.pid
        # I tried this three functions but it doesn't work.
        # driver.service.process.kill()
        # driver.service.process.terminate()
        # driver.quit()


        return {
            "heading": heading,
            "cookies": cookies,
            "chromeOptions": driver.capabilities['goog:chromeOptions'],
        }

    response = scrape_website_args(input)
    print(response)
    # scrape_website_args.close() => this also does not work even with reuse_driver=True and keep_driver_alive=True
    return response
    

if __name__ == "__main__":
    response = scape_cookies(
        {
            "proxy": "proxy here",
            "url": "https://www.instacart.com/",
        }
    )

python -m pip install botasaurus returning error

Defaulting to user installation because normal site-packages is not writeable
Obtaining file:///C:/Users/cntow/Downloads/Compressed/botasaurus-master/botasaurus-master
Installing build dependencies ... done
Checking if build backend supports build_editable ... done
Getting requirements to build editable ... error
error: subprocess-exited-with-error

ร— Getting requirements to build editable did not run successfully.
โ”‚ exit code: 1
โ•ฐโ”€> [47 lines of output]
C:\Users\cntow\AppData\Local\Microsoft\WindowsApps\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\python.exe: No module named pip
Traceback (most recent call last):
File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.496.0_x64__qbz5n2kfra8p0\Lib\importlib\metadata_init_.py", line 397, in from_name
return next(cls.discover(name=name))
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
StopIteration

  During handling of the above exception, another exception occurred:

  Traceback (most recent call last):
    File "<string>", line 77, in install_javascript_package
    File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.496.0_x64__qbz5n2kfra8p0\Lib\importlib\metadata\__init__.py", line 861, in distribution
      return Distribution.from_name(distribution_name)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.496.0_x64__qbz5n2kfra8p0\Lib\importlib\metadata\__init__.py", line 399, in from_name
      raise PackageNotFoundError(name)
  importlib.metadata.PackageNotFoundError: No package metadata was found for javascript

  During handling of the above exception, another exception occurred:

  Traceback (most recent call last):
    File "C:\Users\cntow\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 353, in <module>
      main()
    File "C:\Users\cntow\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 335, in main
      json_out['return_val'] = hook(**hook_input['kwargs'])
                               ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "C:\Users\cntow\AppData\Local\Packages\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\LocalCache\local-packages\Python312\site-packages\pip\_vendor\pyproject_hooks\_in_process\_in_process.py", line 132, in get_requires_for_build_editable
      return hook(config_settings)
             ^^^^^^^^^^^^^^^^^^^^^
    File "C:\Users\cntow\AppData\Local\Temp\pip-build-env-p_t7bham\overlay\Lib\site-packages\setuptools\build_meta.py", line 441, in get_requires_for_build_editable
      return self.get_requires_for_build_wheel(config_settings)
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "C:\Users\cntow\AppData\Local\Temp\pip-build-env-p_t7bham\overlay\Lib\site-packages\setuptools\build_meta.py", line 325, in get_requires_for_build_wheel
      return self._get_build_requires(config_settings, requirements=['wheel'])
             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
    File "C:\Users\cntow\AppData\Local\Temp\pip-build-env-p_t7bham\overlay\Lib\site-packages\setuptools\build_meta.py", line 295, in _get_build_requires
      self.run_setup()
    File "C:\Users\cntow\AppData\Local\Temp\pip-build-env-p_t7bham\overlay\Lib\site-packages\setuptools\build_meta.py", line 480, in run_setup
      super(_BuildMetaLegacyBackend, self).run_setup(setup_script=setup_script)
    File "C:\Users\cntow\AppData\Local\Temp\pip-build-env-p_t7bham\overlay\Lib\site-packages\setuptools\build_meta.py", line 311, in run_setup
      exec(code, locals())
    File "<string>", line 87, in <module>
    File "<string>", line 83, in pre_install
    File "<string>", line 79, in install_javascript_package
    File "C:\Program Files\WindowsApps\PythonSoftwareFoundation.Python.3.12_3.12.496.0_x64__qbz5n2kfra8p0\Lib\subprocess.py", line 413, in check_call
      raise CalledProcessError(retcode, cmd)
  subprocess.CalledProcessError: Command '['C:\\Users\\cntow\\AppData\\Local\\Microsoft\\WindowsApps\\PythonSoftwareFoundation.Python.3.12_qbz5n2kfra8p0\\python.exe', '-m', 'pip', 'install', 'javascript']' returned non-zero exit status 1.
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
error: subprocess-exited-with-error

ร— Getting requirements to build editable did not run successfully.
โ”‚ exit code: 1
โ•ฐโ”€> See above for output.

Fail close driver on browser crashing

Description

Each selenium.common.exceptions.InvalidSessionIdException error breaks execution of bose.launch_tasks.launch_tasks function.

Steps to Reproduce

  1. Create task to scan some site. Iterate couple of pages (http://site/page-1 http://site/page-2 ... http://site/page-6)
  2. Package code to docker image
  3. Run container
  4. Got errors with message like "Message: unknown error: session deleted because of page crash"
  5. Catch InvalidSessionIdException inside task.run(self, driver: BoseDriver, data: any)
  6. Got an error TypeError: unsupported operand type(s) for -: 'datetime.datetime' and 'str'

Expected behavior: broken task can be finished normally

Actual behavior: broken task stops all process, next tasks will not executed

Reproduces how often: for sites with bot detection - 99% cases

Additional context

Can't reproduce on host machine. Only inside docker container.

Full stack-trace:

Traceback (most recent call last):
2023-07-16T19:02:56.132915200Z   File "/code/venv/lib/python3.11/site-packages/bose/base_task.py", line 210, in run_task
2023-07-16T19:02:56.132923000Z     close_driver(driver)
2023-07-16T19:02:56.132927600Z   File "/code/venv/lib/python3.11/site-packages/bose/base_task.py", line 203, in close_driver
2023-07-16T19:02:56.132939700Z     driver.close()
2023-07-16T19:02:56.133008400Z   File "/code/venv/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 551, in close
2023-07-16T19:02:56.133173900Z     self.execute(Command.CLOSE)
2023-07-16T19:02:56.133260300Z   File "/code/venv/lib/python3.11/site-packages/selenium/webdriver/remote/webdriver.py", line 429, in execute
2023-07-16T19:02:56.133411400Z     self.error_handler.check_response(response)
2023-07-16T19:02:56.133519800Z   File "/code/venv/lib/python3.11/site-packages/selenium/webdriver/remote/errorhandler.py", line 243, in check_response
2023-07-16T19:02:56.133590700Z     raise exception_class(message, screen, stacktrace)
2023-07-16T19:02:56.133747600Z selenium.common.exceptions.InvalidSessionIdException: Message: invalid session id
2023-07-16T19:02:56.133802800Z Stacktrace:
2023-07-16T19:02:56.133812700Z #0 0x55a7b015a233 <unknown>
2023-07-16T19:02:56.133817900Z #1 0x55a7afe89770 <unknown>
2023-07-16T19:02:56.133822800Z #2 0x55a7afeb9589 <unknown>
2023-07-16T19:02:56.133826700Z #3 0x55a7afee4b86 <unknown>
2023-07-16T19:02:56.133830800Z #4 0x55a7afee0dea <unknown>
2023-07-16T19:02:56.133834700Z #5 0x55a7afee0516 <unknown>
2023-07-16T19:02:56.133838800Z #6 0x55a7afe593a3 <unknown>
2023-07-16T19:02:56.133843100Z #7 0x55a7b011a114 <unknown>
2023-07-16T19:02:56.133858100Z #8 0x55a7b011df67 <unknown>
2023-07-16T19:02:56.133863200Z #9 0x55a7b01286b0 <unknown>
2023-07-16T19:02:56.133867700Z #10 0x55a7b011ebb3 <unknown>
2023-07-16T19:02:56.133871100Z #11 0x55a7b00ec95a <unknown>
2023-07-16T19:02:56.133874900Z #12 0x55a7afe57b83 <unknown>
2023-07-16T19:02:56.133878600Z #13 0x7f92a414e18a <unknown>
2023-07-16T19:02:56.133882400Z 
2023-07-16T19:02:56.133886500Z 
2023-07-16T19:02:56.133890300Z During handling of the above exception, another exception occurred:
2023-07-16T19:02:56.133893900Z 
2023-07-16T19:02:56.133898000Z Traceback (most recent call last):
2023-07-16T19:02:56.133902300Z   File "/code/main.py", line 5, in <module>
2023-07-16T19:02:56.133907800Z     launch_tasks(*tasks_to_be_run)
2023-07-16T19:02:56.133919000Z   File "/code/venv/lib/python3.11/site-packages/bose/launch_tasks.py", line 54, in launch_tasks
2023-07-16T19:02:56.134112000Z     current_output = task.begin_task(current_data, task_config)
2023-07-16T19:02:56.134164700Z                      ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2023-07-16T19:02:56.134195200Z   File "/code/venv/lib/python3.11/site-packages/bose/base_task.py", line 237, in begin_task
2023-07-16T19:02:56.134256000Z     final = run_task(False, 0)
2023-07-16T19:02:56.134311100Z             ^^^^^^^^^^^^^^^^^^
2023-07-16T19:02:56.134322600Z   File "/code/venv/lib/python3.11/site-packages/bose/base_task.py", line 221, in run_task
2023-07-16T19:02:56.134434300Z     end_task(driver)
2023-07-16T19:02:56.134487500Z   File "/code/venv/lib/python3.11/site-packages/bose/base_task.py", line 149, in end_task
2023-07-16T19:02:56.134570900Z     task.end()
2023-07-16T19:02:56.134647300Z   File "/code/venv/lib/python3.11/site-packages/bose/task_info.py", line 38, in end
2023-07-16T19:02:56.134716200Z     self.data["duration"] = format_time_diff(self.data["start_time"],self.data["end_time"])
2023-07-16T19:02:56.134774900Z                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
2023-07-16T19:02:56.134807900Z   File "/code/venv/lib/python3.11/site-packages/bose/task_info.py", line 11, in format_time_diff
2023-07-16T19:02:56.134914700Z     time_diff = end_time - start_time
2023-07-16T19:02:56.134991400Z                 ~~~~~~~~~^~~~~~~~~~~~
2023-07-16T19:02:56.135022200Z TypeError: unsupported operand type(s) for -: 'datetime.datetime' and 'str'

Mobile Proxy Support

This library needs mobile proxy support which needs to run an IP changer request before the browser opens etc.

Don't understand how to use any of the ProfileManager functions

Description

The scraping stuff is all clear to me as I have experience with Selerium (And the Beautiful Soup stuff wasn't too bad either), but I am not getting how to properly use anything here: https://github.com/omkarcloud/botasaurus?tab=readme-ov-file#i-have-automated-the-creation-of-user-accounts-now-i-want-to-store-the-user-account-credentials-like-email-and-password-how-to-store-it

Steps to Reproduce

Here's the script its only purpose is to grasp how to use these functions

from botasaurus import *

user = bt.generate_user(country=bt.Country.IN)
print(user)
bt.Profile.set_profile(user)
bt.Profile.set_item("api_key", "BDEC26")
profiles = bt.Profile.get_all_profiles()
print(profiles)

Expected behavior: Save user data in a Chrome profile?

Actual behavior: [What actually happens]

Traceback (most recent call last):
File "C:\py311botasaurus\btprofiles.py", line 5, in
bt.Profile.set_profile(user)
File "C:\py311botasaurus\py311botasaurus\Lib\site-packages\botasaurus\profile.py", line 146, in set_profile
self.check_profile()
File "C:\py311botasaurus\py311botasaurus\Lib\site-packages\botasaurus\profile.py", line 83, in check_profile
raise Exception('This method can only be run in run method of Task and when you have given the current profile in the Browser Config.')
Exception: This method can only be run in run method of Task and when you have given the current profile in the Browser Config.

Reproduces how often: [What percentage of the time does it reproduce?]

Every time. I have no idea how to use anything here: https://github.com/omkarcloud/botasaurus?tab=readme-ov-file#i-have-automated-the-creation-of-user-accounts-now-i-want-to-store-the-user-account-credentials-like-email-and-password-how-to-store-it

Additional context

Wrapper and Decorator Help

Can you add c# wrapper?
And can we use this library without decorating function? if yes can you give me a example

i'm looked at codes and a lot of antibot code is based on english like "Please verify you are a human", can you add localization support for that?

Enums, Consts can be on the other folder

    def short_random_sleep(self):
        sleep_for_n_seconds(uniform(2, 4))

    def long_random_sleep(self):
        sleep_for_n_seconds(uniform(6, 9))

    def sleep_forever(self):
        sleep_forever()
#anti_detect_driver.py #85 - 92 i'm think it's must be in the utils.py with the reference to enum in the wait.py

accept_cookies_btn = driver.get_element_or_none_by_selector("button#L2AGLb", None)
# accept_google_cookies.py #27 i'm think button#L2AGLb code and others can be changed/randomized with google because of that i'm think you can create json parser for your library with json parser users can be create his selector/captcha solver/clicker without forking the repository.

    def google_get(self, link,  wait=None, accept_cookies=False):
        self.get("https://www.google.com/")
        if accept_cookies:
            accept_google_cookies(self)
        return self.get_by_current_page_referrer(link, wait)

    def get_google(self, accept_cookies=False):
        self.get("https://www.google.com/")
        if accept_cookies:
            accept_google_cookies(self)
        # self.get_element_or_none_by_selector('input[role="combobox"]', Wait.VERY_LONG)
#anti_detect_driver.py #342 - 352 there is a lot of google urls firstly you can add enum/consts class for that "https://www.google.com/" and secondly you can add support for customising url (supporting user to select google url)

#You can add support for searching on google (with keyword) and listing data, if you want to add i think it's most be enumerable because user can be don't want to click/go/navigate to the pages in the first list (or second list etc)...

#And looks like there is a speeling error on accept_google_cookies.py #25
                    raise Exception("Unabe to load Google")

And selenium automatically adds navigator.webdriver variable with true value you need to change this variable to false
And you need add --disable-blink-features=AutomationControlled (i don't looked all code, maybe it has already been added)

i don't used new versions of selenium maybe it's changed

remove NPM dependency proxy-chain

proxy-chain is required by botasaurus-proxy-authentication but not by botasaurus

$ grep -r -w -h 'require(' botasaurus/botasaurus/
got = require("got-scraping-export")
chrome_launcher = require("chrome-launcher")

currently setup.py will also install proxy-chain

botasaurus/setup.py

Lines 57 to 61 in 34ad082

print("Installing needed npm packages")
# Install each npm package
self.install_npm_package("proxy-chain")
self.install_npm_package("got-scraping-export")
self.install_npm_package("chrome-launcher")

Unable to pass cloudflare challenge

Hey,

I just started using your tool and it is truly amazing!
However, I have some trouble to bypass some cloudflare protections.

from botasaurus import *


@browser()
def scrape_heading_task(driver: AntiDetectDriver, data):
    driver.google_get("https://vulbis.com")
    driver.prompt()
    heading = driver.text('html')
    return heading

if __name__ == "__main__":
    # Initiate the web scraping task
    scrape_heading_task()

Issue: it keeps refreshing the page, not able to solve the challenge

profile not work

I used
@browser(
headless=True,
profile='my-profile',
proxy="http://your_proxy_address:your_proxy_port",
user_agent=bt.UserAgents.user_agent_106
)
but it not work, I don't know where the profile was created
Thanks!

Dockerfile

Hello,

I would like to know where the Dockerfile is located so that I can update the packages on my end.

Thanks.

Getting started with Botasaurus script throws lots of exceptions

Description

I am just seeing a ton of exceptions trying to run the first Selenium scraping task that goes to https://www.omkar.cloud/ and grabs the h1 heading. It's the first Botasaurus script here:

from botasaurus import *

@browser
def scrape_heading_task(driver: AntiDetectDriver, data):
    # Navigate to the Omkar Cloud website
    driver.get("https://www.omkar.cloud/")
    
    # Retrieve the heading element's text
    heading = driver.text("h1")

    # Save the data as a JSON file in output/all.json
    return {
        "heading": heading
    }
     
if __name__ == "__main__":
    # Initiate the web scraping task
    scrape_heading_task()

It's the first script in what is botasaurus: https://www.omkar.cloud/botasaurus/docs/what-is-botasaurus/

Steps to Reproduce

  1. Run python main.py

Expected behavior: [What you expect to happen]

Scrape the h1 heading and store it as a string called heading which is returned once the function is called (and presumably automatically saved into a json file by the botasaurus framework)

Actual behavior: [What actually happens]

Lots of errors:

(py311selenium) C:\py311seleniumbot>python main.py
Running

DevTools listening on ws://127.0.0.1:64985/devtools/browser/6520850b-e749-463b-9c45-8e5ecdea678e
[24816:3140:1224/150501.718:ERROR:cert_issuer_source_aia.cc(34)] Error parsing cert retrieved from AIA (as DER):
ERROR: Couldn't read tbsCertificate as SEQUENCE
ERROR: Failed parsing Certificate

[24816:3140:1224/150501.917:ERROR:cert_issuer_source_aia.cc(34)] Error parsing cert retrieved from AIA (as DER):
ERROR: Couldn't read tbsCertificate as SEQUENCE
ERROR: Failed parsing Certificate

Traceback (most recent call last):
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\decorators.py", line 377, in run_task
    close_driver(driver)
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\decorators.py", line 250, in close_driver
    driver.quit()
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\anti_detect_driver.py", line 470, in quit
    self.close_proxy()
TypeError: 'bool' object is not callable
Error getting page source: Message: invalid session id
Stacktrace:
        GetHandleVerifier [0x00916EE3+174339]
        (No symbol) [0x00840A51]
        (No symbol) [0x00556E8A]
        (No symbol) [0x00580980]
        (No symbol) [0x00581F8D]
        GetHandleVerifier [0x009B4B1C+820540]
        sqlite3_dbdata_init [0x00A753EE+653550]
        sqlite3_dbdata_init [0x00A74E09+652041]
        sqlite3_dbdata_init [0x00A697CC+605388]
        sqlite3_dbdata_init [0x00A75D9B+656027]
        (No symbol) [0x0084FE6C]
        (No symbol) [0x008483B8]
        (No symbol) [0x008484DD]
        (No symbol) [0x00835818]
        BaseThreadInitThunk [0x76FBFCC9+25]
        RtlGetAppContainerNamedObjectPath [0x774D7C6E+286]
        RtlGetAppContainerNamedObjectPath [0x774D7C3E+238]

Traceback (most recent call last):
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\decorators.py", line 377, in run_task
    close_driver(driver)
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\decorators.py", line 250, in close_driver
    driver.quit()
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\anti_detect_driver.py", line 470, in quit
    self.close_proxy()
TypeError: 'bool' object is not callable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\anti_detect_driver.py", line 431, in save_screenshot
    self.get_screenshot_as_file(
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 927, in get_screenshot_as_file
    png = self.get_screenshot_as_png()
          ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 963, in get_screenshot_as_png
    return b64decode(self.get_screenshot_as_base64().encode('ascii'))
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 975, in get_screenshot_as_base64
    return self.execute(Command.SCREENSHOT)['value']
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 429, in execute
    self.error_handler.check_response(response)
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 243, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.InvalidSessionIdException: Message: invalid session id
Stacktrace:
        GetHandleVerifier [0x00916EE3+174339]
        (No symbol) [0x00840A51]
        (No symbol) [0x00556E8A]
        (No symbol) [0x00580862]
        (No symbol) [0x005A6EBA]
        (No symbol) [0x005A2036]
        (No symbol) [0x005A1CC2]
        (No symbol) [0x005370DB]
        (No symbol) [0x005375DE]
        (No symbol) [0x005379EB]
        GetHandleVerifier [0x009B4B1C+820540]
        sqlite3_dbdata_init [0x00A753EE+653550]
        sqlite3_dbdata_init [0x00A74E09+652041]
        sqlite3_dbdata_init [0x00A697CC+605388]
        sqlite3_dbdata_init [0x00A75D9B+656027]
        (No symbol) [0x0084FE6C]
        (No symbol) [0x00536F4C]
        (No symbol) [0x00536AEA]
        (No symbol) [0x006A526C]
        BaseThreadInitThunk [0x76FBFCC9+25]
        RtlGetAppContainerNamedObjectPath [0x774D7C6E+286]
        RtlGetAppContainerNamedObjectPath [0x774D7C3E+238]

Failed to save screenshot
Failed for input: None
We've paused the browser to help you debug. Press 'Enter' to close.
Traceback (most recent call last):
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\decorators.py", line 377, in run_task
    close_driver(driver)
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\decorators.py", line 250, in close_driver
    driver.quit()
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\anti_detect_driver.py", line 470, in quit
    self.close_proxy()
TypeError: 'bool' object is not callable

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\py311seleniumbot\main.py", line 18, in <module>
    scrape_heading_task()
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\decorators.py", line 443, in wrapper_browser
    current_result = run_task(data_item, False, 0)
                     ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\decorators.py", line 411, in run_task
    close_driver(driver)
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\botasaurus\decorators.py", line 249, in close_driver
    driver.close()
    ^^^^^^^^^^^^^^
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 551, in close
    self.execute(Command.CLOSE)
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\webdriver.py", line 429, in execute
    self.error_handler.check_response(response)
  File "C:\py311seleniumbot\py311selenium\Lib\site-packages\selenium\webdriver\remote\errorhandler.py", line 243, in check_response
    raise exception_class(message, screen, stacktrace)
selenium.common.exceptions.InvalidSessionIdException: Message: invalid session id
Stacktrace:
        GetHandleVerifier [0x00916EE3+174339]
        (No symbol) [0x00840A51]
        (No symbol) [0x00556E8A]
        (No symbol) [0x00580862]
        (No symbol) [0x005A6EBA]
        (No symbol) [0x005A2036]
        (No symbol) [0x005A1CC2]
        (No symbol) [0x005370DB]
        (No symbol) [0x005375DE]
        (No symbol) [0x005379EB]
        GetHandleVerifier [0x009B4B1C+820540]
        sqlite3_dbdata_init [0x00A753EE+653550]
        sqlite3_dbdata_init [0x00A74E09+652041]
        sqlite3_dbdata_init [0x00A697CC+605388]
        sqlite3_dbdata_init [0x00A75D9B+656027]
        (No symbol) [0x0084FE6C]
        (No symbol) [0x00536F4C]
        (No symbol) [0x00536AEA]
        (No symbol) [0x006A526C]
        BaseThreadInitThunk [0x76FBFCC9+25]
        RtlGetAppContainerNamedObjectPath [0x774D7C6E+286]
        RtlGetAppContainerNamedObjectPath [0x774D7C3E+238]

Reproduces how often: [What percentage of the time does it reproduce?]

It happens every time.

Additional context

I setup a virtual environment with botasaurus

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.