Giter Site home page Giter Site logo

serpapi / google-search-results-python Goto Github PK

View Code? Open in Web Editor NEW
526.0 14.0 88.0 236 KB

Google Search Results via SERP API pip Python Package

License: MIT License

Python 97.45% Makefile 2.55%
python serp-api bing-image google-crawler google-images scraping serpapi web-scraping

google-search-results-python's Introduction

Google Search Results in Python

Package Build

This Python package is meant to scrape and parse search results from Google, Bing, Baidu, Yandex, Yahoo, Home Depot, eBay and more, using SerpApi.

The following services are provided:

SerpApi provides a script builder to get you started quickly.

Installation

Python 3.7+

pip install google-search-results

Link to the python package page

Quick start

from serpapi import GoogleSearch
search = GoogleSearch({
    "q": "coffee", 
    "location": "Austin,Texas",
    "api_key": "<your secret api key>"
  })
result = search.get_dict()

This example runs a search for "coffee" using your secret API key.

The SerpApi service (backend)

  • Searches Google using the search: q = "coffee"
  • Parses the messy HTML responses
  • Returns a standardized JSON response The GoogleSearch class
  • Formats the request
  • Executes a GET http request against SerpApi service
  • Parses the JSON response into a dictionary

Et voilà...

Alternatively, you can search:

  • Bing using BingSearch class
  • Baidu using BaiduSearch class
  • Yahoo using YahooSearch class
  • DuckDuckGo using DuckDuckGoSearch class
  • eBay using EbaySearch class
  • Yandex using YandexSearch class
  • HomeDepot using HomeDepotSearch class
  • GoogleScholar using GoogleScholarSearch class
  • Youtube using YoutubeSearch class
  • Walmart using WalmartSearch
  • Apple App Store using AppleAppStoreSearch class
  • Naver using NaverSearch class

See the playground to generate your code.

Summary

Google Search API capability

Source code.

params = {
  "q": "coffee",
  "location": "Location Requested", 
  "device": "desktop|mobile|tablet",
  "hl": "Google UI Language",
  "gl": "Google Country",
  "safe": "Safe Search Flag",
  "num": "Number of Results",
  "start": "Pagination Offset",
  "api_key": "Your SerpApi Key", 
  # To be match
  "tbm": "nws|isch|shop", 
  # To be search
  "tbs": "custom to be search criteria",
  # allow async request
  "async": "true|false",
  # output format
  "output": "json|html"
}

# define the search search
search = GoogleSearch(params)
# override an existing parameter
search.params_dict["location"] = "Portland"
# search format return as raw html
html_results = search.get_html()
# parse results
#  as python Dictionary
dict_results = search.get_dict()
#  as JSON using json package
json_results = search.get_json()
#  as dynamic Python object
object_result = search.get_object()

Link to the full documentation

See below for more hands-on examples.

How to set SerpApi key

You can get an API key here if you don't already have one: https://serpapi.com/users/sign_up

The SerpApi api_key can be set globally:

GoogleSearch.SERP_API_KEY = "Your Private Key"

The SerpApi api_key can be provided for each search:

query = GoogleSearch({"q": "coffee", "serp_api_key": "Your Private Key"})

Example by specification

We love true open source, continuous integration and Test Driven Development (TDD). We are using RSpec to test our infrastructure around the clock to achieve the best Quality of Service (QoS).

The directory test/ includes specification/examples.

Set your API key.

export API_KEY="your secret key"

Run test

make test

Location API

from serpapi import GoogleSearch
search = GoogleSearch({})
location_list = search.get_location("Austin", 3)
print(location_list)

This prints the first 3 locations matching Austin (Texas, Texas, Rochester).

[   {   'canonical_name': 'Austin,TX,Texas,United States',
        'country_code': 'US',
        'google_id': 200635,
        'google_parent_id': 21176,
        'gps': [-97.7430608, 30.267153],
        'id': '585069bdee19ad271e9bc072',
        'keys': ['austin', 'tx', 'texas', 'united', 'states'],
        'name': 'Austin, TX',
        'reach': 5560000,
        'target_type': 'DMA Region'},
        ...]

Search Archive API

The search results are stored in a temporary cache. The previous search can be retrieved from the cache for free.

from serpapi import GoogleSearch
search = GoogleSearch({"q": "Coffee", "location": "Austin,Texas"})
search_result = search.get_dictionary()
assert search_result.get("error") == None
search_id = search_result.get("search_metadata").get("id")
print(search_id)

Now let's retrieve the previous search from the archive.

archived_search_result = GoogleSearch({}).get_search_archive(search_id, 'json')
print(archived_search_result.get("search_metadata").get("id"))

This prints the search result from the archive.

Account API

from serpapi import GoogleSearch
search = GoogleSearch({})
account = search.get_account()

This prints your account information.

Search Bing

from serpapi import BingSearch
search = BingSearch({"q": "Coffee", "location": "Austin,Texas"})
data = search.get_dict()

This code prints Bing search results for coffee as a Dictionary.

https://serpapi.com/bing-search-api

Search Baidu

from serpapi import BaiduSearch
search = BaiduSearch({"q": "Coffee"})
data = search.get_dict()

This code prints Baidu search results for coffee as a Dictionary. https://serpapi.com/baidu-search-api

Search Yandex

from serpapi import YandexSearch
search = YandexSearch({"text": "Coffee"})
data = search.get_dict()

This code prints Yandex search results for coffee as a Dictionary.

https://serpapi.com/yandex-search-api

Search Yahoo

from serpapi import YahooSearch
search = YahooSearch({"p": "Coffee"})
data = search.get_dict()

This code prints Yahoo search results for coffee as a Dictionary.

https://serpapi.com/yahoo-search-api

Search eBay

from serpapi import EbaySearch
search = EbaySearch({"_nkw": "Coffee"})
data = search.get_dict()

This code prints eBay search results for coffee as a Dictionary.

https://serpapi.com/ebay-search-api

Search Home Depot

from serpapi import HomeDepotSearch
search = HomeDepotSearch({"q": "chair"})
data = search.get_dict()

This code prints Home Depot search results for chair as Dictionary.

https://serpapi.com/home-depot-search-api

Search Youtube

from serpapi import YoutubeSearch
search = YoutubeSearch({"q": "chair"})
data = search.get_dict()

This code prints Youtube search results for chair as Dictionary.

https://serpapi.com/youtube-search-api

Search Google Scholar

from serpapi import GoogleScholarSearch
search = GoogleScholarSearch({"q": "Coffee"})
data = search.get_dict()

This code prints Google Scholar search results.

Search Walmart

from serpapi import WalmartSearch
search = WalmartSearch({"query": "chair"})
data = search.get_dict()

This code prints Walmart search results.

Search Youtube

from serpapi import YoutubeSearch
search = YoutubeSearch({"search_query": "chair"})
data = search.get_dict()

This code prints Youtube search results.

Search Apple App Store

from serpapi import AppleAppStoreSearch
search = AppleAppStoreSearch({"term": "Coffee"})
data = search.get_dict()

This code prints Apple App Store search results.

Search Naver

from serpapi import NaverSearch
search = NaverSearch({"query": "chair"})
data = search.get_dict()

This code prints Naver search results.

Generic search with SerpApiClient

from serpapi import SerpApiClient
query = {"q": "Coffee", "location": "Austin,Texas", "engine": "google"}
search = SerpApiClient(query)
data = search.get_dict()

This class enables interaction with any search engine supported by SerpApi.com

Search Google Images

from serpapi import GoogleSearch
search = GoogleSearch({"q": "coffe", "tbm": "isch"})
for image_result in search.get_dict()['images_results']:
    link = image_result["original"]
    try:
        print("link: " + link)
        # wget.download(link, '.')
    except:
        pass

This code prints all the image links, and downloads the images if you un-comment the line with wget (Linux/OS X tool to download files).

This tutorial covers more ground on this topic. https://github.com/serpapi/showcase-serpapi-tensorflow-keras-image-training

Search Google News

from serpapi import GoogleSearch
search = GoogleSearch({
    "q": "coffe",   # search search
    "tbm": "nws",  # news
    "tbs": "qdr:d", # last 24h
    "num": 10
})
for offset in [0,1,2]:
    search.params_dict["start"] = offset * 10
    data = search.get_dict()
    for news_result in data['news_results']:
        print(str(news_result['position'] + offset * 10) + " - " + news_result['title'])

This script prints the first 3 pages of the news headlines for the last 24 hours.

Search Google Shopping

from serpapi import GoogleSearch
search = GoogleSearch({
    "q": "coffe",   # search search
    "tbm": "shop",  # shopping
    "tbs": "p_ord:rv", # ordered by review
    "num": 100
})
data = search.get_dict()
for shopping_result in data['shopping_results']:
    print(shopping_result['position']) + " - " + shopping_result['title'])

This script prints all the shopping results, ordered by review order.

Google Search By Location

With SerpApi, we can build a Google search from anywhere in the world. This code looks for the best coffee shop for the given cities.

from serpapi import GoogleSearch
for city in ["new york", "paris", "berlin"]:
  location = GoogleSearch({}).get_location(city, 1)[0]["canonical_name"]
  search = GoogleSearch({
      "q": "best coffee shop",   # search search
      "location": location,
      "num": 1,
      "start": 0
  })
  data = search.get_dict()
  top_result = data["organic_results"][0]["title"]

Batch Asynchronous Searches

We offer two ways to boost your searches thanks to theasync parameter.

  • Blocking - async=false - more compute intensive because the search needs to maintain many connections. (default)
  • Non-blocking - async=true - the way to go for large batches of queries (recommended)
# Operating system
import os

# regular expression library
import re

# safe queue (named Queue in python2)
from queue import Queue

# Time utility
import time

# SerpApi search
from serpapi import GoogleSearch

# store searches
search_queue = Queue()

# SerpApi search
search = GoogleSearch({
    "location": "Austin,Texas",
    "async": True,
    "api_key": os.getenv("API_KEY")
})

# loop through a list of companies
for company in ['amd', 'nvidia', 'intel']:
    print("execute async search: q = " + company)
    search.params_dict["q"] = company
    result = search.get_dict()
    if "error" in result:
        print("oops error: ", result["error"])
        continue
    print("add search to the queue where id: ", result['search_metadata'])
    # add search to the search_queue
    search_queue.put(result)

print("wait until all search statuses are cached or success")

# Create regular search
while not search_queue.empty():
    result = search_queue.get()
    search_id = result['search_metadata']['id']

    # retrieve search from the archive - blocker
    print(search_id + ": get search from archive")
    search_archived = search.get_search_archive(search_id)
    print(search_id + ": status = " +
          search_archived['search_metadata']['status'])

    # check status
    if re.search('Cached|Success',
                 search_archived['search_metadata']['status']):
        print(search_id + ": search done with q = " +
              search_archived['search_parameters']['q'])
    else:
        # requeue search_queue
        print(search_id + ": requeue search")
        search_queue.put(result)

        # wait 1s
        time.sleep(1)

print('all searches completed')

This code shows how to run searches asynchronously. The search parameters must have {async: True}. This indicates that the client shouldn't wait for the search to be completed. The current thread that executes the search is now non-blocking, which allows it to execute thousands of searches in seconds. The SerpApi backend will do the processing work. The actual search result is deferred to a later call from the search archive using get_search_archive(search_id). In this example the non-blocking searches are persisted in a queue: search_queue. A loop through the search_queue allows it to fetch individual search results. This process can easily be multithreaded to allow a large number of concurrent search requests. To keep things simple, this example only explores search results one at a time (single threaded).

See example.

Python object as a result

The search results can be automatically wrapped in dynamically generated Python object. This solution offers a more dynamic, fully Oriented Object Programming approach over the regular Dictionary / JSON data structure.

from serpapi import GoogleSearch
search = GoogleSearch({"q": "Coffee", "location": "Austin,Texas"})
r = search.get_object()
assert type(r.organic_results) == list
assert r.organic_results[0].title
assert r.search_metadata.id
assert r.search_metadata.google_url
assert r.search_parameters.q, "Coffee"
assert r.search_parameters.engine, "google"

Pagination using iterator

Let's collect links across multiple search results pages.

# to get 2 pages
start = 0
end = 40
page_size = 10

# basic search parameters
parameter = {
  "q": "coca cola",
  "tbm": "nws",
  "api_key": os.getenv("API_KEY"),
  # optional pagination parameter
  #  the pagination method can take argument directly
  "start": start,
  "end": end,
  "num": page_size
}

# as proof of concept 
# urls collects
urls = []

# initialize a search
search = GoogleSearch(parameter)

# create a python generator using parameter
pages = search.pagination()
# or set custom parameter
pages = search.pagination(start, end, page_size)

# fetch one search result per iteration 
# using a basic python for loop 
# which invokes python iterator under the hood.
for page in pages:
  print(f"Current page: {page['serpapi_pagination']['current']}")
  for news_result in page["news_results"]:
    print(f"Title: {news_result['title']}\nLink: {news_result['link']}\n")
    urls.append(news_result['link'])
  
# check if the total number pages is as expected
# note: the exact number if variable depending on the search engine backend
if len(urls) == (end - start):
  print("all search results count match!")
if len(urls) == len(set(urls)):
  print("all search results are unique!")

Examples to fetch links with pagination: test file, online IDE

Error management

SerpApi keeps error management simple.

  • backend service error or search fail
  • client error

If it's a backend error, a simple error message is returned as string in the server response.

from serpapi import GoogleSearch
search = GoogleSearch({"q": "Coffee", "location": "Austin,Texas", "api_key": "<secret_key>"})
data = search.get_json()
assert data["error"] == None

In some cases, there are more details available in the data object.

If it's a client error, then a SerpApiClientException is raised.

Change log

2023-03-10 @ 2.4.2

  • Change long description to README.md

2021-12-22 @ 2.4.1

  • add more search engine
    • youtube
    • walmart
    • apple_app_store
    • naver
  • raise SerpApiClientException instead of raw string in order to follow Python guideline 3.5+
  • add more unit error tests for serp_api_client

2021-07-26 @ 2.4.0

  • add page size support using num parameter
  • add youtube search engine

2021-06-05 @ 2.3.0

  • add pagination support

2021-04-28 @ 2.2.0

  • add get_response method to provide raw requests.Response object

2021-04-04 @ 2.1.0

  • Add home depot search engine
  • get_object() returns dynamic Python object

2020-10-26 @ 2.0.0

  • Reduce class name to Search
  • Add get_raw_json

2020-06-30 @ 1.8.3

  • simplify import
  • improve package for python 3.5+
  • add support for python 3.5 and 3.6

2020-03-25 @ 1.8

  • add support for Yandex, Yahoo, Ebay
  • clean-up test

2019-11-10 @ 1.7.1

  • increase engine parameter priority over engine value set in the class

2019-09-12 @ 1.7

  • Change namespace "from lib." instead: "from serpapi import GoogleSearch"
  • Support for Bing and Baidu

2019-06-25 @ 1.6

  • New search engine supported: Baidu and Bing

Conclusion

SerpApi supports all the major search engines. Google has the more advance support with all the major services available: Images, News, Shopping and more.. To enable a type of search, the field tbm (to be matched) must be set to:

  • isch: Google Images API.
  • nws: Google News API.
  • shop: Google Shopping API.
  • any other Google service should work out of the box.
  • (no tbm parameter): regular Google search.

The field tbs allows to customize the search even more.

The full documentation is available here.

google-search-results-python's People

Contributors

ajsierra117 avatar dimitryzub avatar elizost avatar gbcfxs avatar hartator avatar heyalexej avatar ilyazub avatar justinrobertohara avatar jvmvik avatar kennethreitz avatar lf2225 avatar manoj-nathwani avatar paplorinc avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

google-search-results-python's Issues

[Feature request] Make `async: True` do everything under the hood

From a user perspective, the less setup required the better. I personally find the second example (example.py) more user-friendly especially for non-very technical users.

The user has to just add an async: True and don't bother tinkering/figuring out stuff for another ~hour about how Queue or something else works.

@jvmvik @ilyazub @hartator what do you guys think?

@aliayar @marm123 @schaferyan have you guys noticed similar issues for the users or have any users requested similar things?


What if instead of this:

# async batch requests: https://github.com/serpapi/google-search-results-python#batch-asynchronous-searches

from serpapi import YoutubeSearch
from queue import Queue
import os, re, json

queries = [
    'burly',
    'creator',
    'doubtful'
]

search_queue = Queue()

for query in queries:
    params = {
        'api_key': '...',                 
        'engine': 'youtube',              
        'device': 'desktop',              
        'search_query': query,          
        'async': True,                   # ❗
        'no_cache': 'true'
    }
    search = YoutubeSearch(params)       
    results = search.get_dict()         
    
    if 'error' in results:
        print(results['error'])
        break

    print(f"Add search to the queue with ID: {results['search_metadata']}")
    search_queue.put(results)

data = []

while not search_queue.empty():
    result = search_queue.get()
    search_id = result['search_metadata']['id']

    print(f'Get search from archive: {search_id}')
    search_archived = search.get_search_archive(search_id)
    
    print(f"Search ID: {search_id}, Status: {search_archived['search_metadata']['status']}")

    if re.search(r'Cached|Success', search_archived['search_metadata']['status']):
        for video_result in search_archived.get('video_results', []):
            data.append({
                'title': video_result.get('title'),
                'link': video_result.get('link'),
                'channel': video_result.get('channel').get('name'),
            })
    else:
        print(f'Requeue search: {search_id}')
        search_queue.put(result)

Users can do something like this and we handle everything under the hood:

# example.py
# testable example
# example import: from serpapi import async_search

from async_search import async_search
import json

queries = [
    'burly',
    'creator',
    'doubtful',
    'minecraft' 
]

# or as we typically pass params dict
data = async_search(queries=queries, api_key='...', engine='youtube', device='desktop')

print(json.dumps(data, indent=2))
print('All searches completed')

Under the hood code example:

# async_search.py
# testable example

from serpapi import YoutubeSearch
from queue import Queue
import os, re

search_queue = Queue()

def async_search(queries, api_key, engine, device):
    data = []
    for query in queries:
        params = {
            'api_key': api_key,                 
            'engine': engine,              
            'device': device,              
            'search_query': query,          
            'async': True,                  
            'no_cache': 'true'
        }
        search = YoutubeSearch(params)       
        results = search.get_dict()         
        
        if 'error' in results:
            print(results['error'])
            break

        print(f"Add search to the queue with ID: {results['search_metadata']}")
        search_queue.put(results)

    while not search_queue.empty():
        result = search_queue.get()
        search_id = result['search_metadata']['id']

        print(f'Get search from archive: {search_id}')
        search_archived = search.get_search_archive(search_id)
        
        print(f"Search ID: {search_id}, Status: {search_archived['search_metadata']['status']}")

        if re.search(r'Cached|Success', search_archived['search_metadata']['status']):
            for video_result in search_archived.get('video_results', []):
                data.append({
                    'title': video_result.get('title'),
                    'link': video_result.get('link'),
                    'channel': video_result.get('channel').get('name'),
                })
        else:
            print(f'Requeue search: {search_id}')
            search_queue.put(result)
            
    return data

Is there a specific reason we haven't done it before?

SerpApiClient.get_search_archive fails with format='html'

SerpApiClient.get_search_archive assumes all results must be loaded as a JSON, so it fails when using format='html'

GoogleSearchResults({}).get_search_archive(search_id='5df0db57ab3f5837994cd5a1', format='html')
---------------------------------------------------------------------------                                                                                                                                   JSONDecodeError                           Traceback (most recent call last)
<ipython-input-8-b6d24cb47bf7> in <module>
----> 1 GoogleSearchResults({}).get_search_archive(search_id='5df0db57ab3f5837994cd5a1', format='html')

C:\ProgramData\Anaconda3\lib\site-packages\serpapi\serp_api_client.py in get_search_archive(self, search_id, format)
78             dict|string: search result from the archive
79         """
---> 80         return json.loads(self.get_results("/searches/{0}.{1}".format(search_id, format)))
81
82     def get_account(self):

C:\ProgramData\Anaconda3\lib\json\__init__.py in loads(s, encoding, cls, object_hook, parse_float, parse_int, parse_constant, object_pairs_hook, **kw)
352             parse_int is None and parse_float is None and
353             parse_constant is None and object_pairs_hook is None and not kw):
--> 354         return _default_decoder.decode(s)
355     if cls is None:
356         cls = JSONDecoder

C:\ProgramData\Anaconda3\lib\json\decoder.py in decode(self, s, _w)
337
338         """
--> 339         obj, end = self.raw_decode(s, idx=_w(s, 0).end())
340         end = _w(s, end).end()
341         if end != len(s):

C:\ProgramData\Anaconda3\lib\json\decoder.py in raw_decode(self, s, idx)
355             obj, end = self.scan_once(s, idx)
356         except StopIteration as err:
--> 357             raise JSONDecodeError("Expecting value", s, err.value) from None
358         return obj, end

JSONDecodeError: Expecting value: line 1 column 1 (char 0)

google scholar pagination skips result 20

When retrieving results from Google Scholar using the pagination() method, the first article on the second page of google scholar is always missing.

I think this is caused by the following snippet in the update() method of google-search-results-python/serpapi/pagination.py:

def update(self):
        self.client.params_dict["start"] = self.start
        self.client.params_dict["num"] = self.num
        if self.start > 0:
            self.client.params_dict["start"] += 1

This seems to mean that for all pages except the first, paginate increases start by 1. So while for the first page it requests results starting at 0 and ending at 19 (if page_size=20). For the second page it requests results starting at 21 and ending at 40, skipping result 20.

If I delete the if statement, the code seems to work as intended and I get result 19 back.

Knowledge Graph object not being sent in response.

Some queries which return a knowledge graph in both my own google search and when tested in the SerpApi Playground are not returning the 'knowledge_graph' key in my own application.

Code:

params = {
    'q': 'Aspen Pumps Ltd',
    'engine': 'google',
    'api_key': <api_key>,
    'num': 100
  }

result_set = GoogleSearchResults(params).get_dict()

print(result_set.keys())

Evaluation:

dict_keys(['search_metadata', 'search_parameters', 'search_information', 'ads', 'shopping_results', 'organic_results', 'related_searches', 'pagination', 'serpapi_pagination'])

Manual Results:

https://www.google.com
Screenshot 2019-07-31 at 15 48 55

https://serpapi.com/playground
Screenshot 2019-07-31 at 15 49 10

Cannot increase the offset between returned results using pagination

I am trying to use the pagination feature based on the code at (https://github.com/serpapi/google-search-results-python#pagination-using-iterator). I want to request 20 results per API call but pagination by default iterates by 10 results only instead of 20, meaning I my requests end up overlapping.

I think I have found a solution to this. Looking in the package, the pagination.py file has a Pagination class which takes a page_size variable that changes the size of the offset between returned results.
The Pagination class is imported in the serp_api_client.py file within the pagination method starting on line 170 but here the page_size variable wasn't included. I just added page_size = 10 on lines 170 and 174 and now I can use the page_size variable if I call search.pagination(page_size = 20). Can this change be made in the code?

KeyError when Calling Answer Box

I've attempted to get the results from the answer box using the documentation here.

I noticed the Playground does not return these results either.

Is there any way to get this URL also?

Output Returned when Attempting to Run the Sample Provided:

from serpapi import GoogleSearch

params = {
  "q": "What's the definition of transparent?",
  "hl": "en",
  "gl": "us",
  "api_key": ""
}

search = GoogleSearch(params)
results = search.get_dict()
answer_box = results['answer_box']

Connexion issue

Hi,
One of the user using my code get the following error when creating a client.
image
I suppose it is machine settings related as it doesn't happen to other users.
Thanks for helping
P.S. I am fairly new to coding.

[Discuss] Wrapper longer response times caused by some overhead/additional processing

@jvmvik this issue is for discussion.

I'm not 100% sure what the cause is, but there's might be some overhead or additional processing in the wrapper that causes longer response times. Or it is as it should be? Let me know if it's the case.

Table shows results when making 50 requests:

Making direct requests to serpapi.com/search.json Making a request to serpapi.com through API wrapper Making a request with async batch requests with Queue
~7.192448616027832 seconds ~135.2969319820404 seconds ~24.80349826812744 seconds

Making a direct request to serpapi.com/search.json:

import aiohttp
import asyncio
import os
import json
import time

async def fetch_results(session, query):
    params = {
        'api_key': '...',
        'engine': 'youtube',
        'device': 'desktop',
        'search_query': query,
        'no_cache': 'true'
    }
    
    url = 'https://serpapi.com/search.json'
    async with session.get(url, params=params) as response:
        results = await response.json()

    data = []

    if 'error' in results:
        print(results['error'])
    else:
        for result in results.get('video_results', []):
            data.append({
                'title': result.get('title'),
                'link': result.get('link'),
                'channel': result.get('channel').get('name'),
            })

    return data

async def main():
    # 50 queries
    queries = [
        'burly',
        'creator',
        'doubtful',
        'chance',
        'capable',
        'window',
        'dynamic',
        'train',
        'worry',
        'useless',
        'steady',
        'thoughtful',
        'matter',
        'rotten',
        'overflow',
        'object',
        'far-flung',
        'gabby',
        'tiresome',
        'scatter',
        'exclusive',
        'wealth',
        'yummy',
        'play',
        'saw',
        'spiteful',
        'perform',
        'busy',
        'hypnotic',
        'sniff',
        'early',
        'mindless',
        'airplane',
        'distribution',
        'ahead',
        'good',
        'squeeze',
        'ship',
        'excuse',
        'chubby',
        'smiling',
        'wide',
        'structure',
        'wrap',
        'point',
        'file',
        'sack',
        'slope',
        'therapeutic',
        'disturbed'
    ]

    data = []

    async with aiohttp.ClientSession() as session:
        tasks = []
        for query in queries:
            task = asyncio.ensure_future(fetch_results(session, query))
            tasks.append(task)

        start_time = time.time()
        results = await asyncio.gather(*tasks)
        end_time = time.time()

        data = [item for sublist in results for item in sublist]

    print(json.dumps(data, indent=2, ensure_ascii=False))
    print(f'Script execution time: {end_time - start_time} seconds') # ~7.192448616027832 seconds

asyncio.run(main())

Same code but using the wrapper YoutubeSearch (not 100% sure if valid comparison):

import aiohttp
import asyncio
from serpapi import YoutubeSearch
import os
import json
import time

async def fetch_results(session, query):
    params = {
        'api_key': '...',
        'engine': 'youtube',
        'device': 'desktop',
        'search_query': query,
        'no_cache': 'true'
    }
    search = YoutubeSearch(params)
    results = search.get_json()

    data = []

    if 'error' in results:
        print(results['error'])
    else:
        for result in results.get('video_results', []):
            data.append({
                'title': result.get('title'),
                'link': result.get('link'),
                'channel': result.get('channel').get('name'),
            })

    return data

async def main():
    queries = [
        'burly',
        'creator',
        'doubtful',
        'chance',
        'capable',
        'window',
        'dynamic',
        'train',
        'worry',
        'useless',
        'steady',
        'thoughtful',
        'matter',
        'rotten',
        'overflow',
        'object',
        'far-flung',
        'gabby',
        'tiresome',
        'scatter',
        'exclusive',
        'wealth',
        'yummy',
        'play',
        'saw',
        'spiteful',
        'perform',
        'busy',
        'hypnotic',
        'sniff',
        'early',
        'mindless',
        'airplane',
        'distribution',
        'ahead',
        'good',
        'squeeze',
        'ship',
        'excuse',
        'chubby',
        'smiling',
        'wide',
        'structure',
        'wrap',
        'point',
        'file',
        'sack',
        'slope',
        'therapeutic',
        'disturbed'
    ]

    data = []

    async with aiohttp.ClientSession() as session:
        tasks = []
        for query in queries:
            task = asyncio.ensure_future(fetch_results(session, query))
            tasks.append(task)
        
        start_time = time.time()
        results = await asyncio.gather(*tasks)
        end_time = time.time()

        data = [item for sublist in results for item in sublist]

    print(json.dumps(data, indent=2, ensure_ascii=False))
    print(f'Script execution time: {end_time - start_time} seconds') # ~135.2969319820404 seconds

Using async batch requests with Queue:

from serpapi import YoutubeSearch
from urllib.parse import (parse_qsl, urlsplit)
from queue import Queue
import os, re, json
import time

# 50 queries
queries = [
    'burly',
    'creator',
    'doubtful',
    'chance',
    'capable',
    'window',
    'dynamic',
    'train',
    'worry',
    'useless',
    'steady',
    'thoughtful',
    'matter',
    'rotten',
    'overflow',
    'object',
    'far-flung',
    'gabby',
    'tiresome',
    'scatter',
    'exclusive',
    'wealth',
    'yummy',
    'play',
    'saw',
    'spiteful',
    'perform',
    'busy',
    'hypnotic',
    'sniff',
    'early',
    'mindless',
    'airplane',
    'distribution',
    'ahead',
    'good',
    'squeeze',
    'ship',
    'excuse',
    'chubby',
    'smiling',
    'wide',
    'structure',
    'wrap',
    'point',
    'file',
    'sack',
    'slope',
    'therapeutic',
    'disturbed'
]

search_queue = Queue()

for query in queries:
    params = {
        'api_key': '...',                 
        'engine': 'youtube',              
        'device': 'desktop',              
        'search_query': query,          
        'async': True,                   
        'no_cache': 'true'
    }

    search = YoutubeSearch(params)       # where data extraction happens
    results = search.get_dict()          # JSON -> Python dict
    
    if 'error' in results:
        print(results['error'])
        break

    print(f"Add search to the queue with ID: {results['search_metadata']}")
    search_queue.put(results)

data = []

start_time = time.time()

while not search_queue.empty():
    result = search_queue.get()
    search_id = result['search_metadata']['id']

    print(f'Get search from archive: {search_id}')
    search_archived = search.get_search_archive(search_id)
    
    print(f"Search ID: {search_id}, Status: {search_archived['search_metadata']['status']}")

    if re.search(r'Cached|Success', search_archived['search_metadata']['status']):
        for video_result in search_archived.get('video_results', []):
            data.append({
                'title': video_result.get('title'),
                'link': video_result.get('link'),
                'channel': video_result.get('channel').get('name'),
            })
    else:
        print(f'Requeue search: {search_id}')
        search_queue.put(result)
        
print(json.dumps(data, indent=2))
print('All searches completed')

execution_time = time.time() - start_time
print(f'Script execution time: {execution_time} seconds') # ~24.80349826812744 seconds

how to resolve the Connection aborted error when calling the serpapi

Hi,
A new scrapper here.
in my api call, i have the following error. Would you please let me know if i am doing anything wrong here? Thanks a lot

https://serpapi.com/search
---------------------------------------------------------------------------
ConnectionResetError                      Traceback (most recent call last)
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    676                 headers=headers,
--> 677                 chunked=chunked,
    678             )

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    380         try:
--> 381             self._validate_conn(conn)
    382         except (SocketTimeout, BaseSSLError) as e:

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/urllib3/connectionpool.py in _validate_conn(self, conn)
    977         if not getattr(conn, "sock", None):  # AppEngine might not have  `.sock`
--> 978             conn.connect()
    979 

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/urllib3/connection.py in connect(self)
    370             server_hostname=server_hostname,
--> 371             ssl_context=context,
    372         )

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/urllib3/util/ssl_.py in ssl_wrap_socket(sock, keyfile, certfile, cert_reqs, ca_certs, server_hostname, ssl_version, ciphers, ssl_context, ca_cert_dir, key_password, ca_cert_data)
    385         if HAS_SNI and server_hostname is not None:
--> 386             return context.wrap_socket(sock, server_hostname=server_hostname)
    387 

/anaconda/envs/azureml_py36/lib/python3.6/ssl.py in wrap_socket(self, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname, session)
    406                          server_hostname=server_hostname,
--> 407                          _context=self, _session=session)
    408 

/anaconda/envs/azureml_py36/lib/python3.6/ssl.py in __init__(self, sock, keyfile, certfile, server_side, cert_reqs, ssl_version, ca_certs, do_handshake_on_connect, family, type, proto, fileno, suppress_ragged_eofs, npn_protocols, ciphers, server_hostname, _context, _session)
    816                         raise ValueError("do_handshake_on_connect should not be specified for non-blocking sockets")
--> 817                     self.do_handshake()
    818 

/anaconda/envs/azureml_py36/lib/python3.6/ssl.py in do_handshake(self, block)
   1076                 self.settimeout(None)
-> 1077             self._sslobj.do_handshake()
   1078         finally:

/anaconda/envs/azureml_py36/lib/python3.6/ssl.py in do_handshake(self)
    688         """Start the SSL/TLS handshake."""
--> 689         self._sslobj.do_handshake()
    690         if self.context.check_hostname:

ConnectionResetError: [Errno 104] Connection reset by peer

During handling of the above exception, another exception occurred:

ProtocolError                             Traceback (most recent call last)
/anaconda/envs/azureml_py36/lib/python3.6/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    448                     retries=self.max_retries,
--> 449                     timeout=timeout
    450                 )

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    726             retries = retries.increment(
--> 727                 method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
    728             )

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/urllib3/util/retry.py in increment(self, method, url, response, error, _pool, _stacktrace)
    409             if read is False or not self._is_method_retryable(method):
--> 410                 raise six.reraise(type(error), error, _stacktrace)
    411             elif read is not None:

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/urllib3/packages/six.py in reraise(tp, value, tb)
    733             if value.__traceback__ is not tb:
--> 734                 raise value.with_traceback(tb)
    735             raise value

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
    676                 headers=headers,
--> 677                 chunked=chunked,
    678             )

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/urllib3/connectionpool.py in _make_request(self, conn, method, url, timeout, chunked, **httplib_request_kw)
    380         try:
--> 381             self._validate_conn(conn)
    382         except (SocketTimeout, BaseSSLError) as e:

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/urllib3/connectionpool.py in _validate_conn(self, conn)
    977         if not getattr(conn, "sock", None):  # AppEngine might not have  `.sock`
--> 978             conn.connect()
    979 

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/urllib3/connection.py in connect(self)
    370             server_hostname=server_hostname,
--> 371             ssl_context=context,
    372         )

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/urllib3/util/ssl_.py in ssl_wrap_socket(sock, keyfile, certfile, cert_reqs, ca_certs, server_hostname, ssl_version, ciphers, ssl_context, ca_cert_dir, key_password, ca_cert_data)
    385         if HAS_SNI and server_hostname is not None:
--> 386             return context.wrap_socket(sock, server_hostname=server_hostname)
    387 

/anaconda/envs/azureml_py36/lib/python3.6/ssl.py in wrap_socket(self, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname, session)
    406                          server_hostname=server_hostname,
--> 407                          _context=self, _session=session)
    408 

/anaconda/envs/azureml_py36/lib/python3.6/ssl.py in __init__(self, sock, keyfile, certfile, server_side, cert_reqs, ssl_version, ca_certs, do_handshake_on_connect, family, type, proto, fileno, suppress_ragged_eofs, npn_protocols, ciphers, server_hostname, _context, _session)
    816                         raise ValueError("do_handshake_on_connect should not be specified for non-blocking sockets")
--> 817                     self.do_handshake()
    818 

/anaconda/envs/azureml_py36/lib/python3.6/ssl.py in do_handshake(self, block)
   1076                 self.settimeout(None)
-> 1077             self._sslobj.do_handshake()
   1078         finally:

/anaconda/envs/azureml_py36/lib/python3.6/ssl.py in do_handshake(self)
    688         """Start the SSL/TLS handshake."""
--> 689         self._sslobj.do_handshake()
    690         if self.context.check_hostname:

ProtocolError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

During handling of the above exception, another exception occurred:

ConnectionError                           Traceback (most recent call last)
<ipython-input-26-45ac328ca8f8> in <module>
      1 question = 'where to get best coffee'
----> 2 results = performSearch(question)

<ipython-input-25-5bc778bad4e2> in performSearch(question)
     12 
     13     search = GoogleSearch(params)
---> 14     results = search.get_dict()
     15     return results

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/serpapi/serp_api_client.py in get_dict(self)
    101             (alias for get_dictionary)
    102         """
--> 103         return self.get_dictionary()
    104 
    105     def get_object(self):

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/serpapi/serp_api_client.py in get_dictionary(self)
     94             Dict with the formatted response content
     95         """
---> 96         return dict(self.get_json())
     97 
     98     def get_dict(self):

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/serpapi/serp_api_client.py in get_json(self)
     81         """
     82         self.params_dict["output"] = "json"
---> 83         return json.loads(self.get_results())
     84 
     85     def get_raw_json(self):

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/serpapi/serp_api_client.py in get_results(self, path)
     68             Response text field
     69         """
---> 70         return self.get_response(path).text
     71 
     72     def get_html(self):

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/serpapi/serp_api_client.py in get_response(self, path)
     57             url, parameter = self.construct_url(path)
     58             print(url)
---> 59             response = requests.get(url, parameter, timeout=self.timeout)
     60             return response
     61         except requests.HTTPError as e:

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/requests/api.py in get(url, params, **kwargs)
     73     """
     74 
---> 75     return request('get', url, params=params, **kwargs)
     76 
     77 

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/requests/api.py in request(method, url, **kwargs)
     59     # cases, and look like a memory leak in others.
     60     with sessions.Session() as session:
---> 61         return session.request(method=method, url=url, **kwargs)
     62 
     63 

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/requests/sessions.py in request(self, method, url, params, data, headers, cookies, files, auth, timeout, allow_redirects, proxies, hooks, stream, verify, cert, json)
    540         }
    541         send_kwargs.update(settings)
--> 542         resp = self.send(prep, **send_kwargs)
    543 
    544         return resp

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/requests/sessions.py in send(self, request, **kwargs)
    653 
    654         # Send the request
--> 655         r = adapter.send(request, **kwargs)
    656 
    657         # Total elapsed time of the request (approximately)

/anaconda/envs/azureml_py36/lib/python3.6/site-packages/requests/adapters.py in send(self, request, stream, timeout, verify, cert, proxies)
    496 
    497         except (ProtocolError, socket.error) as err:
--> 498             raise ConnectionError(err, request=request)
    499 
    500         except MaxRetryError as e:

ConnectionError: ('Connection aborted.', ConnectionResetError(104, 'Connection reset by peer'))

Exception not handled on SerpApiClient.get_json

I am experiencing unexpected behaviors when getting thousands of queries. For some reason, sometimes the API returns an empty response. It happens at random (1 time out of 10000 perhaps).

When this situation happens, the method SerpApiClient.get_json does not handle the empty response. In consecuence, the json.loads() raises an exception causing a JSONDecodeError.

I attach an image to clarify the issue.

issue

It seems a problem with the API service. Not sure if the problem should be solved with an Exception handling, handling the code 204 (empty response), or if there is any bug with servers.

to reproduce the exception:

import json json.loads('')
Do you recommend any guidelines to handle the problem in the meanwhile you review the issue on the source code?

Thanks.

[Version] Update PyPi to include the most up-to-date version

Currently, the PyPi allows users to install our library using pip easily. However, the library has been updated not to include the print method (screenshot below). The PyPi version still includes it, causing confusion among users. Some of them think that the printed URL should contain their data from search and contacting us about SerpApi not working, while others simply ask for this to be removed for clarity.

Current state:
image

The user confused about the data not being available in the printed link.
Another user confused about the data not being available in the printed link

The user asking to remove the print method for clarity (they installed it through PyPi)
Another user asking to remove the print method

Link for account details for PyPi

macOS installation issue

When installing the package via pip it fails.

Collecting google-search-results
  Using cached https://files.pythonhosted.org/packages/08/eb/38646304d98db83d85f57599d2ccc8caf325961e8792100a1014950197a6/google_search_results-1.5.2.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/private/var/folders/3m/91gj9l890y71886_7sfndl3r0000gn/T/pip-install-YVqFKL/google-search-results/setup.py", line 7, in <module>
        with open(path.join(here, 'SHORT_README.rst'), encoding='utf-8') as f:
      File "/usr/local/Cellar/python@2/2.7.15_3/Frameworks/Python.framework/Versions/2.7/lib/python2.7/codecs.py", line 898, in open
        file = __builtin__.open(filename, mode, buffering)
    IOError: [Errno 2] No such file or directory: '/private/var/folders/3m/91gj9l890y71886_7sfndl3r0000gn/T/pip-install-YVqFKL/google-search-results/SHORT_README.rst'

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in /private/var/folders/3m/91gj9l890y71886_7sfndl3r0000gn/T/pip-install-YVqFKL/google-search-results/

Running macOS catalina and python 2.7

~ ❯❯❯ pip --version
pip 19.0.2 from /usr/local/lib/python2.7/site-packages/pip (python 2.7)
~ ❯❯❯ python --version
Python 2.7.15

{'error':'We couldn't find your API Key.'}

`from serpapi.google_search_results import GoogleSearchResults

client = GoogleSearchResults({"q": "coffee", "serp_api_key": "************************"})

result = client.get_dict()`

I tried giving my API key from serpstack. Yet I am left with this error. Any help could be much useful.

You need a valid browser to continue exploring our API

This is the error message you get when you don't supply a private key. I think information on this site should be provided regarding:

  1. How to get an API key
  2. Is a key free or how much does it cost
  3. Are there limits to using the key (hits/hour or whatever)

The service provided by the repo is very valuable, but can I use it or not depends on the answers to these questions.

[Google Jobs API] Support for Pagination

As Google Jobs does not return serpapi_pagination key but expects start param to paginate, this iteration of the library does not support pagination in Google Jobs. Pagination Support to be added for Google Jobs.

# stop if backend miss to return serpapi_pagination
if not 'serpapi_pagination' in result:
  raise StopIteration

# stop if no next page
if not 'next' in result['serpapi_pagination']:
    raise StopIteration

image

Pagination iterator doesn't work for APIs with token-based pagination

For several APIs, parsing the serpapi_pagination.next is the only way to update params_dict with correct values. An increment of params.start won't work for Google Scholar Profiles, Google Maps, YouTube.

# increment page
self.start += self.page_size

Google Scholar Profiles

Google Scholar Profiles API have pagination.next_page_token instead of serpapi_pagination.next.

pagination.next is a next page URI like https://serpapi.com/search.json?after_author=0QICAGE___8J&engine=google_scholar_profiles&hl=en&mauthors=label%3Asecurity where after_author is set to next_page_token.

Google Maps

In Google Maps Local Results API there's only serpapi_pagination.next with a URI like https://serpapi.com/search.json?engine=google_maps&ll=%4040.7455096%2C-74.0083012%2C14z&q=Coffee&start=20&type=search

YouTube

In YouTube Search Engine Results API there's serpapi_pagination.next_page_token similar to Google Scholar Profiles. serpapi_pagination.next is a URI with sp parameter set to next_page_token.

@jvmvik What do you think about parsing serpapi_pagination.next in Pagination#__next__?

- self.start += self.page_size
+ self.client.params_dict.update(dict(parse.parse_qsl(parse.urlsplit(result['serpapi_pagination']['next']).query)))

Here's an example of endless pagination of Google Scholar Authors (scraped 190 pages and manually stopped).

Provide a more convenient way to paginate via the Python package

Currently, the way to paginate searches is to get the serpapi_pagination.current and increase the offset or start parameters in the loop. Like with regular HTTP requests to serpapi.com/search without an API wrapper.

import os
from serpapi import GoogleSearch

params = {
    "engine": "google",
    "q": "coffee",
    "tbm": "nws",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
results = search.get_dict()

print(f"Current page: {results['serpapi_pagination']['current']}")

for news_result in results["news_results"]:
    print(f"Title: {news_result['title']}\nLink: {news_result['link']}\n")

while 'next' in results['serpapi_pagination']:
    search.params_dict[
        "start"] = results['serpapi_pagination']['current'] * 10
    results = search.get_dict()

    print(f"Current page: {results['serpapi_pagination']['current']}")

    for news_result in results["news_results"]:
        print(
            f"Title: {news_result['title']}\nLink: {news_result['link']}\n"
        )

A more convenient way for an official API wrapper would be to provide some function like search.paginate(callback: Callable) which will properly calculate offset for the specific search engine and loop through pages until the end.

import os
from serpapi import GoogleSearch

def print_results(results):
  print(f"Current page: {results['serpapi_pagination']['current']}")

  for news_result in results["news_results"]:
    print(f"Title: {news_result['title']}\nLink: {news_result['link']}\n")

params = {
    "engine": "google",
    "q": "coffee",
    "tbm": "nws",
    "api_key": os.getenv("API_KEY"),
}

search = GoogleSearch(params)
search.paginate(print_results)

@jvmvik @hartator What do you think?

get_html() Returns JSON Instead of HTML

A customer reported the get_html() method for this library returns a JSON response instead of the expected HTML.

I may be misunderstanding something about what the get_html method is intended to do, but I checked this locally and the customer's report appears to be correct:

Screenshot 2024-01-05 at 9 34 23 AM Screenshot 2024-01-05 at 9 42 56 AM

SSLCertVerificationError [SSL: CERTIFICATE_VERIFY_FAILED] error

A user reported receiving this error:

SSLCertVerificationError Traceback (most recent call last)
/opt/anaconda3/lib/python3.8/site-packages/urllib3/connectionpool.py in urlopen(self, method, url, body, headers, retries, redirect, assert_same_host, timeout, pool_timeout, release_conn, chunked, body_pos, **response_kw)
698 # Make the request on the httplib connection object.
--> 699 httplib_response = self._make_request(
700 conn,

SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: unable to get local issuer certificate (_ssl.c:1125)

The solution for them was to turn off the VPN.

How to get "related articles" links from google scholar via serpapi?

I am using SERP API to fetch google scholar papers, although there is always a link called "related articles' under each article but SERP API doesn't have any SERP URL to fetch data of those links?

Screenshot 2022-07-14 at 3 05 07 AM

Serp API result :

Screenshot 2022-07-14 at 3 15 16 AM

Can I directly call this URL https://scholar.google.com/scholar?q=related:gemrYG-1WnEJ:scholar.google.com/&scioq=Multi-label+text+classification+with+latent+word-wise+label+information&hl=en&as_sdt=0,21 using serp API?

google scholar pagination not returning final results page

I am using the paginate method with google scholar engine to return all results for a search term. When I use a for loop to iterate the pagination and put the results a list, it doesn't return the final page of results, instead stopping at the penultimate page (code snippet and terminal output below).

import serpapi
import os
from loguru import logger
from dotenv import load_dotenv

load_dotenv()

search_string = '"Singer Instruments" PhenoBooth'

# Pagination allows iterating through all pages of results
logger.info("Initialising search through serpapi")
search = serpapi.GoogleSearch(
    {
        "engine": "google_scholar",
        "q": search_string,
        "api_key": os.getenv("SERPAPI_KEY"),
        "as_ylo": 1900,
    }
)
pages = search.pagination(start=0, page_size=20)

# get dict for each page of results and store in list
results_list = []
page_number = 1
for page in pages:
    logger.info(f"Retrieving results page {page_number}")
    results_list.append(page)
    page_number += 1

gscholar_results = results_list[0]["search_information"]["total_results"]
print(f"results reported by google scholar: {gscholar_results}")

paper_count = 0
for page in results_list:
    for paper in page["organic_results"]:
        paper_count += 1

print(f"number of papers in results: {paper_count}")

Screenshot 2021-07-30 at 17 05 11

If I check my searches on serpAPI.com, results are being generated for all pages (see below for example in code). So the problem is not that the result isn't generated, its just not coming out of the pagination iterator for some reason.

Screenshot 2021-07-30 at 17 06 21

Different results from serpapi (Google Trends) versus Google Trends site

I'm having an issue (which is causing a serious headache / project issue for us) where the results from serpapi are different versus those returned from Google, when querying the website directly.

Below is a simple code snippet to reproduce:

from serpapi import GoogleSearch
import pandas as pd

PARAMS = {'engine': 'google_trends',
          'data_type' : 'RELATED_QUERIES',
          'q' : "health insurance",
          'geo': "IE",
          'date' : "2022-01-01 2022-12-31",
          'hl' : 'en-GB',
          'csv' : True,
          'api_key' : '[Key]'}

search = GoogleSearch(PARAMS) 
results = search.get_dict() 

rel = results['related_queries']['top']
df = pd.DataFrame(rel)

df[["query", "value"]]

df.to_csv("serpapi_results.csv")

Below is an image with the difference in results (and also attached are the files)
image

This is an image of the URL in Google Trends:
image

serpapi_results.csv
relatedQueries_google_results.csv

Can you let me know if this is a known issue or if I've made some mistake in my API call?
Thanks,
Ronan

[Pagination] Pagination isn't correct and it skips index by one

image

Since the start value starts from 0, the correct second page should be 10 and not 11.

This behaviour is causing a skip in pages also. The customers are getting confusing results:
image

Intercom Link
First recognized by @marm123.

I think this part needs to be replaced by:
image

self.client.params_dict['start'] += 0

Whether it would cause any error on other engines is something I don't know. But it may also fix it for every other engine.

Python package should not include tests

When installing via pip, the installation includes the tests directory:

mkdir deps
python3 -m pip install --target deps "google-search-results==2.4.2"
ls -1 deps/tests

Outputs:

__init__.py
__pycache__
test_account_api.py
(etc)

Tests should be excluded.

ImportError: cannot import name 'GoogleSearch' from 'serpapi'

After creating a subscriber account in serpapi, I have been given an API key. I Installed the "pip install google-search-results "

but whenever I tried to run my django app I get this error:

File "/Users/nazibabdullah/opt/miniconda3/envs/topsearch/lib/python3.8/threading.py", line 932, in _bootstrap_inner
self.run()
File "/Users/nazibabdullah/opt/miniconda3/envs/topsearch/lib/python3.8/threading.py", line 870, in run
self._target(*self._args, **self._kwargs)
File "/Users/nazibabdullah/opt/miniconda3/envs/topsearch/lib/python3.8/site-packages/django/utils/autoreload.py", line 64, in wrapper
fn(*args, **kwargs)
File "/Users/nazibabdullah/opt/miniconda3/envs/topsearch/lib/python3.8/site-packages/django/core/management/commands/runserver.py", line 125, in inner_run
autoreload.raise_last_exception()
File "/Users/nazibabdullah/opt/miniconda3/envs/topsearch/lib/python3.8/site-packages/django/utils/autoreload.py", line 87, in raise_last_exception
raise _exception[1]
File "/Users/nazibabdullah/opt/miniconda3/envs/topsearch/lib/python3.8/site-packages/django/core/management/init.py", line 394, in execute
autoreload.check_errors(django.setup)()
File "/Users/nazibabdullah/opt/miniconda3/envs/topsearch/lib/python3.8/site-packages/django/utils/autoreload.py", line 64, in wrapper
fn(*args, **kwargs)
File "/Users/nazibabdullah/opt/miniconda3/envs/topsearch/lib/python3.8/site-packages/django/init.py", line 24, in setup
apps.populate(settings.INSTALLED_APPS)
File "/Users/nazibabdullah/opt/miniconda3/envs/topsearch/lib/python3.8/site-packages/django/apps/registry.py", line 116, in populate
app_config.import_models()
File "/Users/nazibabdullah/opt/miniconda3/envs/topsearch/lib/python3.8/site-packages/django/apps/config.py", line 269, in import_models
self.models_module = import_module(models_module_name)
File "/Users/nazibabdullah/opt/miniconda3/envs/topsearch/lib/python3.8/importlib/init.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "", line 1014, in _gcd_import
File "", line 991, in _find_and_load
File "", line 975, in _find_and_load_unlocked
File "", line 671, in _load_unlocked
File "", line 843, in exec_module
File "", line 219, in _call_with_frames_removed
File "/Users/MyProjects/topsearch/topsearch/searchapp/models.py", line 3, in
from serpapi import GoogleSearch
ImportError: cannot import name 'GoogleSearch' from 'serpapi' (/Users/nazibabdullah/opt/miniconda3/envs/topsearch/lib/python3.8/site-packages/serpapi/init.py)

Python 3.8+, Fatal Python error: Segmentation fault when calling requests.get(URL, params) with docker python-3.8.2-slim-buster/openssl 1.1.1d and python-3.9.10-slim-buster/openssl 1.1.1d

Here's the trace:

Python 3.9.10 (main, Mar  1 2022, 21:02:54) 
[GCC 8.3.0] on linux
Type "help", "copyright", "credits" or "license" for more information.
>>> import requests
>>> requests.get('https://serpapi.com', {"api_key":VALID_API_KEY, "engine": "google_jobs", "q": "Barista"})
Fatal Python error: Segmentation fault

Current thread 0x0000ffff8e999010 (most recent call first):
  File "/usr/local/lib/python3.9/ssl.py", line 1173 in send
  File "/usr/local/lib/python3.9/ssl.py", line 1204 in sendall
  File "/usr/local/lib/python3.9/http/client.py", line 1001 in send
  File "/usr/local/lib/python3.9/http/client.py", line 1040 in _send_output
  File "/usr/local/lib/python3.9/http/client.py", line 1280 in endheaders
  File "/usr/local/lib/python3.9/site-packages/urllib3/connection.py", line 395 in request
  File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 496 in _make_request
  File "/usr/local/lib/python3.9/site-packages/urllib3/connectionpool.py", line 790 in urlopen
  File "/usr/local/lib/python3.9/site-packages/requests/adapters.py", line 486 in send
  File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 703 in send
  File "/usr/local/lib/python3.9/site-packages/requests/sessions.py", line 589 in request
  File "/usr/local/lib/python3.9/site-packages/requests/api.py", line 59 in request
  File "/usr/local/lib/python3.9/site-packages/requests/api.py", line 73 in get
  File "<stdin>", line 1 in <module>
Segmentation fault

This is not specific to one engine, it also applies to google_images if I swap the engine.

Dockerfile:

FROM python:3.9.10-slim-buster

ENV PYTHONUNBUFFERED 1
ENV PYTHONDONTWRITEBYTECODE 1

# OLD: RUN apt-get update && apt-get upgrade -y && apt-get install gcc -y && apt-get install apt-utils -y

# Install build-essential for celery worker otherwise it says gcc not found
RUN apt-get update \
  # dependencies for building Python packages
  && apt-get install -y build-essential \
  # psycopg2 dependencies
  && apt-get install -y libpq-dev \
  # Additional dependencies
  && apt-get install -y telnet netcat \
  # cleaning up unused files
  && apt-get purge -y --auto-remove -o APT::AutoRemove::RecommendsImportant=false \
  && rm -rf /var/lib/apt/lists/*

WORKDIR /app

COPY requirements.txt ./
RUN pip install --no-cache-dir -r requirements.txt

COPY ./compose/local/flask/start /start
RUN sed -i 's/\r$//g' /start
RUN chmod +x /start

# COPY ./compose/local/flask/celery/worker/start /start-celeryworker
# RUN sed -i 's/\r$//g' /start-celeryworker
# RUN chmod +x /start-celeryworker

# COPY ./compose/local/flask/celery/beat/start /start-celerybeat
# RUN sed -i 's/\r$//g' /start-celerybeat
# RUN chmod +x /start-celerybeat

# COPY ./compose/local/flask/celery/flower/start /start-flower
# RUN sed -i 's/\r$//g' /start-flower
# RUN chmod +x /start-flower

COPY . .

# COPY entrypoint.sh /usr/local/bin/
# ENTRYPOINT ["entrypoint.sh"]

docker-compose.yml:

version: "3.9"

services:
  flask_app:
    restart: always
    container_name: flask_app
    image: meder/flask_live_app:1.0.0
    command: /start
    build: .
    ports:
      - "4000:4000"
    volumes:
      - .:/app
    env_file:
      - local.env
    environment:
      - FLASK_ENV=development
      - FLASK_APP=app.py
    depends_on:
      - db
  db:
    container_name: flask_db
    image: postgres:16.1-alpine
    ports:
      - "5432:5432"
    environment:
      - POSTGRES_USER=USER
      - POSTGRES_PASSWORD=PW
      - POSTGRES_DB=DB
    volumes: 
      - postgres_data:/var/lib/postgresql/data
  redis:
    container_name: redis
    image: redis:7.2-alpine
    ports:
      - "6379:6379"
volumes:
  postgres_data: {}

And requirements.txt though I didn't update requirements.txt after trying 3.9.10 from the original 3.8.2:

flask==3.0.0
psycopg2-binary==2.9.9
google-search-results==2.4.2

The above trace came from bashing into my docker instance and running requests.get after importing it like so:

docker exec -it flask_app bash

The host machine runs this fine, but uses LibreSSL 2.8.3 / Python 3.8.16 - based on other tickets/issues here it seems like there's possibly something on the SSL side of the backend that's triggering this - would appreciate some insight

Someone ran into this on SO and the selected answer was updating the timeout: https://stackoverflow.com/questions/74774784/cheerypy-server-is-timing-out but no guarantee this is the same issue, just a reference.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.