gscwrapper
is aimed at providing SEO profesionnals wotking with Python a strong basis to work with Google Search Console's APIs.
It provides an easy way to query and work with data from the following endpoints:
For now, this package is only designed to work with the webmasters.readonly scope from the API.
There are countless GSC libraries available. My favorite (and the one I've been using for years) is available here. That being said, these libraries:
- Are often limited to downloading data and don't offer methods to run common SEO analysis. I would often end up copying my code between notebooks and I needed a library to centralize the common operations I often do.
- Are sometimes owned by non-SEO and therefore aren't always up-to-date, especially when there is an API update. Python is used by many SEO professionals and yet we often rely on non-SEO to maintain the libraries we use as an industry.
I've decided to create my own based on my most common needs as a SEO profesionnal. It has also been a fun project to work on :)
DISCLAIMER: this library is not aimed at taking decisions for you, it just speeds up some repetitive data manipulation tasks we often do. I strongly advise you to read & understand the code behind a method if you aim at taking decisions only based on the output of a method. In most cases, the only library used under the hood is Pandas.
I'm more than welcome to receive suggestions or solve issues through GitHub. Nevertheless:
- The code is extensively commented to make it readable for everyone, even if you don't master Python. If you have a question on how a method works under the hood, please have a look at the code first.
- I'm not a developer and this is, by far, the most complex project I had to work on by myself. I try to stick to concepts I understand and I won't update my code just because I'm not using a best practice here and there.
- I do it for free and hence I have to prioritize my (paid) work and my personnal life over this library.
First, install the package using:
pip3 install git+https://github.com/antoineeripret/gsc_wrapper
Then, follow these steps:
- Create a new project in the Google Developers Console,
- Enable the Google Search Console API under "APIs & Services".
- Create an "Oauth consent screen" (External). The app name doesn't really matter.
- Add the webmasters.readonly scope
- Add your e-mail(s) to the test users.
- Next, create credentials under "Credentials" (Oauth client ID)
- Choose "Desktop app" for the "Application type". Again, the name doesn't matter.
- Download the JSON file and save it in your working directory
If you want more detail about this process, have a look at this video (from 12:00 to 27:00).
After that, executing your first query is as easy as using the following code snippet:
import gscwrapper
#authentificate
account = gscwrapper.generate_auth(
'config/client_secret_mvp.json',
serialize='config/credentials.json'
)
#we choose the website we want
webproperty = account['https://www.exemple.com/']
report = (
webproperty
#we call the query method
.query
#we define the dates
.range(start="2023-01-01", stop="2023-02-01")
#we define the dimensions
.dimensions(['page'])
#we get the data
.get()
)
The above example will use your client configuration file to interactively generate your credentials. You'll then be able to call any available method on the returned object containing your GSC data.
If you're unsure what webproperties are linked to your account, you can run the following code, which will return a DataFrame with your webproperties and their permission levels.
account.list_webproperties()
The first time you run this code, you'll be asked to visit an URL:
- Copy and paste it in your browser
- Select the e-mail adress you've added as a test user in a previous step
- Click on "Continue"
- Click on "Continue" again
- Copy the authorization code and paste it in the input box that will have appeared in your terminal / notebook from where you run the code
If you prefer to use a service account key, the process is easier and you just have to run the following code. No need to save the credentials in that case.
account = (
gscwrapper
.generate_auth(
client_config="service_account.json",
service_account_auth=True
)
)
If you wish to save your credentials, to avoid going
through the OAuth consent screen in the future, you can specify a path to save
them by specifying serialize='path/to/credentials.json
.
When you want to authenticate a new account you run:
account = gscwrapper.generate_auth(
'config/client_secret_mvp.json',
serialize='config/credentials.json'
)
Which will save your credentials to a file called credentials.json
.
From then on, you can authenticate with:
account = gscwrapper.generate_auth(
'config/client_secret_mvp.json',
credentials='config/credentials.json'
)
To query data from the search analytics, you can filter the data you retrieve based on:
- range: Unlike Josh's library, you need to explicitely define the dates (YYYY-MM-DD format).
report = (
webproperty
.query
.range(start="2023-01-01", stop="2023-02-01")
.dimensions(["date"])
.get()
)
- dimensions: dimensions need to be passed as a list. You cannot get data without specifying at least one dimension.
report = (
webproperty
.query
.range(start="2023-01-01", stop="2023-02-01")
.dimensions(["date"])
.get()
)
Please be aware that you may be affected by data sampling based on the number of dimensions you need. I strongly advise to include only the dimensions you need, otherwise the data extraction may take more time than needed.
- filter: you can decide to analyse just a part of your website. You can filter using any dimension or operator included below:
DIMENSIONS = ['country','device','page','query','searchAppearance','date']
OPERATORS = ['equals','notEquals','contains','notContains','includingRegex','excludingRegex']
If you use a REGEX, I strongly advise to test it using the GSC UI first, because some characters are not suppoorted.
IMPORTANT: you can filter on a dimension that is not included in your report.
report = (
webproperty
.query
.range(start="2023-01-01", stop="2023-02-01")
.filter("page", "blog", "contains")
.dimensions(["date"])
.get()
)
- search_type: we often use the
web
search type, but others are available.
report = (
webproperty
.query
.range(start="2023-01-01", stop="2023-02-01")
.filter("page", "blog", "contains")
.dimensions(["date"])
.search_type('discover')
.get()
)
- limit: by default, the library will fetch the API and try to retrieve all the data available. You can decide to retrieve a specific number of results using this method.
report = (
webproperty
.query
.range(start="2023-01-01", stop="2023-02-01")
.filter("page", "blog", "contains")
.dimensions(["date"])
.limit(50)
.get()
)
When you run any of these code snippets, you'll generate a Report
object.
A Report
object contains all the data you downloaded from GSC. I developped a couple of methods you can freely use based on your needs. Based on the method you use, you'll need some dimensions in your report. For instance, you cannot call the ctr_yield_curve()
method without the date
and the query
dimensions.
If you are not using the API (working with the bulk export for instance), you can load a Report
object from any DataFrame using the following code:
(
gscwrapper
.query
.Report(
#the dataframe where you have your data
x,
#the name of the webproperty
"https://www.website.com/",
#min date
"2023-01-01",
#max date
"2023-30-01"
)
You'll then be able to use any of the available methods.
Available methods:
- show_data()
- ctr_yield_curve()
- active_pages()
- cannibalization()
- forecast()
- brand_vs_no_brand()
- keyword_gap()
- causal_impact()
- update_urls()
- extract_search_volume()
- find_potential_contents_to_kill()
- find_content_decay()
- winners_losers()
- find_long_tail_keywords()
- find_ctr_outliers()
- abcd()
- pages_lifespan()
- seasonality_per_day()
- replace_query_from_list()
Required dimensions | Required metrics | Output |
---|---|---|
None | None | pd.DataFrame |
This method is pretty straight-forward. It returns your data as a Pandas DataFrame. Useful to check what the API has returned and perform ad-hoc analysis that are not covered by the other methods from this library.
Required dimensions | Required metrics | Output |
---|---|---|
query / date | clicks / impressions / position | pd.DataFrame |
You can call this method to build a CTR yield curve with your data.
(
report
.ctr_yield_curve()
)
For example, you could get this output.
position | ctr | clicks | impressions | kw_count |
---|---|---|---|---|
1.0 | 19.02 | 807 | 4242 | 1516 |
2.0 | 11.66 | 476 | 4084 | 1026 |
3.0 | 15.1 | 340 | 2252 | 637 |
4.0 | 12.2 | 268 | 2196 | 730 |
5.0 | 7.16 | 126 | 1761 | 551 |
6.0 | 5.49 | 195 | 3552 | 909 |
7.0 | 3.66 | 107 | 2921 | 696 |
8.0 | 3.27 | 102 | 3124 | 930 |
9.0 | 2.17 | 67 | 3087 | 897 |
10.0 | 1.51 | 73 | 4841 | 1250 |
Required dimensions | Required metrics | Output |
---|---|---|
date | None | pd.DataFrame |
We often need to compare weekly or monthly data, when GSC only provides daily data. Using this method, you can resample your data.
Accepted periods are the following:
- D: day
- W: week
- M: month
- Q: quarter
- Y: year
report.group_data_by_period('W')
This will output a table where the main metrics (clicks & impressions) are grouped by the new period chosen.
date | clicks | impressions |
---|---|---|
2023-01-01 | 206 | 8109 |
2023-01-08 | 1871 | 69275 |
2023-01-15 | 1998 | 67207 |
2023-01-22 | 1706 | 60436 |
2023-01-29 | 1980 | 67552 |
2023-02-05 | 935 | 29155 |
Required dimensions | Required metrics | Output |
---|---|---|
page | clicks / impressions | pd.DataFrame |
We sometimes have to know the percentage of pages that are active from a list of URLs. This method allows you to compare the page
you have in your GSC report to a list of URLs.
You can either:
- pass a manual list of URLs
- provide a sitemap or sitemap index URL and the library will download all the URLs included there.
(
report
.active_pages(
sitemap_url = "https://www.website.com/sitemap_index.xml"
)
)
The outcome: a table telling you which pages from your list are active (based on impressions or clicks).
page | clicks | impressions | active_impression | active_clicks |
---|---|---|---|---|
https://www.website.com/blog/content_1 | 1.0 | 21.0 | True | True |
https://www.website.com/blog/content_2 | 0.0 | 78.0 | True | False |
https://www.website.com/blog/content_3 | 1.0 | 161.0 | True | True |
Required dimensions | Required metrics | Output |
---|---|---|
query / page | clicks / impressions | pd.DataFrame |
GSC can be a fabulous tool to find cannibalization at scale. The trick is to remove false positives and cases of good cannibaliation. This method can be used to find queries where:
- we have more than one page ranking for a specific query
- we have more than one page representing at least 10% of the clicks for a given query
- this query represents at least 10% of the total clicks of the page
This definition is subjective and you are free to have a look at the source code to create your own based on your projects.
(
report
.cannibalization(brand_variants=['brand','mybrand'])
)
IMPORTANT: you need to provide the common branding structures as a list to remove these cases from the cannibalization analysis. Otherwise, you would end up with a lot of false positives on your branded terms.
This method would return the following table, with a selection of the pages that seem to suffer from cannibalization. You can then define what you want to do with them based on the SEO context.
page | query | clicks_query | impressions | click_pct | clicks_page | click_pct_page | |
---|---|---|---|---|---|---|---|
https://www.website.com/blog/content_1 | xxxxx | 11 | 385 | 57.89 | 44 | 25.00 | |
https://www.website.com/blog/content_2 | xxxxx | 8 | 191 | 42.10 | 23 | 34.78 | |
https://www.website.com/blog/content_1 | yyyyy | 15 | 387 | 62.50 | 44 | 34.09 | |
https://www.website.com/blog/content_2 | yyyyy | 9 | 148 | 37.50 | 23 | 39.13 |
Required dimensions | Required metrics | Output |
---|---|---|
date | clicks | pd.DataFrame |
This method is used to forecast traffic, using prophet. Using the data available in our Report
object, we can use this external library to forecast future clicks.
The function is simple to use but please note that I do not advise to create forecast if you don't have at least a decent number of days in your Report
object, otheriwse the forecast will be inaccurate.
You can specify the number of days you want to forecast as the only accepted parameter for this method.
(
report
.forecast(days=10)
)
This method returns the DataFrame as created by Prophet, and if you want to understand the column names, have a look at their documentation.
Required dimensions | Required metrics | Output |
---|---|---|
query / date | None | pd.DataFrame |
This method allows you to simply compare your branded and non-branded clicks / impressions over time.
(
report
.brand_vs_no_brand(brand_variants=['brand', 'mybrand'])
)
It returns a clean table with your clicks & impressions over time that allows you to see how your traffic is evolving on branded and non-branded terms.
date | clicks_brand | impressions_brand | clicks_no_brand | impressions_no_brand |
---|---|---|---|---|
2023-01-01 | 0.0 | 4.0 | 60 | 4917 |
2023-01-02 | 0.0 | 15.0 | 84 | 6648 |
2023-01-03 | 1.0 | 10.0 | 88 | 5401 |
2023-01-04 | 0.0 | 11.0 | 80 | 5390 |
2023-01-05 | 0.0 | 12.0 | 80 | 5401 |
2023-01-06 | 0.0 | 13.0 | 79 | 5492 |
2023-01-07 | 0.0 | 11.0 | 80 | 5458 |
Required dimensions | Required metrics | Output |
---|---|---|
query | None | pd.DataFrame |
Some common SEO tools such as Semmrush, Ahrefs or Sistrix allows you to perform a keyword gap analysis.
This is the first step to understand what contents you may want to create for your projects if it makes sense.
This method allows you to perform a similar operation by **comparing the keywords you have in your Report
object and any other list of keywords.
(
report
.keyword_gap(
#the DataFrame where your list of keywords is stored
df,
#the column name where your keywords are stored
column='keyword',
)
)
This method will filter your df
to keep only the keywords that are not included in your Report
object.
Required dimensions | Required metrics | Output |
---|---|---|
date | clicks | Causal Impact object |
GSC is a great tool to understand if some changes applied to a specific set of pages is having a positive impact.
This method allows you to use Causal Impact to infer the expected effect a given intervention (or any action) had on some response variable by analyzing differences between expected and observed time series data.
(
report
.causal_impact(
intervention_date="2023-01-01",
)
)
To ensure that the results make sense, you need to have at least the same amount of days before & after the intervention date in your Report
object.
This method will return a ci
object. Refer to the documentation to understand how you can explore the results.
Required dimensions | Required metrics | Output |
---|---|---|
page | None | Report object |
This method is one you need to use if you're dealing with a migration. When you want to compare the traffic before / after, we need to manually update the URLs returned by GSC based on a redirect mapping we have.
This method simplifies the process:
- you provide a redirect mapping with from and to columns
- the method will update your
Report
object by updating the from URLs by the to URLs.
(
report
.update_urls(
#the dataframe where we have the from & to columns
redirect_mapping=redirects
)
)
As the method just update the object itself, you can then call any other method freely.
Required dimensions | Required metrics | Output |
---|---|---|
query | None | pd.DataFrame |
GSC is a goldmine when it comes to keyword discovery, but it is sometimes handy to know what is the search volume of a given keyword we have in our dataset.
Even if keyword volume are not the only datapoint you need to use to assess a content potential, in somes cases it must be used.
This method leverages DataForSEO, an API I often recommand to get search volume at scale, to get the search_volume
for your keyword.
Please note that this method is designed to work with any dataset size, buy I do not recommand to use it if you have more than 100,000 keywords. In that case, use a separate script.
(
report
.extract_search_volume(
#the location code from DataForSEO
## 2250 is France
location_code=2250,
#credentials for the API
client_email="[email protected]",
client_password="xxxxx",
#If you just want to calculate the cost
calculate_cost=True
)
)
To avoid any cost-related issue, you need to explicitely set the calculate_cost parameter to False
to run the extraction.. For now, search volume are extracted for Google only.
Required dimensions | Required metrics | Output |
---|---|---|
page | clicks / impressions | pd.DataFrame |
This method is similar to active_pages()
: you provide a sitemap URL and the function returns the contents based on your clicks_threshold
and/or impressions_threshold
.
(
report
.find_potential_contents_to_kill(
#my sitemap
"https://www.website.com/sitemap_index.xml",
#the threshold
clicks_threshold=0,
impressions_threshold=0,
)
)
This code would return a DataFrame like the following one:
loc | clicks | impressions | ctr | position |
---|---|---|---|---|
https://www.website.com/ | 0.0 | 0.0 | 0.0 | 0.0 |
https://www.website.com/blog/2024/01/09/xxx | 0.0 | 0.0 | 0.0 | 0.0 |
https://www.website.com/blog/2024/01/09/yyy | 0.0 | 0.0 | 0.0 | 0.0 |
Please note that you shouldn't just kill contents that have no impressions / clicks, because:
- Some are useful even if they do not generate SEO traffic
- Some of them may have been published some days ago
- Some of them may need to be updated
But it speeds a part of this process.
Required dimensions | Required metrics | Output |
---|---|---|
page / date | clicks | pd.DataFrame |
Content Decay is something we always have to investigate to ensure that our best performing contents always stay at the top of the SERPs.
It is indeed easier to rank better for an existing content than ranking a new one (all things being equal, obviously). This method analyzes your data and returns the content that seem to be suffering from this issue.
(
report
.find_content_decay(
threshold_decay=0.25,
threshold_clicks=100
)
)
This code would return a list of contents with the following characteristics:
- During the last (full) month of available data in the
Report
object, the page generating at least 25% less clicks than what it had during the peak month - During its peak, the content generated at least 100 clicks
When you get the output, you need to add your industry knowledge to understand what is going on because:
- The seasonnality can affect the outcome. Indeed, if your peak month is August and your run the analysis in December, all your contents may be "decaying".
- New SERP layout can also affect your CTR and hence affect the output of this method
Still, it's a good to speed-up the process and come up with a smaller list of contents to update to protect your key positions.
Required dimensions | Required metrics | Output |
---|---|---|
page | None | pd.DataFrame |
This method is self-explanatory.
It will allow you to find the pages in your GSC Report
object that are not included in a sitemap your provide.
Required dimensions | Required metrics | Output |
---|---|---|
page / date | clicks | pd.DataFrame |
This method is especially useful after Google Updates. It allows you to quickly know what content are generating less / more clicks between two periods.
(
report
.winners_losers(
period_from=['2023-01-01', '2023-01-15'],
period_to=['2023-01-16', '2023-01-31'],
)
)
Required dimensions | Required metrics | Output |
---|---|---|
query | None | pd.DataFrame |
Quickly filter your GSC Report
object based on the number of words included in your keywords.
For instance:
(
report
.find_long_tail_keywords(number_of_words=7)
)
This will simply filter your data to include only keywords that are composed by at least 7 words.
Required dimensions | Required metrics | Output |
---|---|---|
query / date | clicks / impressions / position | pd.DataFrame |
This method allows you to find CTR outliers.
- It first call the
ctr_yield_curve
method to build a custom basis for comparison. I strongly advise to filter toReport
object accordingly to ensure that you are not mixing several SERP layouts. - It then compare the expected CTR (based on weighted average position) and the real CTR (based on clicks & impressions) to find outliers.
(
report
.find_ctr_outliers()
)
Required dimensions | Required metrics | Output |
---|---|---|
None | None | pd.DataFrame |
This methods allows you to assign an ABCD rank to a metric based on cumulative percentage contribution.
- A: belong to the top 50%
- B: between 50 & 75%
- C: between 75 and 90%
- D: between 90 and 100%
For instance, if e run the following code:
(
report
.abcd('clicks')
)
we could get this table as the output:
country | clicks | abcd |
---|---|---|
mex | 5955 | A |
ecu | 1936 | B |
ven | 1447 | B |
esp | 765 | C |
col | 716 | C |
Required dimensions | Required metrics | Output |
---|---|---|
date / page | None | pd.DataFrame |
This method has been designed thinking about Google Discover. You can easily know how many pages has appeared per day using this method.
(
report
.pages_per_day()
)
This would return a table with the number of pages per day, as the name of the method suggest.
date | page |
---|---|
2024-01-01 | 1901 |
2024-01-02 | 2544 |
2024-01-03 | 2761 |
2024-01-04 | 2853 |
2024-01-05 | 2473 |
2024-01-06 | 2281 |
2024-01-07 | 2796 |
Required dimensions | Required metrics | Output |
---|---|---|
date / page | None | pd.DataFrame |
Similar function to pages_per_day(), but is tells you what is the average lifespan of a page in your dataset.
Required dimensions | Required metrics | Output |
---|---|---|
date | clicks / impressions | pd.DataFrame |
This method allows you to quickly understand the weekly seasonnality you have in your data.
(
report
.seasonality_per_day()
)
It would return a table as the following:
date | click | impressions |
---|---|---|
Monday | 2218 | 352148 |
Tuesday | 2456 | 399690 |
Wednesday | 2532 | 414933 |
Thursday | 2312 | 381517 |
Friday | 1081 | 195331 |
Saturday | 827 | 142410 |
Sunday | 1267 | 210677 |
Required dimensions | Required metrics | Output |
---|---|---|
query | None | pd.DataFrame |
When you are working on a project where pages are created at scale, you don't want to understand what are the most common keywords, but are the most common structures.
To achieve this objective, this method allow us to replace any occurence of an element of a list in our query column. For instance, if you have a travel website, you won't have "flight paris barcelona" but "flight element element" assuming that you provide a list of cities.
This will considerably speed-up your analysis to optimize your templates.
(
report
.replace_query_from_list(word_list)
)
You can also query the Sitemaps by creating a Sitemap
object.
import gscwrapper
#authentificate
account = gscwrapper.generate_auth(
'config/client_secret_mvp.json',
serialize='config/credentials.json'
)
#we choose the website we want
webproperty = account['https://www.exemple.com/']
#we create the sitemap obkect
sitemap = (webproperty.sitemap)
We can easily list the sitemaps that are included in GSC:
(
sitemap
.list_sitemaps()
)
This would return a DataFrame similar to the following one:
path | lastSubmitted | isPending | isSitemapsIndex | lastDownloaded | warnings | errors | contents | type |
---|---|---|---|---|---|---|---|---|
https://www.website.com/sitemap_index.xml | 2023-12-23T23:41:05.453Z | False | True | 2024-01-09T15:07:27.851Z | 0 | 0 | [{'type': 'web', 'submitted': '475', 'indexed': '0'}, {'type': 'image', 'submitted': '457', 'indexed': '0'}] | |
https:///www.website.com/page-sitemap.xml | 2019-10-15T15:44:16.831Z | False | False | 2024-01-04T09:48:28.500Z | 3 | 0 | [{'type': 'web', 'submitted': '14', 'indexed': '0'}] | sitemap |
This method will check if the sitemaps we have in GSC are all returning a 200 response code. If you have a heavy IT policty in place, please note that the response code this method gets and the one Google would get may be different.
(
sitemap
.check_sitemaps()
)
path | response_code |
---|---|
https://www.website.com/sitemap_index.xml | 200 |
https://www.website.com/page-sitemap.xml | 200 |
You can also query the URL Inspection by creating a Inspect
object.
import gscwrapper
#authentificate
account = gscwrapper.generate_auth(
'config/client_secret_mvp.json',
serialize='config/credentials.json'
)
#we choose the website we want
webproperty = account['https://www.exemple.com/']
#we create the sitemap obkect
inspect = (webproperty.inspect)
This method allows you to add URLs in the Inspect
object. These URLs are the one you'll send to the API in the execute()
call to get the data for.
(
inspect
.add_urls([
'https://www.website.com/page',
'https://www.website.com/other-page'
]
)
)
Same logic but to remove URLs from the Inspect
object.
THis will call the API for every unique URLs you added using the add_urls()
method. Based on the number you have, the extraction can take a moment.
(
inspect
.execute()
)
If for whatever reason the execution fails during the process, you can retrieve the results that have already been generated by inspecting inspect.results
.