christophebe / serp Goto Github PK

View Code? Open in Web Editor NEW

100.0 5.0 24.0 456 KB

Google Search SERP Scraper

JavaScript 100.00%

serps google scraper serp seo

serp's Introduction

serp

This module allows to get the result of a Google search based on a keyword.

It provides different options for scraping the google results called SERP (Search Engine Result Page) :

delay between requests
retry if error
with or without proxy, proxies or scrape API.

This module uses Playwright in order to accept the cookie consent popup before making a new search.

Installation

$ npm install serp -S

Simple usage

const serp = require("serp");

var options = {
  host : "google.fr",
  qs : {
    q : "test",
    filter : 0,
    pws : 0
  },
  num : 100
};

const links = await serp.search(options);

Understanding the options structure :

For google.com, the param host is not necessary.
qs can contain the usual Google search parameters : https://moz.com/ugc/the-ultimate-guide-to-the-google-search-parameters.
options.qs.q is the keyword
num is the number of desired results (defaut is 10).

Delay between requests

It is possible to add a delay between each request made on Google with the option delay (value in ms). The delay is also applied when the tool read the next result page on Google.

const serp = require("serp");

var options = {

  qs : {
    q : "test"
  },
  num : 100,
  delay : 2000 // in ms
};

const links = await serp.search(options);

Retry if error

If an error occurs (timeout, network issue, invalid HTTP status, ...), it is possible to retry the same request on Google. If a proxyList is set into the options, another proxy will be used.

const serp = require("serp");

var options = {

  qs : {
    q : "test"
  },
  num : 100,
  retry : 3,
  proxyList : proxyList
};

const links = serp.search(options);

Get the number of results

You can get the number of indexed pages in Google by using the following code.

const serp = require("serp");

var options = {
  host : "google.fr",
  numberOfResults : true,
  qs : {
    q   : "site:yoursite.com"
  },
  proxyList : proxyList
};

const numberOfResults = await serp.search(options);

With proxy

You can add the proxy reference in the options

const serp = require("serp");

var options = {
  qs : {
    q : "test",
  },
  proxy : {
        server: 'hots:port',
        username: 'username',
        password: 'password'
  }
  
};


const links = await serp.search(options);

With multiple proxies

You can also use the module simple proxy if you have several proxies (see : https://github.com/christophebe/simple-proxies). In this case, a different proxies (choose randomly) will be used of each serp.search call.

See this unit test to get the complete code. The proxies have to be in a txt file : one line for each proxy with the following structure : host:port:user:password

const  serp = require("serp");

var options = {
  qs : {
    q : "test",
  },
  proxyList : proxyList
};

const links = await serp.search(options);

with a scrape API

This module can use a scrape API instead of a list of proxies.

This is an example with scrapeapi.com

const options = {
      num: 10,
      qs: {
        q: 'test'
      },
      scrapeApiUrl: `http://api.scraperapi.com/?api_key=${ accessKey }`
    };

    try {
      const links = await serp.search(options);

      // console.log(links);
      expect(links).to.have.lengthOf(10);
    } catch (e) {
      console.log('Error', e);
      expect(e).be.null;
    }

Proxies or Scrape API ?

If you make many requests at the same time or over a limited period of time, Google may ban your IP address. This can happen even faster if you use particular search commands such as: intitle, inurl, site:, ...

It is therefore recommended to use proxies. The SERP module supports two solutions:

Datacenter proxies like for example those proposed by Mexela. Shared proxies are more than enough.
Scrape APIs such as scrapeapi.com

What to choose? Datacenter proxies or Scrape API ?

It all depends on what you are looking for. Datacenter proxies will provide the best performance and are generally very reliable. You can use the "retry" option to guarantee even more reliability. It's also a solution that offers a good quality/price ratio but it will require more effort in terms of development, especially for the rotation of proxies. If you want to use rotation with datacenter proxies, see this unit test.

Although slower, the scrape APIs offer other features such as the geolocation of IP addresses over a larger number of countries and the ability to scrape dynamic pages. Using such an API can also simplify the code. Unfortunately, this solution is often more expensive than data center proxies. So, scrape APIs becomes interesting if you have other scrape needs.

In all cases, make a test with shared proxies in order to check it is suffisiant for your use cases.Those proxies are really cheap.

serp's People

Contributors

Stargazers

Watchers

serp's Issues

This module is not working, I think because Google changed page structure

This module is not working at all. It returns empty array, or return an array with only one element which is the image result link Google put into its result page.

proxy not working

I've created a sample app

const options = {
    qs: {
      q: "silicon+valley",
      filter: 0,
      pws: 0
    },
    num: 100,
    proxy: "http://username:password@ip:port"
  };

  const links = await serp.search(options);

without proxy it worked for some minutes, but after I've got an error:

UnhandledPromiseRejectionWarning: StatusCodeError: 429 - "
<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\">\n
<html>\n

<head>
    <meta http-equiv=\ "content-type\" content=\ "text/html; charset=utf-8\">
    <meta name=\ "viewport\" content=\ "initial-scale=1\">
    <title>https://www.google.com/search?q=silicon%20valley&amp;filter=0&amp;pws=0</title>
</head>\n

<body style=\ "font-family: arial, sans-serif; background-color: #fff; color: #000; padding:20px; font-size:18px;\" onload=\ "e=document.getElementById('captcha');if(e){e.focus();}\">\n
    <div style=\ "max-width:400px;\">\n
        <hr noshade size=\ "1\" style=\ "color:#ccc; background-color:#ccc;\">
        <br>\n
        <form id=\ "captcha-form\" action=\ "index\" method=\ "post\">\n
            <script src=\ "https://www.google.com/recaptcha/api.js\" async defer></script>\n
            <script>
                var submitCallback = function(response) {
                    document.getElementById('captcha-form').submit();
                };
            </script>\n
            <div id=\ "recaptcha\" class=\ "g-recaptcha\" data-sitekey=\ "6LfwuyUTAAAAAOAmoS0fdqijC2PbbdH4kjq62Y1b\" data-callback=\ "submitCallback\" data-s=\ "RzBrezqy7Ocruy9AYYozK6BSB1mY3RCdv2dAWoem_xSFyKZqVwEJA8TWx-AedRQ5DWshAEpDf6v2b5Als9D-fC0MnE4rzOUq-mhiJm3yHLCqVgioWZPUSianWs7MLGX45BMm0WFmwBxtvMysrCEHlMVX1QX-Aju5C3qgWfHRbm4s9KovQljUG0QySUFMDsCLVaM6kFcqi7MQECgPSBKxZ6Za4AKqlHdnmkbVvr45N-nEGOpvt_YB4Hs\"></div>\n
            <input type='hidden' name='q' value='EgTEEFYzGLiQsO4FIhkA8aeDSy9YnDQaO0Qz94XxC9gfOeK6Q9VwMgFy'>
            <input type=\ "hidden\" name=\ "continue\" value=\ "https://www.google.com/search?q=silicon%20valley&amp;filter=0&amp;pws=0\">\n</form>\n
        <hr noshade size=\ "1\" style=\ "color:#ccc; background-color:#ccc;\">\n\n
        <div style=\ "font-size:13px;\">\n<b>About this page</b>
            <br>
            <br>\n\nOur systems have detected unusual traffic from your computer network. This page checks to see if it&#39;s really you sending the requests, and not a robot. <a href=\ "#\" onclick=\ "document.getElementById('infoDiv').style.display='block';\">Why did this happen?</a>
            <br>
            <br>\n\n
            <div id=\ "infoDiv\" style=\ "display:none; background-color:#eee; padding:10px; margin:0 0 15px 0; line-height:1.4em;\">\nThis page appears when Google automatically detects requests coming from your computer network which appear to be in violation of the <a href=\ "//www.google.com/policies/terms/\">Terms of Service</a>. The block will expire shortly after those requests stop. In the meantime, solving the above CAPTCHA will let you continue to use our services.
                <br>
                <br>This traffic may have been sent by malicious software, a browser plug-in, or a script that sends automated requests. If you share your network connection, ask your administrator for help &mdash; a different computer using the same IP address may be responsible. <a href=\ "//support.google.com/websearch/answer/86640\">Learn more</a>
                <br>
                <br>Sometimes you may be asked to solve the CAPTCHA if you are using advanced terms that robots are known to use, or sending requests very quickly.\n</div>\n\nIP address: 196.xx.xx.xx
            <br>Time: 2019-11-13T13:42:18Z
            <br>URL: https://www.google.com/search?q=silicon%20valley&amp;filter=0&amp;pws=0
            <br>\n</div>\n</div>\n</body>\n

</html>\n"

so according to docs I added this line:
proxy: "http://username:password@ip:port"

But still getting the same error.

P.S. I've tested my proxy and it works

add page pagination for scrapping google search results

Hi, thank you for creating a very awesome tool. And I'm inspired by your code.
Why not add page pagination? It can be a very powerful feature. :)

The number of results returns always 0

Hi
You made a great tools man!!!
I found that something changed in google html and it returns always 0 result.
See attached PS:

FYI and FYR

triggeruncaughtexception

I'm extremely newbie at Node js. When I run this the code
async function search() {
// Run asynchronous code
const serp = require("serp");

var options = {
host : "google.com",
qs : {
q: "seo",
filter: 0,
pws: 0
},
num : 100
};

const links = await serp.search(options);
links.then((value) => {
console.log(value);
})
}

// Run the function
search();`

I receive the following error

Error when searching on google.com

Hi, i have this code:
`const serp = require("serp");

const keyword = "bici sportiva";

var options = {
host: "google.com",
qs: {
q: keyword,
filter: 0,
pws: 0,
},
num: 100,
};

const getLinks = async () => {
console.log("fetching google...");
const links = await serp.search(options);
console.log("links", links);
};

getLinks();
`
it work on google.it but not on google.com (also tried to remove host option ad set num to 10), the resulting error is:

(node:64929) UnhandledPromiseRejectionWarning: Error: Invalid HTTP status code on undefined
at requestFromBrowser (/Users/fabriziocoppolecchia/Desktop/Progetti/Freelance sharing/Tools Fabio/serpPositionSearch/node_modules/serp/index.js:125:11)
at runNextTicks (internal/process/task_queues.js:58:5)
at processImmediate (internal/timers.js:434:9)
at async execRequest (/Users/fabriziocoppolecchia/Desktop/Progetti/Freelance sharing/Tools Fabio/serpPositionSearch/node_modules/serp/index.js:110:19)
at async doRequest (/Users/fabriziocoppolecchia/Desktop/Progetti/Freelance sharing/Tools Fabio/serpPositionSearch/node_modules/serp/index.js:90:18)
at async Object.search (/Users/fabriziocoppolecchia/Desktop/Progetti/Freelance sharing/Tools Fabio/serpPositionSearch/node_modules/serp/index.js:38:20)
at async getLinks (/Users/fabriziocoppolecchia/Desktop/Progetti/Freelance sharing/Tools Fabio/serpPositionSearch/index.js:20:17)
(Use node --trace-warnings ... to show where the warning was created)
(node:64929) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag --unhandled-rejections=strict (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 1)
(node:64929) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

all result urls are not found able or wrong here is my code

console.log(links[0].url.replace("/url?esrc=s&q=&rct=j&sa=U&url=", ""))

  msg.reply(new Discord.MessageEmbed().setDescription(`${links.map((link, index) => `**${index}. **[${link.title}](${link.url.replace("/url?esrc=s&q=&rct=j&sa=U&url=", "")})`).join("\n")}`.substr(0, 2000))
  .setFooter("PICK YOUR RESULT!", msg.author.displayAvatarURL({dynamic:true}))).then(ms => {
    ms.channel.awaitMessages(m => m.author.id == msg.author.id, { time: 50000, max: 1, errors: ["time"] }).then(coll=>{
      if(Number(coll) > 10 || Number(coll) < 0) return msg.reply("NOT A VALID INPUT ERROR!")
      msg.reply(`**${links[coll.first().content].title}**\n\n${links[coll.first().content].url.replace("/url?esrc=s&q=&rct=j&sa=U&url=", "")}`.substr(0, 2000));
    }).catch(e=>{
      msg.reply("NOT ANSWERED IN TIME / INVALID INPUT");
    })
  })
})

}
else return msg.reply(UNKNOWN CMD! These are all of my cmds: \n\``${config.prefix}ping | --> shows you my latency\n${config.prefix}google | --> googles smt````);
})

There should be a option to get results of the exact same query q1, and not the suggested query q2.

I've opened a pull request to address this case #10

util.promisify is not a function

/node_modules/serp/index.js:4
const delay = util.promisify(setTimeout);
^
TypeError: util.promisify is not a function

got error after npm install today. Few day's ago it was OK.
node v. 7.8.0