Giter Site home page Giter Site logo

serp's Introduction

serp

This module allows to get the result of a Google search based on a keyword.

It provides different options for scraping the google results called SERP (Search Engine Result Page) :

  • delay between requests
  • retry if error
  • with or without proxy, proxies or scrape API.

This module uses Playwright in order to accept the cookie consent popup before making a new search.

Installation

$ npm install serp -S

Simple usage

const serp = require("serp");

var options = {
  host : "google.fr",
  qs : {
    q : "test",
    filter : 0,
    pws : 0
  },
  num : 100
};

const links = await serp.search(options);

Understanding the options structure :

Delay between requests

It is possible to add a delay between each request made on Google with the option delay (value in ms). The delay is also applied when the tool read the next result page on Google.

const serp = require("serp");

var options = {

  qs : {
    q : "test"
  },
  num : 100,
  delay : 2000 // in ms
};

const links = await serp.search(options);

Retry if error

If an error occurs (timeout, network issue, invalid HTTP status, ...), it is possible to retry the same request on Google. If a proxyList is set into the options, another proxy will be used.

const serp = require("serp");

var options = {

  qs : {
    q : "test"
  },
  num : 100,
  retry : 3,
  proxyList : proxyList
};

const links = serp.search(options);

Get the number of results

You can get the number of indexed pages in Google by using the following code.

const serp = require("serp");

var options = {
  host : "google.fr",
  numberOfResults : true,
  qs : {
    q   : "site:yoursite.com"
  },
  proxyList : proxyList
};

const numberOfResults = await serp.search(options);

With proxy

You can add the proxy reference in the options

const serp = require("serp");

var options = {
  qs : {
    q : "test",
  },
  proxy : {
        server: 'hots:port',
        username: 'username',
        password: 'password'
  }
  
};


const links = await serp.search(options);

With multiple proxies

You can also use the module simple proxy if you have several proxies (see : https://github.com/christophebe/simple-proxies). In this case, a different proxies (choose randomly) will be used of each serp.search call.

See this unit test to get the complete code. The proxies have to be in a txt file : one line for each proxy with the following structure : host:port:user:password

const  serp = require("serp");

var options = {
  qs : {
    q : "test",
  },
  proxyList : proxyList
};

const links = await serp.search(options);

with a scrape API

This module can use a scrape API instead of a list of proxies.

This is an example with scrapeapi.com

const options = {
      num: 10,
      qs: {
        q: 'test'
      },
      scrapeApiUrl: `http://api.scraperapi.com/?api_key=${ accessKey }`
    };

    try {
      const links = await serp.search(options);

      // console.log(links);
      expect(links).to.have.lengthOf(10);
    } catch (e) {
      console.log('Error', e);
      expect(e).be.null;
    }

Proxies or Scrape API ?

If you make many requests at the same time or over a limited period of time, Google may ban your IP address. This can happen even faster if you use particular search commands such as: intitle, inurl, site:, ...

It is therefore recommended to use proxies. The SERP module supports two solutions:

  • Datacenter proxies like for example those proposed by Mexela. Shared proxies are more than enough.

  • Scrape APIs such as scrapeapi.com

What to choose? Datacenter proxies or Scrape API ?

It all depends on what you are looking for. Datacenter proxies will provide the best performance and are generally very reliable. You can use the "retry" option to guarantee even more reliability. It's also a solution that offers a good quality/price ratio but it will require more effort in terms of development, especially for the rotation of proxies. If you want to use rotation with datacenter proxies, see this unit test.

Although slower, the scrape APIs offer other features such as the geolocation of IP addresses over a larger number of countries and the ability to scrape dynamic pages. Using such an API can also simplify the code. Unfortunately, this solution is often more expensive than data center proxies. So, scrape APIs becomes interesting if you have other scrape needs.

In all cases, make a test with shared proxies in order to check it is suffisiant for your use cases.Those proxies are really cheap.

serp's People

Contributors

christophebe avatar dependabot[bot] avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

serp's Issues

proxy not working

I've created a sample app

const options = {
    qs: {
      q: "silicon+valley",
      filter: 0,
      pws: 0
    },
    num: 100,
    proxy: "http://username:password@ip:port"
  };

  const links = await serp.search(options);

without proxy it worked for some minutes, but after I've got an error:

UnhandledPromiseRejectionWarning: StatusCodeError: 429 - "
<!DOCTYPE html PUBLIC \"-//W3C//DTD HTML 4.01 Transitional//EN\">\n
<html>\n

<head>
    <meta http-equiv=\ "content-type\" content=\ "text/html; charset=utf-8\">
    <meta name=\ "viewport\" content=\ "initial-scale=1\">
    <title>https://www.google.com/search?q=silicon%20valley&amp;filter=0&amp;pws=0</title>
</head>\n

<body style=\ "font-family: arial, sans-serif; background-color: #fff; color: #000; padding:20px; font-size:18px;\" onload=\ "e=document.getElementById('captcha');if(e){e.focus();}\">\n
    <div style=\ "max-width:400px;\">\n
        <hr noshade size=\ "1\" style=\ "color:#ccc; background-color:#ccc;\">
        <br>\n
        <form id=\ "captcha-form\" action=\ "index\" method=\ "post\">\n
            <script src=\ "https://www.google.com/recaptcha/api.js\" async defer></script>\n
            <script>
                var submitCallback = function(response) {
                    document.getElementById('captcha-form').submit();
                };
            </script>\n
            <div id=\ "recaptcha\" class=\ "g-recaptcha\" data-sitekey=\ "6LfwuyUTAAAAAOAmoS0fdqijC2PbbdH4kjq62Y1b\" data-callback=\ "submitCallback\" data-s=\ "RzBrezqy7Ocruy9AYYozK6BSB1mY3RCdv2dAWoem_xSFyKZqVwEJA8TWx-AedRQ5DWshAEpDf6v2b5Als9D-fC0MnE4rzOUq-mhiJm3yHLCqVgioWZPUSianWs7MLGX45BMm0WFmwBxtvMysrCEHlMVX1QX-Aju5C3qgWfHRbm4s9KovQljUG0QySUFMDsCLVaM6kFcqi7MQECgPSBKxZ6Za4AKqlHdnmkbVvr45N-nEGOpvt_YB4Hs\"></div>\n
            <input type='hidden' name='q' value='EgTEEFYzGLiQsO4FIhkA8aeDSy9YnDQaO0Qz94XxC9gfOeK6Q9VwMgFy'>
            <input type=\ "hidden\" name=\ "continue\" value=\ "https://www.google.com/search?q=silicon%20valley&amp;filter=0&amp;pws=0\">\n</form>\n
        <hr noshade size=\ "1\" style=\ "color:#ccc; background-color:#ccc;\">\n\n
        <div style=\ "font-size:13px;\">\n<b>About this page</b>
            <br>
            <br>\n\nOur systems have detected unusual traffic from your computer network. This page checks to see if it&#39;s really you sending the requests, and not a robot. <a href=\ "#\" onclick=\ "document.getElementById('infoDiv').style.display='block';\">Why did this happen?</a>
            <br>
            <br>\n\n
            <div id=\ "infoDiv\" style=\ "display:none; background-color:#eee; padding:10px; margin:0 0 15px 0; line-height:1.4em;\">\nThis page appears when Google automatically detects requests coming from your computer network which appear to be in violation of the <a href=\ "//www.google.com/policies/terms/\">Terms of Service</a>. The block will expire shortly after those requests stop. In the meantime, solving the above CAPTCHA will let you continue to use our services.
                <br>
                <br>This traffic may have been sent by malicious software, a browser plug-in, or a script that sends automated requests. If you share your network connection, ask your administrator for help &mdash; a different computer using the same IP address may be responsible. <a href=\ "//support.google.com/websearch/answer/86640\">Learn more</a>
                <br>
                <br>Sometimes you may be asked to solve the CAPTCHA if you are using advanced terms that robots are known to use, or sending requests very quickly.\n</div>\n\nIP address: 196.xx.xx.xx
            <br>Time: 2019-11-13T13:42:18Z
            <br>URL: https://www.google.com/search?q=silicon%20valley&amp;filter=0&amp;pws=0
            <br>\n</div>\n</div>\n</body>\n

</html>\n"

so according to docs I added this line:
proxy: "http://username:password@ip:port"

But still getting the same error.

P.S. I've tested my proxy and it works

triggeruncaughtexception

I'm extremely newbie at Node js. When I run this the code
async function search() {
// Run asynchronous code
const serp = require("serp");

var options = {
host : "google.com",
qs : {
q: "seo",
filter: 0,
pws: 0
},
num : 100
};

const links = await serp.search(options);
links.then((value) => {
console.log(value);
})
}

// Run the function
search();`

I receive the following error

Screenshot_6

Error when searching on google.com

Hi, i have this code:
`const serp = require("serp");

const keyword = "bici sportiva";

var options = {
host: "google.com",
qs: {
q: keyword,
filter: 0,
pws: 0,
},
num: 100,
};

const getLinks = async () => {
console.log("fetching google...");
const links = await serp.search(options);
console.log("links", links);
};

getLinks();
`
it work on google.it but not on google.com (also tried to remove host option ad set num to 10), the resulting error is:

(node:64929) UnhandledPromiseRejectionWarning: Error: Invalid HTTP status code on undefined
at requestFromBrowser (/Users/fabriziocoppolecchia/Desktop/Progetti/Freelance sharing/Tools Fabio/serpPositionSearch/node_modules/serp/index.js:125:11)
at runNextTicks (internal/process/task_queues.js:58:5)
at processImmediate (internal/timers.js:434:9)
at async execRequest (/Users/fabriziocoppolecchia/Desktop/Progetti/Freelance sharing/Tools Fabio/serpPositionSearch/node_modules/serp/index.js:110:19)
at async doRequest (/Users/fabriziocoppolecchia/Desktop/Progetti/Freelance sharing/Tools Fabio/serpPositionSearch/node_modules/serp/index.js:90:18)
at async Object.search (/Users/fabriziocoppolecchia/Desktop/Progetti/Freelance sharing/Tools Fabio/serpPositionSearch/node_modules/serp/index.js:38:20)
at async getLinks (/Users/fabriziocoppolecchia/Desktop/Progetti/Freelance sharing/Tools Fabio/serpPositionSearch/index.js:20:17)
(Use node --trace-warnings ... to show where the warning was created)
(node:64929) UnhandledPromiseRejectionWarning: Unhandled promise rejection. This error originated either by throwing inside of an async function without a catch block, or by rejecting a promise which was not handled with .catch(). To terminate the node process on unhandled promise rejection, use the CLI flag --unhandled-rejections=strict (see https://nodejs.org/api/cli.html#cli_unhandled_rejections_mode). (rejection id: 1)
(node:64929) [DEP0018] DeprecationWarning: Unhandled promise rejections are deprecated. In the future, promise rejections that are not handled will terminate the Node.js process with a non-zero exit code.

all result urls are not found able or wrong here is my code

const Discord = require("discord.js");
const serp = require("serp");
const request = require("request");
//CLIENT
const client = new Discord.Client()
const config = {
"token": "your token in here",
"prefix" : "!"
}
client.login(config.token);
client.on("ready", () => console.log("BOT IS READY"))
client.on("message", msg => {
if(!msg.guild || msg.author.bot || !msg.content.startsWith(config.prefix)) return;
const args = msg.content.slice(config.prefix.length).trim().split(/ +/g);
const cmd = args.shift().toLowerCase();
if(cmd === "ping") return msg.reply(My Ping is: \${client.ws.ping}ms`); else if(cmd === "google") { if(!args[0]) return msg.reply(please add your search term: `${config.prefix}google Hello world``)
var options = {
host : "google.com",
qs : {
q : args.join(" "),
},
num : 10,
retry : 3,
};
serp.search(options).then(links => {

console.log(links[0].url.replace("/url?esrc=s&q=&rct=j&sa=U&url=", ""))

  msg.reply(new Discord.MessageEmbed().setDescription(`${links.map((link, index) => `**${index}. **[${link.title}](${link.url.replace("/url?esrc=s&q=&rct=j&sa=U&url=", "")})`).join("\n")}`.substr(0, 2000))
  .setFooter("PICK YOUR RESULT!", msg.author.displayAvatarURL({dynamic:true}))).then(ms => {
    ms.channel.awaitMessages(m => m.author.id == msg.author.id, { time: 50000, max: 1, errors: ["time"] }).then(coll=>{
      if(Number(coll) > 10 || Number(coll) < 0) return msg.reply("NOT A VALID INPUT ERROR!")
      msg.reply(`**${links[coll.first().content].title}**\n\n${links[coll.first().content].url.replace("/url?esrc=s&q=&rct=j&sa=U&url=", "")}`.substr(0, 2000));
    }).catch(e=>{
      msg.reply("NOT ANSWERED IN TIME / INVALID INPUT");
    })
  })
})

}
else return msg.reply(UNKNOWN CMD! These are all of my cmds: \n\``${config.prefix}ping | --> shows you my latency\n${config.prefix}google | --> googles smt````);
})

Error getting multiple pages (new google serps)

If you check to get 100 results, you don't always get them. Google is changing the search results and sometimes a button that says "Show more" appears and the typical pagination does not appear.

This package is prepared to get the pagination links, but not the "Show more" button.

proxyList.pick() not a function

I investigate source code, and can not find pick fund in options.proxy = options.proxyList.pick().getUrl();;

UnhandledPromiseRejectionWarning: TypeError: options.proxyList.pick is not a function

Getting google suggested result instead of 'No results' for some queries

let q1 = atista holololof gusanmk
Suppose I searched for q1. Google gives no results for this search, instead by default it lists the results of a suggested query q2 = atista holololo gussa nmk

Screenshot (96)

Double quoting the query works as expected(gives no search results) in case of q1.
But in some cases even quoting the query doesn't work as expected, and gives suggested results. For example: "Khaccha payyu"

Screenshot (98)

Screenshot (99)

There should be a option to get results of the exact same query q1, and not the suggested query q2.

I've opened a pull request to address this case #10

util.promisify is not a function

/node_modules/serp/index.js:4
const delay = util.promisify(setTimeout);
^
TypeError: util.promisify is not a function

got error after npm install today. Few day's ago it was OK.
node v. 7.8.0

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.