jaybizzle / crawler-detect Goto Github PK

View Code? Open in Web Editor NEW

1.9K 53.0 253.0 11.58 MB

🕷 CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent

Home Page: https://crawlerdetect.io

License: MIT License

PHP 100.00%

php user-agent crawler spider bots detect hacktoberfest

crawler-detect's Introduction

crawlerdetect.io

About CrawlerDetect

CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent and http_from header. Currently able to detect 1,000's of bots/spiders/crawlers.

Installation

composer require jaybizzle/crawler-detect

Usage

use Jaybizzle\CrawlerDetect\CrawlerDetect;

$CrawlerDetect = new CrawlerDetect;

// Check the user agent of the current 'visitor'
if($CrawlerDetect->isCrawler()) {
    // true if crawler user agent detected
}

// Pass a user agent as a string
if($CrawlerDetect->isCrawler('Mozilla/5.0 (compatible; Sosospider/2.0; +http://help.soso.com/webspider.htm)')) {
    // true if crawler user agent detected
}

// Output the name of the bot that matched (if any)
echo $CrawlerDetect->getMatches();

Contributing

If you find a bot/spider/crawler user agent that CrawlerDetect fails to detect, please submit a pull request with the regex pattern added to the $data array in Fixtures/Crawlers.php and add the failing user agent to tests/crawlers.txt.

Failing that, just create an issue with the user agent you have found, and we'll take it from there :)

Laravel Package

If you would like to use this with Laravel, please see Laravel-Crawler-Detect

Symfony Bundle

To use this library with Symfony 2/3/4, check out the CrawlerDetectBundle.

YII2 Extension

To use this library with the YII2 framework, check out yii2-crawler-detect.

ES6 Library

To use this library with NodeJS or any ES6 application based, check out es6-crawler-detect.

Python Library

To use this library in a Python project, check out crawlerdetect.

JVM Library (written in Java)

To use this library in a JVM project (including Java, Scala, Kotlin, etc.), check out CrawlerDetect.

.NET Library

To use this library in a .net standard (including .net core) based project, check out NetCrawlerDetect.

Ruby Gem

To use this library with Ruby on Rails or any Ruby-based application, check out crawler_detect gem.

Go Module

To use this library with Go, check out the crawlerdetect module.

Parts of this class are based on the brilliant MobileDetect

crawler-detect's People

Contributors

Stargazers

Watchers

Forkers

yokozuna nigelterry bleveque42 deceuninckbelux mtdavidson loveorigami dmitryssh rixaman castevinz romaricdrigon highestgoodlikewater jeroenherczeg cwhsu1984 piotrantosik nick-andren dmitriynet timersys tessin lonson atouhou linkesch only-1234 ncjoes minhd jrean wellaflex gpoulles sunkan amochohan zlabst demis-palma torst jasperchan parisholley maxkhrichtchatyi horsten qa1 jamesforks vswb loicfevrier koenvu surjit vmak11 jefferyhus bbaronsvk sky19890315 marcelwirtz komivi-ps telus mygithubforks denisvs waithaka shane6969 remy22 charkes brandonhamric g-dev21 gsouf jeisinge rapita mr901 magictoy yiqingxin idevin gaybro8777 arihantsurana babarinde alexmayo adjustive peip-mirror snowsoft tn3rb geeky-biz frankbv atolodas messiaqin igneus mattparlane moudarir ak868308 peter279k softctrl ayoze pcmanprogrammeur binaryfolks-developer satmaelstorm mpskovvang wesavetheworld czarnolecki message-dimke loadkpi sirroman danbergan studiomax ndberg gmshawon orynider himiro pendalf89 codemastersolucoes

crawler-detect's Issues

acapbot

Hi! I just detected another bot in my server logs:

Mozilla/5.0 (compatible;acapbot/0.1)

Incorrect agent result

Mozilla/5.0 (Linux; Android 6.0.1; LG-K100 Build/MXB48T; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/50.0.2661.86 Mobile Safari/537.36 YandexSearch/5.75

NULL user-agent

Hi,

In same cases, User-Agent is set to null. While not definitive (it could be faked), isn't it a sign we may have a programmatic connection? Then is_bot should be true.

Class 'Jaybizzle\CrawlerDetect\CrawlerDetect' not found

2ip.ru signature

2ip.ru CMS Detector (http://2ip.ru/cms/)

How can I get bot information?

I am interested in knowing which bot came to my website and when did it come and and how many times in a day?

Can you add these methods?

TIA

Incorrect agent result

masscan/1.0

https://github.com/robertdavidgraham/masscan

headers

{
  "user-agent": "masscan/1.0"
}

Create extension please for YII2

monperrus/crawler-user-agents repository

Look at https://github.com/monperrus/crawler-user-agents

Incorrect agent result - its a slack preview bot

Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Slack/1.2.6 Chrome/45.0.2454.85 AtomShell/0.34.3 Safari/537.36 Slack_SSB/1.2.6

Detect google bots

Google now supports crawling ajax pages. But other bots don't.
I want to check that bot is google or not. How can i do this check?
Thanks

add options for exception bots from google, bing, yandex and other search engine

Reorder regex patterns so they all have a chance of returning a match?

Some regex patterns will never match.

For example...

'Googlebot',
'Googlebot-Image',
'Googlebot-Mobile',

A user-agent that contains Googlebot-Image will never match that specific regex, because the more generic Googlebot regex matches first.

In this instance, should we remove Googlebot-Image and Googlebot-Mobile or keep them in for verbosity and reorder them?

Updated & Sorted List

Hi guys,

Awesome list - really appreciate it!
I have analyzed our dataset of incoming calls and merged some bots, that were not in your list.
In addition I have sorted the list, since I am doing a binary search through it, but the list is currently not 'sorted' according to ASCII standards.

Sorry for not pointing out which entries I have added :( but I was tunnel-visioning and forgot to save them...

Hope it still helps and that you can use these / update your list.

Cheers,
ebbmo
botlist.txt

Amazon AWS Crawlers

I've added the following to my crawler detection:
substr(gethostbyaddr($_SERVER['REMOTE_ADDR']),-14)==".amazonaws.com"

Amazon AWS crawls with all kinds of random spoofed user agents, so that this was the only way for me to get rid of them for good.

So maybe that's useful for inclusion despite being based on the remote host, and not the user agent.

some other false positives

hi,
maybe I found three other false positives. Here's the list:

Mozilla/5.0 (Linux; Android 4.4.2; ForwardRuby Build/KOT49H) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.81 Mobile Safari/537.36
Mozilla/5.0 (Linux; Android 5.1; Cosmos Build/LMY47I) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/39.0.0.0 Mobile Safari/537.36
Mozilla/5.0 (Linux; U; Android 4.2.1; en-ph; MyPhone Agua Vortex Build/JOP40D) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30

What do you think?

Googlebot is not detected

Hi,

Still parsing my logs, hits by Google googlebot are not detected.
The trick is that it uses a valid User-agent - Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/28.0.1500.71 Safari\/537.36 - but you should read from HTTP From header, whose value is googlebot(at)googlebot.com.
It is a big modification, but is it considered?

Incorrect agent result

null or an empty string are 2 user agent cases for bad bots. Nice to handle them too

User agent to add

Hi, maybe you can add this User-Agent ?

CloudEndure Scanner ([email protected])

Incorrect agent result

User Agent: ping.blo.gs/2.0

Source Url: http://blo.gs/ping.php

Possible Crawler: wpif

We have seen a user agent of Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.21 (KHTML, like Gecko) wpif Safari/537.21 that appears to be non-human. In particular, https://www.webmasterworld.com/search_engine_spiders/4785385.htm appears to agree. Browscap does not. Thoughts?

Split detection regex rules into groups

Might be handy to split the regex rules based on the type of bot we are trying to detect

Then we can have methods such as isSearchEngine(), isLinkChecker(), isValidator(), isLibrary()

isCrawler() would still check for all

New Crawler - Cloudflare Always On

Mozilla/5.0 (compatible; CloudFlare-AlwaysOnline/1.0; +https://www.cloudflare.com/always-online) AppleWebKit/534.34

See https://www.cloudflare.com/always-online

Pull request submitted

yahoo, google --> please check

74.6.254.126 ||| Mozilla/5.0 (compatible; Yahoo Ad monitoring; https://help.yahoo.com/kb/yahoo-ad-monitoring-SLN24857.html)
74.6.254.126 ||| Mozilla/5.0 (iPhone; CPU iPhone OS 7_1 like Mac OS X) AppleWebKit (KHTML, like Gecko) Mobile (compatible; Yahoo Ad monitoring; https://help.yahoo.com/kb/yahoo-ad-monitoring-SLN24857.html)

66.249.91.34 ||| AdsBot-Google-Mobile (+http://www.google.com/mobile/adsbot.html) Mozilla (iPhone; U; CPU iPhone OS 3 0 like Mac OS X) AppleWebKit (KHTML, like Gecko) Mobile Safari
209.85.238.93 ||| AdsBot-Google-Mobile (+http://www.google.com/mobile/adsbot.html) Mozilla (iPhone; U; CPU iPhone OS 3 0 like Mac OS X) AppleWebKit (KHTML, like Gecko) Mobile Safari
66.249.92.17 ||| Mozilla/5.0 (iPhone; CPU iPhone OS 9_1 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13B143 Safari/601.1 (compatible; AdsBot-Google-Mobile; +http://www.google.com/mobile/adsbot.html)

Scout user agent

User agent used by scoutapp:

ScoutURLMonitor/5.9.8

Why Slurp and not Yahoo! Slurp

Crawler-Detect started detect Slurp crawler instead of Yahoo! Slurp.
User agent: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)

Can you, please, make this lib detect such user agents as Yahoo! Slurp crawler.

PS: http://www.useragentstring.com/ knows Yahoo! Slurp and does not know Slurp.

Missing "Google favicon" detection ?

I see this in my SERVER (PHP) variable, with no detection :
"[HTTP_USER_AGENT] => Google favicon"

Could you manage its detection on a future tag ?
Thx

Incorrect agent result

B0t

Add crawler detecting via ip adress

Hello,
I think it would be great to add detection of the crawlers using a client ip address.
May be some resources like iplists.com or myip.ms/browse/web_bots/ can be used for obtainig a nearly complete list of the ip addresses.

[To Do] Add tests

Move tests from Laravel package to this repo

Assets

Rebelmouse

Hi,

Bot with User-Agent RebelMouse\/0.1 Mozilla\/5.0 (compatible; http:\/\/rebelmouse.com) Gecko\/20100101 Firefox\/7.0.1 is not detected. Note that it also has referer set to http://rebelmouse.com while it makes no sense, plus From header set to [email protected].

May you port this to use with tuckey urlrewrite or atleast Java

The tool that I am looking for and necessary code I have written is mentioned in this stackoverflow question

Tuckey Url Rewrite is java (tomcat) port of famous apache lib mod_rewrite, it could be found here.

Add http: to the generic regex?

See #30

FEVER user agent

Hi,
i found an issue about this UA:

Mozilla/5.0 (Linux; Android 5.1; FEVER Build/LMY47D) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.105 Mobile Safari/537.36

I'm not sure this is a crawler. I think this is a valida UA provided by this company: wikomobile.com.
You can read something here: http://www.tera-wurfl.com/explore/?action=wurfl_id&id=wiko_fever_ver1

What do you think?

Google WebCache

Great script!

Thought I would pass on I have traces of the following bot slipping through the filter:

21.206.197.104.bc.googleusercontent.com

Class 'Jaybizzle\CrawlerDetect\CrawlerDetect' not found

I have created a CrawlerDetect folder in my php app folder and I extracted the github zip in there like so:

raw
src
composer.json
export.php
LICENSE
README.md

I have this at the top of my php script file:

include_once("CrawlerDetect/export.php");
use Jaybizzle\CrawlerDetect\CrawlerDetect;
$CrawlerDetect = new CrawlerDetect;

if($CrawlerDetect->isCrawler()) {
	writeme("we got a crawler!");
}else{
	writeme("we got a human!");
}

When I run my php script in a browser, I get Class 'Jaybizzle\CrawlerDetect\CrawlerDetect' not found. On github for this library it mentions running cmd-line stuff, etc. Can I not just dump the files in my directory and reference it at the top of my script? What am I doing wrong?

New bots

Could you please add "OnPageBot" and "Uptrends"? Thanks.

Why need - // Pass a user agent as a string ?

it detect all useragent?
$CrawlerDetect->isCrawler()

AND

it not detect all useragent?
$CrawlerDetect->isCrawler('Mozilla/5.0 (compatible; Sosospider/2.0; +http://help.soso.com/webspider.htm)')

// Check the user agent of the current 'visitor'
if($CrawlerDetect->isCrawler()) {
    // true if crawler user agent detected
}

// Pass a user agent as a string
if($CrawlerDetect->isCrawler('Mozilla/5.0 (compatible; Sosospider/2.0; +http://help.soso.com/webspider.htm)')) {
    // true if crawler user agent detected
}

Incorrect user agent result: HubPages

User Agent: HubPages V0.2.2 (http://hubpages.com/help/crawlingpolicy)
Result: HubPages V0.2.2 (http://hubpages.com/help/crawlingpolicy

I would also like to point out parsing error in User Agents like WordPress/4.8; http://xyz.com or redback/v0-570-g26f8c96, it always includes the trailing slash in the result, e.g. WordPress/ or Redback/

May be cleaning out the result with below code will help:
trim(preg_replace('/[^A-Za-z0-9\-\_\.]/', ' ', $result))

Empty user agent

If its empty user agent is treated as a crawler? Can we add a simple setting variable to do that? That all empty UA should be treated as crawler

Detect SogouMobileBrowser

Good day! I use your script, very good work! But faced with such a problem. Home comes a lot of users with a browser SogooMobileBrovser and it is certainly not bots. How to make an exception for them, what would use the current versions of your repository?

Example:
Mozilla/5.0 (Linux; U; Android 5.1; zh-cn; 1501_M02 Build/LMY47D) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30 SogouMSE,SogouMobileBrowser/4.1.0

Google Docs Crawler

I found a new crawler in my server logs:

Mozilla/5.0 (compatible; GoogleDocs; apps-spreadsheets; +http://docs.google.com)

Incorrect agent result

Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Clarsentia)

We have a site with several sub-sites. This bot clicked on every ad on each sub-site (ignoring rel=nofollow). It did not fetch robots.txt either.

New crawler

Hi
This crawler visited my site today:

Virusdie crawler/3.0

Hope you can include it.

Thanks.

Google HTTP java client

Hi,

I got several hits from Google-HTTP-Java-Client\/1.17.0-rc (gzip) User-Agent. That one is not considered a bot. Stricto sensu, it is not since it's only the medium, but it denotes a programmatic connection and then eventually a crawler.
What about adding this one?

Googlebot regex collision

When I add the following user agent to the devices.txt test file:

Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)

The tests fail. However, when I add Googlebot to the Crawlers.php file, PHPUnit fails because there is a regex collision.

Bots with a missing user-agent test

The following bots do not have a corresponding user-agent in the test suite. If you know any user-agent that will fulfil any of the following, please add it to test/crawlers.txt

brainobot
citeseerxbot
findthatfile
g00g1e.net
IOI
lb-spider
lssbot
toplistbot
UsineNouvelleCrawler
web-archive-net.com.bot
wocbot

detect new crawler

I think I found a new crawler.
information of this crawler is : AppManager RPT-HTTPClient/0.3-3E
is that true?

Branch.io Crawler is not in the list

Branch.io has a crawler that scrapes metadata and has a Branch-Passthrough user agent.

jaybizzle / crawler-detect Goto Github PK

crawler-detect's Introduction

About CrawlerDetect

Installation

Usage

Contributing

Laravel Package

Symfony Bundle

YII2 Extension

ES6 Library

Python Library

JVM Library (written in Java)

.NET Library

Ruby Gem

Go Module

crawler-detect's People

Contributors

Stargazers

Watchers

Forkers

crawler-detect's Issues

Recommend Projects

Recommend Topics

Recommend Org