jaybizzle / crawler-detect Goto Github PK
View Code? Open in Web Editor NEW๐ท CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent
Home Page: https://crawlerdetect.io
License: MIT License
๐ท CrawlerDetect is a PHP class for detecting bots/crawlers/spiders via the user agent
Home Page: https://crawlerdetect.io
License: MIT License
Might be handy to split the regex rules based on the type of bot we are trying to detect
Then we can have methods such as isSearchEngine()
, isLinkChecker()
, isValidator()
, isLibrary()
isCrawler()
would still check for all
The following bots do not have a corresponding user-agent in the test suite. If you know any user-agent that will fulfil any of the following, please add it to test/crawlers.txt
Create extension please for YII2
Great script!
Thought I would pass on I have traces of the following bot slipping through the filter:
21.206.197.104.bc.googleusercontent.com
User Agent: HubPages V0.2.2 (http://hubpages.com/help/crawlingpolicy)
Result: HubPages V0.2.2 (http://hubpages.com/help/crawlingpolicy
I would also like to point out parsing error in User Agents like WordPress/4.8; http://xyz.com
or redback/v0-570-g26f8c96
, it always includes the trailing slash in the result, e.g. WordPress/ or Redback/
May be cleaning out the result with below code will help:
trim(preg_replace('/[^A-Za-z0-9\-\_\.]/', ' ', $result))
Hi,
In same cases, User-Agent is set to null. While not definitive (it could be faked), isn't it a sign we may have a programmatic connection? Then is_bot
should be true.
Hi,
Bot with User-Agent RebelMouse\/0.1 Mozilla\/5.0 (compatible; http:\/\/rebelmouse.com) Gecko\/20100101 Firefox\/7.0.1
is not detected. Note that it also has referer set to http://rebelmouse.com
while it makes no sense, plus From
header set to [email protected]
.
Some regex patterns will never match.
For example...
'Googlebot',
'Googlebot-Image',
'Googlebot-Mobile',
A user-agent that contains Googlebot-Image
will never match that specific regex, because the more generic Googlebot
regex matches first.
In this instance, should we remove Googlebot-Image
and Googlebot-Mobile
or keep them in for verbosity and reorder them?
Branch.io has a crawler that scrapes metadata and has a Branch-Passthrough
user agent.
I have created a CrawlerDetect folder in my php app folder and I extracted the github zip in there like so:
I have this at the top of my php script file:
include_once("CrawlerDetect/export.php");
use Jaybizzle\CrawlerDetect\CrawlerDetect;
$CrawlerDetect = new CrawlerDetect;
if($CrawlerDetect->isCrawler()) {
writeme("we got a crawler!");
}else{
writeme("we got a human!");
}
When I run my php script in a browser, I get Class 'Jaybizzle\CrawlerDetect\CrawlerDetect' not found. On github for this library it mentions running cmd-line stuff, etc. Can I not just dump the files in my directory and reference it at the top of my script? What am I doing wrong?
Hello,
I think it would be great to add detection of the crawlers using a client ip address.
May be some resources like iplists.com or myip.ms/browse/web_bots/ can be used for obtainig a nearly complete list of the ip addresses.
Hi
This crawler visited my site today:
Virusdie crawler/3.0
Hope you can include it.
Thanks.
Hi,
I got several hits from Google-HTTP-Java-Client\/1.17.0-rc (gzip)
User-Agent. That one is not considered a bot. Stricto sensu, it is not since it's only the medium, but it denotes a programmatic connection and then eventually a crawler.
What about adding this one?
Hi, maybe you can add this User-Agent ?
CloudEndure Scanner ([email protected])
Hi,
Still parsing my logs, hits by Google googlebot
are not detected.
The trick is that it uses a valid User-agent - Mozilla\/5.0 (Macintosh; Intel Mac OS X 10_8_4) AppleWebKit\/537.36 (KHTML, like Gecko) Chrome\/28.0.1500.71 Safari\/537.36
- but you should read from HTTP From
header, whose value is googlebot(at)googlebot.com
.
It is a big modification, but is it considered?
I think I found a new crawler.
information of this crawler is : AppManager RPT-HTTPClient/0.3-3E
is that true?
B0t
I found a new crawler in my server logs:
Mozilla/5.0 (compatible; GoogleDocs; apps-spreadsheets; +http://docs.google.com)
Hi
I am interested in knowing which bot came to my website and when did it come and and how many times in a day?
Can you add these methods?
TIA
Mozilla/5.0 (compatible; CloudFlare-AlwaysOnline/1.0; +https://www.cloudflare.com/always-online) AppleWebKit/534.34
See https://www.cloudflare.com/always-online
Pull request submitted
null or an empty string are 2 user agent cases for bad bots. Nice to handle them too
Mozilla/5.0 (Linux; Android 6.0.1; LG-K100 Build/MXB48T; wv) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/50.0.2661.86 Mobile Safari/537.36 YandexSearch/5.75
Mozilla/5.0 (compatible; MSIE 9.0; Windows NT 6.1; Clarsentia)
We have a site with several sub-sites. This bot clicked on every ad on each sub-site (ignoring rel=nofollow). It did not fetch robots.txt either.
hi,
maybe I found three other false positives. Here's the list:
Mozilla/5.0 (Linux; Android 4.4.2; ForwardRuby Build/KOT49H) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.81 Mobile Safari/537.36
Mozilla/5.0 (Linux; Android 5.1; Cosmos Build/LMY47I) AppleWebKit/537.36 (KHTML, like Gecko) Version/4.0 Chrome/39.0.0.0 Mobile Safari/537.36
Mozilla/5.0 (Linux; U; Android 4.2.1; en-ph; MyPhone Agua Vortex Build/JOP40D) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30
What do you think?
ty
Hi,
i found an issue about this UA:
Mozilla/5.0 (Linux; Android 5.1; FEVER Build/LMY47D) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/49.0.2623.105 Mobile Safari/537.36
I'm not sure this is a crawler. I think this is a valida UA provided by this company: wikomobile.com.
You can read something here: http://www.tera-wurfl.com/explore/?action=wurfl_id&id=wiko_fever_ver1
What do you think?
User Agent: ping.blo.gs/2.0
Source Url: http://blo.gs/ping.php
Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Slack/1.2.6 Chrome/45.0.2454.85 AtomShell/0.34.3 Safari/537.36 Slack_SSB/1.2.6
Good day! I use your script, very good work! But faced with such a problem. Home comes a lot of users with a browser SogooMobileBrovser and it is certainly not bots. How to make an exception for them, what would use the current versions of your repository?
Example:
Mozilla/5.0 (Linux; U; Android 5.1; zh-cn; 1501_M02 Build/LMY47D) AppleWebKit/534.30 (KHTML, like Gecko) Version/4.0 Mobile Safari/534.30 SogouMSE,SogouMobileBrowser/4.1.0
See #30
When I add the following user agent to the devices.txt
test file:
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html)
The tests fail. However, when I add Googlebot
to the Crawlers.php
file, PHPUnit fails because there is a regex collision.
Hi guys,
Awesome list - really appreciate it!
I have analyzed our dataset of incoming calls and merged some bots, that were not in your list.
In addition I have sorted the list, since I am doing a binary search through it, but the list is currently not 'sorted' according to ASCII standards.
Sorry for not pointing out which entries I have added :( but I was tunnel-visioning and forgot to save them...
Hope it still helps and that you can use these / update your list.
Cheers,
ebbmo
botlist.txt
User agent used by scoutapp:
ScoutURLMonitor/5.9.8
Why need - // Pass a user agent as a string ?
it detect all useragent?
$CrawlerDetect->isCrawler()
AND
it not detect all useragent?
$CrawlerDetect->isCrawler('Mozilla/5.0 (compatible; Sosospider/2.0; +http://help.soso.com/webspider.htm)')
// Check the user agent of the current 'visitor'
if($CrawlerDetect->isCrawler()) {
// true if crawler user agent detected
}
// Pass a user agent as a string
if($CrawlerDetect->isCrawler('Mozilla/5.0 (compatible; Sosospider/2.0; +http://help.soso.com/webspider.htm)')) {
// true if crawler user agent detected
}
2ip.ru CMS Detector (http://2ip.ru/cms/)
We have seen a user agent of Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.21 (KHTML, like Gecko) wpif Safari/537.21
that appears to be non-human. In particular, https://www.webmasterworld.com/search_engine_spiders/4785385.htm appears to agree. Browscap does not. Thoughts?
If its empty user agent is treated as a crawler? Can we add a simple setting variable to do that? That all empty UA should be treated as crawler
Move tests from Laravel package to this repo
I've added the following to my crawler detection:
substr(gethostbyaddr($_SERVER['REMOTE_ADDR']),-14)==".amazonaws.com"
Amazon AWS crawls with all kinds of random spoofed user agents, so that this was the only way for me to get rid of them for good.
So maybe that's useful for inclusion despite being based on the remote host, and not the user agent.
Google now supports crawling ajax pages. But other bots don't.
I want to check that bot is google or not. How can i do this check?
Thanks
Hi! I just detected another bot in my server logs:
Mozilla/5.0 (compatible;acapbot/0.1)
I see this in my SERVER (PHP) variable, with no detection :
"[HTTP_USER_AGENT] => Google favicon"
Could you manage its detection on a future tag ?
Thx
Crawler-Detect started detect Slurp
crawler instead of Yahoo! Slurp
.
User agent: Mozilla/5.0 (compatible; Yahoo! Slurp; http://help.yahoo.com/help/us/ysearch/slurp)
Can you, please, make this lib detect such user agents as Yahoo! Slurp
crawler.
PS: http://www.useragentstring.com/ knows Yahoo! Slurp
and does not know Slurp
.
74.6.254.126 ||| Mozilla/5.0 (compatible; Yahoo Ad monitoring; https://help.yahoo.com/kb/yahoo-ad-monitoring-SLN24857.html)
74.6.254.126 ||| Mozilla/5.0 (iPhone; CPU iPhone OS 7_1 like Mac OS X) AppleWebKit (KHTML, like Gecko) Mobile (compatible; Yahoo Ad monitoring; https://help.yahoo.com/kb/yahoo-ad-monitoring-SLN24857.html)
66.249.91.34 ||| AdsBot-Google-Mobile (+http://www.google.com/mobile/adsbot.html) Mozilla (iPhone; U; CPU iPhone OS 3 0 like Mac OS X) AppleWebKit (KHTML, like Gecko) Mobile Safari
209.85.238.93 ||| AdsBot-Google-Mobile (+http://www.google.com/mobile/adsbot.html) Mozilla (iPhone; U; CPU iPhone OS 3 0 like Mac OS X) AppleWebKit (KHTML, like Gecko) Mobile Safari
66.249.92.17 ||| Mozilla/5.0 (iPhone; CPU iPhone OS 9_1 like Mac OS X) AppleWebKit/601.1.46 (KHTML, like Gecko) Version/9.0 Mobile/13B143 Safari/601.1 (compatible; AdsBot-Google-Mobile; +http://www.google.com/mobile/adsbot.html)
Could you please add "OnPageBot" and "Uptrends"? Thanks.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.