safesploitorg / doogle Goto Github PK

View Code? Open in Web Editor NEW

29.0 7.0 14.0 274 KB

Doogle is a search engine and web crawler which can search indexed websites and images

Home Page: https://search.safesploit.com/

License: MIT License

PHP 76.01% CSS 15.42% JavaScript 5.13% Hack 3.44%

crawler php search search-engine database-search php-search-engine full-text-search indexing mysql google-clone

doogle's People

Contributors

Stargazers

Watchers

Forkers

profnum01 lninjo melisa2206 dehlirious deepayanmallick whitesoarer codingsmartweb erievs hadi12d klebergraciasoares rszipper

doogle's Issues

How to crawl only the suggested site and not external sites

Discussed in #24

^{Originally posted by benzvi8888 March 10, 2024}
Seems when I checked the database the crawler go into social link and start index their content - it crawl Facebook, Twitter, LinkedIn and so on - it filled my database with it.
is there any way/option/code to limit the crawling to the site itself only?

Docker?

Do I need to setup docker to make this script work?

Joseph

Bug: Search box not displaying on iOS Safari

Searchbox related to .mainSection .searchContainer .searchBox is not displaying on iOS 16 Safari.

Believed to be an issue related to
border: none; box-shadow: 0px 2px 2px 0px rgba(0,0,0,0.16), 0px 0px 0px 1px rgba(0,0,0,0.08);
not rendering correctly in mobile Safari.

Possible remedy is to attempt WebKit translation or opt for compatible CSS under @media only screen and (max-width: 700px).

Notes

Update as a patch.

Robots file?

Does this search engine follow robots.txt file or can it ignore and index it either way.

Trying to index my blogspot domain but because of the robots.txt file I can't do that.. any way to allow me to index the site while ignoring the robot.txt

Fix for the Cloudflare 524 Timeout Error after 99s

The crawl.php code as it is, at line 160 (line 153 for crawl-manual) could use these bits:
$spacer_size = 8; // increment me until it works echo str_pad('', (1024 * $spacer_size), "\n"); if(ob_get_level()) ob_end_clean();

It will prevent a inevitable 524 Timeout if you are using Cloudflare services (ran into this issue myself)

Source of fix: https://github.com/marcialpaulg/Fixing-Cloudflare-Error-524

Example in action:
dd72700

HEY THIS SCRIPT IS AMAZING i love it

is it possible to have it more updated / also is it fixed for safari or iphone users mobile devices?

sometimes the crawl gets time out server time out but its a great script
thanks

Empty query returns all data from tables!

Discussed in #22

^{Originally posted by autodatabases December 11, 2023}
Empty query returns all data from tables!

Checkup ? Question (Sitemap Crawl Functionality | Crawling Description Question)

Hey, i wanted to ask you some questions.

so the question is i wanted to use your search engine to index a website using sitemap.xml ( index and crawl the whole content from the website) this way it will be easier to pinpoint the engine on what pages it needs to search on. it would be much more easier to find content you are looking for.

because I followed your Read.me file but each time Doodle crawl through a website I find out that it only saves the page title and the website description. eg. Hackernew website. ( when I index and search for a keyword the result is almost the same( description) but the URL is present and the title is not.

eg. when I search for Malware

the result present is
title: Malware Strains Targeting Python and JavaScript Developers
description: The Hacker News is the most trusted and popular cybersecurity publication for information security professionals seeking breaking news, actionable insights
https://thehackernews.com/2022/12/malware-strains-targeting-python-and.html

see the description uses the main website description instead of the blog page.

am not sure if am missing something.

What PHP versions are you writing this in?

I have seen some other versions. But, they quit working with the newer versions of php. So, I would like to know if you are at least using php7 or above.

Vulnerable to XSS

In search.php, the search term is directly handed off with no processing.
Line 7 $term = $_GET['term'];

Thus line 18
<?php if(isset($term) && $term != '') echo($term . ' | '); ?>

Line 53
<input class="searchBox" type="text" name="term" value="<?php echo $term; ?>" autocomplete="off">

Line 65 & 70, are all vulnerable to XSS.

So navigating to "search.php?type=&term=">''"><b><h1>" would result in a broken page.

Is this a big deal? No. But it's bad practice.

https://github.com/safesploit/doogle/blob/main/search.php

Bug: Crawling non-ASCII characters (URL)

When crawling the Japanese Wikipedia ja.wikipedia.org/wiki/メインページ the following URL is indexed
https://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8

Memory Leak - calling itself inside of itself, forever, with a solution

Hi!
I've learned something and I've found a solution

Crawler.php

Line 202 `foreach($crawling as $site)`

My PHP error logs went haywire after running the script for an extended period of time, 971 occurrences of itself and it causes a crash at that point

Stack trace: #2 Crawler.php(494): Crawler->followLinks() #3 Crawler.php(494): Crawler->followLinks() #4 Crawler.php(494): Crawler->followLinks() #5 Crawler.php(494): Crawler->followLinks() . . .

and it goes on and on, up to 971 occurrences.

The issue is that , this means that what the script has been doing is
Get called for Url1 (and doesn't call getDetails for it?)
View first a href on Url1
Visit a href URL(Lets call it URL2) from Url1 and call getDetails on it
Visit the first a href on URL2. calls getdetails
visits the first a href on the last a href on url 2 from url1.
visits the first a href on the last a href on the last a href on url2 from url 1.
etc etc etc, it goes on and on forever UNTIL one subprocess doesn't have any a hrefs or all the a hrefs were processed, then it goes to its parent node.

Meaning

The first original array worth of a href URLs are not fully processed until the sub-processes finish, and for the sub-processes to finish the sub-sub-processes have to finish, and eventually you get to a point where this happens Allowed memory size of 16582912000 bytes exhausted (tried to allocate 20480 bytes)

Solution!

Line 166 function followLinks($url) to function followLinks($url, $depth = 0)
170 Insert if ($depth >= 12) {return;} replace 12 with how deep you want it to go
Line 203: erase and fill with if(isset($site)){$this->followLinks($site, $depth + 1);}

Alternative Solution

public function count_method_occurrences($method_name) { $backtrace = debug_backtrace(); $count = 0; foreach ($backtrace as $trace) { if (isset($trace['class']) && isset($trace['function']) && $trace['function'] === $method_name) { $count++; } } return $count; }

Call it via

if ($this->count_method_occurrences('followLinks') < 12) { foreach(){} }

*Please note occurrences via this will be +1 than $depth

Indexing problem with Polish characters

Email body below.

Hi, I use your doogle and I have a problem with Polish characters, i.e. partially indexed pages have normally Polish characters, but some pages do not exist, do you know how to solve it? Thank you and best regards Thank you