safesploitorg / doogle Goto Github PK
View Code? Open in Web Editor NEWDoogle is a search engine and web crawler which can search indexed websites and images
Home Page: https://search.safesploit.com/
License: MIT License
Doogle is a search engine and web crawler which can search indexed websites and images
Home Page: https://search.safesploit.com/
License: MIT License
Originally posted by benzvi8888 March 10, 2024
Seems when I checked the database the crawler go into social link and start index their content - it crawl Facebook, Twitter, LinkedIn and so on - it filled my database with it.
is there any way/option/code to limit the crawling to the site itself only?
Do I need to setup docker to make this script work?
Joseph
Searchbox related to .mainSection .searchContainer .searchBox
is not displaying on iOS 16 Safari.
Believed to be an issue related to
border: none; box-shadow: 0px 2px 2px 0px rgba(0,0,0,0.16), 0px 0px 0px 1px rgba(0,0,0,0.08);
not rendering correctly in mobile Safari.
Possible remedy is to attempt WebKit translation or opt for compatible CSS under @media only screen and (max-width: 700px)
.
Update as a patch.
Does this search engine follow robots.txt file or can it ignore and index it either way.
Trying to index my blogspot domain but because of the robots.txt file I can't do that.. any way to allow me to index the site while ignoring the robot.txt
The crawl.php code as it is, at line 160 (line 153 for crawl-manual) could use these bits:
$spacer_size = 8; // increment me until it works echo str_pad('', (1024 * $spacer_size), "\n"); if(ob_get_level()) ob_end_clean();
It will prevent a inevitable 524 Timeout if you are using Cloudflare services (ran into this issue myself)
Source of fix: https://github.com/marcialpaulg/Fixing-Cloudflare-Error-524
Example in action:
dd72700
is it possible to have it more updated / also is it fixed for safari or iphone users mobile devices?
sometimes the crawl gets time out server time out but its a great script
thanks
Originally posted by autodatabases December 11, 2023
Empty query returns all data from tables!
Hey, i wanted to ask you some questions.
so the question is i wanted to use your search engine to index a website using sitemap.xml ( index and crawl the whole content from the website) this way it will be easier to pinpoint the engine on what pages it needs to search on. it would be much more easier to find content you are looking for.
because I followed your Read.me file but each time Doodle crawl through a website I find out that it only saves the page title and the website description. eg. Hackernew website. ( when I index and search for a keyword the result is almost the same( description) but the URL is present and the title is not.
eg. when I search for Malware
the result present is
title: Malware Strains Targeting Python and JavaScript Developers
description: The Hacker News is the most trusted and popular cybersecurity publication for information security professionals seeking breaking news, actionable insights
https://thehackernews.com/2022/12/malware-strains-targeting-python-and.html
see the description uses the main website description instead of the blog page.
am not sure if am missing something.
I have seen some other versions. But, they quit working with the newer versions of php. So, I would like to know if you are at least using php7 or above.
In search.php, the search term is directly handed off with no processing.
Line 7 $term = $_GET['term'];
Thus line 18
<?php if(isset($term) && $term != '') echo($term . ' | '); ?>
Line 53
<input class="searchBox" type="text" name="term" value="<?php echo $term; ?>" autocomplete="off">
Line 65 & 70, are all vulnerable to XSS.
So navigating to "search.php?type=&term=">''"><b><h1>
" would result in a broken page.
Is this a big deal? No. But it's bad practice.
When crawling the Japanese Wikipedia ja.wikipedia.org/wiki/メインページ
the following URL is indexed
https://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8
Hi!
I've learned something and I've found a solution
foreach($crawling as $site)
My PHP error logs went haywire after running the script for an extended period of time, 971 occurrences of itself and it causes a crash at that point
Stack trace: #2 Crawler.php(494): Crawler->followLinks() #3 Crawler.php(494): Crawler->followLinks() #4 Crawler.php(494): Crawler->followLinks() #5 Crawler.php(494): Crawler->followLinks() . . .
and it goes on and on, up to 971 occurrences.
The issue is that , this means that what the script has been doing is
Get called for Url1 (and doesn't call getDetails for it?)
View first a href on Url1
Visit a href URL(Lets call it URL2) from Url1 and call getDetails on it
Visit the first a href on URL2. calls getdetails
visits the first a href on the last a href on url 2 from url1.
visits the first a href on the last a href on the last a href on url2 from url 1.
etc etc etc, it goes on and on forever UNTIL one subprocess doesn't have any a hrefs or all the a hrefs were processed, then it goes to its parent node.
The first original array worth of a href URLs are not fully processed until the sub-processes finish, and for the sub-processes to finish the sub-sub-processes have to finish, and eventually you get to a point where this happens Allowed memory size of 16582912000 bytes exhausted (tried to allocate 20480 bytes)
Line 166 function followLinks($url)
to function followLinks($url, $depth = 0)
170 Insert if ($depth >= 12) {return;}
replace 12 with how deep you want it to go
Line 203: erase and fill with if(isset($site)){$this->followLinks($site, $depth + 1);}
public function count_method_occurrences($method_name) { $backtrace = debug_backtrace(); $count = 0; foreach ($backtrace as $trace) { if (isset($trace['class']) && isset($trace['function']) && $trace['function'] === $method_name) { $count++; } } return $count; }
Call it via
if ($this->count_method_occurrences('followLinks') < 12) { foreach(){} }
*Please note occurrences via this will be +1 than $depth
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.