Giter Site home page Giter Site logo

safesploitorg / doogle Goto Github PK

View Code? Open in Web Editor NEW
29.0 7.0 14.0 274 KB

Doogle is a search engine and web crawler which can search indexed websites and images

Home Page: https://search.safesploit.com/

License: MIT License

PHP 76.01% CSS 15.42% JavaScript 5.13% Hack 3.44%
crawler php search search-engine database-search php-search-engine full-text-search indexing mysql google-clone

doogle's People

Contributors

safesploit avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

doogle's Issues

How to crawl only the suggested site and not external sites

Discussed in #24

Originally posted by benzvi8888 March 10, 2024
Seems when I checked the database the crawler go into social link and start index their content - it crawl Facebook, Twitter, LinkedIn and so on - it filled my database with it.
is there any way/option/code to limit the crawling to the site itself only?

Docker?

Do I need to setup docker to make this script work?

Joseph

Bug: Search box not displaying on iOS Safari

Searchbox related to .mainSection .searchContainer .searchBox is not displaying on iOS 16 Safari.

Believed to be an issue related to
border: none; box-shadow: 0px 2px 2px 0px rgba(0,0,0,0.16), 0px 0px 0px 1px rgba(0,0,0,0.08);
not rendering correctly in mobile Safari.

Possible remedy is to attempt WebKit translation or opt for compatible CSS under @media only screen and (max-width: 700px).

Notes

Update as a patch.

Robots file?

Does this search engine follow robots.txt file or can it ignore and index it either way.

Trying to index my blogspot domain but because of the robots.txt file I can't do that.. any way to allow me to index the site while ignoring the robot.txt

HEY THIS SCRIPT IS AMAZING i love it

is it possible to have it more updated / also is it fixed for safari or iphone users mobile devices?

sometimes the crawl gets time out server time out but its a great script
thanks

Checkup ? Question (Sitemap Crawl Functionality | Crawling Description Question)

Hey, i wanted to ask you some questions.

so the question is i wanted to use your search engine to index a website using sitemap.xml ( index and crawl the whole content from the website) this way it will be easier to pinpoint the engine on what pages it needs to search on. it would be much more easier to find content you are looking for.

because I followed your Read.me file but each time Doodle crawl through a website I find out that it only saves the page title and the website description. eg. Hackernew website. ( when I index and search for a keyword the result is almost the same( description) but the URL is present and the title is not.

eg. when I search for Malware

the result present is
title: Malware Strains Targeting Python and JavaScript Developers
description: The Hacker News is the most trusted and popular cybersecurity publication for information security professionals seeking breaking news, actionable insights
https://thehackernews.com/2022/12/malware-strains-targeting-python-and.html

see the description uses the main website description instead of the blog page.

am not sure if am missing something.

Vulnerable to XSS

In search.php, the search term is directly handed off with no processing.
Line 7 $term = $_GET['term'];

Thus line 18
<?php if(isset($term) && $term != '') echo($term . ' | '); ?>

Line 53
<input class="searchBox" type="text" name="term" value="<?php echo $term; ?>" autocomplete="off">

Line 65 & 70, are all vulnerable to XSS.

So navigating to "search.php?type=&term=">''"><b><h1>" would result in a broken page.

Is this a big deal? No. But it's bad practice.

https://github.com/safesploit/doogle/blob/main/search.php

Bug: Crawling non-ASCII characters (URL)

When crawling the Japanese Wikipedia ja.wikipedia.org/wiki/メインページ the following URL is indexed
https://ja.wikipedia.org/wiki/%E3%83%A1%E3%82%A4%E3%83%B3%E3%83%9A%E3%83%BC%E3%82%B8

Memory Leak - calling itself inside of itself, forever, with a solution

Hi!
I've learned something and I've found a solution

Crawler.php

Line 202 foreach($crawling as $site)

My PHP error logs went haywire after running the script for an extended period of time, 971 occurrences of itself and it causes a crash at that point

Stack trace: #2 Crawler.php(494): Crawler->followLinks() #3 Crawler.php(494): Crawler->followLinks() #4 Crawler.php(494): Crawler->followLinks() #5 Crawler.php(494): Crawler->followLinks() . . .

and it goes on and on, up to 971 occurrences.

The issue is that , this means that what the script has been doing is
Get called for Url1 (and doesn't call getDetails for it?)
View first a href on Url1
Visit a href URL(Lets call it URL2) from Url1 and call getDetails on it
Visit the first a href on URL2. calls getdetails
visits the first a href on the last a href on url 2 from url1.
visits the first a href on the last a href on the last a href on url2 from url 1.
etc etc etc, it goes on and on forever UNTIL one subprocess doesn't have any a hrefs or all the a hrefs were processed, then it goes to its parent node.

Meaning

The first original array worth of a href URLs are not fully processed until the sub-processes finish, and for the sub-processes to finish the sub-sub-processes have to finish, and eventually you get to a point where this happens Allowed memory size of 16582912000 bytes exhausted (tried to allocate 20480 bytes)

Solution!

Line 166 function followLinks($url) to function followLinks($url, $depth = 0)
170 Insert if ($depth >= 12) {return;} replace 12 with how deep you want it to go
Line 203: erase and fill with if(isset($site)){$this->followLinks($site, $depth + 1);}

Alternative Solution

public function count_method_occurrences($method_name) { $backtrace = debug_backtrace(); $count = 0; foreach ($backtrace as $trace) { if (isset($trace['class']) && isset($trace['function']) && $trace['function'] === $method_name) { $count++; } } return $count; }

Call it via

if ($this->count_method_occurrences('followLinks') < 12) { foreach(){} }

*Please note occurrences via this will be +1 than $depth

Indexing problem with Polish characters

Email body below.

Hi, I use your doogle and I have a problem with Polish characters, i.e. partially indexed pages have normally Polish characters, but some pages do not exist, do you know how to solve it? Thank you and best regards Thank you

image

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.