Giter Site home page Giter Site logo

adamdehaven / fetchurls Goto Github PK

View Code? Open in Web Editor NEW
126.0 126.0 46.0 74 KB

A bash script to spider a site, follow links, and fetch urls (with built-in filtering) into a generated text file.

Home Page: https://www.adamdehaven.com/blog/easily-crawl-a-website-and-fetch-all-urls-with-a-shell-script/

License: MIT License

Shell 100.00%
bash-scripting crawl shell-script spider urls website wget

fetchurls's Introduction

fetchurls's People

Contributors

adamdehaven avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

fetchurls's Issues

Question - fetchUrlsForDomain

In fetchUrlsForDomain, grep is used to search for user exluded extensions after wget has downloaded the files. I couldn't see a written explanation for why filtering for extensions was done in this way, rather than filtering them using wget --reject or wget --reject-regex. Was this for reliability?

Not sure if this was the right place or way to ask, I'm very new and inexperienced. Thanks!

not working on kali linux

Hi,
hope you're doing well
fetchurl is not working on kali but same process is done on cPanel its working fine.
Capture

Bash Error

Hey Adam!

Really cool concept. Do you know what this error might mean?

fetchurls-master/fetchurls.sh: line 61: read: -i: invalid option

Thanks,
Charlie

v3.2.3 Non Interactive Mode now allowing a few attributes

Describe the bug
Version v3.2.3
Non-interactive mode always overrides -f -l and -e attributes with defaults

To Reproduce
./fetchurls.sh -t -d http://example.com -l /tmp -f file -e "css"

Just review the troubleshooting output.

Made the following changes to correct issue.

diff fetchurls.sh fetchurls.sh-ORIG
493c493
< elif [ -z "$USER_SAVE_LOCATION" ] && [ "$RUN_NONINTERACTIVE" -eq 1 ]; then

else
508c508
< elif [ -z "$USER_FILENAME" ] && [ "$RUN_NONINTERACTIVE" -eq 1 ]; then


else
521c521
< elif [ -z "$USER_EXCLUDED_EXTENTIONS" ] && [ "$RUN_NONINTERACTIVE" -eq 1 ]; then


else

Thanks for your efforts.

Grep problem

I do not have option --max-count in grep (BSD grep) 2.5.1-FreeBSD under MacOS

Fixes

I suggest you set a trap in your script , so if someone does a ctrl-d or the script fails that the color is reset in the terminal for the user. insert this at the top of your script

resetcolor () { # this is just a rough example is can be implemented better 
echo -e "\033[0m"
}

trap resetcolor INT TERM EXIT

you also need to create the directory that you suggest in the script . it fails to find the directory because its not created,

echo "${COLOR_RESET}# "
read -e -p "# Save txt file as: ${COLOR_CYAN}" -i "${filename}" SAVEFILENAME

mkdir -p $savelocation/$SAVEFILENAME # you need to place this here

savefilename=$SAVEFILENAME

I'd suggest you rewrite it , as its a bit messy and wont handle syntax errors properly
What if someone has renamed there desktop file or deleted it or wants to put it somewhere else ? crazy I know but your script will fail

what is someone enters an incomplete url or non existent url example : http://www.goodfg.comdwdd
you could handle this in a while true loop and check the existence of the url

here is something I threw together as an an example

while true
do
# CHECK IF URL EXISTS AND IS ONLINE
if [[ "$#" -eq  "0" ]] ; then
       echo -e "${YELLOW}Example${NC} : checkurl msn.com"
       echo
       break
       exit
fi

if curl --output /dev/null --silent --head --fail "$CONVERTCASE" then

CHECKURL=$(lynx -dump http://downforeveryoneorjustme.com/$CONVERTCASE | grep -o "It's just you")

if [[ $CHECKURL != 0 ]] ; then
      echo -en "Url exists so lets continue" > /dev/null
      break
else
      echo -en  "Url doesn't exist lets try again" 
      sleep 2
fi
done

[Feature Request] Add Option for User-Agent

Hey Adam, thank you for your nice little script! :-)

I ran into the problem that the website I'd like to fetch all URLs from blocked the default wget User-Agent (currently "Wget/1.20.3 (linux-gnu)").
To progress with my task I manually changed your (current) script and added a User-Agent string (to the wget command) and it worked very well.

Question: Are you willing to add an option for the User-Agent?

If yes, I would prepare a PR โ€ฆ :-)

KR

Permission issue to write file

Hey,

I keep getting the following errors and it never asks me for the location where i want to store. It did ask me the prompt to enter the url of the website where I wanted to fetch the urls. Not sure how to resolve the permission issue to let it write the file and ask me about the file name steps.

#    
#    Save file to location
/Users/varunkhanduja/Desktop/fetchurls-master/fetchurls.sh: line 103: read: -i: invalid option
read: usage: read [-ers] [-u fd] [-t timeout] [-p prompt] [-a array] [-n nchars] [-d delim] [name ...]
usage: mkdir [-pv] [-m mode] directory ...
#    
#    Save file as
/Users/varunkhanduja/Desktop/fetchurls-master/fetchurls.sh: line 110: read: -i: invalid option
read: usage: read [-ers] [-u fd] [-t timeout] [-p prompt] [-a array] [-n nchars] [-d delim] [name ...]
#    
#    Fetching URLs for 
#    
/Users/varunkhanduja/Desktop/fetchurls-master/fetchurls.sh: line 64: /.txt: Permission denied

[Bug Report] does not run in Mac OS because of bash v3.2 - workaround

Hello, thank you very much for writing this script and taking the time to make it available. I have been looking for such a thing for quite a while now and lack of any of the skills to write it.

I have found and somewhat solved a compatibility issue with Mac OS. Sorry it's so long but I don't know how to do this properly and what can be left out. Also I have to admit I am extremely excited because I have never even halfway solved a programming(ish) problem in my whole life.

Summary

  • Script does not run on mac OS because Mac OS has bash v 3.2 only.
  • How to install bash 5 in Mac OS
  • A way to edit the .sh file so the correct version of bash will be called

Describe the bug

I read in #1 that this script requires bash >v4. So I investigated and found out that even modern up to date versions of Mac OS are running bash 3.2 unless it has been manually upgraded. This has something to do with Apple being unable or unwilling to comply with the GPL requirements for v >4. See this SE thread among other discussions online.

Also of note (something I only learned recently despite being a regular if casual terminal user for many years) that zsh has been the default shell in Mac OS for some time now. Probably because they didn't want to have an out of date shell for the rest of time.

To Reproduce

  1. download and run per instructions
$ ./fetchurls.sh

Fetch a list of unique URLs for a domain.

Enter the full domain URL ( https://example.com )
Domain URL: https://quotes.toscrape.com
usage: grep [-abcDEFGHhIiJLlmnOoqRSsUVvwxZ] [-A num] [-B num] [-C[num]]
	[-e pattern] [-f file] [--binary-files=value] [--color=when]
	[--context[=num]] [--directories=action] [--label] [--line-buffered]
	[--null] [pattern] [file ...]
usage: grep [-abcDEFGHhIiJLlmnOoqRSsUVvwxZ] [-A num] [-B num] [-C[num]]
	[-e pattern] [-f file] [--binary-files=value] [--color=when]
	[--context[=num]] [--directories=action] [--label] [--line-buffered]
	[--null] [pattern] [file ...]

Save file to directory
./fetchurls.sh: line 492: read: -i: invalid option
read: usage: read [-ers] [-u fd] [-t timeout] [-p prompt] [-a array] [-n nchars] [-d delim] [name ...]
usage: mkdir [-pv] [-m mode] directory ...

Save file as
./fetchurls.sh: line 505: read: -i: invalid option
read: usage: read [-ers] [-u fd] [-t timeout] [-p prompt] [-a array] [-n nchars] [-d delim] [name ...]

Exclude files with matching extensions
./fetchurls.sh: line 518: read: -i: invalid option
read: usage: read [-ers] [-u fd] [-t timeout] [-p prompt] [-a array] [-n nchars] [-d delim] [name ...]

Fetching URLs for

./fetchurls.sh: line 358: /.txt: Permission denied
^Cease wait... [ | ]

Environment

To verify bash version:

$ bash --version
bash --version
GNU bash, version 3.2.57(1)-release (x86_64-apple-darwin18)
Copyright (C) 2007 Free Software Foundation, Inc.

and also

$ which bash
which bash
/bin/bash

Apparently this is where bash 3 lives. Bash 4 lives in /usr/local/bin/bash

Workaround

  1. Install homebrew if not already.

  2. Install up to date bash (version 5 at time of writing)

    brew install bash
  3. bash 3 will still be active and the Internet says better not to replace it completely in case for some reason you need it some day. The instructions I looked at all assumed you would like to make bash 5 you default shell. However I am happy with zsh so I was able to find this which I believe makes it default only when you call bash

    sudo bash -c 'echo /usr/local/bin/bash >> /etc/shells'

So now if I run bash --version or which bash it still returns v 3 as originally. (The Internet said which should return both, but it didn't, strangely.) However if I actually switch to bash (by inputting command bash) and enter those commands it will report v5.

Even when I ran the script from the bash 5 prompt instead of zsh, I still got same errors. (If you are thinking "this guy doesn't know anything about shell scripting" you are correct.) However I changed the top from

#!/bin/shift

to

#!/usr/local/bin/bash

and the script ran. But it was taking so long to run I thought it was hanging and wouldn't complete and that's when I noticed the original said shift not bash which is the only thing I've ever seen at the top of a script. So I went to find out what that means and after some dead ends I found something that sounds reasonable but I don't understand. By the time I had finished reading that page the script completed. It ran just fine without shift.

I also tried running it with both shebangs (new word I learned today) included but the result was the same. I ran the script in interactive mode with all defaults so perhaps a problem will arise at a later date. I'm sure you already have a good idea of the answer to that question.

Someone who understands what is going on here would probably be able to find a better solution but this seems to work.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.