Giter Site home page Giter Site logo

confetcher's Introduction

### Confetcher - Go - Concurrent Crawler

Install the go compiler first - http://golang.org/doc/install

This crawler (creatively named concurrent_crawler.go :P )  takes a file containing the urls to crawl as input

You can also specify the maximum number of workers/goroutines (By default it is set to 3)

<pre>
go build -o concurrent_crawler concurrent_crawler.go
</pre>

Run it using - 

<pre>
./concurrent_crawler -i input_file -m 4
</pre>

-m is the maximum number of workers flag


Brief Explanation -

Go has solid support for coroutine-style threading, with explicit communication between the threads. You can start a new thread by invoking a function: just prefix the invocation with the word “go”.So if “fetch()” is a  function call; “go fetch()” (kind of like how you'd tell your dog ) is an invocation of a goroutine, which runs concurrently with the code that called it.

Once you’ve created a go-routine, you can only talk to it through channels. Channels support two operations: output, and input. Any channel operation blocks until a matching operation is executed by a different goroutine. So writing to a channel blocks until someone reads it; reading from a channel blocks until someone writes something for you to read. You can only pass a single type of value over a channel. Goroutines are very cheap and lightweight, and they’re mapped onto OS threads. So you don’t need to worry (much) about creating lots of goroutines. ( Creating hundreds of thousands of goroutines is very normal in go - according to the documentation :P )

Here is a brief explanation of some of the functions used in the code - 


<pre>
// Run represents a number of functions running concurrently.
type Run struct 
</pre>

<pre>
// NewRun returns a new parallel instance.  It will run up to maxPar
// functions concurrently.
func NewRun(maxPar int) *Run 
</pre>

<pre>
// Do requests that are run concurrently.  If there are already the maximum
// number of functions running concurrently, it will block until one of
// them has completed. 
func (r *Run) Do(f func() error) 
</pre>

<pre>
// Wait waits for all the functions to complete.  
//If any errors were encountered, it returns an
// Errors value describing all the errors in arbitrary order.
func (r *Run) Wait() error 
</pre>

Below is the part of the code that handles the fetchin and saving part.

<pre>
response, err := http.Get(url)
			if err != nil {
				return nil
			}
			responseString, err := ioutil.ReadAll(response.Body)
			response.Body.Close()
			err = ioutil.WriteFile(outputFile, responseString, 0644)
			fmt.Printf("Finished fetching url%s\n", url)
			return err

</pre>

The output is saved in files named file0, file1, file2 etc in the same directory


confetcher's People

Contributors

anandghegde avatar

Stargazers

The Dude avatar  avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.