A web crawler for Wipro interview process.
Here are the instructions for the Buildit Platform Engineer exercise:
This exercise is to examine your technical knowledge, reasoning and engineering principals. There are no tricks or hidden agendas.
We are looking for a demonstration of your experience and skill using current software development technologies and methods. Please make sure your code is clear and demonstrates good practices.
Your solution will form the basis for discussion in subsequent interviews.
Please write a simple web crawler in a language of your choice in a couple of hours – please don’t spend much more than that.
The crawler should be limited to one domain. Given a starting URL – say http://wiprodigital.com - it should visit all pages within the domain, but not follow the links to external sites such as Google or Twitter.
The output should be a simple structured site map (this does not need to be a traditional XML sitemap - just some sort of output to reflect what your crawler has discovered), showing links to other pages under the same domain, links to external URLs and links to static content such as images for each respective page.
Please provide a README.md file that explains how to run / build your solution. Also, detail anything further that you would like to achieve with more time.
Once done, please make your solution available on Github and forward the link. Where possible please include your commit history to give visibility of your thinking and progress.
What you need to share with us:
- Working crawler as per requirements above
- A README.md explaining
- How to build and run your solution
- Reasoning and describe any trade offs
- Explanation of what could be done with more time
- Project builds / runs / tests as per instruction
- Get user input
- Start crawler
- Set user input as top level page
- Maybe read robots.txt
- Maybe apply robots.txt rules
- Skip if page has been visited already
- Download html file
- Parse html file
- Extract title from file
- Extract links from file
- Maybe consider spawning threads to process multiple pages at the same time
- Create page object with title, url and list of links.
- Add page to a map
- Add links a to toVisit list, filtering out links from other websites.
- save to file (consider that links written before, as skipped to avoid circular references)
- maybe format the file
I'll use java, as I was initially approached by the recruiter as a java developer, so it is my understanding that this would allow the review of my test to be more meaningful for the role.
Otherwise, I'd possibly choose Python if this is never meant to be concurrent, as Python is very simple and very quick to code. If this is ever meant to be concurrent, I'd consider Go, as the concurrency model of Go is very robust and it was a language designed for developer productivity.
I chose Java 8 to write the code instead of Java 9, simply because so far, I only read the new features list, but haven't had a chance to work with it yet.
As java is my choice of language, I'll use JUnit for unit tests and Mockito to create mocks. Jsoup was chosen as the lib to parse the HTML.
I also added Google's guava, as the helper methods included in that library are life-saving.
All methods have been created as public to facilitate testing. It is possible to test private methods too, but it requires more code. So in order to keep the time used to write this to a minimum, I'm using this as a shortcut. Ideally, all methods in WebCrawler should be private, except for the crawl method.
Dependencies:
With the dependencies installed, run the command below to clean build and execute tests:
gradle clean build test
To run the web crawler, use the command below:
java -jar <jarfile> <website> <sitemap>
Where:
jarfile
is the full path to the jar filewebsite
is the domain name to crawlsitemap
is the file to save
- Make the crawler multi-threaded
- Allows for distribution
- Generate a pretty site map
- Add more functional tests
- Add wiremock to for an end to end test, with a mocked
- Add proper logging