Giter Site home page Giter Site logo

mrrobot's Introduction

README
Run with:
    $ sbt
    > run $url
- url: starting URI. MUST end with a '/'
- Depth is limited to 3, change by editing Main.scala
Make a pretty picture with:
    dot -Tpdf output.dot > foo.pdf

TODO
recognise links outside A.HREF (e.g. image maps)
- very difficult in the general case becuase javascript can load stuff and navigate the browser. Even regexs aren't sufficient here becuase the new URL may never occur in one peice in the document
properly parse HTML
Try harder to determine "one domain" is, e.g. currently the host part of the URI is used, so a DNS name, its IP4 and IP6 addresses are considered different.
- is a subdomain equal to its superdomain (e.g. www.google.com == google.com ?)
take multiple base URLs on the command line, process all (in parallel) and name each output file after the URL

TRADEOFFS
In the interests of producion-quality:
* Scala version pinned to a known quantity
* Dependences kept to semver minor or patch ranges

Environment
Runs on Linux - a unikernel cluster might be better suited.

Language
Scala - modern, cool (easy to hire good people), very hard to work with if you don't know it, possibly a bit niche still. Gives us Akka which is a great computation model for this kind of problem - Erlang/Elixir have all the problems of Scala and more.

Libraries
async-http-library - canonical, bit java-focussed, pulls in netty

mrrobot's People

Contributors

mt-inside avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.