Giter Site home page Giter Site logo

unfurl's People

Contributors

dependabot[bot] avatar jkppr avatar moshekaplan avatar nemec avatar obsidianforensics avatar olliejc avatar rafiot avatar scottwedge avatar sim4n6 avatar weslambert avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

unfurl's Issues

Parse GitHub URLs

Description
GitHub URLs have a lot of info in them describing what is being viewed.

Examples

References

  • The above examples is just from me looking at the URLs; some actual references for this would be great!

Add support for clicked Yahoo URLs

Description
Parse out a clicked URL from a Yahoo search:

For example, a URL like the following:

https://r.search.yahoo.com/_ylt=AwrBT8OVIB9aqsoASDZXNyoA;_ylu=X3oDMTByOHZyb21tBGNvbG8DYmYxBHBvcwMxBHZ0aWQDBHNlYwNzcg--/RV=2/RE=1512018198/RO=10/RU=https%3a%2f%2fdeveloper.yahoo.com%2fcocktails%2fmojito%2fdocs%2fcode_exs%2fquery_params.html/RK=2/RS=vQW48_o6zXyIDewim5cXq8Np1zo-


RU = [e for e in path.split("/") if e.startswith("RU=")]
# RU=[https%3a%2f%2fdeveloper.yahoo.com%2fcocktails%2fmojito%2fdocs%2fcode_exs%2fquery_params.html]
link_target = urllib.unquote(RU[0][3:])
# link_target =https://developer.yahoo.com/cocktails/mojito/docs/code_exs/query_params.html

Add ability for node to have multiple parents

Some nodes may combine to give more info, and thus it would be nice to reference both "parent" nodes from a child node. One example is the ei Google param; one node has full seconds and another has fractional seconds. Combining both these nodes would then yield the complete timestamp.

Add parsing IP addresses (in all their variants)

Description
IP addresses can be represented in different ways. Add support in Unfurl to parse all these variants of IP addresses:

From https://www.trustwave.com/en-us/resources/blogs/spiderlabs-blog/evasive-urls-in-spam/:

References

Add logging

Right now, errors/warnings/etc are printed. Switch to using logging.

The explanation for the date/time parameters are wrong

In the mouseover to the ei parameter it states that the ei parameter is believed to correspond to when the search took place. This is not correct. I have seen ei parameters hours apart from the actual search.

What I have gathered is, that the EI is session start (When the tab was opened) or last search. If there is only 1 search immediately after tab opening it will roughly correspond to search time, but that is a coincidence

An example can be seen here:

https://www.google.com/search?sxsrf=ALeKk01OOv3qyJF_Sb7_1WIuBa-cNGDDNg%3A1587403446547&ei=ttqdXsP7IMKZk74Pgv-k6AY&q=third+search+at17%3A28&oq=third+search+at17%3A28&gs_lcp=CgZwc3ktYWIQA1DbihFYqeQRYPnmEWgAcAB4AIABfYgBthOSAQQyNC41mAEAoAEBqgEHZ3dzLXdpeg&sclient=psy-ab&ved=0ahUKEwjDrq_UwvfoAhXCzMQBHYI_CW0Q4dUDCAw&uact=5

I opened a tab and searched for 'First search at 17:22' (At 17:22 naturally).
I then searched for 'Second search at 17:24'
Finally waited and searched for 'third search at 17:28'

So there is no indication of the actual search time.

Identify and parse (some) URLs generated by Metasploit

Description
Metasploit / CS use some generated URLs that can be decoded. Didier Stevens did a write-up & built a tool to decode them (https://isc.sans.edu/diary/27204). That research is a great base for an Unfurl parser.

Examples

  • hxxp://127.0.0.1:8080/4PGoVGYmx8l6F3sVI4Rc8g1wms758YNVXPczHlPobpJENARSuSHb57lFKNndzVSpivRDSi5VH2U-w-pEq_CroLcB--cNbYRroyFuaAgCyMCJDpWbws/
  • hxxp://12.34.56.78/WjSH

References

Docker Support

I've created a Dockerfile and docker-compose file for standing up unfurl in a Docker container. However, it appears that even though we can change the host/port for unfurl in config.ini, if you are running via Docker etc, the fetch request here refers to the explicit host/port as defined in config.ini.

This becomes an issue when you are trying to expose unfurl via 0.0.0.0 in the Docker container, and are trying to access unfurl via the browser via the host IP address (something like 192.168.1.199, because unfurl is trying to fetch 0.0.0.0:5000, instead of the host IP).

I've got a PR ready (from here) that replaces unfurl_host:unfurl_port with:

window.location.host (host + port of current request)

and provides the necessary files for running unfurl in a Docker container

See here for more details on the host/port change

This seems to work okay from my testing.

Any thoughts on potential issues with this, or how this could be done better to support Docker?

If this all sounds okay, I will go ahead and submit a PR -- just let me know!

Add domain reputation checks

Description
Add a lookup "parser" that takes a domain name in and shows some basic reputation things; for example popularity (Alexa 1M?) or if it has been flagged as bad (SafeBrowsing?).

Examples

  • Show google.com as popular/common and not malicious.
  • Show goooogle.ru as not popular/rare and malicious (made up example).

References

  • TBD

Make unfurl application port configurable

Because the application port is referenced in multiple places, and many folks run multiple services on certain hosts (and the typical default Flask port is 5000 😃 ) the application port should be easily modifiable (pull value from config file, etc). I can assist with a PR when I get a chance.

Expand Bing parser

Proofpoint URL decoding

Expand Short URLs

Description
Consider implementing expansion of short URLs.
Pro: Lots of URL shortening services can hide interesting links.
Con: We'd need to reach out to 3rd party sites to do this resolution (the data is not embedded in the URL).

Suggestion: Docker from local code rather than git

After raising #75 and trying to develop a new parser I realised that the code pulls from GitHub rather than building from local files.

A suggestion would be to use the local files rather than git, for example (as well as use a python base image):

FROM python:3-alpine
COPY requirements.txt /unfurl/requirements.txt

RUN cd /unfurl && \
    pip3 install -r requirements.txt

COPY . /unfurl
RUN sed -i 's/^host.*/host = 0.0.0.0/' /unfurl/unfurl.ini

WORKDIR /unfurl
ENTRYPOINT ["python3", "unfurl_app.py"]

The reason for the two COPYs is so that between changes, builds are cached up to the second COPY and then those changes built rather than all the dependencies every time.

I figured I'd raise an issue rather than another PR to get some comments. It would be good to get @weslambert thoughts.

Firebase Push IDs

Description
Firebase's push IDs are the chronological, 20-character unique IDs. A push ID contains 120 bits of information. The first 48 bits are a timestamp, which both reduces the chance of collision and allows consecutively created push IDs to sort chronologically. The timestamp is followed by 72 bits of randomness

Examples

  • "-JhLeOlGIEjaIOFHR0xd"
  • "-JhQ76OEK_848CkIFhAq"
  • "-JhQ7APk0UtyRTFO9-TS"

References

pytest or unittest for testing ?

Description
I would like to create a mechanism of testing. I suggest to use either Pytest, or Unittest.

Please, before I dive on working on the PR. Could you please choose the one that you recommend the most ?

Add support for mailto URLs

Description
Unfurl mailto URLs which behave very similar to http URLs but miss out the regular-looking scheme part (mailto: rather than http://, s3:// etc.), so first take the data before ? (if in string) and that is the to address(es) anything after the ? can be treated like url.query.

Examples

References

parse_url parser does not handle query parameters with no value

Some websites add keys to the URL query string that have no value, but still affect the way the page is displayed. One (trimmed down) example is the following Facebook URL:

https://www.facebook.com/photo.php?type=3&theater

In this case, "theater" sits on its own and indicates that the photo should be opened in a lightbox. Unfortunately, the parameter is missing entirely from the tree after parsing, as you can see below:
missing theater query parameter

This is a pretty easy fix. Modify line 66 of parse_url.py as below:

-        parsed_qs = urllib.parse.parse_qs(node.value)
+        parsed_qs = urllib.parse.parse_qs(node.value, keep_blank_values=True)

fixed theater query parameter

I could submit a pull request to fix this now, however, on lines 94 and 106 I see regexes for parsing similar forms (a=b|c=d|e=f and a=b&c=d&e=f). I'd like to make sure this issue is fixed in those as well, but I haven't yet figured out how to build a test case/example to cover them. Any ideas?

Add detection/flagging for possible homoglyph/graph attacks

Description
Unfurl already decodes punycoded URLS (xn--...), but not hxxp://аbc.com (a is Slavic). It does in fact handle these (shows as % encoded, then translates back to original homograph domain. It would be good to add some flagging to the %-encoded node.

image

Fragments in URLs not working

I noticed that after cloning the repository from master and running with python3 unfurl_app.py, fragments in URLs do not appear in the unfurled graph. This issue is also present on your demo website when deep-linking to the URL, but not when I submit a URL via the form. I'm guessing it's because fragments aren't sent to the server when GETing a URL. Maybe it would be better to base64 the URL in the deep link or something to prevent the browser from misinterpreting the query.

Doesn't work:
http://127.0.0.1:5000/https://www.facebook.com/photo.php?type=3#hello
https://dfir.blog/unfurl/?url=https://www.facebook.com/photo.php?type=3#hello

Works:
Go to: https://dfir.blog/unfurl/
Submit query: https://www.facebook.com/photo.php?type=3#hello

Add more encoding types

Description
Unfurl currently supports basic url-safe b64 decoding, and only if the results are ASCII. This should be expanded to include more encoding types and chains.

Currently supported:

  • url-safe b64 -> ASCII

Encodings to add:

  • url-safe b64 -> gzip -> ASCII
  • url-safe b64 -> protobuf
  • b32 -> ASCII
  • More ...

Examples

  • TBD

References

Parse YouTube URLs

Description
There is some good data encoded in YouTube URLs. At a minimum, some links point to a particular time in a video.

I'm happy to help/coach/answer questions if anyone wants to work on this issue!

Examples

References

  • The t param looks to be the number of seconds into the video

parsing suggestion: recognize and decode 12-byte Mongo object ID timestamp (first four bytes)

I just read https://techkranti.com/idor-through-mongodb-object-ids-prediction/, and it brought to mind a case I was working last week. Not positive it was actually this, but I'm going to check when I get to work. In any case, it would be useful for unfurl to automatically recognize Mongo object IDs that appear in URLs, and decode the embedded timestamp. (If it already does, I apologize. I wasn't able to find an example ready-to-hand, with which to test, and I don't see any references to Mongo object IDs in the documentation.)

Parsing misclassification if the password begins with ? or / or #

These URLs cause a misclassification:

The misclassification appears if the password begins with a delimiter defined in RFC3986, section 3.2:

The authority component is preceded by a double slash ("//") and is
terminated by the next slash ("/"), question mark ("?"), or number
sign ("#") character, or by the end of the URI.

Add support for Yahoo Search subdomains

Description
Yahoo uses the following subdomains for different types of search:

  • Web Search = "search.yahoo.com"
  • Image Search = "images.search.yahoo.com"
  • Video Search = "video.search.yahoo.com"
  • News Search = "news.search.yahoo.com"

What is the best way to add these in at the top-layer?

Investigate Google gs_lcp parameter

Google Search: bih and biw parameters

Great tool thanks Ryan!

Both bih and biw parameters in google search currently appear as generic 'URL parsing functions'

image

However, they appear to be well documented as testable that it equates to the browser windows height and width. This can be checked using something like: https://browsersize.com/

image

Thanks again!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.