The unfurl from obsidianforensics

Parse GitHub URLs

Description
GitHub URLs have a lot of info in them describing what is being viewed.

Examples

https://github.com/obsidianforensics/unfurl/blob/master/unfurl/parsers/parse_twitter.py#L15
- From above, we can tell:
- GitHub username (obsidianforensics)
- Repo name (unfurl)
- Viewing file content (blob)
- on branch master
- path in code (unfurl/parsers/parse_twitter.py)
- Goes to line 15 in the file (#L15)
https://github.com/obsidianforensics/unfurl/tree/master/unfurl/parsers
- Mostly same as above, but viewing the "tree" view of file names, not file content

References

The above examples is just from me looking at the URLs; some actual references for this would be great!

Add way to store API keys in a config file

Add way to store API keys in a config file. Currently they are stored in code in each parser, which isn't good.

Support for additional google search urls

See here for other Google search urls to include
https://github.com/randomaccess3/googleURLParser

Add support for clicked Yahoo URLs

Description
Parse out a clicked URL from a Yahoo search:

For example, a URL like the following:

https://r.search.yahoo.com/_ylt=AwrBT8OVIB9aqsoASDZXNyoA;_ylu=X3oDMTByOHZyb21tBGNvbG8DYmYxBHBvcwMxBHZ0aWQDBHNlYwNzcg--/RV=2/RE=1512018198/RO=10/RU=https%3a%2f%2fdeveloper.yahoo.com%2fcocktails%2fmojito%2fdocs%2fcode_exs%2fquery_params.html/RK=2/RS=vQW48_o6zXyIDewim5cXq8Np1zo-


RU = [e for e in path.split("/") if e.startswith("RU=")]
# RU=[https%3a%2f%2fdeveloper.yahoo.com%2fcocktails%2fmojito%2fdocs%2fcode_exs%2fquery_params.html]
link_target = urllib.unquote(RU[0][3:])
# link_target =https://developer.yahoo.com/cocktails/mojito/docs/code_exs/query_params.html

Add ability for node to have multiple parents

Some nodes may combine to give more info, and thus it would be nice to reference both "parent" nodes from a child node. One example is the ei Google param; one node has full seconds and another has fractional seconds. Combining both these nodes would then yield the complete timestamp.

Add parsing IP addresses (in all their variants)

Description
IP addresses can be represented in different ways. Add support in Unfurl to parse all these variants of IP addresses:

From https://www.trustwave.com/en-us/resources/blogs/spiderlabs-blog/evasive-urls-in-spam/:

Dotted decimal IP address: https://216.58.199.78 (this example uses Google.com’s IP)
Octal IP address: https://0330.0072.0307.0116 (convert each decimal number to octal)
Hexadecimal IP address: https://0xD83AC74E (convert each decimal number to hexadecimal)
Integer or DWORD IP address: https://3627730766 (convert hexadecimal IP to integer)

References

https://www.trustwave.com/en-us/resources/blogs/spiderlabs-blog/evasive-urls-in-spam/

Add logging

Right now, errors/warnings/etc are printed. Switch to using logging.

Twitter "Snowflakes"

Looks to be time-based, but exact structure needs more research. Also unsure where they are used.

Examples:

References:

Flake IDs

Description
Flake IDs are similar to snowflakes or UUIDv1 in that there is interesting info inside them.

Examples

8HFaR8qWtRlGDHnO57

References

https://github.com/boundary/flake

The explanation for the date/time parameters are wrong

In the mouseover to the ei parameter it states that the ei parameter is believed to correspond to when the search took place. This is not correct. I have seen ei parameters hours apart from the actual search.

What I have gathered is, that the EI is session start (When the tab was opened) or last search. If there is only 1 search immediately after tab opening it will roughly correspond to search time, but that is a coincidence

An example can be seen here:

https://www.google.com/search?sxsrf=ALeKk01OOv3qyJF_Sb7_1WIuBa-cNGDDNg%3A1587403446547&ei=ttqdXsP7IMKZk74Pgv-k6AY&q=third+search+at17%3A28&oq=third+search+at17%3A28&gs_lcp=CgZwc3ktYWIQA1DbihFYqeQRYPnmEWgAcAB4AIABfYgBthOSAQQyNC41mAEAoAEBqgEHZ3dzLXdpeg&sclient=psy-ab&ved=0ahUKEwjDrq_UwvfoAhXCzMQBHYI_CW0Q4dUDCAw&uact=5

I opened a tab and searched for 'First search at 17:22' (At 17:22 naturally).
I then searched for 'Second search at 17:24'
Finally waited and searched for 'third search at 17:28'

So there is no indication of the actual search time.

Add Support for identifying results from a Google Image search?

Description
Add support for identifying URLs which are the results from a Google image search

Examples

Add protocol/port to url parser

Description
Add the protocol and port to the parsed results for a URL. Ideas for child nodes are SSL cert lookup info if the protocol was HTTPS, and common services associated with ports (if the port was specified).

Examples

References

TBD

Identify and parse (some) URLs generated by Metasploit

Description
Metasploit / CS use some generated URLs that can be decoded. Didier Stevens did a write-up & built a tool to decode them (https://isc.sans.edu/diary/27204). That research is a great base for an Unfurl parser.

Examples

hxxp://127.0.0.1:8080/4PGoVGYmx8l6F3sVI4Rc8g1wms758YNVXPczHlPobpJENARSuSHb57lFKNndzVSpivRDSi5VH2U-w-pEq_CroLcB--cNbYRroyFuaAgCyMCJDpWbws/
hxxp://12.34.56.78/WjSH

References

Docker Support

I've created a Dockerfile and docker-compose file for standing up unfurl in a Docker container. However, it appears that even though we can change the host/port for unfurl in config.ini, if you are running via Docker etc, the fetch request here refers to the explicit host/port as defined in config.ini.

This becomes an issue when you are trying to expose unfurl via 0.0.0.0 in the Docker container, and are trying to access unfurl via the browser via the host IP address (something like 192.168.1.199, because unfurl is trying to fetch 0.0.0.0:5000, instead of the host IP).

I've got a PR ready (from here) that replaces unfurl_host:unfurl_port with:

window.location.host (host + port of current request)

and provides the necessary files for running unfurl in a Docker container

See here for more details on the host/port change

This seems to work okay from my testing.

Any thoughts on potential issues with this, or how this could be done better to support Docker?

If this all sounds okay, I will go ahead and submit a PR -- just let me know!

Parse Zoom URLs

Description
Zoom (zoom.us) has gotten pretty popular recently. What can be pulled out of Zoom URLs?

Examples

Search to find Zoom URLs: https://www.google.com/search?q=inurl%3Azoom.us%2Fj+and+intext%3Ascheduled
https://zoom.us/j/213045107 (fictitious I hope)

References

https://marketplace.zoom.us/docs/guides/guides/client-url-schemes

Website linked in repo description is unavailable with HTTPS

The website linked in the repo's about section https://unfurl.link/ is dead, the server 162.255.119.33 refuses connections on port 443. Port 80 redirects to https://dfir.blog/unfurl/

$ curl https://unfurl.link/
curl: (7) Failed to connect to unfurl.link port 443 after 316 ms: Connection refused
$ curl http://unfurl.link/
<a href='https://dfir.blog/unfurl/'>Found</a>.

Try to detect custom b64 charsets

ref: http://jon.glass/blog/BotTricker/

Grammar typo in README.md

Description

Grammar typo of two "up"'s in "by breaking up a URL up into components".

Add ability to read from and write to file

FR for Unfurl to be able to read in a bunch of URLs from a file (one per line) and optionally write output to a CSV file.

Mastodon Snowflakes

Description
Mastodon is a social network that uses IDs similar to Snowflakes, so we should be able to pull out a timestamp.

Examples

https://infosec.exchange/web/statuses/101767036616676231

References

Find better way to integrate referenced sources (clicking on hover text isn't working).

Add domain reputation checks

Description
Add a lookup "parser" that takes a domain name in and shows some basic reputation things; for example popularity (Alexa 1M?) or if it has been flagged as bad (SafeBrowsing?).

Examples

Show google.com as popular/common and not malicious.
Show goooogle.ru as not popular/rare and malicious (made up example).

References

TBD

Make unfurl application port configurable

Because the application port is referenced in multiple places, and many folks run multiple services on certain hosts (and the typical default Flask port is 5000 😃 ) the application port should be easily modifiable (pull value from config file, etc). I can assist with a PR when I get a chance.

Add support for Bing image URLs?

Description
Add support for parsing URLs for the result from a Bing image search

Examples

https://tse1.mm.bing.net/th/id/OET.7b7cf265df8b4028ab9fe59d1f2a81b1?w=272&h=272&c=7&rs=1&o=5&pid=1.9

References

N/A

Python packaging

Where is unfurl on pypy ;-) ?

Expand Bing parser

Description
The Bing parser right now is very minimal; what else can we pull from a Bing search URL?

Examples

References

Existing Bing parser
For Web API, some params might overlap: https://docs.microsoft.com/en-us/rest/api/cognitiveservices-bingsearch/bing-web-api-v7-reference#query-parameters

Proofpoint URL decoding

Description
There is a bunch encoded in Proofpoint URL Defense URLs - at this point I'm unsure what beyond the "original" URL can be extracted. Needs exploration.

Examples
From https://help.proofpoint.com/Threat_Insight_Dashboard/API_Documentation/URL_Decoder_API:

References

Add more timestamp formats

Unfurl supports multiple timestamps, but we can add more.

References:

https://github.com/Fetchered/time_decode/blob/master/time_decode.py

Expand Short URLs

Description
Consider implementing expansion of short URLs.
Pro: Lots of URL shortening services can hide interesting links.
Con: We'd need to reach out to 3rd party sites to do this resolution (the data is not embedded in the URL).

Add way for nodes to query their "siblings", not just parents

Add way for nodes to query their "siblings", not just parents.

Many URLs have a hierarchical structure in the URL path, with one segment value effecting how others are interpreted. We need a way to query "across" the tree, not just "up" it.

Gmail URLs

Description
Gmail URLs have embedded info in IDs that we can extract.

Examples

References

KSUID parser

Description
KSUIDs are time-sortable, so we should parse this as a timestamp and the random component.

Examples

0ujsszwN8NRY24YaXiTIE2VWDTS

References

https://github.com/segmentio/ksuid

Medium URLs

Add basic parsing for Medium URLs.

Examples:

References:

https://superuser.com/questions/1111599/how-does-medium-com-for-example-generate-the-url-fragment-hash

Suggestion: Docker from local code rather than git

After raising #75 and trying to develop a new parser I realised that the code pulls from GitHub rather than building from local files.

A suggestion would be to use the local files rather than git, for example (as well as use a python base image):

FROM python:3-alpine
COPY requirements.txt /unfurl/requirements.txt

RUN cd /unfurl && \
    pip3 install -r requirements.txt

COPY . /unfurl
RUN sed -i 's/^host.*/host = 0.0.0.0/' /unfurl/unfurl.ini

WORKDIR /unfurl
ENTRYPOINT ["python3", "unfurl_app.py"]

The reason for the two COPYs is so that between changes, builds are cached up to the second COPY and then those changes built rather than all the dependencies every time.

I figured I'd raise an issue rather than another PR to get some comments. It would be good to get @weslambert thoughts.

Firebase Push IDs

Description
Firebase's push IDs are the chronological, 20-character unique IDs. A push ID contains 120 bits of information. The first 48 bits are a timestamp, which both reduces the chance of collision and allows consecutively created push IDs to sort chronologically. The timestamp is followed by 72 bits of randomness

Examples

"-JhLeOlGIEjaIOFHR0xd"
"-JhQ76OEK_848CkIFhAq"
"-JhQ7APk0UtyRTFO9-TS"

References

pytest or unittest for testing ?

Description
I would like to create a mechanism of testing. I suggest to use either Pytest, or Unittest.

Please, before I dive on working on the PR. Could you please choose the one that you recommend the most ?

Add JSON output mode to CLI tool

Add support for mailto URLs

Description
Unfurl mailto URLs which behave very similar to http URLs but miss out the regular-looking scheme part (mailto: rather than http://, s3:// etc.), so first take the data before ? (if in string) and that is the to address(es) anything after the ? can be treated like url.query.

Examples

There's some examples in point 6 of the RFC: https://tools.ietf.org/html/rfc2368
https://css-tricks.com/snippets/html/mailto-links/

References

https://tools.ietf.org/html/rfc2368
Started a parser here: parse_mailto.py
- Initial example:

parse_url parser does not handle query parameters with no value

Some websites add keys to the URL query string that have no value, but still affect the way the page is displayed. One (trimmed down) example is the following Facebook URL:

https://www.facebook.com/photo.php?type=3&theater

In this case, "theater" sits on its own and indicates that the photo should be opened in a lightbox. Unfortunately, the parameter is missing entirely from the tree after parsing, as you can see below:

This is a pretty easy fix. Modify line 66 of parse_url.py as below:

-        parsed_qs = urllib.parse.parse_qs(node.value)
+        parsed_qs = urllib.parse.parse_qs(node.value, keep_blank_values=True)

I could submit a pull request to fix this now, however, on lines 94 and 106 I see regexes for parsing similar forms (a=b|c=d|e=f and a=b&c=d&e=f). I'd like to make sure this issue is fixed in those as well, but I haven't yet figured out how to build a test case/example to cover them. Any ideas?

Add cmd-c / rightclick shortcut to copy a selected node content

Description
When selecting a node, add the ability to cmd+c (ctrl+c) to copy the content of the node.

Optionnal

Add a right click "copy" ability

Add detection/flagging for possible homoglyph/graph attacks

Description
Unfurl already decodes punycoded URLS (xn--...), but not hxxp://аbc.com (a is Slavic). It does in fact handle these (shows as % encoded, then translates back to original homograph domain. It would be good to add some flagging to the %-encoded node.

Fragments in URLs not working

I noticed that after cloning the repository from master and running with python3 unfurl_app.py, fragments in URLs do not appear in the unfurled graph. This issue is also present on your demo website when deep-linking to the URL, but not when I submit a URL via the form. I'm guessing it's because fragments aren't sent to the server when GETing a URL. Maybe it would be better to base64 the URL in the deep link or something to prevent the browser from misinterpreting the query.

Doesn't work:
http://127.0.0.1:5000/https://www.facebook.com/photo.php?type=3#hello
https://dfir.blog/unfurl/?url=https://www.facebook.com/photo.php?type=3#hello

Works:
Go to: https://dfir.blog/unfurl/
Submit query: https://www.facebook.com/photo.php?type=3#hello

Add more encoding types

Description
Unfurl currently supports basic url-safe b64 decoding, and only if the results are ASCII. This should be expanded to include more encoding types and chains.

Currently supported:

url-safe b64 -> ASCII

Encodings to add:

url-safe b64 -> gzip -> ASCII
url-safe b64 -> protobuf
b32 -> ASCII
More ...

Examples

TBD

References

https://tools.ietf.org/html/rfc3548.html

Parse YouTube URLs

Description
There is some good data encoded in YouTube URLs. At a minimum, some links point to a particular time in a video.

I'm happy to help/coach/answer questions if anyone wants to work on this issue!

Examples

https://youtu.be/Xj1wG_M7yyg?t=19658

References

The t param looks to be the number of seconds into the video

parsing suggestion: recognize and decode 12-byte Mongo object ID timestamp (first four bytes)

I just read https://techkranti.com/idor-through-mongodb-object-ids-prediction/, and it brought to mind a case I was working last week. Not positive it was actually this, but I'm going to check when I get to work. In any case, it would be useful for unfurl to automatically recognize Mongo object IDs that appear in URLs, and decode the embedded timestamp. (If it already does, I apologize. I wasn't able to find an example ready-to-hand, with which to test, and I don't see any references to Mongo object IDs in the documentation.)

Parsing misclassification if the password begins with ? or / or #

These URLs cause a misclassification:

http://test:[email protected]
- test: -> URL network location
- [email protected] -> URL query
http://test:/[email protected]
- test: -> URL network location
- /[email protected] -> URL path
http://test:#[email protected]
- test: -> URL network location
- [email protected] -> URL fragment

The misclassification appears if the password begins with a delimiter defined in RFC3986, section 3.2:

The authority component is preceded by a double slash ("//") and is
terminated by the next slash ("/"), question mark ("?"), or number
sign ("#") character, or by the end of the URI.

Discord and Mastodon Snowflake IDs overlap

Description
There is a possibility that the Mastodon and Discord snwoflake-like IDs overlap.

Examples

http://127.0.0.1:5000/https://mastodon.cloud/@TimDuran/103453805855961797

Add support for Yahoo Search subdomains

Description
Yahoo uses the following subdomains for different types of search:

Web Search = "search.yahoo.com"
Image Search = "images.search.yahoo.com"
Video Search = "video.search.yahoo.com"
News Search = "news.search.yahoo.com"

What is the best way to add these in at the top-layer?

Investigate Google gs_lcp parameter

Description
A new Google search parameter (gs_lcp) has appeared, and it looks like it may be the replacement for gs_l, which gave a lot of interesting information on search timing. Unfurl already parses the gs_lcp param as a nested protobuf, so what's needed is some research digging into what those parsed values actually mean.

Examples

https://dfir.blog/unfurl/?url=https://www.google.com/search?source=hp&ei=BYgfX8rWKfPT9APswJc4&q=hindsight&oq=hindsight&gs_lcp=CgZwc3ktYWIQAzIFCAAQsQMyBQgAELEDMgUIABCxAzICCAAyBQgAELEDMgIIADICCAAyAggAMggILhDHARCvATICCAA6CAgAELEDEIMBOgsILhCxAxDHARCjAjoICC4QxwEQowI6CAguELEDEIMBOgUILhCxAzoHCAAQsQMQCjoICC4QsQMQkwI6AgguOgYIABAWEB46BQghEKABOg4ILhCxAxDHARCjAhCTAjoLCC4QsQMQxwEQrwE6BAgAEANQ5wZYpVhgiWZoAXAAeAKAAX6IAY4VkgEEMjMuN5gBAKABAaoBB2d3cy13aXqwAQA&sclient=psy-ab&ved=0ahUKEwiK7aKK7u7qAhXzKX0KHWzgBQcQ4dUDCAk&uact=5

References

gs_l parsing in gSERPent by @randomaccess3: https://github.com/randomaccess3/googleURLParser/blob/58f0db205e903e4d18847673d3b94b963b1d2a17/GSERPent.pl#L949
gs_l parsing in Unfurl: https://github.com/obsidianforensics/unfurl/blob/master/unfurl/parsers/parse_google.py#L411

Google Search: bih and biw parameters

Great tool thanks Ryan!

Both bih and biw parameters in google search currently appear as generic 'URL parsing functions'

However, they appear to be well documented as testable that it equates to the browser windows height and width. This can be checked using something like: https://browsersize.com/

Thanks again!

obsidianforensics / unfurl Goto Github PK

unfurl's People

Contributors

Stargazers

Watchers

Forkers

unfurl's Issues

Recommend Projects

Recommend Topics

Recommend Org