obsidianforensics / unfurl Goto Github PK
View Code? Open in Web Editor NEWExtract and Visualize Data from URLs using Unfurl
Home Page: http://unfurl.link
License: Apache License 2.0
Extract and Visualize Data from URLs using Unfurl
Home Page: http://unfurl.link
License: Apache License 2.0
Description
GitHub URLs have a lot of info in them describing what is being viewed.
Examples
https://github.com/obsidianforensics/unfurl/blob/master/unfurl/parsers/parse_twitter.py#L15
https://github.com/obsidianforensics/unfurl/tree/master/unfurl/parsers
References
Add way to store API keys in a config file. Currently they are stored in code in each parser, which isn't good.
See here for other Google search urls to include
https://github.com/randomaccess3/googleURLParser
Description
Parse out a clicked URL from a Yahoo search:
For example, a URL like the following:
https://r.search.yahoo.com/_ylt=AwrBT8OVIB9aqsoASDZXNyoA;_ylu=X3oDMTByOHZyb21tBGNvbG8DYmYxBHBvcwMxBHZ0aWQDBHNlYwNzcg--/RV=2/RE=1512018198/RO=10/RU=https%3a%2f%2fdeveloper.yahoo.com%2fcocktails%2fmojito%2fdocs%2fcode_exs%2fquery_params.html/RK=2/RS=vQW48_o6zXyIDewim5cXq8Np1zo-
RU = [e for e in path.split("/") if e.startswith("RU=")]
# RU=[https%3a%2f%2fdeveloper.yahoo.com%2fcocktails%2fmojito%2fdocs%2fcode_exs%2fquery_params.html]
link_target = urllib.unquote(RU[0][3:])
# link_target =https://developer.yahoo.com/cocktails/mojito/docs/code_exs/query_params.html
Some nodes may combine to give more info, and thus it would be nice to reference both "parent" nodes from a child node. One example is the ei Google param; one node has full seconds and another has fractional seconds. Combining both these nodes would then yield the complete timestamp.
Description
IP addresses can be represented in different ways. Add support in Unfurl to parse all these variants of IP addresses:
From https://www.trustwave.com/en-us/resources/blogs/spiderlabs-blog/evasive-urls-in-spam/:
References
Right now, errors/warnings/etc are printed. Switch to using logging.
Looks to be time-based, but exact structure needs more research. Also unsure where they are used.
Examples:
References:
Description
Flake IDs are similar to snowflakes or UUIDv1 in that there is interesting info inside them.
Examples
References
In the mouseover to the ei parameter it states that the ei parameter is believed to correspond to when the search took place. This is not correct. I have seen ei parameters hours apart from the actual search.
What I have gathered is, that the EI is session start (When the tab was opened) or last search. If there is only 1 search immediately after tab opening it will roughly correspond to search time, but that is a coincidence
An example can be seen here:
I opened a tab and searched for 'First search at 17:22' (At 17:22 naturally).
I then searched for 'Second search at 17:24'
Finally waited and searched for 'third search at 17:28'
So there is no indication of the actual search time.
Description
Add support for identifying URLs which are the results from a Google image search
Examples
Description
Add the protocol and port to the parsed results for a URL. Ideas for child nodes are SSL cert lookup info if the protocol was HTTPS, and common services associated with ports (if the port was specified).
Examples
References
Description
Metasploit / CS use some generated URLs that can be decoded. Didier Stevens did a write-up & built a tool to decode them (https://isc.sans.edu/diary/27204). That research is a great base for an Unfurl parser.
Examples
References
I've created a Dockerfile and docker-compose
file for standing up unfurl in a Docker container. However, it appears that even though we can change the host/port for unfurl in config.ini
, if you are running via Docker etc, the fetch
request here refers to the explicit host/port as defined in config.ini
.
This becomes an issue when you are trying to expose unfurl via 0.0.0.0
in the Docker container, and are trying to access unfurl via the browser via the host IP address (something like 192.168.1.199
, because unfurl is trying to fetch 0.0.0.0:5000
, instead of the host IP).
I've got a PR ready (from here) that replaces unfurl_host
:unfurl_port
with:
window.location.host
(host + port of current request)
and provides the necessary files for running unfurl in a Docker container
See here for more details on the host/port change
This seems to work okay from my testing.
Any thoughts on potential issues with this, or how this could be done better to support Docker?
If this all sounds okay, I will go ahead and submit a PR -- just let me know!
Description
Zoom (zoom.us) has gotten pretty popular recently. What can be pulled out of Zoom URLs?
Examples
References
The website linked in the repo's about section https://unfurl.link/ is dead, the server 162.255.119.33 refuses connections on port 443. Port 80 redirects to https://dfir.blog/unfurl/
$ curl https://unfurl.link/
curl: (7) Failed to connect to unfurl.link port 443 after 316 ms: Connection refused
$ curl http://unfurl.link/
<a href='https://dfir.blog/unfurl/'>Found</a>.
Description
Grammar typo of two "up"'s in "by breaking up a URL up into components".
FR for Unfurl to be able to read in a bunch of URLs from a file (one per line) and optionally write output to a CSV file.
Description
Mastodon is a social network that uses IDs similar to Snowflakes, so we should be able to pull out a timestamp.
Examples
References
Find better way to integrate referenced sources (clicking on hover text isn't working).
Description
Add a lookup "parser" that takes a domain name in and shows some basic reputation things; for example popularity (Alexa 1M?) or if it has been flagged as bad (SafeBrowsing?).
Examples
References
Because the application port is referenced in multiple places, and many folks run multiple services on certain hosts (and the typical default Flask port is 5000 😃 ) the application port should be easily modifiable (pull value from config file, etc). I can assist with a PR when I get a chance.
Description
Add support for parsing URLs for the result from a Bing image search
Examples
References
Where is unfurl on pypy ;-) ?
Description
The Bing parser right now is very minimal; what else can we pull from a Bing search URL?
Examples
References
Description
There is a bunch encoded in Proofpoint URL Defense URLs - at this point I'm unsure what beyond the "original" URL can be extracted. Needs exploration.
Examples
From https://help.proofpoint.com/Threat_Insight_Dashboard/API_Documentation/URL_Decoder_API:
References
Unfurl supports multiple timestamps, but we can add more.
References:
Description
Consider implementing expansion of short URLs.
Pro: Lots of URL shortening services can hide interesting links.
Con: We'd need to reach out to 3rd party sites to do this resolution (the data is not embedded in the URL).
Add way for nodes to query their "siblings", not just parents.
Many URLs have a hierarchical structure in the URL path, with one segment value effecting how others are interpreted. We need a way to query "across" the tree, not just "up" it.
Description
Gmail URLs have embedded info in IDs that we can extract.
Examples
References
Description
KSUIDs are time-sortable, so we should parse this as a timestamp and the random component.
Examples
References
Add basic parsing for Medium URLs.
Examples:
References:
After raising #75 and trying to develop a new parser I realised that the code pulls from GitHub rather than building from local files.
A suggestion would be to use the local files rather than git, for example (as well as use a python base image):
FROM python:3-alpine
COPY requirements.txt /unfurl/requirements.txt
RUN cd /unfurl && \
pip3 install -r requirements.txt
COPY . /unfurl
RUN sed -i 's/^host.*/host = 0.0.0.0/' /unfurl/unfurl.ini
WORKDIR /unfurl
ENTRYPOINT ["python3", "unfurl_app.py"]
The reason for the two COPY
s is so that between changes, builds are cached up to the second COPY
and then those changes built rather than all the dependencies every time.
I figured I'd raise an issue rather than another PR to get some comments. It would be good to get @weslambert thoughts.
Description
Firebase's push IDs are the chronological, 20-character unique IDs. A push ID contains 120 bits of information. The first 48 bits are a timestamp, which both reduces the chance of collision and allows consecutively created push IDs to sort chronologically. The timestamp is followed by 72 bits of randomness
Examples
References
Description
I would like to create a mechanism of testing. I suggest to use either Pytest, or Unittest.
Please, before I dive on working on the PR. Could you please choose the one that you recommend the most ?
Description
Unfurl mailto URLs which behave very similar to http
URLs but miss out the regular-looking scheme part (mailto:
rather than http://
, s3://
etc.), so first take the data before ?
(if in string) and that is the to
address(es) anything after the ?
can be treated like url.query
.
Examples
References
Some websites add keys to the URL query string that have no value, but still affect the way the page is displayed. One (trimmed down) example is the following Facebook URL:
https://www.facebook.com/photo.php?type=3&theater
In this case, "theater" sits on its own and indicates that the photo should be opened in a lightbox. Unfortunately, the parameter is missing entirely from the tree after parsing, as you can see below:
This is a pretty easy fix. Modify line 66 of parse_url.py
as below:
- parsed_qs = urllib.parse.parse_qs(node.value)
+ parsed_qs = urllib.parse.parse_qs(node.value, keep_blank_values=True)
I could submit a pull request to fix this now, however, on lines 94 and 106 I see regexes for parsing similar forms (a=b|c=d|e=f
and a=b&c=d&e=f
). I'd like to make sure this issue is fixed in those as well, but I haven't yet figured out how to build a test case/example to cover them. Any ideas?
Description
When selecting a node, add the ability to cmd+c (ctrl+c) to copy the content of the node.
Optionnal
Add a right click "copy" ability
I noticed that after cloning the repository from master and running with python3 unfurl_app.py
, fragments in URLs do not appear in the unfurled graph. This issue is also present on your demo website when deep-linking to the URL, but not when I submit a URL via the form. I'm guessing it's because fragments aren't sent to the server when GETing a URL. Maybe it would be better to base64 the URL in the deep link or something to prevent the browser from misinterpreting the query.
Doesn't work:
http://127.0.0.1:5000/https://www.facebook.com/photo.php?type=3#hello
https://dfir.blog/unfurl/?url=https://www.facebook.com/photo.php?type=3#hello
Works:
Go to: https://dfir.blog/unfurl/
Submit query: https://www.facebook.com/photo.php?type=3#hello
Description
Unfurl currently supports basic url-safe b64 decoding, and only if the results are ASCII. This should be expanded to include more encoding types and chains.
Currently supported:
Encodings to add:
Examples
References
Description
There is some good data encoded in YouTube URLs. At a minimum, some links point to a particular time in a video.
I'm happy to help/coach/answer questions if anyone wants to work on this issue!
Examples
References
I just read https://techkranti.com/idor-through-mongodb-object-ids-prediction/, and it brought to mind a case I was working last week. Not positive it was actually this, but I'm going to check when I get to work. In any case, it would be useful for unfurl to automatically recognize Mongo object IDs that appear in URLs, and decode the embedded timestamp. (If it already does, I apologize. I wasn't able to find an example ready-to-hand, with which to test, and I don't see any references to Mongo object IDs in the documentation.)
These URLs cause a misclassification:
http://test:[email protected]
test:
-> URL network location[email protected]
-> URL queryhttp://test:/[email protected]
test:
-> URL network location/[email protected]
-> URL pathhttp://test:#[email protected]
test:
-> URL network location[email protected]
-> URL fragmentThe misclassification appears if the password begins with a delimiter defined in RFC3986, section 3.2:
The authority component is preceded by a double slash ("//") and is
terminated by the next slash ("/"), question mark ("?"), or number
sign ("#") character, or by the end of the URI.
Description
There is a possibility that the Mastodon and Discord snwoflake-like IDs overlap.
Examples
Description
Yahoo uses the following subdomains for different types of search:
What is the best way to add these in at the top-layer?
Description
A new Google search parameter (gs_lcp) has appeared, and it looks like it may be the replacement for gs_l, which gave a lot of interesting information on search timing. Unfurl already parses the gs_lcp param as a nested protobuf, so what's needed is some research digging into what those parsed values actually mean.
Examples
References
Great tool thanks Ryan!
Both bih
and biw
parameters in google search currently appear as generic 'URL parsing functions'
However, they appear to be well documented as testable that it equates to the browser windows height and width. This can be checked using something like: https://browsersize.com/
Thanks again!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.