unsplash / datasets Goto Github PK
View Code? Open in Web Editor NEW🎁 5,400,000+ Unsplash images made available for research and machine learning
Home Page: https://unsplash.com/data
🎁 5,400,000+ Unsplash images made available for research and machine learning
Home Page: https://unsplash.com/data
The link to download the lite dataset in the the readme is always the same, and always for the latest version of the data set.
This comment seems to indicate that a new snapshots of the datasets are cut when a new version of the contents of this repo is released, it would probably be useful to have links to the various versions of the datasets available, if nothing more than for historical purposes.
I imagine that this cold be integrated with #23 so that in the CHANGELOG.md
, or possibly a releases.json
that would include the download link for the lite dataset and the appropriate integrity check for each released version, including the full dataset's integrity check.
For the record - I tried hitting several variations of https://unsplash.com/data/lite/<version>
to see if a link to the v1.0.0 dataset was available. No luck 😄 .
I love your API and would like to integrate your commercial images into our product but through your API, do you consider creating an API Endpoint?
It could work the same way as your existing API, just for the usage of those datasets.
Creative Commons Images from Flickr is the same, but I like your API more. 👍
Hi,
Just to confirm, if I train a model on the Lite dataset may I distribute it? The license says I cannot distribute the dataset, but is it correct that no restrictions are placed on models trained on the dataset?
Thanks!
How can I download the images of Unsplash-lite using "unsplash-research-dataset-lite-latest.zip"
Describe the bug
Photo with id sEDzxW4NhL4 has errors.
While it can be accessed via photo_url https://unsplash.com/photos/sEDzxW4NhL4 it cannot get accessed
in photo_img_url https://images.unsplash.com/photo-1586019496196-bdbea65add07
To Reproduce
https://images.unsplash.com/photo-1586019496196-bdbea65add07
Expected behavior
Additional context
Hey,
I'm trying to calculate the average amount of images a user downloads.
As I know from my own photo stats, a lot of downloads are generated via API requests from external applications. You state in your API doc that external applications don't need to authenticate on a user level.
My question: Is for an external application like Trello one anonymous user id generated or do you guys have a better approach to distinguish users "behind" the external application?
Example from the test dataset
Could the user from the first row (942 downloads) really be one person or also a whole logical entity like Trello?
anonymous_user_id | downloads |
---|---|
5a055748-57d2-45c1-a882-5b9bb9313509 | 942 |
beb0923e-c17d-4a90-a8db-47b0f45fb0fc | 897 |
85e5db9c-07c7-49bf-9e08-5cbd1603dd74 | 546 |
... | ... |
Hello,
I would like to download the images using the provided urls. Do I risk getting blocked for making too many requests/downloads?
Thank you
Describe the bug
Hello ,
I was looking on the data from the lite dataset this morning and I noticed something weird in the column 'ai_service_2_confidence' from the keywords.tsv000
file.
when I applied some stats on the columns about ai_service the column 'ai_service_2_confidence' seems to have extreme value that are exceeding 100 that is for me the expected max (if I take the ai_service_1_confidence
as reference for exemple)
To Reproduce
There is the code to build the stats
import pandas as pd
dfp_keywords_raw = pd.read_csv('keywords.tsv000', sep='\t', header=0)
dfp_keywords_raw[['ai_service_1_confidence', 'ai_service_2_confidence']].describe()
Steps to reproduce the behavior:
Having a python environment (3.6.13) with pandas 1.1.5 installed
Expected behavior
I am expecting to have a value in the column 'ai_service_2_confidence' in keywords.tsv000
file between 0 and 100 or if it's not the case having a more precise description of the value for the 'ai_service_2_confidence' in the description (like the range)
Additional context
I have a list of the keywords that seems to be impacted by these extreme values
unsplash_extreme_value.zip
Hope that it will help on your investigation 🕵️♀️ (and I hope that is not just me that is missing something)
PS: your dataset is great by the way (really hope to have access to the full version soon)👍
I noticed that in the Lite dataset, there is only an AI caption. Is there a reason that the user's submitted caption isn't there?
Describe the bug
Values of photo_location_latitude
and photo_location_longitude
entries in photos.tsv
are swapped (both in Lite and Full versions).
To Reproduce
Using a photo with id gXSFnk2a9V4
as an example (currently indexed with 1
in the Lite Dataset)
import pandas as pd
df = pd.read_csv('photos.tsv000', sep='\t', header=0)
print({'latitude': df.loc[1]['photo_location_latitude'], 'longitude': df.loc[1]['photo_location_longitude']})
Which outputs {'latitude': -123.97116667, 'longitude': 45.4655}
. You can already notice, that it is incorrect, since the latitude is measured within [-90, 90]
.
curl -k 'https://unsplash.com/napi/photos/aerial-photography-of-seashore-gXSFnk2a9V4' \
-H 'Accept: */*' \
-H 'Connection: keep-alive' \
-H 'Sec-Fetch-Dest: empty' \
-H 'Sec-Fetch-Mode: cors' \
-H 'Sec-Fetch-Site: same-origin' \
-H 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/124.0.0.0 Safari/537.36 Edg/124.0.0.0' \
-H 'accept-language: en-US' \
-H 'sec-ch-ua: "Chromium";v="124", "Microsoft Edge";v="124", "Not-A.Brand";v="99"' \
-H 'sec-ch-ua-mobile: ?0' | \
python3 -c "import sys, json; resp = json.load(sys.stdin); print(resp['location']['position'])"
which outputs {'latitude': 45.4655, 'longitude': -123.97116667}
.
Expected behavior
The entries in the dataset should contain the correct coordinates, meaning that the values of photo_location_latitude
and photo_location_longitude
keys should be swapped.
Additional context
N/A
I was checking the cardinality of various columns in the dataset and the unsplash_photos.photos_featured
is always true.
unsplash_lite=# select photo_featured, count(*) from unsplash_photos group by 1;
photo_featured | count
----------------+-------
t | 25000
Is this the expected value?
Also - the data type for this column in the create-tables.sql is varchar
and I think it should be bool
. I did a quick reload of the data checking if it would still be valid with that change, and it would. Happy to submit a pull request for that if you like.
Hello,
I'm an English Learning Educator and I have a registered project set up already.
I would like to get random images, but ONLY the ones that have a clear or detailed description field, not null or empty values.
Is there a way to do this with your current api or can it be implemented by your team as a feature request?
something like this:
https://api.unsplash.com/photos/random?description=true&description_min_chars=10&client_id=XXXXXX
params
description = true (required)
description_min_chars = int (required) minimum description characters length or char count
Regards
Hugo Barbosa
Ticket #21 mentions additional image metadata that should be in the dataset. Are there other image analysis things that unsplash calculates that could be added? I'm in particular thinking of color value statistics, mean/median pixel value, min/max pixel values etc.
We should provide an integrity check for the Lite and the Full datasets.
We could make the SHA256 hash public.
got reply: "Thanks for inquiring about Unsplash Full Dataset. I would recommend you to download the Lite Dataset before using the Full one. The Lite Dataset is meant to be open and allow anyone to experiment. If you believe your experiment or research would need the whole 2M+ images, we are happy to give you access to it then.
The Full Dataset is meant for artificial intelligence and machine learning research mostly when the Lite Dataset is not sufficient enough."
In the lite dataset, 3 photos have corrupted photo_image_url links. I have fetched the real photo_image_url from photo_url, in case anyone wanted them.
rsJtMXn3p_c
-> https://images.unsplash.com/9/vectorbeastcom-grass-sun.jpg
vigsqYux_-8
-> https://images.unsplash.com/reserve/vof4H8A1S02iWcK6mSAd_sarahmachtsachen.com_TheBeach.jpg
9_9hzZVjV8s
-> https://images.unsplash.com/reserve/RFDKkrvXSHqBaWMMl4W5_Heavy_company
In addition, 20 images are premium in the dataset and the images available have watermarks. It would be awesome for the next version not to include these or have a flag that would disclose if they are premium or not.
PS: I have requested access to the full dataset for a few days now and still haven't gotten any response (Including acknowledgment of receiving the request). Usually how much does it take for a request to be reviewed and is there any email for acknowledging that the request was received successfully?
First off, thanks a lot for making the data available - it's a tremendous service to the research community!
@TimmyCarbone, I have a question regarding the relationship between LITE and FULL.
From what I understand, the LITE dataset is a subset of the FULL dataset. How were the 25k images in the first release of the LITE dataset selected? And how did you select the images that were added to replace removed images in subsequent releases?
Thanks!
I am trying to download unsplash dataset lite version and the link for the lite version download doesn't give me images. Is image in the download link or do I need to download it by using API?
I have requested access to the full dataset. However, here is the reply:
"Thanks for inquiring about Unsplash Full Dataset. I would recommend you to download the Lite Dataset before using the Full one. The Lite Dataset is meant to be open and allow anyone to experiment. If you believe your experiment or research would need the whole 2M+ images, we are happy to give you access to it then.
The Full Dataset is meant for artificial intelligence and machine learning research mostly when the Lite Dataset is not sufficient enough."
After I received this message, I have written multiple emails to further request access to the full dataset. However, I got no response from Victor Ballesteros. I do not understand why access to the full dataset is so difficult to obtain. I doubt whether Unsplash is truly willing to let others use their full dataset.
Hi, Sylvain from the Hugging Face datasets team here.
It would be awesome to have this dataset published on Hugging Face. I discovered it through this blog post: https://huggingface.co/blog/visheratin/nomic-data-cleaning, which relies on a user dataset: https://huggingface.co/datasets/visheratin/unsplash-caption-questions-init.
Having a presence on the HF Hub would make it much easier for ML practitioners to train new models.
You can control the license, terms, and user access (see https://huggingface.co/docs/hub/datasets-gated#gated-datasets).
The keywords data file appears to have an embedded newline in one of the records. I just want to clarify if this is expected or not. It looks like the given psql loading instructions do account for newlines in the TSV file, but if folks are processing the file outside of that without using quote-escaping rules they may process the data incorrectly.
To Reproduce
% wc -l *.tsv000
1646598 collections.tsv000
4075505 conversions.tsv000
2689741 keywords.tsv000
25001 photos.tsv000
8436845 total
Load the according to the documented instructions:
% psql -h localhost -U jeremy -d unsplash_lite -f load-data-client.sql
COPY 25000
COPY 2689739 # <-- Hmm.. this one is NOT 1 less than keywords.tsv000 above
COPY 1646597
COPY 4075504
Check the db row count
unsplash_lite=# select count(*) from unsplash_keywords;
count
---------
2689739
(1 row)
Expected behavior
I initially expected there to be 1 record for each non-header line of TSV, this appears to be an incorrect assumption. It looks like the psql commandline parsed the TSV according to quoted escape rules, so that is good.
I wrote a program to check the keywords file and it reports
% ruby check-tsv.rb keywords.tsv000
Headers: photo_id -- keyword -- ai_service_1_confidence -- ai_service_2_confidence -- suggested_by_user
[1590611 - PF4s20KB678-"fujisan] parts count 2 != 5
[1590612 - mount fuji"-] parts count 4 != 5
lines in file : 2689741
data lines : 2689740
unique row count: 2689740
Then looking at the lines around line 1590610 we see:
% sed -n '1590610,1590615p' keywords.tsv000
PF4s20KB678 night 22.3271160125732 f
PF4s20KB678 "fujisan
mount fuji" t
PF4s20KB678 pier 22.6900939941406 f
PF4s20KB678 viaduct 30.6490669250488 f
PF4s20KB678 architecture 33.084938049316399 f
And the db reports that row and the preceding and following rows correctly loaded.
unsplash_lite=# select * from unsplash_keywords where photo_id = 'PF4s20KB678' and keyword like '%fujisan%';
photo_id | keyword | ai_service_1_confidence | ai_service_2_confidence | suggested_by_user
-------------+------------+-------------------------+-------------------------+-------------------
PF4s20KB678 | fujisan +| | | t
| mount fuji | | |
(1 row)
unsplash_lite=# select * from unsplash_keywords where photo_id = 'PF4s20KB678' and keyword like '%pier%';
photo_id | keyword | ai_service_1_confidence | ai_service_2_confidence | suggested_by_user
-------------+---------+-------------------------+-------------------------+-------------------
PF4s20KB678 | pier | 22.6900939941406 | | f
(1 row)
unsplash_lite=# select * from unsplash_keywords where photo_id = 'PF4s20KB678' and keyword like '%night%';
photo_id | keyword | ai_service_1_confidence | ai_service_2_confidence | suggested_by_user
-------------+---------+-------------------------+-------------------------+-------------------
PF4s20KB678 | night | 22.3271160125732 | | f
If folks are processing these TSV simplistically without using quote-escaping logic then they may process the files incorrectly. I don't want folks to encounter that. And maybe this points to and upstream data input issue, if users are entering newlines in the keyword input - how are they getting processed in the main app.
We may just want to document that there can be embedded newlines in the TSV files.
Thanks!
I've applied the full dataset and it was approved. Now I get the 47 GB dataset.
Will I be banned if I download the pictures?
I have a 1Gb/s server and I'd download the pictures in 'photos.tsv000', the url begins in 'images.unsplash.com', I wonder if I could get banned when I download them too quickly? (For a 1Gb server of mine, about 5-6 pictures per second.)
Can you include the size of the full dataset in your README.md? It gives the size of the lite dataset at ~550Mb. I suppose I could interpolate from there but it would be nice to know what to expect while I wait to see if I'm approved for access. I'm guessing around 44Gb? Thanks.
I believe this dataset is due for an update soon, may I know if we will receive another link to download the updated version once it is out? What is the procedure to get the updated version once it is out?
Is your feature request related to a problem? Please describe.
Since the Unsplash Slack server as the only connection point of the community isn't as populated as it could be, I think there could be another way of bringing the people of Unsplash together.
Describe the solution you'd like
With a dataset of all Unsplash contributors it would be possible to create a map, giving them a chance to find other motivated photographers nearby. The dataset should contain the name, the location, the number of photos, the URL and maybe the linked website.
Describe alternatives you've considered
Before, there have been local Slack channels on the server to connect with other people from the same country, but afaik they have been shutdown.
Additional context
Nothing to add
[1] photo_featured
does this include the featured topic categories at the top?
[1.1] Could this field include which topics it is featured in, but also if the photo was submitted and rejected from a topic?
[1.2] Is there historic data for which images were not included as 'searchable' before the search system was replaced for everything being searchable?
[2] suggested_by_user
the description mentions 'a user (human)'. At some point (maybe?) unsplash was adding tags or keywords to its approved/moderated photos, does (could) this distinguish who added it (uploader or staff)?
[4] keyword
Is this referring to the search terms used to find the photo?
[4.1] assuming it is the searching keywords, can we add a field for the position it was displayed on the website (i.e. if it was the first row first column, or it was an image that was 30 photos down that a searcher scrolled down to find and pick)
Thanks for releasing the dataset, its a great contribution to the research community!
Describe the bug
Photos.tsv numbers 24942 items.
Additional context
I have detected the new photo_ids, 249 items, just in case file
When reviewing the data in the lite dataset, all of the following fields are null in all records.
If all of these are supposed to be null all the time - It may be useful to drop those columns from the dataset completely.
Although if these columns do have data in the full dataset it makes sense to have them exist. If this is the case, it may be useful to update the documentation to note that these fields are null in the lite dataset and have values in the full dataset.
In any case, just checking to make sure that this is the expected behavior.
Is your feature request related to a problem? Please describe.
Unsplash is an awesome dataset with records of anonymous user visits (conversions.tsv), I wonder if there are other organizations that have open-sourced this kind of dataset with anonymous user access? It would be so cool if there was and we can add Related Project
to link them.
Describe the solution you'd like
Find and then link similar datasets with records of anonymous user visits to this project.
Describe alternatives you've considered
No add.
Additional context
No add.
In the unsplash_photos.photo_location_country
and unsplash_photos.photo_location_city
the values appear to be freeform text that was probably direct user input, with effectively duplicate entries for example,
unsplash_lite=# select '>' || photo_location_city || '<' as city, '>' || photo_location_country || '<' as country, count(*) from unsplash_photos where lower(photo_location_city) like '%london%' group by 1,2;
?column? | ?column? | count
-----------+----------------------+-------
>LONDON < | >United Kingdom < | 1
>London< | >Canada< | 7
>London< | >Egyesült Királyság< | 1
>London< | >England< | 1
>London< | >U.K.< | 2
>London< | >U.K< | 1
>London< | >United Kingdom < | 1
>London< | >United Kingdom< | 73
>London< | | 3
(9 rows)```
It looks like there needs to be some data cleaning on these fields, definitely some stripping white space and such. Is it assumed that we should do our own location normalization on this and possibly add in a normalized_photo_location_country
and normalized_photo_location_city
?
Also - over in unsplash_conversion.conversion_country
this appears to be ISO 2 letter country codes. Is this guaranteed to be a valid ISO 2 letter country code? And was this data created based upon a maxmind geoip lookup or something similar?
Thanks so much for this dataset, I think it is going to be quite useful for demonstrational purposes. I hope these questions help increase the quality of what is a already great dataset.
Describe the bug
URL is bad
To Reproduce
Try downloading it
Expected behavior
Good URL
Additional context
Add any other context about the problem here.
Hi, Can u boost the application process for this dataset.
Really eager for some experiments in related research.
I found it's both hard for research or personal-use application.
I wonder how to apply for the full dataset ??
My email is [email protected] many thanks.
Hello,
Thanks for sharing / publishing this dataset.
Are there any plans for mirroring this dataset on Kaggle? If not, can I publish the Lite Dataset as a public dataset on Kaggle linking it back to Unsplash as the source and using the same licensing terms as here?
It would be great to make this data available on Kaggle where a lot of ML research and models can be built.
The dataset is missing width, height and aspect ratio of the photo.
These are 3 important elements and should be appearing as 3 distinct fields.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.