olehonyshchak / pywikimm Goto Github PK

Collects a multimodal dataset of Wikipedia articles and their images

License: GNU General Public License v3.0

Jupyter Notebook 27.52% Python 71.10% Dockerfile 1.39%

wikipedia wikipedia-scraper wikipedia-api wikipedia-bot wikipedia-entries wikipedia-dump wikipedia-search wikipedia-viewer wikipedia-corpus wikipedia-page

pywikimm's People

Contributors

Stargazers

Watchers

Forkers

yashodhank vivian-maes

pywikimm's Issues

Move files into subfolders in repo (src, docker)

Check the problematic image in Naruto article

Do spell-checking on README

Image 'features' property design question

So now we have it calculated as following:

output of 5-th convolutional layer of ResNet152 trained on ImageNet dataset. That output of shape (19, 24, 2048) is then max-pooled to a shape (2048,). Features are taken from original images downloaded in jpeg format with a fixed width of 600px. Practically, it is a list of floats with len = 2048

But should we max-pool that tensor instead of saving the original (19, 24, 2048)? And if we really want just 2048, wouldn't it be better just to use ResNet101, which outputs tensors of this shape? (TODO: double-check whether it's true)

Update Kaggle version

Constructor Design for the library

Think about the design choices of how to make the library easily extendable. For example, make the query to accept as an argument list of function to process text and images. For example, text handlers can accept HTML of the page and its URL as an input, and then return some key-value pair to be added to the dataset.

With that approach, if a user wants to parse additional field he would only need to define the function which with appropriate parsing and pass it as a parameter to query function, where all the meaty and common processing is done. With that approach, the user can select what to download by modifying the list of pre-created handlers for wikitext or caption parsing. Also, we could have designed an approach to uniformly pass cache-related parameters to such functions.

Might be a very good idea but requires tons of work. Will probably be suspended until some reasonable interest to the script appears.

Check-in mutithreading code for reader.query

[REQUIRED]: In generate_visual_features replace model parameter with generic mapper

Mapper should have function map which takes image path as an input and returns a JSON-serializable type as output
As default mapper, we will create ResNet152Mapper, which will essentially replace _get_image_features function

olehonyshchak / pywikimm Goto Github PK

pywikimm's People

Contributors

Stargazers

Watchers

Forkers

pywikimm's Issues

Move files into subfolders in repo (src, docker)

Check the problematic image in Naruto article

Do spell-checking on README

Image 'features' property design question

Update Kaggle version

Constructor Design for the library

Check-in mutithreading code for reader.query

[REQUIRED]: In generate_visual_features replace model parameter with generic mapper

[REQUIRED]: Add image size parameter to download

Double-check whether images with "\"" symbol are downloaded correctly

Feedback Thread. Thanks for sharing!

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent