Giter Site home page Giter Site logo

pimdb

Pimdb is a python package and command line utility to maintain a local copy of the essential parts of the Internet Movie Database (IMDb) based in the TSV files available from IMDb datasets.

License

The IMDb datasets are only available for personal and non-commercial use. For details refer to the previous link.

Pimdb is open source and distributed under the BSD license. The source code is available from https://github.com/roskakori/pimdb.

Installation

Pimdb is available from PyPI and can be installed using:

$ pip install pimdb

Quick start

Downloading datasets

To download the current IMDb datsets to the current folder, run:

pimdb download all

(This downloads about 1 GB of data and might take a couple of minutes).

Transferring datasets into tables

To import them in a local SQLite database pimdb.db located in the current folder, run:

pimdb transfer all

(This will take a while. On a reasonably modern laptop with a local database you can expect about 2 hours).

The resulting database contains one tables for each dataset. The table names are PascalCase variants of the dataset name. For example, the date from the dataset title.basics are stored in the table TitleBasics. The column names in the table match the names from the datasets, for example TitleBasics.primaryTitle. A short description of all the datasets and columns can be found at the download page for the IMDb datasets.

Optionally you can specify a different database using the --database option with an SQLAlchemy engine configuration.

Querying tables

To query the tables, you can use any database tool that supports SQLite, for example the freely available and platform independent community edition of DBeaver or the command line shell for SQLite.

For simple queries you can also use pimdb and look at the result as UTF-8 encoded TSV. For example, here are the details of the top 10 oldest people alive according to IMDb:

pimdb query "select * from NameBasics where birthYear is not null and deathYear is null order by birthYear limit 10" >oldest_people_alive.tsv

You can also run an SQL statement stored in a file:

pimdb query --file some.sql

Building normalized tables

The tables so far are almost verbatim copies of the IMDb datasets with the exception that possible duplicate rows have been removed. This data model already allows to perform several kinds of queries quite easily and efficiently.

However, the IMDb datasets do not offer a simple way to query N:M relations. For example, the column NameBasics.knownForTitles contains a comma separated list of tconsts like "tt2076794,tt0116514,tt0118577,tt0086491".

To perform such queries efficiently you can build strictly normalized tables derived from the dataset tables by running:

pimdb build

If you did specify a --database for the transfer command before, you have to specify the same value for build in order to find the source data. These tables generally use snake_case names for both tables and columns, for example title_allias.is_original.

Querying normalized tables

N:M relations are stored in tables using the naming template some_to_other, for example name_to_known_for_title. These relation tables contain only the numeric ID's to the respective actual data and a numeric column ordering to remember the sort order of the comma separated list in the IMDb dataset column.

For example, here is an SQL query to list the titles Alan Smithee is known for:

select
    title.primary_title,
    title.start_year
from
    name_to_known_for_title
    join name on
        name.id = name_to_known_for_title.name_id
    join title on
        title.id = name_to_known_for_title.title_id
where
    name.primary_name = 'Alan Smithee'

For more information on which tables are available on how they are related read the chapter about the pimdb data model.

Where to go from here

Pimdb's online documentation describes all aspects in further detail. You might find the following chapters of particular interest:

  • Usage: all command line options explained
  • Data model: available tables and example SQL queries
  • Contributing: obtaining the source code and building the project locally

Thomas Aglassinger's Projects

basics icon basics

basic information on various topics

chatmaid icon chatmaid

Hide and clean up unwanted Firefall chat messages.

codecmapper icon codecmapper

Build mapping files derived from Java Charsets which can be processed by Python's gencodec.py.

csv342 icon csv342

Python 3 like CSV module for Python 2

cutplace icon cutplace

validate data stored in CSV, PRN, ODS or Excel files

envsof icon envsof

Eiffel/Sofa Add-On for GoldEd Studio 6

errortext icon errortext

provide error messages for Python exceptions, even if the original message is empty

grotag icon grotag

Grotag views Amigaguide documents or converts them to HTML and DocBook XML. Additionally it can validate and pretty print such documents.

jimic icon jimic

minimalistic eComic viewer for Java platforms supporting JDK 1.1 and AWT

joan icon joan

Clean up C source code generated by the SmallEiffel compiler (version -0.75 and maybe also later versions)

loxun icon loxun

large output in XML using unicode and namespaces

nikola icon nikola

A static website and blog generator

nubops icon nubops

Tool to quickly set up Ububtu standard services

odoo17 icon odoo17

Odoo. Open Source Apps To Grow Your Business.

pimdb icon pimdb

build a database from IMDb datasets

proconex icon proconex

Python producer/consumer with exception handling

proregu icon proregu

Tools for Commodore 64 Programmer's Reference Guide

pygount icon pygount

count lines of code for hundreds of languages using pygments

sanpo icon sanpo

Sanitize PO files from gettext for version control

scunch icon scunch

Update svn working copy from an external folder and copy, add and remove files and folders as necessary.

shapiro icon shapiro

lexicon based opinion mining and sentiment analysis

smartreadargs icon smartreadargs

Workbench/CLI transparent interface to AmigaOS ReadArgs with support for NewIcons

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.