Giter Site home page Giter Site logo

gmail-to-sqlite's Introduction

Gmail to SQLite

This is a script to download emails from Gmail and store them in a SQLite database for further analysis. I find it extremely useful to have all my emails in a database to run queries on them. For example, I can find out how many emails I received per sender, which emails take the most space, and which emails from which sender I never read.

Installation

  1. Clone this repository: git clone https://github.com/marcboeker/gmail-to-sqlite.git.
  2. Install the requirements: pip install -r requirements.txt
  3. Create a Google Cloud project here.
  4. Open Gmail in API & Services and activate the Gmail API.
  5. Open the OAuth consent screen and create a new consent screen. You only need to provide a name and contact data.
  6. Next open Create OAuth client ID and create credentials for a Desktop app. Download the credentials file and save it under credentials.json in the root of this repository.

Here is a detailed guide on how to create the credentials: https://developers.google.com/gmail/api/quickstart/python#set_up_your_environment.

Usage

Sync all emails

  1. Run the script: python main.py sync --data-dir path/to/your/data where --<data-dir> is the path where all data is stored. This creates a SQLite database in <data-dir>/messages.db and stores the user credentials under <data-dir>/credentials.json.
  2. After the script has finished, you can query the database using, for example, the sqlite3 command line tool: sqlite3 <data-dir>/messages.db.
  3. You can run the script again to sync all new messages. Provide --full-sync to force a full sync. However, this will only update the read status, the labels, and the last indexed timestamp for existing messages.

Sync a single message

python main.py sync-message --data-dir path/to/your/data --message-id <message-id>

Commandline parameters

usage: main.py [-h] [--data-dir DATA_DIR] [--update] {sync, sync-message}

Main commands:
sync                    Sync emails from Gmail to the database.
sync-message            Sync a single message from Gmail to the database.

--data-dir DATA_DIR     Path to the directory where all data is stored.
--full-sync             Force a full sync.
--message-id MESSAGE_ID Sync only the message with the given message id.

Schema

CREATE TABLE IF NOT EXISTS "messages" (
    "id" INTEGER NOT NULL PRIMARY KEY, -- internal id
    "message_id" TEXT NOT NULL, -- Gmail message id
    "thread_id" TEXT NOT NULL, -- Gmail thread id
    "sender" JSON NOT NULL, -- Sender as JSON in the form {"name": "Foo Bar", "email": "[email protected]"}
    "recipients" JSON NOT NULL, -- JSON object: {
      -- "to": [{"email": "[email protected]", "name": "Foo Bar"}, ...],
      -- "cc": [{"email": "[email protected]", "name": "Foo Bar"}, ...],
      -- "bcc": [{"email": "[email protected]", "name": "Foo Bar"}, ...]
    --}
    "labels" JSON NOT NULL, -- JSON array: ["INBOX", "UNREAD", ...]
    "subject" TEXT NOT NULL, -- Subject of the email
    "body" TEXT NOT NULL, -- Extracted body either als HTML or plain text
    "size" INTEGER NOT NULL, -- Size reported by Gmail
    "timestamp" DATETIME NOT NULL, -- When the email was sent/received
    "is_read" INTEGER NOT NULL, -- 0=Unread, 1=Read
    "is_outgoing" INTEGER NOT NULL, -- 0=Incoming, 1=Outgoing
    "last_indexed" DATETIME NOT NULL -- Timestamp when the email was last seen on the server
);

Example queries

Get the number of emails per sender

SELECT sender->>'$.email', COUNT(*) AS count
FROM messages
GROUP BY sender->>'$.email'
ORDER BY count DESC

Show the number of unread emails by sender

This is great to determine who is spamming you the most with uninteresting emails.

SELECT sender->>'$.email', COUNT(*) AS count
FROM messages
WHERE is_read = 0
GROUP BY sender->>'$.email'
ORDER BY count DESC

Get the number of emails for a specific period

  • For years: strftime('%Y', timestamp)
  • For months in a year: strftime('%m', timestamp)
  • For days in a month: strftime('%d', timestamp)
  • For weekdays: strftime('%w', timestamp)
  • For hours in a day: strftime('%H', timestamp)
SELECT strftime('%Y', timestamp) AS period, COUNT(*) AS count
FROM messages
GROUP BY period
ORDER BY count DESC

Find all newsletters and group them by sender

This is an amateurish way to find all newsletters and group them by sender. It's not perfect, but it's a start. You could also use

SELECT sender->>'$.email', COUNT(*) AS count
FROM messages
WHERE body LIKE '%newsletter%' OR body LIKE '%unsubscribe%'
GROUP BY sender->>'$.email'
ORDER BY count DESC

Show who has sent the largest emails in MB

SELECT sender->>'$.email', sum(size)/1024/1024 AS size
FROM messages
GROUP BY sender->>'$.email'
ORDER BY size DESC

Count the number of emails that I have sent to myself

SELECT count(*)
FROM messages
WHERE EXISTS (
  SELECT 1
  FROM json_each(messages.recipients->'$.to')
  WHERE json_extract(value, '$.email') = '[email protected]'
)
AND sender->>'$.email' = '[email protected]'

List the senders who have sent me the largest total volume of emails in megabytes

SELECT sender->>'$.email', sum(size)/1024/1024 as total_size
FROM messages
WHERE is_outgoing=false
GROUP BY sender->>'$.email'
ORDER BY total_size DESC

Roadmap

  • Detect deleted emails and mark them as deleted in the database.

gmail-to-sqlite's People

Contributors

marcboeker avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.