Giter Site home page Giter Site logo

mosql's Introduction

MoSQL: a MongoDB → SQL streaming translator

At Stripe, we love MongoDB. We love the flexibility it gives us in changing data schemas as we grow and learn, and we love its operational properties. We love replsets. We love the uniform query language that doesn't require generating and parsing strings, tracking placeholder parameters, or any of that nonsense.

The thing is, we also love SQL. We love the ease of doing ad-hoc data analysis over small-to-mid-size datasets in SQL. We love doing JOINs to pull together reports summarizing properties across multiple datasets. We love the fact that virtually every employee we hire already knows SQL and is comfortable using it to ask and answer questions about data.

So, we thought, why can't we have the best of both worlds? Thus: MoSQL.

MoSQL: Put Mo' SQL in your NoSQL

MoSQL

MoSQL imports the contents of your MongoDB database cluster into a PostgreSQL instance, using an oplog tailer to keep the SQL mirror live up-to-date. This lets you run production services against a MongoDB database, and then run offline analytics or reporting using the full power of SQL.

Installation

Install from Rubygems as:

$ gem install mosql

Or build from source by:

$ gem build mosql.gemspec

And then install the built gem.

The Collection Map file

In order to define a SQL schema and import your data, MoSQL needs a collection map file describing the schema of your MongoDB data. (Don't worry -- MoSQL can handle it if your mongo data doesn't always exactly fit the stated schema. More on that later).

The collection map is a YAML file describing the databases and collections in Mongo that you want to import, in terms of their SQL types. An example collection map might be:

mongodb:
  blog_posts:
    :columns:
    - _id: TEXT
    - author: TEXT
    - title: TEXT
    - created: DOUBLE PRECISION
    :meta:
      :table: blog_posts
      :extra_props: true

Said another way, the collection map is a YAML file containing a hash mapping

<Mongo DB name> -> { <Mongo Collection Name> -> <Collection Definition> }

Where a <Collection Definition> is a hash with :columns and :meta fields. :columns is a list of one-element hashes, mapping field-name to SQL type. It is required to include at least an _id mapping. :meta contains metadata about this collection/table. It is required to include at least :table, naming the SQL table this collection will be mapped to. extra_props determines the handling of unknown fields in MongoDB objects -- more about that later.

By default, mosql looks for a collection map in a file named collections.yml in your current working directory, but you can specify a different one with -c or --collections.

Usage

Once you have a collection map. MoSQL usage is easy. The basic form is:

mosql [-c collections.yml] [--sql postgres://sql-server/sql-db] [--mongo mongodb://mongo-uri]

By default, mosql connects to both PostgreSQL and MongoDB instances running on default ports on localhost without authentication. You can point it at different targets using the --sql and --mongo command-line parameters.

mosql will:

  1. Create the appropriate SQL tables
  2. Import data from the Mongo database
  3. Start tailing the mongo oplog, propogating changes from MongoDB to SQL.

After the first run, mosql will store the status of the optailer in the mongo_sql table in your SQL database, and automatically resume where it left off. mosql uses the replset name to keep track of which mongo database it's tailing, so that you can tail multiple databases into the same SQL database. If you want to tail the same replSet, or multiple replSets with the same name, for some reason, you can use the --service flag to change the name mosql uses to track state.

You likely want to run mosql against a secondary node, at least for the initial import, which will cause large amounts of disk activity on the target node. One option is to specify this in your connect URI:

mosql --mongo mongodb://node1,node2,node3?slaveOk=true

(You should be able to specify ?readPreference=secondary, but the Mongo Ruby driver does not appear to support that usage. I've filed a bug with 10gen about this omission).

Advanced usage

For advanced scenarios, you can pass options to control mosql's behavior. If you pass --skip-tail, mosql will do the initial import, but not tail the oplog. This could be used, for example, to do an import off of a backup snapshot, and then start the tailer on the live cluster.

If you need to force a fresh reimport, run --reimport, which will cause mosql to drop tables, create them anew, and do another import.

Schema mismatches and _extra_props

If MoSQL encounters values in the MongoDB database that don't fit within the stated schema (e.g. a floating-point value in a INTEGER field), it will log a warning, ignore the entire object, and continue.

If it encounters a MongoDB object with fields not listed in the collection map, it will discard the extra fields, unless :extra_props is set in the :meta hash. If it is, it will collect any missing fields, JSON-encode them in a hash, and store the resulting text in _extra_props in SQL. It's up to you to do something useful with the JSON. One option is to use plv8 to parse them inside PostgreSQL, or you can just pull the JSON out whole and parse it in application code.

This is also currently the only way to handle array or object values inside records -- specify :extra_props, and they'll get JSON-encoded into _extra_props. There's no reason we couldn't support JSON-encoded values for individual columns/fields, but we haven't written that code yet.

Sharded clusters

MoSQL does not have special support for sharded Mongo clusters at this time. It should be possible to run a separate MoSQL instance against each of the individual backend shard replica sets, streaming into separate PostgreSQL instances, but we have not actually tested this yet.

Development

Patches and contributions are welcome! Please fork the project and open a pull request on github, or just report issues.

MoSQL includes a small but hopefully-growing test suite. It assumes a running PostgreSQL and MongoDB instance on the local host; You can point it at a different target via environment variables; See test/functional/_lib.rb for more information.

mosql's People

Contributors

dpatti avatar nelhage avatar

Watchers

 avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.