Giter Site home page Giter Site logo

sql-machine-learning / sqlflow Goto Github PK

View Code? Open in Web Editor NEW
5.0K 170.0 696.0 28.33 MB

Brings SQL and AI together.

Home Page: https://sqlflow.org

License: Apache License 2.0

Dockerfile 0.65% Go 50.02% Yacc 0.77% Shell 4.12% Python 40.59% HTML 0.28% Java 2.46% JavaScript 1.10%
sqlflow sql-syntax ai transpiler deep-learning databases machine-learning

sqlflow's People

Contributors

brightcoder01 avatar dependabot[bot] avatar derek-wds avatar echo9573 avatar egolearner avatar ggaaooppeenngg avatar huangxinv587 avatar hungry1526 avatar joyyoj avatar lhw362950217 avatar linkerzhang avatar matrix4284 avatar merlintang avatar publicwlj avatar romainr avatar samplise avatar shendiaomo avatar sneaxiy avatar sperlingxx avatar terrytangyuan avatar tonyyang-svail avatar typhoonzero avatar uuleon avatar wangkuiyi avatar weiguoz avatar yake0176 avatar yancey1989 avatar yuyicg avatar ywskycn avatar zou000 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sqlflow's Issues

make SQL parser support select *

const (
    simpleStarSelect = `
SELECT *
FROM irisis;
`

func TestSimpleSelect(t *testing.T) {
	assert := assert.New(t)
	assert.NotPanics(func() {
		sqlParse(newLexer(simpleStarSelect))
	})
	assert.False(parseResult.Extended)
	assert.Equal([]string{"*"}, parseResult.fields)
}

Gives

--- FAIL: TestSimpleSelect (0.00s)
    parser_test.go:44:
        	Error Trace:	parser_test.go:44
        	Error:      	Not equal:
        	            	expected: []string{"*"}
        	            	actual  : []string(nil)

        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,4 +1,2 @@
        	            	-([]string) (len=1) {
        	            	- (string) (len=1) "*"
        	            	-}
        	            	+([]string) <nil>

        	Test:       	TestSimpleSelect

add train phase in executor.go

An executor will execute a training job. It needs to do the following

  1. Before training, generate the training job code using generateTFProgram in codegen.go
  2. Train the model in a docker container, with the working directory mounted as -v option
  3. After training, upload the trained model file to the MySQL database

Move Python template out from code.go into standalone runnable and testable .py files

For example, we can copy code out from codegen.go into sql/python/fetch_data.py like

import tensorflow as tf
import mysql.connector

def sql_connect(user, passwd, host, port, database, slctStmt):
    if not database:
        return mysql.connector.connect(user=user,
                                     passwd=passwd,
                                     host=host,
                                     port=port)
    else:
        return mysql.connector.connect(user=user,
                                     passwd=passwd,
                                     host=host,
                                     port=port,
                                     database=database)

def fetch_data(user, passwd, host, port, database, slctStmt):
    cursor = sql_connect(user, passwd, host, port, database).cursor()
    cursor.execute(slctStmt)
    field_names = [i[0] for i in cursor.description]
    columns = map(list, zip(*cursor.fetchall()))
    return field_names, columns

def slice_feature_and_label(field_names, columns, feature_types, feature_names, label_name):
    feature_columns = [
        getattr(tf.feature_column, feature_types[i])(
            key=feature_names[i]) for i in range(feature_types)]
    feature_column_names = [
        feature_names[i] for i in range(feature_types)]

    X = {name: columns[field_names.index(nm)] for nm in feature_column_names}
    Y = columns[field_names.index(label_name)]

    return X, Y

and we can have sql/python/test_fetch_data.py like

import unittest
import fetch_data

class TestFetchData(unittest.TestCase):

    def __init__(self, *args, **kwargs):
	super(TestFetchData, self).__init__(*args, **kwargs)
        self.user = 'root'
	self.passwd = 'root'
	self.host = 'localhost'
        self.port = 3306
        self.database = ''

    def test_sql_connect(self):
        self.assertNotIsNotNone(sql_connect(
            self.user, self.passwd, self.host, self.port, self.database))

And codegen.go could execute the template to generate only a Python __main__ function which calls the above testable Python functions

if __main__ = "__main__":
    fetch_data(user={{.User}}, passwd={{.Passwd}}, ...
    ....

Refactorize the API of codegen.go

There should be a function named something like codegen or generateCode in codegen.go, like parse in paser.go and verify in verifier.go.

Thinking about column type inference from SQL field types

The current ad-hoc solution is to use a map

https://github.com/wangkuiyi/sqlflow/blob/80e1fe77eb5c6e8ed1c0ffd3869671e8574ae7de/sql/codegen.go#L15

It might be too weak to support any case other than (1) each field is a feature, and (2) all fields are of float type.

For example, users might write the following statement which operates on a single string-typed field f:

SELECT f
FROM table
COLUMN f, hash(f, 100), cross(f, hash(f, 100))

In our Estimator code, the feature list needs three elements:

[
   tf.feature_column.categorical_column_with_vocabulary_list("f", vocab_list), 
   tf.feature_column.categorical_column_with_hash_bucket(
       tf.feature_column.categorical_column_with_vocabulary_list("f", vocab_list), 100),
   tf.feature_column.cross_column(
       tf.feature_column.categorical_column_with_vocabulary_list("f", vocab_list), 
       tf.feature_column.categorical_column_with_hash_bucket(
           tf.feature_column.categorical_column_with_vocabulary_list("f", vocab_list), 100))
]

The extended SQL syntax for training models using TensorFlow Estimators

Comments are very welcome!

A rough idea in my mind is something like this:

SELECT 
  reviewed_code_lines, 
  contributed_code_lines, 
  performance_eval_level 
FROM employees 
TRAIN DNNClassifier 
PARAMS 
  hidden_units=[10, 10], 
  n_classes=3 
FEATURES 
  reviewed_code_lines, 
  contributed_code_lines, 
  CROSS(reviewed_code_lines, contribiuted_code_lines) 
LABEL
  performance_eval_level
INTO auto_performance_evaluator;

I have a plan to write a parser using flex/bison to parse the above SQL statement with extended syntax. The parser should generate a TensorFlow estimator program similar to that described in this tutorial, but using a MySQLDataset operator, instead of the TextLineDataset operator.

A key challenge here is how to specify the crossed features, which was described in this document.

Another challenge is that how could we save the trained model into a table, e.g., auto_performance_evalutor in the above example.

The work rest in sqlflow/sql

  • setup Iris dataset as a MySQL Docker image in example/iris
  • test auto-generated train program
  • test auto-generated evaluate program
  • test auto-generated predict program
  • write SQLFlowRunner and SQLFlowRunnerHandler, whose input is a SQL statement
    1. Parse and translate the SQL statement into a TensorFlow Python training/predicting program.
    2. Call one of the following
      1. Train,
      2. Evaluate, or,
      3. Predict.
  • Jupyter Notebook magic command
  • Set up for single-node regression test.

Add sql/executor.go

An executor will execute the training/evaluation job. It needs to do the following

  1. Download the model files from the MySQL database, if necessary, to the working directory
  2. Copy paste the generated python training file to the working directory
  3. Train the model in a docker container, with the working directory mounted as -v option
  4. After the job is finished, upload the trained model file to the MySQL database, if necessary

Add sql/verifier.go

After lexer and parser, we need a verifier, which connects to the SQL engine, runs DESCRIBE table_name to retrieve fields and field types, makes sure that columns are derived from existing columns, and infer the feature column types from field types.

SQLFlow in Jupyter Notebook

In order to make the first milestone, we plan to allow users to run the syntax-extended SQL statements in the Jupyter Notebook.

Familiarize TensorFlow Estimator

  1. How to read inputs.
  2. How to convert inputs into features.
  3. How to train the model
  4. How to save the parameters
  5. How to reload the parameters for more training iterations and/or for inference
  6. How to save the model into a SQL system.

Avoid global variable parsedResult

The current parse API uses a global variable parseResult, is it possible to change it to

parseResult := sqlParse(newLexer("select * ..."))

In the inferencing phase, we are parsing two SQL statements: trainSQL and inferSQL. I feel it would be good practice to avoid sharing the parseResult.

Missing quotation marks on "columns" and "save" contents

When serializing parsed SQL statement to json, a couple of qutation marks are missing

{
"extended": true,
"train": true,
"standardSelect": "SELECT employee.age, last_name, salary\n FROMemployee\n WHERE employee.age % 10 < (salary / 10000) AND strings.Upper(last_name) = \"WANG\"\n LIMIT 100;",
"trainClause": {
"estimator": "DNNClassifier",
"attrs": {
"hidden_units": "[10, 20]",
"n_classes": "3"
},
"columns": [
employee.name,
bucketize(last_name, 1000),
cross(embedding(emplyoee.name), bucketize(last_name, 1000))
],
"save": my_dnn_model
}
}

Possibly remove type connectionConfig in codegen.go

The content of connectionConfig overlap with those in mysql.Config.

A consequence is that we must make sure that the content in both structs are consistent with each other, for example, the following code

https://github.com/wangkuiyi/sqlflow/blob/fb6769b55108234d35e1075908a3e2f62eb2a37d/sql/codegen_test.go#L43-L46

reveals that the content in connectionConfig

https://github.com/wangkuiyi/sqlflow/blob/fb6769b55108234d35e1075908a3e2f62eb2a37d/sql/codegen_test.go#L30-L35

must be consistent with those in mysql.Config

https://github.com/wangkuiyi/sqlflow/blob/fb6769b55108234d35e1075908a3e2f62eb2a37d/sql/db_test.go#L17-L21

It looks to me that we can remove the definition of connectionConfig and change the content of TemplateFiller

https://github.com/wangkuiyi/sqlflow/blob/fb6769b55108234d35e1075908a3e2f62eb2a37d/sql/codegen.go#L22-L34

into using mysql.Config and a WorkDir string field:

struct TemplateFiller struct {
   ...
   mysql.Config
   WorkDir string
}

And, change the signature

https://github.com/wangkuiyi/sqlflow/blob/fb6769b55108234d35e1075908a3e2f62eb2a37d/sql/codegen.go#L36

into

func NewTemplateFiller(pr *extendedSelect, fts fieldTypes, cfg *mysql.Config, workdir string)
         (*TemplateFiller, bool) {

[Design] Pipeline like syntax

I am wondering if we could use pipeline like syntax. The parsing would be much easier in this case. And the transformation of the data also looks more nature.

select * from my_table | Normalize | Train DNN

Python unable to connect to MySQL Server

db = mysql.connector.connect(user="root",
                             passwd="root",
                             host="localhost:3306")

Gives

mysql.connector.errors.DatabaseError: 2005 (HY000): Unknown MySQL server host 'localhost:3306' (2)

However, if we remove port 3306, it connects successfully

db = mysql.connector.connect(user="root",
                             passwd="root",
                             host="localhost")

Add predict in sql/codegen.go

In the evaluating phase, generateTFProgram should know four things

  1. parsedResult from parser.go. It contains standard select
  2. savedModel from MySQL database. It contains estimator's config.
  3. fieldTypes from verifier.go. It contains columns and columns types.
  4. mysql.Config. It contains username, passwd etc.

The generated TF Program should do the following

  1. Load X from MySQL
  2. Load model from working directory
  3. Predict Y based on X
  4. Save predicted Y to MySQL(#115)

Add predict in sql/executor.go

A prediction job needs to do the following

  1. Download the model files from the MySQL database to the working directory
  2. Verify the infer clause
  3. Prepare prediction table
  4. Predict

During inference, how to get model config

In the training phase, the sql statement contains model config: DNNClassifier, n_classes and hidden_units

SELECT sepal_length, sepal_width, petal_length, petal_width, species
FROM irisis
TRAIN DNNClassifier
WITH
  n_classes = 3,
  hidden_units = [10, 20]
COLUMN sepal_length, sepal_width, petal_length, petal_width
LABEL species
INTO my_dnn_model;

However, the infer statement doesn't have it

SELECT sepal_length, sepal_width, petal_length, petal_width, species
FROM irisis
INFER my_dnn_model;

Add sql/codegen.go

Need to design the code generate algorithm, particularly, how to map fieldTypes []string returned by sql/verifier.go to tf.feature_column.* calls.

Test training in codegen_test.go

Test Training Program Autogen

  • Go can run a process while capturing its stderr and stdout. We want to test that the auto-generated TenosrFlow/Python program can train the model, so we need to run a Docker container using the above technique but with a command like
docker run --rm -it -v $PWD:/work -w /work tensorflow/tensorflow:1.12 python to_be_tested.py

However, we don't want to save the auto-generated file into the filesystem; instead, we want to pipe it to python running in the TensorFlow container. So we can do

echo "print(1)" | docker run --rm -i tensorflow/tensorflow python

To run the above bash program in a command line, we need to run

sh -c 'echo "print(1)" | docker run --rm -i tensorflow/tensorflow python'

we can do this by following ExampleCmd_CombinedOutput in https://golang.org/src/os/exec/example_test.go

However again, the echo "print(1)" in our case is something returned by Go function in our driving program, but not a standalone program. To pipe something to a process, we need to

package main

import (
	"fmt"
	"os/exec"
	"strings"
)

func main() {
	r := strings.NewReader("print(1)")
	cmd := exec.Command("docker", "run", "--rm", "-i", "tensorflow/tensorflow", "python")
	cmd.Stdin = r
	o, _ := cmd.CombinedOutput()
	fmt.Println(string(o))
}

Unable to untar the model

func (m *model) load(cfg *mysql.Config, cwd string) (e error) {
	db, e := sql.Open("mysql", cfg.FormatDSN())
	if e != nil {
		return e
	}
	defer db.Close()

	sqlfn := fmt.Sprintf("sqlflow_models.%s", m.parseResult.model)
	sqlf, e := sqlfs.Open(db, sqlfn)
	if e != nil {
		return fmt.Errorf("Cannot open sqlfs file %s: %v", sqlfn, e)
	}
	defer func() { sqlf.Close() }()

	if e := gob.NewDecoder(sqlf).Decode(m); e != nil {
		return fmt.Errorf("model.load: gob-decoding model failed: %v", e)
	}

	dir := cwd
	cmd := exec.Command("tar", "Pxzf", "-", "-C", dir)
	cmd.Stdin = sqlf
	return cmd.Run()
}

Gives

tar: Unrecognized archive format
tar: Error exit delayed from previous errors.

Save predicted Y to MySQL

The following SQL statement will save the predicted class into iris.predict.class

SELECT *
FROM iris.iris
PREDICT iris.prediction_table.class
USING my_dnn_model;

Logic:

  1. If selected field's name or type is different from training phase, it should raise an error.
  2. If iris.prediction_table doesn't exist, we should create table iris.prediction_table. And we should figure out the column type of class. If iris.prediction_table already exists, it will be overwritten.

These logic should be implemented in Go.

Missing Column in Prediction Clause

Verifier uses COLUMN and LABEL field to find data types.

SELECT MonthlyCharges, TotalCharges, tenure
FROM churn.churn
TRAIN DNNClassifier
WITH 
  n_classes = 73,
  hidden_units = [10, 20]
COLUMN MonthlyCharges, TotalCharges
LABEL tenure
INTO my_dnn_model;

However, there is no COLUMN field in the prediction clause

SELECT MonthlyCharges, TotalCharges
FROM churn.churn
PREDICT churn.predict.tenure
USING my_dnn_model;

So how should verifier get data types?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.