sql-machine-learning / sqlflow Goto Github PK

View Code? Open in Web Editor NEW

5.0K 170.0 696.0 28.33 MB

Brings SQL and AI together.

Home Page: https://sqlflow.org

License: Apache License 2.0

Dockerfile 0.65% Go 50.02% Yacc 0.77% Shell 4.12% Python 40.59% HTML 0.28% Java 2.46% JavaScript 1.10%

sqlflow sql-syntax ai transpiler deep-learning databases machine-learning

sqlflow's People

Contributors

Stargazers

Watchers

Forkers

wangkuiyi tonyyang-svail llxxxll weiguoz yupbank beijinggao tomzhang zorrock nofeetbird0321 weidong3630 zhaojp-frank chunyang-wen davimf piyumalanthony yake0176 fudp cfmcgrady dlreseach huanghaiquan hadoop835 dotrado xukunl kioco qq1358661914 peng-jin strongyoung ymkigeg chakra-coder zerolugithub renjunbakamei tingxin spencerai kaidong123 jiaqiang pacino616 a289237642 alexsun1995 seair-jzh soindy artificial-intelligence-study chuyuanlinzi qsdj 945226956 killallkill zofuthan 0xrad7 martinhe ye-lun ajaxtony st9489 lingjun3033 practicecondtioning jeromeheng awesome-archive hitflame inrtgdje watchinglater gangyou chengtaoyuan biyituan shengmingqijiquan xanderchang chengyh2golang dtss316 litlay chengcanmm77 yuanfeng0905 junlinnick ryan--yang blackwatchuser jasonaidm wangcheny b-xiang 0xqq shuainiujiushishuai lowsea gaogf oopsoutofmemory misselvexu lopage sigma-random jacklee20151 foreverget luoxuehuan yuanyaoxing eavinlau zuodh backyes sadnessofatlantis jhs2jhs hhy5277 yuqi1129 gitxxx typhoonzero mark-gu-sea changerni mingsang wanghia lukazheng xiaozhch5

sqlflow's Issues

Move test MySQL connection from test.init() into TestMain

Currently, unit tests establish the connection to the test instance of MySQL in the init() function. This prevents us from closing the connection after testing. We should move the connection and closing operation into TestMain.

Make sure that sqlfile.Create or sqlfs.Create returns error with existing file

[Design] How to communicate between Go Parser and Python TF

Homepage and documenation

Setup a homepage of SQLFlow. An option is to setup a homepage using Github: https://pages.github.com/

Build TensorFlow from source code

Mount host directory when run TF programs in Docker containers

The star after SELECT should join with other fields

Currently, the deduction rule

https://github.com/wangkuiyi/sqlflow/blob/2bf1f074ac1ab23da98aa61ec53d8f57e48b6c42/sql/sql.y#L178-L182

states that

SELECT *, b ...

is equivalent to

SELECT b ...

However, the existence of * after SELECT should mean all fields of related tables.

Make train job mounted to a random generated directory

https://golang.org/pkg/io/ioutil/#TempDir

[Survey] Save the model into a SQL system

Save and load the model into a SQL system

MySQL supports binary data (BLOBs and variations thereof):

Forgot LABEL clause in parser

https://github.com/wangkuiyi/sqlflow/blob/b7ee49d4e087cb6beeab4fbf58af4dabce39da53/sql/sql.y#L163-L170

executor predict: download the model to the working directory

As the first step in https://github.com/wangkuiyi/sqlflow/issues/98

make SQL parser support select *

const (
    simpleStarSelect = `
SELECT *
FROM irisis;
`

func TestSimpleSelect(t *testing.T) {
	assert := assert.New(t)
	assert.NotPanics(func() {
		sqlParse(newLexer(simpleStarSelect))
	})
	assert.False(parseResult.Extended)
	assert.Equal([]string{"*"}, parseResult.fields)
}

Gives

--- FAIL: TestSimpleSelect (0.00s)
    parser_test.go:44:
        	Error Trace:	parser_test.go:44
        	Error:      	Not equal:
        	            	expected: []string{"*"}
        	            	actual  : []string(nil)

        	            	Diff:
        	            	--- Expected
        	            	+++ Actual
        	            	@@ -1,4 +1,2 @@
        	            	-([]string) (len=1) {
        	            	- (string) (len=1) "*"
        	            	-}
        	            	+([]string) <nil>

        	Test:       	TestSimpleSelect

add train phase in executor.go

An executor will execute a training job. It needs to do the following

Before training, generate the training job code using generateTFProgram in codegen.go
Train the model in a docker container, with the working directory mounted as -v option
After training, upload the trained model file to the MySQL database

Move Python template out from code.go into standalone runnable and testable .py files

For example, we can copy code out from codegen.go into sql/python/fetch_data.py like

import tensorflow as tf
import mysql.connector

def sql_connect(user, passwd, host, port, database, slctStmt):
    if not database:
        return mysql.connector.connect(user=user,
                                     passwd=passwd,
                                     host=host,
                                     port=port)
    else:
        return mysql.connector.connect(user=user,
                                     passwd=passwd,
                                     host=host,
                                     port=port,
                                     database=database)

def fetch_data(user, passwd, host, port, database, slctStmt):
    cursor = sql_connect(user, passwd, host, port, database).cursor()
    cursor.execute(slctStmt)
    field_names = [i[0] for i in cursor.description]
    columns = map(list, zip(*cursor.fetchall()))
    return field_names, columns

def slice_feature_and_label(field_names, columns, feature_types, feature_names, label_name):
    feature_columns = [
        getattr(tf.feature_column, feature_types[i])(
            key=feature_names[i]) for i in range(feature_types)]
    feature_column_names = [
        feature_names[i] for i in range(feature_types)]

    X = {name: columns[field_names.index(nm)] for nm in feature_column_names}
    Y = columns[field_names.index(label_name)]

    return X, Y

and we can have sql/python/test_fetch_data.py like

import unittest
import fetch_data

class TestFetchData(unittest.TestCase):

    def __init__(self, *args, **kwargs):
	super(TestFetchData, self).__init__(*args, **kwargs)
        self.user = 'root'
	self.passwd = 'root'
	self.host = 'localhost'
        self.port = 3306
        self.database = ''

    def test_sql_connect(self):
        self.assertNotIsNotNone(sql_connect(
            self.user, self.passwd, self.host, self.port, self.database))

And codegen.go could execute the template to generate only a Python __main__ function which calls the above testable Python functions

if __main__ = "__main__":
    fetch_data(user={{.User}}, passwd={{.Passwd}}, ...
    ....

Refactorize the API of codegen.go

There should be a function named something like codegen or generateCode in codegen.go, like parse in paser.go and verify in verifier.go.

reuse tensorflowCmd

tensorflowCmd in codegen_test.go is a useful abstraction and should be used in other components.

https://github.com/wangkuiyi/sqlflow/blob/1f6fa924ab42fe1c1dcdce376915009106e0d6f6/sql/codegen_test.go#L87-L102

Add a smaller example

Using the all-numeric fields data set from TensorFlow: http://download.tensorflow.org/data/iris_training.csv

Following sqlflow/example/churn

Thinking about column type inference from SQL field types

The current ad-hoc solution is to use a map

https://github.com/wangkuiyi/sqlflow/blob/80e1fe77eb5c6e8ed1c0ffd3869671e8574ae7de/sql/codegen.go#L15

It might be too weak to support any case other than (1) each field is a feature, and (2) all fields are of float type.

For example, users might write the following statement which operates on a single string-typed field f:

SELECT f
FROM table
COLUMN f, hash(f, 100), cross(f, hash(f, 100))

In our Estimator code, the feature list needs three elements:

[
   tf.feature_column.categorical_column_with_vocabulary_list("f", vocab_list), 
   tf.feature_column.categorical_column_with_hash_bucket(
       tf.feature_column.categorical_column_with_vocabulary_list("f", vocab_list), 100),
   tf.feature_column.cross_column(
       tf.feature_column.categorical_column_with_vocabulary_list("f", vocab_list), 
       tf.feature_column.categorical_column_with_hash_bucket(
           tf.feature_column.categorical_column_with_vocabulary_list("f", vocab_list), 100))
]

Familiarize add new TensorFlow operators

Due to the reason.

Learn to program SQLFlow server in Go and ZeroMQ

For the context of this issue, please refer to https://github.com/wangkuiyi/sqlflow/issues/64#issuecomment-441339214.

The extended SQL syntax for training models using TensorFlow Estimators

Comments are very welcome!

A rough idea in my mind is something like this:

SELECT 
  reviewed_code_lines, 
  contributed_code_lines, 
  performance_eval_level 
FROM employees 
TRAIN DNNClassifier 
PARAMS 
  hidden_units=[10, 10], 
  n_classes=3 
FEATURES 
  reviewed_code_lines, 
  contributed_code_lines, 
  CROSS(reviewed_code_lines, contribiuted_code_lines) 
LABEL
  performance_eval_level
INTO auto_performance_evaluator;

I have a plan to write a parser using flex/bison to parse the above SQL statement with extended syntax. The parser should generate a TensorFlow estimator program similar to that described in this tutorial, but using a MySQLDataset operator, instead of the TextLineDataset operator.

A key challenge here is how to specify the crossed features, which was described in this document.

Another challenge is that how could we save the trained model into a table, e.g., auto_performance_evalutor in the above example.

The work rest in sqlflow/sql

Release Docker image

The first release should allow use to

start a MySQL Docker container on their laptop computer: https://github.com/wangkuiyi/sqlflow/blob/develop/doc/mysql-setup.md
start an SQLFlow Docker container on their laptop computer.

The SQLFlow container should be able to take user input SQL statement as input, either proxy pass to MySQL or parse and run Python training/prediction code in the container.

Add sql/executor.go

An executor will execute the training/evaluation job. It needs to do the following

Download the model files from the MySQL database, if necessary, to the working directory
Copy paste the generated python training file to the working directory
Train the model in a docker container, with the working directory mounted as -v option
After the job is finished, upload the trained model file to the MySQL database, if necessary

Add sql/verifier.go

After lexer and parser, we need a verifier, which connects to the SQL engine, runs DESCRIBE table_name to retrieve fields and field types, makes sure that columns are derived from existing columns, and infer the feature column types from field types.

SQLFlow in Jupyter Notebook

In order to make the first milestone, we plan to allow users to run the syntax-extended SQL statements in the Jupyter Notebook.

Familiarize TensorFlow Estimator

How to read inputs.
How to convert inputs into features.
How to train the model
How to save the parameters
How to reload the parameters for more training iterations and/or for inference
How to save the model into a SQL system.

Avoid global variable parsedResult

The current parse API uses a global variable parseResult, is it possible to change it to

parseResult := sqlParse(newLexer("select * ..."))

In the inferencing phase, we are parsing two SQL statements: trainSQL and inferSQL. I feel it would be good practice to avoid sharing the parseResult.

Missing quotation marks on "columns" and "save" contents

When serializing parsed SQL statement to json, a couple of qutation marks are missing

{
"extended": true,
"train": true,
"standardSelect": "SELECT employee.age, last_name, salary\n FROMemployee\n WHERE employee.age % 10 < (salary / 10000) AND strings.Upper(last_name) = \"WANG\"\n LIMIT 100;",
"trainClause": {
"estimator": "DNNClassifier",
"attrs": {
"hidden_units": "[10, 20]",
"n_classes": "3"
},
"columns": [
employee.name,
bucketize(last_name, 1000),
cross(embedding(emplyoee.name), bucketize(last_name, 1000))
],
"save": my_dnn_model
}
}

Rename package sqlfile into sqlfs

As it represents a filesystem not a single file.

evaluate accuracy on the training data

In the scanfolded TF program, print accuracy on the training data.

Familiarize TensorFlow feature transform

SQLFlow model includes the TF model; other than TF model includes the SQLFlow model

Recall https://github.com/wangkuiyi/sqlflow/pull/108/files#r237716408, I think we should follow the logic, and the following should simply our code.

Possibly remove type connectionConfig in codegen.go

The content of connectionConfig overlap with those in mysql.Config.

A consequence is that we must make sure that the content in both structs are consistent with each other, for example, the following code

https://github.com/wangkuiyi/sqlflow/blob/fb6769b55108234d35e1075908a3e2f62eb2a37d/sql/codegen_test.go#L43-L46

reveals that the content in connectionConfig

https://github.com/wangkuiyi/sqlflow/blob/fb6769b55108234d35e1075908a3e2f62eb2a37d/sql/codegen_test.go#L30-L35

must be consistent with those in mysql.Config

https://github.com/wangkuiyi/sqlflow/blob/fb6769b55108234d35e1075908a3e2f62eb2a37d/sql/db_test.go#L17-L21

It looks to me that we can remove the definition of connectionConfig and change the content of TemplateFiller

https://github.com/wangkuiyi/sqlflow/blob/fb6769b55108234d35e1075908a3e2f62eb2a37d/sql/codegen.go#L22-L34

into using mysql.Config and a WorkDir string field:

struct TemplateFiller struct {
   ...
   mysql.Config
   WorkDir string
}

And, change the signature

https://github.com/wangkuiyi/sqlflow/blob/fb6769b55108234d35e1075908a3e2f62eb2a37d/sql/codegen.go#L36

into

func NewTemplateFiller(pr *extendedSelect, fts fieldTypes, cfg *mysql.Config, workdir string)
         (*TemplateFiller, bool) {

use shell command tar to save the entire model directory

Instead of using our customized tar, a more straightforward way is to tar the whole model directory.

[Design] Pipeline like syntax

I am wondering if we could use pipeline like syntax. The parsing would be much easier in this case. And the transformation of the data also looks more nature.

select * from my_table | Normalize | Train DNN

Narrow verify's API

verify in verifier.go only needs extendedSelect.standardSelect

Python unable to connect to MySQL Server

db = mysql.connector.connect(user="root",
                             passwd="root",
                             host="localhost:3306")

Gives

mysql.connector.errors.DatabaseError: 2005 (HY000): Unknown MySQL server host 'localhost:3306' (2)

However, if we remove port 3306, it connects successfully

db = mysql.connector.connect(user="root",
                             passwd="root",
                             host="localhost")

Add predict in sql/codegen.go

In the evaluating phase, generateTFProgram should know four things

parsedResult from parser.go. It contains standard select
savedModel from MySQL database. It contains estimator's config.
fieldTypes from verifier.go. It contains columns and columns types.
mysql.Config. It contains username, passwd etc.

The generated TF Program should do the following

Load X from MySQL
Load model from working directory
Predict Y based on X
Save predicted Y to MySQL(#115)

Rename generateTemplate into newFiller

https://github.com/wangkuiyi/sqlflow/blob/42c08da81c905cd96f4c70165cc4192d8f05143f/sql/codegen.go#L49

The function generateTemplate doesn't return a template; instead, it returns a filler.

Print standardSelect should consider the case of SELECT *

https://github.com/wangkuiyi/sqlflow/pull/38#issuecomment-439729961

Add predict in sql/executor.go

A prediction job needs to do the following

Download the model files from the MySQL database to the working directory
Verify the infer clause
Prepare prediction table
Predict

During inference, how to get model config

In the training phase, the sql statement contains model config: DNNClassifier, n_classes and hidden_units

SELECT sepal_length, sepal_width, petal_length, petal_width, species
FROM irisis
TRAIN DNNClassifier
WITH
  n_classes = 3,
  hidden_units = [10, 20]
COLUMN sepal_length, sepal_width, petal_length, petal_width
LABEL species
INTO my_dnn_model;

However, the infer statement doesn't have it

SELECT sepal_length, sepal_width, petal_length, petal_width, species
FROM irisis
INFER my_dnn_model;

Add sql/codegen.go

Need to design the code generate algorithm, particularly, how to map fieldTypes []string returned by sql/verifier.go to tf.feature_column.* calls.

Test training in codegen_test.go

Test Training Program Autogen

Go can run a process while capturing its stderr and stdout. We want to test that the auto-generated TenosrFlow/Python program can train the model, so we need to run a Docker container using the above technique but with a command like

docker run --rm -it -v $PWD:/work -w /work tensorflow/tensorflow:1.12 python to_be_tested.py

However, we don't want to save the auto-generated file into the filesystem; instead, we want to pipe it to python running in the TensorFlow container. So we can do

echo "print(1)" | docker run --rm -i tensorflow/tensorflow python

To run the above bash program in a command line, we need to run

sh -c 'echo "print(1)" | docker run --rm -i tensorflow/tensorflow python'

we can do this by following ExampleCmd_CombinedOutput in https://golang.org/src/os/exec/example_test.go

However again, the echo "print(1)" in our case is something returned by Go function in our driving program, but not a standalone program. To pipe something to a process, we need to

package main

import (
	"fmt"
	"os/exec"
	"strings"
)

func main() {
	r := strings.NewReader("print(1)")
	cmd := exec.Command("docker", "run", "--rm", "-i", "tensorflow/tensorflow", "python")
	cmd.Stdin = r
	o, _ := cmd.CombinedOutput()
	fmt.Println(string(o))
}

Familiarize MySQL setup

Unable to untar the model

func (m *model) load(cfg *mysql.Config, cwd string) (e error) {
	db, e := sql.Open("mysql", cfg.FormatDSN())
	if e != nil {
		return e
	}
	defer db.Close()

	sqlfn := fmt.Sprintf("sqlflow_models.%s", m.parseResult.model)
	sqlf, e := sqlfs.Open(db, sqlfn)
	if e != nil {
		return fmt.Errorf("Cannot open sqlfs file %s: %v", sqlfn, e)
	}
	defer func() { sqlf.Close() }()

	if e := gob.NewDecoder(sqlf).Decode(m); e != nil {
		return fmt.Errorf("model.load: gob-decoding model failed: %v", e)
	}

	dir := cwd
	cmd := exec.Command("tar", "Pxzf", "-", "-C", dir)
	cmd.Stdin = sqlf
	return cmd.Run()
}

Gives

tar: Unrecognized archive format
tar: Error exit delayed from previous errors.

Store model config at model directory

After training, we need to store model config, such as model type, attribute values. Those values will be used to scanfold inference code.

Save predicted Y to MySQL

The following SQL statement will save the predicted class into iris.predict.class

SELECT *
FROM iris.iris
PREDICT iris.prediction_table.class
USING my_dnn_model;

Logic:

If selected field's name or type is different from training phase, it should raise an error.
If iris.prediction_table doesn't exist, we should create table iris.prediction_table. And we should figure out the column type of class. If iris.prediction_table already exists, it will be overwritten.

These logic should be implemented in Go.

Missing Column in Prediction Clause

Verifier uses COLUMN and LABEL field to find data types.

SELECT MonthlyCharges, TotalCharges, tenure
FROM churn.churn
TRAIN DNNClassifier
WITH 
  n_classes = 73,
  hidden_units = [10, 20]
COLUMN MonthlyCharges, TotalCharges
LABEL tenure
INTO my_dnn_model;

However, there is no COLUMN field in the prediction clause

SELECT MonthlyCharges, TotalCharges
FROM churn.churn
PREDICT churn.predict.tenure
USING my_dnn_model;

So how should verifier get data types?

sql-machine-learning / sqlflow Goto Github PK

sqlflow's People

Contributors

Stargazers

Watchers

Forkers

sqlflow's Issues

Test Training Program Autogen

Recommend Projects

Recommend Topics

Recommend Org