sql-machine-learning / sqlflow Goto Github PK
View Code? Open in Web Editor NEWBrings SQL and AI together.
Home Page: https://sqlflow.org
License: Apache License 2.0
Brings SQL and AI together.
Home Page: https://sqlflow.org
License: Apache License 2.0
Currently, unit tests establish the connection to the test instance of MySQL in the init()
function. This prevents us from closing the connection after testing. We should move the connection and closing operation into TestMain
.
Setup a homepage of SQLFlow. An option is to setup a homepage using Github: https://pages.github.com/
Currently, the deduction rule
states that
SELECT *, b ...
is equivalent to
SELECT b ...
However, the existence of *
after SELECT should mean all fields of related tables.
Save and load the model into a SQL system
MySQL supports binary data (BLOBs and variations thereof):
As the first step in https://github.com/wangkuiyi/sqlflow/issues/98
const (
simpleStarSelect = `
SELECT *
FROM irisis;
`
func TestSimpleSelect(t *testing.T) {
assert := assert.New(t)
assert.NotPanics(func() {
sqlParse(newLexer(simpleStarSelect))
})
assert.False(parseResult.Extended)
assert.Equal([]string{"*"}, parseResult.fields)
}
Gives
--- FAIL: TestSimpleSelect (0.00s)
parser_test.go:44:
Error Trace: parser_test.go:44
Error: Not equal:
expected: []string{"*"}
actual : []string(nil)
Diff:
--- Expected
+++ Actual
@@ -1,4 +1,2 @@
-([]string) (len=1) {
- (string) (len=1) "*"
-}
+([]string) <nil>
Test: TestSimpleSelect
An executor will execute a training job. It needs to do the following
generateTFProgram
in codegen.go
For example, we can copy code out from codegen.go
into sql/python/fetch_data.py
like
import tensorflow as tf
import mysql.connector
def sql_connect(user, passwd, host, port, database, slctStmt):
if not database:
return mysql.connector.connect(user=user,
passwd=passwd,
host=host,
port=port)
else:
return mysql.connector.connect(user=user,
passwd=passwd,
host=host,
port=port,
database=database)
def fetch_data(user, passwd, host, port, database, slctStmt):
cursor = sql_connect(user, passwd, host, port, database).cursor()
cursor.execute(slctStmt)
field_names = [i[0] for i in cursor.description]
columns = map(list, zip(*cursor.fetchall()))
return field_names, columns
def slice_feature_and_label(field_names, columns, feature_types, feature_names, label_name):
feature_columns = [
getattr(tf.feature_column, feature_types[i])(
key=feature_names[i]) for i in range(feature_types)]
feature_column_names = [
feature_names[i] for i in range(feature_types)]
X = {name: columns[field_names.index(nm)] for nm in feature_column_names}
Y = columns[field_names.index(label_name)]
return X, Y
and we can have sql/python/test_fetch_data.py
like
import unittest
import fetch_data
class TestFetchData(unittest.TestCase):
def __init__(self, *args, **kwargs):
super(TestFetchData, self).__init__(*args, **kwargs)
self.user = 'root'
self.passwd = 'root'
self.host = 'localhost'
self.port = 3306
self.database = ''
def test_sql_connect(self):
self.assertNotIsNotNone(sql_connect(
self.user, self.passwd, self.host, self.port, self.database))
And codegen.go
could execute the template to generate only a Python __main__
function which calls the above testable Python functions
if __main__ = "__main__":
fetch_data(user={{.User}}, passwd={{.Passwd}}, ...
....
There should be a function named something like codegen
or generateCode
in codegen.go
, like parse
in paser.go
and verify
in verifier.go
.
tensorflowCmd
in codegen_test.go
is a useful abstraction and should be used in other components.
Using the all-numeric fields data set from TensorFlow: http://download.tensorflow.org/data/iris_training.csv
Following sqlflow/example/churn
The current ad-hoc solution is to use a map
It might be too weak to support any case other than (1) each field is a feature, and (2) all fields are of float type.
For example, users might write the following statement which operates on a single string-typed field f
:
SELECT f
FROM table
COLUMN f, hash(f, 100), cross(f, hash(f, 100))
In our Estimator code, the feature list needs three elements:
[
tf.feature_column.categorical_column_with_vocabulary_list("f", vocab_list),
tf.feature_column.categorical_column_with_hash_bucket(
tf.feature_column.categorical_column_with_vocabulary_list("f", vocab_list), 100),
tf.feature_column.cross_column(
tf.feature_column.categorical_column_with_vocabulary_list("f", vocab_list),
tf.feature_column.categorical_column_with_hash_bucket(
tf.feature_column.categorical_column_with_vocabulary_list("f", vocab_list), 100))
]
Due to the reason.
For the context of this issue, please refer to https://github.com/wangkuiyi/sqlflow/issues/64#issuecomment-441339214.
Comments are very welcome!
A rough idea in my mind is something like this:
SELECT
reviewed_code_lines,
contributed_code_lines,
performance_eval_level
FROM employees
TRAIN DNNClassifier
PARAMS
hidden_units=[10, 10],
n_classes=3
FEATURES
reviewed_code_lines,
contributed_code_lines,
CROSS(reviewed_code_lines, contribiuted_code_lines)
LABEL
performance_eval_level
INTO auto_performance_evaluator;
I have a plan to write a parser using flex/bison to parse the above SQL statement with extended syntax. The parser should generate a TensorFlow estimator program similar to that described in this tutorial, but using a MySQLDataset operator, instead of the TextLineDataset operator.
A key challenge here is how to specify the crossed features, which was described in this document.
Another challenge is that how could we save the trained model into a table, e.g., auto_performance_evalutor in the above example.
example/iris
SQLFlowRunner
and SQLFlowRunnerHandler
, whose input is a SQL statement
Train
,Evaluate
, or,Predict
.The first release should allow use to
The SQLFlow container should be able to take user input SQL statement as input, either proxy pass to MySQL or parse and run Python training/prediction code in the container.
An executor
will execute the training/evaluation job. It needs to do the following
-v
optionAfter lexer and parser, we need a verifier, which connects to the SQL engine, runs DESCRIBE table_name
to retrieve fields and field types, makes sure that columns are derived from existing columns, and infer the feature column types from field types.
In order to make the first milestone, we plan to allow users to run the syntax-extended SQL statements in the Jupyter Notebook.
The current parse API uses a global variable parseResult
, is it possible to change it to
parseResult := sqlParse(newLexer("select * ..."))
In the inferencing phase, we are parsing two SQL statements: trainSQL and inferSQL. I feel it would be good practice to avoid sharing the parseResult
.
When serializing parsed SQL statement to json, a couple of qutation marks are missing
{
"extended": true,
"train": true,
"standardSelect": "SELECT employee.age, last_name, salary\n FROMemployee\n WHERE employee.age % 10 < (salary / 10000) AND strings.Upper(last_name) = \"WANG\"\n LIMIT 100;",
"trainClause": {
"estimator": "DNNClassifier",
"attrs": {
"hidden_units": "[10, 20]",
"n_classes": "3"
},
"columns": [
employee.name,
bucketize(last_name, 1000),
cross(embedding(emplyoee.name), bucketize(last_name, 1000))
],
"save": my_dnn_model
}
}
As it represents a filesystem not a single file.
In the scanfolded TF program, print accuracy on the training data.
Recall https://github.com/wangkuiyi/sqlflow/pull/108/files#r237716408, I think we should follow the logic, and the following should simply our code.
The content of connectionConfig overlap with those in mysql.Config.
A consequence is that we must make sure that the content in both structs are consistent with each other, for example, the following code
reveals that the content in connectionConfig
must be consistent with those in mysql.Config
It looks to me that we can remove the definition of connectionConfig and change the content of TemplateFiller
into using mysql.Config and a WorkDir string
field:
struct TemplateFiller struct {
...
mysql.Config
WorkDir string
}
And, change the signature
into
func NewTemplateFiller(pr *extendedSelect, fts fieldTypes, cfg *mysql.Config, workdir string)
(*TemplateFiller, bool) {
Instead of using our customized tar
, a more straightforward way is to tar
the whole model directory.
I am wondering if we could use pipeline like syntax. The parsing would be much easier in this case. And the transformation of the data also looks more nature.
select * from my_table | Normalize | Train DNN
verify
in verifier.go
only needs extendedSelect.standardSelect
db = mysql.connector.connect(user="root",
passwd="root",
host="localhost:3306")
Gives
mysql.connector.errors.DatabaseError: 2005 (HY000): Unknown MySQL server host 'localhost:3306' (2)
However, if we remove port 3306
, it connects successfully
db = mysql.connector.connect(user="root",
passwd="root",
host="localhost")
In the evaluating phase, generateTFProgram should know four things
parsedResult
from parser.go
. It contains standard selectsavedModel
from MySQL database. It contains estimator's config.fieldTypes
from verifier.go
. It contains columns and columns types.mysql.Config
. It contains username
, passwd
etc.The generated TF Program should do the following
The function generateTemplate doesn't return a template; instead, it returns a filler.
A prediction job needs to do the following
In the training phase, the sql statement contains model config: DNNClassifier
, n_classes
and hidden_units
SELECT sepal_length, sepal_width, petal_length, petal_width, species
FROM irisis
TRAIN DNNClassifier
WITH
n_classes = 3,
hidden_units = [10, 20]
COLUMN sepal_length, sepal_width, petal_length, petal_width
LABEL species
INTO my_dnn_model;
However, the infer statement doesn't have it
SELECT sepal_length, sepal_width, petal_length, petal_width, species
FROM irisis
INFER my_dnn_model;
Need to design the code generate algorithm, particularly, how to map fieldTypes []string
returned by sql/verifier.go
to tf.feature_column.*
calls.
docker run --rm -it -v $PWD:/work -w /work tensorflow/tensorflow:1.12 python to_be_tested.py
However, we don't want to save the auto-generated file into the filesystem; instead, we want to pipe it to python running in the TensorFlow container. So we can do
echo "print(1)" | docker run --rm -i tensorflow/tensorflow python
To run the above bash program in a command line, we need to run
sh -c 'echo "print(1)" | docker run --rm -i tensorflow/tensorflow python'
we can do this by following ExampleCmd_CombinedOutput
in https://golang.org/src/os/exec/example_test.go
However again, the echo "print(1)"
in our case is something returned by Go function in our driving program, but not a standalone program. To pipe something to a process, we need to
package main
import (
"fmt"
"os/exec"
"strings"
)
func main() {
r := strings.NewReader("print(1)")
cmd := exec.Command("docker", "run", "--rm", "-i", "tensorflow/tensorflow", "python")
cmd.Stdin = r
o, _ := cmd.CombinedOutput()
fmt.Println(string(o))
}
func (m *model) load(cfg *mysql.Config, cwd string) (e error) {
db, e := sql.Open("mysql", cfg.FormatDSN())
if e != nil {
return e
}
defer db.Close()
sqlfn := fmt.Sprintf("sqlflow_models.%s", m.parseResult.model)
sqlf, e := sqlfs.Open(db, sqlfn)
if e != nil {
return fmt.Errorf("Cannot open sqlfs file %s: %v", sqlfn, e)
}
defer func() { sqlf.Close() }()
if e := gob.NewDecoder(sqlf).Decode(m); e != nil {
return fmt.Errorf("model.load: gob-decoding model failed: %v", e)
}
dir := cwd
cmd := exec.Command("tar", "Pxzf", "-", "-C", dir)
cmd.Stdin = sqlf
return cmd.Run()
}
Gives
tar: Unrecognized archive format
tar: Error exit delayed from previous errors.
After training, we need to store model config, such as model type, attribute values. Those values will be used to scanfold inference code.
The following SQL statement will save the predicted class
into iris.predict.class
SELECT *
FROM iris.iris
PREDICT iris.prediction_table.class
USING my_dnn_model;
Logic:
iris.prediction_table
doesn't exist, we should create table iris.prediction_table
. And we should figure out the column type of class
. If iris.prediction_table
already exists, it will be overwritten.These logic should be implemented in Go.
Verifier uses COLUMN
and LABEL
field to find data types.
SELECT MonthlyCharges, TotalCharges, tenure
FROM churn.churn
TRAIN DNNClassifier
WITH
n_classes = 73,
hidden_units = [10, 20]
COLUMN MonthlyCharges, TotalCharges
LABEL tenure
INTO my_dnn_model;
However, there is no COLUMN
field in the prediction clause
SELECT MonthlyCharges, TotalCharges
FROM churn.churn
PREDICT churn.predict.tenure
USING my_dnn_model;
So how should verifier get data types?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.