Giter Site home page Giter Site logo

mbok / elasticsearch-linear-regression Goto Github PK

View Code? Open in Web Editor NEW
64.0 9.0 20.0 242 KB

A machine learning plugin for Elasticsearch providing aggregations to compute multiple linear regression on search results in real-time for predictive analytics.

License: Apache License 2.0

Java 100.00%
elasticsearch linear-regression elasticsearch-plugin machine-learning predictive-analytics

elasticsearch-linear-regression's Introduction

A multiple linear regression plugin for Elasticsearch

Build Status

Linear regression model has been a mainstay of statistics and machine learning in the past decades and remains one of the most important tools in context of supervised learning algorithms. It’s a powerful technique for prediction of the value of a dependent variable y (called response variable) given the values of another independent variables x = (x1, x2,…​,xC) (called explanatory variables) based on a training data set. Prediction of the response variable with respect to the input values for the explanatory variables is described by the linear hypothesis function h(x) with

gif

This plugin enhances Elasticsearch’s query engine by two new aggregations, which utilize the index data during search as training data for estimating a linear regression model in order to expose information like prediction of a value for the target variable, anomaly detection and measuring the accuracy or rather predictiveness of the model. Estimation is performed regarding the OLS (ordinary least-squares) approach over the search result set.

Aggregations

Both aggregations are numeric aggregations that estimate the linear regression coefficients gif.latex?\theta 0,%20\theta 1,%20\theta 2,.%20.%20 based on document results of a search query. Each search result document is handled as an observation and the numerical fields as variables (explanatory and response) for the linear model.

Aggregation for prediction

The linreg_predict aggregation computes the predicted outcome for the response variable regarding the estimated model with respect to a set of given input values for the explanatory variables.

value

The predicted value for the response variable computed using the estimated linear hypothesis function h(x) with x given by C input values for the explanatory variables x = [x1, x2,…​,xC].

coefficients

Estimated coefficients gif.latex?\theta 0,%20\theta 1,%20\theta 2,%20\theta 3,.%20.%20 of the linear linear hypothesis function h(x).

Assuming the data consists of documents representing sold house prices with features like number of bedrooms, bathrooms and size etc. we can let predict or validate the price for our house in Morro Bay with 2000 square feet, 4 bedrooms and 2 bathrooms by:

/houses/_search?size=0
{
    "query": {
        "match" : {
            "location" : "Morro Bay"
        }
    },
    "aggs": {
        "house_prices": {
            "linreg_predict": {
                "fields": ["size", "bedrooms", "bathrooms", "price"],
                "inputs": [2000, 4, 2]
            }
        }
    }
}
  1. fields instructs this aggregation to use for the linear regression model the house feature fields size, bedrooms and bathrooms as explanatory variables and the price field as the response variable. The size of the fields array is C + 1 with C entries for the explanatory variables and one entry for the response variable.

  2. inputs passes the feature values of our house we like to predict the price for. The numeric input values have to be passed in array form in the order corresponding to the features listed in the fields attribute. The size of the inputs array is C equivalent to the number of the explanatory variables.

And the following may be the response with the estimated price of around $ 581,458 for our house:

{
    ...
    "aggregations": {
        "my_house_price": {
            "value": 581458.3087492324,
            "coefficients": [
                227990.63952712028,
                248.92285661317254,
                -68297.7720278421,
                64406.52205356777
            ]
        }
    }
}

Aggregation for linear regression statistics

The linreg_stats aggregation computes statistics for the estimated linear regression model.

rss

Residual sum of squares as a measure of the discrepancy between the data and the estimated model. The lower the rss number, the smaller the error of the prediction, and the better the model.

mse

Mean squared error or rather rss divided by the number of documents consumed for model estimation.

r2

Coefficient of determination, denoted R², as a statistical measure of how well the regression model approximates the real data points. R² ranges from 0 to 1, where 1 indicates that the estimated hypothesis function perfectly fits the data. (Available since 5.5.1.2)

coefficients

Estimated coefficients gif.latex?\theta 0,%20\theta 1,%20\theta 2,%20\theta 3,.%20.%20 of the linear linear hypothesis function h(x).

Assuming the data consists of documents representing house prices we can compute statistics for the estimated best fitting linear hypothesis function which predicts house prices based on number of bedrooms, bathrooms and size with

/houses/_search?size=0
{
    "aggs": {
        "house_prices": {
            "linreg_stats": {
                "fields": ["bedrooms", "bathrooms", "size", "price"]
            }
        }
    }
}

The aggregation type is linreg_stats and the fields setting defines the set of fields (as an array) to be used for building the linear model. The first one to many fields stand for the explanatory variables and the last for the response variable. The above request returns the following response:

{
    ...
    "aggregations": {
        "house_prices": {
            "rss": 49523788338938.75,
            "mse": 63410740510.80505,
            "r2": 0.4788369924642064,
            "coefficients": [
                47553.1873756476,
                -100544.07258945837,
                45981.15827544975,
                309.6013051477474
            ]
        }
    }
}

Data conditions

Due to algorithmic constraints both aggregations result an empty response, if

  • the search result size is less or equal than the number of indicated explanatory variables,

  • values of the explanatory variables in the search result set is linearly dependent (that means that a column can be written as a linear combination of the other columns).

Algorithm

This implementation is based on a new parallel, single-pass OLS estimation algorithm for multiple linear regression (not yet published). By aggregating over the data only once and in parallel the algorithm is ideally suited for large-scale, distributed data sets and in this respect surpasses the majority of existing multi-pass analytical OLS estimators or iterative optimization algorithms.

The overall complexity of the implemented algorithm to estimate the regression coefficients is O(N C² + C³), where N denotes the size of the training data set (the number of documents in the search result set) and C the number of the indicated explanatory variables (fields).

Installation

Elasticsearch 5.x

For installing this plugin please choose first the proper version under the compatible matrix which matches your Elasticsearch version and use the download link for the following command.

./bin/elasticsearch-plugin install https://github.com/scaleborn/elasticsearch-linear-regression/releases/download/5.5.2.1/elasticsearch-linear-regression-5.5.2.1.zip

The plugin will be installed under the name "linear-regression". Do not forget to restart the node after installing.

Table 1. Compatibility matrix

Plugin version

Elasticsearch version

Release date

5.5.2.1

5.5.2

Aug 29, 2017

5.5.1.2

5.5.1

Aug 29, 2017

5.5.1.1

5.5.1

Jul 27, 2017

5.5.0.1

5.5.0

Jul 18, 2017

5.3.0.2

5.3.0

Jul 16, 2017

5.3.0.1

5.3.0

Jun 30, 2017

Examples

Predicting house prices

The idea is very simple. We have data in our Elasticsearch index representing sold house prices in our region with some features like square footage of the house, # of bathrooms, # of bedrooms etc. Now we want to find out which price we have to pay for a house of our dreams.

To import the data into Elasticsearch we use logstash and this pipeline config house-prices-import.conf:

./bin/logstash -f house-prices-import.conf

The indexed documents will have this form:

{
  "_index": "houses",
  "_type": "prices",
  "_id": "AV0zjVhTomRh2LZNgmfJ",
  "_source": {
      "bathrooms": 3,
      "bedrooms": 4,
      "size": 4168,
      "mls": "140077",
      "price": 1100000,
      "location": "Morro Bay",
      "price_sq_ft": 263.92,
      "status": "Short Sale"
  }
}

We can now query the index for houses in "Morro Bay" and let predict the price for our dream house with respect to the desired features like 3 bedrooms, 2 bathrooms and at least 2000 square feet:

/houses/_search?size=0
{
    "query": {
        "match" : {
            "location" : "Morro Bay"
        }
    },
    "aggs": {
        "dream_house_price": {
            "linreg_predict": {
                "fields": ["size", "bedrooms", "bathrooms", "price"],
                "inputs": [2000, 3, 2]
            }
        }
    }
}

Regarding the following prediction response we have to expect about $ 650,000 to pay for the desired house in "Morro Bay".

{
    "aggregations": {
        "dream_house_price": {
            "value": 649918.0709489314,
            "coefficients": [
                228318.6161854365,
                249.02340193904183,
                -68314.4830871133,
                64248.05007337558
            ]
        }
    }
}

By using sub aggregations we are able to find out the estimated prices per location:

/houses/_search?size=0
{
    "aggs": {
        "locations": {
            "terms": {
                "field": "location.keyword",
                "size": 15
            },
            "aggs": {
                "dream_house_price": {
                    "linreg_predict": {
                        "fields": ["size", "bedrooms", "bathrooms", "price"],
                        "inputs": [2000, 3, 2]
                    }
                }
            }
        }
    }
}

The response uncovers that "Arroyo Grande" would be the most expensive region for our dream house:

{
    "aggregations": {
        "locations": {
            "buckets": [
                {
                    "key": "Santa Maria-Orcutt",
                    "doc_count": 265,
                    "dream_house_price": {
                        "value": 256251.9105297585,
                        "coefficients": [
                            26437.192829649313,
                            81.19071633227178,
                            6825.9128627023265,
                            23477.773223729317
                        ]
                    }
                },
                {
                    "key": "Paso Robles",
                    "doc_count": 85,
                    "dream_house_price": {
                        "value": 365620.0386191703,
                        "coefficients": [
                            42958.257094706176,
                            151.7000907380368,
                            6486.477078139843,
                            -98.91559301451247
                        ]
                    }
                },
                ...
                {
                    "key": " Arroyo Grande",
                    "doc_count": 12,
                    "dream_house_price": {
                        "value": 1140196.791331573,
                        "coefficients": [
                            728566.7474390095,
                            1956.6474540196602,
                            -706891.620925945,
                            -690495.0006844609
                        ]
                    }
                }
                ...
            ]
        }
    }
}

License

Licensed under the Apache License 2.0.

elasticsearch-linear-regression's People

Contributors

mbok avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

elasticsearch-linear-regression's Issues

Compatibility with 6.1.3

Hi mate, thank you for your great plugin! It's really nice! Are you thinking to extend compatibility to elasticsearch 6.x.x? Now we can't use elasticsearch-linear-regression beacuse it is not compatible with last elasticsearch version.

Use case: predict the time of the next purchase

Hello,
probably not the best place to ask for the comment about the mentioned use case, but I'd give it a try.
I am trying to come up with optimal way to forecast the purchase time for Product P and User U.
We currently index these events in ES (pushed by e-commerce system)

orderId,user,product,quantity,time,days
"order1","U","P",1,"2017-01-01",17167
"order2","U","P",2,"2017-01-29",17195
"order3","U","P",3,"2017-04-02",17258
"order4","U","P",1,"2017-07-06",17353
"order5","U","P",2,"2017-08-03",17381

where days is just a integer showing number of days since 1.1.1970 for the event time.

What I want is to predict is the next time of purchase and the quantity.
quantity is last purchased quantity in this case 2
and the forecast time should be somewhere in October.

I've played with this plugin:
and it works well if I have additional calculated field "lag" for each event which denotes the time period until NEXT purchase, so then the data above should look like:

orderId,user,product,quantity,time,days,lag
"order1","U","P",1,"2017-01-01",17167,28
"order2","U","P",2,"2017-01-29",17195,63
"order3","U","P",3,"2017-04-02",17258,95
"order4","U","P",1,"2017-07-06",17353,29
"order5","U","P",2,"2017-08-04",17382,?
GET /buying_habit/_search?size=0 { "query": { "match_all": {} }, "aggs": { "demand_p": { "linreg_predict": { "fields": ["quantity", "lag"], "inputs": [2] } } } }

Of course in this index I will have 10K of different products and 1M different Users.
My first question is how to update this field for the LAST event when new event comes in?
Is it possible to do it in index time?
Does this make sense at all or there is a better way?
Btw in case that there is only one purchase, I'd use the default lifecycle of the product in days (comes from e-commerce as well). But for cases where there is a buying pattern (at least 2 events) I 'd need to use user specific data.
I plan to run forecast query for each User/Product pair every hour to calculate the next forecast time (effectively when user SHOULD run out of supply).
What would be the way to optimize that (avoid doing this one by one)?

Thanks very much in advance,
Milan

Provide aggregation to indicate breakouts regarding a estimated linear regression "channel"

Breakouts (e.g. documents with a response variable value outside of the upper and lower hyper-plane spaced by a specified number of standard deviations above and below the middle linear regression hyper-plane) in time series data may indicate anomaly.
A concrete concept has still to be defined.
Real world use cases are e.g stock markets, see https://www.dailyfx.com/forex/education/trading_tips/daily_trading_lesson/2014/10/24/Trend-Following-with-Regression-Channels.html.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.