Giter Site home page Giter Site logo

dsc-generating-data-wdc01-dtsc-ft-060120's Introduction

Generating Data

Introduction

Data analysis often requires analysts to test the efficiency/performance of an algorithm with a certain type of data. In such cases, the focus is not to answer some analytical questions as we have seen earlier but to test some machine learning hypothesis dealing with, say, comparing two different algorithms to see which one gives a higher level of accuracy. In such cases, the analysts would normally deal with synthetic random data that they generate themselves. This lab and the upcoming lesson will highlight some data generation techniques that you can use later to learn new algorithms while not indulging too much into the domain knowledge.

Objectives

You will be able to :

  • Identify the reason why data scientists would want to generate datasets
  • Generate datasets for classification problems
  • Generate datasets for regression problems

Practice datasets

Practice datasets allow testing and debugging of the algorithms and test its robustness. They are also used for understanding the behavior of algorithms in response to changes in model parameters as we shall see with some ML algorithms. Following are some of the reasons why such datasets are preferred over real-world datasets:

  • Quick and easy generation - save data collection time and efforts
  • Predictable outcomes - have a higher degree of confidence in the result
  • Randomization - datasets can be randomized repeatedly to inspect performance in multiple cases
  • Simple data types - easier to visualize data and outcomes

In this lesson, we shall cover some of the Python functions that can help us generate random datasets.

make_blobs()

The official documentation for this function can be found here. This function generates isotropic gaussian blobs for clustering and classification problems. We can control how many blobs to generate and the number of samples to generate, as well as a host of other properties. Let's see how to import this in a Python environment:

# Import other libraries
import matplotlib.pyplot as plt

import pandas as pd

# Import make_blobs
from sklearn.datasets.samples_generator import make_blobs

Let's now generate a 2D dataset of samples with three blobs as a multi-class classification prediction problem. Each observation will have two inputs and 0, 1, or 2 class values.

X, y = make_blobs(n_samples=100, centers=3, n_features=2)

Now we can go ahead and visualize the results using this code:

# Plot a scatter plot, color 
df = pd.DataFrame(dict(x=X[:,0], y=X[:,1], label=y))
colors = {0:'red', 1:'blue', 2:'green'}
fig, ax = plt.subplots()
grouped = df.groupby('label')
for key, group in grouped:
    group.plot(ax=ax, kind='scatter', x='x', y='y', label=key, color=colors[key])
plt.show()

png

So above we see three different classes. We can generate any number of classes adapting the code above. This dataset can be used with a number of classifiers to see how accurately they perform.

make_moons()

This function is used for binary classification problems with two classes and generates moon shaped patterns. This function allows you to specify the level of noise in the data. This helps you make the dataset more complex if required to test the robustness of an algorithm. This is how you import this function from scikit-learn and use it:

from sklearn.datasets import make_moons
X, y = make_moons(n_samples=100, noise=0.1)

Now we can simply use the code from last example for visualizing the data:

# Plot a scatter plot, color 
df = pd.DataFrame(dict(x=X[:,0], y=X[:,1], label=y))
colors = {0:'red', 1:'blue', 2:'green'}
fig, ax = plt.subplots()
grouped = df.groupby('label')
for key, group in grouped:
    group.plot(ax=ax, kind='scatter', x='x', y='y', label=key, color=colors[key])
plt.show()

png

The noise parameter controls the shape of the data generated. Give it different values from 0 to 1 above and inspect the outcome. 0 noise would generate perfect moon shapes and 1 would be just noise and no underlying pattern. We can also see that this pattern is not "linearly separable", i.e., we can not draw a straight line to separate classes, this helps us try our non-linear classification functions (like sigmoid and tanh etc.)

make_circles()

This function further complicates the generated data and creates values in the form of concentric circles. It also features a noise parameter, similar to make_moons(). Below is how you import and use this function:

from sklearn.datasets import make_circles
X, y = make_circles(n_samples=100, noise=0.05)

Bring in the plotting code from previous examples:

# Plot a scatter plot, color 
df = pd.DataFrame(dict(x=X[:,0], y=X[:,1], label=y))
colors = {0:'red', 1:'blue', 2:'green'}
fig, ax = plt.subplots()
grouped = df.groupby('label')
for key, group in grouped:
    group.plot(ax=ax, kind='scatter', x='x', y='y', label=key, color=colors[key])
plt.show()

png

This is also suitable for testing complex, non-linear classifiers.

make_regression()

This function allows you to create datasets that can be used to test regression algorithms. Regression can be performed with a number of algorithms ranging from ordinary least squares method to more advanced deep neural networks. We can create datasets by setting the number of samples, number of input features, level of noise, and much more. Here is how we import and use this function:

from sklearn.datasets import make_regression
X, y = make_regression(n_samples=100, n_features=1, noise=0.1)
# Plot regression dataset
plt.scatter(X,y)
plt.show()

png

We can further tweak the generated parameters to create non-linear relationships that can be solved using non-linear regression techniques:

# Generate new y 
y2 = y**2
y3 = y**3

# Visualize this data
plt.scatter(X, y2)
plt.show()
plt.scatter(X, y3)
plt.show()

png

png

Level up (Optional)

sklearn comes with a lot of data generation functions. We have seen a few popular ones above. Kindly visit this link to look at more such functions (along with some real world datasets).

Summary

In this lesson, we looked at generating random datasets for classification and regression tasks using sklearn's built-in functions. We looked at some of the attributes for generating data and you are encouraged to dig deeper with the official documentation and see what else can you achieve with more parameters. While learning a new algorithm, these synthetic datasets help you take your focus off the domain and work only with the computational and performance aspects of the algorithm.

dsc-generating-data-wdc01-dtsc-ft-060120's People

Contributors

loredirick avatar shakeelraja avatar sumedh10 avatar mathymitchell avatar

Watchers

James Cloos avatar Kaitlin Vignali avatar Mohawk Greene avatar Victoria Thevenot avatar Otha avatar raza jafri avatar  avatar Joe Cardarelli avatar The Learn Team avatar  avatar Ben Oren avatar Matt avatar Alex Griffith avatar  avatar Amanda D'Avria avatar  avatar Ahmed avatar Nicole Kroese  avatar Dominique De León avatar  avatar Lisa Jiang avatar Vicki Aubin avatar Maxwell Benton avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.