Giter Site home page Giter Site logo

data-mining-coursework's Introduction

Data Mining at Aston University Course

KDD (Knowledge Discovery in Databases) Process

  1. Develop an understanding of the domain
  2. Create target data set
  3. Data cleaning and preprocessing, reduction and projection
  4. Method selection (classification, clustering, association analysis)
  5. Extract patterns, models
  6. Interpretation
  7. Consolidating knowledge

Data

Object Attribute 1 Atribute 2 Atribute 3
Object 1 Attribute value 1 for object 1 Attribute value 2 for object 1 Attribute value 3 for object 1
Object 2 Attribute value 1 for object 2 Attribute value 2 for object 2 Attribute value 3 for object 2
Object 3 Attribute value 1 for object 3 Attribute value 2 for object 3 Attribute value 3 for object 3
Object 4 Attribute value 1 for object 4 Attribute value 2 for object 4 Attribute value 3 for object 4

Attribute: variable, field, characteristic, feature

Objects: record, case, observation, entity, instance

Types of attributes:

  • Nominal/Categorical
    • {juice, beer, soda, …}
    • Names, Labels
    • Eye color
  • Ordinal
    • Energy efficiency {C, B, A, A+, A++}
    • {bad, average, above average, good}
    • {hot > mild > cool}
  • Interval
    • Temperatures
    • Dates, times
  • Ratio
    • Distance
    • Real numbers
Discrete continuos
Countable Infinite
Usually Integers Real numbers
Zip codes, set of words, binary Hight, weight, temperature

Data mining tasks

Classification

  • Predict the class of an object based on its features

Regression

  • Estimate the value for unknown variable based for an object on its attributes

Clustering

  • Unite similar objects in subgroups (clusters)

Association Rule Discovery

  • What things go together

Outlier/Anomaly detection

  • Detect significant deviations from normal behavior

Data Exploration

Frequency(attribute value) = proportion of time the value occurs in the data set

Mode(attribute value) = most frequent attribute values

Percentiles

An ordinal or continuous attribute x A number p between 0 and 100

The p-th percentile is a value $x_p$ of x such that p% of the observed values of x are smaller than $x_p$.

Mean(attribute) = sum(attribute values)/m

Median(attribute) = value in the middle of observations, or average of 2 values in the middle (Trimmed mean)

Range(attribute) = difference between the largest and the smallest

Variance(attribute) = $s^2_x$ = sum(attribute values - mean)$^2$/(m - 1)

Visualisation Techniques

Histograms

Distribution of attribute values

How much values fall into the bin of size 10 (or 20).

Box plots

Scatter plots

Data quality

Noise, Outliers, Missing values, Duplicate data

Data preprocessing

Sampling

  • Without replacement
    • Each time item is selected it is removed
  • With replacement
    • Non removed
    • One object can be picked more than once

Dimensionality reduction

  • Less resources needed

  • Easy visualize

  • Eliminate irrelevant features and reduce noise

  • Feature elimination

  • Feature extraction: PCA

PCA

Linear combinations of the original attributes. Ordered in decreasing amount of variance explained. Orthonormal (orthogonal with unit norm), independent. Not easily interpretable.

Attribute transformation

Apply a function to the attribute values x^k, log(x), sqrt(x)

Standardisation

Replace each original attribute by a scaled version of the attribute

Scale all data in the range [0,1] or [-1,1]

Normalisation

Zero mean and unit variance

,

where

Similarity and Dissimilarity

Euclidian distance

Norm

Binary vectors x and y have binary attributes M_ab = Number of attributes where x has value a\in {0,1} and y has value b\in {0,1}

Simple Matching Coefficient

Jacard Similarity Coefficient

Cosine similarity

Covariance matrix (for features??)

Correlation (for features??)

correlation ≠ causation, look for 3rd variable

x_i and x_j are features

Gower’s similarity index (For objects)

here x, y are objects

For interval/ratio

Where R_i is the range of i-th attribute in the data.

??

...

In linear regression: betas are how influential are attributes

Higher r^2 the better (we can trick it by not including irrelevant variables, that is why we need r_adj^2

r2 is how much (*100% percents) is explained by the model

To Learn

Naïve Bayes
Decision tree
Statistics 
Labs answers, md, git
Dim reduction , pca 
R visualisation, tutorials 
Data types
Data similarity, types, covariant
Covariance and dependence 

data-mining-coursework's People

Contributors

artemii-yanushevskyi avatar

Watchers

James Cloos avatar

Forkers

swipswaps

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.