KDD (Knowledge Discovery in Databases) Process
- Develop an understanding of the domain
- Create target data set
- Data cleaning and preprocessing, reduction and projection
- Method selection (classification, clustering, association analysis)
- Extract patterns, models
- Interpretation
- Consolidating knowledge
Data
Object | Attribute 1 | Atribute 2 | Atribute 3 |
---|---|---|---|
Object 1 | Attribute value 1 for object 1 | Attribute value 2 for object 1 | Attribute value 3 for object 1 |
Object 2 | Attribute value 1 for object 2 | Attribute value 2 for object 2 | Attribute value 3 for object 2 |
Object 3 | Attribute value 1 for object 3 | Attribute value 2 for object 3 | Attribute value 3 for object 3 |
Object 4 | Attribute value 1 for object 4 | Attribute value 2 for object 4 | Attribute value 3 for object 4 |
Attribute: variable, field, characteristic, feature
Objects: record, case, observation, entity, instance
Types of attributes:
- Nominal/Categorical
{juice, beer, soda, …}
- Names, Labels
- Eye color
- Ordinal
- Energy efficiency
{C, B, A, A+, A++}
{bad, average, above average, good}
{hot > mild > cool}
- Energy efficiency
- Interval
- Temperatures
- Dates, times
- Ratio
- Distance
- Real numbers
Discrete | continuos |
---|---|
Countable | Infinite |
Usually Integers | Real numbers |
Zip codes, set of words, binary | Hight, weight, temperature |
Classification
- Predict the class of an object based on its features
Regression
- Estimate the value for unknown variable based for an object on its attributes
Clustering
- Unite similar objects in subgroups (clusters)
Association Rule Discovery
- What things go together
Outlier/Anomaly detection
- Detect significant deviations from normal behavior
Frequency(attribute value) = proportion of time the value occurs in the data set
Mode(attribute value) = most frequent attribute values
An ordinal or continuous attribute x A number p between 0 and 100
The p-th percentile is a value
$x_p$ of x such that p% of the observed values of x are smaller than$x_p$ .
Mean(attribute) = sum(attribute values)/m
Median(attribute) = value in the middle of observations, or average of 2 values in the middle (Trimmed mean)
Range(attribute) = difference between the largest and the smallest
Variance(attribute) =
$s^2_x$ = sum(attribute values - mean)$^2$/(m - 1)
Distribution of attribute values
How much values fall into the bin of size 10 (or 20).

Noise, Outliers, Missing values, Duplicate data
Sampling
- Without replacement
- Each time item is selected it is removed
- With replacement
- Non removed
- One object can be picked more than once
Dimensionality reduction
-
Less resources needed
-
Easy visualize
-
Eliminate irrelevant features and reduce noise
-
Feature elimination
-
Feature extraction: PCA
Linear combinations of the original attributes. Ordered in decreasing amount of variance explained. Orthonormal (orthogonal with unit norm), independent. Not easily interpretable.
Apply a function to the attribute values x^k, log(x), sqrt(x)
Replace each original attribute by a scaled version of the attribute
Scale all data in the range [0,1] or [-1,1]
Zero mean and unit variance
where
Binary vectors x and y have binary attributes M_ab = Number of attributes where x has value a\in {0,1} and y has value b\in {0,1}
correlation ≠ causation, look for 3rd variable
x_i and x_j are features
Gower’s similarity index (For objects)
here x, y are objects
For interval/ratio
Where R_i is the range of i-th attribute in the data.
??
In linear regression: betas are how influential are attributes
Higher r^2 the better (we can trick it by not including irrelevant variables, that is why we need r_adj^2
r2 is how much (*100% percents) is explained by the model
Naïve Bayes
Decision tree
Statistics
Labs answers, md, git
Dim reduction , pca
R visualisation, tutorials
Data types
Data similarity, types, covariant
Covariance and dependence