ML-python-studyML 🤖

: Study and practice various machine learning models

1. Naive Bayes Model

1-1 Gaussian Naive Bayes
1-2 Multinomial naive bayes

2. KNN : k-Nearest neighborhood

select Distance d(a,b)
- Categorical variables : Hamming distance
- Continuous variables : Euclidian distance, Manhattan distance

3. LDA : Linear Discriminant Analysis

3-1. Assumption
- each group of numbers has a probability distribution in the form of a normal distribution.
- each group of numbers has a similar covariance structure.
3-2. The characteristics of the decision boundary obtained as a result of LDA
- Axis orthogonal to boundary
  - Consider the shape of the distribution when the data is projected onto this axis.
- Maximize the difference in means?
  - Use the vector difference vector of the two means.
=> Boundary that maximize the difference between variance and mean
3-3. Advantage
- Unlike the naïve bayes model, it reflects the covariance structure between the explanatory variables.
- Relatively robust even when assumptions are violated.
3-4. Disadvantages
- The number of smallest samples must be greater than the number of explanatory variables.
- Poorly explained if it deviates significantly from the normal distribution assumption.
- Fails to reflect cases where the covariance structure is different between categories y .
3-5. Define and understand QDA
- QDA removes the assumption of a common covariance structure∑ independent of k.
  - It can be utilized when different categories of Y have different covariance structures.
3-6. LDA vs QDA
- Relative advantages of QDA
  - y Allows for different covariance structures for different categories.
- Relative disadvantages of QDA
  - If you have a large number of explanatory variables, there are more parameters to estimate - Requires a large sample size

4. SVM : Support Vector Machine

4-1. Background
- When assumptions about the distribution of data are hard to make, how do you split the data below?
  - focus on the boundary
  - determine the boundary that maximizes the margin as shown below.
- Problem
  - What if there are cases that are not exactly distinct?
    
    => Allow a small amount of error and determine the boundary to minimize it
- The dependent variable is divided into two categories based on the form of the data.
  - Categorical variables
    - Support vector classifier
  - Continuous variables
    - Support vector regression (SVR)
- Key to SVM, SVR
  - Distinguish between what will and will not affect model cost with margins
    - SVM
      - Points that fall within the margin, or are categorized in the opposite direction.
    - SVR
      - Points that are outside the margin.
4-2. SVM with Kernel
- For non-linear relationships
- The curse of dimensionality
  - When fitting data with a non-linear structure, it is necessary to use a kernel.
  - However, as the dimensionality of the dth-degree polynomial increases, above a certain dimensionality, the number of parameters that need to be estimated increases, resulting in higher test errors.
4-3. SVM vs. LDA
- Relative Advantages of SVM
  - When the data distribution is difficult, it is inefficient to consider the covariance structure.
    - Only observations near the boundary can be considered.
  - Higher prediction accuracy.
- Relative disadvantages of SVM
  - Need to determine C
  - Takes a long time to build the model

5. Decision Tree

5-1. Definition

: A model that creates a criterion of variables and uses them to categorize a sample, and then estimates the properties of the categorized group.
- Advantages: highly interpretable, intuitive, universal.
- Disadvantages: high volatility. Can be sensitive to sample.
5-2. Decision tree terminology

Node - The location of the variable on which the classification is based. Divide the sample based on this.
- Parent node - a relative concept. Parent node.
- Child node - Lower node.
- Root node - The top-level node with no child nodes.
- Leaf node (Tip) - The lowest node with no children.
- Internal node - a node that is not a Leaf node.
Edge - Where the conditions that categorize the samples are located.
Depth - the number of edges that must be traversed to reach a particular node from the Root node.
Depending on the response variable
- Categorical variables : Classification tree
- Continuous Variables : Regression Tree (Estimate the category of y from its mean value)

5-3. Entropy
- Entropy is often used as a criterion to select the best attribute for splitting a node in the tree.
- The attribute that maximizes the information gain, which is the reduction in entropy achieved by splitting the node according to that attribute, is chosen as the best attribute.
- The entropy of a set S with respect to a binary classification problem is given by the following formula:
5-3. Information Gain
- Entropy difference before and after a particular node in a decision tree.
- A higher information gain indicates that the attribute can split the dataset into more homogeneous subsets, making the classification task easier. Conversely, a lower information gain indicates that the attribute is less useful for classification.
5-4. classification Tree
- According to the Tree condition. The idea of dividing the area that X can have into blocks.
- Estimate Y from the attributes of the samples in the blocked region.
- For the areas divided, select the variables and criteria that give the best values for the measure below.
  - Entropy
  - Misclassification rate
  - Gini index
- For a determined R_m,
  - the category of estimated Y :
5-5. Regression Tree
- Estimated value of Y:
- For a determined R_m,