Giter Site home page Giter Site logo

ml-python-studyml's Introduction



ML-python-studyML 🤖

: Study and practice various machine learning models



1. Naive Bayes Model

  • 1-1 Gaussian Naive Bayes
  • 1-2 Multinomial naive bayes

2. KNN : k-Nearest neighborhood

  • select Distance d(a,b)
    • Categorical variables : Hamming distance

    • Continuous variables : Euclidian distance, Manhattan distance

3. LDA : Linear Discriminant Analysis

  • 3-1. Assumption

    • each group of numbers has a probability distribution in the form of a normal distribution.
    • each group of numbers has a similar covariance structure.
  • 3-2. The characteristics of the decision boundary obtained as a result of LDA

    • Axis orthogonal to boundary

      • Consider the shape of the distribution when the data is projected onto this axis.
    • Maximize the difference in means?

      • Use the vector difference vector of the two means.

    => Boundary that maximize the difference between variance and mean

  • 3-3. Advantage

    • Unlike the naïve bayes model, it reflects the covariance structure between the explanatory variables.
    • Relatively robust even when assumptions are violated.
  • 3-4. Disadvantages

    • The number of smallest samples must be greater than the number of explanatory variables.
    • Poorly explained if it deviates significantly from the normal distribution assumption.
    • Fails to reflect cases where the covariance structure is different between categories y .
  • 3-5. Define and understand QDA

    • QDA removes the assumption of a common covariance structure∑ independent of k.
      • It can be utilized when different categories of Y have different covariance structures.
  • 3-6. LDA vs QDA Capture d’écran 2023-04-15 à 15 49 07

    • Relative advantages of QDA

      • y Allows for different covariance structures for different categories.
    • Relative disadvantages of QDA

      • If you have a large number of explanatory variables, there are more parameters to estimate - Requires a large sample size

4. SVM : Support Vector Machine

  • 4-1. Background

    • When assumptions about the distribution of data are hard to make, how do you split the data below?

      • focus on the boundary
      • determine the boundary that maximizes the margin as shown below.

    Capture d’écran 2023-04-17 à 02 49 14
    • Problem
      • What if there are cases that are not exactly distinct?

        => Allow a small amount of error and determine the boundary to minimize it

    Capture d’écran 2023-04-17 à 02 51 25
    • The dependent variable is divided into two categories based on the form of the data.

      • Categorical variables
        • Support vector classifier
      • Continuous variables
        • Support vector regression (SVR)

    Capture d’écran 2023-04-17 à 02 56 25
    • Key to SVM, SVR
      • Distinguish between what will and will not affect model cost with margins

        • SVM
          • Points that fall within the margin, or are categorized in the opposite direction.
        • SVR
          • Points that are outside the margin.
  • 4-2. SVM with Kernel

    • For non-linear relationships
    • The curse of dimensionality
      • When fitting data with a non-linear structure, it is necessary to use a kernel.
      • However, as the dimensionality of the dth-degree polynomial increases, above a certain dimensionality, the number of parameters that need to be estimated increases, resulting in higher test errors.
  • 4-3. SVM vs. LDA

    • Relative Advantages of SVM

      • When the data distribution is difficult, it is inefficient to consider the covariance structure.

        • Only observations near the boundary can be considered.
      • Higher prediction accuracy.

    • Relative disadvantages of SVM

      • Need to determine C
      • Takes a long time to build the model

5. Decision Tree

  • 5-1. Definition

    : A model that creates a criterion of variables and uses them to categorize a sample, and then estimates the properties of the categorized group.

    • Advantages: highly interpretable, intuitive, universal.
    • Disadvantages: high volatility. Can be sensitive to sample.
  • 5-2. Decision tree terminology


Capture d’écran 2023-04-17 à 14 45 20


  • Node - The location of the variable on which the classification is based. Divide the sample based on this.

    • Parent node - a relative concept. Parent node.
    • Child node - Lower node.
    • Root node - The top-level node with no child nodes.
    • Leaf node (Tip) - The lowest node with no children.
    • Internal node - a node that is not a Leaf node.
  • Edge - Where the conditions that categorize the samples are located.

  • Depth - the number of edges that must be traversed to reach a particular node from the Root node.

  • Depending on the response variable

    • Categorical variables : Classification tree
    • Continuous Variables : Regression Tree (Estimate the category of y from its mean value)

Capture d’écran 2023-04-17 à 14 46 40


  • 5-3. Entropy

    • Entropy is often used as a criterion to select the best attribute for splitting a node in the tree.

    • The attribute that maximizes the information gain, which is the reduction in entropy achieved by splitting the node according to that attribute, is chosen as the best attribute.

    • The entropy of a set S with respect to a binary classification problem is given by the following formula:
      Capture d’écran 2023-04-17 à 15 14 12

  • 5-3. Information Gain

    • Entropy difference before and after a particular node in a decision tree.

    • A higher information gain indicates that the attribute can split the dataset into more homogeneous subsets, making the classification task easier. Conversely, a lower information gain indicates that the attribute is less useful for classification.

      Capture d’écran 2023-04-17 à 15 25 40
  • 5-4. classification Tree

    • According to the Tree condition. The idea of dividing the area that X can have into blocks.

    • Estimate Y from the attributes of the samples in the blocked region.

    Capture d’écran 2023-04-17 à 15 32 42
    • For the areas divided, select the variables and criteria that give the best values for the measure below.

      • Entropy

      • Misclassification rate

        Capture d’écran 2023-04-17 à 15 36 07
      • Gini index

        Capture d’écran 2023-04-17 à 15 36 37
    • For a determined Rm,

      Capture d’écran 2023-04-17 à 15 42 04
      • the category of estimated Y :

        Capture d’écran 2023-04-17 à 15 42 12
  • 5-5. Regression Tree

    Capture d’écran 2023-04-17 à 15 51 26
    • Estimated value of Y:

      Capture d’écran 2023-04-17 à 15 52 37
    • For a determined Rm,

      Capture d’écran 2023-04-17 à 15 57 11

6. Neural Network

ml-python-studyml's People

Contributors

chaeyeon2367 avatar chaeyeon0930 avatar

Stargazers

Mustapha_Boubkraoui avatar

Watchers

Kostas Georgiou avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.