Giter Site home page Giter Site logo

datamining-1's Introduction

DATAMINING

Content

Table of contents

Unit 1

Practice 1


Test the Law Of Large Numbers for N random normally distributed numbers with mean = 0, stdev=1

Create an R script that will count how many of these numbers fall between -1 and 1 and divide by the total quantity of N You know that E(X) = 68.2% Check that Mean(Xn)->E(X) as you rerun your script while increasing N

Hint:

  1. Initialize sample size
  2. Initialize counter
  3. loop for(i in rnorm(size))
  4. Check if the iterated variable falls
  5. Increase counter if the condition is true
  6. return a result <- counter / N

First we initialize the sample size of 100

N <- 100

Then we initialize the counter

C <- 0

Loop for with the sample size and we print i

for(i in rnorm(N, mean= 0, sd=1)){
    print (i);

We check if the iterated variable falls and we increase the counter if the condition is true

    if (i>=-1 && i<=1){
        C<- C+1
        }
    }

At last we return the result and we print

result <- C/N

print (result)

Practice 2

Functions

Practice find 20 more functions in R and make an example of it.

  1. Show an error if there are NA swims
na.fail ()
  1. Returns a vector of the size of the first variable with the elements that are in the second
match (,)
  1. Calculation of combinations in k with n repetitions formula n! / [(n-k)! k!]
choose (n, k)
  1. Returns the average of the elements
mean ()
  1. Returns the median of the elements
median ()
  1. Reverse the order of elements
rev ()
  1. Sort the elements of x in ascending order
sort ()
  1. Align the elements of x
rank ()
  1. Returns a similar object but suppresses duplicate elements
unique ()
  1. Organize data on y-axis ordered by x
plot ()
  1. Allow to allocate names to the values in the vector
vector <- c(1,2,3)
names(vector)<- c("uno","dos", "tres")
  1. Search and call an data object
get("vector")
  1. Search and call an data object and allow to specify an action in case of not finding the object
get0("vector", ifnotfound = "no disponible")
  1. Return the first value registered in the vector
first(vector)
  1. Return the last value registered in the vector
last(vector)
  1. Creates an expression in R and is stored in x1
x1 <- expression(2^3)
x1
  1. Checks the data class
class(x1)
  1. Evaluates the expression saved
eval(x1)
  1. Creates a sample
set.seed(9562421)
x <- rnorm(1000)
y<- rnorm(1000) + .3 * x
  1. From the plot creates a graph with the density of x
plot(density(x))

Practice 3

Scenario: You are a Data Scientist working for a consulting firm. One of your colleagues from the Auditing Department has asked you to help them assess the financial statement of organization X. You have been supplied with two vector of data: mounthly revenue and expenses for the financial year in quiestion. Your task is to calculate the following financial matrics:

  • profit for each mounth
  • profit after tax for each month (the tax rate is 30%)
  • profit margin for each month - equal to profit after tax divided by revenue
  • good months - where the profit after tax was greater than the mean for the year
  • bad months - where the profit after tax was less then the mean for years
  • the best month - where the profit after tax was max for the year
  • the worst month - where the profit after tax was min for the year All results need to be presented as vectors. Results for dollar values need to be calculate with $0.01 precision, but need to be presented in Units of $1,000(i.e. 1k) with no decimal point.

Results for the profit margin ratio needed to be presented in units of % with no decimal points.

Note: Your collegue has warned you that it is okay for tax for any given month to be negative (in accounting terms, negative tax translates into a deferred tax asset).

Hint 1 Use:

  • round()
  • mean()
  • max()
  • min()
>Data
revenue <- c(14574.49, 7606.46, 8611.41, 9175.41, 8058.65, 8105.44, 11496.28, 9766.09, 10305.32, 14379.96, 10713.97, 15433.50)
expenses <- c(12051.82, 5695.07, 12319.20, 12089.72, 8658.57, 840.20, 3285.73, 5821.12, 6976.93, 16618.61, 10054.37, 3803.96)

Solution

Calculate Profit As The Differences Between Revenue And Expenses

profit <- revenue - expenses
profit

Calculate Tax As 30% Of Profit And Round To 2 Decimal Points

tax <- round(0.30 \* profit, 2)
tax

Calculate Profit Remaining After Tax Is Deducted

profit.after.tax <- profit - tax
profit.after.tax

Calculate The Profit Margin As Profit After Tax Over Revenue Round To 2 Decimal Points, Then Multiply By 100 To Get %

profit.margin <- round(profit.after.tax / revenue, 2) \* 100
profit.margin

Calculate The Mean Profit After Tax For The 12 Months

mean_pat <- mean(profit.after.tax)
mean_pat

Find The Months With Above-Mean Profit After Tax

good.months <- profit.after.tax > mean_pat
good.months

Bad Months Are The Opposite Of Good Months !

bad.months <- !good.months
bad.months

The Best Month Is Where Profit After Tax Was Equal To The Maximum

best.month <- profit.after.tax == max(profit.after.tax)
best.month

The Worst Month Is Where Profit After Tax Was Equal To The Minimum

worst.month <- profit.after.tax == min(profit.after.tax)
worst.month

Convert All Calculations To Units Of One Thousand Dollars

revenue.1000 <- round(revenue / 1000, 0)
expenses.1000 <- round(expenses / 1000, 0)
profit.1000 <- round(profit / 1000, 0)
profit.after.tax.1000 <- round(profit.after.tax / 1000, 0)

Print Results

revenue.1000
expenses.1000
profit.1000
profit.after.tax.1000
profit.margin
good.months
bad.months
best.month
worst.month

BONUS: Preview Of What's Coming In The Next Section Print The Matrix

M <- rbind(
revenue.1000,
expenses.1000,
profit.1000,
profit.after.tax.1000,
profit.margin,
good.months,
bad.months,
best.month,
worst.month
)

M

Investigation

Pair Coding

In this small investigation we will see the concept of pair coding because it was not found as such, the concept of pair programming or pair programming is seen, in the breakdown of the investigation the benefits of using this type of programming are revealed and the difficulties of this. -- Gutierrez Luna Yuridia Nayeli Full Version

Pair Coding 2

The method known as pair programming (in Spanish, programming in pairs)It is mainly used in agile software development and, more specifically, inextreme programming (XP). Pair programming specifies that there are always twopeople working on the code at the same time and that, as far as possible,they sit together. One is in charge of writing the code and the other of supervising inreal time. At the same time, they are constantly exchanging impressions:they discuss problems, find solutions and develop creative ideas. -- Bermudez Ornelas Alberto Full Version

Evaluative Practice

Country_Code <- c("ABW","AFG","AGO","ALB","ARE","ARG","ARM","ATG","AUS","AUT","AZE","BDI","BEL","BEN","BFA","BGD","BGR","BHR","BHS","BIH","BLR","BLZ","BOL","BRA","BRB","BRN","BTN","BWA","CAF","CAN","CHE","CHL","CHN","CIV","CMR","COG","COL","COM","CPV","CRI","CUB","CYP","CZE","DEU","DJI","DNK","DOM","DZA","ECU","EGY","ERI","ESP","EST","ETH","FIN","FJI","FRA","FSM","GAB","GBR","GEO","GHA","GIN","GMB","GNB","GNQ","GRC","GRD","GTM","GUM","GUY","HKG","HND","HRV","HTI","HUN","IDN","IND","IRL","IRN","IRQ","ISL","ITA","JAM","JOR","JPN","KAZ","KEN","KGZ","KHM","KIR","KOR","KWT","LAO","LBN","LBR","LBY","LCA","LKA","LSO","LTU","LUX","LVA","MAC","MAR","MDA","MDG","MDV","MEX","MKD","MLI","MLT","MMR","MNE","MNG","MOZ","MRT","MUS","MWI","MYS","NAM","NCL","NER","NGA","NIC","NLD","NOR","NPL","NZL","OMN","PAK","PAN","PER","PHL","PNG","POL","PRI","PRT","PRY","PYF","QAT","ROU","RUS","RWA","SAU","SDN","SEN","SGP","SLB","SLE","SLV","SOM","SSD","STP","SUR","SVK","SVN","SWE","SWZ","SYR","TCD","TGO","THA","TJK","TKM","TLS","TON","TTO","TUN","TUR","TZA","UGA","UKR","URY","USA","UZB","VCT","VEN","VIR","VNM","VUT","WSM","YEM","ZAF","COD","ZMB","ZWE")
Life_Expectancy_At_Birth_1960 <- c(65.5693658536586,32.328512195122,32.9848292682927,62.2543658536585,52.2432195121951,65.2155365853659,65.8634634146342,61.7827317073171,70.8170731707317,68.5856097560976,60.836243902439,41.2360487804878,69.7019512195122,37.2782682926829,34.4779024390244,45.8293170731707,69.2475609756098,52.0893658536585,62.7290487804878,60.2762195121951,67.7080975609756,59.9613658536585,42.1183170731707,54.2054634146342,60.7380487804878,62.5003658536585,32.3593658536585,50.5477317073171,36.4826341463415,71.1331707317073,71.3134146341463,57.4582926829268,43.4658048780488,36.8724146341463,41.523756097561,48.5816341463415,56.716756097561,41.4424390243903,48.8564146341463,60.5761951219512,63.9046585365854,69.5939268292683,70.3487804878049,69.3129512195122,44.0212682926829,72.1765853658537,51.8452682926829,46.1351219512195,53.215,48.0137073170732,37.3629024390244,69.1092682926829,67.9059756097561,38.4057073170732,68.819756097561,55.9584878048781,69.8682926829268,57.5865853658537,39.5701219512195,71.1268292682927,63.4318536585366,45.8314634146342,34.8863902439024,32.0422195121951,37.8404390243902,36.7330487804878,68.1639024390244,59.8159268292683,45.5316341463415,61.2263414634146,60.2787317073171,66.9997073170732,46.2883170731707,64.6086585365854,42.1000975609756,68.0031707317073,48.6403170731707,41.1719512195122,69.691756097561,44.945512195122,48.0306829268293,73.4286585365854,69.1239024390244,64.1918292682927,52.6852682926829,67.6660975609756,58.3675853658537,46.3624146341463,56.1280731707317,41.2320243902439,49.2159756097561,53.0013170731707,60.3479512195122,43.2044634146342,63.2801219512195,34.7831707317073,42.6411951219512,57.303756097561,59.7471463414634,46.5107073170732,69.8473170731707,68.4463902439024,69.7868292682927,64.6609268292683,48.4466341463415,61.8127804878049,39.9746829268293,37.2686341463415,57.0656341463415,60.6228048780488,28.2116097560976,67.6017804878049,42.7363902439024,63.7056097560976,48.3688048780488,35.0037073170732,43.4830975609756,58.7452195121951,37.7736341463415,59.4753414634146,46.8803902439024,58.6390243902439,35.5150487804878,37.1829512195122,46.9988292682927,73.3926829268293,73.549756097561,35.1708292682927,71.2365853658537,42.6670731707317,45.2904634146342,60.8817073170732,47.6915853658537,57.8119268292683,38.462243902439,67.6804878048781,68.7196097560976,62.8089268292683,63.7937073170732,56.3570487804878,61.2060731707317,65.6424390243903,66.0552926829268,42.2492926829268,45.6662682926829,48.1876341463415,38.206,65.6598292682927,49.3817073170732,30.3315365853659,49.9479268292683,36.9658780487805,31.6767073170732,50.4513658536585,59.6801219512195,69.9759268292683,68.9780487804878,73.0056097560976,44.2337804878049,52.768243902439,38.0161219512195,40.2728292682927,54.6993170731707,56.1535365853659,54.4586829268293,33.7271219512195,61.3645365853659,62.6575853658537,42.009756097561,45.3844146341463,43.6538780487805,43.9835609756098,68.2995365853659,67.8963902439025,69.7707317073171,58.8855365853659,57.7238780487805,59.2851219512195,63.7302195121951,59.0670243902439,46.4874878048781,49.969512195122,34.3638048780488,49.0362926829268,41.0180487804878,45.1098048780488,51.5424634146342)
Life_Expectancy_At_Birth_2013 <- c(75.3286585365854,60.0282682926829,51.8661707317073,77.537243902439,77.1956341463415,75.9860975609756,74.5613658536585,75.7786585365854,82.1975609756098,80.890243902439,70.6931463414634,56.2516097560976,80.3853658536585,59.3120243902439,58.2406341463415,71.245243902439,74.4658536585366,76.5459512195122,75.0735365853659,76.2769268292683,72.4707317073171,69.9820487804878,67.9134390243903,74.1224390243903,75.3339512195122,78.5466585365854,69.1029268292683,64.3608048780488,49.8798780487805,81.4011219512195,82.7487804878049,81.1979268292683,75.3530243902439,51.2084634146342,55.0418048780488,61.6663902439024,73.8097317073171,62.9321707317073,72.9723658536585,79.2252195121951,79.2563902439025,79.9497804878049,78.2780487804878,81.0439024390244,61.6864634146342,80.3024390243903,73.3199024390244,74.5689512195122,75.648512195122,70.9257804878049,63.1778780487805,82.4268292682927,76.4243902439025,63.4421951219512,80.8317073170732,69.9179268292683,81.9682926829268,68.9733902439024,63.8435853658537,80.9560975609756,74.079512195122,61.1420731707317,58.216487804878,59.9992682926829,54.8384146341464,57.2908292682927,80.6341463414634,73.1935609756098,71.4863902439024,78.872512195122,66.3100243902439,83.8317073170732,72.9428536585366,77.1268292682927,62.4011463414634,75.2682926829268,68.7046097560976,67.6604146341463,81.0439024390244,75.1259756097561,69.4716829268293,83.1170731707317,82.290243902439,73.4689268292683,73.9014146341463,83.3319512195122,70.45,60.9537804878049,70.2024390243902,67.7720487804878,65.7665853658537,81.459756097561,74.462756097561,65.687243902439,80.1288780487805,60.5203902439024,71.6576829268293,74.9127073170732,74.2402926829268,49.3314634146342,74.1634146341464,81.7975609756098,73.9804878048781,80.3391463414634,73.7090487804878,68.811512195122,64.6739024390244,76.6026097560976,76.5326585365854,75.1870487804878,57.5351951219512,80.7463414634146,65.6540975609756,74.7583658536585,69.0618048780488,54.641512195122,62.8027073170732,74.46,61.466,74.567512195122,64.3438780487805,77.1219512195122,60.8281463414634,52.4421463414634,74.514756097561,81.1048780487805,81.4512195121951,69.222,81.4073170731707,76.8410487804878,65.9636829268293,77.4192195121951,74.2838536585366,68.1315609756097,62.4491707317073,76.8487804878049,78.7111951219512,80.3731707317073,72.7991707317073,76.3340731707317,78.4184878048781,74.4634146341463,71.0731707317073,63.3948292682927,74.1776341463415,63.1670487804878,65.878756097561,82.3463414634146,67.7189268292683,50.3631219512195,72.4981463414634,55.0230243902439,55.2209024390244,66.259512195122,70.99,76.2609756097561,80.2780487804878,81.7048780487805,48.9379268292683,74.7157804878049,51.1914878048781,59.1323658536585,74.2469268292683,69.4001707317073,65.4565609756098,67.5223658536585,72.6403414634147,70.3052926829268,73.6463414634147,75.1759512195122,64.2918292682927,57.7676829268293,71.159512195122,76.8361951219512,78.8414634146341,68.2275853658537,72.8108780487805,74.0744146341464,79.6243902439024,75.756487804878,71.669243902439,73.2503902439024,63.583512195122,56.7365853658537,58.2719268292683,59.2373658536585,55.633)

We open the data from another file

We start by reading all the data inside an external file and add it to a variable

stats1 <- read.csv(file.choose())
stats1

We split the data to years

We use the split function to separate the years and add them to a variable. Double [] is used for this, it allows it to be used within a dataframe

años <- split(stats1, stats1$Year)
año.1960=años[[1]]

We verify that the data is as expected

año.1960

Creating data frames form the new vectors

We create a data frame to manipulate the data within the vectors adding only the 1960 data

mydf.1960 <- data.frame( Code= Country_Code, Life.Expectancy.1960 = Life_Expectancy_At_Birth_1960)

We verify that the new vector has the correct data

head(mydf.1960)
mydf.1960

Margin the data frames

We combine the data from the csv file with the data frame in which we separate by year (1960) through Country.code

merged.1960 <- merge(año.1960, mydf.1960,  by.x = "Country.Code", by.y = "Code")

We check that the data frames have been combined correctly by checking the head and the whole body

head(merged.1960)
merged.1960

Calling the library necesary to plot the data

Before starting to graph we send to call the necessary library to carry out the plots

library(ggplot2)

Visualizing the data in a plot

We perform the plot to visualize the data as a graph by region

ggplot(merged.1960, aes(x = Fertility.Rate, y = Life.Expectancy.1960, color=Region)) +
  geom_point(aes(color = factor(Region))) + geom_smooth(method=lm, se=FALSE, fullrange=TRUE)

And we do the same but now graphing based on countries

ggplot(merged.1960, aes(x = Fertility.Rate, y = Life.Expectancy.1960, 
color=Country.Name)) +
  geom_point(aes(color = factor(Country.Name))) + 
  geom_smooth(method=lm, se=FALSE, fullrange=TRUE)

2013 year We split the data to years

The variable of separation of years is used to generate the new variable for the year 2013

año.2013=años[[2]]
año.2013

Creating new data frame for the new vectors

A data frame is created to manipulate the data found in vectors

mydf2 <- data.frame( Code= Country_Code, 
                     Life.Expectancy.2013 = Life_Expectancy_At_Birth_2013)

To confirm that the data was assigned, we show the first values of the data frame and the data frame

head(mydf2)
mydf2

The data merged

To combine both data (year 2013 and the corresponding data) a merged is performed which is joined in the columns of "Country.Code" and "Code"

merged_2013 <- merge(año.2013, mydf2,  by.x = "Country.Code", by.y = "Code")

To confirm that the data was correctly assigned, we show the first values of the new data frame that was generated with the merge and we show the merge.

head(merged_2013)
merged_2013

Visualizing the data in plot

To visualize the data in the form of a graph, we create a plot function, which we assign the merge data that we created previously, we assign the Fertility Rate data to the x-axis and the Life Expectancy data to the y-axis, to differentiate them in the color attribute we assign to separate them by Region.

qplot(data = merged_2013, x =Fertility.Rate , y = Life.Expectancy.2013,
      color = Region, size=I(3), shape=I(19), alpha =I(.4), 
      main = "Fertility for Life Expectancy group by Region 2013")

To visualize the data in the form of a graph, we create a plot function, which we assign the merge data that we created previously, we assign the Fertility Rate data to the x-axis and the Life Expectancy data to the y-axis, to differentiate them in the color attribute we assign to separate them by Country Name

qplot(data = merged_2013, x =Fertility.Rate , y = Life.Expectancy.2013,
      color = Country.Name, size=I(3), shape=I(19), alpha =I(.4), 
      main = "Fertility for Life Expectancy group by Country 2013")

Scatter plot by country

Due to the enormous number of countries from which data are extracted when making this grouping, the symbology becomes extensive and the plot of this relationship appears small

1960

img

2013

img

Scatter plot by regions

1960 Some important data that we can take from the graph is that the countries of the African region have a very high fertility rate but have a fairly low life expectancy, the vast majority hovering between 35 and 50. On the other hand, we have the region of Europe where they have a reduced fertility rate but a higher life expectancy circulating between 65 to 75. What we can say is that the higher the socioeconomic level in the region, the higher the life expectancy. will be

img

The plot shows us this scatter plot in which we can visualize the data grouped by colors of the data regions. When analyzing the graph we can see that the country with the highest life expectancy is Europe because the culture of this is very different from the rest of the others, compared to Africa we have the two extremes in one the children are from one to two and while that in Africa they are 4 to 6 or more. Life expectancy also has these extremes in which we see Europe with a life expectancy of 70 years or more, while in Africa it is not greater than 70, if perhaps some scattered little optical data that we could highlight. img

Years comparison

The biggest change we can see is that in 2013 the fertility rate of all regions dropped considerably, Africa is still the highest but we can see a reduction. This event can be attributed to the level of education and the socioeconomic status of each region. The constant growth of countries makes societies evolve, we can clearly see this in the 1960 graph where most of the regions are below 60 years with a fairly high fertility rate.

Unit 2

Practice 1

Functions

Practice find 5 more ggplot functions in R and make an example of it.

Geom bar

This geometry generates a bar plotter (bar graph), we select in X asthe data will be grouped

  1. In order to use this type of plot you need to import the ggplot library
library(ggplot2)
  1. We generate the plot with simple attributes to ggplot we send it as a parameter ofdata the data frame called year 1960 and to this we add the geometry ofgeom_bar with the aesthetics of the X axis is assigned that we will take Region as ourindependent variable.
ggplot(data=año.1960) + geom_bar(aes(x=Región))

image

Geom Polygon

This library generates a plot in a geometric way according to the dataprovided, in the following example we will show the mapping of a region with data fromlatitude and longitude

  1. In order to use this type of plot you need to import the ggplot library
library(ggplot2)
  1. We generate the plott in black and white, with the axes according to latitude and longitude
ggplot(nz, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black")
  1. To the next we add map projections to make it have a sizesuitable for the work area. coord_quickmap () it is a quick approximationkeep straight lines
ggplot(nz, aes(long, lat, group = group)) +
  geom_polygon(fill = "white", colour = "black") +
  coord_quickmap()

image

Geom Boxplot

The box plot is used in numerical variables since it will provide us with both themedian such as quartiles and outliers Create boxes in which data can be grouped,color them and show the distribution of the data

  1. In order to use this type of plot you need to import the ggplot library
library(ggplot2)
  1. We create a simple graph where we specify the dataset and add the functiongeometric boxplot, which will allow us to visualize the data in the form of a box, andto represent the data as required we use aes with the factorcorresponding, in this case "x" is am and "y" is mpg
ggplot(data = mtcars) + geom_boxplot(aes(x=factor(am), y=mpg))

image

Facet Grid

One way to add additional variables is with aesthetic ones. Another way particularlyuseful for categorical variables is to divide the graph into facets, that is,sub-graphs each showing a subset of the data

  1. In order to use this type of plot you need to import the ggplot library
library(ggplot2)
  1. ggplot where we specify the data, a is to tell how we want to visualizethe data and geopoint to define the geometry type
ggplot(mtcars, aes(mpg, qsec)) + geom_point(aes(size = hp), alpha = 0.4)
  1. How do you want to separate the graph into facets according to the combinations of twovariables is added facet_grid () to the code, separating the two factors with a ~
+ facet_grid(factor(cyl)~factor(am))

image

Geom Violin

Violin charts allow you to visualize the distribution of a numeric variable for one orvarious groups. It is very close to a boxplot, but allows a deeper understandingof density. Violins are especially suited when the amount of data ishuge and it is impossible to show individual observations. The violin graphics area very convenient way to display the data and probably deserves more attentioncompared to box plots which can sometimes hide characteristics of thedata

  1. In order to use this type of plot you need to import the ggplot library
library(ggplot2)
  1. We load the data frame, select the axes of the graph and its fill,we add the violin geometry with a transparency of 0.6, we modify withlabs the labels of the plot
  ggplot(iris, aes(x=Species, y=Sepal.Width, fill=Species)) +
    geom_violin(alpha=0.6) +
    labs(title="Iris", 
         subtitle="Distribución del ancho del sépalo por especie", 
         caption="Fuente: Edgar Anderson's Iris Data", 
         y="Ancho del sépalo", 
         x="Especie",
         color=NULL) +
    theme_elegante()

image

Investigation

Grammar of Graphics

In this research we talk about the structure that any graph must have compared grammatically with a sentence, as well as we can find tips on the writing of these. Composition schemes that are related to some geometric perceptual characteristic and how to take advantage of them are also discussed.

- Bermudez Ornelas Alberto

Full Version

Grammar of Graphics 2

In this research-task about the grammar in the graphing of in R to ggplot2 we will see which are the attributes that are taken into account to perform the plotting of the data and to be able to differentiate between data and be able to obtain information from our data in stupid. It is interesting to know that two different types of graphs can be made within the same graph because it separates the data to do this different geoms are used.

- Gutierrez Luna Yuridia Nayeli

Full Version

Evaluative Practice

Develop the following problem with R and RStudio for knowledge extraction that the problem requires. The directors of the movie review website are very happy with their previous installment and now they have a new requirement for you. The previous consultant had created a chart for them which is illustrated in the image below.

image

However, the R code used to create the graph has been lost and cannot be recovered.

Your task is to create the code that will recreate the same table making it look as close to the original as possible.

Introduction

In this evaluative practice, the above is requested, so in order to comply with the request, the data load is carried out in a data frame, the data filtering in 2 cycles, then the plotting is started separately in order to keep an order of changes. Let's venture into the explanation of our code.

Code

Open the corresponding path where the file is saved to import the data. get allows us to visualize the path in the console and with the set line we define the path with which we will be working.

getwd ()
setwd ("C: / Users / yurid / Documents / DataMining / DATAMINING / Unit_2 / Evaluative_Practice")
getwd ()

The data is loaded into a data frame for manipulation. We specify the name of the file where we will get our data and add it to a variable.

dataset <- read.csv ('Project-Data.csv')

Second option to load the data and not have to modify the code. This helps us a lot in case we don't want to modify the original code and since we are working in duos it allows us to choose the file with the data from wherever it is.

dataset <- read.csv (file.choose ())

The dataframe is observed for observation and analysis. With summary we can see the summary of the result of various model adjustment functions

summary (dataset)

The necessary libraries for data manipulation are loaded. ggplot2 allows us to make the graphs of our plots as well as adds many functions with which we can work

library (ggplot2)

For the realization of filters the dply library was used. It is necessary to install this library in order to use it later. This library adds the necessary functions for filtering information in conjunction with ggplot2.

# install.packages ("dplyr")
library (dplyr)

To change the font we need, we install the library

# install.packages ("extrafont")

The font library is loaded

library (extrafont)

After we import all the fonts into our system, this may change depending on your OS. Only done once each file is opened

font_import ()

We execute the command to display the fonts so that "R" can recognize them. It is important because without this line of code the library may not work

fonts ()

A new dataframe is created which takes the data from the “dataset” filtering the data to only add the requested genres. The% in% function of the dplyr library is used which makes data filtering possible

GenreF <-filter (dataset, Genre% in% c ("action", "adventure",
                                     "animation", "comedy", "drama"))

From the previous dataframe, a new one is created to filter the information to only take the requested studies, taking into account only the data that the previous one and this one comprise the two filters.

StudioF <- filter (GenreF, Studio% in% c ("Buena Vista Studios",
                                          "Fox", "Paramount Pictures",
                                          "Sony", "Universal", "WB"))

A variable is created in which we are going to load the plot of the data that it will contain for the X and Y axes that will be generic for the plot. It is said that the data used will be from StudioF and we say that we are going to visualize the axes.

u <- ggplot (StudioF, aes (x = Genre, y = Gross ... US))

We added the Jitter geometry for studies. Here we say that we want to visualize by color with the Studio factor and size the budget

j <- u + geom_jitter (aes (color = Studio, size = Budget ... mill.)) +

As the data looked too crowded we used this function to rescale the data to be able to visualize it in a better way

  scale_size_continuous (range = c (2, 5),
                        trans = scales :: exp_trans (base = 1.2))

Viewing our plot

j

We add boxplot to group by gender and Gross by placing a medium transparency and removing redundant data from the graph

g <-j + geom_boxplot (alpha = 0.2, outlier.colour = NA)

Viewing our plot

g

We place the title of our plot

t <-g + ggtitle ("Domestic Gross% by Genre")

Viewing our plot

t

We put the name of the X and Y axes

e <- t + xlab ("Genre") + ylab ("Gross% US")

Viewing our plot

e

We add the theme for the labels. We say that the text elements of the titles of X and Y will be purple and of a size of 15, the text of the title will be of a size of 25 and that it will be centered; "Hjust = 0.5", we say that all the text within the graph will use the font "Comic Sans MS", finally we change the name of the label to the size "Budget"

th <-e + theme (axis.title.x = element_text (color = "Purple", size = 15),
              axis.title.y = element_text (color = "Purple", size = 15),
              plot.title = element_text (size = 25, hjust = 0.5)
              , text = element_text (family = "Comic Sans MS")
) + labs (yes

Viewing our plot

th

Final plot

img

Conclusion

It is important to know the grammar of the graphs in order to be able to manipulate them and to be able to make the plotted data become important information for the end user who views it.

The way in which the data was displayed is quite interesting, the use of colors to make it more descriptive for the end user; the use of jitter with the geom box to observe the data in a friendly way.

Regarding the graph, we can carry out the analysis that, most of the movies are action films, because it is one of the safest to invest, we can see that there are different budgets and even so, all are in an average of 40 million gross income . The other option that seems safe to invest is animation and more specifically speaking of "Buena Vista Studios" we can see that they have a good gross income within the box, even with the WB study we can see that they have good results in terms of income.

On the other hand, drama and adventure are one of the genres in which it is risky to make an investment because it has a fairly marked point of acceptance and rejection since it may or may not please, so these genres are a whole wheel of luck Of course, for success or rejection, enter other types of variables within the experience of the end user.

Unit 3


Practice 1

BackwardElimination

Analyze the following "backwardElimination" function

Code

The function is named and two parameters (x and sl) are assigned to it.

backwardElimination <- function (x, sl) {

The variable numVar is added and its value will be the length of x. This function is intended to see the entire number of data in the vector. X represents the dataset.

  numVars = length (x)

In the for it says that for each value within the vector of continuous data starting from its first value, the following functions will be performed.

  for (i in c (1: numVars)) {

Starting by initializing the regressor variable, which will contain the values ​​of the linear regression model created together with its respective formula, which says that it will have the entire profit column and its data will be extracted from X.

    regressor = lm (formula = Profit ~., data = x)

In turn, the variable maxVar is created, which will have a vector with the maximum coefficient of the summation of data within regressor, which is within the matrix. It starts from the second value of the length of the values ​​which is numVars and the column of the p-value.

   maxVar = max (coef (summary (regressor)) [c (2: numVars), "Pr (> | t |)"])

This value is obtained, an if is performed saying that if the maximum coefficient value is greater than sl (our parameter which represents the p-value we are looking for) then the vector that has the same value as the maxVar variable will be searched and saved that indicator within j.

    if (maxVar> sl) {
      j = which (coef (summary (regressor)) [c (2: numVars), "Pr (> | t |)"] == maxVar)

Then we eliminate vector j from our dataset which is X.

      x = x [, -j]
    }

Finally, we subtract one unit from the length of our dataset, which is numVars from the for, and we return the sum of all the values ​​inside regressor so that the cycle continues until we are left with pure columns with the desired p-value.

    numVars = numVars - 1
  }
  return (summary (regressor))
}

Conclution

With this small piece of code we can more efficiently search for the best values for our prediction, It is probably faster just to remove the values that we see but when we talk about thousands of factors within a database it becomes very exhaustive to make one For this reason, this function is extremely important as it will save us hours of work.

Practice 1 -1

Simple Logistic Regression

In this practice we will explain how to display a simple linear regression.

Code

We load the data into our dataset variable

getwd()
setwd("C:/Users/yurid/Documents/RepoDataMining/DataMining/MachineLearning/SimpleLinearRegression")
getwd()

dataset <- read.csv('Salary_Data.csv')

We import the caTools library, we assign the seed to create random number rales, to split we assign the values to pass them in a Boolean way, we assign our training variables and test the values ⅔ will go to training and the el will go to test

library(caTools)
set.seed(123)
split <- sample.split(dataset$Salary, SplitRatio = 2/3)
training_set <- subset(dataset, split == TRUE)
test_set <- subset(dataset, split == FALSE)

We create a variable where we will load the data from our dataset taking into account salary and years of experience for the regression, we print a summary to know the data

regressor = lm(formula = Salary ~ YearsExperience,
               data = dataset)
summary(regressor)

We load predict to the variable y_pred to be able to obtain the prediction of the training data with the new test data

y_pred = predict(regressor, newdata = test_set)

We import the ggplot2 library, we plot the points with our training data from the YearsExperience and Saraly columns, we assign the color red, the prediction will be blue, this is given by the training data, and finally we name the plot. axes.

library(ggplot2)
ggplot() +
  geom_point(aes(x=training_set$YearsExperience, y=training_set$Salary),
             color = 'red') +
  geom_line(aes(x = training_set$YearsExperience, y = predict(regressor, newdata = training_set)),
            color = 'blue') +
  ggtitle('Salary vs Experience (Training Set)') +
  xlab('Years of experience') +
  ylab('Salary')

We made the same plot with the only change that now the data will be taken from ⅓ that was left as test data.

ggplot() +
  geom_point(aes(x=test_set$YearsExperience, y=test_set$Salary),
             color = 'red') +
  geom_line(aes(x = training_set$YearsExperience, y = predict(regressor, newdata = training_set)),
            color = 'blue') +
  ggtitle('Salary vs Experience (Test Set)') +
  xlab('Years of experience') +
  ylab('Salary')

Conclution

This type of practical-homework helps us (in particular) to review everything they tell us in class since sometimes we are not at 100 or we get distracted. Prediction is useful for many things and having tools that facilitate this type of situation is much easier. We can see that the trend line along with the data is in good direction.

Practice 2

Logistic Regression

In this practice will explain the data visualization process for logistic regression

Code

The data is loaded into a variable, then it is chosen from those values ​​thatwe have only columns 3 to 5

getwd()
setwd("C:/Users/yurid/Documents/RepoDataMining/DataMining/MachineLearning/LogisticRegression")
getwd()

dataset <- read.csv('Social_Network_Ads.csv')
dataset <- dataset[, 3:5]

The library that we will use caTools is imported, this library helps us to usevarious basic functions without rounding errors.

a. We tell the seed what it will use
b. We do a split to change our values, these will be Boolean format, 75 is assigned to training and 25 to test data
c. From the variable that was created previously, the filter is made for only Include the necessary data at each point

library(caTools)
set.seed(123)
split <- sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set <- subset(dataset, split == TRUE)
test_set <- subset(dataset, split == FALSE)

For our training and test data we will only use thecolumns 1 to 2 and we assign a scale to make it easy to interpret later

training_set[, 1:2] <- scale(training_set[, 1:2])
test_set[, 1:2] <- scale(test_set[, 1:2])

We create a variable where we will adjust the regression in formula we give it the value of our column Purchased, family binomial and we put the data oftraining

classifier = glm(formula = Purchased ~ .,
                 family = binomial,
                 data = training_set)

We create a variable and it is assigned the prediction of our variablead justed above that we print, it is validated if our variable is greater than .5, 1 and 0,we print the matrix

prob_pred = predict(classifier, type = 'response', newdata = test_set[-3])
prob_pred
y_pred = ifelse(prob_pred > 0.5, 1, 0)
y_pred

We create a variable in which we are going to place the difference between the dataset of thetest data and prediction data, printed gives us an effectiveness of 83%

cm = table(test_set[, 3], y_pred)
cm

We load the ggplot2 library to plot our dataframes from both training as a test

a. The first plot is done with the Estimated Salary and Purchased data
b. The second plot is done with Age and Purchased

library(ggplot2)
ggplot(training_set, aes(x=EstimatedSalary, y=Purchased)) + geom_point() + 
  stat_smooth(method="glm", method.args=list(family="binomial"), se=FALSE)
 
ggplot(training_set, aes(x=Age, y=Purchased)) + geom_point() + 
  stat_smooth(method="glm", method.args=list(family="binomial"), se=FALSE)

A plot is created with the values ​​of the test dataset taking the data from EstimatedSalary and Purchased

a. A plot is created with the dataset values ​​taking the Age and Purchased

ggplot(test_set, aes(x=EstimatedSalary, y=Purchased)) + geom_point() + 
  stat_smooth(method="glm", method.args=list(family="binomial"), se=FALSE)
 
ggplot(test_set, aes(x=Age, y=Purchased)) + geom_point() + 
  stat_smooth(method="glm", method.args=list(family="binomial"), se=FALSE)

We import the ElemStatLearn library this library contains functions that help for learning data mining

library(ElemStatLearn)

We create a variable where we will load the test data, we create two variables in which the first column will be used and the second column second column as min and max value of the same columns by 0.01

a. We create a new dataset and assign it the combined values ​​of our previous variables
b. We place the header name that our dataframe will havec. We create a variable where we will save the prediction of our dataset classifier with the new grid_set dataset
d. In a new variable we compare with if the values ​​of our variable of prediction that are greater than 0.5,1,0
e. We plot our datasets that we created earlier,we give a name to the plot, name the axes and we assign what values ​​will gofor each axis
F. The contour are creates a line that divides our two datasets, this taking into account our matrix where the verification ofgreater than 0.5,1 and 0 taking the length of our datasetg. With points we are going to differentiate our values, the first is for the dataof the combination of the two datasets in column 1 and 2, and the second isfor the set data which is originally our training datataking into account column 3

set = training_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
prob_set = predict(classifier, type = 'response', newdata = grid_set)
y_grid = ifelse(prob_set > 0.5, 1, 0)
plot(set[, -3],
     main = 'Logistic Regression (Training set)',
     xlab = 'Age', ylab = 'Estimated Salary',
     xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))

For the test data it is the same we only change our dataset that loads us the data and we generate everything the same

set = test_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
prob_set = predict(classifier, type = 'response', newdata = grid_set)
y_grid = ifelse(prob_set > 0.5, 1, 0)
plot(set[, -3],
     main = 'Logistic Regression (Test set)',
     xlab = 'Age', ylab = 'Estimated Salary',
     xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))

Conclution I like to know more about groupings and the types that exist, even more The important thing is that with this new professor he has been more affable than those who previously taught the fundamentals of it.This subject and its functions become interesting instead of tedious

Practice 3

Multiple Linear Regression

In this practice will explain the data visualization process for Multiple linear regression

Code

We make the location of our workspace and load the data in our dataset

getwd()
setwd("C:/Users/yurid/Documents/RepoDataMining/DataMining/MachineLearning/MultipleLinearRegression")
getwd()
 
# Importing the dataset
dataset <- read.csv('50_Startups.csv')

We make the transformation of the data from string to numeric and show our dataset

dataset$State = factor(dataset$State,
                       levels = c('New York', 'California', 'Florida'),
                       labels = c(1,2,3))
dataset

We import the caTools library and assign the seed, make the transformation of the data to boolean and we separate 80% for training and 20% for testing

library(caTools)
set.seed(123)
split <- sample.split(dataset$Profit, SplitRatio = 0.8)
training_set <- subset(dataset, split == TRUE)
test_set <- subset(dataset, split == FALSE)

The MultiLiner Regression method is used for our test data and get the summary to see the content of our variable to which we assign the method values

regressor = lm(formula = Profit ~ .,
               data = training_set )
 
summary(regressor)

We make the prediction of the test dataset with the regressor method and we print

y_pred = predict(regressor, newdata = test_set)
y_pred

We perform the MultiLinear Regression method by backward eliminatio model.In the first shows all the columns of our dataset taking incounts the original dataset, in the second the State column is removed and the summary to know the value of the vector

regressor = lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend + State,
               data = dataset )
summary(regressor)
 
regressor = lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend,
               data = dataset )
summary(regressor)

Because we are using an elimination model certain columns are eliminated to do the regression and the summary of these eliminations is printed anyway

regressor = lm(formula = Profit ~ R.D.Spend + Marketing.Spend,
               data = dataset )
summary(regressor)
 
regressor = lm(formula = Profit ~ R.D.Spend + Marketing.Spend,
               data = dataset )
summary(regressor)

We make the prediction of the regressor variable made with the elimination model and we print

y_pred = predict(regressor, newdata = test_set)
y_pred

Conclution

analyzing the data we realized that the points are very close to the line of trend, this means that if there is a relationship between these variables, and more Analyzing the coefficient of determination R2 proves that the analysis is correct.

Practice 4

Decision Tree

Decision trees are used from day to day despite the fact that many times we may think not. In this practice we will explain by means of code how decisions are made within the trees and what the visualization of it looks like

Code

We import data and those assigned to the dataset and filter the columns that we will use in this case they are 3 to

getwd()
setwd("/home/chris/Documents/itt/Enero_Junio_2020/Mineria_de_datos/DataMining/MachineLearning/DesicionThree")
getwd()
# Importing the dataset
dataset = read.csv('Social_Network_Ads.csv')
dataset = dataset[3:5]

We carry out the transformation of data to factor of the column assigning values binaries to make it easier to handle data

dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))

We import the caTools library, assign the seed, create our split where We will take the data from Purchased, assigning 75% of the data to training and 25% to test

library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

We scale the values ​​that are not the assigned range in the dataset

training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])

we import the rpart library which will help us to see the behavior of the line clustering by the tree model, we create our variable for the model and we tell it what we want to predict within it and the data of where it is going to take it

library(rpart)
classifier = rpart(formula = Purchased ~ .,
                   data = training_set)

Once created the variable y_pred in which we are going to make the prediction with the data of the model and the new data that will be that of the columns not included and by the type of class

y_pred = predict(classifier, newdata = test_set[-3], type = 'class')
y_pred

We assign table to our confusion matrix variable which will have the values of column 3 and the prediction variable and we print the variable cm

cm = table(test_set[, 3], y_pred)
cm

We import ElemStatLear to display the grouping by model trees, we assign the training data to set, then we assign the values ​​for our groups marking the min and max, we assign the columns of which we want to make the decision we add the prediction for later plot this data, we give the plot the columns of the set we want that it takes, we place a name for the plot and a name for the axes of this, with contour we make the division of decision lines, with the first point we say that color the divisions you have of the space and the division line created by contour and at the end the last point what it does is that it colors the points according to the set to assign them the color it should have according to your prediction

library(ElemStatLearn)
set = training_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, newdata = grid_set, type = 'class')
plot(set[, -3],
     main = 'Decision Tree Classification (Training set)',
     xlab = 'Age', ylab = 'Estimated Salary',
     xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))

We import ElemStatLear to display the grouping by model trees, we assign set the test data, then assign the values ​​forour groups marking the min and max, we assign the columns of which we want to make the decision we add the prediction and then make the plot of these data, we give the plot the columns of the set that we want it to take,we put a name for the plot and a name for the axes of this, with contourwe make the division of decision lines, with the first point we say to color the divisions it has of the space and the division line created by contour and by final the last point what it does is that it colors according to the set the points to assign them the color it should have according to your prediction

library(ElemStatLearn)
set = test_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, newdata = grid_set, type = 'class')
plot(set[, -3], main = 'Decision Tree Classification (Test set)',
     xlab = 'Age' , ylab = 'Estimated Salary',
     xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))

These two lines display the decisions made by the tree to perform the grouping taking into account our classifier variable which contains the model and training data.

plot(classifier)
text(classifier, cex=0.6)

Conclution

We can realize the multiple ways that there are in this practice to visualize data if grouping is needed and see the flow of decisions made by the tree so you can perform automatic groupings and in case you need a run of desktop (with few sheets) can be performed.

Practice 5

Random Forest Classification

In this practice, an explanation of the code will be made to visualize the data using the randomforest function Code

To get started with this visualization you need the elemStatLearn package. Note (This library is archived for newer versions of r a different way is needed to install)

install.packages('ElemStatLearn')
library(ElemStatLearn)

we grab the training set and add it to a variable

set = training_set

Next, the ranges of the sequences within our filtered data set are marked, which are delimited with the use of the max and min functions and apart the passage of these sequences is marked with a factor of .01 finally they are saved in the variables x1 and x2, which allows us to define the background of our graph.

X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)

Expand allows us to carry out all the combinations of variables in the dataset. and grid creates a framework for all these combinations of the factors that were provided

grid_set = expand.grid(X1, X2)

to the frame of combinations that we just created we add a column with their respective names

colnames(grid_set) = c('Age', 'EstimatedSalary')

a prediction is made with our data classifier and our already delimited model background

y_grid = predict(classifier, grid_set)

With the plot function we generate a plot of the data of our dataframe set minus the vector -3. We add a legend and its labels and name its limits that had already been made previously with the names of x1 and x2

plot(set[, -3],
     main = 'Random Forest Classification (Training set)',
     xlab = 'Age', ylab = 'Estimated Salary',
     xlim = range(X1), ylim = range(X2))

a line is created to an existing graph, which will be our division between green and red, with the numerical matrix of our predictions from the data classification

contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)

With grid_set we pull all the limits and ranges of the background to be able to assign a color to their space

points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))

And with this line we color the points of our data set using the ifelse function

points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))

In the following code, exactly the same is done before, but with the difference that now the test set is used instead of the training set

library(ElemStatLearn)
set = test_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, grid_set)
plot(set[, -3], main = 'Random Forest Classification (Test set)',
     xlab = 'Age', ylab = 'Estimated Salary',
     xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))

Conclution This type of procedure is very effective for a large enough data set since it manages to classify more precisely, apart from being much more efficient to handle this amount of data compared to the knn method, so much so that even performing the code without counting time, a difference was noted in the data display time. And to finish this algorithm model can maintain its precision even if a large proportion of data is distant.

Practice 6

KNN

In this practice, an explanation of the code will be made to visualize the data using the Knn function.

Code

To get started with this visualization you need the elemStatLearn package. Note (This library is archived for newer versions of r a different way is needed to install)

install.packages('ElemStatLearn')
library(ElemStatLearn)

we grab the training set and add it to a variable

set = training_set

Next, the ranges of the sequences within our filtered data set are marked, which are delimited with the use of the max and min functions and apart the passage of these sequences is marked with a factor of .01 finally they are saved in the variables x1 and x2, which allows us to define the background of our graph.

X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)

Expand allows us to carry out all the combinations of variables in the dataset. and grid creates a framework for all these combinations of the factors that were provided

grid_set = expand.grid(X1, X2)

to the frame of combinations that we just created we add a column with their respective names

colnames(grid_set) = c('Age', 'EstimatedSalary')

What this function does is that for each row of our test dataframe, the kn vectors of the training set are found and classified by majority vote. cl dictates the factor of true classifications (which are our training set of a specific vector) and with k we say the number of neighbors that are accepted

y_grid = knn(train = training_set[, -3],
             test = grid_set,
             cl = training_set[, 3],
             k = 5)

With the plot function we generate a plot of the data of our dataframe set minus the vector -3. We add a legend and its labels and name its limits that had already been made previously with the names of x1 and x2

plot(set[, -3],
     main = 'K-NN Classifier (Training set)',
     xlab = 'Age', ylab = 'Estimated Salary',
     xlim = range(X1), ylim = range(X2))

a line is created to an existing graph, which will be our division between green and red

verde y rojo 
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)

With grid_set we pull all the limits and ranges of the background to be able to assign a color to their space

points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))

And with this line we color the points of our data set using the ifelse function

points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))

In the following code, exactly the same is done before, but with the difference that now the test set is used instead of the training set

library(ElemStatLearn)
set = test_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = knn(train = training_set[, -3],
             test = grid_set,
             cl = training_set[, 3],
             k = 5)
plot(set[, -3],
     main = 'K-NN Classifier (Test set)',
     xlab = 'Age', ylab = 'Estimated Salary',
     xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))

Conclution It is very interesting to see the use that is given to the different functions to perform this complex data visualization in my opinion. As you can see, our training and test sets are quite similar. where the resemblance can be observed most is in the delimitation of the grid, that is, the background and the created contour. It can also be observed that the points do not have the same concentration of data but they do resemble their position within the test, which gives us to understand that our model is carried out correctly.

Practice 7

SVM

In this practice an explanation of the code will be made to visualize the data using the SVM function

Code

To get started with this visualization you need the elemStatLearn package. Note (This library is archived for newer versions of r a different way is needed to install)

install.packages('ElemStatLearn')
library(ElemStatLearn)

we grab the training set and add it to a variable

set = training_set

Next, the ranges of the sequences within our filtered data set are marked, which are delimited with the use of the max and min functions and apart the passage of these sequences is marked with a factor of .01 finally they are saved in the variables x1 and x2, which allows us to define the background of our graph.

X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)

Expand allows us to carry out all the combinations of variables in the dataset. and grid creates a framework for all these combinations of the factors that were provided

grid_set = expand.grid(X1, X2)

to the frame of combinations that we just created we add a column with their respective names

colnames(grid_set) = c('Age', 'EstimatedSalary')

a prediction is made with our data classifier and our already delimited model background

y_grid = predict(classifier, newdata = grid_set)

With the plot function we generate a plot of the data of our dataframe set minus the vector -3. We add a legend and its labels and name its limits that had already been made previously with the names of x1 and x2

plot(set[, -3],
     main = 'SVM (Training set)',
     xlab = 'Age', ylab = 'Estimated Salary',
     xlim = range(X1), ylim = range(X2))

a line is created to an existing graph, which will be our division between green and red, with the numerical matrix of our predictions from the data classification

contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)

With grid_set we pull all the limits and ranges of the background to be able to assign a color to their space

points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))

And with this line we color the points of our data set using the ifelse function

points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))

What differentiates this code from the others is that the SVM algorithm is used in classifying the data, as you can see below. It is used mainly to estimate a data density or for data regression and classification.

classifier = svm(formula = Purchased ~ .,
                 data = training_set,
                 type = 'C-classification',
                 kernel = 'linear')

In the following code, exactly the same is done before, but with the difference that now the test set is used instead of the training set

library(ElemStatLearn)
set = test_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, newdata = grid_set)
plot(set[, -3], main = 'SVM (Test set)',
     xlab = 'Age', ylab = 'Estimated Salary',
     xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))

Conclution

In conclusion, this type of method is very useful with two binary classes (1 and 2) because it maximizes the possible results. It is also more accurate with the distance between the perimeters of each data class which can lead to a better interpretation of the data.

Homework

Machine Learning

This homework is three questions about the main topic and the analysis of a visual representation of simple lienar regression

- Gutierrez Luna Yuridia Nayeli

Full Version

Machine Learning

In this Homework, some basic questions about linear regression and machine learning were answered and a comparison of the graphs was made.

- Bermudez Ornelas Alberto

Full Version

Evaluative Practice

Introduccion

In this evaluative practice we will implement the grouping by the model of Naive Bayes, this documentation is made in order to have both a physical memory and demonstrate the knowledge acquired through our professor.

Code

As we have seen in previous practices, the data is loaded and in this case only column 3 to 5 will be used, the change of values ​​is made from the purchased column tovalues ​​of 0 and 1

getwd()
setwd("C:/Users/yurid/OneDrive/Documentos/Escuela/2_Mineria de datos/Unit_3")
getwd()
 
dataset = read.csv('Social_Network_Ads.csv')
dataset = dataset[3:5]
 
dataset$Purchased = factor(dataset$Purchased, levels = c(0, 1))

The caTools library is imported, the seed value is added and the transformation of the data in the Purchased column assigning 75% of the data

library(caTools)
set.seed(123)
split = sample.split(dataset$Purchased, SplitRatio = 0.75)
training_set = subset(dataset, split == TRUE)
test_set = subset(dataset, split == FALSE)

It is assigned to the training variable of the columns that are outside the range of the dataset

training_set[-3] = scale(training_set[-3])
test_set[-3] = scale(test_set[-3])

Predict the Purchased function based on all columns before this is dataset training with the type, the way in which the data will be grouped and the linear kernel, we assign the prediction of our data through the classifier variable and the new data that will be that of test_set that does not include columns greater than 3,later we create our variable where we will save the confusion matrix with the table function of column 3 and the prediction variable

library(e1071)
classifier = naiveBayes(formula = Purchased ~ .,
                 data = training_set,
                 type = 'C-classification',
                 kernel = 'linear')
 
y_pred = predict(classifier, newdata = test_set[-3])
y_pred
 
cm = table(test_set[, 3], y_pred)
cm

We import ElemStatLear for the visualization of the grouping by the naive bayes model, we assign set the training data, then assign the values ​​for our groups marking the min and max, we assign the columns of which we want to do the grouping we add the prediction and then make the plot of these data, we give the plot the columns of the set that we want it to take, we place a name for the plot and name for the axes of this, with contour we make the division of parabolic type, with the first point we say to color the divisions that it has of the space and the division line created by contour and at the end the last point what it does is that colors according to the set the points to assign them the color they should have according to their prediction

library(ElemStatLearn)
set = training_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, newdata = grid_set)
plot(set[, -3],
     main = 'Naive Bayes (Training set)',
     xlab = 'Age', ylab = 'Estimated Salary',
     xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))

We import ElemStatLear for the visualization of the grouping by the naive bayes model, we assign set the test data, then assign the values ​​for our groups marking the min and max, we assign the columns of which we want to make the grouping we add the prediction to later plot this data, we give it to plot the columns of the set that we want it to take, we put a name for the plot and name for the axes of this, with contour we make the division of parabolic type, with the First point we say to color the divisions you have of the space and the line of division created by contour and at the end the last point what it does is that it colors according to the setthe points to assign them the color they should have according to their prediction.

library(ElemStatLearn)
set = test_set
X1 = seq(min(set[, 1]) - 1, max(set[, 1]) + 1, by = 0.01)
X2 = seq(min(set[, 2]) - 1, max(set[, 2]) + 1, by = 0.01)
grid_set = expand.grid(X1, X2)
colnames(grid_set) = c('Age', 'EstimatedSalary')
y_grid = predict(classifier, newdata = grid_set)
plot(set[, -3], main = 'Naive Bayes (Test set)',
     xlab = 'Age', ylab = 'Estimated Salary',
     xlim = range(X1), ylim = range(X2))
contour(X1, X2, matrix(as.numeric(y_grid), length(X1), length(X2)), add = TRUE)
points(grid_set, pch = '.', col = ifelse(y_grid == 1, 'springgreen3', 'tomato'))
points(set, pch = 21, bg = ifelse(set[, 3] == 1, 'green4', 'red3'))

Data visualization

image

image

Interpretation of the results

We have an effectiveness of 86% of correctness of the classification of the data which tells us that the other percentage is the red or green dots that are not found assigned correctly in their group.

Another point that we can denote is that as an entrepreneur the volume of data that we are looking for to sell a product is the green area, where we can make a market advertisement looking for that population

Conclution

It was honestly very interesting to realize during the course of the making of documentation that was exactly identical to previous problems with which have been working most of these only changes the model or the classification of thedata but in essence most are grouped in a similar way which makes it easy to learn and understand since you don't have to memorize so much code, only the elementary. It is important to emphasize that each one has a different probability of error, depending onof the percentage and the amount of data that are assigned for the training of the models.

We can say that the Naive Bayes function is an elegant way to implement this algorithm, apart from that it helps us a lot to visualize the data in a consistent way.

Unit 4

Introduction

In this evaluative practice we will present a data visualization using kmeans algorithm, which will make us the plot through clusters with similar features so we can compare the data

Code

the data is loaded, As we have already explained before so as not to have errors within our dual programming we add the method by choose.files ()

getwd()
setwd("C:/Users/yurid/Documents/DataMining/DATAMINING/Unit_4")
getwd()
 
dataset = read.csv(choose.files())

Once we have taken the data from our csv, we insert it into the variable dataset and we tell it that we are only going to use vectors from 1 to 4, since the cluster does not work correctly with alphabetical data

dataset = read.csv('iris.csv')
dataset = dataset[1:4]

We start by creating a randomness seed, then we perform the method WCSS which is an algorithm that adds the distance between each point to its centroid within clusters. It is also called as the elbow method

set.seed(6)
wcss = vector()

We start a for which for each cluster from 1 to 10 will perform the sum of the kmeans of our data and we will filter it by the inertia of the data within its group

for (i in 1:10) wcss[i] = sum(kmeans(dataset, i)$withinss)

and at the end we make our plot, indicating the data matrix to use, the type of figures we want and adding the labels

plot(1:10,
     wcss,
     type = 'b',
     main = paste('The Elbow Method'),
     xlab = 'Number of clusters',
     ylab = 'WCSS')

With everything done this gives us a graph, With these data we can choose our number of clusters. This is usually done by analyzing the points and in where the breaking point is seen and the following data no longer have as much difference we will know that this is the number we are looking for, in this case it is 3.

image

In the same way we generate another seed of randomness. we perform the function kmeans indicating the dataset and the number of centroids that we deduced from the graph above and our data saved within kmeans we do a filtration by the number of clusters

set.seed(29)
kmeans = kmeans(x = dataset, centers = 3)
y_kmeans = kmeans$cluster

For the last steps we will need to initialize the necessary library. One time initialized we will proceed to graph with the clusplot function which graphs two-dimensional data densities, we tell you our full dataset along with our already leaked data. the other parameters are for display the data and are optional. And finally we add the labels

library(cluster)
clusplot(dataset,
         y_kmeans,
         lines = 0,
         shade = TRUE,
         color = TRUE,
         labels = 2,
         plotchar = FALSE,
         span = TRUE,
         main = paste('Clasification of iris'),
         xlab = 'features',
         ylab = 'Clusters')

Data visualization

img

Conclusion

This algorithm has different forms of implementation and varies between each library but the advantages have them all. Some of those advantages are that using aamount of massive data is much faster since it does not save as much information in memory and its implementation is very simple. But you also have problems like when there is data out of group (also called noise) is greatly affected or also that its quality depends a lot according to the measure of similarity between the data. In conclusion this method is quite good to classify data within data groups with great similarity to each other.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.