Giter Site home page Giter Site logo

python-project-2018's Introduction

Python-Project-2018

Background Information on the Iris Dataset

The Iris Dataset is a multivariate data set. There are three species included in the Iris set; Iris Setosa, Iris Versicolour and Iris Virginica. Each species has a total of 50 samples measured, leading to 150 samples in total. In each sample, four measurements were taken; sepal length, sepal width, petal length and petal width. There are no missing data points and all the data is measured in centimetres.

Project Plan

My aim in analyzing the Iris Dataset is to determine if there is any relationship between the four different measurements and the three different species. I am looking for nice patterns that may indicate a relationship between variables, or distinguish between species.

I intend to:

  • Calculate various measures of central tendency and spread for the four measurements in my dataset.
  • Compare pairs of measurements on scatter diagrams to determine if there is any relationship between them, and calculate the correlation coefficient.
  • Compare pairs of measurements on scatter diagrams, differentiated by species.

As this is my first statistical analysis project, I did not have any concrete goals in mind, beyond determining correlation. As I worked my way through this project, I understood the limitations of my initial plan and expanded upon it.

  • After running my program Calculations.py, I realised I needed a graphical representation of distribution for the four measurements.
    • I placed data for each measurement into a histogram to get a graph of the distribution.
    • I then viewed these distributions along the same intervals of the x and y axes.
  • After running my program Scatter.py, it was clear that petal length and petal width correlate strongly. I began to consider if it would be possible to get an equation for length with respect to width. So,
    • I calculated r squared and p values to get a bettter indication of correlation.
    • I got the slope of the line of best fit, and used the slope y-intercept method to find its equation.

How to Run my Code

Clone or download this repository to the desired directory of your machine. You can now open each program in Visual Studio Code or run from command prompt, if and only if, Python is in your PATH.

If using Visual Studio Code: Open folder. Open file. Open Integrated Terminal using CTRL '. Integrated terminal may not be open in the relevant directory. In VS Explorer, rightclick on a file in the required folder, and open in command prompt. Not you can run a program called NAME.py, by typing python NAME.py in the integrated terminal and press Enter.

If you're using Command Prompt: Navigate to the relevant folder using cd command, and run dir to get a list of files in the folder. Any python files in this folder can now be run, by typing python NAME.py and pressing Enter.

Note: Any calculations will be outputted in your application. Graphical representation of the data will open in individual windows, which you must close to move onto the next graph. Please ensure you close the final window to complete the program.

  • Calculations.py - Calculates the max value, min value, range, interquartile range, mean, median, mode, standard deviation and variance for each of the four measurements in the Iris Dataset. This program will list calculations in your application.
  • Hist.py - This program will display the distribution of the four measurements in the Iris Dataset. There are 8 graphs in total. Graphs 5 - 8 display this data along the same intervals of the x and y axes.
  • Hist2.py - This program will display the four measurments on 3 different subplots in order to compare the distribution of species. There are 4 graphs in total.
  • Scatter1.py - This program displays 6 scatter graphs combining the measurements from the Iris Dataset.On each graph the line of best fit is also plotted. This program will output the slope, y-intercept and equation of the lines of best fit. It also prints calculations related to correlation: the correlation co-efficient, r squared value, and p value.
    When running this program, the relevant calculations will appear in your application when the image of the corresponding scatter diagram opens. Please close that window to move onto the next scatter diagram and it's calculations.
  • Scatter2.py - This program displays the same 6 scatter diagrams as scatter1.py, but these are colored to distinguish between species.

My Analysis of the Iris Dataset

Central Tendency and Distribution

Sepal Length Sepal Width Petal Length Petal Width
Minimum Value: 4.3
Maximum Value: 7.9
Range: 3.6
Interquartile Range: 1.3
Mean: 5.8433
Median: 5.8
Mode: 5.0
Standard Deviation: 0.8253
Variance: 0.6811
Minimum Value: 2.0
Maximum Value: 4.4
Range: 2.4
Interquartile Range: 0.5
Mean: 3.0540
Median: 3.0
Mode: 3.0
Standard Deviation: 0.4321
Variance: 0.1868
Minimum Value: 1.0
Maximum Value: 6.9
Range: 5.9
Interquartile Range: 3.5
Mean: 3.7587
Median: 4.35
Mode: 1.5
Standard Deviation: 1.7585
Variance: 3.0924
Minimum Value: 0.1
Maximum Value: 2.5
Range: 2.4
Interquartile Range: 1.5
Mean: 1.1987
Median: 1.3
Mode: 0.2
Standard Deviation: 0.7606
Variance: 0.5785
  • Petal Length is the most spread out measurement with the highest range, IQR and standard deviation. The length varies from 1cm to 6.9cm in this set of data. The wide range of values and presence of outliers, resulted in 3 very different mmeasures of central tendency. (Mean 3.8, Median 4.4 and Mode 1.5)

  • Sepal Width and Petal Width both share the smallest range value of 2.4. Sepal Width is more effected by outliers, with an IQR of 0.5. Meanwhile the IQR of Petal Width is 1.5, reflecting the greater range of values in that set.

  • The variation across the four measurements is quite likely to be due to the variation across species. Using subplots to plot these measurements according to species appears to verify this.

Sepal Length Sepal Width Petal Length Petal Width
  • Petal Length and Petal Width measurements appear to provide the most clear differences between species. Setosa has length less than 2 cm and width less than 1cm. Virginica has length between 4.5 and 7cm and width between 1.5 and 2.5cm approximately. Versicolor has length between 3 and 5cm and width between 1 and 2cm. While there is some overlap bettwen Virginica and Versicolor, these measurements could form the basis for identifying species from measurements.

Correlation

Sepal Length vs. Sepal Width
Slope of the best fit line: -0.0573
Y-intercept: 3.38863737949
Equation: y = 3.39 - 0.06x
R Value: -0.1094
R² Value: 0.012
P Value: 0.1827652153
Petal Length vs.Petal Width
Slope of the best fit line: 0.4164
Y-intercept: -0.36651404521
Equation: y = -0.37 + 0.42x
R Value: 0.9628
R² Value: 0.9269
P Value: 0.0, to 10dp
Sepal Width vs. Petal Width
Slope of the best fit line: -0.6275
Y-intercept: 3.115192687
Equation:y = 3.12 - 0.63x
R Value: -0.3565
R² Value: 0.1271
P Value: 0.0000075239
Sepal Length vs. Petal Length
Slope of the best fit line: 1.8575
Y-intercept: -7.0953814783
Equation: y = -7.10 + 1.86x
R Value: 0.8718
R² Value: 0.76
P Value: 0.0, to 10dp
Sepal Length vs. Petal Width
Slope of the best fit line: 0.7538
Y-intercept: -3.2062768960
Equation: y = -3.21 + 0.75x
R Value: 0.818
R² Value: 0.669
P Value: 0.0, to 10dp
Sepal Width vs. Petal Length
Slope of the best fit line: -1.7112
Y-intercept: 8.98467570546
Equation: y = 8.98 - 1.71x
R Value: -0.4205
R² Value: 0.1768
P Value: 0.0000000843
  • The most statistiallly significant correlation is between Petal Length and Petal width measurements. These measurements provide the highest correlation coefficient (0.9628) and the highest r-squared value (0.9269). The p-value, also indicates that the probability of no relationship between these variables is (5.78 x 10^-86), which I rounded to zero above.

  • P-values indicate that all scatter diagrams except Sepal Length and Sepal Width, have a statistically significant realtionship (p < 0.05).

  • Despite having good p-values, some of these sets have low r-squared values. These are the Sepal Width vs. Petal Length and Sepal Width vs. Petal Width relationships. This suggests that even though there is a statistically significant relationship, the equations of the line of best fit would not be a good model in these cases.

  • While data scientists often employ different methods of linear regression than I did here, they often arrive at very similar equations to model the relationship between measurements. Comparing my work to theirs allowed me to identify a very simple mathematical error in my own calculations.

Further plans

During my analysis of the Iris Dataset, I tried to avoid analysis conducted by data scientists to avoid their work influencing my own. My intial plan to plot the data on scatter diagrams was heavily influenced by the grahical representations of this data on Wikipedia.

This approach helped me to appreciate the importance of using programming languages like Python to conduct data analysis. My own analysis identified errors in P.S. Hoey's mathematical analysis of this set, including an incorrectly identified minimum value.

Following the completion of my own analysis, I consulted analysis of the dataset conducted by data scientists.

  • Many data scientists use methods of linear regression different to the one I used. I hope to gain a better understanding of linear regression that might be more effective than the line of best fit equation.
  • Many data scientists used methods I'm unfamiliar with to classify a iris using it's measurements and the pre-existing Iris data in this set. I would like to understand clustering methods and the applications used to create predictive models, and apply those to this set.

References

python-project-2018's People

Contributors

marionmcg avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.