Giter Site home page Giter Site logo

jorisvandenbossche / ds-python-data-analysis Goto Github PK

View Code? Open in Web Editor NEW
100.0 6.0 58.0 63.12 MB

Data manipulation, analysis and visualisation in Python - specialist course Doctoral schools of Ghent University

Home Page: https://jorisvandenbossche.github.io/DS-python-data-analysis/

License: BSD 3-Clause "New" or "Revised" License

Jupyter Notebook 96.73% Shell 0.13% Python 3.14%

ds-python-data-analysis's Introduction

Data manipulation, analysis and visualisation in Python

Introduction

This course is intended for researchers that have at least basic programming skills in Python. It targets researchers that want to enhance their general data manipulation and analysis skills in Python.

The course does not aim to provide a course in statistics or machine learning. It aims to provide researchers the means to effectively tackle commonly encountered data handling tasks in order to increase the overall efficiency of the research.

The course has been developed as a specialist course for the Doctoral schools of Ghent University, but can be taught to others upon request (and the material is freely available to re-use).

Getting started

The course uses Python 3 and some data analysis packages such as Pandas, Numpy and Matplotlib. To install the required libraries, we highly recommend Anaconda or miniconda (https://www.anaconda.com/download/) or another Python distribution that includes the scientific libraries (this recommendation applies to all platforms, so for both Window, Linux and Mac).

For detailed instructions to get started on your local machine , see the setup instructions.

In case you do not want to install everything and just want to try out the course material, use the environment setup by Binder Binder and open de notebooks rightaway.

Contributing

Found any typo or have a suggestion, see how to contribute.

Meta

Authors: Joris Van den Bossche, Stijn Van Hoey

ds-python-data-analysis's People

Contributors

beramos avatar daanvanhauwermeiren avatar jdcla avatar jorisvandenbossche avatar stijnvanhoey avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

ds-python-data-analysis's Issues

Some small typos

01-basic.ipynb

test out the nxt two lines one at a time --> next
If their not yet a string --> they are

02-control_flow.ipynb

Different exceptions can be combined, rasingin different type of errors --> raising

03-functions.ipynb

It is just the same al all the other objects we worked with! --> as

05-numpy.ipynb

We use matplotlib to sho histogram --> show

python_rehearsal.ipynb

eerst 'builtin', wat later 'built-in'

FutureWarning when concatenating multiple dataframes

In notebook pandas_05_combining_datasets, when using the concat function, the following future warning is provided:

FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  """Entry point for launching an IPython kernel.

During the course, we should use this to explain what a FutureWarning actually is. In the future, we can adjust this an translate this to a note.

np.isclose instead of ==

in: notebooks/python_recap/python_rehearsal.ipynb

maybe it is a good idea to change the value check to np.isclose

EXERCISE: Convert all values -99 of the array AR3 into Nan-values (Note that Nan values can be provided in float arrays as `np.nan`)

original:

AR3[AR3 == -99.] = np.nan

new proposal:

AR3[np.isclose(AR3, -99)] = np.nan

paths

paths are not OS-independent.

os.path and os.split should alleviate this

heatmap creation at case2_biodiversity_analysis

The question is: Create a table, called heatmap_prep, based on the survey_data DataFrame with in the row index the years, in the column the months and as values of the table, the counts for each of these year/month combinat

As this is a groupby-exercise, it does not make much sense to use plotnine, starting from the heatmap_prep variable (as it isn't tidy anymore). Shouldn't we not just introduce seaborn.heatmap for this purpose?

If yes, would this notebook be the proper timing to link the students to the seaborn documentation?

typo in pandas_03_selecting_data.ipynb

ïloc --> iloc

REMEMBER:

Advanced indexing with loc and ïloc

  • **loc**: select by label: `df.loc[row_indexer, column_indexer]`
  • **iloc**: select by position: `df.iloc[row_indexer, column_indexer]`

Case 2: custom matching function

Exercise says to use species and genus separately in input, solution uses the concatenated string as input (but says to give separate strings as input in the function description).

Error message opening Anacoda Navigator

Traceback (most recent call last):
File "C:\Users\iochatzi\AppData\Local\Continuum\Anaconda3\lib\site-packages\psutil_pswindows.py", line 620, in wrapper
return fun(self, *args, **kwargs)
File "C:\Users\iochatzi\AppData\Local\Continuum\Anaconda3\lib\site-packages\psutil_pswindows.py", line 690, in cmdline
ret = cext.proc_cmdline(self.pid)
PermissionError: [WinError 5] Access is denied

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\iochatzi\AppData\Local\Continuum\Anaconda3\lib\site-packages\anaconda_navigator\exceptions.py", line 75, in exception_handler
return_value = func(*args, **kwargs)
File "C:\Users\iochatzi\AppData\Local\Continuum\Anaconda3\lib\site-packages\anaconda_navigator\app\start.py", line 108, in start_app
if misc.load_pid() is None: # A stale lock might be around
File "C:\Users\iochatzi\AppData\Local\Continuum\Anaconda3\lib\site-packages\anaconda_navigator\utils\misc.py", line 384, in load_pid
cmds = process.cmdline()
File "C:\Users\iochatzi\AppData\Local\Continuum\Anaconda3\lib\site-packages\psutil_init_.py", line 701, in cmdline
return self._proc.cmdline()
File "C:\Users\iochatzi\AppData\Local\Continuum\Anaconda3\lib\site-packages\psutil_pswindows.py", line 623, in wrapper
raise AccessDenied(self.pid, self._name)
psutil.AccessDenied: psutil.AccessDenied (pid=1208)

Typo pandas_03

small typo in 'remember box':
Advanced indexing with loc and ïloc

(should be iloc, not ï)

Notes from June 2019 workshop

General:

  • first morning: mention that people need to download latest version of the material (idea: tag to make a release before the course)
  • is %matplotlib inline still needed?

Slides:

  • update conda page (environment, source activate -> activate, ..)
  • Ipython -> IPython
  • IDE's update with Atom / VSCode

Pandas 1:

  • column names in overview table titanic wrong
  • no .values for dataframe
  • html boxen

Pandas 3 selecting:

  • and -> & to combine conditions => add tip for this

Pandas 3b indexing:

  • update setting with copy warning section with more explanation (temporary variable, boolean selection on population, show both and that the one does not work countries.loc[countries['population'] > 50, 'population'] = 50)

Pandas 4:

  • Ways to shorten this a bit?

Pandas 6 - groupby:

  • The color=C0 in the last exercise is not needed anymore

pandas 7 reshape:

  • last exercise pivot_table mean -> median

Visualization - plotnine:

  • the by in the first pandas example not doing anything?

Visualization - landscape:

  • update: Bokeh now supports pdf

Case 1 - bike count:

  • have raw data as backup

Case 2 - biodiversity processing:

  • groupby doesn't count NaNs, value_counts does -> add a note to explain that difference?

Case 2 - biodiversity analysis:

  • don't use built-in sum
  • subselection_sex -> make subtask to create this variable

Case 3 bacterial resistance:

  • intial tidying: we loose the "experiment id" or "repetion id" in the original data (multiple repeptitions for same phage / genotype, which now is a single row) -> that information is lost
  • creation of density_mean -> select column 'optical_density' before .mean() -> no mean of survival etc .. (in solution)

Collection of errors / suggestions 2017

  • air quality processing: split('/') -> os.path.split()
  • matplotlib notebook:
    • "Essential stuff; 2. creating objects" -> ax -> ax1
    • "Best of both worlds" example of tick_params does not have any effect when using "seaborn-whitegrid"
  • First exercise in "pandas_02_basic_operations.ipynb" -> mention to use population series (and not the countries dataframe; not yet seen how to select single element of dataframe)

alternative solution str split exercise in pandas_03a_selecting_data

Currently, for the exercise

Split the 'Name' column on the , extract the first part (the surname), and add this as new column 'Surname' .

The provides solution is df['Surname'] = df['Name'].apply(lambda x: x.split(',')[0]) whereas probably df["Surname"] = df["Name"].str.split(",").str.get(0) makes more sense as a solution in this stage of the course?

Spelling mistakes/typos

00-jupyter_introduction.ipynb > 2.7 Trouble... > "you're cell" should be "your cell" and "you're notebook" should be "your notebook"

change alias of plotnine

Quote from the plotnine developer: For an alias of plotnine, you should consider p9 instead of pn. I have seen plotnine abbreviated that way in a couple of places.

missing answers in pandas_03_selecting_data.ipynb

The following two execises do not have the correct answers

EXERCISE:

  • Change the capital of the UK to Cambridge
len(titles[titles['year'] // 10 == 195])

proposed answer:

countries.loc['United Kingdom', 'capital'] = 'Cambridge'
EXERCISE:
  • Select all countries whose population density is between 100 and 300 people/km²
# %load _solutions/pandas_03_selecting_data42.py
inception = cast[cast['title'] == 'Inception']

proposed answer:

countries[(countries['density'] < 300) & (countries['density'] > 100)]

SHIFT+TAB instead of CTRL+TAB

In notebook "case2_biodiversity_processing
(autosaved)", instead of

" # get help of the function by CTRL + TAB"

should be

" # get help of the function by SHIFT + TAB"

screenshot 2018-12-04 at 13 48 17

typo python_recap/01-basic

Exercise: With the dir(list) command, all the methods of the list type are printed. However, where ... -> we're

Show months as discrete integers in heatmap plotnine

In the case2_biodiversity_analysis notebook, the months on the x-axis of the plotnine heatmap are shown as continuous values (i.e. 0, 2.5, 5, ... ). It would be nice if they are shown as discrete values in the solution (i.e 1, 2, 3, ...) by setting the x-labels as a factor.

Bug in python_rehearsal when loading _solutions/python_rehearsal4.py

There is a line in the python_reheasal notebook that throws an error. Not sure if this is done on purpose to illustrate something or if this really is a bug

First time running the cell gives this:

# %load _solutions/python_rehearsal4.py
from barometric_formula import barometric_formula

The second time running the cell yields the following error trace

# %load _solutions/python_rehearsal4.py
from barometric_formula import barometric_formula

ModuleNotFoundError Traceback (most recent call last)
in ()
1 # %load _solutions/python_rehearsal4.py
----> 2 from barometric_formula import barometric_formula

ModuleNotFoundError: No module named 'barometric_formula'

Strange ax behaviour in matplotlib

When making a bar plot, I forgot/didn't know to define a second ax object, but I still get the correct output.

Code:

unique_species_by_plot = survey_data.groupby(["verbatimLocality"])["name"].nunique()

fig, ax = plt.subplots(figsize=(8,8))
unique_species_by_plot.plot(kind = "barh", color = "pink")
ax.set_ylabel('Plot ID number');
ax.set_xlabel('Number unique species');

Output:
image

Collection of errors / suggestions

@jorisvandenbossche talking:

  • matplotlib notebook: example of fig vs ax -> 'ax' -> 'ax1'
  • autofmt_xdate -> can take rotation keyword!
  • bike data -> drop_duplicates:
    • can also use index.duplicated
    • or for drop_duplicates -> use subset to drop based on the datetimes

Notes from Bordeaux workshop

Notes from Bordeaux workshop:

  • pandas intro notebooks can use a re-work:

    • data structures -> check the pandas-tutorial ones and compare (first dataframe, then series?)
    • data structures -> already do exercise with titanic? (loading the data, seeing the first 5 rows, plotting one column?)
    • basic operations -> countries["capital"].apply(lambda x: len(x)) is bad example, as doing apply(len) is the same
    • basic operations -> exercises with titanic ?
    • basic operations -> leave out alignment example? It only adds cognitive load, and does not occur in the examples (that's maybe for in the "advanced indexing" one?)
  • pandas indexing / selecting data:

    • use titanic for exercises (instead of countries) ?
    • split in two parts? basic (selecting colum(s) + filtering rows) and advanced (actual non-default index, loc/iloc, assignment, index/multi-index)
      check: do we use loc/iloc in some of the case studies? (I assume we might use it for a combined boolean mask + column selection)
  • new notebook: working with missing data ?

  • reshaping: too much ? (only unstack and not stack?)

  • bike_count case study:

    • don't use dayfirst as the solution, only show the comparison
    • more hints (eg "use value_counts", since they haven't seen that function yet)
      maybe work with hidden hint html?
    • drop_duplicates -> does not take into account the index!
  • matplotlib visualization notebook:

    • ax -> ax1
    • set_axis_bgcolor no longer exists (g.ax_joint.set_axis_bgcolor('0.1') in one of the seaborn examples)
  • general: more hints on what to use

  • provide good cheatsheet

python intro: too little too fast if you don't know python yet, boring / too slow if you already know Python

Exercise numpy

Voorlaatste -> Change all even positions of matrix AR to 30

case3 bacterial resistance: variable median/mean

Exercise on optical density: create density_mean:

In the assignment:
Calculate for each combination of Bacterial_genotype, Phage_t and experiment_time_h the mean optical_density and store the result as a dataframe called density_mean
==> mean of optical density is asked

In your solution:
density_mean = (tidy_experiment
.groupby(['Bacterial_genotype','Phage_t', 'experiment_time_h'])
.median().reset_index())
==> median optical density and survival are given

More efficient solution for python_rehearsal119.py

Found a much faster solution for python_rehearsal119.py:

%%timeit
np.count_nonzero(AR>10)
1.44 µs ± 90.8 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

Instead of:
%%timeit
sum(AR>10)
45.2 µs ± 11.6 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)

%%timeit
np.sum(AR>10)
4.66 µs ± 830 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)

incorrect solutions

EXERCISE:

  • Using groupby(), calculate the average age for each sex.

given answer:

# %load _solutions/pandas_06_groupby_operations10.py
df25 = df[df['Age'] <= 25]
df25['Survived'].sum() / len(df25['Survived'])

proposed answer:

df.groupby('Sex')['Age'].aggregate(np.mean)

wrong tip in time series notebook

When updating the notebooks, I added a tip about the link to agg in groupby, whereas that notebook is later in the course.

Note remember the agg when using groupby to derive multiple statistics at the same time?

We should:

  • explain the usage of agg earlier, e.g. in pandas_02_basic_operations notebook
  • remove/adjust the reference to groupby

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.