jorisvandenbossche / ds-python-data-analysis Goto Github PK

View Code? Open in Web Editor NEW

100.0 6.0 58.0 63.12 MB

Data manipulation, analysis and visualisation in Python - specialist course Doctoral schools of Ghent University

Home Page: https://jorisvandenbossche.github.io/DS-python-data-analysis/

License: BSD 3-Clause "New" or "Revised" License

Jupyter Notebook 96.73% Shell 0.13% Python 3.14%

ds-python-data-analysis's Introduction

Data manipulation, analysis and visualisation in Python

Introduction

This course is intended for researchers that have at least basic programming skills in Python. It targets researchers that want to enhance their general data manipulation and analysis skills in Python.

The course does not aim to provide a course in statistics or machine learning. It aims to provide researchers the means to effectively tackle commonly encountered data handling tasks in order to increase the overall efficiency of the research.

The course has been developed as a specialist course for the Doctoral schools of Ghent University, but can be taught to others upon request (and the material is freely available to re-use).

Getting started

The course uses Python 3 and some data analysis packages such as Pandas, Numpy and Matplotlib. To install the required libraries, we highly recommend Anaconda or miniconda (https://www.anaconda.com/download/) or another Python distribution that includes the scientific libraries (this recommendation applies to all platforms, so for both Window, Linux and Mac).

For detailed instructions to get started on your local machine , see the setup instructions.

In case you do not want to install everything and just want to try out the course material, use the environment setup by Binder and open de notebooks rightaway.

Contributing

Found any typo or have a suggestion, see how to contribute.

ds-python-data-analysis's People

Contributors

Stargazers

Watchers

Forkers

gitter-badger cechappe jdcla serenafalcioni93 drdwitte hsthuysma gridl mariaclaudia kcourtet jerevon lionel68 payalbhatia anhnguyendepocen spencerjl qiaoxingli wshusheng qin-courses bhavika0 olufemig tfmortie gvsprasad1205 beramos skerryvore robintux asheesh22 ssitb melunaich davide98-imec jonggan-kim kcantosh gbellandi opiticalvin rajkumarbhc suppu-github rubiruchi lcdvissc pchalee tnp1618 tawenho yunxzhou demeulemeestert douwedevestele jonasvdd rajlra mosesmefe jgatesi jempitman xandrade simonperneel ramonsuarez beatriz-gutierrez muhammeto adjebbar seahl0119 westamine bingruichen isarcharmchi obedamo

ds-python-data-analysis's Issues

typo case1_bike_count

add logic operators to python_recap

In the recap python_rehearsel, & and | are not introduces. would be good to show when working on boolean operators

update link of the python date format reference

In notebook pandas_04_time_series_data.ipynb, the reference to the Python documentation on the format of strings is outdated (python 3.5), should refer to latest Python version.

Typo in case2_biodiversity_processing

(text after solve_double_field_entry)

The function takes a DataFrame as input, splits the record into separate rows and returnd ... -> returns

Some small typos

01-basic.ipynb

test out the nxt two lines one at a time --> next
If their not yet a string --> they are

02-control_flow.ipynb

Different exceptions can be combined, rasingin different type of errors --> raising

03-functions.ipynb

It is just the same al all the other objects we worked with! --> as

05-numpy.ipynb

We use matplotlib to sho histogram --> show

python_rehearsal.ipynb

eerst 'builtin', wat later 'built-in'

FutureWarning when concatenating multiple dataframes

In notebook pandas_05_combining_datasets, when using the concat function, the following future warning is provided:

FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.

To retain the current behavior and silence the warning, pass 'sort=True'.

  """Entry point for launching an IPython kernel.

During the course, we should use this to explain what a FutureWarning actually is. In the future, we can adjust this an translate this to a note.

np.isclose instead of ==

in: notebooks/python_recap/python_rehearsal.ipynb

maybe it is a good idea to change the value check to np.isclose

EXERCISE: Convert all values -99 of the array AR3 into Nan-values (Note that Nan values can be provided in float arrays as `np.nan`)

original:

AR3[AR3 == -99.] = np.nan

new proposal:

AR3[np.isclose(AR3, -99)] = np.nan

paths

paths are not OS-independent.

os.path and os.split should alleviate this

heatmap creation at case2_biodiversity_analysis

The question is: Create a table, called heatmap_prep, based on the survey_data DataFrame with in the row index the years, in the column the months and as values of the table, the counts for each of these year/month combinat

As this is a groupby-exercise, it does not make much sense to use plotnine, starting from the heatmap_prep variable (as it isn't tidy anymore). Shouldn't we not just introduce seaborn.heatmap for this purpose?

If yes, would this notebook be the proper timing to link the students to the seaborn documentation?

error in solution pandas_07_reshaping_data4.py

Exercise is to make a table of medians, but the aggregation method of the solution is mean.

nbtutor needs to be activated and checked for vizualisation notebooks

rendering of html exercise sections not properly displayed

Due to increase in TAB, the list is interpreted as pure code and not properly rendered

potential data set for course

http://www.eandis.be/nl/open-data-over-de-energiemarkt

typo in pandas_03_selecting_data.ipynb

ïloc --> iloc

REMEMBER:

Advanced indexing with loc and ïloc

**loc**: select by label: `df.loc[row_indexer, column_indexer]`
**iloc**: select by position: `df.iloc[row_indexer, column_indexer]`

Case 2: custom matching function

Exercise says to use species and genus separately in input, solution uses the concatenated string as input (but says to give separate strings as input in the function description).

typo in pandas_04_time_series_data.ipynb notebook

s = pd.Series(['2016-12-09 10:00:00', '2016-12-09, 11:00:00', '2016-12-09 12:00:00']) has a redundant ,

Error message opening Anacoda Navigator

Traceback (most recent call last):
File "C:\Users\iochatzi\AppData\Local\Continuum\Anaconda3\lib\site-packages\psutil_pswindows.py", line 620, in wrapper
return fun(self, *args, **kwargs)
File "C:\Users\iochatzi\AppData\Local\Continuum\Anaconda3\lib\site-packages\psutil_pswindows.py", line 690, in cmdline
ret = cext.proc_cmdline(self.pid)
PermissionError: [WinError 5] Access is denied

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "C:\Users\iochatzi\AppData\Local\Continuum\Anaconda3\lib\site-packages\anaconda_navigator\exceptions.py", line 75, in exception_handler
return_value = func(*args, **kwargs)
File "C:\Users\iochatzi\AppData\Local\Continuum\Anaconda3\lib\site-packages\anaconda_navigator\app\start.py", line 108, in start_app
if misc.load_pid() is None: # A stale lock might be around
File "C:\Users\iochatzi\AppData\Local\Continuum\Anaconda3\lib\site-packages\anaconda_navigator\utils\misc.py", line 384, in load_pid
cmds = process.cmdline()
File "C:\Users\iochatzi\AppData\Local\Continuum\Anaconda3\lib\site-packages\psutil_init_.py", line 701, in cmdline
return self._proc.cmdline()
File "C:\Users\iochatzi\AppData\Local\Continuum\Anaconda3\lib\site-packages\psutil_pswindows.py", line 623, in wrapper
raise AccessDenied(self.pid, self._name)
psutil.AccessDenied: psutil.AccessDenied (pid=1208)

typo in solutions/python_rehearsal2.py

The function calls variable pressure_hPa in stead of pressure_sea_level which is the named variable in function definition

Typo pandas_03

small typo in 'remember box':
Advanced indexing with loc and ïloc

(should be iloc, not ï)

update documentation link time series

As the urls of the documentation changed, we have to update this in the documentation: http://pandas.pydata.org/pandas-docs/stable/timeseries.html#offset-aliases

Notes from June 2019 workshop

General:

first morning: mention that people need to download latest version of the material (idea: tag to make a release before the course)
is %matplotlib inline still needed?

Slides:

update conda page (environment, source activate -> activate, ..)
Ipython -> IPython
IDE's update with Atom / VSCode

Pandas 1:

column names in overview table titanic wrong
no .values for dataframe
html boxen

Pandas 3 selecting:

and -> & to combine conditions => add tip for this

Pandas 3b indexing:

update setting with copy warning section with more explanation (temporary variable, boolean selection on population, show both and that the one does not work countries.loc[countries['population'] > 50, 'population'] = 50)

Pandas 4:

Ways to shorten this a bit?

Pandas 6 - groupby:

The color=C0 in the last exercise is not needed anymore

pandas 7 reshape:

last exercise pivot_table mean -> median

Visualization - plotnine:

the by in the first pandas example not doing anything?

Visualization - landscape:

update: Bokeh now supports pdf

Case 1 - bike count:

have raw data as backup

Case 2 - biodiversity processing:

groupby doesn't count NaNs, value_counts does -> add a note to explain that difference?

Case 2 - biodiversity analysis:

don't use built-in sum
subselection_sex -> make subtask to create this variable

Case 3 bacterial resistance:

intial tidying: we loose the "experiment id" or "repetion id" in the original data (multiple repeptitions for same phage / genotype, which now is a single row) -> that information is lost
creation of density_mean -> select column 'optical_density' before .mean() -> no mean of survival etc .. (in solution)

Typo in groupby notebook

"younger that"

+ exercise says "younger than 25" but solution says <=25

Collection of errors / suggestions 2017

air quality processing: split('/') -> os.path.split()
matplotlib notebook:
- "Essential stuff; 2. creating objects" -> ax -> ax1
- "Best of both worlds" example of tick_params does not have any effect when using "seaborn-whitegrid"
First exercise in "pandas_02_basic_operations.ipynb" -> mention to use population series (and not the countries dataframe; not yet seen how to select single element of dataframe)

alternative solution str split exercise in pandas_03a_selecting_data

Currently, for the exercise

Split the 'Name' column on the , extract the first part (the surname), and add this as new column 'Surname' .

The provides solution is df['Surname'] = df['Name'].apply(lambda x: x.split(',')[0]) whereas probably df["Surname"] = df["Name"].str.split(",").str.get(0) makes more sense as a solution in this stage of the course?

Typo in visualization_02_plotnine

Similar to Pandas handling above, we can set up a matplotlib Figure wit plotnine. -> with

missing link in biodiversity processing notebook

"in depth description" link is not a link

Spelling mistakes/typos

00-jupyter_introduction.ipynb > 2.7 Trouble... > "you're cell" should be "your cell" and "you're notebook" should be "your notebook"

change alias of plotnine

Quote from the plotnine developer: For an alias of plotnine, you should consider p9 instead of pn. I have seen plotnine abbreviated that way in a couple of places.

missing answers in pandas_03_selecting_data.ipynb

The following two execises do not have the correct answers

EXERCISE:

Change the capital of the UK to Cambridge

len(titles[titles['year'] // 10 == 195])

proposed answer:

countries.loc['United Kingdom', 'capital'] = 'Cambridge'

EXERCISE:

Select all countries whose population density is between 100 and 300 people/km²

# %load _solutions/pandas_03_selecting_data42.py
inception = cast[cast['title'] == 'Inception']

proposed answer:

countries[(countries['density'] < 300) & (countries['density'] > 100)]

SHIFT+TAB instead of CTRL+TAB

In notebook "case2_biodiversity_processing
(autosaved)", instead of

" # get help of the function by CTRL + TAB"

should be

" # get help of the function by SHIFT + TAB"

typo python_recap/01-basic

Exercise: With the dir(list) command, all the methods of the list type are printed. However, where ... -> we're

case2_biodiversity_analysis: difference in dataframe name between exercise and solutions

In the first exercise (reading the data), it asked to save the data as survey_data, while in the subsequent solutions, surver_data_processed is used.

Show months as discrete integers in heatmap plotnine

In the case2_biodiversity_analysis notebook, the months on the x-axis of the plotnine heatmap are shown as continuous values (i.e. 0, 2.5, 5, ... ). It would be nice if they are shown as discrete values in the solution (i.e 1, 2, 3, ...) by setting the x-labels as a factor.

Bug in python_rehearsal when loading _solutions/python_rehearsal4.py

There is a line in the python_reheasal notebook that throws an error. Not sure if this is done on purpose to illustrate something or if this really is a bug

First time running the cell gives this:

# %load _solutions/python_rehearsal4.py
from barometric_formula import barometric_formula

The second time running the cell yields the following error trace

# %load _solutions/python_rehearsal4.py
from barometric_formula import barometric_formula

ModuleNotFoundError Traceback (most recent call last)
in ()
1 # %load _solutions/python_rehearsal4.py
----> 2 from barometric_formula import barometric_formula

ModuleNotFoundError: No module named 'barometric_formula'

Strange ax behaviour in matplotlib

When making a bar plot, I forgot/didn't know to define a second ax object, but I still get the correct output.

Code:

unique_species_by_plot = survey_data.groupby(["verbatimLocality"])["name"].nunique()

fig, ax = plt.subplots(figsize=(8,8))
unique_species_by_plot.plot(kind = "barh", color = "pink")
ax.set_ylabel('Plot ID number');
ax.set_xlabel('Number unique species');

Output:

exercise solution python_rehearsal108.py

The exercise solution python_rehearsal108.py suggests to evaluate whether a variable equals 0. Could be done with using the variable itself as avaluator

Use `pd.explode` instead of custom function

In biodiversity processing, a custom function is written to transform 2 entries in 1 column/cell to 2 rows. In the latest Pandas version, this can be tackled by explode.

Collection of errors / suggestions

@jorisvandenbossche talking:

matplotlib notebook: example of fig vs ax -> 'ax' -> 'ax1'
autofmt_xdate -> can take rotation keyword!
bike data -> drop_duplicates:
- can also use index.duplicated
- or for drop_duplicates -> use subset to drop based on the datetimes

Notes from Bordeaux workshop

Notes from Bordeaux workshop:

pandas intro notebooks can use a re-work:
- data structures -> check the pandas-tutorial ones and compare (first dataframe, then series?)
- data structures -> already do exercise with titanic? (loading the data, seeing the first 5 rows, plotting one column?)
- basic operations -> countries["capital"].apply(lambda x: len(x)) is bad example, as doing apply(len) is the same
- basic operations -> exercises with titanic ?
- basic operations -> leave out alignment example? It only adds cognitive load, and does not occur in the examples (that's maybe for in the "advanced indexing" one?)
pandas indexing / selecting data:
- use titanic for exercises (instead of countries) ?
- split in two parts? basic (selecting colum(s) + filtering rows) and advanced (actual non-default index, loc/iloc, assignment, index/multi-index)
  check: do we use loc/iloc in some of the case studies? (I assume we might use it for a combined boolean mask + column selection)
new notebook: working with missing data ?
reshaping: too much ? (only unstack and not stack?)
bike_count case study:
- don't use dayfirst as the solution, only show the comparison
- more hints (eg "use value_counts", since they haven't seen that function yet)
  maybe work with hidden hint html?
- drop_duplicates -> does not take into account the index!
matplotlib visualization notebook:
- ax -> ax1
- set_axis_bgcolor no longer exists (g.ax_joint.set_axis_bgcolor('0.1') in one of the seaborn examples)
general: more hints on what to use
provide good cheatsheet

python intro: too little too fast if you don't know python yet, boring / too slow if you already know Python

typo in case2_biodiversity_cleaning

Scenario:
You are interested in occurrence data for a number of species in Flanders.

Exercise numpy

Voorlaatste -> Change all even positions of matrix AR to 30

explain`aggregate` explicitly and drop the `fig, ax` in exercise time series

The fig, ax = plt.subplots is too early (the time series notebook) and can easily be ignored by having the aggregate way of calculating multiple aggregated statistics included in the notebooks

case3 bacterial resistance: variable median/mean

Exercise on optical density: create density_mean:

In the assignment:
Calculate for each combination of Bacterial_genotype, Phage_t and experiment_time_h the mean optical_density and store the result as a dataframe called density_mean
==> mean of optical density is asked

In your solution:
density_mean = (tidy_experiment
.groupby(['Bacterial_genotype','Phage_t', 'experiment_time_h'])
.median().reset_index())
==> median optical density and survival are given

incorrect solutions

EXERCISE:

Using groupby(), calculate the average age for each sex.

given answer:

# %load _solutions/pandas_06_groupby_operations10.py
df25 = df[df['Age'] <= 25]
df25['Survived'].sum() / len(df25['Survived'])

proposed answer:

df.groupby('Sex')['Age'].aggregate(np.mean)

Move to jupyterlab instead of jupyter notebook

If we want to switch to jupyterlab instead of jupyter notebook, some things we need to do:

Update environment.yml file
Update the setup instructions (https://github.com/jorisvandenbossche/DS-python-data-analysis/blob/master/setup.md#3-starting-jupyter-notebook)
Ensure that everything works in jupyterlab:
- I think %matplotlib notebook does not work

Note remember the agg when using groupby to derive multiple statistics at the same time?

We should:

explain the usage of agg earlier, e.g. in pandas_02_basic_operations notebook
remove/adjust the reference to groupby

Typo in notebook 00-jupyter_introduction

Very small typo in the module below line [14]: "to s stop editing," --> to stop editing

jorisvandenbossche / ds-python-data-analysis Goto Github PK

ds-python-data-analysis's Introduction

Data manipulation, analysis and visualisation in Python

Introduction

Getting started

Contributing

Meta

ds-python-data-analysis's People

Contributors

Stargazers

Watchers

Forkers

ds-python-data-analysis's Issues

Recommend Projects

Recommend Topics

Recommend Org