1 实验内容

安装Jupyter Notebook和相关的Python环境，建议采用Anaconda的安装方式。
按照教程完成实验过程，主要包括几个方面：
- 掌握Notebook工具的基本原理
- 学习Python基本语法，完成选择排序程序
- 完成Python数据分析的例子
将上述完成的Jupyter Notebook在Github上进行共享。

2 实验记录

2.1 创建一个新的NoteBook

notebook的创建可以用很多方式，一种是在command中创建、也可以在vscode中创建，vsocde的创建如下。

在vscode中点击search栏目，然后输入Juypter Notebook:..，生成一个Untitled.ipynb文件。

同样，也可以在pycharm中创建。

在pycharm中点击File->new->Jupyter Notebook，然后输入文件名，创建对应的.ipynb文件。

在Notebook中有两个关键元素

cell
kernal

cell

主要包含两种类型：

code类型：包含可被kernel执行的代码，执行之后在下方显示输出。
markdown类型：书写Markdown标记语言的cell。

print('Hello World!')

Hello World!

代码执行之后，cell左侧的标签从In [ _] 变成了 In [1]。In代表输入，[]中的数字代表kernel执行的顺序，而In [_]则表示代码cell正在执行代码。以下例子显示了短暂的In [_]过程。

import time
time.sleep(3)

cell模式

有两种模式，编辑模式（edit mode）和命名模式（command mode）

编辑模式：enter健切换，绿色轮廓
命令模式：esc健切换，蓝色轮廓

常用快捷键（在编辑模式下）

上下键头可以上下cell移动
A或者B在上方或者下方插入一个cell
M将转换代码cell为Markdown cell
Y将设置活动cell为代码 cell
D+D（两次）删除cell
Z 撤销删除
Ctrl + Shift + -将以光标处作为分割点，将cell一分为二。

Kernel

每个notebook都基于一个内核运行，当执行cell代码时，代码将在内核当中运行，运行的结果会显示在页面上。Kernel中运行的状态在整个文档中是延续的，可以跨越所有的cell。这意思着在一个Notebook某个cell定义的函数或者变量等，在其他cell也可以使用。例如：

import numpy as np
def square(x):
    return x * x

执行上述代码cell之后，后续cell可以使用np和square

x = np.random.randint(1, 10)
y = square(x)
print('%d squared is %d' % (x, y))

3 squared is 9

2.2 简单的python程序例子

定义selection_sort函数执行选择排序功能

def selection_sort(arr):
    n = len(arr)
    for i in range(n-1):
        minIndex = i
        for j in range(i+1, n):
            if arr[j] < arr[minIndex]:
                minIndex = j
        arr[i], arr[minIndex] = arr[minIndex], arr[i]
    return arr

定义test函数进行测试

执行数据输入，并调用selection_sort，最后输出结果

def test():
    arr = [3,2,5,1,7,8,6,4,9]
    selection_sort(arr)
    print(arr)

test()

[1, 2, 3, 4, 5, 6, 7, 8, 9]

2.3 数据分析的例子

设置

导入相关的工具库

%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

pandas用于数据处理，matplotlib用于绘图，seaborn使绘图更美观。第一行不是python命令，而被称为line magic。%表示作用与一行，%%表示作用于全文。此处%matplotlib inline 表示使用matlib画图，并将图片输出。

随后，加载数据集。

df = pd.read_csv('../res/fortune500.csv')

检查数据集

上述代码执行生成的df对象，是pandas常用的数据结构，称为DataFrame，可以理解为数据表。
df.head()将会默认展示数据表的前5行

df.head()

	Year	Rank	Company	Revenue (in millions)	Profit (in millions)
0	1955	1	General Motors	9823.5	806
1	1955	2	Exxon Mobil	5661.4	584.8
2	1955	3	U.S. Steel	3250.4	195.4
3	1955	4	General Electric	2959.1	212.6
4	1955	5	Esmark	2510.8	19.1

df.tail()将会默认展示数据表的后5行

df.tail()

	Year	Rank	Company	Revenue (in millions)	Profit (in millions)
25495	2005	496	Wm. Wrigley Jr.	3648.6	493
25496	2005	497	Peabody Energy	3631.6	175.4
25497	2005	498	Wendy's International	3630.4	57.8
25498	2005	499	Kindred Healthcare	3616.6	70.6
25499	2005	500	Cincinnati Financial	3614.0	584

对数据属性列进行重命名，以便在后续访问

df.columns = ['year', 'rank', 'company', 'revenue', 'profit']

len()函数将输出数据表的数据条目数量，可以用于检查数据是否加载完整

len(df)

df.dtypes可以输出各属性列的数据类型

df.dtypes

year         int64
rank         int64
company     object
revenue    float64
profit      object
dtype: object

对于profit属性，期待的结果为float类型，但由于dtypes的检查结果为object，所以可能包含非数字的值，故使用正则表达式进行检查。

non_numberic_profits = df.profit.str.contains('[^0-9.-]')
df.loc[non_numberic_profits].head()

	year	rank	company	revenue	profit
228	1955	229	Norton	135.0	N.A.
290	1955	291	Schlitz Brewing	100.0	N.A.
294	1955	295	Pacific Vegetable Oil	97.9	N.A.
296	1955	297	Liebmann Breweries	96.0	N.A.
352	1955	353	Minneapolis-Moline	77.4	N.A.

使用len()检查这样的记录共有多少条

len(df.profit[non_numberic_profits])

为了更直观的看到这样的非法记录在总体数据表中的占比，使用直方图观察每年的非法记录条目

bin_sizes, _, _ = plt.hist(df.year[non_numberic_profits], bins=range(1955, 2006))

可见每个单独年份中的非法记录都少于25条，即少于4%的比例。这在可以接受的范围内，因此删除这些记录。

df = df.loc[~non_numberic_profits]
df.profit = df.profit.apply(pd.to_numeric)

再次检查数据记录的条目数和各属性的数据类型

len(df)

df.dtypes

year         int64
rank         int64
company     object
revenue    float64
profit     float64
dtype: object

从数据类型可看出，无效非法数据记录已被清洗

使用matplotlib进行绘图

接下来，以年分组绘制平均利润和收入。首先定义变量和方法。

group_by_year = df.loc[:, ['year', 'revenue', 'profit']].groupby('year')
avgs = group_by_year.mean()
x = avgs.index
y1 = avgs.profit
def plot(x, y, ax, title, y_label):
    ax.set_title(title)
    ax.set_ylabel(y_label)
    ax.plot(x, y)
    ax.margins(x=0, y=0)

使用matplotlib对平均利润进行绘图

fig, ax = plt.subplots()
plot(x, y1, ax, 'Increase in mean Fortune 500 company profits from 1955 to 2005', 'Profit (millions)')

初期类似指数增长，但是1990年代初期出现急剧的下滑，对应当时经济衰退和网络泡沫。
使用matplotlib对收入进行绘图

y2 = avgs.revenue
fig, ax = plt.subplots()
plot(x, y2, ax, 'Increase in mean Fortune 500 company revenues from 1955 to 2005', 'Revenue (millions)')

对数据结果进行标准差处理，并绘图表现出来

def plot_with_std(x, y, stds, ax, title, y_label):
    ax.fill_between(x, y - stds, y + stds, alpha=0.2)
    plot(x, y, ax, title, y_label)
fig, (ax1, ax2) = plt.subplots(ncols=2)
title = 'Increase in mean and std Fortune 500 company %s from 1955 to 2005'
stds1 = group_by_year.std().profit.values
stds2 = group_by_year.std().revenue.values
plot_with_std(x, y1.values, stds1, ax1, title % 'profits', 'Profit (millions)')
plot_with_std(x, y2.values, stds2, ax2, title % 'revenues', 'Revenue (millions)')
fig.set_size_inches(14, 4)
fig.tight_layout()

2.4 导出Notebooks

分享Notebooks通常来说一般存在两种形式：一种向本文一样以静态非交互式分享（html,markdown,pdf等）；另外一种通过Git版本工具或者Google Colab进行协同开发

分享之前的工作

分享的Notebooks应包括代码执行的输出，要保证执行的结果符合预期，需完成以下几件事：

点击"Cell > All Output > Clear"
点击"Kernel > Restart & Run All"
等待所有代码执行完毕

这样做的目的使得Notebook不含有中间的执行结果，按照代码执行的顺序，产生稳定的结果。

导出markdown文档

在官方提供的的Notebook服务器页面中打开所要分享的Notebook，点击File->Download as，可以选择多种导出格式，其中包括markdown文档格式：

karwei0 / project_practice_exp3 Goto Github PK

project_practice_exp3's Introduction