guipsamora / pandas_exercises Goto Github PK
View Code? Open in Web Editor NEWPractice your pandas skills!
License: BSD 3-Clause "New" or "Revised" License
Practice your pandas skills!
License: BSD 3-Clause "New" or "Revised" License
Hi - I am new to this site and am attempting to complete the exercises; however, the line where I'd enter code are not in edit form. Am I missing something? Can you please help me figure out why I cannot add answers to the lines?
Step 8 proposes the following solution to count how many Veggie Salad Bowl there are.
chipo_salad = chipo[chipo.item_name == "Veggie Salad Bowl"]
len(chipo_salad)
However this doesn't seem like a general solution to the problem. Wouldn't it be better:
chipo[chipo.item_name == "Veggie Salad Bowl"].quantity.sum()
So that quantities get taken into account?
Link in 06_Stats/Wind_Stats not working. It currently is:
https://github.com/guipsamora/pandas_exercises/blob/master/Stats/Wind_Stats/wind.data
It should be:
https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/Wind_Stats/wind.data
From step12 to 14, it asked to downsample the record to a yearly/monthly/weekly frequency for each location.
The provided solution is like below:
data.groupby(data.index.to_period('A')).mean()
I think it would be simpler to use resample function as below:
data.resample('AS').mean()
data.resample('M').mean()
data.resample('W').mean()
According to the link below, the ix method is deprecated.
http://pandas-docs.github.io/pandas-docs-travis/whatsnew.html#deprecate-ix
Solutions for step 17 could be, for example (just what i came up with):
army.loc[['Arizona'], army.columns[3]]
and for step 18:
army.iloc[2, army.columns.tolist().index('deaths')]
Also, in step 3 the "Don't forget to include the columns names" is confusing since pd.DataFrame imports the structure fine without, just not in the order expected by the solutions of the following exercises.
Thanks for this great collection of exercises, it's a real treasure.
TypeError Traceback (most recent call last)
in ()
----> 1 prices= [float(value[1:-1]) for value in chipo.item_price]
2
3 #reassign the column values with the updated values
4 chipo.item_price = prices
5
in (.0)
----> 1 prices= [float(value[1:-1]) for value in chipo.item_price]
2
3 #reassign the column values with the updated values
4 chipo.item_price = prices
5
TypeError: 'float' object is not subscriptable
When I open the notebook in the browser (Chrome), it seems to render properly except that the cells are all grayed out. I was expecting that I would be able to type into them and run the code I write. Is that not what was intended? Thanks
Great exercises, thank you so much. Not really an issue but a suggestion would be to use nbgrader to check the user's answers against the solution automatically? Here's a link
Hey,
I came across your pandas exercise. I am new to GitHub, how do I code in your pandas exercise and execute them? Right now they are greyed out and I am unable to edit.
Hi, thanks for the exercises provided. Very helpful.
I think the result for 06_Stats/Wind_Stats/Step 8 is incorrect.
Believe we should skip NA when calculate the mean value, but the default of mean() already exclude NA.
So just use below should be fine:
data.mean().mean()
data.shape[0] - data.isnull().sum()
or better data.notnull().sum()
.data.fillna(0).values.flatten().mean()
describe(percentiles=[])
?data.loc[data.index.month == 1].mean()
data.groupby(data.index.to_period('A')).mean()
data.groupby(data.index.to_period('M')).mean().head()
Overall, 11 out of 16 are either wrong or misleading. On top of that, the bulk of this notebook belongs to "Time series analysis"
In
pandas_exercises-master\09_Time_Series\Getting_Financial_Data\Exercises_with_solutions_and_code
""step 4"" not work!
Could you please share the code for step 8.
Step number #9 is not clear. What is the threshold at which you label one as legal drinker?
Just Fixed lots of typos and errors in "Exercise_with_solutions" parts and some for "Exercise" and "Solution" parts.
For "Exercise" and "Solution" parts, I couldn't fix all of the links or titles so far.
It might be little confusing for readers, so please grep urls or titles and fix them.
For some parts, commits are messy, because I had to execute some cells and renew the result.
Rather than checking fixes from github, for some part, checking them from Jupyter notebook would be better.
Step 18. What is the age with least occurrence?
current ver. :
users.age.value_counts().tail(1)
7 year-old is with the least occurrence 1.
However, 11, 10, 73, 66 year-old are with occurrence 1 as well.
So the correct ver.:
users.age.value_counts().tail()
When I tried using the following code the ordered item quantity is different
chipo['item_name'].value_counts().head(1)
Out[48]:
Chicken Bowl 726
Name: item_name, dtype: int64
But when i try your method it is having different value.
chipo.groupby('item_name').sum().sort_values(['quantity'],ascending =False).head(1)
order_id | quantity |
---|
713926 | 761
can you please help me with same and explanation would very much appreciated
In step 5, the exercise question says to create a function to capitalize on strings. The function applied is x.upper(). Isn't it supposed to be x.capitalize(), as a capitalizing string means to make the first character as uppercase and rest as lowercase?
Step 5. Create a lambda function that captalize strings.
captalizer = lambda x: x.upper()
Expected:
Step 5. Create a lambda function that captalize strings.
captalizer = lambda x: x.capitalize()
Step 12. How many different occupations there are in this dataset?
current answer is:
len(users.occupation.unique())
the 'len' function is redundant, try:
users['occupation'].nunique()
Ex2 - Getting and Knowing your Data
Step 10. How many items were ordered?
the answer is same as step 9, but i think the right answer is chipo['quantity'].sum(). do i misunderstand the question?
I think the steps 9-12, dealing with slicing with loc, need to be reworked. Some solutions do not quite match the exercise text. For example, step 12 asks for columns 3-7 (five columns), but the solution retrieves columns 5-7 (three columns)
I've found these exercises great. Thanks for putting them together
The data for Euro 12 is not there within the Filtering and Sorting Exercise
Hello, thx for that nice repo!
round(discipline['Yellow Cards'].mean())
I guess its the overall average rather than grouped by teams (discipline.groupby("Team").agg({"Yellow Cards":"mean"})
)
The links in Step 2 of 05_Merge are incorrect.
They're:
https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/Merge/Auto_MPG/cars1.csv
https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/Merge/Auto_MPG/cars2.csv
They should be:
https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/05_Merge/Auto_MPG/cars1.csv
https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/05_Merge/Auto_MPG/cars2.csv
The link in Step 2 of 06_Stats is incorrect. Currently it's :
https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/Stats/US_Baby_Names/US_Baby_Names_right.csv
It should be:
https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/06_Stats/US_Baby_Names/US_Baby_Names_right.csv
Good exercises
hello,
These codes were very helpful in studying the pandas for me.
I want to post on a blog for studying data qualification exams in Korea.
Since the address points to a file in your repository and not to the data itself, import_csv is importing html codes of the website and not the data in the address.
In 02-Filtering_and_Sorting/Chipotle, step 4 and 5,
chipo['item_price'] = chipo['item_price']/chipo['quantity']
chipo['quantity'] = 1 #Dividing item_price by quantity, therefore let quantity be 1
chipo.drop_duplicates(['item_name'], keep='first', inplace=True)
chipo.sort_values(by='item_price', ascending=False, inplace=True)
display(chipo[['item_name', 'item_price']])
I'm also a beginner at Pandas, please let me know about any stupid thing that I missed. Thanks.
Link expired, Please use this instead.
According to me the solution for step 10 is wrong because the question asks how many distinct items were ordered.
So the solution should be :- len(chipo.groupby("item_name"))
add requirements.txt file to make it easy to install dependencies with pip
In a few of the early Chipotle examples, item_price is treated as the cost of a single item. It looks to me like it's actually the cost of all items of the type in that order - the most expensive item is "2 steak burritos" at $22, but a single steak burrito only costs $11.
Hi ! I cant reach file that is train.csv for https://github.com/guipsamora/pandas_exercises/blob/master/07_Visualization/Titanic_Desaster/Exercises_code_with_solutions.ipynb .
Thank you!
You ask: 'What is the average amount per order?'
Your solution:
order_grouped = chipo.groupby(by=['order_id']).sum() order_grouped.mean()['item_price']
output: 18.81
But if we're talking about average amount per order, I assume that would mean the average revenue per order (quantity * price, what was computed in question 14):
order_grouped = chipo.groupby(by=['order_id']).sum() order_grouped.mean()['rev']
output: 21.39
Just a matter of semantics really. This practice set is awesome btw!
In the chipotle dataset I believe the total revenue is incorrect.
It assumes that the quantity must be multiplied by the price to get the the total.
(chipo['quantity']* chipo['item_price']).sum()
However, I believe that the quantity is already included in the price, as can be seen by examining
chipo[chipo['item_name'] == '6 Pack Soft Drink']
It think the following is sufficient.
chipo['item_price']).sum()
pandas_exercises/04_Apply/Students_Alcohol_Consumption/exercises
refers to dataset that does no longer exist on UCI
"04_Apply/US_Crime_Rates/Exercises_with_solutions.ipynb" has some questions on the 'Chipotle' data, which is not related to the US Crime rate data. See the questions from 'Step 11'.
Maybe these questions should be part of '"04_Apply/US_Crime_Rates/Chipotle'.
We have 'Chips and Roasted Chili-Corn Salsa', 'Chips and Roasted Chili Corn Salsa' at chipotle exercise.
In step 10, we want to multiply all numerical values by 10.
The provided solution is:
df.applymap(times10).head(10)
But this is very slow, because it runs a regular python function on every element in the dataframe.
Better is to test each column's type, and then use pandas built in multiplication on the whole column:
for colname, coltype in df.dtypes.to_dict().items():
if coltype.name in ['int64']:
df[colname] = df[colname] * 10
I used %%timeit
to test the two solutions. On this small dataset, my solution is 5x as fast (1.1ms vs 5.8ms). The difference would get larger with a larger dataset.
Links redirect to this address:
https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/Visualization/Online_Retail/Online_Retail.csv
The working link right now is:
https://raw.githubusercontent.com/guipsamora/pandas_exercises/master/07_Visualization/Online_Retail/Online_Retail.csv
Hello. I want to translate the repo to Korean for korean learners.
I am a Koraean student who study data science. I think It is very useful resource to exercise Pandas. So I want to share it with korean learners.
I know. It is open-source in BSD license. I already forked it. But It is better that i notify you. Thanks for working.
Hi! Earlier you accepted corrections by this issue by @maxim5
But I think one of them is wrong:
- The solution to step 8 is wrong: the mean value does not equal to the mean of means. The right solution is:
data.fillna(0).values.flatten().mean()
because when you fill NA values with 0 you distort the entire data. I think there is no reason to pick 0 or 5 or -100 to replace NA. They must be just skipped. Just like you do in rest of project when using functions like .mean(), .sum() etc. They skip NA values by default.
So the solution must be something like this:
data.sum().sum() / data.notna().sum().sum()
or this:
data.values.flatten()[~np.isnan(data.values.flatten())].mean()
However, I like you project and I learned a lot on it. Thank you!
It would be nice if these exercises would have a license, so one knows under which conditions one can make use of them.
I don't have any particular license in mind myself, and of course that's not my call to make, tough in the name of reducing license proliferation I would suggest to use the same license as pandas itself uses: https://github.com/pandas-dev/pandas/blob/master/LICENSE .
In the baby names exercise, in task no. 5, there is a small mistake:
#deletes Unnamed: 0
del baby_names['Unnamed: 0']
#deletes Unnamed: 0
del baby_names['Id']
instead of:
#deletes Unnamed: 0
del baby_names['Unnamed: 0']
#deletes **Id**
del baby_names['Id']
Steps 4 and 5 assume that each price has a unique price.
However,
chipo[(chipo['item_name'] == 'Chicken Bowl') & (chipo['quantity'] == 1)].item_price.unique()
returns
array([10.98, 11.25, 8.75, 8.49, 8.19, 10.58, 8.5 ])
This is covered by the drop_duplicates, however is still misleading as they aren't duplicate with price.
The solution to the steps need to be updated since the data size has changed.
For reference:
Given Solution Dataset info:
food.info() #Columns: 159 entries
(65503, 159)
159
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 65503 entries, 0 to 65502
Columns: 159 entries, code to nutrition_score_uk_100g
dtypes: float64(103), object(56)
memory usage: 79.5+ MB
Current Dataset info:
>>> food.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 356027 entries, 0 to 356026
Columns: 163 entries, code to water-hardness_100g
dtypes: float64(107), object(56)
memory usage: 442.8+ MB
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.