Giter Site home page Giter Site logo

effective_pandas_book's Introduction

effective_pandas_book's People

Contributors

aj-white avatar mattharrison avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

effective_pandas_book's Issues

Typo

Page 193, section 23.1: "The .sort_values method will you sort the rows" -> the "you" should not be there.

TypeError: Int64 when running `jb2.pivot_table`

I have been following the code in your book in jupyter notebook.
There is several places where the code leads to errors. The errors also show up in the github version of the code.
For example:
In the your notebook Chapters 16-30 (page 302 in the physical book)
input 106 and 107 lead to errors. Have they been corrected?

(jb2
 .pivot_table(index='country_live', columns='employment_status',
     values='age', aggfunc='mean')
)
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-8-c65017e0f276> in <module>
      1 # run code
----> 2 (jb2
      3  .pivot_table(index='country_live', columns='employment_status',
      4      values='age', aggfunc='mean')
      5 )

~/envs/menv/lib/python3.8/site-packages/pandas/core/frame.py in pivot_table(self, values, index, columns, aggfunc, fill_value, margins, dropna, margins_name, observed, sort)
   8036         from pandas.core.reshape.pivot import pivot_table
   8037 
-> 8038         return pivot_table(
   8039             self,
   8040             values=values,

~/envs/menv/lib/python3.8/site-packages/pandas/core/reshape/pivot.py in pivot_table(data, values, index, columns, aggfunc, fill_value, margins, dropna, margins_name, observed, sort)
     93         return table.__finalize__(data, method="pivot_table")
     94 
---> 95     table = __internal_pivot_table(
     96         data,
     97         values,

~/envs/menv/lib/python3.8/site-packages/pandas/core/reshape/pivot.py in __internal_pivot_table(data, values, index, columns, aggfunc, fill_value, margins, dropna, margins_name, observed, sort)
    185                     #  agged.columns is a MultiIndex and 'v' is indexing only
    186                     #  on its first level.
--> 187                     agged[v] = maybe_downcast_to_dtype(agged[v], data[v].dtype)
    188 
    189     table = agged

~/envs/menv/lib/python3.8/site-packages/pandas/core/dtypes/cast.py in maybe_downcast_to_dtype(result, dtype)
    275     if not isinstance(dtype, np.dtype):
    276         # enforce our signature annotation
--> 277         raise TypeError(dtype)  # pragma: no cover
    278 
    279     converted = maybe_downcast_numeric(result, dtype, do_round)

TypeError: Int64

typos in chapter 6, Table 6.1 on page 41

  • s.gt(s2) operator should be s > s2
  • s.ge(s2) operator should be s >= s2
  • s.lt(s2) operator should be s < s2
  • s.le(s2) operator should be s <= s2

happy to help reviewing Edition 2 before realease ;-)

[Chapter 27] Couple of possible improvements

Hi Matt, thanks for your book, really enjoying it. I'd suggest a couple of changes in chapter 27:

  1. Within the description of the parameters of method pandas.DataFrame.groupby() at page 322, I'd change the specification of dropna. Indeed, this works differently than the same parameter in pandas.DataFrame.pivot_table() or in function pandas.crosstab(), where it applies to values (and therefore << [...] dropna=False will keep columns that have no values >>). Instead, in pandas.DataFrame.groupby() dropna applies to group keys (the description above - which is specified in the book - is no longer valid). For this reason, the DataFrame at page 305 should have 8 columns, rather than 4.
  2. At pages 313-314 of the book, per column aggregations are not applied on numeric columns only, which we instead may possibly get by typing jb2.groupby('country_live')[[col for col in jb2.select_dtypes('number').columns]].agg(['min', 'max']).

pg. 296 : code doesn't result in what's in book

(jb
.filter(like=r'job.role.*t')
.where(jb.isna(), 1)
)
results in a single col with the indexes 1...54461

(jb
.filter(like=r'job.role.*t')
.where(jb.isna(), 1)
.fillna(0)
)
ditto

Thereafter seems to be ok

reordering columns

Hi!

So I'm using the "recipe style" of working on a dataframe and assigning some new new columns as part of that process (which works great).

Once of the last steps I'd like to do is put all the columns in a specific order. In this case, by "all" I mean some of the original columns as well as some of the newly created columns.

I understand (or at least think I understand) that since I want access to the new columns, which are in the intermediate df, I'll need to use a lambda.

Looking through Effective Pandas (p.229) Matt does a column rename:

.rename(columns=lambda c: c.replace('.', '_')

But this is doing the same thing to all the columns so I couldn't figure out how to apply this concept to a simple reorder. If I was doing this outside of the recipe, I can simply do:

df[cols in my order] # cols include old and new columns

But using the following inside the function/recipe

[cols in my order] # cols include old and new columns

Naturally fails since the new cols don't exist here.

It's not a huge deal to simply do the ordering after the recipe function is called, just wondering if it's something I can do as part of the recipe?

Thanks!
Dan

.filter with regex

Hi Matt, I'm working through your Effective Pandas book and might have found a typo at the start of the chapter Reshaping Dataframes with Dummies.

You write:

>>> (jb
...  .filter(like=r'job.role.*t')
...  .where(jb.isna(), 1)
... )

but with pandas 1.4.3 that doesn't work. I can leave as .filter(like="job.role") and get the 13 columns as intended, or I can use .filter(regex=r"job.role.*t") and get the 8 columns that have a "t" in the job title.

Digging into apply() for strings

I know we generally want to avoid apply(), especially for any numerical operations. I just often find myself working with and parsing a variety of text (usually coming from csv, which in turn is coming from open textbox data, aka ugly).

Just wondering if Matt or anyone here knows of good resources to dig into using apply? I can try to be more specific, but as an example, today I'm trying to run some sentiment analysis over 2 columns/series in a dataframe, and trying to turn that text into scores. In this very specific case I'm using NRCLex and getting back a dict (like this: {'fear': 2, 'positive': 1, 'negative': 4, 'anticipation': 1}) Which I'm in turn trying creating columns based on that and the value is the dict value. So for this record, column "fear" would have 2.

Anyway, not expecting a direct answer to this specific question (though that would be fine too! ha!) just more where I can look into the apply method and trying to learn how to better work with it.

Thanks!

Chapter 23 Jetbrains Python Survey

The Jetbrains Python survey used in chapter 23 and subsequent chapters is very problematic. I ran into numerous problems when trying to make the jb2 DataFrame on page 233. The first problem was that the number ranges under 'company_size' (e.g. 2-10) were not interpreted correctly by Excel. The hyphen between the two numbers was changed into very strange looking three-character symbols. I had to go into the Excel file and manually change them back into hyphens using Ctrl-H. But that made new problems.

Once the hyphens were inserted, Excel regarded some of the number ranges as dates. For example, 2-10 was turned into 10-Feb. Changing the column format had no effect. After many hours of frustration, I finally discovered that adding a leading space prevented Excel from treating the range as a date.

But then Python had trouble recognizing other number range strings. I kept getting the error "ValueError: invalid literal for int() with base 10: '51-500' ", and others like it. After more frustration I found that many of the string entries in the CSV file had extra spaces, or whitespace. I tried to remove the whitespace all in one sweep using pd.read_csv(jb, delim_whitespace=True), but I only got the following error: ParserError: Error tokenizing data. C error: Expected 194 fields in line 961, saw 215

I had to use Ctrl-H to replace each of the number ranges with whitespace, with a number range without whitespace. As for the ranges that Excel thought were dates, I had to modify the Python code to recognize the needed whitespace.

But that still was not the end. After fixing the strings, the code would not recognize "company_size" as an attribute. It gave me the following error: _"AttributeError: 'DataFrame' object has no attribute _'company_size'__. Again, it took me a few hours, but I finally figured out that the attribute 'company_size' had an extra leading and trailing space, making Python unable to recognize it, since it technically did not match the code.

Bottom line: the Jetbrains survey is not ready to use out of the box, so to speak. Translating the file into CSV creates strange symbols that must be changed internally. Additionally, there is a lot of whitespace surrounding the data entries; without knowing what the whitespace is, it is impossible to make Python read it. Finally, some of the entries need whitespace so that they are not changed into dates, and the Python code must reflect the same thing.

I am still frustrated about this because it took me approximately three days to figure out what was going on.

%matplotlib inline no longer needed

In Chapter 14 on plotting you wrote "To leverage it in Jupyter, make sure you include the
following cell magic to tell Jupyter to display the plots in the browser:
%matplotlib inline " which no longer holds (for 2+ years atleast) especially if you import pyplot or pandas (which Effective Pandas is all about) https://github.com/ipython/ipython/issues/12190
I just felt that a book written in 2021 should explain that to its readers

Setting values on the intermediate dataframe

I have a df that is mostly a bunch of columns that contain numbers (dtypes are Int). The index is a datetime type, but I don't think that's important for my question. Here is my function:

def tally_emotion_scores(input_df):
    pos_e = ['anticipation', 'surprise', 'joy', 'trust']
    neg_e = ['fear', 'anger', 'disgust', 'sadness']
    all_e = pos_e + neg_e
    return(input_df
            .assign(**pd.DataFrame(input_df.total_scores.to_list()).fillna(0).astype('Int64'))
            .drop(columns=['total_scores'])
            .assign(pos_neg_val= lambda df_: df_['positive'] - df_['negative'])
            .set_index('date')
            .sort_index()
        )

What I'd like to do is make changes to columns based on the pos_neg_val.

I can do it on the resulting df (new2_df is what's being returned from the function above). So the following is what I want to do and it works, I'm just trying to figure out how to get this into my function.

new2_df.loc[new2_df['pos_neg_val'] > 0, neg_e] = 0
new2_df.loc[new2_df['pos_neg_val'] <= 0, all_e] = 0

I thought I could use a lambda to access the intermediate df and tried several ways (trying to remember some):

When I tried (on the line right after assigning pos_neg_val):

.loc[lambda df_: df_['pos_neg_val'] > 0, neg_e] = 0

I got:
SyntaxError: cannot assign to subscript here. Maybe you meant '==' instead of '='?

I think I tried adding another assign with versions of:

.assign(
    pos_neg_val= lambda df_: df_['positive'] - df_['negative'],
    [[lambda df_: df_['pos_neg_val'] > 0, neg_e] = 0] # Try one
    [lambda df_: df_['pos_neg_val'] > 0, neg_e] = 0 # Try two
)

Neither of which looked right, but I tried anyway.

So I'm wondering how do I access and set the values on multiple columns based on a new created value on the intermediate df?

Thanks!

query engine fails p182

The code at the top of p182:
(jb2
.query("team_size.isna()")
.employment_status
.value_counts(dropna=False)
)
Can fail with:"TypeError: unhashable type: 'Series'"
Running Python v3.9.7 and Pandas 1.3.4 with latest Anaconda install.
Cause: 'numexpr' is default query engine if installed which appears so with Anaconda.
See (ref).
Solution: add engine='python' to query arguments.

[Question - Chapter 4] dtype='int64' is not np.int64

Hi Matt! First off, thank you for your amazing book! :)
I'm going through Chapter 4. I totally understand the discussion behind the nullable integer type.
Instead, I'm wondering why this sentence from Pandas documentation on Nullable Integer data type

Or the string alias "Int64" (note the capital "I", to differentiate from NumPy’s 'int64' dtype)

does not find confirmation in the following (songs2.dtype is np.int64 gives False):

songs2 = pd.Series(
    [145, 142, 133, 19],
    name='counts'
)

print(songs2.dtype is np.int64)

What am I missing and misunderstanding?

Thank you for your help!

Typos on a few pdf pages

  • Page 12: > Namely, that a dataframe can have on or many series.
  • Page 24, there's an image of city.<TAB> that doesn't match the text (which the data frame would be city_mpg).
  • pg 83 -- "Note the .loc attribute cap? pulling out..."?
  • pg 85 -- "Also, you can only put an expression in it, you can have a statement." -- I'm not clear on what this sentence means, I suspect it's missing a word.
  • pg 90 -- "(city has a numeric index that is unique): city_mpg.reindex([0,0, 10, 20, 2_000_000])" -- I think you mean to refer to city_mpg in the text again.
  • pg 127 -- "This makes it easy to do things like calculate the percentage of quarterly snowfall the fell in a day:" -- quarterly snowfall that fell in a day.
  • pg 167 -- "if frac > 1, my specify replace=True" (perhaps must specify replace=True?)

Why no .mobi or .epub version?

Most of my reading is done on an iPad Pro or an iPhone 11Pro in Kindle or iBooks. PDF formatted books don't really work for that. When will there be .mobi and .epub versions?

.loc on Page 75

On page 75 there is a method on the picture "s.loc[-2:]". It gives an error because it cannot do slice indexing on Index of type int.

Typo

In figure 3.2 caption, page 12: "a dataframe can have on or many series." should be "one or many".

Typo [Chapter 19] pdf p 164

41 20.818182
42 14.636364
43 30.363636
44 15.818182
45 39.772727

should be:
40 20.818182
41 14.636364
42 30.363636
43 15.818182
44 39.772727

Typesetting issue on chap 29 PDF version

I'm really enjoying working through the Effective Pandas book! It's great. However, in the PDF version, there's a typesetting error at the end of Chapter 29. At least in the PDF I have, the summary and exercises go off the end of the page. Just wanted to let you know. Thanks!

Creating dummy columns

In chapter 26 - Reshaping DataFrames with Dummies, we wanted to turn values in the "job.role" columns into a categorical series, which we would then reshape into a dummy matrix.

That's the code of the book:

job = (jb
    .filter(like=r'job.role')
    .where(jb.isna(), 1)
    .fillna(0)
    .idxmax(axis='columns')
    .str.replace('job.role.', '', regex=False))

job

However, many rows have multiple jobs, and the above code only captures the first one.

I think the following code captures all jobs and converts them into a dummy matrix.

(jb
     .filter(like='job.role')
     .fillna('')
     .apply(lambda ser: ','.join([i for i in ser if i]), axis=1)
     .str.get_dummies(sep=',')
)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.