Giter Site home page Giter Site logo

gesistsa / sweater Goto Github PK

View Code? Open in Web Editor NEW
27.0 4.0 4.0 23.98 MB

๐Ÿ‘š Speedy Word Embedding Association Test & Extras using R

License: GNU General Public License v3.0

R 89.56% C++ 8.43% Python 0.36% Shell 1.64%
r wordembedding bias-detection textanalysis

sweater's People

Contributors

chainsawriot avatar cmaimone avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

sweater's Issues

documentation: what is the preferred workflow?

It is recommended to use the function query() to make a query and calculate_es() to calculate the effect size. You can also use the functions listed below.

I'm confused why there are both general functions and specific ones -- seems like you should either specify the method as an argument to query() or have to use a query function specific to each method -- not both options.

Are functions mac(), rnd(), nas() etc functions that create a query? If so, it would be great to get "query" in their names to make that clear.

re: openjournals/joss-reviews#4036

implementing relational inner product association (RIPA)

This paper claims that one can hack WEAT by cherry-picking words in A and B. The RIPA can protect against such hacking.

The method RIPA does not appear to be difficult to implement. But the fact that the paper doesn't publish any data bothers me.

Words in A and B must be in pair, e.g.

A <- c("man", "men", "king")
B <- c("woman", "women", "queen")

paper suggestion

Suggestion: intro to the paper would be stronger/clearer with an example of what implicit word biases are -- that neutral seeming words can be more associated with one gender than the other, or one ethnic group over another. Which can lead to...

Just a few sentences.

Not absolutely needed, but it jumps pretty quickly into social science theory without a generally accessible example of what it is.

re: openjournals/joss-reviews#4036

speed claims?

I'm not completely clear on what the speed claims are here. What is sweater faster than exactly?

I think benchmark.md is showing that implementing one method with rcpp is faster than implementing the same method in pure R or R running C code?

That doesn't seem sufficient for a speed claim in general, especially with that in the package name.

It's very possible I'm missing something though.

I'll note that, other than what's implied by the package name, the paper doesn't make specific speed/performance claims. Dropping the last sentence of the second paragraph of the paper would, I think, remove all reference to speed. If you don't want to go into full comparisons and benchmarking, it may just be enough to talk about how long sweater takes to do common things, and refrain from saying whether that's fast or slow. Having that information -- expected run times -- is very useful just by itself.

re: openjournals/joss-reviews#4036

implementing Embedding Quality Test

It is like a variant of SemAxis.

https://arxiv.org/pdf/1901.07656.pdf

Embedding Quality Test is also possible. And the concept is fun. But it is quite difficult to reproduce because the procedure involves NLTK's WordNet to generate plurals and Synonyms. Wordnet is available for R. But making this package dependent on rJava is no laughing business.

all zero vectors will generate `NaN`

In general, cosine is not a good distance measurement for all-zero vectors. But we can't change that.

https://github.com/chainsawriot/sweater/blob/6aebf710d813033c6d07f0268f12bd3e6badaee5/src/weat.cpp#L14

This will generate "divide by zero" problem because deno_* is zero, the sqrt of deno_* is also zero, and the denominator is zero.

A simple solution is to imitate pytorch to use eps (pytorch uses 1e-8). The denominator should always be positive (due to the squaring and then rooting).

code structure question

Are effect sizes expensive to compute? Looks like a call to query() will compute the effect size automatically and print it out, so my guess is no? Is there a reason not to store the computed effect size in the sweater objects (so it can be referenced later)? With the effect size functions, I'd have to store the effect size for a query in a separate variable from the query object, which makes it easy to get things messed up in the code -- would be nice to have it as part of the result object.

Could still have es functions for convenience of accessing that component of the object if you wanted.

re: openjournals/joss-reviews#4036

error printing sweater object

I ran the first example, and then tried to print the mac_neg object:

> mac_neg

โ”€โ”€ sweater object โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Test type:  mac 
Effect size:  0.1375856 

โ”€โ”€ Functions โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
Error in base::nchar(wide_chars$test, type = "width") : 
  lazy-load database '/Library/Frameworks/R.framework/Versions/4.1/Resources/library/cli/R/sysdata.rdb' is corrupt
In addition: Warning messages:
1: In base::nchar(wide_chars$test, type = "width") :
  restarting interrupted promise evaluation
2: In base::nchar(wide_chars$test, type = "width") :
  internal error -3 in R_decompress1


> packageVersion("cli")
[1] โ€˜3.1.1โ€™

Not sure if this is just my system? I had the same issue when trying to print other sweater objects that were the result of running the examples.

SemAxis

SemAxis by An et al (2018) is a variant of RND.

The only tricky part is augmenting A and B.

https://github.com/ghdi6758/SemAxis/blob/master/code/semaxis.py

I think I will use the notation in the paper, i.e. l instead of k in the python code.

Reference:

@article{an2018semaxis,
  title={SemAxis: A lightweight framework to characterize domain-specific word semantics beyond sentiment},
  author={An, Jisun and Kwak, Haewoon and Ahn, Yong-Yeol},
  journal={arXiv preprint arXiv:1806.05521},
  year={2018}
}

Implementing `print.sweater` and `plot.sweater`

As sweater is an S3 object, it would be great to have a print method and a plot method. By doing so, one only needs to do query and it will instantly print the results (let's say effect size)

The plot method is basically the same as plot_bias.

Organize functions by query types

set(s) of target words set(s) of attribute words Algo Ref
1 1 MAC Manzini
1 2 RNSB / RND
2 2 WEAT
2 2 WEFE

Do we have it?

  • MAC
  • RNSB
  • RND
  • WEAT
  • WEFE

names for vectors in output object?

Can the S_diff and T_diff components of the output for the method have names? I think each value corresponds to an input term, yes? Would be more useful as named vectors.

> sw$S_diff
[1]  0.003158583  0.003242220  0.001271607  0.031652155  0.003074379  0.016247332  0.035000510 -0.010817083
> sw$T_diff
[1] -0.0265718087  0.0054876842 -0.0523231481 -0.0117847993 -0.0369267966  0.0224587349 -0.0167662057
[8]  0.0003334358

re: openjournals/joss-reviews#4036

paper citations

Looks like the paper cites the R word2vec package, but not the original dataset. Should original dataset be cited? Same for small_reddit, glove_math?

Perhaps list them in the paper as sample word embeddings that are included in the package?

re: openjournals/joss-reviews#4036

plotting error

> S4 <- c("math", "algebra", "geometry", "calculus", "equations", "computation", "numbers", "addition")
> T4 <- c("poetry", "art", "dance", "literature", "novel", "symphony", "drama", "sculpture")
> A4 <- c("male", "man", "boy", "brother", "he", "him", "his", "son")
> B4 <- c("female", "woman", "girl", "sister", "she", "her", "hers", "daughter")
> sw <- query(glove_math, S4, T4, A4, B4)
> 
> names(sw)
[1] "S_diff"  "T_diff"  "S_words" "T_words" "A_words" "B_words"
> class(sw)
[1] "sweater" "weat"    "list"   
> plot(sw)
Error in plot_bias(x) : No P slot in the input object x.FALSE

^ I ran the example, then tried to plot. Should this be giving this error?

re: openjournals/joss-reviews#4036

Paper: questions on Query section

For this:

sweater uses the concept of query [@badilla2020wefe] to study the biases in $w$. A query contains two or more sets of seed words with at least one set of target words and one set of attribute words. sweater uses the $\mathcal{S}\mathcal{T}\mathcal{A}\mathcal{B}$ notation from @brunet2019understanding to form a query.

  • Need: concept of a query (missing a)
  • Why is STAB in mathematical notation?
  • Are target words and attribute words types of seed words? I think so, but that could be clearer

I would also find a little more info on target and attribute sets helpful. When you're supposed to supply two different sets, what is each supposed to be? What should be in S and what in T? I appreciate the references, and realize this may be complicated. Some type of brief summary here would help though. For example, for A and B, it seems each should be a set of words relevant to a group? Or the endpoints of a scale?

When you say target words shouldn't have bias, does that mean they are the words you're testing for bias?

Range of effect sizes?

It would be really helpful if the documentation pages for the different methods (or the effect size function pages?) included information on the scale/range/directional of the effect size output. For example, are the values between 0 and 1? What is an example of a lot of bias vs. no bias?

re: openjournals/joss-reviews#4036

installation instructions

Suggesting people use remotes::install_github("chainsawriot/sweater") instead of devtools::install_github("chainsawriot/sweater") will probably result in fewer issues, as remotes is easier for users to install successfully.

Would also consider rewording: "Or the slightly outdated version from CRAN" -- if you're going to offer via CRAN, that should be kept reasonably up to date, and the github version should be considered the developmental one.

re: openjournals/joss-reviews#4036

Issue of pooled SD in `weat_es()`

Hi, I'm the author of the PsychWordVec package (an integrated toolkit of word embedding research). Your sweater package inspired me a lot when I developed test_WEAT() and test_RND() for my package.

Recently I browsed the source code of your sweater package and found that you used a different method to compute pooled SD in weat_es() if the sample sizes are not balanced (n1 != n2). Particularly, I found this line of code (https://github.com/chainsawriot/sweater/blob/master/R/sweater.R#L87):

pooled_sd <- sqrt(((n1 -1) * S_var) + ((n2 - 1) * T_var)/(n1 + n2 + 2))

I'm sure that in statistics this method is usually used to compute the pooled SD (https://www.statisticshowto.com/pooled-standard-deviation/), no matter whether the sample sizes are balanced or not. However, there are two issues to be addressed:

  1. For balanced sample sizes (n1 == n2), the pooled SD calculated by this method would be inconsistent with that by Caliskan et al.'s (2017) sd(c(S_diff, T_diff)) approach. How could we reconcile them? In my test_WEAT(), to avoid such inconsistency, I just follow Caliskan et al.'s approach regardless of whether n1 is equal to n2.
  2. Indeed, the code of computation is incorrect, with a misplaced pair of parentheses ) + ( before n2 - 1 and a wrong sign in n1 + n2 + 2 (which should be n1 + n2 - 2 instead). This could produce substantially wrong results because it actually computes the square root of the sum of (n1 - 1) * S_var and (n2 - 1) * T_var / (n1 + n2 + 2). The correct one should be
    pooled_sd <- sqrt(((n1 - 1) * S_var + (n2 - 1) * T_var) / (n1 + n2 - 2))

Best,
Bruce

"guess" method

I found it confusing in the documentation how the method was being determined. The examples are listed for specific methods, but the query code doesn't specify the method. I see that the default is to guess, which I assume it does based on the combination of STAB inputs provided? But if the examples are supposed to be of a specific method, it would probably be clearer to show code that invokes that method specifically -- what if you change "guess" behavior in the future?

re: openjournals/joss-reviews#4036

paper: other R packages?

Are there any other R packages available for bias in word embeddings, or more generally for bias in text corpora? If not, you can state that. If there are, would be good to reference.

The statement that sweater brings together methods that were only reported in papers' supplemental materials is clear -- I get the utility of the package. I just don't know if there's anything else out there or not.

re: openjournals/joss-reviews#4036

documentation: first example

The first example uses googlenews without introducing it first -- not even that it has word embeddings in it.

Output from the first example isn't explained. I found it confusing.

#> Effect size:  0.1375856
#> 
#> โ”€โ”€ Functions โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
#> โ€ข <calculate_es()>: Calculate effect size
#> โ€ข <plot()>: Plot the bias of each individual word

Why are the function names in <>? Why are these listed? Are these next steps that I should take? Do I supply the output of the query function to them?

What is mac_neg$P?

re: openjournals/joss-reviews#4036

dependency on `text2vec`

text2vec might be archived on Dec 5. The only instance this package uses text2vec is for calculating cosine similarity for SemAxis softening. It is possible to trim this dependency.

Allow S to be a quanteda dictionary - rnsb

It would be better to allow S to be a dictionary:

require(quanteda)
S <- dictionary(list(japanese = c("Japaner". "Japanerin"),
                          korean = c("Koreaner", "Koreanerin")))

And then calculation the bias per word, i.e. Japaner/Japanerin; but aggregate to calculate the multinominal distribution of P by categories (i.e. japanese).

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.