gesistsa / sweater Goto Github PK

View Code? Open in Web Editor NEW

27.0 4.0 4.0 23.98 MB

👚 Speedy Word Embedding Association Test & Extras using R

License: GNU General Public License v3.0

R 89.56% C++ 8.43% Python 0.36% Shell 1.64%

r wordembedding bias-detection textanalysis

sweater's People

Contributors

Stargazers

Watchers

Forkers

cmaimone crsl4 rnaimehaom yhliu2022

sweater's Issues

documentation: what is the preferred workflow?

It is recommended to use the function query() to make a query and calculate_es() to calculate the effect size. You can also use the functions listed below.

I'm confused why there are both general functions and specific ones -- seems like you should either specify the method as an argument to query() or have to use a query function specific to each method -- not both options.

Are functions mac(), rnd(), nas() etc functions that create a query? If so, it would be great to get "query" in their names to make that clear.

re: openjournals/joss-reviews#4036

community guidelines

Add something about how people seeking support with the software should get it -- where can/should people ask questions? It's not very clear from the current contributing section.

re: openjournals/joss-reviews#4036

implementing relational inner product association (RIPA)

This paper claims that one can hack WEAT by cherry-picking words in A and B. The RIPA can protect against such hacking.

The method RIPA does not appear to be difficult to implement. But the fact that the paper doesn't publish any data bothers me.

Words in A and B must be in pair, e.g.

A <- c("man", "men", "king")
B <- c("woman", "women", "queen")

paper suggestion

Suggestion: intro to the paper would be stronger/clearer with an example of what implicit word biases are -- that neutral seeming words can be more associated with one gender than the other, or one ethnic group over another. Which can lead to...

Just a few sentences.

Not absolutely needed, but it jumps pretty quickly into social science theory without a generally accessible example of what it is.

re: openjournals/joss-reviews#4036

speed claims?

I'm not completely clear on what the speed claims are here. What is sweater faster than exactly?

I think benchmark.md is showing that implementing one method with rcpp is faster than implementing the same method in pure R or R running C code?

That doesn't seem sufficient for a speed claim in general, especially with that in the package name.

It's very possible I'm missing something though.

I'll note that, other than what's implied by the package name, the paper doesn't make specific speed/performance claims. Dropping the last sentence of the second paragraph of the paper would, I think, remove all reference to speed. If you don't want to go into full comparisons and benchmarking, it may just be enough to talk about how long sweater takes to do common things, and refrain from saying whether that's fast or slow. Having that information -- expected run times -- is very useful just by itself.

re: openjournals/joss-reviews#4036

implementing Embedding Quality Test

It is like a variant of SemAxis.

https://arxiv.org/pdf/1901.07656.pdf

Embedding Quality Test is also possible. And the concept is fun. But it is quite difficult to reproduce because the procedure involves NLTK's WordNet to generate plurals and Synonyms. Wordnet is available for R. But making this package dependent on rJava is no laughing business.

all zero vectors will generate `NaN`

In general, cosine is not a good distance measurement for all-zero vectors. But we can't change that.

https://github.com/chainsawriot/sweater/blob/6aebf710d813033c6d07f0268f12bd3e6badaee5/src/weat.cpp#L14

This will generate "divide by zero" problem because deno_* is zero, the sqrt of deno_* is also zero, and the denominator is zero.

A simple solution is to imitate pytorch to use eps (pytorch uses 1e-8). The denominator should always be positive (due to the squaring and then rooting).

code structure question

Are effect sizes expensive to compute? Looks like a call to query() will compute the effect size automatically and print it out, so my guess is no? Is there a reason not to store the computed effect size in the sweater objects (so it can be referenced later)? With the effect size functions, I'd have to store the effect size for a query in a separate variable from the query object, which makes it easy to get things messed up in the code -- would be nice to have it as part of the result object.

Could still have es functions for convenience of accessing that component of the object if you wanted.

re: openjournals/joss-reviews#4036

error printing sweater object

I ran the first example, and then tried to print the mac_neg object:

> mac_neg

── sweater object ─────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Test type:  mac 
Effect size:  0.1375856 

── Functions ──────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────────
Error in base::nchar(wide_chars$test, type = "width") : 
  lazy-load database '/Library/Frameworks/R.framework/Versions/4.1/Resources/library/cli/R/sysdata.rdb' is corrupt
In addition: Warning messages:
1: In base::nchar(wide_chars$test, type = "width") :
  restarting interrupted promise evaluation
2: In base::nchar(wide_chars$test, type = "width") :
  internal error -3 in R_decompress1


> packageVersion("cli")
[1] ‘3.1.1’

Not sure if this is just my system? I had the same issue when trying to print other sweater objects that were the result of running the examples.

SemAxis

SemAxis by An et al (2018) is a variant of RND.

The only tricky part is augmenting A and B.

https://github.com/ghdi6758/SemAxis/blob/master/code/semaxis.py

I think I will use the notation in the paper, i.e. l instead of k in the python code.

Reference:

@article{an2018semaxis,
  title={SemAxis: A lightweight framework to characterize domain-specific word semantics beyond sentiment},
  author={An, Jisun and Kwak, Haewoon and Ahn, Yong-Yeol},
  journal={arXiv preprint arXiv:1806.05521},
  year={2018}
}

Implement a speedy version of Relative Norm Distance

As in https://www.pnas.org/content/115/16/E3635

Implement Bias Silhouette Analysis

https://webis.de/downloads/publications/papers/spliethoever_2021.pdf

Implementing `print.sweater` and `plot.sweater`

As sweater is an S3 object, it would be great to have a print method and a plot method. By doing so, one only needs to do query and it will instantly print the results (let's say effect size)

The plot method is basically the same as plot_bias.

Organize functions by query types

set(s) of target words	set(s) of attribute words	Algo	Ref
1	1	MAC	Manzini
1	2	RNSB / RND
2	2	WEAT
2	2	WEFE

Do we have it?

Some methods dont work when `A_words` and `B_words` with length = 1

TODO

nas
ect

rnsb, nas, and semaxis work.

Migration sundries

handles
Github actions
Default branch

Implement Analogy Tasks (and maybe include the data too) to evaluate the general quality of word embeddings

The 3CosAdd, 3CosMul, LRCor etc. see this paper.

Probably include the data also. See the ACLwiki

names for vectors in output object?

Can the S_diff and T_diff components of the output for the method have names? I think each value corresponds to an input term, yes? Would be more useful as named vectors.

> sw$S_diff
[1]  0.003158583  0.003242220  0.001271607  0.031652155  0.003074379  0.016247332  0.035000510 -0.010817083
> sw$T_diff
[1] -0.0265718087  0.0054876842 -0.0523231481 -0.0117847993 -0.0369267966  0.0224587349 -0.0167662057
[8]  0.0003334358

re: openjournals/joss-reviews#4036

paper citations

Looks like the paper cites the R word2vec package, but not the original dataset. Should original dataset be cited? Same for small_reddit, glove_math?

Perhaps list them in the paper as sample word embeddings that are included in the package?

re: openjournals/joss-reviews#4036

plotting error

> S4 <- c("math", "algebra", "geometry", "calculus", "equations", "computation", "numbers", "addition")
> T4 <- c("poetry", "art", "dance", "literature", "novel", "symphony", "drama", "sculpture")
> A4 <- c("male", "man", "boy", "brother", "he", "him", "his", "son")
> B4 <- c("female", "woman", "girl", "sister", "she", "her", "hers", "daughter")
> sw <- query(glove_math, S4, T4, A4, B4)
> 
> names(sw)
[1] "S_diff"  "T_diff"  "S_words" "T_words" "A_words" "B_words"
> class(sw)
[1] "sweater" "weat"    "list"   
> plot(sw)
Error in plot_bias(x) : No P slot in the input object x.FALSE

^ I ran the example, then tried to plot. Should this be giving this error?

re: openjournals/joss-reviews#4036

Paper: questions on Query section

For this:

sweater uses the concept of query [@badilla2020wefe] to study the biases in $w$. A query contains two or more sets of seed words with at least one set of target words and one set of attribute words. sweater uses the $\mathcal{S}\mathcal{T}\mathcal{A}\mathcal{B}$ notation from @brunet2019understanding to form a query.

Need: concept of a query (missing a)
Why is STAB in mathematical notation?
Are target words and attribute words types of seed words? I think so, but that could be clearer

I would also find a little more info on target and attribute sets helpful. When you're supposed to supply two different sets, what is each supposed to be? What should be in S and what in T? I appreciate the references, and realize this may be complicated. Some type of brief summary here would help though. For example, for A and B, it seems each should be a set of words relevant to a group? Or the endpoints of a scale?

When you say target words shouldn't have bias, does that mean they are the words you're testing for bias?

Range of effect sizes?

It would be really helpful if the documentation pages for the different methods (or the effect size function pages?) included information on the scale/range/directional of the effect size output. For example, are the values between 0 and 1? What is an example of a lot of bias vs. no bias?

re: openjournals/joss-reviews#4036

installation instructions

Suggesting people use remotes::install_github("chainsawriot/sweater") instead of devtools::install_github("chainsawriot/sweater") will probably result in fewer issues, as remotes is easier for users to install successfully.

Would also consider rewording: "Or the slightly outdated version from CRAN" -- if you're going to offer via CRAN, that should be kept reasonably up to date, and the github version should be considered the developmental one.

re: openjournals/joss-reviews#4036

Issue of pooled SD in `weat_es()`

Hi, I'm the author of the PsychWordVec package (an integrated toolkit of word embedding research). Your sweater package inspired me a lot when I developed test_WEAT() and test_RND() for my package.

Recently I browsed the source code of your sweater package and found that you used a different method to compute pooled SD in weat_es() if the sample sizes are not balanced (n1 != n2). Particularly, I found this line of code (https://github.com/chainsawriot/sweater/blob/master/R/sweater.R#L87):

pooled_sd <- sqrt(((n1 -1) * S_var) + ((n2 - 1) * T_var)/(n1 + n2 + 2))

I'm sure that in statistics this method is usually used to compute the pooled SD (https://www.statisticshowto.com/pooled-standard-deviation/), no matter whether the sample sizes are balanced or not. However, there are two issues to be addressed:

For balanced sample sizes (n1 == n2), the pooled SD calculated by this method would be inconsistent with that by Caliskan et al.'s (2017) sd(c(S_diff, T_diff)) approach. How could we reconcile them? In my test_WEAT(), to avoid such inconsistency, I just follow Caliskan et al.'s approach regardless of whether n1 is equal to n2.
Indeed, the code of computation is incorrect, with a misplaced pair of parentheses ) + ( before n2 - 1 and a wrong sign in n1 + n2 + 2 (which should be n1 + n2 - 2 instead). This could produce substantially wrong results because it actually computes the square root of the sum of (n1 - 1) * S_var and (n2 - 1) * T_var / (n1 + n2 + 2). The correct one should be
pooled_sd <- sqrt(((n1 - 1) * S_var + (n2 - 1) * T_var) / (n1 + n2 - 2))

Best,
Bruce

"guess" method

I found it confusing in the documentation how the method was being determined. The examples are listed for specific methods, but the query code doesn't specify the method. I see that the default is to guess, which I assume it does based on the combination of STAB inputs provided? But if the examples are supposed to be of a specific method, it would probably be clearer to show code that invokes that method specifically -- what if you change "guess" behavior in the future?

re: openjournals/joss-reviews#4036

NA behaviours

Entirely not tested!

paper: other R packages?

Are there any other R packages available for bias in word embeddings, or more generally for bias in text corpora? If not, you can state that. If there are, would be good to reference.

The statement that sweater brings together methods that were only reported in papers' supplemental materials is clear -- I get the utility of the package. I just don't know if there's anything else out there or not.

re: openjournals/joss-reviews#4036

documentation: citation links

All of the citations to papers in the documentation are great! They'd be even more useful if you linked them, so people could click on the reference to go to it.

re: openjournals/joss-reviews#4036

Benchmark cosine similarity

To show how quick and slow this basic operation is (vs Python or else).

documentation: first example

The first example uses googlenews without introducing it first -- not even that it has word embeddings in it.

Output from the first example isn't explained. I found it confusing.

#> Effect size:  0.1375856
#> 
#> ── Functions ─────────────────────────────────
#> • <calculate_es()>: Calculate effect size
#> • <plot()>: Plot the bias of each individual word

Why are the function names in <>? Why are these listed? Are these next steps that I should take? Do I supply the output of the query function to them?

What is mac_neg$P?

re: openjournals/joss-reviews#4036

dependency on `text2vec`

text2vec might be archived on Dec 5. The only instance this package uses text2vec is for calculating cosine similarity for SemAxis softening. It is possible to trim this dependency.

Allow S to be a quanteda dictionary - rnsb

It would be better to allow S to be a dictionary:

require(quanteda)
S <- dictionary(list(japanese = c("Japaner". "Japanerin"),
                          korean = c("Koreaner", "Koreanerin")))

And then calculation the bias per word, i.e. Japaner/Japanerin; but aggregate to calculate the multinominal distribution of P by categories (i.e. japanese).

gesistsa / sweater Goto Github PK

sweater's People

Contributors

Stargazers

Watchers

Forkers

sweater's Issues

Recommend Projects

Recommend Topics

Recommend Org