gesistsa / sweater Goto Github PK
View Code? Open in Web Editor NEW๐ Speedy Word Embedding Association Test & Extras using R
License: GNU General Public License v3.0
๐ Speedy Word Embedding Association Test & Extras using R
License: GNU General Public License v3.0
It is recommended to use the function query() to make a query and calculate_es() to calculate the effect size. You can also use the functions listed below.
I'm confused why there are both general functions and specific ones -- seems like you should either specify the method as an argument to query() or have to use a query function specific to each method -- not both options.
Are functions mac(), rnd(), nas() etc functions that create a query? If so, it would be great to get "query" in their names to make that clear.
Add something about how people seeking support with the software should get it -- where can/should people ask questions? It's not very clear from the current contributing section.
This paper claims that one can hack WEAT by cherry-picking words in A
and B
. The RIPA can protect against such hacking.
The method RIPA does not appear to be difficult to implement. But the fact that the paper doesn't publish any data bothers me.
Words in A
and B
must be in pair, e.g.
A <- c("man", "men", "king")
B <- c("woman", "women", "queen")
Suggestion: intro to the paper would be stronger/clearer with an example of what implicit word biases are -- that neutral seeming words can be more associated with one gender than the other, or one ethnic group over another. Which can lead to...
Just a few sentences.
Not absolutely needed, but it jumps pretty quickly into social science theory without a generally accessible example of what it is.
I'm not completely clear on what the speed claims are here. What is sweater faster than exactly?
I think benchmark.md is showing that implementing one method with rcpp is faster than implementing the same method in pure R or R running C code?
That doesn't seem sufficient for a speed claim in general, especially with that in the package name.
It's very possible I'm missing something though.
I'll note that, other than what's implied by the package name, the paper doesn't make specific speed/performance claims. Dropping the last sentence of the second paragraph of the paper would, I think, remove all reference to speed. If you don't want to go into full comparisons and benchmarking, it may just be enough to talk about how long sweater takes to do common things, and refrain from saying whether that's fast or slow. Having that information -- expected run times -- is very useful just by itself.
It is like a variant of SemAxis.
https://arxiv.org/pdf/1901.07656.pdf
Embedding Quality Test is also possible. And the concept is fun. But it is quite difficult to reproduce because the procedure involves NLTK's WordNet to generate plurals and Synonyms. Wordnet is available for R. But making this package dependent on rJava is no laughing business.
In general, cosine is not a good distance measurement for all-zero vectors. But we can't change that.
This will generate "divide by zero" problem because deno_*
is zero, the sqrt of deno_*
is also zero, and the denominator is zero.
A simple solution is to imitate pytorch to use eps
(pytorch uses 1e-8
). The denominator should always be positive (due to the squaring and then rooting).
Are effect sizes expensive to compute? Looks like a call to query() will compute the effect size automatically and print it out, so my guess is no? Is there a reason not to store the computed effect size in the sweater objects (so it can be referenced later)? With the effect size functions, I'd have to store the effect size for a query in a separate variable from the query object, which makes it easy to get things messed up in the code -- would be nice to have it as part of the result object.
Could still have es functions for convenience of accessing that component of the object if you wanted.
I ran the first example, and then tried to print the mac_neg object:
> mac_neg
โโ sweater object โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Test type: mac
Effect size: 0.1375856
โโ Functions โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Error in base::nchar(wide_chars$test, type = "width") :
lazy-load database '/Library/Frameworks/R.framework/Versions/4.1/Resources/library/cli/R/sysdata.rdb' is corrupt
In addition: Warning messages:
1: In base::nchar(wide_chars$test, type = "width") :
restarting interrupted promise evaluation
2: In base::nchar(wide_chars$test, type = "width") :
internal error -3 in R_decompress1
> packageVersion("cli")
[1] โ3.1.1โ
Not sure if this is just my system? I had the same issue when trying to print other sweater objects that were the result of running the examples.
SemAxis by An et al (2018) is a variant of RND.
The only tricky part is augmenting A
and B
.
https://github.com/ghdi6758/SemAxis/blob/master/code/semaxis.py
I think I will use the notation in the paper, i.e. l
instead of k
in the python code.
Reference:
@article{an2018semaxis,
title={SemAxis: A lightweight framework to characterize domain-specific word semantics beyond sentiment},
author={An, Jisun and Kwak, Haewoon and Ahn, Yong-Yeol},
journal={arXiv preprint arXiv:1806.05521},
year={2018}
}
As sweater
is an S3 object, it would be great to have a print
method and a plot
method. By doing so, one only needs to do query
and it will instantly print the results (let's say effect size)
The plot
method is basically the same as plot_bias
.
set(s) of target words | set(s) of attribute words | Algo | Ref |
---|---|---|---|
1 | 1 | MAC | Manzini |
1 | 2 | RNSB / RND | |
2 | 2 | WEAT | |
2 | 2 | WEFE |
Do we have it?
TODO
nas
ect
rnsb
, nas
, and semaxis
work.
The 3CosAdd
, 3CosMul
, LRCor
etc. see this paper.
Probably include the data also. See the ACLwiki
Can the S_diff
and T_diff
components of the output for the method have names? I think each value corresponds to an input term, yes? Would be more useful as named vectors.
> sw$S_diff
[1] 0.003158583 0.003242220 0.001271607 0.031652155 0.003074379 0.016247332 0.035000510 -0.010817083
> sw$T_diff
[1] -0.0265718087 0.0054876842 -0.0523231481 -0.0117847993 -0.0369267966 0.0224587349 -0.0167662057
[8] 0.0003334358
Looks like the paper cites the R word2vec package, but not the original dataset. Should original dataset be cited? Same for small_reddit, glove_math?
Perhaps list them in the paper as sample word embeddings that are included in the package?
> S4 <- c("math", "algebra", "geometry", "calculus", "equations", "computation", "numbers", "addition")
> T4 <- c("poetry", "art", "dance", "literature", "novel", "symphony", "drama", "sculpture")
> A4 <- c("male", "man", "boy", "brother", "he", "him", "his", "son")
> B4 <- c("female", "woman", "girl", "sister", "she", "her", "hers", "daughter")
> sw <- query(glove_math, S4, T4, A4, B4)
>
> names(sw)
[1] "S_diff" "T_diff" "S_words" "T_words" "A_words" "B_words"
> class(sw)
[1] "sweater" "weat" "list"
> plot(sw)
Error in plot_bias(x) : No P slot in the input object x.FALSE
^ I ran the example, then tried to plot. Should this be giving this error?
For this:
sweater
uses the concept of query [@badilla2020wefe] to study the biases in sweater
uses the
I would also find a little more info on target and attribute sets helpful. When you're supposed to supply two different sets, what is each supposed to be? What should be in S and what in T? I appreciate the references, and realize this may be complicated. Some type of brief summary here would help though. For example, for A and B, it seems each should be a set of words relevant to a group? Or the endpoints of a scale?
When you say target words shouldn't have bias, does that mean they are the words you're testing for bias?
It would be really helpful if the documentation pages for the different methods (or the effect size function pages?) included information on the scale/range/directional of the effect size output. For example, are the values between 0 and 1? What is an example of a lot of bias vs. no bias?
Suggesting people use remotes::install_github("chainsawriot/sweater")
instead of devtools::install_github("chainsawriot/sweater")
will probably result in fewer issues, as remotes is easier for users to install successfully.
Would also consider rewording: "Or the slightly outdated version from CRAN" -- if you're going to offer via CRAN, that should be kept reasonably up to date, and the github version should be considered the developmental one.
Hi, I'm the author of the PsychWordVec
package (an integrated toolkit of word embedding research). Your sweater
package inspired me a lot when I developed test_WEAT()
and test_RND()
for my package.
Recently I browsed the source code of your sweater
package and found that you used a different method to compute pooled SD in weat_es()
if the sample sizes are not balanced (n1 != n2). Particularly, I found this line of code (https://github.com/chainsawriot/sweater/blob/master/R/sweater.R#L87):
pooled_sd <- sqrt(((n1 -1) * S_var) + ((n2 - 1) * T_var)/(n1 + n2 + 2))
I'm sure that in statistics this method is usually used to compute the pooled SD (https://www.statisticshowto.com/pooled-standard-deviation/), no matter whether the sample sizes are balanced or not. However, there are two issues to be addressed:
sd(c(S_diff, T_diff))
approach. How could we reconcile them? In my test_WEAT()
, to avoid such inconsistency, I just follow Caliskan et al.'s approach regardless of whether n1 is equal to n2.) + (
before n2 - 1
and a wrong sign in n1 + n2 + 2
(which should be n1 + n2 - 2
instead). This could produce substantially wrong results because it actually computes the square root of the sum of (n1 - 1) * S_var
and (n2 - 1) * T_var / (n1 + n2 + 2)
. The correct one should bepooled_sd <- sqrt(((n1 - 1) * S_var + (n2 - 1) * T_var) / (n1 + n2 - 2))
Best,
Bruce
I found it confusing in the documentation how the method was being determined. The examples are listed for specific methods, but the query code doesn't specify the method. I see that the default is to guess, which I assume it does based on the combination of STAB inputs provided? But if the examples are supposed to be of a specific method, it would probably be clearer to show code that invokes that method specifically -- what if you change "guess" behavior in the future?
Entirely not tested!
Are there any other R packages available for bias in word embeddings, or more generally for bias in text corpora? If not, you can state that. If there are, would be good to reference.
The statement that sweater brings together methods that were only reported in papers' supplemental materials is clear -- I get the utility of the package. I just don't know if there's anything else out there or not.
All of the citations to papers in the documentation are great! They'd be even more useful if you linked them, so people could click on the reference to go to it.
To show how quick and slow this basic operation is (vs Python or else).
The first example uses googlenews without introducing it first -- not even that it has word embeddings in it.
Output from the first example isn't explained. I found it confusing.
#> Effect size: 0.1375856
#>
#> โโ Functions โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
#> โข <calculate_es()>: Calculate effect size
#> โข <plot()>: Plot the bias of each individual word
Why are the function names in <>? Why are these listed? Are these next steps that I should take? Do I supply the output of the query function to them?
What is mac_neg$P?
text2vec
might be archived on Dec 5. The only instance this package uses text2vec
is for calculating cosine similarity for SemAxis softening. It is possible to trim this dependency.
It would be better to allow S to be a dictionary:
require(quanteda)
S <- dictionary(list(japanese = c("Japaner". "Japanerin"),
korean = c("Koreaner", "Koreanerin")))
And then calculation the bias per word, i.e. Japaner/Japanerin; but aggregate to calculate the multinominal distribution of P by categories (i.e. japanese).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.