Giter Site home page Giter Site logo

wordvectors's People

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

wordvectors's Issues

R Session Aborted

When I run train_word2vec() R crashes immediately. The file to be imported is 200 novels run through prep_word2vec() which results in a 120 MB .txt file. I've tried on Mac 10.9 and 10.10 as well as R 3.1 and 3.2. Same result in all cases. I'm guessing you've run this on much larger data. Any ideas?

wrap glove training

I've been using this package a bit to explore the standard GloVe model distributed by Stanford.

It might be useful to let this package train GloVe models as well using text2vec.

I suspect this isn't why people install it, but I see the major advantage of this package being the syntax and function wrapping; text2vec's creator want to keep a minimal feature set, so I think there's relatively little overlap between the two packages.

Installation of Package under Windows 7 64-bit raises errors

Hi guys,

I am trying to install the package under Windows 7, 64-bit. I use R 3.2.4 and RTools 3.3. Unfortunately I get the error when trying to install the package (see the error trace below). Is there any ideas on how to fix it?

Thank you.

P.S. Below is the error trace:

`> devtools::install_github("bmschmidt/wordVectors")
Downloading GitHub repo bmschmidt/wordVectors@master
from URL https://api.github.com/repos/bmschmidt/wordVectors/zipball/master
Installing wordVectors
"C:/R/R-32~1.4RE/bin/i386/R" --no-site-file --no-environ --no-save --no-restore --quiet CMD INSTALL  \
  "C:/Users/sqladmin/AppData/Local/Temp/Rtmp6bqZZH/devtools21f05e8c6da7/bmschmidt-wordVectors-7f1914c"  \
  --library="C:/R/R-3.2.4revised/library" --install-tests 

* installing *source* package 'wordVectors' ...
** libs

*** arch - i386
gcc -m32 -I"C:/R/R-32~1.4RE/include" -DNDEBUG     -I"d:/RCompile/r-compiling/local/local323/include"  -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -w   -O3 -Wall  -std=gnu99 -mtune=core2 -c tmcn_word2vec.c -o tmcn_word2vec.o
gcc -m32 -I"C:/R/R-32~1.4RE/include" -DNDEBUG     -I"d:/RCompile/r-compiling/local/local323/include"  -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -w   -O3 -Wall  -std=gnu99 -mtune=core2 -c word2phrase.c -o word2phrase.o
gcc -m32 -shared -s -static-libgcc -o wordVectors.dll tmp.def tmcn_word2vec.o word2phrase.o -pthread -Ld:/RCompile/r-compiling/local/local323/lib/i386 -Ld:/RCompile/r-compiling/local/local323/lib -LC:/R/R-32~1.4RE/bin/i386 -lR
installing to C:/R/R-3.2.4revised/library/wordVectors/libs/i386

*** arch - x64
gcc -m64 -I"C:/R/R-32~1.4RE/include" -DNDEBUG     -I"d:/RCompile/r-compiling/local/local323/include"  -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result -w   -O2 -Wall  -std=gnu99 -mtune=core2 -c tmcn_word2vec.c -o tmcn_word2vec.o
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s: Assembler messages:
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1135: Error: no such instruction: `vfmadd312ss (%rbx),%xmm0,%xmm1'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1154: Error: no such instruction: `vfmadd312ss 4(%rbx),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1158: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1163: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm1'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1168: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1173: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1178: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1183: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm1'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1193: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm1'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1197: Error: no such instruction: `vfmadd312ss (%rbx,%r13,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1201: Error: no such instruction: `vfmadd312ss (%rbx,%r11,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1205: Error: no such instruction: `vfmadd312ss (%rbx,%r14,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1209: Error: no such instruction: `vfmadd312ss (%rbx,%r13,4),%xmm0,%xmm1'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1213: Error: no such instruction: `vfmadd312ss (%rbx,%r11,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1217: Error: no such instruction: `vfmadd312ss (%rbx,%r14,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1220: Error: no such instruction: `vfmadd312ss (%rbx,%r13,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1229: Error: no such instruction: `vfmadd312ss (%rax),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1248: Error: no such instruction: `vfmadd312ss 4(%rax),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1252: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm1'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1257: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1262: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1267: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1272: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm1'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1277: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1287: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1291: Error: no such instruction: `vfmadd312ss (%rax,%r14,4),%xmm0,%xmm1'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1295: Error: no such instruction: `vfmadd312ss (%rax,%r13,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1299: Error: no such instruction: `vfmadd312ss (%rax,%r11,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1303: Error: no such instruction: `vfmadd312ss (%rax,%r14,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1307: Error: no such instruction: `vfmadd312ss (%rax,%r13,4),%xmm0,%xmm1'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1311: Error: no such instruction: `vfmadd312ss (%rax,%r11,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1314: Error: no such instruction: `vfmadd312ss (%rax,%r14,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1878: Error: no such instruction: `vfmadd312ss (%rbx),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1897: Error: no such instruction: `vfmadd312ss 4(%rbx),%xmm0,%xmm1'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1901: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1906: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1911: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1916: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm1'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1921: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1926: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1936: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1940: Error: no such instruction: `vfmadd312ss (%rbx,%r14,4),%xmm0,%xmm1'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1944: Error: no such instruction: `vfmadd312ss (%rbx,%r10,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1948: Error: no such instruction: `vfmadd312ss (%rbx,%r15,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1952: Error: no such instruction: `vfmadd312ss (%rbx,%r14,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1956: Error: no such instruction: `vfmadd312ss (%rbx,%r10,4),%xmm0,%xmm1'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1960: Error: no such instruction: `vfmadd312ss (%rbx,%r15,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1963: Error: no such instruction: `vfmadd312ss (%rbx,%r14,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1972: Error: no such instruction: `vfmadd312ss (%rax),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1991: Error: no such instruction: `vfmadd312ss 4(%rax),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:1995: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2000: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2005: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm1'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2010: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2015: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2020: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2029: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2033: Error: no such instruction: `vfmadd312ss (%rax,%r15,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2037: Error: no such instruction: `vfmadd312ss (%rax,%r14,4),%xmm0,%xmm1'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2041: Error: no such instruction: `vfmadd312ss (%rax,%r10,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2045: Error: no such instruction: `vfmadd312ss (%rax,%r15,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2051: Error: no such instruction: `vfmadd312ss (%rax,%r14,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2054: Error: no such instruction: `vfmadd312ss (%rax,%r10,4),%xmm0,%xmm1'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2057: Error: no such instruction: `vfmadd312ss (%rax,%r15,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2238: Error: no such instruction: `vfmadd312ss (%rbx),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2257: Error: no such instruction: `vfmadd312ss 4(%rbx),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2261: Error: no such instruction: `vfmadd312ss (%rbx,%r8,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2266: Error: no such instruction: `vfmadd312ss (%rbx,%r8,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2271: Error: no such instruction: `vfmadd312ss (%rbx,%r8,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2276: Error: no such instruction: `vfmadd312ss (%rbx,%r8,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2281: Error: no such instruction: `vfmadd312ss (%rbx,%r8,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2286: Error: no such instruction: `vfmadd312ss (%rbx,%r8,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2296: Error: no such instruction: `vfmadd312ss (%rbx,%r8,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2299: Error: no such instruction: `vfmadd312ss (%rbx,%rbp,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2303: Error: no such instruction: `vfmadd312ss (%rbx,%r9,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2306: Error: no such instruction: `vfmadd312ss (%rbx,%rbp,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2311: Error: no such instruction: `vfmadd312ss (%rbx,%r9,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2314: Error: no such instruction: `vfmadd312ss (%rbx,%rbp,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2320: Error: no such instruction: `vfmadd312ss (%rbx,%r9,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2323: Error: no such instruction: `vfmadd312ss (%rbx,%rbp,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2332: Error: no such instruction: `vfmadd312ss (%rax),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2351: Error: no such instruction: `vfmadd312ss 4(%rax),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2355: Error: no such instruction: `vfmadd312ss (%rax,%r8,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2360: Error: no such instruction: `vfmadd312ss (%rax,%r8,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2365: Error: no such instruction: `vfmadd312ss (%rax,%r8,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2370: Error: no such instruction: `vfmadd312ss (%rax,%r8,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2375: Error: no such instruction: `vfmadd312ss (%rax,%r8,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2380: Error: no such instruction: `vfmadd312ss (%rax,%r8,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2389: Error: no such instruction: `vfmadd312ss (%rax,%r8,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2392: Error: no such instruction: `vfmadd312ss (%rax,%r9,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2396: Error: no such instruction: `vfmadd312ss (%rax,%rbp,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2400: Error: no such instruction: `vfmadd312ss (%rax,%r9,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2404: Error: no such instruction: `vfmadd312ss (%rax,%rbp,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2408: Error: no such instruction: `vfmadd312ss (%rax,%r9,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2414: Error: no such instruction: `vfmadd312ss (%rax,%rbp,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2417: Error: no such instruction: `vfmadd312ss (%rax,%r9,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2475: Error: no such instruction: `vfmadd312ss (%rbx),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2494: Error: no such instruction: `vfmadd312ss 4(%rbx),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2498: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2503: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2508: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm5'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2513: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm4'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2518: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2523: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2533: Error: no such instruction: `vfmadd312ss (%rbx,%rdx,4),%xmm0,%xmm5'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2536: Error: no such instruction: `vfmadd312ss (%rbx,%r11,4),%xmm0,%xmm4'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2540: Error: no such instruction: `vfmadd312ss (%rbx,%r8,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2543: Error: no such instruction: `vfmadd312ss (%rbx,%r11,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2548: Error: no such instruction: `vfmadd312ss (%rbx,%r8,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2551: Error: no such instruction: `vfmadd312ss (%rbx,%r11,4),%xmm0,%xmm5'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2557: Error: no such instruction: `vfmadd312ss (%rbx,%r8,4),%xmm0,%xmm4'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2560: Error: no such instruction: `vfmadd312ss (%rbx,%r11,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2569: Error: no such instruction: `vfmadd312ss (%rax),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2588: Error: no such instruction: `vfmadd312ss 4(%rax),%xmm0,%xmm5'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2592: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm4'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2597: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2602: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2607: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2612: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm5'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2617: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm4'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2627: Error: no such instruction: `vfmadd312ss (%rax,%rdx,4),%xmm0,%xmm4'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2630: Error: no such instruction: `vfmadd312ss (%rax,%r8,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2634: Error: no such instruction: `vfmadd312ss (%rax,%r11,4),%xmm0,%xmm7'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2637: Error: no such instruction: `vfmadd312ss (%rax,%r8,4),%xmm0,%xmm6'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2642: Error: no such instruction: `vfmadd312ss (%rax,%r11,4),%xmm0,%xmm5'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2645: Error: no such instruction: `vfmadd312ss (%rax,%r8,4),%xmm0,%xmm4'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2651: Error: no such instruction: `vfmadd312ss (%rax,%r11,4),%xmm0,%xmm8'
C:\Users\sqladmin\AppData\Local\Temp\ccSxQ6f1.s:2654: Error: no such instruction: `vfmadd312ss (%rax,%r8,4),%xmm0,%xmm7'
make: *** [tmcn_word2vec.o] Error 1
Warning: running command 'make -f "Makevars.win" -f "C:/R/R-32~1.4RE/etc/x64/Makeconf" -f "C:/R/R-32~1.4RE/share/make/winshlib.mk" SHLIB="wordVectors.dll" WIN=64 TCLBIN=64 OBJECTS="tmcn_word2vec.o word2phrase.o"' had status 2
ERROR: compilation failed for package 'wordVectors'
* removing 'C:/R/R-3.2.4revised/library/wordVectors'
Error: Command failed (1)`

Error when using reject()

I'm experimenting with the kind of vector rejection described here: http://bookworm.benschmidt.org/posts/2015-10-30-rejecting-the-gender-binary.html

After creating my model:

ff_vectors = train_word2vec("data/processed_tweets.txt")

I try:

beast = ff_vectors[["beast"]] %>% reject(ff_vectors[["points"]])

and get:

Error in crossprod(t(matrix %*% b)/as.vector((b %*% b)), b) : 
  non-conformable arguments

Any help would be appreciated. I'm very interested in working more with the package.

include vignette?

Hello,

I figured out how to run install_github in such a way as to compile the vignette, but it's taking a long time. Can we include a PDF vignette in the repo and link to it from README.md? It could even be included in a separate branch if you don't want it to pollute the revision history.

I was surprised not to be able to find a PDF vignette of this project on Google. It would be very useful. I'm trying to figure out how to make a 2-d scatter plot from a given subset of the word vectors, it looks like that is being done in the vignette but it would be useful to be able to see the example plot output first.

Thanks!

problem to call library

When I use: library(wordVectors-2.0)
it says:
Error in library(wordVectors-2.0) :
there is no package called ‘wordVectors-2.0’

same for wordVector-master or any branch of your project in github

I´m new using RStudio... Then I checked the web and following some advice I manually deleted the folder to go to

install.packages("D:/Folder/wordVectors-2.0.zip")
Installing package into ‘C:/Users/Admin/Documents/R/win-library/3.3’
(as ‘lib’ is unspecified)
Warning in install.packages :
package ‘D:/Folder/wordVectors-2.0.zip’ is not available (for R version 3.3.2)
And this time it did not created any folder, so it seems to be a compatibility problem?? because of "package ‘D:/FolderL/wordVectors-2.0.zip’ is not available (for R version 3.3.2)"????

Faster version of function prep_word2vec

As you note in the README, prep_word2vec is slow. But it seems to be unnecessarily slow because it uses some base R functions and has to do some splitting of long lines. If you are willing to take a dependency on the tokenizers package (and thus stringi, which you already check for) then the function can probably go quite a bit faster. Here is a sketch of a function that takes roughly a tenth of the time on the cookbook corpus. (On my machine, 52.7 seconds for prep_word2vec and 5.8 seconds for prep_word2vec_alt. This function does not include the bundle_ngrams option. I can write this up properly, including bundle_ngrams, and send it as PR if you are interested.

The resulting files can't be tested for identical results, since prep_word2vec introduces NA in places for reasons I don't understand.

# Presumably these will be available as imports
# require(readr)
# require(stringr)
# require(tokenizers)
library(magrittr)
library(wordVectors)

prep_word2vec_alt <- function(origin, destination, lowercase = TRUE) {
  files <- list.files(origin, recursive = TRUE, full.names = TRUE)
  Map(prep_single_file, files, destination, lowercase)
  invisible(destination)
}

prep_single_file <- function(file_in, file_out, lowercase) {
  message("Prepping ", file_in)

  text <- file_in %>%
    readr::read_file() %>%
    tokenizers::tokenize_words(simplify = TRUE, lowercase) %>%
    stringr::str_c(collapse = " ")

  stopifnot(length(text) == 1)
  readr::write_lines(text, file_out, append = TRUE)
  return(TRUE)
}

original_time <- system.time({
  cookbooks <- prep_word2vec("cookbooks", "cookbook.txt", lowercase = TRUE)
  })
alternate_time <- system.time({
  cookbooks_alt <- prep_word2vec_alt("cookbooks", "cookbook-alt.txt",
                                                lowercase = TRUE)
  })

original_time
alternate_time
> original_time
   user  system elapsed 
 25.402  27.369  52.729 
> alternate_time
   user  system elapsed 
  4.903   0.364   5.809 

subscript out of bounds

Hi Ben,

I'm playing with wordvectors again, trying to replicate your genderless post on Melodee Beals' colonial newspaper database. Everything else is working fine, but when I do this:

genderless_cnd = cnd %>% reject(cnd[["he"]] - cnd[["she"]]) %>% reject(cnd[["man"]] - cnd[["woman"]])
#gendered CND:
cnd %>% nearest_to(cnd[["she"]],20) %>% names
#genderless CND:
genderless_cnd %>% nearest_to(genderless_cnd[["she"]],20) %>% names

I get the error:

Error in genderless_cnd[["she"]] : subscript out of bounds

I've tried paging through the bug tracking by options(error=recover) but I'm afraid it's beyond me. Wondered what you thought - have I messed something up?

Fatal Error on 1 MB file

Congratulations, the package is great and thanks for developing it.

I've teste with some standard dataset and it works great (including the 50 MB cookbooks). However, when using it with a personal 1 MB dataset written in Brazilian Portuguese, R crashes every single time. I've already removed punctuation and excess white space, tried with 1/2/4/8 threads, 100/200/500 vectors and with/without removing stopwords, but got no better result. Do you have any idea what it can be the reason of this crash?

ayuda

Allow variables in formulas

For testing quantities, it would be nice to allow variables.

Currently, this code fails because Error in tree[[1]] : object of type 'symbol' is not subsettable. Looks like a parse error, not a namespace one, so should be an easy fix.

dist = 1.2
form = ~ "king" + dist * ("woman" - "man")
glove %>% closest_to(form)

float?

Hi - it's mentioned that single precision support isn't available in R. Perhaps that has changed?

When I do an install.packages('float') a package that supports single precision is installed

Use binary format by default

I just added code to read the binary format. It's about a third the size and takes about a third less time to read in, so I see no reason not to ultimately use it as the standard data interchange format instead of the text representation. Worth making sure that unicode token labels are making it through the gauntlet first, though.

Interestingly, the text version gzips down potentially a little smaller than the binary ones. If space is the only thing that matters, maybe we'd want to look at reading gzips. But faster read/write times are important, too.

doc2vec

Thanks for developing this package. Do you have any plans to implement a doc2vec model in the future?

Add function to 'improve' models

This article spells out a pretty simple pair of tricks that supposedly makes pre-trained embeddings perform better on most benchmarks. I've implemented it in R, but don't have the benchmarks locally to be sure it's useful. I will try to bundle it up into this package at some point.

plot function not working

After training a model, I get:

Vocab size (unigrams + bigrams): 4655
Words in train file: 14208
Starting training using file temp.prep
Vocab size: 343
Words in train file: 4333

But when I try to use plot, I get this error:

> plot(model)
Error in xy.coords(x, y, xlabel, ylabel, log) :
  'x' and 'y' lengths differ
In addition: Warning message:
In matrix(nextup, ncol = j) :
  data length [343] is not a sub-multiple or multiple of the number of rows [172]

Any ideas would be appreciated. Thanks!

is wordVectors on cran?

I get error from install of:
package 'wordVectors' is not available (for R version 3.4.2)

Is it available for different R version?

Quick start - error in type.convert

Problem:
Running through_ Quick Start_ instructions in README.md the process dies with an error in type.convert.

> model = train_word2vec("cookbooks.txt",output="cookbooks.vectors",threads = 3,vectors = 100,window=12)
Starting training using file /home/brandon/repo/stack/data/cookbooks.txt
Vocab size: 32421
Words in train file: 10577282
Alpha: 0.000195  Progress: 99.24%  Words/thread/sec: 18.39k  
Error in type.convert(data[[i]], as.is = as.is[i], dec = dec, numerals = numerals,  : 
  invalid multibyte string at '<f6>(<83>;<a4><d0>�;��{<bb>{<d4>V<bb><b8>�<b3>:q<fd>E;ףv:<9a><99>]9<f6>(l<bb><d7>c�;'

I tried to retrain model on a small subset of cookbooks and that failed similarly.

> model = train_word2vec("cookbooks.txt",output="cookbooks.vectors",threads = 3,vectors = 100,window=12, force=T)
Starting training using file /home/brandon/repo/stack/data/cookbooks.txt
Vocab size: 5331
Words in train file: 345615
Alpha: 0.000073  Progress: 100.34%  Words/thread/sec: 19.73k  
Error in type.convert(data[[i]], as.is = as.is[i], dec = dec, numerals = numerals,  : 
  invalid multibyte string at '<f6>(<83>;<a4><d0>�;��{<bb>{<d4>V<bb><b8>�<b3>:q<fd>E;ףv:<9a><99>]9<f6>(l<bb><d7>c�;'

It appear to be choking on an usual character or unexpected byte. Was there a change in the way the cookbook data was initially saved versus how it is currently processed? The following warnings are also shown:

In addition: Warning messages:
1: In utils::read.table(filename, header = F, skip = 1, colClasses = c("character",  :
  line 1 appears to contain embedded nulls
2: In utils::read.table(filename, header = F, skip = 1, colClasses = c("character",  :
  line 2 appears to contain embedded nulls
3: In utils::read.table(filename, header = F, skip = 1, colClasses = c("character",  :
  line 5 appears to contain embedded nulls
4: In utils::read.table(filename, header = F, skip = 1, nrows = 1,  :
  line 1 appears to contain embedded nulls

Additional system details:
OS - Ubuntu 14.04
R - [1] "R version 3.2.3 (2015-12-10)"

examples of reject

I've been having a lot of fun playing with this. I have a feature request relating to the 'reject' function.

I noticed that your examples in the "?reject" help page have us applying it to a full model, but README.md has a "bank" example where you apply it to the vector you are querying.

Is it possible to summarize the difference between these two ways of querying? The second way is much faster of course. They seem to produce results which are similar but not equivalent.

Also, what is the meaning of "/" in a formula argument to "closest_to"? Does it do actual division? I would have expected it to do something like the "reject" operation in the "bank" example.

I guess my request/suggestion is (1) expand the documentation for "?reject" to have both modes of usage and (2) add syntactic sugar for the "reject" operation.

Thank you.

trouble using wordVectors in Rscript.

Pasted from an e-mail I received for tracking.

I’m having trouble using wordVectors in Rscript.
A minimal case:

library(magrittr)
library(wordVectors)

model <- read.vectors('foo.bin')
model %>% closest_to('man') %>% print()

This code works fine in an interactive R session, but it fails when run via Rscript:


Error in context[[formula]] : subscript out of bounds
Calls: %>% ... <Anonymous> -> closest_to -> cosineSimilarity -> sub_out_formula
Execution halted

(I get the same failure if I rewrite it to use traditional(function(nesting(syntax))) instead of magrittr, BTW.)

Code like

cosineSimilarity(model[['man']],model[['woman']]) %>% print()

also fails similarly in Rscript but works when stepped through or source()’d in an interactive R session.

Back in 2015 I used wordVectors extensively in Rscript with no problems, so whatever’s going on here seems to be connected to changes since then.

Add function to align different models

This Stanford paper describes the most promising method I've seen so far for aligning multiple different models; it would be a useful addition here.

In order to compare word vectors from differ- ent time-periods we must ensure that the vectors are aligned to the same coordinate axes. Ex- plicit PPMI vectors are naturally aligned, as each column simply corresponds to a context word. Low-dimensional embeddings will not be natu- rally aligned due to the non-unique nature of the SVD and the stochastic nature of SGNS. In par- ticular, both these methods may result in arbi- trary orthogonal transformations, which do not af- fect pairwise cosine-similarities within-years but will preclude comparison of the same word across time. Previous work circumvented this problem by either avoiding low-dimensional embeddings (e.g., Gulordava and Baroni, 2011; Jatowt and Duh, 2014) or by performing heuristic local align- ments per word (Kulkarni et al., 2014).
We use orthogonal Procrustes to align the learned low-dimensional embeddings. Defining W(t) ∈ Rd×|V| as the matrix of word embeddings learned at year t, we align across time-periods while preserving cosine similarities by optimizing:
R(t) = arg min ∥W(t)Q − W(t+1)∥F , (4) Q⊤ Q=I
with R(t) ∈ Rd×d. The solution corresponds to the best rotational alignment and can be obtained efficiently using an application of SVD (Scho ̈nemann, 1966).

Test training on travis

I've removed the training tests from Travis because I can't figure out how to get them to work. Something seems to break when I try to write to tmp file.

They should be reactivated.

is there a way to get the token frequencies

Great tool!

I couldn't figure out how words are sorted by frequency if the frequencies are not part of the .bin file or the VectorSpaceModel. I guess the frequencies are tracked in the code which does the training, but left out of the trained vector file? Maybe I'll use 1/rank (Zipf's law) to approximate the frequency, but it would be good to have this documented somewhere. Thanks!

Working with date-time format, cant handle POSIXct. (Error in as.POSIXlt.numeric(x) : 'origin' must be supplied)

Hello,

I have a date-time column in my database in a format of "2017-01-02 8:27" as example. I want to add 10 minutes to this date-time version.

dat$EventTime=as.POSIXct(strptime( dat$EventTime, "%Y-%m-%d %H:%M"), tz = "", origin = '1970-01-01 00:00')

##date-time format becomes 2017-01-02 08:27:00 which is ok, however when I try to add 10 minutes

dat$EventTime[1]+minute(10)

I come across with this error

Error in as.POSIXlt.numeric(x) : 'origin' must be supplied

Could you please help me with that issue?

Thanks,

Training progress report and speed

Hi,

I installed the latest version recently.
There are two differences compared to previous version.

  1. no progress index
  2. seems like slower.

Can I used the progress index for the latest version?
And any reason with seemingly slowing-down ?

Sentence Window/context truncation

Great package! Quick question: It seems that the train_word2vec function only takes a single txt file as input. The original word2vec code can take inputs in paragraph/sentence format (which I guess would be a list of lists in R) and automatically truncates the window so there is no overlap in contexts across sentences. Is there a way to do that with wordVectors? I think not but wanted to ask.

Windows 8 (but not Windows 7 or 10?) compilation fails

Hi there,

I am having trouble installing this on Windows 8 64-bit. I do have Rtools installed as well.

> devtools::install_github("bmschmidt/wordVectors")
Downloading GitHub repo bmschmidt/wordVectors@master
Installing wordVectors
"C:/PROGRA~1/R/R-32~1.1/bin/x64/R" --no-site-file --no-environ --no-save  \
  --no-restore CMD INSTALL  \
  "C:/Users/MotoBot/AppData/Local/Temp/Rtmpq836VL/devtools21f45c5a6f54/bmschmidt-wordVectors-cfd14a5"  \
  --library="C:/Users/MotoBot/Documents/R/win-library/3.2" --install-tests 

* installing *source* package 'wordVectors' ...
** libs

*** arch - i386
gcc -m32 -I"C:/PROGRA~1/R/R-32~1.1/include" -DNDEBUG     -I"d:/RCompile/r-compiling/local/local320/include"  -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result    -O3 -Wall  -std=gnu99 -mtune=core2 -c tmcn_distance.c -o tmcn_distance.o
In file included from tmcn_distance.c:2:0:
distance.h: In function 'distance':
distance.h:40:3: warning: unknown conversion type character 'l' in format [-Wformat]
distance.h:40:3: warning: too many arguments for format [-Wformat-extra-args]
distance.h:41:3: warning: unknown conversion type character 'l' in format [-Wformat]
distance.h:41:3: warning: too many arguments for format [-Wformat-extra-args]
distance.h:42:3: warning: implicit declaration of function 'malloc' [-Wimplicit-function-declaration]
distance.h:42:19: warning: incompatible implicit declaration of built-in function 'malloc' [enabled by default]
distance.h:45:5: warning: unknown conversion type character 'l' in format [-Wformat]
distance.h:45:5: warning: unknown conversion type character 'l' in format [-Wformat]
distance.h:45:5: warning: unknown conversion type character 'l' in format [-Wformat]
distance.h:45:5: warning: too many arguments for format [-Wformat-extra-args]
distance.h:84:5: warning: unknown conversion type character 'l' in format [-Wformat]
distance.h:84:5: warning: too many arguments for format [-Wformat-extra-args]
gcc -m32 -I"C:/PROGRA~1/R/R-32~1.1/include" -DNDEBUG     -I"d:/RCompile/r-compiling/local/local320/include"  -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result    -O3 -Wall  -std=gnu99 -mtune=core2 -c tmcn_word2vec.c -o tmcn_word2vec.o
In file included from tmcn_word2vec.c:3:0:
word2vec.h: In function 'LearnVocabFromTrainFile':
word2vec.h:280:7: warning: unknown conversion type character 'l' in format [-Wformat]
word2vec.h:280:7: warning: format '%c' expects argument of type 'int', but argument 2 has type 'long long int' [-Wformat]
word2vec.h:280:7: warning: too many arguments for format [-Wformat-extra-args]
word2vec.h:292:5: warning: unknown conversion type character 'l' in format [-Wformat]
word2vec.h:292:5: warning: too many arguments for format [-Wformat-extra-args]
word2vec.h:293:5: warning: unknown conversion type character 'l' in format [-Wformat]
word2vec.h:293:5: warning: too many arguments for format [-Wformat-extra-args]
word2vec.h: In function 'SaveVocab':
word2vec.h:302:3: warning: unknown conversion type character 'l' in format [-Wformat]
word2vec.h:302:3: warning: too many arguments for format [-Wformat-extra-args]
word2vec.h: In function 'ReadVocab':
word2vec.h:321:5: warning: unknown conversion type character 'l' in format [-Wformat]
word2vec.h:321:5: warning: format '%c' expects argument of type 'char *', but argument 3 has type 'long long int *' [-Wformat]
word2vec.h:321:5: warning: too many arguments for format [-Wformat-extra-args]
word2vec.h:326:5: warning: unknown conversion type character 'l' in format [-Wformat]
word2vec.h:326:5: warning: too many arguments for format [-Wformat-extra-args]
word2vec.h:327:5: warning: unknown conversion type character 'l' in format [-Wformat]
word2vec.h:327:5: warning: too many arguments for format [-Wformat-extra-args]
word2vec.h: In function 'TrainModelThread':
word2vec.h:367:36: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
word2vec.h:373:50: warning: cast from pointer to integer of different size [-Wpointer-to-int-cast]
word2vec.h: In function 'TrainModel':
word2vec.h:549:5: warning: unknown conversion type character 'l' in format [-Wformat]
word2vec.h:549:5: warning: unknown conversion type character 'l' in format [-Wformat]
word2vec.h:549:5: warning: too many arguments for format [-Wformat-extra-args]
tmcn_word2vec.c: In function 'tmcn_word2vec':
tmcn_word2vec.c:12:9: warning: assignment makes pointer from integer without a cast [enabled by default]
tmcn_word2vec.c: In function 'TrainModelThread':
word2vec.h:530:1: warning: control reaches end of non-void function [-Wreturn-type]
gcc -m32 -shared -s -static-libgcc -o wordVectors.dll tmp.def tmcn_distance.o tmcn_word2vec.o -pthread -Ld:/RCompile/r-compiling/local/local320/lib/i386 -Ld:/RCompile/r-compiling/local/local320/lib -LC:/PROGRA~1/R/R-32~1.1/bin/i386 -lR
installing to C:/Users/MotoBot/Documents/R/win-library/3.2/wordVectors/libs/i386

*** arch - x64
gcc -m64 -I"C:/PROGRA~1/R/R-32~1.1/include" -DNDEBUG     -I"d:/RCompile/r-compiling/local/local320/include"  -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result    -O2 -Wall  -std=gnu99 -mtune=core2 -c tmcn_distance.c -o tmcn_distance.o
In file included from tmcn_distance.c:2:0:
distance.h: In function 'distance':
distance.h:40:3: warning: unknown conversion type character 'l' in format [-Wformat]
distance.h:40:3: warning: too many arguments for format [-Wformat-extra-args]
distance.h:41:3: warning: unknown conversion type character 'l' in format [-Wformat]
distance.h:41:3: warning: too many arguments for format [-Wformat-extra-args]
distance.h:42:3: warning: implicit declaration of function 'malloc' [-Wimplicit-function-declaration]
distance.h:42:19: warning: incompatible implicit declaration of built-in function 'malloc' [enabled by default]
distance.h:45:5: warning: unknown conversion type character 'l' in format [-Wformat]
distance.h:45:5: warning: unknown conversion type character 'l' in format [-Wformat]
distance.h:45:5: warning: unknown conversion type character 'l' in format [-Wformat]
distance.h:45:5: warning: too many arguments for format [-Wformat-extra-args]
distance.h:84:5: warning: unknown conversion type character 'l' in format [-Wformat]
distance.h:84:5: warning: too many arguments for format [-Wformat-extra-args]
gcc -m64 -I"C:/PROGRA~1/R/R-32~1.1/include" -DNDEBUG     -I"d:/RCompile/r-compiling/local/local320/include"  -lm -pthread -O3 -march=native -Wall -funroll-loops -Wno-unused-result    -O2 -Wall  -std=gnu99 -mtune=core2 -c tmcn_word2vec.c -o tmcn_word2vec.o
In file included from tmcn_word2vec.c:3:0:
word2vec.h: In function 'LearnVocabFromTrainFile':
word2vec.h:280:7: warning: unknown conversion type character 'l' in format [-Wformat]
word2vec.h:280:7: warning: format '%c' expects argument of type 'int', but argument 2 has type 'long long int' [-Wformat]
word2vec.h:280:7: warning: too many arguments for format [-Wformat-extra-args]
word2vec.h:292:5: warning: unknown conversion type character 'l' in format [-Wformat]
word2vec.h:292:5: warning: too many arguments for format [-Wformat-extra-args]
word2vec.h:293:5: warning: unknown conversion type character 'l' in format [-Wformat]
word2vec.h:293:5: warning: too many arguments for format [-Wformat-extra-args]
word2vec.h: In function 'SaveVocab':
word2vec.h:302:3: warning: unknown conversion type character 'l' in format [-Wformat]
word2vec.h:302:3: warning: too many arguments for format [-Wformat-extra-args]
word2vec.h: In function 'ReadVocab':
word2vec.h:321:5: warning: unknown conversion type character 'l' in format [-Wformat]
word2vec.h:321:5: warning: format '%c' expects argument of type 'char *', but argument 3 has type 'long long int *' [-Wformat]
word2vec.h:321:5: warning: too many arguments for format [-Wformat-extra-args]
word2vec.h:326:5: warning: unknown conversion type character 'l' in format [-Wformat]
word2vec.h:326:5: warning: too many arguments for format [-Wformat-extra-args]
word2vec.h:327:5: warning: unknown conversion type character 'l' in format [-Wformat]
word2vec.h:327:5: warning: too many arguments for format [-Wformat-extra-args]
word2vec.h: In function 'TrainModel':
word2vec.h:544:84: warning: cast to pointer from integer of different size [-Wint-to-pointer-cast]
word2vec.h:549:5: warning: unknown conversion type character 'l' in format [-Wformat]
word2vec.h:549:5: warning: unknown conversion type character 'l' in format [-Wformat]
word2vec.h:549:5: warning: too many arguments for format [-Wformat-extra-args]
tmcn_word2vec.c: In function 'tmcn_word2vec':
tmcn_word2vec.c:12:9: warning: assignment makes pointer from integer without a cast [enabled by default]
tmcn_word2vec.c: In function 'TrainModelThread':
word2vec.h:530:1: warning: control reaches end of non-void function [-Wreturn-type]
C:\Users\MotoBot\AppData\Local\Temp\cc88Kdj9.s: Assembler messages:
C:\Users\MotoBot\AppData\Local\Temp\cc88Kdj9.s:1094: Error: no such instruction: `vfmadd312ss (%rbx),%xmm0,%xmm8'

[... edited out by BMS--dozens of other assembler errors]

make: *** [tmcn_word2vec.o] Error 1
Warning: running command 'make -f "Makevars.win" -f "C:/PROGRA~1/R/R-32~1.1/etc/x64/Makeconf" -f "C:/PROGRA~1/R/R-32~1.1/share/make/winshlib.mk" SHLIB="wordVectors.dll" WIN=64 TCLBIN=64 OBJECTS="tmcn_distance.o tmcn_word2vec.o"' had status 2
ERROR: compilation failed for package 'wordVectors'
* removing 'C:/Users/MotoBot/Documents/R/win-library/3.2/wordVectors'

My environment:

> sessionInfo()
R version 3.2.1 (2015-06-18)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 8 x64 (build 9200)

locale:
[1] LC_COLLATE=English_South Africa.1252  LC_CTYPE=English_South Africa.1252   
[3] LC_MONETARY=English_South Africa.1252 LC_NUMERIC=C                         
[5] LC_TIME=English_South Africa.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

loaded via a namespace (and not attached):
 [1] httr_1.0.0     R6_2.1.1       magrittr_1.5   tools_3.2.1    curl_0.9.3    
 [6] memoise_0.2.1  stringi_1.0-1  knitr_1.11     stringr_1.0.0  digest_0.6.8  
[11] devtools_1.9.1

Any idea how I can fix this?

Thanks

Does it accept Arabic (or any non-ASCII) in general?

I'm facing difficulty in vectorizing an Arabic text, I don't seem to be able of getting anything useful.

The word2vec function is only extracting funny characters (like emojis and so on) from a text file of about 200k Arabic words.. it seems also to convert these characters to codepoint values.

I would like to have nice an normal looking word2vec for my Arabic text.

Any comments or workarounds?

Fallback to read.binary.vectors if read.vectors fails

One problem with the switch to writing in the binary format is that users who upgrade (which is basically forced with every R version bump) may have code that gets broken if they use the suffix '.vectors'. (They can still read in an old model; but now if they train a new one with the old suffix, it chokes on read.)

One easy way to solve this would be to try-catch the read.vectors function: if it fails to read something as text, we could just give a short at reading it in binary format. If the results are plausible (how to test? Simply by length, probably; or else valid unicode-ness of the row names), return them.

Training on bigrams?

Hey! Wonderful package. Is there currently support for training the word2vec model on bigrams as well as unigrams? I've had some nice success using bigrams via gensim, and was wondering if I'm missing the way to include that here.

Thanks! Again, great work.

Vocabulary size is much smaller than it ought to be

"Vocab size" is way off. See the attached screen shot: 1.4 billion words of PubMedCentral author manuscripts, and the vocabulary size is 1,307, according to the status message output. Doesn't seem likely.

I'm using whatever version of wordVectors was on GitHub as of mid-January 2015. Not sure what version of RStudio--I think those 1.4 billion words of text have choked my laptop to death... OS X.

screenshot 2016-01-19 19 48 53

Restore printing of status updates

In switching from printf to the CRAN-approved Rprintf, I've hit a problem. The C code
wants to print out status updates to the console, here. It used to work fine; but when these lines are uncommented,
R crashes with an error that C stack usage exceeds 261600796060 or something of the sort. The place at which the crash comes is proportional to how often the loop is run (if I change the counter here to once every 1000 lines, it crashes ten times sooner, and ten times later if it's once every 100,000 lines). So apparently calling Rprintf creates a memory leak. I can't find any solutions to this online.

For now I've turned off printing, but this is is a process that can take hours, and visual feedback is extremely useful.

input text file

I was using train_word2vec function to train vector.
But I was wondering that what is the best input text form.
Should I separate sentence line-by-line? Would it interfere training?

What is the algorithm for dealing space or Line break ?

Thank you.

How to get plot points

Hi!

I am able to see the plot on my screen using R Studio, but I'm unable to get the x- and y- coordinates of the words I plotted.

plot(model, perplexity=50)

Also, how do I write those plot points (x- and y- coordinates, not the vectors) to a csv or txt file?

Extract Network Weights

Hey---great work here. Sorry if this is obtuse, but is there a way to extract the neural network weight matrices after training?

Thanks!

how convert a data frame or table to a VectorSpaceModel?

I have a word vector data frame created outside of wordVectors that currently looks like
V1 V2 V3 V4 V5 .............
1 der -0.1292338 1.41541564 0.72683984 -0.08601953
2 die -0.7408874 1.23070979 1.60728443 0.21427894
3 und 0.1368700 0.21688898 0.09194378 -0.42764056
4 in -0.9566143 1.17804027 0.13917272 1.63949668
5 von -1.2693109 0.92857528 -0.88062751 1.41522074
6 den -0.8766794 0.45545051 1.42592216 -1.87232220
7 des 0.8585002 0.80657679 2.12942553 -1.49346220
8 im -1.8885295 0.35904437 0.97661573 -0.38748211
9 mit -0.5756816 -1.57236266 -2.10877585 1.33090031
10 das -0.9001577 -0.02004211 1.45430076 0.93866318
...
and would like to use wordVectors operations. I see that there is a as.VectorSpaceModel(matrix) coercion function, but I don't know the form of the matrix that is required for the coercion to work.

rword2vec, R session aborted

Hello,

I just installed rword2vec, but the session was aborted when I tried to use it. My code is below. Although this question has been asked before, this issue still remained unsolved. Could you please help on this issue?

library(rword2vec)
model <- word2vec(
train_file = "text8",
output_file = "vec.bin",
binary=1,
num_threads=3,
debug_mode=1)

error in scan

Hi Ben,

I accidentally reinstalled wordVectors, and now, on a file that worked fine before, I get this kind of error:

  scan() expected 'a real', got 'm��9&x�9�����&e����8x0S9���8�_l7m�{�*��9DD��ܲ�86�R��҅�c�g�ƒ��0O49�|S9P�O9�V�8���8y鄹G(�����9/݄9��X9��(�z[�9!i'

In addition: Warning messages:

1: In utils::read.table(filename, header = F, skip = 1, colClasses = c("character",  :
  line 4 appears to contain embedded nulls

2: In utils::read.table(filename, header = F, skip = 1, colClasses = c("character",  :
  line 5 appears to contain embedded nulls```

...so I was thinking that maybe it was an encoding issue, but file is in UTF-8, all seems tickety-boo. Wondered what you thought.

Error on loading large model

I am trying to load the standard google news model, but I received this error:

Error in validObject(.Object) : invalid class “VectorSpaceModel” object: Error : cannot allocate vector of size 6.7 Gb

I tried on my macbook and my linux workstation, neither worked. I have no trouble loading this model in gensim.

Also, these are my R stats:

platform x86_64-pc-linux-gnu
arch x86_64
os linux-gnu
system x86_64, linux-gnu
status
major 3
minor 3.1
year 2016
month 06
day 21
svn rev 70800
language R
version.string R version 3.3.1 (2016-06-21)
nickname Bug in Your Hair

Can we generate .vector files

Hello Ben,

As "*.bin" output vector files are not readable? Is there any way to generate ".Vectors" file as it was possible in earlier releases of word2vec?

I am asking because ".vector" files are easier to convert to tensor flow's ".bytes" format to visualise them in tensor flow projector. ".vector" file format matched Mikolov's original format for capturing vectors.

I want to be able to generate output vector files which are readable.

Thanks for great work in developing this package.

Regards

Error in train_word2vec: "Error in if (binary) { : argument is of length zero"

I'm trying to train a model, and ran this in R using a plain text file* that I'd prepared using prep_word2vec:
girl.model <- train_word2vec("girls.txt", output_file = "~/MOC_Project/Data/girl_vectors", vectors = 300, threads = 2, window = 12, classes = 0, min_count = 5, iter = 5, force = TRUE, negative_samples = 5)
I got these messages, which seemed OK:

Starting training using file /Users/ella/Desktop/Capstone/CSStuff/MOC_Project/Scripts/girls.txt
Vocab size: 18998
Words in train file: 1006509

And then, after about a minute, I got this error:

Error in if (binary) { : argument is of length zero

I have no idea what caused this, but it seems like I can't train the model.

*Of a bunch of old social media profiles, which probably isn't relevant

n-grams greater than 2

I was looking to use trigrams because there are significant three-word phrases in my corpus (e.g. "economies in transition" to refer to developing countries). I used the following code in R.

statements <- prep_word2vec(basePath,
"docs.txt",
lowercase=T, bundle_ngrams = 3, threshold = 50)

w2v <- train_word2vec("docs.txt",
output="./stat_vecs.bin",
threads=detectCores(),
vectors=100,
window=7,
force=TRUE)

It worked as expected with the exception that I got some four word phrases (e.g. "so_that_they_can"). I'm curious why this is happening. Thanks!

Delimiter of text context

Hi,
thanks for the library. I would like ask you if it's possible defined some delimited for sliding window. I want for example add each sentence on new line and as a context for each of word use words on same row.

Example:
I have a dream.
Cat eats a hotdog.

So I wold like have context of dream just {I, have, a} not Cat...

It's possible to do that?

Unable to compile with gcc 10 | Arch Linux solution = downgrade

If installing the package in Arch Linux, it failes to compile with the following error:

ccache gcc -shared -L/usr/lib64/R/lib -Wl,-O1,--sort-common,--as-needed,-z,relro,-z,now -o wordVectors.so tmcn_word2vec.o word2phrase.o -L/usr/lib64/R/lib -lR
/usr/bin/ld: word2phrase.o:(.bss+0x18): multiple definition of `vocab'; tmcn_word2vec.o:(.bss+0x78): first defined here
collect2: Fehler: ld gab 1 als Ende-Status zurück
make: *** [/usr/share/R//make/shlib.mk:6: wordVectors.so] Fehler 1
ERROR: compilation failed for package ‘wordVectors’

This seems to be an error raised by gcc 10 and should be fixable quite easily.

Temporary Fix (see here)

Downgrade gcc and related packages (in my case gcc-libs and gcc-fortran) simultaneously to version 9.3.0-1

Either downgrade if you still have the old cached version or download them from
https://archive.archlinux.org/packages/g/gcc/gcc-9.3.0-1-x86_64.pkg.tar.zst
https://archive.archlinux.org/packages/g/gcc-libs/gcc-libs-9.3.0-1-x86_64.pkg.tar.zst and
https://archive.archlinux.org/packages/g/gcc-fortran/gcc-fortran-9.3.0-1-x86_64.pkg.tar.zst

Install using sudo pacman -U gcc-9.3.0-1-x86_64.pkg.tar.zst gcc-libs-9.3.0-1-x86_64.pkg.tar.zst gcc-fortran-9.3.0-1-x86_64.pkg.tar.zst. It's important to downgrade them at the same time because one depends on the other.

To prevent automated updates off gcc till the issue is resolved add the three packages to IgnorePkg in /etc/pacman.conf

prep_word2vec requires R >= 3.2.0

hi am getting this error when trying to parse the text:

Beginning tokenization to text file at cookbooks.txt
Error in prep_word2vec("cookbooks", "cookbooks.txt", lowercase = T) :

could not find function "dir.exists"

Cache magnitudes to improve repeated cosine similarity queries performance

After reading binary file of 320895 rows and 100 columns, it takes around 15-20 seconds to get result of "nearest_to" call. Is this normal behaviour or rather some quirk? The gensim word2vec implementation is rather quick on the exactly same model (it takes less than 1s to get similar vectors).

The code is the following:
model <- read.vectors("model.bin")
nearest_to(model,model[["word"]])

Chinese is not available

The format of my corpus is:
image
Use the following procedure for training:
model = train_word2vec("cookbooks.txt","cookbook_vectors.bin",vectors=200,threads=4,window=12,iter=5,negative_samples=0)
image
result is:
image

install wrong

R version 3.4.0 (2017-04-21) -- "You Stupid Darkness"
Copyright (C) 2017 The R Foundation for Statistical Computing
Platform: x86_64-w64-mingw32/x64 (64-bit)

devtools::install_github("bmschmidt/wordVectors")
Downloading GitHub repo bmschmidt/wordVectors@master
from URL https://api.github.com/repos/bmschmidt/wordVectors/zipball/master
错误: 运行命令'"C:/Program Files/R/R-3.4.0/bin/x64/R" --no-site-file --no-environ --no-save --no-restore --quiet CMD config CC'的状态是2

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.