finalfusion / finalfusion-rust Goto Github PK
View Code? Open in Web Editor NEWfinalfusion embeddings in Rust
Home Page: https://finalfusion.github.io/
License: Other
finalfusion embeddings in Rust
Home Page: https://finalfusion.github.io/
License: Other
We now turn every error into a black box using failure
. However, this makes it very hard for downstream users to give appropriate error messages. We should switch to one or more crate-specific error types.
This is a trivial change, but should go in a separate PR.
I'm implementing stand-alone reading + writing for the different chunks in my Python fork and it seems like we never added the `NdNorms' chunk identifier to the list in the header.
I think it's not a breaking change if we add the identifier to the header. Inside finalfusion
, we don't use the chunk identifiers for storage, vocab and norms in any way, iirc.
edit: Copied the list from the other comment to get an indicator for the list ticks.
Going through the API:
prelude.rs
: #105prelude
re-exports SimpleVocab
and SubwordVocab
but not the aliases for FinalfusionSubwordVocab
etc. I think it makes more sense to re-export the aliases and leave SubwordVocab
in chunks::vocab
.MMapQuantizedArray
is the only storage not re-exported.NdNorms
is the only public chunk not being re-exported. chunks::storage::mod.rs
:
chunks::storage::array.rs
: #93
NdArray
derives Debug
but not MmapArray
.NdArray
should derive Clone
chunks::storage::wrappers.rs
:StorageViewWrap
implies that it wraps a view while it actually wraps a viewable storage. Think about renaming?chunks::mod.rs
:finalfusion chunks
chunks::io.rs
: #92typeid_impl
is lacking docs, choices 1
for u8
and 10
for f32
seem arbitrary - iirc in order to leave room for other int and float types. Could still use some docs to make that clear once this is forgotten.chunks::metadata.rs
: #91Value
.toml
as our choice? Upside of toml
is that we get easy serialization and heterogeneous collections. Downside is that we always need Value
s to construct.chunks::norms.rs
: #90NdNorms
' inner arrayimpl Index for NdNorms
since Norms
seems to be just that but without the [ ]
-indexingIndex::index
returns references. I'd still like to make norm
a method directly on NdNorms
since NdNorms
is entirely useless without importing Norms
otherwise.chunks::vocab.rs
: #89vocab::mod.rs
, vocab::subword.rs
, vocab::simple.rs
.Clone
from Vocab
's requirements and the other places where it pops up because of this requirement. (e.g. Indexer
bounds) compat::fasttext
compat::{text.rs, word2vec.rs}
embeddings.rs
io.rs
lib.rs
similarity.rs
similarity::analogy.rs
and similarity::similarity.rs
subword.rs
Indexer
s could live in their own indexer.rs
util.rs
What needs to be done until then?
chunks::Vocab::{NgramIndices, SubwordIndices}
, consolidate into one trait?. Related to that, bracketing words per default forces us to collect the indices in both methods and return them as Vec
.Add support for pruning embeddings, where N embeddings are retained. Words for which embeddings are removed are mapped to their nearest neighbor.
This should provide more or less the same functionality as pruning in spaCy:
https://spacy.io/api/vocab#prune_vectors
I encourage some investigation here. Some ideas:
The most basic version could simply retain the embeddings of the N most frequent words and map all the remaining words to the nearest neighbor in the N embeddings that are retained.
Select vectors such that the similarities to the pruned vectors is maximized. The challenge here is making it tractable.
An approach similar to quantization, where k-means clustering is performed with N clusters. The embedding matrix is then replaced by the cluster centroid matrix. Each word maps to the cluster it is in. (This could reuse the KMeans stuff from reductive, which is already a dependency of finalfusion).
I would focus on (1) and (3) first.
Benefits:
While reimplementing finalfusion-inspector in Rust, I bumped into a small annoyance. The analogy
method takes an array of &str
:
query: [&str; 3]
However, oftentimes you have a [String; 3]. We should relax this type to:
query: [impl AsRef<str>; 3]
This should not break the API, since it allows a superset of types of the original signature.
This constructor is not documented (and non-trivial).
Hi! This may sound stupid but I have been wondering about this... I tried to reconstruct the whole QuantizedArray
storage into NdArray
storage and wrap it into StorageViewWrap
so that I can compute the accuracy of analogy test, but why don't we have a function for this so we can compute accuracy directly?(Or we do but I fail to find it?)
Trying to read the GoogleNews-vectors-negative300.bin word2vec model triggers this assert:
https://github.com/finalfusion/finalfusion-rust/blob/main/src/chunks/vocab/simple.rs#L28
thread 'main' panicked at 'assertion failed: `(left == right)`
left: `3000000`,
right: `2999997`: words contained duplicate entries.'
(when constructing a new simple vocabulary, the number of indices (3,000,000) ends up different than the number of words (2,999,997))
After some investigations I removed this word trimming and it worked fine afterwards:
https://github.com/finalfusion/finalfusion-rust/blob/main/src/compat/word2vec.rs#L98
I assume the model contains tokens that get trimmed into the same words.
Should I create a pull request to remove this line? Or is there something I'm doing wrong?
The model I used is from: https://code.google.com/archive/p/word2vec
Code:
let mut reader = BufReader::new(File::open("GoogleNews-vectors-negative300.bin").unwrap());
let model = Embeddings::read_word2vec_binary(&mut reader).unwrap();
When running the following code sample on Windows, it crashes:
use std::io::{BufReader, Read};
use std::fs::File;
use finalfusion::prelude::*;
use finalfusion::similarity::WordSimilarity;
fn main() {
let mut reader = BufReader::new(File::open("resources/english-skipgram-mincount-50-ctx-10-ns-5-dims-300.fifu").unwrap());
// Read the embeddings.
let embeddings: Embeddings<VocabWrap, StorageViewWrap > =
Embeddings::read_embeddings(&mut reader)
.unwrap();
}
with the following error:
thread 'main' panicked at 'capacity overflow', src\liballoc\raw_vec.rs:750:5
stack backtrace:
0: core::fmt::write
at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\/src\libcore\fmt\mod.rs:1063
1: std::io::Write::write_fmt<std::sys::windows::stdio::Stderr>
at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\/src\libstd\io\mod.rs:1426
2: std::sys_common::backtrace::_print
at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\/src\libstd\sys_common\backtrace.rs:62
3: std::sys_common::backtrace::print
at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\/src\libstd\sys_common\backtrace.rs:49
4: std::panicking::default_hook::{{closure}}
at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\/src\libstd\panicking.rs:204
5: std::panicking::default_hook
at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\/src\libstd\panicking.rs:224
6: std::panicking::rust_panic_with_hook
at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\/src\libstd\panicking.rs:470
7: std::panicking::begin_panic_handler
at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\/src\libstd\panicking.rs:378
8: core::panicking::panic_fmt
at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\/src\libcore\panicking.rs:85
9: core::panicking::panic
at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\/src\libcore\panicking.rs:52
10: alloc::raw_vec::capacity_overflow
at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\/src\liballoc\raw_vec.rs:750
11: alloc::raw_vec::{{impl}}::allocate_in::{{closure}}<f32,alloc::alloc::Global>
at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\src\liballoc\raw_vec.rs:79
12: core::result::Result<(), alloc::collections::TryReserveError>::unwrap_or_else<(),alloc::collections::TryReserveError,closure-1>
at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\src\libcore\result.rs:851
13: alloc::raw_vec::RawVec<f32, alloc::alloc::Global>::allocate_in<f32,alloc::alloc::Global>
at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\src\liballoc\raw_vec.rs:79
14: alloc::raw_vec::RawVec<f32, alloc::alloc::Global>::with_capacity_zeroed<f32>
at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\src\liballoc\raw_vec.rs:147
15: finalfusion::util::padding<f32>
at C:\Users\Roland\.cargo\registry\src\github.com-1ecc6299db9ec823\finalfusion-0.12.1\src\util.rs:66
16: alloc::vec::from_elem<f32>
at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\src\liballoc\vec.rs:1730
17: std::sys::windows::alloc::{{impl}}::dealloc
at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\/src\libstd\sys\windows\alloc.rs:48
18: std::alloc::__default_lib_allocator::__rdl_dealloc
at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\/src\libstd\alloc.rs:270
19: core::slice::{{impl}}::index<u8>
at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\src\libcore\slice\mod.rs:2891
20: core::ops::function::FnOnce::call_once<fn(finalfusion::chunks::vocab::subword::SubwordVocab<finalfusion::subword::HashIndexer<fnv::FnvHasher>>) -> finalfusion::chunks::vocab::wrappers::VocabWrap,(finalfusion::chunks::vocab::subword::SubwordVocab<finalfusi
at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\src\libcore\ops\function.rs:232
21: finalfusion::chunks::storage::wrappers::{{impl}}::read_chunk<std::io::buffered::BufReader<std::fs::File>>
at C:\Users\Roland\.cargo\registry\src\github.com-1ecc6299db9ec823\finalfusion-0.12.1\src\chunks\storage\wrappers.rs:250
22: finalfusion::embeddings::{{impl}}::read_embeddings<finalfusion::chunks::vocab::wrappers::VocabWrap,finalfusion::chunks::storage::wrappers::StorageViewWrap,std::io::buffered::BufReader<std::fs::File>>
at C:\Users\Roland\.cargo\registry\src\github.com-1ecc6299db9ec823\finalfusion-0.12.1\src\embeddings.rs:404
23: stuff::main
at .\src\bin\stuff.rs:12
24: std::rt::lang_start::{{closure}}<()>
at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\src\libstd\rt.rs:67
25: std::rt::lang_start_internal::{{closure}}::{{closure}}
at /rustc/8d69840ab92ea7f4d323420088dd8c9775f180cd\/src\libstd\rt.rs:52
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.
error: process didn't exit successfully: `target\debug\stuff.exe` (exit code: 101)
On Linux, the same code runs fine. Is there a problem with finalfusion on Windows?
Compare disk/memory use and speed of and add this to the README, to give people an idea of the trade-offs:
I have already done parts of this, so I can take this one.
I think that for downstream users it would be great if we had a first release. Due to semver, once we release 1.0.0, we cannot break APIs anymore in 1.x.y. Basically, once someone puts
finalfusion = "1"
in their Cargo.toml
, it should always work for 1.x.y. Of course, we can extend the API.
This issue is for discussing what still needs to be done before 1.0.0. What sore thumbs are sticking out. Also, we should probably look if there are any public APIs that need to be private or crate-private.
I was wondering if there are reasons against exposing a method to get tuples of (ngram, idx)
for both in vocabulary and out of vocabulary words.
Could be nice to inspect similarity of the subword vectors through the python module. Although, we probably can't get all components that make up in vocab words because we precompute theri representations.
Not sure whether this should go as a dot-release or as a new version...
Add a method Embedings::embedding_into
that constructs the embedding in-place in an ArrayViewMut1
. Obviously, this would be a copy for a known word, but for unknown words, this avoids an allocation in case one already has memory allocated (e.g. in downstream software where you allocate memory for embeddings for all words in a sentence).
Here, if I understand it correctly, is outdated right? Since we have Embeddings<_, MmapQuantizedArray>
already.
If some of the quantizer parameters are wrong, the finalfusion file could be malformed. We should return Err
in such cases, rather than relying on assertions in reductive
.
This issue is for tracking addition of reading functionality for fastText embeddings.
FastTextIndexer
. PR: #29SubwordVocab
to be generic over the indexer. PR: #30SubwordVocab
. PR: #31FastTextReader
trait. PR: #32finalfusion-utils
to support the fastText file format.finalfusion-python
to read fastText embeddings.@sebpuetz I added you as an assignee for the reviewing. I have nearly all functionality implemented already.
I think it would be nice to have a small utility data structure to fetch pretrained embeddings. I don't think this needs to be part of the finalfusion
crate, since it is not really core functionality. The basic idea is:
We'd have a repository finalfusion-fetcher
with some metadata file (probably JSON), mapping embedding file identifiers to URLs. E.g. fasttext.wiki.nl.fifu
could map to http://www.sfs.uni-tuebingen.de/a3-public-data/finalfusion-fasttext/wiki/wiki.nl.fifu
A small crate (possibly in the same repo), would provide a datastructure Fetcher
With a constructor that retrieves the metadata and gives a fetcher:
let fetcher = Fetcher::fetch_metadata().unwrap();
A user could then open embeddings:
let dutch_embeddings = fetcher.open("fasttext.wiki.nl.fifu").unwrap();
This method would check if the embeddings are already available. If not, fetch them, store them in a standard XDG location. Then it would open the embeddings stored in this location.
Similarly, Fetcher::mmap
could be used to memory-map an embedding after downloading.
After this is implemented, the functionality could also be exposed in finalfusion-python
.
This time hopefully without soundness errors ;).
We should be very conservative with adding new chunks. But I think there is a place for a new vocabulary chunk that solves three problems of the current vocab and (bucket) subword vocab chunks:
The proposed vocabulary format would be structured similarly to a fully loaded hash table.
The data would be structured as follows (after the chunk identifier and length). Given a vocabulary of size vocab_len
:
vocab_len
pairs (see notes 1 and 2 below for refinements) of:u64
);vocab_len
indexes (u64), mapping a storage index to a pair from (1);i
: get the i
-th element of the storage->string link table. Thisw
: hash w
and map the hash h
to the vocabulary space, h_vocab
. Start linear probing at h_vocab
until a matching h
is found. Then verify that the strings match (otherwise continue probing).In practice we'd want to use another number than the vocabulary length, e.g. one of the next powers of two. First to make outcomes of hashing uniformly distributed, second to avoid degenerate cases in linear probing. However, without these blank slots, the storage->string link table would not be necessary (since we could sort the storage by hash table order).
The birthday paradox square estimation puts the hash collision probability of 0.5 at 2^32 items, so in practice the first actual string match would be a hit.
If the table is constructed in word frequency order, the amount of linear probing is a function of the word rank/frequency, since when the most frequent words are inserted, most pairs will still be empty.
I guess that adding this chunk entails adding three new chunks, since we have combined the vocabs with subword vocabs. Also, it probably requires a redesign of the explicit n-gram chunk, since otherwise it would not be memory mappable.
We discussed earlier that it may have been a design mistake not to store storage indices with words and relying on the order of words. However, while thinking about this new chunk, I realized that doing this is actually unsound in our current setup. E.g. if one does a similarity or analogy query indices of the top-n results do not map to individual words anymore. We would need to change the APIs to return multiple words for a given index.
...using real-world embeddings. This would allow us to detect performance regressions.
For analogies, we have Analogy
and AnalogyBy
. However Similarity
has both similarity
and similarity_by
methods.
I was converting some files from the text(dims) format to our format and ran into trouble with tokens containing whitespaces. There was a number of tokens with non-space whitespaces within tokens which threw our splitting through .split_whitespace()
off.
Also, there were a bunch of tokens that were just a whitespace (e.g. NEL U+0085
and NBSP U+00A0
), getting reduced to empty strings by the .trim()
call we do on the words. The second part was particularly frustrating to debug, since words
ended up with the correct length, the right number of embeddings was created but an incorrect number of vocabulary entries were being created. Which happened because trim()
ming reduced a couple hundred tokens to ""
.
finalfusion-rust/src/compat/text.rs
Lines 161 to 170 in c9bd67c
This took an awfully long time to debug, mostly due to the stupid format and the unnormalized corpus that the embeddings were trained on, but still, I wonder if we can do better here. E.g.:
let word = parts
.next()
.ok_or_else(|| ErrorKind::Format(String::from("Spurious empty line")))?
.trim_matches(|c| c == ' ' || c == '\t' );
What do you think about targeted trimming and splitting on '\t'
and ' '
rather than any whitespaces? Might reduce the headaches of potential users.
While debugging this, I also found that the following assertion leads to killing the Python process when loading a malformed file through our Python library.
finalfusion-rust/src/embeddings.rs
Lines 50 to 54 in c9bd67c
It's kind of ugly to get this kind of output for a malformed file, There's an issue at PyO3/pyo3#492 about stopping panics at the ffi boundary, but for now we would have to deal with that ourselves. Maybe it's possible to return a Result
here?
thread '<unnamed>' panicked at 'assertion failed: `(left == right)`
left: `199872`,
right: `200000`: Vocab and norms do not have the same length', /data/rust_projects/finalfusion-rust/src/embeddings.rs:50:9
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace.
fatal runtime error: failed to initiate panic, error 5
[1] 8362 abort (core dumped) ipython
Every SubwordVocab
is also a Vocab
. This has the benefit that when you have a dyn SubwordVocab
that the Vocab
methods are also available.
This should by a safe API change (since it only adds methods accessible through SubwordVocab
).
Hey again. So the API is confusing for me here, but when I try to get an embedding and it is not found, what is the proper way of using the Ngram module to get the embedding? Is there a pre-built fn to do it or an example somewhere in the test code?
Rust #[inline] pub fn get_embedding(&self, kmer: &str) -> Embedding { match self.embeddings.embedding_with_norm(&kmer) { Some(x) => x.embedding.into_owned(), None => panic!("Unable to get embeddings for {}", kmer) // Probably have to use the ngrams module to get indices, and then average... } }
I'm using an alternative vocab, but it is working from the NGram bones (with very few changes)
impl From<KmerVocab<NGramConfig, ExplicitIndexer>> for VocabWrap { fn from(v: KmerVocab<NGramConfig, ExplicitIndexer>) -> Self {
Thanks
finalfusion-rust is currently licensed under the following licenses (user's choice):
We are trying to relicense the finalfusion ecosystem to something that is more canonical for the Rust ecosystem, namely (user's choice):
I, @sebpuetz, and SfS ISCL have agreed with relicensing all finalfusion projects in this way. However, finalfusion-rust has two additional contributors:
If you are ok with this relicensing, could you please reply to this issue confirming this? Thanks a lot!
Some code snippets could be nice. Also, a synopsis of what you can do with the crate.
The README also states, that the crate is new and the API will change, while the API might still change, I think we're approaching somewhat stable levels in terms of public API.
Proposed solution: change analogy
to return a Result<..., [bool; 3]>
.
See also: finalfusion/finalfusion-utils#8
Either they are both subwords or n-grams. But the current naming suggests that there is a difference.
We could get a lot of additional pretrained embeddings for 'free' if we convert fastText embeddings from:
https://fasttext.cc/docs/en/crawl-vectors.html
Any opinions?
This is an awesome project sadly I cannot compile it to wasm for native browser use because of the memmap
dep?
This seems to be used in
src/chunks/storage/array.rs:8:5
src/chunks/storage/quantized.rs:6:5
is there a way to remove memmap? or replace it with something that can target wasm?
Thanks
With range 3-6 these fail:
fn empty_string_idx(voc: BucketSubwordVocab) {
assert!(voc.idx("").is_none())
}
fn empty_string_embed(e: Embeddings<BucketSubwordVocab, NdArray>) {
assert!(e.embedding("").is_none())
}
I'd expect the same for other setups, e.g. range 4-6 and length 1 inputs will probably behave the same.
Hi @danieldk thank's for this great library, can you guide me through the steps to include support for sentence embeddings using sentence-transformer? Which traits should I use? I am specially interested in storing embeddings for similarity queries on phrases
These do not have to go into the default PR tests (to avoid long waits). But it would be nice to at least test the master branch and release branches against the BSDs that are supported by Rust.
I tried to use this library with a pretrained models from https://github.com/sdadas/polish-nlp-resources?tab=readme-ov-file#word2vec and I found out that these are in keyed vector format which is currently not supported.
Looking at the losses of OPQ, most of the vectors are reconstructed nearly perfectly in terms of cosine similarity, but the euclidean distances are larger. Consequently, not all vectors of known words are necessarily unit vectors.
To do:
The reductive
dependency should be updated sometime soon:
attempts
of ff-quantize
actually has an effect.ndarray-linalg
to 0.11. The version that we currently use (0.10) forces static linking of OpenBLAS, which fails on systems that do not provide a static library (when the system
feature is used). Without the system
feature, the compiled OpenBLAS library sometimes misses LAPACKE_
symbols. (Not sure yet what the circumstances are, possibly it does not fail if there is a static system library, because it links against that first.)This update is blocked by yet another issue. reductive
switched to rand
0.7. However, rand
0.7 makes use of crate renaming, which seems to be used seldomly enough that it is not supported by Nix' buildRustCrate
. Since I use Nix to build Docker images for sticker (which uses finalfusion), I would like to see that fixed before making the leap. I have already submitted a PR in nixpkgs to add support for crate renames.
@sebpuetz @NianhengWu
We have accumulated quite some changes since June:
Maybe it's time to do a new release branch of finalfusion-rust
and finalfusion-utils
? Are there any small changes we need to get in before? The primary thing on top of my head is a unit test using toy fastText embeddings.
I have the norms layer mostly done, but I was thinking about what the API should look like:
Provide a separate method fn norm(&self, word: &str) -> Option<f32>
. Benefit: no cluttering of embedding
's signature. Most people will just want an embedding. Downside: more expensive if someone wants the embedding and the original norm, especially with subword units.
Extend the return type of embedding
to Option<(CowArray1<f32>, f32)>
, where embedding and norm are returned as a tuple. Advantage: faster than (1), disadvantage: tuple.
Extend the return type of embedding
to Option<Embedding>
where Embedding
is a type that holds the CowArray1
and the norm. Provide two getters: embedding
/norm
. Advantage: faster than (1), clearer than (2). Disadvantages: double wrapping of the actual vector (Embedding
-> CowArray
); getting the embedding is ugly (embeds.embedding("blah").embedding()
); the embedding
getter returns a reference, &CowArray1
is not a very useful type in case the array is owned.
(2), but make this a separate method fn embedding_with_norm(&self, word: &str) -> Option<(CowArray1<f32>, f32)>
.
Any opinions @sebpuetz ?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.