jina-ai / finetuner Goto Github PK
View Code? Open in Web Editor NEW:dart: Task-oriented embedding tuning for BERT, CLIP, etc.
Home Page: https://finetuner.jina.ai
License: Apache License 2.0
:dart: Task-oriented embedding tuning for BERT, CLIP, etc.
Home Page: https://finetuner.jina.ai
License: Apache License 2.0
Have an evaluation function with the following interface
def get_evaluation(queries: DocumentSequence, catalog: Union[DocumentArray, DocumentArrayMemmap], model: AnyDNN, metrics: List[str] = None):
where
queries
are the to-be-scored Documents with either positive results as matches or class informations,catalog
are the potential results,model
is any model andmetrics
the names of the metrics, where None
means all.The evaluation function should compute the embeddings and compute/output requested metrics.
Furthermore, we should have a demo evaluation for the FMNIST dataset as an integration test (potentially sub-sampled for faster execution).
Refactor Pipeline as suggested in Tadej slides
below new refers to dataset without tuples/triplets, old refers to user manually creates tuples/triplets
Dataset
, support new and old style dataset. - @tadejsvDataset
. - @tadejsvwe start from pytorch, then go to paddle and keras, finish by making abstraction classes.
tailor unit testing is in good shape, not the same case for tuner and labeler.
Hi there!
I am trying to tailor a vision transformer (DINO) adding a projection layer on top of the whole transformer arquitecture.
When I run finetuner.tailor.to_embedding_model(model,output_dim=100, input_size=(3, 224, 224))
I get the following error.
I get the model from vits8 = torch.hub.load('facebookresearch/dino:main', 'dino_vits8')
code snippet to reproduce error
import torch
import finetuner
vits8 = torch.hub.load('facebookresearch/dino:main', 'dino_vits8')
finetuner.tailor.to_embedding_model(vits8,output_dim=100, input_size=(3, 224, 224))
Following Keras example in the docs raises shape incompatibility using Jina 2.1.11
It would be useful to implement checkpointing, and related with this, to allow users to resume training from the latest checkpoint.
Pokedex front-to-back search should work out of the box.
next_batch
function checks this condition but doesn't take into account that there might already be requests in progress because of it's async nature.
There is no reason we should bind euclidian distance to TripletLayer
and cosine to CosineLayer
- there should be a parameter in the fit()
function to allow setting the distance function to use
Currently, we have image and text matches tables.
If we want to extend this to more mimetypes, we should make sure that it doesn't get hard to maintain the HTML file as a whole. Refactoring out components for them could be a good way to simplify it.
currently the implementation for paddle and pytorch is error-prone, we get names
from model and we apply hooks to modules
, and we zip(names, summary
based on the assumption that they should be identical.
a better choice is rename module in the hook directly.
This example (adapted from documentation) will crash
import torch
import finetuner
from finetuner.toydata import generate_fashion_match
embed_model = torch.nn.Sequential(
torch.nn.Flatten(),
torch.nn.Linear(
in_features=28 * 28,
out_features=128,
),
torch.nn.ReLU(),
torch.nn.Linear(in_features=128, out_features=32),
)
finetuner.fit(embed_model, train_data=generate_fashion_match, interactive=False)
Traceback (most recent call last):
File "/home/tadej/projects/finetuner/test.py", line 16, in <module>
finetuner.fit(embed_model, train_data=generate_fashion_match, interactive=False)
File "/home/tadej/projects/finetuner/finetuner/__init__.py", line 66, in fit
return fit(*args, **kwargs)
TypeError: fit() got an unexpected keyword argument 'interactive'
This is because of this
finetuner/finetuner/__init__.py
Lines 62 to 65 in 91587d8
interactive
should be removed. Will create a PRWrite a unit test, that checks, if batching vs. non-batching produce same results for a model that has BatchNormlization
or DropoutLayer
for: https://github.com/jina-ai/finetuner/blob/ae8e3990080681a760f465b29c381ffe0e4b0245/finetuner/embedding.py
At the very least, learning rate should be customizable.
We should also consider supporting multiple optimizers, e.g. Adam (which should be the default, as it is the most commonly used one), AdamW (used for transformers) and SGD, and setting their parameters.
Best way to do this is not clear - either we have some selection of optimizers that users can choose from, and we implement it in all frameworks, or we allow passing the optimizer object.
it should cancel the entire training instead
dense
, simple_cnn
, vgg
and lstm
, add bert
to test models.lstm
.Add a save button to the frontend and enable the saving of the model via the Flow.
We want an evaluation pipeline, that measures the quality
The task is to figure out, how to use the Labeler in google Colab.
allow CLIP model fine-tuning
This can prevent bottlenecks when using pre-processing, so it is done asynchronously
Say I want to quickly label many examples of a single class. Is there a way I can use a seed example of that class and use a combination of the nearest neighbour technique and my input to quickly label several 100?
Use labeler UI to mimic user interactions, make sure that it's working end-to-end.
This will give us more confidence that the changes we make doesn't break existing functionality.
I'm thinking about using https://www.cypress.io/ for this.
Instead of name/kwargs - keep the default, also an argument for learning rate
We have very few docstrings right now - would be good to have more
Currently interactive labeler only support Image and Text, is there some kind of middle layer API or similar so that I can modify to allow Audio interactive labeling?
I want to label similar speaker based on audio clip.
Thanks
Fix is to check like isinstance(model, torch.nn.Module)
- however this requires care to first check which frameworks we even have.
takes inputs embeddings and labels, and outputs a list of tuples/triplets
implementation could like:
working
should log as loading data
at very first stage.epochs
is not passed into the front-endimg/
, i expect finetuner automatically fill the datauri
/do the conversion.Currently the labeler does the following:
let end_idx = app.labeler_config.start_idx + (app.labeler_config.example_per_view - app.cur_batch.length)
if (end_idx === app.labeler_config.start_idx) {
return
}
let start_idx = app.labeler_config.start_idx
app.labeler_config.start_idx = end_idx
Thus, when I only have 2 queries but a bigger catalog, I only get labelable data once.
In my opinion, the choosing of new data should be done in the backend and not in the frontend.
We have shortcuts for selecting matches (0 - 9), inverting selection (i), submitting changes (space)
It'd be good to make this configurable.
It would be something like configuration changes we see in games
The difference I want to make in UI is to use the UI itself rather than text. I'll paste a mockup later
and make it trainable
Provide a .plot()
function that leverages mermaid.js
and keras.summary
, paddle.summary
, torchinfor.summary
(pytorch
has no native summary function, this is 3rd party; inspired by it we have already reimplemented our own similar thing in tailor). This will provide the same look & feel as in Jina Core.
pip install .
does not install the requirements stated in requirements.txt
.
i.e.
from finetuner import fit
fit(..., to_embedding_model=True)
this will call tailor + tuner
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.