Giter Site home page Giter Site logo

yandex-research / tabular-dl-tabr Goto Github PK

View Code? Open in Web Editor NEW
255.0 4.0 26.0 19.97 MB

The implementation of "TabR: Unlocking the Power of Retrieval-Augmented Tabular Deep Learning"

Home Page: https://arxiv.org/abs/2307.14338

License: MIT License

Python 82.72% Jupyter Notebook 17.28%
deep-learning machine-learning paper pytorch research tabular-data

tabular-dl-tabr's Introduction

TabR: Unlocking the Power of Retrieval-Augmented Tabular Deep Learning

This is the official implementation of the paper "TabR: Unlocking the Power of Retrieval-Augmented Tabular Deep Learning" (arXiv).

Table of Contents:

The main results

After setting up the environment, use this notebook to browse the main results (for now, you can scroll to the last cell to get an idea of what it looks like).

How to reproduce the results

Set up the environment

Software

For this project, we highly recommend using a conda-like environment manager instead of pip to get things right for the libraries that use CUDA, especially for Faiss. The available options:

  • mamba is a fast replacement for conda
  • (we used this) micromamba can be used to avoid any conflicts with your current setup: it is a single binary which does not require any "installation" (see the documentation)
  • conda is a valid option, but setting up the environment can become extremely slow (or even impossible)

Then, run the following commands (replace micromamba with mamba or conda if needed):

git clone https://github.com/yandex-research/tabular-dl-tabr
cd tabular-dl-tabr
micromamba create -f environment.yaml
micromamba activate tabr

If the micromamba create command fails, try using environment-simple.yaml instead of environment.yaml. If your machine does not have GPUs, use environment-simple.yaml, but replace faiss-gpu with faiss-cpu and remove pytorch-cuda.

Data

(License: we do not impose any new license restrictions in addition to the original licenses of the used dataset. See the paper to learn about the dataset sources)

Navigate to the repository root and run the following commands:

wget https://huggingface.co/datasets/puhsu/tabular-benchmarks/resolve/main/data.tar -O tabular-dl-tabr.tar.gz
tar -xvf tabular-dl-tabr.tar.gz

After that, the data/ directory should appear.

Environment variables

When running scripts, the environment variable CUDA_VISIBLE_DEVICES must be explicitly set. So we assume that you do run the following command first before running other commands:

export CUDA_VISIBLE_DEVICES="0"

Quick test

To check that the environment is configured correctly, run the following command and wait for the training to finish (in this experiment, hyperparameters and results are extremely suboptimal, this is needed only to test the environment):

python bin/ffn.py exp/debug/0.toml --force

The last line of the output log should look like this:

[<<<] exp/debug/0 | <date & time>

Tutorial

Here, we reproduce the results for MLP on the California Housing dataset (in the paper, this dataset is referred to as "CA"). Reproducing the results for other algorithms and datasets is very similar with rare exceptions, which are commented in further sections.

The detailed description of the repository is provided later in the "Understanding the repository" section. Until then, simply copying and pasting the instructions should just work.

Technically, reproducing the results for MLP on the California Housing dataset means reproducing the content of these directories:

  1. exp/mlp/california/0-tuning is the result of the hyperparameter tuning
  2. exp/mlp/california/0-evaluation is the result of evaluation of the tuned configuration from the previous step. This configuration is evaluated under 15 random seeds, which produces 15 single models.
  3. exp/mlp/california/0-ensemble-5 is the result of ensembles of the single models from the previous step (three disjoint ensembles each consisting of five models).

To reproduce the above results, run the following commands (takes up to 30-60 minutes on a single GPU):

cp exp/mlp/california/0-tuning.toml exp/mlp/california/0-reproduce-tuning.toml
python bin/go.py exp/mlp/california/0-reproduce-tuning.toml

In fact, 0-reproduce-tuning is an arbitrary name and you can choose a different one, but it must end with -tuning. Once the run is finished, the following directories should appear:

  • exp/mlp/california/0-reproduce-tuning
  • exp/mlp/california/0-reproduce-evaluation
  • exp/mlp/california/0-reproduce-ensemble-5

After that, you can go to notebooks/results.ipynb and view your results (see the instructions just before the last cell of that notebook).

Note that bin/go.py is just a shortcut and the above commands are equivalent to this:

cp exp/mlp/california/0-tuning.toml exp/mlp/california/0-reproduce-tuning.toml
python bin/tune.py exp/mlp/california/0-reproduce-tuning.toml
python bin/evaluate.py exp/mlp/california/0-reproduce-tuning
python bin/ensemble.py exp/mlp/california/0-reproduce-evaluation

Reproducing other results

General comments:

  • To reiterate, for most models, the pipeline for reproducing the results is the same as for MLP in the above tutorial. Here, we only cover exceptions from this pattern.
  • The last cell of notebooks/results.ipynb covers many (but not all) results from the paper with their locations in exp/.

Evaluating specific configurations without tuning. To evaluate a specific set of hyperparameters without tuning, you can use bin/go.py (to evaluate single models and ensembles) or bin/evaluate.py (to evaluate only single models). For example, this is how you can reproduce the results for the default XGBoost on the California Housing dataset:

mkdir exp/xgboost_/california/default2-reproduce-evaluation
cp exp/xgboost_/california/default2-evaluation/0.toml exp/xgboost_/california/default2-reproduce-evaluation/0.toml
python bin/go.py exp/xgboost_/california/default2-reproduce-evaluation --function bin.xgboost_.main

Note that now we have to explicitly pass the function that is being evaluated (--function bin.xgboost_.main). Again, default2-reproduce-evaluation is an arbitrary name, the only requirement is that it ends with -evaluation.

Custom versions of TabR. In bin/, there are several versions of the model. Each of them has a corresponding directory in exp/ with configs and results. See "Code overview" to learn more.

k Nearest Neighbors. To reproduce the results on the California Housing dataset:

cp exp/neighbors/california/0.toml exp/neighbors/california/0-reproduce.toml
python bin/neighbors.py exp/neighbors/california/0-reproduce.toml

mkdir exp/knn/california/0-reproduce-evaluation
cp exp/knn/california/0-evaluation/0.toml exp/knn/california/0-reproduce-evaluation/0.toml
python -c "
path = 'exp/knn/california/0-reproduce-evaluation/0.toml'
with open(path) as f:
    config = f.read()
with open(path, 'w') as f:
    f.write(config.replace(
        ':exp/neighbors/california/0',
        ':exp/neighbors/california/0-reproduce'
    ))
"
python bin/knn.py exp/knn/california/0-reproduce-evaluation/0.toml

DNNR. First, you need to run bin/dnnr_precompute_scaling.py and obtain results similar to exp/dnnr/precomputed_scaling ("loo" and "ohe" differ only in how the categorical features are encoded; we choose the best of the two approaches on the next step based on the performance on the validation set). Then, you need to run bin/dnnr.py, the corresponding configs are located in exp/dnnr/<dataset name>

NPT. To evaluate NPT, we use the official repository with modifications to allow using our datasets and preprocessing.

Understanding the repository

Read this if you are going to do more experiments/research in this repository.

Code overview

  • bin contains high-level scripts which produce the main results
    • Models
      • tabr.py is the "main" implementation of TabR with many useful technical comments inside
      • tabr_scaling.py is the version of tabr.py with the support for the "context freeze" technique described in the paper
      • tabr_design.py is the version of tabr.py with more options for testing various design decisions and doing ablation studies
      • tabr_add_candidates_after_training.py is the version of tabr.py for evaluating the addition of new unseen candidates after the training as described in the paper
      • ffn.py implements the general "feed-forward network" approach (currently, only the MLP backbone is available, but adding new backbones is simple)
      • ft_transformer.py implements FT-Transformer from the "Revisiting Deep Learning Models for Tabular Data" paper
      • xgboost_.py implements XGBoost
      • lightgbm_.py implements LightGBM
      • catboost_.py implements CatBoost
      • neighbors.py + knn.py implement k Nearest Neighbors
      • dnnr_precompute_scaling.py + dnnr.py implement DNNR from the "DNNR: Differential Nearest Neighbors Regression" paper
      • saint.py implements SAINT from the "SAINT: Improved Neural Networks for Tabular Data via Row Attention and Contrastive Pre-Training" paper
      • anp.py implements the model from the "Attentive Neural Processes" paper
      • dkl.py implements the model from the "Deep Kernel Learning" paper
    • Infrastructure
      • tune.py tunes hyperparameters
      • evaluate.py evaluates a given config over multiple (by default, 15) random seeds
      • ensemble.py ensembles predictions produced by evaluate.py
      • go.py is a shortcut combining [tune.py + evaluate.py + ensemble.py]
  • notebooks contains Jupyter notebooks
  • lib contains common tools used by the scripts in bin and the notebooks in notebooks
  • exp contains experiment configs and results (metrics, tuned configurations, etc.)
    • usually, for a given script in bin, there is a corresponding directory in env. However, this is just a convention, and you can have any layout in exp.

Running scripts

For most scripts in bin, the pattern is as follows:

python bin/some_script.py exp/a/b/c.toml

When the run is successfully finished, the result will be the exp/a/b/c folder. In particular, the exp/a/b/c/DONE file will be created. Usually, the main part of the result is the exp/a/b/c/report.json file.

If you want to run the script with the same config again and overwrite the existing results, use the --force flag:

python bin/some_script.py exp/a/b/c.toml --force

Some scripts (bin/tune.py and bin/go.py) support the --continue flag.

The following scripts have command line interface instead of configs:

  • bin/go.py
  • bin/evaluate.py
  • bin/ensemble.py

Technical notes

  • (IMPORTANT) For most algorithms, the configs are expected to have the data section which describes the input dataset
    • For regression problems, always set y_policy = "standard" unless you are absolutely sure that you need other value
    • Unless a given deep learning algorithm is special in some way, for a given dataset, the data section should be copied from the MLP config for the same dataset. For example, for California Housing dataset, this "source of truth" for deep learning algorithms is the exp/mlp/california/0-tuning.toml config.
  • (IMPORTANT) For deep learning algorithms, for each dataset, the batch size is predefined. As in the previous bullet, the configs for MLP is the source of truth.
  • For saving and loading configs programmatically, use the lib.dump_config and lib.load_config functions (defined in lib/util.py) instead of bare TOML libraries.
  • In many configs, you can see that path-like values (e.g. a path to a dataset) start with ":". It means "relative to the repository root", and this is handled by the lib.get_path function (defined in lib/env.py).
  • The scripts in bin can be used as modules if needed: import bin.ffn. For example, this is used by bin/evaluate.py and bin/tune.py.

Adding new datasets and metrics

How to add a new dataset

To apply the scripts from this repository to your custom dataset, you need to create a new directory in the data/ directory and use the same file names and data types as in our datasets. A good example is the data/adult dataset where all supported feature types are presented (numerical, binary and categorical). The .npy files are NumPy arrays saved with the np.save function (documentation).

Let's say your dataset is called my-dataset. Then, create the data/my-dataset directory with the following content:

  • If the dataset has numerical (i.e. continuous) features
    • Files: X_num_train.npy, X_num_val.npy, X_num_test.npy
    • NumPy data type: np.float32
  • If the dataset has binary features
    • Files: X_bin_train.npy, X_bin_val.npy, X_bin_test.npy
    • NumPy data type: np.float32
    • All values must be 0.0 and 1.0
  • If the dataset has categorical features
    • Files: X_cat_train.npy, X_cat_val.npy, X_cat_test.npy
    • NumPy data type: np.str_ (yes, the values must be strings)
  • Labels
    • Files: Y_train.npy, Y_val.npy, Y_test.npy
    • NumPy data type: np.float32 for regression, np.int64 for classification
    • For classification problems, the labels must form the range [0, ..., n_classes - 1].
  • info.json -- a JSON file with the following keys:
    • "task_type": one of "regression", "binclass", "multiclass"
    • (optional) "name": any string (a "pretty" name for your dataset, e.g. "My Dataset")
    • (optional) "id": any string (must be unique among all "id" keys of all info.json files of all datasets in data/)
  • READY -- just an empty file

At this point, your dataset is ready to use!

How to optimize a custom metric

The "main" metric which is optimized in this repository is referred to as "score". Score is always maximized. By default:

  • for regression problems, the score is negative RMSE
  • for classification problems, the score is accuracy

In the _SCORE_SHOULD_BE_MAXIMIZED dictionary in lib/data.py, you can find other supported scores. To use any of them, set the "score" field in the [data] section of a config:

...

[data]
seed = 0
path = ":data/california"
...
score = "r2"

...

To implement a custom metric, add its name to the _SCORE_SHOULD_BE_MAXIMIZED dictionary and compute it in the lib/metrics.py:calculate_metrics function.

How to add a new task type

We do not provide instructions for that. While adding new task types is definitely possible, overall, the code is written without other task types in mind. For example, there may be places where the code implicitly assumes that the task is either regression or classification. So adding a new task type will require carefully reviewing the whole codebases to find places where the new task type should be taken into account.

How to cite

@article{gorishniy2023tabr,
    title={TabR: Unlocking the Power of Retrieval-Augmented Tabular Deep Learning},
    author={
        Yury Gorishniy and
        Ivan Rubachev and
        Nikolay Kartashev and
        Daniil Shlenskii and
        Akim Kotelnikov and
        Artem Babenko
    },
    journal={arXiv},
    volume={2307.14338},
    year={2023},
}

tabular-dl-tabr's People

Contributors

samoed avatar yura52 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

tabular-dl-tabr's Issues

How to evaluate the performance of MLP on regression-cat-medium-0-OnlineNewsPopularity?

Hello,
From your code, I check the evaluation report.json of MLP on regression-cat-medium-0-OnlineNewsPopularity. And the best epoch metrics is as follows:
"n_parameters": 495793,
"prediction_type": null,
"best_epoch": 26,
"metrics": {
"train": {
"rmse": 0.8142614908186779,
"mae": 0.5985163409301961,
"r2": 0.23417308630867428,
"score": -0.8142614908186779
},
"val": {
"rmse": 0.844946381250874,
"mae": 0.6250255374955493,
"r2": 0.15331082859100664,
"score": -0.844946381250874
},
"test": {
"rmse": 0.8618776869166989,
"mae": 0.6317802140393205,
"r2": 0.14868952248971512,
"score": -0.8618776869166989
}
}
Generally, lower values for RMSE and MAE are desirable, and R² closer to 1 indicates better explanatory power of the model. Based on the provided results, the model performs relatively poorly on the validation and test sets, and the R² values suggest a limited explanatory capability. Is further optimization of the model or consideration of alternative improvement strategies still necessary? In addition, Tensorboard is provided in the project. How can we analyze this model based on the provided Tensorboard?"
Thanks~

Expected 2d tensor for the single feature of such type, got 1d

Hi! Thank you for your interesting work.
I faced some problems because of this function (

def to_torch(self, device=None) -> 'Dataset[Tensor]':
). I have dataset with only one binary feature, it is flattened to 1d tensor, but later 2d tensor expected. Writing torch.atleast_2d(torch.as_tensor(value)).to(device) instead of torch.as_tensor(value).to(device) solved this problem.

Datasets origin

Hi, congratulations for the great work!

Just a few questions about the "why" datasets from https://github.com/LeoGrin/tabular-benchmark (https://huggingface.co/datasets/inria-soda/tabular-benchmark). I noticed that the dataset online news is not on the benchmark original benchmark.

Also the classif-cat-medium-0-compass which i imagine is the compas-two-years is very different.

The one you provided in https://huggingface.co/datasets/puhsu/tabular-benchmarks/resolve/main/data.tar have this characteristics:
{
"name": "classif-cat-medium-1-compass",
"id": "classif-cat-medium-1-compass",
"train_size": 10000,
"val_size": 1993,
"test_size": 4651,
"n_num_features": 8,
"n_cat_features": 7,
"n_bin_features": 2
}

while the original (https://www.openml.org/search?type=data&sort=runs&id=45039&status=active) have 11 features and 4966 lines in total.

I'm comparing with my own algorithm, i'm using your results on the "why" benchmark as reference. Thanks in advance!

The change of the candidate set during training

Thank you for your interesting work, it inspires me a lot.There are a couple of questions that have been bugging me
The work initially uses the entire training set as the fixed set of candidates for all objects.
1.The function 'apply_model' removes them.
2.However,when computing the forward output, it adds the current batch to the candidate set and predicts the output.Should the current batch be added to the candidate set after predicting the output?
3.Additionally, when adding to the candidate set and retrieving context samples, it is guaranteed to retrieve samples that match the target. The related index is then removed from the obtained index.
I can't understand the role and connection of these three operations, can you provide some suggestions

inference

Hello, I trained your model on my dataset, thank you a lot for this brilliant work.
But I don't understand, how to make prediction on my X_test without y_test.(I put 50% of validation instead of real X_test)

add a new dataset

when i prepared the new dataset in the required format, i tried to apply tabr model to the new dataset. And this is the bug i met when run the command " python bin/go.py exp/tabr/albert/0-tuning.toml --force "

File "/mnt/user/yezhen/tabular-dl-tabr/lib/data.py", line 122, in
part: torch.as_tensor(value).to(device)
TypeError: can't convert np.ndarray of type numpy.str_. The only supported types are: float64, float32, float16, complex64, complex128, int64, int32, int16, int8, uint8, and bool.

RuntimeError: mat1 and mat2 must have the same dtype

thanks very much for this great work.

I am tring to understanding the code and use it in my research.

I encounter an error and don't know how to fix it. Any suggestions would be greatly appreciated.

here is the code:

%%

data = {
"X_num": {
"train": X_train,
"val": X_test
},
"Y": {
"train": y_train,
"val": y_test
}
}

%%

dataset = Dataset(
data=data,
task_type=TaskType.REGRESSION,
score='rmse',
y_info=None,
_Y_numpy=None
)

seed = 42
model = {'num_embeddings': None, # Example embedding configuration
'd_main': 64,
'd_multiplier': 1.0,
'encoder_n_blocks': 2,
'predictor_n_blocks': 2,
'mixer_normalization': False,
'context_dropout': 0.1,
'dropout0': 0.1,
'dropout1': 0.1,
'normalization': 'BatchNorm1d',
'activation': 'ReLU'
}

define Config

config = Config(
seed=seed,
data=dataset,
model=model,
context_size=5,
optimizer={'type': 'Adam', 'lr': 0.001},
batch_size=64,
patience=10,
n_epochs=10,
)

%%

output_path = "./output"
force = True
report = main(config, output_path, force=force)

the error details are as follows:


RuntimeError Traceback (most recent call last)
File /Users/hjyu/Library/Mobile Documents/comappleCloudDocs/Code/Transfer_Learning_Tabular/TabR/tabr_test.py:4
2 output_path = "./output"
3 force = True
----> 4 report = main(config, output_path, force=force)

File ~/Library/Mobile Documents/comappleCloudDocs/Code/Transfer_Learning_Tabular/TabR/bin/tabr.py:508, in main(config, output, force)
503 epoch_losses = []
504 for batch_idx in tqdm(
505 lib.make_random_batches(train_size, C.batch_size, device),
506 desc=f'Epoch {epoch}',
507 ):
--> 508 loss, new_chunk_size = lib.train_step(
509 optimizer,
510 lambda idx: loss_fn(apply_model('train', idx, True), Y_train[idx]),
511 batch_idx,
512 chunk_size or C.batch_size,
513 )
514 epoch_losses.append(loss.detach())
515 if new_chunk_size and new_chunk_size < (chunk_size or C.batch_size):

File ~/Library/Mobile Documents/comappleCloudDocs/Code/Transfer_Learning_Tabular/TabR/lib/deep.py:447, in train_step(optimizer, step_fn, batch, chunk_size)
445 optimizer.zero_grad()
446 if batch_size <= chunk_size:
--> 447 loss = step_fn(batch)
448 loss.backward()
449 else:

File ~/Library/Mobile Documents/comappleCloudDocs/Code/Transfer_Learning_Tabular/TabR/bin/tabr.py:510, in main..(idx)
503 epoch_losses = []
504 for batch_idx in tqdm(
505 lib.make_random_batches(train_size, C.batch_size, device),
506 desc=f'Epoch {epoch}',
507 ):
508 loss, new_chunk_size = lib.train_step(
509 optimizer,
--> 510 lambda idx: loss_fn(apply_model('train', idx, True), Y_train[idx]),
511 batch_idx,
512 chunk_size or C.batch_size,
513 )
514 epoch_losses.append(loss.detach())
515 if new_chunk_size and new_chunk_size < (chunk_size or C.batch_size):

File ~/Library/Mobile Documents/comappleCloudDocs/Code/Transfer_Learning_Tabular/TabR/bin/tabr.py:436, in main..apply_model(part, idx, training)
428 candidate_indices = candidate_indices[~torch.isin(candidate_indices, idx)]
429 candidate_x, candidate_y = get_Xy(
430 'train',
431 # This condition is here for historical reasons, it could be just
432 # the unconditional candidate_indices.
433 None if candidate_indices is train_indices else candidate_indices,
434 )
--> 436 return model(
437 x_=x,
438 y=y if is_train else None,
439 candidate_x_=candidate_x,
440 candidate_y=candidate_y,
441 context_size=C.context_size,
442 is_train=is_train,
443 ).squeeze(-1)

File ~/anaconda3/envs/tabr/lib/python3.9/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
1190 # If we don't have any hooks, we want to skip the rest of the logic in
1191 # this function, and just call forward.
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/Library/Mobile Documents/comappleCloudDocs/Code/Transfer_Learning_Tabular/TabR/bin/tabr.py:243, in Model.forward(self, x_, y, candidate_x_, candidate_y, context_size, is_train)
212 def forward(
213 self,
214 *,
(...)
221 ) -> Tensor:
222 # >>>
223 with torch.set_grad_enabled(
224 torch.is_grad_enabled() and not self.memory_efficient
225 ):
(...)
240 # performed without gradients.
241 # Later, it is recomputed with gradients only for the context objects.
242 candidate_k = (
--> 243 self.encode(candidate_x)[1]
244 if self.candidate_encoding_batch_size is None
245 else torch.cat(
246 [
247 self.encode(x)[1]
248 for x in delu.iter_batches(
249 candidate_x
, self.candidate_encoding_batch_size
250 )
251 ]
252 )
253 )
254 x, k = self.encode(x)
255 if is_train:
256 # NOTE: here, we add the training batch back to the candidates after the
257 # function apply_model removed them. The further code relies
258 # on the fact that the first batch_size candidates come from the
259 # training batch.

File ~/Library/Mobile Documents/comappleCloudDocs/Code/Transfer_Learning_Tabular/TabR/bin/tabr.py:206, in Model._encode(failed resolving arguments)
203 assert x # 断言列表x不为空,这可能是为了确保输入数据的正确性
204 x = torch.cat(x, dim=1)
--> 206 x = self.linear(x)
207 for block in self.blocks0:
208 x = x + block(x)

File ~/anaconda3/envs/tabr/lib/python3.9/site-packages/torch/nn/modules/module.py:1194, in Module._call_impl(self, *input, **kwargs)
1190 # If we don't have any hooks, we want to skip the rest of the logic in
1191 # this function, and just call forward.
1192 if not (self._backward_hooks or self._forward_hooks or self._forward_pre_hooks or _global_backward_hooks
1193 or _global_forward_hooks or _global_forward_pre_hooks):
-> 1194 return forward_call(*input, **kwargs)
1195 # Do not call functions when jit is used
1196 full_backward_hooks, non_full_backward_hooks = [], []

File ~/anaconda3/envs/tabr/lib/python3.9/site-packages/torch/nn/modules/linear.py:114, in Linear.forward(self, input)
113 def forward(self, input: Tensor) -> Tensor:
--> 114 return F.linear(input, self.weight, self.bias)

RuntimeError: mat1 and mat2 must have the same dtype

Diasable time dependet leaks during training

I have now the first results for TabR on my custom dataset! Thanks for your repo so far!

However I still have some problem with the current implementation of TabR.
This is what the paper stats:
"Figure 4: A simplified illustration of the retrieval module R, introduced in Figure 2. For the target object’s representation x˜, the module takes the m nearest neighbors among the candidates {x˜i} according to the similarity module S and aggregates their values produced by the value module V"

This approach is good for non time dependent datasets like the titanic dataset where each element is independent of another.

However, we have data from a auto completion usecase where the column "CREATIONDATE" is the date column which massively affects the results. Knowledge of information of future dates leaks into elements of the past. This is why the train and test split is split in the following way:

df_train = df[(df.CREATIONDATE >= '20190101') & (df.CREATIONDATE <= '20191231')]
df_test = df[(df.CREATIONDATE >= '20200101') & (df.CREATIONDATE <= '20200229')]

You see, the test set is strictly after the train set on the time line. And without the model learning this during training, the test results are not very well.

We somehow also need to achieve this inside the train set during training. It means when predicting the class of one row during train time, we need to make sure that only elements of the train set out of the past (so with CREATIONDATE_candidates < CREATIONDATE_train_element_we_want_to_predict).

Where do I need to change this logic in the code in the best possible way?

micromamba environment setup issue

Hi, thanks for sharing this repo.

I tried to setup an environment by following the instruction in README: micromamba create -f environment.yaml
However, I got the following errors.
I was able to resolve the issues of cudatoolkit, panel and bokeh by modifying the version, but not for pytorch.
Could you help me to address this issue?

nvidia/linux-64                                               No change
nvidia/noarch                                                 No change
conda-forge/noarch                                            No change
conda-forge/linux-64                                          No change
pytorch/noarch                                                No change
pytorch/linux-64                                              No change
pyviz/linux-64                                                No change
pyviz/noarch                                                  No change
warning  libmamba Problem type not implemented SOLVER_RULE_STRICT_REPO_PRIORITY
warning  libmamba Problem type not implemented SOLVER_RULE_STRICT_REPO_PRIORITY
warning  libmamba Problem type not implemented SOLVER_RULE_STRICT_REPO_PRIORITY
warning  libmamba Problem type not implemented SOLVER_RULE_STRICT_REPO_PRIORITY
warning  libmamba Problem type not implemented SOLVER_RULE_STRICT_REPO_PRIORITY
warning  libmamba Problem type not implemented SOLVER_RULE_STRICT_REPO_PRIORITY
warning  libmamba Problem type not implemented SOLVER_RULE_STRICT_REPO_PRIORITY
warning  libmamba Problem type not implemented SOLVER_RULE_STRICT_REPO_PRIORITY
warning  libmamba Problem type not implemented SOLVER_RULE_STRICT_REPO_PRIORITY
warning  libmamba Problem type not implemented SOLVER_RULE_STRICT_REPO_PRIORITY
warning  libmamba Problem type not implemented SOLVER_RULE_STRICT_REPO_PRIORITY
warning  libmamba Problem type not implemented SOLVER_RULE_STRICT_REPO_PRIORITY
warning  libmamba Problem type not implemented SOLVER_RULE_STRICT_REPO_PRIORITY
warning  libmamba Problem type not implemented SOLVER_RULE_STRICT_REPO_PRIORITY
warning  libmamba Problem type not implemented SOLVER_RULE_STRICT_REPO_PRIORITY
warning  libmamba Problem type not implemented SOLVER_RULE_STRICT_REPO_PRIORITY
warning  libmamba Problem type not implemented SOLVER_RULE_STRICT_REPO_PRIORITY
error    libmamba Could not solve for environment specs
    The following packages are incompatible
    ├─ bokeh 3.0.3**  is requested and can be installed;
    ├─ cudatoolkit 11.8.0**  is not installable because it conflicts with any installable versions previously reported;
    ├─ panel 0.10.3**  is not installable because there are no viable options
    │  ├─ panel 0.10.3 would require
    │  │  └─ bokeh >=2.2,<2.3 , which conflicts with any installable versions previously reported;
    │  └─ panel 0.10.3 conflicts with any installable versions previously reported;
    └─ pytorch 1.13.1*  is not installable because it conflicts with any installable versions previously reported.
critical libmamba Could not solve for environment specs

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.