evo-design / evo Goto Github PK

View Code? Open in Web Editor NEW

912.0 912.0 105.0 8.99 MB

Biological foundation modeling from molecular to genome scale

License: Apache License 2.0

Python 5.64% Jupyter Notebook 94.36%

evo's People

Contributors

Stargazers

Watchers

Forkers

amelie-iska evelynmitchell annujk ivorobyev xiechengmude lericdax artificial-paul eukaryoting onerai nickydark1 gmavaliani genostack ant4g0nist dionisboss jinyuansun aver1n1900 ljdk jasonwei2014 hahahacccc kehan777 abulenciamiguel zergey horikitasaku muedi maikuraky glionaptu toxices87nepheway 48-itectinq cogentri33 lilllepiclife y-muchsports rightpop-centhart burkepinchglitznotes lawrt-serenesilly louud70 tdsone captainn-icyness truechilled95righthaja itselgispreamha jorenretel glorydoll37plentyeat modelturnedgeek sandrascharpf kirjner mkyriak eltociear exnx e-hiroroll crawlerty59 tonyonst-t diagonallipsx45purfectzerp eevertonbarros y-gotiz agriffondirty slimkinski trickylant6 goswamig zhangjiahuan17 neskuchny wilcohermens rcadecaro samsgates shunsunsun ahmedkazlak ernestmordret afiqmuzaffar tohrnii danejacobson freefrancisco andyjzhao aaronge-2020 khanfs rolayoalarcon jschnoeb1 gchb4hk paolabc pavai-research cosnicolaou fredp16 qiyanghong2020 mdcao aqgenesis pacco4654582 animesh comdec jayhyun damiano-puglisi yashasdevasurmutt mdbabumiamssm ivelet hlok666 kiheonbaek nserin zequanhan luke-ebbis mjenior cyf08 garykbrixi fourpartswater vasa-develop

evo's Issues

OpenGenome Dataset

I was wondering if you plan to release the OpenGenome dataset?

Finetuning on a sequenceClassification task

I am interested in using this model on some classification tasks, but when I have tried to set up the fine_tuning script using AutoModelForSequenceClassification, I get an error (see below). I assume this is because the StripedHyena is a new type of model. Do you have any suggestions?

Thank you.
LeAnn

Traceback (most recent call last):
File "/home/llindsey1/CHPC/evo_finetune.py", line 32, in
seq_classification_model = AutoModelForSequenceClassification.from_config(config)
File "/home/llindsey1/.conda/envs/EVO/lib/python3.9/site-packages/transformers/models/auto/auto_factory.py", line 437, in from_config
raise ValueError(
ValueError: Unrecognized configuration class <class 'transformers_modules.togethercomputer.evo-1-131k-base.8eb9480ea22de5f86eeebc1199a76b63b42d7170.configuration_hyena.StripedHyenaConfig'> for this kind of AutoModel: AutoModelForSequenceClassification.
Model type should be one of AlbertConfig, BartConfig, BertConfig, BigBirdConfig, BigBirdPegasusConfig, BioGptConfig, BloomConfig, CamembertConfig, CanineConfig, LlamaConfig, ConvBertConfig, CTRLConfig, Data2VecTextConfig, DebertaConfig, DebertaV2Config, DistilBertConfig, ElectraConfig, ErnieConfig, ErnieMConfig, EsmConfig, FalconConfig, FlaubertConfig, FNetConfig, FunnelConfig, GemmaConfig, GPT2Config, GPT2Config, GPTBigCodeConfig, GPTNeoConfig, GPTNeoXConfig, GPTJConfig, IBertConfig, LayoutLMConfig, LayoutLMv2Config, LayoutLMv3Config, LEDConfig, LiltConfig, LlamaConfig, LongformerConfig, LukeConfig, MarkupLMConfig, MBartConfig, MegaConfig, MegatronBertConfig, MistralConfig, MixtralConfig, MobileBertConfig, MPNetConfig, MptConfig, MraConfig, MT5Config, MvpConfig, NezhaConfig, NystromformerConfig, OpenLlamaConfig, OpenAIGPTConfig, OPTConfig, PerceiverConfig, PersimmonConfig, PhiConfig, PLBartConfig, QDQBertConfig, Qwen2Config, ReformerConfig, RemBertConfig, RobertaConfig, RobertaPreLayerNormConfig, RoCBertConfig, RoFormerConfig, SqueezeBertConfig, StableLmConfig, T5Config, TapasConfig, TransfoXLConfig, UMT5Config, XLMConfig, XLMRobertaConfig, XLMRobertaXLConfig, XLNetConfig, XmodConfig, YosoConfig.

Fine-tune evo to achieve a sequence-level prediction

Could you supply a script or notebook demonstrating how to perform regression analysis using two sequences as inputs? The input format would be "[CLS]GCTTAGCGAGACTAAATTATATAGCAGCT[SEP]CTAACTGCAGCCCGCCCGTAT", and the prediction should utilize the hidden state associated with the "[CLS]" token.

Thank you,
Jinglie

Err

I am

Training code availability?

Hello Evo team, thanks for an exciting release. Is there any chance we will ever see dataset creation and training code released? I believe this would be great for open science and open source, and encourage other machine learners to join this super exciting direction you've started. :)

PEFT Prompt Tuning: forward() got an unexpected keyword argument 'inputs_embeds'

Has someone successfully applied to Prompt Tuning PEFT to EVO?

With the HF SFT Trainer and the following PEFT config

    peft_config = PromptTuningConfig(
        task_type=TaskType.CAUSAL_LM,
        num_virtual_tokens=128,
        tokenizer_name_or_path='togethercomputer/evo-1-131k-base'
    )
    peft_model = get_peft_model(model, peft_config)
    peft_model.print_trainable_parameters()

I get the following error:

Traceback (most recent call last):
  File "sft.py", line 227, in <module>
    trainer.train()
  File "/home/mleske/anaconda3/envs/py38/lib/python3.8/site-packages/trl/trainer/sft_trainer.py", line 361, in train
    output = super().train(*args, **kwargs)
  File "/home/mleske/anaconda3/envs/py38/lib/python3.8/site-packages/transformers/trainer.py", line 1859, in train
    return inner_training_loop(
  File "/home/mleske/anaconda3/envs/py38/lib/python3.8/site-packages/transformers/trainer.py", line 2203, in _inner_training_loop
    tr_loss_step = self.training_step(model, inputs)
  File "/home/mleske/anaconda3/envs/py38/lib/python3.8/site-packages/transformers/trainer.py", line 3138, in training_step
    loss = self.compute_loss(model, inputs)
  File "/home/mleske/anaconda3/envs/py38/lib/python3.8/site-packages/transformers/trainer.py", line 3161, in compute_loss
    outputs = model(**inputs)
  File "/home/mleske/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/mleske/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
  File "/home/mleske/anaconda3/envs/py38/lib/python3.8/site-packages/peft/peft_model.py", line 1177, in forward
    return self.base_model(inputs_embeds=inputs_embeds, **kwargs)
  File "/home/mleske/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1532, in _wrapped_call_impl
    return self._call_impl(*args, **kwargs)
  File "/home/mleske/anaconda3/envs/py38/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1541, in _call_impl
    return forward_call(*args, **kwargs)
TypeError: forward() got an unexpected keyword argument 'inputs_embeds'

Any pointer on what I am doing wrong would be largely appreciated.

Questions about the performance of evo

Given the example "ACGT" as prompt, it seems that the output data from 8k-base model are strange:

Is it possible to discover such a DNA sequence in the real world? I assume there are no sequence with so many Ts like this. Thanks.

AttributeError: 'InferenceParams' object has no attribute 'fused_ft_kernel'

when i try use prompt"ACTG" to run generate.py

(evo) qj@supermicro-a526:~/python-project/evo/evo-main$ python -m scripts.generate --model-name 'evo-1-131k-base' --prompt ACGT --n-samples 10 --n-tokens 100 --temperature 1. --top-k 4 --device cuda:3
/home/qj/anaconda3/envs/evo/lib/python3.8/site-packages/huggingface_hub/file_download.py:1132: FutureWarning: resume_download is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use force_download=True.
warnings.warn(
Loading checkpoint shards: 100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 3/3 [00:00<00:00, 3.47it/s]
Generated sequences:
Traceback (most recent call last):
File "/home/qj/anaconda3/envs/evo/lib/python3.8/runpy.py", line 194, in _run_module_as_main
return _run_code(code, main_globals, None,
File "/home/qj/anaconda3/envs/evo/lib/python3.8/runpy.py", line 87, in _run_code
exec(code, run_globals)
File "/home/qj/python-project/evo/evo-main/scripts/generate.py", line 67, in
main()
File "/home/qj/python-project/evo/evo-main/scripts/generate.py", line 50, in main
output_seqs, output_scores = generate(
File "/home/qj/python-project/evo/evo-main/evo/generation.py", line 71, in generate
output_ids, logits = g.generate(
File "/home/qj/anaconda3/envs/evo/lib/python3.8/site-packages/stripedhyena/generation.py", line 111, in generate
logits, inference_params_dict_out = self.model(
File "/home/qj/anaconda3/envs/evo/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/qj/anaconda3/envs/evo/lib/python3.8/site-packages/stripedhyena/model.py", line 362, in forward
x, inference_params_dict_out = self.stateful_forward(
File "/home/qj/anaconda3/envs/evo/lib/python3.8/site-packages/stripedhyena/model.py", line 377, in stateful_forward
x, _ = block(x, inference_params=inference_params)
File "/home/qj/anaconda3/envs/evo/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/qj/anaconda3/envs/evo/lib/python3.8/site-packages/stripedhyena/model.py", line 73, in forward
self.inner_mha_cls(
File "/home/qj/anaconda3/envs/evo/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1190, in _call_impl
return forward_call(*input, **kwargs)
File "/home/qj/anaconda3/envs/evo/lib/python3.8/site-packages/flash_attn/modules/mha.py", line 498, in forward
if (not inference_params.fused_ft_kernel) or inference_params.sequence_len_offset == 0:
AttributeError: 'InferenceParams' object has no attribute 'fused_ft_kernel'
how to solve it

Release of OpenGenome

Hi,
Thanks for this great work! May we kindly ask when the pre-training data OpenGenome be released? And all the datasets for downstream tasks?

ModuleNotFoundError: No module named 'transformers_modules.togethercomputer.evo-1-131k-base.9562f3fdc38f09b92594864c5e98264f1bfbca33.tokenizer'

Hi all and thanks for open sourcing this interesting model!

I managed to install flash-attention and all other packages so I am able to import Evo package.
But I am stuck with the following error
ModuleNotFoundError: No module named 'transformers_modules.togethercomputer.evo-1-131k-base.9562f3fdc38f09b92594864c5e98264f1bfbca33.tokenizer'

This happens regardless of using the source code

from evo import Evo
import torch
device = 'cuda:0'
evo_model = Evo('evo-1-131k-base') # here it crashes

or trying to load directly from HF

from transformers import AutoConfig, AutoModelForCausalLM
model_name = 'togethercomputer/evo-1-131k-base'
model_config = AutoConfig.from_pretrained(model_name, trust_remote_code=True)
model_config.use_cache = True
model = AutoModelForCausalLM.from_pretrained(model_name, config=model_config, trust_remote_code=True) # here it crashes

The error points to transformers_modules.togethercomputer.evo-1-131k-base regardless of which EVO checkpoint I select and I tried to update transformers both to latest or to "4.36.2" as show in https://huggingface.co/togethercomputer/evo-1-131k-base/blob/main/generation_config.json

Any clue on how to solve this error please? Thanks!

OSError: togethercomputer/evo-1-131k-base does not appear to have a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack.

Hi,

I just run the example command:

from evo import Evo
import torch

device = 'cuda:1'

evo_model = Evo('evo-1-131k-base')
model, tokenizer = evo_model.model, evo_model.tokenizer
model.to(device)
model.eval()

sequence = 'ACGT'
input_ids = torch.tensor(
    tokenizer.tokenize(sequence),
    dtype=torch.int,
).to(device).unsqueeze(0)
logits, _ = model(input_ids) # (batch, length, vocab)

print('Logits: ', logits)
print('Shape (batch, length, vocab): ', logits.shape)

But it show the error as below, even if I directly the huggingface version. Does anyone know what's the problem?
OSError: togethercomputer/evo-1-131k-base does not appear to have a file named pytorch_model.bin, tf_model.h5, model.ckpt or flax_model.msgpack.

Thanks,
Jinglie

Estimate Required RAM for inference

Hello,

Could someone please provide an estimate of the required RAM with respect to the sequence length for inference? I tried running the example script with 16GB, and it ran out of memory.

Thank you.

Demo application with Artificial?

Congrats on the launch! I'd like to make a quick demo video of how to integrate Evo with Artificial; @Zymrael @brianhie would either of you (or somebody else on your team) be willing to do a quick consult about what kind of application might make sense / be compelling?

For reference, here's another recent demo video I made showing a product overview: https://artificialinc.wistia.com/medias/zf7ai1wgrq

Is it posisble for us to access the embeddings layer?

I wonder if we can access the embeddings of evo to generate arbitary embeddings of DNA sequence. Thanks a lot.

Require for GWAS(genome wide association studies)

Thank you for the amazing work! Can Evo be used to predict variant sites of rice genes? Similar to performing whole-genome association analysis tasks? Are there any tutorials available for reference?

gene essentiality

Can evo-1-8k-base be used for scoring the gene essentiality? Or only the evo-1-131k-base? Thank you!

issue on model.to("cuda") with device_map="auto"

Hi,

I am having below error, while trying to load model on my 2x RTX 3060 using device_map="auto" param:

File /home/uad/sandbox/evo/venv/lib/python3.10/site-packages/accelerate/utils/modeling.py:1395, in check_device_map(model, device_map)
1393 if len(all_model_tensors) > 0:
1394 non_covered_params = ", ".join(all_model_tensors)
-> 1395 raise ValueError(
1396 f"The device_map provided does not give any device for the following parameters: {non_covered_params}"
1397 )
ValueError: The device_map provided does not give any device for the following parameters: backbone.unembed.weight

my code is:

In [2]: from transformers import AutoConfig, AutoModelForCausalLM
   ...:
   ...: model_name = 'togethercomputer/evo-1-8k-base'
   ...: #model_name = "togethercomputer/evo-1-131k-base"
   ...:
   ...: model_config = AutoConfig.from_pretrained(model_name, trust_remote_code=True, revision="1.1_fix")
   ...: model_config.use_cache = True
   ...:
   ...: model = AutoModelForCausalLM.from_pretrained(
   ...:     model_name,
   ...:     config=model_config,
   ...:     trust_remote_code=True,
   ...:     revision="1.1_fix",
   ...:     cache_dir="/llms/evo",
   ...:     low_cpu_mem_usage=True,
   ...:     device_map="auto". ## only updated here from repo code, so that it distributes the weights to multiple GPUs 
   ...: )

What would be the root cause here and possible solution approaches?

Any help is much appreciated. Thanks

Here you can check out the whole stderr output:

Loading checkpoint shards: 100%|████████████████| 3/3 [00:03<00:00, 1.11s/it] Some weights of StripedHyenaModelForCausalLM were not initialized from the model checkpoint at togethercomputer/evo-1-8k-base and are newly initialized: ['backbone.unembed.weight'] You should probably TRAIN this model on a down-stream task to be able to use it for predictions and inference. --------------------------------------------------------------------------- ValueError Traceback (most recent call last) Cell In[1], line 9 6 model_config = AutoConfig.from_pretrained(model_name, trust_remote_code=True, revision="1.1_fix") 7 model_config.use_cache = True ----> 9 model = AutoModelForCausalLM.from_pretrained( 10 model_name, 11 config=model_config, 12 trust_remote_code=True, 13 revision="1.1_fix", 14 cache_dir="/media/raid/llms/evo", 15 low_cpu_mem_usage=True, 16 device_map="auto" 17 )

File /home/uad/sandbox/evo/venv/lib/python3.10/site-packages/transformers/models/auto/auto_factory.py:558, in _BaseAutoModelClass.from_pretrained(cls, pretrained_model_name_or_path, *model_args, **kwargs)
556 else:
557 cls.register(config.class, model_class, exist_ok=True)
--> 558 return model_class.from_pretrained(
559 pretrained_model_name_or_path, *model_args, config=config, **hub_kwargs, **kwargs
560 )
561 elif type(config) in cls._model_mapping.keys():
562 model_class = _get_model_class(config, cls._model_mapping)

File /home/uad/sandbox/evo/venv/lib/python3.10/site-packages/transformers/modeling_utils.py:3820, in PreTrainedModel.from_pretrained(cls, pretrained_model_name_or_path, config, cache_dir, ignore_mismatched_sizes, force_download, local_files_only, token, revision, use_safetensors, *model_args, **kwargs)
3818 device_map_kwargs["force_hooks"] = True
3819 if not is_fsdp_enabled() and not is_deepspeed_zero3_enabled():
-> 3820 dispatch_model(model, **device_map_kwargs)
3822 if hf_quantizer is not None:
3823 hf_quantizer.postprocess_model(model)

File /home/uad/sandbox/evo/venv/lib/python3.10/site-packages/accelerate/big_modeling.py:351, in dispatch_model(model, device_map, main_device, state_dict, offload_dir, offload_index, offload_buffers, skip_keys, preload_module_classes, force_hooks)
317 """
318 Dispatches a model according to a given device map. Layers of the model might be spread across GPUs, offloaded on
319 the CPU or even the disk.
(...)
348 single device.
349 """
350 # Error early if the device map is incomplete.
--> 351 check_device_map(model, device_map)
353 # for backward compatibility
354 is_bnb_quantized = (
355 getattr(model, "is_quantized", False) or getattr(model, "is_loaded_in_8bit", False)
356 ) and getattr(model, "quantization_method", "bitsandbytes") == "bitsandbytes"

File /home/uad/sandbox/evo/venv/lib/python3.10/site-packages/accelerate/utils/modeling.py:1419, in check_device_map(model, device_map)
1417 if len(all_model_tensors) > 0:
1418 non_covered_params = ", ".join(all_model_tensors)
-> 1419 raise ValueError(
1420 f"The device_map provided does not give any device for the following parameters: {non_covered_params}"
1421 )

ValueError: The device_map provided does not give any device for the following parameters: backbone.unembed.weight

Finetune script

Could you provide a script/notebook demo to show how to finetune this model?

GPU memory usage

I wish to ask how much GPU memory is used by the models "evo-1-8k-base" and "evo-1-131k-base"? Could you consider two situations, inference and fine-tuning?
Do you have any suggested graphics cards to run Evo?

Limiting attention radius and extracting embeddings

Hello,

Is it possible to alter the model's attention radius, such that the model only applies attention within a certain window in the input?

A second question: Can you instruct on how I might extract the embeddings from the model? I am using the model out of the box, such that the final output is a tensor of dimension num_batches x input length x num_tokens, but I’d like to access the internal latent space representation of my text as well.

Thank you!

Mamba training loss at 7B parameters

Dear authors,

Very impressive work! I have just finished reading your paper and would be very curious about the training loss curve for a 7B Mamba model as shown in Figure S4 in the paper for Transformer++, Hyena, and Striped Hyena. Do you happen to have it?

Best wishes.

Scoring Question

Hello. Thank you for creating the EVO tool. I just wanted to ask a general question.

When generating sequences a more positive value/score e.g. -0.5 instead of -1.2, the more probable the sequence generated is and also more probable next nucleotide or token that the model is predicting?

Thanks

hello_evo.ipynb code error

Is this code written incorrectly? There is no output key in the output

Release of Pre-training Data Preprocess Scripts

Hi,
If the data release for OpenGenome is still on-going, would it be possible to release the preprocess scripts for the data (no need to be exactly reproducible)?

How can I use this tool to solve the practical biological problem of predicting transcription factor target sequences in actual biology?

T4 GPUs support? Any recommendation

Hello,

First and foremost, thank you for the commendation on our work and paper. I've been attempting to run Evo locally on T4 GPUs, but I encountered an issue with FlashAttn 2.0 not being supported yet. I have a few questions regarding this:

Do you have any plans to support T4 GPUs in the near future?
Will a single 16GB T4 GPU be sufficient for inference? If not, can we implement some optimization processes (with deepspeed) for Hugging Face models?
Is there a way to use FlashAttn 1.x versions, or can we disable Flash-Attn usage directly?
Is it possible to use float16 rather than bfloat16?

Thank you,

Memory usage?

Hi everyone,

I was trying to run the example scripts/score.py file and encountered this error of "torch.cuda.OutOfMemoryError: CUDA out of memory. Tried to allocate 96.00 MiB. GPU"
I'm using a desktop with rtx 3070 8gb.
Do you think the GPU memory is the issue?

Error starting model on HuggingFace inference endpoint

Endpoint encountered an error.
You can try restarting it using the "pause" button above. Check [ logs](https://ui.endpoints.huggingface.co/artificial-paul/endpoints/evo-1-131k-base-artificial/logs) for more details.
Server message:Endpoint failed to start. tory 2024-02-28 05:58:09,130 | INFO | No custom pipeline found at /repository/handler.py 2024-02-28 05:58:09,130 | INFO | Using device GPU Loading /repository requires to execute some code in that repo, you can inspect the content of the repository at https://hf.co//repository. You can dismiss this prompt by passing `trust_remote_code=True`. Traceback (most recent call last): File "/opt/conda/lib/python3.9/site-packages/starlette/routing.py", line 705, in lifespan async with self.lifespan_context(app) as maybe_state: File "/opt/conda/lib/python3.9/site-packages/starlette/routing.py", line 584, in __aenter__ await self._router.startup() File "/opt/conda/lib/python3.9/site-packages/starlette/routing.py", line 682, in startup await handler() File "/app/webservice_starlette.py", line 57, in some_startup_task inference_handler = get_inference_handler_either_custom_or_default_handler(HF_MODEL_DIR, task=HF_TASK) File "/app/huggingface_inference_toolkit/handler.py", line 45, in get_inference_handler_either_custom_or_default_handler return HuggingFaceHandler(model_dir=model_dir, task=task) File "/app/huggingface_inference_toolkit/handler.py", line 17, in __init__ self.pipeline = get_pipeline(model_dir=model_dir, task=task) File "/app/huggingface_inference_toolkit/utils.py", line 261, in get_pipeline hf_******** = pipeline(task=task, model=model_dir, device=device, **kwargs) File "/opt/conda/lib/python3.9/site-packages/transformers/pipelines/__init__.py", line 705, in pipeline config = AutoConfig.from_pretrained(model, _from_pipeline=task, **hub_kwargs, **model_kwargs) File "/opt/conda/lib/python3.9/site-packages/transformers/models/auto/configuration_auto.py", line 986, in from_pretrained trust_remote_code = resolve_trust_remote_code( File "/opt/conda/lib/python3.9/site-packages/transformers/dynamic_module_utils.py", line 538, in resolve_trust_remote_code answer = input( EOFError: EOF when reading a line Application startup failed. Exiting. Do you accept? [y/N]

131k base seems to output non-DNA characters

Might be missing some context, but the 131k base model seems to output unexpected characters

The 8k base model looks okay:

TORCH_HOME not used when downloading models

Models are downloaded to $HOME/.cache by default. Is there a way to change this? Setting TORCH_HOME does not work.

We will deploy evo on a cluster. Having each user download the models is a waste of space.

RNA sequence

First of all thank you for this excellent work! However, I would like to ask whether this model is suitable for RNA sequences?

What was the train/val data split you used during pretraining?

Which datasets were included in train and which in val? Did you split by species or just random?

Inquiry Regarding Local Model Usage and Importing Model Files

Due to connectivity issues with my server, I am unable to directly access the HuggingFace. As a result, I encountered difficulties when attempting to import models using the standard methods, such as Evo('evo-1-131k-base'). Even when attempting to import models using a local path, I have not been successful.

Could you please provide guidance on how to successfully import them into a local environment?

Thank you very much for your attention to this matter. I am eager to receive your guidance and assistance.

Pretrain

When will the pretraining data be available?

Extracting EVO representations rather than logits

Hi, thanks for your amazing work!

How can I extract representations rather than logits from the model?

I am using the huggingface version, and I see the model returns logits' and 'past_key_values. Could you please explain what's in past_key_values and if anything of those can be used as a sequence representation? Or maybe you can suggest other ways to access representations of a model?

NameError: name 'MHA' is not defined

Trying to load the model from HuggingFace across many environments (Jupyter Notebook, local on MacOS, all python 3.10 and latest version of transformers) yields the error:

self.inner_mha_cls = MHA(
NameError: name 'MHA' is not defined

The code I'm using is the sample code from your HuggingFace example!

Max Seq length for inference

May I ask the proper range for input sequence length to do the inference using the evo-1-131k-base model?
I tried to use a single A100 and got CUDA Out of Memory when inputting a single sequence longer than 1000.
Thank you!

Scoring DNA with lineages

Hi,

First of all, great work with Evo and for sharing the models.

Since you can prompt the model with lineages, I am wondering if it is possible to calculate scores or log likelihoods for a DNA sequence given a specific lineage? For example, would a prompt in the following style work?

|d__Bacteria;p__Firmicutes;c__Bacilli;o__Bacillales;f__Bacillaceae;g__Bacillus;s__Bacillus subtilis | ATGA..<DNA>

Many thanks

Dose someone have idea about issues such as "'MHA' is not defined"?

I guess some flash_attn has some problems. When I load models from huggingface: evo_model = Evo('evo-1-8k-base'), an error: self.inner_mha_cls = MHA( NameError: name 'MHA' is not defined or assert RotaryEmbedding is not None, "rotary_emb is not installed" AssertionError: rotary_emb is not installed occur.
My environment: flash_attn 2.5.6 Cuda compilation tools, release 11.6, V11.6.124 pytorch 1.13.1.
Anyone has solutions?

Together model not found

Hi!

Excellent work and papers!

But when I tried to use the together api, following the example in the README documentation, I found such an error.

openai.NotFoundError: Error code: 404 - {'error': {'message': 'Unable to access model togethercomputer/evo-1-131k-base. Please visit https://api.together.xyz to see the list of supported models or contact the owner to request access.', 'type': 'invalid_request_error', 'param': None, 'code': 'model_not_found'}}

message Information :

chat_completion = client.chat.completions.create(
    messages=[
        {"role": "system", "content": ""},
        {
            "role": "user",
            "content": "ACGT",  # Prompt the model with a sequence.
        },
    ],
    model="togethercomputer/evo-1-131k-base",
    max_tokens=128,  # Sample some number of new tokens.
    logprobs=True,
)

Is there something missing?

Thanks!

Prompt for virus and plasmid generation

The example notebook show the utilization of a Greengenes-style lineage as a prompt for generation. However, it does not address the process of prompting for non-chromosomal sequences. For viruses, was the ICTV taxonomy from IMG/VR used during training? Additionally, concerning plasmids, if their host lineage was utilized, how can I direct Evo to generate a plasmid?

Running the model in a firewalled environment

Hi! Thank you very much for the model, the pre-print looks fantastic!

I'd like to use your Huggingface model on our A100 GPUs, but unfortunately we have to work in a firewalled environment and therefore it complicates everything. I'm allowed to access the internet through the "submit" machines, that do not have GPUs, and then I have to switch to offline GPU machines to run the model.

For now, I have attempted to split the process in 2:
First, use the submit machine to download the model locally

git clone [email protected]:togethercomputer/evo-1-131k-base
Cloning into 'evo-1-131k-base'...
remote: Enumerating objects: 134, done.
remote: Counting objects: 100% (131/131), done.
remote: Compressing objects: 100% (130/130), done.
remote: Total 134 (delta 54), reused 0 (delta 0), pack-reused 3
Receiving objects: 100% (134/134), 58.39 KiB | 4.49 MiB/s, done.
Resolving deltas: 100% (54/54), done.

And then switch to my GPU machine and load the model with something like

# load_evo_gpu.py
from transformers import AutoConfig, AutoModelForCausalLM
import torch

if torch.cuda.is_available():
    print('Connected to a GPU\n')
else:
    print('Not connected to a GPU\n')

from psutil import virtual_memory
ram_gb = virtual_memory().total / 1e9
print('Your runtime has {:.1f} gigabytes of available RAM\n'.format(ram_gb))

model_name = 'evo-1-131k-base'

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    local_files_only=True,
)

but this time it fails miserably and spits this error:

Loading checkpoint shards:   0%|                                           | 0/3 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "/pasteur/zeus/projets/p01/MDM/Users/ernest/load_evo_gpu.py", line 15, in <module>
    model = AutoModelForCausalLM.from_pretrained(
            ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pasteur/appa/homes/ermordre/miniconda3/envs/evo/lib/python3.11/site-packages/transformers/models/auto/auto_factory.py", line 556, in from_pretrained
    return model_class.from_pretrained(
           ^^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pasteur/appa/homes/ermordre/miniconda3/envs/evo/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3502, in from_pretrained
    ) = cls._load_pretrained_model(
        ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pasteur/appa/homes/ermordre/miniconda3/envs/evo/lib/python3.11/site-packages/transformers/modeling_utils.py", line 3903, in _load_pretrained_model
    state_dict = load_state_dict(shard_file)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^^
  File "/pasteur/appa/homes/ermordre/miniconda3/envs/evo/lib/python3.11/site-packages/transformers/modeling_utils.py", line 505, in load_state_dict
    with safe_open(checkpoint_file, framework="pt") as f:
         ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
safetensors_rust.SafetensorError: Error while deserializing header: HeaderTooLarge

Any idea why this might be the case?

Default pretrained_model_path is personal directory

tried to delete this but was unable to

AssertionError: rotary_emb is not installed

i try the fllow code：
`from transformers import AutoConfig, AutoModelForCausalLM

model_name = 'togethercomputer/evo-1-8k-base'

model_config = AutoConfig.from_pretrained(model_name, trust_remote_code=True)
model_config.use_cache = True

model = AutoModelForCausalLM.from_pretrained(
model_name,
config=model_config,
trust_remote_code=True,
)`

raise error ：AssertionError: rotary_emb is not installed

however，the rotary_emb v0,1 installed，flash attentation also installed

How can I use multiple graphics cards to fine tune this model

Next is the script I will use

import os
GPU_NUMBER = [1,2,6,8,9]
os.environ["CUDA_VISIBLE_DEVICES"] = ",".join([str(s) for s in GPU_NUMBER])
import torch
from torch.utils.data import DataLoader
from transformers import AutoConfig, AutoModelForCausalLM, TrainingArguments, Trainer
from transformers import AutoTokenizer, DataCollatorForLanguageModeling
from datasets import Dataset, load_dataset, DatasetDict
from Bio import SeqIO
from accelerate import Accelerator

accelerator = Accelerator()

model_name = 'togethercomputer/evo-1-8k-base'

model_config = AutoConfig.from_pretrained(model_name, trust_remote_code=True)

model = AutoModelForCausalLM.from_pretrained(
model_name,
config=model_config,
trust_remote_code=True,
)

model = accelerator.prepare(model)

tokenizer = AutoTokenizer.from_pretrained("togethercomputer/evo-1-8k-base",
trust_remote_code=True)
tokenizer.pad_token = "N"

train_ds = load_dataset('csv', data_files = 'all_fastas.csv')

def preprocess_function(sample):
return tokenizer(sample['Seq'], padding="max_length", truncation=True, max_length=500)

tokenized_ds = train_ds.map(
preprocess_function,
batched=True,
num_proc=4,
)

train_testvalid = tokenized_ds['train'].train_test_split(test_size=0.2)

data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

training_args = TrainingArguments(
output_dir="./results",
evaluation_strategy="epoch",
learning_rate=2e-5,
weight_decay=0.01,
gradient_accumulation_steps=2,
per_device_train_batch_size=1,
warmup_steps=10,
max_steps=10000, # 仅示例
logging_steps=10,
eval_steps=100,
bf16=True,
)

trainer = Trainer(
model=model,
args=training_args,
train_dataset=train_testvalid["train"],
eval_dataset=train_testvalid["test"],
data_collator=data_collator,
)

trainer.train()

print(trainer.evaluate())

trainer.save_model('./finetuned.model_all_par')

Below is an error message

Traceback (most recent call last):
File "/home/qj/python-project/evo/weitiao.py", line 76, in
trainer.train()
File "/home/qj/anaconda3/envs/new_evo/lib/python3.8/site-packages/transformers/trainer.py", line 1885, in train
return inner_training_loop(
File "/home/qj/anaconda3/envs/new_evo/lib/python3.8/site-packages/transformers/trainer.py", line 2216, in _inner_training_loop
tr_loss_step = self.training_step(model, inputs)
File "/home/qj/anaconda3/envs/new_evo/lib/python3.8/site-packages/transformers/trainer.py", line 3238, in training_step
loss = self.compute_loss(model, inputs)
File "/home/qj/anaconda3/envs/new_evo/lib/python3.8/site-packages/transformers/trainer.py", line 3264, in compute_loss
outputs = model(**inputs)
File "/home/qj/anaconda3/envs/new_evo/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/qj/anaconda3/envs/new_evo/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 171, in forward
outputs = self.parallel_apply(replicas, inputs, kwargs)
File "/home/qj/anaconda3/envs/new_evo/lib/python3.8/site-packages/torch/nn/parallel/data_parallel.py", line 181, in parallel_apply
return parallel_apply(replicas, inputs, kwargs, self.device_ids[:len(replicas)])
File "/home/qj/anaconda3/envs/new_evo/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 89, in parallel_apply
output.reraise()
File "/home/qj/anaconda3/envs/new_evo/lib/python3.8/site-packages/torch/_utils.py", line 644, in reraise
raise exception
RuntimeError: Caught RuntimeError in replica 1 on device 1.
Original Traceback (most recent call last):
File "/home/qj/anaconda3/envs/new_evo/lib/python3.8/site-packages/torch/nn/parallel/parallel_apply.py", line 64, in _worker
output = module(*input, **kwargs)
File "/home/qj/anaconda3/envs/new_evo/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/qj/.cache/huggingface/modules/transformers_modules/togethercomputer/evo-1-131k-base/567369e9825aa08b3de4b122fca34fac6a890602/modeling_hyena.py", line 109, in forward
logits, past_key_values = self.backbone(
File "/home/qj/anaconda3/envs/new_evo/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/qj/.cache/huggingface/modules/transformers_modules/togethercomputer/evo-1-131k-base/567369e9825aa08b3de4b122fca34fac6a890602/model.py", line 363, in forward
x, inference_params_dict_out = self.stateless_forward(x, padding_mask=padding_mask)
File "/home/qj/.cache/huggingface/modules/transformers_modules/togethercomputer/evo-1-131k-base/567369e9825aa08b3de4b122fca34fac6a890602/model.py", line 382, in stateless_forward
x, _ = block(x, inference_params=None, padding_mask=padding_mask)
File "/home/qj/anaconda3/envs/new_evo/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/qj/.cache/huggingface/modules/transformers_modules/togethercomputer/evo-1-131k-base/567369e9825aa08b3de4b122fca34fac6a890602/model.py", line 304, in forward
z = self.proj_norm_fn(u)
File "/home/qj/.cache/huggingface/modules/transformers_modules/togethercomputer/evo-1-131k-base/567369e9825aa08b3de4b122fca34fac6a890602/model.py", line 298, in proj_norm
return self.projections(self.pre_norm(x))
File "/home/qj/anaconda3/envs/new_evo/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1501, in _call_impl
return forward_call(*args, **kwargs)
File "/home/qj/.cache/huggingface/modules/transformers_modules/togethercomputer/evo-1-131k-base/567369e9825aa08b3de4b122fca34fac6a890602/layers.py", line 40, in forward
return self.scale * y
RuntimeError: Expected all tensors to be on the same device, but found at least two devices, cuda:0 and cuda:1!

evo-design / evo Goto Github PK

evo's People

Contributors

Stargazers

Watchers

Forkers

evo's Issues

Next is the script I will use

Below is an error message

Recommend Projects

Recommend Topics

Recommend Org