dptech-corp / uni-fold-jax Goto Github PK

View Code? Open in Web Editor NEW

154.0 8.0 46.0 1.49 MB

Trainable AlphaFold implementation in JAX

License: Apache License 2.0

Python 96.56% Shell 3.44%

alphafold2 deep-learning jax

uni-fold-jax's People

Contributors

Stargazers

Watchers

uni-fold-jax's Issues

Questions about running commands

Hi，Ziyao
I found that there is a problem with the running command in section 4.1 and section 4.2.
I don't know wether the problem I found is correct or wrong.
In addition, I would like to ask, which graphics card is needed for training, can A100 + CUDA11.4 be trained, thank you.
Here I would like to thank dptech for its contributions to this field.

Batch size

Hi, I get a question about setting the batch_size. I tried to change the batch_size when it calls sample, seems like the batch_size=1 in the default setting. But it reports an error, like

     rng, prot_idx = self.sample(rng,4)
ValueError: too many values to unpack (expected 2)

Could you please tell me how can I change the batch_size? Thank you!

Does this use distillation training?

As mentioned in Alphafold2's paper, they use a final dataset of 355,993 sequences and use a self-distillation procedure on unlabelled protein sequences.
For this code implementation, do you use self-distillation or just straightforward training?

Results vary across different GPUs

Dear users,

We found that in a small amount of test cases (e. g. T1038 in CASP-14), using different GPUs may lead to different output results, sometimes with large variances on the PDB structures and their evaluation metrics. The issue is potentially caused by the kernel-fusion strategies of jax.jit, but we are not sure currently.

We are sorry for the inconvenience, and are working on locating the problem.

As a temporary solution, we suggest using NVIDIA V100 / A100 / 3090Ti for model inference on local machines.

Looking forward to your pytorch code.

I'm looking forward to your pytorch code ♥, do you have a release plan?

About model parameters

Hi,Ziyao
I see that only AlphaFold model parameters are provided in the ‘scripts’ folder.
Can Uni-Fold model parameters be released? thanks

Release Preprocessed Data/Trained Model/Eval Set

Quick question, do you release the preprocessed data/trained model/eval set? We are trying to run UniFold/OpenFold but each protein is taking us around 3 minutes and we want to process as many of them as possible.

Update to newer OpenMM

install_dependencies.sh pins OpenMM to 7.5.1. That's an old release that isn't supported anymore. Could it be updated to the latest release, or alternatively could the pin be removed? I don't think any code changes are needed, although there are some deprecated module names that could be updated to avoid a deprecation warning. The patch in 1openmm.patch1 also isn't needed anymore. That change has been merged upstream.

Issues in installation

During the installation of the current UniFold, I met the problem of conflict packages.

The conflict is caused by:
The user requested numpy==1.21.0
biopython 1.79 depends on numpy
chex 0.0.7 depends on numpy>=1.18.0
dm-haiku 0.0.4 depends on numpy>=1.18.0
jax 0.2.14 depends on numpy>=1.12
jmp 0.0.2 depends on numpy>=1.19.5
mpi4jax 0.3.2 depends on numpy
scipy 1.7.0 depends on numpy<1.23.0 and >=1.16.5
tensorflow 2.5.3 depends on numpy~=1.19.2

When changing the numpy to 1.19.5 in requirements.txt, the bug is fixed. Wish the version of numpy will be changed in the next version of Uni-fold to make it ok to install. Thanks.

Report a typo in run_from_fasta.py

In run_from_fasta.py:241, the fasta_paths is a list, while predict_from_fasta() take fasta_path as a string. I suppose that in the for-loop the fasta_path should be passed into it but not fasta_paths

  if FLAGS.fasta_paths:    # use protein id 'prot_{idx}' of given list.
    protein_dict = {
        f'prot_{idx:05d}': p for idx, p in enumerate(FLAGS.fasta_paths)}
  else:                     # use basename of sub-directories as protein ids.
    fasta_paths = [
        p for p in glob.glob(FLAGS.fasta_dir + '*') if p.endswith('.fasta')]
    protein_dict = {
        pathlib.Path(p).stem: p for p in fasta_paths}

  for id, fasta_path in protein_dict.items():
    try:
      predict_from_fasta(
          fasta_path=fasta_paths,
          name=id,
          output_dir=FLAGS.output_dir,
          data_pipeline=data_pipeline,
          model_runners=model_runners,
          amber_relaxer=amber_relaxer,
          random_seed=random_seed,
          benchmark=FLAGS.benchmark,
          dump_pickle=FLAGS.dump_pickle,
          timings=None)
    except Exception as ex:
      logging.warning(f"failed to predict structure for protein {id} with "
                      f"fasta path {fasta_path}. Error message: \n{ex}")

into

  if FLAGS.fasta_paths:    # use protein id 'prot_{idx}' of given list.
    protein_dict = {
        f'prot_{idx:05d}': p for idx, p in enumerate(FLAGS.fasta_paths)}
  else:                     # use basename of sub-directories as protein ids.
    fasta_paths = [
        p for p in glob.glob(FLAGS.fasta_dir + '*') if p.endswith('.fasta')]
    protein_dict = {
        pathlib.Path(p).stem: p for p in fasta_paths}

  for id, fasta_path in protein_dict.items():
    try:
      predict_from_fasta(
          fasta_path=fasta_path,
          name=id,
          output_dir=FLAGS.output_dir,
          data_pipeline=data_pipeline,
          model_runners=model_runners,
          amber_relaxer=amber_relaxer,
          random_seed=random_seed,
          benchmark=FLAGS.benchmark,
          dump_pickle=FLAGS.dump_pickle,
          timings=None)
    except Exception as ex:
      logging.warning(f"failed to predict structure for protein {id} with "
                      f"fasta path {fasta_path}. Error message: \n{ex}")

train.py: RuntimeError: Resource exhausted: Out of memory while trying to allocate 1090519040 bytes.

about Uni-Fold bibtex

Can you post a bibtex to cite?

About training data

Hi, Ziyao
Can the training data be released? Either the .pkl file or the raw fasta file is fine, thanks

How to reproduce the complete Uni-Fold model at full scale?

The training configuration file unifold/train/train_config.py allows to specify the features and labels directories for training and evaluation:

"data": {
  "train": {
    "features_dir": "where/training/protein/features/are/stored/",
    "mmcif_dir": "where/training/mmcif/files/are/stored/",
    "sample_weights": "which/specifies/proteins/for/training.json"
  },
  "eval": {
    "features_dir": "where/validation/protein/features/are/stored/",
    "mmcif_dir": "where/validation/mmcif/files/are/stored/",
    "sample_weights": "which/specifies/proteins/for/training.json"
  }
}

Q1:
For training, what are the complete PDB IDs used to reproduce the exact AlphaFold2 model or the Uni-Fold model at full scale?
The example_data directory provides only one PDB ID (i.e. 1ak0) for demonstration purpose: features/1ak0_1_A/features.pkl and mmcif/1ak0.cif. I'm aware that the complete list of CIF files for labels can be obtained from data_dir/pdb_mmcif/mmcif_files, which contains 181295 CIF files. But how to obtain the same complete list of fasta files for generating features (e.g. those fasta/1ak0_1_A.fasta files)?

Q2:
For evaluation, what are the PDB IDs that are used for evaluation? I know they are from CASP14 but can you give the exact list in the repository?

Q3:
Is it necessary to remove the PDB IDs for evaluation from training? So that the two datasets do not intersect.

Train in JAX

As far as I can see, your repo is the only one to provide a way to train Alphafold without using another platform(openfold, fastfold, etc, which all using pytorch). Is there another method to train office Alphafold If you finally decided moving to torch?

Load Params Error

Thanks for your nice repo, but when I run the "run_from_pkl.py" file, I meet this error. What should I do?

Traceback (most recent call last):
  File "run_from_pkl.py", line 165, in <module>
    app.run(main)
  File "/home/panfulu/anaconda3/envs/unifold/lib/python3.8/site-packages/absl/app.py", line 312, in run
    _run_main(main, args)
  File "/home/panfulu/anaconda3/envs/unifold/lib/python3.8/site-packages/absl/app.py", line 258, in _run_main
    sys.exit(main(argv))
  File "run_from_pkl.py", line 106, in main
    model_params = load_params("/home/panfulu/data/alphafold/params/params_model_1.npz")
  File "/home/panfulu/project/Uni-Fold/unifold/train/utils.py", line 194, in load_params
    params = load_params_from_npz(model_path)
  File "/home/panfulu/project/Uni-Fold/unifold/train/utils.py", line 186, in load_params_from_npz
    return params['arr_0'].flat[0]
  File "/home/panfulu/anaconda3/envs/unifold/lib/python3.8/site-packages/numpy/lib/npyio.py", line 260, in __getitem__
    raise KeyError("%s is not a file in the archive" % key)
KeyError: 'arr_0 is not a file in the archive'

Install Error

ERROR: Could not find a version that satisfies the requirement mpi4jax==0.3.2

How to generate features.pkl when using mmseqs2

Hi Ziyao, when i used mmseqs2 to generate MSA, i got six files by searching two databases: .a3m,.dbtype,.index. I noticed that making inference need features.pkl, do you know how to change files mentioned above into features.pkl?

dptech-corp / uni-fold-jax Goto Github PK

uni-fold-jax's People

Contributors

Stargazers

Watchers

Forkers

uni-fold-jax's Issues

Recommend Projects

Recommend Topics

Recommend Org