Giter Site home page Giter Site logo

chemfiles / chemfiles Goto Github PK

View Code? Open in Web Editor NEW
160.0 12.0 48.0 102.41 MB

Library for reading and writing chemistry files

Home Page: http://chemfiles.org

License: BSD 3-Clause "New" or "Revised" License

CMake 2.46% C++ 88.37% Python 1.21% Shell 0.09% C 7.87%
computational-chemistry library files compchem cheminformatics chemistry hacktoberfest

chemfiles's Issues

Test chemfiles on MSYS

MSYS is an Unix system on Windows. Msys2 is installed on Appveyor, and should be tested.

fresh install does not include config.hpp and exports.hpp

I recently cloned from github repo to ~/chemfiles and followed installation instructions, using these options for the cmake command:

 cmake -DCMAKE_BUILD_TYPE=debug -DCHFL_BUILD_TESTS=ON -DCHFL_ENABLE_NETCDF=ON ..

But for some reason both these files weren't present in the ~/chemfiles/include/chemfiles folder, nor in the installed folder /usr/local/include/chemfiles, so when I tried to compile my code (including the flag -I/usr/local/include), the following error appeared:

/usr/local/include/chemfiles.hpp:11:32: fatal error: chemfiles/config.hpp: No such file or directory

The error was fixed by copying these 2 files to /usr/local/include/chemfiles.

Predefined atom groups: water, protein, ...

Using a selection like water or name Na could be nice. This need to predefine a group of atoms as "water".

I do not know how to do this, it may require to detect/classify molecules in the system using some kind of graph algorithms.

Add functions for working with PBC

Functions to get the distance/angle/dihedral between particles could be nice to have. Something like

double Frame::distance(size_t i, size_t j);
double Frame::angle(size_t i, size_t j, size_t k);
double Frame::dihedral(size_t i, size_t j, size_t k, size_t m);

int chfl_frame_distance(const CHFL_FRAME* frame, size_t i, size_t j, double* r);
int chfl_frame_angle(const CHFL_FRAME* frame, size_t i, size_t j, size_t k, double* theta);
int chfl_frame_dihedral(const CHFL_FRAME* frame, size_t i, size_t j, size_t k, size_t m, double* phi);

The other solution is to have a function to wrap a vector in the unit cell, and then let the users write theses functions by himself.

BUS_ERROR on OS X with OSX_DEPLOYMENT_TARGET=10.9 and NetCDF

This is weird. On Os X, when compiling the code in release mode with this exact set of flags, the xyz test run into a Bus error:

  • CMAKE_OSX_DEPLOYMENT_TARGET=10.9
  • CMAKE_C_FLAGS="-mmacosx-version-min=10.9"
  • CHFL_ENABLE_NETCDF=ON

The full set of commands to run to reproduce in a fresh build is:

cmake -DCMAKE_OSX_DEPLOYMENT_TARGET=10.9 -DCMAKE_C_FLAGS="-mmacosx-version-min=10.9" -DCHFL_BUILD_TESTS=ON -DCHFL_ENABLE_NETCDF=ON ../.. 
make
./tests/xyz

Valgrnind report is here, and a LLDB session looks like this:

(lldb) target create "tests/xyz"
Current executable set to 'tests/xyz' (x86_64).
(lldb) run
Process 24586 launched: 'tests/xyz' (x86_64)
Process 24586 stopped
* thread #1: tid = 0xcbcd, 0x00000001000b3130 xyz`boost::system::system_category()::system_category_const, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x1000b3130)
    frame #0: 0x00000001000b3130 xyz`boost::system::system_category()::system_category_const
xyz`boost::system::system_category()::system_category_const:
->  0x1000b3130 <+0>: lock
    0x1000b3131 <+1>: repne
    0x1000b3132 <+2>: orb    (%rax), %al
    0x1000b3134 <+4>: addl   %eax, (%rax)
(lldb) bt
* thread #1: tid = 0xcbcd, 0x00000001000b3130 xyz`boost::system::system_category()::system_category_const, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x1000b3130)
  * frame #0: 0x00000001000b3130 xyz`boost::system::system_category()::system_category_const
(lldb)

Since I do not get any backtrace in LLDB, and valgrind reports jumps on uninitialized values, this might be due to a stack corruption.

This was spotted when investigating the need for export MACOSX_DEPLOYMENT_TARGET="" in conda builds for conda-forge/staged-recipes#1571.

case insensitive search for element(type) matching in PDB records

In PDB ATOM records, elements are always in upper case. I haven't noted this before, since biological molecules are almost exclusively formed by CHONP atoms.

Atom::Atom(std::string name, std::string type):
    name_(std::move(name)), type_(std::move(type)) {
    auto periodic = PERIODIC_INFORMATION.find(type_);
    if (periodic != PERIODIC_INFORMATION.end()) {
        mass_ = periodic->second.mass;
    }
}

Maybe strings in PERIODIC INFORMATION should be lower case and a std::tolower() be applied to the query string before any search.

Please confirm and I'll apply this minor change myself.
Cheers!

Lazy loading of Trajectory::nsteps

Getting the number of steps in a trajectory can be done at no cost for some formats (NetCDF), but might be very costly for other (XYZ). In big XYZ trajectory, just getting the number of steps will take 5s.

We always load the number of steps when opening a trajectory, which impose big cost just to open a file.

We should load this number of steps lazily (only when requested) and not systematically. To be still able to provide an indication of when a trajectory is fully read, we can add bool Format::is_done() and use it to implement bool Trajectory::is_done() and chfl_trajectory_is_done.

Use constants for error handling in C FFI

The C FFI should #define constants for error codes, and use them instead of relying random values.

This would allow to:

  • check directly for error if (chfl_trajectory_open(...) != CHFL_SUCCESS);
  • add the failure causes in documentation;
  • propagate errors to the bindings. In the Rust one, we could remove a lot of Result using this;

Spatial zones for selections

Allowing the select atoms inside a sphere/cube/..., or within a given distance of another atom.

What is needed:

  • Decide on syntax;
  • Decide on implementation strategy inside the current implementation;

Relicense the code under BSD?

I do not know if this is worth it, the BSD does not have that much advantages compared to MPL. Opening this to discuss, if you have an opinion or a requirement concerning Chemfiles licensing, please tell us!

Runtime initialization of Molfile path

It can be nice to set the path to molfile plugins at runtime, and not only at compile time or by the CHRP_MOLFILES environment variable.

On Linux and OS X, we can get the path to the shared library libchemharp.so by using the following:

#include <dlfcn.h>
#include <iostream>

void foo() {
    Dl_info info;
    if (dladdr((void*)foo, &info)){
        std::cout << "foo is at " << info.dli_fname << std::endl;
    }
}

Then the path to molfile plugin can be detected from that.

Another option would be to use the static version of plugins.

Provide a view in the positions/velocities

The FFI functions chfl_frame_positions and chfl_frame_velocities should return a pointer to the Frame::positions and Frame::velocities. This would allow to remove the need to copy the data for big files, and remove the need to do

chfl_frame_positions(frame, positions, atoms);
positions[2][3] = 4.0;
chfl_frame_set_positions(frame, positions, atoms);
  • Use contiguous data in Array3D, instead of std::vector<Vector3D>;
  • Provide this data in the FFI.

Heuristic to guess residues in a topology

In the same spirit as the guess_bonds function, we could try to guess the residues in a topology. The simplest version could simply assume residues == molecule, and just follow the bonds graph. A more elaborated version could try to match sub-graphs to known residues like amino-acids.

Allowing Netcdf Amber trajectories without box information (Unit Cell)

I think the current way to allow .nc trajectories without box info is not working as expected.

When NCFormat::read_cell() calls dimension, if the trajectory has no box information, an error gets thrown by nc::check (called by dimension) before dimension can return, say, size=0

Maybe the call to dimenson in NCFormat::read_cell() could be enclosed in a try{...} catch{}?

Support for residues in selections

With the new Residue class, additional selectors should be added:

  • resname would match the residue name;
  • resid would match the residue id;

I still need to think about how this can interact with multiple selections.

Support for biological molecules

I'm wondering if there's some space for an AminoAcid class. Objects of this class would be linked to each Frame object and 'contain' many objects of the Atom class. That is, a more 'OpenBabel' way of treating biological molecules and a more intuitive one.

I'm currently developing C++ software to use on biological molecules. Chemfiles is the only library that supports the major molecular dynamics trajectory formats and PDB format at the same time. But the lack of a Residue concept is critical. On the other hand, OpenBabel can't read netcdf and my soft became considerably slower when I made the change from chemfiles.

PDB format

File Type text
Topological information Yes
Positions Yes
Velocities ???

Reference: ftp://ftp.wwpdb.org/pub/pdb/doc/format_descriptions/Format_v33_A4.pdf

Remove dependency on netcdfcxx

This prevent easy static linkage of chemahrp, and the current version of netcdfcxx do not uses c++11 functionalities (move semantic in particular). I should call directly the C library for what is needed, wrapping a few types to C++.

Public release planning

What should be done before the public release of this code :

  • Write interface
    • Text format: XYZ
    • Binary format: NetCDF
  • Bindings
    • C read-only binding
    • Fortran read-only binding
    • Python read-only binding
  • Formats
    • PDB
  • Documentation
    • User manual

Santa Claus wish list (or, what will append after the public release but before 1.0)

  • Read-Write bindings
  • Julia binding, either from C of C++
  • Developer documentation

Remove the logging framework?

It does not make sense to have a logging framework in a library, and it could be removed. It is mainly used by the C API, to log exceptions at the interface, but they are not strictly needed.

Another option would be to have it silenced by default.

Developer documentation

Add a basic developer documentation before the 1.0 release.

Built with doxygen doc and some hand-written text.

Remove dependency on Boost libs

Boost libraries should be removed, as there is no strong need for them, and this would make it easier to embed chemahrp in another application.

Places where Boost is used, and how to remove it:

  • filesystem iteration in tests => see the C++1y standard with header;
  • any type => roll my own version, or inherit ostream in private to get access to the operator<<;

Replace the tests/data submodule by an archive

The tests/data git submodule is not really ergonomic, as submodules are not automatically updated when using git pull.

The tests data files are in a separated repository (https://github.com/chemfiles/tests-data) because they are not needed when building the code, and the size of the repository is non negligible.

Just checking in a tar.gz archive and unpacking it with cmake would not work here, because all the tar.gz files would still be in .git/objects. I thing the better way is to download the tests files at compile time, using cmake ExternalProject.

Updated homebrew bottle

Hello @Luthaf, I'm currently going through my homebrew-juliadeps tap, updating versions of software where appropriate, and wanted to check with you to see if upgrading to v0.4.0 of chemharp would be appropriate for the julia code that uses it. Is that something I should do now, or should I wait until changes have been made on the julia side?

PDBx/mmCIF

File Type text
Topological information Yes
Positions ???
Velocities ???

PDBx/mmCIF is the choosen replacement for PDB files starting on 2016. A C++ parser exists here: http://mmcif.wwpdb.org/

PBC with tilted cells

I am not sure whether the current PBC code can handle very tilted cells (with angles less than 60° or more than 120°). We should at least add a test for that, and maybe fix the algorithm.

In case the algorithm do not handle it, a new cell type might be needed for these cells.

Fix the size of data types at the C interface

The code for the C API should only use types with known size: int8_t, uint64_t. Else the bindings will rely on some assumptions about integer size that may not hold and will create nasty bugs.

Test chemfiles on MinGW

MinGW is a port of GNU compiler and tools on Windows. They are NOT an UNIX system.

They are available on Appveyor, and should be tested.

Add support for VMD Molfile plugin

Support for VMD molfile plugins will bring to Chemharp support of 15 new format for free. The firsts steps are in the molfile branch.

Add selections in topology

Users should be able to query Topology for selection, in VMD style. API may looks like this:

//! Get the atoms maching the selection in select
std::vector<bool> Topology::select(const std::string& selection) const;
//! Get the atoms maching the tuple selection in select
std::vector<std::vector<bool>> Topology::select(const std::vector<std::string>& selections) const;

// C API version
int chrp_topology_select(const CHRP_TOPOLOGY* topology, const char* selection, bool* res, size_t natoms);
int chrp_topology_select_tuple(const CHRP_TOPOLOGY* topology, const char** selections, size_t nselect, bool** res, size_t natoms);

Selection are operated with strings, and return the list of atoms corresponding to the selection.

Tuple selections allow to select pairs, triplet, ... of atoms matching a given pair, triplet, ... of selection string.

Examples

topology = ["H", "O", "H", "H", "O", "H"]

auto s = topology.select("name O");
s == [false, true, false, false, true, false];

auto s = topology.select(std::vector{"name H", "name O"}); 
s == [[true, false], [false, true], [true, false], [true, false], [false, true], [true, false]];

TODO list

  • Selection DSL specification;
  • Parsing selection string to AST;
  • Evaluation AST;
  • Testing;

Write tutorials in the documentation

The only usage documentation for now is the API doc, which is a bit rough for a first contact. The tutorials could use the example and explain them. Other ideas are welcome!

Also, it could be nice to have the same tutorial with code examples in multiple languages.

Add support for atom labels in PDB files

Modify Atom concept to hold atom labels.

Main TODOs:

  • Create label_ variable and read it from columns 13-16 of the PDB format file.
  • Rename name_ variable to element_ and read it from columns 78-80 in the PDB format file.
    Then, for consistency, continue replacing name_ with element_ in the rest of the project.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.