chemfiles / chemfiles Goto Github PK

View Code? Open in Web Editor NEW

160.0 12.0 48.0 102.41 MB

Library for reading and writing chemistry files

Home Page: http://chemfiles.org

License: BSD 3-Clause "New" or "Revised" License

CMake 2.46% C++ 88.37% Python 1.21% Shell 0.09% C 7.87%

computational-chemistry library files compchem cheminformatics chemistry hacktoberfest

chemfiles's Issues

Test chemfiles on MSYS

MSYS is an Unix system on Windows. Msys2 is installed on Appveyor, and should be tested.

fresh install does not include config.hpp and exports.hpp

I recently cloned from github repo to ~/chemfiles and followed installation instructions, using these options for the cmake command:

 cmake -DCMAKE_BUILD_TYPE=debug -DCHFL_BUILD_TESTS=ON -DCHFL_ENABLE_NETCDF=ON ..

But for some reason both these files weren't present in the ~/chemfiles/include/chemfiles folder, nor in the installed folder /usr/local/include/chemfiles, so when I tried to compile my code (including the flag -I/usr/local/include), the following error appeared:

/usr/local/include/chemfiles.hpp:11:32: fatal error: chemfiles/config.hpp: No such file or directory

The error was fixed by copying these 2 files to /usr/local/include/chemfiles.

Chemical Markup Language (CML)

File Type	text/xml
Topological information	Yes
Positions	Yes
Velocities	Yes

Reference: http://www.xml-cml.org

Predefined atom groups: water, protein, ...

Using a selection like water or name Na could be nice. This need to predefine a group of atoms as "water".

I do not know how to do this, it may require to detect/classify molecules in the system using some kind of graph algorithms.

Add functions for working with PBC

Functions to get the distance/angle/dihedral between particles could be nice to have. Something like

double Frame::distance(size_t i, size_t j);
double Frame::angle(size_t i, size_t j, size_t k);
double Frame::dihedral(size_t i, size_t j, size_t k, size_t m);

int chfl_frame_distance(const CHFL_FRAME* frame, size_t i, size_t j, double* r);
int chfl_frame_angle(const CHFL_FRAME* frame, size_t i, size_t j, size_t k, double* theta);
int chfl_frame_dihedral(const CHFL_FRAME* frame, size_t i, size_t j, size_t k, size_t m, double* phi);

The other solution is to have a function to wrap a vector in the unit cell, and then let the users write theses functions by himself.

BUS_ERROR on OS X with OSX_DEPLOYMENT_TARGET=10.9 and NetCDF

This is weird. On Os X, when compiling the code in release mode with this exact set of flags, the xyz test run into a Bus error:

CMAKE_OSX_DEPLOYMENT_TARGET=10.9
CMAKE_C_FLAGS="-mmacosx-version-min=10.9"
CHFL_ENABLE_NETCDF=ON

The full set of commands to run to reproduce in a fresh build is:

cmake -DCMAKE_OSX_DEPLOYMENT_TARGET=10.9 -DCMAKE_C_FLAGS="-mmacosx-version-min=10.9" -DCHFL_BUILD_TESTS=ON -DCHFL_ENABLE_NETCDF=ON ../.. 
make
./tests/xyz

Valgrnind report is here, and a LLDB session looks like this:

(lldb) target create "tests/xyz"
Current executable set to 'tests/xyz' (x86_64).
(lldb) run
Process 24586 launched: 'tests/xyz' (x86_64)
Process 24586 stopped
* thread #1: tid = 0xcbcd, 0x00000001000b3130 xyz`boost::system::system_category()::system_category_const, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x1000b3130)
    frame #0: 0x00000001000b3130 xyz`boost::system::system_category()::system_category_const
xyz`boost::system::system_category()::system_category_const:
->  0x1000b3130 <+0>: lock
    0x1000b3131 <+1>: repne
    0x1000b3132 <+2>: orb    (%rax), %al
    0x1000b3134 <+4>: addl   %eax, (%rax)
(lldb) bt
* thread #1: tid = 0xcbcd, 0x00000001000b3130 xyz`boost::system::system_category()::system_category_const, queue = 'com.apple.main-thread', stop reason = EXC_BAD_ACCESS (code=2, address=0x1000b3130)
  * frame #0: 0x00000001000b3130 xyz`boost::system::system_category()::system_category_const
(lldb)

Since I do not get any backtrace in LLDB, and valgrind reports jumps on uninitialized values, this might be due to a stack corruption.

This was spotted when investigating the need for export MACOSX_DEPLOYMENT_TARGET="" in conda builds for conda-forge/staged-recipes#1571.

case insensitive search for element(type) matching in PDB records

In PDB ATOM records, elements are always in upper case. I haven't noted this before, since biological molecules are almost exclusively formed by CHONP atoms.

Atom::Atom(std::string name, std::string type):
    name_(std::move(name)), type_(std::move(type)) {
    auto periodic = PERIODIC_INFORMATION.find(type_);
    if (periodic != PERIODIC_INFORMATION.end()) {
        mass_ = periodic->second.mass;
    }
}

Maybe strings in PERIODIC INFORMATION should be lower case and a std::tolower() be applied to the query string before any search.

Please confirm and I'll apply this minor change myself.
Cheers!

Lazy loading of Trajectory::nsteps

Getting the number of steps in a trajectory can be done at no cost for some formats (NetCDF), but might be very costly for other (XYZ). In big XYZ trajectory, just getting the number of steps will take 5s.

We always load the number of steps when opening a trajectory, which impose big cost just to open a file.

We should load this number of steps lazily (only when requested) and not systematically. To be still able to provide an indication of when a trajectory is fully read, we can add bool Format::is_done() and use it to implement bool Trajectory::is_done() and chfl_trajectory_is_done.

Use constants for error handling in C FFI

The C FFI should #define constants for error codes, and use them instead of relying random values.

This would allow to:

check directly for error if (chfl_trajectory_open(...) != CHFL_SUCCESS);
add the failure causes in documentation;
propagate errors to the bindings. In the Rust one, we could remove a lot of Result using this;

Spatial zones for selections

Allowing the select atoms inside a sphere/cube/..., or within a given distance of another atom.

What is needed:

Decide on syntax;
Decide on implementation strategy inside the current implementation;

Relicense the code under BSD?

I do not know if this is worth it, the BSD does not have that much advantages compared to MPL. Opening this to discuss, if you have an opinion or a requirement concerning Chemfiles licensing, please tell us!

Runtime initialization of Molfile path

It can be nice to set the path to molfile plugins at runtime, and not only at compile time or by the CHRP_MOLFILES environment variable.

On Linux and OS X, we can get the path to the shared library libchemharp.so by using the following:

#include <dlfcn.h>
#include <iostream>

void foo() {
    Dl_info info;
    if (dladdr((void*)foo, &info)){
        std::cout << "foo is at " << info.dli_fname << std::endl;
    }
}

Then the path to molfile plugin can be detected from that.

Another option would be to use the static version of plugins.

Provide a view in the positions/velocities

The FFI functions chfl_frame_positions and chfl_frame_velocities should return a pointer to the Frame::positions and Frame::velocities. This would allow to remove the need to copy the data for big files, and remove the need to do

chfl_frame_positions(frame, positions, atoms);
positions[2][3] = 4.0;
chfl_frame_set_positions(frame, positions, atoms);

Use contiguous data in Array3D, instead of std::vector<Vector3D>;
Provide this data in the FFI.

Heuristic to guess residues in a topology

In the same spirit as the guess_bonds function, we could try to guess the residues in a topology. The simplest version could simply assume residues == molecule, and just follow the bonds graph. A more elaborated version could try to match sub-graphs to known residues like amino-acids.

Allowing Netcdf Amber trajectories without box information (Unit Cell)

I think the current way to allow .nc trajectories without box info is not working as expected.

When NCFormat::read_cell() calls dimension, if the trajectory has no box information, an error gets thrown by nc::check (called by dimension) before dimension can return, say, size=0

Maybe the call to dimenson in NCFormat::read_cell() could be enclosed in a try{...} catch{}?

Support for residues in selections

With the new Residue class, additional selectors should be added:

resname would match the residue name;
resid would match the residue id;

I still need to think about how this can interact with multiple selections.

Use double instead of floats for positions/velocities?

This uses more memory, but have multiple advantages:

Better support for simulations codes, with no double -> float conversion needed;
Support for restart-style files, like the restart convention for NetCDF.

Remove the `_t` suffix for types ?

It is reserved by POSIX, but it makes the types look nicer.

Use Appveyor to test the windows version

The Windows version of Chemharp is starting to work with both MSVC and MinGW, but it is not yet tested in CI services.

First steps are here, but the build fails when bootstraping Boost.

Support for biological molecules

I'm wondering if there's some space for an AminoAcid class. Objects of this class would be linked to each Frame object and 'contain' many objects of the Atom class. That is, a more 'OpenBabel' way of treating biological molecules and a more intuitive one.

I'm currently developing C++ software to use on biological molecules. Chemfiles is the only library that supports the major molecular dynamics trajectory formats and PDB format at the same time. But the lack of a Residue concept is critical. On the other hand, OpenBabel can't read netcdf and my soft became considerably slower when I made the change from chemfiles.

Enable OSX testing on travis

Since this is available, let's use it ! It can be used to build the OSX conda recipe too.

PDB format

File Type	text
Topological information	Yes
Positions	Yes
Velocities	???

Reference: ftp://ftp.wwpdb.org/pub/pdb/doc/format_descriptions/Format_v33_A4.pdf

Remove dependency on netcdfcxx

This prevent easy static linkage of chemahrp, and the current version of netcdfcxx do not uses c++11 functionalities (move semantic in particular). I should call directly the C library for what is needed, wrapping a few types to C++.

Public release planning

What should be done before the public release of this code :

Write interface
- Text format: XYZ
- Binary format: NetCDF
Bindings
- C read-only binding
- Fortran read-only binding
- Python read-only binding
Formats
- PDB
Documentation
- User manual

Santa Claus wish list (or, what will append after the public release but before 1.0)

Read-Write bindings
Julia binding, either from C of C++
Developer documentation

Remove the logging framework?

It does not make sense to have a logging framework in a library, and it could be removed. It is mainly used by the C API, to log exceptions at the interface, but they are not strictly needed.

Another option would be to have it silenced by default.

Gromacs XTC files

File Type	binary
Topological information	???
Positions	Yes
Velocities	Yes

Reference http://www.gromacs.org/Developer_Zone/Programming_Guide/XTC_Library

Check relocation of the conda recipe

A conda recipe was introduced in 9c3ccfe, but I should check that the package is effectively relocatable. In particular, I may need to set the CHRP_MOLFILES environment variable. See http://conda.pydata.org/docs/building/recipe.html for the documentation on recipes.

Remove the Trajectory::sync function

It is not really useful, and does not pull its weight. I only use it for tests, where I could just close the file.

Developer documentation

Add a basic developer documentation before the 1.0 release.

Built with doxygen doc and some hand-written text.

Remove dependency on Boost libs

Boost libraries should be removed, as there is no strong need for them, and this would make it easier to embed chemahrp in another application.

Places where Boost is used, and how to remove it:

filesystem iteration in tests => see the C++1y standard with header;
any type => roll my own version, or inherit ostream in private to get access to the operator<<;

Replace the tests/data submodule by an archive

The tests/data git submodule is not really ergonomic, as submodules are not automatically updated when using git pull.

The tests data files are in a separated repository (https://github.com/chemfiles/tests-data) because they are not needed when building the code, and the size of the repository is non negligible.

Just checking in a tar.gz archive and unpacking it with cmake would not work here, because all the tar.gz files would still be in .git/objects. I thing the better way is to download the tests files at compile time, using cmake ExternalProject.

Amber NetCDF format

To validate the BinaryFile interface.

File Type	binary
Topological information	Yes
Positions	Yes
Velocities	Yes

Reference: http://ambermd.org/netcdf/nctraj.pdf

Support restart convention for NetCDF

This is roughly the same as standard NetCDF, but using doubles most of the time instead of floats.

Bonding information in PDB from standard tables

PBD CONNECT record is only needed for bonds that are not in the standard table for connectivity. Chemfiles should use this table to be able to have all bonds in the system.

Updated homebrew bottle

Hello @Luthaf, I'm currently going through my homebrew-juliadeps tap, updating versions of software where appropriate, and wanted to check with you to see if upgrading to v0.4.0 of chemharp would be appropriate for the julia code that uses it. Is that something I should do now, or should I wait until changes have been made on the julia side?

PDBx/mmCIF

File Type	text
Topological information	Yes
Positions	???
Velocities	???

PDBx/mmCIF is the choosen replacement for PDB files starting on 2016. A C++ parser exists here: http://mmcif.wwpdb.org/

Use an enum for error codes in C APi

Using

typedef enum {
    CHFL_SUCCESS,
    CHFL_ERROR,
    ...
} chfl_status;

and updating the function to return chfl_status instead of int.

Document embededing chemfiles in other projects

Using chemfiles from other project should be easy, and documented!

PBC with tilted cells

I am not sure whether the current PBC code can handle very tilted cells (with angles less than 60° or more than 120°). We should at least add a test for that, and maybe fix the algorithm.

In case the algorithm do not handle it, a new cell type might be needed for these cells.

Fix the size of data types at the C interface

The code for the C API should only use types with known size: int8_t, uint64_t. Else the bindings will rely on some assumptions about integer size that may not hold and will create nasty bugs.

website - dead links

Hi there,

In the main webpage of chemfiles, in section Multiple languages some links are not present or not valid.
See for instance Fortran :
http://github.com/chemfiles/chemfiles.f90
instead of
http://github.com/chemfiles/chemfiles.f03

Test chemfiles on MinGW

MinGW is a port of GNU compiler and tools on Windows. They are NOT an UNIX system.

They are available on Appveyor, and should be tested.

Add support for VMD Molfile plugin

Support for VMD molfile plugins will bring to Chemharp support of 15 new format for free. The firsts steps are in the molfile branch.

Fix the molfile DCD reader

The dcd reader fails on 32 bit windows, and maybe the 64 one too.

This may be because of a 32/64bit integer size difference.

Add selections in topology

Users should be able to query Topology for selection, in VMD style. API may looks like this:

//! Get the atoms maching the selection in select
std::vector<bool> Topology::select(const std::string& selection) const;
//! Get the atoms maching the tuple selection in select
std::vector<std::vector<bool>> Topology::select(const std::vector<std::string>& selections) const;

// C API version
int chrp_topology_select(const CHRP_TOPOLOGY* topology, const char* selection, bool* res, size_t natoms);
int chrp_topology_select_tuple(const CHRP_TOPOLOGY* topology, const char** selections, size_t nselect, bool** res, size_t natoms);

Selection are operated with strings, and return the list of atoms corresponding to the selection.

Tuple selections allow to select pairs, triplet, ... of atoms matching a given pair, triplet, ... of selection string.

Examples

topology = ["H", "O", "H", "H", "O", "H"]

auto s = topology.select("name O");
s == [false, true, false, false, true, false];

auto s = topology.select(std::vector{"name H", "name O"}); 
s == [[true, false], [false, true], [true, false], [true, false], [false, true], [true, false]];

TODO list

Selection DSL specification;
Parsing selection string to AST;
Evaluation AST;
Testing;

Modify `Atom` concept to hold atom labels.

Main TODOs:

Create label_ variable and read it from columns 13-16 of the PDB format file.
Rename name_ variable to element_ and read it from columns 78-80 in the PDB format file.
Then, for consistency, continue replacing name_ with element_ in the rest of the project.

Trajectory NG

File Type	binary
Topological information	???
Positions	Yes
Velocities	Yes

Reference http://dx.doi.org/10.1002/jcc.23495

CIF files

File Type	text
Topological information	Yes
Positions	Yes
Velocities	Maybe

Reference of the foirmat: http://scripts.iucr.org/cgi-bin/paper?S010876739101067X

A library for guessing symmetry is available here: https://github.com/atztogo/spglib

chemfiles / chemfiles Goto Github PK

chemfiles's Issues

Examples

TODO list

Modify Atom concept to hold atom labels.

Recommend Projects

Recommend Topics

Recommend Org

Modify `Atom` concept to hold atom labels.