Giter Site home page Giter Site logo

ciftools-java's People

Contributors

bobhanson avatar dependabot[bot] avatar jonstargaryen avatar josemduarte avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ciftools-java's Issues

Provide command line interface to test parsing of any CIF

Currently, Demo.java has a nice command line interface to test parsing of 1acj PDB entry:

public static void main(String[] args) throws IOException {
parseFile();
System.out.println();
buildModel();
System.out.println();
convertAlphaFold();
}

It would be great if main in Demo.java could accept a filename to parse, and fallback on 1acj if not given anything.

I am doing a comparison of CIF readers and would like to include CIFTools (cod-developers/CIF-parsers#4). To do so I need a command line interface to parse a given file. I believe having such command line interface would benefit other purposes as well, for example, to see if a CIF file parses correctly.

ciftools is not allowing flexibility in case for attributes

ciftools-java is requiring exact-case matches to its category and column names. However, http://mmcif.wwpdb.org/docs/tutorials/mechanics/pdbx-mmcif-syntax.html:

Data category and data item names are not case sensitive.

corresponding to

https://www.iucr.org/resources/cif/spec/version1.1/cifsyntax#case

_Case sensitivity

  1. Data names, block and frames codes, and reserved words are case-insensitive. The case of any characters within data values must be respected._

Pull #2 addresses this issue; I thought it would be a separate pull, but I see now that it just attached one push to another. So sorry for that. Included there:

  • the original pull request that covered the lazy initialization of categories (See DemoRead)
  • additional "...Generic" classes that allow bypassing all generated classes and instead using the actual CIF tags for categories and columns (See DemoReadGeneric)
  • code that fixes this capitalization issue -- at least for reading. I didn't implement writing for the generic options and thought perhaps that would be ok anyway, as the point there would be to do the reading specifically generically.

classes and categories need to be loaded lazily (for SwingJS)

Latest update is appreciated, particularly for adding generic string-based references to CIF tags. The purpose of that generic option is to reduce the required download in SwingJS, however, and we still see a preliminary load for DemoReadGeneric of 789 classes (777 when I remove the StandardCharSets dependency of MessagePack), including 666 from rcsb, resulting in a download of 4.7 MB compressed:

12/05/2019  01:09 PM         4,742,217 core_cifdemo.z.js

The "generic"/properties idea is specifically designed to avoid that. My fork is requiring only 152 classes (just 41 from rcsb), bringing that to a much more respectable 0.8 MB:

12/05/2019  01:05 PM           790,585 core_cifdemo.z.js

This, of course, includes the full extent of the Java(Script) Runtime Environment that is required to load and run DemoReadGeneric. If we check how much the rcsb code contributes, we are down to just 96K for the rcsb classes:

12/05/2019 01:51 PM 95,657 core_cifdemo.z.js

To me, that sounds pretty reasonable for a binaryCIF reader.

I am fairly certain that this efficiency requires use of Properties for two reasons:

First, for generic loading, the problem is a relatively simple one: Make sure ModelFactory only creates CATEGORY_MAP and COLUMN_MAP if needed -- which is when generic loading not used or when text-CIF is being read (generic or not).

Second, for nongeneric loading, we still do not need to load all 600+ categories and map all 7000+ columns. That is the huge download hit, because it requires loading all those categories (as well as the 350K field-name-class-map.csv file).

The solution thus requires both something like a set of property files (compacted to 130K in my implementation, but deliverable in mostly tiny 1-3 KB sets) as well as reflection in order to avoid the downloading of all the category files when text generic or non-generic reading is done.

Also, just to note, even with that, my code is executing at two to five times the speed of rcsb/ciftools in JavaScript reading 3j9m, even when I add the raw column business to ciftools that I propose to speed all column data access. So there is still something there that is an issue.

My reference implementation is at
https://github.com/BobHanson/ciftools-java
with JavaScript site as
https://github.com/BobHanson/ciftools-java/blob/swingjs/dist/ciftools_site.zip?raw=true

I note that with Chrome, now, for 3j9m, I am seeing the following report from DemoReadGeneric in JavaScript:

rcsb/ciftools-java, with additions for rapid column data retrieval:

5 niter=11 processStream BINARY StreamParser
5 niter=11 processStream BINARY StreamParser
5 niter=11 processStream BINARY StreamParser
5 niter=11 processStream BINARY StreamParser
140 niter=10 ------PARSE binary (-GC)
140 niter=10 ------PARSE binary

bobhanson/ciftools-java, with the same:

1 niter=11 processStream BINARY StreamParser
1 niter=11 processStream BINARY StreamParser
1 niter=11 processStream BINARY StreamParser
1 niter=11 processStream BINARY StreamParser
85 niter=10 ------PARSE binary (-GC)
85 niter=10 ------PARSE binary
Demo complete

compared to Java:

0 niter=11 processStream BINARY StreamParser
0 niter=11 processStream BINARY StreamParser
1 niter=11 processStream BINARY StreamParser
1 niter=11 processStream BINARY StreamParser
14 niter=10 ------PARSE binary (-GC)
29 niter=10 ------PARSE binary

(where -GC is removing garbage collection from the time for Java)

The 11-iteration time is an average for decoding the cartn_[x,y,z] columns of atom_site and averaging cartn_x.
The 10-iteration time is an average that includes parsing building the CifFile object.

I do appreciate that even 140 ms for building a CifFile object -- reading 3j3m in its entirety -- is a huge win for binaryCIF! I'm just being greedy here.

unsigned int messagepack methods does not return an unsigned int

The following method should return a long, not an int.

private int readUnsignedInt(DataInputStream inputStream) throws IOException {
    return (int) (inputStream.readInt() & 0xFFFFFFFFL);
}

As written, this just returns a number in the normal int range [Integer.MIN_VALUE, Integer.MAX_VALUE]. Larger numbers must be cast as long, not int.

AlphaFold conversion to BinaryCIF fails

Hi!

Using the library in version 3.0.0 from Maven Central

<dependency>
    <groupId>org.rcsb</groupId>
    <artifactId>ciftools-java</artifactId>
    <version>3.0.0</version>
</dependency>

conversion to BinaryCIF fails with the example code given in the README. The file is not actually converted and contains mmCIF content. Please review the files attached. The gene name of this particular structure on UniProt is strange and contains a two quote characters: ''cytochrome P450.

Code used for conversion:

CifFile cifFile = CifIO.readFromPath(cifFilePath);
MmCifFile mmCifFile = cifFile.as(StandardSchemata.MMCIF);

// convert to BinaryCIF representation
byte[] output = CifIO.writeBinary(mmCifFile);

Thank you for checking.
AF-O49373-F1-model_v1.zip

Differences between expected and obtained 1acj.bcif.gz when switching to OpenJDK17

Hello,

I am one of the Debian maintainers of ciftools-java. Recently Debian switched from OpenJDK11 to OpenJDK17 and this caused a failure in WriterTest.writeBinary (full Debian log). I got the error with version 3.0.1 but also with version 4.0.3.

I could dig a bit more and compared the bytes of the expected and obtained Strings outputted in the log. Here are the first four lines of the xxd output:

$ xxd expected | head -n4
00000000: 1fef bfbd 0800 0000 0000 0000 efbf bd5d ...............]
00000010: 0760 53ef bfbd efbf bd3f efbf bdef bfbd .`S......?......
00000020: 0eef bfbd 1214 0aef bfbd 7353 6959 efbf ..........sSiY..
00000030: bd40 43ef bfbd 0eef bfbd 0eef bfbd efbf .@C.............
$ xxd got | head -n4
00000000: 1fef bfbd 0800 0000 0000 00ef bfbd efbf ................
00000010: bd5d 0760 53ef bfbd efbf bd3f efbf bdef .].`S......?....
00000020: bfbd 0eef bfbd 1214 0aef bfbd 7353 6959 ............sSiY
00000030: efbf bd40 43ef bfbd 0eef bfbd 0eef bfbd ...@C...........

As you can see, the diff is about removing 0000 in the expected string and adding 00ef bfbd instead.

I strongly suspect this is innocuous, as I met a similar issue with another project
broadinstitute/picard#1840
and the authors explained the Java gzip implementation had changed, so that the change we see is normal. Do you also think so?

I would be happy to provide you with more details if needed. Thanks for your attention!

Best regards,

Pierre

LinkedHashMap and funtional programming a problem in JavaScript (SwingJS adaptation only)

More in the way of results than an actual issue. For the record...

LinkedHashMap is not necessary, at least for reading, as the order of keys in CIF files is not significant. Switching to a simple JavaScript map doubles performance in JavaScript.

Decoding speed is marginally increased in JavaScript (5-10%) by using native JavaScript array buffer conversions for MessagePack data arrays, at least in the first decoding from byte[] to short[].

In addition, functional programming in general doesn't translate well into JavaScript, with simpler replacements improving performance by as much as 60-fold (250 ms to 4 ms, for example).

So, as much as possible, if alternatives to those can be provided, it would be great.

Also, the problem with reading text CIF and not knowing what BaseColumn subclasses to assign is taken care of by using property files.

// after optimization (DemoReadGeneric)

// baseline data for 3j9m
// JavaScript (ms)
//
//  5    niter=11    processStream BINARY StreamParser
//  4    niter=11    processStream BINARY StreamParser
// 146    niter=10    ------PARSE binary (-GC)
// 146    niter=10    ------PARSE binary
//
// Java (ms)
//
//   4 niter=11 processStream BINARY StreamParser
//   3 niter=11 processStream BINARY StreamParser
// 4 niter=11 processStream BINARY StreamParser
// 52 niter=10 ------PARSE binary (-GC)
// 78 niter=10 ------PARSE binary
"(-GC)" there means excluding garbage collection time.

Those are reports from org.rcsb.cif.DemoReadGeneric.  This is a test that uses some methods that I added that bypass all generated categories and builds the BaseColumn information from a relatively small set of property files instead. The code is:

prior to loop:

inputStream.reset();
CifFileGeneric cifFile = CifIOGeneric.readFromInputStream(inputStream);

in the 11-iteration loop:

BlockGeneric data = cifFile.getFirstBlock();Category atomSite = data.getCategory("atom_site");
int nAtoms = atomSite.getRowCount();
double[][] aatoms = atomSite.fillFloat(new String[] { "cartn_x", "cartn_y", "cartn_z" });
double sum = 0;
for (int i = nAtoms; --i >= 0;)
sum += aatoms[0][i];
double ave = (nAtoms == 0 ? 0 : sum / nAtoms);

This loop runs approximately 60 times faster in JavaScript than:

FloatColumn cx = ((FloatColumn) atomSite.getColumn("cartn_x"));
FloatColumn cy = ((FloatColumn) atomSite.getColumn("cartn_y"));
FloatColumn cz = ((FloatColumn) atomSite.getColumn("cartn_z"));
OptionalDouble averageCartnX = cx.values().average();

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.