rcsb / ciftools-java Goto Github PK
View Code? Open in Web Editor NEWA Java library for handling text and BinaryCIF files.
License: MIT License
A Java library for handling text and BinaryCIF files.
License: MIT License
Currently, Demo.java
has a nice command line interface to test parsing of 1acj
PDB entry:
ciftools-java/src/main/java/org/rcsb/cif/Demo.java
Lines 17 to 23 in c589bd5
main
in Demo.java
could accept a filename to parse, and fallback on 1acj
if not given anything.
I am doing a comparison of CIF readers and would like to include CIFTools (cod-developers/CIF-parsers#4). To do so I need a command line interface to parse a given file. I believe having such command line interface would benefit other purposes as well, for example, to see if a CIF file parses correctly.
ciftools-java is requiring exact-case matches to its category and column names. However, http://mmcif.wwpdb.org/docs/tutorials/mechanics/pdbx-mmcif-syntax.html:
Data category and data item names are not case sensitive.
corresponding to
https://www.iucr.org/resources/cif/spec/version1.1/cifsyntax#case
_Case sensitivity
Pull #2 addresses this issue; I thought it would be a separate pull, but I see now that it just attached one push to another. So sorry for that. Included there:
Latest update is appreciated, particularly for adding generic string-based references to CIF tags. The purpose of that generic option is to reduce the required download in SwingJS, however, and we still see a preliminary load for DemoReadGeneric of 789 classes (777 when I remove the StandardCharSets dependency of MessagePack), including 666 from rcsb, resulting in a download of 4.7 MB compressed:
12/05/2019 01:09 PM 4,742,217 core_cifdemo.z.js
The "generic"/properties idea is specifically designed to avoid that. My fork is requiring only 152 classes (just 41 from rcsb), bringing that to a much more respectable 0.8 MB:
12/05/2019 01:05 PM 790,585 core_cifdemo.z.js
This, of course, includes the full extent of the Java(Script) Runtime Environment that is required to load and run DemoReadGeneric. If we check how much the rcsb code contributes, we are down to just 96K for the rcsb classes:
12/05/2019 01:51 PM 95,657 core_cifdemo.z.js
To me, that sounds pretty reasonable for a binaryCIF reader.
I am fairly certain that this efficiency requires use of Properties for two reasons:
First, for generic loading, the problem is a relatively simple one: Make sure ModelFactory only creates CATEGORY_MAP and COLUMN_MAP if needed -- which is when generic loading not used or when text-CIF is being read (generic or not).
Second, for nongeneric loading, we still do not need to load all 600+ categories and map all 7000+ columns. That is the huge download hit, because it requires loading all those categories (as well as the 350K field-name-class-map.csv file).
The solution thus requires both something like a set of property files (compacted to 130K in my implementation, but deliverable in mostly tiny 1-3 KB sets) as well as reflection in order to avoid the downloading of all the category files when text generic or non-generic reading is done.
Also, just to note, even with that, my code is executing at two to five times the speed of rcsb/ciftools in JavaScript reading 3j9m, even when I add the raw column business to ciftools that I propose to speed all column data access. So there is still something there that is an issue.
My reference implementation is at
https://github.com/BobHanson/ciftools-java
with JavaScript site as
https://github.com/BobHanson/ciftools-java/blob/swingjs/dist/ciftools_site.zip?raw=true
I note that with Chrome, now, for 3j9m, I am seeing the following report from DemoReadGeneric in JavaScript:
rcsb/ciftools-java, with additions for rapid column data retrieval:
5 niter=11 processStream BINARY StreamParser
5 niter=11 processStream BINARY StreamParser
5 niter=11 processStream BINARY StreamParser
5 niter=11 processStream BINARY StreamParser
140 niter=10 ------PARSE binary (-GC)
140 niter=10 ------PARSE binary
bobhanson/ciftools-java, with the same:
1 niter=11 processStream BINARY StreamParser
1 niter=11 processStream BINARY StreamParser
1 niter=11 processStream BINARY StreamParser
1 niter=11 processStream BINARY StreamParser
85 niter=10 ------PARSE binary (-GC)
85 niter=10 ------PARSE binary
Demo complete
compared to Java:
0 niter=11 processStream BINARY StreamParser
0 niter=11 processStream BINARY StreamParser
1 niter=11 processStream BINARY StreamParser
1 niter=11 processStream BINARY StreamParser
14 niter=10 ------PARSE binary (-GC)
29 niter=10 ------PARSE binary
(where -GC is removing garbage collection from the time for Java)
The 11-iteration time is an average for decoding the cartn_[x,y,z] columns of atom_site and averaging cartn_x.
The 10-iteration time is an average that includes parsing building the CifFile object.
I do appreciate that even 140 ms for building a CifFile object -- reading 3j3m in its entirety -- is a huge win for binaryCIF! I'm just being greedy here.
The following method should return a long, not an int.
private int readUnsignedInt(DataInputStream inputStream) throws IOException {
return (int) (inputStream.readInt() & 0xFFFFFFFFL);
}
As written, this just returns a number in the normal int range [Integer.MIN_VALUE, Integer.MAX_VALUE]. Larger numbers must be cast as long, not int.
Hi!
Using the library in version 3.0.0 from Maven Central
<dependency>
<groupId>org.rcsb</groupId>
<artifactId>ciftools-java</artifactId>
<version>3.0.0</version>
</dependency>
conversion to BinaryCIF fails with the example code given in the README. The file is not actually converted and contains mmCIF content. Please review the files attached. The gene name of this particular structure on UniProt is strange and contains a two quote characters: ''cytochrome P450
.
Code used for conversion:
CifFile cifFile = CifIO.readFromPath(cifFilePath);
MmCifFile mmCifFile = cifFile.as(StandardSchemata.MMCIF);
// convert to BinaryCIF representation
byte[] output = CifIO.writeBinary(mmCifFile);
Thank you for checking.
AF-O49373-F1-model_v1.zip
Hello,
I am one of the Debian maintainers of ciftools-java. Recently Debian switched from OpenJDK11 to OpenJDK17 and this caused a failure in WriterTest.writeBinary (full Debian log). I got the error with version 3.0.1 but also with version 4.0.3.
I could dig a bit more and compared the bytes of the expected and obtained Strings outputted in the log. Here are the first four lines of the xxd output:
$ xxd expected | head -n4
00000000: 1fef bfbd 0800 0000 0000 0000 efbf bd5d ...............]
00000010: 0760 53ef bfbd efbf bd3f efbf bdef bfbd .`S......?......
00000020: 0eef bfbd 1214 0aef bfbd 7353 6959 efbf ..........sSiY..
00000030: bd40 43ef bfbd 0eef bfbd 0eef bfbd efbf .@C.............
$ xxd got | head -n4
00000000: 1fef bfbd 0800 0000 0000 00ef bfbd efbf ................
00000010: bd5d 0760 53ef bfbd efbf bd3f efbf bdef .].`S......?....
00000020: bfbd 0eef bfbd 1214 0aef bfbd 7353 6959 ............sSiY
00000030: efbf bd40 43ef bfbd 0eef bfbd 0eef bfbd ...@C...........
As you can see, the diff is about removing 0000 in the expected string and adding 00ef bfbd instead.
I strongly suspect this is innocuous, as I met a similar issue with another project
broadinstitute/picard#1840
and the authors explained the Java gzip implementation had changed, so that the change we see is normal. Do you also think so?
I would be happy to provide you with more details if needed. Thanks for your attention!
Best regards,
Pierre
More in the way of results than an actual issue. For the record...
LinkedHashMap is not necessary, at least for reading, as the order of keys in CIF files is not significant. Switching to a simple JavaScript map doubles performance in JavaScript.
Decoding speed is marginally increased in JavaScript (5-10%) by using native JavaScript array buffer conversions for MessagePack data arrays, at least in the first decoding from byte[] to short[].
In addition, functional programming in general doesn't translate well into JavaScript, with simpler replacements improving performance by as much as 60-fold (250 ms to 4 ms, for example).
So, as much as possible, if alternatives to those can be provided, it would be great.
Also, the problem with reading text CIF and not knowing what BaseColumn subclasses to assign is taken care of by using property files.
// after optimization (DemoReadGeneric)
// baseline data for 3j9m
// JavaScript (ms)
//
// 5 niter=11 processStream BINARY StreamParser
// 4 niter=11 processStream BINARY StreamParser
// 146 niter=10 ------PARSE binary (-GC)
// 146 niter=10 ------PARSE binary
//
// Java (ms)
//
// 4 niter=11 processStream BINARY StreamParser
// 3 niter=11 processStream BINARY StreamParser
// 4 niter=11 processStream BINARY StreamParser
// 52 niter=10 ------PARSE binary (-GC)
// 78 niter=10 ------PARSE binary
"(-GC)" there means excluding garbage collection time.
Those are reports from org.rcsb.cif.DemoReadGeneric. This is a test that uses some methods that I added that bypass all generated categories and builds the BaseColumn information from a relatively small set of property files instead. The code is:
prior to loop:
inputStream.reset();
CifFileGeneric cifFile = CifIOGeneric.readFromInputStream(inputStream);
in the 11-iteration loop:
BlockGeneric data = cifFile.getFirstBlock();Category atomSite = data.getCategory("atom_site");
int nAtoms = atomSite.getRowCount();
double[][] aatoms = atomSite.fillFloat(new String[] { "cartn_x", "cartn_y", "cartn_z" });
double sum = 0;
for (int i = nAtoms; --i >= 0;)
sum += aatoms[0][i];
double ave = (nAtoms == 0 ? 0 : sum / nAtoms);
This loop runs approximately 60 times faster in JavaScript than:
FloatColumn cx = ((FloatColumn) atomSite.getColumn("cartn_x"));
FloatColumn cy = ((FloatColumn) atomSite.getColumn("cartn_y"));
FloatColumn cz = ((FloatColumn) atomSite.getColumn("cartn_z"));
OptionalDouble averageCartnX = cx.values().average();
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.