Giter Site home page Giter Site logo

Comments (8)

JonStargaryen avatar JonStargaryen commented on August 22, 2024

LinkedHashMap
I decided to use LinkedHashMap because there are some use cases where the order of categories does matter. E.g., reading and writing of a file should result in the original file. I see that that's not relevant for binary files, however mmCIF basically becomes unreadable for humans when atom_site records are shuffled. In Java, a linked map performs roughly as fast as a HashMap.
I think it's a good solution to ignore the linked aspect of the map impl in JavaScript if it doesn't matter for your application and increases performance drastically.

Stream API
And yeah, as everything, the Java Stream API comes at a price. It's a tradeoff between performance and (subjective) readability. Cases like IntStream.of(input).mapToDobule(i->f*i).toArray() from FixedPointCodec are surely overkill and should be plain loops. I'll check for similar occurrences and replace them.

from ciftools-java.

JonStargaryen avatar JonStargaryen commented on August 22, 2024

a6b3a31 replaces a couple of stream-based impls by loops. I don't see any performance implications but I could imagine that the JavaScript code may benefits from it.

from ciftools-java.

BobHanson avatar BobHanson commented on August 22, 2024

from ciftools-java.

BobHanson avatar BobHanson commented on August 22, 2024

from ciftools-java.

BobHanson avatar BobHanson commented on August 22, 2024

I realize this issue is closed, and I don't think I can open it myself, but I'd like to suggest that it really is still an issue. Using Functional programming with DemoReadGeneric: (ms)

(JavaScript)
153 niter=11 processStream BINARY StreamParser
175 niter=11 processStream BINARY StreamParser
173 niter=11 processStream BINARY StreamParser
162 niter=11 processStream BINARY StreamParser
1840 niter=10 ------PARSE binary (-GC)
1840 niter=10 ------PARSE binary

vs. allowing the developer to get the raw data for atom_site.cartn_[x|y|z] directly using

double[][] aatoms = atomSite.fillFloat(new String[] { "cartn_x", "cartn_y", "cartn_z" });

(JavaScript)
1 niter=11 processStream BINARY StreamParser
1 niter=11 processStream BINARY StreamParser
1 niter=11 processStream BINARY StreamParser
1 niter=11 processStream BINARY StreamParser
92 niter=10 ------PARSE binary (-GC)
92 niter=10 ------PARSE binary

So that's an enormous speed-up - on the order of 20 to 200 fold.

The effect is not just JavaScript. The speed is considerably faster in Java as well:

Using the (currently required) functional methods:

2 niter=11 processStream BINARY StreamParser
1 niter=11 processStream BINARY StreamParser
21 niter=10 ------PARSE binary (-GC)
32 niter=10 ------PARSE binary

vs. allowing delivery of the raw arrays:

0 niter=11 processStream BINARY StreamParser
1 niter=11 processStream BINARY StreamParser
12 niter=10 ------PARSE binary (-GC)
25 niter=10 ------PARSE binary

Roughly a two-fold improvement, depending upon how you count.

(All these tests involve reading 3j9m, requesting columns for atom_site.cartn_x, cartn_y, and cartn_z, and then averaging cartn_x.)

from ciftools-java.

JonStargaryen avatar JonStargaryen commented on August 22, 2024

The main issue is the way how data of a Column can be accessed, correct?

It's possible write a loop and access individual values using Column#get(int) by specifying the row. That should be already faster than a stream-based impl. Also, it's possible to access binary data directly (i.e. get access to the int[]/double[]/String[] that columns wrap). There are FloatColumn#getBinaryData(), IntColumn#getBinaryData(), and StrColumn#getBinaryData(). These functions are not exposed via the Column interface and will crash and burn when called on mmCIF data.

Some like this should be as fast as it gets:
double[] cartn_x = atom_site.getCartnX().getBinaryData();

from ciftools-java.

BobHanson avatar BobHanson commented on August 22, 2024

from ciftools-java.

JonStargaryen avatar JonStargaryen commented on August 22, 2024

I can't use .getAtomSite() calls, because that's the 4.7 MB required download.

Oh, I understand. If you access categories/columns by name, you have to know the type and cast appropriately. But that way of directly obtaining the binary data should work for your use case.
double[] cartn_x = ((FloatColumn) block.getCategory("atom_site").getColumn("Cartn_x")).getBinaryData();

Take a look at what I have on my fork. I did a lot of refactoring, and you will see that all the generated api methods and classes are no longer in my edition. It's way more refactoring than you would ever be interested in, I know, so I'm not proposing creating a pull request for that. I will just leave it there for reference.

Yeah, I skimmed through it and adapted some further changes I liked.

I presume David (and others? do they exist) can read binaryCIF files that have all-lower-case tags. Right?

The old JavaScript implementation is case sensitive and I would assume that that is still true for the Mol* implementation.

The nice thing about the property files is that it would be simple to extend those to other CIF dictionaries, not that I am actually going to do that any time soon. Could the schema generator be tweaked to produce these?

I'll adapt your approach of the property files in the future. It's a nice idea to split them by category to allow to ignore categories that will be never accessed. During that process, I'll also adapt the schema generation to create these files directly from the dictionaries.

Sorry, Sebastian, I had to remove most of that, though, as though it is very elegant, it was slowing the process down considerably.

No hard feelings. I'm happy that the library is useful and performs well.

Cheers, Sebastian

from ciftools-java.

Related Issues (7)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.