bdezonia / zorbage Goto Github PK
View Code? Open in Web Editor NEWZorbage: algebraic data types and algorithms for use in numeric processing.
License: BSD 3-Clause "New" or "Revised" License
Zorbage: algebraic data types and algorithms for use in numeric processing.
License: BSD 3-Clause "New" or "Revised" License
Zorbage: algebraic data types and algorithms for use in numeric processing Developer Info: How to include zorbage in your Maven project Add the following dependency to your project's pom.xml: <dependency> <groupId>io.github.bdezonia</groupId> <artifactId>zorbage</artifactId> <version>2.0.3.1</version> </dependency> How to include zorbage in a different build system See https://search.maven.org/artifact/io.github.bdezonia/zorbage/2.0.3.1/jar for instructions on how to reference zorbage in build systems such as Gradle or others. Project Goals: - provide a framework for reusable numeric algorithms - support numeric computing in Java in an efficient manner - provide code easier to develop and almost as fast as C++ - provide code that is faster and less error prone than Python/R/Matlab - support very large data sets in an efficient manner - break limitations of many programming languages in terms of the computable types provided and the extensibility of such types - do all this with a powerful set of simple abstractions Contains support for: - integers, rationals, reals, complex numbers, quaternions, octonions - numbers, vectors, matrices, and tensors - various precisions: 1-bit to 128-bit to unbounded (signed and unsigned and float) - very large datasets (arrays, virtual files, sparse structures, JDBC storage) - generic programming - algebraic/group-theoretic algorithms - procedural java with object oriented comforts - the definition of your own types while reusing existing algorithms Can you show me what Zorbage can do? For some overviews see: https://github.com/bdezonia/zorbage/tree/master/src/main/java/example Once you've covered that if you have more questions look at Zorbage's extensive test code at: https://github.com/bdezonia/zorbage/tree/master/src/test/java/nom/bdezonia/zorbage To see a few small stand alone programs have a look here: https://github.com/bdezonia/zorbage/tree/master/example Thanks, I'll look that up later. Can you describe what can I do with Zorbage? Define multidimensional data sets with flexible out of bounds data handling procedures. Zorbage has an excellent abstraction for multi- dimensional data that is easy to understand and use and is at the same time quite powerful. Use data views of your multidimensional data source to very rapidly set and get values. Data views allow one to write any sample visiting algorithm desired. The views are especially good with nested for loop approaches to iterating through data. Their speed is comparable to direct 1-d array access. Pull out arbitrary planes of your multi- dimensional data set with ease. Use many different data types (133 and counting) - signed integers (bits: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 32, 64, 128, unbounded) - unsigned integers (bits: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 32, 64, 128) - rational numbers (unbounded) - floats (bits: 16, 32, 64, 128, unbounded) - gaussian integers (bits: 8, 16, 32, 64, unbounded) - complex numbers (bits: 16, 32, 64, 128) - quaternions (bits: 16, 32, 64, 128) - octonions (bits: 16, 32, 64, 128) - Unicode16 characters and variable length strings and fixed length strings - booleans - n-dimensional real Points - ARGB, RGB, and CIE LAB tuples - vectors and matrices and tensors made of the various types Define your own data types Do you have a custom data type? (Like RNA base pairs?). Can you define operators between elements? (Like when two base pairs are equal?). Then you can define their algebra and reuse any zorbage algorithms that accept similarly defined types. A conversion api exists for moving between types accurately and efficiently. There are types that are compound: complex numbers, quaternions, and octonions based on any of the floating types. Types can be stored in arrays, files, sparse structures, and JDBC database tables. You can allocate and use huge (length up to 2^63 elements) data structures. Data access revolves around 1-d lists that act like arrays. Arrays can be concatenated, trimmed, subsampled, masked, padded, readonly, as well as other abstractions. At the heart, each multidim data source has one of these arrays behind it. These arrays are not indexed by integers but instead by longs and break many of the limitations of Java arrays. Array storage can be in native types (for speed of access) or for many integer types they can be bit encoded (to save space). You can use existing or write your own generic algorithms that work with all the types transparently. For instance Zorbage has one Sort algorithm that can sort a list made of any of the above defined types while doing no data conversions. You can use existing or write your own algorithms that work with numbers, vectors, matrices, and tensors. Zorbage includes 100's of predefined algorithms too. It can find roots, find derivatives, and solve differential equations numerically involving arbitrary scalar and vector functions and procedures. It also includes algorithms from linear algebra, signal processing, statistics, set theory, and analysis. Algorithms include: - Basic vector and matrix and tensor operations - Transcendental functions of various precision floats and matrices - Basic runge kutta ode solver - FFT and Inverse FFT - Convolutions and Correlations - Resampling algorithms - LU Decomposition and LU Solving (simultaneous equation solving) - Basic statistical functions - Most C++ STL algorithms replicated - Numerous parallel algorithms are provided for quickly processing data. Define complex data sampling algorithms from prebuilt sampling components. Use type safe first class Function and Procedure objects. Pass them as arguments to code that can transform your data quickly and generically. Write methods that return Functions and Procedures if needed. Compute values from user defined Functions and Procedures. Use parsers to create Procedures from strings. These Procedures represent equations that when fed values will compute a return value (the result of applying the inputs to the equation). Equations can return numbers, vectors, matrices, or tensors. The subcomponents of these can be reals, complexes, quaternions, or octonions. Equations can be built out of numbers, input variable references, constants (like E, PI, etc.) and typical functions like sin(), atan(), exp(), log(), etc. For more info see: https://github.com/bdezonia/zorbage/blob/master/EQUATION_LANGUAGE Why Java? - numerical computing in Java is under represented - with Java there is no need to worry about memory or pointer bugs - the JRE optimizes many object allocations into stack allocations - Java is portable: Zorbage runs everywhere the JVM does - Java is safer at runtime than C++ while still supporting good performance - Java is faster at runtime than Python/R/Matlab while also avoiding simple "oops, that typo in my code just killed my long run" situations Why is Zorbage the way it is? - procedural programming is efficient - object oriented comforts can be used where needed - generics allow for reusable algorithms. often one can write one algorithm and use it for reals, complex numbers, quaternions, matrices, etc. - there is no telling what one bored developer can get up to when looking for fun Zorbage breaks Java barriers - in Zorbage "arrays" are indexed by 64-bit long integers rather than 32-bit integers. Relatedly data lists sizes can reach 2^63. - in Zorbage, due to JVM optimizations, you can pass primitives by reference - supports multidimensional arrays - breaks the limitation to integer types being in byte aligned sizes - supports unsigned numbers - supports 128-bit signed and unsigned integers - supports 128-bit floating point numbers - supports 16-bit floating point numbers - supports high precision floating point numbers - supports unbounded ints - supports (unbounded) rational numbers - builds reals, complexes, quaternions, and octonions from any of the floating types. You can write one algorithm and substitute one runtime parameter to calculate in 16 bit, 32 bit, 64 bit, 128 bit or seemingly unbounded accuracy. Zorbage is fast - Zorbage can run numeric algorithms in speeds comparable to C++ - Some multithreaded algorithms are provided - matrix multiplies - resampling - convolutions/correlations - fills - transforms - data conversions - apodizations Zorbage is flexible - supports n-dimensional data sets with flexible out of bounds data handling procedures - the n-dimensional data sets can use the equational language to carefully and powerfully calibrate their axes. Is there a way to get data into a Zorbage backed application? Code has been written to load ECAT scan data using the zorbage ecat library into Zorbage structures. You can find it here: https://github.com/bdezonia/zorbage-ecat Code has been written to load Java audio data using the zorbage jaudio library into Zorbage structures. You can find it here: https://github.com/bdezonia/zorbage-jaudio Code has been written to load raster GIS data using the GDAL library into Zorbage structures. You can find it here: https://github.com/bdezonia/zorbage-gdal Code has been written to load raster GIS data using the NetCDF library into Zorbage structures. You can find it here: https://github.com/bdezonia/zorbage-netcdf Code has been written to load Nifti scan data using the zorbage nifti library into Zorbage structures. You can find it here: https://github.com/bdezonia/zorbage-nifti Code has been written to load NMR data using the zorbage NMR library into Zorbage structures. You can find it here: https://github.com/bdezonia/zorbage-nmr Code has been written to load common raster data formats using the SCIFIO library into Zorbage structures. You can find it here: https://github.com/bdezonia/zorbage-scifio Is there a way to view Zorbage data? Coding is underway on a data viewer. It is available now (alpha ready). You can find it here: https://github.com/bdezonia/zorbage-viewer Programming notes Java 11 info Zorbage has been compiled and tested using Maven with OpenJDK Java 11 on Linux Java 8 info Zorbage was previously compiled and tested using Maven with OpenJDK Java 8 on Linux The Zorbage code has been scrubbed of API calls later than Java 8 and the Zorbage library API should be compilable with Java 8. Java 7 info At one time Zorbage was compiled and tested using Maven with Oracle Java 7 on the Macintosh. Recently the Zorbage code was updated to take advantage of some Java 8 features so it is possible it will not be callable from Java 7. YMMV Other platforms Zorbage has been compiled and tested with many versions of Eclipse and some versions of IntelliJ. Acknowledgements Thank you Dr. Pepper for your timely and generous contributions to the project. Zorbage would not be the same without you. Partial bibliography Books: A Book of Abstract Algebra; Pinter A Student's Guide to Vectors and Tensors; Fleisch Advanced Engineering Mathematics; Wylie, Barrett Algorithms for Computer Algebra; Geddes Basic Partial Differential Equations; Bleecker Calculus and Analytic Geometry; Thomas, Finney Compilers; Aho, Sethi, and Ullman Complex Variables and Applications; Brown, Churchill Complex Variables with Applications; Wunsch Computer Algebra and Symbolic Computation (2 vols); Cohen Design Patterns, gang of four Differential Equations; Ross Differential Equations and Linear Algebra; Edwards and Penny Digital Image Proccessing; Burger Digital Image Processing; Gonzalez, Woods Digital Signal Processing; Proakis Discrete Time Signal Processing; Oppenheim and Schaffer Div, Grad, Curl, and All That Elements of Modern Algebra; Gilbert Elements of Programming; Stepanov From Mathematics to Generic Programming; Stepanov Functions of Matrices; Higham Handbook of Math and Computational Science; Harris, Stocker Haskell School of Expression Haskell: the Craft of Functional Programming; Thompson Introduction to Abstract Algebra; Dubisch Introduction to Algorithms; Cormen, Lesserson, Rivest, Stein Introduction to Computer Simulation Methods; Gould Introduction to Octonion and Other Non-associative Algebras in Physics; Okubo Linear Algebra and Analytic Geometry for Physical Sciences; Landi, Zampini Matrix Algebra; Robbin Modern Mathematical Methods for Physicists and Engineers; Cantrell NMR Data Processing; Hoch and Stern Numerical Methods for Engineers; Chapra Numerical Methods for Engineers and Scientists; Gilat Numerical Methods for Scientists and Engineers; Hamming Numerical Recipes; Press etal Partial Differential Equations for Scientists and Engineers; Farlow Probability and Statistics for Engineering and the Sciences; Devore Programming Clojure; Miller, Halloway, Bedra Real World Haskell Rings, Fields, and Groups; Arnold Schaum's Outline of Tensor Calculus; Kay Scientific Computing With Python; Fuhrer, Solem, Verdier Structure and Interpretation of Computer Programs; Abelson Tensors Made Easy; Bernacchi Tensor Calculus Made Simple; Sochi The Art of Computer Programming; Knuth Online manuals: GNU Scientific Library documentation Boost Library documentation C++ STL documentation Mathematica documentation Matlab documentation R documentation Julia documentation Haskell documentation Ruby documentation Pascal language definitions Online articles: Many articles on Mathworld and Wikipedia and StackExchange Many other web pages and pdfs and slide presentations Paying it forward If you like Zorbage as a library or as a source of ideas then please visit my daughter's band's BandCamp page, listen to their music, and buy something if you like what you hear. Thanks. https://dearmrwatterson.bandcamp.com/
The real based samplings sometimes rely on internal hard coded tolerances. It would be better if the range of inputs to the samplings could be used to calculate relative tolerances.
I am not sure that the SubTensorBridage class is correct or understandable. Maybe I need two arrays instead of three as inputs. The second array is the full coord specification of the origin (a combo of the last two long[] arrays of the inputs of the current implementation). Then also write some 3d tests with 2d slices in various directions (and forw/back orders if desired). Then find bugs and fix issues.
All the tensor classes have their own cut and paste code for indexToLong() and longToIntegerIndex(). There should be one version that all these classes could reference. Note that there is also a version in the IndexUtils class. Maybe we can massage all this into one helper class somewhere.
The file based storage for types would be much improved if it can be based upon nio's MappedByteBuffer. Whatever is implemented needs to be able to handle huge files and to do so in an efficient manner. Maybe all data types will change from working with RandomAccessFile to something else and thus this might be an api breaking change.
The algebras of the various types are very strongly typed. The G.DBL algebra deals with specific float implementations. The bridge classes are less strict. They implement MatrixMember. However they can't be passed to these strict algebras as arguments to methods. Maybe we need to relax the constraints on the algebras. On top of which we might also move to a SparseMatrix type that as it stands now could not be passed to an algebra that works on FloatMatrix elements only. One reason to do this is to allow a matrix multiply to check if it has sparse arguments and if so do an optimized multiply. Investigate how hard it is to relax type constraints in the algebras. If we could then we would be open to supporting specialized implementations. This seems like a somewhat crucial need.
So far I have written cartesian tensor code embedded in euclidean space. We should endeavor to support general tensors in any possible curved space. This will require some research before something can be implemented.
If you allocate a large JDBC based storage item the heap is exhausted when trying to initialize the data to zero. More objects than necessary are created. Break the initialization code into a series of database calls instead of one big one. Think how best to initialize using little memory and few database transactions.
The GDAL standard supports complex valued types made of two 16-bit shorts or two 32-bit ints. My GDAL reader translates these (widened) as complex floating point numbers. Maybe we should define Gaussian integers made of byte/short/int/long/BigInteger components (and gaussian numbers using rational components) and associated algebras for manipulating them. The GDAL reader could use them (if that is what GDAL means them to be). More research is needed here.
The various sort and partition algorithms are not completely fleshed out. Ideally Sort would call Partition as needed and StableSort would call StablePartition as needed. Both would use a helper QuickSort algorithm that could be defined from the current Sort implementation. This would extend the capabilities of zorbage while also cleaning up some untidiness.
There is a subtensor bridge adapter class that treats a tensor of one set of dimensions as a tensor of a smaller set of dimensions. The existing code may not be functional or is too difficult to use. Improve this code. Waiting on the final implementation of the cartesian tensor code before finishing this task.
The float64 tensor classes are not complete. Once completed they will provide templates for the complex, quaternion, and octonion versions. This will also need to propagate to the float16, float32, and high precision subhierarchies. And land in the G class of global algebras.
The key need is for someone to dig into some (cartesian) tensor books and and add a few methods to the float64 version (outerProduct, contract, commaDerivative, semicolonDerivative, maybe innerProduct, maybe others). This would affect the type algebra a bit (probably just the TensorLike interface).
The file based storage structures are not careful about how they allocate and traverse their buffers. Things that might need further error checking:
The Search algorithm was implemented as quickly as possible. It should be revisited and improved to use a better algorithm (perhaps Boyer Moore).
Recently I released the last few tensor classes. They were adapted from the Float64 tensor classes. Find some examples in the literature and write some tests that exercise all the tensor classes.
For a mostly sorted list quicksort performance can approach O(n^2). My Sort algorithm is mostly quicksort based. I have found that climate data might start with many zeroes and thus for a while looks perfectly sorted. This kills quicksort, either in performance time or in stack overflows. We need to fix the Sort algo to work around qsort's worst case. Some references mention using a random partition element but quick experiments showed this might not work. For now (a workaround) I am using StableSort instead of Sort where necessary.
The matrix spectral norm algorithm is currently empty. Implement it. I think it is already linked to for the default norm implementation of the various matrix classes.
See how by defining various interface classes you can run zorbage algorithms on Spark instances.
Mavenization of the project has never been completed. No code is in place to deploy artifacts to repositiories. Finish mavenization. Once the code is in place deploy all tagged versions as artifacts.
Support 128-bit floats (IEEE) in software.
Note for anyone interested in working on this that one could use the Float64 types and algebras as a working template for how to design the classes and for which code needs to be implemented. Maybe code from a permissively licensed library for 128-bit float support could be adapted to Java in zorbage.
Maybe I can define JOCL backed types that implement some zorbage interfaces that allow OpenCL data or methods to interoperate with zorbage.
The current file based storage is a second or 3rd generation implementation. It is much faster than it used to be. But I think it can be even faster. The code right now has one buffer it uses. We should make the code configurable so you can specify number of buffers and buffer size and then write a standalone program that tries a few inline algos (like set all values to random numbers and then sum all values), a few random access algos (like Shuffle) and at least one pathological algo (like Reverse.compute(a,a)). It should time those handful of tests and then write a driver that varies buffer size and number of buffers predictably and maybe can converge on the best combination of these things.
And link to them in the README. A bunch of "How would I do X in Zorbage?" examples. And at least one easily buildable graphical program that exercises some of its capabilities.
The JDBC storage containers take multiple seconds to pass over a few tests of 50 multifield entities. This needs to be improved. See if I'm causing tons of object creations/destructions or if I am doing something else that might be simple to fix.
The javadoc in the source code has hardly been fleshed out. Go through the classes and document the public methods as much as is feasible.
The tensor product classes were derived from each other, We should refactor this. Similar to the vector and matrix classes we could define a bunch of generic tensor algorithms and call them from the tensor classes. This would eliminate a bunch of duplication.
NIFTI format files are used by many in the life sciences. The spec includes types that Java cannot represent. But zorbage can with a little work. Once ticket #34 is done then work on this ticket. The NIFTI reader should become a new project like the GDAL, NETCDF, and SCIFIO readers are.
There is a limitation edge case with parsing signed numbers with signed integer algebras. The MININT values don't parse correctly due to the temporary representation of -MININT as -1 * (MAXINT + 1). This overflows. Ideally the parsing would avoid the negate() call. It could represent the parsing as int("-value") rather than -1*int("value"). This ticket has been hatched from an old recollection so there may be some inaccuracies here.
See the commented out test in:
There are known algorithms for calculating the log, the sqrt, and the nth root of a matrix. Implement these algorithms. They are all interrelated. See Higham's Functions of Matrices book.
The tensor classes have been undergoing accelerated development and in a few days will be much more complete. There will only be one thing missing from their implementations: semicolonDerivative(). I have not yet found a source for an algorithm that can compute this simply for cartesian tensors. This is going to take some investigation. Implement something for all the tensor class implementations when possible.
There is lots of room for expansion in the equation language:
As it stands zorbage currently only supports sparse data. But it does not support methods that accelerate the calculation of vector / matrix / tensor algorithms. Expand the base algorithms to work efficiently with sparse inputs.
Recently I started running a test server that requires the tests to run in a single threaded fashion. I should make changes to the pom.xml that rely on an environment switch such that the test box runs tests singly threaded and the dev boxes run tests in parallel.
Setup a CI server. Have it test zorbage with various versions of Java (6-12) via both Oracle and OpenJDK. Connect it to github so that mvn test is launched on the CI server and the build status is marked appropriately upon commits to master.
Right now the octonion algebras derive from SkewField. In fact the octonion algebra in reality is a largish subset of SkewField. These differences need to be ironed out and reflected in the current algebra hierarchy. This will take some research.
For the 1.0.0 release I refactored the FileStorage classes. The new implementations contain a lot of duplicate code. Refactor these classes to be less redundant.
This has not been done yet and thus creating hashes of these elements might be fundamentally broken. The equals() contract must decide to be strict (exact class type matches and element values match) or loose (a double could be equal to a byte or a short or an int or a long or a float or a double or all kinds of stuff).
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.