bdezonia / zorbage Goto Github PK

Zorbage: algebraic data types and algorithms for use in numeric processing.

License: BSD 3-Clause "New" or "Revised" License

Java 100.00% Shell 0.01%

java numerical-computation algebraic generic-programming multidimensional big-data scientific-computing procedural

zorbage's Introduction

Zorbage: algebraic data types and algorithms for use in numeric processing

Developer Info:

How to include zorbage in your Maven project

  Add the following dependency to your project's pom.xml:
  
  <dependency>
    <groupId>io.github.bdezonia</groupId>
    <artifactId>zorbage</artifactId>
    <version>2.0.3.1</version>
  </dependency>
  
How to include zorbage in a different build system

  See https://search.maven.org/artifact/io.github.bdezonia/zorbage/2.0.3.1/jar
  for instructions on how to reference zorbage in build systems such as
  Gradle or others.

Project Goals:
  - provide a framework for reusable numeric algorithms
  - support numeric computing in Java in an efficient manner
    - provide code easier to develop and almost as fast as C++
    - provide code that is faster and less error prone than Python/R/Matlab
  - support very large data sets in an efficient manner
  - break limitations of many programming languages in terms of the 
      computable types provided and the extensibility of such types
  - do all this with a powerful set of simple abstractions

Contains support for:
  - integers, rationals, reals, complex numbers, quaternions, octonions
  - numbers, vectors, matrices, and tensors
  - various precisions: 1-bit to 128-bit to unbounded (signed and unsigned and float)
  - very large datasets (arrays, virtual files, sparse structures, JDBC storage)
  - generic programming
  - algebraic/group-theoretic algorithms
  - procedural java with object oriented comforts
  - the definition of your own types while reusing existing algorithms

Can you show me what Zorbage can do?

  For some overviews see:

    https://github.com/bdezonia/zorbage/tree/master/src/main/java/example

  Once you've covered that if you have more questions look at Zorbage's
    extensive test code at:

    https://github.com/bdezonia/zorbage/tree/master/src/test/java/nom/bdezonia/zorbage

  To see a few small stand alone programs have a look here:

    https://github.com/bdezonia/zorbage/tree/master/example

Thanks, I'll look that up later. Can you describe what can I do with Zorbage?

  Define multidimensional data sets with flexible out of bounds data
    handling procedures. Zorbage has an excellent abstraction for multi-
    dimensional data that is easy to understand and use and is at the same
    time quite powerful.

  Use data views of your multidimensional data source to very rapidly set
    and get values. Data views allow one to write any sample visiting
    algorithm desired. The views are especially good with nested for loop
    approaches to iterating through data. Their speed is comparable to
    direct 1-d array access. Pull out arbitrary planes of your multi-
    dimensional data set with ease.
        
  Use many different data types (133 and counting)

    - signed integers (bits: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
                           14, 15, 16, 32, 64, 128, unbounded)

    - unsigned integers (bits: 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13,
                           14, 15, 16, 32, 64, 128)

    - rational numbers (unbounded)
    
    - floats (bits: 16, 32, 64, 128, unbounded)

    - gaussian integers (bits: 8, 16, 32, 64, unbounded)
    
    - complex numbers (bits: 16, 32, 64, 128)
    
    - quaternions (bits: 16, 32, 64, 128)
    
    - octonions (bits: 16, 32, 64, 128)
    
    - Unicode16 characters and variable length strings and fixed length strings
  
    - booleans
    
    - n-dimensional real Points
    
    - ARGB, RGB, and CIE LAB tuples
    
    - vectors and matrices and tensors made of the various types
  
  Define your own data types
  
    Do you have a custom data type? (Like RNA base pairs?). Can you define operators
      between elements? (Like when two base pairs are equal?). Then you can define
      their algebra and reuse any zorbage algorithms that accept similarly defined
      types.
  
  A conversion api exists for moving between types accurately and efficiently.
  
  There are types that are compound: complex numbers, quaternions, and
    octonions based on any of the floating types.
    
  Types can be stored in arrays, files, sparse structures, and JDBC
    database tables.
    
  You can allocate and use huge (length up to 2^63 elements) data structures.
  
  Data access revolves around 1-d lists that act like arrays. Arrays can be
    concatenated, trimmed, subsampled, masked, padded, readonly, as well as
    other abstractions. At the heart, each multidim data source has one of
    these arrays behind it. These arrays are not indexed by integers but
    instead by longs and break many of the limitations of Java arrays.
      
  Array storage can be in native types (for speed of access) or for many
    integer types they can be bit encoded (to save space).
  
  You can use existing or write your own generic algorithms that work with
    all the types transparently. For instance Zorbage has one Sort algorithm
    that can sort a list made of any of the above defined types while doing
    no data conversions.
  
  You can use existing or write your own algorithms that work with numbers,
    vectors, matrices, and tensors. Zorbage includes 100's of predefined
    algorithms too. It can find roots, find derivatives, and solve differential
    equations numerically involving arbitrary scalar and vector functions and
    procedures. It also includes algorithms from linear algebra, signal
    processing, statistics, set theory, and analysis.
    Algorithms include:
    - Basic vector and matrix and tensor operations
    - Transcendental functions of various precision floats and matrices
    - Basic runge kutta ode solver
    - FFT and Inverse FFT
    - Convolutions and Correlations
    - Resampling algorithms
    - LU Decomposition and LU Solving (simultaneous equation solving)
    - Basic statistical functions
    - Most C++ STL algorithms replicated
    - Numerous parallel algorithms are provided for quickly processing data.
  
  Define complex data sampling algorithms from prebuilt sampling components.
  
  Use type safe first class Function and Procedure objects. Pass them as
    arguments to code that can transform your data quickly and generically.
    Write methods that return Functions and Procedures if needed. Compute
    values from user defined Functions and Procedures.
 
  Use parsers to create Procedures from strings. These Procedures represent
    equations that when fed values will compute a return value (the result of
    applying the inputs to the equation). Equations can return numbers, vectors,
    matrices, or tensors. The subcomponents of these can be reals, complexes,
    quaternions, or octonions. Equations can be built out of numbers, input
    variable references, constants (like E, PI, etc.) and typical functions
    like sin(), atan(), exp(), log(), etc. For more info see:
    https://github.com/bdezonia/zorbage/blob/master/EQUATION_LANGUAGE

Why Java?

  - numerical computing in Java is under represented
  - with Java there is no need to worry about memory or pointer bugs
  - the JRE optimizes many object allocations into stack allocations
  - Java is portable: Zorbage runs everywhere the JVM does
  - Java is safer at runtime than C++ while still supporting good performance
  - Java is faster at runtime than Python/R/Matlab while also avoiding simple
      "oops, that typo in my code just killed my long run" situations

Why is Zorbage the way it is?

  - procedural programming is efficient
  - object oriented comforts can be used where needed
  - generics allow for reusable algorithms. often one can write one
      algorithm and use it for reals, complex numbers, quaternions,
      matrices, etc.
  - there is no telling what one bored developer can get up to when
      looking for fun
  
Zorbage breaks Java barriers
  
  - in Zorbage "arrays" are indexed by 64-bit long integers rather
      than 32-bit integers. Relatedly data lists sizes can reach
      2^63.
      
  - in Zorbage, due to JVM optimizations, you can pass primitives by reference

  - supports multidimensional arrays
      
  - breaks the limitation to integer types being in byte aligned sizes
    
  - supports unsigned numbers
    
  - supports 128-bit signed and unsigned integers
    
  - supports 128-bit floating point numbers
    
  - supports 16-bit floating point numbers
    
  - supports high precision floating point numbers
    
  - supports unbounded ints
    
  - supports (unbounded) rational numbers
    
  - builds reals, complexes, quaternions, and octonions from any
      of the floating types. You can write one algorithm and substitute
      one runtime parameter to calculate in 16 bit, 32 bit, 64 bit, 128
      bit or seemingly unbounded accuracy.
    
Zorbage is fast
  
  - Zorbage can run numeric algorithms in speeds comparable to C++
    
  - Some multithreaded algorithms are provided
      - matrix multiplies
      - resampling
      - convolutions/correlations
      - fills
      - transforms
      - data conversions
      - apodizations

Zorbage is flexible
  
  - supports n-dimensional data sets with flexible out of bounds data
      handling procedures
      
  - the n-dimensional data sets can use the equational language to
      carefully and powerfully calibrate their axes.

Is there a way to get data into a Zorbage backed application?

  Code has been written to load ECAT scan data using the zorbage
    ecat library into Zorbage structures. You can find it here:
    
    https://github.com/bdezonia/zorbage-ecat

  Code has been written to load Java audio data using the zorbage
    jaudio library into Zorbage structures. You can find it here:

    https://github.com/bdezonia/zorbage-jaudio

  Code has been written to load raster GIS data using the GDAL library
    into Zorbage structures. You can find it here:
  
    https://github.com/bdezonia/zorbage-gdal

  Code has been written to load raster GIS data using the NetCDF library
    into Zorbage structures. You can find it here:
    
    https://github.com/bdezonia/zorbage-netcdf

  Code has been written to load Nifti scan data using the zorbage
    nifti library into Zorbage structures. You can find it here:
    
    https://github.com/bdezonia/zorbage-nifti

  Code has been written to load NMR data using the zorbage NMR
    library into Zorbage structures. You can find it here:

    https://github.com/bdezonia/zorbage-nmr

  Code has been written to load common raster data formats using
    the SCIFIO library into Zorbage structures. You can find it here:
  
    https://github.com/bdezonia/zorbage-scifio
  
Is there a way to view Zorbage data?

  Coding is underway on a data viewer. It is available now (alpha ready).
  You can find it here: https://github.com/bdezonia/zorbage-viewer

Programming notes

  Java 11 info
    
    Zorbage has been compiled and tested using Maven with OpenJDK Java 11 on Linux

  Java 8 info

    Zorbage was previously compiled and tested using Maven with OpenJDK Java 8 on Linux

      The Zorbage code has been scrubbed of API calls later than Java 8 and the
      Zorbage library API should be compilable with Java 8.

  Java 7 info

    At one time Zorbage was compiled and tested using Maven with Oracle Java 7 on the
      Macintosh. Recently the Zorbage code was updated to take advantage of some Java 8
      features so it is possible it will not be callable from Java 7. YMMV

  Other platforms
    
    Zorbage has been compiled and tested with many versions of Eclipse and some versions
    of IntelliJ.

Acknowledgements

  Thank you Dr. Pepper for your timely and generous contributions to the
  project. Zorbage would not be the same without you.

Partial bibliography

  Books:
  
  A Book of Abstract Algebra; Pinter
  A Student's Guide to Vectors and Tensors; Fleisch
  Advanced Engineering Mathematics; Wylie, Barrett
  Algorithms for Computer Algebra; Geddes
  Basic Partial Differential Equations; Bleecker
  Calculus and Analytic Geometry; Thomas, Finney
  Compilers; Aho, Sethi, and Ullman
  Complex Variables and Applications; Brown, Churchill
  Complex Variables with Applications; Wunsch
  Computer Algebra and Symbolic Computation (2 vols); Cohen
  Design Patterns, gang of four
  Differential Equations; Ross
  Differential Equations and Linear Algebra; Edwards and Penny
  Digital Image Proccessing; Burger
  Digital Image Processing; Gonzalez, Woods
  Digital Signal Processing; Proakis
  Discrete Time Signal Processing; Oppenheim and Schaffer
  Div, Grad, Curl, and All That
  Elements of Modern Algebra; Gilbert
  Elements of Programming; Stepanov
  From Mathematics to Generic Programming; Stepanov
  Functions of Matrices; Higham
  Handbook of Math and Computational Science; Harris, Stocker
  Haskell School of Expression
  Haskell: the Craft of Functional Programming; Thompson
  Introduction to Abstract Algebra; Dubisch
  Introduction to Algorithms; Cormen, Lesserson, Rivest, Stein
  Introduction to Computer Simulation Methods; Gould
  Introduction to Octonion and Other Non-associative Algebras in Physics; Okubo 
  Linear Algebra and Analytic Geometry for Physical Sciences; Landi, Zampini
  Matrix Algebra; Robbin
  Modern Mathematical Methods for Physicists and Engineers; Cantrell
  NMR Data Processing; Hoch and Stern
  Numerical Methods for Engineers; Chapra
  Numerical Methods for Engineers and Scientists; Gilat
  Numerical Methods for Scientists and Engineers; Hamming
  Numerical Recipes; Press etal
  Partial Differential Equations for Scientists and Engineers; Farlow
  Probability and Statistics for Engineering and the Sciences; Devore
  Programming Clojure; Miller, Halloway, Bedra
  Real World Haskell
  Rings, Fields, and Groups; Arnold
  Schaum's Outline of Tensor Calculus; Kay
  Scientific Computing With Python; Fuhrer, Solem, Verdier
  Structure and Interpretation of Computer Programs; Abelson
  Tensors Made Easy; Bernacchi
  Tensor Calculus Made Simple; Sochi
  The Art of Computer Programming; Knuth
 
  Online manuals:
  
  GNU Scientific Library documentation
  Boost Library documentation
  C++ STL documentation
  Mathematica documentation
  Matlab documentation
  R documentation
  Julia documentation
  Haskell documentation
  Ruby documentation
  Pascal language definitions

  Online articles:
  
  Many articles on Mathworld and Wikipedia and StackExchange
  Many other web pages and pdfs and slide presentations

Paying it forward
  
  If you like Zorbage as a library or as a source of ideas then please
  visit my daughter's band's BandCamp page, listen to their music, and
  buy something if you like what you hear. Thanks.
  
    https://dearmrwatterson.bandcamp.com/

zorbage's People

Contributors

Stargazers

Watchers

Forkers

winnerlbm

zorbage's Issues

The real based sampling algorithms should calculate their own tolerances

The real based samplings sometimes rely on internal hard coded tolerances. It would be better if the range of inputs to the samplings could be used to calculate relative tolerances.

Test SubTensorBridge thoroughly and fix any issues found

I am not sure that the SubTensorBridage class is correct or understandable. Maybe I need two arrays instead of three as inputs. The second array is the full coord specification of the origin (a combo of the last two long[] arrays of the inputs of the current implementation). Then also write some 3d tests with 2d slices in various directions (and forw/back orders if desired). Then find bugs and fix issues.

Eliminate all the duplicate indexToLong methods

All the tensor classes have their own cut and paste code for indexToLong() and longToIntegerIndex(). There should be one version that all these classes could reference. Note that there is also a version in the IndexUtils class. Maybe we can massage all this into one helper class somewhere.

Redesign the file storage to use MappedByteBuffer

The file based storage for types would be much improved if it can be based upon nio's MappedByteBuffer. Whatever is implemented needs to be able to handle huge files and to do so in an efficient manner. Maybe all data types will change from working with RandomAccessFile to something else and thus this might be an api breaking change.

The algebras are too strict

The algebras of the various types are very strongly typed. The G.DBL algebra deals with specific float implementations. The bridge classes are less strict. They implement MatrixMember. However they can't be passed to these strict algebras as arguments to methods. Maybe we need to relax the constraints on the algebras. On top of which we might also move to a SparseMatrix type that as it stands now could not be passed to an algebra that works on FloatMatrix elements only. One reason to do this is to allow a matrix multiply to check if it has sparse arguments and if so do an optimized multiply. Investigate how hard it is to relax type constraints in the algebras. If we could then we would be open to supporting specialized implementations. This seems like a somewhat crucial need.

Implement general tensors

So far I have written cartesian tensor code embedded in euclidean space. We should endeavor to support general tensors in any possible curved space. This will require some research before something can be implemented.

JDBC based storage can crash during initialization

If you allocate a large JDBC based storage item the heap is exhausted when trying to initialize the data to zero. More objects than necessary are created. Break the initialization code into a series of database calls instead of one big one. Think how best to initialize using little memory and few database transactions.

Support Gaussian integer types and algebras

The GDAL standard supports complex valued types made of two 16-bit shorts or two 32-bit ints. My GDAL reader translates these (widened) as complex floating point numbers. Maybe we should define Gaussian integers made of byte/short/int/long/BigInteger components (and gaussian numbers using rational components) and associated algebras for manipulating them. The GDAL reader could use them (if that is what GDAL means them to be). More research is needed here.

Standardize Sort/StableSort and Partition/StablePartition

The various sort and partition algorithms are not completely fleshed out. Ideally Sort would call Partition as needed and StableSort would call StablePartition as needed. Both would use a helper QuickSort algorithm that could be defined from the current Sort implementation. This would extend the capabilities of zorbage while also cleaning up some untidiness.

Fix the subtensor bridge code

There is a subtensor bridge adapter class that treats a tensor of one set of dimensions as a tensor of a smaller set of dimensions. The existing code may not be functional or is too difficult to use. Improve this code. Waiting on the final implementation of the cartesian tensor code before finishing this task.

The tensor code is not yet totally functional

The float64 tensor classes are not complete. Once completed they will provide templates for the complex, quaternion, and octonion versions. This will also need to propagate to the float16, float32, and high precision subhierarchies. And land in the G class of global algebras.

The key need is for someone to dig into some (cartesian) tensor books and and add a few methods to the float64 version (outerProduct, contract, commaDerivative, semicolonDerivative, maybe innerProduct, maybe others). This would affect the type algebra a bit (probably just the TensorLike interface).

The file based storage structures need enhancements

The file based storage structures are not careful about how they allocate and traverse their buffers. Things that might need further error checking:

a U type with zero or negative number of components
a U type that has more components than the number of elements in the buffer
making sure no U lives partly in one buffer and partly in another
the need for allocating a buffer that can contain large type U's rather than a fixed buffer size

The Search algorithm should use a better algorithm

The Search algorithm was implemented as quickly as possible. It should be revisited and improved to use a better algorithm (perhaps Boyer Moore).

Write tests for the tensor classes

Recently I released the last few tensor classes. They were adapted from the Float64 tensor classes. Find some examples in the literature and write some tests that exercise all the tensor classes.

The Sort code is susceptible to quicksort worst case performance

For a mostly sorted list quicksort performance can approach O(n^2). My Sort algorithm is mostly quicksort based. I have found that climate data might start with many zeroes and thus for a while looks perfectly sorted. This kills quicksort, either in performance time or in stack overflows. We need to fix the Sort algo to work around qsort's worst case. Some references mention using a random partition element but quick experiments showed this might not work. For now (a workaround) I am using StableSort instead of Sort where necessary.

Finish the matrix spectral norm algorithm

The matrix spectral norm algorithm is currently empty. Implement it. I think it is already linked to for the default norm implementation of the various matrix classes.

Investigate how zorbage can interface with Apache Spark

See how by defining various interface classes you can run zorbage algorithms on Spark instances.

Finish mavenization so that zorbage jars can be deployed to repositories

Mavenization of the project has never been completed. No code is in place to deploy artifacts to repositiories. Finish mavenization. Once the code is in place deploy all tagged versions as artifacts.

Write 128 bit floating point types and algebras

Support 128-bit floats (IEEE) in software.

Note for anyone interested in working on this that one could use the Float64 types and algebras as a working template for how to design the classes and for which code needs to be implemented. Maybe code from a permissively licensed library for 128-bit float support could be adapted to Java in zorbage.

Investigate how zorbage could interface with JOCL/OpenCL and how it can accelerate things

Maybe I can define JOCL backed types that implement some zorbage interfaces that allow OpenCL data or methods to interoperate with zorbage.

Accelerate the file based storage classes

The current file based storage is a second or 3rd generation implementation. It is much faster than it used to be. But I think it can be even faster. The code right now has one buffer it uses. We should make the code configurable so you can specify number of buffers and buffer size and then write a standalone program that tries a few inline algos (like set all values to random numbers and then sum all values), a few random access algos (like Shuffle) and at least one pathological algo (like Reverse.compute(a,a)). It should time those handful of tests and then write a driver that varies buffer size and number of buffers predictably and maybe can converge on the best combination of these things.

Write some nontrivial examples

And link to them in the README. A bunch of "How would I do X in Zorbage?" examples. And at least one easily buildable graphical program that exercises some of its capabilities.

The JDBC storage data structures are too slow

The JDBC storage containers take multiple seconds to pass over a few tests of 50 multifield entities. This needs to be improved. See if I'm causing tons of object creations/destructions or if I am doing something else that might be simple to fix.

The javadoc in the source code needs fleshing out

The javadoc in the source code has hardly been fleshed out. Go through the classes and document the public methods as much as is feasible.

Refactor the tensor classes

The tensor product classes were derived from each other, We should refactor this. Similar to the vector and matrix classes we could define a bunch of generic tensor algorithms and call them from the tensor classes. This would eliminate a bunch of duplication.

Write a NIFTI file reader

NIFTI format files are used by many in the life sciences. The spec includes types that Java cannot represent. But zorbage can with a little work. Once ticket #34 is done then work on this ticket. The NIFTI reader should become a new project like the GDAL, NETCDF, and SCIFIO readers are.

The parsing of min values of signed int classes is broken

There is a limitation edge case with parsing signed numbers with signed integer algebras. The MININT values don't parse correctly due to the temporary representation of -MININT as -1 * (MAXINT + 1). This overflows. Ideally the parsing would avoid the negate() call. It could represent the parsing as int("-value") rather than -1*int("value"). This ticket has been hatched from an old recollection so there may be some inaccuracies here.

See the commented out test in:

https://github.com/bdezonia/zorbage/blob/master/src/test/java/nom/bdezonia/zorbage/procedure/impl/parse/TestRealNumberEquation.java

Matrix log / sqrt / cbrt

There are known algorithms for calculating the log, the sqrt, and the nth root of a matrix. Implement these algorithms. They are all interrelated. See Higham's Functions of Matrices book.

The tensor classes do not yet implement semicolonDerivative()

The tensor classes have been undergoing accelerated development and in a few days will be much more complete. There will only be one thing missing from their implementations: semicolonDerivative(). I have not yet found a source for an algorithm that can compute this simply for cartesian tensors. This is going to take some investigation. Implement something for all the tensor class implementations when possible.

Improve the equation language

There is lots of room for expansion in the equation language:

vec_or_mat_or_ten raised to the int_power
number * vec/mat/ten
factorial
erf/erfc (Done? If not then easily doable I think)
atan2
matrix constants whose dims are not 0x0 (imagine declaring it as PI(2) or E(3) in language to give it a square 2x2 or 3x3 shape)
parsing and assigning -minint correctly (Done I think)
supporting tuples
supporting a/rgbs
supporting points
supporting booleans and also and/or/not/shift
like the E(5) and PI(3) we can do rand(7,7) for a 7x7 matrix of random values in the parsing language.
ramp function
step function
etc.

Sparse algorithm support

As it stands zorbage currently only supports sparse data. But it does not support methods that accelerate the calculation of vector / matrix / tensor algorithms. Expand the base algorithms to work efficiently with sparse inputs.

Edit pom.xml to support multiple testing profiles

Recently I started running a test server that requires the tests to run in a single threaded fashion. I should make changes to the pom.xml that rely on an environment switch such that the test box runs tests singly threaded and the dev boxes run tests in parallel.

Setup a CI server

Setup a CI server. Have it test zorbage with various versions of Java (6-12) via both Oracle and OpenJDK. Connect it to github so that mvn test is launched on the CI server and the build status is marked appropriately upon commits to master.

Refine algebra of Octonions

Right now the octonion algebras derive from SkewField. In fact the octonion algebra in reality is a largish subset of SkewField. These differences need to be ironed out and reflected in the current algebra hierarchy. This will take some research.

Refactor the FileStorage classes to eliminate duplicate code

For the 1.0.0 release I refactored the FileStorage classes. The new implementations contain a lot of duplicate code. Refactor these classes to be less redundant.

Implement hashCode() and equals() for all the numeric types

This has not been done yet and thus creating hashes of these elements might be fundamentally broken. The equals() contract must decide to be strict (exact class type matches and element values match) or loose (a double could be equal to a byte or a short or an int or a long or a float or a double or all kinds of stuff).