Almost Realism Scientific Computing and Machine Learning Libraries Tools for high-performance scientific computing, generative art, and machine learning in Java with pluggable native acceleration. Currently supporting OpenCL (X86/ARM) and Metal (Aarch64), CUDA support in progress.


All of these examples are found in the test directory of the utils module in this repository

Using Tensor

Tensor is a data structure that is used to represent multi-dimensional arrays of data. Before a tensor can be used in computations, it has to be packed (at which point it's shape becomes immutable).

public class MyNativeEnabledApplication implements CodeFeatures {
	// ....

	public void createTensor() {
		// Create the tensor
		Tensor<Double> t = new Tensor<>();
		t.insert(1.0, 0, 0);
		t.insert(2.0, 0, 1);
		t.insert(5.0, 0, 2);

		Tensor<Double> p = new Tensor<>();
		p.insert(3.0, 0);
		p.insert(4.0, 1);
		p.insert(5.0, 2);

		// Prepare the computation
		CollectionProducer<PackedCollection<?>> product = multiply(c(t.pack()), c(p.pack()));

		// Compile the computation and evaluate it
		// Note that you can also combine compilation and evaluation into
		// one step, if you are not planning to reuse the compiled expression
		// for multiple evaluations.

Using Variables

Mathematical operations can use both constant values and variable values:

public class MyNativeEnabledApplication implements CodeFeatures {
	// ....

	public void variableMath() {
		// Define argument 0
		Producer<PackedCollection<?>> arg = v(shape(2), 0);

		// Compose the expression
		Producer<PackedCollection<?>> constantOperation = c(7.0).multiply(arg);

		// Compile the expression
		Evaluable<PackedCollection<?>> compiledOperation = constantOperation.get();

		// Evaluate the expression repeatedly
		System.out.println("7 * 3 | 7 * 2 = ");
		compiledOperation.evaluate(pack(3, 2)).print();
		System.out.println("7 * 4 | 7 * 3 = ");
		compiledOperation.evaluate(pack(4, 3)).print();
		System.out.println("7 * 5 | 7 * 4 = ");
		compiledOperation.evaluate(pack(5, 4)).print();

This also shows some other features, including the pack() method for reserving memory, and that operations can broadcast over different shapes.

Controlling Execution

It is not normally required to worry about how resources on the underlying system are reserved and managed, but in case it is important for your use case there are two Context concepts that allow for more power over when and how non-JVM resources are allocated. DataContexts are the top level Context concept. No memory can be shared between DataContexts, so most applications will use a single DataContext. However, there may be scenarios where it is desirable to wipe all the off-heap data that is used (given that there is no garbage collection this actually becomes quite important for longer-running "service" applications).

DataContexts are global to the JVM; there is only ever one DataContext in effect at a time.

public class MyNativeEnabledApplication implements CodeFeatures {
	// ....

	public void performThreeExperiments() {
		for (int i = 0; i < 3; i++) {
			dc(() -> {
				// Define argument 0
				Producer<PackedCollection<?>> arg = v(shape(2), 0);

				// Compose the expression
				Producer<PackedCollection<?>> constantOperation = c(7.0).multiply(arg);

				// Compile the expression
				Evaluable<PackedCollection<?>> compiledOperation = constantOperation.get();

				// Evaluate the expression repeatedly
				System.out.println("7 * 3 | 7 * 2 = ");
				compiledOperation.evaluate(pack(3, 2)).print();
				System.out.println("7 * 4 | 7 * 3 = ");
				compiledOperation.evaluate(pack(4, 3)).print();
				System.out.println("7 * 5 | 7 * 4 = ");
				compiledOperation.evaluate(pack(5, 4)).print();

In this example, our accelerated operations are repeated 3 times. However, after each one, all of the resources are destroyed. This corresponds to terminating the cl_context if you are using OpenCL, or wiping local storage if you are using an external executable, etc.

The other Context concept, ComputeContext, exists within a particular DataContext. A ComputeContext tracks only the compiled InstructionSets, and memory can be shared across different ComputeContexts. Obviously not every kind of ComputeContext can be used with every kind of memory: if you are using CLDataContext and memory is stored on a GPU device, there is no way to use it with, for example, a ExternalComputeContext without incurring a lot of costly memory copying, so be careful with these choies.

ComputeContexts are global to the Thread, a JVM can have multiple ComputeContexts at once, if multiple Threads were used to create them.

When creating a ComputeContext, you can instruct the DataContext of your expectations, and it will make a best effort to fulfill them.

Note: It will not fail when your expectations cannot be met, it will just provide computing resources other than what you expected.

public class MyNativeEnabledApplication implements CodeFeatures {
	// ....

	public void useCpuAndGpu() {
		PackedCollection<?> result = new PackedCollection(shape(1));

		Producer<PackedCollection<?>> sum = add(c(1.0), c(2.0));
		Producer<PackedCollection<?>> product = multiply(c(3.0), c(2.0));

		cc(() -> a(2, p(result), sum).get().run(), ComputeRequirement.CPU);
		System.out.println("Result = " + result.toArrayString());

		cc(() -> a(2, p(result), product).get().run(), ComputeRequirement.GPU);
		System.out.println("Result = " + result.toArrayString());

In this example, we will perform addition using some available context that supports the CPU, but we will perform multiplication using a (potentially separate) context that supports the GPU. The example also shows some other features, including the a() method for assignment. Assignment produces a Runnable rather than an Evaluable, and the run() method is used to execute it. Be aware that, the ComputeContext which is used for a given Evaluable or Runnable is always the one that was in effect when the Evaluable or Runnable was compiled via the get() method.

Note: The contexts available on a given machine will depend on the hardware.

Parallelization via Accelerator

Although you can use the tool with multiple threads (compiled operations are threadsafe), you may want to leverage parallelization that cannot be accomplished with Java's Thread concept. If you are targeting a GPU with OpenCL or Metal, for example, you'll want to express a collection of operations with a single operation. This works the same way.

public class MyNativeEnabledApplication implements CodeFeatures {
	// ....

	public void kernelEvaluation() {
		// Define argument 0
		Producer<PackedCollection<?>> arg = v(shape(1), 0);

		// Compose the expression
		Producer<PackedCollection<?>> constantOperation = c(7.0).multiply(arg);

		// Compile the expression
		Evaluable<PackedCollection<?>> compiledOperation = constantOperation.get();

		PackedCollection<?> bank = new PackedCollection<>(shape(3)).traverse();
		bank.set(0, 3.0);
		bank.set(1, 4.0);
		bank.set(2, 5.0);

		PackedCollection<?> results = new PackedCollection<>(shape(3)).traverse();

		// Evaluate the expression with the accelerator deciding how to parallelize it

		System.out.println("7 * 3, 7 * 4, 7 * 5 = ");

There are a few nuances here, besides the improved performance of parallelization via SIMD or other kernel features of your available hardware. One is the Evaluable::into method, which allows you to specify the destination of the results. This is useful when you are using (or reusing) a pre-existing data structure. The other is the traverse() method, which is used to adjust the traversal axis of the PackedCollection. More on this in the next section.

Every data structure that deals with one or more collections of numbers has a TraversalPolicy that tells other components how it is expected to be traversed during a computation. Some useful properties of the TraversalPolicy (often called the shape) are shown in the example below.

public class MyNativeEnabledApplication implements CodeFeatures {
	// ....

	public void shapes() {
		// The shape is a 3D array with 10x4x2 elements, and 80 elements in total.
		// However, it will be treated for the purpose of GPU parallelism as one
		// value with 80 elements. In this case, 1 is referred to as the count and
		// 80 is referred to as the size.
		TraversalPolicy shape = shape(10, 4, 2);
		System.out.println("Shape = " + shape.toStringDetail());
		// Shape = (10, 4, 2)[axis=0|1x80]
		//           <dims> [Count x Size]

		// What if we want to operate on groups of elements at once, via SIMD or
		// some other method? We can simply adjust the traversal axis
		shape = shape.traverse();
		System.out.println("Shape = " + shape.toStringDetail());
		// Shape = (10, 4, 2)[axis=1|10x8]
		//           <dims> [Count x Size]
		// --> Now we have 10 groups of 8 elements each, and 10 operations can work on
		// 8 elements each - at the same time.

		shape = shape.traverseEach(); // Move the traversal axis to the innermost dimension
		shape = shape.consolidate(); // Move the traversal axis back by 1 position
		shape = shape.item(); // Pull off just the shape of one item in the parallel group
		System.out.println("Shape = " + shape.toStringDetail());
		// Shape = (2)[axis=0|1x2]
		// --> And that's just one item from the original shape (which contained 40 of them).

More Collection Operations

There are plenty of other operations besides the ones described in the tutorial. They are covered briefly here.


The repeat operation is used to repeat the item (see TraversalPolicy::item, described above) of a collection some finite number of times. This is useful for broadcasting operations in different ways.

public class MyNativeEnabledApplication implements CodeFeatures {
    // ....

	public void repeat() {
		PackedCollection<?> a = pack(2, 3).reshape(2, 1);
		PackedCollection<?> b = pack(4, 5).reshape(2);

This example will produces [8, 10, 12, 15]. Notice how the broadcast behavior here is different because of the repeat operation - because the 2x1 collection has each item (size 1) repeated twice, producing [2, 2, 3, 3] which is then multiplied by the other value to produce a result. Notice how the broadcast behavior here is different because of the repeat operation.


The enumerate operation is used to move over a collection some finite number of times, along some axis of that collection. This is useful for performing a lot of common tensor operations, like transpose or convolution. The enumerate() method accepts an axis, a length, and a stride.

public class MyNativeEnabledApplication implements CodeFeatures {
    // ....

	public void enumerate() {
		PackedCollection<?> a =
				pack(2, 3, 4, 5, 6, 7, 8, 9)
						.reshape(2, 4);
		PackedCollection<?> r = c(a).enumerate(1, 2, 2).evaluate();
		// Shape = (2, 2, 2)[axis=1|1x8]

		// [2.0, 3.0]
		// [6.0, 7.0]
		// [4.0, 5.0]
		// [8.0, 9.0]

Here the enumerate operation iterates over axis 1, collecting 2 numbers at a time, and then advancing by 2 along the axis. The result is a pair of 2x2 collections, a 2x2x2 tensor. The stride can also be omitted if it is the same as the length, as it is in the example.


To take just a slice of a collection, the subset operation can be used. It accepts shape and position information for the slice.

public class MyNativeEnabledApplication implements CodeFeatures {
    // ....

	public void subset3d() {
		int w = 2;
		int h = 4;
		int d = 3;

		int x0 = 4;
		int y0 = 3;
		int z0 = 2;

		PackedCollection<?> a = new PackedCollection<>(shape(10, 10, 10));

		CollectionProducer<PackedCollection<?>> producer = subset(shape(w, h, d), c(a), x0, y0, z0);
		Evaluable<PackedCollection<?>> ev = producer.get();
		PackedCollection<?> subset = ev.evaluate();

		for (int i = 0; i < w; i++) {
			for (int j = 0; j < h; j++) {
				for (int k = 0; k < d; k++) {
					double expected = a.valueAt(x0 + i, y0 + j, z0 + k);
					double actual = subset.valueAt(i, j, k);
					Assert.assertEquals(expected, actual, 0.0001);

More Complex Operations

All these atomic operations (together with the standard mathematical operations) can be combined to achieve basically any kind of tensor algebra you may need. If you find otherwise, please [open an issue][issues-url]

Special Purpose Types

Although PackedCollection is a general purpose data structure for laying out numerical values in memory, and any operation can ultimately use it, there are other types that extend PackedCollection to provide functionality that may be specific to a particular domain.


The Pair type is used to store two values. This seems like it wouldn't require a type of its own, but there are some unique things that make sense only for a pair of values. Some Pair subclasses are listed below.

  1. ComplexNumber - A pair of real numbers, used together to represent a complex number.
  2. TemporalScalar - A pair of a number and a time, used to represent a scalar values distributed along a timeseries.
  3. Scalar - A number and a corresponding uncertainty, used to represent a measurement that may not be precisely known.
  4. Photon - A wavelength and a phase, used to represent a photon in a manner that accounts for quantum interference.
  5. CursorPair - A pair of values that form an interval.


The Vector type stores three values and is used for geometric operations, especially 3D graphics. It has one subclass, Vertex, which represents a position in space but includes references to values for a normal or gradient and a Pair of texture coordinates.


The RGB type stores three values and is used for color operations. It has one subclass, RGBA, which keeps a reference to an additional value for an alpha (transparency) channel.


The MeshData type is used to store the data for a 3D mesh. It directly represents all the data needed for rendering a triagulated 3D object, including the positions of the vertices and the normal/gradient values.

Other Mathematical Operations

Complex Numbers

Given that the sum of two complex numbers is the sum of their real and imaginary parts, there is no need for special handling of complex addition and subtraction, the CollectionProducer::add and CollectionProducer::subtract methods work just as well as for other types of data. However, the process of multiplying two complex numbers is not trivially reducible to multiplication of the individual values, and hence cannot be accomplished with some variation of the tensor multiplication provided by CollectionProducer::multiply (this is similarly true for exponentiation, etc).

For this case, special methods are available.

public class MyNativeEnabledApplication implements CodeFeatures {
    // ....

	public void complexMath() {
		ComplexNumber a = new ComplexNumber(1, 2);
		ComplexNumber b = new ComplexNumber(3, 4);

		Producer<ComplexNumber> c = multiplyComplex(c(a), c(b));
		System.out.println("(1 + 2i) * (3 + 4i) = ");

Automatic Differentiation

Most Producer implementations, mainly those that implement ProducerComputation, support automatic differentiation. This is a powerful feature that allows you to compute the gradient of a function with respect to any of its inputs. This is especially useful for machine learning implementations, but that is far from the only use case.


The delta() method is used to form a new Producer that is the gradient of the original Producer with respect to a particular input. The shape of the result will be the shape of the target input appended to the output shape - resulting in a derivative for each combination of input and output. Some examples follow.

public class MyNativeEnabledApplication implements CodeFeatures {
    // ....

	public void polynomialDelta() {
		// x^2 + 3x + 1
		CollectionProducer<PackedCollection<?>> c = x().sq().add(x().mul(3)).add(1);

		// y = f(x)
		Evaluable<PackedCollection<?>> y = c.get();
		PackedCollection<?> out = y.evaluate(pack(1, 2, 3, 4, 5).traverseEach());

		// dy = f'(x) = 2x + 3
		Evaluable<PackedCollection<?>> dy =;
		out = dy.evaluate(pack(1, 2, 3, 4, 5).traverseEach());

	public void vectorDelta() {
		int dim = 3;
		int count = 2;

		PackedCollection<?> v = pack(IntStream.range(0, count * dim).boxed()
				.reshape(count, dim).traverse();
		PackedCollection<?> w = pack(4, -3, 2);
		CollectionProducer<PackedCollection<?>> x = x(dim);

		// w * x
		CollectionProducer<PackedCollection<?>> c = x.mul(p(w));

		// y = f(x)
		Evaluable<PackedCollection<?>> y = c.get();
		PackedCollection<?> out = y.evaluate(v);

		// dy = f'(x)
		//    = w
		Evaluable<PackedCollection<?>> dy =;
		PackedCollection<?> dout = dy.evaluate(v);


