Using Java with Nvidia GPUs (CUDA)

java cuda gpu multi-gpu

113,979

Solution 1

First of all, you should be aware of the fact that CUDA will not automagically make computations faster. On the one hand, because GPU programming is an art, and it can be very, very challenging to get it right. On the other hand, because GPUs are well-suited only for certain kinds of computations.

This may sound confusing, because you can basically compute anything on the GPU. The key point is, of course, whether you will achieve a good speedup or not. The most important classification here is whether a problem is task parallel or data parallel. The first one refers, roughly speaking, to problems where several threads are working on their own tasks, more or less independently. The second one refers to problems where many threads are all doing the same - but on different parts of the data.

The latter is the kind of problem that GPUs are good at: They have many cores, and all the cores do the same, but operate on different parts of the input data.

You mentioned that you have "simple math but with huge amount of data". Although this may sound like a perfectly data-parallel problem and thus like it was well-suited for a GPU, there is another aspect to consider: GPUs are ridiculously fast in terms of theoretical computational power (FLOPS, Floating Point Operations Per Second). But they are often throttled down by the memory bandwidth.

This leads to another classification of problems. Namely whether problems are memory bound or compute bound.

The first one refers to problems where the number of instructions that are done for each data element is low. For example, consider a parallel vector addition: You'll have to read two data elements, then perform a single addition, and then write the sum into the result vector. You will not see a speedup when doing this on the GPU, because the single addition does not compensate for the efforts of reading/writing the memory.

The second term, "compute bound", refers to problems where the number of instructions is high compared to the number of memory reads/writes. For example, consider a matrix multiplication: The number of instructions will be O(n^3) when n is the size of the matrix. In this case, one can expect that the GPU will outperform a CPU at a certain matrix size. Another example could be when many complex trigonometric computations (sine/cosine etc) are performed on "few" data elements.

As a rule of thumb: You can assume that reading/writing one data element from the "main" GPU memory has a latency of about 500 instructions....

Therefore, another key point for the performance of GPUs is data locality: If you have to read or write data (and in most cases, you will have to ;-)), then you should make sure that the data is kept as close as possible to the GPU cores. GPUs thus have certain memory areas (referred to as "local memory" or "shared memory") that usually is only a few KB in size, but particularly efficient for data that is about to be involved in a computation.

So to emphasize this again: GPU programming is an art, that is only remotely related to parallel programming on the CPU. Things like Threads in Java, with all the concurrency infrastructure like ThreadPoolExecutors, ForkJoinPools etc. might give the impression that you just have to split your work somehow and distribute it among several processors. On the GPU, you may encounter challenges on a much lower level: Occupancy, register pressure, shared memory pressure, memory coalescing ... just to name a few.

However, when you have a data-parallel, compute-bound problem to solve, the GPU is the way to go.

A general remark: Your specifically asked for CUDA. But I'd strongly recommend you to also have a look at OpenCL. It has several advantages. First of all, it's an vendor-independent, open industry standard, and there are implementations of OpenCL by AMD, Apple, Intel and NVIDIA. Additionally, there is a much broader support for OpenCL in the Java world. The only case where I'd rather settle for CUDA is when you want to use the CUDA runtime libraries, like CUFFT for FFT or CUBLAS for BLAS (Matrix/Vector operations). Although there are approaches for providing similar libraries for OpenCL, they can not directly be used from Java side, unless you create your own JNI bindings for these libraries.

You might also find it interesting to hear that in October 2012, the OpenJDK HotSpot group started the project "Sumatra": http://openjdk.java.net/projects/sumatra/ . The goal of this project is to provide GPU support directly in the JVM, with support from the JIT. The current status and first results can be seen in their mailing list at http://mail.openjdk.java.net/mailman/listinfo/sumatra-dev

However, a while ago, I collected some resources related to "Java on the GPU" in general. I'll summarize these again here, in no particular order.

(Disclaimer: I'm the author of http://jcuda.org/ and http://jocl.org/ )

(Byte)code translation and OpenCL code generation:

https://github.com/aparapi/aparapi : An open-source library that is created and actively maintained by AMD. In a special "Kernel" class, one can override a specific method which should be executed in parallel. The byte code of this method is loaded at runtime using an own bytecode reader. The code is translated into OpenCL code, which is then compiled using the OpenCL compiler. The result can then be executed on the OpenCL device, which may be a GPU or a CPU. If the compilation into OpenCL is not possible (or no OpenCL is available), the code will still be executed in parallel, using a Thread Pool.

https://github.com/pcpratts/rootbeer1 : An open-source library for converting parts of Java into CUDA programs. It offers dedicated interfaces that may be implemented to indicate that a certain class should be executed on the GPU. In contrast to Aparapi, it tries to automatically serialize the "relevant" data (that is, the complete relevant part of the object graph!) into a representation that is suitable for the GPU.

https://code.google.com/archive/p/java-gpu/ : A library for translating annotated Java code (with some limitations) into CUDA code, which is then compiled into a library that executes the code on the GPU. The Library was developed in the context of a PhD thesis, which contains profound background information about the translation process.

https://github.com/ochafik/ScalaCL : Scala bindings for OpenCL. Allows special Scala collections to be processed in parallel with OpenCL. The functions that are called on the elements of the collections can be usual Scala functions (with some limitations) which are then translated into OpenCL kernels.

Language extensions

http://www.ateji.com/px/index.html : A language extension for Java that allows parallel constructs (e.g. parallel for loops, OpenMP style) which are then executed on the GPU with OpenCL. Unfortunately, this very promising project is no longer maintained.

http://www.habanero.rice.edu/Publications.html (JCUDA) : A library that can translate special Java Code (called JCUDA code) into Java- and CUDA-C code, which can then be compiled and executed on the GPU. However, the library does not seem to be publicly available.

https://www2.informatik.uni-erlangen.de/EN/research/JavaOpenMP/index.html : Java language extension for for OpenMP constructs, with a CUDA backend

Java OpenCL/CUDA binding libraries

https://github.com/ochafik/JavaCL : Java bindings for OpenCL: An object-oriented OpenCL library, based on auto-generated low-level bindings

http://jogamp.org/jocl/www/ : Java bindings for OpenCL: An object-oriented OpenCL library, based on auto-generated low-level bindings

http://www.lwjgl.org/ : Java bindings for OpenCL: Auto-generated low-level bindings and object-oriented convenience classes

http://jocl.org/ : Java bindings for OpenCL: Low-level bindings that are a 1:1 mapping of the original OpenCL API

http://jcuda.org/ : Java bindings for CUDA: Low-level bindings that are a 1:1 mapping of the original CUDA API

Miscellaneous

http://sourceforge.net/projects/jopencl/ : Java bindings for OpenCL. Seem to be no longer maintained since 2010

http://www.hoopoe-cloud.com/ : Java bindings for CUDA. Seem to be no longer maintained

Solution 2

From the research I have done, if you are targeting Nvidia GPUs and have decided to use CUDA over OpenCL, I found three ways to use the CUDA API in java.

JCuda (or alternative)- http://www.jcuda.org/. This seems like the best solution for the problems I am working on. Many of libraries such as CUBLAS are available in JCuda. Kernels are still written in C though.
JNI - JNI interfaces are not my favorite to write, but are very powerful and would allow you to do anything CUDA can do.
JavaCPP - This basically lets you make a JNI interface in Java without writing C code directly. There is an example here: What is the easiest way to run working CUDA code in Java? of how to use this with CUDA thrust. To me, this seems like you might as well just write a JNI interface.

All of these answers basically are just ways of using C/C++ code in Java. You should ask yourself why you need to use Java and if you can't do it in C/C++ instead.

If you like Java and know how to use it and don't want to work with all the pointer management and what-not that comes with C/C++ then JCuda is probably the answer. On the other hand, the CUDA Thrust library and other libraries like it can be used to do a lot of the pointer management in C/C++ and maybe you should look at that.

If you like C/C++ and don't mind pointer management, but there are other constraints forcing you to use Java, then JNI might be the best approach. Though, if your JNI methods are just going be wrappers for kernel commands you might as well just use JCuda.

There are a few alternatives to JCuda such as Cuda4J and Root Beer, but those do not seem to be maintained. Whereas at the time of writing this JCuda supports CUDA 10.1. which is the most up-to-date CUDA SDK.

Additionally there are a few java libraries that use CUDA, such as deeplearning4j and Hadoop, that may be able to do what you are looking for without requiring you to write kernel code directly. I have not looked into them too much though.

Solution 3

I'd start by using one of the projects out there for Java and CUDA: http://www.jcuda.org/

Solution 4

Marco13 already provided an excellent answer.

In case you are in search for a way to use the GPU without implementing CUDA/OpenCL kernels, I would like to add a reference to the finmath-lib-cuda-extensions (finmath-lib-gpu-extensions) http://finmath.net/finmath-lib-cuda-extensions/ (disclaimer: I am the maintainer of this project).

The project provides an implementation of "vector classes", to be precise, an interface called RandomVariable, which provides arithmetic operations and reduction on vectors. There are implementations for the CPU and GPU. There are implementation using algorithmic differentiation or plain valuations.

The performance improvements on the GPU are currently small (but for vectors of size 100.000 you may get a factor > 10 performance improvements). This is due to the small kernel sizes. This will improve in a future version.

The GPU implementation use JCuda and JOCL and are available for Nvidia and ATI GPUs.

The library is Apache 2.0 and available via Maven Central.

Solution 5

There is not much information on the nature of the problem and the data, so difficult to advise. However, would recommend to assess the feasibility of other solutions, that can be easier to integrate with java and enables horizontal as well as vertical scaling. The first I would suggest to look at is an open source analytical engine called Apache Spark https://spark.apache.org/ that is available on Microsoft Azure but probably on other cloud IaaS providers too. If you stick to involving your GPU then the suggestion is to look at other GPU supported analytical databases on the market that fits in the budget of your organisation.

View more solutions

113,979

Hans

Updated on August 28, 2021

Comments

Hans almost 3 years

I'm working on a business project that is done in Java, and it needs huge computation power to compute business markets. Simple math, but with huge amount of data.

We ordered some CUDA GPUs to try it with and since Java is not supported by CUDA, I'm wondering where to start. Should I build a JNI interface? Should I use JCUDA or are there other ways?

I don’t have experience in this field and I would like if someone could direct me to something so I can start researching and learning.
- steve cook about 10 years
  
  GPUs will help you speed up specific types of compute-intensive problem. However if you have a huge amount of data, you are more likely to be IO bound. Most likely GPUs are not the solution.
- BlackBear almost 9 years
  
  "Boosting Java Performance using GPGPUs" --> arxiv.org/abs/1508.06791
- JimLohse over 7 years
  
  Kind of an open question, I am glad the mods didn't shut it down because the answer from Marco13 is incredibly helpful! Should be a wiki IMHO
Cool_Coder about 10 years

consider an operation of adding 2 matrices and storing result in a third matrix. When mutli threaded on CPU without OpenCL the bottleneck will always be the step in which the addition happens. This operation is obviously data parallel. But lets say we dont know whether it will be compute bound or memory bound beforehand. It takes a lot of time and resources to implement and then see that the CPU is much better in doing this operation. So then how does one identify beforehand this without implementing the OpenCL code.
Marco13 about 10 years

@Cool_Coder Indeed it's hard to tell beforehand whether (or how much) a certain task will benefit from a GPU implementation. For a first gut feeling, one probably needs some experience with different use-cases (which I admittedly also don't really have). A first step could be to look at nvidia.com/object/cuda_showcase_html.html and see whether there is a "similar" problem listed. (It's CUDA, but it's conceptually so close to OpenCL that the results can be transferred in most cases). In most cases, the speedup is also mentioned, and many of them have links to papers or even code
steve cook about 10 years

+1 for aparapi - its a simple way to get started with opencl in java, and allows you to easily compare CPU vs GPU performance for simple cases. Also, it's maintained by AMD but works fine with Nvidia cards.
Marco13 about 10 years

@steve Yes, AMD really did pioneering work with Aparapi, and I think it's by far the most mature approach on the bytecode level (although I wasn't able to closely track its progress recently). One could mention that it is a rather "thick absraction layer": Strictly speaking, you don't need to know anything about GPU programming and you will not "directly" learn anything about CUDA/OpenCL by using it, but...
Marco13 about 10 years

... but having a rough idea about HOW your Java Code will be translated will help to structure your code in a way that can be translated into an efficient OpenCL version. Additionally, it might sooner or later become obsolete when project "Sumatra" is integrated into the JVM (they are already in close cooperation, and the lead developer of Aparapi is also working on Sumatra), but this will probably still take a while. So anybody who has an exiting Java Application and wants to see whether it might benefit from the GPU should give Aparapi a try.
gouessej almost 10 years

javacl is a lot less active than JogAmp's JOCL. The former uses JNA whereas the latter uses JNI which is faster.
Marco13 almost 10 years

@gouessej The list was set up quite a while ago, and aimed at (more-or-less) completeness (I even mentioned some libs that had been abandoned at this point in time - and maybe in the meantime, there are even some new ones...). The perfomance aspect that you mentioned could be worth a discussion that is beyond the scope of these comments. (BTW: Once I talked to M. Bien, considering merging our JOCLs, but we did not further pursue this idea)
gouessej almost 10 years

M. Bien no longer maintains JOCL but we are open to your suggestions and are favorable to a merge.
OverCoder almost 9 years

Please make this a wiki post o.o
ViggyNash over 7 years

This is one of the best responses I've ever seen on StackOverflow. Thanks for the time and effort!
juanmf over 7 years

Hi, thanks for this great answer. Is there anything like this analysis on any (if available) Vulkan ports? and also given that case, a comparison of OpenCL vs Vulkan?
juanmf over 7 years

just found that lwjgl.org includes bindings for Vulkan API.
Alex Punnen over 7 years

Reading this got me confused to see whether to use CUDA or OpenCL for speeding OpenCV wiki.tiker.net/CudaVsOpenCL
Marco13 over 7 years

@AlexPunnen This is probably beyond the scope of the comments. As far as I know, OpenCV has some CUDA support, as of docs.opencv.org/2.4/modules/gpu/doc/introduction.html . The developer.nvidia.com/npp has many image processing routines, which may be handy. And github.com/GPUOpen-ProfessionalCompute-Tools/HIP may be an "alternative" for CUDA. It might be possible to ask this as a new question, but one has to be careful to phrase it properly, to avoid downvotes for "opinion based"/"asking for third-party libraries"...
Alex Punnen over 7 years

@Marco13 OpenCV definitely supports CUDA; I have been trying to compile with it and make it work for object detection algorithms; I will check this and try also with OpenCL and ask another question (or reply to mine) Thanks