問題描述
假設我的 Java 程序的瓶頸確實是一些緊密循環來計算一堆矢量點積.是的,我已經分析過了,是的,它是瓶頸,是的,它很重要,是的,算法就是這樣,是的,我已經運行 Proguard 來優化字節碼,等等.
Let's say the bottleneck of my Java program really is some tight loops to compute a bunch of vector dot products. Yes I've profiled, yes it's the bottleneck, yes it's significant, yes that's just how the algorithm is, yes I've run Proguard to optimize the byte code, etc.
這項工作本質上是點積.如,我有兩個 float[50]
,我需要計算成對產品的總和.我知道處理器指令集的存在是為了快速批量執行此類操作,例如 SSE 或 MMX.
The work is, essentially, dot products. As in, I have two float[50]
and I need to compute the sum of pairwise products. I know processor instruction sets exist to perform these kind of operations quickly and in bulk, like SSE or MMX.
是的,我可以通過在 JNI 中編寫一些本機代碼來訪問這些.事實證明,JNI 調用非常昂貴.
Yes I can probably access these by writing some native code in JNI. The JNI call turns out to be pretty expensive.
我知道你不能保證 JIT 會編譯什么,什么不編譯.有沒有人曾經聽說過使用這些指令的 JIT 生成代碼?如果是這樣,Java 代碼有什么東西可以幫助它以這種方式編譯嗎?
I know you can't guarantee what a JIT will compile or not compile. Has anyone ever heard of a JIT generating code that uses these instructions? and if so, is there anything about the Java code that helps make it compilable this way?
可能是不";值得一問.
Probably a "no"; worth asking.
推薦答案
所以,基本上,你希望你的代碼運行得更快.JNI 就是答案.我知道你說它對你不起作用,但讓我告訴你你錯了.
So, basically, you want your code to run faster. JNI is the answer. I know you said it didn't work for you, but let me show you that you are wrong.
這里是 Dot.java
:
import java.nio.FloatBuffer;
import org.bytedeco.javacpp.*;
import org.bytedeco.javacpp.annotation.*;
@Platform(include = "Dot.h", compiler = "fastfpu")
public class Dot {
static { Loader.load(); }
static float[] a = new float[50], b = new float[50];
static float dot() {
float sum = 0;
for (int i = 0; i < 50; i++) {
sum += a[i]*b[i];
}
return sum;
}
static native @MemberGetter FloatPointer ac();
static native @MemberGetter FloatPointer bc();
static native @NoException float dotc();
public static void main(String[] args) {
FloatBuffer ab = ac().capacity(50).asBuffer();
FloatBuffer bb = bc().capacity(50).asBuffer();
for (int i = 0; i < 10000000; i++) {
a[i%50] = b[i%50] = dot();
float sum = dotc();
ab.put(i%50, sum);
bb.put(i%50, sum);
}
long t1 = System.nanoTime();
for (int i = 0; i < 10000000; i++) {
a[i%50] = b[i%50] = dot();
}
long t2 = System.nanoTime();
for (int i = 0; i < 10000000; i++) {
float sum = dotc();
ab.put(i%50, sum);
bb.put(i%50, sum);
}
long t3 = System.nanoTime();
System.out.println("dot(): " + (t2 - t1)/10000000 + " ns");
System.out.println("dotc(): " + (t3 - t2)/10000000 + " ns");
}
}
和Dot.h
:
float ac[50], bc[50];
inline float dotc() {
float sum = 0;
for (int i = 0; i < 50; i++) {
sum += ac[i]*bc[i];
}
return sum;
}
我們可以通過 JavaCPP 使用這個命令來編譯和運行它:
We can compile and run that with JavaCPP using this command:
$ java -jar javacpp.jar Dot.java -exec
使用 Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz、Fedora 30、GCC 9.1.1 和 OpenJDK 8 或 11,我得到這樣的輸出:
With an Intel(R) Core(TM) i7-7700HQ CPU @ 2.80GHz, Fedora 30, GCC 9.1.1, and OpenJDK 8 or 11, I get this kind of output:
dot(): 39 ns
dotc(): 16 ns
或大約快 2.4 倍.我們需要使用直接 NIO 緩沖區而不是數組,但是 HotSpot 可以像訪問數組一樣快地訪問直接 NIO 緩沖區.另一方面,在這種情況下,手動展開循環并不能顯著提升性能.
Or roughly 2.4 times faster. We need to use direct NIO buffers instead of arrays, but HotSpot can access direct NIO buffers as fast as arrays. On the other hand, manually unrolling the loop does not provide a measurable boost in performance, in this case.
這篇關于是否有任何 JVM 的 JIT 編譯器生成使用矢量化浮點指令的代碼?的文章就介紹到這了,希望我們推薦的答案對大家有所幫助,也希望大家多多支持html5模板網!