Skip to main content

Reduce 算子调优

It's easy to implement in CUDA, but hard to get it right.

本质上就是计算:x=x0x1...xnx = x_0 \otimes x_1 \otimes ... \otimes x_n