Rewrite tensor.eye implementation and kernel#2937
Conversation
|
View rendered docs @ https://intelpython.github.io/dpnp/pull/2937/index.html |
|
Array API standard conformance tests for dpnp=0.21.0dev0=py313h509198e_50 ran successfully. |
|
@antonwolfy @vlad-perevezentsev used this as a chance to experiment with vtune and found that performance difference is very insignificant between implementations. Same for GPU utilization, about the same across the board. This implementation is a bit cleaner in my opinion though, so I think still worth adding |
| @@ -345,9 +345,11 @@ sycl::event full_strided_impl(sycl::queue &q, | |||
|
|
|||
| typedef sycl::event (*eye_fn_ptr_t)(sycl::queue &, | |||
| std::size_t nelems, // num_elements | |||
There was a problem hiding this comment.
nelems is not used in the new logic and can be removed
| @@ -400,9 +406,11 @@ class EyeFunctor | |||
| template <typename Ty> | |||
| sycl::event eye_impl(sycl::queue &exec_q, | |||
| std::size_t nelems, | |||
There was a problem hiding this comment.
nelems is not used in the new logic and can be removed
| const ssize_t start, | ||
| const ssize_t end, | ||
| const ssize_t step, | ||
| const ssize_t rows, |
There was a problem hiding this comment.
could you also update a changelog?
| int dst_typeid = array_types.typenum_to_lookup_id(dst_typenum); | ||
|
|
||
| const py::ssize_t nelem = dst.get_size(); | ||
| const py::ssize_t nelems = dst.get_size(); |
There was a problem hiding this comment.
nelems is passed to fn which does not use it
|
|
||
| auto fn = eye_dispatch_vector[dst_typeid]; | ||
| sycl::event eye_event = | ||
| fn(exec_q, static_cast<std::size_t>(nelems), rows, cols, k, stride0, |
This PR proposes an update to the
tensor.eyeconstructor, shifting to an approach using a 2D nd_range kernel which instead uses the condition thatcol - row == kalong the diagonal, which is correct whether the array is C- or F-contiguousThis eliminates any branching in the kernel, in hopes of improving performance even marginally.