User Guide
C
C
C
C
i
Intel® DPC++ Compatibility Tool Usage Flow
Complete Coding
& Tune to Desired
Performance
Human Readable
DPC++ with Inline
Comments
and API calls. The tool can automatically migrate 80-90
percent
9
of the code (depending on complexity) and embeds
comments to help developers complete the manual step of
the migration process. In this case study, nearly 100 percent
of the code was automatically migrated in a readable and
usable manner.
Comprehensive Performance Optimization using
Intel Software Tools
Optimization #1: First, SonoScape used the Intel® VTune™
Proler to analyze their workload. The proler can quickly
identify CPU and GPU load performance bottlenecks and
provide relevant information. As shown in the gure below,
vector processing makes full use of Intel’s high instruction
throughput and supports the parallel processing of data to
rapidly improve performance over scalar operations.
Figure 5. Scalar processing vs. Vector processing
Figure 6. Workow chart of the Intel DPC++ Compatibility Tool
SonoScape also made use of the DPC++ Compiler in the
oneAPI toolkit to recompile its code and generate vector
instructions for enhanced performance, reducing the
processing speed of the workload from 141 ms to just 33 ms.
7
Optimization #2. Once performance bottlenecks were
identied by the VTune Proler, SonoScape replaced them
with APIs from Intel® Integrated Performance Primitives
(Intel® IPP), a cross-platform software library of functions that
include accelerators for image processing, signal processing,
data compression, encryption mechanisms, and other
applications. Intel IPP can be optimized for CPUs to unlock
the latest features of Intel architecture platforms (such as
AVX-512) to improve application performance.
For example, the ippsCrossCorrNorm_32f and
ippsDotProd_32f64f functions can improve performance by
removing dual-layer loop calculations and multiplication/
addition loops. Through such optimization, SonoScape was
able to further improve the processing speed of the workload
from 33 ms to 13.787 ms.
7
Optimization #3. Originally developed by Intel, the Open
Source Computer Vision Library (OpenCV) OpenCV can
be used to develop real-time image processing, computer
vision, and pattern recognition programs, and supports
the utilization of Intel IPP for accelerated processing.
8
By replacing OpenCV functions in the source code with
IPP functions, the solution scales well in large-scale data
scenarios and performs well across all generations of Intel
platforms.
Optimization #4. Sonoscape’s S-Fetus 4.0 obstetric
screening assistant also utilizes the Intel® DPC++
Compatibility Tool to eciently migrate existing CUDA code
to DPC++, ensuring cross-architecture compatibility and
minimizing the time required for migration. As shown in
Figure 6, the tool provides powerful interactive functions to
help developers migrate CUDA code, including kernel code
DPC++
Source Code
Compatibility
To o l
160
140
120
100
80
60
40
20
0
140
Baseline Optimization 1 Optimization 2 Optimization 2 Optimization 4
33
13.787
13.31
7. 02 4
80-90%
Transformed
Time Optimization of Multimodal Workload
(ms lower is better)
20x faster
Developer’s
CUDA Source
After these optimizations were completed, the performance
of the SonoScape S-Fetus 4.0 running on heterogeneous
platform based on Intel oneAPI DPC++ was increased by
nearly 20x that of the baseline performance data recorded
before optimization, as shown in gure 7.
7
Figure 7. Performance Improvement with the Intel oneAPI
Base Toolkit
7
(Baseline: Code before optimization; Optimization 1: Intel oneAPI DPC++
Compiler; Optimization 2: Intel IPP used to replace loop source code;
Optimization 3: Intel IPP used to replace OpenCV functions; Optimization 4:
CPU + iGPU execution after CUDA migration)
Scalar
Processing
Vector
Processing
VL
A
A
A
A
B
C
B
B
B
+
+
A
i
B
i
5
Solution Brief | Intel® oneAPI Base Toolkit Helps SonoScape Optimize the Performance of its S-Fetus 4.0 Obstetric Screening Assistant