Deep Dive on Qualcomm Snapdragon DXP
• The Hexagon DSP gets wider, faster, and more powerful
• Smartphone cameras will be faster, with more image processing capabilities
• Qualcomm moves sensor hub inside for always on features
A long-term leader in building truly heterogeneous SoCs, which include a 64-bit CPU, a powerful GPU, dual ISPs, Qualcomm also includes a few DSPs, with brands such as Hexagon. Other SoC builders have DSPs in their chips but use them primarily for audio or modem functions. Qualcomm does that as well and also dedicates one to video and image processing.
DSPs use a wide word and are often referred to as very long instruction word (VLIW) devices. Some folks say the “L” really is “large.” VLIW devices can be run as a parallel SIMD processor and have been used as floating point processors in graphics machines. With the power of a VLIW device also comes a complex programming environment, although TI and others have developed some very clever compilers to take some of the drudgery out of explicitly programming the processor. Qualcomm also has such tools for their OEM customers.
What’s new in the Snapdragon 820 is the extended Hexagon DSP, which Qualcomm has designated the 680. The company is employing the signal processing capabilities of the 680 with the two Spectra ISPs in the Snapdragon, which makes for a powerful and very fast image processing system—things that used to fill a 2U rack 10 years ago now fit in your pocket.
Referencing them as the Hexagon Vector eXtension (HVX), they expanded the SIMD from 64-bits to a 1024b SIMD 4 vector-slot VLIW with 4096 result bits/cycle. You can manipulate the device (via its VLIW I/O) to run 256, 8×8 multiply, or 64, 16×16 multiply operations. It also has 32, 1024-bit vector registers which can be operated as 8/16/32 bit fixed point function. The wide vectors processing only support fixed-point operations. This is because of the focus on imaging processing by the wide vector capability, with most imaging being 16-bit fixed-point or less. Only the Hexagon Vector eXtensions are fixed-point only, not the entire core.
Because a DSP is such a special device, it can accomplish a lot of the same accuracy, and do it much faster, than a conventional CISC FP processor. And it can do it with less power. In imaging, for example, the DSP can generate results ~3x faster at ~10x lower energy (vs. quad-CPU). The Hexagon 680 does support 32-bit floating point on the scalar portion of the core, which also still performs SIMD 64-bit concurrently with the SIMD 1024-bit.
It is important to remember this wide vector capability is an extension of the core. That is, the core retains all the previous capabilities, concurrently supporting SIMD 64-bit for 8/16/32 bit fixed-point as well as 32-bit floating point. The DSP’s special ISA offers sliding window filters, LUTs, histograms, and performance sufficient for UHD video, or post-processing of 20Mpix camera burst mode processing and more.
Schematically, the 680 looks like most DSPs, as shown in the following diagram. However, the supporting hardware multi-threading is unique to Qualcomm’s Hexagon DSP.
The 680 can run four 4 parallel scalar threads each with 4-way VLIW and shared L1/L2 and do it at 500MHz per thread, yielding 2GHz total scalar performance. That translates to super-scalar performance concurrently with new wide vector performance.
There are two Hexagon Vector eXtensions (HVX) contexts, controllable by any two scalar threads that will run up to 500MHz per thread, giving 1GHz total vector performance. Other threads can do scalar work in parallel. The net result is a domain specific architecture with a wide 1024-bit SIMD (for pixel data parallelism) and emphasis on low precision fixed-point plus a special ISA. All of that provides a parallel and coordinated capability for scalar and vector threads, which can exploit the large primary cache for imaging working sets.
The 680 takes data from the SoC’s ISP via a L2 cache, and returns image processing and filtering results to the system memory and CPU, as illustrated in the following block diagram. The primary pre-processing path ingests from the Camera Sensor and returns to the Camera ISP, as highlighted in the yellow block. In addition, any supplemental information obtained from the pre-processing can be stored in memory for later use.
The SoC’s ARM compliant SMMU allows for Zero-Copy data sharing with CPU. That, in turn, provides a multi-threaded DSP capability that can service multiple offload sessions (concurrent apps for Audio, Camera, Computer Vision, etc.). The SMMU supports multiple Context Banks to allow sharing with multiple different address spaces on the CPU and the SMMU can be used to support processing on secure content managed outside of HLOS.
So with a wide-word signal processor in the front end, image things can be run really fast. With ever increasing sensor resolution and higher resolution screens, you need to move pixels from the front end to the screen fast, and they better look right when they get there.
Qualcomm compared the DSP with HVX vs. just a Quad Krait CPU with full Neon-Optimization. The Quad Krait CPU was run at 2.65GHz, and the single DSP/HVX ran at 725MHz. The results are shown in the next diagram.
The point Qualcomm is trying to prove here is that, for the super smartphones that will be coming out in 2016, with mega sensors and big high-res screens, you need more image processing horsepower than you can get from just a CPU, no matter how many cores you jam into that CPU.
Qualcomm actually puts three DSPs in their SOC.
The low-power island DSP is for “always on” sensor processing. This is a major breakthrough for Qualcomm and the industry. Putting the sensor hub inside the SoC saves board space and, most importantly, power. The chip has a new power management schema to be “always off” until needed. That gives a longer battery life for key use cases (e.g., pedometer or sensor-assisted positioning).
Qualcomm is claiming to be the first in the SoC market with super wide vector SIMD extensions for their DSP. They claim it can be exploited through conventional tools and techniques, using shared memory POSIX-like threads (on DSP RTOS), and a LLVM compiler. This, says the company, allows programming with C/C++ and Intrinsics and a suite of pre-optimized libraries for common filters & algorithms. What’s not to like?