Deep Dive on Qualcomm Snapdragon DXP

• The Hexagon DSP gets wider, faster, and more powerful
• Smartphone cameras will be faster, with more image processing capabilities
• Qualcomm moves sensor hub inside for always on features

A long-term leader in building truly heterogeneous SoCs, which include a 64-bit CPU, a powerful GPU, dual ISPs, Qualcomm also includes a few DSPs, with brands such as Hexagon. Other SoC builders have DSPs in their chips but use them primarily for audio or modem functions. Qualcomm does that as well and also dedicates one to video and image processing.

DSPs use a wide word and are often referred to as very long instruction word (VLIW) devices. Some folks say the “L” really is “large.” VLIW devices can be run as a parallel SIMD processor and have been used as floating point processors in graphics machines. With the power of a VLIW device also comes a complex programming environment, although TI and others have developed some very clever compilers to take some of the drudgery out of explicitly programming the processor. Qualcomm also has such tools for their OEM customers.

What’s new in the Snapdragon 820 is the extended Hexagon DSP, which Qualcomm has designated the 680. The company is employing the signal processing capabilities of the 680 with the two Spectra ISPs in the Snapdragon, which makes for a powerful and very fast image processing system—things that used to fill a 2U rack 10 years ago now fit in your pocket.

Referencing them as the Hexagon Vector eXtension (HVX), they expanded the SIMD from 64-bits to a 1024b SIMD 4 vector-slot VLIW with 4096 result bits/cycle. You can manipulate the device (via its VLIW I/O) to run 256, 8×8 multiply, or 64, 16×16 multiply operations. It also has 32, 1024-bit vector registers which can be operated as 8/16/32 bit fixed point function. The wide vectors processing only support fixed-point operations. This is because of the focus on imaging processing by the wide vector capability, with most imaging being 16-bit fixed-point or less. Only the Hexagon Vector eXtensions are fixed-point only, not the entire core.

Because a DSP is such a special device, it can accomplish a lot of the same accuracy, and do it much faster, than a conventional CISC FP processor. And it can do it with less power. In imaging, for example, the DSP can generate results ~3x faster at ~10x lower energy (vs. quad-CPU). The Hexagon 680 does support 32-bit floating point on the scalar portion of the core, which also still performs SIMD 64-bit concurrently with the SIMD 1024-bit.

It is important to remember this wide vector capability is an extension of the core. That is, the core retains all the previous capabilities, concurrently supporting SIMD 64-bit for 8/16/32 bit fixed-point as well as 32-bit floating point. The DSP’s special ISA offers sliding window filters, LUTs, histograms, and performance sufficient for UHD video, or post-processing of 20Mpix camera burst mode processing and more.

Schematically, the 680 looks like most DSPs, as shown in the following diagram. However, the supporting hardware multi-threading is unique to Qualcomm’s Hexagon DSP.

Screen Shot 2015-08-27 at 8.46.22 PM

The 680 can run four 4 parallel scalar threads each with 4-way VLIW and shared L1/L2 and do it at 500MHz per thread, yielding 2GHz total scalar performance. That translates to super-scalar performance concurrently with new wide vector performance.

There are two Hexagon Vector eXtensions (HVX) contexts, controllable by any two scalar threads that will run up to 500MHz per thread, giving 1GHz total vector performance. Other threads can do scalar work in parallel. The net result is a domain specific architecture with a wide 1024-bit SIMD (for pixel data parallelism) and emphasis on low precision fixed-point plus a special ISA. All of that provides a parallel and coordinated capability for scalar and vector threads, which can exploit the large primary cache for imaging working sets.

The 680 takes data from the SoC’s ISP via a L2 cache, and returns image processing and filtering results to the system memory and CPU, as illustrated in the following block diagram. The primary pre-processing path ingests from the Camera Sensor and returns to the Camera ISP, as highlighted in the yellow block. In addition, any supplemental information obtained from the pre-processing can be stored in memory for later use.

Screen Shot 2015-08-27 at 8.47.25 PM

The SoC’s ARM compliant SMMU allows for Zero-Copy data sharing with CPU. That, in turn, provides a multi-threaded DSP capability that can service multiple offload sessions (concurrent apps for Audio, Camera, Computer Vision, etc.). The SMMU supports multiple Context Banks to allow sharing with multiple different address spaces on the CPU and the SMMU can be used to support processing on secure content managed outside of HLOS.

So what?

So with a wide-word signal processor in the front end, image things can be run really fast. With ever increasing sensor resolution and higher resolution screens, you need to move pixels from the front end to the screen fast, and they better look right when they get there.

Qualcomm compared the DSP with HVX vs. just a Quad Krait CPU with full Neon-Optimization. The Quad Krait CPU was run at 2.65GHz, and the single DSP/HVX ran at 725MHz. The results are shown in the next diagram.

Screen Shot 2015-08-27 at 8.48.32 PM

The point Qualcomm is trying to prove here is that, for the super smartphones that will be coming out in 2016, with mega sensors and big high-res screens, you need more image processing horsepower than you can get from just a CPU, no matter how many cores you jam into that CPU.

Qualcomm actually puts three DSPs in their SOC.

Screen Shot 2015-08-27 at 8.49.55 PM

The low-power island DSP is for “always on” sensor processing. This is a major breakthrough for Qualcomm and the industry. Putting the sensor hub inside the SoC saves board space and, most importantly, power. The chip has a new power management schema to be “always off” until needed. That gives a longer battery life for key use cases (e.g., pedometer or sensor-assisted positioning).

Qualcomm is claiming to be the first in the SoC market with super wide vector SIMD extensions for their DSP. They claim it can be exploited through conventional tools and techniques, using shared memory POSIX-like threads (on DSP RTOS), and a LLVM compiler. This, says the company, allows programming with C/C++ and Intrinsics and a suite of pre-optimized libraries for common filters & algorithms. What’s not to like?

Published by

Jon Peddie

Dr. Jon Peddie is one of the pioneers of the graphics industry, and formed Jon Peddie Research (JPR) to provide customer intimate consulting and market forecasting services. Peddie lectures at numerous conferences on topics pertaining to graphics technology and the emerging trends in digital media technology. Recently named one of the most influential analysts, he is frequently quoted in trade and business publications, and contributes articles to numerous publications including as well as appearing on CNN and TechTV. Learn more about Jon and his services at www.jonpeddie.com

8 thoughts on “Deep Dive on Qualcomm Snapdragon DXP”

  1. Interesting, thank you.
    This dovetails with the differentiation talk: any OEM will be able to buy the 820, not all will be able to take advantage equally of the 680’s capabilities, since they’ll need to either wrap Qualcomm’s Barebones code in a UI, or, better, meld it into their own apps for photo/vidéo shooting/retouching/rendering… and maybe finally use the 680 to optimize less obvious (not image-related) system routines/apps.
    On a sad note, third-party devs will probably mostly ignore it because it’ll be on a small share of devices, though hopefully key apps (Adobe…) will have bespoke code.

    1. Agree, but this makes an interesting point in the broader conversation about Qualcomm and Android OEMS. If you paid attention to the Obi Mobile launch, one thing that came out of that was their efforts to fine tune the optics of the Sony sensor to the DSP of the Snapdragon 610. Image quality and post processing features are an area they are focusing on for differentiation. Not that this is something only they can do but it does speak of what Qualcomm is also doing in software for their partners.

      Also worth noting what’s coming with the 820 and other features which will eventually trickle down to other products, is their own custom architecture. There are optimizations they are doing across the board with this new architecture on the SoC from sensors, to modem, to graphics, and CPU that I think are quite interesting. Particularly as they trickle down to other chipsets in the portfolio which OEMS trying to hit $200 price points can use.

  2. This is a curiously context-free description.

    Other ARM or specialty manufacturers have already included graphics capabilities past those in the ARM NEON spec; nVidia’s K1 shipped over a year ago and the X1 is on its way. Of course, Apple *ALSO* includes advanced GPU type capability. Dr. Peddie could correct me, but I *ALSO* believe it’s common to include a very sensor-specific ASIC chip in designs to speed burst-mode, HDR, video and the like, well beyond what would be sensible to try in a CPU.

    “So with a wide-word signal processor in the front end, image things can be run really fast.”
    While this type of circuitry works well for speeding photos, an engineer wouldn’t put this much horsepower into a phone just for that purpose. Gaming and video processing (especially, encoding into h.264/h.265 for efficient, compressed transmission of video shot on a phone, rather than playing a Netflix download) are what justify the extra processing capability. Everybody takes photos but penny-wise OEMs seem VERY unlikely to put this chip into middle-of-the-road phones.

    That puts the emphasis onto the point that obarthelemy raises: code interface/compatibility. A game writer won’t be able to devote a lot of time to making a game run well on a tiny fraction of his potential customers’ phones; she’ll need to use standard libraries available on most of the Android phones, that the OEM in turn can tune for the specific graphics chip. Understanding where this chip gets used will require a good understanding of how well Google and Qualcomm will work together.

    (It appears a few dozen million Windows phones may ship with this chip, and the same question applies: how much work will Microsoft put in to making either standard video-editing or other apps get the most out of this, and how well will they support an interface that lets game devs write one set of routines for different models of the WinPhone?)

    I’m frankly a bit pessimistic. nVidia has had some powerful graphics chops in its CPUs — that’s its DNA — but the Tegra line appears to be mostly a commercial failure, with “wins” such as the Surface RT that went nowhere. And to date, partly because of the challenge of good graphics in a battery-powered device, but also because of the limited storage & screen on most of the world’s handhelds, mobile game devs haven’t tried to make rich graphics such as those that’d exploit this CPU’s talents.

  3. Among 168 Brazilian adults hospitalized with severe COVID 19 pneumonia who were treated with oral doses of chloroquine, hydroxychloroquine or ivermectin 10 or 14 mg for 3 days, the mortality rate was similar in all three groups 22, 21 and 23 as were adverse event rates including serum aminotransferase elevations 25 vs 22 vs 26 cheap cialis

Leave a Reply to obarthelemy Cancel reply

Your email address will not be published. Required fields are marked *