In articles and conference presentations on Inference Accelerators, the focus is primarily on TOPS (frequency times number of MACs), a little bit on memory (DRAM interfaces and on chip SRAM), very little on interconnect (also very important, but that’s another story) and almost nothing on the software!
Without software, the inference accelerator is a rock that does nothing. Software is what breathes life into an inference accelerator (but can’t rescue a bad hardware architecture).
Ready, fire, aim
Several customers have told us “our vendors cannot give us performance projections before silicon.” And some customers who have designed their own inference accelerators have complained “we have lots of TOPS but the software guys can’t seem to utilize them efficiently.”
And other customers have told us that to get reasonable performance from some well-known inference accelerators require very low level programming to manage memory storage and transfers because the vendors’ software cannot. It appears many inference accelerators have left their software to a late stage rather than develop the software and hardware together to make sure they work well together.
All inference accelerators have in common the following elements:
- On-chip SRAM
- Off-chip DRAM
- Control logic
- On-chip interconnect between all of the units
The number of elements and organization varies widely between inference accelerators.
When architecting an inference accelerator how do you know if you are building a chip that will deliver high throughput/watt and high throughput/$? The answer is the inference software.
In architecting our InferX X1 we had a performance estimation model very early on for key performance benchmarks, often requested by customers, such as YOLOv3 for megapixel images and ResNet-50 for 224×224 and megapixel images. Using these performance estimation models along with cost models from our silicon/package vendors allowed us to determine the optimum die size, number of MACs, number of SRAM bytes and number of DRAM interfaces to maximize throughput/$ and throughput/watt for megapixel images.
How can we be confident in our performance estimates before silicon? It is because our architecture is totally deterministic. For a given model and image size, we know the execution time to the cycle. It appears that most other inference accelerators have non-determistic features: bus contention, SRAM contention, DRAM contention. With contention performance modelling is very difficult without simulating a large, large number of images for the full model size.
Today our customers can use our performance modelling tool to determine how fast their model/image size will run on X1: it takes a few minutes maximum. Because it’s fast, customers can quickly try modifications to their model to see if it improves throughput by better utilizing the underlying hardware.
Some customers have shared their models with us, especially where they have non-standard applications, to see if we could improve performance. In several cases we have been able to optimize performance 2x or 4x on key layers by implementing new algorithms in our software compiler.
Our full chip RTL is running on Mentor emulators for multiple inference layers running full megapixel image sizes. To do this requires our software to actually be generating the control code for the X1 so our software is ready for silicon (which we will get soon).
Our nnMAX inference compiler takes neural network models in ONNX and TensorFlow-Lite and compiles them directly to the control code for the InferX X1. The customer does not need to do any low level programming, unlike what we hear of most other inference accelerators. X1 supports BF16 so customers with models trained in FP32 can very quickly get up an running without having to wait for quantization (but when they do quantization, X1 runs in INT8 mode too).
When our silicon comes back in Q2 we expect to be able to run numerous open source models (YOLOv3, etc) and numerous customer proprietary models within a week to confirm performance estimations then to sample customers with boards for them to confirm as well.
Developing software performance estimation models then the full software compiler in parallel with and before chip silicon is critical to ensure the combination of hardware+software delivers optimum throughput/$ and throughput/watt. A deterministic architecture is very helpful in being able to do this.
Geoff Tate is the founder and CEO of Flex Logix. Tate has more than three decades of experience in technology. He is the former CEO of Rambus, and a current board director at Everspin Technologies. He received his BSc in computer science from the University of Alberta, and an MBA from Harvard Business School.