Deep-learning inference models: the future of optimized deployment

The Problem

Deep-learning models – whether for natural language processing, image processing, or machine vision – are becoming larger and more complex. Using these models in production requires significant computing resources even after traditional optimization techniques such as hyperparameter tuning, pruning, and quantization are applied. As a result, both cloud and edge deployments of deep-learning are challenged in two important dimensions:

Performance: Applications like autonomous driving, real-time video processing, and interactive voice response are time-sensitive and need to deliver results within strict time constraints. Failure to meet performance targets can limit overall product success.
Cost: In the cloud, longer processing times and increased memory requirements translate into higher costs. At the edge, increasing model complexity requires larger and more expensive CPUs or GPUs. Sometimes the processing unit in a particular edge platform is difficult to change, and thus designers struggle to add functionality and intelligence while staying within the available memory and computing limits.