GCP – Samsung Electronics supercharges Bixby with Cloud TPUs & TensorFlow
If you are an Android Galaxy phone user, you may already be familiar with Bixby, the intelligent voice assistant from Samsung Electronics that powers more than 160 million devices worldwide in nine languages. Today, we announced that Bixby’s voice-recognition model has achieved an 18X speed boost in training by using Cloud TPUs.
Bixby and other voice assistants use Automatic Speech Recognition (ASR) to transcribe spoken language to text. While this technology has come a long way since its inception, it still isn’t perfect, and it’s important to be able to train and retrain ASR models many times to achieve the best possible accuracy. Samsung, a close partner of Google Cloud, used Cloud TPUs, Google Cloud’s purpose-built machine learning processors, to train their ASR models faster and ultimately improve Bixby’s accuracy.
“At Samsung Electronics, we benefit from Google Cloud Premium Support—in local languages—and regional Technical Account Managers. Working hand in hand with both Support and the TAMs, we were able to evolve our technologies while reducing processing times to just half a day,” said KG Kyoung-Gu Woo, VP of AI Development Group, Mobile Communications Business, Samsung Electronics. “We believe collaboration is the key to offering consumers new mobile experiences they will love.”
The more training data, the better
For several years, the Deep Neural Network-Hidden Markov Model (DNN-HMM) hybrid ASR system architecture served as the standard approach for many speech-to-text services, including the previous generation of Bixby. DNN-HMM systems incorporated acoustic, pronunciation, and language models into the speech recognition pipeline. However, the training process was complex, which made it difficult to optimize overall accuracy.
To overcome the limitations of the DNN-HMM hybrid ASR system, the Bixby team decided to revamp their engine with a cutting-edge end-to-end deep learning approach. By leveraging a single deep neural network model based on the Transformer architecture, the new engine would not only have a simplified training process, but it could also have access to a vast pool of training data.
However, this change in the system also meant new challenges. New experiments and tuning were needed to reach an accuracy comparable to the previous system. Additionally, fast training iterations were critical to keeping the live service up to date while continuously expanding into new languages. In order to meet those requirements while absorbing the ever-increasing amount of training data, the Bixby team decided to explore Cloud TPUs.
Optimizing performance on Cloud TPU
Back in 2013, Google realized that our existing CPU and GPU infrastructure could not keep up with our growing computational needs for AI, so we decided to build a new chip specifically for the purpose. The result was the Tensor Processing Unit (TPU), which has been deployed in Google data centers since 2015. Since then, we have developed multiple new generations of TPU chips as well as large-scale supercomputing systems called TPU Pods that connect thousands of TPU chips together to form the world’s fastest ML training supercomputer.
We also make TPUs and TPU Pods available via Google Cloud in highly scalable Cloud TPU configurations that include up to 2,048 cores and 32 TiB of memory. Today, Cloud TPUs enable ML researchers and engineers to build more powerful and accurate models and empower businesses to bring their AI applications to market faster—all while saving valuable time and cost.
To bring out the full power of Cloud TPU, the Bixby team’s engineers used the Cloud TPU Profiler TensorBoard plugin, which was essential to optimizing performance on Cloud TPU. Using the TPU Compatibility view, they could identify and substitute TPU-incompatible operations into available TensorFlow operations for a higher utilization. For a point of reference, we also offer officially supported models that have already been optimized to run on TPUs.
Of course, TensorFlow is not the only ecosystem that we aim to support. You can now run PyTorch models on Cloud TPU as well using the PyTorch / XLA integration which became generally available in September. If you are interested in additional platforms that can unlock the potential of TPUs, you may also want to look into JAX, which brings together Python, automatic differentiation, and XLA to enable flexible, fun, and high-performance machine learning research.
Overcoming technical challenges together with Google
The Bixby team’s migration from GPUs to TPUs seemed to be happening smoothly, but when the time came to test the TPU-trained model back on GPUs, the Bixby team encountered a strange issue. While the model had been fine on TPUs, it started repeating itself once it ran inference on GPUs. For instance, “Hi Bixby, how’s the weather today?” was being transcribed as “Hi Bixby, how’s the weather today? Hi Bixby, how’s the weather today?”
Technical issues like this in machine learning projects are often very difficult to troubleshoot, as it’s not always clear whether the problem is in the code, the platform, the infrastructure, or the data. The Bixby team opened a support case with Google Cloud. And after a combined effort with Customer Engineering, TPU Product team, and Google Brain, we found the root cause.
While sequential models like Bixby’s ASR engine deal with input data of variable length, the XLA compilation for TPU currently has a requirement that all tensor sizes must be known at graph construction time. Due to this difference, the model trained on TPUs that expected a long padding after each input audio sequence, could not reliably predict the end of a sentence when the padding was shortened or removed on GPUs. Hence, being unable to complete the sentence, it was repeating the same phrases over and over again.
Once the root cause was identified, it was quickly addressed. The Bixby team decided to simulate the TPU environment during GPU inference by introducing additional padding at the end of each input audio. Without making any modifications to the training code, this enabled their new model to predict the end of a sentence correctly on GPUs—and the repetition issue was successfully resolved.
Driving innovation with Google Cloud
As the result of the migration to TPUs, Bixby gained an 18x speed boost in training—180 hours on their on-prem training environment with 8 GPUs went down to 10 hours on 64 cores of v3 Cloud TPU. This led to a drastic increase in the iteration speed of experiments, helping the team successfully transition to the new end-to-end architecture, with a result of 4.7% improvement in word error rates, a lighter model one-tenth the size, and 11x faster inference.
You can learn more about Samsung’s journey with Cloud TPU from Hanbyul Kim, the engineer from the Bixby team, in this video (subtitles in English) during Google Cloud Next OnAir Recap: Seoul.
Read More for the details.