On-device training with Core ML – part 1

Matthijs Hollemans
by Matthijs Hollemans
19 July 2019

Table of contents

Machine learning on mobile gets more popular every year! WWDC 2019 gave us lots of new goodies for adding ML into our apps.

One of the biggest announcements was that Core ML 3 now supports training of models on the iPhone and iPad. Who would have thought a few years ago that we’d be training convnets on our handheld devices!

Training neural networks on the device

In this series of blog posts we’ll take a deep dive into on-device training. I’ll show how to train a customizable image classifier using k-Nearest Neighbors as well as a deep neural network.

This is the first of a four-part series:

  1. Introduction to on-device training
  2. Rock, Paper, Scissors (Lizard? Spock?)
  3. k-Nearest Neighbors
  4. Training a Neural Network

Follow along with the source code on GitHub.

Let’s train some deep learning models on our mobile phones!

Note: Just as a historical note, iPhones and iPads have already supported on-device training since iOS 11.3 was released in late 2017. It just wasn’t very convenient to use. These low-level training facilities are provided by the Metal Performance Shaders framework, which also powers the GPU-accelerated training in Turi Create and Create ML on the Mac. But thanks to Core ML 3, on-device training has now become a lot simpler to use!

Personalization instead of training

To be honest… “training deep learning models” might be overselling it a little bit. Apple consistently calls it on-device personalization instead of training.

The goal of these new APIs is to allow fine-tuning an existing model on the personal data of the user.

It’s definitely not for using the iPhone as a replacement for renting a big fat server with NVIDIA Tesla GPUs to train huge models from scratch. Hey, your iPhone’s GPU is powerful but not that powerful…

Think of Core ML 3 training as a form of transfer learning or even online learning, where you only slightly tweak an existing model.

For example, Face ID uses these techniques to learn what the phone’s owner looks like, and to keep its model up-to-date when their face changes over time (growing a beard, wearing different makeup, getting older, etc).

The idea is to start out with a generic model that works OK for everyone, and then make a copy that is customized for each user.

This is not the same as federated learning, where a single model is updated based on the — anonymized — data of many users. Federated learning is a way to do distributed training “on the edge”. Instead of training on centralized servers, it uses the devices of thousands or even millions of users to spread the workload.

Federated learning also takes advantage of training on the user’s own data while keeping it private, but it doesn’t create a personalized model for each individual user. It just learns a little bit from everyone’s data and combines that into one big model. With Core ML’s on-device personalization, however, a single baseline model blossoms into thousands of slightly different ones, each tailored to its one user.

Federated learning is probably something you could do with Core ML models too, but none of the plumbing for that currently exists on iOS.

Core ML vs. Create ML

Just to clear up any potential confusion before we continue: on-device training is completely unrelated to Create ML.

Create ML is Apple’s model training app for macOS. It’s great for quickly building simple models such as image/sound/text classifiers. Like Turi Create, Apple’s other training tool, it employs transfer learning to speed up training times.

Create ML and Core ML were designed to work together: After training, you can directly save your model in Core ML format. No need to run a conversion tool first. There is also a CreateML.framework that lets you train models from a Swift script or Playground.

However, it’s important to realize that on-device training does not use this Create ML framework at all! On-device model personalization is always done with Core ML, not Create ML.

This may seem like a duplication of functionality, but both frameworks do training in slightly different ways. Create ML is good for building the baseline version of the model using as much data as possible, while Core ML’s on-device personalization is intended for slightly tweaking that model using relatively little user data.

So if you ever find yourself wondering how to use these training APIs, make sure you’re looking at the documentation for the correct framework! (Note that Create ML only works on the Mac, it is not available on iOS at all.)

Note: At this point you cannot export trainable models from Create ML — you’ll have to use coremltools afterwards to configure the models for on-device training.

You need a trained model to start with

In order to fine-tune a model, you already need to have a trained model.

Machine learning models start out life not knowing anything. An untrained model’s brain — the learned parameters or weights — is made up of randomly chosen numbers. If you ask such an untrained model for a prediction, it will just make random guesses. To get sensible predictions, you need to train the model first.

Often, your users will have data that is very similar in the broad strokes, but that differs in the details. In that case, it makes sense to use on-device personalization to adapt the model to each user’s specific usage.

To enable this, you need to provide a trained model that already understands the data in general terms. This is the baseline model that gets shipped with the app. On-device training can then be used to make the model learn new things about just this user and their data.

Take photos as an example: most people’s photo albums contain pictures that are very much alike. They will mostly be photos of people or pets or everyday things — as opposed to, say, microscopy pictures or X-rays.

So if your model expects to work on photos — for example, to recognize pets and tag them with their name — you’d provide a baseline model that is trained on a wide variety of images of household animals, so that it already understands what sort of things it can expect to see in such photos. Or if your app is for doctors, you’d provide a model that is already trained to understand X-rays, etc.

Of course, your pets don’t look like exactly like my pets. And the model won’t know their names yet. Using on-device training, my phone could learn specific details from my own pet photos, and your phone would learn specific details about your pets. But this only works because the model already knows what pets look like in the general sense.

It would be silly to expect both our phones to learn everything about pet photos from scratch. Not only is that a lot of duplicate effort, it would also require tons of training data. It’s much smarter to use an already-trained model as the starting point and just modify it slightly for each individual user.

Never ever train from scratch?

Training from scratch does makes sense if each user’s data is completely unique. It must be unlike the data of any other user of your app. In addition, the data must be simple (not images or audio) and be a relatively small amount.

As I explained in the previous section, if the data of many users contains similar patterns, you’re better off with a pre-trained base model that has already learned to understand these general patterns.

But let’s say the data of each user is different enough that pre-training a model isn’t going to help any. In that case, training a unique model for each user from scratch is the way to go.

What I’m talking about here is linear or logistic regression, or perhaps a tree-based model. Training these kinds of models on a moderate amount of simple vector data can be quite fast. However, Core ML currently does not let you train such models — if you want to do this, you’ll have to roll your own.

But models such as neural networks… forget about it! The hardware simply isn’t capable of training neural nets from scratch.

Time for a quick back-of-the-envelope calculation. Suppose you want to train an image classifier from the ground up. First, you’d need to collect thousands of training examples — usually about 1000 per object category. You’d bundle these images with the app or let the user collect their own. You could perhaps use the user’s photo album, but keep in mind you’ll also need to have labels for these training examples. Not many users will have the patience to carefully collect and annotate thousands of examples.

To train the model, the optimizer will go through all the training examples, many times over. A single pass through all examples is called an epoch. Let’s say you have 5,000 training examples and you train for 100 epochs, then over the course of training, the model sees 100 × 5000 = 500,000 examples. Assuming the model needs 0.01 seconds to process each example, training this model takes 5000 seconds or almost one and a half hours.

Well… in theory. Mobile devices aren’t built for running at top speed for hours on end. The phone will get hot quickly, causing thermal throttling to kick in to slow down the processors so the device doesn’t melt. Plus, this kind of computing load will quickly drain your batteries. Oh, and the user won’t be able to use any other apps in the mean time. You could train in the background, but as far as I know, that just uses the CPU, which is about 10× slower than the GPU or Neural Engine. Now you’re talking 15 or so hours, running with the CPU maxed out.

As you can tell, training a neural network completely from scratch on mobile just isn’t a good idea. 🥵 🔥

Why train on the device at all?

Doing training on the device is a good idea for a few reasons:

Obviously, this is not the right solution for all ML training. As I mentioned above, you wouldn’t use it to train big models from scratch.

It really only makes sense to use this technique if a small amount of training on the device is sufficient to make the model better — where “better” could mean something different for each user.

To improve your model based on feedback from all users, what you’re looking for instead is distributed training where each user does a little bit of training on their own device and on their own data, but the results are aggregated and combined into a new master model that is then sent to all the users as part of an app update (see also federated learning).

That approach is useful for when you want the model to learn new trends, such as new popular phrases for a predictive keyboard, but everyone should get the same improvements — they’re not customized for individual users.

Read more about the “why” of on-device training in this blog post.

What do you need for on-device training?

As I mentioned, you need a pretrained model. This will usually involve some kind of neural network. Even for k-Nearest Neighbors (k-NN) you’ll usually have a pipeline made up of a neural network for feature extraction followed by the k-NN classifier.

The pretrained model is just a regular Core ML mlmodel file that is configured to allow updating. This involves the following:

For neural networks the mlmodel also has:

Unless your neural network is small, you wouldn’t set all layers to be trainable. Typically, only the very last layer(s) should be be trainable. Otherwise training would take way too long.

When converting your model to Core ML format with coremltools, you can pass the respect_trainable argument to the converter and it will automatically make the model updatable. But you can also make these changes to an existing mlmodel file afterwards, it just takes a bit more effort. You’ll see how to do this in parts 3 and 4 of this series.

Note: An mlmodel that is updatable can only be opened by Xcode 11 or later, and will only work on iOS 13 and up — even for making predictions.

Don’t forget the labels

The whole point of on-device personalization is that you’ll be training the model on the user’s own data. But just the data by itself is not enough.

You also need to know what the data represents, or else the model cannot learn anything from it.

The kind of machine learning we’re talking about here is supervised learning. That means, besides the training examples, you will also need to have labels for those examples.

Often it is the user who has to provide the labels somehow, as it is their data.

For instance, an app can learn to detect what your pets look like and automatically tag any photos they appear in. But you do first have to tell the app what the names of your pets are — i.e. the labels — for a number of photos, or it will never be able to learn the association.

Note: Apple has published a chapter in the Human Interface Guidelines about designing user interfaces for apps that incorporate machine learning. How users will provide labels for the training data is an important consideration in the design of your app’s UI. Highly recommended reading! See also the WWDC 2019 session Designing Great ML Experiences.

Limitations of training with Core ML 3

Currently, Core ML supports training the following model types:

It’s also possible to train these models if they are part of a pipeline, but only if it’s the last model in the pipeline.

The other model types, such as linear regression or decision trees, cannot be trained by Core ML.

Note: If you wanted to train a linear or logistic regression, you could build a very basic one-layer neural network to perform the regression and train that.

For neural networks, only the following layer types can be trained:

Backpropagation through many of the other layer types is supported, but not all. Only layers following such an unsupported layer can be trained, but not any of the layers preceding it.

I’m sure future versions of Core ML will make it possible to train many other layer types too. Layer types that have weights but that are not trainable yet include: batchnorm, embeddings, bias/scale, and RNNs layers such as LSTM or GRU. If you want to train these, wait until next year. 😬

Core ML 3 offers a limited choice of loss functions:

A model with multiple outputs can also have multiple loss functions. For example, in a model that predicts both a class label and a bounding box you would use cross-entropy for the class label output and MSE for the bounding box. However, there currently is no way to weight these losses so that one counts more than the other. There is also no way to define your own loss functions.

Currently the following optimizers are available:

You also define the hyperparameters for these optimizers in the mlmodel file, such as the learning rate, momentum, mini-batch size, etc. However, you can override (some of) these at runtime, which is useful for changing the learning rate during training (known as learning rate annealing).

Note: Personally, I think it would have been better if the loss function, the optimizer, and the hyperparameters were not part of the mlmodel but of the CoreML.framework API. That would have allowed for custom loss functions. Oh well.

Some other things you cannot currently do:

As may be obvious, Core ML 3 is not a replacement for TensorFlow or PyTorch just yet. But even with these limitations, it does offer exciting new possibilities of what we can do with machine learning on our devices!

👍 Keep reading: Continue to part 2, Rock, Paper, Scissors (Lizard? Spock?), where we’ll build an app that can detect hand gestures.

Image credit: Vector Graphics by vecteezy.com

Written by Matthijs Hollemans.
First published on Friday, 19 July 2019.
If you liked this post, say hi on Twitter @mhollemans or LinkedIn.
Find the source code on my GitHub.

Code Your Own Synth Plug-Ins With C++ and JUCENew e-book: Code Your Own Synth Plug-Ins With C++ and JUCE
Interested in how computers make sound? Learn the fundamentals of audio programming by building a fully-featured software synthesizer plug-in, with every step explained in detail. Not too much math, lots of in-depth information! Get the book at Leanpub.com