If you thought machine learning on mobile is already a bit of a stretch, how about machine learning on microcontrollers?

Take the Arduino Uno. This has an 8-bit CPU running at 16 MHz, no real floating point support — it does not even have a division instruction — and only has 2KB of RAM. Is it even possible to do machine learning on these devices?

**The answer is yes**, but you need to be smart about the algorithm you’re using.

A simple neural network might work but a full-blown convnet will be way too slow without specialized hardware. My interest is in smart algorithms such as Bonsai or SEFR, the topic of today’s blog post.

SEFR, which stands for **Scalable, Efficient, and Fast classifieR**, was first published last year in a paper from three Iranian researchers. Just for fun, I implemented the algorithm in Swift so we can run it on iOS or Mac devices.

You can find the source code for this blog post on GitHub, as usual.

## How it works

The paper is a quick read, so definitely give that a go if you’re interested in reading papers. As is usual for these kinds of papers, the algorithm is described using math. As a programmer, I find algorithms easier to understand when they’re expressed in (pseudo)code. Fortunately, the SEFR algorithm is very straightforward.

SEFR is a **binary classifier**, meaning it can tell apart two classes: the positive class and the negative class. Usually the positive class is the thing you’re looking for, and the negative class is the absence of that thing.

Binary classification is certainly important, but many interesting classification tasks involve multiple classes. You can extend SEFR to do multiclass classification as well, basically by running it multiple times. I’ll show how to do this later on.

Since SEFR is a **supervised learning** algorithm, you’ll need to train it on a set of examples and their labels. The labels are 0 or 1, for the negative class and positive class, respectively. The examples themselves consist of numerical data.

To illustrate this, here is a portion of a possible dataset:

```
let examples: [[Float]] = [
[6.4, 3.1, 5.5, 1.8],
[5.4, 3.0, 4.5, 1.5],
[5.2, 3.5, 1.5, 0.2],
[6.1, 3.0, 4.9, 1.8],
[6.4, 2.8, 5.6, 2.2],
[5.2, 2.7, 3.9, 1.4],
[5.7, 3.8, 1.7, 0.3],
[6.0, 2.7, 5.1, 1.6],
...
]
let labels: [Int] = [
1, 1, 0, 1, 0, 1, 1, 1, ...
]
```

Here, each example uses four **features**. In other words, each example is made up of four floating-point numbers.

Usually `examples`

will be a two-dimensional array of shape N×M where N is the number of examples and M is the number of features. The `labels`

array will be of length N and contains integer values.

The training set does not have to be randomly shuffled.

**Note:** The paper says that the feature data must always be non-negative, i.e. `0.0`

or larger, and recommends normalizing the data into the range [0, 1]. In my testing it didn’t really make much difference, but I suppose negative numbers can mess up the math in interesting ways.

## It’s all about the features

The key idea in SEFR is that we want to determine for each feature whether it helps to identify positive examples, or whether it helps to identify negative examples.

During training, SEFR computes a **weight** value for each feature. If the training data has M features per example, then SEFR learns M weights in total.

The weight is just a number that tells us how much an individual feature “pulls” the example towards the positive class (the weight is close to +1), or how much it pulls the example towards the negative class (the weight is close to -1).

If the weight is close to 0, the feature isn’t very useful to the classification process and can be considered irrelevant.

That’s basically all there is to it. Let’s look at how to compute these weights.

## The algorithm in Swift

Since SEFR is a really simple algorithm, I will describe it by stepping through the source code. You can follow along with the full version.

First, we need to define the parameters that SEFR will learn. I already mentioned the weights, but it also learns a **bias** value. This bias will determine the decision boundary between the two classes. More about this later. There will be M weights — one for each feature — but there is only a single bias.

```
var weights: [Float] = []
var bias: Float = 0
```

To train the model, we define the function `fit()`

. This takes an array of examples and another array of the training labels:

```
func fit(examples: [[Float]], targets: [Int]) {
let numExamples = examples.count
let numFeatures = examples[0].count
```

Recall that `examples`

is a two-dimensional array of size N×M, or `numExamples`

by `numFeatures`

, containing floating-point numbers; `targets`

is an array containing `numExamples`

labels, where each label is either 0 or 1.

First, we will count **how many** positive examples and how many negative examples there are in the training data. If the label for an example is 1, the example belongs to the positive class. If the label is 0 — or really, anything except 1 — the example is from the negative class.

```
var countPos: Float = 0
var countNeg: Float = 0
for i in 0..<numExamples { // loop through all examples
if targets[i] == 1 { // this example has positive label
countPos += 1
} else { // this example has negative label
countNeg += 1
}
}
```

**Note:** Even though the counts are whole numbers, we store them as `Float`

to avoid having to cast them later. If you’re worried about losing precision, have no fear: a `Float`

can store integer values just fine (up to 24 bits).

Next, we look at each of the features in turn. For each individual feature, we compute its **average value** across all the positive examples. We also compute the average value over the negative examples.

```
for f in 0..<numFeatures { // loop through all features
var sumPos: Float = 0
var sumNeg: Float = 0
for i in 0..<numExamples { // loop through all examples
if targets[i] == 1 { // this example has positive label
sumPos += examples[i][f]
} else { // this example has negative label
sumNeg += examples[i][f]
}
}
let avgPos = sumPos / countPos
let avgNeg = sumNeg / countNeg
```

Why do we do this? Let’s say a particular feature usually has a large value when it appears in a positive example, but for any of the negative examples its value is close to zero. Then `avgPos`

for that feature will be large and `avgNeg`

will be small. That means this feature is highly correlated with predicting the positive class.

If `avgNeg`

is greater than `avgPos`

, it’s the other way around: this particular feature is more likely to predict the negative class. And if `avgPos`

and `avgNeg`

are more or less the same, this feature by itself doesn’t really say anything about the example’s class.

We can now **combine these two averages** into a single weight value for this feature:

```
let weight = (avgPos - avgNeg) / (avgPos + avgNeg + 0.0000001);
weights.append(weight)
}
```

The formula used here accomplishes the following:

Since

`avgPos`

and`avgNeg`

can be anything, we want to normalize them so that the weights are always in the range**[-1, +1]**. That’s what the division is for. The small “epsilon” value in the denominator prevents division by zero.If

`avgPos`

for this feature is larger than`avgNeg`

, it means the feature is more correlated with the positive class. The formula turns this into a positive weight value between 0 and +1, because the numerator is positive.However, if

`avgNeg > avgPos`

, the feature is correlated with the negative class. The numerator is negative, making the weight a number between -1 and 0.If

`avgPos`

and`avgNeg`

are identical or very similar, the numerator is a small number. The resulting weight is close to 0, meaning that the feature is not very important.

OK, still with me? At this point we have computed a weight for each individual feature by analyzing our training data.

We can now multiply the data from a given training example with these weights to **calculate a score**. In math terms, we’re taking the dot product between the weights and the example’s features.

The code for calculating the score looks like this:

```
var score: Float = 0
for f in 0..<numFeatures {
score += examples[i][f] * weights[f] // dot product!
}
```

A higher score implies this example is likely positive; a lower score implies the example likely belongs to the negative class. However… we don’t know yet where the actual **decision boundary** is. We can’t say for sure yet how large the score must be to mean “this example is positive”.

For instance, if one example has a score of +43.1 and another has a score of -7.8, does that mean the second example belongs to the positive class or the negative class? We don’t know yet. First, we have to figure out where the decision boundary is. Say the decision boundary is at +10.0, then the second example belongs to the negative class. But if the decision boundary is at -15.4, that example still belongs to the positive class, and only examples with a score less than -15.4 are negative.

In order to figure out the decision boundary, we first compute the **average scores** for all our training examples. As before, we do this separately for the positive and for the negative examples:

```
var sumPosScore: Float = 0
var sumNegScore: Float = 0
for i in 0..<numExamples {
var score: Float = 0
for f in 0..<numFeatures {
score += examples[i][f] * weights[f]
}
if targets[i] == 1 { // example has positive label
sumPosScore += score
} else { // example has negative label
sumNegScore += score
}
}
let avgPosScore = sumPosScore / countPos
let avgNegScore = sumNegScore / countNeg
```

Here, `avgPosScore`

should be larger than `avgNegScore`

because the weights for “positive” features are greater than zero, while the weights for “negative” features are less than zero.

The final step in the training process is to **compute the bias** value:

```
bias = -(countNeg * avgPosScore + countPos * avgNegScore)
/ (countNeg + countPos)
}
```

As I explained, we need to know above which score an example is considered positive. That score is what I called the decision boundary. The bias is used to shift this decision boundary to 0 (hence the minus sign).

To make a prediction, we can now simply take the dot product and add the bias:

```
score = example • weight + bias
```

If this score is >= 0, the example is positive. If this score is < 0, the example is negative. Thanks to the bias, we don’t need to worry about the decision boundary at all.

In essence, the bias is simply the average of `avgPosScore`

and `avgNegScore`

. The formula looks a bit more complex because we use a weighted average, in case the number of positive examples in the training dataset is different from the number of negative examples. This is why we multiply by `countNeg`

and `countPos`

, and divide by the total number of examples.

If you’re wondering why we do `countNeg * avgPosScore`

and vice versa: If there are fewer negative examples than positive examples, we want to the `avgPosScore`

to count less.

This is all you need to do to train the classifier. As you can see, it’s just a bunch of for-loops. It doesn’t get much simpler than that!

**Note:** It should be possible to optimize this even more by using the routines from the Accelerate framework. Instead of computing the average using a for loop, you could use `vDSP.mean()`

to do it in one go, for example. Also, if you split the training examples into two sets, one for the positive examples and one for the negative examples, you can optimize the implementation even more.

## After training comes sunshine

Once the model has been trained, making a prediction on a new example is very straightforward.

I split this up into two functions. The first one computes the “raw” score, just like we did above. This multiplies each feature from the example with its corresponding weight, adds them all up, and also adds the bias.

```
func predictScore(example: [Float]) -> Float {
var score = bias
for f in 0..<weights.count {
score += example[f] * weights[f]
}
return score
}
```

Turning this into a yes/no prediction is as simple as comparing the score to 0:

```
public func predict(example: [Float]) -> Bool {
predictScore(example: example) >= 0
}
```

Here, a `true`

result means the example belongs to the positive class, while `false`

means it belongs to the negative class.

## Going multiclass

A common strategy to turn a binary classifier into a multiclass classifier is to use **one-vs-rest**. If there are, say, 3 classes, you train three separate binary classifiers:

- class A versus (class B and C)
- class B versus (class A and C)
- class C versus (class A and B)

To train the first classifier, you give the examples from class A the label “positive” and the examples from class B and C both get the label “negative”. Similarly for the other classifiers.

To make a prediction, you run the example through all three classifiers and pick the one that gave the highest score. Here, if the second classifier had the highest score, the predicted class is B.

In the source code, this is implemented in `SEFRMulticlass`

. The target labels are still integers but they’re no longer limited to 0 or 1.

The code is really simple. I’ll just highlight the function that is used for making predictions:

```
func predict(example: [Float]) -> Int {
var largest = -Float.greatestFiniteMagnitude
var bestClass = 0
for (label, clf) in zip(labels, classifiers) {
let pred = clf.predictScore(example: example)
if pred > largest {
largest = pred
bestClass = label
}
}
return bestClass
}
```

This loops through the different classifiers — one for each label in the dataset — and makes a prediction. It returns the class label belonging to the classifier that predicted the largest score. (In mathspeak, it takes the argmax.)

## The demo app

It’s easy enough to show that a classifier works well on a toy dataset, but what if you apply it to a real problem?

As a more realistic demonstration, I made a basic app that uses SEFR to classify images. This is the Demo.swift source file in the repo. I’m not going to step through the code here but I will explain what it does.

To run the demo app for yourself, clone the repo and run the following from a Terminal window:

```
$ cat SEFR.swift Demo.swift | swift -
```

This app is written for macOS, although the same code will also work on iOS. There is no Xcode project, it’s just a set of Swift scripts, which is why you need to run it using the above incantation.

The Demo.swift file scans the **images/train** folder to see which subfolders it has. Each of these is considered to be a class. The demo includes the classes: balloon, cat, and sandwich. If you want to use your own images, simply make a folder for it and copy your images into that.

Each of the train subfolders has 15 images in it. These images have been resized to 299×299, but that’s not a requirement — any size image will work (they get resized by the Vision framework to 299×299 anyway).

The demo app loops through all the images in these subfolders and creates **feature vectors** for them using Vision framework’s **FeaturePrint.Scene** model. This is a neural network that is built into macOS and iOS that outputs a 2048-element vector for any image you give it. The numbers in this vector represent supposedly useful information about the image.

You can stick these feature vectors into some kind of classifier. A typical neural network uses logistic regression as its classification layer — i.e. a fully-connected layer followed by softmax — but we’ll use SEFR.

Since there are three classes and each has 15 training images, the total training example array is of size 45 × 2048.

The app also comes with a test set in the **images/test** folder. This has 5 images per class. We also create feature vectors for these images, so we can evaluate how well the classifier works after it has been trained.

**Note:** Recall that SEFR prefers that input data is in the range [0, 1] or at least has no negative values. It turns out that Vision FeaturePrint.Scene outputs vectors that only have numbers >= 0.0, so we don’t need to “fix” the data. If your data contains negative values, consider normalizing the training data. An easy formula is: `normalized = (data - min) / (max - min)`

.

Reading the image files and generating the feature vectors is really the bulk of Demo.swift. Once we have the training and testing data, we do the following to train the multiclass model and test it:

```
let model = SEFRMulticlass()
model.fit(examples: trainExamples, targets: trainLabels)
let predictions = model.predict(examples: testExamples)
```

And then we compute the **accuracy** by counting how many of the `predictions`

are the same as the `testLabels`

.

For the test images that are provided with the demo app, it accurately predicts **100%** of them. High fives all around!

Before we get carried away with this success… It’s important to realize that with 2048 features per image but only 45 images in the training set, the classification problem isn’t very hard. So I’d expect *any* classifier to score 100% correct on this problem.

To see how well SEFR does **with fewer features**, set the `useRandomFeatures`

variable at the top of Demo.swift to `true`

, and choose `numFeatures`

to be a small number. Instead of using the entire 2048-element feature vector, it will now pick a small number of features at random and try to do the classification with those.

Sometimes it will still get a very high score even with one or two features, but the problem is definitely harder now!

Maybe you can add some of your own images and classes to see how well it does on those. (I didn’t want to upload a massive dataset to GitHub.)

## How fast is this?

SEFR is fast and small enough to run on a puny Arduino Uno, and even to train the model on the device itself. Not bad! Of course, it runs even faster on an iPhone.

Training time is linear: If you train it on twice as much data or with twice as many features, it will take twice as long. Training has O(M*N) time complexity, testing a single example is O(M). SEFR also doesn’t need much (temporary) storage space.

The fact that it’s **fast and uses little memory** is a big advantage of SEFR over many other classifiers, and what makes it possible to run on tiny microcontrollers.

Multiclass SEFR can be a little slow. With K classes, you also have to train K different instances of SEFR. To make a prediction, you have to run all of these classifiers and then figure out which gave the best result. The more classes you have, the slower this becomes. Other algorithms, such as Naive Bayes, can trivially handle multiple classes. So SEFR isn’t optimal here.

Of course, there is a downside to having a fast but simple classification algorithm: it’s **not as accurate** as state-of-the-art classifiers such as CatBoost, LightGBM, and XGBoost. As always, each algorithm has its own tradeoffs. SEFR is just another tool in the box. Sometimes it will work better, other times it’s not necessarily the best choice.

Another benefit of SEFR is that the weights tell you **how important** a feature is. This lets you remove irrelevant features — those with a weight close to 0 — from your dataset, for an extra speedup.

Even with its limitations, I do quite like SEFR because it is so stunningly simple. And I look forward to seeing more new, cool algorithms being developed for “Tiny ML”!

If you liked this post, say hi on Twitter @mhollemans or by email matt@machinethink.net.

Find the source code on my GitHub.