This article is part of our coverage of the latest AI research.
A new machine learning technique developed by researchers at Edge Impulse, an ML modeling platform for the edge, enables real-time object detection to be performed on devices with very small computational capacity and of memory. Called Faster Objects, More Objects (FOMO), the new deep learning architecture can unlock new computer vision applications.
Most object detection deep learning models have memory and computational requirements that exceed the capability of small processors. FOMO, on the other hand, only requires several hundred kilobytes of memory, making it an excellent technique for TinyML, a subfield of machine learning focused on running ML models on microcontrollers and other memory-limited devices that have limited or no internet connectivity.
Image Classification vs Object Detection
TinyML has made great strides in image classification, where the machine learning model only needs to predict the presence of a certain type of object in an image. On the other hand, object detection requires the model to identify more than one object as well as the bounding box of each instance.
Object detection models are much more complex than image classification networks and require more memory.
“We added computer vision support to Edge Impulse in 2020, and we’ve seen a tremendous increase in apps (40% of our projects are computer vision apps),” said Jan Jongboom, CTO at Edge Impulse, at TechTalks. “But with current state-of-the-art models, you could only do image classification on microcontrollers.”
Image classification is very useful for many applications. For example, a security camera might use TinyML image classification to determine whether or not there is a person in the frame. However, much more can be done.
“It was a great nuisance that you were limited to these very basic classification tasks. There is a lot of value in seeing “there are three people here” or “this label is in the top left corner”, for example, counting things is one of the biggest demands we see in the market today. today,” says Jongboom.
Earlier object detection ML models had to process the input image multiple times to locate objects, which made them slow and computationally expensive. Newer models such as YOLO (You Only Look Once) use single-shot detection to provide near real-time object detection. But their memory requirements are still high. Even models designed for high-end applications are difficult to run on small devices.
“YOLOv5 or MobileNet SSD are just incredibly large arrays that will never scale to MCU and barely fit Raspberry Pi class devices,” says Jongboom.
Moreover, these models are bad at detecting small objects and they need a lot of data. For example, YOLOv5 recommends over 10,000 training instances per object class.
The idea behind FOMO is that not all object detection applications require the high precision output provided by state-of-the-art deep learning models. By finding the right trade-off between accuracy, speed, and memory, you can reduce your deep learning models to very small sizes while still keeping them useful.
Instead of detecting bounding boxes, FOMO predicts the center of the object. This is because many object detection applications are only interested in the location of objects in the frame and not their size. Centroid detection is much more computationally efficient than bounding box prediction and requires less data.
Redefining Object Detection Deep Learning Architectures
FOMO also applies a major structural change to traditional deep learning architectures.
Single-shot object detectors are composed of a set of convolutional layers that extract features and several fully connected layers that predict the bounding box. Convolution layers extract visual features hierarchically. The first layer detects simple things like lines and edges in different directions. Each convolutional layer is usually coupled with a clustering layer, which reduces the size of the layer’s output and retains the most important features in each area.
The output from the clustering layer is then passed to the next convolutional layer, which extracts higher-level features, such as corners, arcs, and circles. As convolution and clustering layers are added, feature maps zoom out and can detect complex elements such as faces and objects.
Finally, fully connected layers flatten the output of the final convolution layer and attempt to predict the class and bounding box of objects.
FOMO removes fully connected layers and last convolution layers. This transforms the output of the neural network into a scaled down version of the image, with each output value representing a small portion of the input image. The network is then trained on a special loss function so that each output unit predicts the class probabilities for the corresponding patch in the input image. The output effectively becomes a heat map for object types.
This approach has several key advantages. First, FOMO is compatible with existing architectures. For example, FOMO can be applied to MobileNetV2, a popular deep learning model for image classification on edge devices.
Additionally, by dramatically reducing the size of the neural network, FOMO reduces the memory and computational requirements of object detection models. According to Edge Impulse, it is 30 times faster than MobileNet SSD while it can run on devices with less than 200KB RAM.
For example, the following video shows a FOMO neural network detecting objects at 30 frames per second on an Arduino Nicla Vision with just over 200 kilobytes of memory. On a Raspberry Pi 4, FOMO can detect objects at 60 fps as opposed to the MobileNet SSD’s 2 fps performance.
Jongboom told me that FOMO was inspired by work Mat Kelcey, principal engineer at Edge Impulse, has done around neural network architecture for counting bees.
“Traditional object detection algorithms (YOLOv5, MobileNet SSD) are bad for this type of problem (similar sized objects, lots of very small objects), so he designed a custom architecture that optimizes for these problems,” a- he declared.
The granularity of FOMO output can be configured based on the application and can detect many object instances in a single image.
Limits of FOMO
The benefits of FOMO are not without trade-offs. It works best when the objects are the same size. It’s like a grid of equal sized squares, each detecting an object. Therefore, if there is a very large object in the foreground and many small objects in the background, it will not work as well.
Also, when objects are too close to each other or overlap, they will occupy the same grid square, reducing the accuracy of the object finder (see video below). You can overcome this limit to some extent by reducing the size of the FOMO cells or increasing the image resolution.
FOMO is especially useful when the camera is in a fixed location, such as scanning objects on a treadmill or counting cars in a parking lot.
The Edge Impulse team plans to expand their work in the future, including making the model even smaller, under 100 kilobytes, and improving it for transfer learning.
This article was originally written by Ben Dickson and published by Ben Dickson on TechTalks, a publication that examines trends in technology, how they affect the way we live and do business, and the problems they solve. But we also discuss the evil side of technology, the darker implications of new technologies, and what we need to watch out for. You can read the original article here.