YOLO (Object Detection Algorithm)

Image Recognition and Object Detection

Convolutional neural networks are a robust AI technology with a broad range of uses, one of which is image classification. Within the realm of computer vision, numerous complex challenges exist, with object detection being a topical interest. Predominantly associated with autonomous vehicles, technologies such as computer vision, LIDAR, amongst others, are amalgamated to construct a comprehensive, multi-dimensional overview of the environment and its constituents. Object detection also sees extensive employment in video surveillance, specifically regarding crowd surveillance to thwart terrorist activities, count people for statistical research, and refine user experiences.

Image recognition consists of increasingly intricate steps. Initially, classification – categorizing an image into various realms to answer – 'What does this image contain?' (Human, animal, objects), each image is ascribed a single group. Subsequently, localization allows pinpointing particular objects in an image, changing the question to, 'Where is the object?'. In practical scenarios, several objects in one snapshot need to be located, akin to autonomous vehicles that need to identify other cars, traffic signals, signs, and pedestrians, then execute suitable actions.

Detection not only locates all objects in an image but also constructs bounding boxes around them. In certain scenarios, instance segmentation is deployed to determine exact boundaries of objects, though that's a discussion for a different day.

Understanding the YOLO (You Only Look Once) Object Detection Algorithm

Object detection is a cardinal computer vision challenge that involves determining 'what' and 'where' – specifically, the objects present within an image and their precise location(s). Object detection is a more advanced task than classification, which can differentiate objects but cannot indicate their location within the image.

The YOLO detection algorithm gains attention due to its high accuracy level and real-time operation ability. YOLO image processing and object tracking require only one forward propagation through the neural network to make predictions. Once non-max suppression is applied, ensuring that the object detection algorithm only identifies each object once, the detected objects and bounding boxes are output.

YOLO operates a single convolutional neural network to predict multiple bounding boxes and class probabilities for those boxes, thereby enhancing detection performance by training on complete images.

Dissecting the YOLO Algorithm

Object detection methodologies can be bifurcated into two primary types:

  1. Classification-based algorithms: Implemented in two stages, starting with recognizing regions of interest within an image and then leveraging convolutional neural networks for classifying these regions. As predictions are run for each region of interest, this method can be time-consuming. Renowned examples of such algorithms are the Region-based convolutional neural network (RCNN) and Fast-RCNN, Faster-RCNN, and Mask-RCNN.
  2. Regression algorithms: These predict classes and bounding boxes for the entire image in a single run. Famously used algorithms from this group are the YOLO family algorithms and SSD, which are predominantly used for real-time object detection. What makes them popular? They surrender a trivial amount of accuracy for large gains in speed.

To grasp the YOLO algorithm requires understanding what is being predicted. Ultimately, the objective is to predict the class of an object and the bounding box delineating its location. Each bounding box can be captured by four descriptors:

  1. Width
  2. Height
  3. Center of a bounding box
  4. The value correlative to a class of an object

While using the YOLO algorithm, our primary focus is not on interesting parts in our image that could potentially contain an object. On the contrary, the image is divided into cells, usually a 19×19 grid. Each cell is accountable for predicting five bounding boxes when more than one object is identified in the cell. As a consequence, for one image, we end up with a large count of 1805 bounding boxes. Most cells and boundary boxes will be empty, so we predict the value pc, applied via non-max suppression to bounding boxes with high common areas.

The YOLO object detection model has several benefits when compared to other detection methodologies:

  1. During training and testing, YOLO views the entire image.
  2. Training on natural images, YOLO machine learning outperforms other top detection methods.
  3. YOLO AI model operates significantly faster than other detection methods.

Integrate | Scan | Test | Automate

Detect hidden vulnerabilities in ML models, from tabular to LLMs, before moving to production.