Ultralytics YOLOで物体検出する

目的

カメラ映像にリアルタイムで物体名をラベル付けする。

環境

Winodows 11(Ubuntuでも動いた。Ubuntu on Raspberry Pi 4でも動いた。)
Python 3.12
PyTorch(指定したバージョンのCUDAを使用したいなど、理由があれば先に入れておく。後続の作業で自動的にインストールすることもできるので、必要に応じて)

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126

ライセンス

Ultralytics YOLOを無償で使うならAGPL-3.0 Licenseになる。

AGPL-3.0 License: This OSI-approved open-source license is perfect for students, researchers, and enthusiasts. It encourages open collaboration and knowledge sharing. See the LICENSE file for full details.

仮想環境を作成してインストール

mkdir yolo
cd yolo
python -m venv --system-site-packages venv

venv\Scripts\activate

pip install ultralytics

mkdir yolo
cd yolo
python -m venv --system-site-packages venv

venv\Scripts\activate

pip install ultralytics

動作確認

yolo detect predict model=yolo11n.pt

yolo detect predict model=yolo11n.pt

Ultralytics 8.3.103 🚀 Python-3.12.9 torch-2.6.0+cu126 CUDA:0 (NVIDIA GeForce RTX 4060 Ti, 16380MiB)
YOLO11n summary (fused): 100 layers, 2,616,248 parameters, 0 gradients, 6.5 GFLOPs

image 1/2 D:\yolo\venv\Lib\site-packages\ultralytics\assets\bus.jpg: 640×480 4 persons, 1 bus, 44.3ms
image 2/2 D:\yolo\venv\Lib\site-packages\ultralytics\assets\zidane.jpg: 384×640 2 persons, 1 tie, 45.9ms
Speed: 3.5ms preprocess, 45.1ms inference, 62.8ms postprocess per image at shape (1, 3, 384, 640)
Results saved to runs\detect\predict

※以下はRaspberry Pi 4で動かしたときの実行結果。（さすがに遅い）

Ultralytics 8.3.105 🚀 Python-3.12.3 torch-2.6.0+cpu CPU (Cortex-A72)
YOLO11n summary (fused): 100 layers, 2,616,248 parameters, 0 gradients, 6.5 GFLOPs

image 1/2 /home/xxx/yolo/venv/lib/python3.12/site-packages/ultralytics/assets/bus.jpg: 640×480 4 persons, 1 bus, 1180.9ms
image 2/2 /home/xxx/yolo/venv/lib/python3.12/site-packages/ultralytics/assets/zidane.jpg: 384×640 2 persons, 1 tie, 935.2ms
Speed: 16.4ms preprocess, 1058.1ms inference, 6.1ms postprocess per image at shape (1, 3, 384, 640)
Results saved to runs/detect/predict

※Raspberry Pi 5ならNCNNを使用することも検討。Pi 4ではncnnのエクスポートがエラーになって動かなかった。
pip install ncnn
yolo export model=yolo11n.pt format=ncnn

実装

こちら（https://www.ejtech.io/learn/yolo-on-raspberry-pi）を参考にさせて頂いた。

以下のコードではyolo11lを使っている。Raspberry Piで実行するならyolo11nかyolo11sのどちらかのみになるかと思う。
こちら（https://docs.ultralytics.com/ja/models/yolo11/）からダウンロードすることが出来る。

import time

import cv2
import numpy as np

from ultralytics import YOLO

# Load the model into memory and get labemap
model = YOLO('yolo11l.pt', task='detect')
labels = model.names

# Load image source
cap = cv2.VideoCapture(0)

# Set bounding box colors (using the Tableu 10 color scheme)
bbox_colors = [(164,120,87), (68,148,228), (93,97,209), (178,182,133), (88,159,106), 
              (96,202,231), (159,124,168), (169,162,241), (98,118,150), (172,176,184)]

# Initialize control and status variables
avg_frame_rate = 0
frame_rate_buffer = []
fps_avg_len = 200

# Begin inference loop
while True:

    t_start = time.perf_counter()
    # Load frame from image source
    ret, frame = cap.read()
    if (frame is None) or (not ret):
        print('Unable to read frames from the camera. This indicates the camera is disconnected or not working. Exiting program.')
        break

    # Run inference on frame
    results = model(frame, verbose=False)

    # Extract results
    detections = results[0].boxes

    # Initialize variable for basic object counting example
    object_count = 0

    # Go through each detection and get bbox coords, confidence, and class
    for i in range(len(detections)):

        # Get bounding box coordinates
        # Ultralytics returns results in Tensor format, which have to be converted to a regular Python array
        xyxy_tensor = detections[i].xyxy.cpu() # Detections in Tensor format in CPU memory
        xyxy = xyxy_tensor.numpy().squeeze() # Convert tensors to Numpy array
        xmin, ymin, xmax, ymax = xyxy.astype(int) # Extract individual coordinates and convert to int

        # Get bounding box class ID and name
        classidx = int(detections[i].cls.item())
        classname = labels[classidx]

        # Get bounding box confidence
        conf = detections[i].conf.item()

        # Draw box if confidence threshold is high enough
        if conf > 0.5:

            color = bbox_colors[classidx % 10]
            cv2.rectangle(frame, (xmin,ymin), (xmax,ymax), color, 2)

            label = f'{classname}: {int(conf*100)}%'
            labelSize, baseLine = cv2.getTextSize(label, cv2.FONT_HERSHEY_SIMPLEX, 0.5, 1) # Get font size
            label_ymin = max(ymin, labelSize[1] + 10) # Make sure not to draw label too close to top of window
            cv2.rectangle(frame, (xmin, label_ymin-labelSize[1]-10), (xmin+labelSize[0], label_ymin+baseLine-10), color, cv2.FILLED) # Draw white box to put label text in
            cv2.putText(frame, label, (xmin, label_ymin-7), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 0), 1) # Draw label text

            # Basic example: count the number of objects in the image
            object_count = object_count + 1

    # Calculate and draw framerate (if using video, USB, or Picamera source)
    cv2.putText(frame, f'FPS: {avg_frame_rate:0.2f}', (10,20), cv2.FONT_HERSHEY_SIMPLEX, .7, (0,255,255), 2) # Draw framerate
    
    # Display detection results
    cv2.putText(frame, f'Number of objects: {object_count}', (10,40), cv2.FONT_HERSHEY_SIMPLEX, .7, (0,255,255), 2) # Draw total number of detected objects
    cv2.imshow('YOLO detection results',frame) # Display image

    # If inferencing on individual images, wait for user keypress before moving to next image. Otherwise, wait 5ms before moving to next frame.
    key = cv2.waitKey(5)
    
    if key == ord('q') or key == ord('Q'): # Press 'q' to quit
        break
    elif key == ord('s') or key == ord('S'): # Press 's' to pause inference
        cv2.waitKey()
    elif key == ord('p') or key == ord('P'): # Press 'p' to save a picture of results on this frame
        cv2.imwrite('capture.png',frame)
    
    # Calculate FPS for this frame
    t_stop = time.perf_counter()
    frame_rate_calc = float(1/(t_stop - t_start))

    # Append FPS result to frame_rate_buffer (for finding average FPS over multiple frames)
    if len(frame_rate_buffer) >= fps_avg_len:
        temp = frame_rate_buffer.pop(0)
        frame_rate_buffer.append(frame_rate_calc)
    else:
        frame_rate_buffer.append(frame_rate_calc)

    # Calculate average FPS for past frames
    avg_frame_rate = np.mean(frame_rate_buffer)

# Clean up
print(f'Average pipeline FPS: {avg_frame_rate:.2f}')
cap.release()
cv2.destroyAllWindows()

import time

import cv2
import numpy as np

from ultralytics import YOLO

# Load the model into memory and get labemap
model = YOLO('yolo11l.pt', task='detect')
labels = model.names

# Load image source
cap = cv2.VideoCapture(0)

# Set bounding box colors (using the Tableu 10 color scheme)
bbox_colors = [(164,120,87), (68,148,228), (93,97,209), (178,182,133), (88,159,106), 
              (96,202,231), (159,124,168), (169,162,241), (98,118,150), (172,176,184)]

# Initialize control and status variables
avg_frame_rate = 0
frame_rate_buffer = []
fps_avg_len = 200

# Begin inference loop
while True:

    t_start = time.perf_counter()
    # Load frame from image source
    ret, frame = cap.read()
    if (frame is None) or (not ret):
        print('Unable to read frames from the camera. This indicates the camera is disconnected or not working. Exiting program.')
        break

    # Run inference on frame
    results = model(frame, verbose=False)

    # Extract results
    detections = results[0].boxes

    # Initialize variable for basic object counting example
    object_count = 0

    # Go through each detection and get bbox coords, confidence, and class
    for i in range(len(detections)):

        # Get bounding box coordinates
        # Ultralytics returns results in Tensor format, which have to be converted to a regular Python array
        xyxy_tensor = detections[i].xyxy.cpu() # Detections in Tensor format in CPU memory
        xyxy = xyxy_tensor.numpy().squeeze() # Convert tensors to Numpy array
        xmin, ymin, xmax, ymax = xyxy.astype(int) # Extract individual coordinates and convert to int

        # Get bounding box class ID and name
        classidx = int(detections[i].cls.item())
        classname = labels[classidx]

        # Get bounding box confidence
        conf = detections[i].conf.item()

        # Draw box if confidence threshold is high enough
        if conf > 0.5:

            color = bbox_colors[classidx % 10]
            cv2.rectangle(frame, (xmin,ymin), (xmax,ymax), color, 2)

            label = f'{classname}: {int(conf*100)}%'
            labelSize, baseLine = cv2.getTextSize(label, cv2.FONT_HERSHEY_SIMPLEX, 0.5, 1) # Get font size
            label_ymin = max(ymin, labelSize[1] + 10) # Make sure not to draw label too close to top of window
            cv2.rectangle(frame, (xmin, label_ymin-labelSize[1]-10), (xmin+labelSize[0], label_ymin+baseLine-10), color, cv2.FILLED) # Draw white box to put label text in
            cv2.putText(frame, label, (xmin, label_ymin-7), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 0), 1) # Draw label text

            # Basic example: count the number of objects in the image
            object_count = object_count + 1

    # Calculate and draw framerate (if using video, USB, or Picamera source)
    cv2.putText(frame, f'FPS: {avg_frame_rate:0.2f}', (10,20), cv2.FONT_HERSHEY_SIMPLEX, .7, (0,255,255), 2) # Draw framerate
    
    # Display detection results
    cv2.putText(frame, f'Number of objects: {object_count}', (10,40), cv2.FONT_HERSHEY_SIMPLEX, .7, (0,255,255), 2) # Draw total number of detected objects
    cv2.imshow('YOLO detection results',frame) # Display image

    # If inferencing on individual images, wait for user keypress before moving to next image. Otherwise, wait 5ms before moving to next frame.
    key = cv2.waitKey(5)
    
    if key == ord('q') or key == ord('Q'): # Press 'q' to quit
        break
    elif key == ord('s') or key == ord('S'): # Press 's' to pause inference
        cv2.waitKey()
    elif key == ord('p') or key == ord('P'): # Press 'p' to save a picture of results on this frame
        cv2.imwrite('capture.png',frame)
    
    # Calculate FPS for this frame
    t_stop = time.perf_counter()
    frame_rate_calc = float(1/(t_stop - t_start))

    # Append FPS result to frame_rate_buffer (for finding average FPS over multiple frames)
    if len(frame_rate_buffer) >= fps_avg_len:
        temp = frame_rate_buffer.pop(0)
        frame_rate_buffer.append(frame_rate_calc)
    else:
        frame_rate_buffer.append(frame_rate_calc)

    # Calculate average FPS for past frames
    avg_frame_rate = np.mean(frame_rate_buffer)

# Clean up
print(f'Average pipeline FPS: {avg_frame_rate:.2f}')
cap.release()
cv2.destroyAllWindows()

Modelをyolo11n.ptにしてRaspberry Pi 4でも動かしたが、0.9 FPSしか出なかった。Raspberry Pi 5やRaspberry Pi AI HAT+が欲しくなる。

目的

環境

ライセンス

仮想環境を作成してインストール

動作確認

実装