目的
カメラ映像にリアルタイムで物体名をラベル付けする。
環境
Winodows 11(Ubuntuでも動いた。Ubuntu on Raspberry Pi 4でも動いた。)
Python 3.12
PyTorch(指定したバージョンのCUDAを使用したいなど、理由があれば先に入れておく。後続の作業で自動的にインストールすることもできるので、必要に応じて)
pip install torch torchvision torchaudio --index-url https://download.pytorch.org/whl/cu126
ライセンス
Ultralytics YOLOを無償で使うならAGPL-3.0 Licenseになる。
AGPL-3.0 License: This OSI-approved open-source license is perfect for students, researchers, and enthusiasts. It encourages open collaboration and knowledge sharing. See the LICENSE file for full details.
仮想環境を作成してインストール
mkdir yolo
cd yolo
python -m venv --system-site-packages venv
venv\Scripts\activate
pip install ultralytics
動作確認
yolo detect predict model=yolo11n.pt
Ultralytics 8.3.103 🚀 Python-3.12.9 torch-2.6.0+cu126 CUDA:0 (NVIDIA GeForce RTX 4060 Ti, 16380MiB)
YOLO11n summary (fused): 100 layers, 2,616,248 parameters, 0 gradients, 6.5 GFLOPs
image 1/2 D:\yolo\venv\Lib\site-packages\ultralytics\assets\bus.jpg: 640×480 4 persons, 1 bus, 44.3ms
image 2/2 D:\yolo\venv\Lib\site-packages\ultralytics\assets\zidane.jpg: 384×640 2 persons, 1 tie, 45.9ms
Speed: 3.5ms preprocess, 45.1ms inference, 62.8ms postprocess per image at shape (1, 3, 384, 640)
Results saved to runs\detect\predict
※以下はRaspberry Pi 4で動かしたときの実行結果。(さすがに遅い)
Ultralytics 8.3.105 🚀 Python-3.12.3 torch-2.6.0+cpu CPU (Cortex-A72)
YOLO11n summary (fused): 100 layers, 2,616,248 parameters, 0 gradients, 6.5 GFLOPs
image 1/2 /home/xxx/yolo/venv/lib/python3.12/site-packages/ultralytics/assets/bus.jpg: 640×480 4 persons, 1 bus, 1180.9ms
image 2/2 /home/xxx/yolo/venv/lib/python3.12/site-packages/ultralytics/assets/zidane.jpg: 384×640 2 persons, 1 tie, 935.2ms
Speed: 16.4ms preprocess, 1058.1ms inference, 6.1ms postprocess per image at shape (1, 3, 384, 640)
Results saved to runs/detect/predict
※Raspberry Pi 5ならNCNNを使用することも検討。Pi 4ではncnnのエクスポートがエラーになって動かなかった。
pip install ncnn
yolo export model=yolo11n.pt format=ncnn
実装
こちら(https://www.ejtech.io/learn/yolo-on-raspberry-pi)を参考にさせて頂いた。
以下のコードではyolo11lを使っている。Raspberry Piで実行するならyolo11nかyolo11sのどちらかのみになるかと思う。
こちら(https://docs.ultralytics.com/ja/models/yolo11/)からダウンロードすることが出来る。
import time
import cv2
import numpy as np
from ultralytics import YOLO
# Load the model into memory and get labemap
model = YOLO('yolo11l.pt', task='detect')
labels = model.names
# Load image source
cap = cv2.VideoCapture(0)
# Set bounding box colors (using the Tableu 10 color scheme)
bbox_colors = [(164,120,87), (68,148,228), (93,97,209), (178,182,133), (88,159,106),
(96,202,231), (159,124,168), (169,162,241), (98,118,150), (172,176,184)]
# Initialize control and status variables
avg_frame_rate = 0
frame_rate_buffer = []
fps_avg_len = 200
# Begin inference loop
while True:
t_start = time.perf_counter()
# Load frame from image source
ret, frame = cap.read()
if (frame is None) or (not ret):
print('Unable to read frames from the camera. This indicates the camera is disconnected or not working. Exiting program.')
break
# Run inference on frame
results = model(frame, verbose=False)
# Extract results
detections = results[0].boxes
# Initialize variable for basic object counting example
object_count = 0
# Go through each detection and get bbox coords, confidence, and class
for i in range(len(detections)):
# Get bounding box coordinates
# Ultralytics returns results in Tensor format, which have to be converted to a regular Python array
xyxy_tensor = detections[i].xyxy.cpu() # Detections in Tensor format in CPU memory
xyxy = xyxy_tensor.numpy().squeeze() # Convert tensors to Numpy array
xmin, ymin, xmax, ymax = xyxy.astype(int) # Extract individual coordinates and convert to int
# Get bounding box class ID and name
classidx = int(detections[i].cls.item())
classname = labels[classidx]
# Get bounding box confidence
conf = detections[i].conf.item()
# Draw box if confidence threshold is high enough
if conf > 0.5:
color = bbox_colors[classidx % 10]
cv2.rectangle(frame, (xmin,ymin), (xmax,ymax), color, 2)
label = f'{classname}: {int(conf*100)}%'
labelSize, baseLine = cv2.getTextSize(label, cv2.FONT_HERSHEY_SIMPLEX, 0.5, 1) # Get font size
label_ymin = max(ymin, labelSize[1] + 10) # Make sure not to draw label too close to top of window
cv2.rectangle(frame, (xmin, label_ymin-labelSize[1]-10), (xmin+labelSize[0], label_ymin+baseLine-10), color, cv2.FILLED) # Draw white box to put label text in
cv2.putText(frame, label, (xmin, label_ymin-7), cv2.FONT_HERSHEY_SIMPLEX, 0.5, (0, 0, 0), 1) # Draw label text
# Basic example: count the number of objects in the image
object_count = object_count + 1
# Calculate and draw framerate (if using video, USB, or Picamera source)
cv2.putText(frame, f'FPS: {avg_frame_rate:0.2f}', (10,20), cv2.FONT_HERSHEY_SIMPLEX, .7, (0,255,255), 2) # Draw framerate
# Display detection results
cv2.putText(frame, f'Number of objects: {object_count}', (10,40), cv2.FONT_HERSHEY_SIMPLEX, .7, (0,255,255), 2) # Draw total number of detected objects
cv2.imshow('YOLO detection results',frame) # Display image
# If inferencing on individual images, wait for user keypress before moving to next image. Otherwise, wait 5ms before moving to next frame.
key = cv2.waitKey(5)
if key == ord('q') or key == ord('Q'): # Press 'q' to quit
break
elif key == ord('s') or key == ord('S'): # Press 's' to pause inference
cv2.waitKey()
elif key == ord('p') or key == ord('P'): # Press 'p' to save a picture of results on this frame
cv2.imwrite('capture.png',frame)
# Calculate FPS for this frame
t_stop = time.perf_counter()
frame_rate_calc = float(1/(t_stop - t_start))
# Append FPS result to frame_rate_buffer (for finding average FPS over multiple frames)
if len(frame_rate_buffer) >= fps_avg_len:
temp = frame_rate_buffer.pop(0)
frame_rate_buffer.append(frame_rate_calc)
else:
frame_rate_buffer.append(frame_rate_calc)
# Calculate average FPS for past frames
avg_frame_rate = np.mean(frame_rate_buffer)
# Clean up
print(f'Average pipeline FPS: {avg_frame_rate:.2f}')
cap.release()
cv2.destroyAllWindows()
Modelをyolo11n.ptにしてRaspberry Pi 4でも動かしたが、0.9 FPSしか出なかった。Raspberry Pi 5やRaspberry Pi AI HAT+が欲しくなる。