Comparative Assessment of YOLO Segmentation Extensions for Intelligent Fire Detection
Article information
Abstract
With the growing frequency of fire incidents, the demand for rapid and accurate fire detection technologies has become increasingly critical. In this study, we evaluate segmentation-based object detection models YOLO (You Only Look Once) v5-seg, YOLOv8-seg, and YOLOv11-seg for their ability to detect flames and smoke under identical experimental conditions. A total of 5,000 fire images were collected and split into training, validation, and test datasets. The same hardware environment and hyperparameter settings were used for model training to ensure a fair comparison. The experimental results reveal that YOLOv11-seg achieved the best overall performance, with a Precision of 0.710, Recall of 0.570, F1-score of 0.632, and mAP (mean Average Precision) 50 of 0.600. Notably, YOLOv11-seg achieved the highest Recall and mAP values for smoke detection, underscoring its effectiveness in identifying smoke—a critical factor for early fire detection. In terms of efficiency, YOLOv8-seg demonstrated the fastest inference speed, while YOLOv5-seg offered advantages in lightweight model size. However, YOLOv11-seg provided a balanced trade-off between computational cost and detection accuracy, making it the most suitable model for real-world fire response scenarios. Accordingly, this study proposes YOLOv11-seg as a robust baseline model for segmentation-based fire detection and provides a foundational reference for future research on deep learning-driven intelligent fire video analysis.
1. Introduction
In recent years, the occurrence of fires in indoor spaces, buildings, and forests has become a critical social issue due to the severe casualties and property losses they cause[1]. Factors such as urbanization, the increasing use of electric vehicles, and forest drying driven by climate change have contributed to the diversification of fire outbreak patterns, further underscoring the importance of early detection and rapid response[2]. Consequently, the development of intelligent fire detection systems capable of reliably identifying fires at an early stage and supporting on-site decision-making has been consistently emphasized[3].
The advancement of deep learning-based artificial intelligence (AI) technologies is helping to overcome the limitations of delays and false alarms commonly associated with conventional temperature- or smoke sensor-based detection methods. In particular, object detection technology has emerged as a suitable approach for automating and enhancing fire detection, as it can identify the visual characteristics of flames and smoke in real time[4].
Previous studies have actively explored methods to further enhance performance by optimizing model structures and the preprocessing/postprocessing of images based on YOLO-series models, extending beyond simple detection. Shao et al.[5] incorporated transformer self-attention and Alpha-CIOU loss into YOLOv5, significantly improving recall in complex backgrounds, and demonstrated the ability to detect a fire within 9 seconds using factory CCTV footage. Yu et al.[6] reduced the weight of YOLOv5’s backbone/neck architecture and introduced Efficient Channel Attention (ECA), thereby improving detection performance for small-scale fire and smoke. Wang et al.[7] proposed DSS-YOLO, a lightweight version of YOLOv8n, which integrated the DynamicConv attention module (e.g., SEAM) to enhance recognition of small/concealed targets while reducing model size and computational load. Lan et al.[8] introduced Light-YOLOv8-Flame, where the C2f module of YOLOv8s was replaced with the FasterNet Block (PConv + Conv), resulting in a 25.34% reduction in parameters, along with improvements of 0.78% in mAP and 2.05% in Recall. Xu et al.[9] improved detection performance in complex backgrounds and multi-scale environments by replacing the head’s DCN2 with DCN3 in YOLOv11-DH3 and redesigning the loss function based on IoU.
Accordingly, although most previous studies have focused on enhancing fire detection performance through model architecture optimization and image processing, the rationale for selecting a particular version of the YOLO model as the base has not been clearly addressed. Moreover, while many studies concentrate on the simple detection of flames and smoke, few have investigated the precise segmentation of flame and smoke regions, which is essential for effective incident response.
In this study, fire detection models based on YOLOv5-seg, YOLOv8-seg, and YOLOv11-seg—commonly used in the fire detection domain—were trained to address various fire types (building, indoor, wildfire, etc.). These models were then comparatively analyzed for their flame and smoke segmentation performance under identical conditions, providing valuable insights for future research on fire detection model selection and optimization.
2. Real-time object detection model
2.1. YOLOv5-seg
YOLOv5-Seg is a network developed by extending the segmentation output of the YOLOv5 model. Figure 1 illustrates its overall architecture, demonstrating how object detection and segmentation are performed simultaneously on an input image. The backbone generates an initial feature map by processing the input image through the Conv module, then extracts more complex visual patterns by sequentially passing through several Conv and C3 modules; this process converts low-dimensional features into high-dimensional features, resulting in a more detailed feature map. The feature map output from the backbone is then passed to the neck for preprocessing, which includes object classification and location estimation. At this stage, additional feature transformations occur through the Conv module, and resolution is adjusted using the Upsample module to account for object size. Subsequently, the Concat operation merges feature maps from different layers, enabling richer representations for objects of varying scales. Finally, in the Head, object classes and locations are computed based on multiscale feature maps, while in the Segmentation step, pixel-level segmentation results are generated. Owing to these structural characteristics, YOLOv5-Seg can simultaneously perform object detection and segmentation quickly and accurately on input images of different sizes and resolutions[10].
2.2. YOLOv8-seg
YOLOv8-Seg was developed by adding segmentation capabilities to the YOLOv8 architecture. Figure 2 presents the schematic diagram of its detailed architecture. The network consists of three modules—backbone, neck, and head—which work together to enhance both object detection and segmentation performance. The backbone module extracts multilayer features from the input image, achieving greater lightweightness and computational efficiency by replacing the C3 module used in YOLOv5 with the C2f module. In addition, the 6×6 convolution in the initial step was replaced with a 3×3 convolution, reducing unnecessary computational load while enabling more efficient feature learning. The backbone generates feature maps at multiple scales, which are then passed to the neck. The neck learns representations optimized for object detection from these multiscale feature maps generated by the backbone. Specifically, it integrates features effectively across different receptive fields through the Spatial Pyramid Pooling-Fast (SPPF) module, enabling stable processing of objects with large size variations. The head, as the final stage, calculates object locations and classes. Here, the prediction process is simplified and generalization performance is improved by adopting an anchor-free design. Detection and segmentation accuracy are further enhanced by eliminating overlapping predictions using the Non-Maximum Suppression (NMS) process. Through this optimized module design, YOLOv8-Seg achieves high performance in both object detection and segmentation, making it suitable for a wide range of computer vision applications[11].
2.3 YOLOv11-seg
YOLOv11-Seg is an extended model that adds a segmentation output to the conventional YOLOv11 network. Figure 3 presents a schematic diagram of its detailed architecture. The model consists of backbone, neck, and head modules, whose interdependent connections maximize both object detection and segmentation performance. The backbone extracts key features from the input image, incorporating the C3k2 module in place of the C2f module used in YOLOv8. The C3k2 module builds on the Cross Stage Partial (CSP) design, using two small kernels to improve the speed and efficiency of feature learning. In addition, the Cross Stage Partial with Spatial Attention (C2PSA) module is introduced to enhance focus on specific regions, thereby improving detection performance for small or partially occluded objects. The neck module integrates multiscale feature maps extracted from the backbone and passes them to the head. At this stage, the C3k2 module is again applied to maintain computational efficiency while effectively merging information across different object sizes. In the head, the final stage for predicting object locations and classes, multiple C3k2 modules are employed to process the multiscale features. The Convolution-BatchNorm-SiLU (CBS) module is also used to ensure stable feature extraction, while the Non-Maximum Suppression (NMS) process eliminates overlapping candidate regions. With these structural features, YOLOv11-Seg can accurately compute object locations and classes, making it highly effective for both object detection and segmentation tasks[12].
3. Experiment
3.1. Fire detection model development process
Figure 4 illustrates the fire detection model training process and performance comparison. The development environment was configured using Python and PyTorch to implement the YOLOv5-Seg, YOLOv8-Seg, and YOLOv11-Seg models. A total of 5,000 images—covering building fires, indoor fires, and wildfires—were collected, preprocessed, and annotated into two classes: fire and smoke. Each model was trained and tested on this dataset, and inference performance was evaluated using a separate test set. The flame and smoke detection performance of the three YOLO-based segmentation models was then quantitatively analyzed under identical conditions, and the model with the highest accuracy on fire images was identified.
3.2. Experimental setup
Model training and testing were conducted in the environment described in Table 1. The hardware configuration included an Intel® Xeon® Silver 4210 (10-core) CPU, 192 GB of RAM, and two NVIDIA RTX 3090 GPUs. The software environment consisted of CUDA 11.8, cuDNN 8.1.0, Python 3.9.0, and PyTorch 2.2.0. To ensure a fair comparison, identical training parameters were applied to the YOLOv5-Seg, YOLOv8-Seg, and YOLOv11-Seg models. Stochastic Gradient Descent (SGD) was used as the optimizer with a learning rate of 0.01. The number of training epochs was set to 300, and the batch size to 16. Training and testing the models under these consistent conditions enabled a comparative analysis of their performance.
3.3. Data collection
The fire images used in this study were collected from open datasets, including Roboflow and Kaggle. The final dataset consisted of 5,000 images containing flames and smoke across diverse environments, such as building fires, indoor fires, and wildfires. The data were annotated into two classes: fire and smoke. Of the total dataset, 4,000 images (80%) were allocated for training, 500 images (10%) for validation, and the remaining 500 images (10%) for testing. Representative examples of the dataset are presented in Figure 5.
3.4. Model training performance evaluation indicator
To evaluate model performance, a confusion matrix was used to quantitatively assess classifier accuracy, as shown in Table 2. The confusion matrix describes the relationship between actual and predicted classes, where True Positive (TP) represents correctly predicted positive cases, False Negative (FN) represents actual positive cases incorrectly predicted as negative, False Positive (FP) represents actual negative cases incorrectly predicted as positive, and True Negative (TN) represents correctly predicted negative cases. Based on these values, key performance metrics—including Precision, Recall, Average Precision (AP), and mean Average Precision (mAP)—were calculated. Precision and Recall were computed using Eqs. (1) and (2), respectively, while AP and mAP were derived using Eqs. (3) and (4).
In Eq. (3), p(r) denotes the maximum precision at a given recall value, while in Eq. (4), denotes the total number of classes. Furthermore, in Eq. (5), which defines frames per second (FPS), T indicates the total processing time. Here, Prreprocess Time refers to process an input image before it is passed to the model; ∞erence Time refers to the duration during which the model performs inference; and Postprocess refers to the time needed to filter and analyze the output. Finally, FPS is calculated based on the total number of processed images .
Accordingly, this study comprehensively evaluated the performance of the YOLOv5-Seg, YOLOv8-Seg, and YOLOv11-Seg models using these metrics, and verified the effectiveness of the proposed fire detection models.
4. Experiment Results
4.1. Model Training Experiment Results
Figure 6 illustrates the changes in Precision, Recall, and mAP during the training of the YOLOv5-Seg, YOLOv8-Seg, and YOLOv11-Seg models. All models showed rapid performance improvements between epochs 30 and 50, followed by more gradual changes from epochs 50 to 150. YOLOv5-Seg maintained relatively stable performance after this stage, whereas YOLOv11-Seg continued to improve steadily and converged to a stable level after epoch 150. During convergence, Precision tended to reach higher values than Recall, indicating that false positives (FP) for flames and smoke were effectively suppressed during training, while the reduction of false negatives (FN) was comparatively limited. Among the three models, YOLOv11-Seg achieved the highest performance across Precision, Recall, and mAP, followed by YOLOv8-Seg, with YOLOv5-Seg showing the lowest performance. Notably, YOLOv11-Seg maintained stable performance in the later stages of training, demonstrating consistent boundary detection under varying IoU conditions.
4.2. Model Evaluation and Inference Results
Table 3 presents the quantitative performance comparison of the YOLOv5-Seg, YOLOv8-Seg, and YOLOv11-Seg models using 500 test images. Among the three models, YOLOv11-Seg achieved the best overall performance, with a Precision of 0.710, Recall of 0.570, F1-score of 0.632, and mAP50 of 0.600. Notably, the Recall of YOLOv11-Seg exceeded that of YOLOv5-Seg (0.431) and YOLOv8-Seg (0.546) by 0.139 and 0.024, respectively, confirming its superior ability to suppress false negatives and more reliably detect flame and smoke objects. In contrast, YOLOv5-Seg recorded the lowest Recall, indicating a higher frequency of missed detections in real fire images. Moreover, YOLOv11-Seg achieved improved mAP50 compared to the other models, suggesting more accurate and well-defined boundary detection of flame and smoke regions.
For the smoke class, YOLOv11-Seg achieved the highest performance across all metrics, with a Precision of 0.692, Recall of 0.635, F1-score of 0.662, mAP50 of 0.624, and mAP50-95 of 0.434. These results indicate that both missed and false detections of smoke were effectively minimized, despite the inherent irregularity and low contrast of smoke in images. This demonstrates that YOLOv11-Seg is highly suitable for early smoke detection, which is critical in real-world fire response scenarios. In contrast, YOLOv5-Seg recorded the lowest Recall for the smoke class at 0.370, underscoring its limitations in reliable smoke detection. For the fire class, YOLOv5-Seg exhibited stronger boundary consistency, with a mAP50 of 0.605 and mAP50-95 of 0.382. However, YOLOv11-Seg outperformed it by reducing false positives and missed detections while improving boundary detection for smoke, achieving a Precision of 0.727, Recall of 0.505, and F1-score of 0.598.
All models achieved an inference speed of at least 70 FPS, confirming their suitability for real-time image processing. Among them, YOLOv8-Seg recorded the highest speed at 86 FPS; however, this did not directly translate to improved detection accuracy. YOLOv11-Seg achieved an inference speed of 79 FPS, offering sufficient real-time processing capability while also delivering the highest accuracy in flame and smoke detection. These results emphasize that accuracy and speed should be considered as independent performance metrics. Accordingly, this study defined detection accuracy as the primary evaluation criterion, with FPS serving as a supplementary indicator. From this perspective, YOLOv11-Seg is regarded as the most suitable model for integration into fire response systems, including CCTV networks, drones, and vehicle-mounted cameras.
Figure 7 illustrates a visual comparison of inference results for fire images using the YOLOv5-Seg, YOLOv8-Seg, and YOLOv11-Seg models. Consistent with the quantitative analysis in Table 3, the visual results reveal clear differences in detection performance among the models. YOLOv5-Seg was able to detect flame and smoke objects; however, smoke boundaries were often indistinct, and confidence scores were frequently low. These observations align with its low Recall value and the frequent missed detections of smoke objects reported in Table 3. YOLOv8-Seg produced relatively high confidence scores and stable boundary extraction for flame objects but showed lower confidence and occasional omission of smoke regions. This reflects the moderate Recall for the smoke class observed in the quantitative results. In contrast, YOLOv11-Seg consistently generated high confidence scores for both flame and smoke, with more precise boundary detection compared to the other models. Notably, it demonstrated stable smoke detection even under complex or low-contrast background conditions, corresponding to its highest Recall and F1-score for the smoke class. These visual inference results in Figure 7 reinforce the trends identified in Table 3, confirming that YOLOv11-Seg provides the most reliable performance in both flame and smoke detection tasks.
5. Conclusions
To support the development of an intelligent fire detection system capable of responding to various types of fires, including indoor, building, and wildfires, this study conducted a comparative analysis of YOLO-based segmentation models: YOLOv5-Seg, YOLOv8-Seg, and YOLOv11-Seg. All models were evaluated under identical conditions using a dataset of 5,000 real fire images. Both quantitative performance metrics and visual inference results were comprehensively analyzed to assess and compare the effectiveness of each model.
The experimental results demonstrated that YOLOv11-Seg achieved the highest performance among the three models, with a Precision of 0.710, Recall of 0.570, F1-score of 0.632, and mAP@50 of 0.600. In particular, the improvement in Recall significantly reduced missed detections (false negatives), thereby enhancing the reliability of early fire detection—an essential requirement in real-world fire response scenarios. Detection performance for the smoke class was notably higher compared to the other models, confirming YOLOv11-Seg’s ability to maintain stable detection even under conditions with unclear object boundaries and low image contrast. Given the critical importance of early smoke detection in fire mitigation, these results underscore the model’s practical value. While YOLOv8-Seg achieved the fastest inference speed and YOLOv5-Seg offered advantages in model lightweightness, YOLOv11-Seg provided a well-balanced performance across both flame and smoke classes, along with moderate computational resource requirements, making it the most suitable choice for practical fire detection applications. Finally, the visual inference results (Figure 7) were consistent with the quantitative findings (Table 3), further confirming that YOLOv11-Seg delivers the most reliable performance in fire image analysis.
However, the dataset used in this study primarily consists of wildfire, building fire, and indoor fire images captured during the growth phase after ignition. Consequently, it is limited in its ability to fully represent the early stages of fire development, such as small-scale smoke or flame and close-range observations. In particular, because the detection approach relies heavily on color distribution (RGB features) at the pixel level, variations in fire scale or observation distance may adversely affect detection performance. This represents a major limitation of the current study. To address this, future research will focus on incorporating images that better reflect early fire conditions during dataset construction. Additionally, data augmentation techniques will be applied to enhance robustness and improve performance in real-world early fire detection scenarios.
In conclusion, this study objectively evaluated and compared the performance of YOLOv5-Seg, YOLOv8-Seg, and YOLOv11-Seg models using a unified dataset and identical training conditions, providing a foundation for selecting an appropriate segmentation-based fire detection model. Future research will aim to enhance model performance across diverse environments, including nighttime conditions, adverse weather, and complex backgrounds, while also incorporating lightweight and optimization techniques. These efforts are intended to further develop a high-performance fire detection model suitable for real-time applications in practical field environments.
Notes
Author Contributions
The following statements should be used “Conceptualization, S.C. and H.K.; methodology, J.C.; software, S.L.; validation, S.C. and H.J.; formal analysis, S.C.; investigation, H.K.; resources, H.K.; data curation, J.C.; writing— original draft preparation, S.C.; writing—review and editing, S.C.; visualization, H.J.; supervision, H.J.; project administration, H.J.; funding acquisition, H.J. All authors have read and agreed to the published version of the manuscript.”
Conflicts of Interest
The authors declare no conflict of interest.
Acknowledgments
This study is supported by the Korea Agency for Infrastructure Technology Advancement (KAIA) grant funded by the Ministry of Land, Infrastructure and Transport (Grant 202102220002). This work was supported by the National Research Foundation of Korea (NRF) grant funded by the Korea government (MSIT) (NRF: 2021R1G1A1014385).