Tomato Disease Classification Using Transformer-Based Deep Learning Models

Xiaocun Huang, Mustafa Muwafak Alobaedy, Mohd Nurul Hafiz Ibrahim, Ali A Khalaf

doi:https://doi.org/10.1109/ISCI65687.2025.11166703

Year: 2025

Venue: 2025 IEEE 7th Symposium on Computers & Informatics (ISCI), 339–343

Type: conference

Citations: Cited by 1 (per OpenAlex)

DOI: https://doi.org/10.1109/ISCI65687.2025.11166703

External link: https://ieeexplore.ieee.org/document/11166703

Abstract

Crop diseases present a major challenge to global agriculture, reducing yields and threatening food security. Traditional manual methods for detecting diseases in crops like tomatoes are labor-intensive and lack accuracy. This paper proposes a deep learning model based on the Transformer architecture for automated tomato leaf disease classification. Leveraging the self-attention mechanism of Transformers, the model captures global contextual features from leaf images, achieving superior recognition accuracy. The model was trained and tested on a diverse dataset comprising ten categories (nine diseases and healthy leaves), with data augmentation enhancing its robustness. Experimental results demonstrate that the Transformer-based model achieved a classification accuracy of 91.77%, outperforming traditional CNN models such as VGG19 and ResNet50. This study highlights the potential of Transformer architectures in precision agriculture, enabling efficient and accurate disease recognition. Future work will focus on optimizing the model for deployment of mobile devices to facilitate real-time, on-field disease diagnosis.

Keywords

Transformer model; tomato disease classification; deep learning; attention mechanism; computer vision

📄 Full text (16,610 characters)extracted from the PDF · click to expand

Tomato Disease Classification using Transformer-
Based Deep Learning Models 
 
1
st
 Xiaocun Huang 
Faculty of Information Technology 
City University Malaysia 
Cyberjaya, Malaysia 
huangxiaocun@caztc.edu.cn 
 3
rd
 Mohd Nurul Hafiz Ibrahim 
Faculty of Information Technology 
City University Malaysia 
Cyberjaya, Malaysia 
mohd.nurul.hafiz@city.edu.my 
 
2
nd
 Mustafa Muwafak Alobaedy 
Faculty of Information Technology 
City University Malaysia 
Cyberjaya, Malaysia 
alobaedy@ieee.org 
4
th
 Ali A Khalaf 
Computer Science Department, College of Science 
University of Baghdad 
Baghdad, Iraq 
alrahmanali@sc.uobaghdad.edu.iq 
Abstract— Crop diseases present a major challenge to global 
agriculture, reducing yields and threatening food security. 
Traditional manual methods for detecting diseases in crops like 
tomatoes are labor-intensive and lack accuracy. This paper 
proposes a deep learning model based on the Transformer 
architecture for automated tomato leaf disease classification. 
Leveraging the self-attention mechanism of Transformers, the 
model captures global contextual features from leaf images, 
achieving superior recognition accuracy. The model was trained 
and tested on a diverse dataset comprising ten categories (nine 
diseases and healthy leaves), with data augmentation enhancing 
its robustness. Experimental results demonstrate that the 
Transformer-based model achieved a classification accuracy of 
91.77%, outperforming traditional CNN models such as VGG19 
and ResNet50. This study highlights the potential of 
Transformer architectures in precision agriculture, enabling 
efficient and accurate disease recognition. Future work will 
focus on optimizing the model for deployment of mobile devices 
to facilitate real-time, on-field disease diagnosis. 
Keywords— Transformer model, tomato disease classification, 
deep learning, attention mechanism, computer vision. 
I. INTRODUCTION 
Plant diseases significantly affect crop yields and food 
safety, directly impacting the agricultural economy and 
causing substantial economic losses [1]. Moreover, they 
exacerbate the global issue of food insecurity [2]. To mitigate 
losses caused by plant diseases, it is essential to continuously 
monitor crop health and implement timely and effective 
countermeasures [3]. However, traditional manual detection 
methods are unsuitable for large-scale monitoring, as they 
require substantial human and financial resources. 
Additionally, these methods often fail to detect disease 
progression promptly, leading to missed opportunities for 
intervention and, consequently, increased losses [4]. 
Therefore, accurate and timely detection of plant diseases is 
critical. 
In recent years, the rapid development of artificial 
intelligence (AI) has enabled many fields to adopt AI-based 
technologies to address complex challenges effectively [5–7]. 
As a crucial branch of AI, computer vision has emerged as a 
powerful tool for solving various visual recognition problems, 
contributing significantly to advances in object detection and 
classification [8]. Convolutional Neural Networks (CNNs), in 
particular, have demonstrated outstanding performance in 
image classification tasks. Notable CNN-based architectures 
include EfficientNet [9], AlexNet [10], ResNet [11], 
DenseNet [12], and MobileNet [13]. Despite their success, 
CNNs have inherent limitations, especially regarding their 
restricted receptive field, which limits their ability to capture 
global image context due to the fixed size of convolutional 
kernels. 
To address this limitation, researchers have drawn 
inspiration from the Transformer architecture, initially 
developed for natural language processing tasks [14], and 
adapted it for computer vision applications through the Vision 
Transformer (ViT) [15]. ViT has shown promising results in 
various computer vision tasks such as image classification 
[16], object detection [17,18], and image segmentation [19]. 
Owing to the self-attention mechanism, Vision Transformers 
can effectively capture global dependencies, allowing the 
model to process holistic image information from the 
beginning of training. 
This study leverages the Transformer architecture to 
develop a deep learning model for classifying tomato leaf 
diseases. The proposed model demonstrates high recognition 
accuracy, highlighting the potential of Transformer-based 
models in agricultural disease detection and precision farming. 
 
II. CONVOLUTIONAL NEURAL NETWORKS 
The traditional Convolutional Neural Network (CNN) 
architecture consists of five main components: the input layer, 
convolutional layer, pooling (down sampling) layer, 
activation function layer, and fully connected layer. As 
illustrated in Figure 1, these components are sequentially 
stacked to form a deep convolutional structure, enabling the 
network to effectively extract features from input images. 
 
Fig. 1. Architecture of a Convolutional Neural Network (CNN) [20]. 

The size of the convolutional kernel determines the 
receptive field and feature extraction capability of a neural 
network model. A standard convolutional layer can include 
kernels of various sizes, such as 3×3, 5×5, and 7×7, each 
affecting the spatial resolution and receptive field of the 
resulting feature maps. 
The pooling layer plays a critical role in convolutional 
neural networks by performing down sampling operations. It 
not only helps extract essential features but also reduces the 
computational cost and the number of parameters in the model. 
Typically, the pooling layer follows the convolutional layer 
and processes its output feature maps. 
Activation functions are another essential component, 
enabling the network to learn complex patterns in image data. 
By introducing non-linearity into the model, activation 
functions allow neural networks to represent and approximate 
intricate functional relationships, thereby improving their 
ability to fit complex data. 
The fully connected layer, usually positioned at the end of 
a convolutional neural network, serves to transform the twodimensional feature maps into one-dimensional vectors. It 
integrates the local features extracted by previous layers into 
a comprehensive representation of the input image using a 
weight matrix. 
 
III. TRANSFORMER MODEL STRUCTURE 
In 2017, Vaswani et al. proposed the Transformer model 
for natural language processing tasks. Inspired by this 
architecture, Dosovitskiy et al. later adapted the Transformer 
for computer vision applications. Since then, the Transformer 
has demonstrated outstanding performance in various core 
computer vision tasks. 
The Transformer model consists of two main components: 
the encoder and the decoder. In most vision-related 
applications, the encoder plays the primary role, focusing on 
feature extraction and representation. The encoder is 
composed of multiple identical layers, known as Encoder 
Layers, each containing two key subcomponents: a Self-
Attention layer and a Feed-Forward layer. This structure is 
illustrated in Figure 2. 
In the fine-grained classification task of tomato leaf 
diseases, the Transformer model performs feature encoding by 
dividing the plant image into patches. These image patches are 
then flattened into a sequential format. The model's selfattention mechanism assigns different weights to each element 
in the sequence, enabling it to effectively capture the 
relationships between patches. This process enhances the 
model’s ability to recognize subtle variations in the leaf 
images, as illustrated in Figure 3. 
 
 
Fig. 2. Transformer Model Architecture 
 
 
Fig. 3. Feature extraction of tomato leaf diseases.[21] 
 
IV. DATA 
The dataset used in this study comprises images from ten 
categories, including various tomato leaf diseases and healthy 
leaves. The images were collected from multiple sources, with 
some obtained directly and others sourced from the internet, 
resulting in variations in format and resolution. 
To evaluate the model’s performance, the dataset was 
divided into training, validation, and testing sets in a 60:20:20 
ratio. Data augmentation—a commonly used technique in 
deep learning—was applied to increase the dataset size and 
diversity by transforming and expanding the existing images. 
This approach helps improve the model’s generalization and 
robustness, enabling it to make more accurate predictions on 
unseen data. During preprocessing, techniques such as 

flipping, mirroring, and noise addition were used to augment 
the images. The distribution of images across the ten classes, 
including diseased and healthy tomato leaves, is presented in 
Table I. 
TABLE I. DATASET DISTRIBUTION 
disease species Number 
Bacterial_spot 1735 
Early_blight 2236 
Late_blight 2053 
Leaf_Mold 2175 
Septoria_leaf_spot 2349 
Spider_mites Two-spotted_spider_mite 1744 
Target_Spot 1826 
Tomato_mosaic_virus 1818 
Tomato_Yellow_Leaf_Curl_Virus 1989 
Healthy 1952 
 
 
V. EXPERIMENT AND ANALYSIS 
The experimental environment used for model training 
and evaluation is outlined in Table II, detailing both the 
hardware and software specifications. These configurations 
were selected to ensure smooth processing of image data and 
efficient training of the Transformer-based model. 
TABLE II. HARDWARE AND SOFTWARE REQUIREMENTS 
 
The input images were resized to 224×224 pixels and then 
divided by the Transformer model into 16×16 patches, each of 
size 14×14. The model was trained using the cross-entropy 
loss function and optimized with Stochastic Gradient Descent 
(SGD). Training was conducted for 20 epochs with a batch 
size of 8. The initial learning rate was set to 0.001 and 
dynamically adjusted using a cosine decay schedule. 
The model achieved an accuracy of 91.77%, precision of 
91.89%, recall of 91.77%, and an F1-score of 91.77% across 
10 categories, comprising nine disease types and one healthy 
class. Figures 4 and 5 illustrate the training accuracy and loss 
curves, respectively. 
 
Fig. 4. Training and validation accuracy curve of the Transformer model 
over 20 epochs. 
 
 
Fig. 5. Training and validation loss curve of the Transformer model over 20 
epochs. 
As illustrated in the confusion matrix (Figure 6), the model 
demonstrates high recognition rates for six classes—Bacterial 
Spot, Leaf Mold, Septoria Leaf Spot, Spider Mites (Two-
Spotted Spider Mite), Tomato Yellow Leaf Curl Virus, and 
Tomato Mosaic Virus—all exceeding 90%. Recognition rates 
for Early Blight, Healthy, and Late Blight are slightly lower 
but still approach 90%. In contrast, the Target Spot class 
records the lowest recognition rate. This is likely due to the 
relatively small and faint spots on the leaves, which become 
less visible under light exposure, causing the model to 
misclassify them as healthy. 
To compare the proposed Transformer model with 
conventional CNN architecture, we conducted additional 
experiments using VGG19 and ResNet50. The results 
demonstrate that the Transformer model outperforms both 
CNN models in terms of accuracy, precision, recall, and F1score. As illustrated in Figures 7–10, classification metrics 
were calculated separately for each model. 
Specifically, VGG19 achieved an accuracy of 84.23%, 
ResNet50 achieved 86.75%, while the Transformer reached 
91.77%. In terms of precision, VGG19 recorded 85.71%, 
ResNet50 87.62%, and the Transformer achieved 91.89%. 
The recall for the Transformer increased by approximately 
7.54% over VGG19 and 5.02% over ResNet50. Regarding F1score, VGG19 achieved 84.23%, ResNet50 86.75%, and 
Transformer again led with 91.77%. 
 
Hardware Software 
8GB RAM Microsoft Windows 2010 
Core i5-11320H Processor Python 3.8 
500GB of Hard drive capacity Anaconda 3 
GeForce RTX 3060 TensorFlow 2.0 

 
Fig. 6. Normalized confusion matrix of the Transformer model. 
 
 
Fig. 7. Comparison of classification accuracy across VGG19, ResNet50, 
and Transformer models. 
 
 
 
Fig. 8. Comparison of precision scores for VGG19, ResNet50, and 
Transformer models. 

 
Fig. 9. F1-score comparison among VGG19, ResNet50, and Transformer 
models. 
 
 
Fig. 10. Recall comparison across VGG19, ResNet50, and Transformer 
models. 
VI. CONCLUSION 
This study demonstrates the effectiveness of using a 
Transformer-based deep learning model for the classification 
and recognition of tomato leaf diseases. The self-attention 
mechanism within the Transformer architecture enables the 
model to focus selectively on critical regions of input images, 
amplifying relevant features while suppressing noise and 
irrelevant details. As a result, the model achieves high 
classification performance, with an accuracy of 91.77%, 
outperforming traditional CNN-based approaches. 
However, due to the model’s complexity and 
computational requirements, it is currently not optimized for 
deployment on resource-constrained devices such as 
smartphones or edge devices. Future work will focus on 
optimizing and compressing the model to enable real-time, 
mobile-based disease detection. This advancement could 
significantly enhance precision agriculture by providing 
accessible, rapid, and reliable diagnostic tools to farmers and 
agricultural workers in the field. 
 
R
EFERENCES 
[1] J. J. Burdon, L. G. Barrett, L N. Yang, et al. Maximising world food 
production through disease control[J]. BioScience, 2020, 70(2): 126-
128. 
[2] Z. Li, R. Paul, T. Ba Tis, et al. Non-invasive plant disease diagnostics 
enabled by smartphone-based fingerprinting of leaf volatiles[J]. Nature 
Plants, 2019, 5(8): 856866. 
[3] M. A. Altieri. Agroecology: the science of sustainable agriculture[M]. 
CRC Press, 2018. 
[4] S. P. Mohanty, D. P. Hughes, M. Salathé. Using deep learning for 
image-based plant disease detection[J]. Frontiers in plant science, 2016, 
7: 1419. 
[5] G. Litjens, T. Kooi, B. E. Bejnordi, et al. A survey on deep learning in 
medical image analysis[J]. Medical image analysis, 2017, 42: 6088. 
[6] W. Min, S. Jiang, L. Liu, et al. A survey on food computing[J]. ACM 
Computing Surveys (CSUR), 2019, 52(5): 136. 
[7] E. Moen, D. Bannon, T. Kudo, et al. Deep learning for cellular image 
analysis[J]. Nature Methods, 2019, 16(12): 12331246. 
[8] J. G. Arnal Barbedo. Digital image processing techniques for detecting, 
quantifying, and classifying plant diseases[J]. SpringerPlus, 2013, 2(1): 
112. 
[9] M. Tan, Q. Le. Efficientnet: Rethinking model scaling for 
convolutional neural networks[C]//International conference on 
machine learning. 2019: 61056114. 
[10] A. Krizhevsky, I. Sutskever, G. E. Hinton. Imagenet classification with 
deep convolutional neural networks[J]. Advances in neural information 
processing systems, 2012, 25: 10971105. 
[11] K. He, X. Zhang, S. Ren, et al. Deep residual learning for image 
recognition[C]//Proceedings of the IEEE conference on computer 
vision and pattern recognition. 2016: 770778. 
[12] G. Huang, Z. Liu, L. Van Der Maaten, et al. Densely connected 
convolutional networks[C]//Proceedings of the IEEE conference on 
computer vision and pattern recognition. 2017: 47004708. 
[13] A. G. Howard, M. Zhu, B. Chen, et al. Mobilenets: Efficient 
convolutional neural networks for mobile vision applications[J]. ArXiv 
preprint arXiv:1704.04861, 2017. 
[14] A. Vaswani, N. Shazeer, N. Parmar, et al. Attention is all you need[J]—
advances in neural information processing systems, 2017, 30. 
[15] A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al. An Image is Worth 
16x16 Words: Transformers for Image Recognition at 
Scale[C]//International Conference on Learning Representations. 
2020. 
[16] H. Touvron, M. Cord, M. Douze, et al. Training data efficient image 
transformers&distillation through attention[C]//International 
Conference on Machine Learning. 2021: 1034710357. 
[17] N. Carion, F. Massa, G. Synnaeve, et al. End-to-end object detection 
with transformers[C]//European Conference on Computer Vision. 
2020: 213229. 
[18] X. Zhu, W. Su, L. Lu, et al. Deformable detr: Deformable transformers 
for end-to-end object detection[J]. ArXiv preprint arXiv:2010.04159, 
2020. 
[19] L. Ye, M. Rochan, Z. Liu, et al. Cross-modal self-attention network for 
referring image segmentation[C]//Proceedings of the IEEE/CVF 
Conference on Computer Vision and Pattern Recognition. 2019: 
1050210511. 
[20] Zhang S, Zhang Q, Li H. A review of sign language recognition based 
on deep learning [J]. Journal of Electronics and Information 
Technology, 2020, 42 (4): 1021-1032. 
GX. Hao. Research on tomato leaf disease recognition based on deep 
learning[D]. Tarim University, 2023.

Automatically extracted. Refer to the original PDF for figures, tables, and formatting.

Cited by 1 paper

Top 1 citing works, by citation count (via OpenAlex).

Automated Tomato Leaf Disease Classification using Convolutional Neural Networks (2025)

Cite this

Huang, X., Alobaedy, M. M., Ibrahim, M. N. H., & Khalaf, A. A. (2025). Tomato Disease Classification Using Transformer-Based Deep Learning Models. *2025 IEEE 7th Symposium on Computers & Informatics (ISCI), 339–343*. https://doi.org/https://doi.org/10.1109/ISCI65687.2025.11166703