Crop diseases present a major challenge to global agriculture, reducing yields and threatening food security. Traditional manual methods for detecting diseases in crops like tomatoes are labor-intensive and lack accuracy. This paper proposes a deep learning model based on the Transformer architecture for automated tomato leaf disease classification. Leveraging the self-attention mechanism of Transformers, the model captures global contextual features from leaf images, achieving superior recognition accuracy. The model was trained and tested on a diverse dataset comprising ten categories (nine diseases and healthy leaves), with data augmentation enhancing its robustness. Experimental results demonstrate that the Transformer-based model achieved a classification accuracy of 91.77%, outperforming traditional CNN models such as VGG19 and ResNet50. This study highlights the potential of Transformer architectures in precision agriculture, enabling efficient and accurate disease recognition. Future work will focus on optimizing the model for deployment of mobile devices to facilitate real-time, on-field disease diagnosis.
📄 Full text (16,610 characters)extracted from the PDF · click to expand
Tomato Disease Classification using Transformer-
Based Deep Learning Models
1
st
Xiaocun Huang
Faculty of Information Technology
City University Malaysia
Cyberjaya, Malaysia
huangxiaocun@caztc.edu.cn
3
rd
Mohd Nurul Hafiz Ibrahim
Faculty of Information Technology
City University Malaysia
Cyberjaya, Malaysia
mohd.nurul.hafiz@city.edu.my
2
nd
Mustafa Muwafak Alobaedy
Faculty of Information Technology
City University Malaysia
Cyberjaya, Malaysia
alobaedy@ieee.org
4
th
Ali A Khalaf
Computer Science Department, College of Science
University of Baghdad
Baghdad, Iraq
alrahmanali@sc.uobaghdad.edu.iq
Abstract— Crop diseases present a major challenge to global
agriculture, reducing yields and threatening food security.
Traditional manual methods for detecting diseases in crops like
tomatoes are labor-intensive and lack accuracy. This paper
proposes a deep learning model based on the Transformer
architecture for automated tomato leaf disease classification.
Leveraging the self-attention mechanism of Transformers, the
model captures global contextual features from leaf images,
achieving superior recognition accuracy. The model was trained
and tested on a diverse dataset comprising ten categories (nine
diseases and healthy leaves), with data augmentation enhancing
its robustness. Experimental results demonstrate that the
Transformer-based model achieved a classification accuracy of
91.77%, outperforming traditional CNN models such as VGG19
and ResNet50. This study highlights the potential of
Transformer architectures in precision agriculture, enabling
efficient and accurate disease recognition. Future work will
focus on optimizing the model for deployment of mobile devices
to facilitate real-time, on-field disease diagnosis.
Keywords— Transformer model, tomato disease classification,
deep learning, attention mechanism, computer vision.
I. INTRODUCTION
Plant diseases significantly affect crop yields and food
safety, directly impacting the agricultural economy and
causing substantial economic losses [1]. Moreover, they
exacerbate the global issue of food insecurity [2]. To mitigate
losses caused by plant diseases, it is essential to continuously
monitor crop health and implement timely and effective
countermeasures [3]. However, traditional manual detection
methods are unsuitable for large-scale monitoring, as they
require substantial human and financial resources.
Additionally, these methods often fail to detect disease
progression promptly, leading to missed opportunities for
intervention and, consequently, increased losses [4].
Therefore, accurate and timely detection of plant diseases is
critical.
In recent years, the rapid development of artificial
intelligence (AI) has enabled many fields to adopt AI-based
technologies to address complex challenges effectively [5–7].
As a crucial branch of AI, computer vision has emerged as a
powerful tool for solving various visual recognition problems,
contributing significantly to advances in object detection and
classification [8]. Convolutional Neural Networks (CNNs), in
particular, have demonstrated outstanding performance in
image classification tasks. Notable CNN-based architectures
include EfficientNet [9], AlexNet [10], ResNet [11],
DenseNet [12], and MobileNet [13]. Despite their success,
CNNs have inherent limitations, especially regarding their
restricted receptive field, which limits their ability to capture
global image context due to the fixed size of convolutional
kernels.
To address this limitation, researchers have drawn
inspiration from the Transformer architecture, initially
developed for natural language processing tasks [14], and
adapted it for computer vision applications through the Vision
Transformer (ViT) [15]. ViT has shown promising results in
various computer vision tasks such as image classification
[16], object detection [17,18], and image segmentation [19].
Owing to the self-attention mechanism, Vision Transformers
can effectively capture global dependencies, allowing the
model to process holistic image information from the
beginning of training.
This study leverages the Transformer architecture to
develop a deep learning model for classifying tomato leaf
diseases. The proposed model demonstrates high recognition
accuracy, highlighting the potential of Transformer-based
models in agricultural disease detection and precision farming.
II. CONVOLUTIONAL NEURAL NETWORKS
The traditional Convolutional Neural Network (CNN)
architecture consists of five main components: the input layer,
convolutional layer, pooling (down sampling) layer,
activation function layer, and fully connected layer. As
illustrated in Figure 1, these components are sequentially
stacked to form a deep convolutional structure, enabling the
network to effectively extract features from input images.
Fig. 1. Architecture of a Convolutional Neural Network (CNN) [20].
The size of the convolutional kernel determines the
receptive field and feature extraction capability of a neural
network model. A standard convolutional layer can include
kernels of various sizes, such as 3×3, 5×5, and 7×7, each
affecting the spatial resolution and receptive field of the
resulting feature maps.
The pooling layer plays a critical role in convolutional
neural networks by performing down sampling operations. It
not only helps extract essential features but also reduces the
computational cost and the number of parameters in the model.
Typically, the pooling layer follows the convolutional layer
and processes its output feature maps.
Activation functions are another essential component,
enabling the network to learn complex patterns in image data.
By introducing non-linearity into the model, activation
functions allow neural networks to represent and approximate
intricate functional relationships, thereby improving their
ability to fit complex data.
The fully connected layer, usually positioned at the end of
a convolutional neural network, serves to transform the twodimensional feature maps into one-dimensional vectors. It
integrates the local features extracted by previous layers into
a comprehensive representation of the input image using a
weight matrix.
III. TRANSFORMER MODEL STRUCTURE
In 2017, Vaswani et al. proposed the Transformer model
for natural language processing tasks. Inspired by this
architecture, Dosovitskiy et al. later adapted the Transformer
for computer vision applications. Since then, the Transformer
has demonstrated outstanding performance in various core
computer vision tasks.
The Transformer model consists of two main components:
the encoder and the decoder. In most vision-related
applications, the encoder plays the primary role, focusing on
feature extraction and representation. The encoder is
composed of multiple identical layers, known as Encoder
Layers, each containing two key subcomponents: a Self-
Attention layer and a Feed-Forward layer. This structure is
illustrated in Figure 2.
In the fine-grained classification task of tomato leaf
diseases, the Transformer model performs feature encoding by
dividing the plant image into patches. These image patches are
then flattened into a sequential format. The model's selfattention mechanism assigns different weights to each element
in the sequence, enabling it to effectively capture the
relationships between patches. This process enhances the
model’s ability to recognize subtle variations in the leaf
images, as illustrated in Figure 3.
Fig. 2. Transformer Model Architecture
Fig. 3. Feature extraction of tomato leaf diseases.[21]
IV. DATA
The dataset used in this study comprises images from ten
categories, including various tomato leaf diseases and healthy
leaves. The images were collected from multiple sources, with
some obtained directly and others sourced from the internet,
resulting in variations in format and resolution.
To evaluate the model’s performance, the dataset was
divided into training, validation, and testing sets in a 60:20:20
ratio. Data augmentation—a commonly used technique in
deep learning—was applied to increase the dataset size and
diversity by transforming and expanding the existing images.
This approach helps improve the model’s generalization and
robustness, enabling it to make more accurate predictions on
unseen data. During preprocessing, techniques such as
flipping, mirroring, and noise addition were used to augment
the images. The distribution of images across the ten classes,
including diseased and healthy tomato leaves, is presented in
Table I.
TABLE I. DATASET DISTRIBUTION
disease species Number
Bacterial_spot 1735
Early_blight 2236
Late_blight 2053
Leaf_Mold 2175
Septoria_leaf_spot 2349
Spider_mites Two-spotted_spider_mite 1744
Target_Spot 1826
Tomato_mosaic_virus 1818
Tomato_Yellow_Leaf_Curl_Virus 1989
Healthy 1952
V. EXPERIMENT AND ANALYSIS
The experimental environment used for model training
and evaluation is outlined in Table II, detailing both the
hardware and software specifications. These configurations
were selected to ensure smooth processing of image data and
efficient training of the Transformer-based model.
TABLE II. HARDWARE AND SOFTWARE REQUIREMENTS
The input images were resized to 224×224 pixels and then
divided by the Transformer model into 16×16 patches, each of
size 14×14. The model was trained using the cross-entropy
loss function and optimized with Stochastic Gradient Descent
(SGD). Training was conducted for 20 epochs with a batch
size of 8. The initial learning rate was set to 0.001 and
dynamically adjusted using a cosine decay schedule.
The model achieved an accuracy of 91.77%, precision of
91.89%, recall of 91.77%, and an F1-score of 91.77% across
10 categories, comprising nine disease types and one healthy
class. Figures 4 and 5 illustrate the training accuracy and loss
curves, respectively.
Fig. 4. Training and validation accuracy curve of the Transformer model
over 20 epochs.
Fig. 5. Training and validation loss curve of the Transformer model over 20
epochs.
As illustrated in the confusion matrix (Figure 6), the model
demonstrates high recognition rates for six classes—Bacterial
Spot, Leaf Mold, Septoria Leaf Spot, Spider Mites (Two-
Spotted Spider Mite), Tomato Yellow Leaf Curl Virus, and
Tomato Mosaic Virus—all exceeding 90%. Recognition rates
for Early Blight, Healthy, and Late Blight are slightly lower
but still approach 90%. In contrast, the Target Spot class
records the lowest recognition rate. This is likely due to the
relatively small and faint spots on the leaves, which become
less visible under light exposure, causing the model to
misclassify them as healthy.
To compare the proposed Transformer model with
conventional CNN architecture, we conducted additional
experiments using VGG19 and ResNet50. The results
demonstrate that the Transformer model outperforms both
CNN models in terms of accuracy, precision, recall, and F1score. As illustrated in Figures 7–10, classification metrics
were calculated separately for each model.
Specifically, VGG19 achieved an accuracy of 84.23%,
ResNet50 achieved 86.75%, while the Transformer reached
91.77%. In terms of precision, VGG19 recorded 85.71%,
ResNet50 87.62%, and the Transformer achieved 91.89%.
The recall for the Transformer increased by approximately
7.54% over VGG19 and 5.02% over ResNet50. Regarding F1score, VGG19 achieved 84.23%, ResNet50 86.75%, and
Transformer again led with 91.77%.
Hardware Software
8GB RAM Microsoft Windows 2010
Core i5-11320H Processor Python 3.8
500GB of Hard drive capacity Anaconda 3
GeForce RTX 3060 TensorFlow 2.0
Fig. 6. Normalized confusion matrix of the Transformer model.
Fig. 7. Comparison of classification accuracy across VGG19, ResNet50,
and Transformer models.
Fig. 8. Comparison of precision scores for VGG19, ResNet50, and
Transformer models.
Fig. 9. F1-score comparison among VGG19, ResNet50, and Transformer
models.
Fig. 10. Recall comparison across VGG19, ResNet50, and Transformer
models.
VI. CONCLUSION
This study demonstrates the effectiveness of using a
Transformer-based deep learning model for the classification
and recognition of tomato leaf diseases. The self-attention
mechanism within the Transformer architecture enables the
model to focus selectively on critical regions of input images,
amplifying relevant features while suppressing noise and
irrelevant details. As a result, the model achieves high
classification performance, with an accuracy of 91.77%,
outperforming traditional CNN-based approaches.
However, due to the model’s complexity and
computational requirements, it is currently not optimized for
deployment on resource-constrained devices such as
smartphones or edge devices. Future work will focus on
optimizing and compressing the model to enable real-time,
mobile-based disease detection. This advancement could
significantly enhance precision agriculture by providing
accessible, rapid, and reliable diagnostic tools to farmers and
agricultural workers in the field.
R
EFERENCES
[1] J. J. Burdon, L. G. Barrett, L N. Yang, et al. Maximising world food
production through disease control[J]. BioScience, 2020, 70(2): 126-
128.
[2] Z. Li, R. Paul, T. Ba Tis, et al. Non-invasive plant disease diagnostics
enabled by smartphone-based fingerprinting of leaf volatiles[J]. Nature
Plants, 2019, 5(8): 856866.
[3] M. A. Altieri. Agroecology: the science of sustainable agriculture[M].
CRC Press, 2018.
[4] S. P. Mohanty, D. P. Hughes, M. Salathé. Using deep learning for
image-based plant disease detection[J]. Frontiers in plant science, 2016,
7: 1419.
[5] G. Litjens, T. Kooi, B. E. Bejnordi, et al. A survey on deep learning in
medical image analysis[J]. Medical image analysis, 2017, 42: 6088.
[6] W. Min, S. Jiang, L. Liu, et al. A survey on food computing[J]. ACM
Computing Surveys (CSUR), 2019, 52(5): 136.
[7] E. Moen, D. Bannon, T. Kudo, et al. Deep learning for cellular image
analysis[J]. Nature Methods, 2019, 16(12): 12331246.
[8] J. G. Arnal Barbedo. Digital image processing techniques for detecting,
quantifying, and classifying plant diseases[J]. SpringerPlus, 2013, 2(1):
112.
[9] M. Tan, Q. Le. Efficientnet: Rethinking model scaling for
convolutional neural networks[C]//International conference on
machine learning. 2019: 61056114.
[10] A. Krizhevsky, I. Sutskever, G. E. Hinton. Imagenet classification with
deep convolutional neural networks[J]. Advances in neural information
processing systems, 2012, 25: 10971105.
[11] K. He, X. Zhang, S. Ren, et al. Deep residual learning for image
recognition[C]//Proceedings of the IEEE conference on computer
vision and pattern recognition. 2016: 770778.
[12] G. Huang, Z. Liu, L. Van Der Maaten, et al. Densely connected
convolutional networks[C]//Proceedings of the IEEE conference on
computer vision and pattern recognition. 2017: 47004708.
[13] A. G. Howard, M. Zhu, B. Chen, et al. Mobilenets: Efficient
convolutional neural networks for mobile vision applications[J]. ArXiv
preprint arXiv:1704.04861, 2017.
[14] A. Vaswani, N. Shazeer, N. Parmar, et al. Attention is all you need[J]—
advances in neural information processing systems, 2017, 30.
[15] A. Dosovitskiy, L. Beyer, A. Kolesnikov, et al. An Image is Worth
16x16 Words: Transformers for Image Recognition at
Scale[C]//International Conference on Learning Representations.
2020.
[16] H. Touvron, M. Cord, M. Douze, et al. Training data efficient image
transformers&distillation through attention[C]//International
Conference on Machine Learning. 2021: 1034710357.
[17] N. Carion, F. Massa, G. Synnaeve, et al. End-to-end object detection
with transformers[C]//European Conference on Computer Vision.
2020: 213229.
[18] X. Zhu, W. Su, L. Lu, et al. Deformable detr: Deformable transformers
for end-to-end object detection[J]. ArXiv preprint arXiv:2010.04159,
2020.
[19] L. Ye, M. Rochan, Z. Liu, et al. Cross-modal self-attention network for
referring image segmentation[C]//Proceedings of the IEEE/CVF
Conference on Computer Vision and Pattern Recognition. 2019:
1050210511.
[20] Zhang S, Zhang Q, Li H. A review of sign language recognition based
on deep learning [J]. Journal of Electronics and Information
Technology, 2020, 42 (4): 1021-1032.
GX. Hao. Research on tomato leaf disease recognition based on deep
learning[D]. Tarim University, 2023.
Automatically extracted. Refer to the original PDF for figures, tables, and formatting.