Real-time and Intelligent Moving Targets Tracking based on UAV Remote Sensing Video Camera and Brain-like Computing Chips

Moving target tracking technology based on Unmanned Aerial Vehicles (UAV) is widely used in many fields such as automatic inspection and emergency response. The existing moving target tracking methods usually have the problems of large computation and low tracking efficiency. Limited by the computing power of the UAV platform, real-time tracking and analysis of multiple targets based on the video data collected by UAV platform is a difficult task. In this paper, we proposed a novel Target Specific Filtering Tracking with Memory (TSFMTrack) method designed for UAV-based real-time tracking tasks, which involves a Tracklet Filtering Module (TFM) for capturing object appearance features and a Tracklet Matching Module (TMM) for bounding box association in each frame. By experimental comparison with other State-Of-The-Art (SOTA) methods on popular MOT and UAV tracking datasets, the TSFMTrack have shown obvious advantages in accuracy, computational efficiency and reliability. Furthermore, we deployed the TSFMTrack on the brain-inspired chip Lynchip KA200, the experimental results have shown that the TSFMTrack is effective on edge computational platform and suitable for UAV real-time tracking tasks.


Introduction
In recent years, there has been rapid development in Unmanned Aerial Vehicle (UAV) remote sensing technology.videos captured by the UAVs are used for intelligent targets tracking analysis.With UAV platforms featuring unique advantages such as compact size, agile maneuverability, and enhanced safety features, object tracking analysis based on which has found ubiquitous utility in diverse areas such as emergency response, traffic management, factory inspection, and so on.However, there are still several major challenges for real-time and intelligent moving targets tracking algorithms.In complex actual situations, the following obstructive factors hinder the realizing of real-time and accurate moving target tracking.
• Limited energy and computational resources.Limitations in power supply and payload significantly constrain the speed at which real-time processing and analysis of UAV remote sensing images can occur.Also, batteries have long been a barrier, but tethered systems can help compensate the weakness, allowing flights of several hours.To achieve real-time, accurate, and robust motion target tracking, algorithms must strike a balance between accuracy and efficiency.Meanwhile, it should be ensured a sufficient lightweight design to conserve energy for other energy-consuming controlling functions in complex environments, thereby enabling the collection of more geographic information during each flight.
• Influence of camera motion.UAV-mounted cameras exhibit fast movement and continuous angle changes, resulting in images reflecting the relative motion between the UAV and ground objects.Failure to correct this can lead to significant errors in target trajectory prediction during • Viewpoint changes.During sampling, UAVs often fly around objects, capturing different sides of 3D ground objects, leading to diverse changes in object appearance.If without timely online learning and model updates, trackers may misjudge target trajectories or even lose track of targets.
• Low image resolution.The large visual range of UAVs results in background information being insufficient, leading to reduced object resolution in captured images and weakened model representation.This diminished representation can impair tracker discriminative abilities, ultimately resulting in tracking failures.
• Illumination variations and visual occlusion.The lighting conditions for UAVs can change rapidly, ranging from bright to dim environments or transitioning between indoor, canopy, shadowed, and sunlit areas.Furthermore, UAVs frequently encounter complex and poorly lit natural environments during flight, such as nighttime, rainy, or foggy conditions, making it challenging for trackers to distinguish objects from the background.Also, partial or complete occlusion can hinder obtaining information about objects, making it easy to lose track of them.
In terms of multiple object tracking (MOT) , the Trackingby-Detection (TbD) paradigm is one of the mainstream approaches.Comprised of detecting phase and tracking phase, this paradigm aims to first determine the locations of various targets and then correlate them between frames, generating estimated tracks of targets.
For TbD trackers, the performance of detection algorithms are crucial, with notable contributions coming from the YOLO series (Glenn, 2022, Glenn, 2024, Ge et al., 2021).These realtime detectors leverage anchor-based convolutional neural networks (CNN) to solve the detection problem through regression, thus achieving remarkable inference speed with relatively high accuracy.Complementing detection, object tracking techniques have witnessed significant advancements, for instance, SORT (Wojke et al., 2017) , DeepSORT (Pujara and Bhamare, 2022) and their variants (Cao et al., 2023, Maggiolino et al., 2023, Aharon et al., 2022, Zhang et al., 2022).These methods merge Kalman Filters with advanced trajectory matching algorithms along with CNNs to enhance tracking robustness, particularly in scenarios characterized by occlusions and nonlinear motion dynamics.
Moreover, recent advancements in attention mechanisms and correlation filter-based approaches offer promising improvement inspirations for enhancing tracking accuracy and real-time performance.TrackFormer (Meinhardt et al., 2022) and correlation filter-based trackers like MOSSE (Bolme et al., 2010) and ECO (Danelljan et al., 2017), leverage attention mechanisms and Discrete Fourier Transformation respectively to tackle complex tracking scenarios while maintaining low computational complexity.
This study pursues to cope with the challenges of real-time and intelligent moving targets tracking based on UAV remote sensing video cameras, especially the challenges concerning the unique environment of UAV platforms, and further delineates potential directions to guide the progression of research in UAVbased moving target tracking.The primary contributions of this work can be summarized into the following aspects.
In Section 2, we first conducted a comprehensive review of related existing literature in the field.The core idea and methodological researches are explained in Section 3. The integration of the developed tracker into a cohesive code repository facilitates accessibility and reproducibility.Experimental evaluation and onboard testing of our proposed method is included in Section 4, where our TSFMTrack is deployed on Lynchip KA200 brain-inspired chip for comprehensive inference testing of UAV remote sensing videos.Experiments are undertaken on four authoritative UAV benchmark datasets, namely MOT17 (Sun et al., 2019), MOT20 (Dendorfer et al., 2020) and UAVDT (Du et al., 2018), to comprehensively assess the performance of the TSFMTrack in complex scenarios.SOTA target tracking algorithms such as ByteTrack (Zhang et al., 2022) and BoT-SORT (Aharon et al., 2022) were deployed on the same chip for experimental comparison.

Tracking-by-Detection
There are two main approaches to Multiple Object Tracking (MOT): tracking by detection (TbD) and joint detection and tracking (JDT), with the former being widely used due to its simplicity and modularization.Generally, the TbD method can be divided into two parts: object detection and object tracking.
The YOLO series (Redmon et al., 2016, Glenn, 2022, Glenn, 2024)are commonly used real-time detectors in MOT, outperforming their counterparts in speed and accuracy by modeling the detection problem as a regression problem and introducing anchor-based CNN to solve it.YOLOX (Ge et al., 2021)removes prior anchors in YOLO and adds other techniques to reduce hyper-parameters and computational cost while still achieving promising performances.
In object tracking, SORT (Wojke et al., 2017) combines Kalman Filter and Hungarian matching algorithm to create a simple yet effective tracker.Building on SORT, DeepSORT (Pujara and Bhamare, 2022) adds a CNN module to learn visual features, improving the tracker's performance in occlusion scenarios.OC-SORT (Cao et al., 2023) employs observationcentric compensation methods to deal with the error accumulation of Kalman filtering in nonlinear motion scenarios.However, OC-SORT's high reliance on image quality results in less effective performance in practical applications.Addressing this issue, Deep-OC-SORT (Maggiolino et al., 2023) combines the aforementioned trackers and adds correction terms about objects' appearances to tackle feature degradation.To overcome the limitations of SORT-like trackers, BoT-SORT (Aharon et al., 2022) combines motion and appearance information to optimize bounding box direction.Bytetrack (Zhang et al., 2022) associates almost every detection box to minimize mismatching while maintaining a high running speed.
Although all the aforementioned methods have sound and promising outcomes, most of them rely on high-performance GPUs.In UAV tracking scenarios, computational resources are strictly limited due to the UAV's payload capacity.Additionally, training deep networks suitable for UAV tracking requires a large number of UAV-based datasets, which are currently insufficient to support this need.

Accurate Tracking with Attention Mechanism
The attention mechanism (Vaswani et al., 2017) has also demonstrated its capability in object tracking.TrackFormer (Meinhardt et al., 2022) models the tracking task as a prediction problem.Using attention in association and encoderdecoder structures to predict tracklets, it outperformed many state-of-the-art traditional trackers in accuracy.TransMOT effectively models relations between a large number of objects mainly through a spatial-temporal graph transformer structure.SMILETrack (Wang et al., 2024) incorporates a Siamese network to capture appearance features.By employing Patch Self-Attention mechanisms, SMILETrack effectively attends to image similarity and enhances performance in the presence of occlusions.Despite attention mechanism still has the problems that 2.1 mentioned, its idea of focusing the major feature is still worth considering in increasing real-time trackers' accuracy.

Real-time Tracking with Correlation Filter
Correlation Filters (CF) have garnered much attention in UAVbased tracking due to their adaptability, efficiency, and relatively high resilience against background occlusion.One key highlight of correlation filters is that, by applying Discrete Fourier Transformation, they transform cyclic correlation (done by convolution) into element-wise multiplication.This operation significantly reduces computational complexity, allowing CFbased trackers to reach over 24 frames per second (FPS) on a single CPU, meeting real-time requirements for UAV-based tracking.
MOSSE (Bolme et al., 2010) was the first to use CF in object tracking, introducing a minimum squared error regularization method that produced a robust and stable CF tracker.CSK The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLVIII-1-2024 ISPRS TC I Mid-term Symposium "Intelligent Sensing and Remote Sensing Application", 13-17 May 2024, Changsha, China (Henriques et al., 2012) introduced a circulant matrix to compute cyclic correlation.KCF (Henriques et al., 2015) summarized CSK and reformulated the CF-tracking algorithm to a Kernelized Correlation Filter, with complexity equivalent to linear algorithms.CCOT (Danelljan et al., 2016) introduced implicit interpolation to integrate multi-resolution deep feature maps.However, the key formula in CCOT incurred high computation costs.Additionally, training continuous filters would introduce numerous optimized parameters, leading to overfitting.ECO (Danelljan et al., 2017) was proposed to address these issues.It introduced a factorized convolution operator to build the filter in CCOT, while a compact generative model of the training sample distribution decreased computational costs.MCPF (Zhang et al., 2017) incorporates a particle filter and Multi-task Correlation Filter to handle large-scale variation.CSR-DCF (Lukežič et al., 2018) managed to learn accurate features of irregular objects by introducing a spatial confidence map and channel-wise confidence score.Li et al. (Li et al., 2020b) proposed a discriminative correlation filter (DCF) with a memory queue to preserve keyframes' information, enabling long-term tracking with robustness.Considering the reversibility of motion, BiCF (Lin et al., 2020) adds bidirectional incongruity terms in training to ensure the filter's consistency in forward and backward motion prediction.Also, AutoTrack (Li et al., 2020a) incorporates automatic spatial-temporal regularization by integrating local and global response maps to dynamically regulate spatial and temporal weights, ensuring adaptability across diverse sequences while maintaining computational efficiency.
Correlation filter generates promising outcomes when applied in Single Object Tracking tasks, but its structure impedes its performance on MOT task.By decoupling correlational operation from CF and deploy it in solving MOT task might create a tracker with higher efficiency.

Methodology
In this section, a novel detector for real-time UAV MOT task, Target Specific Filtering Track with Memory (TSFMTrack), is presented.The structure of which is illustrated in Fig 1.
To be specific, our TSFMTrack consists of two parts: object detection and tracklet matching.We apply YOLOv8 (Glenn, 2024) for detecting due to its wide range of usage scenarios with better results than previous YOLO detector.The main contribution of our work are mainly in the tracklet matching part, which comprises (a) A Siamese-like Target Filtering Module (TFM) for accurately learns the features and computes similarity score, and (b) A Tracklet Matching Module (TMM) assigns and upgrades the tracklets using Hungarian algorithm.

Target Filtering Module
To achieve promising tracking quality with high efficiency, a well-designed feature extractor, TFM, is proposed.Though Correlation Filter-based trackers achieves low inference time, they only generates one filter for a frame and thus, not suitable for MOT tasks.Also, since they updates their filters online, their tracking quality are not as good as deep-learning based trackers.Siamese network based trackers integrates template information into the searching region and is suitable in processing multiple similar inputs.
The proposed TFM utilizes the advantages of Siamese Network and Correlation operation to precisely and effectively learn the discriminative appearance features to for accurate tracking.To extend CF's success to MOT, we decoupled correlation operation from CF and takes the last fragment of a tracklet as a specified filter for this object to conduct correlation operation.
Figure 2 shows the TFM's architecture.It takes the last fragment of a tracklet of the previous frame and bounding boxes of the current frame as inputs, after preprocessed by CNN, the inputs would then be processed by a Correlation Computing Block (CCB).Finally, another CNN is used to calculate the correlation index between tracklets and targets before calculating similarity score between the two inputs.
where F , G ,H respectively denotes the 2D Discrete Fourier Transform of the input image, response map and the filter.The output of which is then transformed back to spatial domain by Inverse Discrete Fourier Transform to get the response map.
Then the object's current region can be found by searching the maximum response of the map.Here, instead of generating the filter by algorithm, we directly takes the tracking targets as the specified correlation filter for its tracklets and filters the detected targets to get the response map.To take dissimilarity between filters and the unmatched images(i.e.responses generated by the filter and targets not belong to this tracklet) into consideration, the filter itself would be passed to the next procedure after doing the same filtering process.Since the input images are of different sizes, it will first be resized to a fixed size W × W by CNN before being processed by CCB, where W is set to 127 in our work since its a prime number and is close to 2 7 , which makes it suitable to compute by FFT and would not cause information erosion (Wu et al., 2019) on feature map after correlation computation.
3.1.2Back Propagation Through CCB Module Back propagation for training TFM requires a differentiable or piecewise differentiable forward function for each part of TFM.In this section, we will prove that the equivalent function of CCB Module is differentiable and calculate the gradient through it.The function of this back propagation is illustrated in Figure 3.
Since the operation CCB Module done are all matrix-wise without cross-channel calculation, the prove can be done under 2D tensor (i.e.matrix) condition and be easily generalized to 3D tensor condition.The bellowing matrix are all (2n + 1) × (2n + 1) matrix.
For one channel of an image, the process of conducting Fast Fourier Transform and Inverse Fast Fourier Transform is in fact implementing circular convolution.Let Ω represents the kernel matrix for circular convolution, X represents the input matrix and X(i, j) represents the (i, j)-ist element of X.The circular convolution of Ω and X can be written as F = Ω ⊛ X, where ⊛ stands for circular convolution symbol.Other related values are represented as the equation below. (2) The International Archives of the Photogrammetry, Remote Sensing and Spatial Information Sciences, Volume XLVIII-1-2024 ISPRS TC I Mid-term Symposium "Intelligent Sensing and Remote Sensing Application", 13-17 May 2024, Changsha, China Z2n+1 is the integer ring modulo 2n+1.Circular convolution is done by element-wise multiplication and addition, so it is easy to prove that F (i, j) is differentiable with respect to X(k, l).
Let X and Y respectively denote the tracklet and the bounding box image, the equivalent function of CCB Module can be written as For clearer representation and narration, we rewrite the above function as In formulating the process of back propagation, we denote the bias matrix that feature extracting CNN returned as ( B1, B2).B1, B2 are bias matrices corresponding to the input F1, F2 and its element represents the partial derivative of Loss function L with respect to F , i.e.B(i, j) = ∂L ∂F (i,j) .
Let B1 denote the bias matrix passed back to the CNN backbone through the CCB Module with respect to B1, then B1(i, j) can be calculated as where Z is the ring of integer.
Apply homomorphic transformation to the subscript of Equation 3, and the result of which remains unchanged.
in which Rotateπ(X) stands for rotating the matrix X for πrad.This is equal to P XQ, where Equation ( 9) represents the circular convolution operation between the bias matrix B1 and the rotated convolutional kernel Ω, which can be efficiently computed by CCB.

Tracklet Matching Module
Tracklet matching is also a indispensable step in object tracking.Matching the tracklets properly can have a positive impact on tracking outcomes.Our tracking algorithm is built upon the recent SoTA tracker SMILETrack, which is the extension of ByteTrack.By associating almost every tracklet using a twostage matching strategy, these two trackers outperformed their counterparts in overall performances.However, when practically deployed on UAV, tracking algorithms would encounter challenges from unpredictable conditions, e.g.camera motion and occlusion, that affect the tracker's tracking quality.To address these exterior condition, we apply Camera Motion Compensation to accurately track objects in interfered scenarios.Also, we introduce Key Frame Feature Memory Queue to introduce historical views to the tracker, allowing it to adapt to appearance changing and reduces the interference of transient intense interference.We design our TMM method to address the aforementioned problem and integrates the overall matching pipeline in SMILETrack, achieving a interference-robust accurate tracking strategy.
Let O, T , S denote the set of objects, the set current tracklet list and the matrix of object-tracklet similarity respectively.Elements in set O are sorted in a descending order, any element Oi with a detection score lower than 0.1 would be considered as background noise and be removed before matching.Then we divide O into O H and O L , respectively denotes elements with a detection score above/below the median score.The similarity score between the i-th object and the j-th tracklet, i.e.S(i, j)can be calculated by where SDIoU is the DIoU similarity and Scorr is the output of TFM with the input of Oi and the tracklet Ti.
Using O, T , S, the mainstream of TMM can be stated as: • Stage 1. Finds the matches between O H and T .We first predict every tracklets' new position in the using Kalman filter, then the Hungarian algorithm is applied to perform linear assignment using the similarity matrix In summary, the pseudo-code of TMM can be stated as Algorithm 1.
Moreover, for the training process of the two convolutional networks, we employed a variant of DIoU metric (Zheng et al., 2019) for similarity calculation, and trained the parameters in the cnns via computing the L2 loss.
The measurement of similarity can be represented as: Thus the loss function can be computed as: In this scenario, I represents the bounding box of the input object, while J represents the output of tracklet matching.The similarity we've just calculated serves as an intermediary value approximating K, which isn't a direct output.This introduces a disparity compared to directly matching the tracklets, making it impractical to apply perceptual loss.Additionally, the limited parameters in our proposed method reduce the risk of overfitting, to the extent that regularization terms like L1 Loss may even hinder performance rather than enhance it.Generally speaking, employing L2 loss works most effectively under this circumstance.

Camera Motion Compensation
Deep-learning based tracking-by-detection trackers rely heavily on the images' quality, which would be influenced by camera motion in real-life scenarios.In a dynamic camera situation like UAV, this phenomena would be prevalent and could result in increasing ID switches or false negatives.
The principle of CMC is visualized in Figure 4. We apply CMC to correct the Kalman state following the formula down below: Where x k and x ′ k , P k and P ′ k respectively denotes the KF's predicted state vector and covariance matrix before and after CMC operation.It is worth pointing out that the CMC update is done before the Kalman extrapolation step so that the prediction stage is from the CMC-corrected states, which could prevent error accumulation.

Key
Frame Feature Memory Queue Although the interference from the external environment to the image at a certain moment is random, it is known from the law of large numbers that the sum of multiple interference in a system approximately follows a normal distribution, so the overall interference to the image can be approximated as fluctuating within a small range over a long time span, so if the tracker can somehow gain temporal information while tracking, it should have suppresses and smooths the impact of noises.
To enable an access to such information, we introduce Key Frame Feature Memory Queue (KFF Memory Queue), making our tracker more temporal aware.Basically, a memory queue with the length of N is maintained for every active tracklets, storing fragments belongs to this tracklet.In motion prediction, the contribution of each frame are not equal.That is, it only need a small amount of frames to define and illustrate a given smooth motion, these frames are known as key frame.What the memory queue do is to find out such frames and adding those key frames in memory queue.In practice, key frames is mostly chosen as the starting and ending frame of a transition.
Since motions of tracked objects are often complex, it is necessary to add an additional frame in between starting and ending point of a motion.Also, following the Kalman Filter's prior hypothesis, motion can be seen as linear (or smooth) in a relatively long span of time, e.g. 1 second.Based on the above theory, we set the value of N as 6, for it could contain up to 2 seconds' key frame feature.Enqueue and Dequeue.A frame will only enqueue only when it is considered as key frame, i.e. it is the turning point of two different motion or rapid change of appearance, etc.These criteria can be simplified as low similarity score.If a new added frame in a tracklet has a similarity score lower than a threshold τ , i.e.S(i, j) < τ , it would be added into the memory queue of this tracklet.It is also worth pointing out that the enqueuing operation would only performed at stage 1 of TMM.If a queue is not updated for a given period of time t0, it would be considered as "inactive" and the queue would be removed.
Key frame in TFM.Once the fragment is enqueued, it is used to update the filter in CCB.Let f k be the k th fragment in the queue (f1 is the first enqueued fragment) , X be the tracking targets, then a target specific temporal filter can be calculated by where ⊕ is element-wise addition.Generally speaking, earlier enqueued frames have a lower similarity to the current appearance than latter ones, so a decaying factor needs to be introduced.For a given interference, its effect on the current frame roughly decays comparing to its initial intensity.For computational convenience, these two decaying factors are all implemented by multiplying the queue fragments {fi} by a exponentially increasing sequence {λi}.Without this Memory Queue, TMM cannot utilize temporal information to tackle exterior interference.
For straightforward understanding, the pseudo-code of KFF Memory Queue can be written as Algorithm 2. Algorithm

Implementation Details
We implemented all the experiments using PyTorch and deployed our TSFMTrack on the Lynchip KA200 brain-inspired computing chip to conduct comprehensive inference testing of UAV remote sensing videos.In terms of datasets, our experiments were conducted on MOT17, MOT20 and UAVDT benchmarks.Metrics such as MOTA (Bernardin and Stiefelhagen, 2008a), IDF1 (Ristani et al., 2016) and HOTA (Luiten et al., 2020) are employed during our experiments, highlighting identity matching.Our detector was initialized on MOT datasets and fine-tuned on UAVDT datasets.In order to optimize the performance, data augmentation and an SGD optimizer with cosine annealing were applied.Our TMM module is designed to manage tracklets, and we assessed its key parameters in an ablation study.(Dendorfer et al., 2020) provides data under challenging circumstances, ranging from occlusions, crowded scenes to diverse motion patterns.
Unmanned Aerial Vehicle Detection and Tracking (UAVDT) dataset (Du et al., 2018) is composed of video sequences captured from cameras onboard UAVs, of which the objects are clearly annotated.Designed for aerial surveillance, the dataset includes video clips with scale variations, cluttered backgrounds and rapid motion dynamics, which helps to assess the accuracy and robustness of our proposed algorithm and conform to our application background.
MOTA is outlined based on the original data of misses, false positive and mismatches: in which IDSW , F P , F N and GT represents the sample set of association errors, false positives, false negative and ground truth object respectively.
The identification metrics IDF1 (Ristani et al., 2016) can be computed as where IDT P , IDF P , IDF N respectively stands for the number of identification true positive, false positive and false negative.where AC denotes the alignment measurement score, and T P , F P , F N stands for the sample set of true positive, false positive and false negative.

Ablation Studies
The experimental evaluation of TSFMTrack aimed to assess its performance in real-time moving target tracking tasks on UAV platforms.Utilizing benchmark datasets including MOT20, MOT17, and UAVDT, comprehensive analyses were conducted to evaluate TSFMTrack's accuracy and efficiency.
The results depicted in Table 1 showcase TSFMTrack's performance across the benchmark datasets.Notably, TSFMTrack achieved a MOTA (Multiple Object Tracking Accuracy) of 73.3% on the MOT20 dataset, demonstrating its robustness in tracking moving targets with high accuracy.Similarly, on the MOT17 dataset, TSFMTrack achieved a MOTA of 68.5%, showcasing its competitive performance across different datasets.
Furthermore, an ablation study was conducted to analyze TS-FMTrack's performance under various conditions.Table 2 presents the results of the ablation study conducted on the MOT20 dataset.It was observed that TSFMTrack exhibited stable performance across different configurations, maintaining its effectiveness in real-time MTT tasks.
In-depth analyses were conducted to further understand TS-FMTrack's performance characteristics.The similarity analysis revealed that TSFMTrack leverages Intersection over Union (IoU) and Re-identification (Re-ID) metrics for association.The results indicated that IoU performed better in terms of MOTA and identity preservation (IDF1) for the first association stage, while Re-ID yielded higher IDF1 scores.Incorporating IoU as the similarity metric for both association stages resulted in improved overall performance.
Additionally, TSFMTrack was compared with several state-ofthe-art trackers on the MOT17 dataset.The comparison encompassed metrics such as MOTA, IDF1, HOTA (Higher Order Tracking Accuracy), false negatives (FN), false positives (FP), and identity switches (IDs).We compared TSFMTrack with several mainstream state-of-the-art trackers on both the validation set and test set of the MOT17 dataset.The comparison encompassed metrics such as MOTA, IDF1, HOTA, false negatives (FN), false positives (FP), and identity switches (IDs).
Lastly, an analysis of low-score detection boxes was conducted to assess TSFMTrack's performance in handling challenging scenarios.It was observed that TSFMTrack effectively recovered true objects and minimized false associations, leading to improved overall performance metrics such as MOTA and IDF1.

Benchmark Evaluation
We compared TSFMTrack with mainstream state-of-the-art trackers on the performance on validation set and test set.MOT17.We evaluated the performance of TSFMTrack on the MOT17 dataset, which is a widely used benchmark for multiple object tracking.UAVDT.We conducted benchmark evaluation on the UAVDT dataset to evaluate TSFMTrack's performance in aerial tracking scenarios.Although specific quantitative results are not provided here, qualitative assessment indicated that TS-FMTrack performed well in tracking objects from aerial viewpoints, demonstrating its applicability in unmanned aerial vehicle (UAV) applications.
The results demonstrate the competent performance of our TS-FMTrack.With equivalent accuracy on the datasets, the efficiency of TSFMTrack is much higher, indicating that it is well-matched with the task of real-time moving targets tracking based on UAV platforms.
Table 3 summarizes the key performance metrics obtained from the experimental evaluation of TSFMTrack on the aforementioned datasets.Notably, TSFMTrack demonstrates competitive accuracy metrics across all datasets, including MOT17, MOT20 and UAVDT.Specifically, on the MOT20 dataset, TS-FMTrack achieved an MOTA of 75.3%, IDF1 of 78.2%, and HOTA of 64.0%, indicating its robustness in tracking moving targets with high accuracy.
The results indicate that TSFMTrack excels in maintaining high tracking accuracy while exhibiting efficient processing capabilities, as evidenced by its competitive FPS (Frames Per Second) values.This efficiency is particularly advantageous for realtime applications on UAV platforms, where timely and accurate tracking of moving targets is paramount.Additionally, TSFMTrack's performance remains consistent across different datasets and scenarios, demonstrating its versatility and reliability in various real-world environments.
In conclusion, the experimental evaluation of TSFMTrack reaffirms its competence as a real-time moving target tracking algorithm for UAV platforms.Its combination of accuracy, efficiency, and versatility makes it a promising solution for a wide range of applications, including surveillance, reconnaissance, and disaster management.Future research endeavors may focus on further optimizing TSFMTrack's performance and extending its applicability to other domains within the UAV ecosystem.

Future Work
Since the proposed module is a Tracking-by-Detection tracker, it might reach optimal while training and thus hampers further optimization.Also, as tracking environment on UAV is complex and uncertain, so our future works would be finding approaches to combine detection and tracking or further optimise our trackers on UAV based on practical using feedback.

Conclusion
In this paper, we propose the Tracklet Filtering Module (TFM), a siamese-like correlation network to effectively learns object appearance features for multiple-object tracking.We also introduce the Tracklet Matching Module (TMM) for bounding box association in each frame.The experimental results on two MOT datasets (MOT17 and MOT20), and the UAV tracking datasets (UAVDT) demonstrate that the proposed tracker, Target Specific Filtering Track with Memory (TSFMTrack) achieves promising performance in terms of MOTA, IDF1, IDs, and FPS.Besides, the proposed method is deployed on actual UAV platform and proved to be suitable for real-time tasks.

Figure 5 .
Figure 5. KFF Memory Queue + CCB structure.The enqueued frames are used to update filters

Figure
Figure 6.Samples for tracking results 3.1.1Correlation Computing Block Use ⊙ to denote Hadamard multiplication and * to indicate the complex conjugate, then what Correlation Filter does can be generalized as computing the following formula: S H .The unmatched objects of O H and the unmatched tracklet of T are then placed in O H Remain and T H Remain .Match the objects in O L and T H Remain .The unmatched objects ORemain and tracklets TRemain would pass a gate function before further operation.
Set of objects: O, Set of tracklets of the last frame: T , Set of Memory Queue for each tracklet: Q, Set of posterior state estimate vector and covariance matrix: {(⃗ xi, Pi)}, Scaled rotation matrix Mt and translation parameter ⃗ t .Output: Updated T , Q and {( ⃗ xi, Pi)}.Ti in T L remain do Ti.keep← Ti.keep + 1 if Ti.keep > 30 then delete Ti 12) in which ρ 2 (B1, B2) stands for the distance of the central points of B1 and B2, c is the diagonal length of the smallest enclosing box covering the two boxes, and IoU is calculated with T ← T ∪ {(Oi, Tnew)}// New tracklet else delete Oi T L matched .keep← 0; for

Table 1 .
Benchmark evaluation experiment results of TSFMTrack