Research and application of land illegal behavior monitoring based on video image recognition

Our country had increasingly high requirements for land management, especially for arable land, which was the most precious resource. Strict policies for protecting arable land are being implemented. The enforcement of satellite imagery in mega cities had certain limitations in terms of timeliness and accuracy. To meet the requirements of refined management, a high-altitude camera was constructed to form a near ground monitoring network covering the cultivated land area. Multi temporal image groups were obtained through video frame extraction. A monitoring sample library for typical behaviors of natural resources was established. Land objects closely related to illegal activities such as construction machinery and bulldozing areas were identified based on deep learning algorithms. By using the technology of bidirectional conversion between video spatial location and geographic coordinates, a fusion of spatial information video monitoring patterns was formed. Taking the 20000 acre permanent basic farmland area as an example, the recognition rate of illegal land related behaviors was about 81%. By combining approval information and conducting spatial analysis on the identified patterns, real-time warning information could be pushed. This application can effectively improve the timeliness of supervision and enhance the level of supervision.


Introduction
The protection of cultivated land is related to national food security, ecological security and social stability.For a long time, China has attached great importance to the protection of cultivated land resources, and resolutely stopped the "nonagricultural" and "non-grain" transformation of cultivated land.In order to coordinate the relationship between development and security, development and protection, consolidate the foundation of food security in an all-round way, firmly guard the red line of 1.8 billion mu of cultivated land, and emphasize high-quality land law enforcement work to help high-quality social development.As a super large city with limited land resources and continuously expanding population, Shanghai needs to strictly adhere to the red line of farmland protection.At the same time, Shanghai's work requirements of "detection and law enforcement" also emphasize the timeliness of monitoring land violations, and the level of monitoring and detection needs to be improved.
In recent years, the application of remote sensing and GIS related technologies has improved the level of land law enforcement supervision.Remote sensing technology combined with target image recognition has been continuously carrying out related research.In the machine algorithm, the sample target learning method has more advantages, such as neural network (NN) (Zhang B, 2018;Liu Y G and You Z, 2003;Tian Z Z et al, 2016), support vector machine (SVM) (Guo M W et al, 2014;Chen L and Qing Q Q, 2006), random forest (RF) (Xiang T et al, 2016) and other algorithms are applied.The machine learning model established by sample training has strong recognition accuracy.In the field of remote sensing, deep learning can achieve better image feature extraction in classification and recognition (Wang B and Fan D L, 2019).Convolutional neural network (CNN) reduces the complexity of the network model and has a good effect in the field of image processing.Therefore, there are many studies such as the use of DCNN to identify water bodies (Zhou W X, 2023), the use of semi-supervised classification technology based on hyperspectral remote sensing images for accurate classification (Mao Y L, 2023), the classification method of heterogeneous convolutional neural network (HCNN) feature cascade based on high-resolution images, and the improved full convolutional neural network model to study land cover classification (Du J, 2017;Yang X, 2024).Through the analysis, it is found that the research on remote sensing image recognition is mainly applied to the three fields of scene classification, target monitoring, and image retrieval (Tian Z H et al, 2023;Heng X B et al, 2023;Tian Q C et al, 2023), which is more suitable for large-scale monitoring research.For the megacity of Shanghai, with a large urban area and relatively small and scattered cultivated land, the law enforcement requirements are highly granular and effective, and a single regulatory system cannot meet the current needs.With the development of digital twin city construction and AI artificial intelligence technology, the ability to recognize and interpret small image targets has also become stronger, and AI technology has been widely used in urban management, public security, transportation, and other fields for camera information extraction, character recognition, and behavior analysis and recognition (Ye P P, 2023;She Z M, 2023;Hu M, 2023;Zhang X Z, 2022;Xue F C, 2018).In this paper, the monitoring technology route introduces cameras as monitoring sensors for key areas on the ground, uses deep learning methods for test training, and selects the region-based faster regions with CNN feature (Faster-RCNN) algorithm.The algorithm has a good detection effect on targets with incomplete image features (Bao N S et al, 2023), which is suitable for automatic identification of land violations and its application to the construction of cultivated land protection system.Video image recognition, as a supplementary source of illegal clues for image recognition, improves the monitoring level of cultivated land protection.

Identifying targets
The main purpose of illegal behavior monitoring is to detect abnormal changes in cultivated land as early as possible.Specific traces of human intervention often appear before the cropland changes.These traces can be further subdivided into specific scenes and recognized objects, which can be automatically identified by the method of object detection in video images.
After analysis, we can broadly classify illegal behavior into two categories.One is a change in the land use properties of cultivated land in an area, which usually requires the operation of mechanical equipment or the construction of temporary construction sites nearby.The other is that the attributes of cultivated land do not change, and other piles occupy part of the area, resulting in a decrease in the actual cultivated land area.Therefore, three categories of identification scenarios and 13 specific objectives are proposed, as shown in Table 1.The startup warning scenario is for the identification of machinery and equipment, the construction site early warning scenario identifies the characteristics of the construction site, and the groundbreaking early warning scenario is a common type for identifying the occupied cultivated land, as shown in Figure 1.The non-cultivated area of cultivated land affects the area of actual cultivated land.

Sample library production
After classifying the early warning scenarios, the land illegal behaviors have been refined into corresponding identification targets, that is, it is clear that the samples collected can be further made into a sample database.In the process of deep learning model calculation, sample library production is an important and complex task.Effective samples can improve the performance and generalization ability of the model, so as to better solve the problem and goal of characteristics.
The sample needs to have obvious features, such as appearance color, target texture, target shape, and other typical features.
According to the identification requirements of the three types of early warning scenarios, the mechanical equipment in the groundbreaking early warning is a more typical identification target, and the texture is basically fixed.The samples of construction site early warning and groundbreaking early warning scenes are complex, the types are diverse, and the regularity of recognition features such as texture color is not significant, so the type and number of samples should be enriched when selecting.

The accuracy of sample selection affects the training results.
There are a variety of methods for sample generation, including manual annotation, data augmentation, and synthetic data.
Manual annotation is the most common and accurate method that manually labels samples in a dataset to generate input data and labels for responses through expert experience, but requires a lot of time and human resources.Data augmentation mainly processes the raw data through rotation, zoom, panning and other operations, which can enhance the diversity of samples, but excessive data augmentation may introduce unnecessary noise to the performance of the image model.Synthetic data algorithms augment datasets by simulating or generating algorithms with computer graphics to create images.Synthetic data is controllable and flexible, but its quality and realism can affect the performance of the model in real-world scenarios.
In order to select accurate and reliable samples, the results of image recognition for many years were used and the verified patches were screened and extracted.Due to the fact that the patches extracted from Shanghai based on images better than 0.5m satellite films, the patches are strictly close to the target object during extraction, so the number of coordinate point groups at the edge of the patches is too large, which is not conducive to the calculation and recognition of the model in the later stage, and the graph needs to be simplified first.In order to facilitate the calculation, the largest external rectangle was taken as the sample recognition frame, and the batch optimized patches were used as the samples.
At the same time, in order to take into account the angle of high-altitude cameras, multi-angle images such as aviation, satellite, and drone cameras were selected, and some scene pictures automatically generated by AI models were added (Figure 2).The main idea of the model is to divide the object detection task into RPN region generation and target classification and bounding box regression (Figure 3).First, the candidate region is generated through the RPN network.RPN is a fully convolutional network that obtains a fixed-size window on the input feature map and outputs multiple proposals for each window position.Specifically, RPN predicts k anchor boxes at each window position, and uses the size and position of the anchor box to generate candidate boxes, and RPN screens each candidate box by binary classification and regression, binary classification is to clarify the foreground or background, and regression is mainly to adjust the size and position of the box, and finally obtain the final candidate area.Secondly, after screening out the candidate regions, Fatser R-CNN uses the RoI pooling layer to extract the features of each candidate region.RoI pooling maps candidate regions of different sizes into fixed-size feature maps, and transforms target classification and bounding box regression tasks into fixed-size features.Then, the targets are divided into two categories by the fully connected layer and the softmax classifier, and the bounding box is finetuned by the regressor, and the final output is the category where the recommended area is located and the exact position of the image.The main advantages of Faster R-CNN are end-toend training, shared convolutional features, accurate position and size regression, so the detection speed is fast, the accuracy is high, and it is easy to implement and train, which has become one of the mainstream algorithms in the field of object detection.ResNet50 has a deep network, which has a good detection effect as a feature extraction network (Li D Z et al, 2018;Li X J et al, 2024;Zhou S H et al, 2024), which is conducive to the refinement of abstract and deeper target features, and is more suitable for identifying targets with less obvious texture features in the types of construction site warning and scene warning.Considering the cost of running time and overfitting, the number of epochs is not easy to be too much, and the learning rate is mainly set according to experience, too large a value will cause the parameters to be optimized to fluctuate near the minimum value, and too small a value will cause the convergence speed to slow down, and 0.001 is selected for testing according to the general situation.The biggest feature of Faster R-CNN is that RPN training generates an anchor to improve the detection speed.Considering that there are different sizes of indiscriminate occupation and hardening of identification targets in the target groundbreaking warning scenario, the size of the anchors is set to (32 2 , 64 2 , 128 2 , 256 2 , 512 2 ) in order to be compatible with the identification of small targets.

Target recognition experiments
Based on the sample database of illegal land behavior monitoring, the Faster R-CNN model was used as the basic model, and the learning rate and iteration number were set to identify the targets of construction early warning, construction site early warning and ground breaking early warning.Evaluate the performance of the model on the test dataset, mainly based on the calculation of major metrics such as Precision, Recall, and Mean Average Accuracy (mAP).Precision is the ratio of the number of correct targets detected by the model to the total number of targets on the test picture.Recall is the ratio of the number of correct targets detected by the model to the total number of targets on the test picture.mAP is the area difference enclosed by the two coordinate axes of precision and recall, and then the average value is calculated. (1) ( The test point is selected to cover more than 20,000 acres of basic farmland area, the camera deployment height is about 20~40m height, the resolution is 200 dpi, and the test is carried out through video frame extraction.Through experiments, it is found that the three types of target recognition detection are construction early warning> construction site early warning > groundbreaking early warning (Table 2).The early warning is mainly for the identification of mechanical targets, and the model recognition results are relatively good, which can achieve 90% accuracy and 95% recall rate, and the mAP index is 0.85.The construction site early warning and groundbreaking early warning have reached more than 81% after many iterations of training, and the construction site early warning is mainly for the identification of construction site characteristic objects such as sheds, with an accuracy rate of 84%, a recall rate of 90%, an mAP index of 0.8, an accuracy rate of 81% and a recall rate of 88%, and an mAP index of 0.75.On the whole, the Faster R-CNN model has good target detection efficiency and certain effectiveness in three aspects: construction early warning, construction site early warning, and ground breaking early warning (Figure 4).However, there are some differences in the previous identification of different categories.The early warning effect of construction is the highest, which may be that the characteristics of mechanical targets are more obvious, the contour colors are more fixed, and the detection targets are more specific.For example, the outline shape of fences and piles of waste is not fixed, the texture is different under the influence of light, and the background interference is large, so the relative recognition is poor.

Cultivated land protection applications
At present, the supervision of cultivated land protection is mainly based on law enforcement and manual inspection, which cannot meet the regulatory needs of high timeliness response.By deploying high-altitude cameras in areas where farmland is concentrated, video images of farmland can be obtained.Then we obtain video information in real time through video transmission, and regularly extract frames as pictures as needed, and identify the behaviors of land illegal construction early warning, construction site early warning and ground breaking early warning through the trained model, which can effectively identify different types of illegal behaviors.
Through the target recognition of the model, a real-time supervision platform has been established, as shown in the Figure 5, the model is identified and immediately pushed through spatial positioning, and after the business system and the business approval data are approved and compared, the suspected illegal patches are sent to the inspectors, and the inspectors will feedback the inspection situation and disposal information to the platform after the on-site inspection, and the feedback inspection results will be regularly supplemented to the sample library to train and improve the model algorithm.
On the one hand, the platform monitors the farmland area in real time, and automatically identifies the relevant behaviors of land violations through the trained Faster R-CNN model, which can quickly identify suspected illegal behaviors, which has a certain degree of real-time and effectiveness, and improves work efficiency.
Figure 5. Design ideas of the real-time supervision platform

Conclusion
In order to better protect cultivated land, this paper studies the classification of early warning scenarios of land violations, and proposes three types of scenarios: early warning of construction, early warning of construction site, and early warning of ground breaking.For each scenario, we performed detailed target recognition and selected the corresponding samples to make a training sample library.Then, we used the Faster R-CNN model to train the samples and form a training model.
Through experiments, we found that the accuracy of the three types of scene recognition was more than 81%, which had a certain effectiveness, and the effect of groundbreaking early warning was the best, reaching 90%.Therefore, we apply this model to the application management of cultivated land protection, and improve the existing level of land law enforcement management by establishing a real-time supervision platform.
In the next study, we will further improve the sample bank and model training by analyzing the inspection results of the patches after the early warning is pushed.We will continue to optimize the model to improve the recognition rate of construction site warning and ground breaking warning.

Figure 1 .
Figure 1.Schematic diagram of the scene (a1 and a2 shows The start-up warning scenario;b1 and b2 shows the construction site early warning scenario; c1 and c2 shows the groundbreaking early warning scenario)

Figure 2 .
Figure 2. Multi-angle and multi-type sampled images (a,b,c and d are selected from aerial imagery, satellite imagery, video camera, and AI model, respectively)

Figure 3 .
Figure 3. Structure Diagram of Faster R-CNN model TP = the number of positive targets identified as positives FP = the number of negatives targets identified as positives FN = the number of positive targets identified as negatives P

Figure 4 .
Figure 4. Diagram of identifying monitoring points

Table 2 .
Table of training results