Abstract:
High-resolution remote sensing images suffer from semantic information attenuation and multi-scale feature conflicts. To address these issues, this study proposed a semantic segmentation network that integrates multi-dimensional collaborative optimization with cross-level dynamic feature compensation. Specifically, a multi-dimensional collaborative optimization head (MCOH) module was introduced to the network. Using a three-branch collaborative optimization mechanism that involved channel attention, spatial attention, and multi-scale deformable convolution, the MCOH module can suppress shallow feature noise and enhance deep semantic consistency, thereby achieving a trade-off between detail preservation and context modeling. Subsequently, a cross-level dynamic semantic compensation (CDSC) module was incorporated, which can explicitly quantify the semantic similarity of identical objects by constructing a cross-correlation matrix between low-level detail features and high-level semantic features. Furthermore, using a dynamic modulation coefficient, the CDSC module can directionally enhance the feature representation of details via residual connections, effectively mitigating the semantic information loss in a deep network. By integrating the above modules and leveraging the advantages of the convolutional neural network (CNN) and Transformer in terms of local detail extraction and global semantic modeling, respectively, the network model proposed in this study delivered outstanding performance on mainstream international high-resolution remote sensing datasets. The model exhibited mean intersection over union (mIoU) values of 82.49%, 85.10%, and 52.15% on the Vaihingen, Potsdam, and LoveDA datasets, respectively released by the International Society for Photogrammetry and Remote Sensing (ISPRS). Compared to many prevalent models, the proposed model significantly increased segmentation accuracy. Meanwhile, the proposed model outperformed most of its counterparts in terms of memory usage while maintaining parameters and computational complexity within an acceptable level, validating its balance between accuracy and efficiency. The proposed network model in this study serves as a high-performance and practical technical solution for the accurate interpretation of high-resolution remote sensing images in real-world applications such as road extraction, hazard monitoring, and land cover classification.