Search results

1 – 10 of 370
Article
Publication date: 19 December 2023

Jinchao Huang

Single-shot multi-category clothing recognition and retrieval play a crucial role in online searching and offline settlement scenarios. Existing clothing recognition methods based…

Abstract

Purpose

Single-shot multi-category clothing recognition and retrieval play a crucial role in online searching and offline settlement scenarios. Existing clothing recognition methods based on RGBD clothing images often suffer from high-dimensional feature representations, leading to compromised performance and efficiency.

Design/methodology/approach

To address this issue, this paper proposes a novel method called Manifold Embedded Discriminative Feature Selection (MEDFS) to select global and local features, thereby reducing the dimensionality of the feature representation and improving performance. Specifically, by combining three global features and three local features, a low-dimensional embedding is constructed to capture the correlations between features and categories. The MEDFS method designs an optimization framework utilizing manifold mapping and sparse regularization to achieve feature selection. The optimization objective is solved using an alternating iterative strategy, ensuring convergence.

Findings

Empirical studies conducted on a publicly available RGBD clothing image dataset demonstrate that the proposed MEDFS method achieves highly competitive clothing classification performance while maintaining efficiency in clothing recognition and retrieval.

Originality/value

This paper introduces a novel approach for multi-category clothing recognition and retrieval, incorporating the selection of global and local features. The proposed method holds potential for practical applications in real-world clothing scenarios.

Details

International Journal of Intelligent Computing and Cybernetics, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 1756-378X

Keywords

Article
Publication date: 21 August 2023

Minghao Wang, Ming Cong, Yu Du, Dong Liu and Xiaojing Tian

The purpose of this study is to solve the problem of an unknown initial position in a multi-robot raster map fusion. The method includes two-dimensional (2D) raster maps and…

Abstract

Purpose

The purpose of this study is to solve the problem of an unknown initial position in a multi-robot raster map fusion. The method includes two-dimensional (2D) raster maps and three-dimensional (3D) point cloud maps.

Design/methodology/approach

A fusion method using multiple algorithms was proposed. For 2D raster maps, this method uses accelerated robust feature detection to extract feature points of multi-raster maps, and then feature points are matched using a two-step algorithm of minimum Euclidean distance and adjacent feature relation. Finally, the random sample consensus algorithm was used for redundant feature fusion. On the basis of 2D raster map fusion, the method of coordinate alignment is used for 3D point cloud map fusion.

Findings

To verify the effectiveness of the algorithm, the segmentation mapping method (2D raster map) and the actual robot mapping method (2D raster map and 3D point cloud map) were used for experimental verification. The experiments demonstrated the stability and reliability of the proposed algorithm.

Originality/value

This algorithm uses a new visual method with coordinate alignment to process the raster map, which can effectively solve the problem of the demand for the initial relative position of robots in traditional methods and be more adaptable to the fusion of 3D maps. In addition, the original data of the map can come from different types of robots, which greatly improves the universality of the algorithm.

Details

Robotic Intelligence and Automation, vol. 43 no. 5
Type: Research Article
ISSN: 2754-6969

Keywords

Article
Publication date: 1 November 2023

Juan Yang, Zhenkun Li and Xu Du

Although numerous signal modalities are available for emotion recognition, audio and visual modalities are the most common and predominant forms for human beings to express their…

Abstract

Purpose

Although numerous signal modalities are available for emotion recognition, audio and visual modalities are the most common and predominant forms for human beings to express their emotional states in daily communication. Therefore, how to achieve automatic and accurate audiovisual emotion recognition is significantly important for developing engaging and empathetic human–computer interaction environment. However, two major challenges exist in the field of audiovisual emotion recognition: (1) how to effectively capture representations of each single modality and eliminate redundant features and (2) how to efficiently integrate information from these two modalities to generate discriminative representations.

Design/methodology/approach

A novel key-frame extraction-based attention fusion network (KE-AFN) is proposed for audiovisual emotion recognition. KE-AFN attempts to integrate key-frame extraction with multimodal interaction and fusion to enhance audiovisual representations and reduce redundant computation, filling the research gaps of existing approaches. Specifically, the local maximum–based content analysis is designed to extract key-frames from videos for the purpose of eliminating data redundancy. Two modules, including “Multi-head Attention-based Intra-modality Interaction Module” and “Multi-head Attention-based Cross-modality Interaction Module”, are proposed to mine and capture intra- and cross-modality interactions for further reducing data redundancy and producing more powerful multimodal representations.

Findings

Extensive experiments on two benchmark datasets (i.e. RAVDESS and CMU-MOSEI) demonstrate the effectiveness and rationality of KE-AFN. Specifically, (1) KE-AFN is superior to state-of-the-art baselines for audiovisual emotion recognition. (2) Exploring the supplementary and complementary information of different modalities can provide more emotional clues for better emotion recognition. (3) The proposed key-frame extraction strategy can enhance the performance by more than 2.79 per cent on accuracy. (4) Both exploring intra- and cross-modality interactions and employing attention-based audiovisual fusion can lead to better prediction performance.

Originality/value

The proposed KE-AFN can support the development of engaging and empathetic human–computer interaction environment.

Article
Publication date: 3 August 2023

Yandong Hou, Zhengbo Wu, Xinghua Ren, Kaiwen Liu and Zhengquan Chen

High-resolution remote sensing images possess a wealth of semantic information. However, these images often contain objects of different sizes and distributions, which make the…

Abstract

Purpose

High-resolution remote sensing images possess a wealth of semantic information. However, these images often contain objects of different sizes and distributions, which make the semantic segmentation task challenging. In this paper, a bidirectional feature fusion network (BFFNet) is designed to address this challenge, which aims at increasing the accurate recognition of surface objects in order to effectively classify special features.

Design/methodology/approach

There are two main crucial elements in BFFNet. Firstly, the mean-weighted module (MWM) is used to obtain the key features in the main network. Secondly, the proposed polarization enhanced branch network performs feature extraction simultaneously with the main network to obtain different feature information. The authors then fuse these two features in both directions while applying a cross-entropy loss function to monitor the network training process. Finally, BFFNet is validated on two publicly available datasets, Potsdam and Vaihingen.

Findings

In this paper, a quantitative analysis method is used to illustrate that the proposed network achieves superior performance of 2–6%, respectively, compared to other mainstream segmentation networks from experimental results on two datasets. Complete ablation experiments are also conducted to demonstrate the effectiveness of the elements in the network. In summary, BFFNet has proven to be effective in achieving accurate identification of small objects and in reducing the effect of shadows on the segmentation process.

Originality/value

The originality of the paper is the proposal of a BFFNet based on multi-scale and multi-attention strategies to improve the ability to accurately segment high-resolution and complex remote sensing images, especially for small objects and shadow-obscured objects.

Details

International Journal of Intelligent Computing and Cybernetics, vol. 17 no. 1
Type: Research Article
ISSN: 1756-378X

Keywords

Article
Publication date: 12 September 2023

Wei Shi, Jing Zhang and Shaoyi He

With the rapid development of short videos in China, the public has become accustomed to using short videos to express their opinions. This paper aims to solve problems such as…

116

Abstract

Purpose

With the rapid development of short videos in China, the public has become accustomed to using short videos to express their opinions. This paper aims to solve problems such as how to represent the features of different modalities and achieve effective cross-modal feature fusion when analyzing the multi-modal sentiment of Chinese short videos (CSVs).

Design/methodology/approach

This paper aims to propose a sentiment analysis model MSCNN-CPL-CAFF using multi-scale convolutional neural network and cross attention fusion mechanism to analyze the CSVs. The audio-visual and textual data of CSVs themed on “COVID-19, catering industry” are collected from CSV platform Douyin first, and then a comparative analysis is conducted with advanced baseline models.

Findings

The sample number of the weak negative and neutral sentiment is the largest, and the sample number of the positive and weak positive sentiment is relatively small, accounting for only about 11% of the total samples. The MSCNN-CPL-CAFF model has achieved the Acc-2, Acc-3 and F1 score of 85.01%, 74.16 and 84.84%, respectively, which outperforms the highest value of baseline methods in accuracy and achieves competitive computation speed.

Practical implications

This research offers some implications regarding the impact of COVID-19 on catering industry in China by focusing on multi-modal sentiment of CSVs. The methodology can be utilized to analyze the opinions of the general public on social media platform and to categorize them accordingly.

Originality/value

This paper presents a novel deep-learning multimodal sentiment analysis model, which provides a new perspective for public opinion research on the short video platform.

Details

Kybernetes, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 0368-492X

Keywords

Article
Publication date: 19 October 2023

Huaxiang Song

Classification of remote sensing images (RSI) is a challenging task in computer vision. Recently, researchers have proposed a variety of creative methods for automatic recognition…

Abstract

Purpose

Classification of remote sensing images (RSI) is a challenging task in computer vision. Recently, researchers have proposed a variety of creative methods for automatic recognition of RSI, and feature fusion is a research hotspot for its great potential to boost performance. However, RSI has a unique imaging condition and cluttered scenes with complicated backgrounds. This larger difference from nature images has made the previous feature fusion methods present insignificant performance improvements.

Design/methodology/approach

This work proposed a two-convolutional neural network (CNN) fusion method named main and branch CNN fusion network (MBC-Net) as an improved solution for classifying RSI. In detail, the MBC-Net employs an EfficientNet-B3 as its main CNN stream and an EfficientNet-B0 as a branch, named MC-B3 and BC-B0, respectively. In particular, MBC-Net includes a long-range derivation (LRD) module, which is specially designed to learn the dependence of different features. Meanwhile, MBC-Net also uses some unique ideas to tackle the problems coming from the two-CNN fusion and the inherent nature of RSI.

Findings

Extensive experiments on three RSI sets prove that MBC-Net outperforms the other 38 state-of-the-art (STOA) methods published from 2020 to 2023, with a noticeable increase in overall accuracy (OA) values. MBC-Net not only presents a 0.7% increased OA value on the most confusing NWPU set but also has 62% fewer parameters compared to the leading approach that ranks first in the literature.

Originality/value

MBC-Net is a more effective and efficient feature fusion approach compared to other STOA methods in the literature. Given the visualizations of grad class activation mapping (Grad-CAM), it reveals that MBC-Net can learn the long-range dependence of features that a single CNN cannot. Based on the tendency stochastic neighbor embedding (t-SNE) results, it demonstrates that the feature representation of MBC-Net is more effective than other methods. In addition, the ablation tests indicate that MBC-Net is effective and efficient for fusing features from two CNNs.

Details

International Journal of Intelligent Computing and Cybernetics, vol. 17 no. 1
Type: Research Article
ISSN: 1756-378X

Keywords

Article
Publication date: 10 January 2024

Sara El-Ateif, Ali Idri and José Luis Fernández-Alemán

COVID-19 continues to spread, and cause increasing deaths. Physicians diagnose COVID-19 using not only real-time polymerase chain reaction but also the computed tomography (CT…

Abstract

Purpose

COVID-19 continues to spread, and cause increasing deaths. Physicians diagnose COVID-19 using not only real-time polymerase chain reaction but also the computed tomography (CT) and chest x-ray (CXR) modalities, depending on the stage of infection. However, with so many patients and so few doctors, it has become difficult to keep abreast of the disease. Deep learning models have been developed in order to assist in this respect, and vision transformers are currently state-of-the-art methods, but most techniques currently focus only on one modality (CXR).

Design/methodology/approach

This work aims to leverage the benefits of both CT and CXR to improve COVID-19 diagnosis. This paper studies the differences between using convolutional MobileNetV2, ViT DeiT and Swin Transformer models when training from scratch and pretraining on the MedNIST medical dataset rather than the ImageNet dataset of natural images. The comparison is made by reporting six performance metrics, the Scott–Knott Effect Size Difference, Wilcoxon statistical test and the Borda Count method. We also use the Grad-CAM algorithm to study the model's interpretability. Finally, the model's robustness is tested by evaluating it on Gaussian noised images.

Findings

Although pretrained MobileNetV2 was the best model in terms of performance, the best model in terms of performance, interpretability, and robustness to noise is the trained from scratch Swin Transformer using the CXR (accuracy = 93.21 per cent) and CT (accuracy = 94.14 per cent) modalities.

Originality/value

Models compared are pretrained on MedNIST and leverage both the CT and CXR modalities.

Details

Data Technologies and Applications, vol. ahead-of-print no. ahead-of-print
Type: Research Article
ISSN: 2514-9288

Keywords

Article
Publication date: 7 April 2023

Sixing Liu, Yan Chai, Rui Yuan and Hong Miao

Simultaneous localization and map building (SLAM), as a state estimation problem, is a prerequisite for solving the problem of autonomous vehicle motion in unknown environments…

Abstract

Purpose

Simultaneous localization and map building (SLAM), as a state estimation problem, is a prerequisite for solving the problem of autonomous vehicle motion in unknown environments. Existing algorithms are based on laser or visual odometry; however, the lidar sensing range is small, the amount of data features is small, the camera is vulnerable to external conditions and the localization and map building cannot be performed stably and accurately using a single sensor. This paper aims to propose a laser three dimensions tightly coupled map building method that incorporates visual information, and uses laser point cloud information and image information to complement each other to improve the overall performance of the algorithm.

Design/methodology/approach

The visual feature points are first matched at the front end of the method, and the mismatched point pairs are removed using the bidirectional random sample consensus (RANSAC) algorithm. The laser point cloud is then used to obtain its depth information, while the two types of feature points are fed into the pose estimation module for a tightly coupled local bundle adjustment solution using a heuristic simulated annealing algorithm. Finally, the visual bag-of-words model is fused in the laser point cloud information to establish a threshold to construct a loopback framework to further reduce the cumulative drift error of the system over time.

Findings

Experiments on publicly available data sets show that the proposed method in this paper can match its real trajectory well. For various scenes, the map can be constructed by using the complementary laser and vision sensors, with high accuracy and robustness. At the same time, the method is verified in a real environment using an autonomous walking acquisition platform, and the system loaded with the method can run well for a long time and take into account the environmental adaptability of multiple scenes.

Originality/value

A multi-sensor data tight coupling method is proposed to fuse laser and vision information for optimal solution of the positional attitude. A bidirectional RANSAC algorithm is used for the removal of visual mismatched point pairs. Further, oriented fast and rotated brief feature points are used to build a bag-of-words model and construct a real-time loopback framework to reduce error accumulation. According to the experimental validation results, the accuracy and robustness of the single-sensor SLAM algorithm can be improved.

Details

Industrial Robot: the international journal of robotics research and application, vol. 50 no. 6
Type: Research Article
ISSN: 0143-991X

Keywords

Article
Publication date: 22 January 2024

Jun Liu, Junyuan Dong, Mingming Hu and Xu Lu

Existing Simultaneous Localization and Mapping (SLAM) algorithms have been relatively well developed. However, when in complex dynamic environments, the movement of the dynamic…

Abstract

Purpose

Existing Simultaneous Localization and Mapping (SLAM) algorithms have been relatively well developed. However, when in complex dynamic environments, the movement of the dynamic points on the dynamic objects in the image in the mapping can have an impact on the observation of the system, and thus there will be biases and errors in the position estimation and the creation of map points. The aim of this paper is to achieve more accurate accuracy in SLAM algorithms compared to traditional methods through semantic approaches.

Design/methodology/approach

In this paper, the semantic segmentation of dynamic objects is realized based on U-Net semantic segmentation network, followed by motion consistency detection through motion detection method to determine whether the segmented objects are moving in the current scene or not, and combined with the motion compensation method to eliminate dynamic points and compensate for the current local image, so as to make the system robust.

Findings

Experiments comparing the effect of detecting dynamic points and removing outliers are conducted on a dynamic data set of Technische Universität München, and the results show that the absolute trajectory accuracy of this paper's method is significantly improved compared with ORB-SLAM3 and DS-SLAM.

Originality/value

In this paper, in the semantic segmentation network part, the segmentation mask is combined with the method of dynamic point detection, elimination and compensation, which reduces the influence of dynamic objects, thus effectively improving the accuracy of localization in dynamic environments.

Details

Industrial Robot: the international journal of robotics research and application, vol. 51 no. 2
Type: Research Article
ISSN: 0143-991X

Keywords

Article
Publication date: 15 December 2022

Jiaxiang Hu, Xiaojun Shi, Chunyun Ma, Xin Yao and Yingxin Wang

The purpose of this paper is to propose a multi-feature, multi-metric and multi-loop tightly coupled LiDAR-visual-inertial odometry, M3LVI, for high-accuracy and robust state…

Abstract

Purpose

The purpose of this paper is to propose a multi-feature, multi-metric and multi-loop tightly coupled LiDAR-visual-inertial odometry, M3LVI, for high-accuracy and robust state estimation and mapping.

Design/methodology/approach

M3LVI is built atop a factor graph and composed of two subsystems, a LiDAR-inertial system (LIS) and a visual-inertial system (VIS). LIS implements multi-feature extraction on point cloud, and then multi-metric transformation estimation is implemented to realize LiDAR odometry. LiDAR-enhanced images and IMU pre-integration have been used in VIS to realize visual odometry, providing a reliable initial guess for LIS matching module. Location recognition is performed by a dual loop module combined with Bag of Words and LiDAR-Iris to correct accumulated drift. M³LVI also functions properly when one of the subsystems failed, which greatly increases the robustness in degraded environments.

Findings

Quantitative experiments were conducted on the KITTI data set and the campus data set to evaluate the M3LVI. The experimental results show the algorithm has higher pose estimation accuracy than existing methods.

Practical implications

The proposed method can greatly improve the positioning and mapping accuracy of AGV, and has an important impact on AGV material distribution, which is one of the most important applications of industrial robots.

Originality/value

M3LVI divides the original point cloud into six types, and uses multi-metric transformation estimation to estimate the state of robot and adopts factor graph optimization model to optimize the state estimation, which improves the accuracy of pose estimation. When one subsystem fails, the other system can complete the positioning work independently, which greatly increases the robustness in degraded environments.

Details

Industrial Robot: the international journal of robotics research and application, vol. 50 no. 3
Type: Research Article
ISSN: 0143-991X

Keywords

1 – 10 of 370