Feature Descriptors and Deep Learning to Extract Features for Building Monocular VO from TQU-SLAM Benchmark Dataset
Enriching data to train Visual SLAM and VO construction models using deep learning (DL) is an urgent problem today in computer vision. DL requires a large amount of data to train the model, and more data with many different contextual and conditional conditions will create a more accurate Visual SLAM and VO construction model. In this paper, we introduce the TQU-SLAM benchmark dataset, which includes 160,631 RGB-D frame pairs. It was collected from the corridors of three interconnected buildings with a length of about 230m. The ground truth data of the TQU-SLAM benchmark dataset was prepared manually, including the 6-DOF camera pose, 3D point cloud data, intrinsic parameters, and the transformation matrix between the camera coordinate system and the real-world. We also tested the TQU-SLAM benchmark dataset based on the PySLAM framework with traditional features such as SHI TOMASI, SIFT, SURF, ORB, ORB2, AKAZE, KAZE, and BRISK and features extracted from DL such as VGG, DPVO, TartanVO. The camera pose estimation results are evaluated and presented (the ORB2 features have the best results (Errd = 5.74mm), the ratio of the number of frames with detected keypoints of the SHI TOMASI feature is the best (rd = 98.97%)). At the same time, we also present and analyze the challenges of the TQU-SLAM benchmark dataset for building Visual SLAM and VO systems.