Methodology

Landmark Detection
High-Resolution Net (HRNet): a deep neural network that maintains high-resolution representations of an input image by connecting high-to-low resolution convolutions in parallel and produces strong and spatially precise high-resolution representations by repeatedly conducting multi-scale fusions across parallel convolutions. HRNet is widely used in pixel-level tasks since its invention and obtain great performance in these tasks. It is, therefore, proved to be a robust model for pixel-level classification tasks.
X-ray Synthesis
The pix2pix model: The pix2pix model is a generative model based on the conditional Generative Adversarial Network (conditional GAN) for image-to-image translation tasks. It was used by Brian et al. in 2018 to synthesize X-ray images from the surface geometry, which is similar to this project. Therefore, it is also expected to synthesize X-ray images successfully from RGB-D images in this project.

Results

Landmark Detection
The HR-Net output 6-channel heatmaps representing the six predicted landmarks, and the results were evaluated by the Mean Squared Error Loss (MSELoss). The average MSELoss of the model on the testing dataset is 0.00004747, which proves the great performance of our model.
X-ray Synthesis
The pix2pix model synthesized X-ray images with high quality and clear spines curves close to the ground truth, and the feedbacks from the medical expert are overall postive, which shows the effectiveness of our model in synthesizing X-ray images from RGB-D images.