MFCT-GAN: multi-information network to reconstruct CT volumes for security screening

Purpose – At airport security checkpoints, baggage screening is aimed to prevent transportation of prohibited and potentially dangerous items. Observing the projection images generated by X-rays scanner is acriticalmethod.However,whenmultipleobjectsarestackedontopofeachother,distinguishingobjectsonlybyatwo-dimensionalpictureisdifficult,whichpromptsthedemandformorepreciseimagingtechnologytobeinvestigatedforuse.Reconstructingfrom2DX-rayimagesto3D-computedtomography(CT)volumesisareliablesolution. Design/methodology/approach – To more accurately distinguish the specific contour shape of items when stacked, multi-information fusion network (MFCT-GAN) based on generative adversarial network (GAN) and U-like network (U-NET) is proposed to reconstruct from two biplanar orthogonal X-ray projections into 3D CT volumes. The authors use three modules to enhance the reconstruction qualitative and quantitative effects, compared with the original network. The skip connection modification (SCM) and multi-channels residual dense block (MRDB) enable the network to extract more feature information and learn deeper with high efficiency;theintroductionofsubjectivelossenablesthenetworktofocusonthestructuralsimilarity(SSIM)ofimagesduringtraining. Findings – On account of the fusion of multiple information, MFCT-GAN can significantly improve the value ofquantitativeindexesanddistinguishcontourexplicitlybetweendifferenttargets.Inparticular,SCMenablesfeaturesmorereasonableandaccuratewhenexpandedintothreedimensions.TheapplianceofMRDBcanalleviateproblemofslowoptimizationduringthelatetrainingperiod,aswellasreducethecomputationalcost.Theintroductionofsubjectivelossguidesnetworktoretainmorehigh-frequencyinformation,whichmakestherenderedCTvolumesclearerindetails. Originality/value – The authors ’ proposed MFCT-GAN is able to restore the 3D shapes of different objects greatly based on biplanar projections. This is helpful in security check places, where X-ray images of stacked objectsneedtobedistinguishedfromthepresenceofprohibitedobjects.Theauthorsadoptthreenewmodules,SCM,MRDBandsubjectiveloss,aswellasanalyzetherolethemodulesplayin3Dreconstruction.Resultsshowasignificantimprovementonthereconstructionbothinobjectiveandsubjectiveeffects.


Introduction
Baggage screening is a vital procedure in security domain. It's common to use X-rays in two or three orthogonal views for security checking on the metro station or airport. Normally, the security needs to verify whether prohibited items included in luggage or container (Benjamin, 1995), according to X-Rays pictures. The explicit contours and textures of items are important to their judgment. However, X-rays projections are inaccurate to reflect the three-dimensional information of the object, which requires security personnel with sufficient experience to distinguish the true shape of the object. CT is capable of generating a set of 2D pictures that accurately reflect their 3D information, but it is expensive and not suitable for use in security screening occasion (World Health Organization, 2011). Thereby, we propose a 3D reconstruction method based on GAN, which can restore two orthogonal X-ray projections to CT volumes as realistically as possible.
Studies have shown that some aerospace accidents are caused by smuggling of explosives into checked baggage, as well as subway accidents. Therefore, a set of accurate college baggage check-in security system design, which can effectively prevent the recurrence of unsafe events (Shi et al., 2021). Based on 3D X-ray CT images, it's a wildly used method to detect the thread items with X-ray scanners in aviation security (Wang et al., 2020). For some items under potential danger, such as firearms, sharp objects, sharp edges, etc., they have a significant physical appearance (such as shape, volume, texture, etc.) (Mouton et al., 2014). Hence, with the two-dimensional images obtained by the X-ray scanner and further reconstruction, security personnel are able to identify potentially prohibited items from the passenger's baggage, without need to manual search (Mouton and Breckon, 2015), which greatly improves the operational efficiency.
Currently, item-screening through 2D X-ray images is still a manual inspection process. This workflow is cumbersome and requires security personnel with mature relevant experience and training (Shanks and Bradley, 2004), which inevitably raises the operational threshold and tends to produce inaccurate judgments when the operator is immaturely experienced, thus reducing screening efficiency. In particular, in some cases where objects overlap exist, image interpretation becomes a challenging task because the projection map generated by X-rays does not reflect whether the objects produce overlap and simple visual inspection cannot detect whether the prohibited items are obscured by other objects (Megherbi et al., 2013). Therefore, if the image can be reconstructed into a three-dimensional form, the image of the object into the real shape so that the operator can obtain the information masked in the two-dimensional X-ray, so as to obtain a clearer observation and judgment, which will greatly improve the efficiency of the security screening task.
Among most cases in the aviation security infrastructure, explosive detection systems (EDS) are now the only one CT's application approved (Singh, 2003). Based on EDS, the dualenergy CT (DECT) is the technique to distinguish different materials. The principle consists in using two different X-ray spectra to deduce the chemical composition of the investigated material based on its reaction under these spectral conditions (Jin, 2011). The DECT is divided in three categories: post-reconstruction techniques (Graser et al., 2009), pre-reconstruction techniques (Alvarez and Macovski, 1976) and iterative reconstruction techniques (Semerci and Miller, 2012). Although the technique has improved the reconstruction performance, increasing computational demand becomes a significant problem. It's necessary to rotate swiftly to collect enough X-ray apparatus around the items, which is a high-cost and timewasting process. Hence, we need to find a low computation cost method to reconstruct from 2D to 3D with less data acquisition.
For most CT reconstruction algorithms, numbers of X-ray images are required for input, which requires a certain amount of machine computational performance. Some typical principles, such as maximum likelihood (Shepp and Vardi, 1982) and sparsity (Lustig et al., 2007), are used to improve the quality of tomographic reconstruction. Those methods are very time consuming, which is not suitable for the needs of fast inspection in security scenarios. In fact, the vast majority of security machines only obtain 1-3 projection images of objects mutually orthogonal to each other for screening by security personnel, so the question of how to reconstruct 3D information using as few images as possible is significant.
Traditional CT reconstruction methods, which are based on mathematical and theoretical knowledge, often require the creation of fairly accurate models. For instance, filtered back projection and iterative reconstruction (Herman, 2009), which is the one-dimensional Fourier transform of the projection is equivalent to the two-dimensional Fourier transform of the original image. The introduction of a priori knowledge is also a typical method, such as statistical shape models (Lamecker et al., 2006) or anatomical structures knowledge (Serradell et al., 2011). However, these reconstruction methods based on mathematical knowledge require the construction of corresponding mathematical models for different objects, which means that the generalizability of the methods is not enough. Deep learning has a natural advantage in some scenarios, such as modeling of invisible parts, where traditional algorithms have difficulty estimating the depth of objects with a priori knowledge. Eigen et al. use a two-staged convolutional neural networks (CNN) to generate a 2D depth map from a 2D image (Eigen et al., 2014). Philipp Henzler et al. apply U-NET network to get a better performance (Henzler et al., 2017). Ying et al. (2019) propose X2CT-GAN, which perform better than traditional CNNs in terms of subjective effect of reconstruction. In addition, the format of the input data is also an important issue. Wurfl et al. (2016) work on X-ray sinogram as input, which is not readable for human. Magnor et al. reconstruct the 3D model with single X-ray image. Because a single two-dimensional picture lacks much three-dimensional information (Magnor et al., 2004), the effect of three-dimensional reconstruction with only one picture is very blurred. Therefore, if two or more images are used as input, the output reconstruction will be better. Thus, inspired by previous work, we apply GAN to reconstruct CT from two Xray images.
To sum up, our contributions include the following four main points.
(1) We propose SCM module, which introduces the second image as weight map for correction when expanding from 2D to 3D after single-channel feature extraction. The numerical and physical information are combined to enhance the reconstruction effect.
(2) We apply MRDB connection for feature extraction, which reduces the number of model parameters while alleviating the problem of model instability.
(3) We propose subjective loss function for training to improve the generated subjective effect.
(4) Compared with other reconstruction algorithms, our method improves both quantitative and qualitative indicators; especially for the restoration of internal details, the effect is significantly improved.

Network framework
In general, similar to X2CT-GAN (Peng et al., 2020), the overall framework of our network combines GAN and U-like network (U-NET). The input is two 2D X-ray projection images and the output is 3D CT volumes. After encoder-decoder, the features of two networks are fused together and put into a new upsampling decoder to generate the final result. An overview of our network is shown in Figure 1. Here are the details of the MFCT-GAN.

Generator
The role of the generator is to produce a set of 3D CT volumes from two mutually orthogonal 2D X-ray images (vertical plane, horizontal plane or width plane). The network consists of three main components: feature extraction using MRDB connectivity, 3D decoder with upscaling module and features fusion with SCM.
Since it's a dual input, there are two parallel coder-decoder networks in generator. Features fusion component aims to integrate the double channels' 3D features, generating the final reconstructed 3D CT volumes. Given that input is dual-view X-ray images, how to extract the two images features independently and fuse them properly will directly affect the quality of reconstruction. Thereby, some modifications will be applied to raw network.

Reconstructing
CT volumes (1) Features extraction: In order to extract sufficient feature information from input and reduce image pixel size, feature extraction is usually performed with dense block (DB) at beginning.
MRDB Dense Net reduces gradient to a certain extent by directly linking features among layers, which makes great use of global features (Huang et al., 2017). However, as the number of DBs increases and network layers deepen, Dense Net suffers from the problem of gradient disappearance. The introduction of residual learning can alleviate this problem to some extent, which is called residual dense block (RDB) (Zhang et al., 2018). However, Peng et al. (2020) found that multiple residual connections can sufficiently enhance the flow of information, as well as reduce the number of model parameters. Taking the network depth and efficiency of training into consideration, we propose a modified multi-residual dense  (1). The relationship between input and output can be expressed as follows: where the x 0 ; x 1 ; :::; x l−1 denotes the l À th layers' DB output and H l ½ denotes DB, which produces growth rate of feature maps. Different layers' output is converted as H l ½ input. For the last layer, we introduce the first input as residual learning.
As for other details of MRDB, due to the connection of multiple residuals, the input and output share the same number of channels and image size. This is because each DB is followed by a 1 3 1 kernel filter to increase the number of channels and thus connecting the residuals with the previous layer. Each MRDB is followed by a transition block to change the number of input channels for the next MRDB. Hence, the feature extraction capability of the network will be increased effectively.
(2) 3D decoder: When the process of feature extraction for the biplanar input image is completed, one dimension needs to be augmented for subsequent decoding. Inspired by previous work , we add a depth channel to the input data with the same number of width and length channels, in other words, expanding the twodimensional to three-dimensional by duplicating the feature maps.
After bridging the 2D encoder and 3D decoder, we apply the classical upsample method to decode, which consists of two main modules: one is Conv3d-Norm-ReLU for generating more details of reconstruction and the other is Deconv3d-Norm-ReLU for restoring the size of 3D CT volumes. SCM Combining the long-and short-skip connections is beneficial for deep neural networks (Drozdzal et al., 2016). It is very common to employ skip connections to link encoders and decoders. The situation becomes slightly different when cope with 2D-to-3D task. Since it is necessary to expand the features in encoder from two dimensions to three dimensions and then deliver them to following decoder, how to ascend dimension becomes a critical problem. Usually, duplicating the feature map in the depth channel is a common operation, which is inaccurate and rough (Ying et al., 2019;Peng et al., 2020;Ratul et al., 2021).
To better utilize the biplanar information, we propose a novel skip connection module (SCM), shown in Figure 2, to transmit low-level features to high-level features.
In summary, when the first image is encoded and need to expand one dimension, the second image is introduced as weight map and the 3D features of the first input are corrected. Finally, the rectified features are fed into decoder. Specifically, the value pixels in the second image are

Reconstructing
CT volumes normalized to a weight map with the shape ðH ; W Þ. The shape of the 3D feature of the first image is ðD; H ; W Þ. In particular, the values of D, H and W are equal. After that, 3D features are corrected by multiplying the factor in weight map. Specifically, all the pixels in row H of the feature maps are productized by the weight factor in column H of D row in the weight map.
In order to make the weight map consistent with the shape of feature maps, a similar averaging pooling operation is introduced to reduce the size of the weight map. The modification from weight map makes the feature maps work better when expanding from two-dimensional to three-dimensional. The dual-view input information are also fully utilized.
(3) Features fusion: The shortcoming of using a single 2D image to reconstruct into 3D CT volumes is the weak generalization of model. It is rarely enough to learn useful 3D features only by relying on the recurrent training of the deep network, especially for the application where there are many different kinds of objects. That is the reason why we use two mutually orthogonal X-ray images as model input. The complementary information enables the network to generate more accurate reconstruction results. Naturally, after parallel dual-channel encoder-decoder network, we need to fuse those features information.
We apply the third decoder network with the same structure as the 3D decoder mentioned before. Given that in reality, two X-ray images are captured at almost the same time, which means there is no motion occlusion in images. Therefore, we consider that outputs of both decoders share the same information importance. On account of this, the third decoder's input is the average of the outputs from the dual-parallel decoders. As result, the output of third decoder network is final reconstructed 3D CT volumes.

Discriminator
Based on PatchGANs (Ledig et al., 2017;Zhu et al., 2017), which perform great generalization property and is frequently applied in generating images, we use the modified PatchDiscriminator  to work on our network. The vanilla discriminator of PatchGAN is a matrix of N * N rather than a scalar value. By discriminating each patch, local image features can be extracted and characterized, which is more conducive to highresolution-image-generation task. We replace the original conv2d module with conv3d module. Conv3d-Norm-ReLU with kernel size 5 3 are used three times, followed by same architecture with kernel size 5 1 and end with a conv3d layer.

Loss functions
In order to balance the quantitative metrics values and qualitative subjective evaluation after 3D reconstruction, we apply four loss functions to constrain the generative model.

Adversarial loss
GAN is a significant architecture to generate photorealistic images, which is well studied in recent research (Goodfellow et al., 2014). Typically, the classical GAN use sigmoid crossentropy as objective function, which is usually suitable for logical classification. Gradient dispersion inevitably becomes a potential problem. Least squares GAN (LSGAN) replace the original loss function with least squares loss function (Mao et al., 2017). The new object function penalizes samples which are in discriminative truth away from the decision boundary and drags the false samples back into the boundary. In the end, the problem of disappearing gradients is alleviated, which result in improvement on the generated images. The LSGAN loss is defined as follows: i ; ( 2) where G denotes the generator, D denotes the discriminator, x denotes the input of two orthogonal X-ray projection images and the y denotes the corresponding CT volumes ground truth.

Projection loss
Since the loss function of LSGAN put the same importance on each pixel point, the generated image effect is close to the true value in the whole aspect. However, it cannot keep similar to the true value in the structure. And in real life, due to a prior knowledge, even if there is only a two-dimensional picture, people can easily imagine its original three-dimensional appearance (Jiang et al., 2018). Thus, we use projection loss as prior knowledge to constrain the geometric shape in network, which is defined as (Ying et al., 2019) follows: ( 3) where P v denotes the projection in the vertical plane, P h denotes the projection in the horizontal plane and P w denotes the projection in the width plane.

Reconstruction loss
The binarization calculation is done in least squares function for the generated image and the ground truth, which can make it difficult for model to focus on the regions with larger pixel values on images during training. Therefore, in the final generated CT volumes after rendered, the blurring will occur in reconstruction and the information is seriously lost. Therefore, another pixel-level loss function needs to be introduced. Inspired by previous work (Johnson et al., 2016), we apply volume reconstruction loss to constrain the model in pixel, which is defined as follows:

Subjective loss
Both projection loss and reconstruction loss are biased toward pixel-level operation, which lead to high peak signal-to-noise radio (PSNR) calculation. However, the high PSNR metric does not have a completely positive correlation with the subjective effect seen by the human eyes. Rouditchenko et al. (2019) proposed a novel, differentiable error function, combined with l1-norm and SSIM, showing great improvement on image restoration ( ). SSIM is the function to compute similarity between two images, which is related to subjective evaluation of images. Based on previous study, we propose subjective loss, which is defined as follows: Reconstructing CT volumes where ' SSIM represents the SSIM loss function and smooth ' 1 represents the smooth ' 1 loss, which aims to avoid the gradient no longer changing when the learning rate is too small in the late training period (Yu et al., 2016).

Total loss
After introducing four loss functions mentioned above, our final objective function is as follows: where the α is the weight of different loss function, representing the importance of four loss terms. Given that in a realistic security check place, we put more attention on subjective similarity. Therefore, we will appropriately increase the weight of subjective loss. The final weight is set as follows: α 1 ¼ 0:1; α 2 ¼ 8; α 3 ¼ 8; α 4 ¼ 2; ω ¼ 9.

Experiment details 4.1 Datasets
In order to better train the model, we need the X-ray projection maps obtained from the security scanner and the corresponding CT volumes. However, due to the high cost, and the fact that the corresponding available datasets do not exist online, therefore, we use the available chest CT scan dataset on public: the lung image database consortium (LIDC-IDRI) (Armato et al., 2011). To obtain the corresponding 2D orthogonal projection images, the corresponding X-rays are synthesized by using the digitally reconstructed radiographs (DRR) (Milickovic et al., 2000) technique with CycleGAN  in reference to the work of Ying et al. (2019). In summary, there are 920 paired datasets for training and 98 paired datasets for testing. Each paired dataset contains two X-rays images with resized shape of 128 3 128 and a CT volume set with resized shape of 128 3 128 3 128.

Metrics
We use two typical metrics as our quantitate results: PSNR and SSIM. PSNR is calculated based on the mean square error and reflects the relationship between the maximum signal and the background noise (Hor e and Ziou, 2010). In a word, it's an objective index for evaluating images. SSIM is calculated based on the brightness and contrast of local patterns (Wang et al., 2004). This index is close to the real-human perception situation, so SSIM is an image-quality-evaluation standard in line with human intuition.

Qualitative analysis
As shown in Figure 3, the first row shows the CT ground truth and the second row shows the generated effect of the baseline X2CT-GANþB while the third row shows our proposed model MFCT-GAN reconstruction effect. We compare the subjective quality of them. It can be seen that our model produces higher-quality reconstruction compared to the baseline. In particular, (1) our model can produce more explicit contour boundaries, which can clearly distinguish between cavities and solids; (2) for different internal organs, we can see clearer anatomical structures, such as the shape of the spine and spinal cord and (3) for consecutive CT images, our model can capture structural changes of organs fast, so as to adjust the generation effect of the next CT image in time.

Reconstructing CT volumes
We visualize the CT sequence by volume rendering (Brian et al., 1996) as shown in Figure 4. From left to right, this is the reconstruction effect, including ground truth, X2CT-GANþB and MFCT-GAN (ours). As we can see, the baseline method is prone to useless  6761 Note(s): "Layer 1" represents modification on the first multi-res dense, "Layer 1-2" denotes modification on the first and second multi-res dense and "Layer 1-4" denotes modification on the overall multi-res dense. "ω" is the weight of SSIM loss in subjective loss. The best results are bold for viewing   Table 3. Evaluation of different proposed modules redundant data in the thoracic body and the internal real vascular restoration is not as detailed and accurate as our proposed method.
In the realistic security scenario, the 3D restoration of the object overlay needs to be solved at first. Therefore, the accurate distinction between internal objects becomes the main indicator of the quality of our reconstruction. According to Figure 4, we are confident that the MFCT-GAN model can achieve it greatly.

Quantitate analysis
In this section, we discuss the metric enhancement of our proposed method. 2DCNN is a CT reconstruction method that appeared very early, only for single-view input (Henzler et al., 2017); X2CT-GAN is the baseline method, where "S" denotes single-view X-ray input and "B" denotes biplanar X-rays input.
We use PSNR and SSIM as evaluation metrics, and the results are shown in Table 1. It can be clearly seen that the 3D reconstruction using GAN network works better than the traditional CNN. The dual-view input can contain more 3D information, so the reconstruction accuracy is higher. And specifically comparing the baseline with MFCT-GAN, our proposed method has a significant improvement in PSNR up to 4.82 dB (18.4%). Meanwhile, the SSIM metric improvement increases slightly by 0.02 (3.05%). When the PSNR value exceeds 30, we can consider the image quality as good. On account of introduction of subjective loss function, although the index increase is limited, the subjective effect does improve greatly.
Analysis of the calculated performance changes is shown in Table 2. It is easy to conclude that the use of residual learning can effectively reduce the number of model parameters and training speed; the MRDB applied in our method can further reduce the number of parameters and achieve faster computational speed, which effectively saves training time.

Ablation study
To investigate the effectiveness of the three improved modules, an ablation study was conducted and the results are shown in Table 3.
(1) The SCM part has the most obvious improvement on PSNR, which is due to the correction of another orthogonal view picture when the dimensionality expansion is performed before the feature jump connection. And the introduction of subjective loss function has the most obvious improvement on SSIM; this is because the subjective loss includes calculation of SSIM, so the training process on the network switches the importance on pixel-level alignment to the SSIM instead.
(2) SCM gradually improves the reconstruction effect as the encoder network deepens. It is reasonable to assume that the smaller the input size to the decoder, the more obvious the alignment effect will be.
(3) ω in subjective loss indicates the weight of SSIM, and SSIM can retain high-frequency information better. However, smooth L1 will pay more attention to low-frequency information. Since we concentrate on the accuracy of object internal reconstruction, the reconstruction effect can be improved by appropriately increasing the weight of w. In addition, the optimal value of w can be further investigated.

Conclusion
In this paper, we propose a multi-information fusion network, named MFCT GAN, to reconstruct 3D CT volumes from biplanar X-ray projection images. In order to cope with the security check scenario that requires fast restoration of object 3D information, we propose two modules and a loss function, for SCM, MRDB and subjective loss function, which are Reconstructing CT volumes used to improve the reconstruction quality of the vanilla network. Through qualitative and quantitative results analysis, it can be proved that our proposed network can restore the contours of different parts inside the object well and the model training speed is faster. Due to the limited dataset, we will use the actual X-ray images generated by security scanner with corresponding CT volumes for training in the future. Also, we want to design a better volume rendering method to achieve end-to-end reconstruction, which aims to improve the screening efficiency of security personnel to serve more scenarios.