Abstract
We review the relatively immature field of automated image analysis for X-ray cargo imagery. There is increasing demand for automated analysis methods that can assist in the inspection and selection of containers, due to the ever-growing volumes of traded cargo and the increasing concerns that customs- and security-related threats are being smuggled across borders by organised crime and terrorist networks. We split the field into the classical pipeline of image preprocessing and image understanding. Preprocessing includes: image manipulation; quality improvement; Threat Image Projection (TIP); and material discrimination and segmentation. Image understanding includes: Automated Threat Detection (ATD); and Automated Contents Verification (ACV). We identify several gaps in the literature that need to be addressed and propose ideas for future research. Where the current literature is sparse we borrow from the single-view, multi-view, and CT X-ray baggage domains, which have some characteristics in common with X-ray cargo.
Introduction
The use of cargo containers in global trade transactions continues to grow. From 2004 to 2014, and despite the 2008 global economic crisis, the number of Twenty-foot Equivalent Unit (TEU) container transactions more than doubled to reach almost 7 × 108 TEU per annum [74]. During this time, the US Container Security Initiative (CSI), proposed in the wake of the 9/11 terrorist attacks, has encouraged 100% screening of containers [66], and is being implemented by ports around the world [76]. With the ever-growing numbers of containers and increasingly stringent screening requirements, there has been active research in academia and industry to engineer accurate and rapid screening methods, which are vital for the global economy and security.
Cargo containers are frequently exploited for smuggling, which can be achieved by concealment amongst and within legitimate cargo or packaging, by concealment within legitimate or false container partitions, or by intercepting containers to plant and recover contraband (rip on/rip off) [14]. Smuggling bypasses customs controls, allowing criminals to: avoid duties on legitimate goods (e.g. cars, alcohol, cigarettes); trade prohibited or counterfeit items; launder money; and avoid sanctions [17].
Under the CSI and similar initiatives, cargo inspection is implemented in three layers. The first layer selects samples of containers for inspection [16, 76] based on specific intelligence or a risk analysis. Often, a small fraction of containers are randomly sampled in the hope of catching out criminals who have discovered ways to make shipments appear “low risk” [16]. Selected containers first undergo Non-Intrusive Inspection (NII). If anything suspicious is detected then the container is sent for physical inspection. Physical inspection is very slow and expensive; it has to be well documented for use as evidence and done carefully to avoid compensation payouts if the container is innocuous.
The majority of cargo NII systems use transmission X-ray or γ-ray radiography [42] to form an image of the cargo contents (examples in Fig. 3). The image is sent to a human operator who searches it for any anomalies, specific threats, or discrepancies with the shipping manifest. Cargo images pose a difficult visual search task for the human operator, and they are much more difficult to analyse than other types of border security imagery such as baggage. This is because cargo scanners have to operate at a much larger scale. For example, a 40 ft General Purpose cargo container has a volume of 67.6 m3 [15] and is made out of steel, whereas hand luggage volume 1 is typically 0.063 m3 and usually made out of fabric or plastics. The physical scale of cargo scanners makes it difficult to efficiently perform 3D Computed Tomography (CT) [8] but some multi-view systems do exist. Moreover, for cargo it is more difficult to extract material composition information due to the higher energies required for sufficient penetration to obtain good image contrast (Fig. 1). Cargo images are also far more cluttered, whilst small threats, such as firearms, have a very small visual signature. A comparison between baggage and cargo single-view X-ray imagery is shown in in Fig. 2.
Automated image analysis can help with cargo screening by Assisted Inspection or Assisted Selection (Fig. 4). Currently most research has been geared towards Assisted Inspection, with algorithms designed to assist the operator, such as by annotating the image with a Region-of-Interest (ROI) to prompt the operator of a potential security- or customs-related threat. The goal of Assisted Selection is to use automated image analysis to inform the risk analysis used for cargo selection, but relies on the ability to scan all containers at high throughput rates. Such technologies are becoming available, such as rail scanners capable of imaging cargo traveling at up to 60 km/h [61]. When such systems are widely deployed, Assisted Selection has the potential to increase true positive and reduce false positive cargoes in the selected sample. In doing so, it should allow for human resources to be allocated more efficiently.
The literature on automated image analysis for cargo can be separated into Image Preprocessing, and Image Understanding. Image Preprocessing is a broad category including any operation made to an image in order to help Image Understanding by either humans or algorithms. Image Preprocessing includes: image manipulation; image correction and denoising; material discrimination and segmentation; and Threat Image Projection (TIP). Image Understanding methods make decisions based on the image contents. Current methods for this can besplit into Automated Threat Detection (ATD) and Automated Contents Verification (ACV).
In this paper we investigate the existing literature according to the themes of Image Preprocessing (Sec. 2) and Image Understanding (Sec. 3). In some cases, the literature directly relating to cargo imagery is scarce. This is due largely to commercial and security protection, and the difficulty for academics to obtain access to commercial scanning hardware. Additionally, the majority of funding goes towards aviation security, where search tasks are more tractable, and there is a more obvious and immediate threat from terrorism. In cases where cargo research is sparse, we look to the literature from other domains such as baggage, since many of the findings there may be transferable to the cargo domain. The purpose of this paper it to map out the current literature, to identify gaps in it, and to propose future directions of research.
Image Preprocessing
We define Image Preprocessing as any process which is performed before, and in order to improve the performance of, Image Understanding whether performed by humans or automated systems. In the literature, we have identified four topics: image manipulation; image quality improvement; material discrimination and segmentation; and Threat Image Projection (TIP).
Image manipulation
Image manipulation is used to improve the accuracy of human operators and automated Image Understanding algorithms. Most work has been on studying the threat detection performance of human operators under different image manipulation functions implemented in commercial image viewing software. Manipulations include pseudo-colouring, edge enhancement, and intensity transforms such as Histogram Equalization (HE), logarithm and square-root (Fig. 5). Note that pseudo-colour is not based on material properties, which we discuss in Sec. 2.4.
For cargo screening, Michel et al. [50] have shown that image pseudo-colouring does not lead to improved performance, when identifying narcotics, weapons, Improvised Explosive Devices (IEDs) and other explosives compared to when the raw greyscale image is used. Similar results have been found by Klock [39], who tested human performance at detecting concealed IEDs, guns, knives and other prohibited items in baggage. Evaluated manipulations included pseudo-colour, intensity inversion, inorganic or organic material stripping, and a commercial Crystal Clear™ image enhancement 2 . They found that the raw greyscale image and the Crystal Clear™ enhancement led to best human performance.
Chen [11] reasons that although most X-ray cargo images are captured and encoded in 16 bits, typical greyscale displays only use 8 bits so that useful information is lost, but with pseudo-coloured images, there are 8 bits available each for each colour channel, thus potentially preserving the information. However, he argues that the effectiveness of pseudo-colour is in fact limited by the ability of humans to detect subtle colour differences. The author also claims that edge enhancement techniques do not work well for cargo due to the complexity of objects and high pixel noise. Chen [11] qualitatively evaluates linear, logarithm and Adaptive HE (AHE) image transforms. He argues that log transform can be beneficial as it makes image brightness proportional to object thickness, but thin items are sometimes lost. The square-root transform can be beneficial since the Signal-to-Noise Ratio (SNR) is proportional to the square-root of pixel intensity, thus it is an equal-noise display method. Finally, the author observes that, qualitatively, AHE is the best method but that full object thickness information is lost.
As far as we are aware, there have been no specific studies on the effect of image manipulations on automated Image Understanding in cargo. A few researchers have done small studies as parts of larger bodies of work. For example, when building a car detector, Jaccard et al. [35] tested log-intensity histograms as a feature and found them to perform better than intensity histograms. However, the log-image gave worse performance than the raw image when using oriented Basic Image Features (oBIFs), possibly because the oBIF parameters were not re-tuned. In a later paper [37], this time trying both hand-crafted Pyramid Histograms of Visual Words (PHOW) features and features learnt using trained-from-scratch Convolutional Neural Networks (CNNs), the authors found that using log transformed images as input gave a substantial improvement in performance.
Other researchers have applied different image manipulations before applying Image Understanding algorithms. These include: Gaussian blurring [83]; rudimentary segmentation algorithms to extract different image regions [63, 83]; and intensity inversion followed by z-score normalisation and Retinex filtering [84].
Image quality improvement
Image quality improvement can include denoising methods to ameliorate Poisson or salt-and-pepper noise, and methods to correct image errors that arise during image acquisition.
To our knowledge, there have been no published comparison studies on different cargo image denoising techniques. However, in baggage, Mouton et al. [55] perform a comparative study on a number of denoising techniques applied to low quality baggage imagery. Techniques included: anisotropic diffusion; Total Variation (TV) denoising; bilateral filtering, translation-invariant wavelet shrinkage; Non-Local Means (NLM) filtering; and Alpha-Weighted Mean Separation and Histogram Equalisation (AWMSHE). They assess performance by running a Scale-Invariant Feature Transform (SIFT) point detector across image before and after the denoising. They identify object feature points (located on an object of interest) and noise (not on the object) within the CT image (i.e. assumed to be caused by noise or artefacts). Performance is then measured by taking number of object feature points as a fraction of the total number of feature points, assuming that an increasing ratio is indicative of improved performance using a SIFT-based detection algorithm. They find that all methods offer improved performance, over using just the raw image, with translation-invariant wavelet shrinkage performing best. However, it is unclear whether these results would generalise to algorithms that are not based on SIFT.
To our knowledge, there is one publication on image error correction for cargo. Rogers et al. [65] propose a method for correcting wobble artefacts in images captured by mobile cargo scanners. The wobble artefact originates from the wobble of the detector array as a mobile scanner traverses a stationary cargo. The method relies on a slight modification to the scanning hardware by rotating four of the imaging detectors by 90° so that they can measure the beam across its width. This allows the beam to be tracked as it jitters on the detector array. The position tracking is achieved by fitting a Gaussian model to the beam cross-section to obtain an instantaneous estimate of the beam centroid. The instantaneous estimate is Bayesian fused with a second estimate of the beam position, which is a linear combination of previous estimates (i.e. auto-regression). This method improves the tracking robustness to heavily attenuating objects that obscure the beam. The authors use the beam position estimates to apply image corrections. They determine that they can fix: 70% of image error due to detector wobble; 68% of noise due to source fluctuation; and 95% of noise due to sensor variation.
Other researchers have applied simple image correction and denoising methods in preprocessing prior to algorithmic Image Understanding. These include: median filtering or filling individual erroneous pixels with their neighbourhood median to remove salt-and-pepper noise [35, 63]; normalisation of image columns to reduce errors from X-ray source fluctuation [63, 65]; deletion of image rows or columns that contain no image information due to source miss-fire or detector downtime [35, 63].
Threat Image Projection (TIP)
Threat Image Projection (TIP) is a technique first developed for baggage [51]. Most TIP methods insert a fictional threat from a database into an existing benign image. This can be used in Computer-Based Training (CBT) of operators, to assess performance [72], or improve it by increasing operators’ exposure to rare threat scenarios [27]. Similarly, in cargo, researchers are exploring how TIP imagery can be used to increase the competency of operators, however, so far they have relied on using screening experts to manually merge threat and innocuous images to form the TIP image using X-ray image merging software [50]. Moreover, some researchers are beginning to use TIP as a data augmentation methodology when training Machine Learning (ML) based ATD algorithms.
In CT baggage, TIP is complicated by the 3D nature of the images. TIP algorithms typically search for realistic placement volumes (voids) [44, 81] so that the projected threat does not intersect other objects which would create an unrealistic visual cue for operators. Researchers have defined metrics for View Difficulty, Superposition and Bag Complexity [68–70]. Such metrics can be used for adaptive CBT algorithms, where the difficulty of a given search task can be controlled. For example, if an operator is poor at finding threat items in certain contexts, such as complicated clutter, the algorithm can present more of these examples to improve performance under those contexts. Other researchers have noted that TIP imagery appears unrealistic unless realistic noise and artefacts, that match those of the other objects in the baggage image, are synthesised. For example, Megherbi et al. [44, 45] generate realistic metal artefacts in CT baggage, to ensure that artefacts in the threat are consistent with those in the rest of the baggage, and are not a visual cue for operators. Similar ideas are likely to be useful in cargo TIP, for example ensuring that magnification, pixel noise and scatter point-spread functions are consistent between the threat and the rest of the image.
In cargo, some authors have suggested methods for image synthesis for other purposes, but which could be applicable to TIP. White et al. [28] introduce a method for generating synthetic γ-ray cargo images, and use it as surrogate data for testing the effectiveness of different scanning systems when it is impractical to collect large amounts of empirical data. The authors derive an empirical model of the imaging system response from real images of well-characterised objects. They claim to incorporate system properties such as sensitivity, spatial resolution, contrast and noise. To synthesise a threat image, the authors simulate photon transmission using a commercial ray-tracing package, and then apply smoothing and Gaussian noise consistent with their empirical measurements. The ray-tracing software allows for simulation of complex-object models, such as those developed in Computer-Aided Design (CAD). After simulating the photon transport and detector-response model the synthetic threat images are injected into real images. They perform this injection by pixel-wise multiplication of the synthetic threat image with the real image. This method comes directly from the Beer-Lambert law and assumes no cross-pixel effects such as scatter. We propose that synthesising threat images from 3D threat models, could prove very useful, particularly for adding emerging threats to TIP libraries, for example CAD models of 3D-printed weapons.
In the ML community, training data augmentation is used to improve the performance of ML-based algorithms. Data augmentation reduces overfitting by using label-preserving transformations to artificially enlarge the dataset [40]. Transformations must be realistic for the given imaging system, in order to make algorithms robust to natural variation. In visible spectrum imagery, examples of transformations include random crops (translation invariance), random flips (reflection invariance), and random addition of lighting (invariance to illuminance) [34]. In cargo X-ray imagery transformations could include variations in dose, perspective, material composition, and object orientation. Data augmentation is particularly useful in representation learning, such as deep CNNs, which are prone to overfitting if datasets are limited in size and variety.
For automated Image Understanding in cargo, researchers typically face the problem of unbalanced datasets. Whilst images of non-threat cargoes are abundant, images of threat cargo are usually very rare in the wild. So researchers often rely on capturing staged threat images. Threat staging is time consuming and expensive. Recently, researchers have begun to use TIP frameworks to help train and test ML-based Image Understanding algorithms. For example, projecting staged threat images into innocuous Stream-of-Commerce (SoC) images, whilst adding realistic variation, to construct a balanced dataset of threat and benign images [36, 63].
Rogers et al. [63] use image synthesis to train a classifier for detecting loads in declared-as-empty containers. The method extracts a database of objects from real cargo X-ray images. An estimate of the background is obtained by exploiting the uniformity of the cargo container in the image vertical. Background removal is achieved by pixel-wise division of the cropped object by the background estimate. Extracted objects are then manipulated to create diversity in the training set. The authors include random object variations in translations, orientation, density, and volume. They also combine multiple random objects to form a composite object. The composite object is projected into real empty container images in a similar way to White et al. [28].
Jaccard et al. [36] follow a similar process to Rogers et al. [63], but for training a CNN from scratch to detect Small Metallic Threats (SMTs). They create very large numbers of threat images, with high variability in background appearances, by injecting threat images into real images by multiplication. They achieve background removal by manually delineating the threat item and background clutter, and dividing by the mean of the non-clutter background. To increase threat variability, the authors randomly position and flip the object, whilst varying the threat attenuation by a random factor between 0.95 and 1.05. The authors sample a total of 1.2 × 104 threat backgrounds from a very large number of real cargo images.
Recently, Jaccard et al. [38] have introduced a TIP module useful for training classifiers. They list a number of methods for adding realistic noise and variation to training data. This includes: object volume scaling, by jointly scaling the in-plane area and the object attenuation; object density scaling, by scaling the object attenuation; object flips; formation of composite threat objects; addition of noise; and varying the background appearance. More recently, Rogers et al. [64] have introduced a method for magnifying the object according to the depth of the object in the scene. Since most X-ray scanners employ a divergent fan-beam the object appears taller as it is moved closer to the source. They suggest generating the vertical scale factor α by
Material discrimination aims to identify the type of material at each pixel in the image. There is some crossover with Image Understanding, but we include it as an Image Preprocessing method, since Image Understanding methods using features derived from the material information might be helpful in improving performance. This has been the case in multi-view X-ray baggage, where material information is more complete [5].
The interactions of X-rays with a material varies depending on the type of material and the type of radiation. By studying the types of interactions occurring it is possible to identify the type of material by some characteristic such as its effective atomic number. To do this, it is required that measurements at multiple energies are made on the material either by illuminating it with multiple radiation sources [58], or by using a continuous spectrum of radiation energies and a detector that can resolve the difference in the energy spectrum after interaction [26].
Often the high-throughput requirements of commercial systems prevents more than two energies being used, permits at most a few views being acquired, and requires short exposures leading to substantial image noise. The combined effect of these restrictions makes the signal insufficient for discrimination of individual atomic elements [12]. Instead researchers attempt to discriminate between groups of materials such as organics, light metals and heavy metals [58, 59]. Alternatively some researchers attempt to only identify high-Z materials [12, 25] as they can indicate the smuggling of radioactive materials or their shielding. Even in these simple cases, researchers have found it difficult to accurately discriminate materials from raw measurements on a pixel-wise basis, finding that it is necessary to incorporate spatial information into discrimination [58]. Thus, researchers have applied a number of image segmentation approaches to aid with discrimination.
The majority of the cargo material discrimination literature uses dual-energy X-ray systems and are based on α-curve [41, 56], R-curve [58], or H-L curve [82] methods. There has been little influence in cargo work from the baggage or medical domains, due to the much higher energy regime (Fig. 1). For example, the seminal work in CT by Alvarez and Macovski [2], which expands the attenuation coefficient as a set of intuitive basis functions. This approach works in the CT energy regime where the photoelectric interaction, which depends strongly on atomic number, is dominant. But it is subservient to pair production and scatter in the cargo energy regime.
The α-curve [41, 56], R-curve [58], and H-L curve [82] methods attempt to estimate the effective atomic number (Z) grouping (i.e. organic, light metals, heavy metals) by combining high and low energy transparencies to form a value that can be mapped to effective Z grouping using a lookup table. Authors tend to define the transparency T by normalising the image by the total number of photons (integrated over the range of energies E) emitted by the source and the detector sensitivity D (E).
The R-curve method is motivated by capturing transparencies at energies E1 and E2, and taking the ratio of their logs
For the monochromatic and single material case, the R-ratio is unique to the material atomic number Z0 and so materials can be discriminated, at least in theory. This method is well-suited to γ-ray imaging where the photons are emitted with quantised energies. However, in cargo, the X-ray source is not monochromatic and has a continuous Bremsstrahlung distribution. In this case R varies as a function of the material mass thickness. Nevertheless, one can attempt to recover the effective atomic number grouping at a pixel by experimentally measuring the R-ratio as a function of mass thickness to create a lookup table. There are difficulties at low mass thickness where the R-ratio versus mass thickness curves for different materials overlap.
The α-curve method computes the quantities
Again a lookup table is determined through experimentation.
Finally, the H-L curve method simply creates a lookup table using the high (H) and low (L) energy images I1 and I2.
The seminal work for dual-energy material discrimination for cargo was by Ogorodnikov and Petrunin [58, 59]. The authors introduce the R-curve method and attempt to classify materials into four groups: organics (hydrocarbon, Z ∼ 5.3); organics/inorganics (aluminium, Z ∼ 13); inorganics (iron, Z ∼ 26); and heavy substances (lead, Z ∼ 82). They use a prototype inspection system, with a 4/8 MeV cut-off Bremsstrahlung beam and a lead beam filter. They identify that the R-ratio crossover of iron and lead can be translocated by use of the filter, thus allowing improved discrimination for small mass thickness [58]. The authors first study the error when discriminating iron from hydrocarbon as function of mass thickness, and find discrimination is optimal at 40–60 g/cm2. They reason that discrimination error increases for lower mass thickness because there is not sufficient contrast between low and high energy images, and for larger mass thickness due to decreasing signal-to-noise. The authors note that, when discriminating between all four groups, material recognition is unreliable, in particular the water-aluminium discrimination error reaches 40% even at the optimal mass thickness. To remedy this, they incorporate spatial information using a spatial clustering algorithm. All pixels within a given cluster are labeled as a single material based on lookup in the R-ratio table of the cluster mean values. Coloured material discrimination images with and without incorporation of the spatial information are shown in Fig. 6. Qualitatively, it is evident that the use of spatial information greatly improves image quality.
A few years later, Zhang et al. [82] introduced the H-L curve method. They introduce a material intrinsic difference measure, defined as
Since these initial works, other researchers have largely focused on high-Z detection, claiming that multi-group material discrimination is infeasible for commercial systems. For example, Fu et al. [12] claim that identifying the effective Z of the scanned objects is not practical because it requires high precision measurements and the noise in commercial systems is too large. Most have focused on the detection and segmentation of suspicious or high-Z materials.
Fu et al. [23] attempt to segment suspicious, shielded objects. They introduce a hybrid clustering approach which does not require a prior on the number of clusters or the size of clusters, but a prior on the step level, which determines the number of quantisation levels in the clustered image given the maximum image value. Hybrid clustering performs clustering followed by region growing. For clustering, each pixel is first compared to the mean of its neighbourhood, if the pixel is close to the mean then its value is assigned as the quantisation of that mean. If it is not close, then they split the neighbourhood into quadrants, compute the means, and set the pixel value to the nearest quadrant’s quantised mean. They claim that this is faster than recursive K-means clustering and the Leader clustering used by Ogorodnikov and Petrunin [58, 59]. After clustering they do region merging, using the highest intensity region as the seed. To segment shielded objects, the authors iterate through the different quantisation levels, binarise the image by quantisation level, and then region fill based on gradients. If the intensity of a filled region is greater than the surrounding, then it is regarded as a shielded object. The method is tested on a cargo image with various amount of shielded lead and tin. No quantitative measure of the performance is given, but the method appears to work well on the single test image presented.
In a separate paper, Fu et al. [24] attempt to improve detection and reduce false alarms for high-Z detection. They apply their hybrid clustering described in [23]. After identifying regions that are shielded by low-Z materials, they attempt to separate the shielded object from the background by subtracting the shielding attenuation from the shielded attenuation. They claim that the approach yields improved high-Z detection. In another paper [12], they identify two sources of error, namely the edge effect at object edges due to scatter, misalignment, digitisation, and Poisson noise. They propose a wavelet shrinkage denoising approach, which reduces false negatives and false positives, but no quantitative measure of performance is determined. The authors state that similar results can be achieved by use of a Weiner filter, but that it needs to be combined with morphological filtering.
Chen et al. [12] also focus on detecting high-Z material. They use a 6/9 MeV commercial system. No substantial details of the methods are given, although they state that the high-Z signature is generated using “dual-energy information processing, machine vision and topology analysis, and background object striping” [12]. They show an example of lead detection against a piece-wise varying background density, but no quantitative measure of performance is given.
In a recent paper, Ogorodnikov et al. [57] refer to their original work [58] and echo the sentiments of other researchers; that their previous approach to material separation is labile, instable and not repeatable in practical implementation. In this paper, although their algorithmic methods are not detailed in full, the authors attempt 3-group (organics, mineral/light metals, metals) material discrimination but this time with a 3.5/6 MeV Bremsstrahlung beam. Additionally, they attempt to calculate the mass of the object under inspection. They claim a mass preciseness of <10% and effective atomic number preciseness of ±1 in the optimal mass thickness range.
Recently, Li et al. [41] have proposed a solution to improve material recognition when two materials overlap in an image. The method requires prior information about one of the overlapping materials, which the authors argue is available in a practical setting from the shipping manifest, or if trying to separate container and contents an assumption can be made about the container material. Their algorithm firstly performs a pre-classification based on the α-curve method, they then determine if a region is more likely composed of a pure material or two overlapping materials. If composed of two materials, the next step decomposes the material into the two overlapping contributions. The final step is to perform recognition on the materials. To decompose overlapping materials, the authors use a method originating from Dual-Energy X-ray Absorptiometry (DEXA) which is used for measuring bone mineral density and soft-tissue composition of human bodies. The method uses quadratic approximations of the polychromatic transparencies of the high and low energy images. The authors test the algorithm on synthesised data and real data captured in a lab experiment, and achieve good qualitative results.
Other researchers have used simulations to investigate the possibility of material discrimination on systems that are not dual-energy. For example, Gil et al. [26] use Monte Carlo simulation to investigate the possibility of single-shot material discrimination. The single-shot method assumes that the detectors can measure the energy spectrum of the beam and can split it into a low and high energy component to determine the R-ratio. The authors simulate a Bremsstrahlung beam with 9 MeV cut-off, and a low-high division chosen at 4 MeV. They compare the one-shot R-ratio to a 4/9 MeV dual-energy simulation. Comparing the R-ratio for silver and tissue-equivalent plastic, it appears that the one-shot method has a greater discriminative effect, whilst potentially having the benefits of lower X-ray dose and faster scan time. However, it is unclear how this method would work in practice since the derivation of the one-shot R-ratio requires splitting of the energy spectrum before interactions with the scene. Furthermore, it is unclear whether the system model results in a realistic level of noise when compared to a commercial system.
Fantidis et al. [18] investigate potential mixed γ- and X-ray system architectures, and their ability to discriminate materials, through Monte Carlo simulation. They simulate three γ sources (60Co, 137Cs, and 88Y), and a 4/9 MeV dual-energy Bremsstrahlung beam. They test material discrimination performance on 165 materials and using different dual, triple and quadruple combinations of the sources. They assess the potential performance of the system using the number of R-overlaps between different materials. They claim that the optimal selection of sources are 4 MeV Bremsstrahlung and 137Cs for dual, and 4/9 MeV Bremsstrahlung and 137Cs for triple. The optimal quadruple source system, although not specified, only offers a slight improvement over the optimal triple source system. There is no evidence that the authors attempt to model system noise and the effects on discrimination, other authors have found that R-values alone are not a good indicator of performance due to noise in the R-estimates when interrogating materials with small or large mass thickness [58].
For CT imagery of baggage, there have been several proposals for single- and dual- energy segmentation, with some based on Machine Learning (ML) which we review here. The algorithms are designed for segmenting 3D volumes but aspects of the approaches may be transferable to 2D cargo. In CT baggage segmentation, algorithms must cope with a variable and unknown number of baggage items, each with a wide range of possible shapes and sizes [31]. This is in contrast to the medical domain, where segmentation tasks are prespecified, for example a segmentation of a particular organ [31]. Therefore, baggage researchers have looked to design unsupervised algorithms that make no assumptions on the number of objects or on their composition.
The approach taken by Grady et al. [31] for single-energy CT, first identifies object voxels, then identifies candidate object splits using the Isoperimetric Distance Tree (IDT) method [29], and finally evaluates good splits according to a novel Automatic QUality Assessment (AQUA) metric learnt from a large training set. The initial coarse segmentation uses a Mumford-Shah based method [30] applied to a preprocessed (denoised and artefact reduced) CT image. The AQUA method is based on a 42-dimensional descriptor from the prior literature on object segmentation, which includes features based on geometry, intensity, and gradients. To learn the AQUA model, the authors use Principal Component Analysis (PCA) to reduce dimensionality, then fit a Gaussian Mixture Model (GMM) over the PCA coefficients of all the segments in the training set using Expectation-Maximisation (EM). Aqua is used both to select best candidate splits, and to select the best segmentation over three different parameter settings.
Mouton et al. [53] introduce a material-based segmentation for low resolution Dual-Energy CT (DECT) images representative of the aviation security environment. After preprocessing to reduce metal artefacts, the authors first perform a coarse segmentation based on the Dual-Energy Index (DEI) and connected component analysis. The DEI combines the high and low energy linear attenuation coefficients at each voxel to give a crude estimate of the material characteristics. The authors use a Random Forest (RF) model to guide the segmentation process by assessing the quality of individual object segments and the entire segmentation. For individual object segments, the trained RF model uses the same 42-dimensional descriptor proposed by Grady et al. [31]. The authors claim that using the RF approach outperforms AQUA in their aviation setting. The quality of full segmentations is assessed using the RF score of constituent objects weighted by the error in the number of segmented objects. The authors demonstrate that their approach outperforms three state-of-the-art segmentation techniques, including: IDT [29]; Symmetric Region Growing (SymRG) [78]; and 3D Flood-Fill region growing (FloodFill) [80].
In cargo, material-based segmentation is much more challenging due to overlapping of materials and objects, and the inability to reconstruct linear attenuation coefficients that encode material information. However, the α-curve [41, 56], R-curve [58], and H-L curve [82] methods can provide crude (more so than DEI) material information that could potentially be used to initiate coarse segmentations. Similar methods to AQUA [31] and the RF approach of Mouton et al. [53] could be used to identify object splits and to assess overall segmentation quality. However, it is likely that extra metrics would be required to deal with overlapping objects without a priori information on the number of objects overlapping or their characteristics such as thickness and material. Methods have been proposed in multi-view baggage for layer separation that may be applicable to multi-view cargo [33]. To date, we are not aware of any proposals for cargo, or indeed single-view baggage, that can convincingly address these issues.
Discussion on Image Preprocessing
Of the topics identified in Image Preprocessing, by far the most work has been done on material discrimination. The methods are largely derived from physics, and as far as we know, no ML techniques (similar to baggage Refs. [31, 53]) have been applied to the subject due to the difficulty of obtaining sufficient data with accurate labeling. Additionally, since all authors tend to use different datasets from different commercial partners, or independent lab experiments, it is difficult to compare the performance between different contributions. Furthermore, most authors choose to evaluate performance qualitatively rather than quantitatively, and often using only a single image. We feel that researchers need to better quantify per-pixel classification performance so that different methods can be more easily compared. Moreover, we believe that the field would benefit from an open dataset available for researchers.
Three main methods have been introduced for initial pixel classification; the R, α, and H-L curve methods. It is not immediately clear whether any of these performs better than the others. This is because researchers have not yet performed a comparison of the different methods on the same dataset. Such a study is a future avenue for research in the area. Particularly, when a new method is introduced, it should be compared to the methods already existing in the literature.
For image manipulation and image quality improvement, there is a need to evaluate, compare and understand different techniques in terms of their effect on the performance of machine and human Image Understanding. Such work has been attempted in the baggage domain [55]. For TIP, some work in cargo has been done as a preprocessing step for training automated Image Understanding algorithms, and TIP methods have only just been put through basic experimental validation by Rogers et al. [64]. The effects of training ML algorithms on synthesised threat images are yet to be fully understood.
Although no work has been done on verifying that ML algorithms trained on TIP-augmented cargo data actually boosts performance, there has been evidence from other fields for many tasks [13, 32]. There are also several problems that still remain. For example, it is difficult to generate out-of-plane rotations, so augmentation is usually limited to in-plane rotations of the staged threat items. Potential remedies include either developing a framework for collecting the optimal number of threat poses to make accurate interpolation of intermediate out-of-plane rotations, or generating realistic threat images from realistic 3D CAD models of threats [28, 79]. The requirements for a solution are that out-of-plane rotations are accurate, realistic, and can be computed efficiently or on-the-fly. Whilst the interpolation approach would be fast it may be difficult to obtain good accuracy without capturing a large number of projections due to the complicated fan-beam geometry. Conversely, the CAD approach would enable more accurate computation of out-of-plane poses, but it is unclear how realistic the generated threat image would be and how fast it can be computed if accurate photon transport models are required.
Image Understanding
Automated Image Understanding tasks in cargo can be split into the themes of Automated Contents Verification (ACV) and Automated Threat Detection (ATD). We give an overview of the most pertinent works in the literature in Table 1.
Automated Contents Verification
ACV checks whether the cargo contents match those stated on the shipment manifest. This can range from Empty Cargo Verification (ECV) to full Manifest Verification (MV). ECV can be useful for increasing throughput, since declared-as-empty cargoes (20% of all containers) can be sent through a separate automated inspection lane. ECV examples are given in Fig. 7. Containers may be falsely declared as empty in shipping fraud, or may be exploited in rip-on/rip-off smuggling operations. False declared-as-empty cargo containers can also pose safety hazards during container stacking at ports due to the unexpected additional weight. MV compares the X-ray image to the Harmonised System (HS) codes declared on the manifest. Each HS code defines a different broad category of cargo type, for example, live animals, animal products or vegetable products.
The first work on ECV was Chalmers et al. [9, 10], who use “readily available” algorithms to segment the container region and compute metrics that are then compared with empty containers of the same size. No specific details are given on the algorithms or their performance, but we interpret Ref. [10] as follows. The container is classified by generating an intensity histogram of the segmented cargo region and comparing to histograms from historical empty images. The comparison is made using histogram metrics such as minimum, maximum, mean, and standard deviation. Another method is briefly described by Orphan et al. [60], which segments the image (e.g floor, walls, and roof) and then applies an unspecified rule-based object detection algorithm. The authors report 97.2% accuracy (with 0.4% false negatives) when classifying SoC images as empty or non-empty.
More recently, Rogers et al. [63], have attempted ECV by detecting loads within cargo containers. They claim that ECV is difficult due to container parts that locally appear similar to small loads, and due to variation in container types (e.g. refrigerated units, bulk units, 20 ft or 40 ft General Purpose). The task is further complicated by container damage and detritus, which the algorithm must learn to ignore. Their method splits the image into a grid of small 96 × 96 pixels windows. Then for each window they compute image moments and oriented Basic Image Features (oBIFs) at a range of scales. They feed the features, along with the window spatial coordinates into a Random Forest (RF). The authors claim that the spatial coordinates allow the RF to implicitly learn the range of possible empty container appearances at different locations. The classification decision for the image is determined by taking the maximum score of the windows composing the image and comparing it to a tunable threshold. The authors generate synthetic examples (TIP) of non-empty containers in order to train the algorithm, this allows training on more difficult examples than those found in the SoC. The algorithm is tested on both real SoC data and difficult synthetic examples. On the SoC data, it is able to detect 99.3% of non-empty containers while raising 0.7% false alarms on truly empty containers. On difficult examples they are able to achieve 90% detection for loads similar to 1.5 kg of cocaine or 1 L of water, while raising false alarms for 1-in-605 or 1-in-197 containers, respectively.
Andrews et al. [3] have recently used ECV as a test problem for anomaly detection using auto-encoders. They use cargo X-ray images of empty and non-empty containers down-sampled to 32 × 9 pixels. In anomaly detection the algorithm is trained on the normal class only. In one test they considered empty containers as normal, and the loaded as anomolies; in another test, the classes were reversed. The authors derive a number of features from the hidden layer of a trained sparse auto-encoder, including: the hidden representation, the scalar residual magnitude; the signed residual (with and without normalisation by the root-mean-squared residual); the absolute residual; and the squared residual (with and without normalisation by the mean-squared residual). The features are classified using a one-class Radial Basis Function Support Vector Machine (RBF-SVM). When considering non-empty containers as the normal class, they find that the RBF-SVM achieves best classification accuracy (92.99%) when fed the hidden representation as a feature. When considering empty containers as the normal class the best accuracy (99.2%) is achieved when the normalised squared residual is used as the feature.
There have been two published attempts at MV [75, 83]. MV is a multi-class classification task, where cargo containers are classified according to HS code. Tuszynski et al. [75] used the median grey-level image histogram of each HS code in a training set. They then use a weighted city block distance to compare a given example to each HS code model. This approach yields an overall accuracy of 48% given a false positive rate of 5%. This result is improved slightly by Zhang et al. [83], who use a Leung-Malik filter bank to construct a visual codebook as a texture descriptor. They determine that this outperforms Scale-Invariant Feature Transform (SIFT) when classifying cargo images according to their HS code. Note that the authors ignore “non-classical” examples, which they define as those containers that are less than half filled with cargo. We feel that for real-life deployable system, such examples should be included since an adversary could purposefully choose to only half fill a container when smuggling or to avoid duties.
Automated Threat Detection
Currently, there are few publications on cargo ATD, much more work has been done for baggage screening. The first such paper was on detecting cars that may be stolen or undeclared to avoid duties. Jaccard et al. [35] use oBIF histograms computed at a range of scales and a RF classifier. They oversample car windows to boost the number of car examples in the training set. Using a Leave-One-Out-Cross-Validation (LOOCV) scheme they determine a detection rate of 100% of car-containing containers while raising less than 1% false positives on SoC non-car containers. The authors also investigate other features such as intensity histograms, log-intensity histograms, and Basic Image Features (BIFs), but found these inferior to using oBIFs. In a later paper [38], the authors were able to improve performance to 100% detection rate for a false alarm of 0.41%, by including more oBIF scales.
Zheng and Elmaghraby [84] propose a method for ATD in vehicles by detecting anomalous regions within images. They use backscatter images (top view and two side views) and a transmission image (side view) captured from an AS&E OmniView® Gantry. They perform a window-wise correlation analysis comparing a fresh image of the vehicle to a historical image of the same vehicle stored in database. Images are split into 64 rectangular 4 × 16 pixel windows, and the correlation between windows in similar positions in the analysed and historical images are computed, resulting in a 64 × 64 matrix of window correlation values. A given window is classified as anomalous if the maximum of the corresponding matrix row is below a threshold. No quantitative evaluation of the performance is given. A criticism of this proposed method is that an anomalous region will very rarely indicate an actual threat and so the false positive rate is likely to be extremely high.
Jaccard et al. [36] attempt to detect threats that are “akin to small metallic objects (e.g. drill)”; the exact nature of the threats are censored to prevent keyword searching. The method uses CNNs trained-from-scratch on an augmented dataset, with real threat images projected into images from the SoC (TIP). The authors found that a 9-layer shallow network architecture (Krizhevsky et al. [40]) and a very deep 19-layer architecture (Simonyan and Zisserman [71]) both performed well. The shallow network uses convolutional layers with large receptive fields, and each followed by a max pooling layer. Whereas the very deep network uses convolutional layers, with small receptive fields, and stacked in twos or threes between each max pooling layer. In both cases the classification decision from the fully connected output layer is made using the softmax function. The authors compare the CNNs to a oBIF+RF method similar to that previously used to detect cars [35]. Both the shallow and very deep network provided a huge boost in performance over oBIF+RF, with the very deep network performing slightly better than the shallow network. The authors report a false alarm rate of 0.8% given 90% detection. Examples of SMT results for the CNN approach are given in Fig. 8.
Most recently, Jaccard et al. [37] have revisited their car detection work [35] and applied a trained-from-scratch very deep 19-layer CNN [40]. The authors again use window oversampling to increase the number of car training examples. A method based on Pyramid Histograms of Visual Words (PHOW) was also assessed. The authors find that the CNN approach yielded 100% detection and 0.22% false alarms, and was able to detect even heavily obscured cars. Moreover, the CNN approach yielded 5-fold and 1.5-fold improvements in false alarm rate over the PHOW-based method and oBIF+RF method used in Ref. [35]. Examples of car detection results are given in Fig. 9.
ATD for baggage
More ATD research has been carried out in baggage, and detailed summaries can be found in the review by Mouton et al. [52]. We give a brief overview of the points relevant to cargo.
Several different X-ray imaging modalities are used in baggage screening. These range from single-view [62], to multi-view [5, 49], to full 3D Computed Tomography (CT) [19–21, 54]. Classification performance typically improves from single view to CT as more information becomes available. The challenge is how to best use this information.
The general consensus amongst the baggage community, is that classification based on X-ray image data is more challenging than visible spectrum data, and that direct application of methods frequently used in natural images (such as SIFT, Rotation Invariant Feature Transform, and Histogram of Oriented Gradients) do not perform well [67]. However the performance can be improved by utilising the characteristics of X-ray baggage images. For example researchers have found that object detection can be improved by augmenting multiple views, using a false colour material image (where pixels are coloured according to the type of material) [4], or using simple descriptors such as density histogram (DH) or density gradient histogram (DGH) [20, 21].
While it has been widely reported that texture descriptors in baggage scans perform poorly due the lack of texture in X-ray examples [4, 67], the texture visible in cargo X-ray images does differ significantly between images. Medium to low density cargo (such as tyres, and machinery) often contain a lot of complex articulated texture, while high density cargo (such as barrels of oil) has a more uniform appearance. This is possibly why researchers in cargo have enjoyed more success with texture descriptors such as oBIFs [35, 63] or visual codebooks based on a Leung-Malik filter bank [83].
Franzel et al. [22] propose a method of fusing detection results from multiple single views to exploit the extra information available from multi-view. They use a voting-based scheme where detection confidence is increased if rays from detection points from single views intersect in 3D. The motivation is to suppress false alarms since they do not coincide in different views, and to reinforce detections that do. The detection confidence on the single view images are determined by sliding a window over the image, computing Histogram of Oriented Gradients (HOG) as features and using a linear SVM. They address in-plane rotations using a non-maximum suppression scheme, since HOG features are not rotation invariant. Moreover, they claim that the multi-view voting fusion scheme handles out-of-plane rotations. They achieve significantly better detection with their multi-view scheme (80%) over single view (50%) for a 50% false alarm rate.
Baştan et al. [5] propose a different multi-view approach. Instead of fusing single-view classifier confidences, they fuse single-view features. The authors experiment with sparse interest point detectors and dense sampling, with SIFT descriptors and their derivatives (GLOH, CGLOH and CSIFT), as well as the domain spin image descriptor (SPIN) and two novel variants, ESPIN and CSPIN, which incorporate energy information. ESPIN is the concatenation of SPIN descriptors computed on the high and low energy images separately, and CSPIN is the concatenation of SPIN descriptors computed on each channel of the material-coloured image. The authors use a linear Structural SVM (S-SVM) with a branch-and-bound sub-window search framework, which is shown to be more efficient than classical sliding windows. They found both ESPIN and CSPIN performed better than SIFT and SPIN alone, with CSPIN achieving best performance. Like Franzel et al. [22], Baştan et al. [5] find that their multi-view feature concatenation approach performs better than single view. Moreover, their approach performs significantly better than the approach adopted by Franzel et al. [22].
Multi-view fusion approaches similar to those proposed by Baştan et al. [5] and Franzel et al. [22] might be applicable to multi-view fusion in cargo, however performance is likely to be far worse due to the additional complexity. We feel that a possible approach to multi-view detection, for both baggage and cargo, would be to feed the different views into a CNN as separate channels or separate streams. The CNN can learn to jointly use information from the separate views to make better classifications. For 3D shape recognition, Su et al. [73] have found that CNNs fed with multiple 2D views as inputs performs better than using state-of-the-art 3D shape descriptors. It would be an interesting study for ATD in CT, particularly if better performance can be obtained without having to reconstruct the full 3D baggage image.
Recently, Akçay et al. [1] have applied CNNs to ATD in single-view baggage imagery. They recognise that there is a problem with training CNNs from scratch due to the limited availability of data. Thus they adopt a transfer learning approach by taking a pre-trained CNN, primarily trained for general image classification tasks, and fine-tune it for ATD in X-ray baggage. The pre-trained CNN follows the architecture introduced by Krizhevsky et al. [40], consisting of 5 convolutional layers, 3 fully-connected layers and trained on the ImageNet dataset. The authors re-use the generalised feature extraction and representation in the lower layers of the CNN, whilst fine tuning the upper layers. This achieves 99.26% detection and 0.74% false positives, which significantly outperforms prior work in the field. The authors do not comment on the possibility of training a CNN from scratch on data augmented with TIP imagery and realistic variation, such as the work in cargo by Jaccard et al. [36]. Since TIP methods are well-developed for baggage imagery, it would be an interesting comparison to make between a pre-trained and a trained-from-scratch CNN.
Discussion on Image Understanding
It has been just over a decade since publications started to emerge on cargo Image Understanding. In initial works, algorithms were typically based on computing simple features (such as maximum image intensity) and applying intuitive hard-coded rules [60], or by simple comparisons of an image with historical images from a database [9, 84]. Since these initial works, researchers have started to apply ML methods to learn the rules, and even features, from data. Researchers have found that limited access to large, labeled, datasets is still a problem and have started to use Threat Image Projection (TIP) to increase the total amount of training data and the amount of variation within it [36, 63]. Other researchers, in baggage, have chosen to take CNN models trained for recognition tasks on natural images, and fine-tune them for high performance on X-ray imagery [1].
The use of Deep Learning methods, such as CNNs, where feature extraction, representation and classification is learnt simultaneously, shows great promise [1, 38]. Such methods have been shown to achieve superhuman performance in a number of visual tasks, including face recognition and image categorisation [40]. It is, therefore, perfectly acceptable to believe that these methods can, and will, outperform humans at visual inspection of X-ray images. The main obstacle to achieving this is the lack of a very large cross-vendor SoC dataset complete with labels, from which a CNN can be trained from scratch and compared to baseline professional human operator performance.
We feel that the main problem with the cargo Image Understanding field is the lack of open datasets for researchers to score and compare methods on. Although, it is unlikely that such datasets will be made available for threat items such as weapons, datasets could be made available which contain benign non-sensitive items. If the dataset was labeled with anonymised manifest information (e.g. HS-codes), we feel it would provoke wider interest in the field, since the X-ray cargo images are a very different problem to natural images.
There are many avenues for future research in the field, due to its relative infancy. It would be interesting to see how Deep Learning based object categorisation and semantic segmentation methods work on X-ray cargo images. Such methods could find good use as a form of Automated Contents Verification in Assisted Inspection or Selection. In particular, customs agencies store very large collections of cargo images complete with manifest information (labeling), which would be ideal for training CNNs from scratch. However, these datasets are notoriously difficult for researchers to gain access to. Alternatively, transfer learning approaches similar to Akçay et al. [1] could be explored.
Another future challenge, is to develop generalised algorithms that work on images from multiple scanning architectures. So far, algorithms have been developed for a single type of scanner from a single vendor. As far as we know, no researchers have evaluated their algorithms on images from different scanners, and so it is not evident that algorithms would generalise well. Generalisation might be achievable by using transfer learning methods to fine-tune algorithms to specific scanning architectures, or by developing data augmentation techniques that transform images so that they appear as if captured from different scanning architectures.
Conclusion
Automated Analysis of cargo X-ray imagery is still a relatively young field. Over the last decade, more attention has been paid to aviation image analysis (such as baggage), since problems are generally more tractable, and because there has been more funding directed towards aviation due to the more perceivable immediate threat from terrorism. Typically, most work in cargo has been kept in-house by industry for commercial and security reasons. However, academics are beginning to form relationships with industry partners, gaining access to large image datasets with which to work.
In comparison to natural images, cargo X-ray images offer an interesting and difficult challenge for researchers, since objects are translucent making occlusions difficult to disentangle, are usually very cluttered and noisy, whilst appearing skewed in perspective due to the geometry of the X-ray beam. Furthermore, image contents are often more varied than images from the baggage or medical X-ray imaging domains, since a very diverse range of objects are shipped inside containers. We believe that more researchers would become involved in the field if data was easier to get hold of, for example, through the creation of large, labeled, open datasets.
During this review we have identified several open questions and avenues for future research, which we now summarise.
First, there is need for a comparison study of different image preprocessing techniques (i.e. denoising, manipulation and correction), and their effects on the performance of human and algorithmic Image Understanding needs to be understood. It might be that for CNN-based methods, denoising is not essential, but that performance can be improved considerably using some image manipulation. There is support for this in the work by Jaccard et al. [36, 37] who found that log transforming images helped CNN-based ATD considerably.
Second, would ML-based material discrimination work better that the current physics-derived methods? ML methods might be better at exploiting spatial or contextual information to help in the presence of heavy noise found in commercial systems. With enough available data it might be possible to learn the material mapping using a fully Convolutional Neural Network [43].
Third, do dual-energy systems actually aid automated Image Understanding? For example, can derived material information be used as a feature for ML algorithms? And can the R, α, or H-L curves improve CNN approaches by being fed into the input channels?
Fourth, the application of Deep Learning methods needs to be extended to Automated Contents Verification, in particular we feel they would be well suited to multi-class manifest verification.
Fifth, how do current and future algorithms compare to human operator performance? More work needs to be done on measuring baseline human performance, however there may be issues about disclosing these results to the public.
Finally, how transferable are currently developed algorithms - do they generalise to different scanning architectures? If not, can this be achieved through adequate data augmentation or transfer learning techniques?
Footnotes
1Determined based on British Airways cabin bag size allowance of 56 cm × 45 × cm × 25 cm.
2Details of Crystal Clear™ are difficult to find, but the function “optimises image contrast and resolution to bring out picture details” according to a public verbal communication by Andreas Kotowski (Rapiscan Systems CTO) in 2001.
Acknowledgments
Funding for this work was provided through the EPSRC Grant no. EP/G037264/1 as part of UCL’s Security Science Doctoral Training Centre, and Rapiscan Systems Ltd.
