26• Image Processing
26• Image Processing Camera Calibration For Image Processing Abstract | Full Text: PDF (255K) Holo...
348 downloads
2873 Views
7MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
26• Image Processing
26• Image Processing Camera Calibration For Image Processing Abstract | Full Text: PDF (255K) Holographic Storage Abstract | Full Text: PDF (465K) Image and Video Coding Abstract | Full Text: PDF (459K) Image Color Analysis Abstract | Full Text: PDF (356K) Image Enhancement Abstract | Full Text: PDF (671K) Image Processing Abstract | Full Text: PDF (21418K) Image Processing Equipment Abstract | Full Text: PDF (594K) Image Reconstruction Abstract | Full Text: PDF (649K) Image Texture Abstract | Full Text: PDF (349K) Infrared Imaging Abstract | Full Text: PDF (15901K) Microscope Image Processing and Analysis Abstract | Full Text: PDF (316K)
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20EL...ECTRONICS%20ENGINEERING/26.Image%20Processing.htm17.06.2008 15:21:42
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICA...TRONICS%20ENGINEERING/26.%20Image%20Processing/W4111.htm
}{{}}
●
HOME ●
ABOUT US ●
CONTACT US ●
HELP
Home / Engineering / Electrical and Electronics Engineering
Wiley Encyclopedia of Electrical and Electronics Engineering Camera Calibration For Image Processing Standard Article Chanchal Chatterjee1 and Vwani P. Roychowdhury2 1GDE Systems Inc., San Diego, CA 2UCLA, Los Angeles, CA Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. DOI: 10.1002/047134608X.W4111 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (255K)
●
●
● ●
Recommend to Your Librarian Save title to My Profile Email this page Print this page
Browse this title ●
Search this title Enter words or phrases ❍
❍ ❍
Advanced Product Search Search All Content Acronym Finder
Abstract The sections in this article are Camera Calibration Model Linear Methods of Noncoplanar Camera Calibration Nonlinear Methods of Noncoplanar Camera Calibration Robust Methods of Noncoplanar Camera Calibration Coplanar Camera Calibration Experimental Results Concluding Remarks Keywords: camera calibration; noncoplanar calibration; coplanar calibration; extrinsic parameters; intrinsic parameters; robust estimation About Wiley InterScience | About Wiley | Privacy | Terms & Conditions Copyright © 1999-2008John Wiley & Sons, Inc. All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELE...S%20ENGINEERING/26.%20Image%20Processing/W4111.htm17.06.2008 15:22:02
CAMERA CALIBRATION FOR IMAGE PROCESSING
CAMERA CALIBRATION FOR IMAGE PROCESSING The camera calibration problem is well-studied in photogrammetry and computer vision, and it is crucial to several applications in manufacturing, metrology, and aerospace. The topic includes the body of techniques by which we compute (1) the three-dimensional (3-D) position and orientation of the camera relative to a certain world coordinate system (exterior orientation) and (2) the internal camera geometric and optical characteristics (interior orientation). In photogrammetric terminology, the exterior orientation of a camera is specified by all the parameters that determine the pose of the camera in the world reference frame. The parameters consist of the position of the center of perspectivity and the direction of the optical axis. Specification of the exterior orientation therefore requires three rotation angles and three translation parameters and is accomplished by obtaining the 3-D coordinates of some control points whose corresponding positions in the image are known. The interior orientation of a camera is specified by all the parameters that relate the geometry of ideal perspective projection to the physics of an actual camera. The parameters include the focal length, the image center, the scale factor, and the specification of the lens distortion. There are two types of camera calibration problems: (1) noncoplanar calibration, in which the world points lie on a 3-D surface, and (2) coplanar calibration, in which the world points are on a two-dimensional (2-D) plane. A solution to all extrinsic and intrinsic calibration parameters requires nonlinear optimization. In some applications such as passive stereo image analysis, camera calibration may be performed slowly to obtain very accurate results. However, there are several applications that require repeated computation of the camera parameters at real-time or near-real-time speeds. Examples are: (1) determining the position of a camera mounted on a moving airplane, (2) 3-D shape measurement of mechanical parts in automatic parts assembly, (3) part dimension measurement in automatic part machining, and (4) navigation of a camera-mounted land vehicle. Thus, in some applications a complete nonlinear parameter estimation procedure can be employed, whereas others require fast and near-realtime calibration algorithms. In light of these varying applications, we find many algorithms for camera calibration, each satisfying a set of applications with different levels of accuracy. The literature for camera calibration is vast and we offer a brief review. Many earlier studies (1–7) have primarily considered the extrinsic parameters, although some intrinsic parameters are also computed only for the noncoplanar case. These methods are usually efficient mostly due to the computation of linear equations. However, the methods generally ignore nonlinear lens distortion, and the coplanar case is also dealt with inadequately. For example, Sobel (6) used the basic pinhole camera model and utilized nonlinear optimization methods to compute 18 parameters. The nonlinear approach is similar to the parametric recursive least-squares method discussed below. He did not model lens distortion, and the system depended on the user to provide initial parameters for the optimization technique. Gennery (8) found the camera parameters iteratively by minimizing the error of epipolar constraints, but the method is too error-prone as observed by Tsai (9). Yakimovsky and Cunningham (7) and Ganapathy (2) also used the pinhole model and treated some combinations
743
of parameters as single variables in order to formulate the problem as a linear system. However, in this formulation, the variables are not completely linearly independent, yet are treated as such. Furthermore, these methods mostly ignore constraints that the extrinsic and intrinsic parameters must obey, and hence the solutions are suboptimal. For example, the orthonormality constraints on the extrinsic rotation parameters are not strictly imposed. Grosky and Tamburino (4) extended the above linear methods to also include a skew angle parameter, and then they satisfied the orthonormality constraints. In general, linear methods are computationally fast, but in the presence of noise the accuracy of the final solution is relatively poor. Tsai (9) and Lenz and Tsai (10) offered a two-stage algorithm using a radial alignment constraint (RAC) in which most parameters are computed in closed form. A small number of parameters such as the focal length, depth component of the translation vector, and radial lens distortion parameters are computed by an iterative scheme. If image center is unknown, it is determined by a nonlinear approach (10) based on minimizing the RAC residual error. Although the solution is efficient and the closed-form solution is immune to radial lens distortion, the formulation is less effective if tangential lens distortion is also included. Furthermore, by taking the ratio of the collinearity conditions [see Eq. (1)], the method considers tangential information only and radial information is discarded. This is not an optimal solution because all information from calibration points is not fully considered. For example, Weng et al. (11) suggest that ignoring radial information can result in a less reliable estimator. Furthermore, the orthogonality of the first two rows of the rotation matrix is not guaranteed. However, Tsai (9) has offered one of the few algorithms for the coplanar case. Next, we study the computational procedure in traditional analytical photogrammetry. The method (12–16) is based on the parametric recursive procedure of the method of least squares. Using the Euler angles for the rotation matrix [see Ref. 17], two nonlinear collinearity equations are obtained for each observation. The nonlinear equations are linearized using Newton’s first-order approximation. Based on the assumption of normal distribution of errors in measurements, the condition of maximizing the sum of squares of the residuals results in the maximum-likelihood values of the unknowns. Several iterations of the solution must be made to eliminate errors due to the linearization procedure. That is, the computed corrections are applied to the approximations at the end of each iteration, which form the new approximations in the next iteration. Initial solution is a prerequisite to this recursive procedure. In addition, Malhotra and Karara (14) note that ‘‘this solution needs considerable computational effort when the number of parameters in the adjustment are large.’’ Other researchers (4,5,9) also make the same observation. Faig (12), Wong (16), and Malhotra and Karara (14) used the above method for a general solution of all parameters. The generality of their models allows them to accommodate many types of distortions, and it leads to accurate results. However, convergence is not guaranteed. These earlier nonlinear methods obtain good results provided that the estimation model is good, and a good initial guess is available. The direct linear transformation (DLT) method of AbdelAziz and Karara (18,19) is a nonlinear method when lens distortion is corrected. However, in this formulation, depth
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.
744
CAMERA CALIBRATION FOR IMAGE PROCESSING
components of control points in a camera-centered coordinate system are assumed to be constant. Beyer (20) uses a nonlinear formulation and a versatile method based on self-calibrating bundle adjustment. The method is general, and accurate results of up to 1/46th of the pixel spacing in image space are reported. With redundant control points, an accuracy of 1/79th of the pixel spacing is achieved. Nomura et al. (21) also uses a nonlinear method where the dimension of the parameter space for nonlinear optimization is five. This reduced dimensionality for nonlinear optimization is obtained by setting up the calibration chart precisely to eliminate two Euler angles. Scale factor and depth components of the translation vector are assumed known or simply computed. Although starting estimates are required in all nonlinear methods, the research by Weng et al. (11) is an example where a method for obtaining starting estimates is explicitly mentioned. As shown by Weng et al., computing all extrinsic parameters with lens distortion is necessary for accurate camera calibration. The method employs a two-step algorithm that first computes a set of starting values for all parameters, and then it uses linear and nonlinear optimization schemes to obtain accurate estimates. Each iteration improves all parameters by successive minimization of an objective function. The method does not deal with the coplanar case. We next mention the method due to Chatterjee, Roychowdhury, and Chong (22) that uses the Gauss–Seidel framework (23,24) for nonlinear minimization. It exploits the structure of the calibration problem to use smaller blocks that are solved by linear iterations or computed in closed form in each iteration. Nonlinear optimization is performed on a reduced parameter space of three in the noncoplanar case. A detailed initialization algorithm is also provided to obtain starting estimates. Furthermore, they provide an analytical proof of convergence for their algorithm. Finally, there are several applications in aerospace and military that do not have accurate control points. An initial estimate of the calibration parameters is made with inaccurate world points and accurate image points. For such applications, a robust estimation scheme is required. One such algorithm is given by Haralick and Shapiro (25) to solve the extrinsic parameters only. This algorithm is extended to also compute the focal length. Since the literature for calibration is vast and varied, we first discuss the camera calibration model and present off-line methods of computing some intrinsic parameters in the section entitled ‘‘Camera Calibration Model.’’ We next consider noncoplanar camera calibration algorithms that employ (1) linear optimization methods to produce suboptimal solutions in the section entitled ‘‘Linear Methods of Noncoplanar Camera Calibration,’’ (2) nonlinear optimization methods to produce optimal solutions in the section entitled ‘‘Nonlinear Methods of Noncoplanar Camera Calibration,’’ and (3) robust estimation methods to solve a subset of calibration parameters in the section entitled ‘‘Robust Methods of Noncoplanar Camera Calibration.’’ Finally, we discuss algorithms for coplanar camera calibration in the section entitled ‘‘Coplanar Camera Calibration.’’ CAMERA CALIBRATION MODEL In this discussion, we consider at least 11 calibration parameters shown in Table 1. The extrinsic parameters consist of the
Table 1. Camera Calibration Parameters Discussed in This Study Parameters
Type
Description
R t
Extrinsic
f i0 , j 0 s k1 , ..., kr0 p1 , p2 s1 , s2
Intrinsic Intrinsic
Rotation matrix for camera orientation Translation vector for position of camera center Focal length Image center displacement Scale factor Skew angle Radial lens distortion Decentering lens distortion Thin prism lens distortion
Intrinsic
3 ⫻ 3 rotation matrix R for the camera orientation, which can be alternatively specified by the three Euler angles 웆, , and that designate rotations around the X, Y, and Z world axes, respectively (see Fig. 1). Besides, the 3 ⫻ 1 translation vector t denotes the position of the camera center. The intrinsic parameters consist of the effective focal length f of the camera, center of the image array (i0, j0), horizontal scale factor s of the image array, skew angle between the image axes, and radial, tangential, and thin prism lens distortion parameters 兵k1, k2, . . ., kr0, p1, p2, s1, s2其. Extrinsic Parameters Here we describe the geometry of the calibration system. The geometry involves three coordinate systems (see Fig. 1): (1) a 3-D world coordinate system (Xw, Yw, Zw) centered on a point Ow and including a point p ⫽ (x, y, z), (2) a camera coordinate system (Xc, Yc, Zc) with origin at optical center Oc with Zc axis the same as the optical axis, and (3) a 2-D image array system (I, J) centered at a point Os in the image, with (I, J) axes aligned to (Xc, Yc) respectively, and including a point (i, j). Let f be the effective focal length of the camera. The collinearity condition equations are (13,15,19)
f
r T1 p + t1 r T3 p + t3
=i
and
f
r T2 p + t2 r T3 p + t3
= j
(1)
J (i, j)
Os
I
Image plane
Zc Zw
Yc Yw
Ow
Oc
Xc Camera space
Xw 3-D world space p = (x, y, z)
Figure 1. Mapping of a 3-D world point (x, y, z) to image point (i, j).
CAMERA CALIBRATION FOR IMAGE PROCESSING
745
where RT ⫽ [r1 r2 r3] is a 3 ⫻ 3 rotation matrix defining the camera orientation and tT ⫽ [t1 t2 t3] is a translation vector defining the camera position. A very important constraint in calibration algorithms is the orthonormality constraint of the rotation matrix R given by
video. Usually the vertical spacing between lines perfectly matches that on the sensor array, giving us no scale factor in the vertical direction; that is, sj ⫽ 1. The pixels in each line of video signal are resampled by the ADC, which, in reality, samples the video lines with a rate different from the camera and causes the image to be scaled along the horizontal direcRT R = RRT = I (2) tion; that is, si ⬆ 1. Hence, the problem of determining scale factor s ⫽ si. that is, Some researchers (9,10,26) have suggested that the horizontal scale factor s can be approximately determined from rr 1 2 = rr 2 2 = rr 3 2 = 1 and r T1r 2 = r T2r 3 = r T3r 1 = 0 the ratio of the number of sensor elements in the I image direction, to the number of pixels in a line as sampled by the Alternatively, the rotation matrix R can be represented by processor. However, due to timing errors, inconsistency of the the Euler angles 웆, , and that denote rotations around the ADC, and possible tilt of the sensor array, this is not so accuXw, Yw, and Zw world axes, respectively: rate. A more accurate estimate of s is the ratio of the frequency that sensor elements are clocked off of the CCD to the cos φ cos κ sin ω sin φ cos κ − cos ω sin φ cos κ frequency at which the ADC samples. + cos ω sin κ + sin ω sin κ We found a number of methods to compute the horizontal R = − cos φ sin κ − sin ω sin φ sin κ cos ω sin φ sin κ (3) scale factor s by special techniques. Examples are as follows: + cos ω cos κ + sin ω cos κ (1) measuring the frequency of the stripes generated by the interface of ADC-clock and camera-clock that create the scale sin φ − sin ω cos φ cos ω cos φ factor problem (10), (2) measuring the distortion in an image of a perfect circle into an ellipse (30), (3) computing power Image Center and Scale Factor Parameters spectrum of the image of two sets of parallel lines (17), and Ideally the image center is the intersection of the optical axis (4) counting the grid points in an image of a grid pattern (31). of the camera–lens system with the camera’s sensing plane. Consider an image point (if , jf ) with respect to the center For real lenses, optical axis is not so easily defined, and differ- Os of the image buffer. Let the actual image center be at (i0, ent definitions of image centers (26) depend on whether the j0). Let (id, jd) be the location of the point in the image with lens has fixed or variable parameters and on how the variable respect to (i0, j0). We obtain (1,2,9–11,25) parameters are mechanically implemented. For example, for id = s−1 (if − i0 ) and jd = ( jf − j0 ) (4) a simple lens, there can be two axes of symmetry: optical and mechanical. The optical axis is the straight line joining the centers of curvature of the two surfaces of the lens, whereas Lens Distortion Parameters the mechanical axis is determined by the centerline of the machine used to grind the lens’ edge. The angle between As a result of imperfections in the design and assembly of these axes is called decentration (26,27). In a compound lens, lenses, the image of a plane object lies, in general, on a the optical axes of multiple lens elements may not be accu- slightly curved field (15) (see Fig. 2), wherein objects at the rately aligned due to decentration of each lens element, re- edge of the field of view appear somewhat smaller or larger sulting in multiple possibilities for the optical axis. In adjust- than they should. Types of lens distortions commonly seen able and variable focal length lenses, the misalignment are radial (15,32) and tangential (15,33). Two common radial between the optical and mechanical axes change as the spac- distortions are pincushion and barrel distortions. Pincushion distortion results, for example, when a lens is used as a maging between the lens elements are changed. We found several methods to measure image center by us- nifying glass, whereas barrel distortion results when the obing special techniques. Examples of such methods are as fol- ject is viewed through a lens at some distance from the eye. Tangential distortions are usually caused by (1) decenlows: (a) measuring the center of the radial lens distortion tering of the lens (decentering distortion) (11,12,21,32,33), (10,26), (b) determining the normal projection of a viewing and (2) imperfections in lens manufacturing or tilt in camera point onto the imaging plane (26,28), (c) measuring the center of the camera’s field of view (26), (d) passing a laser beam sensor or lens (thin prism distortion) (11,12). One of the efthrough the lens assembly and matching the reflection of the fects of tangential distortion is that a straight line passing beam from the lens with the center of the light spot in the image (10,26), (e) measuring the center of cos4th radiometric falloff or the center of vignetting/image spot (26), and (f) changing the focal length of a camera–lens system to determine image center from a point invariant in the image (10,26). Besides, there are several algorithmic methods of computing image center as a part of the complete calibration algorithm (1,8,10,11,26,28,29). Other intrinsic parameters commonly considered are the scale factors si and sj in the I and J image directions, respectively. Array sensors such as CCD/CID sensors acquire the video information line by line, where each line of video signal is well separated by the horizontal sync of the composite
(a)
(b)
(c) Figure 2. Lens distortions: (a) Radial pincushion, (b) radial barrel, and (c) tangential.
746
CAMERA CALIBRATION FOR IMAGE PROCESSING
(if, jf) Frame buffer image point
Image center and scale factor correction
Here r0 cluding tortion, and (s1, dinates below:
is the order of the radial distortion model. Terms incoefficients (k1, k2, k3, . . ., kr0) account for radial dis(p1, p2, p3, . . .) represent decentering distortions, s2, . . .) represent thin prism distortions. Image coorare corrected for lens distortion by the expression i = id + di
(id, jd) Distorted image point
and
j = jd + d j
(6)
Discussions Lens distortion correction
(i, j)
Ideal undistorted image point
Perspective projection and rotation translation
p = (x, y, z) 3-D world point Figure 3. Steps for transforming frame buffer image point (if , jf ) to ideal undistorted image point (i, j) and to 3-D world point p ⫽ (x, y, z).
through the center of the field of view may appear in the image as a weakly curved line (see Fig. 2). Clearly these distortions are disturbing in applications where the ultimate task is to map a 3-D object in uniform scale from its acquired image. Many researchers (10,11,34,35) have observed that ignoring lens distortion is unacceptable in doing 3-D measurements. Although several studies (9,10,21,30,36,37) have considered only radial distortions up to the first or second order, some studies (11,12,16,20,32,33) have considered both radial and tangential lens distortions, and have used nonlinear optimization schemes to compute them. For example Beyer (20) has demonstrated the effects of higher-order radial and tangential distortion models. By using a first-order radial model, an accuracy in image space of 1/7th of the pixel spacing is obtained. By using a third-order radial and first-order decentering distortion model, this accuracy is enhanced to 1/46th of the pixel spacing. Faugeras (1) and Weng et al. (11) used wide-angle lenses and also found that adding nonradial distortion components improved accuracy. Using a f ⫽ 8.5 mm wide-angle lens with both radial and tangential models, Weng et al. (11) demonstrated a significant improvement in image error. This is also supported by our experiments. One commonly used model for correcting lens distortion is that developed by Brown (32,33). Let (di, dj) be the corrections for geometric lens distortions present in image coordinates (id, jd) respectively, where (id, jd) are obtained from Eq. (4). Let (i, j) be the ideal undistorted image coordinates of a 3-D point p ⫽ (x, y, z). With r2 ⫽ (i2 ⫹ j2), di and dj are expressed by the following series (11,15,32,33):
di = i(k1 r2 + k2 r4 + · · · + kr 0 r2r 0 ) + (p1 (r2 + 2i2 ) + 2p2 ij)(1 + p3 r2 + · · · ) + (s1 r2 + · · · ) d j = j(k1 r2 + k2 r4 + · · · + kr 0 r2r 0 ) + (2p1ij + p2 (r2 + 2 j 2 ))(1 + p3 r2 + · · · ) + (s2 r2 + · · · ) (5)
Following the above description of calibration parameters, Fig. 3 below provides the steps for transforming the frame buffer image point (if , jf ) to the ideal undistorted image point (i, j) and to the 3-D world point p ⫽ (x, y, z). LINEAR METHODS OF NONCOPLANAR CAMERA CALIBRATION In this section, we present four well-known linear methods of camera calibration. These methods are due to (a) Yakimovsky and Cunningham (7), (b) Ganapathy (2), (c) Grosky and Tamburino (4), and (d) Tsai and Lenz (9,10). In this section, we discuss all methods for the noncoplanar case only. The coplanar case is treated separately in the section entitled ‘‘Coplanar Camera Calibration.’’ Yakimovsky and Cunningham’s Method One of the earliest methods for camera calibration that employs a linear estimation technique is due to Yakimovsky and Cunningham (7). The collinear condition equations in Eq. (1) can be written as the following linear equation with unknown parameter vector b: p T 0 1 0 −if p T i b= f (7) 0 p T 0 1 − jf p T jf where b T = [ frrT1
frrT2
ft1
ft2
r T3 ]/t3
Given at least five linearly independent control points p and corresponding image points (i, j), the above equation can be solved by the method of linear least squares. The axial vector r3 and the depth component of the translation vector t3 are solved by imposing the constraint 储r3储 ⫽ 1. Let bT ⫽ [bT1 bT2 b3 b4 bT5]. Then r3 ⫽ b5 /储b5储 and 兩t3兩 ⫽ 1/储b5储. As drawn in Fig. 1, the sign of t3 is negative (positive) if the origin of the world coordinate system Ow is in front of (behind) the camera. An algorithmic method to determine the sign of t3 is given in Ref. 9. Estimates of focal length f can be obtained by imposing the constraints 储r1储 ⫽ 1 or 储r2储 ⫽ 1 as f ⫽ 储b1储/储b5储 or f ⫽ 储b2储/储b5储. However, this produces two estimates of focal length f, which cannot be resolved. Although the method is fast, it ignores the image center, scale factor, and lens distortion parameters. Furthermore, the method fails to impose the orthonormality constraints given in Eq. (2). The constraints 储r1储 ⫽ 储r2储 ⫽ 储r3储 ⫽ 1 are imposed after the solution is obtained, and thus the solution is suboptimal, and hence prone to errors due to noise in calibration data.
CAMERA CALIBRATION FOR IMAGE PROCESSING
A modification to this method that imposes the constraint 储r3储 ⫽ 1 during the least-squares solution can be obtained by reformulating Eq. (7) as follows:
pT 0
0 pT
0 1
−if − jf
frr T2
ft1
1 0
Let bT ⫽ [bT1 bT2 b3 b4 bT5]. The, by imposing the orthonormality conditions of (r1, r2, r3), we obtain
b T1b 5 b Tb , j0 = 2 52 2 b5 b5 b b b 1 − i 0b 5 b 2 − j 0b 5 b b , s= f = b5 b 2 − j 0b 5 b b b − i0b 5 b − j 0b 5 b , r2 = 2 , r3 = 5 r1 = 1 b5 b5 b5 sf b f b b b 3 − i0 b 4 − j0 1 , t2 = , and |t3 | = t1 = b5 b5 b5 sf b f b b i0 =
−if p T b=0 − jf p T
(8)
where b T = [ frrT1
ft2
t3
r T3 ]
The constrained least-squares problem associated with Eq. (8) is b subject to the constraint rr 3 2 = 1 Minimize b T Ab
(9)
Here A is the 12 ⫻ 12 data matrix constructed from the control points and corresponding image points from Eq. (8). Let
A2 A3
A1 A= AT2
747
be a partition of matrix A, where A1 僆 ᑬ9⫻9, A2 僆 ᑬ9⫻3 and A3 僆 ᑬ3⫻3. Then Eq. (9) leads to the following 3 ⫻ 3 symmetric eigenvalue problem for the solution of r3 (see Ref. 22): r 3 = λrr 3 (A3 − AT2 A−1 1 A2 )r
(10)
Once again, the method can be extended to include the constraint 储r3储 ⫽ 1 during the least-squares solution. The method is fast and includes the image center and scale factor parameters. However, it does not include the lens distortion parameters. Moreover, the method imposes the orthonormality constraints after the solution is obtained, and hence the solution is suboptimal. Furthermore, since there are five components of b with six orthonormality constraints [see Eq. (2)], whereas there are 10 unknowns 兵r1, r2, r3, t1, t2, t3, f, i0, j0, s其, we obtain ambiguous solutions. For example, bT1b2 /储b5储2 ⫽ i0 j0, which gives us two solutions for i0: i0 =
b T1b 5 b5 2
and i0 =
b T1b 2 bT2 b5
which may not have the same values. Besides, the method treats the different variables as linearly independent when in reality they are not. Grosky and Tamburino’s Method
Here r3 is the eigenvector corresponding to the minimum eigenvalue . The remaining parameters are [ frrT1
frrT2
f t1
t3 ] = −A−1 1 A2 r 3
f t2
(11)
Ganapathy’s Method Ganapathy (2) extended the Yakimovsky and Cunningham method to also include the image center and scale factor parameters. From Eqs. (1) and (4), we obtain the collinear condition equations below:
f
r T1 p + t1 r T3 p + t3
= s−1 (if − i0 ) and
f
r T2 p + t2 r T3 p + t3
Grosky and Tamburino (4) extended Ganapathy’s method to obtain a unique linear solution for the noncoplanar camera calibration problem. In order to remove the ambiguity in Ganapathy’s solution, Grosky and Tamburino introduced a skew angle parameter which is the positive angle by which the I image coordinate is skewed from J. From Eqs. (1) and (4), we obtain the collinearity condition equations:
f
r T1 p + t1 r T3 p + t3
(12)
cos(θ ) + f
pT 0
0 pT
1 0
0 1
i −if p T b=0 f − jf p T jf
(13)
f
sin(θ ) r T2 p + t2 r T3 p + t3
pT 0
0 pT
1 0
0 1
sft1 + i0t3
ft2 + j0t3
r T3 ]/t3
= ( jf − j0 )
(14)
(15)
where
sf (t1 cos θ + t2 sin θ ) + i0t3 frr T2 + j0r T3
i −if p T b= f T p − jf jf
b T = [sf (rrT1 cos θ + r T2 sin θ ) + i0r T3
where b T = [sfrr T1 + i0r T3
From Eq. (14), we obtain the following linear equations in unknown parameter vector b:
From Eq. (13), we obtain the following linear equation in unknown parameter vector b:
r T2 p + t2 r T3 p + t3
= s−1 (if − i0 ) and
= ( jf − j0 )
frrT2 + j0r T3
ft2 + j0t3
r T3 ]/t3
Let bT ⫽ [bT1 bT2 b3 b4 bT5]. Then, we obtain the following solution for the calibration parameters by imposing the orthonor-
748
CAMERA CALIBRATION FOR IMAGE PROCESSING
mality conditions of (r1, r2, r3):
From Eq. (19), we obtain the following linear equation with unknown parameter vector b:
b Tb b Tb 1 , i0 = 1 52 , j0 = 2 52 b5 b5 b5 b b b b 1 − i 0b 5 b 2 − j 0b 5 b b , s= f = b5 b 2 − j0 b 5 b b
|t3 | =
sin θ =
pT − ip pT [ jp
(16) 2 b 5 2 sf b b − j 0b 5 b − i0b 5 b − j 0b 5 sec θ − 2 tan θ, r2 = 2 r1 = 1 b5 b5 b5 sf b f b f b b4 − j0 b3 − i0 b4 − j0 sec θ − tan θ, t2 = t1 = b5 b5 b5 sf b f b f b
Once again, the method can be extended to include the constraint 储r3储 ⫽ 1 during the least-squares solution of Eq. (15). The method is computationally efficient, solves the image center and scale factor parameters, and satisfies the orthonormality conditions given in Eq. (2). However, the method does not include the lens distortion parameters, and it produces a suboptimal solution since the constraints are imposed after the solution is obtained. Moreover, the method also treats different variables as linearly independent when they are not. Tsai and Lenz’s Method Tsai (9) and Lenz and Tsai (10) designed a method that includes the radial lens distortion parameters in their camera model and provides a two-stage algorithm for estimating the camera parameters. In their method, most parameters are computed in a closed form, and a small number of parameters such as focal length, the depth component of the translation vector, and the radial lens distortion parameters are computed by an iterative scheme. Tsai (9) introduced an radial alignment constraint (RAC) which results from the observation that the vectors (id, jd), (i, j), and (rT1p ⫹ t1, rT2p ⫹ t2) in Fig. 1 are radially aligned when the image center is chosen correctly. Algebraically, the RAC states that (17)
If we determine the image center parameters (i0, j0) by any one of the methods described in the section entitled ‘‘Image Center and Scale Factor Parameters,’’ then the frame buffer image coordinates (if , jf ) can be corrected for image center by using Eq. (4) to obtain the distorted image coordinates sid ⫽ if ⫺ i0 and jd ⫽ jf ⫺ j0. Using first-order radial lens distortion, the collinearity condition equations are
sf f
r T1 p + t1 r T3 p + t3 r T2 p + t2 r T3 p + t3
(18) k1 r 2d )
where rd2 ⫽ id2 ⫹ jd2. Dividing the first equation by the second, we obtain srr T1 p + st1 si = d r T2 p + t2 jd
rT2 st1 ]/t2
(20)
1. Solve Eq. (20) by using the linear least squares method to compute b. 2. Compute 兩t2兩 ⫽ 1/储b2储 by imposing the constraint 储r2储 ⫽ 1. The sign of t2 is determined from Eq. (19). 3. Compute scale factor s ⫽ 储b1储/储b2储 by imposing the constraint 储r1储 ⫽ 1. 4. Ignoring lens distortion (i.e., k1 ⫽ 0), compute approximate values of focal length f and depth component of translation vector t3 from the following equation [derived from the second equation in Eq. (18)] by using the least-squares method:
[(rrT2 p
f = [ jdr T3 p] + t2 ) − j d ] t3
5. Compute exact values of f, t3, and radial lens distortion parameter k1 from the following equation [derived from the second equation in Eq. (18)] by using a standard nonlinear optimization scheme such as steepest descent: f (rr T2 p + t2 ) − t3 jd − k1 ( jd r 2dr T3 p ) − k1t3 ( jd r 2d ) − jdr T3 p = 0 where rd2 ⫽ id2 ⫹ j d2. The method is fast and also includes the radial lens distortion parameters. This makes the method widely applicable to a variety of applications. However, the solution is suboptimal and does not impose the orthonormality constraints given in Eq. (2). Furthermore, the image center parameters are not included in the solution. Lenz and Tsai (10) proposed a nonlinear method for solving the image center parameters by minimizing the RAC residual error. However, this computation is applicable after all extrinsic parameters and focal length are computed by the above-mentioned steps. By taking the ratio of the collinearity condition equations [see Eq. (19)], the method considers only the tangential component of the collinearity equations and ignores the radial component. Furthermore, the tangential lens distortion parameters are ignored. NONLINEAR METHODS OF NONCOPLANAR CAMERA CALIBRATION
= sid (1 + k1 r 2d ) and = jd (1 +
where b T = [srr T1
Let bT ⫽ [bT1 bT2 b3]. We next perform the following steps:
b 2 − j 0b 5 ) b 1 − i0b 5 )T (b (b
(id , jd )//(i, j)//(rr T1 p + t1 , r T2 p + t2 )
b = [i], j]b
(19)
The nonlinear methods of noncoplanar camera calibration first originated in the photogrammetry literature, and they were later refined in computer vision research. We present three methods of nonlinear camera calibration. These are (a) nonlinear least-squares method (25), (b) Weng et al.’s method (11), and (c) Chatterjee, Roychowdhury, and Chong’s method (22). These methods are computationally more intensive than the linear methods discussed before, but they lead to very accurate estimates of the parameters. We shall consider only
CAMERA CALIBRATION FOR IMAGE PROCESSING
the noncoplanar case in this section, and the coplanar case is considered separately in the section entitled ‘‘Coplanar Camera Calibration.’’
where
k g1 (β1k , . . ., βM ) and g = k g2 (β1k , . . ., βM ) ∂g1 /∂β1 ∂g1 /∂β2 . . . k G = ∂g2 /∂β1 ∂g2 /∂β2 . . .
k
Nonlinear Least-Squares Method In this formulation of the camera calibration problem, the representation of the rotation matrix R in terms of the Euler angles 웆, , and as given in Eq. (3) is used. Substituting this R and the image coordinate corrections [see Eqs. (4) and (6)] in the collinearity condition equations [Eq. (1)], we obtain two nonlinear equations below (for first-order radial lens distortion only): if = i0 + sf
r
1 (ω, φ, κ )
p + t1 r 3 (ω, φ, κ )T p + t3 Tp
+ (if − i0 )k1 r(s, i0 , j0 )2
and jf = j0 + f
r
2 (ω, φ, κ )
p + t2 r 3 (ω, φ, κ )T p + t3 Tp
+ ( jf − j0 )k1 r(s, i0 , j0 )2
where r(s, i0, j0)2 ⫽ s⫺2(if ⫺ i0)2 ⫹ ( jf ⫺ j0)2. The unknowns are the parameters of exterior orientation 兵웆, , , t1, t2, t3其 and interior orientation 兵f, s, i0, j0, k1其. These equations are of the following general form: αi = gi (β1 , . . ., βM ) + ξi
for i = 1, 2
where 웁1, . . ., 웁M are the unknown parameters, 움1 and 움2 are the observations, and 1 and 2 are additive zero mean Gaussian noise having covariance ⌺. The maximum likelihood solution leads to the minimization of the criterion J = (α − g)T −1 (α − g)
(21)
where 움T ⫽ [움1 움2] and gT ⫽ [g1 g2]. Starting from a given 0 T approximate solution 웁0 ⫽ (웁10, . . ., 웁M ) , we begin by linearizing the nonlinear transformations about 웁0 and solve for adjustments ⌬웁 ⫽ (⌬웁1, . . ., ⌬웁M)T, which when added to 웁 constitute a better approximate solution. We perform this linearization and adjustment iteratively. If good starting estimates 웁0 are available, in most cases 5 to 10 iterations are required to produce the solution of desired accuracy. In the kth iteration, let 웁k be the current approximate solution. The linearization proceeds by representing each gi(웁k ⫹ ⌬웁) by a first-order Taylor series expansion of gi taken around 웁k: gi (β k + β ) = gi (β k ) + gi (β; β k )
for i = 1, 2
where ⌬gi, the total derivative of gi, is a linear function of the vector of adjustments ⌬웁 given by
gi (β; β k ) =
∂gi k ∂gi k (β ) · · · (β ) β ∂β1 ∂βM
for i = 1, 2
Substituting the linearized expressions into the least-squares criterion [Eq. (21)] and then minimizing it with respect to ⌬웁, we obtain T
T
β = (Gk −1 Gk )−1 Gk −1 (α − gk )
749
∂g1 /∂βM ∂g2 /∂βM
The method produces very accurate parameter estimates and is less prone to errors in the data when compared with the linear methods. The method solves all calibration parameters, and the orthonormality constraints are imposed during the optimization process. This is an optimal solution to the calibration problem. However, the method is computationally slow and requires at least 5 to 10 iterations to converge, provided that good starting estimates are available. Furthermore, accurate starting estimates are very important for the method to converge to the correct solution. Usually, a linear method such as Grosky and Tamburino’s method is used to obtain starting parameter estimates. Weng et al.’s Method Weng et al. (11) described a two-step nonlinear algorithm of camera calibration which first computes a set of starting estimates for all parameters in an initialization algorithm and then uses linear and nonlinear optimization schemes to obtain accurate estimates. Weng et al. also showed that computing all extrinsic parameters with lens distortion is necessary for accurate camera calibration. Weng et al. presents two sets of parameters: (1) extrinsic and intrinsic nondistortion parameters m ⫽ 兵f, s, i0, j0, 웆, , , t1, t2, t3其 and (2) set of distortion parameters d ⫽ 兵k1, k2, p1, p2, s1, s2其. Given control points p and corresponding image points (if , jf ), they use the following optimization criterion: m, d ) = J(m
m, d ))2 + ( jf − j(m m, d ))2 ] [(if − i(m
(22)
all points
where
r
1 (ω, φ, κ )
p + t1 + (i f − i0 )k1 r(s, i0 , j0 )2 r 3 (ω, φ, κ )T p + t3 + higher-order lens distortion terms
m, d ) = i0 + sf i(m
Tp
and
r
2 (ω, φ, κ )
p + t2 + ( j f − j0 )k1 r(s, i0 , j0 )2 T r 3 (ω, φ, κ ) p + t3 + higher-order lens distortion terms
m, d ) = j 0 + f j(m
Tp
where r(s, i0, j0)2 ⫽ s⫺2(if ⫺ i0)2 ⫹ ( jf ⫺ j0)2. In the initialization stage, Weng et al. uses two steps shown below: 1. Use a linear algorithm (by ignoring lens distortion) such as Grosky and Tamburino’s method (see section entitled ‘‘Grosky and Tamburino’s Method’’) to obtain the starting estimates of the extrinsic parameters 兵웆, , , t1, t2, t3其, focal length f, image center parameters (i0, j0), and scale factor s.
750
CAMERA CALIBRATION FOR IMAGE PROCESSING
2. Ignoring lens distortions parameters d, minimize the optimization criterion [Eq. (22)] with respect to the parameter set m by a nonlinear optimization technique such as Levenberg–Marquardt (23,38). For the initialization algorithm, only the central points are used. Central points are those calibration points close to the center of the image, and they do not contain significant lens distortion effects. After the initial estimates are obtained, the main estimation algorithm uses iterations between the following two steps: 1. Estimate the distortion parameters d with the nondistortion parameters m held constant by a linear estimation step. Note that in Eq. (22), if the nondistortion parameters m are held constant, then the distortion parameters d can be represented by the following linear equation:
k1 k 2 2 2 2 s(r d + 2id ) 2sid jd sr d 0 p 1 2id jd (r 2d + 2 jd2 ) 0 r 2d p 2 s1 s2 T r 1 p + t1 − i − sf i 0 f T rT3 p + t3 = r 2 p + t2 jf − j0 − f r T3 p + t3
sid r 2d jd r 2d
sid r 4d jd r 4d
Here, the distorted image points (id, jd) are obtained from frame buffer points (if , jf ) by using Eq. (4). 2. Estimate the nondistortion parameters m with the distortion parameters d held constant by nonlinear minimization of the optimization criterion J(m, d) in Eq. (22). The method iterates between the above two steps until a desired accuracy in the estimates is obtained. The method provides very accurate estimates, solves all calibration parameters, and gives us an optimal solution. However, the method is computationally slow and requires good starting estimates to converge to the correct solution. Since the method also provides an initialization algorithm, reliable starting estimates can be obtained. Chatterjee, Roychowdhury, and Chong’s Method This recent technique (22) for camera calibration also offers a two-step algorithm that consists of an initialization phase and a main estimation step. Instead of using the Euler angles (웆, , ) to represent the rotation matrix R, this method instead directly imposes the orthonormality constraints given in Eq. (2). This relieves the objective function of periodic transcendental terms which frequently results in many false minima. Due to the structure of the objective function in the calibration problem, this method employs the Gauss–Seidel (23,24) technique of nonlinear optimization. The Gauss–Seidel method for block components iteratively minimizes the objective function for a given block with the remaining held con-
stant, and the most recently computed values of the remaining blocks are used. This method is particularly attractive because of its easy implementation (23,24,38), and it lends itself to a comprehensive theoretical convergence analysis (22,24). These properties are utilized to obtain a simpler solution and also obtain a rigorous convergence study. The method partitions the parameter set into three blocks: (1) extrinsic parameters and focal length b ⫽ 兵r1, r2, r3, t1, t2, t3, f其; (2) image center and scale factor parameters m ⫽ 兵i0, j0, s其; and (3) lens distortion parameters d ⫽ 兵k1, k2, . . ., kr0, p1, p2, s1, s2其. In the noncoplanar case, most parameters (blocks b and d) are computed by linear iterations or in closed form. Unlike the previous two methods, nonlinear optimization is performed only for the parameter block m—that is, in a reduced parameter space of dimension 3. Initialization Algorithm. From Eq. (1), we obtain the following objective function: b, m , d ) = p − it3 )2 J(b [( frrT1 p + f t1 − r T3 ip all points (23) p − jt3 )2 ] + ( frrT2 p + f t2 − r T3 jp under constraint 储r3储2 ⫽ 1. Here i ⫽ i(m, d) and j ⫽ j(m, d) by using Eqs. (4) and (6). The initialization algorithm consists of the following steps: 1. Ignoring lens distortion (i.e., d ⫽ 0), an initial estimate of m is obtained by using Grosky and Tamburino’s method with only central calibration points. 2. Iterate between the following two steps for all calibration points: a. Compute b with d held constant. From Eq. (23), we obtain the following constrained linear equation in terms of b: p p T 0 1 0 −i −ip p 0 p T 0 1 − j − jp
b = 0 under constraint rr 3 2 = 1 where bT ⫽ [frT1 frT2 ft1 ft2 t3 rT3]. As discussed in the section entitled ‘‘Yakimovsky and Cunningham’s Method’’ this problem leads to the constrained leastsquares problem given in Eq. (9) whose solutions are given in Eqs. (10) and (11). b. Compute d with b held constant. Here d can be solved from the following unconstrained leastsquares equation derived from Eq. (23): ir 2 ir 4 . . . ir 2r 0 r 2 + 2i2 T (rr 3 p + t3 ) jr 2 jr 4 . . . jr 2r 0 2ij 2 T f (rr 1 p + t1 ) − i(rr T3 p + t3 ) 2ij r 0 d= r 2 + 2 j2 0 r 2 f (rr T2 p + t2 ) − j(rrT3 p + t3 ) where d ⫽ [k1 k2 . . ., kr0 p1 p2 s1 s2]
r2 = s−2 (if − i0 )2 + ( j f − j0 )2 i = s−1 (if − i0 ), j = ( j f − j0 ) and r0 is the order of the radial lens distortion model.
CAMERA CALIBRATION FOR IMAGE PROCESSING
Main Algorithm. From Eq. (1), we obtain the following objective function: 2 qi b, d , m ) = − r T1 p − t1 J(b f all points (24) qj 2 T T 2 + − r 2 p − t2 + (q − r 3 p − t3 ) f under orthonormality constraints [Eq. (2)]. Here i ⫽ i(m, d) and j ⫽ j(m, d) by using Eqs. (4) and (6), and q is a depth variable that is also estimated. We define parameter blocks b ⫽ 兵R, t, f其, dT ⫽ [k1 k2 . . . kr0 p1 p2 s1 s2] and mT ⫽ [i0 j0 s]. The algorithm starts with an initial estimate of the depth parameter q for each calibration point. The initial values could be the same constant for each point, with the constant representing an initial guess of how far the object is from the camera center. The algorithm then iterates between the following three steps: 1. Compute b with d and m held constant. This step is similar to the exterior orientation or space resection problem (15,25,39) in analytical photogrammetry. For this step, we define vT ⫽ [i/f j/f 1]. We then iterate between the following three steps: a. Compute R and t with f and q held constant. From Eq. (24) we obtain the following constrained optimization problem: Minimize
v − Rp p − t 2 under constraint RT R = RRT = I qv (25)
This is the well-known problem (13,15,25,40) in photogrammetry, commonly known as the absolute orientation problem. In this formulation, R can be obtained from the singular value decomposition of B ⫽ ⌺(p ⫺ p)(qv ⫺ qv)T which is B ⫽ UDVT. We obtain (24,39,40) R ⫽ VUT. Here p and qv are the averages of p and qv, respectively, with all calibration points. The translation vector t is obtained from: t ⫽ qv ⫺ Rp. b. Compute q with R, t, and f held constant. From Eq. (25), we obtain the following solution of q: q=
p + t )Tv (Rp v 2 v
(26)
c. Compute f with R, t, and q held constant. From Eq. (25), we obtain the following solution of f:
q2 (i2 + j 2 ) f = q(rrT1 p + t1 )i + q(rrT2 p + t2 ) j
tained by the initialization algorithm, is essential for accurate convergence of this step. The method obtains accurate parameter estimates and solves all calibration parameters. The method offers an optimal solution. Although the method reduces the nonlinear optimization to only a dimension of 3, it is still computationally more intensive than the linear optimization methods discussed in the section entitled ‘‘Linear Methods of Noncoplanar Camera Calibration.’’ The method also needs good starting estimates to converge to the correct solution. Since this method also provides an initialization algorithm, we can obtain reliable starting estimates. In addition, the method provides a Lyapunov-type convergence analysis of the initialization and main algorithms (22). ROBUST METHODS OF NONCOPLANAR CAMERA CALIBRATION In some applications, we do not have accurate control points to obtain good parameter estimates. Such applications are common in aerospace and military problems, where a camera mounted on a moving airplane estimates the locations of points on the ground. Since ground points may not be available accurately, the algorithm has to adaptively improve the calibration parameter estimates as more points are obtained. We describe a robust estimation technique (41) for such applications. Since estimating all parameters with a robust technique is cumbersome and in most cases unnecessary, we present a method to robustly estimate only the extrinsic parameters and focal length. The method discussed below is only applicable to the noncoplanar camera calibration problem. This method can, however, be easily extended to the coplanar camera calibration problem discussed in the section entitled ‘‘Coplanar Camera Calibration.’’ The standard least-squares techniques are ideal when the random data perturbations or measurement errors are Gaussian distributed. However, when a small fraction of the data have large non-Gaussian perturbations, least-squares techniques become useless. With least-squares techniques, the perturbation of the estimates caused by the perturbation of even a single data component grows linearly in an unbounded way. In order to estimate a parameter from a set of observations x1, . . ., xN by the robust method, we use a function whose derivative with respect to is as shown below:
J(θ ) =
N n=1
(27)
2. Compute d with b and m held constant. This is a linear optimization step similar to the initialization algorithm, Step 2b. 3. Compute m with b and d held constant. This is an unconstrained nonlinear optimization step with three parameters. The objective function J in Eq. (24) is minimized by standard optimization methods such as the Conjugate Direction (31) or Quasi-Newton (31) methods. Needless to say, a proper starting value of m, ob-
751
ρ(xn − θ ) and
N ∂J(θ ) = ψ (xn − θ ) ∂θ n=1
The minimization of J() is achieved by finding the approN priate that satisfies 兺n⫽1 (xn ⫺ ) ⫽ 0, where (x) ⫽ ⭸(x)/⭸x. The solution to this equation that minimizes J() is called the maximum-likelihood or M-estimator of . Thus, in robust M-estimation, we determine a function so that the resulting estimator will protect us against a small percentage (say, around 10%) of ‘‘outliers.’’ But, in addition, we desire these procedures to produce reasonably good (say, 95% efficient) estimators in case the data actually enjoy the Gaussian assumptions. Here, the 5% efficiency that we sacrifice in these cases is sometimes referred to as the ‘‘premium’’ that we pay to gain all that protection in very non-Gaussian cases.
752
CAMERA CALIBRATION FOR IMAGE PROCESSING
A scale invariant version of robust M-estimator can also be obtained by finding the solution of N n=1
ψ
x
n
−θ s
=0
where the denominator of the second term counts the number of items that enjoy 兩xn ⫺ k兩/s ⱕ a for Huber’s and Tukey’s , 兩xn ⫺ k兩 ⱕ a앟 for Andrew’s Sine , and b ⱕ 兩xi ⫺ k兩 ⬍ c for Hampel’s . H-Method:
where s is the robust estimate of scale such as s=
median|xn − median(xn )| 0.6745
or s=
θk+1 = θk +
where the ‘‘tuning constant’’ a equals 1.5. Hampel’s : |x|, 0 ≤ |x| < a a, a ≤ |x| < b ψ (x) = c − |x| , b ≤ |x| < c c−b 0, c ≤ |x| where reasonably good values are a ⫽ 1.7, b ⫽ 3.4, and c ⫽ 8.5 Andrew’s Sine : sin(x/a), |x| ≤ aπ ψ (x) = 0, |x| > aπ with a ⫽ 2.1. Tukey’s Biweight :
ψ (x) =
x[1 − (x/a)2 ]2 , 0,
n=1 ψ[(xn
− θk )/s]
N
N N w n xn s n=1 ψ[(xn − θk )/s] θk+1 = n=1 = θ + N k N n=1 wn n=1 wn where the weight is wn =
ψ[(xn − θk )/s] (xn − θk )/s
for n = 1, 2, . . ., N
We invoke the objective function (25) to compute robust estimates of the rotation matrix R, translation vector t, and focal length f. Given world (control) points pn and corresponding image points (in, jn) for i ⫽ 1, . . ., N, from Eq. (25), we obtain the following constrained robust estimation problem:
Minimize
N
ρ
q
n vn
n=1
− Rpn − t s
under constraint RT R = RRT = I
(28)
where vTn = [in / f
jn / f
1]
which leads to the problem
Solve
N n=1
ψ
q
n vn
− Rpn − t s
= 0 under constraint RT R = RRT = I Since is a nonlinear function and the above equation is difficult to solve, we use the weighted least-squares method to obtain robust estimates of the parameters R, t, and f. The method consists of the following steps:
|x| ≤ a |x| > a
with a ⫽ 6.0. Let us mention three iteration schemes that can be used to solve 兺 [(xn ⫺ )/s] ⫽ 0, where is any one of the functions above. We use the sample median as the initial guess 0 of . Define ⬘(x) as the derivative of with respect to x, and k as the estimate of at the kth iteration of the algorithm. The 3 methods are: Newton’s method:
θk+1
N
where the second term is the average of the pseudo (Winsorized) residuals (that is, ‘‘least squares’’ is applied to the residuals). Weighted Least-Squares:
75th percentile − 25th percentile 2(0.6745)
If the sample arises from a Gaussian distribution, then s is an estimate of the standard deviation . There are several choices of functions given below that are commonly used in literature (36): Huber’s : x < −1 −a, ψ (x) = x, |x| ≤ a a, x>a
s
N s n=1 ψ[(xn − θk )/s] = θ k + N n=1 ψ [xn − θk )/s]
1. Obtain initial estimates of the depth parameter qn for each calibration point. The initial values could be the same constant for each point, the constant representing an initial guess of how far the object is from the camera center. 2. Iterate between the following three steps: A. (Optional) Estimate scale s from s=
qn vn − Rpn − t q n vn −Rp n −t = 0 0.6745 median
B. Iterate between the following steps: a. Compute R and t with f, qn, and wn held constant. As described in Section 4.3.1, this is the abso-
CAMERA CALIBRATION FOR IMAGE PROCESSING
lute orientation problem in photogrammetry N (25,39,40). We first compute the matrix B ⫽ 兺n⫽1 T wn(pn ⫺ p)(qnvn ⫺ qv) and then compute its singular value decomposition as B ⫽ UDV T. We then obtain (24,39,40): R ⫽ VU T. Here, p ⫽ N N N N 兺n⫽1 wnpn / 兺n⫽1 wn and qv ⫽ 兺n⫽1 wnqnvn / 兺n⫽1 wn are the weighted averages of pn and qnvn, respectively, for all calibration points. The translation vector t is obtained from: t ⫽ qv ⫺ Rp. b. Compute qn with R, t, f and wn held constant. From Eq. (28), we obtain the following solution of qn for each calibration point: qn =
(Rpn + t)T vn vn 2
for n = 1, . . ., N
(29)
753
and first-order lens distortion parameter d ⫽ 兵k1其. This method uses the ratio of the two collinearity condition equations given in Eq. (1). The second method utilizes both collinearity condition equations and is called the ‘‘Conventional Method’’. In both methods we drop the image center and scale factor parameters m ⫽ 兵i0, j0, s其 which may be estimated by several offline methods discussed in the section entitled ‘‘Image Center and Scale Factor Parameters’’. Tsai’s Coplanar Method. Let p ⫽ (x, y, z) be a world (control) point with corresponding image point (if, jf). Given image center and scale factor parameters 兵i0, j0, s其, computed before, we can correct the image point (if, jf) to obtain distorted image point (id, jd) by applying Eq. (4). If we use radial lens distortion only, then the undistorted image point (i, j) is: 2r
i = id (1 + k1 r2d + · · · + kr 0 rd 0 ) and
c. Compute f with R, t, qn, and wn held constant. From Eq. (28), we obtain the following estimate of f: N
f =
wn q2n (i2n
+
jn2 )
n=1 N
wn qn (rT1 pn
+ t1 )in +
n=1
N
(30) wn qn (rT2 pn
2r
j = jd (1 + k1 r2d + · · · + kr 0 rd 0 ) where rd2 ⫽ id2 ⫹ j d2 and k1, . . ., kr0 are the radial lens distortion parameters. Then, by taking the ratio of the two collinearity condition equations (see Eq. (1)), we obtain: i i r11 x + r12 y + t1 = = d r21 x + r22 y + t2 j jd
+ t2 ) j n
n=1
where tT ⫽ [t1 t2 t3]. C. Compute weights wn from: wn =
ψ (qn vn − Rpn − t/s) qn vn − Rpn − t/s
where 兵r11, r12, r21, r22, t1, t2其 are the extrinsic parameters. Tsai’s algorithm uses the above equations and consists of the following steps: for n = 1, . . ., N
1. Compute the extrinsic parameters b from the following linear equation:
Although most people concerned with robustness do iterate on scale s also (Step A), this step is optional since some problems of convergence may be created.
[ jd x
jd y
r 211 + r 221 + r 231 = 1,
r 212 + r 222 + r 232 = 1,
bT =
− id y]b = [id ]
r11 t2
r12 t2
t1 t2
r21 t2
r22 t2
2. Compute t2 from the following equations: √ S − S2 − 4(b1 b5 − b4 b2 )2 t22 = or 2(b1 b5 − b4 b2 )2 2 |t2 | = √ √ 2 2 (b1 + b5 ) + (b2 − b4 ) + (b1 − b5 )2 + (b2 + b4 )2 where
and
r11 r12 + r21 r22 + r31 r32 = 0
(31)
where rij is the jth component of ri. We discuss two methods of estimating the parameters in the coplanar case: (1) iterative linear methods that produce a suboptimal solution and (2) nonlinear methods that produce an optimal solution. Iterative Linear Methods for Coplanar Camera Calibration Here we discuss two methods of parameter computation. The first method is due to Tsai (9) and computes all extrinsic and the focal length parameters b = {rr 11r 12r 21 , r 22 , r 31 , r 32 , f, t1 , t2 }
− id x
where
COPLANAR CAMERA CALIBRATION In coplanar camera calibration, the calibration points lie on a plane. Without loss of generality, we assume that it is the z coordinate that is unimportant. Due to this assumption, the last column of the rotation matrix R and the depth component of the translation vector t are not available in the collinearity condition equations [Eq. (1)]. Instead of the six orthonormality constraints described in Eq. (2), we have three constraints:
jd
bT = [b1
b2
b3
b4
b5 ], and S = b21 + b22 + b24 + b25
3. Determine the sign of t2. If Sign(b1x ⫹ b2 y ⫹ b3) and Sign(id) are same then Sign(t2) ⫽ ⫹1, otherwise Sign(t2) ⫽ ⫺1. 4. Determine the extrinsic parameters:
p
r11 = t1 b1 , r12 = t2 b2 , r21 = t2 b4 , r22 = t2 b5 , t1 = t2 b3 r13 = − 1 − r211 − r212
p
r23 = Sign(r11r21 + r12 r22 ) 1 − r221 − r222 r31 = r12 r23 − r22 r13 r32 = r21 r13 − r11 r23 , r33 = r11 r22 − r21 r12
754
CAMERA CALIBRATION FOR IMAGE PROCESSING
5. Obtain an approximate solution for focal length f and translation component t3 from the following equation by ignoring lens distortion: f = [ jd (r31x + r32 y)] [(r21 x + r22 y + t2 ) − jd ] t3
r0 is the order of the radial lens distortion model, and bT ⫽ [b1 b2 b3 b4 b5 b6 b7 b8]. Then, by imposing the orthonormality constraints [Eq. (31)], we obtain the parameters below:
f2 = − 6. Obtain an accurate estimate of focal length f, translation component t3, and first-order radial lens distortion parameter k1 from the following equation by using a standard nonlinear minimization scheme such as steepest descent:
f (r21 x + r22 y + t2 ) − t3 jd − k1 ( jd r2d (r31 x + r32 y)) − k1t3 ( jd r2d ) − jd (r31 x + r32 y) = 0 where rd2 ⫽ id2 ⫹ jd2. The method is computationally efficient but it produces a suboptimal solution. Moreover, the image center and scale factor parameters are not considered. The Conventional Method. The conventional method is based on the following unconstrained objective function: b, d ) = J(b [( fr11 x + fr12 y + ft1 − r31 ix − r32 iy − it3 )2 all points
+ ( fr21 x + fr22 y + ft2 − r31 jx − r32 jy − jt3 )2 ]
(32)
Here, i ⫽ i(d) and j ⫽ j(d) by using Eq. (6). The method consists of the following iterative algorithm consisting of two linear least-squares step: 1. Compute b with d held constant. From Eq. (32), we obtain the following linear equation in unknown parameter vector b: x y 0 0 −ix −iy 1 0 i b= 0 0 x y − jx − jy 0 1 j
b1 b2 + b3 b4 2 f2 f2 , t3 = 2 or t32 = 2 2 2 2 2 b5 b6 b1 + b3 + f b5 b2 + b4 + f 2 b26
t3 b 1 t b t b t b , r12 = 3 2 , r21 = 3 3 = r22 = 3 4 f f f f t3 b 7 t3 b 8 , and t2 = = t3 b5 , r32 = t3 b6 , t1 = f f
r11 = r31
The sign of t3 can be determined from the camera position with respect to the world coordinate system as described in the section entitled ‘‘Yakimovsky and Cunningham’s Method.’’ The method is reasonably efficient, but it produces a suboptimal solution including an ambiguous solution for t3. Furthermore, the image center and scale factor parameters are not considered. Nonlinear Method for Coplanar Camera Calibration The nonlinear method is based on the objective function, Eq. (32), except that the image coordinates i and j are corrected for image center, scale factor, and lens distortion; that is, they are functions of both m and d: i ⫽ i(m, d) and j ⫽ j(m, d). Furthermore, the orthonormality constraints, Eq. (31), are imposed during the optimization process. We define the parameter block containing extrinsic parameters and focal length as bT ⫽ [fr11 fr12 fr21 fr22 r31 r32 ft1 ft2]/t3. We can now write the three constraints in Eq. (31) in terms of the elements of b. Let bT ⫽ [b1 b2 b3 b4 b5 b6 b7 b8]. The constraint for coplanar camera calibration is b ) = (b5 b1 + b6 b2 )(b6 b1 − b5 b2 ) h(b + (b5 b3 + b6 b4 )(b1 b3 − b5 b4 ) = 0
(33)
The method consists of the following three iterative steps:
where b T = [ f r11
f r12
f r21
f r22
r31
r32
f t1
f t2 ]/t3
We use the linear least-squares method to solve for b. 2. Compute d with b held constant. From Eq. (32), we obtain the following linear equation in unknown parameter vector d: ir 2 ir 4 . . . ir 2r 0 r 2 + 2i2 2ij r2 0 c3 2ij r2 + 2j 2 0 r2 jr 2 jr 4 . . . jr 2r 0 c d= 1 c2 where
d T = [k1 k2 . . . kr 0 p1 p2 s1 s2 ] r 2 = i2 + j 2 c1 = (b1 x + b2 y + b7 ) − i(b5 x + b6 y + 1) c2 = f (b3 x + b4 y + b8 ) − j(b5 x + b6 y + 1) c3 = (b5 x + b6 y + 1)
1. Compute b with d and m held constant. From Eq. (32), we derive the problem: Solve for b from the following constrained equations: x y 0 0 −ix −iy 1 0 i b= 0 0 x y − jx − jy 0 1 j under constraint h(b) ⫽ 0 [see Eq. (33)]. We use nonlinear optimization methods to solve this problem. 2. Compute d with b and m held constant. This step is similar to the corresponding step in the conventional linear method described in the section entitled ‘‘The Conventional Method.’’ From Eq. (32), we obtain the following linear equation in unknown parameter vector d: ir 2 ir 4 . . . ir 2r 0 r 2 + 2i2 2ij r2 0 c3 2ij r2 + 2j 2 0 r2 jr 2 jr 4 . . . jr 2r 0 c d= 1 c2
CAMERA CALIBRATION FOR IMAGE PROCESSING
where
d T = [k1 k2 . . . kr 0 p1 p2 s1 s2 ] r 2 = s−2 (i f − i0 )2 + ( j f − j0 )2 i = s−1 (i f − i0 ), j = ( j f − j0 ) c1 = (b1 x + b2 y + b7 ) − i(b5 x + b6 y + 1) c2 = f (b3 x + b4 y + b8 ) − j(b5 x + b6 y + 1) c3 = (b5 x + b6 y + 1), and r0 is the order of the radial lens distortion model. 3. Compute m with b and d held constant. This is an unconstrained minimization problem similar to the one described in the section entitled ‘‘Main Algorithm.’’ The objective function J(.) in Eq. (32) is minimized. All extrinsic parameters and focal length are obtained from the equations at the end of the section entitled ‘‘The Conventional Method.’’ The ambiguity in the solution of t3 is resolved due to the constraint h(b) ⫽ 0. The method is computationally complex, but it produces an optimal solution. The method also computes the image center and scale factor parameters. EXPERIMENTAL RESULTS Though we have presented at least 10 different methods of camera calibration, we select four representative methods for our experiments. Two of these methods are for noncoplanar calibration, and two for coplanar calibration. Of these two methods in each calibration case, one is a linear computation method and one a nonlinear method. For noncoplanar calibration, we choose the Grosky and Tamburino’s method (4) as the linear method, and the Chatterjee, Roychowdhury, and Chong’s method (22) as the nonlinear method. For coplanar calibration, we choose the Conventional method for linear computation, and then the nonlinear method. We test these algorithms on two sets of data: (1) synthetic data corrupted with known noise, and (2) real data obtained from a calibration setup. The accuracy of each algorithm is evaluated according to square root of the mean square errors in both image components:
1
Image Error
N
=
N
s−2 (in − in (b, d, m))2 +
n=1
N
! ( jn − jn (b, d, m))2
n=1
(34) where (in, jn) are points measured from the image and (in(b, d, m), jn(b, d, m)) are computed from calibration parameters (b, d, m). Experiments with Synthetic Data The synthetic data was generated for both noncoplanar and coplanar cases with a known set of extrinsic and intrinsic camera parameters. First a 10 ⫻ 10 (i.e., N ⫽ 100) grid of points are generated to simulate the calibration target. (Xw, Yw) components of the calibration points are uniformly spaced in a rectangular grid within the range (⫺5, ⫹5). For the non-
755
coplanar case, the Zw component is randomly generated (uniform distribution) within the range (⫺5, ⫹5). The rotation matrix R in Eq. (3) is generated from Euler angles 웆 ⫽ ⫽ ⫽ 15⬚. Translation vector t is (0.5, 0.5, ⫺14.0), and focal length f ⫽ 300. Image coordinates are obtained from the collinearity conditions, stated in Eq. (1). Lens distortion is then added to this data according to first- and second-order radial distortion models with coefficients k1 ⫽ 10⫺7 and k2 ⫽ 10⫺14. Image center and scale factor parameters are next added to the image with i0 ⫽ 5, j0 ⫽ 8 and s ⫽ 0.8. Finally, an independent Gaussian quantization noise is added to the image components. In accordance with (40), the variance of the added noise in i and j image coordinates are (sf)⫺2 /12 and f ⫺2 /12, respectively. The reasoning behind these noise variances are briefly given below. Different values of in Tables 2, 3, and 4 denote different multiples of this noise. In accordance with (11), calibration accuracy can be defined by projecting each pixel onto a plane that is orthogonal to the optical axis, and go through the projected pixel center on the calibration surface. The projection of the pixel on this plane is a rectangle of size a ⫻ b with a in the I image direction and b in the J image direction (see Fig. 1). Uniform digitization noise in a rectangle a ⫻ b has standard deviation 애0 ⫽ 兹(a2 ⫹ b2)/12 ⫽ z2 兹((sf)⫺2 ⫹ f ⫺2)/12 at depth z. As per (11), this positional inaccuracy is comparable to the normalized error in image plane denoted by 애, and defined as below:
1 i
Normalized Image Error N
=µ=
N
n=1
n
− in (b, d, m) sf
2
N jn − jn (b, d, m) + f n=1
! 2
(35) We shall use the normalized image error 애 to evaluate the accuracy of all algorithms. Noncoplanar Case. First the Grosky and Tamburino’s Linear method is applied to the synthetic image data and corresponding world coordinates. Next, the Chatterjee, Roychowdhury, and Chong’s nonlinear method is applied to this data. The initialization algorithm, followed by five iterations of the main algorithm is used. These results for ⫽ 1, 5 and 10 are shown in Table 2. The results clearly show a significant improvement in all parameters as well as image error due to the nonlinear algorithm for all noise levels. As expected, a higher image error is seen for larger quantization noise denoted by . Table 3 shows the image error against iterations of the nonlinear algorithm. Different amounts of quantization noise show that the image error, although better compared with the linear method, increases with increasing noise . Coplanar Case. Similar to the noncoplanar case, we estimate the calibration parameters first with the linear (Conventional) method and then with the nonlinear method. In order to use the (Conventional) linear method, we need to compute the image center (i0, j0) and scale factor (s) parameters by an off-line method. We used an algorithmic scheme given in (22) to obtain s ⫽ 0.801860, i0 ⫽ 4.768148, and j0 ⫽ 8.37548, when the true values are s ⫽ 0.8, i0 ⫽ 5, and j0 ⫽ 8. We used the 20 iterations of each algorithm. The results are given in Table 4.
756
CAMERA CALIBRATION FOR IMAGE PROCESSING
Table 2. Errors in Parameters and Image Due to the Linear and Nonlinear Methodsa (Noncoplanar) Parameters 储r1 ⫺ r*1 储/储r*1 储 储r2 ⫺ r*2 储/储r*2 储 储r3 ⫺ r*3 储/储r*3 储 储t ⫺ t*储/储t*储 兩 f ⫺ f *兩/兩 f *兩 兩 i0 ⫺ i*0 兩/兩i*0 兩 兩 j0 ⫺ j 0*兩/兩 j 0*兩 兩s ⫺ s*兩/兩s*兩 储d ⫺ d*储/储d*储 Image Error 애 a
Linear ( ⫽ 1)
Nonlinear ( ⫽ 1)
Linear ( ⫽ 5)
Nonlinear ( ⫽ 5)
Linear ( ⫽ 10)
Nonlinear ( ⫽ 10)
0.00130163 0.00019631 0.00131631 0.00117155 0.00081911 0.05657449 0.00353542 0.00005129
0.00001288 0.00000522 0.00001350 0.00002384 0.00002200 0.00056567 0.00016246 0.00000139 0.00697193 0.00178726 0.00000596
0.00137304 0.00017492 0.00148493 0.00125086 0.00082520 0.05788677 0.00377974 0.00006083
0.00002542 0.00003419 0.00004105 0.00014397 0.00014350 0.00202255 0.00099542 0.00001226 0.02037457 0.00881193 0.00002936
0.00153758 0.00022877 0.00174635 0.00153610 0.00100966 0.06279373 0.00552706 0.00007275
0.00007441 0.00007260 0.00010160 0.00031061 0.00030045 0.00523871 0.00201209 0.00002601 0.02843441 0.01461288 0.00004869
0.03385128 0.00011293
0.03490645 0.00011644
0.03808799 0.00012704
⫽ Multiple of quantization noise units added to the image points.
Table 3. Image Errors for Iterations of the Nonlinear Methoda
a b
⫽ ⫽ ⫽ ⫽ ⫽
1 2 3 4 5
b
Linear Method
Init. Alg.
Iter. ⫽ 1
Iter. ⫽ 2
Iter. ⫽ 3
Iter. ⫽ 4
Iter. ⫽ 5
0.03463596 0.03491059 0.03527227 0.03571836 0.03624572
0.00206831 0.00369134 0.00540639 0.00714753 0.00889948
0.00179046 0.00351634 0.00525823 0.00700423 0.00875188
0.00178563 0.00351403 0.00525680 0.00700324 0.00875116
0.00178435 0.00351345 0.00525646 0.00700302 0.00875102
0.00178354 0.00351306 0.00525621 0.00700285 0.00875090
0.00178290 0.00351273 0.00525600 0.00700270 0.00875079
⫽ Multiple of quantization noise units added to the image points. Iterations 1–5 are for the Main Alg.
Table 4. Errors in Parameters and Image Due to the Linear and Nonlinear Methods (Coplanar) Parameters 储r1 ⫺ r*1 储/储r*1 储 储r2 ⫺ r*2 储/储r*2 储 储r3 ⫺ r*3 储/储r*3 储 储t ⫺ t*储/储t*储 兩 f ⫺ f *兩/兩 f *兩 兩 i0 ⫺ i*0 兩/兩i*0 兩 兩 j0 ⫺ j 0*兩/兩 j 0*兩 兩s ⫺ s*兩/兩s*兩 储d ⫺ d*储/储d*储 Image Error 애
Linear ( ⫽ 1)
Nonlinear ( ⫽ 1)
Linear ( ⫽ 5)
Nonlinear ( ⫽ 5)
Linear ( ⫽ 10)
Nonlinear ( ⫽ 10)
0.23353182 0.00545535 0.01506598 0.03428311 0.02422284
0.00035990 0.00055831 0.00456795 0.00475897 0.00458828 0.04601510 0.04551644 0.00000293 0.01137128 0.00165670 0.00000555
0.23354564 0.00546050 0.01516991 0.03432293 0.02432405
0.00036719 0.00058498 0.00468768 0.00491331 0.00471164 0.04858865 0.04695227 0.00000865 0.05210049 0.00817535 0.00002738
0.23367292 0.00546703 0.01530110 0.03437303 0.02445051
0.00038140 0.00061839 0.00485197 0.00510839 0.00486651 0.05155711 0.04875117 0.00001538 0.10238679 0.01533643 0.00005112
0.18899921 0.03688418 0.00012600
0.18898312 0.03898028 0.00013317
0.18896301 0.04290334 0.00014660
Table 5. Calibration Results for Real Data (Noncoplanar) Parameters
Linear Method
Nonlinear Method
r1 r2 r3 t f i0 , j0 s k1 , k 2 p1 , p 2 s1 , s 2 Image Error 애
0.9999482 0.0000020 ⫺0.0101782 0.0001275 0.9999191 0.0127197 0.0101774 ⫺0.0127203 0.9998673 5.0072589 ⫺6.3714880 ⫺509.2793659 4731.5257593 21.9558272 ⫺12.4401193
0.9999501 0.0000290 ⫺0.0099916 0.0000262 0.9999847 0.0055243 0.0099916 ⫺0.0055243 0.9999348 4.8876372 ⫺2.6923089 ⫺507.3856448 4724.5813533 19.9538046 ⫺17.3227311 1.0005753 1.0650 ⫻ 10⫺07 ⫺4.6403 ⫻ 10⫺13 3.0800 ⫻ 10⫺06 ⫺2.1293 ⫻ 10⫺06 1.1242 ⫻ 10⫺06 ⫺5.6196 ⫻ 10⫺07 0.14625769 0.00003096
0.24930409 0.00005269
CAMERA CALIBRATION FOR IMAGE PROCESSING
757
Table 6. Calibration Results for Real Data (Coplanar) Parameters
Linear Method
Nonlinear Method
r1 r2 r3 t f i0 , j0 s k1 , k2 p1 , p 2 s1 , s 2 Image Error 애
0.9992172 ⫺0.0000144 0.0000175 0.9999982 0.0016014 ⫺0.0019229 ⫺0.5774381 ⫺0.8354928 ⫺76.8508100 713.9534937
0.9996423 0.0000762 0.0000371 0.9999910 0.0267438 ⫺0.0042384 0.1626380 ⫺1.7551208 ⫺371.4850635 3464.3130475 6.9151391 ⫺8.6055953 0.9998891 2.1772 ⫻ 10⫺07 ⫺2.4362 ⫻ 10⫺12 ⫺3.0989 ⫻ 10⫺06 ⫺1.1376 ⫻ 10⫺06 5.2088 ⫻ 10⫺06 ⫺2.0617 ⫻ 10⫺06 0.21395695 0.00006176
1.3643 ⫻ 10⫺07 ⫺4.3674 ⫻ 10⫺12
0.25422583 0.00035608
These results clearly show that all parameters are estimated with greater accuracy by the nonlinear method when compared with the linear method. However, the overall estimation accuracy of parameters is lower in comparison with the noncoplanar case. This is expected, since a smaller number of parameters are estimated and fewer constraints can be imposed. However, the total image error is less with the nonlinear method. The results also show an increase in the image error due to higher quantization noise. As seen in Table 3, a similar pattern of image error against iterations is obtained for the coplanar case. Experiments with Real Data Real data is generated from test calibration points created by accurately placing a set of 25 dots in a square grid of 5 ⫻ 5 dots on a flat surface. The center to center distance between the dots is 7.875 mm. The diameter of each dot is 3.875 mm. The calibration pattern is mounted on a custom-made calibration stand. The centroid pixel of each dot is obtained by image processing to subpixel accuracy. Although we observed high lens distortion for wide angle lenses, we used a 35 to 70 mm zoom lens because of its frequent use in many applications, and a depth of field that can focus within a range of 0 to 60 mm. We used an off-the-shelf camera in a monoview setup. The camera resolution is 512 ⫻ 480 pixels and the digitizer gives digital images with 16 bits/pixel. For noncoplanar calibration, we placed the calibration target on a high precision stand that can move along the Zw axis of world coordinates with high accuracy by means of a micrometer screw. Each 360⬚ turn of the screw moves the target by 1 mm, and the positional accuracy is within 0.003 mm. The calibration target is positioned at three positions, z ⫽ 0, z ⫽ ⫺20 mm, and z ⫽ ⫺40 mm. Image points are extracted for all 25 dots in each location to obtain a total of 75 data points for calibration. A second image is acquired at all three of the above z locations for testing. Second-order radial and first-order decentering and thin prism lens distortion models are used. The linear (Grosky and Tamburino’s) and nonlinear (Chatterjee, Roychowdhury, and Chong’s) methods are used with five iterations of the nonlinear algorithm. The results for the noncoplanar case are shown in Table 5. As seen with the synthetic data, image error has improved due the nonlinear algorithm. In the coplanar case, we placed the calibration grid at a fixed z location and acquired the calibration and test images.
We next estimated the image center (i0, j0) and scale factor (s) parameters by an off-line method (22). We obtained s ⫽ 1.001084, i0 ⫽ 6.912747, j0 ⫽ ⫺8.601255. Note that the camera settings for the coplanar case are different from the noncoplanar case, and the results are expectedly different. Twenty iterations of both linear (Conventional) and nonlinear algorithms are used. The results for the coplanar case are summarized in Table 6. As seen before, the image error improved with the nonlinear method. Further note that the parameters computed by the nonlinear method are far more consistent with the actual setup. For example, the linear method estimates focal length f ⫽ 713.95, whereas the nonlinear method estimates f ⫽ 3464.31 (see Table 6). In a similar setup for the noncoplanar case, we obtained f ⫽ 4724.58. Thus, the nonlinear estimate of focal length is more consistent with the actual setup. In order to check the effects of the tangential lens distortion model, we used just the radial model on the above data and obtained a normalized image error of 0.00010177. This is compared with 0.00006176 obtained above with radial, tangential, and thin prism lens distortion models. This experiment shows that the decentering and thin prism distortion models are effective in reducing the net normalized image error. Similar results are obtained by others (1,11,20). CONCLUDING REMARKS In this article, we discuss several methods of noncoplanar and coplanar camera calibration. The methods range in computational complexity, execution speed, robustness, and accuracy of the estimates. The linear methods discussed in the section entitled ‘‘Linear Methods of Camera Calibration’’ are computationally efficient and relatively uncomplicated, but they lack the accuracy and robustness of the parameter estimates. The nonlinear methods discussed in the section entitled ‘‘Nonlinear Methods of Camera Calibration’’ are computationally complex, but the parameter estimates are very accurate provided that the calibration data are also accurate. The robust methods discussed in the section entitled ‘‘Robust Methods of Noncoplanar Camera Calibration’’ can produce good parameter estimates even if the control points are not so accurate. However, the method cannot compute all calibration parameters and is computationally complex. In the end, we have given three methods of coplanar camera calibration: Two are computationally simple but produce suboptimal solutions,
758
CAMERA CALIBRATION FOR IMAGE PROCESSING
and the other is computationally complex and produces an optimal solution.
BIBLIOGRAPHY 1. O. D. Faugeras and G. Toscani, The calibration problem for stereo, Proc. Comput. Vision Pattern Recognition Conf., Miami Beach, FL, 1986, pp. 15–20.
21. Y. Nomura et al., Simple calibration algorithm for high-distortion-lens camera, IEEE Trans. Pattern Anal. Mach. Intell., 14: 1095–1099, 1992. 22. C. Chatterjee, V. P. Roychowdhury, and E. K. P. Chong, A nonlinear Gauss–Seidel algorithm for noncoplanar and coplanar camera calibration with convergence analysis, Comput. Vision Image Understanding, 67 (1): 58–80, 1997. 23. D. P. Bertsekas and J. N. Tsitsiklis, Parallel and Distributed Computation, Englewood Cliffs, NJ: Prentice-Hall, 1989.
2. S. Ganapathy, Decomposition of transformation matrices for robot vision, Proc. Int. Conf. Robotics Autom., pp. 130–139, 1984.
24. G. H. Golub and C. F. VanLoan, Matrix Computations, Baltimore: Johns Hopkins Univ. Press, 1983.
3. K. D. Gremban et al., Geometric camera calibration using systems of linear equations, Int. Conf. Robotics, 1988, pp. 562–567.
25. R. M. Haralick and L. G. Shapiro, Computer and Robot Vision, Vol. 2, Reading, MA: Addison-Wesley, 1993.
4. W. I. Grosky and L. A. Tamburino, A unified approach to the linear camera calibration problem, IEEE Trans. Pattern Anal. Mach. Intell., 12: 663–671, 1990.
26. R. G. Wilson and S. A. Shafer, What is the center of the image?, Technical Report CMU-CS-93-122, Carnegie Mellon University, Pittsburgh, 1993.
5. M. Ito and A. Ishii, Range and shape measurement using threeview stereo analysis, Proc. Comput. Vision Pattern Recognition Conf., Miami Beach, FL: 1986, pp. 9–14.
27. W. J. Smith, Modern Optical Engineering, The Design of Optical Systems, Optical and Electro-Optical Engineering Series, New York: McGraw-Hill, 1966.
6. I. Sobel, On calibrating computer controlled cameras for perceiving 3-D scenes, Artif. Intell., 5: 185–198, 1974.
28. L. L. Wang and W. H. Tsai, Camera calibration by vanishing lines for 3D computer vision, IEEE Trans. Pattern Anal. Mach. Intell., 13: 370–376, 1991.
7. Y. Yakimovsky and R. Cunningham, A system for extracting three-dimensional measurements from a stereo pair of TV cameras, Comput. Graph. Image Process., 7: 195–210, 1978. 8. D. B. Gennery, Stereo-camera calibration, Proc. Image Understanding Workshop, 1979, pp. 101–108. 9. R. Y. Tsai, A versatile camera calibration technique for high-accuracy 3D machine vision metrology using off-the-shelf TV cameras and lenses, IEEE J. Robotics Automat., RA-3: 323–343, 1987.
29. J. Z. C. Lai, On the Sensitivity of Camera Calibration, Image and Vision Computing, 11 (10): 656–664, 1993. 30. M. A. Penna, Camera calibration: A quick and easy way to determine the scale factor, IEEE Trans. Pattern Anal. Mach. Intell., 13: 1240–1245, 1991. 31. B. Caprile and V. Torre, Using vanishing points for camera calibration, Int. J. Comput. Vision, 4: 127–140, 1990.
10. R. K. Lenz and R. Y. Tsai, Techniques for calibrating the scale factor and image center for high accuracy 3-D machine vision metrology, IEEE Trans. Pattern Anal. Mach. Intell., 10: 713– 720, 1988.
32. D. C. Brown, Decentering distortion of lenses, Photogramm. Eng., 32: 444–462, 1966.
11. J. Weng, P. Cohen, and M. Herniou, Camera calibration with distortion models and accuracy evaluation, IEEE Trans. Pattern Anal. Mach. Intell., 14: 965–980, 1992.
34. M. Ito and A. Ishii, A non-iterative procedure for rapid and precise camera calibration, Pattern Recognition, 27 (2): 301–310, 1994.
12. W. Faig, Calibration of close-range photogrammetric systems: Mathematical formulation, Photogramm. Eng. Remote Sens., 41 (12): 1479–1486, 1975.
35. J. Y. S. Luh and J. A. Klaasen, A Three-Dimensional Vision by Off-Shelf System with Multi-Cameras, IEEE Trans. Pattern Anal. Machine Intell. 7: 35–45, 1985.
13. S. K. Ghosh, Analytical Photogrammetry, New York: Pergamon, 1979.
36. D. A. Butler and P. K. Pierson, A Distortion-Correction Scheme by Industrial Machine-Vision Application, IEEE Trans. Robotics Autom., 7: 546–551, 1991.
14. R. C. Malhotra and H. M. Karara, A computational procedure and software for establishing a stable three-dimensional test area for close-range applications, Proc. Symp. Close-Range Programmetric Syst., Champaign, IL, 1975.
33. D. C. Brown, Close-range camera calibration, Photogramm. Eng., 37: 855–866, 1971.
37. S-W. Shih et al., Accurate linear technique for camera calibration considering lens distortion by solving an eigenvalue problem, Opt. Eng., 32 (1): 138–149, 1993.
15. Manual of Photogrammetry, 4th ed., American Society of Photogrammetry, 1980.
38. D. Luenberger, Linear and Nonlinear Programming, 2nd ed., Reading, MA: Addison-Wesley, 1984.
16. K. W. Wong, Mathematical formulation and digital analysis in close-range photogrammetry, Photogramm. Eng. Remote Sens., 41 (11): 1355–1375, 1975.
39. R. M. Haralick et al., Pose estimation from corresponding point data, IEEE Trans. Syst. Man Cybern., 19: 1426–1446, 1989.
17. A. Bani-Hasemi, Finding the aspect-ratio of an imaging system, Proc. Comput. Vision Pattern Recognition Conf., Maui, Hawaii, 1991, pp. 122–126. 18. Y. I. Abdel-Aziz and H. M. Karara, Direct linear transformation into object space coordinates in close-range photogrammetry, Proc. Symp. Close-Range Photogrammetry, Univ. of Illinois at Urbana-Champaign, 1971, pp. 1–18. 19. H. M. Karara (ed.), Non-Topographic Photogrammetry, 2nd ed., American Society for Photogrammetry and Remote Sensing, 1989. 20. H. A. Beyer, Accurate calibration of CCD-cameras, Proc. Comput. Vision Pattern Recognition Conf., Champaign, IL, 1992, pp. 96–101.
40. K. S. Arun, T. S. Huang, and S. D. Bolstein, Least-squares fitting of two 3-D point sets, IEEE Trans. Pattern Anal. Mach. Intell., 9: 698–700, 1987. 41. R. L. Launer and G. N. Wilkinson, Robustness in Statistics, New York: Academic Press, 1979.
CHANCHAL CHATTERJEE GDE Systems Inc.
VWANI P. ROYCHOWDHURY UCLA
CAPACITANCE. See DIELECTRIC POLARIZATION.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICA...TRONICS%20ENGINEERING/26.%20Image%20Processing/W4102.htm
}{{}}
●
HOME ●
ABOUT US ●
CONTACT US ●
HELP
Home / Engineering / Electrical and Electronics Engineering
Wiley Encyclopedia of Electrical and Electronics Engineering Holographic Storage Standard Article Tzuo-Chang Lee1 and Jahja Trisnadi2 1Eastman Kodak Company, Rochester, NY 2Silicon Lights Machines, Sunnyvale, CA Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. DOI: 10.1002/047134608X.W4102 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (465K)
●
●
● ●
Recommend to Your Librarian Save title to My Profile Email this page Print this page
Browse this title ●
Search this title Enter words or phrases
Abstract The sections in this article are Purpose and Methodology Principles of Thick Holograms Storage Density Limits Read Transfer Rate Trade-Off Between The Storage Density and The Transfer Rate About Wiley InterScience | About Wiley | Privacy | Terms & Conditions Copyright © 1999-2008John Wiley & Sons, Inc. All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELE...S%20ENGINEERING/26.%20Image%20Processing/W4102.htm17.06.2008 15:23:04
❍
❍ ❍
Advanced Product Search Search All Content Acronym Finder
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering c 1999 John Wiley & Sons, Inc. Copyright
HOLOGRAPHIC STORAGE Holographic data storage systems store 2-D data patterns in the form of holograms. The use of thick media allows multiplexing of a large number of holograms in the same media volume, and the number of multiplexed holograms can be as high as a few percent of the thickness-to-wavelength ratio, media permitting. Therefore, holographic storage offers two very attractive features: high data capacity and high data transfer rate. The latter property is a result of the nature of parallel readout of the 2-D data pattern. For a recent review, see Refs. 1,2,3,4,5,6. Figure 1 shows a generic holographic storage system. During recording, electronic data are loaded into the optical spatial modulator (SLM), which spatially modulates the signal beam. The signal beam, upon interfering with a reference beam inside the recording media, forms a complex interference pattern that is replicated in the media as a refractive index pattern. During readout, the media are illuminated with a reference beam, identical to the one used in the recording, causing the complex grating to diffract a fraction of the reference beam to re-create the stored image. The detector (e.g., a charge coupled device, CCD) captures this image and converts it back to electronic data. A number of pages can be multiplexed in the same media volume provided they are recorded with reference beams having different wave vectors. Because of the Bragg selectivity property of thick holograms, each page can be recalled independently by a reference beam whose wave vector matches the write reference wave vector. As shown in Fig. 1, the modulated beam is normally focused (i.e., Fourier transformed) by the Fourier transform lens (FTL) onto the recording media to minimize the spatial extent of the data. The reference beam is passed through a multiplexer, which directs the reference beam to land at the recording site at selected angles. Upon readout, the diffracted signal from the media is inverse Fourier transformed by the inverse Fourier transform lens (IFTL) and captured by CCD. Not shown in Fig. 2 are the optical modulators that switch the beams on and off during the operation and the polarization rotators that control the beam polarizations. Holographic storage with 2-D media was actively researched in the early 1960s (7–12) but became dormant because early efforts were hampered by immature component technologies. Additionally, it was not as attractive as bit serial storage in density. Recent advances in 3-D storage media and in lasers, SLM, and CCD have caused a renewed interest in holographic storage. At the present time, the major difficulty still lies in the lack of a mature enough 3-D holographic storage medium.
Purpose and Methodology The bottom line parameters of importance for holographic storage are the density D and the transfer rate νr . Here, νr represents the read transfer rate. The write transfer rate is also important but has a more complicated dependence on media properties such as sensitivity and reciprocity; it will not be addressed in this paper. Under the constraint of a given SNR requirement of the overall system, it can be shown that when we increase νr , then the diffraction efficiency η must be larger, which will then lead to a lower storage density for media with a finite dynamic range n, or vice versa. However, the trade-off is not a linear one. 1
2
HOLOGRAPHIC STORAGE
Fig. 1. Diagram of a holographic storage that employs angular multiplexing. The essential building blocks of the optical architecture are all contained in this diagram.
The purpose of this article is to quantify the preceding statement with detailed analysis, in the context of a thick medium with a given media noise index χ and dynamic range n. This will provide the starting design framework of a holographic storage system, where the trade-off between the media and the system can be clearly addressed. This article will focus on the angle multiplexing architecture where the reference beam and the signal beam are incident on the same media surface. We will not address another type of architecture where the two beams are incident on two different media surfaces with the 0◦ /90◦ geometry (1,13). The latter has somewhat different trade-off rules and promises higher volume storage densities. Even though it is quite straightforward to employ large storage space in the form of say a disk for the former architecture, it is difficult to extend from a single piece of cube to a large volumetric space effectively with the latter architecture. Our analysis methodology is described in the flow diagram shown in Fig. 2. We first establish the equations for the storage density D in terms of the number of pages per stack and the area occupied by the stack. By stack, we mean the area occupied by the multiplexed holograms in the same volumetric space. The number of pages per stack N pg is limited by the dynamic range n and the diffraction efficiency per hologram η. The area occupied by each stack is a complicated function of λ, F number of the Fourier optics, media thickness d, and N SLM , the number of pixels per side for the SLM and the recording geometry. The results of the trade-off studies between D and the material and optics parameters will be given in the section entitled “Storage Density Limits.” We then establish the equations for the transfer rate νr as constrained by the interpage cross-talk SNRpg , the media noise index χ (to be defined in the subsection entitled “Noise”), and η, which determines the amount of signal photons falling on the CCD. The trade-off between νr and the different noise contributions and η will be given in the section entitled “Read Transfer Rate.” Since the common parameter for D and νr is η, this allows us to establish the trade-off relation between the storage density and the transfer rate clearly. We will present the relation in closed form in the section entitled “Trade-off Between the Storage Density and the Transfer Rate.” Furthermore, we established the trade-off between the dynamic range n and the noise index χ in order to achieve the highest possible storage density for holographic storage.
HOLOGRAPHIC STORAGE
3
Fig. 2. Flow diagram of trade-off analysis between storage density and the data transfer rate. The blocks on the left side derive the dependence of the storage density on the material parameters and the optical parameters, while the two blocks on the right derive the dependence of the transfer rate on the diffraction efficiency and the signal-to-noise. The storage density and the transfer rate are tied together via the diffraction efficiency.
Principles of Thick Holograms Review of the Theory of Thick Holograms. The foundation of thick hologram theory was laid in Kogelnik’s 1969 paper (14) using coupled mode equations. Consider a uniform phase grating inside a media of thickness d (see Fig. 3), with a space-dependent refractive index,
where n is the average refractive index, n is the magnitude of the refractive index modulation, and K is the grating vector. The grating vector K has a magnitude
and a direction making an angle φ with respect to the z axis. Using the wave equation
we look for a diffraction solution of the form
4
HOLOGRAPHIC STORAGE
Fig. 3. Geometry of a thick hologram grating.
Fig. 4. Vector diagram of Bragg condition is the foundation of thick holograms.
where the reference beam with an amplitude of R(z) and wave vector kR is incident on the grating causing a diffraction with amplitude of S(z) and vector kS . Also, R(0) = 1 and S(z ≤ 0) = 0. This solution can be found from first-order perturbation near the Bragg condition. The Bragg condition (identified by subscripts 0) is expressed as
This situation is described by the vector diagram in Fig. 4, which is conveniently drawn on a circle of radius k0 = 2πn0 /λ. θR0 and θS0 are the angle that kR0 and kS0 make with the z axis, respectively. λ is the free space wavelength. Varying of kR0 by δθ in direction and δλ in wavelength and inserting Eq. (4) into Eq. (3), Kogelnik found that the first-order perturbation solution is
HOLOGRAPHIC STORAGE
5
Fig. 5. η versus ξ for various values of ν. It shows that the diffracted intensities are dependent on the degree of refractive index modulation, and it also shows the existence of the side lobes, which would cause cross-talk amongst adjacent holograms.
where ξ is the Bragg mismatch parameter,
(a third term, δn, can be added to include the possibility of controlling the Bragg condition through variation of the bulk refractive index in electrooptic materials by electric field), and the modulation parameter ν is
The geometry factors cR and cS are
The second part of Eq. (9) states that the signal is diffracted at an angle θS = cos − 1 [cos θR − K cos φ/k0 ]. The diffraction efficiency η is
The dependency of η on ξ and ν is shown in Fig. 5.
6
HOLOGRAPHIC STORAGE For small a modulation, η approaches a sinc2 -function of ξ:
In the presence of uniform absorption, Eq. (10) must be replaced by
where a is the absorption constant. Also for notation simplicity, we will drop 0 from quantities at the Bragg condition. In holographic storage, the read reference wave vector is taken to be the same as the write reference wave vector for a single λ system. This is not only a practical matter, but necessary. The object beam usually contains multiple wave vectors (Fourier transform of the data image); therefore, simultaneous satisfaction of Bragg condition for all wave vectors can happen only when the read and write reference wave vectors are either identical or antiparallel.
Multiplexing. Principles. Multiplexing is based on two properties: Bragg selectivity and Bragg degeneracy. Selectivity. We have seen that a sufficiently large deviation from Bragg condition, expressed by the Bragg mismatch parameter
will result in little reconstruction of the diffracted signal. And the thicker the media, the higher the selectivity will be. Let us consider angle multiplexing at a fixed wavelength (δλ = 0). To multiplex as many pages as possible in a given θ range, a small value of ξ is desirable. However, it is obvious that small ξ can potentially give rise to large interpage cross-talk. For small refractive index modulation, usually the case in holographic storage, the sinc2 form of the diffracted signal suggests that two consecutive pages should be separated by
to minimize cross-talk. α is referred as page-separation parameter. Under the condition in Eq. (14), the peak of a given page coincides with the αth zeros of its immediate neighbors. For complex gratings (as opposed to uniform planar gratings), the angular profile minima are not exactly zero. As will be shown in the subsection entitled “Noise,” higher and noninteger α values will be needed to reduce the cross-talk at the price of reducing the number of pages that can be multiplexed. Degeneracy. A pair of writing beams produces a unique grating. The converse is, however, not true. There are infinitely many possible reference beams with different incident angles and/or wavelengths that satisfy the Bragg condition in Eq. (2), [i.e., making an angle cos − 1 (K/2k) with the grating vector] and, therefore, reconstruct the hologram. The loci of all wave vectors that recall a given grating vector K form a family of two-sided cones, as illustrated in Fig. 6. Therefore, multiplexing is a systematic method to organize a number of grating vectors so that they are separated from each other by the selectivity and are not degenerately reconstructed (are not lying in the
HOLOGRAPHIC STORAGE
7
Fig. 6. Degeneracy of hologram readout. That is, there are many different reference beam vectors that can retrieve the same hologram.
intersection of two or more cones). A number of multiplexing methods are known. They are all based on the clever exploitation of the selectivity and degeneracy properties. The most common method is angle multiplexing. Angle multiplexing has been used since the early days of holographic storage (7,8,9,10,11,12) and is still widely employed (1,2,3,4,5,6,15,16,17). Other multiplexing methods have been proposed over the years: wavelength multiplexing (7,18), peristhropic multiplexing (19,20), shift multiplexing (21,22,23), and phase multiplexing (24,25,26,27,28,29,30). For the rest of the discussion, we will confine ourselves to angle multiplexing at a fixed wavelength (δλ = 0). Angle Multiplexing. In angle multiplexing, ξ is controlled by varying the angle between the reference wave vector k and the grating vector K at a fixed wavelength. To leave the Bragg cones as fast as possible, the angle variation must lie in the k–K plane. This is equivalent to the requirement that all the reference and the signal beams must be in the reference-signal plane. Angle multiplexing can be accomplished by either changing the reference incident angle or rotating the media (around an axis normal to the reference-signal plane). The latter, however, introduces abberations and is not commonly adopted in practical implementations. Therefore, we will consider only the case where the reference incident angle is varied. The vector diagram is shown in Fig. 7. Applying the “απ-separation” criteria of Eq. (14) in Eq. (13), the angular change between two consecutive pages is
where θR and θS are the reference and signal incident angles inside the media. Integrating over the available reference angle range, θR,min ≤ θR ≤ θR,max , we obtain the number of multiplexed pages N pg in a stack:
8
HOLOGRAPHIC STORAGE
Fig. 7. Vector diagram for angle multiplexing. It shows how the different holographic grating vectors are generated by different reference beam vectors.
Fig. 8. Ten angularly multiplexed holograms with 3π separation.
where µMX is given by
For the purpose of illustration, we show the angle multiplexing of ten pages with α = 3 in Fig. 8. The parameters used are λ = 0.5 µm, n = 1.5, d = 30 µm, α = 3, and θS = 15◦ . The reference angular range is from θR = 15.0◦ to 40.4◦ (which corresponds to an external angular range of R = 23◦ to 76◦ ). Noise. The presence of noise introduces limitations to holographic storage system performance. High noise must be compensated by a large signal to yield good data recovery, but at the price of consuming the dynamic range and hence reducing storage density. Therefore, it is necessary to quantify noise in details. In this section, we will discuss three types of noise that arise in holographic storage: media scatter noise, interpage cross-talk noise, and detector noise. Media Scatter Noise. Illumination of the media during readout produces the desired signal along with scatter lights whose spatial distribution depends on the media optical properties. The scatter noise distribution is described by B( , )—the scatter noise power in the direction ( , ) per unit incident power per unit solid angle, where and are the polar and azimuthal angles, respectively. If the scatter source is d’Lambertian,
HOLOGRAPHIC STORAGE
9
then B is constant. The scatter noise index is defined as
where det is the solid angle subtended by the detector in the imaging optics (see, for example, Fig. 11, where the noise source would be at the object plane and the detector, in the CCD plane). Experimentally, χ can be determined by placing a detector with a collector lens that captures the scatter noise within the solid angle det . The signal-to-noise ratio is simply
where η is the page diffraction efficiency of the hologram. Interpage Cross-Talk Noise. We have seen that the angular profile of a page beyond the first minima is small but not negligible. The contributions of many neighboring pages give rise to interpage cross-talk, which can become quite large. Let us start by considering the normalized diffracted signal amplitude for a small modulation [see Eqs. (12) and (10)] near the Bragg condition:
where δθR is the read reference angle detuning. Suppose that a second page is recorded with a reference angle that differs by
from the page under consideration. For a single wave vector signal beam, the choice of α = 1, 2, . . . corresponds to zero cross-talk as the peak of the first page coincides with the zero minima of the second, and vice versa. √ However, if the signal beam has an angular spread θS ± θS , where θS = tan − 1 (1/2 2F) and F is the Fnumber of the recording optics, the cross-talk is no longer zero. The cross-talk amplitude X, which comes from the signal deviation, −θS ≤ δθS ≤ θS , is
where
Note that for θS = 0 and θR = 90◦ , this linear approximation gives ε = 0 so that we need to examine only the higher-order terms. This is the optimum geometry for maximizing signal-to-noise ratio. This is one of the attractive attributes of the 0◦ /90◦ geometry (13,31). The cross-talk as a function of the signal pixel position at three different values of α are shown in Fig. 9, where θR = θS = 23◦ , n = 1.5, and F = 2.
10
HOLOGRAPHIC STORAGE
Fig. 9. Normalized cross-talk intensity as a function of δθS at θR = π, 2π, and 3π for F = 2. The pixel position j = 30 corresponds to δθS = θS = 0.24 rad.
Now we address the problem of cross-talk for complex multiplexed holograms. Suppose that there are PL pages on the left and PR pages on the right of a given hologram, then the rms interpage noise-to-signal ratio (NSR) is
In most realistic situations, ε depends only weakly on θR , and therefore we treat it as a constant evaluated at the center θR . For the first p0 pages where παp0 ε is small (say up to π/4), or p0 ∼ 1/(4αε), we can use the approximation sinc2 [πm(1 + ε)] ∼ ε2 , m = αp = integer. Notice that for small values of δθS the approximation is independent of α or p, as can be seen in Fig. 9. The noise contribution of the first p0 pages on both sides is
For p ≥ p0 , we use the envelope of the sinc-function to estimate the worst case:
HOLOGRAPHIC STORAGE
11
Fig. 10. Interpage cross-talk as a function of the pixel positions for a family of α values, where α is a measure of the page angular separation. It is clear from the figure that the optimum choice of α is 2.44 which makes the cross-talk uniform and quite small.
Because ∞ p = 1 1/p2 = 1.645 . . ., Eq. (26) will never exceed 0.33/α2 . Notice that the largest contribution comes from the p = 1 term (0.2/α2 ). Thus, the total rms interpage cross-talk noise is confined to
Let us estimate the interpage cross-talk for the previous example with θR = θS = 23◦ and F = 2. In this case, ε = 0.34. For α = 2 (2π-criteria), p0 < 1, and therefore the first term in Eq. (24) does not contribute. Thus, NSRpg ≤ 0.083 or SNRpe ≥ 12. Because Eq. (27) gives the worst-case estimate, we provide in Fig. 10 a direct numerical simulation of cross-talk using Eq. (25). At α = 2, the cross-talk is zero at the center pixel point but grows rapidly near the edges of the image pattern to 0.06 at the edge of the left side. When the value of α is decreased to 1.9, the zero moves to the right, and the worst-case cross-talk gets larger; but when α is increased, the zero moves in the opposite direction and the worst-case cross-talk value also drops until a more or less uniform cross-talk of less than 0.04 is reached at an α value of 2.44 with this particular example. The best choice of α is actually 2.1, as can be seen from Fig. 10. This slight increase in α results in the lowest-possible interpage cross-talk, and it actually leads to a higher storage density when the density is dynamic range limited, as we will show in more detail in the section entitled “Trade-off Between the Storage Density and the Transfer Rate.” Detector Noise. The CCD noise consists of several sources of Johnson thermal noises at the sense node and at the preamps. These noises are lumped together to provide the rms noise electrons QCCDN . Present technologies can provide QCCDN value in the neighborhood of 20 to 50 (32,33). Thus,
where η is the page diffraction efficiency, PR the read reference laser power, τr the read time, QE the CCD quantum efficiency, N 2 SLM the number of SLM pixels, OS the CCD-to-SLM oversampling ratio, and Eλ the
12
HOLOGRAPHIC STORAGE
Fig. 11. The commonly used geometry of the f-f-f-f configuration for Fourier transform recording.
photonic energy at wavelength λ. The factor 0.5 is the result of the 1:1 ratio between the 1s and the 0s in a pseudorandom coded data sequence. These three noise sources will be utilized to develop the transfer rate in the section entitled “Read Transfer Rate” and also the trade-off between the storage densities and the transfer rates in the section entitled “Tradeoff Between the Storage Density and the Transfer Rate.” Before that, let us first investigate the storage density using the equations established earlier. Also for a review of another class of image detectors called CMOS detectors under investigation at many laboratories, see Refs. 34 and 35.
Storage Density Limits Fourier Transform Recording. Fourier transform recording is a simple method of minimizing the spatial extent of a 2D data pattern. The most common geometry employs an f-f-f-f configuration as shown in Fig. 11. Consider a square spatial light modulator SLM with N SLM × N SLM pixels of pitch p × p placed at the front focal plane of the Fourier transform lens with F-number F. Although the Fourier transform at the back focal plane is extensive, all the fundamental information are contained in the small center portion, a square with side h which equals β times the Rayleigh size:
Analysis and empirical tests shows that β must be ≥1.5 for successful data recovery (6). Because of the fact that the Fourier transform of a discrete binary pattern is very spotty, local intensities at the transform domain have large fluctuations. This causes a severe burden on the dynamic range of the recording media. The best approach to minimize the impact of this problem is via the use of the random phase shifters proposed in the early 1970s (36,37,38,39,40,41). Area per Stack. The ultimate transverse area used by a hologram stack depends on several other factors: the media thickness, the media refractive index, the reference incident angle, and the signal incident angle. The cross section of a stack is shown in Fig. 12. The incident plane is chosen to lie in the x–z plane (see the adopted coordinate system). In this diagram, R is the external reference incident angle and S is the external signal incident angle. H x is the linear size of the stack in the x direction and H y , not shown, is the
HOLOGRAPHIC STORAGE
13
linear size in the y direction. The angle is related to the F-number of the FTL by
H x and H y can be derived using the geometrical diagram of Fig. 12,
and
The multipliers gh and gd in Eqs. (31) and (32) are geometrical factors given by
In angle multiplexing, R should be chosen to be the largest reference angle employed. For normal signal incident ( S = 0), Eqs. (33) and (34) simplify to
respectively. The transverse area of a stack is
14
HOLOGRAPHIC STORAGE
Fig. 12. The cross-section of a stack and the optimum placement of the signal beam in the stack.
A useful way to discuss the hologram area is the area per bit for one hologram Abit , which is simply
As an example, Abit is plotted in Fig. 13 versus the F-number, where the media is assumed to have a thickness of 1 mm, and an index of 1.5. Notice that (1) the storage density drops as the number of pixels in the SLM decreases, and (2) an optimum F opt exists that minimizes the area, and it is dependent on N SLM , d, and the ◦ recording angles. In the special case where S = 0 and under reasonable ◦F-numbers (say F > 1), F opt can be well approximated by d/4βλnNSLM . In a numerical example with R = 60 , β = 1.5, λ = 0.5 µm, N SLM = 512, n = 2, d = 5 mm suitable for crystal recording, and the value of F opt becomes as large as 1.3. Storage Density. We have shown the equations for the area occupied per bit for a given hologram in the previous subsection, and if we also know the number of holograms per stack N pg , we will be able to estimate the final density per unit area. We first examine optics limited N pg , denoted by N pgo resulting from multiplexing, assuming no media limitation, and then examine N pgd imposed by the limited dynamic range of the recording media. The latter poses a limit for a given material, whereas the former poses the ultimate limit. Equations (16) and (17) provide the formula for N pgo , thus the areal density is
where Abit is area per bit given in Eqs. (31)–(37). Regarding N pgd , the number depends not only on n but also on the desired diffraction efficiency η. Combining Eqs. (8)–(12)and assuming that (1) ξ2 + ν2 1, which is true for memory applications where η per page must be small, and (2) each page consumes the same amount of δn, that is n = N pgd δn, we obtain
HOLOGRAPHIC STORAGE
15
Fig. 13. Area per bit per hologram versus the F-number using the optimum recording geometry shown in Figure 12. In this example, d = 1 mm, λ = 500 nm, R = S = 35◦ .
thus
In the following figures, we show the storage density potential for both the photopolymers (42,43,44,45,46,47, 48,49,50,51) and the photorefractive signel crystals (52,53,54) such as LiNbO3 . The former has an index about 1.5 and may potentially have a relatively large n of 10 − 3 to 10 − 2 . However, its thickness is not likely to become scalable to more than 1 mm because of dye absorption and the relatively large media noise in photopolymers. The latter has an index near 2.3 and a small n of 10 − 5 to 10 − 4 , but low absorption and low noise for crystals unburdened with striation problems. Figure 14 shows the storage density, expressed in bits per square inch, versus the available n of the first class of materials for a family of required η. The medium thickness is taken to be 1 mm and the absorption e − ad = 0.3, an FT lens with F = 2, R = S = 35◦ , and there is a reference beam swing angle of ±35◦ at the medium plane. The interpage separation factor α is set at 2.1 (see the subsection entitled “Interpage Cross-talk Noise” and Fig. 10 for the rationale behind this choice). The SLM has 512 × 512 pixels. The solid horizontal line is the density limits set by optics (i.e., Bragg selectivity). One sees from this particular example that the storage
16
HOLOGRAPHIC STORAGE
Fig. 14. Equivalent area density versus the media dynamic range for a photo-polymer like material. The optics-limited density is the solid horizontal line at the top. d = 1 mm, λ = 500 nm, F = 2, n = 1.5.
density ceiling is 3.5 × 1011 bits/in.2 . Even at a high η of 10 − 2 per page, the limit can be reached if the material has a n of slightly more than 10 − 2 . Figure 15 shows the storage density for the second class of materials where we have assumed a medium thickness of 3 mm and FT lens with F = 2 speed, not very far from the optimum F-number, as pointed out in the subsection entitled “Area Per Stack.” Other parameters are the same as those used in Fig. 14. One sees that the storage density ceiling imposed by optics is only slightly higher at 4.5 · 1011 /in.2 , even though the medium thickness has increased three-fold. However, because of the smaller available n of about 10 − 4 of most known crystals, it is not easy to reach the optics limit unless very low η can be tolerated. That implies that very low medium noise is needed. More on the subject of the dependence of density on noise will be discussed in the section entitled “Trade-off Between the Storage Density and the Transfer Rate.” Another important point to be made is that the curves in Figs. 14 and 15 are not much different. This is simply the result of the fact that when one records a hologram from the same side of the material, the useful k space is about the same if the external angles are the same. In this case, both simulations are done at R and S of 35◦ , and there is a reference beam swing angle of ±35◦ at the medium plane. The reason that a medium thickness of 1 mm versus 3 mm has not made much difference is discussed next. Because the hologram size increases very quickly with medium thickness, we suspect that there might be an optimum thickness beyond which the storage density will start to decrease. The example shown in Fig. 16 is for a material of n of 10 − 4 and F = 2, which is suitable for a material like LiNbO3 . Here density versus thickness is plotted for a family of η per page. Again the solid curve at the top is the optics-imposed storage limit, where we see an optimum at 3 mm. Because the optimum is at a relatively broad peak, a media thickness
HOLOGRAPHIC STORAGE
17
Fig. 15. Equivalent area density versus the media dynamic range for a LiNbO3 like material. The optics-limited density is the solid horizontal line at the top. d = 3 mm, λ = 500 nm, F = 2, n = 2.3.
of 1 to 2 mm is sufficient for density maximization. Furthermore, if the diffraction efficiency per page is larger than 10 − 6 , then the density will always be n limited and the density maximum occurs at about 1.5 mm.
Read Transfer Rate Generally, signal recovery for high-density signals requires a broadband SNR of at least 14 dB or 5:1. Because the noise in a broadband signal is linearly proportional to the channel bandwidth, higher transfer rates mean higher noises, and therefore a larger signal level is required, or a larger η is needed. Even though the amount of media noise and the interpage cross-talk noise discussed previously are linearly proportional to η, the CCD noise is not. Our approach is to sum all the noise contributors formulated in the subsection entitled “Noise” from which we derive the relationship between the transfer rate and the various noise parameters and the diffraction efficiency. First of all, from Eq. (28), the CCD SNR is given by
18
HOLOGRAPHIC STORAGE
Fig. 16. Equivalent area density versus thickness at various diffraction efficiencies, where λ = 500 nm, F = 2, n = 2.3, n = 10 − 4 . Notice the optimum near a thickness of 2 mm in this case.
where
νread is the effective data transfer rate, that is,
and
where T read is the CCD exposure time and T xfr is the actual CCD frame transfer time.
HOLOGRAPHIC STORAGE
19
Fig. 17. Required diffraction efficiencies versus data rate per unit read power, for a family of media noise indices.
The media signal-to-noise, as defined in Eq. (19), is
Thus the required SNR − 1 , or NSR, of the detection system is related to the media noise, the interpage crosstalk, and the CCD noise electrons by
Thus, the required η to support the desired νread /PR is
We plot the η dependence on νread /PR in Fig. 17 for a family of media noise indices. We have assumed QCCDN = 50, NSR = 0.1 and NSRpg = 0.04 (see the subsection entitled “Interpage Cross-talk Noise”) in this example. Also, the following values are chosen: QE = 0.5, Eλ = 2ev, N SLM = 512, OS = 4, and p = 2. Several observations can be drawn from Fig. 17: (1) As we start at a low value of νread /PR , the first term related to media noise in the numerator of Eq. (46) is much larger than the second term, which is tied with the CCD noise, and η is essentially independent of
20
HOLOGRAPHIC STORAGE
Fig. 18. Trade-off of storage density versus data rate per unit read power, where d = 1 mm, n = 10 − 3 , λ = 500 nm, NSRpg = 0.04, NSR = 0.1, F = 2, α = 2.1, R = S = 35◦ .
νread /PR . Then the CCD noise starts to become competitive with the media noise leading to higher values of η as νread /PR increases. Finally, all curves merge into the slanted linear asymptote, which is controlled by the CCD noise floor, where the media noise and the interpage cross-talk become negligible to the noise bandwidth product of the CCD. (2) For photorefractive crystals like LiNbO3 where the media noise is low, we can support a modest νread /PR with very low η, thus making possible very large storage densities. However, if an aggressive transfer rate is desired, then η must be made larger, causing reduced storage density. The opposite is true for the photopolymers where the media noise is relatively large, thus higher diffraction efficiencies are needed independent of νread /PR until a much larger value of νread /PR is reached. But utilization of the available n is not as effective, causing penalties in storage density. We will devote the next section to a more quantified discussion of this point.
Trade-Off Between The Storage Density and The Transfer Rate The storage density D under dynamic-range-limited situation is given by, from Eq. (40),
HOLOGRAPHIC STORAGE
21
where
where Abit is the area per bit for a single hologram with N 2 SLM bits, defined in Eq. (36). Also the required η to support the desired νread /PR is, from Eq. (46),
Combining Eqs. (46) and (47) yields
where
or
Equation (49) describes the trade-off between storage density and the read data rate under the n limited situation. At a relatively modest data rate,
and Eq. (49) is approximated by
Equation (51) teaches that Dd will be constant first and then fall off when Eq. (50) no longer holds. See Fig. 18. Also, it is quite clear from Eqs. (49) and (51) that the larger SNRpg is with respect to SNR and the smaller the media noise index is, the higher the storage density will be. Using Eq. (49) we plot the storage density D against νread /PR in Fig. 18. The family of curves has a constant n of 10 − 3 , but different values of
22
HOLOGRAPHIC STORAGE
Fig. 19. Trade-off of noise index χ versus dynamic range n in order to reach optics-limited density, for a family of transfer rates. NSRpg = 0.04, NSR = 0.1.
media noise indices. Again the value of α is chosen to be 2.1, giving SNRpg a value of 25, see the subsection entitled “Interpage Cross-talk Noise” and Fig. 10. One notices the trade-off of density versus the data transfer rate and also the importance of minimizing the media noise χ to achieve a large storage density. As stated earlier, when the storage density is driven to the optics limit, then it becomes, see the subsection entitled “Storage Density,”
Equating Eqs. (38) and (48) yields
HOLOGRAPHIC STORAGE
23
Equation (52) provides the relationship among all the material parameters when the optics limit is reached. Note that there is no optics dependence (beam angles, F-number, etc.) nor is there any significant dependence on medium thickness except for the absorption term, ead . Because it might be easier to develop media with a smaller media noise index than a larger n, we show in Fig. 19 the trade-off of χ versus n under an optics-limited situation. Several characteristics stand out clearly: (1) the required n to reach optics-limited performance is about 10 − 3 to 10 − 2 depending on the desired transfer rate (this we believe is an achievable goal for photopolymers); (2) we can make up the lack of a large n by means of lower noise index to reach the optics-limited capacity; and (3) relatively larger values of n are needed for larger transfer rates, no matter how small the noise index is.
BIBLIOGRAPHY 1. J. F. Heanue M. C. Bashaw L. Hesselink Volume holographic storage and retrieval of digital data, Science, 295: 749–752, 1994. 2. J. H. Hong et al. Volume holographic memory systems: Techniques and architectures, Opt. Eng., 34: 2193–2203, 1995. 3. D. Psaltis F. Mok Holographic memories, Sci. Amer., 273 (5): 52–58, 1995. 4. M.-P. Bernal et al. A precision tester for studies of holographic optical storage materials and recording physics, Appl. Opt., 35: 2360–2374, 1996. 5. I. McMichael et al. Compact holographic storage demonstrator with rapid access, Appl. Opt., 35: 2375–2379, 1996. 6. T. C. Lee et al. Holographic optical data storage for desktop computer, Proc. SPIE, 2514: 340–354, 1995. 7. P. J. van Heerden Theory of information storage in solids, Appl. Opt., 2: 393–400, 1963. 8. E. N. Leith et al. Holographic data storage in three-dimensional media, Appl. Opt., 5: 1303–1311, 1966. 9. J. A. Rajchman Holographic optical memory: an optical read-write memory, Appl. Opt., 9: 2269–2271, 1970. 10. E. G. Ramberg Holographic information storage, RCA Rev., 33: 5–53, 1972. 11. B. Hill Some aspects of a large capacity holographic memory, Appl. Opt., 11: 182–191, 1972. 12. G. R. Knight Holographic memories, Opt. Eng., 14: 453–459, 1975. 13. C. Gu et al. Cross-talk-limited storage capacity of volume holographic memory, J. Opt. Soc. Amer., A9: 1978–1983, 1992. 14. H. Kogelnik Coupled wave theory for thick hologram gratings, Bell Sys. Tech. J., 18: 2909–2947, 1969. 15. F. H. Mok et al. Storage of 500 high-resolution holograms in LiNbO3 crystal, Opt. Lett., 16: 605–607, 1991. 16. F. H. Mok Angle-multiplexed storage of 5000 holograms in lithium niobate, Opt. Lett., 18: 915–917, 1993. 17. X. An D. Psaltis Experimental characterization of an angle-multiplexed holographic memory, Opt. Lett., 20: 1913–1915, 1995. 18. G. A. Rakuljic V. Leyva A. Yariv Optical data storage by using orthogonal wavelength-multiplexed volume holograms, Opt. Lett., 17: 1471–1473, 1992. 19. A. Pu et al. Storage density of peristhropic multiplexing, OSA Annu. Meet., Dallas, TX, 1994. 20. K. Curtis A. Pu D. Psaltis Method for holographic storage using peristrophic multiplexing, Opt. Lett., 19: 993–994, 1994. 21. D. Psaltis et al. Holographic storage using shift multiplexing, Opt. Lett., 20: 782–784, 1995. 22. G. Barbastathis et al. Holographic 3D disks using shift multiplexing, Proc. SPIE, 2514: 1995, pp. 355–362. 23. G. Barbastathis M. Levene D. Psaltis Shift multiplexing with spherical reference waves, Appl. Opt., 35: 2403–2417, 1996. 24. V. N. Morozov Theory of holograms formed using a coded reference beam, Sov. J. Quantum. Electron., 7: 961–964, 1977. 25. A. A. Vasil’ev S. P. Kotova V. N. Morozov Pseudorandom signals as key words in an associative holographic memory, Sov. J. Quantum. Electron., 9: 1440–1441, 1979. 26. C. Denz et al. Volume hologram multiplexing using deterministic phase encoding method, Opt. Commun., 85: 171–176, 1991. 27. Y. Taketomi et al. Incremental recording for photorefractive hologram multiplexing, Opt. Lett., 16: 1774–1776, 1991. 28. J. Trisnadi S. Redfield Practical verification of hologram multiplexing without beam movement, Proc. SPIE, 1773: 1992, pp. 362–371.
24
HOLOGRAPHIC STORAGE
29. G. Denz et al. Potentialities and limitations of hologram multiplexing by using the phase-encoding technique, Appl. Opt., 31: 5700–5775, 1992. 30. C. Alves G. Pauliat G. Rosen Dynamic phase-encoding storage of 64 images in BaTiO3 photorefractive crystal, Opt. Lett., 19: 1894–1896, 1994. 31. M. C. Bashaw et al. Cross-talk considerations for angular and phase-encoded multiplexing in volume holography, J. Opt. Soc. Amer., B11: 1820–1836, 1994. 32. W. Lawler et al. Performance of high frame rate, back illuminated CCD imagers, SPIE, 2172: 1994, pp. 90–99. 33. P. A. Levine et al. Multi-pot backside illuminated CCD imagers for moderate to high frame rate camera applications, SPIE, 2172: 1994, pp. 100–114. 34. E. R. Fossum Active pixel sensors: Are CCD dinosaurs?, SPIE, 1900: 1993, pp. 2–14. 35. S. K. Mendis et al. Progress in CMOS active pixel image sensor, SPIE, 2172: 1994, pp. 19–29. 36. C. B. Burckhardt Use of random phase mask for the recording of Fourier transform holograms of data masks, Appl. Opt., 9: 695–700, 1970. 37. Y. Takeda Holographic memory with high quality and high information storage density, J. Appl. Phys., 11: 656–665, 1972. 38. Y. Takeda Y. Oshida Y. Miyamura Random phase shifters for Fourier transformed holograms, Appl. Opt., 11: 818–822, 1972. 39. W. C. Stewart A. H. Firester E. C. Fox Random phase data masks: Fabrication tolerances and advantages of four phase level masks, Appl. Opt., 11: 604–608, 1972. 40. W. J. Dallas Deterministic diffusers for holography, Appl. Opt., 12: 1179–1187, 1973. 41. Y. Nakayama M. Kato Linear recording of Fourier transform holograms using a pseudorandom diffuser, Appl. Opt., 21: 1410–1418, 1982. 42. D. H. Close et al. Hologram recording on photopolymer materials, Appl. Phys. Lett., 14: 159–160, 1969. 43. W. S. Colburn K. A. Haines Volume hologram formation in photopolymer materials, Appl. Opt., 10: 1636–1641, 1971. 44. R. H. Wopschall T. R. Pampalone Dry photopolymer film for recording holograms, Appl. Opt., 11: 2096–2097, 1972. 45. B. L. Booth Photopolymer material for holography, Appl. Opt., 11: 2994–2995, 1972. 46. B. L. Booth Photopolymer material for holography, Appl. Opt., 14: 593–601, 1975. 47. A. M. Weber et al. Hologram recording in DuPont’s new photopolymer materials in Practical Holography IV, Proc. SPIE, 1212: 1990, pp. 30–39. 48. W. J. Gambogi A. M. Weber T. J. Trout Advances and applications of DuPont holographic photopolymers in Holographic Imaging and Materials, Proc. SPIE, 2043: 1993, pp. 2–13. 49. R. T. Ingwall H. L. Fielding Hologram recording with a new Polaroid photopolymer system in applications of holography, Proc. SPIE, 523: 1985, pp. 306–312. 50. R. T. Ingwall M. Troll Mechanism of hologram formation in DMP-128 photopolymer, Opt. Eng., 28: 586–591, 1989. 51. D. A. Waldman et al. Cationic Ring-opening photopolymerization methods for volume hologram recording in diffractive and holographic optics technology III, Proc. SPIE, 2689: 1996, pp. 127–141. 52. P. Gunter J.-P. Huignard Photorefractive Materials and Their Applications, Vols. I and II, Berlin: Springer Verlag, 1988 and references therein. 53. Special issues on photorefractive crystals, J. Opt. Soc. Amer. B: August 1988, August 1992, September 1994 and references therein. 54. P. Yeh Introduction to Photorefractive Nonlinear Optics, New York: Wiley, 1993.
TZUO-CHANG LEE Eastman Kodak Company JAHJA TRISNADI Silicon Lights Machines
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICA...TRONICS%20ENGINEERING/26.%20Image%20Processing/W4103.htm
}{{}}
●
HOME ●
ABOUT US ●
CONTACT US ●
HELP
Home / Engineering / Electrical and Electronics Engineering
Wiley Encyclopedia of Electrical and Electronics Engineering Image and Video Coding Standard Article Kannan Ramchandran1, Mohammad Gharavi-Alkhansari1, Thomas S. Huang1 1University of Illinois at Urbana-Champaign, Urbana, IL Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. DOI: 10.1002/047134608X.W4103 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (459K)
●
●
● ●
Recommend to Your Librarian Save title to My Profile Email this page Print this page
Browse this title ●
Search this title Enter words or phrases
Abstract The sections in this article are Compression Framework For Still Images Jpeg–Still-Image Compression Standard Beyond JPEG: Wavelets Beyond JPEG: Fractal Coding Compression Framework For Video Video Standards MPEG-4 Standard MPEG-7 and Multimedia Databases Current and Future Research Issues Acknowledgments
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20...0ENGINEERING/26.%20Image%20Processing/W4103.htm (1 of 2)17.06.2008 15:23:25
❍
❍ ❍
Advanced Product Search Search All Content Acronym Finder
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICA...TRONICS%20ENGINEERING/26.%20Image%20Processing/W4103.htm
About Wiley InterScience | About Wiley | Privacy | Terms & Conditions Copyright © 1999-2008John Wiley & Sons, Inc. All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20...0ENGINEERING/26.%20Image%20Processing/W4103.htm (2 of 2)17.06.2008 15:23:25
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering c 1999 John Wiley & Sons, Inc. Copyright
IMAGE AND VIDEO CODING Images and video have become ubiquitous in our everyday life. Examples are consumer products from traditional television, still and video cameras, to the newer digital still and video cameras, modern printers and fax machines, videoconferencing systems, Kodak Photo-CD, direct broadcast satellites (DBS), digital video discs (DVD), and high-definition TV (HDTV). Furthermore, the Internet has made more and more images and video available. Images and video are also crucial in many applications in medicine (CT, MRI, PET, etc.), education (distance learning), defense (video transmission from unmanned vehicles), and space science (transmission of images and video of Mars from artificial satellites to earth). In most cases, the representation of images and video is digital. The main things we want to do with images and video are to transmit; to store and retrieve; to manipulate (editing, authoring, etc.); and to visualize. In dealing with images and video, in most cases the amount of data (in bits) is huge. Thus, compression is essential to make implementation possible. For example, straight digitization of HDTV-format signals yields more than 1 Gbit/s. It is impossible to transmit this through a standard TV broadcast channel of 6 MHz bandwidth. Fortunately, compression techniques (MPEG-2) reduce the bit rate to 20 Mbit/s. Then, by using appropriate channel coding and modulation methods, the signal can be transmitted through a 6 MHz channel. Similarly, movies on CD, DBS, DVD, and so on are simply not possible without image/video compression. We have come a long way from the early days (1960s) of basic research in image and video coding to the standardization activities of the last decade (and beyond). The image/video compression standards have had tremendous impact on the development of consumer products. Without the standards, the basic research results (such as transform coding of images) would still be lying dormant in the archives of scholarly journals. In this article, we review the most important basic concepts and techniques of image and video coding, describe the major standards, and discuss some of the challenging research issues. We mention in passing that image/video coding in the broad sense includes: source coding (compression), channel coding (error detection and correction), encryption, and watermarking. This article covers only compression.
Compression Framework For Still Images Compression of video and images is to remove redundancy from their representation. Although there are a number of ways to do this, the inspiration comes, not surprisingly, from the seminal work of Claude Shannon (1), the father of information theory, who quantified the concept of compressing “optimally.” Compression can be done losslessly, meaning that there is absolutely no loss of the source information. This is critical when any loss of information is catastrophic, for example, in data files. However, there is a price paid for perfect representation. Understandably, one cannot typically achieve very high compression performance (typical numbers are 2:1 or 3:1 at best). When dealing with images and visual information, there are two dominant issues. First, the sheer volume of the source dictates the need for higher compression ratios (we could not have 1
2
IMAGE AND VIDEO CODING
digital television with such meager compression ratios), and secondly, perfect replication of the source is often overkill—visual information is inherently very loss tolerant. This leads to the concept of “lossy” compression. One can quantify the amount of loss based on objective measures like mean squared error (MSE) (or more perceptually meaningful measures, which have not however, proved to be very analytically tractable to date). The tradeoff between the number of bits used to represent an image (rate) versus the amount of degradation which the compression entails (distortion) has been formalized by Shannon under Rate-Distortion (R-D) theory. Although the theoretical bounds for specific statistical classes of sources have been provided under this theory, there are two problems: (1) actual images are very complex to model statistically, and simple models do not work well; and (2) practical compression algorithms have severe operational constraints such as low complexity, low delay, and real-time issues that need to be addressed. This underlines the need for efficient compression frameworks that offer a good compromise between performance and complexity. The frameworks that have been most successful and found their way into commercial standards for image compression are based on what is termed transform coding. Standard transform-based image coding algorithms consist of three stages: a linear—typically unitary—decorrelating transform, followed by a quantization stage, and finally entropy coding of the quantized data. A good decorrelating transform removes linear dependencies from the data, thus producing a set of coefficients such that, when scalar quantized, the entropy of the resulting symbol stream is reduced substantially more than if the same quantization were applied directly on the raw image data. Initially, transform coding became popular mainly because of the introduction of the block discrete cosine transform (DCT) which is the workhouse of commercial image compression standards (see next section). There are three reasons for the success of the block DCT: (1) it does a reasonably good job of compacting most of the image energy into a small fraction of the DCT coefficients; (2) because of the use of a block structure, it facilitates spatially adaptive compression, needed when dealing with the highly varying spatial statistics of typical images; (3) it has a fast implementation algorithm, based on a simple extension of the popular fast Fourier transform (FFT), which revolutionized digital signal processing. Quantization. When the DCT is computed for a block of pixels, the value of its transform coefficients must be conveyed to the decoder with sufficient precision. This precision is controlled by a process called quantization. Basically, quantizing a DCT coefficient involves dividing it by a nonzero positive number called the quantization step size and rounding the quotient to the nearest integer. The bigger the quantization step size, the lower the precision of the quantized DCT coefficient, and the fewer the number of bits required to transmit it to the decoder. Usually, higher frequency coefficients are quantized with less precision, taking advantage of the fact that human eyes are less sensitive to high-frequency components in the image. Entropy Coding. After quantization, the final step in the encoding process is to remove redundancy in the quantized symbol stream representing the quantized DCT coefficients. Because of the high-energy compaction property of the DCT, a large fraction of the transform data is quantized to zero. It is certainly unwise to assign as many bits to the zero value as it is to a nonzero value, for the same reasons that in a Morse code, it is unwise to assign the same pattern to an “e” as it is to an “x.” Thus, the symbols generated are highly unbalanced in their probabilistic distributions. Some of them appear very frequently, and some others occur only rarely. It is possible to spend fewer bits (on average) to represent all possible outcomes of a certain symbol by assigning shorter binary strings to highly probable outcomes and longer strings to less probable ones. The final stage, called entropy coding, exploits this idea. The fundamental amount of information present was dubbed “entropy” by Shannon (1). There are many practical entropy-coding schemes (including the famous Morse code with which we are all familiar). The two most popular schemes in image coding are Huffman coding and arithmetic coding. These schemes work by different mechanisms but share the same basic idea of coding a symbol with a number of bits inversely related to the probability that it occurs. Therefore a very challenging problem for entropy coding is to generate an accurate profile for symbol probabilities. This is called the modeling problem. Modeling is done in two different ways: nonadaptive, in which the profile is determined empirically, and remains unchanged throughout coding, and adaptive, in which the profile is refined dynamically in the
IMAGE AND VIDEO CODING
3
coding process by using previously coded data. The adaptive way usually produces a more accurate coding model and consequently better compression, but it require extra computational time to generate the model.
Jpeg–Still-Image Compression Standard JPEG (pronounced “jay-peg”) stands for Joint Photographic Experts Group. JPEG is a very successful stillimage compression standard which is also known as ISO 10918 standard. JPEG is a compression standard for digital images having single (e.g., gray-scale images) or multiple components (e.g., color images) with each component’s sample precision of 8 or 12 bits for lossy compression mode and 2 bits to 16 bits for lossless mode. JPEG is “color-blind” in the sense that the choice of coordinate system for color representation does not affect the compression process. Although there are many different features provided by the standard, here we concentrate only on the most important. Additional information can be found in the excellent reference (2) which also contains the Draft International Standard ISO DIS 10918-1. JPEG standard specifies several encoding and decoding processes which can be combined in three main functional groups: The “Baseline System” This group must be supported by hardware and software JPEG coders, and it represents transform-based lossy compression in sequential mode. In this mode, data is decoded in “one pass,” as shown in Fig. 1 (compare to the progressive mode in the next group). “Baseline” JPEG has been used almost exclusively up to 1996 in many commercial applications. The “Extended System” This group provides better compression performance at the price of somewhat increased complexity by using improved entropy-coding techniques, for example, arithmetic coding and extended Huffman coding. It also allows several types of progressive modes of operation. In contrast with the sequential mode, the progressive mode allows a coarse representation of an encoded image which is later refined as additional bits are decoded. For example, if the compressed image is transmitted over a slow link, this feature allows “previewing” the content of the data at a lower quality or resolution during transmission. This progressive feature is illustrated for DCT-based progression (original size but reduced quality for a lower resolution image), and for hierarchical progression (original quality but reduced size for a lower resolution image) in Fig. 1. Lossless Coding This group provides lossless data compression, and it does not use transform and quantization. No information is lost during compression. The Lossless JPEG mode is based on forming predictions for a sample value from previously encoded samples and encoding only the difference between the predicted and actual values. The first two groups represent the lossy class of compression processes, and they include transform, quantization, and entropy coding. In the following, we describe the main features of these three operations for the sequential mode. The sequential lossy encoding process is illustrated in Fig. 2. DCT Transform. In lossy mode, JPEG decomposes each image component into 8 × 8 blocks for transform, quantization, and entropy coding. Blocks are processed in a raster scan order. Basically, the DCT is a method of decomposing a block of data into a weighted sum of certain fixed patterns. Each such pattern is called a basis, and each basis is associated with coefficient representing the contribution of that pattern in the block to be coded. There are altogether 64 basis functions to completely represent any block without error, and therefore the total number of values to be coded after the transform remains unchanged. In the inverse DCT transform, each basis block is multiplied by its coefficient and the resulting 64 8 × 8 amplitude arrays are summed, each pixel separately, to reconstruct the original block.
4
IMAGE AND VIDEO CODING
Fig. 1. Different modes of JPEG operation.
The DCT basis blocks are characterized by different spatial frequencies, a measure of how fast the pixel values change in that block. At the lowest frequency, the basis block takes a constant value. In other words, there is no variation inside it. As the frequency increases, the pixel values inside that basis block vary faster and faster. A common characteristic of all natural images and video is that low-frequency components are dominant. Therefore, after the DCT, usually only a few low-frequency coefficients are needed to approximate
IMAGE AND VIDEO CODING
5
Fig. 2. Lossy sequential JPEG compression.
the original image block well, a phenomenon called energy compaction. The energy-compaction property makes coding the DCT coefficients much easier than directly coding the original data. DCT is a linear orthogonal transform which can be represented as multiplication of 8 × 8 image blocks by 8 × 8 DCT basis blocks. Formally, DCT operation is described by the following equations:
√ where c(x) = 1/ (2) for x = 0 and C(x) = 1 for x > 0. s(y, x) and S(v, u) are 2-D sample values and DCT coefficients, respectively. The inverse transform is
Quantization. The DCT coefficients of each 8 × 8 block are quantized using a uniform scalar quantizer. Quantization step size is defined for each frequency coefficient using an 8 × 8 quantization matrix. Typically, a single quantization table is used for all components. However, up to four different tables may be used if needed.
6
IMAGE AND VIDEO CODING
The values of the quantization tables are encoded in the header of the compressed file. Quantization is a lossy step, that is, the information cannot be recovered perfectly at the decoder. However, the quantization which allows achieving a high compression rate at the price of some quality degradation. Entropy Coding. The first quantized frequency coefficient, called a DC coefficient, represents the average sample value in a block and is predicted from the previously encoded block to save bits. Only the difference from the previous DC coefficient is encoded, which typically is much smaller than the absolute value of the coefficient. The remaining 63 frequency coefficients (called AC coefficients) are encoded using only the data of the current block. Because normally only a few quantized AC coefficients are not zero, their positions and values are efficiently encoded using run-length codes followed by Huffman or arithmetic entropy coders. AC coefficients are processed in a zigzag manner (see Fig. 2) which approximately orders coefficients from the lowest to the highest frequency. Run-length code represents the sequence of quantized coefficients as (run, value) pairs, where “run” represents the number of zero-valued AC coefficients between the current nonzero coefficient and the previous nonzero coefficient, and “value” is the (nonzero) value of the current coefficient. A special end-of-block (EOB) signals the end of nonzero coefficients in the current block. For the example in Fig. 2 with three nonzero AC coefficients, the possible sequence after the run-length encoder is (0,5)(0,3)(4,7)(EOB). The sequence of “runs” and “values” is compressed by using Huffman or arithmetic codes. The Parameters of the entropy coder are stored in the header of the compressed image. Decoding. The decoder performs inverse operations for entropy coding and the DCT. JPEG: Merits and Demerits. JPEG remains a popular image compression standard because of its competitive compression efficiency, low complexity and low-storage requirements, existence of efficient implementations, and the presence of several progression modes in it. However, because of its block-based coding process, JPEG-coded images suffer from the so-called “blocky artifacts,” especially at high compression rates where the boundaries of the blocks become visible in the decoded image. This problem is resolved to some extent by postprocessing the decoded image. As technology improves, storage and complexity constraints for image coders become less strict and a new generation of image coders is preferred because of its improved compression efficiency. Some examples of such coders are briefly described in the following sections.
Beyond JPEG: Wavelets Since their introduction as a tool for signal representation, wavelets have become increasingly popular within the image-coding community because of the potential gains they offer for constructing efficient image-coding algorithms (3). Those potential gains result from the fact that wavelets provide a good tradeoff between resolution in the space and frequency domains, a feature which results in mapping typical space-domain image phenomena into structured sets of coefficients in the wavelet domain. However, to make effective use of any structure for improving coding performance, an algorithm requires a statistical characterization of the joint distribution of wavelet coefficients, capable of taking that structure into account. A hot research topic as of the late 1990s consists of identifying properties of the wavelet coefficients of images leading to good image-coding performance. Although wavelets share many properties with the DCT (e.g., decorrelation), and the wavelet coefficients may be quantized and entropy coded similarly to the way DCT coefficients are used, taking this same approach ignores a fundamental wavelet property, the joint time-frequency localization. Consider the following simple image model. For coding application, typical images can be described as a set of smooth surfaces delimited by edge discontinuities (see Fig. 3). Now consider the effect of representing an edge in the wavelet domain, as illustrated by Fig. 4. The abstract property of joint time-frequency localization has a very practical consequence: the formation of energy clusters in image subbands at spatial locations associated with edges in the original image.
IMAGE AND VIDEO CODING
7
Fig. 3. Typical images can be described as a set of smooth surfaces delimited by edge discontinuities. Row 20 (up) is a mostly smooth region, whereas row 370 (down) contains some textures.
This clustering effect is attributed mostly to two basic facts: (1) signal energy due to smooth regions is compacted mostly into a few low-frequency coefficients, thus resulting in negligible contributions to coefficients in the higher frequency bands; and (2) because of the small compact support of the wavelets used, edges can contribute energy only to a small number of coefficients in a neighborhood of their space-domain location. This effect is illustrated in Fig. 5. Now, in a data compression application, “useful” properties are those that can be described by some statistical model of the source being coded, so that a practical algorithm can exploit them to achieve better coding performance. Most state-of-the-art wavelet-based image coders are based on exploiting in an improved statistical description of image subbands, which explicitly takes into account the space-frequency localization properties of the wavelet transform.
Beyond JPEG: Fractal Coding Fractal image coding is an image-compression method based mainly on the work of Barnsley in the 1980s (4) and that of Jacquin in the late 1980s and early 1990s (5). All image-coding methods use some form of redundancy present in images for data compression. The type of redundancy that fractal coding methods use is that of similarity across scale. This is based on the assumption that important image features, including straight edges and corners, do not change significantly under change of scale (see Fig. 6). For example, a small part of a straight edge in an image may look similar to a larger part of the same edge in that image. Encoding. Just like any image-coding system, a fractal coding system has an encoder and a decoder. In the encoder, an image is first partitioned into small nonoverlapping blocks, called range blocks. For each range block, the encoder searches in the image for larger blocks that are similar to the range block and among all of them picks the one that is most similar to the range block, as shown in Fig. 7. For each range block, this selected larger block is called the domain block. By most similar, we mean that after applying some simple transformations on the domain block, the mean square error (MSE) between the range block and the
8
IMAGE AND VIDEO CODING
Fig. 4. To illustrate how the wavelet decompositoin preserves the locality of data characteristics in the spatial domain: (a) a step; (b) its projection on the low-pass space; (c) its projection on the high-pass space. In the high-pass projection, all of the signal energy is concentrated in a neighborhood of the spatial location of the step.
transformed domain block becomes the smallest. In addition to shrinking, which needs to be applied to a domain block to make it the same size as the range block, the transformations typically used are adding a constant value to all the pixel values in the domain block to adjust its brightness; multiplying a constant value to all of the pixel values in the domain block to adjust its contrast; rotating the domain block (typically multiples of 90◦ ), and reflecting against the horizontal, vertical, or diagonal axis.
IMAGE AND VIDEO CODING
9
Fig. 5. To illustrate the clustered nature of wavelet data: (a) coefficients in subband LH0 of Lena whose magnitude is above a fixed threshold (i.e., those which contain most of the signal energy necessary to reduce distortion); (b) an equal number of points as in (a), but drawn from a uniform density over the same area. This figure gives motivation for the type of techniques used later in the coder design that would be completely useless if typical realizations of wavelet data were like (b) instead of (a).
So, for each range block, the encoder needs to look at all potential domain blocks in the image and also must find the best set of transformations that may be applied on the potential domain blocks to make them similar to the range block. In practice, some range blocks may be very flat and a single DC value may be sufficient to describe them well. In such cases, the encoder may choose not to conduct the search and encode the block by simply saving its DC value. On the other hand, no good matches may be found for some blocks in the image. In such cases, the encoder may choose to break the range block into smaller range subblocks and conduct the search for the subblocks instead. After finding the best matching domain block and its associated transformations for each range block, the encoder saves the address of the domain block and the parameters of these transformations (e.g., how much the brightness was changed, if and how much the domain block was rotated). It should be emphasized that the encoder does not save either the values of the pixels in the range blocks or the values of the pixels in the domain block. The addresses and the quantized transformation parameters for all the range blocks in the image are entropy-coded to generate the final binary code for the image. We call this binary code the fractal code of the image. This code represents how each part of the image is approximated by transforming other larger parts of the same image. Now, if we go to the location of its domain block for every range block in the image, extract the block from the image, apply the corresponding transformations on it, and substitute the range block with this transformed block, we obtain a new image, which is an approximation of the original image. If we denote the original image by xo , and denote this operation, that is applied on xo , by T, then
that is applying T on xo almost keeps it unchanged. Note that the transformation T is fully represented by the fractal code.
10
IMAGE AND VIDEO CODING
Fig. 6. Fractal coding uses the similarity of different-sized image blocks for coding. It approximates each image block with another larger image block in the same image. For example, blocks “A” and “B” in this image are very similar to blocks “a” and “b,” accordingly, and can be used to approximate them.
If there is an image xf such that T(xf ) is exactly the same as xf for this transformation T, that is,
then xf is called the fixed point of T. Decoding. So far, we know how the encoder takes an image and makes a fractal code, or equivalently, a transformation T, for it. Now the question is how this information can be used by the decoder to reconstruct the original image or an approximation of it. According to a theorem called the Contraction Mapping Theorem (4), if T is a contractive transformation, then applying T on any arbitrary image repeatedly transforms that image to the fixed point of T. More precisely, for any image y,
So, the decoder can find xf by simply taking any arbitrary image and applying the transformation T, which is represented by the fractal code, on the image many times. Of course, xf is usually not exactly the same as xo , but according to another theorem called the Collage Theorem (4), the closer T(xo ) is to xo the closer xf will be to xo . In other words, if the encoder can find very good matches for range blocks, the decoder can duplicate the original image very closely. However, because the decoded image is not exactly the same as the original image, fractal coding falls in the class of lossy compression methods.
IMAGE AND VIDEO CODING
11
Fig. 7. A basic fractal image coding algorithm.
Computational Complexity. One major difficulty with fractal image coding is that the encoding process needs a lot of computation and requires a large amount of computer time. The main computations are trying every potential domain block, finding the best set of transformations for it, and computing how well the result matches the range block. However, there are techniques that alleviate this computational burden to some extent (6). The description given previously provides the basic mechanism of fractal coding. Many different variations of this algorithm have been proposed by different researchers. Interested readers may look at (6) for more details.
Compression Framework For Video Now we move from images to video signals. A video signal can be thought of as a sequence of images, or more correctly image frames. These frames are highly correlated in time, hardly surprising considering that typical video sequences come from camera shots of scenes. Thus, in addition to exploiting the spatial redundancy in individual video frames (as in still images), it is important for a good video-coding algorithm to exploit the much more significant temporal redundancy across frames. In state-of-the-art video coders, this is done by what is termed motion-compensated prediction. The basic idea is simple. Because little changes between successive frames of typical video scenes (e.g., when these represent 1/30 intervals), with high likelihood one can predict the next frame based on the previous or the future frame. Aiding in this prediction, it is necessary to model the motion of areas of the picture. State-of-the-art commercial video-coding algorithms send the motion of image blocks from frame to frame. Some parts of the pictures—those in which no significant changes occur—are simply copied from adjacent pictures. Other parts may be best predicted by parts of the neighboring frames corresponding to the same objects that are displaced by motion, and this requires the use of motion compensation. All of the standards covered in the next section (MPEG-1, MPEG-2, H.261, and H.263) use a block-based algorithm for motion compensation, in which each video frame is partitioned into small rectangular blocks, and for each block, a match into a certain neighborhood in the adjacent (previous or next) frames is conducted, according to a certain
12
IMAGE AND VIDEO CODING
Fig. 8. Typical motion-compensated video encoder and decoder. Here In , I˜n and Iˆn represent the original, motioncompensated, and final reconstructed frame n, respectively.
error criterion (e.g., minimum mean absolute error), and the displacement to the best matching location is coded as the motion vector. Clearly, after motion compensation, the compensated difference is much smaller than the original frame difference, but because one cannot always expect to predict perfectly (if one could, there would be no information to send!), one incurs prediction error. Then this error is sent. The better the prediction, the lower this error, and the fewer the number of bits needed to send it. In current commercial video coders, the error signal (termed the displaced frame difference or DFD) is coded so that is almost identical to the way still images are coded using the JPEG standard, that is using a DCT-based framework. At the decoder, the current frame is reconstructed based on: (1) the past frame; (2) the motion information that was sent relating the previous frame to the current frame; and (3) the prediction error (which is sent in the DCT domain). This very simple framework works very well in practice, and forms the core of commercial standards, like MPEG and H.263 (see next section). Figure 8 captures the essence of this video-coding framework, which is dubbed hybrid motion-compensated video coding.
IMAGE AND VIDEO CODING
13
Fig. 9. The I-frames, P-frames, and B-frames in a GOP and their predictive relationships.
Video Standards In this section, we briefly introduce the popular video-coding standards developed by MPEG and ITU-T. MPEG is the acronym for Moving Picture Expert Group, which was formed under the auspices of the International Organization for Standardization (ISO) and the International Electrotechnical Commission (IEC) and is devoted specifically to standardizing video-compression algorithms for the industry (7). Since it was established in 1988, it has experienced two phases, MPEG-1 and MPEG-2, and is now in its third phase, MPEG-4. Data rates and applications are distinguishing factors among them. MPEG-1 is designed for intermediate data rates on the order of 1.5 Mbit/s. MPEG-2 is intended for higher data rates (10 Mbit/s or more), and MPEG-4 is intended for very low bit rates (about 64 kbit/s or less). MPEG-2 and MPEG-4 have potential uses in telecommunications, and for this reason, a third major international standards organization, the International Telecommunications Union-Telecommunications Sector (ITU-T) is participating in their development. The ITU-T has produced two low-bit-rate video-coding standards so far, namely, H.261 and H.263 (8,9). Frame Types and GOP. A video sequence is made up of individual pictures, also called frames, that occur at fixed time increments. Although there are many facets in video coding, perhaps the most important one is how to take advantage of the similarity between successive pictures in the sequence. The approach taken by many video coders is that most of the pictures are coded in reference to other ones, and only a few of them are coded independently. These independently coded frames are given the name “intra coded frames (I-frames).” The I-frames are needed to start coding the whole sequence and also to facilitate video editing and random access. Typically, I-frames are inserted into the sequence at regular time intervals. The pictures between each pair of successive I-frames are called a “Group of Pictures (GOP),” and are treated as a single unit in coding operations. However, coding an I-frame is expensive because it is treated as completely “fresh” information. The high compression ratio needed in many video applications is achieved by coding other frames in a GOP as differences relative to their neighboring frames. These frames are called “predicted (P)-frames”, or “bidirectionally predicted (B)-frames”, depending on whether only the past frame or both the past and future frames are used as the reference, respectively. Because of the dependency between I-frames, P-frames, and B-frames, the order in which they are coded has to be carefully arranged. Typically, the I-frame in each GOP is coded first, followed by the P-frames, which are predicted from the I-frame, and then finally the B-frames, as shown in Fig. 9. P-frames and B-frames are coded in a variety of ways. Some parts of the pictures—those in which no significant changes occur—are simply copied from adjacent pictures. Other parts may be best predicted from
14
IMAGE AND VIDEO CODING
parts of the neighboring frames corresponding to the same objects that are displaced by motion by using the motion-compensation technique. Both intra-and interframe coding operate on a 8 × 8 block basis and are facilitated by the discrete cosine transform. After quantization, the final step in the encoding process is to entropy code all the relevant information, namely, motion vectors, quantized DCT coefficients, and control information (for instance, quantization step sizes, frame types, and so on). Huffman coding and arithmetic coding are adopted in MPEG-1/2, H.261, and H.263 for entropy coding. Although adaptive entropy coding usually produces a more accurate coding model and consequently better compression compared to nonadaptive entropy coding, the extra computation time required to update the model is not desirable in most video applications. Differences among Various Standards. Although all of these standards center around the basic framework of block-based motion compensation and DCT coding, they are differ from each other in many significant ways. As mentioned earlier, a major difference between MPEG-1/2 and H.261/3 is their targeted bit rates and applications. The applications targeted by MPEG are (1) digital storage, such as compact disc (CD), and digital video disc (DVD); (2) digital satelite broadcasting; (3) high-definition television (HDTV), and those targeted by H.261 and H.263 are low bit rate applications, like videoconferencing and videotelephony. As applications vary, the types of video sequences being coded, and the requirements for coding quality and additional functionalities also differ. For example, in videoconferencing where simple “head and shoulder” sequences are typical, attention is usually focused on the face of the speaker and less on the background parts in the picture, which therefore requires that the face region is coded with higher quality. MPEG-2 includes all features of MPEG-1 with enhancements that improve its coding efficiency and add more flexibility and functions to it. One such example is scalability, which allows various quality reconstructions by partially decoding the same bit stream. Another example is error resilience, which enables the system to operate in a more error-prone environment. H.263 can also be regarded as an enhancement and an extension of H.261. The two most notable innovations in H.263 are the overlapped-block motion-compensation (OBMC) mode, in which motion-compensation accuracy for each motion block is improved by using a weighted linear prediction from neighboring blocks, and the PB frame mode, in which pairs of a P-frame and a B-frame are coded jointly to save bit rates. There are many other differences, however, in nearly every aspect of the coder, which are beyond the scope of this section.
MPEG-4 Standard Purpose. The MPEG-4 standard was developed in response to the growing need for a coding method that facilitates access to visual objects in natural and synthetic moving pictures and associated natural or synthetic sound for various applications, such as digital storage media, the Internet, and various forms of wired or wireless communication (10). The use of this standard means that motion video can be manipulated as a form of computer data and can be stored on various storage media, transmitted and received over existing and future networks, and be distributed on existing and future broadcast channels. Target applications include, but are not limited to, such areas as Internet multimedia, interactive video games, interpersonal communications (videoconferencing, videophone, etc.), interactive storage media (optical disks, etc.), multimedia mailing, networked database services (via ATM, etc.), remote emergency systems, remote video surveillance, and wireless multimedia. Object-Based Coding Scheme. Audiovisual information consists of the coded representation of natural or synthetic objects that can be manifested audibly and/or visually. On the sending side, audiovisual information is compressed, composed, and multiplexed in one or more coded binary streams that are transmitted. On the receiver side, these streams are demultiplexed, decompressed, composed, and presented to the
IMAGE AND VIDEO CODING
15
Fig. 10. Processing stages in an audiovisual terminal.
end user. The end user may have the option to interact with the presentation. Interaction information can be processed locally or transmitted to the sender. The presentation can be performed by a stand-alone system or by part of a system that needs to utilize information represented in compliance with the MPEG-4 standard. In both cases, the receiver is generically called an audiovisual terminal or just terminal (Fig. 10). Scene Description. Scene description addresses the organization of audiovisual objects in a scene in terms of both spatial and temporal positioning. This information allows composing individual audiovisual objects after they have been rendered by their respective decoders. The MPEG-4 standard, however, does not mandate particular rendering or composition algorithms or architectures. These are considered implementationdependent. The scene description can be accomplished by using two different methodologies: parametric [Binary Format for Scenes (BIFS)] and programmatic [Adaptive Audiovisual Session (AAVS) format]. The parametric description is constructed as a coded hierarchy of nodes with attributes and other information (including event sources and targets). The scene description can evolve over time by using coded scene description updates. The programmatic description consists of downloaded Java code that operates within a well-defined execution environment in terms of available systems services for accessing decoder, composition, and rendering resources. Interactivity. To allow active user involvement with the audiovisual information presented, the MPEG4 standard provides support for interactive operation. Interactive features are integrated with the scene description information that defines the relationship between sources and targets of events. It does not, however, specify a particular user interface or a mechanism that maps user actions (e.g., keyboard key pressed or mouse movements) to such events.
16
IMAGE AND VIDEO CODING
Multiplexing/Demultiplexing. Multiplexing provides the necessary facilities for packaging individual audiovisual object data or control information related to audiovisual objects or system management and control in one or more streams. Individual components to be packaged are called elementary streams and are carried by the multiplexer in logical channels. These streams may be transmitted over a network or retrieved from (or placed on) a storage device. In addition, the multiplex layer provides system control tools, for example, to exchange data between the sender and the receiver or to manage synchronization and system buffers. The multiplex layer does not provide facilities for transport layer operation. It is assumed that these are provided by the network transport facility used by the application. Decompression. Decompression recovers the data of an audiovisual object from its encoded format (syntax) and performs the necessary operations to reconstruct the original audiovisual object (semantics). The reconstructed audiovisual object is made available to the composition layer for potential use during scene rendering. Profiles and Levels. Specification of profiles and levels in the MPEG-4 standard is intended to be generic in the sense that it serves a wide range of applications, bit rates, resolutions, qualities, and services. Furthermore, it allows a number of modes of coding of both natural and synthetic video to facilitate access to individual objects in images or video, called content-based access. Various requirements of typical applications have been considered, necessary algorithmic elements have been developed, and they have been integrated into a single syntax, which facilitates the bit-stream interchange among different applications. The MPEG-4 standard includes several complete decoding algorithms and a set of decoding tools. Moreover, the various tools can be combined to form other decoding algorithms. Considering the practicality of implementing the full syntax of this specification, however, a limited number of subsets of the syntax are also stipulated by “profile” and “level.” An MPEG-4 “profile” is a defined subset of the entire bit-stream syntax. Within the bounds imposed by the syntax of a given profile, it is still possible to require very large variation in the performance of encoders and decoders depending on the values taken by parameters in the bit stream. “Levels” are defined within each profile to deal with this problem. A level is a defined set of constraints imposed on parameters in the bit stream. These constraints may be simple limits on numbers. Alternatively, they may take the form of constraints on arithmetic combinations of the parameters. MPEG-4 Objects. Three types of objects supported by the MPEG-4 standard are synthetic objects, visual objects, and audio objects. Synthetic objects, such as geometric entities, transformations, and their attributions, are structured hierarchically to support bit-stream scalability and object scalability. The MPEG-4 standard provides the approach to spatial-temporal scene composition including normative 2-D/3-D scene graph nodes and their composition supported by the Binary Interchange Format Specification (BIFS). Synthetic and natural object composition relies on rendering performed by the particular system to generate specific pixeloriented views of the models. Coded visual data can be of several different types, such as video data, still-texture data, 2-D mesh data or facial animation parameter data. Audio objects include natural audio, synthetic audio, natural speech, and text-to-speech objects. The following paragraphs briefly explain the visual data object. Video Object. A video object in a scene is an entity that a user is allowed to access (seek, browse) and manipulate (cut and paste). Video objects at a given time are called video object planes (vops). The encoding process generates a coded representation of a vop and compositional information necessary for display. Further, at the decoder, a user may interact with and modify the compositional process as needed. The framework allows coding of individual video objects in a scene and also the traditional picturebased coding, which can be thought of as a single rectangular object. Furthermore, the syntax supports both nonscalable coding and scalable coding. Thus it becomes possible to handle normal scalabilities and objectbased scalabilities. The scalability syntax enables reconstructing useful video from pieces of a total bit stream. This is achieved by structuring the total bit stream in two or more layers, starting from a stand-alone base
IMAGE AND VIDEO CODING
17
layer and adding a number of enhancement layers. The base layer can use the nonscalable syntax and can be coded using nonscalable syntax, or in the case of picture-based coding, even a different video-coding standard. To ensure the ability to access individual objects, it is necessary to achieve a coded representation of its shape. A natural video object consists of sequence of 2-D representations (at different time intervals) called vops here. For efficient coding of vops, both temporal redundancies and spatial redundancies are exploited. Thus a coded representation of a vop includes representing its shape, its motion, and its image contents. Still-Texture Object. Visual texture, herein called still-texture coding, is designed to maintain high visual quality in the transmission and render texture under widely varied viewing conditions typical of interaction with 2-D/3-D synthetic scenes. Still-texture coding provides a multilayer representation of luminance, color, and shape. This supports progressive transmission of the texture for image buildup as it is received by a terminal. Downloading the texture resolution hierarchy for constructing image pyramids used by 3-D graphics APIs is also supported. Quality and SNR scalability are supported by the structure of still-texture coding. Mesh Object. A 2-D mesh object is a representation of a 2-D deformable geometric shape, with which synthetic video objects may be created during a composition process at the decoder by spatially piecewise warping existing video object planes or still-texture objects. Instances of mesh objects at a given time are called mesh object planes (mops). The geometry of mesh object planes is coded losslessly. Temporally and spatially predictive techniques and variable-length coding are used to compress 2-D mesh geometry. The coded representation of a 2-D mesh object includes representing its geometry and motion. Face Object. A 3-D (or 2-D) face object is a representation of the human face that is structured to portray the visual manifestations of speech and facial expressions adequate to achieve visual speech intelligibility and the recognition of the mood of the speaker. A face object is animated by a stream of face animation parameters (FAP) encoded for low-bandwidth transmission in broadcast (one-to-many) or dedicated interactive (point-topoint) communications. The FAPs manipulate key feature control points in a mesh model of the face to produce animated visemes for the mouth (lips, tongue, teeth) and animation of the head and facial features, like the eyes. A simple FAP stream can be transmitted to a decoding terminal that animates a default face model. A more complex session can initialize a custom face in a more capable terminal by downloading face definition parameters (FDP) from the encoder. Thus specific background images, facial textures, and head geometry can be portrayed. The FAP stream for a given user can be generated at the user’s terminal from video/audio, or from text-tospeech. FAPs can be encoded at bit rates up to 2 kbit/s to 3kbit/s at necessary speech rates. Optional temporal DCT coding provides further compression efficiency in exchange for delay. Using the facilities of systems, a composition of the animated face model and synchronized, coded speech audio (low-bit-rate speech coder or text-to-speech) can provide an integrated low-bandwidth audio/visual speaker for broadcast applications or interactive conversation.
MPEG-7 and Multimedia Databases MPEG-1 and -2 deal with coding at the pixel level. MPEG-4 codes the information at the object level. MPEG-7, a work item newly started in October 1996, will code information at the content level. From MPEG-1 to MPEG-7, the abstraction level of coding increases. Because MPEG-7 is still at its early stage, the description of MPEG-7 in this section is based on the currently available MPEG documents (10). It is highly possible that the final MPEG-7 standard may differ from that described here. MPEG-7 is formally called Multimedia Content Description Interface. It is motivated by the observation that current multimedia data lack a standard content-descriptive interface to facilitate retrieval and filtering. Nowadays, more and more audiovisual information is available all around the world. But before anyone can use this huge amount of information, it has to be located first. This challenge leads to the need for a mechanism that
18
IMAGE AND VIDEO CODING
supports efficient and effective multimedia information retrieval. In addition, another scenario is associated with broadcasting. For instance, an increasing number of broadcast channels are available, and this makes it difficult to select (filter) the broadcast channel that is potentially interesting. The first scenario is the pull (search) application and the second is the push (filtering) application. MPEG7 aims to standardize a descriptive interface for the multimedia content so that it can easily be searched and filtered. Specifically, MPEG-7 will standardize the following: • • •
a set of descriptive schemes and descriptors a language to specify descriptive schemes, that is a Description Definition Language (DDL) a scheme for coding the description
Before we go any further into MPEG-7, it is important first to introduce some key terminologies used throughout the MPEG-7 description: •
•
• • • • •
Data Audiovisual information that will be described by using MPEG-7, regardless of the storage, coding, display, transmission, medium, or technology. This definition is intended to be sufficiently broad to encompass video, film, music, text and any other medium, for example, an MPEG-4 stream, a video tape, or a book printed on paper. Feature A feature is a distinctive part or characteristic of the data which stands for somebody for something in some respect or capacity (10). Some examples are the color of an image, the style of a video, the title of a movie, the author of a book, the composer of a piece of music, the pitch of an audio segment, and the actors in a movie. Descriptor A descriptor associates a representational value with one or more features. Note that it is possible to have multiple representations for a single feature. Examples might be a time code to represent duration or both color moments and histograms to represent color. Description Scheme The Description Scheme (DS) defines a structure and the semantics of the descriptors and/or description schemes that it contains to model and describe the content. Description A description is the entity that describes the data and consists of a descriptive scheme and instantiation of the corresponding descriptors. Coded Description Coded description is a representation of the description allowing efficient storage and transmission. Description Definition Language This is the language to specify Description Schemes. The DDL will allow the creation of new DSs and extension of existing DSs.
MPEG-7, like the other members of the MPEG family, is a standard representation of audiovisual information that satisfies particular requirements. The MPEG-7 standard builds on other (standard) representations, such as analog, PCM, MPEG-1, -2, and -4. One functionality of the MPEG-7 standard is to provide references to suitable portions of other standards. For example, perhaps a shape descriptor used in MPEG-4 is also useful in an MPEG-7 context, and the same may apply to motion vector fields used in MPEG-1 and -2. MPEG-7 descriptors, however, do not depend on the ways the described content is coded or stored. It is possible to attach an MPEG-7 description to an analog movie or to a picture printed on paper. Because the descriptive features must be meaningful in the context of the application, they will differ for different user domains and different applications. This implies that the same material can be described by using different types of features tuned to the area of application. Taking the example of visual material, a lower abstraction level would be a description, for example, of shape, size, texture, color, and movement. The highest level would give semantic information: “This is a scene with a barking brown dog on the left and a blue ball that falls down on the right, with the sound of passing cars in the background.” All of these descriptions are coded efficiently for search. Intermediate levels of abstraction may also exist.
IMAGE AND VIDEO CODING
19
Fig. 11. A block diagram of MPEG-7 application.
MPEG-7 data may be physically located with the associated audiovisual material in the same data stream or on the same storage system, but the descriptions could also exist somewhere else on the globe. When the content and its descriptions are not co-located, mechanisms that link audiovisual material and their MPEG-7 descriptions are useful. These links should work in both directions. Figure 11 illustrates the functionality and information flow within the MPEG-7 framework. The dotted box denotes the normative elements of the MPEG-7 standard. Note that not all the components in the figure need to be present in all MPEG-7 applications. It is also possible that there are other streams from the content to the user besides those in the figure. Upon completing MPEG-7, many applications, both social and economic, may be greatly affected, including: • • • • • • • • •
education journalism (e.g., searching speeches of a certain politician using his name, his voice, or his face) tourist information entertainment (e.g., searching a game, karaoke) investigative services (human characteristic recognition, forensics) geographical information systems medical applications shopping (e.g., searching for clothes that you like) architecture, real estate, interior design
20 • •
IMAGE AND VIDEO CODING social (e.g., dating services) film, video, and radio archives
The call for proposals of MPEG-7 will be released in October, 1998 and the proposals are due in February, 1999. The following are called for by the MPEG group: • • • • • •
Description Schemes Descriptors Description Definition Language (DDL) Coding methods for compact representation of Descriptions Tools for creating and interpreting Description Schemes and Descriptors (not among the normative parts) Tools for evaluating content-based audiovisual databases (not among the normative parts) The final International Standard is expected in September 2001.
Current and Future Research Issues We are entering an exciting era of research in image and video coding. Many important and challenging research issues emerge from applications related to the Internet, multimedia, and virtual reality: (1) Interplay between networking and compression: Transmission of images and video over heterogeneous computer networks (involving possibly ATM networks, wireless, etc.) introduces many research issues. These include joint source-channel coding, error-resilient coding, and quality of service. (2) Interplay between processing and compression: For many applications, it is important to process in the compressed domain to reduce storage and processing requirements. Thus, we need algorithms for image/video analysis and understanding in the compressed domain. (3) Interplay between multimedia database retrieval/visualization and compression: In the past, image/video compression was mainly for transmission. More and more, applications involve the storage, retrieval, and visualization of the data. Thus, the performance criteria of compression algorithms needs to be reexamined. For example, in many cases, it is important to decode at different image resolutions.
Acknowledgments The authors would like to thank Igor Kozintsev, Yong Rui, Sergio Servetto, Hai Tao, and Xuguang Yang for their generous help in writing this article.
BIBLIOGRAPHY 1. 2. 3. 4. 5.
C. E. Shannon A mathematical theory of communication, Bell Syst. Tech. J., 27: 379–423, 1948. W. Pennebaker J. Mitchell JPEG Still Image Data Compression Standard, New York: Van Nostrand Reinhold, 1993. M. Vetterli J. Kovacevic Wavelets and Subband Coding, Upper Saddle River, NJ: Prentice-Hall, 1995. M. F. Barnsley H. Rising III Fractals Everywhere, 2nd ed., Boston, MA: Academic Press Professional, 1993. A. E. Jacquin Image coding based on a fractal theory of iterated contractive image transformations, IEEE Trans. Image Process., 1: 18–30, 1992. 6. Y. Fisher (ed.) Fractal Image Compression: Theory and Application, New York: Springer-Verlag, 1995.
IMAGE AND VIDEO CODING
21
7. L. M. Joan et al. MPEG Video Compression Standard, New York: International Thomson, 1997. 8. International Telecommunication Union (ITU), Video Codec for Audiovisual Services at p × 64 kb/s, Geneva, Switzerland, 1993, Recommendation H.261. 9. International Telecommunication Union (ITU), Video Coding for Low Bit Rate Communication, Geneva, Switzerland, 1996, Recommendation H.263. 10. MPEG Web Page [Online], 1998. Available www: http://drogo.cselt.stet.it/mpeg/
KANNAN RAMCHANDRAN MOHAMMAD GHARAVI-ALKHANSARI THOMAS S. HUANG University of Illinois at Urbana-Champaign
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICA...TRONICS%20ENGINEERING/26.%20Image%20Processing/W4104.htm
}{{}}
●
HOME ●
ABOUT US ●
CONTACT US ●
HELP
Home / Engineering / Electrical and Electronics Engineering
Wiley Encyclopedia of Electrical and Electronics Engineering Image Color Analysis Standard Article Henry R. Kang1 1Peerless System Corporation, Fairport, NY Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. DOI: 10.1002/047134608X.W4104 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (356K)
●
●
● ●
Recommend to Your Librarian Save title to My Profile Email this page Print this page
Browse this title ●
Search this title Enter words or phrases
Abstract The sections in this article are Basics of Image Color Analysis Color Description and Formulation Human Visual Responses Color Appearance Model Color Mixing Models Circular Dot Overlap Model Device-Independent Color Imaging About Wiley InterScience | About Wiley | Privacy | Terms & Conditions Copyright © 1999-2008John Wiley & Sons, Inc. All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELE...S%20ENGINEERING/26.%20Image%20Processing/W4104.htm17.06.2008 15:23:43
❍
❍ ❍
Advanced Product Search Search All Content Acronym Finder
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering c 1999 John Wiley & Sons, Inc. Copyright
IMAGE COLOR ANALYSIS Color image analysis is a complex process. A color image is a multidimensional entity. In addition to the color attributes, an image is sampled in a two-dimensional plane having a quantized value to indicate the tone intensity in the third dimension. The smallest unit in the two-dimensional image plane is called the picture element (pixel or pel), constituted by the pixel size (width and height) and depth (the number of tone levels). However, the appearance of a color image is much more intriguing than a few physical measures; there are psychological and psychophysical attributes that cannot be measured quantitatively. Therefore, the foremost important factor in the image analysis, color or not, is human vision, because the human being is the ultimate judge of the image quality. Human vision provides the fundamental guidance to digital imaging design, analysis, and interpretation.
Basics of Image Color Analysis Color images have at least two components, spatial patterns and colorimetric properties. Spatial patterns include inadequate sampling (aliasing, stair casing, etc.), the interaction of color separations (moir´e and rosette), and false contouring caused by insufficient tone levels. Colorimetric properties involve color transformation, color matching, color difference, surround, adaption, and so on. The overall image quality is the integration of these factors as perceived by human eyes. Various theories and models are developed to address these problems and requirements. Sampling and quantization theories ensure proper resolution and tone level. Colorimetry provides quantitative measures to specify color stimuli. Color mixing models and transformation techniques give the ability to move color images through various imaging devices. Color appearance models attempt to account for phenomena, such as chromatic adaptation, that cannot be explained by colorimetry. Device models deal with various device characteristics, such as contone rendering, halftoning, color quantization, and the device dot model for generating appearing color images. A thorough image color analysis should include the human visual model, color measurement, color mixing model, color space transform, color matching, device models, device characteristics, and the color appearance model. Human Vision Theory. Human vision is initiated from the photoreceptors (rods and cones) in the retina of the eye. Rods and cones respond to difference levels of luminance; rods serve vision at low luminance levels, and cones at higher luminance levels. In the intermediate levels, both rods and cones function. There is only one type of rod receptor with a peak spectral responsivity at approximately 510 nm. There are three types of cone receptors [see Fig. 1 (Ref. 3)], referred to as L, M, and S cones for their sensitivity in the long, middle, and short wavelengths, respectively. The modern theory of color vision is based on trichromatic and opponent color theories, as shown in Fig. 1 (1). The incoming color signal is separated by three types of cones; the neurons of the retina encode the color into opponent signals. The sum of three cone types (L + M + S) produces an achromatic response, while the combinations of two types with the difference of a third type produce the red-green (L − M + S) and yellow-blue (L + M − S) opponent signals (Ref. 1, pp. 22–23). 1
2
IMAGE COLOR ANALYSIS
Fig. 1. Schematic illustration of the encoding of cone signals into opponent-color signals in the human visual system.
In addition to static human vision theory, dynamic adaptation is very important in color perception. It includes dark, light, and chromatic adaptations. Dark adaptation is the change in visual sensitivity that occurs when the illumination is decreased, such as when one walks into a darkened theater on a sunny afternoon. At first, the entire theater appears completely dark. But after a few minutes one is able to see objects in the theater, such as the aisles, seats, and people. This happens because the visual system is responding to the lack of illumination by becoming more sensitive and capable of producing a meaningful visual response at the lower illumination level. Light adaptation is a decrease in visual sensitivity upon increases in the overall level of illumination. For example, it occurs when one leaves a darkened theater and walks into a sunny street. In this case, the visual system must become less sensitive and adjust to the normal vision (Ref. 1, pp. 23–28). Chromatic adaptation is achieved through the largely independent adaptive gain control on each of the L,
IMAGE COLOR ANALYSIS
3
M, and S cone responses. The gains of the three channels depend on the state of eye adaptation, which is determined by preexposed stimuli and the surround, but independent of the test stimulus (2). An example of chromatic adaptation is viewing a piece of white paper under daylight and then under incandescent light. The paper appears white under both illuminants, despite the fact that the energy reflected from the paper has changed from the blue shade to yellow shade. Chromatic adaptation is the single most important property of the human visual system with respect to understanding and modeling color appearance. Human visual system (HVS) models are attempts to utilize human visual sensitivity and selectivity for modeling and improving the perceived image quality. The HVS is based on the psychophysical process that relates psychological phenomena (color, contrast, brightness, etc.) to physical phenomena (light intensity, spatial frequency, wavelength, etc.). It determines what physical conditions give rise to a particular psychological (perceptual, in this case) condition. A common approach is to study the relationship between stimulus quantities and the perception of the just noticeable difference (JND)—a procedure in which the observer is required to discriminate between color stimuli that evoke JNDs in visual sensation (3). To this end, Weber’s and Fechner’s laws play key roles in the basic concepts and formulations of psychophysical measures. Weber’s law states that the JND of a given stimulus is proportional to the magnitude of the stimulus. Fechner built on the work of Weber to derive a mathematical formula, indicating that the perceived stimulus magnitude is proportional to the logarithm of the physical stimulus intensity. Weber fractions in the luminance, wavelength, color purity, and spatial frequency are the foundations of the color difference measure and image texture analysis. Another important foundation is the Stevens power law. Based on the study of over 30 different types of perceptions, Stevens hypothesized that the relationship between perceptual magnitude and stimulus intensity followed a power law with various exponents for different perceptions. The Stevens power law has been used to model many perceptual phenomena and can be found in fundamental aspects of color measurement, such as the relationship between CIEXYZ and CIELAB (Ref. 1, pp. 45–48). Color Mixing. Many of us are familiar with the mixing of colors from childhood experience. But the proper mixing of colors is not a simple matter. There are many theories and models developed solely for interpreting and guiding the mixing of colors. The simplest color mixing model is the ideal block dye model. Roughly, the visible spectrum can be divided into three bands—the blue band from the end of the ultraviolet light to about 520 nm, the green band from about 520 nm to around 610 nm, and the red band from about 610 nm to the onset of the infrared radiation. Additive primaries of block dyes are assumed to have a perfect reflection at one band of the visible spectrum and a total absorption for the other two bands. Subtractive primaries of block dyes reflect perfectly in two bands and absorb completely in one. Additive and subtractive color mixings are different. Additive color mixing is applied to imaging systems that are light emitters, where the subtractive mixing is for imaging systems that combine unabsorbed light. We use ideal block dyes to illustrate the difference between additive and subtractive color mixings. Additive color mixing is defined as a color stimulus for which the radiant power in any wavelength interval, small or large, in any part of the spectrum, is equal to the sum of the powers in the same interval of the constituents of the mixture, constituents that are assumed to be optically incoherent (Ref. 3, p. 118). An important characteristic of the additive system is that the object itself is a light emitter (a good example is a color television). Red, green, and blue (RGB) are additive primaries. For ideal block dyes, the mixture of two additive primaries produces a subtractive primary. The mixture of the red and green is yellow, the mixture of the green and blue is cyan, and the mixture of the red and blue is magenta. The mixture of all three additive primaries produces a spectrum of total reflection across the whole visible region, giving a white color. These eight colors define the size of all colors that can be rendered by the additive system. They form a cube with three additive primaries as the axes, having strength in the range of [0,1]. Thus, the full-strength primaries have coordinates of (1,0,0) for red, (0,1,0) for green, and (0,0,1) for blue. The origin of the cube (0,0,0), emitting no light, is black. For two-color mixtures, yellow locates at (1,1,0), cyan at (0,1,1), and magenta at (1,0,1), where the three-color mixture is white and locates at (1,1,1). This cube is called the additive color space, and the
4
IMAGE COLOR ANALYSIS
volume of this cube is the size of the color gamut. Note that the additive mixing of ideal block dyes behaves like a logical OR operation; the output is 1 when any of the primaries is 1. Subtractive color mixing is defined as color stimuli for which the radiant powers in the spectra are selectively absorbed by an object such that the remaining spectral radiant power is reflected or transmitted and then received by observers or measuring devices (4). The object itself is not a light emitter, but a light absorber. Cyan, magenta, and yellow (cmy) are subtractive primaries. This system is the complement of the additive system. When cyan and magenta block dyes are mixed, cyan absorbs the red reflection of the magenta and magenta absorbs the green reflection of the cyan. This leaves blue as the only nonabsorbing region. Similarly, the mixture of magenta and yellow gives red, and the mixture of cyan and yellow gives green. When all three subtractive primaries are mixed, all components of the incident light are absorbed to give a black color. Thus, the subtractive mixing of block dyes behaves like a logical AND operation; for a given region of the visible spectrum, the output is 1 if all participated primaries are 1 at that spectral region. Similar to the additive system, the subtractive primaries also form a cube with positions occupied by the complementary colors of the additive cube, where the complementary colors is a pair of colors that gives an achromatic hue (gray tone) such as yellow–blue, green–magenta, and cyan–red pairs. A good example of subtractive color mixing is color printing on paper. Color Spaces and Transforms. A color image is acquired, displayed, and rendered by different imaging devices. For example, an image can be acquired by a flat-bed color scanner, displayed on a color monitor, and rendered by a color printer. Different imaging devices use different color spaces; well-known examples are RGB space for monitors and cmy (or cmyk) space for printers. There are two main types of the color space: device dependent and device independent. Color signals produced or utilized by imaging devices, such as scanners and printers, are device dependent; they are device specific, depending on the characteristics of the device, such as imaging technology, color description, color gamut, and stability. Thus, the same image rendered by different imaging devices may not look the same. In fact, the images usually look different. These problems can be minimized by using some device-independent color spaces. The Commission Internationale de ´ l’Eclairage (CIE) developed a series of color spaces using colorimetry that are not dependent on the imaging devices. CIE color spaces provide an objective color measure and are being used as interchange standards for converting between device-dependent color spaces. A brief introduction will be given in the section titled “CIE Color Spaces.” Because various color spaces are used in the color image process, there is a need to convert them. Color space transformation is essential to the transport of color information during image acquisition, display, and rendition. Converting a color specification from one space to another means finding the links of the mapping. Some transforms have a well-behaved linear relationship; others have a strong nonlinear relationship. Many techniques have been developed to provide sufficient accuracy for color space conversion under the constraints of cost, speed, design parameters, and implementation concerns. Generally, these techniques can be classified into four categories. The first one is based on theoretical color mixing models like the Neugebauer equations and Kubelka–Munk theories. With analytical models, one can predict the color specifications using relatively few measurements. The second method, polynomial regression, is based on the assumption that the correlation between color spaces can be approximated by a set of simultaneous equations. Once the equations are selected, a statistical regression is performed to a set of selected points with known color specifications in both source and destination spaces for deriving the coefficients of the equations. The third approach uses a three-dimensional lookup table with interpolation. A color space is divided into small cells; then the source and destination color specifications are experimentally determined for all lattice points. Nonlattice points are interpolated, using the nearest lattice points. The last category comprises approaches such as neural networks and fuzzy logic. Detailed descriptions and formulations of these techniques can be found in Ref. 4. Color Appearance Phenomena. CIE colorimetry is based on color matching; it can predict color matches for an average observer, but it is not equipped to specify color appearance. Colorimetry is extremely useful if two stimuli are viewed with identical surround, background, size, illumination, surface characteristics,
IMAGE COLOR ANALYSIS
5
and so on. If any of these constraints is violated, it is likely that the color match will no longer hold. In practical applications, the constraints for colorimetry are rarely met. The various phenomena that break the simple CIE tristimulus system are the topics of the color appearance model, such as simultaneous contrast, hue change with ¨ luminance level (Bezold–Brucke hue shift), hue change with colorimetric purity (Abney effect), colorfulness increase with luminance (Hunt effect), contrast increase with luminance (Stevens effect), color consistency, discounting of the illuminant, and object recognition (Ref. 1, pp. 133–157). A frequently used example to show the appearance phenomenon is a picture containing two identical gray patches, one surrounded by a black background and the other by a white background. The one with the black surround appears lighter. If we cover both backgrounds with a piece of white paper (or any uniform paper), with two square holes for gray patches to show through, then these two gray patches will look exactly the same. This simultaneous contrast also occurs in color images. This example illustrates the need for the color appearance model to account for the appearance phenomena because the color measurement will not be able to tell the difference. The key features of color appearance models are the transform of spectral power distribution to a 3-D signal by cones, chromatic adaptation, opponent color processing, nonlinear response functions, memory colors, and discounting of the illuminant. Color appearance modeling is still an actively researched area, and various models have been proposed. A brief review and an example will be given in the section titled “Color Appearance Model.” Device Characteristics. Color images, either printed on paper or displayed on a monitor, are formed by pixels that are confined in the digital grid. Generally, the digital grid is a square (sometimes rectangular, if the aspect ratio is not 1) ruling in accordance with the resolution of the imaging device; that is, if a printer resolution is 300 dots per inch (dpi), then the digital grid has 300 squares per inch. This discrete representation creates problems for the quality of color images in many ways. First, the quality of images is dependent on the sampling frequency, phase relationship, and threshold rule (Ref. 4, pp. 153–156). Then, it is dependent on the arrangement of dots in the digital grid. There are three types of dot placement techniques: dot-on-dot, dot-off-dot, and rotated dot (Ref. 4, pp. 244–247). Each technique has its own characteristics and concerns. A CRT display can be considered as a dot-off-dot technique, in which three RGB phosphor dots are closely spaced without overlapping in a triangular pattern. In printing, a frequently used technique is rotated dots, in which color dots are partially overlapped. The rotated dot is produced by the halftone algorithm. Many halftone algorithms are available to create various pixel arrangements for the purpose of simulating a gray sensation from bilevel imaging devices. At a closer look, the image texture is also dependent on the size and shape of the pixel. Even in the ideal case of the square pixel shape that exactly matches the digital grid, there are still image quality problems; the ideal square pixel will do well on the horizontal and vertical lines, providing sharp and crisp edges, but will do poorly on the angled lines and curves, showing jaggedness and discontinuity. Most rendering devices, such as CRT and dot matrix, ink jet, and xerographic printers, have a circular pixel shape (sometimes an irregular round shape because of the ink spreading). A circular pixel will never be able to fit exactly the size and shape of the square grid. To compensate for these defects, the diameter of the bigger than the digital grid period τ such that dots will touch in diagonal positions. This round pixel is made condition is the minimum requirement for a total area coverage by round pixels in a square grid. Usually, the √ pixel size is made larger than 2τ but less than 2τ to provide a higher degree of overlapping. These larger pixels eliminate the discontinuity and reduce the jaggedness even in the diagonal direction. But these larger pixels cause the rendered area per pixel to be more than the area of the digital grid. When adjacent dots overlap, the interaction in the overlapped pixel areas is not simply an additive one, but more resembles a logical OR (as if areas of the paper covered by ink are represented by a 1 and uncovered areas are represented by a 0). That is, parts of the paper that have already been darkened are not made significantly darker by an additional section of a dot being placed on top. The correction for the overlapping of the irregularly shaped and oversized pixels is an important part of the overall color imaging model; the formulation of the circular dot overlap model will be given in the section titled “Dot Overlap Model.”
6
IMAGE COLOR ANALYSIS
In addition to the discrete nature of the digital grid, different imaging devices use different imaging technology. The most important rendering methods are the contone and halftone techniques. A contone device is capable of producing analog signals in both spatial and intensity domains, providing continuously changing shades. An example of the true contone imaging is the photographic print. Scanners, monitors, and dye diffusion printers with 8 bit depth (or more) can be considered to be contone devices if the spatial resolution is high enough. The additive color mixing of the contone devices (e.g., monitors) can be addressed by Grassmann’s laws, in which the subtractive color mixing of contone devices (e.g., printers) can be interpreted by the Beer– Lambert–Bouguer law and Kubelka–Munk theory. These will be discussed in the section titled “Color Mixing Models.” The halftone process is perhaps the single most important factor in image reproduction for imaging devices with a limited tone level (usually bilevel). Simply stated, halftoning is a printing and display technique that trades area for depth by partitioning an image plane into small areas containing a certain number of pixels; it then modulates the tone density by modulating area to simulate the gray sensation for bilevel devices. This is possible because of limited human spatial contrast sensitivity. At normal viewing distance, persons with correct (or corrected) vision can resolve roughly 8 cycles/mm to 10 cycles/mm. The number of discernible levels decreases logarithmically as the spatial frequency increases. With a sufficiently high frequency, the gray sensation is achieved by integrating over areas at normal viewing distance. This binary representation of the contone image is an illusion; much image information has been lost after the halftoning. Thus, there is a strong possibility of creating image artifacts such as moir´e and contouring. Moir´e is caused by the interaction among halftone screens, and the contouring is due to an insufficient tone level in the halftone cell. There will be no artifacts if a continuous imaging device is used, such as photographic prints. Recently, printers have been developed that can output a limited number of tone levels (usually levels ≤ 16). The result of this capability creates the so-called multilevel halftoning, which combines the halftone screen with limited tone levels to improve color image quality. Halftoning is a rather complex technique; there are books and lengthy reviews on this topic. Readers interested in learning about this technique can refer to the (Ref. 4, pp. 208–260). Most color displays are based on a frame buffer architecture, in which the image is stored in a video memory from which controllers constantly refresh the screen. Usually, images are recorded as full color images, where the color of each pixel is represented by tristimuli with respect to the display’s primaries and quantized to 8 or 12 bits per channel. A 3-primary and 8-bit device, in theory, can generate 256 × 256 × 256 = 16,777,216 colors. Often, the cost of high-speed video memory needed to support storage of these full color images on a high-resolution display is not justified. And the human eye is not able to resolve these many colors; the most recent estimate is about 2.28 million discernible colors (5). Therefore, many color display devices limit the number of colors that can be displayed simultaneously to 8, 12, or 16 bits of video memory, allowing 256, 4096, and 65,536 colors, respectively, for the cost consideration. A palettized image, which has only the colors contained in the palette, can be stored in the video memory and rapidly displayed using lookup tables. This color quantization is designed in two successive steps: (1) the selection of a palette and (2) the mapping of each pixel to a color in the palette. The problem of selecting a palette is a specific instance of the vector quantization. If the input color image has N distinct colors and the palette is to have K entries, the palette selection may be viewed as the process of dividing N colors into K clusters in 3-D color space and selecting a representative color for each cluster. Ideally, this many-to-one mapping should minimize perceived color difference (2). The color quantization can also be designed as a multilevel halftoning. For example, the mapping of a full color image to an 8-bit color palette (256 colors) is a multilevel halftoning with 8 levels of red (3 bits), 8 levels of green (3 bits), and 4 levels of blue (2 bits).
IMAGE COLOR ANALYSIS
7
Color Description and Formulation The following sections present mathematical formulations of the CIE color space, color transformation, human visual model, color appearance model, color mixing model, and dot overlap model. These models are developed to address the image color analysis and quality improvement. CIE Color Spaces. CIE color spaces are defined in CIE Publication 15.2 (6). The most frequently used CIE color spaces are CIEXYZ, CIELUV, and CIELAB. CIEXYZ is visually nonuniform CIE color space; it is built on the spectroscopic foundation together with the human color matching ability. The measured reflectance spectrum of an object P is weighed by the spectra of the color matching functions, x¯ , y¯ , and z¯ , and the standard illuminant Q. Resulting spectra are integrated across the whole visible region to give the tristimulus values:
where X, Y, and Z are tristimulus values of the object, and λ is the wavelength. In practice, the integrals are approximated by finite step summations. The three-dimensional nature of color suggests plotting the value of each tristimulus component along orthogonal axes. The result is called tristimulus space. The projection of this space to the two-dimensional X and Y plane yields the chromaticity diagram of Fig. 2. Mathematically, this can be expressed as
and
where x, y, and z are the chromaticity coordinates. They are the normalization with respect to the sum of tristimulus values. The chromaticity coordinates represent the relative amounts of three stimuli X, Y, and Z required to obtain any color. However, they do not indicate the luminance of the resulting color. The luminance is incorporated into the Y value. Thus, a complete description of a color is given by the triplet (x, y, Y). The 1931 CIE chromaticity diagram is shown in Fig. 2; the boundary of the color locus is the plot of the color matching functions (the spectral colors). The chromaticity diagram provides much useful information, such as the dominant wavelength, complementary color, and purity of a color. The dominant wavelength for a color is obtained by extending the line connecting the color and the illuminant to the spectrum locus. For example, the dominant wavelength for color s1 in Fig. 2 is 584 nm. The complement of a spectral color is on the opposite side of the line connecting the color and the illuminant used. For example, the complement of the color s1 is 483 nm under the illuminant D65 . If the extended line for obtaining the dominant wavelength intersects the “purple line,” the straight line that connects two extreme spectral colors (usually 380 nm and 770 nm), then the color will have no dominant wavelength in the visible spectrum. In this case, the dominant wavelength is specified by the complementary spectral color with a suffix c. The value is obtained by extending a line
8
IMAGE COLOR ANALYSIS
Fig. 2. CIE chromaticity diagram showing the dominant wavelength and purity.
backward through the illuminant to the spectrum locus. For example, the dominant wavelength for color s2 in Fig. 2 is 530c nm. Pure spectral colors lie on the spectrum locus that indicates a fully saturated 100% purity. The illuminant represents a fully diluted color with a purity of 0%. The purity of intermediate colors is given by the ratio of two distances: the distance from the illuminant to the color and the distance from the illuminant through the color to the spectrum locus or the purple line. In Fig. 2, the purity of color s1 is a/(a + b) and s2 is c/(c + d) expressed as a percentage (7). The CIE chromaticity coordinates of a mixture of two or more colors obey Grassmann’s laws (Ref. 3, p. 118). The laws provide the proportionality and additivity to the color mixing: (1) Symmetry law—If color stimulus A matches color stimulus B, then color stimulus B matches color stimulus A. (2) Transitivity law—If A matches B and B matches C, then A matches C. (3) Proportionality law—If A matches B, then αA matches αB, where α is any positive factor by which the radiant power of the color stimulus is increased or reduced while its relative spectral distribution is kept the same. (4) Additivity law—If A matches B, C matches D, and (A + C) matches (B + D), then (A + D) matches (B + C), where (A + C), (B + D), (A + D), (B + C) denote, respectively, additive mixtures of A and C, B and D, A and D, and B and C. Visually uniform color spaces of CIELAB and CIELUV are derived from nonlinear transforms of CIEXYZ; they describe colors using opponent type axes relative to a given absolute white point reference. The CIELUV is a transform of the 1976 UCS chromaticity coordinate u , v , Y.
IMAGE COLOR ANALYSIS
9
where
CIELAB is the transform of 1931 CIEXYZ:
and
where L∗ is the lightness and X n , Y n , Zn are tristimulus values of the illuminant. It is sometimes desirable to identify the components of a color difference as correlates of perceived hue, colorfulness, and lightness. The relative colorfulness, or chroma C∗ ab , and hue angle hab are defined as
The color difference is defined as the Euclidean distance in the three-dimensional space. For CIELAB space, the color difference E∗ ab is
The just noticeable color difference is approximately 0.5 to 1.0 E∗ ab unit. Device-Dependent Color Transform. Again, the simplest model for converting from RGB to cmy is the block dye model that a subtractive primary is the total reflectance minus the reflectance of corresponding additive primary. Mathematically, this is expressed as
where the total reflectance is normalized to 1. This model is simple with minimum computation cost, but it is grossly inadequate. This is because none of the additive and subtractive primaries come close to the block dye
10
IMAGE COLOR ANALYSIS
spectra; there are substantial unwanted absorptions in cyan and magenta primaries. To minimize this problem and improve the color fidelity, a linear transform is applied to input RGB:
Then the corrected additive primary, R G B , is removed from the total reflectance to obtain the corresponding subtractive color.
The coefficients kij of the transfer matrix are determined in such a way that the resulting cmy have a closer match to the original or desired output. Generally, the diagonal elements have values near 1 and off-diagonal elements have small values, serving as the adjustment to fit the measured results. However, this correction is still not good enough because it is a well-known fact that the RGB to cmy conversion is not linear. A better way is to use three nonlinear functions or lookup tables for transforming RGB to R G B . More realistic color mixing models are given in the section titled “Color Mixing Models.” Color Gamut Mapping. One major problem in the color transform is the color gamut mismatch, where the sizes of the source and destination color spaces do not match. This means that some colors in the source space are outside the destination space and cannot be faithfully reproduced. Conversely, some colors in the destination space are outside the source space and will never be used. In this case, the color gamut mapping is needed to bring the out-of-gamut colors within the gamut of the destination color space. The color gamut mapping is a technique that acts on a color space to transform a color from one position to another for the purpose of enhancing the appearance or preserving the appearance of an image. A special case is the tone reproduction curve, which maps input lightness values to output device values. Conventionally, color gamut mapping is applied to all components of a color space. It is one of the difficult problems in the color reproduction; many approaches have been developed. In general, they consist two components—a directional strategy and a mapping algorithm. The directional strategy decides which color space to perform the color mapping, determines the reference white point, computes the gamut surface, and decides what color channel to hold constant and what channel to compress first or to do them simultaneously. The mapping algorithm provides the means for transforming input colors to the surface or inside of the output color space. Frequently used mapping algorithms are clipping, linear compression, and nonlinear compression. Clipping is characterized by the mapping of all out-of-gamut colors to the surface of the output color space, and no change is made to input colors that are inside the output gamut. This will map multiple colors into the same point that may lose fine details and cause shading artifacts in some images. The linear compression can be expressed as
where L∗ i and L∗ o are the lightness for the input and output, respectively, the subscript max indicates the maximum value, and the subscript min indicates the minimum value. A similar expression can be obtained for chroma at a constant hue angle. Nonlinear compression, in principle, can have any functions. In practice, the midtone is usually retained, while the shadow and highlight regions are nonlinearly mapped by using
IMAGE COLOR ANALYSIS
11
a tangent to a linear function near the gray axis, then approaching a horizontal line near the maximum output saturation. This is designed to minimize the loss of details that occur with clipping while retaining the advantage of reproducing most of the common gamut colors accurately. Linear and nonlinear mappings maintain the shading but can cause an objectionable decrease in saturation. Color gamut mapping is most effective in color spaces in which the luminance and chrominance are independent to each other, such as CIELAB and CIELUV. Popular directional strategies for CIELAB or CIELUV are (1) sequential L∗ and C∗ mapping at constant h, (2) simultaneous L∗ and C∗ mapping to a fixed anchor point at constant h, (3) simultaneous L∗, C∗, and h mapping to minimize the color difference between two gamuts, and (4) image-dependent gamut mapping. More information on color gamut mapping can be found on pp. 22–28 of Ref. 4.
Human Visual Responses Human vision response is not uniform with respect to spatial frequency, luminance level, object orientation, visible spectrum wavelength, and so on. The behavior of the human visual system can be studied by employing the Weber and Fechner laws. Weber and Fechner Laws. Weber proposed a general law of sensory thresholds that the JND between one stimulus and another is a constant fraction of the first. This fraction ω, called the Weber fraction, is constant within any one sense modality for a given set of observing conditions (Ref. 3, pp. 489–490). The size of the JND, measured in physical quantities (e.g., radiance or luminance) for a given attribute, depends on the magnitude of the stimuli involved. Generally, the greater the magnitude of the stimuli, the greater the size of the JND. The mathematical formulation of this empirical fact is given in Eq. (23):
where L is the increment that must be added to the magnitude L of the given stimulus to be just noticeable. The quantity L0 is a constant that can be viewed as the internal noise. Fechner reasoned that the JNDs must represent a change in sensation measure . Thus, he postulated that JNDs are equal increments in sensation magnitude at all stimulus magnitudes:
where ω is a constant specifying an appropriate unit of the sensation-magnitude increment. By treating and L as differentials, d and dL, respectively, Eq. (24) can be integrated:
Equation (25) relates the sensation measure (L) by a logarithmic function of the stimulus L, measured in terms of a physical unit. The quantities µ and ν are constants. The relation expressed by Eq. (25) is known as the Fechner law Ref. 3, pp. 489–490. The implication of Eq. (25) is that a perceptual magnitude may be determined by summing JNDs. The Fechner law of Eq. (25) suggests that quantization levels should be spaced logarithmically in reflectance (i.e., uniformly in density domain). Weber’s and Fechner’s laws have been used to form scales of color differences. Quantitative studies address thresholds for a single attribute of color (e.g., brightness, hue, and colorfulness). Important ones are discussed next.
12
IMAGE COLOR ANALYSIS
The Weber fraction L/L as the function of the luminance indicates that the Weber fraction is not constant over the range of luminances studied. Because the threshold Weber fraction is the reciprocal of sensitivity, the sensitivity drops quickly as the lumminance becomes weak. Similar determinations have been made of the threshold difference in wavelength necessary to see a hue difference. Results indicate that the sensitivity to hue is much lower at both ends of the visible spectrum. Approximately, the human visual system can distinguish between colors with dominant wavelengths differing by about 1 nm in the blue–yellow region, but near the spectrum extremes a 10 nm separation is required (8). The threshold measurements of the purity show that the purity threshold varies considerably with wavelength. A very marked minimum occurs at about 570 nm; the number of JND steps increases rapidly on either side of this wavelength (9). The visual contrast sensitivity with respect to the spatial frequency is known as the contrast sensitivity function (CSF). The CSF is very important in color image analysis. The image contrast is the ratio of the local intensity to the average image intensity (10). Visual contrast sensitivity describes the signal properties of the visual system near threshold contrast. For sinusoidal gratings, contrast C is defined by the Michelson contrast as (11)
where Lmax and Lmin are the maximum and minimum luminances, respectively, L is the luminance amplitude, and L is the average luminance. Contrast sensitivity is the reciprocal of the contrast threshold. A CSF describes contrast sensitivity for sinusoidal gratings as a function of spatial frequency expressed in cycles per degree of the visual angle. Typical human contrast sensitivity curves are given in Fig. 3 (Ref. 3); the horizontal axis is spatial frequency and the vertical axis is contrast sensitivity—namely, log(1/C) = −log C, where C is the contrast of the pattern at the detection threshold. The CSF of Fig. 3 shows two features. First, there is a fall-off in sensitivity as the spatial frequency of the test pattern increases, indicating that the visual pathways are insensitive to high-spatial-frequency targets. In another words, the human vision has a lowpass filtering characteristic. Second, there is no improvement of sensitivity at low spatial frequencies; there is even a loss of contrast sensitivity at lower spatial frequencies for luminance CSF. This behavior is dependent on the background intensity; it is more pronounced at higher background. Under scotopic conditions of low illumination, the luminance CSF curve is lowpass and peaks near 1 cpd. On intense photopic backgrounds, CSF curves are bandpass and peaks near 8 cpd. Chromatic CSFs show a lowpass characteristic and have much lower cutoff frequencies than the luminance channel (see Fig. 3), where the blue–yellow opponent color pair has the lowest cutoff frequency. This means that the human visual system is much more sensitive to fine spatial changes in luminance than it is to fine spatial changes in chrominance. Images with spatial degradations in the luminance will often be perceived as blurry or not sharp, while similar degradations in the chrominance will usually not be noticed. This property has been utilized in practice for transmitting color signals in that few bits and lower resolution are used for chrominance channels to reduce the bandwidth (12,13). It is also being used in building the color HVS. For the oblique effect, the visual acuity is better for gratings oriented at 0◦ and 90◦ and least sensitive at 45◦ . This property has been used in printing for designing halftone screens (14). Moreover, the CSF has been used to estimate the resolution and tone level of imaging devices. An upper limit of the estimate is about 400 dpi with a tone step of 200 levels. In many situations, much lower resolutions and tone levels still give good image quality. The high-frequency cutoff and exponential relationship with spatial frequency form the basis for various of HVS models.
IMAGE COLOR ANALYSIS
13
Fig. 3. Spatial contrast sensitivity functions for luminance and chromatic contrast.
Color Appearance Model A color appearance model is any model that includes predictors of at least the relative color appearance attributes of lightness, chroma, and hue. To have reasonable predictors of these attributes, the model must include at least some form of a chromatic-adaptation transform. All chromatic-adaptation models have their roots in the von Kries hypothesis. The modern interpretation of the von Kries model is that each cone has an independent gain control, as expressed in Eq. (27):
where L, M, and S are the initial cone responses; La , M a , and Sa are the cone signals after adaptation; and Lmax , M max , and Smax are the cone responses for the scene white or maximum stimulus. This definition enables the uniform color spaces CIELAB and CIELUV to be considered as color appearance models (Ref. 1, p. 217). More complex models that include predictors of brightness, colorfulness, and various luminance-dependent effects, mentioned in the color appearance phenomena, are Nayatani (15), Hunt (16), RLAB (17), ATD (18), and LLAB (19) models. Nayatani’s and Hunt’s models, evolved from many decades of studies, are capable of predicting the full array of color appearance parameters. They are complete and complex and are not suitable to discuss in this short article. ATD, developed by Guth, is a different type of model than others. It is aimed at describing color vision. RLAB and LLAB are similar in structure. They are the extension of CIELAB and are relatively simple. We present a simple version of the CIE 1997 Interim Color Appearance Model CIECAM97s; the descriptions of other models can be found in Refs. 15 to 16,17,18,19 and in the book Color Appearance Models by Fairchild (1). The formulation of CIECAM97s builds upon the work of many researchers in the field of the color appearance (Ref. 1, pp. 384–389). It begins with a conversion from CIEXYZ to spectrally sharpened cone
14
IMAGE COLOR ANALYSIS
responses RGB via the following matrix multiplication for both sample and white point:
The next step is the chromatic-adaptation transform, which is a modified von Kries adaptation with an exponential nonlinearity added to the short-wavelength blue channel:
where the ε factor is used to specify the degree of adaptation. The value of ε is set equal to 1.0 for complete adaptation, 0.0 for no adaptation. It takes no intermediate values for various degrees of incomplete chromatic adaptation. If B happens to be negative, then Bc is also set to be negative. Similar transformations are made for the source white point. Various factors, including a background induction factor g, the background and chromatic brightness induction factors, N bb and N cb , and the base exponential nonlinearity z, are computed for later calculations.
IMAGE COLOR ANALYSIS
15
The postadaptation signals for both sample and source white are transformed from the sharpened cone responses to the Hunt–Pointer–Estevez cone responses; then nonlinear response compressions are performed by using Eqs. (39)to (42).
The appearance correlates, a, b, and hue angle, h, are calculated as follows:
¨ is calculated by using Eq. (46) for both sample and white. The lightness J is The achromatic response A calculated from the achromatic signals of the sample and white using Eq. (47):
where σ is the constant for the impact of surround; σ = 0.69 for average surround, 0.59 for dim surround, 0.525 for dark surround, and 0.41 for cut-sheet transparencies. Other appearance correlates, such as brightness, saturation, chroma, and colorfulness, can also be calculated.
Color Mixing Models Roughly speaking, two types of color mixing theories exist. The first is formulated for the halftone process, such as the Neugebauer equations, Murray–Davies equation, and Yule–Nielsen model. The other type is based on the light absorption, such as the Beer–Lambert–Bouguer law and Kubelka–Munk theories, for subtractive color mixing.
16
IMAGE COLOR ANALYSIS
Neugebauer Equations. In mixing three subtractive primaries, Neugebauer recognized that there are eight dominant colors—namely, white, cyan, magenta, yellow, red, green, blue, and black—for constituting any color halftone prints. A given color is perceived as the integration of these eight Neugebauer dominant colors. The incident light reflected by one of the eight colors is equal to the reflectance of that color multiplied by its area coverage. The total reflectance is the sum of all eight colors weighed by the corresponding area coverages (20,21). Therefore, it is based on broadband color mixing. An example of a three-primary Neugebauer equation is
where P is the total reflectance of the R, G, or B primary. The spectral powers Pw , Pc , Pm , Py , Pr , Pg , Pb , and P3p are the reflectance of the paper, cyan, magenta, yellow, red, green, blue, and black, respectively, measured with a given primary. Neugebauer employed Demichel’s dot overlap model for the area of each dominant color. The areas are computed from the halftone dot areas of three primaries, ac , am , and ay , to give the areas of all eight dominant colors Aw , Ac , Am , Ay , Ar , Ag , Ab , and A3p (Ref. 4, pp. 34–42).
Neugebauer’s model provides a general description of the halftone color mixing; it predicts the resulting red, green, and blue reflectances or tristimulus values XYZ from given dot areas in the print. In practical applications, one wants the cmy (or cmyk) dot areas for a given color. This requires solving the inverted Neugebauer equations; that is, specifing the dot areas of individual inks required to produce the desired RGB or XYZ values. The inversion is not trivial because the Neugebauer equations are nonlinear. The exact analytical solutions to the Neugebauer equations have not yet solved. Numerical solutions by computer have been reported (22,23). Pollak has worked out a partial solution by making assumptions and approximations (24,25,26). The three-primary expression of Eq. (48) can readily be expanded to four primaries by employing the four-primary fractional area expressions given by Hardy and Wurzburg (27):
IMAGE COLOR ANALYSIS
17
where
For four-primary system, the analytic inversion is even more formidable because there are four unknowns (ac , am , ay , and ak ) and only three measurable quantities (either RGB or XYZ). The practical approach is to constrain on the black printer and then seek solutions on cmy that combine with black to produce a target color. Murray–Davies Equation. The Murray–Davies equation derives the reflectance via the absorption of halftone dots. In a unit area, if the solid ink reflectance is Ps , then the absorption by halftone dots is (1 − Ps ) weighed by the dot area coverage A. The reflectance P of a unit halftone area is the white reflectance, which is 1 or the reflectance of the medium Pw , subtracting the dot absorptance (28).
Equation (52) is usually expressed in the density domain by using the density-reflectance relationship of D = −log P to give
The Murray–Davies equation is often used to determine the area coverage by measuring the reflectance of the solid and halftone step wedges. Yule–Nielsen Model. From the study of the halftone process, Yule and Nielsen pointed out that light does not emerge from the paper at the point where it entered. They estimated that between one-fourth and one-half of the light that enters through a white area will emerge through a colored area, and vice versa. Based on this observation, Yule and Nielsen took light penetration and scattering into consideration; a fraction of light, A (1 − T i ), is absorbed by the ink film on its way into the substrate, where T i is the transmittance of the ink film. After passing through the ink, the remaining light, 1 − A (1 − T i ), in the substrate is scattered and
18
IMAGE COLOR ANALYSIS
reflected. It is attenuated by a factor Pw when the light reaches the air-ink-paper interface. On its way out, this fraction of light is absorbed again by the ink film; the second absorption introduces a power of 2 to the equation. Finally, the light is corrected by the surface reflection rs at the air-ink interface (29). Summing all these effects together, we obtain
The Yule–Nielsen model is a simplified version of a complex phenomenon involving light, colorants, and substrate. For example, the model does not take into account the facts that, in real systems, the light is not completely diffused by the paper and is internally reflected many times within the substrate and ink film. To compensate for these deficiencies, Yule and Nielsen made the square power of Eq. (54) into a variable, n, for fitting the data. With this modification, Eq. (54) becomes
Assuming that the surface reflection rs can be neglected and the reflectance is measured relative to the substrate, we obtain the expression
in the form of the optical density D. Note the similarity of Eq. (56) to the Murray–Davies equation [Eq. (53)]. Equation (56) is very popular and is often used to determine the ink area coverage or dot gain. Generally, the effective transmittance, T i (λ), of mixed inks is derived from a nonlinear combination of the primary ink transmittances measured in solid area coverage. Clapper–Yule Model. After formulating the Yule–Nielson model, Yule then worked with Clapper to develop an accurate account of the halftone process from a theoretical analysis of multiple scattering, internal reflections, and ink transmissions (30,31). In this model, the light is being reflected many times from the surface, within the ink and substrate, and by the background. The total reflected light is the sum of light fractions that emerge after each internal reflection cycle.
where ks = the specular component of the surface reflection ρr = a fraction of light that is reflected at the bottom of the substrate ρe = a fraction of light that is emerged at the top of the substrate Halftone color mixing models of the Murray–Davies, Yule–Nielsen, and Clapper–Yule show a progressively more thorough consideration of the halftone printing process with respect to light absorption, transmission, and scattering induced by the ink-paper interaction. Beer–Lambert–Bouguer Law. The Beer–Lambert–Bouguer law is based on light absorption; thus it is a subtractive color mixing theory. The Beer–Lambert–Bouguer law relates the light intensity to the quantity of the absorbant based on the proportionality and additivity of the colorant absorptivity (Ref. 4, pp. 47–48). The intensity, I, of a monochromatic light attenuated by the absorption of the medium is proportional to its
IMAGE COLOR ANALYSIS
19
intensity:
where K(λ) = absorption coefficient of the absorbent at the wavelength λ w = the length of light path traversing through the absorbent q = the concentration of the absorbent Integrating this differential equation over the entire thickness of the medium gives
or
where Io (λ) = a monochromatic light intensity before passing through an absorbent I(λ) = a light intensity after passing through an absorbent ξ(λ) = absorbance (also referred to as the optical density) of the absorbent Assuming that the path length is a constant, and individual pixels have the same thickness, we can group the absorption coefficient and path length together as a compounded coefficient, K d .
For mixed color films, the absorbance, ξm (λ), is obtained by applying the proportionality and additivity rules.
where K di = the absorption coefficient of ith primary ink qi = the concentration of ith primary ink
Kubelka–Munk Theory. Kubelka and Munk formulated a theory based on two light channels traveling in opposite directions for modeling the translucent and opaque media. The light is being absorbed and scattered in only two directions, up and down. A background is presented at the bottom of the medium to provide the upward light reflection. The derivation of Kubelka–Munk formula can be found in many publications (4,32,33,34). The general expression of the Kubelka–Munk equation is
20
IMAGE COLOR ANALYSIS
where
This expression is called the two-constant Kubelka–Munk equation, because the two constants, K and S, are determined separately. It indicates that the reflectance of a translucent film is a function of the absorption coefficient, the scattering coefficient, the film thickness, and the reflectance of the background. Equation (62) is the basic form of the Kubelka–Munk equation and the foundation of various Kubelka–Munk formulas (33). To employ Eq. (62), one needs to know Pg and w in addition to the K and S constants.
Circular Dot Overlap Model The circular dot overlap model is not the only dot overlap model. In fact, the Neugebauer equations utilize the Demichel’s dot overlap model that is obtained from the joint probability for the overlapped area. Demichel’s model is used to account for the color mixing, while the circular dot overlap model is used to correct tone reproduction due to oversized pixels. The circular dot model assumes an ideal round shape with equal size and a complete and uniform absorptance within the dot boundary. Pixel overlapped areas are physically computed by utilizing geometric relationships. The resulting density is a logical OR operation. First, the size of an overlapped area is dependent on the pixel √ position and pixel size. The vertical (or horizontal) overlapping is different from the diagonal overlapping. For 2τ size pixels, there is no overlap in the diagonal position. The overlap in the vertical or horizontal direction can be calculated by using the geometric relationship; for a square grid, the overlapped area is enclosed in a square with the size of r2 , where r is the radius of the dot.
√ For dot size greater than 2τ, one can derive relationships such as the diagonally overlapped area Ad , the vertically (or horizontally) overlapped area Av , and the excess area Ae over the digital square pixel.
IMAGE COLOR ANALYSIS
21
Monochromatic Dot Overlap Model. Pappas and coworkers expended the circular dot model to account for the amount of dot overlap at each pixel in terms of three area parameters α, β, and γ, as shown in Fig. 4 (35).
where the window W(i, j) consists of the pixel of interest p(i, j) and its eight neighbors, κ1 is the number of horizontally and vertically neighboring dots that are black, κ2 is the number of diagonally neighboring dots that are black and not adjacent to any horizontally or vertically neighboring black dot, and κ3 is the number of pairs of neighboring black dots in which one is a horizontal neighbor and the other is a vertical neighbor. These area parameters α, β, and γ are related to the overlapped areas Ae , Ad , and Av as follows:
The α, β, and γ can be solved from this set of simultaneous equations. The solutions are as follows:
where ρ = r/τ, the ratio of the actual dot radius to the minimum overlap radius τ/ . Since the dot diameter is between τ and 2τ, the ratio ρ is between 1 and such that the black dots are large enough to cover a τ × τ square, but not too large to cover a white pixel, separating two black pixels in vertical or horizontal position. The main application of this dot overlap model is in the digital halftoning. Halftone patterns for the different levels of the halftone dot are analyzed with this circular dot model to predict the actual reflectance of the output patterns (35). Threshold levels of the halftone dot are then adjusted so that the reflectance of the dot patterns matches the input reflectance values. This correction produces patterns with fewer isolated pixels, since the calculation gives a more accurate estimate of the actual reflectance achieved on the paper.
22
IMAGE COLOR ANALYSIS
Fig. 4. Circular dot overlap model with the definition of three overlapped areas, α, β, and γ.
Fig. 5. Circular dot overlap model for color images and the definition of four overlapped segments, ε, δ, ζ, and η.
Color Dot Overlap Model. Pappas expanded the monochromatic dot overlap model to color (36). The color dot overlap model accounts for overlap between neighboring dots of the same and different colors. The resulting color is the weighted average of segmented colors, with the weight being the area of the segment. The overlapping color segments are shown in Fig. 5; the color of each segment depends on the absorption properties of the overlapped inks. For three-color systems such as RGB and cmy, each segment can have one of the possible combinations of primary colors (white, cyan, magenta, yellow, red, green, blue, and black). For four-color systems, there are 24 = 16 dominant colors. This approach is basically the Neugebauer equations using geometrically computed areas instead of the Demichel’s dot overlap model. The area of each segment can be calculated in terms of area parameters α, β, and γ given in Fig. 4.
In Fig. 5, there are four ε, four δ, four ζ, and one η; these multiple areas can have different colors. This makes the computation of color in each segment much more complex. Measurement-Based Tone Correction. The circular dot overlap model is an ideal case. It provides a good first-order approximation for the behavior of printers. In reality, it does not adequately account for all of printer distortions. Significant discrepancies often exist between the predictions of the model and measured
IMAGE COLOR ANALYSIS
23
values (35). In view of this problem, a traditional approach to the correction of printer distortions is the measurement-based techniques. Basically, the measurement-based technique is used to obtain the relationship between the input tone level and measured output reflectance or optical density for a specific halftone algorithm, printer, and substrate combination. The experiment procedure consists of making prints having patches that are halftoned by the algorithm of interest. Each patch corresponds to an input tone level. The number of selected tone levels should be sufficient to cover the whole toner dynamic range with fine step size, but it is not necessary to print all available input levels, such as 256 levels for an 8-bit system, because most printers and measuring instruments are not able to resolve such a fine step. The reflectance or density values of these patches is measured to give the tone-response curve of the halftone algorithm used in conjunction with the printing device. Correcting for the printer distortions consists of inverting this curve to form a tone-correction curve. This tone-correction curve is applied to the image data prior to halftoning and printing as a precompensation.
Device-Independent Color Imaging In a modern office environment, various imaging devices, such as scanners, personal computers, workstations, copiers, and printers, are connected to this open environment. For example, a system may have several color printers using different printing technologies, such as dot matrix, ink jet, and xerography. Moreover, it does not exclude devices of the same technology from different manufactures. In view of these differences, the complexity of an office information system can be extremely high. The main concern in the color image analysis of the electronic information at the system level is the color consistency; the appearance of a document should look alike when the image moves across various devices and goes through many color transforms. However, the degree of the color consistency depends on the applications; a desktop printing for casual users may differ significantly from a short-run printing of high-quality commercial brochures. Hunt pointed out that there are several levels of the color matching: spectral, colorimetric, exact, equivalent, corresponding, and preferred color matchings (37). Spectral color reproduction matches the spectral reflectance curves of the original and reproduced colors. At this level, the original and reproduction look the same under any illuminant; there is no metamerism. A pair of stimuli is metamers if they match under one illuminant but mismatch under a different illuminant. The colorimetric reproduction is a metameric match that is characterized by the original and the reproduction colors having the same CIE chromaticities and relative luminances. A reproduction of a color in an image is exact if its chromaticity, relative luminance, and absolute luminance are the same as those in the original scene. The equivalent color reproduction requires that the chromaticities, relative luminance, and absolute luminances of colors have the same appearance as the colors in the original scene when seen in the image viewing conditions. The corresponding reproduction matches the chromaticities and relative luminances of the color such that they have the same appearance as the colors in the original would have had if they had been illuminated to produce the same average absolute luminance level. The preferred color reproduction is defined as reproductions in which the colors depart from equality of appearance to those in the original, either absolutely or relative to white, in order to give a more pleasing result to the viewer. Ideally, the appearance of a reproduction should be equal to the appearance of the original or to the customer’s preference. In many cases, however, designers and implementers of the color management systems do not know what users really want. For an open environment with such a diverse user demography, it is very difficult to satisfy everybody. Usually, this demand is partially met by an interactive system, adjustable knobs, or some kind of soft proofing system. This difficulty is compounded by the fact that modeling the human visual system is still an area of active research and debate. Even color consistency based on colorimetric equivalence at the system level is not a simple task because each device has its unique characteristics. They are different in imaging technology, color encoding, color description, color gamut, stability, and applications. Because not all scanners use the same lamp and filters
24
IMAGE COLOR ANALYSIS
and not all printers use the same inks, the same RGB image file will look different when it is displayed on different monitors. Likewise, the same cmyk image file will look different when it is rendered by different printers. Moreover, there are even bigger difference in color appearance when the image is moved from one medium to another, such as from monitor to print. Because of these differences, efforts have been made to establish a color management system (CMS) that will provide color consistency. An important event was the formation of the International Color Consortium (ICC) for promoting the usage of color images by increasing the interoperability among various applications (image processing, desktop publishing, etc.), different computer platforms, and operating systems. This organization established the architectures and standards for color transform via ICC profiles and color-management modules (CMM). A profile contains the data for performing color transform and a CMM is an engine for actually processing the image through profiles. The central theme in color consistency at the system level is the requirement of a device-independent color description and properly characterized and calibrated color imaging devices. A proper color characterization requires image test standards, techniques, tools, instruments, and controlled viewing conditions. Main features to consider for color consistency are image structure, white point equivalence, color gamut, color transformation, measurement geometry, media difference, printer stability, and registration (Ref. 4, pp. 261–271). Image structure refers to the way an image is rendered by a given device. The primary factor affecting image texture is the quantization; it is particularly apparent if the output is a bilevel device. Scanners and monitors that produce image elements in 8 bit depth can be considered as contone devices. For printers, the dye sublimation printer is capable of printing contone; other types of printers, such as xerographic and ink jet printers, have to rely on halftoning to simulate the contone appearance. For bilevel devices, halftoning perhaps is the single most important factor in determining the image structure. It affects the tone reproduction curves, contouring, banding, graininess, moir´e, sharpness, and resolution of fine details. Because of these dependencies, color characterization has to be done for each halftone screen. A new characterization is needed when the halftone screen is changed. Various white points are used for different devices; the white point for the scanner is very different from the one used in the CRT monitor. Moreover, the viewing condition of a hard copy is usually different from those used in the scanning and displaying. The operator or the color management system must convert the chromaticity of the white point in the original to that of the reproduction. From the definition of the tristimulus values, we know that the exact conversion requires the substitution of one illuminant spectrum for another, weighs by the color matching function and object spectrum and then integrates the whole visible range. This illuminant replacement process is not a linear correspondence. The change of viewing conditions has been addressed by the chromatic-adaptation models (Ref. 1, pp. 191–214). The color gamut indicates the range of colors an imaging device can render; it is the volume enclosed by the most saturated primary colors and their two-color mixtures in a 3D color space. The gamut size and shape are governed mainly by the colorants, printing technology, and media. Good colorants can be found for use in each technology, but they tend to vary from technology to technology. As a result, different color devices have different gamut sizes such that a color monitor is different from a color printer and a xerographic printer is different from an ink jet printer. Color gamut mismatches of input, display, and output devices are the most difficult problem in maintaining color consistency. Numerous approaches for gamut mapping have been proposed; however, this is still an active research area. Color correction for ink-substrate combinations is only a partial solution. The relationship between the original and reproduction should be taken into account. In fact, the optimal relationship is not usually known; Hunt suggests having six different color matching levels (37). In most cases, the optimum relationship is a compromise. The compromise depends on the user’s preference, on the colors in the original (e.g., monitor), and on those available in the reproduction (e.g., printer). The optimum transform for photographic transparency originals that contain mainly light colors (the high-key image) is different from the optimum for an original that contains mainly dark colors (the low-key image).
IMAGE COLOR ANALYSIS
25
Graphic images probably require a different transform than scanned photographs (38). Color gamut mapping and preferred color transform are a part of the color appearance task. Most colorimeters and spectrophotometers have measuring geometries that are deliberately chosen either to eliminate or totally include the first surface reflections. Typical viewing conditions, having light reflection, transmission, and scattering, lie somewhere between these two extremes. Colors that look alike in a practical viewing condition may measure differently, and vice versa. This phenomenon is strongly affected by the surface characteristics of the images and is most significant for dark colors. A practical solution to this problem is being proposed by using telecolorimetry that puts the measuring device in the same position as the observer (39). As mentioned previously, CIE systems have limited capability to account for changes in appearance that arise from a change of the white point or substrate. Under different media, the isomeric samples may look different. Even within a given medium type, such as paper substrate, the difference can come from the whiteness of the paper, fluorescence, gloss, surface smoothness, and so on. This problem is also part of the color appearance model. Color transformation is the central part of the color consistency at the system level. This is because different devices use different color descriptions. Various transformation methods, such as the color mixing model, regression, three-dimensional interpolation, artificial neural network, and fuzzy logic, have been developed. Major tradeoffs among these techniques are conversion accuracy, processing speed, and computation cost. Color printers are known for their instability; it is a rather common experience that the first print may be quite different from the hundredth one. A study indicates that dye sublimation, ink jet, and thermal transfer techniques are more stable than lithography (40). It is an encouraging trend that new printing technologies are more stable than the conventional methods. A part of the device stability problems, such as drifting, may be corrected by the device calibration to bring the device back to the nominal state. Color printing requires three (cmy) or four (cmyk) separations. They are printed by multiple passes through the printing mechanism or by multiple separated operations in one pass. These operations require mechanical control of the pixel registration. The registration is very important to the color image quality. But it is more a mechanical problem than an image processing problem. Thus, it is beyond the scope of this article. As stated at the beginning of this article, the color image has two components: spatial patterns and colorimetric properties. Factors affecting the color matching are the color transform, color gamut mapping, viewing conditions, and media. Factors affecting the image structure are the sampling, tone level, compression, and halftoning. The surface characteristics are primarily affected by the media; various substrates may be encountered during image transmission and reproduction. These factors are not independent, they may interact to cause the appearance change. For examples, inadequate sampling and tone level not only give poor image structures but may also cause color shifts; a different halftone screen may cause tone changes and color shifts; an improper use of colors may create textures; images with sharp edges appear higher in saturation and have more contrast than images with soft edges (38); and finer screen rulings will give higher resolution and cleaner highlight colors (41). With these complexicities and difficulties, a systematic formulated color imaging model that incorporates the human visual model, color mixing model, color appearance model, and dot overlap model is needed to improve and analyze color images. Human Visual Model. The CSF plays an important role in determining image resolution, image quality improvement, halftoning, and compression. Therefore, many formulas of CFS have been derived from various experiment results. Here we introduce a popular one proposed by Mannos and Sakrison (42).
26
IMAGE COLOR ANALYSIS
where the constants a, b, c, and d are calculated by regression fits of horizontal and vertical threshold modulation data to be 2.2, 0.192, 0.114, and 1.1, respectively. The parameter f r is the normalized radial spatial frequency in cycles/degree of visual subtended angle, and f max is the peak frequency of the bandpass CSF curve. The second expression in Eq. (78) changes the bandpass CSF to a lowpass CSF. To account for angular variations in the human visual function sensitivity, the normalized radial spatial frequency is computed from the actual radial spatial frequency using an angular-dependent scale function (43,44,45):
where
and (θ) is given by
where φ is a symmetry parameter with a value of 0.7 chosen from empirical data, and
This angular normalization produces a 70% bandwidth reduction at 45◦ . The last step is to convert from cycles/degree to cycles/mm on the document for further spatial imaging processing:
where dv is the viewing distance in millimeters, f i is the discrete sampling frequency in cycles per millimeter given by
and τ is the sample spacing on the document. Color Visual Models. In general, color visual models are extensions of luminance models by utilizing the human visual difference in the luminance and chrominance. We can separate the luminance and chrominance channels by using the opponent color description. The luminance part of the model is the same as before. For the chrominance part, we use the fact that the contrast sensitivity to spatial variations in chrominance falls off faster than the luminance with respect to spatial frequency (see Fig. 3) to find a new set of constants for the chromatic CSF. Both responses are lowpass; however, the luminance response is reduced at 45◦ , 135◦ , 225◦ , and 315◦ for the orientation sensitivity. This will place more luminance error along diagonals in the frequency domain. The chrominance response has a narrower bandwidth than the luminance response. Using the narrower chrominance response as opposed to identical responses for both luminance and chrominance will allow more lower-frequency chromatic error, which will not be perceived by human viewers. We can further allow the adjustment of the relative weight between the luminance and chrominance responses. This is achieved by multiplying the luminance response with a weighting factor. As the weighting factor increases, more error will be forced into the chrominance components.
IMAGE COLOR ANALYSIS
27
Many human visual models have been proposed that attempt to capture the central features of human perception. The simplest HVS includes just a visual filter that implements one of the CSFs mentioned previously. Better approaches include a module in front of the filter to account for nonlinearity like the Weber’s law. The filtered signal is pooled together in a single channel of information. This structure is called a singlechannel model. Because the digital image quality is a very complex phenomenon, it requires inputs of all types of distortions. In view of the image complexity, multiple-channel approaches are developed to include various inputs. This is achieved by putting a bank of filters before the nonlinearity in a systematic way; each filter addresses one aspect of the whole image quality. This approach is called a multichannel model. S-CIELAB. Zhang and Wandell extended CIELAB to account for the spatial as well as color errors in reproduction of the digital color image (46). They called this new model as spatial-CIELAB or S-CIELAB. The design goal is to apply a spatial filtering to the color image in a small-field or fine-patterned area, but revert to the conventional CIELAB in a large uniform area. The procedure for computing S-CIELAB is as follows: (1) Input image data are transformed into an opponent-colors space. This color transform converts the input image, specified in terms of the CIEXYZ tristimulus values, into three opponent-colors planes that represent luminance, red-green, and blue-yellow components. (2) Each opponent-colors plane is convolved with a two-dimensional kernel whose shape is determined by the visual spatial sensitivity to that color dimension; the area under each of these kernels integrates to one. A lowpass filtering is used to simulate the spatial blurring by the human visual system. The computation is based on the concept of pattern-color separability. The parameters for the color transform and spatial filters were estimated from psychophysical measurements. (3) The filtered representation is converted back to CIEXYZ space, then to CIELAB representation. The resulting representation includes both spatial filtering and CIELAB processing. (4) The difference between the S-CIELAB representations of the original and its reproduction is the measure of the reproduction error. The difference is expressed by a quantity Es , which is computed precisely as E∗ ab in conventional CIELAB. The S-CIELAB reflects both spatial and color sensitivity. This model is a color-texture metric and a digital imaging model. This metric has been applied to printed halftone images. Results indicate that S-CIELAB correlates with perceptual data better than standard CIELAB (47). This metric can also be used to improve the multilevel halftone images (48). Color Imaging Process. A complete color imaging process that incorporates various visual and physical models may be described as follows: An image with a specific sampling rate, quantization level, and color description is created by the input device that uses (1) the human visual model to determine the sampling and quantization, and (2) the device calibration to ensure that the gain, offset, and tone reproduction curves are properly set. The setting of tone reproduction curves may employ the dot overlap model, for example. This input device assigns device-dependent color coordinates (e.g., RGB) to the image elements. The image is fed to the color transform engine for converting to colorimetric coordinates (e.g., CIELAB). This engine performs the colorimetric characterization of the input device; it implements one or more of color space transform techniques (e.g., color mixing models, 3-D lookup tables). This engine can be a part of the input device, an attachment, or in the host computer. The second step is to apply a chromatic-adaptation and/or color appearance model to the colorimetric data, along with additional information on the viewing conditions of the original image (if any), in order to transform the image data into appearance attributes such as lightness, hue, and chroma. These appearance coordinates, which have accounted for the influences of the particular device and the viewing conditions, are called viewing-conditions-independent space (Ref. 1, p. 346). At this point, the image is represented purely by its original appearance. After this step, the image may go through various processing, such as spatial scaling,
28
IMAGE COLOR ANALYSIS
edge enhancement, resolution conversion, editing, or compression. If the output device is selected, the image should be transformed to a format suitable for the output device. If the output is a bilevel device, the image should be halftoned. A good halftone algorithm takes the human visual model, dot overlap model, and color mixing model into account. Then the color gamut mapping is performed to meet the output color gamut. Now the image is in its final form with respect to the appearance that is to be reproduced. Next, the process must be reversed. The viewing conditions for the output image, along with the final image-appearance data, are used in an inverted color appearance model to transform from the viewing-conditions-independing space to a device-independent color space (e.g., CIEXYZ). These values, together with the colorimetric characterization of the output device, are used for transforming to the device coordinates (e.g., cmyk) for producing the desired output image.
BIBLIOGRAPHY 1. M. D. Fairchild Color Appearance Models, Reading, MA: Addison-Wesley, 1998, p. 24. 2. G. Sharma H. J. Trussell Digital color imaging, IEEE Trans. Image. Proc., 6: 901–932, 1997. 3. G. Wyszecki W. S. Stiles Color Science: Concepts and Methods, Quantitative Data and Formulae, 2nd ed., New York: Wiley, 1982, pp. 489–490. 4. H. R. Kang Color Technology for Electronic Imaging Devices, Bellingham, WA: SPIE Optical Engineering Press, 1997, pp. 3–6. 5. M. R. Pointer G. G. Attridge The number of discernible colors, Color Res. Appl., 23: 52–54, 1998. 6. CIE, Colorimetry, Publication No. 15.2. Paris: Bureau Central de la CIE, 1971. 7. D. B. Judd G. Wyszecki Color in Business, Science, and Industry, 2nd ed., New York: Wiley, 1967, p. 138. 8. D. F. Rogers Procedural Elements for Computer Graphics, New York: McGraw-Hill, 1985, p. 389. 9. C. J. Bartleson Colorimetry, in F. Grum and C. J. Bartleson, Optical Radiation Measurements, Vol. 2, Color Measurement, New York: Academic Press, 1980, pp. 33–148. 10. B. A. Wandell Foundations of Vision, Sunderland, MA: Sinauer Associates, 1995, pp. 106–108. 11. A. A. Michelson Studies in Optics, Chicago: Univ. Chicago Press, 1927. 12. V. K. Zworykin G. A. Morton Television, 2nd ed., New York: Wiley, 1954. 13. F. Mazda, ed. Electronic Engineers Reference Book, 5th ed., London: Butterworths, 1983. 14. R. Ulichney Digital Halftoning, Cambridge, MA: MIT Press, 1987, pp. 79–84. 15. Y. Nayatani et al. Lightness dependency of chroma scales of a nonlinear color-appearance model and its latest formulation, Color Res. Appl., 20: 156–167, 1995. 16. R. W. G. Hunt The Reproduction of Color in Photography, Printing and Television, 5th ed., England: Fountain Press, 1995, Chap. 31. 17. M. D. Fairchild Refinement of the RLAB color space, Color Res. Appl., 21: 338–346, 1996. 18. S. Guth Further applications of the ATD model for color vision, IS&T/SPIE’s Symposium on Electronic Imaging: Science & Technology, Proc. SPIE, 2414: 12–26, 1995. 19. M. R. Luo M-C. Lo W-G. Kuo The LLAB (l:c) color model, Color Res. Appl., 21: 412–429, 1996. 20. H. E. J. Neugebauer Die theoretischen Grundlagen des Mehrfarbendruckes, Z. wiss. Photogr., 36: 73–89, 1937. 21. J. A. C. Yule Principles of Color Reproduction, New York: Wiley, 1967, Chap. 10. 22. I. Pobboravsky M. Pearson Computation of dot areas required to match a colorimetrically specified color using the modified, Proc. TAGA, 24: 65–77, 1972. 23. R. Holub W. Kearsley Color to colorant conversions in a colorimetric separation system, SPIE Vol. 1184, Neugebauer Memorial Seminar Color Reproduction, 1989, pp. 24–35. 24. F. Pollak The relationship between the densities and dot size of halftone multicolor images, J. Photogr. Sci., 3: 112–116, 1955. 25. F. Pollak Masking for halftone, J. Photogr. Sci., 3: 180–188, 1955. 26. F. Pollak New thoughts on halftone color masking, Penrose Annu., 50: 106–110, 1956. 27. A. C. Hardy F. L. Wurzburg Color correction in color printing, J. Opt. Soc. Amer., 38: 300–307, 1948. 28. A. Murray J. Franklin Inst., 221: 721, 1936.
IMAGE COLOR ANALYSIS
29
29. J. A. C. Yule W. J. Nielsen The penetration of light into paper and its effect on halftone reproduction, Proc. Tech. Assn. Graphics Arts, 4: 65–75, 1951. 30. F. R. Clapper J. A. C. Yule The effect of multiple internal reflections on the densities of half-tone prints on paper, J. Opt. Soc. Amer., 43: 600–603, 1953. 31. F. R. Clapper J. A. C. Yule Reproduction of color with halftone images, Proc. Tech. Assn. Graphic Arts, 1–12, 1955. 32. P. Kubelka New contributions to the optics of intensely light-scattering materials. Part I, J. Opt. Soc. Am., 38: 448–457, 1948. 33. D. B. Judd G. Wyszecki Color in Business, Science, and Industry, 3rd ed., New York: Wiley, 1975, pp. 397–461. 34. E. Allen Colorant Formulation and Shading, in F. Grum and C. J. Bartleson, Optical Radiation Measurements, Vol. 2, New York: Academic Press, 1980, pp. 305–315. 35. T. N. Pappas D. L. Neuhoff Model-based halftoning, Proc. SPIE, 1453: 244–255, 1991. 36. T. N. Pappas Model-based halftoning of color images, IEEE Trans. Image Proc., 6: 1014–1024, 1997. 37. R. W. G. Hunt The Reproduction of Color in Photography, Printing and Television, 4th ed., England: Fountain Press, 1987, pp. 177–197. 38. W. L. Rhodes Digital imaging: Problems and standards, Proc. SID, 30: 191–195, 1989. 39. T. Johnson Device independent colour—Is it real? TAGA Proc., 81–113, 1992. 40. S. B. Bolte A perspective on non-impact printing in color, SPIE Vol. 1670, Color Hard Copy and Graphic Arts, 1992, pp. 2–11. 41. G. G. Field The systems approach to other reproduction—A critique, Proc. TAGA, 1–17, 1984. 42. J. L. Mannos D. J. Sakrison The effects of a visual fidelity criterion on the encoding of images, IEEE Trans. Info.Theory, IT-20: 525–536, 1974. 43. T. Mitsa J. R. Alford Single-channel versus multiple-channel visual models for the formulation of image quality measures in digital halftoning, IS&T NIP10, 1994, pp. 385–387. 44. K. E. Spaulding R. L. Miller J. Schildkraut Methods for generating blue-noise dither matrices for digital halftoning, JEI, 6: 208–230, 1997. 45. J. Sullivan L. Ray R. Miller Design of minimum visual modulation halftone patterns, IEEE Trans. Syst. Man. Cybern., 21: 33–38, 1990. 46. X. Zhang B. A. Wandell A spatial extension of CIELAB for digital color image reproduction, SID Int. Symp., Dig. Tech. Papers, 1996, pp. 731–734. 47. X. Zhang, et al. Color image quality metric S-CIELAB and its application on halftone texture visibility. IEEE Compcon. 1997, pp. 44–48. 48. X. Zhang J. E. Farrell B. A. Wandell Applications of a spatial extension to CIELAB, Proc. SPIE/IS&T Symposium on Electronic Imaging, 3025: 154–157, 1997.
HENRY R. KANG Peerless System Corporation
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICA...TRONICS%20ENGINEERING/26.%20Image%20Processing/W4105.htm
}{{}}
●
HOME ●
ABOUT US ●
CONTACT US ●
HELP
Home / Engineering / Electrical and Electronics Engineering
Wiley Encyclopedia of Electrical and Electronics Engineering Image Enhancement Standard Article Scott T. Acton1, Dong Wei2, Alan C. Bovik3 1Oklahoma State University, Stillwater, OK 2The University of Texas at Austin, 3The University of Texas at Austin, Austin, TX Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. DOI: 10.1002/047134608X.W4105 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (671K)
●
●
● ●
Recommend to Your Librarian Save title to My Profile Email this page Print this page
Browse this title ●
Search this title Enter words or phrases
Abstract The sections in this article are Image Enhancement Techniques Applications and Extensions About Wiley InterScience | About Wiley | Privacy | Terms & Conditions Copyright © 1999-2008John Wiley & Sons, Inc. All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELE...S%20ENGINEERING/26.%20Image%20Processing/W4105.htm17.06.2008 15:24:03
❍
❍ ❍
Advanced Product Search Search All Content Acronym Finder
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering c 1999 John Wiley & Sons, Inc. Copyright
IMAGE ENHANCEMENT Thirty years ago, the acquisition and processing of digital imagery belonged almost entirely in the domain of military, academic, and industrial research laboratories. Today, elementary school students download digital pictures from the World Wide Web, proud parents store photographs on a digital CD, business executives cut deals via digital video teleconferencing, and sports fans watch their favorite teams on digital satellite TV. These digital images are properly considered to be sampled versions of continuous real-world pictures. Because they are stored and processed on digital computers, digital images are typically discrete domain, discrete range signals. These signals can be conveniently represented and manipulated as matrices containing the light intensity or color information at each sampled point. When the acquired digital image is not fit for a prescribed use, image enhancement techniques may be used to modify the image. According to Gonzales and Woods (1), “The principal objective of enhancement techniques is to process an image so that the result is more suitable than the original image for a specific application.” So, the definition of image enhancement is a fairly broad concept that can encompass many applications. More usually, however, the image enhancement process seeks to enhance the apparent visual quality of an image or to emphasize certain image features. The benefactor of image enhancement either may be a human observer or a computer vision program performing some kind of higher-level image analysis, such as target detection or scene understanding. The two simplest methods for image enhancement involve simple histogram-modifying point operations or spatial digital filtering. More complex methods involve modifying the image content in another domain, such as the coefficient domain of a linear transformation of the image. We will consider all of these approaches in this article. A wealth of literature exists on point operations, linear filters, and nonlinear filters. In this discussion, we highlight several of the most important image enhancement standards and a few recent innovations. We will begin with the simplest.
Image Enhancement Techniques Denote a two-dimensional digital image of gray-level intensities by I. The image I is ordinarily represented in software accessible form as an M × N matrix containing indexed elements I(i, j), where 0 ≤ i ≤ M − 1, 0 ≤ j ≤ N − 1. The elements I(i, j) represent samples of the image intensities, usually called pixels (picture elements). For simplicity, we assume that these come from a finite integer-valued range. This is not unreasonable, since a finite wordlength must be used to represent the intensities. Typically, the pixels represent optical intensity, but they may also represent other attributes of sensed radiation, such as radar, electron micrographs, x rays, or thermal imagery. Point Operations. Often, images obtained via photography, digital photography, flatbed scanning, or other sensors can be of low quality due to a poor image contrast or, more generally, from a poor usage of the available range of possible gray levels. The images may suffer from overexposure or from underexposure, as in the “mandrill” image in Fig. 1(a). In performing image enhancement, we seek to compute J, an enhanced 1
2
IMAGE ENHANCEMENT
version of I. The most basic methods of image enhancement involve point operations, where each pixel in the enhanced image is computed as a one-to-one function of the corresponding pixel in the original image: J(i, j) = f [I(i, j)]. The most common point operation is the linear contrast stretching operation, which seeks to maximally utilize the available gray-scale range. If a is the minimum intensity value in image I and b is the maximum, the point operation for linear contrast stretching is defined by
assuming that the pixel intensities are bounded by 0 ≤ I(i, j) ≤ K − 1, where K is the number of available pixel intensities. The result image J then has maximum gray level K − 1 and minimum gray level 0, with the other gray levels being distributed in-between according to Eq. (1). Figure 1(b) shows the result of linear contrast stretching on Fig. 1(a). Several point operations utilize the image histogram, which is a graph of the frequency of occurrence of each gray level in I. The histogram value H I (k) equals n only if the image I contains exactly n pixels with gray level k. Qualitatively, an image that has a flat or well-distributed histogram may often strike an excellent balance between contrast and preservation of detail. Histogram flattening, also called histogram equalization in Gonzales and Woods (1), may be used to transform an image I into an image J with approximately flat histogram. This transformation can be achieved by assigning
where P(i, j) is a sample cumulative probability formed by using the histogram of I:
The image in Fig. 1(c) is a histogram-flattened version of Fig. 1(a). A third point operation, frame averaging, is useful when it is possible to obtain multiple images Gi , i = 1, . . ., n, of the same scene, each a version of the ideal image I to which deleterious noise has been unintentionally added:
where each noise “image” Ni is an M × N matrix of discrete random variables with zero mean and variance σ2 . The noise may arise as electrical noise, noise in a communications channel, thermal noise, or noise in the sensed radiation. If the noise images are not mutually correlated, then averaging the n frames together will form an effective estimate Iˆ of the uncorrupted image I, which will have a variance of only σ2 /n:
This technique is only useful, of course, when multiple frames are available of the same scene, when the information content between frames remains unchanged (disallowing, for example, motion between frames), and when the noise content does change between frames. Examples arise quite often, however. For example, frame averaging is often used to enhance synthetic aperture radar images, confocal microscope images, and electron micrographs.
IMAGE ENHANCEMENT
3
Fig. 1. (a) Original “Mandrill” image (low contrast). (b) “Mandrill” enhanced by linear contrast stretching. (c) “Mandrill” after histogram equalization.
Linear Filters. Linear filters obey the classical linear superposition property as with other linear systems found in the controls, optics, and electronics areas of electrical engineering (2). Linear filters can be realized by linear convolution in the spatial domain or by pointwise multiplication of discrete Fourier transforms in the frequency domain. Thus, linear filters can be characterized by their frequency selectivity and spectrum shaping. As with 1-D signals, 2-D digital linear filters may be of the low-pass, high-pass or band-pass variety.
4
IMAGE ENHANCEMENT
Much of the current interest in digital image processing can be traced to the rediscovery of the fast Fourier transform (FFT) some 30 years ago (it was known by Gauss). The FFT computes the discrete Fourier transform (DFT) of an N × N image with a computational cost of O(N 2 log2 N), whereas naive DFT computation requires N 4 operations. The speedup afforded by the FFT is tremendous. This is significant in linear filtering-based image enhancement, since linear filters are implemented via convolution:
where F is the impulse response of the linear filter, G is the original image, and J is the filtered, enhanced result. The convolution in Eq. (6) may be implemented in the frequency domain by the following pointwise multiplication (·) and inverse Fourier transform (IFFT):
where F0 and G0 are 2N × 2N zero-padded versions of F and G. By this we mean that F 0 (i, j) = F(i, j) for 0 ≤ i, j ≤ N − 1 and F 0 (i, j) = 0 otherwise; similarly for G0 . The zero padding is necessary to eliminate wraparound effects in the FFTs which occur because of the natural periodicities that occur in sampled data. If G is corrupted as in Eq. (4) and N contains white noise with zero mean, then enhancement means noise-smoothing, which is usually accomplished by applying a low-pass filter of a fairly wide bandwidth. Typical low-pass filters include the average filter, the Gaussian filter and the ideal low-pass filter. The average filter can be supplied by averaging a neighborhood (an m × m neighborhood, for example) of pixels around G(i, j) to compute J(i, j). Likewise, average filtering can be viewed as convolving G with a box-shaped kernel F in Eq. (7). An example of average filtering is shown in Fig. 2(a)–(c). Similarly, a Gaussian-shaped kernel F may be convolved with G to form a smoothed, less noisy image, as in Fig. 2(d). The Gaussian-shaped kernel has the advantage of giving more weight to closer neighbors and is well-localized in the frequency domain, since the Fourier transform of the Gaussian is also Gaussian-shaped. This is important because it reduces noise “leakage” at higher frequencies. In order to provide an “ideal” cutoff in the frequency domain, the FFT of G0 can be zeroed beyond a cutoff frequency (this is equivalent to multiplying by a binary DFT F0 in Eq. (7)). This result, shown in Fig. 2(e), reveals the ringing artifacts associated with an ideal low-pass filter. Linear filters are also useful when the goal of image enhancement is sharpening. Often, an image is blurred by a novice photographer who moves the camera or improperly sets the focus. Images are also blurred by motion in the scene or by inherent optical problems, such as with the famous Hubble telescope. Indeed, any optical system supplies contributes some blur to the image. Motion blur and defocus can also be modeled as a linear convolution of B∗I, where B, in this case, represents linear distortion. This distortion is essentially a low-pass process; therefore, a high-pass filter can be used to sharpen or deblur the distorted image. The most obvious solution is create an inverse filter B − 1 that exactly reverses the low-pass blurring of B. The inverse filter is typically defined in the frequency domain by mathematically inverting each frequency component of the Fourier transform of B, creating a high-pass filter B − 1 . Let the complex-valued components of the DFT of ˜ B be denoted by B(u,v). Then, the components of B − 1 are given by
The image can be sharpened by pointwise multiplying the (zero-padded) FFT of the blurred image by the (zeropadded) FFT of B − 1 , then performing the inverse FFT operation, which is why this enhancement is often called deconvolution. It must be noted that difficulty arises when the Fourier transform of B contains zero-valued
IMAGE ENHANCEMENT
5
Fig. 2. (a) Original “Winston” image. (b) Corrupted “Winston” image with additive Gaussian-distributed noise (σ = 10). (c) Average filter result (5 × 5 window). (d) Gaussian filter result (standard deviation σ = 2). (e) Ideal low-pass filter result (cutoff = N/4). (f) Wavelet shrinkage result.
elements. In this case, a simple solution is the pseudo-inverse filter, which leaves the zeroed frequencies as zeros in the construction of B − 1 . A challenging problem is encountered when both linear distortion (B) and additive noise (N) degrade the image I:
6
IMAGE ENHANCEMENT
If we apply a low-pass filter F to ameliorate the effects of noise, then we only further blur the image. In contrast, if we apply an inverse (high-pass) filter B − 1 to deblur the image, the high-frequency components of the broadband noise N are amplified, resulting in severe corruption. This ill-posed problem of conflicting goals can be attacked by a compromise between low-pass and high-pass filtering. The famous Wiener filter provides such a compromise [see Russ (3)]. If η represents the noise power and N is white noise, then the frequency response of the Wiener filter is defined by
˜ ˜ where B(u,v) is the complex conjugate of B(u,v). The Wiener filter attempts to balance the operations of denoising and deblurring optimally (according to the mean-square criterion). As the noise power is decreased, the Wiener filter becomes the inverse filter, favoring deblurring. However, the Wiener filter usually produces only moderately improved results, since the tasks of deconvolution (high-pass filtering) and noise-smoothing (low-pass filtering) are antagonistic to one another. The compromise is nearly impossible to balance using purely linear filters. Nonlinear Filters. Nonlinear filters are often designed to remedy deficiencies of linear filtering approaches. Nonlinear filters cannot be implemented by convolution and do not provide a predictable modification of image frequencies. However, for this very reason, powerful nonlinear filters can provide performance attributes not attainable by linear filters, since frequency separation (between image and noise) is often not possible. Nonlinear filters are usually defined by local operations on windows of pixels. The window, or structuring element, defines a local neighborhood of pixels such as the window of pixels at location (i, j):
where K is the window defining pixel coordinate offsets belonging to the local neighborhood of I(i, j). The output pixels in the filtered image J can be expressed as nonlinear many-to-one functions of the corresponding windowed sets of pixels in the image G:
So, the nonlinear filtering operation may be expressed as a function of the image and the defined moving window: J = f (G, K). The windows come in a variety of shapes, mostly symmetric and centered. The size of the window determines the scale of the filtering operation. Larger windows will tend to produce more coarse scale representations, eliminating fine detail. Order Statistic Filters and Image Morphology. Within the class of nonlinear filters, order statistic (OS) filters encompass a large group of effective image enhancers. A complete taxonomy is given in Bovik and Acton (4). The OS filters are based on an arithmetic ordering of the pixels in each window (local neighborhood). At pixel location (i, j) in the image G, given a window K of 2m + 1 pixels, the set of order statistics is denoted by
where GOS (1) (i, j) ≤ GOS (2) (i, j) ≤ . . . ≤ GOS (2m+1) (i, j). These are just the original pixel values covered by the window and reordered from smallest to largest.
IMAGE ENHANCEMENT
7
Perhaps the most popular nonlinear filter is the median filter (5). It is an OS filter and is implemented by
assuming a window size of 2m + 1 pixels. The median smoothes additive white noise while maintaining edge information—a property that differentiates it from all linear filters. Particularly effective at eliminating impulse noise, the median filter has strong optimality properties when the noise is Laplacian-distributed (6). An example of the smoothing ability of the median filter is shown in Fig. 3(a)–(c). Here, a square 5 × 5 window was used, preserving edges and removing the impulse noise. Care must be taken when determining the window size used with the median filter, or streaking and blotching artifacts may result (7). More general OS filters can be described by a weighted sums of the order statistics as follows:
where A is the vector that determines the weight of each OS. Several important enhancement filters evolve from this framework. The L-inner mean filter or trimmed mean filter may be defined by setting A(k) = 1/(2L + 1) for (m + 1) ≤ (m + 1 + L) and A(k) = 0 otherwise. This filter has proven to be robust (giving close to optimal performance) in the presence of many types of noise. Thus, it is often efficacious when the noise is unknown. Weighted median filters also make up a class of effective, robust OS filters (8,9). Other nonlinear filters strongly related to OS filters include morphological filters, which manipulate image intensity profiles and thus are shape-changing filters in these regard. Image morphology is a rapidly exploding area of research in image processing (10). Through the concatenation of a series of simple OS filters, a wide scope of processing techniques emerge. Specifically, the basic filters used are the erosion of G by structuring element K defined by
and the dilation of G by K defined by
The erosion of G by K is often represented by GK, while the dilation is represented by G ⊕ K. By themselves, the erode and dilate operators are not useful for image enhancement, because they are biased and do not preserve shape. However, the alternating sequence of erosions and dilations is indeed useful. The close filter is constructed by performing dilation and then erosion:
while the open filter is erosion followed by dilation:
Open and close filters are idempotent, so further closings or openings yield the same result, much like bandpass filters in linear image processing. Although bias-reduced, the open filter will tend to remove small bright
8
IMAGE ENHANCEMENT
Fig. 3. (a) Original “Bridge” image. (b) “Bridge” image corrupted with 20% salt and pepper noise. (c) Median filter result (B = 5 × 5 square). (d) Open–close filter result (B = 3 × 3 square).
image regions and will separate loosely connected regions, while the close filter will remove dark spots and will link loosely connected regions. To emulate the smoothing properties of the median filter with morphology, the open and close filters can be applied successively. The open–close filter is given by (G ◦ K) · K, and the close–open filter is given by (G · K) ◦ K. Since open–close and close–open filters involve only minimum (erode) and maximum (dilate) calculations, they offer an affordable alternative to the median OS filter, which requires a more expensive ordering of each
IMAGE ENHANCEMENT
9
windowed set of pixels. However, in the presence of extreme impulse noise, such as the salt and pepper noise shown in Fig. 3(b), the open–close (or close–open) cannot reproduce the results of the median filter [see Fig. 3(d)]. Many variants of the OS and morphological filters have been applied successfully to certain image enhancement applications. The weighted majority with minimum range (WMMR) filter is a special version of the OS filter (17) where only a subset of the order statistics are utilized (11). The subset used to calculate the result at each pixel site is the group of pixel values with minimum range. The WMMR filters have been shown to have special edge enhancing abilities and, under special conditions, can provide near piecewise constant enhanced images. To improve the efficiency of OS filters, stack filters were introduced by Wendt et al. (12). Stack filters exploit a stacking property and a superposition property called the threshold decomposition. The filter is initialized by decomposing the K-valued signal (by thresholding) into binary signals, which can be processed by using simple Boolean operators. “Stacking” the binary signals enables the formation of the filter output. The filters can be used for real-time processing. One limitation of the OS filters is that spatial information inside the filter window is discarded when rank ordering is performed. A recent group of filters, including the C, Ll, and permutation filters, combine the spatial information with the rank ordering of the OS structure (13). The combination filters can be shown to have an improved ability to remove outliers, as compared to the standard OS design, but have the obvious drawback of increased computational complexity. Diffusion Processes. A newly developed class of nonlinear image enhancement methods uses the analogy of heat diffusion to adaptively smooth the image. Anisotropic diffusion, introduced by Perona and Malik (14), encourages intraregion smoothing and discourages interregion smoothing at the image edges. The decision on local smoothing is based on a diffusion coefficient which is generally a function of the local image gradient. Where the gradient magnitude is relatively low, smoothing ensues. Where the gradient is high and an edge may exist, smoothing is inhibited. A discrete version of the diffusion equation is
where t is the iteration number, λ is a rate parameter (λ ≤ 1/4), and the subscripts N, S, E, W represent the direction of diffusion. So, ∇IN (i, j) is the simple difference (directional derivative) in the northern direction [i.e., ∇IN (i, j) = I(i − 1, j) − I(i, j)], and cN (i, j) is the corresponding diffusion coefficient when diffusing image I and location (i, j). One possible formation of the diffusion coefficient (for a particular direction d) is given by
where k is an edge threshold. Unfortunately, Eq. (21) will preserve outliers from noise as well as edges. To circumvent this problem, a new diffusion coefficient has been introduced that uses a filtered image to compute the gradient terms (15). For example, we can use a Gaussian-filtered image S = I ∗ G(σ) to compute the gradient terms, given a Gaussian-shaped kernel with standard deviation σ. Then the diffusion coefficient becomes
10
IMAGE ENHANCEMENT
Fig. 4. (a) Original “Cameraman” image. (b) “Cameraman” image corrupted with additive Laplacian-distributed noise. (c) After eight iterations of anisotropic diffusion with diffusion coefficient in Eq. (21). (d) After eight iterations of anisotropic diffusion with diffusion coefficient in Eq. (22).
and can be used in Eq. (20). A comparison between the two diffusion coefficients is shown in Fig. 4 for an image corrupted with Laplacian-distributed noise. After eight iterations of anisotropic diffusion using the diffusion coefficient of Eq. (21), sharp details are preserved, but several outliers remain [see Fig. 4(c)]. Using the diffusion coefficient of Eq. (22), the noise is eradicated but several fine features are blurred [see Fig. 4(d)]. Anisotropic diffusion is a powerful enhancement tool, but is often limited by the number of iterations needed to achieve an acceptable result. Furthermore, the diffusion equation is inherently ill-posed, leading
IMAGE ENHANCEMENT
11
to divergent solutions and introducing artifacts such as “staircasing.” Research continues on improving the computational efficiency and on developing robust well-posed diffusion algorithms. Wavelet Shrinkage. Recently, wavelet shrinkage has been recognized as a powerful tool for signal estimation and noise reduction or simply de-noising (16). The wavelet transform utilizes scaled and translated versions of a fixed function, which is called a “wavelet,” and is localized in both the spatial and frequency domains (17). Such a joint spatial-frequency representation can be naturally adapted to both the global and local features in images. The wavelet shrinkage estimate is computed via thresholding wavelet transform coefficients:
where DWT and IDWT stand for discrete wavelet transform and inverse discrete wavelet transform, respectively (17), and f [ ] is a transform-domain point operator defined by either the hard-thresholding rule
or the soft-thresholding rule
where the value of the threshold t is usually determined by the variance of the noise and the size of the image. The key idea of wavelet shrinkage derives from the approximation property of wavelet bases. The DWT compresses the image I into a small number of DWT coefficients of large magnitude, and it packs most of the image energy into these coefficients. On the other hand, the DWT coefficients of the noise N have small magnitudes; that is, the noise energy is spread over a large number of coefficients. Therefore, among the DWT coefficients of G, those having large magnitudes correspond to I and those having small magnitudes correspond to N. Apparently, thresholding the DWT coefficients with an appropriate threshold removes a large amount of noise and maintains most image energy. Though the wavelet shrinkage techniques were originally proposed for the attenuation of imageindependent white Gaussian noise, they work as well for the suppression of other types of distortion such as the blocking artifacts in JPEG-compressed images (18,19). In this case, the problem of enhancing a compressed image may be viewed as a de-noising problem where we regard the compression error as additive noise. We applied the wavelet shrinkage to enhancing the noisy image shown in Fig. 2(b) and show the de-noised image in Fig. 2(f), from which one can clearly see that a large amount of noise has been removed, and most of the sharp image features were preserved without blurring or ringing effects. This example indicates that wavelet shrinkage can significantly outperform the linear filtering approaches. Figure 5 illustrates an example of the enhancement of JPEG-compressed images (20). Figure 5(a) shows a part of the original image. Fig. 5(b) shows the same part in the JPEG-compressed image with a compression ratio 32:1, where blocking artifacts are quite severe due to the loss of information in the process of compression. Figure 5(c) reveals the corresponding part in the enhanced version of Fig. 5(b), to which we have applied wavelet shrinkage. One can find that the blocking artifacts are greatly suppressed and the image quality is dramatically improved.
12
IMAGE ENHANCEMENT
Fig. 5. (a) Original “Lena” image. (b) “Lena” JPEG-compressed at 32:1. (c) Wavelet shrinkage applied to Fig. 5b.
Homomorphic Filtering. To this point, we have described methods that only deal with additive noise. In several imaging scenarios, such as radar and laser-based imaging, signal-dependent noise is encountered. The signal-dependent noise can be modeled as a multiplicative process:
IMAGE ENHANCEMENT
13
for noise values N(i, j) ≥ 0 (“·” is again pointwise multiplication). Applying the traditional low-pass filters or nonlinear filters is fruitless, since the noise is signal dependent. But we can decouple the noise from the signal using a homomorphic approach. The first step of the homomorphic approach is the application of a logarithmic point operation:
Since log[G(i, j)] = log[N(i, j)] + log[I(i, j)], we now have the familiar additive noise problem of Eq. (4). Then we can apply one of the filters discussed above, such as the median filter, and then transform the image back to its original range with an exponential point operation.
Applications and Extensions The applications of image enhancement are as numerous as are the sources of images. Different applications, of course, benefit from enhancement methods that are tuned to the statistics of physical processes underlying the image acquisition stage and the noise that is encountered. For example, a good encapsulation of image ¨ processing for biological applications is found in Hader (21). With the availability of affordable computing engines that can handle video processing on-line, the enhancement of time sequences of images is of growing interest. A video data set may be written as I(i, j, k), where k represents samples of time. Many of the techniques discussed earlier can be straightforwardly extended to video processing, using three-dimensional FFTs, 3-D convolution, 3-D windows, and 3-D wavelet transforms. However, a special property of video sequences is that they usually contain image motion, which is projected from the motion of objects in the scene. The motion often may be rapid, leading to time-aliasing of the moving parts of the scene. In such cases, direct 3-D extensions of many of the methods discussed above (those that are not point operations) will often fail, since the processed video will often exhibit ghosting artifacts arising from poor handling of the aliasing data. This can be ameliorated via motion-compensated enhancement techniques. This generally involves two steps: motion estimation, whereby the local image motion is estimated across the image (by a matching technique), and compensation, where a correction is made to compensate for the shift of an object, before subsequent processing is accomplished. The topic of motion compensation is ably discussed in Tekalp (22).
BIBLIOGRAPHY 1. 2. 3. 4. 5. 6. 7. 8. 9.
R. C. Gonzales R. E. Woods Digital Image Processing, Reading, MA: Addison-Wesley, 1995. A. Antoniou Digital Filters: Analysis and Design, New York: McGraw-Hill, 1979. J. C. Russ The Image Processing Handbook, Boca Raton, FL: CRC Press, 1995. A. C. Bovik S. T. Acton The impact of order statistics on signal processing, In H. N. Nagaraja, P. K. Sen, and D. F. Morrison (eds.), Statistical Theory and Applications, New York: Springer-Verlag, 1996, pp. 153–176. J. W. Tukey Exploratory Data Analysis, Reading, MA: Addison-Wesley, 1971. A. C. Bovik T. S. Huang D. C. Munson, Jr. A generalization of median filtering using linear combinations of order statistics, IEEE Trans. Acoust. Speech Signal Process., ASSP-31: 1342–1350, 1983. A. C. Bovik Streaking in Median Filtered Images, IEEE Trans. Acoust. Speech Signal Process., ASSP-35: 493–503, 1987. O. Yli-Harja J. Astola Y. Neuvo Analysis of the properties of median and weighted median filters using threshold logic and stack filter representation. IEEE Trans. Acoust. Speech Signal Process., ASSP-39: 395–410, 1991. P.-T. Yu W.-H. Liao Weighted order statistic filters—their classification, some properties and conversion algorithm. IEEE Trans. Signal Process., 42: 2678–2691, 1994.
14
IMAGE ENHANCEMENT
10. J. Serra Image Analysis and Morphology: Vol. 2: The Theoretical Advances. London: Academic Press, 1988. 11. H. G. Longbotham D. Eberly The WMMR filters: A class of robust edge enhancers, IEEE Trans. Signal Process., 41: 1680–1684, 1993. 12. P. D. Wendt E. J. Coyle N. C. Gallagher, Jr. Stack filters, IEEE Trans. Acoust. Speech Signal Process., ASSP-34: 898–911, 1986. 13. K. E. Barner G. R. Arce Permutation filters: A class of nonlinear filters based on set permutations. IEEE Trans. Signal Process., 42: 782–798, 1994. 14. P. Perona J. Malik Scale-space and edge detection using anisotropic diffusion, IEEE Trans. Pattern Anal. Mach. Intell., PAMI-12: 629–639, 1990. 15. F. Catte et al. Image selective smoothing and edge detection by nonlinear diffusion, SIAM J. Numer. Anal., 29: 182–193, 1992. 16. D. L. Donoho De-noising by soft-thresholding. IEEE Trans. Inf. Theory, 41: 613–627, 1995. 17. G. Strang T. Nguyen Wavelets and Filter Banks, Wellesley, MA: Wellesley-Cambridge Press, 1996. 18. D. Wei C. S. Burrus Optimal wavelet thresholding for various coding schemes, Proc. IEEE Int. Conf. Image Process., I: 610–613, 1995. 19. D. Wei A. C. Bovik Enhancement of compressed images by optimal shift-invariant wavelet packet basis. J. Visual Commun. Image Represent., in press. 20. W. B. Pennebaker J. L. Mitchell JPEG—Still Image Data Compression Standard, New York: Van Nostrand Reinhold, 1993. ¨ 21. D. P. Hader (ed.) Image Analysis in Biology, Boca Raton, FL: CRC Press, pp. 29–53, 1991. 22. A. M. Tekalp Digital Video Processing, Upper Saddle River, NJ: Prentice-Hall, 1995.
SCOTT T. ACTON Oklahoma State University DONG WEI The University of Texas at Austin ALAN C. BOVIK The University of Texas at Austin
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICA...TRONICS%20ENGINEERING/26.%20Image%20Processing/W4106.htm
}{{}}
●
HOME ●
ABOUT US ●
CONTACT US ●
HELP
Home / Engineering / Electrical and Electronics Engineering
Wiley Encyclopedia of Electrical and Electronics Engineering Image Processing Standard Article Harpreet Singh1, Y. Hamzeh1, S. Bhama2, S. Talahmeh3, L. Anneberg4, G. Gerhart5, T. Meitzler5, D. Kaur6 1Wayne State University, Detroit, MI 2JTICS, Chrysler Corporation, Detroit, MI 3INVANTAGE, Inc., Taylor, MI 4Lawrence Technological University, Southfield, MI 5Research and Engineering Center (TARDEC), Warren, MI 6University of Toledo, Toledo, OH Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. DOI: 10.1002/047134608X.W4106 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (21418K)
●
●
● ●
Recommend to Your Librarian Save title to My Profile Email this page Print this page
Browse this title ●
Search this title Enter words or phrases
Abstract The sections in this article are Background Image Restoration Image Enhancement Object Recognition Wavelets Image Coding Neural Networks Image Fusion
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20...0ENGINEERING/26.%20Image%20Processing/W4106.htm (1 of 2)17.06.2008 15:24:21
❍
❍ ❍
Advanced Product Search Search All Content Acronym Finder
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICA...TRONICS%20ENGINEERING/26.%20Image%20Processing/W4106.htm
Fuzzy Logic and Neuro-Fuzzy Approach to Image Fusion Major Problems in Video Image Fusion About Wiley InterScience | About Wiley | Privacy | Terms & Conditions Copyright © 1999-2008John Wiley & Sons, Inc. All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20...0ENGINEERING/26.%20Image%20Processing/W4106.htm (2 of 2)17.06.2008 15:24:21
IMAGE PROCESSING
BACKGROUND Image processing consists of a wide variety of techniques and mathematical tools to process an input image. An image is processed as soon as we start extracting data from it. The data of interest in object recognition systems are those related to the object under investigation. An image usually goes through some enhancement steps, in order to improve the extractability of interesting data and subside other data. Extensive research has been carried out in the area of image processing over the last 30 years. Image processing has a wide area of applications. Some of the important areas of application are business, medicine, military, and automation. Image processing has been defined as a wide variety of techniques that includes coding, filtering, enhancement, restoration registration, and analysis. In many applications, such as the recognition of three-dimensional objects, image processing and pattern recognition are not separate disciplines. Pattern recognition has been defined as a process of extracting features and classifying objects. In every three-dimensional (3-D) object recognition system there are units for image processing and there are others for pattern recognition. There are two different approaches to image processing: 1. Analog processing. This approach is very fast since the time involved in analog-to-digital (AD) and a digital-to-analog (DA) conversion is saved. But this approach is not flexible since the manipulation of images is very hard. 2. Digital processing. This approach is slower than the analog approach but is very flexible, since manipulation is done very easily. The processing time of this approach is tremendously improved by the advent of parallel processing techniques. Digital Image Processing Digital image processing is defined in Ref. 1 as “the processing of two dimensional images by a digital computer.” A digital image is represented by an array of regularly spaced and very small quantized samples of the image. Two processes that are related to any digital system are sampling and quantization. When a picture is digitized, it is represented by regularly spaced samples of this picture. These quantized samples are called pixels. The array of pixels that are processed in practice can be quite large. To represent an ordinary black and white television (TV) image digitally, an array of 512 × 512 pixels is required. Each pixel is represented by an 8 bit number to allow 256 gray levels. Hence a single TV picture needs about 2 × 106 bits. Digital image processing encompasses a wide variety of techniques and mathematical tools. They have all been developed for use in one or the other of two basic activities
that constitute digital image processing: image preprocessing and image analysis. An approach called the state-space approach has been recently used in modeling image processors. These image processors are made of linear iterative circuits. The statespace model is used efficiently in image processing and image analysis. If the model of an image processor is known, the realization of a controllable and observable image processor is then very simple. Image Preprocessing. Image preprocessing is an early stage activity in image processing that is used to prepare an input image for analysis to increase its usefulness. Image preprocessing includes image enhancement, restoration, and registration. Image enhancement accepts a digital image as input and produces an enhanced image as an output; in this context, enhanced means better in some respects. This includes improving the contrast, removing geometric distortion, smoothing the edges, or altering the image to facilitate the interpretation of its information content. In image restoration, the degradation is removed from the image to produce a picture that resembles the original undegraded picture. In image registration, the effects of sensor movements are removed from the image or to combine different pictures received by different sensors of the same field. Image Analysis. Image analysis accepts a digital image as input and produces data or a report of some type. The produced data may be the features that represent the object or objects in the input image. To produce such features, different processes must be performed that include segmentation, boundary extraction, silhouette extraction, and feature extraction. The produced features may be quantitative measures, such as moment invariants, and Fourier descriptors, or even symbols, such as regular geometrical primitives. Sampling and Quantization Quantization is the process of representing a very large number (possibly infinite) of objects with a smaller, finite number of objects. The representing set of objects may be taken from the original set (e.g., the common numberrounding process) or may be completely different (e.g., the alphabetical grading system commonly used to represent test results). In image processing systems, quantization is preceded by another step called sampling. The gray level of each pixel in an image is measured, and a voltage signal that is proportional to the light intensity at each pixel is generated. It is clear that the voltage signal can have any value from the voltages that are generated by the sensing device. Sampling is the process of dividing this closed interval of a continuous voltage signal into a number of subintervals that are usually of equal length. In an 8 bit sampling and quantization process, for example, the interval of voltage signals is divided into 256 subintervals of equal length. In the quantization process, each of the generated intervals from sampling is represented by a code word. In an
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright © 2007 John Wiley & Sons, Inc.
2
Image Processing
8-bit quantization process, each code word consists of an 8 bit binary number. An 8 bit analog-to-digital converter (ADC) can simply accomplish the tasks of sampling and quantization. The image data are now ready for further processes through use of digital computers. For systems that involve dynamic processing of image signals [e.g., TV signals or video streams from chargecoupled device (CCD) cameras], the term sampling refers to a completely different process. In this context, sampling means taking measurements of the continuous image signal at different instants of time. Each measurement can be thought of as a single stationary image. A common problem associated with image digitization is aliasing. The sampling theorem states that for a signal to be completely reconstructable, it must satisfy the following equation:
where ws is the sampling frequency and w is the frequency of the sampled signal. Sampling, in this context, means taking measures of the analog signal at different instants separated by a fixed time interval t. This theorem is applicable on the sampling of stationary images as well, where sampling is carried through space instead of time. If the signal is band limited, sampling frequency is determined according to the frequency of its highest-frequency component. Image signals, however, are subjected to truncating, mainly because of the limitations in sensors and display devices. Sensors are capable of recognizing a limited range of gray levels. Real objects usually have wider ranges of gray levels, which means that both the gray levels higher and lower than the range of the sensor are truncated. Truncating is what causes the aliasing problem. To explain how this happens, consider the simple sinusoidal function given by f(x) = cos(x). Figure 1 shows a plot of this function and Fig. 2 shows a plot of its Fourier transform. Figure 3 shows a truncated version of that function, and Fig. 4 shows the equivalent Fourier transform. This function has infinite duration in the frequency domain. The Nyquist frequency is given by wn = ws /2. If we try to sample this signal with a sampling frequency of ws , then all frequencies higher than the Nyquist frequency will have aliases within the range of the sampling frequency. In other words, aliasing causes high-frequency components of a signal to be seen as low frequencies. This is also known as folding. A practical method to get red of aliasing is to prefilter the analog signal before sampling. Figure 4 shows that lower frequencies of the signal contain most of signal’s power. A filter is designed so that filtered signals do not have frequencies above the Nyquist frequency. A standard analog filter transfer function may be given as
Where ζ is the damping factor of the filter and w is its natural frequency. By cascading second- and first-order filters, one can get higher-order systems that have higher performances. Three of the most commonly used filters are the Butterworth filter, ITAE filter, and Bessel filter. Bessel fil-
ters are commonly used for high-performance applications, mainly because of the following two factors: 1. The damping factors that may be obtained by a Bessel filter are generally higher than those obtained by other filters. A higher damping factor means a better cancellation of frequencies outside the desired bandwidth. 2. The Bessel filter has a linear phase curve, which means that the shape of the filtered signal is not much distorted. To demonstrate how we can use a Bessel filter to eliminate high-frequency noise and aliasing, consider the square signal in Fig. 5. This has a frequency of 25 Hz. Another signal with a frequency of 450 Hz is superimposed on the square signal. If we try to sample the square signal with noise [Fig. 5(a)], we will get a very distorted signal [Fig. 5(b)]. Next, we prefilter this signal using a second-order Bessel filter with a bandwidth of 125 Hz and a damping factor of 0.93. The resultant signal is shown in Fig. 5(c). Figure 5(d) shows the new signal after sampling. It is clear that this signal is very close to the original square signal without noise. IMAGE RESTORATION Image restoration refers to a group of techniques that are oriented toward modeling the degradation and applying the inverse process in order to recover the original image. Each component in the imaging system contributes to the degrading of the image. Image restoration techniques try to model the degradation effect of each component and then perform operations to undo the model, to restore the original image (3). There are two different modeling approaches for degradation: the a priori approach and the a posteriori approach. These two approaches differ in the manner in which information is gathered to describe the characteristics of the image degrading. The a priori approach is to try to model each source of noise in the imagery system by measuring the system’s responses to arbitrary noises. In many cases, deterministic models cannot be extracted and stochastic models are used instead. The a posteriori approach is adopted when a great deal of information is known about the original image. We can develop a mathematical model for the original image and try to fit the model to the observed image. Figure 6 shows a simplified model for the image degradation and restoration processes. The original image signal f(x, y) is subjected to the linear degrading function h(x, y). An arbitrarily noise signal η(x, y) is then added to create the degraded image signal η(x, y). Reconstruction approaches try to estimate the original image signal f(x, y) given the degradation signal g(x, y) and some statistical knowledge of the noise signal η(x, y). We can broadly classify reconstruction techniques into two classes: the filtering reconstruction techniques and the algebraic techniques (5).
Image Processing
3
Figure 1. Cosine function with amplitude A and frequency of 1 Hz.
Figure 2. Power spectrum of the cosine function with amplitude A and frequency of 1 Hz.
Figure 3. Truncated cosine function. The truncation is in the variable x (e.g., time), not in the amplitude.
4
Image Processing
Figure 4. The power spectrum of the truncate cosine function is a continuous one, with maximum values at the same points, like the power spectrum of the continuous cosine function.
Figure 5. Antialiasing filtering: (a) square signal with higherfrequency noise, (b) digitized signal with noise, (c) continuous signal after using second-order Bessel filter, and (d) digitized filtered signal. Using the Bessel filter reduced noise and improved the digitization process substantially.
Figure 6. Simplified model for the degradation process. The image signal f(x, y) is subjected to a linear degrading function h(x, y) and an arbitrary noise η(x, y) is added to produce the degraded signal g(x, y).
Image Processing
Filtering Reconstruction Techniques These techniques are rather classical and they make use of the fact that noise signals usually have higher frequencies than image signals. This means that image signals die out faster than noise signals in high frequencies. By selecting the proper filter, one can get a good estimate of the original image signal, by reducing the effect of noise. Examples of the reconstruction filters are the deconvolution filter and the Wiener filter. Deconvolution Filter. This filter is based on the concept of inverse filtering, in which the transfer function of the degraded system is inverted to produce a restored image. Figure 7 shows typical spectra of a deconvolution filtering image restoration system. If no noise is added to the system, the image is perfectly reconstructed. The presence of noise, however, will add a reconstruction error, the value of which can become quite large at spatial frequencies for which f(x, y) is small. This reflects directly on regions of the image with a high density of details because they have higher spatial frequencies than other regions. Wiener Filter. This filter uses the mean-squared error (MSE) criterion to minimize the error signal between the original and degraded image signals. The mean-squared restoration error is given by
where f(x, y) and g(x, y) are as before. The Wiener filter equation is
where Pn (m, n) and Pf (m, n) are the power spectra of the signal and noise, respectively. The Wiener filter acts as a band-pass filter. At low spatial frequencies, it acts as an inverse filter, whereas at higher frequencies, it acts as a smooth rolloff low-pass filter. This filter is not very suitable for use in cases in which images are investigated by the human eye. The MSE technique treats all errors equally, regardless of their spatial location in the image. The human eye, on the other hand, has high degree of tolerance to errors in darker areas on the image than elsewhere. Another limitation in this filter is that it cannot handle dynamically changing image and noise signals. Another type of filter that is closely related to the Wiener filter is the power spectrum equalization (PSE) filter, the equation for which is
The power spectrum obtained by this filter for the reconstructed filter is identical to that of the original signal. Linear Algebraic Restoration Techniques These techniques utilize matrix algebra and discrete mathematics for solving the problem of image restoration. To
5
extract the discrete model for the restoration system, we make the following assumptions:
The digitized original image f(x, y) and the restored image fr (x, y) are stored in the M2 × 1 column vectors f and fr , respectively. The digitized degrading function h(x, y) is stored in a square matrix H (M2 × M2 ). The degraded image g(x, y) and noise η(x, y) is stored in the M2 × 1 column vectors g and n, respectively. We can express the observed (degraded) image vector in the compact form:
We use this model to derive some of the algebraic restoration techniques, namely, the unconstrained reconstruction technique, the constrained reconstruction technique, and the pseudoinverse filtering technique. Unconstrained Reconstruction. If we know very little about the noise n, then we try to find an estimate image fr , such that Hfr approximates g in a least-squares manner. This can be accomplished by minimizing the norm of the noise n. Squaring the norm of both sides of Eq. (6) after substituting for f by the estimate vector fr and moving Hfr to the other side of the equation, we get
where a2 is the square of the norm of vector a and is given by a2 = aa , where a is the transpose of the vector a. Consider the error function E, where
Then our goal is to minimize E with respect to fr . This can be accomplished by taking the derivative of E with respect to fr , and equating the result to zero, that is
Assuming that H−1 exists, the solution for this equation is given as
Constrained Reconstruction. To account for the noise term in Eq. (6), we introduce the square matrix Q (M2 × M2 ) to represent some linear operator on f. By selecting different Q’s, we are able to set the goal of restoration as desired. Equation (8) is now modified to
where λ is a constant called the LaGrange multiplier. Again we try to minimize the error function E, by taking its derivative with respect to fr and equating the result to zero, that is
6
Image Processing
Figure 7. Typical spectra of a deconvolution filtering image-restoration system. The actual response of the restoration process is first inverted. Then signals with frequencies higher than the cutoff frequency are removed. The new response is finally inverted to result in a response very close to the theoretical one.
Solving for fr , we get
Where α = 1/λ is a constant that we adjust to satisfy the constraint. Pseudoinverse Filtering. The pseudoinverse filtering is a special case of the constrained reconstruction techniques. In this technique, the constraint matrix Q is the M2 × M2 identity matrix. Equation (13) then becomes
If α = 0, this equation reduces to Eq. (10), which represents the unconstrained restoration technique.
IMAGE ENHANCEMENT Image enhancement refers to a group of processes that aim toward making the processed image much better than the unprocessed image. In this sense, image enhancement is closely related to image restoration. Enhancement techniques are basically heuristic procedures, that are designed to manipulate an image by taking advantage of the
Image Processing
human vision system (2). Image enhancement is used to remove the noise that may be added to the image from several sources, including electrical sensor noise and transmission channel errors. There are many enhancement techniques to improve the input image. Some of these techniques are gray-scale modification, point operators, contrast enhancement and filtering. Gray-Scale Modification The gray-scale modification technique is very effective to modify contrast or dynamic range of an image. In this technique, the gray scale of an input image is changed to a new gray scale according to a specific transformation. The specific transformation is the relation between the intensity of the input image to the intensity of the output image. By properly choosing this transformation, one can modify the dynamic range or the contrast of an input image. For each input pixel the corresponding output pixel is obtained from the plot or the look-up table. The transformation desired depends on the application. In many applications a good transformation is obtained by computing the histogram of the input image and studying its characteristics. The histogram of an image represents the number of pixels that has a specific gray value as a function of all gray values. The gray-scale transformation is very simple and very efficient, but it has the following two limitations: 1. It is global—the same operation is applied to all the pixels that have the same gray level in the image. 2. The transformation that is good for one image may not be good for others. Point Operators Point processing is a group of simple techniques used for image enhancement. Point operation is a direct mapping from an input image to an output image, where the pixel value in the enhanced output image depends only on the value of the corresponding pixel in the input image (5). This mapping can mathematically be given by
where Po is the output enhanced pixel, Pi is the corresponding input pixel and N is the point operator. In case of discrete images, a point operator may be implemented by a look-up table (LUT), in which for each pixel in the output image, there is a corresponding pixel in the input image. It should be noted that point operators affect the gray level of pixels in an image but have no effect on the spatial relationships between pixels. Point operators may be used in applications such as photometric calibration, contrast enhancement, display calibration, and contour lines. Types of Point Operations. Point operators are generally classified as either linear or nonlinear operators. Linear Point Operations. The output gray level is a linear function of the input gray level. The mapping function is
7
given by
where Go and Gi are the gray levels of the output pixel and input pixel, respectively. By changing the values of the coefficients a and b, different effects may be accomplished. For example, if a = 1 and b = 0, the gray levels of the image pixels do not change. A negative value of a has the effect of complementing the image. That is, dark areas of the image become light and light areas become dark. Figure 8 shows a plot of a linear point operator. Nonlinear Monotonic Point Operations. Nonlinear point operations are mainly used to modify the midrange gray level. One type of nonlinear point operation increases the gray level of midrange pixels, while leaving dark and light pixels little changed. This type of operator may be given by
where Gm is the maximum gray level and A is a parameter that determines the amount of increase (A > 0) or decrease (A < 0) in the midlevel gray range. The logarithmic operator is a point operator that maps each pixel value to its logarithm. The logarithmic mapping function is given by
In an 8 bit format, the value of c is selected such that the maximum output value is 255. Contrast Enhancement Contrast is defined as the range of brightness values present in an image. Contrast enhancement is a very important issue when the final output image has to be evaluated by a human observer. The human eye can distinguish up to 64 gray levels, compared to 256 different gray levels that can be achieved with an 8 bit imagery system. The range of brightness values collected by a sensor may not match the capabilities of the output display medium. Contrast enhancement involves changing the range of brightness in an image in order to increase contrast. Selecting a suitable method for contrast enhancement depends on two main factors: the state of the original image and the requirements from the final image. In many cases, more than one step of contrast enhancement is required to obtain the desired output. Two common contrast enhancement methods are the contrast-stretching technique and window-and-level technique. Contrast Stretching. This process expands the range of brightness of an original digital photograph into a new distribution. As a result, the total range of sensitivity of the display device can be utilized. Contrast stretching makes minute variations within the image data more obvious. Contrast stretching may be accomplished by both linear and nonlinear methods. Examples of linear contrast enhancement are the minimum-maximum linear contrast stretch, percentage linear contrast stretch, and piecewise
8
Image Processing
Figure 8. Linear point operator. By changing the slope a and vertical intersection b, we can use point operators to produce different effects in the processed image.
linear contrast stretch. An example of nonlinear contrast enhancement is the histogram equalization. Window and Level. If the brightness range of the element of interest in an image is limited, then a method known as Window and Level can be used to enhance its contrast on the expense of other elements. All signal levels below the desired range will be mapped onto the same low output level and all those with higher levels will be mapped onto the same high output level. Only pixels within the desired range will be represented with an acceptable range of contrast. Two parameters define the desired range of contrast: the middle point Level and the width of the range Window. Figure 9 shows how the window-and-level technique affects the histogram of the image. If the original image has a poor contrast level, then the window-and-level technique alone will not produce satisfactory results. A better technique would be to use one of the contrast-stretching methods to increase the contrast range first and then apply the window-and-level method on the desired element, taking into consideration its new range of contrast. Low-Pass Filtering Low-pass filtering reduces the high-frequency components while retaining the low-frequency components. The lowfrequency components of an image constitute most of the energy in an image. In other words, low-pass filtering removes a large amount of noise at the expense of removing a small amount of the actual signal. Low-pass filtering can be used efficiently to remove additive random and multiplicative noise, but at the same time it blurs the processed image. Blurring is the primary limitation of low-pass filtering. Figure 10 shows a photograph of the famous Taj-Mahal, one of the Seven Wonders of the World. Figure 11 shows the effect of low-pass filtering on the same image.
High-Pass Filtering High-pass filtering reduces the low-frequency components while preserving the high-frequency components of an input image. Since the high-frequency components generally correspond to edges or fine details of an image, highpass filtering increases the contrast and thus sharpens the edges of the image. High-pass filtering can be used to remove image blur but at the same time it tends to increase background noise power. Figure 12 shows the effect of high-pass filtering on Fig. 10. Median Filtering Median filtering is a nonlinear digital technique used for image enhancement. The basic idea behind median filtering is the removal of noise by finding the median of neighboring pixels and assigning it to the center point. It is used to reduce impulsive noise and to enhance edges while reducing random noise. Figure 13 shows the effect of filtering the photograph image with a median filter with a window of three pixels. OBJECT RECOGNITION Object recognition includes the process of determining the object’s identity or location in space. The problem of object or target recognition starts with the sensing of data with the help of sensors, such as video cameras and thermal sensors, and then interpreting these data in order to recognize an object or objects. We can divide the object-recognition problem into two categories: the modeling problem and the recognition problem. Image Segmentation Image segmentation is the process of partitioning a digital image into disjoined, meaningful regions. The meaningful regions may represent objects in an image of threedimensional scene, regions corresponding to industrial,
Image Processing
9
Figure 9. Effect of the window-and-level technique on the image histogram. By using this technique, we can select and enlarge any portion of the image histogram we need.
residential, agricultural, or natural terrain in an aerial recognizance application, and so on. A region is a connected set of pixels and the objects are considered either fourconnected, if only laterally adjacent pixels are considered, or they can be eight-connected, if diagonally adjacent pixels are also considered to be connected. Image segmentation is an efficient and natural process for humans. A human eye (or rather, mind) sees not a complex scene, but rather a collection of objects. In contrast, image segmentation is not an easy task in digital image processing, and it may become a serious problem if the number of objects is large or unknown or if the boundaries between objects are not clear. Three of the most commonly used techniques for digital image segmentation are the gray-level thresholding technique, gradient-based segmentation technique, and the region growing technique. A brief description of the three techniques is given below. Gray-Level Thresholding. This technique is an example of the regional approach, which implies grouping pixels into distinct regions or objects. When using this technique for
image segmentation, all pixels that have gray levels equal to or above a certain threshold are assigned to the object. The rest of the pixels are assigned to the object’s background. If the image contains more than one object, then more threshold levels can be used to do the segmentation. Gradient-Based Segmentation. This technique concentrates on the boundaries between different regions. Reference 5 describes three techniques that are based on the gradient technique. They are the boundary tracking technique, the gradient image thresholding, and Laplacian edge detection. We describe each of these techniques briefly. Boundary Tracking. This technique starts with scanning the image for the pixel with the highest gradient. This pixel is for sure on the object’s boundary. Then a 3-by-3 pixel segment (with the original pixel in its center) is used as tracking probe for the next pixel with the highest gradient, in the neighbor of the original one. This pixel is considered part of the object’s boundary and the probe is moved to the new pixel to search for another boundary pixel. The process is repeated until a closed contour is formed. If, at
10
Image Processing
Figure 10. Photo of the Taj-Mahal with arbitrary noise.
Image Processing
Figure 11. Low-pass filtering of the image removes most of the noise but blurs the image.
11
12
Image Processing
Figure 12. High-pass filtering of the image sharpens the edge details but it increases the background noise signal power.
Image Processing
Figure 13. Median filtering reduces random noise while enhancing edges.
13
14
Image Processing
any time, three adjacent pixels in the probe have the same highest gray level (i.e., gradient), the one in the middle is selected. If two have the same highest gradient, the choice is arbitrary. Gradient Image Thresholding. This technique is based on the following phenomenon. If we threshold a gradient image at moderate gray level, the object and the background stand below the threshold and most edge points stand above it (5). The technique works like this: We first threshold the image at a low level to identify the object and background, which are separated by bands of edge points. We then gradually increase the threshold level, causing both the object and background to grow. When they finally touch, without merging, the points of contact define the boundary (5). Laplacian Edge Detection. The Laplacian is a scalar, second-order derivative operator that, for a twodimensional function f(x, y), can be given by
The Laplacian will produce an abrupt zero-crossing at an edge. If a noise-free image has sharp edges, the Laplacian can find them. In the presence of noise, however, low-pass filtering must be used prior to using the Laplacian. Region Growing. Region growing is one of the artificial intelligence techniques for digital image segmentation. Unlike other techniques, this one is capable of utilizing many properties of the pixels inside the different objects, including average gray level, texture, and color information. The technique works as follows: First, the image is partitioned into many small regions. Then the properties of the pixels in each region are computed. In the next step, average properties of adjacent regions are examined. If the difference is significant, the boundary is considered strong and it is allowed to stay. Otherwise, the boundary is considered weak, and the two regions are merged. The process is repeated until there are no more boundaries weak enough to allow regions merging. This technique is especially useful in cases where prior knowledge of the scene features is not available. Texture We can define texture as an attribute representing variations in brightness among the image pixels. This variation may be in the level of brightness, special frequencies, or orientation. Computer-generated images look very realistic for objects of metallic or plastic materials but tend to look smoother than necessary in images of organic materials, such as human skin. Surface roughness is a feature closely related to texture. Texture is sometimes seen as a measure of surface roughness. Random texture is a type of texture caused by noise from cameras, sensors, or films. This type has no recognizable pattern. Satellite images are a good example of images where patterned texture is most recognizable. Different uses of land, such as agriculture or
construction, produce different textures. Some of the operators commonly used to detect texture in images are the rank operators, range operators, and gray-level variation. The rank operator measures the range between the maximum and minimum brightness values in the image. Larger range values correspond to surfaces with a larger roughness. The range operator converts the original image into one in which brightness represents texture (4). Gray-level variation can be measured by the sum of the squares of the differences between brightness of the center pixel and its neighbors. The root of resultant value is then calculated to give the root-mean square (rms) measurement of the surface roughness. Research indicates that the human eye is insensitive to texture differences of an order higher than a second (5). Object Modeling Modeling is the process of representing a real system in an abstract manner, in order to study its different features. It is widely used in all fields of engineering. In control engineering, for example, a mathematical model of a physical system is extracted to facilitate the study of its performance under different circumstances. In object recognition, a model is created for the object under investigation. This model is then compared to different models that are stored in a database. If this model matches one of the available models, say, the model of object A, then the investigated model is classified as object A. In an airplane recognition system, for example, the database contains models of different types of airplanes. There are two types of models in object modeling systems: 1. Geometric Models These models are represented by surface or volume primitives with attributes and relationships. The primitives and their relationships can be described geometrically. These models are used for regular manmade objects because they can be described geometrically. 2. Symbolic Models In these models, a three-dimensional object is represented by a graph whose nodes are the 3-D primitives and whose arcs are the attributes and relations. These models are good for representing objects that are irregular in natural. The primitives of the geometric models depend on the application. In general, there are three classes of primitives: 1. Surface or boundary primitives 2. Sweep primitives 3. Volumetric primitives The boundary primitives are bounded faces of the object. The sweep primitives are general cylinders or cones. The volumetric primitives can be categorized into three types as follows: 1. Spatial Primitives These primitives are 3-D arrays of cells, which may be marked as filled with matter or not.
Image Processing
2. Solid Geometric Primitives These primitives are regular solid primitives, which may be combined by modified versions of the Boolean set operators—union, difference, and intersection. 3. Cell Primitives These primitives have complex shapes and they can be connected together by just one combining operator (glue). Pattern Recognition Patterning is a process to extract features and recognize objects and patterns from a given image. Usually there are two approaches to solve the problem of pattern recognition: the RF processing approach and the image-processing approach. In the RF processing approach, electromagnetic waves are transmitted and then reflected from the target. The reflected waves are analyzed to recognize the target. The RF processing is usually used when the target is very far or out of range of sight. In the image processing approach, a picture or an image of the field is taken. The picture is then processed and analyzed to recognize the objects in the picture. The recognition task is usually done in two phases: 1. Training Phase In this phase a pictorial database or a knowledge base is created offline, which contains data about all objects under consideration. If the recognition system is designed to recognize airplanes, for example, the database may include data about all airplanes under consideration. 2. Recognition Phase In this phase, the incoming picture is first processed to enhance it. The picture is analyzed to extract the features that characterize it. Finally the features are matched with the features stored in the database to recognize the object or objects. WAVELETS Wavelets are a relatively new area of signal processing and offer a potentially very useful approach to image analysis. Because of the fact that one has local control over resolution, many cue features of different size can be extracted from an image. Wavelets also show promise as an aid in understanding how early vision can be emulated and further understood. Wavelet Transforms It is well known that in signal processing, the more compact your method of representing information the better. In 1822, Jean Joseph Fourier devised a very efficient way to represent the information content of a signal. His idea was to represent a signal as the sum of its frequencies. Transmitted power spectra, carrier frequencies, brain activity, NMR signals—all these global descriptions provide a lot of information in a compact manner. However, most of the power of this kind of representation vanishes when one tries to represent information that changes its nature during the course of signal recording. A good example of this kind of a signal is a musical score. A global analysis of
15
a recording of a musical selection with a Fourier transform (FT) will indicate the specific notes played within the piece of music, but there is no way to recover the timing of the notes. The musical score, on the other hand, indicates each note that was played and the time it was played. Wavelets represent a signal in a way such that local frequency information is available for each position within the signal. Wavelets are able to analyze a signal based on the position-varying spectra. The multiresolution pyramidal decomposition that results is also well matched to fractals and shows great potential for the removal of background noise, such as static in recordings, and for pattern recognition and texture segmentation. In the following section we discuss the conceptual understanding of how wavelets can be used in signal analysis. Wavelet Transform Versus Fourier Transform A fairly old method of computing a local spectrum is to apply the FT to one specific piece of the signal at a time. This is the idea behind what is called the windowed Fourier transform (WFT). Basically, the implementation involves use of a moving rectangular piece of the whole image to isolate a portion of the signal of interest, which is then Fourier transformed. As the window slides along different positions, the WFT gives the spectra at these positions. This kind of analysis has a fundamental problem, however [the mathematics of which is, in fact, similar to the Heisenberg uncertainty principle in quantum mechanics (QM)]. Multiplying the signal by the window function results in convoluting or mixing the signal spectrum with the spectrum of the window. Add to this the fact that as the window gets smaller, its spectrum gets wider, and we have the basic dilemma of localized spectra: The better we determine the position of the signal the poorer we localize the spectrum. This is analogous to the case in QM where increased precision in a description of, say, the momentum of an electron reduces the precision available in the position of that electron. Very accurate determinations can be made or computed, but both are not available simultaneously to an unlimited degree of precision. Correspondingly, there is a fundamental physical limit to the degree of accuracy with which the frequency content can be known at a particular position and simultaneously with which the location where the signal is being analyzed can be known. In 1946, Dennis Gabor introduced a version of the WFT that reduced this uncertainty somewhat. The Gabor transform uses a Gaussian profile for the window, as a Gaussian function minimizes this uncertainty. However, the underlying idea of localizing a spectrum of a signal by windowing the signal needs to be reconsidered. Obviously, caution must be taken in the selection of the signal. Careful attention to the placement of the window, however, is not an easy task for realistic time-varying signals. We are in fact trying to do two different things simultaneously. Frequency is a measure of cycles per unit time or signal length. High-frequency oscillations take much less signal length or time than do low-frequency oscillations. High frequencies can be well localized in the overall signal with a short window, but low-frequency localization requires a long window.
16
Image Processing
The wavelet transform takes an approach that permits the window size to scale to the particular frequency components being analyzed. Wavelets are, generally speaking, functions that meet certain requirements. The name wavelet originates from the requirement of integrating to zero by oscillating about the x axis and being well localized. In fact, there are many kinds of wavelets. There are smooth wavelets, wavelets with simple mathematical expressions, wavelets that are associated with filters, etc. The simplest wavelet is the Haar wavelet. Just as the sine and cosine functions are used in Fourier analysis as basis functions, wavelets are used as the basis functions for representing other functions. After the “mother” wavelet, W(x), is chosen, it is possible to make a basis out of translations and dilation of the mother wavelet. Special values for a and b are used such that a = 2(−j ) and b = kj , where k and j are integers. This particular choice of a and b gives a so-called sparse-basis. The sparse-basis choice of parameters also makes the link to multiresolution signal analysis possible. There are many important characteristics of wavelets that make them more flexible than Fourier analysis. Fourier basis functions are localized in frequency but not in time. Small-frequency changes in a FT cause changes everywhere in the time domain. Wavelets can be localized in both frequency position and in scale by the use of dilation, and in time by the use of translations. This ability to perform localization is useful in many applications. Another advantage of wavelets is the high degree of coding efficiency, or in other words, data compression available. Many classes of functions and data can be represented by wavelets in a very compact way. The efficient coding of information makes wavelets a very good tool for use in data compression. One of the more popular examples of this sparse-coding capability is the well-known fact that the FBI has standardized the use of wavelets in digital fingerprint compression. The FBI gets compression ratios on the order of 20:1. The image difference between the original and the wavelet stored and decompressed image is negligible. Many more applications of wavelets currently exist. One involves the denoising of old analog recordings. For example, Coifman and co-workers at Yale University have been using wavelets to take the static and hiss out of old recordings. They have cleaned up recordings of Brahms playing his First Hungarian Dance for the piano. Noisiness extends to the realm of data sets. Noisy data sets can be cleaned up by using wavelets. Typically the speed of wavelets is also much faster than the fastest fast Fourier transform (FFT). The data are basically encoded into the coefficients of the wavelets. The computational complexity of the FFT is of the order of n log n where n is the number of coefficients; whereas for most wavelets, the order of complexity is of the order n. Figure 14 shows how wavelets are used to process data. Many data operations, such as multiresolution signal processing, can be done by processing the corresponding wavelet coefficients. How wavelets work is best illustrated by way of an example. The simplest of all wavelets is the Haar wavelet, shown in Fig. 15. The Haar wavelet, W(x), is a step function that assumes the values 1 and −1, on the scale [0,½] and [½,1]. The Haar wavelet is more than 80 years old and
has been used for various applications. It can be shown that any continuous function can be approximated by Haar functions. Dilation and translations of the function (x),
define an orthogonal basis in the space of all square integrable functions L2 (R). This means that any element of L2 (R) can be presented as a linear combination of these basis functions. The orthogonality of the wavelet pairs is checked by the following:
whenever j = j and k = k is not satisfied simultaneously. The constant that makes the orthogonal basis orthonormal is 2j/2 . Wavelet Transforms and Their Use With Clutter Metrics Wavelets and wavelet transforms are essentially an elegant tool from the field of mathematics that can be applied to the area of signal processing. Wavelets are used for removing noise or unwanted artifacts from images as well as acoustic data. Clutter is a term that refers to the psychophysical task of perceiving objects in a scene while in the presence of similar objects. There are many definitions of clutter in use at the moment, however, none take into account the multiresolution capability of wavelets. IMAGE CODING Image coding refers to groups of techniques that are used to represent (3-D) objects in forms manageable by computers for the purpose of object classification and recognition. Object moments, Fourier descriptors, and expert systems are examples of these techniques. The Moment Technique An image of a 3-D object may be described and represented by means of a set of its moments. The set of moments of an object can be normalized to be independent of the object primary characteristics, namely, translation, rotation, and scale. Hence, moments can be used to recognize 3-D objects. There are different types of moments that may be used to solve the problem of image coding. These moments are different in their properties and their computation complexities. Conventional Moments. The conventional definition of the two-dimensional moment of order p + q of a function f(x, y) is given by
where the summation is over the area of the image. If the input image is a binary image, then r = 0, otherwise r = 1. f(x, y) represents the gray-level value of the pixel at point (x, y). The conventional moment sequence (Mpq ) is uniquely determined by f(x, y) and (x, y) is uniquely determined by
Image Processing
17
Figure 14. The use of wavelets to process raw data.
Figure 15. The Harr transform basis functions. The basis functions in each stage cover the whole range (from 0 to 1).
the sequence Mpq . It is always possible to find all orders of moments for an image, but a truncated set may offer a more convenient and economical representation of the essential shape characteristics of an object. The conventional moments of an object are changed by the primary characteristics of the object in the following manner: 1. Scale Change The conventional moments of f(x, y) after a scale change by a factor λ are defined by
2. Translation The moments of f(x, y) translated by an amount (a, b) are defined by
4. Reflection A reflection about the y axis results in a change of sign of all moments, which depend upon an odd power of y, that is,
For object recognition and image analysis, a truncated set of moments is needed. A complete moment set of order n is always defined to be a set of all moments of order n and lower. A complete moment set of order n contains (n + 1)(n + 2)/2 elements. Rotational Moments. The rotational moments for an image f(x, y) are defined by
3. Rotation The moments of f(x, y) after a rotation by an angle θ about the origin are defined by where r θ tan−1 (y/x)
18
Image Processing
n the order of the moment nL≥0 A truncated set of rotational moments of order n contains the same information as a truncated set of conventional moments of the same order. The rotation normalization of these moments is very simple, but the translation and scale normalization is very difficult and time consuming. The standard rotational moments can be obtained by normalizing the rotational moments with respect to scale, translation, and rotation. Orthogonal Moments. There are two types of orthogonal moments. 1. Legendre Moments These moments are based on the conventional moments. The Legendre moment of order p + q is defined by
where Pp (x) and Pq (y) are Legendre polynomials. Legendre polynomials are defined by
difficult. The inverse transform of f(x, y) from a sequence {AnL } is very simple because of orthogonality property. Object Classification Using Moments. A complete moment set can be created by any type of the moments discussed above. However, the conventional moments require minimum computation time for scale and translation normalization. The only operation to be considered is the rotation normalization. Simple parallel hardware architecture may offer a significant savings in time to the rotation normalization process. Hence, moment invariant and standard moments are considered for object classification. Fourier Descriptors The Fourier descriptor (FD) method is a well-known method of describing the shape of a closed figure. Depending on the technique used to calculate the FDs, the FDs can be normalized to be independent of the translation, the rotation, and the scale of the figure. 1. The transformation of a binary matrix to a polygonal boundary 2. The calculation of the FDs of the closed curve Calculation of Fourier Descriptors. There are different techniques used to calculate the FDs of a closed curve. The two most common techniques are discussed below.
Advantages are as follows: a. The inverse transform to obtain f(x, y) from the sequence (Lpq ) is very simple because of the orthogonality property. b. They are more suitable for optical methods than the conventional moments, since the range needed to record Legendre moments is less than that needed for conventional moments. Disadvantages are as follows: a. Much computation is required to normalize the Legendre moments with respect to scale, translation and rotation. 2. Zernike MomentsThese moments are defined based on rotational moments. The Zernike moment of order n is defined by
Technique Suggested by Cosgriff and Zahn. This technique involves many steps to calculate the FDs: 1. A starting point on the boundary is selected. 2. The boundary is traced in the clockwise direction and at each pixel the last and next neighbors are found and noted down. 3. The vertices of the curve are found. 4. The length (L) between any two successive vertices is calculated. The length between the starting point and a vertex Vi is denoted by Li , and it is equal to i k=1 Lk . 5. The perimeter is calculated by using the following formula: L = m i=1 Li , where m is the number of vertices. 6. The angular change (φ) at each vertex is calculated. 7. Next, the following Fourier series coefficients are calculated.
where [VnL ]* is the complex conjugate of the Zernike polynomial. The Zernike polynomials are defined by
and r is same as before. The rotation normalization of Zernike moments is very simple but the translation normalization is very
where φk is the angular change at vertex k, Lk is the arc length to vertex k, L is the perimeter, and m is the
Image Processing
number of vertices. 8. Lastly, the polar coordinates are calculated as follows:
The An ’s and the αn ’s are the Fourier descriptors. It is usually sufficient to represent a closed curve by the first 10 harmonics. The calculated FDs are put in a vector. This vector is called the feature vector. The FDs calculated by this technique are independent of rotation, translation and scale. Technique Used by Wallace and Wintz and by Kuhl et al. This technique retains almost all shape information and is computationally efficient because it uses the FFT. The steps involved in this technique are explained below as follows.
19
Expert Systems Expert systems have been successfully used in different fields of engineering as a substitution of the traditional methods. An expert system is a system designed to accomplish a certain goal by following the appropriate set of rules that is a part of a larger, more general group of rules, given by experts in the specific field. Expert systems consist of two units, the knowledgebase management unit and the knowledge-base unit. The knowledge-base management interfaces between the knowledge base and the outside world and hence facilitates to enter the basic knowledge and to construct this knowledge for efficient retrieval and decision making. Object Representation. The knowledge base has two segments. One segment contains the specific facts, that define the problem and the other segment contains the rules that operate on the facts during the problem-solving process. The knowledge in the knowledge base should be structured in a very powerful way to facilitate the searching process and to reduce the searching time. Three knowledge structures are common:
1. A starting point on the boundary is selected. 2. The original contour is sampled and then replaced by a piecewise linear contour. 3. The piecewise linear contour is represented by a chain code representation. 4. The length of the contour is computed. 5. The contour is uniformly resampled at spacing chosen to make the total number of samples a power of 2. 6. The FDs are computed by simply taking the FFT of the resultant samples. 7. The FDs should be normalized such that the contour has a standard size, rotation, translation, and starting point.
1. Relational Knowledge-Base Structure This takes a form of a table. 2. Hierarchical Knowledge-Base Structure or Tree Structure Each parent has one or more descendant, but each descendant has only one parent. 3. Network Structure Each parent has one or more descendant and each descendant can have one or more parent.
First, we set A(O) equal to zero to normalize position. Second, scale normalization is accomplished by dividing each coefficient by the magnitude of A(1). Third, rotation and starting point normalization are accomplished by making the phases of the two coefficients of largest magnitude to be zero. The normalized FDs (NFDs) are set in a vector. This vector is called the feature vector and it is used to recognize the object. To recognize an object, there should be a training set. The training set should contain all the possible feature vectors for all possible types of objects. The recognition of an unknown feature vector can be achieved by using one of the following two techniques:
1. A fact base that is equal to the conventional database. 2. A rule base that consists of production rules that operate on the fact base. 3. Confidence factors, which indicate the degree of confidence that can be placed in facts and rules. 4. Metarules, which determine when and how the rules in the rule base are to be executed.
1. Distance-weighted k nearest neighbor rule 2. χ2 technique Because FDs are just independent of one orientation angle (θ y ), the training set should contain feature vectors for any given angles θ x and θ z .
It is easy to update the knowledge in a relational structure, but searching takes a long time. The hierarchical structure reduces the searching time. The network structure allows a many-to-many relation. In general each knowledge base has the following features:
The knowledge-base management unit accepts the extracted features of an object and then represents them in a way that is appropriate for symbolic processing. There are several ways to represent these features: first-order predicate calculus, frames, and semantic networks. In the problem of recognizing a tank, for example, firstorder predicate represents the fact that a gun is located over a small rectangle in the form (6) OVER (GUN, SMALL-RECTANGLE)
A frame for the same fact might be in this form Name of frame
20
Image Processing
Type of frame Object 1 Object 2 F1 OVER GUN SMALL-RECTANGLE In the semantic network, each node contains an object and the lines between the nodes represent the relationship. The knowledge base is unique to this particular problem. This knowledge base contains rules that show the relationship among several pieces of information. The most common rule is the production rule, such as IF THERE IS A GUN, AND THE GUN IS OVER A RECTANGLE, AND THE RECTANGLE IS OVER WHEELS, THEN THERE IS A TANK.
Here we have three antecedents connected by logic AND, which when satisfied lead to the consequence that there is a tank (6). When all the three antecedents are present as assertions, they match the left-hand side of the production rule, leading to the consequence there is a tank. In addition, there are two further steps to the making of matches: 1. The production rules may be listed in some order to facilitate the matching process. 2. The search strategy unit may be able to invoke a sequence of rules that finds the match more quickly than random search lists. The search strategy unit searches the knowledge base for problem solution or object identification. A knowledgebase search is the foundation of artificial-intelligence (AI) systems. The searching process depends on the knowledge structure or the arrangement of the production rules in the knowledge base. The simplest arrangement of the production rules is to list them sequentially with no order. Each asserted fact is then run through the production rules until the match is found. With a small number of rules, it is practical to search a random list of rules. In the 3-Dobject recognition problem, there are many rules and hence the knowledge base is large. For this type of problem, the knowledge structure must be designed properly to decrease the searching time. Searching Algorithms As stated before, the searching algorithm depends on the knowledge structure. Taking the best knowledge structure, the following searching algorithms may be applied. 1. Depth-First Search This technique takes one branch of the knowledge structure tree and searches all the nodes of that branch to its last node. If a solution is not found, the procedure backs up to the next unsearched branch and continues the search to the last node and so on until a solution is found.
2. Breadth-First Search This technique searches according to the level of a node. Level one will be the top node. The successors of this node will be labeled level two. This technique starts by searching all of the nodes on level two, then level three, and so on, until a solution is found. 3. Heuristic Search The two prior searching techniques require searching through the entire knowledge base in order to find the solution. When the knowledge base is large, the time for solution search can become a real problem. The heuristic search technique is devised to reduce the searching time. Heuristics is defined as any information that would bring closer to a problem solution. Such information often takes the form of rules of thumb that may be applied in decision-making processes. Heuristic information can be applied at many nodes along the solution search path, and the decision can then be made as to which branch of the knowledge structure is more likely to hold the last solution. Heuristics are generally applied in one of the following two ways: 1. When deciding whether a particular branch of the knowledge structure should be eliminated 2. When deciding which branch of the knowledge structure should be searched next The inference unit is used to obtain the object identity. The inference or reasoning process accepts the assertions from the input image and searches the knowledge base for identification. Two approaches to inference are common: deduction inference and induction inference. Deduction or backward inference begins with an hypothesis or a goal and then proceeds to examine the production rules, which conform to the hypothesis. Induction or forward inference examines the assertions and then draws a conclusion, to which the conditions conform. In other words, deduction starts with the goals, while induction starts with the facts or the assertions. Different 3-D-object recognition systems that utilize AI techniques has been built and tested. The performance of those systems is very optimistic. NEURAL NETWORKS A neural network consists of a number of nonlinear computational elements called neurons or nodes. The neurons are interconnected with adaptive elements known as weights and operate in parallel environment. The structure of neural networks is similar to simplified biological systems. Recently, neural networks have become a prime area of research because they possess the following characteristics:
Highly parallel structure; hence a capability for fast computing
Ability to learn and adapt to changing system parameters (e.g., inputs)
High degree of tolerance to damage in the connections
Image Processing
21
Ability to learn through parallel and distributed processing
where η is the iteration step
Nonparametric values, not dependent on as many assumptions regarding underlying distributions as traditional networks (e.g., classifier or optimizing circuits) Neural networks have been used as associative content addressable memories (ACAM) and as classifiers. In an ACAM, the memory is associated not by an address, but rather by partially specified or many versions of the stored memory pattern. A classifier assigns an input sample to one of the predetermined classes. The use of neural networks has recently been proposed for creating dynamic associative memory (DAM), which utilizes supervised learning algorithms, by recording or learning to store the information as a stable memory state. The neural network minimizes the energy function formed by the mean square error constructed by the difference between the actual training signal and the signal estimated by the network. Back Propagation Back propagation is one of the most popular supervised training methods for neural networks. It is based on the gradient-descent technique for minimizing the square of error between the desired output and the actual output. The following is a summary of the back-propagation training algorithm for a two-layer neural network with n inputs, l outputs, and j neurons in the hidden layer. 1. Store the training set {(Xk , dk ): k = 1, 2, 3, . . . , n}, where Xk is the input vector and dk is the desired output vector. 2. Initiate V and W to small random numbers (between −1 and 1), where V is the weight matrix for the hidden layer (j × n) and W is the weight matrix for the output layer (l × j). 3. For each pair (X, d), compute the forward pass as follows: ◦ Compute netj , j = 1, 2, . . . , n, where ◦ Compute Zj , where, for a unipolar sigmoid activation function, Zj can be given by
◦ Compute netl , l = 1, 2, . . . , L, where ◦ Compute yl , where for the same unipolar activation function, yl is given by
4. Compute the backward pass as follows: ◦ ◦ Compute
◦ 5. Update weights as follows: ◦ ◦ 6. Check convergence by comparing the resultant output with the desired ones. If not converged, go to step 3. Figure 16 shows a schematic for a two-layer neural network. Neural Networks for Image Compression Recently, neural networks have been used for image compression. Artificial neural network models called connectionist models or parallel distributed processing models have received much attention in many fields where high computation ratios are required. Many neural network approaches for image compression yield performance superior to that of the discrete traditional approaches. The estimated output is compared with the actual one for learning. For back-propagation algorithms, image data compression is presented as an encoder problem. In fact, the weight matrices encode the principal components of the given image, that is, after convergence it decomposes the given image into an eigenvalue of decreasing variance. Back propagation is so powerful for image compression that singular value decomposition techniques will be quite appropriate. Figure 17 shows one possible application for neural network in image transmission. Digital image generated with an 8 bit imagery system may be reduced in size by feeding it to an 8-to-4 multilayer neural network. On the other end of the line, the compressed image is processed by a 4-to8 neural network for reconstruction. Compressing images before transmitting them speeds up the transmission time substantially. Neural Networks for Pattern Recognition The neural network is an ideal tool for pattern recognition. Any recognition system needs to be trained to recognize different patterns, and training is the most important part in the design of neural network systems. All patternrecognition systems try to imitate the recognition mechanism carried out by humans. Since the neural network is a simplification of human neural system, it is more likely to adapt to the human way of solving the recognition problem than other techniques and systems. Finally, we can look at pattern recognition as a classification problem, which is best handled by neural network systems. The design of the neural network system for pattern recognition starts with collecting data on each of the objects that is to be recognized by the system. A class is assigned to each object, and the collection of data and classes is used to train the system.
22
Image Processing
Figure 16. Two-layer neural network with n inputs, j hidden neurons, and l outputs.
Figure 17. The use of neural networks for image compression and transmission.
Image Processing
23
IMAGE FUSION Image fusion is one of the forms of multi-sensor fusion. Image fusion is comparatively a new area and is attracting a lot of attention in areas related to: U.S. Army research and development of situational awareness systems, target detection, automotive, medical imaging and remote sensing. The current definition of sensor fusion is very broad and the fusion can take place at the signal, pixel, feature, and symbol level. The goal of image fusion is to create new images that are more suitable for the purposes of human visual perception, object detection and target recognition. A prerequisite for successful image fusion (pixel level) is that multi-sensor images have to be correctly aligned on a pixel-by-pixel basis. Image fusion has the following advantages:
Figure 18. Fusion Process on a Venn Diagram.
1. It improves the reliability by eliminating redundant information. 2. It improves the contrast by retaining complementary information. See Fig. 18. Image fusion is a sequel to data fusion. A number of approaches are currently being discussed and used for image fusion like Pyramidal transforms, Wavelet transforms, different mean value, Max/Min approach, and fuzzy/neurofuzzy logic. Multi-sensor data often contains complementary information about the region interrogated so image fusion provides an effective method to enable comparison and analysis of such data. Image fusion is being used in numerous medical applications to obtain a better image and is being tested in automotive industries to enhance the vision of road so as to see a better image during a rainy or a foggy weather for collision avoidance applications. The fuzzy/neuro-fuzzy image fusion technique is described below.
Figure 19. Image Fusion system for four input images.
FUZZY LOGIC AND NEURO-FUZZY APPROACH TO IMAGE FUSION Fuzzy logic and neuro-fuzzy approach are an alternative to a large number of conventional approaches, which are based on a host of empirical relations. Empirical approaches are time consuming and result in a low correlation. The fuzzy logic approach is based on simple rules, which are easy to apply and take less time. As shown in Fig. 19, multiple images can be fused using this method. For image fusion for a specific application, the user selects optimized membership functions and the number of membership functions associated with each image, as shown in Fig. 20. Fuzzy rules are then applied at every pixel level to the corresponding pixels of all the input images. The neural network and fuzzy Logic approach can be used together for sensor fusion. Such a sensor fusion could belong to a class of sensor fusion in which the image or sensor features are input and a decision could be the output. All the parameters of neuro-fuzzy system such as membership function number, type and number of iterations can be selected according to the requirements of processing
Figure 20. Six Gaussian membership function for Input/Output Images using fuzzy logic.
speed and quality of the output image. The basic concept is to associate the given sensory inputs with some decision outputs. After developing the system, another group of input image data is used to evaluate the performance of the system. In Fig. 21, we can see the output of fusion of the two medical images using both fuzzy and neurofuzzy algorithm. Applications of fused medical images are: image-guided surgery, non-invasive diagnosis and treatment planning. Images from the same modality may be combined from different times, or images from different modalities may be combined to provide more information to the surgeon. For example, intensity-based tissue classification and anatomical region demarcation may be seen using MRI, while metabolic activity or neural circuit activity from the same region can be measured with PET or functional MRI. This type of information is important for surgical planning so as to avoid regions of the brain not
24
Image Processing
Figure 21. Fused medical image output using fuzzy and neurofuzzy logic.
scene, camera frame rates etc. 2. Adaptive Preprocessing Algorithm- Before applying the video fusion process, images require checking for suitability of fusion. The quality of the input image is foremost for ensuring the quality of the fused output irrespective of the fusion algorithm adopted. Using a set of scene matrices to monitor input scene quality for local situational awareness, we can adapt the frames to be sent for the fusion process. Adaptive algorithms are currently being developed in order to prevent skewed images, different zoom level images or distorted frames. 3. Video fusion using fuzzy/neuro-fuzzy logic- Pixel wise fusion with the Gaussian membership function type and with optimum membership functions is accomplished based on fuzzy trained system. In the case of neurofuzzy logic, a trained neural network is used for fusing the corresponding frames. For the video image fusion, the frames from the video are obtained, which are fused using the same approach of fuzzy fusion technique. All the fused images are converted to frames and assembled for the final video display.
MAJOR PROBLEMS IN VIDEO IMAGE FUSION
Figure 22. Steps of Video fusion using fuzzy and neuro-fuzzy logic.
relevant to the upcoming surgery. Video image fusion Process Video image fusion is implemented same way using framewise image fusion with any of the image fusion techniques after the reduction of video into frames. Fuzzy logic is unique way of sensor fusion with less internal complexity. In the dynamic environment, the scene changes continuously and there is a need to focus on the correct frame alignment and preprocessing of the frames more than the algorithms for the fusion process. Video image fusion using fuzzy logic can provide useful information for variety of applications. The major steps for implementing video image fusion using the fuzzy or neuro-fuzzy methods are outlined below. 1. Image Registration- All image fusion algorithms require registered input images so pixel combinations are valid. For two video frames to be fused, it is important that they be properly aligned with each other in both spatial and temporal domain. The degree of registration that can be achieved prior to fusion is critical to the quality of the fused image that is produced. If two images are not registered, no image fusion can rectify the fact that invalid pixel information is being compared and possibly combined incorrectly, producing poor image quality at best. Accurate image registration depends on factors such as the fusion application, speed of movement in the
Real time implementation of image fusion tends to magnify many of the inherent problems of the image fusion process. While in the image fusion process, adjustment for any changes in the image can be done on an image by image basis, video fusion process requires real time robust and adaptive algorithm to deal with the dynamic scene problems over all the images in the video set. In a dynamic environment, even with the most rigid arrangement of the cameras, spatial misalignment remains one of the foremost problems in video fusion. Many new solutions are proposed such as internal rotation of the frame along the axes based on misalignment nature of the frames, shade and contrast based adjustments. The same techniques used for video stabilization of the image using fuzzy logic must be applied over the two image streams being fused. Adaptive frame alignment algorithms are a subject of current research. Real-time image processing requires high computational processing speed. The techniques of parallel processing and cluster computing are possible for image and video processing on a static platform. For processing under dynamic environments, image fusion algorithms need to be modified so that low onboard memory storage is needed and the processing of fused video can be viewed in real time.
BIBLIOGRAPHY 1. A. K. Jain Fundamentals of Digital Image Processing, Englewood Cliffs, NJ: Prentice-Hall, 1989. 2. R. C. Gonzalez P. Wintz Digital Image Processing, Reading, MA: Addison-Wesley, 1977. 3. W. K. Pratt Digital Image Processing, 2nd ed., New York: WileyInterscience, 1991. 4. J. C. Russ The Image Processing Handbook, 2nd ed., Boca Raton, FL: CRC Press, 1995.
Image Processing 5. K. R. Castleman Digital Image Processing, Upper Saddle River, NJ: Prentice-Hall, 1996. 6. N. Magnenat-Thalmann D. Thalmann Image Synthesis, Theory and Practice, New York: Springer-Verlag, 1987.
HARPREET SINGH Y. HAMZEH S. BHAMA S. TALAHMEH L. ANNEBERG G. GERHART T. MEITZLER D. KAUR Wayne State University, Detroit, MI JTICS, Chrysler Corporation, Detroit, MI INVANTAGE, Inc., Taylor, MI Lawrence Technological University, Southfield, MI Research and Engineering Center (TARDEC), Warren, MI University of Toledo, Toledo, OH
25
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICA...TRONICS%20ENGINEERING/26.%20Image%20Processing/W4107.htm
}{{}}
●
HOME ●
ABOUT US ●
CONTACT US ●
HELP
Home / Engineering / Electrical and Electronics Engineering
Wiley Encyclopedia of Electrical and Electronics Engineering Image Processing Equipment Standard Article Robert John Gove1 1Equator Technologies, Inc., Campbell, CA Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. DOI: 10.1002/047134608X.W4107 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (594K)
●
●
● ●
Recommend to Your Librarian Save title to My Profile Email this page Print this page
Browse this title ●
Search this title Enter words or phrases ❍
❍ ❍
Advanced Product Search Search All Content Acronym Finder
Abstract The sections in this article are Image Sensing Equipment Image Computing Equipment Video and Image Compression Equipment Image Database Equipment Image Display Equipment Image Printing Equipment Keywords: image acquisitions; scanners; video capture; image display; DSP chips; image databases; computer networks; hardware approaches to video and image coding; printers About Wiley InterScience | About Wiley | Privacy | Terms & Conditions Copyright © 1999-2008John Wiley & Sons, Inc. All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELE...S%20ENGINEERING/26.%20Image%20Processing/W4107.htm17.06.2008 15:24:44
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering c 1999 John Wiley & Sons, Inc. Copyright
IMAGE PROCESSING EQUIPMENT Image processing equipment (IPE) encompasses a range of machines that collect or generate images, altering them in some way to improve the perceptibility of the images by human or machine. IPE can enable us to see in dark or remote locations or simply to improve the quality of a picture or synthetic scene. The equipment has historically been an effective tool for business, industry, medicine, entertainment, military, and science. Applications of Image Processing Equipment Image processing equipment now extends from the smallest of electronics items produced today, such as a microminiature camera based on a highly integrated chip in biomedical and hidden surveillance applications, to some of the most complex and advanced machines produced. IPE is ubiquitous, being used by virtually every industrialized nation today. While everyone may not be familiar with the direct use of the technology or the plethora of information we indirectly create with image processing equipment, most greatly benefit from its use. Many of the images we see on television and movies were created with image processing. The weather reports we watch include enhanced satellite imagery. The checks we send through the banking system are converted to digital images and then destroyed. The US Treasury uses image processing to assess automatically the quality of currency just after it’s printed. The medical professions feature many imaging modalities, including X-ray and ultrasound images for diagnosis. All of these applications utilize various IPE. Many applications of image processing have become familiar to us due to its use by the entertainment– film industry, personal computers with imaging capabilities, and the Internet. The near perfection of computergenerated imagery (CGI) of dinosaurs in Paramount Picture’s Jurassic Park and scenes of the demise of the Titanic offer shocking realism to movie viewers. Major motion pictures, such as Paramount Picture’s Forrest Gump (released in 1995), feature image manipulation for entertainment. By this technology actors re-live historical events, flawlessly “mixing” with actual film footage of the event. Using image processing to control lighting, color and blending of images, the actor Tom Hanks was realistically portrayed to converse with Presidents Kennedy and Johnson at White House functions of their time, long after their deaths. The actor also became a world-class table tennis player by careful frame-by-frame manipulation of a ping pong ball. In addition it has become all too common to see image processing equipment extracting recognizable images of criminals with blurry and dark photographs taken at crime scenes. Publications have seen development costs reduce with the use of computer imagery. Electronic mailing of digital images captured with the use of scanners or electronic cameras attached to personal computers (PCs) connected to the Internet continues to increase in popularity. Finally this technology may eventually develop to supply prosthetic artificial eyes for the blind (1). Many Users of Image Processing Equipment Nearly every industry uses some form of equipment that processes images. Consumers use cameras for photography and home videos. Travelers use security cameras for vacation homes and Internet “spy” cameras for seeing what awaits them. Doctors and technicians use IPE for medical instrumentation and computeraided analysis of the images for diagnosis and treatments. West Virginia Department of Motor Vehicles staff use IPE with facial recognition software (from VisionSphere Technologies Inc.) to verify the identity of those attempting to renew a driver’s license. A neurobiologist in Massachusetts uses thumbnail-size cameras on the back of horseshoe crabs (“Crabcams”) to study behavior of aquarium-bound crabs when viewing the captured 1
2
IMAGE PROCESSING EQUIPMENT
Fig. 1. Elements of image processing equipment (IPE) include the input block (for image/video capture/sensing), the processing block (for image/video computing), the storage/transmission block (for image/video database and compression), and the output block (for display or print of the images).
scenes. Astronomers use multispectral, high-resolution and high-sensitivity cameras to see into deep space with ground-based and space-based telescopes (Hubble Space Telescope) coupled with CCD imagers. NASA used imagers to navigate the Pathfinder Rover on Mars. Manufacturers use vision systems for factory automation, such as inspection of units produced and ultimately product quality control. Pilots use remote sensing and vision systems for navigation. Soldiers use remote battlefield reconnaissance sensors to see the enemy prior to entering the field. Battleships use airborne, visible, IR and radar imagery for detection of possible hostile concerns. Semiconductor manufacturing technicians use X-ray imaging to align ball-grid-array (BGA) package components onto printed circuit boards. Digital image manipulation for explicit alteration of events has been used for “spin doctoring” of news or evidence [see (2)]. Elements of IPE Figure 1 illustrates common elements found in most IPE. The input block includes the functions of capturing a scene with a camera or scanner. The time frame for the event of scene capture varies with the application. Interactive communication with images and video would dictate almost instantaneous capture and processing of the image sequence, while many applications utilize image storage for later analysis. The image process block includes a wide body of functions and algorithms for image processing, coding, color/texture analysis, enhancement, reconstruction and rendering, using techniques like transform domain processing, histogram analysis, and filtering. The store/transmit block provides image archive, distribution, communication and database mechanisms, using solutions like video disks, CDs, DVDs, films, or computer/telephony networks. Finally the output block includes displaying or printing the images on graphics or video monitors, cinema, paper, or for direct machine control. Removal of Technology Limitations As with other areas, the availability of related technologies has restricted the capabilities of IPE. Parameters such as the size of the display, the sensitivity of the camera, the number of processing cycles, the amount of memory for image storage and the number of bytes per second available for communication have historically limited its widespread use. Regardless, science, defense, manufacturing, and commerce industries have embraced the use of IPE. The equipment is widely used for automation of manufacturing processes, entertainment, detection of items not easily perceivable with human vision and minimization of travel. The technologies just recently developed will very shortly enable IPE to captivate users with immersive high-resolution displays,
IMAGE PROCESSING EQUIPMENT
3
supersensitive cameras, powerful supercomputing technologies, extremely high-density memory cards, and high-bandwidth networks, all for cost-effective personal and portable use. Equipment to perform functions such as lifelike video conferencing and automatic face recognition, described in futuristic movies and books, is emerging. Image Representations: Numeric Pixels and Symbolic Objects Images are represented two primary ways—numerically as a multidimensional array of pixels (or picture elements) or symbolically as a list of object descriptors (also known as graphical or vector representations). While graphics processing utilizes geometric vector representations for polygon drawing and image processing utilizes pixel arrays representations, some applications use both representations. The most common representation, the array of pixels, includes a multidimensional array of scalar quantities, proportional in some way to the scene. These include color photographs with shades and tints of visible color and thermal satellite images of earth with pseudo color shadings showing thermal signatures. This representation provides convenient application of signal processing algorithms and manipulation by specialized processor chips. Pixel arrays have been extended to higher dimensions with volume pixel elements or voxels. In this case a voxel would contain a scalar value that represents the density of a point in space (e.g., a human body) at the three coordinates associated with each value. An IPE could then create a 2-D image of any arbitrary plane slicing through the 3-D object by accessing the appropriate voxels. Using the graphical or vector representation, the object descriptors contain information or instructions specific to reconstruction of the objects on a display medium. This method was developed for graphics applications and vector-oriented displays or printers (e.g., x–y pen plotters). In this case simple objects or drawings like line art and schematics can be drawn with a few strokes rather than painting an entire x–y grid image. Using an ordered list of these descriptors (which could include a 2-D image), the image-rendering system first processes and displays objects that have priority, such objects near the center of attention of the user or not even displaying those objects that have been occluded by others. An advantage of the “graphical” representation is simpler creation of 3-D scenes as a function of the viewer’s pose or gaze at objects relative to the other objects. Either representation can be used to create 3-D scenes or views; however, depending on the scenes complexity, the graphical method can yield better compaction of data (for storage or transmission) and simpler implementations. Organization of the Article This article is organized by categories of IPE. The streaming of data and information through image processing equipment can follow the path represented by these categories. They are:
(1) (2) (3) (4) (5) (6)
Image sensing equipment, for acquiring the original images Image computing equipment, which can include enhancement, analysis, measurement, and compression Video and image compression equipment, for real-time communication with others Image databases equipment, for storage and on-line access of information and transmission Image display equipment, for on-line visualization by the observer Image printing equipment, for hardcopy recording or visualization by the observer
In addition to an overview of the system products or subsystems associated with each, this article includes a summary of the components (or chips) that form building blocks with the underlying technologies, enabling image processing applications today and into the future. Pioneering work in each area has created technologies that are used in nearly every industry today.
4
IMAGE PROCESSING EQUIPMENT
Image Sensing Equipment Image sensing equipment (ISE) provide raw images for use by image processing equipment. This equipment includes electronic still cameras, video cameras, medical imaging equipment, and page scanners. ISE utilize imaging detectors that produce signals proportional to the intensity of the source illumination. Today CCD chips are the leading type of imaging detector, featured in virtually every class of ISE. Some cameras can collect signals well beyond the perceptibility limits of biological human vision. Examples include using X-ray or infrared spectrum sensors. Many ISE internally utilize image processing to condition the signals in preparation for distribution or to simply remove any sensor-induced artifacts such as geometric distortion. Cameras can sense remote locations, enabling inspection of nearly any situation such as microscopic surgery, lumber processing, or viewing the surface of Mars. In many cases cameras provide the sensing function remotely from the final IPE such as in different cities, in space or under the ocean. Remote connection of sensing equipment, as described later in this document, requires digital compression of the images using a variety of methods such as ISO-MPEG (Moving Picture Experts Group) or JPEG (Joint Photographic Experts Group) standards for storage or transmission via radio waves, telephone, cable, or satellite transmission to the IPE. Direct connection to the IPE may utilize compression technology to either improve the quality or reduce the cost of the connection, or a simple full-bandwidth electrical connection. Electronic Cameras and Detecting Methods. Electronic cameras can be as large as a shoebox and as small as a single IC package including an integrated lens. They capture images of the environment by sensing visible, X-ray, infrared, or other spectrums of light. Cameras can capture still images, a sequence of images used to represent “live” motion (video), or both, depending on the camera’s storage and interface methods. The sensing technology, which includes the camera’s optics, photo-recombination methods, sensitivity, size, power, mobility/portability, and electrical interface bandwidth, all directly limits a camera’s capabilities. Various methods have been used to detect light, including photo-emissive devices like photo-multipliers, photodiodes, phototransistors, vidicons, and silicon chips, all of which generate electrons over time proportional to photons hitting the imager. Professional television cameras were originally developed with vidicon detector tubes. These detectors featured a sensing area that collected the generated charge in a material that acted as an array of capacitors and contained circuits for electrical readout of the charged array. In effect an electron beam scanned across the sensor, regenerating the charge into those capacitors, and was bleed away by the illumination as a circuit sequentially detected the amount of charge used to replenish the sensor. For cost and performance reasons, electronic cameras have practically made a complete transition from tube technology to semiconductor sensors, including CCDs (charge-coupled devices) and CMOS (complementary-metal-oxidesilicon). Camera Sensor Chips. The semiconductor architecture most used today, the CCD as shown in Fig. 2, produces a spatially sampled array of pixels with charge proportional to the incident illumination and uniquely transfers the charge from each storage site (or well) to its neighboring storage site in succession [see (3)]. This simultaneous transfer of charge from well to well is characterized as bucket briggading. A CCD generates charge, transfers that charge and converts the charge to voltage on output. Charge generation can be in an array of either MOS capacitors or photodiodes (for interline transfer). In either case, charge transfer occurs in MOS capacitors. The charge is finally converted to a voltage signal at the output of the CCD. A CCD offers uniform response, since its charge generation and collection happens in “precisely” patterned sites or charge wells. During manufacture of CCDs, semiconductor photolithography and process steps like ion implantation create these precise pixels. With vidicon tubes, the specific location of the pixel on the detector varied as a function of input scan voltages. This created a challenge for camera designers that usually resulted in some geometric distortion of the detected images. With CCDs, the pixel location does not change, yielding a very geometrically stable pixel, with less sensitivity to scanning or clock voltages. The CCD’s use of a single high-quality transconductance amplifier, coupled with these highly uniform pixel characteristics, leads to high performance, with excellent uniformity at the output of the device. As with other semiconductor devices, defects
IMAGE PROCESSING EQUIPMENT
5
Fig. 2. The virtual phase (VP) CCD imager chip (shown on the left) and the digital camera chip or chips (shown on the right) have been used to make digital cameras. The CCD converts incident photons to electrons, collecting them in patterned potential wells or pixels. Metal lines control the transfer of the charge in each pixel in unison. These vertical metal lines and horizontal channel stops (not shown) define the pixel location. In a VP CCD another “virtual” transfer control line (implanted beside the metal line) works in concert with the metal line to orchestrate charge transfer horizontally as shown by the potential plot in the illustration. The digital camera chip(s) can be fully integrated solutions or independent chips, depending on the product requirements.
can occur. Several techniques have been developed to mask these defects, like pixel output replication [see (4)], to increase the processing yield and resultantly lower the cost to manufacture these devices. CCDs support tremendous variations of integration time, useful for stop action or low illumination applications. Low dark current (the amount of charge generated with no light onto the detector) and high sensitivity mean that a CCD can detect scenes with extremely low illumination when using long exposure times. Cooling CCDs with thermal electric coolers (TEC) further reduce dark current. For these reasons CCD-based telescopes extend our sight to previously unobservable galaxies (e.g., with the Hubble Space Telescope). For security applications, this increased sensitivity means that intruders are seen in extremely low illumination situations. To reduce the cost of consumer products, video camera manufacturers can use inexpensive, larger f /number lens with light absorbing color filters and still be able to make excellent pictures in dimly lit rooms. On the other hand, a very short integration time can be used for capturing rapid motion with stop action clarity such as freezing the image of a hummingbird’s wings in motion. For these “stop-action” applications, a CCD with “electronic-shuttering” again has the advantage over conventional photography’s mechanical-shuttering techniques. A CCD will use frame-transfer or interline-transfer methods (5) to quickly transfer the exposed pixels to covered sites, thereby abruptly terminating integration for those pixels. Only then does the CCD read out the frame of pixels begin, usually at conventional video display rates. Simultaneously integration of the next frame begins in the detecting array. Frame transfer techniques can offer the advantage of a single video frame acquisition, with less field mixture, for freeze-frame applications. Interline-transfer architectures can have lower efficiency than frame-transfer architectures, since the relative percentage of photon-collecting area is lower. However, interline-transfer approaches facilitate the use of photodiode detectors and anti-blooming structures that prevent charge spillage to adjacent pixels when overexposing the device. Since CCDs contain analog pixels, with charge proportional to the light integrated on each charge well, devices have been developed to perform analog signal processing of the image directly on the imaging chip. For example, a CCD can add two pixels together by simply clocking two pixels successively into one charge well. This can reduce the system cost, size and power for several applications.
6
IMAGE PROCESSING EQUIPMENT
The size of the pixel affects the dynamic range of the device. A larger pixel can generate more charge. The structure of the clock lines, which facilitate charge transfer, but also reflect light away from the underlying recombination silicon, influences overall efficiency of the CCD. The physical clock structure and shape of these timing signals used to transfer charge determine the devices ability to completely move all the charge from well to well. Any residual charge will lead to poor dynamic range or picture smearing. The architecture of the sensor, including the size of each pixel, the structure of the light-blocking clock lines, and the characteristics of the photo-recombination wells, influences a CCD’s performance. CMOS Sensor Chips. CMOS (complementary-metal-oxide-silicon) sensors represent another architecture that directly images light onto a structure similar to conventional CMOS memories. These sensors require more complicated addressing schemes to select specific x–y cells than a CCD. In addition more chargesensing amplifiers are used, creating more performance issues. CMOS sensors also have lower sensitivity, since their complex clocking structure again tends to block more of the underlying silicon where photo-recombination occurs, leading to pixel fill factors as low as 30%. Lower dynamic range results with higher output amplifier noise relative to a CCD. However, the CMOS approach has seen a resurgence of interest lately due to the cost-effective nature of integrating many full-digital video functions (like analog to digital conversion and compression) and sensing functions directly onto one silicon chip, as well as the use of conventional high-volume semiconductor memory fabrication techniques. Color Imaging Chips. Professional CCD cameras incorporate three sensors and special optics for simultaneous sensing of three images corresponding to the color primaries (red, green, and blue or yellow, magenta, and cyan). Alignment of the individual sensors and the optical assembly creates challenges for manufacturers. For cost reasons, consumer-level cameras utilize only a single sensor coupled with a color filter mosaic integrated onto the top of the chip. Alignment of the filters can be performed optically before cementing onto the chip or directly patterned onto the device. Manufacturers utilize an array of color-filtered pixels which can be combined to form a color triad of pixels. Unfortunately, this color-sensing method can reduce the overall spatial resolution by as much as a third in one dimension. Taking advantage of the human eye’s increased sensitivity to green, some color sequences like green-red-green-blue improve the apparent luminance resolution to only a half the original grayscale resolution (rather than a third). Geometric Distortion and Correction with Warping. Many cameras contain optical systems that present a distorted image of the scene. The distortion can result from geometric properties of the optical system, such as the use of a fish eye lens for wide-angle imaging, or cylindrical lens for line scanning or low-quality lens. Sensor artifacts, such as with the use of vidicons, can also worsen these problems. The distortion can also directly result from applications like producing photomosaic maps of the earth, whereby the captured 2-D satellite images contain curvilinear sides of varying severity, as a function of the angle looking away from normal. Distorted and disjoint views from spacecraft such as the Viking and Mariner also require correction and mosaicing for proper viewing. A warping process can correct for these distortions as described in Ref. 6. With warping, the algorithm matches control points located on the distorted image and the target image. For an application such as use of a fish eye lens, this process is very simple, whereas other applications need manual intervention to determine the proper mapping coordinates. Each pixel is then mapped from the source into the destination. Since this cannot be a one-to-one mapping, interpolation of the source image is required per pixel. The degree of the polynomial that estimates the missing pixels will influence the result. However, simple bilinear interpolation or first-order warping can generate effective results. Using four input pixels I1 (pixel value at x + 1, y), I2 (pixel value at x + 1, y + 1), I3 (pixel value at x, y + 1), and I4 (pixel value at x, y), we can compute the interpolated pixel value or intensity I(x, y) for an arbitrary pixel at a location within the four pixels as
IMAGE PROCESSING EQUIPMENT
7
where X and Y are fractional distances (somewhere between 0 and 1), computed as the fractional coordinate of the input pixel being mapped from the output pixel, as follows:
This affine warping can produce translation, rotation, scale, and shear mapping. A set of control points can be used to determine the coefficients of A1 through A6. Once calculated, we can use incremental addressing (one addition operation) to reduce the number of computations. For a digital signal processor (DSP) or media processor this warping function requires eight adds/subtracts, four multiplies, four loads, one store and several register accesses per pixel, or perhaps 15 cycles per pixel. With a 640 × 480 pixel image processed at 60 frames per second, this requires nearly 300 million operations per second (MOPS). Digital Still Cameras. Many manufacturers produce digital still cameras (Olympus, Sharp, Epson) for both consumer and professional markets. Figure 2 shows the key elements of an electronic digital camera. These products enable Internet-based applications of imaging by capturing digital images for subsequent processing, printing and storage with a personal computer. Several manufacturers produce digital still cameras with 640 × 480 pixel resolution [including Nikon (http://www.nikon.com), Panasonic (http://www.panasonic.com) and Canon (http://www.ccsi.canon.com)]. With integrated LCD display, computer interfaces, JPEG storage of nearly 130 normal resolution or 65 fine resolution images in 4 Mbyte of memory and annotation on the images using a pen stylus or speech input, these cameras offer a portable solution for image acquisition. Another unit offers a completely enclosed camera on a small PC card for mobile use on a laptop. At the high-end, the EOS D2000 (from Canon and Kodak) offers a 2 million pixel (1728 × 1152) electronic camera for photojournalism or surveillance applications. With the cost lowering of semiconductor Flash memory cards and continuing CCD improvements, consumer digital still cameras will proliferate in the near term. Likewise development of custom chips, and recently a microprocessor-based solution for the digital camera functions shown in Fig. 2, enables cost-effective and flexible solutions. The full-integrated Mips-based DCAM-101 (from LSI Logic at www.lsi.com) offers a single-chip solution with pixel processing for color space conversion with a Bayer pattern of CCD color filter mosaics, coupled with interfaces for all other functions in the diagram. One manufacturer (Hewlett-Packard Co.) produces a full solution home-photography kit called HP PhotoSmart. The kit includes a camera with 640 × 480 CCD with de-mosaicing, filtering, and JPEG compression, a 2400 dpi (dots per inch) scanner with feeders for color slides, negatives or prints up to 5 × 7 inches, and a printer. When used with a PC the system provides a complete picture taking, editing (using Microsoft’s PictureIt software), printing and photograph storing solution. Another manufacturer (EarthCam) produces an integrated camera solution, called the EarthCam Internet Camera, with direct connection to the Internet via a built-in 28.8 kbyte/s modem. Users simple connect the camera to a phone connection for remote control and access to pictures captured from the remotely located camera. Video Cameras. Video cameras are frequently used a sources for IPE video capture. External frame grabber equipment can capture still images from these cameras or, with some newer video cameras, internal electronics captures still images for subsequent transfer to a computer via standard interfaces. Camcorders include tape storage of up to several hours for later playback. Camcorders have become ubiquitous today. These cameras have one limitation—image stability caused by one’s natural shaking of the small handheld cameras. To correct for this jitter of the picture, some units utilize electronic stabilization. Minute repositioning of the optical system when circuits internal to the camera detect subtle motion stabilize the pictures. Alternative methods include the adaptive repositioning of the physical position of the output video on the sensor. Mo-
8
IMAGE PROCESSING EQUIPMENT
tion detection processing techniques drive this selective repositioning of the starting pixel of readout of the CCD sensor. Unfortunately, this technique requires a larger sensor to maintain the same picture size while repositioned. Analog and Digital Outputs. Most video cameras have analog video outputs. Lower-end units use a single composite luma (V) and chroma (C) signal, mid-end units use separate Y and C signals, and high-end units use individual components of Y, R-Y, B-Y. These are acceptable for many applications, but they limit overall performance for critical applications. Recent digital video cameras (from Sony) have included IEEE1394 (originally developed by Apple Computer) as a standard solution to digital interface of the camera’s video to storage or other computer equipment. This approach minimizes cost and improves picture quality in comparison to conventional analog interfaces that may require digital-to-analog converters in the camera and analog-to-digital converters in the image computing/storage equipment. Other Video Cameras. One video camera manufacture produces an MPEG camera with CCD sensor, LCD display and direct MPEG full-motion digital compressed video and audio stored on a 260 Mbyte disk (Hitachi at http://www.hitachi.com). Twenty minutes of MPEG-1 video is stored or 3000 JPEG images, in any combination. The MPEG camera’s combination of JPEG still image and MPEG movie creation simplifies Internet web site development or CD-ROM creation. Digital video (DV) camcorders (Sony Corp.) offer a higher-quality alternative to conventional analog video camcorders (like 8 mm or VHS-C). DV technology utilizes proprietary algorithm optimized for recording on a small format tape, while supporting trick modes such as fast-forward. DV also supports the 16:9 video format, which has wide acceptance in Japan. CCD-based video cameras for production studios (from Sony and Eastman Kodak Microelectronics) can produce high-resolution 1920 × 1080i video format signals. Major events such as the New Years’ Rose Parade and the Olympics have featured these cameras. Video from these cameras will be used for terrestrial broadcast of digital high-definition TV (HDTV) in the United States. Cameras with Pan-Tilt-Zoom. Applications such as surveillance or video conferencing require manipulation of the camera’s viewing pose. To address this need, several video cameras manufacturers have integrated pan-tilt-zoom (PTZ) capabilities into the camera unit (e.g., the Sony D30 and Nisca). Other advanced camera systems for video have integrated microphone arrays and signal processing to create a control signal for pointing or steering the PTZ camera to the pose corresponding to the speaker’s location within the conference room (PictureTel and Vtel). Medical Imaging. Medical IPE are used to diagnose and treat patients. Several modalities, like MRI, ultrasound, and X-ray CT (computed tomography), are used to detect features within patients or to measure objects with either still or moving image sequences. For example, MRI can provide images for interpretation of the presence of disease or for guidance of radiation therapy. Medical imaging applications require extreme accuracy and repeatability, with no artifacts induced by the equipment. The acquired images can be 2-D slices (cross sections) or images of the body, with pixels corresponding in some way to the density of the object. Processed images are either projected images of this acquired 2-D data or digitally reconstructed views of projections throughout the volume of data. Magnetic resonance imaging (or MRI) builds a cross-sectional image of the body by first aligning the body’s protons (hydrogen nuclei) spin axis with a large electromagnetic field. Radio waves then knock the protons out of alignment. As the radio waves are removed, the protons realign to their “normal” spin orientation, emitting radio waves that are finally detected by an RF scanner. An image processor then creates a cross-sectional image of the body with repeated scanning in succession to form a 3-D volume. The image processor performs filtering, false coloring by mapping intensity to color, and volume rendering tasks. Ultrasound imaging passes harmless ultrasonic sound waves through the body and uses microphones to detect internal reflections, which are then composed into an image in real-time. Again improvements in image quality are obtained with spatial and temporal filtering. Today’s research includes the use of these imaging
IMAGE PROCESSING EQUIPMENT
9
Fig. 3. Camera and scanning equipment with their elements, each including a CCD imager.
systems to actually manipulate or target the delivery of drugs or genes with combinations of ultrasound and MRI. Digital X rays are produced with amorphous silicon large area sensing technology (see DpiX at http://www.dpix.com) integrated with a video camera to pick up the real-time flouroscopic image. While these can be used for medical applications, other industrial and scientific applications (like welding) are popular. The medical profession has historically utilized film radiology, coupled with scanners for digitization. Scanners. Scanners represent a class of camera equipment which capture images by effectively sweeping a sensor (linear or 2-D) across an object, capturing sequences of data which are later reconstructed to form the actual image. The image produced by a scanner can be as simple as an image of a page, as complex as a full 3-D volumetric image of one’s body, or as large as a high-resolution view of the entire surface of the earth. Page Scanners. Xerox led the development of an entire paper scanning industry based on the flat-bed scanner and printing in one unit, the ubiquitous photocopier. Common in most businesses and many households today, page scanners are manufactured by several companies (e.g., Hewlett-Packard, Umax, and AGFA). Figure 3 shows the common elements of a flat-bed page scanner. These elements include a linear CCD imager, a scanning mechanism and a buffer that holds the scanned image. For applications like desktop publishing, manufacturers integrate paper control and advanced CCDs, creating a range of handheld to desktop scanning devices with about 600 dots-per-inch (dpi) up to 1200 dpi, and 30 to 36 bits of color. Several scanners also incorporate resolution enhancement functions, such as interpolation of the images, to increase the apparent resolution of the scanned image by up to a factor of 10. CCDs have typically been used for scanners due to their relatively low cost and simplicity of creating extremely high resolution in one linear dimension. CCDs are only limited by the size of a silicon wafer (8 or more inches at 1000 pixels per inch) and packaging technologies. Airborne Scanners and Machine Vision. Large-scale, high-resolution terrain sensing is achieved by taking advantage of the extremely high-resolution detection possible with scan-line sensing. In airborne scanning, with a known ground speed, sweeping a landscape with 2-D imaging arrays generates multiple samples of the scene with only fixed time-delay between each row. These time-delay and integration (TDI) techniques have the benefit that repetitive samples of the same pixels are available for averaging to remove sensing noise, thereby improving the quality of the sensed scene. Similar TDI techniques are used to inspect objects moving at high speeds on a conveyor belt in machine vision applications such as lumber processing. Infrared Cameras. Cameras that image by detecting the heat of the source or infrared have numerous applications for security, military or night surveillance situations. Photo-multiplier tubes and CCD arrays sense the near-infrared spectrum. Far infrared detectors have the distinct advantage of imaging through fog. Early far-infrared cameras were devised with mechanical scanning a linear array of IR-sensitive detectors (e.g., mercury-cadmium-telluride and lead-tin-telluride) as described in Ref. 7. For proper operation, these detectors
10
IMAGE PROCESSING EQUIPMENT
were cooled to near 100 K with cryogenic coolers. More recent infrared cameras utilize 2-D arrays of detectors, with a resultant shape and size similar to conventional visible cameras. These thermal cameras are used in electronics manufacturing to find hot, defective circuits and for detection of heat loss in buildings. 3-D Cameras. Medical imaging, product design, and Internet 3-D applications require capture of the 3D images. 3-D systems include those that are (1) true 3-D, which acquire multiple gazes for stereoscopic display, (2) real 3-D, which capture volumetric data for holographic display or other 3-D displays, and (3) non-true 3-D, which capture a 5-D plenoptic function (color as a function of position and pose), with a succession of 3-D gazes [see (8)]. Real 3-D sensing requires either passive or active scanners (like lasers or MRI) to capture the data. True 3-D and non-true 3-D data are captured with passive cameras, which in some way capture multiple views of the scene. This can be with special optics for parallax view or with successive repositioning of the camera. The “TriForm” 3-D Image Capture System uses structured light, illuminating the object with patterns of known location and dimension (like a grid), coupled with true-color CCD imager to collect the object’s surface texture. A non-true 3-D camera produced by researchers at Columbia University, called the Omnicam, creates a 360◦ full-motion video using a curved mirror and glass dome about the size of a large grapefruit. Some manufacturers have produced similar 3-D cameras. As discussed in a later section of this article, the image is later warped to correct for geometric distortions and to allow the viewer to select a random view 360◦ around the camera.
Image Computing Equipment Image computing equipment takes raw (or previously processed) images from sensors (or stored images) and performs one of a variety of image processing functions like filtering for enhancement and recognition of features. (Another section of this encyclopedia contains descriptions of these functions and algorithms.) This equipment includes a wide range of computing equipment, both hardware and software. Supercomputers, PCs, workstations, or standalone equipment, some with specialized processing chips or DSP (digital signal processor) chips, perform these tasks. The properties manipulated with ICE include the size, aspect ratio, color, contrast, sharpness, and noise content of an image. In addition features or properties of the image can be determined with ICE, including the number of objects, the type of the objects (text vs. photograph of a scene), the motion of the objects, and scene changes. With ICE, images or the objects within the images can be blended or mixed with others to create new images. The computing needs of personal computers and workstations continually increase as users demand higher quality video, graphics, and audio, using faster networks. For example, simple word processors and spreadsheets of the last decade were replaced with later versions that automatically check punctuation and grammar, on a formatted page, with voice annotation, all as one types. This was possible due to a substantial increase in performance of the underlying computing technologies within the computer. In the future, 3-D image rendering will extend from today’s screen savers and 3-D games, to be integrated with word processors that feature a 3-D representation of the document moving in virtual space, all consuming more and more resources. While the computer’s host processor continues to increase in computing capability by Moore’s law (about 2× every 18 months) to address some of these needs, other processing chips like image processors, media processors, DSPs, and ASICs offer many advantages. PC and Workstation Image Computing Equipment. Image computing systems commonly use a personal computer (PC) or workstation as a base platform. These IPE include three elements, including (1) a high-performance PC, (2) imaging hardware, and (3) image processing software. The high-performance PC includes a host processor, usually with an image computing acceleration processor such as a DSP, storage [random access memory (RAM) and disk], networking and high-resolution display. Imaging hardware includes an image capture board, camera, digital video tape, scanner, or other imaging hardware. Image processing
IMAGE PROCESSING EQUIPMENT
11
software includes the code to drive I/O (input/output) equipment, such as the camera, and the code that instructs the processor to manipulate the images. Several elements of a computer system determine its ability to be used as image computing equipment. The processor or processors must have high speed (performing billions of operations per second), with high bus bandwidth (up to hundreds of megabytes per second) and efficient I/O. The computer’s operation system must support real-time scheduling, with multiple processing tasks or threads. The software tools should include application development tools, graphical user interfaces (GUI) and windows for concurrent presentation of code, debug information, and images. Memories needed vary as a function of the application, with several megabytes of semiconductor memory and gigabytes of disk memory a minimum today. Networks for communication of image or video should have very high bandwidth (minimum of 64 kbps), with low latency and jitter. The peripheral interfaces should support industry standards (e.g., IEEE-1394) for direct connection to cameras. The displays should support high resolution (greater than 640 × 480), high frame rate (60 to 85 fps) and color (24 bits). Image/Video Capture. Paramount to IPE is the capturing of the image or image sequence from some source, namely camera, video tape, laser disk, or DVD. Since most of these sources utilize analog video interfaces, a video capture board is necessary to decode and convert NTSC or PAL analog signals to digital images. The video capture function includes synchronization to the input video using a sync stripper, filtering for proper extraction of chroma and luma, and possible color space conversion. Many chips are available to perform these functions (including Brooktree at www.brooktree.com and Philips at http://www.philips.com). Several commercial and consumer equipment manufacturers produce computer interface boards or boxes for video capture, including the Matrox MeteorII (http://www.matrox.com) and ATI All-in-Wonder Pro (http://www.atitech.com). Video and image capture cards are available for standard interface buses (VME, ISA, VESA, S-bus and PCI), making them available for the majority of computers. Some computers, such as versions of Apple’s Macintosh, already contain this capture function in the base computer. Analog Video Interfaces. The video industry adopted a series of international standards for interconnection of video equipment. In the United States and Japan, the NTSC and RS-170 standards identify a composite video signaling representation that encodes luminance (luma), chromanance (chroma), and timing into a single channel. The S-Video method has been adopted for higher quality separation of luma and chroma data, achieving a higher-resolution picture without the mixing of the 3.58 MHz (US) color carrier signal into the luminance space. The artifact created by mixing these signals is seen as dots crawling on colored edges of the video. PCs and workstations have adopted RGB (with sync signal on green) and RGBS methods, each with negligible artifacts. Digital Video Interfaces. Several alternative industry standards have emerged for digital interface between cameras, video storage, video processing equipment, displays and PCs—Universal Serial Bus (USB), IEEE-1394, and VESA. IEEE-1394 supports two compressed-video types, 5:1 and MPEG. Professional level video products use standards like RS-243 or D1 serial digital video buses. In addition, several video processing chips support the CCIR-656 digital-signaling format, one that represents a CCIR-601 resolution digital image and sync information. Even with the advantages of serial digital video buses (described previously), analog video sources will exist for many years to maintain backward compatibility. Image Processing Application Software. The processing of these captured images or sequences is accomplished with some combination of the host processor (Intel Pentium, AMD K5, DEC Alpha, SGI MIPS, etc.), attached DSPs and/or fixed logic. For host-based processing, many off-the-shelf software packages are available, supporting popular operating systems (e.g., Unix, Windows and Mac-OS). Popular image processing software includes:
(1) Photoshop for general-purpose http://www.adobe.com)
still
image
manipulation
and
enhancement
(from
Adobe
at
12
IMAGE PROCESSING EQUIPMENT
(2) KPT Bryce 3-D for image rendering with texture mapping (from MetaCreations at http://www.metacreations.com) (3) Khoros for general image processing (from Khoral Research, Inc. at http://www.khoral.com) (4) Debabilizer for image manipulation or format conversion (from Equilibrium at http://www.equil.com) (5) Kai’s Power Goo (KPG) for image morphing (from MetaCreations at http://www.metacreations.com). (6) KBVision and Aphelion libraries and software development environment for image processing and image understanding (from Amerinix Applied Imaging at http://www.aai.com) (7) Gel-Pro Analyzer for biology applications (from Media Cybernetics at http://www.mediacy.com) (8) Image-Pro Plus software toolkit for image capture and analysis functions (from Media Cybernetics at http://www.mediacy.com) These software packages offer a wide range of image conditioning and manipulation functions to the user. In addition each manufacturer of image capture boards (or frame grabbers) and processor boards provides image processing software. Video Editing and Computing Equipment. Studio and professional applications of video editing use a plethora of video editing equipment. While computing of the video images can be performed at any speed off-line, the presentation of the processed images must appear at video rates to properly see the result. PCs and workstations (from Sun and Silicon Graphics) coupled with software packages like Media Compositor (Avid Technology) provide complex editing solutions. They utilize compression techniques, such as MPEG, to fit the large sequences on conventional PC disks and MPEG decoders to present the results. As an alternative, high-end professional digital video disk systems (e.g., Abecus or Sierra Systems digital disk recorder DDR systems), video memory systems (e.g., Sony and DVS), and digital VCRs (e.g., Sony’s D1 format recorders) provide frame-by-frame manipulation of digital CCIR-601 video frames. This equipment enables compositing or processing of video frames off-line in the host computer and later playback in real-time on the DDR, VMS, or VCR. With these techniques, the algorithm developer can visualize the effects of any image or video-computing algorithm simulated on a workstation in high-level software prior to investing development resources to port the algorithms to real-time processing on DSPs or other custom hardware. For the home or small business, several PC-based solutions have been introduced for manipulation of video clips. With these packages one can transform many hours of home videos into condensed video programs. Apple Computer Inc. has produced AV Macintosh computers with video inputs and outputs, coupled with compression technology (QuickTime), to feature digital video editing. The “Avid Cinema” software package for the Macintosh creates an effective video editing user application (Avid Technology). Also certain VCRs (Sony) include PC software when used with appropriate camcorders to index and edit home movies. Both infrared and LANC (or “Control-L”) are used to control the VCRs and camcorders. PC-add-in boards (AV Media, ATI, Matrox, Videonixs, Sigma Designs, and Real Magic) provide MPEG encoders and decoders to enable this editing application. Finally, software-only solutions for PCs have emerged (SoftPEG MPEG player from Zoran/CompCore at http://www.zoran.com and XingMPEG from Xing at http://www.xingtech.com). This approach has been limited by either lengthened editing times or reduced image quality, partly due to the limitations of MPEG-1. However, real-time soft-only versions of MPEG-1 encode and decode exist today on 300 MHz PC systems. With technology improvements including faster processor chips, this trend toward soft-only video editing and computing will continue. Processor Chips—Image Processing, Media Processing, DSP and Coding Chips. A wide range of processors with specialized architectures and custom chips have been developed for image computing equipment [for details see (9)]. Multimedia processor (MP) and DSP chips, such as Texas Instruments TMS320C80 and Philips TriMedia processors, can address the computational needs of image computing. The programmable nature of the multimedia processors, coupled with architectural optimizations for the manipulation of images, and clock speeds at several hundred megahertz, creates ideal technologies for IPE. All applications become “soft,” with the advantage of long-term adaptability to changes in requirements, algorithms, data sets, or
IMAGE PROCESSING EQUIPMENT
13
standards. However, the degree of progammability and flexibility to these new requirements and applications varies greatly by the architecture and software development tools associated with each solution. In addition to media processors and DSPs, special enhancements to general-purpose processors, like Intel Pentium processors MultiMedia extension (MMX), AMD K6 processors, DEC Alpha processors and Sun SPARC processors’ VIS instructions, offer speedup of imaging functions in those environments. Table 1 illustrates several currently available image processing and coding chips. Technology limitations to overall performance are based on chip architecture, use of parallelism, and process technology. In addition software development tools also have a dramatic impact on the efficient use of the processor and the development time utilizing the processors [for details see (10)]. While a particular processor may have exceptional speed at performing a particular function, the user’s ability to change the underlying algorithm depends on the capabilities of the tools. A good C compiler that automatically finds parallelism, re-orders instructions, allocates registers, and unrolls loops can greatly accelerate the development of imaging software for the processor. Programmable DSP and Media Processors for Imaging and Video. Several DSP and MP chips have been designed to meet the computational demands of image and video processing [see (11)]. DSP processors were originally developed in 1982 to satisfy the demands of a digital finite impulse response (FIR) filter or convolution. Specifically, their internal architecture contains key functional units enabling very fast multi-tap filters. With parallel units performing adds, multiplies, and complex data addressing in a single cycle, they efficiently perform the following operation:
While these original devices excelled at audio signal processing applications, the cycle bandwidths required for image computing were not seen until early in the 1990s. Further extensions evolved from that basic DSP architecture to perform multidimensional processing of images (more registers), video processing (higher I/O and processing bandwidths) and graphics processing (with bit-level operations)—or media processing. With programmable multidimensional filters possible, many new applications could use these chips. From Table 1, the TMS320C80, TriMedia, and MPACT media processors possess exceptional computational power, yielding between 2 and 20 billion operations per second (BOPS) for image and video functions. Simple functions like image math on standard size images can require a few operations per pixel on these processor, corresponding to a few million instructions per second at video rates (computed as image size times frame rate times operations per pixel). Complex filters, geometric warps, and complete codecs can require many BOPS on these processors. The same functions on general-purpose processors could take 10 to 50 times the number of cycles of a DSP or MP. As shown in Fig. 4, these processors use architectures with large register files, instruction, and data caches and arithmetic, logic, and multiply units to achieve speedup for image processing. These functional units are partitionable during execution to support simultaneous processing of multiple bit, byte, or word sizes, depending on the algorithmic needs. The capabilities and interconnection of these units have been optimized for classes of imaging computing including filters, transforms, morphology, image algebra, and motion computation. Three overarching advantages exist for DSP and MP approaches to these problems. First, their programmable nature maps well to the evolution of new and unique algorithms, appearing as software solutions. For example, new approaches to video coding, such as video coding based on image warping and nonrectangular DCTs [see (12)] can be implemented in a programmable chip. They are also used to implement evolving standards like MPEG-4 [see (13)], which prevents costly hardware redesign each time the standard proposal changes. Second, these DSP and MP can accelerate time-to-market over hardware approaches whereby custom logic chips like full custom or PALs or FPGAs (from Xilinx or Altera) are designed or programmed. This assumes that the task of programming is not daunting. Third, programmable solutions enable multithreaded operations
14
IMAGE PROCESSING EQUIPMENT
IMAGE PROCESSING EQUIPMENT
15
Fig. 4. This hardware diagram illustrates the elements of an image computing system. A central DSP or media processor (MP) can work in concert with a host computer to process images or video. Input, capture, display, output, and memory chips are also typically used. The DSP or MP features many internal registers, ALUs, and caches to achieve computational performance.
or the repeated switching between several independent tasks. With this feature, application developers can dynamically balance loading of the processor as a function of demand, supporting computational variation for interactive applications, enabling graceful degradation of less important functions. For example, if video quality becomes vital during a conference, the audio quality or frame rate could be lowered to allocate more cycles to video processing for that portion of time. In conclusion, the impact of using these media processors includes realization of a flexible platform for rapid development of image processing algorithms, a wider range of practical algorithms with several BOPS of performance and proprietary control of new algorithms in software (not dedicated to hardware). Other Image Processing and Coding Chips. In applications where standard functions exist, such as with MPEG-2 video compression, custom or hardwired solutions find use in cases where that may be the only function to be performed. Hardwired solutions can offer efficient use of transistors and low cost with proper design (a time-consuming task). Many consumer products use these hardwired solutions for that reason. However, use of these hardwired devices can be so restrictive that system performance goals cannot be met. For example, the application may require higher pixel precision than that provided by the manufacturer of the hardwired chip. One can typically alter a programmable solution with software to satisfy the requirement. Other solutions, such as the CLC4000 series and VCP (C-Cube and 8×8, Inc.), include fixed datapath processors, combining either dedicated logic function blocks or micro-instruction function blocks which are hand coded with assembly. This data-path architecture limits overall efficiency for only but a select class of algorithms, since the silicon devoted to those functions must always be kept busy with those functions to justify their cost. However, for specific applications, those solutions can have a cost advantage over programmable solutions. (See web site http://www.visiblelight.com/mpeg/products/p tools.htm for a more complete list of these MPEG processors.) Typically these hardwired chips only perform the coding and/or decoding portion, depending on other back-end or front-end chips to pre- and post-process the video (e.g., for conditioning, color space conversion, and scaling), and align with digital audio data. Companies like S3, Brooktree, TI, Pixel Semiconductor, and Cirrus
16
IMAGE PROCESSING EQUIPMENT
Logic provide chips for functions like dithering and color space conversion in scalable windows. Unfortunately, the scaling and filtering differs as the application switches between standards (MPEG-1 to MPEG-2) and resolutions (source or display). With advanced media processors these functions can fit into the processor with the codec, simplifying design and minimizing costs. MVP: An Architecture for Advanced Image and Video Applications. The multimedia video processor (MVP) or TMS320C80 is well suited for image and video applications, such as those required for IPE [(see (14,15)]. This single-chip parallel media processor performs over 2 billion operations per second. However, key architecture considerations make this processor ideal for imaging applications. The architecture was tuned for algorithms like MPEG, JPEG, 3-D graphics, DCTs, motion estimation, and multi-tap, multidimensional filters. Several manufacturers of computer add-in boards developed solutions featuring the TMS320C80’s image processing abilities, including Ariel, Loughborough Sound Images (LSI), Matrox, and Precision Digital Images (PDI). The elements within one MVP include four advanced DSPs (ADSP) parallel processors (PP), an embedded RISC master processor (MP), instruction/data caches, floating point capability, crossbar switch to connect the many on-chip memory modules, I/O transfer controller (TC), SRAM, DRAM, and VRAM control, pixel/bit processing, host interface, and JTAG/IEEE-1149.1 and emulation. The keys to the MVP’s high performance include an efficient parallel processing architecture. It uses a tightly coupled array of heterogeneous processors, each of which perform complete operations in a single cycle (but with some internal pipelining), and each executing different instruction streams. This architecture forms a specialized MIMD (multiple instruction, multiple data) architecture to achieve high performance in image computing. The individual ADSPs were optimized for fast pixel processing by tuning their parameters for image, video, and graphics processing. Features like certain combinations of subtract followed by absolute value in the same cycle and bit flags to control operations are keys to its performance. In addition the chip uses intelligent control of image data flowing throughout the chip via a crossbar connection of each processor’s memory, accessible in one cycle, and a complex I/O processor. This I/O transfer controller is capable of directly accessing external memory as multidimensional entities in external or internal memory (e.g., strips of data anywhere in memory), relieving the internal processors from the burden of computing every pixel’s address in memory. Finally the MVP contains most functions needed in a single chip, forgoing slower chip-to-chip communications for several applications. With this collection of elements already contained in the chip, very compact and cost-effective digital image computing systems have been constructed. Reductions on the order of hundreds of components to one have been realized. Examples include a fingerprint identification system and an ultrasound imaging system [see (11)], and a group video conferencing system integrated into a CRT display (Sony). Programming Media Processors and VLIW. Early media processing chips like the 320C80 required detailed assembly coding to port imaging algorithms onto the processor to achieve peak speeds. C compilers could create code for each of the multiple internal processors. However, to fully utilize the power of the processor, one manually partitioned algorithms to each processor without the help of the compiler. This would consume many months of design by programmers very skilled at coding in assembly language. The root cause of the complexity of programming these MP stems from the architecture itself, in which the programmer must consider optimal usage of many restricted and orthogonal resources (e.g., different combinations of register usage and functional data paths). However, with use of fixed latencies for each operation like a conventional DSP, the programmer could then systematically piece together a solution. More recent MPs feature very-large-instruction-word (VLIW) architectures (TriMedia in Table 1). In this case all functional units and registers comprise a large pool of resources that are automatically allocated by a compiler. C compilers improve the programmer’s ability to map the algorithms onto the processor by automatically performing instruction selection, scheduling, register allocation, and dealing with the unique latencies and any specific load/store dependencies. A good compiler will find parallelism, fill delay slots, and accurately map the application code to run efficiently on the processor. However, coding efficiency, or the
IMAGE PROCESSING EQUIPMENT
17
resultant speed of the processor when using compiler generated code, may not be as good as detailed assembly coding. Fortunately compiler technology continues to improve. Image-Adaptive Algorithms with Programmable Processors. The computational efficiency of media processors can open new applications to the system designer, including image pre-processing, scene change detection, object tracking, and recognition. For example, when using a programmable solution, data-adaptive routines such as Inverse DCT will have reduced computational complexity if the program ignores zero coefficients, saving adds and multiplies [see (16)]. With the ability to perform image-adaptive functions, system developers can create new applications, such as scene content processing, adaptive contrast, and real-time video object overlay with these media processor chips. Developers can use real-time object detection and image warping for insertion of region-specific information like text with local languages, news, weather, or advertisements directly into the scene. For example, telecasts of ball games can use video messages inserted directly onto the ball field, onto a players back, or on actual billboards that line the field but with different messages for different broadcast regions.
Video and Image Compression Equipment Video and image compression equipment utilize digital compression methods to reduce storage and/or communication bandwidth requirements. This equipment includes videophones, group-room video conferencing systems, and entertainment video distribution equipment such as digital broadcast satellite (DBS) systems. These products can create stunning digital pictures for application from telemedicine to home theater entertainment with video decoders and encoders at each site. Video conferencing systems can offer lifelike interaction with pictures of persons in remote locations. One such application, teledining, creates a virtual dining table for simultaneous dining with others in cities around the world. In contrast, DBS systems have a few centrally located, high-quality digital encoders preparing TV programs for “wide-scale” distribution, a satellite delivery network and millions of inexpensive digital video decoders, one or more in each subscriber’s home, with minimal interactivity. Video Conferencing Systems. Video conferencing systems include (1) consumer-oriented videophones like the ViaTV (produced by 8×8 Inc.) or the C-Phone, (2) PC desktop systems and (3) group-room systems like the Concorde (produced by PictureTel Inc.) as described in Refs. 17 and 18. Video conferencing systems completely integrate image sensing, computing, and display subsystems. They also include full codecs (encoders and decoders) with point-to-point network connection. To place video calls between several sites a multi-point control unit (MCU) serves as a centralized codec, decoding each line, continuously selecting and distributing the video associated with the most active speaker to all participants. Each audio channel is simply summed and redistributed to all callers. These video conferencing systems typically communicate via telephone lines, including packet switched networks (using ITU H.323) and circuit switched (using ITU H.320), as described in a later section of this article. Figure 5 illustrates uses of these various systems. DBS Systems and MPEG-2 Encoders. Other digital video and image coding systems include DBS and MPEG-2 encoding. DBS systems exist worldwide including, in the United States, the Direct Satellite System (DSS) by RCA or Sony (or DirecTV) and Echostar’s satellite dish systems, PerfecTV in Japan, and Canal+ or TPS in France. While DSS methods have typically been used to deliver digital quality video to homes, recent products also offer Internet access via a low-speed phone connection and high-speed multiplexed data distribution via the satellite. Higher-quality MPEG-2 encoding systems are used for compression of the video with DBS systems (Compression Labs Inc. and DiviCom Inc.) or other PC-based processor boards (produced by Minerva and Optibase). Internet Streaming Video. As shown in Fig. 5, another application centers on streaming video media to clients via local intranets or the Internet. Programs or movies can be downloaded from a server and played to multiple viewers with multicasting techniques. Local area networks (LANs) may have sufficient bandwidths for
18
IMAGE PROCESSING EQUIPMENT
Fig. 5. Digital compressed video equipment utilizes a variety of network infrastructures to deliver video/image services to many clients. Video and image compression equipment exchange compressed digital image and video bitstreams via satellite, packet networks, private line networks, or switched circuit networks, accessed by either the central office in plain-old-telephone-service (POTS), Internal protocol (IP) servers, or routers.
this application, but web-based video streaming presents even lower bandwidths and the difficulties imposed by unpredictable slowdowns or interruptions of service. This application demands very highly compressed video, dynamic buffering, and decoding to present a viewable sequence. In a streaming application, users view the beginning of the movie, while the remainder of the clip downloads in the background. The buffering tends to smooth out variations in transmission rates. RealVideo from Progressive Networks (http://www.real.com) and VivoActive (http://www.vivo.com) are the most popular video streaming solutions today. However, video quality suffers with today’s compression technology, typically achieving only 160 × 120 pixel frames at 3 to 5 fps. Video and Image Coding Standards. The industry standards for video and image compression (ITU H.261, H.262, H.263, JPEG, JBIG, and ATSC’s HDTV) have advantages for specific implementations or markets. Table 2 shows how these compression standards differ. Standards are key to many applications like video conferencing. Interoperability that results from use of standards encourages consumer acceptance of the technology. A legacy of over 50,000 H.261 group video conferencing systems are installed throughout the world, encouraging system developers to keep backward compatibility to those systems. However, while all H.261 systems can interoperate, the quality of the video varies as a function of the manufacturer’s implementation of the standard. H.262 (MPEG-2) excels when higher channel or line rates (>2 Mbit/s) are used, especially with interlaced video. With enhancement modes like overlapping motion vectors and PB frames, H.263 has the highest coding efficiency, which is especially useful at low bit rates (e.g., 20 to 30 kbps using H.324 POTS), however also beneficial at higher rates (1.5 Mbit/s). New enhancements, called H.263+, continue this trend in image coding efficiency improvements or increased perceptual quality, as described in Ref. 19 and the University of British Columbia (at www.ece.ubc.ca/spmg). Consequently, most of these 263+ enhancements are at the expense of additional hardware. Numerous nonstandard algorithms are in use today, like CTX+ (from CLI), SG4 (from PictureTel), wavelets, and fractals techniques; however interoperability issues limit their wide spread acceptance.
IMAGE PROCESSING EQUIPMENT
19
These proprietary methods can have value to closed systems like national defense or scientific applications. Unfortunately, the uniqueness of these algorithms drives special hardware requirements. On the contrary, implementation of mainstream codecs presents a wider range of chip solutions and lower implementation costs. Designing High-Quality Digital Compressed Video. In applications where the quality of the compressed picture is paramount numerous system considerations are important. Proper system design increases the perceptibility of the video and reduces annoying artifacts, particularly when communicating with limited bandwidth. These design considerations include square versus rectangular pixels, pixel clock jitter, system noise, image scaling, interlace to progressive conversion, frame rate, gamma, brightness, contrast, saturation, sharpness, artifact reduction, and combining graphics with video. Design Consideration for Square Versus Rectangular Pixels. While digital cameras can reduce system costs by eliminating the analog-to-digital converter, they can also create complications in the use of the data. If a camera uses square pixels such as with 640 × 480 resolution, horizontal scaling must be performed in a H.261/263 video conferencing system to meet the 352 pixel horizontal resolution (in addition to vertical scaling to conform to the 288 pixel common intermediate format, CIF, vertical resolution). Performing this scaling will increase the software or hardware costs of the codec and reduce performance somewhat due to scaling artifacts or blurring. In addition some digital cameras produce only 320 × 240 pixel images due to cost limitations and bandwidth constraints of the interface bus (e.g., Universal Serial Bus, USB, at near 12 Mbit/s), demanding image scaling when used with standard video conferencing. Design Consideration for Sharpness and Compressed Artifact Reduction. At lower bit rates, the MPEG and H.26× compression process will exhibit annoying block and mosquito artifacts. These artifacts can be removed with video postprocessing. The 8 × 8 blocks are a result of the discontinuity of the image as it’s broken into a mosaic. Mosquito artifacts occur due to coding errors in the high-spatial-frequency regions of the image, such as on edges. These edge artifacts change from frame to frame, creating an effect of dots dancing around edges. In addition the compression process can reduce the overall sharpness of the image. Simply sharpening the image will inappropriately enhance the mosquito and block artifacts. Edge-preserving mosquito and block filtering followed by image sharpening will improve the performance as described in Ref. 20. Design Consideration for Pixel Clock Jitter. During quantization of analog video or NTSC/PAL/SECAM video decoding, noise can be created by jitter of the pixel clock. This can waste bits by compressing false
20
IMAGE PROCESSING EQUIPMENT
signals. If greater than 10 ns, this jitter can create temporal differences that will be interpreted as frame-toframe movement by the encoder, reducing overall coding efficiency. Design Consideration for Low-Level Imaging System Noise. Noise can be injected into the imaging system from several sources, including random and 1/f noise, quantization noise, or finite precision effects. Temporal filtering as shown below can improve coding efficiencies by an order of magnitude, but can also create artifacts. Rapid motion will result with a somewhat blurred image.
Design Consideration for Image Scaling. Converting typical camera resolutions of 720, 640, or 320 horizontal pixels, with 480 (US) or 576 (Europe) vertical rows or lines of pixels, demand the use of scaling to interoperate with other systems in the common intermediate format (CIF) or 352 × 480 or other display formats. Conversion of the US’s 480 lines to 288 CIF resolution will require a 10 line to 6 line conversion. Simple nearest-neighbor algorithms produce distorted outputs with harsh aliasing effects for line drawings (21). Bilinear interpolation filters alias the detailed content of the image. Use of anti-aliasing filters or higher-order polynomial filters will correct many of these artifacts, however at considerable implementation cost (22). Design Consideration for Interlace Versus Progressive Scanning. An interlaced image results when a single image is split into two sequentially transmitted sub-images or fields, each of which have only the odd or even lines. Depending on the system design, interlacing can create a temporal discontinuity in the sequence of images. It is usually manifested as every-other-line jagged edges in the parts of the scene undergoing motion. Today interlaced video (driven by the conventional television broadcast standards) leads the industry but creates complications for some international video compression standards which were based on the use of progressive images. A simple approach to the problem, namely dropping a field on input, reduces the image’s resolution. Artifacts include an “aliased” appearance for images of nearly horizontal line structure. For example, the corners of a room or fluorescent lights appear to have stair steps on their edges. Converting the two fields to a progressive image prior to compression yields the best results but requires complex and expensive motioncompensated interlace-to-progressive methods. New digital television standards support both progressive and interlace methods. Design Consideration for Frame Rate. Most cameras produce 30 frames per second video, matching the format of TV displays. Feature film is typically created at 24 fps (see http://www.smpte.org). Telecini equipment (e.g., produced by Rank) uses a technique called 3:2 pull-down to convert 24 fps film to 30 fps. In essence, three field samples are taken of one input 24 fps frame to create three output fields (two of which make one output frame) and two field samples are taken from the next input 24 fps frame to create the next output frame. One visual problem created with 3:2 pull-down methods manifests itself as a stutter of the objects in the video when panning quickly from side to side. Coding efficiency can improve if the encoder recognizes the “repeated” frames and does not code them. Design Consideration for Display Effects: Gamma, Brightness, Contrast, and Saturation. Prior to the display of de-compressed video, significant adjustment may be needed to adjust for differences in the diplay used. The amount of processing depends on the characteristics of the display used and the characteristics of the source material. For example, when a video sequence is played on a computer’s RGB display, the difference in display gamma from a conventional NTSC TV CRT will most likely result in a lack of visibility of detail in dark scenes. Gamma correction will remedy this problem by adjusting the intensity of each pixel to compensate for the difference. This can be accomplished by using a simple lookup table. Likewise, adapting to the video signal’s setup, color demodulation errors and gain or linearity problems in the system’s amplifiers can greatly enhance the picture quality.
IMAGE PROCESSING EQUIPMENT
21
Design Consideration for Combining Graphics and Video. Several problems occur when combining graphics and video. Video is typically overscanned on the TV by more than 4% to keep the viewer from seeing blanking signals on the screen when the set has aged to the point that the scanning electronics effectively shrinks the picture. Proper scaling of the image or cutting out blanking signals is key to a clean edge on the picture. When placing RGB graphics on an NTSC display, proper treatment must be given to the color gamut differences. For example, white text will appear pinkish on an NTSC monitor if not properly corrected. Blending of graphics and video can also pose a problem in terms of aliasing of lines. Finally graphics can flicker in video displays due to the interlacing effects without proper adjustment. Video and Image Quality Validation. While design of video and image compression equipment presents many challenges, perhaps the greatest remains the determination of picture quality in compressed image applications. The most common metric used to measuring image quality has been the peak signal-tonoise ratio (PSNR). PSNR is calculated by comparing the processed image to the original or reference image as follows:
where
MSE is the mean square error. PSNR has many limitations, since the value does not necessarily increase for errors that a viewer can easily see, in comparison to those that may be masked by the human vision system. A test equipment manufacturer (Tektronix at http://www.tek.com) has produced the PQA200, which produces a single objective and numeric value corresponding to a picture-quality rating (PQR). The PQR utilizes technology developed to measure just noticeable differences (JND) of human vision (Sarnoff at http://www.sarnoff.com). The process transforms each image according to the human visual system, which includes a linear system with point nonlinearity and could include masking models for color constancy, chromatic adaptation, lateral inhibition, and Mach bands (23). A JND image or map is produced corresponding to the problems or errors in the image. A totally dark JND image corresponds to a perfect picture with respect to the original image. The PQA200 utilizes multiple TMS320C80 processors (described earlier) to process the JND image in less than a minute, rather than the alternative of using a panel of expert viewers performing analysis and calculations for many weeks to achieve the same result.
Image Database Equipment Image database equipment (IDE) provides solutions to the storage and access of images and video files. Traditional database methods of using keyword labels to access text data have been replaced with methods like text labels (i.e., the database “contains a picture of Lou Whittaker on Mt. Rainier”), thumbnail images, interactive 3-D panoramic views, and content-based archival and retrieval. Refer to Ref. (24) for more details. The common elements of all IDE include (1) compression, (2) massive storage to hold the image files, and (3) communication networks for distribution of the data. In a typical application computer servers contain image objects that are played back by clients utilizing a communication network. These networks can be as simple as a telephone connection to the Internet or a local area network (LAN) and as complex as asynchronous transfer
22
IMAGE PROCESSING EQUIPMENT
mode (ATM) networks. We describe common computer networks, some image database applications and image content retrieval. Computer/Telephony Networks. Traditional computer networks, such as Ethernet, token-ring, and FDDI (Fiber Distributed Data Interface), create cost-effective LAN connections for sharing of image databases within a building or adjacent buildings. The limitations of LAN networks are bandwidth (10 up to 100 Mbit/s), number of clients and topology. For example, if dozens of clients attempt to access large images on a server, the effective bandwidth decreases and latency increases by that amount. In addition, if the topology has each computer connected in a ring, then each hop between computer on the ring can slow response. Star or centralized topologies where a central node has direct connections to each client machine minimize the hops or latency. For smaller networks the fully distributed topology is effective with each machine interconnected via a direct link. Hierarchical networks create extensions to the centralized network with a series of intermediate nodes, forming a tree structure. For interactive video conferencing with face-to-face communication, the roundtrip delay (from the sending camera, through compression, transmission over the network, decompression and display on the receiving display and then back the other direction) must be less than 300 ms. For this reason centralized topologies are more useful for ITU-H.323 video conferencing over LANs. Today’s wide-area networks (WAN) offer connection of remote client equipment with servers (e.g., with Internet servers). These networks include circuit-switched telephony networks like analog POTS, integrated services digital network (ISDN), T1, and asynchronous digital subscriber loop (ADSL). Their advantages include predictable delay and quality of signals. Video conferencing using ITU-H.320 operates within these telephony networks. Alternatively, gateways can be used to place IP packets on these WANs within the ITUH.323 standard (from Radvision). In a high-speed packet switched networks, like ATM (broadband-ISDN), the topology is less important, since the data are broken into packets for transmission via one of a multitude of different paths to the destination. The largest issue with these networks is that of long and unpredictable latency for those packets that travel the long routes to the server. Modems and the Internet. Modems (Modulator-DEModulator) connected to plain-old-telephoneservice (POTS) phone lines enable communication with digital signals. Today’s modems (by 3Com and Rockwell) can achieve up to 56 kbit/s from the server and 28.8 kbit/s from the client via ITU standard V.90 and V.34 modems (28.8 both ways). The Internet has been facilitated by use of these client modems in the home, T1 networks in the office, all connected with multi-line modem servers operating with the TCP-IP (transfer control protocol-Internet protocol). Routers and hubs interconnect the variety of network connections within a particular organization before connection to the Internet. The servers are then connected using T3 connections to the backbone of the World Wide Web, providing high-speed transfer of the data packets. Cable, Satellite, and Broadcast Networks. In addition to computer and telephony networks, video and image communication frequently occurs via cable, satellite, and terrestrial broadcast. As these networks increasingly utilize digital images and computing costs decline, an increasing percentage of IPE uses these one-to-many networks today. Broadcast network solutions offer high video bandwidths, supporting hundreds of digital video programs simultaneously. Unfortunately, their topology with a single server and many clients limits interactive communication. However, with the large quantity of channels available, image and video database applications are emerging to selectively download the client’s requested data to a local disk, such as overnight (e.g., WebTV). Network Demands of Images, from FAX to Video. Effective communication of image and video demands a variety of average bit rates and buffering to compensate for differences between the target rate and the actual rate. Communication with high-quality images may dictate up to 30 Mbit/s, whereas ITU Group-4 FAX could utilize merely 64 kbps or less over POTS phone lines. Moving a typical 640 × 480 × 24 bit color image via the Internet with POTS lines could take several minutes. Sending video over networks requires (1) 64 kbit/s for low-quality video when using H.261, (2) 6 Mbit/s for TV-quality video when using MPEG-2 MP/ML, and (3) 10 to 40 Mbit/s for HDTV-quality video with MPEG-2 MP/HL. Data buffering compensates for dynamic latency in the network or variations in peak network performance. Most video applications require
IMAGE PROCESSING EQUIPMENT
23
Fig. 6. Image database equipment utilizes various search applications (or engines), coupled with image/video databases and a communication network, to help users find useful images or video.
access and display of a continuous stream of images without frame drops. The size of this video buffer depends on the type of network (e.g., one-to-many or many-to-many), the required data rate and traffic on the network. Some applications can tolerate missing data but only if occurring in a predictable and infrequent fashion. In that case minimization of the frequency and duration of the frame drops increases the viewing experience [see (25)]. Traditional networks are designed to minimize errors and correct any that occur. Ignoring errors and not waiting for retransmission of bad data can improve the overall data rate realized, but input image quality will suffer. Image and Video Content Indexing and Retrieval. With a wide variety of video and imaging material accessible either via the Internet or other electronic libraries, browsing, searching and retrieving the interesting or appropriate information can become challenging. As shown in Fig. 6, methods utilized include accessing image databases with textual headers or thumbnails. Searching by keywords, a traditional approach to databases, only has value for image databases if a textual description of each image’s content (stored in the database with the images) contains pertinent information. This can include physical (size), perceptual, conceptual, objective, subjective, or statistical data. Methods for retrieval of video increase database searching complexity due to the amount of data searched and the inter-relationship of each frame. Retrieving video clips, browsing video databases, and managing video libraries can be less than satisfying without an algebraic video model and multimedia hyperlinks [see (26)]. The video algebra keeps a working record of operations like how the video was constructed, how it was manipulated (stretched, differenced, summed, etc.), and what was composited or linked together to make the material. Various techniques have been developed to quickly locate pertinent images or data. Current work on the ISO MPEG-7 standard (Multimedia Content Description Interface) should yield a formal standard for such image retrieval techniques. The scope of MPEG-7 includes defining a standard description of images and video for interoperable use by several search engines and feature extractors. Techniques include content labeling
24
IMAGE PROCESSING EQUIPMENT
or indexing where headers or tags contain text which in some way relate to the content of the image [(see (27)]. Unfortunately, these labels are quite laborious to create and may not actually catalog the pertinent information. This subjective indexing has limited usefulness since it’s difficult to anticipate the content that will interest all users of a particular image or sequence. However, these techniques can be useful for finding all the photos of “David Lettermen,” but most likely not find those images that look similar to David, since they probably wouldn’t have been stored that way. Software tools are commercially available for determining a picture’s characteristics like color, texture, or shape of objects. Solutions such as Excalibur’s Visual RetrievalWare and AAI’s (Amerinix Applied Imaging) KBVision software system use image understanding techniques or adaptive pattern recognition processing (APRP) to allow on-line identification and indexing of images and sequences. The software permits query-byexample searching. One can search for patterns, colors, or a picture that looks similar to a reference picture. Since only patterns are correlated, rather than the underlying object descriptions, a complex object can elude detection by the software. Yahoo!’s Image Surfer site utilizes Excalibur’s APRP for Internet searching of items like charts, photos of actors, and movies (see http://ipix.yahoo.com for examples). Other methods include use of thumbnail images, with lower-resolution snapshots of a larger image (e.g., using JPEG), and faster decoding of MPEG sequences by processing only portions of the compressed bitstream. Other techniques include scene change detection methods to aid the surfing of video sequences. Dramatic changes in motion, color and sound can indicate a scene change. Using only a single frame per scene (located by the scene change detection algorithm) facilitates rapid searching of the database. Web-Based Imaging Database Application. Replacing home photo albums with a virtual photo album, featuring distributed storage of electronic pictures somewhere on the World Wide Web has emerged. Kodak originally produced the FlashPix architecture for Internet imaging. Users upload digital images via an Service Provider to a database managed by the service provider (Kodak in this case). Pictures can be edited, placed into a virtual photo album, sent via email or selectively accessed by others on the Internet. The use of titles and thumbnail images can minimize the bandwidth requirement when searching through your image database via the Internet. A recent consortium of companies called Digital Imaging Group (http://www.digitalimaging.org) are continuing to expand this Internet Imaging Protocol (IIP), enabling efficient distributed imaging database solutions using FlashPix and other tiled image formats. 3-D Imaging Database Application. Using virtual reality concepts and the Internet, another form of image database type has been created for visualization of 3-D images. These products implement a 5-D space, called the plenoptic function, projecting pixel color as a function of position and pose (or gaze) into the 3-D space. QuickTime-VR (by Apple Computer at http://www.apple.com) represents a view into a 3-D space allowing the observer to rotate 360 degrees left or right, and to shift up or down, all from the same vantage point. It’s very compelling to spin through 3-D environments with this technology. The user can selectively pan, tilt, and zoom throughout the 3-D space from the perspective of the original camera. In addition users can link to other positions in the scene that have also been captured with the same 3-D techniques. With proper design and capture, users can wander throughout a structure with a true to life feel. Live Picture provides an immersive imaging database for users to view 3-D panoramic scenes (similar to QuickTime-VR), but with the added feature that maintains resolution for selected objects as zooming progresses (see http://www.livepicture.com for examples). In effect users can virtually fly across town, into a building, through a room and look at the content of a television within that room, all while maintaining good display resolution. New image capture techniques have been employed for this application. Rather than constructing a complex panoramic camera (e.g., those used with the popular Panavision technology in movie theaters) these images can be collected with a variety of new cameras using a special fish-eye lens. The camera captures a distorted view of the scene in a 360◦ circle. Subsequent image warping corrects for any distortion created by the lens. Alternatives include capturing successive images or swaths of the scene with overlapping gaze and later creating a mosaic of the entire 3-D scene. Special camera mounts maintain parallax by maintaining all rotation
IMAGE PROCESSING EQUIPMENT
25
about the plane of the sensor. Subsequent processing of the images seams the images into a large panoramic image or a compressed file for later expansion by a QuickTime-VR application. To correct for optical distortion or misalignments of the images, image warping is performed (see the section titled “Geometric Distortion and Correction with Warping” earlier in this article for a description of this warping function). 3-D Model Database Applications. Utilizing 3-D scanners and graphical representations several applications have emerged to offer a visual walk through 3-D databases, such as a human body. These techniques have also been used in the apparel industry to scan customers for specialized fitting of clothing. Another application features a virtual globe, where users retrieve the world’s historical information by interacting with a 3-D model display of the world, with successive clicks of a computer mouse to enable searches of a specific location at specific times. Multiresolution analysis and progressive meshes provide two unique techniques for representation and rendering of the 3-D image data. Each of these provide solutions for rapid display of lower resolution representations, followed by higher-resolution meshes or textures when the data become available over low-bandwidth networks. Interactive computer video games (e.g., Bungie’s Myth) provide Internet-based multiplayer sessions with 3-D meshes rendered in real-time somewhat independent of the client computer’s processing capabilities and the Internet’s available bandwidth.
Image Display Equipment Image display equipment provides the mechanism to view raw or processed images or video. This equipment includes computer displays, televisions, studio monitors, or projectors utilizing a variety of display technologies. (See Ref. 28 for an overview of displays and Ref. 29 for a description of emerging television display technologies.) Display technology is very application dependent. The number of viewers, the amount of power or space available, the number of colors, and the rate of image display of the application all drive unique display requirements. The quality of the display directly influences the overall performance of an image processing system. Some display equipment, like advanced consumer televisions, incorporate image processing functions directly into the equipment, conditioning the video signals for any display-dependent effects or adding features that may improve the quality of the displayed pictures. In many cases this processing can destructively interact with the image or compression processing which has occurred prior to display. Examples include special noise reduction filters, sharpness filters, or gamma correction functions, each of which may enhance artifacts in digital images, like block artifacts in MPEG/JPEG compressed images. In general, users and developers of IPE must be familiar with the features and limitations of the display equipment used. Interlaced and Progressive Scanning Methods. Two predominant types of display scanning methods exist today—interlaced video displays such as NTSC, PAL, or SECAM format televisions used throughout the world and progressive graphics displays such as VGA format monitors typically used with today’s personal computers. Both displays range in size from handheld to large wall-sized units, depending on the display technology employed. However, as displays increase in size, higher performance results with progressive techniques. Television’s interlaced video displays resulted from the limitation of transmission, processing, and display bandwidth of the video systems developed over 50 years ago. Interlacing techniques display only half the lines (odd lines) or rows each field (i.e., every other line), followed by display of the other half of the lines (even lines) the next field time. The benefit of interlacing is that it uses only half the bandwidth of progressive displays. A slight flickering artifact can result, coupled with more motion effects since the number of scan lines is halved for moving objects. Computer graphics’ progressive displays provide full scan displays with a resultant minimization of display artifacts. In contrast to the fixed frame rate of NTSC or PAL (60 or 50 fps), the frame rate of progressive graphics displays will vary depending on the system design. In addition, since the viewer of computer graphics typically sits within an arm’s reach of the display, faster display refresh rates (70 to 80 Hz) are necessary
26
IMAGE PROCESSING EQUIPMENT
to prevent off-axis flicker for most viewers. Graphics displays feature square pixels for geometric precision, whereas conventional broadcast video formats use nonsquare pixels. Direct View versus Projection Methods. Display systems are also characterized as either directview, where the viewer looks at the display device (e.g., a face plate on a CRT or LCD panel), or projection, where the viewer looks at an image projected from the display device (e.g., CRT, LCD, DMD, or other). Projection display systems use optics to enlarge the display device’s image onto a screen or directly into the viewer’s retina for “heads-up” display. Projection techniques have also been used to present a virtual image that seems to float out in space, being superimposed on the real scene. Direct-view displays can have the advantage of minimal optical distortion and brightness. The trade-off also includes the size and cost. Laser scanning creates another type of display, particularly in the creation of 3-D objects, however, is quite limited due to the serial nature of scanning coupled with no phosphor retention of the scene. Some display devices created with semiconductor technology (like DMD with displays near one square inch) dictate the use of projection techniques for enlargement for groups viewing. Front versus Rear Projection Methods. Projection displays either utilize front projection or rear projection techniques. Electronic front screen projectors (from Sony, Proxima, Barco, etc.) operate analogously to conventional film projectors (e.g., slide projectors, overhead projectors, or cinema), whereby the image is optically projected onto a reflective surface, such as a white wall or glass-beaded screen (from Da-Lite, Inc.). The projected image originates from the front of the reflective screen, with the projector usually somewhere in the room (or behind a wall or ceiling). Rear screen projectors (from RCA/Proscan, Sony, Mitsubishi, and Pioneer) illuminate a screen from behind and usually contain all the equipment within a movable enclosure that looks like a large TV enclosure. An obvious advantage of rear over front projection is the reduction of occlusion of the screen by those in the room within the line-of-sight of the projector, whereas the advantage of front screen can be portability. Rear screen projectors achieve high levels of brightness with customized screens using Ferznel lens to focus the light, usually at the expense of viewing angle—one disadvantage of rear screen projection techniques. In particular, consumer rear screen projectors have been designed for optimal viewing brightness when viewed from a seated position while dimming significantly when viewed from a standing position. CRT (Cathode-Ray Tube) Displays. Conventional direct-view and projection displays use CRTs as a good balance between cost, performance, and lifetime. With CRTs, a single high-energy beam, scanning an x–y grid and charging a phosphor screen can create brilliant displays in fully lit rooms. Direct-view TVs continue to exhibit high performance over other technologies like projection. Current limitations are near 40 in. diagonal in size (Mitsubishi). As one increases the size of the CRT display, the geometric limitations of pointing one beam at a glass screen dictates a deeper tube and curved surface. As well the pressure of the atmosphere dictates thicker and thicker glass to preserve a vacuum in the tube. Difficulties include the manufacturing limitations of glass and the depth growing to exceed standard consumer door widths, complicating installation of the TV. Likewise computer monitors continue to feature CRTs for the same cost and performance reasons. However, several applications make demands on a display that CRTs cannot address. Mechanical design or architecture demands higher resolution graphics and geometric accuracy. Group activities require bigger screens. Laptop applications require flat displays. For these reasons, flat-screen display panel technologies have seen recent development, as described in Ref. 30. LCD (Liquid Crystal Display). Liquid crystals displays (LCDs) use a panel of light modulators, imaged either by direct viewing of the panel (as with most laptop computers and handheld image display equipment) or by projection [for more details, see (31,32)]. Each of hundreds of thousands of light valves (each pixel) either passes or blocks light, depending on the orientation of microscopic crystals contained within each valve. The amount of light passed through each valve determines the shade or color of the pixel. The crystals are aligned by applying electromagnetic fields. Of the various methods used to produce LCDs, the most widely-accepted consists of an “active-matrix” of transistors for modulation of the crystal orientation. LCD panels today typically
IMAGE PROCESSING EQUIPMENT
27
range from a few inches to 14 in., with some as high as 21 in. A manufacturer (Sharp) has actually seamlessly joined two 29 in. displays to make a 40 in. prototype LCD panel. Threadlike twisted-neumatic liquid crystals are typically used in LCDs. In the nonenergized state the crystals (which are sandwiched between two polarizers) align with a twist. This enables light to pass between the two polarizers (oriented 90◦ out of phase) by twisting the light through the liquid crystals. Supertwist LCDs are more than 180◦ out of phase. When a field is applied, the molecules align perpendicular to the display surface. Light then passes directly through the crystals to the next polarizer. No light escapes since the light from the first polarizer is 90◦ out of phase with the second polarizer. Color can be generated by using color triads at each pixel, or with the use of multiple panels for projection displays. LCDs have great advantage for mobile applications due to their low power consumption. In addition they perform well in graphics applications (like PCs) due to the geometric stability and 1:1 pixel mapping of the display grid to generated pixels. This 1:1 pixel mapping, like other panel displays, can yield extremely sharp text and graphics displays in comparison to analog displays like CRTs. These advantages are offset by disadvantages for many other imaging applications. LCDs typically have low pixel fill factors (the percentage of the pixel site illuminated) such that only 50% to 60% of the pixel site is illuminated, surrounded by dark patterning structures. The contrast ratio (or ratio of brightest to darkest images displayable) has also not matched CRTs. High-resolution LCDs are difficult to manufacture, increasing costs relative to other display technologies. LCDs are typically fabricated at nearly 100 pixels per inch. Xerox Palo Alto Research Center produced the first 300 dot/inch LCD with 6.3 million pixels in a 13 in. panel. Despite their limitations, LCDs can create sharp and bright displays when coupled with the appropriate illumination system. Active and passive-matrix LCDs exist in the market place today. Less-expensive, passive LCDs use row address lines and column pulses, positive and negative about a bias, to signal pixels within a row. Unselected pixels accumulate a “cross-talk” signal as a function of the number of rows in the display. Resolution and contrast performance suffer with passive LCDs. By contrast, active-matrix LCDs use individual transistors (thin-film transistors, TFT) at each pixel to isolate on and off pixels, while maintaining a row/column addressing structure similar to DRAMs. A higher-quality image results from the isolation inherent in active-matrix LCD displays. Recent advances in reflective LCDs offer dramatic increases in contrast ratio, rapid response, brightness and can be used with ambient illumination rather than power-consuming back-lights, with great promise for future applications. FED (Field-Emission Display). With FED technology a grid (or field) of individually addressable microscopic electron-emitters replace the CRT’s single electron beam. The light then strikes a layer of phosphor, causing it to glow at the location of electron emission, forming an image. The advantages of FED include low power consumption, wide angle of view, flat screens, high quality, and spatial stability. Manufacturers, including Candescent Technologies (San Jose), PixTech (Montpellier, France), and Texas Instruments (Dallas, Texas) have produced FEDs. Currently manufacturing issues have limited their use. Plasma Gas Display. Applications for a wall-mounted, high-brightness, and high-resolution display such as HDTV has spurred development of gas plasma displays. Manufacturers of plasma systems (Matsushita, NEC, NHK, Fujitsu, Sony, etc.) have produced displays up to 40 in.’ and only a few inches thick. The advantages of plasma gas displays include large flat panels, high brightness, and spatial stability. The disadvantage with respect to LCDs has been power efficiency and contrast ratio. A gas plasma display’s efficiency is near one lumens per watt, between a half and quarter as efficient as backlit LCDs. With plasma displays, an electric current activates a neon gas at each pixel by creating ultraviolet light with a gas discharge (similar to fluorescent lights). The UV light is then turned into visible red, green, and blue light by phosphors. Vertically oriented, phosphor-coated and gas-filled channels contain one electrode. Another layer in the display, an array of horizontal electrodes, crisscross the display. A pixel is then defined by the intersection of two electrodes. As the size of the display or number of pixels increase, the display does not grow in depth, keeping a thin aspect ratio. Manufacturing issues have affected uniformity, contrast ratio and produced annoying defects. Due to the difficulty of applying pixel information to all electrode intersections simultaneously, pixel modulation methods
28
IMAGE PROCESSING EQUIPMENT
Fig. 7. DMD displays utilize a DMD chip and an optics system to create a projection display. The illustration shows a hidden-hinge DMD structure of nine pixels (on the left of the illustration), each of which rotate via a hinge mechanism underneath the mirrors.
are necessary to create shades of gray or color in the display. Addressing limitations had restricted plasma displays to near 6 bits of gray, or 64 shades per color primary, creating significant false contour patterns on the screen. Recent developments have extended plasma displays to 8 bits, some using dithering methods. High-Gain Emissive Display. HGED displays use phosphor modified to operate at low voltages (100 W) with an electron beam grid, rather than a single high-energy beam like CRTs. Telegen’s HGED can produce high resolution with good viewing angles, without the X-ray emissions of CRTs. LEDs on Plastic. With a recent light-emitting diode (LED) display constructed on plastic new possibilities open for large, flexible displays. These sheets physically deformable plastic can wrap around the viewer or even replace glass-pane windows with images of any scene displayed on large sheets of plastic [see (33) for more information on LEDs-on-plastic]. DMD (Digital MicroMirror Display). A display technology called Digital MicroMirror (DMD), or Digital Light Projection (DLP), creates an all digital approach to displaying images (from Texas Instruments). In essence, DMDs create a grayscale or color image by modulating a sequence of binary-images posed by an array of microscopic mirrors. Each mirror is smaller than a human hair, usually 17 µm, as shown in Fig. 7. DMDs can create brilliant pictures with minimal artifacts. A unique advantage of a DMD display includes its extremely low power of operation, actually creating a still binary image with no power from the light modulator. DaVis Powerbeam V SVGA data and video projector, Proxima, In-Focus, and others have based projector products on TI’s DLP. For high-brightness applications, such as those with large-audiences, manufacturers have produced DLP projectors with 2000 to 5000 lm (Electrohome and Digital Projection Inc., respectively). DMD technology emerged from work with a real-time image correlation system for use by NASA to remotely dock spacecraft. Initially micromirrors were glued onto a polymer sheet that was spatially modulated by transistors under the polymer. Later, mirrors with torsioned hinges were patterned on the silicon device to create a robust binary display. Currently conventional semiconductor processes are used to pattern the mirrors directly onto a silicon wafer. This should lead to manufacturing at a reasonable cost. Early problems with DMDs, which have since been overcome, include artifacts relating to use of pulsewidth modulation to create grayscale and color shades with the binary image display technology. DMDs present a binary image to the viewer at extremely high rates using the viewer’s eye to integrate those binary images into a specific intensity per pixel. Each frame of an 8 bit system is broken into 256 time units (1/30/256 s). A saturated white pixel would require the DMD to display the binary image with that pixel “on” for the entire frame time,
IMAGE PROCESSING EQUIPMENT
29
while a half-gray pixel would only be “on” for half of the frame time. A linear range of display times result. The problem occurs as intensity number 128 (10000002 ) transitions to 127 (011111112 ). In this case, the temporal asymmetry of pulse widths creates an opportunity for the eye to incorrectly integrate the pulses if the eye moves or has its gaze interrupted (as with some object passing between the eye and the display). Solutions include breaking long pulse binary images into smaller pulses and temporally distributing them [described in (34)]. Miniature Displays or HMDs. Head-mounted displays (HMDs) use special optics and a miniature display to create images that appear to the viewer as a large display at a distance. In effect the viewer sees a 20 in. monitor at a distance of five feet when in actuality a small display is about an inch in front of the viewer. HMDs first appeared in military applications, such as displaying of instrumentation for pilots or realistic flight simulators, and now have entered consumer markets. Some problems with image quality and unsettling physical effects (e.g., dizziness or headache) have limited their widespread use. Several display technologies have been used for HMDs. One company (Reflection Technology) uses a scanning mirror to paint an image with a linear LED array. Nintendo’s Virtual Game Boy used this display. Another manufacturer (Motorola) produced a miniature 240 × 144 LED array for HMD applications. Yet another manufacturer (Kopin) produced HMD displays by direct patterning of light valves, such as AM-LCD, on silicon wafers at over 1700 lines/in. They produce devices at 320 × 240 resolution. With 15 µm pixels, 192 displays can be created from one 6 in. silicon wafer. Backlighting the LCD is accomplished with LEDs. Color can be achieved with frame sequential illumination of red, green, and blue LEDs. Another approach, with polysilicon TFT LCD mini displays (from Seiko-Epson and Sony), uses RGB color filters. This reduces brightness and resolution by one-third. These LCDs are constructed on quartz substrates, at a higher cost that silicon wafers. Finally, products also feature direct scanning of a laser on the retina of the eye, providing small, head-mounted personal displays. HDTV Display and AFD. While Japan implemented high-definition TV (HDTV) and enhanceddefinition TV (EDTV) systems years ago, the recent digital HDTV standard approved by US government (see http://www.atsc.org) raises questions of what will be the display of choice. Certainly CRT technology continues to offer high quality for most video applications. However, the ideal display for high resolution and consumer costs has not yet been determined. The use of digital transmission, coupled with the advent of digital versatile disk (DVD), may make the digital display technologies most desirable. In that scenario, everything would be digital [as discussed in (34)]. The standard for HDTV in the United States will allow 18 different video formats as shown in Table 3. HDTV in the United States has been a very political issue since broadcasters, cable operators, and TVset manufacturers must adapt to the new standard—a standard that presents 18 different possibilities for noncompatibility. Others have proposed counters to the standard for alternative use of the digital spectrum. Microsoft, Sun Microsystems, and Compaq favor the introduction of a base layer (DS0) and extensions to the base layer (DS1) so that virtually any format can be supported in the future. The difficulty has been for standard developers to commit to one format as was done decades ago for NTSC. A flexible compromise to this situation has been developed by others (e.g., Hitachi and RCA Thomson), essentially constructing a display to adapt to any of the formats. Their proposed All-Format Decoder reduces the decoding, memory and display requirements of HDTV to those similar to conventional standard definition TV, while displaying all of the formats [see (35)].
30
IMAGE PROCESSING EQUIPMENT
TV and Computer Graphics Convergence. Today, televisions and graphics monitors are converging for many reasons. While analog techniques and limitations drove video standards and systems, now digital techniques are replacing many video functions, including the capture and storage of video, and more recently the display and broadcast of video. Size, cost, error resiliency, and longevity of video data storage started the trend. Digital transmission enables reduced artifacts, no noise problems in broadcast fringe locations, and optimization of the bandwidth for the scene and the display used. Computer graphics displays have achieved sharp, noiseless, flicker-free, and clear pictures. Since computer users sit within arms reach to the monitor, high refresh rates (70 to 85 Hz) are used to prevent the perception of off-axis flicker. In addition they can operate at reduced brightness relative to a television, since light diminishes by the square of the distance from the source. TV requires viewing from across a room, at a distance of five or more times the picture height and higher brightness. Off-axis flicker is not a problem. Now with digital broadcast and display techniques, video broadcast viewers enjoy the same benefits as computer graphics, especially if square pixels are used for video. Therefore graphics, text, and video can merge without the distorting effects of scaling one or the other. As the industry migrates to HDTV to attain the movie and computer graphics experience in the home, the challenge becomes finding a display technology that can support expanded screen sizes, maintain high brightness, and increase frame refresh rates to keep flicker distortion at bay. At that point, TV and computer displays will have effectively converged. Image Printing Equipment Image printing equipment provides hardcopy reproduction of raw or processed images. This equipment includes laser printers, ink-jet printers, offset printers, and digital copiers. Very affordable solutions exist today for producing exceptional quality pictures composed of text, graphics, and full-color for most applications. The limitations of today’s printing technologies are the cost of consumables (ink, paper, toner, etc.), printing time (network, decompressing and rendering time, etc.), and durability of the printed material, with resolution continuously improving. In fact we are rapidly approaching the point of widespread replacement of conventional silver-halide film photography by electronic photography coupled with an ink or electrostatic printer. Printers. Printers represent a class of image processing equipment that imprints images onto paper or other material for relatively permanent recording in visible form. Similar to scanners, printers usually sweep an imaging device (linear or 2-D display, pen, or inkjet, rather than image collector) across a piece of paper. Printers transport the paper while performing this function, whereas most scanners maintain a stationary page. Figure 8 shows these common elements of a printer system. Image processing equipment use several methods to deliver images to printers. A popular method includes use of a Page Description Language (PDL). A PDL facilitates printing (or display) of a document independent of the source material resolution or the printer resolution. No matter what quality of printer used, in terms of chromatic or spatial resolution, the source image is communicated independent of those limitations. Methods for achieving this include coding object descriptions rather than bitmaps. For example, an object like a line or character can be coded as a sequence of pen strokes rather than a bitmap. Popular PDLs include Adobe’s Postscript [see (36)], Interpress, and TrueImage. Many other image storage formats like Tagged Image File Format (TIFF), Graphics Interchange Format (GIF), PICT (Apple MacPaint format), BMP (Windows device-independent bitmaps), or others assist with compression and transmission of images to printers [refer to (37) for a more detailed description of the image formats]. The TIFF format maintains high quality images for delivery to other computers, but without significant compression. With only 256 colors, the GIF format is suitable for synthesized images but does not work well with natural continuous-tone images sensed by cameras. ISO’s JBIG (Joint Bitonal Image Group) and CCITT’s Group 3 and 4 FAX standards are used for bi-level images and do not compress gray scale images well. However, with JBIG the images can be broken into bit planes for compression. JPEG is very popular for
IMAGE PROCESSING EQUIPMENT
31
Fig. 8. Elements of a laser printer system include a processor for printer language interpretation, laser diodes, an OPC, and a fuser.
general continuous-tone images, those found in real-world scenes. The JPEG 2000 standard, with techniques such as progressive lossless reconstruction, is intended to serve applications which the original JPEG fails, such as low-bandwidth dissemination of images via the Internet, pre-press imaging, electronic photography, and laser print rendering. RIPs. Printers use a raster image processor (RIP) to render printable images from the PDL input data. A RIP consists of the following elements: (1) compositing of images and text, (2) compression and decompression to reduce the cost of the printers by reducing memory requirements, (3) color space conversion [while the final color space is usually CMYK (cyan-magenta-yellow-black), intermediate processing is usually in an alternative color space], (4) filtering to sharpen edges, (5) affine image transforms to rotate and scale images, (6) halftoning or screening to increase apparent color resolution, (7) geometry and spline conversion to smooth fonts, and (8) memory management. Several companies produce a RIP for printer or multi-function printer (MFP) applications, including Electronics-For-Imaging (EFI) and Xionics. RIPs can operate in most computers, resulting in delivery of quality images to a multitude of printers. Communication speed and buffering can limit the effectiveness of these PC-based soft RIPs. A RIP implemented on fast hardware embedded within a printer offers printer manufacturers opportunity for solutions optimized for cost, speed and printed picture quality. Ink Jet Printer. Consumer grade ink-jet printers of today range from 300 to 1200 dots per inch (e.g., those from Lexmark International, Hewlett Packard, Epson, and Canon), with at least one available at 1440 dpi (e.g., the Epson Stylus Color 800). Offering an effective balance of performance and cost, ink jet printers are limited by printing time and artifacts. The time to physically scan a pen cartridge across the page, and then up the page, fundamentally limits printing times. Artifacts include horizontal banding due to paper transport inaccuracies, and graininess of the printed image due to low resolution or the use of dithering and halftone patterns to create shades of gray. Proper color rendition, including proper tint, saturation, and contrast is controlled by (1) the ink, (2) the control of the ink jet, and (3) the type of paper. The bleeding characteristics of ink varies with the paper fibers, which in effect can change the contrast of the image. These printers make adequate printed images for desktop publishing with text, graphics, and images. Color Laser Printer. Electrostatic color laser printers consist of a print engine and a print controller. The printer’s engine determines the base resolution capabilities, the paper capacities in terms of pages per minute, and type of consumables used. These printers have been manufactured at resolutions of 300 to 1000 dpi for black and white, and color laser printers at 300 to 1200 dpi (Hewlett Packard, IBM, and Canon).
32
IMAGE PROCESSING EQUIPMENT
The engine contains a laser diode for imaging onto an organic photoconductor (OPC), a fuser, toner, and paper transport. The low-power laser diode, driven by the controller, first writes the latent image (or pattern of charge) onto the OPC. Four layers of charged powder toner (cyan, magenta, yellow, black) are rolled past the OPC drum, attracting to the oppositely charged OPC image. As the drum rotates, the toner is transferred to the paper from the OPC (in the pattern of the latent image). The fuser’s application of heat and pressure with rollers bond the toner to the page. A fuser oil keeps the toner off the roller and onto the paper. The printer’s controller will directly influence the overall print quality. Considering the limitations of the print engine, the print controller will translate target pixels into a printable image. For example, the engine may only produce 300 dpi at 16 colors. Considering that limitation, the print controller may perform dithering or halftoning to increase the apparent resolution of the printed page, both spatially and chromatically. The controller’s speed of performing image processing functions like enhancement or this dithering can limit throughput. Fast processors provide acceleration for rapid printing in the midst of many image processing functions. Some printers can also modulate the laser beam intensity to control the size and placement of the laser dots, effectively controlling resolution. In addition apparent resolution can be improved when incorporating image processing techniques that smooth edges or curves. Professional and Offset Printing. High-end, professional, or production printing units utilize many techniques, including electrostatic and ink jet, to transfer an image onto paper or other material. Powerful image processing capabilities of these machines render the large high-resolution images and drive the printing devices, which may include the use of multiple drums or multiple passes, for optimal print quality. With large images and many subsequent passes of the image dictates these printers must use large memories and/or compression techniques for temporary storage of images. Digital Copiers and Multifunction Systems (MFS). A multifunction system (MFS) combines digital scanners, image processing, printers, and digital copiers to create a new class of high-performance page copiers. This coupling of imaging subsystems with image processing offers new features. Conventional copiers simply scanned and printed line by line, whereas these newer forms of digital copiers offer additional features like “Scan-Once-Followed-by-Print-Many-Copies” modes, since they use full compression of the document, adaptive processing of images for improved print quality, and wider ranges of document resizing with affine transforms. In addition these MFS equipment can incorporate document image segmentation and character recognition for highly compressed storage of the document, subsequent editing of the document, and later high-quality printing of the document. Color Management. To achieve reliable and reproducible color throughout the entire reproduction process, independent of platform or device, major providers like Adobe, AGFA, SGI, Kodak, and Microsoft developed the International Color Consortium (ICC) specification. Scanners use reference images (both scanned and previously stored reference images) to aid in determining the appropriate corrections. Likewise printers are measured with reference images. Using this standard, ICC profiles for the input scanner and output printer are tagged with the image file to facilitate proper correction at the time of use, mapping from any color space to any other color space. Consistent color rendition is achieved with these techniques.
BIBLIOGRAPHY 1. 2. 3. 4. 5. 6. 7. 8.
C. Koch Neuromorphic Vision Chips, IEEE Spectrum, 33 (5): 38–46, 1996. W. J. Mitchell When is seeing believing?, Sci. Am., Special Issue Comput. 21st Century, 6 (1): 110–119, 1995. C. Sequin M. Tompsett Charge Transfer Devices, New York: Academic Press, 1975. J. Younse R. Gove Texas Instruments, Inc., Programmable CCD Defect Compensator, U.S. Patent 4,805,023, 2/14/89. G. C. Holst CCD Arrays, Cameras, and Displays, Bellingham, WA: SPIE Optical Engineering Press, 1996. G. Wolberg Digital Image Warping, Los Alamitos, CA: IEEE Computer Society Press, 1990. W. Wolfe G. Zissis The Infrared Handbook, Washington, DC: Environmental Research Institute of Michigan, 1978. B. G. Haskell A. Puri A. N. Netravali Digital Video: An Introduction to MPEG-2, New York: Chapman & Hall, 1997.
IMAGE PROCESSING EQUIPMENT
33
9. V. Bhaskaran K. Konstantinides Image and Video Compression Standards: Algorithms and Architectures, Norwell, MA: Kluwer Academic, 1995. 10. R. Gove et al. Next-generation media processors and their impact on medical imaging, SPIE Proc. Med. Imaging, 1998. 11. C. Basoglu A Generalized Programmable System and Efficient Algorithms for Ultrasound Backend, PhD dissertation, Univ. Washington, Seattle, 1997. 12. Y.-M. Chou H.-M. Hang A video coding algorithm based on image warping and nonrectangular DCT coding, Proc. SPIE, Visual Commun. Image Proc. ’97, San Jose, 1997, pp. 176–187. 13. C. Laurent C. Boeville Parallel implementation of greyscale morphological operators on MVP, Proc. SPIE, Visual Commun. Image Process. ’97, San Jose, 1997, pp. 502–513. 14. R. J. Gove Architectures for single-chip image computing, SPIE Proc. Electron. Imaging Sci. Technol. Conf. Image Process. Interchange, San Jose, CA, 1992, pp. 30–40. 15. K. Guttag R. Gove J. Van Aken A single-chip multiprocessor for multimedia: The MVP, IEEE Comput. Graphics Appl., 12 (6): 53–64, 1992. 16. L. McMillan L. Westover A forward-mapping realiation of the inverse discrete cosine transform, Proc. IEEE Data Compression Conf., Snowbird, UT, 1992, pp. 219–228. 17. H. Malvar M. Murphy, (eds.) The Science of Videoconferencing 2 (Inside Picture Tel), Andover, MA: PictureTel Corporation, 1996. 18. J. Duran C. Sauer Mainstream Videoconferencing: A Developer’s Guide to Distance Multimedia, Reading, MA: AddisonWesley, 1997. 19. B. Erol et al. The H.263+ video coding standard: Complexity and performance, Proc. IEEE Data Compression Conf., Snowbird, UT, 1998, pp. 259–268. 20. A. Jacquin H. Okada P. Crouch Content-adaptive postfiltering for very low bit rate video, Proc. IEEE Data Compression Conf., Snowbird, UT, 1997, pp. 111–120. 21. J. Goel Digital video resizing, Electron. Design News, December 18, pp. 38–39, 1995. 22. A. Murat Tekalp Digital Video Processing, Upper Saddle River, NJ: Prentice-Hall, 1995. 23. W. K. Pratt Digital Image Processing, 2nd ed., New York: Wiley, 1991. 24. S. Khoshafian A. Brad Baker Multimedia and Imaging Databases, San Francisco: Morgan Kaufmann, 1996. 25. D. Kozen Y. Minsky B. Smith Efficient algorithms for optimal video transmission, Proc. IEEE Data Compression Conf., Snowbird, UT, 1998, pp. 229–238. 26. R. Weiss A. Duda D. Gifford Content-based access to algebraic video, IEEE Proc. Int. Conf. Multimedia Comput. Syst., 1994, pp. 140–151. 27. Y. Gong et al. An image database system with content capturing and fat image indexing abilities, IEEE Proc. Int. Conf. Multimedia Comput. Syst., 1994, pp. 140–151. 28. A. F. Inglis A. C. Luther Video Engineering, 2nd ed., New York: McGraw-Hill, 1996. 29. A. Sobel Television’s bright new technology, Sci. Amer., 278 (5): 45–55, 1998. 30. L. E. Tannas, ed. Flat-Panel Displays and CRTs, Jr. New York: Van Nostrand Reinhold, 1985. 31. E. Kaneko Liquid Crystal TV Displays: Principles and Applications of Liquid Crystal Displays, Norwell, MA: Reidel Publishing, 1987. 32. W. E. Howard Thin-film transistor/liquid crystal display technology: An introduction, IBM J. Res. Develop., 36 (1): 3–10, 1992. 33. P. Wam Plastics Get Wired, Sci. Amer., Special Issue on Solid State Century, the Past, Present and Future of the Transistor, 90–96, January 22, 1998. 34. R. Gove DMD display systems: The impact of an all-digital display, Proc. Soc. Inf. Display Int. Symp., San Jose, CA, 1994. 35. L. Pearlstein J. Henderson J. Boyce An SDTV decoder with HDTV capability: An all format ATV decoder, 137th SMPIE Proc., 1995, pp. 422–434. 36. Adobe Systems Inc., Postscript Language Reference Manual, Reading, MA: Addison-Wesley, 1990. 37. C. Poynton Technical Introduction to Digital Video, New York: Wiley, 1996. 38. K. R. Castleman Digital Image Processing, Upper Saddle River, NJ: Prentice-Hall, 1996.
ROBERT JOHN GOVE Equator Technologies, Inc.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICA...TRONICS%20ENGINEERING/26.%20Image%20Processing/W4108.htm
}{{}}
●
HOME ●
ABOUT US ●
CONTACT US ●
HELP
Home / Engineering / Electrical and Electronics Engineering
Wiley Encyclopedia of Electrical and Electronics Engineering Image Reconstruction Standard Article Rangaraj M. Rangayyan1 and Apostolos Kantzas1 1University of Calgary Copyright © 1999 by John Wiley & Sons, Inc. DOI: 10.1002/047134608X.W4108 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (649K)
●
●
● ●
Recommend to Your Librarian Save title to My Profile Email this page Print this page
Browse this title ●
Search this title Enter words or phrases ❍
❍ ❍
Advanced Product Search Search All Content Acronym Finder
Abstract The sections in this article are Projection Data Collection Methods Image Reconstruction Techniques Examples of Reconstructed Images Display of ct Images Industrial Applications Concluding Remarks Acknowledgments Keywords: computed tomography (CT); image reconstruction from projections; nuclear medicine imaging; single-photon emission computed tomography (SPECT); magnetic resonance (MR) imaging (MRI); x-ray imaging; industrial applications of computed tomography About Wiley InterScience | About Wiley | Privacy | Terms & Conditions Copyright © 1999-2008John Wiley & Sons, Inc. All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELE...S%20ENGINEERING/26.%20Image%20Processing/W4108.htm17.06.2008 15:25:17
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering c 1999 John Wiley & Sons, Inc. Copyright
IMAGE RECONSTRUCTION Computed tomography (CT) is an imaging technique that has revolutionized the field of medical diagnostics. CT has also found applications in many other areas such as nondestructive evaluation of industrial and biological specimens, radioastronomy, light and electron microscopy, optical interferometry, X-ray crystallography, petroleum engineering, and geophysical exploration. Indirectly, it has also led to new developments in its predecessor techniques in radiographic imaging. The fundamental principle behind CT, namely, image reconstruction from projections, has been known for about 80 years, since the exposition of the topic by Radon (1) in 1917. More recent developments in the subject arose in the 1950s and 1960s from work by a number of researchers in diverse applications. Some important publications in this area are those by Cormack (2, 3) on the representation of a function by its line integrals; by Bracewell and Riddle (4) on the reconstruction of brightness distributions of astronomical bodies from fan-beam scans at various angles; by 5, 6) on the reconstruction of three-dimensional (3-D) images of viruses from electron micrographs; by 7 on the convolution backprojection technique; and by Gordon et al. (8) on algebraic reconstruction techniques. Pioneering work on the development of practical scanners for medical applications was done by 9, 10, and 11. X-ray CT was well established as a clinical diagnostic tool by the early 1970s. Mathematically, the main principle behind CT imaging is that of estimating an image (object) from its projections (integrals) measured in different directions (1213141516171819–20). A projection of an image is also referred to as the Radon transform of the image at the corresponding angle, after the main proponent of the associated mathematical principles. In continuous space, the projections are ray integrals of the image, measured at different ray positions and angles; in practice, only discrete measurements are available. The solution to the problem of image reconstruction may be formulated variously as backprojecting and summing the given projection data, completing the corresponding Fourier space, solving a set of simultaneous equations, and so on. Each of these methods has its own advantages and disadvantages that determine its suitability to a particular imaging application. In this article, we describe the three basic approaches to image reconstruction from projections mentioned above. Techniques for gathering the projection data as well as for display and processing of the reconstructed images in a few specific application areas will be described briefly.
Projection Data Collection Methods In ordinary radiography, a two-dimensional (2-D) shadow of a 3-D body is produced on film by irradiating the body with X-ray photons [see Fig. 1(a).]. Each ray of X-ray photons is attenuated by a factor depending on the integral of the linear attenuation coefficient along the path of the ray and produces a corresponding gray level at the point hit on the film or the detecting device used. Let N i denote the number of X-ray photons incident upon the body being imaged within a specified time interval for a particular ray path. Let N o be the 1
2
IMAGE RECONSTRUCTION
Fig. 1. (a) An ordinary X-ray image or typical radiograph is a 2-D shadow of a 3-D object. The entire object is irradiated with X rays. (b) In CT imaging, only the desired sectional plane of the body—labeled as the plane ABCD in the figure—is irradiated. The measured data represent a 1-D projection of a 2-D cross-sectional plane of the object. The profile X–X in (a) and the projection X–X in (b) should be identical except for the effects of scattered radiation.
corresponding number of photons exiting the body. Then, we have
or
N i and N o are Poisson variables; it is assumed that they are very large for the above equations to be applicable. µ(x, y) represents the linear attenuation coefficient at (x, y ), ds represents the elemental distance along the ray, and the integral is along the ray path from the X-ray source to the detector. The value of µ(x, y) depends on the density of the object or its constituents along the ray path, as well as the frequency (or wavelength or energy) of the radiation used. Equation (1) assumes the use of monochromatic or monoenergetic X rays. A measurement of the exiting X rays (i.e., N o and N i for reference) thus gives us only an integral of µ(x, y) over the ray path. The internal details of the body along the ray path are compressed onto a single point on
IMAGE RECONSTRUCTION
3
the film or a single measurement. Extending the same argument to all ray paths, we see that the radiographic image so produced is a 2-D (X-ray) shadow of the 3-D object, where the internal details are superimposed. The problem of obtaining details of the interior of the human body noninvasively had always been of interest, and within a few years after the discovery of X rays by R¨ontgen in 1895, techniques were developed to image sectional planes of the body. These methods, called laminagraphy, planigraphy, tomography, and so on (21), used synchronous movement of the X-ray source and film in such a way as to produce a sharp image of a single focal plane, with the images of all other planes being blurred; the methods had limited commercial success. CT imaging as we know it today was developed during the late 1960s and the early 1970s, producing images of cross sections of the human head and body as never seen before (noninvasively and nondestructively!). While the fundamental radiographical equation is the same as Eq. (1), in most CT scanners, only the desired cross section of the body is irradiated using a finely collimated ray of X-ray photons [see Fig. 1(b)], instead of irradiating the entire body with a solid beam of X rays as in ordinary radiography. Such ray integrals are measured at many positions and angles around the body, scanning the body in the process. The principle of image reconstruction from projections is then used to compute an image of a section of the body, hence the name computed tomography. [See 22 for an excellent review of the history of CT imaging; see also 23.] Figure 2 depicts some of the scanning procedures employed: Figure 2(a) shows the translate–rotate scanning geometry for parallel-ray projections, Fig. 2(b) shows the translate-rotate scanning geometry with a small fan-beam detector array, Fig. 2(c) shows the rotate-only scanning geometry for fan-beam projections, and Fig. 2(d) shows the rotate-only scanning geometry for fan-beam projections using a ring of detectors. A more recently developed scanner specialized for cardiovascular imaging (24, 25) completely eliminates mechanical scanning movement to reduce scanning time by employing electronically steered X-ray microbeams and rings of detectors, as illustrated schematically in Fig. 3. Once a sectional image is obtained, the process may be repeated to obtain a series of sectional images of the 3-D body or object being investigated. Imaging a 3-D body is usually accomplished by reconstructing one 2-D section at a time through the use of one-dimensional (1-D) projections. Exceptions to this are the Dynamic Spatial Reconstructor developed at the Mayo Clinic (22, 26), where a series of 2-D projection images are obtained via irradiation of the portion of the body of interest and a fluorescent screen, and single-photon emission computed tomography (SPECT), where a series of 2-D projection images are obtained using a gamma camera (17, 272829–30). Some of the physical considerations in X-ray CT imaging are the effects of beam hardening, dual-energy imaging, scatter, photon detection noise, and ray stopping by heavy implants (18, 28, 29, 31). •
•
•
Beam Hardening. The X rays used in radiographic imaging are typically not monoenergetic; that is, they possess X-ray photons over a certain band of frequencies or electromagnetic energy levels. As the X rays propagate through a body, the lower-energy photons get absorbed preferentially (depending on the length of the ray path through the body and the attenuation characteristics of the material along the path). Thus the X rays that pass through the object at longer distances from the source will possess relatively fewer photons at lower-energy levels than at the point of entry into the object (and hence a relatively higher concentration of higher-energy photons). This effect is known as beam hardening. The effect of beam hardening may be reduced by prefiltering or prehardening the X-ray beam and narrowing its spectrum. Use of monoenergetic X rays from a synchrotron or a laser could also obviate this problem. Dual-Energy Imaging. Different materials have different energy-dependent X-ray attenuation coefficients. X-ray measurements or images obtained at multiple energy levels (also known as energy-selective imaging) could be combined to derive information about the distribution of specific materials in the object or body imaged. Weighted combinations of multiple energy images may be obtained to display soft-tissue and hard-tissue details separately (29). Scatter. As an X-ray beam propagates through a body, photons are lost due to absorption and scattering at each point in the body. The angle of the scattered photon at a given point along the incoming beam is
4
IMAGE RECONSTRUCTION
Fig. 2. Translate–rotate scanning geometry for parallel-ray projections; (b) translate–rotate scanning geometry with a small fan-beam detector array; (c) rotate-only scanning geometry for fan-beam projections; and (d) rotate-only scanning geometry for fan-beam projections using a ring of detectors. (Reproduced with permission from R. A. Robb, X-ray computed tomography: An engineering synthesis of multiscientific principles, CRC Crit. Rev. Biomed. Eng., 7: 264–333, Mar. 1982. c 1982 CRC Press, Boca Raton, Florida.) Copyright
a random variable, and hence the scattered photon contributes to noise at the point where it strikes the detector. Furthermore, scattering results in loss of contrast of the part of the object where photons were scattered from the main beam. The noise effect of scattered radiation is significant in gamma-ray emission imaging (SPECT) and requires specific methods to improve the quality of the image (27, 28). The effect of scatter may be reduced by collimation or energy discrimination, as scattered photons usually have lower energy levels.
IMAGE RECONSTRUCTION
5
Fig. 3. Electronic steering of an X-ray beam for motion-free scanning and CT imaging. (Reproduced with permission from D. P. Boyd et al., Proposed dynamic cardiac 3-D densitometer for early detection and evaluation of heart disease, IEEE c 1979 IEEE. See also Ref. 22.) Trans. Nucl. Sci., NS-26: pp. 2724–2727, 1979. Copyright
•
•
Photon Detection Noise. The interaction between an X-ray beam and a detector is governed by the same rules as for interaction with any other matter: photons are lost due to scatter and absorption, and some photons may pass through unaffected (undetected). The small size of the detectors in CT imaging reduces their detection efficiency. Scattered and undetected photons cause noise in the measurement; for detailed analysis of noise in X-ray detection, please refer to 29 and Cho et al. (28). Ray Stopping by Heavy Implants. If the body imaged contains extremely heavy parts or components that are nearly X-ray opaque and stop entirely the incoming X-ray photons (such as metal screws or plates in bones and surgical clips), no photons would be detected at the corresponding point of exit from the body. Effectively, the attenuation coefficient for the corresponding path would be infinite. A reconstruction algorithm would not be able to redistribute such an attenuation value over the pixels along the corresponding ray path in the reconstructed image. This leads to streaking artifacts in CT images.
Other imaging modalities used for projection data collection are ultrasound (time of flight or attenuation), magnetic resonance (MR), and nuclear emission (gamma rays or positrons) (12, 17, 28, 29, 3233–34). Techniques using nonionizing radiation are of importance in imaging pregnant women and/or fetuses. While the physical parameter imaged may differ between these modalities, once the projection data are acquired, the mathematical image reconstruction procedure could be almost the same. A few special considerations in imaging with diffracting sources are described in the section entitled “Imaging with Diffracting Sources.” The nature of data acquired in MR imaging is described in the section entitled “Nature of Data Acquired in Magnetic Resonance Imaging.”
6
IMAGE RECONSTRUCTION
Image Reconstruction Techniques Projection Geometry. Let us now consider the problem of reconstructing a 2-D image given parallelray projections of the image measured at different angles. Referring to Fig. 4, let f (x, y) represent the density distribution within the image. While in practice discrete images are used, the initial presentation here will be in continuous space notations for easier comprehension. Consider a ray AB represented by the equation
[The derivations presented in this article follow closely those of 13; we thank them for their kind permission. For further details, please refer to 17, 18) and Kak and Slaney (12).] The integral of f (x, y) along the ray path AB is given by
where δ(·) is the Dirac delta function. When this integral is evaluated for different values of the ray offset t1 , we obtain the projection Pθ (t). The function Pθ (t) is known as the Radon transform of f (x, y). [Note that while a single projection Pθ (t) of a 2-D image at a given value of θ is a 1-D function, a set of projections for various values of θ could be seen as a 2-D function.] As the different rays within a projection are parallel to one another, this is known as parallel-ray geometry. Theoretically, we would need an infinite number of projections for all θ to be able to reconstruct the image. Before we consider reconstruction techniques, let us take a look at the projection or Fourier slice theorem. The Projection or Fourier Slice Theorem. The projection or Fourier slice theorem relates the three spaces we encounter in image reconstruction from projections—the image, Fourier, and projection (Radon) spaces. Considering a 2-D image, the theorem states that the 1-D Fourier transform (FT) of a 1-D projection of the 2-D image is equal to the radial section (slice) of the 2-D FT of the 2-D image at the angle of the projection. This is illustrated graphically in Fig. 5 and may be derived as follows. Let F(u, v) represent the 2-D FT of f (x, y), given by
Let Sθ (w) represent the 1-D FT of the projection Pθ (t); that is,
where w represents the frequency variable corresponding to t . (Note: if x, y, s, and t are in centimeters, the units for u, v, and w will be cycles/cm.) Let f (t, s) represent the image f (x, y) rotated by angle θ, with the transformation given by
IMAGE RECONSTRUCTION
7
Fig. 4. Illustration of a ray path AB through a sectional image f (x, y). The (t, s) axis system is rotated by angle θ with respect to the (x, y ) axis system. ds represents the elemental distance along the ray path AB. Pθ (t1 ) is the ray integral of f (x, y ) for the ray path AB . Pθ (t ) is the parallel-ray projection (Radon transform or integral) of f (x, y) at angle θ. (Adapted with permission from A. Rosenfeld and A. C. Kak, Digital Picture Processing, 2nd ed., New York: Academic Press. c 1982 Academic Press.) Copyright
Then
8
IMAGE RECONSTRUCTION
Fig. 5. Illustration of the Fourier slice theorem. F (u, v ) is the 2-D FT of f (x, y). F (w, θ1 ) = Sθ1 (w) is the 1-D FT of Pθ1 (t). F (w, θ2 ) = Sθ2 (w) is the 1-D FT of Pθ2 (t).
Transforming from (t, s) to (x, y), we get
which expresses the projection theorem. Note that dxdy = dsdt. It immediately follows that if we have projections available at all angles from 0◦ to 180◦ , we can take their 1-D FTs, fill the 2-D Fourier space with the corresponding radial sections or slices, and take an inverse 2-D FT to obtain the image f (x, y). The difficulty lies in the fact that, in practice, only a finite number of projections will be available, measured at discrete angular positions or steps. Thus some form of interpolation will be essential in the 2-D Fourier space (5, 6). Extrapolation may also be required if the given projections do not span the
IMAGE RECONSTRUCTION
9
entire angular range. This method of reconstruction from projections, known as the Fourier method, succinctly relates the image, Fourier, and projection (Radon) spaces. Backprojection. Let us now consider the simplest reconstruction procedure—backprojection (BP). Assuming the rays to be ideal straight lines (rather than strips of finite width) and the image to be made of dimensionless points rather than pixels or voxels of finite size, it can be seen that each point in the image f (x, y) contributes to only one ray integral per parallel-ray projection Pθ (t), with t = x cos θ + y sin θ. We may obtain an estimate of the density at a point by simply summing (integrating) all the rays that pass through it at various angles, that is, by backprojecting the individual rays. In doing so, however, the contributions to the various rays of all other points along their paths are also added up, causing smearing or blurring; yet this method produces a reasonable estimate of the image. Mathematically, simple BP can be expressed as (13)
This is a sinusoidal path of integration in the (θ, t) Radon space. In practice, only a finite number of projections and a finite number of rays per projection will be available—that is, the (θ, t) space will be discretized—hence interpolation will be required. Considering a point source as the image to be reconstructed, it can be seen that BP produces a spokelike pattern with lines at all projection angles, intersecting at the position of the point source. This may be considered to be the point spread function (PSF) of the reconstruction process, which is responsible for the blurring of details (please see the section entitled “Examples of Reconstructed Images” for illustrations of the PSF of the BP process). Because the BP procedure is linear, the reconstructed version of an unknown image may be modeled as the convolution of the unknown image with the PSF. Knowing the PSF, we may attempt deconvolution. Deconvolution is implicit in the filtered (convolution) backprojection technique, which is described next. Filtered Backprojection. Consider the inverse FT relationship
Changing from (u, v) to polar coordinates (w, θ ), we get
10
IMAGE RECONSTRUCTION
Here, u = wcosθ, v = wsinθ, and dudv = wdwdθ . Since F(w, θ + π) = F(−w, θ), we get
where again t = x cos θ + y sin θ. If we now define
we get
It is now seen that a perfect reconstruction of f (x, y) may be obtained by backprojecting filtered projections Qθ (t) instead of backprojecting the original projections Pθ (t)) hence the name filtered backprojection (FBP). The filter is represented by the |w| function, known as the ramp filter. Note that the limits of integration in Eq. (2) are (0, π) for θ and (−∞, ∞) for w. In practice, a smoothing window should be applied to reduce the effects of high-frequency noise. Furthermore, the integrals change to summations in practice due to the finite number of projections available, as well as the discrete nature of the projections themselves and the FT computations employed. (Details of the discrete version of FBP are provided in the section entitled “Discrete Filtered Backprojection.”) An important feature of the FBP technique is that each projection may be filtered and backprojected as other projections are being acquired, which was of help in online processing with the first-generation scanners. Furthermore, the inverse FT of the filter |w| (with modifications to account for the discrete nature of measurements, smoothing window, etc.) could be used to convolve the projections directly in the t space (7) using fast array processors. FBP is the most widely used procedure for image reconstruction from projections; however, the procedure provides good reconstructed images only when a large number of projections spanning the full angular range of 0◦ to 180◦ are available. Discrete Filtered Backprojection. The filtering procedure with the |w| function, in theory, must be performed over −∞ ≤ w ≤ ∞. In practice, the signal energy above a certain frequency limit will be negligible, and |w| filtering beyond the limit increases noise. Thus we may consider the projections to be bandlimited to ±W cycles/cm. Then, using the sampling theorem, Pθ (t) can be represented by its samples at the sampling rate 2W cycles/cm as
IMAGE RECONSTRUCTION
11
Then
If the projections are of finite order, that is, they can be represented by a finite number of samples N + 1, then
Assume that N is even, and let the frequency axis be discretized as
Then
This represents the discrete FT (DFT) relationship and may be evaluated using the fast Fourier transform (FFT) algorithm. The filtered projection Qθ (n τ) may be obtained as
If we want to evaluate Qθ (t) for only those t at which Pθ (t) has been sampled, we get
12
IMAGE RECONSTRUCTION
In order to control noise enhancement by the |(m(2W/N)| filter, it may be beneficial to include a filter window such as the Hamming window; then
with
Using the convolution theorem, we get
where denotes circular (periodic) convolution, and φ(k/2W) is the inverse DFT of |m(2W/N)| G(m(2W/N)), m = −N/2, . . ., 0, . . ., N/2. Butterworth or other lowpass filters may also be used instead of the Hamming window. Note that the inverse FT of |w| does not exist as |w| is not square integrable. However, if we consider the inverse FT of |w| exp(−verbar;w|) as → 0, we get the function (13)
For large t, pε (t) −1/(2πt)2 . The reconstructed image may be obtained as
where the K angles θi are those at which projections Pθ (t ) are available. For practical implementation of discrete FBP, let the projections be sampled with an interval of τ cm with no aliasing error. Each projection Pθ (kτ ) is thus limited to the frequency band ( −W, W ), with W = (1/2τ) cycles/cm. The continuous versions of the filtered projections are
IMAGE RECONSTRUCTION
13
where the filter H(w) = |w|bW (w), with bW (w) as defined earlier in Eq. (3). The impulse response of the filter H(w) is (13)
Since we require h(t) only at integral multiples of the sampling interval τ, we have
The filtered projection Qθ (t) may be obtained as
where N is the finite number of samples in the projection Pθi (kτ ) . Note that h(nτ ) is required for n = −(N − 1), . . ., 0, . . ., N − 1 . When the filter is implemented as a convolution, the FBP method is also referred to as convolution backprojection. The procedure for FBP may be expressed in algorithmic form as: (1) (2) (3) (4)
Measure projection Pθi (n τ). Compute the filtered projection Q θi (nτ). Backproject the filtered projection Q θi (nτ). Repeat Steps 1 to 3 for all projection angles θi , i = 1, 2, . . . , K .
Severe artifacts arise if sampling in the ( θ, t ) space is inadequate or incomplete. The FBP algorithm is suitable for online implementation in a translate–rotate CT scanner as each parallel-ray projection may be filtered and backprojected as soon as it is acquired, while the scanner is acquiring the next projection. The reconstructed image is ready as soon as the last projection is acquired, filtered, and backprojected. When the projections are acquired using fan-beam geometry, one could either re-bin the fan-beam data to compose parallel-ray projections, or use reconstruction algorithms specifically tailored to fan-beam geometry (12, 13). Algebraic Reconstruction Techniques. In the absence of complete projection data spanning the full angular range of 0◦ to 180◦ , the algebraic reconstruction technique (ART) (19, 20) could yield better results than FBP or the Fourier method. ART is related to the Kaczmarz method of projections for solving simultaneous equations [see Rosenfeld and Kak (13) for an excellent discussion on this topic]. The Kaczmarz method takes an approach that is completely different from that of the Fourier or FBP methods: the available projections (ray sums in the discrete case) are seen as a set of simultaneous equations, with the unknown quantities being discrete pixels of the image. The large sizes of images encountered in practice preclude the use of the usual methods for solving simultaneous equations. Furthermore, in many practical applications, the number of available equations may be far less than the number of pixels in the image to be reconstructed; the set of simultaneous equations is then underdetermined. The Kaczmarz method of projections is an elegant iterative method, which may be implemented easily.
14
IMAGE RECONSTRUCTION
Fig. 6. ART treats the image as a matrix of discrete pixels of finite size (x, y). Each ray has a finite width. The fraction of the area of the jth pixel crossed by the ith ray is represented by the weighting factor wij . wij = Area of ABCD/(x y) for the jth pixel in the figure. (Adapted with permission from A. Rosenfeld and A. C. Kak, Digital Picture Processing, 2nd c 1982 Academic Press.) ed., New York: Academic Press. Copyright
Let the image to be reconstructed be divided into N cells, f i denoting the value in the jth cell (the image density or intensity is assumed to be constant within each cell). Let M ray sums be made available, expressed as
where wij is the contribution factor of the jth image element to the ith ray sum, equal to the fractional area of the jth cell crossed by the ith ray path, as illustrated in Fig. 6. Note that for a given ray i, most of the wij will be zero, as only a few elements of the image contribute to the corresponding ray sum. Then
A grid representation with N cells gives the image N degrees of freedom. Thus an image represented by (f1 , f2 , . . ., fN ) may be considered to be a single point in an N-dimensional hyperspace. Thus each of the above ray sum equations will represent a hyperplane in this hyperspace. If a unique solution exists, it is given by the intersection of all the hyperplanes at a single point. To arrive at the solution, the Kaczmarz method takes the
IMAGE RECONSTRUCTION
15
approach of successively and iteratively projecting an initial guess and its successors from one hyperplane to the next. Let us, for simplicity, consider a 2-D version of the situation, as illustrated in Fig. 7. Let f (0) represent vectorially the initial guess to the solution, and let w1 represent vectorially the series of weights (coefficients) in the first ray equation. The first ray sum may then be written as
The hyperplane represented by this equation is orthogonal to w1 . With reference to Fig. 8, Eq. (4) says that for the vector OC corresponding to any point C on the hyperplane, its projection onto the vector w1 is of a constant length. The unit vector OU along w1 is given by
The perpendicular distance of the hyperplane from the origin is
Now, f (1) = f (0) − GH, and
Since the directions of GH and OU are the same, GH = |GH| OU. Thus
Therefore
In general, the jth estimate is obtained from the (j − 1 )th estimate as
16
IMAGE RECONSTRUCTION
Fig. 7. Illustration of the Kaczmarz method of solving a pair of simultaneous equations in two unknowns. The solution is f = [3, 4] . The weight vectors for the two ray sums (straight lines) are w1 = [2, −1] and w2 = [1, 1] . The equations of the straight lines are w1 ·f = 2 f 1 − f 2 = 2 = p1 and w2 ·f = f 1 + f 2 = 7 = p2 . The initial estimate is f (0) = [4, 1] . The first updated estimate is f (1) = [2, 2] ; the second updated estimate is f (2) = [3.5, 3.5] . As two ray sums are given, two corrections constitute one cycle (or iteration) of ART. The path of the second cycle of ART is also illustrated in the figure.
That is, the current (jth − 1)th estimate is projected onto the jth projection hyperplane and the deviation from the true ray sum pj is obtained. This deviation is normalized and applied as a correction to all the pixels according to the weighting factors w . When this process is applied to all the M ray sum hyperplanes given, one cycle or iteration is completed. Depending on the initial guess and the arrangement of the hyperplanes, a number of iterations may have to be completed in order to obtain the solution (if it exists). The following characteristics of ART are worth noting: • •
ART proceeds ray-by-ray and is iterative. If the hyperplanes of all the given projections are mutually orthogonal, we may start with any initial guess and reach the solution in only one cycle (if it exists). On the other hand, if the hyperplanes subtend very small angles with one another, a large number of iterations will be required. The number of iterations required may be reduced by using optimized projection access schemes (35).
IMAGE RECONSTRUCTION
17
Fig. 8. Illustration of the algebraic reconstruction technique. f (1) is an improved estimate computed by projecting the initial estimate f (0) onto the hyperplane (the straight line AG in the illustration) corresponding to the first ray sum given by the equation w1 ·f − p1 . (Reproduced with permission from A. Rosenfeld and A. C. Kak, Digital Picture Processing, 2nd c 1982 Academic Press.) ed., New York: Academic Press. Copyright
• • •
If the number of ray sums is greater than the number of pixels, that is, M ≥ N, but the measurements are noisy, no unique solution exists—the procedure will oscillate in the neighborhood of the intersections of the hyperplanes. If M < N, the system is underdetermined and an infinite number of solutions will exist. It has been shown that unconstrained ART converges to the minimum-variance estimate (36). The major advantage of ART is that any a priori information about the image may be introduced easily into the iterative procedure (e.g., upper or lower limits on pixel values, spatial boundaries of the image). This may help in obtaining a useful “solution” even if the system is underdetermined.
18
IMAGE RECONSTRUCTION
Approximations to the Kaczmarz Method. We could rewrite the reconstruction step in Eq. (5) at the pixel level as
where qj = f (i − 1) ·wj = N k = 1 f (j − 1) k wjk . This equation says that when we project the (j − 1 )th estimate onto the jth hyperplane, the correction factor for the mth cell is
Here, pj is the given (true) ray sum for the jth ray, and qj is the computed ray sum for the same ray for the estimated image on hand. (pj − qj ) is the error in the estimate, which may be normalized and applied as a correction to all the pixels with appropriate weighting. In one of the approximations (19, 20), the wjk are simply replaced by 0’s or 1’s depending on whether the center of the jth image cell is within the ith ray (of finite width) or not. Then the coefficients need not be computed and stored: we may instead determine the pixels to be corrected for the ray considered during the reconstruction procedure. Furthermore, N k = 1 w2jk jk> = N j , the number of pixels crossed by the jth ray. The correction to all pixels in the jth ray is then (pj − qj )/N j . Thus
Because the corrections may be negative, negative pixel values may be encountered. Since negative values are not meaningful in most imaging applications, the constrained (and thereby nonlinear) version of ART is defined as
The corrections could also be multiplicative (8):
In this case no positivity constraint is required. Furthermore, the convex hull of the image is almost guaranteed (subject to approximation related to the number of projections available), as a pixel once set to zero will remain so during subsequent iterations. It has been shown that the multiplicative version of ART converges to the maximum-entropy estimate of the image (18, 37). A generic ART procedure may be expressed in the following algorithmic form: (1) Prepare an initial estimate of the image. All of the pixels in the initial image could be zero for additive ART; however, for multiplicative ART, pixels within at least the convex hull of the object in the image must be nonzero. (2) Compute the projection (or ray sum) qj for the first ray path for the initial estimate of the image.
IMAGE RECONSTRUCTION
19
(3) Obtain the difference between the true ray sum pj and the computed ray sum qj , and apply the correction to all the pixels belonging to the ray according to one of the ART equations [e.g., Eq. (6), (7), (8) or (9)]. Apply constraints, if any, based on a priori information available. (4) Perform Steps 2 and 3 for all rays available. (5) Steps 2 to 4 constitute one cycle or iteration (over all available projections or ray sums). Repeat Steps 2 to 4 as many times as required. If desired, compute a measure of convergence, such as
Stop if the error is less than a prespecified limit; else, go back to Step 2. For improved convergence, a simultaneous correction procedure [simultaneous iterative reconstruction technique—SIRT (38)] has been proposed, where the corrections to all pixels from all the rays are first computed, and the averaged corrections are applied at the same time to all the pixels (i.e., only one correction is applied per pixel per iteration). Guan and Gordon (35) proposed different projection access schemes to improve convergence, including consecutive use of projections in mutually orthogonal directions. Imaging with Diffracting Sources. In some applications of CT imaging, such as imaging pregnant women, X-ray imaging might not be advisable. Imaging with nonionizing forms of radiation, such as acoustic (ultrasonic) (12, 32) and electromagnetic (optical or thermal) imaging (34), is then a valuable alternative. X-ray imaging is also not suitable when the object to be imaged has no (or poor) contrast in density or atomic number distribution. An important point to note in acoustic or electromagnetic imaging is that these forms of energy do not propagate along straight-line ray paths through a body due to refraction and diffraction. When the dimensions of inhomogeneities in the object being imaged are comparable to or smaller than the wavelength of the radiation used, geometric propagation concepts cannot be applied; it becomes necessary to consider wave propagation and diffraction-based methods. When the body being imaged may be treated as a weakly scattering object in the 2-D sectional plane and invariant in the axial direction, the Fourier diffraction theorem is applicable (12). This theorem states that the FT of a projection including the effects of diffraction gives values of the 2-D FT of the image along a semicircular arc. Interpolation methods may be developed in the Fourier space taking this property into account for reconstruction of images from projections obtained with diffracting sources. Backpropagation and algebraic techniques have also been proposed for the case of imaging with diffracting sources (12). Detailed discussion of these methods is beyond the scope of the present article. Nature of Data Acquired in Magnetic Resonance Imaging. MR imaging is based on the principle of nuclear magnetic resonance (NMR)—the behavior of nuclei under the influence of externally applied magnetic and electromagnetic (radio-frequency or RF) fields (28, 33, 39). A nucleus with an odd number of protons or an odd number of neutrons has an inherent nuclear spin and exhibits a magnetic moment; such a nucleus is said to be NMR active. The commonly used modes of MR imaging rely on hydrogen1 H (proton), carbon13 C, fluorine19 F, and phosphorus31 P nuclei. In the absence of an external magnetic field, the vectors of magnetic moments of active nuclei have random orientations, resulting is no net magnetism. When a strong external magnetic field H 0 is applied, the nuclear spins of active nuclei align with the field (either parallel or antiparallel to the field). The axis of the magnetic field is referred to as the z axis. Parallel alignment corresponds to a lower energy state than antiparallel alignment, and hence there will be more nuclei in the former state. This state of forced alignment results in a net magnetization vector.
20
IMAGE RECONSTRUCTION
The magnetic spin vector of each active nucleus precesses about the z axis at a frequency known as the Larmor frequency, given by ω0 = −γH 0 , where γ is the gyromagnetic ratio of the nucleus considered. This form of precession is comparable to the rotation of a spinning top’s axis around the vertical. MR imaging involves controlled perturbation of the precession of nuclear spins and measurement of the RF signals emitted when the perturbation is stopped and the nuclei return to their previous states. MR imaging is an intrinsically 3-D imaging procedure. The traditional CT scanners require mechanical scanning and provide 2-D cross-sectional or transversal images in a slice-by-slice manner. Slices at other orientations, if required, have to be computed from a set of 2-D slices covering the required volume. In MR imaging, however, images may be obtained directly at any transversal, coronal, sagittal, or oblique section by using appropriate gradients and pulse sequences. Furthermore, no mechanical scanning is involved: slice selection and scanning are performed electronically by the use of magnetic field gradients and RF pulses. The main components and principles of MR imaging are (33):
•
•
•
•
•
A magnet that provides a strong, uniform field of the order of 0.5 to 4 T. This causes all active nuclei to align in the direction of the field (parallel or antiparallel) and precess about the axis of the field. The rate of precession is proportional to the strength of the magnetic field H 0 . The stronger the magnetic field, the higher will be the signal-to-noise ratio of the data acquired. An RF transmitter to deliver an RF electromagnetic pulse H 1 to the body being imaged. The RF pulse provides the perturbation: it causes the axis of precession of the spin vectors to deviate or “flip” from the z axis. In order for this to happen, the frequency of the RF field must be the same as that of precession of the active nuclei, such that the nuclei can absorb energy from the RF field (hence the term “resonance”). The frequency of RF-induced rotation is given by ω1 = −γH 1 . When the RF perturbation is removed, the active nuclei return to their unperturbed states (alignment with H 0 ) through various relaxation processes, emitting energy as RF signals. A gradient system to apply to the body a controlled, spatially varying and time-varying magnetic field h(x) = G· x, where x is a vector representing the spatial coordinates and G is the gradient applied. The components of G along the z direction as well as in the x and y directions (the plane orthogonal to the z axis) are controlled individually. The gradient causes nuclei at different positions to precess at different frequencies and provides for spatial coding of the signal emitted from the body. The Larmor frequency at x is given by w (x) = −γ(H 0 + G·x) . Nuclei at specific positions or planes in the body may be excited selectively by applying RF pulses of specific frequencies. The combination of the gradient fields and the RF pulses applied is called the pulse sequence. An RF detector system to detect the RF signals emitted from the body. The RF signal measured outside the body represents the sum of the RF signals emitted by active nuclei from a certain part or slice of the body, as determined by the pulse sequence. The spectral spread of the RF signal provides information on the location of the corresponding source nuclei. A computing and imaging system to reconstruct images from the measured data, as well as process and display the images. Depending on the pulse sequence applied, the RF signal sensed may be formulated as the 2-D or 3-D FT of the image to be reconstructed (28, 33, 39). The data measured correspond to samples of the 2-D FT of a sectional image at points on concentric squares or circles (28). The Fourier or backprojection methods described in the preceding sections may then be used to obtain the image.
The image obtained is a function of the nuclear spin density in space and the corresponding parameters of the relaxation processes involved.
IMAGE RECONSTRUCTION
Fig. 9.
21
A synthetic 2-D image (phantom) with 101 × 101 8-bit pixels, representing a cross section of a 3-D object.
Examples of Reconstructed Images Figure 9 shows a synthetic 2-D image (phantom), which we will consider to represent a cross section of a 3-D object. The objects in the image were defined on a discrete grid and hence have step and/or jagged edges. Figure 10(a) is a plot of the projection of the phantom image computed at 90◦ ; note that the values are all positive. Figure 10(b) is a plot of the filtered projection using only the ramp filter (|w|) required in the FBP algorithm; note that the filtered projection has negative values. Figure 11 shows the reconstruction of the phantom obtained using 90 projections from 2◦ to 180◦ in steps ◦ of 2 with the simple BP algorithm. While the objects in the image are visible, the smearing effect of the BP algorithm is obvious. Figure 12 shows the reconstruction of the phantom obtained using 90 projections with the FBP algorithm; only the ramp cfilter essential for the FBP process was used with no other smoothing or lowpass filter function. The contrast and visibility of the objects are better than those in the case of the simple BP result; however, the image is noisy due to the increasing gain of the ramp filter at higher frequencies. The reconstructed image also exhibits artifacts related to computation of the projections on a discrete grid; please refer to 18, 36, and 12 for discussions on this topic. The use of additional filters could reduce the noise and artifacts: Fig. 13 shows the result of reconstruction with the FBP algorithm including a fourth-order Butterworth filter with the 3 dB cutoff at 0.2 times the sampling frequency. The Butterworth filter has suppressed the noise and artifacts at the expense of blurring the edges of the objects in the image. Figure 14 shows the reconstruction of the phantom obtained using 90 projections with three iterations of ART, as in Eq. (8). Projections for use with ART were computed from the phantom image data using an angle-dependent ray width, given by max ( |sin θ|, |cos θ| ) (19, 40). The quality of the reconstruction is better than that given by the BP or FBP algorithms. The Radon transform may be interpreted as a transformation of the given image from the (x, y) space to the (t, θ) space. In practical CT scanning, the projection or ray integral data are obtained as samples at discrete intervals in t and θ . Just as we encounter the (Nyquist or Shannon) sampling theorem in the representation of a 1-D signal in terms of its samples in time, we now encounter the requirement to sample adequately along both the t and θ axes. A major distinction lies in the fact that the measurements made are discrete to begin with, and the signal (the body or object being imaged) cannot be prefiltered to prevent aliasing. Undersampling in either axis will lead to aliasing errors and poor reconstructed images. Figures 15, 16, and 17 show reconstructed
22
IMAGE RECONSTRUCTION
Fig. 10. (a) Projection of the phantom image in Fig. 9 computed at 90◦ . (b) Filtered version of the projection using only the ramp filter required in the FBP algorithm.
Fig. 11. Reconstruction of the phantom in Fig. 9 obtained using 90 projections from 2◦ to 180◦ in steps of 2◦ with the simple BP algorithm.
images obtained using only 10 projections spanning the 0◦ to 180◦ range in sampling steps of 18◦ and using the BP, FBP, and ART algorithms, respectively. The advantage of ART due to the use of the positivity constraint (a priori knowledge imposed) and the ability to iterate is seen in the improved quality of the result.
IMAGE RECONSTRUCTION
23
Fig. 12. Reconstruction of the phantom in Fig. 9 obtained using 90 projections from 2◦ to 180◦ in steps of 2◦ with the FBP algorithm; only the ramp filter essential for the FBP process was used.
Fig. 13. Reconstruction of the phantom in Fig. 9 obtained using 90 projections from 2◦ to 180◦ in steps of 2◦ with the FBP algorithm; the ramp filter essential for the FBP process was used along with a Butterworth lowpass filter.
Figures 18,19, and 20 show reconstructed images obtained using 10 projections but spanning only the angular range of 40◦ to 130◦ in steps of 10◦ . The limited angular coverage provided by the projections has clearly affected the quality of the image. Again, ART has provided the best possible reconstruction among the algorithms used in the study. The use of limited projection data in reconstruction results in geometric distortion and streaking artifacts. The distortion may be modeled by the PSF of the reconstruction process (if it is linear and shift invariant, as BP, FBP, and unconstrained ART are). The PSFs of the simple BP method are shown as images in Fig. 21, for the case with 10 projections over 180◦ , and in Fig. 22, for the case with 10 projections from 40◦ to 130◦ . The reconstructed image is given by the convolution of the original (unknown) image with the PSF. Limited
24
IMAGE RECONSTRUCTION
Fig. 14. Reconstruction of the phantom in Fig. 9 obtained using 90 projections from 2◦ to 180◦ in steps of 2◦ with three iterations of constrained additive ART.
Fig. 15. Reconstruction of the phantom in Fig. 9 obtained using 10 projections from 18◦ to 180◦ in steps of 18◦ with the simple BP algorithm.
improvement in image quality may be obtained by applying deconvolution filters to the reconstructed image (41424344454647–48).
Display of ct Images X-ray CT is capable of producing images with very good density resolution, on the order of 1 part in 1000. For display purposes, the attenuation coefficients are normalized with respect to that of water and expressed as H = 1000(µ/µw − 1) Hounsfield units, where µ is the measured attenuation coefficient, and µw is the attenuation coefficient of water. The CT number is expressed in Hounsfield units, named after the inventor of the first
IMAGE RECONSTRUCTION
25
Fig. 16. Reconstruction of the phantom in Fig. 9 obtained using 10 projections from 18◦ to 180◦ in steps of 18◦ with the FBP algorithm; the ramp filter essential for the FBP process was used along with a Butterworth lowpass filter.
Fig. 17. Reconstruction of the phantom in Fig. 9 obtained using 10 projections from 18◦ to 180◦ in steps of 18◦ with three iterations of constrained additive ART.
commercial medical CT scanner (10). This scale results in values of about +1000 for bone, 0 for water, about −1000 for air, 20 to 80 for soft tissue, and about −800 for lung tissue (22). The dynamic range of CT values is much wider than those of common display devices and the human visual system at a given level of adaptation. Furthermore, detailed diagnosis requires visualization of small density differences. For these reasons, presentation of the entire range of values available in a CT image in a single display is neither practically feasible nor desirable. In practice, small “windows” of the CT number scale are chosen and linearly expanded to occupy the capacity of the display device. The window width and level (center) values may be chosen interactively to display different density ranges with improved perceptibility of details within the chosen density window. (Values above or below the window limits are displayed as totally
26
IMAGE RECONSTRUCTION
Fig. 18. Reconstruction of the phantom in Fig. 9 obtained using 10 projections from 40◦ to 130◦ in steps of 10◦ with the simple BP algorithm.
Fig. 19. Reconstruction of the phantom in Fig. 9 obtained using 10 projections from 40◦ to 130◦ in steps of 10◦ with the FBP algorithm; the ramp filter essential for the FBP process was used along with a Butterworth low-pass filter.
white or black, respectively.) This technique, known as windowing or density slicing, may be expressed as
where f (x, y) is the original image in CT numbers, g(x, y) is the windowed image to be displayed, (m, M ) is the range of CT values in the window to be displayed, and (n, N ) is the range of the display values. The window width is (M − m ) and the window level is (M + m)/2 ; the display range is typically n = 0 and N = 255 with 8 bit display systems. Figure 23 shows a set of two CT images of a patient with head injury, with each image displayed with two sets of window level and width. The effects of the density window chosen on the features
IMAGE RECONSTRUCTION
27
Fig. 20. Reconstruction of the phantom in Fig. 9 obtained using 10 projections from 40◦ to 130◦ in steps of 10◦ with three iterations of constrained additive ART.
Fig. 21.
Point spread function of the simple BP procedure using 10 projections spread evenly over 180◦ .
of the image displayed are clearly seen in the figure: either the fractured bone or the brain matter are seen in detail in the windowed images, but not both in the same image. A dramatic visualization of details may be achieved by pseudocolor techniques. Arbitrary or structured color scales could be assigned to CT values by look-up tables or gray-scale-to-color transformations. Some of the popular color transforms are the rainbow (VIBGYOR) and the heated metal color (red–yellow–white) sequences. Difficulties may arise, however, in associating density values with different colors if the transformation is arbitrary and not monotonic in intensity or total brightness. A look-up table linking the displayed colors to CT numbers or other pixel attributes may assist in improved visual analysis of image features in engineering and scientific applications.
28
IMAGE RECONSTRUCTION
Fig. 22.
Point spread function of the simple BP procedure using 10 projections from 40◦ to 130◦ in steps of 10◦ .
Fig. 23. A set of two CT images of a patient with head injury, with each image displayed with two sets of window level and width. One of the windows displays the skull, the fracture, and the bone segments, but the brain matter is not visible; the other window displays the brain matter in detail, but the fracture area is saturated. (Courtesy of Dr. W. Gordon, Health Sciences Centre, Winnipeg, MB, Canada.)
Industrial Applications The phenomenal success of CT in diagnostic medicine attracted interest from many other disciplines in recent years (4950515253545556–57). The petroleum industry has demonstrated significant interest, with the objective of determining physical properties of porous rocks and fluid saturations during multiphase flow phenomena associated with petroleum and natural gas production (49, 50, 52). Most major oil companies possess some form of an in-house CT system, although a considerable number of exploratory projects have been performed in hospitals after patient-imaging hours. Recently, the term “process tomography” has been identified with the development and use of tomographic imaging techniques for industrial applications (51, 53). The
IMAGE RECONSTRUCTION
Fig. 24.
29
X-ray CT images of multiphase flow in porous media.
possibility of identifying elements of different density and/or different atomic number within a study sample gives a wide potential for industrial applications, some of which are described in the following paragraphs. The characterization of cores from oil and gas reservoirs using CT is based on the following principles. The normalized X-ray attenuation numbers of a CT image are proportional to the bulk density of the sample under study for the energies at which commercial medical X-ray CT scanners operate (49, 50). The atomic number effect in samples consisting of elements of atomic number up to 20 is negligible. A linear relationship between CT number and bulk density can be established through a simple calibration process. In reservoir rock characterization studies, density variations can be translated into porosity maps and specific feature maps (such as vugs and fractures). Density contrast can also be used to identify the distribution of up to two fluids within reservoir rocks, such as oil/water, oil/gas, or water/gas (50, 52). However, the density contrast is often not enough to show quantitative fluid saturation distribution. Contrast enhancement may be achieved by “doping” one of the fluid phases with a high-contrast agent (such as iodide salts). Several attempts have been made to quantify three-phase saturation distribution through selected doping agents and dual-energy scanning, but with limited success (58). Figure 24 shows a typical example of the application of X-ray CT in reservoir rock characterization. The first image shows a typical cross section of a relatively uniform sand pack that contains water. The second image shows the same cross section after a polymer solution was injected into the sand pack; the image clearly identifies the area in which the solution has displaced oil and water. 59 developed a portable X-ray and gamma-ray minitomograph for application in soil science; in particular, they used the scanner to measure water content and bulk density of soil samples. Soil-related studies may address identification of features (such as fractures, wormholes, and roots) and flow of various contaminants in soil. Some exotic applications have also appeared in the literature, such as the use of CT to determine the quality of turf in golf courses.
30
IMAGE RECONSTRUCTION
Recently, numerous applications of CT have appeared in projects related to monitoring of multiphase flow phenomena in pipes and in models of chemical reactors (53, 54). In these cases, the objective is to determine flow phenomena associated with modeling of specific chemical reactor problems. Pipe-flow studies were published first, and fluidized-bed problems and trickle-bed problems began appearing in the literature in the 1990s. In many of the chemical reactor problems studied via CT, new custom-made CT systems have been described in the literature, such as gamma-ray CT devices and a variety of electrical CT systems that utilize capacitance, resistivity, or impedance variability in the systems under study (51, 53). Forestry applications of CT have appeared in the literature in the form of scanning of live trees to measure growth rings and detect decay using a portable X-ray CT scanner (60), and monitoring tree trunks or logs in the timber industry. Applications of CT for characterization of other materials has had limited success. Studies on polymer characterization have shown that CT can be used to study structural defects and other heterogeneities in polymer objects (61). Algorithms for material characterization based on atomic number differences have also been presented in the literature; multiple-energy imaging has been used for such studies. Imaging of steel objects has been performed with high-energy gamma-ray scanners for studies on structural failure and corrosion (62). CT scanners have been custom-built for inspection of turbine engines and rocket boosters. The resolution of CT devices used in industrial applications can vary significantly, reaching down to 200 µm by 200 µm pixels in cross section. Special systems have been built to image small samples (less than 1 cm3 in volume) with resolutions of about 5 µm by 5 µm in cross section (63, 64). Such an imaging procedure is called microtomography, being a hybrid of tomography and microscopy. Most microtomography studies are performed with finely focused radiation beams produced by a particle accelerator. Porous media studies (63) and polymer characterization studies (64) have been reported using microtomography. Application of CT imaging in geotechnical engineering is promising. Published examples of geotechnical CT imaging include pipeline–soil interactions, modeling of tectonic plate movements, and determining stress– strain relationships and stress effects in various media (65). Industrial applications of SPECT have been limited in scope. The major application of SPECT in chemical engineering is for real-time tracer work and radioactive particle tracking. Radio-pharmaceuticals are used to tag either a fluid phase or a solid particle and to follow the trajectories in real time. Applications include flow in porous media and fixed-bed reactors. Radioactive particle tracking has also been used to monitor solid flow in fluidized beds and solid handling systems (66). Figure 25 shows an example of a solid circulation map in a laboratory fluidized-bed column. The map is depicted as a probability density function of a given particle being at a location in the column. Maps such the one in Fig. 25 can be used for modeling flow phenomena in fluidized-bed reactors. Figure 26 shows two sets of MR images of samples of a sandstone reservoir (67). Longitudinal and transversal images are shown for each sample. The core at the top is a layered sandstone sample: distinct bedding planes are visible. The sample at the bottom exhibits large pores as bright spots.
Concluding Remarks The 1980s and 1990s have brought out many new developments in CT imaging. Continuing development of versatile imaging equipment and image processing algorithms has been opening up newer applications of CT imaging. Three-dimensional imaging of moving organs such as the heart is now feasible. Three-dimensional display systems and algorithms have been developed to provide new and intriguing displays of the interior of the human body. Three-dimensional images obtained by CT are being used in surgery and radiation therapy, creating the new fields of image-guided surgery and treatment. Practical realization of portable scanners has also made possible field applications in agricultural sciences and other areas. CT is a truly revolutionary investigative imaging technique—a remarkable synthesis of many scientific principles.
IMAGE RECONSTRUCTION
Fig. 25.
31
Probability density function of the position of a solid particle in a fluidized-bed obtained by SPECT imaging.
Fig. 26. MR images of reservoir cores obtained by using a Bruker Biospec MR scanner operating at 2.35 T and 100 MHz. Top: layered sandstone; bottom: sandstone with fairly large pores. Bright areas correspond to high proton concentration.
Acknowledgments RMR thanks Dr. Richard Gordon, University of Manitoba, Winnipeg, Canada, for introducing him to the fascinating subject of computed tomography and for reviewing this article. We are grateful to the various sources cited for permitting use of their material, in particular, Dr. Avinash C. Kak, Purdue University, West
32
IMAGE RECONSTRUCTION
Lafayette, IN. We thank Sridhar Krishnan for help in preparing this article. We also thank Dr. Michael R. Smith of the University of Calgary and Dr. Leszek Hahn of the Foothills Hospital, Calgary, for assistance in preparing parts of the article. This work has been supported by grants from the Natural Sciences and Engineering Research Council of Canada.
BIBLIOGRAPHY ¨ ¨ 1. J. Radon, Uber die Bestimmung von Funktionen durch ihre Integralwerte langs gewisser Mannigfaltigkeiten, Math. Phys. K., 69: 262, 1917. 2. A. M. Cormack, Representation of a function by its line integrals, with some radiological applications, J. Appl. Phys., 34: 2722–2727, 1963. 3. A. M. Cormack, Representation of a function by its line integrals, with some radiological applications II, J. Appl. Phys., 35: 2908–2913, 1964. 4. R. N. Bracewell A. C. Riddle, Inversion of fan-beam scans in radio astronomy, Astrophys. J., 150: 427–434, 1967. 5. R. A. Crowther, et al. Three dimensional reconstructions of spherical viruses by Fourier synthesis from electron micrographs, Nature, 226: 421–425, 1970. 6. D. J. De Rosier A. Klug, Reconstruction of three dimensional images from electron micrographs, Nature, 217: 130–134, 1968. 7. G. N. Ramachandran A. V. Lakshminarayanan, Three-dimensional reconstruction from radiographs and electron micrographs: Application of convolutions instead of Fourier transforms, Proc. Natl. Acad. Sci. USA, 68: 2236–2240, 1971. 8. R. Gordon, R. Bender, G. T. Herman, Algebraic reconstruction techniques (ART) for three-dimensional electron microscopy, J. Theor. Biol., 29: 471–481, 1970. 9. W. H. Oldendorf, Isolated flying spot detection of radio-density discontinuities displaying the internal structural pattern of a complex object, IRE Trans. Bio-Med. Elect., BME-8: 68, 1961. 10. G. N. Hounsfield, Computerized transverse axial scanning (tomography): Part 1. Description of system, Br. J. Radiol., 46: 1016–1022, 1973. 11. J. Ambrose, Computerized transverse axial scanning (tomography): Part 2. Clinical application, Br. J. Radiol., 46: 1023–1047, 1973. 12. A. C. Kak M. Slaney, Principles of Computerized Tomographic Imaging, Piscataway, NJ: IEEE Press, 1988. 13. A. Rosenfeld A. C. Kak, Digital Picture Processing, 2nd ed., New York: Academic Press, 1982. 14. A. K. Louis F. Natterer, Mathematical problems of computerized tomography, Proc. IEEE, 71: 379–389, 1983. 15. R. M. Lewitt, Reconstruction algorithms: Transform methods, Proc. IEEE, 71: 390–408, 1983. 16. Y. Censor, Finite series-expansion reconstruction methods, Proc. IEEE, 71: 409–419, 1983. 17. G. T. Herman, Image Reconstruction from Projections: Implementation and Applications, Berlin: Springer-Verlag, 1979. 18. G. T. Herman, Image Reconstruction from Projections: The Fundamentals of Computed Tomography, New York: Academic Press, 1980. 19. R. Gordon, A tutorial on ART (algebraic reconstruction techniques), IEEE Trans. Nucl. Sci., 21: 78–93, 1974. 20. R. Gordon G. T. Herman, Three-dimensional reconstruction from projections: A review of algorithms, Int. Rev. Cytol., 38: 111–151, 1974. 21. P. Edholm, The tomogram—its formation and content. Acta Radiol., Suppl. No. 193: 1–109, 1960. 22. R. A. Robb, X-ray computed tomography: An engineering synthesis of multiscientific principles, CRC Crit. Rev. Biomed. Eng., 7: 264–333, 1982. 23. R. M. Rangayyan, Computed tomography techniques and algorithms: A tutorial, Innovation Technol. Biol. Med., 7 (6): 745–762, 1986. 24. D. P. Boyd, et al. Proposed dynamic cardiac 3-D densitometer for early detection and evaluation of heart disease, IEEE Trans. Nucl. Sci., NS-26: 2724–2727, 1979. 25. D. P. Boyd M. J. Lipton, Cardiac computed tomography, Proc. IEEE, 71: 298–307, 1983. 26. R. A. Robb, et al. High-speed three-dimensional x-ray computed tomography: The Dynamic Spatial Reconstructor, Proc. IEEE, 71: 308–319, 1983. 27. S. A. Larsson, Gamma camera emission tomography, Acta Radiol. Suppl., 363, 1980.
IMAGE RECONSTRUCTION 28. 29. 30. 31. 32. 33. 34. 35. 36. 37.
38. 39. 40. 41. 42. 43. 44. 45. 46. 47. 48. 49. 50. 51. 52. 53. 54. 55.
33
Z. H. Cho, J. P. Jones, M. Singh, Foundations of Medical Imaging, New York: Wiley, 1993. A. Macovski, Medical Imaging Systems, Englewood Cliffs, NJ: Prentice-Hall, 1983. G. F. Knoll, Single-photon emission computed tomography, Proc. IEEE, 71: 320–329, 1983. A. Macovski, Physical problems of computerized tomography, Proc. IEEE, 71: 373–378, 1983. J. F. Greenleaf, Computerized tomography with ultrasound, Proc. IEEE, 71: 330–337, 1983. W. S. Hinshaw A. H. Lent, An introduction to NMR imaging: From the Bloch equation to the imaging equation, Proc. IEEE, 71: 338–350, 1983. ¨ G. Muller et al.(eds.), Medical Optical Tomography: Functional Imaging and Monitoring, Bellingham, WA: SPIE Optical Engineering Press, 1993. H. Guan R. Gordon, Computed tomography using ART with different projection access schemes: A comparison study under practical situations, Phys. Med. Biol., 41: 1727–1743, 1996. G. T. Herman, A. Lent, S. W. Rowland, ART: Mathematics and applications—a report on the mathematical foundations and on the applicability to real data of the algebraic reconstruction techniques, J. Theor. Biol., 42: 1–32, 1973. A. Lent, A convergent algorithm for maximum entropy image restoration, with a medical x-ray application, in R. Shaw (ed.), Image Analysis and Evaluation, Washington, DC: Society of Photographic Scientists and Engineers, 1977, pp. 249–257. P. Gilbert, Iterative methods for the three-dimensional reconstruction of an object from projections, J. Theor. Biol., 36: 105–117, 1972. P. C. Lauterbur C. M. Lai, Zeugmatography by reconstruction from projections, IEEE Trans. Nucl. Sci., 27: 1227–1231, 1980. R. M. Rangayyan R. Gordon, Streak preventive image reconstruction via ART and adaptive filtering, IEEE Trans. Med. Imaging, 1: 173–178, 1982. R. Gordon R. M. Rangayyan, Geometric deconvolution: A meta-algorithm for limited view computed tomography, IEEE Trans. Biomed. Eng., 30: 806–810, 1983. R. Gordon, A. P. Dhawan, R. M. Rangayyan, Reply to comments on geometric deconvolution: A meta-algorithm for limited view computed tomography, IEEE Trans. Biomed. Eng., 32: 242–244, 1985. P. J. Soble, R. M. Rangayyan, R. Gordon, Quantitative and qualitative evaluation of geometric deconvolution of distortion in limited-view computed tomography, IEEE Trans. Biomed. Eng., 32: 330–335, 1985. R. M. Rangayyan, R. Gordon, A. P. Dhawan, Algorithms for limited-view computed tomography: An annotated bibliography and a challenge, Appl. Opt., 24 (23): 4000–4012, 1985. A. P. Dhawan, R. M. Rangayyan, R. Gordon, Image restoration by Wiener deconvolution in limited-view computed tomography, Appl. Opt., 24 (23): 4013–4020, 1985. D. Boulfelfel, et al. Three-dimensional restoration of single photon emission computed tomography images, IEEE Trans. Nucl. Sci., 41: 1746–1754, 1994. D. Boulfelfel, et al. Restoration of single photon emission computed tomography images by the Kalman filter, IEEE Trans. Med. Imaging, 13: 102–109, 1994. D. Boulfelfel, et al. Pre-reconstruction restoration of single photon emission computed tomography images, IEEE Trans. Med. Imaging, 11: 336–341, 1992. A. Kantzas, Investigation of physical properties of porous rocks and fluid flow phenomena in porous media using computer assisted tomography, In Situ, 14 (1): 77–132, 1990. A. Kantzas, D. F. Marentette, K. N. Jha, Computer assisted tomography: From qualitative visualization to quantitative core analysis, J. Can. Petrol. Tech., 31 (9): 48–56, 1992. R. A. Williams M. S. Beck, Process Tomography; Principles, Techniques and Applications, London: Butterworth Heinemann, 1995. S. L. Wellington H. J. Vinegar, X-ray computerized tomography, J. Petrol. Technol., 39 (8): 885–898, 1987. D. M. Scott R. A. Williams, Frontiers in Industrial Process Tomography, Proc. Eng. Found. Conf., The Cliffs Shell Beach, CA, 1995. J. Chaouki, F. Larachi, M. P. Dudukovi´c (ed.), Non-invasive Monitoring of Multiphase Flows, Amsterdam: Elsevier, 1997. J. L. Ackerman W. A. Ellingson (eds.), Advanced Tomographic Imaging Methods for the Analysis of Materials, Pittsburgh, PA: Materials Research Society, 1991.
34
IMAGE RECONSTRUCTION
56. Technical Digest of the Topical Meeting on Industrial Applications of Computed Tomography and NMR Imaging, Optical Society of America, Hecla Island, Manitoba, Canada, 1984. 57. Special Issue. Collection of papers from the Optical Society of America Topical Meeting on Industrial Applications of Computed Tomography and NMR Imaging, Appl. Opt., 24 (23): 3948–4133, 1985. 58. A. Kantzas, Applications of computer assisted tomography in the quantitative characterization of porous rocks, Int. Symp. Soc. Core Analysts—CT Workshop, Stavanger, Norway, Sept. 1994. 59. P. E. Cruvinel, et al. X- and gamma-rays computerized minitomograph scanner for soil science, IEEE Trans. Instrum. Meas., 39: 745–750, 1990. 60. A. M. Onoe, et al. Computed tomography for measuring annual rings of a live tree, Proc. IEEE, 71: 907–908, 1983. 61. S. Persson E. Ostman, Use of computed tomography in non-destructive testing of polymeric materials, Appl. Opt., 24 (23): 4095, 1985. 62. B. D. Sawicka R. L. Tapping, CAT scanning of hydrogen induced cracks in steel, Nucl. Instrum. Methods Phys. Res. A, 256: 103, 1987. 63. M. E. Coles, et al. Developments in synchrotron x-ray microtomography with applications to flow in porous media, SPE Reservoir Eval. Eng., 1 (4): 288–296, 1998. 64. W. C. Conner, et al. Use of x-ray microscopy and synchrotron microtomography to characterize polyethylene polymerization particles, Macromolecules, 23 (22): 4742–4747, 1990. 65. A. Kantzas, Stress–strain characterization of sand packs under uniform loads as determined from computer-assisted tomography, J. Can. Petrol. Technol., 36 (6): 48–52, 1997. 66. A. Kantzas, et al. Application of gamma camera imaging and SPECT systems in chemical processes, First World Congress on Industrial Process Tomography, Buxton, Greater Manchester, UK, Apr. 1999. 67. K. Mirotchnik, et al. Determination of mud invasion characteristics of sandstone reservoirs using a combination of advanced core analysis techniques, Proc. 1998 Int. Symp. Soc. Core Analysts, The Hague, The Netherlands, Sept. 1998, p. SCA9815.
RANGARAJ M. RANGAYYAN APOSTOLOS KANTZAS University of Calgary
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICA...TRONICS%20ENGINEERING/26.%20Image%20Processing/W4109.htm
}{{}}
●
HOME ●
ABOUT US ●
CONTACT US ●
HELP
Home / Engineering / Electrical and Electronics Engineering
Wiley Encyclopedia of Electrical and Electronics Engineering Image Texture Standard Article B. S. Manjunath1 and R. Chellappa2 1University of California, Santa Barbara, CA 2University of Maryland, College Park, MD Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. DOI: 10.1002/047134608X.W4109 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (349K)
●
●
● ●
Recommend to Your Librarian Save title to My Profile Email this page Print this page
Browse this title ●
Search this title Enter words or phrases
Abstract The sections in this article are Texture Classification and Segmentation Gray Level Image Statistics and CO-Occurrence Matrices Random Field Models Spatial Filtering Rotation Invariant Texture Classification Applications Summary Acknowledgments About Wiley InterScience | About Wiley | Privacy | Terms & Conditions Copyright © 1999-2008John Wiley & Sons, Inc. All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELE...S%20ENGINEERING/26.%20Image%20Processing/W4109.htm17.06.2008 15:25:35
❍
❍ ❍
Advanced Product Search Search All Content Acronym Finder
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering c 1999 John Wiley & Sons, Inc. Copyright
IMAGE TEXTURE An image texture may be defined as a collection of elements or patterns wherein the individual elements themselves may or may not have a well-defined structure. Textures associated with most of the man-made objects have some regularity associated with them, in addition to a well-defined pattern structure. Examples of such textures include pictures of barbed wire, a brick wall, or a marble floor with periodically repeating tiles. On the other hand, a coastal line in an aerial photograph showing sand and water does not have any structure. Texture appears in natural pictures very frequently. The surface of a polished wooden table is a good example. A Landsat satellite picture showing the vegetation in the Amazonian area or the floating ice in the Antarctic are some other examples of such natural textures. A popular data set for researchers in this field is the digitized images from the Brodatz album (1). Some examples of textured images are shown in Fig. 1. Image texture analysis in the past two decades has primarily focused on texture classification, texture segmentation, and texture synthesis. In addition, texture mapping has been studied extensively in computer graphics for generating realistic images for visual simulations, computer animation, and 3-D rendering of elevation maps. In texture classification the objective is to assign a unique label to each homogeneous region. For example, regions in a satellite picture may be classified into ice, water, forest, agricultural areas, and so on. In medical image analysis, texture is used in applications such as segmenting magnetic resonance (MR) images of brain into gray and white matter, or detecting cysts in the X-ray computed tomography (CT) images of the kidneys. While image classification results in segmentation, there is also considerable interest in achieving segmentation without prior knowledge of the textures. Texture is also useful in recovering 3-D shape. A homogeneous 3-D texture under perspective projection assumption will induce distortions in the projected image. Figure 2 illustrates this. Variations in the image, such as the density and shape changes of the texture ‘blobs’, provide information about the 3-D shape of the object (2). Much of the previous work is concerned with synthetic image data where the focus was on shape recovery. However, as mentioned earlier, identifying the basic texture primitive itself is a major research problem. Alternative approaches include using spatial frequency information in recovering the shape (3,4,5). Perhaps one of the few successful applications of textures is in content based image search of large image and video databases (6). Texture information can be used to annotate and search images such as aerial photographs or color pictures of nature. Figures 3 and 4 illustrate some examples taken from an aerial photograph database. In the applications section we discuss more about the construction of a texture dictionary to efficiently navigate large pictorial databases. Combined with color and shape, texture can be used to select a wide variety of patterns even when semantic level information is absent.
Texture Classification and Segmentation Texture classification refers to the problem of assigning a particular class label to a given textured region. If the images are preprocessed to extract homogeneous textured regions, then the pixel data within these regions can be used for estimating the class labels. Here standard pattern classification techniques may be applied 1
2
IMAGE TEXTURE
Fig. 1.
Brick wall, elephant skin, ostrich skin leather, grass from Brodatz album.
assuming that there is only one texture in the region. Some of the initial work on texture analysis considered using spatial image statistics for classification purposes. These include image correlation (7), energy features and their extensions (8), features from co-occurrence matrices (9), run-length statistics (10), and difference statistics. As texture analysis research matured, two distinct approaches have emerged: in one, researchers seek to understand the process of texture generation and this led to generative texture models which could be used for classification as well as texture creation. This emphasis can be seen in much of the work on random field models for texture representation such as the 2-D nonsymmetric half plane models (11) and noncausal Gauss Markov random field models and their variations (12,13,14,15). A review of some of the recent work can be found in Ref. 16. Once the appropriate model features are computed, the problem of texture classification can be addressed using techniques from traditional pattern classification. Although significant progress has been made using these methods, several problems remain as the methods are sensitive to illumination and resolution changes, and transformations such as rotation. The second emphasis has its roots in modeling human texture perception, particularly that of pre-attentive texture discrimination. Psychologists have studied texture for the purposes of understanding human visual perception for many decades now. Figure 5 shows an example of a texture mosaic where the boundary between the L’s and +’s can be easily identified by humans without requiring detailed analysis of the micropatterns. Pre-attentive texture segmentation refers to this ability of humans to distinguish between textures in an image without any detailed scene analysis. Central to solving this problem are issues related to defining these texture features and their computation. Some of the early work in this field can be attributed to Julesz (17) for his theory of textons as basic textural elements. Spatial filtering approach has been used by many researchers for detecting texture boundaries not clearly explained by the texton theory (18,19). Texture discrimination is generally modeled as a sequence of nonlinear filtering operations without any prior assumptions about the texture generation process. Independent of the human psychovisual considerations, spatial filtering for texture analysis is now a mature area. Some of the recent work involves multiresolution filtering for both classification and segmentation (20,21,22). In the following we briefly describe the current research on both model based and
IMAGE TEXTURE
3
Fig. 2. Circular blobs mapped on to different 3-D shapes. The perspective distortion of the texture in the images provide depth cues.
spatial filtering approaches to texture analysis. Our aim is to expose the reader to the breadth of literature on these topics. Selected references are provided at the end for additional reading.
Gray Level Image Statistics and CO-Occurrence Matrices A digitized image is an array of intensity values. For a textured image, these intensity values, in general, are distributed in a random manner. However, the statistics underlying these distributions are helpful in characterizing these textures. A picture of a beach has significantly different variations in gray values compared to a cloud image. The gray level co-occurrence matrix characterizes second order gray level relationships. Consider any two pixels separated by (D, θ) in the image. Here D is the distance between the two pixels, the
4
IMAGE TEXTURE
Fig. 3. Search and retrieval using texture. “Texture” in aerial photographs can help in search and retrieval of similar image patterns. Identification text superimposed on an airphoto is selected as a texture pattern in this example, and regions containing similar identification marks are retrieved.
line connecting the two pixels makes an angle θ with the X-axis. Let one of the pixels have an intensity value I and the other pixel a value J. We can then count all occurrences of pixel pairs in the image which are separated by (D, θ). Let this be f (I, J). One can thus construct a matrix f (m, n) wherein the elements represent frequency count for the particular pair of intensities and for a specific distance D. If no directionality distinction is made between the two pixels, that is, f (I, J) = f (J, I), we get a symmetric co-occurrence matrix. A different matrix is constructed for each (D, θ). Further abstraction of information is necessary for computational reasons. Typical features that are computed from f (I, J) include the energy (I,J f (I, J)) and entropy (−I,J f (I, J) log f (I, J)). Detailed description of these features can be found in Ref. 23. While the co-occurrence matrix provides an intuitive mechanism to capture spatial relationships, computationally it is expensive. Note that the (D, θ) space must be sampled, and even for a coarse quantization such as 5 distances and 6 orientations, one needs to compute 30 co-occurrence matrices. Since the derived statistics directly depend on the gray values, this measure is sensitive to gray scale distortions. Further, this method only captures intensity relationships in a fine grain texture and may not be well suited for textures whose primitives are spatially large as in, for example, a picture of a brick wall.
Random Field Models A typical image is represented over a rectangular array and the statistical distribution of the pixel intensities can be modeled as a random field. A simple model is to represent a pixel intensity at location s, y(s), as a linear combination of pixel values {y(s + r), r ∈ N} within a small neighborhood N and additive noise (24). The neighborhood N is typically a block of pixels surrounding s thus leading to noncausal models. The Gaussian Markov random field (GMRF) models is a specific class of such two-dimensional noncausal models that has been quite popular and widely studied within the texture analysis literature. Let denote a set of grid points
IMAGE TEXTURE
5
Fig. 4. Another example of using texture for pattern retrieval. In this, a texture descriptor from a parking lot is used to search for other parking lots in a aerial image database. The top three matches contain similar patterns. (A): query image, B,C,D: top three retrievals based on texture in (A).
on a two-dimensional lattice of size M × M. Then a random process Y(s) is said to be Markov if
The neighborhood N of s can be arbitrarily defined. However, in most image processing applications it is natural to consider neighbors which are also spatially closer. For the case of GMRF, the neighborhood set of pixels is symmetric. For instance, the four nearest neighbors (North, South, East, and West) of a pixel form the simplest neighborhood set which is referred to as the first order GMRF model. Adding the diagonal neighbors gives the
6
IMAGE TEXTURE
Fig. 5.
A texture consisting of randomly oriented L and + micro patterns. The line segments are of equal length.
second order GMRF model and so on. Cross and Jain (25) provide a detailed discussion on the applications of Markov random fields (MRF) in modeling textured images. A nontrivial problem with any MRF model is the estimation of model parameters. Usually the structure of the model (such as first or second order GMRF) is assumed to be known even though estimating the structure itself is a challenging issue. A comparison of different GMRF parameter estimation schemes can be found in Ref. 24. While some of the initial motivation for the use of MRF models was for texture synthesis, it has also been used for texture classification and segmentation purposes. Given an image consisting of an unknown number of texture classes, the problem of segmentation involves both parameter estimation and classification. However, for parameter estimation one needs homogeneous image regions, which can only be obtained after the image is segmented! Both parameter estimation and segmentation are computationally expensive operations. If the texture class and hence the class parameter information is known, image segmentation can be formulated as a maximum a posteriori (MAP) estimation problem. In modeling images with more than one texture, in addition to the MRF model describing a texture patch, an additional random process to characterize the distribution of textures in the image, is introduced. This texture label process is usually modeled using discrete Markov models with a single parameter measuring the amount of clustering between neighboring pixels. Let Y be the image intensity, modeled as a GMRF conditioned on the class label, and L be the class label process. Then the posterior distribution of texture labels for the entire image given the intensity array is
where Pr(Y|L) is the conditional probability of observing the given intensity array given the label distribution, Pr(L) is the probability of a label distribution L. Maximizing the right-hand size gives the MAP estimate. In general, finding an optimal solution is not feasible because of the nonconvex nature of the MAP cost function. Geman and Geman (26) employ a stochastic relaxation method for finding a solution. This approach can be shown in theory to yield the global optimum but requires impractical annealing schedules. Most of its software implementations have a fixed number of iterations and the results are usually good.
IMAGE TEXTURE
7
Gaussian Markov Random Field Parameter Estimation. While there are many methods proposed for estimating GMRF parameters, it is difficult to guarantee both consistency and stability of the estimates. Consistency refers to the property that the estimates converge to the true values of the parameters and stability refers to the positive definiteness of the covariance matrix in the expression for the joint probability density of the MRF. Parameter estimation can be formulated as an optimization problem. However, as mentioned earlier, parameter estimation requires segmented image regions whereas segmentation requires good estimates of the parameters. Lakshmanan and Derin (27) propose an optimization framework for simultaneous parameter estimation and segmentation. They compute the maximum likelihood estimates of the parameters and a MAP solution for segmentation. Multiresolution Analysis And Markov Random Fields. Much of the computations involving MRF models are computationally expensive: the cost functions for parameter estimation and segmentation are nonconvex and hence involve iterative algorithms. One possibility is to use a multiresolution approach. Coarser resolution sample fields are obtained by subsampling. In general GMRF lose their Markov property under subsampling (the second-order GMRF with separable autocovariance matrix is an exception). An excellent discussion on multiresolution GMRF models can be found in Ref. 28. Krishnamachari and Chellappa (29) present techniques for GMRF parameter estimation at coarser resolution from finer resolution parameters. They use the coarse resolution parameters to segment the coarse resolution image and the segmentation results are propagated to finer resolutions. Besides improving the speed of computations, segmentation at coarser resolutions provides good initial conditions for following finer resolution images, thus improving the segmentation results compared to working at the original resolution. Concluding the discussion on GMRF models for texture analysis, we observe that they provide an elegant mathematical framework for describing texture and for deriving algorithms for texture classification and segmentation. Multiresolution models are particularly interesting from a computational point of view. Recently, GMRF models have been applied to unsupervised segmentation of color texture images wherein both spatial and interband interactions are modeled using random fields (30).
Spatial Filtering A serious drawback of random field models that characterize intensity distributions in modeling textures is that they are sensitive to gray level distortions induced by changes in imaging conditions such as lighting. In contrast, typical spatial filtering methods use the variations in the intensity as opposed to the absolute values of the intensity itself for texture classification and segmentation purposes. Laws’ work on texture energy features (8) is one of the early attempts to apply spatial filtering followed by some nonlinearities (such as computing the energy) for texture discrimination. In this formulation, images are processed by a number of filters, each designed to extract a certain type of image feature. Typical features of interest include edges and lines. While the initial design of these filters by Laws was quite ad-hoc, he obtained significantly better results compared to co-occurrence based methods. In recent years, spatial filtering methods have been extensively studied for texture classification/segmentation tasks. Malik and Perona (19) proposed a computational framework for modeling pre-attentive texture discrimination. Images are filtered using even symmetric kernels (such as Gaussians and Gaussian second derivatives) followed by half-wave rectification. Weak responses are suppressed using local negative feedback among competing feature locations. Texture gradient is then computed and boundaries are identified. The authors argue against the use of energy type features as well as the use of odd-symmetric filters in the preprocessing stage. Their experimental results indicate that the proposed algorithm yields results comparable to human discrimination on a set of textures frequently used in the literature.
8
IMAGE TEXTURE
Directional filters, such as those based on Gabor functions, have been used by many researchers for texture analysis (31,32,33,34,35). Gabor functions are modulated Gaussians and the general form of a two-dimensional Gabor function can be written as:
where σx and σy define the widths of the Gaussian in the spatial domain and (u0 , v0 ) is the frequency of the complex sinusoid. Some of the early work on using Gabor functions in image processing and vision can be credited to Daugman (36). Daugman suggests that the 2-D receptive field profiles of simple cells in the mammalian visual cortex can be well modeled by Gabor functions. These functions also have some nice theoretical properties such as minimizing the joint uncertainty in space and frequency which may have some implications in coding and recognition applications. In Ref. 33, the authors present a multiresolution framework for boundary detection. Features of interest in an image are generally present in various spatial resolutions, and a multiscale representation facilitates feature extraction and analysis. The image is first filtered with a bank of Gabor filters tuned to different orientations and scale. This is followed by local competitive and co-operative feature interactions to suppress weak features while reinforcing the stronger ones. An interesting aspect in Ref. 33 is the use of interscale interactions where features at neighboring scales interact, and the resulting output is sensitive to curvature changes and line endings in the image. A grouping stage combines outputs tuned to similar orientations. Finally, texture and intensity gradients are computed to detect boundaries in the image. The paper reports results on a wide variety of images including some interesting examples of illusory boundaries. Continuing on the use of Gabor filters and a multiresolution representation, Manjunath and Ma (21) propose a filter design that generates a bank of Gabor filters given the lower and upper cut off frequencies. This self-similar wavelet dictionary is obtained by dilations and rotations of the kernel g(x, y).
where σu = 12 πσx and σv = 12 πσy The filter bank can then be computed as:
where θ = nπ/K and K is the total number of discrete orientations. Let U l and U h denote the lower and upper center frequencies of interest and S be the number of scales in the multiresolution decomposition. Then the
IMAGE TEXTURE
9
following equations can be used to compute the filter parameters:
where W = U h and m = 0, 1, . . ., S − 1. The image is then convolved with each of the filters in the dictionary and the mean and standard deviation of the resulting outputs are used in constructing a texture feature vector. A weighted Euclidean distance metric is used to compare two feature vectors. Extensive experiments on the texture in the Brodatz album (1) indicate that this feature vector does quite well in characterizing a wide variety of textures. A detailed comparison and experimental validation of different texture feature representations is made in (21). Later on we present an image database application where similar patterns are retrieved based on texture.
Rotation Invariant Texture Classification The general approach to developing rotation-invariant techniques has been to modify successful non-rotationinvariant techniques. Since standard MRF models are inherently dependent on rotation, several methods have been introduced to obtain rotation invariance. Kashyap and Khotanzad (37) developed the circular autoregressive model with parameters that are invariant to image rotation. Choe and Kashyap (38) introduced an autoregressive fractional difference model that has rotation (as well as tilt and slant) invariant parameters. Cohen, Fan, and Patel (39) extended a likelihood function to incorporate rotation (and scale) parameters. To classify a sample, an estimate of its rotation (and scale) is required. For feature-based approaches, rotation-invariance is achieved by using anisotropic features. Porat and Zeevi (40) use first and second order statistics based upon three spatially localized features, two of which (dominant spatial frequency and orientation of dominant spatial frequency) are derived from a Gabor-filtered image. Leung and Peterson (41) present two approaches, one that transforms a Gabor-filtered image into rotation invariant features and the other of which rotates the image before filtering; however, neither utilizes the spatial resolving capabilities of the Gabor filter. You and Cohen (42) use filters that are tuned over a training set to provide high discrimination among its constituent textures. Greenspan, et al., (43) use rotation-invariant structural features obtained via multiresolution Gabor filtering. Rotation invariance is achieved by using the magnitude of a discrete Fourier transform (DFT) in the rotation dimension. Haley and Manjunath (22) have investigated applications of Gabor features for rotation invariant classification. A polar analytic form of a two-dimensional Gabor wavelet and a much more detailed set of microfeatures is computed. From these microfeatures, a micromodel which characterizes spatially localized amplitude, frequency, and directional behavior of the texture, is formed. The essential characteristics of a texture sample, its macrofeatures, are derived from the estimated selected parameters of the micromodel. Classification of texture samples is based on the macromodel derived from a rotation invariant subset of macrofeatures. In experiments using the Brodatz album, comparatively high classification rates are obtained. A detailed feature parametric analysis and feature quality analysis is provided in Ref. 44. Summarizing the discussion on spatial filtering methods for texture analysis, significant progress has been made in the use of band pass filters for extracting texture features since the early work of Laws (8). As in the case of random field models, scale and rotation invariant analysis remain as challenging issues.
10
IMAGE TEXTURE
Applications In recent years image texture has emerged as an important primitive to search and browse through large collections for similar looking patterns. An image can be considered as a mosaic of textures and texture features associated with the regions can be used to index the image data. For instance, a user browsing an aerial image database may want to identify all parking lots in the image collection. A parking lot with cars parked at regular intervals is an excellent example of a textured pattern. Similarly, agricultural areas and vegetation patches are other examples of textures commonly found in aerial imagery and satellite photographs. An example of a typical query that can be asked of such a content based retrieval system could be “retrieve all Landsat images of Santa Barbara which have less than 20% cloud cover” or “Find a vegetation patch that looks like this region.” In the Alexandria digital library (ADL) (45) project at the University of California at Santa Barbara, researchers are developing a prototype geographic information system that will have some of the image search features described above. Manjunath and Ma (21) investigate the role of textures in annotating image collections and report on the performance of several state-of-the-art texture analysis algorithms with performance in image retrieval being the objective criterion. Their texture analysis scheme based on a Gabor wavelet decomposition described earlier in this article performed quite well in this application compared to methods such as those using random field models and orthogonal wavelet filters. Ma and Manjunath (46) provide a detailed description of a system that searches aerial photographs based on texture content. They demonstrate that texture could be used to select a large number of geographically salient features including vegetation patterns, parking lots, and building developments. Using texture primitives as visual features, one can query the database to retrieve similar image patterns. Much of the results presented are with airphotos although a similar analysis can be applied to Landsat and Spot satellite images. This is currently being integrated into the ADL project (45), whose goal is to establish an electronic library of spatially indexed data, providing internet access to a wide collection of geographic information. A significant part of this collection includes maps, satellite images, and airphotos. For example, the Maps and Imagery Library at the UCSB contains over 2 million of historically valuable aerial photographs. A typical air photo can take over 25 MB of disk space, and providing access to such data raises several important issues, such as multiresolution browsing and selecting images based on content. Figure 4 and Fig. 5 show some examples of texture based retrieval in an airphoto database. What distinguishes image search for database related applications from traditional texture classification methods is the fact that there is a human in the loop (the user), and there is a need to retrieve more than just the best match. In typical applications a number of top matches with rank-ordered similarities to the query pattern will be retrieved. Comparison in the texture feature space should preserve visual similarities between patterns. This is an important but difficult problem in content-based image retrieval. Toward this objective, a hybrid neural network algorithm to learn the pattern similarity in the texture feature space is proposed (47). This approach uses training data containing the pattern similarity information (provided by human indexers) to partition the feature space into many visually similar clusters. A performance evaluation of this approach using the Brodatz texture indicate that a significantly better retrieval performance can be achieved. In addition to retrieving perceptually more relevant data, an additional advantage of this approach is that it also provides an efficient indexing tree to narrow down the search space. An interesting component of the system described in Ref. (46) is the texture thesaurus for similarity search. A texture thesaurus can be visualized as an image counterpart of the traditional thesaurus for text search. It creates the information links among the stored image data based on a collection of code words and sample patterns obtained from a training texture pattern set. Similar to parsing text documents using a dictionary or thesaurus, the information within images can be classified and indexed via the use of a texture thesaurus. The design of the texture thesaurus has two stages. The first stage uses a learning similarity algorithm to combine the human perceptual similarity with the low level feature vector information, and the second stage utilizes a hierarchical vector quantization technique to construct the code words. The texture
IMAGE TEXTURE
11
thesaurus so constructed is domain-dependent and can be designed to meet the particular need of a specific image data type by exploring the training data. Further, the thesaurus model provides an efficient indexing tree while maintaining or even improving the retrieval performance in terms of human perception. The visual code word representation in the thesaurus can be used as information samples to help users browse through the database.
Summary Image texture research has seen much progress during the last two decades. Texture based image classification has found applications in satellite and medical image analysis and in industrial vision systems for applications such as defect detection. Texture mapping for visualization and computer animations is now a well established area in computer graphics. Texture appears to be a promising image feature for search and indexing of large image and video databases. In both model based and spatial filtering approaches, current research is on deriving scale and rotation invariant texture features. Recent work on rotation invariant texture computations appear quite encouraging whereas scale invariance remains elusive.
Acknowledgments The airphoto examples shown in Figs. 4 and 5 are from a demonstration program written by Dr. Wei-Ying Ma.
BIBLIOGRAPHY 1. P. Brodatz Textures: A Photographic Album for Artists and Designers, New York: Dover, 1966. 2. D. Blostein N. Ahuja Shape from texture: integrating texture element extraction and surface estimation, IEEE Trans. Pattern Anal. Mach. Intell., 11: 1233–50, 1989. 3. R. Bajcsy L. Lieberman Texture gradient as a depth cue. Comput. Graphics Image Process., 5: 52–67, 1976. 4. B. J. Super A. C. Bovik Shape from texture using local spectral moments, IEEE Trans. Pattern Anal. Mach. Intell., 17: April 1995. p. 333–343. 5. J. Krumm S. A. Shafer A characterizable shape-from-texture algorithm using the spectrogram, Proc. IEEE-SP Int. Symp. Time-Freq. Time-Scale Anal., Oct. 25–28 1994, pp. 322–325. 6. H. Voorhees T. Poggio Computing texture boundaries from images, Nature, 333: 364–367, 1988. 7. P. C. Chen T. Pavlidis Segmentation by texture using correlation, IEEE Trans. Pattern Anal. Mach. Intell., 5: 64–69, 1983. 8. K. Laws Textured image segmentation, Ph.D. thesis, University of Southern California, 1978. 9. R. Haralick R. Bosley Texture features for image classification, 3rd ERTS Symp., NASA SP-351, 1219–1228, 1973. 10. M. M. Galloway Texture analysis using gray level run lengths, Comput. Graphics Image Process., 4: 172–179, 1975. 11. C. W. Therrien An estimation theoretic approach to terrain image segmentation, Comput. Vision, Graphics Image Process., 22: 313–326, 1983. 12. F. S. Cohen D. B. Cooper Simple parallel hierarchical and relaxation algorithms for segmenting noncausal Markovian fields, IEEE Trans. Pattern Anal. Mach. Intell., 9: 195–219, 1987. 13. S. Chatterjee R. Chellappa Maximum likelihood texture segmentation using Gaussian Markov random field models, Proc. IEEE Conf. Comput. Vision Pattern Recognition, June 1985. 14. H. Derin H. Elliott Modeling and segmentation of noisy and textured images using Gibbs random fields, IEEE Trans. Pattern Anal. Mach. Intell., 9: 39–55, 1987. 15. B. S. Manjunath R. Chellappa A note on unsupervised texture segmentation, IEEE Trans. Pattern Anal. Mach. Intell., 13: 472–483, 1991.
12
IMAGE TEXTURE
16. R. Chellappa R. L. Kashyap B. S. Manjunath Model-based texture segmentation and classification, in Handbook of Pattern Recognition and Computer Vision, C. H. Chen, L. F. Pau, and P. S. P. Wang (eds.), Singapore: World Scientific, 1993, pp. 279–310. 17. B. Julesz Textons, the elements of texture perception, and their interactions, Nature, 290: 12, March 1981. 18. J. R. Bergen E. H. Adelson Early vision and texture perception, Nature, 333: 363–364, 1988. 19. J. Malik P. Perona Preattentive texture discrimination with early vision mechanisms, J. Opt. Soc. Amer., A7: 923–932, 1990. 20. T. Chang C.-C. J. Kuo Texture analysis and classification with tree structured wavelet transform, IEEE Trans. Image Process., 2: 429–441, 1993. 21. B. S. Manjunath W. Y. Ma Texture features for browsing and retrieval of image data, IEEE Trans. Pattern Anal. Mach. Intell., 18: 1996, 837–842. 22. G. Haley B. S. Manjunath Rotation invariant texture classification using a complete space-frequency model, IEEE Trans. Image Process., 1998, in press. 23. R. M. Haralick L. Shapiro Computer and Robot Vision, Addison-Wesley, 1992, vol. 1, chap. 9. 24. R. Chellappa Two-dimensional discrete Gaussian Markov random field models for image processing, in L. N. Kanal and A. Rosenfeld (eds.), Progress in Pattern Recognition 2, Amsterdam: Elsevier, 1985. 25. G. R. Cross A. K. Jain Markov random field texture models, IEEE Trans. Pattern Anal. Mach. Intell., 5: 25–39, 1983. 26. S. Geman D. Geman Stochastic relaxation, Gibbs distributions, and Bayesian restoration of images, IEEE Trans. Pattern Anal. Mach. Intell., 6: 721–741, 1984. 27. S. Lakshmanan H. Derin Simultaneous parameter estimation and segmentation of Gibbs random fields using simulated annealing, IEEE Trans. Pattern Anal. Mach. Intell., 11: 799–813, August 1989. 28. S. Lakshmanan H. Derin Gaussian Markov random fields at multiple resolutions, in R. Chellappa (ed.), Markov Random Fields: Theory and Applications, New York: Academic Press, 1993, pp. 131–157. 29. S. Krishnamachari R. Chellappa Multiresolution Gauss-Markov random field models for texture segmentation, IEEE Trans. Image Process., 6: 1997. 30. D. K. Panjwani G. Healey Markov random field models for unsupervised segmentation of textured color images, IEEE Trans. Pattern Anal. Machine Intell., 17: 939–954, 1995. 31. A. Bovik M. Clark W. Geisler Multichannel texture analysis using localized spatial filters, IEEE Trans. Pattern Anal. Mach. Intell., 12: 55–73, 1990. 32. A. K. Jain F. Farrokhnia Unsupervised texture segmentation using Gabor filters, Pattern Recognition, 23: 1167–1186, 1991. 33. B. S. Manjunath R. Chellappa A Unified approach to boundary perception: edges, textures and illusory contours, IEEE Trans. Neural Networks, 4: 96–108, Jan 1993. 34. D. Dunn W. E. Higgins J. Wakeley Texture segmentation using 2-D Gabor elementary functions, IEEE Trans. Pattern Anal. Mach. Intell., 16: 1994. 35. H. Greenspan R. Goodman R. Chellappa C. H. Anderson Learning texture discrimination rules in a multiresolution system, IEEE Trans. Pattern Anal. Machine Intell., 16: 894–901, 1994. 36. J. G. Daugman Uncertainty relation for resolution in space, spatial frequency and orientation optimized by twodimensional visual cortical filters, J. Opt. Soc. Amer., 2: 1160–1169, 1985. 37. R. L. Kashyap A. Khotanzad A Model-based method for rotation invariant texture classification, IEEE Trans. Pattern Anal. Mach. Intell., 8: 472–481, 1986. 38. Y. Choe R. L. Kashyap 3-D shape from a shaded and textural surface image, IEEE Trans. Pattern Anal. Mach. Intell., 13: 907–918, 1991. 39. F. S. Cohen Z. Fan M. A. Patel Classification of rotated and scaled textured image using Gaussian Markov random field models, IEEE Trans. Pattern Anal. Mach. Intell., 13: 192–202, 1991. 40. M. Porat Y. Y. Zeevi Localized texture processing in vision: analysis and synthesis in the Gaborian space, IEEE Trans. Biomed. Eng., 36: 115–129, 1989. 41. M. M. Leung A. M. Peterson Scale and rotation invariant texture classification, Proc. 26th Asilomar Conf. Signals, Syst. Comput. Pacific Grove, CA, October 1992. 42. J. You H. A. Cohen Classification and segmentation of rotated and scaled textured images using tuned masks, Pattern Recognition, 26 (2): 245–258, 1993.
IMAGE TEXTURE
13
43. H. Greenspan et al. Rotation invariant texture recognition using a steerable pyramid, Proc. IEEE Int. Conf. Image Process., Jerusalem, Israel, October 1994. 44. G. M. Haley Rotation invariant texture classification using a complete space-frequency model, M.S. Thesis, Elect. Comput. Eng. Depart., Santa Barbara, CA: Univ. California, June 1996. 45. T. Smith et al. A Digital library for geographically referenced materials, IEEE Computer, 54–60, May 1996. 46. W. Y. Ma B. S. Manjunath A texture thesaurus for browsing large aerial photographs, J. Amer. Soc. Inf. Sci., 49 (7): 633–648, 1997. 47. W. Y. Ma B. S. Manjunath Texture features and learning similarity, Proc. IEEE Int. Conf. Comput. Vision Pattern Recognition, San Francisco, CA, June, 1996, pp. 425–430.
B. S. MANJUNATH University of California R. CHELLAPPA University of Maryland
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICA...TRONICS%20ENGINEERING/26.%20Image%20Processing/W4110.htm
}{{}}
●
HOME ●
ABOUT US ●
CONTACT US ●
HELP
Home / Engineering / Electrical and Electronics Engineering
Wiley Encyclopedia of Electrical and Electronics Engineering Infrared Imaging Standard Article Thomas J. Meitzler1, Grant R. Gerhart1, Euijung Sohn1, Harpreet Singh2, Kyung–Jae Ha2 1US Army Tank-Automotive and Armaments Command, Warren, MI 2Wayne State University, Detroit, MI Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. DOI: 10.1002/047134608X.W4110.pub2 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (15901K)
●
●
● ●
Recommend to Your Librarian Save title to My Profile Email this page Print this page
Browse this title ●
Search this title Enter words or phrases
Abstract The sections in this article are Theory of Infrared Imaging Infrared Image Processing Infrared Systems and Array Geometries Metrics in Cluttered Image Environments Equipment and Experimental Methods Wavelet Transforms of IR Images Fuzzy Logic and Detection Metrics for IR Images Applications of Infrared Imaging Sensors Infrared Imaging and Image Fusion used for Situational Awareness
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20...0ENGINEERING/26.%20Image%20Processing/W4110.htm (1 of 2)17.06.2008 15:25:52
❍
❍ ❍
Advanced Product Search Search All Content Acronym Finder
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICA...TRONICS%20ENGINEERING/26.%20Image%20Processing/W4110.htm
Appendix A Summary About Wiley InterScience | About Wiley | Privacy | Terms & Conditions Copyright © 1999-2008John Wiley & Sons, Inc. All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20...0ENGINEERING/26.%20Image%20Processing/W4110.htm (2 of 2)17.06.2008 15:25:52
INFRARED IMAGING In 1800 Sir William Herschel, Royal Astronomer to King George III of England, wrote, “. . . There are rays coming from the sun . . . invested with a high power for heating bodies, but with none for illumination objects . . . The maximum of the heating power is vested among the invisible rays” (1). During his search for a new lens filter material to be used in telescopes for observing solar phenomena, Herschel recorded that some samples of colored glass that gave like reductions in brightness transmitted little of the sun’s heat, whereas other samples transmitted so much heat that he felt impending eye damage after only a few seconds of observation. Herschel performed experiments with prisms and sensitive thermometers to determine which colors of the visible spectrum had the largest heating effect. Herschel recorded that the heating effect increased as the thermometer was moved from the violet to the red end of the spectrum.He continued to move the thermometer beyond the visible red end of the spectrum and saw an even greater temperature increase. The dark heat discovered by Herschel is known today as the infrared part of the electromagnetic (EM) spectrum (7). The prefix infra in this case refers to longer wavelengths, below or beneath (2) the red part of the visible region of the EM spectrum. The wavelengths of the EM waves in an infrared image typically range from 1 to 14 µm. Infrared radiation occupies the region of the electromagnetic spectrum between visible EM waves and radio waves. The range from three to five microns is called the short-wavelength band and the region from 8 to 12 µm is called the long-wavelength band. The technology and science of infrared imaging have developed from Herschel’s discovery in 1800 into a worldwide endeavor involving scientists and engineers from fields such as medicine, astronomy, material science, and the military. The technology of infrared imaging took a leap forward in 1917 when Case invented a photoconductive detector material capable of more efficiently detecting infrared radiation. Further improvements were made to the technology during World War II when it was discovered that the spectral sensitivity of the detectors could be broadened by cooling the photoconductive detector material below room temperature (1). Currently, the majority of fielded military IR imaging systems use similarly cooled detectors. Within the past ten years, however, a new imaging technology based on the pyroelectric effect has led to the development of uncooled IR imaging systems. The usefulness of infrared imaging devices is based on a fundamental physical principle: heated objects emit infrared energy because of molecular agitation. The infrared energy emitted by the material object is called thermal radiance or radiation. The radiance or electromagnetic energy from the object itself is typically called the signature of that particular object. Each object has a unique signature. In a later section, the quantities used to characterize IR signatures are discussed. Several factors determine the magnitude of the thermal radiance emitted by an object. Some of these factors are the object’s temperature relative to its environment, its surface reflectivity and its geometrical properties. In nature infrared energy is exchanged be-
tween objects when they absorb and radiate solar and heat energy from the atmosphere. Thermal energy from vehicles is released from the combustion of fuel inside engines and friction generated by moving parts. Heat from combustion and frictional heat cause ground vehicles to emit IR energy and hence have a signature in the IR spectrum unique to the vehicle’s geometry and material characteristics. From a military point of view, a large IR signature aids in detecting a vehicle and is therefore usually to be avoided. Commercially, large IR signatures indicate energy loss, such as heat escaping through a region of poorly insulated ceiling or, in the case of electrical transmission wire, a region of wire nonuniformity that leads to increased ohmic resistance. There are many applications of infrared imaging, such as medical research, astronomy, and remote sensing. Details of these areas can be found in books on these particular topics. The general principles involved in forming an IR image are discussed below. THEORY OF INFRARED IMAGING The EM spectrum encompasses wavelengths that span the range from very high-frequency gamma rays to lowfrequency radio waves. Infrared radiation has wavelengths just beyond the red part of the spectrum. Most of the observed phenomena in the visible region are related to the reflected sunlight or artificial illumination. Most infrared phenomena are related to radiation emitted from objects. The radiation emitted is related to temperature. The wellknown relationship between temperature and radiation, known as the Stefan–Boltzmann law of radiation, is obtained by integrating Planck’s formula from λ = 0 to λ = ∞. The total radiant exitance by a blackbody is given by
where σ is the Stefan–Boltzmann constant, 5.67 × 10−8 W/(m2 K4 ). Every object in the universe is constantly emitting and receiving IR radiation as thermal radiation from every other object. The amount of radiated energy varies depending on the temperature and emittance of its surfaces. The laws governing the radiation, transmission, and absorption of infrared radiation are part of thermodynamics and optics. This infrared radiation can be used for imaging devices that convert an invisible infrared image into a visible image. Human vision can be extended beyond the visible red part of the EM spectrum with a thermal imaging system. Historically, after the emergence of television after World War II, camera tube principles were applied to nightvision systems. The source radiation for IR imaging may be self-emitted from objects in the scene of interest or reflected from other sources. Sunlight is the primary source of reflected radiation. Another may be from controlled sources, such as lasers, used specifically as illuminators for the imaging device. Those systems with sources that illuminate part of the scene are called active, whereas those relying largely on radiation emitted from the scene are called passive. A passive IR imager is an electrooptical system that accomplishes the function of remote imaging of the scene without active scene illumination. Figure 1 is a block
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright © 2007 John Wiley & Sons, Inc.
2
Infrared Imaging
2.
3.
4. Figure 1. Block diagram of a passive IR sensor.
5. diagram of a passive infrared imaging system. To describe the electronic operation of IR sensors, it is often easier to use the particle nature of light, that is, photons. A photon’s path is through the sensor from the scene element source, then on through the optics and detector which ends with its conversion to an electronic signal at the focal-plane array (FPA). The photon-generated signals are processed to obtain useful information which is displayed as an image by infrared image processing. It will be helpful to the reader to begin a detailed discussion of a passive imaging system by clarifying signal processing versus image processing. The division between signal processing and image processing is blurred by the applied architecture and end application. Signal processing usually refers to the time stream of data coming from a single detector, much like the line trace of a heartbeat on an oscilloscope. Signal processing works on and looks at the rise and fall of the trace to detect targets against the background noise (5). Diagrams of IR signal processors are given in Ref. 3. Infrared imaging has found application in such diverse fields as medicine, nondestructive material evaluation, astronomy, and robotics. As such, a large degree of image processing is needed. The following describes some of these steps for infrared image processing. INFRARED IMAGE PROCESSING Now we consider what happens after the sensor has converted photons to representative signals. The output of the focal plane’s multiplexer is analog electrical signals. The signals may be amplified and converted into digital signals and conditioned and processed to form images or targeting information. The following are the special considerations needed to create a clear image as given in (5): 1. Correction To smooth out the response of the individual pixels that make up the FPA and make the response more uniform is one of the most important signal processing functions. Most FPA’s require a twopoint correction that adjusts each pixel for both gain and offset by recording the response of the FPA as it
stares at two known radiance sources of blackbodies at two different temperatures. Calibration When some radiometric applications use three- and four-point correction, calibration is needed. Calibration adjusts system parameters to known standards which are the standard blackbody sources. Ac Coupling Ac coupling removes a pedestal of noise. This allows identifying a smaller temporal spike more easily on top of a large, constant signal platform. This removes the constant amplitude component of the scene and, in the temporal domain, passes only changes. Dc Coupling Dc coupling works similarly for constant radiation and staring sensors. A dc-coupled signal with a low-pass filter reduces large fluctuations and passes the constant background. This approach is often used with staring sensors. Thresholding This technique compares a pixel value with some threshold value. If the value is higher than the threshold value, it will be identified as a possible target pixel. Thresholding for focal-plane frames falls into the category of image processing.
Image processing deals with the differences between adjacent scene pixels in a two-dimensional arrangement of detectors, and it generally includes the time-changing components of signals. Usually, a level of signal processing (such as nonuniformity correction) occurs before other higher order image processing functions, such as edge detection. A primary factor to remember is that signal processing is single-detector element or pixel-related, whereas image processing relates to multiple pixel configurations. Image processing also refers to operations that transform images into other images. Image recognition is the mapping of images into descriptions. Image transformations convert matrix A into another matrix B called the transform image. The image B is computed to extract features, which are characteristic elements (geometrical, IR light intensities) in the image, to classify objects or targets in the scene. The following are some basic image processing techniques and functions and their corresponding techniques (5): 1. Spatial Filtering Spatial filtering removes false targets (reduction of clutter) using a weighted average of the signal from neighboring pixels to determine a localized mean. This localized mean sets an adaptive threshold to determine whether the pixel belongs to a targetlike object. Temporal filtering performs the same in the time domain for each pixel. This process is equivalent to a type of spatial filter for a scanning system and a frame-to-frame subtraction for a staring sensor. Morphological filtering is a spatial filtering technique which uses a specific one-dimensional or two-dimensional operator to detect objects of a given spatial frequency. 2. Frame-to-Frame Subtraction In frame-to-frame subtraction, one frame is simply subtracted from another taken at a different time.
Infrared Imaging
3. Streak Detection Streak detection is a higher order form of frame subtraction indicating moving targets. In streak detection, a pixel and its neighbors are compared from frame-to-frame in the time domain. 4. Image Formation Image formation consists of taking the FPA output and forms an image. IR target features used for classification are characterized by the fact that some are geometric and others statistical in nature. The recognition process involves a treetype decision process adapted to each set of target classes. Using those features by which each class is best discriminated from others, such a procedure first starts with welldiscriminated classes and ends with overlapping classes. Such a hierarchical classification procedure, represented by a tree structure, consists of specified feature extraction and decision rules. The individual decisions in a classification tree may rely on one or more of the following approaches (4): 1. Template matching of features derived, respectively, from feature extraction and learning 2. Statistical pattern recognition 3. Syntactic pattern recognition 4. Hybrid pattern recognition and artificial intelligence for scene understanding. INFRARED SYSTEMS AND ARRAY GEOMETRIES In order to form IR images, several array geometries are possible. There are several established methods for designing practical IR images. Table 1(10) summarizes the various architectures used in infrared systems and the distinguishing features of each generation of system. Focal-Plane Arrays The focal plane is perpendicular to the optic axis where the IR radiation is focused in an imaging system. An array of detectors located there is a focal-plane array (FPA). The focal-plane array consists of densely packed detector elements numbering up to 106 or more. A FPA may contain several FPA sections connected together. Occasionally, a focal plane is curved about the Petzval surface (57) using more than one focal-plane array. A group of focal-plane arrays is called a mosaic. The evolution of infrared detectors suitable for forward-looking infrared (FLIR) systems is shown diagrammatically in (11). FLIRs are IR images that form an image of the scene directly in front of the camera. Large arrays are usually made from photovoltaic detectors. The photovoltaic detector lends itself more to large arrays because of its lower thermal power dissipation. Photovoltaic operation also provides a square root of two improvement in noise over a photoconductive detector, simpler biasing, and more accurately predictable responsivity. Most focal-plane assemblies are constructed by using detector chips with, for example, 640 × 480 detector elements. These can be monolithic (detector and signal processor in a single semiconductor crystal) or hybrid (detector and signal processor in separate materials interconnected
3
by solder bump bonds and/or evaporated fan-out patterns) (5). Uncooled Systems Uncooled infrared focal planes are fundamentally different from cryogenically cooled systems (55). Uncooled focal planes are two-dimensional arrays of infrared detectors thermally isolated from their surroundings. The detectors respond to incoming infrared radiation by changing their temperature. The materials used for these detectors are chosen for certain unique properties, such as resistance, pyroelectric polarization, or dielectric constant that varies sharply with temperature. A pyroelectric material has an inherent electrical polarization. The magnitude of the polarization is a function of temperature (56). In a pyroelectric material polarization vanishes at the Curie temperature. The pyroelectric coefficient p is defined as the gradient of electrical polarization as a function of temperature. A simple pyroelectric detector consists of a wafer of this material with metal electrodes bonded on each face. The material is oriented so that the polar axis is perpendicular to the electrodes. As the temperature of the pyroelectric material is changed by incident radiation, its polarization changes in direct proportion to the magnitude of the temperature change. Charges accumulate on the electrodes, and a voltage develops across the faces of the pyroelectric material, just as in a capacitor. The charge or voltage residing on the material is a function of the incident irradiation by IR energy. An imaging sensor is accomplished using this detector array by adding a “chopper” to introduce a known change in the incident radiation and hence voltage across the material. Currently, two different types of detector, one ferroelectric and the other bolometric, are used for uncooled, focalplane arrays. The approach followed by Raytheon is to use a ferroelectric material. An array of detectors is made with a ferroelectric material having polarization and dielectric constants that change with temperature, resulting in a change in charge on a capacitor as the target scene varies. The detectors are fabricated from the ceramic material, barium strontium titanate (BST), and then bump bonded to a silicon readout circuit. The approach uses a two-dimensional array of bolometers, resistors whose resistance changes approximately 2% per ◦ C change. The temperature-sensitive resistor material is vanadium oxide, suspended on a bridge of silicon nitride that is isolated thermally from the substrate containing the readout electronics. These uncooled, focal-plane systems perform robustly enough to make them suitable for many military and commercial applications. An example of a road scene taken with a pyroelectric device is shown in Fig. 19. One development option is to reduce the detector size even further, thus producing better resolution with the same optics. Sensitivity improvements are also desirable and achievable. For applications where the sensitivity is already satisfactory, the increased sensitivity could be traded for slower (i.e., smaller, lighter, and less expensive) optics. Additional development is expected to result in decreased electrical noise and smarter focal planes (3, 5).
4
Infrared Imaging
METRICS IN CLUTTERED IMAGE ENVIRONMENTS Many perceptual measures used by the US Army (12, 13) and other armed forces exist for assessing the visibility of vehicles as seen by a human through an infrared sensor and electronic display. These measures have a limited range of applicability because the measures usually work only for a restricted class of images. The focus of this article is algorithms for optimal measures of thermal contrast, target temperature minus background temperature (T), and clutter. The detection of vehicles by an IR sensor is essentially the problem of detecting such vehicles in a cluttered background. The purpose of defining and quantifying clutter is to aid the development of more realistic human detection models. In one way or another, most of the work on clutter measures or metrics has depended on or grown from a metric defined by Schmeider–Weathersby (SW) (14). The SW clutter metric is written as follows:
where σ i is the variance of the ith cell and N is the number of cells. The SW clutter metric divides the image into a number of cells, each of which are scaled to represent twice the longest dimension of the target, and then the variance of each cell is divided by the total number of cells. Clutter The clutter and temperature difference between the target and background (T) metrics are discussed first because together, clutter and T metrics form the signal-to-noise ratio (SNR) used in evaluating the probability of detection (Pd). A large part of the framework for the present methodology for calculating Pd is based on signal detection theory (SDT). The use of SDT in an area of modeling requires a clear distinction between what is to be the signal and what is to be the noise. The delineation of the target in an image constitutes deciding what the signal is to be. Everything else is taken as the noise. As described further in this article, clutter is considered a noise term. Objects in the image which distract the observer from the target or get attention
Infrared Imaging
because they are similar to the target are called clutter. Clutter Metrics Objects and background regions in an image that look similar to or that distract from the actual target(s) are known as clutter. To date there has been a lot of work to develop a quantitative measure of background clutter (15– 30). Reynolds et al. (17) analyzed many scenes of northern rural Michigan and calculated the aforementioned SW clutter in those scenes as a function of field measured environmental variables, such as wind speed, humidity, insolation, and temperature. For groups like the US Army, who are concerned with how vehicles appear in open terrain, the difficulty of controlling the parameters measured in the field motivates the use of computer modeling. Hence a large part of clutter research is geared toward finding insights to understand the interaction of environmental variables and clutter with the aid of computer modeling so as to reduce the need for taking field data. The goal is to link clutter with human perception and the probability of detection. There are many definitions of clutter currently used in the literature of image processing and target acquisition modeling (18, 19). Some of these metrics follow: (mean target) − (mean background) radiance or T metric 1. Statistical variance metric (SW clutter) 2. Probability of edge (POE) metric 3. Complexity metrics There is presently no definition available which is clearly the best in all cases or images. One primary object of this section is a unified definition of clutter which takes into account the different definitions available in the literature. Many of the images used for study in this article were obtained from the US Army’s Night Vision Lab (NVL) terrain board simulator. The terrain board simulator is a large room painted entirely black which has a scaled-down version of the terrain of a certain part of the world at the center of the room. Genuine infrared and visual sensors are then placed at certain positions. On the terrain board are placed scaled-down vehicles painted with an emissive coating to emulate the actual appearance of the vehicles as seen through night vision cameras. Pictures are then taken with the mounted night-vision cameras and displayed on monitors in target perception experiments. As part of the review of the phenomenology associated with clutter metrics, a few of these metrics are described here. Currently used clutter and image metrics are the following: Der metric, POE metric, the Schmieder–Weathersby metric, and texture-based clutter metrics. Der Clutter Metric. Originally the Der metric was devised as a method to predict the false alarm rate of a given algorithm. The approach was the following: a double win-
5
dow was convolved one pixel at a time over the image. The size of the inner window was the maximum size of the largest target used at the time. These two features, minimum and maximum, were chosen arbitrarily. At each pixel location, the algorithm decides whether the new pixel is in the same intensity space as the one previously examined and then also whether it fits into the inner window. When an intense region of the image of approximate target size is found, that region is catalogued. The principle behind the Der method then is to multiply the distribution of the target like areas by the probability-of-detection distribution. Then the result should give the predicted false alarm rate for an algorithm with a given probability-of-detection distribution. Now if one simply counts the number of Der objects in the image, that number should indicate the number of targetlike objects in the scene, and hence, a measure of clutter (18).
POE Metric. The probability of edge (POE) metric is meant to determine the relationship between the human visual detection system and the statistics of the color or black and white images. First, the image under consideration is processed with a difference-of-offset-Gaussian (DOOG) filter and is thresholded. This procedure emulates the early vision part of a human observer (20), basically just the retinal part of human color vision. Then the number of edge points are counted and are used as the raw metric. The calculation proceeds as follows. First, the image is divided into blocks twice the apparent size of the target in each dimension. Then, a DOOG filter, as described in (21), is applied to each block to emulate one of the channels in preattentive vision. The net effect is to enahance the edges. As discussed by Rotman et al. (22), the histogram of the processed image is normalized and then a threshold T is chosen on the basis of the histogram. The number of points exceeding the threshold in the ith block are computed as POEi,T . Then the POE metric is computed similarly to the statistical variance technique:
Marr (20) and other vision researchers have recognized that preattentive vision is highly sensitive to edges.
Schmieder and Weathersby (SW) Metric. Schmieder and Weathersby (14) have proposed the concept of a root-meansquare (rms) clutter metric of the spatial-intensity properties of the background. It is one of the most commonly used clutter measures. The Schmieder and Weathersby clutter metric, shown in Eq. (2), is computed by averaging the variance of contiguous square cells over the whole scene. Typically N is defined as twice the length of the largest target dimension. The signal-to-clutter ratio (SCR) of the image is then given by the average contrast of the target divided by the clutter computed in Eq. (2).
6
Infrared Imaging
Reynolds (15) showed that the variance in Eq. (2) is equivalent to the following:
where N is the number of cells, k is the number of pixels per cell, xij is the radiance of the jth pixel in the ith cell, and ui is the ith cell mean radiance. Doll (23) compared Eq. (2) with experimental detection times for observers looking at computer-generated images of rural scenes with embedded targets. Good correlation between the average detection time and SCR value was found. One of the fundamental problems of computer-based vision is that the contrast metrics are valid only for a limited group of images. Texture-Based Clutter. As mentioned previously, the purpose of defining and quantifying clutter is to aid the development of more realistic human-detection models. Clutter is sometimes defined in the sense that areas of similar texture contribute to the distractive capability or clutter of a scene. Textural measures of a scene are potentially very powerful metrics for extracting fine-level, contrast differences in an image and play a crucial role in modeling human visual detection. The scene used for this study of texture and clutter was from the standard scene collections of the Keewenaw Research Center of Michigan Technological University, KRC83-1-1.dat. The thermal image contains an M60 tank in a field with a tree background. The basic scene was altered with the Geographics Resources Analysis Support System (GRASS) software package so that there were three scenes with different amounts of clutter, low, medium, and high, respectively. Low, medium, and high clutter were defined as equal to one, three, or nine false targets in the field of view (FOV) containing the vehicle (23). In Fig. 2, pictures A through C are the original input cluttered IR scenes, and pictures D through F are the computer-modeled IR sensor field of views of the low, medium, and high cluttered scenes, respectively. The three images mentioned were read into the TARDEC Thermal Image Model (TTIM) (24). TTIM is a graphics workstation model that uses measured or simulated target-in-scene IR imagery as input. TTIM has been used to simulate the image degradation introduced by atmospheric and sensor effects (25, 26) at a specified range. TTIM transforms the degraded radiance values on a pixel-by-pixel basis to grayscale values for display on a workstation for observer interpretation and/or further processing. Equation (2) was used to calculate the variancebased clutter in the images, also known as the Schmieder–Weathersby (SW) clutter. The cell size chosen was twice the maximum target dimension which was always the length in pixels of the vehicle. The clutter and textural data were then input into statistical and graphing tools for further analysis. The standard clutter SW definition of clutter was modified to include image texture by calculating the gray-scale texture for each cell and then using that texture in place of the variance. The method of calculating texture was based on the equation given in (27).
To compute the SW variance-based clutter, the image is divided into cells twice the maximum target dimension. Then the variance is computed for each cell and aggregated, as described by (30). At a given resolution, a block of pixel values is recorded and the variance of the gray-scale values is formed. Then the rms is taken over the entire image. The equation modified for the digital computation of textural clutter is shown in Eq. (5):
where a and b are any two gray-scale values at any pixel (j, k) in the image matrix, r is the step size, and θ is the direction of the steps (θ equals 0 means stepping horizontally). P is the fraction of the total number of steps whose end points are the gray-scale values a and b, respectively. The mean texture is defined as follows:
Defined in the above manner, texture is considered the spread about the diagonal of the cooccurrence matrix (28). As the clutter in the image increases, the spread of the wings of the central region in the textural plot increases, indicating an increase in texture which would be expected if the clutter in the image increases. The modified clutter equation that includes mean texture follows:
The performance of the textural metric as a function of several variables has been described by (28). The first data set consisted of four variables: (1) wind speed, (2) turbulence, (3) humidity, and (4) temperature difference between the target and background. The second data set included spectral bandpass, turbulence, rain rate, and temperature difference between the target and the background. The experimental design was chosen to include all of the variables so that the least amount of simulations would be required. A fifth variable, texture, was computed for each point in the experimental design. Clutter was modeled as a secondorder response surface in the original five variables. Stepwise regression was used to build the model and residual analysis was done to check the assumptions of the model. Some general remarks concerning the nature of the images generated by the sensor and atmospheric models used are appropriate. For the second parameter set, gray-scale levels of the output images were manually scaled. In other words, for each clutter class, the minimum and maximum temperature values for the extreme parametric cases were recorded in advance to provide a temperature/gray-scale range for the entire image set. Otherwise, the autoscale function would be used and the image contrast would always be maximized for each image rather than having a gray scale appropriate to the entire range of pixel temperatures for the parameter variations. The clutter and textural calculations all used the gray-scale values written from the screen to the hard disk. For all clutter classes, low,
Infrared Imaging
7
Figure 2. A, B, C: original images with low, medium, and high clutter and TTIM. D, E, and F: simulations.
medium, and high, the calculated clutter values decreased as the parameters increased. The modeled sensor scene became more homogenous or lower in contrast, and the target was more difficult to distinguish. Thus the probability of detection can decrease even while the clutter, defined as a variance of gray-scale values in a cell, is decreasing. The results of the simulations showed that for the SW clutter metric, temperature difference, rain rate, and texture were the most important parameters. For the texturebased clutter metric, temperature difference, rain rate, and spectral bandpass were important. When the SW clutter was plotted versus the texture-based clutter metric, the result was a straight line. This result is unexpected because the two methods of calculating clutter are very different. The SW clutter metric is a variance of the pixel gray-scale values in the image and the texture-based clutter metric involves the cooccurrence matrices of the gray-scale values in the image. These two measures may give nearly similar results when the amount of texture in the scene is small. Perhaps by increasing a feature in the scene, such as the periodicity or scene complexity, the two texture measures would generate more different results. A representative graph of the two measures for the highly cluttered scene is shown in Fig. 3.
Figure 3. Plot of textured clutter versus SW clutter.
be used in this equation is described next. As described by Rotman et al. (29) and Gerhart et al. (34), the probability of acquisition of a target as a function of time is given by the following:
or
Relative Metrics in the Probability of Detection Rotman et al. (29) reviews the NVL model for computing the probability of detection Pd and suggests a way to include clutter in the algorithm for Pd . However, no mention is made of how to compute the properly scaled clutter factors alluded to in the paper. A method for obtaining clutter factors based on other validated clutter measures that can
In Eq. (9), ρ is an estimate of the target acquisition probability over an infinite amount of time when the target is in the field of view, and
8
Infrared Imaging
where n is the number of resolvable cycles across the target, n50 is the number of cycles required for P∞ to equal 0.5, E is equal to 2.7 + 0.7 (n/n50 ), and CF is a clutter factor. Clutter is a term that refers to the psychophysical task of perceiving objects in a scene in the presence of similar objects (15). Clutter factor refers to the number used to represent how many target-like objects there are in the image which confuse the viewer as to the location of the target. Rotman (29) has shown that
bicle was used as the experimental area. A schematic of the test setup is shown in Fig. 5. Recently at TARDEC, a visual perception laboratory (VPL) large enough to enclose vehicles and three large rear projection screens has been built to perform visual and infrared detection tests. A later section describes and shows the TARDEC VPL. WAVELET TRANSFORMS OF IR IMAGES Introduction
Hence, Eq. (9) can be written as a function of T and the clutter factor CF. Or the probability of detection P(t) can now be computed as a function of T, the range from the target to the sensor, the atmospheric condition, and the sensor system parameters. Figure 4, courtesy of Dr. Barbara O’Kane of Night Vision Electrooptic Sensor Directorate (NVESD), is one of the images used in this study. The target is in the foreground. EQUIPMENT AND EXPERIMENTAL METHODS Throughout this section, reference has been made to experimental values of Pd . A discussion is now given of what this entails. The experimental determination of the Pd for a target in a cluttered image is determined by using displayed images on a rear-projection screen or computer monitor. Basically, the observer sits at a distance of 1 meter away from a large, high-resolution display monitor in a darkened room or area and glimpses the images for about 1 s. The subject is given training before the tests and is told what to expect in the way of imagery and targets. All of the visual detection tests test foveal detection, that is, detection of an object in a scene directly in the line of sight of the subject. Foveal detection is important for static scene detection, whereas peripheral viewing is often important for detecting motion. The observers are then asked whether they did or did not see a target of interest (military or otherwise), and/or if the observers are not sure and would like to see the image(s) again. This is called a forced-choice experimental design. The images in a particular data set are displayed at random, and the responses of the individual subjects are tallied electronically and used to calculate a probability of detection and of false alarm of the target of interest for the given experiment and for that particular subject. A typical collection of subjects is 10 to 15 and a typical number of images is 15 to 100. The separate subjects’ responses, or Pd ’s, for each target are also used to calculate a population mean probability of detection and probability of false alarm for each target. Psychophysical detection experiments have been performed, using this pilot test lab, for part of the US Army’s visual acquisition model validation and verification programs and for a Cooperative Research And Development Agreement (CRADA) between TARDEC and an automobile company on vehicle conspicuity. Throughout the tests mentioned, a small cu-
There are many important characteristics of wavelets that make them more flexible than Fourier analysis. Fourier basis functions are localized in frequency but not in time. Small frequency changes in a FT cause changes everywhere in the time domain. Wavelets can be localized in both frequency position and in scale by dilation and in time by translations. This ability to perform localization is useful in many applications (31). Another advantage of wavelets is the high degree of coding efficiency or, in other words, data compression available. Many classes of functions and data can be represented very compactly by wavelets. Typically the wavelet transforms are computed at a faster rate than the fast Fourier transforms (FFTs). The data is basically encoded into the coefficients of the wavelets. The computational complexity of the FFT is of the order of n log n, where n is the number of coefficients, whereas for most wavelets, the order of complexity is of the order n. Many data operations, such as multiresolution signal processing, can be done by processing the corresponding wavelet coefficients (31). The basic flow of processing in wavelet analysis is shown in Fig. 6. Wavelet Transforms and Their Use with Clutter Metrics Wavelets and wavelet transforms are essentially an elegant tool that can be applied to image processing. Wavelets are used for removing noise or unwanted artifacts from IR images as well as acoustic data (31, 32) or for localizing certain target cue features. There are many definitions of clutter currently used at the moment (30). However, none take into account the multiresolution capability of wavelets. A way to do this is by using edge points determined by wavelets (30, 33). The Pd can be computed using the signal-to-clutter ratio (SCR) as in the classical theory. Then the computed Pd values can be correlated to the experimentally determined Pd for the target. As discussed in (22, 34), the classical algorithm for the determination of the Pd of a target in an image is shown in Eqs. (8), (9), (10), and (11). The idea proposed by (30) is to use wavelets in the clutter factor so that the noise term is replaced by an interference term that includes noise plus clutter. The wavelet probability of the edge (WPOE) algorithm was applied to many IR images with a combination of a personally developed code and the software package XWAVE2 (35) installed on a Silicon Graphics Indigo2 workstation running IRIX 5.03. For input images, Night Vision Lab’s (NVL) terrain board images, developed and shared by Dr. B. O’Kane, C. Walters, and B. Nystrom (12) were used. A sample of this data from the NVL set is shown in Fig. 4. The image was segmented by cell, and a wavelet transform was applied to
Infrared Imaging
9
Figure 4. NVESD terrain board image (from Dr. O’Kane).
Figure 5. Pilot perception test setup.
Figure 6. Flow of processing in wavelet analysis of IR images.
Figure 7. NVESD image level 1.
find the number of edge points at a particular scale. Figures 7,8, and 9 are the wavelet transforms of the entire IR image. After processing the image with the wavelet filters, what is required is the number of edge points above a certain threshold in each of the cells of the wavelet image and at a particular resolution level. A cell is composed of an array of pixels taken together generally forming a square or rectangular figure. With the number of edge points per cell and the total number of points or pixels in the cell, the
WPOE can be computed. The WPOE clutter metric is used in the denominator of the SNR to compute the probability of detection of the target. For the IR case, the signal is the difference of the mean temperature of the target and background, and the noise term is a clutter metric, such as the POE, WPOE, or RMS clutter. A new metric, the WPOE metric, and an algorithm for computing clutter in infrared and visual images was used in (30). There are some problems that must be resolved, such as thresholding and further re-
10
Infrared Imaging
Figure 8. NVESD image level 2.
Figure 9. NVESD image level 3.
duction of the cell size, but the method of the WPOE metric is of potential use in image analysis.
FUZZY LOGIC AND DETECTION METRICS FOR IR IMAGES It has been three decades since L. A. Zadeh first proposed fuzzy set theory (logic) (36). Following Mamdani and Assilian’s pioneering work in applying the fuzzy logic approach (FLA) to a steam engine in 1974 (37), the FLA has been finding a rapidly growing number of applications. These applications include transportation (subways, helicopters, traffic control, and air control for highway tunnels), automobiles (engines, brakes, transmission, and cruise control systems), washing machines, dryers, TVs, VCRs, video cameras, and other industries including steel, chemical, power generation, aerospace, medical diagnosis, and data analysis (38–42). Although fuzzy logic encodes expert knowledge directly and easily using rules with linguistic labels, it usually takes some time to design and tune the membership functions which quantitatively define these linguistic labels. Neural network learning techniques can automate this process and substantially reduce development time and cost while improving performance. To enable a system to deal with cognitive uncertainties in a manner more
like humans, researchers have incorporated the concept of fuzzy logic into the neutral network modeling approach. A new approach to computing the probability of target detection in infrared and visual scenes containing clutter by fuzzy logic is described in Ref. 30. At present, target acquisition models based on the theory of signal detection are not mature enough to robustly model the human detection of targets in cluttered scenes because our awareness of the visual world is a result of perceiving, not merely detecting, the spatiotemporal, spectrophotometric stimuli transmitted onto the photoreceptors of the retina (43). The computational processes involved with perceptual vision can be considered the process of linking generalized ideas or concepts to retinal, early visual data (43). These ideas or concepts may be various clutter or edge metrics and luminance attributes of a military vehicle or automobile. From a systemic theoretical viewpoint then, perceptual vision involves mapping early visual data into one or more concepts and then inferring meaning of the data based on prior experience and knowledge. Even in the case of IR images, the observer is looking at a picture of the scene displayed on some form of a monitor or display. The approaches of fuzzy and neuro-fuzzy systems will provide a robust alternative to complex semiempirical models for predicting observed responses to cluttered scenes. The
Infrared Imaging
fuzzy-based approaches have been used to calculate the Pd of vehicles in different infrared scenes. Probability of Detection (Pd ) for Ground Targets The development of the software that models the relationships among the various factors that affect the determination of the probability of detection is currently undergoing development. Some of these modeling factors are introduced in (57) to demonstrate the capability of the fuzzy and neuro-fuzzy approaches in predicting Pd with a 0.9 correlation to experimental values. The Pd value can be determined with the FLA using input parameters for the images shown in Fig. 4. The one output parameter is Pd . Figure 10 shows the FLA computed Pd ’s and the experimental Pd ’s for comparison. The correlation of the experimental Pd to the FLA-predicted Pd is 0.99. This result indicates the power of using the FLA to model highly complex data for which there would be many interrelated equations if one tried to model the detection problem as shown in Eqs. <xref target="W4110-mdis-0008 W4110-mdis-0009 W4110-mdis-0010 W4110-mdis-0011"/>. APPLICATIONS OF INFRARED IMAGING SENSORS Uncooled Pyroelectric IR Sensors for Collision Avoidance Collision avoidance and vision enhancement systems are seen as an integral part of the next generation of active automotive safety devices (44, 45). Automotive manufacturers are evaluating the usefulness of a variety of imaging sensors in such systems (44). One potential application for automobiles is a driver’s vision-enhancement system (53). This use of night-vision sensors as a safety feature would allow drivers to see objects at a distance of about 450 m, far beyond the range of headlights. Obstacles in the driver’s peripheral visual field of view could be seen and recognized much sooner. Sensors that operate at wavelengths close to the electromagnetic frequency band of human vision (such as video cameras) provide images with varying degrees of resolution. However, the quality of the images (in terms of relative contrast and spatial resolution) acquired by such a camera degrades drastically under conditions of poor light, rain, fog, smoke, etc. One way to overcome such poor conditions is to choose an imaging sensor that operates at longer (than visual) wavelengths. The relative contrast in images acquired from such sensors does not degrade as drastically with poor visibility. However, this characteristic comes at a cost. The spatial resolution of the image provided by such sensors is less than that provided by an inexpensive video camera. Passive infrared sensors operate at a wavelength slightly longer than the visual spectrum. (The visual spectrum is between 0.4 and 0.7 µm, and the commonly used portions of the infrared spectrum are in the atmospheric “windows” that reside between 0.7 and 14 µm.) Hence the IR sensors perform better than video cameras (in terms of relative contrast) when visibility is poor. Also, because their wavelength of operation is only slightly longer, the quality of the image provided by an infrared sensor is comparable to that of a video camera (in terms of spatial resolution). As a result, infrared sensors have much potential
11
for use in automotive collision avoidance systems (44, 46). Of all the different types of infrared detector technologies, this article considers two state-of-the-art infrared detectors that offer beneficial alternatives to an infrared sensor system for automotive and surveillance applications. The first alternative is based on a cooled FPA of CMOS PtSi infrared detectors that operate in the 3.4 to 5.5 µm wavelength band. The second alternative is based on a staring, uncooled, barium strontium titanate (BST) FPA of ceramic sensors that operate in the 7.5 to 13.5 µm wavelength band. Under clear atmospheric conditions and at ranges less than 500 m, the 3.4 to 5.5 µm systems generate images with less contrast than the 7.5 to 13.5 µm systems. Dual-band field data show that the 3.4 to 5.5 band systems present more contrast between temperature extremes, whereas the 7.5 to 13.5 band systems show more detail in the overall picture. The TACOM Thermal Image Model (TTIM) is a computer model that simulates the appearance of a thermal scene seen through an IR imaging system (24). The TTIM simulates the sampling effects of the older single-detector scanning systems and more modern systems that use focalplane, staring arrays. The TTIM also models image intensifiers. A typical TTIM simulation incorporates the image degrading effects of several possible atmospheric conditions by using low-resolution transmission (LOWTRAN), a computer model of the effects of atmospheric conditions on thermal radiation that was developed at the United States Air Force’s Geophysics Laboratory. A particularly attractive feature of the TTIM is that it produces a simulated image for the viewer, not a set of numbers as some of the other simulations do. We refer the reader to Fig. 11 for a schematic representation of the TTIM. Examples of the usage of the TTIM are shown in Figs. 12,13, and 14. In Fig. 12, a CAD file of a tank is shown at the top, and the two lower images show the scene as seen through a TOW IR Imager at a certain range and then with rain added. Figure 13 shows how the TTIM compares different metrics used to quantify the visibility of a target. Figure 13 shows that simple metrics that assume a uniform target and background do not work. Figure 14 shows how the TTIM compares the performance of short and longwave IR imagers in the presence of rain. The TTIM has been used to simulate cooled and uncooled IR imaging systems and to compare their performance from the standpoint of automotive applications. Analogous comparisons exist in the current literature (47, 48). The TTIM and NAC-VPM together allow comparing the performance of the two IR systems in terms of how good the quality of their images is for subsequent human perception/interpretation. Given that we have two images of the same scene, captured by using the two different infrared systems, we use the NAC-VPM to assess which of the two is better. The NAC-VPM is a computational model of the human visual system (49). Within the functional area of signature analysis, the unclassified model consists of two parts: early human vision modeling and signal detection. The early visual part of the model itself is made up of two basic parts. The first part is a color separation module, and the second part is a spatial frequency decomposition module. The color separation module is akin to
12
Infrared Imaging
Figure 10. Graph of expected versus FLA predicted detection probabilities.
Figure 11. Schematic of the TTIM computer model.
the human visual system. The spatial frequency decomposition system is based on a Gaussian–Laplacian pyramid framework. Such pyramids are special cases of wavelet pyramids, and they represent a reasonable model of spatiofrequency channels in early human vision (50). See Fig. 15 for a schematic representation of NAC-VPM. Figure 16 shows how the TTIM and NAC-VPM together are used to determine which spatial frequencies and features of a vehicle are the most visible. Simulation of Infrared Sensors This section presents the simulation of cooled and uncooled infrared imaging systems using the TTIM. Specifically, the input to the TTIM was the actual thermal images of commercial vehicles in a typical road scene, which were resampled using the TTIM. The initial infrared images were taken at TARDEC with the pyroelectric sensor from Raytheon. Examples of the way rain affects the qual-
ity of the sensor displayed image are presented. Target is synonymous with the object of interest in the scene and no target means the image with the object of interest removed. This type of simulation is a substantial first step to providing a means for comprehensively evaluating and comparing sensor systems. The ability to simulate the sensors provides a means for exactly repeating imaging experiments and measurements, difficult to achieve in field trials. Also, the ability to simulate the sensors provides the ability to exercise control over the imaging conditions. In the cooled infrared systems, for example, it is important to provide proper temperature shielding during field trials. Otherwise, the quality of the images acquired from the infrared system is badly affected, and it negatively affects the validity of subsequent comparisons between sensor systems. By simulating cooled infrared systems, such difficulties can be avoided. Figure 17 shows simulated infrared images of typical commercial vehicles when the viewing distance (the
Infrared Imaging
13
Figure 12. CAD image of tank and simulated range and rain degradation using TTIM.
Figure 13. Comparison of thermal target metrics using TTIM.
distance between the vehicle and the sensor) is fixed and the amount of rainfall under which the image is acquired increases. This is done for both the cooled and uncooled cases by inputting into the TTIM the thermal image containing the target and no-target image. The images have been resampled according to the specific sensor and then degraded by rain and fog. The uncooled images are in the
left column and the cooled images are in the right column. The top row is the clear case with and without the object of interest, which is the car at the center of the picture. The range for all the pictures is 70 m. The second row is for the case of fog. As one goes down the columns of images, the rain rate is 1, 12.5, 25, 37.5, and 50 mm/h, respectively. The images show that the longwave, uncooled camera provides
14
Infrared Imaging
Figure 14. Dual-band study using TTIM.
a higher contrast picture under all conditions. The system (courtesy of Mr. Sam McKenney of Raytheon) used by the authors to take the original data is shown in Fig. 18. Figure 19 shows simulated images from the uncooled sensor that were resampled to represent ranges at 30 to 90 m. Sensor Image Comparison The NAC-VPM was used to compare the quality of images acquired from the cooled and the uncooled, infrared imaging systems through rain and fog. With the NAC-VPM one may obtain the SNR and a psychophysical measure of detectability d (58) in each of the images for a vehicle of interest. The input to the TVM was the target and no-target images, corresponding to the infrared systems (54). In Fig. 20 the highest SNR of all frequency channels is plotted as a function of the rain rate. The curve in Fig. 20 with the higher whole image SNR is that of the uncooled pyroelectric FPA. The image with the higher SNR has a greater contrast and is easier to interpret. The object of interest in a scene with the higher d has a higher conspicuity and is therefore easier to see. Figure 21 shows the predicted visibility of the vehicles when viewed through the sensors and atmosphere. The two curves in Fig. 21 are the detectabilities of the target vehicle predicted by the visual model. For this particular case, the conspicuity of the target seen through the uncooled 7.5 to 13.5 band has the higher predicted conspicuity. Using the TTIM, one may successfully simulate both infrared imaging systems. The 7.5 to 13.5 band has more background radiance in the scenes which
tends to add more gray to the image as the rain-rate increases, whereas, the 3.4 to 5.5 band gets grayer with increasing rain rate primarily because of radiance loss due to scattering. These model predictions are consistent with infrared field images of test patterns through both bands in the rain. Scattering losses are compounded by the shape of the Planck blackbody distribution. The shape of the blackbody curves at a temperature of 300 K show that the 7.5 to 13.5 band has almost a factor of 2 more energy. By using the NAC-VPM, the two sensors were compared. In each of the spatial frequency channels found in early vision among humans, a measure of detectability for an object and background of interest is found. The SNR versus rain rate for both the sensors can be plotted, and the variation in the SNR, as the amount of rainfall under which the images are acquired increases, is obtained. Simulations show that (1) because the 7.5 to 13.5 band has more excitance than the 3.4 to 5.5 band and (2) the transmittance is nearly a factor of 1.5 better in rain than the 7.5 to 13.5 band, (3) coupled with the fact that the uncooled imagery was excellent in quality, the 7.5 to 13.5 band, uncooled pyroelectric sensor is the better multipurpose sensor. In addition, the unit used for data collection was in fact several years old, and there has since been a 50% increase in detector sensitivity along with improvements in the detector uniformity and system implementation. Sensor comparisons are one aspect of collision avoidance and vision enhancement. There are a number of other human factors and social issues associated with the science of collision avoidance, as pointed out in (45).
Infrared Imaging
15
Figure 15. Schematic of NAC-VPM computer model.
Figure 16. Use of TTIM and NAC-VPM to determine spatial frequencies of greatest visibility of a baseline vehicle.
TARDEC Visual and Infrared Perception Laboratory (VPL) The TARDEC National Automotive Center (NAC) is developing a dual-need visual and infrared perception laboratory as part of a Cooperative Research and Development Agreement (CRDA) between Army Materiel Command (AMC) and local auto companies. Dual use has become an important term which means that the technology
developed by the US Army could be used by the civilian sector, and vice versa. There are many applications of IR imagers; in particular, they are used for defense and civilian collision avoidance applications. As part of the performance assessment of the imagers, a perception laboratory can be used to determine the performance of various sensors in terms of enhancing the scene in the field-of-view of an observer. TARDEC researchers are using this facility
16
Infrared Imaging
Figure 17. Cooled and uncooled sensor images at fixed range.
Figure 18. Texas Instrument uncooled thermal imager.
to calibrate and validate human performance models for evaluating visual collision avoidance countermeasures for commercial and military vehicles on the nation’s highways. The laboratory is also being used to collect baseline data on the human visual perception of various types of ground vehicles and treatments to those vehicles. Figure 22 shows a schematic of the fully automated data acquisition and analysis hardware used in the laboratory. A unique capability of the laboratory is the magnetic head tracker attached to the observer which relays signals to the control computer for the correct image display at an
appropriate time during the intersection search scenario. Additional recently upgraded capabilities include a combination headtracker and eyetracker to record instantaneous foveal fixation relative to the scene. The existing laboratory configuration could also be used to present infrared images to observers. The laboratory experiments have several advantages over field-test exercises including better control over observer stimuli, larger sample sizes, and lower cost. In particular, laboratory perception tests offer a viable and economic way of augmenting and complementing field test data by using image simulation techniques to extend the
Infrared Imaging
17
Figure 19. Uncooled imager simulation at four ranges.
Figure 20. Image SNR versus rain rate.
Figure 21. Conspicuity versus rain rate.
range of conditions and target/background signatures beyond the original field test conditions. These techniques are particularly useful for virtual prototyping of military and civilian applications. An example of dual-need functionality of the VPL is that the recent calibration of the NACVPM in the laboratory for automotive use also worked well for camouflaged vehicles.
Visual Perception Laboratory Facilities Figure 23 shows a view of the main test area viewed through the control room window in the laboratory. The entire facility consists of a 2500 ft2 area which can accommodate vehicles ranging in size up to the Bradley IFV. This scene also shows that a half-car mock-up used in a perception experiment surounded by the three video projection screens which display the driver’s front, left, and
18
Infrared Imaging
Figure 22. Current equipment setup for the visual perception laboratory.
Figure 23. TARDEC VPL.
Figure 24. Visual scene in TARDEC VPL as seen by driver.
right views of the intersection traffic. Figure 24 shows a visual scene containing camouflaged targets located along a tree line depicted from behind the driver’s head position through the front windshield of a HMMWV. Visual perception experiments conducted with such scenes will allow Army researchers to study a wide field-of-regard (FOR)
search-and-target-acquisition (STA) strategies for military vehicles. Further upgrades to the VPL scheduled to occur during the next year will be the addition of three screens and LCD projectors to provide a five-screen wraparound effect and wide field-of-regard as required for search and target
Infrared Imaging
acquisition performance assessment of ground vehicle systems. Infrared Imaging for Ice Detection and Situational Awareness Following are two examples of the use of different parts of the infrared spectrum, near and far, for remote nondestructive testing and increased situational awareness. The first example describes an infrared system for remote detection of ice and the second example describes an infrared and visible fused system for situational awareness. As part of a Space Act Agreement between NASAKennedy Space Center (KSC) and the U.S. Army Tank Automotive Research Development and Engineering Center (TARDEC), members of TARDEC’s Visual Perception Lab (VPL) performed a technology search and laboratory evaluation of potential electro-optical systems capable of detecting the presence and determining the thickness of ice on Space Transportation System (STS) External Tank (ET) Spray-On-Foam Insulation (SOFI), see Figure 25. The SAA and subsequent evaluation activity resulted from discussions between NASA-KSC and the Army following the Columbia Shuttle accident. NASA sought a fresh approach to seemingly intractable ice accumulation and assessment problems that had plagued them since the earliest days of the STS Program. The VPL team, having expertise in imaging sensors, desired to contribute in some way to NASA’s Return to Flight planning and accomplishment. Previous research by VPL investigators, following earlier NASA inquiries, indicated that it might be possible to detect and image ice-covered areas with an infrared (IR) camera. In addition it was realized that methods were needed to detect clear ice (transparent to the naked eye), and to discriminate between ice, frost, and water on ET SOFI surfaces. A technology search followed by members of the VPL resulted in a selection of two electrooptical systems using infrared as candidates for further investigation. The VPL comparison of these systems, testing, and analyses was the subject of the first report submitted to NASA-KSC in June 2004. As a result of that report, VPL investigators and NASA engineers determined that a system developed by MacDonald, Dettwiler and Associates Ltd. (MDA–formally known as MDR) of Canada offered the greatest potential to support T-3 hour ice debris team detection and evaluation activities on the launch pad prior to STS launches. NASA’s initial desire was that the system be capable of detecting ice with a thickness and the diameter of a U. S. quarter (approximately, 1/16 inch thick [0.0625 inch] and one inch in diameter)–in essence the Launch Commit Criteria (LCC) for safe vehicle ascent. In addition, the system was to be passive (without emissions), portable for use by the NASA ice debris detection team on access platforms at T-3 hours, and able to meet launch complex safety requirements (i.e. be explosion proof and within EMI/EMC limits). Description of the MDA System The MDA system uses a low power near-infrared Xenon strobe to illuminate a surface on which there may be ice–
19
in this case, ET SOFI. After illumination of the SOFI surface, electromagnetic energy is reflected back and focused on to an IR (1.1 to 1.4 micron, [Gregoris, 60] ) sensor. An un-cooled focal plane array sensor provides inputs to a linked computer. Based on the electromagnetic theory of reflection of light at the surface of a dielectric (ice in this case), the computer estimates the thickness of ice, if present. Various ice thickness ranges are color-coded (e.g., blue = 0.020–0.029 inch, green = 0.030–0.039 inch, yellow = 0.040–0.049 inch, red ≥0.050 inch) [59] on the system monitor to help the operator interpret values quantitatively. A circular “bulls-eye” is shown on the system display to align a small -target area. The average measured ice thickness from pixels located in the bulls-eye (64 pixels 8 × 8), is displayed on-screen in a field labeled “tkns in” (i.e. thickness inches). The system and its components (sensor, VHS recorder, and battery power supply) are contained in N2 purged enclosures, and are mounted on a two wheeled cart provided by NASA as shown in Figure 26 below. Referring to Figure 27 below, as light is incident on a thin dielectric (e.g. ice), a fraction of the light is reflected at the air/dielectric interface, and the rest of the light is transmitted through the dielectric. The transmitted fraction propagates through the dielectric until it reflects off the substrate. The light reflected off the substrate returns through the dielectric until it reaches the dielectric/air interface, where it is again partially reflected into the dielectric and the air. Some absorption of the light occurs as it travels through the dielectric. The internal reflection continues until all the light is absorbed completely by the dielectric. For a dielectric of thickness d, the effective reflectance, Re (λ, θ), of the dielectric layer is given by Equation 12 below, Re (λ, θi ) = R(λ, θi ) + [
Rw (λ)(1 − R(λ, θi ))2 e−2a(λ)d ], 1 − (Rw (λ)R(λ, θi ))2 e−2a(λ)d
(12)
where, Re (λ, θ i ) is the effective reflectance R(λ, θ) is the dielectric spectral reflectance a (λ) is the spectral absorptivity Rw (λ) is the substrate spectral reflectance. Using specific sub-bands within the near IR region of 1.1–1.4 microns, the spectral contrast is defined by, C=[
Rl − Ru ], Rl + Ru
(13)
where l, and u are the lower and upper bands respectively in Equation 13. Measurement of the reflected energy and the computation of the spectral contrast allows for the detection of ice on a surface and the estimation of the thickness d, of the ice on that surface. Below in Figure 28, (chart from U.S. Patent # 5,500,530 [60]), the reflectance is plotted versus wavelength for 0.5 mm ice and water layers with light incident normal to the surface. It is clear from Figure 28 that the IR reflectance of water and ice is very different and linear over a long range. During the year 2006, VPL and NASA ice debris team members tested the MDA ice camera in a hanger at the
20
Infrared Imaging
Figure 25. Frost and ice on the space shuttle ET SOFI (Courtesy of NASA KSC).
Figure 26. MDA ET Inspection System Cart (courtesy of MDA Corp.).
Selfridge Air National Guard Base (SANG) in southeastern Michigan. A sample image of the display of the camera is shown below in Figure 29. Future tests and system modifications are being planned at the time of this article. The images below in Figure 30 are an example of the use of image fusion to show surface and subsurface defects in a thermal protective tile used on the shuttle orbiter. The top image is the IR image, the second image is the visible image, and the bottom most image is the fused image of the IR and visible. The type of fusion performed was contrast, Laplacian pyramid, and the imaging was done using cameras in the TARDEC Visual Perception Lab.
INFRARED IMAGING AND IMAGE FUSION USED FOR SITUATIONAL AWARENESS A requirement for armored vehicles of the future is to have a system that is able to provide close-in situational awareness and understanding to the crew within the whole 360 degree hemisphere (Figure 31) of the vehicle. TACOM, Ford and Sarnoff Laboratories are partnering to develop, test and evaluate prototype systems to provide situational awareness. Elmo QN42 visual cameras (Figure 32) are used in the present system because of their small size and excellent
Infrared Imaging
21
Figure 27. Reflection of light from a thin ice layer.
Figure 28. Computed spectral reflectance of ice and water versus wavelength [60].
Figure 29. MDA Ice camera display of ice covered SOFI panel (picture from SANG tests) Red areas indicate ice thickness of 0.0625 in.
22
Infrared Imaging
Figure 30. IR, visible and fused image of thermal protective tile.
Figure 31. 360 degree panoramic fusion concept picture.
Infrared Imaging
23
Figure 32. Elmo QN42 Camera used in TARDEC system.
color fidelity. The cameras have a field of view of approximately 53 degrees by 39 degrees. The Elmo cameras have a 410,000 pixel color CCD that results in 786 (V) X 470 lines (H) resolution with simultaneous Y/C and composite video outputs. Indigo Omega infrared cameras (Figure 33) were also selected because of their small size and clear image and have an image array resolution of 160 by 120 pixels, with a 51 by 51 micron pixel size. The detector is an uncooled micro bolometer. The infrared cameras are fitted with 8.5 mm lenses that provide a field of view of approximately 55 by 40 degrees. The Indigo Omega cameras are sensitive to the 7.5 to 13.5 micron band of the electromagnetic spectrum. There are four visible cameras and four IR cameras in each of the housings in the front and rear of the vehicle, hence a total of eight visible and infrared cameras. The output of the eight cameras are combined using multiplexers and with the Sarnoff stitching and image registration software provide a panoramic view that is scrollable and adjustable in magnification. The imagery from the cameras is combined, registered, and fused to provide a real-time panoramic stitched view of the world around the vehicle onto which they are mounted. Figure 34 shows the sensor systems attached to the Lincoln Navigator. The sensors are in the open configuration for testing and characterization. Future plans for hardening include the use of transparent covers and lenses and a smaller housing. A Lincoln Navigator was used as a test vehicle prototype for of several reasons: 1) such a test platform for the cameras is practical for driving around and testing the camera system in the metropolitan areas, and 2) the Navigator platform has ample space in the back and permits a convenient platform to demonstrate the system. APPENDIX A Abbreviations ET: FOV: FSS: IR: LCC: LH2: LN2: LO2: MDA: SAA: SOFI: SOW: STS: SWIR: TARDEC: VPL:
External Tank Field of View Fixed Service Structure Infrared Launch Commit Criteria Liquid Hydrogen Liquid Nitrogen Liquid Oxygen MacDonald, Dettwiler and Associates Ltd. Space Act Agreement between NASA and TARDEC Spray-on Foam Insulation Statement of Work Space Transportation System Shortwave infrared Tank Automotive Research, Development,and Engineering Center Visual Perception Laboratory at TARDEC
SUMMARY Infrared imaging is playing an increasingly important role in several application areas in the US Army and in the automobile companies for collision avoidance, nondestructive evaluation of materials, medical diagnostic imaging, and astronomy.
24
Infrared Imaging
Figure 33. Indigo Omega IR camera used in TARDEC system.
Figure 34. Front and rear camera assemblies mounted on a TARDEC Lincoln Navigator with prototype 360 degree image fusion system.
BIBLIOGRAPHY 1. J. E. Palmer J. D’Agostino T. J. Lillie Infrared Recognition and Target Handbook, NVEOL, Ft. Belvoir, VA, 1982. 2. Britannica World Language Edition of Funk and Wagnalls Standard Dictionary, Encyclopedia Britannica, vol. 1, Chicago, 1961, p. 650. 3. M. Schlessinger Infrared Technology Fundamentals, 2nd ed., New York: Marcel Dekker, 1995, p. 200. 4. L. F. Pau M. Y. El-Nahas An Introduction to Infrared Image Acquisition and Classification Systems, Research Studies Press, 1983. 5. J. L. Miller Principles of Infrared Technology, New York: Van Nostrand Reinhold, 1994. 6. M. Schlessinger Infrared Technology Fundamentals, 2nd. ed., New York: Marcel Dekker, 1995. 7. R. W. Mangold Infrared Theory, Temperature Developments, Stamford, CT: Omega Engineering, Inc. 8. P. Klocek, ed. Handbook of Infrared Optical Materials, New York: Marcel Dekker, 1991. 9. E. L. Dereniak G. D. Boreman Infrared Detectors and Systems, New York: Wiley, 1996. 10. Table on Infrared System Characteristics, courtesy of Mr. Luke Scott, NVESD, 1992. 11. C. Contini R. Honzic Staring FPA Modeling Capability, SPIE, 636: 60, 1986. 12. B. L. O’Kane C. P. Walters J. D’Agostino Report on Perception Experiments in Support of Thermal Performance Models. NVL Report, 1993. 13. B. L. O’Kane et al. Target signature metrics analysis for performance modeling, 1993. 14. D. Schmieder M. Weathersby Detection performance in clutter with variable resolution, IEEE Trans. Aerosp. Electron. Syst., AES-19: 1983.
15. W. Reynolds Toward quantifying infrared clutter, in Characterization, Propagation, and Simulation of Infrared Scenes, SPIE vol. 1311. 16. M. R. Weathersby D. E. Schmieder An experiment for quantifying the effect of clutter on target detection, in Infrared Technology, SPIE vol. 510, 1984, pp. 26–33. 17. W. Reynolds A. L. LaHaie R. K. Baratono Standard Scenes Program for Establishing a Natural Scenes Data Base, KRC, DAAE07-85-G-R007, 1989, p. 49. 18. B. L. O’Kane et al. Target signature metrics analysis for performance modeling, March 1993. 19. S. R. Rotman E. S. Gordon M. L. Kowalczyk Modeling human search and target acquisition performance: III. Target detection in the presence of obscurants, Opt. Eng., 30 (6): 1991. 20. D. Marr Vision—A Computational Investigation into the Human Representation and Processing of Visual Information, New York: W. H. Freeman, 1982. 21. P. Burt E. Adelson The Laplacian Pyramid as a Compact Image Code, IEEE Trans. Commun., COM-31: 532–540, 1983. 22. S. R. Rotman E. S. Gordon M. L. Kowalczyk Modeling human search and target acquisition performance: III. Target detection in the presence of obscurants, Opt. Eng., 30 (6): 1991. 23. T. J. Doll D. E. Schmeider Observer false alarm rate effects on target resolution criteria, KRC Symp. Proc., 1992, pp. 206–218. 24. J. M. Hall E. T. Buxton T. J. Rogne Tacom Thermal Image Model Ver. 3.1, Technical Reference and User Guide, Dec. 1989. 25. T. Cook et al. The impact of aliasing on vehicle detectability, Proc. 3rd Annu. Ground Target Modeling Validation Conf., August 1992.
Infrared Imaging 26. T. Meitzler G. Gerhart P. Collins A comparison of 3–5 and 8–12 micron sensors, Proc. Infrared Imaging Syst.: Des., Anal., Modeling, Testing IV SPIE Conf., April 1994. 27. W. K. Pratt Digital Image Processing, New York: Wiley, 1978, p. 508. 28. T. Meitzler W. Jackson D. Bednarz A semi-empirical study of background clutter and texture in infrared scenes, Proc. 3rd Annu. Ground Target Modeling Validation Conf., August, 1992. 29. S. R. Rotman, et al. Modeling human search and target acquisition performance: V. Search strategy, Opt. Eng., April 1996, submitted. 30. T. J. Meitzler Modern Approaches to the Computation of the Probability of Target Detection in Cluttered Environments, Ph.D. Thesis, Wayne State University, 1995. 31. M. O. Freeman Wavelets-signal representations with important advantages, Opt. Photonics News, 4 (8): 8–14, 1993. 32. H. H. Szu, ed. Wavelet Applications, SPIE Proc. 2242, SPIE, Bellingham, WA, 1994. 33. T. Meitzler et al. Wavelet transforms of cluttered images and their application to computing the probability of detection, Opt. Eng., 35 (10): 3019–3025, 1996. 34. G. Gerhart et al. Target Acquisition Methodology for Visual and Infrared Imaging Sensors, Opt. Eng., 35 (10): 3026–3036, 1996. 35. S. Mallet S. Zhong Characterization of signals from multiscale edges, IEEE Trans. Pattern Anal. Mach. Intell., 14: 710–732, 1992. 36. L. Zadeh Fuzzy Sets, Information and Control, 8: 338–353, 1965. 37. E. Mamdani S. Assilian Applications of fuzzy algorithms for control of simple dynamic plant, Proc. Inst. Elec. Eng., 121: 1585–1588, 1974. 38. T. Munakata Y. Jani Fuzzy systems: An overview, Commun., ACM, 37 (3): 69–76, 1994. 39. R. Mizoguchi H. Motoda (eds.) AI in Japan: Expert systems research in Japan, IEEE Expert, 10 (4): 14–23, 1995. 40. E. Cox The Fuzzy Systems Handbook: A Practitioner’s Guide to Building, Using, and Maintaining Fuzzy Systems, Cambridge, MA: AP Professional, 1994. 41. D. G. Schwartz et al. Applications of fuzzy sets and approximate reasoning, IEEE Proc., 82 (4): 482–498, 1994. 42. T. Terano K. Asai M. Sugeno Fuzzy Systems and Its Applications, Cambridge, MA: AP Professional, 1992. 43. M. Gupta G. Knopf Fuzzy logic in vision perception. Robots Comput. Vision XI, SPIE 1992, vol. 1826, pp. 300–316. 44. R. K. Jurgen The electronic motorist, IEEE Spectrum 32 (3): 37–48, 1995. 45. R. K. Deering D. C. Viano Critical success factors for crash avoidance countermeasure implementation, Proc. Int. Congr. Transp. Electron., 1994, pp. 209–214. 46. S. Klapper B. Stearns C. Wilson Low cost infrared technologies make night vision affordable for law enforcement agencies and consumers, Proc. Int. Congr. Transp. Electron., 1994, pp. 341–345. 47. G. A. Findlay D. R. Cutten Comparison of performance of 3–5 and 8–12 micron infrared system, Appl. Opt., 28: 5029–5037, 1989. 48. T. W. Tuer Thermal imaging systems relative performance: 3–5 vs 8–12 microns, Tech. Rep., AFL-TR-76-217. 49. G. H. Lindquist G. Witus et al. Target discrimination using computational vision human perception models, Proc. SPIE
50. 51.
52. 53.
54.
55. 56.
57. 58. 59. 60.
25
Conf. Infrared Imaging Syst.: Des. Anal., Modeling Testing, 1994, vol. 2224, pp. 30–40. J. Malik P. Perona Preattentive texture discrimination with early vision mechanisms, J. Opt. Soc. Am., 7: 923–932, 1990. A. B. Watson Detection and recognition of simple spatial forms, NASA Tech Memo No. 84353, Natl. Aeronautics Space Administration, Moffett Field, CA, 1983. P. W. Kruse Thermal imagers move from the military to the marketplace, Photonics Spectra, 103–108, 1995. Lakshmanan et al. Simulation and comparison of infrared sensors for automotive collision avoidance, 1995 SAE Int. Congr. Exposition, Feb. 1995. R. E. Flannery J. E. Miller Status of Uncooled Infrared Imagers, 1689, in Infrared Imaging Systems, 1992, SPIE vol. 1689, pp. 379–395. D. E. Burgess Pyrolectrics in a harsh environment, in Infrared Detectors and Arrays, 1988, SPIE vol. 930, pp. 139–150. T. Meitzler et al. Fuzzy logic approach to computing the probability of target detection in cluttered environments, Opt. Eng., 35 (12): 1996. W. Wolfe G. Zissis (eds.) The Infrared Handbook, revised ed., Environmental Res. Inst. Michigan, 1989, pp. 8–19. G. Gescheider Psychophysics Method, Theory, and Application, Lawrence Erlbaum, 1985. MD Robotics Prototype Ice Detection Camera User Manual, Rev. 3.0, 6 January 2005, pg. 18. United States Patent # 5,500,530, March 1996, inventor: Gregoris; Dennis J., assignee: SPAR Aerospace Limited (Brampton, Ontario, Canada).
THOMAS J. MEITZLER GRANT R. GERHART EUIJUNG SOHN HARPREET SINGH KYUNG–JAE HA US Army Tank-Automotive and Armaments Command, Warren, MI Wayne State University, Detroit, MI
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELECTRICA...TRONICS%20ENGINEERING/26.%20Image%20Processing/W4101.htm
}{{}}
●
HOME ●
ABOUT US ●
CONTACT US ●
HELP
Home / Engineering / Electrical and Electronics Engineering
Wiley Encyclopedia of Electrical and Electronics Engineering Microscope Image Processing and Analysis Standard Article J. Paul Robinson1 and John Turek1 1Purdue University, West Lafayette, IN Copyright © 1999 by John Wiley & Sons, Inc. All rights reserved. DOI: 10.1002/047134608X.W4101 Article Online Posting Date: December 27, 1999 Abstract | Full Text: HTML PDF (316K)
●
●
● ●
Recommend to Your Librarian Save title to My Profile Email this page Print this page
Browse this title ●
Search this title Enter words or phrases ❍
❍ ❍
Advanced Product Search Search All Content Acronym Finder
Abstract The sections in this article are Microscopy Electron Microscopy Confocal Microscopy Image Analysis Medical Applications Hardware Keywords: microscopy; light microscopy, image analysis; confocal microscopy; electron microscopy; 3D techniques; medical applications; hardware About Wiley InterScience | About Wiley | Privacy | Terms & Conditions Copyright © 1999-2008John Wiley & Sons, Inc. All Rights Reserved.
file:///N|/000000/0WILEY%20ENCYCLOPEDIA%20OF%20ELE...S%20ENGINEERING/26.%20Image%20Processing/W4101.htm17.06.2008 15:26:17
MICROSCOPE IMAGE PROCESSING AND ANALYSIS
9
matched the refractive index of glass. Otto Schott formulated glass lenses that color-corrected objectives and produced the first apochromatic objectives in 1886. Ko¨hler’s crucial development of what has been termed Ko¨hler illumination (discussed later) just after the turn of the century had a significant impact on microscopy and was possibly one of the most significant factors prior to the electronic age. Modern microscopes are all compound rather than simple microscopes since the image is magnified in two separate lens systems—the ocular and the objective. The following discussion covers both optical and electron microscopy; a comparison of both systems appears in Fig. 1. Components of Microscopes Nomenclature of Objectives. Several types of objectives, generally referred to as apochromatic, achromatic, and fluorite, are available for general microscopy. Achromatic objectives were first developed in the early nineteenth century by Lister and Amici, whose goal was to remove as much spherical and axial chromatic aberration as possible. Chromatic aberration prevents adequate imaging because the image if uncorrected will contain color fringes around fine structures. Correction is achieved by combining a convex lens of crown glass with a concave lens of flint glass. Fluorite objectives are made especially for fluorescence. Objectives are generally designed to be used with a cover glass having a thickness of 160 애m to 190 애m. Listed on each objective will be the key characteristics, as shown in Fig. 2. Dry Objectives. High-quality objectives designed for use in air will usually have a correction collar. If the thickness of the cover glass differs from the ideal 170 애m, the correction collar must be adjusted to reduce spherical aberration.
MICROSCOPE IMAGE PROCESSING AND ANALYSIS MICROSCOPY History of Microscope Development Prior to 1800 production microscopes using simple lens systems were of higher resolution than compound microscopes despite the achromatic and spherical aberrations present in the double convex lens design used. In 1812 W. H. Wollaston made a significant improvement to the simple lens, which was followed by further improvements over the next few years. Brewster improved upon this design in 1820. In 1827 Giovanni Battista Amici built high-quality microscopes and introduced the first matched achromatic microscope. He recognized the importance of cover-glass thickness and developed the concept of water immersion. Carl Zeiss and Ernst Abbe advanced oil immersion systems by developing oils that
Oil Immersion and Water Immersion. With objectives of high numerical aperture (NA) and high magnification, oil immersion will generally be necessary, as the resolution of a specimen is directly proportional to the NA. An objective designed for use with oil will always be clearly marked ‘‘oil’’. Specific oils of refractive index necessary to match the specimen and mounting conditions are desirable. For example, glass has an effective refractive index of 1.51 and oil must match this. The variation of refractive index with temperature must also be taken into account. It may be desirable to image an aqueous specimen through water. Water immersion objectives are available that can be placed directly into the suspending medium (which has an effective refractive index of approximately 1.33). Light Sources. Most light sources for microscopy are arc lamps such as xenon, mercury, carbon, or zirconium arc lamps. These lamps generally have a life span of around 250 h, but recently lamps with life spans of 1000 h to 2000 h have been developed. The advantage of these light sources is the broad-spectrum excitation offered, stretching from 325 nm up to 700 nm. The excitation spectra of a mercury lamp and a mercury xenon lamp are compared in Fig. 3. It is possible to excite several fluorochromes simultaneously with these relatively inexpensive light sources.
J. Webster (ed.), Wiley Encyclopedia of Electrical and Electronics Engineering. Copyright # 1999 John Wiley & Sons, Inc.
10
MICROSCOPE IMAGE PROCESSING AND ANALYSIS
Electron gun
Light
Electron gun Photodetector Confocal pinhole
An
Dichroic mirror Scan generator
Ap C
Point light source
S CRT D
Ap
O CRT
I
S P
PMT Collector
TEM
Final image
LM
Focal plane
Confocal microscope
SEM
Figure 1. Comparison of image formation in electron and light microscopes.
160 mm Tube Length Microscope. Most commercial microscopes have been designed according to the International Standards Organization (ISO) standard, which was created to ensure that components (mainly objectives) were interchangeable. The standard, described in Fig. 4, specifies several key dimensions such as the length of the tube (160 mm), the parfocal distance of the objective (45 mm), and the object-to-image distance (195 mm). Infinity-Corrected Microscopes. Recently many microscope manufacturers have switched to infinity-corrected optical systems. In this method of imaging, the traditional Telan optics (which are usually placed within 160 mm microscopes to allow insertion of other optical components) are replaced by a tube lens that forms a real, intermediate image and in which aberrations can be corrected. The placement of this tube lens allows the insertion of many optical components within the system because of its infinite image distance, its most significant advantage over the 160 mm systems. Microscope Variations Ko¨hler Illumination. One of the most significant improvements in microscopy occurred at the beginning of the twentieth century when August Ko¨hler developed the method of illumination still called Ko¨hler illumination. Ko¨hler also
PLAN-APO-40× 1.30 NA 160/0.22
Flat field
Apochromat Magnification Numerical Cover-glass Thickness factor tube length (mm) aperture
Figure 2. Objectives.
recognized that using shorter-wavelength light (UV) could improve resolution. The principle of Ko¨hler illumination is that parallel light rays provide an evenly illuminated field of view while simultaneously the specimen is illuminated with a very wide cone of light. In Ko¨hler illumination there are two conjugate planes formed with either image-forming planes or illuminating planes, which image the light source (filament). In order to observe the illuminating planes, it is necessary to focus on the objective back focal plane using a special telescope that replaces a regular ocular. Naturally the image planes are focused onto the specimen using the conventional ocular. The Ko¨hler principle is shown in Fig. 5. While the efficiency of this system is not very high, it forms a very even illumination (the bright field), which is usually more important for quantitative purposes. Phase Contrast. The development of phase-contrast microscopy resulted in a Nobel Prize for its inventor, Zernike, in
Irradiance at 0.5m (mW • m–2 • nm–1)
Types of Optical Microscopes
102
101 Xe lamp 1
10–1
200 nm
Hg lamp Argon 350 nm line
Argon 488 mn line 400 nm
600 nm
Figure 3. Arc lamp excitation spectra.
800 nm
MICROSCOPE IMAGE PROCESSING AND ANALYSIS
Mechanical tube length = 160 mm Object to image distance = 195 mm
Focal length of objective = 45 mm
Figure 4. The conventional microscope.
1932. Phase contrast is essential to see details in specimens that have almost the same refractive index as the suspending or mounting medium. Changing the phase of the central beam by one quarter of a wavelength at the rear focal plane of the objective achieves fine detail and contrast. Essentially, an object that was previously transparent will now become absorbent, resulting in edges and organelles appearing dark with a light background. Dark-Field Microscopy. In dark-field illumination, light is prevented from traveling through the objective because its angle is increased beyond the collection angle (the numerical aperture) of that objective. Thus, only highly diffracted light from granular or diffractive components or objects reaches the objective. Very small objects will be bright on a dark background. Differential Interference Contrast (Nomarski Optics). Differential interference contrast microscopy requires the insertion of several additional components into the light path. Light travels through a polarizer through two critical components known as Wollaston prisms, which split the light into two
Condenser Field iris
Specimen
Eyepiece Field stop
Retina
Conjugate planes for image-forming rays
Field iris Specimen
Field stop
Conjugate planes for illuminating rays Figure 5. Ko¨hler illumination.
11
parallel beams polarized at 45⬚ with respect to each other. This light illuminates the specimen, which now by its nature creates an optical path difference. The light then passes through a second Wollaston prism, which essentially combines both beams, followed by an analyzer, which is oriented at 45⬚ with respect to the vibration plane of either of the combined waves (giving equal intensity). Rotation of a polarizer increases or decreases one or the other component of the light. The shear created at the edge of the specimen will cause a halo, resulting in poor edge resolution if the Wollaston prisms are not matched to the resolution of the objective, necessitating customized prisms for every objective. For microscopy using epi-illumination, rather than transmitted light, a single Wollaston prism can be used. Interference Reflection Microscopy. The principle of interference reflection depends upon passing the light through surfaces of different refractive index, resulting in constructive and destructive interference. When light is refracted in its immersion medium, a phase shift occurs, which when added to the resultant beam through each component (e.g., glass and medium) will cancel each other out as destructive interference. The further the specimen is from the coverslip the brighter the reflectance will be, and the closer, the darker the reflectance, such that at the contact point of the coverslip and the medium no effective illumination will be observed. Fluorescence Microscopy. In fluorescence microscopy light of one wavelength excites fluorescent molecules within or attached to the specimen, and the emitted light of a longer wavelength is collected. This means that only molecules that are fluorescent will be imaged in such a system. Most fluorescent microscopes use mercury arc lamps as excitation sources, with selection of specific excitation wavelength being made by excitation filters and dichroic beam splitters to separate the excitation light from the emitted signal. ELECTRON MICROSCOPY Principles 1. Comparison of image formation in electron and light microscopes 2. Preparation of biological samples 3. Digital imaging in electron microscopes 4. Image acquisition and storage 5. Image processing 6. Model-based and design-based image analysis Figure 1 is a schematic comparison of a transmission electron microscope (TEM), a light microscope (LM), and a scanning electron microscope (SEM). The TEM and SEM use a beam of electrons as the illumination source, as opposed to photons for the LM. The geometric optics of image formation in the TEM is identical to the LM. The only difference is that focusing is accomplished with electromagnetic lenses in the TEM and glass lenses in the LM. In the TEM and SEM, illumination is most commonly generated by thermally emitted electrons from a tungsten filament. The amount of illumination from an electron gun or the brightness (웁) is defined as 웁 ⫽ eeV/kT, where e is the cathode current density, e the elec-
12
MICROSCOPE IMAGE PROCESSING AND ANALYSIS
tronic charge, V the accelerating voltage, k Boltzmann’s constant, and T the absolute temperature. The brightness 웁 is proportional to V; therefore increasing the accelerating voltage will increase the illumination. The electron gun, the condenser lens (C) system, and the anode (An) form the illuminating system of the microscope. The electron gun consists of the filament or cathode and an electron-gun cap. The electron gun is self-biased, and the bias voltage applied to the gun cap controls the area of electron emission in the filament. Any increase in the beam current causes an increase in the bias voltage, which acts to reduce the beam current again. Below the gun cap is an anode that is held at ground potential. The electrons are accelerated by the potential difference between the cathode and the anode. Apertures (Ap) in each of the microscopes control illumination. In the TEM and LM the illumination is transmitted through the specimen (S) and focused by an objective lens (O), which is the source of the resolving power. After passing through the objective lens the image is further magnified by intermediate (I) and projector lenses (P) in the TEM and by the lens in the eyepiece in the LM. The final image for the TEM is projected onto a fluorescent screen and for the LM onto the retina of the eye. The illumination system of the SEM is the same as that of the TEM, but deflection coils (D) are used to raster the electron beam across the specimen. As the electron beam scans the specimen it generates a variety of secondary electromagnetic radiation that may be analyzed. Low-energy secondary electrons (⬍50 eV) that escape from the surface of the specimen contain the surface detail information. These electrons impinge upon a scintillator located at one end of a quartz light pipe, where they are collected and converted into light pulses that are conducted via this light pipe to a photomultiplier tube (PMT) for amplification. The signal is then processed so that it can be displayed on cathode ray tubes (CRTs). Preparation Methods In order to understand image processing and analysis of electron microscopic images it is important to have some appreciation of the preparation methods and sample sizes for the specimens that are the subject of the analysis. Biological specimens for transmission or scanning electron microscopy need to have their ‘‘normal’’ structure stabilized via chemical or physical preservation (fixation). The usual chemical fixative is a buffered solution of 2% to 3% glutaraldehyde. This fivecarbon aldehyde has two reactive groups that cross-link proteins by forming methylene bridges between polypeptides at reactive side groups, especially active amino and imino groups. The solution is usually slightly hypertonic (400 mOsm to 500 mOsm).
Glutaraldehyde
H | O C CH2
CH2
CH2
H | C
O
The glutaraldehyde preserves primarily the protein components of tissues but does not preserve lipid components, which require a secondary fixative such as osmium tetroxide (OsO4). After rapid removal from the body, tissues may be fixed by immersion and then minced into 1 mm3 pieces while still submerged in fixative. The rate of fixation may be increased via microwave irradiation procedures. Alternatively, an anesthe-
tized animal may have the blood replaced by the fixative by either whole body perfusion or cannulation and perfusion of a specific organ. After fixation, specimens are dehydrated through a solvent gradient (i.e., 30%, 60%, 90%, 100%), generally acetone or alcohol. This allows the infiltration of the tissue with an embedding resin (most commonly epoxy and acrylic resins) that will be cured and hardened. The embedding resin acts as a support for the tissue, which will be thin-sectioned (50 nm to 90 nm) for examination in the TEM. The images obtained from a thin section perhaps 0.5 mm2 in area will be the subject of the image analysis. With the current computational power and relative ease of extracting data from images, it is simple to generate large amounts of meaningless data. Therefore, the sampling procedures to be used in collecting data are critical for the measurements to be valid. Unbiased sampling procedures are the basis for modern morphological and stereological image analysis. Anyone embarking upon computerized analysis should refer to the stereology literature on sampling before beginning an experiment. The areas for image analysis are best sampled in a systematic manner independent of the content during both the collection and photographic process. Digital Imaging in the Electron Microscope In order to perform image processing and analysis, images must be converted into a digital form. In the 1990s, digital image acquisition has become the standard for scanning electron microscopes. Since the image in the SEM starts as an electronic signal, it is relatively easy to convert and display the image in digital format. Until recently, the resolution capabilities of digital acquisition devices were not suitable or were cost prohibitive for direct digital acquisition in transmission electron microscopes. However, the technology has progressed very rapidly and as the cost of high-resolution chargecoupled device (CCD) cameras drops, digital image acquisition is expected to be the rule and not the exception for most commercial TEM microscopes by the end of the century. Image Acquisition Since current SEMs are all digital, this discussion will focus on TEMs. For microscopes that utilize the traditional film format, the negative may be digitized using flat-bed scanners that have a transparency adapter. Scanning resolutions of 600 dpi or greater are common on today’s equipment and are suitable for all but the most demanding analysis, which may require high-resolution drum scanners. It is important that analysis be performed on nonpixelated images. If individual pixels are evident, then the analysis will not be accurate. The pixel depth is also important for analysis of the image. While all flat-bed scanners will provide 8-bit gray-scale resolution, flat-bed scanners that have the capability to capture10-bit gray-scale or higher can take advantage of the contrast resolution present in an electron microscopic negative. When processing the image the greater pixel depth may be utilized to enhance features of interest. A variety of CCD cameras is currently available for direct image acquisition in the TEM. Cameras with a video rate of acquisition are useful for demonstrating images to groups of individuals and may be useful for basic image analysis such as cell counting. Slow-scan digital cameras such as the Gatan Megascan (2048 ⫻ 2048 pixels) and Kodak Megaplus (4096 ⫻
MICROSCOPE IMAGE PROCESSING AND ANALYSIS
4096 pixels) are capable of producing high-quality digital images in the electron microscope. One advantage of CCD cameras is that they can perform binning operations. The output from an array of pixels (e.g., 6 ⫻ 6 pixels) may be combined into one pixel. Although this reduces resolution, it also increases sensitivity and allows the rapid collection of information from electron-beam-sensitive specimens using minimal illumination. If the objects of interest are present, then a final high-resolution image that does not utilize binning may be captured.
Pinhole 1
Specimen
The original patent for the confocal microscope was filed by Marvin Minsky at Harvard University in 1957. Successive developments were made by Brackenhoff, Wijnaendts van Resandt, Carlesson, Amos, and White, among others. A confocal microscope achieves crisp images of structures even within thick tissue specimens by a process known as optical sectioning. The source of the image is photon emission from fluorescent molecules within or attached to structures within the object being sectioned. A point source of laser light illuminates the back focal plane of the microscope objective and is subsequently focused to a diffraction-limited spot within the specimen. At this point the fluorescent molecules are excited and emit light in all directions. However, because the emitted light refocuses in the objective image plane (being conjugate with the specimen), and because the light passes through a pinhole aperture that blocks out-of-focus light, an image of only a thin optical section of the specimen is formed. Out-offocus light is effectively removed from the emission, creating a ‘‘clean’’ image, as opposed to the traditional fluorescent microscope that includes all this out-of-focus light (Fig. 6a). It is possible to increase the ‘‘depth’’ of the optical section by varying the diameter of the pinhole, which effectively increases the light collection from the specimen. However, this also decreases the resolution. The resolution of a point light source in this image plane is a circular Airy diffraction pattern with a central bright region and outer dark ring. The radius of this central bright region is defined as rAiry ⫽ 0.61 /NA, where rAiry is a distance in the specimen plane, is the wavelength of the excitation source, and NA is the numerical aperture of the objective lens. To increase the signal and decrease the background light, it is necessary to decrease the pinhole to a size slightly less than rAiry; a correct adjustment can decrease the background light by a factor of 103 over conventional fluorescence microscopy. Therefore pinhole diameter is crucial for achieving maximum resolution in a thick specimen. This becomes a tradeoff, however, between increasing axial resolution (optimum ⫽ 0.7rAiry) and lateral resolution (optimum ⫽ 0.3rAiry). There are several methods for achieving a confocal image. One common method is to scan the point source over the image using a pair of galvanometer mirrors, one of which scans in the X and the other in the Y direction. The emitted fluorescence emission traverses the reverse pathway and is collected instantaneously. It is separated from the excitation source by a beam-splitting dichroic mirror that reflects the emission to a photomultiplier tube that amplifies the signal, passes it to an analog-to-digital converter (ADC), and displays
Pinhole 2
Detector Condenser lens
Objective lens (a) Detector-CCD camera
CONFOCAL MICROSCOPY Principles
13
Galvanometer
Slit
Laser
Filters Ocular
Galvanometer Objective specimen
(b) Figure 6. (a) How a confocal image is formed. Modified from J. B. Pawley, Handbook of Biological Confocal Microscopy, New York: Plenum Press, 1989. (b) Principles of a line scanner.
the signal as a sequential raster scan of the image. Most current systems utilize 16-bit ADCs, allowing an effective image of 1024 ⫻ 1024 pixels or more with at least 256 gray levels. Some confocal microscopes can collect high-speed images at video rates (30 frames/s), while others achieve faster scanning by slit-scanning (see below). Recently, two-photon excitation has been demonstrated in which a fluorophore simultaneously absorbs two photons, each having half the energy normally required to raise the molecule to its excited state. A significant advantage of this system is that only the fluorophore molecules in the plane of excitation can be excited, as this is the only area where the light intensity is sufficient. Therefore even less background noise is collected and the efficiency of imaging thick specimens is significantly increased. Two-photon excitation has an added advantage for those probes requiring UV excitation (see later), which has a tendency to damage tissue (particularly when imaging live cells); the two-photon system can achieve UV excitation without damage to the tissue. Benefits of Confocal Microscopy. Confocal microscopy has become a familiar tool in the research laboratory with a number of significant advantages over conventional fluorescence microscopy. Among these are the following: Reduced blurring of the image from light scattering. As explained previously, since out-of-focus light is excluded from the image plane, it is possible to collect sharper images than with regular fluorescent microscopes. Photons emitted from points outside the image plane are rejected.
14
MICROSCOPE IMAGE PROCESSING AND ANALYSIS
Increased effective resolution. This occurs by virtue of the increased resolution observed from a point source of light imaged through a pinhole. Improved signal-to-noise ratio. Decreased background light allows significantly improved signal-to-noise ratio. Z-axis scanning. A series of optical sections can be obtained at regular distances by moving the objective progressively through the specimen in the vertical direction. Depth perception in Z-sectioned images. With software reconstruction techniques, it is possible to reconstruct an image of the fluorescence emission of the specimen through the entire depth of the specimen. Electronic magnification adjustment. By reducing the scanned area of the excitation source, but retaining the effective resolution, it is possible to magnify the image electronically. Light Sources The most common light sources for confocal microscopes are lasers. The acronym LASER stands for light amplification by stimulated emission of radiation; the nature of laser operation is detailed elsewhere. Most lasers on conventional confocal microscopes are continuous wave (CW) lasers and are gas lasers, dye lasers, or solid-state lasers. The most popular gas laser is argon-ion (Ar), followed by either krypton-ion (Kr) or a mixture of argon and krypton (Kr-Ar). Helium-neon (HeNe) and helium-cadmium (He-Cd) are also used in confocal microscopy. The He-Cd provides UV lines at 325 nm or 441 nm, while the argon-ion lasers can emit 350 nm to 363 nm UV light if the laser is powerful enough. Otherwise the most common wavelength for fluorescent molecule excitation is 488 nm [ideal for fluorescein-isothiocyanate (FITC)] and either 543 nm (green He-Ne), 568 nm, or 647 nm (Kr-Ar). The power necessary to excite fluorescent molecules at a specific wavelength can be calculated. For instance, consider 1 mW of power at 488 nm focused via a 1.25 NA objective to a Gaussian spot whose radius at 1/e2 intensity is 0.25 애m. The peak intensity at the center will be 10⫺3 W 앟(0.25 ⫻ 10⫺4 cm)2 ⫽ 5.1 ⫻ 105 W/cm2 or 1.25 ⫻ 1024 photons/(cm2 ⭈ s⫺1). At this power, FITC would have 63% of its molecules in an excited state and 37% in ground state at any one time. The great majority of confocal microscopes are designed around conventional microscopes, but as discussed above, the light source can be one of several lasers. For most cell biology studies, arc lamps are not adequate sources of illumination for confocal microscopy. When using multiple laser beams, it is vital to expand the laser beam using a beam-expander telescope so that the back focal aperture of the objective is always completely filled. If several different lasers are used, the beam widths must also be matched if simultaneous excitation is required. The most important feature in selecting the laser line is the absorption maximum of the fluorescent probe. Examples of almost ideal excitation of probes can be seen in Table 1, which shows the maximum excitation in percent of several probes excited by the three primary lines available from a Kr–Ar laser—a common excitation source for confocal microscopes. Line Scanners The principal reason for using a line scanner is to obtain rapid successive images of a fluorescence emission. One of the
Table 1. Excitation and Emission Peaks
Fluorophore FITC Bodipy Tetra-M-Rho L-Rhodamine Texas Red CY5
Excitation Peak (nm)
Emission Peak (nm)
496 503 554 572 592 649
518 511 576 590 610 666
% Max Excitation at 488 nm
568 nm
647 nm
87 58 10 5 3 1
0 1 61 92 45 11
0 1 0 0 1 98
Note: You will not be able to see CY5 fluorescence under the regular fluorescent microscope because the wavelength is too high.
limiting factors that must be addressed in confocal microscopy is photobleaching of the fluorophor with intense illumination. By scanning a line of laser light across the specimen instead of a point of light, even illumination of lower intensity can be applied to the specimen at high rates. In general, these instruments are referred to as slit-scanners because they utilize a slit aperture, which can either scan or remain stationary. The detector for these scanners is usually a sensitive video camera—SIT, ISIT, or cooled CCDs. An example of the light path of one such slit-scanner is shown in Fig. 6b. Fluorescent Probes Many fluorescent probes are available for use in confocal microscopy: probes for protein (Table 2), probes for organelles (Table 3), probes for nucleic acids (Table 4), probes for ions (Table 5), probes for pH indicators (Table 6), and probes for oxidation states (Table 7). The excitation properties of each probe depend upon its chemical composition. Ideal fluorescent probes will have a high quantum yield, a large Stokes shift, and nonreactivity with the molecules to which they are bound. It is vital to match the absorption maximum of each probe to the appropriate laser excitation line. For fluorochrome combinations, it is desirable to have fluorochromes with similar absorption peaks but significantly different emission peaks, enabling use of a single excitation source. It is common in confocal microscopy to use two, three, or even four distinct fluorescent molecules simultaneously. Photobleaching Photobleaching is defined as the irreversible destruction of an excited fluorophore by light. Uneven bleaching through a
Table 2. Probes for Proteins Probe FITC PE APC PerCP Cascade Blue Coumarin-phalloidin Texas Red Tetramethylrhodamine-amines CY3 (indotrimethinecyanines) CY5 (indopentamethinecyanines)
Excitation (nm)
Emission (nm)
488 488 630 488 360 350 610 550 540 640
525 525 650 680 450 450 630 575 575 670
MICROSCOPE IMAGE PROCESSING AND ANALYSIS Table 5. Probes for Ions
Table 3. Specific Organelle Probes Probe BODIPYa NBDb DPHc TMAd-DPH Rhodamine 123 DiOe DiI-Cn-(5) f DiO-Cn-(3) g
Site
Excitation (nm)
Emission (nm)
Golgi Golgi Lipid Lipid Mitochondria Lipid Lipid Lipid
505 488 350 350 488 488 550 488
511 525 420 420 525 500 565 500
a
BODIPY: borate-dipyrromethene complex. b NBD: nitrobenzoxadiazole. c DPH: diphenylhexatriene. d TMA: trimethylammonium. e DiO: DiO-C18-(3)-3,3⬘-dioctadecyloxacarbocyanine perchlorate. f DiI-Cn-(5): 1,1⬘-di‘‘n’’yl-3,3,3⬘,3⬘-tetramethylindocarbocyanine perchlorate. g DiO-Cn-(3): 3,3⬘-di‘‘n’’yl oxacarbocyanine iodide.
specimen will bias the detection of fluorescence, causing a significant problem in confocal microscopy. Methods for countering photobleaching include shorter scan times, high magnification, high NA objectives, and wide emission filters as well as reduced excitation intensity. A number of antifade reagents are available; unfortunately, many are not compatible with viable cells. Antifade Reagents. Many quenchers act by reducing oxygen concentration to prevent formation of excited species of oxygen, particularly singlet oxygen. Antioxidants such as propyl gallate, hydroquinone, and p-phenylenediamine, while fine for fixed specimens, are not satisfactory for live cells. Quenching fluorescence in live cells is possible using systems with reduced O2 concentration or using singlet-oxygen quenchers such as carotenoids (50 mM crocetin or etretinate in cell cultures), ascorbate, imidazole, histidine, cysteamine, reduced glutathione, uric acid, and trolox (vitamin E analog). Photobleaching can be calculated for a particular fluorochrome such as FITC—at 4.4 ⫻ 1023 photons ⭈ cm⫺2 ⭈ s⫺1 FITC bleaches with a quantum efficiency Qb of 3 ⫻ 10⫺5. Therefore FITC would be bleaching with a rate constant of 4.2 ⫻ 103 s⫺1, so 37% of the molecules would remain after 240 애s of irradiation. In a single plane, 16 scans would cause 6% to 50% bleaching. Applications Neuroscience. Evaluation of neuronal tissue is a classic application of confocal microscopy. One example is the identiTable 4. Probes for Nucleic Acids Hoechst 33342 (ATa rich) (UV) DAPIb (UV) PIc (UV or visible) Acridine Orange (visible) TOTO-1, YOYO-3, BOBOd (visible) Pyrine Y (visible) Thiazole Orange (visible) a
15
AT: Adenine/Thymine. b DAPI: 4⬘,6-diamidino-2-phenylindole. c PI: propidium iodide. d TOTO, YOYO, BOBO: dimeric cyanine dyes for nucleic acid staining molecular probes proprietary.
Probe INDO-1 QUIN-2 Fluo-3 Fura-2 a
Excitation (nm)
Emission (nm)
350 350 488 330, 360
405, 480a 490 525 510
Optimal—it may be more practical to measure at 400, 525 nm.
fication and tracking of nerve cells. Typically neurons are injected with a fluorescent dye such as Lucifer yellow; threedimensional projections are made to identify the structure and pathway of the neuron. Cell Biology. The applications in cell biology are too numerous to cover in this section. However, one of the most useful applications currently is cell tracking using green fluorescent protein (GFP), a naturally occurring protein in the jellyfish Aequorea victoria, which fluoresces when excited by UV or blue light. A fluorescent protein (GFP) can be transfected into cells so that subsequent replication of the organism carries with it the fluorescent reporter molecule, providing a valuable tool for tracking the presence of that protein in developing tissue or differentiated cells. This is particularly useful for identifying regulatory genes in developmental biology and for identifying the biological impact of alterations to normal growth and development processes. In almost any application, multiple fluorescent wavelengths can be detected simultaneously. For instance, the fluorescent dyes Hoechst 33342 (420 nm), FITC (525 nm), and Texas Red (630 nm) can be simultaneously collected to create a three-color image (or more if more detectors are available), providing excellent information regarding the location and relationships between the labeled molecules and the structures they identify. Living Cells. Evaluation of live cells using confocal microscopy presents some difficult problems. One of these is the need to maintain a stable position while imaging a live cell. For example, a viable respiring cell, even when attached to a matrix of some kind, may be constantly changing shape, preventing a finely resolved three-dimensional (3-D) image reconstruction. Fluorescent probes must be found that are not toxic to the cell. Hoechst 33352 is a cell-permeant DNA probe that can be used to label live cells. Figure 7 presents an example in which there was a need to identify and enumerate cells attached to an extracellular matrix. In this image the cells can be accurately enumerated and their relative locations within the matrix determined as well. This figure demonstrates the effectiveness of confocal microscopy as a qualitative and quantitative tool for creating a 3-D image reconstruction of these live cells. The image is in three parts: (a) a 3-D
Table 6. pH Sensitive Indicators Probe SNARF-1a BCECFb a b
Excitation (nm)
Emission (nm)
488 488 440, 488
575 525, 620 525
SNARF: seminaphthorhodafluor. BCECF: 2⬘,7⬘-bis-(carboxyethyl)-5,6-carboxyfluorescein.
16
MICROSCOPE IMAGE PROCESSING AND ANALYSIS
Table 7. Probes for Oxidation States Probe DCFH-DA HEb DHR 123c
a
Oxidant
Excitation (nm)
Emission (nm)
H2O2 O2⫺ H2O2
488 488 488
525 590 525
a
DCFH-DA: dichlorofluorescin diacetate. HE: hydroethidine. c DHR-123: dihydrorhodamine 123. b
reconstruction of the material; (b) a pseudo-red/green image that requires red/green glasses to see the 3-D effect; and (c) a rotation of the image at 30⬚ from the plane of collection. Figure 8 is another example of 3-D imaging of live cells, in this case endothelial cells growing on glass in a tissue culture dish. Thirty image sections were taken 0.2 애m apart; the image plane presented shows an X–Z plane with the cells attached to the cover glass. The cover glass is represented by a cartoon insertion to show where the cells would be actually attached. Ratio Imaging. Confocal microscopy can be used for evaluation of physiological processes within cells. Examples are changes in cellular pH, changes in free Ca2⫹ ions, and changes in membrane potential and oxidative processes within cells. These studies require simultaneous fluorescence emission ratioing of the fluorescence molecules in real time. Usually these molecules are excited at one wavelength but emit at two wavelengths depending upon the change in properties of the molecule. For example, changes in cellular pH can be identified using BCECF (2⬘,7⬘-bis(carboxyethyl)-5(and 6)-carboxyfluorescein) which is excited at 488 nm and emits at 525 nm and 590 nm. The ratio of 590 nm to 525 nm signals reflects the intracellular pH. Similarly, INDO-1, a calcium indicator, can be excited at 350 nm to measure the amount of Ca2⫹ in a cell. The ratio of 400 nm to 525 nm emission signals reflects the concentration of Ca2⫹ in the cell; INDO-1 can bind Ca2⫹, and the fluorescence of the bound molecule is preferentially at the lower emission wavelength. Rapid changes in Ca2⫹ can be detected by kinetic imaging—taking a series of images at
(a)
Figure 8. Example of 3-D reconstruction. Cells grown on a cover glass were imaged by confocal microscopy and then reconstructed as they had been on the original coverslip.
both emission wavelengths in quick succession. An example is shown in Fig. 9. Fluorescence Recovery after Photobleaching. Fluorescence recovery after photobleaching (FRAP) is a measure of the dynamics of the chemical changes of a fluorescent molecule within an object such as a cell. A small area of the cell is bleached by exposure to an intense laser beam, and the recovery of fluorescent species in the bleached area is measured. In such a case the recovery time (t) can be calculated from the equation t ⫽ W2 /4D, where W is the diameter of the bleached spot and D is the diffusion coefficient of the fluorescent molecule under study. An alternative technique can measure interactions between cells; one of two attached cells is bleached and the recovery of the pair as a whole is monitored. Figure 10 demonstrates such a setup. Diagnostic Pathology. The use of confocal microscopy in diagnostic pathology is a developing field. The key advantages of the confocal system relate to the increased resolution, the ability to create 3-D views (stereoscopic images), and the more advanced image analysis that usually accompany these instruments. Some particular advantages are in evaluation of skin diseases and of cell growth in tissue injury and repair.
(b)
Figure 7. (a) Projection of endothelial cells labeled with hydroethdine imaged at 488 nm excitation and 575 nm emission. Fluorescence measured is for ethidium bromide, the product of oxidation of the hydroethidine. (b) Reconstruction of pollen grains from a confocal image of 30 optical sections.
Figure 9. Calcium flux measurements. Pulmonary artery endothelial cells loaded with Indo-1, a fluorescent probe for calcium. The presence of calcium can be quantitated by measuring the ratio of fluorescence emission at two wavelengths—400 nm (left box) and 525 nm (right box).
MICROSCOPE IMAGE PROCESSING AND ANALYSIS
Mathematical morphology may also be used to describe gray-scale (multivalue) objects. Gray-scale objects are functions of f(x, y) of the two spatial coordinates x, y,(x, y) 僆 ⺢2
F(%)
Intense laser beam bleaches fluorescence
Time Recovery of fluorescence
Zero time
17
10 s
30 s
Figure 10. Fluorescence recovery after photobleaching.
IMAGE ANALYSIS 2-D Image Processing Image processing is the procedure of feature enhancement prior to image analysis. Image processing is performed on pixels (smallest units of digital image data). The various algorithms used in image processing and morphological analysis perform their operations on groups of pixels (3 ⫻ 3, 5 ⫻ 5, etc.) called kernels. These image-processing kernels may also be used as structuring elements for various image morphological analysis operations. The diagram
(x, y) ∈ R2 → f (x, y) ∈ R The gray-scale images may be viewed as subsets of the Cartesian product ⺢2 ⫻ ⺢. If the gray-scale image is of the form f(i, j), (i, j) 僆 ⺪2, then it may be represented as a subset of ⺪2 ⫻ ⺢. Mathematical morphological operations utilize a structuring element (usually a round or square kernel) to probe an image. The interaction of the probe with the image is the morphological transformation (X). Measurements (m) derived from (X) need to meet several principles if the morphological transformation is to be quantitative. These principles form the foundation for model-based image analysis.
Structuring element A• X
⌿
⌿ (X)
m(⌿ (X))
The first principle is translation invariance. This means that the structure of an object does not change with its position in space. If Xz is the translate of the set X 傺 ⺢2 by some vector z 僆 ⺢2 the translation invariance is stated ψ (Xz ) = [ψ (X )]z
(A)
(B)
(C)
represents a series of 3 pixel ⫻ 3 pixel kernels (left to right A, B, C). Many image-processing procedures will perform operations on the central (black) pixel using information from neighboring pixels. In kernel A, information from all the neighbors is applied to the central pixel. In kernel B, only the strong neighbors, those pixels vertically or horizontally adjacent, are used. In kernel C, only the weak neighbors, those diagonally adjacent, are used in the processing. Various permutations of these kernel operations form the basis for digital image processing. At this point it is necessary to define some of the properties of digital images and the principles that guide mathematical morphology. A Euclidean binary image may be considered as a subset of n-dimensional Euclidean space ⺢n. A binary image may be represented by the set X = {z : f (z) = 1, z = (x, y) ∈ R2 } The background of a binary object is the set Xc. X c = {z : f (z) = 0, z = (x, y) ∈ R2 } The function f is the characteristic function of X. If the Euclidean grid ⺪2 is used instead of ⺢2, then the definition for binary images becomes
X = {(i, j) : f (i, j) = 1, z = (i, j) ∈ Z2}
X c = {(i, j) : f (i, j) = 0, z = (i, j) ∈ Z2}
The second principle of mathematical morphology is compatibility with change of scale. A transformation of an object may not change the object’s structure. A morphological transformation (X) is valid if, for some scaling factor , the following transformation (X) is true: ψλ = λψ (λ−1X ) The third principle is local knowledge. Since objects are viewed through image frames or windows, it is necessary that any window M* contain enough information (local knowledge as opposed to global knowledge) such that the transformation (X) may be performed. [ψX ∩ M)] ∩ M ∗ = ψ (X ) ∩ M ∗ The fourth principle is semicontinuity. If A is included in B then (A) should be included in (B). If Xn is a sequence of closed objects tending toward the limit image of object X, and (Xn) is the sequence of the transformed objects, then the transformations are semicontinuous if the sequence of transformed objects tends to (X). These four principles are fundamental to many of the image-processing and analysis functions. Two of the most basic mathematical morphological operations are erosion and dilation. These are based upon Minkowski set addition and subtraction.
18
MICROSCOPE IMAGE PROCESSING AND ANALYSIS
If the image (A) is probed by the structuring element (B), the erosion of set A by set B is defined by A 両 B ⫽ 兵x: B ⫽ x ⬍ A其.
A–B A
B
In mathematical morphology the complementary operation to erosion is dilation. The dilation of set A by structuring element B is defined by A 丣 B ⫽ [Ac 両 (⫺B)]c, where Ac is the set theoretic complement of A.
SEM image in the upper left is represented by the histogram (output LUT) in the lower left. The histogram shows that the image does not contain gray values at the upper or lower end. If the histogram is adjusted to make the levels more uniform as represented in the middle histogram, this stretches the range of gray values from 0 to 255. The resulting image and stretched histogram are represented in the upper and lower right frames. It is important to understand that the processed image on the right does not have greater resolution and is not inherently a better image than the one on the left. Because human vision is very sensitive to differences in contrast, the image on the right appears better but is not actually improved.
With the background just described on some of the principles of mathematical morphology we can now examine some basics of image processing and analysis. Simple image-processing techniques to change the brightness and contrast enhancement may be performed on look-up tables (LUT). Input lookup tables allow manipulation of the image data prior to saving the image in digital format. To utilize an input LUT, a frame grabber is needed. The output LUT displays the data after storage in digital format in a frame buffer or storage device. In an 8-bit gray-scale image, a level of zero (0) represents black and a level 255 represents white. In Fig. 11 the
Filtering Methods—Noise Reduction. Images collected under low illumination conditions may have a poor signal-to-noise ratio. The noise in an image may be reduced using imageaveraging techniques during the image acquisition phase. By using a frame grabber and capturing and averaging multiple frames (e.g., 16 to 32 frames), one can increase the information in the image and decrease the noise. Cooled CCD cameras have a better signal-to-noise ratio than noncooled CCD cameras. Utilizing spatial filters may also decrease noise in a digital image. Filters such as averaging and Gaussian filters will reduce noise but also cause some blurring of the image, so their use on high-resolution images is usually not acceptable. Median filters cause minimal blurring of the image and may be acceptable for some electron microscopic images. These filters use a kernel such as a 3 ⫻ 3 or 5 ⫻ 5 to replace the central or target pixel luminance value with the median value of the neighboring pixels. Periodic noise in an image may be removed by editing a 2D fast Fourier transform (FFT). A forward FFT of the image in Fig. 12 allows one to view the periodic noise (center panel) in an image. This noise, as indicated
Figure 11. Scanning electron micrograph of cultured cells. The histograms represent gray-scale distribution. Left, original image, which does not cover the entire gray-scale range. Right, image obtained when gray-scale histogram is stretched to cover the entire range. See text for detailed explanation.
255 0
A+B A B Image Enhancement
0
255 0
25
MICROSCOPE IMAGE PROCESSING AND ANALYSIS
19
Figure 12. FFT analysis of image performed using Optimas.
by the white box, may be edited from the image and then an inverse Fourier transform performed to restore the image without the noise (right panel). Image Analysis After adjustment for contrast and brightness and for noise, the next phase of the process is feature identification and classification. Most image data may be classified into areas that feature closed boundaries (e.g., a cell), points (discrete solid points or objects that may be areas), and linear data. For objects to be identified they must be segmented and isolated from the background. It is often useful to convert a grayscale image to binary format (all pixel values set to 0 or 1). Techniques such as image segmentation and edge detection are easily carried out on binary images but may also be performed on gray-scale or color images. Thresholding. The simplest method for segmenting an image is to use thresholding techniques. Thresholding may be performed on monochrome or color images. For monochrome images, pixels within a particular gray-scale range or value may be displayed on a computer monitor and the analysis performed on the displayed pixels. Greater discrimination may be achieved using color images. Image segmentation may be achieved based upon red, green, and blue (RGB) values in the image; a more powerful method is to use hue, saturation, and intensity (HSI). The HSI method of color discrimination is closer to that of the human brain. Depending upon the image, it may be possible to achieve discrete identification of objects based on threshold alone. However, for gray-scale electron microscopic images, thresholding may only work for simple specimens (i.e., count negatively stained virus particles). It is seldom sufficient to identify the objects of interest in complex specimens (sections of a cell or tissue) because many disparate objects often share similar gray-scale levels. Another method to segment the image is edge detection. Edge-detection filters such as Laplace, Sobel, and Roberts will detect and enhance individual objects. However, these filters may still leave some objects with discontinuous boundaries and will fail to separate objects that touch each other. Algorithms may be used to connect nearby edges, and watershed filters may be used to separate touching objects. When images
are segmented it is then useful to use additional parameters to classify the objects. One useful parameter is shape. The shape factor of an object is defined as 4앟 ⫻ (area)/(perimeter)2. A perfect circle will have a shape factor of 1. Departure from circularity (e.g., oval or irregular border) will lower the shape factor. The shape factor can also be expressed as (perimeter)2 /(area). With this formula, a circle has a minimum value of 4앟 and the factor becomes larger for noncircular objects. This is a useful criterion for distinguishing cancer cells from normal cells. Cancer cells or their cell nuclei usually are less circular than normal cells. In addition to shape, size features such as minimum and maximum diameter or area may be included. Enumeration. One of the simplest procedures in image analysis is the counting of objects. If a digital image is displayed in any of a variety of image analysis software programs, one has the option of identifying a region of interest (ROI) for analysis. In most circumstances it is important that the area chosen for analysis be derived via systematic sampling, and not because the area has the features of interest. The entire image may be the ROI, or a polygonal or irregular region may be specified. The bounding box of the image in Fig. 13 is the ROI. If a threshold is set using a software program for image analysis, the features will be identified via the RGB or grayscale color values and a boundary identified for the objects. The objects are then counted, omitting any that touch the ROI, since these may be partial profiles. In the preceding example the number of profiles is 26. Up to this point, the discussions of image analysis are those applied to model-based systems. In model-based systems, algorithms are used to extract the quantitative data. However, these morphological operations make certain assumptions about the nature of the objects being measured, and therefore have some inherent bias in the measurement process. Image analysis based upon modern stereological methods utilizes probabilistic geometry to extract quantitative information from images. Because the data are extracted via systematic sampling and point counting, no assumptions are made as to the size or shape of the structures and the technique may therefore be considered an unbiased method of analysis.
20
MICROSCOPE IMAGE PROCESSING AND ANALYSIS
file area a ⫽ AA /QA, where QA is the relative number of profiles. QA = 24/10 µm2 = 2.4/µm2 Therefore the mean area a ⫽ 0.166/2.4 ⫽ 0.069 애m2. (a)
(b)
(c) Figure 13. Bounding boxes depict ROI.
Using the methods of modern stereology, an unbiased counting frame is placed over the image used for counting objects. Every profile completely within the counting frame is tallied along with those that touch the upper and right dashed line. Profiles touching the solid line are excluded from the tally. The result is 24 profiles counted [Fig. 13(b)]. If the size of the counting frame is known, the number of profiles may be expressed as numerical density (the number of objects divided by the frame area). If we assume arbitrarily that the frame represents 10 애m2, then the numerical density (QA) ⫽ 24/10 애m2 ⫽ 2.4 objects per 애m2. After counting the number of objects, the next basic operation is to determine the size or area of the objects. If a counting frame with regularly spaced points is placed over Fig. 13(c), then the mean areal fraction of the objects may be determined. A profile is counted if it covers the upper right quadrant of one of the counting points on the grid. The points are counted with no regard to their relationship to the counting frame. The areal fraction AA is defined as the number of points hitting profiles divided by the total number of points AA = 16/96 = 0.166 This value may be used to determine the mean profile area provided the area of the frame is known. The mean pro-
Volume Estimation. These same principles of image analysis utilized to extract information from 2-D images may be extended and modified to extract data from 3-D images. In model-based image analysis, algorithms would be used to estimate volumes of structures. The volumes of structures may be determined from serial sections of material obtained through a confocal microscope (optical sections) or plastic (epoxy) -embedded sections of tissues. If the thickness of a section is known, the volume may be determined by summing the areas of the segmented objects of interest. Some software programs are able to do a seed fill of such structures so that the entire object is visualized and then algorithms used to estimate volume. Instead of seed filling, a wire frame may be placed around an object and an isocontoured surface applied to the image. Algorithms may then be used to estimate the surface area of the 3-D object. Using design-based stereology, 3-D information may be extracted from 2-D sections of objects as shown in Fig. 14(a). The Cavalieri method can be used to estimate the volume of compartments within a structure (i.e., an organ such as the lung or liver). From a random starting point, the structure is sliced systematically. The random start point is determined by dividing the length of the structure by some slice interval: a 100 mm structure divided by 10 gives a slice interval of 10 mm. To determine the position of the first slice, a random number between 1 and 10 is chosen (e.g., 4 mm). After processing and embedding the slices, the area of the slices may be determined by point counting methods. The volume of the structure V(struct) ⫽ 兺P(struct) A(point)nt, where 兺P(struct) is the sum of all points falling on the slices, A(point) is the area of a test point at a magnification of one, n is the nth slice selected, and t is the mean thickness of the slice. The analysis in three dimensions may also be used to estimate the numerical density of objects (i.e., cells) within a structure (tissue or organ) using an optical dissector. The dissector is a direct counting method. Plastic sections (앑40 애m to 50 애m) are examined under the light microscope, and as one slowly focuses down through the section, each cell nucleus that appears is counted. The vertical movement of the stage is the height of the dissector. In Fig. 14(b), each numbered section represents an optical section. The volume of the tissue
1 2 3 4
Figure 14. (a) 3-D information extracted from 2-D sections of objects; (b) numbered sections represent an optical section.
1 (a)
2
3 (b)
4
MICROSTRIP LINES
21
containing the cells is determined by the area of the dissector counting frame (i.e., 0.5 mm2) multiplied by the dissector height. The number of cells divided by the dissector volume equals the 3-D numerical density of the cells. The numerical density multiplied by the volume of a structure V(struct) would then equal the total number of cells in the structure.
R. P. Bolender, D. M. Hyde, and R. T. Dehoff, Lung morphometry: A new generation of tools and experiments for organ, tissue, cell, and molecular biology, Amer. Physiol. Lung Cell. Mol. Physiol. 265: L521–L548, 1993.
Data Storage for Image Analysis
H. J. G. Gundersen and E. B. Jensen, The efficiency of systematic sampling in stereology and its prediction, J. Microsc. 147: 229– 263, 1987.
Storage mediums are also undergoing rapid development. Storage onto computer hard-disk drives should be considered only temporary until the images are archived. The most costeffective and platform-portable storage method is archiving images onto recordable read-only memory compact disks (CDROMs). Since almost all personal computers now have CDROM drives, this archival method provides accessibility for most users without the need for specialized equipment. Storage of images in ISO-9660 format guarantees readability of the images on Microsoft Windows, UNIX, and Apple computers. CD-ROMs have an expected shelf life of 30 to 100 years depending upon how they are used and stored. The successor to the CD-ROM is the digital versatile disk (DVD). Currently, DVD players cannot read CD-ROM, because the wavelength of the DVD laser is different from what is required for CDROM. It is possible that manufacturers will provide DVD readers with dual lasers so that there will be some backwards compatibility, but eventually data stored on CD-ROM may have to be transferred to other media. Of the tape storage solutions, digital linear tape (DLT) has the quickest access times and may be a possible solution for mass storage.
L. M. Cruz-Orive, Particle number can be estimated using a disector of unknown thickness: the selector, J. Microsc. 145: 121–142, 1987.
H. J. G. Gundersen et al., Some new, simple and efficient stereological methods and their use in pathological research and diagnosis, Acta Pathol. Microbiol. Immunol. Scand. 96: 379–394, 1988. B. Matsumato (ed.), Cell Biological Applications of Confocal Microscopy, Vol. 38 of Methods in Cell Biology, San Diego: Academic Press, 1993. J. B. Pawley (ed.), Handbook of Biological Confocal Microscopy, 2nd ed., New York: Plenum, 1995. J. C. Russ (ed.), The Image Analysis Handbook, 2nd ed., Boca Raton: CRC Press, 1995. J. Serra, Image Analysis and Mathematical Morphology, London: Academic Press, 1983. D. L. Spector, R. D. Goldman, and L. A. Leinwand, Cells: A Laboratory Manual, Vol. 2, Light Microscopy and Cell Structure, New York: Cold Spring Harbor Laboratory Press, 1998. Y.-L. Wang and D. L. Taylor, Fluorescence Microscopy of Living Cells in Culture, San Diego: Academic Press, 1989.
J. PAUL ROBINSON JOHN TUREK Purdue University
MEDICAL APPLICATIONS Laser Scanning Ophthalmoscopy Recently a near-infrared laser scanning ophthalmoscope was developed that could look deep into the macula region, the central region of the retina where the high density of rods and cones results in the eye’s high resolution. This instrument uses a vertical-cavity surface-emitting laser (VCSEL) array laser operating at 850 nm. The instrument combines a regular ophthalmoscope and a laser confocal ophthalmoscope. The advantage of the near-infrared light is that it passes through the somewhat opaque lens common in patients with cataracts. HARDWARE Microscopes Many companies manufacture microscopes and all make relatively high-quality instruments. Some of the better known are Leica, Olympus, Nikon, and Zeiss. All these companies currently manufacture confocal microscopes as well. Bio-Rad Microsciences, Meridian Instruments, Molecular Dynamics, Noran, Newport, and Optiscan all manufacture confocal microscopes based upon the above commercially available microscope systems. BIBLIOGRAPHY R. P. Bolender, Biological stereology: History, present state, future directions, Microsc. Res. Tech. 21: 255–261, 1992.
MICROSCOPY, ACOUSTIC. See ACOUSTIC MICROSCOPY. MICROSCOPY, MAGNETIC. See MAGNETIC MEDIA, IMAGING.
MICROSCOPY, X-RAY. See X-RAY MICROSCOPY. MICROSENSORS. See PIEZORESISTIVE DEVICES. MICROSOFT WINDOWS. See WINDOWS SYSTEMS. MICROSTRIPLINE. See STRIPLINE COMPONENTS.