Lecture Notes in Computer Science Commenced Publication in 1973 Founding and Former Series Editors: Gerhard Goos, Juris Hartmanis, and Jan van Leeuwen
Editorial Board David Hutchison Lancaster University, UK Takeo Kanade Carnegie Mellon University, Pittsburgh, PA, USA Josef Kittler University of Surrey, Guildford, UK Jon M. Kleinberg Cornell University, Ithaca, NY, USA Friedemann Mattern ETH Zurich, Switzerland John C. Mitchell Stanford University, CA, USA Moni Naor Weizmann Institute of Science, Rehovot, Israel Oscar Nierstrasz University of Bern, Switzerland C. Pandu Rangan Indian Institute of Technology, Madras, India Bernhard Steffen University of Dortmund, Germany Madhu Sudan Massachusetts Institute of Technology, MA, USA Demetri Terzopoulos University of California, Los Angeles, CA, USA Doug Tygar University of California, Berkeley, CA, USA Moshe Y. Vardi Rice University, Houston, TX, USA Gerhard Weikum Max-Planck Institute of Computer Science, Saarbruecken, Germany
4319
Long-Wen Chang Wen-Nung Lie (Eds.)
Advances in Image and Video Technology First Pacific Rim Symposium, PSIVT 2006 Hsinchu, Taiwan, December 10-13, 2006 Proceedings
13
Volume Editors Long-Wen Chang National Tsing Hua University Department of Computer Science Hsinchu, 300 , Taiwan E-mail:
[email protected] Wen-Nung Lie National Chung Cheng University Department of Electrical Engineering Chia-Yi, 621 Taiwan E-mail:
[email protected] Library of Congress Control Number: 2006937888 CR Subject Classification (1998): H.5.1, H.5, I.4, I.3, H.3-4, E.4 LNCS Sublibrary: SL 6 – Image Processing, Computer Vision, Pattern Recognition, and Graphics ISSN ISBN-10 ISBN-13
0302-9743 3-540-68297-X Springer Berlin Heidelberg New York 978-3-540-68297-4 Springer Berlin Heidelberg New York
This work is subject to copyright. All rights are reserved, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, re-use of illustrations, recitation, broadcasting, reproduction on microfilms or in any other way, and storage in data banks. Duplication of this publication or parts thereof is permitted only under the provisions of the German Copyright Law of September 9, 1965, in its current version, and permission for use must always be obtained from Springer. Violations are liable to prosecution under the German Copyright Law. Springer is a part of Springer Science+Business Media springer.com © Springer-Verlag Berlin Heidelberg 2006 Printed in Germany Typesetting: Camera-ready by author, data conversion by Scientific Publishing Services, Chennai, India Printed on acid-free paper SPIN: 11949534 06/3142 543210
Preface
The conference of the 1st IEEE Pacific-Rim Symposium on Image and Video Technology (PSIVT 2006) was held at Hsinchu, Taiwan, Republic of China, on December 11–13, 2006. This volume contains papers selected for presentation at this conference. The aim of this conference was to bring together theoretical advances and practical implementations contributing to, or being involved in, image and video technology. PSIVT 2006 featured a comprehensive program including tutorials, keynote and invited talks, oral paper presentations, and posters. We received 450 submissions from 22 countries and accepted 141 papers among those (i.e., defining an acceptance rate of 31.3%). The intention was to establish PSIVT as a topquality series of symposia. Decisions were difficult sometimes, but we hope that the final result is acceptable to all involved. Besides keynotes and invited talks, PSIVT 2006 offered 76 oral presentations and 58 posters, according to the proper registration of these papers by the defined deadline. We deeply appreciate the help of the reviewers, who generously spent their time to ensure a high-quality reviewing process. Useful comments were provided by reviewers, often quite detailed, and they certainly offered authors opportunities to improve their work not only for this conference, but also for future research. We thank Springer’s LNCS department and IEEE’s Circuits and Systems Society for efficient contacts during the preparation of the conference and these proceedings. Their support is greatly appreciated. This conference would never have been successfully completed without the efforts of many people. We greatly appreciate the effort and the cooperation provided by our strong Organizing Committee. We would also like to thank all the sponsors for their considerable support including the National Tsing Hua University (NTHU), National Chung Cheng University (NCCU), The University of Auckland (UoA), National Science Council (NSC), Ministry of Education (MoE), Sunplus Technology Co., National Center for High-Performance Computing (NCHC), and Institute for Information Industry (III). October 2006
Long-Wen Chang Wen-Nung Lie
PSIVT 2006 Organization
Organizing Committee General Co-chairs
Yung-Chang Chen (National Tsing Hua Univ., Taiwan) Reinhard Klette (The Univ. of Auckland, New Zealand) Program Co-chairs Long-Wen Chang (National Tsing Hua Univ., Taiwan) Wen-Nung Lie ( National Chung Cheng Univ., Taiwan) Local Arrangements Co-chairs Chaur-Chin Chen (National Tsing Hua Univ., Taiwan) Shang-Hong Lai (National Tsing Hua Univ., Taiwan) Publicity Chair Chiou-Ting Hsu (National Tsing Hua Univ., Taiwan) Finance Chair Tai-Lang Jong (National Tsing Hua Univ., Taiwan) Publication Chair Rachel Chiang (National Chung Cheng Univ., Taiwan) Tutorial and Invited Speaker Chairs Chia-Wen Lin (National Chung Cheng Univ., Taiwan) Award Co-chairs Jin-Jang Leou (National Chung Cheng Univ., Taiwan) Chia-Wen Lin (National Chung Cheng Univ., Taiwan) Exhibition Chair Chung-Lin Huang (National Tsing Hua Univ., Taiwan) Exhibition Co-chairs Chung-Lin Huang (National Tsing Hua Univ., Taiwan) Charng-Long Lee (Sunplus Tech. Co., Taiwan) Webmaster Chao-Kuei Hsieh (National Tsing Hua Univ., Taiwan) Steering Committee Kap Luk Chan (Nanyang Technological Univ., Singapore) Yung-Chang Chen (National Tsing Hua Univ., Taiwan)
VIII
Organization
Yo-Sung Ho (Gwangju Institute of Science and Tech., Korea) Reinhard Klette (The Univ. of Auckland, New Zealand) Mohan M. Trivedi (Univ.of California at San Diego, USA)
Theme Co-chairs 3D Scene Modeling Domingo Mery (Pontificia Universidad Catolica de Chile, Chile) Long Quan (The Hong Kong Univ. of Science and Technology, Hong Kong) Image Analysis Chia-Yen Chen (National Chung Cheng Univ., Taiwan) Kazuhiko Kawamoto (Kyushu Institute of Technology, Japan) Intelligent Vision Applications Kap Luk Chan (Nanyang Technological Univ., Singapore) Chin-Teng Lin (National Chiao-Tung Univ., Taiwan) Multimedia Compression and Transmission Ramakrishna Kakarala (Avago Technologies, USA) Shipeng Li (Microsoft Research Asia, China) Multimedia Signal Processing Yo-Sung Ho (Gwangju Institute of Science and Tech., Korea) Ya-Ping Wong (Multimedia University, Malaysia) Panoramic Imaging and Distributed Video Systems Kenichi Kanatani (Okayama University, Japan) Mohan M. Trivedi (Univ. of California at San Diego, USA) Sensors Technologies Charng-Long Lee (Sunplus Tech. Co., Taiwan) Y. Tim Tsai (ITRI, Taiwan) Visualization Masa Takatsuka (The Univ. of Sydney, Australia) Yangsheng Wang (CBSR, CASIA, China)
Organization
Program Committee 3D Scene Modeling Chi-Fa Chen (I-Shou Univ., Taiwan) Boris Escalante (Univ. Autonoma de Mexico, Mexico) Andre Gagalowicz (INRIA, France) Nancy Hitschfeld (Universidad de Chile, Chile) Yi-Ping Hung (National Taiwan Univ., Taiwan) Atsushi Imiya (Chiba Univ., Japan) Nahum Kiryati (Tel Aviv Univ., Israel) Brendan McCane (Univ. of Otago, New Zealand) Domingo Mery (Pontificia Universidad Catolica de Chile, Chile) Luis Pizarro (Saarland Univ., Germany) Long Quan (The Hong Kong Univ. of Science and Technology, Hong Kong) Fernando Rannou (Univ. Santiago de Chile, Chile) Bodo Rosenhahn (MPI, Germany) Luis Rueda (Univ. Concepcion, Chile) Hideo Saito (Keio Univ., Japan) Robert Valkenburg (Industrial Research Ltd., New Zealand) Image Analysis Ruey-Feng Chang (National Taiwan Univ., Taiwan) Chia-Yen Chen (The Univ. of Auckland, New Zealand) Sei-Wang Chen (National Taiwan Normal Univ., Taiwan) Li Chen (Univ. of the District of Columbia, USA) Hidekata Hontani (Nagoya Institute of Technology, Japan) Norikazu Ikoma (Kyushu Institute of Technology, Japan) Xiaoyi Jiang (Univ. of M¨ unster, Germany) Kazuhiko Kawamoto (Kyushu Institute of Technology, Japan) Yukiko Kenmochi (French Natl. Center for Scientific Research, France) Yoon Ho Kim (Mokwon Univ., Korea) Anthony Maeder (CSIRO, ICT Centre, Australia) Akira Nakamura (Hiroshima, Japan) Hajime Nobuhara (Univ. of Tsukuba, Japan) Nicolai Petkov (Groningen Univ., Netherlands) Gerd Stanke (GFaI, Germany) Akihiro Sugimoto (National Instutite of Informatics, Japan) Yung-Nien Sun (National Cheng Kung Univ., Taiwan) Toru Tamaki (Hiroshima Univ., Japan) Evelyn L. Tan (Univ. of the Philippines, Quezon City, Philippines) Petra Wiederhold (CINVESTAV, Mexico) Huabei Zhou (Wuhan Univ., China)
IX
X
Organization
Intelligent Vision Applications Jacky Baltes (Univ. of Manitoba, Canada) John Barron (Univ. of Western Ontario, Canada) Chris Bowman (Industrial Research Ltd., New Zealand) Thomas Braunl (The Univ. of Western Australia, Australia) Kap Luk Chan (Nanyang Technological Univ., Singapore) How-lung Eng (Institute of Infocomm. Research, Singapore) Uwe Franke (DaimlerChrysler AG, Machine Perception, Germany) Jessie Jin (Univ. of Newcastle, Australia) Ron Kimmel (Technion, Israel) Chin-Teng Lin (National Chiao Tung Univ., Taiwan) Brian Lovell (The Univ. of Queensland, Australia) Herbert Suesse (Jena Univ., Germany) Shuicheng Yan (Univ. of Illinois at Urbana-Champaign, USA) Su Yang (Fudan Univ., China) Wei Yun Yau (Institute of Infocomm. Research, Singapore) Multimedia Compression and Transmission Mei-Juan Chen (National Dong-Hwa Univ., Taiwan) Markus Flierl (Stanford Univ., USA) Wenjen Ho (Institutes of Information Industry, Taiwan) Ramakrishna Kakarala (Avago Technologies, USA) Andreas Koschan (Univ. of Tennessee, Knoxville, USA) Chang-Ming Lee (National Chung Cheng Univ., Taiwan) Shipeng Li (Microsoft Research Asia, China) Vincenzo Liguore (Ocean Logic Pty Ltd., Australia) Yan Lu (Microsoft Research Asia, China) Lei Ming (TVIA Inc., USA) Philip Ogunbona (Univ. of Wollongong, Australia) Volker Rodehorst (Tenchnische Universit¨ at Berlin, Germany) Gary Sullivan (Microsoft Corporation, USA) Alexis M. Tourapis (Dolby Corporation, USA) Feng Wu (Microsoft Research Asia, China) Chia-Hung Yeh (National Dong-Hwa Univ., Taiwan) Lu Yu (Zhejiang Univ., China) Bing Zeng (The Hong Kong Univ. of Science and Technology, Hong Kong) Multimedia Signal Processing Oscar Au (The Hong Kong Univ. of Science and Technology, Hong Kong) Berlin Chen (National Taiwan Normal Univ., Taiwan) Yo-Sung Ho (Gwangju Institute of Science and Tech., Korea) JunWei Hsieh (Yuan-Ze Univ., Taiwan) Hideaki Kimata (NTT, Japan) Yung-Lyul Lee (Sejong Univ., Korea)
Organization
Xuelong Li (Univ. of London, UK) Aljoscha Smolic (Fraunhofer-HHI, Germany) Kwanghoon Sohn (Yonsei Univ., Korea) Yeping Su (Thomson Co., USA) Masayuki Tanimoto (Nagoya Univ., Japan) Ya-Ping Wong (Multimedia Univ., Malaysia) Marcel Worring (Univ. of Amsterdam, Netherlands) Panoramic Imaging and Distributed Video Systems Narendra Ahuja (Beckman Institute, Univ. of Illinois, USA) Elli Angelopoulou (Stevens Inst. of Technology, USA) Naoki Chiba (SANYO North America Corporation, USA) Tomio Echigo (Osaka Univ., Japan) Tarak Gandhi (CVRR, USA) Fay Huang (National Yi-Lan Univ., Taiwan) Kohsia Huang (CVRR, USA) Naoyuki Ichimura (AIST, Japan) Kenichi Kanatani (Okayama Univ., Japan) Sangho Park (CVRR, USA) Shmuel Peleg (Hebrew Univ., Israel) Richard Radke (Rensselaer Polytechnic Institute, USA) Nobutaka Shimada (Ritsumeikan Univ., Japan) Mohan Trivedi (CVRR, USA) Sensor Technologies Anko Boerner (German Aerospace Center (DLR), Germany) Oscal T.-C. Chen (National Chung Cheng Univ., Taiwan) Chiou-Shann Fuh (National Taiwan Univ., Taiwan) Herbert Jahn (German Aerospace Center (DLR), Germany) Charng-Long Lee (Sunplus Inc., Taiwan) C.F. Lin (Yuan Zhe Univ., Taiwan) Y.T. Liu (Altek Inc., Taiwan) Bruce MacDonald (The Univ. of Auckland, New Zealand) John Morris (The Univ.of Auckland, New Zealand) Ralf Reulke (Humboldt Univ., Germany) Martin Scheele (German Aerospace Center (DLR), Germany) Ewe Hong Tat (Multimedia Univ., Malaysia) Y. Tim Tsai (ITRI, Taiwan) Sheng-Jyh Wang (National Chiao Tung Univ., Taiwan) David Yuen (Univ. of Southampton, UK) Visualization Thomas Buelow (Philips Research Lab., Germany) Patrice Delmas (The Univ. of Auckland, New Zealand) David Dagan Feng (Hong Kong Polytech. Univ., Hong Kong)
XI
XII
Organization
Richard Green (Univ. of Canterburyh, New Zealand) Reinhard Koch (Christian-Albrechts-Univ. of Kiel, Germany) Damon Shing-Min Liu (National Chung Cheng Univ., Taiwan) Ngoc-Minh Le (HCMC Univ. of Technology, Vietnam) Andrew Vande Moere (Univ. of Sydney, Australia) Shigeru Owada (Sony CSL, Japan) Masa Takatsuka (The Univ. of Sydney, Australia) Matthias Teschner (Freiburg Univ., Germany) Yangsheng Wang (CBSR, CASIA, China) Michael Wilkinson (Groningen Univ., Netherlands) Jason Wood (The Univ. of Leeds, UK)
Reviewers Narendra Ahuja Kiyoharu Aizawa Elli Angelopoulou Mohsen Ashourian Oscar Au Jacky Baltes John Barron Anko Boerner Chris Bowman Thomas Buelow Hyeran Byun Kap Luk Chan Chuan-Yu Chang I-Cheng Chang Kevin Y.-J. Chang Long-Wen Chang Berlin Chen Chaur-Chin Chen Chi-Fa Chen Chia-Yen Chen Chih-Ming Chen Ling-Hwei Chen Jiann-Jone Chen Li Chen Mei-Juan Chen Oscal T.-C. Chen Sei-Wang Chen Shu-Yuan Chen Yong-Sheng Chen Fang-Hsuan Cheng Shyi-Chyi Cheng
Cheng-Chin Chiang Rachel Chiang Naoki Chiba Yoon Sik Choe Chun-Hsien Chou Cheng-Hung Chuang Jen-Hui Chuang Pau-Choo Chung Patrice Delmas Tomio Echigo How-lung Eng Boris Escalante Chih-Peng Fan Chiung-Yao Fang Markus Flierl Uwe Franke Chiou-Shann Fuh Zhi-Wei Gao Richard Green Nobuhara Hajime Chin-Chuan Han Hsueh-Ming Hang Yutaka Hatakeyama MahmoudReza Hejazi Nancy Hitschfeld Wenjen Ho Jin Woo Hong Hidekata Hontani Yea-Shuan Huang Hsu-Feng Hsiao JunWei Hsieh
Chiou-Ting Hsu Spencer Y.Hsu Naoyuki Ichimura Norikazu Ikoma Atsushi Imiya Herbert Jahn Byeungwoo Jeon Hong Jeong Xiaoyi Jiang Jesse Jin Youngki Jung Ramakrishna Kakarala Yasushi Kanazawa Li Wei Kang Kazuhiko Kawamoto Hideaki Kawano Yukiko Kenmochi Chnag-Ik Kim Dongsik Kim Hae-Kwang Kim Hyoung Joong Kim Jae-Gon Kim Jin Woong Kim Munchurl Kim Sang-Kyun Kim Yong Han Kim Yoon Ho Kim Ron Kimmel Nahum Kiryati Reinhard Koch Andreas Koschan
Organization
Ki Ryong Kwon Shang-Hong Lai Ngoc-Minh Le Chang-Ming Lee Charng-Long Lee Jia-Hong Lee Pei-Jun Lee Sangyoun Lee Yung-Lyul Lee Shipeng Li Mark Liao Wen-Nung Lie Cheng-Chang Lien Jenn-Jier James Lien Vincenzo Liguore Chia-Wen Lin Chin-Teng Lin Guo-Shiang Lin Huei-Yung Lin Shin-Feng Lin Tom Lin Damon Shing-Min Liu Yuante Liu Chun Shien Lu Meng-Ting Lu Yan Lu Bruce MacDonald Brendan McCane Domingo Mery Shaou-Gang Miaou Andrew Vande Moere Kyung Ae Moon Ken’ichi Morooka John Morris
Akira Nakamura Philip Ogunbona Seung-Jun Oh Shigeru Owada Jeng-Shyang Pan Hyun Wook Park Jong-II Park Shmuel Peleg Nicolai Petkov Luis Pizarro Richard Radke Fernando Rannou Ralf Reulke Volker Rodehorst Bodo Rosenhahn Luis Rueda Gary Sullivan Hideo Saito Tomoya Sakai Day-Fann Shen Jau-Ling Shih Nobutaka Shimada Donggyu Sim Kwanghoon Sohn Hwang-Jun Song Gerd Stanke Mu-Chun Su Po-Chyi Su Akihiro Sugimoto Jae-Won Suh Sang Hoon Sull Hung-Min Sun Yung-Nien Sun Aramvith Supavadee
Alexis M. Tourapis Toru Tamaki Ewe Hong Tat Matthias Teschner C.-J. Tsai Cheng-Fa Tsai Chia-Ling Tsai Y. Tim Tsai Chien-Cheng Tseng Din-Chang Tseng Robert Valkenburg Chieh-Chih Wang Sheng-Jyh Wang Wen-Hao Wang Yuan-Kai Wang Michael Wilkinson Chee Sun Won Jason Wood Feng Wu Hsien-Huang P. Wu Shuicheng Yan Jar-Ferr Yang Mau-Tsuen Yang Su Yang Zhi-Fang Yang Wei Yun Yau Chia-Hung Yeh Takanori Yokoyama Lu Yu Sung-Nien Yu David Yuen Huabei Zhou Zhi Zhou
Sponsoring Institutions National Tsing Hua University (NTHU) National Chung Cheng University (NCCU) The University of Auckland (UoA) National Science Council (NSC) Ministry of Education (MoE) Sunplus Technology Co. National Center for High-Performance Computing (NCHC) Institute for Information Industry (III)
XIII
Table of Contents
3D Scene Modeling Error Analysis of Feature Based Disparity Estimation . . . . . . . . . . . . . . . . Patrick A. Mikulastik, Hellward Broszio, Thorsten Thorm¨ ahlen, and Onay Urfalioglu
1
Collinearity and Coplanarity Constraints for Structure from Motion . . . . Gang Liu, Reinhard Klette, and Bodo Rosenhahn
13
Reconstruction of Building Models with Curvilinear Boundaries from Laser Scanner and Aerial Imagery . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Liang-Chien Chen, Tee-Ann Teo, Chi-Heng Hsieh, and Jiann-Yeou Rau The Generation of 3D Tree Models by the Integration of Multi-sensor Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Liang-Chien Chen, Tee-Ann Teo, and Tsai-Wei Chiang LOD Generation for 3D Polyhedral Building Model . . . . . . . . . . . . . . . . . . Jiann-Yeou Rau, Liang-Chien Chen, Fuan Tsai, Kuo-Hsin Hsiao, and Wei-Chen Hsu Octree Subdivision Using Coplanar Criterion for Hierarchical Point Simplification . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Pai-Feng Lee, Chien-Hsing Chiang, Juin-Ling Tseng, Bin-Shyan Jong, and Tsong-Wuu Lin Dimension Reduction in 3D Gesture Recognition Using Meshless Parameterization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yunli Lee, Dongwuk Kyoung, Eunjung Han, and Keechul Jung Target Calibration and Tracking Using Conformal Geometric Algebra . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yilan Zhao, Robert Valkenburg, Reinhard Klette, and Bodo Rosenhahn
24
34
44
54
64
74
Robust Pose Estimation with 3D Textured Models . . . . . . . . . . . . . . . . . . . Juergen Gall, Bodo Rosenhahn, and Hans-Peter Seidel
84
Multi-scale 3D-Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Karsten Scheibe, Michael Suppa, Heiko Hirschm¨ uller, Bernhard Strackenbrock, Fay Huang, Rui Liu, and Gerd Hirzinger
96
XVI
Table of Contents
Image Analysis Boundary Based Orientation of Polygonal Shapes . . . . . . . . . . . . . . . . . . . . ˇ c Joviˇsa Zuni´
108
LPP and LPP Mixtures for Graph Spectral Clustering . . . . . . . . . . . . . . . . Bin Luo and Sibao Chen
118
The Invariance Properties of Chromatic Characteristics . . . . . . . . . . . . . . . Yun-Chung Chung, Shyang-Lih Chang, Shen Cherng, and Sei-Wang Chen
128
A Scale Invariant Surface Curvature Estimator . . . . . . . . . . . . . . . . . . . . . . John Rugis and Reinhard Klette
138
N -Point Hough Transform Derived by Geometric Duality . . . . . . . . . . . . . Yoshihiko Mochizuki, Akihiko Torii, and Atsushi Imiya
148
Guided Importance Sampling Based Particle Filtering for Visual Tracking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kazuhiko Kawamoto
158
Intelligent Spot Detection for 2-DE Gel Image . . . . . . . . . . . . . . . . . . . . . . . Yi-Sheng Liu, Shu-Yuan Chen, Ya-Ting Chao, Ru-Sheng Liu, Yuan-Ching Tsai, and Jaw-Shu Hsieh
168
Gaze Estimation from Low Resolution Images . . . . . . . . . . . . . . . . . . . . . . . Yasuhiro Ono, Takahiro Okabe, and Yoichi Sato
178
Image Similarity Comparison Using Dual-Tree Wavelet Transform . . . . . . Mong-Shu Lee, Li-yu Liu, and Fu-Sen Lin
189
A Novel Supervised Dimensionality Reduction Algorithm for Online Image Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fengxi Song, David Zhang, Qinglong Chen, and Jingyu Yang
198
Feature Selection Based on Run Covering . . . . . . . . . . . . . . . . . . . . . . . . . . . Su Yang, Jianning Liang, Yuanyuan Wang, and Adam Winstanley
208
Automated Detection of the Left Ventricle from 4D MR Images: Validation Using Large Clinical Datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . Xiang Lin, Brett Cowan, and Alistair Young
218
Two Thresholding for Deriving the Bi-level Document Image . . . . . . . . . . Yu-Kumg Chen and Yi-Fan Chang
228
A Novel Merging Criterion Incorporating Boundary Smoothness and Region Homogeneity for Image Segmentation . . . . . . . . . . . . . . . . . . . . Zhi-Gang Tan, Xiao-Chen He, and Nelson H.C. Yung
238
Table of Contents
XVII
A New Local-Feature Framework for Scale-Invariant Detection of Partially Occluded Objects . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Andrzej Sluzek
248
Color Correction System Using a Color Compensation Chart for the Images from Digital Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Seok-Han Lee, Sang-Won Um, and Jong-Soo Choi
258
A Comparison of Similarity Measures for 2D Rigid MR Image Registration Using Wavelet Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shutao Li, Shengchu Deng, and Jinglin Peng
270
Finding the Shortest Path Between Two Points in a Simple Polygon by Applying a Rubberband Algorithm . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fajie Li and Reinhard Klette
280
Facial Expressions Recognition in a Single Static as well as Dynamic Facial Images Using Tracking and Probabilistic Neural Networks . . . . . . . Hadi Seyedarabi, Won-Sook Lee, Ali Aghagolzadeh, and Sohrab Khanmohammadi
292
Application of 3D Co-occurrence Features to Terrain Classification . . . . . Dong-Min Woo, Dong-Chul Park, Seung-Soo Han, and Quoc-Dat Nguyen
305
(2D)2 DLDA for Efficient Face Recognition . . . . . . . . . . . . . . . . . . . . . . . . . . Dong-uk Cho, Un-dong Chang, Kwan-dong Kim, Bong-hyun Kim, and Se-hwan Lee
314
Automatic Skull Segmentation and Registration for Tissue Change Measurement After Mandibular Setback Surgery . . . . . . . . . . . . . . . . . . . . . Jeongjin Lee, Namkug Kim, Ho Lee, Suk-Ho Kang, Jae-Woo Park, and Young-Il Chang
322
Voting Method for Stable Range Optical Flow Computation . . . . . . . . . Atsushi Imiya and Daisuke Yamada
332
Shadow Removal for Foreground Segmentation . . . . . . . . . . . . . . . . . . . . . . Kuo-Hua Lo, Mau-Tsuen Yang, and Rong-Yu Lin
342
A Unified Approach for Combining ASM into AAM . . . . . . . . . . . . . . . . . . Jaewon Sung and Daijin Kim
353
Variational Approach to Cardiac Motion Estimation for Small Animals in Tagged Magnetic Resonance Imaging . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hsun-Hsien Chang
363
Performance Assessment of Image Fusion . . . . . . . . . . . . . . . . . . . . . . . . . . . Qiang Wang and Yi Shen
373
XVIII
Table of Contents
Possibilistic C-Template Clustering and Its Application in Object Detection in Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tsaipei Wang
383
Position Estimation of Solid Balls from Handy Camera for Pool Supporting System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hideaki Uchiyama and Hideo Saito
393
Shape Retrieval Using Statistical Chord-Length Features . . . . . . . . . . . . . . Chaojian Shi and Bin Wang
403
Multifocus Image Sequences for Iris Recognition . . . . . . . . . . . . . . . . . . . . . Byungjun Son, Sung-Hyuk Cha, and Yillbyung Lee
411
Projection Method for Geometric Modeling of High Resolution Satellite Images Applying Different Approximations . . . . . . . . . . . . . . . . . . . . . . . . . . Fahim Arif, Muhammad Akbar, and An-Ming Wu
421
Intelligent Vision Applications Global Localization of Mobile Robot Using Omni-directional Image Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sukjune Yoon, Woosup Han, Seung Ki Min, and Kyung Shik Roh
433
Computer-Aided Vision System for MURA-Type Defect Inspection in Liquid Crystal Displays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hong-Dar Lin and Singa Wang Chiu
442
Effective Detector and Kalman Filter Based Robust Face Tracking System . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chi-Young Seong, Byung-Du Kang, Jong-Ho Kim, and Sang-Kyun Kim
453
Recognizing Multiple Billboard Advertisements in Videos . . . . . . . . . . . . . Naoyuki Ichimura
463
Occlusion Detection and Tracking Method Based on Bayesian Decision Theory . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yan Zhou, Bo Hu, and Jianqiu Zhang
474
Pedestrian Recognition in Far-Infrared Images by Combining Boosting-Based Detection and Skeleton-Based Stochastic Tracking . . . . . Ryusuke Miyamoto, Hiroki Sugano, Hiroaki Saito, Hiroshi Tsutsui, Hiroyuki Ochi, Ken’ichi Hatanaka, and Yukihiro Nakamura Unsupervised Texture Segmentation Based on Watershed and a Novel Artificial Immune Antibody Competitive Network . . . . . . . . . . . . . . . . . . . . Wenlong Huang and Licheng Jiao
483
495
Table of Contents
XIX
A Novel Driving Pattern Recognition and Status Monitoring System . . . Jiann-Der Lee, Jiann-Der Li, Li-Chang Liu, and Chi-Ming Chen
504
Advances on Automated Multiple View Inspection . . . . . . . . . . . . . . . . . . . Domingo Mery and Miguel Carrasco
513
Target Tracking and Positioning on Video Sequence from a Moving Video Camera . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chi-Farn Chen and Min-Hsin Chen
523
Radiometrically-Compensated Projection onto Non-Lambertian Surface Using Multiple Overlapping Projectors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hanhoon Park, Moon-Hyun Lee, Byung-Kuk Seo, Hong-Chang Shin, and Jong-Il Park
534
Motion Detection in Complex and Dynamic Backgrounds . . . . . . . . . . . . . Daeyong Park, Junbeom Kim, Jaemin Kim, Seongwon Cho, and Sun-Tae Chung
545
Effective Face Detection Using a Small Quantity of Training Data . . . . . . Byung-Du Kang, Jong-Ho Kim, Chi-Young Seong, and Sang-Kyun Kim
553
A New Passage Ranking Algorithm for Video Question Answering . . . . . Yu-Chieh Wu, Yue-Shi Lee, Jie-Chi Yang, and Show-Jane Yen
563
Fusion of Luma and Chroma GMMs for HMM-Based Object Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wen-Hao Wang and Ruei-Cheng Wu
573
Undistorted Projection onto Dynamic Surface . . . . . . . . . . . . . . . . . . . . . . . Hanhoon Park, Moon-Hyun Lee, Byung-Kuk Seo, and Jong-Il Park
582
Robust Moving Object Detection on Moving Platforms . . . . . . . . . . . . . . . Ming-Yu Shih and Bwo-Chau Fu
591
Automatic Selection and Detection of Visual Landmarks Using Multiple Segmentations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Daniel Langdon, Alvaro Soto, and Domingo Mery
601
Hybrid Camera Surveillance System by Using Stereo Omni-directional System and Robust Human Detection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Kenji Iwata, Yutaka Satoh, Ikushi Yoda, and Katsuhiko Sakaue
611
Robust TV News Story Identification Via Visual Characteristics of Anchorperson Scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chia-Hung Yeh, Min-Kuan Chang, Ko-Yen Lu, and Maverick Shih
621
Real-Time Automatic Calibration for Omni-display in Ubiquitous Computing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dongwuk Kyoung, Yunli Lee, Eunjung Han, and Keechul Jung
631
XX
Table of Contents
Face Occlusion Detection for Automated Teller Machine Surveillance . . . Daw-Tung Lin and Ming-Ju Liu
641
Automatic Pose-Normalized 3D Face Modeling and Recognition Systems . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sunjin Yu, Kwontaeg Choi, and Sangyoun Lee
652
Vision-Based Game Interface Using Human Gesture . . . . . . . . . . . . . . . . . . Hye Sun Park, Do Joon Jung, and Hang Joon Kim
662
Terminal Phase Vision-Based Target Recognition and 3D Pose Estimation for a Tail-Sitter, Vertical Takeoff and Landing Unmanned Air Vehicle . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Allen C. Tsai, Peter W. Gibbens, and R. Hugh Stone
672
Detection of Traffic Lights for Vision-Based Car Navigation System . . . . Tae-Hyun Hwang, In-Hak Joo, and Seong-Ik Cho
682
What Is the Average Human Face? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . George Mamic, Clinton Fookes, and Sridha Sridharan
692
An Improved Real-Time Contour Tracking Algorithm Using Fast Level Set Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Myo Thida, Kap Luk Chan, and How-Lung Eng
702
Multimedia Compression and Transmission Dynamic Perceptual Quality Control Strategy for Mobile Multimedia Transmission Via Watermarking Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chin-Lun Lai
712
A Rate Control Using Adaptive Model-Based Quantization . . . . . . . . . . . . Seonki Kim, Tae-Jung Kim, Sang-Bong Lee, and Jae-Won Suh
722
The Design of Single Encoder and Decoder for Multi-view Video . . . . . . . Haksoo Kim and Manbae Kim
732
Vector Quantization in SPIHT Image Codec . . . . . . . . . . . . . . . . . . . . . . . . . Rafi Mohammad and Christopher F. Barnes
742
Objective Image Quality Evaluation for JPEG, JPEG 2000, and Vidware VisionTM . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chung-Hao Chen, Yi Yao, David L. Page, Besma Abidi, Andreas Koschan, and Mongi Abidi Efficient H.264 Encoding Based on Skip Mode Early Termination . . . . . . Tien-Ying Kuo and Hsin-Ju Lu
751
761
Table of Contents
XXI
Low-Power H.264 Deblocking Filter Algorithm and Its SoC Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Byung-Joo Kim, Jae-Il Koo, Min-Cheol Hong, and Seongsoo Lee
771
Rate Control Scheme for Video Coding Using Low-Memory-Cost Look-Up Table . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Linjian Mo, Chun Chen, Jiajun Bu, Xu Li, and Zhi Yang
780
Adaptive GOP Bit Allocation to Provide Seamless Video Streaming in Vertical Handoff . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dinh Trieu Duong, Hye-Soo Kim, Le Thanh Ha, Jae-Yun Jeong, and Sung-Jea Ko
791
Memory-Efficiency and High-Speed Architectures for Forward and Inverse DCT with Multiplierless Operation . . . . . . . . . . . . . . . . . . . . . . Tze-Yun Sung, Mao-Jen Sun, Yaw-Shih Shieh, and Hsi-Chin Hsin
802
Unequal Priority Arrangement for Delivering Streaming Videos over Differentiated Service Networks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chu-Chuan Lee, Pao-Chi Chang, and Shih-Jung Chuang
812
A New Macroblock-Layer Rate Control for H.264/AVC Using Quadratic R-D Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Nam-Rye Son, Yoon-Jeong Shin, Jae-Myung Yoo, and Guee-Sang Lee Selective Quality Control of Multiple Video Programs for Digital Broadcasting Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hong-Yeon Yu, Aaron Park, Keun Won Jang, Jin Park, Seong-Joon Baek, Dong Kook Kim, Young-Chul Kim, and Sung-Hoon Hong
822
832
Joint Source-Channel Video Coding Based on the Optimization of End-to-End Distortions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Wen-Nung Lie, Zhi-Wei Gao, Tung-Lin Liu, and Ping-Chang Jui
842
A Temporal Error Concealment Method Based on Edge Adaptive Boundary Matching . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Hyun-Soo Kang, Yong-Woo Kim, and Tae-Yong Kim
852
Fractional Full-Search Motion Estimation VLSI Architecture for H.264/AVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chien-Min Ou, Huang-Chun Roan, and Wen-Jyi Hwang
861
Motion JPEG2000 Coding Scheme Based on Human Visual System for Digital Cinema . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jong Han Kim, Sang Beom Kim, and Chee Sun Won
869
XXII
Table of Contents
Wavelet-Based Image Compression with Polygon-Shaped Region of Interest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Yao-Tien Chen, Din-Chang Tseng, and Pao-Chi Chang
878
Buffer Occupancy Feedback Security Control and Changing Encryption Keys to Protect MOD Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jen-Chieh Lai and Chia-Hui Wang
888
H.264-Based Depth Map Sequence Coding Using Motion Information of Corresponding Texture Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Han Oh and Yo-Sung Ho
898
A Novel Macroblock-Layer Rate-Distortion Model for Rate Control Scheme of H.264/AVC . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Tsung-Han Chiang and Sheng-Jyh Wang
908
A Geometry-Driven Hierarchical Compression Technique for Triangle Meshes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Chang-Min Chou and Din-Chang Tseng
919
Shape Preserving Hierarchical Triangular Mesh for Motion Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Dong-Gyu Lee and Su-Young Han
929
Parameter Selection of Robust Fine Granularity Scalable Video Coding over MPEG-4 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Shao-Yu Wang and Sheng-Jyh Wang
939
Adaptive QoS Control for Scalable Video Streaming over WLAN . . . . . . . Hee Young Cho, Doug Young Suh, and Gwang Hoon Park
949
H.264 Intra Mode Decision for Reducing Complexity Using Directional Masks and Neighboring Modes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Jongho Kim, Kicheol Jeon, and Jechang Jeong
959
Seamless Channel Transition for BroadCatch Scheme . . . . . . . . . . . . . . . . . Yukon Chang and Shou-Li Hsu
969
Spatial Interpolation Algorithm for Consecutive Block Error Using the Just-Noticeable-Distortion Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Su-Yeol Jeon, Chae-Bong Sohn, Ho-Chong Park, Chang-Beom Ahn, and Seoung-Jun Oh Error Resilient Motion Estimation for Video Coding . . . . . . . . . . . . . . . . . . Wen-Nung Lie, Zhi-Wei Gao, Wei-Chih Chen, and Ping-Chang Jui
979
988
Table of Contents
XXIII
Multimedia Signal Processing A Novel Scrambling Scheme for Digital Video Encryption . . . . . . . . . . . . . Zhenyong Chen, Zhang Xiong, and Long Tang
997
An Electronic Watermarking Technique for Digital Holograms in a DWT Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1007 Hyun-Jun Choi, Young-Ho Seo, Ji-Sang Yoo, and Dong-Wook Kim Off-Line Signature Verification Based on Directional Gradient Spectrum and a Fuzzy Classifier . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1018 Young Woon Woo, Soowhan Han, and Kyung Shik Jang Automatic Annotation and Retrieval for Videos . . . . . . . . . . . . . . . . . . . . . . 1030 Fangshi Wang, De Xu, Wei Lu, and Hongli Xu Application of Novel Chaotic Neural Networks to Text Classification Based on PCA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1041 Jin Zhang, Guang Li, and Walter J. Freeman Lip Localization Based on Active Shape Model and Gaussian Mixture Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1049 Kyung Shik Jang, Soowhan Han, Imgeun Lee, and Young Woon Woo Multi-scale MAP Estimation of High-Resolution Images . . . . . . . . . . . . . . 1059 Shubin Zhao Fast Transcoding Algorithm from MPEG2 to H.264 . . . . . . . . . . . . . . . . . . 1067 Donghyung Kim and Jechang Jeong A New Display-Capture Based Mobile Watermarking System . . . . . . . . . . 1075 Jong-Wook Bae, Bo-Ho Cho, and Sung-Hwan Jung Adaptive Data Hiding for Images Based on Harr Discrete Wavelet Transform . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1085 Bo-Luen Lai and Long-Wen Chang Error Concealment in Video Communications by Informed Watermarking . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1094 Chowdary Adsumilli, Sanjit Mitra, Taesuk Oh, and Yong Cheol Kim Dirty-Paper Trellis-Code Watermarking with Orthogonal Arcs . . . . . . . . . 1103 Taekyung Kim, Taesuk Oh, Yong Cheol Kim, and Seongjong Choi An Efficient Object Tracking Algorithm with Adaptive Prediction of Initial Searching Point . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1113 Jiyan Pan, Bo Hu, and Jian Qiu Zhang
XXIV
Table of Contents
An Automatic Personal TV Scheduler Based on HMM for Intelligent Broadcasting Services . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1123 Agus Syawal Yudhistira, Munchurl Kim, Hieyong Kim, and Hankyu Lee Increasing the Effect of Fingers in Fingerspelling Hand Shapes by Thick Edge Detection and Correlation with Penalization . . . . . . . . . . . 1133 O˘guz Altun and Song¨ ul Albayrak Binary Edge Based Adaptive Motion Correction for Accurate and Robust Digital Image Stabilization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1142 Ohyun Kwon, Byungdeok Nam, and Joonki Paik Contrast Enhancement Using Adaptively Modified Histogram Equalization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1150 Hyoung-Joon Kim, Jong-Myung Lee, Jin-Aeon Lee, Sang-Geun Oh, and Whoi-Yul Kim A Novel Fade Detection Algorithm on H.264/AVC Compressed Domain . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1159 Bita Damghanian, Mahmoud Reza Hashemi, and Mohammad Kazem Akbari A Representation of WOS Filters Through Support Vector Machine . . . . 1168 W.C. Chen and J.H. Jeng Detection Method and Removing Filter for Corner Outliers in Highly Compressed Video . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1177 Jongho Kim, Kicheol Jeon, and Jechang Jeong Accurate Foreground Extraction Using Graph Cut with Trimap Estimation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1185 Jung-Ho Ahn and Hyeran Byun
Panoramic Imaging and Distributed Video Systems Homography Optimization for Consistent Circular Panorama Generation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1195 Masatoshi Sakamoto, Yasuyuki Sugaya, and Kenichi Kanatani Graph-Based Global Optimization for the Registration of a Set of Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1206 Hui Zhou One-Dimensional Search for Reliable Epipole Estimation . . . . . . . . . . . . . . 1215 Tsuyoshi Migita and Takeshi Shakunaga
Table of Contents
XXV
Sensor Technologies Image Technology for Camera-Based Automatic Guided Vehicle . . . . . . . 1225 Young-Jae Ryoo An Efficient Demosaiced Image Enhancement Method for a Low Cost Single-Chip CMOS Image Sensor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1234 Wonjae Lee, Seongjoo Lee, and Jaeseok Kim Robust Background Subtraction for Quick Illumination Changes . . . . . . . 1244 Shinji Fukui, Yuji Iwahori, Hidenori Itoh, Haruki Kawanaka, and Robert J. Woodham A New Sampling Method of Auto Focus for Voice Coil Motor in Camera Modules . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1254 Wei Hsu, Chiou-Shann Fuh
Visualization A Fast Trapeziums-Based Method for Soft Shadow Volumes . . . . . . . . . . . 1264 Difei Lu and Xiuzi Ye Photo-Consistent Motion Blur Modeling for Realistic Image Synthesis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1273 Huei-Yung Lin and Chia-Hong Chang Integration of GPS, GIS and Photogrammetry for Texture Mapping in Photo-Realistic City Modeling . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1283 Jiann-Yeou Rau, Tee-Ann Teo, Liang-Chien Chen, Fuan Tsai, Kuo-Hsin Hsiao, and Wei-Chen Hsu Focus + Context Visualization with Animation . . . . . . . . . . . . . . . . . . . . . . 1293 Yingcai Wu, Huamin Qu, Hong Zhou, and Ming-Yuen Chan Exploiting Spatial-temporal Coherence in the Construction of Multiple Perspective Videos . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1303 Mau-Tsuen Yang, Shih-Yu Huang, Kuo-Hua Lo, Wen-Kai Tai, and Cheng-Chin Chiang A Perceptual Framework for Comparisons of Direct Volume Rendered Images . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1314 Hon-Cheng Wong, Huamin Qu, Un-Hong Wong, Zesheng Tang, and Klaus Mueller Hierarchical Contribution Culling for Fast Rendering of Complex Scenes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1324 Jong-Seung Park and Bum-Jong Lee
XXVI
Table of Contents
Hierarchical Blur Identification from Severely Out-of-Focus Images . . . . . 1334 Jungsoo Lee, Yoonjong Yoo, Jeongho Shin, and Joonki Paik Author Index . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1343
Error Analysis of Feature Based Disparity Estimation Patrick A. Mikulastik, Hellward Broszio, Thorsten Thorm¨ahlen, and Onay Urfalioglu Information Technology Laboratory, University of Hannover, Germany {mikulast, broszio, thormae, urfaliog}@lfi.uni-hannover.de http://www.lfi.uni-hannover.de
Abstract. For real-time disparity estimation from stereo images the coordinates of feature points are evaluated. This paper analyses the influence of camera noise on the accuracy of feature point coordinates of a feature point detector similar to the Harris Detector, modified for disparity estimation. As a result the error variance of the horizontal coordinate of each feature point and the variance of each corresponding disparity value is calculated as a function of the image noise and the local intensity distribution. Disparities with insufficient accuracy can be discarded in order to ensure a given accuracy. The results of the error analysis are confirmed by experimental results.
1
Introduction
Disparity estimation algorithms compute disparities from the coordinates of selected corresponding feature points from images in standard stereo geometry. For the use of these estimated disparities in computer vision systems it is desirable to specify their accuracy. Therefore, in this paper the error variance of a disparity estimator is determined analytically and experimentally. In previous work Luxen [1] measures the variance of feature point coordinates, taking image noise into account. The result is a mean error variance of all feature points in an image at a specific level of image noise. Local intensity distributions at specific feature points are not taken into account. Rohr [2] introduces an analytical model of a corner and calculates the feature point coordinates of different feature detectors for this corner. Thus he characterizes the different detectors but does not consider the errors of the coordinates due to camera noise. Szeliski [3] has analytically calculated the accuracy of displacement estimators like the KLT-Tracker [4]. The resulting covariance matrix describes the variance of the displacement error for each displacement. Other approaches do not estimate displacements but nevertheless apply similar covariance matrices to describe the accuracy of feature point coordinates [5,6,7,8]. However, these results have not been analytically proven or evaluated in experiments. Thus, so far the accuracy of feature point coordinates from image gradient based detectors similar to the Harris Detector [9] has not been calculated analytically. L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 1–12, 2006. c Springer-Verlag Berlin Heidelberg 2006
2
P.A. Mikulastik et al.
This paper analyses the influence of camera noise on the accuracy of a feature point detector for disparity estimation. It is based on a modified Harris Detector [9]. The accuracy is defined by the error variance of the feature coordinate. In a second step the error variance of the disparity estimation is derived. In section 2 the feature detector considered in this paper is described. In section 3 an error analysis for feature coordinates and disparity is presented. Section 4 describes experiments for measuring the error variance of feature coordinates. Conclusions are given in section 5.
2
Feature Point Detector
The feature detector considered in this paper is based on the Harris Detector [9]. In stereo vision only disparities in horizontal direction of the stereo image are considered. Therefore, the process of feature detection is simplified so that only gradients in direction of the x-axis are measured. This also results in a reduced computational complexity. For detection of feature points the following equation describes the edge response function R for vertical edges: 1 R(x, y) = αi = [1, 2, 1] Ix (x, y + i)αi , (1) i=−1
where x and y are coordinates in the image. Ix is an approximation of the horizontal intensity gradient: Ix (x, y) = −I(x − 2, y) − 2I(x − 1, y) + 2I(x + 1, y) + I(x + 2, y)
(2)
A feature point is detected, if R(x, y) is greater than a predefined threshold TR and if R(xm , ym ) is a local maximum in horizontal direction: R(xm , ym ) > TR R(xm , ym ) > R(xm − 1, ym ) R(xm , ym ) > R(xm + 1, ym )
(3)
Estimation of subpel coordinates. In horizontal direction a subpel coordinate is estimated for every feature point. A parabola is fitted to the edge response function (see figure 1): 1 (4) R(x, y) = a + bx + cx2 2 To achieve a compact notation the coordinate system is chosen so that xm = 0. The three parameters a, b, c are calculated with: 1 R(−1, ym ) = a − b + c 2 R(0, ym ) = a 1 R(+1, ym ) = a + b + c 2
(5)
Error Analysis of Feature Based Disparity Estimation
3
x y
R(x, y)
x xm
x0
Fig. 1. Interpolation of R(x, y) with a parabola. The maximum defines the subpel coordinate of the feature point x0 .
Solved for a, b, c: a = R(0, ym ) 1 b = (R(+1, ym ) − R(−1, ym)) 2 c = R(−1, ym) − 2R(0, ym ) + R(+1, ym )
(6)
In order to find the maximum of the parabola the derivation of equation 4 is set to zero: ∂R(x, y) ! (7) = b + cx = 0 ∂x The null of equation 7 marks the subpel coordinate x0 of the feature point: x0 = − =
3
b c
1 R(−1, ym ) − R(+1, ym ) 2 R(−1, ym) − 2R(0, ym ) + R(+1, ym)
(8)
Variance of the Horizontal Coordinate
To consider the influence of image noise on the subpel coordinates of feature ˜ y) with: points, noisy intensity values I(x, ˜ y) = I(x, y) + n(x, y) I(x,
(9)
are considered. n(x, y) is white, mean free and Gaussian distributed noise with a noise variance of σn2 . The gradient of the image noise is defined as: nx (x, y) = −n(x − 2, y) − 2n(x − 1, y) + 2n(x + 1, y) + n(x + 2, y)
(10)
4
P.A. Mikulastik et al.
Therefore, the measured image gradient is I˜x (x, y): ˜ − 2, y) − 2I(x ˜ − 1, y) + 2I(x ˜ + 1, y) + I(x ˜ + 2, y) I˜x (x, y) = −I(x = Ix (x, y) −n(x − 2, y) − 2n(x − 1, y) + 2n(x + 1, y) + n(x + 2, y) nx (x,y)
= Ix (x, y) + nx (x, y) (11) ˜ y) can be written as: The measured cornerness response function R(x, 1 ˜ y) = I˜x (x, y + i)αi R(x, i=−1 1 1 = Ix (x, y + i)αi + nx (x, y + i)αi i=−1
(12)
i=−1
Ix can be positive or negative if the edge has a gradient in the positive or in the negative direction. The influence of the image noise is the same in both cases. Therefore only edges with positive gradients are considered in the following calculations and it is assumed that Ix is always positive: 1
Ix (x, y + i)αi > 0
(13)
i=−1
Generally, the intensity values are much larger than the image noise: 1
Ix (x, y + i)αi >
i=−1
1
nx (x, y + i)αi
(14)
i=−1
˜ y) can be written as: With equations 13 and 14 R(x, 1 1 ˜ y) = Ix (x, y + i)αi + nx (x, y + i)αi R(x, i=−1 i=−1 R(x,y)
with: Rn (x, y) =
(15)
Rn (x,y)
1
nx (x, y + i)αi
(16)
i=−1
˜ y) is computed for every point-position. The calculation of the feature R(x, point’s subpel coordinates is carried out according to equation 3 to 7. x˜0 can be calculated with equation 8: x ˜0 =
˜ ym ) ˜ 1 R(−1, ym ) − R(1, ˜ ym ) + R(1, ˜ ym ) ˜ 2 R(−1, ym ) − 2R(0,
(17)
Error Analysis of Feature Based Disparity Estimation
and: x ˜0 =
˜ ym ) ˜ 1 R(−1, ym ) − R(1, 2 c + Rn (−1, ym ) − 2Rn (0, ym ) + Rn (1, ym )
5
(18)
Δc
For a more compact notation Δc is introduced: Δc = Rn (−1, ym ) − 2Rn (0, ym ) + Rn (1, ym )
(19)
as well as the normalized value Δc : Δc =
Δc c
(20)
c is a sum of intensity values and Δc is a sum of noise values. Therefore Δc is a small value. With Δc and Δc equation 18 simplifies to: x ˜0 = =
˜ ym ) ˜ 1 R(−1, ym ) − R(1, 2 c + Δc ˜ ym ) ˜ 1 R(−1, ym ) − R(1, 2
(21)
c(1 + Δc )
Multiplication of nominator and denominator with (1 − Δc ) equals to: x ˜0 =
˜ ym ) 1 R(−1, ˜ ym ) ˜ ˜ 1 R(−1, ym ) − R(1, ym ) − R(1, − · Δc 2 2 2 c(1 − Δc ) c(1 − Δc 2 )
(22)
Since Δc is a small value it can be assumed that 1 >> Δc
2
(23)
.
With this assumption, equation 22 is simplified to: x ˜0 ≈
˜ ym ) 1 R(−1, ˜ ym ) ˜ ˜ 1 R(−1, ym ) − R(1, ym ) − R(1, − · Δc 2 c 2 c
(24)
With equation 16:
x0
Δxa
R(−1, ym ) − R(1, ym ) Rn (−1, ym ) − Rn (1, ym ) + x ˜0 ≈ 2c 2c R(−1, ym ) − R(1, ym ) Rn (−1, ym ) − Rn (1, ym ) + · Δc − 2c 2c Δxa is defined:
Δxa =
Rn (−1, ym ) − Rn (1, ym ) 2c
(25)
(26)
With Δxa we can write: x ˜0 = x0 + Δxa − x0 Δc − Δxa Δc
(27)
6
P.A. Mikulastik et al.
Δxa Δc is the procduct of sums of noise values. Therefore it is very small. It will be neglected in the following calculation: x ˜0 = x0 + Δxa − x0 Δc
(28)
Δx0
Δx0 describes the error of the undistorted coordinate x0 : Δx0 = Δxa − x0 Δc
(29)
It has been verified that experimentally that Δx0 has a zero mean. With this 2 assumption the variance of the coordinate’s error σΔ equals the root mean square 2 E[Δx0 ]:
2 2 σΔ = E[Δx20 ] = E (Δxa − x0 Δc )
2 = E Δxa 2 − 2x0 Δxa Δc + x20 Δc
2 = E Δxa 2 − E [2x0 Δxa Δc ] + E x20 Δc
(30)
The terms of equation 30 are evaluated individually:
E Δxa
2
=E
Rn (−1, ym ) − Rn (1, ym ) 2c
2 (31)
=
1 2 E (R (−1, y ) − R (1, y )) n m n m 4c2
With equation 15: ⎡ 2 ⎤ 1 1 1 nx (−1, ym + i)αi − nx (1, ym + i) ⎦ E[Δxa 2 ] = 2 E ⎣ 4c i=−1 i=−1 Evaluation of the square gives: 1 E[Δxa ] = 2 E 4c 2
1
2 nx (−1, ym + i)αi
i=−1 1 i=−1
nx (1, ym + i)αi +
−2
1
nx (−1, ym + i)αi ·
i=−1 1 i=−1
nx (1, ym + i)αi
2
(32)
Error Analysis of Feature Based Disparity Estimation
7
With E[n(x, y1 )n(x, y2 )] = 0, for y1 = y2 and E[n(x1 , y)n(x2 , y)] = 0, for x1 = x2 : ⎡ 2 ⎤ 1 nx (x, y + i)αi ⎦ E⎣ i=−1
=E
=E
1
n2x (x, y + i)α2i (33)
i=−1 1
n2 (x − 2, y + i) + 4n2 (x − 1, y + i) + 4n2 (x + 1, y + i)
i=−1
+n (x + 2, y + i) α2i
2
and E
1
nx (−1, y + i)αi ·
i=−1
nx (1, y + i)αi
nx (−1, y + i) · nx (1, y + i)α2i
i=−1
(34)
1
=E
i=−1
1
=E
1
−4n (0, y + 2
i)α2i
.
i=−1
Equation 32 simplifies to: 1 1 2 E[Δxa ] = 2 E (n2 (−3, ym + i) + 4n2 (−2, ym + i) + 4n2 (0, ym + i) 4c i=−1 + n2 (1, ym + i))α2i − 2
1
−4n2 (0, ym + i)α2i
i=−1
+
1
n2 (−1, y + i) + 4n2 (0, y + i) + 4n2 (2, y + i)
i=−1
+n (3, y + i) α2i
2
(35) For the expectation the terms n2 (x, y) become the variance of the image noise σn2 : 1 1 1 1 10σn2 α2i − 2 −4σn2 α2i + 10σn2 α2i E[Δxa 2 ] = 2 (36) 4c i=−1 i=−1 i=−1 Evaluation of the sums gives: E[Δxa 2 ] =
σn2 42σn2 [60 + 48 + 60] = 4c2 c2
(37)
8
P.A. Mikulastik et al.
The second term in equation 30 can be expanded to: Δc Rn (−1, ym ) − Rn (1, ym ) x0 E[2Δxa x0 Δc ] = E 2 2c c Rn (−1, ym ) − Rn (1, ym ) =E 2 2c Rn (−1, ym ) − 2Rn (0, ym ) + Rn (1, ym ) x0 c x0 = 2 E [(Rn (−1, ym ) − Rn (1, ym )) c (Rn (−1, ym ) − 2Rn (0, ym ) + Rn (1, ym ))] A calculation similar to that from equation 31 to 36 leads to: 1 1 x0 2 2 2 2 E[2Δxa x0 Δc ] = 2 10σn αi − 10σn αi · c i=−1 i=−1 1 1 2 2 2 2 −2 4σn αi + 2 4σn αi i=−1
(38)
(39)
i=−1
=0 The third term in equation 30 can be expanded to: x2
2 2 E x20 Δc = 20 E (Rn (−1, ym ) − 2Rn (0, ym ) + Rn (1, ym )) c
(40)
Once again, a calculation similar to that from equation 31 to 36 leads to: 120σ 2 x2
2 n 0 E x20 Δc = c2
(41)
Insertion of equations 37, 39 and 41 in equation 30 leads to: 42σn2 120σn2 x20 + c2 c2 42 + 120x20 = σn2 c2
2 σΔ =
= σn2
(42) 42 + 120x20
(R(−1, ym ) − 2R(0, ym ) + R(+1, ym ))2
2 This equation will be used to calculate the error variance σΔ of the feature point coordinate. The disparity d is the distance between a feature point in the left image and the corresponding feature point in the right image:
d = xleft − xright
(43)
Error Analysis of Feature Based Disparity Estimation 2 The variance σΔ of the disparity error Δd is: d 2 2 2 E Δd = σΔd = E Δxleft − Δxright
= E Δx2left − 2Δxleft Δxright + Δx2right
= E Δx2left − E 2Δxleft Δxright + E Δx2right
9
(44)
The distortions of the feature point’s coordinates in the two images are statistically independent, therefore equation 44 simplifies to:
2 2 2 + E Δx = E Δx σΔ d left right (45) 2 2 = σΔ + σ Δright left The variance of the disparity error is given by the sum of the error variances of the feature point coordinates.
4
Experimental Results
The following experiments have been carried out to evaluate the derivation from the preceding chapter 3. An image sequence consisting of a static scene with constant lighting is taken with a 3-Chip CCD camera (Sony DXC-D30WSP). In order to determine the size of σn2 the average intensity value at each pixel from 1000 frames is calculated to produce a noise free image for the sequence. By subtraction of the noise free image and the original images, difference images containing only the camera noise can be obtained. A camera noise variance of σn2 = 4.8 which equals a PSNR of 41.3 dB was measured for the sequence. Figure 2 shows an example of an image from the sequence. Using the feature detector described in section 2 feature points are detected in every image of the sequence. Now, correspondences between the feature points in the sequence are established. A correspondence is given, if a feature point in one image is located at the coordinates x, y and in another image at the coordinates x±, y±, with ≤ 0, 5 pel If a feature point has correspondences in all images of 2 a sequence, the measured variance of its horizontal coordinate σ ˜Δ is calculated. This value can be compared with the results from the derivation in section 3. 2 2 Figure 3 shows the measured variances σ ˜Δ over the variances σΔ calculated 2 as described in section 3. The figure shows that the measured variances σ ˜Δ have nearly the same values as the calculated ones. Therefore the calculation is confirmed by the experiment. Also the values lie in the same regions observed by other researchers [1]. A second experiment with a synthetic image shows the dependence between subpel position of a feature point and the error variance of its coordinate. The image shown in figure 4 is used for this experiment. A feature detection in this image results in feature points with subpel positions in the whole range −0.5 < x0 < 0.5 because the edge in the image is slightly slanted.
10
P.A. Mikulastik et al.
Fig. 2. Example image from the test sequence with a camera noise variance of σn2 = 4.8
0.004
2 σ ˜Δ (measured) pel2
0.005
0.003
0.002
0.001
0
0
0.001
0.002 2 σΔ
0.003 2
0.004
0.005
(calculated) pel
2 2 Fig. 3. Measured error variances σ ˜Δ over analytically calculated error variances σΔ for each feature point in the image sequence taken with a real camera
Because the image shown in figure 4 is noisefree, synthetic noise was added the the images intensity values to generate 1000 noisy images with the original 2 2 image as basis. Now the same procedure to calculate σ ˜Δ and σΔ as described with the real image is conducted. Figure 5 shows the measured and the calculated noise variances of the feature 2 2 point’s coordinates σ ˜Δ and σΔ over the subpel coordinate. It can be observed that the coordinate error variance of feature points at a full-pel position is smaller
Error Analysis of Feature Based Disparity Estimation
a) Original image
11
b) Detail
coordinate error variance pel2
Fig. 4. Synthetic image used in the experiment. The edge is slightly slanted, so that feature points with a range of subpel coordinates can be detected.
2 σ ˜Δ (measured) 2 σΔ (calculated)
0.0014 0.0012 0.001 0.0008 0.0006 0.0004 0.0002 0 −0.5
−0.4
−0.3
−0.2
−0.1
0
0.1
0.2
0.3
0.4
0.5
subpel coordinate x0 [pel] 2 2 and calculated error variance σΔ as function of the Fig. 5. Measured error variance σ ˜Δ subpel coordinate x0
than that for feature points at half-pel position. The variances vary by a factor 2 of about three. Also it can be observed that the calculated variances σΔ match 2 the measured variances σ ˜Δ which supports the correctness of the calculation.
5
Conclusions
A feature point detector using horizontal intensity gradients and offering subpel accuracy was described in section 2. It was shown that typically most of the feature points have an error variance of less than 0.01pel2 for the horizontal coordinate. An analysis of the error of the horizontal feature point coordinate revealed the interrelationship between the image noise σn2 , the local image content, given by the local image intensity values I(x, y), and the variance of the 2 feature point’s horizontal coordinate error σΔ . A formula for the disparity error variance based on the feature coordinate error variance has been derived. In an experiment (section 4) it was shown that the results of the analytical derivation
12
P.A. Mikulastik et al.
match measured results obtained using synthetic images and images from a real camera. A second experiment has shown that the coordinate error variance of feature points at a full-pel position is smaller by a factor of three than that for feature points at half-pel position. The calculation presented in this paper allows to benchmark feature points and disparities during feature detection on their expected error variance. This is a great advantage compared to methods that try to eliminate bad feature points at a later stage in the process of disparity estimation. In the future this work will be expanded to a feature detector that measures gradients in all directions in the image, i.e. the feature detector of Harris et. al.[9]
References 1. Luxen, M.: Variance component estimation in performance characteristics applied to feature extraction procedures. In Michaelis, B., Krell, G., eds.: Pattern Recognition, 25th DAGM Symposium, Magdeburg, Germany, September 10-12, 2003, Proceedings. Volume 2781 of Lecture Notes in Computer Science., Springer (2003) 498–506 2. Rohr, K.: Localization properties of direct corner detectors. Journal of Mathematical Imaging and Vision 4 (1994) 139–150 3. Szeliski, R.: Bayesian Modeling of Uncertainty in Low-Level Vision. Kluwer International Series in Engineering and Computer Science, 79, ISBN: 0792390393 (1989) 4. Shi, J., Tomasi, C.: Good features to track. In: CVPR, IEEE (1994) 593–600 5. Kanazawa, Y., Kanatani, K.: Do we really have to consider covariance matrices for image features? In: ICCV. (2001) 301–306 6. Morris, D.D., Kanade, T.: A unified factorization algorithm for points, line segments and planes with uncertainty models. In: Proceedings of Sixth IEEE International Conference on Computer Vision (ICCV’98). (1998) 696–702 7. Singh, A.: An estimation-theoretic framework for image-flow computation. In: ICCV. Third international conference on computer vision (1990) 168–177 8. F¨ orstner, W.: Reliability analysis of parameter estimation in linear models with applications to mensuration problems in computer vision. In: CVGIP 40. (1987) 273–310 9. Harris, C., Stephens, M.: A combined corner and edge detector. In: 4th Alvey Vision Conference. (1988) 147–151
Collinearity and Coplanarity Constraints for Structure from Motion Gang Liu1 , Reinhard Klette1 , and Bodo Rosenhahn2 1
Department of Computer Science, The University of Auckland, New Zealand 2 Max Planck Institute Saarbr¨ ucken, Germany
Abstract. Structure from motion (SfM) comprises techniques for estimating 3D structures from uncalibrated 2D image sequences. This work focuses on two contributions: Firstly, a stability analysis is performed and the error propagation of image noise is studied. Secondly, to stabilize SfM, we present two optimization schemes by using a priori knowledge about collinearity or coplanarity of feature points in the scene.
1
Introduction
Structure from motion (SfM) is an ongoing research topic in computer vision and photogrammetry, which has a number of applications in different areas, such as e-commerce, real estate, games and special effects. It aims at recovering 3D (shape) models of (usually rigid) objects from an (uncalibrated) sequence (or set) of 2D images. The original approach [5] of SfM consists of the following steps: (1) extract corresponding points from pairs of images, (2) compute the fundamental matrix, (3) specify the projection matrix, (4) generate a dense depth map, and (5) build a 3D model. A brief introduction of some of those steps will be presented in Section 2. Errors are inevitable to every highly complex procedure depending on realworld data, and this also holds for SfM. To improve the stabilization of SfM, two optimizations are proposed using information from the 3D scene; see Section 3. Section 4 presents experimental results, and Section 5 concludes the paper with a brief summary.
2
Modules of SfM
This section gives a brief introduction for some of the SfM steps (and related algorithms). For extracting correspondent points, we recall a method proposed in [14]. Then, three methods for computing the fundamental matrix are briefly introduced. To specify a projection matrix from a fundamental matrix, we describe two common methods based on [3,4]. In this step we also use the knowledge of intrinsic camera parameters, which can be obtained through Tsai calibration [12]; this calibration is performed before or after taking the pictures for the used L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 13–23, 2006. c Springer-Verlag Berlin Heidelberg 2006
14
G. Liu, R. Klette, and B. Rosenhahn
camera. It allows to specify the effective focal length f , the size factors ku and kv of CCD cells (for calculating the physical size of pixels), and the coordinates u0 and v0 of the principal point (i.e., center point) in the image plane. 2.1
Corresponding Points
We need at least seven pairs of corresponding points to determine the geometric relationship between two images, caused by viewing the same object from different view points. One way to extract those points from a pair of images is as follows [14]: (i) extract candidate points by using the Harris corner detector [2], (ii) utilize a correlation technique to find matching pairs, and (iii) remove outliers by using a LMedS (i.e., least-median-of-squares) method. Due to the poor performance of the Harris corner detector on specular objects, this method is normally not suitable. 2.2
Fundamental and Essential Matrix
A fundamental matrix F is an algebraic representation of epipolar geometry [13]. It can be calculated if we have at least seven correspondences (i.e., pairs of corresponding points), for example using linear methods (such as the 8-Point Algorithm of [8]) or nonlinear methods (such as the RANSAC Algorithm of [1], or the LMedS Algorithm of [14]). In the case of a linear method, the fundamental matrix is specified through solving an overdetermined system of linear equations utilizing the given correspondences. In the case of a nonlinear method, subsets (at least seven) of correspondences are chosen randomly and used to compute candidate fundamental matrices, and then the best is selected, which causes the smallest error for all the detected correspondences. According to our experiments, linear methods have a more time efficient and provide reasonably good results for large (say more than 13) numbers of correspondences. Nonlinear methods are more time consuming, but less sensitive to noise, especially if correspondences also contain outliers. For given intrinsic camera parameters K1 and K2 , the Essential matrix E can be derived from F by computing E = K2T F K1 2.3
Projection Matrix
A projection matrix P can be expressed as follows: P = K[R | −RT ] where K is a matrix of the intrinsic camera parameters, and R and T are the rotation matrix and translation vector (the extrinsic camera parameters). Since the intrinsic parameters are specified by calibration, relative rotation and translation
Collinearity and Coplanarity Constraints for Structure from Motion
15
can be successfully extracted from the fundamental matrix F . When recovering the projection matrices in reference to the first camera position, the projection matrix of the first camera position is given as P1 = K1 [I | 0], and the projection matrix of the second camera position is given as P2 = K2 [R | −RT ]. The method proposed by Hartley and Zisserman for computing rotation matrix R and translation vector T (from the essential matrix E) is as follows [3]: 1. compute E by using E = K2T F K1 , where ⎛ ⎞ f ku 0 u0 Ki = ⎝ 0 f kv v0 ⎠ 0 0 1 (note: K1 = K2 if we use the same camera at view points 1 and 2), 2. perform a singular value decomposition (SVD) of E by following the template E = U diag(1, 1, 0)V T , 3. compute R and T (for the second view point), where we have two options, namely R1 = U W V T R2 = U W T V T T1 = u3
T2 = −u3
where u3 is the third column of U and ⎛ ⎞ 0 −1 0 W = ⎝1 0 0⎠ 0 0 1 Another method for computing R and T from E (also only using elementary matrix operations) is given in [4], which leads to almost identical results as the method by Hartley and Zisserman. 2.4
Dense Depth Map
At this point, the given correspondences allow only a few points to be reconstructed in 3D. A satisfactory 3D model of a pictured object requires a dense map of correspondences. The epipolar constraint (as calculated above) allows that correspondence search can be restricted to one-dimensional epipolar lines, it supports that images are at first rectified following the method in [10], and that correspondence matching is then done by searching along a corresponding scan line in the rectified image. We also require a recovered base line between both camera positions to calculate a dense depth map.
3
Optimization with Prior Knowledge
Since computations of fundamental and projection matrix are sensitive to noise, it is necessary to apply a method for reducing the effect of noise (to stabilize SfM). We utilize information about the given 3D scene, such as knowledge about collinearity or coplanarity.
16
3.1
G. Liu, R. Klette, and B. Rosenhahn
Knowledge About Collinearity
It is not hard to detect collinear points on man-made objects, such as buildings or furniture. Assuming ideal central projection (i.e., no lens distortion or noise), then collinear points in object space are mapped onto one line in the image plane. We assume that lens distortions are small enough to be ignored. Linearizing points which are supposed to be collinear can then be seen as a way to remove noise. Least-square line fitting (minimizing perpendicular offsets) is used to identify the approximating line for a set of “noisy collinear points”. Assume that we have such a set of points P = {(xi , yi )|i = 1, . . . , n} which determines a line l(α, β, γ) = αx + βy + γ. The coefficients α, β and γ are calculated as follows [7]: μxy α= μ2xy + (λ∗ − μxx )2 λ∗ − μxx β= μ2xy + (λ∗ − μxx )2 γ = −(αx + βy) where
1 (μxx + μyy − (μxx − μyy )2 + 4μxy ) 2 n n 1 1 = (xi − x)2 , μyy = (yi − y)2 n − 1 i=1 n − 1 i=1 λ∗ =
μxx
1 (xi − x)(yi − y), n − 1 i=1 n
μxy =
1 xi n i=1 n
x=
1 yi n i=1 n
and y =
After specifying the line, the points’ positions are modified through perpendicular projection onto the line. 3.2
Knowledge About Coplanarity
Coplanar points can be expected on rigid structures such as on walls or on a tabletop. For a set of points, all incident with the same plane, there is a 3 × 3 matrix H called homography which defines a perspective transform of those points into the image plane [11]. Homography. Consider we have an image sequence (generalizing the two-image situation from before) and pki is the projection of 3D point Pi into the kth image, i.e. Pi is related to pki as follows:
Collinearity and Coplanarity Constraints for Structure from Motion
17
pki = ωki Kk Rk (Pi − Tk )
(1)
where ωki is an unknown scale factor, Kk denotes the intrinsic matrix (for the used camera), and Rk and Tk are the rotation matrix and translation vector. Following Equation (1), Pi can be expressed as follows: −1 −1 −1 Pi = ωki Rk Kk pki + Tk
(2)
Similarly, for point pli lying on the lth image, we have −1 −1 −1 Pi = ωli Rl Kl pli + Tl
(3)
From Equations (2) and (3), we get pki = ωki Kk Rk (ωli−1 Rl−1 Kl−1 pli + Tl − Tk ) ∞ With Rkl = Rk Rl−1 we define Hkl = Kk Rkl Kl−1 . We Kk Rk (Tl − Tk ). Equation (4) can then be simplified to
(4)
also have epipole ekl =
−1 ∞ pki = ωki ωli (Hkl pli + ωli ekl )
(5)
∞ Hkl is what we call the homography which maps points at infinity (ωli = 0) from image l to image k. Consider a point Pi on plane n T Pi − d = 0. Then, from Equation (3), we have −1 −1 −1 n T Pi − d = n T ωli Rl Kl pli + n T Tl − d = 0
Then we have ωli =
n T Rl−1 Kl−1 pli d−n T Tl
what can be rewritten as follows: ωli = d−1 T Rl−1 Kl−1 pli l n where d−1 nT Tl is the distance from the camera center (principal point) of the l = d− lth image to the plane ( n, d). Substituting ωli into Equation (5), finally we have ∞ pki = ωki ωli−1 (Hkl + d−1 T Rl−1 Kl−1 )pli l ekl n
Let
∞ H = ωki ωli−1 (Hkl + d−1 T Rl−1 Kl−1 ) l ekl n
This means: points lying in the same plane have identical H which can be utilized as coplanarity constraint; see [11]. Coplanarity optimization. Coplanar points satisfy the relation described by homography. We use this relation for modifying “noisy coplanar points,” using the following equation: pki = Hkl pli Here, Hkl is the homography between kth and lth image in the sequence, and pki , pli are projections of point Pi on the kth and lth image, respectively.
18
4
G. Liu, R. Klette, and B. Rosenhahn
Experiments and Analysis
To analyze the influence of noise, we perform SfM in a way as shown in Figure 1. At Step 1, Gaussian noise is introduced into coordinates of detected correspondences. At step 2, three different methods are compared to specify which one is the best to compute the fundamental matrix. At Step 3, a quantitative error analysis is performed.
Fig. 1. The way we perform SfM
This section shows at first experiments of the performance of different methods for computing the fundamental matrix, and second the effect of those optimizations mentioned in the previous section. 4.1
Computation of Fundamental Matrix
Three algorithms (8-Point, RANSAC and LMedS) are compared with each other in this section. To specify the most stable one in presence of noise, Gaussian Noise (with mean 0 and deviation δ = 1 pixel) and one outlier are propagated to given correspondences. Performances of the three algorithms are characterized in Figure 2: due to the outlier, the 8-Point Algorithm is more sensible than the other two. 4.2
Optimizations
To test the effect of the optimizations mentioned in the previous section, the results of splitting essential matrices (rotation matrices and translation vectors) are utilized to compare with each other. Two images of a calibration object are used as test images (shown in Figure 3). The data got from calibration (intrinsic
Collinearity and Coplanarity Constraints for Structure from Motion
19
Fig. 2. Performance of three algorithms in presence of noise
Fig. 3. The first (left) and second (right) candidate images
parameters and extrinsic parameters of camera) are used as the ground truth. Roll angle α, pitch angle β and yaw angle γ are used to compare the rotation matrices in a quantitative manner. These angles can be computed from a rotation matrix R by following equations [9]: r23 r13 α = atan2( sin(γ) , sin(γ) ) r32 −r31 β = atan2( sin(γ) , sin(γ) ) 2 + r2 , r ) γ = atan2( r31 32 33
where rij is the element of R at ith row and jth column, and ⎧ atan( y ) (x > 0) ⎪ ⎪ ⎨ y · (πx − atan(| y |)) (x < 0) x atan2(y, x) = |y| y · π2 (y = 0, x = 0) ⎪ |y| ⎪ ⎩ undefined (y = 0, x = 0) Since splitting the essential matrix only results in a translation vector up to a scale factor, all translation vectors (include the ground true one) are transformed into a normalized vector (length equal to one unit) to compare with each other
20
G. Liu, R. Klette, and B. Rosenhahn
Fig. 4. Errors in rotation matrices (left) or translation vectors (right). First row: errors from non-noisy data. Second row: noisy data. Third or fourth row: errors from noisy data after optimization with collinearity or coplanarity knowledge, respectively.
Collinearity and Coplanarity Constraints for Structure from Motion
21
Fig. 5. Epipolar lines result from different data sets. The green lines (dash lines) from data without generated noise; the red lines (straight lines) are from noisy data; the blue lines (dash-dot lines) and yellow lines (dot lines) are from noisy data which has been optimized with collinearity and coplanarity knowledge, respectively.
in a quantitative manner. The comparison of rotation matrices and translation vectors are shown in Figure 4. The errors are mean error of ten times iteration when different number of correspondences are given. The noise propagated is Gaussian noise (with mean 0 and deviation δ = 1 pixel). The method used to compute the fundamental matrix is the 8-Point Algorithm, which is more sensitive to noise than RANSAC and LMedS Algorithm. The method of Hartley and Zisserman is used to split essential matrix. According to the results shown in Figure 4, the coplanarity knowledge gives a better optimization than collinearity knowledge. One possible reason is that the collinearity optimization is performed on uncalibrated images, in which the true correlation of collinear points are not strictly lying in a straight line. For arbitrary images, the effect of optimizations can be seen from Figure 5 through looking at relative positions of epipolar lines computed from different data sets. It shows that the two optimization strategies bring positive effects on reducing the influence of noise, and the coplanarity optimization performs better than the collinearity optimization. Figure 6 shows the reconstructed point cloud of the CITR-building in Auckland and Figure 7 visualizes the texture mapped surface model. The main edges of the building are reconstructed with
22
G. Liu, R. Klette, and B. Rosenhahn
Fig. 6. Two different views of reconstructed points from the optimized SfM algorithm
Fig. 7. Triangulated surface mesh with textures
near-perfect 90◦ angles. Slight image noise (less then 1 pixel) already leads to angles between 20◦ and 140◦ which indicates the sensitivity of classic SfM approaches. By incorporating the collinearity and coplanarity constraints, the reconstruction quality improved.
5
Conclusion
Modules relating to structure from motion have been discussed in this paper. According to experiments, structure from motion is sensitive to noise and it is necessary to improve its stability. Two optimizations, using collinearity and coplanarity knowledge, have been proposed, and the relating experiments show that the two proposed optimizations, especially the coplanarity one, bring positive effects on reducing influences of noise.
Acknowledgments The authors thank Daniel Grest and Kevin Koeser from the University of Kiel for the BIAS-Library and useful hints.
Collinearity and Coplanarity Constraints for Structure from Motion
23
References 1. M. Fischler and R. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography. Comm. ACM, 24:381–385, 1981. 2. C. Harris and M. Stephen. A combined corner and edge detector. In Proc. Alvey Vision Conf., pages 147–151, 1988. 3. R. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University Press, New York, 2000. 4. B. K. P. Horn. Recovering baseline and orientation from essential matrix. http://www. ai.mit.edu/people/bkph/papers/essential.pdf, 1990. 5. T. Huang. Motion and structure from feature correspondences: a review. Proc. IEEE, 82:252–268, 1994. 6. R. Klette, K. Schl¨ uns, and A. Koschan. Computer Vision – Three-dimensional Data from Images. Springer, Singapore, 1998. 7. C. L. Lawson and R. J. Hanson. Solving Least Squares Problems. Prentice Hall, Englewood, 1974. 8. H. Longuet-Higgins. A computer algorithm for reconstructing a scene from two projections. Nature, 293:133–135, 1981. 9. R. Murray, Z. Li, and S. Sastry. A Mathematical Introduction to Robotic Manipulation. CRC Press, Boca Raton, 1994. 10. M. Pollefeys, R. Koch, and L. van Gool. A simple and efficient rectification method for general motion. In Proc. Int. Conf. Computer Vision, pages 496–501, 1999. 11. R. Szeliski and P. H. S. Torr. Geometrically constrained structure from motion: points on planes. In Proc. European Workshop 3D Structure Multiple Images LargeScale Environments, pages 171–186, 1998. 12. R. Tsai. A versatile camera calibration technique for high-accuracy 3d machine vision metrology using off-the-shelf tv cameras and lenses. IEEE J. Robotics and Automation, 3:323–344, 1987. 13. Z. Zhang. Determining the epipolar geometry and its uncertainty: a review. Int. J. Computer Vision, 27:161–198, 1998. 14. Z. Zhang, R. Deriche, O. Faugeras, and Q.-T. Luong. A robust technique for matching two uncalibrated images through the recovery of the unknown epipolar geometry. Artificial Intelligence J., 78:87–119, 1995.
Reconstruction of Building Models with Curvilinear Boundaries from Laser Scanner and Aerial Imagery Liang-Chien Chen, Tee-Ann Teo, Chi-Heng Hsieh, and Jiann-Yeou Rau Center for Space and Remote Sensing Research, National Central University No. 300, Jhong-Da Road, Jhong-Li City, Tao-Yuan 32001, Taiwan {lcchen, ann}@csrsr.ncu.edu.tw,
[email protected],
[email protected] Abstract. This paper presents a scheme to detect building regions, followed by a reconstruction procedure. Airborne LIDAR data and aerial imagery are integrated in the proposed scheme. In light of the different buildings, we target the ones with straight and curvilinear boundaries. In the detection stage, a region-based segmentation and object-based classification are integrated. In the building reconstruction, we perform an edge detection to obtain the initial building lines from the rasterized LIDAR data. The accurate arcs and straight lines are then obtained in the image space. By employing the roof analysis, we determine the three dimensional building structure lines. Finally, the Split-Merge-Shape method is applied to generate the building models. Experimental results indicate that the success rate of the building detection reaches 91%. Among the successfully detected buildings, 90% of the buildings are fully or partially reconstructed. The planimetric accuracy of the building boundaries is better than 0.8m, while the shaping error of reconstructed roofs in height is 0.14 m. Keywords: LIDAR, Aerial Image, Building Models.
1 Introduction Building modeling in cyber space is an essential task in the application of threedimensional geographic information systems (GIS) [1]. The extracted building models are useful for urban planning and management, disaster management, as well as other applications. Traditionally, the generation of building models is mainly performed by using stereo aerial photography. However, the airborne LIDAR (Light Detecting And Ranging) system is proving to become a promising technological alternative. As the airborne LIDAR integrates the Laser Scanner, Global Positioning System (GPS) and Inertial Navigation System (INS) together, it is able to provide direct georeferencing. Its high precision in laser ranging and scanning orientation renders possible decimeter level accuracy of 3D objects. The three-dimensional point clouds acquired by an airborne LIDAR system provide comprehensive shape detail, while aerial images contain plentiful spectral information. Thus, the integration of the two complementary data sets reveals the possibility of an automatic generation of building models. Several data fusion methods have been proposed to generate building models, e.g., L.-W. Chang, W.-N. Lie, R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 24 – 33, 2006. © Springer-Verlag Berlin Heidelberg 2006
Reconstruction of Building Models with Curvilinear Boundaries
25
LIDAR and aerial images, [2] LIDAR and three-line-scanner images [3], LIDAR and satellite images [4], LIDAR, aerial images and 2D maps [5]. The physical mechanism in the model generation includes the identification of the buildings region, and the reconstruction of the geometric models. To automate the identification procedure, a classification process using remotely sensed data should be employed to detect the building regions. The reconstruction strategy can be classified into two categories, i.e., model-driven and data-driven. The Model-driven approach is a top-down strategy, which starts with a hypothesis of a synthetic building model. Verification of the model’s consistency with the LIDAR point clouds is then performed. In the strategy, a number of 3D parametric primitives are generated by the segmentation of the LIDAR data. Afterwards, the best fitting primitives is selected from the aerial image. The building model is obtained by merging together all the 3D building primitives [6]. This method is restricted by the types of 3D parametric primitives. The Data-driven approach is a bottom-up strategy, which starts from the extractions of the building primitives, such as building corner, structure lines and roof-tops. Subsequently, a building model can be grouped together through a hypothesis process. A general approach is to extract the plane features from the LIDAR point clouds, and detect the line features from the aerial image. The plane and line features are combined to develop the building models [7]. The reported results are limited to buildings with straight line boundaries. Buildings with curvilinear boundaries are seldom discussed. Furthermore, there is no report in the literature on 3D curvilinear building modeling from LIDAR and image data. From a data fusion’s point of view, we propose a scheme to reconstruct building models via LIDAR point clouds and aerial image. The proposed scheme comprises of two major parts: (1) detection, and (2) reconstruction. Spatial registration of the LIDAR data and aerial imagery is performed during the data preprocessing. The registration is done in such a way that the two data sets are unified in the object coordinate system. Meanwhile, we calculate the exterior orientation parameters of the aerial imagery by employing ground control points. Afterwards, a region-based segmentation and object-based classification are integrated during the building detection stage. After the segmentation, the object-based classification method detects the building regions by considering the spectral features, shape, texture, and elevation information. For the building reconstruction stage, the building blocks are divided and conquered. Once the building regions are detected, we analyze the coplanarity of the LIDAR point clouds to obtain the 3D planes and 3D ridge lines. We use the edge detection method to obtain the initial building lines from the rasterized LIDAR data. Through the back projection of the initial lines to the image space, the accurate arcs and straight lines are obtained in the image space. The edges extracted from the aerial image are incorporated to determine the 3D position of the building structure lines. A patented Split-Merge-Shape [8] method is then employed to generate the building models in the last step. This article is organized as follows. Section 2 discusses the methodology of the building detection. In section 3, the building reconstruction strategy is presented. We validate the proposed scheme by using aerial image and LIDAR data acquired by the Leica ALS50 system in section 4. Finally, a summary of the described method is given at the last segment.
26
L.-C. Chen et al.
2 Building Detection The primary objective of this section is to extract the building regions. There are two steps in the proposed scheme: (1) region-based segmentation, and (2) object-based classification. The flow chart of the detection method is shown in Fig. 1. There are two ways to conduct the segmentation. The first is the contour-based approach. It performs the segmentation by utilizing the edge information. The second is the region-based segmentation. It uses a region growing technique to merge pixels with similar attributes. We select the region-based approach, because it is less sensitive to noise. The proposed scheme combines the surface variations from the LIDAR data with the spectral information obtained from the orthoimage in the segmentation. The pixels with similar geometric and spectral properties are merged into a region. After segmentation, each region is a candidate object for classification. Instead of a pixel-based approach, an object-based approach is performed. Considering the characteristics of elevation, spectral information, texture, roughness, and shape, the classification procedure is performed to detect the building regions. The considered characteristics are described as follows.
Y
(1) Elevation: Subtracting the Digital Terrain Model (DTM) from the Digital Surface Model (DSM), we generate the Normalized DSM (NDSM). The data describes the height variations above ground. By setting an elevation threshold, one can select the above ground objects, which include buildings and vegetation. (2) Spectral information: The spectral information is obtained from color aerial image. A greenness index is used to distinguish the vegetation from non-vegetation areas. (3) Texture: The texture information is retrieved from aerial image via a Grey Level Co-occurrence Matrix (GLCM) [9] texture analysis. GLCM is a matrix of relative frequencies for pixel values occurring within a specific neighborhood. We select the entropy and homogeneity as indices to quantify the co-occurrence probability. The role of the texture information is to separate Fig. 1. Flowchart of building detection the building from the vegetation, when the objects have similar spectral responses. (4) Roughness: The roughness of the LIDAR data aims to differentiate the vegetation regions from non-vegetation ones. The surface roughness is similar to the texture information of the image data. The role of the surface roughness is to separate the building and vegetation, when the objects have similar spectral responses. We choose the slope variance as the roughness index.
Reconstruction of Building Models with Curvilinear Boundaries
27
(5) Shape: The shape attribute includes the size and length-to-width ratio. An area threshold is used to filter out the overly small objects. This means regions smaller than a minimum area are not taken into account as a building. The length-to-width ratio is suitable to remove the overly thin objects. The objects would not be considered as a building, when the length-to-width ratio is larger than a specified threshold.
3 Building Reconstruction The reconstruction stage begins by isolating each individual building region. The stage includes three parts: (1) detection of roof planes, (2) extraction of 3D structure lines, and (3) 3D building modeling. The flow chart of the building reconstruction is shown in Fig. 2.
Fig. 2. Flowchart of building reconstruction
3.1 Detection of Roof Planes A TIN-based region growing procedure is employed to detect the roof planes. The point clouds are first structured to a TIN-mesh built by Delaunay triangulation. The coplanarity and adjacency between the triangles are considered for the growing TINbased regions. The coplanarity condition is examined by the distance of the triangle center to the plane. When the triangles meet the coplanarity criteria, the triangles are merged as a new facet. The process starts by selecting a seed triangle and determining the initial plane parameter. The initial plane is determined from the seed triangle. If the distance of the neighbor triangle to the initial plane is smaller than a specified threshold, the two triangles are combined. The parameters of the reference plane are recalculated using all of the triangles that belong to the region. The seed region starts to grow in this manner. When the region stops growing, a new seed triangle is chosen.
28
L.-C. Chen et al.
The region-growing stops when all of the triangles have been examined. Due to the errors of the LIDAR data, the detected regions may consist of fragmental triangles. Thus, small regions that have the closest normal vector will be merged in its neighborhood. After the region growing, we use the least squares regression to determine the plane equations. A sample result of the detection is illustrated in Fig. 3.
(a)
(b)
Fig. 3. Illustration of detection for roof planes (a) triangles mesh (b) extracted planes
3.2 Extraction of 3D Structure Lines Two types of building structure lines, namely, ridge lines and step edges are targeted in this study. The ridge line is a building feature, where two planes intersect. It can be determined by the extracted planes. The step edge represents a building structure, where roofs have height jumps. A step edge may be straight or curvilinear. Considering the difference in the spatial resolution, each initial step edge is estimated from the LIDAR data, while the precise step edge is extracted from the image. In the extraction of ridges, the line is obtained by the intersection of the neighboring planes. Mathematically, the intersection line computed from the plane equations is a 3D straight line without end points. Thus, we use the shared triangle vertices to define the line ends. That means the final product of a ridge line is a straight line with two end points. In the extraction of step edges, we detect the initial building edges from the rasterized LIDAR data. The rough edges from the LIDAR data are used to estimate the location of the step edges in the image space. Building edges around the projected area are detected through the Canny Edge Detector [10]. At this stage, there is no preknowledge about the lines being straight or curvilinear. In order to distinguish the straight lines from the curvilinear ones, we develop a scheme to identify the different line types. The basic mechansim is to determine the most probable radius of a segment. First, we perform the line tracking to split all the edges into several line segments. A sample of the extracted edge pixels are shown in Fig. 4a. Fig. 4b demonstrates a sample result of the split line segments. Afterwards, we merge the adjacent line segments by the criterion of length and angle. The merged lines are treated as an arc candidate. Fig. 4c presents a sample result of the arc candidate. The last step is to test the rationality of the radius for each arc candidate. We randomly select three points from an arc candidate to calculate a radius. All the points are tested to generate a radius histogram like Fig. 4d. The horizontal axis is for the radii of possible circles, while the vertical axis presents the accumulated number of the radii. The arc candidate is accepted when the radius shows the highest concentration.
29
Count
Reconstruction of Building Models with Curvilinear Boundaries
(a)
(b)
(c)
(d)
Fig. 4. Illustration of curvilinear lines separation (a) extracted edges (b) split line segments (c) arc candidates (d) radius histogram
For the classified straight lines, we use the Hough Transform [11] to extract the target lines in a parameter space. Eq. 1 shows a straight line being transformed in the Hough space. On the other hand, the classified curvilinear lines are extracted by a modified Hough Transform [12]. The circle equation is shown in Eq. 2. Notice that the circle’s radius is calculated from the radius histogram, as described above. Given the image coordinates and the height information from the 3D planes, we calculate the 3D structure lines in the object space via exterior orientation parameters.
xi cos θ + yi sin θ = ρ .
(1)
where, xi,yi: the pixel coordinate in location i, θ: angle, and ρ: distance.
( xi − a ) 2 + ( yi − b) 2 = r 2 .
(2)
where, xi,yi: the pixel coordinate in location i, a,b: the center of a circle, and r: radius of circle. 3.3 3D Building Modeling
The extracted 3D structure lines are processed by a patented method, i.e., SplitMerge-Shape method [8], for building reconstruction. The Split and Merge process sequentially reconstructs the topology between the two consecutive line segments, and then reforms the areas as enclosed regions. The two procedures are performed in a two dimensional space. During splitting, a line segment is chosen for reference. We split all the line segments into a group of roof primitives. All of the possible roof primitives are generated by splitting the area of interest from all the line segments. In the merging procedure, the connectivity of the two adjacent roof primitives is analyzed successively. If the boundaries shared between them do not correspond to any 3D line segment, the two roof primitives will be merged. The Shape step uses the available 3D edge height information to determine the most appropriate rooftop. The Shape process is performed in a three dimensional
30
L.-C. Chen et al.
space. The first step of shaping is to assign a possible height for each roof edge from its corresponding 3D edge. Every 3D edge is first automatically labeled as a shared edge or an independent edge. The height for an independent edge can then be assigned from its corresponding 3D edges. The second step is to define the shape of a rooftop, according to the height of the independent edges. If more than two independent edges exist, and are sufficient to fit into a planar face, a coplanar fitting is applied. Fig. 5 demonstrates the modeling procedure.
(a)
(b)
(c)
(d)
Fig. 5. Procedure of building modeling (a) 3D line segments (b) results of splitting (c) results of merging (d) results of shaping
4 Experimental Results The LIDAR data used in this investigation covers a test area situated within the Industrial Technology Research Institute in northern Taiwan. The LIDAR data is obtained by the Leica ALS 50 system. The average density of the LIDAR point clouds is 2pts/m2. The LIDAR data is shown in Fig 6a. The ground sampling distance of the aerial image is 0.5m. Fig 6b shows the image of the test area. The test area contains complex buildings, such as straight lines and curvilinear boundary buildings. The roof type is either flat or gable. There are 23 buildings in the test area. We use stereoscopic measurements to derive the building models, as references for validations. The experiments include three different aspects in the validation procedure. The first evaluates the detection rate for building regions. The second checks the planimetric accuracy of building corners. The third assesses the height discrepancy between the roof top and the original LIDAR point clouds. 4.1 Building Detection
During building detection, the surface points and ground points from the LIDAR data are both rasterized to DSM and DTM with a pixel size of 0.5m. The aerial image is orthorectified by using the DSM. A 1/1,000 scale topographic map is employed for ground truth. The classified results, which are superimposed onto a topographic map is shown in Fig. 6c. It is found that 21 out of the 23 buildings are successfully detected, where the detection rate is 91%. The missing buildings are, in general, too small for detection. Both of the two missing buildings are smaller than 30 m2.
Reconstruction of Building Models with Curvilinear Boundaries
(a)
(b)
31
(c)
Fig. 6. Results of building detection (a) LIDAR DSM (b) aerial image (c) detected building regions with vector maps
4.2 Building Reconstruction
During building reconstruction, we categorize the buildings into three types: (1) straight line buildings with flat roofs, (2) straight line buildings with gable roofs, and (3) curvilinear boundary buildings with flat roofs. Two selected examples with different building complexities are given in Fig. 7. Fig. 8 shows all the generated building models. In the accuracy evaluation, we compare the coordinates of the roof corners in the reconstructed models with the corners acquired by the stereoscopic manual measurements. The Root-Mean-Square-Errors (RMES) are 0.71m and 0.73m in the X and Y directions, respectively. The ground resolution of the aerial image is 0.5m. Thus, the accuracy is roughly 1.5 pixels in the image space. In Fig. 9, we provide error vectors that are superimposed onto the building boundaries.
(a)
(b)
(c)
(d)
(e)
(f)
(g)
(h)
Fig. 7. Results of reconstruction for complex buildings (a) building regions of case 1, (b) detected planes of case 1, (c) extracted lines of case 1, (d) generated building models of case 1, (e) building regions of case 2, (f) detected planes of case 2, (g) extracted lines of case 2, (h) generated building models of case 2
32
L.-C. Chen et al.
Fig. 8. The generated building models
Fig. 9. Error vectors of building corners
Table 1. Success rate of building reconstruction Reconstruction results Correct Partially correct Erroneous
Straight line boundaries Flat roof Gable roof 9 5 2 0 0 2
Curvilinear boundaries Flat roof 2 1 0
Success rate (%) 76 14 10
The fidelity of the building reconstruction is validated in terms of the success rate. The success rate of the reconstruction is divided into three categories, namely, correct, partially correct and erroneous. Reconstructed buildings that have the same shape with their actual counterpart are deemed correct. For the partially correct, they represent the group of connected buildings, where only a portion is successfully reconstructed. The reconstruction is erroneous, when the building model is inherently different in shape with the actual one. Table 1 shows the success rate for the three types of buildings. Seventy six percent of the buildings is correctly reconstructed. The buildings that failed in the reconstruction, i.e., the erroneous category, are the small ones that do not have enough available LIDAR points. The mean value of the height differences between the LIDAR points and roof surface, which is called the shaping error, is 0.12 m. The discrepancies range from 0.06 m to 0.33 m.
5 Conclusions In this investigation, we have presented a scheme for the extraction and reconstruction of building models via the merging of LIDAR data and aerial imagery. The results from the tests demonstrate the potential of reconstructing the buildings with straight lines and curvilinear boundaries. The building models generated by the proposed method take advantage of the high horizontal accuracy from the aerial image, and high vertical accuracy of the LIDAR data. More than 91% of the building regions are correctly detected by our approach. Among the successfully detected buildings, ninety percent of the buildings are fully or partially reconstructed. The planimetric accuracy of the building boundaries is better than 0.8m, while the shaping error of the reconstructed roofs in height is 0.14 m. This demonstrates that the proposed scheme proves to be a promising tool for future applications.
Reconstruction of Building Models with Curvilinear Boundaries
33
Acknowledgments. This investigation was partially supported by the National Science Council of Taiwan under Project No. NSC94-2211-E-008-028. The authors would like to thank the Department of Land Administration of Taiwan and Industrial Technology Research Institute of Taiwan for providing the test data sets.
References 1. Hamilton, A., Wang, H., Tanyer, A.M., Arayici, Y., Zhang, X., Song, Y.: Urban Information Model for City Planning. ITcon. 10 (2005) 55-67. 2. Rottensteiner, F.: Automatic Generation of High-quality Building Models from LIDAR Data, IEEE Computer Graphics and Applications. 23 (2003) 42-50. 3. Nakagawa, M., Shibasaki, R., Kagawa, Y.: Fusion Stereo Linear CCD Image and Laser Range Data for Building 3D Urban Model. International Achieves of Photogrammetry and Remote Sensing. 34 (2002) 200-211. 4. Guo, T.: 3D City Modeling using High-Resolution Satellite Image and Airborne Laser Scanning Data. Doctoral dissertation, Department of Civil Engineering, University of Tokyo, Tokyo. 2003. 5. Vosselman, G.: Fusion of Laser Scanning Data, Maps and Aerial Photographs for Building Reconstruction. International Geoscience and Remote Sensing Symposium. 1 (2002) 85-88. 6. Rottensteiner, F., Jansa, J.: Automatic Extraction of Building from LIDAR Data and Aerial Images. International Achieves of Photogrammetry and Remote Sensing. 34 (2002) 295-301. 7. Fujii, K., Arikawa, T.: Urban object reconstruction using airborne laser elevation image and aerial image, IEEE Transaction on Geoscience and remote sensing. 40 (2002) 2234-2240. 8. Rau, J. Y., Chen, L. C.: Robust Reconstruction of Building Models from ThreeDimensional Line Segments. Photogrammetric Engineering and Remote Sensing. 69 (2003) 181-188. 9. Haralick, R.M., Shaunmmugam, K., Distein, I.: Texture Features for Image Classification. IEEE Transactions on Systems Man and Cybernetics. 67 (1973) 786-804. 10. Canny, J.: A Computational Approach to Edge Detection. IEEE Transactions on Pattern Analysis and Machine Intelligence. 8 (1986) 679-698. 11. Hough, P.V.C.: Methods and Means for Recognising Complex Patterns. U.S. patent No. 306954. (1962). 12. Xu, L., Oja, E., Kultanen, P.: A New Curve Detection Method: Randomized Hough Transform (RHT). Pattern Recognition Letters. 11 (1990) 331-338.
The Generation of 3D Tree Models by the Integration of Multi-sensor Data Liang-Chien Chen, Tee-Ann Teo, and Tsai-Wei Chiang Center for Space and Remote Sensing Research, National Central University No. 300, Jhong-Da Road, Jhong-Li City, Tao-Yuan 32001, Taiwan {lcchen, ann}@csrsr.ncu.edu.tw,
[email protected] Abstract. Three-dimensional tree modeling is an important task in the management of forest ecosystems. The objective of this investigation is to reconstruct 3D tree models using LIDAR data and high resolution images. The proposed scheme comprises of three major steps: (1) data preprocessing, (2) vegetation detection, and (3) tree modeling. The data preprocessing includes spatial registration of the airborne LIDAR and high resolution images, derivation of the above ground surface from LIDAR data, and generation of a spectral index from high resolution images. In the vegetation detection, a region-based segmentation and knowledge-based classification are integrated to detect the tree regions. Afterwards, the watershed segmentation is selected to extract the tree crown and heights. In the last step, we use the tree height, tree crown and terrain information to build up the 3D tree models. The experimental results indicate that the accuracy of the extracted individual tree is better than 80%, while the accuracy of the determined tree heights is about 1m. Keywords: LIDAR, High Resolution Image, Tree Model.
1 Introduction Forests are an important and valuable resource to the world. Furthermore, research in relevant environmental issues, such as the ecosystem, biodiversity, wildlife, and so forth, require detailed forest information. In light of this, there is an increasing urgency to obtain a comprehensive understanding of forests. Thus, the 3D tree model plays an important role in forest management. Remote sensing technology has been applied to forest ecosystem management for many years [1]. Most of the research utilizes the spectral characteristics of optical images to detect the forest, and delineate the independent tree crown [2] [3]. Previous studies used high resolution images to estimate tree locations, canopy density, and biomass [4]. However, optical images are easily influenced by the topography and weather conditions. In addition, as the ability of optical images to penetrate through the forest area is weak, it is unable to directly capture the 3D forest structure. Nowadays, LIDAR (LIght Detection And Ranging) systems have become a mature tool for the derivation of 3D information. The LIDAR system is an integration of the Laser Scanner, Global Position System (GPS) and Inertial Navigation System (INS). Its high precision in laser ranging and scanning orientation makes the decimeter L.-W. Chang, W.-N. Lie, R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 34 – 43, 2006. © Springer-Verlag Berlin Heidelberg 2006
The Generation of 3D Tree Models by the Integration of Multi-sensor Data
35
accuracy for the ground surface possible. The LIDAR technology provides horizontal and vertical information at high spatial resolution and vertical accuracy. Forest attributes like tree heights can be directly retrieved from LIDAR data. Early studies of LIDAR forest measurements used LIDAR to estimate forest vegetation characteristics, such as canopy cover, forest volume, and biomass [5]. However, although the LIDAR data contains abundant 3D shape features, it lacks the spectral information. The individual tree extraction from LIDAR data is an important topic for 3D tree modeling. The strategy of individual tree delineation can be divided into the following categories: (1) Pixel-based method [6], (2) Region-based method [7], (3) Contourbased method [8], and (4) Empirical method [9]. In the region-based approach, such as the watershed segmentation, it applies mathematical morphology to explore the geometric structure of trees in an image. The advantage of this approach is that the method may selectively preserve the structural information, while also accomplishing the desired tasks within the image. In the region-based approach, homogeneity regarding the shape and color in the neighborhood is examined in a region growing process. The contour-based approach minimizes the internal energy by weighting the parameters. In the empirical method, it collects a large amount of ground truth pertaining to various forest parameters, such as the tree crown width, tree height, and tree age. Afterwards, the relationship among the tree properties is analyzed. A typical result shows that the tree height and tree crown width is revealed to have a linear relationship after a regression analysis is performed. A review of the rapidly growing literature on LIDAR applications emphasizes the need for data fusion in the processing phase of LIDAR data, which may serve as a method in improving the various features extraction task. Therefore, we conducted an integration of the LIDAR data and high resolution image to build up the forest canopy model. The objective of this investigation was to construct a 3D forest canopy model using LIDAR data and high resolution image. The proposed method was a data fusion scheme with a coarse-to-fine strategy. The proposed scheme comprised of three major steps: (1) data preprocessing, (2) vegetation detection, and (3) tree modeling. The data preprocessing included space registration of the LIDAR and high resolution image, derivation of the above ground surface from LIDAR data, and generation of a spectral index from high resolution image. In the vegetation detection, a region-based segmentation, followed by the knowledge-based classification was integrated to detect tree regions. Subsequently, we performed a tree crown extraction in the vegetation regions. The watershed segmentation and local maximum search were selected to extract the tree crowns. In the last step, we used the tree height, tree crown, and terrain information to build the 3D tree models. The validation datasets included an orchard test site in Taiwan, located in Tai-Chung, and a forest test site in Finland, situated in Espoolahti.
2 Methodology The proposed scheme encompassed three parts: (1) data preprocessing, (2) vegetation detection, and (3) tree modeling. The preprocessing included a geometric and radiometric processing. A divide-and-conquer strategy was then incorporated to detect the vegetation followed by the tree modeling. Fig.1 shows the flowchart of the proposed method.
36
L.-C. Chen, T.-A. Teo, and T.-W. Chiang
LIDAR Digital Surface Model
Digital Terrain Model
Normalized DSM
Aerial Image Greenness Image Data Prepreocessing
Region-based Segmentation Knowledge-based Classification
Vegetation Detection
Watershed Segmentation Tree Parameterization
Tree Modeling
3D Tree Models Fig. 1. Flowchart of the proposed method
2.1 Data Preprocsessing The data preprocessing was composed of three major procedures: (1) space registration, (2) derivation of above ground surface, and (3) generation of a spectral index. The LIDAR data was used to generate a Digital Terrain Model (DTM) and Digital Surface Model (DSM) in grid form. In the space registration, the aerial images were orthorectified using the DSM and ground control points. The orthorectified images were essentially co-registered with the LIDAR data. We subtracted the DTM from the DSM to generate the Normalized DSM (NDSM). The NDSM represented the above ground surface that was used to separate the ground and above ground objects. The formula of NDSM is shown in Eq. (1). Fig. 2 demonstrates a sample result of NDSM. For the multispectral aerial image, we used the red and green bands to calculate the greenness index [10]. Eq. (2) shows the formula of the greenness index. The greenness index was mostly used to identify the vegetation areas. Fig. 3 illustrates a sample result of the greenness image. For cases where the near infrared band was available, the NDVI (Normalized Difference Vegetation Index) was utilized, as it was deemed a better index in accessing the greenness level. The calculation of the NDVI is shown in Eq. (3). NDSM = DSM – DTM .
(1)
The Generation of 3D Tree Models by the Integration of Multi-sensor Data
Greenness = ( G – R ) / ( G + R ) .
(2)
NDVI = ( IR – R ) / ( IR + R ) .
(a)
(b)
(c)
Fig. 2. LIDAR data preprocessing (a) Digital Terrain Model (b) Digital Surface Model (c) Normalized Digital Surface Model
37
(3)
(a)
(b)
Fig. 3. Aerial image preprocessing (a) aerial image (b) greenness image
2.2 Vegetation Detection The objective of the vegetation detection was to extract the vegetation areas, in which the non-vegetation areas would not interfere in the tree modeling. We integrated the region-based segmentation and knowledge-based classification during this stage. The elevation from the LIDAR data and radiometric features from the orthoimages were combined in the segmentation. Thus, pixels with similar height and spectral attributes were merged into a region. After the segmentation, each separated region was a candidate object for classification. Considering the height and spectral characteristics, the vegetation areas were extracted by a fuzzy logic classification. We first used a multiple data segmentation technique to perform the region-based segmentation. It could identify objects with correlated characteristics in terms of reflectance and height. In this step, we fused the NDSM and the greenness level for segmentation. This method identified geographical features using scale homogeneity parameters [11], which were obtained from the spectral reflectance in the RGB and elevation value in the NDSM. The homogeneity was described by a mutually exclusive interaction between the attribute and shape. Eq. (4) shows the formula of the homogeneity index. The composed homogeneity index was based on the attribute and shape factor, where it considered the attribute and shape information simultaneously. The formulas of the attribute and shape factor are shown in Eq. (5) and Eq. (6), respectively. The weights of the attribute and shape factor should be set properly. The attribute factor used the standard deviation of the region as a segmentation criterion. The shape factor selected the smoothness and compactness of the region boundary for weighting. The formulas for the calculations of smoothness and compactness are shown in Eq. (7) and Eq. (8), respectively. After the segmentation, we employed the object-oriented classification in the vegetation detection. The classification was based on a fuzzy logic classification system, where the membership functions employed thresholds and weights for each data layer. The above ground and high greenness objects were classified as vegetation. Fig. 4 shows an example of the segmentation and classification.
38
L.-C. Chen, T.-A. Teo, and T.-W. Chiang
H = w * hattribute + (1 - w) * hshape .
(4)
where, H: homogeneity index, hattribute: attribute factor, hshape: shape factor, and w: weighting for attribute and shape factor. hattribute = ∑ w_c * σc .
(5)
where, hattribute: attribute factor, w_c: weighting among layers, and σc: standard deviation of pixel attribute in a region. hshape = ws * hsmooth + ( 1 - ws ) * hcompact .
(6)
hsmooth = L / B .
(7)
hcompact = L / N 0.5 .
(8)
where, hshape; shape factor, ws: weighting for smoothness and compactness, hsmooth: smoothness of region, hcompact: compactness of region, L: border length of region, B: shortest border length of region, and N: area of region. 2.3 Tree Modeling After the vegetation detection, we focused on the vegetation areas for the extraction of tree models. The primary process included individual tree segmentation and tree parameterization. We first applied the watershed segmentation method [12] on the DSM in the vegetation regions. To avoid the errors of the DTM, we perform segmentation in the DSM rather than the NDSM. As it could detect the changes of the individual tree height, it was able to extract the boundary of each individual tree. We assumed that the maximum point in the boundary was the tree position. The local maximum method was applied to extract the tree location and tree height. In this study, a least squares circle fitting was applied to extract the tree crown radius. The tree parameters for each individual tree included the tree crown radius, position and height. The 3D tree models were represented by using a tree model database. Once the tree type was selected, the 3D tree model could be generated by the tree parameters [13]. In this study, the tree types are selected manually. Fig. 5 demonstrates a sample procedure of tree modeling.
The Generation of 3D Tree Models by the Integration of Multi-sensor Data
(a)
(b)
Fig. 4. Vegetation detection (a) segmentation results (b) classified results
(a)
(b)
39
(c)
Fig. 5. Tree modeling (a) watershed segmentation results (b) parameterized tree crowns (c) 3D tree models
3 Experiment Results Two test sites were selected for validation---an orchard area and a forest area. The orchard area was located in Taiwan’s Tai-Chung city, and the LIDAR data was obtained via the Optech ALTM2033. The average density of the LIDAR data was roughly 1.7 points per square meter. Color aerial photo with a scale of 1:5,000 were used in test area 1, where it was scanned in a 20 μm per pixel mode. Thus, the ground resolution of the digital images was around 0.1m. The second test area was located in Espoolahti of Finland. The Finland data was released by the European Spatial Data Research (EuroSDR) and International Society for Photogrammetry and Remote Sensing (ISPRS) as a sample test site. The LIDAR data was also obtained by the Optech ALTM2033, and the average density of the LIDAR data was about 2 points per square meter. The Vexcel Ultracam-D image with a 1/6,000 scale was selected for the forest area. The ground resolution of the digital images was about 0.2m. The test data sets are shown in Fig. 6. Since the image included the near infrared band, the NDVI was employed in the Espoolahti case. In the region-based segmentation, we first set the weights of the image layers on the homogeneity segmentation using eCognition [11]. The weighting of the LIDAR part and aerial photo was 2:1. Considering that the shape of the forest was irregular, the attribute factor was more important than the shape factor. Hence, the attribute and shape factors were 0.8 and 0.2, respectively. After segmentation, we performed the object-oriented classification to determine the tree regions. We used the extracted tree regions in the individual trees extraction. Assuming that the highest point in the boundary was a treetop, one segmented boundary was selected to represent one tree crown. Afterwards, we used the least squares circle fitting to extract the tree crown radius. Finally, the 3D tree models were generated by the tree parameters. We used stereoscopic measurements to determine the tree heights and tree crowns, as references for validations. The experiments included three different aspects in the validation. The first evaluated the detection rate for an individual tree. The second checked the correctness of the tree crown. The third assessed the accuracy of the tree heights.
40
L.-C. Chen, T.-A. Teo, and T.-W. Chiang
(a)
(b)
(c)
(d)
Fig. 6. Test area (a) aerial image for Tai-Chung area (b) NDSM for Tai-Chung area (c) aerial image for Espoolahti area (d) NDSM for Espoolahti area
3.1 Tai-Chung Case In the Tai-Chung case, we identified objects with correlated characteristics in terms of the reflectance and height. After the vegetation detection, we used tree areas to perform the watershed segmentation on the DSM. Fig. 7a shows the classification results. Fig. 7b reveals the block results, which were derived from the watershed segmentation. The background image is the corresponding DSM. Subsequently, we used the least squares circle fitting to extract the tree crown radius. The results are shown in Fig 7c, and the generated 3D tree models are shown in Fig 7d. During the evaluation, we checked each segmentation boundary to examine the reliability of the tree crown determination. For the validated patches, we compared the tree heights for accuracy assessments. In this case, we had 197 trees in the reference data. The number of detected trees was 210. The correctness was, thus, 95%. The commission and omission errors stood at 10% and 4%, respectively. Fig. 7c shows the locations of those trees. We then used the 197 matched boundaries to determine the local maximum height for the analysis of the tree crown accuracy. Comparing the automatic results and manual measurement tree crown, the accuracy of the tree crown reached 92%. The commission and omission errors were 29% and 7%, respectively. The RMSE of the tree height was 0.62 m. The result demonstrated a great potential in employing multi-sensor data for 3D tree modeling.
(a)
(b)
(c)
(d)
Fig. 7. Results of Tai-Chung area (a) classified results (b) individual tree segmentation results (c) parameterized tree crowns (d) 3D tree models
3.2 Espoolahti Case The Espoolahti case was a forest area with complex trees. The vegetation results are shown in Fig 8a. The result indicates that most of the areas in the Espoolahti case
The Generation of 3D Tree Models by the Integration of Multi-sensor Data
41
were tree areas. Fig. 8b is the results of the individual segmentation from DSM. We then selected a small area, as shown in Fig 8b to do the tree crown parameterization. The tree crown is shown in Fig. 8c, and the generated 3D tree models are revealed in Fig 8d.
(a)
(b)
(c)
(d)
Fig. 8. Results of Espoolahti area (a) classified results (b) individual tree segmentation results (c) parameterized tree crowns (d) 3D tree models
In this case, we had 105 trees in the reference data. The number of detected trees was 97. The number of matched trees was 78, where the correctness was 80%. The commission and omission errors stood at 19% and 25%, respectively. Afterwards, we used the 78 matched boundaries to determine the local maximum height in accessing the accuracy. Comparing the automatic results and manual measurement tree crown, the accuracy of the tree crown reached 78%. The commission and omission errors were 37% and 21%, respectively. The RMSE of the tree height was 1.12 m. As this location was more complicated than the orchard area, the accuracy was lower than the previous case. Table 1. Summary of tree modeling accuracy Evaluation Item Tree detection rate
Tree crown evaluation Tree height evaluation
Accuracy (%) Commission Error (%) Omission Error (%) Accuracy (%) Commission Error (%) Omission Error (%) Root-Mean-Squares Error (m) Error Range (m)
Tai-Chung 95.4 10.5 4.6 92.8 29.8 7.1 0.62 -2.0 ~ 1.8
Espoolahti 80.4 25.7 19.6 78.4 37.7 21.6 1.12 -1.9 ~ 2.0
3.3 Summary The accuracy and reliability of our study is shown in Table 1. The experimental results are summarized as follows. (1) The coarse-to-fine strategy detects the vegetation areas in the first stage. The detected areas are then pinpointed to extract the individual trees. This strategy may reduce the amount of errors involved in the individual trees extraction. The detection rate reaches approximately 80%.
42
L.-C. Chen, T.-A. Teo, and T.-W. Chiang
(2) In the individual tree segmentation, we select a morphology filter to extract the tree boundaries from the Digital Surface Model rather than optical image data. The reliability of tree crowns may reach 80%. (3) Comparing the extracted tree heights and manually measured tree heights. The accuracy of the tree height is about 1m. (4) The results show the potential of 3D tree model generation by fusing the LIDAR and image data.
4 Conclusions This investigation presents a scheme of 3D forest modeling by the fusion of spectral and height information. The results indicate that the correctness of the tree extraction reaches 80%, and the accuracy of extracted tree heights is about 1m. Considering the errors in the photogrammetric measurements for the reference data set, the accuracy could be underestimated. The results demonstrate that the proposed scheme may be used to estimate the tree height on individual tree level basis. Acknowledgments. This investigation was partially supported by the National Science Council of Taiwan under Project No. NSC94-2211-E-008-027. The authors would like to thank the Council of Agriculture of Taiwan, European Spatial Data Research (EuroSDR) and International Society for Photogrammetry and Remote Sensing (ISPRS) for providing the test data sets. The authors also wish to thank the anonymous reviewer for helpful comments and suggestions.
References 1. Lefsky, M.A., Cohen, W.B., Spies, T.A.: An Evaluation of Alternate Remote Sensing Products for Forest Inventory, Monitoring, and Mapping of Douglas-Fir Forests in Western Oregon. Canadian Journal of Forest Research. 31 (2001) 78–87. 2. Gong, P., Mei, X., Biging, G.S., Zhang, Z.: Improvement of an Oak Canopy Model Extracted from Digital Photogrammetry. Photogrammetric Engineering and Remote Sensing. 63 (2002) 919-924. 3. Wang, L., Gong, P., Biging, G.S.: Individual Tree-Crown Delineation and Treetop Detection in High-Spatial-Resolution Aerial Imagery. Photogrammetric Engineering and Remote Sensing. 70 (2004) 351-357. 4. Sheng, Y., Gong, P., Biging, G.S.: Model-Based Conifer Canopy Surface Reconstruction from Photographic Imagery: Overcoming the Occlusion, Foreshortening, and Edge Effects. Photogrammetric Engineering and Remote Sensing. 69 (2003) 249-258. 5. Popescu, S.C., Wynee, R.H., Nelson, R.F.: Measuring Individual Tree Crown Diameter with Lidar and Assessing Its Influence on Estimating Forest Volume and Biomass. Canadian Journal of Remote Sensing. 29 (2003) 564–577. 6. Yu, X., Hyyppä, J., Kaartinen, H., Maltamo, M.: Automatic Detection of Harvested Tees and Determination of Forest Growth using Airborne Laser Scanning. Remote Sensing of Environment. 90 (2004) 451-462.
The Generation of 3D Tree Models by the Integration of Multi-sensor Data
43
7. Suárez, J.C., Ontiveros, C., Smith, S., Snape, S.: Use Of Airborne Lidar and Aerial Photography in the Estimation of Individual Tree Height in Forestry. Computer and Geosciences, 31 (2005) 253-262. 8. Persson, A., Holmgren, J., Soderman, U.: Detecting and Measuring Individual Trees using an Airborne Laser Scanner. Photogrametry Engineering and Remote Sensing. 68 (2002) 925-932. 9. Popescu, S.C., Wynne, R..H.: Seeing the Trees in the Forest: Using LIDAR and Multispectral Data Fusion with Local Filtering and Variable Window Size for Estimating Tree Height. Photogrametry Engineering and Remote Sensing. 70 (2004) 589-604. 10. Niederöst, M.: Automated Update of Building Information in Maps using Medium-Scale Imagery (1:15,000). In: Baltsavias, E., Gruen, A., van Gool. L. (eds.): Automatic Extraction of Man-Made Objects from Aerial and Space Images. 3 (2001) 161-170. 11. Definiens: eCoginition User’s Guide. (2004) 79-81. 12. Luc V., Soille, P.: Watersheds in digital spaces: An efficient algorithm based on immersion simulations. IEEE Transactions on Pattern Analysis and Machine Intelligence. 13 (1991) 583-598. 13. Dunbar M.D., Moskal, L.M., Jakubauskas, M.E.: 3D Visualization for the Analysis of Forest Cover Change. Geocarto International. 19 (2004) 103-112.
LOD Generation for 3D Polyhedral Building Model Jiann-Yeou Rau1, Liang-Chien Chen2, Fuan Tsai3, Kuo-Hsin Hsiao4, and Wei-Chen Hsu4 1
Specialist, 2 Professor, 3Assistant Professor, Center for Space and Remote Sensing Research, National Central University, Jhong-Li, Taiwan {jyrau, lcchen, ftsai}@csrsr.ncu.edu.tw 4 Researcher, Energy and Environment Lab., Industrial Technology Research Institute, Chu-Tung, Taiwan {HKS, ianhsu}@itri.org.tw
Abstract. This paper proposes an algorithm for the automatic generation of Levels-of-Detail (LODs) for 3D polyhedral building models. In this study a group of connected polyhedrons is considered as “one building”, after which the generalization is applied to each building consecutively. The most detailed building models used is the polyhedral building model which allows for an elaborate roof structure, vertical walls and a polygonal ground plan. In the work the term “Pseudo-Continuous LODs” is described. The maximum distinguishable “feature resolution” can be estimated from the viewer distance to a building and is used to simplify the building structure by the polyhedron merging and wall collapsing with regularization processes. Experimental results demonstrate that the number of triangles can be reduced as a function of the feature resolution logarithm. Some case studies will be presented to illustrate the capability and feasibility of the proposed method including both regular and irregular shape of buildings. Keywords: Levels-of-Detail, Generalization, 3D Building Model.
1 Introduction The creation of a digital 3D city model is a generalization procedure from a complex world to digital geometric data that can be stored in a computer. Important spatial features such as roads, buildings, bridges, rivers, lakes, trees, etc. are digitized as two or three dimensional objects. Among them, the most important one is the 3D building models. From the application point of view, the more detail provided by a building model the greater the amount of geometrical information available for spatial analysis and realistic visualization. However, since the computer has limited resources for computation, storage and memory, the geometry of building models has to be generalized to reduce their complexity and increase their efficiency of demonstration or analysis. This means that th.e generation of different LODs of 3D building models for real-time visualization applications must be a compromise between structural detail and browser efficiency. L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 44 – 53, 2006. © Springer-Verlag Berlin Heidelberg 2006
LOD Generation for 3D Polyhedral Building Model
45
In computer graphics, the efficiency of 3D browsing is highly dependent on the number of triangles and textures to be rendered. Many algorithms regarding terrain simplification have already been developed in the field of computer vision and computer graphics. For example, Hoppe [1] and Garland & Heckbert [2] introduced an edge collapse transformation to simplify surface geometry, resulting in continuous LODs that can be applied for progressive meshing applications. In the case of 3D building models generalization, Sester [3] proposed a least squares adjustment method to simplify building ground plans. Kada [4] adopted a similar concept and applied it to 3D polyhedral building models. Based on the scale-spaces theory, Mayer [5] has suggested using a sequence of morphological operations to generalize 3D building models. However, the method is suitable for orthogonal building models only. In case of non-rectangular buildings they have to be squared in advance before generalization. In this paper, the concept of “Pseudo-Continuous LODs” (PCLODs) is utilized for the generalization of 3D polyhedral building models. The generation of Continuous LODs for a 3D city model is impractical for the following two reasons. The first reason is that the buildings are mostly separated by some distance so their geometry is different from a digital terrain model. The second reason is that buildings have vertical walls and the topology has to be maintained during geometric simplification. Thus in this work we treat a group of connected or contacted polyhedrons as “one building model” and the generalization is applied to each building model respectively. The idea for PCLODs comes from the digital camera. The computer screen corresponds to the camera's focal plane with a fixed CCD sensor size, frame area and focal length. In computer rendering and simulation of a virtual landscape is similar to the photo imaging process. A building’s structural details make it easy to recognize when the viewer distance is closer to the target, which is similar to rendering on the computer screen. On the contrary, when the viewer distance to the target gets longer, the smaller structure of buildings become indistinguishable they can thus be ignored during computer rendering. Here we focus on the generalization of 3D polyhedral building models. The most detailed “3D polyhedral building model” is generated semi-automatically based on the SPLIT-MERGE-SHAPE (SMS) algorithm [6]. The SMS algorithm utilizes 3D building structure lines that have been measured from aerial stereo-pairs. The 3D structure lines can be either derived from semi-automatic stereo-measurements [8] or extracted from high-resolution satellite images and/or aerial imagery [9]. Buildings with gable roofs, flat roofs, and regular or irregular ground plan can be described. Two neighboring buildings will not overlap due to the SPLIT process provided by the SMS algorithm. In a 3D visualization system, the viewer distance can be used to estimate the maximum distinguishable “feature resolution”. The feature resolution is used to detect small 3D features in one building model. Since the distance variation in dynamic 3D browsing is continuous, the corresponding PCLOD 3D building models can thus be generated immediately. However, in order to reduce the computational load during generalization, we can adopt out-of-core processing, by generating all the LODs of a building model before 3D visualization. Some case studies will be presented to illustrate the capability and feasibility of the proposed method for both regular and irregular type of buildings.
46
J.-Y. Rau et al.
2 Methodology The proposed scheme for generating PCLODs for a 3D building model can be divided into two parts, i.e. polyhedron merging and wall collapsing with regularization. Fig. 1 illustrates a flowchart of the proposed approach. In which we iteratively simplify the original detailed building model or generalized building model by enlarging the values of the feature resolution. The result is a sequence of different LODs of 3D building model from fine to coarser. The method can be used for out-of-core processing or real-time visualization, depending on the power of the computer, the number of building models to be generalized, and the required rendering frame rate. In the following sections, the definition of feature resolution and the two major procedures are described. 2.1 Feature Resolution In the proposed generalization, only one parameter is adopted for geometry simplification, i.e. the feature resolution (R). During 3D browsing, we can calculate the distance between the viewer and the building. The center of a group of polyhedrons is used in the distance calculation to maintain consistency within a building. According to the perspective projection geometry, we can simulate the computer screen as a CCD frame or a digital camera. Using the viewer distance and the camera’s focal length, we can estimate the scale between the object space and the image plane. The feature resolution can thus be obtained using the scale and the CCD cell size. It is also similar to a ground sampling distance (GSD) of one pixel during the photo imaging process. In 3D computer graphics, ground features with a size smaller than the feature resolution are indistinguishable and may be eliminated in order to simplify the geometry and increase the rendering efficiency. In this study, the feature resolution is given and insignificant geometrical structures are Fig. 1. The proposed generalization scheme detected and eliminated. 2.2 Polyhedron Merging The polyhedron merging process includes the following three procedures, i.e. (1) the flattening of sloped roofs, (2) the merging of two connected polyhedrons with a small
LOD Generation for 3D Polyhedral Building Model
47
height difference, and (3) the elimination at small polyhedrons contained within a bigger polyhedron. The first two steps are performed only on the building’s roof structure. Two visual important features are considered, i.e. the length of the sloped roof-edge and the height difference between two neighboring flat-roofs. The flattening of sloped roofs is processed before the merging of two neighboring flat-roofs. There are many kinds of roof types, gable, hipped, tent-shaped, hexagonal, pyramid-shaped roofs and so on, that have to be considered as visual important features. In the polyhedral building model, these roof structures are constructed by sloped triangles with height variations which are different from those at horizontal rooftops. In cases when the length of the edge of a roof-triangle is smaller than the feature resolution, the sloped roof-triangle will be flattened. This means that their roof height is replaced by the average of the original ones. For example, Fig.2 (A) shows a building with two gable roofs and two flat roofs. Utilizing a feature resolution of 3 meters, these two gable roofs will be flattened at first, as shown in Fig.2 (B). The second step is to merge two neighboring flat-roofs with an insignificant height difference. If their height difference is smaller than the feature resolution, the neighboring polyhedrons will be merged as one. As illustrated in Fig.2 (C), there are five polyhedrons that are merged into one polyhedron. The final roof(A) height will be close to the roof with the larger area, as obtained by the weighting average of two roofheights via the horizontal projection area. In order to compare the height difference between two polyhedrons, the topology of all the (B) polyhedrons has to be constructed first. The topology of a polyhedron provides information regarding its neighboring polyhedrons and its corresponding walls. The walls can be categorized as either independent walls or shared walls. A shared wall means there is a corresponding (C) neighboring wall, but this is not true for independent walls. In an island type of building, one small polyhedron will be contained within a bigger polyhedron. The third step is thus designed to eliminate the interior small polyhedron. If the length of all the walls of the interior polyhedron and the height difference between these two polyhedrons Fig. 2. Example of polyhedron are all smaller than the feature resolution, the merging interior one will be directly removed. 2.3 Wall Collapsing with Regularization Since most of buildings are composed of at least four walls, i.e. a tetrahedron, a building with less than six walls is not considered in the proposed generalization scheme. In order to preserve the principal structure of a building during generalization, the longest walls have to be detected and used for shape
48
J.-Y. Rau et al.
regularization. Using the independent walls as described in the previous section, we can construct the building’s boundaries. The principal structure analysis is applied on the boundary only. Adopting the concept of visual importance, the longest walls are considered as the principal structure of a building. Thus, all independent walls are fist sorted according to length so that a threshold to separate them from major and minor structures can thus be determined. The threshold is determined by the average length of the fourth and the fifth longest walls. Any wall with a horizontal length longer than the threshold is assigned as the principal structure of the building. In the simplification process, the principal structure will be maintained, in order to maintain the visual importance of a building between two consecutive LODs. The purpose of the above polyhedron merging process is mainly to generalize the rooftop structure. The result maintaining a detailed building ground plan, including intrusions or pillars located at building corners. In 1996, Hoppe [1] introduced an edge collapse transformation method to eliminate two neighboring triangles resulting in the merging of two vertices into one. This operation cannot be directly applied to simplify a 3D polyhedral building model. For example, Fig.3 (A) illustrates three connected vertical walls where each wall is composed of two triangles. If the roof-top edge on the middle wall is contracted by merging two corners into one, the wall becomes sloped and destroys the topology as a vertical wall, as shown in Fig.3 (B). In this paper, we adopt the same concept by contracting the whole wall not only one edge. The effect is illustrated in Fig.3 (C)-(D) using the same example. The purpose of wall collapsing is to simplify the geometric building structure by detecting and eliminating insignificant features.
Fig. 3. Wall collapse operation for 3D building model
Before the wall collapsing procedure, we need to detect insignificant geometric structures by searching for all short walls in one polyhedron. The shortest wall is eliminated one by one up to a certain threshold length, which is the same as in principal structure analysis. However, in order to retain main structure of the building, additional co-linear processing is necessary. During wall collapsing independent walls and shared walls are treated separately. z
For a sequence of shared walls neighboring the same polyhedron has to be processed at the same time. A fixed point is first determined at the center of an insignificant wall, as shown in Fig.3 (D) denoted by the symbol “~”. The endpoint of the previous wall and the starting-point of the following wall are merged at the fixed point. In cases where the previous wall or the following wall is
LOD Generation for 3D Polyhedral Building Model
49
related to different polyhedron, the fixed point is determined by the junction point of the two walls which correspond to two neighboring polyhedrons. This will avoid a gap between them when a center location is used as a fixed point. For example, Fig. 4 illustrates three polyhedrons projected on the horizontal plane. In Fig. 4(A), the shared walls are denoted by red lines, in which E1 is the detected insignificant wall to be eliminated. In Fig. 4(B), if we directly extract E1 by its center, i.e. T1, a gap will be occurred between the left polyhedron and the right two polyhedrons. So, during wall collapsing we choose the junction point T2, as indicated in Fig. 4(C), as the fixed point to avoid the gap effect.
(A)
(B)
(C)
Fig. 4. Prevention of gap effect during wall collapsing z
For independent walls, we have to maintain the principal structure of a building. Thus, the vertices of a wall that belong to the principal structure cannot be moved during the wall collapsing process. This means that the fixed point is set at the junction of the insignificant wall and the principal wall. Since independent walls construct the building boundary, the visual effect will be better and the number of walls can be reduced further by applying co-linear processing along the principal structure. A piping technique is utilized to accomplish this procedure. A wall that belongs to the principal structure is chosen as the pipe axis. The radius of the pipe is the same as the feature resolution. Any wall that was contained in the pipe is projected onto the pipe axis, except for wall vertices that are connected to a shared wall, again to avoid the gap effect. For example, Fig.5 illustrates two polyhedrons projected on the horizontal plane. The red lines denote the principal structure. In Fig. 5(A), E1~E3 are three detected insignificant walls to be eliminate. Since T1 is a termination point related to the principal structure the movement is not allowed, during collapsing of E1 the fixed point is set at T1. The elimination of E2 has the same situation as eliminating E1 that T2 is set as a fixed point. However, for the extraction of E3, the fixed point is set at the center of E3, i.e. T3. Fig. 5(B) demonstrates the wall collapsing result. In the following regularization process, we choose E0 as the pipe’s axis. The pipe is denoted by two green dashed lines, as shown in Fig. 5(B). Although E4 and E5 are all contained in the pipe, E4 is connected to a shared wall the movement is not allowed. On the other hand, E5 has to be projected onto the pipe’s axis to finish the co-linear process, as show in Fig. 5(B-C) T3 has moved to T5.
50
J.-Y. Rau et al.
Fig. 5. Wall collapsing with regularization
3 Case Study The total number of triangles is an important index for evaluating the efficiency of computer graphics rendering. This is especially crucial when the viewer distance becomes longer that the number of buildings within the view frustum will be increased significantly. In this section we like to estimate the reduction rate for when the feature resolution (R) is getting larger, which corresponds to the viewer moving farther away from the target. In the experiment, the total number of triangles is the sum of all roofs [9] and walls triangles. Four complex building models are utilized as examples, as shown in Fig. 6. The feature resolution is changed from one meter to 35 meters, in which only six of 50 LODs are illustrated for demonstration. In the figure, the original elaborate 3D polyhedral building model is denoted by R equal to zero, while the other LODs are generated using a feature resolution of 2, 5, 15, 25, and 35 meters, respectively. The results demonstrate that the principal structure of the building is preserved for all LODs. In addition, two irregular shaped buildings are test in the experiments, i.e. case 2 and case 3, the feasibility of the proposed scheme is also demonstrated. Since the generalization result is dependent on the complexity of the building models two consecutive LODs may have the same geometry, i.e. same number of triangles. For example, the generalization results are exactly the same when using R equals to 15 and 25 in Case 4. In Fig.6, the maximum and minimum number of total triangles is also illustrated. From which, one may notice that the ratio of the maximum and the minimum are from 27 to 53. This means that the total number of triangles is reduced by more than 27 times compares to the original one. The implication is that during 3D visualization of a city model, more than 27 times the number of building models could be rendered in real-time when the viewer is at long distance away.
LOD Generation for 3D Polyhedral Building Model R
Case 1
Case 2
Case 3
Case 4
Max.
535
721
731
434
Min.
10
16
19
16
51
0
2
5
15
25
35
Fig. 6. Four case studies of generated PCLODs of 3D building models, where R is the feature resolution in meters, and Max./Min. indicate the maximum and minimum number of triangles for all PCLODs of 3D building models
52
J.-Y. Rau et al.
Fig.7 illustrates the total number of triangles (y-axis) vs. the feature resolution (xaxis). A logarithmic function (Ln) is chosen to fit the relation between the feature resolution and the total number of triangles in the building model. The fitting is applied only in the range between the maximum and minimum total of triangles, for each case respectively. The results are indicated in Fig.7 with a dashed line. The correlation coefficients after regression for all cases are above 0.8. This demonstrates that the triangle reduction rate is high especially at smaller feature resolution. Case 1 Ln Case 1 800
Case 2 Ln Case 2
Case 3 Ln Case 3
Case 4 Ln Case 4
Case 1 : y = -131.85Ln(x) + 531.36 2
R = 0.9054
700
Case 2 : y = -170.95Ln(x) + 557 2
600
R = 0.8736 Case 3 : y = -172.14Ln(x) + 533.14
500
R = 0.8532 Case 4 : y = -131.39Ln(x) + 322.08
400
R = 0.815
No. of Triangules
2
2
300 200 100 0 -100
1
3
5
7
9
11 13 15 17 19 21 23 25 27 29 31 33 Feature Resolution R (Meters)
Fig. 7. Total number of triangles vs. feature resolution R for each case. The logarithmic function and its fitting results are indicated by a dashed line with a corresponding correlation coefficient (R2).
4 Conclusions This paper presents an automatic generalization approach for 3D polyhedral building models where only one parameter, i.e. the feature resolution, is used. Since a continuous generalization is not applicable to 3D building models, we propose to generate Pseudo-Continuous LODs by changing the feature resolution. The feature resolution can be estimated from the image scale and the virtual CCD size. The number of triangles for a complex building can be significantly reduced as a function of the feature resolution logarithm. As well, the principal structure of the building can be preserved, avoiding the popping effect produced when the LOD is changed. The proposed algorithm can be applied for the generalization of irregular shaped buildings. The experimental results demonstrate that the proposed algorithm is
LOD Generation for 3D Polyhedral Building Model
53
effective in terms of reducing the number of triangles needed and also maintaining the principal structure of a building. A greater than 27 time reduction rate can be achieved for complex building models. The result is applicable for 3D real-time visualization applications of a digital city model. The proposed algorithm can also be used for outof-core processing or real-time visualization depending on the power of the computer and the number of building models to be generalized. However, the method is not designed for view-dependent generalization or the aggregation of two non-connected buildings. The combination of aggregation in the generalization for adjacent buildings is necessary for future research to further reduce the amount of geometric data. As well, the automatic generation of multi-resolution facade texture from the corresponding building models LOD is necessary for photo-realistic visualization applications.
Acknowledgment This paper is partially supported by the National Science Council under project numbers NSC 89-2211-E-008-086, NSC 93-2211-E-008-026 and the NCU-ITRI Union Research and Developing Center under project number NCU-ITRI950304.
References 1. Hoope, H., 1996,”Progressive meshes”, Proceedings of ACM SIGGRAPH 96, pp.325-334. 2. Garland, M., Heckbert, P., 1997,”Surface simplification using quadric error metrics”. In: Proceedings of ACM SIGGRAPH 97, pp.206-216, 1997. 3. Sester, M., 2000, “Generalization based on least squares adjustment”, International Archives of Photogrammetry and Remote Sensing, Amsterdam, Netherlands, Vol. XXXIII, Part B4, pp.931-938. 4. Kada, M., 2002,”Automatic generalisation of 3D building models”. International Archives of the Photogrammetry, Remote Sensing and Spatial Information Services, Volume XXXIV, Part 4, pp.243-248. 5. Mayer, H., 2005,”Scale-spaces for generalization of 3D building”, International Journal of Geographical Information Science, Vol.19, No.8-9, Sep.-Oct. 2005, pp.975-997. 6. Rau, J. Y., and Chen, L. C., 2003, “Robust reconstruction of building models from threedimensional line segments”, Photogrammetric Engineering and Remote Sensing, Vol. 69.No.2, pp. 181-188. 7. Rau, J. Y., Chen, L. C., and Wang, G. H., 2004, “An Interactive Scheme for Building Modeling Using the Split-Merge-Shape Algorithm”, International Archives of Photogrammetry and Remote Sensing, Vol. 35, No. B3, pp. 584-589. 8. Chen, L. C., Teo, T. A., Shao, Y. C., Lai, Y. C., and Rau, J. Y., 2004, “Fusion of LIDAR Data and Optical Imagery for Building Modeling”, International Archives of Photogrammetry and Remote Sensing, Vol. 35, No. B4, pp. 732-737. 9. Ratcliff, J. W., 2006,”Efficient Polygon Triangulation”, Available: http://www.flipcode. com/files/code/triangulate.cpp.
Octree Subdivision Using Coplanar Criterion for Hierarchical Point Simplification Pai-Feng Lee1, Chien-Hsing Chiang1, Juin-Ling Tseng2, Bin-Shyan Jong3, and Tsong-Wuu Lin4 1
Dept. of Electronic Engineering, Chung Yuan Christian University, Dept. of Management Information System, Chin Min Institute of Technology, 3 Dept. of Information & Computer Engineering, Chung Yuan Christian University, 4 Dept. of Computer & Information Science, Soochow University {allen, hikki, arthur}@cg.ice.cycu.edu.tw,
[email protected],
[email protected] 2
Abstract. This study presents a novel rapid and effective point simplification algorithm based on point clouds without using either normal or connectivity information. Sampled points are clustered based on shape variations by octree data structure, an inner point distribution of a cluster, to judge whether these points correlate with the coplanar characteristics. Accordingly, the relevant point from each coplanar cluster is chosen. The relevant points are reconstructed to a triangular mesh and the error rate remains within a certain tolerance level, and significantly reducing number of calculations needed for reconstruction. The hierarchical triangular mesh based on the octree data structure is presented. This study presents hierarchical simplification and hierarchical rendering for the reconstructed model to suit user demand, and produce a uniform or feature-sensitive simplified model that facilitates rapid further mesh-based applications, especially the level of detail.
1 Introduction Due to the continuing development of computer graphics technology, diversified virtual reality applications are being increasingly adopted. Exactly how three dimensional (3D) objects in the real world can be efficiently and vividly portrayed in virtual scenes has recently become a crucial issue in computer graphics. A triangular mesh is one of the most popular data structures for representing 3D models in applications. Numerous methods currently exist for constructing objects using surface reconstruction, and reconstructing a point cloud as a triangular mesh. The data for sampled points are generally obtained from a laser scanner. However, the extracted sampled points are frequently affected by shape variation, causing over-sampling in the flat surface (Fig. 1). The number of triangles created rises as the number of points sampled from the surface of a 3D object rises, helping to reconstruct the correct model. However, subsequent graphics applications, such as morphing, animation, level of detail and compression, increase the computation costs. Appropriate relevant points should be chosen to retain object features and reduce storage and calculation costs. L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 54 – 63, 2006. © Springer-Verlag Berlin Heidelberg 2006
Octree Subdivision Using Coplanar Criterion for Hierarchical Point Simplification
55
Hence, reducing the number of triangles while retaining surface characteristics within a certain error value is worthy of research. The main research on object simplification can be divided into two fields, mesh-based simplification [1, 2, 3, 7, 9, 14] and point-based simplification [5, 6, 8, 10, 12, 13]. Mesh-based simplification requires connectivity relations to be obtained in advance. Hence, the sampled point data acquired by a scanner must perform triangulation by a reconstruction algorithm before simplification. Surface reconstruction extracts sampled points from a 3D object, and reconstructs the triangular mesh from an original object within a certain tolerance level [4]. The mesh-based simplification then attempts to reduce the number of triangles in triangular mesh while maintaining the object quality. Numerous good meshbased simplification algorithms have been presented, including vertex decimation [14], edge contraction [1], triangle contraction [2], vertex clustering [7], vertex pair contraction [9] and feature extraction [3]. Traditional methods such as Quadric Error Matrices (QEM) [9] have decimation operations that are generally arranged in a priority queue according to an error matrix that quantifies errors caused by decimation. Simplification is performed iteratively to reduce any smoothing of point pairs caused by the decimation operation. This greedy technique can obtain the simplified model with the minimum error of the original model. However, these algorithms all achieve good simplification effect in application, but need a triangular mesh and connectivity in advance of simplification. Restated, the algorithms are burdened with a large number of computations before simplification processing. Consequently, this process is prohibitively expensive.
Fig. 1. Over-sampling in the flat region needlessly increases the number of calculations. The simplified model produces the same effect of solid representation.
Therefore, point cloud simplification is an attractive approach. Point-based simplification is applied before reconstruction. If suitable relevant points can be extracted from a point cloud that represent surface variation, then the number of calculations needed for reconstruction can be significantly reduced. Dey [13] presented the first point cloud simplification approach. Dey adopted local feature sizes to detect redundancy in the input point cloud and ensure relevant point densities, thereby exploiting a 3D Voronoi diagram for a densely distributed original point set in advance of simplification. However, this method also requires many computations. Boissonnat and Cazals [6] presented a coarse-to-fine point simplification algorithm that randomly calculates a point subset and builds a 3D Delaunay triangulation. Additional points are inserted iteratively based on their distance to the closest 3D facet until the simplified surface mesh conforms to the restricted error value. Allegre [12] presented a convex hull for all points that adopts a decimation scheme to merge adjacent points according to geometrical and topological constraints. These algorithms must adopt pre-processing to retain the original surface data before simplifying the point set, and therefore require many computations.
56
P.-F. Lee et al.
Pauly [10] applied the four mesh-based simplification techniques to point cloud simplification. A uniform incremental clustering method is computationally efficient, but leads to a high mean error. The hierarchical clustering method can reduce calculation and memory, but has a marginally better mean error value than the uniform incremental clustering method. The quadric error-based iterative simplification method obtains the best error rate, but has a major disadvantage in that its execution time is sensitive to the input point set size. The particle simulation method obtains a good error rate, but requires many calculations. Alexa [8] proposed to uniformly simplify the point set by estimating the distance from a point to the Moving Least Square (MLS) surface. Alexa also presented a re-sampling method to ensure the distribution of density. Moenning and Dodgson [5] presented an intrinsic coarse-to-fine point simplification algorithm that guarantees uniform or feature-sensitive distribution. They adopted the farthest point sampling and a fast marching algorithm to choose relevant points and set density threshold to ensure point set density. However, their method requires expanding the computational 3D Voronoi diagram, and consequently requires many computations and a large memory. This study presents a novel rapid and effective point simplification algorithm based on a point cloud without normal and connectivity information. This study initiates with a scattered sampled point set in 3D, and the final output is a triangular mesh model simplified according to restrictive criteria. The proposed method reduces the number of calculations between triangulation and establishing the connectivity relation, and includes three main steps, point simplification, reconstruction guarantee and hierarchical simplification. In the point simplification step, sample points are obtained using 3D acquisition devices that fully represent the object surface variation. This step investigates how to best choose the most appropriate number of relevant points from the sampled points to reduce the complexity of operation and obtain an acceptable simplified result. Sampled points are clustered, based on shape variation, by using the octree data structure, which is an inner point distribution of a cube, to judge whether these points correlate with the coplanar properties. The local coplanar method causes the simplified model to have a feature-sensitive property. The feature-sensitive distribution can achieve a small simplification-based error rate, but does not permit successful reconstruction. Consequently, considering the scattered relevant points in the levels of the hierarchical tree are dynamically adjusted. Moreover, the distribution density is increased by dummy vertices in the region with excessive difference between adjacent levels, helped by hierarchical tree information. The problem of undersampling is thus successfully solved, obtaining a good simplified, reconstructed model. Finally, a hierarchical triangular mesh suitable for multi-resolution is obtained after successfully reconstructing the simplified point set. This study presents a novel method for extracting the relevant points for a dense input point set, and adopts the reconstruction algorithm presented by Jong [4] to generate a simplified model. Experimental results confirm that a good simplified model, with the advantages given below, can be quickly obtained. 1. Connectivity relations do not need to be recorded. Hence, the reconstruction algorithm can be adopted, significantly reducing the computational cost. 2. The calculations for extracting the relevant points ensure that the simplified model has a good error rate. 3. Using an octree data structure maintains multi-resolution.
Octree Subdivision Using Coplanar Criterion for Hierarchical Point Simplification
57
4. The hierarchical rendering and imaging can be interactively changed by automatically assigning local sampling constraints.
2 Algorithm Overview The following steps are crucial to simplifying the sampled point cloud and completely reconstructing the triangular mesh. 2.1 Choose the Coplanar Variable In this study, the point cloud is subdivided iteratively according to the space coordinates until each cluster meets its respective restricted criterion. The local neighbor of a point set for a cluster was identified by the formula presented by Pauly [10], based on the following equations: ci
def
1 q N i q N
¦
Ǵ ci R 3
(1)
Ǵ C i R 3* 3
(2)
i
Ci
def
1 q c i q c i t N i q N
¦
i
f
O0 O0 O1 O 2
Ǵ if
O0 d O1 d O 2
(3)
where, given a cluster, ci denotes the gravity center; Ci denotes the covariance matrix; q denotes the point in a cluster, and |Ni| denotes the number of points. According to Eq. (2), the covariance matrix Ci is a symmetrical positive semidefinite 3×3 matrix with three eigenvalues λ0, λ1 and λ2. These eigenvalues measure the variation of points in the direction of their corresponding eigenvectors e0, e1, e2. Eigenvector e0 is a vector characteristic of the minimum eigenvalue λ0, which denotes the normal vector of a cluster. A cluster conforms to coplanar characteristics if λ0 btop )&(y + b < headlb )}(4) where 2/3σ ≤ a ≤ σ ,2/3 ∗ 1.6σ ≤ b ≤ 1.6σ. blef t , bright ,btop , and headlb denote the left, right, top, and lower boundary of a blob, respectively. σ is by estimated lb the upper horizontal projection of the body: σ = 12 ( headlb1−btop head P [j]), j=btop h where Ph [j] denotes the horizontal projection.
4
Face Occlusion Detection
Based on the best matched elliptical area of head, the face region is defined as in Fig. 7. Point C(x, y) is the center of the ellipse. The ellipse represents the head region, points P1 , P2 , P3 , and P4 indicate the four √ corners of the face region. b
a2 −(x−a)2
The y coordinate of P3 and P4 is obtained as y + . We define the a ratio of skin color area to face region as in Equation (5) to determine whether the face is occluded or not. A face is determined as non-occluded if the skin area ratio ≥ 0.6. Otherwise, the face is occluded. The threshold 0.6 is chosen through extensive experiments. Skin− ratio =
area of skin color in face region face area
(5)
For skin color identification, both normalized RGB color model (rgb) and RGB color model are used. We adopt the normalized RGB range mentioned in [16] to discriminate the skin color: 0.36 ≤ r ≤ 0.465 and 0.28 ≤ g ≤ 0.363. The range described above contains black and very dark brown color. To further exclude out those dim colors, we define additional rules: R+G+B < 28 and B > 28 in 3 original RGB color model. The former constrain is to exclude the black pixels. The later condition restricts the blue color range, it aims at removing the very dark brown color.
Fig. 7. The face region of the best matched elliptical area. Skin color region indicates the face region.
648
5
D.-T. Lin and M.-J. Liu
Experimental Results
The proposed facial occlusion detection system is implemented on a low-end PC (Pentium III 1.0 GHz CPU). A warning alarm will be issued once the occlusion case is detected. We used a DVR to capture video sequences with setting of bank ATM. For the sake of real time performance, the video image size sequence is down-scaled by 1/4. The result is a 160x120 image per frame. The developed detection system works at the speed of 5-20 frames per second.
Fig. 8. Five examples of face occlusion
Fig. 9. Six examples of face occlusion detection results
We have tested on 173 video clips including 130 occlusive cases and 43 nonocclusive cases. The occlusive case video sequences were recorded from ten people (five males and five females), each with thirteen face occlusion scenarios. Each person walks either forward or backward to the ATM and pretend doing transaction or cashing. In addition, 33 people were tested for non-facial occlusion. Figure 9 presents the results of six facial occlusion detection scenarios. The red circle delineates the detected head area. The square windows show the face area and skin color area accordingly. Table 1 shows the performance of detection. The proposed system achieves 95.35% accuracy for non-occlusive detection and 96.15% accuracy in occlusive detection, the overall performance is 95.95%. We also compared the detection performance with other methods [17] and [18] as
Face Occlusion Detection for Automated Teller Machine Surveillance
649
Table 1. Face occlusion detection hit ratio Face type
Sequence Hit number correct ratio Non-occlusive 43 41 95.35% Occlusive 130 125 96.15% Overall 173 166 95.95%
Table 2. Performance comparison of the proposed system with other methods Method
occlusive non-occlusive cases cases Lin and Lee[17] 96.2% 97% 257/267 frames 132/136 frames Lin and Fan[18] 100% 100% 1/1 video clips 3/3 video clips Our Method 95.35% 96.15% 125/130 video clips 41/43 video clips
shown in Table 2. Lin and Lee compute the gray values of eyes and mouth area for justifying their appearance [17]. Their overall accuracy rate was 96.5% out of 403 test images. Lin and Fan used the skin color distribution ratio to determine if the face is occluded [18]. The accuracy rate was 100% (4/4 video clips). As we can see from Table 2, our system performance is comparative to the others. While, our simulation result is more objective with extensive amount of test sequences and more realistic environments.
6
Conclusion
We have implemented a real time face occlusion detection system. This work presents an effective detection system for facial occlusion to assist security personnel in surveillance by providing both valuable information for further video indexing applications and important clues for investigating a crime. First, static and dynamic edges are combined to detect moving objects. Then a standard least squares straight line fitting method is used to merge motion blobs. Moving forward or backward is determined based on the aspect ratio of bounding box. An ellipse model is fitting to the candidate face by maxmizing the gradient magnitude around the perimeter. Finally, face occlusion is performed based on the ratio of skin color in the face where the skin color is determined by a simple thresholding in R-G color space. The proposed detection system can achieve an overall accuracy of 96.43% at a speed of 5-20 frames per second. The proposed facial occlusion detection system can be applied for surveillance of ATMs with a complex background.
650
D.-T. Lin and M.-J. Liu
Acknowledgement This work is supported in part by Ministry of Economic Affairs, Taiwan, under grant no. 95-EC-17-A-02-S1-032, and by National Science Council. Taiwan, under grant no. NSC 94-2622-E-305-001-CC3.
References 1. Elena Stringa and Carlo S. Regazzoni. Real-time video-shot detection for scene surveillance applications. IEEE Transactions on Image Processing, 9(1):69–79, January 2000. 2. Ismail Haritaoglu, David Harwood, and Larry S. Davis. W4:real-time surveillance of people and their activities. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(8):809–830, August 2000. 3. Osama Masoud and Nikolaos P. Papanikolopoulos. A novel method for tracking and counting pedestrians in real-time using a single camera. IEEE Transactions on Vehicular Technology, 50(5):1667–1278, September 2001. 4. Changick Kim and Jenq-Neng Hwang. Fast and automatic video object segmentation and tracking for content-based applications. IEEE Transactions on Circuits And Systems for Video Technology, 12(2):122–129, February 2002. 5. Shie-Jue Lee, Chen-Sen Ouyang, and Shih-Huai Du. A neuro-fuzzy approach for segmentation of human objects in image sequences. IEEE Transactions on Systems, Man, and CybernaticsXPart B, 33(3):420–437, June 2003. 6. C.R. Wren, A. Azarbayejani, T. Darrell, and A. Pentland. Pfinder: real-time tracking of the human body. IEEE Transactions on Pattern Analysis and Machine Intrelligence, pages 780–785, 1997. 7. C. Stauffer and W.E.L. Grimson. Learning patterns of activity using real-time tracking. IEEE Transactions on Pattern Analysis and Machine Intelligence., pages 747–757, 2000. 8. W. Zhao, R. Chellappa, P. Phillips, and A. Rosenfeld. Face recognition: A literature survey. ACM Computing Surveys, 35(4):399–458, December 2003. 9. R. Brunelli and T. Poggio. Face recognition: Features versus templates. IEEE Transactions on Pattern Analysis and Machine Intelligence, 15:1042–1052, 1993. 10. M. Fukumi, Y. Mitsukura, and N. Akamatsu. A design of face detection system by lip detection neural network and skin distinction neural network. In IEEE International Conference on Systems, Man and Cybernetics, volume 4, pages 2789– 2793, October 2000. 11. T. Kurita, M. Pic, and T. Takahashi. Recognition and detection of occluded faces by a neural network classifier with recursive data reconstruction. In Proceedings of the IEEE Conference on Advanced Video and Signal Based Surveillance, pages 53–58, 2003. 12. I.B. Ciocoiu. Occluded face recognition using parts-based representation methods. In Proceedings of the 2005 European Conference on Circuit Theory and Design, volume 1, pages 315–318, 2005. 13. R. C. Gonzalea and R. E. Woods. Digital Image Processing, 3nd. Addision-Wesley, 2003. 14. Anton and Rorres. Elementary Linear Algebra, Application Version, 8th Edition. Wiley, 2002.
Face Occlusion Detection for Automated Teller Machine Surveillance
651
15. Stan Birchfield. Elliptical head tracking using intensity gradients and color histogram. In IEEE Conference on Computer Vision and Pattern Recognition, pages 232–237, June 1998. 16. Yanjiang Wang and Baozong Yuan. A novel approach for human face detection from color images under complex background. The Journal of the Pattern Recognition Society, 34:1983–1992, 2001. 17. Hsin-Chien Lin. Detection of faces with covers in a vision-based security. Master’s thesis, National Chiao-Tung University, Taiwan, June 2003. 18. Huang-Shan Lin. Detection of facial occlusions by skin-color based and localminimum based feature extractor. Master’s thesis, National Central University, Taiwan, June 2003.
Automatic Pose-Normalized 3D Face Modeling and Recognition Systems Sunjin Yu1, Kwontaeg Choi2, and Sangyoun Lee1 Biometrics Engineering Research Center, Dept. of Electrical and Electronics Engineering, 2 Dept. of Computer Science, Yonsei University 134 Shinchon-dong, Seodaemun-gu, Seoul 120-749, South Korea {biometrics, choikt, syleee}@yonsei.ac.kr 1
Abstract. Pose-variation factors present a big problem in 2D face recognition. To solve this problem, we designed a 3D face acquisition system which was able to generate multi-view images. However, this created another poseestimation problem in terms of normalizing the 3D face data. This paper presents an automatic pose-normalized 3D face data acquisition method that is able to perform both 3D face modeling and 3D face pose-normalization at once. The proposed method uses 2D information with the AAM (Active Appearance Model) and 3D information with a 3D normal vector. The proposed system is based on stereo vision and a structured light system which consists of 2 cameras and 1 projector. In orsder to verify the performance of the proposed method, we designed an experiment for 2.5D face recognition. Experimental results showed that the proposed method is robust against pose variation.
1 Introduction Although current 2D face recognition systems have reached a certain level of maturity, performance has been limited by external conditions such as head pose and lighting conditions. To alleviate these factors, 3D face recognition methods have recently received significant attention, and the appropriate 3D sensing techniques have been highlighted [1][2]. Many 3D data acquisition methods that use general-purpose cameras and computer vision techniques have been developed [3][4]. Since these methods use generalpurpose cameras, they may be effective for low-cost 3D input sensors that can be used in 3D face recognition systems. Previous approaches in the field of 3D shape reconstruction in computer vision can be broadly classified into two categories; active and passive sensing. Although the stereo camera, a kind of passive sensing device, infers 3D information from multiple images, the human face contains an unlimited number of features. Because of this, it is difficult to use dense reconstruction with human faces. There is also a degree of ambiguity when matching point features. This is known as the matching problem [5]. Therefore, passive sensing is not an adequate choice for 3D face data acquisition. Active sensing, however, uses a CCD camera to project a special pattern onto the subject and reconstruct shapes by using reflected pattern imaging [6]. Because active L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 652 – 661, 2006. © Springer-Verlag Berlin Heidelberg 2006
Automatic Pose-Normalized 3D Face Modeling and Recognition Systems
653
sensing is better at matching ambiguity and also provides dense feature points, it can act as an appropriate 3D face-sensing technique. However, in general, a 3D reconstruction device requires pose-normalization in order to recognize objects after acquisition. Some 3D recognition systems assume that a normalized 3D database exists already. They may also carry out another posenormalization step after the initial acquisition step [7][8][9][10][11]. This represents a 3D head pose estimation problem in 3D face recognition systems. Solutions to this kind of problem can be roughly divided into feature-based and appearance-based methods. Feature-based methods try to estimate the position of significant facial features such as the eyes, the nose or the mouth in the facial range or intensity image. Appearance-based methods consider the facial image as a global entity [12]. This paper presents a new method for automatic pose-normalized 3D data acquisition which is able to perform both 3D face data acquisition and pose-normalization at once. In the proposed system, both binary and color time-multiplexing patterns were used. Once the correspondence problem was solved, the 3D range data was computed by using triangulation. Triangulation is a well-established technique for acquiring range data with corresponding point information [3]. For pose-normalization, the pose-estimation method was used with the AAM (Active Appearance Method) [13]. This paper is organized as follows. Section 2 presents background information, and the 3D face modeling step is discussed in Section 3. In Section 4, the automatic posenormalized 3D face data acquisition method is proposed and Section 5 comprises the experimental results. Finally, Section 6 concludes the paper.
2 Background 2.1 Camera Calibration Calibration refers to the process of estimating the parameters that determine a projective transformation from the 3D space of the world onto the 2D space of an image plane. A set of 3D-2D point pairs for calibration can be obtained with a calibration rig. If at least six point pairs are known, the calibration matrix can be uniquely determined. However, in many cases, due to errors, it is better to use far more than six point pairs. Although this may result in what is known as the over-determined problem, we used 96 point pairs to ensure high accuracy in our experiments. Finally, the stereo camera system was calibrated with the DLT (Direct Linear Transform) algorithm [3][4]. 2.2 Epipolar Geometry Epipolar geometry refers to the intrinsic projective geometry between two given views. It is independent of scene structure, and depends only on the camera’s internal parameters and relative pose [4]. The fundamental matrix F encapsulates this intrinsic geometry. If a point in 3 space X is imaged as x in the first view, and x’ in the second, T
then the image points can be said to satisfy the relation x '
Fx = 0 , l ' = Fx [4].
654
S. Yu, K. Choi, and S. Lee
The fundamental matrix is also an algebraic representation of epipolar geometry. It derives the fundamental matrix from the mapping between a point and its epipolar line, and then specifies the properties of the matrix [4]. Using the normalized 8-point algorithm [4] [14], we extracted the fundamental matrix. 2.3 Structured Light System Stereo vision is based on capturing a given scene from two or more points of view and then finding the correspondences between the different images in order to triangulate the 3D position. However, difficulties in finding the correspondences may arise, even when taking into account epipolar constraints. In addition, active stereo vision using light can be divided into time-multiplexing, spatial neighborhood and direct coding [6]. It is possible to use both binary and color time-multiplexing coded patterns to solve the correspondence problem. The idea of these methods is to provide distinguishing features in order to find the correspondence pairs. 2.4 Active Appearance Model Each shape sample x can be expressed by means of a set of shape coefficients as shown in equation (1) and texture g as shown in equation (2).
where
x = x + Ps bs
(1)
g = g + Pg bg ,
(2)
x is the mean shape, Ps is the matrix with eigenvectors on the shapes, bs is a
shape coefficient, texture, and
g is the mean texture, Pg is the matrix with eigenvectors on the
bg is a texture coefficient. Since the shapes and gray-level textures can
be correlated with each other, the shape and texture coefficients can be concatenated into a vector b as shown in equation (3).
⎛ Ws bs ⎞ b=⎜ ⎟ = Qc b g ⎝ ⎠
(3)
In equation (3), Q represents the eigenvectors and c is a vector of the appearance parameters which control both the shape and gray-levels. Thus all images are modeled with a vector c which describes both the shape and gray-level texture variations together. With the given face image, it was necessary to search for a 2D face model that minimized the texture error [13]. After AAM matching, bs , bg and c were extracted and used as shape coefficients, g represented the mean texture, and Pg was the matrix with eigenvectors on the texture. All of these were used as features to identify faces. To improve recognition performance, we used a feature extraction technique which
Automatic Pose-Normalized 3D Face Modeling and Recognition Systems
655
combined a generic AAM with a person-specific AAM. In training, one generic model was constructed which captured the variation of all the training faces, and a specific model was constructed for each individual to model the variation of only one person. Generally the AAM is able to provide information on the principal axes of the training data since the main idea of the AAM is PCA. The axes extracted from the generic model are efficient ways to identify faces; meanwhile the axes extracted from the person-specific models are efficient ways to track the faces of the corresponding person. Thus, the person-specific model is more suitable for tracking faces whereas the generic model is more suitable for extracting facial features for recognition. With the given face image, we first used the person-specific model to track and extract the shape aligned to that face. In the first frame we tried all person-specific models and then selected one model by using the minimum matching error. We applied a generic AAM model to the selected face region in order to extract the shape and texture features for face recognition. Fig. 1 shows the way to get the geometric information of a face by using the AAM. To find the initial shape position, we used the face detection method proposed by the Viola-Jones cascade detector [15].
3 3D Face Modeling 3.1 3D Face Data Acquisition System The simplest active technique is to use a single-line stripe pattern, which greatly simplifies the matching process, even though only a single line of 3D data points can be obtained with each image shot. To speed up the acquisition of 3D range data, it is necessary to adopt a multiple-line stripe pattern instead. However, the matching process then becomes much more difficult. One other possibility is to use color information to simplify this difficulty [6].
(a) Face detection
(b) Initial shape
(c) Shape fitting
Fig. 1. Face geometric information using the adaboost algorithm and the AAM
Furthermore, when using the single-camera approach, it is necessary to find the correspondence between the color stripes projected by the light source and the color stripes observed in the image. In general, due to the different reflection properties (or surface albedos) of varying object surfaces, the color of the stripes recorded by the camera is usually different from that of the stripes projected by the light source (even when the objects are perfectly Lambertian)[16]. It is difficult to solve these problems in many practical applications. However, this does not affect our system (which consists of two cameras and one projector) if the object is Lambertian, because the color observed by the two cameras
656
S. Yu, K. Choi, and S. Lee
will be the same, even though this observed color may not be exactly the same as the color projected by the light source. Therefore, by adding one more camera, the more difficult problem of lighting-to-image correspondence is replaced by an easier problem of image-to-image stereo correspondence. Here, the stereo correspondence problem is also easier to solve than traditional stereo correspondence problems because an effective color pattern has been projected onto the object. 3.2 Pattern Design The binary sequential pattern uses two illumination levels which are coded as 0 and 1 and the color sequential pattern uses color instead of these two illumination levels [6]. Previous research has shown that the color model is an effective model for stereo matching [17][18][19]. We adopted the HSI model for color-coded sequential pattern generation. Using the line features and the HSI color model, a set of unique color encoded vertical stripes was generated. Each color-coded stripe was obtained as follows. The stripe color was denoted as shown in equation (4). stripe ( ρ , θ ) = ρ e jθ ,
(4)
where ρ is the saturation value and θ is the hue value of the HS polar coordinate system. We used only one saturation value (saturation=1) because there was enough hue information to distinguish each stripe for the matching process. Finally, the stripe color equation was defined by using equation (5). color ( m , n ) = e
j ( mH
jmp
+ε n )
,
(5)
where m represents the set element, n is the set number, and ε is the hue perturbation. Next, the color-coded sequence was obtained as follows. We determined that the hue distance was 144 D . The next set elements were then used sequentially. In this paper, we generated codes in the face region. We used seven images for the binary pattern and three images for the color pattern. The binary pattern generated 2 7 = 128 codes and the color pattern generated 5 3 = 125 codes. 3.3 Absolute Code Interpolation After triangulation, we obtained the 3D reconstructed face data. However, this data was sparse, because we had to deal with a thinned code in each view. Thinning refers to the process of reducing the ambiguity of the corresponding problem. If we did not use the thinning step, the 1-to-many matching problem would have been encountered. Fig. 2 shows the reconstructed sparse 3D face results. However, in order to recognize the images, dense reconstruction is required. Therefore, we used absolute code interpolation to discover which human’s face surface comprised the continuous region and when the neighbor region’s depth was similar. The thinned codes in each view were interpolated by linear interpolation. Fig. 3 shows an example of code interpolation. Fig. 4 shows an example of reconstructed dense 3D face data using the proposed absolute code interpolation. The reconstructed 3D face data consists of a 3D point cloud and a texture map.
Automatic Pose-Normalized 3D Face Modeling and Recognition Systems
657
4 Automatic Pose-Normalized 3D Face Modeling 4.1 3D Face Pose-Normalization The proposed automatic normalized 3D face data acquisition method uses 2D and 3D information fusion. First, we found the eye region, nose region and mouth region in the 2D image by using the AAM. Second, we reconstructed the 3D face. Third, because we knew the 2D and 3D relation equations, we were able to find the center of the 3D eyes, the top of the nose and the center point of the mouth. Fourth, using the 3D eye and nose top points, we produced a mesh plane. Fifth, using this mesh plane, we were able to find the normal vector. Finally, the problem of DOF 3 against translation was solved by nose tip rearrangement and the problem of DOF 3 against rotation was solved by normal vector rearrangement. By obtaining the three points P0 , P1 and P2 (two eyes’ mid points and lip’s mid point), we were able to make the 3D normal vector as follows [20]:
Fig. 2. Reconstruction of 3D face
Fig. 3. Code Interpolation
Fig. 4. Reconstructed dense 3D face data using the proposed absolute code interpolation
N=
(P1 - P0 ) × (P2 - P0 ) (P1 - P0 ) × (P2 - P0 )
,
(6)
where N is the 3D normal vector. The process of the proposed method is as follows: 1. 2.
Find the 2D geometry features such as eye, nose and mouth region in the 2D image. Reconstruct the 3D face data.
658
3. 4. 5. 6. 7. 8.
S. Yu, K. Choi, and S. Lee
Find the 3D geometry features in the 3D space using the relation between the 2D geometry features and the 3D position. Search the nose tip under the nose region using the 2D nose information. Move the 3D face data’s center to the nose tip for translation. Check the eye position and align the 3D face data for the Z axis horizontally. Calculate the 3D normal vector using the two eye region’s mid points and the lip region’s mid point. Rotate the 3D normal vector with the 3D face data for the X and Y axis.
4.2 Procedure 3D face modeling can be divided into an off-line and an on-line process. Calculating the camera matrix and the fundamental matrix is an off-line process. Because these are calculated before the on-line process, they are not related to running-time. After the off-line process, it is possible to carry out the on-line process, which is related to running-time. The on-line process can be roughly divided into 2D processing and 3D processing. 2D processing consists of capturing the image pairs, detecting the face region, extracting the 2D face geometry features, thinning, absolute code interpolation and finding corresponding pairs. 3D processing consists of triangulation as well as the automatic pose-normalized face modeling step. The procedure is as follows: 1. 2.
Choose the pattern type (binary or color). Project the coded patterns onto the human face and capture the stereo reflected pattern images. 3. Detect the face region using the adaboost algorithm in order to provide the initial shape position. 4. Extract the 2D facial geometry features such as eyes, nose, and lip regions using the initial shape position and the AAM. 5. Generate the code according to the pattern type (binary or color). 6. Thin the generated code in order to reduce the correspondence problem. 7. Interpolate the thinned code. 8. Find the corresponding pairs using the fundamental matrix. 9. Triangulate using the camera matrix. 10. Find the 3D normal vector and the nose tip using the 2D facial geometry information. 11. Normalize the pose using the 3D normal vector. Items 1~8 refer to 2D processing steps and items 9-11 refer to 3D processing steps.
5 Experimental Results 5.1 3D Face Database We created a 3D face database of fifteen persons. Each person was captured at five different pose variations: front, up, down, left and right. Also, each person was subjected to binary sequential and color sequential reconstruction. The color sequential
Automatic Pose-Normalized 3D Face Modeling and Recognition Systems
659
reconstructed data used a train and a gallery, while the binary sequential reconstructed data used a query. After normalization, we obtained the normalized 3D face data. Fig. 5 shows the geometric information and the 3D normal vector in the 3D space. Fig. 6 shows an example of the 3D face database, both with and without the proposed normalized method. 5.2 Recognition Results To verify the proposed pose-normalized method, we compared two types of recognition experiments. The first experiment was done in order to recognize the reconstructed 3D face data without pose-normalization. The second experiment was done to recognize the reconstructed 3D face data with pose-normalization. Both experiments projected the 3D to 2D space with texture. This method can apply the existing 2D face recognition algorithm easily. It is a 2.5D face recognition method. The gallery included reconstructed 3D face data using a color sequential pattern and the query included reconstructed 3D face data using a binary sequential pattern. Table 1 shows the used sets. To recognize them, we used the AAM shape information and Euclidean distance. Table 2 shows the identification results using the AAM shape information. After normalization, the recognition rate improved by about 35%.
Fig. 5. Geometric information and the normal vector in the 3D face space
(a) Without the proposed method
(b) With the proposed method
Fig. 6. Reconstructed 3D face database without and with pose-normalization Table 1. Used sets and types
Used set Color sequential Binary sequential
Type Gallery Query
Number of data items
5 × 15=75 5 × 15=75
660
S. Yu, K. Choi, and S. Lee
Table 2. Identification results
Set Projected 2D texture Projected 2D texture
Normalization OFF ON
Number of data items 60 % (45/75) 94.67 % (71/75)
6 Conclusions This paper presents an automatic pose-normalized 3D face data acquisition method and a 2.5D face recognition system. We assumed that the user’s pose variation range was ± 35 D in each axis and identification system was trained for only resisted people using AAM. We produced a full face recognition system from 3D data acquisition to recognition so that we produced 3D modeling which used a pose-normalized method. Processing time is about ±10 seconds which are depended matching points. Our system consisted of two cameras and one projector, with expansion capabilities. Because the projector provided the features, we were easily able to change features such as lasers, infra-red systems and so on. To normalize the pose factor, we used the 2D and 3D information. This method showed some advantages. Both 2D and 3D information can present both advantages and disadvantages. Therefore, we used the 2D and 3D information to provide advantages and to offset disadvantages. We also used a 3D point cloud and texture map. The recognition method was 2.5D face recognition, which is very useful because 2D recognition has lower computational cost than 3D recognition methods and there are powerful recognition methods in the 2D space. The proposed pose-normalization method shows robust results against pose variation. Experimental results show that the normalized recognition rate improved by about 35% when compared with un-normalized recognition. In future work, we will consider other recognition methods in the 2D space such as the SVM (Support Vector Machine), ICA (Independent Component Analysis), LFA (Local Feature Analysis) and so on. Finally, we hope to design a full 3D face data recognition and 2D and 3D multimodal recognition system. Acknowledgements. This work was supported by the Korea Science and Engineering Foundation (KOSEF) through the Biometrics Engineering Research Center (BERC) at Yonsei University, Seoul, South Korea.
References 1. H.S. Yang, K.L. Boyer and A.C. Kak, “Range data extraction and interpretation by structured light,” Proc. 1st IEEE Conference on Artificial Intelligence Applications, Denver, CO, pp. 199–205, 1984. 2. K.L. Boyer and AC. Kak, “Color-encoded structured light for rapid active ranging,” IEEE Trans. Pattern Analysis and Machine Intelligence, pp.14–28, 1987. 3. Emanuele Trucco and Alessandro Verri, “Introductory Techniques for 3-D Computer Vision,” Prentice Hall, 1998.
Automatic Pose-Normalized 3D Face Modeling and Recognition Systems
661
4. R. Hartley and A. Zisserman, “Multiple view Geometry in computer vision,” Cambridge University Press, 2000. 5. D. Scharstein and R. Szeliski, “Stereo Matching with Nonlinear Diffusion”, International Journal of Computer Vision, 28(2), pp. 155 – 174, 1998. 6. J. Salvi, J. Pags and J. Batlle, “Pattern Codification Strategies in Structured Light Systems,” Pattern Recognition 37(4), pp. 827–849, April 2004. 7. S. Malassiotis and M. G. Strintzis, “Pose and illumination compensation for 3D face recognition,” Proc. Int. Conf. Image Processing, 1, pp. 91–94, 2004. 8. S. Malassiotis and M. G. Strintzis, “Robust Face Recognition Using 2D and 3D Data: Pose and Illumination Compensation,” Pattern Recognition, Jan. 2005. 9. Chang, K.I., Bowyer, K.W. and Flynn, P.J. “Multi-Modal 2D and 3D Biometrics for Face Recognition,” Analysis and Modeling of Faces and Gestures, 2003. AMFG 2003. IEEE International Workshop, pp. 187-194, Oct. 2003. 10. C. Xu, Y. Wang, T.Tan and L.Quan, “A New Attempt to Face Recognition Using Eigenfaces”, Proc. ACCV, pp.884~889, 2004. 11. H. Song, S. Lee, J. Kim and K.Sohn, “3D sensor based face recognition,” Applied Optics, Vol. 44, No. 5, pp. 677-687, Feb. 2005. 12. H. Song, “3D Head Pose Estimation and Face Recognition Using Range Image”, Ph.D Thesis, Yonsei University, South Korea, 2005. 13. T. F. Cootes, G. Edwards and C. Taylor, “Active Appearance Models,” IEEE Trans. On Pattern Analysis and Machine Intelligence, 23(6), pp. 681–685, 2001. 14. Hartley, R., "In Defence of the 8-points Algorithm", in Proceedings Fifth International Conference on Computer Vision, Cambridge, Mass., June 1995. 15. P. Viola and M. Jones, “Robust real-time face detection,” International Journal of Computer Vision 57(2), pp. 137–154, 2004. 16. D. Shin and J. Kim, “Point to Point Calibration Method of Structured Light for Facial Data Reconstruction”, First International Conference, ICBA 2004, pp. 200-206, Jul. 2004. 17. C.H. Hsieh, C.J. Tsai, Y.P. Hung and SC. Hsu, “Use of chromatic information in regionbased stereo,” Proc. IPPR Conference on Computer Vision, Graphics, and Image Processing, Nantou, Taiwan, pp. 236–243, 1993. 18. R.C. Gonzales and R.E. Woods, “Digital Image Processing,” Addison-Wesley, Reading, MA, 1992. 19. C. Chen, Y. Hung, C. Chiang and J. Wu, “Range data acquisition using color structured lighting and stereo vision,” Image and Vision Computing, pp. 445–456, Mar. 1997. 20. E. Lengyel, “Mathematics for 3D Game Programming and Computer Graphics”, Charles River Media, 2001.
Vision-Based Game Interface Using Human Gesture Hye Sun Park, Do Joon Jung, and Hang Joon Kim Department of Computer Engineering, Kyungpook National Univ., Korea {hspark, djjung, hjkim}@ailab.knu.ac.kr
Abstract. Vision-based interfaces pose a tempting alternative to physical interfaces. Intuitive and multi-purpose, these interfaces could allow people to interact with computer naturally and effortlessly. The existing various visionbased interfaces are hard to apply in reality since it has many environmental constraints. In this paper, we introduce a vision-based game interface which is robust in varying environments. This interface consists of three main modules: body-parts localization, pose classification and gesture recognition. Firstly, body-part localization module determines the locations of body parts such as face and hands automatically. For this, we extract body parts using SCI-color model, human physical character and heuristic information. Subsequently, pose classification module classifies the positions of detected body parts in a frame into a pose according to Euclidean distance between the input positions and predefined poses. Finally, gesture recognition module extracts a sequence of poses corresponding to the gestures from the successive frames, and translates that sequence into the game commands using a HMM. To assess the effectiveness of the proposed interface, it has been tested with a popular computer game, Quake II, and the results confirm that the vision-based interface facilitates more natural and friendly communication while controlling the game. Keywords: Vision-based Game Interface, Gesture Recognition, HMM.
1 Introduction The quest for intelligent and natural Human Computer Interaction (HCI) has recently received a lot of attention, and influenced the development of various 3D computer games. In particular, if a user can control a 3D computer game using visual, auditory, or simple actions then the user feels more intuitive and interesting during playing a computer game [1-3]. Especially, visual information makes it possible to communicate with computerized equipment at a distance, without any requirement for physical contacts. Also such vision-based interfaces can control games or machines. In this case, a game player feels more intuitive and interesting during playing a computer game. Various vision based interfaces have been proposed to develop a system for human computer interaction and also a few of them are applied to game or machine. Soshi Iba [4] developed the system that hidden Markov models to spot and recognize gestures captured for controlling robots. Rubine [5] produced a system based on L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 662 – 671, 2006. © Springer-Verlag Berlin Heidelberg 2006
Vision-Based Game Interface Using Human Gesture
663
statistical pattern recognition techniques for recognizing single-path gestures (drawn with a mouse or stylus) and multiple-path gestures (consisting of a simultaneous path of multiple fingers). Freeman and Weissman [6] presented a television control system using the tracking of hand movements. Krueger [7] developed an experimental interactive environment system called VIDEOPLACE that segments the user’s image from a known background, and then builds a composite image of the user’s silhouette in a generated environment. Yang et al. [8] developed a gesture based system using a multi-dimensional hidden Markov model (HMM) for telerobotics and human computer interfacing. In this paper, we present a vision-based interface for detecting and recognizing user gesture in image sequence using a single commercial camera in the Quake II game. Previous vision-based interfaces have various constraints such as a specific situation: put on a long-sleeved shirt, a fixed place: stationary lighting and simple background. To solve this problem, it need to more robust vision-based interface. In this paper, the proposed vision-based interface is robust without regard to varying lighting, complex background and various types of cloths (long-sleeved shirt or short-sleeved shirt). Section 2 first of all, introduces such a proposed interface briefly, next explains each modules in detail. In section 3, shows the experimental results of proposed interface. Conclusions are given in section 4.
2 System Overview In this section, we describe the structure of the proposed interface, the functionality of each process in this structure, and the commands to be recognized. The gesture of a player is recognized as an input of Quake II games through our interface. An environment of a proposed interface system is shown in Fig. 1. With a fixed camera, we detect a player’s body parts in each frame, we classify player poses from a symbol table, and recognize from pose symbol sequences to player’s gestures.
Fig. 1. Environment of the proposed game system
Our interface consists of three main modules: body part localization, pose classification, and gesture recognition. Given the input video sequences, body parts localization involves extraction of important body parts, such as the face and hands, from the each frame, Pose classification involves classifying the extracted input features into predefined pose symbols in a symbol table, and gesture recognition involves recognizing a gesture from among the classified symbols using a single
664
H.S. Park, D.J. Jung, and H.J. Kim
HMM. Quake II receives a recognized gesture command and is operated followed that gesture. The 13 commands are frequently used command in Quake II, the commands described in Table 1. Table 1. Thirteen types of gestures used in Quake II Quake Command WALK FORWARD BACK PEDAL ATTACK TURN LEFT TURN RIGHT STEP LEFT STEP RIGHT LOOK UP LOOK DOWN CENTER VIEW RUN UP/JUMP DOWN/CROUCH
2.1
Gesture Shake a right hand in a forward direction. Shake a left hand in a backward direction. Stretch out a right hand in front. Stretch out a left hand to the left. Stretch out a right hand to the right. Move a right hand down and a left hand to the left. Move a left hand down and a right hand to the right. Move a head to the left. Move a head to the right. Stretch out both hands horizontally, then return to the same position. Swing arms alternately. Move both hands down and tilt head back. Lift both hands simultaneously.
Body-Parts Localization
To recognize the user gestures, positions of a face and hands are extracted for each frame of a given image sequence. For this, first of all, we detect a face and hands through three modules: skin-color detection, face detection and hands detection. 2.1.1 Skin-Color Detection Skin-color information is used to extract the hands and face from the input frame, where the skin-color is represented by an SCT (spherical coordinate transform)- color model, which is reasonably insensitive to variations in illumination and computationally inexpensive, making it simple to implement [9]. The RGB-color space is transformed into SCT-color space using the following equations: ⎤, R L = R 2 + G 2 + B 2 , ∠A = cos −1 ⎡ B ⎤ and ∠ B = cos −1 ⎡ ⎢ L sin( ∠ A) ⎥ ⎢⎣ L ⎥⎦ ⎣ ⎦
∠
∠
(1)
where L is the intensity, and A and B represent the colors independent from the intensity. As such, if a particular color is plotted as a point in RGB-space, then the magnitude of the vector for the point is L, the angle between the vector and the blue axis is A, and the angle between the red axis and the projection of the vector onto the RG plane is B. The resulting color space defined by the transformation can be represented by a color triangle [9]. Given skin-color model obtained from experimental results, if the color of a pixel falls within the skin-color distribution range, the pixel is considered as the part of a skin-color area.
∠
∠
Vision-Based Game Interface Using Human Gesture
665
However, a background in real world is quiet complex. Thus, detected regions include both regions of user’s body, namely skin-color, and noises. For this problem, we use a back ground subtraction method proposed in [10]. Given a background image, we can easily detect a foreground region by subtracting a current image with it. We create the background model BGi by using the first frame, F1(x,y). And then we detect the foreground region, FGi through background subtraction. The foreground region is defined as follows: ⎧1 FGi ( x, y ) = ⎨ ⎩ 0
BGi ( x, y ) − Ft ( x, y ) ≥ θ ,⎫ , ⎬ otherwise ⎭
(2)
where BGi(x,y) and Ft(x,y) are the intensity values at the background model and the current frame, respectively. FGt(x,y) is a binary image that consists of a foreground and a background. The threshold value θ is 30. We update the background model using current frame Fi and foreground region FGi. We generate the dilated foreground region FGdi which will be used for background updating. We only update the background on the dilated foreground region where FGdi≠ 0. The background update equation is as follows: BGi +1 = (1 − w) BGi + wFi ,
(3)
where BGi+1 is an updated background model, BGi is current background model, and Gi is a current frame. The weighting factor w is 0.1. After background subtraction, detected skin-color regions with skin locus have small holes. To eliminate them, we fill up them using a morphological operation. Next we detect positions of face and hands by labeling a region using a connected components analysis. Then the labeled regions are determined face, right hand or left hand as next two sub-sections: face detection and hands detection. 2.1.2 Face Detection In an image sequence, detecting of the face and hands is difficult because face and hands have a similar property like color information or motion information. For the face detection problem, a process is needed to verify the face candidate region (skin color region) to be a face or non-face. For the face verification, we use the weight vectors of candidate regions in eigenspace. For the dimensionality reduction of the feature space, we project an N-dimensional candidate face image to the lowerdimensional feature space, called eigenspace or facespace [11, 12]. In eigenspace, each feature component accounts for a different amount of the variation among the face images. To be brief on the eigenspace, let a training set of face images represented by Ndimensional column vector of each image for constructing the facespace be I1, I2, I3,…, IM. The average of the training set is defined by A=1/MΣMi=1Ii. A new set of vectors with zero mean at each dimension is computed as Φi=Ii-A. To produce the M orthogonal vectors that optimally describe the distribution of face images, the covariance matrix is originally computed as C=
1 M
M
∑Φ Φ i =1
i
T i
= YY T ,
(4)
666
H.S. Park, D.J. Jung, and H.J. Kim
for Y = [Φ1Φ 2 ...Φ M ] . Since the matrix C, however, is N × N dimension, determining the N-dimensional eigenvectors and N eigenvalues is an intractable task. Therefore, for the computational feasibility, instead of finding the eigenvectors for C, we calculate M eigenvectors vk and eigenvalues λk of [YTY], so that uk, a basis set is computed as
uk =
Y × vk
λk
,
(5)
for k=1, …, M. Of the M eigenvectors, the M’ significant eigenvectors are chosen as those with the largest corresponding eigenvalues. For M training face images, the feature vectors Wi = [ w1 , w2 , ... , wM ′ ] are calculated as
wk = ukT Φ i , k = 1, …, M.
(6)
To verify the candidate face region to be a face or non-face, the candidate face regions are also projected into the trained eigenspace using Eq. (6). The projected regions are verified using the minimum distances of the detected regions with the face cluster and the non-face cluster according to Eq. (7). min(|| W kcandidate − W face ||, || W kcandidate − W nonface ||),
(7)
where Wkcandidate is the kth candidate face region in trained eigenspace, and W face and Wnonface are the center coordinate of the face cluster and non-face cluster in trained eigenspace respectively. Then, the Euclidean distance measure is used. If there are more than two verified faces in an image frame, we select a biggest one as a face. 2.1.3 Hands Detection We detect right and left hands using heuristic rules. First, hands are the two biggest regions of skin-color regions except a face region. Next, left one is left hands and right one is right hand under a restriction: the hands are not crossing in no case. However, when the user put on the short-sleeved shirt, it is hard to detect hands. Thus, we add to the following heuristic rules: the shape of hand (a fist) is circle and is bigger than the detected arm. But the forearm is bigger than a fist when user bends his arm, so we considered human physical character and gesture character. Thus it followed as: if the extracted hand included arm region is closely a face then it searches hand in region of upper half in the detected region. Otherwise it searches hand in region of lower half in the region. Given search-space in detected hand region, we finally determine that hand is the region has maximum score that satisfies the Eq. 8. score = magnificat ion of circle ×
a number of skin color pixels inside the hand circle a number of pixels inside the hand circle
(8)
The hand circle is extended by 5 different magnifications to find potential locations for the hand.
Vision-Based Game Interface Using Human Gesture
2.2
667
Pose Classification
The body parts positions of a frame are classified to a pose symbol that have a smallest norm to the given body parts positions in a symbol table predefined. The norm is defined as a Euclidean distance in 6D-vector space of the body positions. The symbol table is trained off-line using K-means clustering method. For this, we extract poses which are represented the gesture per each gesture. In experiment results, K is 64 and the number of poses is selected as 23 under consideration that some poses which are similar to the others are same one. 2.3
Gesture Recognition
To recognize a gesture, we take a continuous stream of pose symbols. Every-time a symbol is given, a gesture recognizer determines whether the user is performing one of the thirteen gestures predefined above, or not. Usually HMMs are working on isolated or pre-segmented sequences of input symbol sequences. For this, it needs to find an automatic way of segmenting the sequences. But in reality, it is hard to find an easy way to find this segmentation. So we propose a new architecture of a HMM that takes a continuous stream of pose symbols with automatically segmenting and recognizing abilities. The proposed HMM is a single HMM modeled the thirteen gestures. Given a continuous stream of images as an input, the HMM starts with initial state probabilities s 0 = {sk0 } and continuously updates its state probabilities with each symbol of that stream as shown in Eq. 9. k −1
snt = ∑ ( skt −1 × akn ) × bnp , snt = k =0
snt
k −1
∑s k =0
t k
(9)
• S = {sk}: A vector of state probabilities, sk denotes a state probability of state k. • K = the number of states, • M = the number of poses. • A={aij}: An K × K matrix for the state transition probability distributions where aij is the probability of making a transition from state si to sj. • B={bij}: An K × M matrix for the observation symbol probability distributions where bij is the probability of emitting pose symbol vk in state si.
Then, the HMM has a few states that distinguish the path related with each gesture. So when the state has higher state probability than predefined threshold, a gesture is detected and recognized.
3 Experimental Results The experimental results showed that vision-based interface allowed that gamer feel more natural and intuitive during he or she play a game. The proposed interface was implemented on a PentiumIV-2.8GHZ PC using Visual C++ language. In the proposed interface, we improve various restrictions in [4,6,13]. In sub-section, we describe each module’s results in detail.
668
H.S. Park, D.J. Jung, and H.J. Kim
3.1 Skin-Color Detection Result To demonstrate the robustness of the proposed skin-color detection, tests were performed using sequences collected on different days, at different times, under varying lighting environments from four different individuals. To obtain the skincolor model, skin-color regions were extracted from 100 test data, the color distribution of these regions investigated, and the skin-color finally represented by the following parameters: A is 1.001~1.18, B is 0.82 ~ 0.97, and L is 170 ~ 330.
∠
∠
3.2 Face Detection Result To detect the user’s face in skin-color regions, we experiment with twenty different test sequences, and the train to sets consisted of six individuals with five different face orientations. Fig. 2 shows some of the training images used to construct the eigenspace.
Fig. 2. The parts of the training images for eigenspace construction
An analysis of the image sets captured during the experiment revealed that the face detection was 91.0% correct on average. Face detection rate = Non_Face detection rate =
Number of correctly detected images as faces Number of images detected as true faces
(10)
Number of correctly detected images as non_faces Number of images detected as non_faces
(11)
In Table 2, the face detection success rates represent when the face regions were detected as faces and the non-face regions detected as non-faces. Table 2. The face detection success rates Users 1 2 3 4 5 6 Total
Face 93.8% 90.9% 91.3% 90.0% 92.5% 91.0% 91.6%
Non_Face 91.8% 88.3% 90.8% 91.0% 90.0% 89.9% 90.4%
Total 92.8% 89.6% 91.0% 90.5% 91.3% 90.5% 91.0%
Vision-Based Game Interface Using Human Gesture
669
3.3 Hands Detection Result Detected hand region is represented by circle. To find potential locations for the hand, the circle is extended by 5 different magnifications. The circle is extended by 1 radius, 2 radius, 3 radius, 4 radius, 5 radius, the radius is 3 pixels in our experiment. Fig. 3 (a) and (b) show hands detection result with the different magnifications of the circle for potential hands locations. The error is occurred when it is failed to detect face or the hand region is not dominant. Fig. 3 (c) shows an example of detection error. An analysis of the image sets captured in the experiments results that the hands detection was 95.1% correct on average.
×
×
×
×
(a)
×
(b)
(c)
Fig. 3. Hands detection results. Each example contains an original image (first row), skin color images (second row), and overlaid with detected hands (third row: in the third row, a bounding box with a white solid line is detected face region, bounding boxes with a gray dotted line are skin-color region except face region, bounding boxes with a gray solid line are hand searching region according to the human physical character).
3.4 Gesture Recognition Results The performance of the proposed HMM recognizer was analyzed using a total of 1300 frames of thirteen gestures performed many times by different individuals. The results are summarized in Table 3 based on the detection and reliability ratio used in [14], defined using the following Eqs. (11) and (12). The experimental results showed a 93.5% accuracy.
670
H.S. Park, D.J. Jung, and H.J. Kim
Detection =
Reliability =
Correctly recognized gestures The number of input gestures
(12)
Correctly recognized gestures , The number of input gestures + the number of insertion errors
(13)
where ‘Insert’ means the errors when the recognizer detected non-existing gestures. Plus, ‘Delete’ and ‘Substitute’ represent when the recognizer failed to detect a gesture and misclassified a gesture, respectively. Table 3. Gesture-recognition results
Command WALK FORWARD BACK PEDAL TURN LEFT TURN RIGHT STEP LEFT STEP RIGHT ATTACK CENTER VIEW UP/JUMP DOWN/ CROUCH RUN LOOK DOWN LOOKUP Total
Results
Num. of Gestures
Insert
Delete
Substitute
Correct
Detection(% )
Reliability(%)
327
0
27
0
300
91.74
91.74
89
0
8
0
81
91.01
91.01
117
0
0
0
117
100
100
132
0
0
0
132
100
100
83
0
0
0
83
100
100
92
9
0
0
92
100
91.09
236
0
17
0
219
92.80
92.80
52
0
3
0
49
94.23
94.23
62
0
2
0
60
96.77
96.77
43
0
4
0
39
90.70
90.70
32
0
2
8
22
68.75
68.75
24
0
3
0
21
87.50
87.50
27 1316
0 9
3 69
0 8
24 1239
88.89 94.15%
88.89 93.50%
4 Conclusions In this paper, we proposed a vision-based interface and applied to a gesture based game system for the Quake II. In implemented game system, the proposed method showed 93.5% reliability, which was enough to enjoy the game Quake II using gestures, instead of the existing interface such as keyboard and mouse. Also the proposed method is robust in varying environments such as varying lighting, complex background and user putted on various clothes. Therefore the user can enjoy playing game within natural real-environment.
Vision-Based Game Interface Using Human Gesture
671
References 1. Sara Kiesler and Pamela Hinds: Human-Robot Interaction, Journal of HCI: Special Issue on HCI, Vol.19 No. 1 & 2 (2004) 2. Maja Pantic and Leon J. M. Rothkrantz: Toward an Affect-Sensitive Multimodal HuamnComputer Interaction, Proceedings of the IEEE Vol. 91 No. 9 (2003) 3. B. A. MacDonald and T. H. J. Collett: Novelty Processing in Human Robot Interaction, Symposium of one-day trans-disciplinary inter-media in Elam School of Fine Arts at the University of Auckland (2004) 4. S. Iba, J. M. V. Weghe, C. J. J. Paredis and P. K. Khosla: An Architecture for Gesture based Control of Mobile Robots, Intelligent Robots and Systems Vol. 2 (1999) 5. D.H. Rubine: The automatic recognition of gesture, Ph.D dissertation, Computer Science Department, Carnegie Mellon University December (1991) 6. W.T. Freeman and C.D. Weissman: Television control by hand gestures, Int. Workshop of Race and Gesture Recognition (1995) 179-183 7. M.W. Krueger: Artificaial Reality II, Addison-Wesley Reading MA (1991) 8. J. Yang, Y. Xu and C.S. Chen: Gesture Interface: Modeling and Learning, Robotics and Automation (1994) 1747-1752 9. J.Hymes, M.W. Powell, R. Murphy: Cooperative navigation of micro-rovers using color segmentation, IEEE International Symposium on Computational Intelligence in Robotics and Automation, (1999) 195-201 10. F. S. Chen, C. M. Fu and C. L. Hyang: Hand Gesture Recognition using a Real-time Tracking Method and Hidden Markov Models, Image and Vision Computing Vol. 2 (2003) 11. Turk, M., Pentland, A.: Face Recognition Using Eigenfaces. IEEE Proceedings CVPR '91, (1991) 586 -591 12. Guan, A.X., Szu, H.H.: A local face statistics recognition methodology beyond ICA and/or PCA. International Joint Conference on Neural Network, (1999) 1016 -1021 13. H. Kang, C.W. Lee and K. Jung: Recognition-based gesture spotting in video games, Pattern Recognition Letters Vol. 25 (2004) 1701-1714 14. H. K. Lee and J. H. Kim: An HMM-based Threshold Model Approach for Gesture Recognition, IEEE Transaction on Pattern Analysis and Machine Intelligence Vol. 21 (1999)
Terminal Phase Vision-Based Target Recognition and 3D Pose Estimation for a Tail-Sitter, Vertical Takeoff and Landing Unmanned Air Vehicle Allen C. Tsai, Peter W. Gibbens, and R. Hugh Stone School of Aerospace, Mechanical and Mechatronic Engineering, University of Sydney, N.S.W., 2006, Australia {allen.tsai, pwg, hstone}@aeromech.usyd.edu.au
Abstract. This paper presents an approach to accurately identify landing targets and obtain 3D pose estimates for vertical takeoff and landing unmanned air vehicles via computer vision methods. The objective of this paper is to detect and recognize a pre-known landing target and from that landing target obtain the 3D attitude information of the flight vehicle with respect to the landing target using a single image. The Hu’s invariant moments’ theorem is used for target identification and parallel lines of the target shape are investigated to obtain the flight vehicle orientation. Testing of the proposed methods is carried out on flight images obtained from a camera onboard a tail-sitter, vertical takeoff and landing unmanned air vehicle. Keywords: Tail-sitter vertical takeoff and landing unmanned air vehicle, computer vision, moment invariants, vision-based pose/attitude estimation, target identification/detection, parallel lines, vanishing points, perspective transformation and vision-based autonomous landing.
1 Introduction Using cameras as sensors to perform autonomous navigation, guidance and control for either land or flight vehicles via computer vision has been the latest developments in the area of field robotics. This paper focuses on how a tail-sitter Vertical Takeoff and Landing (V.T.O.L.) Unmanned Air Vehicles (U.A.V.) can use visual cues to aid in navigation or guidance, focusing on the 3D attitude estimation of the vehicle especially during the terminal phase: landing; where an accurate tilt angle estimate of the vehicle and position offset from the landing target is required to achieve safe vertical landing. Work in this particular field focusing on state estimation to aid in landing for V.T.O.L. U.A.V. has been done by a number of research groups around the world. Work of particular interest is the U.S.C. A.V.A.T.A.R. project[1]; landing in unstructured 3D environment using vision information has been reported but only the heading angle of the helicopter could be obtained from those vision information. Work done by University of California, Berkeley[2] focused on ego-motion estimation where at least two or more images are L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 672 – 681, 2006. © Springer-Verlag Berlin Heidelberg 2006
Terminal Phase Vision-Based Target Recognition and 3D Pose Estimation
673
required to determine the 3D pose of a V.T.O.L. U.A.V. Yang and Tsai[3] have looked at position and attitude determination of a helicopter undergoing 3D motion using a single image, but did not present a strategy for target identification. Amidi and Miller [4] used a visual odometer but the visual data could not be used as stand alone information to provide state estimation in 3D space to achieve landing. This paper will depict how from a single image, target identification is achieved as a V.T.O.L. U.A.V. undergoes 3D motion during hover, just before landing, as well as being able to ascertain 3D pose from the appearance of the landing target which is subject to perspective transformation. The 3D attitudes of the vehicle are determined by investigating parallel lines of the landing target. Target identification is achieved through the calculations of Hu’s invariant moments of the target. Paper Outline: Section 2 will discuss in detail the image processing techniques undertaken, set up of the video camera on the “T-Wing” and the design of the landing pad; section 3 looks into the mathematical Hu’s invariant moments’ theorem applied in order to detect and recognize the landing target, section 4 presents the mathematical techniques used to carry out 3D pose estimation, and section 5 presents and discusses the results for the strategies proposed tested on flight images taken from a U.A.V.; and lastly for Section 6, conclusion and directions for future work are drawn.
2 Vision Algorithm The idea of the vision algorithm is to accurately maintain focus on an object of interest, which is the marking on the landing pad (here on after will be referred to as the landing target), by eliminating objects that are not of interest. The elimination of unwanted objects is achieved by a series of transformation from color to a binary image, filtering and image segmentations. In this section the image acquisition hardware set-up and the set up of the landing pad is firstly introduced and following that is the details of the vision algorithm. 2.1 Image Acquisition Hardware Set-Up The capital block letter “T” is used as the marking on the landing pad to make up the landing target. The idea behind using a “T” is that it is of only one axis of symmetry whereas the more conventional helicopter landing pads, either a circle or the capital block letter of “H”, have more than one axis of symmetry. The one axis symmetry will allow uniqueness and robustness when estimating the full 3D attitudes and/or positions of the flight vehicle relative to Ground Coordinate System which is attached to the landing target. The camera used was a C.C.T.V. camera, the imaging sensor was of a 1/3 Panasonic Color CCD type. The resolution is of 737 horizontal by 575 vertical pixels with a field of view of 95º by 59º and records at 25 Hz. Recording of the images of the on-board camera was achieved by transmission using a 2.4GHz wireless four channel transmitter to a laptop computer where a receiver is connected. The camera was calibrated using online calibration toolbox [5]. The test bed used for the flights was
〞
674
A.C. Tsai, P.W. Gibbens, and R.H. Stone
the T-Wing [6] tail-sitter V.T.O.L. U.A.V. which is currently under research and development at the University of Sydney. The set up of the camera on the flight vehicle is shown on the following simulation drawing and photos:
Fig. 1. Simulation drawing and photos of the camera set-up on the T-Wing
2.2 Image Processing Algorithm The low level image processing for the task of target identification and pose estimation requires a transformation from the color image to a gray-scale version to be carried out first. This basically is the elimination of the hue and saturation information of the RGB image while maintaining the luminance information. Once the transformation from color to gray-scale is achieved, image restoration, i.e. the elimination of noise, is achieved through a nonlinear spatial filter, median filter, of mask size [5 × 5] is applied twice[7] over the image. Median filters are known to have low pass characteristics to remove the white noise while still maintaining the edge sharpness, which is critical in extracting good edges as accuracy of attitude estimation is dependent on the outcome of the line detection. After the white noise is removed, a transformation from gray-scale to binary image is required by thresholding at 30% from the maximum intensity value, in an effort to identify the object of interest: the landing target, which is marked out in white. Despite that, the fact of sun’s presence and other high reflectance objects that lie around the field where the experiment was carried out; it is not always possible to eliminate other white regions through this transformation. Image segmentation and connected component labeling was carried out on the images in an attempt to get rid of the objects left over from gray-scale to binary transformation that are of no interest. These objects are left out by determining that their area is either too big (reflectance from the ground due to sunlight) or too small (objects such as other markings). The critical area values for omitting objects are calculated based on the likely altitude and attitudes of the flight vehicle once in hover, which is normally in between two to five meters and ±10º. Given that the marking “T” is of known area and geometry the objects to be kept in interest should have an area within the range 1500 to 15,000 pixels after accounting for perspective transformation. This normally leaves two or more objects still present in the image, which was the case for about 70% of the frames captured. The Hu’s Invariant Moments theorem is then applied as a target identification process in an attempt to pick out the correct landing target. Once the landing target is determined the next stage is to determine the edges of the block letter “T” in order to identify the parallel lines which are to be used later on for attitude estimation. The edge detection is carried out via the Canny edge detector [8];
Terminal Phase Vision-Based Target Recognition and 3D Pose Estimation
675
the line detection is carried out using the Hough transform [8]. The following figure shows the stages of image processing:
Fig. 2. Image processing stages (from left to right and down): 1st the grayscale transformation, median filtering, binary transformation, rejection of small objects, rejections of large objects, and finally target identification with line detection
3 Target Identification The landing target identification procedure is accomplished by using the geometric properties of the target. This involves investigating the moment of inertia of the target shape. Hu’s invariant moments[9] are known to be invariant under translation, rotation and scaling in the 2D plane. This feature is very well suited to tasks associated with identifying landing targets. The following equation represents the moments of a 2D discrete function:
∑∑i
mpq =
i
p
j q I (i, j )
j
(1)
where (p + q) represent the order of the moments and I(i,j) is the intensity of an image. The indexes i, j corresponds to the image plane coordinate axes x and y indexes respectively. The central moments are the moments of inertia defined about the centre of gravity and are given by the following equation: μ
pq
=
∑ ∑ i
(i − x )
p
( j − y ) q I (i, j)
j
where the indexes have the same meaning as in equation (1), x and troids of the target shape.
(2)
y are the cen-
676
A.C. Tsai, P.W. Gibbens, and R.H. Stone
The normalized central moment is defined as follows: η
pq
=
μ pq γ μ 00
(3)
Where γ = (p + q) /2, for p + q = 2, 3, ........ The first four orders of invariant moments can be determined by the normalized central moments, they are as follows: φ 1 = η 20 + η 02
φ 2 = (η 20 − η 02 ) + 4 η 11 φ 3 = (η 30 − 3η 12 ) 2 + ( 3η 21 − η 03 ) 2 φ 4 = (η 30 + η 12 ) 2 + (η 21 + η 03 ) 2 2
(4)
In this paper all these four orders of invariant moments were tracked to carry out identification of the landing target allowing for perspective distortion. An object is considered to be the target if the sum of errors for all four orders is the minimum of all the objects. As the tail-sitter V.T.O.L. U.A.V. approach the landing phase, the flight vehicle will always undergo 3D motion, the invariant moment method described above is only known to be invariant under 2D scaling, translation and rotation. Therefore it is critical to investigate all facets of the invariant moments. As proven by Sivaramakrishna and Shashidharf [10], it is possible to identify objects of interest from even fairly similar shapes even if they undergo perspective transformation by tracking the higher order moments as well as the lower order ones. This method is applied later on to determine the correct target, i.e. target identification.
4 State Estimation Due to the inherent instability with V.T.O.L. U.A.V. near the ground during landing, it is necessary to be able to accurately determine the 3D positions and 3D attitudes of the flight vehicle relative to the landing target in order to carry out high performance landing. The information given by the parallel lines of the target is one of vision-based navigation techniques used to obtain high integrity estimation of the 3D attitudes of the flight vehicle. Because of the landing pad marking: ”T”, there are two sets of nominally orthogonal parallel lines that are existent, the two sets of orthogonal parallel lines each containing only two parallel lines are named A and B. The parallel lines in set A and set B correspond to the horizontal and vertical arm of the “T” respectively. 4.1 Coordinate Axes Transformation
Before establishing the relationship of parallel lines to determine the vehicle attitude, the transformations between several coordinate systems need to be ascertained. The first coordinate system to be defined is the image plane denoted as the I.C.S. (Image Coordinate System), and then the camera coordinate system (C.C.S.) which
Terminal Phase Vision-Based Target Recognition and 3D Pose Estimation
677
represents the camera mounted on the flight vehicle. Lastly the flight vehicle body axes denoted as the vehicle coordinate system (V.C.S.). The line directions of the three flight vehicle body axes in the C.C.S. are pre-known and denoted as dnormal, dlongitudinal and dlateral. To describe the relative orientation between the landing target and the flight platform, the global coordinate system (G.C.S.) is also required. The axes of the G.C.S. are denoted as X, Y and Z. The X axis is parallel with the horizontal bar of the block letter “T” and pointing to the “T”’s right; the Y axis is parallel with the vertical bar of the “T” and a positive down direction. These two axes are situated at the centre point where the vertical and horizontal bars meet rather than the conventional centre of gravity of the shape. The Z axis is therefore pointing up according to the right hand rule. The transformation from the C.C.S. to the V.C.S. amounts to an axis re-definition according to: ~ X V
.C . S .
⎡0 = ⎢⎢ 1 ⎢⎣ 0
0 0 −1
− 1⎤ ~ 0 ⎥⎥ × X 0 ⎥⎦
(5)
C .C . S .
4.2 Flight Vehicle Orientation Estimation Via Parallel Lines Information
Moving onto the application of parallel lines of the target shape to deduce 3D attitudes of the flight vehicle; it is well known that a set of 3D parallel lines intersect at a vanishing point on an image plane due to perspective transformation. The vanishing point is a property that indicates the 3D line direction of the set of parallel lines [11]. Considering a 3D line “L” represented by a set of points: L = {(x, y, z) | (x, y, z) = (p1, p2, p3) + λ(d1, d2, d3) for real λ}
(6)
The line L passes through the point (p1, p2, p3) and has the line directions (d1, d2, d3). The image plane point of a 3D point on the line L can be written as: (u, v) = (f×x/z, f×y/z) = [f×(p1 + λd1)/(p3 + λd3), f×(p2 + λd2)/(p3 + λd3)]
(7)
where f is the focal length of the camera and λ is the line length or parameter. A vanishing point (u∞, v∞) can be detected if λ → ∞ and d3 ≠ 0; that is: (u∞, v∞) = [f×d1/d3, f×d2/d3]
(8)
The direction of the line “L” can now be uniquely determined from the above equation to be: (d1, d2, d3) = ( u∞ , v∞ , f )/
u∞
2
+ v∞
2
+ f
2
(9)
With the above theory applied to the problem of pose estimation; the line directions of the X axis, which corresponds to the horizontal bar of the “T”, and the Y axis, corresponding to the vertical bar of the “T”, in the C.C.S. can firstly be determined. The line directions are denoted as dx, dy and dz, where the line direction of dz is determined by the cross product of dx and dy.
678
A.C. Tsai, P.W. Gibbens, and R.H. Stone
Once the directions of the G.C.S. axes in the C.C.S. are determined the flight platform’s attitudes with respect to the landing target can be determined via the following, which takes into account the transformation from C.C.S. to V.C.S.: cos α = (dx x dLong.)/(||dx||× || dLong.||), cos β = (dy x dLat.)/(||dy||× || dLat.||), cos γ = (dz x dNormal.)/(||dz||× || dNormal.||).
(10)
dx, dy and dz are the direction cosines of the G.C.S. axes in the C.C.S. α, β, and γ are the angles of the G.C.S. axes in the V.C.S., which correspond to the roll, pitch and yaw angles of the flight vehicle respectively. The following figure shows the relations between the C.C.S., G.C.S. and I.C.S., and the line direction of the vanishing point in C.C.S.
Fig. 3. Schematic diagram of the relations between I.C.S., C.C.S., and G.C.S., and the line direction of G.C.S. Y axis in the C.C.S.
5 Experimental Results Flight images taken during the flight testing of the “T-Wing”, a tail-sitter V.T.O.L. U.A.V., were used to test the accuracy, repeatability and computational efficiency of the fore mentioned theories of target recognition and 3D attitude estimation. Flight images are extracted from a 160 seconds period of the flight vehicle hovering over the target with the target always kept insight of the camera viewing angles. Figure 4 shows the three angles between the V.C.S. and the G.C.S. as the flight platform pitches, yaws and rolls during the hover.
Terminal Phase Vision-Based Target Recognition and 3D Pose Estimation
679
Fig. 4. Plots of the Alpha, Beta and Gamma angles ascertained from vanishing points and the respective roll, pitch and yaw angles of the vehicle
The estimated attitudes were compared with filtered estimates from a NovAtel RTK G.P.S. unit, accuracy of down to 2cm and a Honeywell Ring-Laser Gyro. of 0.5º accuracy. The errors from the attitude estimates from the parallel line information compared favorably with R.M.S. errors of 4.8º in alpha, 4.2º in beta and 4.6º in gamma. This range of error is deemed good by other work’s standard[12]. Figure 5 shows the computed 1st, 2nd, 3rd and 4th order invariant moments of the landing target, “T”, from the images captured during the vehicle hover. Knowing that the invariant moments are only “invariant” under 2D translation, scaling and rotation, all four orders of the invariant moments were tracked. The results show that the error of the invariants moments computed as the flight vehicle hovers above the target are larger than the errors associated with 2D motion; but by tracking of the first four orders of invariant moments, the target always had the smallest total normalized error (sum of all four percentage discrepancy to true value obtained from noiseless images divided by four) than other objects remaining in the images. The first four orders of invariant moments shows that the normalized errors were greater for the period where the vehicle undergoes greater pitching and yawing moments than periods where the vehicle is almost in a perfect vertical hover mode. There were images where noise was a big issue but nevertheless by tracking all the four orders of invariant moments, the landing target could still be distinguished. With regard to computational time, the filtering and thresholding of the images took approximately 11.92% of the time; component labeling and segmentation took around about 54.99%; attitude estimation algorithm needed 4.933% and the invariant moments’ calculations required 28.16% of the computational time when dealing with two objects.
680
A.C. Tsai, P.W. Gibbens, and R.H. Stone
Fig. 5. The computed 1st, 2nd, 3rd and 4th order invariant moments (in red --) compared with true values (in blue -); and the total normalized error
6 Conclusion In this paper, an algorithm to accurately identify landing targets and from those landing targets, using computer vision techniques, to obtain attitudes estimates of a tail-sitter V.T.O.L. U.A.V. undergoing 3D motion during the hover phase is presented. This method of 3D attitude estimation requires only a single image which is a more computationally effective technique than motion analysis which requires processing of two or more images. This paper also presented techniques of accurately determining landing targets via the invariant moments’ theorem while an air vehicle is undergoing 3D motion. The results show a good accuracy with target detection and pose estimation with previous other work in this field. Further developments of these techniques in the future can possibly see autonomous landing of manned helicopter onto a helipad and commercial aircrafts autonomously detecting runways during landing. A major issue requiring investigation is the estimation of attitude when the landing target is only partially visible in the image plane. We intend in the future to integrate the attitude, position and velocities estimates, to act as guidance information, with the control of the “T-Wing” especially during landing.
Terminal Phase Vision-Based Target Recognition and 3D Pose Estimation
681
Acknowledgements. J. Roberts for his technical expertise with setting up of the camera system and the wireless transmission for recordings.
References [1] S. Saripalli, J. F. Montgomery, and G. S. Sukhatme, "Vision-based autonomous landing of an unmanned aerial vehicle.", IEEE International Conference on Robotics and Automation, ICRA'02 Proceedings, Washington, DC, 2002. [2] O. Shakernia, R. Vidal, C. S. Sharp, Y. Ma, and S. Sastry, "Multiple view motion estimation and control for landing an unmanned aerial vehicle.", IEEE International Conference on Robotics and Automation, ICRA'02 Proceedings, Washington, DC, 2002. [3] Z. F. Yang and W. H. Tsai, "Using parallel line information for vision-based landmark location estimation and an application to automatic helicopter landing," Robotics and Computer-Integrated Manufacturing, vol. 14, pp. 297-306, 1998. [4] T. K. O. Amidi, and J.R. Miller, "Vision-Based Autonomous Helicopter Research at Carnegie Mellon Robotics Institute 1991-1997.", American Helicopter Society International Conference, Heli, Japan, 1998. [5] Z. Zhang, "Flexible Camera Calibration by viewing a plane from unknown orientation.", International Conference on Computer Vision (ICCV'99), Corfu, Greece, pp. 666-673, September 1999 [6] R. H. Stone, "Configuration Design of a Canard Configuration Tail Sitter Unmanned Air Vehicle Using Multidisciplinary Optimization." PhD Thesis, University of Sydney, Australia, 1999. [7] R. C. Gonzalez, R. E. Woods, and S. L. Eddins, Digital Image Processing Using MATLAB: Pearson Prentice Hall, Upper Saddle River, N.J., 2004. [8] R. C. Gonzalez and R. E. Woods, Digital Image Processing, 2 ed., Pearson Prentice Hall, Upper Saddle River, N.J., 2002. [9] M. Hu, "Visual Pattern Recognition by Moment Invariants.", IRE Transactions on Information Theory, 1962. [10] R. Sivaramakrishna and N. S. Shashidharf, "Hu's moment invariants: how invariant are they under skew and perspective transformations?", IEEE WESCANEX 97: Communications, Power and Computing. Conference Proceedings, 1997. [11] R. M. Haralick and L. G. Shapiro, Computer and Robot Vision, vol. II: Addison-Wesley, 1993. [12] C. S. Sharp, O. Shakernia, and S. S. Sastry, "A vision system for landing an unmanned aerial vehicle.", IEEE International Conference on Robotics and Automation, ICRA '01 Proceedings., 2001.
Detection of Traffic Lights for Vision-Based Car Navigation System Hwang Tae-Hyun, Joo In-Hak, and Cho Seong-Ik Telematics.USN Research Division, ETRI 161 Kajeong-dong, Yuseong-gu, Daejeon, Republic of Korea {hth63339, ihjoo, chosi}@etri.re.kr
Abstract. A recent trend of car navigation system is using actual video captured by camera equipped on a vehicle. The video-based navigation systems displays guidance information overlaid onto video before reaching a crossroad, so it is essential to detect where the crossroads are in the video frame. In this paper, we suggest a detection method for traffic lights that is used for estimating location of crossroads in image. Suggested method can detect traffic lights in a long distance, and estimates pixel location of crossroad that is important information to visually represent guidance information on video. We suggest a new method for traffic light detection that processes color thresholding, finds center of traffic light by Gaussian mask, and verifies the candidate of traffic light using suggested existence-weight map. Experiments show that the detection method for traffic signs works effectively and robustly for outdoor video and can used for video-based navigation system.
1
Introduction
As the advent of telematics as a newest field of information technology, car navigation systems (CNS) have been developed as a key application in telematics field, and have got much interests of the users and demands of service providers. CNSs have been introduced in the form of PDA-based system that is based on 2dimensional digital map and route planning/guidance functions. Recent service providers have begun to support 2.5D and 3D display. However, navigation systems till now have limited representation power for real world, especially in case of confusing crossroad or junction. Such defect may make drivers get lost or even cause traffic accidents. In this respect, navigation systems are now aiming at more realistic representation of real world. A newest direction of navigation system is to use actual video captured by camera equipped on a vehicle. The examples of early video-based systems are INSTAR of Siemens[1] and VICNAS of Kumamoto University[2]. Such systems intend to enhance drivers’ understanding about real world by capturing real-time video and displaying navigation information (such as directed arrows, symbols, and icons) onto it. Because video is more perceptible than conventional map (either paper or digital), such video-based navigation systems are expected to reduce users’ visual distractions while driving and to make navigation guidance easier L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 682–691, 2006. c Springer-Verlag Berlin Heidelberg 2006
Detection of Traffic Lights for Vision-Based Car Navigation System
683
and more convenient. To make the video-based navigation possible, navigation information should be overlaid accurately onto video. An approach is AR (Augmented Reality)-styled display, but such approach requires very accurate position/attitude of vehicle as well as geographic topography, for which high-cost sensors (including camera) and large-volumed database are needed. To solve the problem, we suggest a computer vision technology that processes input video and helps to overlay navigation information onto the video. Considering that the most important function of a navigation system is to guide direction before a crossroad is reached, we focus on detecting the location of crossroad in image, as a key to our vision-based approach. However, practically speaking, the guidance should be done in a long distance before a crossroad. Directly detecting crossroad in a long distance is quite difficult (even human’s eye may not identify it), so we detect traffic lights rather than crossroad itself. Assuming that there are traffic lights at most of crossroads, we can estimate pixel location of a crossroad if a traffic light is detected. Many detection methods have been suggested for road facilities including traffic lights [3][4][5], but most of the methods are not applicable to the vision-based navigation systems because they cannot detect traffic lights in a long distance. To effectively and robustly detect traffic lights for our goal, we suggest a new detection method that can detect traffic lights in a long distance. The remainder of this paper is organized as follows. Section 2 provides a detection method for traffic lights. Section 3 describes how the detection functions are worked with car navigation system. In Section 4 we present the experimental results, and Section 5 concludes our suggestions.
2
Detection of Traffic Lights
The method suggested in this paper detects traffic lights by an algorithm shown in Fig.1. Each process is executed sequentially, as described below. 2.1
Traffic Lights Detection
Color thresholding. In this paper we use HSI color model to analyze color of traffic lights. It is a color model based on human vision, and many researches have used color thresholding with the HSI color model [3]. It can be used as a measure for distinguishing predefined colors of traffic lights. We refer an official guideline for colors of traffic lights, published by National Police Agency of Korea, the competent authority of management of traffic lights[6]. In this guideline, each color of traffic lights is defined as an area in CIE chromaticity diagram, in which yellow and red are in almost same area. Therefore, we use same color information when detecting yellow and red traffic lights. After transforming RGB color model into HSI color model [7], we execute thresholding with regard to red and green color components of CIE chromaticity to get candidate colors of traffic lights, as
684
T.-H. Hwang, I.-H. Joo, and S.-I. Cho
Fig. 1. Traffic lights detection algorithm
⎧ Red ⎪ ⎪ ⎨
if ( ((H(x, y) < Trl ) ∨ (H(x, y) > Tru )) ∧ (S(x, y) > Ts ) ∧ (I(x, y) > Ti ) ) C(x, y) = Green if ( (T ⎪ gl < H(x, y) < Tgu ) ∧ (S(x, y) > Ts ) ∧ (I(x, y) > Ti ) ) ⎪ ⎩ 0 otherwise (1) where C(x, y) is result of the color thresholding, H(x, y) is Hue value, S(x, y) is saturation value, and I(x, y) is intensity value. The C(x, y) becomes a set of red (include yellow) and green candidate traffic lights. The four threshold values (Trl , Tru , Tgl , Tgu ) are determined in order to check the Hue values are included in a predefined range. Further, to detect candidate region of traffic lights whose lights are on, we also use thresholds of saturation (Ts ) and intensity (Ti ). Noise Removal. There are many noises in the candidate region for traffic lights which come from lens distortion, diffraction of sunlight, vibration of camera, and background objects with similar color. To remove such noises, a morphology operation is executed, as shown in M = ((((C ⊕ T ) ⊕ T ) T ) T )
(2)
where M is result of morphology operation, C is result of the color thresholding (previous step), T is a 3 × 3 mask template whose all elements are 1, ⊕ means dilation operation, and means erosion operation. The operation (2) is similar to closing operation of morphology ((C ⊕ T ) T ), but it is a more empirical method that can effectively remove a lot of irregular noises and color components other than those of traffic lights. Blob Labeling. The segmented candidates of traffic light are processed through blob labeling, and are determined as further candidates if the size of labeled blob meets the criteria
Detection of Traffic Lights for Vision-Based Car Navigation System
Nl < BA < Nu
685
(3)
where BA denotes size of blob area, and Nl and Nu are lower bound and upper bound of BA, respectively. That is, only blob whose size is within a certain range is left as a further candidate. The actual size of a traffic light (a lamp) is 30cm in radius, while its pixel size depends on the distance between the traffic light and camera, as well as camera specification (such as focal length). The values of Nl and Nu can be dynamically changed, and in our experiments they are set to 30 and 350 respectively. 2.2
Traffic Lights Verification
The detected candidates of traffic lights are processed through the verification process (see algorithm shown in Fig.1). The verification process is executed in two steps described below; finding center of the lights and recognizing the traffic lights using existence-weight map. Traffic Lights Center Detection. The most useful characteristic of traffic lights is that they emit light themselves. Figure 2 shows the extracted candidates for traffic lights and their normalized distribution with regard to green. We can see the distribution of traffic lights is similar to Gaussian distribution (Fig. 2(c) and (d)). Therefore, we estimate a center of traffic light by Gaussian mask filtering to a given candidate of traffic light. The size of Gaussian mask is set to half of size of the candidate region. Because the size of the candidate region varies (according to the distance between the traffic light and camera), the Gaussian mask should have dynamic size accordingly. We can estimate that a point with largest value of convolution operation on the candidate region and Gaussian mask is center of traffic light. We denote the estimated center of traffic light as (cx, cy). Figure 3(a) shows a candidate of traffic light after labeling process, Fig.3(b) shows a result of estimation of center and boundary of a traffic light, and Fig.3(c) shows the used Gaussian mask. Existence-Weight Map(EWM). In case of moving camera, the detection of traffic lights may fail because of vibration of camera and image distortion caused by light and lens. Especially in case of interlaced camera, the inconsistency of scan line makes the detection more difficult. Considering such difficulties, we use an idea for efficient detection of traffic lights: ’The existence of a traffic light in a frame implies it is very likely that the traffic light also exists at similar pixel position in the next frame.’ So we suggest Existence-Weight Map(EWM) in order to find a traffic light effectively even in case of uncertain or failed result in a single frame. To define the EWM, we first segment the upper half of image into irregular-sized rectangular grid, which we call blocks (Fig.4(a)). We exclude lower half of image because it is nearly impossible for traffic lights to appear, with our camera setting. Note that, the size of each block depends on image size, and it is determined to be larger for upper part of the image because the traffic lights in the image is moving more rapidly around upper part than around center of image.
686
T.-H. Hwang, I.-H. Joo, and S.-I. Cho
(a) Original image
(b) Normalized G channel
(c) Intensity distribution of (b)
(d) Gaussian distribution
Fig. 2. Distribution of traffic lights
(a) Candidate image
(b) Detected center
(c) Gaussian mask for (b)
Fig. 3. Detecting center of traffic lights
Suppose there are M × N blocks for each frame t (In our experiments with 720×480 image, we have 24×15 blocks, as shown in Fig.4(a)). For each block a weight value is assigned, and we denote P (x, y, t) as the weight value of (x, y)-th block at t-th frame. Then we define the weight value as possibility of existence of a traffic light at each block, and calculate each P (x, y, t) as follows. First, all P (x, y, t) are set to zero before starting to scan the frames. Given frame t − 1, let (m, n) be x- and y- index of the block where the estimated center of traffic light (cx, cy) exists. Then each P (x, y, t) at next frame are set to P (m, n, t) = P (m, n, t − 1) + Ti P (m, n + 1, t) = P (m, n + 1, t − 1) + Tu P (m ± 1, n, t) = P (m ± 1, n, t − 1) + Ts P (m ± 1, n + 1, t) = P (m ± 1, n + 1, t − 1) + Ts P (x, y, t) = P (x, y, t − 1) − Ts ∀x∈ / {m, m − 1, m + 1} and ∀ y ∈ / {n, n + 1}
(4)
Detection of Traffic Lights for Vision-Based Car Navigation System
687
where the parameters Ti , Tu , Ts are empirically determined positive number with 1 > Ti > Tu > Ts > 0. Figure 4(b) shows an example of the parameters values in (4). These values mean: if the center of traffic light is found in block (m, n, t−1), it is most likely to appear in same location (m, n, t), the upward block (m, n + 1, t) has the second highest possibility, and four sideward blocks (m ± 1, n, t) and (m ± 1, n + 1, t) have the third highest possibility. This is because the traffic lights are moving upward or sideward in the next frame as the vehicle is moving. Also, to prevent divergence of P (x, y, t), we limit its value as 0 ≤ P (x, y, t) ≤ 5 ∀ x, y, t
(a) Block segmentation
(5)
(b) Example of parameters
Fig. 4. Existence-weight map(EWM)
Traffic Lights Recognition. With the EWM mentioned above we recognize traffic lights as a final step, because the candidate regions processed by center detection steps may include false candidate as well. We assume that the lamps of traffic lights are distinct from outside region in intensity. To use the intensity criteria, we first divide the candidate region from blob labeling process (see Fig.3) into inner part(lamp part) and outer part(outside estimated boundary of traffic light). Having defined that (m, n) is the index of the block containing the estimated center of traffic light (cx, cy), we recognize the traffic lights using the values of EWM and the intensity values of candidate regions around the estimated traffic light, by the criteria ¯ ¯ I(in) + k · P (m, n, t) > I(out) · Tr
(6)
¯ ¯ where I(in) and I(out) are average intensity value (defined as 0∼255) of the inner part and the outer part, respectively. In our experiments, the constant k and Tr in (6) is empirically determined as 6 and 2.5 respectively. Finally, we can determine that the (cx, cy) is a center of traffic light if and only if (6) yields true, which is our goal.
3
Interface with Car Navigation System
In this section, we present how the traffic lights detection functions suggested in Section 2 interface and work with car navigation system. The architecture of the navigation system combined with detection module are briefly shown in Fig.5.
688
T.-H. Hwang, I.-H. Joo, and S.-I. Cho
Fig. 5. Architecture of the system
3.1
Sending Notification for Detection
On reaching guide point (crossroad) before a predefined distance, say 300m, the guidance module send a message Detect(current, guide, f, road) to detection module, where current is current position, guide is position of next crossroad, f represents location and attributes of traffic lights, and road is vector data for the road between current and guide. When this message is received, the detection module starts to work with necessary information. The information is mainly used for reducing the search region and estimating where a traffic light appear in a frame. Further, from current and the position of traffic light (included in f) the detection module can estimate how large (pixel size) each traffic light appears in video frame. 3.2
Using Detection Result
After the detection module is executed, the detection results are used to display the navigation information more accurately at crossroad. More specifically, they are used to estimate pixel position that corresponds to real world crossroad. Again, we assume that traffic lights are located at crossroads. For this, the traffic lights located at a crossroad are logically connected to the crossroad in a database. Moreover, for each traffic light, its size (radius), number of lamps, and height from ground are also recorded. The values are basically defined in a domestic transportation standard, so we don’t need to measure the values individually. Once a detection result (bounding rectangle of lamp of traffic light) is given, we estimate the pixel position of crossroad using the detection result and the height of the traffic light. Let (x, y) be coordinate of a detection result(bottom of bounding rectangle). Then the estimated pixel position (x , y ) of the crossroad is calculated by x = x y = y + (
height × f ) · size dist
(7)
where height is height of the traffic light from ground, f is focal length of used camera, dist is distance between the traffic light and the camera, and size is
Detection of Traffic Lights for Vision-Based Car Navigation System
689
physical length of a CCD cell of the camera (we can get values of f and size from camera specifications). In (7), y−y means y-axis pixel length corresponding actual length h. After the pixel position that corresponds to real world crossroad is determined, we can get adjusted and more accurate pixel position where a direction indicator should be displayed, even when the simply projected point (camera pitch and topography are not considered) is not accurate.
4
Experiments
The experimental environment for our research is a software application with simulation data. The CPU is Intel Pentium4 processor with 3.0GHz, and the size of main memory is 1024MBytes. The simulation data (video, location, and traffic lights) are collected in two road sections in Seoul and Daejeon, both of which are about 5km long. The video is captured with 720 × 480 size and 30f ps. An application that looks like a video player is developed as a test platform for our experiments. The detection ratio of the traffic lights detection module is shown in Fig.6. In case of 130m or longer distance detection ratio is low because the traffic light appears very small (BA is lower than 10); in case of 90m the ratio shows about 80% (BA is about 40). The example detection results with regard to the distance between traffic lights and camera are shown in Fig.7. The average processing time of the traffic lights detection module for each frame is 70.86msec. Although we do not resize original image size, a processing time of about 15f ps is possible.
Fig. 6. Detection ratio
With the detected traffic lights and estimated location of crossroad, we display adjusted direction indicator onto video frame at a prototype video-based navigation system. The example of pictorial comparison between simply projected direction indicator and adjusted direction indicator by traffic light detection is shown in Fig.8.
690
T.-H. Hwang, I.-H. Joo, and S.-I. Cho
(a) Image(120m)
(b) Result(120m)
(c) Image(100m)
(d) Result(100m)
(e) Image(80m)
(f) Result(80m)
(g) Image(60m)
(h) Result(60m)
(i) Image(40m)
(j) Result(40m)
(k) Image(30m)
(l) Result(30m)
Fig. 7. Detection result
(a) Projected direction indicator
(b) Adjusted by traffic light detection
Fig. 8. Adjusted display of direction indicator
5
Conclusion
In this paper, we suggested a detection method for traffic lights that can be effectively used for video-based car navigation systems. For video-based navigation systems that display navigation information overlaid onto video on reaching a crossroad, it is essential to find the exact location of the crossroad in image, because it is important to visually represent guidance information on video. However, because it is very difficult in actual situation, we suggested a method that effectively detects traffic lights instead of the crossroad itself and estimates location of crossroads with the result. We detect candidate traffic lights by color thresholding, noise removal, and blob labeling process, and then find center of traffic lights by Gaussian mask, which is verified by suggested Existence-Weight
Detection of Traffic Lights for Vision-Based Car Navigation System
691
Map. Experiments show that the suggested method works in a long distance before a crossroad for outdoor video and can be effectively used for video-based navigation systems. Because video is regarded as a new media for future car navigation systems, computer vision technology will be more and more used in car navigation industry. In the future, more computer vision technologies will be required for other objects that appear in video captured by vehicle, such as roadside facilities, buildings, road lanes, or other vehicles. We should develop more robust computer vision methods based on analyses on the demands of applications in the field of car navigation system.
References 1. W. Narzt, G. Pomberger, A. Ferscha, D. Kolb, R. Muller, J. Wieghardt, H. Hortner, C. Lindinger, ”Pervasive Information Acquisition for Mobile AR-Navigation Systems,” 5th IEEE Workshop on Mobile Computing Systems & Applications, Monterey, California, USA, October 2003, pp.13-20. 2. Zhencheng Hu, Keiichi Uchimura, ”Solution of Camera Registration Problem Via 3D-2D Parameterized Model Matching for On-Road Navigation,” International Journal of Image and Graphics, Vol.4, No.1, 2004, pp.3-20. 3. Arturo de la Escalera, Jose Maria Armingol, Jose Manuel Pastor, and Francisco Jose Rodriguez, ”Visual Sign Information Extraction and Identification by Deformable Models for Intelligent Vehicles,” IEEE Transactions on Intelligent Transportation Systems, Vol.5, No.2, Jun 2004. 4. Zhuowen Tu and Ron Li, ”Automatic Recognition of Civil Infrastructure Objects in Mobile Mapping Imagery Using Markov Random Field,” Proc. of ISPRS Conf. 2000, Amsterdam, Jul 2000. 5. Tae-Hyun Hwang, Seong-Ik Cho, Jong-Hyun Park, and Kyuong-Ho Choi, ”Object Tracking for a Video Sequence from a Moving Vehicle: A Multi-modal Approach,” ETRI Journal, Vol.28, No.3, 2006, pp.367-370. 6. National Police Agency of Korea, The Standard Guideline for Colors of Traffic Signs, 2004. 7. R. C. Gonzalez and R. C. Woods, ’Digital Image Processing,’ Addison-Wesley, 1992.
What Is the Average Human Face? George Mamic, Clinton Fookes, and Sridha Sridharan Image and Video Research Laboratory Queensland University of Technology GPO Box 2434, 2 George Street, Brisbane, Queensland 4001 {g.mamic, c.fookes, s.sridharan}@qut.edu.au
Abstract. This paper examines the generation of a generic face model from a moderate sized database. Generic models of a human face have been used in a number of computer vision fields, including reconstruction, active appearance model fitting and face recognition. The model that is constructed in this paper is based upon the mean of the squared errors that are generated by comparing average faces that are calculated from two independent and random samplings of a database of 3D range images. This information is used to determine the average amount of error that is present at given height locations along the generic face based on the number of samples that are considered. These results are then used to sub-region the generic face into areas where the greatest variations occur in the generic face models.
1
Introduction
Today’s smart security solutions often employ the use of biometrics, such as fingerprints or iris scans, as an integral part in the overall system. This is largely due to the increased demand on modern society to identify or authenticate an individual with much stronger certainty. The uniqueness of face recognition technology that distinguishes it from the other cohort of biometrics is that images of the face can be captured quickly and in the majority of situations non-intrusively. They also have the potential to operate in noisy crowded environments where other biometrics may fail. Unfortunately, traditional 2D methods for face verification have struggled to cope with changes in illumination and pose. This has led researchers to the use of three-dimensional (3D) facial data in order to overcome these difficulties [1]. One of the key components of 3D face recognition is the ability to reconstruct the 3D human face and estimate relevant models from the captured sensor data. This step is not only important for face recognition, but also for other computer graphics and machine vision fields including entertainment, medical image analysis and human-computer interaction applications [2]. This reconstruction and modeling process often employs the use of a generic face model or generic face mesh as a tool that assists in developing solutions to this challenging problem. Observation of the numerous and varied approaches to solving the face modeling problem has revealed that little thought is given to the actual generic face L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 692–701, 2006. c Springer-Verlag Berlin Heidelberg 2006
What Is the Average Human Face?
693
model that is employed. Consequently, there are almost as many generic face models that exist in the literature as there are researchers who have investigated the problem. Some face models may be typically constructed from a set of training range images; may use active appearance models [3] or morphable models [4]; or some researchers may even employ the use of their own face as a “generic face model” which is subsequently fitted, deformed, or applied to other people’s faces. The latter obviously, is not the most optimum approach to adopt. This paper attempts to answer the question: what is the average human face and does an average model actually exist? Such questions are rarely asked but the results may be of particular value to a wide range of face modeling and recognition papers. If such a generic/average model does exist, what are the constraints on this model and what are the characteristics of this model? To answer such questions will allow more intelligent application of generic face models in face reconstruction applications. This paper examines the calculation of the generic face mesh in the following manner. Section 2 presents some existing work in the field and demonstrates how generic face meshes are used by numerous researchers in the face recognition field. Section 3 presents the preprocessing steps that were conducted on the database of 3D images that were used in the construction of the generic face mesh and related experiments. The database employed in this paper is the second version of the “Face Recognition Grand Challenge” (FRGC v2.0) database [5]. Section 4 presents the error metric that was used to determine the number of required models. Section 5 presents the results of the testing of the generic 3D face mesh and this is followed by the Conclusion which is presented in Section 6.
2
Related Work
There are numerous methods to capture 3D information of a human face. Methods include a range of both passive and active sensors such as stereo [6–8], shape from shading [9], structured light [2], shape from silhouettes, laser scanners [5], or structure from motion methods [10]. Some of these methods obtain the 3D information directly while others depend on methods to reliably determine correspondences between images containing human faces. To perform the latter approach in an automatic fashion is extremely problematic. The majority of these problems stem directly from the human face itself, as faces are non-planar, non-rigid, non-lambertian and are subject to self occlusion [11]. Due to these issues and the uniform appearance of large portions of the human face, passive techniques such as stereo can not always accurately determine 3D face shape. Active sensors, although they can produce higher resolution and more accurate data, achieve this with more expensive equipment and at the cost of projecting light or other visible patterns onto the subjects face. Many passive techniques depend on extensive manual guidance to select specific feature points or determine correspondences between images [12]. Fua and Miccio [8] use a least-squares optimisation to fit a head animation model to stereo data. Although powerful, their method requires the manual selection of
694
G. Mamic, C. Fookes, and S. Sridharan
key 2D feature points and silhouettes in the stereo images. Ypsilos et al. [2] also fit an animation face model to stereo data. However, the stereo data is combined with a projected infrared structured light pattern allowing the simultaneous capture of 3D shape information as well as colour information. Ansari et al. [13] fit a face model to stereo data via a Procrustes analysis to globally optimise the fit, then a local deformation is performed to refine the fitting. Some methods use optical flow [14] or appearance-based techniques [10] to overcome the lack of texture in many regions in the face. These approaches, however, are often hampered by small deformations between successive images and issues associated with non-uniform illumination. Differential techniques attempt to characterise the salient facial features such as the nose, the mouth and the orbits using extracted features such as crest lines and curvature information. Lengagne et al. [15] employ a constrained mesh optimisation based on energy minimisation. The energy function is a combination of two terms, an external term, which fits the model to the data and an internal term, which is a regularisation term. The difficulty with differential techniques is in controlling the optimisation as noisy images can wreak havoc with estimating differential quantities, whilst regularisation terms have the potential to smooth out the very features that are most important to recognition algorithms. Statistical techniques fit morphable models of 3D faces to images using information that is gathered during the construction of the generic face mesh. Blanz and Vetter [16] present a method for face recognition, by simulating the process of image formation in 3D space, fitting a morphable model of 3D faces to images and a framework for face identification based upon model parameters defined by 3D shape and texture. The developed system is a complete image analysis system provided that manual interaction in the definition of feature points is permissible. Koterba et al. [3] however, studied the relationship between active appearance model fitting and camera calibration. They use a generic model as a non-rigid calibration grid to improve the performance of multi-view active appearance model fitting.
3
Face Mesh Preprocessing
The first stage in the investigation of the construction of the generic face mesh involved the development of an appropriate database of face images that could be used. Sets of range images were taken from the FRGC v2.0 database [5] and each of the images was then subjected to three stage pre-processing. This data normalisation stage is an integral step for any face recognition application, particularly for 3D data. It is rarely noted but the 3D modality has high levels of noise with often extreme outliers that must consequently be removed. Another issue rarely raised with 3D face recognition is there is a pose dependence. The angle of the individual to the laser scanner (or other capture device) will cause self-occlusion, unless the capture equipment can gather a full 3D face image. However, even with these issues 3D face data is considered to be less susceptible to noise than 2D images.
What Is the Average Human Face?
695
The first stage of the pre-processing pose-normalised the data such that the position and direction of the normal vector of the plane formed by the centre of the eyes and the tip of the nose was identical across all the images. The second stage filtered and interpolated across areas of the face that contained substantial errors in the range image, as is caused when taking range measurements in the eyeball area and the nostrils. The final stage of the pre-processing cropped all the images so as to only contain information from the forehead of the face down to the top lip of the subjects. This was done primarily to reduce any errors that may be introduced by variations in the mouth region that may have occurred during the image capture process. These stages are outlined in more detail as follows [17]. The 3D data is assumed to be point cloud data on a semi-regular x- and y-grid; there is limited variation along the grid lines. Every image has four landmark points: the right eye corner, left eye corner, nose tip and chin. Depth normalisation is conducted by using three of these landmark points (right eye corner, left eye corner and chin) to compensate for in-plane and out-of-plane rotations; the three landmark points are rotated to reside in the plane. The face data is then cropped based on the two eye corners and the erroneous data in this cropped region is removed using a gradient filter (Algorithm 1) in the x and y domains respectively, Znorm1 = grad f ilt(Z, ∂x, N ) and Znorm2 = grad f ilt(Znorm1, ∂y, N ); the median filter used is defined by Algorithm 2. Where Z is the cropped depth image, N is the size of the median filter, C is the cutoff point for the gradient filter, i refers to the columns in the image and j refers to the rows in the image.
Algorithm 1. Gradient Filter(grad filt(Z, ∂n, N )) 1: 2: 3: 4: 5: 6: 7: 8: 9:
Zn = ∂Z ∂n if Zn (i, j) > C then Z(i, j) = error end if for i = 1 : num cols(Z) do for j = 1 : num rows(Z) do Z(i, j) = med filt(Z, i, j, N ) end for end for
For this work it was found that a value of C = 10 was appropriate and that a median filter of size N = 5 was sufficient; for images which were larger than 128 × 128 pixels. This valid data, Znorm2 was re-interpolated onto a regular grid of 128 × 128 pixels, an example output of this procedure is provided in Figure 1. In addition to filtering and registration, the 3D data must also be range normalised. Range normalisation consists of setting the maximum value to 255 (the maximum value for a normal 2D image) and adjusting all other values to be relative to this value.
696
G. Mamic, C. Fookes, and S. Sridharan
Algorithm 2. Median Filter (med filt(Z, i, j, N )) 1: 2: 3: 4: 5: 6: 7: 8: 9: 10: 11: 12: 13: 14:
start i = i − N , stop i = i + N start j = j − N , stop j = j + N if start i < 1 then start i = 1 else if stop i > num columns(Z) then stop i = num columns(Z) end if if start j < 1 then start j = 1 else if stop j > num rows(Z) then stop j = num rows(Z) end if Zsub region = Z(start i : stop i, start j : stop j) Z(i, j) = median(Zsub region = error)
3D Face Image
z−axis (depth)
40 20 0 −20
Fig. 1. A mesh plot of a cropped and interpolated 3D face image
The final database was composed of 466 individuals represented by 128 × 128 images. Each image has the spatial co-ordinates available in X, Y and Z matrices and the position of the eye centres and nose tip was available in a separate structure.
4
Generic Face Experiment Design
The investigation of the average face was done by drawing two random and independent samples of the database of range images R that was constructed in the preprocessing stage. The random sampling was performed using a random
What Is the Average Human Face?
697
number generator which generates uniformly distributed pseudo-random numbers. The state of the random number generator was altered with every iteration. For a given set, S, of n randomly selected range images, where S ⊂ R, the mean range distances, M 1, were calculated by, n SiX,Y . (1) M 1X,Y = i=1 n A second set T , of n random range images, where T ⊂ Q and Q = R − S, was then chosen and the mean range distances M 2 of this set of images was calculated in a similar manner. To compare the two ‘average’ faces that were calculated from set S and set T a squared error matrix E 2 was calculated using, E 2 = [M1 − M2 ]2 .
(2)
The final matrix E 2 provides the squared error that exists at each point on the surface for the particular sets of randomly generated mean images. This procedure for generating the matrix E 2 was then repeated for 10,000 simulation runs. The collection of the squared error matrices were then used to determine the average squared error across all simulation runs in the following manner, 2 Eave = E 2 /10000. (3)
5
Results and Discussion
To investigate the MSE versus the number of models that were present in the database of objects, the experiment detailed in the previous section was repeated for n = 1...233 (i.e. half the total database size of 466 due to the creation of two generic face models being compared). For every value of n the experiment was conducted 10000 times and the MSE at each observed range measurement was calculated. The results of this set of experiments are presented in Figure 2. Figure 2 shows that there is an inverse relationship between the number of models and the amount of mean squared error which is present between the mean squared faces that are calculated. As this is an inverse relationship, it indicates that a truly average face would be achieved only with an infinitely large database. However, by examining the mean error that exists at any given location between the average faces that are calculated for a given sample size it is possible to calculate the percentage mean squared error that can be expected to be found at given locations across an average face. These results are presented in Table 1. As seen from Table 1, once more than sixty models are chosen to calculate the average face, the average squared error at a given location on the face varies by 5% or less to another average face which is calculated using the same number of models. This result should serve as a guide as to how many faces should be ‘averaged’ when calculating a generic face to represent the faces that are present
698
G. Mamic, C. Fookes, and S. Sridharan MSE ’vs’ Number of Models Averaged 80,000
75,000
70,000
65,000
60,000
55,000
50,000
MSE
45,000
40,000
35,000
30,000
25,000
20,000
15,000
10,000
5,000
0
0
10
20
30
40
50
60
70
80
90
100
110
120
130
140
150
160
170
180
190
200
210
220
230
240
250
Number of Models Averaged
Fig. 2. Plot of MSE versus the number of faces in the model averaged
in a database. This can also be used as a starting point in reconstruction and calibration algorithms which require a generic model for initialisation of the algorithm. To further investigate the relationship that exists between the MSE and the ‘average face’, the MSE surface as it exists on the average face was plotted for the case where 233 models were considered. This relationship is displayed in Figure 3. Figure 3 shows that the bulk of the MSE that exists between the average models was located at the tip of the nasal region and around the brow area, where as the cheeks and forehead region on the other hand had far lower values of MSE associated with them. This result indicates that the most important information for applications such as recognition and face reconstruction occurs in these areas, and thus algorithms which are time critical in their application should focus on these regions for salient parameter calculations. This result is also important for any coding standards that are produced for faces, as any meshes used to represent the face should focus the vertices in the nasal and brow areas. To further illustrate this point, a face that contains 95% of the MSE that exists when 233 models are considered can be represented as shown in Figure 4. The results that have been presented in this paper have been empirical in nature. A formal hypothesis testing procedure was constructed using the Bonferroni correction as the mechanism by which every range point on the image could be tested simultaneously. Although this framework was accurate in a synthetic
What Is the Average Human Face?
699
0
20 1 40 0.5 60 0 140
80 120 100 100
80 60 120
40 20 0 140
Fig. 3. Plot of MSE variations over the generic face model surface
40
30
20
10
0
−10
−20 40 20 100
0 90 80
−20
70 60 −40
50 40 30
−60 20 10 −80
0
Fig. 4. Plot of face regions that contain 95% of the MSE that exists in the generic face model when 233 faces are utilised
environment, the variations and interpolation errors for the eye and nostril regions that are present on a database of real faces provided inconsistent results. This is currently the focus of future research and it is hoped that a formal framework will eventually be able to be constructed for facial analysis.
700
G. Mamic, C. Fookes, and S. Sridharan Table 1. Table of percentage MSE versus the number of models No. Models 10 15 20 24 30 34 39 47 59 78 117 233
6
% Error 31.71% 20.57% 16.05% 13.50% 11.07% 9.58% 7.52% 6.78% 5.34% 4.10% 2.69% 1.36%
Conclusion
This paper examined the generation of a generic face model from a moderate sized database. The model constructed was based upon the mean of the squared errors that are generated by comparing average faces that are calculated from two independent and random samplings of a database of 3D range images. This information is used to determine the average amount of error that is present at given height locations along the generic face based on the number of samples that are considered. Experimental results demonstrated that at least sixty faces need to be incorporated into a generic face model in order to obtain a mean squared error of 5% or less to another average face which is calculated using the same number of models. The empirical evidence presented has demonstrated that a generic or ‘average’ face model does exist down to an acceptable level of error. The constructed model also revealed that 95% of the errors within the average model are contained within the brow and nose regions of the face. The presented empirical investigation of the generic face model will allow for more intelligent application of generic face models in face reconstruction and recognition applications.
References 1. Bowyer, K.C.K.W., Flynn, P.J.: A survey of approaches to three-dimensional face recognition. Technical report, University of Notre Dame (2003) 2. Ypsilos, I., Hilton, A., Rowe, S.: Video-rate capture of dynamic face shape and appearance. In: Proceedings of the Sixth IEEE International Conference on Automatic Face and Gesture Recognition. (2004) 3. Koterba, S., Baker, S., Matthews, I., Changbo, H., Xiao, J., Cohn, J., Kanade, T.: Multi-view aam fitting and camera calibration. In: Tenth IEEE International Conference on Computer Vision. (2005) 511–518 4. Blanz, V., Vetter, T.: A morphable model for the synthesis of 3d faces. In: SIGGRAPH. (1999)
What Is the Average Human Face?
701
5. Phillips, J., Flynn, P., Scruggs, T., Bowyer, K., Chang, J., Hoffman, K., Marques, J., Min, J., Worek, W.: Overview of the face recognition grand challenge. In: Proceedings of IEEE Conference of Computer Vision and Pattern Recognition. Volume 1. (2005) 947–954 6. Kuo, C., Lin, T.G., Huang, R.S., Odeh, S.: Facial model estimation from stereo/mono image sequence. IEEE Transactions on Multimedia 5(1) (2003) 8–23 7. Chen, Q., Medioni, G.: Building 3d human face models from two photographs. Journal of VLSI Signal Processing 27 (2001) 127–140 8. Fua, P., Miccio, C.: Animated heads from ordinary images: A least-squares approach. Computer Vision and Image Understanding 75(3) (1999) 247–259 9. Choi, K., Worthington, P., Hancock, E.: Estimating facial pose using shape-fromshading. Pattern Recognition Letters 23(5) (2002) 533–548 10. Kang, S.: A structure from motion approach using constrained deformable models and apperance prediction. Technical Report CRL 97/6, Cambridge Research Laboratory (1997) 11. Baker, S., Kanade, T.: Super-resolution optical flow. Technical Report CMU-RITR-99-36, Robotics Institute, Carnegie Mellon University (1999) 12. Pighin, F., Hecker, J., Lischinski, D., Szeliski, R., Salesin, D.: Synthesizing realistic facial expressions from photographs. In: Proceedings of SIGGRAPH Computer Graphics. Volume 26. (1998) 75–84 13. Ansari, A.N., Abdel-Mottaleb, M.: 3d face modelling using two views and a generic face model with application to 3d face recognition. In: Proceedings of the IEEE Conference on Advanced Video and Signal Based Surveillance. (2003) 14. DeCarlo, D., Metaxas, D.: Deformable model-based shape and motion analysis from images using motion residual error. In: International Conference on Computer Vision, India (1998) 113–119 15. Lengagne, R., Fua, P., Monga, O.: 3d stereo reconstruction of human faces driven by differential constraints. Image and Vision Computing 18 (2000) 337–343 16. Blanz, V., Vetter, T.: Face recognition based on fitting a 3d morphable model. IEEE Transactions on Pattern Analysis and Machine Intelligence 25(9) (2003) 1063–1074 17. McCool, C., Chandran, V., Sridharan, S.: 2d-3d hybrid face recognition based on pca and feature modelling. Proceedings of the 2nd Workshop on Multimodal User Authentication (2006)
An Improved Real-Time Contour Tracking Algorithm Using Fast Level Set Method Myo Thida1 , Kap Luk Chan2 , and How-Lung Eng3 1,2
Center for Signal Processing, Division of Information Engineering, School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, 639798, 3 Institute for Infocomm Research, 21 Heng Mui Keng Terrace Singapore 119613 1
[email protected],2
[email protected],3
[email protected] Abstract. Human contour provides important information in high level vision tasks such as activity recognition and human computer interaction where detailed analysis of shape deformation is required. In this paper, a real time region-based contour tracking algorithm is presented. The main advantage of our algorithm is the ability of tracking nonrigid objects such as human using an adaptive external speed function with fast level set method. The weighting parameter, λi is adjusted to accommodate the situations in which the object being tracked and the background have similar intensity. Experimental results show the better performance of the proposed algorithm with adaptive weights.
1
Introduction
Successful human contour tracking plays a key role in identifying people and recognizing or understanding human behaviors. In early years, human tracking concentrates on feature points such as corner points of moving silhouettes or centroid of rectangular box that bounded the tracked person. However, featurebased method is not appropriate when the shape of the objects changes dynamically as the shape of objects cannot be easily related to the positions of feature points in subsequent frames. In addition, human contour provides important information in high level vision tasks such as activity recognition and human computer interaction where detailed analysis of the shape deformation is required. The active contour model, also known as the Snake model, was proposed by Kass et al.[7] in 1987 and it had been widely used in edge detection, shape modeling, segmentation and motion tracking. However, traditional active contour approach encounters two major problems: the mathematical model in the snake method cannot deal with topological change without adding external mechanisms, and it is very sensitive to initial contour position and initial shape. The Level Set Method introduced by Osher and Sethian [12] overcomes the above problems. Using the level set formulation, complex curves can be detected and tracked and topological change of the curves are naturally managed. Considerable research has been reported on applying the Level Set Method in moving L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 702–711, 2006. c Springer-Verlag Berlin Heidelberg 2006
An Improved Real-Time Contour Tracking Algorithm
703
object detection and tracking. Paragios and Deriche[10] proposed level-set based geodesic framework for object tracking where speed function in the formulation is defined in terms of image gradient, ∇I. But, tracking using gradient image will fail when the boundary between background image and object being tracked becomes ambiguous. By generalizing the region competition idea [13], Mansouri [8] proposed Bayesian-based level set tracking method where the object being tracked and background regions are assumed to have distinct luminance characteristics. In [9], Paragios and Deriche combine region-based and edge-based information. Recently, many researchers have introduced shape information into level set method [1], [5]. However, the use of Level Set Method has been limited due to its high computational complexity. The major challenges of using level set method in object tracking are:(1) reducing computational cost for implementing and (2)constructing an adequate speed function that governs the curve deformation. In [11], a new implementation of level set method is proposed and achieved real time performance. In this paper, a novel method for human contour tracking using the fast level set implementation method is proposed where the speed function is defined based on the Chan-Vese Model with adaptive weighting parameters, λi . Compared to [11] which mainly focus on implementation of level set method in real time, our main objective is to construct an adequate speed function that is adaptable to fast level set implementation so that our tracking algorithm is robust to weak object boundary while achieving real time performance. In the next section, we briefly review the level set method and discuss the fast level set implementation in [11]. Section 3 presents the proposed algorithm and explains the formulation of external speed function in detail. Finally, experimental results and concluding remarks are given in Sections 4 and 5.
2
Level Set Method
The level set method is a numerical technique for tracking a propagating interface which changes topology over time. This method has been widely used in many areas including computer vision and image proposing after the first work was introduced by Osher and Sethian [12]. The key attraction of the level set method over the snake method is its ability to handle changes in curve topology in a natural way. In addition, the level set method is not sensitive to initial contour position and shape as snake method do. This makes the level set method more suitable for detecting and tracking of multiple objects simultaneously while handling topological changes in natural way. Level set method embeds the initial position of moving interface as the zero level set of higher dimensional function φ(x, y, t) [12]. Then, the evolution of implicit function is tracked over time. Finally, the position of the moving interface at anytime, t, is given by the zero level set of time-dependent level set function. That is,C(x, y, t) = {(x, y)|φ(x, y, t) = 0}. As the interface moves over time, the value of the function φ at each grid point (x, y) must be updated according to the speed function F . The speed function F can be generated based on global
704
M. Thida, K.L. Chan, and H.-L. Eng
properties of the image or local properties such as curvature. Regardless of the properties that generate the speed function, it can be assumed that the speed function is in the direction perpendicular to the interface as the tangential component will not effect on the position of the interface. In order to represent the evolving interface as the zero level set of the implicit function, φ ,the following condition must be true at any time t: φ(p(x(t), y(t)), t) = 0
(1)
By taking derivative using the chain rule, we have φt + ∇φ(p(t), t).p (t) = 0
(2) − → − → Let n be a local unit(outward) normal to the interface, and then, n can be written in terms of φ as follows: ∇φ − → n = |∇φ| and speed F which is perpendicular to the interface is → n F = p (t)−
(3)
(4)
Hence, equation (2) becomes φt + F |∇φ| = 0, φ(x, y, 0) = φ0 (x, y)
(5)
where the set {(x, y)|φ0 (x, y) = 0} defines the initial contour. 2.1
Speed Function, F
The performance of level set function depends on the speed function F that governs the model deformation. In classical approach, level set speed function depends on the image gradient, ∇I. For example, Geodesic Active Contours [10] sets the speed function F as: F = g(|∇I|)(κ + ν)
(6)
where g(|∇I|) is commonly defined as g(|∇I|) =
1 1 + |∇Gσ ∗ I|p
(7)
with p = 1 or p = 2 . Gσ denotes a Gaussian convolution filter with standard deviation σ. The speed decreases towards zero at high gradient location. The main assumption for this edge-based model is that the object being tracked has strong intensity contrast to the background. This assumption constraints the applications of the gradient based level set method. This had led researchers to consider the region-based energy which rely on the information provided by the entire region such as color, texture and motion-based properties [2]. Motivated by [13], the speed function that considers the probability density function of the regions inside and outside the structures of interest have been proposed [1], [8]. This class of speed function is well suited in the situations where intensity distributions of regions match to the assumed intensity models.
An Improved Real-Time Contour Tracking Algorithm
2.2
705
Implementation Technique
In the traditional approach, the level set equation in (5) is solved by replacing time derivatives with finite differences and spatial derivatives with approximations using forward and backward differences. But, the direct implementation approach of level set method is computationally very expensive and the Narrow Band approach has been recently popular where the computation is limited to only the neighborhood pixels of the zero level set[1]. However, the computational cost still remains considerable and real time performance is not achieved. Shi et al. [11] proposed a fast implementation technique for solving level set method. Their approach differs from the Narrow Band Method that in their method, the curve is evolved without solving partial different equation though the evolving curve is still represented by level set function φ. The computational cost drops dramatically and real time performance can be achieved. Fast Level Set Implementation Approach. In the level set method, the boundary curve C is represented implicitly as the zero level set of a function φ. In classical level set method, the moving object is tracked by solving the following partial differential equation. φt + F |∇φ| = 0
(8)
Given the initial contour position, the value for φ(x, y, t = 0) is initialized. Generally, φ is defined as signed distance function which is negative inside the curve C and positive outside C. Then, the value of φ is adjusted based on the speed function F that specifies how contour points move in time. By examining the neighbor points along the contour, it can be seen that the values of φ of neighboring grid point inside C will be switched from negative to positive if the curve moves inward and vice versa if curve moves outward. Based on the above observations, fast level set method defines two lists of neighboring grid points Lin and Lout for the boundary curve C as follows: Lin = {x|φ(x) < 0 and ∃y ∈ N4 (x) such that φ(y) > 0} Lout = {x|φ(x) > 0 and ∃y ∈ N4 (x) such that φ(y) < 0} where N4 (x) is 4-connected discrete neighborhood of a pixel x. The value of φ is chosen from a limited set of integers {−3, −1, 1, 3} and defined as: ⎧ 3, if ⎪ ⎪ ⎨ 1, if φ(x) = −1, if ⎪ ⎪ ⎩ −3, if
x x x x
is is is is
exterior point; in Lout ; in Lin ; interior point.
(9)
where the exterior points are those pixels outside C but not in Lout and interior points are those pixels inside C but not in Lin . Then, the curve C is evolved at pixel resolution by simply switching the neighboring pixels between the two lists
706
M. Thida, K.L. Chan, and H.-L. Eng
Lin and Lout based on the speed function F . In General, the speed function F is composed of external and internal image dependent speed: ∂φ = (Fext + Fint )|∇φ| ∂t
(10)
The evolution due to data dependent speed Fext and the smoothness regularization due to Fint is separated into two cycles. The first cycle moves the curve toward the desired boundary based on the external speed, Fext and the second cycle smooths the curve with Fint . In general, Fext is data dependent speed and defined as > 0, if a pixel lies inside the object; Fext = < 0, otherwise. According to the definition of Lin and Lout , the external speed values should be positive for all the pixels in Lin and negative for the pixels in Lout if the curve is at the object boundary. Based on this observation, we switch a pixel from Lout to Lin if F > 0 and vice versa if F < 0. The iteration process stops (a) if the speed at each neighboring grid points satisfy: F (x) ≤ 0 F (x) ≥ 0
∀x ∈ Lout ∀x ∈ Lin
or (b) A pre-specified maximum number of iterations, Na is reached. A predefined maximum iteration number, Na is required for the image with noises and clutter conditions. The popular choice for Fint is μκ where κ is curvature of the curve and defined as the Laplacian of φ.
3
Proposed Algorithm
Assuming that the image acquisition system is still, the moving object is detected by gray-level differencing between the current frame and reference background model. The background model is updated with time using a mixture of K Gaussian distributions [3]. Ideally, the difference image is composed of two regions: static and moving object. The static region corresponds to those pixels that belong to both current and background frame and differences are approximately zero. The moving region correspond to those pixels that belong to only the current frame and differences are significantly large. This property can be exploited to define the region-based speed function. Difference from [11], we define the region-based speed function based on the Chan-Vese Model for region segmentation [4]. 3.1
Speed Function Based on Region Intensity
Given the difference image I(x, y) = It (x, y) − B(x, y) where It (x, y) and B(x, y) refers to the current image and background image respectively, the energy function of Chan-Vese segmentation model [4] can be written as
An Improved Real-Time Contour Tracking Algorithm
E = μ.|∂Ωin |+v.
Ωin
dx+λ1
2
Ωin
707
(I(x, y)−c1 ) dxdy+λ2
Ωout
(I(x, y)−c2 )2 dxdy
where Ωin and Ωout denotes the object and background region respectively. The term |∂Ωin | is the length of the object boundary and it provides smoothness regularization. The second term penalizes the area of the object region and is generally omitted by taking v to be zero. The last two terms represents the region fitting energy and will be minimized only if the moving curve is at the boundary of the object being tracked. The parameters c1 and c2 represents the averages of image inside and outside curve C and are updated via the following equation as φ evolves: I(x, y)H(φ)dxdy I(x, y)(1 − H(φ))dxdy Ω c1 (φ) = ; c2 (φ) = Ω H(φ)dxdy Ω Ω (1 − H(φ))dxdy where H(φ) is Heavisdie function and defined as 1, x < 0; H(x) = 0, x > 0. Keeping c1 and c2 fixed and minimizing the energy function, E, with respect to φ results
∇φ ∂φ = δ(φ) μ∇.( ) − v − λ1 (I(x, y) − c1 )2 + λ2 (I(x, y) − c2 )2 (11) ∂t |∇φ| In order to extend the evolution to all level sets of φ , Dirac function δ(φ) can be replaced with |∇φ|. Hence, above equation becomes
∇φ ∂φ = |∇(φ)| μ∇.( ) − v − λ1 (I(x, y) − c1 )2 + λ2 (I(x, y) − c2 )2 (12) ∂t |∇φ| where μ > 0, v ≥ 0, and λ1 , λ2 > 0 are fixed parameters. In fast level set implementation, the level set equation is of the form ∂φ = (Fext + Fint )|∇φ| (13) ∂t where Fext drives the curve towards the desired boundary and Fint smooths the curve. To provide such behavior, we define the speed function as: Fext = sign(−λ1 (I(x, y) − c1 )2 + λ2 (I(x, y) − c2 )2 ) ∇φ ) Fint = μ∇.( |∇φ|
(14) (15)
In equation (14), I(x, y) ≈ c1 for a pixel inside the object and I(x, y) ≈ c2 for pixels outside the object. Hence, the first term in Fext will be much smaller than the second term and the resultant Fext will be positive if a pixel lies inside the moving object. On the other hand, the external speed Fext will be negative if the pixel lies outside the moving object. The negative and positive values of Fext drives the curve toward the object boundary where Fext = 0. Internal speed, Fint comes from the length constraints and provides the boundary smoothness and length scale of the moving object.
708
3.2
M. Thida, K.L. Chan, and H.-L. Eng
Curve Evolution Parameters
The parameters, λi gives the weights to the object and background classes. If we choose weighting parameters, λi to be the same, it implies that both object and background classes are equally likely to occur and no particular class is favored over another. In our proposed algorithm, we adaptively define the external speed function to reduce the misclassification of foreground pixels as background through the parameter λi . Let λ2 = kλ1 Then, the external speed is redefined as: −λ1 (I(x, y) − c1 )2 + kλ1 (I(x, y) − c2 )2 , if |Fext | < th; Fext = (16) otherwise −(I(x, y) − c1 )2 + (I(x, y) − c2 )2 , where k is a constant parameter and k ≥ 1. The parameter, th is a threshold determining the weakness of external speed function. In our algorithm, the parameters, c1 and c2 are chosen to represent the mean intensity values and can extend our algorithm by choosing parameters, c1 and c2 to be other statistical properties.
4
Experimental Results
To evaluate the performance of proposed method, it is tested with real time video sequences. The first few frame is assumed to have no moving objects and assigned as the reference background model. For the subsequent frames, the difference image between the current frame and background model is used as input to the tracking algorithm while background model is updated with time.
(a)
(b)
(c)
Fig. 1. (a) Background Model (b) Current Frame and (c) Difference Image
According to the fast level set implementation [11], the mathematical implementation is performed in two cycle. The regularization parameters, Na and Ng are selected based on the balance between the external speed and the smoothing effects. In the proposed algorithm, the initial curve can be anywhere in the image and we set the image border as the initial curve in order to detect the moving object in the whole domain. For the subsequent frames, the initial curve can be obtained by the results from the previous frame. In our experiment, we choose Na = 255 for motion detection in the first frame and 20 for subsequent frames. By setting the parameter, Na selectively, we can further reduce the computational cost. The parameter, Ng eliminates the small holes in the final result and
An Improved Real-Time Contour Tracking Algorithm
709
is chosen based on the noise level of the input image. Generally, Ng is chosen to be much smaller than Na . In our experiment, we set Ng = 7 and the parameter, Ng is fixed for the whole sequence. 4.1
Motion Detection
In the proposed algorithm, the initial curve can be anywhere in the image and we set the image border as the initial curve in order to detect the moving objects in the whole image. The curve will be initialized to the image border whenever the motion detection module shows there are changes in the number of moving objects in the scene. Otherwise, the tracking curve in the previous frame will be used to initialize the curve in the current frame.Fig. (2) shows the evolution process in one of the frames of Hall monitor sequence. The contour of the person is successfully detected after 243 iterations by satisfying part (a) of stopping conditions. The time for one iteration is about 0.005 sec.
iteration 11
iteration 51
iteration 171
iteration 243
Fig. 2. The curve evolution process of the two cycle algorithm
4.2
Tracking
After the moving object is detected successfully, only a few number of iterations are required in subsequent frames to obtain tight boundary. The average number of iterations required in the subsequent frames is 8 and average tracking time is 0.0866 sec per frame on a 2.4 GHz PC. The moving person and his shadow is tracked successfully in real time as he walks in the hallway.
frame 30
frame 35
frame 50
Fig. 3. Tracking results of Hall monitor sequence. Frame 30, 35, and 50 are shown.
We present the result of applying our adaptive speed function to one of the sequences from NLPR database [6]. The boundaries of human silhouettes in the observed images are tracked using our proposed algorithm. In the first
710
M. Thida, K.L. Chan, and H.-L. Eng
frame 23
frame 24
frame 27
frame 36
Fig. 4. Tracking results of the moving person. Frame 22, 23 and 24 are shown. Equal weighting parameters: λ1 = λ2 = 1.
frame 23
frame 24
frame 27
frame 36
Fig. 5. Tracking results of the moving person: Giving object class a higher weight for those pixels with |Fext | < th
frame, the curves converge to the boundary of the person to be tracked after about 239 iterations. From frame 23 to frame 36, the intensity values of the background and the part of the tracked person’s body are similar. The frame difference for most of the points in the left side of body part is very small and misled the algorithm to define these points as background points Fig.(4). Indeed, this is a well-known and challenging problem in region-based level set algorithm. In the proposed algorithm as well as other region-based level set method, tracking is treated as discriminant analysis of pixels into two classes, where the classes correspond to object region and background region. When the object being tracked and background has similar statistical property, it is difficult to track the boundary accurately without adding additional constraints. As shown in Fig.(5), applying equation (16) can overcome the problem of misclassification and recover the body part.
5
Conclusion
In this paper, we propose a real time region-based contour tracking algorithm using the fast level set method. The external speed function is constructed based on Chan-Vese Model with the adaptive weighting parameters, λi . The main advantage of our algorithm is the ability of tracking objects with weak boundary in real time. The weighting parameters, λi are adjusted to accommodate the situations in which the object being tracked and the background have similar intensity. Experimental results demonstrate that our proposed method can overcome the problem of misclassifications by adjusting the weighting parameters.
An Improved Real-Time Contour Tracking Algorithm
711
However, in practical conditions, the low contrast region problem may not be solved by only adjusting the weighting parameter, λi . A more sophisticated speed function model will be required to achieve the more robust performance for tracking low contrast objects and it is aimed as our future work.
Acknowledgements The first author would like to thank Institute for Infocomm Research, Singapore for their financial support.
References 1. Yilmaz A, Xin Li, and Mubarak Shah. Contour-based tracking with occlusion handling in video acquired using mobile cameras. IEEE Trans. on Pattern Analysis and Machine Intelligence, 26(11), November 2004. 2. S. Besson, M. Barlaund, and G. Aubert. Detection and trackign of moving objects using a new level set based method. Proc. International Conference on Pattern Recognition, 3:1100–1105, 2000. September. 3. Stauffer C and Grimson W.E.L. Adaptive background mixture models for real-time tracking. IEEE Comptuer Society Conference on Computer Vision and Pattern Recognition, 2, 1999. 4. T. Chan and L. Vese. Active contours without edges. IEEE Trans. on Image Processing, 10(2):266–277, February 2001. 5. Daniel Cremers. Dynamical statistical shape priors for level set based tracking. IEEE Trans. on Pattern Analysis and Machine Intelligence, 2006. 6. http://www.nlpr.ia.ac.cn/English/irds/gaitdatabase.htm. 7. M. Kass, A. Witkin, and D. Terzopolous. Snakes: Active contour models. International Journal of Computer Vision, 1:321–331, 1987. 8. A. Mansouri. Region tracking via level set pdes without motion computation. IEEE Trans. Pattern Analysis and Machine Intelligence,, 24:947–961, July 2002. 9. N Paragios and R. Deriche. Unifying boundary and region-based information for geodesic active tracking. IEEE Conference on Computer Vision and Pattern Recognition, Colorado, USA 1999. 10. N Paragios and R. Deriche. Geodesic active contours and level sets for the detection and tracking of moving objects. IEEE Trans. on Pattern Analysis and Machine Intelligence,, 22(3):266–280, March 2000. 11. Yonggang Shi and W.Clem Karl. Real-time tracking using level sets. Computer Vision and Pattern Recognition, June 2005. 12. S.Osher and J.A.Sethian. Fronts propagating with curvature dependent speed: Algorithms based on hamiton-jacobi formulations. Journal of Computational Physics, 79.:12–49, 1988. 13. S. Zhu and A. Yuille. Region competition: Unifying snake/ballon, region growing, and bayes/mdl/energy for multi-band image segmentation. IEEE Trans. on Pattern Analysis and Machine Intelligence., 18(9):884–900, 1996.
Dynamic Perceptual Quality Control Strategy for Mobile Multimedia Transmission Via Watermarking Index* Chin-Lun Lai Communication Engineering Department of Oriental Institute of Technology 58 Sze-Chuan Rd. Sec 2, Pan-Chiao City, 220, Taipei, Taiwan, R.O.C.
[email protected] Abstract. In this paper, a simple and efficient strategy of quality control for mobile multimedia transmission is proposed. Based on embedding the watermark into the transmitted media as the channel indication index and using the human visual/auditory sensitivity property, the dynamic bit allocation strategy is adopted accordingly thus to obtain the best quality of the media contents under a fixed bit rate constraint. The simulation results show that by using the proposed channel predication method and bit allocation strategy, not only more transmission efficiency can be achieved but the copyright protection requirement is still met, while only increasing little computational complexity at the mobile terminal. Keywords: Channel Estimation, Digital Watermarking, Wavelet and Subband Coding, Dynamic Bit Allocation, Human Sensitivity Function, Mobile Multimedia Transmission.
1 Introduction Channel detection is the key technology of serving the stable and best media quality in wireless transmission applications, especially in cellular system due to the asymmetric calculation power for the base station and mobile terminal and frequency division multiplex (FDD) for uplink and downlink transmission, and is still a challenge for current and future wireless communication system. For example, a GSM base station will transmit a training sequence and receive the analysis feedback signal from terminal as a reference for transmitting power control, and then acquire a better transmission performance consequently. However, more transmission bandwidth is necessary [1]. Rather than using the training sequence, an alternative way [2] uses feedback to inform base station the downlink channel condition, but this approach puts heavily calculation burden on mobile terminal. Moreover, since channel estimation from the training signal in time domain just tells us the global condition rather than each band of the original signal, the optimal strategy to overcome the *
This research was supported by the National Science Council of the Republic of China under NSC 93-2622-E-161-002-CC3.
L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 712 – 721, 2006. © Springer-Verlag Berlin Heidelberg 2006
Dynamic Perceptual Quality Control Strategy for Mobile Multimedia Transmission
713
channel effect corresponding to different frequency band is unavailable. Once the channel condition is known, it is possible to use many new developing schemes, such as transmit diversity [3][4] (open loop) , dynamic modulation scheme, variable bit rate coding, multilevel coding scheme [5][6], … etc., to combat wireless channel effect thus a more efficient communication system is obtained. On the other hand, due to the fast development of digital watermarking in multimedia transmission, and media sharing service such as 4G mobile communication or P2P service, digital watermark becomes a broad technique for copyright protection, authentication, broadcasting monitoring, in the multimedia transmission applications [7]-[11]. It is noted that since the watermark can be thought as the side information of the transmitting signals [9]-[12], it is reasonable to expect that it is possible to look at the watermarking as useful side information in the communication system to determine the channel condition, ex. the state of the channel noise level. Thus, watermarking not only can be treated as intellectual property protection technique but also performs as a channel indicator in the growing wireless multimedia networks network service. Although this concept had been disclosed previously, there is no explicit way to correlate the received watermark and the channel condition. It is expected that via calibrating the distortion rate of the transmitted watermark and original signal, the channel condition can be estimated precisely by a simple method. After the channel condition is known, the specific coding scheme for different channel condition can be adopted to improve the transmission efficiency for both the bit error rate (BER) and transmission rate. This paper presents an effective approach, which is expected to be satisfied in both respects: copyright protection and channel effects compensation, for mobile multimedia transmission. By detecting the watermark information embedded in every subband of the media, the channel condition can be easily estimated and feedback to the transmitter, then a suitable coding algorithm is adopted to overcome the channel effect. The human visual/auditory sensitivity property is used in the proposed bits allocation scheme between different bands and source/channel coding aspect [13], thus to achieve the goal of quality control under limited bit rate criterion, which is important in mobile transmission such as 3G/4G multimedia service. For example, for those important bands, more bits should be assigned to overcome the severe channel effect to obtain the acceptable image quality. However, if the band condition is indicated as “good enough”, the extra bits can be reallocated to other bands with more distortion thus a better image result with fine detail is available. The paper is organized as follows. In section 2, the theoretical principle and coding algorithm of the proposed strategy is described. The simulation results and discussions are given in the section 3, and finally, the conclusion and the future work are given in the section 4.
2 Theoretical Principles and Algorithms Description The simplified block diagram of the proposed scheme is shown in Fig. 1 and is described as follows.
714
C.-L. Lai
Fig. 1. Simplified block diagram of the proposed transmission system. Note that C represents the original subband coefficients.
2.1 Watermark Embedding and Detection It is observed that since the time domain training sequence is used in the general transmitting systems, it is hard to indicate channel condition accordingly to subband individually. To gain the better efficiency from coding strategy, a subband decomposition codec with watermarking in each band is proposed as follows. Let Co denotes the original source to be transmitted, then a wavelet decomposition filter pair with impulse response (hdL, hdH) is used to divide the baseband signal into two subbands, says CL and CH band respectively, by CL =Co* hdL ,
hdL =[0.707, 0.707]
(1)
CH =Co* hdH ,
hdH =[0.707, -0.707]
(2)
where hdL and hdH, also known as Daubechies wavelets with N=1, and (CL , CH) represent the high and low frequency bands respectively. The reconstruction process is the same as in (1)-(2) by convolving the decomposed bands with the reconstruction filters hrL and hrH, which are represented as hrL= hdL and hrH = - hdH. For an image source Co, the wavelet analysis is applied in the same manner two times for each row and column thus to obtain four subband coefficient images CLL, CLH, CHL, and CHH, respectively. The watermarking technique we used is similar to Hsu’s method [14] by using the DWT and multiresolution technique, which is stated in brief as follows. For each wavelet band, a pre-determined binary watermark template with ¼ size of the coefficients image is undergone the successive layers decomposition, pixel-based pseudo permutation, and block-based image dependent permutation processes. After that, the watermark information is embedded into the transform domain by modifying the DWT coefficients according to their neighboring relationship. For example, the successive two middle-frequency coefficients are compared by subtraction, if the former is greater, it denotes that a bit ‘1’ can be embedded directly. On the contrary, if a bit ‘0’ is to be inserted, a number must be added into the later coefficient thus to make it greater than the former. An example result for watermark hiding is shown in Fig. 2 by using the above watermarking algorithm The watermark can be self extracted without the original image [14]. The received image is undergone the same processes of wavelet decomposition, then the recovered
Dynamic Perceptual Quality Control Strategy for Mobile Multimedia Transmission
715
Fig. 2. (Left) Binary text watermark image of size (128*128), (Right) The resultant image after watermarking (256*256)
watermark can be obtained by comparing the neighboring coefficients in middlefrequency to disclose the transmitted bit symbols, and descrambling the bit stream to reconstruct the watermark image. 2.2 Channel Estimation by Watermark Correlation Due to the stochastic property of the original message, it is reasonable to estimate the channel condition by comparing the original and the reconstructed watermark. Thus, after the watermark information is extracted from the terminal, a simple and effective operation can be performed to estimate the channel condition for each frequency band. To reduce the computational complexity, a normalized correlation (NC) coefficient is used to measure the fidelity of the reconstructed watermark as follows M −1
M −1
∑ ∑ NC=
i=0
w ( i , j ) w '( i , j )
j=0 M −1
M −1
i=0
j=0
∑ ∑
[ w (i , j )]
(3)
2
where w and w’ denotes the original and received watermark respectively. The NC coefficient reveals the information of the channel effect for a communication system, for example, little noise and channel distortion made the NC near to one, while heavy noise or severe channel distortion made it approach to zero. Thus, a simple indicator of the channel condition can be obtained by observing the NC coefficient and referring to a pre-determined lookup table, which correlates the NC value to the noise 2 level (signal to noise ratio: SNR, or noise variance: σ ) or bit error rate (BER) of the communication system. A typical SNR definition is given as follows M −1
M −1
i=0
j=0
∑ ∑ SNR=10 Log
M −1
M −1
∑ ∑ i=0
[ C o ( i , j )]
2
[ C o ( i , j ) − C '( i , j )]
(4) 2
j=0
where C’ denotes the noisy message. When a noise with known power variance is added to the transmitted image, a corresponding NC value from the corrupted watermark can be calculated by (3), and the corresponding SNR value can be determined by (4). A simulation result for two popular noise models, Gaussian and Uniform, is given in Fig. 3 to describe the relevance between the NC and SNR.
716
C.-L. Lai
(b)
(a)
(c)
Fig. 3. (a) Correlation curves of NC and SNR values for Gaussian and Uniform noise models. (b) and (c) show the recovered watermark corresponding to receiving an ‘acceptable’ or ‘bad’ transmitted media (Lena) quality respectively. This figure illustrates that the received media fidelity can also be predicted by observing the received watermark.
According to this result, the NC and SNR can be correlated approximately (note that numerical errors will appear at NC near 0 and 1 since using curve fitting) by SNR_Gaussian = -142.8NC2+254.5NC-71.9 (dB)
(5)
Thus, it is easy for us to know the channel noise power level just by compute the NC at terminal side. 2.3 Optimal Bit Allocation Strategy by Human Perception Sensitivity Property, Image Contents, and Channel Condition Once the channel condition for each band is known, it can be sent back to the transmitter as the control signal to adjust the system parameters, thus the extra transmission efficiency can be achieved. In this paper, a dynamic bit allocation scheme for each signal band under a fixed bit rate criterion is used to improve the visual/auditory quality of the transmitting media under various channel conditions. That is, to make sure that the satisfactory quality for various channel condition can be achieved. To achieve this goal, the threshold points, which indicate the maximum number of removable bits, corresponding to 3 kinds of visual/auditory quality level must be determined by a series of subjective experiments. A number of members in our research group are selected to join this test. The wavelet coefficients of the transmitted media are removed (or replaced by zero) one by one according to the visual/auditory weighting function, which is similar to the concept of HVSquantization method in wavelet [15]. That is, wavelet coefficient with lower
Dynamic Perceptual Quality Control Strategy for Mobile Multimedia Transmission
717
(a)
(b)
(c)
(d)
(e)
△
Fig. 4. (a) Threshold points of ‘good--’, and ‘acceptable-- ’ for Lena image with different noise levels. It indicates the maximum removable bits to obtain the desired image quality and the resultant SNR value. (b)-(d) represent the results of ‘good’, ‘acceptable’, and ‘bad’ quality in noiseless condition. The noise level 1, 2, and 3 are 24.9, 99.2, and 404.7 (dB) respectively. Note that a high frequency content image ‘Baboon’ is also tested with the same condition to (b) and is shown in (e) for comparison, and the completed test results are depicted as a dotted curve in (a) for the Baboon image under noiseless condition.
perceptual weighting should be removed first. Two threshold points: ‘good’ and ‘acceptable’ are decided during the degradation process if most of the testers observe the significant quality change between the successive images. That is, in the previous image we feel “good enough” and in the next image we feel “not good”. Resultant curves of two test examples, ‘Lena’ and ‘Baboon’ image, at different noise level are shown in Fig. 4(a) to indicate the removed bit number and the received image fidelity. The threshold points are also examined and the corresponding images for noiseless channel case are shown in Fig. 4(b)-(d), which represents the 3 categories images of the visual quality requirement. In our coding strategy, the concept of ‘bit allocation’ includes two aspects and depends on the channel condition and the required service quality. If the communication system is undergone a more noisy or distorted channel, more bits should be allocated to channel coding algorithm to protect perceptually important coefficients. However, if the channel effect is negligible, more bits can be used in the source coding algorithm to improve the fine detail of the transmitted media. For example, if CLL band is corrupted severely, it means the acceptable quality is unavailable; we must remove some bits, from other bands, to protect CLL by channel
718
C.-L. Lai
coding techniques until the desired quality result is reached. On the other hand, if the channel condition is ‘good enough’, more bits, which exceed the quality requirement for a specific desired service, can be allocated to CLH, CHL, and CHH bands to get the higher image quality with fine detail. Thus, the number of maximum removable bits, which depends on both the image contents and the channel condition, gives us the guiding principle for tuning the coding parameters. To simplify the coding process, the image content classification is represented by the power ratio of the high frequency to the full band, says CHH/C. According to the previous experimental results, the number of maximum removable bit number corresponding to different channel conditions and the desired quality can be approximated by R_good = 665+3700x+6727x2-5.451y+0.0093602y2
(6)
R_accept = 1690+3298.5x+4880x2-9.5704y+0.017058y2
(7)
where x denotes the power ratio of the high frequency band to the overall bands, and y denotes the noise level in variance respectively. The above two equations, simply give us the information of minimum required bits number at the transmitter when the noise level and the desired service quality are known. Audio signal are also tested in the same process. For audio signal, the binary watermark image is embedded into frequency band coefficients with less auditory sensitivity. The optimal threshold points for removable bit number, similar to (6)-(7), are represented by R_good = 4158+787x+149x2-21.05y+0.048951y2
(8)
R_accept = 9401+1779.6x+337x2-34.391y+0.07295y2
(9)
3 Simulation Results and Discussions To evaluate the performance of the proposed system, a designed MATLAB procedure with Simulink block is used to simulate the whole process of the communication system, the watermark signal is the same as in Fig. 2 and the test images are Lena, Baboon, …etc. After the wavelet decomposition and watermarking process, each watermarked frequency band Cwi, i=1~4, is undergone the digital modulation scheme and is transmitted over a wireless channel with a specific noise models added in. At the receiver terminal, under the perfect sample-timing assumption, the output of the matched filter is Cwi , i=1~4, and then the corresponding reconstruction algorithm is used to retrieve the watermark from the received signal. The channel condition in each band can be known by directly calculating the NC coefficients by (3) and then referring to (5) to obtain the noise power. After that, the channel condition index and the power ratio of the high/overall frequency band, are used as the feedback signal to the transmitter. Once the desired service quality is assigned, by referring to (6)-(9), it is possible to allocate bits precisely either between different signal bands or channel/source coding schemes, to obtain the best perceptual quality media under the present channel condition. To simplify the coding process,
′
Dynamic Perceptual Quality Control Strategy for Mobile Multimedia Transmission
719
the coding scheme is emulated by preserving the corresponding number of the original image coefficients according to the bit allocation result. One example result of the complete simulation process at different channel condition is shown in Fig. 4 and Fig. 5. It can be observed in Fig. 4(a) that for the same image under different channel conditions, the removable bits increases if the channel noise level decreases. On the other hand, it also reveals that for two different images under the same channel condition, the removable bits number is greater for the image with higher highfrequency power. These results tell us that more bits can be (or should be) saved and re-allocated to the channel coding algorithm to protect those perceptually important coefficients during transmission. In Fig. 5, the resultant images and the corresponding SNR values by using or not using the dynamic bit allocation strategy (total allocated bits=52Kb) for three noise levels, are shown in Fig. 5(a)-(f) and the figure caption. From these results, it is observed that at a fixed bit rate transmission constraint, which is often used in general wireless transmission service, the proposed algorithm not only save bits by dropping the training sequence, but also achieve the comparable image quality than the innovative system [16]. Something interesting should be noted in Fig. 5(d) and (f) where the image with lower SNR in (f) seems has the better visual quality than (d), however, it is trivial that sometimes severe channel noise condition overcomes the contouring effect, which comes from the heavy quantization, and then results a visually smooth image. Nevertheless, this contradiction disappears if the dynamic bit allocation scheme is used (shown in (a) and (c)). Moreover, it should be noted that the reconstructed watermark image can be still recognized if the received source media quality is not below the acceptable point, thus is able to maintain the copyright protection requirement. On the other hand, by using the hierarchical coding structure and the human sensitivity property, the channel effects such as delay or lost packet, which often damage the mobile transmission system, can be alleviated significantly. For example, if the channel is known to be not
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 5. Resultant images with (abc) and without (def) using the dynamic bit allocation scheme 2 2 2 under different channel noise levels. (a)(d): =24.9, (b)(e): =99.2, and (c)(f): =404.7. Each image is allocated with 52K bits. The SNR for (a)-(f) are 15.1, 14.8, 14.3, 13.1, 12.9, and 12.2 (dB) respectively.
720
C.-L. Lai
so good, the “most important part” of the media contents will be protected by more channel coding symbols thus to overcome the lost packet effect and maintain a minimum satisfactory media quality. Other media type such as video is also tested, although the video case hasn’t been simulated completely, most test examples also give us the similar conclusions. Finally, different noise models, like Rayleigh or Ricean, give the similar simulation results. Although the channel model is more complicated in a real transmission system, it is feasible to use the similar algorithm to detect and compensate the unknown channel effect thus to meet the quality control requirement.
4 Conclusions and Future Work An intuitive scheme for channel detection and optimal bit allocation, for best perceptual media quality in mobile multimedia transmission under fixed bit-rate constraint, is proposed. Via the watermarking technique as the channel indication, the training sequence, which is used in the general mobile communication system to estimate the channel condition, can be omitted thus more resource can be saved. The proposed system not only fits the demand of copyright protection of digital contents but also increase the transmission efficiency and the perceptual quality within a bandwidth limitation. Moreover, detecting the channel condition by watermarking in frequency domain offers the ability to examine the channel for each individual subband, thus is able to overcome the different channel effects for the corresponding band. The simulation results show the good performance of the proposed scheme in receiving visually better images than before, especially in an extreme condition with severe noisy environment. That is, the proposed system can provide a more perceptually satisfactory media quality than the previous systems which may have the unacceptable results, thus is much practical for 3G/4G multimedia service like H.324M mobile video telephony applications. Finally, although it is reasonable to expect that the implemented system will exhibit a good work by observing the simulation result, modifications may be necessary in real system to obtain the optimum performance according to various conditions. Various skills are helpful to enhance the performance of the transmission system by using the similar concept and are left as the future works. For example, the multilayer wavelet decomposition method with various wavelet functions, may further improve the efficiency of the bit allocation scheme. Moreover, various watermarking or coding schemes including channel coding and source coding, can be test to find the optimal system framework for different conditions. Of course, skills including digital modulation or transmit diversity are also encouraged to be examined to enhance the system performance further. Acknowledgments. C.L. Lai would thank to the group members of Dr. J. Huang and J.C. Yang for their help on part of the MATLAB simulation and comment.
Dynamic Perceptual Quality Control Strategy for Mobile Multimedia Transmission
721
References 1. Jorg Eberspacher, Hans-Jorg Vogel: GSM: Switching, Service and Protocols. John Wiley & Sons (1998) 2. A.F. Naguib, A. Paulraj, T. Kailath: Capacity Improvement with Base Station Antenna Arrays in Cellular CDMA. IEEE Transactions on Vehicular Technology, vol. 43, no. 3 (1994) 691-698 3. S.M. Alamouti: A Simple Transmit Diversity Technique for Wireless Communications. IEEE Journal on Select Areas in Communication, vol. 16 (1998) 1451-1458 4. Josef Fuhl, Alexander Kuchar, Ernst Bonek: Capacity Increase in Cellular PCS by Smart Antennas. IEEE VTC Conference, vol. 3 (1997) 1962-1966 5. M. El-Sayed, J. Haffe: A View of Telecommunications Network Evolution. IEEE Communications Magazine (2002) 74-81 6. John G. Proakis: Digital Communications. 4th Edition, McGraw-Hill International Edition (2001) 7. J. Sun, J. Sauvola, D. Howie: Features in Future: 4G Vision from a Technical Perspective. IEEE2001 8. J. Borras-Chia: Video Service over 4G Wireless Networks: Not Necessarily Streaming. IEEE2002 9. T. Kalker, at el.: Music2Share - Copyright-Compliant Music Sharing in P2P Systems. Proceedings of the IEEE, vol. 92, no. 6 (2004) 961–970 10. I. Cox, J. Bloom, M. Miller: Digital Watermarking: Principles & Practice. Morgan Kaufmann Publishers, Inc. San Francisco (2002) 11. G.C. Langelaar, I. Setyawan, R.L. Lagendijk: Watermarking Digital Image and Video Data - A State-of-the-Art Overview. IEEE Signal Processing Magazine, vol. 17, no. 5 (2000) 20-46 12. C.E. Shannon: Channels with Side Information at the Transmitter. IBM J. Res. Develop. vol. 2 (1958) 289-293 13. C.L. Lai: The Dynamic Detection and Optimal Codec Allocation for Wireless Transmission based on Digital Watermark Calibration. Grant Project Report of National Science Council of the Republic of China, NSC 93-2622-E-161-002. 14. C.T. Hsu, J.L. Wu: Multiresolution Watermarking for Digital Images. IEEE Transactions on Circuits and Systems II: Analog and Digital Signal Processing, vol. 45, no. 8 (1998) 1097-1101 15. M.J. Nadenau, Julien Reichel, Murat Kunt: Wavelet-Based Color Image Compression: Exploiting the Contrast Sensitivity Function. IEEE Transactions on Image Processing, vol. 12, no. 1 (2003) 58-70 16. D.M. Chandler, S.S. Hemami: Multiresolution Watermarking for Digital Images. IEEE Transactions on Image Processing, vol. 14, no. 4 (2005) 397-410
A Rate Control Using Adaptive Model-Based Quantization Seonki Kim1 , Tae-Jung Kim2 , Sang-Bong Lee2 , and Jae-Won Suh2 1
2
LG Electronics, Inc., Seoul, Korea
[email protected] Chungbuk National University, School of Electrical and Computer Engineering, 12 Gaeshin-dong, Heungduk-gu, Chongju, Korea
[email protected],
[email protected],
[email protected] Abstract. The rate control is an essential component in video coding to provide a uniform quality under given coding constraints. The source distribution to be quantized cannot be defined as a single model in the video sequence. In this paper, we present a new rate control algorithm based on the Generalized Gaussian R-D model. Through considering a relation between adjacent frames or macroblocks, we determine model parameters, and perform a rate control on the H.264/AVC video codec. As shown in experimental results, the proposed algorithm provides an improved quality of the reconstructed picture after encoding. In addition, our scheme generates the number of bits close to the target bitrate.
1
Introduction
Generally, the output bitrate for video coding is controlled by the quantization parameter that is determined in the encoder. In other words, the selection of the proper quantization parameter controls the number of output bitrate and provides the best quality within coding constraints, such as target bitrate and frame rate, etc. In addition, the rate control is also required to ensure a prevention of overflow or underflow of the buffer. The loss of data by overflow that is an irreparable loss leads to the degradation of the quality after encoding. Therefore, the rate control is an essential component in video coding. However, the rate control itself is not specified by the standard, and is only considered in the encoder side. Although rate control algorithms are not mandatory for video coding standard, video coding standards have own rate control models, such as MPEG-2 TM5 [1], MPEG-4 Q2 model [2], and H.263 TMN8 [3]. The classical rate-distortion model based quantizer has been studied for a long time. However, at low bitrate, it doesn’t work well since the variance of the source data was only considered as a parameter to control rate. The variance itself is not sufficient to characterize source data at low bitrate. In order to overcome this problem, rate-quantization models that generate the bitrate close to the target bits have been studied in [6,7,8]. These rate control algorithms using rate model that is derived from the rate-distortion function or rate-quantization L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 722–731, 2006. c Springer-Verlag Berlin Heidelberg 2006
A Rate Control Using Adaptive Model-Based Quantization
723
curve assume that the data to be quantized was considered and represented as having a single distribution. In the realistic case, however, all frames cannot be unified with a unique model. In this paper, we propose a new rate control algorithm considering the adaptive model adjustment. Based on the characteristics of the Generalized Gaussian distribution, as source statistics, rate control is done adaptively. Using previous coding information and the current coding mode, we decide model parameters and estimate a source statistic. From the experimental results, we conclude that the proposed algorithm provides an acceptable and improved quality, and guarantees a generation of bitrate close to the target bitrate.
2 2.1
Proposed Algorithm Rate-Quantization Model
We consider R-D function of the Generalized Gaussian function: R(D) =
1 σβ log2 ( ) γ D
(1)
where σ denotes the standard deviation of the source data to be quantized, and β and γ are model parameters. Previous works [9,10] show that the Generalized Gaussian distribution is suitable for designing rate model about DCT coefficients. β is a free parameter controlling the exponential rate of skewness of distribution. As special cases, when β = 1, γ = 2 and β = 2, γ = 1.5, the Generalized Gaussian distribution can be regarded as the Laplacian and Gaussian distributions, respectively. Before designing a rate model, we should consider the amount of the distortion as the change of quantization parameter (QP) because information loss is done by QP. In general, the distortion by the quantization is defined as follows, D(Q) = c · Q2
(2)
where c is the distortion parameter that represents a relationship between distortion and quantization parameter. The value of distortion parameter gives an inspiration to make a region-based rate control scheme. We do not handle it in this paper. The decision of distortion parameter will be described in the next section. From Eq. (1) and Eq. (2), rate-quantization model is formulated as: σ β · 2−γR Q= (3) c In this rate-quantization model, the selection of parameters, which are β, γ, and c, is important in the rate-quantization model, as their values presents a different model that reflects the characteristics of source data. More concrete description of the estimation of model parameters will be given in the next section.
724
2.2
S. Kim et al.
Estimation of Coding Parameters
In section 2.2, we designed a new rate-quantization model based on the ratedistortion model of Generalized Gaussian distribution. The Eq. (3) requires 5 parameters: shape factor (β), distortion parameter (c), standard deviation of source (σ), target rate (R), and model parameter (γ). The source statistic value that is the standard deviation is explained in the above section how to get it from the structural constraint of H.264 video coding. In this section, we will explain that the methods to obtain other coding parameters required in the proposed algorithm. Shape Factor. The shape factor β is an important parameter to model a form of the source distribution. However, the exact estimation of the shape parameter for the source is not easy. It also requires a high computational complexity. In order to avoid this problem, some researches have been studied to find the optimal shape factor with small complexity [9,10]. In this paper, we simply decide the shape factor by a similarity between two adjacent images. ⎧ ω > 0.9 ⎪ ⎨ 1, ω < 0.6 β = 2, (4) ⎪ ⎩ 2 − 3.3 · (ω − 0.6), a ≤ ω ≤ b 256 1, |x − x | < 2 1 ω= v(i), v(i) = (5) 256 i=1 0, o.w. where x and x denotes the samples, which are placed on the corresponding point, in the current and the reference frames, respectively. In this paper, we assume that the source distribution is modeled as one of forms existing between Laplacian, that is defined by the value of 1, and Gaussian, that is defined by the value of 2. Distortion Parameter. In Eq. (2), we design a distortion model by a distortion parameter, c, and a quantization parameter, Q. As a quantization parameter, the distortion is also considered at the macroblock-level. In order to decide the distortion parameter for each macroblock, we consider the typical distortion measure of quantization [3]. α D = Q2 (6) 3 where α is a distortion weight for the current macroblock. From Eq. (2) and Eq. (6), we can derive the equation to compute a distortion parameter, c, as follows: α c= (7) 3 ⎧ BT 1 BT ⎨ , < 256 × N 2 α = σ × (256N ) (8) ⎩ 1, o.w.
A Rate Control Using Adaptive Model-Based Quantization
725
where BT denotes the target bits for a frame, σ is the standard deviation. N is the number of macroblocks in a frame. Model Parameter. We should update the model parameter γ. Let RT be the target bits and RA be the coding bits per pixel. Thus, RT and RA is obtained by a division by number of samples in the frame. Theoretically, RT follows the rate-quantization model because the rate-quantization model was based on the rate-distortion function. Thus, we can express RT and RA as follows: RT = γe log2
σβ , c · Q2
RA = γc log2
σβ c · Q2
(9)
where γe and γc are the model parameter used for the current frame and the model parameter to be used for the coming frame.
β
1 1 σ 1 1 · log2 = ·A (10) RT − RA = − − γe γc c · Q2 γe γc ∴ γc =
Aγe A − γe (RT − RA )
where A = log2
σβ c · Q2
(11)
(12)
The value of A is small. The model parameter is obtained at the frame-level, but other parameters are obtained at each macroblock-level. In order to define the source statistic for the frame, we re-calculate for the actual residual data that is obtained after coding. Because the model parameter is updated after the coding for the frame, we can use a real residual data. The distortion parameter and the shape parameter for frame-level is obtained as follows: c (β) =
N 1 ci (βi ) N i=1
(13)
where N is the number of macroblocks in a frame, and ci and βi indicate the average distortion parameter and the average shape factor for the frame, respectively. 2.3
Computation of Quantization Parameter
We compute the quantization parameter for the current macroblock by the designed rate-quantization model shown in Eq. (3). The calculated quantization parameter is rounded to nearest integer value. The quantization parameter is also adjusted as follows in order to reduce a remarkable quality variation between the current macroblock and the previous macroblock. ⎧ ⎪ ⎨ QPprev − 2, if QP − QPprev < −2 ∗ QP = QPprev + 2, if QP − QPprev > 2 (14) ⎪ ⎩ QP, otherwise
726
S. Kim et al.
The QP ∗ is the adjusted quantization parameter. After an adjustment of quantization parameter, we should set to the quantization parameter value in the range of 0 to 51. This range is defined in the standard of H.264 video coding. 2.4
Rate-Distortion Optimization and Its Constraint in H.264
Causality Problem. H.264 introduced the rate-distortion optimization technique to increase the coding efficiency. The rate-distortion optimization is based on Lagrangian optimization method. H.264 defined the formula to calculate the Lagrangian multiplier. (15) λ = 0.85 × 2QP/3 In λ computation, QP value is required, and λ is calculated in a initial coding process in H.264 video coding order. As shown in Eq. (15), the computation of the Lagrangian multiplier requires the value of the quantization parameter. However, the decision of the quantization parameter needs the standard deviation of source data. Therefore, we get into difficulty for the ”causality problem” or ”chickenand-egg dilemma.” In order to avoid that limitation in the H.264 video coding, we should predict the standard deviation value. Prediction of Standard Deviation. For the causality problem, Ning [11] proposed one method: 8 × σP + σPL + σPT σ= (16) 10 where σ is the estimated standard deviation for the current macroblock and σP is an actual standard deviation of the macroblock in the previous frame. σPL and σPT are the standard deviations of the left and top macroblock of the macroblocks placed at the same position in the previous frame. Ning’s method has a problem which depends on information for the previous frame, entirely. Therefore, Ning’s method can include a possibility for the high prediction error because it doesn’t consider the characteristics of the current frame itself. In order to reflect the spatial relationship within a frame when we predict the standard deviation, we consider the standard deviation of the adjacent macroblocks of the target macroblock. The proposed predictive method is as follows: ⎧ L T σC + σC ⎪ ⎪ α > α1 × σP × σ , ⎪ 1 P ⎪ ⎪ 2 ⎪ ⎨ L T σC + σC σ = α2 × σP , (17) < α2 × σP ⎪ 2 ⎪ ⎪ ⎪ L T ⎪ ⎪ ⎩ 6 × σP + 2 × {σC + σC } , otherwise 10 where σP and σC are the standard deviation of macroblock in the previous frame corresponding to the same position for the current frame and the standard deviation for macroblocks in the current frame, respectively. T and L represent the top macroblock and left macroblock of the current macroblock, respectively. Two parameters, α1 and α2 , are set by 1.1 and 0.9, respectively. These values are heuristically decided by experiments.
A Rate Control Using Adaptive Model-Based Quantization
3
727
Experimental Results
3.1
Simulation Condition
We implement the proposed algorithm in the H.264 reference software. In order to evaluate the proposed quantization algorithm, we set a common simulation conditions that follow the baseline profile. The baseline profile is specified in the H.264 standard document [13]. Table 1 shows our simulation conditions. Table 1. Simulation Conditions Rate-Distortion Optimization GOP Structure Symbol Mode MV search range Reference Frames
On IPPP CAVLC 32 1
The proposed algorithm is applied for the P-frames because B-frame coding is not included in the baseline profile. For the I-frames, we use the fixed quantization parameter. Test video sequences used for simulation are ”Foreman”, ”News”, ”Silent” and ”Mother and Daughter” sequences. These sequences are defined as the test sequences in the document. The resolution of test sequences is QCIF with 4:2:0. Target bits for experiments are 32, 48 and 64 kbps and frame rates are 10 fps. First of all, we verify the accuracy of the proposed prediction method of standard deviation for source. Figure 1 shows the comparison between the original standard deviation and the predictive standard deviation by the proposed method. The predictive values are obtained by Eq. (17). The proposed method generates the standard deviation close to the original value. 25 Original Standard Deviation Estimated Standard Deviation
Standard Deviation
20
15
10
5
0 20
40
60
80
Macroblock Number
Fig. 1. Standard deviation between the original value and the predictive value
728
3.2
S. Kim et al.
Performance Evaluation
In order to evaluate the proposed quantization algorithm, we compare test results with the MPEG-2 TM5 [1], H.263 TMN8 [3], and Siwei’s algorithms [4]. Table 2 shows the comparison between the number of coding bits for the proposed algorithm and target bits. As shown in Table 2, the proposed algorithm generates the number of bits close to the target bits within 1% difference. Table 2. Coding bits and target bits for the proposed rate control (kbps) Test Sequence News Foreman Silent MAD
Target bits 32 32 48 48
Coding bits 31.83 32.08 48.06 48.34
Difference -0.17 0.08 0.06 0.34
One of measures to show the performance is PSNR. Table 3 describes the comparison of the average PSNR values for the proposed algorithm and others. The proposed scheme provides the considerably good results. The uniform PSNR service is also required to reduce the quality variation in terms of human visual scene. Figure 2 describes PSNR fluctuations for two sequences. As shown in Table 3. Comparison of average PSNR among three rate control algorithms (dB) Test Sequence News Foreman Silent MAD
Target bits (kbps) 32 32 48 48
Proposed algorithm 34.49 32.89 35.64 40.04
Siwei’s algorithm 33.48 32.01 34.90 39.57
Foreman (64kbps)
TMN8 algorithm 33.00 31.73 34.31 38.50
TM5 algorithm 33.45 32.00 34.70 39.19
Silent (48kbps)
50 Proposed method Siwei's method TMN8 TM5
45
Proposed method Siwei's method TMN8 TM5
40
38
36
40
34 35 32 30 30
25
28 20
40
60
Frame Number
(a) Foreman, 64kbps
80
100
20
40
60
80
Frame Number
(b) Silent, 48kbps
Fig. 2. PSNR fluctuation comparison
100
A Rate Control Using Adaptive Model-Based Quantization Foreman
729
Silent 39
37
38 36 37 35 36 35
34
34 33 Proposed method Siwei's method TMN8 TM5
32
31
33
Proposed method Siwei's method TMN8 TM5
32 31
30
35
40
45
50
55
60
65
30
35
40
Rate (kbps)
45
50
55
60
65
Rate (kbps)
(a) Foreman
(b) Silent
Fig. 3. Rate-distortion curve, Target rates (32, 48, and 64 kbps) Bits Fluctuation (Foreman - 48kbps)
Bits Fluctuation (Silent - 48kbps)
35000
40000 Proposed method Siwei's method
Proposed method Siwei's method
30000
30000
25000
20000 20000 15000
10000
10000
5000
0
0 20
40
60
Frame Number
(a) Foreman, 48kbps
80
100
20
40
60
80
100
Frame Number
(b) Silent, 48kbps
Fig. 4. Bits fluctuation comparison between the proposed and Siwei’s algorithms
Figure 2, the proposed algorithm provides the improved reconstructed image quality without large quality variation rather than others. In Figures, we show results for two kinds of test sequences that are ”Foreman,” and ”Silent.” The reason for using two sequences is that two sequences have the conflicted characteristics. ”Foreman” has high complexity in terms of motion and texture composition rather than ”Silent” sequence. Figure 3 shows the rate-distortion curve for three kinds of target bits that are 32, 48, and 64kbps. As shown in Figure 3 clearly, the proposed algorithm provides higher coding efficiency than others. One of main applications in H.264 is the mobile communications. The general mobile communication has a low bit-rate channel or has a small capacity to transmit a video data. At the low bitrate, small bit fluctuations are required to avoid an abrupt channel use. Figure 4 shows the bit fluctuations for two
730
S. Kim et al.
sequences at the specific target rates. The proposed algorithm generates the coding bits without an excessive variation.
4
Conclusion
The rate control for H.264 requires a low computational complexity since H.264 has a high coding complexity. In this paper, we proposed an adaptive modelbased rate control scheme for the effective coding of inter-frames. The proposed rate-quantization model is designed from the rate-distortion function of Generalized Gaussian Distribution, which can adaptively represent various distributions by changing of the shape factor. The shape factor is simply obtained from the comparison of difference between macroblocks. The quantization parameter is calculated by the designed rate-quantization model. Simulation results show that the proposed rate control scheme generates coding bits close to the target bits and provides improved coding efficiency at low bit rates.
References 1. MPEG-2, MPEG-2 Test Model 5(TM5) Doc.ISO/IEC/ JTC1/SC29/WG11/ N0400, Test Model Editing Committee, Apr. 1994. 2. MPEG-4, MPEG-4 Video Verification Model version 16 (VM16) Doc.ISO/IEC JTC1/SC29/WG11/N3312, Mar. 2000. 3. ITU-T SG16 Video Coding Experts Group, Video Codec Test Model, Near-Term Version 8 (TMN8), Sep. 1997. 4. Siwei Ma, Wen Gao, Peng Gao, and Yan Lu: Rate Control for Advance Video Coding (AVC) Standard, International Symposium on Circuits and Systems, Vol.2, pp.892-895, May 2003. 5. Ning Wang and Yun He, ”A New Bit Rate Control Strategy for H.264,” Joint Conference of the 4th International Conference on Information, Communications and Signal Processing and 4th Pacific-Rim Conference on Multimedia, 3A2.6, Dec. 2003. 6. Liang-Jin Lin and Antonio Ortega, ”Bit-Rate Control Using Piecewise Approximated Rate-Distortion Characteristics,” IEEE Trans. on Circuits and Systems for Video Technology, Vol. 8, No. 4, Aug. 1998. 7. Jordi Ribas-Corbera and Shawmin Lei, ”Rate Control in DCT Video Coding for Low-Delay Communications,” IEEE Trans. on Circuits and Systems for Video Technology, Vol. 9, No. 1, Feb. 1999. 8. Zhihai He and Sanjit K. Mitra, ”A Linear Source Model and a Unified Rate Conrtrol Algorithm for DCT Video Coding,” IEEE Trans. on Circuits and Systems for Video Technology, Vol. 12, No. 11, Nov. 2002. 9. Edmund Y. Lam and Joseph W. Goodman, “A Mathematical Analysis of the DCT Coefficient Distribution for Images,” IEEE Trans. on Image Processing, Vol. 9, No. 10, pp.1661-1666, Oct. 2002. 10. Gregory S. Yovanof and Sam Liu, ”Statistical Analysis of the DCT Coefficients and Their Quantization Error,” Conference on the Thirtieth Asilomar, Vol. 1, pp.601605, Nov. 1996. 11. N. Wang and Y. He, ”A new bit rate control strategy for H.264,” Pacific-Rim Conference on Multimedia, Dec. 2003.
A Rate Control Using Adaptive Model-Based Quantization
731
12. B. Aiazzi, L. Alparone, and S. Baronti, ”Estimation Based on Entropy Matching for Generalized Gaussian PDF Modeling,” IEEE Signal Processing Letters, Vol. 6, No. 6, pp.138-140, June 1999. 13. ”Draft ITU-T Recommendation and Final Draft International Standard of Joint Video Specification(ITU-T Rec. H.264—ISO/IEC 14496-10 AVC),” Joint Video Team(JVT) of ISO/IEC MPEG and ITU-T VCEG(ISO/IEC JTC/SG29/WG11 and ITU-T SG16 Q.6) JVT-G050.doc, 8th meeting, Geneva, Switzerland, May 2003.
The Design of Single Encoder and Decoder for Multi-view Video Haksoo Kim and Manbae Kim Kangwon National University Department of Computer, Information, and Telecommunication 192-1 Hoja2-dong, Chunchon 200-701, Republic of Korea
[email protected] Abstract. Recently, multi-view video has been an emerging media in the 3-D field. In general, the multi-view video processing requires encoders and decoders as many as the number of cameras, and thus the processing complexity results in difficulties of practical implementation. To solve this problem, this paper considers the design of a single encoder and a single decoder for the multi-view video. At the encoder side, multiview sequences are combined by a video mixer. Then, the mixed sequence is compressed by an H.264/AVC encoder. The decoding part is composed of a single decoder and a scheduler controlling the decoding process. The goal of the scheduler is to assign approximately identical number of decoded frames to each view sequence by estimating the decoder utilization of GOPs and subsequently applying frame skip methods. Experimental results show that identical decoder utilization is achieved for each view sequence. Finally, the performance of the proposed method is compared with a simulcast encoder in terms of bit-rate and PSNR using a RD (rate-distortion) curve.
1
Introduction
With the progress of information technologies and the demand of consumers, a variety of multimedia contents have been served to the consumers. Recently, beyond 2-D and stereoscopic video, multi-view video acquired from multiple cameras has been a newly emerging media [1,2,3,4,5], and can provide a wide view to consumers with immersive perception of natural scenes. In general, multi-view compression systems are implemented similar to a simulcast method, where each view sequence is separately encoded by its dedicated encoder. Fecker et al. [6] proposed a simulcast method. The compressed view sequences are combined into a single bitstream that is transmitted to the client. Anthony et al. [4] developed a multi-view transmission system, where each encoded view sequence is transmitted independently. Both approaches employ encoders and decoders as many as the number of view sequences, requiring complex system design. To solve this problem, this paper presents a single encoder and decoder system for processing multi-view video. A mixed sequence is generated by combining all the view sequences and then compressed by an L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 732–741, 2006. c Springer-Verlag Berlin Heidelberg 2006
The Design of Single Encoder and Decoder for Multi-view Video
733
H.264/AVC encoder. The additional information such as the number of cameras, GOP size and encoder profile are needed prior to the encoding. At the decoder side, a single decoder is used in order to decode a compressed bitstream containing all view sequences. To assign the fair decoding to each view sequence, a scheduling scheme is utilized. The proposed decoder is designed to satisfy the constant quality of decoded sequences that is one of the requirements of the MPEG MVC [7]. The standardization of multi-view video compression is currently under way in MPEG 3DAV MVC group [8]. The reference model of a multi-view encoder has been proposed by Fraunhofer-HHI. The proposed model is based on hierarchical B-picture and adapted prediction structure. The number of decoded picture buffers (DPB) needed at encoding process is 2×GOP length + number of views. For memory management, frame reordering is implemented. Starting from the first image of each view sequence, GOPs are zigzag-scanned and encoded. A main difference is that we utilize an ordinary H.264/AVC coder rather than a multi-view coder proposed in MPEG MVC. Therefore, the encoder complexity including the number of DPBs does not increase in our method. Definitions N : Intra period M : Interval between I/P pictures K : The number of view sequences (number of cameras) Vk : kth view sequence, k ∈ {1, 2, . . . , K} GOP(Vk ) : GOP of Vk GoGOP (Group of GOP) : The set of GOPs at time range [ti , ti+N ] Dk : Decoding time of Vk in GoGOP Uk : Decoder utilization of Vk in GoGOP S : The maximum number of decoded frames in GoGOP This paper is organized as follows. In Section 2, the structure of a single encoder is presented along with video mixing. In Section 3, the decoding system using a single decoder and a scheduler is discussed. Experimental results are given in Section 4. The decoder performance and coding efficiency are evaluated. The conclusion and future works are given in Section 5.
2
Single Encoder with Video Mixing
In order to implement a single encoder, view sequences need to be reordered by a video mixing. Fig. 1 shows how the four view sequences are reordered by the video mixing and then compressed by an H.264/AVC encoder. K and N are 4 and 4, respectively. In Fig. 1 (a), each view sequence has an identical GOP structure (e. g., IPPPPI). The encoding order is GOP1 (V1 ), GOP1 (V2 ), GOP1 (V3 ), GOP1 (V4 ), GOP2 (V1 ), GOP2 (V2 ), ... as illustrated in Fig. 1 (b). The multi-view sequences are encoded in GOP units. The ith group of GOPs, GoGOPi consists of GOPi (V1 ), GOPi (V2 ), GOPi (V3 ), and GOPi (V4 ). The video mixing is designed in a manner that the multi-view video can be encoded by a
734
H. Kim and M. Kim
conventional H.264/AVC encoder, thus maintaining the identical encoder complexity. The video mixing can be applied to baseline and main profiles of H.264/ AVC. The main profile includes B picture in addition to I and P pictures of the baseline profile [9].
(a)
(b) Fig. 1. shows how view sequences are mixed and compressed. (a) Four view sequences, and (b) mixed sequences compressed by an H.264/AVC encoder.
3
Decoding Multi-view Video
This section presents a multi-view decoding system that is composed of an H.264/AVC decoder and a scheduler controlling the decoder. Even though different scheduling schemes are carried out for the baseline and main profiles, they are designed in a manner that each encoded view sequence maintains the consistent quality of decoded sequences (e. g., the number of decoded frames). The scheduler is composed of the three parts: the computation of decoder utilization of each view sequence, the view priority decision, and the frame skip. Since the number of frames to be decoded are limited (e. g., 30 fps), the scheduler effectively controls the number of decoded frames for each view sequence. Further, in order to reduce the degradation of visual image quality caused by the frame skip process, an effective scheme for selecting discarded frames based upon a cost function is presented. Finally, we discuss the frame skip schemes for the baseline and main profiles.
The Design of Single Encoder and Decoder for Multi-view Video
735
The block diagram of the proposed multi-view decoder is illustrated in Fig. 2. The compressed bitstream of K view sequences is separated by the stream splitter and each compressed data of each view is stored at buffer(Vk ) (k = 1, . . . , K). The scheduler manages the decoding of the stored buffer data in GOP units. For a GOP of Vk , the decoding time Dk is computed. Then the decoder utilization Uk is computed by Uk = Dk /
K
Dk
(1)
k=1
where the range of Uk is [0, 1]
Fig. 2. Block diagram of the proposed multi-view decoding system
The view priority adaptively varies every GoGOP according to the Uk of a current GOP. A view sequence with low Uk is assigned a higher priority and thus decoded earlier in the next GOP. The frame skip is mainly composed of skipped frame decision and decoded frame selection as shown in Fig. 3. The former decides the number of discarded frames among N GOP frames. For every Vk , the number of discarded frames lk is estimated from Uk . According to the lk , the decoded frame selection chooses frames to be decoded based upon a skip mode that consists of sequential skip and non-sequential skip. For the main profile of H.264/AVC, both skip modes can be used. Meanwhile, only the sequential skip mode is used for the baseline profile. Both modes utilize a cost function that is related to visual image quality [10,11]. Finally, GOP frames to be decoded are determined. Suppose that if lth frame Fl in a GOP(Vk ) is discarded, successive Fl+1 , Fl+2 , . . ., FN are also discarded. Then, the number of decoded frames is (l − 1). For all GOPs, the total decoded frames in a GoGOP is computed by (l1 − 1) + (l2 − 1) + · · · + (lK − 1) =
K
lk − K
k=1
where lk is the first frame to be discarded in the GOP(Vk ).
(2)
736
H. Kim and M. Kim
Fig. 3. Block diagram of the proposed frame skip
The maximum number of decoded frames in a GoGOP is K × N . However, since typical computing power can not satisfy such a requirement, the number of actually decoded frames, S is less than or equal to K × N . Then, Eq. (2) becomes K lk − K = S (3) k=1
where the range of S is at [0, K × N ]. We utilize Eq. (2) to find l1 , l2 , . . . , lK to satisfy Eq. (3). Given Uk , lk is estimated by using an inverse relation. If Uk = [0, 1] is inversely mapped to lk = [S, 1], this is expressed by lˆk = −(S − 1)Uk + S
(4)
where lˆk is the estimate. Summing over K views, we have K k=1
lˆk =
K
[−(S − 1)Uk + S] = S(K − 1) + 1
(5)
k=1
K where k=1 Uk = 1. From Eqs. (3) and (5), the first frame number to be discarded for GOP(Vk ) is computed by lˆk lk∗ = (S + K) K (6) lˆk k=1
where is a floor function. The number of decoded frames is lk∗ − 1. The next procedure is to determine which frames are decoded. A simple approach is to decode all consecutive frames before the first discarded frame.
The Design of Single Encoder and Decoder for Multi-view Video
737
The other is to discard non-consecutive frames, thus achieving better visual image quality. In order to provide a measure of the playback discontinuity, a cost function, φ(S) is defined. It takes two aspects of playback discontinuity into account: the length of a sequence of consecutive discarded frames and the distance between two adjacent but non-consecutive discarded frames. The cost function φ(S) assigns a cost ci to a discarded frame F depending on whether it belongs to a sequence of consecutive discarded frames or not. If frame F belongs to a sequence of consecutive discarded frames, then the cost ci is defined to be mi , if the frame is the mi th consecutively discarded frames in the sequence. Otherwise, the cost function ci is defined based on the√distance di to the previous discarded frame and given by the formula ci = 1+1/ di . Therefore, the total cost function is the sum of all the costs of any discarded frames in the sequence. Since the baseline profile of H.264/AVC defines I and P pictures only, a sequential skip mode is used. In case of the GOP structure of I1 P1 P2 P3 P4 P5 I2 P6 P7 P8, if I1 is discarded, next P1, ... , P5 are also discarded. Therefore, If Pi is discarded, all successive frames in the same GOP are discarded. Therefore, if three frames in the current GOP and two frames in the next GOP are discarded, the decoded frames are I1, P1, P2 and I2, P6. Unlike the baseline profile, two skip modes can be utilized for the main profile. However, the non-sequential mode gives less cost function than the sequential one. The selection of decoded frames depends upon picture types of I, P and B. Table 1 shows some examples of the frame selection with varying GOP structures. The selection order indicates the degree of importance for consistent visual image quality. For the main profile with N = 9 and M = 3, the selection order is I P1 P2 B1 B3 B5 B2 B4 B6. If the total decoded frames are six, I P1 P2 B1 B3 and B5 are decoded. The selection order can also be found in other two examples. Table 1. The non-sequential skip mode and the selection order Main profile (N=9, M=3) Selection order Main profile (N=8, M=4) Selection order Main profile (N=10, M=5) Selection order
4
I 1 I 1 I 1
B1 4 B1 3 B1 3
B2 7 B2 5 B2 5
P1 2 B3 7 B3 7
B3 5 P1 2 B4 9
B4 8 B4 4 P1 2
P2 3 B5 6 B5 4
B5 6 B6 8 B6 6
B6 9
B7 B8 8 10
Experimental Results
We have performed various experiments in order to examine the performance of the proposed single encoder and decoder design. Test multi-view sequences are aquarium, f lamenco, objects, and race provided by KDDI, Japan [8]. The image size of the test sequences is 320 × 240. We used H.264/AVC JM (Joint
738
H. Kim and M. Kim
Model) 7.6 reference software. Bit rates are 128, 256, and 512 Kbps, and QP (quantization parameter) is fixed to be 28. Intra Decoded Refresh (IDR) is set to zero. As given in Table 2, mixed test sequences for baseline and main profiles were produced with various combinations of (K, N, M ) and encoded by H.264/AVC. b21.yuv with (K, N ) = (2, 12) is made by combining flamenco1.yuv and aquarium1.yuv. The total number of frames NT is 300 and the encoded file is b21.264. The reason why different sequences are mixed is that we tried to examine the effect of inter-coding modes of H.264/AVC on decoded frames. m21.yuv with (K, N, M ) = (2, 12, 3) was made from flamenco1.yuv and aquarium1.yuv and its compressed bistream is m21.264. The total frames are 240. Other test sequences were produced in a similar manner. Fig. 4 shows the variation of decoder Table 2. Test sequences of baseline and main profiles. NT = total frames.
Baseline profile b11.yuv b12.yuv b21.yuv b22.yuv b41.yuv b42.yuv b81.yuv b82.yuv
K 1 1 2 2 4 4 8 8
N 4 12 12 24 12 24 12 24
NT Encoded file 100 b11.264 300 b12.264 300 b21.264 300 b22.264 400 b41.264 400 b42.264 500 b81.264 500 b82.264
Main profile m11.yuv m12.yuv m21.yuv m22.yuv m23.yuv m41.yuv m42.yuv m43.yuv m81.yuv m82.yuv
K 1 1 2 2 2 4 4 4 8 8
N 4 12 12 12 12 12 12 16 12 24
M 1 2 3 2 6 6 4 8 3 6
NT Encoded file 80 m11.264 120 m12.264 240 m21.264 240 m22.264 240 m23.264 480 m41.264 480 m42.264 512 m43.264 576 m81.264 576 m82.264
utilization of the baseline profile sequences. In case of b41.264 with four views, the decoder utilization is approximately 0.25. Similar results are also observed for b81.264, where the mean decoder utilization is approximately 0.125. Fig. 5 shows the decoder utilization of two main profile sequences, m21.264 and m43.264. The graphs show similar performance with the baseline profile sequences. The variation of cost functions for two main profile sequences, m41.264 and m81.264 is observed in Fig. 6. As predicted, the cost function decreases linearly as S increases. The total number of frames in a GoGOP for m41.264 and m81.264 is 48 and 96, respectively. At S = 40 and 80, the total costs are close to zero. We compared the RD (rate-distortion) performance of the proposed method with a simulcast method as given in Fig. 7. The GOP structure is IBBP... and N is 12. For flamenco1, the simulcast method outperforms ours by PSNR of 2 dB on average. On the contrary, our method outperforms the simulcast one by 8 dB for race1. For objects1 with eight view sequences, two methods show the similar performance. Experimental results show that the performance depends upon image motions contained in multi-view video.
The Design of Single Encoder and Decoder for Multi-view Video
739
Fig. 4. The variation of decoder utilization for baseline profile sequences
Fig. 5. The variation of decoder utilization for main profile sequences
Fig. 6. The variation of cost functions for main profile sequences with respect to S
740
H. Kim and M. Kim
Fig. 7. Performance comparison of simulcast sequence and video-mixed sequence
5
Conclusion and Future Works
In this paper, we presented a single encoder and decoder design for multiview video processing compared with the simulcast approach requiring encoders and decoders as many as the number of camera views. Further, conventional H.264/AVC encoder and decoder were employed unlike MPEG MVC multi-view coder. To carry out such design, a video mixing reordering view sequences was proposed. At the decoder side, an appropriate scheduling scheme as well as frame skip modes was proposed. For the consistent quality of decoded view sequences, decoder utilization was utilized to decide the frames to be decoded. The cost function was presented for baseline and main profiles of H.264/AVC for controlling visual image quality. Experimental results performed on various multi-view test sequences validate that each view sequence is fairly decoded. Finally, the coding efficiency verified by RD curves showed that the proposed method achieves similar performance compared with the simulcast method. For future works, we are going to study the expansion of our method to multiple decoders since this could produce better performance. Secondly, clients might ask for particular views rather than the demand of all the view sequences. Then, our system needs to be adapted to this application. One approach will be the application of DIA (Digital Item Adaptation) of MPEG-21 Multimedia Framework.
The Design of Single Encoder and Decoder for Multi-view Video
741
Acknowledgment Authors would like to thank anonymous reviewers for their valuable comments. This research was supported by the MIC(Ministry of Information and Communication), Korea, under the ITRC(Information Technology Research Center) support program supervised by the IITA (Institute of Information Technology Assessment)(RBRC: IITA-2006-(C1090-0502-0022)).
References 1. E. Cooke, I. Feldman, P. Kauff, and O. Schreer, ”A modular approach to virtual view creation for a scalable immersive teleconferencing”, IEEE International Conf. on Image Processing, 2003. 2. W. Matusik and H. Pfister, ”3D TV: a scalable system for real-time acquisition, transmission and autostereoscopic display of dynamic scenes”, Proc. of ACM SIGGRAPH, 2004. 3. S. Lee, K. Lee, C. Han, J. Yoo, M. Kim and J. Kim, ”View-switchable stereoscopic HD video over IP networks for next generation broadcasting”, SPIE Photonic East, Vol. 6016, Oct. 2005. 4. A. Vetro, W. Matusik, H. Pfister, J. Xin, ”Coding approaches for end-to-end 3D TV systems”, Pciture Coding Symposium, Dec. 2004. 5. Q. Zhang, W. Zhu and Y-Q Zhang, ”Resource allocation for multimedia streaming over the Internet”, IEEE Trans. on Multimedia, Vol. 3, No. 3, Sep. 2001. 6. Ulrich Fecker and Andre Kaup, ”Transposed Picture Ordering for Dynamic Light Field Coding”, ISO/IEC JTC1/SC29/WG11, N10929, Redmond USA, July 2004. 7. MPEG/ISO/IEC JTC1/SC29/WG11, ”Requirements on Multi-view Vdeo Coding”, N6501, Redmond, USA, July 2004. 8. MPEG/ISO/IEC JTC1/SC29/WG11, ”Mutiview Coding using AVC”, M12945, Bangkok, Thailand, Jan. 2006. 9. I. Richardson, H.264 and MPEG-4 video compression, Wiley, 2003. 10. K. Chebrolu and R. Rao, ”Selective frame discard for interactive video”, IEEE International Conference on Communications, Vol. 7, pp. 4097-4102, June 2004. 11. Z.-L Zhang, S. Nelakuditi, R. Agarwal, and R. Tsang, ”Efficient selective frame discard algorithms for stored video delivery across resource constrained networks”, IEEE INFOCOM, 1999.
Vector Quantization in SPIHT Image Codec Rafi Mohammad and Christopher F. Barnes School of Electrical and Computer Engineering, Georgia Institute of Technology, Atlanta, GA 30332-0250, USA
[email protected],
[email protected] Abstract. The image coding1 algorithm “Set Partitioning in Hierarchical Trees (SPIHT)” introduced by Said and Pearlman achieved an excellent rate-distortion performance by an efficient ordering of wavelet coefficients into subsets and bit plane quantization of significant coefficients. We observe that there is high correlation among the significant coefficients in each SPIHT pass. Hence, in this paper we propose trained scalar-vector quantization (depending on a boundary threshold) of significant coefficients to exploit correlation. In each pass, the decoder reconstructs coefficients with scalar or vector quantized values rather than with bit plane quantized values. Our coding method outperforms the scalar SPIHT coding in the high bit-rate region for standard test images.
1
Introduction
All natural images consist of both trends (low frequency content) and anomalies (high frequency content) and wavelet transform is well suited for capturing these phenomena in a small fraction of total coefficients. Due to this excellent energy compaction property, wavelet transform based image compression has grown popular over the last decade [1], [2], [3], [4]. Also, reconstructed images are free of blocking artifacts common to DCT compressed images. Shapiro’s Embedded Zerotree Wavelet (EZW) coder achieved excellent rate-distortion performance by using a novel zerotree data structure [2], successive approximation quantization and arithmetic coding. EZW produces an embedded bit stream that is suitable for remote browsing applications. Said and Pearlman introduced the SPIHT embedded coder [4] as an improvement of EZW, and its performance surpassed the performance of all existing image coding algorithms. Mukherjee and Mitra [5] introduced vector SPIHT (VSPIHT) for image coding by forming blocks of wavelet coefficients before zerotree coding. VSPIHT algorithm for image compression achieved modest improvement over scalar SPIHT [4] algorithm. In this paper we propose a vector quantization of significant coefficients of each pass of SPIHT below a boundary threshold and scalar quantization of significant coefficients above the same threshold. The resulting indices are sent to the decoder for decoding the compressed image. Here, for simplicity we have not performed any further compression of indices through entropy coding. Unlike 1
The terms coding and compression are used synonymously in this paper.
L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 742–750, 2006. c Springer-Verlag Berlin Heidelberg 2006
Vector Quantization in SPIHT Image Codec
743
in [5], we form blocks of significant coefficients after they are found using zerotree prediction. Section 2 explains briefly SPIHT image coding algorithm, and the analysis of significant data using correlation. In Section 3, we explain how to utilize the correlation among coefficients to efficiently compress the significant coefficients using a combination of scalar and vector quantization based on the Lloyd algorithm [6]. Section 4 gives the coding results of standard test images using this scalar and vector quantized SPIHT (SQ-VQ SPIHT) algorithm. Section 5 concludes the paper.
Fig. 1. Spatial Orientation Tree
2
Analysis of Significant Coefficients in SPIHT
In a hierarchical subband decomposition such as wavelets there exists a relationship among coefficients across scales in the form of a tree known as the Spatial Orientation Tree (SOT) [4]. SOT consists of a root node and all of its descendants as shown in Fig. 1 for a 3 level dyadic subband decomposition. SPIHT uses three lists to encode and decode significant information: List of insignificant pixels (LIP), List of insignificant sets (LIS) and List of significant pixels (LSP). In each pass coefficients are classified into these lists using a significance test [4]. Every SPIHT pass consists of sorting, refinement and quantization update steps and coefficients are compared to a threshold (T = 2n ) for significance. In the initial pass, T is set as 2n0 , where n0 is chosen such that maximum coefficient cmax ∈ [2n0 , 2n0 +1 ). LIP is initialized with the elements of the highest level low-low frequency subband and LIS is initialized with SOT root nodes. During the sorting step of a pass, LIP elements are tested for significance and moved to LSP, if found significant. Also, the sets in LIS are tested for significance in the
744
R. Mohammad and C.F. Barnes
same step and partitioned into subsets. In the refinement step, the elements in the LSP are refined by adding nth most significant bit (MSB) to the encoder bit stream in the following pass. n is decreased by 1 before the next pass in the quantization update step. As the coefficients are successively refined using threshold as a power of two, SPIHT encoding can be considered to be Bit Plane Coding (BPC) [7]. BPC is similar to the binary representation of real numbers, with each binary digit added to the right, more precision is added. Correlation between coefficients can be quantified using an autocorrelation function (acf) [8]. The autocorrelation function gives the closeness or similarity of two samples as a function of their space or time separation (k) on average. Variance-normalized acf is defined as ρk =
1 Rcc (k) = Rcc (0) Nc σc2
Nc −|k|−1
c(n)c(n + |k|)
(1)
Rcc (0) = σc2
(2)
n=0
where Nc = Number of coefficients in a particular pass i.e. coefficients having magnitude in the interval [T 2T ), σc2 = variance of coefficients and Rcc (k) is the autocorrelation function. We observe that there is a high correlation among coefficients resulting after each sorting step and Fig.2 gives a variation of ρk with k for various intervals for a standard test image, Barbara. Note that the coefficients in each pass have absolute values and their signs are encoded separately. Similar correlation is found among the coefficients of each sorting step in all the natural images tested. From Fig.2, it can be seen that absolute coefficients with magnitude less than 1 have ρk > 0.95 for k = 1, 2, · · · , 10. 1
Variance−normalized acf
0.95
0.9
0.85 T=8 T=4 T=2 T=1 T = 0.5 T = 0.25
0.8
0.75
0.7
1
2
3
4
5
6
7
8
9
10
k
Fig. 2. Variation of normalized acf with k, (T represents the interval [T, 2T ))
Vector Quantization in SPIHT Image Codec
3
745
Scalar-Vector Quantization of Significant Coefficients
Shannon’s rate-distortion theory tells that better compression can be achieved by coding vectors rather than scalars [9]. All natural images have a high probability of having insignificant coefficients and these can be coarsely quantized or discarded [10]. Initial few passes of SPIHT for all natural images result in significant coefficients of high magnitude, hence these coefficients need to be scalar quantized for finer classification. Therefore, we choose to vector quantize the coefficients below and scalar quantize the coefficients above a boundary threshold Tsq−vq . The value of Tsq−vq has to be always less than the initial quantization level T = 2n0 so that coefficients of few initial passes can be scalar quantized. In a fixed rate vector quantizer of codebook size N and block length m, every code vector, yi , is represented by the same number of bits (log2 N ) and the code rate of this VQ is log2 N/m bits per vector component. The goal of optimal vector quantizer is to design a codebook Y = (yi , i = 1, 2, · · · N ) such that the expected distortion between an input vector x and its reproduction vector x ˆ is minimized. The LBG algorithm [11], which is an extension of the Lloyd algorithm for scalar quantizers was developed to produce such a codebook for vector quantizer. For m = 1, vector quantization (VQ) becomes non-uniform scalar quantization (SQ). SQ-VQ SPIHT image encoding algorithm is given in Fig.3 and decoding algorithm will be similar [4] to the encoding algorithm. The refinement step of SPIHT algorithm is replaced by the scalar-vector quantization step. SQ-VQ SPIHT decoder reconstructs the coefficients simply as a table look up operation for scalar and vector quantizers. For a p pass image coding, the total number of bits spent in SPIHT p−1
nb1 =
nei + nbsort
(3)
i=1
The total number of bits spent in the scalar-vector quantized SPIHT in p passes psq
p
nei log2 N1 +
nb2 = i=1
nei i=psq +1
log2 N2 + nbsort m
(4)
Here nei is the number of elements in a sorting step i, psq is the number of passes in which coefficients are scalar quantized, N1 scalar quantizer codebook size, N2 vector quantizer codebook size, nbsort is the total number of sorting step bits in p passes. Here, sorting step coefficients are quantized in the same pass, whereas in SPIHT current sorting step coefficients are refined in the refinement step of the following pass.
4
Implementation and Results
With the above background, we present the actual implementation details and coding results for standard gray scale images used by the image coding research
746
R. Mohammad and C.F. Barnes
Initialization
Sorting Step
No
T < Tsq vq
Yes
Form blocks of dimension m Generate SQ Index Generate VQ Index No
No
Last ci
Last block Yes
Yes
Last Quantization Level
No
Yes
End Fig. 3. SQ-VQ SPIHT Image encoder
community. Each image of Ntr number of natural images is wavelet transformed using bi-orthogonal filters for 5 levels. After subtracting the mean from the highest level low-low subband, the coefficients are grouped into several classes. Each class has absolute value of coefficients in [2n0 −n 2n0 −n+1 ) where n0 is
Vector Quantization in SPIHT Image Codec
747
8000 7000 6000 5000 4000 3000 2000 1000 0
0
5
10
15
Coefficient values
Fig. 4. Histogram plot of significant wavelet coefficients for Barbara image
40
38 36 Tsq−vq = 1 34 Tsq−vq = 1 PSNR (dB)
32
PSNR (dB)
35
30 28 SQ−VQ SPIHT(17/11) SQ−VQ SPIHT(9/7) SPIHT(17/11) SPIHT(9/7)
30 SQ−VQ SPIHT(17/11) SQ−VQ SPIHT(9/7) SPIHT(17/11) SPIHT(9/7) 25 0.1
0.2
0.3
0.4
0.5 0.6 0.7 Bits per pixel (bpp)
0.8
0.9
26 24
1
22 0.1
0.2
0.3
0.4
0.5 0.6 0.7 Bits per pixel (bpp)
0.8
0.9
1
Fig. 5. Rate-distortion curves for (a) Boat and (b) Barbara images using 17/11 and 9/7 bi-orthogonal filters
chosen as per the maximum value of coefficients and n takes values 0, 1, · · · , pmax . Here, pmax is chosen as per the highest bit rate requirement. Fig.4 shows the histogram plot of significant coefficients resulting from all SPIHT passes at 0.1 bits per pixel (bpp) for Barbara image. We can see from the histogram that most of the coefficients are close to zero in magnitude and these coefficients need to be vector quantized by forming blocks for efficient compression. A few high magnitude coefficients have to be scalar quantized for finer classification. We find the distribution of significant coefficients similar to this histogram for all natural images tested. Selecting a suitable boundary threshold, Tsq−vq , between scalar and vector quantizers is an important step. It is found that the threshold Tsq−vq is highly image dependent and its optimal value will result in good compression performance. A trained scalar codebook of size N1 is formed for each class of coefficients having a magnitude greater than Tsq−vq . For vector quantization, training data of block length m is formed from each class of coefficients having magnitude below Tsq−vq . Then, these vectors of each class are quantized using the LBG algorithm [11]. Thus, we have scalar codebooks for each class of coefficients above Tsq vq and vector codebooks below Tsq−vq .
748
R. Mohammad and C.F. Barnes (a)
(b)
36
36
34
34 T
sq−vq
=8
T
sq−vq
=4
32 PSNR (dB)
PSNR (dB)
32
30
28
26
30
28
26 SQ−VQ SPIHT SPIHT
SQ−VQ SPIHT SPIHT
24
22 0.1
24
0.2
0.3
0.4
0.5 0.6 0.7 Bits per pixel (bpp)
0.8
0.9
22 0.1
1
0.2
0.3
0.4
0.5 0.6 0.7 Bits per pixel (bpp)
(c)
0.8
0.9
1
0.9
1
(d)
38
40
36 35 Tsq−vq = 0.25
34 T
sq−vq
= 0.5
30 PSNR (dB)
PSNR (dB)
32 30
25
28 20 26 24 22 0.1
SQ−VQ SPIHT SPIHT
SQ−VQ SPIHT SPIHT
0.2
0.3
0.4
0.5 0.6 0.7 Bits per pixel (bpp)
0.8
15
0.9
1
10 0.1
0.2
0.3
0.4
0.5 0.6 0.7 Bits per pixel (bpp)
0.8
Fig. 6. Rate-distortion curves at various boundary thresholds(Tsq−vq ) for Barbara image
To encode an image in SQ-VQ SPIHT, it is first wavelet transformed using biorthogonal filters for 5 levels and then the mean is subtracted from its top level low-low band. After the sorting pass [4], coefficients in the interval [T 2T ), where T ≥ Tsq−vq , are scalar quantized with a codebook of size N1 and the coefficients in the interval [T 2T ), where T < Tsq−vq , are vector quantized with a codebook of size N2 . The scalar and vector indices are sent to the decoder using log2 N1 and log2 N2 /m bits respectively for each coefficient. For simplicity, the scalar and vector quantized indices are not entropy-coded. We have used Ntr = 230 training images , code book sizes N1 = 16, N2 = 256 and block length m = 4 in all our coding experiments. Also, in our experiments we have not included test images while forming the training data. From the rate-distortion curves given in Fig.5 for Boat and Barbara images, it can be seen that SQ-VQ SPIHT performs better in the high bit region compared to low bit rate region. Also, there is negligble effect on rate-distortion performance with the type of bi-orthogonal filter (17/11 or 9/7) used. To see the effect of the boundary threshold Tsq−vq on rate-distortion performance, we have performed coding experiments at various Tsq−vq values for Barbara image, as shown in Fig.6(a), (b), (c) and (d) using 17/11 bi-orthogonal filters. It can be seen from these figures that as Tsq−vq is reduced, mean square error of coded
Vector Quantization in SPIHT Image Codec
(a) Original Image
749
(b) SQ-VQ SPIHT at 0.2 bpp (PSNR = 28.57 dB)
Fig. 7. Visual Quality results for the Boat Image
(a) Original Image
(b) SQ-VQ SPIHT at 0.65 bpp (PSNR = 32.69 dB)
Fig. 8. Visual Quality results for the Barbara Image
images increases, resulting in poor PSNR values. At low Tsq−vq values, a large number of significant coefficients need to be scalar quantized compared to the high Tsq−vq values. This will incur high index bit cost resulting in poor ratedistortion performance. At high Tsq−vq value, a few significant coefficients need to be scalar quantized and the rest are vector quantized resulting in good ratedistortion performance (as index bit cost per coefficient is less in VQ than in SQ) as shown in Fig.6(a) and (b). Particularly from Fig.6(b), it can be seen that the boundary threshold Tsq−vq = 4 results in the best rate distortion performance for Barbara image. Fig.7 and 8 show visual quality of boat and Barbara images coded at 0.2 bpp and 0.65 bpp respectively with SQ-VQ SPIHT algorithm using Tsq−vq = 1 and 17/11 bi-orthogonal filters.
750
R. Mohammad and C.F. Barnes
The SQ-VQ SPIHT encoding complexity is higher than that of SPIHT due to exhaustive search vector quantization and non uniform scalar quantization. But the decoding complexity of the same is less than that of SPIHT due to table look up operation. The non-uniform SQ and VQ complexity can be reduced significantly by using residual quantizers (scalar and vector) [12] in SQ-VQ SPIHT.
5
Conclusion
We have presented scalar and vector quantization of significant coefficients in SPIHT image codec. SQ-VQ SPIHT performs better than SPIHT coding algorithm in the high bit region compared to low bit rate region. We found that choosing a suitable boundary threshold results in good rate-distortion performance. In future, we need to find an automatic method to calculate image specific boundary threshold. Also SQ-VQ SPIHT performance can be improved further by employing variable rate residual vector quantization instead of exhaustive search vector quantization as used in this work.
References 1. M.Antonini, M.Barlaud, P.Mathieu, and I.Daubechies, “Image coding using wavelet transform,” IEEE Trans. Image Processing, vol. 1, pp. 205–220, Apr. 1992. 2. Jerome M. Shapiro, “Embedded image coding using zerotrees of wavelet coefficients,” IEEE Trans. Signal Processing, vol. 41, pp. 3445–3462, Dec. 1993. 3. E.A.B Da Silva, D.G. Sampson, and M.Ghanbari, “A successive approximation vector quantizer for wavelet transform image coding,” IEEE Trans. Image Processing, vol. 5, pp. 299–310, Feb. 1996. 4. Amir Said and William Pearlman, “A new, fast, and efficient image codec based on set partitioning in hierarchical trees,” IEEE Trans. Circuits and Syst. Video Technol., vol. 6, pp. 243–250, June 1996. 5. Debargha Mukherjee and Sanjit K. Mitra, “Vector spiht for embedded wavelet video and image coding,” IEEE Trans. Circuits and Syst. Video Technol., vol. 13, pp. 231–246, Mar. 2003. 6. Allen Gersho and Robert M. Gray, Vector Quantization and Signal Compression, Kluwer Academic Publishers, 1992. 7. Brian A. Banister and Thomas R. Fischer, “Quantization performance in spiht and related wavelet image compression algorithms,” IEEE Sig. Proc. Letters, vol. 6, pp. 97–99, May. 1999. 8. N. S. Jayant and Peter Noll, Digital Coding of Waveforms, Prentice-Hall, 1984. 9. Robert M. Gray, “Vector quantization,” IEEE ASSP Mag., vol. 1, pp. 4–29, Apr. 1984. 10. Al Bovik, Hand book of Image and Video Processing, Academic Press, San Diego, USA, 2000. 11. A. Buzo Y.Linde and R.M. Gray, “An algorithm for vector quantizer design,” IEEE Trans. Commun., pp. 84–95, Jan.1980. 12. Christopher F. Barnes, Syed A. Rizvi, and Nasser M. Nasrabadi, “Advances in residual vector quantization: A review,” IEEE Trans. Image Processing, vol. 5, pp. 226–262, Feb. 1996.
Objective Image Quality Evaluation for JPEG, JPEG 2000, and Vidware VisionTM Chung-Hao Chen, Yi Yao, David L. Page, Besma Abidi, Andreas Koschan, and Mongi Abidi The University of Tennessee Imaging, Robotics and Intelligent System Laboratory Knoxville, Tennessee 37996-2100, USA {cchen10, yyao1, dpage, besma, akoschan, abidi}@utk.edu
Abstract. In this paper, three compression methods, JPEG, JPEG 2000, and Vidware VisionTM are evaluated by different full- and no-reference objective image quality measures including Peak-Signal-to-Noise-Ratio (PSNR), structural similarity (SSIM), and Tenengrad. In the meantime, we also propose an image sharpness measure, non-separable rational function based Tenengrad (NSRT2), to address whether the compression method is appropriate to be used in a machine recognition application. Based on our experimental results Vidware VisionTM is more robust to changes in compression ratio and presents gradually degraded performance at a considerably slower speed thus outperforming JPEG and JPEG 2000 when the compression ratio is smaller than 0.7%. Furthermore, the effectiveness of our proposed measurement, NSRT2, is also validated via experiments and performance comparisons with other objective image quality measures. Keywords: JPEG, JPEG 2000, Vidware VisionTM, image quality, mage sharpness, and image compression.
1 Introduction Image compression [1], [2], [3] is an essential component of image retrieval [5], recognition [6], and internet applications [7]. The goal of compression is to minimize the size of the data being broadcast or stored, thus minimizing transmission time and storage space while still maintaining a desired quality compared to the original image. One of the most popular lossy compression approaches for still images is JPEG. JPEG [2] stands for Joint Photographic Experts Group, the name of the committee that developed the standard. JPEG compression divides the input image into 8 by 8 pixel blocks and calculates the discrete cosine transform (DCT) of each block. A quantizer rounds off these DCT coefficients according to a pre-defined quantization matrix. This quantization step produces the "lossy" nature of JPEG. Afterwards, JPEG applies a variable length code to these quantized coefficients. JPEG 2000 [1] is a wavelet-based image compression standard that was also created by the Joint Photographic Experts Group with intention to outperform their L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 751 – 760, 2006. © Springer-Verlag Berlin Heidelberg 2006
752
C.-H. Chen et al.
original JPEG standard. JPEG 2000 is based on the idea that coefficients of a transform which decorrelates pixels of an image could be coded more effectively than the original pixels. If functions of the transform, which is the wavelets transform in JPEG 2000, translate most of the important visual information into a small number of coefficients then the remaining coefficients could be quantized/normalized coarsely or even truncated to zero without introducing noticeable distortions. Compared with JPEG, JPEG 2000 [11] not only improves the quality of decompressed images but also supports much lower compression ratios. Recently, Vidware Incorporated developed a product line called Vidware VisionTM [3], [4]. It is divided into three separate categories: Still Image (as a replacement for JPEG), Full Frame video (as a replacement for M-JPEG), and Full Motion video (an H.264 compliant CODEC). Vidware VisionTM Still Image is an integer-based encoding algorithm and varies the size of macro-blocks according to the shape and contents of the image, unlike JPEG where the block size is fixed. Therefore, this variable macro-blocking produces reduced blocky artifacts and increases image fidelity. In terms of full-reference image quality measure, PSNR is the most widely used full-reference quality measure. It is appealing because of its simple computation and clear physical meaning [9], [10]. Nevertheless, it is not closely matched to perceived visual quality [8], [9], [10]. Therefore, SSIM [12] is selected for its improved representation of visual perception. The SSIM measure compares local patterns of pixel intensities that have been normalized and hence are invariant to luminance and contrast. This method could be seen as complementary to the traditional PSNR approach. A successful recognition requires sufficient local details to differentiate various patterns and the ability to preserve these local details. These details translate into the sharpness of the decompressed images and interpreted by image sharpness measures. Sharpness measures have been traditionally divided into 5 categories [13]: gradient based, variance based, correlation based, histogram based, and frequency domain based methods. Image noise level and artifacts, such as blocking effects, vary with respect to different compression methods as well as various compression ratios. Conventional sharpness measures are not feasible because they are unable to differentiate variations caused by actual image edges from those induced by image noise and artifacts. To avoid artificially elevated sharpness values due to image noise and artifacts, NSRT2 is proposed as an adaptive measure. The contributions of this paper are: (1) Evaluation of three compression methods, JPEG, JPEG 2000, and Vidware VisionTM by different full- and no-reference objective image quality measures; (2) The design of NSRT2 as an adaptive measure, and its ability to evaluate the sharpness of decompressed images for machine recognition applications. The remainder of this paper is organized as follow. Section 2 describes our proposed adaptive sharpness measure, NSRT2. Experimental results are demonstrated in section 3, and Section 4 concludes the paper.
Objective Image Quality Evaluation for JPEG, JPEG 2000, and Vidware VisionTM
753
2 Sharpness Quality Measure Sharpness measures are traditionally used to quantify out-of-focus blur. Nevertheless, their extension to evaluate the sharpness of compressed images is non-trivial. To avoid artificially elevated sharpness values due to image noise and artifacts, adaptive sharpness measures assign different weights to pixel gradients according to their local activities. For pixels in smooth areas, small weights are used. For pixels adjacent to strong edges, large weights are allocated. Adaptive sharpness measures are comprised of two determinant factors: the definition of local activities and the selection of weight functions. Figure 1 depicts the flow chart to compute adaptive sharpness measures. Image gradients
Weight Functions
Separable
Polynomial
Non-separable
Rational
Gradient based sharpness measures
Fig. 1. Flow chart to compute adaptive sharpness measures
Based on how local activities are described, adaptive sharpness measures can be divided into two groups: separable and non-separable. Separable measures only focus on horizontal and vertical edges. A horizontal image gradient g x ( x, y ) and a vertical
image gradient g y ( x, y ) are computed independently. For instance,
⎧gx ( x, y) = f (x +1, y) − f (x −1, y) ⎨ ⎩g y (x, y) = f (x, y +1) − f (x, y −1)
(1)
In contrast, non-separable measures include the contributions from diagonal edges. An example of this type of image gradients g ( x, y ) is given by:
g ( x, y ) = f ( x − 1, y) + f ( x + 1, y ) − f ( x, y − 1) − f ( x, y + 1) .
(2)
Different forms of weights can be used, among which polynomial and rational functions are two popular choices. The polynomial, to be more specific cubic, and rational functions are also exploited in adaptive unsharp masking [14], [15]. The polynomial weights suppress small variation mostly introduced by image noise and have been proved efficient in evaluating the sharpness of high magnification images [16]. The rational weights emphasize a particular range of image gradients. Taking the non-separable image gradient g ( x, y ) for example, the polynomial weights are:
754
C.-H. Chen et al.
ω( x, y) = g ( x, y) pω
(3)
where pω is a power index determining the degree of noise suppression. The rational weights can be written as
ω ( x, y ) =
(2k0 + k1 )g ( x, y)
(4)
g ( x, y ) + k1 g ( x, y ) + k 02 2
where k 0 and k1 are coefficients associated with the peak position L0 and width ΔL of the corresponding function, respectively, and comply with the following equations
k0 = L0
(5)
k + 8k0 k1 + 12k02 − ΔL2 = 0 2 1
Figure 2 illustrates the comparison of different forms of weight functions. 1
0.8
ω (x)
0.6
0.4
L =10, Δ L=35 0 L =10, Δ L=5 0 p =2 ω p =6 ω No weights
0.2
0 0
5
10 x
15
20
Fig. 2. Illustration of weight functions
The newly designed weights are then applied to gradient based sharpness measures to construct adaptive sharpness measures. For the Tenengrad measure for instance, the resulting separable measure is given by: S=
∑ ∑ [ω ( x, y) g M
N
x
2 2 x ( x, y ) + ω y ( x, y ) g y ( x, y )
]
(6)
where ω x ( x, y ) / ω y ( x, y ) denotes the weights obtained from the horizontal/vertical gradients g x ( x, y ) / g y ( x, y ) . For non-separable methods, the corresponding adaptive Tenengrad is formulated as S=
∑ ∑ M
N
[
ω ( x, y ) g x2 ( x, y ) + g 2y ( x, y )
]
(7)
To account for blocking artifacts, we follow the method Wang et al. proposed in [12]. The above measures are divided into two groups: pixels within a block and pixels at block borders. Accordingly, we compute two values
Objective Image Quality Evaluation for JPEG, JPEG 2000, and Vidware VisionTM
S=
M −1
∑
∑ ω ( x, y)[g N −1
2 x ( x,
y ) + g 2y ( x, y )
x = 0, y =0 mod( x ,8 ) ≠ 0 mod( x ,8) ≠ 0
]
and
⎣ M8 ⎦−1 ⎣ N8 ⎦−1 B= ω (8 x,8 y ) g x2 (8 x,8 y ) + g y2 (8 x,8 y )
∑∑
x = 0,
y =0
[
755
(8)
]
The final image quality measure combines these two values:
Q = S p B −1
(9)
The adjustable variables in the rational weight based measures include the peak position L0 , response width ΔL , and weight power p . These parameters can be selected according to the visual perception of the images. In our implementation, pixel gradients g x ( x, y ) / g y ( x, y ) / g ( x, y ) are scaled by their mean before computing the weights. The chosen measure NSRT2 then has a peak position k 0 of 10 and a weight power p of 2. The coefficient k1 is forced to be zero, resulting in a response width of ΔL = 2 3 L0 = 35 . In practice, different forms of image gradients are employed for deriving weights, as in equations (3) and (4), and computing sharpness, as in equations (6) and (7), for improved robustness to image noise and artifacts. For weight computation the image gradients used are listed in equations (1) and (2), while for sharpness computation the image gradients based on the Sobel filters are recommended.
3 Experimental Results We compare the performance of JPEG, JPEG 2000, and Vidware VisionTM based on a data set composed of compressed images at various compression ratios. The 48 raw color images (24 bits/pixel RGB) in the data set include 12 selected images with different resolutions and 36 long range face images (resolution: 720×480). The 12 selected images, as shown in part in Fig. 3 (a)-(d), can be further divided into 4 categories. The first category includes some well-known test images such as the Baboon, Lena, and Peppers. The second category is comprised of standard test patterns, the IEEE resolution chart (sinepatterns.com), Television Color Test Pattern (high-techproductions.com), and Color Test Template (eecs.berkeley.edu). The third category focuses on images used in face and license plate recognition, two exemplary applications of machine recognition. High resolution images are covered in the last category. The second group, as shown in part in Fig. 3 (e)-(f), is a collection of face images of 6 subjects. A total of 6 images per subject are obtained at various distances
756
C.-H. Chen et al.
(9.5, 10.4, 11.9, 13.4, 14.6, and 15.9 meters) and with different camera’s focal length. The camera’s focal length is adjusted so that the subject’s face presents similar sizes in all 6 images. Each selected image is compressed by JPEG, JPEG 2000, and Vidware VisionTM with different compression ratios varying from 0.2% to 30%. This produces a total of 342 compressed images for each compression method. In addition, the compression ratio used in this paper indicates: C=
R O
(10)
Where C, R, and O represent compression ratio, file size of the compressed image, and file size of the original image respectively.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 3. Illustration of test images, (a) Lena 512×512; (b) IEEE resolution chart 1440×1086; (c) License plate 1024×768; (d) Under vehicle view 2560×1920; (e) 9.5 meters face image; (f) 15.9 meters face image
3.1 Full-Reference Quality Measures
Figure 4(a) illustrates the computed SSIM measure. We could see that JPEG 2000 and Vidware VisionTM outperform JPEG for all tested compression ratios. For compression ratios in the range of 7%~10%, JPEG 2000 presents a slightly better performance compared with Vidware VisionTM. However, this performance difference diminishes for other tested compression ratios. Figure 4(b) illustrates the computed PSNR measure. As expected, JPEG 2000 outperforms JPEG for all tested compression ratios. However, their PSNR values increase quickly with respect to decreased compression ratios, indicating a rapidly degraded image quality. In contrast, the PSNR value of Vidware VisionTM increases at a considerably slower speed. As a result, Vidware VisionTM eventually outperforms both JPEG and JPEG 2000 as the compression ratio goes below 0.7%.
Objective Image Quality Evaluation for JPEG, JPEG 2000, and Vidware VisionTM
757
Structural Similarity (SSIM)
1
0.95
0.9
0.85
JPEG JPEG2000 Vidware
0.8
0.75 -1 10
0
10 10 Compression ratio (%)
1
10
2
(a)
Peak-Signal-to-Noise-Ratio (PSNR)
60 55 50
JPEG JPEG2000 Vidware
45 40 35 30 25 -1 10
0
10 10 Compression ratio (%)
1
10
2
(b) Fig. 4. (a )SSIM and (b) PSNR comparison among JPEG, JPEG 2000, and Vidware VisionTM
3.2 No-Reference Quality Measures
For fair comparisons, the computed sharpness values are first normalized with respect to the corresponding values of the uncompressed images. Figure 5 shows the performance comparison among JPEG, JPEG 2000, and Vidware VisionTM based on the proposed measure, NSRT2 and the Tenengrad measure. In Fig. 5(a), we could see similar output based on full-reference measures. JPEG 2000 produces the best performance. For most of the tested compression ratios, the performance of Vidware VisionTM falls in between those of JPEG and JPEG 2000. Exceptions are observed when the compression ratio is larger than approximately 10%, where JPEG outperforms Vidware VisionTM and yields a performance comparable to that of JPEG 2000.
C.-H. Chen et al.
Normalized sharpness measure NSRT2
758
1.1 1 0.9 0.8 0.7 0.6 JPEG JPEG2000 Vidware
0.5 0.4 -1 10
0
1
10 10 Compression ratio (%)
10
2
Normalized sharpness measure Tenengrad
(a) 1.1 1.05 1 0.95 0.9 0.85 JPEG JPEG2000 Vidware
0.8 0.75 -1 10
0
1
10 10 Compression ratio (%)
10
2
(b) Fig. 5. Comparisons among JPEG, JPEG 2000, and Vidware VisionTM based on image sharpness evaluated by (a) NSRT2 and (b) Tenengrad
The conventional Tenengrad measure shown in Fig. 5(b) is also implemented and serves as comparison reference to validate the use of the newly developed NSRT2. The Tenengrad measure is tuned to the local activities of image gradients only and disregards either the sources or the visual effects of these local activities. The artifacts from image compression are also regarded as useful local details, resulting in similar sharpness values for all decompressed images with the same compression ratio. As a consequence, simple Tenengrad measure is insufficient and not applicable. Furthermore, at a compression ratio of 0.8%, as highlighted by a dashed circle in Fig. 5(b), the Tenengrad measure fails because of the overwhelming blocking artifacts. In comparison, the proposed NSRT2 is able to distinguish artifacts from desired local details and thus is qualified to evaluate the performances of tested compression methods. In general, the performance of Vidware VisionTM is less sensitive and decreases gradually with respect to the decreased compression ratio. In comparison,
Objective Image Quality Evaluation for JPEG, JPEG 2000, and Vidware VisionTM
759
the performance of JPEG deteriorates significantly for compression ratios smaller than 2% and similar behavior occurs to JPEG 2000 for compression ratios smaller than 0.8%. 3.3 Visual Perception
Figure 6 illustrates visual perception of JPEG, JPEG 2000, Vidware VisionTM with 0.7% compression ratio. It is obvious that JPEG has inferior performance, but it is not easy to judge the performance of JPEG 2000 and Vidware VisionTM.
(a)
(b)
(c)
(d)
Fig. 6. Illustration of visual perception with 0.7% compression ratio. (a) original 9.5 meters face image; (b) JPEG; (c)JPEG 2000; (d) Vidware VisionTM.
4 Conclusion In this paper, we evaluated three different compression methods, JPEG, JPEG 2000, and Vidware VersionTM by existing metrics including PSNR, SSIM, and Tenengrad. We also proposed a new image sharpness measure, NSRT2, for evaluating the sharpness of decompressed images. According to our experimental results, we had the following observations: (1) In general, for lower compression ratios (>1-(p11), tC = tC0+1 else P1 = p1, tC = tC0 Δ = clip3(-tC,tC,(((q0-p0)3), P0 = clip1(p0+Δ), P2 = p2, P3 = p3 if aq = °q2-q0° < β then Q1 = q1+clip3(-tC0,tC0,(q2+(p0+q0+1)>>1-(q11), tC = tC0+1 else Q1 = q1, tC = tC0 Δ = clip3(-tC,tC,(((q0-p0)3), Q0 = clip1(q0-Δ), Q2 = q2, Q3 = q3 where clip3(A,B,K) = A if KB, K otherwise clip1(A) = 0 if A255, A otherwise
Bs=1
Bs=2
Bs=3
if b0 = °p0-q0° ≥ α or bp = °p1-p0° ≥ β or bq = °q1-q0° ≥ β then P0 = p0, P1 = p1, P2 = p2, P3 = p3, Q0 = q0, Q1 = q1, Q2 = q2, Q3 = q3 else if ap = °p2-p0° < β and b0 < ((α>>2)+2) then P0 = (p2+(p1
MAD − h1 ×σ h2
.
(7)
where σ is a parameter that must be more than one. In this new model, we’ll have to calculate the model coefficients k1, k2, k3, k4 and k5. Here, we use a linear regression technique to find these coefficients. Based on this model, we then formulate our R-D model for H.264/AVC at the macroblock level. 2.2 The Relation Between Header Bits and Macroblock Mode In H.264/AVC, inter-prediction coding has several macroblock modes. The number of macroblock header bits is varying in different macroblock modes. In Fig. 8, we show that the header bits of 16×16, 16×8 and 8×16 macroblock modes are basically irrelative to MAD, but the header bits of the 8×8 mode are proportional to the MAD value of the macroblock. Moreover, the variances of the header bits in 16×16, 16×8 and 8×16 modes are very small, but it is not the case in the 8×8 mode.
Fig. 8. The relation between header bits and the Fig. 9. The relation between macroblock header MAD value of macroblocks bits, number of motion vector, and macroblock MAD
In the 8×8 mode, the motion vector data are variable. This is because an 8×8 block can be divided into 8×4, 4×8, or 4×4 modes. A smaller block size causes more motion vectors and thus more header bits. In Fig. 9, we can observe such a relation. Hence, a larger MAD has a larger chance to encode more motion vectors. Based on this observation, we model the relation between the 8×8 header bits and the MAD value of the macroblock in terms of a linear model:
Hbits8×8 = a × MAD + b .
(8)
where a and b are coefficients of this model. For other modes, such as 16×16, 16×8 and 8×16 modes, we simply estimate the header bits of the current macroblock to be the averaged value of the previous macroblock. Using this header bits prediction method, we may reformulate Equation as
A Novel Macroblock-Layer Rate-Distortion Model for Rate Control Scheme
913
Bits = k1 × MAD × QP′2 + k2 × MAD + k3 × QP′2 + k 4 × MAD + k5 × QP′ +Hbits ⎧ ⎪ QP′ ⎪ QP = ⎨ ⎪QP′ − 1 ⎪⎩
QP′ ≤
MAD − h1 ×σ h2
QP′ >
MAD − h1 ×σ h2
(9)
where the coefficients are computed based on the linear regression technique over the already encoded data.
3 Bit Allocation at Macroblock Level In the rate control scheme of the JM model in H.264/AVC, if the number of basic unit in a picture is one, we need to determine the bit allocation for each basic unit. If the basic unit is a macroblock, we perform bit allocation for each macroblock. In JM [8], the original method is defined as Tmb ( j , k ) =
pMAD ( j , k )
∑
2
pMAD ( j , k )
2
×T ( j) .
(10)
k
where j denotes the frame index, T(j) represents the target bits of the jth frame, k denotes the index of the encoded macroblock, and Tmb(j,k) represents the target bits of the kth macroblock at the jth frame. The symbol pMAD denotes the predicted MAD of a macroblock. This equation means that a macroblock with a larger MAD value needs more bits to be coded. Here, the prediction of MAD is based on a linear model: pMAD( j, k ) = a1 × MAD( j − 1, k ) + a2 .
(11)
where pMAD means the predicted MAD value, MAD means the computed MAD value, and a1 and a2 are coefficients. However, when a macroblock is under movement, the corresponding residual data should be in motion too. Hence, the MAD of that macroblock no longer stays at the same place. Fig. 10 illustrates such a situation where the movement of macroblocks causes the MAD patterns in the current frame to be shifted to some other places in the next frame.
Fig. 10. The illustration of macroblock motion
To modify the MAD prediction at the macroblock level, we check the motion vector of each macroblock and move the MAD of each macroblock to the new location
914
T.-H. Chiang and S.-J. Wang
accordingly. After moving all the macroblocks, we calculate the averaged MAD to represent the predicted MAD value. Like the case in Fig. 10, the MAD of macroblock M in the next picture can be computed as: pMADt +1 ( M ) =
wN × MADt ( N ) + wL × MADt ( L ) + wM × MADt ( M ) . wN + wL + wM
(12)
where pMAD means the predicted MAD, wi represents the weighting coefficient of a macroblock. This modification not only improves video quality but also reduces the probability of buffer fullness.
4 Proposed Rate Control Scheme for H.264/AVC The proposed rate control scheme for H.264/AVC contains three parts. The first part is the GOP-level rate control, where the initial QP and the definition of buffer fullness of this GOP are determined. The second part is the picture-level rate control, where the QP of each frame is determined. Finally, the basic-unit-level rate control determines the QP value of each basic unit. In Table 1, we list the symbols used in this section. In the GOP-level rate control, the expected available bits for GOP encoding are computed. The initial QP value is determined according to the average QP value of the previous GOP. If this is the first GOP, the initial QP value is determined according to the channel bandwidth and video resolution. The picture-level rate control is further divided into two stages, pre-encoding stage and post-encoding stage. In the pre-encoding stage, we need to determine the target bits of this picture. The target bits are predicted in two aspects. One is according to buffer fullness and channel bit rate. The other is determined by the remaining bits of this GOP. That is, we have
R ( j) Ti ( j ) = i + γ × ( Si ( j ) − Vi ( j ) ) . f
(13)
B ( j) . Tˆi ( j ) = i N p,r
(14)
The real target bits are computed by combining Ti ( j ) and Tˆi ( j ) together. In this paper, we choose Ti ( j ) = 0.5 × Tˆi ( j ) + 0.5 × Ti ( j ) .
(15)
On the other hand, the post-encoding stage adds the actual encoded bits into the buffer and makes sure whether the buffer overflows. Finally, for the basic-unit-level rate control, we first use Equation (12) to predict the MAD value of each macroblock and compute the target bits of each basic unit based on Equation (10). Then, we determine the QP value of each basic unit according to the
A Novel Macroblock-Layer Rate-Distortion Model for Rate Control Scheme
915
Table 1. Summary of Symbols
Parameter Name i j f Ri( j ) Np( i-1 ) Bi( j ) Si( j ) Vi( j ) T i ( j ) Tˆ i ( j ) T i( j ) Np,r
Definition The index of GOP The index of Picture in each GOP Frame rate Channel bit rate Total number of stored pictures in the (i-1)th GOP The bits for the rest pictures in this GOP Target buffer level Buffer fullness The delta target bits The hat target bits Target bits The number of the remaining stored pictures
R-D model expressed in Equation (9). After determining the QP value, we execute the RDO encoding in each macroblock. The last step is to update the coefficients by the linear regression method.
5 Experimental Results In order to evaluate the performance of the proposed algorithm, we performed several experiments on a few test sequences. The definitions of coding parameters are listed as followings.
Profile: Baseline Frame Rate: 30 frame per sec Buffer size: 0.5×bit rate Frame Size: QCIF GOP: IPPP… Basic Unit: one macroblock RDO: On Bit Rate: 64K, 128K Table 2. The precision of prediction model
Texture-bit Header-bit Sequence Bit Rate Error (MAD) Improvement Error (MAD) Improvement JM Our (JM-Our)/JM JM Our (JM-Our)/JM
Bus Flower
64K 569.7 193.7
0.66
230.50 195.20
0.15
128K 224.2 150.5
0.33
448.91 307.23
0.32
64K 1185.9 995.2
0.16
346.00 220.79
0.36
128K 1965.9 700.7
0.64
466.33 303.87
0.34
916
T.-H. Chiang and S.-J. Wang
Fig. 11. The buffer fullness curves when coding Fig. 12. The buffer fullness curves when coding the “bus” sequence with the bit rate = 64K the “flower” sequence with the bit rate = 128K
Fig. 13. The PSNR curve when coding the Fig. 14. The PSNR curve when coding the “bus” sequence with the bit rate = 64K “flower” sequence with the bit rate = 128K
JM
Proposed
Fig. 15. Comparison of visual quality for the 220-th frame of the “flower” sequence result when the bit rate is 64K
A Novel Macroblock-Layer Rate-Distortion Model for Rate Control Scheme
917
We first compare the precision of the proposed R-D model. We divided the coded bit into two parts, texture bits and header bits. The comparisons with respect to the JM of H.264/AVC are listed in Table 2. The “error” is defined as the mean of absolute difference (MAD) between the actual coded bits and the targets bits (predicted bit number). In Table 2, the improvement is apparent. As shown in Fig. 11 and Fig. 12, our results are closer to the target buffer level. This improvement also causes the increase of PSNR values. Beside the proposed R-D model, the modified MAD prediction also causes the increase of PSNR values in the first few frames of each GOP. The average PSNR gain is 0.26dB and 0.2dB, respectively. Besides the PSNR gain, our proposed method can provide improved visual quality. In Fig. 15, we can see obvious differences, like the flowers in the garden. The improvement is due to the fact that we have reduced the over-use of coded bits during the coding of the upper portion of the pictures. Hence, adequate bits are left for the coding of the bottom portion of the pictures.
6 Conclusion In this paper, we propose a new rate-distortion model for the baseline profile of H.264/AVC. Using the modified R-D model, we can improve the accuracy of rate control and the performance of visual quality, especially at low bitrates. The relation between the header bit and the MAD value has also been discussed. In the basic unit level, we use an instinctive method to predict the MAD value. Utilizing the bit allocation at the macroblock-level, we can achieve better performance over bit allocation and visual quality.
Acknowledgement This research was supported by National Science Council of the Republic of China under Grant Number NSC-94-2219-E-009-017.
References [1] ISO/IEC JTCI/SC29/WG11, Test Model 5, 1993. [2] ITU-T/SG15, Video Codec Test Model, TMN8, Portland, June 1997. [3] MPEG-4 Video Verification Model V18.0, Coding of Moving Pictures and Audio N3908, ISO/IEC, JTC1/SC29/WG11, Jan. 2001. [4] H. J. Lee and T. H. Chiang and Y.-Q. Zhang, “Scalable rate control for MPEG-4 video,” IEEE Trans. on Circuits. Syst. Video Technol., vol.10, No.6, pp. 878-894, Sept. 2000. [5] Siwei Ma, Wen Gao, Feng Wu and Yan Lu, “Rate Control for JVT Video Coding Scheme with HRD Considerations,” IEEE International Conference on Image Processing. vol. 3, pp.793-796, Sept. 2003. [6] Thung-Hiung Tsai and Jin-Jang Leou, “A Rate Control scheme for H.264 Video Transmission,” IEEE International Conference on Multimedia and Expo (ICME), vol.2, pp.1359-1368, June 2004.
918
T.-H. Chiang and S.-J. Wang
[7] Satoshi Miyaji, Yasuhiro Takishima and Yoshinori Hatori, “A Novel Rate Control Method for H.264 Video Coding.” IEEE International Conference on Image Processing. vol. 2, pp.II-309-12, 11-14 Sept. 2005. [8] Siwei Ma and Wen Gao, “Rate-Distortion Analysis for H.264/AVC Video Coding and its Application to Rate Control.” IEEE Trans. on Circuits and Syst. Video Technol., vol.15, No.12, pp.1533-1544, Dec. 2005. [9] Z. Li et al., "Adaptive Basic Unit Layer Rate Control for JVT," JVT-G012, 7th Meeting: Pattaya, Thailand, March 2003.
A Geometry-Driven Hierarchical Compression Technique for Triangle Meshes Chang-Min Chou1,2 and Din-Chang Tseng1 1
Institute of Computer Science and Information Engineering, National Central University, No. 300, Jungda Rd., Jhongli City, Taoyuan, Taiwan 320 2 Department of Electronic Engineering, Ching Yun University, No. 229 Cheng-Shing Rd., Jhongli City, Taoyuan, Taiwan 320
[email protected] [email protected] Abstract. A geometry-driven hierarchical compression technique for triangle meshes is proposed such that the compressed 3D models can be efficiently transmitted in a multi-resolution manner. In 3D progressive compression, we usually simplify the finest 3D model to the coarsest mesh vertex by vertex and thus the original model can be reconstructed from the coarsest mesh by operating vertex-split operations in the inversed vertex simplification order. In general, the cost for the vertex-split operations will be increased as the mesh grows. In this paper, we propose a hierarchical compression scheme to keep the cost of the vertex-split operations being independent to the size of the mesh. In addition, we propose a geometry-driven technique, which predicts the connectivity relationship of vertices based on their geometry coordinates, to compress the connectivity information efficiently. The experimental results show the efficiency of our scheme. Keywords: Mesh compression, progressive compression, progressive mesh.
1 Introduction 3D models have been represented by triangle meshes in computer graphics for a long time. In recent years, the issue of 3D model compression, multi-resolution, and progressive transmission gain more attention due to the usage of advanced scanning devices and the growth of Internet. Transmission of 3D models over Internet is relatively slow for complex models with many fine details. A progressive representation is thus appreciated due to its ability to enhance high performance interactivity with large model transmission. Furthermore, in progressive encoding schemes, models can be encoded as an embedded bit stream so that the receiver can terminate the transmission at any point of time to get an approximation of the model with exact bit control [9]. Typically, meshes are represented by two kinds of data: connectivity and geometry. The former represents how vertices in a mesh are connected while the latter represents the coordinates (and normals, textures, etc.) of each vertex. When considering the problem of 3D compression, for the purpose of reconstruction, the L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 919 – 928, 2006. © Springer-Verlag Berlin Heidelberg 2006
920
C.-M. Chou and D.-C. Tseng
connectivity information has to be encoded (or compressed) losslessly while the geometry information can be encoded in a lossy manner according to the requirement of accuracy [2], [5]. As a result, it is not easy to get a high compression ratio for the connectivity information encoding, especially when the encoding task has to be done in a progressive manner. Previous works on 3D compression can be distinguished to two major categories: single-resolution [12], [15] and multi-resolution [1], [4], [7], [9], [10], [13] compressions. Heckbert and Garland [6] gave a detailed introduction to most famous polygonal surface simplification algorithms. Peng et al. [11] gave a survey on most 3D mesh compression technologies up to date. Rossignac’s Edgebreaker [12] and TG [15] are both great works for single-resolution connectivity compression of triangle meshes. Both works can achieve 1.5 ~ 2 bits/v or even better for connectivity compression. Valette and Prost [16] proposed a lossy to lossless progressive compression scheme for triangular meshes based on a wavelet multi-resolution theory for irregular 3D meshes. Multi-resolution techniques usually transmit a crude model followed by a series of refinement information such that the 3D model can be reconstructed progressively. Hoppe’s PM [7], [8] model introduced a continuous progressive compression scheme, but is space consuming in real practice. In the PM model, the simplification process can be described as a sequence of edge-collapse operations, and the reconstruction process is a sequence of the inversed edge-collapse operations, which are called vertex-split operations. These two operations are illustrated in Fig. 1. In the PM model, the cost for encoding each individual vertex split operation includes three major components: 1. Information that indicates the split vertex, v, in the previous mesh. Theoretically, it takes ⎡lg n ⎤ bits to indicate the split vertex for a mesh with n vertices. 2. Information that specifies how a vertex v is split. This can be achieved by recording the two co-vertices among all incident vertices of vertex v. It costs C2k bits to record these two co-vertices, where k is the number of v’s incident vertices. 3. Information that records the geometry (coordinate) information of the new added vertices. This is usually achieved by encoding the prediction residues, which are the differences between the predicted coordinates and the actual coordinates. Edge collapse
Vertex split Fig. 1. ‘edge collapse’ and ‘vertex split’ operations in the PM model. The two black vertices are co-vertices.
Continuous progressive compression schemes usually impose a significant overhead to transmit the full resolution model when compared to the best single resolution schemes. Pajarola and Rossignac suggested a CPM (Compressed Progressive Meshes) [10] approach that refines the topology of the mesh in batches to eliminate the overhead caused by progressive refinement techniques and achieved
A Geometry-Driven Hierarchical Compression Technique for Triangle Meshes
921
great success. In this paper, we propose a hierarchical compression scheme to keep the cost of the vertex-split operations being independent to the size of the mesh. Our work is similar to the work of Pajarola and Rossignac in the hierarchical manner. However, we propose a different geometry prediction scheme and a geometry-driven connectivity prediction scheme to compress both the geometry and connectivity information efficiently.
2 Hierarchical Simplification Most 3D geometric progressive compression algorithms first transmit a crude model, M0, followed by a sequence of refinement information. Each refinement file contains the information of a set of vertices for reconstructing the next finer model. Fig. 2 illustrates the reconstruction procedure. Our hierarchical transmission scheme can be organized as follows: 1. A crude model, M0, is transmitted using a single resolution compression mechanism. In our work, the TG [15] method is applied on this stage. We suggest M0 to contain about 10 percents of the vertex number of the original model for the best performance in compression. 2. A sequence of refinement archive information of each hierarchy, R1, R2, …, Rn, are sent in order for mesh reconstruction. The refinement archives should contain both the connectivity and geometry information in each hierarchical.
Fig. 2. The refinement reconstruction procedure
We use a ‘half-edge’ algorithm for the edge-collapse operation. That is, for the two vertices at the ends of the collapsed edge, we simply merge one vertex to the other, as shown in Fig. 3. For the following discussion of this paper, we’ll call the existed vertex as the parent vertex, vp, and the disappeared vertex as the child vertex, vc. One major goal of our hierarchical simplification mechanism is to reduce the heavy burden of indicating the split vertex. The cost for indicating the split vertex is ⎡lg n ⎤ in the PM scheme and thus will be increased as the mesh grows. We propose a hierarchical simplification scheme that removes a set of vertices from Mi to form a simplified mesh, Mi-1. The set of removal vertices in each hierarchy must be an independent set. An independent set is a set of vertices whose removal will not affect the neighborhood relationship of any other vertex in the same set. In other words, by removing an independent set of vertices from a mesh, the final resulted mesh is independent to the order of vertex removals. To satisfy this requirement, we set three constraints in the vertex removal process of each hierarchy:
922
C.-M. Chou and D.-C. Tseng
1. Boundary vertex can not be a child vertex. 2. The adjacent vertices of a child vertex cannot serve as child vertices of other vertices in the same hierarchy. Fig. 3 illustrates this constraint. 3. A vertex can serve as a parent vertex for at most one time in each hierarchy.
half-edge collapse vp
vc
vp
Fig. 3. Illustration for the ‘half-edge’ collapse operation and independent set selection constraints. Vertex vc is merged to vp. Once vc is merged to vp, the adjacent vertices of vc (black vertices) can not serve as child vertices of any other vertex in the same hierarchy.
The proposed scheme can successfully reduce the cost for indicating the split vertices. The simplification process is described as follows: 1. Set i = n, where n is the number of the predefined hierarchical levels. 2. Select an independent set of vertices, Vi, from Mi. 3. Remove all vertices of Vi from Mi, and re-triangulate the holes left by those vertex removals to form a simpler mesh Mi-1. 4. Record the refinement information Ri; set i = i-1. 5. Stop if i = 0; else go to Step 2.
3 Refinement Archive Encoding and Decoding For a simplification hierarchy from Mi to Mi-1, we need to encode an archive of refinement information for the reconstruction in the receiver side. The refinement archive includes three major components: which vertices were removed (the split vertex information), the coordinates of these vertices (the geometry information), and how these vertices were connected to their neighboring vertices (the connectivity information). These three components are encoded separately and then packaged together for transmitting to the receiver side. 3.1 The Split Vertices Encoding
For each hierarchy mesh Mi, we select an independent set of vertices for simplification using the selection algorithm described in Section 2, and then execute a series of edge-collapse operations to produce the next hierarchy mesh Mi-1. On the other hand, a series of vertex-split operations should be executed on Mi-1 to reconstruct Mi in the decoding stage. We use the offsets between split vertices in the coarser mesh to record the indices of these split vertices. For example, suppose the indices of the split vertices in Mi-1 are “2, 5, 7, 11, 12…”, the recorded information will be “2, 3, 2, 4, 1…”, which is the first index in Mi-1 followed by a series of offsets. The split vertices are usually evenly distributed among the models, thus the offsets are highly concentrated to several integers. This characteristic implies an entropy coding
A Geometry-Driven Hierarchical Compression Technique for Triangle Meshes
923
method is suitable in this part. We choose an arithmetic coding method [3] for the split vertices encoding. 3.2 The Geometry Encoding
Most 3D compression schemes predict the geometry coordinates based on the connectivity relationship of vertices and then encode the prediction residues [1], [4], [10], [15]. We propose a geometry-driven technique which predicts the connectivity relationship of vertices based on their geometry coordinates. The advantage of geometry-driven compression is that the connectivity information can be compressed much more efficient. However, we still propose a prediction method based on the connectivity relationship of the previous hierarchy mesh Mi-1 to improve the geometry compression rate. The concept of our prediction method is based on the observation that the average coordinates of a parent vertex’s neighboring vertices in Mi-1 is usually close to the average coordinates of it and its corresponding child vertices. That is 1 N (v p )
1
∑ v j ≅ 2 ( v p + vc )
v j ∈N ( v p )
(1)
where N(vp) is the set of vp’s neighboring vertices and N ( v p ) is the size of N(vp). Thus, given the coordinates of vp and its neighboring vertices in Mi-1, we can predict the coordinates of vc as vc' =
2 N (v p )
∑v
j v j ∈N ( v p )
− vp
(2)
The prediction residues (vc − vc' ) are then encoded for geometry reconstruction. Fig. 4 illustrates the concept of the geometry prediction method. Since the geometry prediction residues (GPRs) tend to be very small when comparing to the actual vertex coordinates, they often distribute within a small range of the coordinate scope. According to this characteristic of GPRs, we encode the GPRs by the arithmetic coding method. 3.3 The Connectivity Encoding
For each edge-collapse pair vp, vc and their neighboring vertices, we encode the connectivity relationship based on their geometry information by the following steps:
vp
vc’ v
m
Encode(vc - vc’ )
vc
Fig. 4. The concept of the geometry prediction method. vm is the barycenter of vp’s neighboring vertices. vc is the actual coordinates and vc’ is the predicted coordinates of the child vertex.
924
C.-M. Chou and D.-C. Tseng
1. Connect each neighboring vertex to either vp or vc, depending on which one is nearer (evaluated by Euclidean distance) to the neighboring vertex. This operation will leave two quadrilateral holes as shown in Fig. 5. We define the quadrilateral hole in which the v p vc direction is clockwise as the right quadrilateral hole, and the other one as the left quadrilateral hole. The four corner vertices of the two quadrilateral holes are named as v rp , vcr , v lp , and vcl , according to their positions, as shown in Fig. 5. vpl
left quadrilateral vcl hole
vp vpr
vc right quadrilateral hole vcr
Fig. 5. The two quadrilateral holes and four corner vertices
2. Compare the produced structure with the same part of Mi by ignoring the quadrilateral holes. If they have the same structure, the edge-collapse pair is classified as one of the following four categories by referring the structure of Mi to compare the lengths of v p vcr and vc v rp in the right quadrilateral hole, and compare the lengths of v p vcl and vc v lp in the left quadrilateral hole: (i)
SS: if the actual connections in Mi are shorter edges in both quadrilateral holes, as shown in Fig. 6 (a). (ii) SL: if the actual connections in Mi are shorter edge in the right quadrilateral hole and longer edge in the left quadrilateral hole, as shown in Fig. 6 (b). (iii) LS: if the actual connections in Mi are longer edge in the right quadrilateral hole and shorter edge in the left quadrilateral hole, as shown in Fig. 6 (c). (iv) LL: if the actual connections in Mi are longer edges in both quadrilateral holes, as shown in Fig. 6 (d). Otherwise (i.e., the structures are different), the edge-collapse pair is classified to the “Others” category. L
S vp
vc
L
S
vp
vc
S
S
(a)
(b)
vp
vc
L
(c)
vp
vc
L
(d)
Fig. 6. The connection of (a) SS, (b) SL, (c) LS, and (d) LL categories, where “S” represents “shorter” and “L” represents “longer”
A Geometry-Driven Hierarchical Compression Technique for Triangle Meshes
925
3. For vertex pairs classified to the first four categories, we directly encode their connectivity relationship as SS, SL, LS, and LL, respectively. For vertex pairs classified to the “Others” category, we encode the offsets between the actual connection and the SS condition of the two co-vertices. Fig. 7 illustrates the encoding concepts. v6
v8
v6
v5
v7
v4
vc
vp
v3 v1
v5
v7
v4
vc
vp
v8
v3 v1
v2
(a)
v2
(b)
Fig. 7. The encoding of vertices belongs to the “Others” category. (a) The SS connection condition: v1 and v6 are the two co-vertices. (b) The actual connection condition: v2 and v4 are the two co-vertices. Two integers, +1 (offset between v1 and v2) and -2 (offset between v6 and v4), are encoded for reconstruction.
Table 1 shows the probability distribution of the five connectivity categories of the Horse model in each refinement hierarchy. Due to the uneven distribution of these categories, we encode the connectivity information by an arithmetic coding method. Table 1. The probability distribution of five connectivity categories of the Horse model in each refinement hierarchy Refinement vertices archive R1 1159 R2 1573 R3 2096 R4 2821 R5 3824 R6 5093
SS
SL
LS
LL
Others
0.49 0.52 0.53 0.57 0.61 0.73
0.11 0.10 0.11 0.13 0.13 0.10
0.11 0.11 0.11 0.11 0.13 0.10
0.03 0.02 0.03 0.03 0.02 0.02
0.26 0.25 0.22 0.16 0.11 0.05
For each refinement archive, a header file, which records the lengths of the three refinement components, is attached to the refinement information to form a complete refinement archive package. 3.4 The Refinement Archive Decoding
In the receiver end, a base mesh is received followed by a sequence of refinement archives. The decoder first decodes the base mesh by TG’s algorithm, and then refines the base mesh by the upcoming refinement archives. The decompression steps of each refinement archive to reconstruct a mesh from Mi to Mi+1 are stated as follows:
926
C.-M. Chou and D.-C. Tseng
1. Decode the header file to get the length information of three components: the split vertex information, the GPRs information, and the connectivity information. 2. Decode the split vertex information to figure out the split vertices in Mi. For split vertices, the following geometry and connectivity information are decoded for operating a series of vertex-split operations. 3. Decode the GPRs information. The coordinates of a split child vertex can be calculated by vc = r − v p +
2 N (v p )
∑v
j v j ∈N ( v p )
(3)
where r is the decoded GPR; N(vp) is the set of vp’s neighboring vertices in Mi, and N (v p ) is the size of N(vp). The coordinates of the split child vertices are then used in the next step to get the connection relationship of the next finer mesh Mi+1. 4. Decode the connectivity information. For each vertex-split pair, we operate a vertex-split operation by re-connecting vp and vc to their neighboring vertices. Firstly, we connect each neighboring vertex to either vp or vc, depending on which one is nearer. This operation will leave two quadrilateral holes. Secondly, if the decoded signal is SS, SL, LS, or LL, we simply triangulate these two left quadrilateral holes according to the decoded information. Otherwise, the decoded signals must be two integers, which are the offsets of the two co-vertices between the SS condition and the actual condition. We can easily connect vp and vc, to the correct neighboring vertices after the two co-vertices have been calculated.
4 Experimental Results and Conclusion We have implemented the proposed compression algorithm on various kinds of triangle meshes. We used the TG single resolution scheme to encode the base mesh M0 in all our experiments. In average, it takes about 1.5 bits/v for the connectivity information, 9 bits/v for geometry information with 8-bit quantization per coordinate, and 13.5 bits/v for geometry information with 12-bit quantization per coordinate. Table 2 shows the detailed compression results of our scheme in every hierarchy for the Bunny model, quantized to 10 bits per coordinate. The original Bunny model has 35947 vertices and 69451 faces, and is simplified to a crude model, M0, with 3022 vertices and 3601 faces. The “C bits” column is the sum of the split vertex and the connectivity information. The split vertex information costs about 2.3 bits/v in each hierarchy and the connectivity information costs the others. Table 3 compares the results of our algorithm with that of TG [15], PM [7], and CPM [10] methods. These three mechanisms are the-state-of-the-art schemes for single resolution, continuous progressive, and hierarchical progressive mesh compression, respectively. Our scheme is superior to CPM in both geometry and connectivity compression. From the nature of the independent set split-vertex selection algorithm, edge collapse operations in the same hierarchy will be distributed averagely around the surface of the simplified geometric model. Thus this algorithm is progressively multi-resolution. Fig. 8 shows the experimental results of the Horse model in different hierarchies.
A Geometry-Driven Hierarchical Compression Technique for Triangle Meshes
(a)
(b)
(c)
927
(d)
Fig. 8. The Horse model in different hierarchies: (a) The crude model, M0, 2420 vertices, 4836 faces. (b) M2, 4444 vertices, 8884 faces. (c) M4, 8113 vertices, 16222 faces. (d) The finest model, M7, 19851 vertices, 39698 faces.
Table 2. The detailed compression results of our scheme in each hierarchy for a Bunny model, quantized to 10 bits per coordinate. G represents “Geometry” and C represents “Connectivity”. Refinement Cumulative vertices G bits G/v C bits Archive G/v Base Model 3022 31006 10.26 -3082 R1 606 12130 20.02 20.02 3448 R2 855 15537 18.17 18.94 4685 R3 1125 18396 16.35 17.81 6142 R4 1577 23579 14.95 16.73 8594 R5 2083 28862 13.86 15.77 11185 R6 2773 35694 12.87 14.88 14336 R7 3668 41896 11.42 13.88 17899 R8 4991 48083 9.63 12.68 21710 R9 6793 64540 9.50 11.80 27987 R10 8454 63760 7.54 10.71 27137
C/v 1.02 5.69 5.48 5.46 5.45 5.37 5.17 4.88 4.35 4.12 3.21
Cumulative Reconstructed C/v Model -M0 5.69 M1 5.57 M2 5.52 M3 5.49 M4 5.45 M5 5.37 M6 5.22 M7 4.98 M8 4.74 M9 4.35 M10
Table 3. Comparison of compression results (bits/v) of TG, PM, CPM, and our scheme. The geometry information is quantized to 10 bits per coordinate. The datum is the cumulative compression results for transmitting full resolution models. Single resolution Model
Vertex
Bunny 35947 Horse 19851 Triceratops 2832
TG G/v NA NA 10.4
C/v 1.6 1.4 2.2
Continuous progressive PM G/v C/v NA 37 NA 28 NA 32
Hierarchy progressive CPM G/v 15.4 14.2 14.5
C/v 7.2 7.0 7.0
Our scheme G/v C/v 10.71 4.35 11.97 4.56 12.45 4.87
928
C.-M. Chou and D.-C. Tseng
References 1. Bajaj, C.L., Pascucci, V., Zhuang, G.: Progressive Compression and Transmission of Arbitrary Triangular Meshes. In Proc. Visualization (1999) 307-316 2. Choi, J.S., Kim, Y.H., Lee, H.J., Park, I.S., Lee, M.H., Ahn, C.: Geometry Compression of 3-D Mesh Models using Predictive Two-Stage Quantization. IEEE Circuits and Systems for Video Technology, Vol. 10(2). (2000) 312-322 3. Cleary, J.G., Neal, R.M., Witten, I.H.: Arithmetic Coding for Data Compression. ACM Comm., Vol. 30(6). (1987) 520-540 4. Cohen-Or, D., Levin, D., Remez, O.: Progressive Compression of Arbitrary Triangular Meshes. In Proc. Visualization (1999) 62-72 5. Deering, M.: Geometry Compression. In Proc. SIGGRAPH (1995) 13-20 6. Heckbert, P.S., Garland, M.: Survey of Polygonal Surface Simplification Algorithms. In Proc. SIGGRAPH Course Notes 25. (1997) 7. Hoppe, H.: Progressive Meshes. In Proc. SIGGRAPH (1996) 99-108 8. Hoppe, H.: Efficient Implementation of Progressive Meshes. Technical Report MSR-TR98-02. Microsoft Research. (1998) 9. Li, J., Kuo, C.C.: Progressive Coding of 3D Graphics Models. In Proc. IEEE. Vol. 96(6). (1998) 1052-1063 10. Pajarola, R., Rossignac, J.: Compressed Progressive Meshes. IEEE Visualization and Computer Graphics, Vol. 6(1). (2000) 79-93 11. Peng, J., Kim, C.-S., Kuo, C.-C. J.: Technologies for 3D Mesh Compression: A Survey. J. Vis. Commun. Image, R. 16. (2005) 688–733 12. Rossignac, J.: Edgebreaker: Compressing the Incidence Graph of Triangle Meshes. IEEE Visualization and Computer Graphics, Vol. 5(1). (1999) 47-61 13. Soucy, M., Laurendeau, D.: Multiresolution Surface Modeling Based on Hierarchical Triangulation. Computer Vision and Image Understanding, Vol. 63. (1996) 1-14 14. Taubin, G., Rossignac, J.: Geometric Compression Through Topological Surgery. ACM Graphics, Vol. 17(2). (1998) 84-115 15. Touma, C., Gotsman, C.: Triangle Mesh Compression. In Proc. Graphics Interface (1998) 26-34 16. Valette, S., Prost, R.: A Wavelet-Based Progressive Compression Scheme for Triangle Meshes: Wavemesh. IEEE Visualization and Computer Graphics, Vol. 10(2). (2004) |123-129
Shape Preserving Hierarchical Triangular Mesh for Motion Estimation Dong-Gyu Lee1 and Su-Young Han2 1 Dept. of Computer Engineering, Hanbuk University, Sangpae-dong, Dongducheon, Kyounggi-do, 483-120, Korea
[email protected] 2 Dept. Of Computer Science, Anyang University, Samseong-ri, Buleun-myeon, Ganghwa-gun, Incheon, 417-833, Korea
[email protected] Abstract. In this paper, we propose a motion estimation method using hierarchical triangulation that changes the triangular mesh structure according to its motion activity. The subdivision of triangular mesh is performed from the amount of motion that is calculated from the variance of frame difference. As a result, node distribution is concentrated on the region of high activity. The subdivision method that makes it possible to yield hierarchical triangular mesh is proposed as well as the coding method reducing additional information for hierarchical mesh structure is described. From the experimental result, the proposed method has better performance than the conventional BMA and the other mesh based methods. Keywords: Motion Estimation, Motion Compensation, Triangular Mesh.
1 Introduction Block-based motion estimation and motion compensation is widely employed in modern video-compression systems. The block match algorithm(BMA) assumes that the object displacement is constant within a small block of pixels. This assumption presents difficulties in scenes with multiple moving objects and rotating objects. When adjacent blocks are assigned different displacements, a BMA may form a discontinuity in the reconstructed image called blocking artifacts. Triangular and quadrilateral meshes were proposed as an alternative to BMA in order to allow for affine or bilinear motion estimation and thus achieve better performance in the complex motion field. There are two types of mesh structure, regular meshes [1-2] and content-based meshes [3-5]. In regular meshes, nodes are placed at fixed intervals throughout the image. It is not need to send the mesh topology to the decoder. But this may not have the ability to reflect scene content, i.e., a single mesh element may contain multiple motion. Methods using content-based meshes address this problem, which extract the scene characteristic points and apply triangulation to the points. However these methods require much overhead information about the structure of mesh and the position of nodes. L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 929 – 938, 2006. © Springer-Verlag Berlin Heidelberg 2006
930
D.-G. Lee and S.-Y. Han
In this paper, we propose a hierarchical triangular mesh generation method, in which patches that have high motion activity are successively subdivided, and the additional information reduction coding method for hierarchical triangular mesh structure.
2 Hierarchical Triangular Mesh (HTM) The first step of mesh-based methods is to divide the image into many triangular patches. The motion vectors of the triangular patches are estimated by the node displacement and then the moving position of node is obtained. The motion vectors of the three vertices of a triangular patch are employed in determining the six parameters of an affine transform. In backward motion estimation, the displacement of every pixel inside the triangle can also be obtained from the previous frame using the affine transformation. In generating hierarchical mesh structure, we use the regular meshes which are isosceles right triangle in the coarsest level (level 0). The nodes in this mesh have four-way or eight-way connectivity. According to the motion activity, a new node is introduced in the region which has large motion to construct a finer triangle. The variance of the frame difference (FD) between the previous frame and the current frame is chosen as a measuring function to judge whether the region should be subdivided or not (see Equation (1)).
m( P ) =
1 Np
∑
( x , y )∈ p
( f FD ( x, y ) − f FD )2
(1)
Where the region P is the triangular patch region, N P is the number of pixels in the region, f FD ( x, y ) is the frame difference between current frame and previous frame
and f FD is the mean of f FD ( x, y ) . If m( P) > TH , the three nodes are added into the sides of the triangular respectively. Otherwise, the node is not added because of the small motion value. Because adjacent triangle meshes share a side, the introduced nodes change the structure of adjacent meshes. From the node addition, the next cases are occurred. (1) If all of the current mesh and adjacent meshes are not influenced by node addition, the number of nodes on the sides of the triangle is 0. (2) If the node is added to the adjacent meshes and not added to the current mesh, the number of nodes on the sides of the triangle is 1, 2 (see Fig.1(a)-(e)). (3) If the node is added to the current mesh or is added to all adjacent meshes, the number of nodes on the sides of the triangle is 3 (see Fig.1(e)-(h)).
Figure 1 illustrates the possible triangulation according to the number and the position of the nodes. In the proposed algorithm, for consecutive subdividing a triangle, the shape of a mesh in a coarse level must be preserved in a finer level, too. Mesh structures in all levels are restricted to isosceles right triangle. According to the number and the position of nodes on the sides, the algorithm of triangulation is as follows.
Shape Preserving Hierarchical Triangular Mesh for Motion Estimation
931
Fig. 1. The possible triangulation according to the number and the position of the nodes
Case 0. If the constructed mesh is already an isosceles right triangle, more process is not necessary (see Fig. 1(a),(c),(f),(h)). Case 1. If all nodes are located at only the legs of isosceles right triangle, a new node is introduced on hypotenuse of isosceles right triangle (see Fig. 1(e)). Figure 2 represents the method of isosceles right triangulation. Here the vector a
is from a node on side to the opposite vertex and the vector b is a side vector containing the node. For generated triangle to be isosceles right triangle, it is necessary that vector a and b for all nodes are orthogonal. If not, a node must be introduced on the hypotenuse of this triangle.
Fig. 2. Isosceles right triangulation by node addition
The proposed triangulation procedure is as follows. Step 1) In level 0, generate regular meshes, which are isosceles right triangles. Step 2) Examine whether the distance between triangle vertexes is minimum distance on that level and determine whether patch is split or not. If the amount of motion activity is larger than given threshold, add node into all sides of triangle. If m( P) > TH , node is introduced into the all sides.
932
D.-G. Lee and S.-Y. Han
Step 3) In order for all meshes to be isosceles right triangles, evaluate the inner product of the side vector b containing the node and the vector a from the node to opposite vertex. If it is not orthogonal, add a new node to hypotenuse. Iterate the procedure over the all meshes until adding a node is not necessary. If a ib = 0 , node is introduced into the hypotenuse. Step 4) Triangulate the patches according to the number of nodes on each side of triangle. Step 5) Increase a level by one and divide the minimum size of triangle by two. Repeat Step 2)
Fig. 3 indicates this triangulation procedure and Fig. 4 shows the result of sample image.
Fig. 3. The example of hierarchical triangulation (a) level 0 (b) level 1 (c) level 2
(a)
(b)
(c)
Fig. 4. Hierarchical triangular mesh(Football-frame No.2) (a) level 0 (b) level 1 (c) level 2
3 Estimation of Motion Vector in HTM In order to estimate the affine transformation parameters for motion compensation, it is required to estimate the motion vectors at nodes. The motion vectors of the three vertices of a triangular patch are employed in determining the six parameters of an
Shape Preserving Hierarchical Triangular Mesh for Motion Estimation
933
affine transformation. We use the BMA to get the approximate displacement and then apply the hexagonal matching algorithm(HMA) as a refinement. In order to estimate the moving position of a node, we refine a considered node at the position, where the difference between the reconstructed and the original images is minimal 3.1 Node Motion Estimation Using BMA
BMA centering a considered node is used to search the approximate displacement of node. Because in hierarchical mesh, the distance between nodes is too short as level increases and the additional refinement process is introduced, BMA is applied to the nodes in only level 0. For the other nodes, bilinear interpolation using the position of nodes in level 0 is used to estimate the node position. 3.2 Affine Motion Compensation
After the displacement of the tree vertices of a triangular patch is obtained, the image is reconstructed using the affine transformation. For each point ( x, y ) of a considered triangle, the corresponding transformed position ( x′, y ′) is given by u ( x, y ) = ( x − x′) = a1 + a2 x + a3 y v( x, y ) = ( y − y ′) = a4 + a5 x + a6 y
(2)
where a1 , , a6 coefficients are the six motion parameters of the considered triangle. Let us consider that Equation (2) is available in each vertex of triangle: ⎡ u ( x1 , y1 ) ⎤ ⎡ u1 ⎤ ⎡1 x1 ⎢u ( x , y ) ⎥ ⎢u ⎥ ⎢1 x 2 ⎢ 2 2 ⎥ ⎢ 2⎥ ⎢ ⎢ u ( x3 , y3 ) ⎥ ⎢ u3 ⎥ ⎢1 x3 ⎢ ⎥=⎢ ⎥=⎢ ⎢ v( x1 , y1 ) ⎥ ⎢ v1 ⎥ ⎢ 0 0 ⎢ v( x2 , y2 ) ⎥ ⎢ v2 ⎥ ⎢ 0 0 ⎢ ⎥ ⎢ ⎥ ⎢ ⎢⎣ v( x3 , y3 ) ⎥⎦ ⎢⎣ v3 ⎥⎦ ⎢⎣ 0 0
⎡1 x1
where Φ = ⎢1 x2 ⎢ ⎢⎣1 x3
y1
0
0
y2
0
0
y3
0
0
0 0
1 x1 1 x2
0
1
x3
0⎤ 0 ⎥⎥ 0 ⎥ ⎡Φ 0 ⎤ a ⎥= y1 ⎥ ⎢⎣ 0 Φ ⎥⎦ y2 ⎥ ⎥ y3 ⎥⎦
(3)
y1 ⎤ −1 y2 ⎥⎥ and by rewriting the vector a as a function of Φ , Equation y3 ⎥⎦
(3) becomes
⎡Φ −1 a=⎢ ⎣ 0
⎡ u1 ⎤ ⎢u ⎥ ⎢ 2⎥ 0 ⎤ ⎢u3 ⎥ ⎥⎢ ⎥ Φ −1 ⎦ ⎢ v1 ⎥ ⎢ v2 ⎥ ⎢ ⎥ ⎢⎣ v3 ⎥⎦
(4)
934
D.-G. Lee and S.-Y. Han
where ⎡ x2 y3 − x3 y2 1 Φ = × ⎢ y2 − y3 det(Φ) ⎢ ⎢⎣ x3 − x2 −1
x3 y1 − x1 y3 y3 − y1 x1 − x3
x1 y2 − x2 y1 ⎤ y1 − y2 ⎥⎥ x2 − x1 ⎥⎦
(5)
By substituting the affine parameters into Equation (2), we can perform motion compensation of any point in the current triangle. Because the affine parameters are non-integer real numbers, the resulting coordinates in the reference frame are too. In this case, bilinear interpolation is used to estimate the pixel location. 3.3 Node Refinement
In order to estimate the moving position of a node, we refine a considered node at the position, where the difference between the reconstructed and original images is minimal. Because moving a node affects the structure of adjacent meshes, while keeping neighboring nodes fixed, we move a considered node and reconstruct the image by affine transformation. Motion vector is decided from the position that minimizes the difference. PSNR as the cost function is defined such as
PSNR = 10 log10 MSE =
1 NP
∑
2552 dB MSE
(6)
( I ( x, y ) − I ( x, y )) 2
( x , y∈P )
where P is the region bounded on neighboring nodes and N P is the number of pixels in that region and I ( x, y ) , I ( x, y ) is the pixel intensity of the original image and the reconstructed image, respectively. The refinement process is done on a finer level after all nodes in a coarse level were refined. The above refinement process is iterated until the positions of nodes are optimized. If neighboring nodes were not moved on the previous refinement process and a considered node is not moved on the current refinement process, the refinement process is finished.
4 Coding of the Hierarchical Triangular Mesh Contrast to BMA or regular mesh, the proposed mesh structure is deformable, so additional information for describing the mesh structure is required. In this paper, because the subdivision of triangular mesh is based on the motion activity, additional information can be minimized from coding the only existence of added node. The reconstruction of the mesh structure is accomplished by the same mesh generation process according to the subdivision information. We let code ‘0’ represent the number of node addition is one or zero, and code ‘10’ represents the number of node addition is two, ‘11’ represents that the number of node addition is three, that is, ‘1’ represents the patch that can be subdivided in the higher
Shape Preserving Hierarchical Triangular Mesh for Motion Estimation
935
Fig. 5. The coding of hierarchical mesh structure
composed of adjacent two triangle is subdivide into 8 patches, and the number of added node is coded for each patch. For the reconstruction of the mesh structure, if ‘11’ code is encountered in the received stream, three node is added to the patch and new node is connected with each other to make an isosceles right triangle.
5 Experimental Results The performance of the proposed mesh structure was evaluated for two different test sequences, using the various mesh structure. These sequences(Clair, Football) consist of 352 × 288 pixels and 352 × 240 pixels respectively. Clair is tested at 3 frame interval for the sequence number 1 to 100 and Football is tested at every frame for the sequence number 1 to 30. In the simulation, the performance of the proposed hierarchical triangular mesh(HTM) is compared with that of BMA, hexagonal matching algorithm(HMA) [1] and hierarchical grid interpolation(HGI) [6]. Motion vector is estimated for each node and then from the displacement of these nodes, the reconstructed image is calculated by the motion compensation. As a measure of the quality of the reconstructed image, the PSNR between the original image and the reconstructed image was used. We use full-search BMA with search region of [-8, +7] pixel both in the horizontal and vertical direction respectively. In regular mesh, node distance is 16 × 16 pixel and initial node search region with BMA is [-8, +7] and search region in the iterative refinement process was [-3, +3]. The other conditions are the same as those of regular mesh.
936
D.-G. Lee and S.-Y. Han
We used the 3-level hierarchical structure in HTM. And minimum node distance is 32 × 32 in the level 0 and 16 × 16 in the level 1 and 8 × 8 in the level 2. For each triangular patch in which the variance of frame difference is larger than the threshold, additional nodes are inserted to each side. Node refinement is processed from low level nodes to higher level nodes. For the sake of performance comparison with the other mesh based methods, we adjusted the number of nodes in HGI and HTM similar to those of the other methods. The threshold value for node addition is renewed from the follow Equation (7), which makes total node the same as in the regular mesh, that is about 437 ± 5 % ⎧ (TotalNode − 437) ⎫ TH var (n + 1) = TH var (n) × ⎨1 + ⎬ 437 ⎩ ⎭
(7)
where TH var (n) and TH var (n + 1) are the variance threshold of the current iteration and the next iteration respectively, TotalNode is the number of total nodes. Table 1. Average PSNR (dB) Input Image
Clair
Football
Algorithm BMA HMA HGI HTM BMA HMA HGI HTM
After Complete Convergence 36.0098 38.0738 38.6513 39.1313 22.7097 23.1569 23.5748 24.1643
After 3 Iteration 37.9504 38.4131 38.9704 22.3779 22.7508 23.9737
During the node refinement or image reconstruction, the pixel value was calculated from the bilinear interpolation which was used for subpixel position produced by affine transform. In general, two or three iterative node refinements are sufficient in the refinement process. For the performance comparison, Fig. 6 and Fig. 7 represent the PSNR of reconstructed image in case of after complete convergence and Table 1 shows the average PSNR for the each condition.
Fig. 6. The PSNR of motion compensated image (Claire)
Shape Preserving Hierarchical Triangular Mesh for Motion Estimation
937
Fig. 7. The PSNR of motion compensated image (Football)
Table 2 shows the total number of nodes and the average bitrate for coding the mesh structure. Each motion vector is allocated with 4 bits for horizontal and vertical direction respectively. The total transmitted bits in HGI and HTM are the sum of the bits for coding the mesh structure and the bits for motion vector. In this case because the number of nodes is variable, the average bits per image are presented. In Fig. 6 and Fig. 7, we can see that HTM achieved the better performance in PSNR compared with the regular mesh or HGI with the similar number of nodes. Contrary to regular mesh, HTM structure requires the overhead information for mesh structure coding. But using the appropriate coding method, the mesh structure can be efficiently encoded. The proposed method is superior to regular mesh in average PSNR about 0.5dB to 1 dB. Especially, higher performance is achieved in case of large motion activity image. In Table 1, the average PSNR after the three iterations are shown, this result represents that proposed algorithm improves the node convergence speed in refinement process compared with other mesh based method. Table 2. Average Bitrate (Transmitted) Input Image
Algorithm BMA
Average Bits 396*8
Clair
HMA HGI HTM BMA
428*8 430.6*8+452.21 434.6*8+666.27 330*8
Football
HMA HGI HTM
362*8 437.5*8+461.5 439.0*8+718.1
6 Conclusion In this paper, a new hierarchical mesh generation method has been proposed. The size of triangular patch is varied according to motion activity of a video sequence. Because the mesh generated is denser in moving area than in still area, the motion compensation pays more attention on moving area such that better performance is
938
D.-G. Lee and S.-Y. Han
achieved. We exploit the hierarchical mesh structure that the subdivided patches construct the isosceles right triangle like the initial triangular patch. It makes the additional subdivision is possible and reduces the overhead information for coding the mesh structure. Simulation result shows that the proposed algorithm improves the performance of motion compensation in comparison with the other mesh based methods by more than 0.5dB or 1 dB in PSNR and is also suitable for low bitrate coding.
References [1] Y. Nakaya and H. Harashima: Motion Compensation Based on Spatial Transformations. IEEE Trans. on Circuits and Systems for Video Technology, 4(3), June (1994) 339-356 [2] M. Sayed and W. Badawy: A Novel Motion Estimation Method for Mesh-Based Video Motion Tracking. ICASSP 2004, (2004) 337-340 [3] M. Dundon, O. Avaro, C. Roux: Triangular active mesh for motion estimation. Signal Processing: Image Communication, 10, (1997) 21-41 [4] G. Al-Regib, Y. Altunbasak and R. M. Mersereau: Hierarchical Motion Estimation with Content-Based Meshes. IEEE Trans. On Circuits and Systems for Video Technology, 13(10), Oct. (2003) 1000-1005 [5] M. Servais, T. Vlachos and T. Davies: Affine Motion Compensation using a ContentBased Mesh. IEE Proc.-Vis. Image Signal Process., 152(4), August (2005) 415-423 [6] C-L Hung and C-Y Hsu: A New Motion Compensation Method for Image Sequence Coding Using Hierarchical Grid Interpolation. IEEE Trans. on Circuits and Systems for Video Technology, 4(1), Feb. (1994) 42-52
Parameter Selection of Robust Fine Granularity Scalable Video Coding over MPEG-4 Shao-Yu Wang and Sheng-Jyh Wang National Chiao Tung University 1001 Ta Hsueh Road, Hsin Chu, Taiwan, R.O.C.
[email protected],
[email protected] Abstract. In this paper, we propose an approach to automatically choose the α and β parameters in the RFGS coding scheme for the coding of video sequences. We first analyze the residuals after motion compensation with different parameter settings under various bit rates. We then investigate the relationship between the MSE gain and MSE loss for each bit-plane. With those models and relationships, we propose first an approach to automatically choose the sets of optimized parameters for a fixed bit rate. Then we propose an approach to deal with a range of bit rates. These two MSE-based approaches require only light computational loads.
1 Introduction In the past decades, video compression has drawn significant attention due to the challenging task to transmit video data through channels with limited bandwidth. Traditionally, video compression standards, like MPEG-1, aims at optimizing the coding efficiency at a fixed bit rate. To transmit through wireless networks, however, video data had better be coded in a scalable way since the allowed bandwidth is dynamically varied. Therefore, for wireless transmission, video sequences need to be optimized for a wide range of bit rates, instead of a fixed bit rate. In MPEG-4, the fine granularity scalability (FGS) scheme has been proposed to offer a scalable way to cope with bandwidth fluctuation [1]. The concept of FGS is illustrated in Fig. 1. In FGS, a video sequence is partitioned into two layers: a base layer which satisfies the minimum bandwidth requirement, and an enhancement layer which is obtained by subtracting the base layer from the original sequence. The base layer is coded using DCT-based non-scalable single layer coding, while the enhancement layer is coded using bit-plane coding and can be truncated at any bit [2]. In the FGS scheme, the motion compensation process is only applied over the base layer and the enhancement layer is coded without temporal predictions. Although this scheme may enhance the capability of error resilience, the lack of temporal prediction in the enhancement layer may sacrifice the coding efficiency. After the FGS proposal, several approaches have been proposed to increase the coding efficiency by exploiting the temporal correlation [3],[5]-[8]. Among these approaches, the robust fine granularity scalability (RFGS) scheme proposed in [3] offers a flexible structure to control the level of temporal prediction. In RFGS, two parameters, α and β, are utilized to control the temporal prediction process, as shown L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 939 – 948, 2006. © Springer-Verlag Berlin Heidelberg 2006
940
S.-Y. Wang and S.-J. Wang
in Fig. 2. The parameter β, with 0 ≤ β ≤ Nb, controls the number of bit-planes in the enhancement layer that would be used for motion compensation; while the parameter α, with 0 ≤ α ≤ 1, is a leaky factor that can be used to attenuate the amount of error drift. Here, Nb denotes the number of bit-planes in the enhancement layer. For more details, please refer to [3]. Channel
Encoder
Decoder Enh
Enh. : Enhancement Layer
Enh
Base + K bit-planes (K 0.5 , the proposed method has a
976
Y. Chang and S.-L. Hsu
Fig. 8. Average waiting time on a BroadCatch-based VoD system ( θ = 0.729 )
smaller average waiting time than that for BroadCatch. Fig. 10 shows the average number of allocated channels and the number of total viewers for the 20 movies in the VoD system. It shows that more channels are indeed allocated to more popular movies, leading to the reduction of average waiting time.
Fig. 9. Average waiting time as a function of the skew factor θ (20 movies)
Seamless Channel Transition for BroadCatch Scheme
977
Fig. 10. Average number of allocated channels and the number of users ( θ = 0.729 , (u , l , i ) = (9,4,5) )
5 Conclusion Various broadcasting schemes are used in Near Video-on-Demand systems to serve multiple movies to a large number of viewers. In these systems, seamless channel transition provides the ability to dynamically reallocate channels without service interruption and thus can be used as an important performance tuning tool. In this paper, we show how seamless channel transition can be achieved for the BroadCatch scheme. Our simulation shows that the average waiting time can be reduced when seamless channel transition is employed along with a suitable transition strategy. Acknowledgments. This research is supported in part by an ISU grant, ISU-94-01-08.
References 1. H. Shachnai and P. S. Yu, “The role of wait tolerance in effective batching: A paradigm for multimedia scheduling schemes,” Technical Report RC 20038, IBM Research Division, T.J. Watson Research Center, Apr. 1995. 2. L-S. Juhn and L-M. Tseng, “Fast broadcasting for hot video access,” RTCSA’97: the proceedings of the 4th international workshop on real-time computing system and applications, pp.237-243, Oct. 1997. 3. S. Viswanathan and T. Imielinski, “Metropolitan area video on demand service using pyramid broadcasting,” Multimedia System, vol.4, no.4, pp.197-208, Aug. 1996. 4. K.A. Hua and S. Sheu, “Skyscraper broadcasting: a new broadcasting scheme for metropolitan video on demand system,” ACM SLGCOMM97, Cannes, France, pp.89-100, Sep. 1997.
978
Y. Chang and S.-L. Hsu
5. L.S. Juhn and L. M. Tseng, “Staircase data broadcasting and receiving scheme for hot video service,” IEEE Transaction on Consumer Electronics, vol.43, no.4, pp.1110-1117, Nov. 1997. 6. L. S. Juhn and L. M. Tseng, “Harmonic broadcasting for video on demand service,” IEEE Transaction on Broadcasting, vol.43, no.3, pp.268-271, Sep. 1997. 7. M.A. Tantaoui, K.A. Hua, and T.T. Do, “BroadCatch: a periodic broadcast technique for heterogeneous video-on-demand,” IEEE Transactions on Broadcasting, vol.50, no.3, pp.289-301, Sep. 2004. 8. Yu-Chee Tseng, Ming-Hour Yang, Chi-Ming Hsieh, Wen-Hwa Liao, and Jang-Ping Sheu, “Data broadcasting and seamless channel transition for highly demanded videos,” IEEE Transactions on Communications, vol.49, no.5, pp.863-874, May 2001. 9. Yukon Chang and Shou-Li Hsu, “A flexible architecture for unified Video-on-Demand Services,” Workshop on Consumer Electronics and Signal Processing(WCEsp2004), Hsinchu, Taiwan, Nov. 2004. 10. Yu-Chee Tseng, Yu-Chi Chueh, and J.P. Sheu, “Seamless channel transition for the staircase video broadcasting scheme,” IEEE/ACM Transactions on Networking, vol.12, pp.559-571, Jun. 2004. 11. Wei-De Chien, Yuan-Shiang Yeh, and Jia-Shung Wang, “Practical channel transition for Near-VoD Services,” IEEE Transactions on Broadcasting, vol.51, pp.360-365, Sep. 2005. 12. Asit Dan, Dinkar Sitaram, and Perwez Shahabuddin, “Scheduling Policies for an OnDomand Video Server with Batching,” ACM Multimedia conference, pp.15-23, 1994. 13. S. Ramesh, I. Rhee, and K. Guo, “Multicast with Cache (Mcache): An Adaptive ZeroDelay Video-on-Demand Service,” IEEE Infocom 2001, vol.1, pp.85-94, 2001.
Spatial Interpolation Algorithm for Consecutive Block Error Using the Just-Noticeable-Distortion Method Su-Yeol Jeon, Chae-Bong Sohn, Ho-Chong Park, Chang-Beom Ahn, and Seoung-Jun Oh VIA-Multimedia Center, Kwangwoon University, 447-1, Wolgye-Dong, Nowon-Gu, 139-701, Seoul, Korea {k2ambo, via-staff}@viame.re.kr
Abstract. This paper presents a new error concealment algorithm based on edge-oriented spatial interpolation as well as the JND (Just-NoticeableDistortion) function to select valid edge information in adjacent blocks. Several conventional algorithms suffer from computational complexity or inaccurate recovery at missing blocks having strong edgeness. In order to alleviate these drawbacks, the proposed algorithm estimates a dominant direction to interpolate the missing pixel values. At this step, automatic thresholding method based on the JND function is proposed to determine a valid direction. Then, a lost block is recovered by weighted linear interpolation along the dominant vector. Finally, the median filter is applied to recover pixels that are not located on the line with the estimated dominant direction. Simulation results show that the proposed algorithm improves the PSNR of the recovered image approximately 2.5dB better than Hsia’s and can significantly reduce computational complexity with keeping subjective quality as good as the RIBMAP method. Keywords: Error concealment, block-loss recovery, JND function, matching vector, interpolation.
1 Introduction The block-loss recovery is very important for block-based coding, such as JPEG and MPEG standards. When a block is lost, it can be estimated by using adjacent pixels in post processing. Many spatial interpolation techniques for block-loss recovery have been proposed. There is no acceptable algorithm in terms of computational complexity as well as recovery accuracy for commercial products. Lee et al. [1] introduced a cubic spline interpolation using sliding recovery window. Tsekeridou and Pitas [2] proposed a split-match algorithm using ‘best match’ 4x4 blocks. Sun and Kwok [3] presented a spatial interpolation algorithm using projections onto convex sets [4]. Among the spatial domain error concealment algorithms, the Hsia’s algorithm achieves relatively good quality, compared to the other algorithms with low computational complexity [5]. This algorithm recovers the error blocks by identifying two 1-D boundary matching vectors. However, its recovery direction is still L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 979 – 987, 2006. © Springer-Verlag Berlin Heidelberg 2006
980
S.-Y. Jeon et al.
inaccurate since edge directions of boundary pixels are not considered. Especially, the performance of Hsia’s algorithm is declined for the regions having multiple edge directions in boundary pixels. In cope with this shortcoming, we propose a new error recovery method of which the key operation is to efficiently find a boundary matching vector according to neighboring block edges. This scheme is organized as follows. Firstly, the proposed algorithm detects edges with neighboring blocks and selects valid directions based on the JND function [6]. Then, a dominant direction is selected as the matching vector among detected valid edges. Finally, according to the direction of the matching vector, spatial interpolation is applied to lost blocks. The rest of this paper is organized as follows. In Section 2, the proposed algorithm is presented in detail. Simulation results are shown in Section 3. Finally, Section 4 concludes this paper.
2 Proposed Algorithm 2.1 Adjacent and Connected Blocks A lost block, L, and its surrounding neighboring blocks are illustrated in Fig. 1. In Fig. 1(a), Six neighboring blocks, ATL, AT, ATR, ABL, AB and ABR, and boundary pixel lines, D, are shown with the lost block, L, in the center, of which size is N×N. Fig. 1(b) shows 3x3 size boundary blocks, B, which are adjacent to the lost block and are in neighboring blocks of the lost block, L.
Fig. 1. Lost block and its neighboring blocks (a) Lost block L and six neighboring blocks ATL, AT, ATR, ABL, AB and ABR. (b) Lost block L and boundary blocks, B.
In the edge-oriented spatial interpolation algorithm proposed by Hsia, lost blocks are restored with 1-D boundary matching between top and bottom boundary pixel lines in the neighboring blocks. Since this algorithm does not count on the edge directions of adjacent blocks, it can deteriorate as shown in Fig. 2.
Spatial Interpolation Algorithm for Consecutive Block Error
981
Fig. 2. The recovered image using Hsia’s spatial interpolation algorithm for MIT image
2.2 JND Function-Based Edge Detection To find the edge direction and magnitude for adjacent blocks, A, sobel edge detector in the spatial domain is applied to 3x3 boundary blocks, B. As shown in Fig. 3, each boundary block is divided into two regions according to the edge direction. If the difference between two divided regions is larger than the JND value at the center of two regions, the edge is selected as a valid edge. Note that the JND represents the smallest difference in a specified modality of sensory input that is detectable by human being.
⎧⎪T (1 − g ( x, y ) / 127 + 3), for g ( x, y ) ≤ 127 JND( g ( x, y )) = ⎨ o ⎪⎩γ ( g ( x, y ) − 127) + 3, otherwise
(1)
where g(x, y) is a gray scale value at the position (x, y), To = 17 and γ = 3 / 128 [6].
Fig. 3. Edge direction patterns for 3x3 boundary block, B
2.3 Matching Vector Selection
Matching vector selection is depicted in Fig. 4. According the JND functionbased edge detection, each boundary block may have valid edge direction which is the candidate of the best matching vector. The best matching vector is selected as one direction that are most frequently found in all the valid directions. The selected best matching vector represents dominant edge direction of neighboring blocks.
982
S.-Y. Jeon et al.
ATL
AT
ATR
(gx, gy)
3
Best matched boundary
3 Matching Vector
Edge direction Matching vector Lost block
ABL
AB
ABR
Boundary block Boundary pixel line
Fig. 4. Matching vector selection
This selecting process is executed on the top and bottom boundary block, B, respectively. As a result, each lost block, L, has two or less number of matching vectors. 2.4 Spatial Interpolation
To recover block losses, spatial interpolation is employed to each lost block, L. And, lost block restoration is accomplished by the linear spatial interpolation of which recovery direction is set to the direction of the matching vector. Fig. 5 shows the proposed spatial interpolation algorithm with two matching vectors. To find more dominant matching vectors between top and bottom lines, 1-D block boundary matching is employed. The pixel is interpolated along the dominant matching vector as shown in Fig. 5(a). Then, unrecovered pixels are interpolated along another matching vector as shown in Fig. 5(b).
Fig. 5. An example of proposed spatial interpolation algorithm with two-matching vector case
Spatial Interpolation Algorithm for Consecutive Block Error
983
Fig. 6. An example of the proposed spatial interpolation algorithm (a) Single matching vector case (b) No matching vector case
According to the proposed edge detection rule, lost blocks do not always have two matching vectors. Fig. 6 shows the case that a lost block does not have two matching vectors. Fig. 6(a) depicts the single matching vector case and Fig. 6(b) shows the no matching vector case that a lost block does not have any matching vectors. In these cases, non concealment region can be found at a lost block. Accordingly, 1-D block boundary matching to find the edge direction is used to determine which direction is more accurate[5]. This rule is applied to top and bottom directions, respectively. Fig. 7 shows that a few pixels are not recovered even after two spatial interpolations. Pixels in non-concealment areas are replaced with the median filtered values.
Fig. 7. Examples of spatial interpolated image. A small number of pixels are not recovered. (a) Lena (b) MIT.
984
S.-Y. Jeon et al.
(a)
(b)
(c)
(d)
(e) Fig. 8. Test results for MIT: (a) Original image (b) Damaged image (c) Hsia’s (d) RIBMAP and (e) Proposed algorithm
Spatial Interpolation Algorithm for Consecutive Block Error
(a)
(b)
(c)
(d)
985
(e) Fig. 9. Test results for Lena: (a) Original image (b) Damaged image (c) Hsia’s (d) RIBMAP and (e) Proposed algorithm
986
S.-Y. Jeon et al.
3 Experiments and Results The performance of the proposed algorithm is evaluated with two well-known 512×512 test images: MIT and Lena. The proposed algorithm is compared with Hsia’s [5] and Park’s [4] in terms of PSNR. The PSNRs of the proposed and the conventional algorithms are summarized in Table 1. According to Table 1, the proposed algorithm provides a slightly better PSNR value than Hsia’s in case of Lena image. However, in case of MIT image, the PSNR of the proposed algorithm is improved approximately 2.5 dB, compared with Hsia’s one. Table 1. Comparison of PSNRs for different error concealment algorithms Algorithms
MIT
Lena
Hsia’s scheme
23.784
29.735
RIBMAP
28.127
31.706
Proposed scheme
26.262
29.958
The subjective qualities of Hsia’s, Park’s [4] and the proposed algorithm were evaluated. Park’s RIBMAP algorithm shows the best restoration performance in terms of the subjective and objective qualities. However, this algorithm requires high computational complexity. N recovery vector searches and additional filtering are required for N×N lost block restoration in the RIBMAP algorithm. Fig. 8 and 9 show that the proposed algorithm yields significant improvement in terms of subjective quality, compared with Hsia’s and comparable subjective quality with the RIBMAP. According to these test results, we found that the proposed algorithm is better than Hsia’s one in terms of objective and subjective cases while the proposed algorithm yields similar image quality with low computational complexity, compared with the RIBMAP.
4 Conclusion The error correction for both still images and intra-frames is one of key techniques for block-based video coding. Hsia proposed the efficient algorithm for concealment of damaged blocks using boundary matching vector. However, Hsia’s algorithm could be degraded when there are several edges in boundary pixels. In this work, we proposed the new efficient spatial interpolation algorithm for block error concealment based on the JND-oriented edge direction matching. In this algorithm, the best matching vector is selected as the most dominant edge direction in boundary blocks according the JND function. Experimental results show that the proposed scheme yields around 2.5 dB better than Hsia’s one in terms of PSNR. With the low computational complexity, the image quality recovered by the proposed algorithm is similar to that of the RIBMAP.
Spatial Interpolation Algorithm for Consecutive Block Error
987
References 1. X. Lee, Y.-Q. Zhang, and A. Leon-Garcia, “Information loss recovery for block-based image coding techniques-a fuzzy logic approach,” IEEE Trans. Image Process., vol. 4, no. 3, pp. 259–273, Mar. 1995. 2. S. Tsekeridou and I. Pitas, “MPEG-2 error concealment based on block matching principles,” IEEE Trans. Circuits Syst. Video Technol., vol. 10, no. 4, pp. 646–658, Jun. 2000. 3. H. Sun and W.Kwok, “Concealment of damaged blocks transform coded images using projections onto convex sets,” IEEE Trans. Image Process., vol. 4, no. 4, pp. 470–477, Apr. 1995. 4. J. Park, D. Park, R. Marks, and M. El-Sharkawi, “Recovery of Image Blocks Using the Method of Alternating Projections,” IEEE Trans. Image Process., vol. 14, no. 4, pp. 461474, Apr. 2005. 5. S. C. Hsia, “An edge-oriented spatial interpolation for consecutive block error concealment,” IEEE Signal processing letters, vol.11, no. 6, pp. 577–580, June 2004. 6. C. H. Chou and Y. C. Li, “A perceptually Tuned Subband Image Coder Based on the Measure of Just-Noticeable-Distortion Profile,” IEEE Trans. Circuits Syst., Video Technol., vol. 5, no. 6, Dec. 1995.
Error Resilient Motion Estimation for Video Coding Wen-Nung Lie1, Zhi-Wei Gao1, Wei-Chih Chen1, and Ping-Chang Jui2 1 Department of Electrical Engineering National Chung Cheng University, Chia-Yi, 621, Taiwan, ROC
[email protected],
[email protected],
[email protected] 2 Materials & Electro-Optics Research Division, Chung Shan Institute of Science & Technology, Lung-Tan, Tao-Yuan 325, Taiwan, ROC
Abstract. Due to temporal prediction adopted in most video coding standards, errors occurred in the current frame will propagate to succeeding frames that refer to it. This causes substantial degradation in reconstructed video quality at decoder side. In order to enhance robustness of existing temporal prediction techniques, another prediction strategy, called error resilient motion estimation (ERME), to take both coding efficiency and error propagation into considerations is proposed. To find MVs that satisfy the above two requirements, a constrained optimization problem with a new criterion is thus formed for ME. The proposed algorithm is implemented for H.264 video coding standard, where multiple reference frames are allowed for ME. From experimental results, the proposed algorithm can improve PSNR by up to 1.0 dB (at a packet loss rate of 20%) when compared with the full search ME with traditional SAD (sum of absolute difference) criterion. Keywords: H.264, Error resilience, Error concealment, Motion estimation.
1 Introduction Due to the successful development of techniques, such as Discrete Cosine Transform (DCT), Motion Estimation and Compensation (ME/MC), and Variable Length Coding (VLC), a large amount of redundant information in video can be removed and allow the data size to be reduced for transmission over channels of limited bandwidth. However, compressed videos are vulnerable to channel errors; even a minor disturbance may make the received video bit stream undecodable. Hence, techniques, such as error resilient video coding and error concealment [1], [2], were proposed to enhance the robustness of video against errors or to recover videos from errors. To suppress error propagation without sacrificing coding efficiency, a method of searching motion vectors (MVs) in a criterion other than least SAD (sum of absolute difference) was proposed [3], [4]. This kind of algorithms could be classified as Error Resilient Motion Estimation (ERME). Methods proposed in [3], [4] adopt the end-toend distortion as a new criterion for motion search. Although the above-referred methods are capable of suppressing error propagation efficiently, their computational complexity in estimating the end-to-end distortion is prohibitively high and thus impractical for real-time applications. L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 988 – 996, 2006. © Springer-Verlag Berlin Heidelberg 2006
Error Resilient Motion Estimation for Video Coding
989
Extending the prior concept to multiple reference frames (up to 5) for ME in AVC/H.264 video coding standard [7] and avoiding the inherited high computational complexity in estimating the end-to-end distortion, a new ERME algorithm is proposed in this paper. After decomposing the end-to-end distortion into parts of source and channel distortions, an objective measure, which can be estimated with much lower computational complexity, for evaluating the potential error propagation of an MV is derived in this paper. By integrating this objective measure into the conventional search criterion for ME, a constrained optimization problem to derive error resilient MVs can be formed. Importantly, we consider not only error resilience, but also coding efficiency of the estimated MVs, that is, tradeoffs will be made between them in presence of channel errors. The main contribution of this work is that an ERME algorithm of low complexity is developed. Experiments show that our proposed algorithm outperforms the traditional ME method of AVC/H.264 by improving the PSNR by up to 1.0 dB in case of a packet loss rate of 20%.
2 Concept of Error Resilient Motion Estimation The goal of motion estimation is to enhance the coding efficiency by removing the temporal correlation among frames. However, in error-prone environments, it is known that error propagations might be induced in received videos due to incomplete error concealment and the process of motion compensation (MC). In order to compromise between the suppression of error propagation and the retaining of coding efficiency, the said error resilient motion estimation was then proposed [3], [4]. In [3], a criterion that considers end-to-end distortion, that is, the distortion between the original input frame and the decoded frames at the receiver end, for motion estimation was proposed. Then, MVs that can minimize the defined criterion is chosen for encoding. Their experimental results show better qualities with respect to traditional MVs. In [4], a similar motion estimation algorithm as [3] was proposed, except that stochastic frame buffers were used as references rather than conventional reconstructed frame buffers. Although their experimental results show that the use of stochastic frame buffers can improve qualities of the decoded videos, their algorithm is not standard-compatible. A common drawback of the criterion used in [3] and [4] for motion estimation is the extra high computational cost that is induced. The criterion proposed in [3] separates the errors into three sources: error propagation, error concealment, and quantization. To evaluate the distortion from quantization, annoying computational cost arises since DCT/IDCT and quantization should be incorporated into motion estimation process. Here in this paper, we use a simplified criterion, which is capable of retaining the error resilience performance, so that ERME can be feasible in practical applications. On the other hand, the concept of minimizing end-to-end distortions is similarly applied to optimally deciding the coding mode (intra or inter) for each macroblack (MB) [8], [9], [10]. However, their MVs are estimated based on the traditional SAD criterion before the mode decision is optimally made based on an estimated end-to-end distortion. Theoretically, our algorithm for error resilient motion estimation can be used for finding MVs for mode decision based again on the end-to-end distortion.
990
W.-N. Lie et al.
Before illustrating our proposed algorithm, we first model the end-to-end distortion for videos transmitted in error-prone environments as follows [5]: end-to-end distortion can be decomposed into parts of source distortion Ds, incurred by nq, and channel distortion Dc, incurred by nc. Here, nq is related to the quantization noise and nc to the incompleteness of error concealment and motion compensation. A pictorial illustration of our model is depicted in Fig. 1, where f n represents the original video ~ signal, fˆ represents the encoded video (i.e., decoded video without error), and f n
n
represents the video reconstructed at the decoder side in presence of channel noise nc . ~ fn
fˆn
fn
nc
nq
Fig. 1. The modeling of end-to-end distortion
Via the above model, it is ready to observe that the end-to-end distortion Dend-to-end can be formulated as ~ Dend −to−end = E ( f n − f n ) 2 , (1) ~ ≅ E ( f n − fˆn ) 2 + ( fˆn − f n ) 2 = Ds + Dc ~ where Ds = E ( f n − fˆn ) 2 and Dc = E ( fˆn − f n ) 2 . This model was first verified in [5] based on a test platform using H.263 video coding standard. In order to verify its applicability to H.264/AVC standard, the following process is conducted. In Eq. (2), a measure of deviation eD is defined
{
{ {
}
eD =
1 N
N
∑ n =1
}
{
}
}
[D (n ) + D (n )] − D (n ) D S
C
end − to − end
(n )
× 100% ,
(2)
end − to − end
where n is the frame index and N is the total number of frame in a video sequence. Simulations were made by encoding videos, dropping packets randomly in terms of indicated rates, recovering the lost MBs, and measuring Ds, Dc, and Dend-to-end, respectively. After a set of experiments on 4 video sequences under varying packet loss rate (PLR), it was observed that simplification about the end-to-end distortion is sustained by an averaged eD of only 0.863%. Based on this fact, Dend-to-end can then be reduced by minimizing Ds and Dc separately. To derive error resilient MVs, the channel distortion Dc should be modeled first. The search criterion is defined below:
{(
)}
2 ⎫ ~i ⎧ p i (Δx * , Δy * ) = arg ( Δmin ⎨ E fˆn (Δx, Δy ) − f n (Δx, Δy ) ⎬ , x , Δy )∈S ∑ ⎭ ⎩ i =1
(3)
Error Resilient Motion Estimation for Video Coding
991
where Δx and Δy are the horizontal and vertical components of the MV, S is the feasible set for motion vectors, p is the number of pixels in a block, fˆni (Δx, Δy) and ~i th f n (Δx, Δy ) represent the i pixel of a block motion-compensated via MV= (Δx, Δy ) in th the n frame, and E{.} is the expectation operator. Notice that fˆni (Δx, Δy) represents ~ the pixel value correctly decoded, while f ni (Δx, Δy ) is the pixel value considering erroneous reconstruction. Recognizing that most video coding standards adopt SAD as the criterion for motion search, a new criterion below is formed by changing the squared term in Eq.(3) with an absolute term:
{
}
~i ⎧ p ⎫ i ( Δx* , Δy * ) = arg ( Δmin ⎨ E fˆn ( Δx, Δy ) − f n ( Δx, Δy ) ⎬ . x ,Δy )∈S ∑ ⎩ i =1 ⎭
(4)
Based on the fact that E { xˆ − ~ x }≥ E{xˆ − ~ x } , the following inequality will hold.
{
}
{
}
~i ⎧ p ⎫ i (Δx* , Δy * ) = arg ( Δmin ⎨ E fˆn (Δx, Δy ) − f n (Δx, Δy ) ⎬ x ,Δy )∈S ∑ ⎩ i=1 ⎭ p ~ ⎧ ⎫ ≥ arg min ⎨∑ E fˆni (Δx, Δy ) − f ni (Δx, Δy ) ⎬ ( Δx ,Δy )∈S ⎩ i=1 ⎭ p ~ ⎧ ⎫ = arg min ⎨∑ fˆni (Δx, Δy ) − E f ni (Δx, Δy ) ⎬ ( Δx ,Δy )∈S = 1 i ⎩ ⎭
{
(5)
}
Note that the first term in the summation is considered to be deterministic with respect to the E{.} operator, due to its independence to the channel conditions. In Eq.(5), ~ E f ni ( Δx, Δy ) can be estimated easily by using formula blow, which was derived in [6]. ~i ~ E[ f n (Δx, Δy )] = (1 − pe ){E[ f ni−, αj ( Δx, Δy )] (6) ~ , + rni ( Δx, Δy )]} + pe ⋅ E[ f ni−1 ]
{
}
where pe is the error probability of a considered pixel, rni (Δx, Δy ) is the residual ~ produced after motion compensation, E[ f ni−,αj (Δx, Δy )] is the pixel value compensated ~ from the jth pixel of the (n-α)th frame, and E[ f ni−1 ] is the recovered pixel value if the zero motion scheme is adopted for error concealment. According to coding principle, fˆ i (Δx, Δy) can be represented as: n
i fˆn (Δx, Δy ) = fˆni−, αj (Δx, Δy ) + rni (Δx, Δy )
(7)
where fˆni−, αj (Δx, Δy ) is the motion-compensated pixel value from the (n- ) frame. Substituting Eq.(6) and (7) into Eq.(5), we obtain: th
992
W.-N. Lie et al.
(Δx* , Δy * )
(
)
⎧⎪ p fˆ i , j (Δx, Δy ) + r i (Δx, Δy ) − p E[ fˆ i ] n−α n e n−1 = arg min ⎨ ~i, j ( Δx , Δy )∈S i i = 1 ( 1 p ) E f ( x , y ) r ( x , y) − − Δ Δ + Δ Δ ⎪⎩ e n−α n
∑
([
]
)
⎫⎪ ⎬ ⎪⎭
(8)
Eq.(8) is actually not a good criterion, due to its prohibitively high complexity in computing rni (Δx, Δy ) . To yield rni (Δx, Δy ) , the processes of DCT, quantization, and inverse DCT need to be performed for each MV candidate. This is also the reason why algorithms provided in [3], [4] are impractical in terms of computing complexity. The above equation can be rewritten as ( Δx * , Δy * )
(
) (
[
]
)
⎧⎪ p fˆ i , j ( Δx, Δy ) − (1 − p ) ⋅ E ~ f ni−, αj ( Δx, Δy ) + p e ⋅ E[ fˆni−1 ] n −α e = arg ( Δmin ⎨ i x , Δy )∈S ⎪⎩ i =1 + ( p e ⋅ rn ( Δx, Δy ) )
∑
⎫⎪ ⎬ ⎪⎭
(9)
The term pe ⋅ rni (Δx, Δy ) can be ignored, with respect to the other three terms, to reduce Eq.(9) into Eq.(10).
(Δx * , Δy * )
∑(
) (
[
]
)
~ ⎫ ⎧ p ˆ i, j f n −α (Δx, Δy ) − (1 − pe ) ⋅ E f ni−, αj (Δx, Δy ) + pe ⋅ E[ fˆni−1 ] ⎬ = arg min ⎨ ( Δx , Δy )∈S ⎭ ⎩ i =1
(10)
The reasons are stated below. In Eq.(9), the term rni (Δx, Δy) is the prediction error of a certain MV, which is often small if high correlations exist between consecutive ~ video frames. However, the second and third terms in Eq.(9), i.e., E[ f ni−, αj (Δx, Δy )] ~ and E[ f ni−1 ] , represent the expected reconstructions at decoder and normally cannot be ignored. By the way, we consider a value of pe of no more than 0.3 in our system. th In Eq.(10), the first term fˆni−, αj ( Δx, Δy ) is readily available when encoding the n ~ frame, while the second term (the first moment of f n−j α ) can be derived by using the ~ technique proposed in [6]. Note from Eq.(6) that the first moment of f ni can be recursively updated for next frame’s use after a frame is encoded (i.e., after the ~ residual rni is figured out), hence guaranteeing the availability of E[ f ni−, αj (Δx, Δy )] on th evaluating Eq.(10) for the n frame. Clearly, the extra computations required in ~ evaluating Eq.(10) come from the updating of E[ f ni−1 ] . The criterion proposed in Eq.(10) totally ignores the coding efficiency in optimization. It is possible that MVs and residuals thus obtained consume a lot of bit rates for transmission, though a certain degree of robustness can be achieved. Before proceeding, some mathematical terms are defined. Denote (w , α ) to be a MV th
candidate w that refers to the (n- ) frame. Define EP( w ,α ) and CE( w ,α ) to measure the
level of error propagation and coding efficiency, respectively, for a given (w , α ) , as follows.
Error Resilient Motion Estimation for Video Coding
EP( w ,α ) =
p
∑ fˆ
i, j n −α
(
[
]
( w , α ) − f ni
)
[ ])
~ ( w , α) − (1 − pe ) ⋅ E f ni−, αj ( w , α) + pe ⋅ E fˆni−1
i =1
CE ( w ,α ) =
∑ ( fˆ p
i, j n−α
i =1
2
993
(11) (12)
Note that Eq.(11), following from Eq.(10), obviously accounts for the minimization of error propagation, while Eq.(12), related to the power of motion-compensated residuals, accounts for the bit rate required for encoding residuals. Since our purpose is to find MVs which can not only suppress possible error propagation, but also remove redundancies efficiently, a new constrained optimization problem based on EP( w ,α ) and CE( w ,α ) is thus formed below:
Min{EP( w ,α ) } ( w ,α )
subject to : CE( w ,α ) ≤ σ 2 .
(13)
where σ 2 is a predefined power constraint. By controlling the threshold σ 2 , the magnitude of residual signals is restricted. This is also helpful to the validity in assuming a small rni (Δx, Δy) in deriving Eq.(10). Note that a small threshold σ 2 might cause no solution of Eq.(13), while a large σ 2 gives nearly no constraint on minimizing EP( w ,α ) . The solution of Eq.(13) can be easily found via a full search on possible ( w, α) ’s. Note that our algorithm does not change the nature of full search in traditional ME process, but to provide a more proper criterion in selecting MVs that are expected to achieve both error resilience and code efficiency. Our criterion can also be applicable to fast ME where only a certain set of MV candidates are evaluated.
3 Experimental Results The proposed algorithm was implemented on JM93 test model of H.264 video coding standard. The tested video sequences include “Akiyo”, “Table tennis”, and “Stefan” of CIF size. Each sequence contains 100 frames and is coded at 30 fps. Different quantization parameters (QP) (from 20 to 44) are applied to encode sequences for the evaluation of the proposed algorithm under a wide range of bit rates. Because our algorithm is to explore the error resilience of MVs, only the first frame is coded as I picture and the others are coded as P picture without intra-coded MBs therein. The block size for motion estimation is fixed to 16x16 pixels and three reference frames are allowed for each MB. Our proposed ERME algorithm is compared with the conventional search criterion based on SAD at two packet loss rates: 5% and 20%. It is assumed that each packet contains 22 successive MBs. The experimental results are shown in Figs.2, 3 and 4. In these figures, the horizontal and vertical axes represent the average bit rate and the averaged PSNR of the reconstructed videos at the decoder, respectively. Notice that each erroneous MB at the decoder is concealed via the zero-motion algorithm, i.e., replaced by the MB at the same location of the previous frame. The curve marked with “PMV_X” represents our
994
W.-N. Lie et al.
OMV_20
PMV_20
OMV_5
PMV_5
42 40
PSNR
38 36 34 32 30 78
178
278
Kbps
378
478
578
Fig. 2. Averaged PSNR curves of decoded “Akiyo” sequence with errors OMV_20
PMV_20
OMV_5
PMV_5
26 25
PSNR
24 23 22 21 20 19 190
1190
2190
3190
4190
Kbps
Fig. 3. Averaged PSNR curves of decoded “Stefan” sequence with errors
proposed algorithm and the other marked with “OMV_X” represents the original MVs obtained by full search algorithm with conventional SAD criterion; “X” represents the percentage of packet loss rate. For example, “PMV_5” represents the PSNR curve of the proposed algorithm at 5% packet loss rate. It is observed from the experimental results that our proposed algorithm is approximate to the conventional ME algorithm in terms of the reconstruction PSNR for the video sequence “Akyio”. The reason is obvious: the inherited error propagation is minor due to the extremely static nature of “Akyio”. However, for video sequences with moderate to high motion, such as “Table tennis” and “Stefan”, in which the error propagation might be substantial, the effectiveness of our proposed algorithm in
Error Resilient Motion Estimation for Video Coding
OMV_20
PMV_20
OMV_5
995
PMV_5
29 28
PSNR
27 26 25 24 23 22 70
570
1070
1570 Kbps 2070
2570
3070
3570
Fig. 4. Averaged PSNR curves of decoded “Table tennis” sequence with errors
suppressing error propagation becomes much more significant. For the “Stefan” sequence, our proposed algorithm can improve the PSNR performance by up to 1.0 dB at PLR = 20%. On the other hand, our algorithm might be inferior to conventional motion estimation algorithm at low bit rates and for static sequences. This is a result of making a tradeoff between coding efficiency and error resilience. Finally, the choice of σ 2 may play an important role in controlling the effectiveness of our proposed algorithm. The tighter the constraint imposed by σ 2 , the less effective the suppression of error propagation (since EP( w ,α ) and CE( w ,α ) might be
contradictory for a given (w, α ) ). Currently, σ 2 is determined by summing the squared error for MB difference between frames and hence adaptive to MBs.
4 Conclusions In this paper, a new error resilient ME algorithm is proposed. In the proposed algorithm, finding MVs becomes solving a constrained optimization problem in considerations of both coding efficiency and error propagation. From experimental results, it is observed that our proposed algorithm can improve the reconstruction quality by up to1.0 dB in presence of a PLR of 20% for videos with moderate to high motion. The contribution of our work is that an efficient and feasible algorithm, in contrast to others ([3], [4]) in literature, is proposed to optimally search MVs that are capable of suppressing error propagation and achieving high coding efficiency.
References 1. Yao Wang and Qin-Fan Zhu, “Error control and concealment for video communication: A review,” Proc. of IEEE, vol. 86, no. 5, pp. 974 – 997, May 1998. 2. Yao Wang, Stephan Wenger, Jiangtao Wen, and Aggelos K. Katsaggelos, “Error resilient video coding techniques,” IEEE Signal Processing Mag., vol. 17, pp.61–82, July 2000.
996
W.-N. Lie et al.
3. Hua Yang, K. Rose, “Rate-Distortion optimized motion estimation for error resilient video coding,” in Proc. of IEEE Int’l Conf. Acoustics, Speech, and Signal processing, vol. 2, pp. 173-176, March 2005. 4. Oztan Harmanci and A. Murat Tekal, “Stochastic frame buffers for rate distortion optimized loss resilient video communications,” in Proc. of IEEE Int’l Conf. Image Processing, vol. 1, pp. 789-792, Sept. 2005. 5. Zhihai He, Jianfei Cai and Chang Wen Chen, “Joint source channel rate-distortion analysis for adaptive mode selection and rate control in wireless video coding,” IEEE Trans on CSVT, vol 12, no 6, pp. 511-523, June 2002. 6. R. Zhan, S. L. Regunathan, and K. Rose, “Video coding with optimal intra/inter mode switching for packet loss resilience,” IEEE J. Select. Areas Commun., vol. 18, no. 6 pp. 966-976, June 2000. 7. Jorn ostermann, Jan Bormans, Peter List, Detlev Marpe, Matthias Narroschke, Fernando Pereira, Thomas Stockhammer, and Thomas Wedi, “Video coding with H.264/AVC: tools, performance, and complexity,” IEEE Circuit and Systems Mag., vol. 4, pp. 7-28, First quarter 2004. 8. Pao-Chi Chang, Tien-Hsu Lee, Jhin-Bin Chen, and Ming-Kuang Tsai, “Encoder-originated error resilient schemes for H.264 video coding,” 18th IPPR Conference on Computer Vision, Graphics and Image Processing, pp. 406–412, Aug. 2005. 9. A. Leontaris and P.C. Cosman, “Video compression for lossy packet networks with mode switching and a dual-frame buffer,” IEEE Trans. on Image Processing, vol. 13, pp. 885897, July 2004. 10. G. Cote and F. Kossentini, “Optimal intra coding of blocks for robust video communication over the Internet,” Signal Processing: Image Commu., pp. 25-34, Sep.1999.
A Novel Scrambling Scheme for Digital Video Encryption Zhenyong Chen1, Zhang Xiong1, and Long Tang2 1
School of Computer Science and Engineering, Beihang University, Beijing, China, 100083 2 Department of Computer Science and Technology, Tsinghua University, Beijing, China, 100084
[email protected],
[email protected],
[email protected] Abstract. Nowadays it is easy and fast to deliver and communicate the digital video contents. Therefore, security problem reveals followed in applications such as video-on-demand, videoconference and video chatting system. In this paper, a scrambling scheme employing novel grouping strategy is proposed to encrypt the compressed video data. In new scheme, the scrambling and descrambling operations are embedded into the video encoding and decoding processes. DCT coefficients are divided into 64 groups according to their positions in 8×8 size blocks, and scrambled inside each group. Besides, motion vectors are also grouped and permuted in term of their modes. Thus the influence on statistical property of the video data is reduced furthest. Experimental results show that the proposed method has less bitstream overhead and no any image quality degradation. It is also indicated that the security is enhanced considerably via extending the scrambling range. A content-based dynamic secret key method is also presented to withstand the known-plaintext attack. Keywords: Video Encryption; MPEG-4; Scrambling.
1 Introduction Distributed multimedia applications such as video on demand system, video broadcast, video conferencing and videophone make the multimedia security research a key issue [1]. Intended tortious behavior will even cause an unprotected video communication leak the privacy of communicators. Thus content creators and providers will be reluctant to communicate or distribute their video contents if they are not assured that contents are securely protected. In this instance, methods of video encryption with conditional access act as one key technique to secure the video content. In the past years, methods to encrypt video data have been discussed can be classified into three categories, that is, cryptographic method, permutation method and hybrid or other methods. A typical cryptographic approach called naive algorithm is to encrypt the entire MPEG stream using standard encryption algorithms like DES L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 997 – 1006, 2006. © Springer-Verlag Berlin Heidelberg 2006
998
Z. Chen, Z. Xiong, and L. Tang
(Data Encryption Standard) [2] and AES (Advanced Encryption Standard). Naive algorithm treats the MPEG bit-stream as ordinary data and does not use any of the special MPEG structure. Although this is supposed to be the most secure MPEG encryption algorithm that preserves the data size, due to high data rate of video, it usually adds a large amount of processing overhead. There are some other cryptographic approaches called selective algorithms [3, 4] that use the features of MPEG layered structures. These methods partially choose the portions of the compressed video such as headers, I frames and I-blocks in P and B frames, and encrypt them using standard encryption algorithms. Tang [5] first presents a zig-zag permutation method for MPEG video. The encryption operation is embedded into the MPEG compression process. Instead of mapping the 8×8 block to a 1×64 vector in zig-zag order, it uses a random permutation list to map the individual 8×8 block to a 1×64 vector. Zeng [6] extends the permutation range to the segment and each segment consists of several macroblocks/blocks. Within each segment, DCT (Discrete cosine transform) coefficients of the same frequency band are randomly shuffled within the band. Besides, there is another permutation method that shuffles the code words in the VLC (Variable length coding) tables [7]. Qiao and Nahrstedt present a special method named Video Encryption Algorithm (VEA) to encrypt the video data [8]. It utilizes the results of the statistical analysis of compressed video frames, encodes half of the data using DES, and has a 47% gain in terms of number of XOR operations over DES. VEA provides efficient, real-time and secure encryption for low-resolution and low bit rate MPEG sequence. However, it will have real-time implementation problem for high resolution and high bitrate sequences. In this paper, we present a new scrambling scheme to encrypt the video data via grouping the DCT coefficients and motion vectors, which has less bitstream overhead and higher security than prior similar encryption schemes and has no any image quality degradation. Furthermore, a content-based dynamic generating method of secret keys is given to increase the ability resisting to the known-plaintext attack. The remainders of the paper are structured as follows. Section 2 discusses the group-based scrambling scheme for DCT coefficients in frequency domain. Section 3 gives a scheme to shuffle the motion vectors of MPEG-4 video. Then we perform experiments and security analysis in section 4. Finally, in section 5, we draw some conclusions.
2 Frequency Domain Permutation It is a natural idea to distort the visual image directly in the spatial domain. Before comeback the distorted video appears unintelligible and this method is employed in several video scrambling systems [9, 10]. But it will significantly change the statistical property of the original video signal, thus making the original video very difficult to compress. Another potential drawback of scrambling video in the spatial domain is that the highly spatially- and temporally-correlated nature of the video data can be used for efficient attacks [6]. So we address the permutation method in the frequency domain.
A Novel Scrambling Scheme for Digital Video Encryption
999
All current international standards for video compression, namely MPEG-1, MPEG-2, ITU-T H.261, ITU-T H.263, and the baseline mode of MPEG-4 and H.264 are hybrid coding schemes. Such schemes are based on the principles of motion compensated prediction and block-based transform coding. The transform used is often the discrete cosine transform. Fig. 1 shows a generic block diagram of a hybrid coding scheme. The video sequence to be compressed is segmented into groups of pictures (GOP). Each GOP has intra-coded frame (I-frame) followed by some forward predictive coded frames (Pframes) and bidirectional predictive coded frames (B-frames). I frames are split into nonoverlapping blocks (intra-coded blocks) of 8×8 pixels which are compressed using DCT, quantization (Q), zig-zag-scan, run-level-coding and entropy coding (VLC). P/B frames are subject to motion compensation by subtracting a motion compensated prediction. The residual prediction error signal frames are also split into nonoverlapping blocks (inter-coded blocks) of 8×8 pixels which are compressed in the same way as blocks of inter-frames. Sometimes, P/B frames also have some intra-coded blocks when better efficiency will be obtained using intra-coded compression. For one video standard, when setting a certain group parameters, the bitrate and visual quality of the compressed video data will be fixed always. intra/inter
Video in
-
mode
Coding control
DCT
Qp
Q
VLC Q-1
MC
IDCT
Frame buffer MV
ME
Fig. 1. Hybrid video coding scheme
In order to avoid the visual quality degradation, we shuffle DCT coefficients after the quantization stage. In other words, the scrambling module is embedded into the position between the quantization and VLC stages. In our scheme shown in Fig.2, the scrambling module contains three components, the secret key KD, the shuffling table generator STG, and the scrambler S. Using the secret key, STG generates a shuffling table for the quantized DCT coefficients. Then the scrambler permutes the quantized DCT coefficients according to the shuffling table and the remaining procedures are the same as the generic hybrid coding scheme. A corresponding reverse process to recover the quantized DCT coefficients is given in Fig.3. In 8×8 DCT transform coding, the 64 transformed coefficients are zig-zag ordered such that coefficients are arranged approximately in the order of increasing frequency. This arrangement enhances the efficiency of the run length coding because the bitstream coder uses a run-length and variable-length coding technique that generally
1000
Z. Chen, Z. Xiong, and L. Tang
assigns shorter codewords to combinations of coefficient values and run lengths. So breaking the zig-zag order within the block will significantly degrade the statistics relied upon by a run-length coder. So we use a new grouping strategy that is illustrated in Fig.4 to shuffle the DCT coefficients. In fact, the DCT coefficients in the same frequency position have most approximate statistics property. According to this point, for each signal, all the DCT coefficients in a frame are partitioned into 64 groups by the frequency position and then shuffled inside each group. In Fig.4, n is total number of 8×8 blocks of each signal, for luminance signal, n=w×h/8/8, and for chrominance signals, n=w×h/16/16, where w is the width of frames and h is the height of frames. In some extent, except for the optimization method, this strategy can be regarded as the method that has the least influence on the statistics property of the quantized DCT coefficients. Thus the overhead of the compressed bit-stream will be considerably limited. intra/inter
mode
Coding control
Qp
Secret key KD
Shuffling table generator
STG
Video in
-
DCT
Q
S Q-1
MC
VLC IDCT
Frame buffer MV
ME
Fig. 2. Scrambling of quantized DCT coefficients mode
intra/inter
Decoding control
Qp
Shuffling table generator STG
VLC
S-1
Frame buffer
Secret key KD
Q-1
Video out IDCT
MC
MV
Fig. 3. Recovery of scrambled quantized DCT coefficients
+
A Novel Scrambling Scheme for Digital Video Encryption
Quantized DCT coefficients of block n Quantized DCT coefficients of block 2 Quantized DCT coefficients of block 1
q 10 q 00 q80 # 0 q 56
q11
q 0n −1 q1n −1 q8n −1 1q9n −1 " q7 # 1 # 0 q" n −1 q15 n −1 7 q 56 q57 0 # q% 15 1 " # q 63
q1081 " q91 q#90 "# 1 q#56 q 157 % 0 0 " q 63 q57
Group 0:
q00
q 10 " q0n −1
Group 1:
q10
q11 " q1n −1
1001
" q 7n −1 " q15n −1 % # n −1 " q 63 $
… Group 63:
0 1 n −1 q63 q 63 " q 63
Fig. 4. Grouping of quantized DCT coefficients
When there is less motive objects and scene, the scrambling of I frames will render the following P/B frames difficult to perceive due to the dependency of P/B frames on I frames. However, in many other circumstances, the video content may not be highly correlation in temporal dimension. In these cases, the intra-coded blocks and the accumulating residual energy will leak the partial image information. In order to achieve higher security, the DCT coefficients of intra-coded blocks and non intracoded blocks in P/B frames need to be shuffled also. Apparently, when employing same grouping strategy, they should be separated each other.
3 Motion Vector Permutation Theoretically, it is absolutely secure in visual perceptivity when I frame, intra-coded blocks and inter-coded blocks in P/B frames are all encrypted but the bit-stream overhead will reach a certain level. For inter-coded blocks in P/B frames, there is an optional substitutive scheme that encrypts their motion vectors because both the similar encryption efficiency and less bit-stream overhead will be obtained. Additionally, in some scenarios, it is necessary to scramble the motion vector information for those applications that have high level security requirement. In Fig.5 similarly to the scrambling scheme of DCT coefficients, it can be seen that there are three components included in the motion vector encryption module, the secret key KM, the shuffling table generator STG, and the scrambler S. Besides, a reverse process to recover the scrambled motion vectors can be seen in Fig.6.
1002
Z. Chen, Z. Xiong, and L. Tang intra/inter
mode Coding control
Qp
Video in DCT
-
Q
VLC Q-1
MC
IDCT
Frame buffer MV
S Secret key KM
ME
STG
MV scrambled Shuffling table generator
Fig. 5. Scrambling of motion vectors
mode
intra/inter Decoding control
Qp
Shuffling table generator MV scrambled
STG S-1
Video out
Q-1
IDCT
Frame buffer
MC
VLC
+
Secret key KM
MV
Fig. 6. Recovery of scrambled motion vectors
Generally in hybrid video coding scheme, the motion estimation and compensation are carried out based on the choice of 16×16 blocks (referred to as macroblocks) as a basic unit. The estimated motion vectors and other side information are encoded with the compressed prediction error in each macroblock depending on the type of the macroblock. For example, in MPEG-4 standard, the macroblock of P frame has three type motion vectors listed in Table 1, that is, 16×16, 16×8 and 8×8. Different type macroblock has different number motion vectors and they are differenced with respect to a prediction value and coded using variable length codes. So we group motion vectors in terms of their types in the same way as we group the quantized DCT coefficients and then shuffle them inside each group. Thus the encrypted motion vectors can be recovered in the decoder side correctly. In addition, B frame has more type motion vectors but the same strategy can be used.
A Novel Scrambling Scheme for Digital Video Encryption
1003
Table 1. Types of motion vectors of P frame Types 16×16 16×8 8×8
Numbers of motion vector(s) 1 2 4
Has4MVForward Flag FieldMV Flag FALSE FALSE TRUE
FALSE TRUE Ignored
4 Experiments and Security Analysis 4.1 Experiments In order to evaluate the performance of the new proposed schemes, we implement a demo via integrating our encryption strategies into the MPEG-4 verification model coder [11] provided by ISO. Using this demo, experiments to scramble I frame, P frame motion vectors respectively and both of them are performed. The test video sequences with QCIF (176×144) size include Foreman, CarPhone, Claire, Grandma, MissAm, Salesman, Trevor and Suzie, which are shown in Fig.7.
a) Foreman
b) MissAm
e) Salesman
f) Claire
c) CarPhone
g) Travor
d) Grandma
h) Suzie
Fig. 7. Original video sequences
Fig. 8. Scrambling of quantized DCT coefficients of I frame
1004
Z. Chen, Z. Xiong, and L. Tang
Fig. 9. Scrambling of P frame motion vectors
Fig.8 gives a scrambled I frame image of each test video sequence and the details of images are not visible completely. Fig.9 shows that macroblocks of P frame existing motion, that is, inter-blocks, have been thrown into confusion, which benefits to keep the secret of the motion information. In Table 2, it can be seen that the bitstream overhead of I frame scrambling is in the range from 9.54% to 48.16% and the Foreman video is lowest. It seems that the motive videos have less bitrate increase than those still videos relatively. The same results of P frame motion vector scrambling can be found also. However, like MissAm, Grandma and Claire, the videos that have high relative bitstream overhead still have rather high compression rate. This indicates that our encryption schemes are acceptable in the manner of influence on compression efficiency. Zeng [6] gave the bitstream overhead of Carphone video by means of his own method. So we use the experimental data of Carphone video to compare with his result, which are listed in Table3. It shows that our method is better. Table 2. Experimental results
Video Sequences Foreman MissAm Carphone Grandma Salesman Claire Travor Suzie
Total number of frames
Original length
400 150 382 870 449 494 150 150
15,206,400 5,702,400 14,522,112 33,073,920 17,069,184 18,779,904 5,702,400 5,702,400
Normally compressed length
I frame permutation Overhead
335,529 36,642 302,281 264,765 192,104 114,272 90,583 66,490
9.54% 28.19% 11.73% 23.47% 11.37% 48.16% 12.26% 13.53%
I frame and P frame motion permutation Overhead 11.27% 67.03% 16.56% 41.74% 17.04% 84.34% 17.83% 23.20%
Compression rate 40.73 93.17 41.22 88.13 75.92 89.15 53.43 69.61
A Novel Scrambling Scheme for Digital Video Encryption
1005
Table 3. Comparison of bit overhead of Carphone video sequence Scrambling mode Zeng’s method Quantized DCT coefficients of I frames scrambled 19.8% Both quantized DCT coefficients of I frames and 23.6% motion vectors of P frames scrambled
Our method 11.7% 16.6%
4.2 Security Analysis For a 352×288 size I frame, assuming that nYi nonzero coefficients of Y luminance signal, nUj nonzero coefficients of U chrominance signal and nVk nonzero coefficients of V chrominance signal exist in the position respectively indexed by i, j and k, the times of required trials in which an attacker attempts to recover the original I frame is 64
64
64
i =1
j =1
k =1
∏ nYi !× ∏ nUj !× ∏ nVk ! where nYi ≤ 44 × 36 = 1584, nUj ≤ 22 × 18 = 396, nVk ≤ 22 × 18 = 396. It can be seen that the computation cost is rather tremendous to completely recover the original I frame. Similarly, if the attacker wants to recover the motion vectors scrambling of P frame, the number of required trials is r!us!ut! , where r is the number of 16×16 type motion vectors, s is the number of 16×8 type motion vectors and t is the number of 8×8 type motion vectors. To strengthen the ability withstanding the known-plaintext attack, the contentbased dynamic secret key generating method can be employed. For example, the encryption key of the quantized DCT coefficients can be produced as follows.
K D ( p) = KF ( K m , F p ), p = 1,2, " , P
(1)
where p is the index of frames, KD(p) is the secret key of p frame, KF the key generation function, Km the main key to generate dynamic keys, Fp the feature of p frame, and P is the total number of frames. The feature can be extracted from the content characteristics of each frame, like sum or average of the quantized DCT coefficients which should not change after shuffling operation. Thus each frame has individual scrambling table and this will make the known-plaintext attack more difficult. Additionally, the key synchronization also can be held correctly and will not be destroyed by the frame desynchronization such as incidental frame dropping and adding.
5 Conclusion In order to enhance the security, it should be regarded as one principle that the encryption must scramble the image and motion information in a best possible extent because illegal attackers will impose this information possibly. Moreover, shuffling operations modify statistical property of the video data in a certain degree and this will cause the increase of the compressed video bitstream. Therefore, it is also necessary to be treated as another principle that the scrambling operation must bring possibly less influence on the statistical property of the video data. According to these two principles, we present a new scrambling scheme to encrypt the video data via
1006
Z. Chen, Z. Xiong, and L. Tang
employing the new grouping strategy. Experimental results show that the novel scrambling scheme has less influence on the compressed video bitstream. In addition, a content-based dynamic generating method of secret keys is given to increase the ability resisting to the known-plaintext attack. The analysis also indicates that the new scheme has rather high security. It needs to be emphasized that the encrypted video data has standard video format. So using this method, we can open the beginning part or any segments of a movie to attract the audience and keep secret of the remainder. Acknowledgments. This work is supported by a grant of key program named “Theory and Methods of Security for Image/Graphics Information on Internet (No.60133020)” from National Natural Science Foundation of China. The demo used in our experiment is implemented based on the MPEG-4 reference source code, which is provided by International Organization for Standardization. In addition, using the open source of XVID MPEG-4 codec, we also implement a video encryption/ decryption plug-in component for common video players. So we would like to thank them together.
References 1. Lintian Qiao and Klara Nahrstedt. Comparison of MPEG encryption algorithms. International Journal of Computers and Graphics, special issue: “Data Security in Image Communication and Network” vol.22 January 1998. 2. I. Agi and L. Gong. An empirical study of secure MPEG video transmissions [A]. The Internet Society Symposium on Network and Distirbuted System Security [C]. San Diego, CA, 1996, 137-144 3. Y. Li, Z. Chen, S. Tan, et al. Security enhanced mpeg player. In proceedings of IEEE First International Workshop on Multimedia Software Development (MMSD’96), Berlin, Germany, March 1996. 4. T. B. Maples and G. A. Spanos. Performance study of a selective encryption scheme for the security of networked, Real-Time video. In Proceedings of 4th International Conference on Computer Communications and Networks, Las Vegas, Nevada, September 1995. 5. L. Tang. Methods for encrypting and decrypting MPEG video data efficiently [A]. Proceedings of the 4th ACM International Multimedia Conference [C]. Boston, MA, 1996, 219-229 6. Wenjun Zeng and Shawmin Lei. Efficient frequency domain selective scrambling of digital video [J]. IEEE Transactions on Multimedia, 2003, 5(1): 118-129 7. Mohan S. Kankanhalli and Teo Tian Guan. Compressed domain scrambler/descrambler for digital video. IEEE Transactions on Consumer Electronics, Vol.48, No.2, pp.356-365, May 2002. 8. Lintian Qiao and Klara Nahrstedt. A new algorithm for MPEG video encryption. In Proceedings of the First International Conference on Imaging Science, Systems and Technology (CISST’97), Las Vegas, Nevada, July 1997, pp.21-29. 9. G. L. Hobbs, Video scrambling, U.S. Patent 5 815 572, Sept.29, 1998. 10. D. Zeidler and J.Griffin, Method and apparatus for television signal scrambling using blocl shuffling. U.S. Patent 5 321 748, June 14, 1994. 11. ISO. Source code of MPEG4 encoder and decoder of publicly available standards [EB/OL]. http://www.iso.org/iso/en/ittf/PubliclyAvailableStandards/14496-5_Compressed_directories/ Visual/Natural.zip, 2004-03-20
An Electronic Watermarking Technique for Digital Holograms in a DWT Domain Hyun-Jun Choi1 , Young-Ho Seo2 , Ji-Sang Yoo1 , and Dong-Wook Kim1 1
2
Kwangwoon University, Welgye-Dong, Nowon-Gu, Seoul 139-701, Korea Hansung University, Samsung-Dong 3ga, Sungbuk-Gu, Seoul 136-792, Korea
Abstract. Digital hologram generated by a computer calculation is one of the most expensive contents and its usage is being expanded. Thus, it is highly necessary to protect the ownership of digital hologram. In this paper a spatial-domain and a frequency-domain electronic watermarking schemes were proposed. The spatial-domain scheme was only to compare the results to the ones from frequency-domain scheme and the frequencydomain scheme used 2-dimensional mallat-tree discrete wavelet transform. Both of them showed very high imperceptibility and quite high robustness against the attacks. Especially the MDWT-domain scheme was very high robustness such that the error ratio at the worst case was only 3%. Thus, we expect that it is used as a good watermarking scheme of digital hologram with high performance. But the spatial-domain scheme was turned out to be useless when data compression process is necessary after watermarking.
1
Introduction
As the digital era has been advanced since late 20th century, digital data has been getting more portions in data usage. This is because digital data has the advantages over analog data in easiness in storage, copy, adjustment, high immunity to noise in communication, etc. But some of these advantages act as its disadvantages: easy to illegal duplication, modification, etc. and not easy to distinguish the original from the duplicated. Accordingly, various digital contents are increasingly duplicated and used illegally by many internet users at present. Therefore, an issue of such illegal duplication/modification of digital contents such as protecting the ownership of intellectual properties is emerging. Digital watermarking technology implements preparation of a data inside the content for claiming such ownership to protect the intellectual property right and it has been known as the best solution for this protection. So far, many researchers have been studying this technology for many valuable digital contents such as audios, images, moving pictures, etc. Digital watermarking is a technique that a specified data (watermark) is concealed in a digital content but it is almost impossible to know whether the watermark is embedded or not (imperceptibility) when it is actually used. Its another very important property is that the embedded watermark should not be harmed or removed by malicious or L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 1007–1017, 2006. c Springer-Verlag Berlin Heidelberg 2006
1008
H.-J. Choi et al.
non-malicious attacks (robustness) so that the extracted watermark after attack should be until the original content is useful [1]. Digital hologram is a technique that the interference patterns between the reference light wave and the object light wave is captured with a CCD camera instead writing it on a holographic film [2]. The original image can be reconstructed by loading the digital hologram on a Spatial Light Modulator and illuminating the reference light that is the same as was recording. Other method to obtain a digital hologram is by calculating the interference patterns on a computer (Computer-Generated Hologram, CGH). A CGH is very time-consuming because each pixel value is affected by each source point of the object. Thus a hologram or a digital hologram is relatively very expensive 3-dimensional image that recently, researchers in many institutes in the world are studying hologram and its watermarking related technology. But most of them are the optical methods with optical elements or optical parameters [3,4]. In this paper, we propose a digital watermarking method for digital holograms electronically, not optically. The main idea is to use frequency-domain characteristics of digital holograms, for which DWT (Discrete Wavelet Transform) is used to transform into a frequency-domain. Because all the existing watermarking methods are optical ones, we set up a spatial-domain electronic watermarking method for the comparison purpose. In this paper, all the holograms are assumed digital, that is, fringe patterns captured by CCD camera or generated by a computer.
2
Analysis of Digital Holograms
In this section, digital holograms, that is, fringe patterns are analyzed to determine the proper locations to insert digital watermark in. From now on, a fringe pattern is regarded as a spatial data, while the frequency-domain data is the results from transforming the fringe pattern by a frequency-domain transform technique, that is, DWT. Also, we assume that each pixel in a fringe pattern has a grey-level value. It can be assumed that a fringe pattern is colored. In this case also, the color components are separated into red, green, and blue images. The image in this paper corresponds to one of the three components. Fig. 1 (a) shows the hologram image (200 × 200 [P ixels2 ]) created by computer graphics and 1 (b) is its fringe pattern (1, 024 × 1, 024 [P ixel2 ]), that is, CGH created by means of numerical computation with the image. In fact, a fringe pattern itself is a frequency-domain data transformed by Fresnel transform. But as mentioned before, this paper considers a fringe pattern as a spatial data. Fig. 2 shows the procedure to analyze the fringe patterns in the spatial domain and the frequency domain. In both analysis procedures, a fringe pattern is divided into several segments. From the properties of a fringe pattern, a part of a fringe pattern can reconstruct the original image with some degradation in
An Electronic Watermarking Technique for Digital Holograms
1009
image size according to the size of the segment. That means a data in a segment affects reconstruction of the whole image. Thus, a part of fringe pattern can be used for watermarking without loosing the imperceptibility and robustness of watermark. For spatial analysis, each bit-plane (BP) of the fringe pattern is examined as expressed in Fig. 2 (a). In this analysis the strength of data that affects the quality of the reconstructed image seriously is examined. For frequency-domain analysis, 2-dimensional (2D) transformed are performed by regarding a fringe pattern as a 2D data. Both mallat-tree decomposition (MDWT) method and packetization DWT (PDWT) method are considered. 2D MDWT is the standard transform in JPEG2000. The watermark embedding process might disturb the energy distribution of a target data because embedding watermark changes the values of data. The change in energy distribution in turn affects some characteristics of the data, which might change the results of further processes such as signal processing or data compression. Therefore, the analysis in this paper mainly consists of the processes to examine the energy distribution of the data with considering the watermarking process. In this paper, our watermarking scheme is to replace a specific bit of a specific position with a digital watermark bit.
Fig. 1. (a) Original image (b) fringe pattern of (a)
Fig. 2. Procedure for analysis: (a) spatial-domain (b) frequency-domain
1010
2.1
H.-J. Choi et al.
Analysis of a Fringe Pattern in Spatial-Domain
One of the candidate data domains for watermarking might be the spatial domain. As mentioned before, the fringe pattern itself is regarded as the spatial data in this paper. Also, we consider the whole fringe pattern or one of the segmented one as the target watermarking space. The feature analysis of a fringe pattern in the spatial domain consists of the examining the effect of watermarking on a specific bit-plane (BP). Here in this paper, a pixel of in a fringe pattern consists of 8-bits for its value as shown in Fig. 3. The main purpose for spatial-domain data analysis is how much a specific BP affects to the reconstructed image (result from inverse CGH (I-CGH)) with a fringe pattern. Thus, we examine the reconstructed result for each BP of a fringe pattern and the example images are shown in Fig. 4. In real experiment, more than 100 fringe patterns were examined. As expected from the knowledge from 2D images, BP7 (most significant BP, MSBP) alone reconstructs the image with a little degradation in the intensity. This is because of the weight of the BP7 is almost half of the value. An unexpected result from the experiment is that the BP6 of the most fringe patterns can reconstruct the image recognizably, even though the intensities are pretty much degraded. It means that only 1/4 in intensity of a fringe pattern can reconstruct the image, which might be usefully used in data compression of a digital hologram. In most of the fringe patterns, a BP lower than BP6 cannot reconstruct the image, as shown in Fig. 4 (d). It means that data adjustment in BP lower
Fig. 3. Bit-planes of a fringe pattern or its segment
Fig. 4. Reconstructed images by HoloVision: (a) original (b) with BP7 (c) with BP6 (d) with BP5
An Electronic Watermarking Technique for Digital Holograms
1011
than BP6 hardly affect the reconstructed image and they might be good places for watermarking. 2.2
Analysis of a Fringe Pattern in Frequency-Domain by DWT
Frequency analysis by 2D DWT is to examine the energy distribution throughout the frequency bands because the energy distribution is the most important to decide the watermarking positions with maintaining the imperceptibility and robustness of the embedded watermark. Mallat-tree 2D DWT. MDWT is to decompose an image into monotonically increasing frequency subbands. In a level of transform, an image in decomposed into four subband: LL, HL, LH, and HH, where ‘L’ (‘H’) means low-pass filtered result and the first (second) letter of the two is for horizontal (vertical) direction. For example, HL means the resulting subband from high-pass filtering in horizontal direction and low-pass filtering in vertical direction. Once a level of transform is completed, the next level of transform is for the lowest subband, LL. Thus, k-level decomposition produces 3k+1 subbands. Fig. 5 shows an 5-level 2D MDWT results where (a) shows the decomposition scheme and the subband numbers and (b) is an example with Lena 2D image. The average energy distribution of the result from 5-level 2D MDWT for 100 fringe pattern is shown in Table 1, in which each energy value is the total energy of the coefficients in the corresponding subband. That means the energy value of each coefficient in a subband is 4 times higher that the one in the table than the one in the one-level lower subband because the number of coefficients in one-level lower subband is 4 times larger. In general, a fringe pattern is known to have much high-frequency components compared to an ordinary 2D image. But from the experiment, a fringe pattern showed a similar property of very high energy concentration (more than 70 % on the lowest frequency components, that is, subband 0 in this experiment.
Fig. 5. Frequency domain of 5-level MDWT: (a) scheme and subband numbers (b) transformed example
1012
H.-J. Choi et al.
One peculiar characteristic in this experimental result is that some of the higher frequency components (subband 10, 12, 13, 15, 16, 18) have quite high energy distribution. These subbands are in the vertical or diagonal direction which means a fringe pattern has quite high energy in vertical and diagonal-directional high components. Consequently, the lowest, the vertical, and the diagonal subbands are good candidates for watermarking place. Packetization 2D DWT. One modified 2D DWT method is packetization 2D DWT (2D PDWT). The scheme is as follows. The first level of transform is the same as MDWT. But from the second level, the transform proceeds with a particular purpose. For example, if local energy concentration is needed, a subband whose energy is lower than a given threshold value is continuously transformed until the resulting energy of subband is higher than the threshold. In Fig. 6, all the subbands except HH proceeded the second-level of transform. Among the subbands from LL, only HL, LH subband was transformed in the third level and only LL, HL, LH subband was transformed in the fourth level, and so on. Table 1. Average energy distribution after 5-level 2D MDWT Subband Average energy 0 16199.634 1 0.006 2 0.022 3 0.077 4 0.051 5 0.024 6 0.104 7 0.214 8 0.283 9 1.559
Ratio Subband Average energy Ratio 72.677 10 2.208 0.010 11 0.883 0.004 0.000 12 12.514 0.056 0.000 0.000 13 224.675 1.008 14 6.705 0.030 0.000 15 567.815 2.547 0.000 0.000 16 2185.485 9.805 17 19.361 0.087 0.001 18 3068.226 13.765 0.001 0.007 Total 22289.845 100.000
However, this scheme is not easy to apply for examination of digital holograms because each one may have different frequency characteristics. Thus in this paper, we generalize 2D PDWT as Fig. 7. Here, we only transformed two level of 2D DWT and each level-1 subband was transformed for one more level with any restriction in proceeding further transform. Thus all the subbands are the results of the two-level 2D DWT. The average energy distribution is shown in Table 2. Differently from MDWT, the results from PDWT showed quite random distribution in energy. Subband 0 has the highest energy but subband 14 also has very high energy. Also, subband 5, 6, 7, 15 retain quite high energy. Thus, PDWT does not look good transformation to find the watermarking locations with the energy distribution. Consequently we use only MDWT to find the watermark positions.
An Electronic Watermarking Technique for Digital Holograms
1013
Fig. 6. Frequency domain of 2D PDWT: (a) scheme and subband numbers (b) transformed example Table 2. Energy distribution and ratio of fringe pattern Subband Average energy Ratio Subband Average energy Ratio 0 16201.034 38.357 8 0.268 0.001 1 224.675 0.532 9 5.734 0.014 10 13.869 0.033 2 6.705 0.016 11 274.797 0.651 3 567.815 1.344 4 625.367 1.481 12 133.677 0.316 13 54.296 0.129 5 1533.207 3.630 14 10972.849 25.979 6 3743.395 8.863 7 3813.564 9.029 15 4066.018 9.627 Total 42237.271 100.000
3
Electronic Watermarking Method on a Digital Holograms
In this section, we propose an electronic watermarking method for digital hologram. To find the watermarking positions and embed the watermark, we use 2D MDWT. But we also propose a watermarking method on the spatial domain for the purpose of comparison with the frequency-domain method. In watermarking, we assume that the watermark data to be embedded consists of binary data. Especially we use a binary image with the size of 32 × 32 [P ixels2 ]. This assumption is not a special one because the target data to be watermarked is also a digital and if a non-binary or analog data is to be used, it can be changed into a binary bit stream. 3.1
Watermarking in Spatial-Domain
In general, watermarking in a spatial domain has been done to the regions with large high-frequency components because of the imperceptibility of the watermark. Different from a general 2D image, a fringe pattern retains high-frequency
1014
H.-J. Choi et al.
Fig. 7. Watermarking procedure in the spatial domain
components all around the image, which is experimentally examined in the previous section. Thus, electronic watermarking on a fringe pattern can be done any region of a fringe pattern. In this paper, we embedded the watermark data randomly to all around the fringe pattern, as shown in Fig. 7. To select the watermarking places, we used a 32-bit linear feedback shift register (LFSR) whose feedback characteristic is as follows. P (x) = x32 + x22 + x2 + 1
(1)
From the LFSR, two groups of parallel outputs are taken to be used as horizontal and vertical values. As can see in Fig. 7, the watermark data is replaced with the 4th bit of the corresponding coefficient indicated by the two groups of parallel data from the LFSR. The 4th bit was chosen by considering the experimental results in the previous section. 3.2
Watermarking in MDWT-Domain
For the frequency domain, we considered only MDWT based on the previous experiments. Because of the property of a fringe pattern, the lowest subband after DWT is less sensitive to the spatial data change than others, as the experimental results. Also because the energy is concentrated to this subband, most coefficients have very high positive value. Thus, we chose the lowest subband as the watermarking places. The watermarking scheme in DWT domain is shown in Fig. 8. Because we chose 1, 024 × 1, 024 [P ixels2 ] as the size of a fringe pattern and 32 × 32 [P ixels2 ] as the size of watermark, 5-level 2D MDWT is performed, which results in the size of the lowest subband the same as that of watermark. For DWT-domain watermarking, we selected a bit-plane replacement method. That is, the 3rd lowest bit-plane is replaced with the whole watermark image data. Then the whole image is inversely 2D MDWTed to reconstruct the fringe pattern.
4
Experimental Results
To test the proposed watermarking scheme in the previous section, more than 100 fringe patterns were watermarked, attacked, and reconstructed. As mentioned
An Electronic Watermarking Technique for Digital Holograms
1015
Fig. 8. Watermarking scheme in the MDWT domain
before, the size of a fringe pattern is 1, 024 × 1, 024 [P ixels2 ] and the size of watermark is 32 × 32 [P ixels2 ]. To figure out the robustness and imperceptibility of the schemes, we considered the four kinds of attacks: JPEG compression, Gaussian noise addition, blurring, and sharpening attacks by Adobe PhotoshopT M .
Fig. 9. Watermarking example for rabbit image; fringe pattern of (a) before, (b) after watermarking in spatial domain, (c) after watermarking in DWT domain; holographic image of (d) before, (e) after watermarking in spatial domain, (f) after watermarking in DCT domain
Fig. 9 shows examples of watermarked fringe patterns and the corresponding reconstructed holographic images for rabbit image. As can recognize from the figures, the three fringe patterns and the holographic images are indistinguishable from the original ones, which means the visual perceptibility of each scheme is very low. Table 3 shows the normalized correlation (NC) values of the watermarked fringe patterns and their reconstructed holographic images with respect to the original ones. In all cases the NC values are high enough as expected. In Table 4, the average error ratios of the extracted watermarks after the four kinds of attacks are listed. For JPEG compression, up to 0-quality compression is performed. For Gaussian noise, up to 10% is added because the reconstructed image from the fringe pattern added more than 10% of Gaussian noise is degraded in the quality so that it is not worthy to be used in commercially.
1016
H.-J. Choi et al.
In all the experimental cases, MDWT-domain scheme showed better performance than the spatial-domain scheme. The worst performance in MDWTdomain scheme was for the Gaussian noise attacks, but the error rate was only 3% even at the strongest attack of 10% addition. In the spatial-domain scheme, the JPEG compression showed the worst attacks. At JPEG quality 0, which corresponds to about 20:1 in the compression ratio, the error rate was more than 19%. That means the spatial-domain scheme is very weak in data compression and it is not useful if a data compression is accompanied. Table 3. Average NC values of fringe pattern and holographic image NC Values F ringe pattern Holographic image Spatial 0.99997 0.99962 MDWT 0.99996 0.99989 Domain
Table 4. Experimental results by watermarking in spatial-domain and MDWT-domain Attack(Adobe Error rates(%) Spatial DW T PhotoshopT M ) JPEG quality 6 0 0 JPEG quality 4 0.2 0 14.1 0 JPEG quality 2 19.4 0 JPEG quality 0 Gaussian noise addition(5%) 1.5 0 3.0 Gaussian noise addition(10%) 12.5 0 0 Sharpening Blurring 0.4 0
For reference, Fig. 10 shows some examples of the extracted watermarks after attacks. In this figure, only the cases that the error ratio are 0.2%, 1.5%, 5.5%, 12.5%, and 19.4% are included but the other cases can be predicted from these result. As can imagine from the figures, an extracted watermark with error ratio of less than 10% is useful for the protection of ownership.
Fig. 10. Extracted watermarks: error ratio (a) 0.2, (b) 1.5, (c) 12.5, (d) 19.4
An Electronic Watermarking Technique for Digital Holograms
5
1017
Conclusions
In this paper, we proposed a frequency-domain and a spatial-domain electronic watermarking schemes for digital holograms. The spatial-domain scheme was only for comparison. Here, we regarded the fringe patterns as the spatial-domain data and the results from mallat-tree 2D DWT are the frequency-domain data. The spatial-domain scheme disperses the watermark data throughout the fringe pattern. DWT-domain scheme embeds the watermark data by replacing a specific bit-plane of the lowest subband with the watermark data plane. Experimental results indicated that both schemes are very imperceptible in fringe pattern and in the reconstructed holographic image. Also they showed quite high strength against most of the attacks. Generally they showed lowest robustness to the Gaussian noise addition attacks. But the spatial-domain scheme showed very weak on the data compression attacks. In all the cases, the spatial-domain scheme was worse than DWT-domain scheme and showed useless when data compression is accompanied. Thus, the DWT-domain watermarking scheme is expected to be used effectively to protect the ownership of digital hologram. Also, we expect that the contents in this paper can be the bases for further research on the electronic watermarking for digital holograms.
Acknowledgement This work was supported by grant No. (R01-2006-000-10199-0) from the Basic Research Program of the Korea Science & Engineering Foundation.
References 1. Ingemar J. C., Matthew L. M., Jeffrey A. B.: Digital Watermarking, Morgan Kaufmann Publishers, San Francisco, CA (2002) 2. Ulf S. and Werner J.: Digital Holography, Springer Berlin Germany (2005) 3. Kishk S., Bahram J.: 3D Object Watermarking by a 3D Hidden Object. OPTICS EXPRESS, Vol. 11, No. 8 (2003) 874-888 4. Hyun K., Yeon L.: Optimal Watermarking of Digital Hologram of 3-D object. OPTICS EXPRESS, Vol. 13, No. 8 (2005) 2881-2886
Off-Line Signature Verification Based on Directional Gradient Spectrum and a Fuzzy Classifier Young Woon Woo, Soowhan Han, and Kyung Shik Jang Department of Multimedia Engineering, Dong-Eui University San 24, Gaya-Dong, Pusanjin-Gu, Pusan, 614-714, Korea
[email protected] Abstract. In this paper, a method for off-line signature verification based on spectral analysis of directional gradient density function and a weighted fuzzy classifier is proposed. The well defined outline of an incoming signature image is extracted in a preprocessing stage which includes noise reduction, automatic thresholding, image restoration and erosion process. The directional gradient density function derived from extracted signature outline is highly related to the overall shape of signature image, and thus its frequency spectrum is used as a feature set. With this spectral feature set, having a property to be invariant in size, shift, and rotation, a weighted fuzzy classifier is evaluated for the verification of freehand and random forgeries. Experiments show that less than 5% averaged error rate can be achieved on a database of 500 samples including signature images written by Korean letters as well.
1
Introduction
Handwritten signature verification is concerned with determining whether a particular signature truly belongs to a person, so that forgeries can be deleted. Most of us are familiar with the process of verifying a signature for identification, especially in legal, banking, and other high security environments. It can be either on-line or off-line, which is differentiated by the data acquisition method[1]. In an on-line system, signature traces are acquired in real time with digitizing tablets, instrumented pens, or other specialized hardwares during the signing process. In an off-line system, signature images are acquired with scanners or cameras after the complete signatures have been written. There have been over a dozen prior research efforts and the summaries of these efforts are shown in [2]. However, most of the prior works on handwriting have used real-time input, which means they dealt with on-line system[3][4][5]. Relatively only a few methods focused on off-line signature verification. In off-line system, image of a signature written on a paper is obtained either through a camera or a scanner and obviously dynamic information is not available. Since the volume of information available is less, signature analysis using off-line techniques is relatively more difficult. To solve this off-line signature verification problems, elastic image matching techniques[6], extended shadow-coding method[7], and 2-D FFT(Fast L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 1018–1029, 2006. c Springer-Verlag Berlin Heidelberg 2006
Off-Line Signature Verification
1019
Fourier Transform) spectral method[8] were presented in the past. For the case when the actual (or true) signature and the forgery are very similar, Ammar et al. introduced an effective approach based on pressure features of the signature image[9]. More recent efforts tend to utilize a neural network classifier or a fuzzy classifier with the directional probability of the gradient on the signature image[10][11] and with geometric features for the detection of random and freehand forgeries[12]. And for the detection of skilled forgeries, Hidden Markov Model (HMM) is successively applied[13]. On reviewing the literature it was realized that a direct comparison of results from different researchers is often impossible. This is due to factors such as different data set used, field conditions, training and test data size, and the way in which the issue of forgery was handled[1][2]. Therefore, the main focus of this paper is to introduce a new technique for carrying out off-line signature verification and it is compared with our previous work[10] only. In this study, the global features based on FFT spectrum of directional gradient density function, which can abstract the overall shape information of signature image, and a weighted fuzzy classifier are proposed. The feature set extracted from directional gradient density function was developed earlier and widely used in off-line signature verification systems[10][11]. It is easy to extract and relatively well contains the overall shape information of signature image. However, the directional density function in [10][11] was derived from the entire signature image and easily affected by the rotation of image. Thus in this paper, the directional density function is extracted from only the outline of signature image not the entire signature image because the overall shape information is more densely located at the outline of image, and the FFT spectrum of that is utilized as a feature vector for invariance of image rotation and data reduction. The summary of our verification process is shown in Fig 1. The design of a complete AHSVS (Automatic Handwritten Signature Verification System) which is able to cope with all classes of forgeries (random, freehand,
Fig. 1. Overall processing steps for the proposed freehand or random forgery detection system
1020
Y.W. Woo, S. Han, and K.S. Jang
and traced [1]) is a very difficult task because of computational resources and algorithmic complexity. A better solution might be to subdivide the decision process in a way to eliminate rapidly gross forgeries like random or freehand forgeries. In this study, we focus on the construction of a first stage of verification system in a complete AHSVS. Thus, only the freehand forgeries, which are written in forgers’ own handwriting style without knowledge of the appearance of genuine signature, or the random forgeries, which use his/her own signatures instead of genuine signatures, are considered.
2
Signature Image Segmentation and Feature Measurement
Signature image segmentation: The extraction of a signature image from the noisy background is done as follows. The first work is to apply a lowpass filter, shown in equation (1), to a scanned image for the noise reduction. p (i, j) =
1 9
j+1 i+1
p(l, k) (1 ≤ i ≤ m, 1 ≤ j ≤ n)
(1)
l=i−1k=j−1
where p(l, k): the original image, p (i, j): the averaged image and m by n is the size of image (200 by 925). In a next, the threshold value, T HD, is automatically selected from the averaged image, based on a simple iterative algorithm proposed by Ridler et al[14], and the fine signature image is restored as shown in equation (2). p (i, j) = p(i, j) if p (i, j) > T HD,
otherwise
p (i, j) = 0
(2)
where p (i, j): the restored image. In a third, the outline of signature is extracted from the restored image p (i, j) by using the erosion process. If p (i, j) > 0 and its 8-neighbor count is below 8, p (i, j) must be on the outline of signature image, which is shown in equation (3). pˆ(i, j) = p (i, j) if p (i, j) > 0 and N p < 8 otherwise
pˆ(i, j) = 0
(3)
where N p is a count of 8-neighbor pixels which are greater than zero. A sample signature image restored from noisy background and its outline is shown in fig. 2. Feature Measurement: The one utilized as the input of a weighted fuzzy mean classifier for the verification process is the FFT spectral feature vector of directional gradient density function extracted from only the outline of signature image. It depends on the overall shape of the signature image, and so is assumed to have enough information for the detection of freehand or random forgeries. In the gradient computation process, Sobel 3 by 3 mask shown in fig. 3 is convolved with each pixel on the restored image if and only if it is on the outline of signature image, and the amplitude and angular orientation of gradient vector are computed by equation (4) and (5).
Off-Line Signature Verification
1021
Fig. 2. A sample signature restored from noisy background and its outline
Fig. 3. Sobel 3 by 3 gradient mask
Gr(i, j) = Sr(i, j) ⊗ p (i, j) Gc(i, j) = Sc(i, j) ⊗ p (i, j) and, G(i, j) = Gr(i, j)2 + Gc(i, j)2 π 128 Gc −1 + ∗ θ(i, j) = tan Gr 2 π if and only if pˆ(i, j) > 0,
(4) (5)
where pˆ(i, j) : a pixel on the outline of signature image and p (i, j): a pixel on the restored image. The multiplication term, 128 π in equation (5), allows the angular orientation θ(i, j) to have a range from 0 to 127 for FFT. In a next, the directional gradient density function for the pixels on the outline of signature image, DF (θk ), is derived by equation (6) and its normalized term, N F (θk ) , by equation (7). 925 200 DF (θk ) = X(θk , i, j) (6) i=1 j=1
where X(θk , i, z) = G(i, j), θ(i, j) = θk , derived by equations (4),(5), and k=0, 1 ,... 127. DF (θk ) N F (θk ) = 127 (7) DF (θk ) θk =0
Finally, as a means of data reduction and feature selection, the 128 point FFT is taken into the normalized directional gradient density, which is shown in equation (8).
127 N F (θk ) exp(−2πjθk m/128) (8) F (m) = abs θk =0
1022
Y.W. Woo, S. Han, and K.S. Jang
where m=0,1,2,....,127. A D.C. component, F (0), is always zero because the mean value of N F (θk ) is removed before FFT. Thus, in this study, the first 15 spectral components except F (0) are utilized as a feature vector to be an input of the weighted fuzzy classifier for verification. The feature vector formed with F (1), F (2), ...., F (15) has a property to be invariant in size, shift, and rotation. Fig. 4 shows some samples of genuine and forged signatures written by Korean letters and their feature vectors. The FFT spectral feature vectors extracted from two genuine signatures, (a) and (b) in Fig. 4., even one of them is scaled and rotated, are very similar together, but different with feature vectors extracted from two kinds of freehand forgeries written by two different person, (c) and (d) in Fig 4.
Fig. 4. Samples of genuine and forged signatures written by Korean letters and their FFT spectral feature vectors
3
A Weighted Fuzzy Classifier
The construction of a fuzzy classifier depends on the type of a fuzzy membership function and the calculation method of a fuzzy mean value[15]. The triangular type of membership function has a simple configuration and is easy to apply where the only one reference feature set for one target pattern is used as in our experiments. Thus the one used in this paper as a classifier is combined a triangular fuzzy membership function with a weighted fuzzy mean method which utilizes the variances of each dimensional feature value in a reference feature set as the weights, ωi . This is shown in equation (9). hw (μ1 (x1 ), μ2(x2 ), · · ·, μn (xn ); ω1 , ω2 , · · ·, ωn ) =
n i=1
n μi (xi )·ωi , ( ωi = 1) (9) i=1
Off-Line Signature Verification
1023
where hw : a weighted fuzzy mean value, μi : a membership grade extracted from a triangular membership function, wi : a weight for an ith feature value, xi , and n is the dimension of incoming feature vector (15 in this paper). This type of classifier does not require a training stage while the classifier based on neural network algorithms does. And the performance of a neural classifier highly depend on its architecture and learning algorithm. In this fuzzy classifier, the triangular fuzzy membership functions of each of fifteen-dimensional feature values and their variances used as weights are simply constructed and computed by using the reference feature set, and utilized for the verification of a signature image without any further process. Thus the evaluation process is much simpler and easier than that of the conventional neural network classifier. For an incoming test signature, the fifteen membership grades are computed by pre-established triangular membership function, and its weighted fuzzy mean value is derived by equation (9). If it is greater than a threshold value, this signature is verified as a genuine signature, and if not, it is discarded as a forgery. More details about this fuzzy classifier based on triangular membership function and weighted fuzzy mean value can be founded in our previous work[10].
4
Experiments and Performance Assessment
In our experimental process, a total of 500 samples including signature images written by Korea letters were corrected and verified by both of the proposed method and our previous algorithm[10] where the twelve-dimensional directional density extracted from an entire signature image had been utilized as a feature vector. Test images belong to five sets of different signatures. Signature data collection for each set was done by as follows. One of five different writers was chosen as a target and asked to write his own name twenty times on an A4 page, and four of the remaining writers were assigned to be forgers. Each of the forgers was asked to write the targeted name twenty times in his own handwriting on an A4 page. The forgers were not allowed to study the samples of the original signature because this study focused on only freehand or random forgery detection not skilled forgery detection. Each set comprises one A4 page of genuine and four A4 pages of freehand forgery signatures. They were scanned, one page at a time, at resolution of 300 dpi, 8-bit gray-scale, and each signature was stored in a 200 by 925 pixel matrix. Thus each set contains 20 genuine signatures and 80 freehand forgeries. By use of each of five writers’ own signature as a target, five sets of different signature classes were made. Some samples of data set 1 to 5 are shown in fig. 5, and those of data set 2 and 5 are written by Korean Letters. Signature verification with a neural classifier needs the variety of forged signatures to train the classifier for the high performance[16]. However, under the real world environment, only a few forged signature samples are available. In this study, a fuzzy mean classifier without any knowledge of forged signatures is presented to decide an incoming signature whether it belongs to a genuine or forged signature. The construction of a fuzzy classifier and the verification process were done by as follows. In a first, the reference feature set is constructed
1024
Y.W. Woo, S. Han, and K.S. Jang
Fig. 5. Sample signatures in data set 1 to 5
only with the randomly selected genuine signature samples shown in equation (10), and the weights for each of fifteen-dimensional feature values are derived by equation (11). nm
rf (xi ) =
1 fj (xi ); i = 1, 2, ..., 15 nm j=1
(10)
where nm : the number of selected genuine signature samples, fj (xi ) is an ith feature value of feature set, fj , extracted from a signature sample j, and rf (xi ) is an ith feature value of reference feature set. For the weights, the normalized variances for each of fifteen-dimensional feature values, vr1 , vr2 , ..., vr15 , are drived as in our previous work[10]. And if a normalized variance for the ith feature values extracted from the selected signature samples, vri , is a pth larger value among [vr1 , vr2 , ..., vr15 ], then a weight for the ith feature value is defined as ωi = a(16 − p)th
l arg er
value in [vr1 , vr2 , ...., vr15 ]
(11)
where a is a small constant. By the equation (11), the ith feature which has a smaller variance has a larger weight, and it means the ith feature whose values are not significantly changed between genuine signature samples is more weighted in verification process. After this stage, the verifier performance is evaluated. For an incoming signature
Off-Line Signature Verification
1025
image, the membership grades of its feature values are derived by equations (12)-(14). (xi − rf (xi )) μi (xi ) = + 1 if xi < rf (xi ) (12) rf (xi ) μi (xi ) = −
(xi − rf (xi )) + 1 if rf (xi )
μi (xi ) =
xi ≥ rf (xi )
μi (xi ) if μi (xi ) ≥ 0 0 if μi (xi ) < 0
(13) (14)
where xi is an ith feature value of input signature image, rf (xi ) is an ith feature value of reference feature set, and μi (xi ) is a membership grade for xi . All of membership functions are configured as a triangular type shown in figure 6.
Fig. 6. A fuzzy membership grade, µ1 (x1 ), for a signature sample shown in fig.4-(a)
In a next, the weighted fuzzy mean value is derived by equation (15) and the signature is verified by equation (16). h(μ1 (x1 ), μ2(x2 )), · · ·, μ15 (x15 ); ω1 , ω2 , · · ·, ω15 ) =
15
μi (xi )·ωi
(15)
i=1
where h is the weighted fuzzy mean value of an incoming signature image, μi is a membership grade of the ith feature value, and wi is a weight shown in equation (11). h ≥ T h accepted as a genuine signature h
σ , then S2 will be assigned to the new shot; If P(S3=1|CE)> σ , then S3 will be assigned to the new shot, and so on. We obtain the final annotation set of the shot by dynastic Bayesian inference. The procedure is similar to that proposed by Huang [9], as follows. Step 1. The directed graph of Bayesian Network is moralized and triangulated into a chordal graph. Step 2. A join tree (JT) is built from the chordal graph, which consists of cliques node and separator sets (abbreviated as sepset). Step 3. Initialize the belief potential of each clique and sepset according to the conditional probabilities of the nodes in Bayesian Network. Step 4. Select the final annotation set from the SCS as follows. NE={S1}; CE = ∅ ; SCS ={S1,…,SN}
Automatic Annotation and Retrieval for Videos
1035
While (NE is not empty) do {Input NE into JT and modify the potentials in the JT; Perform Global propagation to make the potentials of JT locally consistent; CE = CE ∪ NE ; NE = ∅ ; for (each concept Si in SCS) do if ((Si ∉ CE) and (P(Si=1|CE)> σ )) then { NE=NE ∪ {Si}; SP[f][Si]=P(Si=1|CE); } } where CE is the final annotation set (FAS) of a shot after the procedure stops, and SP[f][Si] is used to store the probability of annotating the frame f with concept Si.
5 Ranked Retrieval Based on Annotation The task of video retrieval is similar to the general ad-hoc retrieval problem. We are given a text query Q={w1, w2, …, wk} and a collection V of key frames of videos. The goal is to retrieve the video key frames that contain concept set Q in V. A simple approach to retrieving videos is to annotate each key frame in V with a small number of concepts using the techniques proposed in section 2, 3 and 4. We could then index the annotations and perform text retrieval in the usual manner. This approach is very straightforward. However, it has a disadvantage that does not allow us to perform ranked retrieval. This is due to the binary nature of concept occurrence in automatic annotations: a concept either is or is not assigned to the key frame. When the annotations of many retrieved frames contain the same number of concepts, documentlength normalization will not differentiate between these key frames. As a result, all frames containing the same number of concepts are likely to receive the same score. Probabilistic annotation can be used to rank the relevant key frames (here ‘relevant key frames’ are the ones that contain all query concepts in their ground-truth annotation). A technique has been developed to assign a probability P(S=1|CE) to every concept S in the annotation in section 4. The key frame is scored by the probability that a query would be observed. Given the query Q= {w1, w2, …, wk} and a frames f ∈ V , the probability of containing Q in frame f is: k
P (Q| f )=∏ P (w j |f )
(7)
j =1
where P(wj | f) has already been computed and stored in SP[f][wj] in section 4. all retrieved frames are ranked according to P(wj | f) from large to small.
6 Experimental Results We have chosen videos of different genres including landscape, city and animal from website www.open-video.org to create a database of a few hours of videos. Data from 35 video clips has been used for the experiments.
1036
F. Wang et al. Table 1. The presence frequency of each concept in training set (%)
concept percentage concept percentage
car
road
bridge
building
waterfall
water
3.08
6.84
2.66
19.39
1.14
42.97
boat 6.08
cloud
sky
snow
mountain
greenery
land
animal
13.31
44.87
6.84
14.45
32.7
25.48
25.48
First, we partition every video clip into several shots and to manually annotate the shots. The key frames of each shot are extracted automatically to form the samples set. The perceptual features such as HSV accumulated Histogram and Edge Histogram are extracted automatically from the sample set and stored into a video database after being normalized. In our experiments, there are 544 key frames for training and 263 key frames for testing. Fourteen different concepts are extracted automatically from the training set, which are shown in table 1. 6.1 Results: Automatic Videos Annotation
In this section we evaluate the performance of our method on the task of automatic video annotation. We are given an unannotated key frame f and are asked to automatically produce an annotation. The automatic annotation is then compared to the manual ground-truth annotation (GT). It is said in [6] that different classifiers were evaluated including k-nearest neighbors, one-layer neural network, and mixture of experts, of which k-nearest neighbor was shown to outperform the rest. So we compare our method of selecting the semantic candidate set described in section 2 with K Nearest Neighbor (KNN) and Naïve Bayesian (NB) classifier. Four most probable concepts are also chosen for KNN and NB to build their SCSs. Three standards are used to measure the performance of the three methods as follows. S 1 _ first =
N correct _ first
(8)
N first precision =
N correct
(9)
N label
recall =
N correct
(10)
N ground _ truth
where Nfirst is the number of samples with concept S in the first position in GT, Ncoris the number of the samples whose first concept in SCS is the same as that in GT, i.e. S, Ncorrect is the number of samples having a given concept S in its SCS correctly, Nlabel is the number of samples having that concept in SCS, Nground_truth is the number of samples having that concept in GT. Table 2 shows the mean S1_first (MF), the mean precision (MP) and the mean recall (MR) over all concepts in SCS. KNN consistently outperforms NB, which conforms to the conclusion drawn by Benítez [6]. It also indicates that all metrics of SCCM are the biggest among the three methods, especially MR and MF of SCCM are rest_first
Automatic Annotation and Retrieval for Videos
1037
much bigger than those of NB and KNN methods. So we have two reasons to expect that the annotation performance obtained by SCCM could be the best among the three methods. First, the larger the recall is, the more the SCS covers the correct concepts. Second, the first concept in SCS is the first evidence during Bayesian inference and the accuracy of the evidence is very crucial to the annotating results. Table 2. The MP, MR and MF of semantic candidate set before inference
Method NB KNN SCCM
MP 0.274 0.378 0.378
MR 0.444 0.490 0.779
MF 0.137 0.182 0.392
# concepts with recall>0 13 12 14
concepts with recall=0 waterfall waterfall , bridge \
After obtaining the SCS and the first concept of a testing frame, the system obtains the other concepts coexisting in the same frame from SCS by Bayesian Inference detailed in section 4. We also measure the performance of annotating by formula (9) and (10), where Ncorrect is the number of the correct concepts that the system automatically annotated for a testing frame, Nlabel is the total number of the concepts that the system automatically annotates for a testing frame, Nground_truth is the number of the actual concepts of a testing frame in GT. Table 3 shows the average results of annotating using different methods of extracting SCS and two methods of constructing BN structure. First, it indicates that the inference performance on SCS obtained using SCCM is significantly higher than that using NB and KNN algorithm, no matter which method of constructing BN is used. Second, it shows that the recall values of all concepts in SCCM are larger than 0, and that of some concepts with low presence frequency in NB and KNN are zero. This is because KNN and NB algorithm are sensitive to the distribution of each concept in the training set. SCCM has not such problem because it is not sensitive to the presence frequency of each concept. Even a concept with low presence frequency can be assigned to a video shot only if its semantic class centre is close to the shot according to the visual features. It indicates that SCCM is more robust than NB algorithm and KNN algorithm. Third, IDABM performs a little better than TDABM, no matter which method of obtaining SCS is used. Table 3. The result of automatic annotation. # con is the number of concepts with recall>0.
Method TDABM IDABM
MP 0.277 0.279
NB MR 0.414 0.444
#con 13 13
MP 0.355 0.360
KNN MR 0.47 0.489
#con 11 11
MP 0.406 0.410
SCCM MR #con 0.731 14 0.765 14
Figure 2 shows the automatic annotation of several testing samples using three methods to obtain SCS and two methods to construct BN. It is obvious that there is a main concept in each frame. For example, the main concept in Figure 2(a) is ‘waterfall’, that in Figure 2(b) is ‘car’ and that in Figure 2(c) is ‘bridge’. All the above concepts have lower presence frequency in training set. SCCM can correctly capture
1038
F. Wang et al.
these concepts, but NB and KNN algorithm can not. In most cases, SCCM is able to correctly annotate the main concepts, even though it annotates incorrectly with some concepts such as ‘building’ in Figure 2(a) and ‘water’ in Figure 2(b). In general, SCCM algorithm outperforms significantly NB algorithm and KNN algorithm in annotating video shots. And IDABM performs a little better than TDABM without any prior knowledge.
(a)
(b)
(c)
Fig. 2. The examples of automatic annotation of several methods 6.2 Results: Ranked Retrieval of Videos
In this section we turn our attention to the problem of ranked retrieval of key frames. In the retrieval setting we are given a text query Q={w1, w2, …, wk} and a testing collection of unannotated key frames. For each testing frame f, Equation (7) is used to get the conditional probability P(Q|f). All frames in the collection are ranked according to the conditional likelihood P(Q|f). In our retrieval experiments, we use four sets of queries, constructed from all 1-, 2-, 3- and 4-word combinations of concepts that occur at least twice in the testing set. A frame is considered relevant to a given query if its manual annotation (ground-truth) contains all of the query words. We use precision and recall averaged over the entire query set as our evaluation metrics. Table 4. The experiment result of retrieval
#concepts 1-concept 2-concept 3-concept 4-concept
NB
Metric TDABM IDABM TDABM IDABM TDABM IDABM TDABM IDABM
Precision 0.277 0.279 0.168 0.176 0.117 0.130 0.111 0.111
KNN Recall 0.414 0.444 0.313 0.341 0.154 0.205 0.125 0.125
Precision 0.361 0.364 0.194 0.196 0.178 0.195 0.139 0.139
Recall 0.471 0.491 0.332 0.355 0.289 0.409 0.313 0.313
SCCM Precision 0.410 0.416 0.261 0.269 0.258 0.284 0.25 0.436
Recall 0.731 0.765 0.435 0.475 0.214 0.318 0.313 0.613
Automatic Annotation and Retrieval for Videos
1039
(a) Top 5 frames retrieved using NB algorithm
(b) Top 5 frames retrieved using KNN algorithm
(c) Top 5 frames retrieved using SCCM algorithm Fig. 3. Example of top 5 frames retrieved in response to the text query “mountain sky”. All the results of IDABM are the same with those of TDABM.
Table 4 shows the performance of our method on the four query sets, contrasted with performance of NB algorithm and KNN algorithm on the same data. We observe that our method substantially outperforms the NB algorithm and KNN algorithm on every query set except for the recall of SCCM compared with that of KNN in 3concept query. It could be said SCCM has the same performance with KNN algorithm in this case because the sum of precision and recall in both methods are about equal. Figure 3 shows top 5 frames retrieved in response to the text query “mountain sky”. We do not compare the results of our method with that of other papers because it is unfair to make a direct quantitative comparison with their method using different data set. It is a pity that there is no free and standard video data set to measure the algorithm performance now.
7 Conclusion We have proposed a new approach to automatically annotate a video shot with a varied number of semantic concepts and to retrieve videos based on text queries. There are three main contributions of this work. The first one is to propose a simple but efficient method to automatically extract the semantic candidate set based on visual features. The second one is to propose IDABM algorithm to build the semantic network. IDABM has two advantages over TDABM. One is no need for users to provide the node ordering. The other is to reduce the time complexity from O(n4) to O(n2) when orienting the edges. The third one is to obtain the annotation with varied length from SCS by Bayesian Inference. Experiments show that SCCM is efficient and significantly outperformed KNN and NB method in obtaining SCS, though it is simple, and IDABM is better than TDABM algorithm in automatic semantic annotation.
1040
F. Wang et al.
References [1] John R. Smith, Murray Campbell, Milind Naphade, Apostol Natsev, Jelena Tesic : Learning and Classification of Semantic Concepts in Broadcast Video. International conference on intelligence analysis (2005). [2] Yan rong : Probabilistic Models for Combining Diverse Knowledge Sources in Multimedia Retrieval. Dissertation of Carnegie Mellon University ( 2005). [3] Barnard, K., P. Duygulu, and D. Forsyth, N. de Freitas, D. Blei, and M.I. Jordan : Matching Words and Pictures. Journal of Machine Learning Research (JMLR), Special Issue on Text and Images Vol. 3. (2003) 1107-1135. [4] Tseng, B.T., C.-Y. Lin, M.R. Naphade, A. Natsev, and J.R. Smith : Normalized Classifier Fusion for Semantic Visual Concept Detection. In Proc. of Int. Conf. on Image Processing (ICIP-2003), Barcelona, Spain (2003) 14-17. [5] Milind Ramesh Naphade : A Probabilistic Framework For Mapping Audio-visual Features to High-Level Semantics in Terms of Concepts and Context. Dissertation of the University of Illinois at Urbana-Champaign (2001) [6] Ana Belén Benítez Jiménez : Multimedia Knowledge: Discovery, Classification, Browsing, and Retrieval. Dissertation of Columbia University ( 2005) [7] S. L. Feng, R. Manmatha and V. Lavrenko. Multiple Bernoulli Relevance Models for Image and Video Annotation. In CVPR, 2004 [8] Cheng, J., Greiner, R., Kelly,J. & Bell, D., Liu,w. Learning Belief Networks from Data: An Information Theory Based Approach. Artificial Intelligence. (2002)137(1-2):43-90. [9] Cecil Huang: Inference in Belief Networks: A Procedural Guide. International Journal of Approximate Reasoning Vol. 11 New York (1994) 1-158.
Application of Novel Chaotic Neural Networks to Text Classification Based on PCA Jin Zhang1,3, Guang Li2,*, and Walter J. Freeman4 1
Department of Biomedical Engineering, Zhejiang University, Hangzhou, 310027, China 2 National Laboratory of Industrial Control Technology, Zhejiang University, Hangzhou, 310027, China
[email protected] 3
Software College, Human University, Changsha, 410082, China 4 Division of Neurobiology, University of California at Berkeley, Donner 101, Berkeley, CA, 94720-3206, USA
Abstract. To model mammalian olfactory neural systems, a chaotic neural network entitled K-set has been constructed. This neural network with nonconvergent “chaotic” dynamics simulates biological pattern recognition. This paper reports the characteristics of the KIII set and applies it to text classification. Compared with conventional pattern recognition algorithms, its accuracy and efficiency are demonstrated in this report on an application to text classification. Keywords: Chaotic neural network, text classification, principal component analysis.
1 Introduction Text classification [1] is the problem of automatically assigning predefined categories to natural language texts, based on their contents. Text classification applications are emerging as an important class of text processing applications, including indexing texts to support document retrieval, extracting information from texts, aiding human indexers in their tasks, and knowledge management using document classification agents. Due to the increasing volume of information needing to be processed, many algorithms have been proposed in order to improve the performance of classification, such as support vector machine (SVM) [2], Bayes algorithm [3], and so on. Artificial neural network [4] is also widely used in text classification because of its classification and powerful nonlinear mapping ability. Although traditional artificial neural networks simulate some primary features such as the threshold behavior and plasticity of synapses, their structures are still much simpler in comparison with biological neural system in real life. Based on the experimental study of the olfactory systems of rabbits and salamanders, a chaotic neural network mimicking the olfactory system called KIII model has been established. Built according to the architecture of the olfactory neural system, this L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 1041 – 1048, 2006. © Springer-Verlag Berlin Heidelberg 2006
1042
J. Zhang, G. Li, and W.J. Freeman
chaotic neural network can simulate EEG waveform observed in biological experiments well. Contrasting to conventional linear methods, the KIII model has strong nonlinear characteristics. In this paper, we use the novel chaotic neural network mimicking olfactory system as a pattern classifier for text classification. Before classification, the features are extracted based on principal component analysis (PCA). The experimental results show that the KIII model has potential for text classification and other pattern classification tasks.
2 A Neural Chaotic Neural Network ― The KIII Model 2.1 KIII Model KIII model architecture follows olfactory system [5, 6], which consists of four main parts (shown in the left part of Fig.1). A schematic diagram of the KIII model, in which K0, KI and KII are included, is shown in the right part of Fig. 1. In K-set model, each node represents a neural population or cell ensemble. The dynamical behavior of each ensemble of the olfactory system is governed by equation (1).
1 [ xi′′(t ) + ( a + b ) xi′ ( t ) + a ⋅ b ⋅ xi (t ) ] = a ⋅b
N
∑ ⎡⎣W j ≠i
⎧ q (1 − e − ( e x p ( x ( t ) ) − 1 ) / q ) Q ( xi (t ), q ) = ⎨ −1 ⎩ x 0 = l n (1 − q l n (1 + 1 / q ) )
ij
⋅ Q ( x j ( t ), q j ) ⎤⎦ + I i ( t )
(1)
x (t ) > x 0 x (t ) < x 0
(2)
Here i=1…N, where N is the number of channels. In equation (1), xi(t) represents the state variable of ith neural population, xj(t) represents the state variable of jth neural population, which is connected to the ith, while Wij indicates the connection strength between them. Ii(t) is an input function, which stands for external stimulus. Physiological experiments confirm that a linear second-order derivative is an appropriate choice. The parameter a = 0.220msec-1, b = 0.720 msec-1 reflect two rate constant derived from the electro-physiological experiments. Q(xj(t),qj) is a nonlinear sigmoid function derived from Hodgkin-Huxley equation and is expressed as equation (2). In the equation (2), q represents the maximum asymptote of the sigmoid function, which is also obtained from biological experiments. In different layers, q is different, for example, q is equal to 5 in OB layer and equal to 1.824 in PG layer. And x0 can reflect the state of neuron should be excitatory or inhibitory. The KIII model describes the whole olfactory neural system. It includes populations of neurons, local synaptic connection, and long forward and distributed time-delayed feedback loops. In the topology of the KIII network, R represents the olfactory receptor, which is sensitive to the odor molecules, and provides the input to the KIII network. The PG layer and OB layer are distributed, which contains n channels. The AON and PC layer are only composed of single KII network. The parameters in the KIII network were optimized by measuring the olfactory evoked
Application of Novel Chaotic Neural Networks to Text Classification Based on PCA
1043
Fig. 1. Topological diagram for the neural system (left) and KIII network (right)
potentials and EEG, simulating their waveforms and statistical properties, and fitting the simulated functions to the data by means of nonlinear regression. After the parameter optimization, the KIII network generates EEG-like waveform with 1/f power spectra [7]. Compared with well-known low-dimensional deterministic chaotic systems such as the Lorenz or Logistics attractors, nervous systems are high-dimensional, nonautonomous and noisy systems [8]. In the KIII network, independent rectified Gaussian noise was introduced to every olfactory receptor to model the peripheral excitatory noise and single channel of Gaussian noise with excitatory bias to model the central biological noise sources. The additive noise eliminated numerical instability of the KIII model, and made the system trajectory stable and reliable under statistical measures, which meant that under perturbation of the initial conditions or parameters, the system trajectories were robustly stable [9]. Because of this stochastic chaos, the KIII network not only simulated the chaotic EEG waveforms, but also acquired the capability for pattern recognition, which simulated an aspect of the biological intelligence, as demonstrated by previous applications of the KIII network to recognition of one-dimensional sequences, industrial data and spatiotemporal EEG patterns [10], which deterministic chaos can’t do, owing to its infinite sensitivity to initial conditions. 2.2 Learning Rule We study the n-channel KIII model, which means that the R, P and OB layer all have n units, either K0 (R, P) or KII (OB). When stimulus is presented to some of the n channels while other channels receive the background input (zero), the OB units in these channels experience a state transition from the basal activity to a limit cycle attractor and the oscillation in these units increases dramatically. Therefore, we
1044
J. Zhang, G. Li, and W.J. Freeman
choose the activity of the M1 node in the OB unit to be the scale to indicate the extent to which the channel is excited. In other words, the output of the KIII set is taken as a 1×n vector at the mitral level to express the AM patterns of the local basins, specifically the input-induced oscillations in the gamma band (20 Hz to 80 Hz) across the n channels during the input-maintained state transition. To put it clearly, the activity of M1 node is calculated as the root mean square (RMS) value of the M1 signal over the period of stimulation. There are two kind of learning rules [11]: Hebbian and habituation learning. The rule of Hebbian learning reinforces the desired stimulus patterns while habituation learning decreases the impact of the background noise and the stimuli that are not relevant or significant. According to the modified Hebbian learning rule, when two nodes become excited at the same time with reinforcement, their lateral connection weight is strengthened. On the contrary, without reinforcement the lateral connection is weakened. Let PM(i) donate the activity of the ith M1 node in the OB layer, PM indicate the average value of all the PM(i) (i = 1,…, n), WM(i)->M(j) represent the connection weight from the ith M1 node to the jth M1 node, hHeb and hHab denotes the connection weight when two nodes are excited or not separately. The algorithm can be described as follows: 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11.
For i = 1…N For j = 1…N IF PM(i)>(1+K)*PM AND PM(j)>(1+K)*PM AND i != j Then W`M(i)->M(j) = hHeb (i = 1,…,N) ElseIF i = = j Then W`M(i)->M(j) =0 Else W`M(i)->M(j) =hHab (i = 1,…,N) End End //loop for j End //loop for i
The bias coefficient K>0 is introduced in order to avoid the saturation of the weight space. The habituation constitutes an adaptive filter to reduce the impact of environmental noise that is continuous and uninformative. In our simulation, the habituation exists at the synapse of the M1 nodes on other nodes in the OB and the lateral connection within the M1 layer. It is implemented by incremental weight decay (multiply with a coefficient hhab N . In this experiment, we set M to be equal to N . In each node in the trellis, there are M/4 incoming arcs and M/4 outgoing arcs. Bold arcs represent positive arcs which embed bit 1 and non-bold arcs represent negative arcs which represent bit 0. H∗ denotes the set of all the positive and negative arcs. H∗ = {h1 , h2 , · · ·, hM }
Fig. 2. Trellis-code with orthogonal arcs
1106
T. Kim et al.
Fig. 3. The 12 DCT coefficients for the extract vector
2.2
Embedding and Detection of Message Bits
The image is converted into 8 × 8 block DCT domain. Then, N low ac components are taken from each of the L blocks, as in Figure 3, and then they are concatenated and shuffled into a randomly-ordered vector. The shuffled vector, v, referred as the extract vector of the image, is an L × N sequence. The i-th message bit is embedded into the i-th N -element segment in v. v = [v1,1 , v1,2 · ··, v1,N · · · vi,1 , vi,2 · · · vi,N · · · vL,1 , vL,2 · · · vL,N ]
3
(1)
Informed Embedding over Orthogonal Arcs
Informed coding is the process of choosing the codeword which is the most correlated with the cover work. In the proposed trellis, there are 4L paths in all, starting from the node in the first stage to the end, since 4 nodes exit from each node. Each path represents a specific (but not necessarily distinct) message. Identification of the right path is performed in two steps. In the first step, we remove those arcs in the trellis which do not encode the given message. Then, each stage is left with either bold arcs (bit 1) or non-bold arcs (bit 0). All the paths in the modified trellis shown in Figure 4 represent the encoded message. In the second step, the highest correlation path is identified by Viterbi decoding on the modified trellis. At the receiver side, Viterbi decoding on the complete trellis might return a path other than the path chosen for the modified trellis. To suppress this erroneous decoding, the extract vector is modified such that its correlation with the arcs in the right path secures a margin over all the interfering paths. 3.1
Informed Embedding
The description that follows is about the embedding process for the i-th stage in the trellis. However, it can be equally applied to all the other stages since they have identical configuration of arcs. Without loss of generality, we may assume that the i-th message bit is 1 and the arc in the i-th stage in the identified path from the informed coding process is hm0 . Define u be the i-th N-element segment in v. u = [u1 , u2 · · · uN ] = [vi,1 , vi,2 · · · vi,N ]
Dirty-Paper Trellis-Code Watermarking with Orthogonal Arcs
1107
Fig. 4. Modified Trellis-code with only unipolar arcs
Define qm as the step correlation of an arc, hm with u. qm = hm · u
(m = 1, 2, · · ·, M )
(2)
Viterbi algorithm returns the path which wins the highest accumulated correlation with v. The accumulated correlation is computed incrementally for each stage of the trellis. We define Qim (j) where (j = 1, ..., i) as the sum of correlation of the arcs with v. The arcs used in sum are only those starting from first stage to the j-th stage, out of all the i arcs in the winning path connected to hm in the i-th stage. Qim (i) = qm + Qim (i − 1) (3) When hm0 was chosen in the informed coding process, there were only positive arcs in the i-th stage of the modified trellis. Hence, there is always a chance of some negative arcs having larger accumulated correlation than hm0 , and so the detector at the receiver side might return an erroneous decoded message since the detection is performed on the complete trellis. In order to prevent this, we modify u such that Qim0 (i) is larger than the accumulated correlation for the other positive arcs as well as all the negative arcs, by at least R0 (desired robustness constant). Qim0 (i) ≥ Qim (i) + R0
(m = 1, · · ·, M
m = m0 )
(4)
The N -element vector, u, is updated element by element. Each element of u can be either increased or decreased in accordance with the signs of the corresponding elements in u, hm0 and interfering arcs, hm . When the signs of u and hm0 are the same and the sign of an interfering arc, hc is different, we increase that magnitude of the element of u by α. When the signs of u and hc are the same and the sign of hm0 is different, we decrease that magnitude of that element of u by α. If the element of u is too small, we increase or decrease it by β, after checking the signs in hm0 and u. Modification of u is repeated against all the other interfering arcs. As a result, each element of u is altered only when it helps to increase Qim0 (i) over some of the interfering arcs. Furthermore, the amount of increase of
1108
T. Kim et al.
Qim (i)(m = m0 ) at the change of one element of u, never exceeds the increase in Qim0 (i). With this updating scheme, the correlation margin over interfering arcs either increases or remains the same, at each element change of u. This way, the proposed algorithm effectively modifies u toward hm0 and suppresses the interfering arcs until a desired robustness level of R0 is achieved. The pseudo-code for the proposed embedding algorithm is as follows: DO from m = 1 to M (skipping m = m0 ) hc ← m-th element of H∗ REPEAT UNTIL (Qim0 (i) ≥ Qim (i) + R0 ) Do from n = 1 to N u ← n-th element of u g ← n-th element of hm0 b ← n-th element of hc IF u · g ≈ 0 THEN IF g > 0 |u| ← |u| + β ELSE IF g < 0 |u| ← |u| − β END IF END IF IF u · g > 0 AND g · b < 0 THEN |u| ← |u| · (1 + α) END IF IF u · g < 0 AND g · b < 0 THEN |u| ← |u| · (1 − α) END IF END DO END REPEAT END DO 3.2
Comparison with Miller’s Embedding
In the informed embedding in Miller’s [3], the modification is performed on v, instead of u, in a stochastic manner. Each time v is updated, the accumulated correlation of interfering paths (rather than interfering arcs) with v also changes in a random way. This introduces two problems. First, at each updating of v, Viterbi decoding for all the 2L paths needs to be performed again in order to find the most interfering path and secure the desired robustness against it. Even with the modified (simplified) Viterbi algorithm, it costs a huge computation time. Second, the modification of v is accomplished in a random way and so the convergence is not guaranteed.
Dirty-Paper Trellis-Code Watermarking with Orthogonal Arcs
1109
The proposed embedding algorithm is free from these problems. First, convergence of the extract vector is deterministic since the updating of u is performed in accordance with the signs of u and arcs. At each updating of u, it gets closer to hm0 and recedes farther from interfering arcs. Second, the embedding process is performed on u, instead of along a path, which is L-times longer. Hence, just incremental updating of the accumulated correlation is enough. The speed-up factor is L, which is the message size.
4
Experimental Results
We tested the proposed embedding algorithm for several standard test images. Message sequences are randomly generated for each experiment. The message length, L, is either 1024 for 256 × 256 images, and is 4096 for 512 × 512 images. The orthogonal arcs are the column vectors of a 16 × 16 Hadamard matrix. The values of α and β used in the experiment are both 0.01. Evaluation of the performance is performed with respect to two criteria. The first is the fidelity. For this, we measured the Watson distance [10] in the watermarked image. The second criterion is the robustness against attacks. For this, we measured the bit error rate (BER). 4.1
Fidelity Test
In Table 1, Watson distance (with R0 =3.5) is compared for several informed watermarking methods [3][7][11] under similar operating conditions. The proposed method provides lower Watson distance. Table 1. Comparison of Watson distance
Average
4.2
Proposed 49
Miller 86
Lin 55
Ward 50
Robustness Test
With the proposed informed embedding, BER is zero in the absence of intentional attacks. BER stays at zero until the strength of attack reaches a certain level. We examined the BER performance of our method under various types of attacks. The tested attacks are: additive Gaussian noise, lowpass filtering, valumetric scaling and lossy compression. Additive White Gaussian Noise BER under additive Gaussian noise is shown in Figure 5. With the proposed embedding, BER stays at zero until the standard deviation of the Gaussian noise, σ , reaches 5. Even beyond σ = 5, BER stays at less than 1% while BER for other methods is as high as 4%. In summary, the proposed embedding provides superior robustness against Gaussian noise to other methods.
1110
T. Kim et al.
Fig. 5. BER under gaussian noise attack
Fig. 6. BER under lowpass filtering
Fig. 7. BER under valuemetric scaling
Dirty-Paper Trellis-Code Watermarking with Orthogonal Arcs
1111
Fig. 8. BER under JPEG compression
Lowpass Filtering BER under lowpass filtering is shown in Figure 6. The lowpass filter has a size of 21 × 21. BER is measured for the filter width ranging from σ = 0.1 ∼ 1.0. Performance of the proposed method is between [3] and [11]. The proposed method has inferior performance to [3] especially at heavier filtering because it uses larger (16) number of DCT coefficients than in [3] (12). Valumetric Scaling BER under valumetric scaling is shown in Figure 7. The scale factor, ν, is ranging from 0.1 ∼ 2.0. When ν is less than 1.0 (downscaling), BER is virtually zero. When ν is larger than 1.0 (upscaling), the proposed embedding provides lower BER, compared to others. Lossy Compression BER under JPEG compression is shown in Figure 8. Lower qualify factor (QF) means heavier JPEG compression. BER performance is examined for QF from 20∼100. The proposed embedding provides BER ≤ 1% for QF=40. Inferior performance is due to the larger number of DCT coefficients, as in LPF filtering. 4.3
Speedup Test
Miller’s trellis-code embedding is very slow due to the repeated Viterbi decoding at each iteration of v. The speedup factor by the proposed embedding is L. In experiments, the average time for embedding L(= 1024) bits into a 256 × 256 image is 1200 secs [3], 2 secs [7] and 1 sec (proposed method).
5
Conclusions
We present a dirty-paper trellis-code watermarking algorithm with orthogonal arcs. Though the underlying trellis structure is similar to Millers’, the orthogonal
1112
T. Kim et al.
arcs provide two advantages over the original method. The first one is a speedup with a factor of L. The original trellis-code with random-valued arcs is very slow since modification of the cover work is performed on path-level. The proposed deterministic algorithm with orthogonal arcs is much faster since just incremental arc-level updating of u is enough for the embedding of each bit. Experimental results also show that the proposed method provides higher fidelity and lower BER against various types of attacks, compared with other conventional informed watermarking methods. Acknowledgment. The authors wish to thank the Brain Korea 21 Project of MOE and the Next-generation Digital Convergence Platform Research Fund 2005 of MOCIE for supporting this study.
References 1. Costa, M., ”Writing on Dirty Paper,” IEEE Trans. Information Theory, vol.29, pp. 439-441, May, 1983 2. Chen, B and Wornell, G., ”Digital watermarking and Information embedding using dither modulation,” in Proc. of 2nd WMSP, IEEE, pp.273-278, 1998 3. Miller, M., Do¨err, G., and Cox, I., ”Applying Informed Coding and Embedding to Design a Robust High-Capacity Watermark.,” IEEE Tran. Image Processing, vol.,13, pp.792-807, June, 2004 4. Cox, I., Miller, M., and Bloom, J., ”Digital Watermarking, Morgan Kaufmann,” 2001 5. Kuk., H. and Kim., Y., ”Performance Improvement of Order Statistical Patchwork,” LNCS, vol. 2939, Springer-Verlag, pp.246-262, Oct. 2003 6. Moulin, P. and O’Sullivan, J., ”Information-Theoretic Analysis of Information Hiding,” in Proc. of ISIT, IEEE, Jun., 2000, Italy 7. Lin, L., Do¨err, G., Cox, I., and Miller, M., ”An Efficient Algorithm for Informed Embedding of Dirty-Paper Trellis Codes for Watermarking,” in Proc. of ICIP, IEEE, Sep., 2005, Italy 8. Abrardo, A. and Barni, M., ”Informed Watermarking by Means of Orthogonal and Quasi-Orthogonal Dirty Paper Coding,” IEEE Tran. Signal Processing, vol., 53, IEEE, pp. 824-833, Feb., 2005 9. Abrardo, A. and Barni, M., ”Fixed-Distortion Orthogonal Dirty Paper Coding for Perceptual Image Watermarking,” LNCS vol. 3200, pp. 52-66, Springer-Verlag, May, 2004 10. Watson, A., ”DCT Quantization Matrices Optimized for Individual Images,” Human Vision, Visual Processing, and Digital Display IV, vol. SPIE-1913, pp. 202-216, 1993. 11. Coria-Mendoza , L., Nasiopoulos, P., Ward, R., ”A Robust Watermarking Scheme Based on Informed Coding and Informed Embedding,” in Proc. of ICIP, IEEE, Sep., 2005, Italy
An Efficient Object Tracking Algorithm with Adaptive Prediction of Initial Searching Point* Jiyan Pan, Bo Hu, and Jian Qiu Zhang Dept. of E. E., Fudan University. 220 Handan Road, Shanghai 200433, P.R. China {jiyanpan, bohu, jqzhang01}@fudan.edu.cn
Abstract. In object tracking, complex background frequently forms local maxima that tend to distract tracking algorithms from the real target. In order to reduce such risks, we utilize an adaptive Kalman filter to predict the initial searching point in the space of coordinate transform parameters so that both tracking reliability and computational simplicity is significantly improved. Our method tracks the changing rate of the transform parameters and makes prediction on future values of the transform parameters to determine the initial searching point. More importantly, noises in the Kalman filter are effectively estimated in our approach without any artificial assumption, which makes our method able to adapt to various target motions and searching step sizes without any manual intervention. Simulation results demonstrate the effectiveness of our algorithm. Keywords: Object tracking, coordinate transform, initial searching point, adaptive Kalman filter.
1 Introduction Object tracking has been widely applied to video retrieval, robotics control, traffic surveillance and homing technologies. A lot of object tracking algorithms have been reported in literatures, and among them the template matching algorithms has drawn much attention [1]-[6]. In such algorithms, target is modeled by a template, and is tracked in a video sequence by matching candidate image regions with the template through coordinate transforms. The set of transform parameters that yield the highest similarity between the template and the mapped image region of the current frame represents the geometric information of the target. The performance of object tracking heavily depends on whether the search for the optimal transform parameters can be executed effectively. Many fast searching algorithms have been proposed in an effort to increase the accuracy of searching results *
This work is part of Research on Key Scientific and Technologic Problems of Molecular Imaging and is supported by National Basic Research Program 973 under Grant 2006CB705700.
L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 1113 – 1122, 2006. © Springer-Verlag Berlin Heidelberg 2006
1114
J. Pan, B. Hu, and J.Q. Zhang
while reducing computational complexity. Typical algorithms include Three Step Search (TSS) [7], 2D-Log Search (2DLS) [8], Block-based Gradient Descent Search (BBGDS) [9], and Lucas-Kanade algorithm [1]. For all the algorithms mentioned above, the distraction of local minima is always a serious problem frequently leading to the failure to find the real coordinate transform parameters. Ideally, the image region where real target occupies in the current frame should render the largest similarity measure and therefore unambiguously make itself stand out against the other parts of the frame. When background is cluttered, however, some nearby objects also generate comparable similarity measure and hence confuse tracking algorithms. When searching for optimal coordinate transform parameters, tracking algorithms frequently find themselves trapped into local maxima produced by background objects and other interferences. Such a situation can be improved by predicting the initial searching point in the space of transform parameters for the next frame and reducing searching range to ensure unimodalilty of the similarity measure. Since most local maxima in the transform parameter space reside some distance from the global maximum where the target locates, the risk of being trapped into local maxima can be substantially reduced if the initial searching point is in the close vicinity of the global maximum. This requires a good prediction of the geometric status of the target in each frame. In the realm of object tracking, Kalman filters have been used in literatures [6], [11], [12], but few of them serve the purpose of predicting the initial searching point and enhancing tracking performance for the next frame. Besides, the model noises are fixed and determined empirically. In this paper, we propose an approach which employs Kalman filter to track the changing rate of the transform parameters instead of directly filtering their values. Then we select the predicted parameters as the initial searching point for the next frame. More importantly, after analyzing the cause of the model noises in the Kalman filter, we propose an effective method to estimate the power of those noises. As a result, the Kalman filter in our approach can automatically adapt to various target motions and searching step sizes. Experimental results indicate that the proposed method can achieve extremely high accuracy of predicting parameters and hence a significant decrease in the risk of being distracted by background interferences, as well as a considerable drop in computational burden. The remainder of this paper is organized as follows. Section II focuses on the adaptive Kalman prediction of the initial searching point in the transform parameter space after a brief review of object tracking algorithms based on template matching. Experimental results are included in Section III. The paper is concluded in Section IV.
2 Adaptive Prediction of the Initial Searching Point 2.1 Object Tracking Based on Template Matching The object (or target) to be tracked is characterized by an image called template which is generally extracted from the first frame of a video sequence. In subsequent frames of the video sequence, the template is mapped to the coordinate system of the frames by coordinate transforms. A searching algorithm tries various combinations of transform
An Efficient Object Tracking Algorithm with Adaptive Prediction
1115
parameters to find a set of transform parameters that maximize the similarity between the template and the mapped region of the current frame: a m = arg max sim{I [ϕ (x; a )], T (x )} a
(1)
where T(x) is the grey scale value of a template pixel located at x in the template coordinate system, I(y) is the grey scale value of a frame pixel located at y in the frame coordinate system, φ(x;a) is the coordinate transform with parameter vector a, sim{I,T} is a function that measures the degree of similarity between images I and T. Typical examples of sim{I,T} include the normalized linear correlation or the inverse of SSD (sum of squared difference) between I and T [13]. am is the transform parameter vector that the searching algorithm assumes to be the one corresponding to correct geometric information of the target. The type of the coordinate transform is determined by its parameter vector a. For the coordinate transform that consists of translation, scaling and rotation, a has four components and φ(x;a) can be written as ⎡ cos a2 sin a2 ⎤ ⎡ x ⎤ ⎡ a3 ⎤ . ⎥⎢ ⎥ + ⎢ ⎥ ⎣ − sin a2 cos a 2 ⎦ ⎣ y ⎦ ⎣a4 ⎦
ϕ ( x; a ) = a1 ⎢
(2)
Generally speaking, φ(x;a) can have arbitrarily large number of parameters and hence describe extremely complex object motions. Yet the model described by (2) is sufficient for most real-world tracking applications. 2.2 Predicting the Initial Searching Point In order to predict the initial searching point in the transform parameter space, possible value of each transform parameter in the next frame has to be predicted. Since the frame rate is relatively high, we can reasonably assume the changing rate of each parameter does not alter abruptly over adjacent frame intervals. What brings uncertainty to the changing rate is the influence of arbitrary motion of the target. Such an influence brings about fluctuation of the changing rate of the transform parameters, and thus can be regarded as noise. We employ an adaptive Kalman filter to track the changing rate of the parameters. Such a method is especially instrumental in predicting, not just smoothing, the geometric status of the target. Since different transform parameters describe independent aspects of target motion, they can be predicted separately. The discussion below therefore focuses on one parameter alone and it can be applied to the other parameters trivially. The state transition equation and the measurement equation for the changing rate of a coordinate transform parameter a are v(n ) = v(n − 1) + u (n − 1) ,
(3)
vm (n ) = v(n ) + w(n ) ,
(4)
where v(n) is the changing rate of the parameter defined as a(n)-a(n-1), vm(n) is the measured changing rate of the parameter, which is actually the increment of the result of parameter search in (1), u(n) is the cause of the fluctuation of v(n) and is white with
1116
J. Pan, B. Hu, and J.Q. Zhang
the power of σu2(n), and w(n) is the measurement noise resulting from the limit in the precision of the searching step size for the parameter a. It is also white, with the power of σw2(n). Suppose vˆ P (n ) is the prediction of v after the measurement up to frame n-1 is available, and vˆ E (n ) is the estimate of v after the measurement up to frame n is acquired. If eP (n ) denotes the prediction error of v and eE (n ) represents the estimation error of v, the following equations hold: v(n − 1) = vˆ E (n − 1) + eE (n − 1) ,
(5)
v(n ) = vˆ P (n ) + eP (n) .
(6)
Since the state transition coefficient in (3) is one, the estimate of v at frame n-1 serves as the prediction of v at frame n: vˆ P (n ) = vˆ E (n − 1) .
(7)
From (3), and (5) to (7), the relationship between the prediction and the estimation errors can be derived: eP (n ) = eE (n − 1) + u (n − 1) .
(8)
As eE(n-1) is uncorrelated with u(n-1), the additive relationship remains for the power of the signals in (8): σ P2 (n ) = σ E2 (n − 1) + σ u2 (n − 1)
(9)
Where σP2 and σE2 are the power of prediction error and estimation error, respectively. According to the theory of Kalman filtering [10], the optimal Kalman gain can be expressed as G (n ) =
1 1 + σ (n ) σ P2 (n) 2 w
(10)
where the increase in the prediction error or the decrease in the measurement noise will lead to the rise in the Kalman gain. After the measured value of v is obtained at frame n, the estimated value of it can be calculated using its predicted value and the Kalman-gain-weighted innovation: vˆE (n ) = vˆP (n ) + G (n )[vm (n ) − vˆP (n )] = vˆP (n ) + G (n )α (n )
(11)
where α (n ) = vm (n ) − vˆ P (n ) is the innovation at frame n. Updating the estimate of v leads to the renewal of estimation error as σ E2 (n ) = [1 − G(n )]σ P2 (n ) .
(12)
(7) and (9) to (12) form a complete iteration to update the prediction of v. After the predicted value of v for frame n+1 is obtained by applying (7) after (11), the prediction of a at frame n+1 can be written as
An Efficient Object Tracking Algorithm with Adaptive Prediction
aˆ P (n + 1) = am (n ) + vˆP (n + 1)
1117
(13)
where aˆ P (n + 1) is the prediction of a at frame n+1, and am(n) is the searching result of a at frame n. aˆ P (n + 1) is usually very close to the real value of a(n+1) and the initial searching point for a is therefore selected as aˆ P (n + 1) . 2.3 Estimating the Power of the Model Noises Although the equations listed above seem to have solved our problem, the power of the two model noises, σu2(n) and σw2(n), remain to be estimated. Correct evaluation of them plays a key role in obtaining a proper Kalman gain and thus directly determines the performance of the Kalman filter. In the remainder of this section we would like to describe our approach to estimate σu2(n) and σw2(n). As is mentioned before, the measurement noise is caused by the non-infinitesimal searching step size in looking for the optimal coordinate transform parameters. For simplicity of notation, we denote a(n) as an. Suppose the step size for searching the parameter an is Δ , and the searching result is am,n. It is reasonable to assume that the true value of an is uniformly distributed over an interval of Δ centered at am,n; that is, the density of the true value of an is ⎧1 Δ , an − am,n ≤ Δ 2 . pn (an ) = ⎨ elsewhere ⎩ 0,
(14)
The power of searching error of an can be expressed as follows:
{
σ a2 = E (am,n − an )2
(a
=∫
∞
=∫
am , n + Δ 2
−∞
− an ) pn (an )dan . 2
m ,n
am , n − Δ 2
=
}
(a
− an )
2
m ,n
(15)
1 dan Δ
Δ2 12
Since v(n) is the changing rate of an, it is evident that v(n ) = an − an−1 ,
(16)
vm (n ) = am ,n − a m,n −1 .
(17)
Taking (4), (16) and (17) into consideration, we can derive the power of measurement noise σw2(n) as follows:
{ = E{(a − a ) }+ E{(a
σ w2 = E{w2 (n)} = E (vm (n) − v(n))2 2
m,n
n
}
− an−1 )
2
m, n−1
− 2 E{(am,n − an )(am,n−1 − an−1 )}
}.
(18)
As the parameter searching processes at different frames are uncorrelated, the cross term of (18) is zero. Considering (15), we can reduce (18) to
1118
J. Pan, B. Hu, and J.Q. Zhang
{
} {
}
σ w2 = E (am,n − an )2 + E (am,n−1 − an−1 )2 =
Δ2 Δ2 Δ2 . + = 12 12 6
(19)
From (19) we can infer that having a finer searching step size can reduce the power of the measurement noise, which is just as expected. The estimation of σu2(n), however, is not as straightforward since the motion of the target can be arbitrary. Yet we can still acquire its approximate value by evaluating the power of the innovation α(n). Considering (3), (4), (5) and (7) simultaneously, one can immediately get the following equation which relates the innovation with the estimation error and the two model noises:
α (n ) = eE (n − 1) + u (n − 1) + w(n ) .
(20)
The uncorrelatedness among the right-hand terms in (20) yields
σ α2 (n ) = σ E2 (n − 1) + σ u2 (n − 1) + σ w2 (n ) where
σα2(n)
(21)
is the power of the innovation and can be approximated as
σ α2 (n ) =
1 n ∑ [vm (k ) − vˆP (k )]2 . N k =n− N +1
(22)
N is the number of frames over which the power of the innovation is averaged to obtain its approximate expectation. Combining (19), (21) and (22), we can acquire the estimation of σu2(n) as follows:
σ u2 (n ) =
2 1 n +1 [vm (k ) − vˆP (k )]2 − σ E2 (n ) − Δ ∑ N k =n− N + 2 6
(23)
Where σE2(n) is calculated online in the iterations of Kalman filtering. So far we have derived the expressions for estimating the power of the two model noises. By doing so, we do not have to assign any empirical values to those noises as most conventional approaches do. As a result, the method we have proposed can be applied to video sequences with various characteristics of target motion and searching algorithms with different searching step sizes, without any need to tune the Kalman filter manually. The last step remaining is to initialize the filter. Since we have no information regarding target motion at the very beginning, it is natural to set the initial values of both vˆ E and σ E2 to be zero: vˆ E (0 ) = 0 , σ E2 (0) = 0 .
(24)
3 Experimental Results In order to examine how the adaptive prediction of the initial searching point in the transform parameter space can improve the performance of object tracking, we compare the tracking results of two algorithms that are exactly the same in every other aspect except that the first algorithm selects the transform parameters predicted by our
An Efficient Object Tracking Algorithm with Adaptive Prediction
1119
proposed method as the initial searching point for the next frame, and the other algorithm just takes the parameters found in the current frame as the initial searching point for the next frame. For simplicity, we denote the algorithm with adaptive prediction of initial searching point as Algorithm 1, and the other one is represented by Algorithm 2. The model of object motion includes translation and scaling. In both algorithms, the searching step size is 1 pixel for horizontal location and vertical location, and 0.05 for scale. Both algorithms select the inverse of SSD as the similarity function [2], and use gradient descent search algorithm to look for optimal transform parameters. Adaptive Kalman appearance filter is employed to update the template. Figs. 1 to 3 illustrate how well our proposed method predicts the coordinate transform parameters in the next frame. We apply Algorithm 1 to a video sequence where the target undergoes much motion both in spatial locations and scales. Both actual and predicted values of the coordinate transform parameters for every frame are plotted in the same figure. It can be seen from the figures that our method gives a very precise prediction of what the parameters are going to be in the next frame. The average distance between the initial searching point and the actual point in transform parameter space reduces from 2.7398 to 0.9632 when we use Algorithm 1 instead of Algorithm 2. Such a significant drop in the searching distance is extremely beneficial to tracking algorithms in terms of enhancing tracking stability and decreasing computational burden, as will be demonstrated in the following experimental results. Fig. 4 and Fig. 5 exemplify considerable improvement of tracking stability when using the adaptive prediction of the initial searching point. When the initial searching point is much closer to the actual point in transform parameter space, tracking algorithms are less likely to be distracted by local maxima resulting from cluttered background, similar objects, or other interferences. This fact is confirmed by our experiments in which we deliberately choose a video sequence that has a vehicle running on a dark road at night. Due to the darkness, the vehicle is blurred and is somewhat similar to the road. When we apply Algorithm 2 to track the vehicle, it is not long before the algorithm loses the target because of being distracted by interferences from the road, as is shown in Fig. 4. Algorithm 1, however, successfully locks on the target throughout the sequence as is demonstrated in Fig. 5. The region in the lower right corner of each frame is the overlapped template. Computational burden can also be greatly saved by the adaptive prediction of the initial searching point. Since the distance between the initial searching point and the final result point is substantially reduced, it takes searching algorithms in (1) much fewer trials to reach a final status, and computational complexity is therefore considerably reduced. Fig. 6 shows the parameter searching trial times of both algorithms. The right chart of Fig. 6 demonstrates the case where target has relatively high motion. The saving of computational burden is as high as 66.8%. Even in the case where target has low motion that is illustrated in the left chart of Fig. 6, Algorithm 1 can still lower computational complexity by 27.1%. Since only scalar calculations are involved in the adaptive Kalman prediction of the initial searching point, the proposed algorithm can be implemented real time at a rate of 30fps using C codes on a Pentium-4 1.7GHz PC.
1120
J. Pan, B. Hu, and J.Q. Zhang
Fig. 1. Curves of the horizontal location of the target. The curve with circles represents the actual horizontal target location of every frame, and the curve with crosses depicts the predicted horizontal target location before every new frame is input.
Fig. 2. Curves of the vertical location of the target. The meanings of different types of curves are the same as in Fig. 1.
Fig. 3. Curves of the target scale. The meanings of different types of curves are the same as in Fig. 1.
An Efficient Object Tracking Algorithm with Adaptive Prediction
1121
Fig. 4. Algorithm 2 fails to keep track of the vehicle when facing strong interferences from the background. Frame 1, frame 23 and frame 50 are displayed from left to right.
Fig. 5. Algorithm 1 tracks the vehicle perfectly all the time in spite of the existence of strong interferences from the background. Frame 1, frame 23 and frame 50 are displayed from left to right.
Fig. 6. Curves of parameter searching trial times over frame indices. The curves with circles show the result of Algorithm 2, and the curves with crosses illustrate the result of Algorithm 1. The left chart demonstrates the case where target has low motion, and the right chart, high motion.
4 Conclusion In this paper we propose an algorithm which adaptively predicts possible coordinate transform parameters for the next frame and selects them as the initial searching point when looking for the real transform parameters. By doing so, tracking algorithms have less risk of being distracted by local maxima resulting from interferences, and tracking
1122
J. Pan, B. Hu, and J.Q. Zhang
performance is thus improved. We use an adaptive Kalman filter to achieve this purpose, but instead of directly filtering the values of transform parameters, we apply the Kalman filter on the changing rate of those parameters to effectively predict their future values. Moreover, we quantitatively analyze the cause of the model noises in the Kalman filter and derive their analytical expressions, so that the Kalman filter in our algorithm is automatically and correctly tuned when the characteristics of target motion change over time, or the searching algorithm uses different searching step sizes. Experimental results show that our proposed algorithm considerably promotes tracking stability while substantially decreasing computational complexity.
References 1. Baker S., Matthews I.: Lucas-Kanade 20 Years on: A Unifying Framework. Int’l J. Computer Vision, Vol. 53, No. 3. (2004) 221-255 2. Matthews I., Ishikawa T., Baker S.: The Template Update Problem. IEEE Trans. on Pattern Analysis and Machine Intelligence, Vol. 26, No. 6. (2004) 810-815 3. Kaneko T., Hori Osamu: Template Update Criterion for Template Matching of Image Sequences. Proc. IEEE Int’l Conf. Pattern Recognition, Vol. 2. (2002) 1-5 4. Peacock A.M., Matsunaga S., Renshaw D., Hannah J.: A. Murray: Reference Block Updating When Tracking with Block Matching Algorithm. Electronic Letters, Vol. 36. (2000) 309-310 5. Smith C., Richards C., Brandt S., Papanikolopoulos N.: Visual Tracking for Intelligent Vehicle-highway Systems. IEEE Trans. on Vehicle Technology, (1996.) 6. Papanikolopoulos N., Khosla P., and Kanade T.: Visual Tracking of a Moving Target by a Camera Mounted on a Robot: A Combination of Control and Vision. IEEE Trans. on Robotics and Automation, Vol. 9. (1993) 14-35 7. Wang Y., Ostermann J., Zhang Y.Q.: Video Processing and Communications. Prentice Hall (2002) 159-161 8. Jain J. Jain A.: Displacement Measurement and Its Application in Interframe Image Coding. IEEE Trans. on Communications, Vol. 33. (1981) 1799-1808 9. Liu L.K., Feig E.: A block-based gradient descent search algorithm for block motion estimation in video coding. IEEE Trans. on Circuits and Systems for Video Technology, Vol. 6. (1996) 419-422 10. Brown R.G. and Hwang P.Y.C.: Introduction to Random Signals and Applied Kalman Filtering. John Wiley (1992) 11. Blake A., Curwen R., Zisserman A.: A Framework for Spatio-Temporal Control in the Tracking of Visual Contour. Int’l J. Computer Vision, Vol. 11, No. 2. (1993) 127-145 12. Isard M., Blake A.: CONDENSATION – Conditional Density Propagation for Visual Tracking. Int’l J. Computer Vision, Vol.29, No. 1. (1998) 5-28 13. Sezgin M., Birecik S., Demir D., Bucak I.O., Cetin S., Kurugollu F.: A Comparison of Visual Target Tracking Method in Noisy Environments. Proc. IEEE Int’l Conf. IECON, Vol. 2. (1995) 1360-1365
An Automatic Personal TV Scheduler Based on HMM for Intelligent Broadcasting Services Agus Syawal Yudhistira1, Munchurl Kim1, Hieyong Kim2, and Hankyu Lee2 1
School of Engineering, Information and Communications University (ICU) 119 Munji Street, Yuseong-Gu, Daejeon, 305-714, Republic of Korea 2 Broadcasting Media Research Division Electronic and Telecommunications Research Institute 161 Gajeong-dong, Yuseong-gu, Daejeon, 305-700, Republic of Korea 1 2 {agus, mkim}@icu.ac.kr, {hykim, lhk}@etri.re.kr
Abstract. In the future television broadcasting a flood of information from various sources will not always be welcomed by everyone. The need of accessing specific information as required is becoming a necessity. We are interested to make the life of television consumer easier by providing an intelligent television set which can adaptively proposed certain shows to the viewer based on the user historical consumed shows. The method proposed is by utilizing Hidden Markov Model (HMM) to model the user preference of kind of genres the viewer will watch based on recorded genres for several weeks time. The result is satisfactory for users which have middle to high consistent preference of television consumption. Keywords: Hidden Markov Model (HMM), personalization, multimedia contents.
1 Introduction Television broadcasting services are entering a new era especially with availability of digital TV, internet, etc. which profoundly affect the TV viewers. Flood of TV programs coming from different sources does not always being welcomed or desirable. Availability of many TV programs at the TV viewer’s side entails the difficulty of finding their preferred TV programs. The TV viewers must set aside a considerable amount of their time for searching TV programs and tailoring into their personalized schedules the available published TV program schedules. Worst is the situation when no TV program schedule is available so the TV viewers must find their preferred TV program contents by hopping across the TV channels. The TV viewers can even miss their preferred contents while searching and tailoring TV program schedules for themselves. Our interest is to make it possible the generation of personalized TV program schedules. The personalized TV program scheduler is based on Hidden Markov Model (HMM) by utilizing the TV viewing history data. The HMM is a well known method to model and predict the processes based on available past behavior information and data. In this paper, our objective is to build a L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 1123 – 1132, 2006. © Springer-Verlag Berlin Heidelberg 2006
1124
A.S. Yudhistira et al.
model that predicts the TV watching behaviors in the chronological order of TV genre that a TV viewer is likely to watch. We assume that a TV viewer exhibits a consistent TV watching pattern in his/her TV watching history data. The transition from genre to genre in a day while watching TV can be considered a stochastic process that can be modeled based on the HMM. For the modeling of the chronological order for the watched TV program sequences for each day, we explicitly couple the genre-channel in building our model. The TV watching history data consists of TV program titles with their respective genres, channels, watched times and durations, etc. To model our automatic personal TV scheduler, the HMM is trained with the TV watching history data for each day. The whole possible TV watching time band is defined from 6:00 P.M. to 12:00 A.M. (midnight) for one day. We assume that the average duration of TV watching time on a specific program is greater than 20 minutes. So the whole time band for one day is segmented into 20 minutes subintervals. The TV watching behavior is observed as follows: (1) it is observed what genres of TV programs a TV viewer has watched every 20 minutes; (2) the transition probabilities of the genres are computed every 20 minutes; (3) the TV watching behavior is then regarded as the most probable transition sequence of genres in TV programs; (4) and an estimate sequence of TV channels can be presented to the TV viewer according to the most probable transition sequence of genres. This paper is organized as follows: Section 2 briefly introduces the HMM for a self contained purpose; Section 3 presents the modeling of a TV personal scheduler by applying the HMM for the usage history data of TV watching; Section 4 shows the experimental results and we conclude our approach in Section 5.
2 Hidden Markov Model HMM is a statistical model of sequential data with double stochastic process. One of the processes is not directly observable and termed as hidden. Observation is conducted trough the other stochastic process that produces a sequence of observable data. This observable stochastic process is a probabilistic function of the hidden stochastic process. Although we say the underlying stochastic process is hidden, this does not mean that the states are totally modeled arbitrarily by means of guess. As in our problem, it is possible for us to observe the channel or genre transition directly and to know how many channels and genres are available. But in some other areas, this information may not be available. One application that extensively uses HMM is speech recognition. But HMM can also be applied for many kinds of problems such as image analysis, language identification, DNA sequencing, handwriting and text recognition, signal processing, climatology and applied also for many other problems. Defined formally, an HMM λ is a 5-tuple of
λ = ( S , V , Π , A, B )
(1)
where S = {s1 , s2 , …, sN } is a finite set of N states, V = {v1 , v2 ,…, vM } is a set of M possible symbols that can be emitted from each states, Π = {π i } are the initial state
An Automatic Personal TV Scheduler Based on HMM
1125
probabilities, A = {aij } are the state transition probabilities, B = {b j (k )} are the output or emission probabilities (B is also called as the confusion matrix). In compact and ordinary form, the definition is written as a triplet:
λ = (Π , A, B ) .
(2)
A more detailed of each parameter is as follows: 1. N, is the number of states in the model. Thus although the states can be hidden, we are implying that the state is finite. 2. M, is the number of distinct observation symbols per state. 3. π i is the probability that the system starts at state i at the beginning. 4.
aij is the probability of going to state j from state i.
vk at state j.
5. lastly, b j ( k ) is the probability of emitting symbol
Of course following constraint must be fulfilled as it is in every Markov model: N
∑π i =1
N
∑a j =1
M
ij
=1
∑ b (k ) = 1 k =1
j
i
=1
for for
(3)
1≤ i ≤ N
1≤ j ≤ N
(4) .
(5)
These constraints assume that the model is a stationary process and will not vary in time. In our case, two special non-emitting states are added: a starting state sstart and an ending state send in addition to the other ordinary emitting states. These states do not have output probability distributions associated with them but they have transition probabilities. sstart is always the first state of the model from which transition to other states begins. Thus, the transition probabilities of this state are the initial state probabilities Π itself. send always comes last toward which the transitions from other states converge. No other transition is possible from send . For more details, readers are recommended to refer to [1][2]. There are 3 problems to solve with HMM: 1. Given the model λ and observation sequences, how to compute the probability of the particular output? This is a problem of evaluation. 2. Given the model λ and observation sequences, how can we find a sequence of hidden states that best explain the observations? This is a problem of decoding. 3. Given the observation sequences, how can we optimize the model parameters? This is a problem of training. For training the HMM, we can choose channels or genres as our observation sequences. This training problem is solved by the Baum-Welch algorithm, also known as the forward-backward algorithm. This method consists of two recursion computation: the forward recursion and backward recursion. We use the Baum-Welch algorithm to train our HMM based model [4].
1126
A.S. Yudhistira et al.
3 Modeling a Personal TV Scheduler Using HMM The availability of TV watching history data are necessary to employ the HMM: that is what genre of the program contents the person watched, when they have been watched, how long it was watched and in which TV channel it was broadcasted. In order to implement the HMM to estimate the chronological order of the genre sequences during a day, we choose the genres as the underlying states and the channels as the observations. We assume that TV viewers are allowed to access all the available channels and to watch the program contents in all kinds of genres.
Fig. 1. An example of six states (genre) fully connected Markov model. Both start and end states are non-emitting, while the other six states emitting observation symbols (channels). The number of states is determined by genre types.
TV viewers usually watches TV program contents on some different channels at anytime. This will present us with an ergodic model as shown in Figure 1. Notice that there is no edge connecting the start with end. In order for the HMM to be an effective model, the stochastic process should be stationary, not varying in time. Since TV program schedules do significantly vary in time during the course of the days, weeks and months, a significant range of time can be taken and considered invariant. We divide one day into the time bands of 2 hours, each of which bands is represented by one HMM. Since TV programs are usually broadcast in a weekly cycle, the same time bands in the same days for weeks can be considered invariant so the history data of TV watching from the same day during several weeks can be used for training. This is represented graphically in Fig. 2. As aforementioned, the time band of two hours is segmented into the sub-time bands of 20 minutes long, thus producing six sub-time bands. Each sub-time band represents an observation instance. In other words, we are checking the possibility of changing TV program contents every 20 minutes. Each 20 minutes period is represented by one genre, which is the one that compose most of the 20 minutes time. For instance, if during that 20 minutes time, the viewer is watching genre1 for 12
An Automatic Personal TV Scheduler Based on HMM
1127
minutes and watching genre2 for the rest 8 minutes, genre1 is chosen to represent the period. This will give us a coarse estimation on what genre was actually consumed and what genre is expected. Both the genre and channel information of the day is subject to the representations in 20 minutes unit.
Fig. 2. During a week cycle, a television viewer history is divided into days. For each days the data is divided into 2 hours bands, each represented by an HMM.
For training the HMM, a windowing method is employed. We use training data obtained during 8 weeks for which the TV program schedules have maintained unchanged. 3.1
Initial HMM Parameters
Rabiner states that although in theory the reestimation equations should give values of the HMM parameters which correspond to a local maximum of the likelihood function, experience has shown that either random (subject to the stochastic and the nonzero value constraints) or uniform initial estimates of the Π and A parameters is adequate for giving useful reestimates of these parameters in almost all cases [1]. However for the B parameters, good initial estimates are helpful in the discrete symbol case [1]. Given the sequences of genres Gr={gr(t)} and channels Cr={cr(t)}, 1≤r ≤R, 1≤t ≤Tr , where here R is the number of weeks inside a window and Tr is total number of genre sub-time band in week r. The component of B¸ bj(k) is R
b j (k ) =
Tr
∑∑ match(c (t ) = s , g (t ) = v ) r =1 t =1
r
j
R
∑ Tr r =1
r
k
(6)
1128
A.S. Yudhistira et al. match(cr (t ) = s j , g r (t ) = vk ) = 1 if cr (t ) = s j ∧ g r (t ) = vk = 0 otherwise
(7)
3.2 Employing the HMM We employ the HMM for prediction as follows: after the training is completed, a trained HMM is available with new parameters. Following the initial transition Π and parameter A, as the case of ordinary Markov process, a sequence of predicted genres will be obtained. From the refined parameter B, we can deduct the channel for each corresponding genre in the sequence. But the genre prediction does not always agree with what is actually broadcasted by each TV channel during the day. Adjustment is made according to the prediction of channels and the actual schedule of each TV channel during the day. For instance, for period 7 P.M. to 7:20 P.M, result of prediction is genre 1 from channel A, but according to the day schedule for channel A, it should be genre 2, then we correct the prediction to be genre 2 from channel A. We termed this adjustment attempt as synchronization.
4 Experiment and Result These experiments utilize the data of TV program viewing history from AC Nielson Korea Research Center, recorded by 2,522 people from December 2002 to May 2003. The history data consists of the following database fields: Table 1. Experiment source database structure
Field Name id date dayofweek subscstart_t subscend_t programstart_t programend_t channel genre1
Description TV viewer's ID program broadcasting date a day of the week for program: beginning time point of watching a program ending time point of watching a program scheduled beginning time of program scheduled ending time of program channel of program (6 channels) genre of program (8 genres)
An attention must be taken regarding the column genre1. The column does not provide reliable information on what genres a viewer watches but only provide information of what genres of TV program contents were broadcast before the viewer changes a channel. It is possible though to reconstruct the right schedule table of each channel from the database by aggregating programstart_t and programend_t of
An Automatic Personal TV Scheduler Based on HMM
1129
several viewers. We use this resulting table to provide reliable genre information for our experiment performance evaluation. The six television channels in our experiment include MBC, KBS1, KBS2, SBS, EBS and iTV. Each of the broadcast channels provides the TV program contents in the eight genres: Education, drama & movie, news, sports, children, entertainment, information and others. Also in the experiments, we define the whole time band for each day from 6 P.M. to midnight only. Data for the experiments is taken form several people who exhibit their TV watching behaviors with medium to high consistency week by week. The consistency here means that a TV viewer watches a television almost every week for 6 months, and during the day he/she usually watches the television from 6 PM to midnight. It also means that each of the TV viewers exhibits a similar pattern in the chronological order of TV program genres that he/she has watched during the six months. Preparation for training as follows: First both channel and genre information during 8 weeks periods is filtered. Those viewing record of less than 5 minutes are regarded as hopping and then excluded. The record then is divided into 2 hours bands and 20 minutes sub-time bands. Thus we can have 6 transitions for each 2 hours time band. Since we have 26 weeks recorded data in our database, we can make experiment to predict 18 consecutive weeks. Each week is predicted by using previous 8 weeks recorded data, and the expected result is validated using actual recorded data of the week we try to estimate as shown in Fig. 3.
Fig. 3. HMM training scheme using 8 weeks previous recorded history
Using 8 weeks data, parameters of the HMM are built and then trained using channel transitions information. Predicted transitions of channel and genre can be obtained from the trained model. The evaluation of the trained HMM is shown in Figure 4. The evaluation is made as follows: for each sub-time band of 20 minutes where the prediction match with actual choice the viewer made as recorded in the database, a score is given one. Those that do not match are given zero score. This is done for the sequence of predicted channels and genres, respectively. Consistency of the person is computed with the same method by comparing two consecutive weeks. Given a set of genre transition Gw = {gw(t)}, a set of prediction of genre transition G’w = {g’w(t)}, 1≤w≤W and 1≤t≤Tw, where here W is the number of experiment weeks we predict and Tw is total number of genre sub-time band in week w. Expected genre transition accuracy is
1130
A.S. Yudhistira et al. W
Tw
∑∑ genre _ match( g '
genre prediction accuracy =
w =1 t =1
w
(t ), g w (t ))
(8)
W
∑T w=1
w
genre _ match( g 'w (t ), g w (t )) = 1 if g 'w (t ) = g w (t ) . = 0 otherwise
(9)
The genre consistency is counted as W
genre consistency =
Tw
∑∑ genre _ match( g w= 2 t =1
w−1
(t ), g w (t ))
(10)
W
∑T w=1
w
genre _ match( g w−1 (t ), g w (t )) = 1 if g w−1 (t ) = g w (t ) . = 0 otherwise
(11)
And given a set of channel transition Cw = {cw(t)}, a set of prediction of channel transition C’w = {c’w(t)}, 1≤w≤W and 1≤t≤Tw, where here W is the number of weeks we predict and Tw is total number of channel sub-time band in week w. Expected channel transition accuracy is W
channel prediction accuracy =
Tw
∑∑ channel _ match(c '
w
w=1 t =1
(t ), cw (t ))
(12)
W
∑ Tw w=1
channel _ match(c 'w (t ), cw (t )) = 1 if c 'w (t ) = cw (t ) = 0 otherwise
(13)
And channel consistency is counted as W
channel consistency =
Tw
∑∑ channel _ match(c
w −1
w= 2 t =1
(t ), cw (t ))
W
∑ Tw
(14)
w=1
channel _ match(cw−1 (t ), cw (t )) = 1 if cw−1 (t ) = cw (t ) = 0 otherwise
(15)
Selection of specimen is based on the consistency of the viewer, neglecting the viewer’s profile.
An Automatic Personal TV Scheduler Based on HMM
1131
Fig. 4. Evaluating expected transition of genres (and channels) of week 9 with actual recorded data of that week
Table 2 and 3 represent the experimental results of specimens under two different scenarios. Table 2 and 3 represent the prediction accuracies with the genre as state and with the channel as state, respectively. Even with the data with high consistency, Table 3 shows lower accuracy results compared to that in Table 2. Especially, the channel prediction is likely to fail along with genre. Table 2. Prediction accuracy: Genre as state and channel as observation
Under the heading estimation, there are three columns. First second column are result of prediction without adjustment. The last column, sync. genre, shows accuracy of genre prediction with synchronization as stated in section 3.2. This cannot be done in the scenario which results in Table 3. User ID 115434206 (Wednesday, second model) shows a little anomaly, which can be explained. During 2 hours period the person watched extensively one kind of genre provided from several channels. The high accuracy of genre estimation (without synchronization) shows that the prediction stays in one kind of genre. Yet, for each state, only one channel will be assigned highest probability. The transition of channels is not adequately reflected in this case
1132
A.S. Yudhistira et al.
and thus lowers accuracy result. This can be an indication that 2 hours band is not stationary enough, but overall result is quite good for coarse deduction and content advising. Those that less consistent in TV consumption showed as expected low prediction accuracy. Reason why the second scenario does not provide good result is because the confusion matrix (genre) only reflects distribution of genre broadcasted by the channel. Thus, if the channel failed to produce good prediction the genre will also fail. This is different from the first scenario where the genre probability is summarized from all possible channels available. Table 3. Prediction accuracy: Channel as state and genre as observation
5 Conclusion In this paper, we propose an automatic personal TV scheduler that is modeled using HMM. The usage history data of TV watching has been used to train and to test the proposed personal TV scheduler. The double stochastic property of HMM is exploited to synchronize the channel preference with the expected genre preference. It provides satisfactory accuracy to predict TV viewer’s preference for middle to high consistent TV consumption.
References 1. Rabiner, L.R.: A Tutorial on Hidden Markov Models and Selected Application in Speech Recognition. Proc. IEEE 77 (1989) 257-285. 2. Zhai, Cheng Xiang: A Brief Note on the Hidden Markov Models (HMMs). Department of Computer Science University of Illinois at Urbana-Champaign (2003). 3. Stamp, Mark: A Revealing Introduction to Hidden Markov Models. Department of Computer Science San Jose State University (2004). 4. Young, S., Evermann, G., Gales, M., et al.: The HTK Book. Cambridge University Engineering Department (2005) 6-8, 128-130.
Increasing the Effect of Fingers in Fingerspelling Hand Shapes by Thick Edge Detection and Correlation with Penalization Oğuz Altun and Songül Albayrak Yıldız Technical University, Computer Engineering Department, Yıldız, İstanbul, Türkiye {oguz, songul}@ce.yildiz.edu.tr
Abstract. Fingerspelling is used in sign language to spell out names of people and places for which there is no sign or for which the sign is not known. In this work we describe a method for increasing the effect of fingers in Fingerspelling hand shapes. Hand shape objects are obtained by extraction of representative frames, color segmentation in YCrCb space and angle of least inertia based fast alignment [1]. Thick edges of the hand shape objects are extracted with a distance to edge based method. Finally a calculation that penalizes similarity for not-corresponding pixels is employed to correlation based template matching. The experimental Turkish fingerspelling recognition system recognizes all 29 letters of the Turkish alphabet. The train video database is created by three signers, and has a set of 290 videos. The test video database is created by four signers, and has a set of 203 videos. Our methods achieve a success rate of 99%. Keywords: Turkish Fingerspelling Recognition, Fast Alignment, Angle of orientation, Axis of Least Inertia, Thick Edges, Correlation with Penalization.
1 Introduction Sign Language is the language used mainly by deaf people and people with hearing difficulties as a visual means of communication by gestures, facial expression, and body language. There are two major types of communication in sign languages: word based and letter based. The first one has word based sign vocabulary, and gestures, facial expression, and body language are used for communicating these words. The second one has letter based vocabulary, and is called fingerspelling. It is used to spell out names of people and places for which there is no sign and can also be used to spell words for signs that the signer does not know the sign for, or to clarify a sign that is not known by the person reading the signer [2]. Sign languages develop specific to their communities and are not universal. For example, ASL (American Sign Language) is totally different from British Sign Language even though both countries speak English [3]. Quite a number of Sign Language recognition systems are reported amongst which are American Sign Language (SL) [4], Australian SL [5], and Chinese SL [6]. L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 1133 – 1141, 2006. © Springer-Verlag Berlin Heidelberg 2006
1134
O. Altun and S. Albayrak
Previous approaches to word level sign recognition rely heavily on statistical models such as Hidden Markov Models (HMMs). A real-time ASL recognition system developed by Starner and Pentland [4] used colored gloves to track and identify left and right hands. They extracted global features that represent positions, angle of axis of least inertia, and eccentricity of the bounding ellipse of two hands. Using an HMM recognizer with a known grammar, they achieved a 99.2% accuracy at the word level for 99 test sequences. For TSL (Turkish Sign Language) Haberdar and Albayrak [7], developed a TSL recognition system from video using HMMs for trajectories of hands. The system achieved a word accuracy of 95.7% by concentrating only on the global features of the generated signs. The developed system is the first comprehensive study on TSL and recognizes 50 isolated signs. This study is improved with local features and performs person dependent recognition of 172 isolated signs in two stages with an accuracy of 93.31% [8]. For fingerspelling recognition, most successful approaches are based on instrumented gloves, which provide information about finger positions. Lamar and Bhuiyant [9] achieved letter recognition rates ranging from 70% to 93%, using colored gloves and neural networks. More recently, Rebollar et al. [10] used a more sophisticated glove to classify 21 out of 26 letters with 100% accuracy. The worst case, letter ’U’, achieved 78% accuracy. Isaacs and Foo [11] developed a two layer feed-forward neural network that recognizes the 24 static letters in the American Sign Language (ASL) alphabet using still input images. ASL fingerspelling recognition system is with 99.9% accuracy with an SNR as low as 2. Feris, Turk and others [12] used a multi-flash camera with flashes strategically positioned to cast shadows along depth discontinuities in the scene, allowing efficient and accurate hand shape extraction. Altun et al. [1] used axis of least inertia base fast alignment and compared different classification algorithms. They achieved 98.83% accuracy out of 174 videos of 29 Turkish alphabet letter. 1NN and SVM were the best classifiers. In this work, we have developed a signer independent fingerspelling recognition system for Turkish Sign Language (TSL). The representative frames are extracted from sign videos. Hand objects in these frames are segmented out by skin color in YCrCb space. These hand objects are aligned using the angle of orientation based fast alignment method. Then, the aligned object is moved into the center of a minimum bounding square, and resized. Thick edges of the object in minimum bounding square are used for correlation with penalization based template matching. The remaining of this paper is organized as follows: In Section 2 we describe the representative frame extraction, skin detection by color, our fast alignment method, and thick edge extraction. We explain the correlation with penalization algorithm in Section 3. Section 4 covers the video database we use, and supplies a table of the distribution of number of signers in test and train databases. In Section 5 we evaluate the experimental results, and we supply a table that shows the effects of the methods we use. Section 6 has a discussion of assumptions and future work. Finally, conclusions are addressed in Section 7.
Increasing the Effect of Fingers in Fingerspelling Hand Shapes
1135
Fig. 1. Representative frames for all 29 letters in Turkish Alphabet
2 Feature Extraction Contrary to Turkish Sign Language word signs, Turkish fingerspelling signs, because of their static structure, can be discriminated by shape alone by use of a representative frame. To take advantage of this and to increase processing speed, these representative frames are extracted and used for recognition. Fig. 1 shows representative frames for all 29 Turkish Alphabet letters. In each representative frame, hand regions are determined by skin color. From the binary images that show hand and background pixels, the regions we are interested in are extracted, aligned and resized as explained in [1]. Finally, from those, the thick edges are extracted for template matching. 2.1 Representative Frame Extraction In a Turkish fingerspelling video, representative frames are the ones with least hand movement. Hence, the frame with minimum distance to its successor is chosen as the representative. Distance between successive frames f and f+1 is given by the sum of the city block distance between corresponding pixels [1]: D f = ∑n | ΔRnf | + | ΔGnf | + | ΔBnf | ,
(1)
where f iterates over frames, n iterates over pixels, R, G, B are the components of the pixel color, ΔRnf = Rnf +1 − Rnf , ΔGnf = Gnf +1 − Gnf , and ΔBnf = Bnf +1 − Bnf .
1136
O. Altun and S. Albayrak
(a)
(b)
Fig. 2. (a) Original image and detected skin regions after pixel classification, (b) result of the morphological opening
2.2 Skin Detection by Color For skin detection, YCrCb color-space has been found to be superior to other color spaces such as RGB and HSV [13]. Hence we convert the pixel values of images from RGB color space to YCrCb using (2): Y = 0.299R + 0.587G + 0.114B, Cr = B − Y , Cb = R − Y .
(2)
In order to decrease noise, all Y, Cr, and Cb components of the image are smoothed with a 2D Gaussian filter given by (3),
F ( x, y) = (1 2πσ 2 ) exp(− ( x 2 + y 2 ) 2σ 2 )
(3)
where σ is the standard deviation. Chai and Bouzerdom [14] report that pixels that belong to the skin region have similar Cr and Cb values, and give a distribution of the pixel color in Cr-Cb plane. Consequently, we classified a pixel as skin if the Y, Cr, Cb values of it falls inside the ranges 135 < Cr < 180, 85 < Cb < 135 and Y > 80 (Fig. 2.a). As in Altun et al. ([1]), after clearing small skin colored regions by morphological opening (Fig. 2.b), skin detection is completed. 2.3 Fast Alignment for Correlation Based Template Matching Template matching is very sensitive to size and orientation changes. Hence a scheme that can compensate size and orientation changes is needed. Eliminating orientation information totally is not appropriate however, as depicted in Fig. 3. Fig. 3a-b show two 'C' signs that we must be able to match each other, so we must compensate the small orientation difference. In Fig. 3c-d we see two 'U' signs that we need to differentiate from 'C' signs. 'U' signs and 'C' signs are quite similar to each other in shape, luckily orientation is a major differentiator. As a result we need a scheme that
(a)
(b)
(c)
(d)
Fig. 3. (a)-(b) The 'C' sign by two different signers. (c)-(d) The 'U' sign by two different signers.
Increasing the Effect of Fingers in Fingerspelling Hand Shapes
1137
Fig. 4. Axis of least second moment and the angle of orientation θ
not only can compensate small orientation differences of hand regions, but also is responsive to large ones. We use the fast alignment method [1] that makes the angle of orientation θ zero. Angle of orientation is the angle between y axis and the axis of least moment (shown in Fig. 4), and is given by (4): 2θ = arctan (2 M 11 ( M 20 − M 02 ) ) , M 11 = ∑ x ∑ y xyI ( x, y )
(4)
M 20 = ∑ x ∑ y x 2 I ( x, y ) M 02 = ∑ x ∑ y y 2 I ( x, y )
,
where I(x,y) = 1 for pixels on the object, and 0 otherwise. Bounding square is defined as the smallest square that can completely enclose all the pixels of the object [1]. After putting images in the center of a bounding square, and than resizing the bounding square to a fixed, smaller resolution, the fast alignment process ends (Fig. 5). 2.4 Thick Edge Extraction Most edge detection algorithms give edges of only one pixel wide. We propose a method to adjust the thickness of the edges without damaging the outer contours of the objects. Our method also adjusts the importance of the edge pixels for correlation.
(a)
(b)
(c)
(d)
(e)
Fig. 5. Stages of fast alignment. (a) Original frame. (b) Detected skin regions. (c) Region of Interest (ROI). (d) ROI rotated according to the angle of orientation. (e) Resized bounding square with the object in the center.
1138
O. Altun and S. Albayrak
Assume a binary image b where pixels with values equal to zero (zero-pixels) are background, the rest of the pixels (non-zero-pixels) represent objects, and non-zeropixels have the value Max (e.g. 255). Then, given an edge image e produced from binary image by an edge detector, the thick edge image t can be produced by the algorithm in Fig. 6:
• •
Let x, y, u, and v represent pixel coordinates. For each non-zero pixel b[x,y] of the binary image b, o Let e[u,v] be the non-zero pixel in edge image whose coordinates u, v has the closest Euclidean distance to the coordinate [x,y]. Call that distance dist[x,y]. o t[x,y] = Max – steepness*dist[x,y], o If t[x,y] < 0, t[x,y] = 0.
Fig. 6. Thick Edge extraction algorithm
The steepness helps to adjust the thickness and the relative importance of pixels on edges, as seen in Fig. 7.
Fig. 7. Thick edges of the same binary image. The steepness value increases from left to right.
In a fingerspelling shape, most of the information is in the fingers of a hand, and not in the palm (or back of the hand). Thick edge extraction increases contribution of the fingers to the matching result.
3 Correlation with Penalization Similarity of two templates is calculated by the algorithm summarized in Fig. 8. Image pixels take zero or positive values. By using the algorithm in Fig. 8 for correlation based template matching, we not only award the coincidence of two non-zero values, but also penalize the coincidence of a zero with a non-zero value. This ensures that the matching shapes have more matching edges, and less nonmatching ones.
Increasing the Effect of Fingers in Fingerspelling Hand Shapes
•
1139
For each of the corresponding pixel values first and second of the two templates o If one of the first or second is equal to zero, and the other is not, similarity = similarity – (first2 + second2)*penalty o Else, similarity = similarity + first * second
Fig. 8. Correlation with penalization algorithm. The factor penalty helps in adjusting the amount of penalization.
4 Video Database The train and test videos are acquired by a Philips PCVC840K CCD webcam. The capture resolution is set to 320x240 with 15 frames per second (fps). While programming is done in C++, the Intel OpenCV library routines are used for video capturing and some of the image processing tasks. We develop a Turkish Sign Language fingerspelling recognition system for the 29 static letters in Turkish Alphabet. The training set is created using three different signers. For training, they sign a total of 10 times for each letter, which sums up to 290 training videos. For testing, they and an additional signer sign a total of 7 times for each letter, which sums up to 203 test videos. Table 1 gives a summary of the distribution of the train and test video numbers for each signer. Notice that training and test sets are totally separated, and test set has the videos of a signer for whom there is no video in the train set. Table 1. Distribution of train and test video numbers for each signer
A Z Total
Signer 1 Train Test 4 2 4
2
Signer 2 Train Test 4 2 4
2
Signer 3 Train Test 2 2 2
2
Signer 4 Train Test 0 1 0
1
Total Train Test 10 7 10 290
7 203
5 Experimental Results Fast alignment, thick edges and correlation with penalization produced a success rate of 99% out of 203 test instances of the 29 Turkish fingerspelling videos. That is, the system gives just two errors. The effect of thick edge extraction and correlation with penalization can be seen in Table 2. Fast alignment is used in all experiments, and it alone gives only 4 errors. An interesting point is that thick edge extraction decreases the success alone, but increases when combined with correlation with penalization. This shows how useful non-matching edge pixels are for the purpose of distinguishing between different hand shapes.
1140
O. Altun and S. Albayrak
Table 2. Number of true classifications, number of false classifications and the success rate for different combinations of methods. Fast alignment is applied in all cases.
Thick Edges No Thick Edges
Correlation with Penalization Success #True #False Rate (%) 201 2 99.02 198 5 97.54
Correlation without Penalization Success #True #False Rate (%) 195 8 96.06 199 4 98.03
6 Discussions and Future Work Even though we assume the inverse, not all letters in Turkish alphabet are representable by one single frame, ' ' being an example. The sign of this letter involves some movement that differentiates it from 'S'. Still, representing the whole sign by one single frame is acceptable since this work is actually a step towards making a full blown Turkish Sign Language recognition system that can also recognize word signs. That system will incorporate not only shape but also the movement, and the research on it is continuing. The importance of successful segmentation of the skin and background regions can not be overstated. In this work we assume that there is no skin colored background regions and used color based segmentation in YCrCb space. The systems' success depends on that assumption; research on better skin segmentation is invaluable. Testing and training sets are created by multiple signers, and test set has a signer for whom train set has no video. The system is fast due to single frame video representation, fast alignment process, and resizing the bounding square to a smaller resolution. The amount of resizing can be arranged for different applications. Fast alignment, thick edge extraction and correlation with penalization methods are robust to the problem of occlusion of the skin colored regions (e.g. hands) with each other, because we do not try to find individual hands, and because our methods allow us to process occluding hand shapes as one big fingerspelling shape. The methods presented here would work equally well in the existence of a face in the frame, even though in this study we used a fingerspelling video database ([1]) that has only hand regions in each frame. Although we demonstrated these methods in the context of hand shape recognition, they are equally applicable to other problems where shape recognition is required, for example to the problem of shape retrieval.
7 Conclusions A Turkish fingerspelling recognition system is tested and found to have 99.02% success rate. This high success rate is the result of the fast alignment, thick edge extraction, and correlation with penalization methods. Fast alignment process brings objects with similar orientation into same alignment, while bringing objects with high orientation difference into different alignment. This is a desired result, because for fingerspelling
Increasing the Effect of Fingers in Fingerspelling Hand Shapes
1141
recognition, shapes that belong to different letters can be very similar, and the orientation can be the main differentiator. After the alignment, to resize without breaking the alignment, the object is moved into the center of a minimum bounding square. Thick edge extraction increases the contribution of the thinner parts of the hand shapes (like fingers) to result. This is also desired, since most of the information in a fingerspelling hand shape is in the fingers. As a final step, correlation with penalization helps to account for not only matching pixels, but also pixels that do not match. Our research on converting this template matching strategy into a local descriptor is continuing.
References 1. Altun, O., Albayrak, S., Ekinci, A., Bükün, B.: Turkish Fingerspelling Recognition System Using Axis of Least Inertia Based Fast Alignment. The 19th Australian Joint Conference on Artificial Intelligence (AI 2006) (2006) 2. http://www.british-sign.co.uk/learnbslsignlanguage/whatisfingerspelling.htm. 3. http://www.deaflibrary.org/asl.html. 4. Starner, T., Weaver, J., Pentland, A.: Real-time American sign language recognition using desk and wearable computer based video. Ieee Transactions on Pattern Analysis and Machine Intelligence 20 (1998) 1371-1375 5. Holden, E.J., Lee, G., Owens, R.: Australian sign language recognition. Machine Vision and Applications 16 (2005) 312-320 6. Gao, W., Fang, G.L., Zhao, D.B., Chen, Y.Q.: A Chinese sign language recognition system based on SOFM/SRN/HMM. Pattern Recognition 37 (2004) 2389-2402 7. Haberdar, H., Albayrak, S.: Real Time Isolated Turkish Sign Language Recognition From Video Using Hidden Markov Models With Global Features. Lecture Notes in Computer Science LNCS 3733 (2005) 677 8. Haberdar, H., Albayrak, S.: Vision Based Real Time Isolated Turkish Sign Language Recognition. International Symposium on Methodologies for Intelligent Systems, Bari, Italy (2006) 9. Lamar, M., Bhuiyant, M.: Hand Alphabet Recognition Using Morphological PCA and Neural Networks. International Joint Conference on Neural Networks, Washington, USA (1999) 2839-2844 10. Rebollar, J., Lindeman, R., Kyriakopoulos, N.: A Multi-Class Pattern Recognition System for Practical Fingerspelling Translation. International Conference on Multimodel Interfaces, Pittsburgh, USA (200) 11. Isaacs, J., Foo, S.: Hand Pose Estimation for American Sign Language Recognition. Thirty-Sixth Southeastern Symposium on, IEEE System Theory (2004) 132-136 12. Feris, R., Turk, M., Raskar, R., Tan, K.: Exploiting Depth Discontinuities for VisionBased Fingerspelling Recognition. 2004 IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops(CVPRW'04) (2004) 13. Sazonov, V., Vezhnevetsi, V., Andreeva, A.: A survey on pixel vased skin color detection techniques. Graphicon-2003 (2003) 85-92 14. Chai, D., Bouzerdom, A.: A Bayesian Approach To Skin Colour Classification. TENCON2000 (2000)
Binary Edge Based Adaptive Motion Correction for Accurate and Robust Digital Image Stabilization Ohyun Kwon1,*, Byungdeok Nam1, and Joonki Paik2 1
DSC R&D Center, Optics & Digital Imaging Business, Samsung Techwin Co., Ltd 145-3, Sangdaewon 1-Dong, Jungwon-Gu, Sungnam-City, Kyungki-Do, Korea, 462-121
[email protected] 2 Image Processing and Intelligent Systems Laboratory, Department of Image Engineering, Graduate School of Advanced Imaging Science, Multimedia, and Film, Chung-Ang University
Abstract. Digital image stabilization (DIS) becomes one of the most important features in consumer cameras. The image sequence is easily interfered by handshaking during acquisition process especially in zoom-in highly. Many of the state-of-the-art algorithms for image stabilization perform well for high patterned and clear images. However, for high noise or low pattern level images, the algorithm performance degrades seriously. In this paper, we propose a novel digital image stabilization algorithm for removal of unwanted motion in high noise or low pattern level images. The proposed algorithm consists of an adaptive Kalman filter for motion prediction and a binary edge transform (BET) based phase correlation (PC) for motion estimation. Since the proposed algorithm is designed for real-time implementation, it can be used as an algorithm for a DIS to preserve a good environment that the photograph has been taken of many commercial applications such as low cost camcorders, digital cameras, CCTV, and surveillance video systems. Keywords: Digital image stabilization, motion estimation, motion correction, binary edge transform, motion discrimination, Kalman filter.
1 Introduction A demand for digital image stabilization techniques is rapidly increasing in many visual applications, such as camcorders, digital cameras, and video surveillance systems. Image stabilization refers to a digital image processing technique that registers sequence image, differently out-of-focused by unstable image. Conventional stabilization techniques, such as manual stabilization, through the lens auto-stabilization, and semi-digital auto-stabilization, cannot inherently deal with digital image stabilization. Image stabilization can be realized with fully digital auto-stabilization based on PC estimation and correction. In this paper, a novel shake adaptive binary edge transformation, which detects most jitter in noisy images and provides superior environment that photograph taken, has been proposed. Furthermore, the proposed BET filter can be implemented with the minimum hardware expense of a single frame memory while the conventional temporal filters require several frame memories to implement. Since stabilization *
Member, IEEE.
L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 1142 – 1149, 2006. © Springer-Verlag Berlin Heidelberg 2006
Binary Edge Based Adaptive Motion Correction
1143
algorithms on noisy images often result in noise amplification, the proposed algorithm should be used prior to any image stabilization operation. The proposed algorithm can also be used in any optical instrument to overcome handshake impairments in the acquired image and thereby resulting in better image quality. The rest of the paper is organized as follows. Existing stabilization techniques and problem formulation are described in section 2. The proposed BET and adaptive motion correction algorithm is described in section 3. Simulation results and comparisons are shown in section 4. Finally, concluding remarks are outlined in section 5.
2 Existing State-of-the-Art Methods The image stabilization systems can be classified into three major types in [1]: the electronic, the optical [2], and the digital stabilizers. In DIS, various algorithms had been developed to estimate the local motion vectors such as representative point matching (RPM) [3], edge pattern matching (EPM) [4], bit-plane matching (BPM) [5], one bit transform matching (1BTM) [6], hierarchical distributed template matching (HDTM) [7], and others [8][9][10]. The major objective of these algorithms is to reduce the computational complexity, in comparison with full-search block-matching method, without losing too much accuracy and reliability. In general, the RPM and 1BTM can greatly reduce the complexity of computation in comparison with the other methods. However it is sensitive to irregular conditions such as moving objects and intentional panning, etc. Therefore, the reliability evaluation is necessary to screen the undesired motion vectors for the theses method. In [1], an inverse triangle method is proposed to extract reliable motion vectors in plain images which are lack of features or contain large low-contrast area, etc., and a background evaluation model is developed to deal with irregular images which contain large moving objects, etc. Using HDTM method, only useful reference blocks that are indispensable for accurate motion estimation are selected with its reliability and consistency on pose estimation was proposed in [7]. However, these algorithms cannot cover widely various irregular conditions such as the low level images, and it is also hard to determine an optimum partial template block for discrimination in various conditions. Therefore the accurate estimation is necessary in poor conditions. In the motion correction of DIS, accumulated motion vector estimation [4] and frame position smoothing (FPS) [8] are the two most popular approaches. The accumulated motion vector estimation needs to compromise stabilization and intentional panning preservation since the panning condition causes a steady-state lag in the motion trajectory. The FPS accomplished the smooth reconstruction of an actual long-term camera motion by filtering out jitter components based on the concept of designing the filter with appropriated cut-off frequency or adaptive fuzzy filter to continuously improve stabilization performance. But, these algorithms cannot cover unfeasible motion vector from motion estimation.
3 Proposed Digital Image Stabilization Algorithm The proposed digital image stabilization algorithm uses the following two procedures to obtain a suitable registered image: (i) Using a binary edge based phase correlation
1144
O. Kwon, B. Nam, and J. Paik
for motion estimation (ii) kalman and motion discrimination (MD) filter for motion correction shown in Fig. 1.
Fig. 1. Block diagram of the proposed digital image stabilization system
3.1 Motion Estimation Using Binary Edge Based Phase Correlation Binary Edge Transformation based phase correlation estimation, in which frames are transformed into binary number/pixel representation by comparing the original image frame against a difference in shifted original image frame version, provides a low complexity and a high accuracy phase correlation motion estimation approach. Simple and fast transfer to edge image frame from original image frame, we propose shift difference edge (SDE) transformation in the form of
SDE (i, j ) = I (i, j ) − I (i − α , j − α ).
(1)
SDE (i, j ) represents the edge image frame using shift difference transformation and α is the constant value for width of edge line. The binary edge image BE (i, j ) where
with some threshold the constructed as
⎧1 , SDE (i, j ) > th BE (i, j ) = ⎨ ⎩0 , SDE (i, j ) ≤ th
(2)
The empirically selected threshold in the range, 0< th β
(4)
is used to filter the motion vector, and the frame motion vector is decide based on the number of feasible motion vector, which can be expressed in the form of
FMV =
d1 ⋅ LMV1 + d 2 ⋅ LMV2 + d3 ⋅ LMV3 + d 4 ⋅ LMV4 , d1 + d 2 + d3 + d 4
(5)
d1 + d 2 + d3 + d 4 ≠ 0. The processing of MD filter discriminates between proper and improper vector from four LMVs of each sub-images using PFMV. Then calculates frame motion vector using only feasible motion vectors for accuracy of DIS. The utilization of adaptive kalman and MD filter in the motion correction part of the stabilization system overcome unfeasible motion vector from phase correlation.
4 Experimental Results In order to further increase the speed of frame motion estimation, four sub-images designated as shown in Fig. 5(a) are used and local motion vectors are computed
Binary Edge Based Adaptive Motion Correction
1147
using BE transform based phase correlation for all sub-images. We proposed rectangular sub-images to extend Region of Photographing (ROP) and to attempt to remove moving objects in the sub-image. These sub-images are used to determine LMVs using phase correlation. For efficient FFT computation, sub-images have square shape with horizontal and vertical pixel dimensions being a power of two. Typically a subimage size of 32 × 32 is preferred to reduce the computation and at the same time, to keep a sufficiently large area for a correct estimation. But the proposed sub-images are not square. Rectangular sub-image changes to square as seen in the Fig. 5(b).
f k1 ( x1 , x2 )
f k3 ( x1 , x2 )
f k1+1 ( x1 , x 2 )
f k3+1 ( x1 , x 2 )
f k2+1 ( x1 , x2 )
f k4+1 ( x1 , x 2 )
(a)
(b)
Fig. 5. Location of rectangular sub-images for motion estimation. (a) Location of sub-images, (b) Change to square.
To evaluate the algorithm performance, motion estimation algorithm was used on both natural and synthetic low light image sequences. The stabilizing performance is evaluated subjectively and verified with root mean square error (RMSE) between the estimated motion vectors and the true motion vectors. The performance is evaluated using the same error measure utilized in [1][6][9]. The RMSE is given by 1/ 2
ε RMS =
1⎡N ⎤ ( xˆk − xk ) 2 + ( yˆ k − yk )2 ⎥ ∑ ⎢ N ⎣ k =1 ⎦
(6)
where ( xk , y k ) is the true values of the measurement motion vectors and ( xˆ k , yˆ k ) is the estimated motion vectors generated from the evaluated DIS algorithms. The results using the proposed stabilization algorithm for both natural and synthetic sequences are shown in Fig. 6. Test sequences are Basketball (B), Library (L), Street (S), Market (M), Café (C), Gate (G), and Phone (P). Performance comparison using the RMSE obtained by several existing stabilization algorithms is given in Table. I. Results are also compared against phase correlation based motion estimation performance as presented in [9]. For this purpose, sub-image based approaches using assignment of the local motion vector with the largest peak amplitude as the global motion vector (Sub-PC-H), assignment of the average of the two local motion vectors with the highest two peak amplitudes as the global motion vector (Sub-PC-2), assignment of the weighted average of all local motion vectors weighted by their peak
1148
O. Kwon, B. Nam, and J. Paik
(a) Natural image
(b) Synthetic noisy image
(c) Proposed BET image
Fig. 6. Experimental results of the motion estimation algorithm for DIS
amplitude values (Sub-PC-W), and assignment of the sub-images one-bit transform (Sub-1BT) [6] were simulated for comparison with the proposed BE transformation based motion estimation and MD. The samples of test results with synthetic sequences are shown in Fig. 6 (b) and Fig. 6 (c). The effectiveness of the proposed DIS algorithm in processing different images can easily be evaluated by Table I which demonstrates the stabilization results of (BE+MD) filter and the comparison filters for images degraded by noise. It can be seen from the Table I that the shake removal performance of the Sub-1BT is significantly low compared with the other operators. The outputs of the PC based motion estimation with Sub-PC-H, Sub-PC-2, and Sub-PC-W operators are almost the same, but these operators present incorrect results of motion estimation in the noisy image. The Sub-PC-BE operator exhibits much better performance than the others. Even though phase correlation based techniques provide improved motion estimation accuracy, the computational complexity is higher compared to the Sub-1BT approach, due to the requirement of Fourier domain computation. It is clearly seen that the Sub-PCBE and MD operator successfully detects the motion vector while at the same time
Table 1. Comparison of RMSE of various DIS algorithms with the proposed BET+MD
Sub-PC-H Sub-PC-2 Sub-PC-W Sub-1BT Sub-PC-BET Sub-PC-BET -MD
RMSE (Noise) RMSE (Noise) RMSE (Noise) RMSE (Noise) RMSE (Noise) RMSE (Noise)
B 0.152 0.521 0.128 0.462 0.137 0.431 0.139 0.311 0.125 0.137 0.123 0.128
L 1.064 0.638 0.165 0.662 0.142 0.598 0.183 0.772 0.132 0.151 0.121 0.132
S 0.112 0.452 0.136 0.419 0.114 0.422 0.142 0.389 0.098 0.112 0.102 0.111
M 0.125 0.598 0.113 0.528 0.102 0.487 0.139 0.566 0.095 0.108 0.092 0.0098
C 0.185 0.463 0.164 0.533 0.161 0.453 0.194 0.823 0.128 0.131 0.116 0.135
G 0.114 0.479 0.105 0.477 0.108 0.455 0.132 0.588 0.103 0.117 0.111 0.113
P 0.121 0.513 0.115 0.532 0.112 0.561 0.146 0.427 0.095 0.106 0.091 0.094
Binary Edge Based Adaptive Motion Correction
1149
efficiently removing the jittering in the noisy image. In addition, the proposed SubPC-BE and MD approach provides a reasonable robust motion estimation system for image sequence stabilization.
5 Conclusions In this paper, we proposed a novel digital image stabilization algorithm using binary edge transform for phase correlation is proposed, which stables shaking image sequence in zoom-in highly. For the realization of the digital image stabilization, we used binary edge transform, PC estimation, and adaptive kalman + MD filter. Although the proposed algorithm is good at operating low light level (less patterned or noisy) images, it can be extended to more general applications by low cost camcorders, digital cameras, mobile phone cameras, CCTV, surveillance video systems, and television broadcasting systems.
References 1. S. Hsu, et al., “A robust digital image stabilization technique based on inverse triangle method and background detection,” IEEE Trans. Consumer Electronics, vol. 51, pp. 335345, May, 2005. 2. K. Sato, et al., “Control techniques for optical image stabilizing system,” IEEE Trans. Consumer Electronics, vol. 39, no. 3, pp. 461-466, Aug. 1993. 3. K. Uomori, et al., “Automatic image stabilizing system by full-digital signal processing,” IEEE Trans. Consumer Electronics, vol. 36, no. 3, pp. 510-519, Aug. 1990. 4. J. Paik, et al., “An adaptive motion decision system for digital image stabilizer based on edge pattern matching,” IEEE Trans. Consumer Electronics, vol. 38, no. 3, pp. 607-616, Aug. 1992. 5. S. Ko, et al., “Fast digital image stabilizer based on gray-coded bit-plane matching,” IEEE Trans. Consumer Electronics, vol. 45, pp. 598-603, 1999. 6. A. Yeni, et al., “Fast digital image stabilization using one bit transform based sub-image motion estimation,” IEEE Trans. Consumer Electronics, vol. 49, no. 4, pp. 1320-1325, Nov. 2003. 7. H. Okuda, et al., “Optimum motion estimation algorithm for fast and robust digital image stabilization,” IEEE Trans. Consumer Electronics, vol. 52, no. 1, pp. 276-280, Feb. 2006. 8. S. Erturk, “Image sequence stabilization: motion vector integration (MVI) versus frame position smoothing (FPS),” Proc. of the 2nd International Symposium on Image and Signal Processing and Analysis, pp. 266-271, 2001. 9. S. Erturk, “Digital Image Stabilization with Sub-Image Phase Correlation Based Global Motion Estimation,” IEEE Trans. Consumer Electronics, vol. 49, no. 4, pp. 1320-1325, Nov. 2003. 10. Y. Matsushita, et al., “Full-frame video stabilization with motion inpainting” IEEE Trans. Pattern Analysis and Machine Intelligence, vol. 28, no. 7, pp. 1150-1163, Jul. 2006.
Contrast Enhancement Using Adaptively Modified Histogram Equalization Hyoung-Joon Kim1, Jong-Myung Lee1, Jin-Aeon Lee2, Sang-Geun Oh2, and Whoi-Yul Kim1 1
Division of Electronics and Computer Engineering, Hanyang University, Haengdang-Dong, Sungdong-Gu, Seoul, 133-792, Korea {khjoon, jmlee}@vision.hanyang.ac.kr,
[email protected] 2 Samsung Electronics, Giheung-Eup, Yongin-Si, Gyeonggi-Do, 449-712, Korea {jalee, stephen.oh}@samsung.com
Abstract. A new contrast enhancement method called adaptively modified histogram equalization (AMHE) is proposed as an extension of typical histogram equalization. To prevent any significant change of gray levels between the original image and the histogram equalized image, the AMHE scales the magnitudes of the probability density function of the original image before equalization. The scale factor is determined adaptively based on the mean brightness of the original image. The experimental results indicate that the proposed method not only enhances contrast effectively, but also keeps the tone of the original image. Keywords: Contrast enhancement, image enhancement, histogram equalization.
1 Introduction Histogram equalization (HE) is a very common method for enhancing the contrast of an image [1]. Based on the histogram or probability density function (PDF) of the original image, the image’s PDF is reshaped into one with a uniform distribution property in order to enhance contrast. Although HE provides images of higher contrast, it does not necessarily provide images of higher quality [2]. That is, HE shifts the mean brightness of the image significantly and sometimes even degrades the image quality. In addition, HE causes a washed-out effect when the amplitudes of the histogram components are very high at one or several locations on the grayscale [3]. To overcome the aforementioned problems of typical HE, brightness preserving bihistogram equalization (BBHE) has been proposed [4]. The ultimate goal of BBHE is to preserve the mean brightness of the original image while enhancing the contrast. BBHE decomposes the original image into two sub-images based on the mean brightness, and equalizes each the sub-images independently based on their corresponding PDFs. Later, equal area dualistic sub-image histogram equalization (DSIHE) [5], recursive mean-separate histogram equalization (RMSHE) [6], and minimum mean brightness error bi-histogram equalization (MMBEBHE) [7] have L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 1150 – 1158, 2006. © Springer-Verlag Berlin Heidelberg 2006
Contrast Enhancement Using Adaptively Modified Histogram Equalization
1151
been proposed with the same purpose of preserving the mean brightness. Although these methods can prevent significant change of gray levels and preserve the tone of the original image, they cannot sufficiently enhance the contrast of some sample images presented in Section 3. Another method, bin underflow and bin overflow (BUBO) [2], has been proposed to prevent a significant change of gray levels but still provide functionality controlling the rate of contrast enhancement in HE. Before equalization, the PDF of the original image is modified as the magnitudes of each histogram component are limited to the given upper and lower bounds. By limiting the magnitudes, a significant change of gray levels does not occur. However, since the gray levels, which the magnitudes of the histogram components are limited, are stretched linearly, as shown in Fig. 1, it is sometimes difficult to sufficiently enhance the contrast of the regions composed of pixels corresponding to those gray levels. Based on typical HE, we propose a new and simple method called adaptively modified histogram equalization (AMHE) to enhance contrast. Similar to BUBO, AMHE modifies the PDF of the original image before equalization. Our proposed method differs by scaling the magnitudes of the PDF while preserving the original shape of the PDF rather than limiting the magnitudes in BUBO. AMHE also provides a functionality to control the rate of contrast enhancement that can be adaptively determined based on the mean brightness of the original image. The rest of the paper is organized as follows. Section 2.1 reviews HE and BUBO. Section 2.2 describes the proposed method, and Section 2.3 presents the rule to determine adaptively the rate the contrast can be enhanced. Experimental results with comparison to different methods are presented in Section 3. Finally, the paper is concluded in Section 4.
2 Adaptively Modified Histogram Equalization 2.1 HE and BUBO Let X={X(i, j)} denote a given image with L gray levels, where X(i, j) denotes the gray level at the location (i, j) and X(i, j) ∈ [0, L-1]. The PDF of X is defined as
p (k ) =
nk N
k = 0, 1, …, L-1
(1)
where N represents the number of pixels in X and nk denotes the number of pixels with the gray level of k. Based on the PDF, its cumulative distribution function (CDF) is defined as k
c( k ) = ∑ p(i ) .
(2)
i =0
The mapping function of HE is defined as
f ( k ) = ( L − 1)c( k ) .
(3)
The PDF of the HEed image by Eq. (3) obtains uniform distribution, which reflects the enhanced contrast. However, there is no mechanism for controlling the rate of
1152
H.-J. Kim et al.
contrast enhancement. In addition, if many pixels are concentrated on in a few gray levels, the gray levels have been changed too much and thus the resulting image looks unnatural. To overcome these shortcomings of HE, BUBO [2] modifies the original PDF with constraints as
⎧cBO , if p(k ) > cBO ⎪ pBUBO ( k ) = ⎨ p( k ), if cBU ≤ p( k ) ≤ cBO ⎪c , if p (k ) < c BU ⎩ BU
(4)
where cBO and cBU are thresholds for the bin underflow (BU) and bin overflow (BO), respectively, defined as
cBU = (1 − α ) / N and cBO = (1 + α ) / N .
(5)
Here, α is a parameter to control the rate of contrast enhancement from none at α = 0 to full HE at α = ∞ . Fig. 1 shows a simulated mapping function of BUBO. In Fig. 1 (a), plots for the original PDF, the modified PDF by Eq. (4), and cBO are shown in thin solid, thick solid, and dotted lines, respectively. The probabilities greater than cBO are limited to cBO and thus the part of the mapping function corresponding to those gray levels has linear increment characteristics, as shown in Fig. 1 (b). This can prevent a significant change in the gray level but hinder the enhancement of contrast. fBUBO
p
f(k+6) f(k+5) f(k+4)
cBO
f(k+3) f(k+2) f(k+1) f(k)
k
k+3
k+6
(a) Original PDF
k k+1 k+2 k+3 k+4 k+5 k+6
(b) Mapping function
Fig. 1. The mapping function of BUBO: (a) Original PDF (thin solid), modified PDF (thick solid) by Eq. (4), and cBO (dotted). (b) Mapping function between k and k+6 gray levels.
2.2 Adaptively Modified Histogram Equalization In HE, the rate of contrast enhancement can be controlled by putting constraints on the gradient of the mapping function, i.e. the CDF [2]. Since the gradient of the CDF becomes the PDF, we can control the rate of contrast enhancement by modifying the PDF. The proposed method, adaptively modified histogram equalization (AMHE), modifies the original PDF and preserves its shape as shown in Fig. 2 by
Contrast Enhancement Using Adaptively Modified Histogram Equalization
1153
⎧ ( p(k ) − pmid )2 , if p(k ) > p p α + ⎪ mid mid pmax − pmid ⎪ p AMHE ( k ) = ⎨ 2 ⎪ p − α ( pmid − p( k ) ) , otherwise ⎪⎩ mid pmid − pmin
(6)
where pmin and pmax are the minimum and maximum values of p(k), respectively, and pmid is the mean value of pmin and pmax. Note that pAMHE(k) is forcibly set to zero when negative. The rate of contrast enhancement is then determined by α , and the rule for deciding α adaptively is given in the next section. pmax
pmax p(i) pmax-pmin p(i)-pmid
pmid
pmid pmid-p(j) pmid-p min
pAMHE(k)
p(j)
p(k)
k
pmin
k
pmin
(a) Original PDF
(b) Modified PDF
Fig. 2. Modification of the PDF
For the modified PDF pAMHE, the CDF cAMHE is first computed by Eq. (7). Since cAMHE(L-1) does not become one, c~AMHE is used instead as the mapping function for contrast enhancement by a simple normalization in Eq. (8). k
c AMHE ( k ) = ∑ p AMHE (i )
(7)
c (k ) c~AMHE ( k ) = AMHE ( L − 1) c AMHE ( L − 1)
(8)
i =0
Since the shape of the original PDF is preserved, AMHE can enhance contrast with similar performance of HE while AMHE can prevent a significant change in gray level by scaling the PDF. Fig. 3 shows simulated results of AMHE and BUBO. The gray levels where the magnitudes of the histogram components are limited are linearly stretched in the enhanced result of BUBO. However, the gray levels are stretched in proportion to those probabilities in AMHE and thus the contrast of the regions in the image composed of those gray levels is more enhanced than BUBO. 2.3 Adaptive Parameter Decision As mentioned, the strength of the modification of the PDF decided by α affects the mapping function, and thus it is important to determine the proper value of α . In our
1154
H.-J. Kim et al.
(a) Original PDF
(b) Mapping function of AMHE
(c) Contrast enhanced PDF using AMHE
(d) Original PDF
(e) Mapping function of BUBO
(f) Contrast enhanced PDF using BUBO
Fig. 3. Simulated results of AMHE and BUBO
experiments testing various images, fine results have generally been obtained when α is set to 0.7. Nonetheless, we suggest a rule to decide α adaptively based on the mean brightness of the image by Eq. 9. Note that Xm denotes the mean brightness of the image. When the image is decomposed into two sub-images by Xm, Xml indicates the mean brightness of the sub-image composed of pixels with the gray levels less than or equal to Xm while Xmu is the mean brightness of the other sub-image. It is obvious that Xm is larger than Xml and smaller than Xmu.
X m − X ml , for 0 ≤ k ≤ X m X mu − X ml X − Xm α = mu , for X m < k ≤ L − 1 X mu − X ml
α=
Xm
where, X ml = ∑ k ⋅ p ( k ) k =0
Xm
∑ p (k ) and k =0
X mu =
L −1
∑ k ⋅ p (k )
k = X m +1
(9) L −1
∑ p(k )
k = X m +1
The motivation of the decision rule of Eq. (9) is based on the observation that in natural images many pixels are generally concentrated on mean brightness. More specifically, as the interval between Xml and Xm becomes shorter, more pixels with gray levels between Xml and Xm occur for the sub-image composed of pixels with the gray levels less than or equal to Xm. The same phenomenon applies to the other subimage. In the next section, two enhanced results using this decision rule are provided.
3 Experimental Results To demonstrate the performance of contrast enhancement, we tested the proposed method on various images gathered from the Internet as well as taken by digital
Contrast Enhancement Using Adaptively Modified Histogram Equalization
1155
(a) Original (158.89)
(b) HE (131.59)
(c) BBHE (143.22)
(d) DSIHE (147.05)
(e) RMSHE (154.48)
(f) MMBEBHE (158.84)
(g) BUBO (148.33)
(h) AMHE (122.17) Fig. 4. Contrast enhancement results
(a) Original
(b) HE
(c) BBHE
(d) DSIHE
(e) RMSHE
(f) MMBEBHE
(g) BUBO
(h) AMHE
Fig. 5. The PDFs of the images in Fig. 4
cameras. Our method was also compared with different methods such as HE [1],BBHE [4], DSIHE [5], RMSHE [6], MMBEBHE [7], and BUBO [2]. Note that α in BUBO is set to 0.7 and there are two recursions in RMSHE. Two examples are shown in Fig. 4 and Fig. 6 while the histograms of those images are shown in Fig. 5 and Fig. 7, respectively. Note that numbers in parentheses in Fig. 4 and Fig. 6 indicate the mean brightness of each image. As mentioned previously,
1156
H.-J. Kim et al.
(a) Original (26.73)
(b) HE (133.08)
(c) BBHE (64.59)
(d) DSIHE (79.03)
(e) RMSHE (38.74)
(f) MMBEBHE (39.67)
(g) BUBO (66.43)
(h) AMHE (88.46) Fig. 6. Contrast enhancement results
(a) Original
(b) HE
(c) BBHE
(d) DSIHE
(e) RMSHE
(f) MMBEBHE
(g) BUBO
(h) AMHE
Fig. 7. The PDFs of the images in Fig. 6
HE enhances contrast sufficiently but the enhanced images seem unnatural. BBHE, DSIHE, RMSHE, and MMBEBHE well preserve the mean brightness of the original image in comparison with HE. However, the enhanced images by them are far from satisfactory—the region of the tree is especially indistinguishable as shown in Fig. 8. BUBO and our method enhance contrast better than the rest while maintaining the
Contrast Enhancement Using Adaptively Modified Histogram Equalization
(a) Original
(b) HE
(c) BBHE
(d) DSIHE
(e) RMSHE
(f) MMBEBHE
(g) BUBO
(h) AMHE
1157
Fig. 8. The enlarged images of Fig. 6
tone of each original image. Nevertheless, the enhanced images of our method are recognized more easily than BUBO, and the grayscale is utilized more efficiently, as shown in Fig. 5 (g) and (h).
4 Conclusions In this paper, we present a new and simple contrast enhancement method referred to as adaptively modified histogram equalization (AMHE). It is an extension of typical histogram equalization (HE). AMHE modifies the probability density function (PDF) of a given image while preserving its shape and then applies HE to that modified PDF. By modifying the PDF, not only is a significant change in gray level, which often occurs in HE, prevented, but the overall contrast of the image is enhanced. Moreover, the rate of enhanced contrast can be controlled by the degree of the modification of the PDF, and is determined adaptively using the mean brightness of the image. Comparison of experimental results using different methods indicates that the proposed method not only enhances contrast effectively while keeping the tone of the original image, but most importantly it makes the enhanced image look very natural by adhering to the flavor of original image.
1158
H.-J. Kim et al.
References 1. Gonzalez, R. C., Woods R. E.: Digital Image Processing. 2nd edn. Prentice-Hall, New Jersey (2002) 2. Yang, S. J., Oh, J. H., Park, Y. J.: Contrast Enhancement Using Histogram Equalization with Bin Underflow and Bin Overflow. International Conference on Image Processing (2003) 881-884 3. Chen, Z., Abidi, B. R., Page, D. L., Abidi, M. A.: Gray-Level Grouping (GLG): An Automatic Method for Optimized Image Contrast Enhancement—Part I: The Basic Method. IEEE Transactions on Image Processing, Vol. 15, No. 8 (2006) 2290-2302 4. Kim, Y. T.: Contrast Enhancement using Brightness Preserving Bi-Histogram Equalization. IEEE Transactions on Consumer Electronics, Vol. 43, No. 1 (1997) 1-8 5. Wang, Y., Chen, Q., Zhang, B.: Image Enhancement Based on Equal Area Dualistic SubImage Histogram Equalization Method. IEEE Transactions on Consumer Electronics, Vol. 45, No. 1 (1999) 68-75 6. Chen, S., Ramli, A. R.: Contrast Enhancement using Recursive Mean-Separate Histogram Equalization for Scalable Brightness Preservation. IEEE Transactions on Consumer Electronics, Vol. 49, No. 4 (2003) 1301-1309 7. Chen, S., Ramli, A. R.: Minimum Mean Brightness Error Bi-Histogram Equalization in Contrast Enhancement. IEEE Transactions on Consumer Electronics, Vol. 49, No. 4 (2003) 1310-1319
A Novel Fade Detection Algorithm on H.264/AVC Compressed Domain Bita Damghanian1, Mahmoud Reza Hashemi2, and Mohammad Kazem Akbari1 1
Computer Engineering & Information Technology Department, Amirkabir University of Technology, Tehran, Iran {damghanian, akbari}@ce.aut.ac.ir 2 Multimedia Processing Laboratory School of Electrical and Computer Engineering, University of Tehran, Iran
[email protected] Abstract. With the increasing use and availability of digital video, shot boundary detection, as a fundamental step in automatic video content analysis and retrieval, has received extensive attention in recent years. But the problem of detecting gradual transitions such as fades, especially in the compressed domain, has yet to be solved. In this paper, a new robust compressed domain fade detection method is proposed, which is based on the luminance weighting factors available at slice level in H.264 compressed video. Simplicity and working directly in the H.264 compressed domain with high precision and recall are the main advantages of the novel algorithm. Keywords: Content based video analysis and retrieval, shot boundary detection, fades detection, compressed domain, H.264 video coding standard, weighted prediction.
1 Introduction To achieve automatic video content analysis, shot boundary detection is a preliminary step; hence, in recent years the research on automatic shot boundary detection techniques has exploded. In the literature, shot is defined as the sequence of images generated by the camera from the time it begins recording to the time it stops; and according to whether the transition between shots is abrupt or not, it can be classified into two types: abrupt transition (cut) and gradual transition. Gradual transitions are further categorized as dissolve, fades, and all kinds of wipe. There are two types of fade: fade-in and fade-out. A fade-in occurs when the picture gradually appears from a black screen and a fade-out occurs when the picture information gradually disappears, leading to a black screen. Among gradual transitions, fades are mainly used in television and movie production to emphasize time transitions. Furthermore, they can be used to separate different TV program elements such as the main show material from commercial blocks. Fadeout/in combinations can also indicate relative mood and pace between shots [1]. Therefore, detecting them is a very powerful tool for shot classification and story summarization. L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 1159 – 1167, 2006. © Springer-Verlag Berlin Heidelberg 2006
1160
B. Damghanian, M.R. Hashemi, and M.K. Akbari
Fade detection algorithms are classified as uncompressed and compressed domain algorithms [2]. Uncompressed or pixel domain algorithms utilize information directly from the spatial video domain. These techniques are computationally demanding and time consuming; because they must decompress the coded videos to apply the method on. They are hence less efficient comparing to the compressed domain approach. Among the international video coding standards, H.264/AVC is the most recent one and due to its superior compression performance, more video will be encoded with it in the future. Thus, it is meaningful to study fade detection techniques directly in the H.264 compressed domain. The remainder of this paper is organized as follows. Section 2 reviews the existing techniques and related works. Section 3 presents an overview of the weighted prediction tool in the H.264 standard. The novel fade detection method is described in section 4. Simulation results are presented in section 5. Finally, the paper ends in section 6 with the conclusion and future work.
2 Related Works Several shot boundary detection techniques have been proposed in the literature. Most existing methods focus on cut detection rather than fade or dissolve detection. They are designed based on the fact that the frames within the same shot maintain some consistency in the visual content, while the frames surrounding the shot boundaries exhibit a significant change of visual content [3]. The considered visual content can be a feature in the uncompressed domain such as object boundaries [4], color histogram [5], correlation average [6], or a value in the compressed domain like the rate of macro blocks [2], [7], motion vectors [8], and DC coefficients [9]. Meanwhile, as mentioned in [3] the gradual transitions may span over dozens of frames and the variation of the visual content between two consecutive frames may be considerably small; therefore, the identification of gradual transitions is far more difficult, and specific solutions for fade and dissolve detection is needed. In fade-out, the first frame gradually darkens into a sequence of dark monochrome frames and in the case of a fade-in, it gradually brightens from a sequence of dark monochrome frames and these dark monochrome frames seldom appear elsewhere; as a result of this, the fade detection problem usually turns to the recognition of monochrome frames. Monochrome frames are identified by calculating the standard deviation of each frame [10], the standard deviation together with the average intensity of each frame [3], or by calculating the correlation of an image with a constant image [6]. After locating monochrome frames, some other constraints are checked for robustness such as standard deviation of the pixels [6], the ratio between the second derivative of the variance curve and the first derivative of the mean curve [10], second derivative curve of luminance variance, and the first derivative of each frame mean [5], [1]. [5] And [11] have surveyed some other fade detection techniques which have somehow the same ideas as the above schemes. All the methods mentioned so far, are performed in the pixel domain; and hence, are not very efficient for applications such as video indexing where we have to deal
A Novel Fade Detection Algorithm on H.264/AVC Compressed Domain
1161
with a large amount of compressed video content. Few other methods have been proposed for the MPEG compressed domain. Most of them have a moderate accuracy and precision. For instance, [12] declares a fade-in, if the number of positive residual DC coefficients in P frames, exceeds a certain percentage of the total number of nonzero DC coefficients consistently over several consecutive frames. Simplicity is the advantage of this technique, because it only uses entropy decoding; but on the other hand, it does not have a high accuracy and can not locate the exact boundaries of fades. [10] Has an approximation of its pixel domain solution for the MPEG-2 and H.263 compressed domain. It approximates the mean and the variance using a DCestimation of each picture. Results are distorted due to the limitations of variance calculation. There are even fewer methods for the H.264 compressed domain. One of the few existing techniques has been proposed in [13], which at first, uses the intra prediction mode histogram to locate potential GOPs (Group of Pictures), where shot transitions occur with great probability. It then utilizes more inter prediction modes and Hidden Markov Models to segment the video sequence. This method is able to fairly detect cuts and dissolves, but not fades. To summarize, there is a real need for a robust fade detection algorithm in the H.264 compressed video. In this paper a new three step algorithm for fade detection is proposed.
3 Overview of the Weighted Prediction Tool in the H.264 Standard Weighted Prediction (WP) is a newly adopted concept in the prediction component of the H.264 video coding standard. It is available in the Main and Extended profiles. Unlike motion estimation which looks for the displacement of moving objects, weighted prediction is interested in changes in light or color. The H.264 encoder determines a single weighting factor for each color component of each reference picture index and encodes it in the bit stream slice header when the weighted prediction explicit mode is selected. Thus, each color component of the reference frame is multiplied by the weighting factor and the motion compensation is performed on the weighted reference. This results in a significant improvement in the performance of motion estimation and a lower prediction error. More details can be found in [14]. The novel fade detection method proposed here, exploits this explicit weighted prediction feature of the H.264 video coding standard.
4 The Proposed Method In this paper a new three step algorithm for fade detection is proposed. The proposed method uses the Luminance WP factors, available at slice level of the H.264 compressed video, to detect fade transitions and their exact boundaries. The first step in the proposed algorithm detects the potential fades and their types (fade-in, fade-out).
1162
B. Damghanian, M.R. Hashemi, and M.K. Akbari
False alarms are identified and removed in the second and third steps. The third step is also responsible for locating the exact boundaries of fades in the given video sequence. Some of the key advantages of the proposed method over existing techniques are as follows: 1. It works directly in the H.264 compressed domain and only the slice headers are decoded; hence, improving the efficiency. 2. The proposed algorithm is very simple and requires inexpensive computation. 3. It is able to find the exact boundaries of fades and their types (fade-in, fade-out). The three steps are described in more details in the following sections. 4.1 First Step: Detecting Potential Fades and Their Types According to [15], each P or B slice is at least predicted from one reference frame; which is the previous I or P frame for P slices and the next I or P frame for B slices. Hereafter, we refer to the Luminance Weighted Prediction Factor of P and B slices relative to these references, as PLWP and BLWP factors, respectively. These factors can be considered as the ratio of the overall luminance of the current slice with respect to the reference frame; hence, for PLWP factor, one of the following conditions is plausible: 1. If this ratio is one, there is no change of light between the current and the reference frame. This is usually the case inside a shot where there is no considerable change of light, and is also true in cut and cross-dissolve transitions where the two shots have the same overall luminance or in object/camera motion where there is no change of light. 2. If this ratio is greater than one, the current frame is brighter than the reference frame. This may potentially represent a fade-in, where there is an overall increase of luminance. 3. If this ratio is less than one, the current frame is darker than the reference frame. This may potentially represent a fade-out, where there is an overall decrease of luminance. Based on the above statement, in the first step of the proposed method, the frames with PLWP/BLWP factors not equal to one are selected as fade candidates. If the corresponding PLWP factors are greater than one or if the corresponding BLWP factors are less than one, they are probable fade-ins, otherwise they are probable fade-outs. (As noted earlier, P and B slices are referenced in opposite direction; hence, when PLWP factor is greater than one, the BLWP factor is less than one and vice versa.) 4.2 Second Step: Removing False Alarms It should be noted that not all the cases where the PLWP/BLWP factors are not equal to one, represent a fade. For instance, in an abrupt transition between two shots with different global luminance, or when there is a sudden flash light inside a shot, the PLWP/BLWP factors are not equal to one, for two or three consecutive frames. In
A Novel Fade Detection Algorithm on H.264/AVC Compressed Domain
1163
other words, using the simple method in the previous step is not enough and may result in false alarms. The second step is designed based on the definition of a fade. Accordingly, in fades, the overall luminance changes over N consecutive frames; thus, if the PLWP and BLWP factors are not equal to one for less than N consecutive frames, they will be omitted from the list of fade candidates. The value for N can be varied from 10 to about 150 frames. Therefore, we limit N to 10, in order not to miss any real fade. 4.3 Third Step: Removing False Alarms and Locating the Exact Fade Boundaries Additive dissolve, slow camera motion, or slow object movement may also introduce some false alarms. They may change the luminance of consecutive frames and make some long term variations in PLWP and BLWP factors. To make our algorithm robust to these false alarms and to find the exact fade boundaries, the third step of the proposed algorithm is designed based on the rule derived from the mathematical model of fades. We show this analytically for fade-in, but it can be similarly shown for fade-out, as well. According to [16], if G(x; y) is a grey scale sequence and N is the length of the fade transition, a fade-in sequence (F) is modeled by:
F (x; y; t ) = G (x; y )×
t . N
(1)
Thus, the fade-in transition frames are calculated as follows: F (0) =G*0, F (1) = G*1/N, F (2) =G*2/N, F (3) = G*3/N, etc. Obviously the ratio of the third frame luminance to the second frame luminance (F (2) / F (1)) will be 2 and the ratio of the forth frame luminance to the third frame luminance (F (3) / F (2)) will be 3/2, etc. If the video sequence is coded in the IPPP format, the PLWP factor shows the ratio of the overall luminance of the current slice to the overall luminance of the previous frame; therefore, the PLWP factors will be approximately: 2, 3/2, 4/3, 5/4, 6/5, etc. In other video coding formats such as IBP or IBBP the reference is not the previous frame. For this reason, these exact values are not obtained, but PLWP factors are still in descending order. For instance, in IBP format the reference frame of each P frame will be the frame before the previous one. Thus, the PLWP factors are obtained from (F (4) / F (2), F (6) / F (4)…). Hence, the PLWP factors will be approximately: 4/2, 6/4, 8/6, etc. Consequently, we can conclude the third step of the algorithm: Among the frames selected in the previous steps as fade candidates, those that have PLWP factors in a descending order are accepted as such. The first and the last points of PLWP factors decrease are the start and the end points of the fade transition, respectively. 4.4 Summarizing the Three Steps
The simplified flowchart of the novel method is shown in figure 1. Reducing to essentials, only the PLWP factors are considered in the flow diagram.
1164
B. Damghanian, M.R. Hashemi, and M.K. Akbari
Fig. 1. Simplified flowchart of the novel method. Only the PLWP Factors are considered in the flow diagram.
At this stage, the complete fade detection method can be summarized as follows: If the PLWP factors change in a descending order from a value greater than one for at least 10 consecutive frames, a fade-in is detected. In the same manner, if the PLWP factors change in a descending order from a value less than one for at least 10 consecutive frames, a fade-out is detected. In both fade types, the first and the last point of decrease are the start and the end points of fade. Figure 2 shows clearly the fades detected by the proposed algorithm based on the PLWP factors for one sample sequence.
Fig. 2. PLWP factors of one sample sequence, scaled by 32. The detected fades are marked. Object motion caused some variations in the PLWP factors (e.g. frames 448 to 477). These variations were correctly removed in the third step.
A Novel Fade Detection Algorithm on H.264/AVC Compressed Domain
1165
In the presence of B frames, the BLWP factors must also change in an ascending order from a value less than one for at least 10 consecutive frames to declare a fadein. In the same manner, the BLWP factors must also change in an ascending order from a value greater than one to declare a fade-out.
5 Simulation Results To evaluate the proposed technique, it has been simulated and applied to five H.264 encoded sequences (resolution: QCIF, CIF, 30 frames/sec) with three GOP formats including IPPP, IBP, and IBBP. The test sequences contained a total of 8378 frames including 27 fades with varying length (10 to 170 frames), 17 cuts, and 13 dissolves. The test sequences and transitions have been selected to not only provide a variety of real case transition scenarios, but also to provide various disturbance factors, such as various lighting conditions, camera motion and activity (zoom, tilt, and pan), object motion with various speeds, cuts, and dissolves, to fairly evaluate the effectiveness, and the robustness of the proposed scheme. Experimental results show that the new method can not distinguish some special cross-dissolves from fades. These special cross-dissolves are those between two shots with very different luminance and from the observer point of view, they are very similar to fades. To remove these false alarms, another step can be added to the proposed algorithm to recognize whether the first frame of the detected fade-in and the last frame of the detected fade-out are black or not. If they are not black, they will be removed from the list of fades. Recall, precision, cover-recall, and cover-precision are the main measures, used to evaluate the performance of the novel method. They are defined in TRECVID2005[17]. Accordingly, recall is the percentage of true fades detected, and precision is the percentage of non false positives in the set of detection. Cover-precision and coverrecall are specific forms of precision and recall that evaluate the detector ability to accurately locate the start and the end of a fade. Each metric is averaged over each GOP format and the result is summarized in table1. Since at the moment no other fade detection method working directly in the H.264 compressed domain, is reported in the literature; we haven’t been able to compare our results with any other technique. But, current experimental results indicate that the novel method, presented in this paper, has a high precision and recall and can locate the exact boundaries with a high cover-precision and cover-recall. Table 1. Fade detection results, where Nc is the number of correctly detected fades, Nm is the number of fades missed by the proposed method, Nf is the number of falsely detected fades, CP is cover-precision, and C-R is cover-recall GOPformat IPPP IBP IBBP
Nc 25 27 27
Nm 2 0 0
Nf 0 2 3
Precision 100 93.10 90
Recall 92.59 100 100
C-P 97.95 96.89 93.86
C-R 99.25 99.12 97.72
1166
B. Damghanian, M.R. Hashemi, and M.K. Akbari
6 Conclusion and Future Work In this paper, a new robust H.264 compressed domain fade detection method is proposed. The new technique is not only working directly in the H.264 compressed domain with low complexity, but also has a high precision and recall. The new algorithm was evaluated in terms of recall, precision, cover-recall, and cover-precision. Our future work is to further improve the precision by recognizing whether the first frame of the detected fade-in and the last frame of the detected fade-out are black or not. If they are not black, they will be removed from the list of fades. In addition, it is desired to stretch the new technique to operate on sequences where the weighted prediction is not used. Acknowledgments. This work was supported in part by Iran Telecommunication Research Center (ITRC).
References 1. B.T. Truong, C. Dorai, S. Venkatesh: Improved Fade and Dissolve Detection for Reliable Video Segmentation. In: Proc. International Conference on Image Processing, Vancouver, BC, Canada, Vol. 3(2000) 961-964 2. J. Calic and E. Izquierdo: Towards Real-Time Shot Detection in the MPEG-Compressed Domain. In: Proc. of the Workshop on Image Analysis for Multimedia Interactive Services (WIAMIS), Finland (2001) 1390–1399 3. W. Zheng, J. Yuan, H. Wang, F. Lin, B. Zhang: A Novel Shot Boundary Detection Framework. In: Proc. of VCIP: Visual Communications and Image Processing, Vol.5960, Bellingham, WA (2005) 410-420 4. P.N. Hashimah, L. Gao, R. Qahwaji, J. Jiang: An Improved Algorithm for Shot Cut Detection. In: Proc. of VCIP: Visual Communications and Image Processing, Vol. 5960, Bellingham, WA (2005) 1534-1541 5. B.T. Truong, Ch. Dorai, S. Venkatesh: New Enhancements to Cut, Fade, and Dissolve Detection Processes in Video Segmentation. In: Proc. of the 8th ACM international conference on Multimedia, Marina del Rey, California, United States (2000) 219 - 227 6. S. Porter, M. Mirmehdi, B. Thomas: Detection and Classification of Shot Transitions. In: British Machine Vision Conference (BMVC) (2001) 73-82 7. J. Nang, S. Hong, Y. Ihm: An Efficient Video Segmentation Scheme for MPEG Video Stream using Macroblock Information. In: Proc. of the 7th ACM international conference on Multimedia, Orlando, Florida, United States (1999) 23 - 26 8. W.S. Chau, Oscar C. Au, T.W. Chanc, T.S. Chongd: Efficient and Reliable Scene Change Detection in MPEG Compressed Video. In: Proc. of SPIE: Visual Communications and Image Processing, Vol. 5960, Bellingham, WA (2005) 1542-1549 9. K. Hoashi, M. Sugano, M. Naito, K. Matsumoto, F. Sugaya, Y. Nakajima: Shot Boundary Determination on MPEG Compressed Domain and Story Segmentation Experiments for TRECVID 2004. In: TREC Video Retrieval Evaluation Forum, JAPAN (2004) 10. W.A.C. Fernando, C.N. Canagarajah, D.R. Bull: Fade and Dissolve Detection in Uncompressed and Compressed Video Sequences. In: Proc. International Conference on Image Processing (ICIP), Kobe, Japan, Vol. 3(1999) 299-303
A Novel Fade Detection Algorithm on H.264/AVC Compressed Domain
1167
11. R. Lienhart: Reliable Transition Detection in Videos: A Survey and Practitioner’s Guide. In: International Journal of Image and Graphics, Vol.1 (2001) 469-486 12. Divakaran, Ajay, Ito, Hiroshi, Sun, Huifang, Poon, Tommy: Fade-in/out Scene Change Detection in the MPEG-1/2/4 Compressed Video Domain. In: Proc. SPIE Storage and Retrieval for Media Databases, Vol. 3972 (2000) 518-522 13. Y. Liu, W. Zeng, W. Wang, W. Gao: A Novel Compressed Domain Shot Segmentation Algorithm on H.264/AVC. In: ICIP: International Conference on Image Processing, Vol.4 (2004) 2235-2238 14. J. M. Boyce: Weighted Prediction in the H.264/MPEG AVC Video Coding Standard. In: Proc. of the 2004 International Symposium on Circuits and Systems, ISCAS '04, Vol.3 (2004) 789-92 15. L. E. G. Richardson: H.264 and MPEG-4 Video Compression: Video Coding for Next Generation. John Wiley & Sons (2003) 16. Z. Cernekova, C. Nikou, I. Pitas: Entropy Metrics Used for Video Summarization. In: Proc. 18th spring conference on Computer graphics, New York, USA (2002) 73 - 82 17. A. Smeaton, P. Over: TRECVID-2005: Shot Boundary Detection Task Overview. (2005)
A Representation of WOS Filters Through Support Vector Machine W.C. Chen and J.H. Jeng Department of Information Engineering, I-Shou University, Kaohsiung County, Taiwan
[email protected],
[email protected] Abstract. A weighted order statistic filter (WOSF) generates a Boolean function by the specified parameters. However, WOSF with different set of parameters may generate the same Boolean function. In this paper, one proposes an alternative representation for WOS filters to characterize WOSF. This method is based on the maximal margin classification of SVM. From a truth table generated by a linearly separable Boolean function, one takes the inputs and outputs to form a training set. By SVM, it generates a maximal margin hyperplane. The hyperplane has an optimal normal vector and optimal bias, and defines a discriminant function. The discriminant function decides the category of training data and can be used to represent WOSF. In other words, one merely utilizes a normal vector and bias to represent all WOS filters with same output but different weight vectors and threshold values. Keywords: linearly separable, Boolean function, maximal margin, normal vector, WOSF, SVM.
1 Introduction Stack filters [1] [2] are the filters based on threshold logics. It is known that any stack filter can be characterized by a Boolean function and with nonnegative weights and threshold values, these stack filters are called weighted order statistic filters (WOSF) [3]. The WOSF, including median filters, weighted median filters (WMF), and order statistic filters (OSF), belong to the class of nonlinear filters. Because the implementation of WOS filters is based on their statistical and deterministic properties, they have good performance for edge preservation and noise suppression. Therefore they can be applied in digital signal processing such as noise cancellation, image reconstruction, edge enhancement, and texture analysis [4]. Some scholars have presented several papers about the representation of WOS filters [5-8], as Pertti Koivisto designed WOS filters by the characterization of training-based optimization, and C. E. Savin used linearly separable stack-like architecture to design WOS filters. Their representations of WOS filters are based on the pair of a weight vector and a threshold value. Various weight vectors or threshold values can represent different WOS filters. In this paper, one proposes a different way to represent WOS filters. This sense is based on the characterization of maximal margin classification on Support Vector L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 1168 – 1176, 2006. © Springer-Verlag Berlin Heidelberg 2006
A Representation of WOS Filters Through Support Vector Machine
1169
Machine (SVM). From a truth table generated by a linearly separable Boolean function, one takes the inputs and outputs to form a training set. Based on SVM training, one can generate a maximal margin hyperplane. This hyperplane has an optimal normal vector and an optimal bias. By means of the normal vector and bias, one defines a discriminant function. The sign of the discriminant function can represent the WOSF. In other words, one merely utilizes a normal vector and a bias that can represent all WOS filters with the same output but different weight vectors and threshold values. An example of three variables is given to illustrate and verify the proposed characterization method.
2 WOS Filters K K For an n-dimension input vector x i ∈ {0 , 1} n , the number of nonzero components of x i K K is called Hamming weight, which is denoted by xi . Let f r ( x i ) be a function with domain {0 , 1} n and range {0, 1} in which parameter r , 0 ≤ r ≤ n + 1 is referred to as an order. The function as follows: K ⎧⎪1 , if x i ≥ r K f r (xi ) = ⎨ (1) ⎪⎩0 , else
K For each fixed r in (1), the function f r ( x i ) produces a pack of outputs. When r is changed, the function will produce another pack of outputs. Therefore the function K K f r ( x i ) can produce n + 1 packs of outputs. Such function f r ( x i ) is called an order statistic filter (OSF). K For the case of n = 3 , let xi = ( x1 , x 2 , x3 ) ∈ {0 , 1}3 , the order may be K K K K r = 0, 1, 2, 3 or 4 . It means that the functions f 0 ( x i ) , f 1 ( x i ) , f 2 ( x i ) , f 3 ( x i ) and K f 4 ( x i ) can produce 5 packs of outputs. The outputs of these 5 OSF are given in table 1. In table 1, each function has the same inputs. When the order r varies, the outputs are K changed. For r = 2 , the filter f 2 ( x i ) has outputs (0 0 0 1 0 1 1 1) . K K An OSF f r ( x i ) can be extended to a WOSF, weighted OSF f Ω , r ' ( x i ) . The quantity Ω is a weight vector and r ' is a threshold value, 0 ≤ r ' ≤ n'+1 . Given a weight vector Ω = (ω1 ω 2 .......... ω n ) , ω i ∈ R , the parameter n' is the inner product of K Ω and input vector x i , i.e., n' = ω i x i . The parameter r ' is a threshold value lying K between 0 and n'+1 . The function f Ω , r ' ( xi ) can be defined as follows:
∑
⎧1 , if ∑ ω i xi ≥ r ' K f Ω, r ' ( xi ) = ⎨ (2) ⎩0 , else K For each fixed r ' in (2), the function f Ω , r ' ( xi ) produces a pack of outputs. Altering K r ' , the function will produce another pack of outputs. Therefore the function f Ω , r ' ( xi )
1170
W.C. Chen and J.H. Jeng
K can produce n'+1 packs of outputs. Such function f Ω , r ' ( xi ) is called a weight order
statistic filter (WOSF). Let n = 3 and the weight vector Ω = (1 3 1) . The threshold values may be K K r ' = 0, 1, 2, 3, 4, 5 and 6 . It means that these functions f Ω , 0 ( xi ) , f Ω , 1 ( xi ) ,…..…, K K f Ω , 5 ( x i ) , and f Ω , 6 ( x i ) can produce 7 packs of outputs. The outputs of 7 WOSF are given in table 2. In table 2, each function has the same inputs and fixed weight vector Ω . When the threshold r ' varies, the filter and outputs are changed. For r ' = 2 , the K filter f Ω , 2 ( xi ) has outputs (0 0 1 1 0 1 1 1) . It is clear that when the weight vector is Ω = (1 1 ..........1) , the WOS filter is just an OS filter. K K A Boolean function f ( xi ) is a function having inputs xi ∈ {0 , 1}n and outputs y i ∈ {0, 1} . Each Boolean function can generate a truth table. A Boolean function is K linearly separable if there exists a vector α = (α 1 ,........, α n ) ∈ R n and a threshold value θ ∈ R , such that
⎧1 , if K f ( xi ) = ⎨ ⎩0 , else
∑ α i xi
≥θ
(3)
In comparison to equation (2), every WOS filter is obviously a linearly separable Boolean function.
3 WOS Filters Induced from SVM On learning theory, the support vector machine (SVM) is a supervised learning and can be used as a classifier or regressor. When it is a classifier, the discriminant function is used to classify the training data. If the training data is linearly separable, the most important character of SVM is that the separable hyperplane has maximum margin property. It means that the distance from the training data to the hyperplane can be maximized. Moreover, only few of the training data are needed to represent the discriminant function, i.e., the hyperplane. These representative data are referred to as support vectors. K Given X ⊆ R n , Y = {1,−1} , and the training set S = {( x i , y i )}li =1 ⊆ X × Y , the K vector x i is as data information and y i is the 2-class category. If the training set S is linearly separable, then there exists a hyperplane H wK ,b G K K K H wK ,b = {xi ∈ ℜ n : f wK ,b ( xi ) =< w , xi > +b = 0, i = 1 ......l }
such that this hyperplane can correctly classify the training set. K K K Given a pair ( w, b) , w ∈ R n and b ∈ R , define f wK ,b ( x i ) by K K K f wK ,b ( x i ) =< w , xi > +b
The normalized function g wK ,b of f wK ,b is given by
A Representation of WOS Filters Through Support Vector Machine
K K g wK ,b ( x i ) = w
−1
K K f wK ,b ( x i ) =< w
−1
K K K w, x i > + w
−1
1171
K b , xi ∈ R n
K K The functional margin μ s ( w, b) and the geometric margin η s ( w, b) of a training set S that is defined as
K
K K
l
K
l
μ s ( w, b) = min y i ⋅ [< w, xi > +b] = min y i ⋅ f wK ,b ( xi ) i =1
K
i =1
K
l
η s ( w, b) = min y i ⋅ [< w i =1
−1
K K K w, x i > + w
−1
l K b] = min y i ⋅ g wK ,b ( xi ) i =1
The margin γ s of the training set S is the maximum geometric margin of all hyperplanes defined as
K
l
K
l
γ s = max min [ y i ⋅ g wK ,b ( xi )] = max min[ y i ⋅ (< w K K w ,b
i =1
i =1
w ,b
−1
K K K w, xi > + w
−1
b)]
The hyperplane with the maximum geometric margin is called maximal margin hyperplane or optimal hyperplane. For a linearly separable training set S, the margin of training set equals to the inverse K of the Euclidean norm of w . Hence one considers the following primal optimization problem: K K minimize 2 −1 w T w (4) K K subject to y i ⋅ [< w, xi > +b] ≥ 1, ∀ i ∈ l K Suppose the pair ( w*, b*) is the solution above primal optimization problem, the maximal margin hyperplane can be written as G K K K H wK *,b* = {xi ∈ ℜ n : f wK *,b* ( xi ) =< w * , xi > +b* = 0 } K K where w* ∈ R n , b* ∈ R and γ s = w *
−1
.
To solve the optimization problem, practically, one uses Lagrangian Theorem to convert (4) to the dual optimization problem [9]. l
maximize
∑z
i
i =1
l
subject to
∑
− 2 −1
l
l
∑∑ z z i
j
K K y i y j < xi , x j >
i =1 j =1
(5)
z i y i = 0 and z i ≥ 0, ∀ i ∈ l
i =1
Suppose parameter z * is the solution of dual optimization problem, one defines the index set I SV for support vectors, I SV = {i ∈ l : z i* > 0} . The optimal normal vector K w * can be written as l K K w* = ∑ z i* y i xi = i =1
∑z
i∈I SV
* i
K y i xi
1172
W.C. Chen and J.H. Jeng
By means of the Karush-Kuhn-Tuker (KKT) conditions [9]
K K z i* [ y i < w*, xi > + y i b * −1] = 0, K K yi [< w*, xi > +b*] − 1 ≥ 0,
(6)
z i* ≥ 0. the optimal bias b * is written as K K b* = y k − < w*, x k >= y k −
= y k −
i∈I SV
∑z
* i
K K y i < xi , x k >
i∈I SV
where k is an arbitrary element in I SV . Hence the discriminant function is defined as K K K f wK *,b* ( x ) =< w*, x > +b* =
∑z
* i
K K y i < x i , x > +b *
(7)
i∈I SV
K For any index i ∈ I SV , the vector x i is called a support vector, in which the
corresponding Lagrange multipliers z i* > 0 . The support vectors are the nearest data points away from hyperplane. Also, they are the points with the most difficulty to be classified. Based on the KKT conditions, one has the following equation K K y i ⋅ [< w*, xi > +b*] = 1 (8) y i = +1 , the quantity This equation reveals that when category K K K K < w*, xi > +b* = +1 or when y i = −1 , < w*, xi > +b* = −1 . Because the optimal hyperplane H wK *, b* can correctly separate the training set, hence one has ⎧+ 1 , if f wK *,b* > 0 yi = ⎨ ⎩− 1 , else
(9)
By the equation (7), (8) and (9), the sign of discriminant function sgn( f wK *,b* ( x)) is defined as follows: K K ⎧+ 1 , if < w*, xi > +b * > 0 K sgn( f wG *, b* ( xi )) = ⎨ (10) ⎩− 1 , else Since the sign of discriminant function can decide the data classified to “+1” category or “-1” category, compared to equation (3), this discriminant function can also represent a linearly separable Boolean function. One now takes the inputs and the K outputs to form a training set S = {( x i , y i )}li =1 from a truth table generated by a linearly separable Boolean function. Based on SVM training, one can generate a maximal K margin hyperplane H wK *,b* . This hyperplane has an optimal normal vector w * and an optimal bias b * , from which one can define a discriminant function
A Representation of WOS Filters Through Support Vector Machine
1173
K K K f wK *,b* ( xi ) =< w*, xi > +b * . Because the truth table also represents lots of WOSF with
the same output, the discriminant function can be used to represent all WOSF with the same output but different weight vectors and threshold values. Table 1. The output and OS filters representation
K Input x i
Output
( x1 , x 2 , x3 ) (0, 0, 0) (0, 0, 1) (0, 1, 0) (0, 1, 1) (1, 0, 0) (1, 0, 1) (1, 1, 0) (1, 1, 1)
f0
f1
f2
f3
f4
1 1 1 1 1 1 1 1
0 1 1 1 1 1 1 1
0 0 0 1 0 1 1 1
0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0
Table 2. The output and WOS filters representation with Ω = (1, 3, 1)
K Input x i
Output
( x1 , x 2 , x3 )
f ΩK , 0
f ΩK , 1
f ΩK , 2
f ΩK , 3
f ΩK , 4
f ΩK , 5
f ΩK , 6
(0, 0, 0)
1
0
0
0
0
0
0
(0, 0, 1)
1
1
0
0
0
0
0
(0, 1, 0)
1
1
1
1
0
0
0
(0, 1, 1)
1
1
1
1
1
0
0
(1, 0, 0)
1
1
0
0
0
0
0
(1, 0, 1)
1
1
1
0
0
0
0
(1, 1, 0)
1
1
1
1
1
0
0
(1, 1, 1)
1
1
1
1
1
1
0
4 An Illustrative Example K From table 2, one takes a WOSF f Ω , 4 ( xi ) that has a pack of outputs (0 0 0 1 0 0 1 1) . K K Composing of the inputs xi ∈ {0, 1}3 and outputs to form a training set S = {( x i , y i )}8i =1 ,
1174
W.C. Chen and J.H. Jeng
K for SVM training, one generates a hyperplane with normal vector w* = (1 2 1) and bias b* = −1 . Hence the discriminant function can be written as K f *(121) , −1 ( xi ) =< (1 2 1), ( x1 x 2 x3 ) > +(−1)
Because a filter is characterized by the outputs (0 0 0 1 0 0 1 1) , the sign of K discriminant function sgn( f *(121), −1 ( xi )) can represent all WOS filters with the same K outputs but different weight vectors Ω . For example, the WOSF f (121) , 3 ( xi ) , and K f (131) , 4 ( xi ) generate the same outputs both of which can be represented by the sign of K discriminant function sgn( f *(121), −1 ( xi )) . The WOSF of traditional sense and SVM sense are listed in table 3. Table 3. The WOSF representation of SVM sense and traditional sense with the same output
Output
(01111111)
(00110111)
(00110011)
K w* b*
(111)
(010)
(121)
(111)
Filter
sgn( f (111) , 2 )
(121) 1 sgn( f (121) , 1 )
0 sgn( f ( 010) , 0 )
−1 sgn( f (121) , −1 )
−2 sgn( f (111) , − 2 )
(131)
(131)
(131)
(131)
4
5
2
(00010011) K SVM sense WOSF sgn( f wK *,b* ( xi ))
K Traditional sense WOSF f Ω , r ' ( x i )
(00000001)
Ω r'
(131) 1
2
3
Filter
f (131), 1
f (131), 2
f (131), 3
f (131), 4
f (131), 5
(120)
(121)
(111)
Ω r'
(324)
(231)
2
3
2
3
3
Filter
f (324 ), 2
f ( 231), 3
f (120 ), 2
f (121), 3
f (111), 3
Ω r'
(543)
(243)
(241)
(243)
(213)
3
4
4
6
6
Filter
f (543), 3
f ( 243), 4
f ( 241), 4
f ( 243), 6
f ( 213), 6
Ω r'
(474)
(154)
(252)
(252)
(121)
4
5
5
7
4
Filter
f ( 474 ), 4
f (154), 5
f ( 252 ), 4
f ( 252 ), 7
f (121), 4
Ω r'
(589)
(364)
(263)
(465)
(225)
5
6
6
10
9
Filter
f (589 ), 5
f (364 ), 6
f ( 263), 6
f ( 465), 10
f ( 225), 9
A Representation of WOS Filters Through Support Vector Machine
1175
In table 3, the element “0” replaces the element “-1” to express the context. For the pack of outputs (0 0 0 1 0 0 1 1) , the represented WOSF merely uses a normal vector K w* = (1 2 1) and bias b* = −1 in SVM sense. Compared with the traditional representation, the WOS filters need lots of different weight vectors Ω and threshold values r ' , as Ω = (1 3 1) , r ' = 4 or Ω = (2 4 3) , r ' = 6 or Ω = (2 5 2) , r ' = 7 ….. and so forth. Obviously, only few parameters are able to represent all WOSF having the same outputs but different weight vectors. Other WOSF corresponding different outputs also are listed in table 3. In table 3, each column contains a pack of outputs, a corresponding WOSF in SVM sense, and some WOSF in traditional sense.
5 Conclusion In traditional sense, each weight vector and threshold value can represent a WOSF. Various weight vectors and threshold values can represent different WOSF, but the outputs are the same. To reduce the numbers of weight vectors and threshold values and use other form to represent WOSF that is the main purpose of this research. This paper proposes a representation based on the properties of maximal margin classification of SVM. Through a truth table generated by a linearly separable Boolean function, one takes each input data and related outputs to form a training set for SVM training. One thus generates an optimal normal vector and an optimal bias. The normal vector and bias can represent all WOS filters with same outputs but different weight vectors and threshold values. Compared to the known representation on [5]-[8], the proposed method can characterize WOSF with fewer parameters.
Reference 1. P. D. Wentd, E. J. Coyle, and N. C. Gallagher Jr. : Stack Filters. IEEE Trans. Acoust. Speech and Signal Processing, vol. 34. no. 4. (1986) 898-911 2. O. Yli-Harja, J. T. Astola, and Y. Neuvo.: Analysis of the properties of median and weighted median filters using threshold logic and stack filter representation. IEEE Transactions on Signal Processing, vol. 39. (1991) 395-410 3. J. Nieweglowski, M. Gabbouj, and Y. Neuvo.: Weighted medias-positive Boolean functions conversion algorithms. Signal Processing, vol. 34. (1982) 149-161 4. P. D. Wentd.: Nonrecursive and recursive stack filters and their filtering behavior. IEEE Trans. Acoust. Speech and Signal Processing, vol. 38. no. 12, (1990) 2099-2107 5. P. Koivisto, and H. Huttunen: Design Of Weighted Order Statistic Filters By Training-Based Optimization. International Symposium on Signal Processing and its Application (ISSPA), Kuala Lumpur, Malaysia, (2001) 6. Michael Ropert, Francois Moreau de Saint-Martin, and Danielle Pele. A new representation of weighted order statistic filters. Signal Processing, vol. 54. (1996) 201-206 7. C. E. Savin, M. O. Ahmad, and M. N. S. Swamy. Design of Weighted Order Statistic Filters Using Linearly Separable Stack-Like Architecture. Circuits and Systems, Proceedings of the 37th Midwest Symposium, vol. 2. (1994) 753–756
1176
W.C. Chen and J.H. Jeng
8. Nina S. T. Hirata and Roberto Hirata Jr.: Design of Order Statistic Filters from Examples. Proceedings of the XVI Brazilian Symposium on Computer Graphics and Image Processing (SIBGRAPI’03), Brazilian, (2003) 24-26 9. Yih-Lon Lin, Wei-Chin Teng, Jyh-Horng Jeng, and Jer-Guang Hsieh.: Characterization of Canonical Robust Template Values for a Class of Uncoupled CNNs Implementing Linearly Separable Boolean Functions. WSEAS Transactions on Information Science and Applications, vol.7, no.2. (2005) 940-944
Detection Method and Removing Filter for Corner Outliers in Highly Compressed Video Jongho Kim, Kicheol Jeon, and Jechang Jeong Dept. Electrical and Computer Engineering, Hanyang University, 17 Haengdang, Seongdong, Seoul 133-791, Korea {angel, hobohemia, jjeong}@ece.hanyang.ac.kr
Abstract. We propose a detection method and a removal filter for corner outliers in order to improve visual quality in highly compressed video. Corner outliers are detected using the direction of edge going through a block-corner and the properties of blocks around the block-corner. The proposed filter for removing corner outliers, which compensates the stair-shaped discontinuities around edges using the adjacent pixels, is applied to the detected area. Simulation results show that the proposed method improves, particularly in combination with deblocking filters, the visual quality remarkably. Keywords: corner outlier, low bit-rate video, block-based coding, MPEG-4 video.
1 Introduction Most video coding standards, which have a hybrid structure, adopt block-based motion compensated prediction and transform. Consequently, the block-based processing generates undesired artifacts such as blocking artifacts, ringing noise, and corner outliers, particularly in very low bit-rate video. Blocking artifacts are grid noise along block boundaries in relatively flat areas; ringing noise is the Gibb’s phenomenon due to truncation of high-frequency coefficients by quantization; corner outlier is a special case of blocking artifacts at the cross-point of a block-corner and a diagonal edge. To reduce the blocking artifacts and the ringing noise, a number of studies have been carried out in spatial domain [1, 2, 5] and transform domain [3, 4], respectively. However, the corner outliers are still visible in some video sequences, since a deblocking filter is not applied to the areas including a large difference at a block boundary in order to avoid undesired blurring [1]. Hardly any studies have been carried out on removing the corner outliers although the artifacts degrade visual quality considerably because the corner outliers appear just in limited areas and the peak signal-to-noise ratio (PSNR) improvement is somewhat small. Therefore, we propose an effective corner outlier removal filter with a simple detection method to improve visual quality in very low bit-rate video. The proposed method can be used along with various deblocking [1-4] and deringing filters [5] for more improvement of visual quality. L.-W. Chang, W.-N. Lie, R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 1177 – 1184, 2006. © Springer-Verlag Berlin Heidelberg 2006
1178
J. Kim, K. Jeon, and J. Jeong
The remaining parts of the paper are as follows. We present a detection method for corner outliers based on the pre-defined patterns, i.e., the direction of an edge going through a block-corner, in Section 2. Section 3 describes, in detail, the proposed filter to remove the corner outliers. Simulation results and conclusions are given in Section 4 and Section 5, respectively.
2 Definition and Observations on the Corner Outliers Consider the original and its reconstructed frames illustrated in Fig. 1, where a diagonal edge goes through a block-corner. The edge occupies large areas in blocks B and C, whereas it occupies very small areas (d0 in Fig. 1) in block D. If block D is flat except d0, the AC coefficients of DCT of block D are mainly related to d0, and their values are small. Since most small AC coefficients are truncated by quantization in very low bit-rate coding, the area of d0, which represents the edge in block D, cannot be reconstructed as shown in Fig. 1(b). As a result, the visually annoying stair-shaped artifact is produced around the block-corner. Such artifacts are called corner outliers. block boundary block A
block boundary block B
block A'
block B' edge (E’ )
edge (E) block boundary
block boundary
d0
block C
block D
(a)
d0'
block C'
block D'
(b)
Fig. 1. Conceptual illustration of a corner outlier, (a) an edge going through a block-corner in an original frame, (b) an corner outlier located in d0’ in a reconstructed frame
To detect and remove the corner outliers, we find two major observations on the artifacts as follows. First, there is a large difference between boundary values of the block including the corner outlier and the other three blocks around the cross-point. For example, the difference between d0’ and its upper pixel or between d0’ and its left pixel is large as shown in Fig. 1(b). Second, corner outliers are more noticeable in flat areas, that is, each block around the cross-point is relatively flat. We examine the properties of the blocks in terms of the observations around every cross-point to detect the corner outliers appropriately. In addition, since corner outliers are prominent when a diagonal edge occupies block areas unequally around a cross-point, we deal with four types, as depicted in Fig. 2, based on the edge direction. We detect the actual corner outlier, which satisfies the proposed detection conditions among each detection type. Dealing with the pre-defined detection types instead of detecting an edge precisely has an advantage in reduction of computational complexity.
Detection Method and Removing Filter for Corner Outliers
1179
edge
block boundary edge
corner outlier block boundary
corner outlier
(b) (b)
(a) edge edge
corner outlier corner outlier
(c)
(d)
Fig. 2. Detection types based on the edge direction, (a) Lower 45° direction, (b) Upper 45° direction, (c) Upper 135° direction, (d) Lower 135° direction
3 Detection Method and Removal Filter for Corner Outliers 3.1 Detection Method for Corner Outliers To detect corner outliers based on the first observation, we obtain the average values of the four pixels around the cross-point, which are represented as the shaded areas in Fig. 3, by Aavg =
1 4 1 4 1 4 1 4 ai , Bavg = ∑ bi , Cavg = ∑ ci , Davg = ∑ di ∑ 4 i =1 4 i =1 4 i =1 4 i =1
(1)
where each capital and small letter denotes blocks and pixels, respectively, that is, Aavg is an average from a1 to a4 of block A, Bavg is an average from b1 to b4 of block B, and so on. Then, to determine the detection type among the four cases, we examine the differences between the average values of the corner-outlier candidate block and its neighboring blocks, using the equations listed in Table 1. By using the average values around the cross-point instead of each pixel value, we can reduce detection errors in complex areas. In Table 1, QP denotes the quantization parameter. For example, when block A has a corner outlier, the differences between Aavg and Bavg and between Aavg and Cavg is large by the first observation, thereby we consider the block A has a corner outlier. Practically, since we do not know the corner-outlier candidate block, the four cases listed in Table 1 should be investigated. If two corner outliers appear at one crosspoint, the artifacts are arranged on diagonally opposite sides because the corner outliers cannot be placed vertically or horizontally. In this case, the proposed method can detect both artifacts without additional computations, since we examine the four cases independently using the equations in Table 1.
1180
J. Kim, K. Jeon, and J. Jeong block boundary block A
block boundary
block B a7
a6
a5
b5
b6
b7
a8
a4
a3
b3
b4
b8
a9
a2
a1
b1
b2
b9
c9
c2
c1
d1
d2
d9
c8
c4
c3
d3
d4
d8
c7
c6
c5
d5
d6
d7
block C
block D
Fig. 3. Pixel arrangement for detecting and filtering corner outliers Table 1. Detection criterion according to the detection type Block containing the artifact A B C D
Detection type
Detection criterion
Upper 45° Upper 135° Lower 135° Lower 45°
|Aavg − Bavg| > 2QP and |Aavg − Cavg| > 2QP |Bavg − Aavg| > 2QP and |Bavg − Davg| > 2QP |Cavg − Aavg| > 2QP and |Cavg − Davg| >2QP |Davg − Bavg| > 2QP and |Davg − Cavg| > 2QP
To detect corner outliers based on the second observation, we examine whether the block satisfying the condition in Table 1 is flat or not. That is, when the candidate block is A, we examine flatness of block A with respect to the pixel including the corner outlier using 9
Aflat = ∑ a1 − ai
(2-1)
i=2
where a1 is the pixel including the corner outlier. We regard the block as flat when Aflat is smaller than QP. In case where another block is determined as a candidate block, we examine whether each of the followings is smaller than QP or not. 9
9
9
i=2
i =2
i=2
B flat = ∑ b1 − bi , C flat = ∑ c1 − ci , D flat = ∑ d1 − di
(2-2)
where each capital and small letter denotes blocks and pixels, respectively, and QP is the quantization parameter. When the condition of Table 1 is satisfied and the value of (2) is smaller than QP, we regard the block as including a corner outlier, and then the proposed filter is applied to the block in order to remove the artifact. 3.2 The Proposed Corner Outlier Removal Filter To remove the corner outliers, we propose a filter that updates pixels of the stairshaped discontinuity using neighboring pixels. To reduce computational complexity,
Detection Method and Removing Filter for Corner Outliers
1181
we apply the proposed filter under the assumption that the edge has a diagonal direction instead of detecting the actual edge direction of neighboring blocks. This is reasonable because: 1. Corner outliers created by a diagonal edge are more noticeable than the artifacts created by a horizontal or vertical edge at a cross-point. 2. Various deblocking filters can remove corner outliers created by a horizontal or vertical edge. According to the above assumptions and the detection results obtained in Section 2, when block A includes a corner outlier as shown in Fig. 2(b), the pixels in block A are replaced by ⎧a1 ' = (b1 + b2 + c1 + d1 + d 2 + c3 + d 3 + d 4 ) / 8 ⎪a ' = (a + a '+ b + c + c + d + c + c + d ) / 9 2 1 1 2 1 1 4 3 3 ⎪ 2 ⎪⎪a3 ' = (a3 + b3 + b4 + a1 '+ b1 + b2 + c1 + d1 + d 2 ) / 9 ⎨ ⎪a4 ' = (a4 + a3 '+ b3 + a2 '+ a1 '+ b1 + c2 + c1 + d1 ) / 9 ⎪a5 ' = (a5 + b5 + b6 + a3 '+ b3 + b4 + a1 '+ b1 + b2 ) / 9 ⎪ ⎪⎩a9 ' = ( a9 + a2 '+ a1 '+ c9 + c2 + c1 + c8 + c4 + c3 ) / 9
(3)
where each index follows that of Fig. 3. For blocks B, C, and D, the filtering methods are similar to (3) and actual equations are as follows: in case where block B includes a corner outlier, as seen in Fig. 2(c), the pixels in block B are replaced by ⎧b1 ' = ( a1 + a2 + d1 + c1 + c2 + d3 + c3 + c4 ) / 8 ⎪b ' = (b + b '+ a + d + d + c + d + d + c ) / 9 2 1 1 2 1 1 4 3 3 ⎪ 2 ⎪⎪b3 ' = (b3 + a3 + a4 + b1 '+ a1 + a2 + d1 + c1 + c2 ) / 9 ⎨ ⎪b4 ' = (b4 + b3 '+ a3 + b2 '+ b1 '+ a1 + d 2 + d1 + c1 ) / 9 ⎪b5 ' = (b5 + a5 + a6 + b3 '+ a3 + a4 + b1 '+ a1 + a2 ) / 9 ⎪ ⎪⎩b9 ' = (b9 + b2 '+ b1 '+ d 9 + d 2 + d1 + d8 + d 4 + d3 ) / 9
(4)
in case where block C includes a corner outlier, as seen in Fig. 2(d), the pixels in block C are replaced by ⎧c1 ' = ( d1 + d 2 + a1 + b1 + b2 + a3 + b3 + b4 ) / 8 ⎪c ' = (c + c '+ d + a + a + b + a + a + b ) / 9 2 1 1 2 1 1 4 3 3 ⎪ 2 ⎪⎪c3 ' = (c3 + d3 + d 4 + c1 '+ d1 + d 2 + a1 + b1 + b2 ) / 9 ⎨ ⎪c4 ' = (c4 + c3 '+ d 3 + c2 '+ c1 '+ d1 + a2 + a1 + b1 ) / 9 ⎪c5 ' = (c5 + d 5 + d 6 + c3 '+ d3 + d 4 + c1 '+ d1 + d 2 ) / 9 ⎪ ⎪⎩c9 ' = (c9 + c2 '+ c1 '+ a9 + a2 + a1 + a8 + a4 + a3 ) / 9
(5)
and in case where block D includes a corner outlier, as seen in Fig. 2(a), the pixels in block D are replaced by
1182
J. Kim, K. Jeon, and J. Jeong
⎧d1 ' = (c1 + c2 + b1 + a1 + a2 + b3 + a3 + a4 ) / 8 ⎪d ' = (d + d '+ c + b + b + a + b + b + a ) / 9 2 1 1 2 1 1 4 3 3 ⎪ 2 ⎪⎪d 3 ' = (d 3 + c3 + c4 + d1 '+ c1 + c2 + b1 + a1 + a2 ) / 9 ⎨ ⎪d 4 ' = (d 4 + d3 '+ c3 + d 2 '+ d1 '+ c1 + b2 + b1 + a1 ) / 9 ⎪d 5 ' = ( d5 + c5 + c6 + d 3 '+ c3 + c4 + d1 '+ c1 + c2 ) / 9 ⎪ ⎪⎩d 9 ' = (d 9 + d 2 '+ d1 '+ b9 + b2 + b1 + b8 + b4 + b3 ) / 9
(6)
In each equation, the indices follow those of Fig. 3.
4 Simulation Results The proposed method was applied to ITU test sequences at various bit-rates. To evaluate the proposed method, each test sequence was coded using the MPEG-4 verification model (VM) [6] with two coding modes: IPPP…, i.e., all frames are interframe coded except the first frame, and I-only, i.e., all frames are intra-frame coded. In each mode, we applied the proposed filter to the reconstructed frames with none of the coding options being switched on and with the deblocking filter [2] being switched on, respectively. To arrive at a certain bit-rate, an appropriate quantization parameter was chosen and kept constant throughout the sequence. This can avoid possible side effects from typical rate control methods. The simulation results for IPPP… coded sequences are summarized in Table 2. It can be seen that PSNR results for the luminance component are increased by up to 0.02dB throughout the sequences. Just a slight improvement in PSNR is obtained due to limitation of the area satisfying the filtering conditions. However, the proposed filter considerably improves subjective visual quality, particularly for sequences with apparent diagonal edges such as Hall Monitor, Mother & Daughter, Foreman, etc. For emphasizing the effect of the proposed filter, partially enlarged frames of the first frame of Hall Monitor sequence under each method, i.e., the result of the proposed method without and with deblocking filter, respectively, are shown in Fig. 4.
(a) Fig. 4. Result images for Hall Monitor sequence (a) Original sequence (QCIF, QP=17), (b) and (d) partially enlarged images of MPEG-4 reconstructed images without and with the MPEG-4 deblocking filter, respectively, (c) and (e) partially enlarged images of the proposed method without and with the MPEG-4 deblocking filter, respectively
Detection Method and Removing Filter for Corner Outliers
(b)
(c)
(d)
(e)
1183
Fig. 4. (continued) Table 2. PSNR results for IPPP… case
Sequence
QP
Hall Monitor Mother & Daughter Container Ship Hall Monitor Mother & Daughter Container Ship Foreman Coastguard Hall Monitor News Foreman News
18 16 17 9 8 10 14 14 12 19 12 16
PSNR_Y (dB) Bit-rate No filtering + Deblocking + (kbps) No filtering Deblocking proposed filter proposed filter 9.39 29.8728 29.8851 30.1957 30.2090 9.45 32.2256 32.2469 32.3684 32.3894 9.8 29.5072 29.5082 29.7281 29.7291 24.29 34.0276 34.0325 34.1726 34.1835 23.83 35.3303 35.3316 35.2686 35.2708 21.61 32.5701 32.5711 32.6003 32.6013 46.44 30.6237 30.6252 31.0727 31.0739 44.68 29.0698 29.0763 29.1562 29.1586 47.82 33.8216 33.8236 34.0921 34.0948 47.21 31.2192 31.2202 31.3516 31.3528 64.65 31.4786 31.4798 31.7587 31.7598 63.14 32.0648 32.0658 32.1233 32.1243
Table 3 shows the results for I-only coded sequences of Hall Monitor and Foreman. The results are similar to the IPPP… coded case. The proposed filtering conditions are satisfied at low QP for the sequences that include low spatial details such as Hall Monitor, since each frame of the sequences is relatively flat originally.
1184
J. Kim, K. Jeon, and J. Jeong
On the other hand, the proposed filtering conditions are satisfied at relatively high QP for the sequences that include medium or high spatial details such as Foreman, since each frame of the sequences is flattened at high QP. This tendency maintains in the sequences with or without the deblocking filter. Table 3. PSNR results for I-only case Sequence
Method
No filtering No+proposed Hall Monitor Deblocking Deblocking+proposed No filtering No+proposed Foreman Deblocking Deblocking+proposed
QP=12 32.8170 32.8396 33.2223 33.2422 32.1830 32.1862 32.5741 32.5760
PSNR_Y (dB) QP=17 QP=22 30.5284 28.9326 30.5381 28.9374 30.9624 29.4025 30.9848 29.4052 30.0531 28.6406 30.0606 28.6540 30.5267 29.1635 30.5316 29.1745
QP=27 27.6323 27.6344 28.1188 28.1230 27.5515 27.5684 28.1405 28.1526
5 Conclusions The corner outliers are very annoying visually in highly compressed video although they appear in limited areas. To remove the corner outliers, we have proposed a simple and effective post-processing method, which includes a detection method and a compensation filter. The proposed method, particularly in combination with the deblocking filter, further improves both objective and subjective visual quality. Acknowledgments. This work was supported in part by the Human Resource Development Project for IT SoC Key Architect under Korea IT Industry Promotion Agency.
References 1. Park, H., Lee, Y.: A Postprocessing Method for Reducing Quantization Effects in Low Bitrate Moving Picture Coding. IEEE Trans. Circuits and Syst. Video Technol., Vol. 9. (1999) 161-171 2. Kim, S., Yi, J., Kim, H., Ra, J.: A Deblocking Filter with Two Separate Modes in Blockbased Video Coding. IEEE Trans. Circuits and Syst. Video Technol., Vol. 9. (1999) 156160 3. Zhao, Y., Cheng, G., Yu, S.: Postprocessing Technique for Blocking Artifacts Reduction in DCT Domain. IEE Electron. Lett., Vol. 40. (2004) 1175-1176 4. Wang, C., Zhang, W.-J., Fang, X.-Z.: Adaptive Reduction of Blocking Artifacts in DCT Domain for Highly Compressed Images. IEEE Trans. Consum. Electron., Vol. 50. (2004) 647-654 5. Kaup, A.: Reduction of Ringing Noise in Transform Image Coding Using Simple Adaptive Filter. IEE Electron. Lett., Vol. 34. (1998) 2110-2112 6. Text of MPEG-4 Video Verification Model ver.18.0. ISO/IEC Doc. N3908, (2001)
Accurate Foreground Extraction Using Graph Cut with Trimap Estimation Jung-Ho Ahn and Hyeran Byun Dept. of Computer Science, Yonsei University, Seoul, Korea, 126-749
Abstract. This paper describes an accurate human silhouette extraction method as applied to video sequences. In computer vision applications that use a static camera, the background subtraction method is one of the most effective ways of extracting human silhouettes. However it is prone to errors so performance of silhouette-based gait and gesture recognition often decreases significantly. In this paper we propose two-step segmentation method: trimap estimation and fine segmentation using a graph cut. We first estimated foreground, background and unknown regions with an acceptable level of confidence. Then, the energy function was identified by focussing on the unknown region, and it was minimized via the graph cut method to achieve optimal segmentation. The proposed algorithm was evaluated with respect to ground truth data and it was shown to produce high quality human silhouettes.
1
Introduction
Background subtraction has been widely used to detect and track moving objects obtained from a static camera. Recently background subtraction methods have been used to assist in human behavior analysis such as surveillance and gait and gesture recognition[10, 13, 18]. In gait and gesture recognition silhouettes are used to determine configurations of the human body. However these human silhouettes are often not accurate enough and recognition performance can decrease significantly when using these silhouettes [11]. Especially, shadows can not always be properly removed, and some parts of the silhouette can be lost when occluding object parts have similar colors as the occluded background areas. Figure 1 demonstrates these problems. Background subtraction has long been an active area of research. Horprasert et al. [6] proposed a robust background subtraction and shadow detection method which was applied to an object tracking system together with an appearance model [13]. The adaptive background subtraction method using the Gaussian Mixture Model (GMM) [14] was presented and it was applied to foreground analysis with intensity and texture information [16]. A joint color-with-depth observation space [7] and an efficient nonparametric background color model [5] were also proposed to extract foreground objects. In this paper we propose a foreground segmentation that consists of two steps; trimap estimation and fine segmentation using a graph cut. In trimap estimation we estimate the confident foreground and background regions and leave the L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 1185–1194, 2006. c Springer-Verlag Berlin Heidelberg 2006
1186
J.-H. Ahn and H. Byun
Fig. 1. Problems of the background subtraction method to extract human silhouettes. In (a) and (b) some parts of the human silhouette disappear when the color of these parts is too similar to that of the occluded background area. (c) and (d) shows the shadow problem. These silhouettes were obtained with the Horprasert’s algorithm [6].
dubious area unknown. The amount of the estimated foreground and background regions is related to the level of confidence. Fine segmentation focus on the unknown regions and uses the energy minimization technique. Throughout this paper it is assumed that a static background is available, images are captured by a static mono camera, and only one person enters a scene. As a pioneer work of object segmentation using graph cut, the user-interactive segmentation technique was proposed by Boykov and Jolly [1]. GrabCut [12] and lazy snapping[9] were other user-interactive image cutout systems based on a graph cut. Lazy snapping used a graph cut formulation based on regions generated by the watershed algorithm, instead of the image pixels. In this paper we also perform image segmentation but it is for estimating the trimap. Recently, the Layered Graph Cut(LGC) method based on color, contrast and stereo matching [8] information has been proposed to infer the foreground by using a pair of fixed webcams. Section 2 describes the trimap concept and the proposed graph cut framework using the trimap information. Section 3 explains the concrete trimap estimation method using region likelihood. Experimental results are given in section 4and then conclusions are presented in section 5.
2 2.1
Graph Cut Segmentation with Estimated Trimap Estimated Trimap
In natural image matting[4][15] with still images the trimap is supplied by the user. This trimap partitions the image into three regions: firm foreground, firm background and unknown regions. In unknown regions, the matte can be estimated using the color statistics in the known foreground and background regions. In this paper we attempt to estimate the trimap in video applications without user interaction. Foreground segmentation refers to a binary labeling(classification) process that assigns all the pixels in a given image to either foreground or background. It is very hard to label all the pixels in the image correctly, but some parts of the
Accurate Foreground Extraction Using Graph Cut with Trimap Estimation
1187
Fig. 2. Trimap Comparison. (a) and (b) show typical trimaps in natural image matting [15], whereas (c) and (d) shows the trimaps used in the proposed algorithm.
image can be easily labeled with simple ideas or features. In trimap estimation we pre-determine the labels of each of the pixels in the area that can be easily labeled. Then the trimap information is reflected into the energy function for fine segmentation. In section 3 we propose a trimap estimation method that uses the background model. Traditionally the unknown regions of the trimap is located between foreground and background areas but they can theoretically be placed anywhere in our estimated trimap. Figure 2 shows some typical examples of the trimaps of image matting and the proposed method. The trimap T can be viewed as the function from the set of pixels P of the image to be segmented to the label set LT T : P → LT = {−1, 0, 1}
(1)
where -1, 0, and 1 represent unknown, background and foreground respectively. In every frame we estimate the trimap T and define the sets O and B by O = {p ∈ P|T (p) = 1},
B = {p ∈ P|T (p) = 0}.
(2)
The set of unknown pixels U is defined by P − (O ∪ B). We call the pixels in O the foreground seeds. The pixels in B are called the background seeds. 2.2
Graph Cut with the Estimated Trimap
In this section we describe the energy function for fine segmentation. The energy function is similar to that used in the GrabCut method [12] but it explores both trimap information and the color likelihoods. We consider a standard neighborhood system N of all unordered pairs {p, q} of neighboring pixels, and define z = (z1 , · · · , z|P| ) as the image where zn is the RGB color vector for the nth pixel and f = (f1 , f2 , · · · , f|P| ) as a binary vector whose components fp specify label assignments to pixels p in P. Each fp can be either 1 or 0 where 1 represents the foreground and and 0 represents the background. The vector f defines a segmentation. Given an image, we seek the labeling f that minimizes the energy E(f ) = γD(f ) + V (f )
(3)
1188
J.-H. Ahn and H. Byun
Fig. 3. Graph cut with the estimated trimap information. (a) input image, (b) trimap; white, black, gray areas indicate estimated foreground, background and unknown regions, respectively. (c) graph cut framework with the estimated trimap information, (d) the extracted foreground region.
where coefficient γ specifies the relative importance of the data term D(f ) and the smoothness term V (f ). The form of D(f ) and V (f ) are given by D(f ) = Dp (fp ) p∈P
V (f ) =
δ(fp , fq )Vp,q (fp , fq ).
{p,q}∈N
where δ(fp , fq ) denotes the delta function defined by 1 if fp = fq and 0 otherwise. Given an image, it is necessary that the segmentation boundaries align with contours of high image contrast. This process is modeled using the smoothness term Vp,q defined by (4) Vp,q (fp , fq ) = exp −||zp − zq ||2 /β where the constant β is chosen by the expectation of 2||zp − zq ||2 over all {p, q} ∈ N . The data term Dp measures how well label fp fits pixel p given the observed data zp . We model the data term by using the given trimap T and the foreground and background color likelihoods of P (·|1) and P (·|0), respectively. The likelihoods are modeled by using Gaussian mixtures in the RGB color space, learned from image frames labelled from earlier in the sequence. Using the trimap information the data term Dp can be defined as ⎧ ⎨ − log P (zp |fp ) if p ∈ U WfOp (zp |fp ) if p ∈ O Dp (fp ) = (5) ⎩ if p ∈ B WfBp (zp |fp ) In the above equation WfOp (zp )’s and WfBp (zp ) are defined by WfOp (zp ) = −Wfp log P (zp |fp ) WfBp (zp ) = −W1−fp log P (zp |fp )
(6)
Accurate Foreground Extraction Using Graph Cut with Trimap Estimation
1189
Fig. 4. Pixel foreground likelihoods: (a) image of the 250th frame in the test data JH1, (b) brightness distortion likelihood − log f b (αp |p), (c) scaled chromaticity distortion likelihood −η log f c (γp |p), where η = 5. The likelihoods are truncated by 255.
for all p ∈ P and W1 > 1 > W0 . This data term model boosts the likelihood of − log P (·|fp ) but suppresses − log P (·|1 − fp ) for the pixel p of T (p) = fp where fp = 0 or 1. The estimated foreground and background seeds can be incorrectly labeled, thus the trimap information is used to distort the color likelihoods in the data term and the final segmentation label is given by a graph cut. Minimization of the energy (3) is done by a standard min-cut/max-flow algorithm [2]. Figure 3 (c) illustrates the proposed graph cut framework using the estimated trimap information.
3
The Trimap Estimation Method Using Region Likelihood
To extract an accurate foreground silhouette, it is very important that the set O in the trimap T should not contain any background pixels but should contain as many foreground pixels as possible, and vice versa for the set B. To accomplish this, we estimate the seeds of T in the region units, instead of in the pixel units. Region unit processing can have the a denoising effect which can allow for recovering clear outlines of the foreground objects. This processing was motivated from the perceptual grouping principles for object segmentation of still images[17]. We first propose the background color distribution model in section 3.1, and the region likelihood is determined in section 3.2. Finally, the region decision function for trimap estimation is proposed in section 3.3. 3.1
Brightness and Chromaticity Background Likelihood
Horprasert et. al[6] proposed the statistical background model that separated the brightness from the chromaticity component. They modeled a pixel p by 4-tuple < μp , σp , ap , bp > where μp was the expected color value, σp was the standard deviation of the RGB color value, ap was the variation of the brightness distortion, and bp was the variation of the chromaticity distortion of pixel p. Using the background model, we propose the foreground likelihood l(p) for each pixel p. The object likelihood is decomposed of the brightness and chromaticity likelihoods. From the definitions of the distortions , the distribution
1190
J.-H. Ahn and H. Byun
Fig. 5. Region unit processing for trimap estimation. (a) original image of the 581st frame in the test data JH3 with a bounding box, (b) mean shift segmentation of the image within the bounding box, (c) pixel foreground likelihoods l(p)’s, (d) naive region likelihood Lp (Ri ) (e) white area shows the regions that touch the bounding box.
f b of the brightness distortion α of pixel p can be modelled using normal distribution of mean 1 and variance a2p , α ∼ N (1, a2p ). The distribution f c of the chromaticity distortion γ at pixel p can be approximated by one-sided normal distribution of mean 0 and variance bp 2 , 2 exp(−γ 2 /(2bp 2 )), f c (γ|p) = √ 2πbp
γ ≥ 0.
(7)
Assuming that brightness distortion αp and chromatic distortion γp are independent, the naive background probability density function fB of pixel p can be given by fB (Cp |p) = f b (αp |p)f c (γp |p) , (8) where Cp is the RGB color vector of pixel p, and αp and γp are calculated as in [6]. Figure 4 shows that the brightness and chromaticity distortions complement each other, thus our independence assumption can be empirically supported. The pixel foreground likelihood l of pixel p is given by l(p) = − log(f b (αp |p)) − η log(f c (γp |p)),
(9)
where a constant η is introduced since the chromaticity distortion is relatively smaller than the brightness distortion in practice. Note that l(p) = − log (fB (Cp |p)) when η = 1. 3.2
Region Likelihoods
In every frame we perform image segmentation in order to partition each image into homogeneous small regions. We define R = {Ri }i∈I as the set of the regions. For computational efficiency we perform it in the bounding box surrounding the foreground object. The bounding boxes are obtained by the maximal connected component of the foreground regions that are extracted by pixel-wise background subtraction. The foreground region likelihood L(Ri ) of region Ri ∈ R is calculated by L(Ri ) = λ1 Lp (Ri ) + λ2 Lo (Ri ). (10)
Accurate Foreground Extraction Using Graph Cut with Trimap Estimation
1191
Fig. 6. Foreground segmentation with trimap estimation. (a) previous foreground object silhouette of the 580th frame, (b) regularization term Lo (Ri ) scaled by 255, (c) foreground region likelihood L(Ri ) truncated by 255. (d) the estimated trimap, (e) silhouette obtained by the proposed algorithm, (g) silhouette obtained by the Horprasert’s algorithm[6]; deep shadow are on the floor and the teaching desk. .
In the above equation, Lp (Ri ) is the naive foreground region likelihood given by the arithmetic mean of the pixel foreground likelihoods l(p)’s, i.e. p∈Ri l(p)/nRi , here nRi is the number of pixels in the region Ri . The naive region likelihood is not enough to decide the foreground object region especially when the occluding foreground parts are a similar color to the occluded background parts or when there are some deep shadows. Figure 5 shows an example of this. case. To overcome this problem we use the regularization term Lo (Ri ). This is the overlapping ratio of region Ri given by noRi /nRi , where noRi is the number of pixels in region Ri that belong to the previous foreground object region. λ2 is a regularization parameter. Figure 5 shows the pixel and naive foreground region likelihoods of an image. Some foreground regions with lower naive region likelihoods were supplemented by Lo in Fig. 6 (c). In the experiments we set λ1 = 0.8 and λ2 = 30. 3.3
Trimap Decision
By using the region likelihoods L(Ri ) in (10), the trimap can be estimated by using the region decision function fD : R → {−1, 0, 1} defined by, ⎧ if L(Ri ) > TU ⎨1 if L(Ri ) < TL fD (Ri ) = 0 (11) ⎩ −1 otherwise. where the thresholds TU and TL are related to the level of confidence on the trimap information. We set TU = 170 and TL = 100 in the experiments. Furthermore we define RB as a collection of the regions that touch the bounding box. The regions in RB are assumed to belong to the background area but when some region Rj s belong to RB and fD (Ri ) = 1 the regions are labeled as unknown. Figure 5 (e) shows the regions in RB and an example of the estimated trimap is shown in Fig. 6 (d).
4
Experimental Results
Performance of the proposed method was evaluated with respect to the groundtruth segmentation of every tenth frame in each of five 700-frame test sequences
1192
J.-H. Ahn and H. Byun
Table 1. Segmentation Error(%). The proposed algorithm(GT) is compared with the Horprasert’s algorithm(HP) and the normal Graph Cut algorithm(GC) using color likelihoods on the test sequences of JH1, JH2, JH3, GT 1 and KC1.
GT HP GC
JH1
JH2
JH3
GT 1
KC1
0.47 2.59 10.86
0.45 5.22 12.90
0.98 1.67 12.89
0.74 1.66 16.91
1.58 3.01 23.91
Fig. 7. Comparison of extracted human silhouettes. (Top) Four frames of the test sequence JH3. (Middle) Foreground extraction using HP[6]. (Bottom) Foreground extraction using the proposed algorithm.
of 320×240 images. The ground-truth data was labeled manually. Each pixel was labeled as foreground, background or unknown. The unknown label occurred in one plus pixel and one minus pixel along the ground-truth foreground object boundaries The five test sequences were labeled by JH1, JH2, JH3, GT 1, KC1. Each was composed of different backgrounds and people. In all sequences, one person entered the scene, moved around and assumed natural poses, and the whole body was shown for gesture recognition. A person is shown under a light in KC1 so that deep shadows are cast over the floor and wall. Segmentation performance of the proposed method(GT) was compared with that of the Horprasert’s background subtraction method(HP) [6] and the video version of the GrabCut method(GC) [12]. The only difference between GT and GC is that GT used a trimap but GC did not. Table 1 shows that the proposed method outperformed the both methods. The error rate was calculated within the bounding boxes only ignoring the mixed pixels. In the experiments we used ten Gaussian mixtures for color likelihood models and the mean shift segmentation method [3] was used to obtain the regions Rt . Human silhouette results are
Accurate Foreground Extraction Using Graph Cut with Trimap Estimation
1193
Fig. 8. Comparison of the silhouettes. (Top) Four frames of the test sequence KC1. (Middle) Foreground extraction using HP[6]. (Bottom) Foreground extraction using the proposed algorithm.
shown in Fig.7 and Fig. 8. The average running time of the proposed algorithm was 14.5 fps on a 2.8 GHz Pentium IV desktop machine with 1 GB RAM.
5
Conclusions
This paper has addressed accurate foreground extraction in video applications. We proposed a novel foreground segmentation method using a graph cut with an estimated trimap. We first estimated the trimap by partitioning the image into three regions: foreground, background and unknown regions. Then the trimap information was incorporated into the graph cut framework by distorting the foreground and background color likelihoods. The proposed algorithm showed good results on the real sequences and also worked at near real-time speed. However, about 60 percent of the processing time of the proposed algorithm was taken from the mean shift segmentation. In future work we are developing an efficient image segmentation method to find proper automic regions. Acknowledgements. This research was supported by the Ministry of Information and Communication, Korea under the Information Technology Research Center support program supervised by the Institute of Information Technology Assessment, IITA-2005-(C1090-0501-0019), and the Korea Science and Engineering Foundation (KOSEF) through the Biometrics Engineering Research Center (BERC) at Yonsei University.
1194
J.-H. Ahn and H. Byun
References 1. Y. Boykov, and M. Jolly. Iterative graph cuts for optimal boundary and region segmentation of objects in N-D Images: In: Proc. IEEE Int. Conf. on Computer Vision, (2001) 105-112. 2. Y. Boykov and V. Kolmogorov. An experimental comparision of min-cut/max-flow algorithms for energy minimization in vision: IEEE Trans. on Pattern Anal.and Mach. Intell. 26(9) (2004) 1124-1137. 3. D. Comaniciu, P. Meer: Mean shift: a robust approach toward feature space analysis: IEEE Trans. Pattern Anal. Mach. Intell. 24(5) (2002) 603-619. 4. Y.-Y. Chuang, B. Curless, D. Salesin and R. Szeliski, A bayesian approach to digital matting: In: Proc. Int. Conf. Computer Vison and Pattern Recognition 2 (2001) 264-271. 5. A. Elgmmal, R. Duraiswami, L.S. Davis: Efficient kernel density estimation using the fast gauss transform with applications to color modeling and tracking: IEEE Trans. Pattern Anal. Mach. Intell. 25(11) (2003) 1499-1504. 6. T. Horprasert, D. Harwood, L.S. Davis: A statistical approach for real-time robust background subtraction and shadow detection. In: Proc. IEEE Frame Rate Workshop (1999) 1-19. 7. M. Harville: A framework for high-level feedback to adaptive, per-pixel, mixtureof-Gaussian background models. In: Proc. European Conf. on Computer Vision (2002) 543-560. 8. V. Kolmogorov, A. Criminisi, A. Blake, G. Cross and C. Rother: Bi-layer segmentation of binocular stereo video. In: Proc. Int. Conf. on Computer Vision and Pattern Recognition, (2005). 9. Y. Li, J. Sun, C.-K. Tang and H.-Y. Shum: Lazy snapping. ACM Trans. Graphics 23(3) (2004) 303-308. 10. H. Li, M. Greenspan: Multi-scale gesture recognition from time-varying contours. In: Int. Conf. Computer Vision (2005) 236-243. 11. Z. Liu, S. Sarkar: Effect of silhouette quality on hard problems in gait recognition. IEEE Trans. Systems, Man, and Cybernetics-Part B:Cybernetics 35(2) (2005) 170-183. 12. C. Rother, V. Kolmogorov, and A. Blake. GrabCut: Interactive foreground extraction using iterated graph cuts: In: ACM Trans. Graph, 23(3)(2004) 309-314. 13. A. Senior: Tracking people with probabilistic appearance models. In: Proc. IEEE Int. Workshop on PETS, (2002) 48-55. 14. C. Stauffer, W.E.L. Grimson: Learning patterns of activity using real-time tracking. IEEE Trans. Pattern Anal. Mach. Intell. 22 (2000) 747-757. 15. J. Sun, J. Jia, C.-K. Tang and H.-Y. Shum, Poisson Matting: ACM Transaction on Graphics, 23(3) (2004) 315-321. 16. Y.-L. Tian, M. Lu, A. Hampapur: Robust and efficient foreground analysis for real-time video surveillance. In: Proc. Int. Conf. Computer Vision and Pattern Recognition (2005) 970-975. 17. Z. Tu: An integrated framework for image segmentation and perceptual grouping. In: Int. Conf. Computer Vision (2005) 670-677. 18. L. Wang, T. Tan, H. Ning, W. Hu: Silhouette analysis-based gait recognition for human identification. IEEE Trans. Pattern Anal. Mach. Intell. 25(12) (2003) 15051518.
Homography Optimization for Consistent Circular Panorama Generation Masatoshi Sakamoto1 , Yasuyuki Sugaya2, and Kenichi Kanatani1 1
Department of Computer Science, Okayama University, Okayama 700-8530, Japan {sakamoto, kanatani}@suri.it.okayama-u.ac.jp 2 Department of Information and Computer Sciences, Toyohashi University of Technology, Toyohashi, Aichi 441-8480 Japan
[email protected] Abstract. A 360◦ panorama is generated from images freely taken by a hand-held camera: images are pasted together and mapped onto a cylindrical surface. First, the lack of orientation information is overcome by invoking “oriented projective geometry”. Then, the inconsistencies arising from a 360◦ rotation of the viewing direction are resolved by optimizing all the homographies between images simultaneously, using Gauss-Newton iterations based on a Lie algebra representation. The effectiveness of our method is demonstrated using real images.
1
Introduction
A panorama is an image with a large angle of view, giving viewers an impression as if they were in front of a real scene. In this paper, we consider a panorama that covers all 360◦ directions around the viewer. A typical approach for realizing this is to take images with a special optical system such as an fish-eye lens camera, a mirror-based omnidirectional camera, a composite multicamera system, or a camera rotation mechanism [16]. Such optical systems have developed for autonomous robot navigation applications [17]. Another approach is to paste multiple images together, known as image mosaicing [13,15]. This technique has been studied in relation to such video image processing as image coding, image compression, background extraction, and moving object detection [2,3,10,12]. However, the aim of this paper is not such industrial or media applications. We consider a situation where travelers take pictures around them using an ordinary digital camera, go home, and create panoramic images for personal entertainment. The purpose of this paper is to present a software system that allows this without using any special device or requiring any knowledge about the camera and the way the pictures were taken. The principle of image mosaicing is simple. If we find four or more corresponding points between two images, we can compute the homography (or projective transformation) that maps one image onto the other. Hence, we can warp one image according to the computed homography and past it onto the other image. Continuing this, we can create a panoramic image. L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 1195–1205, 2006. c Springer-Verlag Berlin Heidelberg 2006
1196
M. Sakamoto, Y. Sugaya, and K. Kanatani
However, the images to be pasted distort as we proceed and diverge to infinity when the viewing direction changes by 90◦ . This can be avoided if we map the images onto a cylindrical surface and unfold it. For this, however, we need to know the orientations of the cameras that took the individual images. That information would be obtained if we used a special device such as an omnidirectional camera or a camera rotation mechanism, but we are assuming that no knowledge is available about camera orientations. We resolve this difficulty by noting that even though a homography-based panorama cannot be displayed on a planar surface beyond a ±90◦ range, it is nevertheless mathematically defined over the entire 360◦ range. If we invoke the formalism of oriented projective geometry [11,14], we can consistently define the pixel values in all directions expressed in homogeneous coordinates, which can then be mapped onto a cylindrical surface. However, another critical issue arises: if we warp one image and successively paste it onto another, the final image may not agree with the initial image due to accumulated errors. To overcome this, we present a numerical scheme for optimizing all the homographies between images simultaneously subject to the condition that no inconsistency arises. This can be done by using GaussNewton iterations based on a Lie algebra representation. We demonstrate the effectiveness of our method using real images.
2 2.1
Panorama Generation Using Homographies Homographies
As is well known, images taken by a camera rotating around the center of the lens are related to each other by homographies (projective transformations). This holds for whatever camera motion if the scene is planar or is sufficiently far away. In whichever case, let (x, y) be a point in one image and (x , y ) the corresponding point in another. If we represent these by 3-D vectors ⎛ ⎛ ⎞ ⎞ x/f0 x /f0 x = ⎝ y/f0 ⎠ , x = ⎝ y /f0 ⎠ , (1) 1 1 where f0 is an arbitrary constant, the homography relationship is written in the form x = Z[Hx]. (2) Here, Z[ · ] denotes scale normalization to make the third component 1, and H is a 3 × 3 nonsingular matrix determined by the relative motion of the camera and its intrinsic parameters. For simplicity, we call the matrix H also a “homography”. As Eq. (2) implies, the absolute magnitude and the sign of the matrix H are indeterminate. If we replace the constant f0 in Eqs. (1) by another value f˜0 , the matrix H in Eq. (2) changes into ˜ ˜ = diag(1, 1, f0 )Hdiag(1, 1, f0 ), H f0 f˜0
(3)
Homography Optimization for Consistent Circular Panorama Generation
1197
where diag( · · · ) denotes the diagonal matrix with diagonal elements · · · in that order). 2.2
Optimal Estimation of Homographies
Equation (2) states that vectors x and Hx are parallel to each other, so it is equivalently rewritten as x × Hx = 0. (4) Given N corresponding points (xα , yα ) and (xα , yα ), α = 1, ..., N , between two images, we let xα and xα be their vector representations in the form of Eqs. (1), ¯ α and x ¯ α their true positions in the absence of noise. If we regard the and x uncertainty of the x and y coordinates of each point as random Gaussian noise of mean 0 and a constant standard deviation, statistically optimal estimation of the homography H reduces to the minimization of J= subject to the constraint
N 1 ¯ α 2 , xα − x 2 α=1
¯ α × H x ¯ α = 0. x
(5)
(6)
Eliminating the constraint by introducing Lagrange multipliers and ignoring higher order error terms, we can rewrite Eq. (5) in the following form1 [6]: J=
N 1 (x × Hxα , W α (xα × Hxα )), 2 α=1 α
− W α = xα × HP k H × xα + (Hxα ) × P k × (Hxα ) .
(7) (8)
Here, ( · )− denotes pseudoinverse, and P k is the following projection matrix: P k = diag(1, 1, 0).
(9)
In this paper, we denote by u × A the matrix whose columns are the vector products of u and the columns of the matrix A, and by A × v the matrix whose rows are the vector products of v and the rows of A. 2.3
Panorama Generation
Suppose the user holds a camera roughly horizontally and takes pictures around him roughly at an equal angle. We assume that the scene is sufficiently far away. Since no mechanical device is used, this is merely an approximation. We assume that the focal length is unknown and different from picture to picture. The user first specifies corresponding points between adjacent images. Various automatic matching techniques have been proposed [8,9,18], but some mismatches are unavoidable using them. In our experiments, we manually selected corresponding points. 1
The source code of the program that minimizes Eq. (7) by a technique called renormalization [7] is available at http://www.suri.it.okayama-u.ac.jp
1198
M. Sakamoto, Y. Sugaya, and K. Kanatani
Fig. 1. Successively warping and pasting images using homographies Π
P
I θ
fr
p
O
Fig. 2. The rectangular region I on the tangent plane Π to the cylinder
Then, we compute the homography H between two neighboring images and paste one image onto the other via Eq. (2). Figure 1 is a part of the panoramic image thus created. We can see that images distort more as the viewing direction moves; the distortion diverges to infinity when the viewing direction is orthogonal to its initial orientation.
3 3.1
Cylindrical Panorama Generation Mapping onto a Cylindrical Surface
The image divergence is avoided if we map the input images onto a cylindrical surface around the viewpoint and unfold it. However, the camera orientation information is missing. This is resolved by invoking oriented projective geometry [11,14]. First, we do the following preparation: – Imagine a hypothetical cylinder of radius fr around the viewpoint O and define a (θ, h) cylindrical coordinate system (θ around the circumference and h in the axial direction). A point with cylindrical coordinates (θ, h) is unfolded onto a point with Cartesian coordinates (fr θ, h). – Let Π be the plane tangent to the cylinder along the line θ = 0. Define an xy coordinate system on it such that the x-axis coincides with the line h = 0. – Define an xy coordinate system on each input image such that the origin is at the center of the frame with the x-axis extending upward and the y-axis rightward. We identify this xy coordinate system with the xy coordinate system on Π and define on Π a rectangular region I of the same size as input images centered on (θ, h) = (0,0) (Fig. 2). – Number the input images in the order of adjacency, and compute the homography H k(k+1) that maps the kth image onto the (k + 1)th image from
Homography Optimization for Consistent Circular Panorama Generation
1199
Fig. 3. Circular panorama corresponding to Fig. 1
the specified corresponding points, k = 1, ..., M , where we use fr in the place of f0 in Eqs. (1) (see Eq. (3)). We choose the sign2 of H k(k+1) so that det H k(k+1) > 0. Then, we compute the pixel value of each point p with (discretized) cylindrical coordinates (θ, h) as follows (Fig. 2): 1. Compute the intersection P of the plane Π with the line passing through the viewpoint O and the point p on the cylinder, using homogeneous coordinates. and OP have the same 2. If P is inside the region I and if the vectors Op orientation, copy the pixel value3 of P in the first image to p. + and H 12 Op, 3. Else, let p be the point on the cylinder such that4 Op compute the intersection P of Π with the line passing through O and p . and OP have the same 4. If P is inside the region I and if the vectors Op orientation, copy the pixel value of P in the second image to p. + , and 5. Else, let p be the point on the cylinder such that Op H 23 Op compute the intersection P of Π with the line passing through O and p . and OP have the same 6. If P is inside the region I and if the vectors Op orientation, copy the pixel value of P in the third image to p. 7. Repeat the same process over all the images and stop. If no pixel value is obtained, the value of p is undefined. The radius fr of the cylinder is arbitrary in principle. However, if we require that the viewing direction should agree with the physical direction, e.g., two viewing directions that make 30◦ actually make 30◦ in the scene, the radius fr should coincide with the focal length f1 for the first image. This is because the mapping onto the cylinder starts with the first image; the subsequent images are successively warped so as to agree with it. The computation of the focal lengths for all the images will be described in Sec. 4. Figure 3 is a circular panorama thus generated using the images for Fig. 1. Since the first image is mapped onto the cylinder, next the second image onto the rest of the cylinder, then the third, and so on, images with smaller numbers look “above” images with larger numbers (we can reverse the order, of course). 2
3
4
In practice, it is sufficient if the (33) element is positive. If it is 0, the image origin is mapped to infinity. This does not happen between two overlapping images. The pixel value of a point with non-integer coordinates is bilinearly interpolated from surrounding pixels. +
The relation denotes that one side is a multiple of the other side by a positive number.
1200
M. Sakamoto, Y. Sugaya, and K. Kanatani
Fig. 4. If neighboring images are successively pasted, the final image does not correctly match the initial image
However, this makes the last image “below” the first image. For consistency, we place the last image above the first image where they overlap (we omit the details). 3.2
Oriented Projective Geometry
In the above procedure, the intersection P of the tangent plane Π is computed in homogeneous coordinates, so no computational failure occurs if P is at infinity. In the standard projective geometry, a point P on the plane Π is identified with the “line of sight” l passing through P and the viewpoint O (the point P is regarded as located at infinity if the line l is parallel to Π). Since the line of sight l is not oriented, no distinction can be made between “in front of” or “behind” the viewpoint O. It is shown, however, that almost all properties of projective geometry is preserved if the line of sight is oriented [14]. This “oriented projective geometry” was found to be very useful for computer vision applications [11]. In the preceding procedure, we utilized this framework when we signed H k(k+1) so and that det H k(k+1) > 0 and compared the “orientations” of the vectors Op , etc. Op
4
Simultaneous Homography Optimization
4.1
Discrepancies of the Circular Mapping
If we compute the homographies between two images independently for all pairs, the final image does not necessarily match the initial image correctly, as shown in Fig. 4. This is because due to the accumulation of numerical errors and image distortions, the composite mapping H M1 H (M−1)M H (M−2)(M−1) · · · H 23 H 12
(10)
does not necessarily define the identity mapping (M is the number of images). This is not a serious problem if the purpose of the circular panorama is simply for displaying the unfolded image. A more important application is, however, to let the user feel virtual reality by interactively moving the scene as the viewer changes the viewing direction5 . This can be don by remapping, each time the 5
The viewing direction is controlled by a mouse, keyboard, or a joystick. If the user wares a head-mount display (HMD), the head orientation can be measured from the signal it emits.
Homography Optimization for Consistent Circular Panorama Generation
1201
user specifies the viewing direction, the corresponding part of the cylinder onto its tangent plane6 . For such applications, no inconsistency is allowed anywhere around the cylinder. To resolve this, Shum and Szeliski [13] introduced a postprocessing for distributing the inconsistency equally over all image overlaps. Here, we adopt a more consistent approach: we optimize all the homographies simultaneously subject to the constraint that Eq. (10) define the identity map. As a byproduct, the orientations and the focal lengths of the cameras that took the input images are optimally estimated. 4.2
Parameterization of Homographies
While a general homography has 8 degrees of freedom7 , the homography arising from camera rotation and focal length change has only 5 degrees of freedom8 . If we take the kth image with focal length f , rotate the camera around the lens center by R (rotation matrix), and take the (k + 1)th image with focal length f , the homography H k(k+1) that maps the kth image onto the (k + 1)th image has the form H k(k+1) = diag(1, 1,
f0 f1 )R diag(1, 1, ). fk+1 k(k+1) f0
(11)
Equation (10) defines the identity mapping if and only if R12 R23 · · · R(M−1)M RM1 = I,
(12)
where I is the unit matrix. So, we minimize (cf. Eq. (7))
J=
M Nk(k+1) 1 k(k+1) (xk+1 × H k(k+1) xkα , W α (xk+1 × H 12 xkα )), α α 2 α=1
(13)
k=1
subject to the constraint that all H k(k+1) have the form of Eq. (11) in such a way that Eq. (12) is satisfied. In Eq. (13), we assume that Nk(k+1) points xkα in the kth image correspond to the Nk(k+1) points xk+1 in the (k + 1)th image, α k(k+1) and the subscript k is computed modulo M . The matrix W α is the value of W in Eq. (8) for the kth and the (k + 1)th images. Through Eq. (11), the function J in Eq. (13) is regarded as a function of the focal lengths f1 , f2 , ..., fM and the rotations R12 , R23 , ..., RM1 ; we minimize J with respect to them subject to Eq. (12). 6
7 8
A well known such system is QuickTime VR [1], for which input images are taken using a special camera rotation mechanism. The 3 × 3 matrix H has 9 elements, but there is an overall scale indeterminacy. The camera rotation matrix R has 3 degrees of freedom, to which are added the focal lengths f and f before and after the camera rotation.
1202
4.3
M. Sakamoto, Y. Sugaya, and K. Kanatani
Lie Algebra Approach
Each rotation Rk(k+1) is specified by three parameters, but using a specific parameterization such as the Euler angles θ, φ, and ψ complicates the equation. So, we adopt the well known method of Lie algebra [4,5]. Namely, instead of directly parameterizing Rk(k+1) , we specify its “increment9 ” in each step. To be specific, we exploit the fact that a small change of rotation Rk(k+1) has the following form [4,5]: Rk(k+1) + ω k(k+1) × Rk(k+1) + · · · .
(14)
Here, · · · denotes terms of order 2 or higher in ω k(k+1) . If we replace Rk(k+1) in H k(k+1) by Eq. (14), Eq. (13) can be regarded as a function of f1 , ..., fM and ω 12 , ..., ω M1 . After minimizing it with respect to them, the rotations Rk(k+1) are updated by Rk(k+1) ← R(ω k(k+1) )Rk(k+1) , (15) where R(ω) denotes the rotation around axis ω by angle ω. The derivatives of J with respect to f1 , ..., fM and ω 12 , ..., ωM1 can be analytically calculated (we omit the details). If we introduce the Gauss-Newton approximation, we can also calculate the second derivatives in simple analytic forms (we omit the details). It seems, therefore, that we can minimize J by Gauss-Newton iterations. However, there is a serious difficulty in doing this. 4.4
Alternating Optimization Approach
We need to enforce the constraint of Eq. (12). To a first approximation, Eq. (12) is expressed in the following form (we omit the details): ω 12 + R12 ω23 + R12 R23 ω 34 + · · · + R12 R23 · · · R(M−1)M ω M1 = 0.
(16)
A well known strategy for constrained optimization is the method of projection: the parameters are incremented without considering the constraint and then projected onto the constraint surface in the parameter space. However, if we try to minimize Eq. (13) without considering Eq. (16), the solution is indeterminate (the Hessian has determinant 0). We resolve this difficulty by adopting the alternate optimization. Namely, Eq. (13) is first minimized with respect to f1 , ..., fM with R12 , ..., RM1 fixed. Then, the result is minimized with respect to ω 12 , ..., ωM1 with f1 , ..., fM fixed. This time, the solution is unique because the Hessian of J with respect to ω12 , ..., ωM1 alone is nonsingular. Next, Eq. (16) is imposed by projection in the form of ˆ k(k+1) = ω k(k+1) − Δω k(k+1) . ω 9
(17)
The linear space defined by such (mathematically infinitesimal) increments is called the Lie algebra so(3) of the group of rotations SO(3) [4].
Homography Optimization for Consistent Circular Panorama Generation
1203
Fig. 5. After the simultaneous optimization of homographies, the discrepancy in Fig. 4(b) disappears
The correction Δω k(k+1) is determined as follows. The condition for Eq. (17) to satisfy Eq. (16) is written as ˜ = e, SΔω (18) where we define
S = I R12 R12 R23 · · · R12 R23 · · · R(M−1)M ,
˜ = Δω , Δω 12 Δω 23 · · · Δω M1 e = ω12 + R12 ω 23 + R12 R23 ω34 + · · · + R12 R23 · · · R(M−1)M ω M1 .
(19)
˜ that minimizes Introducing Lagrange multipliers, we can obtain the solution Δω ˜ subject to Eq. (18) in the form Δω ˜ = S (SS )−1 e. Δω
(20)
From this, the correction formula of Eq. (17) reduces to the following form (we omit the details): ˆ 12 = ω12 − ω
1 1 1 ˆ 23 = ω 23 − ˆ 34 = ω 34 − e, ω R e, ω R R e, M M 12 M 12 23
1 R R · · · R (21) (M−1)M e. M 12 23 The rotations R12 , ..., RM1 are updated by Eq. (15). With these fixed, Eq. (13) is again minimized with respect to f1 , ..., fM , and the same procedure is iterated. Since Eq. (16) is a first approximation in ω k(k+1) , the rotations Rk(k+1) updated by Eq. (15) may not strictly satisfy Eq. (12). However, the discrepancy is very small, so we randomly choose one rotation matrix and replace it by the value that strictly satisfies Eq. (15), i.e., replace it by the inverse of the product of the remaining rotation matrices. Figure 5 is the result corresponding to Fig. 4. No discrepancy occurs this time. This can be confirmed by many examples, which are not shown here due to page limitation, though. ...,
5
ˆ M1 = ω M1 − ω
Concluding Remarks
We have shown a consistent technique for generating a 360◦ circular panorama from images taken by freely moving a hand-held camera.
1204
M. Sakamoto, Y. Sugaya, and K. Kanatani
The first issue is the lack of camera orientation information, which was resolved by computing the homographies between neighboring images and invoking the framework of “oriented projective geometry”. The second issue is the discrepancies between the final image and the initial image due to accumulated errors. We resolved this by simultaneously optimizing all the homographies. To this end, we introduced the Gauss-Newton iterations using the Lie algebra representation and the alternating optimization scheme. In practice, we need to add further corrections and refinements for eliminating intensity discontinuities and geometric discrepancies at individual image boundaries. This can be done easily using existing techniques (we omit the details).
References 1. S.E. Chen, QuickTime VR—An image-based approach to virtual environment navigation, Proc. SIGGRAPH’95 , August 1995, Los Angeles, U.S.A., pp. 29–38. 2. M. Irani, S. Hsu, and P. Anandan, Video compression using mosaicing representations, Signal Process: Image Comm., 7-4/5/6 (1995-11), 529–552. 3. M. Irani, P. Anandan, J. Bergen, R. Kumar, and S. Hsu, Efficient representations of video sequences and their applications, Signal Process: Image Comm., 8-4/5/6 (1996-11), 327–351. 4. K. Kanatani, Group-Theoretical Methods in Image Understanding, Springer, Berlin, Germany, 1990. 5. K. Kanatani, Geometric Computation for Machine Vision, Oxford University Press, Oxford, U.K., 1993. 6. K. Kanatani, Statistical Optimization for Geometric Computation: Theory and Practice, Elsevier, Amsterdam, the Netherlands, 1996; reprinted, Dover, New York, NY, U.S.A., 2005. 7. K. Kanatani, N. Ohta and Y. Kanazawa, Optimal homography computation with a reliability measure, IEICE Trans. Inf. & Syst., E83-D-7 (2000-7), 1369–1374. 8. Y. Kanazawa and K. Kanatani, Robust image matching preserving global consistency, Proc. 6th Asian Conf. Comput. Vision, January 2004, Jeju, Korea, Vol. 2, pp.1128–1133. 9. Y. Kanazawa and K. Kanatani, Image mosaicing by stratified matching, Image Vision Computing, 22-2 (2004-2), 93–103. 10. M.-C. Lee, C.-l. B. Lin, C. Gu, T. Markoc, S I. Zabinski, and R. Szeliski, A layered video object coding system using sprite and affine motion model, IEEE Trans. Circuits Video Tech., 7-1 (1997-2), 130–145. 11. S. Leveau and O. Faugeras, Oriented projective geometry for computer vision, Proc. 4th Euro. Conf. Comput. Vis., Vol. 1, April 1996, Cambridge, U.K., pp. 147–156. 12. H.S. Sawhney and S. Ayer, Compact representations of videos through dominant and multiple motion estimation, IEEE Trans. Patt. Anal. Machine Intell., 18-8 (1996-4), 814–830. 13. H.-Y. Shum and R. Szeliski, Construction of panoramic image mosaics with global and local alignment, Int. J. Comput. Vis., 36-2 (2000-2), 101–130. 14. J. Stolfi, Oriented Projective Geometry: A Framework for Geometric Computation, Academic Press, San Diego, CA, U.S.A., 1991. 15. R. Szeliski and H.-U. Shum, Creating full view panoramic image mosaics and environment maps, Proc. SIGGRAPH’97 , August 1997, Los Angeles, CA, U.S.A., pp.251–258.
Homography Optimization for Consistent Circular Panorama Generation
1205
16. Y. Yagi, Omnidirectional sensing and its applications, IEICE Trans. Inf. & Syst., E82-D-3 (1999-3), 568–579. 17. Y. Yagi, K. Imai, K. Tsuji and M. Yachida, Iconic memory-based omnidirectional route panorama navigation, IEEE Trans. Patt. Anal. Mach. Intell., 27-1 (2005-1), 78–87. 18. Z. Zhang, R. Deriche, O. Faugeras, and Q.-T. Luong, A robust technique for matching two uncalibrated images through the recovery of the unknown epipolar geometry, Artif. Intell., 78-1/2 (1995-10), 87–119.
Graph-Based Global Optimization for the Registration of a Set of Images Hui Zhou Epson Edge, Epson Canada Ltd., 3771 Victoria Park Ave., Toronto, Ontario, Canada M1W 3Z5
[email protected] Abstract. In this paper, we introduce a fast and efficient global registration technique in photo-stitching, in which, the layout of the set of images has multiple rows and columns. The proposed algorithm uses a graph based model to deal with the problem of global registration. By using alternative path from the graph containing the image layout, it is possible to align the images against the reference image when pair-wise registration fails. Keywords: Global registration; panoramic imaging; image mosaics.
1 Introduction With the popularity of digital cameras, it has been an active field for researchers and commercial practitioners to create mosaics and panoramic images by stitching multiple photos or video frames. Numbers of approaches [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] have been proposed. There are usually two ways to generate the image mosaic: frame-to-mosaic and frame-to-frame. Frame-to-mosaic is to align each new frame to the mosaic being built, thus the registration happens between each frame and the mosaic being formed; in another way, frame-to-frame means to acquire the transform between frames through registration and then map all of them to the mosaic. For the first approach, it is obvious that the registration is working between original frame and transformed frame (the mosaic). Hence, it works fine only when there is sufficient overlap between frames. The second approach can work on the large view angle case. However, since the registration is done in a pairwise way locally, small registration error between frames can be accumulated, and amplified when all the frames are aligned together as one mosaic. One solution to overcome the problem in frame-to-frame case is to apply a global optimization step such that all the frame-to-mosaic mappings are maximally consistent with all the local pairwise registration. The standard optimization technique is bundle adjustment, derived from photogrammetry [11]. Researchers have proposed global optimization techniques based on this technique. For example, Szeliski and Shum [5] proposed a global optimization based on bundle adjustment, in which the minimization is formulated as a constrained least-squares problem with the 3D rotational and focal length model. In [4], Sawhney et al proposed a framework, in which topology and registration are combined. The global alignment is done through L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 1206 – 1214, 2006. © Springer-Verlag Berlin Heidelberg 2006
Graph-Based Global Optimization for the Registration of a Set of Images
1207
a way similar as bundle adjustment approach. Xiong et al [6] used simulated annealing technique for global optimization. This paper proposes a novel algorithm used for the optimization of registration for a set of images with multiply rows and multiple columns. The algorithm aims to estimate the optimal transform parameters for each frame against with the reference frame such that the stitched output result contains the reference frame on which no transform is needed to apply. The registration graph is introduced and built through registration steps. By using registration graph, the proposed approach tries to optimize the registration result -- to replace failed registered pairs by registering other pairs; and reduce the registration error globally. In the paper, we assume that the layout of the image set is defined based on 4neighbor connectivity, i.e., a frame could have overlapped area with its up, down, left and right neighbors. Note that it is not necessary for any image to have all 4 overlapped neighboring images. However, any image should have at least one connected neighbors, i.e., no isolated images. Pairwise registration is the base for the proposed optimization algorithm. It has been well explored by researchers [12]. In this paper, feature-based registration technique [13] is used for the registration of a pair of frames. The result of each pairwise registration is then the matching points list for the two images.
2 Proposed Algorithm We start the registration from the reference image and its four neighbors. The reference image can be specified as any one of the images. As the example shown in Figure 1, we can suppose the reference image is frame 5.
Frame 1
Frame 2
Frame 3
Frame 4
Frame 5
Frame 6
Frame 7
Frame 8
Frame 9
Fig. 1. An example of 3x3 image sequence. The resolution has been reduced for illustration purpose (original resolution 1360x1024 for each frame).
1208
H. Zhou
2.1 Registration Cycle The first cycle of registration is between frame 5 and its four neighbors, as illustrated in Figure 2. In the following of the paper, we use number with prefix of 'F' to note the frame. From Figure 2A, it is obvious that four times pairwise registration is needed: F2 against F5; F4 against F5; F6 against F5; and F8 against F5. After the first cycle of registration, we will continue the registration for those frames further to the reference frame. To avoid the misalignment error between unregistered images pairs, we align one unregistered frame with its two successfully registered neighboring frames simultaneously, as illustrated in Figure 2B. Next section will elaborate as how to align one frame against more than one frame. 1
2
3
1
2
3
4
5
6
4
5
6
7
8
9
7
8
9
Fig. 2. Left (A): The 4 registrations between the reference (F5) and its four neighbors. Gray block indicates the reference frame, while white as unregistered frames. Right (B): The 2nd registration circle in which the frames will be aligned against the successfully registered frames (black blocks) in the previous cycle.
The registration cycles go on in a way that frames closer to the reference frame have higher priority than those further to the reference frame. 2.2 Registering One Against More Than One Image The motivation to register one frame against its two successfully registered neighbors, rather than one of them, is to obtain better estimation by taking advantage of the fact that the frame shares overlapping region both with its register-able neighbors. As the example shown in Figure 2B, if we only align frame 1 against frame 2 in the registration, there could be a big misalignment between frame 1 and frame 4 when we finally put all the frames to align with the reference frame. In this paper, the implementation of registering one frame against more than one frame is done by solving augmented over-constrained problems. Solving the overconstrained linear equations is the same as the solution described in reference [1]. For example, suppose we have frame F1, we need to align it with other two frames, F2 and F4. We can do the following steps to align F1 with both F2 and F4: 1) Pairwise registration for F1 and F2. If succeeds, obtain matching points list between F1 and F2 and set it to Q1; otherwise quit. 2) Pairwise registration for F1 and F4. If succeeds, obtain matching points list between F1 and F4 and set it to Q2; otherwise quit. 3) Combine Q1 and Q2 to Q. Set up a set of over-constrained linear equations based on Q, and solve it.
Graph-Based Global Optimization for the Registration of a Set of Images
1209
2.3 Registration Graph As the registration continues cycle by cycle, we construct the registration graph to record the registration information between frames. A registration graph is a directed graph representation, in which each vertex indicates a frame, and the edge between vertices indicates the alignment. The registration graphs shown in Figure 3 are the graphs we constructed in the different status for the set of sample images shown in Figure 1. Pairwise registration fails on F4 against F5 and F6 against F5, but succeeds on F2 against F5 and F8 against F5, as shown in Status 1. This registration graph is updated after every registration cycle finished. It should be mentioned that in the registration cycles, the frame will be registered only against its previously registerable neighbors. For example, in the 2nd registration cycle, for frame 1, we only register it against F2, rather than against F2 and F4, this is because F4 failed in the previous registration cycle. The same situation happens for F3 and F7. Figure 3 Status 2 shows the registration graph after all the registration cycles finishes. Apparently, not all images are successfully registered at this time.
1
2
3
1
2
3
4
5
6
4
5
6
7
8
9
7
8
9
Status 1
Status 2
1
2
3
1
2
3
4
5
6
4
5
6
7
8
9
7
8
9
Status 3
Status 4
Fig. 3. Illustration of the construction of the registration graph. An arrow shows successful pairwise registration, and an arrow with a backslash on means pairwise registration fails on the pair.
2.4 Re-registration For the frames failed to be registered after all registration cycles end, we will do registration again to align it against one of its neighbors from which can form an alternative path leading to the reference frame. For example, for the graph shown in Figure 3 Status 2, F4 failed to be aligned to reference frame (F5), we should,
1210
H. Zhou
however, be able to align it with the reference if we can register it against Frame 1, which has been successfully registered in the 2nd registration cycle. But how to find the proper neighbor to form the path from the current frame to the reference frame? In the example shown in Figure 3 Status 2, we could potentially choose F4 -> F1 -> F2 -> F5, or F4 -> F7 -> F8 ->F5. In this paper, a shortest-path algorithm is applied to find the proper neighbors. The reason for finding the shortest path is from the observation that the shorter the path is the smaller the accumulated registration error along the path is. Floyd's shortest-path finding algorithm is used to find the shortest path in the registration graph. For the example shown in Figure 3 Status 2, path F4 -> F1 -> F2 -> F5 and F4 -> F7 -> F8 -> F5 are both the shortest ones. We only need to pick one of them, say, F4 -> F1 -> F2 -> F5. We also can obtain another shortest-path when trying to align frame 6 to the reference: F6 -> F3 -> F2 -> F5. After we find the shortest path and determine the proper neighbor of the unregistered frame, we will register the frame with its neighbor and modify the registration graph if registration succeeds. The example shown in Figure 3 Status 2 becomes the one shown in Figure 3 Status 3 after we successfully register F4 against F1 and F6 against F3. It can be noticed that F9 is still not successfully aligned to the reference in Figure 3 Status 3. We can get it by repeat the process described above: find the path F9 -> F6 -> F3 -> F2 -> F5, then register F9 -> F6, and modify the registration graph as shown in Figure 3 Status 4. We repeatedly do this process: shortest-path finding, determining the proper neighbor for registration and modifying the registration graph until every frame has been successfully registered or there is no way to align unregistered frames. 2.5 Aligning to the Reference with Less Error By multiplying the transform matrices along the shortest path in the registration graph from any frame to the reference frame, we could obtain the direct transform matrix between each frame and the reference frame. For example, The transform matrix beween Frame 1 against Frame 5 M [1][ 5] should be obtained by M [1][ 5 ] = M [1][ 2 ] • M [ 2 ][ 5 ] , where M [1][ 2 ] and M [ 2 ][ 5 ] are the estimated transform matrices between F1 and F2 and F2 and F5, respectively. The transform matrices obtained by multiplying the matrices along the path to the reference may not be exactly accurate. Suppose we have a very small error in the transform matrix between each pair: M [1][5] = Mˆ [1][ 5] + M δ 15 ; M [1][ 2 ] = Mˆ [1][ 2 ] + M δ 12 and M [ 2 ][ 5] = Mˆ [ 2 ][5] + M δ 25 ; where, Mˆ [1][ 2 ] , Mˆ [ 2 ][5] and
Mˆ [1][5] are the ideal, correct
transform matrices and M [1][ 2 ] , M [ 2 ][ 5 ] and M [1][5] are the estimated transform matrices between Frame 1 and 2, Frame 2 and 5, and Frame 1 and 5, respectively. M δ 15 , M δ 12 and M δ 25 are the corresponding error matrices between correct matrices and estimated ones. By Multiplying them, we get Mˆ [1][ 5] + M δ 15 = ( Mˆ [1][ 2 ] + M δ 12 )( Mˆ [ 2 ][5] + M δ 25 ) while Mˆ [1][5] = Mˆ [1][ 2 ] Mˆ [ 2 ][5] , then M δ 15 = M [1][ 2 ] Mˆ δ 25 + M δ 12 Mˆ [ 2 ][ 5] + M δ 12 M δ 25
Graph-Based Global Optimization for the Registration of a Set of Images
1211
Therefore, through matrix multiplication, the error might be accumulated and even amplified. This accumulated error becomes even larger when the multiplication sequence is longer (which is the case that the registration path is longer). Even we have used the shortest path to reduce this accumulated error; the error can be relatively big. We therefore propose an algorithm to reduce the effect of the cumulative error. During the transformation of a registered image, the matching point list along the registration path from the registered image to the reference image is remapped. In the example illustrated in Figure 5, the frame 1 is registered relative to previouslyregistered frame 2, which in turn, is registered to the reference frame 5. A point P in frame 1 corresponds to a point Q in frame 2, and point R in frame 2 corresponds to a point S in reference frame 5. M1 Frame 2
Frame 1 Q
R
M12M25 P*
Q* M25
Mδ15 S Frame 5
Fig. 4. Correction of the accumulated error resulting from multiplication of transform matrices
The transform matrices M[1][2] between F1 and F2, and M[2][5] between images F2 and F5 are estimated by solving the corresponding matching point lists. A point Q* corresponding to the point Q after having been transformed to the reference image using M[2][5] can be calculated. A point P* corresponding to the point P after having been transformed to the reference image using M[2][5]•M[1][2] can also be calculated. It should be noted that points P* and Q* can be transformed to locations inside or outside of the reference image I5. Ideally, P* should be located at the same point as Q*. This is not, however, typically the case. P* can differ from Q* as Q* is calculated from M[2][5], whereas P*
1212
H. Zhou
is calculated using M[2][5]•M[1][2]. As noted above, a cumulative error can result from one or more matrix multiplications. As a result, it is obvious that Q* should be more ˆ accurate than P*. The transform matrix M[1][5] can then be corrected to M by [ 1 ][ 5 ] determining the transformation between the feature points in F1 and the corresponding transformed (i.e., transformed to F5) feature points from F2, thereby cancelling the additional error present in M[1][5] determined using the multiplied individual transformations. This correction is repeated for all registration paths containing three or more images. For illustration purposes, assume that, for a path from image I1 to image I2 to … image IN, image I1 is to be aligned to image IN. Pi and Pj are coordinates of matched points in images Ii and Ij, respectively. The transform matrix M [1][ N ] can be adjusted to alleviate the effect of the cumulative error in an iterative way by using the following approach: 1) i ← N − 2 , j ← i + 1 2) Pj ' ← M [ j ][ N ] Pj 3) Solve M [ i ][ N ] Pi = Pj ' to get M [ i ][ N ] 4) i ← i − 1, j ← i + 1 5) If i≥1, go to step 2; otherwise end of the algorithm. After the alignment to reference frame and the correction of accumulated error, all successfully registered frames in previous registratioin cycles and re-registration steps have been aligned to the reference frame. Each frame now has the estimated transform matrix, with which can be used to transfrom the frame onto the reference plane.
Fig. 5. Composed output result by transforming all frames against the reference frame
Graph-Based Global Optimization for the Registration of a Set of Images
1213
3 Experiment Result To show the registration result, we transformed each frame in Figure 1 using the corresponding estimated transform matrices. When generating the output composition result, as shown in Figure 5, one frame is simply put on the top of another. No blending in the overlapping region is applied. The output images were generated from original sample images, but the picture shown in Figure 5 has been resized for the illustration purpose. On PC with 1.7GHz CPU and 512MB memory running Microsoft Windows 2000, the total registration time for this 3x3 example is 2.7 seconds.
4 Concluding Remarks This paper proposed a graph based optimization technique for image sequence registration. By introducing registration graph and finding the shortest path, the algorithm is able to provide more accurate and robust registration for image sequence. Accumulated registration error along the path to reference can be further reduced from the correction step.
References 1. Szeliski, R. and Shum, H. Creating Full View Panoramic Mosaics and Environment Maps. SIGGRAPH, (1997) 251-258 2. Peleg. S. and Herman, J. Panoramic Mosaics by Manifold Projection. CVPR, (1997) 338-343 3. Irani, M., Anandan, P. and Hsu, S. Mosaic based representations of video sequences and their applications. ICCV, (1995) 605-611 4. Sawhney, H. S., Hsu, S., and Kumar, R. Robust video mosaicing through topology inference and local to global alignment. ECCV, (1998) 103- 119 5. Shum. H. and Szeliski, R. Construction and refinement of panoramic mosaics with global and local alignment. ICCV, (1998) 953-958 6. Xiong, Y. and Turkowski, K. Registration, calibration and blending in creating high quality panoramas. Proceedings of Applications of Computer Vision, (1998) 69-74 7. Krishnan, A. and Ahuja, N., Panoramic Image Acquisition, Proceedings IEEE Conference on Computer Vision and Pattern Recognition, San Francisco, CA, June 18-20, (1996) 379-384 8. R. Benosman, S. B. Kang, and O. Faugeras, Eds., Panoramic Vision, Springer-Verlag, (2001) 9. Chu-Song Chen, Yu-Ting Chen, Fay Huang: Stitching and Reconstruction of LinearPushbroom Panoramic Images for Planar Scenes. ECCV (2) (2004) 190-201 10. Tomio Echigo, Richard J. Radke, Peter J. Ramadge, Hisashi Miyamori, Shun-ichi Iisaku: Ghost Error Elimination and Superimposition of Moving Objects in Video Mosaicing. ICIP (4) (1999) 128-132
1214
H. Zhou
11. Slama C. C. (Ed) Manual of Photogrammetry. American Society of Photogrammetry and Remote Sensing, Virginia, USA, (1980) 12. Brown, L.G. A survey of image registration techniques. ACM Computing Surveys, 24 (1992) 326–376 13. Zhang, Z., Deriche, R., Faugeras, O., and Luong, Q.-T. A Robust Technique for Matching Two Uncalibrated Images Through the Recovery of the Unknown Epipolar Geometry. Artificial Intelligence Journal, 78, (1995) 87-119
One-Dimensional Search for Reliable Epipole Estimation Tsuyoshi Migita and Takeshi Shakunaga Okayama University, 3-1-1, Tsushima-naka, Okayama, Japan {migita, shaku}@chino.it.okayama-u.ac.jp Abstract. Given a set of point correspondences in an uncalibrated image pair, we can estimate the fundamental matrix, which can be used in calculating several geometric properties of the images. Among the several existing estimation methods, nonlinear methods can yield accurate results if an approximation to the true solution is given, whereas linear methods are inaccurate but no prior knowledge about the solution is required. Usually a linear method is employed to initialize a nonlinear method, but this sometimes results in failure when the linear approximation is far from the true solution. We herein describe an alternative, or complementary, method for the initialization. The proposed method minimizes the algebraic error, making sure that the results have the rank-2 property, which is neglected in the conventional linear method. Although an approximation is still required in order to obtain a feasible algorithm, the method still outperforms the conventional linear 8-point method, and is even comparable to Sampson error minimization. Keywords: fundamental matrix, epipole estimation, polynomial equation, resultant.
1
Introduction
Given a pair of images, and a set of correspondences of the feature points in the images, we can estimate the epipolar geometry, or the fundamental matrix, which can be used in reconstruction of the 3D geometry of the cameras and that of the scene [1, 2]. Epipolar geometry estimation is one of the most important issues in computer vision, and a great number of studies have investigated solutions to epipolar geometry estimation. Although a certain class of the problem could be solved easily by even a naive method, another class was found to be very sensitive to the method used [1, 3]. Therefore, a sophisticated method that has a wider range of application is required. The present paper describes a new approach that reduces the estimation to a high order polynomial equation in a single variable, which enables a reliable initial estimation for the best estimation method in the literature. The most accurate results are obtained by maximum likelihood (ML) estimation [4], which minimizes the squared sum of reprojection errors by estimating the position, the orientation and the intrinsic parameters for each camera, as well as the 3D position for each feature point. In addition, this method can easily L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 1215–1224, 2006. c Springer-Verlag Berlin Heidelberg 2006
1216
T. Migita and T. Shakunaga
be extended to deal with an arbitrary camera model (e.g. with lens distortion), as well as an arbitrary noise model (e.g. removing outliers). A more convenient method estimates a fundamental matrix that minimizes the Sampson error [1, 3], and it is known that the results are sufficiently accurate for most cases. However, both methods require a reliable initial estimation for an iterative process of nonlinear optimization to be successful. The well-known linear eight-point algorithm [5] (hereinafter L8P) could be regarded as a further approximation to the Sampson error minimization. The L8P algorithm requires two approximations to be performed only with linear algebra. First, the cost function should be approximated to be a bilinear form of the nine elements of the fundamental matrix. Second, in the main calculation, it cannot constrain the resulting matrix to be of rank 2, and the constraint is enforced afterward. Since this method can generate a result without any prior knowledge, it is usually used as an initial estimation for one of the more accurate nonlinear methods, such as those described above. However, this may fail because the L8P can only produce a sub-optimal result, due to the approximations described above. Therefore, it would be beneficial to construct another method that could complement L8P. For this purpose, in the present paper, we reduce the estimation to a high order polynomial equation in a single variable using the resultant, which is a mathematical tool for eliminating variables from a nonlinear equation system. For a single-variable polynomial equation, we can enumerate all of the possible solutions without any initial estimation, whereas L8P can obtain only one candidate. We herein derive two methods, an exact method and its approximation. The exact method minimizes the same cost function as that of the L8P, but the rank-2 constraint is fully taken into account. Therefore, in principle, the results are more accurate than L8P. Since the exact method requires an extreme amount of computation, we need an approximation. However, the approximation method still outperforms L8P, and the results are comparable to those of Sampson error minimization.
2
Mathematical Background
In this section, we briefly introduce several mathematical concepts that are used extensively throughout the present paper. Consider an n × n matrix A and an n-vector x satisfying Ax = 0. The determinant of A is 0, and x is the right null vector of A. The elements of matrix A can be polynomials, and then the determinant is also a polynomial. l m i Consider two polynomials P = i=0 pi xi and Q = i=0 qi x and observe that the following equation holds when P = Q = 0, ⎡ p0 p1 · · · pl ⎤⎡ 1 ⎤ p0 p1 · · · pl ⎢ x ⎥ ⎢ . ⎥ . ⎢ x2 ⎥ . . ⎢ .⎥ . ⎥=0 (1) ⎢ q0 q1 · · · qm ⎥⎢ ⎥ . ⎣ q0 q1 · · · qm ⎦ ⎢ ⎣ .. ⎦ .
.
.
.
.
.
xl+m
One-Dimensional Search for Reliable Epipole Estimation
1217
where, in the matrix, the upper m rows contain pi ’s, and the lower l rows contain qi ’s. The matrix is of size (l + m) × (l + m), and its determinant must be 0 if and only if P = Q = 0. In addition, the right null space of the matrix contains a geometric series of x. Such a matrix is called the Sylvester matrix, and its determinant is the resultant of the polynomials P and Q, which we can be written as Res(P, Q) [6]. We use the resultant or a similar technique to reduce the epipole search to an equation of following type: Let the epipole be (x, y, 1)T , (A0 + xA1 + · · · xk Ak )(1, y, y 2 , · · · , y n )T = 0 .
(2)
We use the determinant of the matrix to determine x, and the right null vector to determine y. Details are presented in the following sections.
3
Epipolar Geometry and the Fundamental Matrix
We first define the problem of epipolar geometry estimation in Section 3.1. Then, in Section 3.2, we reduce the search space to a hemisphere of the epipole. At the same time, we compare the proposed method and the linear 8-point method. In Section 3.3, a simpler formulation of the epipole search is established, which is then reduced to a single-variable equation in Sections 3.4 and 3.5. Finally, a specific method by which to solve the equation is discussed in Section 3.6. 3.1
Basic Formulation
The scene contains two cameras and P feature points. Let xp be the 3D coordinates of the p’th feature point, and let lp and rp be the homogeneous coordinates of the images of the p’th feature point in the first and second images, respectively. Then, the following relation, known as the epipolar constraint, holds [1]: lTp F rp = 0
(3)
where F is referred to as a fundamental matrix and conveys the internal and external parameters of the two cameras. The scale of F has no meaning and F should be of rank 2. Therefore, F has seven degrees of freedom. Our goal is to estimate F from a given set of feature correspondences {(lp , r p )}, by minimizing the Algebraic cost function defined as E(F ) :=
P −1
|lTp F r p |2
or
E(f ) = f T Hf ,
(4)
p=0
where f is a 9-vector containing all of the elements of the fundamental matrix F in row major order, i.e., f T := [ aT bT cT ], where F T = [ a b c ], and H is a symmetric matrix defined so that E(F ) = E(f ). For removing scale ambiguity, the search space of F should be limited so that its Frobenius norm is 1, written hereinafter as |F |F = 1, or f T f = 1.
1218
T. Migita and T. Shakunaga
Note that the algebraic cost function can be regarded as an approximation of the Sampson error, defined as
2 lTp F r p ES (F ) := where |(a, b, c)T |2C := a2 + b2 . (5) 2 + |F T l |2 |F r | p p C C p 3.2
Simple Equation for λ and e
Minimization of the bilinear form f T Hf subject to the constraint f T f = 1 is reduced to the following eigenequation: (H − λI)f = 0 .
(6)
This is the main part of the linear 8-point algorithm. It is easy to show that λ coincides with the residual to be minimized (λ = f T Hf ) and satisfies the 9th order equation det(H − λI) = 0. In this calculation, the rank-2 constraint could not be imposed. On the other hand, once we know the coordinates of the epipole e = (x, y, z)T , we can easily obtain the rank-2 fundamental matrix, which minimizes the algebraic cost function as follows [1]: Let us define ⎡ ⎤ ⎡ ⎤ e u1 u2 ⎦ , and g := DT f u1 u2 B := ⎣ e ⎦ , D := ⎣ (7) e u1 u2 where (e, u1 , u2 ) is an orthonormal basis. Obviously, [B|D] is an orthogonal matrix, and [B|D] divides the search space for f into 3D and 6D subspaces. Since the rank-2 constraint requires that B T f = 0, f must lie in the 6D subspace spanned by the column vectors of D. Therefore, the algebraic cost function is g T DT HDg and the norm constraint is g T g = 1. Then, g satisfies the following equation: (DT HD − λI)g = 0 T
where det(D HD − λI) = 0 .
(8) (9)
From the above equation, f is obtained as Dg, where g is the least significant eigenvector of DT HD. The resulting matrix must be of rank 2. Compare this equation with eq. (6). It is important, in the proposed approach, that λ is the algebraic error to be minimized and satisfies the 6th order equation eq. (9), and the coefficients of the equation are a function of the epipole coordinates. Figure 2 (a) shows the minimum one of six solutions of eq. (9), i.e. λ, plotted with respect to e, for the image pair shown in Fig. 1. In principle, our goal is now to choose the epipole that minimizes the minimum solution of eq. (9). In Fig. 2 (a), such an epipole is indicated as a small circle, along with other local minima or maxima or saddle points. In the following subsection, we derive a simpler form of eq. (9), in which the coefficients are 6th order polynomials of the coordinates of the epipole.
One-Dimensional Search for Reliable Epipole Estimation
3.3
1219
Four-Dimensional Polynomial Equation for Epipole Search
Here, we must minimize eq. (4) subject to two constraints, |F |2F = 1 and det(F ) = 0, or equivalently f T f − 1 = 0 and B T f = 0. The norm constraint is incorporated into eq. (4) by the Lagrange multiplier method as usual, while the rank-2 constraint is incorporated by the penalty method: E(f , e) = f T Hf + νf T BB T f − λ(f T f − 1)
(10)
where the penalty weight ν goes to ∞, and λ is a Lagrange multiplier. Note that the last two terms added here are zeroes for the true solution. As is usual for minimization problems, we start with differentiation: (1/2)∇E = (H + νBB T − λI)f = 0 .
(11)
Multiplying f T from the left to eq. (11), we have f T (H + νBB T − λI)f = 0, and comparing this identity with eq. (10), we can see that E = λ. This means that the Lagrange multiplier agrees with the cost function to be minimized. From eq. (11), the determinant of (H + νBB T − λI) should be zero. Although it may involve a considerable amount of symbolic manipulation, it is straightforward to expand the determinant to show the several properties described below. This is a 9th order equation in λ and a cubic equation in ν. In addition, the coefficient for the ν 3 term is a 6th order polynomial in λ, which is hereinafter referred to as f6 . As ν → ∞, six out of nine solutions of det(H +νBB T −λI) = 0 approach the solutions of f6 = 0. The explicit expression of f6 is given by f6 (λ; e) := lim ν −3 H − λI + νBB T = c0 + c1 λ + c2 λ2 + · · · + c6 λ6 = 0 (12) ν→∞
The algebraic residual λ should satisfy this equation, which is a simpler form of eq. (9). The coefficients ci ’s are 6th order homogeneous polynomials in the coordinates of the epipole. Writing the epipole e as (x, y, z)T , we have ci = ci600 x6 + ci510 x5 y + · · · + ci006 z 6 . 3.4
(13)
High Order Polynomial Equation for Epipole Search
Next, we introduce the 1st and 2nd order derivatives of ci ’s, i.e., g i := ∇ci =: (gi0 , gi1 , gi2 )T , and Hi := ∇∇T ci . Note that the homogeneity suggests that eT g i = 6ci , and Hi e = 5g i . In order to find e = (x, y, z)T that minimizes the residual λ, we use the property whereby the gradient of eq. (12) satisfies: ( g0 g1 g2 )T := g 0 + g 1 λ + g 2 λ2 + · · · + g 6 λ6 = 0 .
(14)
Note that the right-hand side of the equation is intuitively ωe, where ω is a T i Lagrange multiplier, but considering e ( g λ − ωe) = 0, we obtain ω = i i i 6 i ci λ = 0.
1220
T. Migita and T. Shakunaga
We can remove λ by constructing the following di ’s: ⎡ ⎤ ⎤ ⎡ ⎤ ⎡ d0 0 Res(g1 , g2 ) 1 ⎣ d1 ⎦ := ⎣ Res(g2 , g0 ) ⎦ = ⎣ 0 ⎦ |e|12 0 Res(g0 , g1 ) d2 ⎡ g g g ··· g where
⎢ ⎢ Res(gi , gj ) = det ⎢ ⎣
(15) ⎤
i0 i1 i2 i6 gj0 gj1 gj2 · · · gj6 gi0 gi1 gi2 · · · gi6 gj0 gj1 gj2 · · · gj6 .
.
.
.
.
.
.
.
.
. gi0 gi1 gi2 · · · gi6 gj0 gj1 gj2 · · · gj6 .
.
⎥ ⎥ ⎥ , ⎦
(16)
6 and gi = j=0 gij λj . In this case, each of the resultants is a 60th order homogeneous polynomial in x, y and z, and a multiple of |e|12 . Therefore, we need to consider 48th order quotients only. We can solve this system of equations in six different methods. In the following, we describe one method, and the other methods are obtained by exchanging the role of variables, say by replacing (x, y, z) with (x, z, y), (y, x, z), (y, z, x), (z, x, y), and (z, y, x), respectively. j Letting z = 1 and regarding di ’s as polynomials in y, we have di = 48 j=0 dij y , where coefficients dij ’s are polynomials in x. Then, the following equation holds for the true x and y: ⎡ d00 ⎡ ⎤ d10 0 ⎢ d20 ⎢ ⎥ ⎢0⎥ ⎢ ⎢ ⎥ ⎢ ⎥ ⎢ ⎢ ⎥ ⎢0⎥ A⎢ ⎥ = ⎢ ⎥ , where A := ⎢ ⎢ ⎢ ⎥ ⎢ .. ⎥ ⎢ ⎣ ⎦ ⎣.⎦ ⎣ 0 y 71 ⎡
1 y y2 .. .
⎤
d01 d11 d21 d00 d10 d20
d02 d12 d22 d01 d11 d21
⎤
· · · d0,48 · · · d1,48 · · · d2,48 d02 · · · d0,48 d12 · · · d1,48 d22 · · · d2,48 .
.
.
.
.
.
. . . d00 d10 d20
. . . d01 d02 · · · d0,48 d11 d12 · · · d1,48 d21 d22 · · · d2,48
⎥ ⎥ ⎥ ⎥ . (17) ⎥ ⎥ ⎦
Therefore, the determinant of A should be 0 for the true x. This condition is a 1,728th order equation in x. We have obtained a high order polynomial equation for epipole search. However, it does not appear to be feasible to implement this method because the usual double-precision arithmetic is not sufficiently accurate for solving this equation, and the required amount of calculation is enormous. Therefore, in the following subsection, we discuss an approximation. Moreover, we do not believe this to be the best reduction of eq. (14), which could probably be reduced to a much lower order polynomial by using one of the techniques for calculating the multipolynomial resultant [6]. This is left for a future study. 3.5
Feasible Solution
We need an approximation to construct a feasible equation. We can assume that λ is small enough to neglect the O(λ2 ) terms, because λ is the residual to be minimized and the typical magnitude of λ is 10−2 . When such higher order terms
One-Dimensional Search for Reliable Epipole Estimation
1221
are neglected, λ = −c0 /c1 . An example is shown in Fig. 2(b), and Figs. 2(a) and 2(b) are almost indistinguishable. Correspondingly we truncate eq. (14) as follows: g0 + g1 λ = 0 .
(18)
Similar to eq. (15), λ is removed from eq. (18) by ( d0 d1 d2 )T := g 0 × g 1 = 0
(19)
where di is a 10th order homogeneous polynomial in three variables. We can solve this equation in a similar manner to that described in the previous subsection. For each selection of six roles of variables, we construct a matrix corresponding to A in eq. (17), which is now of size 15 × 15, and the determinant is a 75th order polynomial. Figure 3 shows an example of the determinant plotted with respect to x. The polynomial has only a few zeroes, although the order of the polynomial is 75. Figure 4 shows an example of eq. (19), where we can see each of the small circles, representing a minimum, a maximum or a saddle point of λ, lie on the curves where di = 0 for every i. 3.6
Solving the Equation
We must first calculate cijkl ’s. However, the direct calculation of eq. (12) involves extensive calculation. Therefore, we instead use eq. (9) to obtain the solutions λi ’s for at least 28 linearly independent (x, y, z)’s. Then, ci ’s are calculated up to scale 1 , from ⎧ ⎪ 6 6 ⎨ −c0 /c6 = λ0 + λ1 + · · · + λ5
1 i c1 /c6 = λ0 λ1 + λ0 λ2 + · · · + λ4 λ5 (20) (λ − λi ) = ci λ , or ⎪ c .. 6 ⎩ i=0 i=0 . for each (x, y, z). From these ci ’s, an equation system (a least-squares equation system) for cijkl ’s can be constructed, by stacking eq. (13) for more than 28 tuples of (x, y, z)’s. Then, we easily obtain gij ’s by symbolic differentiation of ci ’s. For searching x, which makes the determinant of A in eq. (17), or its variant of size 15 × 15, become zero, the search space can be limited within [−1 : 1] because every point (x, y, z) on the hemisphere is covered either twice or three times by the six permutations of the three variables. Then, the range [−1 : 1] is divided into dozens of parts by equally spaced nodes, and the determinant is calculated for each node. If the sign of the determinant changes between two consecutive nodes, there must be a solution in the interval between the nodes, and the root is calculated by the Secant method. We do not calculate the coefficients of the high order polynomial. For each x that makes A singular, the right null vector of A 1
The scale is not needed. However, if required, we can use the fact that c6 = |e|6 , which can be derived directly from eq. (12), to determine the scale.
1222
T. Migita and T. Shakunaga
Fig. 1. Example of an image pair det(A) x 10 -168 0.4 0.2 0 -0.2 -0.4
(a) Exact λ
(b) Approximation −c0 /c1
-1
-0.5
0
0.5
1 x
Determinant with Fig. 2. Fish-eye projection of the residual λ. The re- Fig. 3. gions where λ approaches 0 appear as dark regions. respect to x when z = 1. The determinant is a 75th order Contour lines are also shown. function of x, but has only three solutions in the range.
contains a geometric series of y, and we can extract y from the series. Otherwise, the solution x should be discarded. We must also be aware of false solutions, because the equation is satisfied by not only a point that minimizes λ, but also maxima or saddle points. To check this, we use the Hessian i Hi λi at the point, at which all of the eigenvalues should be non-negative at a minimum. Once the epipole is obtained, the fundamental matrix is easily calculated, as described in Section 3.2. The required time for the entire calculation is approximately 100 ms on a 3.2-GHz Pentium4 processor.
4
Experiment
In this section, we show the experimental results for a real image sequence of a toy house. Two out of 37 images are shown in Fig. 1. We tracked 100 feature points manually. The average tracking error was approximately three pixels. There are 397 image pairs that have seven or more point correspondences. Thirty images were captured with the camera traveling on a circular trajectory around the object, and the focal lengths are almost fixed. Seven additional images were captured from positions that were not on the trajectory. The distance to the object and the focal length for these additional images may vary. This increases the variation of the epipole position. Figure 5 shows the distribution of the
One-Dimensional Search for Reliable Epipole Estimation
(a) First element (d0 )
(b) Second element (d1 )
1223
(c) Third element (d2 )
Fig. 4. The sign of each element of g 0 × g 1 = (d0 , d1 , d2 )T is indicated by the black and white areas, and the boundaries between the areas indicate the locations at which di = 0. The intersections are also indicated by small circles. y 1
0
-1 -1
0
1
x
Fig. 5. Distribution of the epipoles for 397 image pairs
Fig. 6. Example of a λ plot with respect to e
epipoles for 397 image pairs. Although many of the epipoles are near (±1, 0, 0), there are image pairs for which the epipoles are near the image center. Number of Solutions: One of the advantages of the proposed method is that it can detect all of the local minima, whereas the linear 8-point method can generate only one result. The number of epipole candidates, or local minima, varied from 1 to 5. There were 179 image pairs for which the proposed method found only one solution, whereas for the other pairs the proposed method found more than one candidate solutions. Figure 6 shows an example of λ variations in the search space, where there are three local minima. There were an additional 106 such pairs, which indicates that this is not a rare case. For many cases, a smaller number of correspondences causes the search space to have more than one local minimum. We can choose the most appropriate minimum from among these candidates. The choice can be made after a nonlinear refinement. Accuracy: We compared the proposed method to Sampson error minimization, and the result is shown in Fig. 7, along with similar results for other two methods. In the figure, (a) shows the linear 8-point algorithm (L8P), (b) shows the proposed method with first order approximation, and (c) shows the nonlinear minimization of the algebraic error. Given the epipole e1 obtained by a certain method and e2 obtained by the Sampson error minimization, the error between them is defined as |e1 − e2 |2 . When a method produces several candidates, the
1224
T. Migita and T. Shakunaga Frequency (out of 397) 70
Frequency (out of 397) 70
Frequency (out of 397) 70
60
60
60
50
50
50
40
40
40
30
30
30
20
20
20
10
10
0
-8
10
10
-6
-4
10
-2
10
0 1
(a) Linear 8-point
Error
10 -8
10
10
-6
10
-4
-2
10
0 1
Error
(b) Approximation −c0 /c1
-8
10
10
-6
-4
10
-2
10
1
Error
(c) Exact λ
Fig. 7. Histograms of the errors between the solutions of each method and those of the Sampson error minimization
candidate that gives the smallest error is chosen. We then created a histogram of the errors of over 397 image pairs for each of three methods. In the figure, we can easily see the peak of the histogram is around 10−3 for L8P, and around 10−5 for the other two methods. This indicates that the proposed method is more accurate than L8P, and thus has lower probability to yield a local minimum. Moreover, since the shape of the histograms shown in (b) and (c) are almost the same, the approximation introduced in Section 3.5 is approximately negligible. In other words, the degradation caused by the approximation is much smaller than the degradation caused by neglecting the rank-2 constraint in L8P. Moreover, the typical error of 10−5 means that the proposed method is comparable to the Sampson error minimization.
5
Conclusions
We have described herein a novel approach for estimating the epipole from point correspondences in two uncalibrated views. With a slight approximation, the proposed method reduces the search to a set of six 1D searches, each of which is expressed as a 75th order polynomial equation. Within the limitation of our approximation, the proposed method can provide all of the local minima, and the results are comparable to those of Sampson error minimization. However, it is probable that the presented algorithm is not optimal and so should be improved in a future study.
References 1. Hartley, R. I., Zisserman, A.: Multiple View Geometry in Computer Vision. Cambridge University Press (2004) 2. Kanatani, K.: Statistical Optimization for Geometric Computation. Elsevier (1996) 3. Zhang, Z.: Determining the Epipolar Geometry and its Uncertainty: A Review, IJCV 27(2), pp.161–198 (1998) 4. Triggs, B., McLauchlan, P. F., Hartley, R., Fitzgibbon, A. W.: Bundle adjustment — A modern synthesis, LNCS 1883 pp.298–375 (2000) 5. Hartley, R. I.: In Defence of the 8-point Algorithm, PAMI 19-6, pp.580–593, (1997) 6. Cox, D.A., Little, J. B., O’Shea, D.B.: Using Algebraic Geometry, Springer, (1998)
Image Technology for Camera-Based Automatic Guided Vehicle Young-Jae Ryoo Deparment of Control System Engineering, Mokpo National University, 61 Dorim-ri, Muan-goon, Jeonnam 534-729, Korea
[email protected] Abstract. This paper describes image technology using a neural network system for an automatic guided vehicle to follow a lane. Without complicated image processing from the image of a lane to the vehicle-centered representation of a bird’s eye view in conventional studies, the proposed system transfers the input of image information into the output of a steering angle directly. The neural network system replaces the nonlinear relation of image information to a steering angle of vehicle on the real ground. For image information, the vanishing point and vanishing line of lane on a camera image are used. In a straight and curved lane, the driving performances by the proposed technology are measured in simulation and experimental test. Keywords: Image sensor, neural network, vanishing point, vanishing line.
1 Introduction There is increasing interest in intelligent transportation system (ITS) in order to drive safely from a starting point to goal. An important component of intelligent automatic guided vehicle is lane following of lateral motion control of a vehicle. Image and video system plays an important role in lane following because of the flexibility and the two-dimensional view. Systems that integrate both image sensor and automatic guided vehicle together have received a lot of attention[1-6]. Such systems can solve many problems that limit applications in current vehicles. Recently, some researchers have discussed possibilities for the application of intelligent system in automatic guided vehicle[710]. The system comprises major modules: an image sensor, a geometric reasoning, and a lateral control. The image sensor acquires camera images, from which certain segments of lane are extracted by means of image processing algorithms. After the segments are transformed from the image coordinate system to the local vehicle one, they are then transferred to the reasoning module, which assigns them a consistent lane interpretation using local geometric supports and temporal continuity constraints. The lane description is sent to the lateral control system, which generates a steering value for vehicle navigation. The system suffers from heavy computation needed to be complete within the given time. For the reasoning module, the computational complexity is governed by camera parameters while the lateral control being dependent on the parameters of the lane and vehicle. Provided the processing time L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 1225 – 1233, 2006. © Springer-Verlag Berlin Heidelberg 2006
1226
Y.-J. Ryoo
available is enough, this difficulty might be circumvented with a high speed computing system. Unfortunately, in practice, the processing time is limited, which demands more efficient control methodology from a computational standpoint. In this paper, a fast image technology for an automatic guided vehicle with image sensors is presented, which uses image information to guide itself along lane. In the proposed system, the position and orientation of a vehicle on a lane are assumed to be unknown, but the vanishing point and vanishing line of the lane are given in a video image. The proposed system controls the vehicle so that it can follow the guidelines of lane. The proposed intelligent system transfers the input of image information into the output of steering angle directly, without a complex geometric reasoning from a visual image to a vehicle-centered representation in conventional studies. The neural network system replaces the human driving skill of nonlinear relation between vanishing lines of lane boundary on the camera image and the steering angle of vehicle on the real ground.
2 Image Technology for Automatic Guided Vehicle 2.1 Feature Extraction from Image Sensor The information of lane boundary on image is obtained from the image processing of a lane scene. As shown in Fig. 1, the geometrical relationship between the vehicle and the lane can be described in image using the following parameters. 1) The vanishing point (VP) describes the lateral position of the current vanishing point from the center of a camera image as shown in Fig. 1 (b) and (d). The current vanishing point depends upon the orientation of the vehicle on the lane as shown in Fig. 1 (a) and (c). 2) The vanishing line (VLL , VLR) is defined by the slope of the angle between the current vanishing line and the horizontal line as shown in Fig. 1 (b) and (d). The slope is relative to the deviation defined by the lateral distance of the center of the vehicle from the center of the lane. The camera image contains the vanishing point and vanishing line of the lane. With these data in the camera picture, the current position and orientation of the vehicle on the lane can be uniquely determined by geometric calculations. On the basis of the above method, the following visual control method is introduced: Fig. 1 (d) is the current camera image, and Fig. 1 (b) is the desired camera image, obtained when the vehicle reaches the desired relative position and orientation on the lane. The visionbased control system computes the error signals in terms of the lateral position and slope derived from the vanishing point and vanishing line respectively in the visual image. A steering angle generated by the proposed system makes that the vanishing point and the vanishing line on the current image coincide with the desired image. The lateral position of the vanishing point and the slope of the vanishing line computed from the linear vanishing line represent the relative position and orientation of the vehicle on the lane. Then the vehicle is required to move its center to the lateral center of the lane and to parallel the lane by controlling its steering angle. It is significant that the vanishing point moves to the desired point in accordance with human's skill of driving.
Image Technology for Camera-Based Automatic Guided Vehicle
1227
VP = 0
(b) (a)
VLL
VLR
VLL=VLR VP < 0
(c)
(d)
VLL>VLR
Fig. 1. Relation of camera image of lane view with the orientation and deviation of vehicle on the lane, and the features of vanishing point and vanishing line on camera image
2.2 Using Neural Network The relation between the steering angle and the vanishing point and vanishing line on the camera image is a highly nonlinear function. Neural network is used for the nonlinear relation because it has the learning capability to map a set of input patterns to a set of output patterns. The inputs of the neural network (x1 , x2) are the lateral position of the vanishing point (VP) and the slope of the vanishing line (VL). The output of the neural network system (y0) is the steering angle value for the vehicle (δ ).
x1 = VP x2 = VL = VLL − VLR
(1)
y0 = δ Learning data could be obtained from human’s driving. After the neural network system learns the relation between input patterns and output patterns sufficiently, it makes a model of the relation between the position and orientation of the vehicle, and that of the lane. Thus, a good model of the control task is obtained by learning, without inputting any knowledge about the specific vehicle and lane.
3 Computer Simulations To simulate in computer as shown in Fig. 2, vehicle model, transformation of coordinate system, and control algorithms should be determined. The used vehicle model is a general model, and has specific parameters. Through transformation of coordinate system, the lane could be displayed on the camera image plane visually.
1228
Y.-J. Ryoo
Fig. 2. Structure of simulation
3.1 Vehicle Model The general model of a vehicle with 4 wheels in the world coordinate system is shown in Fig. 3. The reference point (xcW, ycW) is located at the center point between the rear wheels. The heading angle θ for XW-axis of the world coordinate system and the steering angle δ are defined in the vehicle coordinate system, and the model equation is expressed as follows;
1 θ = v sin δ Lv xcW = v cos θ cos δ
(2)
⋅
y cW = v sin θ cos δ. Since the vehicle coordinate system is used in control of automatic guided vehicles, the current position (xcW, ycW) of the vehicle in the world coordinate system is redefined as the point of origin for the vehicle coordinate system. When the vehicle which has the distance Lv between front wheel and rear wheel runs with velocity v, the
Image Technology for Camera-Based Automatic Guided Vehicle
1229
new position of the vehicle is nonlinearly relative to steering angle δ of front wheel and heading angle θ determined by the vehicle direction and lane direction.
Fig. 3. Vehicle model and camera coordinate transformation
3.2 Transformation from Ground to Image In order to simulate in computer visually, the lane has to be displayed on the camera image plane. Thus the coordinate transformations along the following steps are needed to determine the lane of visual data from the lane on the world coordinate system : 1) Transformation from the world coordinate system to the vehicle coordinate system. The position (xcW, ycW) on the world coordinates is redefined as the origin for the vehicle coordinates.
⎡ X V ⎤ ⎡ cos θ ⎢ V ⎥=⎢ ⎣ Y ⎦ ⎣− sin θ
sin θ ⎤ ⎡ X W − xcW ⎤ ⎢ ⎥ cos θ ⎥⎦ ⎣ Y W − ycW ⎦
(3)
where (XV, YV) is a point in the vehicle coordinate system, (XW, YW) is a point in the world coordinate system, (xcW, ycW) is the reference point located at the center point between the rear wheels, and θ is the heading angle of the vehicle. 2) Transformation from the vehicle coordinate system to the camera coordinate system.
⎡0 − 1⎤ V ⎡ 0 ⎤ ⎡X I ⎤ ⎢ ⎡ ⎤ X ⎥ ⎢ ⎥ ⎢ I ⎥ = ⎢0 1 ⎥ ⎢ V ⎥ + ⎢ H vc ⎥ ⎣ Y ⎦ ⎢1 1 ⎥ ⎣ Y ⎦ ⎢ L ⎥ ⎣ ⎦ ⎣ vc ⎦
(4)
where (XC, YC, ZC) is a point in the camera coordinate system, (XV, YV) is a point on the vehicle coordinate system, Hvc is a height of the mounted camera from ground, and Lvc is the distance of the mounted camera from the center of the vehicle.
1230
Y.-J. Ryoo
3) Transformation from the camera coordinate system to the image coordinate system. (Θ, Φ , Ψ ) is the orientation of the camera with respect to the camera coordinates. The image plane (XI, YI) corresponds to the XC, YC plane at distance f (the focal length) from the position of the camera along the ZC axis. The coordinate transformation is expressed by the rotation matrix R and the focal matrix F.
⎧ ⎡ X C ⎤⎫ ⎡X ⎤ ⎪ ⎢ C ⎥⎪ ⎢ I ⎥ = F ⎨R ⎢ Y ⎥ ⎬ ⎣Y ⎦ ⎪ ⎢ Z C ⎥⎪ ⎩ ⎣ ⎦⎭ I
(5)
where (XI, YI) is a point in the image coordinate system, and (XC, YC, ZC) is a point in the camera coordinate system. 3.3 Simulation Results Simulation results for the automatic guided vehicle are shown in this section. The simulation program is developed from programming of vehicle model, transformation of coordinate system, and control algorithms by C++ in IBM-PC.
Fig. 4. The simulation program shows the camera image view and the bird’s eye view
The performance of the proposed visual control system by neural network was evaluated using a vehicle driving on the track with a straight and curved lane as shown in Fig. 5. Fig. 5 shows the bird's eye view of lane map and the trajectories of vehicle's travel. Fig. 6 shows the vehicle’s steering angle during automatic guided driving in the lane of Fig. 5. In Fig. 6, the steering angle is determined from neural network system, and steering actuator model, and vehicle model. The steering angle is almost zero at straight lane (section A), negative at left-turn curved lane (section B), and returns to
Image Technology for Camera-Based Automatic Guided Vehicle
1231
zero over curved lane (section C). The controller steers the vehicle to left (about 0.14[rad]) at left-turn curve (section B). A lateral error is defined as distance between the center of lane and the center of the vehicle. As shown in Fig. 7, automatic guided driving is completed in lateral error less than 0.60[m]. (C)
(B) (A)
Lateral error (m)
Steering angle (rad)
Fig. 5. Trajectories of vehicle driving on curved lane (curvature radius = 6 meters) 0.15 0.10 0.05 0.00 -0.05 -0.10 -0.15 -0.20 0
5
10
15 Time(sec)
20
25
0
5
10
15 Time(sec)
20
25
0.60 0.50 0.40 0.30 0.20 0.10 0.00 -0.10 -0.20
Fig. 6. Lateral error and steering angle
4 Experiments 4.1 Experimental Set-Up The designed vehicle has 4 wheels, and its size is 1/3 of small passenger car. Driving torque comes from 2 induction motors set up at each rear wheel, and each motor has 200 watts. Front wheel is steered by worm gear with DC motor. Energy source is 4 batteries connected directly, and each battery has 6 volts. And CCD camera is used as a image sensor to get the lane information. The control computer of vehicle has function to manage all systems, recognize the lane direction from input camera image
1232
Y.-J. Ryoo
by lane recognition algorithm, and make control signal of steering angle by neural network control. And the control computer manages and controls input information from various signal, also it inspects or watches the system state. The computer is chosen personal computer for hardware extensibility and software flexibility. Electric system to control composes of vision system, steering control system, and speed control system. Vision system has a camera to acquire lane image and image processing board IVP-150 to detect lane boundary. Steering control system has D/A converter to convert from control value to analog voltage, potentiometer to read the current steering angle, and analog P-controller. Speed control system has 16bit counter to estimate current speed calculated from encoder pulses, D/A converter to convert command speed from controller to analog voltage, and inverter to drive 3phase induction motor. Fig. 7 shows the designed vehicle.
Fig. 7. Designed vehicle
4.2 Automatic Guided Driving Test The trajectory with a straight and curved lane has the configuration as Fig. 5 evaluated in the simulation study. The lane width is 1.2[m], the thickness of the lane boundary is 0.05[m], and the total length of the lane is set by about 20[m] merged the straight lane and curved lane. The curvature radius of the curved lane is 6[m]. Automatic guided vehicle is confirmed the excellent driving on a straight lane and curved lane as shown Fig. 9.
Fig. 8. Camera images, extracted features of vanishing lines and steering angles
Image Technology for Camera-Based Automatic Guided Vehicle
1233
Fig. 9. Automatic guided driving on curved lane
5 Conclusions In this paper, a scheme for an image technology of an automatic guided vehicle with an image sensor was described using neural network. The nonlinear relation between the image data and the control signals for the steering angle can be learned by neural network. The validity of this image technology was confirmed by computer simulations. This approach is effective because it essentially replaces human's skill of complex geometric calculations and image algorithm with a simple mapping of neural network system.
References 1. R. Hartley and A. Zisserman, Multiple View Geometry in Computer Vision, Cambridge, (2000) 2. Sadayuki Tsugawa, Vision-based Vehicles in Japan: Machine Vision Systems and Driving Control systems", IEEE Transactions on Industrial Electronics, Vol. 41, No. 4 (1994) 398405 3. J. Manigel and W. Leonhard, Vehicle Control by Computer Vision", IEEE Transactions on Industrial Electronics, Vol. 39, No. 3 (1992) 181-188 4. C. E. Thorpe, Vision and Navigation, the Canegie Mellon Nablab, Kluwer Academic Publishers (1990) 5. D. Kuan, Autonomous Robotic Vehicle Road Following, IEEE Transactions on Pattern Analysis and Machine Intelligence, Vol.10, No.5 (1988) 6. A. M. Waxman, J. J. LeMoigne, L. S. Davis, B. Srinivasan, T. R. Kushner, E. Liang and T. Siddalingaiah, A Visual Navigation System for Autonomous Land Vehicles, IEEE Journal of Robotics and Automation, Vol. RA-3, No. 2 (1987) 124-140 7. Young-Jae Ryoo, Vision-based Neuro-fuzzy Control of Autonomous Lane Following Vehicle, Neuro-fuzzy Pattern Recognition, World Scientific (2000) 8. L. Zhaoi and C. E. Thorpe, Stero and Neural Network-based Pedestrian Detection, IEEE Trans. on Intelligent Transportation System, Vol. 1, No. 3 (2000) 148-154 9. Dean A. Pomerleau, Neural Network Perception for Mobile Robot Guidance, Kluwer Academic Publishers (1997) 10. Kevin M. Passino, Intelligent Control for Autonomous Systems, IEEE spectrum, (1995) 55-62.
An Efficient Demosaiced Image Enhancement Method for a Low Cost Single-Chip CMOS Image Sensor Wonjae Lee1, Seongjoo Lee2, and Jaeseok Kim1,* 1
Dept. of Electrical and Electronic Engineering, Yonsei University, Seoul, Korea Dept. of Information and Communication Engineering, Sejong University, Seoul, Korea
[email protected],
[email protected],
[email protected] 2
Abstract. An efficient demosaiced image enhancement method for a low cost single-chip CMOS image sensor (CIS) with an on-chip image signal processor (ISP) is proposed. In a single-chip CIS for handheld devices that requires low power consumption, the ISP implemented must be as small as possible while maintaining high performance. Demosaicing and image enhancement are the largest block in the ISP because they need line memories. In this paper, we propose a new method to minimize the number of line memories. In the proposed method, buffered data for demosaicing were used to make approximated luminance data for image enhancement. This enables line memory sharing between demosaicing and image enhancement. The experimental results showed that the proposed method enhanced the image quality. Thus, both high quality and a small area can be achieved by utilizing the proposed method. Keywords: Single-chip CMOS image sensor, image signal processor, demosaicing, image enhancement.
1 Introduction CMOS image sensors (CISs) as image input devices are becoming popular due to the demand for miniaturized, low-power and cost-effective imaging systems such as digital still cameras, PC cameras and video camcorders. CISs have several advantages over CCD (Charge Coupled Device) in terms of opportunity to integrate the sensor with other circuitry. For example, an image signal processor (ISP) can be integrated with a sensor into a single chip to reduce packaging costs. Several image processing algorithms are used in ISP to improve captured image quality. Among the image processing blocks, demosaicing and image enhancement are the core blocks in ISP. CISs capture only one color channel since the Bayer pattern [1] CFA (Color Filter Array) is used. Therefore, two other colors must be *
This work was supported by the MIC (Ministry of Information and Communication), Korea, under the ITRC (Information Technology Research Center) support program supervised by the IITA (Institute of Information Technology Assessment) and Yonsei University Institute of TMS Information Technology, a Brain Korea 21 program, Korea.
L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 1234 – 1243, 2006. © Springer-Verlag Berlin Heidelberg 2006
An Efficient Demosaiced Image Enhancement Method
1235
interpolated from the neighboring pixel values. This process is called CFA interpolation or demosaicing, and it requires line memories. However, the number of line memories is limited for a low cost single-chip CIS. To acquire high quality demosaiced images, a large number of line memories are essential. In considering the number of line memories for image enhancement, therefore, the new method seeks to minimize the number of line memories while maintaining high performance. In this paper, we propose a new and efficient demosaiced image enhancement method for ISP integrated CISs. Using the characteristics of conventional high performance demosaicing algorithms, which interpolate the green channel first and then red and blue channels, image enhancement is done without additional line memories. The paper is organized as follows. In section 2, demosaicing algorithms, image enhancement algorithms, and the conventional architecture of ISP are introduced. The proposed method is presented in section 3. The experimental results are described in section 4, and the conclusions are reached in section 5.
2 Image Signal Processor Fig. 1 shows a basic structure for an ISP. An image sensor captures the scene. Demosaicing interpolates the missing colors to make color images. After demosaicing, several image processing algorithms are performed to enhance the image quality. Color correction improves color quality. Gamma correction makes the input signal more proportional to the light intensity. The auto white balance removes unrealistic colors to transform the image into what the eyes perceive. The color conversion transforms the color space from RGB to YCbCr, as the YCbCr space is more efficient for image enhancement and compression. Image enhancement improves image contrast or sharpness.
Fig. 1. The basic structure of image signal processor
2.1 Demosaicing Algorithms CISs can perceive only pixel brightness. Therefore, three separate sensors are needed to completely measure each color channel and make a color image. To reduce the cost and complexity, a single image sensor with CFA is used to capture all three color channels simultaneously. The Bayer pattern, shown in Fig. 2, is widely used as a CFA and provides the array or mosaic of the RGB colors. Since each pixel captures only one color channel, two other colors must be interpolated from neighboring pixel values. If this process is carried out inappropriately, the image suffers from various color artifacts, which results in degraded color resolution.
1236
W. Lee, S. Lee, and J. Kim
B
G
B1
G
B
G
R2
G3
R4
G
B5
G6
B7
G8
B9
G
R10 G11 R12
G
B
G
B13
G
B
Fig. 2. Bayer CFA pattern
Several demosaicing algorithms have been proposed. Among them, bilinear interpolation is computationally efficient, but it introduces large errors that blur the edges of the resulting image. In adaptive color plane interpolation (ACPI) [2], spatial and spectral correlations are considered to enhance demosaiced image quality. In such images, good quality is maintained even at the edges. It begins by using edge-directed interpolation for the green channel. Referring to Fig. 2, horizontal gradients and vertical gradients are defined as DH = − B5 + 2B7 − B9 + G 6 − G 8 . DV = − B1 + 2B7 − B13 + G 3 − G 11 .
(1)
Then, the missing green pixel G7 can be estimated as ⎧ G 6 + G 8 − B5 + 2B 7 − B9 + ⎪ 2 2 ⎪⎪ G 3 + G 11 − B1 + 2B 7 − B13 G7 = ⎨ + 2 2 ⎪ ⎪ G 3 + G 6 + G 8 + G 11 + − B1 − B5 + 4B 7 − B9 − B13 4 8 ⎩⎪
if DH < DV if DH > DV
. (2)
if DH = DV
Fig. 3 shows the demosaiced image. Many color artifacts are shown in Fig 3(b), which uses bilinear interpolation. However, there are no color artifacts in Fig 3(c), which uses ACPI. In addition, the result image using bilinear interpolation is more blurred than by using ACPI. Therefore, high quality cannot be obtained using bilinear interpolation even though the image enhancement process is applied.
(a)
(b)
(c)
Fig. 3. The demosaiced image (a) Original image, (b) Bilinear interpolation, and (c) ACPI
An Efficient Demosaiced Image Enhancement Method
1237
2.2 Image Enhancement Algorithms I: Unsharp Masking Filter Due to the optics and demosaicing, the captured image can be blurred. An unsharp masking filter (UMF) is a well-known technique to improve the visual quality of blurred images by enhancing the edges [3]. It is a simple sharpening operator that derives its name from the fact that it enhances edges via a procedure that subtracts an unsharp, or smoothed, version of an image from the original image. The enhanced image, Îij, is obtained from the input image, Iij, as _
Î ij = I ij + λ ⋅ (I ij − I ij ) .
(3)
where λ is a scaling constant and Īij is the output of a low pass filter. 2.3 Image Enhancement Algorithms II: Adaptive Contrast Enhancement When the images captured by the CIS are presented on the display, they may contain low contrast details because of the difference between the display and the scene dynamic range. When scene information with a wide dynamic range is squeezed into the limited range of a typical display, the local details of low contrast are not perceived. To enhance the image quality adaptively, a real time adaptive contrast enhancement method (ACE) was introduced [4]. The local frequency envelope is brought close to the global mean while the high frequency local variations need to be amplified above the contrast sensitivity threshold. This can be modified for implementation purposes by replacing the global mean with the local mean. Fig. 4 shows the block diagram.
Fig. 4. Block diagram of the adaptive contrast enhancement
The image intensity at each pixel location is transformed based on the local mean (mij) and local stand deviation (σij) estimated on the window surrounding the pixel point. The transformed intensity (Îij) becomes
Îij = g ij ⋅ (Iij − m ij) + m ij .
(4)
1238
W. Lee, S. Lee, and J. Kim
where
⎧ Gmin x < x1 ⎪⎪ x − x x2 − x 1 gij = ⎨ Gmax + Gmin x 1 ≤ x < x 2 . x − x x − x 2 1 2 1 ⎪ x≥x2 ⎩⎪ Gmax
(5)
and x, Iij, Gmin and Gmax represent the ratio of mij to σij, the intensity at pixel location (i, j), the minimum value and the maximum value of the variable gain, gij, respectively. In addition, x1 and x2 represent the endpoints of the interval where the variable gain is applied to the difference to amplify the local variations. To prevent the gain from being inordinately large in areas with a large mean and small standard deviation, the local gain is actually controlled by Gmin and Gmax in (5). 2.4 Conventional Image Signal Processor
In conventional designs [5-7], a simple bilinear interpolation is used for color demosaicing to minimize the number of line memories. Fig. 5 shows a block diagram of a conventional ISP. The general quality of the images in those conventional designs is not high because bilinear interpolation was used. To enhance image quality, a high performance demosaicing algorithm like ACPI must be used. Even though the demosaiced image quality of ACPI is good, the required line memories are too large for a single-chip CIS when considering the line memories for image enhancement. Therefore, there is a need for a new method to achieve both high quality and low hardware complexity with a small number of line memories.
Fig. 5. Block diagram of the conventional ISP
3 Proposed Image Signal Processor In conventional designs, image enhancement is performed using a luminance channel after color demosaicing. Separated line memories are needed in such conventional architectures. In the proposed method, approximated luminance data are used, enabling line memory sharing between demosaicing and image enhancement. ACPI interpolates the green channel first and then the red and blue channels using the interpolated green channel. This means that all green channel data, which composes about 60% of the luminance signal, are available. Therefore, the proposed method does not need additional line memories for image enhancement.
An Efficient Demosaiced Image Enhancement Method
1239
Fig. 6 shows the demosaicing process of ACPI installed into ISP, which requires four line memories for inputting Bayer data to make the 5x5 window and one line memory for two lines of an interpolated green channel. Since half the green channels are missing for each line, one line memory is enough to store the interpolated green channel. At B19 in the third line of Fig. 6, the green channel is interpolated using the 5x5 window (dash-dot line) and stored in the line memory. Those interpolated and stored green channels are used to make a 3x3 window for red and blue channel interpolation. The missing red and blue channels in the second line are interpolated using the interpolated green channel and the red, green and blue channels stored in the line memory for inputted Bayer data. Data interpolated in the second line become outputs of color demosaicing. In Fig. 6, all the green channel data in the 5x3 window (dotted line) centered at G10 are available after green channel interpolation at B19. Therefore, we can use that green channel data for image enhancement.
Fig. 6. Process of the ACPI
3.1 Proposed Method I: Unsharp Masking Filter
An UMF needs a low-pass filtered image (Īij). The approximated low-pass filtered image can be calculated as follows using the linearity of the summation. _
I ij = F(I ij ) = F(C1 ⋅ R + C 2 ⋅ G + C 3 ⋅ B) = C1 ⋅ F(R ) + C 2 ⋅ F(G ) + C 3 ⋅ F( B) .
(6)
where Ci is the coefficient for color conversion from RGB to YCbCr, and R, G, and B represent red, green and blue, respectively. F(x) is the result of a low pass filter when the input is x. Instead of calculating the exact value of F(x), the approximated value F’(x) can be calculated using the pixels generated during the interpolation process. The interpolated red and blue channels at the second line in Fig. 6 can also be employed for image enhancement using a small buffer. If we want to enhance G10 in Fig. 6 and use a mean filter (3x3) as a low pass filter, the F’(x) of each channel can be calculated as
1240
W. Lee, S. Lee, and J. Kim
F' (R) = (R8' + R9 + R10' + R11)/ 4 . F' (G) = (∑ G i + ∑ G i ')/9 .
(7)
F' (B) = (B3 + B9' + B10 ' + B17 )/ 4 . R’, G’ and B’ are the interpolated channel data. The luminance of the current pixel (Iij) can be easily obtained since all data are interpolated and available at the current location. 3.2 Proposed Method II: Adaptive Contrast Enhancement
To adopt the image enhancement method in [4], local mean (mij) and local stand deviation (σij) in (4) and (5) must be calculated. The calculation of the approximated local mean of luminance intensity, mij, is similar to the process of low pass filtering in the UMF, and the 5x3 window is used.
m ij = E(Y ) = E (C1 ⋅ R + C 2 ⋅ G + C3 ⋅ B) = C1 ⋅ E(R ) + C 2 ⋅ E(G ) + C3 ⋅ E (B) .
(8)
To obtain the gain, gij, a local standard deviation has to be calculated. Contrary to the mean, which is a linear equation, the standard deviation is a quadratic equation. Therefore, without all pixel values, it cannot be calculated. In the proposed method, we use the green channel instead of the luminance channel. The ratio of the green channel mean to green channel standard deviation is used to obtain the gain, gij.
gij =
Eij(G) . σij(G)
(9)
3.3 Proposed Image Signal Processor
Fig. 7 shows a block diagram of the proposed ISP. Since a color demosaicing block and an image enhancement block share line memories, it can be implemented using only five line memories, which is the same or less than the line memories used in [5-7].
Fig. 7. Block diagram of the proposed ISP
An Efficient Demosaiced Image Enhancement Method
1241
4 Experimental Results We tested the performance of the proposed algorithm with several natural color images shown in Fig. 8. To show the efficiency of the proposed method, the three methods described below were compared. 1) Method I: Bilinear (3x3 window) interpolation + Image enhancement 2) Method II: ACPI + Image enhancement using luminance data 3) The proposed method: ACPI + Image enhancement using approximated luminance data Table I shows the performance comparisons. Method II and the proposed method show much better performance than method I, and the number of required line memories for the proposed method is the smallest of the three methods. To show the difference between image enhancement using the luminance channel and the green channel, we calculated the mean squared error (MSE). The averaged MSE between method II and the proposed method is about 4.98 when UMF is applied and about 7.5 when ACE is applied. Even though there are some differences, they are visually the same.
Fig. 8. Color images used in the experiments Table 1. Performance Comparisons
Demosaicing Performance MSE # of Line Memory
PSNR (dB)
R G B
UMF ACE Demosaicing Image Enhancement Total
METHOD I
METHOD II
28.37 32.45 28.08 2 4 6
36.45 38.25 36.27
PROPOSED METHOD 36.45 38.25 36.27 4.98 7.5
5 4 9
5 5
Fig. 9 shows the result images. As shown in Fig. 9, the image quality of the proposed method is almost the same as method II and much better than method I. In Fig. 9(b) and (e), the images are blurred and severe color artifacts are observed.
1242
W. Lee, S. Lee, and J. Kim
However, the sharpen images with little color artifacts are shown in Fig. 9(c-d) and (fg). The experimental results indicate that the proposed method can enhance image quality without using additional line memories.
(a)
(b)
(c)
(e)
(f)
(d)
(g)
Fig. 9. Results images: (a) Original image, (b) Method I (UMF), (c) Method II (UMF), (d) Proposed Method (UMF), (e) Method I (ACE), (f) Method II (ACE), (g) Proposed Method (ACE)
5 Conclusion An efficient demosaiced image enhancement method for a low cost single-chip CIS with an on-chip image signal processor was proposed. For low cost single-chip CISs, the ISP implemented must be small and maintain high quality. In conventional ISPs, demosaicing and image enhancement are completed independently. Thus, they require a large number of line memories. To reduce the number of line memories, simple demosaicing algorithms with low quality are usually used. In this paper, we proposed a new method to minimize the number of line memories. In the proposed method, the buffered data for demosaicing were used to construct approximated luminance data for image enhancement. This enabled line memory sharing between the demosaicing block and image enhancement. To verify the propose method, an unsharp masking
An Efficient Demosaiced Image Enhancement Method
1243
filter and adaptive contrast enhancement were used. Experimental results indicated that the proposed approach could enhance image quality without additional line memories. Using the proposed method, other image enhancement techniques can also be applied. Therefore, the proposed method is very useful for the low cost single-chip CIS which requires high performance, low area and low power consumption.
Acknowledgments This research was supported by the MIC (Ministry of Information and Communication), Korea, under the ITRC (Information Technology Research Center) support program supervised by the IITA (Institute of Information Technology Assessment) and Yonsei University Institute of TMS Information Technology, a Brain Korea 21 program, Korea.
References 1. Bayer, Bryce E. “Color imaging array” U.S. Patent 3,971,065 2. J. E. Adams and J. F. Hamilton Jr., “Adaptive color plane interpolation in single sensor color electronic camera” U.S. Patent 5,629,734 3. Digital Image Processing by Rafael C. Gonzalez and Richard E. Woods 4. Patrenahalli M. Narendra, and Robert C. Fitch, “Real time Adaptive Contrast enhancement”, IEEE Trans. On Pattern Analysis and Machine Intelligence, Vol. PAMI-3, No. 6, November, 1981, pp. 655-661 5. Yun Ho Jung, Jae Seok Kim, Bong Soo Hur, and Moon Gi Kang, “Design of Real-Time Image Enhancement Preprocessor for CMOS Image sensor”, IEEE Transactions On Consumer Electronics, Vol. 46, No. 1, Feb 2000, pp. 68-75 6. Kim, H., et al.: ‘Digital signal processor with efficient RGB interpolation and histogram accumulation’, IEEE Transactions Consumer Electronics, 1998, 44, pp. 1389–1395 7. Rongzheng Zhou., et al.: ‘System-on-chip for mega-pixel digital camera processor with auto control functions’, ASIC, 2003, Proceedings. 5th International Conference on Vol. 2, 21-24, Oct. 2003, pp. 894 - 897
Robust Background Subtraction for Quick Illumination Changes Shinji Fukui1 , Yuji Iwahori2, Hidenori Itoh3 , Haruki Kawanaka4, and Robert J. Woodham5 1
4
Faculty of Education, Aichi University of Education, 1 Hirosawa, Igaya-cho, Kariya 448-8542, Japan
[email protected] 2 Faculty of Engineering, Chubu University, 1200 Matsumoto-cho, Kasugai 487-8501, Japan
[email protected] 3 Faculty of Engineering, Nagoya Institute of Technology, Gokiso-cho, Showa-ku, Nagoya 466-8555, Japan
[email protected] Faculty of Info. Sci. and Tech., Aichi Prefectural University, 1522-3 Ibaragabasama, Kumabari, Nagakute-cho, Aichi-Gun 480-1198, Japan
[email protected] 5 Dept. of Computer Science, University of British Columbia, Vancouver, B.C. Canada V6T 1Z4
[email protected] Abstract. This paper proposes a new method to extract moving objects from a color video sequence. The proposed method is robust to both noise and intensity changes in the observed image. A present background image is estimated by generating conversion tables from the original background image to the present image. Then, the moving object region is extracted by background subtraction. Using color gives more accurate detection than a previous method which used only monochrome data. Color images increase the computational load. The method addresses this problem by using the GPU’s throughput. Results are demonstrated with experiments on real data.
1
Introduction
Real-time extraction of moving objects from video sequences is an important topic for various applications of computer vision. Applications include counting the number of cars in traffic, observing traffic patterns, automatic detection of trespassers, video data compression, and analysis of non-rigid motion. Among the segmentation approaches of moving object, “Background subtraction” is the most basic and speedy approach. However, it works well only when the background image has the constant brightness, and fails when the brightness of the moving object is close to that of the background. “Peripheral Increment Sign” (PIS)[1] is also proposed for the condition in which the illumination is not constant in a video sequence. It is applicable to L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 1244–1253, 2006. c Springer-Verlag Berlin Heidelberg 2006
Robust Background Subtraction for Quick Illumination Changes
1245
the real-time implementation because the filtering process is simple, but the segmentation of the moving object itself is still liable to be affected by noise. As for the “Normalized Distance”[2], which was proposed to give a better result even with the effect of illumination change, it is also based on the background subtraction, and is liable to be affected by noise and fails when the degree of brightness is low or when the observed one has the similar texture as the background. In [3], an approach to estimate the background occluded by a moving object is proposed. It uses the texture and the normalized intensity to the effect of the illumination change, because the texture and the normalized intensity are illumination invariant. This approach assumes the linear change of illumination intensity, and is not applied to the nonlinear intensity changes. On the other hand, the probabilistic approach using mixture Gaussian model are proposed to remove the shadow region of a moving object from the estimated moving region[4]-[6]. These methods need learning process to extract moving objects and are not applied to the quick illumination changes. This paper proposes a new method to extract moving objects from a color video sequence. The proposed method is robust both to noise and to intensity changes caused by scene illumination changes or by camera function. A present background image is estimated by generating conversion tables from the original background image to the present image. Then, the moving object region is extracted by a method based on background subtraction. Since the background image, which excludes the moving object itself, is estimated from the observed image and the original background image, the method is applicable to nonlinear intensity changes. The proposed approach uses a color video sequence and gives more accurate detection than our previous method[7] which used only monochrome data. On the other hand, color sequences increase the processing load, in general. The method addresses this problem by using a Graphics Processing Unit (GPU) throughput. Results are demonstrated on experiments with real data.
2
Proposed Approach
Figure 1 shows the outline of the proposed approach. The proposed approach consists of two processes. The first process estimates the present background image. The second process extracts the moving objects based on background subtraction. 2.1
Generation of Estimated Background Image
Let the RGB color values at (x, y) in the original background image, BG, be denoted by BG(x, y) = (BGR (x, y), BGG (x, y), BGB (x, y)). Let those in the observed image at time t = 1, 2, . . . , be denoted by It (x, y) = (ItR (x, y), ItG (x, y), ItB (x, y)). The region with no moving objects at time t is defined as the background region, At .
1246
S. Fukui et al.
Current Frame Image Previous Frame Image
Background Image
Generation of Estimated Background Image
Estimated Background Image
Blocking Process of Background Subtraction Values
Moving Region
Fig. 1. Outline of Proposed Approach
When the intensity of the whole image changes, for example owing to the automatic gain control function of the camera, It (x, y) in At differs from BG(x, y). In this case, moving regions cannot be extracted by simple background subtraction. The background image, BGt , has to be estimated with high confidence. Then moving regions are extracted using the subtraction of BGt and It . Conversion tables which convert BG into BGt are generated to estimate BGt . The present background image, BGt , is estimated through the following procedures: 1. Generate conversion tables 2. Obtain BGt from conversion tables Generation of Conversion Tables Let (x, y) and (x , y ) be different points in the image. Suppose, when the overall intensity of the whole image changes, that the value of ItR (x, y) is equal to that of ItR (x , y ) and the value of BGR (x, y) is equal to that of BGR (x , y ). From this assumption, a conversion table from BGR (x, y) to ItR (x, y) is obtained based on the relation between BGR (x, y) and It (x, y)R in At . Conversion tables for the G and B color channels are obtained in the same way. Let the candidate for the background region be At . Figure 2 shows how to specify At . First, the absolute value of the differences between It and It−1 is obtained. Second, the region with the largest differences is selected by thresholding. Third, At is calculated as the logical AND between the selected region and the last estimated background region At−1 . Here, A0 is the entire image, which is necessary to estimate BG1 . When the illumination changes quickly, the approach extracts the whole image region as At . At is treated as a region extracted by dilation process to At−1 in such a case. The conversion table of each channel from BG(x, y) to It (x, y) is made by the RGB values in the obtained region At . These tables convert the RGB color values of BG to the estimated background image BGt . A histogram shown in
Robust Background Subtraction for Quick Illumination Changes
R
G
B
It-1 , It-1 , It-1
R
G
1247
B
It , It , It
SUBTRACT AND
Inter-Frame Subtraction (Binary) Image
A’t
At-1
Fig. 2. Specifying Region At
Figure 3 is produced from each set (ItR (x, y), BGR (x, y)) in the region At to make the conversion table for R channel. The conversion table is obtained from this histogram as shown in Figure 4. As shown in the histogram, at the ‘value at BGR = 26’ in Figure 4, some pixels with the same value of BGR may not have the same value of It under the effect of noise. Therefore, the conversion table uses the median value of a set of pixels at It . Since the median value of It becomes 11 in this example, the set (26, 11) is obtained and added to the conversion table. The set consists of all pixels which have the same value in BG(x, y) as At . Then, the linear interpolation is applied to the points that do not appear in the conversion table. Obtaining BGt from Conversion Tables The conversion tables convert the RGB color values of BG to the RGB color values of the moving region in It . As a result, the background, BGt , without moving objects is estimated. 2.2
Object Extraction by Background Subtraction
After BGt is estimated, regions corresponding to moving objects are extracted by background subtraction. To get a more robust result, It is divided into blocks. Each each block is recognized as a background region or not. The total of the absolute values of differences for each block is calculated for each color channel. A block that has one (or more) total below threshold is regarded as a block of an object. The method changes the threshold value dynamically as the intensity of the whole image changes. The histograms shown in Figure 3 are used to determine the threshold. The procedure for determining the threshold for an R value is as follows: The range with 80% or more of frequency from the mean value is considered. That value is adopted as the threshold for that R value.
1248
S. Fukui et al.
I Rt
BG R
t
Frequency Va lu ea tI R
A’t
0
Value at BG R
Fig. 3. How to Make Histogram
The threshold for an R value that does not appear in the conversion table can not be determined by the above procedure. In that case, the threshold for the R value is set to be one smaller value than the R value. Thresholds for G and B channels are determined in the same way. The threshold for each pixel is determined for each color channel. Next, the sum of the absolute value of the differences between BGt (x, y) and It (x, y) and the sum of threshold values for each channel are calculated at each block. Finally, blocks where the sums of the absolute value of the differences become larger than those of the thresholds of all channels are regarded as blocks in the background. Otherwise, they are regarded as blocks in the moving object. Each block is analyzed by the above process. Subsequently, a more detailed outline of the moving object is extracted. For blocks that overlap the boundary of a moving object, each pixel is classified into either the moving region or the background region, as was the case for the whole block itself.
3
Speed Up Using the GPU
Recently, the performance of the Graphics Processing Unit (GPU) has been improved to display more polygons in shorter time. The geometry and rendering engines in a GPU are highly parallelized. Moreover, the shading process has become programmable via what is called the “Programmable Shader.” Current GPUs make geometry and rendering highly programmable. GPUs are very powerful, specialized processors that can be used to increase the speed not only of standard graphics operations but also of other processes[8]. A GPU is useful for image processing and can be programmed with a high-level language like C. This proposed method uses the GPU for acceleration. While, some approaches using GPU has been proposed recently[9]-[10]. A texture containing a image data should be created to use GPU. The result needed by CPU should be rendered to a texture and CPU should use the texture data. Also, a processing result of a pixel can not be used in a process for other
1249
t
Frequency Va lu e at IR
Robust Background Subtraction for Quick Illumination Changes
0
Value at BG R R
Linear Interpolation
Frequency
Value at I tR
Value at BG = 26 ...
0
Value at ItR
Median = 11
11 0
26 Value at BG R
Conversion Table
Fig. 4. Generating Conversion Table from Histogram
pixels. So, generating the conversion tables and determining the thresholds are processed at CPU in the proposed method. The procedure for the proposed method is as follows: 1. Extract At from It texture, It−1 texture and At−1 texture, and render the result into At texture 2. Generate the conversion table for each channel from At texture, BG texture and It texture, and create texture for the conversion table 3. Determine thresholds for each channel and create texture for thresholds 4. Estimate BGt from BG texture, It texture, At texture and texture for the conversion table 5. Calculate the absolute values of difference between It and BGt , and rendering the result into a texture 6. Determine the threshold for each pixel and render the result into a texture 7. Calculate sum of the absolute value of difference at each block and rendering the result into a texture 8. Calculate sum of the threshold at each block and render the result into a texture 9. Render the result for blocking process into a texture 10. In the border blocks of a moving object, render the result for thresholding process at each pixel into At texture. Otherwise, render the result for blocking process into At texture
4
Experimental Results
In the experiments, a general purpose digital video camera was used as the input device. A PC with Pentium4 570 and GeForce 780GTX was used to extract
1250
S. Fukui et al.
Fig. 5. Background Image Obtained in Advance
(a)
(b)
(c)
Fig. 6. Observed Images and Results: (a) Observed Images, (b) Results produced by the proposed approach, and (c) Results produced by the previous approach[7]
moving objects. Each target image frame was 720×480 pixels. The corresponding block size was set to 10 × 10 pixels. First, an indoor scene is used as the original background. The experiment is done under conditions in which the intensity of the light source changes over the course of the image sequence. Figure 5 shows the background images obtained in advance. Figure 6 shows the observed images and experimental results. Figure 6-(a) shows the observed images, which have different overall brightness from the original background image and Figure 6-(b) shows the experimental results. The white region shows the
Robust Background Subtraction for Quick Illumination Changes
(a)
(b)
1251
(c)
Fig. 7. Results When Illumination Just Changed: (a) Observed Images, (b) Results produced by the proposed approach, and (c) Results produced by the previous approach[7]
Fig. 8. Background Image Obtained in Advance
detected moving object, and the black region corresponds to the detected background region. Figure 6-(b) demonstrates that this approach can extract the moving object with high performance even under quick illumination changes. Figure 6-(c) shows the experimental results produced by our previous approach [7]. It is shown that the proposed approach works better than the previous approach [7]. Figure 7 shows the results at the moment when the illumination just changed. Figure 7-(a) shows the observed images. Figure 7-(b) shows the experimental results produced by the proposed approach. Figure 7-(c) shows the experimental results produced by our previous approach [7]. The proposed approach has the clear advantage in comparison with the previous approach [7] even when the illumination just changes. In this experiment, using the GPU performed about 17.9 msec/frame. In contrast, without the GPU performed about 44.7 msec/frame. These results
1252
S. Fukui et al.
(a)
(b)
Fig. 9. (a) Observed Images, and (b) Results
show that using the GPU increases processing speed dramatically so that we can extract the moving object in real time. Next, another experiment was done outdoors, about 20 minutes after sun set. Figure 8 shows the background images obtained in advance. Figure 9 shows the observed images and experimental results. Figure 9-(a) shows the observed images, which again have different overall brightness from the original background image and Figure 9-(b) shows the experimental results. These results show that this approach can also extract the moving object with high performance even under gradual change in sunlight.
5
Conclusion
A new approach is presented to extract the moving regions in a video sequence. The approach estimates the present background image. It is able to extract a moving object in real time with robustness to quick intensity changes. Using the GPU results in speed up. GPU can be expected to increase the speed of other image processing operations. Cases of incorrect detection are remained. Future work includes the detection of shadow regions produced by moving objects.
Acknowledgment This work is supported by THE HORI INFORMATION SCIENCE PROMOTION FOUNDATION, the Grant-in-Aid for Scientific Research No.16500108, Japanese Ministry of Education, Science, and Culture, and Chubu University grant. Woodham’s research is supported by the Canadian Natural Sciences and Engineering Research Council (NSERC).
Robust Background Subtraction for Quick Illumination Changes
1253
References 1. Y. Sato, S. Kaneko, and S. Igarashi, “Robust Object Detection and Segmentation by Peripheral Increment Sign Correlation Image,” in Trans. of the IEICE, Vol. J84-D-II, No. 12, pp. 2585-2594, Dec, 2001. (in Japanese) 2. Shigaki NAGAYA, Takafumi MIYATAKE, Takehiro FUJITA, Wataru ITO and Hirotada UEDA, “Moving Object Detection by Time-Correlation-Based Background Judgment Method,” in Transactions of the Institute of Electronics, Information and Communication Engineers D-II, Vol. J79-D-II, No. 4, pp. 568-576, Apr, 1996. (in Japanese) 3. H. Habe, T. Wada, T. Matsuyama, “A Robust Background Subtraction Method under Varying Illumination” in Tech. Rep. of IPSJ, SIG-CVIM115-3, Mar, 1999. (in Japanese) 4. Stauffer C, Grimson W. E. L.: “Adaptive background mixture models for real-time tracking”, in Proceedings of 1999 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (Cat. No PR00149). IEEE Comput. Soc. Part Vol. 2, 1999. 5. Stauffer C, Grimson W. E. L. “Learning patterns of activity using real-time tracking”, in IEEE Transactions on Pattern Analysis & Machine Intelligence, 2000. 22(8): p. 747-57. 6. P. KaewTraKulPong and R. Bowden: “An improved adaptive background mixture model for real-time tracking with shadow detection” in Proc. of the 2nd European Workshop on Advanced Video-Based Surveillance Systems, Sep. 2001. 7. S. Fukui, T. Ishikawa, Y. Iwahori, H. Itoh: “Extraction of Moving Objects by Estimating Background Brightness” in Trans. of IIEEJ, Vol. 33, No. 3, pp. 350357, May, 2004. 8. http://www.gpgpu.org/ 9. Andreas Griesser, Stefaan De Roeck, Alexander Neubeck and Luc Van Gool: “ GPU-Based Foreground-Background Segmentation using an Extended Colinearity Criterion”, in Proc. of Vision, Modeling, and Visualization (VMV), Nov, 2005. 10. Ryohei MATSUI, Hajime NAGAHARA, Masahiko YACHIDA: “An acceleration method of omnidirectional image processing using GPU”, in Proc. of Meeting on Image Recognition and Understanding (MIRU2005), pp. 1539-1546, Jul, 2005. (in Japanese)
A New Sampling Method of Auto Focus for Voice Coil Motor in Camera Modules Wei Hsu and Chiou-Shann Fuh Department of Computer Science and Information Engineering, National Taiwan University, Taipei, Taiwan, R.O.C {r93063, fuh}@csie.ntu.edu.tw
Abstract. We present an effective and efficient sampling table for auto focus searching algorithms to phone camera modules. Many searching algorithms base on focus value curve, and they do not consider the correlation between sampling and image characterization. Proposed table is calibrated by the camera images to meet results of optical lens, sensor response, noise, image pipeline, compressed method, and focus value function. We verify the algorithms in an image signal processor with mega-pixel level cameras. The table reduces total searching steps from 412 to 16, and images are still sharp. Our searching algorithm with the table can focus a scene within 0.6 seconds from infinity to 8 cm. For further progress, they can easily apply to mobile phones. Keywords: Auto focusing, focus value, phone camera, voice coil motor.
1 Introduction Auto Focus (AF), Auto Exposure (AE), and Auto White Balance (AWB) algorithms are the most important controls in image quality of digital cameras, and they are key differentiations to other cameras. Image sensors of CCD (Charged-Coupled Device) and CMOS (Complementary Metal-Oxide Semiconductor) are popular today. CCD sensors have better image quality, but prime cost and power consumption are disadvantageous. CMOS sensors with build-in processor are applied to mobile phones and webcam widely, but image quality is just acceptable. A new architecture, ISP (Image Signal Processor) with a raw sensor, has been proposed. Many people believe the architecture can bring a balance between image quality and cost. Actually, a famous company had promoted the design for a time. The sensor provider will not integrate ISP to CMOS sensors of high resolution to reduce the production cost recently. In a typical ISP, AF finds a focus. AE controls exposure value, and AWB balances color temperatures. Color calibration maps device’s colors into standard (or preferred) colors. Image pipeline processes Bayer [1] images to full-color images. JPEG [2] (Joint Photographic Experts Group) encoder compresses images into data streams for communication or storage. In the future, ISP will support H.264 to meet video applications of high compressed rate and quality. Small dimension and high resolution of phone camera modules are noticeable features. Especially in height of sizes, thickness of mobile phone is a very alluring L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 1254 – 1263, 2006. © Springer-Verlag Berlin Heidelberg 2006
A New Sampling Method of Auto Focus for Voice Coil Motor in Camera Modules
1255
condition in marketing, but the thickness is usually constrained by camera’s height. The height decreases, but optical advantage becomes weakly. In beyond applications, fixed-focus cameras apply to lower resolution (under 2 mega-pixels), and their images are always blurred. An AF function can compensate the weakness, and it effectively upgrades image sharpness, especially at near distance. If high-pixel sensors bond with fixed-focus lens, the output images are still blurred. Therefore, AF is a very important feature in high resolution age. Normally in AF algorithms, they can be divided into sharpness functions and searching loops. Sharp evaluation and searching servo are related as closely as each in reliability and speed. The computation of sharpness functions is usually greater than searching one. Sharpness functions give statistics to recognize blurred degrees of images. Simple evaluated functions are always influenced by noise. Complex functions can provide more complete information, but their computations are huge against embedded systems. The evaluation of suitable functions is necessary for the purpose. The optical lens for phone camera has larger depth of filed (DOF), but AF still spends many times to search from infinity to macro distance. Our new sampling table by calibration can reduce searching times and preserve reliability, and proposed searching algorithm performs effectively and efficiently for mobile phones.
2 Focus Value Focus value is sharp information of images, and auto-focus algorithms measure image quality by focus value functions. Authors present different focus value functions, and we classify functions into gray level, spatial-filter, frequency, and probability domains. Gray level functions often base on amplitude, energy, gradient, absolute value, and variance [3], [8]. In the spatial-filter domain, high-pass or edge filters evaluate statistics. Tenegrad and Laplacian are famous filter processes [3]. Conveniently, one directional filters in horizontal are developed in real platforms. The unilateral sharp information is incomplete, but it is enough to focus. DCT (Discrete Cosine Transform), FFT (Fast Fourier Transform), and wavelet are used in frequency analyses. Human coding length can be focus value [7] in probability. In JPEG compression, frequency data are encoded by Huffman coding. In-focus images have more complex distribution in frequency lines, and stream length is larger. Sum-Modulus-Difference (SMD), Laplacina, FFT, or coding length is the representation in each domain. Focus value curves of Fig. 1 show a sequence of JPEG from infinity to macro range by five focus value functions. Gray level amplitude often has many local maximum values, and many searching methods lose. SMD algorithm is nice in the images, because it have a sharp and clear curve. FFT fits to analyze images, and Laplacina is a robust method in many objective conclusions. Images are very complex results from captured devices, and they are consisted of the scenes, noise, sensor response, optical lens, image pipeline, compressed specification, and captured condition. After quantization, JPEG images have lost some information, and it is added some extra noise.
1256
W. Hsu and C.-S. Fuh
Fig. 1. Focus value curves are by five different functions. Sequential JPEG images are captured from the furthest end to the nearest end step by step. Gray level amplitude, SMD, Laplacian, FFT, and file size algorithms have different image features by the curves.
Different functions perform different image features. Many methods normalize results, and we can ignore the operation in search for computational reduction. For the fastest method, we only read file size as focus value, and it costs nothing in time and space. Frequency domain methods have the most complete information, but too complex. Algorithms in gray level domain are the fastest, but they are sensitive to slight variation of images. Methods in the spatial-filter domain are applicable in embedded systems, but a hardware engine should be implemented for real-time requirements. We can combine, modify or transfer focus value functions to improve statistical reliability [4]. Two or more focus value functions can be computed at the same time, and there is more information for search. In different applications, we capture images by many data formats. Focus value functions usually measure sharpness by luminance and green color. We also can consider file sizes or stream lengths from compressed data. If there is a DCT hardware engine of an encoder, we can use frequency transformation for the most complete and accurate statistics conveniently. Focus value functions are difficult to say what is the best, but we can choose a suitable one to meet specific requirements under constraints of system ability and computational time.
3 Focus Sampling AF sampling is scenic capture from different focus distances. A focus value curve records focus values at all of focus positions, and we usually evaluate AF accuracy by the curve. Window segment and adaptive step size have flexible applications to improve AF performance. A sampling table by calibration can speed up focus time, and we discuss with them as follows.
A New Sampling Method of Auto Focus for Voice Coil Motor in Camera Modules
1257
3.1 Focus Value Curve Stepping motor moves lens by gear rotation, and VCM (Voice Coil Motor) drives lens by coil displacement to change different focusing. Input images of captured devices are discrete signals, and fixed focus positions must be repeated under mechanical and electrical controls. Generally, total steps can cover all focuses. In searching rang, searching algorithms find a focus to decrease unnecessary lens movement. Global search scans all steps, and it records focus value every step in Fig. 2.
Fig. 2. Discrete signals of a focus value curve. Vertical axis is amplitude of focus value, and focuses at different position are on horizontal axis. Global search scans all steps to plot the curve, and we usually analyze AF methods by the curve.
Noise, camera shaking, and object movement often make vibration in a curve. The situation is always confusing searching algorithms. The maximum statistic has the sharpest image in the most definition of sharpness functions, and focus value of infocus range is greater than one of out-focus range [6]. After a search, AF moves lens to the position with the maximum statistic to finish the action. Other searching algorithms can be imagined that they jump on a focus value curve by global search to find their best focus. 3.2 Window Segment The purpose of window segment can reduce computation and improve 3A (AF, AE, and AWB) performance as Fig. 3. Statistics by a full image is an average result, and we are hard to see scenic details. We segment an image into many blocks, and compute their information. Statistics of blocks usually include focus value, color average, or color summation. A small number of blocks lose details, and majority has huge computation. Basically, bigger image sizes and more blocks are better for precise evaluation. If there is no constraint in computation, the best preciseness is by one pixel per block. The block-average operation can reduce noisy interference, and it also reduces computation to meet the real-time requirement. In our experiments, we propose 80 to 200 blocks in 640 by 480 pixels. 3A algorithms could give different weights to blocks or select areas for precise controls. Central, spot, and matrix of AE metering are typical applications. Many AWB algorithms analyze statistical distribution in specific domains to decide samples for color correction. A central area is important in AF as Fig. 3 (a), and we can give suitable weights to correspond to human vision by location. Relatively, peripheral blocks are unconcerned even though we can ignore them. If we know the subject in
1258
W. Hsu and C.-S. Fuh
(a)
(b)
Fig. 3. Window segment. The blocks in dark color are interesting features. We can give weights on the blocks to enhance reliability. (a) A central area. (b) An adaptive area.
which blocks as Fig. 3 (b), AF can refer to the blocks for the most precise search. Such as object detection and face tracking are the applications for adaptive areas. Window segment is very flexible and effective for algorithmic development, and it is tendency towards ISP. A focus value curve by a small number of blocks makes a steep mountain, and it easily differentiates from a peak or hillside. The disadvantage is sensitive to noise, and a local maximum often appears. Smooth curves are accumulated by more blocks, but searching algorithms decide the peak slow. It is trade-off between computational blocks and focus value reliability. Many AF methods choose different areas between out-focus and in-focus range for accurate and fast purposes, and the areas are different central sizes [4], [5], [6]. In out-focus range, a smooth curve avoid local maximum to converge to in-focus range fast. AF can accurately find a peak in a steep curve of infocus range through area-size exchange. 3.3 Step Size Searching algorithms change step sizes to speed up convergence. Global search uses the minimum step to sample all images in searching range. It is the slowest method, but the most accurate. Binary search compares focus value to use decrement of half steps to find the maximum value back and forth. Fibonacci search is similar to binary search, but the step sizes are arranged by Fibonacci sequence. Binary search and Fibonacci search are faster than global search, but their accuracy is doubtful. Many searching methods consider amplitude [4], slope [5], or ratio [9] of focus value to change step size during hill-climbing. Methods of Adaptive step size can decrease focus time efficiently. Adaptive methods are very difficult to decide step sizes in different searching stages, and they do not concern with scenes, optical lens, and focus value functions. Focus value cures are very different in many cameras, and controlled parameters in the AF must be fined tuning to improve performance and reliability for a specific platform. 3.4 Sampling Table Searching algorithms become efficiently and effectively by a sampling table. The table records focus steps, and a searching algorithm tracks the table to move lens position. Fig. 4 illustrates non-linear relation between object-distance and focus steps by a phone camera module.
A New Sampling Method of Auto Focus for Voice Coil Motor in Camera Modules
1259
Fig. 4. An object-distance and focus steps curve. In the camera module, the optical effective steps are around 490 steps. DOF is narrow at near end, but it is wide at far distance.
Optical lens has the DOF feature, and we can utilize lesser steps to represent all steps. Lens movement should be involved in the table as a result of non-linear relation for efficient searches. A sampling table can be built by DOF or fixed distances with probability. By fixed distances, the purpose is focus at some fixed distance, but it is discontinuous in optics. By DOF solutions, sampling steps have optical difference, and they can reduce probability of local maximum. Searching algorithms with the table also can avoid noisy interference to improve accuracy. Because of the optical nonlinear relation, lens movement should not consider logic judgement only. We should track specific focuses in sampling to improve AF performance. In adaptive step controls, a big step could jump a peak, and a small one is slow. Adaptive controls read focus value to adjust steps. Normally, high-contrast scenes have a sharp focus value curve, and low-contrast scenes have a gradual one. Searching algorithms often confuse in-focus with out-focus, because they never know the curvature. Noise also is a major cause to interfere in focus value curves. Preview mode with high frame rate needs huge image data, and it accompanies more noise. If searching algorithms can base on optical continuous features to search, it would not lose necessary sampling steps. An object at the far end with large DOF could be use small steps, and we can use big steps at the near end respectively. It is the reason to keep reliability and improve speed. Searching steps are reduced, and searching times are saved without quality loss. Images are complex. If we only consider the DOF feature, it is insufficient. We present a new sampling method to calibrate by a camera.
4 The Algorithms Our proposed method has a sampling table and searching algorithm for VCM lens in camera modules. The table establishment is described in section 4.1, and a fast and robust searching method is stated in section 4.2. 4.1 The Sampling Table Fig. 5 draws a focus value curve, and DOF makes clear images in a range. An acceptable threshold differentiates between blurred and clear image. Clear images have over acceptable threshold in the curve, and blurred images are opposite.
1260
W. Hsu and C.-S. Fuh
Fig. 5. A focus value curve corresponds to clear and blurred images by an acceptable threshold. Clear images are at the hilltop, and they are in a nusmbers of focus steps.
The curve has the maximum focus value (MFV), and the blurred ratio (BR) is calculated by BR = Focus Value/MFV
(1)
For an efficient search, we build the sampling table from a blurred end (infinity) to the other blurred end (macro). At the first, we need give an acceptable threshold ratio (ABR) to decide clear or blurred images. A testing chart is set at infinity distance, and global search finds the focusing step. Focusing is moved toward near direction step by step until BR is smaller than ABR. At the condition, we record the focus step into the proposed table, and set the testing chart at the focus step with a new MFV by global search. After the iterative processes from the furthest to nearest end, we can build the sampling table. The table by the current focus value function can cover all steps in searching range. Searching algorithms refer to the specific steps which are peaks of curves efficiently in Fig. 6, and all scenes have sharpness over ABR by the sampling. ABR should be fined tuning by a target camera. Focus value curves of a fixed scene are very different by different cameras and focus value functions. Noise and artifact interfere in adjustable degrees of edge enhancement. Therefore, we should estimate the image quality to decide ABR to meet the customization.
Fig. 6. Sampling steps for auto focus. A focus value curve is calibrated one by one in searching range. Every curve is different gradient after normalization, and the sampling steps are at the peaks.
A New Sampling Method of Auto Focus for Voice Coil Motor in Camera Modules
1261
4.2 Searching Algorithm Our proposed method has an initial and searching loop stage, and we consider the central window segment for focus value evaluation. Fig. 7 draws the search flow chart.
Fig. 7. Proposed auto focus searching. Initial and searching loop stages are the backbone in our auto focus servo. A new frame drives the loop progress. Peak detection decides a peak of a focus value curve to reduce searching times. Finally, lens is moved to a focusing.
In the initial stage, we move lens to focus infinity and reset controlled parameters. The basic focus value is recorded for reference of the current scene. The loop progress refers to the sampling table frame by frame, and focus values renew synchronously. The maximum focus value is also updated every frame. When the table runs out, it terminates the loop. Peak detection also can break the loop, and it bases on the basics focus value. We assume the focus value of in-focus range is over 1.1 times than one of out-focus range [9] in high-contrast scenes. If the focus value decreases on going and accumulates to seven frames under 0.9 times condition of basics focus value, lens has been over the peak. If peak detection does not break the loop, all of sampling steps are be scanned. After the loop termination, lens is moved to the position with the maximum focus value, and we finish AF.
5 Experimental Results In our experiments, we use a VCM camera module which is 10 x 10 x 7.3 mm3, and the effective searching range is from infinity to 8 cm. The actuator driver outputs the voltage to make lens displacement for VCM. The driver has 10 bits resolution for
1262
W. Hsu and C.-S. Fuh
voltage control, and we can arrange 1024 steps. Effective steps (infinity to 8 cm) are 412 steps in searching range, and others cannot be focused. Popular CMOS sensors of 2 and 3 mega-pixels are verified by our AF algorithm. The sensors can provide 10 bits Bayer data of 640 by 480 pixels at 30 fps (frame per second) in preview mode, but we control in 28.5 fps for better image quality. Our experimental ISP has ability to process the image size in real-time. The 3A hardware engine in ISP calculates red, green, blue, and focus value statistics with 12 by 8 window segment. The computational time and lens displacement can be finished within a frame flush, and the software is developed in language C. Optical DOF of VCM camera modules is larger, and we reduce 4 steps to 1 step for global search evaluation. Global search spend 3.61 seconds in a round by 103 steps. In our proposed table, the acceptable threshold ratio is set to 0.94 to calibrate our camera module, and we reduce 412 steps to 16 steps. Our method spends 0.56 seconds in the worst case, and the least time is 0.32 seconds. Peak detection can reduce searching steps in high-contrast scenes, and we spend less time for a search. In accuracy, we base on the sampling table to search in searching range. Peak detection is not work in low-contrast scenes for reliability. If the object is at the near end, the peak detection also looks ineffectively. In other words, our search scans all of steps in the sampling table. The maximum focus value of global search indeed is different from ours, but the differences are within acceptable range. Therefore, images have sharpness over acceptable vision. Actually, it is fast even though we do not care the peak detection. Our AF scans all of focuses within 0.6 seconds in our platform.
6 Conclusion Our proposed algorithm performs not only fast but also accuracy by camera modules and platform. In the sampling table, focus values of neighbor values are very different, and the sampling method can avoid local maximum situations possibly. The table has the minimum and most efficient sampling steps for other searching algorithms. For more accurate, we also can search the maximum focus value among three largest sampling steps. Images are consisted of the scenes, sensor response, optical lens, image pipeline, and compressed method. Especially in image pipeline, edge enhancement functions can compensate sharpness, but too strong operation is not naturally. Noise often emphasizes at the same time. In the current stage, we estimate the acceptable blurred ratio by vision. PSNR (Peak Signal-to-Noise Ratio) is a good method to evaluate a difference of two images. When we change focal lengths, the scenes are different. We are hard to evaluate the blurred degree by PSNR. The combination of aperture sizes and optical zooms has different DOF, and tables must be calibrated by conditions. In the future works, we will focus on the acceptable blurred ratio for objective evaluation. The developed software can be transferred to the family chips easily, and we will apply them to ISP chip for further progress. Zoom tracking has the similar concept, and we will transfer the idea.
A New Sampling Method of Auto Focus for Voice Coil Motor in Camera Modules
1263
References 1. Bayer, B.: Color imaging array. In: U.S. Patent No. 3,971,065. (1976) 2. Wallace K.: The JPEG still picture compression standard. IEEE Trans. on Consumer Electronics.(1991) 3. Chern N. K., Neow P. A. and Jr M. H. A.: Practical issues in pixel-based autofocusing for machine vision. Int. Conf. On Robotics and Automation. (2001) 2791- 2796 4. He J. et al: Modified fast climbing search auto-focus algorithm with adaptive step size searching technique for digital camera. IEEE. Trans. on Consumer Electronics, Vol.49, No.2. (2003) 257-262 5. Choi K. and Ko S.: New autofocusing technique using the frequency selective weighted median filter for video cameras. IEEE Trans. on Consumer Electronics, Vol. 45, No. 3. (1999) 820-827 6. Lee J. S., Jung Y. Y., Kim B. S., and Ko S. J.:An advanced video camera system with robust AF, AE, and AWB control. IEEE Trans. on Consumer Electronics, Vol. 47, No. 3. (2001) 694-699 7. Weintroub J., Aronson M. D., and Cargill E. B.: Method and apparatus for detecting optimum lens focus position. In: U.S. Patent No. 0,117,514. (2003) 8. Lee, J.H et al: Implementation of a passive automatic focusing algorithm for digital still camera. IEEE Trans. on Consumer Electronics, Vol. 41, No. 3. (1995) 449-454 9. Kehtarnavaz N. and Oh H. J.: Development and real-time implementation of a rule-based auto-focus algorithm. Journal of Real-Time Imaging, Vol. 9, Issue 3. (2003) 197-203
A Fast Trapeziums-Based Method for Soft Shadow Volumes Difei Lu1 and Xiuzi Ye1,2 1
College of Computer Science, Zhejiang University,China 2 SolidWorks Corporation,USA
Abstract. One of the best choices of multimedia for fast, high quality shadows is the shadow volume algorithm. However, the calculation of detailed soft shadows is one of the most difficult challenges in computer graphics in the case of area light source. In this paper, we present a new fast trapeziums-based algorithm for rendering soft shadows using a single shadow ray for each shadow pixel. Compared to other soft shadow methods, our algorithm produces very pleasing smooth and artifact-free soft shadow image while executing one order of magnitude faster. Our main contribution is a trapeziums-based method for quickly determining the proportion of area light which overlaps with occluders as seen from the shadow point to be shaded rather than using sample points on the area light source. To speed up calculation a bound box of projected light source is used to relate potential silhouette edges with shadow points beforehand. We demonstrate results for various scenes, showing that detailed soft shadows can be generated at very high speed. Keywords: shadow algorithms, visibility determination, soft shadow volume.
1 Introduction and Previous Work Soft shadows are a result of the continuous variation of illumination across a receiving surface when the light source becomes partially occluded by other objects in the scene. Their appearance is mainly controlled by the shape and location of penumbra regions, which are the regions on a receiver where the light source is partially visible. Soft shadows play a key role in the overall realism of computer generated images. In this paper we present an extended soft shadow volume algorithm based on the earlier penumbra wedge-based methods [1, 2, 3, 4] to produce physically quality shadows from planar area light sources. [5, 6] present excellent surveys of the vast literature on shadow algorithms, and we will only briefly review here some of the main approaches, especially the few that create physically-based soft shadows from area light sources. Using penumbra wedges [7] is fundamentally more efficient than processing the entire occluders. Earlier approximate soft shadow volume methods have been targeting interactive rate performance based on hardware. Ray tracing algorithms compute shadows by casting a ray between a scene point lying and a light source [8]. This method gets great result, but is quite expensive because each ray must in effect sample the scene for potential L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 1264 – 1272, 2006. © Springer-Verlag Berlin Heidelberg 2006
A Fast Trapeziums-Based Method for Soft Shadow Volumes
1265
occluders. [9] parameterizes rays with a spread angle using cones. The technique can approximate soft shadows by tracing a single cone from a surface point to an area light source. [1] describes a new soft shadow volume algorithm that generalizes the earlier penumbra wedge-based methods to produce physically correct shadows from planar area light sources. The algorithm is targeted to production-quality ray tracers, and generates the same image as stochastic ray tracing, but substantially faster. The fundamental operation in soft shadow generation is determining the visibility between an area light source and a shadow point. The proportion of the visible light source determines the brightness of the shadow point. The algorithm presented in [1] approximates an area light source with a set of light samples and may produce some artifacts on the shadow image. We use trapeziums to present the area light source exactly to get more smooth and artifact-free results. [10] uses a simple extension to ray tracing to create visually approximate shadows with little extra computation. It is interesting and useful but the result is not very pleasing. Our method is based on ray tracing and the result is better than [10]. Extending from [1], we introduce bound rectangles for quickly determining the set of edges whose projections from a shadow point to the plane of an area light overlap with the light source. We don’t construct penumbra wedge for finding silhouette edges to avoid creating complex penumbra wedges. In order to speed up calculation we store the potential silhouette edges into a grid data structure. Each cell of the grid represents a shadow point and contains many related potential silhouette edges. This operation is carried out in the beginning of calculation. After finding out all potential silhouette edges for all shadow points, we check all these edges to get exact silhouette edges for every shadow points. Then for a given shadow point P we project its exact silhouette edges to the plane of light source. From these projected silhouette edges we construct many trapeziums to analyze visibility of light source. We will present the details in latter sections.
2 Obtaining Soft Shadows with Ray Tracing In this section we present the basis of our soft shadow volume creating technique. The pipeline for computing soft shadows is adapted from [1]. There are five steps in the process, i.e., (1)Finding out all global potential silhouette edges from the area light source; (2)Using bound rectangles to get local potential silhouette edges for each shadow point; (3)Find out exact silhouette edges from result of (2) from each shadow point; (4)For a given shadow point, projecting its silhouette edges to the plane of light source and modifying the projected silhouette edges according to the rules detailed later; (4)Splitting projected silhouette edges according to vertical coordinates values of end points and intersect points of these edges; (5)Constructing trapeziums to determine the visibility of light source. In the following paragraph we explain the details for each step. First of all, all global potential silhouette edges should be found out. The method is very similar with that described in [1]. It is simple and clear as shown in Fig.1.
1266
D. Lu and X. Ye
Fig. 1. Demonstrating how to choose global silhouette edges from designated area light source (the blue polygon). The red line segment is an edge. The two triangles connected to the edge lie on plane1 and plane2 respectively. When area light and v1 are on the same side of plane1 and area light and vertex v2 are on the same side of plane2, the edge isn’t silhouette edge; When area light and v1 are on the different side of plane1 and area light and vertex v2 are on the different side of plane2, the edge isn’t silhouette edge; Otherwise the edge is silhouette edge. If there is only a single triangle connected to an edge, it must always be considered as a global potential silhouette edge.
Fig. 2. The first picture illustrate the process of constructing bound rectangle of projected vertices of light source. The second picture determines whether light source is visible from a scene point P.
A Fast Trapeziums-Based Method for Soft Shadow Volumes
1267
For a given shadow point P, only a small proportion of the global potential silhouette edges are silhouette edges. In order to a quickly find out the relevant silhouette edges for P, we store all global potential silhouette edges into a grid data structure according to bound rectangle of projected light source on scene plane. Each shadow point maintains a list used to keep all relevant silhouette edges. For each global potential silhouette edge we construct a bound rectangle using projected vertices of light source as shown in the first picture of Fig.2. We add the edge to the lists of shadow points which lie inside the rectangle. After all global potential silhouette edges have been enumerated, for each shadow point P its list contains all local potential silhouette edges relevant to it. We now turn to the process of the soft shadow query for each shadow point. The query is issued for a given shadow point P and determines how much proportion of the light source is visible from P. For each P, The first step of the query is to enumerate its list to determine which silhouette edge intersect with the light source as seen from P (second picture of Fig.2). Then, we test that these edges are silhouette edges from P. We can use the method described in Fig.1 to determine if an edge is a silhouette edge from P [2, 11].
Fig. 3. Projecting silhouette edges to the light source plane from P. Only edges which lie inside light source or intersect with light source will be used to calculate soft shadow volumes.
After collecting the list of exact silhouette edges from P, we project all these silhouette edges to the light source plane from P (Fig.3). The projected silhouette edges are oriented according to the relative position of the triangle containing the edge. We let the orientation of projected silhouette edge be ascent if the triangle lies on right of the edge, otherwise it is descent. As illustrated in Fig.3, the orientation of edge V1V2 is ascent and V2V3 is descent. The orientation property of silhouette edge is very important when calculating visibility of light source. Some projected silhouette edges should be
1268
D. Lu and X. Ye
modified if they intersect with the left side of light source (Fig.4). Fig.4 shows the six cases that need to modify silhouette edges. The first four cases have been described in [1], we extend the rules used in [1] by Fig.4e and Fig.4f.
Fig. 4. The green rectangles represent the area light source. The red lines are silhouette edges. If a silhouette edge intersects with the left side of light source, it should be modified. The first row of this picture shows the original edge and the second row show the modified edge. a, b, c and d break the original edge into two edges. e and f show how to deal with horizontal edges. In e and f the yellow lines and the horizontal edge construct a projected triangle of occluder. e shows the case when the triangle is above the edge and f shows the case when the triangle is under the edge. In e the converted edge is above the original one and in f the converted edge is under the original one. The properties of changed edges keep the same as original. In a, b, can d a vertical silhouette edges is added. In e and f the original horizontal edges are discarded.
Now we split all silhouette edges according to intersect points and end points of silhouette edges in vertical coordinate. Fig.5 shows the process of splitting. In the second picture of Fig.5, edges will be broken at the red points. The broken edges will keep the same orientation property as the original ones. After splitting, the light source is partitioned to many small trapeziums (Fig.6). Each trapezium has unique visibility property.
Fig. 5. Splitting silhouette edges according to end points and intersect points of all edges in vertical coordinate. The first picture shows the original edges and the second picture shows the red break points.
A Fast Trapeziums-Based Method for Soft Shadow Volumes
1269
From the definition of the orientation property of silhouette edge we can find that an ascent silhouette edge indicates the beginning of an invisible trapezium occluded by occluders on light source and a descent silhouette edge means the ending of the region. Let the number of surfaces which occlude the light source be depth number [1]. We can mark all trapeziums by depth numbers. Each trapezium has a depth number to indicate its visibility status. When calculating depth numbers, the initial value for each line is 0 (see Fig.6). Then we calculate the depth numbers from left to right. For a trapezium its depth number is equals to the previous depth number plus an integer T when the left side of the trapezium is an ascent silhouette edge; otherwise we subtract T from the previous depth number. T is 1 if there is only a single triangle connected to the silhouette edge, otherwise T is 2. In Fig.6, all depth numbers is calculated in case that T is 1. Special care must be taken to account for the missing silhouette edges outside the light source. When projected silhouette edges lie on the left side of the light source they are omitted. The methods shown in Fig.4 can deal with all these incorrect cases. From the methods shown in Fig.4 we can see that the depth number of trapezium is a relative value. The trapeziums with the lowest depth number may be invisible from shadow points because those projected silhouette edges on the left side of the light source are omitted. We must test these trapeziums to determine if they are visible. In order to accomplish this, we cast a single reference ray from shadow point P to the barycenter of one of these trapeziums and count the number of surfaces it intersects with the occluders, yielding the actual depth number of these areas. If the actual depth number is greater than 0 the whole light source is invisible. Otherwise, we can get the visibility rate of the light source as:
R=1−¦st / Sl
where R is visibility rate for a shadow point, St is the area of trapezium with lowest depth number and Sl is the area of the whole light source.
Fig. 6. Trapeziums are constructed according to the property of silhouette edge. Depth number of each trapezium is calculated and trapeziums with different depth number are drawn using different color. In this figure T always is 1.
1270
D. Lu and X. Ye
Using the method describe previously we can numerate all shadow points to get their visibility rate of light source and then draw the soft shadow quickly. The render resolution of the examples of this paper is 200×200. We interpolate visibility rate between two neighboring shadow points linearly.
3 Results and Conclusions In this section, we consider several examples of soft shadows produced using our prototype. Our prototype is implemented by Visual C++. All results were generated on a PC with P4 2.66 GHZ CPU, 1G RAM and Windows OS. In Fig.7, 8, 9 and 10, the major strength of our algorithm is shown, namely that soft shadows can be cast on a complex planar shadow receiver. As can be seen, the rendered images exhibit typical characteristics of soft shadows: the shadows are softer the farther away the occluder is from the receiver, and they are hard where the occluder is near the receiver.
Fig. 7. Soft shadows of a cubic and an apple are cast to a white floor
Fig. 8. Soft shadows of a cubic and a box blend with complex background
A Fast Trapeziums-Based Method for Soft Shadow Volumes
1271
Fig. 9. More complex meshes (bishop and bird) create soft shadows on different nature scenes
Fig. 10. A large dog mesh is used to test the strength of method. A flower is placed to a lawn naturally. Table 1. Performance results from examples (Fig.7-10) with a scene resolution of 200×200. All computations performed by our algorithm are included in the timings. The times (in seconds) are gotten on a PC with 1G RAM and 2.66GHZ CPU. Meshes bishop apple flower bird cube dog box
# of vertices 250 891 242 1155 8 6650 120
# of triangles 496 1704 401 2246 12 13176 228
Distance to light(cm) 50 50 100 30 20 50 30
Size of light(cm) 5.0×5.0 3.0×3.0 1.0×1.0 5.0×5.0 8.0×8.0 3.0×3.0 10.0×10.0
Time (s) 1.011 2.012 0.934 3.234 0.005 18.126 0.082
Our new algorithm offers a significant performance improvement in a variety of test scenes. The proposed method in this paper is suitable for computing shadows from very
1272
D. Lu and X. Ye
large light sources because it does not require additional light samples for limiting the noise to an acceptable level as the method used in [1].We compare our method with the method presented in [1]. Our method is faster when the quality of results is same. While our research takes a small step down to a new path, several limits remain as future work. The most conspicuous limitation of our technique is that for large scene resolution our method is also expensive like other soft shadow algorithms and need use intelligent ways to overcome it. As future work, we would like to investigate better sampling strategies of scene.
Acknowledgements The authors would like to thank the support from the China NSF under grant #60473106, China Ministry of Education under grant#20030335064, China Ministry of Science and Technology under grant #2003AA4Z3120.
References 1. Laine, S., Aila, T., Assarsson, U., Lehtinen, J., and Akenine-Möller, T. 2005. Soft shadow volumes for ray tracing. ACM Trans. Graph. 24, 3 (Jul. 2005), 1156-1165. 2. Brabec S., Seidel H.-P.: Shadow volumes on programmable graphics hardware. In Proceedings of Eurographics (2003), vol. 22, pp. 433--440. 6 3. Assarsson, U. and Akenine-Möller, T. 2003. A geometry-based soft shadow volume algorithm using graphics hardware. ACM Trans. Graph. 22, 3 (Jul. 2003), 511-520. 4. Akenine-Möller, T. and Assarsson, U. 2002. Approximate soft shadows on arbitrary surfaces using penumbra wedges. In Proceedings of the 13th Eurographics Workshop on Rendering (Pisa, Italy, June 26 - 28, 2002). S. Gibson and P. Debevec, Eds. ACM International Conference Proceeding Series, vol. 28. Eurographics Association, Aire-la-Ville, Switzerland, 297-306. 5. AndrewWoo, Pierre Poulin, and Alain Fournier. A survey of shadow algorithms. IEEE Computer Graphics and Applications, 10(6):13–32, Nov. 1990. 6. Hasenfratz, J.-M., Lapierre, M., Holzschuch, N., and Sillion, F. 2003. A Survey of Real-Time Soft Shadows Algorithms. Computer Graphics Forum 22, 4, 753–774. 7. Akenine-M¨OLLER, T., and Assarsson, U. 2002. Approximate Soft Shadows on Arbitrary Surfaces using Penumbra Wedges. In 13th Eurographics Workshop on Rendering, Eurographics, 297–305. 8. Turner Whitted. An Improved Illumination Model for Shaded Display. Communications of the ACM, 23:343–349, 1980. 9. Amanatides, J. 1984. Ray Tracing with Cones. In Computer Graphics (Proceedings of ACM SIGGRAPH 84), ACM Press, 129–135. 10. Stefan Brabec and Hans-Peter Seidel. Single sample soft shadows using depth maps. In Graphics Interface, 2002. 11. MCGUIRE, M. 2004. Observations on Silhouette Sizes. Journal of Graphics Tools 9, 1, 1–12.
Photo-Consistent Motion Blur Modeling for Realistic Image Synthesis Huei-Yung Lin and Chia-Hong Chang Department of Electrical Engineering, National Chung Cheng University, 168 University Rd., Min-Hsiung Chia-Yi 621, Taiwan, R.O.C.
[email protected],
[email protected] Abstract. Motion blur is an important visual cue for the illusion of object motion. It has many applications in computer animation, virtual reality and augmented reality. In this work, we present a nonlinear imaging model for synthetic motion blur generation. It is shown that the intensity response of the image sensor is determined by the optical parameters of the camera and can be derived by a simple photometric calibration process. Based on the nonlinear behavior of the image intensity response, photo-realistic motion blur can be obtained and combined with real scenes with least visual inconsistency. Experiments have shown that the proposed method generates more photo-consistent results than the conventional motion blur model.
1
Introduction
In the past few years, we have witnessed the convergence of computer vision and computer graphics [1]. Although traditionally regarded as inverse problems of each other, image-based rendering and modeling shares the common ground of these two research fields. One of its major topics is how to synthesize computer images for graphics representations based on the knowledge of a given vision model. This problem commonly arises in the application domains of computer animation, virtual reality and augmented reality [2]. For computer animation and virtual reality, synthetic images are generated from existing graphical models for rendering purposes. Augmented reality, on the other hand, requires the image composition of virtual objects and real scenes in a natural way. The ultimate goal of these applications is usually to make the synthetic images look as realistic as those actually filmed by the cameras. Most of the previous research for rendering synthetic objects into real scenes deal with static image composition, even used for generating synthetic video sequences. When modeling a scene containing a fast moving object during the finite camera exposure time, it is not possible to insert the object directly into the scene by simple image overlay. In addition to the geometric and photometric consistency imposed on the object for the given viewpoint, motion blur or temporal aliasing due to the relative motion between the camera and the scene usually have to be taken into account. It is a very important visual cue to human L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 1273–1282, 2006. c Springer-Verlag Berlin Heidelberg 2006
1274
H.-Y. Lin and C.-H. Chang
perception for the illusion of object motion, and commonly used in photography to illustrate the dynamic features in the scene. For computer generated or stop motion animations with limited temporal sampling rate, unpleasant effects such as jerky or strobing appearance might present in the image sequence if motion blur is not modeled appropriately. Early research on the simulation of motion blur suggested a method by convolving the original image with the linear optical system-transfer function derived from the motion path [3,4]. The uniform point spread function (PSF) was demonstrated in their work, but high-degree resampling filters were later adopted to further improve the results of temporal anti-aliasing [5]. More recently, Sung et al. introduced the visibility and shading functions in the spatial-temporal domain for motion blur image generation [6]. Brostow and Essa proposed a frame-to-frame motion tracking approach to simulate motion blur for stop motion animation [7]. Except for the generation of realistic motion blur, there are also some researchers focusing on real-time rendering using hardware acceleration for interactive graphics applications [8,9]. Although the results are smooth and visually consistent, they are only approximations due to the oversimplified image formation model. It is commonly believed that the image acquisition process can be approximated by a linear system, and motion blur can thus be obtained from the convolution with a given PSF. However, the nonlinear behavior of image sensors becomes prominent when the light source changes rapidly during the exposure time [10]. In this case, the conventional method using a simple box filter cannot create the photo-realistic or photo-consistent motion blur phenomenon. This might not be a problem in purely computer-generated animation, but inconsistency will certainly be noticeable in the image when combining virtual objects with real scenes. Thus, in this work we have proposed a nonlinear imaging model for synthetic motion blur generation. Image formation is modified and incorporated with nonlinear intensity response function. More photo-consistent simulation results are then obtained by using the calibrated parameters of given camera settings.
2
Image Formation Model
The process of image formation can be determined by the optical parameters of the lens, geometric parameters of the camera projection model, and photometric parameters associated with the environment and the CCD image sensor. To synthesize an image from the same viewpoint of the real scene image, however, only the photometric aspect of image formation has to be considered. From basic radiometry, the relationship between scene radiance L and image irradiance E is given by 2 π d E=L cos4 α (1) 4 f where d, f and α are the aperture diameter, focal length and the angle between the optical axis and the line of sight, respectively [11]. Since the image
Photo-Consistent Motion Blur Modeling for Realistic Image Synthesis
1275
intensity is commonly used to represent the image irradiance, it is in turn assumed proportional to the scene radiance for a given set of camera parameters. Thus, most existing algorithms adopt a simple pinhole camera model for synthetic image generation of real scenes. Linear motion blur is generated by convolving the original image with a box filter or a uniform PSF [3]. Although the image synthesis or composition are relatively easy to implement based on the above image formation, the results are usually not satisfactory when compared to the real images captured by a camera. Consequently, photo-realistic scene modeling cannot be accomplished by this simplified imaging model. One major issue which is not explicitly considered in the previous approach is the nonlinear behavior of the image sensors. It is commonly assumed that the image intensity increases linearly with the camera exposure time for any given scene point. However, nonlinear sensors are generally designed to have the output voltage proportional to the log of the light energy for high dynamic range imaging [12,13]. Furthermore, the intensity response function of the image sensors is also affected by the F-number of the camera from our observation. To illustrate this phenomenon, an image printout with white, gray and black stripes is used as a test pattern. Image intensity values under different camera exposures are calibrated for various F-number settings. The plots of intensity value versus exposure time for both the black and gray image stripes1 are shown in Figure 1. The figures demonstrate that, prior to saturation, the intensity values increase nonlinearly with the exposure times. Although the nonlinear behaviors are not severe for large F-numbers (i.e., small aperture diameters), they are conspicuous for smaller F-numbers. Another important observation is that, even with different scene radiance, the intensity response curves for the black and gray patterns are very similar if the time axis is scaled by a constant. Figure 2 shows the intensity response curves for several F-numbers normalized with respect to the gray and black image patterns. The results suggest that the intensity values of a scene point under different exposure time are governed by the F-number. To establish a more realistic image formation model from the above observations, a monotonically increasing function with nonlinear behavior determined by additional parameters should be adopted. Since the intensity response curves shown in Figure 1 cannot be easily fitted by gamma or log functions with various F-numbers, we model the intensity accumulation versus exposure using an operation similar to the capacitor charging process. The intensity value of an image pixel I(t) is modeled as an inverse exponential function of the integration time t given by I(t) = Imax (1 − e−k
d2 ρ
t
)
for
0≤t≤T
(2)
where T , Imax , d, k and ρ are the exposure time, maximum intensity, aperture diameter, a camera constant, and a parameter related to the object surface’s reflectance property, respectively. If all the parameters in Eq. (2) are known, 1
The intensity response of the white image pattern is not shown because it is saturated in a small exposure range.
H.-Y. Lin and C.-H. Chang
300
300
250
250
200
200 intensity
intensity
1276
150
100
100 F=2.4 F=5 F=7.1 F=9 F=11
50
0
150
0.1
0.5
1
1.5 exposure (ms)
F=2.4 F=5 F=7.1 F=9 F=11
50
0
2
333.33
1000
2000 3000 exposure (ms)
4
x 10
4000
5000
300
250
250
200
200 intensity
intensity
(a) Intensity versus exposure time for the black (left) and gray (right) patterns. 300
150
100
100 F=2.4 F=5 F=7.1 F=9 F=11
50
0
150
1000
F=2.4 F=5 F=7.1 F=9 F=11
50
5000
0
333.33
exposure (ms)
1000 exposure (ms)
(b) Small exposure range clearly shows the nonlinear behavior of the intensity. Fig. 1. Nonlinear behavior of the intensity response curves with different F-numbers
then it is possible to determine the intensity value of the image pixel for any exposure time less than T . For a general 8-bit greyscale image, the maximum intensity Imax is 255 and I(t) is always less than Imax . The aperture diameter d is defined as the F-number divided by the focal length, and can be obtained from the camera settings. The parameters k and ρ are constants for any fixed scene point in the image. Thus, Eq. (2) can be rewritten as
I(t) = Imax (1 − e−k t )
(3)
for a given set of camera parameters. The only parameter k can then be determined by an appropriate calibration procedure with different camera settings. To verify Eq. (3), we first observe that I(0) = 0 as expected for any camera settings. The intensity value saturates as t → ∞, and the larger the parameter k is, the faster the intensity saturation occurs. This is consistent with the physical model: k contains the reflectance of the scene point and thus represents the irradiance of the image point. Figures 1 and 2 illustrate that, the nonlinear responses are not noticeable for small apertures, but they are evident for large aperture sizes. For either case, the response function can be modeled by Eq. (3)
300
300
250
250
200
200 intensity
intensity
Photo-Consistent Motion Blur Modeling for Realistic Image Synthesis
150
100
150
100
50
0
50
exp black gray 100
500
1000
exp black gray
0
1500
1000
2000
exposure (ms)
3000 exposure (ms)
4000
5000
6000
(b) F-5, k = 0.000015 300
250
250
200
200 intensity
intensity
(a) F-2.4, k = 0.000014 300
150
100
150
100
50
0
1277
50
exp black gray 1000
2000
3000
5000 exposure (ms)
(c) F-7.1, k = 0.0000166
10000
0
exp black gray 1000
2000
3000
5000 exposure (ms)
10000
(d) F-11, k = 0.000019
Fig. 2. Normalized intensity response curves for different F-numbers
with some constant k . Thus, the most important aspect of the equation is to characterize the image intensity accumulation versus integration time based on the fixed camera parameters. For a given intensity value, it is not possible to determine the exposure time since the image irradiance also depends on the object’s reflectance property. However, it is possible to calculate the image intensity of a scene point under any exposure if an intensity-exposure pair is given and the normalized response curve is known for specific camera parameter settings. This is one of the requirements for generating space-variant motion blur as described in the following section. To obtain the normalized intensity response function up to an unknown scale factor in the time domain, the images of the calibration patterns are captured with various exposure followed by least-squared fitting to find the parameter k for different F-numbers. As shown in Figure 2, the resulting fitting curves (black dashed lines) for any given F-number provide good approximation to the actual intensity measurements for both the black and gray patterns. This curve fitting and parameter estimation process can be referred to as photometric calibration for the intensity response function. It should be noted that only the shape of the intensity response curve is significant, the resulting function is normalized in the time axis by an arbitrary scale factor. Given the intensity value of an
1278
H.-Y. Lin and C.-H. Chang
image pixel with known camera exposure, the corresponding scene point under different amount of exposure can be calculated by Eq. (3).
3
Synthetic Motion Blur Image Generation
Motion blur arises when the relative motion between the scene and the camera is fast during the exposure time of the imaging process. The most commonly used model for motion blur is given by
T
g(x, y) =
f (x − x0 (t), y − y0 (t))dt
(4)
0
where g(x, y) and f (x, y) are the blurred and ideal images, respectively. T is the duration of the exposure. x0 (t) and y0 (t) are the time varying components of motion in the x and y directions, respectively [3]. If only the uniform linear motion in the x-direction is considered, the motion blurred image can be generated by taking the average of line integral along the motion direction. That is, g(x, y) =
1 R
R
f (x − ρ, y)dρ
(5)
0
where R is the extent of the motion blur. Eq. (5) essentially describes the blurred image as the convolution of the original (ideal) image with a uniform PSF 1/R, |x| ≤ R/2 h(x, y) = (6) 0, otherwise This model is de facto the most widely adopted method for generating motion blur images. Its discrete counterpart used for computation is given by g[m, n] =
K−1 1 f [m − i, n] K i=0
(7)
where K is the number of blurred pixels. As an example of using the above image degradation model, motion blur of an ideal step edge can be obtained by performing the spatial domain convolution with the PSF given by Eq. (6). The synthetic result is therefore a ramp edge with the width of the motion blur extent R. If this motion blur model is applied on a real edge image, however, the result is generally different from the recorded motion blur image. Figures 3(a), 3(b) and 3(c) illustrate the images and intensity profiles of an ideal step edge, motion blur edge created using Eq. (7) and real motion blur edge captured by a camera, respectively. As shown in Figure 3(c), the intensity profile indicates that there exists non-uniform weighting on the pixel intensities of real motion blur. Since the curve is not symmetric with respect to its midpoint, this nonlinear response is clearly not due to the optical defocus of the camera and cannot be described by a Gaussian process.
Photo-Consistent Motion Blur Modeling for Realistic Image Synthesis
1279
200 180 160 140 120 100 80 60 40 20 0
0
100
200
300
400
500
600
700
500
600
700
500
600
700
500
600
700
(a) Ideal step edge and the corresponding intensity profile. 200 180 160 140 120 100 80 60 40 20 0
0
100
200
300
400
(b) Motion blur edge generated using Eq. (7). 200 180 160 140 120 100 80 60 40 20 0
0
100
200
300
400
(c) Real motion blur edge captured by a camera. 200 180 160 140 120 100 80 60 40 20 0
0
100
200
300
400
(d) Motion blur edge synthesized by the proposed method. Fig. 3. Motion blur synthesis of an ideal edge image
1280
H.-Y. Lin and C.-H. Chang
In this work, motion blur is modeled using nonlinear intensity response of the image sensor as discussed in the previous section. For an image position under uniform motion blur, its intensity value is given by the integration of image irradiance associated with different scene points during the exposure time. Every scene point in the motion path thus contributes the intensity for smaller yet equal exposure time. Although these partial intensity values can be derived from the static image with full exposure by linear interpolation in the time domain, nonlinear behavior of the intensity response should also be taken into account. Suppose the monotonic response function is I(t), then the motion blur image g(x, y) is given by 1 R −1 g(x, y) = I I (f (x − ρ, y))dρ (8) R 0 where R is the motion blur extent and I −1 (·) is the inverse function of I(t). The discrete counterpart of Eq. (8) is given by K−1 1 −1 g[m, n] = I I (f [m − i, n]) (9) K i=0 where K is the number of blurred pixels. If we consider the special case that the intensity response function I(t) is linear, then Eqs. (8) and (9) are simplified to Eqs. (5) and (7), respectively. Figure 3(d) shows the synthetic motion blur edge of Figure 3(a) and the corresponding intensity profile of the image scanlines generated using Eq. (9). The intensity response function I(t) is given by Eq. (3) with F-5 and k = 0.000015. By comparing the generated images and intensity profiles with those given by the real motion blur, the proposed model clearly gives more photoconsistent results than the one synthesized using uniform PSF. The fact that brighter scene points contribute more intensity to the image pixels, as shown in Figure 3(c), is successfully modeled by the nonlinear response curve.
4
Results
Figure 4 shows the experimental results of a real scene. The camera is placed at about 1 meter in front of the object (a tennis ball). The static image shown in Figures 4(a) is taken at F-5 with exposure time of 1/8 second. Figure 4(b) shows the motion blur image taken under 300 mm/sec. lateral motion of the camera using the same set of camera parameters. The blur extent in the image is 180 pixels, which is used for synthetic motion blur image generation. Figure 4(c) illustrates the motion blur synthesized using the widely adopted uniform PSF for image convolution. The result generated using the proposed nonlinear intensity response function is shown in Figure 4(d). For the color images, red, green and blue channels are processed separately. Motion blur images are first created for each channel using the same intensity response curve, and then combined to form the final result.
Photo-Consistent Motion Blur Modeling for Realistic Image Synthesis
(a) Static image.
1281
(b) Real motion blur image.
(c) Motion blur generated using Eq. (7). (d) Motion blur by the proposed method. Fig. 4. Experimental results of a real scene
Fig. 5. Motion blur generated with small F-number (large aperture size)
With careful examination of Figure 4, it is not difficult to find that the image shown in Figure 4(d) is slightly better than Figure 4(c). The image scanline intensity profiles of Figure 4(d) are very close to those exhibited in the real motion blur image. Figure 5 (left) shows another example taken at F-2.4 with exposure time of 1/8 second. The middle and right figures are the results using Eq. (7) and the proposed method, respectively. It is clear that the nonlinear behavior becomes prominent and has to be considered for more realistic motion blur synthesis.
1282
5
H.-Y. Lin and C.-H. Chang
Conclusion
Image synthesis or composition with motion blur phenomenon have many applications in computer graphics and visualization. Most existing works generate motion blur by convolving the image with a uniform PSF. The results are usually not photo-consistent due to the nonlinear behavior of the image sensors. In this work, we have presented a nonlinear imaging model for synthetic motion blur generation. More photo-realistic motion blur can be obtained and combined with real scenes with least visual inconsistency. Thus, our approach can be used to illustrate dynamic motion for still images, or render fast object motion with limited frame rate for computer animation.
References 1. Lengyel, J.: The convergence of graphics and vision. Computer 31(7) (1998) 46–53 2. Kutulakos, K.N., Vallino, J.R.: Calibration-free augmented reality. IEEE Transactions on Visualization and Computer Graphics 4(1) (1998) 1–20 3. Potmesil, M., Chakravarty, I.: Modeling motion blur in computer-generated images. In: Proceedings of the 10th annual conference on Computer graphics and interactive techniques, ACM Press (1983) 389–399 4. Max, N.L., Lerner, D.M.: A two-and-a-half-d motion-blur algorithm. In: SIGGRAPH ’85: Proceedings of the 12th annual conference on Computer graphics and interactive techniques, New York, NY, USA, ACM Press (1985) 85–93 5. Dachille, F., Kaufman, A.: High-degree temporal antialiasing. In: CA ’00: Proceedings of the Computer Animation. (2000) 49–54 6. Sung, K., Pearce, A., Wang, C.: Spatial-temporal antialiasing. IEEE Transactions on Visualization and Computer Graphics 08(2) (2002) 144–153 7. Brostow, G., Essa, I.: Image-based motion blur for stop motion animation. In: SIGGRAPH 01 Conference Proceedings, ACM SIGGRAPH (2001) 561–566 8. Wloka, M.M., Zeleznik, R.C.: Interactive real-time motion blur. The Visual Computer 12(6) (1996) 283–295 9. Meinds, K., Stout, J., van Overveld, K.: Real-time temporal anti-aliasing for 3d graphics. In Ertl, T., ed.: VMV, Aka GmbH (2003) 337–344 10. Rush, A.: Nonlinear sensors impact digital imaging. Electronics Engineer (1998) 11. Forsyth, D., Ponce, J.: Computer Vision: A Modern Approach. Prentice-Hall (2003) 12. Debevec, P.E., Malik, J.: Recovering high dynamic range radiance maps from photographs. In: SIGGRAPH ’97: Proceedings of the 24th annual conference on Computer graphics and interactive techniques, ACM Press (1997) 369–378 13. Schanz, M., Nitta, C., Bussman, A., Hosticka, B.J., Wertheimer, R.K.: A highdynamic-range cmos image sensor for automotive applications. IEEE Journal of Solid-State Circuits 35(7) (2000) 932–938
Integration of GPS, GIS and Photogrammetry for Texture Mapping in Photo-Realistic City Modeling Jiann-Yeou Rau1, Tee-Ann Teo2, Liang-Chien Chen3, Fuan Tsai4, Kuo-Hsin Hsiao5, and Wei-Chen Hsu5 1
Specialist, 2 Ph. D Student, 3 Professor, 4 Assistant Professor, Center for Space and Remote Sensing Research, National Central University (NCU), Jhong-Li, Taiwan {jyrau, ann, lcchen, ftsai}@csrsr.ncu.edu.tw 5 Researcher, Energy and Environment Lab., Industrial Technology Research Institute (ITRI), Chu-Tung, Taiwan {HKS, ianhsu}@itri.org.tw
Abstract. The generation of the textures of building facades from terrestrial photos is time-consuming work. A more cost-effective way to create a photorealistic city model containing thousands of buildings and facade textures is necessary. This paper proposes an approach which integrates GPS, GIS and photogrammetry for multi-face texture mapping of a three-dimensional building model. A GPS integrated, high-resolution, non-metric and wide field-of-view digital camera is used. By means of integrating the GIS graphic interface and GPS information about camera location, a large quantity of pictures can be managed efficiently. A graphics user interface for the interactive solving of the exterior orientation parameters is designed. In the meantime, during texturing mapping lens-distortion can be corrected and self-occluded facades can be automatically compensated for. The experimental results indicate that the proposed approach is efficient due to a complex building could be treated in one process and multi-face texturing is designed thus the number of pictures can be reduced. Additionally, a large quantity of pictures can be managed efficiently by GPS-Photo-GIS integration. Keywords: Photo-Realistic Modeling, Texture Mapping, City Model.
1 Introduction The presentation of a photo-realistic 3D city model is of major interest in the field of geoinformatics. On the city scale, a large number of building models and terrestrial photos need to be treated. In the meantime, the occlusion problems may be introduced by the buildings itself (i.e. self-occlusion) or due to objects in front of the buildings, such as cars, trees, street signs, etc. Thus, high performance production throughput, accurate geometric quality and occlusion-free texture generation are three of the major problems to solve. L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 1283 – 1292, 2006. © Springer-Verlag Berlin Heidelberg 2006
1284
J.-Y. Rau et al.
1.1 Related Work A number of algorithms for the purpose of the extraction of 3D geometric building models have already been reported. These algorithms can be categorized as either an automatic strategy [1] or a semi-automatic approach [2]. The representation of 3D building models is a generalization procedure. This means that from a city scale point of view, it is impractical to describe complex buildings using detailed geometrical descriptions. Since a detailed sketch of the buildings includes the windows, ornaments, doors and so on, will introduce a huge volume of data, which cause problem in real applications. Such detailed structures can be omitted when far field visualization is considered, however in near field applications these detailed structures are important for realization and recognition. Facade texture provides complementary information for a generalized building model but detailed geometric structures are lost. The texture provides a photo-realistic effect and conveys the visually-detailed geometrical structure of a building. Such details are getting more important for many simulation applications, such as urban planning, virtual tourism, the digital documenting of historic landmarks, and so on. The creation of rooftop texture can be acquired from aerial photos by a trueorthophoto generation process [3]. However, this is not suitable for the generation of facade texture due to the large amount of perspective distortion. Brenner & Haala [4] have proposed an interactive scheme which utilizes projective transformation for the fast production of building facade textures. Haala [5] has proposed using a high-resolution panoramic CCD line scanner to improve the efficiency of texture collection. Since one picture can cover a large quantity of buildings, the overall effort needed can be reduced. However, this method requires an elevated camera station to minimize occlusions, and an expensive rotating CCD scanner. Klinec and Fritsch [6] determined the camera’s exterior orientation parameters by searching for correspondences between building models and image features automatically. The results are used in a photogrammetric space resection process. The occlusion-free facade texture can be fused from multi-views of images. Varshosaz [7] proposed an automatic approach utilizing a specially designed photo system with a servo-driven theodolite. The system can take images automatically in a step-wise manner at preset locations covering the complete field of view. The theodolite angles are used to index image, thus only one space resection has to be done for each station. The texturing of the final facade is a combination of images from different views. 1.2 Concept of the Paper In the paper, we describe an efficient and good quality method using a highresolution, wide field-of-view (FOV), GPS integrated digital camera for texture collection. The wide FOV camera used will introduce significant lens distortion. So, the camera calibration is necessary before texture mapping when the photogrammetric back-projection is carried out. In contrary to most commercial 3D VR-GIS packages that use only one texture to simulate all facades of a building, results in a “virtual
Integration of GPS, GIS and Photogrammetry for Texture Mapping
1285
realistic” (VR) city model. In this paper, we adopt photogrammetric techniques that can generate different textures for different faces of a building. By this means multiface texturing can be apply for each building results in a “photo-realistic” city model. For the purpose of multi-face texture generation using single photo, the orientation parameters of the camera are required. In this work, it is assumed that polyhedral building models are ready for texture mapping. Thus, the building’s roof and foot corners can be utilized as ground control points in space resection for solving the orientation parameters. However, a traditional standard procedure for orientation modeling is tedious. A graphic user interface (GUI) was designed to reflect the orientation modeling result in real-time by the back-projection of the building’s wireframe model on the terrestrial image via the marking and moving of image control points. The operator can visually inspect the consistency between the wire-frame model and the buildings in the terrestrial image. Even a non-expert inexperienced operator can produce facade textures without a long training time. For the purpose of high production throughput, the buildings are captured in one photo as many as possible provided it can fulfill the required image spatial resolution, e.g. 20cm or 40cm per-pixel. The spatial resolution can be estimated from the image scale and the camera’s pixel size. During the texturing mapping stage, the visible faces are first draped with the corresponding texture as determined by photogrammetric back-projection. For occlusion area, it can be composite from different views of image [6], [7]. However some other issues may be raised, such as differences in illumination can cause grey value discontinuity, miss-registration can introduce a ghost effect, and production efficiency can be degraded, and so on. In this paper, the texture in the non-occluded area is directly created from the visible face. For a partially occluded area, the texture can be generated by filling its neighboring texture using the mirror operation. It is assumed that the building texture is repetitive or homogeneous. Since there is no need for multi-view image composition, the number of pictures can be reduced while still maintaining a good textural quality. Since the quantity of pictures captured for a city model is large, the process of searching for and managing the corresponding pictures for texture production is time consuming. Since the GPS equipment is integrated with the camera, the geographic location can be saved in the header of the picture. When integrated into a GIS environment, the pictures can be managed by their geographic location. We design an emulated GIS interface to mange pictures in a similar way. It is also integrated with the texturing mapping GUI, to assist the operator in their searching for and management of the massive number of terrestrial photos.
2 Methodology In the following sections, a detailed description regarding camera calibration, multiface texture mapping, and picture management by GPS-Photo-GIS integration is provided.
1286
J.-Y. Rau et al.
2.1 Camera Calibration A high-resolution DSLR camera is utilized for the experimental investigation, i.e. Nikon D2X with AF-S DX 17-55mm lens. The camera has a lens with a focal length from 17mm to 55mm and a 23.7mm x 15.7mm CMOS sensor. The camera enables a FOV ranging from 79° to 28°, and the resolution is 4,288 x 2,848 pixels, which enables the preservation of rich textural information when distance acquisition is required. The high-resolution wide FOV digital camera is suitable for both taking pictures of a large building from a short distance away or many buildings a long distance away. In such cases, the focal length of the camera is short so that lens distortion will be significant, which has to be considered in the geometrical correction process. In the meantime, due to one picture can contain many buildings the number of pictures for texture generation will be minimized. For camera calibration, the PhotoModeler [8] is utilized to estimate the lens distortion parameters. Lens distortion includes both radial and decentering parts. The PhotoModeler utilizes a well-designed calibration sheet for camera calibration as shown in Fig.1. It utilizes a series of repetitive pattern for ground control points. At first, four ground control points with known distances are manually measured. The other control points are then automatically matched to increase measurement redundancy. A self-calibration bundle adjustment with additional lens distortion parameters is adopted. Due to an accurate and CPU intensive lens distortion correction model is not necessary for texturing generation, we correct for dominant radial distortion only. A detailed description of the method can be found in [9]. Fig. 1. Camera calibration sheet
2.2 Texture Mapping In the following sections, the design of the GUI for texture mapping, camera orientation modeling procedure, the multi-face texture generation procedure, and the occlusion compensation procedure are described. 2.2.1 Design of GUI A graphics user interface for interactive texture mapping based on the Visual Studio .NET platform is developed. The standard OPENGL graphic library is a favorable choice for 3D graphics programming because most commercial graphic cards have hardware implementation that can accelerate the texture rendering. An example of the designed GUI is illustrated in Fig. 2. The GUI has a 3D viewer to
Integration of GPS, GIS and Photogrammetry for Texture Mapping
1287
display the 3D building models, to select the facades 3D Viewer 2D Viewer of interest, and to show the texture mapping results. It also provides a 2D viewer for displaying terrestrial photos, for marking image control points, and for visual inspecting the buildings wire-frame model Back-projection results on the image. The 2D viewer also allow for another GPS-Photo-GIS integration function, which Contol points list Orientation Parameters emulates a GIS environment and can be used to manage Fig. 2. The designed GUI for texture mapping the massive number of terrestrial photos. In additional, a tablet frame is designed to illustrate the corresponding ground control points, list of the image control points and the camera’s orientation parameters. 2.2.2 Control Point Marking and Orientation Modeling In the 3D viewer, the operator can manually or automatically select more than one facade of interest for texture generation even though occlusions exist. The rooftops and footing corners of the selected facades are labeled and used as candidate ground control points. They are automatically indexed and inserted into the tablet frame. Some of the ground control points are visible in the terrestrial image. The operator can now mark its corresponding image control points from the 2D viewer. After four control points have been marked, the space resection process can be initiated. The orientation results are used for the back-projection of the selected facades and plotting the wire-frame models on the terrestrial image. The operator can then visually inspect them for consistency to confirm the correctness of the orientation modeling. If the orientation result is not correct, the operator needs to move the image control points and the space resection will response in real-time. The orientation modeling is finished when the wire-frame models coincide with the corresponding features in the terrestrial image. Fig. 3 illustrates two examples of the correct and incorrect orientation modeling results. The left hand column illustrates the texture mapping result, while the right hand column demonstrates the visual inspection procedure. In this demonstration, three facades are selected and four control points are marked. The wire-frames of the selected facades are back-projected onto the terrestrial image based on the orientation modeling result.
1288
J.-Y. Rau et al.
2.2.3 Texture Generation and Occlusion Compensation Once the orientation parameters have been determined, the textures of selected facades can be generated simultaneously. The technique is the same as for orthophoto generation using the photogrammetric back-projection method, i.e. utilize the orientation parameters and the co-linearity equations. However, there may be partial self-occlusion problems with facades. Occlusion detection can be performed by the ZI-Buffer technique similar to the true-orthophoto generation [3]. In this work, it is assumed that building facades contain repetitive patterns or are homogeneous in texture. It is estimated that about 90% of the building models processed hold true to this assumption. Hence, the hidden part can be filled with texture from neighboring parts using the mirror operation, which can be done in autonomous after detection of the major hidden side. An example of selfFig. 3. Demonstration of (upper row) incorrect and (lower occlusion is demonstrated row)correct orientation modelling in Fig. 4. Façade AB is partially occluded by façade CD. The two facades can be projected to the image plane and their distance to the camera’s perspective center can be calculated. The distance is recorded in the Z-buffer and an index map is adopted to indicate the hidden pixels. An index map is also used to store the generated texture.
Fig. 4. Illustration of self-occlusion and mirror operation for compensation
Integration of GPS, GIS and Photogrammetry for Texture Mapping
1289
As shown in the middle of Fig. 4, the hidden parts are denoted as black color. The major hidden side is detected by searching for black pixels along the four sides. The side containing the largest percentage of black pixels is defined as the major hidden side. In addition, the width of black pixels next to the major hidden side (e.g. w1 and w2 in Fig. 4) is estimated. The location of the mirror can be determined according to the largest one (i.e. w1 due to w1 is greater than w2). The right hand photo in Fig. 4 depicts the texture mirroring result. If one compares the image before and after occlusion compensation, it is demonstrated that the effect of mirror texturing is acceptable if the facade texture has a repetitive or homogeneous pattern. In Fig. 5 the results of multi-face texture mapping by means of the proposed procedure is demonstrated. Except for trees in front of the buildings that are difficult or impossible to remove, the experimental results confirm the fidelity of the proposed scheme. The above results are acceptable and useful for most geo-visualization applications. Moreover, since the number of pictures needed to create the whole building is reduced, the efficiency is Fig. 5. Example of the multi-faces texture significantly improved. mapping
2.2.4 Picture Management by GPS-Photo-GIS Linking In the experiments, we integrate the Nikon D2X camera with a Garmin GPS Geko 301. The image was stored in JPEG format with an EXIF header that can store GPS geographic information. In the design of GUI, as shown in Fig. 2, we create a GIS-like environment in the 2D viewer to superimpose an aerial orthophoto and the picture location by geographic coordinates. However, since the orthophoto is in the TWD67 Fig. 6. Example of GPS-Photo-GIS integration coordinate system (i.e. Taiwan for picture management Datum 1967), while the GPS location information is recorded in the WGS84 geographic coordinates (i.e. Lat. / Long.). Before superimposition, the GPS coordinates have to be transferred to the TWD67 coordinate system. Fig. 6 illustrates an example of the overlay results. In the figure, the locations of the pictures are denoted by the yellow triangles. They are linked to the original image in
1290
J.-Y. Rau et al.
the hard disk. The operator can thus double-click on the symbol to retrieve and display the corresponding terrestrial photo on the 2D viewer for control point marking. By means of GPS-Photo-GIS integration a large quantity of terrestrial pictures can be managed easily, which is useful and efficient, especially when a city model need to be treated.
3 Case Study 3.1 Camera Calibration In the experiments, eight different views of photos are acquired using the shortest focal length, i.e. 17 mm. One photo is shown in Fig. 1. The total lens distortion vectors, including radial and decentering distortion, are illustrated in Fig. 7. The maximum lens distortions for radial, decentering and total distortion for a focal length of 17 mm are summarized in Table 1. Fig. 7. The total lens distortion vectors In Table 1 and Fig. 7, we notice that the total lens distortion is about 91.55 pixels at the edges of the camera frame. The radial lens distortion dominates the total lens distortion. We also adopt the camera calibration result to compare the effect of lens distortion correction. Fig. 8 shows a testing picture with an equal grid. Due to lens distortion the grid appear to bend outward, as shown in Fig. 8 (A). After applying the lens distortion correction procedure, the distorted lines are restored and presented as parallel lines, as shown in Fig. 8 (B). The results demonstrate that the lens distortion is significant and indispensable for texture mapping. The adopted lens distortion correction method is feasible for the creation of a photo-realistic city model. Table 1. Summary of maximum lens distortion on x-y axis. (Units: Pixels)
Radial Lens Distortion Decentering Lens Distortion Total Lens Distortion
(A) Before correction
x-axis 88.45 3.10 91.55
(B) After correction
Fig. 8. Comparison of lens distortion correction
y-axis 59.83 1.62 61.46
Integration of GPS, GIS and Photogrammetry for Texture Mapping
1291
3.2 Multi-face Texture Mapping In this section, we discuss the testing of the 3D building models on NCU. Fig. 9 shows an example of a complex building composed of circular and rectangular shapes. In Fig. 10, 12 visible faces are selected for texture mapping. In Fig. 9, 5 control points, depicted with square symbols, are marked for orientation modeling. The texture mapping result is shown in Fig. 11 for comparison. In this case, if affine transformation is used for texture mapping. The operator needs to mark four image control points for one face. This means that the operator needs to mark 48 image control points. The man-power needed is significantly more than for the proposed scheme. In additional, the marking of the image control points may introduces a certain degree of measurement errors, so that a gray value discontinuity effect may occur between two consecutive faces when affine transformation is utilized. The described effect does not happen in the proposed approach. The image control points for texture mapping are calculated by the colinearity equations with the estimated camera orientation parameters. This means that the visual quality is improved. With this procedure we create a photo-realistic city model of NCU that containing about 300 polyhedrons, as shown in Fig. 12. In total, more than 1,400 terrestrial photos are utilized and about 40 man-hours are spent in texturing mapping. Since the buildings in NCU are separated by some distance taking pictures is not a problem. However, trees or cars in front of the buildings give rise to significant non-selfocclusion problem making it difficult to solve. There is a trade-off between efficiency and realism. On the city scale, the proposed approach fulfills the requirements for producing a photo-realistic city model.
Fig. 9. Terrestrial photo
Fig. 10. The selected facades of interest before texture mapping
Fig. 11. Multi-faces texture mapping results
Fig. 12. Photo-realistic city model of NCU
1292
J.-Y. Rau et al.
4 Conclusions In the paper, we propose the use of a high-resolution non-metric wide FOV digital camera for texture generation in photo-realistic city modeling. The adopted camera calibration model is both effective and feasible for texture mapping. The designed GUI interface for control point marking and camera orientation modeling is easy to operate that an expert or experienced operator is not required. The proposed scheme has proven to be efficient due to the following two reasons. The first one is the number of pictures to be process can be reduced. That is because all visible facade texture can be generated from one photo and partially occluded facades can be compensated for using the mirror operation. The fidelity of the generated texture is high provided the facades meet the assumption that they contain repetitive or homogenous patterns. The second reason is that a large quantity of pictures can be treated efficiently by GPS-Photo-GIS integration. Due to the adoption of photogrammetric techniques multi-face texture mapping can be performed, resulting in a photo-realistic city model. The generated city model is appropriate for most geovisualization applications.
Acknowledgments This work was partially supported by the NCU-ITRI Union Research and Developing Center under project number NCU-ITRI 930304. It is also partially supported by the National Science Council under project number NSC 93-2211-E-008-026.
References 1. Baillard, C. and A. Zisserman: A Plane Sweep Strategy for the 3D Reconstruction of Buildings from Multiple Images. IAPRS, Vol. 33, Part. B2. Amsterdam, Netherlands (2000) 56-62. 2. Brenner, C.: Towards Fully Automatic Generation of City Models. IAPRS, Vol. 33, Part. B3, Amsterdam, Netherlands (2000) 84-92. 3. Rau, J. Y., Chen, N.Y. and Chen, L. C.: True Orthophoto Generation of Built-Up Areas Using Multi-View Images. PE&RS, Vol. 68, No. 6 (2002) 581-588. 4. Brnner, C., & Haala, N.: Fast Production of Virtual Reality City Models. IAPRS, Vol.32, No.4 (1998) 77-84. 5. Haala, N.: On the Refinement of Urban Models by Terrestrial Data Collection. IAPRS, Vol.35, Part.B3, Istanbul, Turkey (2004) 564-569. 6. Klinec, D. and Fritsch, D.: Towards Pedestrian Navigation and Orientation. In: Proceedings of the 7th South East Asian Survey Congress, SEASC’03. Hong Kong (2003) On CD-ROM 7. Varshosaz, M.: Occlusion-Free 3D Realistic Modeling of Buildings in Urban Areas. IAPRS, Vol.35, Part.B4, Istanbul, Turkey (2004) 437-442. 8. Photomodeler, Eos Systems Inc., Available: http://www.photomodeler.com. 9. Michael R. Bax: Real-Time Lens Distortion Correction: 3D Video Graphics Cards Are Good for More than Games. Stanford Electrical Engineering and Computer Science Research Journal, Spring (2004), Available: http://ieee.stanford.edu/ecj/spring04.html.
Focus + Context Visualization with Animation Yingcai Wu, Huamin Qu, Hong Zhou, and Ming-Yuen Chan Department of Computer Science and Engineering Hong Kong University of Science and Technology {wuyc, huamin, zhouhong, pazuchan}@cse.ust.hk Abstract. In this paper, we present some novel animation techniques to help users understand complex structures contained in volumetric data from the medical imaging and scientific simulation fields. Because of the occlusion of 3D objects, these complex structures and their 3D relationships usually cannot be revealed using one image. By our animation techniques, the focus regions, the context regions, and their relationships can be better visualized at the different stages of an animation. We propose an image-centric method which employs layereddepth images (LDIs) to get correct depth cues, and a data-centric method which exploits a novel transfer function fusing technique to guarantee the smooth transition between frames. The experimental results on real volume data demonstrate the advantages of our techniques over traditional image blending and transfer function interpolation methods.
1 Introduction Volume visualization helps people gain insight into volumetric data using interactive graphics and imaging techniques. The data from real applications such as medical imaging and computational fluid dynamics often contain multiple complex structures. Because of the occlusion of 3D objects, revealing all these structures and presenting their 3D relationships in one image are very challenging. Two widely used techniques to attack the occlusion problem are direct volume rendering (DVR) and focus+context visualization. By assigning different transparency values to the voxels of the volume data via transfer functions (TFs) and then compositing them into one image, DVR can reveal more information than traditional surface-based rendering. Not all structures are interesting to users. Some structures are more important and are often called the “regions of interest” or “focus regions”, while other structures are less important and only serve as the “context”. Various focus + context techniques such as illustrative visualization and smart visibility have been developed to reveal the regions of interest (i.e., focus regions) while preserving the context. Nevertheless, all these techniques have their limitations and cannot totally solve the occlusion problem. Usually, when the focus regions are highlighted, the context may be missed or distorted to some extent. On the other hand, the focus may be occluded by the preserved context. We propose to use animation for focus+context visualization. We first generate some keyframes which can clearly reveal either the regions of interest or the context, and then fill in the gap between each pair of the successive key frames with a sequence of intermediate frames. The spatial relationship between the focus and the context will be revealed in the animation process. Our work is motivated by an interesting psychology L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 1293–1302, 2006. c Springer-Verlag Berlin Heidelberg 2006
1294
Y. Wu et al.
phenomenon - “bird in cage”’ - also called afterimage [1]. It is closely related to the theory of persistence of vision which accounts for the illusion of motion in the filming industry. According to the theory, a visual image will persist in a user’s brain for a short time after the real image has disappeared from his/her eyes. Therefore, if we quickly show a sequence of the key frames and intermediate frames containing the focus and context, then the focus and context may simultaneously appear in users’ brains and the users can figure out their 3D relationship. Compared with traditional images, direct volume rendered images (DVRIs) have some special features. Traditional images usually show objects in the real world so opaque surfaces are often presented, while DVRIs are used to reveal information contained in 3D volume data so multi-layer transparent surfaces are usually displayed. In some DVRIs containing fuzzy structures, there are even no clear surfaces. To generate animations for DVRIs with large transparent areas, depth cues and smooth transitions are especially important for conveying correct information. If we directly apply traditional animation techniques such as image blending to DVRIs, we may not get correct depth cues and some misleading information will be introduced. To address these problems, we propose two animation techniques, i.e., image-centric method and data-centric method, for DVRIs. In our image-centric method, the intermediate frames between any two successive keyframes are synthesized by image blending. We employ layered depth images (LDIs) to provide the depth information of the 3D structures in the keyframes and then composite these LDIs in depth-sorted order to obtain the intermediate images with correct depth cues. The image-centric method provides a practical solution for simple volumetric data and low-end platforms. However, since the operations of the volume rendering are nonlinear, the image-blending method may fail for complex data. Thus, a data centric method is introduced for high-end applications. In our data centric method, the intermediate frames are all direct volume rendered to reveal complex structures and guarantee correct depth cues. The TFs for the intermediate frames are generated by a novel TF fusing technique to guarantee a smooth transition between any two frames. We further develop two animation editing techniques, i.e., level-of-detail and zoom-in/out to emphasize the focus frames and suppress the context frames in the time domain of the animation. This paper is organized as follows: After reviewing previous work in Section 2, we give an overview of our animation approaches in Section 3. The image-centric approach for generating animations is presented in Section 4, while the data-centric approach is described in Section 5. In Section 6, we discuss how to emphasize the regions of interest while preserving the context in the time domain of the animation. The experimental results are presented in Section 7. We conclude our work and suggest some future research directions in Section 8.
2 Related Work Image Blending. Image blending has been extensively studied in image processing [2], computer vision [3], and human-computer interaction [4]. Whitaker et al. [2] proposed an image blending procedure by progressively minimizing the difference of the level sets between two images. Baudisch et al. [4] developed an approach which relies on a
Focus + Context Visualization with Animation
1295
vector of blending weights to display overlapping windows simultaneously. More recently, Soille [3] presented a method for image compositing based on mathematical morphology and a marker-controlled segmentation paradigm. Image blending is a classic technique and is still widely used. Focus + Context Visualization. To effectively visualize salient features (or focus regions) without occlusion while preserving necessary context is a challenging problem in volume visualization. Weiskopf et al. [5] proposed several interactive clipping methods by exploiting the powerful capability of GPUs. However, some spatial relationships of 3D structures are lost when too much context is removed. Viola et al. [6] recently proposed a novel importance-driven approach which is capable of enhancing important features while preserving necessary context by generating cut-away views and ghosted images from volumetric data. It is further used in an interactive system [7], volumeshop, for direct volume illustration. Nevertheless, their approaches require pre-segmented volumetric data, which are not always available in practice. Volume illustration, first introduced by Ebert et al. [8], provides an alternative way for focus + context visualization. With non-photorealistic rendering (NPR) techniques, volume illustration helps users perceive visualization results by highlighting significant features and subjugating less important details. Lens and distortion is another focus + context technique for volume exploration. Wang et al. [9] developed some interactive GPU-accelerated volume lens to magnify the regions of interest and compress the other volume regions, but at the cost of distorting the data. Animated Visualization. Lum et al. [10] presented an impressive visualization technique called Kinetic Visualization for creating motion along a surface to facilitate the understanding of static transparent 3D shapes. Weiskopf [11] described some relevant psychophysical and physiological findings to demonstrate the important role of color in the perception of motion. Correa et al. [12] proposed a new data traversal approach for volumetric data with motion-controlled transfer functions. Three perceptual properties of motion: flicker, direction, and velocity have been thoroughly examined by Huber et al. [13] in an experimental study. Although there has been growing interest in animated visualization, most previous works either aim at enhancing the perception of 3D shapes and structures of static objects [10, 12] or focus on evaluating some general motion attributes [11,13]. Using animation for focus + context visualization is still an unexplored area.
3 Overview In this paper, we propose two approaches, i.e., the image-centric and data-centric methods, to create animations for focus + context volume visualization. Our approaches employ an important animation technique called tweening (or in-betweening) which can generate a sequence of intermediate frames between two successive key frames so that the first key frame can smoothly change into the second one. After creating a series of key frames that reveal either the focus or the context, our approaches will apply the tweening techniques to the successive key frames to form an animation. The differences between the image-centric method and the data-centric method lie in three aspects. First, the approaches to generating key frames are different. The imagecentric method creates the key frames by surface-based volume rendering with
1296
Y. Wu et al.
user-specified isovalues, while the data-centric method generates the key frames by DVR with user-specified transfer functions (TFs). Second, the processes of tweening between the key frames are different. The image-centric method achieves the tweening by blending the key frames with different alpha values. The original volume is not required for the tweening process. The data-centric method generates a sequence of intermediate frames by fusing the transfer functions and it requires the original volume. Third, the image-centric method aims at low-end platforms with limited memory and storage space, and low-end applications with simple data. Instead, the data-centric method is used to create higher quality and smoother animations on powerful platforms. The below table summarizes the differences. Image-Centric Method Data-Centric Method Keyframe Generation Surface-based volume rendering Direct volume rendering Tweening Blending keyframes Fusing TFs Original Volume Not required for tweening Required for tweening Applications Low-end High -end Our system further employs two widely used techniques, level-of-detail and zoomin/out, to the generated animations to emphasize the important frames (i.e., the focus frames) and suppress the less important frames. They are applied to the animation timeline which controls the display time of each frame in an animation. With the level-ofdetail technique, users are able to reduce the number of context keyframes, while the zoom-in/out technique allows users to emphasize the focus frames while suppressing the context frames.
4 Image-Centric Animation In this section, we present the image-centric method. Compared with the data-centric animation scheme (see Section 5), it is light-weight and better suited for low-end platforms such as smartphones and personal digital assistants (PDA). For example, it can be used in a client-server graphics system. The required key frames (i.e., a series of layered depth images (LDIs)) can be generated in a remote server. Some of those LDIs contain the regions of interest, while the rest have the context. The key frames are then sent to the low-end clients. After receiving the key frames, the clients will fill in the gap between each pair of the successive key frames by blending the key frames with different alpha values. It consumes much less bandwidth, memory, and computation time than the data-centric method. 4.1 Layered Depth Images The concept of layered depth images (LDIs) was first proposed by Shade et al. [14] for image-based rendering. In our work, the LDIs are used as key frames to provide the depth information of different structures in the original data. The focus region and context region are specified by users as iso-values. Then surface-based volume rendering will be used to generate all key frames within one pass. The LDIs are stored as the result. We cast a ray from each pixel, and save the depth and interpolated color for each intersection point between the ray and the user-specified iso-surfaces.
Focus + Context Visualization with Animation
1297
There are two reasons that we use LDIs as key frames. First, LDIs consume much less memory and bandwidth than original volumes, and can be generated efficiently. Therefore, they are very suitable for low-end platforms. Second, LDIs can guarantee that the blending used in the tweening process (see Section 4.2) and level-of-detail approach (see Section 6) will generate correct depth cues for intermediate frames. 4.2 Tweening Between Key Frames We employ a simple yet effective technique to fill in the gap between each pair of the successive key frames. The tweening technique creates a sequence of intermediate frames by blending the successive key frames with different alpha values. We denote the image blending as a function B(P1 , P2 , α1 , α 2) where P1 and P2 are two successive key frames, and α1 is an alpha value for P1 and α2 is an alpha value for P2 . Thus, the intermediate frames can be viewed as a set of B(P1 , P2 , α1 , α 2) with a linear combination of α1 and α2 of the key frames. To get ccorrect depth cues, the layers of LDIs at the similar depth range will be alpha blended first, and then the different layers will be composited into an intermediate frame in depth-sorted order.
5 Data-Centric Animation The image-centric animation scheme is effective for simple volume data and lowend applications. However, because not all 3D structures can be represented as isosurfaces, some important information contained in complex volume data may be missed in the keyframes and intermediate frames generated by our method. Moreover, the alpha blending operations may introduce some misleading information [4]. Although we could minimize the amount of incorrect information with LDIs, the existence of misleading or incorrect information is still unavoidable in the blended images in some situations. Therefore, for high-end applications, we propose a data-centric method where all frames are generated by direct volume rendering to provide correct depth cues and eliminate misleading information. Our method applies a novel transfer function fusing technique to generate a sequence of intermediate transfer functions from any two successive keyframe transfer functions, and the animation formed by the images rendered using these transfer functions will guarantee the smooth transitions between frames. 5.1 Similarity-Based Tweening Between Key Frames TFs in DVR are used to classify different features in volumetric data and determine the information shown in the final images. We assume that multiple key TFs for the key frames have already been generated by users. These TFs either emphasize the focus or the context of the data. Suppose that the TFs for two successive key frames, P1 and P2 , are T F1 and T F2 , and the number of intermediate frames between two key frames is N. To tween between these two key frames, one straightforward solution is to linearly interpolate the key TFs to obtain a number of intermediate TFs, which can then be used to generate a sequence of intermediate frames by DVR. For example, the TF for the ith i intermediate frame can be computed as T Fi = (N−i) N ∗ T F1 + N ∗ T F2 . However, based on our experiments with real data (See Fig. 1(a)-1(e)), this method cannot generate
1298
Y. Wu et al.
expected results. This is because that the compositing operation used in volume rendering is nonlinear, and thus the smooth transition between two TFs cannot guarantee the smooth transition between the resulting DVRIs. To solve this problem, we propose a similarity-based tweening technique. We first introduce a similarity metric between two frames and then linearly change the similarity values between the intermediate frames and the key frames. To guarantee a smooth animation, for the ith intermediate frame Pi , the similarity value between Pi and P1 should be (N−i) N and the similarity value between Pi and P2 should be Ni . In our data-centric method, we compute a series of intermediate TFs whose generated DVRIs will linearly change their similarity values with the keyframes. To do this, we apply a novel transfer function fusing technique. 5.2 Transfer Function Fusing Consider the following TF fusing problem: Given two transfer functions T F1 and T F2 (called parent transfer functions) and their corresponding DVRIs P1 and P2 , we are asked to generate a transfer function T F (called child transfer function) whose corresponding DVRI P will have a similarity value v1 with P1 and a similarity value v2 with P2 . We develop a transfer function fusing system to solve this problem based on our previous work [15]. Our system first employs a contour-based similarity metric to compare two direct volume rendered images. Then an energy function is introduced and the transfer function fusing problem is turned into an optimization problem which can be solved using genetic algorithms (GAs).
6 Animation Editing Each key frame used in the animation usually contains only one specific feature except in a few situations where the feature is hard to be separated from other features. If there are many features in the volumetric data, there will be a number of key frames in the animation. Thus, the total number of frames including the key frames and intermediate frames will be too large, which makes it difficult for users to focus on some frames containing the features they want to see. To overcome this problem, we propose two techniques, level-of-detail (LOD) and zoom-in/out, to emphasize the important frames while suppressing the context frames. The proposed techniques are applied on the timeline, which controls the display time of each frame. In our paper, we apply LOD to the timeline to make the exploration easier and more flexible in a visualization process. Users are allowed to select multiple successive key frames and combine them into a new key frame. After that, the system replaces these keyframes with the newly combined key frame for the animation. Tweening will be automatically done with the image-centric or the data-centric method. Users can perform this combination many times until they obtain satisfactory animations. For the imagecentric method, the combination of the key frames is done by blending them with the same or different alpha values specified by users. On the other hand, the data-centric method can use the fusing technique for TFs with different expected similarity values to combine the key frames. In our work, we mainly use the LOD technique to reduce the number of the context keyframes.
Focus + Context Visualization with Animation
1299
After the context keyframes are clustered by the LOD technique, we may need to further emphasize the frames in which we are interested. In order to highlight these focus frames, we apply a technique simliar to the fisheye-view approach [16] to the timeline of the animation, which gives the important frames a longer display time and reduces the display time for less important frames.
7 Experiment Results We tested our system on a Pentium(R) 4 3.2GHz PC with an Nvidia Geforce 6800 Ultra GPU with 256MB RAM. The sampling rate of DVR was two samples per voxel along each ray. The image resolution used in this system was 512 × 512. All volumetric data used for experiments were 8-bit data. For the generation of key frames, the imagecentric method is at least n (the number of the required key frames) times faster than the data-centric method, since the image-centric method needs only one pass of surfacebased volume rendering to generate all the key frames while the data-centric method
(a) (0.9,0.1)
(b) (0.7,0.3)
(c) (0.5,0.5)
(d) (0.3,0.7)
(e) (0.1,0.9)
(f) (0.9,0.1)
(g) (0.7,0.3)
(h) (0.5,0.5)
(i) (0.3,0.7)
(j) (0.1,0.9)
(k) (0.9,0.1)
(l) (0.7,0.3)
(m) (0.5,0.5)
(n) (0.3,0.7)
(o) (0.1,0.9)
Opacity
TF1 0 Opacity
Scalar Value
TF2 0
(p) keyframe1 (q) keyframe2
Scalar Value
(r) T F1 and T F2
Fig. 1. Tweening between key frames: (p) and (q) are the key frames; (a)-(e) intermediate frames created by linearly interpolating TFs of the key frames with different (α , β ); (f)-(j) intermediate frames created by fusing the key frames with different (v1 , v2 ); (k)-(o) intermediate frames generated by blending the key frames with different (α1 , α2 ); (r) TFs of the key frames
1300
Y. Wu et al.
requires n passes of DVR to create them. For the tweening, the data-centric method required around 40 seconds to fuse two TFs, while the image-centric method needed only 0.147 seconds to blend two frames. Fig. 1 shows the differences of the tweening done by different methods. Fig. 1(p) and 1(q) are the key frames indicated as keyframe1 and keyframe2, and Fig. 1(a)-1(e) were generated by linearly interpolating the TFs of the key frames (T F = α ∗ T F1 + β ∗ T F2 , where α and β are shown below the corresponding figure, and T F1 and T F2 are shown in Fig. 1(r)). Fig. 1(a)-1(e) fail to form a smooth transition from keyframe1 to keyframe2. The change from Fig. 1(a) to 1(b) is too abrupt, and Fig. 1(b)-1(e) are almost the same as the keyframe2 (Fig. 1(p)). Oppositely, Fig. 1(f)-1(j) generated by our datacentric method and Fig. 1(k)-1(o) created by our image-centric method have a smoother transition between the successive key frames. Additionally, the tweening (Fig. 1(f)-1(j)) done by the data-centric method is the best among all these methods. Fig. 1(k) and 1(o) make the morphing, generated by the image-centric method, a bit abrupt and not as good as that created by the data-centric method, since they make the transition from keyframe1 to Fig. 1(l) and the transition from Fig. 1(n) to keyframe2 rough. Moreover, the images created by the data-centric method have richer details and provide better depth cues than those created by the image-centric method (see the regions selected by the red curves in Fig. 1(i) and 1(n)). The experiment on a CT carp dataset (256 × 256 × 512) was conducted to demonstrate the effectiveness of the LOD technique used in our system. There are four key frames (Fig. 2(a)-2(d)) generated by the image-centric method, and three key frames (Fig. 2(e)-2(g) generated by the data-centric method. The LOD technique was applied to reduce the number of the context keyframes. Fig. 2(h) was generated by fusing Fig. 2(f), 2(g), and 2(e) with the same expected similarity value (v = 0.5), while Fig. 2(i) was generated by compositing Fig. 2(a)-2(d)) in depth-sorted order. Both Fig. 2(h) and
(a)
(b)
(e)
(c)
(d)
(f)
(h)
(g)
(i)
Fig. 2. Key frames: (a)-(d) key frames (LDIs) generated using the image-centric method; (e)-(g) key frames generated with the data-centric method; (h) image generated by fusing (f) with (g), and then with (e); (h) image generated by compositing the LDI (a)-(d) in depth-sorted order
Focus + Context Visualization with Animation
1301
2(i) can correctly provide depth cues. This experiment shows that both the blending and fusing methods are effective for simple volumetric data like the CT carp data. Finally, our approaches were applied to an MRI head dataset (256 × 256 × 256). Fig. 3(a)-3(c) are LDIs generated by the image-centric method within one pass of surfacebased volume rendering, while Fig. 3(d)-3(f) were generated separately using traditional DVR with manually-created TFs. Obviously, Fig. 3(b) and 3(c) are sharper and clearer than Fig. 3(e) and 3(f). However, Fig. 3(e)-3(f) may better reflect the fuzzy and noisy nature of the original data.
(a)
(b)
(c)
(d)
(e)
(f)
Fig. 3. Key frames: (a)-(c) are LDIs generated using the image-centric method; (d)-(f) are generated using the data-centric method
8 Conclusions and Future Work In this paper, we have developed two animation techniques - the image-centric and data-centric methods - for focus + context visualization. Our animation techniques allow users to better visualize the focus and the context regions in volume data and reveal their 3D relationship. Layered depth images were exploited in the image-centric method to provide correct depth cues. A novel transfer function fusing method was employed in the data centric method to achieve smooth transitions between frames. The experiment results on real volume data demonstrated the advantages of our methods over traditional image blending and transfer function interpolation approaches. Our animation techniques can help users better understand their data and are very suitable for presentation, education, and data exploration. In the future, we plan to improve the animation quality of the image-centric method by investigating more advanced blending schemes, and the tweening speed for the datacentric method by exploiting graphics hardware. We also want to conduct a user study to thoroughly evaluate the effectiveness of the animation techniques for focus + context volume visualization and fine tune the parameters used in our animation system.
Acknowledgements This work was supported by RGC grant CERG 618705 and HKUST grant DAG 04/05 EG02. We would like to thank the anonymous reviewers for their helpful comments, and thank Ka-Kei Chung for his help with the experimental results.
1302
Y. Wu et al.
References 1. Robinson, W.S.: Understanding Phenomenal Consciousness. Cambridge University Press, Cambridge, U.K (2004) 2. Whitaker, R.T.: A level-set approach to image blending. IEEE Transactions on Image Processing 9(11) (2000) 1849–1861 3. Soille, P.: Morphological image compositing. IEEE Transactions on Pattern Analysis and Machine Intelligence 28(5) (2006) 673–683 4. Baudisch, P., Gutwin, C.: Multiblending: displaying overlapping windows simultaneously without the drawbacks of alpha blending. In: Proceedings of the SIGCHI conference on Human factors in computing systems. (2004) 367–374 5. Weiskopf, D., Engel, K., Ertl, T.: Interactive clipping techniques for texture-based volume visualization and volume shading. IEEE Transactions on Visualization and Computer Graphics 9(3) (2003) 298–312 6. Viola, I., Kanitsar, A., Gr¨oller, M.E.: Importance-driven feature enhancement in volume visualization. IEEE Transactions on Visualization and Computer Graphics 11(4) (2005) 408– 418 7. Bruckner, S., Gr¨oller, M.E.: Volumeshop: An interactive system for direct volume illustration. In: Proceedings of IEEE Visualization 2005. (2005) 671–678 8. Rheingans, P., Ebert, D.: Volume illustration: Nonphotorealistic rendering of volume models. IEEE Transactions on Visualization and Computer Graphics 7(3) (2001) 253–264 9. Wang, L., Zhao, Y., Mueller, K., Kaufman, A.E.: The magic volume lens: An interactive focus+context technique for volume rendering. In: Proceedings of IEEE Visualization 2005. (2005) 367–374 10. Lum, E.B., Stompel, A., Ma, K.L.: Using motion to illustrate static 3d shape–kinetic visualization. IEEE Transactions on Visualization and Computer Graphics 9(2) (2003) 115–126 11. Weiskopf, D.: On the role of color in the perception of motion in animated visualizations. In: Proceedings of IEEE Visualization 2004. (2004) 305–312 12. Correa, C.D., Silver, D.: Dataset traversal with motion-controlled transfer functions. In: Proceedings of IEEE Visualization 2005. (2005) 359–366 13. Huber, D.E., Healey, C.G.: Visualizing data with motion. In: Proceedings of IEEE Visualization 2005. (2005) 527–534 14. Shade, J., Gortler, S., wei He, L., Szeliski, R.: Layered depth images. In: Proceedings of the 25th annual conference on Computer graphics and interactive techniques. (1998) 231–242 15. Wu, Y., Qu, H., Zhou, H., Chan, M.Y.: Fusing features in direct volume rendered images. In: To appear in the proceedings of the second international symposium on visual computing. (2006) 16. Furnas, G.W.: Generalized fisheye views. In: Proceedings of the SIGCHI conference on Human factors in computing systems. (1986) 16–23 17. S, T., M, H., RA, A.: Human perception of structure from motion. Vision Research 31 (1991) 59–75 18. Lagasse, P., Goldman, L., Hobson, A., Norton, S.R., eds.: The Columbia Encyclopedia. Sixth edn. Columbia University, U.S.A (2001)
Exploiting Spatial-temporal Coherence in the Construction of Multiple Perspective Videos Mau-Tsuen Yang, Shih-Yu Huang, Kuo-Hua Lo, Wen-Kai Tai, and Cheng-Chin Chiang Department of Computer Science & Information Engineering National Dong-Hua University, Taiwan
[email protected] Abstract. We implement an image-based system that constructs multiple perspective videos with active objects using an off-the-shelf PC and digital video camcorders without the need of advanced calibration and 3D model reconstruction. Especially, temporal coherence is exploited to speedup the correspondence search. Spatial coherence is exploited for occlusion recovery to improve the quality of the generated images. The proposed system can automatically establish dense correspondence and construct virtual views on-line according to user instructions. Keywords: Active object movie, Multiple perspective video, 3D video, View morphing, Occlusion recovery, Spatial-temporal coherence.
1 Introduction To create an object movie with active objects, called a multiple perspective video or 3D video, an image sequence, instead of a single image, must be stored for each viewing direction to record movement of the objects. The set of image sequences, each captured at separate perspectives but all focused on the same target object, can be analyzed to generate a virtual image sequence at arbitrary viewing direction. Several image-based interpolation techniques have been proposed to generate intermediate views of static scenes with fixed objects [1,10,14]. Recently, some researches have been conducted on the analysis of dynamic scenes with non-rigid or moving objects [5,8,13] with the help of 3D model reconstruction and re-projection. The spatial-temporal view interpolation [13] was proposed for interpolating views of a non-rigid object across both space and time. They analyzed captured image sequences to compute 3D voxel models of the scene at each time instant and to generate a scene flow across time in which the movement of each voxel is described. In addition, Matsuyama et al. [8] reconstructed 3D shape based on silhouette intersection using a PC cluster. Kitahara et al. [5] represented a 3D object with a set of planes based on the observer’s viewing position. Because these methods were based on explicit 3D geometry, the freedom of observer’s position is guaranteed. However, not only high-end video cameras and special synchronization hardware were required, but set-up and calibration process was also tedious. L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 1303 – 1313, 2006. © Springer-Verlag Berlin Heidelberg 2006
1304
M.-T. Yang et al.
Instead of relying on explicit 3D model, it is also possible to implement multiple perspective video using image based approach in that implicit 3D model such as correspondence can be utilized. Huang et al. [3] proposed a method to generate arbitrary intermediate video from two videos captured at separate viewing positions. Their interpolation algorithm is based on the disparity based view morphing. Stereo video can be generated, compressed, and displayed on-line. However, the same interpolation algorithm was repeated at each time instant, and temporal coherence was not considered at all. As a result, the computational intensive correspondence search slowed down the speed of the system. Another interesting problem in multiple perspective videos is occlusion that can degrade the quality of virtual view significantly. Occlusion occurs when some parts of objects can be seen in one perspective, but may disappear in other perspectives. In an object movie, occlusion usually occurs on the left/right boundary of an object. Searching for corresponding points in the occluded area is very difficult if not impossible. However, without corresponding pairs in these areas, the left/right edges of objects in a virtual view will become blurred. Speed and quality are two critical issues for realizing the potential power of interactive multiple perspective videos. In this paper, temporal coherency is exploited to speedup the correspondence search. Occlusion is handled properly based on spatial coherence to improve the quality of the generated images. The input image sequences can be captured using off-the-shelf DV camcorders without the need for advanced calibration. A sequence of virtual views can be generated and displayed on-line using an inexpensive PC according to the demands of the user. Fig. 1 demonstrates the basic ideas and the flow chart for the proposed system.
Left/Right real image sequence
Feature points matching Fundamental matrix estimation and image rectification Foreground/background segmentation Dense depth map construction by exploiting temporal coherence
Virtual view image sequence
Occlusion recovery by exploiting spatial coherence
Fig. 1. Concepts and flowchart of the construction of a multiple perspective video
The remaining parts of the paper are organized as follows. Section 2 explains feature point extraction and corresponding pair matching. Section 3 presents fundamental matrix estimation and image pair rectification. Section 4 describes the foreground/background segmentation. Section 5 utilizes epipolar constraints to efficiently construct a dense depth map. Section 6 exploits temporal coherence to
Exploiting Spatial-temporal Coherence in the Construction
1305
speedup the correspondence search. Section 7 exploits spatial coherence for occlusion recovery to improve the quality of the depth map. Section 8 demonstrates the experimental results. Section 9 presents the conclusions and discussions.
2 Corresponding Feature Point Matching First, several image sequences are captured using DV camcorders that are located at different viewing positions but focusing at the same target object in the center. To construct virtual views, the feature point correspondence between each neighboring pair of captured image sequences must be established. The correspondence problem is a very challenging problem. Instead of finding corresponding feature points manually or semi-automatically, we implement an efficient block matching method to automatically establish the correspondence. A few salient feature points in both images are found using the Susan edge detector [12]. Typically detected feature points are points with significant changes in brightness in the neighborhood such as the corners of an object. For each feature point on the left (right) image, the corresponding feature point on the right (left) image is searched by comparing the neighborhood similarities calculated based on a Normalized Correlation (NC) score defined as follows: n
Score( p1 , p2 ) =
u=
n
m
∑ ∑ [ I ( x + i, y + j) − I ( x, y)] × [ I (s + i, t + j ) − I (s, t )]
i =− n j =− m
1
1
2
u
m
n
(1)
m
∑ ∑ [ I ( x + i, y + j ) − I ( x, y)] ∑ ∑ [ I 2
i =− n j =− m
1
2
1
i =− n j =− m
2
( s + i, t + j ) − I 2 ( s, t )]2
where p1 represents a pixel (x,y) in the left image I1, p2 represents a pixel (s,t) in the right image I2, I(x,y) represents the brightness intensity of a pixel (x,y) in the image I, and the size of the comparison window is (2m+1)x(2n+1). Point p2 with the maximum correlation score is selected as the matching candidate for p1. The results from this step will be a set of corresponding feature point pairs.
3 Fundamental Matrix Estimation and Image Rectification Epipolar geometry [11] is an effective means to speedup the search for corresponding points. Suppose a point P in 3D space is projected to p on the image plane I1 with focus C1 and to point p’ on image plane I2 with focus C2. The plane contains point p, C1, and C2 is called epipolar plane. The epipolar plane intersects the image plane I1 (I2) on a straight line L1 (L2). The lines L1 and L2 are called epipolar lines. For each feature point p on I1 (I2), its corresponding point on I2 (I1) must fall on the epipolar line L2 (L1). The projected point p on I1 (I2) and its corresponding epipolar line L2 (L1) on I2 (I1) are related by a 3x3 matrix F, called fundamental matrix. Their relationships are Fp=L2 and FTp’=L1. The fundamental matrix F contains both intrinsic parameters (such as focal length) and extrinsic parameters (such as relative translation and rotation) of an image pair. If
1306
M.-T. Yang et al.
the fundamental matrix is known, then the correspondence search algorithm only needs to search through a 1-D epipolar line instead of through a 2-D search window. Thus, the computation time can be significantly reduced. A fundamental matrix can be estimated using only 8 corresponding pairs [6]. However, the resulting estimation can be seriously degraded by a single outlier in the input corresponding pairs. The estimation accuracy can be improved by providing more reliable corresponding pairs. The redundancy of corresponding pairs can be utilized to minimize the estimation error. Fortunately, it is assumed that the two acquiring cameras do not move during the capturing period, making the fundamental matrix between them fixed across time. Thus, all of the corresponding pairs obtained at different time instants can be fused together to estimate the fundamental matrix. At time instant ti, a record is maintained for selecting the top 30 corresponding pairs with the highest correlation scores from all correspondence pairs obtained from time t0 to ti. These best corresponding pairs are used to estimate a fundamental matrix using a non-linear minimization method [4,7]. The estimated fundamental matrix is then used to obtain the epipolar lines at time instant ti+1. This iterative process is performed repeatedly at each time instant. As a result, the quality of the top 30 corresponding points becomes more reliable and the accuracy of the estimated fundamental matrix is improved across time as well. In case the estimation error is too large due to emerging outliers, the process is reset and a new set of corresponding points is collected. With the fundamental matrix an image pair can be rectified to speedup the stereo matching. The purpose of the rectification is to transform the images according to their fundamental matrix so that their epipolar lines are aligned horizontally. In this paper, we apply the polar rectification method [9] so that the matching ambiguity can be reduced to half epipolar lines.
4 Foreground/Background Segmentation Since we are usually interested only in the foreground objects in an object movie, the foreground and background should be separated so that the dense computation in the next step can focus on the foreground objects. Every pixel in an image can be classified as foreground or background according to its color information. A static background is assumed in our application and the color information for the background can be learned in the startup phase. The startup phase contains N frames in which no foreground object is present. In these fames, the mean μi(x,y) and variance σi2(x,y) of each color component i ∈ {R, G, B} in each pixel (x,y) are measured. In the following frames, a distance dt(x,y) for the pixel (x,y) at time instant t is defined using the following equation to evaluate its likeliness as a foreground pixel: d t ( x, y ) =
( I i ,t ( x, y ) − μi ( x, y ))2 ∑ σ i2 ( x, y ) i∈{ R , G , B}
(2)
where Ii,t(x,y) is the value of i-th color component of the pixel (x,y). The background region in the left frame Bl and foreground region in the left frame Fl can be segmented using a threshold T.
Exploiting Spatial-temporal Coherence in the Construction
⎧Bl,t = {(x, y) | d l ,t (x, y) < T } ⎨ ⎩ Fl,t = {(x, y) | d l ,t (x, y) ≥ T }
1307
(3)
The background (foreground) region in the right frame Br(Fr) can be defined similarly. After each pixel is classified as foreground or background, a morphological closing operating, i.e. a 3x3 dilation followed by a 3x3 erosion, is performed to fill-in the noisy holes in the foreground region.
5 Dense Depth Map Construction A dense depth map is constructed using a correspondence search process that is similar to the block matching method discussed in Section 2. However, several differences exist. First, instead of matching feature points, we need to find a corresponding point for every foreground pixel. To reduce the computation complexity, we use a sum-of-absolute-difference (SAD) score to replace the NC score. Score( p1 , p2 ) =
n
m
∑ ∑ I ( x + i, y + j ) − I
i =− n j = − m
1
2
( s + i, t + j )
(4)
Second, since the image has been rectified according to the estimated fundamental matrix, the search window can be limited to a horizontal stripe that is around its epipolar line. By reducing the search area from two dimensions to one dimension, the computation time can be significantly reduced and the quality of corresponding pairs can be improved. In addition, an analysis of the SAD score distribution can help to identify uncertainties. A nearly flat SAD distribution represents a smooth area without much texture, while a SAD distribution with several similar minima indicates areas with repetitive texture. Let SAD1 be the minimum SAD score that locates at pixel pr, and SAD2 be the second lowest SAD score that locates at pixel p’r in the search window. We define the uncertainty of a corresponding pair (pl→pr) as U=SAD2 /SAD1. The corresponding pair (pl→pr) is accepted only if SAD1 is lower than a predefined threshold and U is higher than another threshold. Third, after two correspondence sets including left-to-right and right-to-left mappings are established, two sets of correspondence (left-to-right and right-to-left) are joined together to obtain a single correspondence set. However, the joining set might contain several one-to-many of many-to-one mappings due to occlusion or mismatch. This can be corrected by a left/right consistency check. For each one-to-many (or many-to-one) mapping, we try to replace it using one-to-one mapping that has the lowest SAD score. The final correspondence set will be a single set containing one-to-one dense mappings. With these modifications, a dense depth map can be constructed more reliably and efficiently. Nevertheless, incorrect corresponding pairs and depth estimations are inevitable using an automatic search method.
6
Exploiting Temporal Coherence for Speedup
The construction time of the dense depth map can be further reduced by exploiting the temporal coherence. In an object movie with real-time frame rate, objects in
1308
M.-T. Yang et al.
consecutive frames usually stay fixed or move very slowly. Temporal coherence can be exploited to speedup the search of corresponding points. Suppose Pl,t (Pr,t) is a pixel located on a frame captured in left (right) images at time instant t, We represent a one-to-one pair of corresponding points at time instant t by ( Pl ,t ↔ Pr ,t ) , and the set of all corresponding pairs at time instant t by CRt. Inside an image frame at time instant t, only certain points require a new search for the corresponding points. Other points can inherit their corresponding points from the previous correspondence set CRt-1 at time instant t-1. The regions that require a new correspondence search can be categorized to three cases, analyzed as follows: First, changing region S contains pixels with significant changes in brightness. Suppose Il,t(x,y) (Ir,t(x,y)) represents the intensity of a pixel (x,y) located on left (right) image at time instant t, the changing region S can be detected using a threshold. ⎧⎪S l ,t = {(x, y) | I l ,t(x, y) − I l ,t −1(x, y) > T } ⎨ ⎪⎩S r ,t = {(x, y) | I r ,t (x, y) − I r ,t −1(x, y) > T }
(5)
Second, missing region M contains every pixel at time instant t that does not have any corresponding point at time instant t-1. The missing region M is defined using the following equation: ⎧M l ,t = {(x, y) | ((x, y) ↔ Pr ,t −1 ) ∉ CRt −1} ⎨ ⎩M r ,t = {(x, y) | ( Pl ,t −1 ↔ (x, y)) ∉ CRt −1}
(6)
Third, invalid region V contains every pixel in the frame at time instant t that does have corresponding point at time instant t-1, but the corresponding point falls inside the changing region S or the background region B at time instant t. The invalid region V can be identified using the following equation: ⎧Vl ,t = {(x, y) | ((x, y) ↔ Pr ,t −1 ) ∈ CRt −1 and ( Pr ,t −1 ∈ S r ,t or Pr ,t −1 ∈ Br ,t )} ⎨ ⎩Vr ,t = {(x, y) | ( Pl ,t −1 ↔ (x, y)) ∈ CRt −1 and ( Pl ,t −1 ∈ S l ,t or Pl ,t −1 ∈ Bl ,t )}
(7)
Re-computing region R is then defined as the union of changing region S, missing region M, and invalid region V. ⎧ Rl ,t = S l ,t ∪ M l ,t ∪ Vl ,t ⎨ ⎩ Rr ,t = S r ,t ∪ M r ,t ∪ Vr ,t
(8)
For each pixel in the left (right) image inside the re-compute region Rl (Rr) a corresponding point in the right (left) image must be searched. For each pixel that is not inside the re-compute region, its corresponding point can be propagated from the last consecutive frames. As a result, the time spent on correspondence search can be reduced greatly. Fig. 2 demonstrates these three cases using a virtual image sequence that contains a small teapot moving horizontally across a static big teapot.
Exploiting Spatial-temporal Coherence in the Construction
(a)
1309
(b)
(c)
(d)
(e)
Fig. 2. Exploit temporal coherence to speedup correspondence search. (a) Left image at time t-1 (b) Left image at time t (c) Changing area Sl,t (d) Missing area Ml,t (e) Invalid area Vl,t.
7 Exploiting Spatial Coherence for Occlusion Recovery The quality of the final generated virtual images depends on the quality of the depth map. After the dense depth construction, there will still be some holes in the depth map due to high SAD or low left-right consistency. The quality of the depth map can be further improved by filling these holes properly. In order to preserve the object edges in the process of hole filling, each hole should be filled along scan lines parallel to the primary edge axis of the hole. A primary edge axis for each hole in the depth map can be computed as the majority of the orientation of the corresponding edges in the edge map. An edge map can be constructed by convolving the original image with a Prewitt operator [11]. Furthermore, each hole should be filled using different strategy based on the causes of the hole. First, in the smooth areas that lack texture, the correspondence search usually fails because of ambiguity, resulting in holes in the depth map. For each scan line in a hole, suppose d1 (d2) is the depth value on the left (right) boundary of the line, the depth of any pixel along the scan line can be linearly interpolated using d1 and d2. d ( s) = d 2 ×
s s + d1 × (1 − ) K K
(9)
where K is the distance between the left and the right boundary and s is the distance between the estimated pixel and the left boundary. Second, closer objects in the scene might occlude objects that are further away. Some areas appearing in the left (right) image might completely disappear in the right (left) image. The correspondence search will fail again in these areas, causing holes in the depth map. Each hole in this case contains two areas belonging to either closer object or further object respectively. Depth discontinuity should be kept around the occlusion boundary. Instead of using a linear interpolation, a critical point with maximum magnitude in the edge map is found along each scan line. The depth value of
1310
M.-T. Yang et al.
any pixel located on the left (right) of the critical point can copy the depth value on the left (right) boundary of the scan line. ⎧d1 , d ( s) = ⎨ ⎩d 2 ,
if s locates on the left of the critical point
(10)
if s locates on the right of the critical point
Third, certain parts of an object might be occluded by itself, called self-occlusion. In other words, the left (right) boundary of an object that can be seen by left (right) camera might not be observed by the right (left) camera. In this case, the depth along each scan line in the hole should be replaced by the depth of the foreground object, i.e., d(s)=min(d1,d2). Fig. 3 demonstrates these three cases using a virtual image sequence containing a big teapot occluded by a small teapot. Scan lines for hole filling are set to be horizontal in this example for illustration purpose. As a final noise filter, a depth consistency check in the neighborhood is applied to remove outliers. After these refinements, all the empty holes in the depth map are filled, and the depth for each pixel in the foreground region is determined. After a reliable and dense depth map is established, an arbitrary virtual image is interpolated in any desired viewing direction. First, the left and right images (I1 and I2) are warped such that the optical axes of both images (I1’ and I2’) are perpendicular to the transition baseline. Second, a morphing function is applied to the warped images to obtain a desired transition view. The morphing function used here is called disparity based view morphing [3] which is a non-linear mapping based on the depth value of the corresponding pair. Finally, the transition view (Is’) is warped back to the correct view (Is) with proper viewing direction and position. Holes in the final view are eliminated by interpolating the values of horizontally adjacent pixels.
Fig. 3. Three types of holes in depth map. (a) The depth map (b) Further object occluded by closer object (c) Self-occlusion area (d) Smooth area.
Exploiting Spatial-temporal Coherence in the Construction
1311
8 Experimental Results We demonstrate the constructing results of two multiple perspective videos containing a real person and a cartoon character respective. The movie lasts for 30 seconds (600 frames) and the acting character rotates his head, shakes his hands, and moves his body throughout this period. The image resolution is 360x240 in a 24-bit color format. Two cameras captured the left and right image sequences in different viewing directions (around 30 degree apart). The proposed system generates and displays a virtual image sequence from the center viewing direction providing left and right image sequences as inputs. The execution speed is around 7 frames per second on a standard PC with a Pentium IV 2.0 GHz CPU. One additional camera captured a real image sequence at center viewing position for comparison purpose. Fig. 4 shows the left, right, virtual center, and real center image sequences. Source Right Image Source Left Image
Virtual Middle Image Real Middle Image
Fig. 4. Screenshots from two multiple perspective videos
The virtual and real image sequences look very similar during the playback. With close examination on frozen images, we notice that two kinds of conditions cause blurred virtual images in this example. The first condition occurs when some parts of acting character move too fast (e.g. shaking hands). Fast movement blurs the capture images so that feature points can not be extracted precisely in blurred area. Thus, the quality of the generated virtual images degrades. The second condition occurs in areas with incorrect corresponding pairs. Incorrect corresponding pairs cause displacement in the interpolation process and also reduce the fundamental matrix estimation accuracy. Generally, the quality of the virtual images in the ending part is better than the quality of the virtual images in the beginning part. The reason is that the estimated fundamental matrix might not be accurate in the beginning. As time goes by, more reliable corresponding pairs become available, so the estimated fundamental matrix is much more stable in the latter part of the sequence. Fig. 5(a) demonstrates the speedup of the processing time for each frame to construct and refine the depth map. The curve with asterisk marks (*) represents the processing time without considering temporal coherence and occlusion recovery. The curve with plus marks (+) represents the processing time considering only temporal coherency, while the curve with circle marks ( ) represents the processing time considering only occlusion recovery. The curve with square marks (□) represents the
。
1312
M.-T. Yang et al.
Fig. 5. The comparison of the (a) processing time and (b) quality of the virtual images
processing time considering both temporal coherence and occlusion recovery. It can be noted that the exploitation of temporal coherence not only speeds-up the process by a factor of 3, but also decrease the variance in the processing time across frames. This means that the virtual images can be generated much more frequently and smoothly using the proposed method. It can also be observed that the time spent on occlusion recovery is relatively small due to each hole can be filled with a one pass scan. Fig. 5(b) compares the quality of the generated virtual images through the whole image sequence. The quality is measured using the sum-of-square-difference (SSD) over every pixel between virtual and real image frames. Similarly, four curves with different marks represent the SSD for four different cases. It can be noted that the occlusion recovery improves the quality while the temporal coherence degrades the quality of the generated images. It can also be observed that the quality of the virtual view generally improves across time except around frame 50-60 where the target waves his hands very fast.
9 Conclusions and Discussions We implement a system that automatically constructs a multiple perspective video from a few image sequences captured at different perspectives. Especially, temporal coherence is analyzed and exploited to speedup the dense correspondence matching between each pair of images. Several strategies are proposed to properly fill holes in the depth map caused by textureless, occluded or self-occluded regions. Only simple equipment (off-the-shelf PC and DVs) is used in the proposed system. Image sequences are captured at separate viewing positions as the input data. The proposed system analyzes the captured image sequences and finds the correspondence between them. At playback time, a virtual image sequence is generated dynamically at arbitrary viewing directions so that a user can change his viewing directions on-line.
Acknowledgements This research was supported in part by the National Science Council of Taiwan.
Exploiting Spatial-temporal Coherence in the Construction
1313
References 1. Chen, S. & Williams, L.: View Interpolation for Image Synthesis. Proceedings of SIGGRAPH, Pages 279-288 (1993) 2. Gonzalez, R. & Woods, R. Digital Image Processing. Prentice Hall (2002) 3. Huang, H., Kao, C. & Hung, Y.: Generation of Multi-Viewpoint Video from Stereoscopic Video. IEEE Transactions on Consumer Electronics (1999) 4. Intel. Intel Open Source Computer Vision Library. Available at http://www.intel.com/ research/mrl/research/opencv. 5. Kitahara, I. & Ohta, Y.: Scalable 3D Representation for 3D Video Display in a Large-scale Space. Proceedings of IEEE Virtual Reality Conference (2003) 6. Longuet-Higgins, H.: Multiple Interpretations of a Pair of Images of a Surface. Proceedings of the Royal Society London A, Vol. 418, Pages 1-15 (1988) 7. Luong, Q. & Faugeras, O. The Fundamental Matrix: Theory, Algorithms, and Stability Analysis. International Journal of Computer Vision, Vol. 17, (1995) 8. Matsuyama, T., Wu, X., Takai, T. & Wada, T.: Real-time Dynamic 3-D Object Shape Reconstruction and High-Fidelity Texture Mapping for 3-D Video. IEEE Transactions on Circuits and Systems for Video Technology (2004) 9. Pollefeys, M., Koch, R. & Gool, L.: A simple and efficient rectification method for general motion. International Conference on Computer Vision (1999) 10. Seitz, S. & Dyer, C.: View Morphing. Proceedings of SIGGRAPH (1996) 11. Shapiro, L. & Stockman, G.: Computer Vision. Prentice Hall (2001) 12. Smith, S. & Brady, J.: SUSAN - A New Approach to Low Level Image Processing. International Journal of Computer Vision, Vol. 23 (1997) 13. Vedulayz, S., Bakerz, S. & Kanade, T.: Spatial-temporal View Interpolation. Proceedings of the 13th ACM Eurographics Workshop on Rendering, (2002) 14. Zhang, L., Wang, D. & Vincent, A.: Adaptive Reconstruction of Intermediate Views from Stereoscopic Images. IEEE Transactions on Circuits and Systems for Video Technology (2006)
A Perceptual Framework for Comparisons of Direct Volume Rendered Images Hon-Cheng Wong1 , Huamin Qu2 , Un-Hong Wong1 , Zesheng Tang1 , and Klaus Mueller3 1
Faculty of Information Technology, Macau University of Science and Technology, Macao, China 2 Department of Computer Science and Engineering, Hong Kong University of Science and Technology, Hong Kong, China 3 Department of Computer Science, Stony Brook University, Stony Brook, USA
[email protected],
[email protected],
[email protected],
[email protected],
[email protected] Abstract. Direct volume rendering (DVR) has been widely used by physicians, scientists, and engineers in many applications. There are various DVR algorithms and the images generated by these algorithms are somewhat different. Because these direct volume rendered images will be perceived by human beings, it is important to evaluate their quality based on human perception. One of the key perceptual factors is that whether and how the visible differences between two images will be observed by users. In this paper we propose a perceptual framework, which is based on the Visible Differences Predictor (VDP), for comparing the direct volume rendered images generated with different algorithms or the same algorithm with different specifications such as shading method, gradient estimation scheme, and sampling rate. Our framework consists of a volume rendering engine and a VDP component. The experimental results on some real volume data show that the visible differences between two direct volume rendered images can be measured quantitatively with our framework. Our method can help users choose suitable DVR algorithms and specifications for their applications from a perceptual perspective and steer the visualization process.
1 Introduction Direct volume rendering (DVR) is a widely used technique in visualization, which directly renders 3D volume data into 2D images without generating any intermediate geometric primitives. There are many DVR algorithms developed in the past two decades, including ray-casting [1], splatting [2], shear-warp [3], 2D texture slicing [4], 3D texture slicing [5], and GPU-based volume rendering [4] [6] [7]. A recent survey of DVR algorithms can be found in [8]. It is well known that direct volume rendered images generated by different DVR methods are somewhat different and some algorithms can generate images with better quality than others. Therefore, there is a need to compare the quality of direct volume rendered images generated by different methods and specifications. Fortunately, L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 1314–1323, 2006. c Springer-Verlag Berlin Heidelberg 2006
A Perceptual Framework for Comparisons of Direct Volume Rendered Images
1315
there are more and more works reporting the comparisons of volume rendering algorithms [9] [10] and volume rendered images [11]. These direct volume rendered images will be perceived by the human beings. Therefore, it is important to quantitatively evaluate DVR images based on human perception. One of the key perceptual factors is that whether the visible differences between two images will be observed. However, research concerning this factor is scant. In this paper we propose a perceptual framework, which is based on Daly’s Visible Differences Predictor [12], for comparative study of direct volume rendered images. It will be used to predict the visible differences of the direct volume rendered images generated with different algorithms and the same algorithm with different specifications. The remainder of this paper is organized as follows: We first introduce related work in Section 2. We then describe our framework and review VDP in Section 3. Next, we compare the direct volume rendered images by using our framework in Section 4. Finally, we conclude our work and discuss future research directions in Section 5.
2 Related Work Comparative evaluation in DVR algorithms: Methods for comparing DVR algorithms can be categorized into three classes: 1) Image level methods [13] [9]: They usually compare DVR images side-by-side using various methods such as difference image, mean square errors (MSE), and root mean square error (RMSE); 2) Data level methods [10]: They use raw data and intermediate information obtained during the rendering process for comparison; 3) Analytical methods to calculate the error bounds [14]: They analyze the errors in gradient calculations, normal estimation schemes, and filtering and reconstruction operations. In addition to these comparative methods, Mei”sner et al. [15] performed a practical evaluation of popular DVR algorithms in terms of rendering performance on real-life data sets. Perception issues in computer graphics: Considerable concern has arisen over the perception in graphics research in recent years, especially in the area of global illumination. There has been much work on perceptually-based rendering proposed. Most of them focus on two tasks: 1) To establish stopping criteria for high quality rendering systems by developing perceptual metrics [16] and 2) To optimally manage resource allocation for efficient rendering algorithms by using perceptual metrics [17] [18]. In addition, Rushmeier et al. [19] proposed some metrics for comparing real and synthetic images. Perception issues in visualization: There is a growing number of research on using perception for visualization. For example, Lu et al. [20] utilized several feature enhancement techniques to create effective and interactive visualizations of scientific and medical data sets. Ebert [21], Interrante [22], and Chalmers and Cater [23] recently give excellent surveys on perception issues in visualization. Zhou et al. [24] presented a study of image comparison metrics for quantifying the magnitude of difference between a visualization of a computer simulation and a photographic image captured from an experiment. To the best of our knowledge, there has been little work which applies perception in comparative evaluation of DVR. Inspired by the work of Myszkowski [18] which uses
1316
H.-C. Wong et al.
VDP for global illumination problems, we have developed a framework based on VDP to compare the direct volume rendered images.
3 Our Framework and Visible Differences Predictor (VDP) The block diagram of our framework is shown in Figure 1. Our framework contains a volume rendering engine which generates the images by using different DVR algorithms supported in the engine. Two direct volume rendered images are produced with the user specified settings. Then they will be sent to the VDP, which compares these images and gives a differences map (VDP responses) as output.
Fig. 1. Block diagram of our framework
There are many metrics based on Human Visual System (HVS). Two most popular ones are Daly’s Visible Differences Predictor (VDP) [12] and Sarnoff Visual Discrimination Model (VDM) [25]. Both metrics were shown to perform equally well on average [26]. We chose VDP in our framework because of its modularity and extensibility. The block diagram of VDP [18] is shown in Figure 2. VDP receives as input a
Fig. 2. Block diagram of the Visible Differences Predictor [18]
pair of images (target image and mask image), and outputs a map of probability values, which indicates how the differences between those images are perceived [12] [18]. These two input images are first processed by the amplitude nonlinearity, which simulates the adaptation of HVS to local luminance. Then the resulting images are converted into frequency domain using FFT. After that, contrast sensitivity function (CSF), which simulates the variations in visual sensitivity of HVS, is performed on the frequency signals. These images are then converted to spatial frequency and orientation channels using a pyramid-style cortex transform. Masking function which is used to increase the
A Perceptual Framework for Comparisons of Direct Volume Rendered Images
1317
threshold of detectability is applied to these images and the minimal threshold elevation value for corresponding channels and pixels are taken by mutual masking/mask image. A psychometric function for predicting the probability of perceived differences is applied to these images and finally the predicted probability is visualized. As VDP is a general purpose predictor of the differences between images, it can be used to evaluate pairs of images for a wide range of applications. Although VDP does not support chromatic channels in input images, in DVR applications many important insights such as depth cues can be well captured in an achromatic images. Thus we embedded VDP in our framework for comparing the direct volume rendered images. Since the HVS is more sensitive to the differences of contrast and less sensitive to the differences between colors, we convert color direct volume rendered images generated by the volume rendering engine to gray-scale images before sending them to the VDP. All comparisons were performed on a 2.4GHz AMD Opteron 280 processor with 6GB main memory, and an NVIDIA Quadro FX 4500 graphics card with 512MB video memory. The resulting VDP responses (differences map in our framework) are represented as a color map which is blended with the original grey-scale target image. Color is added to each pixel in this target image to indicate its difference detection probability values. The probability values greater than 0.75, which is the standard threshold value for discrimination tasks [27], are set to red pixels. In the rest of the paper, we usually provide results in a set of three figures. The first two figures are generated using different algorithms or settings and the third one shows their differences map, which is encoded in the same color scale as in Figure 3 (d). The background pixels (black pixels) are not included in the calculation of percentage of red pixels in the VDP result.
4 Comparisons of Direct Volume Rendered Images In this section, we use our framework to measure the perceptible differences in the direct volume rendered images generated with different algorithms or the same algorithm with different specifications. For DVR algorithms we select two most popular methods: GPU-based ray-casting [6] and 3D texture slicing [5]. And several specifications including shading, gradient estimation scheme, and sampling rate are chosen for comparisons. We limit our case studies to static, regular or rectilinear, scalar volume data only. The size of all images is 512 × 512. All algorithm-independent parameters such as viewing, transfer functions, and optical model, are kept constant in each image comparison set in order to have a fair comparison. Experimental results will be discussed at the end of this section. 4.1 GPU-Based Ray-Casting Versus 3D Texture Slicing Two data sets are used here: 2563 CT human head and 2563 MRI human head. Figure 3 and Figure 4 compare the direct volume rendered images generated with GPU-based ray-casting and 3D texture slicing for these two data sets respectively. From the VDP responses shown in Figure 3 (c), we can see that there are some noticeable areas of red pixels, which indicate the differences between Figure 3 (a) and (b) in these regions are quite noticeable (probability > 75%) to human beings. And there are only some light green pixels inside the regions of red pixels. The percentage of red pixels is 11.89% in
1318
H.-C. Wong et al.
the VDP result. In Figure 4 (c), we can see some pieces of red pixels and many small pieces of light greed pixels on the surface of the MRI head. The percentage of red pixels is 28.27% in it. In both Figure 3 (c) and Figure 4 (c), the red pixels are distributed in the transparent regions. The VDP predictions (Figure 3 (c) and Figure 4 (c)) coincide with the human perception of the visual results shown in Figure 3 (a) and (b), and Figure 4 (a) and (b).
(d) (a)
(b)
(c)
Fig. 3. Comparison of GPU-based ray-casting and 3D texture slicing (2563 CT human head): (a) GPU-based ray-casting; (b) 3D texture slicing; (c) VDP result (Red pixels: 11.89%); (d) Color scales for encoding the probabilities (%) in (c)
(a)
(b)
(c)
Fig. 4. Comparison of GPU-based ray-casting and 3D texture slicing (2563 MRI human head): (a) GPU-based ray-casting; (b) 3D texture slicing; (c) VDP result (Red pixels: 28.27%)
4.2 Shading We are interested in the visual differences between the rendering results in the following two scenarios. First, in pre-shaded DVR, the shading model at the grid samples is evaluated first and then the illumination is interpolated. In contrast, the normal is interpolated first and then the shading model for each reconstructed sample is evaluated in post-shaded DVR. Second, separate color interpolation was used in [1], which interpolates voxel colors and opacities separately before computing the product of them. Wittenbrink et al. [28] pointed out that it is more correct to multiply color and opacity beforehand at each voxel and then interpolate the product. We compare the pre-shaded and post-shaded DVR images, as well as the opacity-weighted color interpolated [28] and separate color interpolated DVR images using the GPU-based ray-caster with a 2563 engine data set.
A Perceptual Framework for Comparisons of Direct Volume Rendered Images
1319
Figure 5 shows the comparison of pre-shaded and post-shaded DVR. From Figure 5 (c) we can know that the rendering results between pre-shaded (Figure 5 (a)) and postshaded (Figure 5 (b)) DVR images are quite different. The percentage of red pixels reaches 54.79%, which means that such differences in these two images are very noticeable to human beings.
(a)
(b)
(c)
Fig. 5. Comparison of pre-shaded DVR and post-shaded DVR (2563 engine): (a) Pre-shaded DVR; (b) Post-shaded DVR; (c) VDP result (Red pixels: 54.79%)
Figure 6 shows the comparison of opacity-weighted color interpolated and separate color interpolated DVRs. The percentage of red pixels is 64.00%. And the distribution of the most noticeable differences between Figure 6 (a) and (b) can be easily found in Figure 6 (c).
(a)
(b)
(c)
Fig. 6. Comparison of opacity-weighted color interpolated and separate color interpolated DVRs (2563 engine): (a) Opacity-weighted color interpolated DVR; (b) Separate color interpolated DVR; (c) VDP result (Red pixels: 64.00%)
4.3 Gradient Estimation Scheme In DVR, different gradient estimation scheme expresses the choice of normal computation from the volume data. They may significantly affect the shading and appearance of the rendering results. Two schemes are compared in Figure 7: central difference operator [1] which computes gradients at data values in the x, y, z direction and then uses the gradients at the eight nearest surrounding data locations to interpolate the gradient vectors for locations other than at data locations; intermediate difference operator [29] which computes gradient vectors situated between data locations using differences in data values at the immediate neighbors. Figure 7 (c) shows that the differences between the images are very obvious. The percentage of red pixels is 50.62%.
1320
H.-C. Wong et al.
(a)
(b)
(c)
Fig. 7. Comparison of intermediate and central difference operators in DVR (2563 aneurism): (a) With intermediate difference operator; (b) With central difference operators; (c) VDP result (Red pixels: 50.62%)
4.4 Sampling Rate Direct volume rendered images generated by a GPU-based ray-caster with different sampling rates are compared in this section. The motivation for performing this kind of comparisons is that we want to reduce the rendering time of the data sets by downgrading their sampling rates without sacrificing the image quality too much. A 2563 CT head data set is used and two sets of comparison are shown in Figure 8, where the upper set compares images generated with 512 samples and with 640 samples, and the lower set compares images generated with 1280 samples and 1408 samples. Figure 8 (c) shows that the differences between Figure 8 (a) and (b) are quite noticeable, and the percentage of red pixels is 39.54%. From Figure 8 (f), we can find that the differences between Figure 8 (d) and (e) are not so noticeable, where the percentage of red pixels is 28.46%. 4.5 Discussions The experimental results show that differences between two direct volume rendered images are quite noticeable in transparent regions, indicating that different DVR algorithms or the same algorithm with different specifications are sensitive to these regions because of the inner structures of the data visualized there. Thus the choice of DVR algorithms or specifications have major impact on the visual result of the transparent regions. For shading methods, the visual appearance of images generated with preshaded and post-shaded are quite different as shown in the VDP result (Figure5 (c)). The image quality of separate color interpolation is considered having color-bleeding artifacts [28]. The VDP result provides a distribution of such artifacts that may be noticed in the regions with red pixels. For gradient estimation schemes, the intermediate difference operator in DVR offers a better shading of the images [29] and the VDP result indicates such differences clearly. For sampling rates, the differences between two images with higher sampling rates are less than those two images with lower sampling rates. With enough high sampling rates, further increasing the sampling rate may not improve the image quality. With our framework, the differences between direct volume rendered images can be easily identified quantitatively. Thus it may be used for researchers to determine an appropriate DVR algorithm or a set of specifications for their research and applications.
A Perceptual Framework for Comparisons of Direct Volume Rendered Images
(a)
(b)
(c)
(d)
(e)
(f)
1321
Fig. 8. Comparison of direct volume rendered images with different sampling rates (2563 CT head): (a) Image rendered with 512 samples; (b) Image rendered with 640 samples; (c) VDP result of (a) and (b) (Red pixels: 39.54%); (d) Image rendered with 1280 samples; (e) Image rendered with 1408 samples; (f) VDP result of (d) and (e) (Red pixels: 28.46%)
5 Conclusions and Future Work In this paper, we proposed a framework for comparing direct volume rendered images generated by different algorithms or the same algorithm with different specifications. Two most popular DVR algorithms, GPU-based ray-casting and 3D texture slicing, are selected for comparisons. Some specifications including shading methods (pre-shaded v.s. post-shaded, separate color interpolated v.s. opacity-weighted color interpolated), gradient estimation schemes (central and intermediate difference operators), and sampling rate are also compared. The experimental results with real data sets show that we can get quantitative and perceptual comparison results with our framework. To conclude, this study is our first attempt to apply perception knowledge on direct volume rendered images. It will allow scientists and engineers to better understand volume data. In the future, we would like to perform a psychophysical validation of VDP for DVR applications and use our framework to conduct a more comprehensive study involving more direct volume rendered images generated by different kernels such as different filters and optical models. As the computation of VDP is quite expensive due to the multiscale spatial processing involved in some of its components, we plan to implement VDP on GPUs and integrate it with existing GPU-based volume rendering algorithms into our framework to provide fast feedbacks of comparison results. In addition, VDP may be used for level-of-detail (LOD) selection in large volume visualization as what Wang et al. [30] have done recently. This fast comparative framework can then be used for evaluating direct volume rendered images from a perceptual point of view and steering the DVR process.
1322
H.-C. Wong et al.
Acknowledgments. This work is supported by the Science and Technology Development Fund of Macao SAR grant 001/2005/A.
References 1. Levoy, M.: Display of surfaces from volume data. IEEE Computer Graphics and Applications 8 (1988) 29–37 2. Westover, L.: Footprint evaluation for volume rendering. In: Computer Graphics (ACM SIGGRAPH ’90). Volume 24. (1990) 367–376 3. Lacroute, P., Levoy, M.: Fast volume rendering using a shear-warp factorization of the viewing transformation. In: Proceedings of ACM SIGGRAPH ’94. (1994) 451–458 4. Rezk-Salama, C., Engel, K., Bauer, M., Greiner, G., Ertl, T.: Interactive volume rendering on standard pc graphics hardware using multi-textures and multi-stage rasterization. In: Proceedings of ACM SIGGRAPH/EUROGRAPHICS Workshop on Graphics Hardware ’00. (2000) 109–118 5. Gelder, A.V., Kim, K.: Direct volume rendering with shading via 3d textures. In: Proceedings of ACM/IEEE Symposium on Volume Visualization ’96. (1996) 22–30 6. Kr¨uger, J., Westermann, R.: Acceleration techniques for gpu-based volume rendering. In: Proceedings of IEEE Visualization ’03. (2003) 287–292 7. Xue, D., Crawfis, R.: Efficient splatting using modern graphics hardware. Journal of Graphics Tools 8 (2003) 1–21 8. Kaufman, A., Mueller, K.: Overview of volume rendering. In: The Visualization Handbook, Edited by C. D. Hansen and C. R. Johnson, Elsevier Butterworth-Heinemann (2005) 127– 174 9. Williams, P.L., Uselton, S.P.: Metrics and generation specifications for comparing volumerendered images. Journal of Visualization and Computer Animation 10 (1999) 159–178 10. Kim, K., Wittenbrink, C.M., Pang, A.: Extended specifications and test data sets for data level comparisions of direct volume rendering algorithms. IEEE Transactions on Visualization and Computer Graphics 7 (2001) 299–317 11. Pommert, A., H¨ohne, K.H.: Evaluation of image quality in medical volume visualization: The state of the art. In: Proceedings of International Conference on Medical Image Computing and Computer-Assisted Intervention ’02. (2002) 598–605 12. Daly, S.: The visible differences predictor: An algorithm for the assessment of image fidelity. In: Digital Image and Human Vision, Edited by A. B. Watson, The MIT Press (1993) 179– 206 13. Gaddipatti, A., Machiraju, R., Yagel, R.: Steering image generation with wavelet based perceptual metric. Graphics Graphics Forum (EUROGRAPHICS ’97) 16 (1997) 241–251 14. Machiraju, R., Yagel, R.: Reconstruction error characterization and control: A sampling theory approach. IEEE Transactions on Visualization and Computer Graphics 2 (1996) 364– 378 15. Mei”sner, M., Huang, J., Bartz, D., Mueller, K., Crawfis, R.: A practical evaluation of popular volume rendering algorithms. In: Proceedings of IEEE Symposium on Volume Visualization ’00. (2000) 81–90 16. Ramasubramanian, M., Pattanaik, S.N., Greenberg, D.P.: A perceptually based physical error metric for realistic image synthesis. In: Proceedings of ACM SIGGRAPH ’99. (1999) 73–82 17. Bolin, M., Meyer, G.: A perceptually based adaptive sampling algorithm. In: Proceedings of ACM SIGGRAPH ’98. (1998) 299–309 18. Myszkowski, K.: The visible differences predictor: Applications to global illumination problems. In: Proceedings of EUROGRAPHICS Workshop on Rendering ’98. (1998) 223–236
A Perceptual Framework for Comparisons of Direct Volume Rendered Images
1323
19. Rushmeier, H., Ward, G., Piatko, C., Sanders, P., Rust, B.: Comparing real and synthetic images: Some ideas about metrics? In: Proceedings of EUROGRAPHICS Workshop on Rendering ’95. (1995) 82–91 20. Lu, A., Morris, C.J., Taylor, J., Ebert, D.S., Hansen, C., Rheingans, P., Hartner, M.: Illustrative interactive stipple rendering. IEEE Transactions on Visualization and Computer Graphics 9 (2003) 127–138 21. Ebert, D.S.: Extending visualization to perceptualization: The importance of perception in effective communication of information. In: The Visualization Handbook, Edited by C. D. Hansen and C. R. Johnson, Elsevier Butterworth-Heinemann (2005) 771–780 22. Interrante, V.: Art and science in visualization. In: The Visualization Handbook, Edited by C. D. Hansen and C. R. Johnson, Elsevier Butterworth-Heinemann (2005) 781–806 23. Chalmers, A., Cater, K.: Exploiting human visual perception in visualization. In: The Visualization Handbook, Edited by C. D. Hansen and C. R. Johnson, Elsevier ButterworthHeinemann (2005) 807–816 24. Zhou, H., Chen, M., Webster, M.F.: Comparative evaluation of visualization and experimental results using image comparison metrics. In: Proceedings of IEEE Visualization ’02. (2002) 315–322 25. Lubin, J.: A visual discrimination model for imaging system design and evaluation. In: Vision Models for Target Detection and Recognition, Edited by E. Peli, World Scientific Publishing (1995) 245–283 26. Li, B., Meyer, G., Klassen, R.: A comparison of two image quality models. In: Human Vision and Electronic Imaging III, SPIE Vol. 3299 (1998) 98–109 27. Wilson, H.R.: Psychophysical models of spatial vision and hyperacuity. In: Spatial Vision, Vol. 10, Vision and Visual Disfunction, Edited by D. Regan, MIT Press (1991) 64–86 28. Wittenbrink, C.M., Malzbender, T., Goss, M.E.: Opacity-weighted color interpolation for volume sampling. In: Proceedings of Symposium on Volume Visualization ’98. (1998) 135– 142 29. van Scheltinga, J.T., Bosma, M., Smit, J., Lobregt, S.: Image quality improvements in volume rendering. In: Proceedings of the Fourth International Conference on Visualization in Biomedical Computing (VBC ’96). (1996) 87–92 30. Wang, C., Garcia, A., Shen, H.W.: Interactive level-of-detail selection using image-based quality metric for large volume visualization. IEEE Transactions on Visualization and Computer Graphics (accepted) (2006)
Hierarchical Contribution Culling for Fast Rendering of Complex Scenes Jong-Seung Park and Bum-Jong Lee Department of Computer Science & Engineering, University of Incheon, 177 Dohwa-dong, Nam-gu, Incheon, 402-749, Republic of Korea {jong, leeyanga}@incheon.ac.kr
Abstract. An intelligent culling technique is necessary in rendering 3D scenes to allow the real-time navigation in complex virtual environments. When rendering a large scene, a huge number of objects can reside in a viewing frustum. Virtual urban environments present challenges to interactive visualization systems, because of the huge complexity. This paper describes a LOD management (or contribution culling) method for the view-dependent real-time rendering of complex huge urban scenes. From the experimental results, we found the proposed LOD management method effectively limits the amount of objects to put into the rendering pipeline without notable degradation of rendering quality.
1
Introduction
A huge amount of previous works have been focused on interactive visualization of complex virtual environments [1][2]. A large number of objects should be drawn to ensure the high quality rendering. However, it slows down the rendering speed, which is not acceptable for many interactive applications. Some insignificant objects need not be drawn and they should be culled away from the set of objects in the scene. Such a culling process should be proceeded fast enough to satisfy the real-time requirement. A suitable spatial data structure is required for the fast culling. The spatial data structures can be classified into four categories [3][4]: Bounding volume hierarchies, Binary space partitioning (BSP) trees, Octrees, and Quadtrees. A bounding volume is a volume that encloses a set of objects. Bounding volumes constitute a hierarchical structure where a bounding volume in a high-level node corresponds to the bounding volume for a large cluster of objects and that in a lowest-level node corresponds to the bounding volume for a single object. At the rendering, a large number of objects can be culled out at once using the hierarchical structure of bounding volumes. A binary space partitioning (BSP) tree is a data structure that represents a recursive hierarchical subdivision of a 3D space into convex subspaces. The BSP tree partitions a space by any hyperplane that intersects the interior of that space. The result is two new subspaces that can be further partitioned recursively. An octree is a data structure which can define a shape in 3D space where the 3D space is divided into eight cubes. It can L.-W. Chang, W.-N. Lie, and R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 1324–1333, 2006. c Springer-Verlag Berlin Heidelberg 2006
Hierarchical Contribution Culling for Fast Rendering of Complex Scenes
1325
be further divided into eight smaller cubes recursively. This can be represented by a tree structure where each node corresponds to a cube in the space. The quadtree is a tree structure in which each internal node has up to four children. Quadtrees are most often used to partition a 2D space by recursively subdividing it into four quadrants or regions. The regions may be square or rectangular. Quadtree traversal is required for the visibility test [5]. To reduce the traversal complexity, several efficient quadtree traversal techniques have been proposed such as a block-priority-based traversal [6]. Quadtrees could also be used to improve rendering quality [7]. Geometry culling is concerned with removing objects that a viewer cannot see in navigation. It reduces the number of polygons sent down the rendering pipeline by excluding faces that need not be drawn. There are several different kinds and stages of culling [3]. The view frustum culling is a form of visibility culling based on the premise that only objects in the current view volume must be drawn [8]. Objects outside the volume will not be visible to the viewer, so they should be discarded. The backface culling can be considered as a part of the hidden surface removal process of a rendering pipeline. In the backface culling, polygons (usually triangles) that are not facing the camera are removed from the rendering pipeline. This is simply done by comparing the polygon’s surface normal with the position of the camera. The portal culling divides the models into cells and portals, computing a conservative estimate of which cells are visible at the render time. The occlusion culling attempts to determine which portions of objects are hidden by other objects from a given viewpoint. The occluded objects are discarded before they are sent to the rendering pipeline. The occlusion culling also are a form of visible surface determination algorithms that attempt to resolve visible (or non-visible surfaces) at larger granularity than pixel-by-pixel testing [9]. In this paper, we describes a simple and fast contribution culling technique using a hierarchical tree structure. If the polygon is too far from the viewer it will not be drawn since its contribution to the rendering quality is not significant. Hence, we throw away all objects whose projected area into the screen space is below the threshold. In Section 2, we describe the hierarchical tree structure for the LOD management. In Section 3, we explain the fast contribution culling and rendering method and provides experimental results using our complex virtual urban environments. Finally, concluding remarks are described in Section 4.
2
Hierarchical Representation of Scene Structure
For the real-time rendering for complex scenes, the use of levels-of-detail (LODs) is necessary. The LOD management involves decreasing the details of an object as it moves away from the viewer. The use of LOD increases the efficiency of rendering by reducing the number of polygons to be drawn. Real-time rendering issues have been actively studied to satisfy the requirement of real-time computation in drawing large complex scenes. A distributed visibility culling technique was proposed to render large complex scenes using occluders with the BSP tree[10].
1326
J.-S. Park and B.-J. Lee obj_3
obj_1
obj_2 obj_5
Camera obj_4 near plane obj_6 View Frustum far plane intersect Viewer
obj_1 inside
intersect inside
obj_5
obj_2
obj_4 obj_6
inside inside
obj_3
Terrain Grid Cells
intersect outside
inside
(a)
(b)
Fig. 1. (a) A view frustum with objects and the corresponding bounding volume hierarchy, (b) Subdivision of terrain grid cells
Bittner divided a space using space subdivision of a line space, and rendered visible objects by selecting an occluder[11]. Cohen-or[12] described a method of partitioning the view space into cells containing a conservative superset of the visible objects. The view frustum culling is the easiest culling. If a polygon is not in the field of view then it need not be rendered. In the culling, it is required to test the intersection of the polygon and the six planes of the view frustum. The temporal coherence and the camera movement coherence can be exploited to reduce the intersection tests. The temporal coherence is based on the fact that if a bounding volume of a node was outside one of the view frustum plane at the previous frame, then the node is probably outside the same plane. The camera translation and rotation coherency exploits the fact that, when navigating in a 3D space, only rotation around an axis or translation is commonly applied to the camera. In those cases, many objects that were outside the plane at the last frame still remain outside. Fig. 1(a) shows a view frustum hierarchy. If a non-leaf node is inside the view frustum, all the child nodes of it are also inside the view frustum. This hierarchical structure speeds up the culling process. An appropriate representation for the effective view frustum culling implementation is a quadtree [13][14][15]. The terrain cell partitioning in the quadtree-based view frustum culling is shown in Fig. 1(b). The terrain area is divided into cells of regular squares. A bounding box centered at the camera position is managed during the camera movement. Only when the camera crosses the bounding box, the system recomputes the quadtree structure. When the cells are on the frustum boundary, low level quadtree nodes
Hierarchical Contribution Culling for Fast Rendering of Complex Scenes
1327
Fig. 2. Wire-frame views of the rendered scenes using a simple culling by the number of polygons (upper) and using a simple culling by the distance measure (lower)
30 120
29
100
27
frame rate (fps)
frame rate (fps)
28 26 25 24 23
80 60 40
22 21 2000 4000 6000 8000 10000 12000 14000 # polygons
20
500
1000 1500 distance (m)
2000
2500
Fig. 3. Rendering frame rate for the simple methods according to the number of polygons (left) and the visibility range (right)
are used to refine the areas in order to reduce the number of objects to be sent to the rendering pipeline. Several techniques using hierarchical data structures have been proposed to efficiently handle a large amount of data. Tsuji [16] proposed a method of combining the dynamic control of LOD and the conservative occlusion culling by using a new hierarchical data structure. Face clusters are recursively partitioned to dynamically control the LOD. We use the hierarchical weighted tree structure for the efficient implementation of the view frustum culling and the contribution culling. When rendering a complex environment containing a huge number of objects, the rendering speed should not be slowed down. To avoid such dependency on the number of visible objects, we investigated a criteria to choose the most contributive objects for the high quality rendering. Small objects in the screen space are culled out even though they are near to the camera position.
1328
3
J.-S. Park and B.-J. Lee
View Space Partitioning and Contribution Culling
Our goal is to guarantee high-fidelity rendering of large models using a low computational load. Hence, it is necessary to reduce the number of polygons sent to the rendering pipeline. We select only the fixed number of the most important visible objects and send them to the rendering pipeline. If small details contribute little or nothing to the scene fidelity then it can be discarded. The contribution of each object is measured as the area of the projection in the screen space. For each object its bounding volume is computed at the initial loading stage. When creating or updating the hierarchical structure, the bounding volume is projected onto the projection plane and the area of the projection is then estimated in pixels. The area is a measure of the contribution of the object. If the number of pixels in the area is smaller than a given threshold value then the object is culled away from the rendering pipeline. Our view space partitioning and contribution culling algorithm is as follows. As an initialization stage, we find the quadrilateral visible region of the terrain in the image space. The region corresponds to the root node of the hierarchical structure. Initially, the weight of the node is set to zero. Beginning from the node, we recursively split the region into four subregions. For each node the four corners of the region are always kept. Since we knows the view frustum geometry and the projection matrix, we can find the world coordinates of the corners. If the number of objects in a region is equal to zero or one, we do not split the region any more. If there are multiple objects in the region we split the region into four subregions. For each subregion we also create a node and attach it to the parent node. For each newly created node we evaluate its weight and associate the weight to the node. The weight of a node is defined as the sum of all weights for objects in inside the screen rectangular region corresponding to the node. Note that the weight of an object is the number of pixels of the bounding rectangle in the image space. When a node is created and its weight is evaluate, the weights of all its parent nodes are also updated by adding the weight of the new node. The hierarchical structure is updated only when there is a significant change in camera position or orientation. In a rendering cycle, only a fixed number of significant objects are rendered. The number of objects to be rendered depends on the desired FPS for a specific machine. To select the significant objects we traverse the tree in a breath-first-order. Let N be the desired number of the objects to be drawn. Then, we try to choose the most N significant objects starting from the root node n1 . If the node have subnodes, we try to choose the N significant objects from the subnodes. For a subnode nk we try to choose (M ∗ wk /w1 + 0.5) objects where wk is the weight of the node nk . Note that the sum of all subnode weights is equal to the weight of the parent node. The selection process is repeated recursively until the desired number of objects are chosen or there are no more significant objects in the subnodes. We have tested the proposed method on a complex virtual urban environment. Our rendering engine is based on the MS Windows XP platform and is
Hierarchical Contribution Culling for Fast Rendering of Complex Scenes
6
x 105
0.1 ViewFrustum FixedRange FixedPolygon Proposed
5
ViewFrustum FixedRange FixedPolygon Proposed
0.09 0.08 0.07 speed(1/fps)
4 polygons
1329
3 2
0.06 0.05 0.04 0.03 0.02
1
0.01 0
0
50
100
150 200 camera angle
250
300
350
0 0
50
100
150 200 camera angle
250
300
350
Fig. 4. Comparison of rendering performance for the 360◦ rotating camera
implemented using C++ and DirectX 9.0. The machine has a 2.8GHz Pentium 4 processor with 512MB DDR RAM and an ATI RADEON 9200 GPU with 64MB DRAM. We implemented four different methods and compared the fidelity of the rendering and the frame rate. The methods are denotated by ViewFrustum, FixedRange, FixedPolygon, and Proposed, each corresponds to the view frustum culling method, the fixed visibility range method, the fixed number of polygons method, and the proposed method, respectively. The method ViewFrustum renders only the objects that are inside the view frustum. The method FixedRange culls away objects that are placed outside the distance over a specified distance. The method FixedPolygon limits the number of polygons to be sent to the rendering pipeline. Only the given number of the nearest polygons to the viewer are sent to the pipeline. The method Proposed chooses the given number of the most significant objects and send them to the pipeline. We tested the proposed method by applying it to render a huge number of structures in the Songdo area of the Incheon Free Economic Zone, a metropolitan section which is currently under development. It will occupy a vast stretch of land totaling 13,000 acres by the time all development is completed. Based on the construction implementation plan, we modeled the virtual Songdo city. Currently, our virtual environment contains over a million objects for high rise landmark business buildings, hotels, research centers, industrial complex, bio complex, residential complex buildings, and many other kinds of structures. We first tested the two simplest simplification methods, FixedPolygon and FixedRange. Their wire-frame views of the rendered scenes are shown in Fig. 2. In the upper figure, the corresponding number of the limited polygons are 960, 3,000, and 5,040, respectively, from left to right order. In the lower figure, the corresponding visibility ranges are 300, 1,000, and 2,750 meters, respectively, from left to right order. The frame rate is shown in Fig. 3 when controlling the number of polygons (left) and the visibility range (right). The rendering frame rate should be stable when the viewer is moving freely in a world space. We measured the number of polygons and the rendering frame
1330
J.-S. Park and B.-J. Lee
Fig. 5. Comparison of rendering fidelity: FixedRange, FixedPolygon, and Proposed
rate when the camera is rotated with respect to Y -axis in a fixed position. Fig. 4 shows the number of polygons (left) and the frame rate (right), respectively, for the corresponding camera angle. The proposed method has provided almost the same quality rendering results with almost the same rendering frame rate, which indicates that the computation load is not dependent on the complexity of the current view. We compared the rendering fidelity for three methods: FixedRange, FixedPolygon, and Proposed. Two rendered scenes are shown in Fig. 5 and Fig. 6. The three rows in each figure correspond to FixedRange, FixedPolygon, and Proposed, respectively, in order from top to bottom. In the FixedRange method, the objects that are placed outside the distance over 900 meters do not be rendered. In the FixedRange method, we limit the number of polygons to 82,720 polygons. The
Hierarchical Contribution Culling for Fast Rendering of Complex Scenes
1331
Fig. 6. Comparison of rendering fidelity: FixedRange, FixedPolygon, and Proposed
rendered scenes from the Proposed method (the third row) show that some significant buildings were rendered, which were not rendered in the other methods. In our methods, huge number of tiny objects or mostly occluded objects are not drawn. Instead, significant objects in far distance are drawn. The proposed method has kept high-quality rendering without significant performance depreciation. Table 1 and Table 2 show the comparison of the number of polygons and the frame rate for the four different methods. In the FixedNumber method, the number of polygons is fixed to 82,720. In the Proposed method, the number of polygons is always equal to or less than that of the FixedNumber method. However, the rendering quality is as good as that of not using the
1332
J.-S. Park and B.-J. Lee Table 1. Comparison of the number of polygons
methods ViewFrustum FixedRange FixedNumber Proposed
View1 106,032 84,936 82,720 78,654
View2 382,016 82,720 82,720 82,316
View3 187,248 186,496 82,720 82,720
View4 317,344 177,472 82,720 82,720
View5 560,992 287,264 82,720 82,720
View6 88,736 75,952 82,720 62,414
Table 2. Comparison of frame rates (unit: frames per second) methods ViewFrustum FixedRange FixedNumber Proposed
View1 58 63 62 64
View2 37 62 62 61
View3 51 51 62 61
View4 40 52 62 61
View5 21 42 62 61
View6 61 64 62 64
ViewFrustum method. In our methods, huge number of tiny objects or mostly occluded objects are not drawn. Instead, we draw other significant objects.
4
Conclusion
2We have presented a fast rendering method for complex urban scenes. The method combines visibility culling and model simplification into a single framework. Using a novel contribution culling technique less significant objects are culled away from the rendering pipeline. Since only a fixed number of objects are tried to be sent to the pipeline, computation load does not depend on the scene complexity. The perceived quality of the proposed rendering method is almost same as view frustum culling that draws all visible objects. We assume that, if the screen projected area of an object is small, then its contribution to the human perception is also small. Contributions of objects are represented by weights on nodes of the hierarchical tree structure. By traversing nodes, a fixed number of significant visible nodes are selected and sent to the pipeline. The rendering speed is determined by the number of significant objects to be drawn and, hence, we obtained a uniform frame rate in rendering. We tested the proposed method to a new metropolitan sector which is currently under development and validated that the proposed rendering method is appropriate for the real-time rendering of complex huge scenes. Acknowledgements. This work was supported in part by grant No. RTI05-0301 from the Regional Technology Innovation Program of the Ministry of Commerce, Industry and Energy(MOCIE) and in part by the Brain Korea 21 Project in 2006.
Hierarchical Contribution Culling for Fast Rendering of Complex Scenes
1333
References ¨ uc¸, B.: A survey of interactive realistic walkthrough 1. Sel¸cuk, A., G¨ ud¨ ukbay, U., Ozg¨ techniques in complex graphical environments. In: Advances in Computer and Information Sciences. (1998) 334–341 2. Chhugani, J., Purnomo, B., Krishnan, S., Cohen, J., Venkatasubramanian, S., Johnson, D., Kumar, S.: vlod: High-fidelity walkthrough of large virtual environments. IEEE Transactions on Visualization and Computer Graphics 11(1) (2005) 35–47 3. Akenine-M¨ oller, T., Haines, E.: Real-Time Rendering. A. K. Peters, Ltd., Natick, MA, USA (2002) 4. Samet, H., Webber, R.E.: Hierarchical data structures and algorithms for computer graphics. i. fundamentals. IEEE Computer Graphics and Application 8 (1988) 48– 68 5. Pajarola, R.: Overview of quadtree-based terrain triangulation and visualization. Technical Report UCI-ICS TR 02-01, Dept. Info. Comp. Sci., University of California, Irvine (2002) 6. Wu, Y., Liu, Y., Zhan, S., Gao, X.: Efficient view-dependent rendering of terrains. In: GRIN’01: No description on Graphics interface 2001. (2001) 217–222 7. Pajarola, R.: Large scale terrain visualization using the restricted quadtree triangulation. In: Proceedings of IEEE Visualization’98. (1998) 19–24 8. Assarsson, U., Moller, T.: Optimized view frustum culling algorithms for bounding boxes. Journal of Graphics Tools 5(1) (2000) 9–22 9. Clark, J.H.: Hierarchical geometric models for visible surface algorithms. Communications of the ACM 19(10) (1976) 547–554 10. Lu, T., Chang, C.: Distributed visibility culling technique for complex scene rendering. Journal of Visual Languages and Computing 16 (2005) 455–479 11. Bittner, J., Wonka, P., Wimmer, M.: Visibility preprocessing for urban scenes using line space subdibision. In: Proceedings of Pacific Graphics (PG’01), IEEE Computer Society (2001) 276–284 12. Cohen-Or, D., Fibich, G., Halperin, D., Zadicario, E.: Conservative visibility and strong occlusion for viewspace partitioning of densely occluded scenes. Computer Graphics Forum 17(3) (1998) 243–253 13. Falby, J.S., Zyda, M.J., Pratt, D.R., Mackey, R.L.: Npsnet: Hierarchical data structures for real-time three-dimensional visual simulation. Computers and Graphics 17(1) (1993) 65–69 14. Gudukbay, U., Yilmaz, T.: Stereoscopic view-dependent visualization of terrain height fields. IEEE Transactions on Visualization and Computer Graphics 8(4) (2002) 330–345 15. Yin, P., Shi, J.: Cluster based real-time rendering system for large terrain dataset. Computer Aided Design and Computer Graphics (2005) 365–370 16. Tsuji, T., Zha, H., Kurazume, R., Hasegawa, T.: Interactive rendering with lod control and occlusion culling based on polygon hierarchies. In: Proceedings of the Computer Graphics International (CGI’04). (2004) 536–539
Hierarchical Blur Identification from Severely Out-of-Focus Images Jungsoo Lee1, Yoonjong Yoo1, Jeongho Shin2, and Joonki Paik1 1
Image Processing and Intelligent Systems Laboratory, Department of Image Engineering, Graduate School of Advanced Imaging Science, Multimedia, and Film, Chung-Ang University, Seoul, Korea
[email protected] http://ipis.cau.ac.kr 2 Department of Web Information Engineering, Hankyong National University, Ansung-city, Korea
[email protected] Abstract. This paper proposes a blur identification method from severely outof-focus images. The proposed blur identification algorithm can be used in digital auto-focusing and image restoration. Since it is not easy to estimate a point spread function (PSF) from severely out-of-focus images, a hierarchical approach is applied in the proposed algorithm. For severe out of focus blur, the proposed algorithm uses an hierarchical approach for estimating and selecting feasible PSF from successive down sampled images. The down sampled images contain more useful edge information for PSF estimation. The feasible PSF selected, can then be reconstructed for original image resolution level by up sampling methods. In order to reconstruct the PSF accurately, a regularized PSF reconstruction algorithm is used. Finally, we can restore the severely blurred image with the reconstructed PSF. Experimental results show that reconstructed PSF by the proposed hierarchical algorithm can efficiently restore severely out-of-focus images.
1 Introduction There have been many researches for image restoration in the past decade [1], [5], [6], [7]. Due to the imperfection of practical imaging systems and inappropriate imaging conditions, the recorded images are often corrupted by degradations, such as blurring, noise and distortion, etc. Because of the bandwidth reduction of an image, blurring occurs in imaging system. In general, the degradation can be modeled as a linear convolution of original image with a point-spread function (PSF). The task from the linear modeling, image restoration can recover or estimate the original image from its degraded observation. Image restoration can be classified into two categories such as direct restoration and iterative method. The direct restoration methods, such as inverse filtering, least-squares filters, etc. may fail to restore blurred images because of ill-posed inverse problem. Its solution is highly sensitive to noise or measurement errors. In practical applications, iterative restoration algorithms are often adopted, which have two main advantages over the direct algorithms [1], [4]. First, it can be terminated prior to convergence, L.-W. Chang, W.-N. Lie, R. Chiang (Eds.): PSIVT 2006, LNCS 4319, pp. 1334 – 1342, 2006. © Springer-Verlag Berlin Heidelberg 2006
Hierarchical Blur Identification from Severely Out-of-Focus Images
1335
resulting in a partially deblurred image which will often not exhibit noise amplification. The second advantage is that the inverse operator does not need to be implemented [1]. But iterative algorithms require high computational work. The existing hierarchical techniques in [6], [7], however, could not provide acceptable performance due to difficulties in accurately estimating the blur parameters in severely blurred images. In this paper, we propose a new hierarchical blur identification and image restoration algorithm. We can estimate the PSF by analysis module based on boundary information from the image itself [2], [3]. Since it is hard to measure the edge information from severely blurred images, a pyramid-based hierarchical algorithm is used. After measuring PSFs from the edge of image pyramid, we use PSF reconstruction techniques to recover PSF with original support size. The paper is organized as follows. In section 2, the hierarchical and image models are discussed; in section 3, the coefficients of PSF extraction scheme are presented; section 4 explains the regularized PSF reconstruction. Section 5 covers the experimental results and section 6 concludes the paper.
2 Hierarchical Model for Image Degradation 2.1 Image Degradation Model A noisy blurred image is modeled as the output of a 2D linear space-invariant system, which is characterized by its impulse response. The output of a 2D linear system g (m, n) is given as g ( m, n ) = h ( m, n ) ∗ f ( m, n ) + n ( m, n ) =
N −1 N −1
∑∑ h(m − k , m − l ) f (k , l ) + n(m, n) ,
(1)
k =0 l =0
where h(m, n) represents the PSF that behaves like a smoothing operator, and
f (k , l ) is the original input image with size N × N . The last term, n(m, n) represents an additive random noise under assumption that its distribution comply with zeromean Gaussian density. For further simplification (1) represents to use the matrix-vector notation
g = Hf + n ,
(2)
2 Where f , g , and n are lexicographically ordered vectors of size N ×1 , and H is 2 2 the blurring operator of N × N .
2.2 Hierarchical Model We assume that there is observed low-resolution PSF. The observation model is then
yi = Wi g + ηi , where
(3)
g and yi represent the lexicographically ordered desired high-resolution
image and low-resolution image for
i th level. Wi and ηi represent the corresponding
1336
J. Lee et al.
weights and additive Gaussian noise at level i . If we assume that the size of the PSF 2 2 2 is N × N , and then x is N ×1 vectors, y is M 2 ×1 vectors and W is a M × N matrix. From Eq. (2), a hierarchical model can be written as
yi = Wi ( Hf + n) + ηi = ( H i )( f i ) + η i ,
(4)
where H i , fi , and η i represent H i = Wi H , f i = Wi f , and η i = Wi n + ηi , respectively.
3 Proposed PSF Identification Method 3.1 Hierarchical Structure The prior PSF estimation method mentioned in section 3.2 requires that it has constraint; the number of prior PSF sets is finite. The fact, we can estimate more effective if support size of PSF is small or it has fewer PSF coefficients, implies that the support size of PSF decreases as well. Hence we propose hierarchical identification scheme. The outline of hierarchical identification scheme is shown in Fig. 1. At first, we measure up to feasible edge information in the available blurred image in order to estimate PSFs using prior PSFs sets. If there is not enough edge information, then the blurred image is downsampled number of times by a factor of 2i, denoted
Fig. 1. Hierarchical blur identification procedure
Hierarchical Blur Identification from Severely Out-of-Focus Images
1337
by { y, y1 , y2 ," , yM −1 , yM } . By repeating the above process, the blurred image which is downsampled is measured to find feasible edge information. We identify H i in Eq. (4). After identifying, we should do process that is PSF reconstruction to
H1 from H i . 3.2 Prior PSF Estimation The discrete approximation of an isotropic PSF is shown in Fig. 2. As shown in Fig. 2, many pixels are located off concentric circles within the region defined as
S R = { (m, n) | m 2 + n 2 ≤ R
},
(5)
Fig. 2. The geometrical representation of a 2D isotropic discrete PSF: The gray rectangles represent inner pixels of the PSF and the empty ones are outer pixels of the PSF
where R is the radius of the PSF. Each pixel within the support S R is located either on the concentric circles or not. The pixels on a concentric circle are straightforwardly represented as the PSF coefficients, a0 ," , aR , as shown in Fig. 2. On the other hand, pixels off a concentric circle are not described by those. So these pixels should be interpolated by using adjacent pixel on the concentric circle as
{
ar + β ar +1 , if (m, n) ∈ S R , h (m, n) = α 0, otherwise,
(6)
where a r and a r +1 respectively represent the r th and the r +1st entries of the PSF coefficient vector. In (6), index r is determined as
r = ⎢ m2 + n2 ⎥ , ⎢⎣
⎥⎦
(7)
1338
J. Lee et al.
(a)
(b)
(c)
(d)
Fig. 3. The pattern images and its edges: (a) original simple pattern image, (b) blurred pattern image, (c) edges of (a), and (d) edges of (b)
where ⎣⋅⎦ is the truncation operator to integer. Based on Fig. 2, α and β are determined as
α = r + 1 − m2 + n2 , and β = 1 − α .
(8)
This approximation of 2D discrete PSF is available to the popular isotropic blurs, such as Gaussian out-of-focus blur, uniform out-of-focus blur, and x-ray scattering. The optimal PSF’s were selected using the out-of-focus estimate by calculating the distance between the unit step function and a luminance profile across the edge of the restored with various PSF’s as 1/ 2
d
(k )
= s
(k )
⎛N ⎞ − u = ⎜ ∑ s (j k ) − u j 2 ⎟ 2 ⎝ j =1 ⎠
, for k = 1, 2, ", K . ,
(k )
(9)
th
where s represents the restored edge profile with the k pre-estimated PSF in the selected auto-focusing region, and u the unit step function. Here, N denotes the number of points in the edge profile and k is an index for a PSF in the set of K members. The optimal PSF can be determined by choosing the k mum distance as
th
PSF with mini-
arg min {d ( k ) }, for k = 1, 2, ", K .
(10)
4 PSF Reconstruction Using Spatially Adaptive Constraints In this section, we mention how the estimated PSF for down sampled images can be used to overcome the out of focus blur at higher resolution images. The estimated PSF is extended and generalized to be applied to its corresponding higher resolution images. 4.1 PSF Reconstruction for High Resolution Restoration
The PSF reconstruction problem can be started as the problem of finding the inverse of the down-sampling operator Wi in Eq. (3) or finding an PSF which is as close as
Hierarchical Blur Identification from Severely Out-of-Focus Images
1339
possible to the original one, subject to an optimality criterion. The PSF reconstruction problem is as ill-posed problem, which results in the matrix Wi being ill-conditioned. A regularization approach is required in solving ill-posed problems. Such an approach controls the noise amplification in the solution by incorporation a priori information about the solution into the restoration process, and it results in the minimization with respect to H of the functional
Φ( x) = ( H i − Wi H The functional
2
+ λ CH ) ↑ 2 2
(11)
CH is called the stabilizing function since it bounds the energy
of the restored signal at high frequencies, primarily due to the amplified noise. By −1
minimizing the first term of Φ(λ = 0) , the solution vector H tends to Wi H i , resents the smoothest possible solution. The tradeoff between deconvolution and noise smoothing operation is determined by the parameter λ . By neglecting the constant terms with respect to the variable H in Eq. (11), the minimization of Φ ( H ) is equivalent to the minimization of f ( H ) , where
f (H ) =
1 T H TH − bT H 2
(12)
and
T = WiTWi + λ C T C and b = WiT H i .
(13)
The necessary condition for a minimum of f ( H ) result in the solution of the system of linear equation
TH = b
(14)
Iterative methods are used in this work in solving Eq. (14). Among the advantages of iterative algorithms is the possibility of incorporation constraints, expressing properties of the solution into the PSF reconstruction process. The regularized solution given by (14) can be successively approximated by means of the iteration
H0 = 0 , H
n +1
= H + β (WiT H i − Wi TWi H n − λ ( H n )C T CH n ) , n
(15)
where β and λ ( H ) represent a multiplier that can be constant or it can depend on the iteration index and regularization functional. 4.2 Regularization Function
We are interested to seek to incorporate the following desirable properties for the regularization function at each iteration step. We further simplify the analysis by
1340
J. Lee et al.
λ ( H ) is inversely proportional to H − WH
assuming in [4] that ⅰ) is proportional to CH
2
,
2
, ⅱ)
λ(H )
ⅲ) λ ( H ) is larger than zero. Having all there assump-
tions, we have the following regularization function according to the properties described above.
λ(H ) = θ (
CH
2 2
H − WH + δ
),
(16)
where θ is a monotonically increasing function. δ prevents the denominator from becoming zero. In this paper, Eq. (16) is defined by
λ ( H ) = ln(
CH
2 2
H − WH + δ
).
(17)
5 Experimental Results In this section, we present result that is dealt with the synthetically blurred images by two different of PSFs. Experiment I: In this experiment, Fig. 4(a) shows the original image, ‘man’, with the size of 512× 512 pixels. We synthetically make an out-of-focused version this image by convolving with 17 × 17 Gaussian that the variance of Gaussian function, σ = 3 , is used. 30dB blurred signal-to-noise ratio (BSNR) noise is added after the simulated out-of-focus blur. The degraded version of the ‘man’ is shown in Fig. 4(b). As the first step of the proposed algorithm, the feasible edges should be extracted from the image shown in Fig. 4(b) according to the algorithm described in section 3.2. This picture, however, has the lack of feasible edge information. Thus, the original image is downsampled to level 2 by a factor of 4, as shown in Fig. 4(c). Fig. 4(d) shows the restored image of downsampling from Fig. 4(c) using PSF estimation by prior PSF method. The set of prior PSFs are Gaussian blur that support size is 3,5 and 7 variance of Gaussian is between 0.04~4.04, and the number of these is 300. Finally, Fig. 4(e) shows the restored image has the original support size by PSF construction described in section 4. We use a constraint least squares (CLS) filter to restore the out-of-focus image. Experiment II: Synthetically, uniform out-of-focus blur with 9 × 9 is also used for another experiment. 35dB BNSR additive is further added to the blurred image. The original image is ‘peppers’ with 512 × 512 pixels in Fig. 5(a). Fig. 5(b) shows degraded image with 9 × 9 uniform blur. Fig. 5(c) shows downsampling image at downsampled level 1 by a factor of 2. Fig. 5(d) shows the restored image of Fig. 5(c) using prior PSF estimation. The set of prior PSFs is the same set used in experiment I. Fig. 5 (e) shows the restored image using CLS filter based on proposed algorithm.
Hierarchical Blur Identification from Severely Out-of-Focus Images
1341
(c)
(a)
(b)
(d)
(e)
Fig. 4. The experimental results of experiment I: (a) original Image, (b) the degraded Image, (c) downsampled image from (b) at level 2, (d) the restored image of downsampling from (c), and (e) the restored image by the proposed algorithm
(c)
(a)
(b)
(d)
(e)
Fig. 5. The experimental results of experiment II: (a) original Image, (b) the degraded Image, (c) downsampled image from (b), (d) the restored image of downsampling from (c) at level 1, and (e) the restored image by the proposed algorithm
6 Conclusions In this paper, we have proposed a hierarchical PSF estimation and image restoration algorithm for severely blurred image by using a prior PSF estimation approach based on feasible edge information. We identify the PSF having a large support size. This algorithm providing the optimum result decreased. The proposed algorithm is robust against the large support size of blur. Experimental results showed that highfrequency details can be restored from the severely blurred images. Although the proposed algorithm is currently limited to space invariant, isotropic PSFs, it can be extended to more general applications based PSF estimation.
Acknowledgment This research is supported by Korean Ministry of Science and Technology under National Research Laboratory Project and by Korean Film Council under Culture Technology Development Project.
1342
J. Lee et al.
References 1. J. Biemond, L. L. Reginald, and R. M. Mersereau, “Iterative Methods for Image Deblurring,” IEEE Trans. on Image Processing, vol. 78, no. 5, pp. 856-883, May 1990 2. S. K. Kim, S. R. Paik, and J. K. Paik, “Simultaneous Out-of-Focus Blur Estimation and Restoration for Digital AF System,” IEEE Trans. Consumer Electronics, vol. 44, no. 3, pp. 1071-1075, August 1998. 3. S. K. Kim and J. K. Paik, “Out-of-Focus Blur Estimation and Restoration for Digital AutoFocusing System,” Electronics letters, vol. 34, no. 12, pp. 1217-1219, June 1998. 4. E. S. Lee and M. G. Moon, “Regularized Adaptive high-Resolution Image Reconstruction Considering Inaccurate Subpixel Registration,” IEEE Trans. on Image Processing, vol. 12, no. 7, July 2003. 5. A. K. Katsaggelos, “Iterative Image Restoration Algorithms,” Optical Engineering, vol. 28, no. 7, pp. 735-748, July 1989. 6. R. L. Lagendijk, J. Biemond and D. E. Boekee, “Hierarchical Blur Identification,” Acoustics, Speech, and Signal Processing, vol. 4, pp. 1889-1892, April 1990. 7. N. P. Galatsanos, V. Z. Mesarovic, R. Molina, and A. K. Katsaggelos, “Hierarchical Bayesian Image Restoration from Partially Known Blurs,” IEEE Trans. on Image Processing, vol. 9, no. 10, pp. 1784 – 1797, Oct 2000.
Author Index
Abidi, Besma 751 Abidi, Mongi 751 Adsumilli, Chowdary 1094 Aghagolzadeh, Ali 292 Ahn, Chang-Beom 979 Ahn, Jung-Ho 1185 Akbar, Muhammad 421 Akbari, Mohammad Kazem 1159 Albayrak, Song¨ ul 1133 Altun, O˘ guz 1133 Arif, Fahim 421 Bae, Jong-Wook 1075 Baek, Seong-Joon 832 Barnes, Christopher F. 742 Broszio, Hellward 1 Bu, Jiajun 780 Byun, Hyeran 1185 Carrasco, Miguel 513 Cha, Sung-Hyuk 411 Chan, Kap Luk 702 Chan, Ming-Yuen 1293 Chang, Chia-Hong 1273 Chang, Hsun-Hsien 363 Chang, Long-Wen 1085 Chang, Min-Kuan 621 Chang, Pao-Chi 812, 878 Chang, Shyang-Lih 128 Chang, Un-dong 314 Chang, Yi-Fan 228 Chang, Young-Il 322 Chang, Yukon 969 Chao, Ya-Ting 168 Chen, Chi-Farn 523 Chen, Chi-Ming 504 Chen, Chun 780 Chen, Chung-Hao 751 Chen, Liang-Chien 24, 34, 44, 1283 Chen, Min-Hsin 523 Chen, Qinglong 198 Chen, Sei-Wang 128 Chen, Shu-Yuan 168 Chen, Sibao 118
Chen, W.C. 1168 Chen, Wei-Chih 988 Chen, Yao-Tien 878 Chen, Yu-Kumg 228 Chen, Zhenyong 997 Cherng, Shen 128 Chiang, Cheng-Chin 1303 Chiang, Chien-Hsing 54 Chiang, Tsai-Wei 34 Chiang, Tsung-Han 908 Chiu, Singa Wang 442 Cho, Bo-Ho 1075 Cho, Dong-uk 314 Cho, Hee Young 949 Cho, Seong-Ik 682 Cho, Seongwon 545 Choi, Hyun-Jun 1007 Choi, Jong-Soo 258 Choi, Kwontaeg 652 Choi, Seongjong 1103 Chou, Chang-Min 919 Chuang, Shih-Jung 812 Chung, Sun-Tae 545 Chung, Yun-Chung 128 Cowan, Brett 218 Damghanian, Bita 1159 Deng, Shengchu 270 Duong, Dinh Trieu 791 Eng, How-Lung
702
Fookes, Clinton 692 Freeman, Walter J. 1041 Fu, Bwo-Chau 591 Fuh, Chiou-Shann 1254 Fukui, Shinji 1244 Gall, Juergen 84 Gao, Zhi-Wei 842, 988 Gibbens, Peter W. 672 Ha, Le Thanh 791 Han, Eunjung 64, 631 Han, Seung-Soo 305
1344
Author Index
Han, Soowhan 1018, 1049 Han, Su-Young 929 Han, Woosup 433 Hashemi, Mahmoud Reza 1159 Hatanaka, Ken’ichi 483 He, Xiao-Chen 238 Hirschm¨ uller, Heiko 96 Hirzinger, Gerd 96 Ho, Yo-Sung 898 Hong, Min-Cheol 771 Hong, Sung-Hoon 832 Hsiao, Kuo-Hsin 44, 1283 Hsieh, Chi-Heng 24 Hsieh, Jaw-Shu 168 Hsin, Hsi-Chin 802 Hsu, Shou-Li 969 Hsu, Wei 1254 Hsu, Wei-Chen 44, 1283 Hu, Bo 474, 1113 Huang, Fay 96 Huang, Shih-Yu 1303 Huang, Wenlong 495 Hwang, Tae-Hyun 682 Hwang, Wen-Jyi 861 Ichimura, Naoyuki 463 Imiya, Atsushi 148, 332 Itoh, Hidenori 1244 Iwahori, Yuji 1244 Iwata, Kenji 611 Jang, Keun Won 832 Jang, Kyung Shik 1018, 1049 Jeng, J.H. 1168 Jeon, Kicheol 959, 1177 Jeon, Su-Yeol 979 Jeong, Jae-Yun 791 Jeong, Jechang 959, 1067, 1177 Jiao, Licheng 495 Jong, Bin-Shyan 54 Joo, In-Hak 682 Jui, Ping-Chang 842, 988 Jung, Do Joon 662 Jung, Keechul 64, 631 Jung, Sung-Hwan 1075 Kanatani, Kenichi 1195 Kang, Byung-Du 453, 553 Kang, Hyun-Soo 852 Kang, Suk-Ho 322
Kawamoto, Kazuhiko 158 Kawanaka, Haruki 1244 Khanmohammadi, Sohrab 292 Kim, Bong-hyun 314 Kim, Byung-Joo 771 Kim, Daijin 353 Kim, Donghyung 1067 Kim, Dong Kook 832 Kim, Dong-Wook 1007 Kim, Haksoo 732 Kim, Hang Joon 662 Kim, Hieyong 1123 Kim, Hye-Soo 791 Kim, Hyoung-Joon 1150 Kim, Jaemin 545 Kim, Jaeseok 1234 Kim, Jong Han 869 Kim, Jongho 959, 1177 Kim, Jong-Ho 453, 553 Kim, Junbeom 545 Kim, Kwan-dong 314 Kim, Manbae 732 Kim, Munchurl 1123 Kim, Namkug 322 Kim, Sang Beom 869 Kim, Sang-Kyun 453, 553 Kim, Seonki 722 Kim, Tae-Jung 722 Kim, Taekyung 1103 Kim, Tae-Yong 852 Kim, Whoi-Yul 1150 Kim, Yong Cheol 1094, 1103 Kim, Yong-Woo 852 Kim, Young-Chul 832 Klette, Reinhard 13, 74, 138, 280 Ko, Sung-Jea 791 Koo, Jae-Il 771 Koschan, Andreas 751 Kuo, Tien-Ying 761 Kwon, Ohyun 1142 Kyoung, Dongwuk 64, 631 Lai, Bo-Luen 1085 Lai, Chin-Lun 712 Lai, Jen-Chieh 888 Langdon, Daniel 601 Lee, Bum-Jong 1324 Lee, Chu-Chuan 812 Lee, Dong-Gyu 929 Lee, Guee-Sang 822
Author Index Lee, Hankyu 1123 Lee, Ho 322 Lee, Imgeun 1049 Lee, Jeongjin 322 Lee, Jiann-Der 504 Lee, Jin-Aeon 1150 Lee, Jong-Myung 1150 Lee, Jungsoo 1334 Lee, Mong-Shu 189 Lee, Moon-Hyun 534, 582 Lee, Pai-Feng 54 Lee, Sang-Bong 722 Lee, Sangyoun 652 Lee, Se-hwan 314 Lee, Seok-Han 258 Lee, Seongjoo 1234 Lee, Seongsoo 771 Lee, Wonjae 1234 Lee, Won-Sook 292 Lee, Yillbyung 411 Lee, Yue-Shi 563 Lee, Yunli 64, 631 Li, Fajie 280 Li, Guang 1041 Li, Jiann-Der 504 Li, Shutao 270 Li, Xu 780 Liang, Jianning 208 Lie, Wen-Nung 842, 988 Lin, Daw-Tung 641 Lin, Fu-Sen 189 Lin, Hong-Dar 442 Lin, Huei-Yung 1273 Lin, Rong-Yu 342 Lin, Tsong-Wuu 54 Lin, Xiang 218 Liu, Gang 13 Liu, Li-Chang 504 Liu, Li-yu 189 Liu, Ming-Ju 641 Liu, Rui 96 Liu, Ru-Sheng 168 Liu, Tung-Lin 842 Liu, Yi-Sheng 168 Lo, Kuo-Hua 342, 1303 Lu, Difei 1264 Lu, Hsin-Ju 761 Lu, Ko-Yen 621 Lu, Wei 1030 Luo, Bin 118
Mamic, George 692 Mery, Domingo 513, 601 Migita, Tsuyoshi 1215 Mikulastik, Patrick A. 1 Min, Seung Ki 433 Mitra, Sanjit 1094 Miyamoto, Ryusuke 483 Mo, Linjian 780 Mochizuki, Yoshihiko 148 Mohammad, Rafi 742 Mueller, Klaus 1314 Nakamura, Yukihiro 483 Nam, Byungdeok 1142 Nguyen, Quoc-Dat 305 Ochi, Hiroyuki 483 Oh, Han 898 Oh, Sang-Geun 1150 Oh, Seoung-Jun 979 Oh, Taesuk 1094, 1103 Okabe, Takahiro 178 Ono, Yasuhiro 178 Ou, Chien-Min 861 Page, David L. 751 Paik, Joonki 1142, 1334 Pan, Jiyan 1113 Park, Aaron 832 Park, Daeyong 545 Park, Dong-Chul 305 Park, Gwang Hoon 949 Park, Hanhoon 534, 582 Park, Ho-Chong 979 Park, Hye Sun 662 Park, Jae-Woo 322 Park, Jin 832 Park, Jong-Il 534, 582 Park, Jong-Seung 1324 Peng, Jinglin 270 Qu, Huamin
1293, 1314
Rau, Jiann-Yeou 24, 44, 1283 Roan, Huang-Chun 861 Roh, Kyung Shik 433 Rosenhahn, Bodo 13, 74, 84 Rugis, John 138 Ryoo, Young-Jae 1225
1345
1346
Author Index
Saito, Hideo 393 Saito, Hiroaki 483 Sakamoto, Masatoshi 1195 Sakaue, Katsuhiko 611 Sato, Yoichi 178 Satoh, Yutaka 611 Scheibe, Karsten 96 Seidel, Hans-Peter 84 Seo, Byung-Kuk 534, 582 Seo, Young-Ho 1007 Seong, Chi-Young 453, 553 Seyedarabi, Hadi 292 Shakunaga, Takeshi 1215 Shen, Yi 373 Shi, Chaojian 403 Shieh, Yaw-Shih 802 Shih, Maverick 621 Shih, Ming-Yu 591 Shin, Hong-Chang 534 Shin, Jeongho 1334 Shin, Yoon-Jeong 822 Sluzek, Andrzej 248 Sohn, Chae-Bong 979 Son, Byungjun 411 Son, Nam-Rye 822 Song, Fengxi 198 Soto, Alvaro 601 Sridharan, Sridha 692 Stone, R. Hugh 672 Strackenbrock, Bernhard 96 Sugano, Hiroki 483 Sugaya, Yasuyuki 1195 Suh, Doug Young 949 Suh, Jae-Won 722 Sun, Mao-Jen 802 Sung, Jaewon 353 Sung, Tze-Yun 802 Suppa, Michael 96 Tai, Wen-Kai 1303 Tan, Zhi-Gang 238 Tang, Long 997 Tang, Zesheng 1314 Teo, Tee-Ann 24, 34, 1283 Thida, Myo 702 Thorm¨ ahlen, Thorsten 1 Torii, Akihiko 148 Tsai, Allen C. 672 Tsai, Fuan 44, 1283 Tsai, Yuan-Ching 168
Tseng, Din-Chang 878, 919 Tseng, Juin-Ling 54 Tsutsui, Hiroshi 483 Uchiyama, Hideaki 393 Um, Sang-Won 258 Urfalioglu, Onay 1 Valkenburg, Robert
74
Wang, Bin 403 Wang, Chia-Hui 888 Wang, Fangshi 1030 Wang, Qiang 373 Wang, Shao-Yu 939 Wang, Sheng-Jyh 908, 939 Wang, Tsaipei 383 Wang, Wen-Hao 573 Wang, Yuanyuan 208 Winstanley, Adam 208 Won, Chee Sun 869 Wong, Hon-Cheng 1314 Wong, Un-Hong 1314 Woo, Dong-Min 305 Woo, Young Woon 1018, 1049 Woodham, Robert J. 1244 Wu, An-Ming 421 Wu, Ruei-Cheng 573 Wu, Yingcai 1293 Wu, Yu-Chieh 563 Xiong, Zhang 997 Xu, De 1030 Xu, Hongli 1030 Yamada, Daisuke 332 Yang, Jie-Chi 563 Yang, Jingyu 198 Yang, Mau-Tsuen 342, 1303 Yang, Su 208 Yang, Zhi 780 Yao, Yi 751 Ye, Xiuzi 1264 Yeh, Chia-Hung 621 Yen, Show-Jane 563 Yoda, Ikushi 611 Yoo, Jae-Myung 822 Yoo, Ji-Sang 1007 Yoo, Yoonjong 1334 Yoon, Sukjune 433
Author Index Young, Alistair 218 Yu, Hong-Yeon 832 Yu, Sunjin 652 Yudhistira, Agus Syawal Yung, Nelson H.C. 238 Zhang, David 198 Zhang, Jian Qiu 1113
1123
Zhang, Jianqiu 474 Zhang, Jin 1041 Zhao, Shubin 1059 Zhao, Yilan 74 Zhou, Hong 1293 Zhou, Hui 1206 Zhou, Yan 474 ˇ c, Joviˇsa 108 Zuni´
1347