This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
MACHINE PERCEPTIONARTIFICIAL INTELLIGENCE _ Volume 55 I World Scientific
Web Document AnollfCJC
Challenges and Opportunities
SERIES IN MACHINE PERCEPTION AND ARTIFICIAL INTELLIGENCE* Editors:
H. Bunke (Univ. Bern, Switzerland) P. S. P. Wang (Northeastern Univ., USA)
Vol. 38: New Approaches to Fuzzy Modeling and Control — Design and Analysis (M. Margaliot and G. Langholz) Vol. 39: Artificial Intelligence Techniques in Breast Cancer Diagnosis and Prognosis (Eds. A. Jain, A. Jain, S. Jain and L. Jain) Vol. 40: Texture Analysis in Machine Vision (Ed. M. K. Pietikainen) Vol. 41: Neuro-Fuzzy Pattern Recognition (Eds. H. Bunke and A. Kandel) Vol. 42: Invariants for Pattern Recognition and Classification (Ed. M. A. Rodrigues) Vol. 43: Agent Engineering (Eds. Jiming Liu, Ning Zhong, Yuan Y. Tang and Patrick S. P. Wang) Vol. 44: Multispectral Image Processing and Pattern Recognition (Eds. J. Shen, P. S. P. Wang and T. Zhang) Vol. 45: Hidden Markov Models: Applications in Computer Vision (Eds. H. Bunke and T. Caelli) Vol. 46: Syntactic Pattern Recognition for Seismic Oil Exploration (K. Y. Huang) Vol. 47: Hybrid Methods in Pattern Recognition (Eds. H. Bunke and A. Kandel) Vol. 48: Multimodal Interface for Human-Machine Communications (Eds. P. C. Yuen, Y. Y. Tang and P. S. P. Wang) Vol. 49: Neural Networks and Systolic Array Design (Eds. D. Zhang and S. K. Pal) Vol. 50: Empirical Evaluation Methods in Computer Vision (Eds. H. I. Christensen and P. J. Phillips) Vol. 51: Automatic Diatom Identification (Eds. H. du But and M. M. Bayer) Vol. 52: Advances in Image Processing and Understanding A Festschrift for Thomas S. Huwang (Eds. A. C. Bovik, C. W. Chen and D. Goldgof) Vol. 53: Soft Computing Approach to Pattern Recognition and Image Processing (Eds. A. Ghosh and S. K. Pal) Vol. 54: Fundamentals of Robotics — Linking Perception to Action (M. Xie)
*For the complete list of titles in this series, please write to the Publisher.
Series in Machine Perception and Artificial Intelligence - Vol. 55 ■■■■■HI
Web Document Analncic HlldiyolO
Challenges and Opportunities
Editors
flpostolos flntonacopoulos University of Liverpool, UK
Jianp? Hu IBM TJ. Watson Research Center, USA
8* World Scientific New Jersey • London • Singapore • Hong Kong
Published by World Scientific Publishing Co. Pte. Ltd. 5 Toh Tuck Link, Singapore 596224 USA office: Suite 202, 1060 Main Street, River Edge, NJ 07661 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.
ISBN 981-238-582-7
Printed by MultiPrint Services
PREFACE
With the ever-increasing use of the Web, a growing number of documents are published and accessed on-line. The emerging issues pose new challenges for Document Analysis. Although the development of XML and the new initiatives on the Semantic Web aim to improve the machine-readability of web documents, they are not likely to eliminate the need for content analy sis. This is particularly true for the kind of web documents created as web publications (vs. services) where visual appearance is critical. Such con tent analysis is crucial for applications such as information extraction, web mining, summarization, content re-purposing for mobile and multi-modal access and web security. The need is evident for discussions to identify the role of Document Analysis in this new technical landscape. This book is a collection of chapters including state-of-the-art reviews of challenges and opportunities as well as research papers by leading re searchers in the field. These chapters are assembled into five parts, reflect ing the diverse and interdisciplinary nature of this field. The book starts with Part I, Content Extraction and Web Mining, where four different re search groups discuss the application of graph theory, machine learning and natural language processing to the analysis, extraction and mining of web content. Part II deals with issues involved in adaptive content delivery to devices of varying screen size and access modality, particularly mobile de vices. Part III focuses on the analysis and management of one of the most common structured elements in web documents — tables. Part IV includes three chapters on issues related to images found in web documents, includ ing text extraction from web images and image search on the web. Finally, the book is concluded in Part V with discussions of new opportunities for Document Analysis in the web domain, including human interactive proofs for web security, the exploitation of web resources for document analysis experiments, the expansion of the concept of "documents" to include mul timedia documents, and areas where what has been learnt from traditional Document Image Analysis can be applied to the web domain. V
vi
Preface
It is our hope that this book will set the scene in the emerging field of Web Document Analysis and stimulate new ideas, new collaborations and new research activities in this important area. We would like to extend our gratitude to Horst Bunke who encouraged and supported us unfailingly in putting together this book. We are also grateful to Ian Seldrup of World Scientific for his helpful guidance and for looking after the final stages of the production. Last but certainly not least, we wish to express our warmest thanks to the Authors, without whose interesting work this book would not have materialised.
Apostolos Antonacopoulos and Jianying Hu
CONTENTS
Preface
v
Part I. Content Extraction and Web Mining Ch. 1.
Ch. 2.
Ch. 3.
Ch. 4.
Clustering of Web Documents Using a Graph Model A. Schenker, M. Last, H. Bunke and A. Kandel
3
Applications of Graph Probing to Web Document Analysis D. Lopresti and G. Wilfong
19
Web Structure Analysis for Information Mining V. Lakshmi, A.H. Tan and C.L. Tan
39
Natural Language Processing for Web Document Analysis M. Kunze and D. Rosner
59
Part II. Document Analysis for Adaptive Content Delivery Ch. 5. Ch. 6.
Ch. 7.
Reflowable Document Images T.M. Breuel, W.C. Janssen, K. Popat and H.S. Baird
81
Extraction and Management of Content from HTML Documents H. Alam, R. Hartono and A.F.R. Rahman
95
HTML Page Analysis Based on Visual Cues Y. Yang, Y. Chen and H.J. Zhang
vii
113
viii
Contents
Part III. Table Understanding on the Web Ch. 8. Ch. 9.
Automatic Table Detection in HTML Documents Y. Wang and J. Hu
135
A Wrapper Induction System for Complex Documents and its Application to Tabular Data on the Web W. W. Cohen, M. Hurst and L.S. Jensen
155
Ch. 10. Extracting Attributes and their Values from Web Pages M. Yoshida, K. Torisawa and J. Tsujii
179
Part IV. Web Image Analysis and Retrieval Ch. 11. A Fuzzy Approach to Text Segmentation in Web Images Based on Human Colour Perception A. Antonacopoulos and D. Karatzas
203
Ch. 12. Searching for Images on the Web Using Textual Metadata E. V. Munson and Y. Tsymbalenko
223
Ch. 13. A n Anatomy of a Large-Scale Image Search Engine W.-C. Lai, E. Y. Chang and K.-T. Cheng
235
Part V. N e w Opportunities Ch. 14. Web Security and Document Image Analysis H.S. Baird and K. Popat
257
Ch. 15. Exploiting W W W Resources in Experimental Document Analysis Research D. Lopresti
273
Ch. 16. Structured Media for Authoring Multimedia Documents T. Tran-Thuong and C. Roisin
293
Contents
ix
Ch. 17. Document Analysis Revisited for Web Documents R. Ingold and C. Vanoirbeek
315
Author Index
333
This page is intentionally left blank
Part I. Content Extraction and Web Mining
This page is intentionally left blank
CHAPTER 1 C L U S T E R I N G OF WEB D O C U M E N T S USING A GRAPH M O D E L
Adam Schenker 1 , Mark Last2, Horst Bunke 3 , and Abraham Kandel 1 'Department of Computer Science and Engineering, University of South Florida, 4202 E. Fowler Ave. ENB 118 Tampa, FL, 33620, USA E-mail: aschenke, [email protected] 2 Department of Information Systems Engineering, Ben-Gurion University of the Negev Beer-Sheva 84105, Israel E-mail: [email protected] 3 Inst. Fur Informatik und angewandte Mathematik, Department of Computer Science, University of Bern, Neubriickstrasse 10, CH-3012 Bern, Switzerland E-mail: [email protected]
In this chapter we enhance the representation of web documents by utilizing graphs instead of vectors. In typical content-based representations of web documents based on the popular vector model, the structural (term adjacency and term location) information cannot be used for clustering. We have created a new framework for extending traditional numerical vector-based clustering algorithms to work with graphs. This approach is demonstrated by an extended version of the classical k-means clustering algorithm which uses the maximum common subgraph distance measure and the concept of median graphs in the place of the usual distance and centroid calculations, respectively. An interesting feature of our approach is that the determination of the maximum common subgraph for measuring graph similarity, which is an NP-Complete problem, becomes polynomial time with our graph representation. By applying this graph-based kmeans algorithm to the graph model we demonstrate a superior performance when clustering a collection of web documents. 1. Introduction In the field of machine learning, clustering has been a useful and active area of research for some time. In clustering, the goal is to separate a given group of data items (the data set) into groups (called clusters) such that items in the same cluster are similar to each other and dissimilar to the items in other clusters. 3
4
Schenker et al.
Unlike the supervised methods of classification, no labeled examples are provided for training. Clustering of web documents is an important problem for two major reasons. First, clustering a document collection into categories enables it to be more easily browsed and used. Automatic categorization is especially important for the World Wide Web with its huge number of dynamic (time varying) documents and diversity of topics; such features make it extremely difficult to classify pages manually as we might do with small document corpora related to a single field or topic. Second, clustering can improve the performance of search and retrieval on a document collection. Hierarchical clustering methods, for example, are used often for this purpose.1 When representing documents for clustering, a vector model is typically used.2 In this model, each possible term that can appear in a document becomes a feature dimension. The value assigned to each dimension of a document may indicate the number of times the corresponding term appears on it. This model is simple and allows the use of traditional clustering methods that deal with numerical feature vectors in a Euclidean feature space. However, it discards information such as the order in which the terms appear, where in the document the terms appear, how close the terms are to each other, and so forth. By keeping this kind of structural information we could possibly improve the performance of the clustering. The problem is that traditional clustering methods are often restricted to working on purely numeric feature vectors. This comes from the need to compute distances between data items or to calculate some representative of a cluster of items {i.e. a centroid or center of a cluster), both of which are easily accomplished in a Euclidean space. Thus either the original data needs to be converted to a vector of numeric values by discarding possibly useful structural information (which is what we are doing when using the vector model to represent documents) or we need to develop new, customized algorithms for the specific representation. We deal with this problem by introducing an extension of a classical clustering method that allows us to work with graphs as fundamental data structures instead of being limited to vectors of numeric values. Our approach has two main benefits. First, it allows us to keep the inherent structure of the original documents by modeling each document as a graph, rather than having to arrive at numeric feature vectors that contain only term frequencies. Second, we do not need to develop new clustering algorithms completely from scratch: we can apply straightforward extensions to go from classical clustering algorithms that use numerical vectors to those that deal with graphs. In this chapter we will describe a k-means clustering algorithm that utilizes graphs instead of vectors and illustrate its usefulness by applying it to the problem of clustering a
Clustering of Web Documents using a Graph Model
5
collection of web documents. We will show how web documents can be modeled as graphs and then clustered using our method. Experimental results will be given and compared with previous results reported for the same web data set based on a traditional vector representation. The chapter is organized as follows. In Sec. 2 we introduce the mathematical foundations we will use for clustering with graphs. In Sec. 3, we extend the classical k-means algorithm to use graphs instead of numerical vectors. In Sec. 4 we will describe a web page data set and its representation by the graph model. In Sec. 5 we present experimental results and a comparison with previous results from clustering the same web documents when using a vector model and classical k-means algorithms. Conclusions are given in Sec. 6. 2. Graphs: Formal Notation Graphs are a mathematical formalism for dealing with structured entities and systems. In basic terms a graph consists of vertices (or nodes), which correspond to some objects or components. Graphs also contain edges, which indicate the relationships between the vertices. The first definition we have is that of the graph itself. Each data item (document) in the data set we are clustering will be represented by such a graph: Definition 1. A graph3'* G is formally defined by a 4-tuple (quadruple): G=(V, E, a, p), where V is a set of vertices (also called nodes), EcVxV is a set of edges connecting the vertices, a:V—>Zv is a function labeling the vertices, and /1:E-^ZE is a function labeling the edges (Zv and EE being the sets of labels that can appear on the nodes and edges, respectively). The next definition we have is that of a subgraph. One graph is a subgraph of another graph if it exists as a part of the larger graph: Definition 2. AgraphGi = (VUEU au Pi) is a subgraph5 of a graph G2 = (V2,E2, ota, [h), denoted GiOG2, if VIQV2, Elp2((x.y)) V(*oOe£,. Next we have the important concept of the maximum common subgraph, or mcs for short, which is the largest subgraph a pair of graphs have in common: Definition 3. A graph G is a maximum common subgraph5 (mcs) of graphs G\ and G2, denoted mcs(G\,G2), if: (1) GcG\ (2) GcjG2 and (3) there is no other
6
Schenker et al.
subgraph G' (G'QGX, G'aG2) such that \G'\>\G\. (Here IGI is intended to convey the "size" of the graph G; usually it is taken to mean IVI. i.e. the number of vertices in the graph.) Using these definitions, a method for computing the distance between two graphs using the maximum common subgraph has been proposed:
max(|G,|,|G2|) where G\ and G2 are graphs, mcs(G\,G2) is their maximum common subgraph, max(...) is the standard numerical maximum operation, and I...1 denotes the size of the graph as we mentioned in Definition 3.6 This distance measure has four important properties.3 First, it is restricted to producing a number in the interval [0, 1]. Second, the distance is 0 only when the two graphs are identical. Third, the distance between two graphs is symmetric. Fourth, it obeys the triangle inequality, which ensures the distance measure behaves in an intuitive way. For example, if we have two dissimilar objects (i.e. there is a large distance between them) the triangle inequality implies that a third object which is similar (i.e. has a small distance) to one of those objects must be dissimilar to the other. Methods for computing the mcs are presented in the literature.7,8 In the general case the computation of mcs is NP-Complete, but as we will see later in the chapter, for our graph representation the computation of mcs is polynomial time due to the existence of unique node labels in the considered application. Other distance measures which are also based on the maximum common subgraph have been suggested. For example, Wallis et al. have introduced a different metric which is not as heavily influenced by the size of the larger graph.9 Fernandez and Valiente combine the maximum common subgraph and the minimum common supergraph in their proposed distance measure.10 However, Eq. 1 is the "classic" version and the one we will use in our implementation and experiments. As yet there are no reported findings to indicate which distance measure is most appropriate for various applications, and this is a topic we will investigate in future research. However, the distance measure of Eq. 1 has the advantage that it requires the least number of computations when compared to the other two distance measures we mentioned above. Finally we need to introduce the concept of the median of a set of graphs. We define this formally as:
Clustering of Web Documents using a Graph Model
7
Definition 4. The median of a set ofn graphs,11 S={Gi, G2,..., G„},is a graph G such that G has the smallest average distance to all elements in S: G - argmirJ —2_,dist{s,G)
(2)
Here S is the set of n graphs (and thus \S\=n) and G is the median. The median is defined to be a graph in set S. Thus the median of a set of graphs is the graph from that set which has the minimum average distance to all the other graphs in the set. The distance dist(...) is computed from Eq. 1 above. There also exist the concepts of the generalized median and weighted mean, where we don't require that G be a member of S.11'12 However, the related computational procedures are much more demanding and we do not consider them in the context of this chapter. Note that the implementation of Eq. 2 requires only O(n') graph distance computations and then finding the minimum among those distances. 3. The Extended k-Means Clustering Algorithm With our formal notation now in hand, we are ready to describe our framework for extending classical clustering methods which rely on Euclidean distance. The extension is surprisingly simple. First, any distance calculations between data items is accomplished with a graph-theoretical distance measure, such as that of Eq. 1. Second, since it is necessary to compute the distance between data items and cluster centers, it follows that the cluster centers (centroids) must also be graphs if we are to use a method such as that in Eq. 1. Therefore, we compute the representative "centroid" of a cluster as the median graph of the set of graphs in that cluster (Eq. 2). We will now show a specific example of this extension to illustrate the technique. To avoid any confusion, we should briefly emphasize here the difference between our method and the family of "traditional" graph-theoretic clustering algorithms.1'13 In the typical graph clustering case, all the data to be clustered is represented as a single graph where the vertices are the data items and the edge weights indicate the similarity between items. This graph is then partitioned to create groups of connected components (clusters). In our method, each data item to be clustered is represented by a graph. These graphs are then clustered using some clustering algorithm (in this case, k-means) utilizing the distance and median computations previously defined in lieu of the traditional Euclidean distance and centroid calculations.
8
Schenker et al. Inputs: the set of n data items and a parameter k, defining the number of clusters to create Outputs: the centroids of the clusters (represented as numerical vectors) and for each data item the cluster (an integer in [l,k]) it belongs to Step 1. Assign each data item (vector) randomly to a cluster (from 1 to k). Step 2. Using the initial assignment, determine the centroids of each cluster. Step 3. Given the new centroids, assign each data item to be in the cluster of its closest centroid. Step 4. Re-compute the centroids as in Step 2. Repeat Steps 3 and 4 until the centroids do not change.
Fig. 1. The basic k-means clustering algorithm.
The k-means clustering algorithm is a simple and straightforward method for clustering data.14 The basic algorithm is given in Fig. 1. This method is applicable to purely numerical data when using Euclidean distance and centroid calculations. The usual paradigm is to represent each data item, which consists of m numeric values, as a vector in the space 9?m. In this case the distances between two data items are computed using the Euclidean distance in m dimensions and the centroids are computed to be the mean of the data in the cluster. However, now that we have a distance measure for graphs (Eq. 1) and a method of determining a representative of a set of graphs (the median, Eq. 2) we can apply the same method to data sets whose elements are graphs rather than vectors. The k-means algorithm extended to operate on graphs is given in Fig. 2. Inputs: the set of n data items (represented by graphs) and a parameter k, defining the number of clusters to create Outputs: the centroids of the clusters (represented as graphs) and for each data item the cluster (an integer in [l,k]) it belongs to Step 1. Assign each data item (graph) randomly to a cluster (from 1 to k). Step 2. Using the initial assignment, determine the median of the set of graphs for each cluster using Eq. 2. Step 3. Given the new medians, assign each data item to be in the cluster of its closest median (as determined by distance using Eq. 1). Step 4. Re-compute the medians as in Step 2. Repeat Steps 3 and 4 until the medians do not change.
Fig 2. The k-means algorithm for using graphs.
4. Clustering of Web Documents using the Graph Model In order to demonstrate the performance and possible benefits of the graphbased approach, we have applied the extended k-means algorithm to the clustering of a collection of web documents. Some research into performing
Clustering of Web Documents using a Graph Model
9
clustering of web pages is reported in the literature.15-18 Similarity of web pages represented by graphs has been discussed in a recent work by Lopresti and Wilfong.19 Their approach differs from ours in that they extract numerical features from the graphs (such as node degree and vertex frequency) to determine page similarity rather than comparing the actual graphs; they also use a graph representation based on the syntactical structure of the HTML parse tree rather than the textual content of the pages. However, the work we are most interested in for evaluation purposes is that of Strehl et al20 In that paper, the authors compared the performance of different clustering methods on web page data sets. This paper is especially important to the current work, since it presents baseline results for a variety of standard clustering methods including the classical k-means using different similarity measures. The data set we will be using is the Yahoo "K" series,* which was one of the data sets used by Strehl et al. in their experiments.20 This data set consists of 2,340 Yahoo news pages downloaded from www.yahoo.com in their original HTML format. Each page is assigned to one of 20 categories based on its content, such as "technology", "sports" or "health". Although a pre-processed version of the data set is also available in the form of a term-document matrix and a list of stemmed words, we are using the original documents in order to capture their inherent structural information using graphs. We represent each web document as a graph using the following method: •
Each term (word) appearing in the web document, except for stop words (see below), becomes a vertex in the graph representing that document. This is accomplished by labeling each node (using the node labeling function a, see Definition 1) with the term it represents. Note that we create only a single vertex for each word even if a word appears more than once in the text. Thus each vertex in the graph represents a unique word and is labeled with a unique term not used to label any other node. If word a immediately precedes word b somewhere in a "section" s of the web document (see below), then there is a directed edge from the vertex corresponding to a to the vertex corresponding to b with an edge label s. We take into account certain punctuation (such as a period) and do not create an edge when these are present between two words.
This data set is available at: ftp://ftp.cs.umn.edu/dept/users/boley/PDDPdata
Schenker et al.
10
•
We have defined three "sections" for the web pages. First, we have the section title, which contains the text in the document's TITLE tag and any provided keywords (meta-data). Second we have the section link, which is text appearing in clickable links on the page. Third we have the section text, which comprises any of the readable text in the document (this includes link text but not title and keyword text). We perform removal of stop words, such as "the", "and", "of, etc. which are generally not useful in conveying information by removing the corresponding nodes and their incident edges. We also perform simple stemming by checking for common alternate forms of words, such as the plural form.
•
We remove the most infrequently occurring words on each page, leaving at most m nodes per graph (m being a user provided parameter). This is similar to a dimensionality reduction process for vector representations.
This form of knowledge representation is a type of semantic network, where nodes in the graph are objects and labeled edges indicate the relationships between objects.21 The conceptual graph is a type of semantic network sometimes used in information retrieval.22 With conceptual graphs, terms or phrases related to documents appear as nodes. The types of relations (edge labels) include synonym, part-whole, antonym, and so forth. Conceptual graphs are used to indicate meaning-oriented relationships between concepts, whereas our method indicates structural relationships that exist between terms in a web document. We give a simple example of our graph representation of a web document in Fig. 3. The ovals indicate nodes and their corresponding term labels. The edges are labeled according to title (TI), link (L), or text (TX). The document represented by the example has the title "YAHOO NEWS", a link whose text reads "MORE NEWS", and text containing "REUTERS NEWS SERVICE REPORTS". This novel method of document representation is somewhat similar to that of directed acyclic word graphs (or DAWGs); however, our nodes represent words rather than letters, our model allows for cycles in the graphs, and the edges are labeled.
Clustering of Web Documents using a Graph Model
11
Fig. 3. An example graph representation of a web document.
When determining the size of a graph representing a web document (Definition 3) we use the following method: Definition 5. The size of a graph G=(V, E, a, p), denoted \G\, is defined as \G\=\V\+\E\. Recall that the typical definition is simply IGI=IVI. However, for this application it is detrimental to ignore the contribution of the edges, which indicate the number of phrases identified in the text. Further, it is possible to have more than one edge between two nodes since we are labeling the edges according to the document section in which the terms are adjacent separately. Before moving on to the experiments, we mention an interesting feature this model of representing documents has on the time complexity of determining the distance between two graphs (Eq. 1). In the distance calculation we are using the maximum common subgraph; the determination of this in the general case is known to be an NP-Complete problem.24 However, our graphs for this application have the following property: Vxje V, a(x)=a(y) if and only if x=y
(3)
In other words, each node in a graph has a unique label assigned to it, namely the term it represents. No two nodes in a graph will have the same label. Thus the maximum common subgraph Gm=(Vrm^m,o^,j8m) of a pair of graphs Gx and Gi can be created using the following method: Step 1. Create the set of vertices by Vm={x\xs V{ and xe V2 and ai(x)=az(x)} Step 2. Create the set of edges by Em={ (x,y)\x,ye Vm and fii{{x,y))=^2((x,y))}
12
Schenker et al.
The first step states the set of vertices in the maximum common subgraph is just the intersection of the set of terms of both graphs. Each term in the intersection becomes a node in the maximum common subgraph. The second step creates the edges by examining the set of nodes created from the previous step. We examine all pairs of nodes in the set; if both nodes contain an edge between them in both original graphs and share a common label, then we add the edge to the maximum common subgraph. Note that this is different from the concept of induced maximum common subgraph, where nodes are added only if they are connected by an edge in both original graphs. If there is a common subset of nodes but different edge configurations in the original graphs, we still add the nodes using our method. We also note that in document clustering, the nodes, which represent terms, are much more important than the edges, which only indicate the relationships between the terms {i.e., followed by). We see that the complexity of this method is OflVillV^U for the first step and 0(\Vmcs\2) for the second step. Thus it is 0(\Vl\W2\+\VmJ2) < 0(\V\2+\VmJ2) = 0(\V\2) overall if we substitute V = ma\(JVi\,\V2\). 5. Experimental Results In order to compare clustering methods with differing distance measures, Strehl et al. proposed the use of an information-theoretic measure of clustering performance.20 This measurement is given as:
A"=-ifn/
(it)
\og(k-g)
(4)
where n is the number of data items, k is the desired number of clusters, g is the actual number of categories, and « + 1 for yi = + 1 w ■ Xi — b < —1 for m = —1 while minimizing the vector 2-norm of w. The SVM problem in linearly separable cases can be efficiently solved using quadratic programming techniques, while the non-linearly separable cases can be solved by either introducing soft margin hyperplanes, or by mapping the original data vectors to a higher dimensional space where the data points become linearly separable. 14 ' 15 One reason why SVMs are very powerful is that they are very universal learners. In their basic form, SVMs learn linear threshold functions. Never theless, by a simple "plug-in" of an appropriate kernel function, they can be used to learn polynomial classifiers, radial basis function (RBF) networks, and three-layer sigmoid neural nets. 15 For our experiments, we used the SVMhght system implemented by Thorsten Joachims. 16 4. Data Collection and Ground Truthing Since there are no publicly available web table ground truth database, researchers tested their algorithms in different data sets in the past. 3,2 ' 4 However, their data sets either had limited manually annotated table data (e.g., 75 HTML pages in Penn et al.,2 175 manually annotated table tags in Yoshida et a/.4) or were collected from some specific domains (e.g., a set of tables selected from airline information pages were used in Chen et al. 3 ). To develop our machine learning based table detection algorithm, we
Automatic
Table Detection in HTML
Documents
147
needed to build a general web table ground truth database of significant size. 4 . 1 . Data
Collection
Instead of working within a specific domain, our goal of data collection was to get tables of as many different varieties as possible from the web. At the same time, we also needed to insure that enough samples of gen uine tables were collected for training purpose. Because of the latter prac tical constraint we biased the data collection process somewhat towards web pages that are more likely to contain genuine tables. A set of key words often associated with tables were composed and used to retrieve and download web pages using the Google search engine. Three directo ries on Google were searched: the business directory and news directory using key words: { t a b l e , stock, bonds, f i g u r e , schedule, weather, s c o r e , s e r v i c e , r e s u l t s , value}, and the science directory using key words { t a b l e , r e s u l t s , value}. A total of 2, 851 web pages were down loaded in this manner and we ground truthed 1,393 HTML pages out of these (chosen randomly among all the HTML pages). 4.2. Ground
Truthing
HTML File
J C^__ Parser ___J^>
r—T—n Hierarchy
_ \_ CAdding attributes^ HTML with attributes and unique index to each table(ground truth)
r
C^Validation ~[^)
(a)
(b)
Fig. 2. (a) The diagram of ground truthing procedure; (b) A snapshot of the ground truthing interface.
There has been no previous report on how to systematically generate web table ground truth data. To build a large web table ground truth
Y. Wang and J. Hu
148
database, a simple, flexible and complete ground truth protocol is required. Figure 4.2(a) shows the diagram of our ground truthing procedure. We created a new Document Type Definition(DTD) which is a superset of W3C HTML 3.2 DTD. We added three attributes for
element, which are "tabid", "genuine table" and "table title". The possible value of the second attribute is yes or no and the value of the first and third attributes is a string. We used these three attributes to record the ground truth of each leaf
node. The benefit of this design is that the ground truth data is inside HTML file format. We can use exactly the same parser to process the ground truth data. We developed a graphical user interface for web table ground truthing using the Java language. Figure 4.2(b) is a snapshot of the interface. There are two windows. After reading an HTML file, the hierarchy of the HTML file is shown in the left window. When an item is selected in the hierarchy, the HTML source for the selected item is shown in the right window. There is a panel below the menu bar. The user can use the radio button to select either genuine table or non-genuine table. The text window is used to input table title. 4.3. Database
Description
The resulting database is summarized in Table 1. It contains 14,609
elements, out of which 11,477 are leaf
elements. Among the leaf
elements, 1, 740 are genuine tables and the remaining 9, 737 are non-genuine tables. Note that even in this somewhat biased collection, genuine tables only account for less than 15% of all leaf table elements. Table 1.
elements 14,609
Summary of the database.
Leaf le> elements 11,477
Genuine tables 1,740
Non-genuine tables 9,737
5. Experiments A hold-out method was used to evaluate our table classifier. We randomly divided the data set into nine parts. The classifiers were trained on eight parts and then tested on the remaining one part. This procedure was re peated nine times, each time with a different choice for the test part. Then the combined nine part results were averaged to arrive at the overall per formance measures. 13
Automatic
Table Detection in HTML
Documents
149
T h e output of the classifier is compared with t h e ground t r u t h and the standard performance measures precision (P), recall (R) and F-measure (F) are computed. Let Ngg,Ngn,Nng represent the number of samples in the categories "genuine classified as genuine", "genuine classified as nongenuine" , and "non-genuine classified as genuine", respectively, the perfor mance measures are defined as: Na Naa + Na
R
P =
R + P
N,99 N„ Nn
For comparison among different features we report the performance measures when the best F-measure is achieved using the decision tree clas sifier. T h e results of the table detection algorithm using various features and feature combinations are given in Table 2. For both the naive Bayes based a n d the kNN based word group features, 120 word clusters were used ( M = 120). Table 2. R(%) P(%) F(%)
Results using various feature groups and the decision tree classifier.
L 87.24 88.15 87.70
T 90.80 95.70 93.25
LT 94.20 97.27 95.73
LTW-VS 94.25 97.50 95.88
LTW-NB 95.46 94.64 95.05
LTW-KNN 89.60 95.94 92.77
L: Layout features only. T: Content type features only. LT: Layout and content type features. LTW-VS: Layout, content type and vector space based word group features. LTW-NB: Layout, content type and naive Bayes based word group features. LTW-KNN: Layout, content type and kNN based word group features.
As seen from the table, content type features performed better t h a n layout features as a single group, achieving an F-measure of 93.25%. How ever, when the two groups were combined the F-measure was improved substantially to 95.73%, reconfirming the importance of combining layout and content features in table detection. Among the different approaches for the word group feature, t h e vector space based approach gave the best performance when combined with lay out and content features. However even in this case the addition of the word group feature brought about only a very small improvement. This indicates t h a t the text enclosed in tables is not very discriminative, at least not at the word level. One possible reason is t h a t the categories "genuine" and "non-genuine" are too broad for traditional text categorization techniques to be highly effective.
Y. Wang and J. Hu
150
Overall, the best results were produced with the combination of layout, content type and vector space based word group features, achieving an F-measure of 95.88%. Table 3 compares the performances of different learning algorithms using the full feature set. The leaning algorithms tested include the decision tree classifier and the SVM algorithm with two different kernels - linear and radial basis function (RBF). Table 3.
Experimental results using different learning algorithms.
R(%)_, P(%) F(%)
Decision Tree 94.25 97.50 95.88
SVM (linear) 93.91 91.39 92.65
SVM (RBF) 95.98 95.81 95.89
As seen from the table, for this application the SVM with radial basis function kernel performed much better than the one with linear kernel. It achieved an F measure of 95.89%, comparable to the 95.88% achieved by the decision tree classifier. Figure 3 shows two examples of correctly classified tables, where Fig. 3(a) is a genuine table and Fig. 3(b) is a non-genuine table.
(a) Fig. 3. table
(b)
Examples of correctly classified tables: (a) a genuine table; (b) a non-genuine
Figure 4 shows a few examples where our algorithm failed. Figure 4(a) was misclassified as a non-genuine table, likely because its cell lengths are highly inconsistent and it has many hyperlinks which is unusual for genuine
Automatic
Table Detection in HTML
Documents
151
(a)
(b)
(c)
(d)
Fig. 4. Examples of misclassified tables: (a), (b) genuine tables misclassified as nongenuine; (c), (d) non-genuine tables misclassified as genuine
tables. Figure 4(b) was misclassified as non-genuine because its HTML source code contains only two
tags. Instead of the
tag, the author used
tags to place the multiple table rows in separate lines. This points to the need for a more carefully designed pseudo-rendering process. Figure 4(c) shows a non-genuine table misclassified as genuine. A close examination reveals that it indeed has good consistency along the row di rection. In fact, one could even argue that this is indeed a genuine table, with implicit row headers of Title, Name, Company Affiliation and Phone Number. This example demonstrates one of the most difficult challenges in table understanding, namely the ambiguous nature of many table instances (see the paper by Hu et al17 for a more detailed analysis on that). Figure 4(d) was also misclassified as a genuine table. This is a case where layout features and the kind of shallow content features we used are not enough - deeper semantic analysis would be needed in order to identify the lack of logical coherence which makes it a non-genuine table. For comparison, we tested the previously developed rule-based system2 on the same database. The initial results (shown in Table 4 under "Origi nal Rule Based") were very poor. After carefully studying the results from
Y. Wang and J. Hu
152
the initial experiment we realized that most of the errors were caused by a rule imposing a hard limit on cell lengths in genuine tables. After deleting that rule the rule-based system achieved much improved results (shown in Table 4 under "Modified Rule Based"). However, the proposed machine learning based method still performs considerably better in comparison. This demonstrates that systems based on hand-crafted rules tend to be brittle and do not generalize well. In this case, even after careful manual adjustment in a new database, it still does not work as well as an automat ically trained classifier.
Table 4. R(%) P(%) F(%)
Experimental results of the rule based system. Original Rule Based 48.16 75.70 61.93
Modified Rule Based 95.80 79.46 87.63
A direct comparison to other previous results 3 ' 4 is not possible currently because of the lack of access to their system. However, our test database is clearly more general and far larger than the ones used in Chen et al.3 and Yoshida et al., 4 while our precision and recall rates are both higher.
6. Conclusion and Future Work We presented a machine learning based table detection algorithm for HTML documents. Layout features, content type features and word group features were used to construct a feature set. Two well known classifiers, the decision tree classifier and the SVM, were tested along with these features. For the most complex word group feature, we investigated three alternatives: vector space based, naive Bayes based, and weighted K nearest neighbor based. We also constructed a large web table ground truth database for training and testing. Experiments on this large database yielded very promising results and reconfirmed the importance of combining layout and content features for table detection. Our future work includes handling more different HTML styles in pseudo-rendering and developing a machine learning based table interpreta tion algorithm. We would also like to investigate ways to incorporate deeper language analysis for both table detection and interpretation.
Automatic Table Detection in HTML Documents 7.
153
Acknowledgment
We would like to t h a n k Kathie Shipley for her help in collecting the web pages, and Amit Bagga for discussions on vector space models.
References 1. M. Hurst, "Layout and Language: Challenges for Table Understanding on the Web", First International Workshop on Web Document Analy sis, Seattle, WA, USA, September 2001 (ISBN 0-9541148-0-9) and also at http://www.csc.liv.ac.uk/~wda2001. 2. G. Penn, J. Hu, H. Luo, and R. McDonald, "Flexible Web Document Anal ysis for Delivery to Narrow-Bandwidth Devices", Sixth International Con ference on Document Analysis and Recognition (ICDAR'01), Seattle, WA, USA, September 2001, pp. 1074-1078. 3. H.-H. Chen, S.-C. Tsai, and J.-H. Tsai, "Mining Tables from Large Scale HTML Texts", The 18th International Conference on Computational Lin guistics, Saabrucken, Germany, July 2000, pp. 166-172. 4. M. Yoshida, K. Torisawa, and J. Tsujii, "A Method to Integrate Tables of the World Wide Web", First International Workshop on Web Document Anal ysis, Seattle, WA, USA, September 2001, (ISBN 0-9541148-0-9) and also at http://www.csc.liv.ac.uk/~wda2001. 5. D. Mladenic, "Text-Learning and Related Intelligent Agents", IEEE Expert, July-August 1999. 6. J. Hu, R. Kashi, D. Lopresti, and G. Wilfong, "Medium-Independent Table Detection", SPIE Document Recognition and Retrieval VII, San Jose, CA, January 2000, pp. 291-302. 7. T. Joachims, "A Probabilistic Analysis of the Rocchio Algorithm with TFIDF for Text Categorization", The 14th International Conference on Machine Learning, Nashville, Tennessee, 1997, pp. 143-151. 8. Y. Yang and X. Liu, "A Re-Examination of Text Categorization Methods", 22nd International Conference on Research and Development in Information Retrieval (SIGIR '99), Berkeley, California, 1999, pp. 42-49. 9. M. F. Porter, "An Algorithm for Suffix Stripping", Program, 14(3), 1980, pp. 130-137. 10. D. Baker and A.K. McCallum, "Distributional Clustering of Words for Text Classification", SIGIR'98, Melbourne, Australia, 1998, pp. 96-103. 11. A. McCallum, K. Nigam, J. Rennie, and K. Seymore, "Automating the Con struction of Internet Portals with Machine Learning", Information Retrieval Journal, 3, 2000, pp. 127-163. 12. T. M. Mitchell, Machine Learning, McGraw-Hill, 1997. 13. R. Haralick and L. Shapiro, Computer and Robot Vision, Addison Wesley, 1992. 14. V. N. Vapnik, The Nature of Statistical Learning Theory, 1. Springer, New York, 1995. 15. C. Cortes and V. Vapnik, "Support-Vector Networks", Machine Learning, 20, August 1995, pp. 273-296.
154
Y. Wang and J. Hu
16. T. Joachims, "Making Large-Scale SVM Learning Practical", Advances in Kernel Methods - Support Vector Learning, B. Scholkopf, C. Burges and A. Smola (ed.), MIT-Press, 1999. 17. J. Hu, R. Kashi, D. Lopresti, G. Nagy, and G. Wilfong, "Why Table GroundTruthing is Hard", Sixth International Conference on Document Analysis and Recognition (ICDAR'01), Seattle, WA, September 2001, pp. 129-133.
purpose of learning wrappers. We then propose a learning system that can exploit multiple document representations. In more detail, the system includes a single general-purpose "master learning algorithm" and a varying number of smaller, specialpurpose "builders", each of which can exploit a different view of a doc ument. Implemented builders make use of DOM-level and token-level views of a document; views that take more direct advantage of visual character istics of rendered text, like font size and font type; and views that exploit a high-level geometric analysis of tabular information. Experiments show that the learning system achieves excellent results on real-world wrapping tasks, as well as on artificial wrapping tasks previously considered by the research community.
2. Issues in Wrapper Learning One important challenge faced in wrapper learning is picking the repre sentation for documents that is most suitable for learning. Most previous wrapper learning systems represent a document as a linear sequence of to kens or characters. 2,3 Another possible scheme is to represent documents as trees, for instance using the document-object model (DOM). This rep resentation is used by a handful of wrapper learning systems 4,5 and many wrapper programming languages (e.g, Sahuget et al.6). Unfortunately, both of these representations are imperfect. In a web site, regularities are most reliably observed in the view of the information seen by human readers-that is, in the rendered document. Since the ren dering is a two-dimensional image, neither a linear representation or a tree representation can encode it adequately. One case in which this representational mismatch is important is the case of complex HTML tables. Consider the sample table of Fig. 1. Suppose we wish to extract the third column of Fig. 1. This set of items cannot easily be described at the DOM or token level: for instance, the best DOM-level description is probably "td nodes (table data nodes in HTML) such that the sum of the column width of all left-sibling t d nodes is 2, where column width is defined by the colspan attribute if it is present, and is defined to be one otherwise." Extracting the data items in the first column is also complex, since one must eliminate the "cut-in" table cells (those labeled "Actresses" and "Singers") from that column. Again, cut-in table cells have a complex, difficult-to-learn description at the DOM level ("td nodes such that no right-sibling t d node contains visible text").
Wrapper Induction and Its Application to Tabular Data Check out this KOOL Stuff!!! "Actresses" Lucy Lawless Jolie Angelina "Singers" Madonna Brittany
Spears
images images
links links
images images
links links
157
Last modified: 11/1/01. Fig. 1. A difficult page to wrap.
Rendered page: M y Favorite Musical Artists • Muddy
Waters
• John
Hammond
• Ry •
Cooder
...
Last modified: 11/1/01.
HTML implementation 1: (h3)My Favorite Musical Artists(/h3)
Fig. 2. A rendered page, with two HTML implementations. The second implementa tion exhibits irregularity at the DOM level, even though the rendering has a regular appearance. Another problemmatic case is illustrated by Fig. 2. Here a rendering of a web page is shown, along with two possible HTML representations. In the first case, the HTML is very regular, and hence the artist names to be extracted can be described quite easily and concisely. In the second case,
158
Cohen et al.
the underlying HTML is irregular, even though it has the same appear ance when rendered. (Specifically, the author alternated between using the markup sequences (i)(b)foo(/b)(/i) and (b)(i)bar(/i)(/b) in constructing italicized boldfaced text.) This sort of irregularity is unusual in pages that are created by database scripts; however, it is quite common in pages that are created or edited manually. In summary, one would like to be able to concisely express concepts like "all items in the second column of a table" or "all italicized boldfaced strings". However, while these concepts can be easily described in terms of the rendered page, they may be hard to express in terms of a DOM- or token-level representation.
3. A n Extensible Wrapper Learning System The remarks above are not intended to suggest that DOM and token repre sentations are bad—in fact they are often quite good. We claim simply that neither is sufficient to successfully model all wrappers concisely. In view of this, we argue that an ideal wrapper-learning system will be able to exploit several different representations of a document—or more precisely, several different views of a single highly expressive baseline representation. In this chapter we will describe such a learning system, called WL2. 3.1. Architecture
of the Learning
System
The basic idea in WL2 is to express the bias of the learning system as an ordered set of "builders". Each "builder" is associated with a certain restricted language L. However, the builder for L is not a learning algorithm for L. Instead, to facilitate implementation of new "builders", a separate master learning algorithm handles most of the real work of learning, and builders need support only a small number of operations on L. Builders can also be constructed by composing other builders in certain ways. For instance, two builders for languages L\ and Li can be combined to obtain builders for the language [L\ oL2) (composition), or the language {L\ AL 2 ) (conjunction). We will describe builders for several token-based, DOM-based, and hy brid representations, as well as for representations based on properties of the expected rendering of a document. Specifically, we will describe builders for representations based on the expected formatting properties of text nodes (font, color and so on), as well as representations based on the expected geometric layout of tables in HTML.
Wrapper Induction
3.2. A Generic
and Its Application
Representation
to Tabular Data
for Structured
159
Documents
We will begin with a general scheme for describing subsections of a docu ment, and then define languages based on restricted views of this general scheme. We assume that structured documents are represented with the docu ment object model (DOM). (For pedagogical reasons we simplify this model slightly in our presentation.) A DOM tree is an ordered tree, where each node is either an element node or a text node. An element node has an ordered list of zero or more child nodes, and contains a string-valued tag (such as t a b l e , h i , or l i ) and also zero or more string-valued attributes (such as href or src). A text node is normally denned to contain a single text string, and to have no children. To simplify the presentation, however, we will assume that a text node containing a string s of length k will have k "character node" children, one for each character in s.
Items to be extracted from a DOM tree are represented as spans. A span consists of two span boundaries, a right boundary and a left boundary. Conceptually, a boundary corresponds to a position in the structured doc ument. We define a span boundary to be a pair (n,k), where n is a node and k is an integer. A span boundary points to a spot between the fc-th and the (k + l)-th child of n. For example, if n\ is the rightmost text node in Fig. 3, then (ni, 0) is before the first character of the word "Provo", and (n\, 5) is after the last character of the word "Provo". The span with left boundary (ni, 0) and right boundary (m, 5) corresponds to the text "Provo". As another example, if ri2 is the leftmost l i node in Fig. 3, then the span from (ri2,0) to (ri2,1) contains the text "Pittsburgh, PA". It also
Cohen et al,
160
corresponds to a single DOM node, namely, the leftmost anchor (a) node in the DOM tree. A span that corresponds to a single DOM node is called a node span. 3.3. A Generic
Representation
for
Extractors
A predicate Pi(si, S2) is a binary relation on spans. To execute a predicate pi on span s\ means to compute the set EXECUTE{pi, s\) = {S2 ■ Pi(si, S2)}. For example, consider a predicate p(s\, S2) which is defined to be true if and only if (a) s\ contains S2, and (b) S2 is a node span corresponding to an ele ment node with tag l i . Let si be a span encompassing the entire document of Fig. 3. Then EXECUTE(p, si) contains two spans, each corresponding to an l i node in the DOM tree, one containing the text "Pittsburgh, PA", and one containing the text "Provo, UT". We require that every predicate is one-to-many and that membership in a predicate can be efficiently decided (i.e., given two spans si and S2, one can easily test if p(s\, S2) is true.) We also assume that predicates are executable—i.e., that EXECUTE (p,s) can be efficiently computed for any initial span s. The extraction routines learned by our wrapper induction system are represented as executable predicates. Since predicates are sim ply sets, it is possible to combine predicates by Boolean operations like conjunction or disjunction; similarly, one can naturally say that predicate Pi is "more general than" predicate pj (i.e. it defines a superset). We note that these semantics can be used for many commonly used extraction languages, such as regular expressions and XPath queries. 3 Many of the predicates learned by the system are stored as equivalent regular expressions or XPath queries. 3.4. Representing
Training
Data
A wrapper induction system is typically trained by having a user identify items that should be extracted from a page. Since it is inconvenient to label all of a large page, a user should have the option of labeling some initial section of a page. To generate negative data, it is assumed that the user completely labeled the page or an initial section of it. A training set T for our system thus consists of a set of triples (Outeri,Scope1,InnerSeti), (Outer2,Scope2,InnerSet2), ■■■, where in each pair Outerj is usually a span corresponding to a web page, Scopei a
X P a t h is a widely-used declarative language for addressing nodes in an XML (or XHTML) document. 7
Wrapper Induction
and Its Application
to Tabular Data
161
is the part of Outeri that the user has completely labeled, and InnerSeti is the set of all spans that should be extracted from Outeri. Constructing positive data from a training set is trivial. The positive examples are simply all pairs {{Outeri, Innerij) : Innerij € InnerSeti}. When it is convenient we will think of T as this set of pairs. While it is not immediately evident how negative data can be con structed, notice that any hypothesized predicate p can be tested for con sistency with a training set T by simply executing it on each outer span in the training set. The spans in the set InnerSeti — EXECUTE(p, Outeri) are false negative predictions for p, and the false positive predictions for p are spans s in the set {s S EXECUTE(p, Outeri) - InnerSeti : contains (Scope, s)} 3.5. Designing
(1)
a Bias
The bias of the learning system is represented by an ordered list of builders. Each builder BL corresponds to a certain restricted extraction language11 L. To give two simple examples, consider these restricted languages: • -^bracket is defined as follows. Each concept c € Lbracket is defined by a pair (£, r), where I and r are strings. Each pair corresponds to a predicate pn^{s\,S'i), which is true if and only if s°\ exceeds a certain threshold (currently, two). Here, a is a heuristic parameter that weakens the impact of a when pos(a) has a small value; it is currently set to two.
3. List Analysis In this section we describe a method to analyze lists based on the extracted ontologies. As stated in the introduction, a Web page given as an input to our system is first decomposed into a sequence of blocks bounded by separators. The State Sequence Estimation Module (SSEM) determines a sequence of states for the block sequence, by using an ontology extracted from HTML tables. Before explaining the list analysis algorithm, we for mally define the terms used in the remainder of this chapter. After that, we describe our SSEM module, which estimates a sequence of states. 3.1. Term
Definition
In the following we give definitions of the terms used in the subsequent sections. • A page is a sequence of page fragments, each of which is either a block or a separator. • A block b is a sequence of words.
190
M. Yoshida, K. Torisawa and J. Tsujii
Fig. 9.
An example of HMMs for block sequences.
• A separator is a sequence of separator elements, which are HTML tags or special characters. The special characters are those that tend to be used as block boundaries. They are defined a priori? • An ontology is a sequence ((Ai,Vi), (A2, V2), ■•-, (Am, Vm)), where Ai and Vi correspond to the ith attribute in the ontology and its value, respectively. Ai is a sequence of strings used to express the ith attribute, and Vi is that used to express its value. The function size(i), whose value is the number of tables from which Ai and Vi were collected, is defined for each i. • A role is a pair (l,i), where I £ {att,val} and i € {1, 2,..., m}. I, or a label, denotes whether a block represents an attribute or a value, and i, or an index, denotes the attribute's (or value's) number in the ontology. In addition, there are other roles denoted by (sentence, —) and (none, —).f • A state is defined for each block and has a role as its value. We denote the label of the state s by l(s) and the index by i(s).
3.2. State Sequence Estimation
Module
Given a sequence of blocks B = (b\,b2,---,bn) and a sequence of separators C = (ci,c 2 , ...,c„_i), g the State Sequence Estimation Module (SSEM) es timates the most probable sequence of states <S = (si,S2,...,s n ). Here, Sj is a state given to the block &,; and Cj is a separator between blocks bi and bi+ie
Currently we have 23 special characters, including ":", " # " , " = " , etc. An explanation for sentence and none roles will be given in Sec. 3.2.1. s Separators before the first block and those after the last block are ignored. f
Extracting Attributes and Their Values from Web Pages
191
The SSEM estimates the <S so that P{S\B) takes the highest value. In other words, T and "•" denotes an inner product. We can then rewrite / as: n
/ ( x ) = w ■ $(x), where w = ^ i=i
a,$(xj).
(1)
246
Lai et al.
Thus, by using K we are implicitly projecting the training data into a different (often higher dimensional) feature space T. The SVM then com putes the aiS that correspond to the maximal margin hyperplane in T. By choosing different kernel functions, we can implicitly project the training data from X into spaces T for which hyperplanes in T correspond to more complex decision boundaries in the original space X. Intuitively, SVMActive works by combining the following three ideas: (1) SVM Active regards the task of learning a target concept as one of learn ing an SVM binary classifier. An SVM captures the query concept by separating the relevant images from the irrelevant images with a hyper plane in a projected space, usually a very high-dimensional one. The projected points on one side of the hyperplane are considered relevant to the query concept and the rest irrelevant. (2) SVMActive learns the classifier quickly via active learning. The active part of SVM Active selects the most informative instances with which to train the SVM classifier. This step ensures fast convergence to the query concept in a small number of feedback rounds. (3) Once the classifier is trained, SVMActive returns the top-fc most relevant images. These are the k images farthest from the hyperplane on the query concept side. 3.3. Hybrid
Algorithms
Although MEGA and SVM Active are effective active learning algorithms, we believe combining their strengths can result in even better learning algorithms. In this section we propose a hybrid algorithm. As discussed in Section 3.2, the SVM Active scheme needs at least one positive and one negative example to start. MEGA is not restricted by this seeding requirement, and it is able to find relevant examples quickly by refining the sampling boundary. It is therefore logical to employ MEGA to perform the initialization task. Once some relevant images are found, the refinement step can be executed by either MEGA or SVM^ct^e. Thus, we can have three execution alternatives. (1) MEGA only. Use MEGA all the way to learn a concept. (2) SVM^ ci j ve only. Use random sampling to find the first relevant exam ple^) and then use SVM^c^e to learn a concept. (3) Pipeline learning. Use MEGA to find initial relevant objects, and then switch to SVM^ctiwe for refining the binary classifier and ranking re-
Large-scale Image Search on the Web
247
turned objects. 4. Experiments We implemented both MEGA and SVMActive m C, C + +, and tested using an Intel Pentium I I I ™ workstation running Linux. We have implemented an industrial strength prototype with all features discussed in this chapter. We tested our prototype with the intent of answering three central ques tions. (1) Do MEGA and SVM Active learn to model complex query concepts in very high dimensional spaces both quickly and with a small number of iterations? (2) Is using MEGA to find the first relevant object(s) and then switching to SVM^ct^e a more effective algorithm than using the two learning algorithms individually? (3) Does the percentage of relevant data affect learning performance? (If we reduce the matching data from 5% of the dataset to 1%, does it take more iterations to learn query concepts?) For our empirical evaluation of our learning methods we used a twentycategory image dataset where each category consisted of 200 to 300 images. Images for this dataset were collected from Corel Image CDs and the In ternet. The image set consists of architecture, bears, cities, clouds, couples, elephants, fabrics, fireworks, food, flowers, insects, ladies, landscape, objec tionable images, planets, tigers, tools, vehicles, waves, and textures. We separate our queries into two categories: 5% and 1% queries. The matching images for each of the more specific query concepts such as "pur ple flowers" and "white bears" account for about 1% of the total dataset. More general concepts such as "bears", "flowers" and "architectures" have about 5% matching images in the dataset. For each experiment, we will report results for these two categories of queries separately. In the experiment, we report the precision of the top-10 and top-20 retrievals to measure performance. We run each experiment ten times and report the average precision. 5. Results and Discussion Here we present our analysis of the results, organized with regard to the central questions listed at the start of Section 4.
248
Fig. 7.
Lai et al.
Precision versus iterations for top-10 retrieval (a) 5% and (b) 1% queries.
5.1. MEGA, S\/MActive, o,nd Pipelining
MEGA with
SVM^fc
For top-10 retrieval with 5% matching data (Fig. 7(a)), SVM^ct^e clearly outperforms MEGA. The major weakness of SVMActive is in initialization—finding the first few positive samples. Such weakness may not seriously affect this experiment, since there is a high probability of find ing one of the 5% positive examples through random sampling. For queries with only 1% matching data (Fig. 7(b)), such a weakness becomes more
Large-scale Image Search on the Web
249
6 7 Iterations
(a)
(b) Fig.
Precision versus iterations for top-20 retrieval (a) 5% and (b) 1% queries.
significant, because it substantially degrades SVMActiue's performance. Es pecially, in the first two iterations, the precision of SVM^ct^e is very low. Overall, the precision of MEGA and SVMytcti-ue is similar for 1% queries. The hybrid algorithm (pipelining MEGA with SVMActive) clearly out performs both SVM Active and MEGA. The difference in precision is more significant for queries with 1% matching data than for those with 5% match ing data. This trend indicates the strength of the hybrid algorithm in han dling more specific query concepts. Note that the precision of the hybrid
Lai et al.
250
algorithm reaches 90% after 4 and 9 iterations, respectively, for these two experiments (Figs. 7(a) and (b)). For top-20 retrieval (Figs. 8(a) and (b)), the precision of the hybrid algorithm remains the highest, followed by SVMActive and MEGA. Again, the differences in their performances are more significant for queries with 1% matching data than for those with 5% matching data. For 1% queries, the hybrid algorithm achieves 100% precision after 12 iterations, which indicates the robustness of this new enhancement. 5.2.
Observations
Our experiments have answered the questions that we stated in the begin ning of Section 4. (1) MEGA and SVM Active can learn complex query concepts in highdimensional spaces in a small number of user iterations. (2) The hybrid scheme that uses MEGA to find the first relevant object(s) and then switches to SVM Active works significantly better than using the two learning algorithms individually. (3) When the matching data is scarce, the number of iterations required to learn a concept increases. 6. Comparison with Related Work Figure 9 depicts the architectures of some traditional image search engines. A shaded box in the figure indicates that the component is used in the corre sponding search paradigm. Figure 9(a) shows the architecture of the current google image-search engine. Google uses text only and does not consider image content. For a text-based image retrieval engine, keywords entered by users may have multiple senses and hence may not precisely characterize the users' query concept. For example, the search results with the keyword "baby" return images ranging from baby photos to baby drawings, statues, and cartoons. Figure 9(b) shows the architecture of a typical content-based im age retrieval system. The following systems fit into this category: IBM QBIC, 1 Stanford SIMPLIcity,2 Columbia VisualSEEk and WebSEEk, 3 CMU Informedia4 and UCSB NeTra.5 These systems try to capture the user's query concepts through one "good" example. This restrictive as sumption may not be able to model a query concept such as flowers of different kinds that have diversified visual features. Our system overcomes
Large-scale Image Search on the Web
(a) Text-based Fig. 9.
251
(b) Content-based
System Architecture of Three Search Paradigms.
the restriction by learning through multiple examples, both positive and negative, provided by users. Our system (Fig. 6) employs a sampling methods to minimize the num ber of samples required to learn a binary classifier that separates the ob jects matching the concept from the rest. To improve classification accu racy, researchers recently proposed a number of ensemble techniques such as bagging,26 arcing27 and boosting.28'29 These ensemble schemes enjoy success in improving classification accuracy through reduction in bias or variance, but they do not help reduce the number of samples required to learn a query concept. In fact, most ensemble schemes actually increase learning time because they introduce learning redundancy for improving prediction accuracy. 30 ' 29 ' 31 To reduce the number of required samples,researchers in the machine learning community have conducted several studies of active learning for classification. The query by committee algorithm 32 ' 33 uses a distribution over all possible classifiers and attempts to greedily reduce the entropy of this distribution. This general purpose algorithm has been applied in a number of domains using classifiers (such as Naive Bayes classifiers34,35) for which specifying and sampling classifiers from a distribution is natural. Probabilistic models such as the Naive Bayes classifier provide interpretable models and principled ways to incorporate prior knowledge and data with missing values. However, they typically do not perform as well as discrimi native methods'such as SVMs. 36 ' 37
Lai et al.
252 7.
Conclusion
This chapter proposes an image retrieval system t h a t uses active learning to capture complex and subjective query concepts. We propose using MEGA t o quickly find objects relevant to the query concept, t h e n switching t o SVMActive once relevant objects are found. T h e experimental results show t h a t this hybrid approach outperforms MEGA and SVM Active when they are used separately. An on-line system prototype is available at the web site. 3 8 We are expanding our work in at least two directions. First, we have recently discovered a perceptual distance function, dynamic partial func tion ( D P F ) , 3 9 , 4 0 t h a t works substantially better t h a n traditional distance functions in finding transformed images (e.g., frames, cropped, rotated, a n d downsampled images of a seed image). We plan to use D P F as a kernel func tion to examine its effectiveness in query-concept learning. Second, we have developed an image annotation scheme, 4 1 which provides images with a set of keywords associated with confidence factors for supporting multimodal image retrieval and annotation. We believe t h a t with the perception-based paradigm as the core, a keyword and content integrated system provides the solution to large-scale image search on the Web.
References 1. M. Flickner, H. Sawhney, J. Ashley, Q. Huang, B. Dom, M. Gorkani, J. Hafner, D. Lee, D. Petkovic, D. Steele and P. Yanker, "Query by Im age and Video Content: the QBIC System", IEEE Computer, vol. 28, no. 9, 1995, pp. 23-32. 2. L. Wu, C. Faloutsos, K. Sycara and T. R. Payne, "FALCON: Feedback Adap tive Loop for Content-Based Retrieval", The 26* Very Large Data Bases Conference, September 2000, pp. 297-306. 3. J. R. Smith and S.-F. Chang, "VisualSEEk: A Fully Automated ContentBased Image Query System", A CM Multimedia Conference, 1996. 4. S. Stevens, M. Christel and H. Wactlar, "Informedia: Improving Access to Digital Video", Interactions, vol. 1, no. 3, 1994, pp. 67-71. 5. W. Y. Ma and B. Manjunath, "NeTra: A Toolbox for Navigating Large Image Database", Proc. Int'l Conf. Image Processing, 1997, pp. 568-571. 6. E. Chang, B. Li and C. Li, "Towards Perception-Based Image Retrieval", IEEE Content-Based Access of Image and Video Libraries, June 2000, pp. 101-105. 7. S. Tong and E. Chang, "Support Vector Machine Active Learning for Image Retrieval", Proceedings of A CM International Conference on Multimedia, Oc tober 2001, pp. 107-118. 8. S. Arya, D. Mount, N. Netanyahu, R. Silverman and A. Wu, "An Opti mal Algorithm for Approximate Nearest Neighbor Searching in Fixed Di-
Large-scale Image Search on the Web
9.
10.
11.
12.
13.
14.
15.
16. 17.
18.
19. 20.
21. 22. 23. 24. 25. 26.
253
mensions", Proceedings of the 5th ACM-SIAM Sympos. Discrete Algorithms, 1994, pp. 573-82. K. Clarkson, "An Algorithm for Approximate Closest-point Queries", Pro ceedings of the 10th Annual Symposium on Computational Geometry, 1994, pp. 160-64. P. Ciaccia and M. Patella, "PAC Nearest Neighbor Queries: Approximate and Controlled Search in High-Dimensional and Metric Spaces", International Conference on Data Engineering, 2000, pp. 244-255. P. Indyk and R. Motwani, "Approximate Nearest Neighbors: Towards Re moving the Curse of Dimensionality", Proceedings of the 30th A CM Sympo sium on Theory of Computing, 1998, pp. 604-13. J. M. Kleinberg, "Two Algorithms for Nearest-Neighbor Search in High Di mensions", Proceedings of the 29th ACM Symposium on Theory of Comput ing, 1997, pp 599-608. E. Kushilevitz, R. Ostrovsky and Y. Rabani, "Efficient Search for Approxi mate Nearest Neighbor in High Dimensional Spaces", Proceedings of the 30th ACM Symposium on Theory of Computing, 1998, pp. 614-23. J. Li, J. Z. Wang and G. Wiederhold, "IRM: Integrated region matching for image retrieval", Proceedings of ACM Multimedia, October 2000, pp. 147156. A. Natsev, R. Rastogi and K. Shim, "WALRUC: A Similarity Retrieval Algorithm for Image Databases", Proceedings of ACM Sigmod, June 1999, pp. 395-406. S. Brin and H. Garcia-Molina, "Copy Detection Mechanisms for Digital Doc uments", Proceedings of ACM SIGMOD, May 1995, pp. 398-409. E. Chang, J. Wang, C. Li and G. Wiederhold, "RIME - A Replicated Image Detector for the WWW", Proc. of SPIE Symposium of Voice, Video, and Data Communications, November 1998. K. Goh and E. Chang, "Indexing Multimedia Data in High-dimensional and Dynamic Weighted Feature Spaces", The 6 Visual Database Conference, May 2002. A. Gersho and R. Gray, Vector Quantization and Signal Compression. Kluwer Academic, 1991. E. Chang and B. Li, "MEGA — The Maximizing Expected Generalization Algorithm for Learning Complex Query Concepts (Extended Version)", Tech nical Report http://www-db.stanford.edu/~echang/mega.pdf, November 2000. M. Kearns, M. Li and L. Valiant, "Learning Boolean Formulae", Journal of ACM, vol. 41, no. 6, 1994, pp. 1298-1328. M. Kearns and U. Vazirani, An Introduction to Computational Learning The ory. MIT Press, 1994. T. Michell, Machine Learning. McGraw Hill, 1997. L. Valiant, "A Theory of Learnable", Proceedings of the Sixteenth Annual ACM Symposium on Theory of Computing, 1984, pp. 436-445. C. Burges, "A Tutorial on Support Vector Machines for Pattern Recogni tion" , Data Mining and Knowledge Discovery, vol. 2, 1998, pp. 121-167. L. Breiman, "Bagging Predicators", Machine Learning, 1996, pp. 123-140.
254
Lai et al.
27. L. Breiman, "Arcing Classifiers", The Annals of Statistics, 1998, pp. 801-849. 28. R. Schapire, Y. Freund, P. Bartlett and W. Lee, "Boosting the Margin: A New Explanation for the Effectiveness of Voting Methods", in Proceeding of the Fourteenth International Conference on Machine Learning, Morgan Kaufmann, 1997, pp. 322-330. 29. A. Grove and D. Schuurmans, "Boosting in the Limit: Maximizing the Mar gin of Learned Ensembles", in Proc. 15th National Conference on Artificial Intelligence (AAAI), 1998, pp. 692-699. 30. T. Dietterich and G. Bakiri, "Solving Multiclass Learning Problems via Error-Correcting Output Codes", Journal of Artifical Intelligence Research, vol. 2, 1995, pp. 263-286. 31. M. Moreira and E. Mayoraz, "Improving Pairwise Coupling Classification with Error Correcting Classifiers", Proceedings of the Tenth European Con ference on Machine Learning, April 1998. 32. H. Seung, M. Opper and H. Sompolinsky, "Query by Committee", in Pro ceedings of the Fifth Workshop on Computational Learning Theory, Morgan Kaufmann, 1992, pp. 287-294. 33. Y. Freund, H. Seung, E. Shamir and N. Tishby, "Selective Sampling Using the Query by Committee Algorithm", Machine Learning, vol. 28, 1997, pp. 133168. 34. I. Dagan and S. Engelson, "Committee-based Sampling for Training Prob abilistic Classifiers", in Proceedings of the Twelfth International Conference on Machine Learning, Morgan Kaufmann, 1995, pp. 150-157. 35. A. McCallum and K. Nigam, "Employing EM in Pool-Based Active Learning for Text Classification", in Proceedings of the Fifteenth International Con ference on Machine Learning, Morgan Kaufmann, 1998, pp. 350-358. 36. T. Joachims, "Text Categorization with Support Vector Machines", in Pro ceedings of the European Conference on Machine Learning, Springer-Verlag, 1998, pp. 137^142. 37. S. Dumais, J. Platt, D. Heckerman, and M. Sahami, "Inductive Learning Algorithms and Representations for Text Categorization", in Proceedings of the Seventh International Conference on Information and Knowledge Man agement, ACM Press, 1998, pp. 148-155. 38. E. Chang, K.-T. Cheng, W.-C. Lai, C.-T. Wu, C.-W. Chang and Y.-L. Wu, "PBIR — A System that Learns Subjective Image Query Concepts", Proceed ings of ACM Multimedia, http://www.mmdb.ece.ucsb.edu/~demo/corelacm/, October 2001, pp. 611-614. 39. B. Li, E. Chang and C.-T. Wu, "Dynamic Partial Function", IEEE Confer ence in Image Processing, September 2002. 40. B. Li and E. Chang, "Discovery of A Perceptual Distance Function for Mea suring Image Similarity", ACM Multimedia Journal Special Issue, 2002. 41. E. Chang, K. Goh, G. Sychay and G. Wu, "Content-based Soft Annota tion for Multimodal Image Retrieval Using Bayes Point Machines", IEEE Transactions on Circuits and Systems for Video Technology Special Issue on Conceptual and Dynamical Aspects of Multimedia Content Description, 2002.
Part V. New Opportunities
This page is intentionally left blank
C H A P T E R 14 W E B SECURITY A N D D O C U M E N T IMAGE ANALYSIS*
Henry S. Baird and Kris Popat Palo Alto Research Center 3333 Coyote Hill Road Palo Alto, CA 94304 USA E-mail: {baird\popat}@parc.com URL: www.pare, com/istl/groups/did
Web services offered for human use are being abused by programs. Efforts to defend against these have, over the last five years, stimulated the development of a new family of security protocols able to distin guish between human and machine users automatically over GUIs and networks. AltaVista pioneered this technology in 1997; by 2000, Yahoo! and PayPal were using similar methods. Researchers at Carnegie-Mellon University and, then, a collaboration between the University of Cali fornia at Berkeley and the Palo Alto Research Center developed such tests. By January 2002 the subject was called 'human interactive proofs' (HIPs), defined broadly as challenge/response protocols which allow a human to authenticate herself as a member of a given group: e.g., hu man (vs. machine), herself (vs. anyone else), etc. All commercial uses of HIPs exploit the gap in reading ability between humans and machines. Thus, many technical issues studied by the document image analysis (DIA) research community are relevant to HIPs. This chapter describes the evolution of HIP R&D, applications of HIPs now and on the hori zon, relevant legal issues, highlights of the first NSF HIP workshop, and proposals for a DIA research agenda to advance the state of the art of HIPs.
1. Introduction In 1997 Andrei Broder and his colleagues, 4 then at the D E C Systems Re search Center, developed a scheme to block the abusive automatic submis sion of URLs to the AltaVista web-site. 5 Their approach was to present a t This is an expanded version of a paper published in the DAS2002 Proceedings.1 257
258
H.3. Baird and K. Popat
potential user with an image of printed text formed specially so that ma chine vision (OCR) systems could not read it but humans still could. In September 2000, Udi Manber, Chief Scientist at Yahoo!, challenged Prof. Manuel Blum and his students 2 at The School for Computer Science at Carnegie Mellon University (CMU) to design an "easy to use reverse Tur ing test" that would block 'bots' (computer programs) from registering for services including chat rooms, mail, briefcases, etc. In October of that year, Prof. Blum asked the first author, of the Palo Alto Research Center (PARC), and Prof. Richard Fateman, of the Computer Science Division of the University of California at Berkeley (UCB), whether systematically ap plied image degradations could form the basis of such a filter, stimulating the development of PessimalPrint. 3 In January 2002, Prof. Blum and the present authors ran a workshop at PARC on 'human interactive proofs' (HIPs), defined broadly as a class of challenge/response protocols which allow a human to be authenticated as a member of a given group — an adult (vs. a child), a human (vs. machine), a particular individual (vs. everyone else), etc. All commercial uses of HIPs known to us exploit the large gap in ability between human and machine vision systems in reading images of text. The number of applications of vision-based HIPs to Web security is large and growing. HIPs have been used to block access to many services by machine users, but they could also, in principle, be used as 'anti-scraping' technologies to prevent the large-scale copying of databases, prices, auction bids, etc. If HIPs — possibly not based on vision — could be devised to discriminate reliably between adults and children, the commercial value of the resulting applications would be large. Many technical issues that have been systematically studied by the doc ument image analysis (DIA) community are relevant to the HIP research program. In an effort to stimulate interest in HIPs within the document im age analysis research community, this chapter details the evolution of the HIP research field, the range of applications of HIPs appearing on the hori zon, highlights of the first HIP workshop, and proposals for a DIA research agenda to advance the state of the art of HIPs.
1.1. An Influential
Precursor:
Turing
Tests
Alan Turing proposed6 a methodology for testing whether or not a machine effectively exhibits intelligence, by means of an "imitation game" conducted over teletype connections in which a human judge asks questions of two in-
Web Security and Document
Image
Analysis
259
terlocutors — one human and the other a machine — and eventually decides which of them is human. If judges fail sufficiently often to decide correctly, then that fact would be, Turing proposed, strong evidence that the machine possessed artificial intelligence. His proposal has been widely influential in the computer science, cognitive science, and philosophical communities 7 for over fifty years. However, no machine has "passed the Turing test" in its original sense in spite of perennial serious attempts. In fact it remains easy for human judges to distinguish machines from humans under Turing-test-like condi tions. Graphical user interfaces (GUIs) invite the use of images as well as text in the dialogues.
1.2. Robot Exclusion
Conventions
The Robot Exclusion Standard, an informal consensus reached in 1994 by the robots mailing list (robotsQnexor. co.uk), specifies the format of a file (the h t t p : / / . . . / r o b o t s . t x t file) which a web site or server may install to instruct all robots visiting the site which paths it should not traverse in search of documents. The Robots META tag allows HTML authors to indicate to visiting robots whether or not a document may be indexed or used to harvest more links (cf. www.robotstxt.org/wc/meta-user.html). Many Web services (Yahoo!, Google, etc) respect these conventions. Some 'abuses' which HIPs address are caused by deliberate disregard of these conventions. The legality of disregarding the conventions has been vigorously litigated but remains unsettled. 8 ' 9 Even if remedies under civil or criminal law are finally allowed, there will certainly be many instances where litigation is likely to be futile or not cost-effective. Thus there will probably remain strong commercial incentives to use technical means to enforce the exclusion conventions. The financial value of any service to be protected against 'bots' can not be very great, since a human can be paid (or in some other way rewarded) to pass the CAPTCHA (an acronym for Completely Automated Public Turing Test to Tell Computers and Humans Apart, coined by Prof. Manuel Blum, Luis A. von Ahn, and John Langford of CMU). Of course, minimum human response times — of 5-10 seconds at least — may be almost always slower than a automated attack, and this speed gap may force reengineering of the 'bot' attack pattern. Nevertheless, this may be simpler — and more stable — than actively engaging in an escalating arms race with CAPTCHA designers. There are unsubstantiated reports that humans are already being
260
H.S. Baird and K. Popat
incented to pass CAPTCHAs. 10 1.3. Primitive
Means
For several years now web-page designers have chosen to render some ap parent text as image (e.g., GIF) rather than encoded text (e.g., ASCII), and sometimes in order to impede the legibility of the text to screen scrap ers and spammers. A frequent use of this is to hide email addresses from automatic harvesting by potential spammers. To our knowledge the extent of this practice has not been documented. One of the earliest published attempts to automate the reading of imaged-text on web pages was by Lopresti and Zhou. 11 Kanungo et al.12 reported that, in a sample of 862 sampled web pages, "42% of images con tain text" and, of the images with text, "59% contain at least one word that does not appear in the ... HMTL file." 1.4. First
Use: The Add-URL
Problem
In 1997 AltaVista sought ways to block or discourage the automatic sub mission of URLs to their search engine. This free "add-URL" service is important to AltaVista since it broadens its search coverage and ensures that sites important to its most motivated customers are included. How ever, some users were abusing the service by automating the submission of large numbers of URLs, and certain URLs many times, in an effort to skew AltaVista's importance ranking algorithms. Andrei Broder, Chief Scientist of AltaVista, and his colleagues devel oped a filter (now visible at their site 5 ). Their method is to generate an image of printed text randomly (in a "ransom note" style using mixed type faces) so that machine vision (OCR) systems cannot read it but humans still can (Fig. 1). In January 2002 Broder told the present authors that the system had been in use for "over a year" and had reduced the number of "spam add-URL" by "over 95%." (No details concerning the residual 5% are mentioned.) A U.S. patent 4 was issued in April 2001. To the present authors, these do not seem to present a difficult chal lenge to modern machine vision methods. The black characters are widely separated against a background of a uniform grey, so they can be eas ily isolated. Recognizing an isolated bilevel pattern (here, a single char acter) which has undergone arbitrary affine spatial transformations is a well-studied problem in pattern recognition, and several effective methods have been published. 13 ' 14 The variety of typefaces used can be attacked by
Web Security and Document Image
Analysis
261
Submission Code:
Enter Submission Code: Fig. 1. Example of an AltaVista challenge: letters are chosen at random, then each is assigned to a typeface at random, then each letter is rotated and scaled, and finally (optionally, not shown here) background clutter is added.
a brute-force enumeration. 1.5. The ChatRoom
Problem
In September 2000, Udi Manber of Yahoo! described this "chat room prob lem" to researchers at CMU: 'bots' were joining on-line chat rooms and irritating the people there, e.g., by pointing them to advertising sites. How could all 'bots' be refused entry to chat rooms? CMU's Prof. Manuel Blum, Luis A. von Ahn, and John Langford articulated 2 some desirable properties of a test, including: • the test's challenges can be automatically generated and graded (i.e., the judge is a machine); • the test can be taken quickly and easily by human users (i.e., the dialogue should not go on long); • the test will accept virtually all human users (even young or naive users) with high reliability while rejecting very few; • the test will reject virtually all machine users; and • the test will resist automatic attack for many years even as tech nology advances and even if the test's algorithms are known (e.g., published and/or released as open source). Theoretical security issues underlying the design of CAPTCHAs have been addressed by Luis von Ahn, Manuel Blum, Nicholas Hopper and John Langford. 15 The CMU team developed a "hard" 'GIMPY' CAPTCHA which picked English words at random and rendered them as images of printed text under a wide variety of shape deformations and image occlusions, the word
262
H.S. Baird and K. Popat
images often overlapping. The user was asked to transcribe some number of the words correctly. An example is shown in Fig. 2.
mmL<mffl Fig. 2. Example of a "hard" GIMPY image produced by the Carnegie-Mellon Univ. CAPTCHA.
The non-linear deformations of the words and the extensive overlapping of images are, in our opinion, likely to pose serious challenges to existing machine-reading technology. However, it turned out to place too heavy a burden on human users, also: in trials on the Yahoo! website, users com plained so much that this CAPTCHA was withdrawn. As a result, a simplified version of GIMPY ("easy" or "EZ" GIMPY) , using only one word-image at a time (Fig. 3), was installed by Yahoo!, and is in use at the time of writing (visible at chat. yahoo. com after clicking on 'Sign Up For Yahoo! Chat!'). It is used to restrict access to chat rooms and other services to human users. According to Udi Manber, Chief Scientist of Yahoo!, it serves up as many as a million challenges each day. The variety of deformations and confusing backgrounds (the full range of these is not exhibited in the Figure) poses a serious challenge to present machine-vision systems, which typically lack versatility and are fragile out side of a narrow range of expected inputs. However, the use of one English word may be a significant weakness, since even a small number of partial recognition results can rapidly prune the number of word-choices.
Web Security and Document Image
Enter the word as it is shown in the box below.
Analysis
263
Word Verification This step helps Yahoo! prevent automated registrations. If you can not see this image click here.
Fig. 3. Example of a simplified Yahoo! challenge (CMU's "EZ GIMPY"): an English word is selected at random, then the word (as a whole) is typeset using a typeface chosen at random, and finally the the word image is altered randomly by a variety of means including image degradations, scoring with white lines (shown here), and non-linear deformations.
1.6. Screening
Financial
Accounts
PayPal (www. paypal. com) is screening applications for its financial pay ments accounts using a text-image challenge (Fig. 4). We do not know any details about its motivation or its technical basis.
As a security measure, please enter the characters you see in the box on the right into the box on the left. (The characters are not case sensitive.) Help?
Fig. 4. Example of a PayPal challenge: letters and numerals are chosen at random and then typeset, spaced widely apart, and finally a grid of dashed lines is overprinted.
This CAPTCHA appears to use a single typeface, which strikes us a
264
H.3. Baird and K. Popat
serious weakness that the use of occluding grids does little to strengthen. A similar CAPTCHA has recently appeared on the Overture website (click on 'Advertiser Login' at www.overture.com).
1.7.
PessimalPrint
The first author, together with Richard Fateman and Allison Coates of UCB, applied a model of document image degradations 16 that approxi mates ten aspects of the physics of machine-printing and imaging of text, including spatial sampling rate and error, affine spatial deformations, jitter, speckle, blurring, thresholding, and symbol size. Fig. 4 shows an example of PessimalPrint challenges that was synthetically degraded according to certain parameter settings of this model.
Fig. 5. Example of a PessimalPrint challenge: an English word is chosen at random, then the word (as a whole) is typeset using a randomly chosen typeface, and finally the word-image is degraded according to randomly selected parameters (with certain ranges) of the image degradation model.
An experiment assisted by ten UCB graduate-student subjects and three commercial OCR machines located a range of model parameters in which images could be generated pseudorandomly that were always legible to the human subjects and never correctly recognized by the OCR systems. In the current version of PessimalPrint, for each challenge a single English word is chosen randomly from a set of 70 words commonly found on the Web; then the word is rendered using one of a small set of typefaces and that ideal image is degraded using the parameters selected randomly from the useful range. These images, being simpler and less mentally challenging than the original GIMPY, would in our view almost certainly be more readily accepted by human subjects.
Web Security and Document Image
Analysis
265
2. The First International HIP Workshop The first NSF-sponsored workshop on Human Interactive Proofs was held January 9-11, 2002, at the Palo Alto Research Center. The workshop was a two-and-one-half day, 100%-participation work shop. There were thirty-eight invited participants, with large representa tions from CMU, U.C. Berkeley, and PARC. The CMU group was led by Prof. Manuel Blum, co-organizer of the workshop with the first author. Profs. Richard Fateman, Doug Tygar, and Jitendra Malik and their stu dents attended from UCB. Prof. George Nagy represented RPI, Dr. Nancy Chan the City Univ. of Hong Kong, and Dr. Moni Naor the Weizmann Insti tute. Dr. Robert Sloan, Director of the NSF Theory of Computing Program, attended and expressed warm support for this new research area. The Chief Scientists of Yahoo! and Altavista were present, along with researchers from IBM Research, Lucent Bell Labs, Intertrust STAR Labs, RSA Security, and Document Recognition Technologies, Inc. Prof. John McCarthy of Stanford University presented an invited plenary talk on "Frontiers of AI". As a starting point for discussion, HIPs were defined tentatively as automatic protocols allowing a person to authenticate him/herself — as, e.g., human (not a machine), an adult (not a child), himself (no one else) — over a network without the burden of passwords, biometrics, special mechanical aids, or special training. There was considerable breadth of interests represented; topics presented and discussed included: • Completely Automatic Public Turing tests to tell Computers and Humans Apart (CAPTCHAs): criteria, proofs, and design; • secure authentication of individuals without using identifying or other devices; • catalogues of actual exploits and attacks by machines to commer cial services intended for human use; • funding prospects for HIP work; • design and implementation case study of "Ransom Note" style CAPTCHA; • audio-based CAPTCHAs; • CAPTCHA design considerations specific to East-Asian languages; • authentication and forensics of video footage; • feasibility of text-only CAPTCHAs; • CAPTCHAs based on the human-machine gap text recognition
H.S. Baird and K. Popat
266
• • e • •
ability; images, human visual ability, and computer vision in CAPTCHA technology; usability issues in cryptography tools; human-fault tolerant approaches to cryptography and authentica tion; and robustly non-transferable authentication. protocols based on human ability to memorize through association and perform simple mental calculations.
thwarting password guessing blocking denial-of-service attacks suppressing spam preventing ballot stuffing protecting databases (e.g., the paper by Baron 8 )
Some believe that similar problems are likely to arise on Intranets. Many further details of the HIP2002 workshop are available on-ine at www.pare. com/istl/groups/did/HIP2002, including the Program and Participants' list. 3. Implications for DIA Research The emergence of 'human interactive proofs' as a research field offers a rare opportunity (perhaps unprecedented since Turing's day) for a substantive alliance between the DIA and the theoretical computer science research communities, especially theorists interested in cryptography and security. The implications for DIA research may be far-reaching. At the heart of CAPTCHAs based on reading-ability gaps is the choice of the family of challenges: that is, defining the technical conditions under which text-images can be generated that are reliably human-legible but machine-illegible. This triggers many DIA research questions, including: • Historically, what do the fields of Computer Vision, Pattern Recog nition, and DIA suggest are the most intractable obstacles to ma chine reading, e.g.: segmentation problems (clutter, etc); gestaltcompletion challenges (parts missing or obscured); severe image degradation? • What are the conditions under which human reading is peculiarly
Web Security and Document Image
Analysis
267
(or even better, inexplicably) robust? What does the literature in cognitive science and the psychophysics of human reading suggest, e.g.: ideal size and image contrast; known linguistic context; style consistency? • Where, quantitatively as well as qualitatively, are the margins of good performance located, for machines and for humans? • Having chosen one or more of these 'ability gaps', how can we reliably generate an inexhaustible supply of distinct challenges that lie strictly 'inside' the gap? It is well known in the DIA field that low-quality images of printed-text documents pose serious challenges to current image pattern recognition technologies. 17,18 In an attempt to understand the nature and severity of the challenge, models of document image degradations 16,27 have been de veloped and used to explore the limitations 28 of image pattern recognition algorithms. These methods should be extended theoretically and be bet ter characterized in an engineering sense, in order to make progress on the questions above. The choice of image degradations for PessimalPrint was often guided by the discussion by Rice et al.18 of cases that defeat modern OCR machines, especially: • thickened images, so that characters merge together; • thinned images, so that characters fragment into unconnected com ponents; • noisy images, causing rough edges and salt-and-pepper noise; • condensed fonts, with narrower aspect ratios than usual; and • Italic fonts, whose rectilinear bounding boxes overlap their neigh bors'. Does the rich collection of examples in this book suggest other effective means that should be exploited? To our knowledge, all DIA research so far has been focused at appli cations in non-adversarial environments. We should look closely at new security-sensitive questions such as: • how easily can image degradations be normalized away? • can machines exploit lexicons (and other linguistic context) more or less effectively than people? Our familiarity with the state of the art of machine vision leads us to
the recognition of non-text images, such as line-drawings, faces, and various objects in natural scenes. One might reasonably intuit that these would be harder and so decide to use them rather than images of text. This intuition is not supported by the cognitive science literature on human reading of words. There is no consensus on whether recognition occurs letter-by-letter or by a word-template model; 24 ' 25 some theories stress the importance of contextual clues26 from natural language and pragmatic knowledge. Fur thermore, many theories of human reading assume perfectly formed images of text. However, we have not found in the literature a theory of human reading which accounts for the robust human ability to read despite extreme segmentation (merging, fragmentation) of images of characters. The resistance of these problems to technical attack for four decades and the incompleteness of our understanding of human reading abilities suggests that it is premature to decide that the recognition of text under conditions of low quality, occlusion, and clutter, is intrinsically much easier — that is, a significantly weaker challenge to the machine vision state-of-the-art — than recognition of objects in natural scenes. There is another reason to use images of text: the correct answer to the challenge is unambiguously clear and, even more helpful, it maps into a unique sequence of keystrokes. Can we put these arguments more convincingly? 4. Discussion The HIP2002 Workshop was a 'snapshot' of a research community in the early stages of formation. It seems to us to be a promising field, already enjoying a critical mass of hard problems, smart researchers, and commer cial value. The academic disciplines that were represented at the workshop included: • • • • •
Perhaps this list is too narrow; other disciplines that could make important contributions may include: • biometrics, especially remotely sensed (noninvasive) personal at tributes which may help reinforce tentative conclusions developed by CAPTCHAs;
H.S. Baird and K. Popat
270
• cognitive science, especially for sharper quantitative insight into the margins of good performance of h u m a n cognitive abilities; • psychophysics, especially the psychophysics of h u m a n vision and reading, b o t h normal and impaired; and • psychology generally, for better understanding of widely shared hu m a n intellectual abilities and limitations. 5.
Acknowledgments
Our interest in HIPs was triggered by a question — could character images form the basis of a Turing test? — raised by Manuel Blum of CarnegieMellon Univ., which in t u r n was stimulated by Udi Manber's posing the "chat room problem" at CMU in September 2000. Manuel Blum, Luis A. von Ahn, and J o h n Langford, all of CMU, shared with us much of their early thinking. Manuel proposed the H I P workshop, accepted our offer t o hold it at P A R C , and promoted it vigorously, inviting key participants. John Langford, Lenore Blum, and Luis A. von Ahn helped with many details of planning a n d execution. Charles Bennett of IBM Research took the group photo. We are especially grateful t o many P A R C researchers and staff for helping us r u n the workshop so smoothly: Prateek Sarkar, Tom Breuel, Tom Berson, Dirk Balfanz, David Goldberg, Jeanette Figueroa, Randy Jenk ins, Beej Martinez, Eleanor Alvarido, Dan Novarro, Mimi Gardner, Dayne Peavy, Mike Hornbuckle, Sally Peters, and K a t h y Jarvis. Allison Coates provided references and commentary related to t h e cognitive science liter ature. Monica Chew provided references and commentary related to the psychophysics of vision literature. This chapter has benefited from discus sions with Hermann Calabria, Andrei Broder, and Udi Manber and from careful readings by Monica Chew a n d Victoria Stodden. References 1. H. S. Baird and K. Popat, "Human Interactive Proofs and Document Image Analysis," Proc, 5th IAPR Int'l Workshop on Document Analysis Systems, Princeton, NJ, Springer-Verlag (Berlin) LNCS 2423, August 2002, pp. SOTSIS. 2. M. Blum, L. A. von Ahn, and J. Langford, The CAPTCHA Project, "Com pletely Automatic Public Turing Test to tell Computers and Humans Apart," www.captcha.net, Dept. of Computer Science, Carnegie-Mellon Univ., and personal communications, November, 2000. 3. A. L. Coates, H. S. Baird, and R. Fateman, "Pessimal Print: a Reverse Tur ing Test," Proc, IAPR 6th Int'l Conf. on Document Analysis and Recogni tion, Seattle, WA, September 10-13, 2001, pp. 1154-1158.
Web Security and Document Image Analysis
271
4. M. D. Lillibridge, M. Abadi, K. Bharat, and A. Z. Broder, "Method for Selectively Restricting Access to Computer Systems," U.S. Patent No. 6,195,698, Issued February 27, 2001. 5. AltaVista's "Add-URL" site: a l t a v i s t a . c o m / s i t e s / a d d u r l / n e w u r l , pro tected by the earliest known CAPTCHA. 6. A. Turing, "Computing Machinery and Intelligence," Mind, 59(236), 1950, pp. 433-460. 7. A. P. Saygin, I. Cicekli, and V. Akman, "Turing Test: 50 Years Later," Minds and Machines, 10(4), Kluwer, 2000. 8. D. P. Baron, "eBay and Database Protection," Case No. P-33, Case Writing Office, Stanford Graduate School of Business, 518 Memorial Way, Stanford Univ., Stanford, CA 94305-5015, 2001. 9. P. Plitch, "Are Bots Legal?," The Wall Street Journal, Dow Jones Newswires: Jersey City, NJ, online.wsj.com, September 16, 2002. 10. C. Thompson, "Slaves to Our Machines," Wired magazine, October, 2002, pp. 35-36. 11. D. Lopresti and J. Zhou, "Locating and Recognizing Text in W W W Im ages," Information Retrieval, May, 2000, 2(2/3), pp. 177-206. 12. T. Kahungo, C. H. Lee, and R. Bradford, "What Fraction of Images on the Web Contain Text?", Proc., 1st Int'l Workshop on Web Document Anal ysis, Seattle, WA, September 8, 2001 (ISBN 0-9541148-0-9) and also at www.csc.liv.ac.uk/~wda2001. 13. D. Shen, W. H. Wong, and H. H. S. Ip, "Affine-Invariant Image Retreival by Correspondance Matching of Shapes," Image and Vision Computing, (17), 1999, pp. 489-499. 14. T. Leung, M. Burl, and P. Perona, "Probabilistic Affine Invariants for Recog nition," Proc., IEEE Computer Society Conf. on Computer Vision and Pat tern Recognition., 1998, pp. 678-684. 15. L. von Ahn, M. Blum, N.J. Hopper and J. Langford, "CAPTCHA: Using Hard AI Problems For Security", EUROCRYPT 2003, May 4-8, 2003, War saw, Poland. 16. H. S. Baird, "Document Image Defect Models," in H. S. Baird, H. Bunke and K. Yamamoto (Eds.), Structured Document Image Analysis, SpringerVerlag: New York, 1992, pp. 546-556. 17. S. V. Rice, F. R. Jenkins and T. A. Nartker, "The Fifth Annual Test of OCR Accuracy," ISRI TR-96-01, Univ. of Nevada, Las Vegas, 1996. 18. S. V. Rice, G. Nagy, and T. A. Nartker, OCR: An Illustrated Guide to the Frontier, Kluwer Academic Publishers, 1999. 19. G. E. Legge, D. G. Pelli, G. S. Rubin, and M. M. Schleske, "Psychophysics of Reading: I. Normal Vision," Vision Research, 25(2), 1985, pp. 239-252. 20. G. E. Legge, T. S. Klitz, and B. S. Tjan. "Mr. Chips: An Ideal-Observer Model of Reading," Psychological Review 104(3), 1997, pp. 524-553 21. D. G. Pelli, C. W. Burns, B. Farell, and D. C. Moore, "Identifying Letters," Vision Research, (accepted with minor revisions; to appear), 2002. 22. T. Pavlidis, "Thirty Years at the Pattern Recognition Front," King-Sun Fu Prize Lecture, 11th Int'l Conf. on Pattern Recognition, Barcelona, Septem-
272
H.S. Baird and K. Popat
ber, 2000. 23. G. Nagy and S. Seth, "Modern Optical Character Recognition." The Froehlich / Kent Encyclopaedia of Telecommunications, 11, Marcel Dekker, NY, 1996, pp. 473-531. 24. R. G. Crowder, The Psychology of Reading, Oxford University Press, 1982. 25. P. A. Kolers, M. E. Wrolstad, and H. Bouma, Processing of Visible Language 2, Plenum Press, 1980. 26. L. M. Gentile, M. L. Kamil, and J. S. Blanchard Reading Research Revisited, Charles E. Merrill Publishing, 1983. 27. T. Kanungo, Document Degradation Models and Methodology for Degrada tion Model Validation, Ph.D. Dissertation, Dept. EE, Univ. Washington, March 1996. 28. T. K. Ho and H. S. Baird, "Large-Scale Simulation Studies in Image Pat tern Recognition," IEEE Trans, on Pattern Recognition and Machine Intel ligence, 19(10), October 1997, pp. 1067-1079.
C H A P T E R 15 EXPLOITING W W W RESOURCES IN EXPERIMENTAL D O C U M E N T ANALYSIS RESEARCH
Daniel Lopresti Bell Labs, Lucent Technologies Inc. 600 Mountain Avenue Murray Hill, NJ 07974 USA E-mail: dpl@research. bell-labs, com Many large collections of document images are now becoming available online as part of digital library initiatives, fueled by the explosive growth of the World Wide Web. In this chapter, we examine protocols and system-related issues that arise in attempting to make use of these new resources, both as a target application (building better search engines) and as a way of overcoming the problem of acquiring ground-truth to support experimental document analysis research. We also report on our experiences running two simple tests involving data drawn from one such collection. The potential synergies between document analysis and digital libraries could lead to substantial benefits for both communities. 1. I n t r o d u c t i o n In the years t h a t have passed since we first examined potential synergies between the World Wide Web and the field of document analysis, 1 the Web has established itself as the largest distributed collection of docu ments in the history of civilization. Many researchers are now exploring problems t h a t have arisen out of this phenomenon, including, for exam ple, the extraction and recognition of text embedded in color GIF and J P E G i m a g e s , 2 _ 6 a n d the detection of tables in H T M L . 7 Document anal ysis is being applied to the conversion process of placing archival mate rial on the W W W . 8 Since pages represented in image format can be quite large, effective compression techniques are needed for document storage and delivery. 9 _ 1 1 Moreover, the pervasive impact of the Web has spawned work in related areas, including the use of XML to represent recognition results, 1 2 , 1 3 as well as frameworks for Web-based cooperative document 273
274
D. Lopresti
understanding. 14 Such opportunities and challenges were the subject of a recent workshop. 15 Despite this flurry of activity centered around the Web, there is an important development that appears to have been largely overlooked: that is, the rapidly growing body of traditional scanned documents now being made available online. In retrospect, this should come as no surprise as: (1) the WWW was always touted as a delivery mechanism for multimedia content, (2) documents serve as a basic "quantum" of information in our society, and (3) most users are generally oblivious to the distinction between a page presented in image format versus one encoded in, say, HTML. Often, collections of scanned documents are the product of digital library projects aimed at preserving and disseminating works of historical significance.16'17 However, noisy word images generated on-demand also play a key role in an interesting new idea for protecting online services from automated attacks, an application area sometimes referred to as Human Interactive Proofs. 18 ' 19 For example, the Making of America collection (part of Cornell Uni versity's Prototype Digital Library 16 ) comprises 267 monographs (books) and 22 journals (equaling 955 serial volumes) for a total of 907,750 pages, making it almost 1,000 times larger than the dataset offered on the UW1 CD-ROM. 20 The procedures used in creating this digital library match stan dard methodologies employed in experimental document analysis research: "The materials in the MOA collection were scanned from the origi nal paper source, with materials disbound locally due to the brittle nature of many of the items ... The images were captured at 600 dpi in TIFF image format and compressed using CCITT Group 4. Minimal document structuring occurred at the point of conver sion, primarily linking image numbers to pagination and tagging self-referencing portions of the text ... Further conversion included both optical character recognition of the page images, and SGML-encoding of the ensuing textual information." 21 While OCR results are used for full text retrieval purposes, the default view returned to users of the system is an image of a scanned page. Fig. 1 shows a snapshot of a browser window displaying a page from Making of America on the left,22 and another example of an online docu ment image, a card from the catalog for Princeton University's library, on the right. 23 As a result of such efforts at bringing scanned documents online, sev-
Exploiting
Fig. 1.
WWW Resources in Experimental
Document
Analysis Research
275
Examples of documents delivered in image format on the Web.
eral intriguing opportunities present themselves to researchers working in document analysis. The most obvious of these would be to apply stateof-the-art techniques to build higher quality and/or more powerful indices for information retrieval and presentation. This notion of crafting a better third-party search engine for digital libraries has an analog on the Web as a whole, where competing search engines vie for users by indexing the text that can be extracted from documents encoded in HTML, PDF, PostScript, and other "easy" formats. It is certainly possible to imagine doing a better job on the MOA collection; for example, a search for the keyword "mo dem" returns 1,364 hits in documents published between 1815 and 1926, even though the word was first coined in the early 1950's.a Two of the librarians in the project write: "Our attention to retaining pagination and document structure will allow us to selectively insert improved OCR as it is completed. As we insert the more accurate OCR over time, we expect that the greatly improved OCR will make the searching tools even more effective."25 Beyond this relatively straightforward improvement, it seems conceivable a
This test was inspired by a discussion in Baker's book Double Fold, p. 7 1 . 2 4 The word "modem" is most likely in the collection because of a common OCR error, the merging of two adjacent characters into one, rn —> rra, which causes "modern" to become "modem."
276
D. Lopresti
that higher-level document analysis methods could provide useful new paradigms for retrieval from digital libraries. Another thought-provoking possibility would be to use existing online collections of scanned images as a way of overcoming the problem of acquir ing sufficient training and testing data to support experimental document analysis research. This matter is regarded as so pressing that it was one of the prime motivations behind the creation of the Open Mind Initiative, 26 a project to enlist Web users around the world to assist in the labeling of ground-truth data for algorithm development. But while Open Mind deals with this one aspect of the problem, it does not address where the raw data comes from, or what qualifies it as "relevant." These issues will be a focus of this paper. 2. Traditional Approaches Typically, document analysis researchers either assemble their own collec tions of scanned images and/or use pre-existing datasets, such as those disseminated by UW, 20 NIST, 27 UNLV,28 and CEDAR. 29 The former ap proach allows the corpus to be targeted to the task under study, but the acquisition process can be time-consuming and perhaps expensive. On the other hand, standard datasets distributed on CD-ROM, once purchased, are easy to use and provide a convenient basis for comparison, although they may not cover the precise application of interest, potentially introduce copyright issues, and could become overused to the point that techniques are developed specific to the test set, which is usually relatively small. Another methodology designed to replace or supplement the previous two approaches involves synthesizing training or testing data. There are, for example, models for generating noisy page images30 and for creating random instances of tables. 31 While it is possible to produce an endless stream of data in this way, there is always the question of whether such data is truly representative. 3. Exploiting W W W Resources As we have noted, there is an enormous quantity of page image data now available on the Web. How might this be used to support document anal ysis research? Consider the basic steps involved in building datasets for either training or testing purposes: (1) collecting and scanning representa tive pages, (2) labeling the ground-truth, and (3) distributing the dataset. While the last step might not seem strictly necessary, good scientific prac-
Exploiting
WWW
Resources in Experimental
Document
Analysis Research
277
tice requires describing experiments in sufficient detail that it is possible to reproduce them. With that in mind, it clearly becomes important that the test data be accessible to other researchers. With digital libraries, the first and last of these steps are already taken care of. The pages have been scanned and are freely available online. The developer of the library presumably has dealt with any copyright issues connected to the works in question. Furthermore, it is easy to argue that such pages must be representative because they are, in fact, real documents of definite value to some target audience. Still, there remains the question of what to do about labeling the ground-truth. What are the available options? One solution would be to make use of the existing ground-truth provided by the digital library itself (e.g., the OCR results in the case of the Making of America collection). Another would be to develop protocols for using truth produced and/or maintained by a third party (previous researchers who have used the same test documents, or an Open Mind-like entity). A third approach would be to study evaluation techniques that do not depend on having an explicit ground-truth (e.g., comparing retrieval effectiveness relative to what is obtained when using the source library's tools). For Human Interactive Proofs, the system providing the challenge can also be used to judge whether or not the response is correct (see, for example, the word image recognition problem that is used to protect the free email services offered by Yahoo! at http://mail.yahoo.com). 4. Proof of Concept: Analysis of a Digital Library To explore the ideas outlined in this paper, we have performed two sim ple "proof of concept" exercises: the first using the pagereader system 32 developed by Baird at Bell Labs to OCR a set of pages randomly chosen from the Making of America digital library, 16 and the second examining an algorithm we have proposed with several colleagues for table detection. 33 This sort of evaluation is fundamentally different from the kinds typically described in the literature. Because the selection of test images is unbiased and completely automatic, the pages in question are never seen in advance by the researcher(s) involved in running the tests; there can be no attempt, explicit or subconscious, to discard images that do not fit the model or to tune an algorithm to the dataset. b As a result, this criterion is almost It is, of course, quite acceptable to maintain a record of the test documents that were used for an after-the-fact analysis.
278
D. Lopresti
certainly more demanding than what is normally encountered. Most research systems for document analysis, including pagereader and our table detection code, assume the input image will be in TIFF format, however TIFF is not a native encoding for current Web browsers. In the case of Making of America, the pages are delivered in one of three pos sible formats: a "50% size" GIF image, a "100% size" GIF image, and a PDF document containing the original scanned TIFF. The GIF forms have relatively low spatial resolution, making use of grayscale (image depth) to compensate, and hence would be difficult to use without a significant amount of extra work. Hence, we chose to implement a process pipeline that first converts the PDF version of the page into PostScript and then extracts the image directly from the PostScript. In addition to the various image "flavors" of the page, the OCR output used to create the searchable index for Making of America is available. We can use this text for evalua tion purposes, but must be careful about making assumptions concerning its quality or the way that it is formatted.0 Lacking our own complete index of the digital library, our approach to retrieving a random page image from Making of America is to issue a query by choosing a term from the Unix spell dictionary, which contains 24,259 words including a number of proper names. From the results of this search, we randomly choose one of the works (book or journal) that is returned, and from that work we select a specific page that contains a match. The implementation of the Web interface is programmed in Tcl/Tk using the Spynergy Toolkit.34 It takes a total of six HTTP "round-trips" to get the data we need: (1) First, issue a search request using a randomly chosen keyword and retrieve the results. (2) The results are presented in "slices" of 50 works per HTML page. Ran domly select a slice and retrieve it. (3) Within the slice, determine one of the works at random and retrieve it. (4) Within the work, randomly choose one of the matches and retrieve it. (5) Based on the HTML for the final target page, retrieve the PDF file that contains the embedded TIFF image (which is then extracted locally). (6) In the same way, retrieve the OCR text corresponding to the target page. c Generally, we assume that the text may contain a modest number of OCR errors, but that any severe problems will have been detected and corrected by those responsible for building the digital library.
Exploiting
WWW Resources in Experimental
Document Analysis Research
279
The last step is skipped in the table detection experiment as it is unneces sary. We have developed a set of simple "wrappers" to extract the required information from the HTML code returned by the MOA server. 4.1. Optical
Character
Recognition
For the OCR experiment, we retrieved 250 pages from the digital library. On the occasions when an HTTP fetch timed-out (after 30 seconds for the initial connection, and 5 seconds for each subsequent buffer), the search was attempted again using a different term. d This situation seemed to arise most often when the original query generated an extremely large number of hits (tens of thousands); it is likely that the machine serving Making of America builds data structures that grow with the size of the result. The 10 most- and least-frequent matches are listed in Table 1. Note that there is a wide distribution and even arcane terms arise occasionally in the collection. Table 1.
Most- and least-frequent matches in the OCR experiment.
Most -Frequent Matches Search Term 236,021 enemy 103,160 science 46,007 edge 44,467 sold 42,054 empire taught 39,429 35,812 request guide 35,667 34,123 base 31,952 virtue
The times needed to retrieve and process the pages are graphed, in order of decreasing total time, in Fig. 2 (note that the y-axis uses a log scale).6 The four components of the total are the times required to: (1) fetch the data, (2) convert the PDF to TIFF, (3) OCR the image, and (4) compare the output from pagereader to the ground-truth. The minimum total time was 93 seconds, the maximum was 1,966 seconds, and the average was 376 seconds. These values are dominated by the time it took to perform OCR (minimum 42 seconds, maximum 1,890 seconds, average 323 seconds). In We also re-ran searches that returned no matches. All tests were performed on an older SGI 0 2 workstation (200 MHz MIPS R5000 CPU, 64 MB RAM). e
280
D. Lopresti
other words, OCR was responsible for 86% of the computation time, on average. On the other hand, processing the HTTP requests and retrieving the page images over the Internet amounted to only about 6% of the total. This ratio is likely to hold true for any kind of sophisticated document analysis, so overhead due to network delay should not be an issue.
Fig. 2.
Times to retrieve and process the 250 test pages used in the OCR experiment.
Given the output from OCR and a suitable ground-truth, we would ordinarily apply techniques from approximate string matching to classify errors and provide a quantifiable measure of the accuracy of the recogni tion process. 35 Such an approach will not work here, however. Although we presume the ground-truth contains a reasonably reliable representation of the text on the page (a "bag of words," if you will), we cannot be cer tain of the precise layout standards used by those who built the digital library. For example, a two-column page could be represented that way in the ground-truth, or it may be de-columnized. Arbitrary conventions might be employed for unrelated articles appearing on the same page. The fact that we have no guarantee there will be a correspondence between the read ing orders for the OCR output and the truth, combined with the potential for large numbers of OCR errors and the need for the evaluation to be fully automated, means that string matching methods must be ruled out. Instead, we have chosen to perform evaluation by applying a well-known ■
Exploiting
WWW Resources in Experimental
Document
Analysis Research
281
measure developed in the context of information retrieval. The vector space model, first proposed by Salton, et al.,36 assigns large weights to terms that occur frequently in a given document but rarely in others because such terms are able to distinguish the document in question from the rest of the collection. Let tfik be the frequency of term tk in document Di, n^ be the number of documents containing term tk, T be the total number of terms, and N be the size of the collection. Then a common weighting scheme (tf x idf, or "term frequency times inverse document frequency") defines wik, the weight of term tk in document Di, to be: tfik ■ log(N/nk) wik
=
,
2
=
V£j=iW -(MJW)
2
■
(1)
The summation in the denominator normalizes the length of the vector so that all documents have an equal chance of being retrieved. Given query vec tor Qi = (wn, Wi2,..., Wix) and document vector Dj = (WJI, Wj2,..., WJT), a dot product is computed to quantify the similarity between the two. In our analysis, we apply this measure using word unigram tokens with stopword removal.
Document Rank (sorted by vector space similarity)
Fig. 3.
Vector space similarity scores for the 250 test pages used in the OCR experiment.
The similarity scores for the 250 test documents relative to their groundtruths are graphed in Fig. 3, sorted in order of decreasing similarity. The
D. Lopresti
282
maximum was 0.916, the minimum 0.030, and the average 0.520. While these values may seem low, one must keep in mind several important mit igating factors: (1) the severity of the test, (2) the "ground-truth" may itself contain OCR errors, and (3) vector space similarity is not identical to OCR accuracy. A more detailed examination of the results for the five best and five worst documents, as listed in Tables 2 and 3, provides subjective confirmation that this paradigm makes useful distinctions between "easy" and "hard" pages for the system under study. Table 2. Vector Space Similarity 0.916
Highest vector space similarity scores for the OCR experiment. http://cdl.library.cornell.edu/ cgi-bin/moa/moa-cgi?notisid= ABP2287-0047-195, p. 766
Note two column layout. two column layout.
0.889
ABK2934-0016-34, p. 192
0.883
ABR0102-0171-4, p. 97
two column layout with ruling line down gutter.
0.882
ABP2287-0042-55, p. 251
two column layout.
0.874
ABP2287-0056-192, p. 929
two columns headed by centered title and abstract, text starts with ornate drop-cap.
Table 3. Vector Space Similarity 0.045
Lowest vector space similarity scores for the OCR experiment. http://cdl.library.cornell.edu/ cgi-bin/moa/moa-cgi?notisid= ABR0102-0045-13, p. 661
Note two column layout, scan looks light, ground-truth also noisy.
0.038
ABS1821-0024-102, p. 46
three columns (newspaper format), page looks slightly skewed, irregular line spacing.
0.036
ABK4014-0008-45, p. 285
two columns, obvious skew, small font, tight spacing.
0.036
ABS1821-0006-20, p. 6
three columns, line drawing in middle of page, scan skewed and light, ground-truth also noisy.
0.030
ANU4519-0130, p. 881
two columns (index page from pension records including many proper names), sparse text, obvious skew.
Taking a closer look at the best- and worst-case documents, shown in Figs. 4 and 5, respectively, we note that the ground-truth supplied by MOA
Exploiting
WWW Resources in Experimental
Document Analysis Research
283
is generally "cleaner" than the output from pagereader, although it is by no means error-free. It is easy to understand why the OCR system under test behaved as it did, given the relative qualities of the two input images; the page in Fig. 4 is clear and dark and printed using a relatively large font, while the page in Fig. 5 is light and sparse, uses a small font, and exhibits visible skew. We also see the impact that de-columnization would have on attempts to compare the two representations using simple string matching, the traditional method for determining OCR accuracy; the output from pagereader does not correspond on a line-by-line basis to the ground-truth, so the edit distance will not be representative of the character-level errors that occurred.
Fig. 4. The best-case document, MOA ground-truth, and OCR results from pagereader (ABP2287-0047-195, p. 766).
To examine quantitatively how the vector space evaluation measure re lates to string edit distance, we corrected by hand the text files provided by Making of America for the best- and worst-case documents. We then computed the normalized edit distance between this "true" ground-truth
284
D. Lopresti
Fig. 5. The worst-case document, MOA ground-truth, and OCR results from pagereader (ANU4519-0130, p. 881).
and the corresponding OCR results, as well as the ground-truth as orig inally supplied. Table 4 presents this comparison. Vector space similarity appears roughly similar to edit disance when the OCR accuracy is high, but much more severe in its judgement when the accuracy is low. Whether the verified error rate for the worst-case ground-truth, 6.6%, is representative of the collection as a whole is a matter for further investigation. Table 4. Comparison of vector space similarity and string edit distance for the bestand worst-case documents. http://cdl.library.cornell.edu/ cgi-bin/moa/moa-cgi?notisid=
Vector Space Similarity
Normalized Edit Distance: Corrected Ground-Truth vs. OCR Result MOA Ground-Truth
ABP2287-0047-195, p. 766
0.916
0.968
0.997
ANU4519-0130, p. 881
0.030
0.547
0.934
Exploiting
WWW Resources in Experimental
Document Analysis Research
285
Proof-reading and correcting more than a small number of sample docu ments would be tedious (although as a service provided to the international research community, it could perhaps be justified). To increase our confi dence that vector space is a useful comparison measure in this case, to some approximation, we turned instead to a simulation study. Taking the ground-truth for 1,000 documents randomly selected from MOA, we ran the files through a filter that injects "noise" at a predetermined rate. The ba sic editing operations performed were single character deletions, insertions, and substitutions. We then computed both the vector space similarity and the normalized string edit distance between the original and modified files, graphing the results in Fig. 6. There is clearly a strong relationship between the two, suggesting it may be reasonable to use vector space as a surrogate for edit distance in certain applications as we have done here.
Fig. 6.
Results of the study comparing string edit distance vs. vector space similarity.
4.2. Table
Detection
Our past work on table detection considered input in both ASCII and image format. In the latter case, we tested our techniques on a. relatively small number of pages that we knew in advance contained tables. The focus was on whether the algorithm could correctly delimit the boundaries of a table and its various component regions. Another important aspect of the detection problem, however, is deciding when tables are present in an
286
D. Lopresti
unknown input. Indeed, for many real applications this must be the first step and hence becomes a key issue. As reported elsewhere,33 our approach to table detection is to formulate the task of partitioning the input into tables as an optimization problem that can be solved using dynamic programming. Say that tab[i,j] is a mea sure of our confidence when text lines i through j are interpreted as a single table. Let meritpre(i, [i+l,j]) be the merit of prepending line i to the table extending from line i + 1 to line j , and meritapp([i,j — l],j) be the merit of appending line j to the table extending from line i to line j — 1. Then: tabli j] = max I meriivre(i, [,JS \tab[i,j-l]
[* + M ) + tab[i + l,j] + meritapp([i,j-l],j)
(2]
The merit functions are based on white space correlation. This defines an upper triangular matrix with values for all possible table starting and end ing positions. The partitioning of the input into tables can then be expressed as an optimization problem. Let score[i,j] correspond to the best way to inter pret lines i through j as some number of (i.e., zero or more) tables. The computation is: r- -i {tabli, j] score z, 7 — m a x < , .. ,. , „ .,, ( max;v
combining state-of-the-art reviews
N\\
of challenges and opportunities
\V \ \ \.
/ \ \c
\
with research papers by leading
>\ researchers. Readers will find )\ \ / \ , in-depth discussions on the
\
/\
man
\ \
\
\
field, including web image
\ / \
processing, applications of
L
\j/ / I I U / yy /
y f /
disciplinary areas within the
y
y/\
/
y diverse and inter-
machine learning
and
A
graph theories for content
/
extraction and web mining. adaptive web content delivery.
multimedia document modeling and human interactive proofs for