Search-Based Applications At the Confluence of Search and Database Technologies
Synthesis Lectures on Information Conc...
138 downloads
543 Views
3MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
Search-Based Applications At the Confluence of Search and Database Technologies
Synthesis Lectures on Information Concepts, Retrieval, and Services Editor Gari Marchionini, University of North Carolina, Chapel Hill Synthesis Lectures on Information Concepts, Retrieval, and Services is edited by Gary Marchionini of the University of North Carolina. The series will publish 50- to 100-page publications on topics pertaining to information science and applications of technology to information discovery, production, distribution, and management. The scope will largely follow the purview of premier information and computer science conferences, such as ASIST, ACM SIGIR, ACM/IEEE JCDL, and ACM CIKM. Potential topics include, but not are limited to: data models, indexing theory and algorithms, classification, information architecture, information economics, privacy and identity, scholarly communication, bibliometrics and webometrics, personal information management, human information behavior, digital libraries, archives and preservation, cultural informatics, information retrieval evaluation, data fusion, relevance feedback, recommendation systems, question answering, natural language processing for retrieval, text summarization, multimedia retrieval, multilingual retrieval, and exploratory search.
Search-Based Applications - At the Confluence of Search and Database Technologies Gregory Grefenstette and Laura Wilber 2010
Information Concepts: From Books to Cyberspace Identities Gary Marchionini 2010
Estimating the Query Difficulty for Information Retrieval David Carmel and Elad Yom-Tov 2010
iRODS Primer: Integrated Rule-Oriented Data System Arcot Rajasekar, Reagan Moore, Chien-Yi Hou, Christopher A. Lee, Richard Marciano, Antoine de Torcy, Michael Wan, Wayne Schroeder, Sheau-Yen Chen, Lucas Gilbert, Paul Tooby, and Bing Zhu 2010
iv
Collaborative Web Search: Who, What, Where, When, and Why Meredith Ringel Morris and Jaime Teevan 2009
Multimedia Information Retrieval Stefan Rueger 2009
Online Multiplayer Games William Sims Bainbridge 2009
Information Architecture: The Design and Integration of Information Spaces Wei Ding and Xia Lin 2009
Reading and Writing the Electronic Book Catherine C. Marshall 2009
Hypermedia Genes: An Evolutionary Perspective on Concepts, Models, and Architectures Nuno M. Guimarïes and Luïs M. Carrico 2009
Understanding User-Web Interactions via Web Analytics Bernard J. ( Jim) Jansen 2009
XML Retrieval Mounia Lalmas 2009
Faceted Search Daniel Tunkelang 2009
Introduction to Webometrics: Quantitative Web Research for the Social Sciences Michael Thelwall 2009
Exploratory Search: Beyond the Query-Response Paradigm Ryen W. White and Resa A. Roth 2009
v
New Concepts in Digital Reference R. David Lankes 2009
Automated Metadata in Multimedia Information Systems: Creation, Refinement, Use in Surrogates, and Evaluation Michael G. Christel 2009
Copyright © 2011 by Morgan & Claypool
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or transmitted in any form or by any means—electronic, mechanical, photocopy, recording, or any other except for brief quotations in printed reviews, without the prior permission of the publisher.
Search-Based Applications - At the Confluence of Search and Database Technologies Gregory Grefenstette and Laura Wilber www.morganclaypool.com
ISBN: 9781608455072 ISBN: 9781608455089
paperback ebook
DOI 10.2200/S00320ED1V01Y201012ICR017
A Publication in the Morgan & Claypool Publishers series SYNTHESIS LECTURES ON INFORMATION CONCEPTS, RETRIEVAL, AND SERVICES Lecture #17 Series Editor: Gari Marchionini, University of North Carolina, Chapel Hill Series ISSN Synthesis Lectures on Information Concepts, Retrieval, and Services Print 1947-945X Electronic 1947-9468
Search-Based Applications At the Confluence of Search and Database Technologies
Gregory Grefenstette and Laura Wilber Exalead, S.A.
SYNTHESIS LECTURES ON INFORMATION CONCEPTS, RETRIEVAL, AND SERVICES #17
M &C
Morgan
& cLaypool publishers
ABSTRACT We are poised at a major turning point in the history of information management via computers. Recent evolutions in computing, communications, and commerce are fundamentally reshaping the ways in which we humans interact with information, and generating enormous volumes of electronic data along the way. As a result of these forces, what will data management technologies, and their supporting software and system architectures, look like in ten years? It is difficult to say, but we can see the future taking shape now in a new generation of information access platforms that combine strategies and structures of two familiar – and previously quite distinct – technologies, search engines and databases, and in a new model for software applications, the Search-Based Application (SBA), which offers a pragmatic way to solve both well-known and emerging information management challenges as of now. Search engines are the world’s most familiar and widely deployed information access tool, used by hundreds of millions of people every day to locate information on the Web, but few are aware they can now also be used to provide precise, multidimensional information access and analysis that is hard to distinguish from current database applications, yet endowed with the usability and massive scalability of Web search. In this book, we hope to introduce Search Based Applications to a wider audience, using real case studies to show how this flexible technology can be used to intelligently aggregate large volumes of unstructured data (like Web pages) and structured data (like database content), and to make that data available in a highly contextual, quasi real-time manner to a wide base of users for a varied range of purposes. We also hope to shed light on the general convergences underway in search and database disciplines, convergences that make SBAs possible, and which serve as harbingers of information management paradigms and technologies to come.
KEYWORDS search-based applications, search engines, semantic technologies, natural language processing, human-computer information retrieval, data retrieval, online analytical processing, OLAP, data integration, alternative data access platforms, unified information access, NoSQL, mash-up technologies
ix
Contents Acknowledgments . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xv Glossary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . xvii
1
Search Based Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1 1.1 1.2 1.3 1.4 1.5 1.6
2
Changing Times . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Need for High Performance and Scalability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Need for Unified Access to Global Information . . . . . . . . . . . . . . . . . . . . . . . . . The Need for Simple Yet Secure Access . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
7 8 8 9
Origins and Histories . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 11 3.1 3.2 3.3
4
1 2 4 4 5 6 6 6
Evolving Business Information Access Needs . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7 2.1 2.2 2.3 2.4
3
Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.1.1 What is a Search Based Application? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . High Impact, Low Risk Solution for Businesses . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Fertile Ground for Interdisciplinary Research . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Valuable Tool for Database Administrators . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . New Opportunities for Search Specialists . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . New Flexibility for Software Developers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1.6.1 Lecture Roadmap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . What has Changed Recently . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.1 Search Engines Enter the Enterprise . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.2 Databases Go Online . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3.3.3 Structural and Conceptual Changes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
11 13 13 14 15 16
Data Models & Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.1
Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.1.1 Conceptual Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17 4.1.2 Data Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
x
4.2
4.3
5
5.2
5.3
Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.1 Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.1.2 Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.1 Creation/Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.2.2 Updating . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . What has Changed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.1 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5.3.2 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
29 29 29 30 30 31 31 31 32
Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35 6.1
6.2 6.3
7
18 19 19 20 21 23 23 23
Data Collection/Population . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 29 5.1
6
4.1.3 Storage Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.1 Conceptual Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.2 Data Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.2.3 Storage Framework . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . What has Changed Recently . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.1 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4.3.2 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.1 Natural Language Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.1.2 Relevancy Criteria . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . What has Changed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.1 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 6.3.2 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
35 35 37 37 37 37 42
Data Retrieval . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43 7.1
7.2
7.3
Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.1 Querying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.1.2 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.1 Querying . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7.2.2 Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . What’s Changed? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
43 43 44 44 44 45 45
xi
7.3.1 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45 7.3.2 Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 49
8
Data Security, Usability, Performance, Cost . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 51 8.1 8.2 8.3
9
9.2
57 58 58 58 58 59 59 59
What is an SBA Platform? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Information Access Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SBA Platforms: Market Leaders . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SBA Platforms: Other Vendors . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SBA Vendors: COTS Applications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
61 62 62 64 66
SBA Uses & Preconditions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 11.1 11.2
12
SBA-Enabling Search Engine Evolutions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1.1 Data Model . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1.2 Data Storage . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1.3 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1.4 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1.5 Data Retrieval & Output . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9.1.6 Data Security, Usability, Performance, Cost . . . . . . . . . . . . . . . . . . . . . . . . . . Convergence . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
SBA Platforms . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 61 10.1 10.2 10.3 10.4 10.5
11
51 51 52 52
Summary Evolutions and Convergences . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 57 9.1
10
Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Databases . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . What has Changed . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8.3.1 Search Engines . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
When Are SBAs Used? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 69 How Are SBAs Used? . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 70
Anatomy of a Search Based Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 71 12.1
12.2
SBAs for Structured Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.2 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.3 Data Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.1.4 Data Retrieval & Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SBAs for Unstructured Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
71 71 72 73 74 77
xii
12.3
13
83 83 83 84 85 86 87
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . The Urbanizer Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . How Urbanizer Works . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . What’s Next . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
90 91 91 91
Case Study: National Postal Agency . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 95 15.1
15.2
15.3
16
Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . A Track & Trace Solution . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Existing Drawbacks . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Opting for a Search Based Application . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . First prototypes . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Future . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Case Study: Urbanizer . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89 14.1 14.2 14.3 14.4
15
77 77 78 78 79
Case Study: GEFCO . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 83 13.1 13.2 13.3 13.4 13.5 13.6 13.7
14
12.2.1 Data Collection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.2 Data Processing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.3 Data Updates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 12.2.4 Data Retrieval & Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . SBAs for Hybrid Content . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
Customer Service SBA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.1.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.1.2 Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Operational Business Intelligence (OBI) SBA . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.2.2 Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . Sales Information SBA for Telemarketing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.3.1 Background . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15.3.2 Deployment . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
95 95 96 97 97 98 98 98 98
Future Directions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 103 16.1
16.2
The Influence of the Deep Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.1.1 Surfacing Structured Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.1.2 Opening Access to Multimedia Content . . . . . . . . . . . . . . . . . . . . . . . . . . . The Influence of the Semantic Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
104 104 105 105
xiii
16.3
16.4
The Influence of the Mobile Web . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3.1 Mission-Based IR . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16.3.2 Innovation in Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ...And Continuing Database/Search Convergence . . . . . . . . . . . . . . . . . . . . . . . . .
106 106 106 106
Bibliography . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109 Authors’ Biographies . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
Acknowledgments We would like to thank Gary Marchionini and Diane Cerra for inviting us to participate in this timely and important lecture series, with a special thank you to Diane for her assistance and patience in guiding us through the publication process. We would also like to thank Morgan & Claypool’s reviewers, including Susan Feldman, Stephen Arnold and John Tait, for their thoughtful suggestions and comments on our manuscript. Ms. Feldman and Mr. Arnold are constant sources of insight for all of us working in search and information access-related disciplines, and we welcome Mr. Tait’s remarks based on his long IR research experience at the University of Sunderland and his more recent efforts at advancing research in IR for patents and other large scale collections at the Information Retrieval Facility. In addition, we are grateful to our colleagues and managers at Exalead for allowing us time to work on this lecture, and for providing valuable feedback on our draft manuscript, especially Olivier Astier, Stéphane Donzé and David Thoumas. We would also like to thank our partners and customers. They are the source of the examples provided in this book, and they have played a pioneering role in expanding the boundaries of applied search technologies, in general, and searchbased applications, in particular. Finally, we would like to thank our families.Their love sustains us in all we do, and we dedicate this book to them.
Gregory Grefenstette and Laura Wilber December 2010
Glossary Glossary ACID
Constraints on a database for achieving Atomicity, Consistency, Isolation and Durability
Agility
The ease with which a computer application can be altered, improved, or extended
API
Application Programming Interface, specifies how to call a computer program, what arguments to use, and what you can expect as output
Application layer
Part of the Open System Interconnection model, in which an application interacts with a human user, or another application
Atomicity
The idea that a database transaction either succeeds or fails in its entirety
Availability
The percentage of time that data can be read or used.
Batch
A computer task that is programmed to run at a certain time (usually at night) with no human intervention
B2C
Business to Customer; B2C websites offer goods or services directly to users
B+ tree
A block-oriented data structure for efficient insertion and removal of data nodes
BI
Business Intelligence, views on data that aid users with business planning and decision making
BigTable
An internal data storage system used by Google, handles multidimensional key-value pairs
BSON
Binary JSON
xviii
GLOSSARY
Business application
Any information processing application used in running a business
Cache
A rapid computer memory where frequently or recently used data is temporarily stored
CAP theorem
One cannot achieve Consistency, Availability, and Partition tolerance at the same time
Category
A flat or hierarchic semantic dimension added to a document, or part of a document
Categorization
Assigning, usually through statistical means, one or more categories to text
CDM
Customer Data Management
Cloud services
Computer applications that are executed on computers outside the enterprise rather than in-house. Examples are SalesForce, Google Apps, Yahoo mail, etc.
Clustering
Grouping documents according to content similarity
CMS
Content Management System
Consistency
A quality of an information system in which only valid data is recorded; that is, there are not two conflicting versions of the same data
Connector
A program that extracts information from a certain file format, or from a database
Consolidation
Making all the data concerning one entity available in one output
COTS
Commercial off-the-shelf software
Crawl
Fetching web pages for indexing by following URLs found in each page
CRM
Customer Relationship Management, applications used by businesses to interact with customers
GLOSSARY
CSIS
Customer Service Information System
Data integration
Merging data from different data sources or different information systems
Data mart
A subset of data found in an enterprise information system, relevant for a specific group or purpose
Data warehouse
A database which is used to consolidate data from disparate sources
DBA
Database administrator, the person who is responsible for maintaining (and often designing) an organization’ database(s)
Deep Web
Web pages that are dynamically generated as a result of form input and/or database querying
Directory
A listing of the files or websites in a particular storage system
DIS
Decision Intelligence System, a computer-based system for helping decision making
Document model
A model of seeing a database entity as a single persistent document, composed of typed fields and categories corresponding to the entity’s attributes
Dublin Core Metadata
A standard for metadata associated with documents, such as Title, Creator, Publisher, etc.
Durability
A database quality that means that successfully completed transactions must persist (or be recoverable) in the case of a system failure
EDI
Electronic Data Interchange, an early database communication system
ETL
Extract-Transform-Load, any method for extracting all or part of a database and storing it in another database
Enterprise Search
Searching access-controlled, structured and unstructured data found within the enterprise
xix
xx
GLOSSARY
ERP
Enterprise Resource Planning
Evolutive Data Model
Model that can be easily extended with new fields or data types without rebuilding the entire data structure
Facet
A dimension of meaning that can be used for restricting search, for example shirts and coats are two facets that could be found on a shopping site
Field
A labeled part of a document in a search engine. Fields can be typed to contain text, numbers, dates, GPS coordinates, or categories
Firewall
A computer-implemented protection that isolates internal company data from outside access
File server
A service that provides sequential or direct access to computer files
Full-text engine
A system for searching any of the words found in documents, rather than just a set of manually assigned keywords
Garbage collection
A process for recovering memory, usually by recognizing deleted or out-ofdate data
Gartner
An information technology research and advisory firm that reports on technology issues
GPS
Global Positioning System, a system of satellites for geolocating a point on the globe
Hash table
Hashing converts a data item into a single number, and the hash table maps this number to a list of items
Heuristics
Methods based more on demonstrated performance than theory, weighting words by their inverse frequency in a collection is an example
HTTP
HyperText Transfer Protocol, an application layer protocol for accessing web pages
IDC
International Data Corporation, a global provider of market intelligence and analysis concerning information technology
GLOSSARY
ILM
Information Lifecycle Management
IMAP
Internet Message Access Protocol, a format for transmitting emails
Index, inverted
A data structure that contains lists of words with pointers to where the words are found in documents
Index slice
One section of an inverted index which can be distributed over many different computer stores
Intranet
A secure network that gives authorized users Web-style access to an organization’s information assets (e.g., internal documents and web pages)
IR
Information Retrieval, the study of how to index and retrieve information, usually from unstructured text
IS
Information System, a generic term for any computer system for storing and retrieving information
Isolation
The database constraint specifying that data involved in a transaction are isolated from (inaccessible to) other transactions until the transaction is completed to avoid conflicts and overwrites
IT
Information Technology, a generic term covering all aspects of using computers to store and manipulate information
JDBC Join
Java Database Connectivity, a Java version of ODBC In a relational database, gathering together data contained in different tables
JSON
JavaScript Object Notation, a standard for exchanging data between systems
Key-value store
A data storage and retrieval system in which a key (identifying an entity) is linked to the one or more values associated with that entity. This allows rapid lookup of values associated with an entity, but does not allow joins on other fields
Mash-up
A software application that dynamically aggregates information from many different sources, or output from many processes, in a single screen
xxi
xxii
GLOSSARY
MDM
Master Data Management, a system of policies, processes and technologies designed to maintain the accuracy and consistency of essential data across many data silos
Metadata
Typed data associated with a document, for example, Author, Date, Category
Mobile Web
Web pages accessible through a mobile device such as a smartphone
MySQL
A popular open source relational database
Normalized relational schema
A model for a relational database that is designed to prevent redundancies that can cause anomalies when inserting, updating, and deleting data
NoSQL
Not Only SQL, an umbrella term for large scale data storage and retrieval systems that use structures and querying methodologies that are different from those of relational database systems
OBI
Operational Business Intelligence, data reporting and analysis that supports decision making concerning routine, day-to-day operations
OCR
Optical Character Recognition, a technology used for converting paper documents or text encapsulated in images into electronic text, usually with some noise caused by the conversion
ODBC
Open Database Connectivity, a middleware for enabling and managing exchanges between databases Extracting information from a database application and storing it in a search engine application
Offloading
OLAP
Online Analytical Processing, tools for analyzing data in databases
OLTP
Online Transaction Processing
Ontology
A taxonomy with rules that can deduce links not necessarily present in the taxonomy
GLOSSARY
Partition tolerance
Means that a distributed database can still function if some of its nodes are no longer available
Performance
The measure of a computer application’s rapidity, throughput, availability, or resource utilization
PHP
PHP: Hypertext Preprocessor, a language for programming web pages
PLM
Product Lifecycle Management, systems which allow for the management of a product from design to retirement
Plug-and-play
Modules that can be used without any reprogramming, “out of the box”
POC
Proof of concept, an application that proves that something can be done, though it may not be optimized for performance
Portal
A web interface to a data source
Primary key
In a relational database, a value corresponding to a unique entity, that allows tables to be joined for a given entity
RDBMS
Relational database management system
Redundancy
Storing the same data in two different places in a data base, or information system.This can cause problems of consistency if one of the values is changed and not the other
Relational model
A model for databases in which data is represented as tables. Some values, called primary keys, link tables together
Relevancy
For a given query, a heuristically determined score of the supposed pertinence of a document to the query
REST
Representational State Transfer, protocol used in web services, in which no state is preserved, but in which every operation of reading or writing is self sufficient
RFID
Radio Frequency Identification, systems using embedded chips to transmit information
xxiii
xxiv
GLOSSARY
RSS
Really Simple Syndication, an XML format for transmitting frequently updated data
R tree
An efficient data structure for storing GPS-indexed points and finding all the points in a given radius around a point
RDF
Resource Description Framework, a format for representing data as sets of triples, used in semantic web representations
SBA
Search Based Applications, an information access or analysis application built on a search engine, rather than on a database.
SCM
Supply Chain Management
Scalability
The desirable quality of being able to treat larger and larger data sets without a decrease in performance, or rise in cost
Search engine
A computer program for indexing and searching in documents
Semantic Web
Collection of web pages that are annotated with machine readable descriptions of their content
Semistructured data
Data found in places where the data type can be surmised, such as in explicitly labeled metadata, or in structured tables on web pages
SEO
Search engine optimization, strategies that help a web page owner to improve a site’s ranking in common web search engines
SERP
Search engine results page, the output of a query to a search engine
Silo
An imagery-filled term for an isolated information system
SMART system
An early search engine developed by Gerald Salton at Cornell
GLOSSARY
SOAP
Simple Object Access Protocol, a format for transmitting data between services
Social media
Data uploaded by identified users, such as in YouTube, FaceBook, Flickr
SQL
Structured Query Language, commonly used language for manipulating relational databases
Structured data
Data organized according to an explicit schema and broken down into discrete units of meaning, with units represented using consistent data types and formats (databases, log files, spreadsheets)
SVM
Support vector machine, used in classification
Table
Part of a relational database, a body of related information. Each row of the table corresponds to one entity, and each column, to some attribute of this entity
Taxonomy
A hierarchically typed system of entities, such as mammals being part of animals being part of living beings
TCO
Total cost of ownership, how much an application costs when all implicit and explicit costs are factored in over time
Timestamp
A chronological value indicating when some data was created
Top-k
The k highest ranked responses in a database system that can rank answers to a query
Transaction
In databases, a sequence of actions that should be performed as an uninterruptable unit, for example, purchasing a seat on a flight
Unstructured data
Data that is not formally or consistently organized, such as textual data (email, reports, documents) and multimedia content
URL
Universal Resource Locator, the address of a web page
xxv
xxvi
GLOSSARY
Usability
The desirable quality of being able to be used by a large population of users with little or no training
Vertical application
An application built for a specific domain, such as pharmaceuticals, finance, or manufacturing. A horizontal application could be used in a number of different domains.
XML
eXtended Markup Language, a standard for including metadata in a document
W3C
World Wide Web Consortium
WYSIWYG
What You See Is What You Get
YPG
Yellow Pages Group, Canada
1
CHAPTER
1
Search Based Applications 1.1
INTRODUCTION
Figure 1.1: Can you see the search engine behind these screens?
Management of information via computers is undergoing a revolutionary change as the frontier between databases and search engines is disappearing. Against this backdrop of nascent convergence, a new class of software has emerged that combines the advantages of each technology, right now, in Search Based Applications. Until just a short while ago, the lines were still relatively clear. Database software concentrated on creating, storing, maintaining and accessing structured data, where discrete units of information (e.g. product number, quantity available, quantity sold, date) and their relation to each other were well defined. Search engines were primarily concerned with locating a document or a bit of information within collections of unstructured textual data: short abstracts, long reports, newspaper articles, email, Web pages, etc. (classic Information Retrieval, or IR; see Chap. 3). Business applications were built on top of databases, which defined the universe of information available to the end user, and search engines were used for IR on the Web and in the enterprise.
2
1. SEARCH BASED APPLICATIONS
Figure 1.2: Databases have traditionally been concerned with the world of structured data; search engines with that of unstructured data (some of these data types, like HTML pages and email messages, contain a certain level of exploitable structure, and are consequently sometimes referred to as "semi-structured").
Such neat distinctions are now falling away as the core architectures, functionality and roles of search engines and databases have begun to evolve and converge. A new generation of non-relational databases, which shares conceptual models and structures with search engines, has emerged from the world of the Web (see Chapter 4), and a new breed of search engine has arisen which provides native functionality akin to both relational and non-relational databases (described in Chapters 3-9 and listed in Chapter 10). It is this new generation engine that supports Search Based Applications, which offer precise, multi-axial information access and analysis that is virtually indistinguishable at a surface level from database applications, yet are endowed with the usability and massive scalability of Web search.
1.1.1
WHAT IS A SEARCH BASED APPLICATION?
We define a Search Based Application (SBA) as any software application built on a search engine backbone rather than a database infrastructure, and whose purpose is not classic IR, but rather mission-oriented information access, analysis or discovery.1
1This new type of application has alternately been referred to as a "search application," "search-centric application," "extended
business application," "unified information access application" and "search-based application." The latter is the label used by IDC’s Susan Feldman, one of the first industry analysts to identify SBAs as a disruptive trend and an influential force in the SBA label being adopted as the industry standard. Feldman has recently moved toward a more precise definition, limiting SBAs to "fully packaged applications" supplying "all the tools that are commonly needed for a specific task or workflow," that is to say, commercial-off-the-shelf (COTS) software [Feldman and Reynolds, 2010]. However, we prefer a broader definition to underscore one of the great benefits of the SBA model: the ability for anyone to rapidly and inexpensively develop highly specific solutions for unique contexts, and, following the same pattern as database applications, we expect both custom and COTS SBAs to flourish over the next decade.
1.1. INTRODUCTION
Definition: Search Based Application A software application that uses a search engine as the primary information access backbone, and whose main purpose is performing a domain-oriented task rather than locating a document. Examples: Customer service and support Logistical track and trace Contextual advertising Decision intelligence e-Discovery SBAs may be used to provide more intuitive, meaningful and scalable access to the content in a single database, hiding away the complexity of the database structure as data is extracted and re-purposed by search engine techniques. They may also be used to autonomously and intelligently gather together massive volumes of unstructured and structured data from an unlimited number of sources (internal or external) and to make this aggregate data available in real time to a wide base of users for a broad range of purposes. While search engines in the SBA context complement rather than replace databases, which remain ideal tools for many types of transaction processing, this ’re-purposing’ of search engines nonetheless represents a major rupture with a 30-year tradition of database-centered software application development. In spite of the significance of this shift, the SBA trend has been unfolding largely under the radar of researchers, systems architects and software developers. However, SBAs have begun to capture the focused attention of business.2 "The elements that make search powerful are not necessarily the search box, but the ability to bring together multiple types of information quickly and understandably, in real time, and at massive scale. Databases have been the underpinning for most of the current generation of enterprise applications; search technologies may well be the software backbone of the future." —Susan Feldman, IDC LINK, June 9, 2010
2 SBAs are fueling a significant portion of the growth in the search and information access market, which IDC estimates grew at
double digit rates in 2007 and 2008, and at a healthy 3.9% (to $2.1 billion) in 2009 [Feldman and Reynolds, 2010]. Gartner, Inc. estimates an compound annual growth rate of 11.7% from 2007 to 2013 for the enterprise search market [Andrews , 2010].
3
4
1. SEARCH BASED APPLICATIONS
1.2
HIGH IMPACT, LOW RISK SOLUTION FOR BUSINESSES
SBAs offer businesses a rapid, low risk way to eliminate some of the peskiest and most common information systems (IS) problems: siloed data, poor application usability, shifting user requirements, systemic rigidity and limited scalability.
Figure 1.3: Search engine-based Sourcier makes vast volumes of structured water quality data accessible via map-based search and visualization, and ad hoc, point-and click-analysis.
Even though SBAs allow business to clear these hurdles and bring together large volumes of real time information in an immediately actionable form—thereby improving productivity, decision making and innovation—too many in the business community are still unaware that search engines can serve as an information integration, discovery and analysis platform. This is the reason we have written this book.
1.3
FERTILE GROUND FOR INTERDISCIPLINARY RESEARCH
We have also undertaken this project to introduce SBAs to a wider segment of the data management research community. Though the convergence of search and database technologies is gradually being recognized by this community3 , many researchers are still unaware of the pragmatic benefits of SBAs and the mutually beneficial evolutions underway in both search and database disciplines. 3 See, for example, recent workshops like Using Search Engine Technology for Information Management (USETIM’09) that was held
in August 2009 at the 35th International Conference on Very Large Data Bases (VLDB09), which examines whether search engine technology can be used to perform tasks usually undertaken by databases. http://vldb2009.org/?q=node/30
1.4. A VALUABLE TOOL FOR DATABASE ADMINISTRATORS
However, as a group of prominent database and search scientists recently noted, exploding data volumes and usage scenarios along with major shifts in computing hardware and platforms have resulted in an "urgent, widespread need for new data management technologies," innovations that will only come about through interdisciplinary research.4
Figure 1.4: This Akerys portal generates personalized, real-time real estate market intelligence based on unstructured online classifieds and in-house databases.
1.4
A VALUABLE TOOL FOR DATABASE ADMINISTRATORS
Like their research counterparts, many Database Administrators (DBAs) are also unfamiliar with SBAs. We hope this book will raise awareness of SBAs among DBAs as well, because SBAs offer these professionals a fast and non-intrusive way to offload overtaxed systems5 and to reveal the full richness of the data those systems contain, opening database content up for free-wheeling discovery and analysis, and enabling it to be contextualized with external Web, database and enterprise content.
4 From the The Claremont Report on Database Research, the summary report of the May, 2008 meeting of a group of leading database
and data management researchers who meet every five years to discuss the state of the research field and its impacts on practice: http://db.cs.berkeley.edu/claremont/claremontreport08.pdf 5 Offloading a database means extracting all the data that a user might want to access and indexing a copy of this information in a search engine. The term offloading refers to the fact that search requests no longer access the original database, whose processing load is hence reduced.
5
6
1. SEARCH BASED APPLICATIONS
1.5
NEW OPPORTUNITIES FOR SEARCH SPECIALISTS
For search specialists who are not yet familiar with SBAs, we hope to introduce them to this significant new way of using search technology to improve our day-to-day personal and professional lives, and to make them aware of the new opportunities for scientific advancement and entrepreneurship awaiting as we seek ways to improve the performance of search engines in the context of SBA usage.
1.6
NEW FLEXIBILITY FOR SOFTWARE DEVELOPERS
We also hope to make software developers aware of the new options SBAs offer: one doesn’t always need to access an existing database (or create a new one) to develop business applications or to meticulously identify all user needs in advance of programming, and one need not settle for applications that must be modified every time these needs or source data change.
1.6.1
LECTURE ROADMAP
While this diversity of audiences and the short format of the book necessitate a surface treatment of many issues, we will consider our mission accomplished if each of our readers walks away with a solid (if basic) understanding of the significance, function, capabilities and limitations of SBAs, and a desire to go forth and learn more. To begin, we’ll first take a look at the ways in which information access needs have changed, then provide a comparative view of ways in which search engines and databases work and how each has evolved. We’ll then explain how SBAs work and how and when they are being used, including presenting several case studies. Finally, we will situate this shift within the larger context of evolutions taking place on the Web, including conceptions of the Deep Web, the Semantic Web, and the Mobile Web, and what these evolutions may mean for the next generation of SBAs.
7
CHAPTER
2
Evolving Business Information Access Needs 2.1
CHANGING TIMES
Figure 2.1: The 1946 ENIAC, arguably the first general-purpose electronic computer, weighed in at 30 tons and consumed 63 sq. meters of floor space. To celebrate ENIAC’s 50th birthday, a University of Pennsylvania team integrated the whole of ENIAC on a 7x5 sq. mm chip. (U.S. Army Photo, courtesy Harold Breaux.)
Before we examine search and database technologies in more detail (paying particular attention to recent evolutions giving rise to Search Based Applications), it’s important to first understand the changes in the business information landscape which are driving these evolutions.
8
2. EVOLVING BUSINESS INFORMATION ACCESS NEEDS
Globalization, the Internet, new data capture technologies (e.g., barcode scanners, RFID), GPS, Cloud services, mobile computing, 3D and virtualization...a whole host of evolutions over the past two decades have resulted in a veritable explosion in the volume of data businesses must manage, and near-runaway complexity in enterprise information ecosystems. Data silos are mushrooming at an impossible pace, and the number and types of users and devices interacting with organization’s information systems are proliferating. While opinions may vary as to specific recommendations for addressing these challenges, it’s clear that, at a minimum, organizations need: • Better ways to manage large data volumes (improved performance, scalability and agility) • More data integration (physical or virtual) • Easier (yet secure) access for more types of users and devices
2.2
THE NEED FOR HIGH PERFORMANCE AND SCALABILITY
According to a recent IDC estimate, the amount of digital information created and replicated in 2009 was 800,000 petabytes, enough to fill a stack of DVDs reaching from the earth to the moon and back. The estimate for 2020? A 44-fold increase to 35 zetabytes⣔or enough to fill a stack of DVDs reaching halfway to Mars (Gantz and Reinsel, 2010). Businesses must not only find some way to locate accurate information in these massive stores, but find a way to make it understandable and useful to a broad range of audiences. To make matters worse, they must do so faster than ever before. In today’s hyper competitive, interconnected global economy, yesterday’s data simply will no longer do.
2.3
THE NEED FOR UNIFIED ACCESS TO GLOBAL INFORMATION
Similarly, making businesses decisions based on only a small fraction of available data will also no longer suffice. Up to 90% of these massive corporate information assets now exists in unstructured (or semi-structured) format1 , like text documents, multimedia files, and Web content. Information systems have to date done a poor job of exploiting this ‘messy’ content, resulting in a very limited perspective of a business and its market. Some data, like Web and social media content, is simply not leveraged in a business context (not in any formal way, at least): already strained systems simply can’t digest such voluminous, ‘dirty’ data. 1 Seth Grimes (http://www.clarabridge.com/default.aspx?tabid=137&ModuleID=635&ArticleID=551) investigates
the origins of this commonly cited approximation of the amount unstructured data. He concludes that, whatever the real proportion “53%/80%/85%/90%”, these figures “make concrete – they focus and solidify – the realization that unstructured data matters." Susan Feldman traces the 85% figure back to an IBM study from the 1990s.
2.4. THE NEED FOR SIMPLE YET SECURE ACCESS
Figure 2.2: Skyrocketing data volumes result in information overload in the workplace. (Based on source image in the IDC Digital Universe Study, sponsored by EMC, May 2010, [Gantz and Reinsel, 2010].)
At the same time, current information systems are ill-suited to bringing data from multiple data sources together in an understandable, pertinent way, even when source systems contain coherent versions of a consistent type of data (for example, unifying access to multiple databases). Data integration is urgently needed, yet it remains for most organizations a prohibitively complex, costly undertaking, if not a completely Sisyphean task given the proliferation of data sources, the complexity of contemporary information ecosystems, and the high rate of mergers and acquisitions.
2.4
THE NEED FOR SIMPLE YET SECURE ACCESS
Even if a business could surmount these back-end scaling and integration challenges, making this data easy for both human beings and other systems to use would remain a challenge. The evolution of virtually all work into “information work” to some degree, globalization, and the increasing depth, complexity and interdependence of supply chains2 (fueled in large part by the growing influence of a consumer-led model of demand), mean Information Technology (IT) is under tremendous pressure to extend access to business information assets to an ever greater range of users and systems. Consequently, Information Technology no longer has the luxury of developing applications for a known group of trained professionals; they are being tasked with creating applications that can 2 See [Dedrick et al., 2008] for an interesting study on how available information technology relates to implementable supply chain
models
9
10
2. EVOLVING BUSINESS INFORMATION ACCESS NEEDS
be used by people whom they do not know, cannot train, and who may not even speak the same language, much less have a predictable set of technical skills3 . Moreover, long experience with the consumer Web has made even highly trained, highly skilled workers impatient with difficult to use information systems. Accordingly, this instant usability has assumed a mission-critical role. While IT struggles to respond to this clamor for Web-style simplicity, unity, and scalability, they must still meet security demands that are growing more complex and stringent by the day. All in all, it’s a formidable set of challenges to say the least, one that points to a need for a unified information access platform that can handle both unstructured and structured data—both inside the firewall and out on the Internet, scales easily to any volume, is completely secure, and utterly simple to use. In short, it points to the need for an infrastructure that combines the capacities and benefits of both search engines and databases. This convergence, satisified at a functional level by SBAs, has been fueled by these shifts in information access needs by a dissolution of the boundaries between the Web and the enterprise, with an attendant incursion of search engines and databases into other’s once exclusive domains. Let’s now look more closely at search and database technologies, and the efffects of this crossincursion, in particular the SBA enabling transformation of IR-focused search engines into general information access platforms.
3 One early attempt to circumvent the complexity of accessing databases was to use a natural language interface
[Copestake and Sparck Jones, 1990] In such interfaces, a well-structured, grammatical,“ordinary” language query was transformed into a classical database request. Unfortunately, such systems remained brittle and could not sufficiently hide the complexity of the information model used by the database.
11
CHAPTER
3
Origins and Histories At A Glance Characteristic Origin Primary Usage Target content Number of users Data volume
3.1
Search Engine Web Information retrieval Unstructured text Unlimited Billions of records
Databases Enterprise Transaction processing Structured data Limited Thousands/millions of records
SEARCH ENGINES
The best known search engine today is Google, but research into search engines that could rank documents relevant to a query began in the late 1950s with Cyril Cleverdon’s Cranfield studies [Cleverdon and Mills, 1963], Margaret Masterman and her team at the Cambridge Language Research Unit [Masterman et al., 1958], and Gerald Salton’s SMART system [Salton, 1971]. Arising from the world of library science and card catalogs, these early search engines were built to index and retrieve books and other documents, moving from older systems using Boolean retrieval over controlled indexing terms to relevance ranking and fuzzy matches over the full text of documents. With the appearance of the Internet in the early 1990s, Web-based search engines began to emerge as an efficient way to help users locate information in a huge, organic, and linked collection of documents. The very first Internet IR systems compiled searchable lists of file names found on Internet servers [Deutsch and Emtage, 1992], and the next generation (e.g., Yahoo!) sought to build directories of Web sites [Filo and Yang, 1995] or, alternately, to build a searchable key word index of the textual content of individual Web pages (not unlike a book index). Instead of users navigating a directory menu to locate information, these latter “full-text" engines enabled users to simply enter a key word of interest in a text box, and the engine would return a ranked list of pages containing that term.1 This full-text model remains the dominant force in information retrieval (IR) from Web sources2 1The first such full-text web search engine was WebCrawler [Pinkerton, 1994], which came out in 1994. 2 Another tradition of Boolean search engines gave rise to IBM’s Storage and Information Retrieval System (STAIRS) and
Lockheed’s Dialog which survives as ProQuest’s Dialog, www.dialog.com.
12
3. ORIGINS AND HISTORIES
Search engine developers were concerned about processing great quantities of varied data, with little human intervention, into indexes that could be rapidly searched in response to a user query. When the Web began to grow at an exponential rate, the concern of search engine developers was to gather as much information as possible in the processing time allocated. Rather than providing all possible responses to a query, techniques were developed to try to provide the best answers, using word weighting, web page weighting and other heuristics. Gradually a variety of approximate matching techniques, involving language processing techniques, and more and more exposure of whatever metadata or category information might be retained from input documents, were implemented in search engines. Today, millions of people worldwide have become familiar with using Web search engines to find at least some of the information they need.
Figure 3.1: Two early Web search engines, WebCrawler and Lycos. Screenshots from October 1996 (courtesy of the Internet Archive).
Not intended for transaction processing and focused initially on IR for textual data only, these Web engines were designed from inception to execute lightning fast read operations against a massive volume of data by a vast number of simultaneous users [Brin and Page, 1998].
3.2. DATABASES
3.2
13
DATABASES
Available commercially for more than 50 years, databases were born and bred inside the enterprise, long before the Web existed. Simply stated, they are software programs designed to record and store information in a logical, pre-defined3 structure [Nijssen and Halpin, 1989]. They were initially designed to support a limited number of users (typically well-trained), and process a limited volume of data (data storage was very expensive during the period in which databases emerged). From their earliest days, the primary role of databases was to capture and store business transactions (orders, deliveries, payments, etc.) in a process known as Online Transaction Processing (OLTP), and to provide reporting against those transactions [Claybrook, 1992]. Secondarily, though equally important, they were used to organize and store non-transactional information essential to a business’s operations, for example, employee or product data. The overriding concern of database developers was that the data contained in the databases remain accurate and reliable, even if many people were manipulating the database, reading it, and writing into it, at the same time. If you consider the problem of reserving an airline ticket, it becomes clear how important and difficult this is. Airline agents and consumers all over the world might be trying to reserve seats simultaneously, and it is the job of the database software to make sure that every available seat is immediately visible, in real time, and that no seat is sold twice. These worries led database designers to concentrate on how to make transactions dealing with data consistent and foolproof. Around the mid 1970s, databases began to be used to consolidate data from different business units to produce operational reports, representing the first generation of decision intelligence systems (DIS). From the early 1990s on, DIS evolved from simple reporting to more sophisticated analysis enabled by special tools (called OLAP tools for Online Analytical Processing) capable of combining multiple dimensions (time, geography, etc.) and indicators (sales, clients, inventory levels, etc.). To support the consolidated views needed for DIS, databases began to be aggregated into ‘mega-databases’ known as data warehouses [Chaudhuri and Dayal, 1997], which, in turn, had to be re-divided into smaller, task-driven databases know as data marts [Bonifati et al., 2001] to skirt complexity and performance issues.
3.3
WHAT HAS CHANGED RECENTLY
First, let’s look at what hasn’t changed: Search engines still handle textual data exceptionally well and generally provide far faster, more scalable IR than databases. Databases remain exceptional tools for recording transactions (OLTP) and for the deep analysis of structured data (complex OLAP) [Negash and Gray, 2008]. What has changed, however, is a blurring of the boundaries between the Web and the enterprise, and the attendant incursion of each into the other’s once exclusive domains. 3The term “pre-defined” is not meant to imply that the data schemas employed by relational databases are static: they are rather
constructed with a base model that typically evolves over the course of usage. This evolution, however, requires largely manual adaption of the schema and stored data.
14
3. ORIGINS AND HISTORIES
Figure 3.2: The Database contains the reference version of enterprise data. Data warehouses collect data from different databases to achieve a consolidated view, and/or to offload access requests to protect transaction processing capacity. Data marts contain smaller slices of this consolidated data.
3.3.1
SEARCH ENGINES ENTER THE ENTERPRISE
In the 1990s, search engines entered the world of searching enterprise data4 . Beginning with the release of Verity’s Topic engine5 in 1988, a new type of engine, the enterprise search engine, was tasked with IR on internal business systems (file servers, email servers, intranets, collaboration systems, etc.) instead of on the Web. This shift meant these full-text engines had to tackle a new set of challenges unique to a business environment: processing a wider range of data formats, enforcing security, and developing different conceptions of relevancy and ranking (Web notions of ‘popularity’ being largely meaningless in an enterprise environment). These engines also sought to provide alternative ways of navigating search results, such as categorical clustering (called faceted search, see Tunkelang [2009]), rather than leaving users to rely solely on ranked lists. At the same time, similar experiments with faceted search and navigation were 4 Personal Library Software (PLS), later acquired by AOL, provided search capabilities over CD-ROMs and some enterprise data
in the mid 1980s, and Lockheed’s Dialog system provided enterprises access to external data and text databases from the early 1970s. 5 http://www.answers.com/topic/verity-inc
3.3. WHAT HAS CHANGED RECENTLY
15
Figure 3.3: Enterprise search required new strategies for determining relevancy, navigating results, and managing security.
taking place on the Web.6 (Search-based DIS was not yet on the radar, but the foundations were being laid.)
3.3.2
DATABASES GO ONLINE
During the same period, databases entered the world of the Web. First, application layers were constructed to tie Web interfaces into back end database systems to enable e-commerce [Hasselbring, 2000].7 This drive rapidly expanded to other business functions as well (customer support, knowledgebases, etc.). Shortly thereafter, databases began to be used to manage, search and present the entirety of a website’s content, a role their enterprise counterparts had already begun to play in internal content management systems (CMS). These expanded IR functions meant databases needed to become more adept at manipulating textual information, and the incursion into the Web, along with escalating corporate datastores, placed pressure on databases to improve their scalability and performance to meet the demands of large volumes of users and data.
6 Alta Vista Live Topics, etc. 7 Before the advent of Web e-commerce, databases were already connecting with one another via EDI (Electronic Data Interchange)
systems, first connected via dedicated channels, later connecting via the Internet, which was conceived in 1969 by the U.S. Department of Defense’s Advanced Research Projects Agency, or DARPA. The World Wide Web, and consequently Web-based ecommerce, emerged in the mid-1990s.
16
3. ORIGINS AND HISTORIES
Figure 3.4: Search Based Applications introduce the affordances of search engines into the information access and business intelligence domains.
3.3.3
STRUCTURAL AND CONCEPTUAL CHANGES
Overall, the cumulative effect of these shifts as well as changes in business IR needs, led to important structural and conceptual evolutions in both databases and search engines, touching on foundational areas such as: • Conceptual Data Models • Logical Storage Structures • Data Collection/Population Procedures • Data Processing Methods • Data Retrieval Strategies We’ll now compare the traditional approaches and recent evolutions of each technology in each of these areas for databases and search engines, showing what has changed recently that allows for the realization of Search Based Applications.
17
CHAPTER
4
Data Models & Storage At A Glance Characteristic Basic semantic model Logical storage structure Representational state Storage architecture
Search Engine Document model Index De-normalized Distributed
4.1
SEARCH ENGINES
4.1.1
CONCEPTUAL DATA MODEL
Databases Relational data model Relational table Normalized Centralized
Search engines use a “document model” to represent information. In the earliest days of Web search, a ‘document’ was a Web page, and that document consisted of keywords found in the page as well as descriptive information like page title, content headings, author, and modification data (collectively known as metadata,1 or information about information). The first enterprise engines likewise conceived of a document in a fairly literal sense: a Word document, a presentation, an email message, etc.
4.1.2
DATA STORAGE
The search engine index provides the primary structure for storing information about documents, including the information required to retrieve them: file system location, Web address (URL), etc. Some engines would store part of the documents crawled in a cache, a compressed copy of select text, with associated metadata. Other engines would store cached copies of complete documents. For efficiency, some search engines (particularly Web engines) would build and maintain indexes using these cached copies, though users would be linked through to the ‘live’ page when they clicked on a result. 1 See the Dublin Core Metadata Initiative at dublincore.org for an elaborate standard of metadata markup for documents in the
sense used here.
18
4. DATA MODELS & STORAGE
Let’s look at the example of a simple full-text Web engine.2 To construct an index for this type of engine, the engine first creates a forward index which extracts all the terms used in a single document (Figure 4.1). Next, an inverted index is created against the contents of the forward index which reverses this pairing and lists first words, then all document(s) in which a word appears, facilitating key wordbased retrieval. To return results ranked by relevancy, such indexes also incorporate information like the number of times a terms is used in a document (frequency), an indication of the position of these occurrences (if a user’s search term occurs often and at the top of a document, it is probably a good match) and, for Web engines, factors such as the number and credibility of external links pointing to the source pages (Figure 4.2).
Figure 4.1: A forward index compiles all pertinent words in a given document.
Figure 4.2: An inverted index extracts all occurrences of a word from the forward index, and associates it with other data such as position and frequency. Here, the term “dogs” appears in Document 1 two times, at positions 3 and 6.
4.1.3
STORAGE FRAMEWORK
To cope with the performance, scalability and availability demands of the Web, search engines from inception were designed with the distributed architectures suited to grid computing models. In large 2 For an in-depth description of building a search engine, see Büttcher et al. [2010].
4.2. DATABASES
19
volume environments, search platforms distribute processing tasks, indexes and document caches across multiple servers [Councill et al., 2006]. For example, an engine may distribute slave copies of index slices across multiple servers (with all writes written to the master, all reads from slaves), with a ‘meta-index’ used to direct queries to the right slice.These meta-indices may take the form of bitmaps, hash tables (equality searches), B+ trees, a multi-level index for block oriented storage capable of range searches like , =, or between [Comer, 1979], R-trees, for multi-dimensional data such geospatial coordinates [Guttman, 1984], or some combination of the above3 .
Figure 4.3: Distributed architectures with load balancing, partitioning and replication are used for improved performance and availability in large volume environments
4.2
DATABASES
4.2.1
CONCEPTUAL DATA MODEL
While search engines represent information using a document model, databases employ a ‘data model,’ or more specifically, for relational databases (the dominant database type since the 1980s), a normalized relational schema [Codd, 1990]. The goal of this schema is to enable complete, accurate representations of business entities, like products or customers, that can be used in a multitude of ways. Unlike conventional search engines in which the index serves as a directory for retrieving 3 http://en.wikipedia.org/wiki/Tree_(data_structure) provides a good introduction to each of these tree data struc-
tures.
20
4. DATA MODELS & STORAGE
externally stored documents, a database’s normalized relational schema serves as both conceptual data model and physical storage framework.
4.2.2
DATA STORAGE
The primary logical storage structure in a database is a table (also called, logically enough, a relation). Each individual table contains a body of related information, for example product details grouped in a ‘Products’ table. Each column (attribute) represents a single category of information for the entities represented in the table, for example, ‘Price’ in a ‘Products’ table. Each cell (field ) in a column contains a uniform representation of that attribute (for instance, text values like “one hundred,” or numeric values like “100” to represent prices). Each row (called tuples or records) in the table constitutes a complete representation of the entity the table treats (for example, a specification for a product). As each row represents a unique entity, it is assigned a unique identifier (the primary key). Consider for instance, the simple table of products for a company in Figure 4.4.
Figure 4.4: Database tables store information in a well-defined, highly consistent form. The row is the logical unit for representing a complete view of an individual instance of the entity the table treats.
Figure 4.5 shows a Manufacturers table. It contains information related to the Products table (Who manufactures a given product?).To logically bind these two tables, the manufacturer’s primary key is inserted into the Products table. When a primary key appears in an external table, it is called a foreign key. It is through this series of primary and foreign keys that the relationships between tables are established. The individual structure of tables and their relationships to one another constitute the database’s overall relational schema, a schema which is precisely defined in advance of data creation/collection by the database’s architect. To retrieve a list of all products displaying the manufacturers’ names and addresses, these two tables have to be combined (joined). Why store data in separate tables if it has to be pieced back
4.2. DATABASES
21
Figure 4.5: Relationships between tables are established via a system of Primary and Foreign Keys.
together in this fashion? The answer is to avoid repetition and anomalies through a process known as normalization. In normalization, all data is broken down into the smallest, non-repeated units possible. The goal of normalization is to avoid having multiple values entered in a single cell, to avoid repeating data in multiple rows, and to repeating data across multiple tables. It is a means to ensure data consistency (i.e., two version of the same piece of information are the same across the system) and integrity (i.e., the data remains the same as it was entered) as well as to avoid redundancy, the latter in part being a legacy from the early days of database development, when, as pointed out previously, storage was extremely expensive. Figure 4.7 shows a simple non-normalized table, and the same table normalized. Exceptions encountered over the course of usage (for instance, adding a manufacturer producing the same part in multiple locations under more than one name) require modification of the data model and database structure, as well as modification of the application layer programming used to retrieve data views. If one is attempting to manage a large, rapid evolving body of information, accommodating exceptions can become quite complex and time-consuming.
4.2.3
STORAGE FRAMEWORK
Relational databases can use distributed architectures (master/slave replication, cluster computing, table partitioning, etc.) for large volume systems. However, given their primary emphasis on resource-
22
4. DATA MODELS & STORAGE
Figure 4.6: Entity-Relationship (ER) Diagrams are a common way of representing the overall database model.
Figure 4.7: Data is broken down into the smallest discrete units needed to preserve data integrity and consistency
intensive write operations (Create, Update, or Delete, or CRUD operations), their need to join data across multiple tables for meaningful data views, and their need to ensure data integrity and reliability, maintaining performance, integrity and consistency across multiple servers is usually complex and expensive.
4.3. WHAT HAS CHANGED RECENTLY
4.3
WHAT HAS CHANGED RECENTLY
4.3.1
SEARCH ENGINES
23
First, with search’s incursion into the enterprise, and the accompanying push to handle structured content—a force in Web search engine development as well—the concept of a ‘document’ began to evolve. For Search Based Application engines, the number and complexity of attributes stored in a given column greatly increased, and the conception of a document expanded from that of a literal document like a Web page or text file, to also include a meaningful collection of information akin to a database-style business entity. For example, a search ‘document’ in an SBA context may aggregate numerous discrete pieces of information from multiple sources to create a well-rounded representation of an entity like a product or employee. Unlike a database entity, however, this meaningful representation is always present in its entirety in a single synthetic document stored within the search engine index. Entity attributes are stored in a de-normalized state and can evolve as source data evolves (we’ll at how these representations are built and maintained in Chapters 5 and 6).
Figure 4.8: The concept of a ‘document’ in the search context has evolved to include representations of database-style entities
4.3.2
DATABASES
The core data model and storage unit for a relational database remains the row-oriented, normalized relational table, and scaling this model continues to be expensive and complex. As one way to improve performance, database engineers began to experiment with ways to introduce persistent, document-style modes of representing business entities into databases— mainly as a way to boost performance as joining multiple tables to get a unified data view is resource intensive.
24
4. DATA MODELS & STORAGE
At their simplest level, these efforts entailed a more extensive use of views or virtual tables, which are in essence simply cached versions of data ‘pre-assembled’ into meaning units [Carey et al., 1998]. At a more advanced level [Hull, 1997], these efforts have resulted in experiments with objectoriented databases, which were originally created to handle complex data types like images, video, audio, and graphs that were not well supported by conventional relational databases. However, performance and scaling issues, among others, have confined object-oriented databases to niche markets, though mainstream databases have adopted some features and strategies of these systems into conventional relational databases (such as using object-oriented strategies for handling select complex data types like multimedia). More recent efforts to surmount performance and scaling barriers and develop a more agile data model and more scalable data storage have resulted in new non-relational database structures. These include: • Key-value stores • Document databases • Wide-column stores • Graph databases These types of databases are collectively referred to as NoSQL databases [Leavitt, 2010] (for “Not Only SQL,” a reference to the standard querying language for relational databases), distributed data stores, schemaless databases, or VLDBs (for Very Large Databases, as many of these alternatives are reviewed in the conferences and journals of the non-profit VLDB Endowment). Though each label is less than ideal, we’ll use the most common, NoSQL databases, for this book. Despite their structural differences, NoSQL databases share several primary characteristics with each other, and with search engines. They all: • Represent data as key-value pairings stored in column-oriented indexes, • Use distributed architectures (supported by an extensive use of meta-indexes to support partitioning and sharding) to overcome the performance and scaling limitations of relational databases, • Emerged from the Web, or use Web-derived technologies (prime agents including Internet giants like Amazon, Facebook, Google, LinkedIn and eBay), and • Relax consistency and integrity requirements to improve performance (following the CAP theorem that you cannot achieve Consistency, Availability and Partition-Tolerance at the same time).4 Below are snapshots views of the data models and storage architectures employed by each of these types of non-relational databases 5 . 4 See Brewer [2000], and Gilbert and Lynch [2002]. 5 A list of NoSQL databases is found here: http://nosql-database.org.
4.3. WHAT HAS CHANGED RECENTLY
25
Key-Value Stores
Figure 4.9: Key-Value Stores enable ultra-rapid retrieval of simple data
Typically, these databases map simple string keys to string values in a hash table structure (Figure 4.8). Some support values in the form of strings, lists, sets or hashes where keys are strings and values are either strings or integers. None support repetition. For most, querying is performed against keys only, and limited to exact matches (e.g., I can search for ‘Prod_123’ but not ‘Zapito’). Others support operations like intersection, union, and difference between sets and sorting of lists, sets and sorted sets. Examples include Voldemort (LinkedIn), Redis, SimpleDB (Amazon), Tokyo Cabinet, Dynamo, Riak and MemcacheDB6 [DeCandia et al., 2007]. Document Database Document databases are key-value stores as well, but the values they contain are semistructured and can be queried. Figure 4.9 shows a simple example with multiple attribute name/value pairs in the Value column.
Figure 4.10: Document Databases contain semi-structured values that can be queried. The number and type of attributes per crow can vary, offering greater flexibility than the relational data model. 6 project-voldemort.com,
code.google.com/p/redis, aws.amazon.com/simpledb, www.dynamocomputing.com, memcachedb.org wiki.basho.com/display/RIAK
fallabs.com/tokyocabinet,
26
4. DATA MODELS & STORAGE
Document databases include XML databases, which store data as XML documents. Many support joins using Javascript Option Notation ( JSON) or Binary JSON (BSON)7 . Examples include CouchDB and MongoDB ( JSON/BSON) and MarkLogic, Berkeley DB XML, MonetDB (XML databases)8 . Wide Column Databases These structures, sometimes called BigTable clones as most are patterned after the original Bigtable [Chang et al., 2006], Google’s internal storage system for handling structured data, can be thought of as multi-dimensional key-value pairs (Google’s own definition: “a Bigtable is a sparse, distributed, persistent, multi-dimensional sorted map"). Figure 4.9 offers a simplified representation of such a map.
Figure 4.11: Wide Column Databases are multi-dimensional key-value stores that can accommodate a very large number of attributes. They offer no native structure for determining relationships or joining tables.
In essence, a Bigtable structure is made up of individual ‘big tables’ which are like giant, unnormalized database tables. Each can house a huge number of attributes, like SBA engines, but there is no native semantic structure for determining relationships and no way to join tables, though as with SBA engines, duplicates are allowed. In the case of BigTables, this duplication includes duplicate rows, as with the scooter price above. Timestamps are used to distinguish the most recent data while supporting historical queries and auto garbage collection (a strategy likewise employed by some XML databases). 7 www.json.org and bsonspec.org 8 couchdb.apache.org, www.mongodb.org,
etdb.cwi.nl
www.marklogic.com, www.oracle.com/technetwork/database/berkeleydb, mon-
4.3. WHAT HAS CHANGED RECENTLY
27
Examples in addition to Google’s Bigtable include Cassandra (Facebook), HBase, Hypertable and Kai9 . Graph Databases Graph databases [Angles and Gutierrez, 2008] replace relational tables with structured relational graphs of one-to-many key-value pairs. They are the only of the four NoSQL types discussed here that concern themselves with relations. A graph database considers each stored item to have any number of relationships. These relationships can be viewed as links, which together form a network, or graph. These graphs can be represented as an object-oriented network of nodes, relations and properties.
Figure 4.12: Graph databases are more concerned with the relationships between data entities than with the entities themselves.
While accessing such relationships is very useful in many applications (consider social networking for example), querying graph databases is typically slow [Cheng et al., 2009] since graph structures have to be matched, even if the volume of data that can be treated is very high. Examples of graph databases include Neo4j [Vicknair et al., [Giannadakis et al., 2010], Sones, VertexDB, and AllegroGraph.10
2010],
InfoGrid
The column-oriented, key-value centered distributed storage model used by all of these NoSQL databases is the key to their massive scalability and IR processing speed, and it represents the major point of convergence with search engines. Information retrieval needs have evolved, 9 cassandra.apache.org, hbase.apache.org, hypertable.org, sourceforge.net/projects/kai/ 10 neo4j.org, infogrid.org, sones.com/home, dekorte.com/projects/opensource/vertexdb, franz.com/agraph/allegrograph
28
4. DATA MODELS & STORAGE
and both search engines and NoSQL databases are filling IR needs unmet by relational databasecentered IR. However, significant structural differences between the these two technologies remain in structured data handling, the use of semantic technologies, and querying methodologies. We’ll explore these differences in the next two chapters.
29
CHAPTER
5
Data Collection/Population At A Glance Characteristic Primary method Pre-processing Data freshness
Search Engine Crawlers Not required Quasi-real-time
5.1
SEARCH ENGINES
5.1.1
COLLECTION
Databases Direct writes, ETL (connectors) Required 24hrs+ for data warehouses
Early Web search engines used a single primary tool to collect data, a software program called a crawler [Heydon and Najork, 1999]. The crawler would connect to a website, capture the text it contained along with basic metadata like page titles, content headers or sub-headers, etc. (sending the information collected back to a central server(s) for indexing), and then follow the hyperlinks from one page to the next in an unending circuit across the Web. Aside from some basic, automated formatting and clean up (for example, removing HTML tags or double whitespaces), no pre-processing was required for the data collected – it was a straight take-it-as-found mode of operating.
5.1.2
UPDATING
Search engines employ varying update strategies [Wolf et al., 2002] according to available resources and editorial or business objectives. Some simply crawl the entire Web on a fixed schedule (biweekly, monthly, etc.), re-indexing content as they go, others employ strategies based on factors like the expected frequency of changes (based on prior visits or site type, such as news media) and site quality heuristics. Whatever strategy is used to achieve optimal freshness, search engines are designed for incremental, differential data collection and index updating, and there is no technical barrier to high performance search engines performing quasi-real-time updates for billions of pages.1 1 Prior to its recent migration from a MapReduce to BigTable index architecture, Google employed a vertical strategy for updating
some portions of its index more frequently than others. This kept the index relatively fresh for certain types of sites like major news outlets, but the intensive batch-style processing within MapReduce impacted index freshness even for these frequently crawled
30
5. DATA COLLECTION/POPULATION
5.2
DATABASES
5.2.1
CREATION/COLLECTION
Databases typically occupy a middleware position in IT architecture, receiving data through discrete data inputs by human users through front-end applications, through individual writes by other applications, or though batch imports (automated or manual). Automated batch transfers are usually accomplished with the aid of ETL tools (for Extract, Transform and Load).
Figure 5.1: Databases occupy a middleware position in conventional IS architectures, and are at the core of all data creation, storage, update, delete and access processes.
Because databases were primarily designed to record business transactions, they feature a host of tools designed to ensure the accuracy and referential integrity of data [Bernstein and Newcomer, 2009, Chapter 6]. For example, being able to roll data back to its original state if a transaction is interrupted mid-stream (for example, by a power failure), rejecting data that is not of a consistent type or format (for instance, enforcing a uniform representation of monetary amounts), preventing other users or processes from accessing a table when a user is creating or modifying a record to prevent conflicts (transaction locking), or being able to recover a successfully completed transaction from a log file in case of a system failure. These types of constraints are referred to as ACID constraints, for Atomicity, Consistency, Isolation, and Durability. To ensure consistency during batch transfers, ETL tools can be used to ‘clean’ data, that is to say, to render it consistent with the target database’s structure and formatting conventions (with a fair amount of manual mapping and configuration). If the data to be imported is simply inconsistent with the target database’s data model, that model must be revised before the transfer can proceed. sites. Under the new architecture, Google has moved much closer to a global, near real-time incremental update strategy. See http://googleblog.blogspot.com/2010/06/our-new-search-index-caffeine.html.
5.3. WHAT HAS CHANGED
5.2.2
31
UPDATING
Databases typically use differential update processes, inserting, updating or deleting data as received by user or system inputs. However, for large central repositories like data warehouses, processing such real-time incremental changes can be slower than a complete transfer and rebuild. Whether such systems employ a differential or rebuild strategy, resource-intensive updates are typically executed in batch operations scheduled for off-peak hours, typically once a day, to avoid performance bottlenecks. The down side of this practice is the users who rely on business intelligence systems built on data warehouses and marts have to make decisions based on data that is 24 hours old (or older depending on the scale and complexity of the database systems).
5.3
WHAT HAS CHANGED
5.3.1
SEARCH ENGINES
As search engines were pushed to accommodate a wider variety of data, they developed software interfaces called connectors that enabled them to access and acquire new types of content. These include file system connectors (to sequentially read and index document repositories like enterprise file servers), messaging connectors (for connecting to enterprise email systems), and, for Search Based Application engines, database connectors (using the Open Database Connectivity - ODBC, or Java Database Connectivity - JDBC, protocols).2 . Many engines now also feature a Push Application Programming Interface (API) that supports custom connectors developed in standard programming languages and typically communicating with the engine via HTTP protocols. As a result, there is now a generation of search engines that can connect to virtually any information source, and process virtually any type of content: unstructured (text documents, presentations, multimedia files), semi-structured (email messages, HTML, XML), and structured (databases, logs, spreadsheets)3 And in an advance important to the development of SBAs, latter generation engines can not only ingest the data contained in structured systems, they can capture and exploit the data schema employed by the source system - still with no pre-processing other than a basic metadata mapping in a push API or database connector. The schematic information extracted is not only represented within the index, it can also optionally be used to guide the entire indexing and query processing chain, much as earlier engines used external taxonomies and ontologies to guide the indexing of text (more on this in the Chapter 7, Data Processing). In addition to expanding the range and semantic depth of information a search engine can ingest, this framework of crawlers, connectors and APIs has given businesses considerable control over data freshness. Indexes continue to be updated in a differential, incremental process—with the rules and schedules being separately configurable for crawlers and connectors—and updates 2 See support.microsoft.com/kb/110093, download-llnw.oracle.com/javase/tutorial/jdbc/basics/ 3 See http://incubator.apache.org/connectors/ or
http://www.exalead.com/software/common/pdfs/products/cloudview/Exalead-Connectors-and-Formats. pdf.
32
5. DATA COLLECTION/POPULATION
can be performed with little to no impact on source systems.4 This means engines can index data directly from source systems rather than centralized repositories, and data freshness can be quasiinstantaneous even for large data volumes, or scheduled as desired to optimize cost and performance. Even in databases, the idea of data freshness can vary, with data being updated once a day (for example, for calculating sales) to real-time (as in the stock market). SBA platforms can download data from databases on a regular schedule (every 15 minutes, once an hour, etc.). Accordingly, deploying an SBA platform alongside a data warehouse can drop data latency from 24 hours to near real-time. And once the data is loaded in the index, unlimited users can query the data at will with zero impact on underlying production databases—a significant advantage given that access requests typically far
5.3.2
DATABASES
The basic procedures for loading data into relational databases, meanwhile have not changed, though ODBC connectors are extensively used for data consolidation, and the number and sophistication of ETL tools has increased. High end ETL tools5 now have the capacity to load structured, unstructured, and semi-structured data into a relational database management system (RDBMS), though this is typically accomplished via a companion search engine. Whatever its original format, the data to be loaded must still conform to the RDBMS data schema, and latency remains an issue as long as primary access is delivered via a centralized data warehouse. NoSQL databases use strategies similar to search engines for data collection and updating, including the use of crawlers and HTTP/REST protocols6 executed through APIs. And their flexible data models alleviate the need for heavy pre-processing of ingested data. However, unlike search engines, NoSQL databases are often positioned as alternatives to databases, and much of their efficiency is achieved through compromises to ACID constraints [Leavitt, 2010] that make them a less desirable tool for OLTP. Such systems often aim simply for eventual consistency, meaning that under their particular distributed architecture, one user may see a data update a few seconds later than another user, but eventually, everyone will have access to consistent information. This eventual consistency is sufficient for many applications, especially access-oriented Web applications. As Tony Tam, VP of engineering at Wordnik, stated when migrating five billion documents (1.5 terabytes of data) from a MySQL database to MongoDB, “We kind of don’t care about consistency."7 And although some NoSQL databases can support ACID constraints to some degree (e.g., document databases), they are not designed to support high throughput OLTP applications 4 Generally, such update requests have no impact on production databases. However, if a request does put a load on a database, it is
a known load, and the database administrator can spend whatever time is required to optimize the process, or can simply schedule it at an appropriate time or rate. 5 See www.etltool.com 6 REST stands for “representational state transfer" and refers to the ability to access information through an HTTP call. 7 Reported in the article “MongoDB Handles Masses Of Data," InformationWeek, May 3, 2010. http://www. informationweek.com/news/software/database/showArticle.jhtml?articleID=224700367
5.3. WHAT HAS CHANGED
33
workloads.8
over nonpartitionable SBA engines don’t require such compromises because they are intended to complement, rather than replace, relational database systems (though they could be used as replacements in certain access-oriented contexts).
8 For an example of the research being done on scaling ACID-compliant OLTP systems on distributed, shared-nothing architectures,
see http://db.cs.yale.edu/determinism-vldb10.pdf [Thomson and Abadi, 2010].
35
CHAPTER
6
Data Processing At A Glance Characteristic Processing Principal technology
6.1
Search Engine Natural language processing Semantics
Databases Data processing Data mapping
SEARCH ENGINES
Many search engines prepare extracted content for indexing through a two step process: natural language processing, and assignment of relevancy criteria. Natural Language Processing serves three purposes: normalizing away linguistic variations before a document is indexed, recognizing structure in text such as noun phrases that should be indexed as a unit, and typing the structures found, identifying them, for example, as persons, places or things. These typed normalized features are then indexed. The index contains pointers to where the features were found: in what document, in what sentence, and in what position. In addition to these positions, weights are also assigned to each feature, with rarer features receiving higher weights, to be used during the calculation of the relevancy of a document to a query.
6.1.1
NATURAL LANGUAGE PROCESSING
After documents consisting of text and basic metadata (e.g., URL, file size, file type, modify/creation/extraction date, etc.) have been extracted from source systems and saved in the designated storage structure (partial/complete cache; in-memory or remote), the unstructured textual parts are prepared for indexing via simple tokenization or more elaborate natural language processing (NLP). This preparation identifies indexable key words within documents and normalizes them in a process that can include these NLP steps: Language Detection Using statistical sampling of common sequences of letters, the engine determines in which language a document is written [Grefenstette, 1995]. For example, a text containing a word ending in -ack is more likely to be English than French, and a word ending in -que is more
36
6. DATA PROCESSING
likely to be French than English. Cumulative statistics over all the letter sequences found in a piece of text are use to decide the language. Tokenization Next, the text is tokenized [Grover et al., 2000], or split, into a sequence of individual words using language-specific grammar, punctuation and word separation rules. Tokenization of input text solves simple problems such as stripping punctuation from words, but it can be complicated by the fact that certain languages include punctuation inside words (for example, the French word aujourd’hui (today)) and certain words may contain punctuation (such as C++, a programming language). Tokenization also recognizes sentence boundaries so that a document containing the fragment “...in New York. Researchers from...” won’t be returned as an exact match for a phrasal query on “New York Researchers.” Stemming and Lemmatization Recognized tokens are further normalized either through simple suffix removal, or by morphological analysis. A stemming module [Baeza-Yates and Ribeiro-Neto, 2010, Chapter 7]will apply language specific suffixing rules to remove common suffixes, for example, removing a trailing -s from “items” to form the stem “item,” which is then indexed.1 More elaborate, linguistically-based morphological analysis (also called lemmatization) uses dictionaries and rules [Salton, 1989] extensively to identify more complex variants, tranforming “mouse” into “mice” for indexing. Part of Speech Tagging Optional, and more computationally expensive, tagging modules [Manning and Schütze, 1999, Chapter 10] can be used to improve lemmatization by first identifying the part of speech of a word, determining by context if a term like “ride” is being used in the source document as a verb, not a noun, so that the term can be mapped to appropriate variants like “rode” and “riding.” Some search engines allow the possibilities choosing to index words based on their part of speech. Chunking Part of speech tagged text can be parsed or chunked [Abney, 1991] to recognize units, such as noun phrases, which can be indexed as a unit, or stored with the document as additional features. For example, from the text "... information from barcode scanners are downloaded to the ...", the normalized noun phrase barcode_scanner can be recognized by parsing and added as an additional feature in the index. 1 Stemming, Lemmatization, and Part of Speech Tagging may likewise be applied query time as well as during indexing.
6.2. DATABASES
6.1.2
37
RELEVANCY CRITERIA
For both Web and enterprise engines, the indexing process also includes computation of general ranking and relevancy scores of the document as well as the assignment of relevancy-related metadata that can be used at query time to determine relevance in the context of a specific query. This may include, for example, the position and frequency of a given term within a document [Clarke and Cormack, 2000], a weighting for the ‘credibility’ of the source [Caverlee and Liu, 2007], the GPS coordinates of the document for use in a map-based search, among others. On the Web, document processing may include calculating the number and quality of external links to the document (or web page) and the frequency of page updates [Page et al., 1998]. Once Natural Language Processing and and relevancy assignments are complete, documents are automatically mapped to high level index fields (like language or mime_type) and written to the index2 .
6.2
DATABASES
Conventional relational databases do not apply linguistic analysis when preparing data to be written to database tables. Instead, natural language processing is replaced by conventional data processing. The data is mapped, formatted and loaded according to strictly defined structures and procedures. These mappings must traditionally be established manually for each source. For example, in the case of a structured source, the source system may record customer gender in a “Gender” column, with cells containing “Female” or “Male” values, and the target system may use a “Sex” column with cells containing “1” for male and “2” for female. The administrator would need to configure the connector, API or ETL to map these columns as well as converting the textual data to its numeric counterparts. For data entry by end users of applications built on databases, this mapping would be handled by the application programming. For example, the user may select “Female” or “Male” from a pulldown menu, but the backend programming would pass values of “1” or “2” to the SQL script in charge of writing data to the database. At the time of the load (or SQL commit) data validation is applied. If validation fails, it may result in full rejection of the transaction.
6.3
WHAT HAS CHANGED
6.3.1
SEARCH ENGINES
There are three main evolutions in search engine data processing that enable the unified access to information found in Search Based Applications: 1. An expansion of the statistical calculations performed on source data 2. A capacity to extract the semantic information contained in structured data 2 One also says ingested into the index.
38
6. DATA PROCESSING
3. An expansion of the range and depth of semantic processing on all data As search engines entered the enterprise, there was a marked expansion of the statistical calculations applied to 1) develop meaningful ranking criteria in the absence of Web-style popularity measures [Fagin et al., 2003], and to 2) support dynamic categorization and clustering3 to provide task-oriented navigation and refinement of search results. To meet these two needs, Search Based Application engines began to apply counts to every indexable entity and attribute (references to a particular person in an email system, number of products in a database with a particular attribute, regional sales from an ERP, etc.). In SBAs, the capacity to organize information into categories and clusters is being used for generic, database-style content presentation and navigation (rather than simply search results refinement), and the wealth of statistical calculations intended to support relevancy are being repurposed to generate ad hoc reporting and analysis across all data facets, as will be covered in more depth in the next chapter. Next, advances in connector technology [Kofler, 2005, Reese, 2000] are enabling search engines to retrieve data schemas from structured systems and to ingest these schemas as indexable entity attributes as well as employing them for for high level index organization. The ability to capture, access and exploit the rich semantic metadata encapsulated in structured data is the prime reason search engines now provide an effective means of database offloading (see Figures 6.1 and 6.2). The next SBA-enabling evolution is a significant extension of the depth and range of natural language processing [Manning and Schütze, 1999] employed by SBA engines. These expanded semantic capabilities, coupled with basic mapping supplied in configuring APIs and connectors, enable SBA engines to: • Effectively structure unstructured content • Enrich data with meanings and relationships not reflected in source systems • Meaningfully aggregate heterogeneous, multi-source content (non-structured and/or structured) into a meaningful whole This expansion is also the reason the ‘document’ in the search document model has been able to evolve from a literal document to a meaningful unit of business information. Beyond the basic Natural Language Processing functions outlined earlier (Language Detection, Tokenization, Stemming, Lemmatization and Part of Speech Tagging), semantic analysis today may include: Word Sense Disambiguation Using contextual analysis [Gabrilovich and Markovitch, 2009], for example, to determine if a “crane” in a given text a type of bird or a type of machine using the words found around the 3 See, for example, http://demos.vivisimo.com/projects/BioMed
6.3. WHAT HAS CHANGED
39
Figure 6.1: SBA engines use database connectors plus natural language processing to transform relational database models into a column-oriented document model.
word “crane” in the document and comparing these to word sense models created ahead of time. The word might then be indexed as “crane (bird)” or “crane (machine)”. Standard and Custom Entity Extraction Identifying, extracting and normalizing entities like people, places, organizations, dates, times, physical properties, measurements, etc., a process aided by dictionaries or the use of thesauruses paired with extraction rules [Kazama and Torisawa, 2007]. Dependency Analysis Identifying relations between words in a sentence, such as the subjects or objects of verbs.4 Dependency parsing helps to isolate noun phrases that can be indexed as a unit as well as showing relations between entities. Summarization Producing, for example, a shorter text sample that would contain much of the important information in a longer text [Mani and Maybury, 1999]. Search engine generally post snippets 4 http://nlp.stanford.edu/software/lex-parser.shtml
40
6. DATA PROCESSING
Figure 6.2: The entity attributes and relationships extracted from the database remain discretely accessible and exploitable.
in which the important query words were found close together. Summarization of longer documents can also be done independently of a query, by calculating the most important (frequent) words in the documents and pulling out a number of sentences that are dense in these important words. Pronoun Resolution Determining, for example, who the words “they” or “this person” refer to in a text [Mitkov, 2001]. Pronoun resolution connects these anaphoric references with the most likely entities found in the preceding text. Event and Fact Extraction Recognizing basic narrative constructions: for example, determining if a text describes an attack (even if the word “attack” was not used in the text) and determining when it took place, who was involved, and whether anybody was hurt [Ji and Grishman, 2008]. Event extraction,
6.3. WHAT HAS CHANGED
41
which often relies on dependency extraction, can provide a data base entry-like summary of a text. Table Extraction Recognizing tables in text [Cafarella et al., 2010]. This can help type entities since the column heading often provides the type of the entities found in the column. It also can be used to attach attributes to these entities, for example, the team that a basketball player plays for. There is much research into transforming data found in semi-structured tables into Linked Open Data [Bizer et al., 1998]. Multimedia Analysis Extracting semantic information beyond standard metadata like titles and descriptions; for instance, using automatic speech-to-text transcription to open access to speech recordings and videos [Lamel and Gauvain, 2008], or applying object recognition processing to images, such as face detection or color or pattern matching [Lew et al., 2006]. Sentiment Analysis Deciding if a text is positive or negative, and what principal emotive sentiments it conveys [Grefenstette, 2004].This is done through lexicons of negatively and positively charged words. Current techniques use specific lexicons for each domain of interest.5 Dynamic Categorization Determining from context if, for example, the primary subject of a document is medicine, or finance, or sports, and tagging it as such. Documents may be clustered according to these dynamic categories, and one may apply further classification technologies to these clusters to organize them into a meaningful structure [Russell and Norvig, 2009, Chapter 18],[Qi and Davison, 2009]. Examples include: • Rule Based Classification (supervised classification using a model ontology and document collection). Rule Based Classification may include the use of ‘fuzzy’ ontology matching - using contextual analysis, synonyms, linguistic variants - to more effectively identify common entities across heterogeneous source systems. • Bayesian Classification (unsupervised classification based on statistical probabilities) • SVM (Support Vector Machine) Classification (another form of supervised statistical learning) Relationship Mapping 5 See http://www.urbanizer.com that implements domain lexicons for restaurants’ cuisine, service and atmosphere
42
6. DATA PROCESSING
Using the principle of “co-occurrence” to map relationships between people, objects, places, events, etc6 . For example, if person A and person B are mentioned in one text, and person B and C in another, there is tangential evidence of a relationship between persons A and C. These and other types of semantic analysis tasks all involve deciding when two things are the same, or applying some label (metadata) to a piece of information to better represent its meaning and context, or showing the relationship between two things. Commercial search engines7 may contain up to 20 different semantic processors to extend and enrich content. Accordingly, the depth and complexity of meanings and relationships a modern semantic engine can unearth and capture exceeds that of most database systems. Moreover, these relationships are organic rather than pre-defined, which means they can evolve as underlying data evolves.
6.3.2
DATABASES
In general, databases continue to follow strict, top-down data processing procedures. These procedures are part and parcel of the core strength of databases: ensuring data integrity. What has changed, however, is that conventional databases are increasingly employing Natural Language Processing to aid and automate the mapping processes during data integration, especially when dealing with very large databases. To return to our prior example, a natural language processor may be used within an ETL tool to automatically map the “Gender” and “Sex” columns or, at least, to identify them as possible matches for subsequent human review8 . To date, however, the NLP tools used in such processes are limited, and advanced semantics are not employed at the processing level, though the semantics inherent in the database structure itself can, of course, be exploited to great advantage during the retrieval process. With the exception of relationship mapping in graph databases (such as Neo4j), No SQL systems (key-value, document, and wide-column stores) likewise employ NLP in a limited manner. They do not support full text indexing (or, consequently, full text searching), nor the automatic categorization and clustering that enables faceted search and navigation, nor the semantic tagging that supports fuzzy query interpretation and matching. In addition, they do not support the ranking and relevancy calculations that are being used to provide reporting and analysis in SBAs. In fact, both RDBM and NoSQL systems are typically paired with a search engine (external or embedded) to deliver these capabilities, or with an RDBMS for categorization, reporting and analysis.
6 See, for example, http://labs.exalead.com/experiments/miiget.html 7 Attivio, Autonomy, Endeca, Exalead, Funnelback, Sinequa, Vivisimo, among others. 8 See, for example, http://www.megaputer.com/polyanalyst.php
43
CHAPTER
7
Data Retrieval At A Glance Characteristic Read Pattern Query method Algebraic, numeric operations Filtering Query Interface Data output
Search Engine Column Natural language No Post (Ranking/relevancy) Unique query box Ranked lists, visualisation
7.1
SEARCH ENGINES
7.1.1
QUERYING
Databases Row SQL commands Yes Pre (Exact match) Form-based interface Limited by data and structures
The traditional role of search engines is to help human users locate documents (Web pages, text files, etc.). To do so, a user enters a natural language question or one or more key words in a search text box. Similar to the content indexing pipeline, the search engine first parses the user’s request into individual words (tokenization), then identifies possible variants for these words (stemming and lemmatization) before identifying and evaluating matches. There are three basic types of queries for conventional search engines: Ranked Queries, Phrase Queries and Boolean Queries. • Ranked Queries Ranked Queries produce results that are ordered, ranked, according to some computed relevancy score. On the Web, this score includes lexical, phonetic or semantic distance factors and ‘popularity’ factors (like the number and type of external links pointing to the document [Brin and Page, 1998]). Ranked queries are also called top-k queries [Ilyas et al., 2004] in the database world. • Phrasal Queries Phrasal queries are ranked queries that take word order into account to make the best match. For a query like “dog shelters New York,” the engine would give a higher score to a document entitled “List of dog shelters in New York City” than to a document entitled “York Supervisor
44
7. DATA RETRIEVAL
Complains New Emergency Shelter Not Fit for a Dog.” The possibility of phrasal queries imply that the positions of the words inside each input document are also stored in the index. • Boolean Queries Search engines also support Boolean queries, which contain words plus Boolean operators like AND, OR and NOT, and may employ parentheses to show the order in which relationships should be considered: (crawfish OR crayfish) AND harvesting
7.1.2
OUTPUT
For conventional search engines, results are output as ranked lists, on a Search Engine Results Page, a SERP (Figure 7.1). These lists typically contain document titles, content snippets or summaries, hyperlinks to source documents, and basic metadata like file size and type. Users scan the list and click through to source documents until they find what they need (or rephrase or abandon their search). A whole industry of Search Engine Optimization (SEO) exists on how to improve ranking in these lists1 , from the point of view of the web page author or owner.
7.2
DATABASES
7.2.1
QUERYING
Relational databases are queried for a broad range of purposes: to perform a task (e.g., process an order), locate specific data (e.g., look up an address), display content, or analyze information (e.g., generate and review reports by various data dimensions). These functions are performed using Structured Query Language (SQL) commands. SQL2 is a robust query language permitting a wide range of operations for both numerical and textual data. Retrieval options include comparisons (=, , >, =,