SYNTACTIC PATTERN RECOGNITION FOR SEISMIC OIL EXPLORATION Hi) u- Yuunil mi ni> 60
■ MACHINE PERCEPTION ARTIFICIAL INTEL...
42 downloads
1169 Views
5MB Size
Report
This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!
Report copyright / DMCA form
SYNTACTIC PATTERN RECOGNITION FOR SEISMIC OIL EXPLORATION Hi) u- Yuunil mi ni> 60
■ MACHINE PERCEPTION ARTIFICIAL INTELLIGENCE ^ ^ ^ V o l u m e 46 ^ ^ 1
World Scientific
50
30
20
10
SYNTACTIC PATTERN RECOGNITION FOR SEISMIC OIL EXPLORATION
SERIES IN MACHINE PERCEPTION AND ARTIFICIAL INTELLIGENCE* Editors: H. Bunke (Univ. Bern, Switzerland) P. S. P. Wang (Northeastern Univ., USA)
Vol. 34: Advances in Handwriting Recognition (Ed. S.-W. Lee) Vol. 35: Vision Interface — Real World Applications of Computer Vision (Eds. M. Cherietand Y.-H. Yang) Vol. 36: Wavelet Theory and Its Application to Pattern Recognition (V. V. Tang, L. H. Yang, J. Liu and H. Ma) Vol. 37: Image Processing for the Food Industry (E. Fl. Davies) Vol. 38: New Approaches to Fuzzy Modeling and Control — Design and Analysis (M. Margaliot and G. Langholz) Vol. 39: Artificial Intelligence Techniques in Breast Cancer Diagnosis and Prognosis (Eds. A. Jain, A. Jain, S. Jain and L Jain) Vol. 40: Texture Analysis in Machine Vision (Ed. M. K. Pietikainen) Vol. 41: Neuro-Fuzzy Pattern Recognition (Eds. H. Bunke and A. Kandel) Vol. 42: Invariants for Pattern Recognition and Classification (Ed. M. A. Rodrigues) Vol. 43: Agent Engineering (Eds. Jiming Liu, Ning Zhong, Yuan Y. Tang and Patrick S. P. Wang) Vol. 44: Multispectral Image Processing and Pattern Recognition (Eds. J. Shen, P. S. P. Wang and T. Zhang) Vol. 45: Hidden Markov Models: Applications in Computer Vision (Eds. H. Bunke and T. Caelli) Vol. 46: Syntactic Pattern Recognition for Seismic Oil Exploration (K. Y. Huang) Vol. 47: Hybrid Methods in Pattern Recognition (Eds. H. Bunke and A. Kandel) Vol. 48: Multimodal Interface for Human-Machine Communications (Eds. P. C. Yuen, Y. Y. Tang and P. S. P. Wang) Vol. 49: Neural Networks and Systolic Array Design (Eds. D. Zhang and S. K. Pal)
*For the complete list of titles in this series, please write to the Publisher.
Series in Machine Perception and Artificial Intelligence - Vol. 46
SYNTACTIC PATTERN RECOGNITION FOR SEISMIC OIL EXPLORATION
Kou-Yuan Huang Department of Computer and Information Science National Chiao Tung University, Taiwan Hsinchu, Taiwan
V|S* World Scientific w l
NewJersey Sinqapore •»lLondon • Hong Kong New Jersey • Singapore
Published by World Scientific Publishing Co. Pte. Ltd. P O Box 128, Farrer Road, Singapore 912805 USA office: Suite IB, 1060 Main Street, River Edge, NJ 07661 UK office: 57 Shelton Street, Covent Garden, London WC2H 9HE
British Library Cataloguing-in-Publication Data A catalogue record for this book is available from the British Library.
SYNTACTIC PATTERN RECOGNITION FOR SEISMIC OIL EXPLORATION Series in Machine Perception & Artificial Intelligence Volume 46 Copyright © 2002 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof, may not be reproduced in any form or by any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher.
For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 222 Rosewood Drive, Danvers, MA 01923, USA. In this case permission to photocopy is not required from the publisher.
ISBN 981-02-4600-5
Printed in Singapore by Mainland Press
In memory of my father
To my mother, my wife, Jen-Jen, my children Yuh-Shan Cathy and Harry
In memory of the late Goss Distinguished Professor King-sun Pu School of Electrical and Computer Engineering Purdue University
This page is intentionally left blank
AUTHOR'S BIOGRAPHY Kou-Yuan Huang received the B.S. in Physics and M.S. in Geophysics from National Central University, Taiwan, in 1973 and 1977, respectively, and the M.S.E.E. and Ph.D. degrees in Electrical and Computer Engineer ing from Purdue University, West Lafayette, Indiana, in 1980 and 1983, respectively. Since 1978, he was a Graduate Research Assistant at Purdue Univer sity. From 1978 to 1979 he was in Geophysics, Department of Geoscience. From 1979, he was in the School of Electrical and Computer Engineering. He joined the Laboratory for Applications of Remote Sensing (LARS) in 1979. From 1981 to 1983, he was in the Advanced Automation Research Laboratory. From September 1983 to August 1988, he was the Faculty in the Department of Computer Science, University of Houston. Now he is the Professor at the Department of Computer and Information Science at National Chiao Tung University. From August 1992 to July 1993, he was the Visiting Scholar at University of Texas at Austin for one semester and later at Princeton University. From August 1996 to July 1997, he took his sabbatical leave at Rice University and University of Houston. He widely published papers in journals: Geophysics, Geoexploration, Pattern Recognition, IEEE Transactions on Geoscience and Remote Sensing,..., etc. His major contributions are in the areas of seismic pattern recognition using image processing, statistical, syntactic, neural networks, and fuzzy logic methods.
VU
This page is intentionally left blank
PREFACE The use of pattern recognition has become more and more important in seismic oil exploration. Interpreting a large volume of seismic data is a challenging problem. Seismic reflection data in the one-shot seismogram and stacked seismogram may contain some structural information from the response of subsurface. Syntactic/structural pattern recognition techniques can recognize the structural seismic patterns and to improve seismic inter pretations. In 1-D seismic data analyses, different Ricker wavelets represent differ ent pattern classes. We can describe the Ricker wavelets into strings or sentences of symbols. Then we can parse the testing string by grammar or compute the distance between testing string and training string, and assign the testing string to its correct class. In 2-D seismic data analyses, the primary reflection from the geologic structure of gas and oil sand zones can show a seismic structure (bright spots). The bright spot pattern can be processed and represented as a tree representation. Then we can use the tree recognition system to recognize this kind of seismic pattern. In 1-D syntactic analyses, the methods include (1) the error-correcting finite-state parsing for the recognition of the 1-D string of Ricker wavelets, (2) the modified error-correcting Earley's parsing and (3) the parsing using match primitive measure for the recognition of the 1-D attributed string of the wavelets, and (4) the Levenshtein distance computation and (5) the likelihood ratio test for wavelet recognition. In the 2-D tree automata, the methods include (6) the weighted minimum-distance structure preserved error-correcting tree automaton and (7) the modified maximum-likelihood structure preserved error-correcting tree automaton for syntactic parsing of the 2-D seismic bright spot patterns. Finally we present (8) a hierarchical system to recognize seismic patterns in a seismogram.
IX
X
PREFACE
Syntactic seismic pattern recognition can be one of the milestones to ward geophysical intelligent interpretation system. The syntactic methods in this book can be applied to other fields, for example: medical diagnosis system. This book has been written for geophysicists, computer scientists and electrical engineers. I thank Kevin M. Barry of Teledyne Exploration for providing real seismic data. I am indebted to my graduate students, especially at the University of Houston — University Park (1983-1988), in many helpful discussions. At last I thank Professor C. H. Chen at the University of Mas sachusetts at Dartmouth. He encouraged me to write a paper, "Syntactic pattern recognition," in the book, Handbook of Pattern Recognition & Com puter Vision, edited by C. H. Chen, L. F. Pau, and P. S. P. Wang, World Scientific Publishing, 2nd edition, 1998/99. Then using the syntactic ap proaches to the seismic exploration data, I can finish this book. This book was partially supported by the National Science Council, Taiwan, under grant NSC-78-0408-E009-16, NSC-80-0408-E-009-17, NSC-81-0408-E-00912 and NSC-82-0408-E-009-065. Kou-Yuan Huang Hsinchu, Taiwan
CONTENTS AUTHOR'S BIOGRAPHY
vii
PREFACE
ix
1
INTRODUCTION TO SYNTACTIC PATTERN RECOGNITION 1.1. SUMMARY 1.2. INTRODUCTION 1.3. ORGANIZATION OF THIS BOOK
2
INTRODUCTION TO FORMAL LANGUAGES AND AUTOMATA
2.1. SUMMARY 2.2. LANGUAGES AND GRAMMARS
2.3. 2.4. 2.5.
2.6.
Type 0 (unrestricted) grammar Type 1 (context-sensitive) grammar Type 2 (context-free) grammar Type 3 (finite-state or regular) grammar FINITE-STATE AUTOMATON EARLEY'S PARSING FINITE-STATE GRAMMATICAL INFERENCE 2.5.1. Inference of Canonical Finite-State Grammar 2.5.2. Inference of Finite-State Grammar Based on K-Tails . . STRING DISTANCE COMPUTATION
xi
1 1 1 3 7
7 7 9 9 9 9 10 14 16 16 17 18
xii
ERROR-CORRECTING FINITE-STATE A U T O M A T O N FOR RECOGNITION OF RICKER WAVELETS 3.1. SUMMARY 3.2. INTRODUCTION 3.3. SYNTACTIC PATTERN RECOGNITION 3.3.1. Training and Testing Ricker Wavelets 3.3.2. Location of Waveforms and Pattern Representation 3.4. EXPANDED GRAMMARS 3.4.1. General Expanded Finite-State Grammar 3.4.2. Restricted Expanded Finite-State Grammar 3.5. MINIMUM-DISTANCE ERROR-CORRECTING FINITE-STATE PARSING 3.6. CLASSIFICATION OF RICKER WAVELETS 3.7. DISCUSSION AND CONCLUSIONS
CONTENTS
3
ATTRIBUTED GRAMMAR A N D ERROR-CORRECTING EARLEY'S PARSING 4.1. SUMMARY 4.2. INTRODUCTION 4.3. ATTRIBUTED PRIMITIVES AND STRING 4.4. DEFINITION OF ERROR TRANSFORMATIONS FOR ATTRIBUTED STRINGS 4.5. INFERENCE OF ATTRIBUTED GRAMMAR 4.6. MINIMUM-DISTANCE ERROR-CORRECTING EARLEY'S PARSING FOR ATTRIBUTED STRING 4.7. EXPERIMENT
. .
21 21 21 22 22 25 25 25 28 31 32 37
4
A T T R I B U T E D G R A M M A R A N D MATCH PRIMITIVE M E A S U R E (MPM) FOR RECOGNITION OF SEISMIC WAVELETS 5.1. SUMMARY 5.2. SIMILARITY MEASURE OF ATTRIBUTED STRING MATCHING 5.3. INFERENCE OF ATTRIBUTED GRAMMAR 5.4. TOP-DOWN PARSING USING MPM
39 39 39 41 41 42 45 47
5
51 51 51 55 56
CONTENTS
xiii
5.5. EXPERIMENTS OF SEISMIC PATTERN RECOGNITION 5.5.1. Recognition of Seismic Ricker Wavelets 5.5.2. Recognition of Wavelets in Real Seismogram 5.6. CONCLUSIONS
58 58 60 64
6
S T R I N G DISTANCE A N D LIKELIHOOD RATIO TEST FOR D E T E C T I O N OF C A N D I D A T E B R I G H T SPOT 6.1. SUMMARY 6.2. INTRODUCTION 6.3. OPTIMAL QUANTIZATION ENCODING 6.4. LIKELIHOOD RATIO TEST (LRT) 6.5. LEVENSHTEIN DISTANCE AND ERROR PROBABILITY 6.6. EXPERIMENT AT MISSISSIPPI CANYON 6.6.1. Likelihood Ratio Test (LRT) 6.6.2. Threshold for Global Detection 6.6.3. Threshold for the Detection of Candidate Bright Spot 6.7. EXPERIMENT AT HIGH ISLAND T R E E G R A M M A R A N D A U T O M A T O N FOR SEISMIC PATTERN R E C O G N I T I O N 7.1. SUMMARY 7.2. INTRODUCTION 7.3. TREE GRAMMAR AND LANGUAGE 7.4. TREE AUTOMATON 7.5. TREE REPRESENTATIONS OF PATTERNS 7.6. INFERENCE OF EXPANSIVE TREE GRAMMAR 7.7. WEIGHTED MINIMUM-DISTANCE SPECTA 7.8. MODIFIED MAXIMUM-LIKELIHOOD SPECTA 7.9. MINIMUM DISTANCE GECTA 7.10. EXPERIMENTS ON INPUT TESTING SEISMOGRAMS 7.11. DISCUSSION AND CONCLUSIONS
65 65 65 66 67 68 69 72 72 72 73
7
75 75 75 77 78 84 85 86 92 94 . . 95 102
xiv
A HIERARCHICAL R E C O G N I T I O N SYSTEM OF SEISMIC PATTERNS A N D F U T U R E S T U D Y 8.1. SUMMARY 8.2. INTRODUCTION 8.3. SYNTACTIC PATTERN RECOGNITION 8.3.1. Linking Processing and Segmentation 8.3.2. Primitive Recognition 8.3.3. Training Patterns 8.3.4. Grammatical Inference 8.3.5. Finite-state Error Correcting Parsing 8.4. COMMON-SOURCE SIMULATED SEISMOGRAM RESULTS 8.5. STACKED SIMULATED SEISMOGRAM RESULTS 8.6. CONCLUSIONS 8.7. FUTURE STUDY
CONTENTS
8
103 103 103 107 107 107 108 109 109 110 117 121 121
REFERENCES
123
INDEX
131
Chapter 1
INTRODUCTION TO SYNTACTIC PATTERN RECOGNITION
1.1.
SUMMARY
In this chapter we discuss the fundamental idea, system, methods, and applications of syntactic pattern recognition; the reason to use syntactic methods in seismic data. Also we describe the content of each chapter. 1.2.
INTRODUCTION
Syntactic pattern recognition has been developed over two decades, re ceived much attention and applied widely to many practical pattern recog nition problems, such as (1) English and Chinese character recognition, (2) fingerprint recognition, (3) speech recognition, (4) remote sensing data analysis, (5) biomedical data analysis in chromosome images, carotid pulse waves, EEG signals,..., etc., (6) scene analysis, (7) texture analysis, (8) 3-D object recognition, (9) two-dimensional mathematical symbols, (10) spark chamber pictures, (11) chemical structures, (12) geophysical seismic signal analysis,..., etc. [3, 6, 13-16, 19, 22-24, 30, 37, 39, 41, 46, 48, 49, 53-55, 58, 59, 62, 64-66, 72, 73, 78-81, 85-89, 92, 96, 98, 100, 112, 113, 116]. In the pattern recognition problems, besides the statistical approach, the structural information that describes the pattern is important, so we can use syntactic methods to recognize the pattern. A pattern can be decomposed into simpler subpatterns, and each simpler subpattern can be l
INTRODUCTION
2
Fig. 1.1.
TO SYNTACTIC
PATTERN
RECOGNITION
Block diagram of a syntactic pattern recognition system.
decomposed again into even simpler subpatterns, and so on. The simplest subpatterns are called primitives (symbols, terminals). A pattern can be described as a representation, i.e., a stringof primitives, a tree, a graph, an. array, a matrix, or an attributed string,..., etc. [33, 43, 64-66, 68, 91, 110]. We can parse the representation and assign the pattern to its correct class. A basic block diagram of the syntactic pattern recognition system is shown in Fig. 1.1. The system consists of two major parts: training and recognition. The training part consists of primitive (and relation) selec tion, grammatical inference, automata construction from the training pat terns, and the recognition part consists of preprocessing, segmentation or decomposition, primitive (and relation) recognition, construction of pattern representation, and syntactic parsing analysis for the input testing pattern. The finite-state grammar, context-free grammar and context-sensitive grammar of the formal language are adopted in the description of 1-D string representation of the pattern [2, 41]. The 1-D string grammars also include programmed grammar, indexed grammar, grammar of picture de scription language, transition network grammar, operator precedence gram mar, pivot grammar, plex grammar, attributed grammar,..., etc. [16, 36, 37, 39, 41, 49, 55, 90, 92, 95, 97, 101]. The syntactic parsing analy ses include finite-state automata, pushdown automata, top-down parsing, bottom-up parsing, Cocke-Younger-Kasami parsing, Earley's parsing,..., etc. [2, 41]. The description power can be extended from 1-D string grammars to high-dimensional pattern grammars for the analysis of 2-D and 3-D pat terns. The high-dimensional pattern grammars include tree grammar,
ORGANIZATION OF THIS BOOK
3
array grammar, web grammar, graph grammar, shape grammar, matrix grammar,..., etc. [17, 33, 41, 43, 66, 68, 88, 91, 94, 110]. The syntactic parsing analyses include tree automata, array automata,..., etc. For consideration of substitution, insertion, and deletion errors in the pattern, the automata can be expanded to error-correcting automata to accept the noisy pattern or distorted pattern [1, 64, 65, 84, 101, 102, 106, 115]. The 1-D string grammars and high-dimensional pattern grammars also include stochastic grammars, languages, and the corresponding parsers [29, 40, 44, 86, 101, 105]. The use of pattern recognition has become more and more important in seismic exploration [4, 5, 10, 11,18-21, 26, 50, 52-68, 99]. However, most of the papers emphasize statistical seismic pattern recognition. Interpreting a large volume of seismic data is a challenging problem. Seismic data in the one-shot seismogram and stacked seismogram may contain some phys ical and structural information from the response of subsurface. So before interpreting seismic data, it is better to have the techniques to process the seismic data and to improve seismic interpretation. Here using the structural information of seismic data, we propose the important syntactic approach to seismic pattern recognition.
1.3.
ORGANIZATION OF THIS BOOK
In Chapter 2, we start to discuss the fundamental theory of formal lan guages and parsing methods. There are four kinds of basic grammars and languages: finite-state, context-free, context-sensitive, and unrestricted. Finite-state automaton can recognize the finite-state language. Earley's parsing algorithm can recognize the context-free language. Finite-state grammar can be inferred from sample strings. Levenshtein distance is the distance computation between two strings. In Chapter 3, syntactic pattern recognition techniques are applied to the analysis of 1-D seismic traces to classify Ricker wavelets. Seismic Ricker wavelets have structural information in shape, and each Ricker wavelet can be represented by a string of symbols. To recognize the strings, we use a finite-state automaton to identify each string. The automaton can accept strings having substitution, insertion, and deletion errors of the symbols. There are two attributes, terminal symbol and weight, in each transition of
4
INTRODUCTION TO SYNTACTIC PATTERN
RECOGNITION
the automaton. A minimum-cost error-correcting finite-state automaton is proposed to parse the input string. Two methods of parsing attributed string are proposed. One is the modified error-correcting Earley's parsing in Chapter 4, and the other is a parsing using the match primitive measure (MPM) in Chapter 5. In Chapter 4, the modified minimum distance error-correcting Earley parsing for an attributed string can handle three types of error. The recognition criterion of the modified Earley's algorithm is "minimumdistance." We discuss the application of the parsing method to the recog nition of seismic Ricker wavelets and the recognition of wavelets in real seismic data in Chapter 5. In Chapter 5, the computation of the match primitive measure between two attributed strings using dynamic programming is proposed. The MPM parsing algorithm for an attributed string can handle three types of er ror. The MPM parsing algorithm is obtained from the computation be tween the input string and the string generated by the attributed grammar. The MPM parsing is more efficient than the modified Earley's parsing. The recognition criterion of the MPM parsing algorithm is "maximummatching". The parsing method is applied to the recognition of seismic Ricker wavelets and the recognition of wavelets in real seismic data. In Chapter 6, Levenshtein distance computation is applied to detect the candidate bright spot, trace by trace, in the real seismograms. The system for one-dimensional seismic analysis includes a likelihood ratio test, optimal amplitude-dependent encoding, probability of detecting the sig nal involved in the global and local detection, plus minimum-distance and nearest-neighbor classification rules. The relation between error probability and Levenshtein distance is proposed. In Chapter 7, tree automaton of syntactic pattern recognition is adopted to recognize 2-D structural seismic patterns. The tree automaton system includes two parts. In the training part of the system, the training seis mic patterns of known classes are transformed into their corresponding tree representations. Tree representations can infer tree grammars. Several tree grammars are combined into one unified tree grammar. Tree gram mar can generate the error-correcting tree automaton. In the recognition part of the system, each input testing seismogram passes through pre processing and tree representation of seismic pattern. Each input tree is parsed and recognized into the correct class by the error-correcting tree
ORGANIZATION OF THIS BOOK
5
automaton. Because of complex variations in the seismic patterns, three kinds of automaton are adopted in the recognition: weighted minimum distance structure preserved error-correcting tree automaton (SPECTA), modified maximum-likelihood SPECTA, and minimum distance generalized error-correcting tree automaton (GECTA). Weighted minimum distance SPECTA and modified maximum-likelihood SPECTA take only substi tution errors of the tree structure into consideration. Minimum-distance GECTA takes substitution, deletion, and insertion errors of the tree struc ture into consideration. The bright spot seismic pattern is shown as the example in the parsing steps. Tree automata could be applied to the recog nition of other seismic patterns, such as pinch-out, flat spot, gradual sealevel fall, and gradual sealevel rise patterns. The tree automaton system pro vides a tool for recognition of seismic patterns, and the recognition results can improve seismic interpretation. In Chapter 8, we present a hierarchical system to recognize seismic patterns in a seismogram. The seismic patterns are hierarchically decom posed or recognized into single patterns, straight-line patterns or hyperbolic patterns, using syntactic pattern recognition. The Hough transformation technique is used for reconstruction, pattern by pattern. The system of syntactic pattern recognition includes envelope generation, a linking pro cess in the seismogram, segmentation, primitive recognition, grammatical inference, and syntax analysis. The seismic patterns are automatically recognized and reconstructed.
This page is intentionally left blank
Chapter 2
INTRODUCTION TO FORMAL LANGUAGES AND AUTOMATA
2.1.
SUMMARY
In this chapter we introduce the fundamental theory of formal lan guages and syntactic parsing methods: finite-state grammar and language, context-free grammar and language, finite-state automaton for recognition of finite-state language, the Earley's parsing of context-free language, the inferences of finite-state grammar from training samples, and the distance computation between two strings. We use these fundamental methods in the later chapters.
2.2.
LANGUAGES AND GRAMMARS
The formal language theory was initially developed to understand the basic properties of natural languages. The phrase-structure grammar with a set of rewriting rules can be used as a method for describing languages. The phrase-structure languages and their relation to automata were described by Aho and Ullman [2]. Before defining the formal languages, we analyze a basic English sen tence, "The boy runs fast," using the English grammar. We can parse the sentence in tree. We can have a set of production rules or rewriting rules from the tree: (1) <sentence> -> <noun phrase> 7
INTRODUCTION TO FORMAL LANGUAGES AND AUTOMATA
(2) (3) (4) (5) (6) (7)
<noun phrase> -» <article> <noun> —> <article> -» The <noun> -> boy —> runs -> fast
where the symbol "—»•" means "can be rewritten as." The sentence can be derived by the production rules from (1) to (7) using the left-most substitution: <sentence> =>■ <noun phrase> => <article> <noun> =>■ The <noun> => The boy => The boy => The boy runs =>• The boy runs fast Definition 2.1 In the formal language theory, a grammar is a fourfold multiple, G — (Vjv, Vr,P,S), in which (1) Vjv is a finite set of nonterminal (nonprimitive) symbols, (2) VT is a set of terminal (primitive) symbols, V/vUVr = V, VjvU VT = , (3) P is a finite set of production rules denoted by a —>■ /3, where a is in (VJV U VTO'VAKVAT U VT)* and /3 is in (Vjv U
VT)*,
(4) 5 ( e VJV) is a starting nonterminal symbol, where (Vjv U Vp)* denotes the set containing all strings over (Vjv U VT) including zero length string (sentence) A. Prom the starting symbol 5, a string (sentence) is derived by using the production rules of P, S —► ao => cc\ => ■ • • => a m = a } . S =>■ a G
G
LANGUAGES AND GRAMMARS
9
represents that we can derive from S to the string (sentence) a in several derivation steps using production rules in G. Depending on the form of the production rules, the grammars can be divided into finite-state grammar, context-free grammar, context-sensitive grammar, and unrestricted grammar. Type 0 {unrestricted) grammar There is no restriction on the production rules. The languages generated by type 0 grammars are called type 0 languages. Type 1 {context-sensitive)
grammar
The production rules of type 1 (context-sensitive) grammars are of the form a\Aa2 —>• aif3a2 where A £ V/v, a i , a2 G V*, and (3 G V+ {j3 € V* and j3 ^ A). That nonterminal A is replaced by string /3 is dependent on the contexts of the both sides of A. The languages generated by contextsensitive grammars are called type 1, or context-sensitive, languages. Type 2 {context-free) grammar The production rules of type 2 (context-free) grammars are of the form A —> (3 where A 6 Vjf, and /3 G V+. That nonterminal A is replaced by string /? is independent on the contexts of the both sides of A. The languages generated by context-free grammars are called type 2, or contextfree, languages. Type 3 {finite-state or regular) grammar The production rules of type 3 (finite-state or regular) grammars are of the form A -» aB or A —> b where A, B e V^ and a, b G Vy. The languages generated by finite-state grammars are called type 3, or finite-state (or regular), languages. Example 2.1 Finite-state grammar and language Consider the grammar G={VN, VT, P, S), where VN = {S, A}, VT = {a, b}, and P: (1)
S-^aA
INTRODUCTION TO FORMAL LANGUAGES AND AUTOMATA
10
(2) (3)
A^aA A^b
From the form of the production rules, the grammar is a finite-state gram mar. A typical sentence is generated by the derivation S => aA =>■ aaA => aaaA => aaab. In general, L(G) = {anb\n = 1,2,...}. E x a m p l e 2.2 Context-free grammar and language Consider the grammar G = (VN, VT, P, S), where VN = {S, A}, VT = {0,1}, and P: (1) S-^OAO (2) A^OAQ (3) A -+ 1 The grammar is a context-free grammar and L(G) = {0™10n|n = 1,2,...}.
2.3.
FINITE-STATE A U T O M A T O N
A finite-state automaton is the simplest recognizer (recognition device) to recognize the strings (sentences) of the language which are generated from finite-state grammar. Definition 2.2
A deterministic finite-state automaton A is a quintuple
A=(^2,Q,8,q0,F)
,
where ^ is a finite set of input symbols, Q is a finite set of states, 6 is a mapping of Q x £) onto Q (the next state function), qo (G Q) is the initial state, and F C Q is the set of final states. T(A) is the language accepted by A A convenient representation of a finite-state automaton is given in Fig. 2.1. The finite control, in one of the states in Q, reads symbols from an input tape sequentially from left to right. Initially, the finite control is in state go a n d is scanning the leftmost symbol of a string in YT, which appears on the input tape. J2* denotes the set containing all strings over ^2 including A, the empty string. The interpretation of S{q,a)=ql,
q,q'eQ,
d £ ^
FINITE-STATE
AUTOMATON
11
(b) Fig. 2.2. (a) Graphical representation of 8{q, a) = q'. (b) A state transition table of % , o ) =q'.
is that the automaton A, in present state q and scanning the present input symbol a, goes to next state q' and the input head moves one square to the right. A convenient way to represent the mapping 6 is by use of a state transition diagram or a state transition table. The state transition diagram and table corresponding to S(q, a) = q' are shown in Figs. 2.2(a) and 2.2(b). The mapping 5 can be extended from an input symbol to a string of input symbols by defining S(q,X) = q,
6(q,xa) = 6(6(q,x),a),
x G Y J and a G \]
.
Thus, the interpretation of S(q, x) = q' is that the automaton A, starting in state q and scanning through the string x on the input tape, will be in state q' and the input head moves to the right from the portion of the input tape containing x.
INTRODUCTION
12
TO FORMAL LANGUAGES AND AUTOMATA
A string or a sentence w is said to be accepted by A if S(qo, w) =p for some p € F. The set of strings accepted by A is defined as T(A) = {w\5(q0,w) G F } . There is a theorem that transforms a finite-state grammar into a finitestate automaton [41]. The relation is that production rules in P become the mapping 5 in A Theorem 2.1 Let G = (V/v, VT,P, S) be a finite-state grammar. Then there exists a finite-state automaton A = (]T}, a is a production in P, add [S —> »a, 0] to LQ. Perform step (2) until no new items can be added to LQ. (2) For each item [A -> »Bf3,0] is in LQ and for all productions in P of the form B —> 7, add the item [B —> »7,0] to LQ. Now, construct Lj from
LQ,LI,
...,
Lj-\.
(3) For each item of the form [^4 —> a • a/3, i] in Lj-\ such that a = aj, add item [A -4 aa • f3, i] to Lj. Perform steps (4) and (5) until no new items can be added to Lj.
EARLEY'S
PARSING
15
(4) For each item [A -¥ a», i] in Lj and each item [B —> 7 • Af), k\ in Li, add [B ->■ 7 A • /?, ft] to Xj. (5) For each item [A —► a • i?/3, i] in Lj and for all productions in P of the form B —¥ 7, add the item [B —>■ #7, j] to L,-. Step (3) is the derivation to match the terminal a,j. Step (4) is the backward relay or connection in derivation of partial string. Step (5) is the forward derivation or expansion. If [S —>■ a»,0] is in Ln, then S can derive from position 0 to n for the string w = aia^- • • an, i.e., w is in L(G), otherwise the string w is not in L{G). The space complexity of Earley's parsing is Oin2), where n is the length of the input string. If the grammar is ambiguous, the time complexity is 0(n3), and if the grammar is unambiguous, the time complexity is 0(n2). Example 2.4 VN = {S,T,F},
Given the context-free grammar G = (VJV, Vr,P,S), VT = { « , + , * , ( , ) } , and P : (1)S-*S
+T
where
(4)S^T
(2) T ->■ T*F
(5)T->F
(3) F -» (S)
(6)F->a.
Let w = a*a. Applying the Earley's parsing algorithm, we obtain the parse lists for w. L\ :
L2 '■
L3 :
[5-*-«5 + T,0]
[F-tam.O]
[T->T*.F,0]
[F-¥
[S->«T,0]
[r^F.,0]
[F^.(5),2]
[T^r*F.,0]
[T-^.T*F,0]
[S^T;0]
[F-+«a,2]
[5-*r»,0]
[T->»F,0]
[T^-T«*F,0]
[F-+.(S),0]
[ S - + 5 • + r , 0]
LQ
:
a;2]
[T-+T»*F,0] [S-»5»+r,0]
[F -> «a, 0] Since [5 -> T», 0] is in L3, the input string a*a is in L(G). After a tring is accepted, its one or all possible derivation trees can be extracted [2]. Earley's parsing algorithm is used for error-correcting parsing of attributed string in Chapter 4.
INTRODUCTION
16
2.5.
TO FORMAL LANGUAGES AND AUTOMATA
FINITE-STATE G R A M M A T I C A L I N F E R E N C E
The grammar can be directly inferred from a set of sentences, L(G). The problem of learning a grammar based on a set of sentences is called gram matical inference. A basic diagram of a grammatical inference machine is shown as follows. Two finite-state grammar inference techniques are dis cussed: one is the canonical inference and the other is the K-tai\ inference.
Source grammar 2.5.1.
Inference
•^1 ) -*"2 '
•'
x
t}
of Canonical
Inference algorithm Finite-State
Inferred grammar Grammar
A basic canonical definite finite-state grammar can be inferred from a set of training strings [41]. The canonical definite finite-state grammar Gc associ ated with the positive sample set S+ = {x\, X2, ■ ■ ■, xn} is defined as follows: GC =
(VN,VT,P,S)
where S is the starting symbol, VN, VT, and P are generated using the following steps. (1) Check each string Xi e S+ and identify all of the distinct terminal symbols used in the strings of S+. Call this set of the terminal symbols as VT. (2) For each string x; € S+, Xi = a^aii ■ ■ ■ a,in, generate the corresponding production rules S —¥ CLuZil Zn -¥
aaZi2
Zi2 —► a^Ziz
^i,n—\
' din
Each Zij represents a new nonterminal symbol. (3) The nonterminal symbol set Vjv consists of S and all the distinct non terminal symbols Zitj produced in step (2). The set P consists of all the distinct production rules generated in step (2).
FINITE-STATE
GRAMMATICAL
INFERENCE
17
Example 2.5 Given a training string abbe, the inferred canonical definite finite-state grammar GC(VN, Vp, P, S) is as follows: VN = {S,A,B,C}
VT = {a,b,c}
S = {S} .
The production rule set P:
(0)S^aA,
(l)A-¥bB,
(2)B-+bC,
(3)C-+c.
We use this example in Chapter 3 for inference of expanded
finite-state
grammar.
2.5.2.
Inference of Finite-State on K- Tails
Grammar
Based
Initially, we define the derivative of a language with respect to the symbol, the new set of derivative is assigned as a state and the relation can become a production rule reversely. Then we combine the derivative and AT-tails to infer the finite-state grammar from a set of training strings or sentences. Definition 2.4 The formal derivative of a set of strings S with respect to the symbol a 6 Vp is defined DaS = {x\ax € 5 } . The formal derivative can easily be extended so that if ail
x-
\ Deletion
\Substitution
Definition 2.8 Weighted error transformations [41] Similar to the defi nition of error transformations, the weighted error transformations can be defined as follows. (1) Weighted substitution error transformation u\au)2 S|—>' ui\bw2, for a,b £ Vx, a y£ b, where S(a,b) is the cost of substituting a by b. Let S(a,a) =0.
20
INTRODUCTION TO FORMAL LANGUAGES AND AUTOMATA
(2) Weighted deletion error transformation a;iatJ2 °h—> ^x^i-, for a € Vp, where D(a) is the cost of deleting a from u}iau)2(3) Weighted insertion error transformation W1UJ2 (■—> W1&W2, for b S Vp, where 1(b) is the cost of inserting b. Definition 2.9 Weighted distance between two strings The weighted dis tance between two strings x, y £ Vp, dw(x,y), is defined as the smallest cost of weighted error transformations required to derive y from x. A l g o r i t h m 2.2
Weighted distance between two strings [109]
Input:
Two strings x = a\a2 • ■ -an and y = &i&2 • • • bm, substitution error cost S(a, b), S(a,a) = 0, deletion error cost D(a), and insertion error cost 1(a), 0 , 6 6 Vp. Output: d(x,y). Method: Step 1. D(0,0) = 0. Step 2. Do i = l , n . D{i,0) = D(i - 1,0) + D(ai) Do j = l,m. D(0,j) = D{0,j - 1) + I(bj) Step 3. Do i = 1, n; do j = 1, m. el=D(i-l,j-l)+S(ai,bj) e2 = D(i-l,j) + D(ai) e3 = D(i,j-l)+I(bj) D(i,j) = m i n ( e i , e 2 , e 3 ) Step 4. d(x,y) = D(n,m). Exit.
We may consider context-deletion and context-insertion errors, then the deletion cost Del(a,b), deleting o in front of b or after b, and the insertion cost I(a,b), inserting b in front of a or after a, must be included. The rela tion between string distance and error probability has ever been presented in the detection of wavelets [58, 59]. For a given input string y and a given grammar G, we can find the minimum distance between y and z using parsing technique, where string z is in L(G). The parsing technique is called minimum-distance errorcorrecting parser (MDECP) (1, 41]. We use MDECP in the finite-state parsing, attributed string parsing and tree automaton.
Chapter 3
E R R O R - C O R R E C T I N G FINITE-STATE AUTOMATON FOR RECOGNITION OF RICKER WAVELETS
3.1.
SUMMARY
Syntactic pattern recognition techniques are applied to the analysis of onedimensional seismic traces for classification of Ricker wavelets. Seismic Ricker wavelets have structural information in the shape. Each Ricker wavelet can be represented by a string of symbols. In order to recognize the strings, we use the finite-state automaton for recognition of each string. Then each Ricker wavelet can be classified. The automaton can accept the strings with the substitution, insertion, and deletion errors of the symbols. There are two attributes, terminal symbol and weight, in each transition of the automaton. A minimum-cost error-correcting finite-state automaton is proposed to parse the input string. The recognition results of Ricker wavelets are quite encouraging.
3.2.
INTRODUCTION
Since seismic Ricker wavelets have structural information, it is natural to adopt a syntactic (structural) approach in seismic pattern analysis [58, 64, 65]. Each Ricker wavelet can be represented by a string of symbols (terminals, primitives). Each of these strings can then be recognized by the finite-state automaton allowing each Ricker wavelet to be classified. A block diagram of the classification system is shown in Fig. 3.1. The system includes a training (analysis) part and a recognition part. The 21
ERROR-CORRECTING
22
Input seismic signals
Location of wavelets
FINITE-STATE AUTOMATON .
String pattern representation Amplitude Segmen dependent tation encoding
Error-correcting finite-state parsing
Inference of finite-state grammar
Expanded >| grammar and automaton
Classification results
Recognition ' Training T Training wavelets
String pattern representation
Fig. 3.1. A classification system of seismic wavelets using error-correcting finite-state parsing.
training part establishes p a t t e r n representation and grammatical inference. T h e recognition p a r t includes location of waveforms, p a t t e r n representation, and error-correcting finite-state parsing. P a t t e r n representation performs p a t t e r n segmentation and primitive recognition to convert a Ricker wavelet into a string of primitives (symbols). Grammatical inference infers finitestate grammar from a set of training strings. T h e finite-state g r a m m a r is expanded to contain three types of error symbol transformations: dele tion, insertion, a n d substitution errors. T h e a u t o m a t o n can be constructed from error-correcting finite-state g r a m m a r . Then, the minimum-distance error-correcting finite-state a u t o m a t o n can perform syntactic parsing a n d classification of input Ricker wavelets.
3.3. 3.3.1.
SYNTACTIC PATTERN Training
and Testing
RECOGNITION
Ricker
Wavelets
T h e eight classes of zero-phase Ricker wavelets with different frequencies (15, 25, and 35 Hz) and maximum amplitudes (—0.30 to 0.30) are used in the classification a n d shown in Table 3.1. T h e 28 Ricker wavelets of eight classes with r a n d o m noise are generated in the seismic traces in Fig. 3.2. T h e sampling interval is 1 ms. T h e class of each Ricker wavelet in the seismic traces is shown in Table 3.2. Eight Ricker wavelets are chosen as the training wavelets and one for each class.
SYNTACTIC
PATTERN Table 3.1.
Ricker wavelet class
RECOGNITION
23
Selected Ricker wavelets in string representations.
Frequency (Hz)
Reflection coefficient
Training strings (corrupted by noise)
0.25 0.15 -0.15 -0.25 0.20 -0.20 0.30 -0.30
cccaacccaaccoAbcoCCBCCC AACCCBBC ccACoccbBAbaBAbccoBCCBaaBCCC CCBAaoCCCbbBCBbaCCbcoAcccoAccbab BCCCDCooCCAcaCCocbBacccbcccBacc ccccccccoBBBCCCCCC CCCCCCBBCCocccccccc dddddcbBCCDDD DDDCCCCbcccddd
1
15
2
15
3
15
4
15
5
25
6
25
7
35
8
35
Table 3.2.
Classes of Ricker wavelets in Fig. 3.2. Sample
Class
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
5 6 1 5 1 8 1 7 3 4 2 1 7 8 8 8 5 6 7 5 4 2 3 6 3 7 1 3
ERROR-CORRECTING
FINITE-STATE AUTOMATON
1. OS4 T I SHIT)
3.2. Twenty-eight Ricker wavelets in the seismic traces.
,..
EXPANDED
3.3.2.
GRAMMARS
Location Pattern
25
of Waveforms and Representation
We use pick detection method to locate each Ricker wavelet. Then the pattern representation can transform each wavelet into a syntactic string of primitives. The time interval of the segmentation is 1 ms and each segment is assigned as a primitive. In order to classify the Ricker wavelets with different amplitudes, the amplitude-dependent encoding of the modified Freeman's chain code is used [58, 65]. The difference di = y%+i — yi of vertical coordinates of two adjacent data points, (ti,yi) and (£j + i,yj + i), is assigned as a terminal symbol according to the following: Wi = d,
if 0.05027 < di,
Wi
— c,
if 0.01497 < di < 0.05027,
Wi
= b,
if 0.00668 < di < 0.01497,
Wi
= a,
if 0.00199 < di < 0.00668,
Wi
= o,
if - 0.00199 < di < 0.00199,
Wi
= A
if - 0.00668 < di < -0.00199,
Wi
= B
if - 0.01497 < di < -0.00668,
Wi
= C
if - 0.05027 < di < -0.01497,
Wi
_ D
if di < -0.05027.
and
After primitive recognition, the string representations of the eight training Ricker wavelets are shown in Table 3.1.
3.4. 3.4.1.
EXPANDED GRAMMARS General
Expanded
Finite-State
Grammar
Because of noise and distortion problems, three types of error symbol trans formations may occur in the strings: substitution errors, insertion errors, and deletion errors. After the inference of finite-state grammar, the gram mar is expanded to include the three types of error production rules. This
26
ERROR-CORRECTING
FINITE-STATE AUTOMATON
...
type of grammar is called a general expanded finite-state grammar. The following steps are used to construct this grammar. (1) The original production forms of a finite-state grammar are A —► aB, or A -» a. Change A —> a to A —i aF, where F is the new nonterminal with ending terminal a. (2) The production forms added to account for substitution errors are A -» bB, where a^b. (3) The production forms added to account for insertion errors are A —> aA. (4) The production forms added to account for deletion errors are A —> XB, where A is the empty terminal. We can put weights on these production rules if we wish. The algorithm is as follows. Algorithm 3.1
Construction of general expanded finite-state grammar
Input: A finite-state grammar G = (Vpf, Vp, P, S). Output: The general expanded finite-state grammar G' = (yN, VT, P', S"), where P' is a set of weighted production rules. Method: (1) For each production rule in P with the form A —> a in the grammar G, change the rule to the form A —► aF, where F is a new nonterminal. (2) Set V± = VTU {A}, Vff=VNU {F}, S' = S. (3) Let the production rule set P' = P with a weight of zero for each original production rule. (4) For each nonterminal A in V^ add the production A -¥ XA (with the weight 0) to P'. (5) Substitution error production: For each production A —> aB in P do For each terminal b in Vp do If A -> bB is not in the P' then add the production A —► bB (with the weight 1) to P'. (6) Insertion error production: For each nonterminal A in VN do For each terminal a in Vr do If A -> aA is not in the P' then add the production A —>■ aA (with the weight 1) to P'.
EXPANDED
27
GRAMMARS
(7) Deletion error production: For each production A —» aB in P (A ^ B) do If A -» AS is not in the P ' then add the production A -» AP (with the weight 1) to P'. (8) Add the production P -» A (with weight 0) to P . (9) Output C . Equal unit weight is assigned to each error production rule in the algorithm. Different weights may be assigned in steps (5), (6) and (7). E x a m p l e 3.1 From the previous Example 2.5, given the training string abbe, the inferred general expanded finite-state grammar G'(V^,Vr',. P ' , S") using Algorithm 3.1 is as follows: Vif = {S,A,B,C,F}
Vi = {a,b,c,X}
S> = {S}
The production rule set P' is: (0)5 ->aA,0
13)
B-+aC,l
26) C -> aC, 1
(l)A^-bB,0
14) B ->• cC, 1
27) C -»■fcC,1
(2)B->bC,0
15) C -4 aP, 1
28) C -> cC, 1
(3) C -> cF, 0
16) C -»• 6F, 1
29) F -> aP, 1
(4) 5 -> AS, 0
17) S -> oS", 1
30) P -4 6P, 1
(5)A->AA,0
18) 5 -* 65,1
31) P -> cP, 1
( 6 ) P - > AP,0
19) 5 -4 c5,1
3 2 ) 5 ^ A.4,1
(7) C -»■ AC, 0
20)
A-^aA,l
33)A->AP,1
( 8 ) F - > AF,0
21) A ->• &A, 1
34) B -s- AC, 1
(9) S -»■ 64, 1
22)
35) C - > A F , 1
(10) 5 - + c . 4 , 1
23) P -4 aB, 1
(ll)A->aP,l
24) P -)■ bB, 1
(12) A-> cP, 1
25)P->cP,l
A-^cA,l
36)P->A,0
Production rules (9) to (16 handle the substitution errors, rules (17) to (31) handle the insertion errors, and rules (32) to (35) handle the deletion errors. The corresponding error-correcting finite-state automaton is shown in Fig. 3.3.
ERROR-CORRECTING
28
(c,1)
(c.1)
(G,1)
(c.1)
FINITE-STATE AUTOMATON
(c,1)
(C,1)
(c,1)
(CO)
...
(c,1)
Fig. 3.3- Transition d i a g r a m of t h e general e x p a n d e d g r a m m a r G' of E x a m p l e 3.1.
3.4.2.
Restricted
Expanded
Finite-State
Grammar
For insertion error, we can insert an error terminal symbol before or after some arbitrary terminal symbol. Then we can expand the form of the production rule A —> aB with restricted insertion error as follows:
or
A —»• bB\,
B\ -4- aB
A —> ai?2 ,
B2 —> bB
(Insert b in front of a), (Insert b after a).
The proposed algorithm is described in the following. A l g o r i t h m 3.2
Construction of restricted expanded finite-state grammar
Input: A finite-state grammar G = (Vjv, Vp, P, S). Output: The restricted expanded finite-state grammar G' = {Vpj, Vp, P', S'), where P' is a set of weighted production rules. Method: (1) For each production rule in P with the form A —¥ a in the grammar G, change the rule to the form A —► aF, where F is a new nonterminal. (2) Let P' = P with the weight 0 for each original production rule. (3) Substitution error production: For each production A —> aB in P do For each terminal b in Vr do
EXPANDED
(4)
(5)
(6) (7) (8) (9)
GRAMMARS
29
If A -s- bB is not in the P' then add the production A —> bB (with the weight 1) to P'. Insertion error production: For each production A —t aB in P do { Insert b in front of a } For each terminal b in V? do add the production A ->■ frBi (with the weight 1) to P', add the production Bi —>■ aB (with the weight 0) to P ' , and { Insert 6 after a } For each terminal b in V? do add the production A —>■ a P 2 (with the weight 0) to P ' , add the production Bi —> bB (with the weight 1) to P'. Deletion error production: For each production A -> aB in P (A ^ B) do If A -> AB is not in the P ' then add the production A —>■ AB (with the weight 1) to P ' . Set S' = S,Vj, = VTL> {A}, V^ = all the distinct nonterminal symbols inP' For each nonterminal A in V^ do add the production A —► AA (with the weight 0) to P ' . Add the production F ->• A (with weight 0) to P ' . Output G'.
E x a m p l e 3.2 From the Example 2.5, given the training string abbe, the inferred restricted expanded finite-state grammar G"(V^, V^,P",S") using Algorithm 3.2 is as follows: Vj} = {S,A,B,C,D,E,F,G,H,I,J,K,L}
V^ = {a,b,c,\}
The production rule set P" is: (0)S->aA,0
(21)P->aA,l
(42) C ->■ cK, 1
(1) A -» 6B, 0
(22) E->bA,l
(43) if -> cF, 0
(2) B -4 bC, 0
(23) P -> cA, 1
(44) C -> cL, 0
(3) C -4 cF, 0
(24) A ->• aG, 1
(45) !■ -4 aF, 1
(4) 5 -4 bA, 1
(25) A -4 bG, 1
(46) L -4 bF, 1
S" = {S}.
ERROR-CORRECTING
30
(5)
S-¥cA,l
(6) A-¥aB,l (7) A -¥ cB, 1 (8) B -> aC, 1 (9) £ -> cC, 1 (10) C->aF,l (11)C->6F,1 (12) 5 -> XA, 1 (13)4-+ A 5 , l (14) B -»• AC, 1 (15) C -»• AF, 1 (16) S -4 aZ>, 1 (17) S -► &D, 1 (18) 5 -}■ eD, 1 (19) D - > a.4,0 (20) 5 -> a F , 0
FINITE-STATE AUTOMATON .
26 A -* cG, 1 !27 G - + 6 5 , 0 '28; A ^ 6iJ, 0 29 H -*aB,l 30 H->bB,l !31 H^cB,l 32 B^-a.1,1 33 B^bl,l 34 B->cI,l 35 J -> 6C, 0 '36 B^-bJ,0 37 J ->■ aC, 1 38 J -4 6C, 1 39. J -!• CC, 1 40 C-+aK,l '41 C-+bK,l
(47) Z, ->• cF, 1 (48) 5 -+ AS,0 (49)4 -s- XA, 0 (50) 5 -> XB, 0 (51)C-> AC,0 (52) D -+ AD, 0 (53) £ -s- XE, 0 (54) F ->• AF, 0 (55)G->AG,0 (56) H->XH,0 (57)/(58) J (59) # (60) £ -
A/,0 • AJ,0 *Aif,0 >AZ,,0
(61)F->A,0
The corresponding error -correcting finite-state automaton is shown in Fig. 3.4.
Fig. 3.4. Transition diagram of the restricted expanded grammar G" of Example 3.2.
MINIMUM-DISTANCE ERROR-CORRECTING
3.5.
...
31
MINIMUM-DISTANCE E R R O R - C O R R E C T I N G FINITE-STATE PARSING
Input testing pattern strings can be analyzed by a finite-state automa ton that can accept strings derived by finite-state grammar. Given a finite-state grammar G, there exists a finite-state automaton M such that the set of strings accepted by M is equal to the set of strings L(G) derived by the finite-state grammar G. The automaton M can be represented by a finite-state transition diagram [41]. The production rule A —¥ aB in the finite-state grammar corresponds to the transition S(A, a) = B in the automaton. An input string can go from the initial state to the final state of the automaton if the string is accepted by the automaton. Here, each transition of the error-correcting finite-state automaton has two attributes, input terminal symbol and weight value. For the production rule A -4- aB with weight w, we use CAB{O) — w a s the cost representation in the transition of the automaton. We want to parse the input string from the initial state to the final state by minimum cost. The following algorithm is proposed to compute the minimum cost (distance) by using the dynamic programming technique [41]. Algorithm 3.3 Minimum cost of error-correcting finite-state parsing with two attributes in each transition Input:
An error-correcting finite-state automaton with n nodes num bered 1,2,... ,n, where node 1 is the initial state and node n is the final state. Two attributes, terminal symbol and its cost function Cij(a), for 1 < i, j < n, a € (Vp U {A}), with Cij(a) > 0, for all i and j . An input testing string S. Output: M\n the minimum cost of the path from node 1 to node n when the parsing sequence is equal to the terminal sequence of the input string S. Method: (1) M\\ = 0, Mij = maxint (a large number), 1 < j < n. (2) For 1 < j < n do Mij = min{Mifc + Ckj(X), l a, y(A) = y(a) to the inference grammar, if they are the new production rules.
This inferred grammar will generate excessive strings if we apply syn tactic rules only. However, we can use semantic rules (inherited attributes) to restrict the grammar so that no excessive strings are generated.
4.6.
MINIMUM-DISTANCE ERROR-CORRECTING EARLEY'S PARSING FOR A T T R I B U T E D STRING
A modified Earley's parsing algorithm is here developed for attributed context-free languages. Here, errors of insertion, deletion, and substitu tion transformation are all considered in the derivation of Earley's item lists. Let the attributed grammar G = (Vn,Vt,P, S) be a CFG (contextfree grammar), and let z = &1&2 • • • °n be an input string in Vt*. The form [A —> ct»P, x, y, i] is called an item for z if A —¥ a/3 is a production in P and 0 < i < n [2, 32]. The dot in a • (3 between a and f3 is a meta-symbol not in Vn or Vt, which represents the parsing position; x is a counter for local syntactic deformation which accumulates the total cost of substitution of terminal symbols. When A = S, y is used as a counter for global deforma tion which records the total cost of insertion and deletion errors. On the other hand, if A ^ S, then y is used as the synthesized attribute of A. The meaning of index i is the starting parsing position of the string, and it is the same pointer as the conventional item of Earley's parsing algorithm. The parsing algorithm for an input string z is shown in the following. Algorithm 4.2 Minimum-distance an attributed string
error-correcting Earley's parsing for
ATTRIBUTED
46
Input:
GRAMMAR AND
ERROR-CORRECTING...
An attributed grammar G= (Vn, Vt,P,S) and a test string z = b\ b2 ■ ■ ■ bn in Vt*
Output: The parse lists IQ, I\,...,In, and the decision whether or not z is accepted by the grammar G together with the syntactic and semantic deformation distances. Method: (1) Set j = 0 and add [S -> • a, 0,0,0] to Ij if S —> a is a production in P. (2) Repeat steps (3), (4) and (5) until no new items can be added to Ij. (3) If [B —> £ •, xi,yi,i] is in Ij,B ^ S, and (a) if [A —> a • B(3, x2,y2, k] is in h and A ^ S, then add item [A —> a,B • f3,xi +x2,yi +V2,k] to Ij. (b) if [S* —y a • Bfi, x2, y2, k] is in Ii, then add item [S -¥ aB • (3,x\ + x2,V2 + \L{B) -yi\,k] to Ij. (c) if [S —> a • Cf3,x2,y2,k] is in i,, C ^ B, then add item [S1 —^ a • C/3, x2,yi + y2, k] to Ij. (d) if [S —>• a •, a;2, j/2, A] is in Ii, then add item [S —>■ a •, x2, y\ + y2, k] to Ij. (4) If S —> ^ is a production in P, and if [A —> a • B(3, x, y, i] is in Ij, then add item [B -> • £ , 0,0, j] to Ij. (5) If [5 -4 a»B/3, x,y,i] is in i j , then add item [5 —>■ aB»/3, x,y + L(B),i) to i j . (6) If j" = n, go to step (8); otherwise increase j to j + 1. (7) For each item [A —>■ a • a/3, x, y, ij in i j - i , add item [A —J- a a • /3, x + ^(a, 6j), y + y(a),«] to Ij, where y{a) is the synthesized attribute of a. For simplicity, let y(a) = 1 for all a in VJ. S(a, bj) is the substitution cost, and S(a,a) = 0. Go to step (2). (8) If item [S —> a»,x,y,0] is in In, then string z is accepted by grammar G where x is the local deformation distance, and y is the global de formation distance; otherwise, string z is not accepted by grammar G. Exit. In the above algorithm, step (3b) handles the semantic insertion and deletion errors, steps (3c) and (3d) handle the syntactic insertion errors, step (5) handles the syntactic deletion errors, and step (7) handles the syn tactic substitution errors. It is possible for collision to occur in the process of developing a new item; i.e., the old item has already been in the list
EXPERIMENT
47
when a new item is to be put in the list. Under this situation, the one with the minimum distance (minimum of x + y) is selected for that item. Actually, collision may occur with only the items related to 5-productions, because insertion and deletion transformations are allowed for those items only. Since the error-correcting grammar is ambiguous, the time complexity is 0(n3), and the space complexity is 0(n2), where n is the length of the input string. The parsing is inefficient if the length of the input string is large. 4.7.
EXPERIMENT
Given a training string abbe. Using Algorithm 4.1, the inferred attributed grammar is shown in Fig. 4.2. An input test string aabc is parsed by Algorithm 4.2. The corresponding item lists are shown in Fig. 4.3. As we
Training string : abbe The attributed grammar G(Vn,Vt,P,S) is as follows : Vn = {S,A,B,C} Vt={a,b,c) S ={S} The production set P is as follows : Syntactic rules Semantic rules S-^ABC A->aA A—>a B->bB B->b C->cC C->c
Fig. 4.2.
L(A) = 1 L(B) = 2 U.Q = 1 y(A) = y(a)+y(A) y(A) = y(a) y(B)=y(b)+y(B) y(B)=y{b) y(Q = y(c) + y(Q y(Q = y(c)
Training string abbe and its inferred atributed grammar for Earley's parsing.
48
ATTRIBUTED GRAMMAR AND
ERROR-CORRECTING...
Test string aabc and its item lists I[0] contains [S-+ABC* [C->»c [C^fcC [S^>AB*C [B->»b [B^'bB [S->A»BC [A—>»a [A-^faA [S^'ABC
,0,4,0] ,0,0,0] ,0,0,0] ,0,3,0] ,0,0,0] ,0,0,0] ,0,1,0] ,0,0,0] ,0,0,0] ,0,0,0]
I [4] contains [C->cC* [B->bB» [A—»oA» [A-^aA* [B->bB»
[C->cO [C-»c [C-^'cC [B->»b [B->«6B [A->«a [A-^aA» [C->cO [B->kB« [S->»ABC [A-*aA* [S-+ABC* [S^>AB*C [S->A'BC [A-^>a»A [A->a»
Mil
I[3] contains [A->aA» ,1,3,0] [B^,bB» ,2,3,0] ,3,3,0] [C->cO ,0,0,3] [C-*»c ,0,0,3] [C^'cC ,0,0,3] [S-»£ [B-^'bB ,0,0,3] [A->»a ,0,0,3] [A—>*aA ,0,0,3] [C-^cC* ,2,2,1] [B^>bB* ,1,2,1] ,0,3,0] [S-fABC [A-)aA» ,1,2,1] ,1,0,0] [S-»Afl»C [S^ABC* ,1,1,0] ,0,2,0] [S^>A*BC [A-^a*A ,1,1,2] [A—>a* ,1,1,2] [B->b*B ,0,1,2] [B->6« ,0,1,2] [C->c»C ,1,1,2] [C->c» ,1,1,2]
I[l] contain, [C->»c [C^»cC [£->•* [B->.W? [A-»«a [A->*aA [S->ABC [S^AmBC [S^AB»C [S-+ABC* [A-*a*A [A-*a» [B^b'B [£->£■• [C-^cC [C^>c
I[2] contains ,0,0,1] ,0,0,1] ,0,0,1] ,0,0,1] ,0,0,1] ,0,0,1] ,0,1,0] ,0,0,0] ,0,2,0] ,0,3,0] ,0,1,0] ,0,1,0] ,1,1,0] ,1,1,0] ,1,1,0] ,1,1,0]
[C^'c [C->*cC [£->•& [B-**bB [A->»a [A^>*aA [C->cO [B-+bB» [S-*»ABC [A-^aA* [S^ABC* [S^>AB»C [S^A'BC [A-»a»A [i4-»a» [B-)b»B [B->6. [C->c*C [C-^c»
,0,0,2] ,0,0,2] ,0,0,2] ,0,0,2] ,0,0,2] ,0,0,2] ,2,2,0] ,2,2,0] ,0,2,0] ,0,2,0] ,1,2,0] ,1,1,0] ,0,1,0] ,0,1,1] ,0,1,1] ,1,1,1] ,1,1,1] ,1,1,1] ,1,1,1]
,3,4,0] ,3,4,0] ,2,4,0] ,2,3,1] ,2,3,1] ,2,3,1] ,0,0,4] ,0,0,4] ,0,0,4] ,0,0,4] ,0,0,4] ,0,0,4] ,1,2,2] ,1,2,2] ,0,4,0] ,2,2,2] ,1,0,0] ,1,1,0] ,0,3,0] ,1,1,3] ,1,1,3] ,0,1,3] ,0,1,3] ,1,1,3] ,1,1,31
Fig. 4.3. Item lists of the Earley's attributed parsing on the test string aabc.
EXPERIMENT
49
can see from the derived item lists, the three kinds of errors are considered. The corresponding items are generated for each possible error transforma tion. Because the item [S -> ABC •, 1,0,0] is in I± list, the string aabc is accepted with local syntactic deformation distance 1 and global deformation distance 0.
This page is intentionally left blank
Chapter 5
ATTRIBUTED G R A M M A R AND M A T C H P R I M I T I V E MEASURE ( M P M ) FOR R E C O G N I T I O N OF SEISMIC WAVELETS
5.1.
SUMMARY
The computation of the match primitive measure between two attributed strings using dynamic programming is proposed. The MPM parsing algorithm for an attributed string can handle three types of error. The MPM parsing algorithm is obtained from the computation between the in put string and the string generated by the attributed grammar. The MPM parsing is more efficient than the modified Earley's parsing. The recognition criterion of the MPM parsing algorithm is "maximum-matching". The Earley's parsing and MPM parsing methods are applied to the recognition of seismic Ricker wavelets and the recognition of wavelets in real seismic data.
5.2.
SIMILARITY M E A S U R E OF A T T R I B U T E D S T R I N G MATCHING
Although the modified Earley's parsing algorithm considers all three types of errors, the parsing is inefficient. Here, the parsing of an attributed string using the match primitive measure (MPM) is proposed. The similarity measure between two attributed strings is proposed and discussed in the following. The match primitive measure (MPM) is defined as the maximum num ber of matched primitives between two strings. The computation of the MPM between two length-attributed strings can be implemented by the 51
52
ATTRIBUTED
GRAMMAR AND MATCH PRIMITIVE
...
dynamic programming technique on grid nodes as shown in Fig. 5.1. For each node, three attributes are associated, i.e., (f,h,v). Let a be an attributed string, where a[i] denotes the ith primitive in a; a[i].s and a[i].y denote t h e syntactic symbol and length attribute of a[i], respec tively. Let (i,j) indicate the position in the grid. f[i,j] represents the MPM value from point (0, 0) to (i,j), i.e., the MPM value between two attributed substrings (a[l].s,a[l].y)(a[2].s, a[2].j/) • • -(a[i].s,a[i].y) and (b[l}.s,b[l].y)(b[2].s,b[2}.y)---(b[j}.s,b[j}.y) of attributed strings a and b. h[i,j] and v[i,j] represent the residual length attributes of primitive a[i] and b[j], respectively, after the match primitive measure (MPM) between two attributed substrings (a[l].s,a[l].j/)(o[2].s, a[2].y) • ■ -(a[i].s,a[i].y) and (b[l}.s,b[l}.y)(b[2}.s,b[2].y)---(b{j}.s,b[j].y) of attributed strings a and b. The partial MPM f[i,j] can be computed from the partial MPM's f[i — l,j] and f[i, j — 1] as shown in Fig. 5.1. The following algorithm is proposed to compute the MPM between two attributed strings.
Ai-hj-l]
fiij-l] •O-
fii-hfl O
Fig. 5.1. Partial MPM f[i,j] computed from f[i,j — 1] and f[i — l,j].
Algorithm 5.1 Computation of the match primitive measure between two attributed strings Input:
(MPM)
Two attributed strings a and b. Let a = (a[l].s,a[l].y)(a[2].s,a[2].y) ■ • • (a[m].s,a[m].y) b = (b[l].s, b[l].y)(b[2}.s, b[2].y) ■ ■ ■ (b[n].s, b[n).y) where m, n are the number of primitives of a and b, respectively. Output: The maximum MPM S(a, b).
SIMILARITY
MEASURE OF ATTRIBUTED STRING MATCHING
53
Method: (1) /[0,0]:=0;/i[0,0j:=0;t;[0,0]:=0; (2) for i:— 1 to m do begin /[t,0]:=0; h[i,0] :=a[i].y; v[i,0}:=0; end; (3) for j := 1 to n do begin /[0,j]:=0; h[0,j]:=0; v[0,j] :=b[j}.y; end; (4) for i := 1 to m do begin for j := 1 to n do begin nodi := hmove(z, j) nod2 := vmove(j, j) if n o d i . / > nod2./ then nodefi, j] : = nodi else node [i, j] := nod2; end; (5) Output S{a,b):= f[m,n\/ y/yi x y2; where 2/i = Eia[il-J/>2/2 = E i 6l?1-2/ Functions hmove and vmove are written as follows: function hmove(i, j): node_type; {node(t'-l, j) -*• node (i,j)} begin if a[i].s 7^ b[j}.s then eW := 0; else dt := min(v[z — 1, ji],a[i].j/); hmove./ := f[i - 1, j] + oK; hmove./i := a[i].y — d£; hmove.'u := u[i - 1, j] - di; return(hmove); end; function vmove(i, j): node.type; {node(i, j - 1) ->■ node (i, j)} begin if a[i].s ^ b[j].s then d^ := 0;
ATTRIBUTED
54
GRAMMAR
AND MATCH PRIMITIVE
...
else dl := min(/i[i, j - 1], b[j].y); + d£; vmove./ = f[i,j-l} vmove./i = h[i,j -1] -d£; vmove. v = b[j].y - dlreturn(vmove); end; In the above algorithm, two functions are used. Function hmove is used to compute the variation of attributes (/, h,v) from node (i — l,j) to node (i, j). Function vmove is used to get the value of (/, h, v) at point (i,j) from point (i, j — 1). An example of the MPM computation of two attributed strings is shown in Table 5.1. The normalized MPM value is calculated. Table 5.1. Calculation of the MPM between two attributed strings a and b, where a = (a, 3)(d, l)(c, 5)(h, 2)(a, 3)(g, 2). 6 = (a, 4)(c, 2)(d, 2)(c, 4)(h, l)(g, 3). Two strings:
a = {a,3)(d, l)(c, 5)(h, 2)(o, 3)(g, 2) 6 = (a,4)(c,2)(d,2)(c,4)(h,l)( 9 > 3)
/ v' 0 0 (a,4) 1 (c.2) 2 * (cf,2) 3 (c,4) 4 (M) 5 (5.3) 6
(a,3) (of,1) (c,5) (/7,2) (a,3) (0,2) 1 2 3 4 5 6
f=0
f=0
f=0
f=0
f=0
f=0
f=0
/) = 0
fc = 3
ft=1
h=5
/i = 3
h=2
v=0
v=0
v=0
v=0
h=2 v= 0
v=0
v=0
/=0
f=3
f=3
/=3
1=3
f=4
f=4
h=0
/i=0 i/= 1
/i = 1 i/=1
h=5 i/=1
h=2
h=2 i/=n
h=2 i/=n
v=4
v=t
f=0
f=3
f=3
f=5
f=5
/=5
f=5
h=0 v= 2
h=0
/i=1
h=3
h=2
h=2
v=2
y=2
v=0
v=0
h=3 v=0
f=0
f=3
f=4
f=5
f=5
f=5
f=5
h=0 v=2
h=0
/) = 0
/j = 3
v=1
v=2
h=2 v= 2
h=0 v= 2
h=2
v=2
f=0
f=3
/=4
/=8
f=8
f=8
f=8
h=0
h=0
ft = 0
h=0
v=4
v=4
v=4
v=1
h=2 v=-\
h=3 v=-\
h=2 v= 1
v=0
v=2
/=0
f=3
f=4
f=8
f=9
1=9
/=9
/) = 0
h=0
/1 = 0
h=0
h=1
ft = 2
v=\
v=1
v=1
^=1
v= 0
h=3 v=0
/=6
(=3
f=4
t=8
f=9
/=9
Mi
ft = 0
h=0
h=0
ft = 0
ft=1
v=3
v=3
v=3
v=3
/! = 3 v= 3
/i = 0
v=3
s(a, b) = l W l 6 x l 6 = 0.6875
v=0
v=1
INFERENCE
5.3.
OF ATTRIBUTED
55
GRAMMAR
I N F E R E N C E OF A T T R I B U T E D G R A M M A R
For the parsing of an attributed string using the property of the MPM, the attributed grammar for the training strings is inferred first. The inference procedure of an attributed grammar is similar to Algorithm 4.1 and is described below. Algorithm 5.2
Inference of attributed grammar
Input: A set of training strings. Output: An attributed grammar. Method: (1) Convert each input string to the attributed string by merging identical primitives. (2) For each input attributed string 0102^3 • • • a^, add to the grammar the production S -> A1A2A3 ■ ■ -Ah, where Ai is the nonterminal corres ponding to terminal a,; and the semantic rule L(Ai) = yi, 1 < i < k, where j/j is the length attribute of primitive a*. (3) For each primitive a, add the production rule A -» a, y(A) = y(a) and y(a) — y, where y is the length attribute of primitive a. The example is shown in Fig. 5.2.
Training string : abbe The attributed grammar G(Vn,Vt,P,S) is as follows : Vn = {A, B, C, S] Vt={a,b,c] S ={S] The production set P is as follows Syntactic rules Semantic rules
Fig. 5.2.
S-> ABC
■-\L(B)--= 2 L(Q = 1 UA) =
A-^a B^b C^c
y(A)-- ■-y(a), y(B) =--y(b), y(Q =--y(c),
y(a) = 1 y(b) = 2 y{c) = 1
Training string abbe and its inferred attributed grammar for the MPM parsing.
ATTRIBUTED
56
5.4.
GRAMMAR.AND MATCH PRIMITIVE
...
T O P - D O W N PARSING U S I N G M P M
Given a n attributed grammar G and input attributed string z, the value of the MPM between z and L(G), the language generated by the grammar G, is calculated. Consider an 5-production rule in the grammar, which has the form S -> AiA2Az
■ ■ ■ Am .
For each nonterminal at the right-hand side of 5-production rule, two at tributes are associated with it. f[k] denotes the MPM value calculated from the beginning up to the parse of fcth nonterminal. h[k] is a kind of residual attribute used for the calculation later on. The proposed algorithm to compute the MPM between z and L{G) is described in the following. Algorithm 5.3
Top-down parsing using the MPM
Input:
An attributed grammar G = (Vn, Vt, P, 5) and an input string z. Let m = the number of primitives in z. n = the length of z = J2t zW-2/Output: The maximum MPM between z and L(G). Method: (1) (2) (3) (4)
(5)
(6) (7) (8)
Set N = the number of 5-production rules, and max_MPM = 0. Set /[0] = 0 and h[0] = 0. For all 1 < k < N do steps (4) to (10). Apply the fcth ^-production rule with the form Sk -» Ak,iAk,2 ■ ■ ■ Afc,mfc, where mk is the number of nonterminals at the right-hand side of the &th 5-production rule to do steps (5) to (8). For all 1 < i < m^ do { f\i} = 0; h[i] = L(Akli); }. For all 1 < j < m do steps (7) and (8). Set VQ = z[j].y and v = VQ. For all 1 < i < mk do Apply production rule Akyi -» a,k,i(a) if z[j].s = a,k,i, then dt - min(y(afcii), v) else d£ = 0; fi=f[i-l]+dl;
TOP-DOWN
PARSING
USING
57
MPM
hi = y(ak,i) - di; vi = v — d£; (b) if z[j].s = a,k,i, then di = min(/i[i], vo) else cK = 0;
/ 2 = /[*]+ f2 then { /[ max_MPM, then max_MPM = MPM. (11) Output max_MPM. Here the normalized MPM is calculated. Algorithm 5.3 is obtained from the comparison between the input string and the string generated by the 5-production rule. Example 5.1 The training string abbe and its inferred attributed gram mar are shown in Fig. 5.2. One input string aabc has been tested, and the parsing result is shown in Table 5.2. The MPM value is 0.75 after normalization. Table 5.2.
Parsing result for the test string aabc; the MPM value is 0.75. Test string:aa6c = (a, 2)(6, l)(c, 1)
it
1
2
3
1 1
0 1
0 1
0 1
0 1
1 1
2 3
1 1
2 2
2 3
max_MPM = 3 tf%x%~ = 0.75
58
ATTRIBUTED
5.5.
GRAMMAR
AND MATCH
PRIMITIVE
...
E X P E R I M E N T S OF SEISMIC PATTERN RECOGNITION
5.5.1.
Recognition
of Seismic
Ricker
Wavelets
Since the seismic wavelets have structural information, it is natural to adopt a syntactic approach in seismic pattern analysis [58, 59, 62, 6466]. The above proposed Earley's parsing in Chapter 4 and MPM parsing in Chapter 5 are used for the classification of seismic Ricker wavelets. A block diagram of the classification system is shown in Fig. 4.1. Twenty eight zero-phase Ricker wavelets of eight classes with Gaussian noise are generated in the simulated seismic traces in Fig. 3.2. Eight class wavelets with different frequencies and reflection coefficients are selected as the train ing patterns and listed in Table 3.1. Each wavelet in seismic trace and its corresponding class is listed in Table 3.2. The signal of each seismic trace is converted into a syntactic string of primitives. The sampling interval is 1 ms. Each segment is assigned as a primitive. The modified Freeman's chain code is used. Since the one dimensional seismic data are processed, nine primitives with rightward direction are defined in Chapter 3. The strings of the eight training Ricker wavelets corrupted by noise are listed in Table 3.1. In Earley's parsing approach, the attributed grammar for each class is inferred from the eight training strings by using Algorithm 4.1. The recognition rate and average cpu time of the modified Earley's parsing by using Algorithm 4.2 are listed in Table 5.3. The computer is VAX 11/780. The parsing speed is not good due to the inefficiency of Algorithm 4.2. However, the recognition rate is good. Table 5.3. The average parsing time and percentage of correct classification of the classifier using the attributed error-correcting Earley parser (Algorithm 4.2). Average CPU time for one string (sec) 1,974.8
Percentage of correct classification (25/28)
89.29%
In the MPM parsing approach, the same set of test data are used again. The attributed grammar is inferred by using Algorithm 5.2. Four cases with different weighting of errors are listed in Table 5.4. The recognition
EXPERIMENTS
OF SEISMIC
Table 5.4.
PATTERN
RECOGNITION
59
The four different cases and the descriptions for each case.
Case no.
Case descriptions
1
Normalized mpm. Estimating the number of substitution errors by use of an upper bound. Estimating the number of insertion and deletion errors by use of lower bound.
2
Nonequal weights are assigned to insertion, deletion and substitution transformations. weight(insertion) = 2.0 weight (deletion) = 2.0 weight(substitution) = 1.0 Estimating the number of substitution errors by use of a lower bound. Estimating the number of insertion and deletion errors by use of an upper bound.
3
Nonequal weights are assigned to insertion, deletion and substitution transformations. weight(insertion) = 2.0 weight(deletion) = 2.0 weight (substitution) = 1.0 Estimating the number of substitution errors and the number of insertion and deletion errors by use of the averages of upper bound and lower bounds.
4
Nonequal weights are assigned to insertion, deletion and substitution transformations. weight (insertion) = 2.0 weight (deletion) = 2.0 weight (substitution) = 1.0
Table 5.5. The average parsing time and percentage of correct classification of the attributed-grammar parser using Algorithm 5.3 for four different cases.
Case no.
Average CPU time for one string (sec)
1 2 3 4
0.244 0.244 0.243 0.242
Percentage of correct classification 21/28 23/28 17/28 21/28
75.00% 82.14% 60.71% 75.00%
60
ATTRIBUTED
GRAMMAR AND MATCH PRIMITIVE
...
rate and the cpu time of the MPM parsing by using Algorithm 5.3 are listed in Table 5.5. The parsing speed is fast. Nevertheless, the recognition rate is not better than that of Earley's parsing. If the method of estimating distance is used, the recognition rate is improved; especially for Case 2. In Case 2, some of the insertion or deletion errors are counted as substi tution errors. In Case 3, all of the substitution errors are considered as insertion and deletion errors. Due to the effect of noise, substitution errors much more easily occurred than the insertion and deletion errors. Insertion and deletion errors are more important than the substitution errors in our study.
5.5.2.
Recognition
of Wavelets
in Real
Seismogram
A real seismogram in the Mississippi Canyon is studied for the classification of wavelets and is shown in Fig. 5.3. The selected training traces are the 6th, 14th, 22nd, 30th, 38th, 46th, 54th, and 62nd traces. The wavelets are extracted through the process of peak detection and wavelet determination. The training samples are clustered into six classes by using the hierarchical JO
2D
30
"O
50
»
Fig. 5.3. Real seismogram at Mississippi Canyon.
EXPERIMENTS
OF SEISMIC »
PATTERN 2D
RECOGNITION 30
«
50
61 »
D.OSE
1.0SEC
2 0 SEC
Fig. 5.4(a). Canyon.
The detected waveforms of the 1st class in the seismogram from Mississippi
ID
2D
30
50
»
fcOSEC
USE ■
2* EEC • Fig. 5.4(b). Canyon.
The detected waveforms of the 2nd class in the seismogram from Mississippi
ATTRIBUTED
62
»
»
GRAMMAR
AND MATCH
»
»
«
PRIMITIVE
.
»
6.0 SEC
1.0 ££C
2 0 SEE
Fig. 5.4(c). Canyon.
The detected waveforms of the 3rd class in the seismogram from Mississippi
»
IB
30
«
50
60
0.0 5£C
1.0 SEC -
IP 2 0 2C
Fig. 5.4(d). Canyon.
The detected waveforms of the 4th class in the seismogram from Mississippi
EXPERIMENTS
OF SEISMIC
»
PATTERN
RECOGNITION
»
»
SO
63
»
CLCSEC
MSEC
iosr Fig. 5.4(e). Canyon.
The detected waveforms of the 5th class in the seismogram from Mississippi
» CLO
20
»
«
50
»
EEC
1.CEEC ■
10 5EC • Fig. 5.4(f). Canyon.
T h e detected waveforms of the 6th class in the seismogram from Mississippi
64
ATTRIBUTED GRAMMAR AND MATCH PRIMITIVE
...
dendrogram. Because of the better recognition rate in the above analysis, the modified Earley's parsing is chosen in this process. After the attributed parsing, the classification results are shown in Fig. 5.4(a)-5.4(f). 5.6.
CONCLUSIONS
Using an attribute, it is possible to reduce the length of pattern string and the size of an inferred attributed grammar. Due to the repetition of primitives in the string representation of wavelets, attributed strings and grammar are a better way to describe the wavelets. The computation of the match primitive measure between two attributed strings using dynamic programming has been proposed. The MPM parsing algorithm has been obtained by modifying the comparison between two attributed strings to the comparison between an input string and the strings generated by the attributed grammar. The recognition criterion of the modified Earley's algorithm is minimum-distance, but the recognition criterion of the MPM parsing algorithm is maximum-matching. The parsing methods have been applied to the recognition of seismic Ricker wavelets and the recognition of wavelets in real seismic data. The recognition results can improve seismic interpretation.
Chapter 6
STRING DISTANCE AND LIKELIHOOD RATIO TEST FOR DETECTION OF CANDIDATE BRIGHT SPOT
6.1.
SUMMARY
Syntactic pattern recognition techniques are applied to the classification of wavelets in the seismograms. The system for one-dimensional seis mic analysis includes a likelihood ratio test, optimal amplitude-dependent encoding, probability of detecting the signal involved in the global and local detection, plus minimum-distance and nearest-neighbor classification rules. The relation between error probability and Levenshtein distance is proposed. 6.2.
INTRODUCTION
In a seismogram, the wavelets of a bright spot have a specific structure. So syntactic pattern recognition is proposed to detect the candidate bright spot trace by trace. A block diagram of the 1-D syntactic pattern recognition system for the detection of candidate bright spot is shown in Fig. 6.1. The characteristic of 1-D syntactic approach is the string matching in the seismic trace. For the detection of candidate bright spots, testing traces are selected from the input seismogram and tree classification techniques are used in the detection of candidate bright spots [56, 57]. From the detected candidate bright spot, the sample patterns of the wavelets are extracted. Amplitudedependent encodings of optimal quantization are used. The global detection is to detect the possible wavelets. Levenshtein distance [75] is computed 65
66
STRING
DISTANCE
Global. Input seismo ram
Optimal quantization encoding
String
detection: extract possible wavelets
AND LIKELIHOOD
-
distance computation
Thresholding to detect candidate bright spots
RATIO
TEST
...
Classification result
Recognition ' Testing i
Tree classification tobright spot pattern s
1 Optimal quantization
Encoded pattern 3f J br ght s pots
Fig. 6.1. A block diagram of syntactic pattern recognition system (for the detection of candidate bright spot in a seismogram).
between the possible wavelet string and the extracted strings of bright spots. The local detection is to extract the candidate wavelet. Using the proba bility of detection, a threshold is set to detect the candidate bright spot. The system is used to detect the candidate bright spot, trace by trace, in the real seismograms of Mississippi Canyon and High Island.
6.3.
OPTIMAL QUANTIZATION E N C O D I N G
Initially, the seismic trace is encoded by amplitude-dependent encoding. Assign the ith pair of waveform points [(a?i, 2/i), (a;»+i, 2/i+i] to the sym bol Wi denoting the slope characteristic of the line segment joining the two points. Let di = j/j + i — yi. For the amplitude-dependent encoding, the assignment of di = yi+i — yi to a symbol is a quantization problem. The optimal quantization of 8 levels for the Gaussian samples is used [47]. From the experiments described here, if the standard deviation as of di from the signal is larger than 1.5an of di from noise, then an 8level optimal quantization is good. The pattern primitives are defined as follows.
LIKELIHOOD RATIO TEST (LRT)
67
Wi = d
for
di > 1.76(7,
cs = 2.15
2.0S*cFig. 6.3(a).
Tree classification result of bright spots.
/ 0.0240482, or r < -0.0240482. 6.6.2.
Threshold for Global
Detection
For an input seismic string, the wavelets must be detected and extracted. Because the detection is under the Levenshtein distance of string cal culation, the detected signal is defined as the possible wavelets. The accompanying detection is defined as the global detection. Comparing the 8-level optimal quantization with LRT in Fig. 6.3(b), the closest levels to /3 = ±0.0240482 are the end points at ±0.0268365 = ±0.5crs. Then, ±0.0268365 = ±0.5crs are selected as the new threshold. The areas above 0.0268365 and below -0.0268365 in Fig. 6.3(b) are the detected areas of the signal, i.e., the intervals of b, c, d, B, C, and D. The probability of detecting b, c, d, B, C, and D is 0.617. For input 8 symbols of string, 8*0.617 = 4.936. Here a threshold is set: if the number of symbol (b, c, d, B, C, or D) is equal to or larger than 5, then the possible wavelet is detected. Otherwise, input the next 8 symbol string. 6.6.3.
Threshold for the Detection Bright Spot
of
Candidate
The extracted string of the 21st trace is ABBAAbdc. Using the 8-level optimal quantization, the di value is quantized to the quantized value Cj, the conditional mean of each quantization interval [47], Assume that Gaussian noise is added to the quantized value c;, then the probability of detection in the quantized interval can be calculated from Fig. 6.3(c). The extracted string of the 21st trace has 8 symbols. Each symbol has its quantized value Cj. For 8 c» of the quantized value of the candidate bright spot, the probability of detecting every Cj can be calculated by using the statistical table of normal distribution. The extracted string of the 21st trace has 3A, 2B, lb, lc, and Id. From Fig. 6.3(c), the sum of these 8 detection probabilities is 5.7354(3*0.66 + 2*0.704+1*0.704+1*0.7115 + 1*0.9319 = 5.7354). Truncated 5.7354 = 5. The approximated number of detected symbols is 5. The number of missing and false symbols, i.e., error or undetected symbols, is 8 — 5 = 3.
EXPERIMENT AT HIGH ISLAND
73
The input string belongs to the string of candidate bright spot if the Levenshtein distance is less than 3 symbols. So 3 is selected as a threshold. Suppose that x1, x2, z 3 and xi are candidate bright spot strings on the 13th, 21st, 29th, and 37th traces, iidL{x1,y) < ku or dL(x2,y) < k2, or L 3 L i d (x ,y) < k3, or d (x ,y) < fc4, then y is the detected candidate bright spot string. The threshold values of ki, k2, k3 and k4 are 3 in this experi ment. The classification result is shown in Fig. 6.4. O.OSec
60
30
to
30
so
to
t.OSffC
«
it
KOXs
it
((titt ti
2.05»c Fig. 6.4.
6.7.
1-D classification result at Mississippi Canyon.
E X P E R I M E N T AT HIGH ISLAND
Similarly the real seismogram at High Island is shown in Fig. 6.5 and the classification result is shown in Fig. 6.6.
74
STRING DISTANCE AND LIKELIHOOD RATIO TEST . ■«
30
20
to
w Fig. 6.5. Real seismogram at High Island (negative on the right). so
Fig. 6.6.
to
1-D classification result at High Island.
Chapter 7
TREE GRAMMAR AND AUTOMATON FOR SEISMIC PATTERN RECOGNITION
7.1.
SUMMARY
In a number of synthetic seismograms, there may exist certain structures in shape. In order to recognize seismic patterns and improve seismic interpre tation, we use the method of tree automaton. We show that the seis mic bright spot pattern can be represented as tree, so we can use the tree automaton in the recognition of seismic patterns. The system of tree automaton includes two parts. In the training part, the training seismic patterns of known classes are constructed into their corresponding tree representations. Trees can infer tree grammars. Several tree grammars are combined into one unified tree grammar. Tree grammar can generate the error-correcting tree automaton. In the recognition, each input testing seismogram passes through preprocessing, pattern extraction, and tree repre sentation of seismic pattern. Then each input tree is parsed and recognized by the error-correcting tree automaton. Several fundamental algorithms on tree construction of the seismic pattern and tree grammar inference are pro posed in this study. The method is applied to the recognition of bright spot, pinch-out, fiat spot, gradual sealevel fall, and gradual sealevel rise patterns.
7.2.
INTRODUCTION
Tree grammars and the corresponding recognizers, tree automata, have been successfully used in many applications, for example: English character 75
TREE GRAMMAR AND AUTOMATON
76
...
recognition, LANDSAT data interpretation, fingerprint recognition, classi fication of bubble chamber photographs, and texture analysis [41, 78, 85, 89,116]. Fu pointed out that "By the extension of one-dimensional concate nation to multidimensional concatenation strings are generalized to trees." [41]. Comparing with other high dimensional pattern grammars: web gram mar, array grammar, graph grammar, plex grammar, shape grammar,..., etc. [41], tree grammar is easy and convenient to describe a pattern using data structure of tree, especially in the tree traversal and in the substitu tion, insertion, and deletion of a tree node. The system of tree automaton is shown in Fig. 7.1. In the training part of the system, the training seismic patterns of known classes are constructed into their corresponding tree representations. Trees can infer tree grammars [8, 12, 76, 77, 86]. Several tree grammars are combined into one unified tree grammar. Tree grammar can generate the error-correcting tree automaton.
Input testing seismogram
Preprocessing: (1) Envelope (2) Thresholding (3) Compression (4) Thinning
Pattern representation: (1) Pattern extraction (2) Primitive recognition (3) Tree construction
ErrorClassification correcting tree automata results
Training seismic
Pattern representation: (1) Pattern extraction (2) Primitive recognition
Tree grammar inference
Recognition ' Training i
Fig. 7.1.
A tree automaton system for seismic pattern recognition.
In the recognition part of the system, each input testing seismogram passes through preprocessing and tree representation of seismic pattern. The preprocessing includes envelope [35, 56, 57, 104], thresholding, com pression in the vertical time-axis direction, and thinning [63, 67, 107, 117]. Tree representation of seismic pattern includes the extraction of seismic patterns, primitive recognition, and construction of tree representation. So a seismic pattern can be constructed as a tree. Then the tree is parsed by the error-correcting tree automaton into correct class. Three kinds of tree automaton are adopted in the recognition: weighted minimum distance structure preserved error-correcting tree automaton
TREE GRAMMAR AND LANGUAGE
77
(SPECTA), modified maximum-likelihood SPECTA, and minimum dis tance generalized error-correcting tree automaton (GECTA). We have some modifications on the methods of weighted minimum distance SPECTA and maximum-likelihood SPECTA. We show that the seismic bright spot pat tern can be represented as tree, so we can use the tree automaton in the recognition of seismic patterns.
7.3.
TREE GRAMMAR AND LANGUAGE
A tree domain (tree structure) is shown below [41]. Each node has its ordering position index and is filled with a terminal symbol. 0 is the root index of a tree. 0
/ I \ 0.1 0.2 0.3 . . . / \ 0.1.1 0.1.2 . . . / 0.1.1.1
\ 0.1.1.2
Each node has its own children, except bottom leaves of the tree. The number of children at node is called the rank of the node. Although there are different kinds of tree grammars [41], we use the expansive tree grammar in the study because of the following theorem. Theorem 7.1 For each regular tree grammar Gt, one can effectively con struct an equivalent expansive grammar G't, i.e., L(G't) = L(Gt) [41]. An expansive tree grammar is a four-tuple Gt = (V, r, P, S), where V =
VNUVT,
Vw = the set of nonterminal symbols, VT = the set of terminal symbols, S : the starting nonterminal symbol,
78
TREE GRAMMAR AND AUTOMATON
...
r : the rank of terminal symbol, i.e., the number of children in the tree node, and each tree production rule in P is of the form (1)
XQ
-»
x
I X\
or
(2)
Xo^-x
I --A X2
Xr(x)
where x e VT and XQX\X2 ■ ■ • Xr(x) € Vjf. For convenience, the tree production rule (1) can be written as XQ —> X1X2 • • • Xr(x) [41]. From the starting symbol S, a tree is derived by using the tree pro duction rules of P, S —> ao => a\ =>■ ■ ■ ■ => am = a. The tree language generated by Gt is denned as L(Gt) = {a is a tree \S^a in Gt}, where * represents several derivation steps using tree production rules in Gt. 7.4.
TREE AUTOMATON
The bottom-up replacement functions of the tree automaton are generated from the expansive tree production rules of the tree grammar. Expansive tree grammar is Gt = (V,r,P,S), and tree automaton is Mt = (Q,f,S), where Q is the set of states, / are the replacement functions, and S becomes the final state. If tree production rule XQ
-> X / I •■■\
Xi X2
Xn
is in P, then bottom-up replacement function in the tree automaton can be written as X0 4- x /|...\,or X\ X2
Xn
/(
x ) -> X0 /|...\, or
Xi X2
fx(XuX2,...,Xn)^X0.
Xn
Tree automaton is an automatic machine to recognize the tree and has the tree bottom-up replacement functions which are the reverse direction of the tree production rules. Tree grammar is forward and top-down derivation to derive the tree. The tree automaton is a backward replacement of the states from the bottom to the root of the tree. If the final replacement state
TREE AUTOMATON
79
is in the set of the final states, then the tree is accepted by the automaton of the class. Otherwise, the tree is rejected. Example 7.1 The following tree grammar Gt = (V,r,P,S) where V = {S, A, B, $, a, b}, VT = {•$, ia,^ b}, r{a) = {2,0}, r(b) = {2,0}, r($) = 2, andP: (1) S -»• $
(2) A -» a
I\ A
(3) B -+ b
I\
B
A
(4) A-Ki
(5) B -> b
I \
B
A
b
can generate the patterns, for example, using productions (1), (4), and (5),
/ \ a b or using productions (1), (2), (3), (4), (5), (4), and (5). $
t
^W
w r
Vi
r
1 a 1 \ a b
\ b 1\ a b
f
The tree automaton which accepts the set of trees generated by Gt is Mt = (Q, fa, fb, f$,F), where Q = {qA,qB, qs}, F = {qs}, and / : fa = qA, fb = qB, fa(qA,
qB) = qA, h(qA, qs) = qB, / $ ( « u , qB) = qs ■
Example 7.2 The following tree grammar can be used to generate trees represent at ing L-C networks.
$,
©
rtflTK
!_npftV
rtv
rr\,
_1*
TREE GRAMMAR AND AUTOMATON
80
...
Gt = (V,r,P,S), where V = {S,A,B,D,E,$,Vin,L,C,W}, VT = {$,Vin,L,C,W}, r($) = 2, r(Vin) = 1, r(L) = {1,2}, r(C) = 1, r(W) = 0, andP: (1) 5 -> $ (2) A-+Vin
(3) 5 -> Z (4) 5 ->■ L (5) D -* C (6) £J -> W
/ \
I
/ \
I
I
A B
E
D B
D
E
For example, after applying productions (1), (2), (3), (6), (5), (4), (6), (5), and (6), the following tree is generated. $ / \
K L I W
/ \ C L I I
©
v.„
L
L
-'TffiP'
j—'TflfffV
7T\
>Tv
w c I
w The tree automaton which accepts the set of trees generated by Gt is Mt = (Q,fw,fc,fL,fvin,f$,F), where Q = {qE,qD,qB,qA,qs}, F = {qs}, and / : fw{ )=qE,
fc(qB) = qo,
/v s „{1E) = qA , Example 7.3 spot pattern.
/ L ( ? D ) = qB,
fUqD,qB)
= qB,
fs(qA, qB) = qs ■
Tree representation and tree grammar of seismic bright
(A) Bright spot pattern: The seismogram of the primary reflection of bright spot is generated from a geologic model [27, 28, 93]. The geologic model is shown in Fig. 7.2 and the seismogram is shown in Fig. 7.3. After envelope processing, thresh olding, compression in the vertical direction, and thinning processing, the bright spot pattern can be shown below. We can scan the pattern from left to right, then top to bottom. The segments (branches) can be extracted in
TREE
AUTOMATON
Deroiv-2.0gm/cm**3 Veloc*^*2.0 knt/jec
D-2.5 V-2.59
Fig. 7.2. Geologic model. Station
...
10
20
30
40
(0
50
] >t
ul$y}§4