COMBINATORIAL GEOUP TESTING ATOIfSAPPUCATIONS
SERIES ON APPLIED MATHEMATICS Editor-in-Chief: Frank Hwang Associate Edi...

Author:
Dingzhu Du; Frank Hwang

This content was uploaded by our users and we assume good faith they have the permission to share this book. If you own the copyright to this book and it is wrongfully on our website, we offer a simple DMCA procedure to remove your content from our site. Start by pressing the button below!

COMBINATORIAL GEOUP TESTING ATOIfSAPPUCATIONS

SERIES ON APPLIED MATHEMATICS Editor-in-Chief: Frank Hwang Associate Editors-in-Chief: Zhong-ci Shi and Kunio Tanabe

Vol. 1

International Conference on Scientific Computation eds. T. Chan and Z.-C. Shi

Vol. 2

Network Optimization Problems — Algorithms, Applications and Complexity eds. D.-Z. Du and P. M. Pandalos

Vol. 3

Combinatorial Group Testing and Its Applications by D.-Z. Du and F. K. Hwang

Vol. 4

Computation of Differential Equations and Dynamical Systems eds. K. Feng and Z.-C. Shi

Vol. 5

Numerical Mathematics eds. Z.-C. Shi and T. Ushijima

Series on Applied Mathematics Volume 3

Ding-Zhu D u Department of Computer Science University of Minnesota and Institute of Applied Mathematics Academia Sinica, Beijing

F r a n k K. H w a n g AT&T Bell Laboratories Murray Hill

V f e World Scientific « •

Singapore • New Jersey • L London • Hong Kong

Published by World Scientific Publishing Co. Pte. Ltd. P O Box 128, Farrer Road, Singapore 9128 USA office: Suite IB, 1060 Main Street, River Edge, NJ 07661 UK office: 73 Lynton Mead, Totteridge, London N20 8DH

Library of Congress Cataloging-in-Publication Data Du, Dingzhu. Combinatorial group testing and its applications / Ding-Zhu Du, Frank K. Hwang. p. cm. — (Series on applied mathematics; vol. 3) Includes bibliographical references and index. ISBN 9810212933 1. Combinatorial group theory. I. Hwang, Frank. II. Title. III. Series: Series on applied mathematics v. 3. QA182.5.D8 1993 512'.2-dc20 93-26812 CIP

Copyright © 1993 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof, may not be reproduced in any form orby any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher. For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 27 Congress Street, Salem, MA 01970, USA.

Printed in Singapore by JBW Printers & Binders Pte. Ltd.

Preface Group testing has been around for fifty years. It started as an idea to do large scale blood testing economically. When such needs subsided, group testing stayed dormant for many years until it was revived with needs for new industrial testing. Later, group testing also emerged from many nontesting situations, such as experimental designs, multiaccess communication, coding theory, clone library screening, nonlinear optimization, computational complexity, etc.. With a potential world-wide outbreak of AIDS, group testing just might go the full cycle and becomes an effective tool in blood testing again. Another fertile area for application is testing zonal environmental pollution. Group testing literature can be generally divided into two types, probabilistic and combinatorial. In the former, a probability model is used to describe the distribution of defectives, and the goal is to minimize the expected number of tests. In the latter, a deterministic model is used and the goal is usually to minimize the number of tests under a worst-case scenario. While both types are important, we will focus on the second type in this book because of the different flavors for these two types of results. To find optimal algorithms for combinatorial group testing is difficult, and there are not many optimal results in the existing literature. In fact, the computational complexity of combinatorial group testing has not been determined. We suspect that the general problem is hard in some complexity class, but do not know which class. (It has been known that the problem belongs to the class PSPACE, but seems not PSPACE-complete.) The difficulty is that the input consists of two or more integers, which is too simple for complexity analysis. However, even if a proof of hardness will eventually be given, this does not spell the end of the subject, since the subject has many, many branches each posing a different set of challenging problems. This book is not only the first attempt to collect all theory and applications about combinatorial group testing in one place, but it also carries the personal perspective of the authors who have worked on this subject for a quarter of a century. We hope that this book will provide a forum and focus for further research on this subject, and also be a source for references and publications. Finally, we thank E. Barillot, A.T. Borchers, R.V. Book, G.J. Chang, F.R.K. Chung, A.G. Dyachkov, D. Kelley, K.-I Ko, M. Parnes, D. Raghavarao, M. Ruszinko, V.V. Rykov, J. Spencer, M. Sobel, U. Vaccaro, and A.C. Yao for giving us encouragements and helpful discussions at various stage of the formation of this book. Of course, the oversights and errors are our sole responsibility.

v

This page is intentionally left blank

Contents Preface

v

Chapter 1 Introduction

1

1.1 The History of Group Testing 1.2 The Binary Tree Representation of a Group Testing Algorithm and the Information Lower Bound 1.3 The Structure of Group Testing 1.4 Number of Group Testing Algorithms 1.5 A Prototype Problem and Some Basic Inequalities 1.6 Variations of the Prototype Problem References Chapter 2 General Algorithms

1 5 7 10 12 17 18 19

2.1 Li's s-Stage Algorithm 2.2 Hwang's Generalized Binary Splitting Algorithm 2.3 The Nested Class 2.4 (d, n) Algorithms and Merging Algorithms 2.5 Some Practical Considerations 2.6 An Application to Clone Screenings References Chapter 3 Algorithms for Special Cases 3.1 Two Disjoint Sets Each Containing Exactly One Defective 3.2 An Application to Locating Electrical Shorts 3.3 The 2-Defective Case 3.4 The 3-Defective Case 3.5 When is Individual Testing Minimax? 3.6 Identifying a Single Defective with Parallel Tests References

vii

19 20 23 27 30 34 36 38 38 43 48 53 56 59 60

viii

Contents

Chapter 4 Nonadaptive Algorithms and Binary Superimposed Codes . 62 4.1 The Matrix Representation 4.2 Basic Relations and Bounds 4.3 Constant Weight Matrices and Random Codes 4.4 General Constructions 4.5 Special Constructions References

62 63 68 73 78 87

Chapter 5 Multiaccess Channels and Extensions

91

5.1 Multiaccess Channels 5.2 Nonadaptive Algorithms 5.3 Two Variations 5.4 The k-Channel 5.5 Quantitative Channels References Chapter 6 Some Other Group Testing Models 6.1 Symmetric Group Testing 6.2 Some Additive Models 6.3 A Maximum Model 6.4 Some Models for d = 2 References Chapter 7 Competitive Group Testing 7.1 The First Competitiveness 7.2 Bisecting 7.3 Doubling 7.4 Jumping 7.5 The Second Competitiveness 7.6 Digging 7.7 Tight Bound References Chapter 8 Unreliable Tests 8.1 Ulam's Problem 8.2 Geperal Lower and Upper Bounds 8.3 Linearly Bounded Lies (1)

92 96 99 101 105 105 107 107 109 115 118 123 126 126 128 132 134 138 140 143 148 149 149 155 160

Contents 8.4 The Chip Game 8.5 Linearly Bounded Lies (2) 8.6 Other Restrictions on Lies References Chapter 9 Optimal Search in One Variable 9.1 Midpoint Strategy 9.2 Fibonacci Search 9.3 Minimum Root Identification References Chapter 10 Unbounded Search 10.1 Introduction 10.2 Bentley-Yao Algorithms 10.3 Search with Lies 10.4 Unbounded Fibonacci Search References Chapter 11 Group Testing on Graphs 11.1 On Bipartite Graphs 11.2 On Graphs 11.3 On Hypergraphs 11.4 On Trees 11.5 Other Constraints References Chapter 12 Membership Problems 12.1 Examples 12.2 Polyhedral Membership 12.3 Boolean Formulas and Decision Trees 12.4 Recognition of Graph Properties References Chapter 13 Complexity Issues 13.1 General Notions 13.2 The Prototype Problem is in PSPACE

ix 164 168 172 175 177 177 179 183 190 193 193 195 199 200 202 203 203 205 207 212 216 217 218 218 220 222 226 229 231 231 233

x

Contents 13.3 Consistency 13.4 Determinacy 13.5 On Sample Space S(n) 13.6 Learning by Examples References

Index

234 236 237 243 244 245

1 Introduction

Group testing has been around for fifty years. While traditionally group testing literature employs probabilistic models, the combinatorial model has cut its own share and becomes an important part of the literature. Furthermore, combinatorial group testing has tied its knots with many computer science subjects: complexity theory, computational geometry and learning models among others. It has also been used in multiaccess communication and coding.

1.1

The History of Group Testing

Unlike many other mathematical problems which can trace back to earlier centuries and divergent sources, the origin of group testing is pretty much pinned down to a fairly recent event-World War II, and is usually credited to a single person-Robert Dorfman. The following is his recollection after 50 years (quoted from a November 17, 1992 letter in response to our inquiry about the role of Rosenblatt): "The date was 1942 or early '43. The place was Washington, DC in the offices of the Price Statistics Branch of the Research Division of the office of Price Administration, where David Rosenblatt and I were both working. The offices were located in a temporary building that consisted of long wings, chock-full of desks without partitions. The drabness of life in those wings was relieved by occasional bull sessions. Group testing was first conceived in one of them, in which David Rosenblatt and I participated. Being economists, we were all struck by the wastefulness of subjecting blood samples from millions of draftees to identical analyses in order to detect a few thousand cases of syphilis. Someone (who?) suggested that it might be economical to pool the blood samples, and the idea was batted back and forth. There was lively give-and-take and some persiflage. I don't recall how explicitly the problem was formulated there. What is clear is that I took the idea seriously enough so that in the next few days I formulated the underlying probability problem 1

Introduction

2

and worked through the algebra (which is pretty elementary). Shortly after, I wrote it up, presented it at a meeting of the Washington Statistical Association, and submitted the four-page note that was published in the Annals of Mathematical Statistics. By the time the note was published, Rosenblatt and I were both overseas and out of contact." We also quote from an October 18, 1992 letter from David Rosenblatt to Milton Sobel which provides a different perspective. "It is now (Fall of 1992) over fifty year ago when I first invented and propounded the concepts and procedure for what is now called "group testing" in Statistics. I expounded it in the Spring of 1942 the day after I reported for induction during World War II and underwent blood sampling for the Wasserman test. I expounded it before a group of fellow statisticians in the division of Research of the Office of Price Administration in Washington, D.C. Among my auditors that morning was my then colleague Rorbet Dorfman." ' Considering that fifty years have lapsed between the event and the recollections, we find the two accounts reasonably close to each other. Whatever discrepancies there are, they are certainly within the normal boundary of differences associated with human memory. Thus that "someone" (in Dorfman's letter) who first suggested to pool blood samples could very well be David Rosenblatt. It is also undisputed that Dorfman alone wrote that seminal report [1] published in the Notes Sections of the journal Annals of Mathematical Statistics which gave a method intended to be used by the United States Public Health Service and the Selective Service System to weed out all syphilitic men called up for induction. We quote from [1]: "Under this program each prospective inductee is subjected to a 'Wasserman-type' blood test. The test may be divided conveniently into two parts: 1. A sample of blood is drawn from the man, 2. The blood sample is subjected to a laboratory analysis which reveals the presence or absence of "syphilitic antigen." The presence of syphilitic antigen is a good indication of infection. When this procedure is used, n chemical analyses are required in order to detect all infected members of a population of size n. The germ of the proposed technique is revealed by the following possibility. Suppose that after the individual blood sera are drawn they are pooled in groups of, say, five and that the groups rather than the individual sera are subjected to chemical analysis. If none of the five sera contributing to

1.1 The history

of the group testing

3

problem

t h e pool contains syphilitic antigen, t h e pool will not contain it either and will test negative. If, however, one or more of t h e sera contain syphilitic antigen, t h e pool will also contain it and t h e group test will reveal its presence (the author inserted a note here saying t h a t diagnostic tests for syphilis are extremely sensitive and will show positive results for even great dilutions of antigen). T h e individuals making up t h e pool must t h e n be retested to d e t e r m i n e which of t h e m e m b e r s are infected. It is not necessary to draw a new blood sample for this purpose since sufficient blood for b o t h t h e test and t h e retest can be taken at once. T h e chemical analysis requires only small quantities of blood."

If one or more of the sera in the

But, if no one in the pool

pool contain syphilitic antigen,

contains syphilitic antigen, then many tests are saved.

then a test may be wasted.

O

CD

CD CD

Figure 1.1: T h e idea of group testing. Unfortunately, this very promising idea of grouping blood samples for syphilis screening was not actually p u t to use. T h e m a i n reason, c o m m u n i c a t e d to us by C. Eisenhart, was t h a t t h e test was no longer accurate when as few as eight or nine samples were pooled. Nevertheless, test accuracy could have been improved over years, or possibly not a serious problem in screening for another disease. Therefore we quoted t h e above from Dorfman in length not only because of its historical significance, but also because at this age of a potential AIDS epidemic, Dorfman's clear account of applying group testing to screen syphilitic individuals may have new impact to t h e medical world and t h e health service sector.

4

Introduction

Dorfman's blood testing problem found its entrance into a very popular textbook on probability as an exercise; Feller's 1950 book "An Introduction to Probability Theory and Its Application, Vol. I" [2] and thus might have survived as a coffee break talk-piece among the academic circle in those days. But by and large, with the conclusion of the Second World War and the release of millions of inductees, the need of group testing disappeared from the Selective Service and the academic world was ready to bury it as a war episode. The only exception was a short note published by Sterrett [9] in 1951 based on his Ph.D. dissertation at University of Pittsburgh. Then Sobel and Groll [8], the two Bell Laboratories Scientists, gave the phrase "group testing" new meaning by giving the subject a very thorough treatment and established many new grounds for future studies in their 74-page paper. Again, they were motivated by practical need, this time from the industrial sector, to remove all leakers from a set of n devices. We quote from Sobel and Groll: "One chemical apparatus is available and the devices are tested by putting x of them (where 1 < x < n) in a bell jar and testing whether any of the gas used in constructing the devices has leaked out into the bell jar. It is assumed that the presence of gas in the bell jar indicates only that there is at least one leaker and that the amount of gas gives no indication of the number of leakers." Sobel and Groll also mentioned other industrial applications such as testing condensers and resistors, the main idea is very well demonstrated by the Christmas tree lighting problem. A batch of light bulbs is electrically arranged in series and tested by applying a voltage across the whole batch or any subset thereof. If the lights are on, then whole tested subset of bulbs must all be good; if the lights are off, then at least one bulb in the subset is defective. Call the set of defectives among the n items the defective set. Dorfman, as well as Sobel and Groll, studied group testing under probabilistic models, namely, a probability distribution is attached to the defective set and the goal is to minimize the expected number of tests required to identify the defective set. Katona [5] first emphasized the combinatorial aspects of group testing. However, his coverage was predominantly for the case of a single defective and he considered probability distributions of defectives. In this volume a more restrictive viewpoint on combinatorial group testing (CGT) is taken by completely eliminating probability distributions on defectives. The presumed knowledge on the defective set is that it must be a member, called a sample, of a given family called a sample space. For example, the sample space can consist of all rf-subsets of the n items, when the presumed knowledge is that there are exactly d defectives among the n items. This sample space is denoted by S(d,n). An item is said to be defective (or good) in a sample if it is in (or not in) the sample. The goal in CGT is to minimize the number of tests under the worst scenario. A best algorithm under this goal is called a minimax algorithm. The reason that probabilistic models are excluded is not because they are less important or less interesting, but simply

1.2 The Binary Tree Representation and Information Lower Bound

5

because there is so much to tell about group testing and this is a natural way to divide the material. Li [6] was the first to study CGT. He was concerned with the situation where industrial and scientific experiments are conducted only to determine which of the variables are important. Usually, only a relatively small number of critical variables exists among a large group of candidates. These critical variables are assumed to have effects too large to be masked by the experimental error, or the combined effect of the unimportant variables. Interpreting each variable as an item, each critical variable as a defective, and each experiment as a group test, a large effect from an experiment indicates the existence of a critical variable among the variables covered by the experiment. Li assumed that there are exactly d critical variables to start with, and set to minimize the worst-case number of tests. Since Li, CGT has been studied along side with PGT for those classical applications in medical, industrial and statistical fields. Recently, CGT is also studied in complexity theory, graph theory, learning models, communication channels and fault tolerant computing. While it is very encouraging to see a wide interest in group testing, one unfortunate consequence is that the results obtained are fragmented and submerged in the jargons of the particular fields. This book is an attempt to give a unified and coherent account of up-to-date results in combinatorial group testing.

1.2

The Binary Tree Representation of a Group Testing Algorithm and the Information Lower Bound

A binary tree can be inductively defined as a node, called the root, with its two disjoint binary trees, called the left and right subtree of the root, either both empty or both nonempty. Nodes occurring in the two subtrees are called descendants of the root (the two immediate descendants are called children), and all nodes having a given node as a descendant are ancestors (the immediate ancestor is called a parent). Two children of the same parent are siblings. Nodes which have no descendants are called leaves, and all other nodes are called internal nodes. The path length of a node is the number of that node's ancestors. A node is also said at level I if its path length is I — 1. The depth of a binary tree is the maximal level over all leaves. Let S denote the sample space. Then a group testing algorithm T for S can be represented by a binary tree, also denoted by T, by the following rules: (i) Each internal node u is associated with a test t(u); its two links associated with the two outcomes of t(u) (we will always designate the negative outcome by the left link). The test history H(u) of a node u is the set of tests and outcomes associated with the nodes and links on the path of u. (ii) Each node u is also associated with an event S(u) which consists of all the members of S consistent with H(u). \S(v)\ < 1 for each leaf v.

Introduction

6

o O 0 /

Q

\ 1

0/

0/

\ 1

(~~\

test

(^ J

sample

0

negative

1

positive

\ 1

Figure 1.2: The binary tree representation. Since the test t(u) simply splits S(u) into two disjoint subsets, every member of S must appear in one and only one S(v) for some leaf v. An algorithm is called reasonable if no test whose outcome is predictable is allowed. For a reasonable algorithm each S(u) is split into two nonempty subsets. Thus |5(f)| = 1 for every leaf v, i.e., there exists a one-to-one mapping between S and the leaf-set. Let p(s) denote the path of the leaf v associated with the sample s and let \p(s)\ denote its length. Then MT{S) = max \p(s)| is the depth of T. Since the same subset will not be tested more than once in a reasonable algorithm, the number of tests, consequently the number of reasonable algorithms, is finite. Therefore we can define M(S) = min MT(S) . An algorithm which achieves M(S) is called a minimax algorithm for S. The goal of CGT is to find a minimax algorithm for a given 5; usually, we settle for a good heuristic. The determination of M(S) and obtaining good bounds for it are challenging problems and unsettled for most S. Let IVKL^J) denote the smallest (largest) integer not less (greater) than x. Also let log x mean log2 x throughout unless otherwise specified. Theorem 1.2.1 M{S) >

\log\S\].

7

1.3 The Structure of Group Testing

Proof: Consider a group testing algorithm T for the problem S. At each internal node u of T, the test t(u) splits S(u) into two disjoint subsets according to whether the outcome is negative or positive. The cardinality of one of the two subsets must be at least half of \S{u)\. Therefore M{S) > [log|S[|- • We will refer to the bound in Theorem 1.2.1 as the information lower bound. Lemma 1.2.2 E s e s2 - I p ' s 'l = 1 for every reasonable algorithm. Proof. True for | 5 | = 1. A straightforward induction on | 5 | proves the general case. • Theorem 1.2.1 and Lemma 1.2.2 are well known in the binary tree literature. The proofs are included here for completeness. A group testing algorithm T for the sample space 5 is called admissible if there does not exist another algorithm T" for S such that W)\

> \p'(s)\

for a l l s G S

with at least one strict inequality true, where p'(s) is the path of s in T'. Theorem 1.2.3 A group testing algorithm is admissible if and only if it is reasonable. Proof. That "reasonable" is necessary is obvious. To show sufficiency, suppose to the contrary that T is an inadmissible reasonable algorithm, i.e., there exists T' such that \P(S)\>\P'{S)\

for

allseS

with at least one strict inequality true. Then £ s6 s2- |p 'WI > S s e s 2- | p ( s ) l = 1 , a contradiction to Lemma 1.2.2.

•

From now on only admissible, or reasonable, algorithms are considered in the volume. The word "algorithm" implies an admissible algorithm.

1.3

The Structure of Group Testing

The information lower bound is usually not achievable. For example, consider a set of six items containing exactly two defectives. Then [log |5|] = [log (Tjl = 4 . If a subset of one item is tested, the split is 10 (negative) and 5 (positive); it is 6 and 9 for a subset of two, 3 and 12 for a subset of three, and 1 and 14 for a subset of four. By Theorem 1.2.1, at least four more tests are required.

8

Introduction

T h e reason t h a t information lower bound cannot be achieved in general for group testing is t h a t t h e split of S(u) at an internal node u is not arbitrary, b u t m u s t be realizable by a group test. Therefore it is of i m p o r t a n c e t o study which types of splitting are permissible in group testing. T h e rest of this section reports work done by Hwang, Lin and Mallows [4]. While a group testing algorithm certainly performs t h e tests in t h e order starting from t h e root of t h e t r e e proceeding to t h e leaves, t h e analysis is often m o r e convenient if s t a r t e d from t h e leaves (that is t h e way t h e Huffman tree - a m i n i m u m weighted binary tree - is c o n s t r u c t e d ) . T h u s instead of asking what splits are permissible, t h e question becomes: for two children nodes x,y of u, what types of S(x) and S(y) are p e r m i t t e d t o merge into S(u). Let N denote a set of n items, D t h e defective set and S0 = {Dx, • • •, Dk} t h e initial s a m p l e space. W i t h o u t loss of generality assume ujLjTJ,- = N, for any item not in UjLjT?,- can immediately be identified as good and deleted from N. A subset Si of So is said t o b e realizable if t h e r e exists a group testing t r e e T(A) for SQ and a node u of T such t h a t S(u) = Si. A partition n = (Si, • • •, Sm) of So is said to be realizable if t h e r e exists a group testing tree T for So and a set of m nodes (u x , • • •, um) of T such t h a t S(ui) = Si for 1 < i < m. Define || S | | = U o . e s A ' , || S || = N\ || S ||, t h e complement of || S \\, and S = {A \ A C|| S ||, A D some £),•}, t h e closure of S. F u r t h e r m o r e , let 7r = (Si, S2, • • •, Sm) be a partition of So, i.e., S, PI Sj = 0 for all i 7^ j , and Us i g x 5',- = So- T h e n 5,- and Sj are said to be separable if there exists an I C N such t h a t i" n D = 0 for all D G Si, and / n D / 0 for all D £ Sj (or vice versa). Given 7r, define a directed graph G^ by taking each Si as a node, and a directed edge from 5,- t o Sj (Si —» Sj), i ^ j , if and only if t h e r e exists Ai € Si, Aj G Sj such t h a t Ai C Aj.

T h e o r e m 1.3.1 The following statements (i) Si and Sj are not separable, (ii) Si —* Sj —» Si in Gn. (Hi) SiHSj^Hl.

are

equivalent:

Proof: By showing (i) => (ii) =$• (iii) =$• (i). (i) =*> (ii): It is first shown t h a t if Si / » Sj, t h e n I = JS~\\C) || 5,- | | ^ 0 and I separates Si, Sj. Clearly, if / = 0, t h e n || Si | | C | | Sj ||; since || Si ||6 S; and || Sj ||G Sj, it follows t h a t 5,- —• Sj, a contradiction. T h u s 7 ^ 0 . Also, 7 fl £) = 0 for all _D € 5 j since 7 C || 5 j ||. F u r t h e r m o r e , if there is a D € S; such t h a t 7 D 7J = 0, then from t h e definition of 7, P T l f n || 5 ; || HD = fS~\\ D 7J = 0 and hence D C|| ^ ||e 5,-. This implies Si —> 5 j , again a contradiction. Therefore 7 D D ^ 0 for all D € 5, and 7 separates 5^, 5 j . T h e proof is similar if Sj -/* Si. (ii) => (iii): Suppose Si —> Sj —• 5;. Let A,, A; G 5; and A3,A!d G 5 j such t h a t Ai C Aj and A | 2 ^ - . F u r t h e r m o r e , let u = | | S{ || n || S,- ||. T h e n u 7^ 0. Since

1.3 The Structure

of Group

9

Testing

Ai C Aj C|| Sj ||, A,- C|| Si || n || 5 j | | = u C|| 5,- || and t h u s u € S,. a r g u m e n t shows u £ Sj. Therefore 5 , D Sj ^ 0. (iii) => (i): Suppose u> € b e a subset such t h a t / f~l A w € 5,- =*• t h e r e exists a A t h e other h a n d u> £ 5 j =>• w S,- are not separable.

A similar

5; f~l 5,- (w 7^ 0) b u t Si and 5,- are separable. Let I C N 7^ 0 for all A € S; and / f~l Dj = 0 for all Z?; e 5,-. But 6 S; such t h a t £>; C UJ, hence / fl w D I n D ; 7^ 0. On C|| Sj ||=> / fl u> = 0, a contradiction. Therefore, Si and

T h e o r e m 1.3.2 ~K is realizable if and only if G* does not contain

a directed

cycle.

Proof. Suppose G x has a cycle C. Let it' be a partition of So obtained by merging two (separable) sets 5,- and Sj in TT. From T h e o r e m 1.3.1, G* cannot contain t h e cycle St —• Sj —* Si for otherwise S; and Sj are not separable, therefore not mergeable. Since every edge in G* except those between Si and Sj is preserved in G„i, G*i must contain a cycle C". By r e p e a t i n g this a r g u m e n t , eventually one o b t a i n s a partition with no two p a r t s separable. Therefore 7r is not realizable. Next assume t h a t G> contains no cycle. T h e graph Gv induces a partial ordering on t h e 5,'s with 5,- < Sj if and only if Si —> Sj. Let Sj be a m i n i m a l element in this ordering. T h e n / = || Sj || separates Sj from all other 5,'s. Let G^ be obtained from Git by deleting t h e node Sj and its edges. T h e n clearly, GTi contains no cycle. Therefore t h e above a r g u m e n t can be repeated to find a set of tests which separate one Si at a t i m e . • Unfortunately, local conditions, i.e., conditions on 5; and Sj alone, are not sufficient t o tell w h e t h e r a pair is validly mergeable or not. (S, and Sj is validly mergeable if merging Si a n d Sj preserves t h e realizability of t h e partition.) T h e o r e m 1.3.1 provides some local necessary conditions while t h e following corollaries to T h e o r e m 1.3.2 provide more such conditions as well as some sufficient conditions.

C o r o l l a r y 1 . 3 . 3 Let M be the set of Si's which are maximal under the partial ordering induced by G*. (If \M\ = I, add to M = { 5 , } an element Sj such that Sj < 5, and Sj -jt Sk for all other Sk-) Then every pair in M is validly mergeable. C o r o l l a r y 1.3.4 Letn be realizable. Then the pair Si, Sj is validly mergeable if there does not exist some other Sk such that either Si —> Sk or Sj —> Sk • C o r o l l a r y 1.3.5 Si and Sj are not validly mergeable if there exists another that either Si —• Sk —» Sj or vice versa. C o r o l l a r y 1.3.6 Let TV be the partition single element. Then IT is realizable.

of So in which every Si consists

Sk such

of just

a

Introduction

10

Let S C So ( S 7^ 0) and let •K = {S, Si, S2, • • •, Sm} be t h e partition of S0 where each St consists of a single element /),-. T h e following t h e o r e m shows a connection between t h e realization of 5 and t h e realization of n.

T h e o r e m 1 . 3 . 7 The following (i) S is realizable,

statements

are

equivalent:

(ii) Sr\S0 = S. (Hi) -K is

realizable.

Proof. We show (i) =*• (ii) =>• (iii) => (i). (i) =>• (ii): Since S C S0, clearly S Ci S0 2 S. So only S fl So C S needs to be shown. Let D G S C\ S 0 , t h e n D € S and hence there is some D' G S such t h a t D' = S;, for some i. B u t D € 5 implies Si —• 5 and D' C D implies 5 —> S;. Hence S is not realizable by T h e o r e m 1.3.2. (ii) => (iii): If TT is not realizable, t h e n GT contains a directed cycle. Since each Si is a single element, Si = Si and hence any directed cycle m u s t contain S, say Si —* S2 —*•••—> Sk —> S —* Si. B u t Si —> 5 2 —> • • • —> S/i —> 5 implies Si —-• 5 since S,'s are single elements. By T h e o r e m 1.3.1, S —+ Si —> S implies S n 5 i = S n { A } / 0- This implies A e S f l So. Since fteS,5nS0/ S. (iii) =^ (i): Trivially t r u e by using t h e definition of realizability. •

1.4

Number of Group Testing Algorithms

O n e interesting p r o b l e m is t o count t h e n u m b e r of group testing .algorithms for S. This is t h e t o t a l n u m b e r of binary trees with \S\ leaves (labeled by m e m b e r s of S) which satisfy t h e group testing s t r u c t u r e . While this problem r e m a i n s open, Moon and Sobel [7] counted for a class of algorithms when t h e sample space S is t h e power set of n i t e m s . Call a group pure if it contains no defective, and contaminated otherwise. A group testing algorithm is nested if whenever a contaminated group is known, t h e next group to be tested m u s t be a proper subset of t h e c o n t a m i n a t e d group. (A more detailed definition is given in Section 2.3.) Sobel and Groll [8] proved L e m m a 1 . 4 . 1 Let U be the set of unclassified items and suppose that C C U is tested to be contaminated. Furthermore, suppose C" C C is then tested to be contaminated. Then items in C\C can be mixed with items inU\C without losing any information. Proof. Since C" being c o n t a m i n a t e d implies C being c o n t a m i n a t e d , t h e sample space given b o t h C and C" being c o n t a m i n a t e d is t h e same as only C" being c o n t a m i n a t e d . B u t u n d e r t h e l a t t e r case, items in C \ C and U\C are indistinguishable. • T h u s u n d e r a nested algorithm, at any stage t h e set of unclassified i t e m s is characterized by two p a r a m e t e r s m and n, where m > 0 is t h e n u m b e r of items in a

1.4 Number of Group Testing

Algorithms

11

contaminated group and n is the total number of unclassified items. Let f(m,n) denote the number of nested algorithms when the sample space is characterized by such m and n. By using the "nested" property, Moon and Sobel [7] obtained /(0,0) f(0,n) /(l,n) f(m,n)

= = = =

1 EZ = 1 /(0,n-fc)/(fc,r») for n > l, / ( 0 , n - l ) for n > 1, E^J"11/(m — k,n — k)f(k, n) for n > m > 1,

where k is the size of the group to be tested. Recall that the Catalan numbers I (Ik-

n

2)

Ck

-k{k-l

satisfy the recurrence relation

for k > 2. Moon and Sobel gave Theorem 1.4.2 f(0,n) f(m,

n)

= C^U^f^i)

forn>2,

=

for 1 < m < n .

CrJi%if{Q, n-i)

Proof. Theorem 1.4.2 is easily verified for / ( 0 , 2) and f(l,n). proved by induction f(0,n)

=

= /(m,n)

= =

= =

££=1/(0,n-l)/(fc,n) EJU (c B _ i+1 n?-*- 1 /(o,

The general case is

0) (c*n?=1/(o, n -»))

1

Ct+iiiri /^,!). Zj£f(m-k,n-k)f(k,n) ^=i (cm-kn?=-;hf(o,n-k-,•))(c*n?=1/(o,n CmUtJ(0,n-i). a

- *))

Define F(n) = / ( 0 , n ) . The following recurrence relations follow readily from Theorem 1.4.2 and the definition of the Catalan numbers. Corollary 1.4.3 Ifn>l,

then

F(n) = % t l n „ - 1) = 3 * L z l ) n n _ !) . Cn

n +I

Introduction

12

Corollary 1.4.4 Ifn>2,

then F(n) = Cn+lCnC^jC^.j • • • C2

Corollary 1.4.5 Ifn>

I, then

nn) = 4-n ?=1 {i-^A Ty } 2 . The first few values of F(n) are n F(n)

1

2

3

4

5

6

1 2

10

280

235,220

173,859,840,000

.

The limiting behavior of F(n) can be derived from the formula in Corollary 1.4.5. Corollary 1.4.6 lim{F(n)}2'" "^°°x K n

= 4 I T 1~ 1 I l - — ^ — - 1 '- \ 2(i + l ) J

= 1.526753-•• .

More generally, it can be shown that

™-H'+¥-(i)} as n —> oo, where a = 1.526753- • • ; in particular, F(n) > \o?n for n > 1. The proof of this is omitted.

1.5

A Prototype Problem and Some Basic Inequalities

We first describe a prototype CGT problem which will be the focus of the whole book. Then we discuss generalizations and special cases. Consider a set I of n items known to contain exactly d defectives. The defectives look exactly like the good items and the only way to identify them is through testing, which is error-free. A test can be applied to an arbitrary subset of the n items with two possible outcomes: a negative outcome indicates that all items in the subset are good, a positive outcome indicates the opposite, i.e., at least one item in the subset is defective (but not knowing which ones or how many are defective). The goal is to find an algorithm A to identify all defectives with a small M&{S) = M^(d,n). This is the prototype problem. It is also called the (d, ra)-problem to highlight the two parameters n and d. In the literature the (d, n) problem has sometimes been called the hypergeometric group testing problem. We now reserve the latter term for the case

1.5 A Prototype Problem and Some Basic Inequalities

13

that a uniform distribution is imposed on S, since then the probability that a subset of k items containing x defectives follows the hypergeometric distribution

UJ U-J Note that the hypergeometric group testing problem under this new definition belongs to PGT, not CGT. Since M(n,n) = M(0,n) = 0, whenever the (d, n) problem is studied, it is understood that 0 < d < n. Hu, Hwang and Wang [3] proved some basic inequalities about the M function which are reported in this section. Lemma 1.5.1 S C S' implies M(S)

1 + M(d - 1, n - 1) for m > 2 and 0 < d < n. Proof. By Lemma 1.5.1 M(m;d,n)

> M(2;d,n)

for m > 2 .

Let T be an algorithm for the (2; d, n) problem, and let Mj(2\ d, n) = k, i.e., k is the maximum path length of a leaf in T. Let I\ and I2 denote the two items in the contaminated group. Claim. Every path of length k in T includes at least one test that contains either 7i or I2.

Introduction

14

Proof of claim. Suppose to the contrary that for some leaf v the path p(v) has length k and involves no test containing I\ or 72. Since no test on p(v) can distinguish I\ from 72, and since {7i,7 2 } is a contaminated group, I\ and 72 must both be defective in the sample s(v). Let u be the sibling node of v. Then u is also a leaf since v has maximum path length. Since p(u) and p(v) have the same set of tests, 7i and 72 must also be both defective in the sample s(u). Since s(u) ^ s(v), there exists indices i and j such that 7; is defective and 7, is good in the sample s(u), while 7; is good and Ij is defective in s(v). Let w denote the parent node of u and v; thus S(w) = {s(u),s(u)}. Then no test on p(w) can be of the form G U {7;}, where G, possibly empty, contains only items classified as good in s(u), since such a test must yield a positive outcome for s(u) and a negative outcome for s(v), and hence would have separated s(u) from s(v). Define s to be the sample identical to s(u) except that I2 is good and both 7,- and Ij are defective. Then s can be distinguished from s(u) only by a test containing 72, which by assumption does not exist on p(w), or by a test of the form G U {7,}, whose existence has also been ruled out. Thus s € S(w), and u and v cannot both be leaves, a contradiction that completes the proof of the claim. By renaming the items if necessary, one may assume that every path of length k in T involves a test that contains 7i. Add an imaginary defective to the (d — 1, n — 1) problem and label it 7i, and map the n — 1 items of the (d—l,n — I) problem one-toone to the items of the (d, n) problem except I\. Then the modified T can be used to solve the (d — 1, n — 1) problem except that every test containing I\ is skipped since the positive outcome is predictable. But each path of length k in T contains such a test. Hence the maximum path length in applying T to the (d — l , n — 1) problem is k — 1, i.e., MT{2;d,n) > 1 + MT{d - l,n-l) . Since T is arbitrary, the proof is complete.

•

Corollary 1.5.4 M(n — l , n ) = n — 1. Proof. Trivially true for n = 1. For n > 2 M(n-l,n)

= M(n;n-l,n)

> l + M{n-2,n-l)

Theorem 1.5.5 M{d,n) > l+M{d-l,n-l)

> n - l + M(0,l) =n-l

. D

> M(d - l,n) for 0 < d < n.

Proof. By noting M(d,n) = M(n;d,n) , the first inequality follows from Theorem 1.5.3. The second inequality is trivially true for d = 1. The general case is proved by using the induction assumption

M(d,n) >M{d-l,n)

.

15

1.5 A Prototype Problem and Some Basic Inequalities

Let T be the algorithm which first tests a single item and then uses a minimax algorithm for the remaining problem. Then M(d-l,n)

L e m m a 1.5.6 M(d,n)

< = =

MT(d-l,n) 1 + max{M(d - 1, n - 1), M(d - 2, n - 1)} 1 + M(d- l,n-l) . •

< n — 1.

Proof. The individual testing algorithm needs only n — 1 tests since the state of the last item can be deduced by knowing the states of the other items and knowing d. • Lemma 1.5.7 Suppose that n — d > 1. Then M(d, n) = n — 1 implies M(d, n — 1) = n -2. Proof. Suppose to the contrary that M(d:n — 1) =^ n — 2. By Lemma 1.5.6, this is equivalent to assuming Mid, n — 1) < n — 2. Let T denote an algorithm for the (d, n) problem which first tests a single item and then uses a minimax algorithm for the remaining problem. Then M{d,n)

< = =

MT(d,n) l+ma.x{M(d,n-l),M(d-l,n-l)} 1 + M(d, n - 1) by Theorem 1.5.5

1, then MT(d,n)

> 1 + M(m;d,n) > 2 + M(d-l,n-l)

by Theorem 1.5.3,

a contradiction to what has just been shown. Therefore m = 1 and M(d,n)

= =

l+ma.x{M{d,n-l),M{d1+ M(d,n-1)

\,n-

1)}

Introduction

16 by Theorem 1.5.5 and the fact d < n — 1. It follows that M(d-l,n-l)

= M(d,n-

1).

Hence M(d-i,n-l)

=

n-2

by induction. Therefore M{d, n) = 1 + M{d - 1, n - 1) = n - 1 . L e m m a 1.5.9 Suppose M(d,n)

< n — 1. TViera

M(d, n) > 21 + M( 1. First consider the case / = 1. Let T be a minimax algorithm for the (d, n) problem which first tests a set of m items. If m > 1, then Lemma 1.5.9 is an immediate consequence of Theorem 1.5.3. Therefore assume m = 1. Suppose to the contrary that MT{d,n)

< 2 + M(d - l , n - 1) .

Then 1+ M(d-l,n-l)

> = =

MT{d,n) l+max{Af(2a~\

General Algorithms

22 Hence by induction

f (a + 2)d + (p - 1) - 1 for p > 1 I (a + l)d+(d-2)-l for p = 0,9 < 2 0 " 1 1 (a + l ) d + ( d - l ) - l forp = 0,6>>2 a - 1 .

MG{d,n-2a)= Consequently

' (a + 2)d + p - 2 for p = 0, 6 < 20"1 , 0W , , ., • (a + 2Ja + p — 1 otherwise.

l + MG(d,n-2a)-

1 1

For d' = d — 1 and n' = n — 1, /' = n' - d! + 1 =

/ 2a(d-l) + 2a{p + l) + 6 f o r p < d - 3 a+1 2 (d-l) +e iorp = d-2 2a+1{d-l)+2a+ 8 forp = d-l.

Hence by induction (a + 2 ) ( d - l ) + ( p + l ) - l (a + 3 ) ( d - l ) - l

MG{d-l,n-l)

for p 2.

Proof. l + d-l\

n

dj "

ld

\ d ) > d\ [2" 1 + F{l;d,n) = l + H(d-l,n-l) >

H{d-l,n)

.

2.3 The Nested

Class

25

Finally, for m = 1 and d > 2, F{l;d,n)

=

H{d-l,n-l)

>

H(d-2,n-l)

=

F(l;d-l,n).

For general m > 1 F(m;d,n)

=

min m&x{F(m

— k; d, n — k), F(k; d, n)}

l

min m a x { F ( m — k; d — l , n — k), F(k; d — l , n ) } by induction l 2k~2. Test a group of n — 2 f c _ 1 i t e m s . If t h e o u t c o m e is negative, then n — 2k_1 > 2k — n good items are identified, while a defective can be identified from t h e remaining 2k_1 items in k — 1 tests by binary splitting. If t h e o u t c o m e is positive, t h e n by induction, at least 2 * _ 1 — (n — 2 * - 1 ) = 2k — n good items are identified along with a defective if k — 1 more tests are used. 2. n — 2k~l -

\ 2'

for

z>2 .

T h e o r e m 2.3.5 Ford>2, fd(t) = fd{t- 1) + l{fd-i(t') t' is defined in h-i{t')>h{t-l)>fd-i{t'-l).

-fd(t-

l ) , i - 1 -t'),

where

Proof. Suppose that fd(t — 1) < n < fd(t)- The first test must be on a group not fewer than n — fd(t — 1) items for otherwise a negative outcome would leave too many items for the remaining £ — 1 tests. On the other hand there is no need to test more than n — fd[t — 1) items since only the case of a positive outcome is of concern and by Lemma 2.3.2, the fewer items in the contaminated group the better. For the time being, assume that fd-i{t' +1) > fd(t). Under this assumption, if the first defective lies among the first n — fd^\ (f) items, then after its identification the remaining items need t' + 1 further tests. Therefore only t — 2 — t' tests are available to identify the first defective. Otherwise, the remaining items can be identified by t' tests and one more test is available to identify the first defective. Therefore, the maximum value of doable n is

n - Mt - 1) = /(/«_!(*') - h{t - l),t - 1 - *') • The proof of fd-i(t'+ 1) > fd(t) is very involved. The reader is referred to [11] which proved the same for the corresponding i?*-minimax merging algorithm (see Section 2.4). a By noting fi(t) = 2', f2(t) has a closed-form solution. Corollary 2.3.6 f 22 - 2 + [ 4 ^ J /2(t)

=

\

«-l 2

for t even and > 4, 1 2

{ 2 V - 2 + I "*"' ' '] for t odd and > 3. The recursive equation for fd(t) can be solved in 0(td) time. Since the generalized binary splitting algorithm is in the nested class, H(d,n) cannot exceed Ma{d,n), which is of the order d \og(n/d). Therefore, fx(y) for x < d and y < d log(n/d) can be computed in 0(d2 \og(n/d)) time, a reduction by a factor of n3/d \og(n/d) from the brute force method.

2.4 (d, n) Algorithms and Merging

27

Algorithms

For given n and d, compute fx{y) for all x < d and y < d \og(n/d). A minimax nested algorithm is defined by the following procedure (assume that unidentified items form a line): Step 1. Find t such that fd(t) b3, then a sequence of comparisons (if needed) involving ad are used to merge ad into Bg, i.e. to establish 6,- < aj < 6,+i for some i, j MT,(d,n) . Proof. Let T" be obtained from T by adding a subtree Tv to each terminal node v having a positive number of free items (see Figure 2.1). T„ is the tree obtained by testing the free items one by one. Since free items are the only items at v whose states are uncertain when (d, n) is changed to (d, n), I " is a procedure for the (d, n) problem. From Lemma 2.5.1, the sibling node of v is the root of a subtree with maximum path length at least / — 1 where / is the number of free items of sv. The theorem follows immediately. • Corollary 2.5.3 M(d,n) + 1 > M(d,n)

> M(d,n + 1).

Proof. The first inequality follows from the theorem. The second inequality follows from the observation that the (d, n + 1) problem can be solved by any procedure for the (d, n) problem, provided one of the n + 1 items is put aside. But the nature of the item put aside can be deduced with certainty once the natures of the other n items are known. • For nested algorithms Hwang [9] gave a stronger result which is stated here without proof. Theorem 2.5.4 F(m;d,n+

1) =

Corollary 2.5.5 H(d,n + 1) =

F(m;d,n). H{d,n).

General Algorithms

o

I

I

/»

f

/ »

I

I

t

I

I

t

/

/

/

/

/

/

\

f-i

\

I

Figure 2.1: From tree T to tree T". When the tests are destructive or consummate, as in the blood test application, then the number of tests an item can go through (or the number of duplicates) becomes an important issue. If an item can only go through one test, then individual testing is the only feasible algorithm. If an item can go through at most s tests, then Li's 5-stage algorithm is a good one to use. In some other applications, the size of a testable group is restricted. Usually, the restriction is of the type that no more than k items can be tested as a group. An algorithm can be modified to fit the restriction by cutting the group size to k whenever the algorithm calls for a larger group. There are more subtle restrictions. Call a storage unit a bin and assume that all items in a bin are indistinguishable to the tester even though they have different test history. A b-bin algorithm can use at most b bins to store items. A small number of bins not only saves storage units, but also implies easier implementation. Since at most stages of the testing process, one bin is needed to store good items, one to store defectives and one to store unidentified items, any sensible algorithm will need at least three bins. An obvious 3-bin algorithm is individual testing; and at the first glance this seems to be the only sensible 3-bin algorithm. We now show the surprising result that Li's s-stage algorithm can be implemented as a 3-bin algorithm. The three bins are labeled "queue," "good item" and "new queue." At the beginning of stage i, items which have been identified as good are in the good-item bin, and all other items

33

2.5 Some Practical Considerations

are in the queue bin. Items in the queue bin are tested in groups of size k{ (some possibly k{ — 1) as according to Li's s-stage algorithm. Items in groups tested negative are thrown into the good-item bin, and items in groups tested negative are thrown into the new-queue bin. At the end of stage i, the queue bin is emptied and changes labels with the new-queue bin to start the next stage. Of course, at stage s, each group is of size one and the items thrown into the new-queue bin are all defectives. All nested algorithms are 4-bin algorithms. Intuitively, one expects the minimax nested algorithm to be a minimax 4-bin algorithm. But this involves proving that the following type of tests can be excluded from a minimax algorithm : Suppose a contaminated group C exists. Test a group G which is not a subset of C. When G is also contaminated, throw G into the bins containing C and throw C into the other bin containing unidentified items. Note that the last move loses information on finding C contaminated. But it is hard to relate this to nonminimaxity. Yet another possible restriction is on the number of recursive equations defining a minimax algorithm in a given class. A small number implies a faster solution for the recursive equations. For example, individual testing needs one equation (in a trivial way) and the nested class needs two. Suppose that there are p processors or p persons to administer the tests parallelly. Then p disjoint groups can be tested in one "round." In some circumstance when the cost of time dominates the cost of tests, then the number of rounds is a more relevant criterion to evaluate an algorithm than the number of tests. Li's .s-stage algorithm can be easily adapted to be a parallel algorithm. Define n' = \n/p~\. Apply Li's algorithm to the (d, n') problem except that the g, groups at stage i, i = l , . . . , s , are partitioned into [5,/p] classes and groups in the same class are tested in the same round. At the last stage, instead of d defectives, at most d contaminated groups are identified. Since each group contains at most p items, at most d more parallel tests identify all defectives in these groups. Recall that the number of tests for Li's algorithm is upper bounded by

l o l ^ l o s (5) ' where s = In f^J. The total number of rounds with p processors is upper bounded by logep

\dp)

which tends to

\dp)

feH

iogn

when n is much larger than d and p.

•

Generai Algorithms

34

2.6

An Application to Clone Screenings

Screening large collections of clones for those that contain specific DNA sequences is a preliminary indispensable to numerous genetic studies and is usually performed with a clone-by-clone probe or something equivalent. For a yeast artificial-chromosome (YAC) library, the high sensitivity and specificity of the polymerase chain reaction (PCR) allows the detection of target sequences in DNA prepared from pools of thousands of YAC clones. Several PCR-based screening protocols have been suggested to reduce the number of tests. They will be introduced here and their relations to group testing expounded. A YAC library typically contains from 10,000 to 100,000 clones where each DNA sequence will appear on average in r clones. The value of r is specified by the library and the ratio r/n, though depending on the particular library, is in the order of 10 - 4 . A PCR can identify the existence of a specified DNA sequence in a pool of clones and the identification is considered error-free if the pool size stays within a certain limit. The objective is to identify all clones containing the specified DNA sequence with a minimum number of efforts; this includes the number of PCRs, the easiness of preparing the pools and whether the pooling can be done parallelly. Thus it is clear that clone screening can be cast as a group testing problem where the clones are the items and those containing a specified DNA sequence the defectives. While the number of defectives is usually given in its expected value, an upper bound can be obtained by assuming that the number of defectives follows a Poisson distribution. A PCR is a group test with some restriction on its maximum size. If the DNA of a pool contains the appropriate PCR product, then the test outcome is positive (such a pool is identified as "positive"). While one wants to minimize the number of tests, how easy the test group can be assembled is definitely a concern. In general, nonadaptive algorithms are preferred over sequential algorithms, with the multistage algorithms with a small number of stages a possible compromise. Green and Olson [6] considered a human YAC library with n = 23,040 clones and r = 2. YAC clones were grown on nylon filters in rectangular arrays of 384 colonies following inoculation from four 96-well microtiter plates. The yeast cells from each filter are pooled and the DNA is purified, yielding single-filter pools of DNA. Equal aliquots from single-filter pools are mixed together in groups of five to yield multifilter pools, each representing the DNA from 1920 clones. The multi-filter pools of DNA are then analyzed individually for the presence of a specified DNA segment by using the PCR. If a multi-filter pool is found to be positive, then each constituent single-filter pool is analyzed individually by the same PCR assay. Upon generation of a positive single-filter pool, locations of positive clones within the 384-clone array are established by colony hybridization using the radio-labeled PCR product as the probe. Green and Olson also considered a modification which affects when a positive single-filter pool is found. Instead of colony hybridization, the modification uses the binary representation matrix (see Section 4.5) to identify the positive clone in

2.6 An Application to Clone Screenings

35

[log 1920] = 11 pools. Note that this method would fail if there exists more than one positive clone in the pool, and more pools have to be taken for remedy. But for the given parameters of the library, the probability that this will happen is only 3%. The modified Green and Olson method is a 3-stage group testing algorithm except in the last stage the individual testing is replaced by some technology-based procedures. Note that the size of the single-filter grouping is constrained to be a multiple of 384, and that of the multiple-filter grouping to be within the limit of an effective PCR. De Jong, Aslanidis, Alleman and Chen [3], and also Evans and Lewis [5], proposed a different approach which was elaborated and analyzed by Barillot, Lacroix and Cohen [1]. They assumed that the clones are arranged in a 2-dimensional matrix, and each row or column yields a pool. A positive clone renders its row and its column positive. Thus all positive clones are located at the intersections of positive rows and positive columns. Unfortunately, the reverse is not true. In fact r positive clones may cause r2 such positive intersections. Thus one or several other similar matrices are

o

o

positive

positive

o

Y positive

positive Figure 2.2: True and false positive intersection.

needed to differentiate clones at these intersections, using the fact that positive clones will be located at positive intersections of every matrix. For better discrimination, it is desirable that any two clones appear in the same row or column in at most one matrix. Such matrices have been studied in the statistical literature and are known as lattice designs. In particular, they are called lattice squares if the matrices are square matrices. The construction of a set of k, k > 2, lattice squares corresponds to the construction of k — 2 latin squares (see [15]). Since the effort of collecting a pool is not negligible, after a pool is collected and analyzed, a decision needs to be made among three choices: use another matrix for more pools, test the ambiguous clones individually, accept a small probability of including a false positive clone. The 2-dimensional pooling strategy can be extended to (/-dimensional spaces where each hyperplane yields a pool. For til-dimensional cubes, Barillot et al. pointed out that the size of a pool is n 1 - 1 ^ and would be too large for d > 4. They provided figures which show the optimal d as a function of n and r.

36

General

Algorithms

References [1] E. Barillot, B . Lacroix and D. Cohen, Theoretical analysis of library screening using a A^-dimensional pooling strategy, Nucleic Acids Res. 19 (1991) 6241-6247. [2] X. M. Chang, F . K. Hwang and J. F . Weng, G r o u p testing with two and three defectives, Ann. N.Y. Acad. Sci. Vol. 576, Ed. M. F . Capobianco, M. G u a n , D. F . Hsu and F . T i a n , (New York, 1989) 86-96. [3] P. J. De Jong, C. Aslanidis, J. Alleman and C. Chen, Genome m a p p i n g and sequencing, Cold Spring Harbour Conf. New York, 1990, 48. [4] R. Dorfman, T h e detection of defective m e m b e r s of large populations, Math. Statist. 14 (1943) 436-440.

Ann.

[5] G. A. Evans and K. A. Lewis, Physical m a p p i n g of complex genomes by cosmic m u l t i p l e x analysis, Proc. Nat. Acad. Sci. USA 86 (1989) 5030-5034. [6] E . D . Green and M. V. Olson, S y s t e m a t i c screening of yeast artificial-chromosome libraries by use of t h e polymerase chain reaction, Proc. Nat. Acad. Sci. USA 87 (1990) 1213-1217. [7] F . K. Hwang, A m i n i m a x procedure on group testing problems, Tarnkang Math. 2 (1971) 39-44.

J.

[8] F . K. Hwang, Hypergeometric group testing procedures and merging procedures, Bull. Inst. Math. Acad. Sinica 5 (1977) 335-343. [9] F . K. Hwang, A n o t e on hypergeometric group testing procedures, SIAM Math. 34 (1978) 371-375.

J.

Appl.

[10] F . K. Hwang, A m e t h o d for detecting all defective m e m b e r s in a population by group testing, J. Amer. Statist. Assoc. 67 (1972) 605-608. [11] F . K. Hwang a n d D. N. Deutsch, A class of merging algorithms, J. Assoc. put. Math. 20 (1973) 148-159.

Corn-

[12] F . K. Hwang, T. T . Song and D. Z. Du, Hypergeometric and generalized hypergeometric group testing, SIAM J. Alg. Disc. Methods 2 (1981), 426-428. [13] D. E. K n u t h , The Art of Computer ing, Mass. 1972).

Programming,

Vol. 3, (Addison-Wesley, Read-

[14] C. H. Li, A sequential m e t h o d for screening experimental variables, J. Statist. Assoc. 57 (1962) 455-477.

Amer.

References

37

[15] D. Raghavarao, Constructions and Combinatorial Problems in Designs of Experiments, (Wiley, New York, 1971). [16] M. Sobel and P. A. Groll, Group testing to eliminate efficiently all defectives in a binomial sample, Bell System Tech. J. 38 (1959) 1179-1252.

3 Algorithms for Special Cases

When d is very small or very large, more is known about M(d,n). M ( l , n ) = [logn] by Theorem 1.2.1, and by using binary splitting. Surprisingly, M(2,n) and M(3, n) are still open problems, although "almost" minimax algorithms are known. On the other hand one expects individual testing to be minimax when n/d is small. It is known that the threshold value for this ratio lies between 21/8 and 3, and was conjectured to be 3.

3.1

Two Disjoint Sets Each Containing Exactly One Defective

Chang and Hwang [1], [2] studied the CGT problem of identifying two defectives in A = {A\,..., Am) and B = {_Bj,..., Bn} where A and B are disjoint and each contains exactly one defective. At first, it seems that one cannot do better than work on the two disjoint sets separately. The following example shows that intuition is not always reliable for this problem. E x a m p l e 3 . 1 . Let A = {A1,A2,A3} and B = {B1,B2,B3,Bi,Bs}. If one identifies the defectives in A and B separately, then it takes [log 3] + [log 5] = 2 + 3 = 5 tests. However, the following algorithm shows that the two defectives can be identified in 4 tests. Step 1. Test {Ai, Bi}. If the outcome is negative, then A has two items and B has four items left. Binary splitting will identify the two defectives in log 2+log 4 = 3 more tests. Therefore, it suffices to consider the positive outcome. Step 2. Test B\. If the outcome is negative, then A\ must be defective. The defective in the four remaining items of B can be identified in 2 more tests. If the outcome is positive, then the defective in the three items of A can be identified in 2 more tests. Note that there are 3 x 5 = 15 samples {Ai, Bj}. Since [log 15] = 4, one certainly cannot do better than 4 tests. 38

3.1 Two Disjoint Sets Each Containing Exactly One Defective

39

In general, the sample space is A x B which will also be denoted by m x n if | A |= m and \ B \= n. Does there always exist an algorithm to identify the two defectives in A x B in [logmn\ tests? Chang and Hwang [2] answered in the affirmative. A sample space is said to be A-distinct if no two samples in it share the same A-item Aj. Suppose S is a sample space with | S |= 2 r + 2 ' - 1 + • • • + T-" + q , where 2 r _ p _ 1 > q > 0. An algorithm T for S is called A-sharp if it satisfies the following conditions: (i) T solves S in r + 1 tests. (ii) Let v(i) be the ith node on the all-positive path of T, the path where every outcome is positive. Let v'(i) be the child-node of v(i) with the negative outcome. Then | S(v'(i)) |= 2r~i for i = 0 , 1 , . . . ,p. (iii) | S(v(p+

1)) |= q and S(v(p+ 1)) is A-distinct. r

If | S |= 2 , then the above conditions are replaced by the single condition. (i') T solves S in r tests. Lemma 3.1.1 There exists an A-sharp algorithm for any A-distinct sample space. Proof. Ignore the B-items in the ^4-distinct sample space. Since the A-items are all distinct, there is no restriction on the partitions. It is easily verified that there exists a binary splitting algorithm which is A-sharp. • For m fixed, define rik to be the largest integer such that mrik < 2*. Clearly, there exists a k for which n^ = 1. Theorem 3.1.2 M(m x n/c) = k for all rik > 1. Furthermore, if nk is odd, then there exists an A-sharp algorithm for the sample space m x rikProof. By the definition of n^ and Theorem 1.2.1, M(m X rik) > k. Therefore it suffices to prove M(m X rik) < k. If m is even, test half of A and use induction. So, it suffices to consider odd m. The sample spaceTOx 1 is A-distinct. Therefore there exists an A-sharp algorithm by Lemma 3.1.1. For general rik > 1, Theorem 3.1.2 is proved by induction on rikNote that 2*~2 < mrik-i < 2k~1 < m(nk-i + 1) implies 2 i _ 1 < m{2nk-i)

1 a n d nk_r

+ 1 .

is necessarily odd. Let ~~~

nk—r—l

mnk-r

= I

, nk—r — 2 ,

+ 2

, nk—r~p

+ • • •+ 2

, „

+ q ,

where 0 < q < 2k-r-"-1

.

Then mnk

=

m(2rnk-r

=

2k~1 + 2k~2 + ••• + 2k-p + 2rq + m .

+ 1)

Let T be an A-sharp algorithm for t h e (m x nt._ r )-problem. T h e existence of T is assured by t h e induction hypothesis. Let v be t h e node on t h e all-positive p a t h of T associated with q samples. Let J be t h e set of j such t h a t (Ai, Bj) € S(v) for some Ai. For j e J, let Lj denote a set consisting of those A,'s such t h a t (Ai,Bj) € S(v). Since S(v) is yl-distinct, t h e L,-'s are disjoint. An A-sharp algorithm is now given for t h e (m x nj.)-problem. For notational convenience, write n for nk_T and n' for nk. T h e n B will refer to t h e set of n items and B' to t h e set of n' items. P a r t i t i o n B' — {Bni} i n t o n groups of 2T items G 1 ; • • • , Gn. Consider T t r u n c a t e d at t h e node v; i.e., delete t h e subtree rooted a t v from T. Let T" be an algorithm for t h e (m x rc')-problem where T ' is obtained from T by replacing each i t e m Bj in a test by t h e group Gj and adding Bni to every group tested on t h e all-defective p a t h . T h e n each t e r m i n a l node of T", except t h e node v' corresponding with v, will be associated with a set of solutions (Ai x Gj) for some i and j . Since t h e only u n c e r t a i n t y is on Gj and \ Gj \=2r, r m o r e t e s t s suffice. Therefore, it suffices to give an A-sharp algorithm for t h e sample space 5 ( « ' ) = {VjeALj with | S(v')

x Gj)} U (A x J3 n ) ,

| = 2rq + m. Let Gj\, • • •, Gj2r denote t h e 2T items in Gj. Tj = {UjejGn}

Define

UR ,

where R is a subset of A-items not in any of t h e Lj, j 6 J , with | R \= 2rq + m — 2 * - P - I _ q Note t h a t t h e r e are a total of m — q A-items not in any of t h e Lj. It is now proved t h a t m — q >\ R | > 0. T h e former inequality follows immediately from t h e fact t h a t 2rq < 2r2k-T~p-1 = 2k~p-1 .

3.1 Two Disjoint Sets Each Containing Exactly One Defective

41

Furthermore, since 2*- 1 < m(njfe_i + 1) = =

m(2r-1nk_r + 1) 2k~2 + 2k~3 + ••• + 2k-p-1 + 2 r - 1 9 + m ,

it follows that 2r-xq + m > 2k-p~1 , or \R\>2r'1q-q>0

.

Test Ti at S(v'). Let Sa and Sd denote the partition of S(v') according as to whether the outcome of T\ is negative or positive. Then Sd = {UjeALj x Gji)} U (R x Bn,), with |&|

= =

? + 2rq + m - 2k-"-1 - q 2 r ? + m - 2*- p - 1 ,

and 5 S = {U2ur=2 U i e J (£,• x G iu ,)} U ({A \ R} x Bn.) , with | 5 S | = | S(v') | — | Sd |= 2* _ p _ 1 . Since 5^ is A-distinct, there exists an A-sharp algorithm for Sd by Lemma 3.1.1. It remains to be shown that Sg can be done in k — p — 1 tests. Note that Sg can also be represented as Sg = {Vj£j(Lj

x {Gj \ {Gfl} U {£„.}})} U ({(A \ tf) \ UitjLj}

x £„,) .

Since | Gj \ {Gji} U {Bn,} |= 2 r , | (A \ .ft) \ U, e j£j | must also be a multiple of 2T. Partition A — R — {Uj^jLj} into 2 r subsets of equal size, Hi, • • •, Hi*. Define Tw = { U j e J G ^ } U Hw

for to = 2,3, • • •, 2 r .

Then the TVs are disjoint. By testing a sequence of proper combinations of TVs, it is easily seen that in r tests Sg can be partitioned into 2r subsets consisting of Sw = {UjeJ(Lj

x Giw)} U (Hw x Nn.) ,

w = 2, • • •, 2 r ,

and

Si = S,\ {u£_2 Sw} . Furthermore, 5„, is A-distinct and | Sw |= 2k~p~T~1 for each w = 1, • • •, 2 r . Therefore each Sw can be solved in fc — p — r — 1 more tests by Lemma 3.1.1. This shows that the algorithm just described for S(v') is A-sharp. Therefore, the algorithm T' plus the extension on v' as described is A-sharp. •

Algorithms for Special Cases

42 T(A 3 xB 3 ):

T'(A 3 xB 2 j):

Figure 3.1: An yl-sharp algorithm for A3 x _B2iCorollary 3.1.3 M(rn x n) = [logmn] for all m and n. The corollary follows from Theorem 3.1.2 by way of the easily verifiable fact that M(m X n) is monotone nondecreasing in n. It can be easily verified that if the m x n model is changed to A and B each containing "at least" one defective and the problem is to identify one defective from each of A and B. Then Corollary 3.1.3 remains true. Denote the new problem by fh x n. Then Corollary 3.1.4 M{fn x n) = [logmn] for all m and n. Ruszinko [11] studied the line version, i.e., A and B are ordered sets and a test group must consist of items from the top of the two orders. He showed that for m = 11 and n = (210k — 1)/11 (which is an integer by Fermat's theorem), the first test group must consist of 8 items from A and 210k~'1 items from B. But no second test can split the remaining samples into two parts both of sizes not exceeding 210k~2. He also proved Theorem 3.1.5 7/logm + logn— [logm] — [lognj > 0.8, then [logmn] suffice. Weng and Hwang [13] considered the case of k disjoint sets.

3.2 An Application

to Locating

Electrical

43

Shorts

T h e o r e m 3 . 1 . 6 Suppose that set S; has 2 2 + 1 items containing for i = 0 , 1 , • • • , m . Then M{S0

exactly one

defective

x S1 x • • • x Sm) = 2 m + 1 .

Proof. Since n™ 0 (2 2 " + l )

= =

M{S0

x Si x • • • x Sm)

>

2 2 m + 1 - 1 + 2 2 m + ' - 2 + -.. + 2 1 + 2 0 22m+1-l, 2m+1

by T h e o r e m 1.2.1 .

T h e reverse inequality is proved by giving an algorithm which requires only 2 + 1 tests. Let 7,- be an arbitrary item from S{, i = 0 , 1 , • • • , m . Consider t h e sequence of subsets J i , • • •, J 2 m + 1 - i ; where J 2 * = {h} for k = 0 , 1 , • • • , m and J 2 * + J = { 4 } U J j for 1 < j < 2k. It is easily proved by induction t h a t E / . g j , 2' = j for 1 < j < 2m+l - 1. T h e actual testing of groups is done in t h e reverse order of Jj until either a negative o u t c o m e is obtained or J j is tested positive. Note t h a t a subset is tested only when all subsets contained in it have been tested. Therefore J 2 i + i _ i , • • •, J , + i all tested positive and J j negative imply t h a t items in Jj are good and items in {7o, I\, • • • , Im}\Jj are defective. T h u s t h e (S0 x Si x • • • x Sm) problem is reduced t o t h e (II/. e .7 j (.S'i\{/;})) problem. T h e total n u m b e r of tests is 2m+i

_j+

Y^2{ = 2 m+1 .

If J 2 m+i_ 1 , • • • , Ji all tested positive, then each /;, 0 < i < m , has been identified as a defective through individual testing. Therefore all defectives are identified while t h e total n u m b e r of tests is simply 2 m + 1 — 1. •

3.2

An Application to Locating Electrical Shorts

A prevalent t y p e of fault in t h e manufacture of electrical circuits is t h e presence of a short circuit ("short") between two nets of a circuit. Short testing constitutes a significant p a r t of t h e manufacturing process. Several fault p a t t e r n s can be described as variations or combination of these types of faults. Applications of short testing ranges from printed circuit board testing and circuit testing to functional testing. Short testing procedures can have two different objectives; a detecting procedure aims simply to detect t h e presence of a short, while a locating procedure identifies the shorted pairs of nets. A short detector in c o m m o n use is an a p p a r a t u s involving two connecting leads which, when connected respectively to two nets or groups of nets, can detect, b u t not locate, t h e presence of a short between these nets or groups of nets. An obvious way to use this device for short location is t h e so called n-square testing, which checks each pair of nets separately. If t h e circuit has n nets to start with, then this location procedure requires (!M tests.

Algorithms for Special Cases

44

lead A

_ ^

\

Q

net u

•Q

net v

lead B

Figure 3.2: A short detector. Garey, Johnson, and So [7] proposed a procedure for detecting shorts between nets in printed circuit boards. They showed that no more than 5, 8, or 12 tests (depending on the particular assumptions) will ever be required, independent of the number of nets. However, it is not a locating procedure, and their assumptions allow only shorts occurring vertically or horizontally, not diagonally. Skilling [12] proposed a clever locating procedure much more efficient than the n-square testing. In his method each net is tested individually against all other nets, collectively. Thus after n tests one obtains all the shorted nets, though not knowing to which other nets they are shorted. If there are d shorted nets, then LI more tests are required to determine which of these are shorted to which others. Thus the total number of tests is n + u ) , which can be significantly less than (!M f° r n much larger than d. Chen and Hwang [1] proposed a locating procedure, based on Theorem 3.1.2, which requires approximately 2 ( d + l)logn tests. Consider a circuit board with n networks. Without loss of generality assume n = 2°. This testing process can be mathematically described as follows: Let [X2v=i(ai x &>)] denote the configuration where there exist 2fc disjoint sets with d\, 6j, • • •, ajt, bk nets such that there exists a shorted pair between the nets in a,- and the nets infe,-for at least one i. Let [Yli=i a>\ denote the configuration where there exist k disjoint sets with 1 let it denote the integer such that

itM\ < 2'* < fit ' +1 Since no integer i is a solution of m = 2' for t > 1, no ambiguity arises from the definition of it. By the information lower bound (Theorem 1.2.1), it is clearly an upper bound of n 2 i + 1 - 2' = 2' and /2* + 1\

. < 2* implies - it - 1 > 25 .

For t odd, i ( = 2 2 . Therefore

( T ) - ( V ) = i{(^-D^-(2^-D(2^-2)} =

5 {(»«+i + l)»«+i " 2(t t + i - 1) " 2 t + 1 + 3 - 2 ^ - 4 }

=

{(?1+12+1)-2'} + {3.2ifi-(8(+1

t t + i + 2^ _

2t > 212 for t = 2k + 1 > 13 .

Proof. Constructions for ct,t < 11, were given in [4] and will not b e repeated here. These ct equal i t — 1 for t > 4 or it for t < 3, hence equal n ( ( 2 ) by T h e o r e m s 3.3.3 and 1.2.1. For t > 12, t h e r e exists a generic algorithm as follows: 1. Test a group of ct — ct-i items. If t h e o u t c o m e is negative, follow t h e ct-\ construction. 2. If t h e o u t c o m e of t h e first test is positive, next test a group of m items from t h e untested group (with c t _ ! items) where 39 • 2k~e 55-2fc-6

for t - 2k, for* = 2ifc + l .

If t h e o u t c o m e is positive, t h e problem is reduced to t h e (ct — ct-\) Since

^

/ (26 • 2'=- 6 )(39 • 2*- 6 ) < 22k~2 - j ( 3 7 . 2 * - 6 ) ( 5 5 • 2k~6) < 2 2 *" 1

ct^)m

x m problem.

for t = 2k, for t = 2k + 1,

by Corollary 3.1.3, t — 2 more tests suffice. 3. If t h e o u t c o m e of t h e second test is negative, next test a group of [ m / 2 J items from t h e u n t e s t e d group (with ct-\ — m items). If t h e o u t c o m e is positive, t h e problem is reduced to t h e (ct — ct-i) x [ m / 2 j problem. Since

(c< -

c

i - i ) L ^ J < 2^Ct ~

Ct

-^m

< 2

'~3'

t — 3 more tests suffice. If t h e outcome is negative, a total of

Ct

m

L

m 2J

S

( 89 • 2k~6 - 39 • 2k~6 - 19 • 2k~6 = 31 • 2k~6 < c,_ 3 \ 63 • 2 fc " 5 - 55 • 2 f c - 6 - 27 • 2 * - 6 = 44 • 2k~6 < c,_ 3

i t e m s is left; hence t — 3 more tests suffice. C o r o l l a r y 3 . 3 . 5 ct/nt(2)

for t = 2k, for t = 2k + 1

•

> 0.983.

Proof. For t < 11, c ( / n t ( 2 ) = 1. For t > 12 substituting it - 1 for n ( ( 2 ) ,

S W > 5 S 5 > °-983

n-

'

63 2

5

" -| *~^ > H > 0.984

for

* = 2fc'

fort = 2fc + l. •

51

3.3 The 2-Defective Case

Chang, Hwang and Weng also gave an improvement of the algorithm c. Let w be the algorithm resulting from the selection of the largest m satisfying < 2'~ 2 ,

(wt - wt-i)m

< 2'- 3 ,

ft-uH-iHfj i

m

i

( to /

2 IU< — I ^Wf—Wt-i

(

2«-3

\Wt—Wt-\

or equivalently, (wt - w^wt

- wt-3 + 2) - 3 • 2'" 3 < 0 .

Though this relaxation may decrease wt, it doesn't affect the asymptotic result. Define 5-3^2

«2 =

,

3

/8v^-7 12^2-9 \ r + p p,

52

Algorithms

a3

=

„ „ 9p 2 3 + 6p + — (lb

a4

18%/2 - 15

tor Special

Cases

pq

27p\ -(-2+—)q>

=

as and

9p 2 a'3 = 3 + 6p + - i -

N o t e t h a t a! < 0 for p < p*. F u r t h e r m o r e , there exists a tv large enough such t h a t for all t >tp, i/2 + a3 < 0 ai2* + a 2 2 Also note t h a t tp is d e t e r m i n e d independent of q. Let q = 2('" +1 »/ 2 p T h e n it is easily verified 9 - > ° > P -

for t < tv

2 ( « + i)/2

T h e proof t h a t — >P-

2 ^

*»*>'»•

is by induction on t. It will be shown t h a t

satisfies t h e inequality (y - t c - O d / - u>,_3 + 2) - 3 • 2 ' - 3 < 0 . Since u>( is t h e largest integral solution of t h e relaxed inequality,

For easier writing define r = 2'^ 2 . T h e n ut < y2r, r / 2 - 3 / 2 . Hence

[(-

2(t+l)/2

ut-i

> r — 3/2 and ut-z

Ut + 1 - tOi_3 + 2 - 3 - 2 '

(,.^)^ + ,-(,-J)(r-l) (VS-Dpr + i + f - f f 1 a-\r2 — a 2 r + a 3 + a^r'1

+ a^r'

.^-i)

pr + 3 + — - 3gr-i

>

3.4 The 3-Defective Case

53

Since q > 0, air2 + a2r + a3 < air2 + a2r + a'3 < 0 for t > tp . Furthermore, -1 ,

a\r

-2

+ asr

-2 / °5

=

—a^r

=

a^r

= ^

2

I (

\

rI lSq

V.30 + 27p

r

\

j

( 30 + 27P - j ^

0

'

since l ^ l < 1 and 2^2 tp . ~

The 3-Defective Case

Chang, Hwang and Weng [4] gave the following result. T h e o r e m 3.4.1 There exists an algorithm k such that the (3,ht) problem can be solved in t tests where t +1

fort 8 ,

c2k + 2*- 2 + 1

for t = 3k > 9 ,

ht

. l&k+i + c2fc)/2J + 3 • 2k~3 + 1 for t = 3k + 1 > 10 , except hi2 = 26, hi3 = 32, hn = 40 and his = 52, where ct are given in Theorem 3.3.4. Proof. For t < 7 let h be the individual testing algorithm. For £ = 3k + 2 > 8 let h be the following algorithm: First test a group of 2k~1 items. If the outcome is negative, the number of remaining items is hak+2 — 2 = h3k+i . Hence 3k + 1 more tests suffice. If the outcome is positive, identify a defective in the contaminated group in k — 1 more tests. The number of remaining items is at most hsk+2 - 1 < 108 • 2k~6 + 3 • 2k~3 + 2*" 1 = 41 • 2*"4 < c 2 t + 2 .

Algorithms

54

for Special

Cases

T h e t o t a l n u m b e r of tests is at most l+fc-l+2fc+2=3fc+2. For t = 3fc let h be t h e following algorithm: 1. First test a group of 3 • 2k~3 items. If t h e outcome is negative, t h e n u m b e r of r e m a i n i n g i t e m s is h3k - 3 • 2k~3

=

89 • 2k~6 + 2k~2 + 1 - 3 • 2k~1 = 81 • 2k~6 + 1 < L107 • 2k~7\ + 3 • 2k~4 + 1 + 2k~2 < /i 3 *_i .

Hence 3fc — 1 m o r e tests suffice. 2. If t h e o u t c o m e of t h e first test is positive, next test a group of 2 items consisting of 2k~2 i t e m s from each of t h e c o n t a m i n a t e d (with 3 • 2k~3 items) groups and t h e u n t e s t e d (with h3k — 3 • 2k~3 items) group. If t h e outcome is negative, t h e size of t h e c o n t a m i n a t e d group is reduced to 2k~2. Identify a defective therein in fc — 2 tests, t h e n u m b e r of remaining items is at most h3k - 2k~2 - 1

6 .

T h e t o t a l n u m b e r of tests is at most 2 + fc-2 + 2fc = 3fc. 3. If t h e o u t c o m e of t h e second test is positive, t h e n t h e unidentified items are divided into four sets A,B,C,D of sizes 2k~2,2k'3,2k~3 and h3k — 2k~l such t h a t A U B a n d B U C each contains a defective. Next test t h e set A. If t h e o u t c o m e is negative, t h e n B m u s t contain a defective. Identify it in fc — 3 tests. T h e n u m b e r of r e m a i n i n g i t e m s is at most 2k~3 - 1 + 2k~3 + h3k - 2k~x = h3k - 2k~2 - 1 , which has been shown to b e at most c-ik. T h e total n u m b e r of tests is at most 3+fc-3+2fc=3fc. If t h e o u t c o m e is positive, t h e n identify a defective in each of A and B U C in fc — 2 tests each. T h e n u m b e r of remaining items is at most h3k - 2 = c2k + 2k~2 - 1 = 89 • 2k~6 + 2k-2 - 1 < 2k+1

.

Hence fc + 1 m o r e tests identify t h e last defective. T h e total n u m b e r of tests is at most 3+fc-2+fc-2+fc+l=3fc.

3.4 The 3-Defective

55

Case

Finally, for t = 3 K + 1 > 10, let h be t h e following algorithm: First test a group of h3k+i — h3k items. If t h e o u t c o m e is negative, t h e r e are h3k i t e m s left and t h e problem can be solved in 3 K more tests. If t h e o u t c o m e is positive, t h e c o n t a m i n a t e d group contains

h3k+1-h3k

= [3*±p^j + 2*-s L

63-44 J 2

+ 22

[37^!j + 2*-3

for k = 5 , for

k

>6 .

It is easily verified t h a t 2k~2 < h3k+1 - h3k < 2k~1 . By L e m m a 2.3.3, a defective can be identified from t h e c o n t a m i n a t e d group either in K — 2 t e s t s , or in k — 1 tests with at least 2 ~ — {h3k+i

— h3k)

good items also identified. In t h e first case, t h e n u m b e r of remaining items is at most h3k+i — 1 which can be verified to be < C2/C+2 for all k > 3. In the second case, the n u m b e r of remaining items is

h3k+1

- 1 - [2 4 " 1 - {h3k+1

=

2h3k+l

- h3k - 2k~l - 1

3d. Proof. It suffices to prove Theorem 3.5.1 for M(d, "id + 1) since M(d,n) is nondecreasing in n. For d = 1, M(l,4) = 2 < 4 — 1 by binary splitting. The general case is proved by induction on d. Let T be a nested (line) algorithm which always tests a group of two items (unless only one is left) if no contaminated group exists. When a contaminated group of two items exists, T identifies a defective in it by another test. In the case that the item in the contaminated group being tested is good, then a good item is identified at no additional cost. All other good items are identified in pairs except that those good items appearing after the last defective are identified by deduction without testing. Note that T has at most 3d tests since the d defectives requires 2d tests and another d tests identify 2d good items. As the number of good items is odd, one of them will either be identified for free or is the last item for which no test is needed (since it appears after the last defective). Suppose that T takes 3d tests. Then before the last defective is identified, at least 2d good items must have been identified. If the last defective is identified with a free good item, then no more good item appears after the last defective; otherwise, at most one good item appears after. In either case, before the last two tests which identify the last defective, at most two items, one good and one defective, are unidentified. By testing one of them the nature of the other item is deduced. Thus only 3d — 1 tests are needed. • The latest progress towards proving the n < 3d conjecture is a result by Du and Hwang [6] which shows that individual testing is minimax for n < |_21d/8j. Consider the following minimization problem. Let m and n be relatively prime positive integers, / an integer satisfying 0 < I < m + n — 2, and A a positive number. Define li = [m(l + l)/(m + n)\. The problem is to locate the minimum of w

\

mk + h J

over the nonnegative integers k — 0,1,2, • • • .

3.5 When is Individual Testing Minimax?

57

Let l2 = ln(l + l)/(m + n)\. Since m and n are relatively prime and / + 1 < m + n, neither m(l + l ) / ( m + n) nor n(l + l ) / ( m + n) can be an integer. Therefore

m(/ + l)

1 1 or c < 2(1 + l ) / ( m + n + 2(1 + 1)), f(x) = 1 has no nonnegative solution. If 1 > c > 1 — 1/2M, f(x) = 1 has a unique nonnegative solution, lying in the interval (1/2(1 — c) — M,c/2(1 — c) — (I + l ) / ( m + n)). If 1 - 1/2M > c > 2(1 + l ) / m + n + 2(1 + 1), f(x) = 1 either has no nonnegative solution or has a unique one, lying in the interval [0,c/2(l — c) — (I + l)/(m + n)]. Proof. See [6]. Theorem 3.5.3 M(d,n)

= n - 1 for n

213k~2 is proved by showing that l'K \ ,„i3i_2

min - ) / 2 —

>1

Algorithms for Special Cases

58

Theorem 3.5.3 then follows from Corollary 1.5.10 by setting / = t. Define

Then m = 4, n = 13, I = h = 0, A = 2" 1 3 . Compute M = max

'm + h

,

n + l2\

= 1,

mnnn M > = 7 TTTT = 0-7677 > 1 - — = 0.5 . (m + n)m+n\ 2 Therefore, from Theorem 3.5.2, f(k) = 1 has a unique nonnegative solution in the interval (1.15, 2.59). Namely, F(k), and hence r47fc*)/213k~2 attains a minimum at k = 2. Thus we have 1

c

T'( 1 4 7 *)/ 2iafc " a = ( ? ) / 2 M = L08>1 As the proofs for the other seven cases are analogous to case (i) but with different parameter values, only the values of the parameters are given in each case without further details. K° will always denote the value of k that minimizes fcj)/2"~2t~2. Case (ii). d = 8k + 1, n = 21k + 2, t = 4k + 1, 1.08 < K° < 2.54. 3 17* + A ,ol3*-2 a _ f ^

/o24

T V 4* J/^- =UJ/2M = l-*0>l. Case (iii). d = 8k + 2, n = 21Jfc + 5, t = 4fc + 2, 0.92 < K° < 2.42. 17fc + A ,„ *-i _ m i n ^ ' y " J / 213 - - 1 =

m

. J / 2 0 \ / o l 2 _ , 1 fi / 3 7 \ / o 2 5 i n | ^ j / 2 - = 1.18, ^ j / 2 " = 1.15J > 1 .

Case (iv). d = 8A; + 3, n = 21k + 7, t = 4k + 2, 0.84 < A'0 < 2.30.

Case (v). d = 8A; + 4, n = 21fc + 10, t = 4k + 3, 0.69 < K° < 2.18.

Case (vi). d = 8fc + 5, n = 21k + 13, t = 4k + 3, 0.54 < K° < 2.01.

3.6 Identifying a Single Defective with Parallel Tests

59

Case (vii). d = 8k + 6, n = 21k + 15, t = 4k + 3, 0.40 < K° < 1.88.

Case (viii).

SERIES ON APPLIED MATHEMATICS Editor-in-Chief: Frank Hwang Associate Editors-in-Chief: Zhong-ci Shi and Kunio Tanabe

Vol. 1

International Conference on Scientific Computation eds. T. Chan and Z.-C. Shi

Vol. 2

Network Optimization Problems — Algorithms, Applications and Complexity eds. D.-Z. Du and P. M. Pandalos

Vol. 3

Combinatorial Group Testing and Its Applications by D.-Z. Du and F. K. Hwang

Vol. 4

Computation of Differential Equations and Dynamical Systems eds. K. Feng and Z.-C. Shi

Vol. 5

Numerical Mathematics eds. Z.-C. Shi and T. Ushijima

Series on Applied Mathematics Volume 3

Ding-Zhu D u Department of Computer Science University of Minnesota and Institute of Applied Mathematics Academia Sinica, Beijing

F r a n k K. H w a n g AT&T Bell Laboratories Murray Hill

V f e World Scientific « •

Singapore • New Jersey • L London • Hong Kong

Published by World Scientific Publishing Co. Pte. Ltd. P O Box 128, Farrer Road, Singapore 9128 USA office: Suite IB, 1060 Main Street, River Edge, NJ 07661 UK office: 73 Lynton Mead, Totteridge, London N20 8DH

Library of Congress Cataloging-in-Publication Data Du, Dingzhu. Combinatorial group testing and its applications / Ding-Zhu Du, Frank K. Hwang. p. cm. — (Series on applied mathematics; vol. 3) Includes bibliographical references and index. ISBN 9810212933 1. Combinatorial group theory. I. Hwang, Frank. II. Title. III. Series: Series on applied mathematics v. 3. QA182.5.D8 1993 512'.2-dc20 93-26812 CIP

Copyright © 1993 by World Scientific Publishing Co. Pte. Ltd. All rights reserved. This book, or parts thereof, may not be reproduced in any form orby any means, electronic or mechanical, including photocopying, recording or any information storage and retrieval system now known or to be invented, without written permission from the Publisher. For photocopying of material in this volume, please pay a copying fee through the Copyright Clearance Center, Inc., 27 Congress Street, Salem, MA 01970, USA.

Printed in Singapore by JBW Printers & Binders Pte. Ltd.

Preface Group testing has been around for fifty years. It started as an idea to do large scale blood testing economically. When such needs subsided, group testing stayed dormant for many years until it was revived with needs for new industrial testing. Later, group testing also emerged from many nontesting situations, such as experimental designs, multiaccess communication, coding theory, clone library screening, nonlinear optimization, computational complexity, etc.. With a potential world-wide outbreak of AIDS, group testing just might go the full cycle and becomes an effective tool in blood testing again. Another fertile area for application is testing zonal environmental pollution. Group testing literature can be generally divided into two types, probabilistic and combinatorial. In the former, a probability model is used to describe the distribution of defectives, and the goal is to minimize the expected number of tests. In the latter, a deterministic model is used and the goal is usually to minimize the number of tests under a worst-case scenario. While both types are important, we will focus on the second type in this book because of the different flavors for these two types of results. To find optimal algorithms for combinatorial group testing is difficult, and there are not many optimal results in the existing literature. In fact, the computational complexity of combinatorial group testing has not been determined. We suspect that the general problem is hard in some complexity class, but do not know which class. (It has been known that the problem belongs to the class PSPACE, but seems not PSPACE-complete.) The difficulty is that the input consists of two or more integers, which is too simple for complexity analysis. However, even if a proof of hardness will eventually be given, this does not spell the end of the subject, since the subject has many, many branches each posing a different set of challenging problems. This book is not only the first attempt to collect all theory and applications about combinatorial group testing in one place, but it also carries the personal perspective of the authors who have worked on this subject for a quarter of a century. We hope that this book will provide a forum and focus for further research on this subject, and also be a source for references and publications. Finally, we thank E. Barillot, A.T. Borchers, R.V. Book, G.J. Chang, F.R.K. Chung, A.G. Dyachkov, D. Kelley, K.-I Ko, M. Parnes, D. Raghavarao, M. Ruszinko, V.V. Rykov, J. Spencer, M. Sobel, U. Vaccaro, and A.C. Yao for giving us encouragements and helpful discussions at various stage of the formation of this book. Of course, the oversights and errors are our sole responsibility.

v

This page is intentionally left blank

Contents Preface

v

Chapter 1 Introduction

1

1.1 The History of Group Testing 1.2 The Binary Tree Representation of a Group Testing Algorithm and the Information Lower Bound 1.3 The Structure of Group Testing 1.4 Number of Group Testing Algorithms 1.5 A Prototype Problem and Some Basic Inequalities 1.6 Variations of the Prototype Problem References Chapter 2 General Algorithms

1 5 7 10 12 17 18 19

2.1 Li's s-Stage Algorithm 2.2 Hwang's Generalized Binary Splitting Algorithm 2.3 The Nested Class 2.4 (d, n) Algorithms and Merging Algorithms 2.5 Some Practical Considerations 2.6 An Application to Clone Screenings References Chapter 3 Algorithms for Special Cases 3.1 Two Disjoint Sets Each Containing Exactly One Defective 3.2 An Application to Locating Electrical Shorts 3.3 The 2-Defective Case 3.4 The 3-Defective Case 3.5 When is Individual Testing Minimax? 3.6 Identifying a Single Defective with Parallel Tests References

vii

19 20 23 27 30 34 36 38 38 43 48 53 56 59 60

viii

Contents

Chapter 4 Nonadaptive Algorithms and Binary Superimposed Codes . 62 4.1 The Matrix Representation 4.2 Basic Relations and Bounds 4.3 Constant Weight Matrices and Random Codes 4.4 General Constructions 4.5 Special Constructions References

62 63 68 73 78 87

Chapter 5 Multiaccess Channels and Extensions

91

5.1 Multiaccess Channels 5.2 Nonadaptive Algorithms 5.3 Two Variations 5.4 The k-Channel 5.5 Quantitative Channels References Chapter 6 Some Other Group Testing Models 6.1 Symmetric Group Testing 6.2 Some Additive Models 6.3 A Maximum Model 6.4 Some Models for d = 2 References Chapter 7 Competitive Group Testing 7.1 The First Competitiveness 7.2 Bisecting 7.3 Doubling 7.4 Jumping 7.5 The Second Competitiveness 7.6 Digging 7.7 Tight Bound References Chapter 8 Unreliable Tests 8.1 Ulam's Problem 8.2 Geperal Lower and Upper Bounds 8.3 Linearly Bounded Lies (1)

92 96 99 101 105 105 107 107 109 115 118 123 126 126 128 132 134 138 140 143 148 149 149 155 160

Contents 8.4 The Chip Game 8.5 Linearly Bounded Lies (2) 8.6 Other Restrictions on Lies References Chapter 9 Optimal Search in One Variable 9.1 Midpoint Strategy 9.2 Fibonacci Search 9.3 Minimum Root Identification References Chapter 10 Unbounded Search 10.1 Introduction 10.2 Bentley-Yao Algorithms 10.3 Search with Lies 10.4 Unbounded Fibonacci Search References Chapter 11 Group Testing on Graphs 11.1 On Bipartite Graphs 11.2 On Graphs 11.3 On Hypergraphs 11.4 On Trees 11.5 Other Constraints References Chapter 12 Membership Problems 12.1 Examples 12.2 Polyhedral Membership 12.3 Boolean Formulas and Decision Trees 12.4 Recognition of Graph Properties References Chapter 13 Complexity Issues 13.1 General Notions 13.2 The Prototype Problem is in PSPACE

ix 164 168 172 175 177 177 179 183 190 193 193 195 199 200 202 203 203 205 207 212 216 217 218 218 220 222 226 229 231 231 233

x

Contents 13.3 Consistency 13.4 Determinacy 13.5 On Sample Space S(n) 13.6 Learning by Examples References

Index

234 236 237 243 244 245

1 Introduction

Group testing has been around for fifty years. While traditionally group testing literature employs probabilistic models, the combinatorial model has cut its own share and becomes an important part of the literature. Furthermore, combinatorial group testing has tied its knots with many computer science subjects: complexity theory, computational geometry and learning models among others. It has also been used in multiaccess communication and coding.

1.1

The History of Group Testing

Unlike many other mathematical problems which can trace back to earlier centuries and divergent sources, the origin of group testing is pretty much pinned down to a fairly recent event-World War II, and is usually credited to a single person-Robert Dorfman. The following is his recollection after 50 years (quoted from a November 17, 1992 letter in response to our inquiry about the role of Rosenblatt): "The date was 1942 or early '43. The place was Washington, DC in the offices of the Price Statistics Branch of the Research Division of the office of Price Administration, where David Rosenblatt and I were both working. The offices were located in a temporary building that consisted of long wings, chock-full of desks without partitions. The drabness of life in those wings was relieved by occasional bull sessions. Group testing was first conceived in one of them, in which David Rosenblatt and I participated. Being economists, we were all struck by the wastefulness of subjecting blood samples from millions of draftees to identical analyses in order to detect a few thousand cases of syphilis. Someone (who?) suggested that it might be economical to pool the blood samples, and the idea was batted back and forth. There was lively give-and-take and some persiflage. I don't recall how explicitly the problem was formulated there. What is clear is that I took the idea seriously enough so that in the next few days I formulated the underlying probability problem 1

Introduction

2

and worked through the algebra (which is pretty elementary). Shortly after, I wrote it up, presented it at a meeting of the Washington Statistical Association, and submitted the four-page note that was published in the Annals of Mathematical Statistics. By the time the note was published, Rosenblatt and I were both overseas and out of contact." We also quote from an October 18, 1992 letter from David Rosenblatt to Milton Sobel which provides a different perspective. "It is now (Fall of 1992) over fifty year ago when I first invented and propounded the concepts and procedure for what is now called "group testing" in Statistics. I expounded it in the Spring of 1942 the day after I reported for induction during World War II and underwent blood sampling for the Wasserman test. I expounded it before a group of fellow statisticians in the division of Research of the Office of Price Administration in Washington, D.C. Among my auditors that morning was my then colleague Rorbet Dorfman." ' Considering that fifty years have lapsed between the event and the recollections, we find the two accounts reasonably close to each other. Whatever discrepancies there are, they are certainly within the normal boundary of differences associated with human memory. Thus that "someone" (in Dorfman's letter) who first suggested to pool blood samples could very well be David Rosenblatt. It is also undisputed that Dorfman alone wrote that seminal report [1] published in the Notes Sections of the journal Annals of Mathematical Statistics which gave a method intended to be used by the United States Public Health Service and the Selective Service System to weed out all syphilitic men called up for induction. We quote from [1]: "Under this program each prospective inductee is subjected to a 'Wasserman-type' blood test. The test may be divided conveniently into two parts: 1. A sample of blood is drawn from the man, 2. The blood sample is subjected to a laboratory analysis which reveals the presence or absence of "syphilitic antigen." The presence of syphilitic antigen is a good indication of infection. When this procedure is used, n chemical analyses are required in order to detect all infected members of a population of size n. The germ of the proposed technique is revealed by the following possibility. Suppose that after the individual blood sera are drawn they are pooled in groups of, say, five and that the groups rather than the individual sera are subjected to chemical analysis. If none of the five sera contributing to

1.1 The history

of the group testing

3

problem

t h e pool contains syphilitic antigen, t h e pool will not contain it either and will test negative. If, however, one or more of t h e sera contain syphilitic antigen, t h e pool will also contain it and t h e group test will reveal its presence (the author inserted a note here saying t h a t diagnostic tests for syphilis are extremely sensitive and will show positive results for even great dilutions of antigen). T h e individuals making up t h e pool must t h e n be retested to d e t e r m i n e which of t h e m e m b e r s are infected. It is not necessary to draw a new blood sample for this purpose since sufficient blood for b o t h t h e test and t h e retest can be taken at once. T h e chemical analysis requires only small quantities of blood."

If one or more of the sera in the

But, if no one in the pool

pool contain syphilitic antigen,

contains syphilitic antigen, then many tests are saved.

then a test may be wasted.

O

CD

CD CD

Figure 1.1: T h e idea of group testing. Unfortunately, this very promising idea of grouping blood samples for syphilis screening was not actually p u t to use. T h e m a i n reason, c o m m u n i c a t e d to us by C. Eisenhart, was t h a t t h e test was no longer accurate when as few as eight or nine samples were pooled. Nevertheless, test accuracy could have been improved over years, or possibly not a serious problem in screening for another disease. Therefore we quoted t h e above from Dorfman in length not only because of its historical significance, but also because at this age of a potential AIDS epidemic, Dorfman's clear account of applying group testing to screen syphilitic individuals may have new impact to t h e medical world and t h e health service sector.

4

Introduction

Dorfman's blood testing problem found its entrance into a very popular textbook on probability as an exercise; Feller's 1950 book "An Introduction to Probability Theory and Its Application, Vol. I" [2] and thus might have survived as a coffee break talk-piece among the academic circle in those days. But by and large, with the conclusion of the Second World War and the release of millions of inductees, the need of group testing disappeared from the Selective Service and the academic world was ready to bury it as a war episode. The only exception was a short note published by Sterrett [9] in 1951 based on his Ph.D. dissertation at University of Pittsburgh. Then Sobel and Groll [8], the two Bell Laboratories Scientists, gave the phrase "group testing" new meaning by giving the subject a very thorough treatment and established many new grounds for future studies in their 74-page paper. Again, they were motivated by practical need, this time from the industrial sector, to remove all leakers from a set of n devices. We quote from Sobel and Groll: "One chemical apparatus is available and the devices are tested by putting x of them (where 1 < x < n) in a bell jar and testing whether any of the gas used in constructing the devices has leaked out into the bell jar. It is assumed that the presence of gas in the bell jar indicates only that there is at least one leaker and that the amount of gas gives no indication of the number of leakers." Sobel and Groll also mentioned other industrial applications such as testing condensers and resistors, the main idea is very well demonstrated by the Christmas tree lighting problem. A batch of light bulbs is electrically arranged in series and tested by applying a voltage across the whole batch or any subset thereof. If the lights are on, then whole tested subset of bulbs must all be good; if the lights are off, then at least one bulb in the subset is defective. Call the set of defectives among the n items the defective set. Dorfman, as well as Sobel and Groll, studied group testing under probabilistic models, namely, a probability distribution is attached to the defective set and the goal is to minimize the expected number of tests required to identify the defective set. Katona [5] first emphasized the combinatorial aspects of group testing. However, his coverage was predominantly for the case of a single defective and he considered probability distributions of defectives. In this volume a more restrictive viewpoint on combinatorial group testing (CGT) is taken by completely eliminating probability distributions on defectives. The presumed knowledge on the defective set is that it must be a member, called a sample, of a given family called a sample space. For example, the sample space can consist of all rf-subsets of the n items, when the presumed knowledge is that there are exactly d defectives among the n items. This sample space is denoted by S(d,n). An item is said to be defective (or good) in a sample if it is in (or not in) the sample. The goal in CGT is to minimize the number of tests under the worst scenario. A best algorithm under this goal is called a minimax algorithm. The reason that probabilistic models are excluded is not because they are less important or less interesting, but simply

1.2 The Binary Tree Representation and Information Lower Bound

5

because there is so much to tell about group testing and this is a natural way to divide the material. Li [6] was the first to study CGT. He was concerned with the situation where industrial and scientific experiments are conducted only to determine which of the variables are important. Usually, only a relatively small number of critical variables exists among a large group of candidates. These critical variables are assumed to have effects too large to be masked by the experimental error, or the combined effect of the unimportant variables. Interpreting each variable as an item, each critical variable as a defective, and each experiment as a group test, a large effect from an experiment indicates the existence of a critical variable among the variables covered by the experiment. Li assumed that there are exactly d critical variables to start with, and set to minimize the worst-case number of tests. Since Li, CGT has been studied along side with PGT for those classical applications in medical, industrial and statistical fields. Recently, CGT is also studied in complexity theory, graph theory, learning models, communication channels and fault tolerant computing. While it is very encouraging to see a wide interest in group testing, one unfortunate consequence is that the results obtained are fragmented and submerged in the jargons of the particular fields. This book is an attempt to give a unified and coherent account of up-to-date results in combinatorial group testing.

1.2

The Binary Tree Representation of a Group Testing Algorithm and the Information Lower Bound

A binary tree can be inductively defined as a node, called the root, with its two disjoint binary trees, called the left and right subtree of the root, either both empty or both nonempty. Nodes occurring in the two subtrees are called descendants of the root (the two immediate descendants are called children), and all nodes having a given node as a descendant are ancestors (the immediate ancestor is called a parent). Two children of the same parent are siblings. Nodes which have no descendants are called leaves, and all other nodes are called internal nodes. The path length of a node is the number of that node's ancestors. A node is also said at level I if its path length is I — 1. The depth of a binary tree is the maximal level over all leaves. Let S denote the sample space. Then a group testing algorithm T for S can be represented by a binary tree, also denoted by T, by the following rules: (i) Each internal node u is associated with a test t(u); its two links associated with the two outcomes of t(u) (we will always designate the negative outcome by the left link). The test history H(u) of a node u is the set of tests and outcomes associated with the nodes and links on the path of u. (ii) Each node u is also associated with an event S(u) which consists of all the members of S consistent with H(u). \S(v)\ < 1 for each leaf v.

Introduction

6

o O 0 /

Q

\ 1

0/

0/

\ 1

(~~\

test

(^ J

sample

0

negative

1

positive

\ 1

Figure 1.2: The binary tree representation. Since the test t(u) simply splits S(u) into two disjoint subsets, every member of S must appear in one and only one S(v) for some leaf v. An algorithm is called reasonable if no test whose outcome is predictable is allowed. For a reasonable algorithm each S(u) is split into two nonempty subsets. Thus |5(f)| = 1 for every leaf v, i.e., there exists a one-to-one mapping between S and the leaf-set. Let p(s) denote the path of the leaf v associated with the sample s and let \p(s)\ denote its length. Then MT{S) = max \p(s)| is the depth of T. Since the same subset will not be tested more than once in a reasonable algorithm, the number of tests, consequently the number of reasonable algorithms, is finite. Therefore we can define M(S) = min MT(S) . An algorithm which achieves M(S) is called a minimax algorithm for S. The goal of CGT is to find a minimax algorithm for a given 5; usually, we settle for a good heuristic. The determination of M(S) and obtaining good bounds for it are challenging problems and unsettled for most S. Let IVKL^J) denote the smallest (largest) integer not less (greater) than x. Also let log x mean log2 x throughout unless otherwise specified. Theorem 1.2.1 M{S) >

\log\S\].

7

1.3 The Structure of Group Testing

Proof: Consider a group testing algorithm T for the problem S. At each internal node u of T, the test t(u) splits S(u) into two disjoint subsets according to whether the outcome is negative or positive. The cardinality of one of the two subsets must be at least half of \S{u)\. Therefore M{S) > [log|S[|- • We will refer to the bound in Theorem 1.2.1 as the information lower bound. Lemma 1.2.2 E s e s2 - I p ' s 'l = 1 for every reasonable algorithm. Proof. True for | 5 | = 1. A straightforward induction on | 5 | proves the general case. • Theorem 1.2.1 and Lemma 1.2.2 are well known in the binary tree literature. The proofs are included here for completeness. A group testing algorithm T for the sample space 5 is called admissible if there does not exist another algorithm T" for S such that W)\

> \p'(s)\

for a l l s G S

with at least one strict inequality true, where p'(s) is the path of s in T'. Theorem 1.2.3 A group testing algorithm is admissible if and only if it is reasonable. Proof. That "reasonable" is necessary is obvious. To show sufficiency, suppose to the contrary that T is an inadmissible reasonable algorithm, i.e., there exists T' such that \P(S)\>\P'{S)\

for

allseS

with at least one strict inequality true. Then £ s6 s2- |p 'WI > S s e s 2- | p ( s ) l = 1 , a contradiction to Lemma 1.2.2.

•

From now on only admissible, or reasonable, algorithms are considered in the volume. The word "algorithm" implies an admissible algorithm.

1.3

The Structure of Group Testing

The information lower bound is usually not achievable. For example, consider a set of six items containing exactly two defectives. Then [log |5|] = [log (Tjl = 4 . If a subset of one item is tested, the split is 10 (negative) and 5 (positive); it is 6 and 9 for a subset of two, 3 and 12 for a subset of three, and 1 and 14 for a subset of four. By Theorem 1.2.1, at least four more tests are required.

8

Introduction

T h e reason t h a t information lower bound cannot be achieved in general for group testing is t h a t t h e split of S(u) at an internal node u is not arbitrary, b u t m u s t be realizable by a group test. Therefore it is of i m p o r t a n c e t o study which types of splitting are permissible in group testing. T h e rest of this section reports work done by Hwang, Lin and Mallows [4]. While a group testing algorithm certainly performs t h e tests in t h e order starting from t h e root of t h e t r e e proceeding to t h e leaves, t h e analysis is often m o r e convenient if s t a r t e d from t h e leaves (that is t h e way t h e Huffman tree - a m i n i m u m weighted binary tree - is c o n s t r u c t e d ) . T h u s instead of asking what splits are permissible, t h e question becomes: for two children nodes x,y of u, what types of S(x) and S(y) are p e r m i t t e d t o merge into S(u). Let N denote a set of n items, D t h e defective set and S0 = {Dx, • • •, Dk} t h e initial s a m p l e space. W i t h o u t loss of generality assume ujLjTJ,- = N, for any item not in UjLjT?,- can immediately be identified as good and deleted from N. A subset Si of So is said t o b e realizable if t h e r e exists a group testing t r e e T(A) for SQ and a node u of T such t h a t S(u) = Si. A partition n = (Si, • • •, Sm) of So is said to be realizable if t h e r e exists a group testing tree T for So and a set of m nodes (u x , • • •, um) of T such t h a t S(ui) = Si for 1 < i < m. Define || S | | = U o . e s A ' , || S || = N\ || S ||, t h e complement of || S \\, and S = {A \ A C|| S ||, A D some £),•}, t h e closure of S. F u r t h e r m o r e , let 7r = (Si, S2, • • •, Sm) be a partition of So, i.e., S, PI Sj = 0 for all i 7^ j , and Us i g x 5',- = So- T h e n 5,- and Sj are said to be separable if there exists an I C N such t h a t i" n D = 0 for all D G Si, and / n D / 0 for all D £ Sj (or vice versa). Given 7r, define a directed graph G^ by taking each Si as a node, and a directed edge from 5,- t o Sj (Si —» Sj), i ^ j , if and only if t h e r e exists Ai € Si, Aj G Sj such t h a t Ai C Aj.

T h e o r e m 1.3.1 The following statements (i) Si and Sj are not separable, (ii) Si —* Sj —» Si in Gn. (Hi) SiHSj^Hl.

are

equivalent:

Proof: By showing (i) => (ii) =$• (iii) =$• (i). (i) =*> (ii): It is first shown t h a t if Si / » Sj, t h e n I = JS~\\C) || 5,- | | ^ 0 and I separates Si, Sj. Clearly, if / = 0, t h e n || Si | | C | | Sj ||; since || Si ||6 S; and || Sj ||G Sj, it follows t h a t 5,- —• Sj, a contradiction. T h u s 7 ^ 0 . Also, 7 fl £) = 0 for all _D € 5 j since 7 C || 5 j ||. F u r t h e r m o r e , if there is a D € S; such t h a t 7 D 7J = 0, then from t h e definition of 7, P T l f n || 5 ; || HD = fS~\\ D 7J = 0 and hence D C|| ^ ||e 5,-. This implies Si —> 5 j , again a contradiction. Therefore 7 D D ^ 0 for all D € 5, and 7 separates 5^, 5 j . T h e proof is similar if Sj -/* Si. (ii) => (iii): Suppose Si —> Sj —• 5;. Let A,, A; G 5; and A3,A!d G 5 j such t h a t Ai C Aj and A | 2 ^ - . F u r t h e r m o r e , let u = | | S{ || n || S,- ||. T h e n u 7^ 0. Since

1.3 The Structure

of Group

9

Testing

Ai C Aj C|| Sj ||, A,- C|| Si || n || 5 j | | = u C|| 5,- || and t h u s u € S,. a r g u m e n t shows u £ Sj. Therefore 5 , D Sj ^ 0. (iii) => (i): Suppose u> € b e a subset such t h a t / f~l A w € 5,- =*• t h e r e exists a A t h e other h a n d u> £ 5 j =>• w S,- are not separable.

A similar

5; f~l 5,- (w 7^ 0) b u t Si and 5,- are separable. Let I C N 7^ 0 for all A € S; and / f~l Dj = 0 for all Z?; e 5,-. But 6 S; such t h a t £>; C UJ, hence / fl w D I n D ; 7^ 0. On C|| Sj ||=> / fl u> = 0, a contradiction. Therefore, Si and

T h e o r e m 1.3.2 ~K is realizable if and only if G* does not contain

a directed

cycle.

Proof. Suppose G x has a cycle C. Let it' be a partition of So obtained by merging two (separable) sets 5,- and Sj in TT. From T h e o r e m 1.3.1, G* cannot contain t h e cycle St —• Sj —* Si for otherwise S; and Sj are not separable, therefore not mergeable. Since every edge in G* except those between Si and Sj is preserved in G„i, G*i must contain a cycle C". By r e p e a t i n g this a r g u m e n t , eventually one o b t a i n s a partition with no two p a r t s separable. Therefore 7r is not realizable. Next assume t h a t G> contains no cycle. T h e graph Gv induces a partial ordering on t h e 5,'s with 5,- < Sj if and only if Si —> Sj. Let Sj be a m i n i m a l element in this ordering. T h e n / = || Sj || separates Sj from all other 5,'s. Let G^ be obtained from Git by deleting t h e node Sj and its edges. T h e n clearly, GTi contains no cycle. Therefore t h e above a r g u m e n t can be repeated to find a set of tests which separate one Si at a t i m e . • Unfortunately, local conditions, i.e., conditions on 5; and Sj alone, are not sufficient t o tell w h e t h e r a pair is validly mergeable or not. (S, and Sj is validly mergeable if merging Si a n d Sj preserves t h e realizability of t h e partition.) T h e o r e m 1.3.1 provides some local necessary conditions while t h e following corollaries to T h e o r e m 1.3.2 provide more such conditions as well as some sufficient conditions.

C o r o l l a r y 1 . 3 . 3 Let M be the set of Si's which are maximal under the partial ordering induced by G*. (If \M\ = I, add to M = { 5 , } an element Sj such that Sj < 5, and Sj -jt Sk for all other Sk-) Then every pair in M is validly mergeable. C o r o l l a r y 1.3.4 Letn be realizable. Then the pair Si, Sj is validly mergeable if there does not exist some other Sk such that either Si —> Sk or Sj —> Sk • C o r o l l a r y 1.3.5 Si and Sj are not validly mergeable if there exists another that either Si —• Sk —» Sj or vice versa. C o r o l l a r y 1.3.6 Let TV be the partition single element. Then IT is realizable.

of So in which every Si consists

Sk such

of just

a

Introduction

10

Let S C So ( S 7^ 0) and let •K = {S, Si, S2, • • •, Sm} be t h e partition of S0 where each St consists of a single element /),-. T h e following t h e o r e m shows a connection between t h e realization of 5 and t h e realization of n.

T h e o r e m 1 . 3 . 7 The following (i) S is realizable,

statements

are

equivalent:

(ii) Sr\S0 = S. (Hi) -K is

realizable.

Proof. We show (i) =*• (ii) =>• (iii) => (i). (i) =>• (ii): Since S C S0, clearly S Ci S0 2 S. So only S fl So C S needs to be shown. Let D G S C\ S 0 , t h e n D € S and hence there is some D' G S such t h a t D' = S;, for some i. B u t D € 5 implies Si —• 5 and D' C D implies 5 —> S;. Hence S is not realizable by T h e o r e m 1.3.2. (ii) => (iii): If TT is not realizable, t h e n GT contains a directed cycle. Since each Si is a single element, Si = Si and hence any directed cycle m u s t contain S, say Si —* S2 —*•••—> Sk —> S —* Si. B u t Si —> 5 2 —> • • • —> S/i —> 5 implies Si —-• 5 since S,'s are single elements. By T h e o r e m 1.3.1, S —+ Si —> S implies S n 5 i = S n { A } / 0- This implies A e S f l So. Since fteS,5nS0/ S. (iii) =^ (i): Trivially t r u e by using t h e definition of realizability. •

1.4

Number of Group Testing Algorithms

O n e interesting p r o b l e m is t o count t h e n u m b e r of group testing .algorithms for S. This is t h e t o t a l n u m b e r of binary trees with \S\ leaves (labeled by m e m b e r s of S) which satisfy t h e group testing s t r u c t u r e . While this problem r e m a i n s open, Moon and Sobel [7] counted for a class of algorithms when t h e sample space S is t h e power set of n i t e m s . Call a group pure if it contains no defective, and contaminated otherwise. A group testing algorithm is nested if whenever a contaminated group is known, t h e next group to be tested m u s t be a proper subset of t h e c o n t a m i n a t e d group. (A more detailed definition is given in Section 2.3.) Sobel and Groll [8] proved L e m m a 1 . 4 . 1 Let U be the set of unclassified items and suppose that C C U is tested to be contaminated. Furthermore, suppose C" C C is then tested to be contaminated. Then items in C\C can be mixed with items inU\C without losing any information. Proof. Since C" being c o n t a m i n a t e d implies C being c o n t a m i n a t e d , t h e sample space given b o t h C and C" being c o n t a m i n a t e d is t h e same as only C" being c o n t a m i n a t e d . B u t u n d e r t h e l a t t e r case, items in C \ C and U\C are indistinguishable. • T h u s u n d e r a nested algorithm, at any stage t h e set of unclassified i t e m s is characterized by two p a r a m e t e r s m and n, where m > 0 is t h e n u m b e r of items in a

1.4 Number of Group Testing

Algorithms

11

contaminated group and n is the total number of unclassified items. Let f(m,n) denote the number of nested algorithms when the sample space is characterized by such m and n. By using the "nested" property, Moon and Sobel [7] obtained /(0,0) f(0,n) /(l,n) f(m,n)

= = = =

1 EZ = 1 /(0,n-fc)/(fc,r») for n > l, / ( 0 , n - l ) for n > 1, E^J"11/(m — k,n — k)f(k, n) for n > m > 1,

where k is the size of the group to be tested. Recall that the Catalan numbers I (Ik-

n

2)

Ck

-k{k-l

satisfy the recurrence relation

for k > 2. Moon and Sobel gave Theorem 1.4.2 f(0,n) f(m,

n)

= C^U^f^i)

forn>2,

=

for 1 < m < n .

CrJi%if{Q, n-i)

Proof. Theorem 1.4.2 is easily verified for / ( 0 , 2) and f(l,n). proved by induction f(0,n)

=

= /(m,n)

= =

= =

££=1/(0,n-l)/(fc,n) EJU (c B _ i+1 n?-*- 1 /(o,

The general case is

0) (c*n?=1/(o, n -»))

1

Ct+iiiri /^,!). Zj£f(m-k,n-k)f(k,n) ^=i (cm-kn?=-;hf(o,n-k-,•))(c*n?=1/(o,n CmUtJ(0,n-i). a

- *))

Define F(n) = / ( 0 , n ) . The following recurrence relations follow readily from Theorem 1.4.2 and the definition of the Catalan numbers. Corollary 1.4.3 Ifn>l,

then

F(n) = % t l n „ - 1) = 3 * L z l ) n n _ !) . Cn

n +I

Introduction

12

Corollary 1.4.4 Ifn>2,

then F(n) = Cn+lCnC^jC^.j • • • C2

Corollary 1.4.5 Ifn>

I, then

nn) = 4-n ?=1 {i-^A Ty } 2 . The first few values of F(n) are n F(n)

1

2

3

4

5

6

1 2

10

280

235,220

173,859,840,000

.

The limiting behavior of F(n) can be derived from the formula in Corollary 1.4.5. Corollary 1.4.6 lim{F(n)}2'" "^°°x K n

= 4 I T 1~ 1 I l - — ^ — - 1 '- \ 2(i + l ) J

= 1.526753-•• .

More generally, it can be shown that

™-H'+¥-(i)} as n —> oo, where a = 1.526753- • • ; in particular, F(n) > \o?n for n > 1. The proof of this is omitted.

1.5

A Prototype Problem and Some Basic Inequalities

We first describe a prototype CGT problem which will be the focus of the whole book. Then we discuss generalizations and special cases. Consider a set I of n items known to contain exactly d defectives. The defectives look exactly like the good items and the only way to identify them is through testing, which is error-free. A test can be applied to an arbitrary subset of the n items with two possible outcomes: a negative outcome indicates that all items in the subset are good, a positive outcome indicates the opposite, i.e., at least one item in the subset is defective (but not knowing which ones or how many are defective). The goal is to find an algorithm A to identify all defectives with a small M&{S) = M^(d,n). This is the prototype problem. It is also called the (d, ra)-problem to highlight the two parameters n and d. In the literature the (d, n) problem has sometimes been called the hypergeometric group testing problem. We now reserve the latter term for the case

1.5 A Prototype Problem and Some Basic Inequalities

13

that a uniform distribution is imposed on S, since then the probability that a subset of k items containing x defectives follows the hypergeometric distribution

UJ U-J Note that the hypergeometric group testing problem under this new definition belongs to PGT, not CGT. Since M(n,n) = M(0,n) = 0, whenever the (d, n) problem is studied, it is understood that 0 < d < n. Hu, Hwang and Wang [3] proved some basic inequalities about the M function which are reported in this section. Lemma 1.5.1 S C S' implies M(S)

1 + M(d - 1, n - 1) for m > 2 and 0 < d < n. Proof. By Lemma 1.5.1 M(m;d,n)

> M(2;d,n)

for m > 2 .

Let T be an algorithm for the (2; d, n) problem, and let Mj(2\ d, n) = k, i.e., k is the maximum path length of a leaf in T. Let I\ and I2 denote the two items in the contaminated group. Claim. Every path of length k in T includes at least one test that contains either 7i or I2.

Introduction

14

Proof of claim. Suppose to the contrary that for some leaf v the path p(v) has length k and involves no test containing I\ or 72. Since no test on p(v) can distinguish I\ from 72, and since {7i,7 2 } is a contaminated group, I\ and 72 must both be defective in the sample s(v). Let u be the sibling node of v. Then u is also a leaf since v has maximum path length. Since p(u) and p(v) have the same set of tests, 7i and 72 must also be both defective in the sample s(u). Since s(u) ^ s(v), there exists indices i and j such that 7; is defective and 7, is good in the sample s(u), while 7; is good and Ij is defective in s(v). Let w denote the parent node of u and v; thus S(w) = {s(u),s(u)}. Then no test on p(w) can be of the form G U {7;}, where G, possibly empty, contains only items classified as good in s(u), since such a test must yield a positive outcome for s(u) and a negative outcome for s(v), and hence would have separated s(u) from s(v). Define s to be the sample identical to s(u) except that I2 is good and both 7,- and Ij are defective. Then s can be distinguished from s(u) only by a test containing 72, which by assumption does not exist on p(w), or by a test of the form G U {7,}, whose existence has also been ruled out. Thus s € S(w), and u and v cannot both be leaves, a contradiction that completes the proof of the claim. By renaming the items if necessary, one may assume that every path of length k in T involves a test that contains 7i. Add an imaginary defective to the (d — 1, n — 1) problem and label it 7i, and map the n — 1 items of the (d—l,n — I) problem one-toone to the items of the (d, n) problem except I\. Then the modified T can be used to solve the (d — 1, n — 1) problem except that every test containing I\ is skipped since the positive outcome is predictable. But each path of length k in T contains such a test. Hence the maximum path length in applying T to the (d — l , n — 1) problem is k — 1, i.e., MT{2;d,n) > 1 + MT{d - l,n-l) . Since T is arbitrary, the proof is complete.

•

Corollary 1.5.4 M(n — l , n ) = n — 1. Proof. Trivially true for n = 1. For n > 2 M(n-l,n)

= M(n;n-l,n)

> l + M{n-2,n-l)

Theorem 1.5.5 M{d,n) > l+M{d-l,n-l)

> n - l + M(0,l) =n-l

. D

> M(d - l,n) for 0 < d < n.

Proof. By noting M(d,n) = M(n;d,n) , the first inequality follows from Theorem 1.5.3. The second inequality is trivially true for d = 1. The general case is proved by using the induction assumption

M(d,n) >M{d-l,n)

.

15

1.5 A Prototype Problem and Some Basic Inequalities

Let T be the algorithm which first tests a single item and then uses a minimax algorithm for the remaining problem. Then M(d-l,n)

L e m m a 1.5.6 M(d,n)

< = =

MT(d-l,n) 1 + max{M(d - 1, n - 1), M(d - 2, n - 1)} 1 + M(d- l,n-l) . •

< n — 1.

Proof. The individual testing algorithm needs only n — 1 tests since the state of the last item can be deduced by knowing the states of the other items and knowing d. • Lemma 1.5.7 Suppose that n — d > 1. Then M(d, n) = n — 1 implies M(d, n — 1) = n -2. Proof. Suppose to the contrary that M(d:n — 1) =^ n — 2. By Lemma 1.5.6, this is equivalent to assuming Mid, n — 1) < n — 2. Let T denote an algorithm for the (d, n) problem which first tests a single item and then uses a minimax algorithm for the remaining problem. Then M{d,n)

< = =

MT(d,n) l+ma.x{M(d,n-l),M(d-l,n-l)} 1 + M(d, n - 1) by Theorem 1.5.5

1, then MT(d,n)

> 1 + M(m;d,n) > 2 + M(d-l,n-l)

by Theorem 1.5.3,

a contradiction to what has just been shown. Therefore m = 1 and M(d,n)

= =

l+ma.x{M{d,n-l),M{d1+ M(d,n-1)

\,n-

1)}

Introduction

16 by Theorem 1.5.5 and the fact d < n — 1. It follows that M(d-l,n-l)

= M(d,n-

1).

Hence M(d-i,n-l)

=

n-2

by induction. Therefore M{d, n) = 1 + M{d - 1, n - 1) = n - 1 . L e m m a 1.5.9 Suppose M(d,n)

< n — 1. TViera

M(d, n) > 21 + M( 1. First consider the case / = 1. Let T be a minimax algorithm for the (d, n) problem which first tests a set of m items. If m > 1, then Lemma 1.5.9 is an immediate consequence of Theorem 1.5.3. Therefore assume m = 1. Suppose to the contrary that MT{d,n)

< 2 + M(d - l , n - 1) .

Then 1+ M(d-l,n-l)

> = =

MT{d,n) l+max{Af(2a~\

General Algorithms

22 Hence by induction

f (a + 2)d + (p - 1) - 1 for p > 1 I (a + l)d+(d-2)-l for p = 0,9 < 2 0 " 1 1 (a + l ) d + ( d - l ) - l forp = 0,6>>2 a - 1 .

MG{d,n-2a)= Consequently

' (a + 2)d + p - 2 for p = 0, 6 < 20"1 , 0W , , ., • (a + 2Ja + p — 1 otherwise.

l + MG(d,n-2a)-

1 1

For d' = d — 1 and n' = n — 1, /' = n' - d! + 1 =

/ 2a(d-l) + 2a{p + l) + 6 f o r p < d - 3 a+1 2 (d-l) +e iorp = d-2 2a+1{d-l)+2a+ 8 forp = d-l.

Hence by induction (a + 2 ) ( d - l ) + ( p + l ) - l (a + 3 ) ( d - l ) - l

MG{d-l,n-l)

for p 2.

Proof. l + d-l\

n

dj "

ld

\ d ) > d\ [2" 1 + F{l;d,n) = l + H(d-l,n-l) >

H{d-l,n)

.

2.3 The Nested

Class

25

Finally, for m = 1 and d > 2, F{l;d,n)

=

H{d-l,n-l)

>

H(d-2,n-l)

=

F(l;d-l,n).

For general m > 1 F(m;d,n)

=

min m&x{F(m

— k; d, n — k), F(k; d, n)}

l

min m a x { F ( m — k; d — l , n — k), F(k; d — l , n ) } by induction l 2k~2. Test a group of n — 2 f c _ 1 i t e m s . If t h e o u t c o m e is negative, then n — 2k_1 > 2k — n good items are identified, while a defective can be identified from t h e remaining 2k_1 items in k — 1 tests by binary splitting. If t h e o u t c o m e is positive, t h e n by induction, at least 2 * _ 1 — (n — 2 * - 1 ) = 2k — n good items are identified along with a defective if k — 1 more tests are used. 2. n — 2k~l -

\ 2'

for

z>2 .

T h e o r e m 2.3.5 Ford>2, fd(t) = fd{t- 1) + l{fd-i(t') t' is defined in h-i{t')>h{t-l)>fd-i{t'-l).

-fd(t-

l ) , i - 1 -t'),

where

Proof. Suppose that fd(t — 1) < n < fd(t)- The first test must be on a group not fewer than n — fd(t — 1) items for otherwise a negative outcome would leave too many items for the remaining £ — 1 tests. On the other hand there is no need to test more than n — fd[t — 1) items since only the case of a positive outcome is of concern and by Lemma 2.3.2, the fewer items in the contaminated group the better. For the time being, assume that fd-i{t' +1) > fd(t). Under this assumption, if the first defective lies among the first n — fd^\ (f) items, then after its identification the remaining items need t' + 1 further tests. Therefore only t — 2 — t' tests are available to identify the first defective. Otherwise, the remaining items can be identified by t' tests and one more test is available to identify the first defective. Therefore, the maximum value of doable n is

n - Mt - 1) = /(/«_!(*') - h{t - l),t - 1 - *') • The proof of fd-i(t'+ 1) > fd(t) is very involved. The reader is referred to [11] which proved the same for the corresponding i?*-minimax merging algorithm (see Section 2.4). a By noting fi(t) = 2', f2(t) has a closed-form solution. Corollary 2.3.6 f 22 - 2 + [ 4 ^ J /2(t)

=

\

«-l 2

for t even and > 4, 1 2

{ 2 V - 2 + I "*"' ' '] for t odd and > 3. The recursive equation for fd(t) can be solved in 0(td) time. Since the generalized binary splitting algorithm is in the nested class, H(d,n) cannot exceed Ma{d,n), which is of the order d \og(n/d). Therefore, fx(y) for x < d and y < d log(n/d) can be computed in 0(d2 \og(n/d)) time, a reduction by a factor of n3/d \og(n/d) from the brute force method.

2.4 (d, n) Algorithms and Merging

27

Algorithms

For given n and d, compute fx{y) for all x < d and y < d \og(n/d). A minimax nested algorithm is defined by the following procedure (assume that unidentified items form a line): Step 1. Find t such that fd(t) b3, then a sequence of comparisons (if needed) involving ad are used to merge ad into Bg, i.e. to establish 6,- < aj < 6,+i for some i, j MT,(d,n) . Proof. Let T" be obtained from T by adding a subtree Tv to each terminal node v having a positive number of free items (see Figure 2.1). T„ is the tree obtained by testing the free items one by one. Since free items are the only items at v whose states are uncertain when (d, n) is changed to (d, n), I " is a procedure for the (d, n) problem. From Lemma 2.5.1, the sibling node of v is the root of a subtree with maximum path length at least / — 1 where / is the number of free items of sv. The theorem follows immediately. • Corollary 2.5.3 M(d,n) + 1 > M(d,n)

> M(d,n + 1).

Proof. The first inequality follows from the theorem. The second inequality follows from the observation that the (d, n + 1) problem can be solved by any procedure for the (d, n) problem, provided one of the n + 1 items is put aside. But the nature of the item put aside can be deduced with certainty once the natures of the other n items are known. • For nested algorithms Hwang [9] gave a stronger result which is stated here without proof. Theorem 2.5.4 F(m;d,n+

1) =

Corollary 2.5.5 H(d,n + 1) =

F(m;d,n). H{d,n).

General Algorithms

o

I

I

/»

f

/ »

I

I

t

I

I

t

/

/

/

/

/

/

\

f-i

\

I

Figure 2.1: From tree T to tree T". When the tests are destructive or consummate, as in the blood test application, then the number of tests an item can go through (or the number of duplicates) becomes an important issue. If an item can only go through one test, then individual testing is the only feasible algorithm. If an item can go through at most s tests, then Li's 5-stage algorithm is a good one to use. In some other applications, the size of a testable group is restricted. Usually, the restriction is of the type that no more than k items can be tested as a group. An algorithm can be modified to fit the restriction by cutting the group size to k whenever the algorithm calls for a larger group. There are more subtle restrictions. Call a storage unit a bin and assume that all items in a bin are indistinguishable to the tester even though they have different test history. A b-bin algorithm can use at most b bins to store items. A small number of bins not only saves storage units, but also implies easier implementation. Since at most stages of the testing process, one bin is needed to store good items, one to store defectives and one to store unidentified items, any sensible algorithm will need at least three bins. An obvious 3-bin algorithm is individual testing; and at the first glance this seems to be the only sensible 3-bin algorithm. We now show the surprising result that Li's s-stage algorithm can be implemented as a 3-bin algorithm. The three bins are labeled "queue," "good item" and "new queue." At the beginning of stage i, items which have been identified as good are in the good-item bin, and all other items

33

2.5 Some Practical Considerations

are in the queue bin. Items in the queue bin are tested in groups of size k{ (some possibly k{ — 1) as according to Li's s-stage algorithm. Items in groups tested negative are thrown into the good-item bin, and items in groups tested negative are thrown into the new-queue bin. At the end of stage i, the queue bin is emptied and changes labels with the new-queue bin to start the next stage. Of course, at stage s, each group is of size one and the items thrown into the new-queue bin are all defectives. All nested algorithms are 4-bin algorithms. Intuitively, one expects the minimax nested algorithm to be a minimax 4-bin algorithm. But this involves proving that the following type of tests can be excluded from a minimax algorithm : Suppose a contaminated group C exists. Test a group G which is not a subset of C. When G is also contaminated, throw G into the bins containing C and throw C into the other bin containing unidentified items. Note that the last move loses information on finding C contaminated. But it is hard to relate this to nonminimaxity. Yet another possible restriction is on the number of recursive equations defining a minimax algorithm in a given class. A small number implies a faster solution for the recursive equations. For example, individual testing needs one equation (in a trivial way) and the nested class needs two. Suppose that there are p processors or p persons to administer the tests parallelly. Then p disjoint groups can be tested in one "round." In some circumstance when the cost of time dominates the cost of tests, then the number of rounds is a more relevant criterion to evaluate an algorithm than the number of tests. Li's .s-stage algorithm can be easily adapted to be a parallel algorithm. Define n' = \n/p~\. Apply Li's algorithm to the (d, n') problem except that the g, groups at stage i, i = l , . . . , s , are partitioned into [5,/p] classes and groups in the same class are tested in the same round. At the last stage, instead of d defectives, at most d contaminated groups are identified. Since each group contains at most p items, at most d more parallel tests identify all defectives in these groups. Recall that the number of tests for Li's algorithm is upper bounded by

l o l ^ l o s (5) ' where s = In f^J. The total number of rounds with p processors is upper bounded by logep

\dp)

which tends to

\dp)

feH

iogn

when n is much larger than d and p.

•

Generai Algorithms

34

2.6

An Application to Clone Screenings

Screening large collections of clones for those that contain specific DNA sequences is a preliminary indispensable to numerous genetic studies and is usually performed with a clone-by-clone probe or something equivalent. For a yeast artificial-chromosome (YAC) library, the high sensitivity and specificity of the polymerase chain reaction (PCR) allows the detection of target sequences in DNA prepared from pools of thousands of YAC clones. Several PCR-based screening protocols have been suggested to reduce the number of tests. They will be introduced here and their relations to group testing expounded. A YAC library typically contains from 10,000 to 100,000 clones where each DNA sequence will appear on average in r clones. The value of r is specified by the library and the ratio r/n, though depending on the particular library, is in the order of 10 - 4 . A PCR can identify the existence of a specified DNA sequence in a pool of clones and the identification is considered error-free if the pool size stays within a certain limit. The objective is to identify all clones containing the specified DNA sequence with a minimum number of efforts; this includes the number of PCRs, the easiness of preparing the pools and whether the pooling can be done parallelly. Thus it is clear that clone screening can be cast as a group testing problem where the clones are the items and those containing a specified DNA sequence the defectives. While the number of defectives is usually given in its expected value, an upper bound can be obtained by assuming that the number of defectives follows a Poisson distribution. A PCR is a group test with some restriction on its maximum size. If the DNA of a pool contains the appropriate PCR product, then the test outcome is positive (such a pool is identified as "positive"). While one wants to minimize the number of tests, how easy the test group can be assembled is definitely a concern. In general, nonadaptive algorithms are preferred over sequential algorithms, with the multistage algorithms with a small number of stages a possible compromise. Green and Olson [6] considered a human YAC library with n = 23,040 clones and r = 2. YAC clones were grown on nylon filters in rectangular arrays of 384 colonies following inoculation from four 96-well microtiter plates. The yeast cells from each filter are pooled and the DNA is purified, yielding single-filter pools of DNA. Equal aliquots from single-filter pools are mixed together in groups of five to yield multifilter pools, each representing the DNA from 1920 clones. The multi-filter pools of DNA are then analyzed individually for the presence of a specified DNA segment by using the PCR. If a multi-filter pool is found to be positive, then each constituent single-filter pool is analyzed individually by the same PCR assay. Upon generation of a positive single-filter pool, locations of positive clones within the 384-clone array are established by colony hybridization using the radio-labeled PCR product as the probe. Green and Olson also considered a modification which affects when a positive single-filter pool is found. Instead of colony hybridization, the modification uses the binary representation matrix (see Section 4.5) to identify the positive clone in

2.6 An Application to Clone Screenings

35

[log 1920] = 11 pools. Note that this method would fail if there exists more than one positive clone in the pool, and more pools have to be taken for remedy. But for the given parameters of the library, the probability that this will happen is only 3%. The modified Green and Olson method is a 3-stage group testing algorithm except in the last stage the individual testing is replaced by some technology-based procedures. Note that the size of the single-filter grouping is constrained to be a multiple of 384, and that of the multiple-filter grouping to be within the limit of an effective PCR. De Jong, Aslanidis, Alleman and Chen [3], and also Evans and Lewis [5], proposed a different approach which was elaborated and analyzed by Barillot, Lacroix and Cohen [1]. They assumed that the clones are arranged in a 2-dimensional matrix, and each row or column yields a pool. A positive clone renders its row and its column positive. Thus all positive clones are located at the intersections of positive rows and positive columns. Unfortunately, the reverse is not true. In fact r positive clones may cause r2 such positive intersections. Thus one or several other similar matrices are

o

o

positive

positive

o

Y positive

positive Figure 2.2: True and false positive intersection.

needed to differentiate clones at these intersections, using the fact that positive clones will be located at positive intersections of every matrix. For better discrimination, it is desirable that any two clones appear in the same row or column in at most one matrix. Such matrices have been studied in the statistical literature and are known as lattice designs. In particular, they are called lattice squares if the matrices are square matrices. The construction of a set of k, k > 2, lattice squares corresponds to the construction of k — 2 latin squares (see [15]). Since the effort of collecting a pool is not negligible, after a pool is collected and analyzed, a decision needs to be made among three choices: use another matrix for more pools, test the ambiguous clones individually, accept a small probability of including a false positive clone. The 2-dimensional pooling strategy can be extended to (/-dimensional spaces where each hyperplane yields a pool. For til-dimensional cubes, Barillot et al. pointed out that the size of a pool is n 1 - 1 ^ and would be too large for d > 4. They provided figures which show the optimal d as a function of n and r.

36

General

Algorithms

References [1] E. Barillot, B . Lacroix and D. Cohen, Theoretical analysis of library screening using a A^-dimensional pooling strategy, Nucleic Acids Res. 19 (1991) 6241-6247. [2] X. M. Chang, F . K. Hwang and J. F . Weng, G r o u p testing with two and three defectives, Ann. N.Y. Acad. Sci. Vol. 576, Ed. M. F . Capobianco, M. G u a n , D. F . Hsu and F . T i a n , (New York, 1989) 86-96. [3] P. J. De Jong, C. Aslanidis, J. Alleman and C. Chen, Genome m a p p i n g and sequencing, Cold Spring Harbour Conf. New York, 1990, 48. [4] R. Dorfman, T h e detection of defective m e m b e r s of large populations, Math. Statist. 14 (1943) 436-440.

Ann.

[5] G. A. Evans and K. A. Lewis, Physical m a p p i n g of complex genomes by cosmic m u l t i p l e x analysis, Proc. Nat. Acad. Sci. USA 86 (1989) 5030-5034. [6] E . D . Green and M. V. Olson, S y s t e m a t i c screening of yeast artificial-chromosome libraries by use of t h e polymerase chain reaction, Proc. Nat. Acad. Sci. USA 87 (1990) 1213-1217. [7] F . K. Hwang, A m i n i m a x procedure on group testing problems, Tarnkang Math. 2 (1971) 39-44.

J.

[8] F . K. Hwang, Hypergeometric group testing procedures and merging procedures, Bull. Inst. Math. Acad. Sinica 5 (1977) 335-343. [9] F . K. Hwang, A n o t e on hypergeometric group testing procedures, SIAM Math. 34 (1978) 371-375.

J.

Appl.

[10] F . K. Hwang, A m e t h o d for detecting all defective m e m b e r s in a population by group testing, J. Amer. Statist. Assoc. 67 (1972) 605-608. [11] F . K. Hwang a n d D. N. Deutsch, A class of merging algorithms, J. Assoc. put. Math. 20 (1973) 148-159.

Corn-

[12] F . K. Hwang, T. T . Song and D. Z. Du, Hypergeometric and generalized hypergeometric group testing, SIAM J. Alg. Disc. Methods 2 (1981), 426-428. [13] D. E. K n u t h , The Art of Computer ing, Mass. 1972).

Programming,

Vol. 3, (Addison-Wesley, Read-

[14] C. H. Li, A sequential m e t h o d for screening experimental variables, J. Statist. Assoc. 57 (1962) 455-477.

Amer.

References

37

[15] D. Raghavarao, Constructions and Combinatorial Problems in Designs of Experiments, (Wiley, New York, 1971). [16] M. Sobel and P. A. Groll, Group testing to eliminate efficiently all defectives in a binomial sample, Bell System Tech. J. 38 (1959) 1179-1252.

3 Algorithms for Special Cases

When d is very small or very large, more is known about M(d,n). M ( l , n ) = [logn] by Theorem 1.2.1, and by using binary splitting. Surprisingly, M(2,n) and M(3, n) are still open problems, although "almost" minimax algorithms are known. On the other hand one expects individual testing to be minimax when n/d is small. It is known that the threshold value for this ratio lies between 21/8 and 3, and was conjectured to be 3.

3.1

Two Disjoint Sets Each Containing Exactly One Defective

Chang and Hwang [1], [2] studied the CGT problem of identifying two defectives in A = {A\,..., Am) and B = {_Bj,..., Bn} where A and B are disjoint and each contains exactly one defective. At first, it seems that one cannot do better than work on the two disjoint sets separately. The following example shows that intuition is not always reliable for this problem. E x a m p l e 3 . 1 . Let A = {A1,A2,A3} and B = {B1,B2,B3,Bi,Bs}. If one identifies the defectives in A and B separately, then it takes [log 3] + [log 5] = 2 + 3 = 5 tests. However, the following algorithm shows that the two defectives can be identified in 4 tests. Step 1. Test {Ai, Bi}. If the outcome is negative, then A has two items and B has four items left. Binary splitting will identify the two defectives in log 2+log 4 = 3 more tests. Therefore, it suffices to consider the positive outcome. Step 2. Test B\. If the outcome is negative, then A\ must be defective. The defective in the four remaining items of B can be identified in 2 more tests. If the outcome is positive, then the defective in the three items of A can be identified in 2 more tests. Note that there are 3 x 5 = 15 samples {Ai, Bj}. Since [log 15] = 4, one certainly cannot do better than 4 tests. 38

3.1 Two Disjoint Sets Each Containing Exactly One Defective

39

In general, the sample space is A x B which will also be denoted by m x n if | A |= m and \ B \= n. Does there always exist an algorithm to identify the two defectives in A x B in [logmn\ tests? Chang and Hwang [2] answered in the affirmative. A sample space is said to be A-distinct if no two samples in it share the same A-item Aj. Suppose S is a sample space with | S |= 2 r + 2 ' - 1 + • • • + T-" + q , where 2 r _ p _ 1 > q > 0. An algorithm T for S is called A-sharp if it satisfies the following conditions: (i) T solves S in r + 1 tests. (ii) Let v(i) be the ith node on the all-positive path of T, the path where every outcome is positive. Let v'(i) be the child-node of v(i) with the negative outcome. Then | S(v'(i)) |= 2r~i for i = 0 , 1 , . . . ,p. (iii) | S(v(p+

1)) |= q and S(v(p+ 1)) is A-distinct. r

If | S |= 2 , then the above conditions are replaced by the single condition. (i') T solves S in r tests. Lemma 3.1.1 There exists an A-sharp algorithm for any A-distinct sample space. Proof. Ignore the B-items in the ^4-distinct sample space. Since the A-items are all distinct, there is no restriction on the partitions. It is easily verified that there exists a binary splitting algorithm which is A-sharp. • For m fixed, define rik to be the largest integer such that mrik < 2*. Clearly, there exists a k for which n^ = 1. Theorem 3.1.2 M(m x n/c) = k for all rik > 1. Furthermore, if nk is odd, then there exists an A-sharp algorithm for the sample space m x rikProof. By the definition of n^ and Theorem 1.2.1, M(m X rik) > k. Therefore it suffices to prove M(m X rik) < k. If m is even, test half of A and use induction. So, it suffices to consider odd m. The sample spaceTOx 1 is A-distinct. Therefore there exists an A-sharp algorithm by Lemma 3.1.1. For general rik > 1, Theorem 3.1.2 is proved by induction on rikNote that 2*~2 < mrik-i < 2k~1 < m(nk-i + 1) implies 2 i _ 1 < m{2nk-i)

1 a n d nk_r

+ 1 .

is necessarily odd. Let ~~~

nk—r—l

mnk-r

= I

, nk—r — 2 ,

+ 2

, nk—r~p

+ • • •+ 2

, „

+ q ,

where 0 < q < 2k-r-"-1

.

Then mnk

=

m(2rnk-r

=

2k~1 + 2k~2 + ••• + 2k-p + 2rq + m .

+ 1)

Let T be an A-sharp algorithm for t h e (m x nt._ r )-problem. T h e existence of T is assured by t h e induction hypothesis. Let v be t h e node on t h e all-positive p a t h of T associated with q samples. Let J be t h e set of j such t h a t (Ai, Bj) € S(v) for some Ai. For j e J, let Lj denote a set consisting of those A,'s such t h a t (Ai,Bj) € S(v). Since S(v) is yl-distinct, t h e L,-'s are disjoint. An A-sharp algorithm is now given for t h e (m x nj.)-problem. For notational convenience, write n for nk_T and n' for nk. T h e n B will refer to t h e set of n items and B' to t h e set of n' items. P a r t i t i o n B' — {Bni} i n t o n groups of 2T items G 1 ; • • • , Gn. Consider T t r u n c a t e d at t h e node v; i.e., delete t h e subtree rooted a t v from T. Let T" be an algorithm for t h e (m x rc')-problem where T ' is obtained from T by replacing each i t e m Bj in a test by t h e group Gj and adding Bni to every group tested on t h e all-defective p a t h . T h e n each t e r m i n a l node of T", except t h e node v' corresponding with v, will be associated with a set of solutions (Ai x Gj) for some i and j . Since t h e only u n c e r t a i n t y is on Gj and \ Gj \=2r, r m o r e t e s t s suffice. Therefore, it suffices to give an A-sharp algorithm for t h e sample space 5 ( « ' ) = {VjeALj with | S(v')

x Gj)} U (A x J3 n ) ,

| = 2rq + m. Let Gj\, • • •, Gj2r denote t h e 2T items in Gj. Tj = {UjejGn}

Define

UR ,

where R is a subset of A-items not in any of t h e Lj, j 6 J , with | R \= 2rq + m — 2 * - P - I _ q Note t h a t t h e r e are a total of m — q A-items not in any of t h e Lj. It is now proved t h a t m — q >\ R | > 0. T h e former inequality follows immediately from t h e fact t h a t 2rq < 2r2k-T~p-1 = 2k~p-1 .

3.1 Two Disjoint Sets Each Containing Exactly One Defective

41

Furthermore, since 2*- 1 < m(njfe_i + 1) = =

m(2r-1nk_r + 1) 2k~2 + 2k~3 + ••• + 2k-p-1 + 2 r - 1 9 + m ,

it follows that 2r-xq + m > 2k-p~1 , or \R\>2r'1q-q>0

.

Test Ti at S(v'). Let Sa and Sd denote the partition of S(v') according as to whether the outcome of T\ is negative or positive. Then Sd = {UjeALj x Gji)} U (R x Bn,), with |&|

= =

? + 2rq + m - 2k-"-1 - q 2 r ? + m - 2*- p - 1 ,

and 5 S = {U2ur=2 U i e J (£,• x G iu ,)} U ({A \ R} x Bn.) , with | 5 S | = | S(v') | — | Sd |= 2* _ p _ 1 . Since 5^ is A-distinct, there exists an A-sharp algorithm for Sd by Lemma 3.1.1. It remains to be shown that Sg can be done in k — p — 1 tests. Note that Sg can also be represented as Sg = {Vj£j(Lj

x {Gj \ {Gfl} U {£„.}})} U ({(A \ tf) \ UitjLj}

x £„,) .

Since | Gj \ {Gji} U {Bn,} |= 2 r , | (A \ .ft) \ U, e j£j | must also be a multiple of 2T. Partition A — R — {Uj^jLj} into 2 r subsets of equal size, Hi, • • •, Hi*. Define Tw = { U j e J G ^ } U Hw

for to = 2,3, • • •, 2 r .

Then the TVs are disjoint. By testing a sequence of proper combinations of TVs, it is easily seen that in r tests Sg can be partitioned into 2r subsets consisting of Sw = {UjeJ(Lj

x Giw)} U (Hw x Nn.) ,

w = 2, • • •, 2 r ,

and

Si = S,\ {u£_2 Sw} . Furthermore, 5„, is A-distinct and | Sw |= 2k~p~T~1 for each w = 1, • • •, 2 r . Therefore each Sw can be solved in fc — p — r — 1 more tests by Lemma 3.1.1. This shows that the algorithm just described for S(v') is A-sharp. Therefore, the algorithm T' plus the extension on v' as described is A-sharp. •

Algorithms for Special Cases

42 T(A 3 xB 3 ):

T'(A 3 xB 2 j):

Figure 3.1: An yl-sharp algorithm for A3 x _B2iCorollary 3.1.3 M(rn x n) = [logmn] for all m and n. The corollary follows from Theorem 3.1.2 by way of the easily verifiable fact that M(m X n) is monotone nondecreasing in n. It can be easily verified that if the m x n model is changed to A and B each containing "at least" one defective and the problem is to identify one defective from each of A and B. Then Corollary 3.1.3 remains true. Denote the new problem by fh x n. Then Corollary 3.1.4 M{fn x n) = [logmn] for all m and n. Ruszinko [11] studied the line version, i.e., A and B are ordered sets and a test group must consist of items from the top of the two orders. He showed that for m = 11 and n = (210k — 1)/11 (which is an integer by Fermat's theorem), the first test group must consist of 8 items from A and 210k~'1 items from B. But no second test can split the remaining samples into two parts both of sizes not exceeding 210k~2. He also proved Theorem 3.1.5 7/logm + logn— [logm] — [lognj > 0.8, then [logmn] suffice. Weng and Hwang [13] considered the case of k disjoint sets.

3.2 An Application

to Locating

Electrical

43

Shorts

T h e o r e m 3 . 1 . 6 Suppose that set S; has 2 2 + 1 items containing for i = 0 , 1 , • • • , m . Then M{S0

exactly one

defective

x S1 x • • • x Sm) = 2 m + 1 .

Proof. Since n™ 0 (2 2 " + l )

= =

M{S0

x Si x • • • x Sm)

>

2 2 m + 1 - 1 + 2 2 m + ' - 2 + -.. + 2 1 + 2 0 22m+1-l, 2m+1

by T h e o r e m 1.2.1 .

T h e reverse inequality is proved by giving an algorithm which requires only 2 + 1 tests. Let 7,- be an arbitrary item from S{, i = 0 , 1 , • • • , m . Consider t h e sequence of subsets J i , • • •, J 2 m + 1 - i ; where J 2 * = {h} for k = 0 , 1 , • • • , m and J 2 * + J = { 4 } U J j for 1 < j < 2k. It is easily proved by induction t h a t E / . g j , 2' = j for 1 < j < 2m+l - 1. T h e actual testing of groups is done in t h e reverse order of Jj until either a negative o u t c o m e is obtained or J j is tested positive. Note t h a t a subset is tested only when all subsets contained in it have been tested. Therefore J 2 i + i _ i , • • •, J , + i all tested positive and J j negative imply t h a t items in Jj are good and items in {7o, I\, • • • , Im}\Jj are defective. T h u s t h e (S0 x Si x • • • x Sm) problem is reduced t o t h e (II/. e .7 j (.S'i\{/;})) problem. T h e total n u m b e r of tests is 2m+i

_j+

Y^2{ = 2 m+1 .

If J 2 m+i_ 1 , • • • , Ji all tested positive, then each /;, 0 < i < m , has been identified as a defective through individual testing. Therefore all defectives are identified while t h e total n u m b e r of tests is simply 2 m + 1 — 1. •

3.2

An Application to Locating Electrical Shorts

A prevalent t y p e of fault in t h e manufacture of electrical circuits is t h e presence of a short circuit ("short") between two nets of a circuit. Short testing constitutes a significant p a r t of t h e manufacturing process. Several fault p a t t e r n s can be described as variations or combination of these types of faults. Applications of short testing ranges from printed circuit board testing and circuit testing to functional testing. Short testing procedures can have two different objectives; a detecting procedure aims simply to detect t h e presence of a short, while a locating procedure identifies the shorted pairs of nets. A short detector in c o m m o n use is an a p p a r a t u s involving two connecting leads which, when connected respectively to two nets or groups of nets, can detect, b u t not locate, t h e presence of a short between these nets or groups of nets. An obvious way to use this device for short location is t h e so called n-square testing, which checks each pair of nets separately. If t h e circuit has n nets to start with, then this location procedure requires (!M tests.

Algorithms for Special Cases

44

lead A

_ ^

\

Q

net u

•Q

net v

lead B

Figure 3.2: A short detector. Garey, Johnson, and So [7] proposed a procedure for detecting shorts between nets in printed circuit boards. They showed that no more than 5, 8, or 12 tests (depending on the particular assumptions) will ever be required, independent of the number of nets. However, it is not a locating procedure, and their assumptions allow only shorts occurring vertically or horizontally, not diagonally. Skilling [12] proposed a clever locating procedure much more efficient than the n-square testing. In his method each net is tested individually against all other nets, collectively. Thus after n tests one obtains all the shorted nets, though not knowing to which other nets they are shorted. If there are d shorted nets, then LI more tests are required to determine which of these are shorted to which others. Thus the total number of tests is n + u ) , which can be significantly less than (!M f° r n much larger than d. Chen and Hwang [1] proposed a locating procedure, based on Theorem 3.1.2, which requires approximately 2 ( d + l)logn tests. Consider a circuit board with n networks. Without loss of generality assume n = 2°. This testing process can be mathematically described as follows: Let [X2v=i(ai x &>)] denote the configuration where there exist 2fc disjoint sets with d\, 6j, • • •, ajt, bk nets such that there exists a shorted pair between the nets in a,- and the nets infe,-for at least one i. Let [Yli=i a>\ denote the configuration where there exist k disjoint sets with 1 let it denote the integer such that

itM\ < 2'* < fit ' +1 Since no integer i is a solution of m = 2' for t > 1, no ambiguity arises from the definition of it. By the information lower bound (Theorem 1.2.1), it is clearly an upper bound of n 2 i + 1 - 2' = 2' and /2* + 1\

. < 2* implies - it - 1 > 25 .

For t odd, i ( = 2 2 . Therefore

( T ) - ( V ) = i{(^-D^-(2^-D(2^-2)} =

5 {(»«+i + l)»«+i " 2(t t + i - 1) " 2 t + 1 + 3 - 2 ^ - 4 }

=

{(?1+12+1)-2'} + {3.2ifi-(8(+1

t t + i + 2^ _

2t > 212 for t = 2k + 1 > 13 .

Proof. Constructions for ct,t < 11, were given in [4] and will not b e repeated here. These ct equal i t — 1 for t > 4 or it for t < 3, hence equal n ( ( 2 ) by T h e o r e m s 3.3.3 and 1.2.1. For t > 12, t h e r e exists a generic algorithm as follows: 1. Test a group of ct — ct-i items. If t h e o u t c o m e is negative, follow t h e ct-\ construction. 2. If t h e o u t c o m e of t h e first test is positive, next test a group of m items from t h e untested group (with c t _ ! items) where 39 • 2k~e 55-2fc-6

for t - 2k, for* = 2ifc + l .

If t h e o u t c o m e is positive, t h e problem is reduced to t h e (ct — ct-\) Since

^

/ (26 • 2'=- 6 )(39 • 2*- 6 ) < 22k~2 - j ( 3 7 . 2 * - 6 ) ( 5 5 • 2k~6) < 2 2 *" 1

ct^)m

x m problem.

for t = 2k, for t = 2k + 1,

by Corollary 3.1.3, t — 2 more tests suffice. 3. If t h e o u t c o m e of t h e second test is negative, next test a group of [ m / 2 J items from t h e u n t e s t e d group (with ct-\ — m items). If t h e o u t c o m e is positive, t h e problem is reduced to t h e (ct — ct-i) x [ m / 2 j problem. Since

(c< -

c

i - i ) L ^ J < 2^Ct ~

Ct

-^m

< 2

'~3'

t — 3 more tests suffice. If t h e outcome is negative, a total of

Ct

m

L

m 2J

S

( 89 • 2k~6 - 39 • 2k~6 - 19 • 2k~6 = 31 • 2k~6 < c,_ 3 \ 63 • 2 fc " 5 - 55 • 2 f c - 6 - 27 • 2 * - 6 = 44 • 2k~6 < c,_ 3

i t e m s is left; hence t — 3 more tests suffice. C o r o l l a r y 3 . 3 . 5 ct/nt(2)

for t = 2k, for t = 2k + 1

•

> 0.983.

Proof. For t < 11, c ( / n t ( 2 ) = 1. For t > 12 substituting it - 1 for n ( ( 2 ) ,

S W > 5 S 5 > °-983

n-

'

63 2

5

" -| *~^ > H > 0.984

for

* = 2fc'

fort = 2fc + l. •

51

3.3 The 2-Defective Case

Chang, Hwang and Weng also gave an improvement of the algorithm c. Let w be the algorithm resulting from the selection of the largest m satisfying < 2'~ 2 ,

(wt - wt-i)m

< 2'- 3 ,

ft-uH-iHfj i

m

i

( to /

2 IU< — I ^Wf—Wt-i

(

2«-3

\Wt—Wt-\

or equivalently, (wt - w^wt

- wt-3 + 2) - 3 • 2'" 3 < 0 .

Though this relaxation may decrease wt, it doesn't affect the asymptotic result. Define 5-3^2

«2 =

,

3

/8v^-7 12^2-9 \ r + p p,

52

Algorithms

a3

=

„ „ 9p 2 3 + 6p + — (lb

a4

18%/2 - 15

tor Special

Cases

pq

27p\ -(-2+—)q>

=

as and

9p 2 a'3 = 3 + 6p + - i -

N o t e t h a t a! < 0 for p < p*. F u r t h e r m o r e , there exists a tv large enough such t h a t for all t >tp, i/2 + a3 < 0 ai2* + a 2 2 Also note t h a t tp is d e t e r m i n e d independent of q. Let q = 2('" +1 »/ 2 p T h e n it is easily verified 9 - > ° > P -

for t < tv

2 ( « + i)/2

T h e proof t h a t — >P-

2 ^

*»*>'»•

is by induction on t. It will be shown t h a t

satisfies t h e inequality (y - t c - O d / - u>,_3 + 2) - 3 • 2 ' - 3 < 0 . Since u>( is t h e largest integral solution of t h e relaxed inequality,

For easier writing define r = 2'^ 2 . T h e n ut < y2r, r / 2 - 3 / 2 . Hence

[(-

2(t+l)/2

ut-i

> r — 3/2 and ut-z

Ut + 1 - tOi_3 + 2 - 3 - 2 '

(,.^)^ + ,-(,-J)(r-l) (VS-Dpr + i + f - f f 1 a-\r2 — a 2 r + a 3 + a^r'1

+ a^r'

.^-i)

pr + 3 + — - 3gr-i

>

3.4 The 3-Defective Case

53

Since q > 0, air2 + a2r + a3 < air2 + a2r + a'3 < 0 for t > tp . Furthermore, -1 ,

a\r

-2

+ asr

-2 / °5

=

—a^r

=

a^r

= ^

2

I (

\

rI lSq

V.30 + 27p

r

\

j

( 30 + 27P - j ^

0

'

since l ^ l < 1 and 2^2 tp . ~

The 3-Defective Case

Chang, Hwang and Weng [4] gave the following result. T h e o r e m 3.4.1 There exists an algorithm k such that the (3,ht) problem can be solved in t tests where t +1

fort 8 ,

c2k + 2*- 2 + 1

for t = 3k > 9 ,

ht

. l&k+i + c2fc)/2J + 3 • 2k~3 + 1 for t = 3k + 1 > 10 , except hi2 = 26, hi3 = 32, hn = 40 and his = 52, where ct are given in Theorem 3.3.4. Proof. For t < 7 let h be the individual testing algorithm. For £ = 3k + 2 > 8 let h be the following algorithm: First test a group of 2k~1 items. If the outcome is negative, the number of remaining items is hak+2 — 2 = h3k+i . Hence 3k + 1 more tests suffice. If the outcome is positive, identify a defective in the contaminated group in k — 1 more tests. The number of remaining items is at most hsk+2 - 1 < 108 • 2k~6 + 3 • 2k~3 + 2*" 1 = 41 • 2*"4 < c 2 t + 2 .

Algorithms

54

for Special

Cases

T h e t o t a l n u m b e r of tests is at most l+fc-l+2fc+2=3fc+2. For t = 3fc let h be t h e following algorithm: 1. First test a group of 3 • 2k~3 items. If t h e outcome is negative, t h e n u m b e r of r e m a i n i n g i t e m s is h3k - 3 • 2k~3

=

89 • 2k~6 + 2k~2 + 1 - 3 • 2k~1 = 81 • 2k~6 + 1 < L107 • 2k~7\ + 3 • 2k~4 + 1 + 2k~2 < /i 3 *_i .

Hence 3fc — 1 m o r e tests suffice. 2. If t h e o u t c o m e of t h e first test is positive, next test a group of 2 items consisting of 2k~2 i t e m s from each of t h e c o n t a m i n a t e d (with 3 • 2k~3 items) groups and t h e u n t e s t e d (with h3k — 3 • 2k~3 items) group. If t h e outcome is negative, t h e size of t h e c o n t a m i n a t e d group is reduced to 2k~2. Identify a defective therein in fc — 2 tests, t h e n u m b e r of remaining items is at most h3k - 2k~2 - 1

6 .

T h e t o t a l n u m b e r of tests is at most 2 + fc-2 + 2fc = 3fc. 3. If t h e o u t c o m e of t h e second test is positive, t h e n t h e unidentified items are divided into four sets A,B,C,D of sizes 2k~2,2k'3,2k~3 and h3k — 2k~l such t h a t A U B a n d B U C each contains a defective. Next test t h e set A. If t h e o u t c o m e is negative, t h e n B m u s t contain a defective. Identify it in fc — 3 tests. T h e n u m b e r of r e m a i n i n g i t e m s is at most 2k~3 - 1 + 2k~3 + h3k - 2k~x = h3k - 2k~2 - 1 , which has been shown to b e at most c-ik. T h e total n u m b e r of tests is at most 3+fc-3+2fc=3fc. If t h e o u t c o m e is positive, t h e n identify a defective in each of A and B U C in fc — 2 tests each. T h e n u m b e r of remaining items is at most h3k - 2 = c2k + 2k~2 - 1 = 89 • 2k~6 + 2k-2 - 1 < 2k+1

.

Hence fc + 1 m o r e tests identify t h e last defective. T h e total n u m b e r of tests is at most 3+fc-2+fc-2+fc+l=3fc.

3.4 The 3-Defective

55

Case

Finally, for t = 3 K + 1 > 10, let h be t h e following algorithm: First test a group of h3k+i — h3k items. If t h e o u t c o m e is negative, t h e r e are h3k i t e m s left and t h e problem can be solved in 3 K more tests. If t h e o u t c o m e is positive, t h e c o n t a m i n a t e d group contains

h3k+1-h3k

= [3*±p^j + 2*-s L

63-44 J 2

+ 22

[37^!j + 2*-3

for k = 5 , for

k

>6 .

It is easily verified t h a t 2k~2 < h3k+1 - h3k < 2k~1 . By L e m m a 2.3.3, a defective can be identified from t h e c o n t a m i n a t e d group either in K — 2 t e s t s , or in k — 1 tests with at least 2 ~ — {h3k+i

— h3k)

good items also identified. In t h e first case, t h e n u m b e r of remaining items is at most h3k+i — 1 which can be verified to be < C2/C+2 for all k > 3. In the second case, the n u m b e r of remaining items is

h3k+1

- 1 - [2 4 " 1 - {h3k+1

=

2h3k+l

- h3k - 2k~l - 1

3d. Proof. It suffices to prove Theorem 3.5.1 for M(d, "id + 1) since M(d,n) is nondecreasing in n. For d = 1, M(l,4) = 2 < 4 — 1 by binary splitting. The general case is proved by induction on d. Let T be a nested (line) algorithm which always tests a group of two items (unless only one is left) if no contaminated group exists. When a contaminated group of two items exists, T identifies a defective in it by another test. In the case that the item in the contaminated group being tested is good, then a good item is identified at no additional cost. All other good items are identified in pairs except that those good items appearing after the last defective are identified by deduction without testing. Note that T has at most 3d tests since the d defectives requires 2d tests and another d tests identify 2d good items. As the number of good items is odd, one of them will either be identified for free or is the last item for which no test is needed (since it appears after the last defective). Suppose that T takes 3d tests. Then before the last defective is identified, at least 2d good items must have been identified. If the last defective is identified with a free good item, then no more good item appears after the last defective; otherwise, at most one good item appears after. In either case, before the last two tests which identify the last defective, at most two items, one good and one defective, are unidentified. By testing one of them the nature of the other item is deduced. Thus only 3d — 1 tests are needed. • The latest progress towards proving the n < 3d conjecture is a result by Du and Hwang [6] which shows that individual testing is minimax for n < |_21d/8j. Consider the following minimization problem. Let m and n be relatively prime positive integers, / an integer satisfying 0 < I < m + n — 2, and A a positive number. Define li = [m(l + l)/(m + n)\. The problem is to locate the minimum of w

\

mk + h J

over the nonnegative integers k — 0,1,2, • • • .

3.5 When is Individual Testing Minimax?

57

Let l2 = ln(l + l)/(m + n)\. Since m and n are relatively prime and / + 1 < m + n, neither m(l + l ) / ( m + n) nor n(l + l ) / ( m + n) can be an integer. Therefore

m(/ + l)

1 1 or c < 2(1 + l ) / ( m + n + 2(1 + 1)), f(x) = 1 has no nonnegative solution. If 1 > c > 1 — 1/2M, f(x) = 1 has a unique nonnegative solution, lying in the interval (1/2(1 — c) — M,c/2(1 — c) — (I + l ) / ( m + n)). If 1 - 1/2M > c > 2(1 + l ) / m + n + 2(1 + 1), f(x) = 1 either has no nonnegative solution or has a unique one, lying in the interval [0,c/2(l — c) — (I + l)/(m + n)]. Proof. See [6]. Theorem 3.5.3 M(d,n)

= n - 1 for n

213k~2 is proved by showing that l'K \ ,„i3i_2

min - ) / 2 —

>1

Algorithms for Special Cases

58

Theorem 3.5.3 then follows from Corollary 1.5.10 by setting / = t. Define

Then m = 4, n = 13, I = h = 0, A = 2" 1 3 . Compute M = max

'm + h

,

n + l2\

= 1,

mnnn M > = 7 TTTT = 0-7677 > 1 - — = 0.5 . (m + n)m+n\ 2 Therefore, from Theorem 3.5.2, f(k) = 1 has a unique nonnegative solution in the interval (1.15, 2.59). Namely, F(k), and hence r47fc*)/213k~2 attains a minimum at k = 2. Thus we have 1

c

T'( 1 4 7 *)/ 2iafc " a = ( ? ) / 2 M = L08>1 As the proofs for the other seven cases are analogous to case (i) but with different parameter values, only the values of the parameters are given in each case without further details. K° will always denote the value of k that minimizes fcj)/2"~2t~2. Case (ii). d = 8k + 1, n = 21k + 2, t = 4k + 1, 1.08 < K° < 2.54. 3 17* + A ,ol3*-2 a _ f ^

/o24

T V 4* J/^- =UJ/2M = l-*0>l. Case (iii). d = 8k + 2, n = 21Jfc + 5, t = 4fc + 2, 0.92 < K° < 2.42. 17fc + A ,„ *-i _ m i n ^ ' y " J / 213 - - 1 =

m

. J / 2 0 \ / o l 2 _ , 1 fi / 3 7 \ / o 2 5 i n | ^ j / 2 - = 1.18, ^ j / 2 " = 1.15J > 1 .

Case (iv). d = 8A; + 3, n = 21k + 7, t = 4k + 2, 0.84 < A'0 < 2.30.

Case (v). d = 8A; + 4, n = 21fc + 10, t = 4k + 3, 0.69 < K° < 2.18.

Case (vi). d = 8fc + 5, n = 21k + 13, t = 4k + 3, 0.54 < K° < 2.01.

3.6 Identifying a Single Defective with Parallel Tests

59

Case (vii). d = 8k + 6, n = 21k + 15, t = 4k + 3, 0.40 < K° < 1.88.

Case (viii).

Our partners will collect data and use cookies for ad personalization and measurement. Learn how we and our ad partner Google, collect and use data. Agree & close