Comput Optim Appl (2007) 38: 249–259 DOI 10.1007/s10589-007-9043-y
A bilinear algorithm for sparse representations Pando Georgiev · Panos Pardalos · Fabian Theis
Published online: 20 July 2007 © Springer Science+Business Media, LLC 2007
Abstract We consider the following sparse representation problem: represent a given matrix X ∈ Rm×N as a multiplication X = AS of two matrices A ∈ Rm×n (m ≤ n < N) and S ∈ Rn×N , under requirements that all m × m submatrices of A are nonsingular, and S is sparse in sense that each column of S has at least n − m + 1 zero elements. It is known that under some mild additional assumptions, such representation is unique, up to scaling and permutation of the rows of S. We show that finding A (which is the most difficult part of such representation) can be reduced to a hyperplane clustering problem. We present a bilinear algorithm for such clustering, which is robust to outliers. A computer simulation example is presented showing the robustness of our algorithm. Keywords Sparse component analysis · Blind source separation · Underdetermined mixtures
P. Georgiev () University of Cincinnati, ECECS Department, Cincinnati ML 0030, OH 45221, USA e-mail:
[email protected] P. Georgiev Sofia University “St. Kliment Ohridski”, 5 J. Bourchier Blvd., 1126 Sofia, Bulgaria P. Pardalos University of Florida, Center for Applied Optimization, Gainesville, FL, USA e-mail:
[email protected] F. Theis Institute of Biophysics, University of Regensburg, 93040 Regensburg, Germany e-mail:
[email protected] 250
P. Georgiev et al.
1 Introduction Representation of data becomes a fundamental question in data analysis, signal processing, data mining, neuroscience, etc. Each day a wealth of data is created: literally millions of large data sets are generated in medical imaging, surveillance, and scientific experiments. In addition, the Internet has become a communication medium with vast capacity, generating massive traffic data sets. However, such wealth of data will only be useful to the extent that it can be processed efficiently and meaningfully, whether it be for storage, transmission, visual display, fast on-line graphical query, correlation, or registration against data from other modalities. In the present paper we consider a linear representation of a given matrix exploiting sparsity. Such a linear representation of a given data set X (in form of a (m × N)-matrix) is X = AS,
A ∈ Rm×n , S ∈ Rn×N .
(1)
Under different assumptions on the unknown matrices A (dictionary) and S (source signals) this linear representation is known as: 1. Independent Component Analysis (ICA) problem if the rows of S are (discrete) random variables, which are statistically independent as much as possible; 2. Nonnegative Matrix Factorization (NMF) (see [10]), if the elements of X, A and S are nonnegative; 3. Sparse representation or Sparse Component Analysis problem if S contains as many zeros as possible. There is a large literature devoted to ICA problems (see for instance [5, 8] and references therein) but mostly for the case m ≥ n. We refer to [7, 9, 12–14] and reference therein for some recent papers on SCA and underdetermined ICA (m < n). A related problem is the so called Blind Source Separation (BSS) problem, in which we know a priori that a representation such as in (1) exists and the task is to recover the sources (and the mixing matrix) as accurately as possible. A fundamental property of the complete BSS problem is that such a recovery (under assumption in item 1 and non-Gaussianity of the sources) is unique up to permutation and scaling of the sources, which makes the BSS problem so attractive. Its applications span a large range: data mining, biomedical signal analysis and processing, geophysical data processing, speech enhancement, image reconstruction, wireless communications, etc. Similar uniqueness of the recovery is valid if instead of independence we assume some sparsity of the source matrix [7]. So, the Sparse Representation method proposed here provides an effective alternative to the ICA approach for BSS problems, as well as a robust alternative in the presence of noise and outliers. The main contribution of this paper is a new algorithm for recovering the mixing matrix in (1), based on weak sparsity assumptions on the source matrix S. The recovering of the source matrix after that can be done by the source recovery algorithm from [7]. We have to note the fundamental difference between recovery of sparse signals by our method and by l1 -norm minimization. The l1 -norm minimization gives solutions which have generally at most m non-zeros [4, 6] and in some special cases gives the sparsest solution [6]. We utilize the fact that for almost all data vectors x
A bilinear algorithm for sparse representations
251
(in measure sense) such that the system x = As has a sparse solution with less than m nonzero elements, this solution is unique (see [7]), while in [6] the authors proved that for all data vectors x such that the system x = As has a sparse solution with less than Spark(A)/2 nonzero elements, this solution is unique. Note that Spark(A) ≤ m + 1, where Spark(A) is the smallest number of linearly dependent columns of A. So, our assumptions for uniqueness of the representation (up to permutation and scaling of sources) is less restrictive, since we allow more nonzero elements in the solution, namely m − 1. 2 Blind source separation based on sparsity assumptions In this section we present the method in [7] for solving the BSS problem if the following assumptions are satisfied: (A1) the mixing matrix A ∈ Rm×n has the property that every square m × m submatrix of it is nonsingular; (A2) each column of the source matrix S has at most m − 1 nonzero elements; (A3) the sources are sufficiently rich represented in the following sense: for every index set of n − m + 1 elements I = {i1 , . . . , in−m+1 } ⊂ {1, . . . , n} there exist at least m column vectors of the matrix S such that each of them has zero elements in places with indices in I and each m − 1 of them are linearly independent. 2.1 Matrix identification We describe conditions in the sparse BSS problem under which we can identify the mixing matrix uniquely up to permutation and scaling of the columns. 2.1.1 General case—full identifiability Theorem 1 (Identifiability conditions—general case [7]) Assume that in the representation X = AS the matrix A satisfies condition (A1), the matrix S satisfies conditions (A2) and (A3) and only the matrix X is known. Then the mixing matrix A is identifiable uniquely up to permutation and scaling of the columns. The following matrix identification algorithm is based on the proof of Theorem 1. The main idea is that the weak sparsity assumptions on the source matrix imply that the data points lie on finitely many hyperplanes (skeleton of the data set), which can be estimated by clustering. The columns of the mixing matrix lie on intersections of some of these hyperplanes, and their normalized estimations can be find again by clustering of the normal vectors of the hyperplanes. Algorithm 1 SCA matrix identification algorithm [7] Data: samples x(1), . . . , x(T ) of X. ˆ Result: estimated mixing matrix A. Hyperplane identification n n groups Hk , k = 1, . . . , m−1 , such that the span 1. Cluster the columns of X in m−1 of the elements of each group Hk produces one hyperplane and these hyperplanes are different.
252
P. Georgiev et al.
Matrix identification 2. Cluster the normal vectors to these hyperplanes in the smallest number of groups Gj , j = 1, . . . , n (which gives the number of sources n), such that the normal vectors to the hyperplanes in each group Gj lie in a new hyperplane Hˆ j . 3. Calculate the normal vectors aˆ j to each hyperplane Hˆ j , j = 1, . . . , n. ˆ with columns aˆ j is an estimation of the mixing matrix (up to per4. The matrix A mutation and scaling of the columns). 2.2 Identification of sources Theorem 2 (Uniqueness of sparse representation [7]) Let H be the set of all x ∈ Rm such that the linear system As = x has a solution with at least n − m + 1 zero components. If A fulfills (A1), then there exists a subset H0 ⊂ H with measure zero with respect to H, such that for every x ∈ H \ H0 this system has no other solution with this property. From Theorem 2 it follows that the sources are identifiable generically, i.e. up to a set with a measure zero, if they have level of sparseness grater than or equal to n − m + 1 (i.e., each column of S has at least n − m + 1 zero elements) and the mixing matrix is known. Below we present an algorithm, based on the observation in Theorem 2. The algorithm identifies in which hyperplane from the skeleton each data vector x belongs and finds the coefficients by which x is represented as a linear combination of columns of the estimated (by Algorithm 1) mixing matrix. Algorithm 2 Source Recovery Algorithm [7] Data: samples x(1), . . . , x(N ) (vector columns) of the data matrix X, and mixing matrix A. Result: estimated source matrix S. 1. Identify the set of hyperplanes H produced by taking the linear hull of every subsets of the columns of A with m − 1 elements; 2. Repeat for i = 1 to N : 3. Identify the subspace H ∈ H containing xi := X(:, i), or, in practical situation with presence of noise, identify the one to which the distance from xi is minimal and project xi onto H to x˜ i ; 4. if H is produced by the linear hull of column vectors ai1 , . . . , aim−1 , then find coefficients λi,j such that x˜ i =
m−1
λi,j aij .
j =1
These coefficients are uniquely determined if x˜ i doesn’t belong to the set H0 with measure zero with respect to H (see Theorem 2); 5. Construct the solution S(:, i): it contains λi,j in the place ij for j = 1, . . . , m − 1, the rest of its components are zero.
A bilinear algorithm for sparse representations
253
3 Sparse component analysis Here we review a method in [7] for solving the SCA problem. Now the conditions are formulated only in terms of the data matrix X. Theorem 3 (SCA conditions [7]) Assume that m ≤ n ≤ N and the matrix X ∈ Rm×N satisfies the following conditions: n (i) the columns of X lie in the union H of m−1 different hyperplanes, each column lies in only one such hyperplane, each hyperplane contains at least m columns of X such that each m − 1 of them are linearly independent; p n−1 different hyperplanes {Hi,j }j =1 in (ii) for each i ∈ {1, . . . , n} there exist p = m−2 p H such that their intersection Li = j =1 Hi,j is 1-dimensional subspace. (iii) any m different Li span the whole Rm . Then the matrix X is representable uniquely (up to permutation and scaling of the columns of A and rows of S) in the form X = AS, where the matrices A ∈ Rm×n and S ∈ Rn×N satisfy the conditions (A1) and (A2), (A3) respectively. 4 Skeletons of a finite set of points In this section we present the main idea of our method. Let X be a finite set of points represented by the columns {xj }N j =1 of the matrix 0 0 n m×N X∈R . The solution {(wi , bi )}i=1 of the following minimization problem: minimize
N j =1
min |wTi xj − bi |
1≤i≤n
subject to wi = 1,
bi ∈ R,
(2) i = 1, . . . , n,
defines w(1) -skeleton of X (introduced in [11]). It consists of a union of n hyperplanes Hi = {x ∈ Rm : wTi x = bi }, i = 1, . . . , n, such that the sum of minimum distances of every point xj to it is minimal. Analogically, the solution of the following minimization problem: minimize
N j =1
min |wTi xj − bi |2
1≤i≤n
subject to wi = 1,
bi ∈ R,
(3) i = 1, . . . , n,
defines w(2) -skeleton of X (introduced in [2]). It consists of union of n hyperplanes {Hi }ni=1 (defined as above) such that the sum of squared minimum distances of every point xj to it is minimal. It is clear that if the representation X = AS is sparse (in sense that each column of S contains at most m − 1 non-zero elements, i.e. it satisfies the condition (A2)) then the above defined two skeletons coincide and the data points (columns of X) lie on them.
254
P. Georgiev et al.
Conversely, assuming that above defined two skeletons coincide, the columns of X lie on them and the conditions of Theorem 3 are satisfies, then finding these skeletons leads to finding the mixing matrix, and subsequently, a sparse representation of X in form (1), which is unique (up to scaling and permutation of the rows of S).
5 Reduction of the clustering problem to a bilinear optimization problem Below we present a bilinear algorithm for finding a modified skeleton, replacing the condition wi = 1 in (2) with the condition wTi e = 1, where the vector e has norm one. Proposition 4 Assume that the w(1) -skeleton of X is unique and any column of X lies only in one hyperplane of this skeleton. If the minimum value of the cost function in the clustering problem (2) is zero and bi = 0 then finding the w(1) -skeleton is equivalent to solving the following bilinear minimization problem (with variables tij ∈ R, uij ∈ R, wi ∈ Rm ): minimize
N n
tij uij
(4)
j =1 i=1
subject to −uij ≤ wTi xj ≤ uij , n
tij = 1,
∀i, j,
tij ≥ 0, ∀i, j,
(5) (6)
i=1
wTi e = 1,
∀i.
(7)
The solution of (4–7) is unique and for any j only one component of tij , i = 1, . . . , n, is one, the rest are zero, i.e. the matrix T = {tij } gives the cluster assignment. Remark A necessary condition for the minimum value of the cost function in the minimization problem (4) to be zero is xj for any j not to be collinear to the vector e. Proof of Proposition 4 Let {wi }ni=1 be the solution of (2). Put tij = 1 if i = arg min{wTi xj } and tij = 0 otherwise. Such arg min is uniquely defined by the assumption that any column of X lies in only one hyperplane of the skeleton. Putting also wi = wi /wTi e, uij = wTi xj , then a solution of (4) is given by wi , tij , uij , i = 1, . . . , n, j = 1, . . . , N . Conversely, let wi , tij , uij , i = 1, . . . , n, j = 1, . . . , N be a solution of (4–7) (with minimum value zero). By (6) for any j there is at least one index ij such that tij j = 0 which implies that uij j = 0, i.e. xj lies in a hyperplane with normal vector wij . Since any column of X lies in only one hyperplane of the skeleton, we have wTi xj = 0 for every i = ij which implies tij = 0. Putting wi = wi /wi we see that {wi }ni=1 is a solution of (2).
A bilinear algorithm for sparse representations
255
In the case when the assumptions of Proposition 4 are satisfied except of the assumption that the minimum value of the cost function in the clustering problem (2) is zero, we have that the minimization problems (4–7) and the following minimize
N j =1
min |wTi xj |
1≤i≤n
subject to wTi e = 1, i = 1, . . . , n,
(8)
are equivalent. The proof is similar to those of Proposition 4 having in mind that N n
tij uij ≥
j =1 i=1
N j =1
min uij
1≤i≤n
as equality is achieved only if tij j = 1, where ij = arg min1≤i≤n uij . If fmin and gmin are the minimum values of the minimization problems (2) and (8) respectively, we have fmin ≤ gmin , so if gmin is near to zero we can expect that their solutions coincide (this is the case of the presented example below). The problem (8) has the advantage that it can be solved by linear programming (reformulated in the problem (4–7)), which is robust to outliers. This is stated in the following proposition. Proposition 5 Assume that the vector e with norm 1 is chosen in a such a way that the following condition is satisfied: if we take any disjoint 1 and J2 subsets J of {1, . . . , N}, then the vector e is not collinear to the vector j ∈J1 xj − j ∈J2 xj . If {wi }ni=1 is a solution of (8), then for every i ∈ {1, . . . , n} there is at least one j ∈ {1, . . . , N} such that wTi xj = 0, i.e. the hyperplanes with normal vectors wi contain at least one vector column of X. Remark Almost all vectors e (in a measure sense) in the unit sphere in Rm satisfy the assumption in Proposition 5, so in practice e is randomly chosen. Proof of Proposition 5 Denote Ji = j : |wTi xj | = min |wTk xj | . 1≤k≤n
Then wi is a solution of the following minimization problem: wT xj subject to wT e = 1. minimize f (w) = j ∈Ji
Let i be fixed. Denote for simplicity w = wi . Define: I + = j ∈ Ji : w T xj > 0 , I − = j ∈ Ji : w T xj < 0 , I 0 = j ∈ Ji : w T xj = 0 . We have to prove that I 0 = ∅.
(9)
256
P. Georgiev et al.
By Kuhn–Tucker’s theorem in its subdifferential form and by the chain rule for subdifferentials (see for instance [1]) there exist εj ∈ [−1, 1], j ∈ I 0 , and λ ∈ R such that xj − xj + εj xj = λe. (10) j ∈I +
j ∈I −
j ∈I 0
Here we used the fact that the subdifferential of the function | · | at zero is equal to [−1, 1]. Multiplying (10) by w we obtain λ = fmin . If fmin = 0, the proposition is proved. Otherwise λ = 0. Now if we assume that I 0 = ∅, by (10) we obtain a contradiction with the assumption. The above proposition shows that our method for finding skeletons is robust to outliers in the following sense: if most of the column vectors of X lie on hyperplanes, finding these hyperplanes can be performed by solving the linear programming problem (4–7). Note that such robustness to outliers is not unexpected—it appears for Algorithm 3 n-Planes clustering algorithm via linear programming
A bilinear algorithm for sparse representations
257
instance in linear regression and clustering problems when the l2 norm (Euclidean) is replaced by the l1 norm (see for instance [3]). The separated linear programming problems (with respect to {tij } and with respect to ({wi }ni=1 , {uij }) can be solved alternately, which leads to the following analogue of the k-median clustering algorithm of Bradley, Mangasarian and Street [3] (see Algorithm 3). Note that the linear programming problem with respect to {tij } is the cluster assignment step, which is realized simply as assignment. The finite termination of the algorithm can be proved analogically as the proof of Theorem 3.7 in [2].
6 Computer simulation example We present an example created by Matlab showing the robustness of our algorithm to outliers. We created a matrix S0 ∈ R5×120 from 5 random variables with normal distribution, using the command S0 = randn(5, 120) (in Matlab notation). We set to zero two elements (randomly chosen) in each column of S0 and obtained 2-sparse source matrix S, plotted in Fig. 1. We created randomly a normalized mixing matrix (the norm of each column is one)
Fig. 1 Example. Artificially created 2-sparse source signals
258
P. Georgiev et al.
Fig. 2 Example. Recovered source signals using the recovered mixing matrix A. The samples from 20 to 120 recover perfectly (after permutation and scaling) the original source signals
⎛
⎞ 0.5688 0.1983 −0.4287 −0.3894 −0.7692 ⎜ 0.4337 −0.4531 −0.2374 −0.0730 0.2047 ⎟ ⎟ H=⎜ ⎝ −0.6317 −0.2109 −0.1637 0.8516 −0.6037 ⎠ −0.2988 −0.8432 −0.8562 −0.3432 −0.0443 and created mixed signal matrix as X = HS. We added to the first 20 columns of X a noise N created as N = 0.1 ∗ randn(5, 20) (in Matlab notation). We run our n-planes clustering algorithm several times after obtaining a hyperplane such that at least 5 columns of X are at distance 10−4 to it. We remove these columns and run again the algorithm similarly, until we obtained 10 hyperplanes. They produced the following estimate of the mixing matrix (after running the Matrix Identification part in Algorithm 1, which reconstructs matrix from planes): ⎛ ⎞ 0.7692 0.1983 0.3894 0.4287 0.5688 ⎜ −0.2047 −0.4531 0.0730 0.2374 0.4337 ⎟ ⎟ A=⎜ ⎝ 0.6037 −0.2109 −0.8516 0.1637 −0.6317 ⎠ . 0.0443 −0.8432 0.3432 0.8562 −0.2988 Such perfect reconstruction was not possible when running the algorithm with precision less than 10−4 . After running the source recovery algorithm (see Algorithm 2) we obtained the recovered signals plotted in Fig. 2.
A bilinear algorithm for sparse representations
259
7 Conclusion We developed a new algorithm for hyperplane clustering which with combination of other algorithms leads to sparse representation of signals in sense that we can reconstruct uniquely (up to permutation and scaling) a given linear mixture as multiplication of a mixing matrix and a sparse source matrix. Our algorithm has advantages that it uses linear programming and is robust to outliers. An example is presented showing the robustness of our algorithm.
References 1. Alekseev, V.M., Tihomirov, V.M., Fomin, S.V.: Optimal Control. Nauka, Moscow (1979) (in Russian) 2. Bradley, P.S., Mangasarian, O.L.: k-Plane clustering. J. Glob. Optim. 16(1), 23–32 (2000) 3. Bradley, P.S., Mangasarian, O.L., Street, W.N.: Clustering via concave minimization. In: Advances in Neural Information Processing Systems, vol. 9, pp. 368–374. MIT Press, Cambridge (1997). See also Mathematical Programming Technical report 96-03 (May 1996) 4. Chen, S., Donoho, D., Saunders, M.: Atomic decomposition by basis pursuit. SIAM J. Sci. Comput. 20(1), 33–61 (1998) 5. Cichocki, A., Amari, S.: Adaptive Blind Signal and Image Processing. Wiley, Chichester (2002) 6. Donoho, D., Elad, M.: Optimally sparse representation in general (nonorthogonal) dictionaries via l 1 minimization. Proc. Natl. Acad. Sci. U.S.A. 100(5), 2197–2202 (2003) 7. Georgiev, P., Theis, F., Cichocki, A.: Sparse component analysis and blind source separation of underdetermined mixtures. IEEE Trans. Neural Netw. 16(4), 992–996 (2005) 8. Hyvärinen, A., Karhunen, J., Oja, E.: Independent Component Analysis. Wiley, New York (2001) 9. Lee, T.-W., Lewicki, M.S., Girolami, M., Sejnowski, T.J.: Blind source separation of more sources than mixtures using overcomplete representaitons. IEEE Signal Process. Lett. 6(4), 87–90 (1999) 10. Lee, D.D., Seung, H.S.: Learning the parts of objects by non-negative matrix factorization. Nature 40, 788–791 (1999) 11. Rubinov, A.M., Ugon, J.: Skeletons of finite sets of points. Research working paper 03/06, School of Information Technology & Mathematical Sciences, University of Ballarat (2003) 12. Theis, F.J., Lang, E.W., Puntonet, C.G.: A geometric algorithm for overcomplete linear ICA. Neurocomputing 56, 381–398 (2004) 13. Waheed, K., Salem, F.: Algebraic overcomplete independent component analysis. In: Proceedings of International Conference ICA 2003, Nara, Japan, pp. 1077–1082 14. Zibulevsky, M., Pearlmutter, B.A.: Blind source separation by sparse decomposition in a signal dictionary. Neural Comput. 13(4), 863–882 (2001)