Here I show some of my internal thinking behind recent work where I prove a theorem that I stated as conjecture in an earlier, related manuscript. This arose from a thought I had a while ago: can we achieve a Gram-schmidt-like proces but without the use of inner products? If possible then I thought I could use it produce interesting new matrix factorizations. This proved hard to even make meaningful, what does orthogonality even mean without inner products? I give some of the context below, I state my generalization of “orthogonal” to settings without inner products, and then I prove that my algorithm achieves these pseudo-orthogonal results.
\(\newcommand{\norm}[1]{\left\lVert#1\right\rVert}\)
The Gram-Schmidt algorithm takes as input a sequence of linearly independent vectors \( A_1,A_2,\ldots,A_n \) and produces a new set of linearly independent vectors \( Q_1,Q_2,\ldots,Q_n \) such that \( span(Q_1,Q_2,\ldots,Q_n) = span(A_1,A_2,\ldots,A_n)\) (see definition of span), and also \( Q_i \cdot Q_j = 0 \) whenever \( i\neq j \) but \( Q_i \cdot Q_j = 1\). Here I’m using the standard Euclidean dot product.
What is the utility of this orthogonality property? Orthogonality essentially makes it easy to accurately solve systems of linear equations \( \mathbf{Q}\mathbf{x}=\mathbf{b} \) where \( \mathbf{Q} = [Q_1,Q_2,\ldots,Q_n] \). This results from the simple consequence of orthogonality that \( \mathbf{Q}^T \mathbf{Q} = \mathbf{I} \).
Thus to produce an algorithm that generalizes the Gram-Schmidt algorithm I should seek to reproduce this utility of orthogonality. The utility isn’t necessarily the vanishing dot products, but rather that it results in easily solved systems of linear equations. I give a precise mathematical way to determine how hard it is to solve linear systems with a matrix: conditioning
If we take \( \norm{ \mathbf{x} }_ 2 = \sqrt{\mathbf{x} \cdot \mathbf{x}} \) then the matrix \(\mathbf{Q} \) mentioned above has the property that \( \norm{ \mathbf{Q}\mathbf{x} }_ 2 = \norm{ \mathbf{x} }_ 2 \) for any \( \mathbf{x} \) regardless of the input matrix \( \mathbf{A} = [A_1,A_2,\ldots,A_n] \).
I generalize this property as follows: Suppose that \( \norm{ \cdot } \) is any norm and that \( F \) is an algorithm (such as Gram-Schmidt) which given any input matrix \( \mathbf{A} \) it outputs \( \mathbf{Q}=F(\mathbf{A}) \) such that \( span(Q_1,\ldots,Q_n) = span(A_1,\ldots,A_n)\). Then there exists \(C_1>0,C_2>0\) such that for every \( \mathbf{A} \) and \( \mathbf{x} \) we have
In other words: I require the algorithm \( F \) to never result in matrices \( \mathbf{Q}=F(\mathbf{A}) \) that shrink or grow input vectors by too much. The amount that it may do this is known ahead of time by the constants \(C_1,C_2\) and these do not depend on the input matrices \( \mathbf{A} \). I now give an \( F \) which achieves this and then prove it
Here I assume we have a sequence of \( n\) linearly independent vectors \(A_1,\ldots,A_n\) and that \( \norm{ \cdot } \) is any norm whatsoever (not necessarily Euclidean). I define \(Q_1,\ldots,Q_n\) inductively as follows:
and for any integer \( k>1 \) I define
To put in words: We start by taking the first column of \( \mathbf{Q} \) to be the normalized first column of \( \mathbf{A} \). Every subsequent column of \( \mathbf{Q} \) is the best approximation in the \( \norm{ \cdot} \)-norm of the previous columns to the next column of \( \mathbf{A} \), and then normalized. Note that \( \norm{ Q_i }=1 \) for all \( i \).
I now show that the algorithm specified above results in constants \(C_1>0,C_2>0\) such that for every \( \mathbf{A} \) and \( \mathbf{x} \) we have
The “less than or equal” part of this bound is known as a “forward” bound and is relatively straightforward. The “greater than or equal” part is known as an “inverse bound” and was extremely hard (for me) to prove. I start by proving the forward bound.
Suppose that \(\norm{ \cdot } \) is a norm, that \( \mathbf{A}=[A_1,\ldots,A_n] \) is a matrix with linearly independent columns, and that \( \mathbf{x}=[x_1,\ldots,x_n] \) is a vector. Suppose further that \(\mathbf{Q}\) is defined as above. Then using the triangle inequality and properties of norms we have
we obtain \( C_2 \) by applying norm equivalence on finite dimensional spaces between \( \norm{ \cdot }_ 1 \) and \( \norm{ \cdot } \).
The forward bound above is relatively straightforward in the sense that you only need apply standard linear algebra inequalities to achieve it. The inverse inequality on the other hand required a lot more lateral thinking to prove. I do this one next.
Suppose that \(\norm{ \cdot } \) is a norm, that \( \mathbf{A}=[A_1,\ldots,A_n] \) is a matrix with linearly independent columns, and that \( \mathbf{x}=[x_1,\ldots,x_n] \) is a vector. Suppose further that \(\mathbf{Q}\) is defined as above.
I begin with a subtle fact: for every positive integer \( k\leq n \) we have
where the inequality comes from the optimality property of \( c^k \) and the final equation from \( \norm{Q_k}=1\).
Now I define \( k \) to be the largest integer which satisfies
In particular this means \( x_j < \frac{1}{2^j}\norm{x}_ {\infty} \) for \(j>k \). Thus we have
The above inequality will let us remove absolute values later because
We now apply the reverse triangle inequality and get the following
The proof is complete by appealing to norm equivalence between \( \norm{\cdot } \) and \( \norm{ \cdot }_ {\infty} \)
I stated a simple inductive algorithm that takes a matrix \( \mathbf{A} \) as input and produces a new matrix \( \mathbf{Q} \) which satisfies bounded growth and shrinkage when applied to a vector \( \mathbf{x} \). I quantified the growth and shrinkage factors \( C_1,C_2 \) by the above theorems and proved them. These factors are independent of \( \mathbf{A} \). This suggests that the matrix \( \mathbf{Q} \) is “well conditioned” in the sense that it is easy to solve linear systems \( \mathbf{Q}\mathbf{x} = \mathbf{b} \) and does not get harder to solve this system even if we input a terribly conditioned \(\mathbf{A}\). I consider this a useful step in the direction of generalizing the Gram-Schmidt algorithm so that it does not strictly require inner products.