Here I show some of my internal thinking behind recent work where I prove a theorem that I stated as conjecture in an earlier, related manuscript. This arose from a thought I had a while ago: can we achieve a Gram-schmidt-like proces but *without* the use of inner products? If possible then I thought I could use it produce interesting new matrix factorizations. This proved hard to even make meaningful, what does orthogonality even mean without inner products? I give some of the context below, I state my generalization of “orthogonal” to settings without inner products, and then I prove that my algorithm achieves these pseudo-orthogonal results.

\(\newcommand{\norm}[1]{\left\lVert#1\right\rVert}\)

The Gram-Schmidt algorithm takes as input a sequence of linearly independent vectors \( A_1,A_2,\ldots,A_n \) and produces a new set of linearly independent vectors \( Q_1,Q_2,\ldots,Q_n \) such that \( span(Q_1,Q_2,\ldots,Q_n) = span(A_1,A_2,\ldots,A_n)\) (see definition of span), and also \( Q_i \cdot Q_j = 0 \) whenever \( i\neq j \) but \( Q_i \cdot Q_j = 1\). Here I’m using the standard Euclidean dot product.

What is the utility of this orthogonality property? Orthogonality essentially makes it easy to accurately solve systems of linear equations \( \mathbf{Q}\mathbf{x}=\mathbf{b} \) where \( \mathbf{Q} = [Q_1,Q_2,\ldots,Q_n] \). This results from the simple consequence of orthogonality that \( \mathbf{Q}^T \mathbf{Q} = \mathbf{I} \).

Thus to produce an algorithm that generalizes the Gram-Schmidt algorithm I should seek to reproduce this *utility* of orthogonality. The utility isn’t necessarily the vanishing dot products, but rather that it results in easily solved systems of linear equations. I give a precise mathematical way to determine how hard it is to solve linear systems with a matrix: conditioning

If we take \( \norm{ \mathbf{x} }_ 2 = \sqrt{\mathbf{x} \cdot \mathbf{x}} \) then the matrix \(\mathbf{Q} \) mentioned above has the property that \( \norm{ \mathbf{Q}\mathbf{x} }_ 2 = \norm{ \mathbf{x} }_ 2 \) for any \( \mathbf{x} \) *regardless* of the input matrix \( \mathbf{A} = [A_1,A_2,\ldots,A_n] \).

I generalize this property as follows: Suppose that \( \norm{ \cdot } \) is any norm and that \( F \) is an algorithm (such as Gram-Schmidt) which given any input matrix \( \mathbf{A} \) it outputs \( \mathbf{Q}=F(\mathbf{A}) \) such that \( span(Q_1,\ldots,Q_n) = span(A_1,\ldots,A_n)\). Then there exists \(C_1>0,C_2>0\) such that for every \( \mathbf{A} \) and \( \mathbf{x} \) we have

\begin{align} C_1 \norm{ \mathbf{x} } \leq \norm{ F(\mathbf{A})\mathbf{x} } \leq C_2 \norm{ \mathbf{x} } \end{align}

In other words: I require the algorithm \( F \) to never result in matrices \( \mathbf{Q}=F(\mathbf{A}) \) that *shrink* or *grow* input vectors by too much. The amount that it may do this is known ahead of time by the constants \(C_1,C_2\) and these do not depend on the input matrices \( \mathbf{A} \). I now give an \( F \) which achieves this and then prove it

Here I assume we have a sequence of \( n\) linearly independent vectors \(A_1,\ldots,A_n\) and that \( \norm{ \cdot } \) is any norm whatsoever (not necessarily Euclidean). I define \(Q_1,\ldots,Q_n\) inductively as follows:

\begin{align} Q_1 = \frac{1}{\norm{ A_1 }} A_1 \end{align}

and for any integer \( k>1 \) I define

\begin{align} c^k &= argmin_{u\in \mathbb{R}^{k-1}} \norm{A_{k-1} - \sum_{j=1}^{k-1} Q_j u_j } \\ q_k &= \sum _ {j=1} ^ {k-1} Q_k c_j ^k \\ \gamma_k &= \norm{ q_k } \\ Q_{k} &= \frac{1}{\gamma_k} q_k \end{align}

To put in words: We start by taking the first column of \( \mathbf{Q} \) to be the normalized first column of \( \mathbf{A} \). Every subsequent column of \( \mathbf{Q} \) is the best approximation in the \( \norm{ \cdot} \)-norm of the previous columns to the next column of \( \mathbf{A} \), and then normalized. Note that \( \norm{ Q_i }=1 \) for all \( i \).

I now show that the algorithm specified above results in constants \(C_1>0,C_2>0\) such that for every \( \mathbf{A} \) and \( \mathbf{x} \) we have

\begin{align} C_1 \norm{ \mathbf{x} } \leq \norm{ F(\mathbf{A})\mathbf{x} } \leq C_2 \norm{ \mathbf{x} } \end{align}

The “less than or equal” part of this bound is known as a “forward” bound and is relatively straightforward. The “greater than or equal” part is known as an “inverse bound” and was extremely hard (for me) to prove. I start by proving the forward bound.

Suppose that \(\norm{ \cdot } \) is a norm, that \( \mathbf{A}=[A_1,\ldots,A_n] \) is a matrix with linearly independent columns, and that \( \mathbf{x}=[x_1,\ldots,x_n] \) is a vector. Suppose further that \(\mathbf{Q}\) is defined as above. Then using the triangle inequality and properties of norms we have

\begin{align} \norm{\mathbf{Q}\mathbf{x}} &= \norm{ \sum _{i=1}^n Q_i x_i } \\ &\leq \sum_{i=1}^n \norm{ Q_i x_i } \\ &\leq \sum_{i=1}^n \norm{ Q_i} \left | x_i \right | \\ &= \norm{ x } _1 \end{align}

we obtain \( C_2 \) by applying norm equivalence on finite dimensional spaces between \( \norm{ \cdot }_ 1 \) and \( \norm{ \cdot } \).

The forward bound above is relatively straightforward in the sense that you only need apply standard linear algebra inequalities to achieve it. The inverse inequality on the other hand required a lot more lateral thinking to prove. I do this one next.

Suppose that \(\norm{ \cdot } \) is a norm, that \( \mathbf{A}=[A_1,\ldots,A_n] \) is a matrix with linearly independent columns, and that \( \mathbf{x}=[x_1,\ldots,x_n] \) is a vector. Suppose further that \(\mathbf{Q}\) is defined as above.

I begin with a subtle fact: for every positive integer \( k\leq n \) we have

\begin{align} \norm{\sum_{j=1}^k Q_j x_j} &= \norm{\sum_{j=1}^{k-1} Q_j + Q_k x_k } \\ &= \norm{\sum_{j=1}^{k-1} Q_j + \frac{1}{\gamma_k} \left (A_k - \sum_{j=1}^{k-1} Q_j c^k_j \right)x_k} \\ &\geq \norm{\frac{1}{\gamma_k} \left (A_k - \sum_{j=1}^{k-1} Q_j c^k_j \right)x_k} \\ &= \norm{Q_k x_k} \\ &= |x_k| \end{align}

where the inequality comes from the optimality property of \( c^k \) and the final equation from \( \norm{Q_k}=1\).

Now I *define* \( k \) to be the largest integer which satisfies

\begin{align} |x_k| \geq \frac{1}{2^k} \norm{x}_ {\infty} \end{align}

In particular this means \( x_j < \frac{1}{2^j}\norm{x}_ {\infty} \) for \(j>k \). Thus we have

\begin{align} \norm{\sum_{j={k+1}}^n Q_j x_j} &\leq \sum_{j=k+1}^n \norm{Q_j x_j} \\ &\leq \sum_{j=k+1}^n \frac{1}{2^j} \norm{x}_{\infty} \\ &\leq \frac{1}{2^k} \norm{x}_{\infty} \\ &\leq |x_k| \\ &\leq \norm{\sum_{j=1}^k Q_j x_j } \end{align}

The above inequality will let us remove absolute values later because

\begin{align} \left | \norm{\sum_{j=1}^k Q_j x_j } - \norm{\sum_{j={k+1}}^n Q_j x_j} \right| = \norm{\sum_{j=1}^k Q_j x_j } - \norm{\sum_{j={k+1}}^n Q_j x_j} \end{align}

We now apply the reverse triangle inequality and get the following

\begin{align*} \norm{\mathbf{Q}\mathbf{x}} &= \norm{\sum_{j=1}^n Q_j x_j} \\ &= \norm{\sum_{j=1}^k Q_j x_j + \sum_{j=k+1}^n Q_j x_j} \\ &\geq \left | \norm{\sum_{j=1}^k Q_j x_j} - \norm{\sum_{j=k+1}^n Q_j x_j} \right | \text{ (by reverse triangle inequality)}\\ &= \norm{\sum_{j=1}^k Q_j x_j} - \norm{\sum_{j=k+1}^n Q_j x_j} \\ &\geq \norm{\sum_{j=1}^k Q_j x_j} - \sum_{j=k+1}^n \frac{1}{2^j}\norm{x}_{\infty} \\ &\geq \left | x_k \right | - \sum_{j=k+1}^n \frac{1}{2^j}\norm{\mathbf{x}}_{\infty} \\ &\geq \frac{1}{2^k}\norm{\mathbf{x}}_{\infty} - \sum_{j=k+1}^n \frac{1}{2^j}\norm{\mathbf{x}}_{\infty} \\ &= (\frac{1}{2^k} - \sum_{j=k+1}^n \frac{1}{2^j})\norm{\mathbf{x}}_{\infty} \\ &\geq \frac{1}{2^n}\norm{\mathbf{x}}_{\infty} \text{ (properties of geometric series) } \end{align*}

The proof is complete by appealing to norm equivalence between \( \norm{\cdot } \) and \( \norm{ \cdot }_ {\infty} \)

I stated a simple inductive algorithm that takes a matrix \( \mathbf{A} \) as input and produces a new matrix \( \mathbf{Q} \) which satisfies bounded growth and shrinkage when applied to a vector \( \mathbf{x} \). I quantified the growth and shrinkage factors \( C_1,C_2 \) by the above theorems and proved them. These factors are independent of \( \mathbf{A} \). This suggests that the matrix \( \mathbf{Q} \) is “well conditioned” in the sense that it is easy to solve linear systems \( \mathbf{Q}\mathbf{x} = \mathbf{b} \) and does not *get harder* to solve this system even if we input a terribly conditioned \(\mathbf{A}\). I consider this a useful step in the direction of generalizing the Gram-Schmidt algorithm so that it does not strictly require inner products.