An Interesting Inverse Inequality: Proof

January 27, 2021

Here I show some of my internal thinking behind recent work where I prove a theorem that I stated as conjecture in an earlier, related manuscript. This arose from a thought I had a while ago: can we achieve a Gram-schmidt-like proces but without the use of inner products? If possible then I thought I could use it produce interesting new matrix factorizations. This proved hard to even make meaningful, what does orthogonality even mean without inner products? I give some of the context below, I state my generalization of “orthogonal” to settings without inner products, and then I prove that my algorithm achieves these pseudo-orthogonal results.

$$\newcommand{\norm}[1]{\left\lVert#1\right\rVert}$$

What does Gram-Schmidt do for us?

The Gram-Schmidt algorithm takes as input a sequence of linearly independent vectors $$A_1,A_2,\ldots,A_n$$ and produces a new set of linearly independent vectors $$Q_1,Q_2,\ldots,Q_n$$ such that $$span(Q_1,Q_2,\ldots,Q_n) = span(A_1,A_2,\ldots,A_n)$$ (see definition of span), and also $$Q_i \cdot Q_j = 0$$ whenever $$i\neq j$$ but $$Q_i \cdot Q_j = 1$$. Here I’m using the standard Euclidean dot product.

What is the utility of this orthogonality property? Orthogonality essentially makes it easy to accurately solve systems of linear equations $$\mathbf{Q}\mathbf{x}=\mathbf{b}$$ where $$\mathbf{Q} = [Q_1,Q_2,\ldots,Q_n]$$. This results from the simple consequence of orthogonality that $$\mathbf{Q}^T \mathbf{Q} = \mathbf{I}$$.

Thus to produce an algorithm that generalizes the Gram-Schmidt algorithm I should seek to reproduce this utility of orthogonality. The utility isn’t necessarily the vanishing dot products, but rather that it results in easily solved systems of linear equations. I give a precise mathematical way to determine how hard it is to solve linear systems with a matrix: conditioning

Gram-Schmidt without inner products: use norms instead

If we take $$\norm{ \mathbf{x} }_ 2 = \sqrt{\mathbf{x} \cdot \mathbf{x}}$$ then the matrix $$\mathbf{Q}$$ mentioned above has the property that $$\norm{ \mathbf{Q}\mathbf{x} }_ 2 = \norm{ \mathbf{x} }_ 2$$ for any $$\mathbf{x}$$ regardless of the input matrix $$\mathbf{A} = [A_1,A_2,\ldots,A_n]$$.

Generalizing orthogonality

I generalize this property as follows: Suppose that $$\norm{ \cdot }$$ is any norm and that $$F$$ is an algorithm (such as Gram-Schmidt) which given any input matrix $$\mathbf{A}$$ it outputs $$\mathbf{Q}=F(\mathbf{A})$$ such that $$span(Q_1,\ldots,Q_n) = span(A_1,\ldots,A_n)$$. Then there exists $$C_1>0,C_2>0$$ such that for every $$\mathbf{A}$$ and $$\mathbf{x}$$ we have

\begin{align} C_1 \norm{ \mathbf{x} } \leq \norm{ F(\mathbf{A})\mathbf{x} } \leq C_2 \norm{ \mathbf{x} } \end{align}

In other words: I require the algorithm $$F$$ to never result in matrices $$\mathbf{Q}=F(\mathbf{A})$$ that shrink or grow input vectors by too much. The amount that it may do this is known ahead of time by the constants $$C_1,C_2$$ and these do not depend on the input matrices $$\mathbf{A}$$. I now give an $$F$$ which achieves this and then prove it

The algorithm

Here I assume we have a sequence of $$n$$ linearly independent vectors $$A_1,\ldots,A_n$$ and that $$\norm{ \cdot }$$ is any norm whatsoever (not necessarily Euclidean). I define $$Q_1,\ldots,Q_n$$ inductively as follows:

\begin{align} Q_1 = \frac{1}{\norm{ A_1 }} A_1 \end{align}

and for any integer $$k>1$$ I define

\begin{align} c^k &= argmin_{u\in \mathbb{R}^{k-1}} \norm{A_{k-1} - \sum_{j=1}^{k-1} Q_j u_j } \\ q_k &= \sum _ {j=1} ^ {k-1} Q_k c_j ^k \\ \gamma_k &= \norm{ q_k } \\ Q_{k} &= \frac{1}{\gamma_k} q_k \end{align}

To put in words: We start by taking the first column of $$\mathbf{Q}$$ to be the normalized first column of $$\mathbf{A}$$. Every subsequent column of $$\mathbf{Q}$$ is the best approximation in the $$\norm{ \cdot}$$-norm of the previous columns to the next column of $$\mathbf{A}$$, and then normalized. Note that $$\norm{ Q_i }=1$$ for all $$i$$.

Proving key properties

I now show that the algorithm specified above results in constants $$C_1>0,C_2>0$$ such that for every $$\mathbf{A}$$ and $$\mathbf{x}$$ we have

\begin{align} C_1 \norm{ \mathbf{x} } \leq \norm{ F(\mathbf{A})\mathbf{x} } \leq C_2 \norm{ \mathbf{x} } \end{align}

The “less than or equal” part of this bound is known as a “forward” bound and is relatively straightforward. The “greater than or equal” part is known as an “inverse bound” and was extremely hard (for me) to prove. I start by proving the forward bound.

The forward bound

Suppose that $$\norm{ \cdot }$$ is a norm, that $$\mathbf{A}=[A_1,\ldots,A_n]$$ is a matrix with linearly independent columns, and that $$\mathbf{x}=[x_1,\ldots,x_n]$$ is a vector. Suppose further that $$\mathbf{Q}$$ is defined as above. Then using the triangle inequality and properties of norms we have

\begin{align} \norm{\mathbf{Q}\mathbf{x}} &= \norm{ \sum _{i=1}^n Q_i x_i } \\ &\leq \sum_{i=1}^n \norm{ Q_i x_i } \\ &\leq \sum_{i=1}^n \norm{ Q_i} \left | x_i \right | \\ &= \norm{ x } _1 \end{align}

we obtain $$C_2$$ by applying norm equivalence on finite dimensional spaces between $$\norm{ \cdot }_ 1$$ and $$\norm{ \cdot }$$.

The forward bound above is relatively straightforward in the sense that you only need apply standard linear algebra inequalities to achieve it. The inverse inequality on the other hand required a lot more lateral thinking to prove. I do this one next.

The inverse bound

Suppose that $$\norm{ \cdot }$$ is a norm, that $$\mathbf{A}=[A_1,\ldots,A_n]$$ is a matrix with linearly independent columns, and that $$\mathbf{x}=[x_1,\ldots,x_n]$$ is a vector. Suppose further that $$\mathbf{Q}$$ is defined as above.

I begin with a subtle fact: for every positive integer $$k\leq n$$ we have

\begin{align} \norm{\sum_{j=1}^k Q_j x_j} &= \norm{\sum_{j=1}^{k-1} Q_j + Q_k x_k } \\ &= \norm{\sum_{j=1}^{k-1} Q_j + \frac{1}{\gamma_k} \left (A_k - \sum_{j=1}^{k-1} Q_j c^k_j \right)x_k} \\ &\geq \norm{\frac{1}{\gamma_k} \left (A_k - \sum_{j=1}^{k-1} Q_j c^k_j \right)x_k} \\ &= \norm{Q_k x_k} \\ &= |x_k| \end{align}

where the inequality comes from the optimality property of $$c^k$$ and the final equation from $$\norm{Q_k}=1$$.

Now I define $$k$$ to be the largest integer which satisfies

\begin{align} |x_k| \geq \frac{1}{2^k} \norm{x}_ {\infty} \end{align}

In particular this means $$x_j < \frac{1}{2^j}\norm{x}_ {\infty}$$ for $$j>k$$. Thus we have

\begin{align} \norm{\sum_{j={k+1}}^n Q_j x_j} &\leq \sum_{j=k+1}^n \norm{Q_j x_j} \\ &\leq \sum_{j=k+1}^n \frac{1}{2^j} \norm{x}_{\infty} \\ &\leq \frac{1}{2^k} \norm{x}_{\infty} \\ &\leq |x_k| \\ &\leq \norm{\sum_{j=1}^k Q_j x_j } \end{align}

The above inequality will let us remove absolute values later because

\begin{align} \left | \norm{\sum_{j=1}^k Q_j x_j } - \norm{\sum_{j={k+1}}^n Q_j x_j} \right| = \norm{\sum_{j=1}^k Q_j x_j } - \norm{\sum_{j={k+1}}^n Q_j x_j} \end{align}

We now apply the reverse triangle inequality and get the following

\begin{align*} \norm{\mathbf{Q}\mathbf{x}} &= \norm{\sum_{j=1}^n Q_j x_j} \\ &= \norm{\sum_{j=1}^k Q_j x_j + \sum_{j=k+1}^n Q_j x_j} \\ &\geq \left | \norm{\sum_{j=1}^k Q_j x_j} - \norm{\sum_{j=k+1}^n Q_j x_j} \right | \text{ (by reverse triangle inequality)}\\ &= \norm{\sum_{j=1}^k Q_j x_j} - \norm{\sum_{j=k+1}^n Q_j x_j} \\ &\geq \norm{\sum_{j=1}^k Q_j x_j} - \sum_{j=k+1}^n \frac{1}{2^j}\norm{x}_{\infty} \\ &\geq \left | x_k \right | - \sum_{j=k+1}^n \frac{1}{2^j}\norm{\mathbf{x}}_{\infty} \\ &\geq \frac{1}{2^k}\norm{\mathbf{x}}_{\infty} - \sum_{j=k+1}^n \frac{1}{2^j}\norm{\mathbf{x}}_{\infty} \\ &= (\frac{1}{2^k} - \sum_{j=k+1}^n \frac{1}{2^j})\norm{\mathbf{x}}_{\infty} \\ &\geq \frac{1}{2^n}\norm{\mathbf{x}}_{\infty} \text{ (properties of geometric series) } \end{align*}

The proof is complete by appealing to norm equivalence between $$\norm{\cdot }$$ and $$\norm{ \cdot }_ {\infty}$$

Conclusion

I stated a simple inductive algorithm that takes a matrix $$\mathbf{A}$$ as input and produces a new matrix $$\mathbf{Q}$$ which satisfies bounded growth and shrinkage when applied to a vector $$\mathbf{x}$$. I quantified the growth and shrinkage factors $$C_1,C_2$$ by the above theorems and proved them. These factors are independent of $$\mathbf{A}$$. This suggests that the matrix $$\mathbf{Q}$$ is “well conditioned” in the sense that it is easy to solve linear systems $$\mathbf{Q}\mathbf{x} = \mathbf{b}$$ and does not get harder to solve this system even if we input a terribly conditioned $$\mathbf{A}$$. I consider this a useful step in the direction of generalizing the Gram-Schmidt algorithm so that it does not strictly require inner products.