Orthogonal and Non-Orthogonal Features

Metronome

Junior Member
Joined
Jun 12, 2018
Messages
134
My understanding is that PCA is a technique to orthogonalize possibly-non-orthogonal features (naming the outputs principle components instead of features). However, in the videos I've watched as well as this visual tool, the dimensions of the features space always have an orthogonal basis at the start, and the PCA just turns out to be an angle-preserving transformation from one orthogonal basis to another.

If features are said to be non-orthogonal, does that mean the dimensions of the features space are generated by a non-orthogonal basis, or merely that the data points are correlated over a feature space generated by an orthogonal basis? Is there some obvious isomorphism between these two geometries so that they're used interchangeably? Or is it that multicollinearity is the former concept, and multicollinearity and data point correlation are just two unrelated reasons to use PCA?

For example, in a supervised learning problem of predicting the annual average temperature on some planet at Longitude [imath]A_{test}[/imath] Latitude [imath]B_{test}[/imath] based on the features of the temperatures at Longitude [imath]A_{train 1}[/imath] Latitude [imath]B_{train 1}[/imath] and Longitude [imath]A_{train 2}[/imath] Latitude [imath]B_{train 2}[/imath], if the two training points are positioned very close to each other (especially if the testing point is far away), that seems to imply some defect in the dimensions of the feature space itself, not merely the data points. Is this a case of a highly non-orthogonal basis? The limiting case would be if we used exactly the same training position as the two features. Would that be a case of a feature space generated by two linearly dependent basis vectors, or merely of a straight line of data embedded in a feature space generated by two orthogonal basic vectors? Or is the problem that it really should be the former but you can't know to model it that way unless you know the Longitudes and Latitudes in advance and model accordingly, and the latter is the more generic representation of the data you'll get otherwise?
 
However, in the videos I've watched as well as this visual tool, the dimensions of the features space always have an orthogonal basis at the start, and the PCA just turns out to be an angle-preserving transformation from one orthogonal basis to another.
You got it right here, and, BTW, I think the tool you found provides a good illustration.
If features are said to be non-orthogonal, does that mean the dimensions of the features space are generated by a non-orthogonal basis, or merely that the data points are correlated over a feature space generated by an orthogonal basis?
I believe the latter is a better interpretation, i.e. the bases are orthonormal in both the input and the output of PCA.

As for your example, I think the sets are two small to make a meaningful example since PCA is essentially a statistical tool.
 
You got it right here, and, BTW, I think the tool you found provides a good illustration.

I believe the latter is a better interpretation, i.e. the bases are orthonormal in both the input and the output of PCA.

As for your example, I think the sets are two small to make a meaningful example since PCA is essentially a statistical tool.
Hmmm, let me retry a similar example. Suppose we contrast two separate regression problems. I am creating a regression to predict the temperatures of locations selected uniformly at random on the surface of Bearth, and you are creating a regression to predict the temperatures of locations selected uniformly at random on the surface of Schmearth. As features, each of us have four temperature sensors. I place my temperature sensors in northern New Bork City, eastern New Bork City, southern New Bork City, and western New Bork City, while you place a single temperature sensor on each of four continents on Schmearth. The measurements of a fifth temperature sensor randomly jumping around each planet are the respective prediction targets.

Bearth and Schmearth resemble Earth in terms of the relevant geography (NBC is about the size of NYC, continents are big, the sizes of the three planets are about the same), but we cannot infer temperature information based upon that of Earth, proximity to a sun, etc.; all we have is our respective training data. Also, to avoid the added complexities of time series analysis, the temperature sensors do not timestamp the measurements, but they do associate the five simultaneous measurements with each other, so that what they output is a (unordered) set of data points (1 target measurement and 4 feature measurements per data point), just as needed for basic regression problems.

The ground truth is that temperatures on Bearth fluctuate a lot, while temperatures on Schmearth do not. However, each of us will end up with high correlation within our training data. I believe the problem is that I have a lot of multicollinearity in my features, due to my redundant placement of temperature sensors, while you have much less multicollinearity but an objectively smaller temperature range across your planet.

My understanding was that PCA would parse out this difference to the extent possible, but your answer seems to imply that PCA will not be able to tell the difference because all it looks at is the correlation of the data points, not the features themselves. Even if I know that my temperature sensor placement is creating multicollinearity, it sounds like there's no way for me to build such information into my PCA algorithm as a feature space generated by a non-orthogonal basis, let alone ask PCA to discover my multicollinearity.

In my mind, it's not really clear now in what sense PCA is a good tool to deal with multicollinearity. Is there a context similar to this in which something like Gram-Schmidt (which genuinely orthogonalizes non-orthogonal bases) would be utilized to do the thing I was trying to do with PCA?
 
Still don't understand your question, and my knowledge of PCA is quite limited. My understanding is that PCA assumes orthogonality of the data points space because it computes an orthogonal transformation. E.g. here is an excerpt from Wikipedia:

PCA is defined as an orthogonal linear transformation on a real inner product space that transforms the data to a new coordinate system such that the greatest variance by some scalar projection of the data comes to lie on the first coordinate (called the first principal component), the second greatest variance on the second coordinate, and so on.
[13]
 
Top