All algorithms presented in Chapters 5 to 9 belong to the larger class of supervised learning tools. Such tools seek to unveil a mapping between predictors $\textbf{X}$ and a label $\textbf{Z}$. The supervision comes from the fact that it is asked that the data tries to explain this particular variable $\textbf{Z}$. Another important part of machine learning consists of unsupervised tasks, that is, when $\textbf{Z}$ is not specified and the algorithm tries to make sense of $\textbf{X}$ on its own. Often, relationships between the components of $\textbf{X}$ are identified. This field is much too vast to be summarized in one book, let alone one chapter. The purpose here is to briefly explain in what ways unsupervised learning can be used, especially in the data pre-processing phase.
Often, it is tempting to supply all predictors to a ML-fueled predictive engine. That may not be a good idea when some predictors are highly correlated. To illustrate this, the simplest example is a regression on two variables with zero mean and covariance and precisions matrices:
When the covariance/correlation $ρ$ increase towards 1 (the two variables are co-linear), the scaling denominator in $\boldsymbol{\Sigma}^{-1}$ goes to zero and the formula $\hat{\boldsymbol{\beta}}=\boldsymbol{\Sigma}^{-1}\textbf{X}'\textbf{Z}$ implies that one coefficient will be highly positive and one highly negative. The regression creates a spurious arbitrage between the two variables. Of course, this is very inefficient and yields disastrous results out-of-sample.
We illustrate what happens when many variables are used in the regression below (Table 15.1). One elucidation of the aforementioned phenomenon comes from the variables Mkt_Cap_12M_Usd and Mkt_Cap_6M_Usd, which have a correlation of 99.6% in the training sample. Both are singled out as highly significant but their signs are contradictory. Moreover, the magnitude of their coefficients are very close (0.21 versus 0.18) so that their net effect cancels out. Naturally, providing the regression with only one of these two inputs would have been wiser.
import statsmodels.api as sm # Package for clean regression output
stat = sm.OLS(training_sample['R1M_Usd'],sm.add_constant(training_sample[features])).fit() # Model: predict R1M_Usd
reg_thrhld=3 # Keep significant predictors only
boo_filter = np.abs(stat.tvalues) >= reg_thrhld # regressors significance threshold
estimate=stat.params[boo_filter] # estimate
std_error=stat.bse[boo_filter] # std.error
statistic=stat.tvalues[boo_filter] # statistic
p_value=stat.pvalues[boo_filter] # p.value
significant_regressors = pd.concat([estimate,std_error,statistic,p_value],axis=1) # Put output in clean format
significant_regressors.columns=['estimate','std.error','statistic','p.value'] # Renaming columns
print(significant_regressors)
estimate std.error statistic p.value const 0.040574 0.005343 7.594323 3.107512e-14 Ebitda_Margin 0.013237 0.003493 3.789999 1.506925e-04 Ev_Ebitda 0.006814 0.002256 3.020213 2.526288e-03 Fa_Ci 0.007231 0.002347 3.081471 2.060090e-03 Fcf_Bv 0.025054 0.005131 4.882465 1.048492e-06 Fcf_Yld -0.015893 0.003736 -4.254127 2.099628e-05 Mkt_Cap_12M_Usd 0.204738 0.027432 7.463476 8.461142e-14 Mkt_Cap_6M_Usd -0.179780 0.045939 -3.913443 9.101987e-05 Mom_5M_Usd -0.018669 0.004431 -4.212972 2.521442e-05 Mom_Sharp_11M_Usd 0.017817 0.004695 3.795131 1.476096e-04 Ni 0.015461 0.004497 3.438361 5.853680e-04 Ni_Avail_Margin 0.011814 0.003861 3.059359 2.218407e-03 Ocf_Bv -0.019811 0.005294 -3.742277 1.824119e-04 Pb -0.017897 0.003129 -5.720637 1.062777e-08 Pe -0.008991 0.002354 -3.819565 1.337278e-04 Sales_Ps -0.015786 0.004628 -3.411062 6.472325e-04 Vol1Y_Usd 0.011425 0.002792 4.091628 4.285247e-05 Vol3Y_Usd 0.008459 0.002795 3.026169 2.477060e-03
TABLE 15.1: Significant predictors in the training sample.
In fact, there are several indicators for the market capitalization and maybe only one would suffice, but it is not obvious to tell which one is the best choice.
To further depict correlation issues, we compute the correlation matrix of the predictors below (on the training sample). Because of its dimension, we show it graphically.
import seaborn as sns # Package for plots
sns.set(rc={'figure.figsize':(16,16)}) # Setting the figsize in seaborn
sns.heatmap(training_sample[features].corr()) # Correlation matrix and plot
<AxesSubplot:>
FIGURE 15.1: Correlation matrix of predictors.
The graph of Figure 15.1 reveals several light squares around the diagonal. For instance, the biggest square around the first third of features relates to all accounting ratios based on free cash flows. Because of this common term in their calculation, the features are naturally highly correlated. These local correlation patterns occur several times in the dataset and explain why it is not a good idea to use simple regression with this set of features.
In full disclosure, multicollinearity (when predictors are correlated) can be much less a problem for ML tools than it is for pure statistical inference. In statistics, one central goal is to study the properties of $\beta$ coefficients. Collinearity perturbs this kind of analysis. In machine learning, the aim is to maximize out-of-sample accuracy. If having many predictors can be helpful, then so be it. One simple example can help clarify this matter. When building a regression tree, having many predictors will give more options for the splits. If the features make sense, then they can be useful. The same reasoning applies to random forests and boosted trees. What does matter is that the large spectrum of features helps improve the generalization ability of the model. Their collinearity is irrelevant.
In the remainder of the chapter, we present two approaches that help reduce the number of predictors:
The first method is a cornerstone in dimensionality reduction. It seeks to determine a smaller number of factors ($K'<K$) such that:
In this short subsection, we define some key concepts that are required to fully understand the derivation of principal component analysis (PCA). Henceforth, we work with matrices (in bold fonts). An $I \times K$ matrix $\textbf{X}$ is orthonormal if $I> K$ and $\textbf{X}'\textbf{X}=\textbf{I}_K$. When $I=K$, the (square) matrix is called orthogonal and $\textbf{X}'\textbf{X}=\textbf{X}\textbf{X}'=\textbf{I}_K$, i.e.,$\textbf{X}^{-1}=\textbf{X}'$.
One foundational result in matrix theory is the Singular Value Decomposition (SVD, see, e.g., chapter 5 in Meyer (2000)). The SVD is formulated as follows: any $I \times K$ matrix $\textbf{X}$ can be decomposed into
where $\textbf{U}$ ($I\times I$( and $\textbf{V}$ ($K\times K$) are orthogonal and $\boldsymbol{\Delta}$ (with dimensions $I\times K$) is diagonal, i.e., $\Delta_{i,k}=0$ whenever $i\neq k$. In addition,$\Delta{i,i}\ge 0$: the diagonal terms of $\boldsymbol{\Delta}$ are nonnegative.
For simplicity, we assume below that $\textbf{1}_I'\textbf{X}=\textbf{0}_K'$, i.e., that all columns have zero sum (and hence zero mean).31 This allows to write that the covariance matrix is equal to its sample estimate $\boldsymbol{\Sigma}_X= \frac{1}{I-1}\textbf{X}'\textbf{X}$.
One crucial feature of covariance matrices is their symmetry. Indeed, real-valued symmetric (square) matrices enjoy a SVD which is much more powerful: when $\textbf{X}$ is symmetric, there exist an orthogonal matrix $\textbf{Q}$ and a diagonal matrix $\textbf{D}$ such that
This process is called diagonalization (see chapter 7 in Meyer (2000)) and conveniently applies to covariance matrices.
The goal of PCA is to build a dataset that has fewer columns but that keeps as much information as possible when compressing the original one, $\tilde{\textbf{X}}$. The key notion is the change of base, which is a linear transformation of $X$ into $\textbf{Z}$, a matrix with identical dimension, via
where $\textbf{P}$ is a $K \times K$ matrix. There are of course an infinite number of ways to transform $\textbf{X}$ into $\textbf{Z}$, but two fundamental constraints help reduce the possibilities. The first constraint is that the columns of $\textbf{Z}$ be uncorrelated. Having uncorrelated features is desirable because they then all tell different stories and have zero redundancy. The second constraint is that the variance of the columns of $\textbf{Z}$ is highly concentrated. This means that a few factors (columns) will capture most of the explanatory power (signal), while most (the others) will consist predominantly of noise. All of this is coded in the covariance matrix of $\textbf{Y}$:
The covariance matrix of $\textbf{Z}$ is
In this expression, we plug the decomposition (15.2) of $\boldsymbol{\Sigma}_X$:
thus picking $\textbf{P}=\textbf{Q}$, we get, by orthogonality, $\boldsymbol{\Sigma}_Y=\frac{1}{I-1}\textbf{D}$, that is, a diagonal covariance matrix for $\textbf{Z}$. The columns of $\textbf{Z}$ can then be re-shuffled in decreasing order of variance so that the diagonal elements of $\boldsymbol{\Sigma}_Y$ progressively shrink. This is useful because it helps locate the factors with most informational content (the first factors). In the limit, a constant vector (with zero variance) carries no signal.
The matrix $\textbf{Z}$ is a linear transformation of $\textbf{X}$, thus, it is expected to carry the same information, even though this information is coded differently. Since the columns are ordered according to their relative importance, it is simple to omit some of them. The new set of features $\tilde{\textbf{X}}$ consists in the first $K'$ (with $K′<K$) columns of $Z$.
Below, we show how to perform PCA with scikit-learn and visualize the output with the python PCA package. To ease readability, we use the smaller sample with few predictors.
from sklearn import decomposition
pca = decomposition.PCA(n_components=7) # we impose the number of components
pca.fit(training_sample[features_short]) # Performs PCA on smaller number of predictors
print(pca.explained_variance_ratio_) # Cheking the variance explained per component
P=pd.DataFrame(pca.components_,columns=features_short).T # Rotation (n x k) = (7 x 7)
P.columns = ['P' + str(col) for col in P.columns] # tidying up columns names
P
[0.35718238 0.1940806 0.15561321 0.10434453 0.09601422 0.07017118 0.02259388]
P0 | P1 | P2 | P3 | P4 | P5 | P6 | |
---|---|---|---|---|---|---|---|
Div_Yld | -0.271599 | 0.579099 | 0.045725 | -0.528956 | 0.226626 | 0.506566 | 0.032012 |
Eps | -0.420407 | 0.150082 | -0.024767 | 0.337373 | -0.771377 | 0.301883 | 0.011965 |
Mkt_Cap_12M_Usd | -0.523868 | -0.343239 | 0.172289 | 0.062495 | 0.252781 | 0.002987 | 0.714319 |
Mom_11M_Usd | -0.047238 | -0.057714 | -0.897160 | 0.241015 | 0.250559 | 0.258477 | 0.043179 |
Ocf | -0.532947 | -0.195890 | 0.185039 | 0.234371 | 0.357596 | 0.049015 | -0.676866 |
Pb | -0.152413 | -0.580806 | -0.221048 | -0.682136 | -0.308665 | 0.038675 | -0.168799 |
Vol1Y_Usd | 0.406890 | -0.381139 | 0.282162 | 0.155411 | 0.061575 | 0.762588 | 0.008632 |
The rotation gives the matrix $\textbf{P}$: it’s the tool that changes the base. The first row of the output indicates the standard deviation of each new factor (column). Each factor is indicated via a PC index (principal component). Often, the first PC (first column PC1 in the output) loads negatively on all initial features: a convex weighted average of all predictors is expected to carry a lot of information. In the above example, it is almost the case, with the exception of volatility, which has a positive coefficient in the first PC. The second PC is an arbitrage between price-to-book (short) and dividend yield (long). The third PC is contrarian, as it loads heavily and negatively on momentum. Not all principal components are easy to interpret.
Sometimes, it can be useful to visualize the way the principal components are built. In Figure 15.2, we show one popular representation that is used for two factors (usually the first two).
from pca import pca
model = pca(n_components=7) # Initialize
results = model.fit_transform(training_sample[features_short], col_labels=features_short) # Fit transform and include the column labels and row labels
model.biplot(n_feat=7, PC=[0,1],cmap=None, label=None, legend=False) # Make biplot
[pca] >Processing dataframe.. [pca] >The PCA reduction is performed on the [7] columns of the input dataframe. [pca] >Fit using PCA. [pca] >Compute loadings and PCs. [pca] >Compute explained variance. [pca] >Outlier detection using Hotelling T2 test with alpha=[0.05] and n_components=[7] [pca] >Outlier detection using SPE/DmodX with n_std=[2] [pca] >Plot PC1 vs PC2 with loadings.
(<Figure size 1500x1000 with 1 Axes>, <AxesSubplot:title={'center':'7 Principal Components explain [99.99%] of the variance'}, xlabel='PC1 (35.7% expl.var)', ylabel='PC2 (19.4% expl.var)'>)
FIGURE 15.2: Visual representation of PCA with two dimensions.
The numbers indicated along the axes are the proportion of explained variance of each PC. Compared to the figures in the first line of the output, the numbers are squared and then divided by the total sum of squares.
Once the rotation is known, it is possible to select a subsample of the transformed data. From the original 7 features, it is easy to pick just 4.
pd.DataFrame( # Using DataFrame format
np.matmul( # Matrix product using numpy
training_sample[features_short].values,P.values[:, :4]), # Matrix values
columns=['PC1','PC2','PC3','PC4'] # Change column names
).head() # Show first 5 lines
PC1 | PC2 | PC3 | PC4 | |
---|---|---|---|---|
0 | -0.591998 | -0.177306 | 0.058881 | -0.349897 |
1 | -0.043180 | -0.718323 | -0.510459 | -0.050138 |
2 | -1.104983 | -0.429470 | 0.023240 | -0.171445 |
3 | -0.376485 | -0.418983 | -0.650190 | -0.081842 |
4 | -0.018831 | -0.581435 | 0.242719 | -0.358501 |
These 4 factors can then be used as orthogonal features in any ML engine. The fact that the features are uncorrelated is undoubtedly an asset. But the price of this convenience is high: the features are no longer immediately interpretable. De-correlating the predictors adds yet another layer of “blackbox-ing” in the algorithm.
PCA can also be used to estimate factor models. In Equation (15.3), it suffices to replace $\textbf{Z}$ with returns, $\textbf{X}$ with factor values and $\textbf{P}$ with factor loadings (see, e.g., Connor and Korajczyk (1988) for an early reference). More recently, Lettau and Pelger (2020a) and Lettau and Pelger (2020b) propose a thorough analysis of PCA estimation techniques. They notably argue that first moments of returns are important and should be included in the objective function, alongside the optimization on the second moments.
We end this subsection with a technical note. Usually, PCA is performed on the covariance matrix of returns. Sometimes, it may be preferable to decompose the correlation matrix. The result may adjust substantially if the variables have very different variances (which is not really the case in the equity space). If the investment universe encompasses several asset classes, then a correlation-based PCA will reduce the importance of the most volatile class. In this case, it is as if all returns are scaled by their respective volatilities.
In a PCA, the coding from $\textbf{X}$ to $\textbf{Z}$ is straightfoward, linear and works both ways: \begin{equation} \textbf{Z}=\textbf{X}\textbf{P} \quad \text{and} \quad \textbf{X}=\textbf{YP}', \end{equation}
so that we recover $\textbf{X}$ from $\textbf{Z}$. This can be writen differently:
If we take the truncated version and seek a smaller output (with only $K′$ columns), this gives:
where $\textbf{P}_{K'}$ is the restriction of $\textbf{P}$ to the $K′$ columns that correspond to the factors with the largest variances. The dimensions of matrices are indicated inside the brackets. In this case, the recoding cannot recover $\textbf{P}$ exactly but only an approximation, which we write $\breve{\textbf{X}}$. This approximation is coded with less information, hence this new data b$\breve{\textbf{X}}$ is compressed and provides a parsimonious representation of the original sample $\textbf{X}$.
An autoencodeur generalizes this concept to nonlinear coding functions. Simple linear autoencoders are linked to latent factor models (see Proposition 1 in for the case of single layer autoencoders.) The scheme is the following
where the encoding and decoding functions $N$ and $N′$ are often taken to be neural networks. The term autoencoder comes from the fact that the target output, which we often write $\textbf{Z}$ is the original sample $\textbf{X}$. Thus, the algorithm seeks to determine the function $N$ that minimizes the distance (to be defined) between $\textbf{X}$ and the output value $\breve{\textbf{X}}$. The encoder generates an alternative representation of $\textbf{X}$, whereas the decoder tries to recode it back to its original values. Naturally, the intermediate (coded) version $\tilde{\textbf{X}}$ is targeted to have a smaller dimension compared to $\textbf{X}$.
Autoencoders are easy to code in Keras (see Chapter 7 for more details on Keras). To underline the power of the framework, we resort to another way of coding a NN: the so-called functional API. For simplicity, we work with the small number of predictors (7). The structure of the network consists of two symmetric networks with only one intermediate layer containing 32 units. The activation function is sigmoid; this makes sense since the input has values in the unit interval.
import tensorflow as tf
from tensorflow import keras
from tensorflow.keras import layers, Input
from plot_keras_history import show_history, plot_history
input_layer = Input(shape=(7,)) # features_short has 7 columns
encoder = tf.keras.layers.Dense(units=32, activation="sigmoid")(input_layer) # First, encode
encoder = tf.keras.layers.Dense(units=4)(encoder) # 4 dimensions for the output layer (same as PCA example)
decoder = tf.keras.layers.Dense(units=32, activation="sigmoid")(encoder) # Then, from encoder, decode
decoder = tf.keras.layers.Dense(units=7)(decoder) # the original sample has 7 features
In the training part, we optimize the MSE and use an Adam update of the weights (see Section 7.2.3).
ae_model = keras.Model(input_layer, decoder) # Builds the model
ae_model.compile( # Learning parameters
optimizer='adam',
loss='mean_squared_error',
metrics='mean_squared_error')
Finally, we are ready to train the data onto itself! The evolution of loss on the training and testing samples is depicted in Figure 15.3. The decreasing pattern shows the progress of the quality in compression.
history=ae_model.fit(NN_train_features, # Input
NN_train_features, # Output
epochs=15,
batch_size=512,
validation_data=(NN_test_features, NN_test_features))
plot_history(history)
Epoch 1/15 387/387 [==============================] - 2s 3ms/step - loss: 0.0756 - mean_squared_error: 0.0756 - val_loss: 0.0424 - val_mean_squared_error: 0.0424 Epoch 2/15 387/387 [==============================] - 1s 2ms/step - loss: 0.0307 - mean_squared_error: 0.0307 - val_loss: 0.0236 - val_mean_squared_error: 0.0236 Epoch 3/15 387/387 [==============================] - 1s 2ms/step - loss: 0.0194 - mean_squared_error: 0.0194 - val_loss: 0.0167 - val_mean_squared_error: 0.0167 Epoch 4/15 387/387 [==============================] - 1s 2ms/step - loss: 0.0163 - mean_squared_error: 0.0163 - val_loss: 0.0160 - val_mean_squared_error: 0.0160 Epoch 5/15 387/387 [==============================] - 1s 2ms/step - loss: 0.0159 - mean_squared_error: 0.0159 - val_loss: 0.0157 - val_mean_squared_error: 0.0157 Epoch 6/15 387/387 [==============================] - 1s 2ms/step - loss: 0.0157 - mean_squared_error: 0.0157 - val_loss: 0.0156 - val_mean_squared_error: 0.0156 Epoch 7/15 387/387 [==============================] - 1s 2ms/step - loss: 0.0156 - mean_squared_error: 0.0156 - val_loss: 0.0156 - val_mean_squared_error: 0.0156 Epoch 8/15 387/387 [==============================] - 1s 2ms/step - loss: 0.0156 - mean_squared_error: 0.0156 - val_loss: 0.0155 - val_mean_squared_error: 0.0155 Epoch 9/15 387/387 [==============================] - 1s 2ms/step - loss: 0.0155 - mean_squared_error: 0.0155 - val_loss: 0.0155 - val_mean_squared_error: 0.0155 Epoch 10/15 387/387 [==============================] - 1s 3ms/step - loss: 0.0155 - mean_squared_error: 0.0155 - val_loss: 0.0155 - val_mean_squared_error: 0.0155 Epoch 11/15 387/387 [==============================] - 1s 2ms/step - loss: 0.0155 - mean_squared_error: 0.0155 - val_loss: 0.0154 - val_mean_squared_error: 0.0154 Epoch 12/15 387/387 [==============================] - 1s 2ms/step - loss: 0.0154 - mean_squared_error: 0.0154 - val_loss: 0.0154 - val_mean_squared_error: 0.0154 Epoch 13/15 387/387 [==============================] - 1s 2ms/step - loss: 0.0154 - mean_squared_error: 0.0154 - val_loss: 0.0153 - val_mean_squared_error: 0.0153 Epoch 14/15 387/387 [==============================] - 1s 2ms/step - loss: 0.0153 - mean_squared_error: 0.0153 - val_loss: 0.0152 - val_mean_squared_error: 0.0152 Epoch 15/15 387/387 [==============================] - 1s 2ms/step - loss: 0.0152 - mean_squared_error: 0.0152 - val_loss: 0.0150 - val_mean_squared_error: 0.0150
(<Figure size 1000x500 with 2 Axes>, array([<AxesSubplot:title={'center':'Loss'}, xlabel='Epochs', ylabel='Loss'>, <AxesSubplot:title={'center':'MSE'}, xlabel='Epochs', ylabel='MSE'>], dtype=object))
FIGURE 15.3: Output from the training of an autoencoder.
In order to get the details of all weights and biases, the syntax is the following.
ae_weights=ae_model.get_weights()
Retrieving the encoder and processing the data into the compressed format is just a matter of matrix manipulation. In practice, it is possible to build a submodel by loading the weights from the encoder (see exercise below).
The second family of unsupervised tools pertains to clustering. Features are grouped into homogeneous families of predictors. It is then possible to single out one among the group (or to create a synthetic average of all of them). Mechanically, the number of predictors is reduced.
The principle is simple: among a group of variables (the reasoning would be the same for observations in the other dimension) $\textbf{x}_{\{1 \le j \le J\}}$, find the combination of $k<J$ groups that minimize
where $||\cdot ||$ is some norm which is usually taken to be the Euclidean $l^2$-norm. The $S_i$ are the groups and the minimization is run on the whole set of groups $\textbf{S}$. The $\textbf{m}_i$ are the group means (also called centroids or barycenters): $\textbf{m}_i=(\text{card}(S_i))^{-1}\sum_{\textbf{x}\in S_i}\textbf{x}$
In order to ensure optimality, all possible arrangements must be tested, which is prohibitively long when $k$ and $J$ are large. Therefore, the problem is usually solved with greedy algorithms that seek (and find) solutions that are not optimal but ‘good enough’.
One heuristic way to proceed is the following:
Below, we illustrate this process with an example. From all 93 features, we build 10 clusters.
from sklearn import cluster
k_means = cluster.KMeans(n_clusters=10) # setting the number of cluster
k_means.fit(training_sample[features].T) # Performs the k-means clustering
clusters = pd.DataFrame([features, k_means.labels_],index=["factor", "cluster"]).T # Organize the cluster data
clusters.loc[clusters['cluster']==4,:] # Shows one particular group
factor | cluster | |
---|---|---|
6 | Capex_Ps_Cf | 4 |
19 | Eps | 4 |
20 | Eps_Basic | 4 |
21 | Eps_Basic_Gr | 4 |
22 | Eps_Contin_Oper | 4 |
23 | Eps_Dil | 4 |
68 | Op_Prt_Margin | 4 |
69 | Oper_Ps_Net_Cf | 4 |
80 | Sales_Ps | 4 |
We single out the fourth cluster which is composed mainly of accounting ratios related mainly to EPS. Given these 10 clusters, we can build a much smaller group of features that can then be fed to the predictive engines described in Chapters 5 to 9. The representative of a cluster can be the member that is closest to the center, or simply the center itself. This pre-processing step can nonetheless cause problems in the forecasting phase. Typically, it requires that the training data be also clustered. The extension to the testing data is not straightforward (the clusters may not be the same).
To the best of our knowledge, nearest neighbors are not used in large-scale portfolio choice applications. The reason is simple: computational cost. Nonetheless, the concept of neighbors is widespread in unsupervised learning and can be used locally in complement to interpretability tools. Theoretical results on k-NN relating to bounds for error rates on classification tasks can be found in section 6.2 of Ripley (2007). The rationale is the following. If:
then the neighborhood of one instance $\textbf{x}_i$ from the testing features computed on the training sample will yield valuable information on $y_i$
In what follows, we thus seek to find neighbors of one particular instance $\textbf{x}_i$ (a $K$-dimensional row vector). Note that there is a major difference with the previous section: the clustering is intended at the observation level (row) and not at the predictor level (column).
Given a dataset with the same (corresponding) columns $\textbf{X}_{i,k}$, the neighbors are defined via a similarity measure (or distance)
where the distance functions $d_k$ can operate on various data types (numerical, categorical, etc.). For numerical values, $d_k(x_{j,k},x_{i,k})=(x_{j,k}-x_{i,k})^2$ or $d_k(x_{j,k},x_{i,k})=|x_{j,k}-x_{i,k}|$. For categorical values, we refer to the exhaustive survey by Boriah, Chandola, and Kumar (2008) which lists 14 possible measures. Finally the $c_k$ in Equation (15.9) allow some flexbility by weighting features. This is useful because both raw values ($x_{i,k}$ versus $x_{i,k'}$) or measure outputs ($d_k$ versus $d_{k'}$) can have different scales.
Once the distances are computed over the whole sample, they are ranked using indices $l_1^i, \dots, l_I^i$:
The nearest neighbors are those indexed by $l_m^i$ for $m=1,\dots,k$. We leave out the case when there are problematic equalities of the type $D\left(\textbf{x}_{l_m^i},\textbf{x}_i\right)=D\left(\textbf{x}_{l_{m+1}^i},\textbf{x}_i\right)$ for the sake of simplicity and because they rarely occur in practice as long as there are sufficiently many numerical predictors.
Given these neighbors, it is now possible to build a prediction for the label side $y_i$. The rationale is straightforward: if $\textbf{x}_i$ is close to other instances $\textbf{x}_j$, then the label value $y_i$ should also be close to $y_j$ (under the assumption that the features carry some predictive information over the label $y$).
An intuitive prediction for $y_i$ is the following weighted average:
where $h$ is a decreasing function. Thus, the further $\textbf{x}_j$ is from $\textbf{x}_i$, the smaller the weight in the average. A typical choice for $h$ is $h(z)=e^{-az}$ for some parameter $h(z)=e^{-az}$ that determines how penalizing the distance $D(\textbf{x}_j,\textbf{x}_i)$ is. Of course, the average can be taken in the set of $k$ nearest neighbors, in which case the $h$ is equal to zero beyond a particular distance threshold:
A more agnostic rule is to take $h:=1$ over the set of neighbors and in this case, all neighbors have the same weight (see the old discussion by T. Bailey and Jain (1978) in the case of classification). For classification tasks, the procedure involves a voting rule whereby the class with the most votes wins the contest, with possible tie-breaking methods. The interested reader can have a look at the short survey in Bhatia et al. (2010).
For the choice of optimal $k$, several complicated techniques and criteria exist (see, e.g., Ghosh (2006) and Hall et al. (2008)). Heuristic values often do the job pretty well. A rule of thumb is that $k=\sqrt{I}$ ($I$ being the total number of instances) is not too far from the optimal value, unless $I$ is exceedingly large.
Below, we illustrate this concept. We pick one date (31th of December 2006) and single out one asset (with stock_id equal to 13). We then seek to find the $k=30$ stocks that are the closest to this asset at this particular date.
from sklearn import neighbors as nb # Package for Nearest Neighbors detection
knn_data = data_ml.loc[data_ml['date']=='2006-12-31',:] # Dataset for k-NN exercise
knn_target = knn_data.loc[knn_data['stock_id'] == 13, features] # Target observation
knn_sample = knn_data.loc[knn_data['stock_id'] != 13, features] # All other observations
neighbors = nb.NearestNeighbors(n_neighbors=30) # Number of neighbors to use
neighbors.fit(knn_sample)
NearestNeighbors(n_neighbors=30)
neigh_dist, neigh_ind = neighbors.kneighbors(knn_target)
print(pd.DataFrame(neigh_ind)) # Indices of the k nearest neighbors
0 1 2 3 4 5 6 7 8 9 ... 20 21 22 23 24 \ 0 9 8 99 185 294 21 266 180 23 191 ... 310 95 165 215 268 25 26 27 28 29 0 194 103 17 117 539 [1 rows x 30 columns]
Once the neighbors and distances are known, we can compute a prediction for the return of the target stock. We use the function $h(z)=e^{-z}$ for the weighting of instances (via the distances).
knn_labels = knn_data.loc[:, 'R1M_Usd'].values[neigh_ind] # y values for neigh_ind
np.sum(knn_labels * np.exp(-neigh_dist)/np.sum(np.exp(-neigh_dist))) # Pred w. k(z)=e^(-z)
0.03092438258317905
knn_data.loc[knn_data['stock_id'] == 13, 'R1M_Usd'] # True y
96734 0.089 Name: R1M_Usd, dtype: float64
The prediction is neither very good, nor very bad (the sign is correct!). However, note that this example cannot be used for predictive purposes because we use data from 2006-12-31 to predict a return at the same date. In order to avoid the forward-looking bias, the knn_sample variable should be chosen from a prior point in time.
The above computations are fast (a handful of seconds at most), but hold for only one asset. In a $k$-NN exercise, each stock gets a customed prediction and the set of neighbors must be re-assessed each time. For $N$ assets,$N(N-1)/2$ distances must be evaluated. This is particularly costly in a backtest, especially when several parameters can be tested (the number of neighbors, $k$, or $a$ in the weighting function $h(z)=e^{-az}$). When the investment universe is small (when trading indices for instance), k-NN methods become computationally attractive (see for instance Chen and Hao (2017)).
Code the compressed version of the data (narrow training sample) via the encoder part of the autoencoder.
Bailey, T, and A. K. Jain. 1978. “A Note on Distance-Weighted K-Nearest Neighbor Rules.” IEEE Trans. On Systems, Man, Cybernetics 8 (4): 311–13.
Bhatia, Nitin, and others. 2010. “Survey of Nearest Neighbor Techniques.” arXiv Preprint, no. 1007.0085.
Boriah, Shyam, Varun Chandola, and Vipin Kumar. 2008. “Similarity Measures for Categorical Data: A Comparative Evaluation.” In Proceedings of the 2008 Siam International Conference on Data Mining, 243–54.
Chen, Yingjun, and Yongtao Hao. 2017. “A Feature Weighted Support Vector Machine and K-Nearest Neighbor Algorithm for Stock Market Indices Prediction.” Expert Systems with Applications 80: 340–55.
Connor, Gregory, and Robert A Korajczyk. 1988. “Risk and Return in an Equilibrium Apt: Application of a New Test Methodology.” Journal of Financial Economics 21 (2): 255–89.
Ghosh, Anil K. 2006. “On Optimum Choice of K in Nearest Neighbor Classification.” Computational Statistics & Data Analysis 50 (11): 3113–23.
Hall, Peter, Byeong U Park, Richard J Samworth, and others. 2008. “Choice of Neighbor Order in Nearest-Neighbor Classification.” Annals of Statistics 36 (5): 2135–52.
Lettau, Martin, and Markus Pelger. 2020a. “Estimating Latent Asset-Pricing Factors.” Journal of Econometrics Forthcoming.
Lettau, Martin, and Markus Pelger. 2020b. “Factors That Fit the Time Series and Cross-Section of Stock Returns.” Review of Financial Studies 33 (5): 2274–2325.
Meyer, Carl D. 2000. Matrix Analysis and Applied Linear Algebra. Vol. 71. SIAM.
Ripley, Brian D. 2007. Pattern Recognition and Neural Networks. Cambridge University Press.
footnotes