Feature Generation and Selection for Single Character Recognition
 Background: Singlecharacter recognition
 Dataset
 Solutions
 Feature selection:
 Results
 Summary
 References/Notes
Background: Singlecharacter recognition
In my Capitalsplaying project I used the Tesseract OCR engine to read the letters in each tile after segmenting tiles with OpenCV.
Although my segmentation is consistent, presenting Tesseract with a single character from a single font on a white background, I noticed some performance problems. For example, the letter 'W' is never recognized correctly, and occasionally other letters are misclassified. Whether this is due to my using Tesseract incorrectly (i.e. possibly this could improve with training), complications from using it in single character mode, or because Tesseract is garbage, I'm not sure.
Also, Tesseract is a powerful OCR engine, but I only need singlecharacter recognition. Invoking Tesseract for each tile in a screenshot takes a significant portion of the total runtime of capitalssolver.
I decided to write my own classifier for the relatively trivial problem of classifying consistentlyrendered single character tiles. All of the code used is available on github/receptorocr. I have tried to note when I used different revisions of the same code in this article.
Dataset
I generated training/test data by segmenting several screenshots from Capitals using gen_training_data.py and labeled them manually using label_seg.py. This training data is available here, and is the same dataset I use throughout this article. A scaled example image is below.
This is a $c=28$ class problem: AZ, blank, and "capital" (an arbitrary icon specific to the game Capitals). I have a small dataset with $n=346$ examples. I have at least 4 examples of each class, except for 'J', of which there are only 3. The images are 500px by 500px binary images, or nominally $d=250,000$ dimensions. Throughout the article I use 75% of the data (259 instances) for training, and the remaining 25% (87 instances) for testing.
Note the small $n$ means that there is significant variability in performance, depending on how the training/test sets are (randomly) split. It's not improbable that there could be no 'J's present in the training set in one run, for example. Where error has varied significantly I have done several runs and presented a typical value.
Also note that since this is a multinomial problem, the 'chance' error rate is not 50%, but approximately 90%, obtained by always guessing the most frequent class, 'A'.
Solutions
Template matching w/ LSPC (brute force benchmark)
Note: Throughout this article I use LSPC[^{1}]. It is a kernelbased nonlinear classifier, approximately as accurate as SVM or KLR, but blindingly fast (it has a closed form solution), allowing me to evaluate hundreds of models in the time it would take to train a single KLR. It is also natively multiclass and produces probabilistic outputs.
A naïve approach is to encode each image as a vector of pixel values and use these as inputs to a nonlinear classifier, such as a neural network, support vector machine, or KLR. This is essentially template matching.
Template matching can be effective when images are consistent within classes (i.e. all "f"s look the same). On more difficult problems, template matching can be effective with some applicationspecific preprocessing, such as deskewing[^{2}], and other normalization.
Encoding entire images as vectors can lead to dimensionality problems, even after reducing image resolution. Also, template matching models cannot make more "perceptual" decisions ("two straight lines at right angles, plus a curving part might be a five"). Part of the allure of deep neural networks is that their heirarchical structure may be enabling this type of decision making.
Since all character examples in my application come from a single font and are consistently oriented, template matching is a good benchmark.
Using template_match.py I generated the $n=346$, $d=250,000$ vectors and classified using LSPC by passing the resulting 165 megabyte CSV to template.m.
The resulting data is unwieldy, and clearly redundant (as a backoftheenvelope measure of information content, note that it gzips to 1.2M)
Test error obtained with LSPC for template matching was 31.03%, significantly better than chance but not useful.
Test error for template matching: 31.03%
Hu moments (feature benchmark)
Hu invariant moments are image moments which are invariant under translation, scaling, and rotation. They are not meant as pattern recognition features, however I use them here to provide another benchmark. This can be thought of as a slightly more sophisticated, image processingspecific version of classifying data based on their summary statistics (mean, variance, etc).
This isn't completely off the rails, for example in this representation blank spaces are all zeros, and it's easy to imagine that some more patterns will be separable.
Using hu_moments.py I generated the $n=346$, $d=7$ vectors and classified using the same MATLAB code as above.
Test error obtained with LSPC for Hu features was 10.34%, showing even rudimentary features can outperform template matching. Less is more. Small Data! ^{3}
Test error for Hu features: 10.34%
Receptors
With better feature design, we can generate smaller models, achieve better classification, and possibly gain insight into the data by producing more explainable models. The disadvantage to this approach over a more general method is that the features produced are often domainspecific, design can be laborintensive, and approaches can be nonobvious or unknown in some domains.
Example features for character recognition could be:
 counting connected components
 drawing regularlyspaced horizontal and vertical lines over the image and counting intersections for each
 statistics gleaned from approximating edges as lowerdegree polygons
As an interesting compromise between manual feature design and automatic feature extraction, I found this codeproject post by Andrew Kirillov, who uses "receptors" (scroll to "Another Approach")^{4}.
The idea is to project the same set of small line segments on each image, and generate a vector of receptor activation (crossing the letter / not crossing the letter) for each image.
I construct receptors by generating a set of midpoints, lengths, and angles. The midpoint positions are normalized so the image centroid is $(0.5,0.5)$, and distances (length, offset from centroid) are normalized by the image diagonal. So these features should not depend on scale or translation (although significant variation in whitespace padding could break my implementation).
Midpoints have a Gaussian distribution $N(\mu = [0.5, 0.5], \sigma^2 = 0.2)$, and lengths are Rayleigh distributed ($\sigma = 0.08$). Angles are uniform between $0$ and $2\pi$.
With gen_receptors.py I generated 2500 receptors, and produced a CSV using gen_training_csv.py.
At first, I tried to improve upon the receptor model by making receptor activation a real number (the average pixel intensity across the receptor) rather than a binary activation. Test error obtained with LSPC for 2500 realvalued receptors was 22%. By clamping receptor values to binary 0/1, I obtained perfect classification.
In the codeproject post, the author is using a neural network, and training time can be greatly improved by reducing the number of features. He uses empirical estimates of entropy to attempt to select the most useful features.
Here, $n=2500$ receptors are more than enough to perfectly classify my dataset, and training time with LSPC is a fraction of a second, making feature selection unnecessary. However requiring so many receptors for such an easy classification task seems wasteful, so I discuss feature selection below.
Test error for 2500 binary receptors: 0.00%
Feature selection
Theoretical bound and PCA
While 2500 binary features perfectly separates the data, $c=28$ classes should be separable with $\lceil log_2(28) \rceil = 5$ binary features^{5}, if five features can be found which:
 are consistent withinclass (i.e. feature is always or never on whenever it is shown an example of an 'S')
 usefully separate the space (e.g. a feature that is always on for AM and always off for NZ. a second feature is on for A, C, E, ... and off for B, D, F, ...)
These qualities can be measured statistically with entropy, but I the idea is fixed in my intuition from playing Guess Who? as a kid. In the game, players attempt to select an individual from an array of faces by asking their opponent yes/no questions. "Is the person I am looking for female?" is a good first question to ask, because it splits the game space about evenly. "Does the person have green eyeglasses" is highlyspecific and could pay off, but it's more likely to eliminate only one or two faces, so it is not a great first question. I'll discuss this more in the next section.
Although I know that only five binary features could separate 28 classes, I don't yet know that these features can be modeled with receptors (I have strong evidence that such features exist, however, since I have separated the space using 2500 receptors).
More evidence that I can use significantly fewer than 2500 receptors: I used Principal Component Analysis, which I have written about previously, on the receptor activation data. Using PCA I can project the data along the first $k$ principal components, which is like reshaping the data along a new set of axes which are uncorrelated.
It's in part a way to get at the underlying dimensionality of the data: a $d \times d$ matrix will have $d$ principal components, but only a few may be large. In my test with 4841 receptors only about 10 principal componennts were significantly larger than zero, suggesting that a lowdimensional representation will be sufficient to represent most of the information.
I used rocr_pca.m to do PCA on the receptor activation data. Using the first $k$ principal components^{6}:
k  LSPC test error 

1  24.14% 
2  5.75% 
3  1.15% 
4  0.00% 
5  0.00% 
(Note that fewer than five principal components were needed. This is because the projection along these components is realvalued, not binary).
With, for example, 5 principal components, I still need to compute the activation for all 4841 receptors before projecting in order to classify an image^{7}. But this experiment shows that the underlying information is not nearly 4841dimensional and far fewer receptors should be needed. Selecting useful receptors is explored in the following section.
Test error for 4 principal components (from 4841 binary receptors): 0.00%
Entropy
Andrew Kirillov uses entropy to do feature selection. I reimplemented this approach at first, and I'll summarize it here. The upshot is that it can reduce the number of features required, but not as dramatically as hillclimbing (below).
Probability
I know that only a few receptors should be required. I generate a field of 5,000 receptors, which should be large enough to contain a few useful features. Receptor activation (binary on/off) I model with a random variable $Y_k \in \{0,1\}$ (where k represents the particular receptor in question, 1 to 5,000). A random variable $X \in \{A, B, C, ...\}$ models which class a given image is.
Some quantities, which I empirically estimate from my training images:
 $p(X)$: class frequency distribution, e.g. $p(A) = 0.1069$, $p(W) = 0.0173$
 $p(Y_k=1)$: probability that receptor $k$ is on (across all images)
 $p(Y_k=1X=x)$: probability that receptor $k$ is on, given that the image is a specific letter $x$
For example, if receptor 903 is always on when the letter is 'Q', then $p(Y_{903}=1X=q) = 1$. This means that all 'Q's trigger receptor 903, which is good, but it does not mean that receptor 903 is a good indication that the letter is a 'Q'.
Using Bayes rule, I can compute $p(X=xY_k=1)$, the probability that an image is a specific letter given that receptor $k$ is on:
\begin{equation} p(X=xY=1) = \frac{p(Y=1X=x) \cdot p(X=x)}{P(Y=1)} \end{equation}
Receptor 903 may light up for W's and A's also (or for every letter! In other words, $p(X=qY_{903}=1)$ may still be small).
H(YX)
So two quantities are interesting. We want receptors which are consistent within their classes. We can measure this by the entropy $H(Y_kX)$. This is the average of the conditional entropy $H(Y_kX=x)$ across all X. For example, if $H(Y_{k}X=q)$ is large, then there is uncertainty about whether receptor k will turn on for a 'Q'. If it is zero, then receptor k is either always on or always off when presented with a 'Q'.
H(XY)
More interesting is how well a receptor determines class. This is $H(XY_k=1)$, which captures the uncertainty in $X$ given that $Y_k$ is on. It is minimized by certainty (if $p(X=xY=1) = 1$ for some $x$), and maximized ($log_2(1/C)$) by a uniform distribution, indicating complete uncertanty.^{8}
I select receptors which are consistent and divide the space by selecting small $H(Y_kX)$ and large $H(XY_k=1)$. This is done by assigning to each receptor a 'usefulness' score defined by the product:
\begin{equation} H(XY_k=1) \cdot (1  H(Y_kX)), \end{equation}
and selecting the receptors with the $N$ highest values. By classifying using LSPC with the first $N$ receptors, I plot test error vs $N$, showing how many receptors are required for successful classification (from an initial set of 2500):
So only about 600 receptors are required for perfect classification which is an improvement, but this image shows that the first nearly 200 receptors are completely useless!
I also tried selecting for receptor specificity, i.e. recomputing usefulness as:
\begin{equation} (log_2(1/C)  H(XY_k=1)) \cdot (1  H(Y_kX)), \end{equation}
Note the larger horizontal axis here. With this method, the first few receptors were more immediately useful, but it took longer to converge. Also, this is from an initial set of 5000. This is evidence that there is a balance between initial set size (larger = more likely to generate useful receptors) and redundancy.
(This method used revisions 2fb264a9a0e759eaa06e0c0a9cc263c655f78f17 and 3706c08a1d067cf13a3d2efd3c4317fc76071e44 of gen_receptors.py).
Both methods are a clear improvement over 2,500 or 5,000, but it seems too many. This is addressed in the next section.
Test error for first 600 features selected using entropy: 0.00%
Redundancy (KL divergence)
The entropybased selection described above has a certain informationtheoretic appeal, but it does not look for receptors which split the class space differently. A receptor which separates AN from MZ has high 'usefulness' as calculated above, but so do 200 receptors which separate AN from MZ, and these do not add information beyond the first. In the image below, the first 1,000 or so receptors (shown in green) very specifically identify a capital. Only after 1,000 do we get receptors which activate for other letters (red).
I attempted to address this by augmenting the usefulness score with a measure of redundancy^{9}. The KullbackLeibler divergence is a nonnegative measure of difference between probability distributions, closely related to mutual information and entropy. $D_{KL}(PQ)$ is not symmetric, but $D_{KL}(PQ) + D_{KL}(QP)$ is. So I use the following algorithm to score receptors:
 Compute usefulness for all receptors as above
 Until no receptors are remaining:
 Add most useful receptor to an ordered set $S$
 Recompute usefulness for all receptors, multiplying by the average symmetric KL divergence from receptors in $S$
 Take the first $N$ receptors from $S$, these are the $N$ most 'useful' receptors
(This method used revision e5be113fbe1dccdad865c78ccdf5cfef10eae127 of gen_receptors.py).
The result of this approach, from an initial set of 5000, is shown below (here I am minimizing $H(XY)$, as in the previous plot).
Initially, this is a clear improvement over the last, but it has not succeeded in eliminating many features in the end. This leads me to abandon the informationtheoretic approach entirely in favour of a simple method discussed in the next section.
Greedy hillclimbing & pruning
A simple feature selection strategy is greedy hillclimbing. Start with an empty set of features, $S$, and a set of remaining features, $R$.
 Until $R$ is empty or the test error has not improved in a few iterations:
 For every feature $k \in R$, train a classifier with features $S + k$.
 Add whichever feature decreased test error the most to $S$
 Remove it from $R$
This involves training an astonishing number of classifiers, like a triangular number $T_n$ for $n$ initial features. Fortunately this is possible in a reasonable period of time with LSPC, and even fast if I add the five best features at each iteration instead of proceeding one at a time.
The "greedy" aspect of this algorithm is that it never unselects a feature. It is descending^{10} an objective function (test error) by taking the steepest immediate step at every iteration. A complete search would be the power set of features, $2^{5000}$ of them, but that is too many and the greedy hill climb works well enough.
Since I added five features at a time, hill climbing is followed by a pruning step, which is much the same but in reverse. One feature is removed at a time. Code is rocr_hill.m.
Results
In a test with 5000 initial receptors, 45 were added and then pruned to 20 while maintaining perfect classification. These receptors are shown below on a blank image and two letters:
The features are shown below, along with their entropies and probabilities of activation $p(Y_k=1X=x)$. This helps illustrate how each feature breaks up the space:
On (> 90%)  Mid (15%  90%)  Low (1%  15%)  Off (< 1%) 
#  Hyx  Hxy  _  1  A  B  C  D  E  F  G  H  I  J  K  L  M  N  O  P  Q  R  S  T  U  V  W  X  Y  Z 

4621  0.06  4.50  
2133  0.08  3.92  
2598  0.04  3.81  
3035  0.11  3.78  
2404  0.10  3.65  
1571  0.13  3.33  
2186  0.07  3.31  
2151  0.15  3.28  
1460  0.07  3.24  
1777  0.03  3.09  
1195  0.09  2.98  
1645  0.19  2.85  
1202  0.11  2.40  
1330  0.14  1.84  
154  0.09  1.46  
1145  0.05  1.06  
177  0.01  0.00  
392  0.01  0.00  
445  0.01  0.00  
315  0.01  0.00 
Summary
Here is a summary of the techniques discussed in this article. In all cases (except chance), LSPC is used to classify the generated/selected features.
Method  Features  Error 

Chance  0  ~90% 
Template matching  250000  31.03% 
Hu moments  7  10.34% 
PCA (continuous activation)  8 principal components (4841 underlying receptors)  0% 
PCA (binary activation)  4 principal (4841 underlying receptors)  0% 
Receptors (continuous activation)  2500  22% 
Receptors (binary activation)  2500  0% 
Entropy selection (max HXY)  ~600 (initially 2500)  0% 
Entropy selection (min HXY)  ~1700 (initially 5000)  0% 
Entropy (min HXY) + KL divergence  ~1500 (initially 5000)  0% 
Greedy hillclimbing & pruning  20 (initially 5000)  0% 
References/Notes

Sugiyama, Masashi. "Superfasttrainable multiclass probabilistic classifier by leastsquares posterior fitting." IEICE Transactions on Information and Systems 93.10 (2010): 26902701. ↩

Or 'deslanting', discussed in Teow, LooNin, and KiaFock Loe. "Robust visionbased features and classification schemes for offline handwritten digit recognition." Pattern Recognition 35.11 (2002): 23552364. ↩

This illustrates an inefficiency in the model: the template matching dataset contained more information, extracting the Hu features did not add any information. By rearranging information and throwing away redundant data, I was able to improve performance here. It is theoretically possible that a sufficiently complex neural network could perform optimally (here, perfect classification is possible) on the raw data. In some cases, this is impractical and some preprocessing can go a long way. On the other hand, especially in image recognition tasks, DBNs and other deep neural models have shown remarkable results and can be used to generate highlevel features automatically. It's possible that this kind of model can make manual feature design unnecessary, and enable feature design in spaces we do not understand. ↩

It's worth noting the similarities to template matching here. These features are not invariant to rotation or skew, but are somewhat more flexible than templates because activation may take place anywhere on an arbitrary line segment. ↩

One feature can split 28 classes into two groups of 14. Further independent features could split those into four groups of 7, then four groups of 3 and four groups of 4, and so on. [28] > [14 14] > [7 7 7 7] > [3 3 3 3 4 4 4 4] > [2 2 2 2 2 2 2 2 2 2 2 2 1 1 1] > [1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1 1]. So 5 features are needed. ↩

With continuous rather than binary receptor activations, PCA+LSPC required 8 principal components. ↩

For LSPC, this means PCA is no real advantage. It is cheap enough to classify $d=2500$ vectors. Peprocessing with PCA will be a big speedup for models which depend more heavily on dimension. ↩

Maximizing this value, then, would seem counterintuitive. However, receptors which evenly split the class space also have high entropy. I also tried minimizing H(XY), which is more akin to asking "does the subject wear green eyeglasses?" than "does the subject have facial hair?", but small values here at least ensure there is some certainty. Both approach suffer from the redundancy issue I discuss in the next section. ↩

This combined entropy/redundancy approach is vaguely reminiscent of more mature techniques like mRMR. ↩

I guess nobody calls it "valleydescending" ↩