Abracadabra

Do it yourself


  • Home

  • Categories

  • About

  • Archives

  • Tags

  • Sitemap

  • 公益404

  • Search
close
Abracadabra

About Feature Scaling and Normalization

Posted on 2018-10-15 | | Visitors

About standardization

The result of standardization (or Z-score normalization) is that the features will be rescaled so that they’ll have the properties of a standard normal distribution with $\mu=0$ and $\sigma=1$

where $\mu$ is the mean (average) and $\sigma$ is the standard deviation from the mean; standard scores (also called z scores) of the samples are calculated as follows:
$$
z = \frac{x - \mu}{\sigma}
$$
Standardizing the features so that they are centered around 0 with a standard deviation of 1 is not only important if we are comparing measurements that have different units, but it is also a general requirement for many machine learning algorithms. Intuitively, we can think of gradient descent as a prominent example (an optimization algorithm often used in logistic regression, SVMs, perceptrons, neural networks etc.); with features being on different scales, certain weights may update faster than others since the feature values $x_j$ play a role in the weight updates
$$
\Delta w_j = - \eta \frac{\partial J}{\partial w_j} = \eta \sum_i (t^{(i)} - o^{(i)}) x_j^{(i)},
$$
so that

$w_j := w_j + \Delta w_j$, where $\eta$ is the learning rate, $t$ the target class label, and $o$ the actual output. Other intuitive examples include K-Nearest Neighbor algorithms and clustering algorithms that use, for example, Euclidean distance measures – in fact, tree-based classifier are probably the only classifiers where feature scaling doesn’t make a difference.

In fact, the only family of algorithms that I could think of being scale-invariant are tree-based methods. Let’s take the general CART decision tree algorithm. Without going into much depth regarding information gain and impurity measures, we can think of the decision as “is feature x_i >= some_val?” Intuitively, we can see that it really doesn’t matter on which scale this feature is (centimeters, Fahrenheit, a standardized scale – it really doesn’t matter).

Some examples of algorithms where feature scaling matters are:

  • k-nearest neighbors with an Euclidean distance measure if want all features to contribute equally
  • k-means (see k-nearest neighbors)
  • logistic regression, SVMs, perceptrons, neural networks etc. if you are using gradient descent/ascent-based optimization, otherwise some weights will update much faster than others
  • linear discriminant analysis, principal component analysis, kernel principal component analysis since you want to find directions of maximizing the variance (under the constraints that those directions/eigenvectors/principal components are orthogonal); you want to have features on the same scale since you’d emphasize variables on “larger measurement scales” more. There are many more cases than I can possibly list here … I always recommend you to think about the algorithm and what it’s doing, and then it typically becomes obvious whether we want to scale your features or not.

In addition, we’d also want to think about whether we want to “standardize” or “normalize” (here: scaling to [0, 1] range) our data. Some algorithms assume that our data is centered at 0. For example, if we initialize the weights of a small multi-layer perceptron with tanh activation units to 0 or small random values centered around zero, we want to update the model weights “equally.” As a rule of thumb I’d say: When in doubt, just standardize the data, it shouldn’t hurt.

About Min-Max scaling

An alternative approach to Z-score normalization (or standardization) is the so-called Min-Max scaling(often also simply called “normalization” - a common cause for ambiguities).
In this approach, the data is scaled to a fixed range - usually 0 to 1.
The cost of having this bounded range - in contrast to standardization - is that we will end up with smaller standard deviations, which can suppress the effect of outliers.

A Min-Max scaling is typically done via the following equation:
$$
X_{norm} = \frac{X - X_{min}}{X_{max} - X_{min}}
$$

Z-score standardization or Min-Max scaling?

“Standardization or Min-Max scaling?” - There is no obvious answer to this question: it really depends on the application.

For example, in clustering analyses, standardization may be especially crucial in order to compare similarities between features based on certain distance measures. Another prominent example is the Principal Component Analysis, where we usually prefer standardization over Min-Max scaling, since we are interested in the components that maximize the variance (depending on the question and if the PCA computes the components via the correlation matrix instead of the covariance matrix; but more about PCA in my previous article).

However, this doesn’t mean that Min-Max scaling is not useful at all! A popular application is image processing, where pixel intensities have to be normalized to fit within a certain range (i.e., 0 to 255 for the RGB color range). Also, typical neural network algorithm require data that on a 0-1 scale.

Standardizing and normalizing - how it can be done using scikit-learn

Of course, we could make use of NumPy’s vectorization capabilities to calculate the z-scores for standardization and to normalize the data using the equations that were mentioned in the previous sections. However, there is an even more convenient approach using the preprocessing module from one of Python’s open-source machine learning library scikit-learn.

For the following examples and discussion, we will have a look at the free “Wine” Dataset that is deposited on the UCI machine learning repository
(http://archive.ics.uci.edu/ml/datasets/Wine).

Forina, M. et al, PARVUS - An Extendible Package for Data Exploration, Classification and Correlation. Institute of Pharmaceutical and Food Analysis and Technologies, Via Brigata Salerno, 16147 Genoa, Italy.

Bache, K. & Lichman, M. (2013). UCI Machine Learning Repository [http://archive.ics.uci.edu/ml]. Irvine, CA: University of California, School of Information and Computer Science.

The Wine dataset consists of 3 different classes where each row correspond to a particular wine sample.

The class labels (1, 2, 3) are listed in the first column, and the columns 2-14 correspond to 13 different attributes (features):

1) Alcohol
2) Malic acid
…

Loading the wine dataset

1
2
3
4
5
6
7
8
9
10
11
12
import pandas as pd
import numpy as np
df = pd.io.parsers.read_csv(
'https://raw.githubusercontent.com/rasbt/pattern_classification/master/data/wine_data.csv',
header=None,
usecols=[0,1,2]
)
df.columns=['Class label', 'Alcohol', 'Malic acid']
df.head()
Class labelAlcoholMalic acid
0114.231.71
1113.201.78
2113.162.36
3114.371.95
4113.242.59

As we can see in the table above, the features Alcohol (percent/volumne) and Malic acid (g/l) are measured on different scales, so that Feature Scaling is necessary important prior to any comparison or combination of these data.

Standardization and Min-Max scaling

1
2
3
4
5
6
7
from sklearn import preprocessing
std_scale = preprocessing.StandardScaler().fit(df[['Alcohol', 'Malic acid']])
df_std = std_scale.transform(df[['Alcohol', 'Malic acid']])
minmax_scale = preprocessing.MinMaxScaler().fit(df[['Alcohol', 'Malic acid']])
df_minmax = minmax_scale.transform(df[['Alcohol', 'Malic acid']])
1
2
3
4
print('Mean after standardization:\nAlcohol={:.2f}, Malic acid={:.2f}'
.format(df_std[:,0].mean(), df_std[:,1].mean()))
print('\nStandard deviation after standardization:\nAlcohol={:.2f}, Malic acid={:.2f}'
.format(df_std[:,0].std(), df_std[:,1].std()))
1
2
3
4
5
Mean after standardization:
Alcohol=0.00, Malic acid=0.00
Standard deviation after standardization:
Alcohol=1.00, Malic acid=1.00
1
2
3
4
print('Min-value after min-max scaling:\nAlcohol={:.2f}, Malic acid={:.2f}'
.format(df_minmax[:,0].min(), df_minmax[:,1].min()))
print('\nMax-value after min-max scaling:\nAlcohol={:.2f}, Malic acid={:.2f}'
.format(df_minmax[:,0].max(), df_minmax[:,1].max()))
1
2
3
4
5
Min-value after min-max scaling:
Alcohol=0.00, Malic acid=0.00
Max-value after min-max scaling:
Alcohol=1.00, Malic acid=1.00

Plotting

1
%matplotlib inline
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
from matplotlib import pyplot as plt
def plot():
plt.figure(figsize=(8,6))
plt.scatter(df['Alcohol'], df['Malic acid'],
color='green', label='input scale', alpha=0.5)
plt.scatter(df_std[:,0], df_std[:,1], color='red',
label='Standardized [$$N (\mu=0, \; \sigma=1)$$]', alpha=0.3)
plt.scatter(df_minmax[:,0], df_minmax[:,1],
color='blue', label='min-max scaled [min=0, max=1]', alpha=0.3)
plt.title('Alcohol and Malic Acid content of the wine dataset')
plt.xlabel('Alcohol')
plt.ylabel('Malic Acid')
plt.legend(loc='upper left')
plt.grid()
plt.tight_layout()
plot()
plt.show()

png

The plot above includes the wine datapoints on all three different scales: the input scale where the alcohol content was measured in volume-percent (green), the standardized features (red), and the normalized features (blue). In the following plot, we will zoom in into the three different axis-scales.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
fig, ax = plt.subplots(3, figsize=(6,14))
for a,d,l in zip(range(len(ax)),
(df[['Alcohol', 'Malic acid']].values, df_std, df_minmax),
('Input scale',
'Standardized [$$N (\mu=0, \; \sigma=1)$$]',
'min-max scaled [min=0, max=1]')
):
for i,c in zip(range(1,4), ('red', 'blue', 'green')):
ax[a].scatter(d[df['Class label'].values == i, 0],
d[df['Class label'].values == i, 1],
alpha=0.5,
color=c,
label='Class %s' %i
)
ax[a].set_title(l)
ax[a].set_xlabel('Alcohol')
ax[a].set_ylabel('Malic Acid')
ax[a].legend(loc='upper left')
ax[a].grid()
plt.tight_layout()
plt.show()

png

Bottom-up approaches

Of course, we can also code the equations for standardization and 0-1 Min-Max scaling “manually”. However, the scikit-learn methods are still useful if you are working with test and training data sets and want to scale them equally.

E.g.,

1
2
3
std_scale = preprocessing.StandardScaler().fit(X_train)
X_train = std_scale.transform(X_train)
X_test = std_scale.transform(X_test)

Below, we will perform the calculations using “pure” Python code, and an more convenient NumPy solution, which is especially useful if we attempt to transform a whole matrix.

Vanilla Python

1
2
3
4
5
6
7
8
9
10
11
# Standardization
x = [1,4,5,6,6,2,3]
mean = sum(x)/len(x)
std_dev = (1/len(x) * sum([ (x_i - mean)**2 for x_i in x]))**0.5
z_scores = [(x_i - mean)/std_dev for x_i in x]
# Min-Max scaling
minmax = [(x_i - min(x)) / (max(x) - min(x)) for x_i in x]

NumPy

1
2
3
4
5
6
7
8
9
10
import numpy as np
# Standardization
x_np = np.asarray(x)
z_scores_np = (x_np - x_np.mean()) / x_np.std()
# Min-Max scaling
np_minmax = (x_np - x_np.min()) / (x_np.max() - x_np.min())

Visualization

Just to make sure that our code works correctly, let us plot the results via matplotlib.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
from matplotlib import pyplot as plt
fig, ((ax1, ax2), (ax3, ax4)) = plt.subplots(nrows=2, ncols=2, figsize=(10,5))
y_pos = [0 for i in range(len(x))]
ax1.scatter(z_scores, y_pos, color='g')
ax1.set_title('Python standardization', color='g')
ax2.scatter(minmax, y_pos, color='g')
ax2.set_title('Python Min-Max scaling', color='g')
ax3.scatter(z_scores_np, y_pos, color='b')
ax3.set_title('Python NumPy standardization', color='b')
ax4.scatter(np_minmax, y_pos, color='b')
ax4.set_title('Python NumPy Min-Max scaling', color='b')
plt.tight_layout()
for ax in (ax1, ax2, ax3, ax4):
ax.get_yaxis().set_visible(False)
ax.grid()
plt.show()

png

The effect of standardization on PCA in a pattern classification task

Earlier, I mentioned the Principal Component Analysis (PCA) as an example where standardization is crucial, since it is “analyzing” the variances of the different features.
Now, let us see how the standardization affects PCA and a following supervised classification on the whole wine dataset.

In the following section, we will go through the following steps:

  • Reading in the dataset
  • Dividing the dataset into a separate training and test dataset
  • Standardization of the features
  • Principal Component Analysis (PCA) to reduce the dimensionality
  • Training a naive Bayes classifier
  • Evaluating the classification accuracy with and without standardization

Reading in the dataset

1
2
3
4
5
6
import pandas as pd
df = pd.io.parsers.read_csv(
'https://raw.githubusercontent.com/rasbt/pattern_classification/master/data/wine_data.csv',
header=None,
)

Dividing the dataset into a separate training and test dataset

In this step, we will randomly divide the wine dataset into a training dataset and a test dataset where the training dataset will contain 70% of the samples and the test dataset will contain 30%, respectively.

1
2
3
4
5
6
7
from sklearn.cross_validation import train_test_split
X_wine = df.values[:,1:]
y_wine = df.values[:,0]
X_train, X_test, y_train, y_test = train_test_split(X_wine, y_wine,
test_size=0.30, random_state=12345)

Feature Scaling - Standardization

1
2
3
4
5
from sklearn import preprocessing
std_scale = preprocessing.StandardScaler().fit(X_train)
X_train_std = std_scale.transform(X_train)
X_test_std = std_scale.transform(X_test)

Dimensionality reduction via Principal Component Analysis (PCA)

Now, we perform a PCA on the standardized and the non-standardized datasets to transform the dataset onto a 2-dimensional feature subspace.
In a real application, a procedure like cross-validation would be done in order to find out what choice of features would yield a optimal balance between “preserving information” and “overfitting” for different classifiers. However, we will omit this step since we don’t want to train a perfect classifier here, but merely compare the effects of standardization.

1
2
3
4
5
6
7
8
9
10
11
12
from sklearn.decomposition import PCA
# on non-standardized data
pca = PCA(n_components=2).fit(X_train)
X_train = pca.transform(X_train)
X_test = pca.transform(X_test)
# om standardized data
pca_std = PCA(n_components=2).fit(X_train_std)
X_train_std = pca_std.transform(X_train_std)
X_test_std = pca_std.transform(X_test_std)

Let us quickly visualize how our new feature subspace looks like (note that class labels are not considered in a PCA - in contrast to a Linear Discriminant Analysis - but I will add them in the plot for clarity).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
from matplotlib import pyplot as plt
fig, (ax1, ax2) = plt.subplots(ncols=2, figsize=(10,4))
for l,c,m in zip(range(1,4), ('blue', 'red', 'green'), ('^', 's', 'o')):
ax1.scatter(X_train[y_train==l, 0], X_train[y_train==l, 1],
color=c,
label='class %s' %l,
alpha=0.5,
marker=m
)
for l,c,m in zip(range(1,4), ('blue', 'red', 'green'), ('^', 's', 'o')):
ax2.scatter(X_train_std[y_train==l, 0], X_train_std[y_train==l, 1],
color=c,
label='class %s' %l,
alpha=0.5,
marker=m
)
ax1.set_title('Transformed NON-standardized training dataset after PCA')
ax2.set_title('Transformed standardized training dataset after PCA')
for ax in (ax1, ax2):
ax.set_xlabel('1st principal component')
ax.set_ylabel('2nd principal component')
ax.legend(loc='upper right')
ax.grid()
plt.tight_layout()
plt.show()

png

Training a naive Bayes classifier

We will use a naive Bayes classifier for the classification task. If you are not familiar with it, the term “naive” comes from the assumption that all features are “independent”.
All in all, it is a simple but robust classifier based on Bayes’ rule

I don’t want to get into more detail about Bayes’ rule in this article, but if you are interested in a more detailed collection of examples, please have a look at the Statistical Patter Classification in my pattern classification repository.

1
2
3
4
5
6
7
8
9
from sklearn.naive_bayes import GaussianNB
# on non-standardized data
gnb = GaussianNB()
fit = gnb.fit(X_train, y_train)
# on standardized data
gnb_std = GaussianNB()
fit_std = gnb_std.fit(X_train_std, y_train)

Evaluating the classification accuracy with and without standardization

1
2
3
4
5
6
7
8
9
10
11
from sklearn import metrics
pred_train = gnb.predict(X_train)
print('\nPrediction accuracy for the training dataset')
print('{:.2%}'.format(metrics.accuracy_score(y_train, pred_train)))
pred_test = gnb.predict(X_test)
print('\nPrediction accuracy for the test dataset')
print('{:.2%}\n'.format(metrics.accuracy_score(y_test, pred_test)))
1
2
3
4
5
Prediction accuracy for the training dataset
81.45%
Prediction accuracy for the test dataset
64.81%
1
2
3
4
5
6
7
8
9
pred_train_std = gnb_std.predict(X_train_std)
print('\nPrediction accuracy for the training dataset')
print('{:.2%}'.format(metrics.accuracy_score(y_train, pred_train_std)))
pred_test_std = gnb_std.predict(X_test_std)
print('\nPrediction accuracy for the test dataset')
print('{:.2%}\n'.format(metrics.accuracy_score(y_test, pred_test_std)))
1
2
3
4
5
Prediction accuracy for the training dataset
96.77%
Prediction accuracy for the test dataset
98.15%

As we can see, the standardization prior to the PCA definitely led to an decrease in the empirical error rate on classifying samples from test dataset.

Appendix A: The effect of scaling and mean centering of variables prior to PCA

Let us think about whether it matters or not if the variables are centered for applications such as Principal Component Analysis (PCA) if the PCA is calculated from the covariance matrix (i.e., the kkprincipal components are the eigenvectors of the covariance matrix that correspond to the kk largest eigenvalues.

1. Mean centering does not affect the covariance matrix

Here, the rational is: If the covariance is the same whether the variables are centered or not, the result of the PCA will be the same.

Let’s assume we have the 2 variables $x$ and $y$ Then the covariance between the attributes is calculated as
$$
\sigma_{xy} = \frac{1}{n-1} \sum_{i}^{n} (x_i - \bar{x})(y_i - \bar{y})
$$
Let us write the centered variables as
$$
x’ = x - \bar{x} \text{ and } y’ = y - \bar{y}
$$
The centered covariance would then be calculated as follows:
$$
\sigma_{xy}’ = \frac{1}{n-1} \sum_{i}^{n} (x_i’ - \bar{x}’)(y_i’ - \bar{y}’)
$$
But since after centering, $\bar{x}’ = 0$ and $\bar{y}’ = 0$ we have

$\sigma_{xy}’ = \frac{1}{n-1} \sum_{i}^{n} x_i’ y_i’$ which is our original covariance matrix if we resubstitute back the terms $x’ = x - \bar{x} \text{ and } y’ = y - \bar{y}$.

Even centering only one variable, e.g., xx wouldn’t affect the covariance:

$$
\sigma_{\text{xy}} = \frac{1}{n-1} \sum_{i}^{n} (x_i’ - \bar{x}’)(y_i - \bar{y})
$$

2. Scaling of variables does affect the covariance matrix

If one variable is scaled, e.g, from pounds into kilogram (1 pound = 0.453592 kg), it does affect the covariance and therefore influences the results of a PCA.

Let cc be the scaling factor for $x$

Given that the “original” covariance is calculated as

$$
\sigma_{xy} = \frac{1}{n-1} \sum_{i}^{n} (x_i - \bar{x})(y_i - \bar{y})
$$
the covariance after scaling would be calculated as:

$$
\sigma_{xy}’ = \frac{1}{n-1} \sum_{i}^{n} (c \cdot x_i - c \cdot \bar{x})(y_i - \bar{y}) = \frac{c}{n-1} \sum_{i}^{n} (x_i - \bar{x})(y_i - \bar{y}) \Rightarrow \sigma_{xy}’ = c \cdot \sigma_{xy}
$$
Therefore, the covariance after scaling one attribute by the constant $c$ will result in a rescaled covariance $c \sigma_{xy}$ So if we’d scaled $x$ from pounds to kilograms, the covariance between $x$ and $y$ will be 0.453592 times smaller.

3. Standardizing affects the covariance

Standardization of features will have an effect on the outcome of a PCA (assuming that the variables are originally not standardized). This is because we are scaling the covariance between every pair of variables by the product of the standard deviations of each pair of variables.

The equation for standardization of a variable is written as

$$
z = \frac{x_i - \bar{x}}{\sigma}
$$
The “original” covariance matrix:

$$
\sigma_{xy} = \frac{1}{n-1} \sum_{i}^{n} (x_i - \bar{x})(y_i - \bar{y})
$$
And after standardizing both variables:

$$
x’ = \frac{x - \bar{x}}{\sigma_x} \text{ and } y’ =\frac{y - \bar{y}}{\sigma_y}
$$

$$
\sigma_{xy}’ = \frac{1}{n-1} \sum_{i}^{n} (x_i’ - 0)(y_i’ - 0) = \frac{1}{n-1} \sum_{i}^{n} \bigg(\frac{x - \bar{x}}{\sigma_x}\bigg)\bigg(\frac{y - \bar{y}}{\sigma_y}\bigg) = \frac{1}{(n-1) \cdot \sigma_x \sigma_y} \sum_{i}^{n} (x_i - \bar{x})(y_i - \bar{y})
$$

$$
\Rightarrow \sigma_{xy}’ = \frac{\sigma_{xy}}{\sigma_x \sigma_y}
$$

Abracadabra

Install SerpentAI on Windows 10

Posted on 2018-06-07 | | Visitors

Python Environment

Python 3.6+ (with Anaconda)

Serpent.AI was developed taking full advantage of Python 3.6 so it is only natural that the Python requirement be for versions 3.6 and up.

Installing regular Python 3.6+ isn’t exactly difficult but Serpent.AI relies on a good amount of scientific computing libraries that are extremely difficult / impossible to compile on your own on Windows. Thankfully, the Anaconda Distribution exists and takes this huge weight off our collective shoulders.

Installing Anaconda 5.2.0 (Python 3.6)

Download the Python 3.6 version of Anaconda 5.2.0 and run the graphical installer.

The following commands are to be performed in an Anaconda Prompt with elevated privileges (Right click and Run as Administrator). It is recommended to create a shortcut to this prompt because every Python and Serpent command will have to be performed from there starting now.

Creating a Conda Env for Serpent.AI

conda create --name serpent python=3.6 (‘serpent’ can be replaced with another name)

Creating a directory for your Serpent.AI projects

mkdir SerpentAI && cd SerpentAI

Activating the Conda Env

conda activate serpent

3rd-Party Dependencies

Redis

Redis is used in the framework as the in-memory store for the captured frame buffers as well as the temporary storage of analytics events. It is not meant to be compatible with Windows! Microsoft used to maintain a port but it’s been abandoned since. This being said, that Redis version is sufficient and it outperforms stuff like running it in WSL on Windows 10. It will install as a Windows Service. Make sure you set it to start automatically.

Install Windows Subsystem for Linux (WSL)

  1. From Start, search for Turn Windows features on or off (type turn)
  2. Select Windows Subsystem for Linux (beta)

img

Once installed you can run bash on Ubuntu by typing bash from a Windows Command Prompt. To install the latest version of Redis we’ll need to use a repository that maintains up-to-date packages for Ubuntu and Debian servers like https://www.dotdeb.org which you can add to Ubuntu’s apt-get sources with:

1
2
3
4
$ echo deb http://packages.dotdeb.org wheezy all >> dotdeb.org.list
$ echo deb-src http://packages.dotdeb.org wheezy all >> dotdeb.org.list
$ sudo mv dotdeb.org.list /etc/apt/sources.list.d
$ wget -q -O - http://www.dotdeb.org/dotdeb.gpg | sudo apt-key add -

Then after updating our APT cache we can install Redis with:

1
2
$ sudo apt-get update
$ sudo apt-get install redis-server

You’ll then be able to launch redis with:

1
$ redis-server --daemonize yes

Which will run redis in the background freeing your shell so you can play with it using the redis client:

1
2
3
4
5
$ redis-cli
$ 127.0.0.1:6379> SET foo bar
OK
$ 127.0.0.1:6379> GET foo
"bar"

Which you can connect to from within bash or from your Windows desktop using the redis-cli native Windows binary from MSOpenTech.

Build Tools for Visual Studio 2017

Some of the packages that will be installed alongside Serpent.AI are not pre-compiled binaries and will be need to be built from source. This is a little more problematic for Windows but with the correct C++ Build Tools for Visual Studio it all goes down smoothly.

You can get the proper installer by visiting https://www.visualstudio.com/downloads/ and scrolling down to the Build Tools for Visual Studio 2017 download. Download, run, select the Visual C++ build tools section and make sure the following components are checked (VSs are not installed):

  • Visual C++ Build Tools core features
  • VC++ 2017 version 15.7 v14.14 latest v141 tools
  • Visual C++ 2017 Redistributable Update
  • VC++ 2015.3 v14.00 (v140) toolset for desktop
  • Windows 10 SDK (10.0.17134.0)
  • Windows Universal CRT SDK

Installing Serpent.AI

Once all of the above had been installed and set up, you are ready to install the framework. Remember that PATH changes in Windows are not reflected in your command prompts that were opened while you made the changes. Open a fresh Anaconda prompt before continuing to avoid installation issues.

Go back to the directory you created earlier for your Serpent.AI projects. Make sure you are scoped in your Conda Env.

Run pip install SerpentAI

Then run serpent setup to install the remaining dependencies automatically.

Installing Optional Modules

In the spirit of keeping the initial installation on the light side, some specialized / niche components with extra dependencies have been isolated from the core. It is recommended to only focus on installing them once you reach a point where you actually need them. The framework will provide a warning when a feature you are trying to use requires one of those modules.

OCR

A module to provide OCR functionality in your game agents.

Tesseract

Serpent.AI leverages Tesseract for its OCR functionality. You can install Tesseract for Windows by following these steps:

  1. Visit https://github.com/UB-Mannheim/tesseract/wiki
  2. Download the .exe for version 3
  3. Run the graphical installer (Remember the install path!)
  4. Add the path to tesseract.exe to your %PATH% environment variable

You can test your Tesseract installation by opening an Anaconda Prompt and executing tesseract --list-langs.

Installation

Once you’ve validated that Tesseract has been properly set up, you can install the module with serpent setup ocr

GUI

A module to allow Serpent.AI desktop app to run.

Kivy

Kivy is the GUI framework used in the framework.

Once you are ready to test your Kivy, you can install the module with serpent setup gui and try to run serpent visual_debugger

Abracadabra

Matching Networks for One Shot Learning

Posted on 2018-06-02 | | Visitors

By DeepMind crew: Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, Daan Wierstra

This is a paper on one-shot learning, where we’d like to learn a class based on very few (or indeed, 1) training examples. E.g. it suffices to show a child a single giraffe, not a few hundred thousands before it can recognize more giraffes.

This paper falls into a category of “duh of course” kind of paper, something very interesting, powerful, but somehow obvious only in retrospect. I like it.

Suppose you’re given a single example of some class and would like to label it in test images.

  • Observation 1: a standard approach might be to train an Exemplar SVM for this one (or few) examples vs. all the other training examples - i.e. a linear classifier. But this requires optimization.
  • Observation 2: known non-parameteric alternatives (e.g. k-Nearest Neighbor) don’t suffer from this problem. E.g. I could immediately use a Nearest Neighbor to classify the new class without having to do any optimization whatsoever. However, NN is gross because it depends on an (arbitrarily-chosen) metric, e.g. L2 distance. Ew.
  • Core idea: lets train a fully end-to-end nearest neighbor classifer!Screen Shot 2016-08-07 at 10.08.44 PM

The training protocol

As the authors amusingly point out in the conclusion (and this is the duh of course part), “one-shot learning is much easier if you train the network to do one-shot learning”. Therefore, we want the test-time protocol (given N novel classes with only k examples each (e.g. k = 1 or 5), predict new instances to one of N classes) to exactly match the training time protocol.

To create each “episode” of training from a dataset of examples then:

  1. Sample a task T from the training data, e.g. select 5 labels, and up to 5 examples per label (i.e. 5-25 examples).
  2. To form one episode sample a label set L (e.g. {cats, dogs}) and then use L to sample the support set S and a batch B of examples to evaluate loss on.

The idea on high level is clear but the writing here is a bit unclear on details, of exactly how the sampling is done.

The model

I find the paper’s model description slightly wordy and unclear, but basically we’re building a differentiable nearest neighbor++. The output \hat{y} for a test example \hat{x} is computed very similar to what you might see in Nearest Neighbors:Screen Shot 2016-08-07 at 11.14.26 PM
where a acts as a kernel, computing the extent to which \hat{x} is similar to a training example x_i, and then the labels from the training examples (y_i) are weight-blended together accordingly. The paper doesn’t mention this but I assume for classification y_i would presumbly be one-hot vectors.

Now, we’re going to embed both the training examples x_i and the test example \hat{x}, and we’ll interpret their inner products (or here a cosine similarity) as the “match”, and pass that through a softmax to get normalized mixing weights so they add up to 1. No surprises here, this is quite natural:

Screen Shot 2016-08-07 at 11.20.29 PM
Here c() is cosine distance, which I presume is implemented by normalizing the two input vectors to have unit L2 norm and taking a dot product. I assume the authors tried skipping the normalization too and it did worse? Anyway, now all that’s left to define is the function f (i.e. how do we embed the test example into a vector) and the function g (i.e. how do we embed each training example into a vector?).

Embedding the training examples. This (the function g) is a bidirectional LSTM over the examples:

Screen Shot 2016-08-07 at 11.57.10 PM

i.e. encoding of i’th example x_i is a function of its “raw” embedding g’(x_i) and the embedding of its friends, communicated through the bidirectional network’s hidden states. i.e. each training example is a function of not just itself but all of its friends in the set. This is part of the ++ above, because in a normal nearest neighbor you wouldn’t change the representation of an example as a function of the other data points in the training set.

It’s odd that the order is not mentioned, I assume it’s random? This is a bit gross because order matters to a bidirectional LSTM; you’d get different embeddings if you permute the examples.

Embedding the test example. This (the function f) is a an LSTM that processes for a fixed amount (K time steps) and at each point also attends over the examples in the training set. The encoding is the last hidden state of the LSTM. Again, this way we’re allowing the network to change its encoding of the test example as a function of the training examples. Nifty: Screen Shot 2016-08-08 at 12.11.15 AM

That looks scary at first but it’s really just a vanilla LSTM with attention where the input at each time step is constant (f’(\hat{x}), an encoding of the test example all by itself) and the hidden state is a function of previous hidden state but also a concatenated readout vector r, which we obtain by attending over the encoded training examples (encoded with g from above).

Oh and I assume there is a typo in equation (5), it should say r_k = … without the -1 on LHS.

Experiments

Task: N-way k-shot learning task. i.e. we’re given k (e.g. 1 or 5) labelled examples for N classes that we have not previously trained on and asked to classify new instances into he N classes.

Baselines: an “obvious” strategy of using a pretrained ConvNet and doing nearest neighbor based on the codes. An option of finetuning the network on the new examples as well (requires training and careful and strong regularization!).

MANN of Santoro et al. [21]: Also a DeepMind paper, a fun NTM-like Meta-Learning approach that is fed a sequence of examples and asked to predict their labels.

Siamese network of Koch et al. [11]: A siamese network that takes two examples and predicts whether they are from the same class or not with logistic regression. A test example is labeled with a nearest neighbor: with the class it matches best according to the siamese net (requires iteration over all training examples one by one). Also, this approach is less end-to-end than the one here because it requires the ad-hoc nearest neighbor matching, while here the exact end task is optimized for. It’s beautiful.

Omniglot experiments

Screen Shot 2016-08-08 at 10.21.45 AM

Omniglot of Lake et al. [14] is a MNIST-like scribbles dataset with 1623 characters with 20 examples each.

Image encoder is a CNN with 4 modules of [3x3 CONV 64 filters, batchnorm, ReLU, 2x2 max pool]. The original image is claimed to be so resized from original 28x28 to 1x1x64, which doesn’t make sense because factor of 2 downsampling 4 times is reduction of 16, and 28/16 is a non-integer >1. I’m assuming they use VALID convs?

Results: Screen Shot 2016-08-08 at 10.27.46 AM

Matching nets do best. Fully Conditional Embeddings (FCE) by which I mean they the “Full Context Embeddings” of Section 2.1.2 instead are not used here, mentioned to not work much better. Finetuning helps a bit on baselines but not with Matching nets (weird).

The comparisons in this table are somewhat confusing:

  • I can’t find the MANN numbers of 82.8% and 94.9% in their paper [21]; not clear where they come from. E.g. for 5 classes and 5-shot they seem to report 88.4% not 94.9% as seen here. I must be missing something.
  • I also can’t find the numbers reported here in the Siamese Net [11] paper. As far as I can tell in their Table 2 they report one-shot accuracy, 20-way classification to be 92.0, while here it is listed as 88.1%?
  • The results of Lake et al. [14] who proposed Omniglot are also missing from the table. If I’m understanding this correctly they report 95.2% on 1-shot 20-way, while matching nets here show 93.8%, and humans are estimated at 95.5%. That is, the results here appear weaker than those of Lake et al., but one should keep in mind that the method here is significantly more generic and does not make any assumptions about the existence of strokes, etc., and it’s a simple, single fully-differentiable blob of neural stuff.

(skipping ImageNet/LM experiments as there are few surprises)

Conclusions

Good paper, effectively develops a differentiable nearest neighbor trained end-to-end. It’s something new, I like it!

A few concerns:

  • A bidirectional LSTMs (not order-invariant compute) is applied over sets of training examples to encode them. The authors don’t talk about the order actually used, which presumably is random, or mention this potentially unsatisfying feature. This can be solved by using a recurrent attentional mechanism instead, as the authors are certainly aware of and as has been discussed at length in ORDER MATTERS: SEQUENCE TO SEQUENCE FOR SETS, where Oriol is also the first author. I wish there was a comment on this point in the paper somewhere.

  • The approach also gets quite a bit slower as the number of training examples grow, but once this number is large one would presumable switch over to a parameteric approach.

  • It’s also potentially concerning that during training the method uses a specific number of examples, e.g. 5-25, so this is the number of that must also be used at test time. What happens if we want the size of our training set to grow online? It appears that we need to retrain the network because the encoder LSTM for the training data is not “used to” seeing inputs of more examples? That is unless you fall back to iteratively subsampling the training data, doing multiple inference passes and averaging, or something like that. If we don’t use FCE it can still be that the attention mechanism LSTM can still not be “used to” attending over many more examples, but it’s not clear how much this matters. An interesting experiment would be to not use FCE and try to use 100 or 1000 training examples, while only training on up to 25 (with and fithout FCE). Discussion surrounding this point would be interesting.

  • Not clear what happened with the Omniglot experiments, with incorrect numbers for [11], [21], and the exclusion of Lake et al. [14] comparison.

  • A baseline that is missing would in my opinion also include training of an Exemplar SVM, which is a much more powerful approach than encode-with-a-cnn-and-nearest-neighbor.

    ​

Abracadabra

One Shot Learning and Siamese Networks in Keras

Posted on 2018-06-02 | | Visitors

Background:

Conventional wisdom says that deep neural networks are really good at learning from high dimensional data like images or spoken language, but only when they have huge amounts of labelled examples to train on. Humans on the other hand, are capable of one-shot learning - if you take a human who’s never seen a spatula before, and show them a single picture of a spatula, they will probably be able to distinguish spatulas from other kitchen utensils with astoundingly high precision.

image

Never been inside a kitchen before? Now’s your chance to test your one shot learning ability! which of the images on the right is of the same type as the big image? Email me for the correct answer.

..Yet another one of the things humans can do that seemed trivial to us right up until we tried to make an algorithm do it.

This ability to rapidly learn from very little data seems like it’s obviously desirable for machine learning systems to have because collecting and labelling data is expensive. I also think this is an important step on the long road towards general intelligence.

Recently there have been many interesting papers about one-shot learning with neural nets and they’ve gotten some good results. This is a new area that really excites me, so I wanted to make a gentle introduction to make it more accessible to fellow newcomers to deep learning.

In this post, I want to:

  • Introduce and formulate the problem of one-shot learning
  • Describe benchmarks for one-shot classification and give a baseline for performance
  • Give an example of deep one-shot learning by partially reimplementing the model in this paper with keras.
  • Hopefully point out some small insights that aren’t obvious to everyone

Formulating the Problem - N-way One-Shot Learning

Before we try to solve any problem, we should first precisely state what the problem actually is, so here is the problem of one-shot classification expressed symbolically:

Our model is given a tiny labelled training set SS, which has N examples, each vectors of the same dimension with a distinct label yy.
$$
S={(x_1,y_1),…,(x_N,y_N)}
$$
It is also given $\hat{x}$, the test example it has to classify. Since exactly one example in the support set has the right class, the aim is to correctly predict which $y \in S$ is the same as $\hat{x}$ ‘s label, $\hat{y}$.

There are fancier ways of defining the problem, but this one is ours. Here are some things to make note of:

  • Real world problems might not always have the constraint that exactly one image has the correct class
  • It’s easy to generalize this to k-shot learning by having there be k examples for each yiyirather than just one.
  • When N is higher, there are more possible classes that $\hat{x}$ can belong to, so it’s harder to predict the correct one.
  • Random guessing will average $\frac{100}{n}\%$ accuracy.

Here are some examples of one-shot learning tasks on the Omniglot dataset, which I’ll describe in the next section.

imageimageimage9, 25 and 36 way one-shot learnng tasks.

About the data - Omniglot! :

The Omniglot dataset is a collection of 1623 hand drawn characters from 50 alphabets. For every character there are just 20 examples, each drawn by a different person at resolution 105x105.

imageimageimageimageimageimageA few of the alphabets from the omniglot dataset. As you can see, there’s a huge variety of different symbols.

If you like machine learning, you’ve probably heard of the MNIST dataset. Omniglot is sometimes referred to as the transpose of mnist, since it has 1623 types of character with only 20 examples each, in contrast to MNIST having thousands of examples for only 10 digits. There is also data about the strokes used to create each character, but we won’t be using that. Usually, it’s split into 30 training alphabets and 20 evaluation alphabets. All those different characters make for lots of possible one-shot tasks, so it’s a really good benchmark for one-shot learning algorithms.

A One-Shot Learning Baseline / 1 Nearest Neighbour

The simplest way of doing classification is with k-nearest neighbours, but since there is only one example per class we have to do 1 nearest neighbour. This is very simple, just calculate the Euclidean distance of the test example from each training example and pick the closest one:
$$
C(\hat{x})={\arg \min}_{c\in S}||\hat{x} − x_c||
$$
According to Koch et al, 1-nn gets ~28% accuracy in 20 way one shot classification on omniglot. 28% doesn’t sound great, but it’s nearly six times more accurate than random guessing(5%). This is a good baseline or “sanity check” to compare future one-shot algorithms with.

Hierarchical Bayesian Program Learning from Lake et al gets 95.2% - very impressive! The ~30% of this paper which I understood was very interesting. Comparing it with deep learning results that train on raw pixels is kind of “apples and oranges” though, because:

  1. HBPL used data about the strokes, not just the raw pixels
  2. HBPL on omniglot involved learning a generative model for strokes. The algorithm requires data with more complicated annotation, so unlike deep learning it can’t easily be tweaked to one-shot learn from raw pixels of dogs/trucks/brain scans/spatulas and other objects that aren’t made up of brushstrokes.

Lake et al also says that humans get 95.5% accuracy in 20 way classification on omniglot, only beating HBPL by a tiny margin. In the spirit of nullius in verba, I tried testing myself on the 20 way tasks and managed to average 97.2%. I wasn’t always doing true one-shot learning though - I saw several symbols I recognised, since I’m familiar with the greek alphabet, hiragana and katakana. I removed those alphabets and tried again but still managed 96.7%. My hypothesis is that having to read my own terrible handwriting has endowed me with superhuman symbol recognition ability.

Ways to use deep networks for one shot learning?!

If we naively train a neural network on a one-shot as a vanilla cross-entropy-loss softmax classifier, it will severely overfit. Heck, even if it was a hundred shot learning a modern neural net would still probably overfit. Big neural networks have millions of parameters to adjust to their data and so they can learn a huge space of possible functions. (More formally, they have a high VC dimension, which is part of why they do so well at learning from complex data with high dimensionality.) Unfortunately this strength also appears to be their undoing for one-shot learning. When there are millions of parameters to gradient descend upon, and a staggeringly huge number of possible mappings that can be learned, how can we make a network learn one that generalizes when there’s just a single example to learn from?

It’s easier for humans to one-shot learn the concept of a spatula or the letter ΘΘ because they have spent a lifetime observing and learning from similar objects. It’s not really fair to compare the performance of a human who’s spent a lifetime having to classify objects and symbols with that of a randomly initialized neural net, which imposes a very weak prior about the structure of the mapping to be learned from the data. This is why most of the one-shot learning papers I’ve seen take the approach of knowledge transfer from other tasks.

Neural nets are really good at extracting useful features from structurally complex/high dimensional data, such as images. If a neural network is given training data that is similar to (but not the same as) that in the one-shot task, it might be able to learn useful features which can be used in a simple learning algorithm that doesn’t require adjusting these parameters. It still counts as one-shot learning as long as the training examples are of different classes to the examples used for one-shot testing.

(NOTE: Here a feature means a “transformation of the data that is useful for learning”.)

So now an interesting problem is how do we get a neural network to learn the features? The most obvious way of doing this (if there’s labelled data) is just vanilla transfer learning - train a softmax classifier on the training set, then fine-tune the weights of the last layer on the support set of the one-shot task. In practice, neural net classifiers don’t work too well for data like omniglot where there are few examples per class, and even fine tuning only the weights in the last layer is enough to overfit the support set. Still works quite a lot better than L2 distance nearest neighbour though! (See Matching Networks for One Shot learning for a comparison table of various deep one-shot learning methods and their accuracy.)

There’s a better way of doing it though! Remember 1 nearest neighbour? This simple, non-parametric one-shot learner just classifies the test example with the same class of whatever support example is the closest in L2 distance. This works ok, but L2 Distance suffers from the ominous sounding curse of dimensionality and so won’t work well for data with thousands of dimensions like omniglot. Also, if you have two nearly identical images and move one over a few pixels to the right the L2 distance can go from being almost zero to being really high. L2 distance is a metric that is just woefully inadequate for this task. Deep learning to the rescue? We can use a deep convolutional network to learn some kind of similarity function that a non-parametric classifer like nearest neighbor can use.

Siamese networks

imageI originally planned to have craniopagus conjoined twins as the accompanying image for this section but ultimately decided that siamese cats would go over better..

This wonderful paper is what I will be implementing in this tutorial. Koch et al’s approach to getting a neural net to do one-shot classification is to give it two images and train it to guess whether they have the same category. Then when doing a one-shot classification task described above, the network can compare the test image to each image in the support set, and pick which one it thinks is most likely to be of the same category. So we want a neural net architecture that takes two images as input and outputs the probability they share the same class.

Say x1 and x2 are two images in our dataset, and let x1∘x2 mean “x1 and x2 are images with the same class”. Note that x1∘x2 is the same as x2∘x1 - this means that if we reverse the order of the inputs to the neural network, the output should be the same - p(x1∘x2) should equal p(x2∘x1). This property is called symmetry and siamese nets are designed around having it.

Symmetry is important because it’s required for learning a distance metric - the distance from x1 to x2 should equal the distance x2 to x1.

If we just concatenate two examples together and use them as a single input to a neural net, each example will be matrix multiplied(or convolved) with a different set of weights, which breaks symmetry. Sure it’s possible it will eventually manage to learn the exact same weights for each input, but it would be much easier to learn a single set of weights applied to both inputs. So we could propagate both inputs through identical twin neural nets with shared parameters, then use the absolute difference as the input to a linear classifier - this is essentially what a siamese net is. Two identical twins, joined at the head, hence the name.

Network architecture

Unfortunately, properly explaining how and why a convolutional neural net work would make this post twice as long. If you want to understand convnets work, I suggest checking out cs231n and then colah. For any non-dl people who are reading this, the best summary I can give of a CNN is this: An image is a 3D array of pixels. A convolutional layer is where you have a neuron connected to a tiny subgrid of pixels or neurons, and use copies of that neuron across all parts of the image/block to make another 3d array of neuron activations. A max pooling layer makes a block of activations spatially smaller. Lots of these stacked on top of one another can be trained with gradient descent and are really good at learning from images.

I’m going to describe the architecture pretty briefly because it’s not the important part of the paper. Koch et al uses a convolutional siamese network to classify pairs of omniglot images, so the twin networks are both convolutional neural nets(CNNs). The twins each have the following architecture: convolution with 64 10x10 filters, relu -> max pool -> convolution with 128 7x7 filters, relu -> max pool -> convolution with 128 4x4 filters, relu -> max pool -> convolution with 256 4x4 filters. The twin networks reduce their inputs down to smaller and smaller 3d tensors, finally their is a fully connected layer with 4096 units. The absolute difference between the two vectors is used as input to a linear classifier. All up, the network has 38,951,745 parameters - 96% of which belong to the fully connected layer. This is quite a lot, so the network has high capacity to overfit, but as I show below, pairwse training means the dataset size is huge so this won’t be a problem.

imageHastily made architecture diagram.

The output is squashed into [0,1] with a sigmoid function to make it a probability. We use the target t=1t=1 when the images have the same class and t=0t=0 for a different class. It’s trained with logistic regression. This means the loss function should be binary cross entropy between the predictions and targets. There is also a L2 weight decay term in the loss to encourage the network to learn smaller/less noisy weights and possibly improve generalization:
$$
L(x_1,x_2,t)=t⋅\log(p(x_1∘x_2))+(1−t)⋅\log(1−p(x_1∘x_2))+λ⋅||w||^2
$$
When it does a one-shot task, the siamese net simply classifies the test image as whatever image in the support set it thinks is most similar to the test image:
$$
C(\hat{x},S) = {\arg\max}_c P(\hat{x}∘x_c),x_c∈S
$$
This uses an argmax unlike nearest neighbour which uses an argmin, because a metric like L2 is higher the more “different” the examples are, but this models outputs p(x1∘x2), so we want the highest. This approach has one flaw that’s obvious to me: for any xaxa in the support set,the probability $\hat{x}∘x_a$ is independent of every other example in the support set! This means the probabilities won’t sum to 1, ignores important information, namely that the test image will be the same type as exactly one x∈S…

Observation: effective dataset size in pairwise training

EDIT: After discussing this with a PhD student at UoA, I think this bit might be overstated or even just wrong. Emperically, my implementation did overfit, even though it wasn’t trained for enough iterations to sample every possible pair, which kind of contradicts this section. I’m leaving it up in the spirit of being wrong loudly.

One cool thing I noticed about training on pairs is that there are quadratically many possible pairs of images to train the model on, making it hard to overfit. Say we have CC examples each of EE classes. Since there are C⋅EC⋅E images total, the total number of possible pairs is given by
$$
Npairs=(C⋅E 2)=(C⋅E)!/2!(C⋅E−2)!
$$
For omniglot with its 20 examples of 964 training classes, this leads to 185,849,560 possible pairs, which is huge! However, the siamese network needs examples of both same and different class pairs. There are E examples per class, so there will be (E 2) pairs for every class, which means there are Nsame=(E 2)⋅C possible pairs with the same class - 183,160 pairs for omniglot. Even though 183,160 example pairs is plenty, it’s only a thousandth of the possible pairs, and the number of same-class pairs increases quadratically with E but only linearly with C. This is important because the siamese network should be given a 1:1 ratio of same-class and different-class pairs to train on - perhaps it implies that pairwise training is easier on datasets with lots of examples per class.

The Code:

Prefer to just play with a jupyter notebook? I got you fam

Here is the model definition, it should be pretty easy to follow if you’ve seen keras before. I only define the twin network’s architecture once as a Sequential() model and then call it with respect to each of two input layers, this way the same parameters are used for both inputs. Then merge them together with absolute distance and add an output layer, and compile the model with binary cross entropy loss.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
from keras.layers import Input, Conv2D, Lambda, merge, Dense, Flatten,MaxPooling2D
from keras.models import Model, Sequential
from keras.regularizers import l2
from keras import backend as K
from keras.optimizers import SGD,Adam
from keras.losses import binary_crossentropy
import numpy.random as rng
import numpy as np
import os
import dill as pickle
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
def W_init(shape,name=None):
"""Initialize weights as in paper"""
values = rng.normal(loc=0,scale=1e-2,size=shape)
return K.variable(values,name=name)
#//TODO: figure out how to initialize layer biases in keras.
def b_init(shape,name=None):
"""Initialize bias as in paper"""
values=rng.normal(loc=0.5,scale=1e-2,size=shape)
return K.variable(values,name=name)
input_shape = (105, 105, 1)
left_input = Input(input_shape)
right_input = Input(input_shape)
#build convnet to use in each siamese 'leg'
convnet = Sequential()
convnet.add(Conv2D(64,(10,10),activation='relu',input_shape=input_shape,
kernel_initializer=W_init,kernel_regularizer=l2(2e-4)))
convnet.add(MaxPooling2D())
convnet.add(Conv2D(128,(7,7),activation='relu',
kernel_regularizer=l2(2e-4),kernel_initializer=W_init,bias_initializer=b_init))
convnet.add(MaxPooling2D())
convnet.add(Conv2D(128,(4,4),activation='relu',kernel_initializer=W_init,kernel_regularizer=l2(2e-4),bias_initializer=b_init))
convnet.add(MaxPooling2D())
convnet.add(Conv2D(256,(4,4),activation='relu',kernel_initializer=W_init,kernel_regularizer=l2(2e-4),bias_initializer=b_init))
convnet.add(Flatten())
convnet.add(Dense(4096,activation="sigmoid",kernel_regularizer=l2(1e-3),kernel_initializer=W_init,bias_initializer=b_init))
#encode each of the two inputs into a vector with the convnet
encoded_l = convnet(left_input)
encoded_r = convnet(right_input)
#merge two encoded inputs with the l1 distance between them
L1_distance = lambda x: K.abs(x[0]-x[1])
both = merge([encoded_l,encoded_r], mode = L1_distance, output_shape=lambda x: x[0])
prediction = Dense(1,activation='sigmoid',bias_initializer=b_init)(both)
siamese_net = Model(input=[left_input,right_input],output=prediction)
#optimizer = SGD(0.0004,momentum=0.6,nesterov=True,decay=0.0003)
optimizer = Adam(0.00006)
#//TODO: get layerwise learning rates and momentum annealing scheme described in paperworking
siamese_net.compile(loss="binary_crossentropy",optimizer=optimizer)
siamese_net.count_params()

The original paper used layerwise learning rates and momentum - I skipped this because it; was kind of messy to implement in keras and the hyperparameters aren’t the interesting part of the paper. Koch et al adds examples to the dataset by distorting the images and runs experiments with a fixed training set of up to 150,000 pairs. Since that won’t fit in my computers memory, I decided to just randomly sample pairs. Loading image pairs was probably the hardest part of this to implement. Since there were 20 examples for every class, I reshaped the data into N_classes x 20 x 105 x 105 arrays, to make it easier to index by category.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
class Siamese_Loader:
"""For loading batches and testing tasks to a siamese net"""
def __init__(self,Xtrain,Xval):
self.Xval = Xval
self.Xtrain = Xtrain
self.n_classes,self.n_examples,self.w,self.h = Xtrain.shape
self.n_val,self.n_ex_val,_,_ = Xval.shape
def get_batch(self,n):
"""Create batch of n pairs, half same class, half different class"""
categories = rng.choice(self.n_classes,size=(n,),replace=False)
pairs=[np.zeros((n, self.h, self.w,1)) for i in range(2)]
targets=np.zeros((n,))
targets[n//2:] = 1
for i in range(n):
category = categories[i]
idx_1 = rng.randint(0,self.n_examples)
pairs[0][i,:,:,:] = self.Xtrain[category,idx_1].reshape(self.w,self.h,1)
idx_2 = rng.randint(0,self.n_examples)
#pick images of same class for 1st half, different for 2nd
category_2 = category if i >= n//2 else (category + rng.randint(1,self.n_classes)) % self.n_classes
pairs[1][i,:,:,:] = self.Xtrain[category_2,idx_2].reshape(self.w,self.h,1)
return pairs, targets
def make_oneshot_task(self,N):
"""Create pairs of test image, support set for testing N way one-shot learning. """
categories = rng.choice(self.n_val,size=(N,),replace=False)
indices = rng.randint(0,self.n_ex_val,size=(N,))
true_category = categories[0]
ex1, ex2 = rng.choice(self.n_examples,replace=False,size=(2,))
test_image = np.asarray([self.Xval[true_category,ex1,:,:]]*N).reshape(N,self.w,self.h,1)
support_set = self.Xval[categories,indices,:,:]
support_set[0,:,:] = self.Xval[true_category,ex2]
support_set = support_set.reshape(N,self.w,self.h,1)
pairs = [test_image,support_set]
targets = np.zeros((N,))
targets[0] = 1
return pairs, targets
def test_oneshot(self,model,N,k,verbose=0):
"""Test average N way oneshot learning accuracy of a siamese neural net over k one-shot tasks"""
pass
n_correct = 0
if verbose:
print("Evaluating model on {} unique {} way one-shot learning tasks ...".format(k,N))
for i in range(k):
inputs, targets = self.make_oneshot_task(N)
probs = model.predict(inputs)
if np.argmax(probs) == 0:
n_correct+=1
percent_correct = (100.0*n_correct / k)
if verbose:
print("Got an average of {}% {} way one-shot learning accuracy".format(percent_correct,N))
return percent_correct

..And now the training loop. Nothing unusual here, except for that I monitor one-shot tasks validation accuracy to test performance, rather than loss on the validation set.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
evaluate_every = 7000
loss_every=300
batch_size = 32
N_way = 20
n_val = 550
siamese_net.load_weights("PATH")
best = 76.0
for i in range(900000):
(inputs,targets)=loader.get_batch(batch_size)
loss=siamese_net.train_on_batch(inputs,targets)
if i % evaluate_every == 0:
val_acc = loader.test_oneshot(siamese_net,N_way,n_val,verbose=True)
if val_acc >= best:
print("saving")
siamese_net.save('PATH')
best=val_acc
if i % loss_every == 0:
print("iteration {}, training loss: {:.2f},".format(i,loss))

Results

Once the learning curve flattened out, I used the weights which got the best validation 20 way accuracy for testing. My network averaged ~83% accuracy for tasks from the evaluation set, compared to 93% in the original paper. Probably this difference is because I didn’t implement many of the performance enhancing tricks from the original paper, like layerwise learning rates/momentum, data augmentation with distortions, bayesian hyperparemeter optimization and I also probably trained for less epochs. I’m not too worried about this because this tutorial was more about introducing one-shot learning in general, than squeezing the last few % performance out of a classifier. There is no shortage of resources on that!

I was curious to see how accuracy varied over different values of “N” in N way one shot learning, so I plotted it, with comparisons to 1 nearest neighbours, random guessing and training set performance.

imageresults.

As you can see, it performs worse on tasks from the validaiton set than the train set, especially for high values of N, so there must be overfitting. It would be interesting to see how well traditional regularization methods like dropout work when the validation set is made of completely different classes to the training set. It works better than I expected for large N, still averaging above 65% accuracy for 50-60 way tasks.

Discussion

We’ve just trained a neural network trained to do same-different pairwise classification on symbols. More importantly, we’ve shown that it can then get reasonable accuracy in 20 way one-shot learning on symbols from unseen alphabets. Of course, this is not the only way to use deep networks for one-shot learning.

As I touched on earlier, I think a major flaw of this siamese approach is that it only compares the test image to every support image individualy, when it should be comparing it to the support set as a whole. When the network compares the test image to any image x1, p(x^∘x1) is the same no matter what else is the support set. This is silly. Say you’re doing a one-shot task and you see an image that looks similar to the test image. You should be much less confident they have the same class if there is another image in the support set that also looks similar to the test image. The training objective is different to the test objective. It might work better to have a model that can compare the test image to the support set as a whole and use the constraint that only one support image has the same class.

Matching Networks for One Shot learning does exactly that. Rather than learning a similarity function, they have a deep model learn a full nearest neighbour classifier end to end, training directly on oneshot tasks rather than on image pairs. Andrej Karpathy’s notes explain it much better than I can. Since you are learning a machine classifier, this can be seen as a kind of meta-learning. One-shot Learning with Memory-Augmented Neural Networks explores the connection between one-shot learning and meta learning and trains a memory augmented network on omniglot, though I confess I had trouble understanding this paper.

What next?

The omniglot dataset has been around since 2015, and already there are scalable ML algorithms getting within the ballpark of human level performance on certain one-shot learning tasks. Hopefully one day it will be seen as a mere “sanity check” for one-shot classification algorithms much like MNIST is for supervised learning now.

Image classification is cool but I don’t think it’s the most interesting problem in machine learning. Now that we know deep one-shot learning can work pretty good, I think it would be cool to see attempts at one-shot learning for other, more exotic tasks.

Ideas from one-shot learning could be used for more sample efficient reinforcement learning, especially for problems like OpenAI’s Universe, where there are lots of MDPs/environments that have similar visual features and dynamics. - It would be cool to have an RL agent that could efficiently explore a new environment after learning in similar MDPs.

imageOpenAI’s world of bits environments.

One-shot Imitation learning is one of my favourite one-shot learning papers. The goal is to have an agent learn a robust policy for solving a task from a single human demonstration of that task.This is done by:

  1. Having a neural net map from the current state and a sequence of states(the human demonstration) to an action
  2. Training it on pairs of human demonstrations on slightly different variants of the same task, with the goal of reproducing the second demonstration based on the first.

This strikes me as a really promising path to one day having broadly applicable, learning based robots!

Bringing one-shot learning to NLP tasks is a cool idea too. Matching Networks for One-Shot learning has an attempt at one-shot language modeling, filling a missing word in a test sentence given a small set of support sentences, and it seems to work pretty well. Exciting!

Conclusion

Anyway, thanks for reading! I hope you’ve managed to one-shot learn the concept of one-shot learning :) If not, I’d love to hear feedback or answer any questions you have!

Abracadabra

Anaconda uses socket proxy on Windows 10

Posted on 2018-05-17 | | Visitors

you need to create a .condarc file in you Windows user area:

1
C:\Users\<username>\

The file should contain (if you are using shadowsocks):

1
2
3
4
5
6
7
8
9
10
11
12
13
channels:
- defaults
# Show channel URLs when displaying what is going to be downloaded and
# in 'conda list'. The default is False.
show_channel_urls: True
allow_other_channels: True
proxy_servers:
http: socks5://127.0.0.1:1080
https: socks5://127.0.0.1:1080
ssl_verify: False

Noticed that you cannot create a file that begins with a dot in Windows directly.

To create/rename on windows explorer, just rename to .name. - The additional dot at the end is necessary, and will be removed by Windows Explorer.

To create a new file begins with a dot, on command prompt:

1
echo testing > .name
1…345…27
Ewan Li

Ewan Li

Ewan's IT Blog

131 posts
64 tags
RSS
Github Twitter
© 2019 Ewan Li
Powered by Hexo
Theme - NexT.Mist
本站访客数人次 本站总访问量次