Abracadabra

Do it yourself


  • Home

  • Categories

  • About

  • Archives

  • Tags

  • Sitemap

  • 公益404

  • Search
close
Abracadabra

Install SerpentAI on Windows 10

Posted on 2018-06-07 | | Visitors

Python Environment

Python 3.6+ (with Anaconda)

Serpent.AI was developed taking full advantage of Python 3.6 so it is only natural that the Python requirement be for versions 3.6 and up.

Installing regular Python 3.6+ isn’t exactly difficult but Serpent.AI relies on a good amount of scientific computing libraries that are extremely difficult / impossible to compile on your own on Windows. Thankfully, the Anaconda Distribution exists and takes this huge weight off our collective shoulders.

Installing Anaconda 5.2.0 (Python 3.6)

Download the Python 3.6 version of Anaconda 5.2.0 and run the graphical installer.

The following commands are to be performed in an Anaconda Prompt with elevated privileges (Right click and Run as Administrator). It is recommended to create a shortcut to this prompt because every Python and Serpent command will have to be performed from there starting now.

Creating a Conda Env for Serpent.AI

conda create --name serpent python=3.6 (‘serpent’ can be replaced with another name)

Creating a directory for your Serpent.AI projects

mkdir SerpentAI && cd SerpentAI

Activating the Conda Env

conda activate serpent

3rd-Party Dependencies

Redis

Redis is used in the framework as the in-memory store for the captured frame buffers as well as the temporary storage of analytics events. It is not meant to be compatible with Windows! Microsoft used to maintain a port but it’s been abandoned since. This being said, that Redis version is sufficient and it outperforms stuff like running it in WSL on Windows 10. It will install as a Windows Service. Make sure you set it to start automatically.

Install Windows Subsystem for Linux (WSL)

  1. From Start, search for Turn Windows features on or off (type turn)
  2. Select Windows Subsystem for Linux (beta)

img

Once installed you can run bash on Ubuntu by typing bash from a Windows Command Prompt. To install the latest version of Redis we’ll need to use a repository that maintains up-to-date packages for Ubuntu and Debian servers like https://www.dotdeb.org which you can add to Ubuntu’s apt-get sources with:

1
2
3
4
$ echo deb http://packages.dotdeb.org wheezy all >> dotdeb.org.list
$ echo deb-src http://packages.dotdeb.org wheezy all >> dotdeb.org.list
$ sudo mv dotdeb.org.list /etc/apt/sources.list.d
$ wget -q -O - http://www.dotdeb.org/dotdeb.gpg | sudo apt-key add -

Then after updating our APT cache we can install Redis with:

1
2
$ sudo apt-get update
$ sudo apt-get install redis-server

You’ll then be able to launch redis with:

1
$ redis-server --daemonize yes

Which will run redis in the background freeing your shell so you can play with it using the redis client:

1
2
3
4
5
$ redis-cli
$ 127.0.0.1:6379> SET foo bar
OK
$ 127.0.0.1:6379> GET foo
"bar"

Which you can connect to from within bash or from your Windows desktop using the redis-cli native Windows binary from MSOpenTech.

Build Tools for Visual Studio 2017

Some of the packages that will be installed alongside Serpent.AI are not pre-compiled binaries and will be need to be built from source. This is a little more problematic for Windows but with the correct C++ Build Tools for Visual Studio it all goes down smoothly.

You can get the proper installer by visiting https://www.visualstudio.com/downloads/ and scrolling down to the Build Tools for Visual Studio 2017 download. Download, run, select the Visual C++ build tools section and make sure the following components are checked (VSs are not installed):

  • Visual C++ Build Tools core features
  • VC++ 2017 version 15.7 v14.14 latest v141 tools
  • Visual C++ 2017 Redistributable Update
  • VC++ 2015.3 v14.00 (v140) toolset for desktop
  • Windows 10 SDK (10.0.17134.0)
  • Windows Universal CRT SDK

Installing Serpent.AI

Once all of the above had been installed and set up, you are ready to install the framework. Remember that PATH changes in Windows are not reflected in your command prompts that were opened while you made the changes. Open a fresh Anaconda prompt before continuing to avoid installation issues.

Go back to the directory you created earlier for your Serpent.AI projects. Make sure you are scoped in your Conda Env.

Run pip install SerpentAI

Then run serpent setup to install the remaining dependencies automatically.

Installing Optional Modules

In the spirit of keeping the initial installation on the light side, some specialized / niche components with extra dependencies have been isolated from the core. It is recommended to only focus on installing them once you reach a point where you actually need them. The framework will provide a warning when a feature you are trying to use requires one of those modules.

OCR

A module to provide OCR functionality in your game agents.

Tesseract

Serpent.AI leverages Tesseract for its OCR functionality. You can install Tesseract for Windows by following these steps:

  1. Visit https://github.com/UB-Mannheim/tesseract/wiki
  2. Download the .exe for version 3
  3. Run the graphical installer (Remember the install path!)
  4. Add the path to tesseract.exe to your %PATH% environment variable

You can test your Tesseract installation by opening an Anaconda Prompt and executing tesseract --list-langs.

Installation

Once you’ve validated that Tesseract has been properly set up, you can install the module with serpent setup ocr

GUI

A module to allow Serpent.AI desktop app to run.

Kivy

Kivy is the GUI framework used in the framework.

Once you are ready to test your Kivy, you can install the module with serpent setup gui and try to run serpent visual_debugger

Abracadabra

Matching Networks for One Shot Learning

Posted on 2018-06-02 | | Visitors

By DeepMind crew: Oriol Vinyals, Charles Blundell, Timothy Lillicrap, Koray Kavukcuoglu, Daan Wierstra

This is a paper on one-shot learning, where we’d like to learn a class based on very few (or indeed, 1) training examples. E.g. it suffices to show a child a single giraffe, not a few hundred thousands before it can recognize more giraffes.

This paper falls into a category of “duh of course” kind of paper, something very interesting, powerful, but somehow obvious only in retrospect. I like it.

Suppose you’re given a single example of some class and would like to label it in test images.

  • Observation 1: a standard approach might be to train an Exemplar SVM for this one (or few) examples vs. all the other training examples - i.e. a linear classifier. But this requires optimization.
  • Observation 2: known non-parameteric alternatives (e.g. k-Nearest Neighbor) don’t suffer from this problem. E.g. I could immediately use a Nearest Neighbor to classify the new class without having to do any optimization whatsoever. However, NN is gross because it depends on an (arbitrarily-chosen) metric, e.g. L2 distance. Ew.
  • Core idea: lets train a fully end-to-end nearest neighbor classifer!Screen Shot 2016-08-07 at 10.08.44 PM

The training protocol

As the authors amusingly point out in the conclusion (and this is the duh of course part), “one-shot learning is much easier if you train the network to do one-shot learning”. Therefore, we want the test-time protocol (given N novel classes with only k examples each (e.g. k = 1 or 5), predict new instances to one of N classes) to exactly match the training time protocol.

To create each “episode” of training from a dataset of examples then:

  1. Sample a task T from the training data, e.g. select 5 labels, and up to 5 examples per label (i.e. 5-25 examples).
  2. To form one episode sample a label set L (e.g. {cats, dogs}) and then use L to sample the support set S and a batch B of examples to evaluate loss on.

The idea on high level is clear but the writing here is a bit unclear on details, of exactly how the sampling is done.

The model

I find the paper’s model description slightly wordy and unclear, but basically we’re building a differentiable nearest neighbor++. The output \hat{y} for a test example \hat{x} is computed very similar to what you might see in Nearest Neighbors:Screen Shot 2016-08-07 at 11.14.26 PM
where a acts as a kernel, computing the extent to which \hat{x} is similar to a training example x_i, and then the labels from the training examples (y_i) are weight-blended together accordingly. The paper doesn’t mention this but I assume for classification y_i would presumbly be one-hot vectors.

Now, we’re going to embed both the training examples x_i and the test example \hat{x}, and we’ll interpret their inner products (or here a cosine similarity) as the “match”, and pass that through a softmax to get normalized mixing weights so they add up to 1. No surprises here, this is quite natural:

Screen Shot 2016-08-07 at 11.20.29 PM
Here c() is cosine distance, which I presume is implemented by normalizing the two input vectors to have unit L2 norm and taking a dot product. I assume the authors tried skipping the normalization too and it did worse? Anyway, now all that’s left to define is the function f (i.e. how do we embed the test example into a vector) and the function g (i.e. how do we embed each training example into a vector?).

Embedding the training examples. This (the function g) is a bidirectional LSTM over the examples:

Screen Shot 2016-08-07 at 11.57.10 PM

i.e. encoding of i’th example x_i is a function of its “raw” embedding g’(x_i) and the embedding of its friends, communicated through the bidirectional network’s hidden states. i.e. each training example is a function of not just itself but all of its friends in the set. This is part of the ++ above, because in a normal nearest neighbor you wouldn’t change the representation of an example as a function of the other data points in the training set.

It’s odd that the order is not mentioned, I assume it’s random? This is a bit gross because order matters to a bidirectional LSTM; you’d get different embeddings if you permute the examples.

Embedding the test example. This (the function f) is a an LSTM that processes for a fixed amount (K time steps) and at each point also attends over the examples in the training set. The encoding is the last hidden state of the LSTM. Again, this way we’re allowing the network to change its encoding of the test example as a function of the training examples. Nifty: Screen Shot 2016-08-08 at 12.11.15 AM

That looks scary at first but it’s really just a vanilla LSTM with attention where the input at each time step is constant (f’(\hat{x}), an encoding of the test example all by itself) and the hidden state is a function of previous hidden state but also a concatenated readout vector r, which we obtain by attending over the encoded training examples (encoded with g from above).

Oh and I assume there is a typo in equation (5), it should say r_k = … without the -1 on LHS.

Experiments

Task: N-way k-shot learning task. i.e. we’re given k (e.g. 1 or 5) labelled examples for N classes that we have not previously trained on and asked to classify new instances into he N classes.

Baselines: an “obvious” strategy of using a pretrained ConvNet and doing nearest neighbor based on the codes. An option of finetuning the network on the new examples as well (requires training and careful and strong regularization!).

MANN of Santoro et al. [21]: Also a DeepMind paper, a fun NTM-like Meta-Learning approach that is fed a sequence of examples and asked to predict their labels.

Siamese network of Koch et al. [11]: A siamese network that takes two examples and predicts whether they are from the same class or not with logistic regression. A test example is labeled with a nearest neighbor: with the class it matches best according to the siamese net (requires iteration over all training examples one by one). Also, this approach is less end-to-end than the one here because it requires the ad-hoc nearest neighbor matching, while here the exact end task is optimized for. It’s beautiful.

Omniglot experiments

Screen Shot 2016-08-08 at 10.21.45 AM

Omniglot of Lake et al. [14] is a MNIST-like scribbles dataset with 1623 characters with 20 examples each.

Image encoder is a CNN with 4 modules of [3x3 CONV 64 filters, batchnorm, ReLU, 2x2 max pool]. The original image is claimed to be so resized from original 28x28 to 1x1x64, which doesn’t make sense because factor of 2 downsampling 4 times is reduction of 16, and 28/16 is a non-integer >1. I’m assuming they use VALID convs?

Results: Screen Shot 2016-08-08 at 10.27.46 AM

Matching nets do best. Fully Conditional Embeddings (FCE) by which I mean they the “Full Context Embeddings” of Section 2.1.2 instead are not used here, mentioned to not work much better. Finetuning helps a bit on baselines but not with Matching nets (weird).

The comparisons in this table are somewhat confusing:

  • I can’t find the MANN numbers of 82.8% and 94.9% in their paper [21]; not clear where they come from. E.g. for 5 classes and 5-shot they seem to report 88.4% not 94.9% as seen here. I must be missing something.
  • I also can’t find the numbers reported here in the Siamese Net [11] paper. As far as I can tell in their Table 2 they report one-shot accuracy, 20-way classification to be 92.0, while here it is listed as 88.1%?
  • The results of Lake et al. [14] who proposed Omniglot are also missing from the table. If I’m understanding this correctly they report 95.2% on 1-shot 20-way, while matching nets here show 93.8%, and humans are estimated at 95.5%. That is, the results here appear weaker than those of Lake et al., but one should keep in mind that the method here is significantly more generic and does not make any assumptions about the existence of strokes, etc., and it’s a simple, single fully-differentiable blob of neural stuff.

(skipping ImageNet/LM experiments as there are few surprises)

Conclusions

Good paper, effectively develops a differentiable nearest neighbor trained end-to-end. It’s something new, I like it!

A few concerns:

  • A bidirectional LSTMs (not order-invariant compute) is applied over sets of training examples to encode them. The authors don’t talk about the order actually used, which presumably is random, or mention this potentially unsatisfying feature. This can be solved by using a recurrent attentional mechanism instead, as the authors are certainly aware of and as has been discussed at length in ORDER MATTERS: SEQUENCE TO SEQUENCE FOR SETS, where Oriol is also the first author. I wish there was a comment on this point in the paper somewhere.

  • The approach also gets quite a bit slower as the number of training examples grow, but once this number is large one would presumable switch over to a parameteric approach.

  • It’s also potentially concerning that during training the method uses a specific number of examples, e.g. 5-25, so this is the number of that must also be used at test time. What happens if we want the size of our training set to grow online? It appears that we need to retrain the network because the encoder LSTM for the training data is not “used to” seeing inputs of more examples? That is unless you fall back to iteratively subsampling the training data, doing multiple inference passes and averaging, or something like that. If we don’t use FCE it can still be that the attention mechanism LSTM can still not be “used to” attending over many more examples, but it’s not clear how much this matters. An interesting experiment would be to not use FCE and try to use 100 or 1000 training examples, while only training on up to 25 (with and fithout FCE). Discussion surrounding this point would be interesting.

  • Not clear what happened with the Omniglot experiments, with incorrect numbers for [11], [21], and the exclusion of Lake et al. [14] comparison.

  • A baseline that is missing would in my opinion also include training of an Exemplar SVM, which is a much more powerful approach than encode-with-a-cnn-and-nearest-neighbor.

    ​

Abracadabra

One Shot Learning and Siamese Networks in Keras

Posted on 2018-06-02 | | Visitors

Background:

Conventional wisdom says that deep neural networks are really good at learning from high dimensional data like images or spoken language, but only when they have huge amounts of labelled examples to train on. Humans on the other hand, are capable of one-shot learning - if you take a human who’s never seen a spatula before, and show them a single picture of a spatula, they will probably be able to distinguish spatulas from other kitchen utensils with astoundingly high precision.

image

Never been inside a kitchen before? Now’s your chance to test your one shot learning ability! which of the images on the right is of the same type as the big image? Email me for the correct answer.

..Yet another one of the things humans can do that seemed trivial to us right up until we tried to make an algorithm do it.

This ability to rapidly learn from very little data seems like it’s obviously desirable for machine learning systems to have because collecting and labelling data is expensive. I also think this is an important step on the long road towards general intelligence.

Recently there have been many interesting papers about one-shot learning with neural nets and they’ve gotten some good results. This is a new area that really excites me, so I wanted to make a gentle introduction to make it more accessible to fellow newcomers to deep learning.

In this post, I want to:

  • Introduce and formulate the problem of one-shot learning
  • Describe benchmarks for one-shot classification and give a baseline for performance
  • Give an example of deep one-shot learning by partially reimplementing the model in this paper with keras.
  • Hopefully point out some small insights that aren’t obvious to everyone

Formulating the Problem - N-way One-Shot Learning

Before we try to solve any problem, we should first precisely state what the problem actually is, so here is the problem of one-shot classification expressed symbolically:

Our model is given a tiny labelled training set SS, which has N examples, each vectors of the same dimension with a distinct label yy.
$$
S={(x_1,y_1),…,(x_N,y_N)}
$$
It is also given $\hat{x}$, the test example it has to classify. Since exactly one example in the support set has the right class, the aim is to correctly predict which $y \in S$ is the same as $\hat{x}$ ‘s label, $\hat{y}$.

There are fancier ways of defining the problem, but this one is ours. Here are some things to make note of:

  • Real world problems might not always have the constraint that exactly one image has the correct class
  • It’s easy to generalize this to k-shot learning by having there be k examples for each yiyirather than just one.
  • When N is higher, there are more possible classes that $\hat{x}$ can belong to, so it’s harder to predict the correct one.
  • Random guessing will average $\frac{100}{n}\%$ accuracy.

Here are some examples of one-shot learning tasks on the Omniglot dataset, which I’ll describe in the next section.

imageimageimage9, 25 and 36 way one-shot learnng tasks.

About the data - Omniglot! :

The Omniglot dataset is a collection of 1623 hand drawn characters from 50 alphabets. For every character there are just 20 examples, each drawn by a different person at resolution 105x105.

imageimageimageimageimageimageA few of the alphabets from the omniglot dataset. As you can see, there’s a huge variety of different symbols.

If you like machine learning, you’ve probably heard of the MNIST dataset. Omniglot is sometimes referred to as the transpose of mnist, since it has 1623 types of character with only 20 examples each, in contrast to MNIST having thousands of examples for only 10 digits. There is also data about the strokes used to create each character, but we won’t be using that. Usually, it’s split into 30 training alphabets and 20 evaluation alphabets. All those different characters make for lots of possible one-shot tasks, so it’s a really good benchmark for one-shot learning algorithms.

A One-Shot Learning Baseline / 1 Nearest Neighbour

The simplest way of doing classification is with k-nearest neighbours, but since there is only one example per class we have to do 1 nearest neighbour. This is very simple, just calculate the Euclidean distance of the test example from each training example and pick the closest one:
$$
C(\hat{x})={\arg \min}_{c\in S}||\hat{x} − x_c||
$$
According to Koch et al, 1-nn gets ~28% accuracy in 20 way one shot classification on omniglot. 28% doesn’t sound great, but it’s nearly six times more accurate than random guessing(5%). This is a good baseline or “sanity check” to compare future one-shot algorithms with.

Hierarchical Bayesian Program Learning from Lake et al gets 95.2% - very impressive! The ~30% of this paper which I understood was very interesting. Comparing it with deep learning results that train on raw pixels is kind of “apples and oranges” though, because:

  1. HBPL used data about the strokes, not just the raw pixels
  2. HBPL on omniglot involved learning a generative model for strokes. The algorithm requires data with more complicated annotation, so unlike deep learning it can’t easily be tweaked to one-shot learn from raw pixels of dogs/trucks/brain scans/spatulas and other objects that aren’t made up of brushstrokes.

Lake et al also says that humans get 95.5% accuracy in 20 way classification on omniglot, only beating HBPL by a tiny margin. In the spirit of nullius in verba, I tried testing myself on the 20 way tasks and managed to average 97.2%. I wasn’t always doing true one-shot learning though - I saw several symbols I recognised, since I’m familiar with the greek alphabet, hiragana and katakana. I removed those alphabets and tried again but still managed 96.7%. My hypothesis is that having to read my own terrible handwriting has endowed me with superhuman symbol recognition ability.

Ways to use deep networks for one shot learning?!

If we naively train a neural network on a one-shot as a vanilla cross-entropy-loss softmax classifier, it will severely overfit. Heck, even if it was a hundred shot learning a modern neural net would still probably overfit. Big neural networks have millions of parameters to adjust to their data and so they can learn a huge space of possible functions. (More formally, they have a high VC dimension, which is part of why they do so well at learning from complex data with high dimensionality.) Unfortunately this strength also appears to be their undoing for one-shot learning. When there are millions of parameters to gradient descend upon, and a staggeringly huge number of possible mappings that can be learned, how can we make a network learn one that generalizes when there’s just a single example to learn from?

It’s easier for humans to one-shot learn the concept of a spatula or the letter ΘΘ because they have spent a lifetime observing and learning from similar objects. It’s not really fair to compare the performance of a human who’s spent a lifetime having to classify objects and symbols with that of a randomly initialized neural net, which imposes a very weak prior about the structure of the mapping to be learned from the data. This is why most of the one-shot learning papers I’ve seen take the approach of knowledge transfer from other tasks.

Neural nets are really good at extracting useful features from structurally complex/high dimensional data, such as images. If a neural network is given training data that is similar to (but not the same as) that in the one-shot task, it might be able to learn useful features which can be used in a simple learning algorithm that doesn’t require adjusting these parameters. It still counts as one-shot learning as long as the training examples are of different classes to the examples used for one-shot testing.

(NOTE: Here a feature means a “transformation of the data that is useful for learning”.)

So now an interesting problem is how do we get a neural network to learn the features? The most obvious way of doing this (if there’s labelled data) is just vanilla transfer learning - train a softmax classifier on the training set, then fine-tune the weights of the last layer on the support set of the one-shot task. In practice, neural net classifiers don’t work too well for data like omniglot where there are few examples per class, and even fine tuning only the weights in the last layer is enough to overfit the support set. Still works quite a lot better than L2 distance nearest neighbour though! (See Matching Networks for One Shot learning for a comparison table of various deep one-shot learning methods and their accuracy.)

There’s a better way of doing it though! Remember 1 nearest neighbour? This simple, non-parametric one-shot learner just classifies the test example with the same class of whatever support example is the closest in L2 distance. This works ok, but L2 Distance suffers from the ominous sounding curse of dimensionality and so won’t work well for data with thousands of dimensions like omniglot. Also, if you have two nearly identical images and move one over a few pixels to the right the L2 distance can go from being almost zero to being really high. L2 distance is a metric that is just woefully inadequate for this task. Deep learning to the rescue? We can use a deep convolutional network to learn some kind of similarity function that a non-parametric classifer like nearest neighbor can use.

Siamese networks

imageI originally planned to have craniopagus conjoined twins as the accompanying image for this section but ultimately decided that siamese cats would go over better..

This wonderful paper is what I will be implementing in this tutorial. Koch et al’s approach to getting a neural net to do one-shot classification is to give it two images and train it to guess whether they have the same category. Then when doing a one-shot classification task described above, the network can compare the test image to each image in the support set, and pick which one it thinks is most likely to be of the same category. So we want a neural net architecture that takes two images as input and outputs the probability they share the same class.

Say x1 and x2 are two images in our dataset, and let x1∘x2 mean “x1 and x2 are images with the same class”. Note that x1∘x2 is the same as x2∘x1 - this means that if we reverse the order of the inputs to the neural network, the output should be the same - p(x1∘x2) should equal p(x2∘x1). This property is called symmetry and siamese nets are designed around having it.

Symmetry is important because it’s required for learning a distance metric - the distance from x1 to x2 should equal the distance x2 to x1.

If we just concatenate two examples together and use them as a single input to a neural net, each example will be matrix multiplied(or convolved) with a different set of weights, which breaks symmetry. Sure it’s possible it will eventually manage to learn the exact same weights for each input, but it would be much easier to learn a single set of weights applied to both inputs. So we could propagate both inputs through identical twin neural nets with shared parameters, then use the absolute difference as the input to a linear classifier - this is essentially what a siamese net is. Two identical twins, joined at the head, hence the name.

Network architecture

Unfortunately, properly explaining how and why a convolutional neural net work would make this post twice as long. If you want to understand convnets work, I suggest checking out cs231n and then colah. For any non-dl people who are reading this, the best summary I can give of a CNN is this: An image is a 3D array of pixels. A convolutional layer is where you have a neuron connected to a tiny subgrid of pixels or neurons, and use copies of that neuron across all parts of the image/block to make another 3d array of neuron activations. A max pooling layer makes a block of activations spatially smaller. Lots of these stacked on top of one another can be trained with gradient descent and are really good at learning from images.

I’m going to describe the architecture pretty briefly because it’s not the important part of the paper. Koch et al uses a convolutional siamese network to classify pairs of omniglot images, so the twin networks are both convolutional neural nets(CNNs). The twins each have the following architecture: convolution with 64 10x10 filters, relu -> max pool -> convolution with 128 7x7 filters, relu -> max pool -> convolution with 128 4x4 filters, relu -> max pool -> convolution with 256 4x4 filters. The twin networks reduce their inputs down to smaller and smaller 3d tensors, finally their is a fully connected layer with 4096 units. The absolute difference between the two vectors is used as input to a linear classifier. All up, the network has 38,951,745 parameters - 96% of which belong to the fully connected layer. This is quite a lot, so the network has high capacity to overfit, but as I show below, pairwse training means the dataset size is huge so this won’t be a problem.

imageHastily made architecture diagram.

The output is squashed into [0,1] with a sigmoid function to make it a probability. We use the target t=1t=1 when the images have the same class and t=0t=0 for a different class. It’s trained with logistic regression. This means the loss function should be binary cross entropy between the predictions and targets. There is also a L2 weight decay term in the loss to encourage the network to learn smaller/less noisy weights and possibly improve generalization:
$$
L(x_1,x_2,t)=t⋅\log(p(x_1∘x_2))+(1−t)⋅\log(1−p(x_1∘x_2))+λ⋅||w||^2
$$
When it does a one-shot task, the siamese net simply classifies the test image as whatever image in the support set it thinks is most similar to the test image:
$$
C(\hat{x},S) = {\arg\max}_c P(\hat{x}∘x_c),x_c∈S
$$
This uses an argmax unlike nearest neighbour which uses an argmin, because a metric like L2 is higher the more “different” the examples are, but this models outputs p(x1∘x2), so we want the highest. This approach has one flaw that’s obvious to me: for any xaxa in the support set,the probability $\hat{x}∘x_a$ is independent of every other example in the support set! This means the probabilities won’t sum to 1, ignores important information, namely that the test image will be the same type as exactly one x∈S…

Observation: effective dataset size in pairwise training

EDIT: After discussing this with a PhD student at UoA, I think this bit might be overstated or even just wrong. Emperically, my implementation did overfit, even though it wasn’t trained for enough iterations to sample every possible pair, which kind of contradicts this section. I’m leaving it up in the spirit of being wrong loudly.

One cool thing I noticed about training on pairs is that there are quadratically many possible pairs of images to train the model on, making it hard to overfit. Say we have CC examples each of EE classes. Since there are C⋅EC⋅E images total, the total number of possible pairs is given by
$$
Npairs=(C⋅E 2)=(C⋅E)!/2!(C⋅E−2)!
$$
For omniglot with its 20 examples of 964 training classes, this leads to 185,849,560 possible pairs, which is huge! However, the siamese network needs examples of both same and different class pairs. There are E examples per class, so there will be (E 2) pairs for every class, which means there are Nsame=(E 2)⋅C possible pairs with the same class - 183,160 pairs for omniglot. Even though 183,160 example pairs is plenty, it’s only a thousandth of the possible pairs, and the number of same-class pairs increases quadratically with E but only linearly with C. This is important because the siamese network should be given a 1:1 ratio of same-class and different-class pairs to train on - perhaps it implies that pairwise training is easier on datasets with lots of examples per class.

The Code:

Prefer to just play with a jupyter notebook? I got you fam

Here is the model definition, it should be pretty easy to follow if you’ve seen keras before. I only define the twin network’s architecture once as a Sequential() model and then call it with respect to each of two input layers, this way the same parameters are used for both inputs. Then merge them together with absolute distance and add an output layer, and compile the model with binary cross entropy loss.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
from keras.layers import Input, Conv2D, Lambda, merge, Dense, Flatten,MaxPooling2D
from keras.models import Model, Sequential
from keras.regularizers import l2
from keras import backend as K
from keras.optimizers import SGD,Adam
from keras.losses import binary_crossentropy
import numpy.random as rng
import numpy as np
import os
import dill as pickle
import matplotlib.pyplot as plt
from sklearn.utils import shuffle
def W_init(shape,name=None):
"""Initialize weights as in paper"""
values = rng.normal(loc=0,scale=1e-2,size=shape)
return K.variable(values,name=name)
#//TODO: figure out how to initialize layer biases in keras.
def b_init(shape,name=None):
"""Initialize bias as in paper"""
values=rng.normal(loc=0.5,scale=1e-2,size=shape)
return K.variable(values,name=name)
input_shape = (105, 105, 1)
left_input = Input(input_shape)
right_input = Input(input_shape)
#build convnet to use in each siamese 'leg'
convnet = Sequential()
convnet.add(Conv2D(64,(10,10),activation='relu',input_shape=input_shape,
kernel_initializer=W_init,kernel_regularizer=l2(2e-4)))
convnet.add(MaxPooling2D())
convnet.add(Conv2D(128,(7,7),activation='relu',
kernel_regularizer=l2(2e-4),kernel_initializer=W_init,bias_initializer=b_init))
convnet.add(MaxPooling2D())
convnet.add(Conv2D(128,(4,4),activation='relu',kernel_initializer=W_init,kernel_regularizer=l2(2e-4),bias_initializer=b_init))
convnet.add(MaxPooling2D())
convnet.add(Conv2D(256,(4,4),activation='relu',kernel_initializer=W_init,kernel_regularizer=l2(2e-4),bias_initializer=b_init))
convnet.add(Flatten())
convnet.add(Dense(4096,activation="sigmoid",kernel_regularizer=l2(1e-3),kernel_initializer=W_init,bias_initializer=b_init))
#encode each of the two inputs into a vector with the convnet
encoded_l = convnet(left_input)
encoded_r = convnet(right_input)
#merge two encoded inputs with the l1 distance between them
L1_distance = lambda x: K.abs(x[0]-x[1])
both = merge([encoded_l,encoded_r], mode = L1_distance, output_shape=lambda x: x[0])
prediction = Dense(1,activation='sigmoid',bias_initializer=b_init)(both)
siamese_net = Model(input=[left_input,right_input],output=prediction)
#optimizer = SGD(0.0004,momentum=0.6,nesterov=True,decay=0.0003)
optimizer = Adam(0.00006)
#//TODO: get layerwise learning rates and momentum annealing scheme described in paperworking
siamese_net.compile(loss="binary_crossentropy",optimizer=optimizer)
siamese_net.count_params()

The original paper used layerwise learning rates and momentum - I skipped this because it; was kind of messy to implement in keras and the hyperparameters aren’t the interesting part of the paper. Koch et al adds examples to the dataset by distorting the images and runs experiments with a fixed training set of up to 150,000 pairs. Since that won’t fit in my computers memory, I decided to just randomly sample pairs. Loading image pairs was probably the hardest part of this to implement. Since there were 20 examples for every class, I reshaped the data into N_classes x 20 x 105 x 105 arrays, to make it easier to index by category.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
class Siamese_Loader:
"""For loading batches and testing tasks to a siamese net"""
def __init__(self,Xtrain,Xval):
self.Xval = Xval
self.Xtrain = Xtrain
self.n_classes,self.n_examples,self.w,self.h = Xtrain.shape
self.n_val,self.n_ex_val,_,_ = Xval.shape
def get_batch(self,n):
"""Create batch of n pairs, half same class, half different class"""
categories = rng.choice(self.n_classes,size=(n,),replace=False)
pairs=[np.zeros((n, self.h, self.w,1)) for i in range(2)]
targets=np.zeros((n,))
targets[n//2:] = 1
for i in range(n):
category = categories[i]
idx_1 = rng.randint(0,self.n_examples)
pairs[0][i,:,:,:] = self.Xtrain[category,idx_1].reshape(self.w,self.h,1)
idx_2 = rng.randint(0,self.n_examples)
#pick images of same class for 1st half, different for 2nd
category_2 = category if i >= n//2 else (category + rng.randint(1,self.n_classes)) % self.n_classes
pairs[1][i,:,:,:] = self.Xtrain[category_2,idx_2].reshape(self.w,self.h,1)
return pairs, targets
def make_oneshot_task(self,N):
"""Create pairs of test image, support set for testing N way one-shot learning. """
categories = rng.choice(self.n_val,size=(N,),replace=False)
indices = rng.randint(0,self.n_ex_val,size=(N,))
true_category = categories[0]
ex1, ex2 = rng.choice(self.n_examples,replace=False,size=(2,))
test_image = np.asarray([self.Xval[true_category,ex1,:,:]]*N).reshape(N,self.w,self.h,1)
support_set = self.Xval[categories,indices,:,:]
support_set[0,:,:] = self.Xval[true_category,ex2]
support_set = support_set.reshape(N,self.w,self.h,1)
pairs = [test_image,support_set]
targets = np.zeros((N,))
targets[0] = 1
return pairs, targets
def test_oneshot(self,model,N,k,verbose=0):
"""Test average N way oneshot learning accuracy of a siamese neural net over k one-shot tasks"""
pass
n_correct = 0
if verbose:
print("Evaluating model on {} unique {} way one-shot learning tasks ...".format(k,N))
for i in range(k):
inputs, targets = self.make_oneshot_task(N)
probs = model.predict(inputs)
if np.argmax(probs) == 0:
n_correct+=1
percent_correct = (100.0*n_correct / k)
if verbose:
print("Got an average of {}% {} way one-shot learning accuracy".format(percent_correct,N))
return percent_correct

..And now the training loop. Nothing unusual here, except for that I monitor one-shot tasks validation accuracy to test performance, rather than loss on the validation set.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
evaluate_every = 7000
loss_every=300
batch_size = 32
N_way = 20
n_val = 550
siamese_net.load_weights("PATH")
best = 76.0
for i in range(900000):
(inputs,targets)=loader.get_batch(batch_size)
loss=siamese_net.train_on_batch(inputs,targets)
if i % evaluate_every == 0:
val_acc = loader.test_oneshot(siamese_net,N_way,n_val,verbose=True)
if val_acc >= best:
print("saving")
siamese_net.save('PATH')
best=val_acc
if i % loss_every == 0:
print("iteration {}, training loss: {:.2f},".format(i,loss))

Results

Once the learning curve flattened out, I used the weights which got the best validation 20 way accuracy for testing. My network averaged ~83% accuracy for tasks from the evaluation set, compared to 93% in the original paper. Probably this difference is because I didn’t implement many of the performance enhancing tricks from the original paper, like layerwise learning rates/momentum, data augmentation with distortions, bayesian hyperparemeter optimization and I also probably trained for less epochs. I’m not too worried about this because this tutorial was more about introducing one-shot learning in general, than squeezing the last few % performance out of a classifier. There is no shortage of resources on that!

I was curious to see how accuracy varied over different values of “N” in N way one shot learning, so I plotted it, with comparisons to 1 nearest neighbours, random guessing and training set performance.

imageresults.

As you can see, it performs worse on tasks from the validaiton set than the train set, especially for high values of N, so there must be overfitting. It would be interesting to see how well traditional regularization methods like dropout work when the validation set is made of completely different classes to the training set. It works better than I expected for large N, still averaging above 65% accuracy for 50-60 way tasks.

Discussion

We’ve just trained a neural network trained to do same-different pairwise classification on symbols. More importantly, we’ve shown that it can then get reasonable accuracy in 20 way one-shot learning on symbols from unseen alphabets. Of course, this is not the only way to use deep networks for one-shot learning.

As I touched on earlier, I think a major flaw of this siamese approach is that it only compares the test image to every support image individualy, when it should be comparing it to the support set as a whole. When the network compares the test image to any image x1, p(x^∘x1) is the same no matter what else is the support set. This is silly. Say you’re doing a one-shot task and you see an image that looks similar to the test image. You should be much less confident they have the same class if there is another image in the support set that also looks similar to the test image. The training objective is different to the test objective. It might work better to have a model that can compare the test image to the support set as a whole and use the constraint that only one support image has the same class.

Matching Networks for One Shot learning does exactly that. Rather than learning a similarity function, they have a deep model learn a full nearest neighbour classifier end to end, training directly on oneshot tasks rather than on image pairs. Andrej Karpathy’s notes explain it much better than I can. Since you are learning a machine classifier, this can be seen as a kind of meta-learning. One-shot Learning with Memory-Augmented Neural Networks explores the connection between one-shot learning and meta learning and trains a memory augmented network on omniglot, though I confess I had trouble understanding this paper.

What next?

The omniglot dataset has been around since 2015, and already there are scalable ML algorithms getting within the ballpark of human level performance on certain one-shot learning tasks. Hopefully one day it will be seen as a mere “sanity check” for one-shot classification algorithms much like MNIST is for supervised learning now.

Image classification is cool but I don’t think it’s the most interesting problem in machine learning. Now that we know deep one-shot learning can work pretty good, I think it would be cool to see attempts at one-shot learning for other, more exotic tasks.

Ideas from one-shot learning could be used for more sample efficient reinforcement learning, especially for problems like OpenAI’s Universe, where there are lots of MDPs/environments that have similar visual features and dynamics. - It would be cool to have an RL agent that could efficiently explore a new environment after learning in similar MDPs.

imageOpenAI’s world of bits environments.

One-shot Imitation learning is one of my favourite one-shot learning papers. The goal is to have an agent learn a robust policy for solving a task from a single human demonstration of that task.This is done by:

  1. Having a neural net map from the current state and a sequence of states(the human demonstration) to an action
  2. Training it on pairs of human demonstrations on slightly different variants of the same task, with the goal of reproducing the second demonstration based on the first.

This strikes me as a really promising path to one day having broadly applicable, learning based robots!

Bringing one-shot learning to NLP tasks is a cool idea too. Matching Networks for One-Shot learning has an attempt at one-shot language modeling, filling a missing word in a test sentence given a small set of support sentences, and it seems to work pretty well. Exciting!

Conclusion

Anyway, thanks for reading! I hope you’ve managed to one-shot learn the concept of one-shot learning :) If not, I’d love to hear feedback or answer any questions you have!

Abracadabra

Anaconda uses socket proxy on Windows 10

Posted on 2018-05-17 | | Visitors

you need to create a .condarc file in you Windows user area:

1
C:\Users\<username>\

The file should contain (if you are using shadowsocks):

1
2
3
4
5
6
7
8
9
10
11
12
13
channels:
- defaults
# Show channel URLs when displaying what is going to be downloaded and
# in 'conda list'. The default is False.
show_channel_urls: True
allow_other_channels: True
proxy_servers:
http: socks5://127.0.0.1:1080
https: socks5://127.0.0.1:1080
ssl_verify: False

Noticed that you cannot create a file that begins with a dot in Windows directly.

To create/rename on windows explorer, just rename to .name. - The additional dot at the end is necessary, and will be removed by Windows Explorer.

To create a new file begins with a dot, on command prompt:

1
echo testing > .name
Abracadabra

Paper Note: Selective Search for Object Recognition

Posted on 2018-05-03 | | Visitors

与 Selective Search 初次见面是在著名的物体检测论文 Rich feature hierarchies for accurate object detection and semantic segmentation ,因此,这篇论文算是阅读 R-CNN 的准备。

这篇论文的标题虽然也提到了 Object Recognition ,但就创新点而言,其实在 Selective Search 。所以,这里只简单介绍 Selective Search 的思想和算法过程,对于 Object Recognition 则不再赘述。

什么是 Selective Search

Selective Search,说的简单点,就是从图片中找出物体可能存在的区域。

resultresult

上面这幅宇航员的图片中,那些红色的框就是 Selective Search 找出来的可能存在物体的区域。

在进一步探讨它的原理之前,我们分析一下,如何判别哪些 region 属于一个物体?

image segimage seg

作者在论文中用以上四幅图,分别描述了四种可能的情况:

  1. 图 a ,物体之间可能存在层级关系,比如:碗里有个勺;
  2. 图 b,我们可以用颜色来分开两只猫,却没法用纹理来区分;
  3. 图 c,我们可以用纹理来区分变色龙,却没法用颜色来区分;
  4. 图 d,轮胎是车的一部分,不是因为它们颜色相近、纹理相近,而是因为轮胎包含在车上。

所以,我们没法用单一的特征来定位物体,需要综合考虑多种策略,这一点是 Selective Search 精要所在。

需要考虑的问题

在学习 Selective Search 算法之前,我曾在计算机视觉课上学到过关于物体(主要是人脸)检测的方法。通常来说,最常规也是最简单粗暴的方法,就是用不同尺寸的矩形框,一行一行地扫描整张图像,通过提取矩形框内的特征判断是否是待检测物体。这种方法的复杂度极高,所以又被称为 exhaustive search。在人脸识别中,由于使用了 Haar 特征,因此可以借助 Paul Viola 和 Michael Jones 两位大牛提出的积分图,使检测在常规时间内完成。但并不是每种特征都适用于积分图,尤其在神经网络中,积分图这种动态规划的思路就没什么作用了。

针对传统方法的不足,Selective Search 从三个角度提出了改进:

  1. 我们没法事先得知物体的大小,在传统方法中需要用不同尺寸的矩形框检测物体,防止遗漏。而 Selective Search 采用了一种具备层次结构的算法来解决这个问题;
  2. 检测的时间复杂度可能会很高。Selective Search 遵循简单即是美的原则,只负责快速地生成可能是物体的区域,而不做具体的检测;
  3. 另外,结合上一节提出的,采用多种先验知识来对各个区域进行简单的判别,避免一些无用的搜索,提高速度和精度。

算法框架

algorithmalgorithm

论文中给出的这个算法框架还是很详细的,这里再简单翻译一下。

  • 输入:彩色图片。
  • 输出:物体可能的位置,实际上是很多的矩形坐标。
  • 首先,我们使用这篇论文的方法将图片初始化为很多小区域 $R=r_i, \cdots, r_n$。由于我们的重点是 Selective Search,因此我直接将该论文的算法当成一个黑盒子。
  • 初始化一个相似集合为空集: $S=∅$。
  • 计算所有相邻区域之间的相似度(相似度函数之后会重点分析),放入集合 S 中,集合 S 保存的其实是一个区域对以及它们之间的相似度。
  • 找出 S 中相似度最高的区域对,将它们合并,并从 S 中删除与它们相关的所有相似度和区域对。重新计算这个新区域与周围区域的相似度,放入集合 S 中,并将这个新合并的区域放入集合 R 中。重复这个步骤直到 S 为空。
  • 从 R 中找出所有区域的 bounding box(即包围该区域的最小矩形框),这些 box 就是物体可能的区域。

另外,为了提高速度,新合并区域的 feature 可以通过之前的两个区域获得,而不必重新遍历新区域的像素点进行计算。这个 feature 会被用于计算相似度。

相似度计算方法

相似度计算方法将直接影响合并区域的顺序,进而影响到检测结果的好坏。

论文中比较了八种颜色空间的特点,在实际操作中,只选择一个颜色空间(比如:RGB 空间)进行计算。

正如一开始提出的那样,我们需要综合多种信息来判断。作者将相似度度量公式分为四个子公式,称为互补相似度测量(Complementary Similarity Measures) 。这四个子公式的值都被归一化到区间 [0, 1] 内。

1. 颜色相似度scolor (ri,rj)scolor (ri,rj)

正如本文一开始提到的,颜色是一个很重要的区分物体的因素。论文中将每个 region 的像素按不同颜色通道统计成直方图,其中,每个颜色通道的直方图为 25 bins (比如,对于 0 ~ 255 的颜色通道来说,就每隔 9(255/25=9) 个数值统计像素数量)。这样,三个通道可以得到一个 75 维的直方图向量 $C_i={c_{i}^{1}, …, c_{i}^{n}}$,其中 n = 75。之后,我们用 L1 范数(绝对值之和)对直方图进行归一化。由直方图我们就可以计算两个区域的颜色相似度:
$$
s_{color}(r_i, r_j) =\sum_{k=1}^{n}{min(c_{i}^{k}, c_{j}^{k})}
$$
这个颜色直方图可以在合并区域的时候,很方便地传递给下一级区域。即它们合并后的区域的直方图向量为:
$$
C_t=\frac{size(r_i)C_i+size(r_j)C_j}{size(r_i)+size(r_j)}
$$
,其中$size(r_i)$ 表示区域 $r_i$ 的面积,合并后的区域为 $size(r_t)=size(r_i)+size(r_j)$。

2. 纹理相似度$s_{texture}(r_i,r_j)$

另一个需要考虑的因素是纹理,即图像的梯度信息。

论文中对纹理的计算采用了 SIFT-like 特征,该特征借鉴了 SIFT 的计算思路,对每个颜色通道的像素点,沿周围 8 个方向计算高斯一阶导数(σ=1σ=1),每个方向统计一个直方图(bin = 10),这样,一个颜色通道统计得到的直方图向量为 80 维,三个通道就是 240 维:$T_i={t_i^{(1)}, …, t_i^{(n)}}$,其中 n = 240。注意这个直方图要用 L1 范数归一化。然后,我们按照颜色相似度的计算思路计算两个区域的纹理相似度:
$$
s_{texture}(r_i, r_j) =\sum_{k=1}^{n}{min(t_{i}^{k}, t_{j}^{k})}
$$

3. 尺寸相似度$s_{size} (r_i,r_j)$

在合并区域的时候,论文优先考虑小区域的合并,这种做法可以在一定程度上保证每次合并的区域面积都比较相似,防止大区域对小区域的逐步蚕食。这么做的理由也很简单,我们要均匀地在图片的每个角落生成不同尺寸的区域,作用相当于 exhaustive search 中用不同尺寸的矩形扫描图片。具体的相似度计算公式为:
$$
s_{size}(r_i, r_j)=1-\frac{size(r_i) + size(r_j)}{size(im)}
$$
其中,$size(im)$ 表示原图片的像素数量。

4. 填充相似度$s_{fill}(r_i,r_j)$

填充相似度主要用来测量两个区域之间 fit 的程度,个人觉得这一点是要解决文章最开始提出的物体之间的包含关系(比如:轮胎包含在汽车上)。在给出填充相似度的公式前,我们需要定义一个矩形区域 $BB_{ij}$,它表示包含 $r_i$ 和 $r_j$ 的最小的 bounding box。基于此,我们给出相似度计算公式为:
$$
s_{fill}(r_i, r_j)=1-\frac{size(BB_{ij})-size(r_i)-size(r_j)}{size(im)}
$$
为了高效地计算 $BB_{ij}$,我们可以在计算每个 region 的时候,都保存它们的 bounding box 的位置,这样,$BB_{ij}$ 就可以很快地由两个区域的 bounding box 推出来。

5. 相似度计算公式

综合上面四个子公式,我们可以得到计算相似度的最终公式:
$$
s(r_i, r_j) = a_1 s_{color}(r_i, r_j) +a_2s_{texture}(r_i, r_j) \\\\ +a_3s_{size}(r_i, r_j)+a_4s_{fill}(r_i, r_j)
$$
其中,$a_i$的取值为 0 或 1,表示某个相似度是否被采纳。

Combining Locations

前面我们基本完成了 Selective Search 的流程,从图片中提取出了物体可能的位置。现在,我们想完善最后一个问题,那就是给这些位置排个序。因为提取出来的矩形框数量巨大,而用户可能只需要其中的几个,这个时候我们就很有必要对这些矩形框赋予优先级,按照优先级高低返回给用户。原文中作者称这一步为 Combining Locations,我找不出合适的翻译,就姑且保留英文原文。

这个排序的方法也很简单。作者先给各个 region 一个序号,前面说了,Selective Search 是一个逐步合并的层级结构,因此,我们将覆盖整个区域的 region 的序号标记为 1,合成这个区域的两个子区域的序号为 2,以此类推。但如果仅按序号排序,会存在一个漏洞,那就是区域面积大的会排在前面,为了避免这个漏洞,作者又在每个序号前乘上一个随机数 $RND∈[0,1]$,通过这个新计算出来的数值,按从小到大的顺序得出 region 最终的排序结果。

参考

  • Selective Search for Object Recognition(阅读)
  • Efficient Graph-Based Image Segmentation

本文作者: Jermmy

本文链接: https://jermmy.github.io/2017/05/04/2017-5-4-paper-notes-selective-search/

版权声明: 本博客所有文章除特别声明外,均采用 CC BY-NC-SA 3.0 许可协议。转载请注明出处!

1234…25
Ewan Li

Ewan Li

Ewan's IT Blog

125 posts
59 tags
RSS
Github Weibo
© 2018 Ewan Li
Powered by Hexo
Theme - NexT.Mist
本站访客数人次 本站总访问量次