Abracadabra

cs231n Software Packages notes

Software Packages

Caffe

http://caffe.berkeleyvision.org

Overview

  • From U.C. Berkeley
  • Written in C++
  • Has Python and Matlab bindings
  • Good for training or finetuning feedforward models

Tip

Don’t be afraid to read the code!

Main classes

  • Blob: Stores data and derivatives
  • Layer: Transforms bottom blobs to top blobs
  • Net:
    • Many layers
    • Computes gradients via forward / backward
  • Solver: Uses gradients to update weights

Protocol Buffers

  • “Typed JSON” from Google

  • Define “message types” in .proto files

    1
    2
    3
    4
    5
    message Person {
    required string name = 1;
    required int32 id = 2;
    optional string email = 3;
    }
  • Serialize instances to text files (.prototxt)

    1
    2
    3
    name: "John Doe"
    id: 1234
    email: "jdoe@example.com"
  • Compile classes for different languages

Training / Finetuning

  1. Convert data (run a script)
  2. Define net (edit prototxt)
  3. Define solver (edit prototxt)
  4. Train (with pretrained weights) (run a script)

Step1: Convert Data

  • DataLayer reading from LMDB is the easiest
  • Create LMDB using convert_imageset
  • Need text file where each line is
    • “[path/to/image.jpeg][label]”
  • Create HDF5 file yourself using h5py
  • [extras] some methods:
    • ImageDataLayer: Read from image files
    • WindowDataLayer: For detection
    • HDF5Layer: Read from HDF5 file
    • From memory, using Python interface
    • All of these are harder to use (except Python)

Step2: Define Net

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
name: "ResNet-152"
input: "data"
input_dim: 1
input_dim: 3
input_dim: 224
input_dim: 224
layer {
bottom: "data"
top: "conv1"
name: "conv1"
type: "Convolution"
convolution_param {
num_output: 64
kernel_size: 7
pad: 3
stride: 2
bias_term: false
}
}
layer {
bottom: "conv1"
top: "conv1"
name: "bn_conv1"
type: "BatchNorm"
batch_norm_param {
use_global_stats: true
}
}
layer {
bottom: "conv1"
top: "conv1"
name: "scale_conv1"
type: "Scale"
scale_param {
bias_term: true
}
}
layer {
top: "conv1"
bottom: "conv1"
name: "conv1_relu"
type: "ReLU"
}
layer {
bottom: "conv1"
top: "pool1"
name: "pool1"
type: "Pooling"
pooling_param {
kernel_size: 3
stride: 2
pool: MAX
}
}
layer {
bottom: "pool1"
top: "res2a_branch1"
name: "res2a_branch1"
type: "Convolution"
convolution_param {
num_output: 256
kernel_size: 1
pad: 0
stride: 1
bias_term: false
}
}
layer {
bottom: "res2a_branch1"
top: "res2a_branch1"
name: "bn2a_branch1"
type: "BatchNorm"
batch_norm_param {
use_global_stats: true
}
}
layer {
bottom: "res2a_branch1"
top: "res2a_branch1"
name: "scale2a_branch1"
type: "Scale"
scale_param {
bias_term: true
}
}
layer {
bottom: "pool1"
top: "res2a_branch2a"
name: "res2a_branch2a"
type: "Convolution"
convolution_param {
num_output: 64
kernel_size: 1
pad: 0
stride: 1
bias_term: false
}
}
layer {
bottom: "res2a_branch2a"
top: "res2a_branch2a"
name: "bn2a_branch2a"
type: "BatchNorm"
batch_norm_param {
use_global_stats: true
}
}
layer {
bottom: "res2a_branch2a"
top: "res2a_branch2a"
name: "scale2a_branch2a"
type: "Scale"
scale_param {
bias_term: true
}
}
  • .prototxt can get ugly for big models
  • ResNet-152 prototxt is 6775 lines long!
  • Not “compositional”; can’t easily define a residual block and reuse

Step2: Define Net (finetuning)

  • Same name: weights copied
  • Different name: weights reinitialized

Step3: Define Solver

  • Write a prototxt file defining a SolverParameter

    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    11
    12
    13
    14
    15
    16
    17
    18
    19
    20
    21
    22
    23
    24
    25
    26
    27
    28
    29
    30
    31
    32
    33
    34
    35
    36
    37
    38
    39
    40
    41
    42
    43
    44
    45
    46
    47
    48
    49
    50
    51
    52
    53
    54
    55
    56
    57
    58
    59
    60
    61
    62
    63
    64
    65
    66
    67
    68
    69
    70
    71
    72
    73
    74
    75
    76
    77
    78
    79
    80
    81
    82
    83
    84
    85
    86
    87
    88
    89
    90
    91
    92
    93
    94
    95
    96
    97
    98
    99
    100
    101
    102
    103
    104
    105
    106
    107
    108
    109
    110
    111
    112
    113
    114
    115
    116
    117
    118
    119
    120
    121
    122
    123
    124
    125
    126
    127
    128
    129
    130
    131
    132
    133
    134
    135
    136
    137
    138
    139
    140
    141
    142
    143
    144
    145
    146
    147
    148
    149
    150
    151
    152
    153
    154
    155
    156
    157
    158
    159
    160
    161
    162
    163
    164
    165
    166
    message SolverParameter {
    //////////////////////////////////////////////////////////////////////////////
    // Specifying the train and test networks
    //
    // Exactly one train net must be specified using one of the following fields:
    // train_net_param, train_net, net_param, net
    // One or more test nets may be specified using any of the following fields:
    // test_net_param, test_net, net_param, net
    // If more than one test net field is specified (e.g., both net and
    // test_net are specified), they will be evaluated in the field order given
    // above: (1) test_net_param, (2) test_net, (3) net_param/net.
    // A test_iter must be specified for each test_net.
    // A test_level and/or a test_stage may also be specified for each test_net.
    //////////////////////////////////////////////////////////////////////////////
    // Proto filename for the train net, possibly combined with one or more
    // test nets.
    optional string net = 24;
    // Inline train net param, possibly combined with one or more test nets.
    optional NetParameter net_param = 25;
    optional string train_net = 1; // Proto filename for the train net.
    repeated string test_net = 2; // Proto filenames for the test nets.
    optional NetParameter train_net_param = 21; // Inline train net params.
    repeated NetParameter test_net_param = 22; // Inline test net params.
    // The states for the train/test nets. Must be unspecified or
    // specified once per net.
    //
    // By default, all states will have solver = true;
    // train_state will have phase = TRAIN,
    // and all test_state's will have phase = TEST.
    // Other defaults are set according to the NetState defaults.
    optional NetState train_state = 26;
    repeated NetState test_state = 27;
    // The number of iterations for each test net.
    repeated int32 test_iter = 3;
    // The number of iterations between two testing phases.
    optional int32 test_interval = 4 [default = 0];
    optional bool test_compute_loss = 19 [default = false];
    // If true, run an initial test pass before the first iteration,
    // ensuring memory availability and printing the starting value of the loss.
    optional bool test_initialization = 32 [default = true];
    optional float base_lr = 5; // The base learning rate
    // the number of iterations between displaying info. If display = 0, no info
    // will be displayed.
    optional int32 display = 6;
    // Display the loss averaged over the last average_loss iterations
    optional int32 average_loss = 33 [default = 1];
    optional int32 max_iter = 7; // the maximum number of iterations
    optional string lr_policy = 8; // The learning rate decay policy.
    optional float gamma = 9; // The parameter to compute the learning rate.
    optional float power = 10; // The parameter to compute the learning rate.
    optional float momentum = 11; // The momentum value.
    optional float weight_decay = 12; // The weight decay.
    // regularization types supported: L1 and L2
    // controlled by weight_decay
    optional string regularization_type = 29 [default = "L2"];
    // the stepsize for learning rate policy "step"
    optional int32 stepsize = 13;
    // the stepsize for learning rate policy "multistep"
    repeated int32 stepvalue = 34;
    // Set clip_gradients to >= 0 to clip parameter gradients to that L2 norm,
    // whenever their actual L2 norm is larger.
    optional float clip_gradients = 35 [default = -1];
    optional int32 snapshot = 14 [default = 0]; // The snapshot interval
    optional string snapshot_prefix = 15; // The prefix for the snapshot.
    // whether to snapshot diff in the results or not. Snapshotting diff will help
    // debugging but the final protocol buffer size will be much larger.
    optional bool snapshot_diff = 16 [default = false];
    // the mode solver will use: 0 for CPU and 1 for GPU. Use GPU in default.
    enum SolverMode {
    CPU = 0;
    GPU = 1;
    }
    optional SolverMode solver_mode = 17 [default = GPU];
    // the device_id will that be used in GPU mode. Use device_id = 0 in default.
    optional int32 device_id = 18 [default = 0];
    // If non-negative, the seed with which the Solver will initialize the Caffe
    // random number generator -- useful for reproducible results. Otherwise,
    // (and by default) initialize using a seed derived from the system clock.
    optional int64 random_seed = 20 [default = -1];
    // Solver type
    enum SolverType {
    SGD = 0;
    NESTEROV = 1;
    ADAGRAD = 2;
    }
    optional SolverType solver_type = 30 [default = SGD];
    // numerical stability for AdaGrad
    optional float delta = 31 [default = 1e-8];
    // If true, print information about the state of the net that may help with
    // debugging learning problems.
    optional bool debug_info = 23 [default = false];
    // If false, don't save a snapshot after training finishes.
    optional bool snapshot_after_train = 28 [default = true];
    }
    // A message that stores the solver snapshots
    message SolverState {
    optional int32 iter = 1; // The current iteration
    optional string learned_net = 2; // The file that stores the learned net.
    repeated BlobProto history = 3; // The history for sgd solvers
    optional int32 current_step = 4 [default = 0]; // The current step for learning rate
    }
    enum Phase {
    TRAIN = 0;
    TEST = 1;
    }
    message NetState {
    optional Phase phase = 1 [default = TEST];
    optional int32 level = 2 [default = 0];
    repeated string stage = 3;
    }
    message NetStateRule {
    // Set phase to require the NetState have a particular phase (TRAIN or TEST)
    // to meet this rule.
    optional Phase phase = 1;
    // Set the minimum and/or maximum levels in which the layer should be used.
    // Leave undefined to meet the rule regardless of level.
    optional int32 min_level = 2;
    optional int32 max_level = 3;
    // Customizable sets of stages to include or exclude.
    // The net must have ALL of the specified stages and NONE of the specified
    // "not_stage"s to meet the rule.
    // (Use multiple NetStateRules to specify conjunctions of stages.)
    repeated string stage = 4;
    repeated string not_stage = 5;
    }
    // Specifies training parameters (multipliers on global learning constants,
    // and the name and other settings used for weight sharing).
    message ParamSpec {
    // The names of the parameter blobs -- useful for sharing parameters among
    // layers, but never required otherwise. To share a parameter between two
    // layers, give it a (non-empty) name.
    optional string name = 1;
    // Whether to require shared weights to have the same shape, or just the same
    // count -- defaults to STRICT if unspecified.
    optional DimCheckMode share_mode = 2;
    enum DimCheckMode {
    // STRICT (default) requires that num, channels, height, width each match.
    STRICT = 0;
    // PERMISSIVE requires only the count (num*channels*height*width) to match.
    PERMISSIVE = 1;
    }
    // The multiplier on the global learning rate for this parameter.
    optional float lr_mult = 3 [default = 1.0];
    // The multiplier on the global weight decay for this parameter.
    optional float decay_mult = 4 [default = 1.0];
    }
  • If finetuning, copy existing solver.prototxt file

    • Change net to be your net
    • Change snapshot_prefix to your output
    • Reduce base learning rate (divide by 100)
    • Maybe change max_iter and snapshot

Step 4: Train

1
2
3
4
5
6
7
8
./build/tools/caffe train \
-gpu 0 \
-model path/to/trainval.prototxt \
-solver path/to/solver.prototxt \
-weights path/to/pretrained_weights.caffemodel
# -gpu -1 for CPU mode
# -gpu all for multi-GPU data parallelism

Model Zoo

https://github.com/BVLC/caffe/wiki/Model-Zoo

Python Interface

Read the code! Two most important files:

Good for:

  • Interfacing with numpy
  • Extract features: Run net forward
  • Compute gradients: Run net backward (DeepDream, etc)
  • Define layers in Python with numpy (CPU only)

Pros / Cons

  • (+) Good for feedforward networks
  • (+) Good for finetuning existing networks
  • (+) Train models without writing any code!
  • (+) Python interface is pretty useful!
  • (-) Need to write C++ / CUDA for new GPU layers
  • (-) Not good for recurrent networks
  • (-) Cumbersome for big networks (GoogLeNet, ResNet)

Torch

http://torch.ch

Overview

  • From NYU + IDIAP
  • Written in C and Lua
  • Used a lot a Facebook, DeepMind

Lua

Learn Lua in 15 Minutes

  • High level scripting language, easy to interface with C
  • Similar to Javascript:
    • One data structure: table == JS object
    • Prototypical inheritance: metatable == JS prototype
    • First-class functions
  • Some gotchas:
    • 1-indexed =(
    • Variables global by default =(
    • Small standard library

Tensor

Torch tensors are just like numpy arrays

Documentation on GitHub:

nn

nn module lets you easily build and train neural nets

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
-- our optimization procedure will iterate over the modules, so only share
-- the parameters
mlp = nn.Sequential()
linear = nn.Linear(2,2)
linear_clone = linear:clone('weight','bias') -- clone sharing the parameters
mlp:add(linear)
mlp:add(linear_clone)
function gradUpdate(mlp, x, y, criterion, learningRate)
local pred = mlp:forward(x)
local err = criterion:forward(pred, y)
local gradCriterion = criterion:backward(pred, y)
mlp:zeroGradParameters()
mlp:backward(x, gradCriterion)
mlp:updateParameters(learningRate)
end
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
-- our optimization procedure will use all the parameters at once, because
-- it requires the flattened parameters and gradParameters Tensors. Thus,
-- we need to share both the parameters and the gradParameters
mlp = nn.Sequential()
linear = nn.Linear(2,2)
-- need to share the parameters and the gradParameters as well
linear_clone = linear:clone('weight','bias','gradWeight','gradBias')
mlp:add(linear)
mlp:add(linear_clone)
params, gradParams = mlp:getParameters()
function gradUpdate(mlp, x, y, criterion, learningRate, params, gradParams)
local pred = mlp:forward(x)
local err = criterion:forward(pred, y)
local gradCriterion = criterion:backward(pred, y)
mlp:zeroGradParameters()
mlp:backward(x, gradCriterion)
-- adds the gradients to all the parameters at once
params:add(-learningRate, gradParams)
end

cunn

Running on GPU is easy

1
2
3
4
5
6
7
8
9
10
11
12
local model = nn.Sequential()
model:add(nn.Linear(2,2))
model:add(nn.LogSoftMax())
model:cuda() -- convert model to CUDA
local input = torch.Tensor(32,2):uniform()
input = input:cuda()
local output = model:forward(input)
local input = torch.CudaTensor(32,2):uniform()
local output = model:forward(input)

optim

optim package implements different update rules: momentum, Adam, etc

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
require 'optim'
for epoch = 1, 50 do
-- local function we give to optim
-- it takes current weights as input, and outputs the loss
-- and the gradient of the loss with respect to the weights
-- gradParams is calculated implicitly by calling 'backward',
-- because the model's weight and bias gradient tensors
-- are simply views onto gradParams
function feval(params)
gradParams:zero()
local outputs = model:forward(batchInputs)
local loss = criterion:forward(outputs, batchLabels)
local dloss_doutputs = criterion:backward(outputs, batchLabels)
model:backward(batchInputs, dloss_doutputs)
return loss, gradParams
end
optim.sgd(feval, params, optimState)
end

Modules

  • Caffe has Nets and Layers; Torch just has Modules
  • Modules are classes written in Lua; easy to read and write
  • Forward / backward written in Lua using Tensor methods
  • Same code runs on CPU / GPU
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
local Linear, parent = torch.class('nn.Linear', 'nn.Module')
function Linear:__init(inputSize, outputSize, bias)
parent.__init(self)
local bias = ((bias == nil) and true) or bias
self.weight = torch.Tensor(outputSize, inputSize)
self.gradWeight = torch.Tensor(outputSize, inputSize)
if bias then
self.bias = torch.Tensor(outputSize)
self.gradBias = torch.Tensor(outputSize)
end
self:reset()
end
function Linear:noBias()
self.bias = nil
self.gradBias = nil
return self
end
function Linear:reset(stdv)
if stdv then
stdv = stdv * math.sqrt(3)
else
stdv = 1./math.sqrt(self.weight:size(2))
end
if nn.oldSeed then
for i=1,self.weight:size(1) do
self.weight:select(1, i):apply(function()
return torch.uniform(-stdv, stdv)
end)
end
if self.bias then
for i=1,self.bias:nElement() do
self.bias[i] = torch.uniform(-stdv, stdv)
end
end
else
self.weight:uniform(-stdv, stdv)
if self.bias then self.bias:uniform(-stdv, stdv) end
end
return self
end
local function updateAddBuffer(self, input)
local nframe = input:size(1)
self.addBuffer = self.addBuffer or input.new()
if self.addBuffer:nElement() ~= nframe then
self.addBuffer:resize(nframe):fill(1)
end
end
function Linear:updateOutput(input)
if input:dim() == 1 then
self.output:resize(self.weight:size(1))
if self.bias then self.output:copy(self.bias) else self.output:zero() end
self.output:addmv(1, self.weight, input)
elseif input:dim() == 2 then
local nframe = input:size(1)
local nElement = self.output:nElement()
self.output:resize(nframe, self.weight:size(1))
if self.output:nElement() ~= nElement then
self.output:zero()
end
updateAddBuffer(self, input)
self.output:addmm(0, self.output, 1, input, self.weight:t())
if self.bias then self.output:addr(1, self.addBuffer, self.bias) end
else
error('input must be vector or matrix')
end
return self.output
end
function Linear:updateGradInput(input, gradOutput)
if self.gradInput then
local nElement = self.gradInput:nElement()
self.gradInput:resizeAs(input)
if self.gradInput:nElement() ~= nElement then
self.gradInput:zero()
end
if input:dim() == 1 then
self.gradInput:addmv(0, 1, self.weight:t(), gradOutput)
elseif input:dim() == 2 then
self.gradInput:addmm(0, 1, gradOutput, self.weight)
end
return self.gradInput
end
end
function Linear:accGradParameters(input, gradOutput, scale)
scale = scale or 1
if input:dim() == 1 then
self.gradWeight:addr(scale, gradOutput, input)
if self.bias then self.gradBias:add(scale, gradOutput) end
elseif input:dim() == 2 then
self.gradWeight:addmm(scale, gradOutput:t(), input)
if self.bias then
-- update the size of addBuffer if the input is not the same size as the one we had in last updateGradInput
updateAddBuffer(self, input)
self.gradBias:addmv(scale, gradOutput:t(), self.addBuffer)
end
end
end
function Linear:sharedAccUpdateGradParameters(input, gradOutput, lr)
-- we do not need to accumulate parameters when sharing:
self:defaultAccUpdateGradParameters(input, gradOutput, lr)
end
function Linear:clearState()
if self.addBuffer then self.addBuffer:set() end
return parent.clearState(self)
end
function Linear:__tostring__()
return torch.type(self) ..
string.format('(%d -> %d)', self.weight:size(2), self.weight:size(1)) ..
(self.bias == nil and ' without bias' or '')
end

Tons of built-in modules and loss functions

https://github.com/torch/nn

Container

Container modules allow you to combine multiple modules

nngraph

A multi-layer network where each layer takes output of previous two layers as input.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
input = nn.Identity()()
L1 = nn.Tanh()(nn.Linear(10, 20)(input))
L2 = nn.Tanh()(nn.Linear(30, 60)(nn.JoinTable(1)({input, L1})))
L3 = nn.Tanh()(nn.Linear(80, 160)(nn.JoinTable(1)({L1, L2})))
g = nn.gModule({input}, {L3})
indata = torch.rand(10)
gdata = torch.rand(160)
g:forward(indata)
g:backward(indata, gdata)
graph.dot(g.fg, 'Forward Graph')
graph.dot(g.bg, 'Backward Graph')

More Info

Pretrained Models

Package Management

After installing torch, use luarocks to install or update Lua packages

(Similar to pip install from Python)

Other useful packages

Typical Workflow

  1. Preprocess data; usually use a Python script to dump data to HDF5
  2. Train a model in Lua / Torch; read from HDF5 datafile, save trained model to disk
  3. Use trained model for something, often with an evaluation script

Example: https://github.com/jcjohnson/torch-rnn

Step 1: Preprocess data; usually use a Python script to dump data to HDF5 (https://github.com/jcjohnson/torch-rnn/blob/master/scripts/preprocess.py)
Step 2: Train a model in Lua / Torch; read from HDF5 datafile, save trained model to disk (https://github.com/jcjohnson/torch-rnn/blob/master/train.lua )
Step 3: Use trained model for something, often with an evaluation script (https://github.com/jcjohnson/torch-rnn/blob/master/sample.lua)

Pros / Cons

  • (-) Lua
  • (-) Less plug-and-play than Caffe
    • You usually write your own training code
  • (+) Lots of modular pieces that are easy to combine
  • (+) Easy to write your own layer types and run on GPU
  • (+) Most of the library code is in Lua, easy to read
  • (+) Lots of pretrained models!
  • (-) Not great for RNNs

Theano

http://deeplearning.net/software/theano/

Overview

  • From Yoshua Bengio’s group at University of Montreal
  • Embracing computation graphs, symbolic computation
  • High-level wrappers: Keras, Lasagne

Other Topics

Conditionals: The ifelse and switch functions allow conditional control flow in the graph

Loops: The scan function allows for (some types) of loops in the computational graph; good for RNNs

Derivatives: Efficient Jacobian / vector products with R and L operators, symbolic hessians (gradient of gradient)

Sparse matrices, optimizations, etc

Multi-GPU

Experimental model parallelism:
http://deeplearning.net/software/theano/tutorial/using_multi_gpu.html

Data parallelism using platoon:
https://github.com/mila-udem/platoon

High level wrapper

  • Lasagne
  • Keras

Pretrained Models

Lasagne Model Zoo has pretrained common architectures:
https://github.com/Lasagne/Recipes/tree/master/modelzoo
AlexNet with weights: https://github.com/uoguelph-mlrg/theano_alexnet
sklearn-theano: Run OverFeat and GoogLeNet forward, but no fine-tuning? http://sklearn-theano.github.io
caffe-theano-conversion: CS 231n project from last year: load models and weights from caffe! Not sure if full-featured https://github.com/kitofans/caffe-theano-conversion

Pros / Cons

  • (+) Python + numpy
  • (+) Computational graph is nice abstraction
  • (+) RNNs fit nicely in computational graph
  • (-) Raw Theano is somewhat low-level
  • (+) High level wrappers (Keras, Lasagne) ease the pain
  • (-) Error messages can be unhelpful
  • (-) Large models can have long compile times
  • (-) Much “fatter” than Torch; more magic
  • (-) Patchy support for pretrained models

TensorFlow

https://www.tensorflow.org

Overview

  • From Google
  • Very similar to Theano - all about computation graphs
  • Easy visualizations (TensorBoard)
  • Multi-GPU and multi-node training

Tensorboard

Tensorboard makes it easy to visualize what’s happening inside your models

Multi-GPU

Distributed

Pretrained Models

You can get a pretrained version of Inception here:
https://github.com/tensorflow/tensorflow/blob/master/tensorflow/examples/android/README.md

(In an Android example?? Very well-hidden)

The only one I could find =(

Pros / Cons

  • (+) Python + numpy
  • (+) Computational graph abstraction, like Theano; great for RNNs
  • (+) Much faster compile times than Theano
  • (+) Slightly more convenient than raw Theano?
  • (+) TensorBoard for visualization
  • (+) Data AND model parallelism; best of all frameworks
  • (+/-) Distributed models, but not open-source yet
  • (-) Slower than other frameworks right now
  • (-) Much “fatter” than Torch; more magic
  • (-) Not many pretrained models

Use Cases

  • Extract AlexNet or VGG features? Use Caffe
  • Fine-tune AlexNet for new classes? Use Caffe
  • Image Captioning with finetuning?
    • -> Need pretrained models (Caffe, Torch, Lasagne)
    • -> Need RNNs (Torch or Lasagne)
    • -> Use Torch or Lasagna
  • Segmentation? (Classify every pixel)
    • -> Need pretrained model (Caffe, Torch, Lasagna)
    • -> Need funny loss function
    • -> If loss function exists in Caffe: Use Caffe
    • -> If you want to write your own loss: Use Torch
  • Object Detection?
    • -> Need pretrained model (Torch, Caffe, Lasagne)
    • -> Need lots of custom imperative code (NOT Lasagne)
    • -> Use Caffe + Python or Torch
  • Language modeling with new RNN structure?
    • -> Need easy recurrent nets (NOT Caffe, Torch)
    • -> No need for pretrained models
    • -> Use Theano or TensorFlow
  • Implement BatchNorm?
    • -> Don’t want to derive gradient? Theano or TensorFlow
    • -> Implement efficient backward pass? Use Torch

Recommendation:

  • Feature extraction / finetuning existing models: Use Caffe
  • Complex uses of pretrained models: Use Lasagne or Torch
  • Write your own layers: Use Torch
  • Crazy RNNs: Use Theano or Tensorflow
  • Huge model, need model parallelism: Use TensorFlow