# Deep Learning

## Contents

## Introduction

Convolutional neural networks show great promise in modern image processing. New techniques for detecting edges in images, a historically difficult problem, may have significance for the field of computer handwriting recognition. In the Spring of 2017, we, Jenna McCampbell, Ji Won Chung, and Farida Sabry, set out to learn as much about edge detection with convolutional neural networks as we could, and begin work on implementing a CNN architecture that could allow for improved handwriting recognition. Much of our learning was guided by "Holistically-Nested Edge Detection," a paper describing the research of Saining Xie and Zhuowen Tu or the University of California, San Diego.[1]

## A History of Neural Nets and Their Capabilites

**1943:** Warren McColloch and Walter Pitts discovered that neurons that summed binary digits and outputted a binary result based on a given threshold could model AND, OR, and XOR logical operations. This study became the foundation for modern artificial intelligence because it demonstrated that computers could perform logical reasoning.

**1958:** Frank Rosenblatt invented the perceptron, a mathematical model of how the neurons in brains work. Perceptrons take a number of binary inputs, multiply them by certain weights based on their importance, and output one if the resulting number is big enough and zero if it is small enough based on a given threshold. Inspired by the assertion of Donald Hebb who postulated that human brains learn through the process of formulating and changing synapses between neurons, Rosenblatt added to McColloch and Pitts’s earlier discoveries by providing the perceptron as a mechanism to allow computers to learn. Given a set of training data, the weights of binary inputs to perceptrons could be changed to produce the correct results for known data. Rosenblatt showed perceptrons could be used to classify simple shapes represented in 20 by 20 pixel images.

**1960:** Bernard Widrow and Tedd Hoff create “an adaptive ‘ADALINE’ neuron using chemical memistors.” This model eliminated the activation functions conceived of by Rosenblatt to determine if the combined inputs to a neuron should output one or zero. They demonstrated that eliminating the threshold activation function allowed for a calculus-based approach to minimizing error and finding optimal weight values. They used the derivatives of the training error with respect to each weight to zero in on optimal weight values.

**1969:** Marvin Minsky and Symour Papert published a book called *Perceptrons* in which they discussed the limitations of the perceptron in response to what they felt was a pervasive exaggeration of this model’s ability to make computers think like humans. The book discussed many limitations including the inability of a single perceptron to learn the XOR function because it is not linearly separable (the decision surface cannot be separated by a line). Minsky and Papert argued that solving the XOR problem would require multiple layers of perceptrons. Because Rosenblatt’s training model only specified the correct output for the final output layer, it could not be used to adjust weights in hidden layers (layers between the input and output layers.) It is believed that this book helped to usher in the first age of stagnation in the funding for research on computer neurons and artificial intelligence.

**1974-1986:** Paul Werbos became the first in the United States to suggest that backpropagation could be used to train multilayer neural nets in his PhD thesis. Backpropagtion derived from the realization that an alternative neuron to perceptrons could be used in neural networks. Such a neuron would use an activation function that was non-linear, yet still differentiable, drawing from the ideas of ADALINE. With such an activation function, the derivative could be used to adjust weights and the chain rule could be employed to find the derivative for neurons in earlier layers to adjust their weights as well, making learning in a multilayer neural network possible. Because of the stagnation in funding for artificial intelligence at this time, Werbos did not publish on backpropagation until 1982, and it wasn’t until 1986 that David Rumelhart, Geoffrey Hinton, and Ronald Williams popularized these ideas with their clear and concise description in "Learning representations by back-propagating errors".

**1989:** Kurt Hornik, Maxwell Stinchcombe, and Halbert White published, “Multilayer feedforward networks are universal approximators,” an article that mathematically proved that multilayer neural networks could theoretically implement any function. Also in 1989, Yann LeCun et al. published “Backpropagation Applied to Handwritten Zip Code Recognition,” demonstrating a concrete application of multilayer neural networks trained through backpropagation. However, the model used in this paper did not use backpropagation alone to achieve its results. The first hidden layer of the model employed convolution. Each neuron in this layer had a smaller set of weights than the entire dimensions of the image being processed. This set of weights could be applied to many subsets of a given image to detect a particular pattern. Rather than apply a different weight for each pixel in an image, this method allowed one neuron to detect a smaller pattern anywhere in an image. The following layers then took these localized features and combined them to detect larger portions of the image. The last two layers were traditional hidden layers that detected patterns from the aggregate work of the previous layers to make a determination of what number the image of a handwritten digit represented. This approach prevented the neural network from having to learn the same patterns over in new locations within an image.

**Late 1980s to mid 1990s:** Backprogpation works by essentially spitting up errors and working backwards to assign different amounts of blame to the weights of each neural connection. Because of this nature of backpropagation, the more layers involved in a neural network, the harder it was to train that network accurately with backpropagtion; error signals either shrunk rapidly or exploded. Although convolution helped to cut back on the number of layers necessary to make a functional neural network, this problem persisted and neural networks gained a reputation for being unduly complicated and producing suboptimal results. The enthusiasm and funding for neural networks thus waned again during the mid-1990s.

**2006:** Geoffrey Hinton, Simon Osindero, and Yee-Whye Teh conducted research with the help of funding from the Canadian Institute for Advanced Research and published a breakthrough paper “A fast learning algorithm for deep belief nets.” This paper essentially proved that neural networks made of many layers underperformed because they were initialized with random weights. Instead of randomly selecting the initial weights for their neural network, the team trained each layer separately with unsupervised training to obtain more accurate weights with which to train the entire network with supervised training. This breakthrough along with improved computing power since the mid-1990s lead to a significant resurgence in confidence in neural networks, and companies like Microsoft and Google began to devote their resources to deep learning research.

**Early 2010s to present:** Geoffrey Hinton suggests that the mistakes that have lead neural networks to produce suboptimal results are as follows: training data sets were too small, computers were too slow, weights were not initialized optimally, and often the wrong kind of activation function was used. Much of the work being done after Hinton et al.’s 2006 breakthrough focuses on correcting previous assumptions about neural networks and comparing the breakthroughs of deep convolutional neural networks to other computational methods.

**For more information on the history of neural nets, see:** Andrey Kerenkov's Blog Post

## An Introduction to Neural Networks

**Convolution**

Convolution in neural networks is a mathematical process employed in hidden layers to identify certain features, like color or shape, within an image. A convolutional layer has multiple filters, each of which is an arrays of numbers that represents a pattern to be recognized within an inputted image. Each filter is smaller than the dimensions of the input itself, and through convolution, it can be used to scan for features throughout the image. The filter and its receptive field, the area covered by a filter, may be scaled to produce different results. In a convolutional layer, these filters are shifted around an inputted image. At each location, the corresponding input and filter values are multiplied and combined, resulting in an array with one number for each place where the filter is applied. This resulting array, called a feature map, shows the likelihood that the feature represented in the filter occurs in each area of the input.

## Saining Xie and Zhuowen Tu's "Holistically-Nested Edge Detection"[2]

### Introduction to the HED System

This section is a summary of Saining Xie and Zhuowen Tu's paper on "Holistically-Nested Edge Detection".

Humans have the ability to detect and distinguish the boundaries between an object and its surroundings. This faculty is still lacking in edge detection technology.

Xie and Tu created a holistically-nested edge detection (HED) system. This HED is an end-to-end detection system which “automatically learns the type of hierarchical features” necessary for edge detection. The system is “holistic” because the algorithm “aims to train and predict edges in an image-to-image fashion”. The HED system is “nested” to highlight the value of edge maps as side outputs. The edge maps have a common prediction path and have their “successive edge maps be more concise”.

The HED improves holistic image training and prediction employing a derivation of fully convolutional neural networks (FCN) for *image-to-image* classification. Image-to-image classification is a system where it takes an entire image as its input. The output of the system is an image of the edge map.

The HED also tackles the problem regarding nested multi-scale feature learning. The HED employs a derivation of deeply-supervised nets which directs early classification results.

Background research indicate the following seven properties are crucial for a successful system:

**(1)** Carefully designed and/or learned features

**(2)** Multi-scale response fusion

**(3)** Engagement of different levels of visual perception

**(4)** Incorporating structural information

**(5)** Incorporating context

**(6)** Exploring 3d geometry

**(7)** Addressing occulsion boundaries

The HED system employs an end-to-end edge detection system that incorporates a deep supervision and side outputs to enhance the results. Deep supervision makes multi-scale responses more semantically meaningful.

However, this HED system does not include contextual information because it does not focus on the limitations of local pixel labels.

The HED system is computationally efficient and accurate for the following reasons:

**(1)** The image-to-image training can train significantly larger amounts of samples simultaneously

**(2)** The additional deep supervision helps learn more transparent features

**(3)** Using the side outputs in the “end-to-end learning encourages coherent contributions from each layer”

### Various Network Architectures

The HED system is able to efficiently produce predictions from multiple image scales thanks to its unique architecture. Xie and Tu identify several categories of previously researched architectures that attempt to provide similar results, including **multi-stream networks**, **skip layer networks**, **networks that employ a single model on multiple inputs**, and **training independent networks**.

Systems that employ **multi-stream learning** pass input through multiple branches in the network, each representing different parameter values and receptive field sizes. The output from these branches is then passed through a global output layer to produce the result. Unlike multi-stream learning, **skip layer networks** contain only one branch. Links combine feature responses for different layers, all of which are combined in a single output layer. Xie and Tu identify the fact that only one prediction is produced as a shared weak point of these two models. **Networks that employ a single model on multiple inputs** can be more effective for edge detection because they produce multiple predictions at multiple scales. These systems take inputs of multiple sizes and produce as many multi-sized outputs. Another, more radical approach is to **train independent networks** such that one input may be analyzed at different depths in each independent network. Since this approach is extremely resource-inefficient, it does not provide an ideal model for edge detection.

Xie and Tu’s answer to the relative pros and cons of the above described architectures is their **holistically-nested system** which attempts to reduce computational redundancy while maintaining the ability to produce predictions at multiple scales. This architecture is composed of a single stream with multiple layers that have different receptive field sizes and produce differently scaled side outputs. These side outputs may be combined through a fusion layer to produce one more comprehensive output.

## Installation of HED Neural Network on Windows

This is the documentation on the installation of the HED neural network architecture for Windows.

1. Go to https://github.com/s9xie/hed (extra information on prerequisites is listed) 2. Install the prerequisites for caffe 3. Install the github repo modified-caffee for hed: https://github.com/s9xie/hed.git 4. Install the pretrained model: http://vcl.ucsd.edu/hed/hed_pretrained_bsds.caffemodel 5. Place it into /hed folder 6. Configure and Build Caffe https://github.com/BVLC/caffe/blob/windows/README.md 7. Add cmake.exe and python.exe to $PATH

### Errors/Bugs in Installation

Ran into the following problems running scripts\build_win.cmd & running visual studio 2015: '"C:\Program Files (x86)\Microsoft Visual Studio 14.0\Common7\Tools\..\..\VC\vcvarsall.bat"' is not recognized as an internal or external command, operable program or batch file. CMake Error: CMake was unable to find a build program corresponding to "Ninja". CMAKE_MAKE_PROGRAM is not set. You probably need to select a different build tool. CMake Error: CMAKE_C_COMPILER not set, after EnableLanguage CMake Error: CMAKE_CXX_COMPILER not set, after EnableLanguage -- Configuring incomplete, errors occurred! ERROR: Configure failed

#### Resolving vcvarsall.bat

1. Open Visual Studio and check if the license is updated. 2. If Visual studio says license is expired: a) Create an account for a renewed license for Visual Studio 2015 b) Look for Program and Features on computer - you may have to update license before starting this process & restart the computer at one point and try again c) Looked for Visual Studio 2015 program d) Right clicked "Visual Studio 2015" e) Click “Change” f) Click continue/update/install until page shown in stackoverflow[http://stackoverflow.com/questions/33323172/vcvarsall-bat- needed-for-python-to-compile-missing-from-visual-studio-2015-v-1] appears g) Select Programming Language > Visual C++ > Common Tools for Visual C++ 2015, AND Windows and Web Development

### Subsidiary Knowledge

#### To add something to the PATH

1. Go to Control Panel 2. Select System 3. Advanced System Properties 4. Environment Variables 5. Select PATH then click “Edit” button 6. Add address/path separated by semi-colons and a space

#### To switch from H to C drive

CHDIR /D C:

## Installation of HED Neural Networks on OSX

1. Go to https://github.com/s9xie/hed 2. Install Pre-requisites for Caffe a) Install Homebrew b) Follow Instructions 3. Get Modified-caffe for HED a) Install git b) Git clone the s9xie/hed repository --> download pre-trained model and place it in examples/hed folder in the repository clone

Possible solution to the following error: ImportError: No module named _caffe https://github.com/BVLC/caffe/issues/263

## Backpropagation

This is a high-level explanation of backpropagation. It is a summary of the following links: [3], [4], [5] For a more thorough mathematical explanation, refer to the sources linked at the bottom of the section. For beginners, it will be helpful to view or review concepts on neural networks[6], the feed-forward algorithm, chain rule[7], and gradient descent[8] .

Before explaining backpropagation, we must first develop the intuition for it. The purpose of training a network is to find the set of weights, W, that will produce the most optimal results, J(W), that we want. The cost function, which depends on weights, measures the performance of our network. The cost function is complex and multidimensional. A single weight has its own dimensions.[9] There may be multiple local mins and maxes (or optimal costs). Therefore, it is necessary to use an algorithm called gradient descent [10] to optimize (or obtain the lowest) cost. Gradient descent updates the weights of the network as one approaches a minimum or optimal value. The following will be called the update rule [11]

W:= W - α*(∂J/∂W)

Let us denote the measurement of a cost function a]s J(W). The updated weight (on the left hand side) is the original weight minus the derivative of the cost function with respect to the weight itself , scaled by the learning rate, alpha. So as the change in cost with respect to the change in weight becomes smaller, we will obtain our optimal weight, the W (on the left hand side). The learning rate is chosen by the discretion of the researcher.[12]

A more efficient algorithm than gradient descent is momentum. Momentum takes into account the fact that the algorithm will update the weight faster as the weights approach the optimal solution more quickly. This increased efficiency will train the network faster. The momentum update rule is denoted as [13]:

W: = W + V = W + μV - α*(∂J/∂W)

V:=μV - α*(∂J/∂W) is an update velocity rule. μ (mu) is the factor by which our velocity V, decays.

This process is the conceptual backbone of backpropagation.

However, learning multiple features is still computationally intensive and finding the weights is difficult if we were to rely on an algorithm that necessitates information on the task and a method that guess and checks if the feature is effective or not. For example, it is inefficient to constantly guess and check whether a random change in weight on an edge enhances the performance of a the network. Multiple training cases via a feed-forward method must be tested in order to verify if that one weight change was effective. A different method would be to change the activities of a hidden unit allow us to compute how to change the weights. This is more efficient because the number of weights is greater than the number of hidden activities. However, to calculate each weight directly becomes inefficient and impossible [14].

--> (input) ---w1---> (h1) ---w2--->(h2) ---w3---> (output) ----->

A neural network like such is created first by a feed-forward algorithm. Each hidden layer has an activation function f. The hidden layer has the hidden neurons, h1 and h2. h1 has an input weight of w1, and an output weight of w2. h2 has an input weight of w2 and an output weight of w3. To recap, in a neural network, the output depends on its previous weights and hidden units[15].

The output = f(w3 * h2), h2 = f(w2 * h1), and h1 = f(w1*input). By substitution of h2: output = f(w3 * f(w2*h1)). By substitution of h1: output = f(w3 * f( w2*f(w1*input))).

The derivative of the output with respect to weight one, without loss of generality, by the chain rule [16]:

(∂output/∂w1) = (∂output/∂h2)*(∂hidden2/∂h1)*(∂hidden1/∂w1)

We will now add one more layer, J box, that calculates the error and returns it[17].

--> (input) ---w1---> (h1) ---w2--->(h2) ---w3---> (output) -----> [ J ] ---> error

Without loss of generality, the derivative of the error with respect to weight 1 will be [18]:

(∂error/∂w1) = (∂error/∂output)*(∂output/∂h2)*(∂hidden2/∂h1)*(∂hidden1/∂w1)

This is called the error derivative. The error derivatives with respect to hidden activities indicates how fast the errors change given a change in activity in a hidden unit. The effects of the activity in a hidden unit can alter the output units and error. “Each hidden activity can affect many output units and can therefore have many separate effects on the error. These effects must be combined”. The error derivatives can be computed simultaneously in an efficient manner. The result of the error derivatives can be easily converted into the “weights that go into a hidden unit”. [19] This error derivative is computable, given an activation and cost function (which will be defined). Therefore, via gradient descent, in the opposite direction of the arrows of the neural network, one can find a more optimal weight.

Essentially back propagation is the process of finding the most optimal weight by going from J to any weight (in this example, w1) using the chain rule.

For a more mathematical explanation of back propagation, please refer: MIT Course 6.034 - [20] or the Stanford Class Website on CNN Visual Recognition - [21]