This post shows how to build a deep neural network from scratch in Julia for the classic appication of building a clasifier of images of cats
Some notes before start.
JuliaCall
with the chunk’s language set up to julia
, however the following functions can be executed directly on a Julia IDE, actually the first time I coded this exercise was in Atom customize to Juno.library(JuliaCall);
options(JULIA_HOME="C:/Users/aaron/AppData/Local/Programs/Julia-1.6.1/bin")
julia_setup(rebuild=TRUE)
#Test code in Julia
=sqrt(2);
a println(a)
1.4142135623730951
This functions uses load_data()
an auxiliary function presented in the appendix of auxiliary functions, here.
using HDF5, Images, ImageView, TestImages
#---------- Load the data
, trset_y_orig, tset_x_orig, tset_y_orig, classes=load_data()
trset_x_orig#Example Image: The index can be changed by the user
=trset_x_orig[:,:,:,129];
ExampleIm#Convert the UInt8 hexadecimal format to N0f8 (ranging from 0 to 1)
=normedview(ExampleIm);
B=colorview(RGB,B[1,:,:],B[2,:,:],B[3,:,:])'
C; imshow(C)
Image with index 129 is the following image of a cat, in Julia is represented as an 3x64x64 array, a image of 64-by-64 pixels in 3 channels, RGB.
In this step, we create a function to flatten the data into a vector (from a 3-dimensional array) and normalize by 255 to have pixel values between zero and one.
#Function to vectorize the images
function flatData(X::Array)
=reshape(X, :, size(X)[end])/255
flatXreturn flatX
end
=flatData(trset_x_orig);
trset_x=flatData(tset_x_orig); tset_x
Here we have 209 images in the training set and 50 images in the test set.
We are going to use a 5-layer NN (four hidden layers) with:
We specified this architecture into an array with the dimensions of the weights matrices per layer. We also use init_param()
, presented here to initialize the parameters of the NN.
=[size(trset_x)[1],20,15,10,5,1];
dim_layers=init_param(dim_layers); Param
For example, this implies that in the first layer, the weight matrix \(W_1\) is \(20\times 12288\) because we take the flatten image of the input layer (\(3\cdot 64\cdot 64=12288\)) and we multiply by the weights in each of the 20 neurons; the associated bias is \(b_1\) that is \(20\times 1\). In the second layer, the weight matrix \(W_2\) is \(15\times 20\) because we take the output from the first layer (\(20\times 1\)) and we multiply by the weights in each of the 15 neurons of the second layer; the associated bias is \(b_2\) that is \(15\times 1\).
In this model our parameters to learn are \((W_i,b_i)\) with \(i=1,2,3,4,5\). That amount to \(258,316\) constants to optimize (\(12888\cdot 20+20+15\cdot 20+15+10\cdot 15+10+5\cdot 10+5+5\cdot 1+1\)).
We are going to use the cross entropy cost functions. It is implemented here. Let \(\hat{a}\) be the output vector from the last layer of the NN; then the cost functions is given by: \[C=-\frac{1}{N}\sum_{i=1}^N y_i \ln(\hat{a}_i)+(1-y_i)\ln(1-\hat{a}_i)\] where \(N\) is the number of examples in the training set.
We will use two activation functions: in the first 4 hidden layers each neuron will use a ReLu activation and in the final layer the activation will be of the Sigmoid function.
The ReLU function is implemented here and it is given by: \[f(x)=\max(0,x)\]
The Sigmoid function is implemented here and it is given by: \[\sigma(x)=\frac{1}{1+\exp(-x)}\] Notice that the sigmoid function is always between zero and one.
During training we will do \(S=10000\) iterations, where each iterations performs a forward propagation, computes the associated cost, does backward propagation and finally updates the parameters before the following iteration. Performing a fixed number of iterations during optimization is rather simple, it would be better to create a flag that stops execution after achieving a target accuracy in the training sample.
When optimizing using gradient descent, a learning rate \(\alpha\) is used during the parameter update:
\[\theta_{t+1}\leftarrow\theta_t+\alpha \frac{\partial C}{\partial \theta}\bigg\rvert_{\theta=\theta_t}\]
In this exercise we set \(\alpha=0.0075\). There is whole literature about the role of the learning rate and how to set it.
Training the model will be done using gradient descent to optimize the cost function. This entails:
These steps are combined in the following L_layer_model()
function.
"""
Function to combine all sustantive function into a deep cat classifier
"""
function L_layer_model(X::Array,Y::Array,layer_dims::Array, learning_rate::Float64,num_iter::Int,print_cost::Bool)
#Step 1: Initialize parameters
=init_param(layer_dims); println(param["W1"][1:2,1:2])
param#Step 2: Gradient descent loop
for i in 1:num_iter
#Step 3: Do a complete forward pass
,caches=complete_forward(X, param)
AL#Step 4: Compute the current cost
=cost_function(AL,Y)
cost#Step 5: Do a complete backward pass
=complete_backward(AL,Y,caches)
grads#Step 6: Update parameters
=update_param(param,grads,learning_rate)
param#Step 7: Print cost
if print_cost & ((i%100)==0)
"The cost after iteration ", i , " is ",cost)
println("W1"][1:2,1:2])
println(param[end
end
#Step 8: Return the parameters
return param
end
=0.0075; nIter=10000;
lrate#Train the model
@time parameters=L_layer_model(trset_x,trset_y,
, lrate, nIter, true) dim_layers
Executing the function with 10000 iterations should not take long (under a couple of minutes) and in the execution we should see the cost function values decreasing.
The next step is to evaluate the accuracy of the model in both the training and the test set. We use the predict()
function implemente here. In this functions if the predicted probability of \(Y=1\), is bigger than 50%, then we label that image as of a cat, and if the probability is less than 50% then it is labelled as of not a cat. This threshold can be adjusted according the application, here we opt for a parsimonious approach.
#Check the Accuracy on the train set
= predict(trset_x, trset_y, parameters)
pred_train
#Check the Accuracy on the test set
=reshape(tset_y_orig,(1,length(tset_y_orig)));
tset_y= predict(tset_x, tset_y, parameters) pred_test
The previous code gives an accuracy on the training set of \(99.52\%\) and of \(66\%\) on the test set, not bad for a NN without using computer vision tricks.
: 0.9952153110047847
Training Accuracy: 66 Test Accuracy
Finally we illustrate how the NN predicts an image in the test set.
=50
index=tset_x_orig[:,:,:,index];
Example_Test_Im#Convert the UInt8 hexadecimal format to N0f8 (ranging from 0 to 1)
=normedview(Example_Test_Im);
B=colorview(RGB,B[1,:,:],B[2,:,:],B[3,:,:]) C
We see that the image is that of a cat, the NN has not seen this image and correctly identifies that the image has a cat!
if pred_test[index]==tset_y[index]
"Correct! The label is correct.")
println(if pred_test[index]==1
"It's a CATTO! "); println("It's a CATTO! "); println("It's a CATTO! ")
println(elseif pred_test[index]==0
"No Cat! "); println("No Cat! "); println("No Cat! ")
println(end
else
"!!! The label is incorrect !!!")
println(end
! The label is correct.
Correct's a CATTO!
It's a CATTO!
It's a CATTO! It
The h5
files are in a folder named datasets
.
#Function to read the h5 file and obtain the matrices for the images
function load_data()
#Train data
=h5open("datasets\\train_catvnoncat.h5","r")
train_data=read(train_data["train_set_x"]) # your train set features
train_set_x_orig=read(train_data["train_set_y"]) # your train set labels
train_set_y_orig#Test data
=h5open("datasets\\test_catvnoncat.h5","r")
test_data=read(test_data["test_set_x"]) # your train set features
test_set_x_orig=read(test_data["test_set_y"]) # your train set labels
test_set_y_orig#Classes
=read(test_data["list_classes"])
classes#Reshape the matrices
return train_set_x_orig, train_set_y_orig, test_set_x_orig, test_set_y_orig, classes
end
#Computes the Sigmoid function to the entries of the array A
function Sigmoid(A::Array)
=A
cache=(exp.(-A).+1).^(-1)
Zreturn Z,cache
end
#sigmoid backward
function sigmoid_backward(dA::Array,cache::Array)
=reshape(cache,size(dA));
Z=(exp.(-Z).+1).^(-1);
S=dA.*(S.*(-S.+1));
dZ@assert size(dZ)==size(Z)
return dZ
end
#ReLu Function
#Computes the ReLU activation function to the entries of the array A
function ReLu(A::Array)
=A
cache=max.(A,0)
Z@assert size(A)==size(Z)
return Z,cache
end
#Relu backward
function relu_backward(dA::Array,cache::Array)
=cache; dZ=copy(dA); dZ[Z.<0].=0;
Z@assert size(dZ)==size(Z)
return dZ
end
#Parameter initializing
"""
Function to initialize the parameter of all the layers
"""
function init_param(layer_dim::Array)
=Dict();
paramfor l in 2:length(layer_dim)
"W",(l-1))]=randn((layer_dim[l],layer_dim[l-1]))/sqrt(layer_dim[l-1]);
param[string("b",(l-1))]=zeros(layer_dim[l],1)
param[string(@assert size(param[string("W",(l-1))])==(layer_dim[l], layer_dim[l-1])
@assert size(param[string("b",(l-1))])==(layer_dim[l],1)
end
return param
end
#Forward Linear Activation
"""
Function to compute the forward linear activation of a layer
"""
function linear_activation(A_prev::Array, W::Array, b::Array, activation::String)
=Dict("A_prev"=>A_prev,"W"=>W,"b"=>b);
linear_cache=W*A_prev.+b;
Zif activation=="sigmoid"
,act_cache=Sigmoid(Z)
Aelseif activation=="relu"
,act_cache=ReLu(Z)
Aend
=Dict("LinCache"=>linear_cache,"ActCache"=>act_cache)
cachereturn A,cache
end
#Complete forward propagation
function complete_forward(X::Array, parameters::Dict)
=X
A=Int(length(parameters)/2)
L=[]
cachesfor l in 1:(L-1)
=A
A_prev,cache=linear_activation(A_prev,parameters[string("W",l)], parameters[string("b",l)], "relu")
A=vcat(caches,cache)
cachesend
,cache=linear_activation(A,parameters[string("W",L)], parameters[string("b",L)], "sigmoid")
AL=vcat(caches,cache)
cachesreturn AL, caches
end
"""
Function for the cross entropy cost
"""
function cost_function(AL::Array,Y::Array)
=length(Y)
m=(-1/m)*sum(Y.*log.(AL).+(-Y.+1).*log.(-AL.+1))
costreturn cost
end
#Linear_backward step
"""
Function to compute one step of linear-backward derivatives
"""
function Linear_backward(dZ::Array,cache::Dict)
=cache["A_prev"]; W=cache["W"]; b=cache["b"];
A_prev=size(A_prev)[2]; #dZ=reshape(dZ,(1,length(dZ)))
m=(1/m)*dZ*A_prev'
dW=(1/m)*sum(dZ,dims=2)
db=W'*dZ
dA_prev
@assert size(A_prev)==size(dA_prev)
@assert size(W)==size(dW)
@assert size(b)==size(db)
return dA_prev,dW,db
end
#Linear activation backward
"""
Function to compute the backward prop of a linear activation layer
"""
function linear_act_back(dA::Array,cache::Dict,activation::String)
=cache["LinCache"]
linear_cache=cache["ActCache"]
activation_cacheif activation=="relu"
=relu_backward(dA, activation_cache);# print(dZ)
dZ, dW, db=Linear_backward(dZ, linear_cache)
dA_prevelseif activation=="sigmoid"
=sigmoid_backward(dA, activation_cache);# print(dZ)
dZ, dW, db=Linear_backward(dZ, linear_cache)
dA_prevend
return dA_prev, dW, db
end
"""
Function that computes the whole backward pass
"""
function complete_backward(AL::Array,Y::Array,caches::Array)
=Dict()
grads=length(caches)
L#AL=reshape(AL,m);
=reshape(Y,(1,length(Y)))
Y#Initialize the back prop
=-((Y./AL)-(-Y.+1)./(-AL.+1))
dAL=caches[L]
current_cache"dA",L)], grads[string("dW",L)], grads[string("db",L)] = linear_act_back(dAL, current_cache, "sigmoid")
grads[string(#Do the whole backward pass
for l in 1:(L-1)
=caches[(L-l)]
current_cache, dW_temp, db_temp = linear_act_back(grads[string("dA",(L-l+1))], current_cache, "relu")
dA_prev_temp"dA",(L-l))]=dA_prev_temp
grads[string("dW",(L-l))]=dW_temp
grads[string("db",(L-l))]=db_temp
grads[string(end
return grads
end
#Function to update parameters
"""
Function to update the parameters via gradient descent
"""
function update_param(param::Dict, grads::Dict, learning_rate::Float64)
=Int(length(param)/2)
Lfor l in 1:L
"W",l)]=param[string("W",l)]-learning_rate*grads[string("dW",l)]
param[string("b",l)]=param[string("b",l)]-learning_rate*grads[string("db",l)]
param[string(end
return param
end
#Function to construct predictions given parameters
function predict(X::Array,Y::Array,param::Dict)
=size(X)[2]
m,caches=complete_forward(X,param)
probs=1*(probs.>0.5)
p=sum(p.==Y)/m
acu"Probs: ", string(probs))
println("Predictions: ", string(p))
println("True labels: ", string(Y))
println("Accuracy: ",acu)
println(return p
end