Abstract: This PDSG workshop introduces basic concepts of the grandfather of neural networks - the Perceptron. Concepts covered are history, algorithm and limitations.
Level: Fundamental
Requirements: No prior programming or statistics knowledge required.
2. Initial History
• Neural Networks have been around a long time.
• 1943 - Warren McCulloch, a neurophysiologist and Walter Pitts,
a mathematician, published a paper on how neurons might work.
They modeled a simple neural network with electrical circuits.
• 1949 - The Organization of Behavior, by Donald Hebb reinforced
the concept of neurons.
• 1950s - Nathanial Rochester from the IBM research laboratories
led the first effort to simulate a neural network.
• 1959 - Bernard Widrow and Marcian Hoff of Stanford developed
the first real neural network – MADALINE.
• 1969 - Marvin Minsky and Seymour Papert's 1969 book
Perceptrons, kicked off the dissolutionment period where little
research continued until 1981.
i.e., demonstrated the Perceptron could not model an XOR operation.
3. Neuron
Neural Networks consist of Neurons
X1
Inputs
W1
W2
W3
X2
X3
Neuron
Inputs from
the features
(independent
variables) in
the dataset.
Weight (importance)
on how each feature
contributes to the output.
Output
Value
The model
(predictor)
The prediction
Can be:
Real value
Probability
Binary
Categorical
4. Neuron – Categorical Output
Neural Networks consist of Neurons
X1
Inputs
W1
W2
W3
X2
X3
Neuron
Y1
Y2
Y3
Outputs
Categorical
Outputs
(e.g., Apple,
Pear, Banana).
Neuron outputs only
a single value.
Output nodes Y1, Y2 and
Y3 each weight the output
from the neuron and make
a separate calculation for
their final output.
5. Neuron - Details
Neural Networks consist of Neurons
X1
Inputs
W1
W2
W3
X2
X3
Neuron
Output
Value
Normalize (0..1) or Standardize the inputs (feature scaling)
so no input dominates another.
𝑖=0
𝑛
𝑤𝑖 ∗ 𝑥𝑖Ø( )
Summation of the weighted inputs
Activation function
Backward propagation to
adjust (learn) the weights
(e.g., Gradient Descent).
The higher the weight,
the more it contributes
to the outcome
(prediction).
6. Activation Functions
• Most Common
• Threshold – Either a zero or one is outputted (binary).
Ø(x) =
• Sigmoid – A Curve that converges exponentially towards 0 for
x < 0 and 1 for x > 0.
{ 1 if x ≥ 0
0 if x < 0 }
Convergence to zero
Convergence to one
Also referred to as
a squashing function,
Squashing the output
between 0 and 1.
Popularly used in
output nodes for
probability prediction.
7. Activation Functions
• Most Common
• Hyperbolic Tangent – converges to -1 for x < 0 and 1 for x > 0.
Ø(x) =
𝟏 − 𝒆−𝟐𝒙
𝟏+ 𝒆−𝟐𝒙
• Rectifier – 0 if x <= 0, otherwise x
Ø(x) =
Ø(x) = max(0,x)
{ 0 if x ≤ 0
x if x > 0 } Popularly used in
hidden layers for
outputting to the next
layer.
Also referred to as
a squashing function,
Squashing the output
between -1 and 1.
Alternate representation.
8. Fully Connected Neural Network (FCNN)
• Full Connected Neural Network consists of:
• Input Layer – inputs from the data (samples).
• Output Layer – the predictions.
• Hidden Layer(s) – Between the input and output layers,
where the learning occurs.
• All nodes are connected to every other node in the next layer.
• Activation Functions – where outputs are binary, squashed, or
rectified.
• Forward Feeding and Backward Propagation - for learning the
weights.
9. Fully Connected Neural Network (FCNN)
X1
X2
Xn
Input Layer
Hidden Layer
ŷ
Output Layer
Simple FCNN:
- One Hidden Layer
- One Output Node
Rectifier Activation Function (ReLU)
Sigmoid Activation Function
If below zero, then
Output no signal.
Squash into a probability.
Acronym
10. Deep Neural Network (FCNN)
X1
X2
Xn
Input Layer
Hidden Layers
ŷ
Output Layer
It’s a Deep Neural Network
if it has more than one hidden
layer – That’s It!
11. Hidden Nodes are Specialized Learners
Age
Income
18-25
(low
income
)
ŷ Spending
Each Node in the Hidden Network Specializes
W1-1
W2-1
Learns weights to best predict when age is young and
income is low (i.e., they spend their parent’s money).
Outputs high signal
Outputs low or no signal
< 25
< 1000
Sample
The more hidden nodes, the more specialized learners
12. Cost Function
Age
Income
ŷ Spending - ŷ
Calculate Cost (Loss) During Training
W1-1
W2-1
< 25
< 1000
y (label)
Data
y
Predicted
And actual.
C =
𝟏
𝟐
𝒚 − ŷ 𝟐
One of the most commonly used
cost functions for neural networks.
13. Feed Forward - Training
Feed Forward Training Loop
Training
Data
Data
Data
Data
Data
Feed a single
row of data at
a time.
Repeat
Neural Network
C =
𝟏
𝟐
𝒚 − ŷ 𝟐
Calculate the cost (loss).
Converge
?
Can’t minimize the cost
function anymore.
Adjust Weights
Make small adjustments to
weights in the neural network.
Summation
∑ C =
𝟏
𝟐
𝒚 − ŷ 𝟐
No
Run the training set again
through the neural network.
Each run is called an Epoch.
Yes
StopTrained Neural Network
14. Multiple Output Nodes - Softmax
• Squashes a set of input values into 0 and 1 (probabilities), all
adding up to 1.
Softmax
z1
z2
z3
zk
f(z1) ∈ R{ 0, 1 }
f(z2) ∈ R{ 0, 1 }
f(z3) ∈ R{ 0, 1 }
f(zk) ∈ R{ 0, 1 }
Output Layer
Hidden Layer
x1
x2
x3
Input Layer
Features
Predicted
output
(real) values
Classification
probabilities, e.g.,
90% apple
6% pear
3% orange
1% banana
Each output node specializes
on a different classification.
15. Final Note – Training vs. Prediction
• Once we have trained the neural network, we do not have to
repeat the training steps when using the model for prediction.
• No repeating of Epochs, Gradient Descent and Backward Propagation.
• The model will run much faster than during training.