Some DL.

Published: Mon 21 October 2024
By Alex

In python.

Deep Learning Tools

Machine Learning

Difference/similarities between DL and ML. In ML model receives data and get patterns and make a representation that fits best the data. Then when you input the new data model figures out a class or label for each datapoint. So learning is the storing collections of patterns that are used to make assumption about new input. DL is a subtype of ML.

Deep

The deep is because you have deep structure of layers that put on top of each other and it is similar to the structure of the brain where numerous layers of neural networks perform steps to identify patterns and categorize stuff around us. Deep stack of abstract layers are interconnected in a special way based on a data feed into it.

Deep learning is different from the ML since you not always need to make feature extraction for a model (always necessary in ML): a representational learning can learn from raw input.

Another difference between DL and ML is that large amounts of data is the crucial parameter for DL, without massive amount of training data it is less accurate than less complex ML models.

DL application need to process thousand sometimes millions of objects to start react is a appropriate way when such an object appears in a system.

Simultaneous Multiple Computations Along Neural Networks

To recognize objects in the way human does was something that computers found difficult until Rosenblatts perceptron algorithm 1953 was developed. It was able to solve image recognition problem. Later Multilayer Perceptron (Feedforward Network) was derived from it - sequentially processed input through multiple layers of neurons each time making decisions returning meaningful output.

GPU

Processing power of high-performance graphics processing units is a driving parameter that makes DL executed fast compared to execution on CPU. GPUs are essential for deep learning (DL) because they can quickly perform parallel processing of matrices, vectors, and scalars.

Common Deep Learning Concepts

Types of Deep Learning

Type Input Data Principle Year of Breakthrough
Reinforcement Learning Various Trains agents to make sequential decisions by interacting with an environment to maximize cumulative rewards. 1950s
Recurrent Neural Networks (RNNs) Sequential data (e.g., text) Use recurrent layers to maintain state across inputs and time steps. 1986
Convolutional Neural Networks (CNNs) Grid-like data (e.g., images) Use convolutional layers to extract features, pooling layers to reduce feature map size, and fully connected layers for classification. 1989
Autoencoders Various Learn data representations by compressing and decompressing input data. 2006
Transfer Learning Various Uses pre-trained models to solve new tasks by freezing some layers and retraining others. 2014

Maths. Linear Algebra.

Understanding linear algebra is crucial for DL.

Chollet's book quickly covers the most relevant datastructures and datatypes for tensorflow(btw tensors are n-d arrays), here shown as python (numpy) objects.

import numpy as np
# scalar
x = 1
# matrices
z = np.array([[1,2,3], [4,5,6], [7,8,9]])

# vectors
y  = z[-1] # [1,2,3]

tensor = np.zeros((3, 2, 5)) # see below
"""
Level 1:
[[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]

Level 2:
[[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]

Level 3:
[[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]
"""
  • The tensor tensor is a 3D array with shape [3, 2, 5].
  • This means it has 3 levels, each containing a list of lists (2x5 matrix).
  • The first level is list of lists of 2x5 matrices.

Jake VanderPlas covers the dimensionality of nested numpy arrays (shapes) and matrix algebra (I assembled it in another article here: <>).

Scalars/Matrix Addition & Multiplication

# scalars/matrix addition & multiplication
# Given:
    matrix_a = np.array([[1, 2, 3], [4, 5, 6]])
    matrix_b = np.array([[7, 8], [9, 10],  [11, 12]])
# multiplication
    matrix_c = np.dot(matrix_a, matrix_b) # [[58, 64], [139, 154]]
    # wich in native python would be:

    matrix_c2 = [[0, 0], [0, 0]]

    # for each row in first matrix
    # for each element k in row i of A and corresponding element k in column j of B
    # (k ranges from 0 to len of first matrix):
    #    Multiply A[i][k] by B[k][j]
    #    Add the result to sum: set C[i][j] = sum
    for i in matrix_a:
        for j in matrix_b:
            matrix_c2[i][j] = sum(
                matrix_a[i][k] * matrix_b[k][j] for k in range(matrix_a.shape[1])
            )
    '''
    Matrix A (2x3) and Matrix B (3x2)
    Result matrix C will have shape (2x2) and vice versa B x A will have shape (3x3)
    '''
In [2]:
from IPython.display import Image
Image(r'C:\thisAKcode.github.io\Pelican\content\images\matrix_mult.png',  width = 600)
Out[2]:

Transpose Matrices

# matrix m transpose
m = np.array([[1, 2, 3], [4, 5, 6]])
m_t = np.transpose(m) # [[1, 4], [2, 5], [3, 6]]

Neural Networks Concept Viewed as a Graph

  • Neurons are like nodes in a graph.
  • Weights are like edges connecting the nodes.
  • Activation functions are like the nodes' output functions.
  • Loss functions are like the cost functions.
  • Optimizers are like the algorithms to minimize the cost functions.

Useful Analogy: Comparing TensorFlow Neural Networks to Execution Graphs

Aspect Neural Network Execution Graph
Structure Layers (input, hidden, output) connected by weights. Visual representation of computations.
Nodes Neurons that perform computations. Operations (e.g., addition, multiplication).
Edges Weights that connect neurons. Data paths that carry datastructures between operations.
Data Flow Data (tensors) flows from the input layer, through hidden layers, to the output layer. Data flows from one operation to another, following the graph's structure.
Training Adjusts the weights to minimize error. Updates the graph's parameters to optimize performance.
Example Input Layer -> Hidden Layer -> Output Layer Operation A -> Operation B -> Operation C
Visual Representation Input -> [Weights] -> Hidden -> [Weights] -> Output Node A -> [Data] -> Node B -> [Data] -> Node C

Why Activation Functions applied to the tensors?

Activation functions are applied to tensors at each layer of a neural network. They bring non-linearity to the model. The most common activation functions used in DL:

  1. Sigmoid: tf.keras.activations.sigmoid (0-1) used in the output layer of a binary classification problem.
  2. Tanh: tf.keras.activations.tanh (-1-1)
  3. ReLU: tf.keras.activations.relu (0 to infinity) is used in hidden layers of a neural network.
  4. Leaky ReLU: tf.keras.activations.relu (0 to infinity)
  5. Softmax: tf.keras.activations.softmax (0-1) used in the output layer of a multi-class classification problem.
    As well as implementation in python (numpy): ```python def sigmoid(x): return 1 / (1 + np.exp(-x))

def relu(x): return np.maximum(0, x)

def softmax(x): exp_x = np.exp(x); return exp_x / exp_x.sum()


### Loss Functions
Loss functions are used to measure the error/performance between the predicted value and actual value. The loss function is used to update the weights of the model to minimize the error. The loss function is used to update the weights of the model to minimize the error. Here are the most common loss functions used in DL:

1. Mean Squared Error: `tf.keras.losses.mean_squared_error`
2. Binary Crossentropy: `tf.keras.losses.binary_crossentropy`
3. Categorical Crossentropy: `tf.keras.losses.categorical_crossentropy`
4. Sparse Categorical Crossentropy: `tf.keras.losses.sparse_categorical_crossentropy`
5. Hinge: `tf.keras.losses.hinge`

Implementing MSE in python (numpy):
```python
def mean_squared_error(y_true, y_pred):
    return np.mean(np.square(y_true - y_pred))

Optimizers

Optimizers are used to update the weights of the model to minimize the error (minimize the loss function). Here are the most common optimizers used in DL:

  1. SGD: tf.keras.optimizers.SGD
  2. RMSprop: tf.keras.optimizers.RMSprop
  3. Adagrad: tf.keras.optimizers.Adagrad
  4. Adadelta: tf.keras.optimizers.Adadelta
  5. Adam: tf.keras.optimizers.Adam
def gradient_descent(weights, learning_rate, gradients):
    return weights - learning_rate * gradients

def adam(weights, learning_rate, gradients, m, v, beta1=0.9, beta2=0.999, epsilon=1e-8):
    m = beta1 * m + (1 - beta1) * gradients
    v = beta2 * v + (1 - beta2) * gradients**2
    m_hat = m / (1 - beta1)
    v_hat = v / (1 - beta2)
    return weights - learning_rate * m_hat / (np.sqrt(v_hat) + epsilon)

Regularization

Regularization reduces model complexity and prevents overfitting by adding a penalty to the loss function.

Types of Regularization:

  • L1 Regularization: Adds a penalty equal to the absolute value of coefficients.
    tf.keras.regularizers.L1(l1=0.01)
    
  • L2 Regularization: Adds a penalty equal to the square of coefficients.
    tf.keras.regularizers.L2(l2=0.01)
    
  • L1 and L2 Regularization: Combines both L1 and L2 penalties.
    tf.keras.regularizers.L1L2(l1=0.01, l2=0.01)
    

Usage in Layers:

  • Apply regularization to layers like Dense, Conv2D.
    model = tf.keras.Sequential([
          tf.keras.layers.Dense(128, activation='relu',
          kernel_regularizer=tf.keras.regularizers.L2(0.01)),
          tf.keras.layers.Dense(10, activation='softmax')
    ])
    

Impact on Training:

  • Regularization terms are added to the loss function to control model complexity.

Hyperparameter Tuning:

  • Adjust l1, l2 to balance underfitting and overfitting.

Dropout:

  • Randomly ignore neurons during training.
    tf.keras.layers.Dropout(rate=0.5)
    

Backpropagation

Backpropagation updates neural network weights by calculating the gradient of the loss function.

def backpropagation(weights, learning_rate, gradients):
        return weights - learning_rate * gradients

Mathematics

Traversing Neural Network with Mathematical Operations

A neural network can be represented mathematically using matrices and vectors.

$$\ \mathbf{b}_2 = \begin{bmatrix} b_{21} \\ b_{22} \\ \vdots \\ b_{210} \end{bmatrix} \ $$

Neural Network Tree Representation

forward propagation in a neural network can be represented as a tree structure with three layers: input, hidden, and output. Here is the graph over a simple neural network with one hidden layer mathematical operations for each node and edge are included.

plaintext
    A1 A2 A3          Input layer
     \\| |//
      \|X|/
      B1 B2           Hidden Layer
       \ /
        C1            Output Layer

Pseudocode

Define the input layer. Define the hidden layer with a specified number of neurons and an activation function. Define the output layer with the number of neurons corresponding to the number of classes and an activation function. Compile the model with a loss function and an optimizer. Train the model on the dataset.

Mathematical Representation using Matrices

Iput Layer:

  • Neurons: Nodes that receive input data. $$ \mathbf{x} \in \mathbb{R}^{784} $$ Hidden Layer:
  • Weights: $ \mathbf{W}_1 \in \mathbb{R}^{784 \times 128} $ $$ \mathbf{W}_1 = \begin{bmatrix} w_{11} & w_{12} & \cdots & w_{1,128} \\ w_{21} & w_{22} & \cdots & w_{2,128} \\ \vdots & \vdots & \ddots & \vdots \\ w_{784,1} & w_{784,2} & \cdots & w_{784,128} \end{bmatrix} $$
  • Biases: $$ \mathbf{b}_1 = \begin{bmatrix} b_{11} \\ b_{12} \\ \vdots \\ b_{128} \end{bmatrix} $$
  • Activation: $$ \mathbf{h} = \text{ReLU}(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1) $$

Output Layer: $$ \mathbf{y} = \text{softmax}(\mathbf{W}_2 \mathbf{h} + \mathbf{b}_2) $$

Explanation Beyond the Maths

  • Input Layer: Takes an input vector $ \ \mathbf{x} \ $ of size 784.
  • Hidden Layer: Applies a linear transformation followed by a ReLU activation function.
  • Output Layer: Applies another linear transformation followed by a softmax activation function to produce a probability distribution over the 10 classes.

Neural Network Layers

Layer Pseudocode Mathematical Representation Explanation Beyond the Maths
Input Layer x = np.random.randn(input_size) $$ \mathbf{x} \in \mathbb{R}^{784} $$ Takes an input vector ( \mathbf{x} ) of size 784, representing a 28x28 pixel image.
Hidden Layer h = relu(np.dot(x, weight1) + bias1) $$ \mathbf{h} = \text{ReLU}(\mathbf{W}_1 \mathbf{x} + \mathbf{b}_1) $$ Applies a linear transformation followed by a ReLU activation function.
Output Layer y = softmax(np.dot(h, weight2) + bias2) $$ \mathbf{y} = \text{softmax}(\mathbf{W}_2 \mathbf{h} + \mathbf{b}_2) $$ Applies another linear transformation followed by a softmax activation function.

Mathematical Representation Details

Component Pseudocode Mathematical Representation Explanation Beyond the Maths
Weights (Hidden Layer) weight1 = np.random.randn(input_size, hidden_layer_size) $$ \mathbf{W}_1 \in \mathbb{R}^{784 \times 128} $$ Matrix of weights connecting the input layer to the hidden layer.
Biases (Hidden Layer) bias1 = np.random.randn(hidden_layer_size) $$ \mathbf{b}_1 \in \mathbb{R}^{128} $$ Bias vector added to the hidden layer.
Weights (Output Layer) weight2 = np.random.randn(hidden_layer_size, output_size) $$ \mathbf{W}_2 \in \mathbb{R}^{128 \times 10} $$ Matrix of weights connecting the hidden layer to the output layer.
Biases (Output Layer) bias2 = np.random.randn(output_size) $$ \mathbf{b}_2 \in \mathbb{R}^{10} $$ Bias vector added to the output layer.

Example

For a digit recognition problem, the input layer receives a 784-dimensional vector representing a 28x28 pixel image. The hidden layer applies a linear transformation followed by a ReLU activation function. The output layer applies another linear transformation followed by a softmax activation function to produce a probability distribution over the 10 classes (digits 0-9).

Training a Neural Network

Training involves feeding input data through the network.

import numpy as np

# Define the input layer
input_size = 784
hidden_layer_size = 128
output_size = 10

# Initialize weights and biases
weight1 = np.random.randn(input_size, hidden_layer_size)  # Weights for input to hidden layer
bias1 = np.random.randn(hidden_layer_size)  # Biases for hidden layer

weight2 = np.random.randn(hidden_layer_size, output_size)  # Weights for hidden to output layer
bias2 = np.random.randn(output_size)  # Biases for output layer

# Forward pass
def relu(x):
    return np.maximum(0, x)

def softmax(x):
    exp_x = np.exp(x - np.max(x))
    return exp_x / exp_x.sum(axis=0)

# Input vector
x = np.random.randn(input_size)
# Hidden layer computation
h = relu(np.dot(x, weight1) + bias1)
# Output layer computation
y = softmax(np.dot(h, weight2) + bias2)

Conclusion

Deep Learning is a subset of Machine Learning that uses neural networks to model complex patterns in data. It involves activation functions, loss functions, optimizers, regularization, backpropagation, and training. Mostly built using linear algebra translated to Python.

links

social