deep learning

1. Neural Networks

The rise of the World Wide Web and affordable data storage has made massive datasets available. At the same time, the increase in computational power has set the stage for deep learning to flourish. Deep learning has come a long way, evolving from basic neural network concepts into the advanced systems we rely on today. It all began with the idea of artificial neurons, inspired by the workings of the human brain.

The human brain works by connecting billions of neurons that send signals to each other. These neurons can either fire or stay inactive, depending on the signals they receive. Neurons decide to fire when they get enough signals from other neurons. The neuron activates and sends the signal forward if enough input is received. This process is like a switch that turns on when it gets enough power. When neurons fire, they pass information to other neurons, forming a network. When we learn something, specific connections between neurons get stronger. The more we practice or repeat something, the stronger these connections become. If we stop using a connection, it weakens over time. This is how the brain adapts and learns new things.

If we tried visualizing neuron connections, they would look like the picture on the right. Google and Harvard University mapped a microscopically part of a woman’s brain. You can also view this interactive simulation.

Screenshot of Neuroglancer, a tool to visualize a fraction of the brain.

The human brain inspires neural networks. When you look at a Golden Retriever, light enters your eyes and reaches the retina, where the image is converted into electrical signals. These signals travel to the visual cortex in your brain. In the visual cortex, different neurons become active based on the dog’s features, such as the shape of its head, ears, and the color of its coat. Through experience and repeated exposure to retrievers, your brain has learned to connect these neural activation patterns with the concept of a Golden Retriever. Once these neurons are activated, your brain efficiently processes the information and identifies the dog.

A Simple Artificial Neural Network

The diagram shows an artificial neural network. It has three input neurons that receive initial data, four hidden neurons that process this information, and a single output neuron that provides the final prediction or decision. Connections between neurons are shown with arrows. Every connection between neurons has a weight that adjusts the strength of the signal sent to the next layer. During training, the network changes these weights to improve its accuracy. Hidden neurons are crucial for learning. They extract patterns and features from the input data. These neurons take the weighted input from the input layer, apply an activation function, and send the transformed signal to the next layer. The hidden layers allow the network to understand complex relationships within the data, enabling it to recognize non-linear patterns and make more accurate predictions or classifications.

2. Backpropagation

In the 1980s, deep learning took a significant step forward by developing the Backpropagation algorithm. This innovation made it possible to train neural networks more effectively by adjusting their connections based on their errors. This allowed more effective neural network training by improving the alterations of connections based on mistakes. This led directly to the development of deep structures that could distinguish patterns from data and subsequently laid the foundation for most modern applications in various industries.

Backpropagation is a method used in training neural networks. It involves two main steps:

1. Forward Pass: The neural network predicts based on the current weights.

2. Backward Pass: The network adjusts its weights based on the error in its prediction to improve future predictions.

Backpropagation aims to minimize the error between the network’s networks and the actual outcomes by updating the network’s weights. Let’s compare this to buying a new electric car. If you want to buy a new EV, you might consider the price, the car’s battery range, and the vehicle’s brand. Based on these factors, imagine a system that predicts whether buying a particular electric vehicle is a good idea. The system initially assigns weights to each factor, determining how much they influence the final decision.

Step 1: Forward Pass

In the forward pass, the system begins by considering various inputs. For example, a car might cost $40,000, have a 300-mile range, and come from a reputable brand. Each of these factors is assigned a weight that reflects its importance. For instance, the cost might be assigned a weight of 0.5, the range 0.3, and the brand 0.2. The system then calculates a weighted sum of these inputs to make a decision.

After considering the cost, range, and brand of the car and multiplying each by its respective weight, you arrive at a score. However, there might still be something that influences your decision toward buying the car, even if the inputs (cost, range, brand) aren’t perfect.

That’s where the bias comes in. A bias is added to the weighted sum of the inputs to adjust the final score. It represents an extra push toward making a decision, even if the inputs alone don’t lead to a clear result.

In the formula, this looks like:

\[ \text{Score} = (\text{Cost} \times w_{\text{cost}}) + (\text{Range} \times w_{\text{range}}) + (\text{Brand} \times w_{\text{brand}}) + \text{Bias} \]

This bias shifts the score just enough to affect the final outcome, which is then passed through an activation function that leads to the final decision: “Yes, buy the car” or “No, don’t buy the car.”

Step 2: Calculate the Error

After making the decision, the system assesses the outcome. If you bought the car but were unhappy with it, perhaps because the range was too short, this indicates that the system’s prediction was incorrect. The error is the difference between the predicted outcome (a good purchase) and the actual outcome (a bad purchase). In neural networks, this error is typically calculated using a loss function, such as Mean Squared Error (MSE) for regression tasks or Cross-Entropy for classification tasks. The error can be represented as:

\[ \text{Error} = \text{Loss}(\text{Predicted Outcome}, \text{Actual Outcome}) \]

Step 3: Backward Pass

In the backward pass, the system adjusts the weights to minimize future errors. It calculates how much each weight contributed to the error by finding the gradient of the loss function with respect to each weight. The gradient tells the system how much a change in the weight will impact the error. The system then updates the weights by moving them in a direction that reduces the error, using an optimization method like Gradient Descent. The formula used for updating the weights is:

\[ w_{\text{new}} = w_{\text{old}} – \eta \times \text{gradient} \]

Here, \( \eta \) (eta) is the learning rate, which controls how big the adjustments are with each iteration. The system repeats this process many times, gradually reducing the errors and improving its predictions.

In this Python visualization, data enters the input layer and moves through the network. This process is called the forward pass. As the data flows through the hidden Layers, each layer applies certain weights to the inputs, transforming the data step by step. These transformations help the network identify patterns or features in the data, which are essential for making accurate predictions. Finally, the processed data reaches the output layer, where the network decides or predicts. After this forward pass, the network checks its prediction against the actual result, calculating the error. Then, the network uses a process called backpropagation to correct this error. The error is sent backward through the network, starting from the output layer and moving back through the hidden layers. Each layer’s weights are adjusted based on how much they contributed to the error. Backpropagation enables neural networks to learn from their mistakes and improve over time. By changing the network’s weights based on errors, backpropagation allows the model to improve its predictions, classifications, or decisions.

3. Convolutional Neural Networks (CNNs)

CNNs are designed to handle data like images and videos. They use a unique technique called convolutions to process pixel data efficiently. Neural networks using backpropagation struggled with extensive image data because they required too many parameters. CNNs solved this by reducing the number of parameters and automatically learning features (like edges or shapes) from images. CNN revolutionized computer vision, enabling applications like facial recognition, object detection, and autonomous driving. They became popular with the success of AlexNet in 2012, which outperformed previous image recognition methods.

To understand how CNNs work, let’s look at this illustration:

An image of a Golden Retriever that is fed into the CNN.

This layer detects specific parts of the image, such as edges or certain dog features. Filters scan small portions of the image to identify patterns.

Pooling reduces the size of the image but retains the necessary information. This will select the most prominent features. The image should look simpler but still retain essential structures.

The pooled feature maps are then flattened into a single column or vector of values. This part is more abstract since the result would be a one-dimensional vector of numbers.

The flattened vector is fed into the fully connected layer, which makes the final prediction.

Dog: 90% confidence

Cat: 10% confidence

To understand CNNs, please read this excellent article about Understanding Visualization.

CNNs are used for video analysis, facial recognition, and image classification.

4. Recurrent Neural Networks (RNNs) and Long Short-Term Memory Networks (LSTMs)

Recurrent Neural Networks (RNNs) process data in a specific order, like sentences, time series, or speech. They understand this order and context better by remembering what came before. When using voice assistants like Siri or Alexa to play a song, RNNs help understand the command by paying attention to the order of the words. However, RNNs struggle with remembering information from long ago if the sequence is too long.

Visualization of an RNN

Green circles xt: Inputs at each time step.

Blue rectangles (ht): Hidden states that carry memory from one step to the next, allowing the network to retain information about previous inputs.

Red circles (ot): Outputs generated based on the hidden states.

The arrows show how information flows between inputs, hidden states, and outputs over time. The secret state updates with each input, helping the RNN maintain context from past steps and making it ideal for sequential tasks like language processing or time series analysis.

However, RNNs struggled with retaining long-term dependencies, which led to the development of Long-Short-Term Memory (LSTM) networks and Gated Recurrent Units (GRUs). These architectures allow better information retention over long sequences, making them useful for language modeling, translation, and sentiment analysis tasks. In a customer service chatbot, an LSTM or GRU can better understand the flow of conversation, remembering what the customer asked five minutes ago.

5. Generative Adversarial Networks (GANs)

Ian Goodfellow invented Generative Adversarial Networks (GANs) in 2014. GANs are comprised of two interconnected networks: a generator and a discriminator. The generator network creates synthetic data, such as images, to make them indistinguishable from accurate data, while the discriminator assesses the authenticity of the generated data. This dynamic interplay between the two networks continuously improves the generator’s ability to produce highly realistic data, making GANs a powerful tool for generating lifelike images and other forms of data.

In a Generative Adversarial Network (GAN), there are two main parts: the generator and the discriminator. The generator creates fake data that looks as real as possible. The discriminator acts like a critic, checking both real data from the actual dataset and fake data from the generator. Its job is to tell the difference between real and fake data. The process between the generator and the discriminator is called an adversarial process. This process repeats many times (epochs), with both networks improving. The main goal is for the generator to make data so realistic that the discriminator can’t reliably tell what’s real and what’s fake anymore.

View this notebook to see a basic GAN in action.

Please use this simulator to play around with general adversarial networks (GANs).

6. Transformers

Researchers at Google Brain popularized transformers models. They are the brains behind many AI tools, such as chatbots, language translation services, and even tools that can write text. Transformer models are designed to process sequences of data efficiently. They rely on self-attention to understand the relationships between all input parts simultaneously rather than processing them step by step like older models. This makes them much faster and better at handling large datasets, especially in tasks like language understanding, translation, or text generation.

For example, you are subscribed to a music service that gives you recommendations based on your preferences. It wouldn’t have been accurate if the system had considered previous music choices. You could, for example, at one point, want to listen to fast, upbeat music when you are playing sports or at the gym, but at other times, you could prefer slower music. The system looks at user actions and timestamps and discovers action patterns to determine a user’s choice at a given time.

View a transformer in action: Transformer application.

7. Large Language Models (LLMs) and Multimodal Language Models (MMLs)

Open AI launched the first large-scale language model based on the Transformer architecture in 2018. GPT-1 demonstrated the effectiveness of pre-training a model on large amounts of text data and then fine-tuning it for specific tasks. GPT-1 had 117 million parameters, and GPT-3, launched in 2020, had 175 billion parameters. GPT-4 is estimated to have 1.76 trillion parameters. Parameters are often used to determine the size of LLMs. The more parameters a model has, the more complex patterns it can learn, but it also requires more data and computational power to train effectively. Advanced hardware is needed to train large language models.

Recently, more multimodal language models have been released. These models can handle and output more than one type of data—like text, images, and sounds—all at once. One of the most popular ones is Google’s Gemini, which empowers the Gemini Chatbot. Open AI also has a multimodal LLM, GPT-4o available via ChatGPT Plus, OpenAI’s API, and the free chatbot Microsoft Copilot. Claude 3 is Anthropic’s latest family of models. Amazon invested in Anthropic, but it is also developing its multimodal model, Olympus, to be launched later this year.

To view a timeline of recent Multimodal Models, read this page.

Resources:


Dive into Deep Learning (D2L). (n.d.). Handbook of Deep Learning. Retrieved from https://www.d2l.ai/chapter_recurrent-modern/index.html

World Economic Forum. (2022, January). How Deep Learning Drives Business Productivity and Revenue. Retrieved from https://www.weforum.org/agenda/2022/01/deep-learning-business-productivity-revenue

Google Cloud. (n.d.). What Is Deep Learning? A Beginner’s Guide. Retrieved from https://cloud.google.com/discover/what-is-deep-learning?hl=en

Fridman, L. (n.d.). MIT Deep Learning Basics Tutorial. GitHub. Retrieved from https://github.com/lexfridman/mit-deep-learning/blob/master/tutorial_deep_learning_basics/deep_learning_basics.ipynb

Google Developers. (n.d.). Google Machine Learning Crash Course. Retrieved from https://developers.google.com/machine-learning/crash-course

arXiv. (n.d.). Multimodal Machine Learning: Foundations and Applications. Retrieved from https://arxiv.org/html/2401.13601v5