Foundations of Generative AI and Transformers
Introduction
There’s a lot of unsettling energy these days about AI taking over the world. Unfortunately, the way things are going, it wouldn’t be too surprising if that claim turned out to be true—especially if AI keeps making our essays sound like Shakespeare wrote them. Jokes aside, AI isn’t some magic alchemy of spells, and it certainly won’t replace you. AI works best as your creative partner; one that should be guided by proper ethical standards and human imagination.
Due to the recent swarm of information about Artificial Intelligence and its related disciplines, this article aims to clarify the confusion surrounding generative AI and the processes behind how it generates different modalities of data in lightning speed. We’ll discuss the mechanisms that paved the way for generative AI, offering a brief look into its inner workings and the powerful techniques that drive it. Following this, we’ll discuss how you, the reader, can incorporate this new knowledge into your own careers, as well as a challenge at the end to strengthen your prowess in this dynamic field.
Getting the Facts Straight: Clarifying the Nuances
I’m confident you’ve heard words such as AI, generative AI, LLMs, etc. thrown around loosely, but what does that all mean? AI, or artificial intelligence, refers to computing systems designed to model human cognition. These systems can perform tasks like learning, problem-solving, and decision-making by simulating human thought processes, unlike traditional computers that simply follow predefined instructions. Generative AI is a subset of artificial intelligence that performs generative tasks (hence the name), which involve creating new content such as text, images, or videos based on patterns learned from large datasets. These tasks rely on mathematical frameworks trained on big data to produce multimodal outputs. Think of these models as ‘digital brains’ that learn patterns from data and combine them to create new content. The main idea is generative AI is not only an analysis paradigm, but it also produces new outputs by recombining patterns learned from the data it’s given. Large Language Models (LLMs) are another subset of AI, particularly a type of generative AI that are designed to understand and process natural language. Various popular ones are OpenAI’s GPT-4 (used in ChatGPT), Meta’s Llama 4, or Anthropic’s Claude 3.7 Sonnet. Imagine AI as an entire workforce in a company, generative AI as the creative department, and LLMs like GPT-4 as the star copywriters of the department; those who curate the most attractive text for the public to read. The models behind these tools are like the years the copywriters put into their training to master the art of language. All four of these topics are interconnected but serve slightly different purposes relative to the AI space.
Your Brain Plays an Important Role
Now that the terminology is set straight, we can dive deeper into how some popular generative AI models works. Before we step into the technical, let’s take a step back and see where the inspiration to make such intricate methodologies came from.
A question I like to ask others to spark their interest in AI is, “When you look at me, how do you know it’s me?.” This question may sound slightly trivial, but when you stop and reflect on it, the question can be quite daunting. Regardless of what I am wearing, the setting I’m in, or whether I came back from vacation with a slight tan, people who are familiar with me can still recognize me in a split second. Our brain naturally filters and ranks sensory input based on relevance, and that prioritization drives our response, for example, identifying me as the person you're currently observing. The brain can be thought of as a sophisticated network, capable of processing, filtering, and adapting to new information. By studying how the brain learns, researchers have discovered clever ways to design models that mimic this behaviour. Therefore, if the goal were to mimic how a human generates text in writing or other forms of data, it makes the most sense to mimic the “architecture” of the brain which can be represented mathematically. This is the idea of neural networks, which serves as the backbone for many modern generative AI models. Advancements in this field led to the development of Transformer architectures, which will be discussed in the later sections.
Artificial Networks: Mimicking the Mind
If we want to mimic how the brain works, wouldn't it make sense to emulate its structure and encode its main components into a mathematical architecture? That is one of the main ideas behind how many generative AI models and broader fields of AI operate. Without getting into too much biology, engineers attempt to capture psychological processes such as the evaluation of sensory inputs, neuron communication (like two people receiving information from each other) models of contextual processing, reasoning, and more. The combination of these and other processes is imitated and integrated into the design of architectures that aim to emulate aspects of human thought.
Attention Was All That We Needed
Narrowing down our topic to generative AI, a major milestone in the field was marked by Google's 2017 paper, Attention Is All You Need. This paper revealed key components that earlier models, such as GRUs (Gated Recurrent Unit Networks) and LSTMs (Long Short-Term Memory Networks), were missing: namely, the ability to accurately retain long chains of thought and maintain contextual understanding, which limited their effectiveness in generating coherent long-form content or responding with nuance in complex tasks. Earlier models often struggled because they processed inputs sequentially, making it difficult to capture long-range dependencies and selectively filter relevant context. To overcome these limitations, researchers at Google proposed the Transformer, an architecture that captures context by attending to distinct parts of the input through a mechanism called self-attention - a process where the model evaluates how much each word in the input relates to others (like how ChatGPT processes a prompt). This mechanism is enhanced by multi-headed self-attention, where multiple “heads” focus on different parts of the input simultaneously for richer understanding.
To understand how Transformer architecture works, consider the following analogy. Imagine standing in a room surrounded by hundreds of people, each speaking to you at once. Before you can begin to process what these people are saying, you need to organize the information: you assign each voice a unique position based on where it is coming from. This acts like positional encodings, helping you keep track of the order of information.
Next, instead of dealing with raw voices, you convert what each person says into a simpler, structured form, like translating it into numbers you can easily work with. This process resembles embedding the words into a numerical space that the Transformer can operate on, since it only understands numbers. As you listen, you instinctively ask: "What am I hearing?" (queries), "What information is being offered?" (keys), and "What should I remember?" (values). These are the queries, keys, and values inside the self-attention mechanism, helping you decide where to focus your attention.
Since you can think about multiple conversations at once, your brain creates different strategies for listening; one for tone, one for important words, one for emotional cues. This is like using multiple heads in multi-headed self-attention, allowing the Transformer to analyze various aspects of the input in parallel. After gathering and filtering the information, you process it more deeply to make sense of what you’ve focused on, like how each token passes through a small feed-forward network inside the Transformer to refine its meaning.
For example, consider a query like: “Why are there so many geese on Western University’s campus in the spring?” When processed by a Transformer, each attention head might focus on different components of the sentence. One of them might attend to “so many geese,” another to “Western University,” and another to “in the spring.” Just as your brain splits focus between tone, content, and context in a conversation, the model breaks the sentence down into parallel attention streams. Each of these is then refined through small feed-forward networks to build a complete understanding of the query, combining areas like wildlife behaviour, location, and seasonality to inform its response.
Once you have fully processed the information, you start forming a clear response, piecing together the most important parts. This resembles decoding and token prediction, where the Transformer generates the next word (token) based on everything it has understood so far.
Finally, before speaking back, you need to translate your structured understanding back into natural language (numbers to letters) just like the Transformer maps its internal numerical predictions back into readable text.
And like any conversation, you repeat this cycle (listening, focusing, processing, responding) layer after layer, until the whole interaction is complete. In a way, this process mirrors how our brains filter, prioritize, and respond to information in real time. This is the power of the Transformer. Today, companies like OpenAI and Meta build their most advanced models on Transformer architectures. As AI continues to evolve, these ideas are likely to be foundational in driving innovation for various industries such as healthcare, education, and many others. With careful decision-making, AI can serve as a tool to support human work rather than replace it. How we apply these technologies will determine their impact: AI could either displace workers through automation or enhance human capabilities to create new opportunities (Acemoglu & Johnson, 2023). Generally, AI should be used to enhance work and never cheat, plagiarize, or compromise integrity.
The Challenge
For your challenge, try using a generative AI tool such as ChatGPT, Google’s Gemini, or Microsoft CoPilot and pick a hobby you like. Next, try starting with a broad or vague prompt, then gradually refine it by adding more specific details, and compare the outputs at each step.
For example, I love otters. Here’s a sample prompt progression:
- Vague Prompt: "Tell me about otters."
- More specific: "How do otters adapt to their environments in different parts of the world?"
- Even more specific: "What are the main threats to North American river otters due to habitat loss?"
- Very specific: "How has urban development along the Thames River in Ontario impacted the behaviour and population of local otters over the past decade?"
As you refine your prompts, reflect on the following:
- How does the AI’s response change with each level of specificity?
- Why might the AI give you the responses it does at each level?
- Did the model leave out key context?
- What details or perspectives become clearer as the prompt narrows?
- What might this reveal about how the AI interprets and prioritizes your request?
A More Technical Challenge
A Quick Note:
If you’re new to AI or don’t have a technical background, don't worry—this section is an optional dive for those curious about how Transformers work under the hood. The questions below guide you through key concepts from the original paper, Attention is All You Need, which introduced the Transformer architecture. You don’t need to answer them perfectly; the goal is to explore how these models handle language, parallelism, and long-range dependencies that differ from past approaches. Feel free to skim the paper or tackle a few questions that interest you.
Now that you’ve learned some of the foundations, you are invited to take a deeper dive into the technical details to further your understanding. For a more advanced challenge, read the original 2017 paper Attention is All You Need and, as you read it, pause to reflect on the following questions.
- Why do the authors argue that self-attention models can be more interpretable than recurrent models?
- Why does the Transformer use positional encodings instead of recurrence to represent the order of sequences?
- What is the core problem that self-attention solves in sequence modelling? How does it affect long-range dependencies compared to recurrent neural networks?
- What advantage does multi-headed attention (MHSA) provide over single-headed attention? How does using multiple heads enhance model capacity?
- What specific design decision made by Google makes the Transformer highly parallelable during training?
- What is the purpose of masking the decoder’s self-attention layers?
- When scaling bigger models, which factors (dropout, number of heads, embedding size) impact performance the most?
References
-
Acemoglu, D., & Johnson, S. (2023, October 25). Choosing AI’s impact on the future of work. MIT Shaping the Future of Work. https://mitsloan.mit.edu/ideas-made-to-matter/choosing-ais-impact-future-of-work
-
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., & Polosukhin, I. (2017). Attention is all you need. arXiv. https://doi.org/10.48550/arXiv.1706.03762
Disclosure
NOTE: This post was written without the use of generative AI. All grammar and wording are the author's own. Photos are sourced from scholarly references.
