The Journey of a Prompt: What Really Happens Inside an LLM?

AI

LLM

Basis

The Journey of a Prompt: What Really Happens Inside an LLM?

We've all been there. A blank cursor blinks on the screen, a complex question swirls in your head, and you turn to your digital oracle—a large language model. Imagine typing something like: "Explain the attention mechanism in transformers as if I'm 10 years old". You hit Enter, and… magic. In a couple of seconds, a perfectly structured, clear, and even witty answer unfolds on the screen. It feels as if an invisible genius is sitting on the other side, instantly grasping your query and finding the best words for the explanation.

What exactly happened in those two seconds? How did the string of characters we entered transform into meaningful and coherent text? We've become so accustomed to this everyday miracle that we've stopped asking questions. For most of us, an LLM is a perfect black box. We toss in a note with a question, and it spits out an answer. It's like some magical mailbox from a fantasy novel: reliable, fast, and unfathomable.

In my articles, I constantly reiterate the same idea: artificial intelligence is not magic. It's the result of complex engineering and scientific work, a confluence of circumstances where computational power, data volumes, and ingenious algorithms converged at a single point. And to truly use these tools effectively and responsibly, we must understand how they work. Therefore, our task today is to open this magical mailbox and meet the "genius" inside personally. We will trace the complete journey of our simple prompt from the moment you hit Enter until the first word of the answer appears.

Be prepared: behind this almost instantaneous reaction lies a process comparable in its logistics and complexity to the operation of an entire metropolis. It has its own border checkpoints, its libraries, its industrial centers, and even its creative quarters. Let's take a tour of this digital metropolis and see what happens at each stage of our query's journey. And like any journey, it all begins with crossing the border.

1-min.png

Customs. Turning Words into Numbers (Tokenization)

So, our prompt-traveler has arrived at the first checkpoint of the digital metropolis—the "customs office." And here, its first and most fundamental transformation awaits. Unlike us, the machine doesn't operate with words, sentences, let alone nuances of sarcasm or poetry. Its language is the language of mathematics. And for our query to continue its journey, it needs to be translated into this universal language. This process is called tokenization.

To continue our analogy, tokenization is like a gourmet dish (our prompt) being broken down into its simplest, basic ingredients upon entering the city. Instead of a complex "arugula and shrimp salad," we're left with a set of containers: "arugula leaves," "shrimp," "cherry tomatoes," "olive oil," "salt." Only in our case, the "ingredients" aren't food products, but tokens.

A token is not always an entire word. It can be a part of a word, a single character, or a whole word if it appears frequently. A useful rule of thumb is that one token generally corresponds to approximately 4 characters of plain English text. This roughly translates to ¾ of a word (i.e., 100 tokens ≈ 75 words). A special algorithm, the tokenizer, breaks down our original text into these minimal semantic units that exist in its "vocabulary." And then, each such token is assigned a unique numerical identifier—an ID.

To make this less abstract, try using OpenAI's tokenizer. Enter a phrase into it, for example, "Controlled Hallucinations". And in real-time, you'll see how it breaks down into parts: ["Controlled", "Hall", "uc", "inations"] And below them, their numerical identifiers from the model's vocabulary appear: [162001, 14346, 1734, 15628]

Here's what's most interesting. For the model, the words "cat" and "Cat" are two completely different tokens with different IDs. It doesn't understand that it's the same animal, just written differently. It sees [83827] and [3682, 1523]. Similarly, the word "tokenization" might be broken down into multiple tokens, for example, if the full word isn't in its vocabulary. This is a strict, merciless, and entirely formal process. Its task is not to understand meaning, but to unify the input data, transforming it into a sequence of numbers with which mathematical operations can be performed.

See? For the model, your beautiful and meaningful sentence no longer exists. For it, there's only a set of IDs, a vector of numbers. This is the entry ticket, the passport of our query, allowing it to pass customs control and enter the next, far more mysterious quarter of this city—a gigantic library of meanings, where these soulless numbers will begin to acquire weight and significance.

2-min.png

The Library of Meanings. Vector Representations (Embeddings)

Our prompt, broken down into faceless numerical IDs, has cleared customs. But what's next? A sequence of numbers like [162001, 14346, 1734, 15628] carries no meaning on its own. It's just an inventory number. To transform this number into something meaningful, our query is sent to the very heart of the digital city—a gigantic, unimaginably vast library. This is the stage of creating vector representations (embeddings).

Imagine a typical library. Books in it are arranged alphabetically. This is convenient for searching, but it tells you nothing about the content. The book 'Anna Karenina' will stand next to 'Analytical Geometry,' even though in terms of meaning, they are in different universes. Now imagine another library—a library of meanings. Here, books are arranged not alphabetically, but by content and context. All books on quantum physics are on one shelf, next to them are shelves on general relativity. Romance novels are grouped in one hall, and detective stories in the neighboring one. Moreover, the closer the books are in meaning, the closer they are placed to each other on the shelves.

This is precisely how the embeddings mechanism works. For each token in its vocabulary, the model stores a special "address"—a vector. This is not just a number, but a long, long array of numbers (e.g., 768, 4096, or even more elements), which represents the coordinates of that token in a multi-dimensional semantic space. When our token with ID 14346 ("Hall") reaches this stage, the model simply finds the corresponding row in its gigantic table-catalog and replaces the ID with this vector.

And this is where the real magic, based on pure geometry, begins. In this multi-dimensional space, words with similar meanings turn out to be neighbors. The vector for the word "king" will be located near the vector for "monarch," and the vector for "dog" will be near "hound" and "puppy." But what's most striking is that relationships between words also transform into geometric vectors. A classic example you've probably heard: vector("King") - vector("Man") + vector("Woman") ≈ vector("Queen")

This is not a metaphor. It's a real mathematical operation. The model, having been trained on gigantic volumes of text, has learned that the transition from "man" to "woman" is a specific displacement in this space, a certain vector. And by applying this same displacement to "king," we arrive at the point where "queen" is located.

This is where one of the main misconceptions lies. 
Myth: "The model understands the meaning of words, like a human." 
Reality: No, it doesn't. It operates with geometry. For it, "meaning" is simply a point in a multi-dimensional space, and "understanding" is the calculation of distances and directions between these points. This is pure, cold mathematics, not consciousness or empathy. The model doesn't know what royalty or gender differences are. It only knows that the coordinates of tokens associated with these concepts obey specific geometric laws.

3-min.png

The Heart of the Machine. The Attention Spotlight (Attention Mechanism)

So, our prompt has passed through tokenization and transformed into a matrix of multi-dimensional vectors—rich, imbued with mathematical "meaning" coordinates in a gigantic library. But this is just a static map of the terrain for now. How will the model know which of these vectors to look at first? How will it establish connections between "robot," "suitcase," and the property "heavy" in our future query?

This is where the mechanism that can, without exaggeration, be called the heart or, if you like, the central nervous system of modern LLMs comes into play. This is the Attention Mechanism—the very key innovation of the Transformer architecture, which I already mentioned in my breakdown of the MiniMax-M1 model's architecture. If embeddings are the dictionary, then attention is the grammar and logic that allows words from this dictionary to be linked into meaningful constructions.

The best analogy that comes to mind is a spotlight on a dark theater stage. Imagine that all the words of your prompt, represented as vectors, are actors frozen on stage. When the model needs to generate the next word (play the next scene), it doesn't look at all actors simultaneously with the same intensity. Instead, it turns on a powerful spotlight that picks out from the darkness those who are most important for the current line. The spotlight's beam might be brighter on one actor and only lightly touch another. This brightness is the "weight" or "score" of attention.

Let's break down a classic example that perfectly illustrates this process. Take the sentence: "The robot couldn't lift the suitcase because it was too heavy." For a human, it's obvious that the pronoun "it" refers to the suitcase, not the robot. But how does a machine, for which this is just a sequence of vectors, understand this?

When the model reaches the word "it," the attention mechanism essentially starts asking the entire preceding context: "Hey, which of you should I pay the most attention to right now to understand what's being referred to?". To do this, it calculates "compatibility scores" (attention scores) between the vector of the word "it" and the vectors of all previous words: "The robot," "couldn't," "lift," "the suitcase," "because," "it." Thanks to the magic of vector arithmetic, which we discussed earlier, the vector for "suitcase," possessing the properties of an inanimate and liftable object, will receive the highest compatibility score with the context "was too heavy." The vector for "robot," which is the subject of the action in this sentence, will receive a much lower score. The attention spotlight will brightly illuminate "suitcase," and the model will understand that the subsequent narrative should be built around the properties of this particular object.

This "weighting" process occurs constantly, with the generation of every new word in the answer. The model re-evaluates the entire preceding text—both the original prompt and the already generated part of the answer—and decides which parts of the context are most relevant for predicting the next token.

Attention is the model's superpower to remember context and not lose the thread of the narrative, even in very long dialogues. This is what distinguishes modern LLMs from their predecessors, who suffered from "amnesia" and would forget the beginning of a sentence by its end.

4-min.png

The Birth of the Answer. Word-by-Word Prediction

We left our prompt in the very heart of the machine, where the attention mechanism, like a spotlight, highlighted the most important semantic connections between word-vectors. The model now "understands" (in its own, mathematical sense) what refers to what. But how does a coherent answer emerge from this focused understanding? Here we approach the culmination of the entire process, the moment when abstract computations finally materialize into text.

And the first thing to realize is: the answer is not generated all at once. The model doesn't conceive of a phrase and then write it down. The process is more like a "complete the phrase" game or the operation of the world's most advanced T9. The model generates the answer autoregressively, meaning token by token, predicting the most probable continuation of the already existing text.

Let's simulate this process step-by-step.

  1. Input Data: The model received our prompt, transformed into a matrix of vectors: ["Explain", "attention", "mechanism", "in", "transformers", ...]
  2. First Prediction: This entire matrix passes through many layers of the Transformer. At the output of the last layer, the model generates not a word, but a gigantic array of numbers—the so-called logits. The size of this array is equal to the size of its entire vocabulary (e.g., 50,000+ tokens).
  3. Probability Calculation: These raw logits are passed through a Softmax function, which converts them into a probability distribution. Now, each token in the vocabulary has its own probability of becoming the next word in the sentence. For example:
    • Token "This" — probability 15%
    • Token "Mechanism" — probability 12%
    • Token "Attention" — probability 9%
    • ...
    • Token "Valera" — probability 0.00001%
  4. First Word Selection: The model places its bet and selects the most probable (by default) token. Let's say it's "This." The answer has begun to emerge.
  5. Autoregressive Cycle: And now for the most important part. This new token "This" is immediately appended to the end of the original context. In the next step, the model will analyze not just our prompt, but ["Explain", ..., "as", "if", "I'm", "10", "years", "old", "This"]. And the entire process repeats: a new pass through the layers, a new calculation of probabilities for the next word, a new selection. Then, this word is also added to the context, and so on, token by token, until the model predicts a special [END_OF_SEQUENCE] token, signaling the completion of the answer.

It is this step-by-step, cyclical process that allows us to debunk another popular myth. 
Myth: "The model invents or composes the answer." 
Reality: It makes the most probable statistical prediction for the next word, based on all the preceding context. Every word in the answer is not a flash of insight or a creative act, but the result of a complex mathematical probability calculation. There is no creativity, consciousness, or intention in the human sense involved. It is simply an incredibly powerful mechanism for predicting sequences, which has learned to mimic meaningful speech by analyzing trillions of examples from the internet.

5-min.png

The Director's Console. Controlling Creativity, Predictability, and... Everything Else

We've established that the birth of an answer is a cascade of statistical bets. At each step, the model looks at the probability distribution and selects the next token. By default, it aims to choose the most obvious, mathematically most probable option. But what if we could intervene in this process? What if we had a control panel that allowed us to influence exactly how the model makes its choice? Such a console exists, and it's much richer than it appears at first glance. Let's break down its key regulators.

The Creativity Dials: temperature, top_p, and top_k

  • temperature: This is the main regulator for "boldness."
    • Explanation: A low temperature makes token selection very strict and predictable, while a high temperature smoothes out probabilities, giving less obvious words a chance.
    • Recommendations:
      • 0.10.3 (The "Encyclopedist" Mode): Ideal for tasks requiring factual accuracy, code generation, or data extraction. Answers will be dry but reliable.
      • 0.70.9 (The "Conversationalist" Mode): The sweet spot for most applications. Suitable for chatbots, text writing, creative copywriting. Answers turn out lively yet coherent.
      • 1.0 and above (The "Mad Poet" Mode): For brainstorming, generating unconventional ideas, and poetry. Use with caution—there's a high risk of getting incoherent text.
  • top_p (nucleus sampling): A more refined tool for controlling creativity.
    • Explanation: It cuts off the least probable tokens, leaving only those whose cumulative probability makes up p for selection. This helps avoid outright nonsensical options while preserving diversity.
    • Recommendations: The optimal value for most tasks is considered to be 0.9 or 0.95. An important nuance: generally, you use either temperature or top_p, as they solve similar problems using different methods.
  • top_k: The most straightforward of the trio.
    • Explanation: Simply limits the selection to the k most probable tokens.
    • Recommendations: Less popular than top_p, but easy to understand. A value around 4050 is often encountered.

Fighting Parroting: Repetition Penalties

One of the most annoying traits of LLMs is their tendency to get stuck in loops. To combat this, we have two "penalty" parameters:

  • presence_penalty:
    • Explanation: Imposes a fixed "penalty" on any token that has already appeared in the text. This encourages the model to use more diverse vocabulary.
    • Recommendations: Small values, typically in the range from 0.1 to 1.0, are usually used. Even a small value like 0.2 already significantly improves the diversity of the answer.
  • frequency_penalty:
    • Explanation: The penalty amount depends on how often the token has already appeared. The more repetitions, the higher the penalty.
    • Recommendations: The range is similar—from 0.1 to 2.0, where 2.0 is already a very aggressive penalty that will almost certainly prevent any repetitions but might make the text unnatural.

Hard Limits and Stop Signals

These parameters do not affect creativity but set the boundaries for generation.

  • max_tokens:
    • Explanation: The maximum number of tokens the model will generate in the answer. Your emergency brake.
    • Recommendations: The value completely depends on the task. For a headline, 30 is enough; for a paragraph, 200; for a full answer or a short article, 1024 or 2048.
  • Stop Sequence:
    • Explanation: A string upon whose appearance generation will immediately cease.
    • Recommendations: There's no numerical value here. The only recommendation is: choose a sequence that is guaranteed not to appear in a meaningful answer. Examples: "\n\n\n", "###", "<|endoftext|>", or for a chatbot—"\nUser:".

Having understood this, you cease to be a passive user. You receive the keys to the machine. You can make the model an accurate assistant, a creative partner, or a generator of random ideas. Now you are not just an audience member; you are a director with a full control panel, capable of customizing every aspect of the upcoming performance.

6-min.png

The Last Mile. From Theory to Code with Gemini API Example

We've grasped the theory, virtually turned the dials on the director's console from the previous act, and now understand how to influence the model's "character." But all this complex mechanics remains an abstraction until we find a way to access it. That way is an API (Application Programming Interface). All the power, all the complexity, and all our control levers become real and tangible precisely through it.

To continue our analogies, a large language model is an incredibly complex and powerful engine built deep within Google or OpenAI. And an API is the dashboard, ignition key, and pedals, which are exposed specifically for us, the developers. They allow us to start this engine, set its RPMs, and direct its power where needed. Let's cover this "last mile" and see how to transform our theory into a working Python script using the Gemini API as an example.

Step 1: The Key to the Kingdom (API Key)

Before starting the machine, we need a key. In the world of APIs, such a key is the API key. It's your unique secret pass into the model's world, used for authenticating your requests. You can obtain it, for example, in Google AI Studio with just a couple of clicks.

But here I want to emphasize security. An API key is like the key to a very expensive car. Don't leave it in plain sight. The most common mistake for beginners is to embed the key directly into the code. This is absolutely forbidden; otherwise, upon the first publication of your code to GitHub, your key (and your budget) will become public knowledge. The correct way is to store it in environment variables or use specialized services for secret management.

Step 2: Connecting to the Machine (SDK)

With the key in hand, we need to "connect" to the model. Theoretically, you could do this "directly"—by sending raw HTTP requests to the desired endpoint. But that's slow and inconvenient. It's much simpler and more efficient to use an official library—an SDK (Software Development Kit)—which handles all that messy work for us. For Gemini, this is the google-genai library.

It is installed with a single command:

pip install google-genai

The SDK provides us with convenient and intuitive Python objects and methods, abstracting away all the complexity of network interaction.

Step 3: First Startup (Simple Call)

We have the key, the tools are installed. It's time to start the engine. According to the documentation, the new syntax has become even more intuitive. Let's write our "Hello, World!" in the world of large language models.

from google import genai

# SDK will automatically pick up your key if it's stored
# in the GOOGLE_API_KEY environment variable
client = genai.Client()

# Send our first prompt, specifying the model and content
response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="Explain how AI works?"
)

# Print the text part of the response
print(response.text)

That's it. Behind these few lines of code lies the entire journey we described: tokenization, embeddings, the functioning of the attention mechanism, and word-by-word generation.

Step 4: Using the Director's Console (Passing Parameters)

And now—the most interesting part. Let's close the loop of our narrative and use that very director's console from the section about the director's console. All the parameters we discussed are passed to the model via a special configuration object.

from google import genai
from google.genai import types

client = genai.Client()

# Create a configuration, using knowledge from the "Director's Console"
config = types.GenerateContentConfig(
    temperature=0.1,  # Enable "Encyclopedist" mode for a precise answer
    max_output_tokens=1024
)

# Call the model, passing our prompt and configuration
response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=["Explain how AI works?"],
    config=config
)

print(response.text)

See? temperature=0.1 is not just an abstract number, but a concrete line in your code that directly controls the model's behavior. You just instructed it to be as precise and restrained as possible. We've traveled the full path from the vague concept of "creativity" to a real tool that can be imported, configured, and applied. We've learned to drive. Now we are ready to ask truly complex questions about the future of these technologies.

fin-min.png

From Ballet to the Horizon: What's Next?

We have completed our journey. We've opened the "magical mailbox," traced our prompt's path from simple text to a complex matrix of vectors, seen how the attention spotlight extracts meaning, and even sat at the director's console, controlling the model's behavior through code. Now, beyond the magic, we see an elegant, albeit incredibly complex, mathematical ballet.

But understanding how it works is only the first step. The real work begins when we ask the question: "So what now?". Having understood the mechanics, we can finally formulate the main questions about the limits, risks, and future of this approach. And I want to leave you with three of them, transforming this article from an instruction into food for thought.

  1. The Hallucination Problem. Now we understand why models "lie." They don't lie in the human sense—they make a statistically probable but, unfortunately, incorrect prediction. This is not malicious intent, but a side effect of their nature. The main question is: Can we "cure" this propensity for fabrication without killing the creativity that arises from the same probabilistic freedom? One of the most promising approaches to solving this problem is the use of techniques like RAG, which "ground" the model in concrete facts, preventing it from drifting into fantasies, as I discussed in detail in the article on creating smart search for notes.
  2. The "Parroting" Problem. If a model is, in essence, a "stochastic parrot" that incredibly skillfully combines and repeats what it has read on the internet, can it truly create new knowledge? Or are we doomed to receive only brilliant remixes of existing ideas? Even more importantly, such an approach forces the model to reproduce and amplify all human biases embedded in the training data. This leads to real problems, as in the case of AI bias in recruitment, which I analyzed in a previous piece.
  3. The Future of Architectures. We've dissected today's "engine"—the Transformer. But the technological race doesn't stand still. Tomorrow, it might be replaced or augmented by hybrid and multimodal systems. For instance, architectures like Mixture-of-Experts (MoE), which use a "team" of highly specialized models instead of a single monolithic "brain," are already changing the game. What will this change in our prompt's journey? What will its path through such a distributed system look like?

We started with "how does it work?" and ended with "what does it mean for us?".

And in this, perhaps, lies the true magic of large language models: they have transformed us from passive users into active participants in a dialogue about the nature of intelligence. Every prompt we make is not just a query, but an experiment. Every successful or unsuccessful generation provides data for analysis, not just for the machine, but for our own expectations. They compel us to formulate thoughts more precisely, to critically evaluate information, and to ask questions about the limits of what's possible. And the main lesson of this journey is that understanding these systems means gaining not just a powerful tool, but a new, albeit very strange, dialogue partner that helps us better understand not only machines but ourselves.

Stay curious.

An indie hacker's take on AI and development: a deep dive into language models, gadgets, and self-hosting through hands-on experience.
© 2025 Gotacat Team