Here is an excerpt from an article written by Rama Ramakrishnan for MIT Sloan Management Review. To read the complete article, check out others, sign up for email alerts, and obtain subscription information, please click here.
Illustration Credit: Carolyn Geason-Beissel/MIT SMR | Getty Images
* * *
Business leaders making decisions involving AI need to know the essentials of how large language models and the GenAI tools based on them operate. Get up to speed on these commonly misunderstood topics.
In my work at MIT Sloan School of Management, I have taught the basics of how large language models (LLMs) work to many executives during the past two years.
Some people posit that business leaders neither want to nor need to know how LLMs and the generative AI tools that they power work — and are interested only in the results the tools can deliver. That is not my experience. Forward-thinking leaders care about results, of course, but they are also keenly aware that a clear and accurate mental model of how LLMs work is a necessary foundation for making sound business decisions regarding the use of AI technologies in the enterprise.
In this column, I share questions on 10 often-misunderstood topics that I am often asked about, along with their answers. You don’t need to read a book on each one of these topics, nor do you have to get into the technical weeds, but you do need to understand the essentials. Consider this list a useful reference for yourself and for your teams, colleagues, or customers the next time one of these questions comes up in a discussion with them. I have heard from my executive-level students at MIT that this knowledge is especially helpful as a reality check in conversations with technology partners.
10 Essential Questions and Answers on GenAI and LLMs
[Here are the first three.]
1. I understand that LLMs generate output one piece of text at a time. How does the LLM “decide” when to stop?
Put another way, when does the LLM decide to give the user the final answer to a question? The decision to stop generating is determined by a combination of what the LLM predicts and the rules set by the software system running it. It is not a choice made by the LLM alone. Let’s examine in detail how this works.
When an LLM answers a question, it produces text one small piece at a time. The technical name for a piece is token.1 Tokens can be words or parts of words. At each step, the LLM predicts which token should come next based on the prompt and what it has already written so far.2
An external system runs the LLM in a “generate the next token; append it to the input; generate the next token” loop until a stopping condition is triggered. When this happens, the system stops asking the LLM for more tokens and shows the result to the user.
Many stopping conditions are used in practice. An important one involves a special “end of sequence” token that (informally) means “end of answer.” This token is used in the training process to denote the end of individual training examples and so, during training, the LLM learns to predict this special token at the point where its answer is complete. Other stopping conditions include (but are not limited to) a limit on the maximum number of tokens that have been generated so far, or the generation of a user-defined pattern called a stop sequence.
When we use the web version of a tool like ChatGPT as consumers, we don’t see this process — only the finished text. But when your organization starts building its own LLM apps, developers can adjust these stopping rules and other parameters themselves, and these choices can affect answer completeness, cost, and formatting.
The important point here is that the “decision” to stop is an interaction between the LLM’s token predictions and external control logic, not a decision made by the LLM.
2. If the LLM makes a mistake and I correct it, will it update itself immediately?
No, the LLM will not update itself immediately if you correct it. If you are using tools like ChatGPT or Claude, your correction might help improve future versions of the model if your chat history is included in a future training run, but those updates happen over weeks or months, not instantly.
Some apps, such as ChatGPT, have a memory feature that can update in real time to remember personal information like your name, preferences, or location. However, this memory is used for personalization and does not appear to be used for correcting the model’s factual knowledge or reasoning errors.
3. If the LLM repeatedly generates one token at a time based on the current conversation, why have I seen it use information from a prior conversation (say, from a week ago) in the response?
LLMs generate responses one token at a time, based on the input they are given in that conversation. By default, they don’t use past conversations. However, as noted in the response above, some LLM applications have a memory feature that lets them store information from earlier chats — such as your name, interests, preferences, ongoing projects, or frequently queried topics.
When you start a new chat, relevant pieces of this stored memory may be automatically added to the prompt behind the scenes. This means that the model is not actually recalling past chats in real time; instead, it is being fed reminders of that information as part of the input. That’s how it can appear to “remember” things from a week ago.
The details of what is stored and when it is used vary by vendor, and the exact methods haven’t been disclosed. It is possible that a technique like retrieval-augmented generation (RAG) is being used to decide which memory items to include in a new prompt. Many platforms allow users to view, edit, or turn off memory entirely. In the ChatGPT app, for example, this can be accessed via Settings > Personalization.
RAG, if you are not familiar, is a technique used to provide the LLM with access to a specific set of proprietary data. This helps the LLM provide helpful responses.
4. I understand that LLMs have a training cutoff date, and they don’t “know” about things that happened after that date. However, they can answer questions about events that happened after the cutoff date. How does this work?
When you ask a question about something that happened after an LLM’s training cutoff date, the model itself doesn’t “know” about the event unless it has access to up-to-date information. Some systems — like ChatGPT with browsing enabled — can perform live web searches to help answer such questions.
Without access to live data, a model might still generate an answer based on its training data that doesn’t reflect real-world updates.
In those cases, the LLM may generate a search query based on your question, and a separate part of the system (outside the model itself) carries out the search. The results are then sent back to the LLM so that it can generate an answer based on that fresh information. Not all LLMs or applications have this capability, though. Without access to live data, a model might still generate an answer based on its training data, which doesn’t reflect real-world updates.
* * *
Here is a direct link to the complete article.
References (4)
1. On average, a token is about three-fourths of a word, and modern LLMs have a vocabulary of tens of thousands to over 100,000 tokens. You can enter different questions into OpenAI’s Tokenizer tool and see how a word is tokenized to gain a deeper understanding.
2. Strictly speaking, given an input, the LLM generates a probability (that is, a number between 0.0 and 1.0) for each token in its vocabulary. You can think of the probability for a token as a measure of its suitability to be the next token. Across all the tokens in the vocabulary, the probabilities add up to 1.0. The next token is selected based on these probabilities using a variety of developer-controllable strategies (such as picking the token with the highest probability or selecting a token randomly in proportion to its probability).
Show All References