Comprehending Generative AI text generation process

Harshal Kothawade
6 min readJan 24, 2024

In the ever-evolving landscape of artificial intelligence, one remarkable innovation stands out — Text-Based Generative AI. This cutting-edge technology holds immense importance in various domains, revolutionizing the way we interact with machines and leveraging the power of language for unprecedented applications. Text-Based Generative AI excels in understanding and generating human-like text, enabling machines to comprehend and respond to natural language input.

However, delving into the conceptual workings of how these generative AI models produce text can be truly intriguing. Understanding the underlying mechanisms offers a fascinating glimpse into the intricacies of text generation by these Generative AI systems. We’ll unravel the operational intricacies of these models, comprehending step by step how they function.

1. Model Vocabulary — set of words or tokens that the language model has been trained on and is capable of recognizing. This vocabulary encompasses the diverse range of words and expressions present in the training data that the language model has processed during its training phase. Given that there is no predetermined set of fixed words, we are treating individual characters as the units within the vocabulary.

# Given that there is no predetermined set of fixed words, we are treating individual characters as the units within the vocabulary
# Considered following set of characters as vocabulary.
# Experiment: Manually add few characters in vocab list like numerical or small case letters
vocab = ['\n', ' ', '!', '"', '#', ',', '-', '.', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']
vocab_size = len(vocab)
print(vocab, vocab_size)

2. Encoding and Decoding — Encoding involves converting input text into a format that the model can understand. This is typically achieved through tokenization, where words or subword units are represented as numerical tokens. Decoding process involves generating text in a coherent and contextually appropriate manner. It is a process where the model refines its predictions based on the context. The context is updated with each generated token. In summary, encoding transforms input text into a format suitable for the model, while decoding involves generating meaningful and contextually appropriate output based on the encoded information

# Creating dictionaries based on the vocab token and its index for encoding and decoding
str_to_int = {ch:i for i, ch in enumerate(vocab)}
print("str_to_int: ",str_to_int)
int_to_str = {i:ch for i, ch in enumerate(vocab)}
print("int_to_str: ",int_to_str)

encode = lambda s: [str_to_int[c] for c in s]
decode = lambda l: ''.join([int_to_str[c] for c in l])

# Output:
# str_to_int: {'\n': 0, ' ': 1, '!': 2, '"': 3, '#': 4, ',': 5, '-': 6, '.': 7, 'A': 8, 'B': 9, 'C': 10, 'D': 11, 'E': 12, 'F': 13, 'G': 14, 'H': 15, 'I': 16, 'J': 17, 'K': 18, 'L': 19, 'M': 20, 'N': 21, 'O': 22, 'P': 23, 'Q': 24, 'R': 25, 'S': 26, 'T': 27, 'U': 28, 'V': 29, 'W': 30, 'X': 31, 'Y': 32, 'Z': 33}
# int_to_str: {0: '\n', 1: ' ', 2: '!', 3: '"', 4: '#', 5: ',', 6: '-', 7: '.', 8: 'A', 9: 'B', 10: 'C', 11: 'D', 12: 'E', 13: 'F', 14: 'G', 15: 'H', 16: 'I', 17: 'J', 18: 'K', 19: 'L', 20: 'M', 21: 'N', 22: 'O', 23: 'P', 24: 'Q', 25: 'R', 26: 'S', 27: 'T', 28: 'U', 29: 'V', 30: 'W', 31: 'X', 32: 'Y', 33: 'Z'}

3. Creating embedding vector — The purpose of the ‘nn.Embedding’ layer is to create a dense vector representation (embedding) for each unique index. Instead of representing words or categorical values as single integers, an embedding layer maps each index to a high-dimensional vector. Although we are not training any model, these embedding vectors are learnable parameters. During training, the neural network adjusts the values of these vectors based on the task at hand.

# Initializing embedding randomly for this exercise
# Experiment: Here any existing character level embedding can be used for next character prediction
self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)
Bigram Count matrix

4. Generating tokens — To generate tokens, two essential elements are required: a starting token or context and the count of tokens to be generated. Typically, models incorporate a <SOS> (start of sequence) token and predict an <EOS> (end of sequence) token during next-word prediction. However, in the absence of such an advanced model, we opt for a simpler approach. Using a context vector (e.g., ‘\n’) and specifying the desired number of tokens to generate serve as the parameters for our token generation process.

# Initialize model with vocab_size
model = BigramLanguageModel(vocab_size)

# Create starting point for the model to generate tokens
# Used '\n' start token present at position 0
# Experiment: Provide random integer as input, you can use torch.randint(vocab_size, (1,1), dtype=torch.long)
context = torch.zeros((1,1), dtype=torch.long)

# Number of tokens to be generated
# Experiment: Try changing the max_new_tokens
max_new_tokens = 10

generated_index = model.generate(context, max_new_tokens=max_new_tokens)
generated_chars = decode(generated_index[0].tolist())

It’s important to note that the model is designed to produce random text because it has not undergone any training, and its embeddings are initialized randomly. The primary goal of this model is to provide a straightforward illustration of text generation mechanics rather than generating meaningful content.

This complete code serves as a demonstration.

import torch
import torch.nn as nn
from torch.nn import functional as F

# Given that there is no predetermined set of fixed words, we are treating individual characters as the units within the vocabulary
# Considered following set of characters as vocabulary.
# Experiment: Manually add few characters in vocab list like numerical or small case letters
vocab = ['\n', ' ', '!', '"', '#', ',', '-', '.', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z']
vocab_size = len(vocab)

# Creating dictionaries based on the vocab token and its index for encoding and decoding
str_to_int = {ch:i for i, ch in enumerate(vocab)}
int_to_str = {i:ch for i, ch in enumerate(vocab)}

encode = lambda s: [str_to_int[c] for c in s]
decode = lambda l: ''.join([int_to_str[c] for c in l])

# Class for generating next token from current input
class BigramLanguageModel(nn.Module):
def __init__(self, vocab_size):
super().__init__()
# Initializing embedding randomly for this exercise
# Experiment: Here any existing character level embedding can be used for next character prediction
self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

# generate next character based on index (input characters) and max_new_tokens (how many tokens to generate)
def generate(self, index, max_new_tokens):
for _ in range(max_new_tokens):
# Step 1: Get logits for the current index
logits = self.token_embedding_table(index)
# Step 2: Select the logits for the last position in the sequence
logits = logits[:, -1, :]
# Step 3: Apply softmax to obtain probabilities
probs = F.softmax(logits, dim=-1)
# Step 4: Sample the next index from the probability distribution
index_next = torch.multinomial(probs, num_samples=1)
# Step 5: Concatenate the newly sampled index to the sequence
index = torch.cat((index, index_next), dim=1)
# Step 6: Return the generated sequence
return index

# Initialize model with vocab_size
model = BigramLanguageModel(vocab_size)

# Create starting point for the model to generate tokens
# Used '\n' start token present at position 0
# Experiment: Provide random integer as input, you can use torch.randint(vocab_size, (1,1), dtype=torch.long)
context = torch.zeros((1,1), dtype=torch.long)

# Number of tokens to be generated
# Experiment: Try changing the max_new_tokens
max_new_tokens = 10

generated_index = model.generate(context, max_new_tokens=max_new_tokens)
generated_chars = decode(generated_index[0].tolist())
print(f"generated_chars: '{generated_chars}'")

In summary, this process illustrates the construction and working of a basic Bigram Language Model for character-level token generation using PyTorch. It involves setting up a vocabulary, encoding/decoding functions, and defining a model that predicts the next character based on the bigram probabilities. The code generates a sequence of characters and prints the result.

--

--

Harshal Kothawade

Data Scientist well versed in statistical learning, machine learning and deep learning algorithms. Passionate about data and visualizations.