I’ve recently been working to improve my understanding of how Intelligent Document Processing (IDP) leverages different types of AI to extract data from documents. We’ve been using ABBYY FlexiCapture for many years and are now transitioning our solutions to ABBYY Vantage. With all the recent discussion around AI in IDP tools, I thought it best to educate myself more deeply on how our tools utilise it.
ABBYY FlexiCapture has been incorporated early versions of AI for a long time, including:
- Machine learning
- Deep learning
- Neural networks
- Transformers
During my research, I came across the excellent YouTube series Neural Networks by 3Blue1Brown. One particular video in the series, Transformers (how LLMs work) explained visually, really helped me understand some of the key concepts. Here are my takeaways, and I highly recommend watching the series if you want to learn more.
What does GPT stand for?
The first snippet of information I learned was the meaning behind ‘GPT’ in ChatGPT:
- G = Generative
- P = Pre-trained
- T = Transformer
The key term here is Transformer, as ‘generative’ and ‘pre-trained’ are relatively self-explanatory. Transformers are the backbone of Large Language Models (LLMs) such as ChatGPT and also play a crucial role in the deep learning methods that ABBYY has traditionally used in its IDP tools.
The origin of Transformers
3Blue1Brown explains that the first Transformer model was originally developed by Google as a translation tool—converting text from one language to another. However, Transformers have since become the foundation for processing various data types, such as text-to-image and text-to-speech conversion.
How does a generative Transformer like ChatGPT create responses?
At a high level, ChatGPT generates responses by predicting the next word in a sequence. It does this through the following process:
- It breaks the input text down into vectors, which represent words or parts of words.
- Some words are split into multiple vectors—for example, “clever” and “cleverest” share the root “clever,” but “est” modifies the meaning.
- Tokens are used to measure input and output costs in generative AI. Since vectors aren’t always full words, the number of tokens determines the computational power required, and consequently, the cost.
- The vectors pass through two key layers of the model:
- Attention blocks, which link words together based on their numerical vector representations, grouping words by meaning.
- Multilayer perceptron (MLP) blocks, which analyse each vector by asking questions such as:
- Is this a noun?
- Is this in English?
- Is this an amount?
- Etc.
- The model then predicts the most probable next word (or part of a word) by assigning probabilities and selecting from a shortlist of possibilities.
- This process is repeated, continuously generating text by using the previous list of vectors along with the newly added one.
- This looping mechanism is what makes ChatGPT “talk” as words appear on the screen in real time.
ABBYY and Transformers
Deep learning has used Transformers for years, just as LLMs do. ABBYY has been leveraging the same underlying technology that powers LLMs in its IDP tools for a long time. Vantage builds on this foundation by incorporating newer AI techniques, including its own language models.3Blue1Brown goes into much greater detail in his video. Now that I’ve discovered the series, my next step is to go back to the beginning and watch But what is a neural network.