Blog post

Building a chatbot with OpenAI: A wild journey

Illustration: Tech writer’s chatbot adventure
Information

This blog post chronicles my vainglorious attempt to teach OpenAI about our SDK documentation. This part of the journey ended in a proof of concept (POC) that I presented internally to the company. Since then, the beta version of the documentation Q&A tool has been made available to the general public. To check this out, head to the Ask AI page. And please let me know what you think about it!

Prompt:

You’re a senior technical writer at a cool, fully remote IT company. You’re responsible for the documentation of different software development kits (SDKs) that enable developers to easily integrate powerful document processing solutions into their apps. Since ChatGPT came out, there has been a lot of discussion in the company about how to leverage this technology. After the CEO says, “We really need to stop researching and start building here,” you decide to take it on yourself to implement a proof of concept (POC) that integrates OpenAI technology into the documentation. Specifically, you want to develop a chatbot tool that enables users to ask questions about the SDK and then gives accurate answers based on the documentation using OpenAI’s language processing capabilities. The end goal is to provide users with working and correct code that’s tailormade to their use cases. As a technical writer with junior developer skills at best, you enjoy the challenge, but you’re out of your depth here.

After a week of research, trial and error, sleepless nights, and broken dreams, the POC is ready for presentation at the general company meeting. Somehow you even manage to include a tool in the POC that enables a user to upload a PDF document, ask a question about the content of the document, and get an answer to their question. The presentation goes well, and everyone seems to enjoy the examples where OpenAI gives weird answers.

Write a blog post about this adventure with humor and sarcasm.

ChatGPT’s response:

So, picture this: I, a senior technical writer with junior developer skills, decided to tackle the challenge of integrating OpenAI technology into our SDK documentation. What could go wrong?

Spoiler alert: everything.

After a week of endless research, trial and error, and questioning my life choices, I finally managed to develop a POC that integrated OpenAI’s language processing capabilities with our SDK documentation. And, because I’m a glutton for punishment, I even threw in a tool that lets users upload a PDF document and ask a question about its content. Genius, right?

Fast forward to the presentation at the general company meeting, and I was bracing myself for the worst. But to my surprise, everyone seemed to enjoy the examples where OpenAI gave some seriously weird answers.

Like the time someone asked the chatbot, “How do I implement the SDK in my app?” and it responded with, “Have you tried feeding it after midnight and getting it wet?” I mean, technically that’s not wrong, but it’s not exactly helpful either.

But despite the bizarre answers, the POC was a success. It’s always a relief when something you’ve poured your blood, sweat, and tears into actually works. And who knows, maybe one day OpenAI will develop a sense of humor that’s a little less...odd.

Indeed, OpenAI’s sense of humor might be odd, but this is much funnier than what I could’ve come up with. As a technical writer, I’ve mastered the art of writing boring and sterile text and stripping language down to its bare essentials. Lunartic Andy, one of my good tech writer friends, calls me Manny Horvitz as a reference to how I repeatedly and brutally butchered his beautiful prose in my reviews.

But what’s missing from OpenAI’s answer? Well, the substance. Everything that you can’t expect a bullshit generator to do, even if it’s as impressive as GPT-4. And this is exactly the reason I needed to develop a POC instead of relying on ChatGPT.

Why Not Just Ask ChatGPT?

Ask ChatGPT a question about how to achieve a certain goal with our SDKs. For example, “How to redact phone numbers with GdPicture.NET?” The chances are high that you’ll get some generic marketing text about how powerful the SDK is, along with some non-existent classes and methods that purportedly redact the exact text “phone number” from a document. It says nothing about the smart redaction capabilities we released a few months ago that automatically recognize sensitive information like phone numbers and redact them from the document. It gives a convincing, well-written answer, but a rather useless one for developers.

Why does this happen? There are several forces at play here. An important one is the cutoff date of the model. The current ChatGPT language model works with data up until September 2021. This means that if you’re interested in a feature we shipped last week, ChatGPT isn’t going to help you. In agile environments where even daily releases aren’t at all unusual, we need a solution where the answer is based on the latest documentation updates.

But for ChatGPT, a problem even bigger than outdated information is no information at all. If it works on the basis of outdated but complete information, the worst it can do is give you code that worked in our SDK two years ago. However, if ChatGPT has incomplete or no information about a topic, it tends to “hallucinate”: It invents stuff based on the closest match it finds in its metallic brain. When this happens, it gives the impression of talking with an eminent student with a compulsion to conform. They’re caught off guard by a question they haven’t fully prepared for, but they aim to please, so they make up facts and present them in a very convincing way.

Feeding the Beast with Context

The solution is to provide OpenAI with context that we know to be accurate, comprehensive, and up to date. In my case, this means including pieces of documentation in the prompt that are the most relevant to the user’s question. Then comes a bit of prompt engineering where I ask OpenAI to rely only on the information explicitly stated in the context, and I teach it to say the crucial “I don’t know” if that’s the case. It has ego, so it must be a hard pill to swallow, but it plays along.

Oh, and I set the “temperature” to 0. Temperature basically determines how risk-taking or adventurous OpenAI is when it formulates its answer. 0 means that the answer is more focused, predictable, and deterministic, and 1 means there’s more variation or randomness to it. Simply put, in my experience, 0 gives you a run-of-the-mill, mediocre high school essay, whereas 1 gives you a hit-and-miss ride of brilliance and utter nonsense. Needless to say, for my boring use case and lifeless writing style, 0 fits the bill perfectly.

Here Comes a Chopper

However, I’ve skipped over an important step. How do you know what context to provide to OpenAI? And how do you feed it to the beast so that it fits within its strict token limits?

Token limits are basically the plague of working with the OpenAI API. One token is equivalent to around four characters in English texts. The different models have different limitations of how many tokens you can have in the prompt and the answer combined.

Model Token Limit Approximate Word Limit
text-davinci-003 4,096 3,000
gpt-3.5-turbo 4,096 3,000
gpt-4 8,192 6,000
gpt-4-32k 32,768 24,000

The GPT-4 API isn’t yet publicly available and costs much more than GPT-3.5 models. So at the moment, the prompt and the answer together need to fit within a very strict 3,000-word limit. How to get around this?

Here comes the Retrieval Plugin to the rescue. Released by the OpenAI team in late March, this tool chops up large documents into smaller chunks of text, represents the chunks as vectors using the OpenAI Embeddings API, and stores the chunks and their embedding vectors in the database. The embedding vector is just a list of numbers that represents a piece of text based on 1,536 output dimensions. You can compare the embedding vectors of text strings to determine how closely related they are. Don’t ask me how this works, but it does.

Anyway, when you ask a question, the Retrieval Plugin represents the question as a vector using the OpenAI Embeddings API, compares the embedding vector of the question with the embedding vectors of the chunks of text in the database, and returns the chunks of text that most closely match the question. The Retrieval Plugin seems to be by far the easiest way to determine the context that’s most relevant to the question and fits within the token limits.

It must be noted that bespoke solutions similar to the Retrieval Plugin have already been around for some time. Supabase implemented its Clippy tool with the same logic, and the company was kind enough to write it all up in a blog post, which was a great inspiration when working on this problem. Kudos to the great Supabase team for its pioneering work here. Meanwhile, I have the advantage of jumping on the bandwagon later when one-size-fits-all solutions like the Retrieval Plugin are already out there.

The Nitty-Gritty

That’s all well and good, but how does the docs chatbot answer questions about documentation?

  1. Previously, I uploaded chunks of the documentation with the Retrieval Plugin to a vector database. The plugin represents these chunks as vectors based on the OpenAI Embeddings API and stores the chunks and their embedding vectors in the database.

  2. The user asks a question in the app.

  3. The app sends the user question to the Retrieval Plugin API. The API represents the question as a vector based on the OpenAI embeddings API, compares the embedding vector of the question with the embedding vectors of the chunks of text in the database, and returns the chunks of text that most closely match the question.

  4. The app formulates the prompt. The prompt includes basic instructions about the expected answer, the user question, and the most relevant chunks of text as context.

  5. The app sends a prompt to the OpenAI Chat Completion API.

  6. The OpenAI answer is displayed to the user.

And this is how the document analyzer answers questions about an uploaded PDF:

  1. The user asks a question in the app and uploads a PDF document.

  2. The app extracts text from the document using the PassportPDF API, our cloud service for document processing.

  3. Using the Retrieval Plugin, the app splits the text into smaller chunks, represents these chunks as vectors based on the OpenAI Embeddings API, and stores the chunks and their embedding vectors in the database.

  4. The app sends the user question to the Retrieval Plugin API. The API represents the question as a vector based on the OpenAI embeddings API, compares the embedding vector of the question with the embedding vectors of the chunks of text in the database that originated from the uploaded document, and returns the chunks of text that most closely match the question.

  5. The app deletes the document from the database.

  6. The app formulates the prompt. The prompt includes basic instructions about the expected answer, the user question, and the most relevant chunks of text as context.

  7. The app sends a prompt to the OpenAI Chat Completion API.

  8. The OpenAI answer is displayed to the user.

Mostly Harmless?

What are the results then? Is ChatGPT really taking over our jobs? Is it taking over the world? I certainly don’t know the answer to these questions. But the answers I got from my POC were generally quite good and surprisingly consistent. Once I uploaded the right guides to the database, it answered the question about redacting phone numbers perfectly. And even when the answer was sometimes inadequate, I could easily tweak the prompt or the text chunk generation mechanism to get better results. For example, when a colleague interrogated finer details on an earlier iteration of the app, like how to redact phone numbers using blue, the answer contained a fair amount of hallucination. After changing the prompt and instructing OpenAI more explicitly to rely only on the context, the problem was solved.

And this isn’t cheating, by the way. As far as I understand this field, the work on artificial intelligence mainly evolved through trial and error based on what works in practice. Tweaking the input and the output (pre- and post-processing) is what will really distinguish good and bad practical implementations of OpenAI technology.

It’s interesting to compare how the two most popular OpenAI models performed. text-davinci-003 is the older model, optimized for text completion. gpt-3.5-turbo is optimized for chat and is 10 times cheaper than text-davinci-003. OpenAI says both have roughly the same capabilities. I can’t say there’s a clear winner, but it felt like gpt-3.5-turbo is better for this use case. For example, when it couldn’t answer the question on the basis of the provided context, text-davinci-003 always said, “Sorry, I don’t know.” In contrast, gpt-3.5-turbo usually said, “Sorry, I don’t know. The given documentation does not provide any specific information on this question. However, it does mention something very closely related to this. Let me tell you that.” This could be helpful. Even if you don’t get the exact answer you’re looking for, the app will give you something to start with.

As for the document analyzer, on a good day, it’s capable of taking a 100-page document full of financial information and answering a specific question correctly in 10–15 seconds. Most of the time, it has good days, but there are some inexplicable bad days. And on a bad day, it struggles to interpret a single-page contract and retrieve the most obvious pieces of information, such as who the contract is for. I haven’t really figured out the pattern here. Most likely, the way chunks are generated and text is split up has a huge impact on results.

Life, the Universe, and Everything

By the way, have I told you that my writing style is often boring and devoid of life? Probably more than once. And now I have irrefutable scientific proof. When I uploaded the developer guides that I wrote for GdPicture.NET to the vector database and asked my magnum opus “What is life?” all it could say was: “Sorry, I don’t know how to help with that based on the provided context.” OK, OK, this is to be expected. Actually, I should be happy that I engineered the prompt correctly. But what happens if I upload something slightly different to the database? Let’s say, my wife’s blog post full of love and life and passion about a Hungarian craft chocolate maker? Surely it won’t be enough to convince the beast that it can answer questions about the meaning of life. And then this happens:

Life is a journey of discovery, growth, and transformation. It is a unique experience that is filled with joy, sorrow, and surprises. From the story of Rózsavölgyi Csokoládé, we can see that life is about taking risks, making decisions, and learning from our mistakes. It is about finding our passions and pursuing them with determination and passion. It is also about connecting with others, sharing experiences, and creating something beautiful together. Life is about making the most of every moment and cherishing the memories we create along the way.

Crushing. Life is chocolate.

And Another Thing

“This is just a POC” was what I was repeating to myself when feature creep raised its ugly head, because it’s so tempting to take on just one more thing, make that button just a bit prettier, or refactor the code to make it look just a bit less spaghetti. Clearly, there are many questions left unanswered and many problems left unsolved. The Retrieval Plugin is a great tool, but its mechanism for generating chunks of text from larger documents is rudimentary at best. At the time of writing, it simply cuts the document up into little shreds that fit in 200 tokens (around 150 words). This logic chops larger code samples into meaningless snippets, classifies them, later assembles the snippets in random order, and feeds all this to OpenAI as context. Not very smart.

Good old prompt engineering can also save the day. Sometimes it feels like you can’t tell OpenAI enough times to only rely on the provided context. You update the prompt, but then it becomes too shy and risk-averse, and it refuses to answer questions for which the relevant information is clearly in the provided context. There seems to be a fine balance to strike between being too strict and too lenient in your instructions.

And as the example with chocolate and the meaning of life demonstrated, you need to be very careful about what makes it to the database. One enthusiastic blog post about something remotely touching on the joys of life, rather than on the delights of smart redaction and barcode generation, was seemingly enough to make OpenAI believe that it’s qualified to engage in deep philosophical discussions about the universe and everything.

So Long, and Thanks for All the Fish

All in all, it was a wild ride, and I’m grateful for the experience. And if anyone needs a chatbot tool that tells you how to feed and water your SDK, I’m your girl.

Why does ChatGPT think I’m a girl? Is it a progressive mindset, prejudice, or the cold-headed calculation of probabilities? We might never know what goes on inside the black box, and all this seems like black magic to me. It’s black magic in the sense that we don’t know how and why this works, but if we perform the correct rituals, it seems we can figure out, through sheer luck and trial and error, a way to get the results we want.

Is this black magic taking over the world? If yes, have I just contributed in a small part to our collective downfall? I don’t know. The given context doesn’t provide any specific information on this question. However, it does mention something very closely related to this, about a technical writer feeling grateful for the opportunity to work on a team where he could take the luxury of spending a week on something wild that’s completely outside his comfort zone. Do you want to hear more about that?

Explore related topics

Free trial Ready to get started?
Free trial