Joining the legions of people that are amazed by the offerings recently disclosed in the OpenAI portfolio, I wanted to explain the lights and shadows of my experience trying ChatGPT and its underlying LLMs. As a disclaimer, I’ll say that my aim was not selfless at all. I did not intend to evaluate the product itself. I rather envisioned the bots as becoming the “software engineer” friend that people would like to have: always ready, always on call. In that vein, can ChatGPT help in our mission to democratize BigML‘s Machine Learning workflows for non-programmers?
So I got an account. You need an email and a phone number to receive the activation code and your user is automatically granted an allowance of $18 USD to test the tool, which was luckily enough to run the different scenarios that I’ll describe in this post.
Chatting with the raw-bot
To be precise, let me start by saying that GPT-4 was released at the end of the experiment, so GPT-3.5 was used for the majority of tests that I’m going to share.
Let’s get going! As I usually maintain the Python bindings for BigML, my curiosity led me first to try ChatGPT to write Python code and create the canonical training:
Wow! My first impression was: that’s simply amazing! This is correctly structured Python code, refers to and uses the Python bindings provided by BigML, and clearly defines the steps that you need to take, which are:
Creating an API connection
Uploading data to create a Source
Creating a Dataset from the Source
Creating a Model from the Dataset
The code should stop here, right? I ask for the decision tree to be created and I don’t need to download it. Yet, knowing how to download the tree would not harm, so for a newbie trying to use that code, that looks pretty sweet!
Of course, I’m not a newbie in that world, so I started seeing the problems that someone with no experience would have using the code snippet:
The first line is meant to install the Python bindings and would only work if used in a Jupyter Notebook (or similar) environment.
The connection to the API needs some credentials to be set as environment variables to work.
The api.create_model method uses a non-existing collection of arguments, (which seemed strange considering that the rest of create methods don’t include any).
The api.export method uses a non-existing download argument.
So a newbie trying this would probably get the related errors one by one, which can be quite tiresome for someone who does not know what might be happening. Our fictitious user could use the bot-friend itself for help, and ChatGPT would modify the answer to correct some of them, but the code would not run yet. It’s frustrating to think that you’re so close to the solution and yet still not able to get to the finish line in sight…
We programmers are used to that frustration. Our eyes are trained to see mistakes and our brains naturally think about tracking the error source. With no such training, you only can ask ChatGPT once again hoping for the best. Well, according to our experiment, that does not always lead to a happy end.
Learn from our experience
Still, if I had to label this first experiment to build a sentiment analysis dataset, I’d attach the positive label to it. I mean, the code that was generated is quite close to the working code. If I really know zero about BigML or the bindings classes and methods, that’s really providing lots of correct information! The errors were mostly in the details. Sure, they prevent the code from working, but they seem small enough. And one can think that by providing some more information to the LLM or tweaking it a bit, it could get the code right eventually. I take my hat off to the engineers that got the model this far and I’m prepared to expect that it will keep improving over time. There are lots of Python code examples stored in GitHub and other repositories that have surely contributed to the present status of the tool and can further help bring it to the next level. That thought led me to try another experiment. What about building a workflow using WhizzML?
That’s more adventurous indeed because WhizzML is a Domain Specific Language developed in BigML that runs on the server side of the platform. OpenAI’s tools should be aware of it, as we have public documents describing the language and some tutorials. Also, plenty of code examples are available on GitHub. Nonetheless, is not as common as a General Purpose Language such as Python.
We all learn from our experiences. Robots do too. Their experience is extracted from data, in this case, lots of documentation, web pages, and code snippets. Thus, the fact that common language examples are better learned than specific ones is something to be expected, and that’s what we detected. Asking whether we could create a cluster using WhizzML, we got a correct affirmative response. However, the associated ChatGPT-generated code was an incorrect pseudo-ruby code which has the wrong syntax. To make matters worse, it uses non-existing functions.
The ChatGPT Plus response is worse. The code is highlighted as wasm. It defines some credentials never used in the code. It adds functions that don’t exist in the language and adds attributes not allowed in the API.
Still, maybe we could help ChatGPT to learn what it needs to know about a DSL to make it work for it too? It could certainly be worth the try.
Teaching the robot
The idea would be to add some context information before asking the robot so that it can focus on the right kind of data before providing the answer. Fortunately, we found a library that helped: Langchain. The library offers utilities to index the information that we want the OpenIA LLMs to use and generate questions that summarize the chat history too. On top of that, it allows adding the reference information as a preface to the question itself, asking the LLM to restrict its answer to that. In fact, the template used says literally: If you don’t know the answer, just say that you don’t know, and don’t try to make up an answer.
Using that library, we create an application that receives the question, adds the previous questions chain to keep the conversation context, and sends the previous WhizzML reference, tutorials, and examples we want to be the main source for the automated response. The result in this case was really better than in the previous ones. However, some simple questions received the I don’t know response, so we tried to avoid being very restrictive. We changed the template to: Show any code snippet found in the context related to the answer if available. And that once again improved the result, especially for basic questions like how to create a decision tree or how to create a prediction.
Of course, that’s in part cheating since those canonical workflows are perfectly described in our documentation. So we started asking more general questions, like: can I create a supervised anomaly detector using WhizzML? We know anomaly detectors are unsupervised, but does the LLM know that?
The initial response was positive, showing the code to build an anomaly detector. After some more questioning, the bot acknowledged that anomaly detectors are unsupervised models, so contradictions surfaced. Also, we observed that small changes in the wording of the question sometimes led to different results, some of which were totally wrong. We saw responses including incorrect PHP-like code, and got the app to recognize that their syntax was not WhizzML-conformant.
In summary, the app is able to tell you that it’s making mistakes but keeps making them all the same. What’s more human than that?
Out of credit
Finally, we exhausted all our credit using the API and that stopped our evaluation on its track. Still, I think that we were able to construct an opinion as to whether we could embrace ChatGPT as the new software programmer on our team. At this stage, my feeling is that the responses are not sound enough for non-programmers to use the bot blindly. Maybe programmers could still benefit from the tool because it will be easier for them to separate the wheat from the chaff at a glance or correct any possible errors as they appear. But then again, programmers have an excellent choice of tools (e.g., readthedocs, StackOverflow), where they can find working examples that have already been tested or checked by referees.
Being a software engineer myself, some might think that my conclusion is biased as I was rooting for the bot’s failure all the time. They would be totally wrong! Since your author is in her mid-fifties and nothing would make her happier than being retired. Maybe, I used it for the wrong goal and I should join the HustleGPT wave that tries to use ChatGPT as a Strategic Advisor for their entrepreneurship with apparent success.
To wrap up, we definitely should keep an eye on this outstanding advance to see its evolution. However, the experiment has awakened some questions in my mind: Will minorities be ignored, or misrepresented by this kind of model? Are LLMs also inheriting the lies and mistakes that we humans have introduced through the years deep into the annals of countless non-edited sources of data that we’ve produced? Would you believe a ChatGPT diagnosis if your health and well-being depended on it? Probably, we’ll need the dust to settle some more before answering those crucial questions.