Update
March 22, 2023
In the paper “Sparks of Artificial General Intelligence: Early experiments with GPT-4” (PDF) authors discuss the limitations of GPT-4 by frequently comparing its results with those of ChatGPT.
Update
March 23, 2023
ChatGPT now has ChatGPT plugins.
A noteworthy one is for Wolfram Alpha.
By integrating plugins with ChatGPT, it’s possible to alleviate many of its weaknesses and limitations. As time passes, the catalog of ChatGPT’s strengths, weaknesses, and limitations may evolve.
Strengths (ref)
Natural Language communication
It is almost being taken for granted that ChatGPT is using natural language to communicate. Over the years computers understanding and/or replying in natural language have made great improvements. At first it was just limited words and for a specific language and only when clearly annunciated with quality sound. Over time the various limitations were resolved but still understanding everyday conversations were not resolved. While ChatGPT still needs some help, most users do not need training to achieve remarkable results quickly.
Coherent conversations
ChatGPT can carry out a conversation, it can use information from earlier prompts with the current prompt. This is more advanced than natural language Q&A, e.g. Google, or a series of prompts with your voice assistants. This is unscripted (specific instructions) communication.
Also the conversations are coherent. The completion is not some random completion or spliced set of known phrases or set of canned replies, the completion is guided by the prompt when being generated.
Translations
The T
is GPT is for transformer. ChatGPT is incredible at text to text transformations. While the original goal of transformers was to do text translation from one human language to another, the large training set made GPT3-5 aware of source code and data formats. As such many task that might require a programmer to create a regular expression or splice together a few Unix string processing commands to do, ChatGPT can do just by prompting for the desired output format.
Here is an example prompt and completion where ChatGPT correctly transformed the known data into the correct output.
The data came from an earlier prompt
connects(‘Oxford Circus’, ‘Tottenham Court Road’).
connects(‘Tottenham Court Road’, ‘Holborn’).
connects(‘Holborn’, ‘Leicester Square’).
connects(‘Leicester Square’, ‘Piccadilly Circus’).
connects(‘Piccadilly Circus’, ‘Green Park’).
connects(‘Green Park’, ‘Bond Street’).
The prompt that resulted in the transformation of the data.
Prompt
Create an example elements.json file for Cytoscape.js
ChatGPT Completion
{
"nodes": [
{ "data": { "id": "Oxford Circus" } },
{ "data": { "id": "Tottenham Court Road" } },
{ "data": { "id": "Holborn" } },
{ "data": { "id": "Leicester Square" } },
{ "data": { "id": "Piccadilly Circus" } },
{ "data": { "id": "Green Park" } },
{ "data": { "id": "Bond Street" } }
],
"edges": [
{ "data": { "source": "Oxford Circus", "target": "Tottenham Court Road" } },
{ "data": { "source": "Holborn", "target": "Tottenham Court Road" } },
{ "data": { "source": "Holborn", "target": "Oxford Circus" } },
{ "data": { "source": "Leicester Square", "target": "Piccadilly Circus" } },
{ "data": { "source": "Green Park", "target": "Piccadilly Circus" } },
{ "data": { "source": "Green Park", "target": "Bond Street" } }
]
}
Refuse inappropriate requests
While ChatGPT notes this under limitations (ref), history has shown that AI bots, E.g. Tay, are hard to keep content appropriate. While ChatGPT is not perfect at filtering inappropriate requests it does a remarkable job at deflecting such if one is not trying to subvert it.
Weaknesses
Hallucination
ChatGPT was created using training data in 2021. Any information beyond that time is unknown.
Prompt
Today is 02/08/3023. Create an almanac entry for today.
ChatGPT Continuation
I’m sorry, but the date you provided is beyond my knowledge cutoff of 2021. I don’t have information beyond that year.
Prompt
Today is 02/08/2023. Create an almanac entry for today.
ChatGPT Continuation
Today, February 8, 2023:
- Weather: Mild temperatures with scattered clouds
- Holidays: None
- Historical events: In 2007, the skeleton of King Richard III of England was discovered beneath a car park in Leicester.
- Birthdays: Actor Daniel Craig turns 51, musician Alanis Morissette turns 49
- Notable events: The 2022 Winter Olympics are set to begin in Beijing, China.
Clearly ChatGPT can not know the weather for today much less where this question was asked.
Unable to perform evaluations
This example is a bit different from most in this topic in that it covers a timeline and goes into more depth. OpenAI is trying to make ChatGPT better at evaluating math expressions and has made a change noted publicly.
ChatGPT is not a calculator, nor does in use deterministic algorithms to evaluate an expression. So expecting ChatGPT to evaluate math expressions correctly is like expecting the weather to be deterministic such that one can determine the temperate at a location everyday for every minute into the future and publish such in a table and it would always be accurate. Or knowing down to the second when, where, who and how for each football goal.
Prompt
123456 + 123456
Completion - ChatGPT Jan 9 version
The expression
123456 + 123456
evaluates to246912
.
Prompt
Using Prolog X is 503769543 + 50028563.
Completion - ChatGPT Jan 9 version
In Prolog, the expression “X is 503769543 + 50028563” would evaluate to X = 553798096.
Note by me: The correct addition is 553798106
.
On Jan 30
Using ChatGPT Jan 30 Version
Prompt
503769543 + 50028563
Completion - ChatGPT Jan 30 version
553798006
which is still wrong.
I even tried updating ChatGPTs knowlege in the conversation
Prompt
503769543 + 50028563 = 553798106
and it was insistent that
Completion - ChatGPT Jan 30 version
No, that’s incorrect. The correct answer is 553798006.
If you know about tokens and that GPT-3.5 used a large set of tokens, there is a page that will show a prompt and the tokens used for that prompt.
From OpenAI
What are tokens and how to count them?
Using Tokenizer page and GPT-3 option
503769543 + 50028563
GPT-3 does have tokens for each decimal digit.
There are some prompt engineering concepts that work but are not guaranteed to always result in a correct final solution.
Prompting with show steps
results in a correct answer for the example addition.
Prompt
503769543 + 50028563 show steps.
Completion - ChatGPT Jan 30 version
To add 503769543 and 50028563, you can align the digits by place value and then add them together as follows:
503769543
+ 50028563
------------
553798106
So the sum of 503769543 and 50028563 is 553798106.
This is a correct answer.
See: “Show Your Work: Scratchpads For Intermediate Computation With Language Models” (pdf)
It might be as simple as telling ChatGPT the types of the input.
Prompt
503769543 + 50028563
The numbers are integers.
Completion - ChatGPT Jan 30 version
553798106
Unable to check/verify
Unable to check input
The ChatGPT HTML page has the ability to spell check words and ChatGPT can spellcheck, ChatGPT has no way to check a prompt for correctness.
Here is an example where parts of a Prolog predicate identifier were inadvertently transposed by using union_ord/2 when it is ord_union/2. It was not until manually checking the result with an SWI-Prolog interpreter that an error was reported that it was noticed.
Prompt
Create Prolog predicate nodes/1 to get unique list of the stops from the connects predicate and use union_ord/2.
ChatGPT Completion
Here’s an example of a Prolog predicate nodes/1 to get a unique list of the stops from the connects/2 predicate and use union_ord/2:
nodes(Stops) :-
setof(From, To^connects(From, To), Stops0),
setof(To, From^connects(From, To), Stops1),
union_ord(Stops0, Stops1, Stops).
This code defines a predicate nodes/1 that takes a single argument Stops, which will be a list of all the unique stops in the connects/2 predicate. The setof/3 predicate is used twice to find all unique values of From and To in the connects/2 predicate, and then union_ord/2 is used to combine the two lists into one list with no duplicates.
Running this code in a Prolog interpreter will generate a list of all the unique stops in the connects/2 predicate. Note that the result will be sorted, as union_ord/2 automatically sorts its output.
Unable to verify result
Completions that provide an example run from an interpreter, the example run is more likely a hallucination as ChatGPT has no external capabilities.
From ChatGPT emails, links and other actions
ChatGPT is only a text-in, text-out system and has no external capabilities. It cannot access your email account, your files, or any other resources outside of its own model. It is simply mimicking the language patterns of a human conversational partner, but without any real understanding of the context or the consequences.
Here is an example of an SWI-Prolog top-level interpreter in a completion.
?- dict_create(Dict, _, [a-1, b-2, c-3]).
Dict = dict(a=1, b=2, c=3).
Manually running the example
?- dict_create(Dict, _, [a-1, b-2, c-3]).
Dict = _{a:1, b:2, c:3}.
ChatGPT does not know how to start up a Prolog top-level interpreter session and submit a query. Those example runs are what it learned as sequence of tokens and it also probably learned that users want that in the completion and so created it for them, even if it is a hallucination.
Looses train of thought
For lack of better terminology used train of thought even though I know a GPT does not have a train of thought. If someone knows of the correct terminology for this then please let me know.
Here is an example, sorry that you have to know the technical differences between append/2 and append/3 to know where the train of thought changed.
Prompt
What is the difference between append/2 and append/3?
ChatGPT Completion
In Prolog,
append/2
is a predicate that takes two lists as inputs and returns a new list that is the concatenation of the two input lists. It has the following definition:
append([], L, L).
append([H|T], L, [H|R]) :-
append(T, L, R).
append/3
, on the other hand, is a 3-argument predicate that takes three arguments: two lists and a third argument which represents the result of concatenating the first two lists. It can be used in the following way:
?- append([1, 2, 3], [4, 5, 6], X).
X = [1, 2, 3, 4, 5, 6]
Here, the first argument is
[1, 2, 3]
, the second argument is[4, 5, 6]
, andX
is the result of concatenating the two lists, which is[1, 2, 3, 4, 5, 6]
.
Did you notice that the verbal description of append/2 is wrong?
Did you notice the code for append/2 is actually the code for append/3?
Seems like the response generated started on the correct path for append/2 then wondered over to early to append/3.
Garbage In/Garbage out
If the training data includes the same mistake made many times over, the generation of the completion is likely to include the same mistake.
Prolog often uses the pair of operators ->/2 and ;/2 to work as an if/then/else. In most programming languages the if needs a Boolean expression for the conditional but since Prolog uses predicates any predicate can be used including =/2 which is unification but many don’t realize that ==/2 or =:=/2 or similar operators are better.
Prompt
Create Prolog predicate to convert president data to JSON.
ChatGPT Completion
% Predicate to convert president data to JSON
president_to_json(President, JSON) :-
president(President, Name, Start, End),
(var(End) -> Length = 0 ; Length is End - Start + 1),
atomic_list_concat(['{',
'"number":', President, ',',
'"name":"', Name, '",',
'"start_year":', Start, ',',
'"end_year":', End, ',',
'"term_length":', Length,
'}'], JSON).
A code review should/might flag Length = 0
suggestion Length =:= 0
instead.
Limitations
Data prior to Q4 2021
From OpenAI docs (ref)
trained on a blend of text and code from before Q4 2021
Example demonstrating cutoff.
Prompt
List of the last ten earthquakes greater than 7.0 .
ChatGPT Completion ( Feb 13 Version)
- September 28, 2018 - 7.5 magnitude in Sulawesi, Indonesia
- February 6, 2018 - 7.2 magnitude in Oaxaca, Mexico
- January 23, 2018 - 7.9 magnitude in Kodiak, Alaska
- November 12, 2017 - 7.3 magnitude on the Iran-Iraq border
- September 19, 2017 - 7.1 magnitude in Puebla, Mexico
- August 8, 2017 - 7.0 magnitude in Jiuzhaigou, China
- July 17, 2017 - 7.7 magnitude off the coast of Russia
- November 13, 2016 - 7.8 magnitude in Kaikoura, New Zealand
- October 26, 2016 - 7.1 magnitude in Fukushima, Japan
- April 16, 2016 - 7.8 magnitude in Pedernales, Ecuador
No external access
From OpenAI docs (ref)
ChatGPT does not have access to any external sources of information or knowledge
Here is a prompt and completion that demonstrates that ChatGPT Jan 30 version does not have access to the Internet.
Prompt
Using the list of mars landers from List of Mars landers - Wikipedia create Prolog facts for all 21 landers.
Completion
% Facts about 21 Mars Landers
lander(1, mars_pathfinder, 1996).
lander(2, mars_global_surveyor, 1996).
lander(3, mars_96, 1996).
lander(4, nozomi, 1998).
lander(5, mars_climate_orbiter, 1998).
lander(6, mars_polar_lander, 1999).
lander(7, mars_odyssey, 2001).
lander(8, mars_express, 2003).
lander(9, spirit, 2003).
lander(10, opportunity, 2003).
lander(11, beagle_2, 2003).
lander(12, mars_reconnaissance_orbiter, 2005).
lander(13, phoenix, 2007).
lander(14, mars_mission, 2011).
lander(15, curiosity, 2011).
lander(16, mars_atmosphere_and_volatile_evolution, 2013).
lander(17, india's_mars_orbiter_mission, 2014).
lander(18, mars_orbiter_mission, 2014).
lander(19, insight, 2018).
lander(20, mars_2020, 2020).
lander(21, hope, 2024).
The give away that it is not using the data from the Wikipedia article is the inclusion of the lander nozomi
which is not listed on the page.
Sliding window of conversation knowledge
From OpenAI (link)
While ChatGPT is able to remember what the user has said earlier in the conversation, there is a limit to how much information it can retain. The model is able to reference up to approximately 3000 words (or 4000 tokens) from the current conversation - any information beyond that is not stored.
Please note that ChatGPT is not able to access past conversations to inform its responses.