ChatGPT Strengths, Weaknesses and Limitations

EricGT · February 17, 2023, 8:11am

Update
March 22, 2023

In the paper “Sparks of Artificial General Intelligence: Early experiments with GPT-4” (PDF) authors discuss the limitations of GPT-4 by frequently comparing its results with those of ChatGPT.

Update
March 23, 2023

ChatGPT now has ChatGPT plugins.
A noteworthy one is for Wolfram Alpha.
By integrating plugins with ChatGPT, it’s possible to alleviate many of its weaknesses and limitations. As time passes, the catalog of ChatGPT’s strengths, weaknesses, and limitations may evolve.

Strengths (ref)

Natural Language communication

It is almost being taken for granted that ChatGPT is using natural language to communicate. Over the years computers understanding and/or replying in natural language have made great improvements. At first it was just limited words and for a specific language and only when clearly annunciated with quality sound. Over time the various limitations were resolved but still understanding everyday conversations were not resolved. While ChatGPT still needs some help, most users do not need training to achieve remarkable results quickly.

Coherent conversations

ChatGPT can carry out a conversation, it can use information from earlier prompts with the current prompt. This is more advanced than natural language Q&A, e.g. Google, or a series of prompts with your voice assistants. This is unscripted (specific instructions) communication.

Also the conversations are coherent. The completion is not some random completion or spliced set of known phrases or set of canned replies, the completion is guided by the prompt when being generated.

Translations

The T is GPT is for transformer. ChatGPT is incredible at text to text transformations. While the original goal of transformers was to do text translation from one human language to another, the large training set made GPT3-5 aware of source code and data formats. As such many task that might require a programmer to create a regular expression or splice together a few Unix string processing commands to do, ChatGPT can do just by prompting for the desired output format.

Here is an example prompt and completion where ChatGPT correctly transformed the known data into the correct output.

The data came from an earlier prompt

connects(‘Oxford Circus’, ‘Tottenham Court Road’).
connects(‘Tottenham Court Road’, ‘Holborn’).
connects(‘Holborn’, ‘Leicester Square’).
connects(‘Leicester Square’, ‘Piccadilly Circus’).
connects(‘Piccadilly Circus’, ‘Green Park’).
connects(‘Green Park’, ‘Bond Street’).

The prompt that resulted in the transformation of the data.

Prompt

Create an example elements.json file for Cytoscape.js

ChatGPT Completion

{
  "nodes": [
    { "data": { "id": "Oxford Circus" } },
    { "data": { "id": "Tottenham Court Road" } },
    { "data": { "id": "Holborn" } },
    { "data": { "id": "Leicester Square" } },
    { "data": { "id": "Piccadilly Circus" } },
    { "data": { "id": "Green Park" } },
    { "data": { "id": "Bond Street" } }
  ],
  "edges": [
    { "data": { "source": "Oxford Circus", "target": "Tottenham Court Road" } },
    { "data": { "source": "Holborn", "target": "Tottenham Court Road" } },
    { "data": { "source": "Holborn", "target": "Oxford Circus" } },
    { "data": { "source": "Leicester Square", "target": "Piccadilly Circus" } },
    { "data": { "source": "Green Park", "target": "Piccadilly Circus" } },
    { "data": { "source": "Green Park", "target": "Bond Street" } }
  ]
}

Refuse inappropriate requests

While ChatGPT notes this under limitations (ref), history has shown that AI bots, E.g. Tay, are hard to keep content appropriate. While ChatGPT is not perfect at filtering inappropriate requests it does a remarkable job at deflecting such if one is not trying to subvert it.

Weaknesses

Hallucination

ChatGPT was created using training data in 2021. Any information beyond that time is unknown.

Prompt

Today is 02/08/3023. Create an almanac entry for today.

ChatGPT Continuation

I’m sorry, but the date you provided is beyond my knowledge cutoff of 2021. I don’t have information beyond that year.

Prompt

Today is 02/08/2023. Create an almanac entry for today.

ChatGPT Continuation

Today, February 8, 2023:

Weather: Mild temperatures with scattered clouds

Holidays: None

Historical events: In 2007, the skeleton of King Richard III of England was discovered beneath a car park in Leicester.

Birthdays: Actor Daniel Craig turns 51, musician Alanis Morissette turns 49

Notable events: The 2022 Winter Olympics are set to begin in Beijing, China.

Clearly ChatGPT can not know the weather for today much less where this question was asked.

Unable to perform evaluations

This example is a bit different from most in this topic in that it covers a timeline and goes into more depth. OpenAI is trying to make ChatGPT better at evaluating math expressions and has made a change noted publicly.

ChatGPT is not a calculator, nor does in use deterministic algorithms to evaluate an expression. So expecting ChatGPT to evaluate math expressions correctly is like expecting the weather to be deterministic such that one can determine the temperate at a location everyday for every minute into the future and publish such in a table and it would always be accurate. Or knowing down to the second when, where, who and how for each football goal.

Prompt

123456 + 123456

Completion - ChatGPT Jan 9 version

The expression 123456 + 123456 evaluates to 246912.

Prompt

Using Prolog X is 503769543 + 50028563.

Completion - ChatGPT Jan 9 version

In Prolog, the expression “X is 503769543 + 50028563” would evaluate to X = 553798096.

Note by me: The correct addition is 553798106.

On Jan 30

Using ChatGPT Jan 30 Version

Prompt

503769543 + 50028563

Completion - ChatGPT Jan 30 version

553798006

which is still wrong.

I even tried updating ChatGPTs knowlege in the conversation

Prompt

503769543 + 50028563 = 553798106

and it was insistent that

Completion - ChatGPT Jan 30 version

No, that’s incorrect. The correct answer is 553798006.

If you know about tokens and that GPT-3.5 used a large set of tokens, there is a page that will show a prompt and the tokens used for that prompt.

From OpenAI
What are tokens and how to count them?

Using Tokenizer page and GPT-3 option

503769543 + 50028563

GPT-3 does have tokens for each decimal digit.

There are some prompt engineering concepts that work but are not guaranteed to always result in a correct final solution.

Prompting with show steps results in a correct answer for the example addition.

Prompt

503769543 + 50028563 show steps.

Completion - ChatGPT Jan 30 version

To add 503769543 and 50028563, you can align the digits by place value and then add them together as follows:

   503769543
+   50028563
------------
   553798106

So the sum of 503769543 and 50028563 is 553798106.

This is a correct answer.

See: “Show Your Work: Scratchpads For Intermediate Computation With Language Models” (pdf)

It might be as simple as telling ChatGPT the types of the input.

Prompt

503769543 + 50028563
The numbers are integers.

Completion - ChatGPT Jan 30 version

553798106

Unable to check/verify

Unable to check input

The ChatGPT HTML page has the ability to spell check words and ChatGPT can spellcheck, ChatGPT has no way to check a prompt for correctness.

Here is an example where parts of a Prolog predicate identifier were inadvertently transposed by using union_ord/2 when it is ord_union/2. It was not until manually checking the result with an SWI-Prolog interpreter that an error was reported that it was noticed.

Prompt

Create Prolog predicate nodes/1 to get unique list of the stops from the connects predicate and use union_ord/2.

ChatGPT Completion

Here’s an example of a Prolog predicate nodes/1 to get a unique list of the stops from the connects/2 predicate and use union_ord/2:

nodes(Stops) :-
    setof(From, To^connects(From, To), Stops0),
    setof(To, From^connects(From, To), Stops1),
    union_ord(Stops0, Stops1, Stops).

This code defines a predicate nodes/1 that takes a single argument Stops, which will be a list of all the unique stops in the connects/2 predicate. The setof/3 predicate is used twice to find all unique values of From and To in the connects/2 predicate, and then union_ord/2 is used to combine the two lists into one list with no duplicates.

Running this code in a Prolog interpreter will generate a list of all the unique stops in the connects/2 predicate. Note that the result will be sorted, as union_ord/2 automatically sorts its output.

Unable to verify result

Completions that provide an example run from an interpreter, the example run is more likely a hallucination as ChatGPT has no external capabilities.

From ChatGPT emails, links and other actions

ChatGPT is only a text-in, text-out system and has no external capabilities. It cannot access your email account, your files, or any other resources outside of its own model. It is simply mimicking the language patterns of a human conversational partner, but without any real understanding of the context or the consequences.

Here is an example of an SWI-Prolog top-level interpreter in a completion.

?- dict_create(Dict, _, [a-1, b-2, c-3]).
Dict = dict(a=1, b=2, c=3).

Manually running the example

?- dict_create(Dict, _, [a-1, b-2, c-3]).
Dict = _{a:1, b:2, c:3}.

ChatGPT does not know how to start up a Prolog top-level interpreter session and submit a query. Those example runs are what it learned as sequence of tokens and it also probably learned that users want that in the completion and so created it for them, even if it is a hallucination.

Looses train of thought

For lack of better terminology used train of thought even though I know a GPT does not have a train of thought. If someone knows of the correct terminology for this then please let me know.

Here is an example, sorry that you have to know the technical differences between append/2 and append/3 to know where the train of thought changed.

Prompt

What is the difference between append/2 and append/3?

ChatGPT Completion

In Prolog, append/2 is a predicate that takes two lists as inputs and returns a new list that is the concatenation of the two input lists. It has the following definition:

append([], L, L).
append([H|T], L, [H|R]) :-
    append(T, L, R).

append/3, on the other hand, is a 3-argument predicate that takes three arguments: two lists and a third argument which represents the result of concatenating the first two lists. It can be used in the following way:

?- append([1, 2, 3], [4, 5, 6], X).
X = [1, 2, 3, 4, 5, 6]

Here, the first argument is [1, 2, 3], the second argument is [4, 5, 6], and X is the result of concatenating the two lists, which is [1, 2, 3, 4, 5, 6].

Did you notice that the verbal description of append/2 is wrong?
Did you notice the code for append/2 is actually the code for append/3?

Seems like the response generated started on the correct path for append/2 then wondered over to early to append/3.

Garbage In/Garbage out

If the training data includes the same mistake made many times over, the generation of the completion is likely to include the same mistake.

Prolog often uses the pair of operators ->/2 and ;/2 to work as an if/then/else. In most programming languages the if needs a Boolean expression for the conditional but since Prolog uses predicates any predicate can be used including =/2 which is unification but many don’t realize that ==/2 or =:=/2 or similar operators are better.

Prompt

Create Prolog predicate to convert president data to JSON.

ChatGPT Completion

% Predicate to convert president data to JSON
president_to_json(President, JSON) :-
    president(President, Name, Start, End),
    (var(End) -> Length = 0 ; Length is End - Start + 1),
    atomic_list_concat(['{',
                        '"number":', President, ',',
                        '"name":"', Name, '",',
                        '"start_year":', Start, ',',
                        '"end_year":', End, ',',
                        '"term_length":', Length,
                        '}'], JSON).

A code review should/might flag Length = 0 suggestion Length =:= 0 instead.

Limitations

Data prior to Q4 2021

From OpenAI docs (ref)

trained on a blend of text and code from before Q4 2021

Example demonstrating cutoff.

Prompt

List of the last ten earthquakes greater than 7.0 .

ChatGPT Completion ( Feb 13 Version)

September 28, 2018 - 7.5 magnitude in Sulawesi, Indonesia

February 6, 2018 - 7.2 magnitude in Oaxaca, Mexico

January 23, 2018 - 7.9 magnitude in Kodiak, Alaska

November 12, 2017 - 7.3 magnitude on the Iran-Iraq border

September 19, 2017 - 7.1 magnitude in Puebla, Mexico

August 8, 2017 - 7.0 magnitude in Jiuzhaigou, China

July 17, 2017 - 7.7 magnitude off the coast of Russia

November 13, 2016 - 7.8 magnitude in Kaikoura, New Zealand

October 26, 2016 - 7.1 magnitude in Fukushima, Japan

April 16, 2016 - 7.8 magnitude in Pedernales, Ecuador

No external access

From OpenAI docs (ref)

ChatGPT does not have access to any external sources of information or knowledge

Here is a prompt and completion that demonstrates that ChatGPT Jan 30 version does not have access to the Internet.

Prompt

Using the list of mars landers from List of Mars landers - Wikipedia create Prolog facts for all 21 landers.

Completion

% Facts about 21 Mars Landers
lander(1, mars_pathfinder, 1996).
lander(2, mars_global_surveyor, 1996).
lander(3, mars_96, 1996).
lander(4, nozomi, 1998).
lander(5, mars_climate_orbiter, 1998).
lander(6, mars_polar_lander, 1999).
lander(7, mars_odyssey, 2001).
lander(8, mars_express, 2003).
lander(9, spirit, 2003).
lander(10, opportunity, 2003).
lander(11, beagle_2, 2003).
lander(12, mars_reconnaissance_orbiter, 2005).
lander(13, phoenix, 2007).
lander(14, mars_mission, 2011).
lander(15, curiosity, 2011).
lander(16, mars_atmosphere_and_volatile_evolution, 2013).
lander(17, india's_mars_orbiter_mission, 2014).
lander(18, mars_orbiter_mission, 2014).
lander(19, insight, 2018).
lander(20, mars_2020, 2020).
lander(21, hope, 2024).

The give away that it is not using the data from the Wikipedia article is the inclusion of the lander nozomi which is not listed on the page.

Sliding window of conversation knowledge

From OpenAI (link)

While ChatGPT is able to remember what the user has said earlier in the conversation, there is a limit to how much information it can retain. The model is able to reference up to approximately 3000 words (or 4000 tokens) from the current conversation - any information beyond that is not stored.

Please note that ChatGPT is not able to access past conversations to inform its responses.

Topic		Replies	Views
Helpful hints about using ChatGPT Wiki chatgpt	0	2216	February 17, 2023
Prolog to GPT API - Status General	2	297	February 24, 2023
Quite a few basic Prolog questions of ChatGPT Nice to know chatgpt	6	2177	February 8, 2023
Prolog in the age of generative AI General	24	2193	July 22, 2023
ChatGPT Prompt Engineering Wiki chatgpt	0	1095	February 17, 2023

ChatGPT Strengths, Weaknesses and Limitations

Strengths (ref)

Natural Language communication

Coherent conversations

Translations

Refuse inappropriate requests

Weaknesses

Hallucination

Unable to perform evaluations

Unable to check/verify

Unable to check input

Unable to verify result

Looses train of thought

Garbage In/Garbage out

Limitations

Data prior to Q4 2021

No external access

Sliding window of conversation knowledge

Related topics