LLMs and Tool Use

Last March, just two weeks after GPT-4 was released, researchers at Microsoft quietly announced a plan to compile millions of APIs—tools that can do everything from ordering a pizza to solving physics equations to controlling the TV in your living room—into a compendium that would be made accessible to large language models (LLMs). This was just one milestone in the race across industry and academia to find the best ways to teach LLMs how to manipulate tools, which would supercharge the potential of AI more than any of the impressive advancements we’ve seen to date.

The Microsoft project aims to teach AI how to use any and all digital tools in one fell swoop, a clever and efficient approach. Today, LLMs can do a pretty good job of recommending pizza toppings to you if you describe your dietary preferences and can draft dialog that you could use when you call the restaurant. But most AI tools can’t place the order, not even online. In contrast, Google’s seven-year-old Assistant tool can synthesize a voice on the telephone and fill out an online order form, but it can’t pick a restaurant or guess your order. By combining these capabilities, though, a tool-using AI could do it all. An LLM with access to your past conversations and tools like calorie calculators, a restaurant menu database, and your digital payment wallet could feasibly judge that you are trying to lose weight and want a low-calorie option, find the nearest restaurant with toppings you like, and place the delivery order. If it has access to your payment history, it could even guess at how generously you usually tip. If it has access to the sensors on your smartwatch or fitness tracker, it might be able to sense when your blood sugar is low and order the pie before you even realize you’re hungry.

Perhaps the most compelling potential applications of tool use are those that give AIs the ability to improve themselves. Suppose, for example, you asked a chatbot for help interpreting some facet of ancient Roman law that no one had thought to include examples of in the model’s original training. An LLM empowered to search academic databases and trigger its own training process could fine-tune its understanding of Roman law before answering. Access to specialized tools could even help a model like this better explain itself. While LLMs like GPT-4 already do a fairly good job of explaining their reasoning when asked, these explanations emerge from a “black box” and are vulnerable to errors and hallucinations. But a tool-using LLM could dissect its own internals, offering empirical assessments of its own reasoning and deterministic explanations of why it produced the answer it did.

If given access to tools for soliciting human feedback, a tool-using LLM could even generate specialized knowledge that isn’t yet captured on the web. It could post a question to Reddit or Quora or delegate a task to a human on Amazon’s Mechanical Turk. It could even seek out data about human preferences by doing survey research, either to provide an answer directly to you or to fine-tune its own training to be able to better answer questions in the future. Over time, tool-using AIs might start to look a lot like tool-using humans. An LLM can generate code much faster than any human programmer, so it can manipulate the systems and services of your computer with ease. It could also use your computer’s keyboard and cursor the way a person would, allowing it to use any program you do. And it could improve its own capabilities, using tools to ask questions, conduct research, and write code to incorporate into itself.

It’s easy to see how this kind of tool use comes with tremendous risks. Imagine an LLM being able to find someone’s phone number, call them and surreptitiously record their voice, guess what bank they use based on the largest providers in their area, impersonate them on a phone call with customer service to reset their password, and liquidate their account to make a donation to a political party. Each of these tasks invokes a simple tool—an Internet search, a voice synthesizer, a bank app—and the LLM scripts the sequence of actions using the tools.

We don’t yet know how successful any of these attempts will be. As remarkably fluent as LLMs are, they weren’t built specifically for the purpose of operating tools, and it remains to be seen how their early successes in tool use will translate to future use cases like the ones described here. As such, giving the current generative AI sudden access to millions of APIs—as Microsoft plans to—could be a little like letting a toddler loose in a weapons depot.

Companies like Microsoft should be particularly careful about granting AIs access to certain combinations of tools. Access to tools to look up information, make specialized calculations, and examine real-world sensors all carry a modicum of risk. The ability to transmit messages beyond the immediate user of the tool or to use APIs that manipulate physical objects like locks or machines carries much larger risks. Combining these categories of tools amplifies the risks of each.

The operators of the most advanced LLMs, such as OpenAI, should continue to proceed cautiously as they begin enabling tool use and should restrict uses of their products in sensitive domains such as politics, health care, banking, and defense. But it seems clear that these industry leaders have already largely lost their moat around LLM technology—open source is catching up. Recognizing this trend, Meta has taken an “If you can’t beat ’em, join ’em” approach and partially embraced the role of providing open source LLM platforms.

On the policy front, national—and regional—AI prescriptions seem futile. Europe is the only significant jurisdiction that has made meaningful progress on regulating the responsible use of AI, but it’s not entirely clear how regulators will enforce it. And the US is playing catch-up and seems destined to be much more permissive in allowing even risks deemed “unacceptable” by the EU. Meanwhile, no government has invested in a “public option” AI model that would offer an alternative to Big Tech that is more responsive and accountable to its citizens.

Regulators should consider what AIs are allowed to do autonomously, like whether they can be assigned property ownership or register a business. Perhaps more sensitive transactions should require a verified human in the loop, even at the cost of some added friction. Our legal system may be imperfect, but we largely know how to hold humans accountable for misdeeds; the trick is not to let them shunt their responsibilities to artificial third parties. We should continue pursuing AI-specific regulatory solutions while also recognizing that they are not sufficient on their own.

We must also prepare for the benign ways that tool-using AI might impact society. In the best-case scenario, such an LLM may rapidly accelerate a field like drug discovery, and the patent office and FDA should prepare for a dramatic increase in the number of legitimate drug candidates. We should reshape how we interact with our governments to take advantage of AI tools that give us all dramatically more potential to have our voices heard. And we should make sure that the economic benefits of superintelligent, labor-saving AI are equitably distributed.

We can debate whether LLMs are truly intelligent or conscious, or have agency, but AIs will become increasingly capable tool users either way. Some things are greater than the sum of their parts. An AI with the ability to manipulate and interact with even simple tools will become vastly more powerful than the tools themselves. Let’s be sure we’re ready for them.

This essay was written with Nathan Sanders, and previously appeared on Wired.com.

Posted on September 8, 2023 at 7:05 AM14 Comments


Winter September 8, 2023 7:59 AM

We can debate whether LLMs are truly intelligent or conscious, or have agency, but AIs will become increasingly capable tool users either way.

As was written in an earlier discussion [1], it only takes the right words to get a human, or API, to do what you want. LLM’s are very good at guessing the right words from the current context. Allowing AI to autonomously contact other entities to do its bidding seems to me a recipe for disaster.

We all know the predicted “joke” about an autonomous vehicle that can charge with a credit card and drive the roads forever. What if this vehicle decides to buy extra fuel tanks and run into a building? Just calling a supplier and driving up to the loading dock would be enough. [2]

Why would it do such a thing? Maybe because the wrong training material was used?

[1] ‘https://www.schneier.com/blog/archives/2023/09/friday-squid-blogging-were-genetically-engineering-squid-now.html/#comment-426247

[2] I am sure reality will be much worse than this simplistic example.

Andy September 8, 2023 10:18 AM

Our legal system may be imperfect, but we largely know how to hold humans accountable for misdeeds; the trick is not to let them shunt their responsibilities to artificial third parties.

Be prepared for corporations owning these AIs to shield themselves behind legal immunity and limit liability. A prime example are vaccines…

JonKnowsNothing September 8, 2023 11:53 AM


re: AI, legal immunity & limited liability, vaccine

It isn’t quite there yet or on the same scale.

AI will be shielded under Section 230 in the USA.

  • Section 230 is a section of Title 47 of the United States Code that was enacted as part of the Communications Decency Act of 1996, which is Title V of the Telecommunications Act of 1996, and generally provides immunity for online computer services with respect to third-party content generated by its users.

Vaccines are shielded are under several USA Federal Departments although SCOTUS has made a few dents in the order of precedence.

  • FDA The United States Food and Drug Administration (FDA or US FDA) is a federal agency of the Department of Health and Human Services. The FDA is responsible for protecting and promoting public health through the control and supervision of food safety, tobacco products, caffeine products, dietary supplements, prescription and over-the-counter pharmaceutical drugs (medications), vaccines , biopharmaceuticals, blood transfusions, medical devices, electromagnetic radiation emitting devices (ERED), cosmetics, animal foods & feed[3] and veterinary products.
  • CDC The agency’s main goal is the protection of public health and safety through the control and prevention of disease, injury, and disability in the US and worldwide.

What happens in other countries is still TBD. Many have completely different rules and order of protections.

In the USA it’s pretty much presented as “You got No Choice” except we do. It’s a gamble by businesses that people will give up THEIR CHOICE for an AI Pizza Topping Suggestion. (1)

At this point I pretty much know what I want on my pizza.



It is not uncommon in the restaurant trade for wait staff to Push or Direct suggestions to specific menu items. Those items are not always the “best” but often have the “best profit” or “We have too much and it’s going bad so PUSH”.

Next time you ask the wait staff for “what do you recommend”, you might consider what kind of answer you expect to get.

XYZZY September 8, 2023 12:52 PM

I was first exposed to how computers could amplify psychological techniques in a military setting back in the 70s. The intersection of the disciplines of HCI and psychology have many practical applications (many ignored in todays HCI) but it has powerful applications that can be negative. As frighting as that was, the application of AI to manipulate humans without their knowledge is upon us. I think people dismiss technology like targeted advertising as a necessary evil that one is free to disregard, like the saying you can always turn off the TV. Increasingly you can’t turn off the internet. Already we see week techniques (sometimes stumbled upon as opposed to designed) used to addict gamers. Today internet technology will not only tell you how to think, what to buy, and who to vote for. It will make you love to do that and be addicted to doing it to exclusion of rational thought.

Winter September 8, 2023 1:00 PM


Today internet technology will not only tell you how to think, what to buy, and who to vote for.

I thought radio and then TV already did just that. Did that fail? And do they have to try again?

XYZZY September 8, 2023 1:32 PM

I have been away from this for some time. To my mind radio and TV send a message with a target in mind and the message is received by some percentage of the target. Some weak feedback message comes back as say increased sales or survey results, that kind of thing. I imagine the same thing happens on the internet and that the internet could provide more specificity as to who might receive the message (important to exclude some people too) and feedback as to who received the message. Also, unlike TV and radio you now can see what the response the message is without the recipient responding directly because you see their subsequent internet activity as well as how that changed with respect to the past. One could approach this in an ad hoc manner or use real science to inform the effort.

lurker September 8, 2023 2:54 PM


Was this essay written by, or with the assistance of an LLM? I had to stop reading at the second para. when I found the typical American oxymoron, presenting “pizza” and “restaurant” as if they were co-related.

benjamin shropshire September 8, 2023 3:26 PM

When it comes to trying to get LLMs to be able to do one thing and not another, I’m going to make a predilection that there is an equivalent issue to the Turing machine problem; Once you build a LLM that can learn to manipulate new tools it wasn’t trained on, it will be next to impossible to stop it from manipulating any tool it can touch.

JonKnowsNothing September 8, 2023 5:43 PM

@XYZZY, @Winter, All

@X: Today internet technology will not only tell you how to think, what to buy, and who to vote for.

@W: I thought radio and then TV already did just that. Did that fail?

AI BuyItNow tools will be like setting up Auto-Bid on an auction site.

Auto-Bid sets up a threshold, ceiling, range and increment amounts primarily in on-line auction sites. If some other person-bot bids over your amount the Auto-Bid bumps the price accordingly, the bump continues until either the other person-bot drops off or the price exceeds your settings.

AI BuyItNow tools will apply this to whatever an AI Suggestion Bot “thinks” is what you want. Likely based on search pattern “trending” index.

Consider this AI BuyItNow pattern, which requires no human interaction just access to a credit card or bank number.

A viral hit on some social media platform on purple shoes. Lots of hits aka viral. People start looking for purple shoes.

The trend index on purple shoes goes up.

The AIBuyItNow bot pulls in the trending index, for which it does not know purple apples from purple shoes, and sieves the index to rank “purple shoes” high.

Cross referencing other information about the person having the AI BuyItNow app, and finds a secondary correlation for “shoe size”.

The AIBuyItNow sifts though the list of “allowed purchase sites” for “purple shoes + shoe size” and trained to find the lowest cost item: sort by descending price.

The AIBuyItNow bot lines up the columns: purple shoes, shoe size, price and once it finds a match: BINGO issues a purchase order. (shipping information provided or confirmed)

The next day you get your purple shoes by overnight delivery;

* size 13

The following day you get another pair of purple shoes by 2 day delivery

* size 3

They are not the same model, shape, design, materials but by gosh they are PURPLE

It will be fairly trivial to start a panic buy on Purple Shoes.

lurker September 9, 2023 12:57 AM

Well, it’s obvious innit? They’re called “Large” Language Models because they prefer quantity over quality.

Scott N Kurland September 9, 2023 1:01 PM

Uh oh. If LLMs can order pizza to fuel late-night coding sessions the intelligence explosion is imminent.

JonKnowsNothing September 9, 2023 1:16 PM

@Scott N Kurland

re: AI order pizza to fuel late-night coding sessions the intelligence explosion is imminent

Unless they order pineapple on the pizza….

(sound and fury of pineapple hitting the CEO’s glass walls: SQUISH)

SpaceLifeForm September 10, 2023 5:15 AM

Is this the same Microsoft?

Watch them walk this back, RSN.


Microsoft says it will defend users of its Copilot generative AI tools if they’re sued for copyright infringement.

“We will defend the customer and pay the amount of any adverse judgments or settlements,” Chief Legal Officer Hossein Nowbar said in a Sept. 7 joint statement with Microsoft President Brad Smith.

Leave a comment


Allowed HTML <a href="URL"> • <em> <cite> <i> • <strong> <b> • <sub> <sup> • <ul> <ol> <li> • <blockquote> <pre> Markdown Extra syntax via https://michelf.ca/projects/php-markdown/extra/

Sidebar photo of Bruce Schneier by Joe MacInnis.