AI has become the ultimate pocket encyclopedia—ask it anything, and it’s ready with a quick answer.
Recipes? Sure. Movie trivia? Easy!
But throw it a real-world, multi-step problem, and suddenly, your super-smart AI looks a little… well, lost.
Take this scenario for instance – you’re at the grocery store, staring at a shelf of fruits.
You snap a photo and ask your AI, “How much will 2 kilograms of bananas and 1 kilogram of apples cost me?” Sounds simple enough, right? Except it’s not. The AI needs to:
- Read the price tags on the fruit (OCR magic).
- Match the prices to the right fruits.
- Do some math to figure out your total.
Most AI models? They’ll flub it. And when they fail, you’re left wondering—was it the reading part, the matching part, or the math? Meanwhile, you’re still standing there, calculator in hand.
Enter TACO: the latest brainchild from the Salesforce innovation lab that tastes great (ok, maybe not literally) but works like a charm! It’s an AI that truly gets it.
In this blog post, we’ll cover everything you need to know about TACO to stay ahead in the AI race.
What is TACO?
Salesforce AI Research has rolled out TACO.
“We present TACO, a family of multi-modal large action models designed to improve performance on complex questions that require multiple capabilities and demand multi-step solutions.” Salesforce said in a blog post on January 16, 2025.
While open-source multi-modal models can handle simple questions just fine, they often struggle with more complex ones that need a mix of skills—like recognizing details, understanding visuals, and doing some reasoning, especially when the task requires several steps.
Enter TACO: a series of multi-modal action models to crush these tricky, multi-step problems.
TACO works by generating chains-of-thought-and-action (CoTA), using tools like OCR, depth estimation, and calculators to break down tasks step by step, and then pulling everything together to give a smooth, clear answer.
In a nutshell, TACO takes on a major flaw in today’s AI systems (like open-source multi-modal models) that struggle with solving real-world, complex problems one step at a time.
How Exactly Does TACO by Salesforce Work?
Salesforce claims that TACO achieved 30-50% higher performance compared to models using traditional direct answers. It also outperformed baseline models by up to 20% on the MMVet benchmark.[i]
Let’s break down how TACO works using an example where you take a photo of a gas station sign with different gas prices, and you ask the AI how much gas you can buy with a specific budget. Here’s what happens in the background:
- Step 1: Image Capture
You snap a photo of the gas station panel showing different gasoline types and prices. - Step 2: Multi-Modal Processing
TACO identifies the image as part of the question by analyzing the visual data (the gas prices, fuel types, etc.) using its built-in vision capabilities, like object detection and OCR (Optical Character Recognition). - Step 3: OCR & Localization
TACO uses OCR to extract the text from the image—identifying things like the price for regular, premium, and diesel fuel. It also figures out where each price is located, so it can match the correct fuel type with the price. - Step 4: Action & Reasoning
TACO then reasons through the problem: it takes the specified budget and performs a multi-step calculation to determine how much gas you can buy for each fuel type (based on the prices it extracted earlier). - Step 5: External Tool Usage (if needed)
If there’s a need for further information or tools (like a calculator to perform the math), TACO can call on external tools, such as a calculator, API for current fuel price info, or even a web search, to ensure the answer is accurate. - Step 6: Chain of Thought & Action (CoTA)
TACO generates chains-of-thought-and-action, working through each step logically, combining reasoning and actions like text extraction, calculations, and using external tools. - Step 7: Final Response
TACO delivers a coherent, accurate answer that explains how much gas you can buy based on your budget, ensuring everything lines up.
Building TACO involved training it on a carefully picked CoTA dataset with 293K examples from 31 different sources, like Visual Genome. The dataset includes all sorts of tasks, from math reasoning to OCR and deep visual understanding. It’s super diverse, with tools like object localization and language-based solvers to handle a wide variety of reasoning and action tasks. The training setup combined LLaMA3 for language and CLIP for visuals, creating a solid multi-modal framework. They fine-tuned the model by tweaking hyperparameters, like lowering learning rates and increasing training epochs, to make sure it could tackle complex, multi-modal challenges with ease.
So, unlike basic AI models that can handle simple queries but fail on complex, multi-step tasks, TACO’s ability to think through each part of the problem and call on external tools makes it more reliable and efficient for tackling difficult queries.
How TACO by Salesforce Fits in Sales, Service, Marketing, and More
Use Cases in Sales, Service, Marketing, and Commerce
- Sales: You’ve got a ton of product images, customer preferences, and market trends floating around. TACO steps in like a superhero, pulling together all that info, offering personalized recommendations, and predicting which products are likely to fly off the shelves.
- Service: Now, picture you’re in customer service, and a customer sends you a picture of their broken product along with a question. TACO doesn’t just read the message; it looks at the image, understands the issue, and gives you the info you need to solve it super quickly.
- Marketing: You’re running a marketing campaign, and you’ve got loads of social media images and text to sift through. TACO gets to work, analyzing both the pics and the comments, picking up on sentiment, and helping you target the right audience with the perfect message.
- Commerce: Running an online store? TACO’s got you covered. It scans product images, reviews, and prices across different platforms to help you figure out the best pricing strategies or which products are hot right now.
Industry-Specific Examples
- Healthcare: Let’s say you’re a doctor, and you’ve got an X-ray and a bunch of patient records to go through. TACO can scan that X-ray, read through the patient’s history, and even suggest treatment options based on all the data. It’s like having a second brain to help with tough decisions—without the stress.
- Finance: If you’re in finance, imagine looking at financial reports, stock data, and market trends. TACO can pull all that together, pulling out key insights from documents, scanning for trends, and even making investment recommendations.
- Retail: Running a retail business? TACO helps you analyze what’s selling, what’s not, and why, by looking at store shelf images and customer shopping behavior. It’s your ally that tells you exactly how to lay out your store or what products to push—no guesswork involved!
The Bottom Line
And bam—TACO is here, the AI that gets you. No more bot responses. TACO’s all about handling those tricky, real-world problems like a boss. It’s not just about answering questions, it’s about cracking multi-step challenges, using all the right tools, and still making it look easy.
This is your AI sidekick that’s smarter than your average assistant and doesn’t need a coffee break!