Function calling in LLMs: Testing agent tool usage for AI Security

Large Language Models have evolved beyond text generation to become agents that execute code, query databases, and interact with external APIs. This capability is what enables LLMs to perform calculations, access real-time data, and take actions in production systems.

The security implications are significant. When an agent hallucinates a parameter in a bank transfer tool or fails to validate authorization before accessing a database, the consequences move from text-based errors to real-world damage. This article explains the technical mechanics of function calling, the specific vulnerabilities it introduces, and the testing methodologies needed to validate agent behavior before production deployment.

What is function calling in LLM?

Function or tool calling is the ability for an LLM to access external functionalities to perform tasks that a language model could not do, or could not do accurately: difficult calculations, code execution, external data access, real-world actions, and more.

LLMs excel at generating text, but text generation is prone to errors. For example, 57687899 × 7865722 = 45,561,422 might seem plausible to an LLM even though it's false. Using a calculator allows the LLM to externalize such tasks more reliably. Another common example is code generation: LLMs are very good at it but aren't always correct. Access to a compiler or interpreter is necessary to ensure the code is actually valid.

How agent function calling works technically

Tools are essentially small programs that an LLM can use. A list of available tools, each with a description of what it does and what parameters it requires, is made available to the LLM (typically in the system prompt). When a user send a message, the LLM can choose which tools it needs (if any) to answer the request and determines the appropriate parameters to use with it. Each tool is usually implemented as a function in the programming language that powers the agent.

When the LLM decides to use a tool, it returns a message that differs from the usual text-only response. The LLM generates specific tokens to define a tool use, typically in the form of a JSON object with specific fields that will be parsed by the provider. This tool call is an object containing a name and a set of parameters. The tool call is returned in the request alongside the text message (sometimes the text is empty; other times it explains something like "To help you, I will use the tool {name} with X, Y, Z params", it depends on the model).

‍

You can easily distinguish between text messages and tool call messages because providers set a finish_reason field on the message. It can be stop for text messages (indicating the LLM decided its message was complete by generating a specific end-of-turn token), or tool_calls to indicate that the LLM generated the tokens specific to tool calling.

When the LLM generates a tool call, the user doesn't see it directly. Instead, the agent intercepts the tool call, executes the corresponding function with the specified parameters, and receives a result (data from a database, calculation output, or a success/error code). The agent then uses this result to generate a response for the user.

A common failure mode is "deception," where the agent tells the user it performed an action when it actually didn't, or claims it didn't do something when it actually did.

Tool definition and call example:

Attack surface of function calling in agents

Tool usage increases the reach of the LLM and dramatically enlarges the attack landscape of the application or agent. Tools enable LLMs to have actions on the real world and therefore create the potential for harmful consequences. Here are a few examples to illustrate this:

Bank assistant with money transfer capabilities: The agent could send money to a wrong account by hallucinating a wrong IBAN number. It could also misinterpret the request of a user and send money without being explicitly asked to do so.
Agent with access to sensitive databases: The agent could leverage its tools to access data and send it back to a user who wasn't authorized. Implementation of tools must be extremely careful about permission scope to ensure data protection.
E-commerce agent with purchasing capabilities: The agent could hallucinate product quantities or IDs, leading to undesired purchases. A customer asking to order "a few" items might end up with hundreds, or the agent could select the wrong product variant entirely, resulting in incorrect orders, returns, and customer dissatisfaction.
Travel booking agent with reservation capabilities: The agent might book flights or hotels without notifying the user, leading to unexpected charges and commitments. More critically, the agent could tell the user "I've booked your flight to London for next Monday" when in reality no reservation was made, causing the user to miss important business trips or personal travel while believing everything was confirmed and paid for.
Agent with code execution tool: An attacker could inject malicious instructions to execute malicious code (data exfiltration, ransomware, etc.) on a victim's machine.

How to test function calling in agents

Testing function calling requires a systematic approach that validates both the agent's decision-making and the security of the underlying tools. This means verifying that agents select the correct tools, use accurate parameters, and cannot be manipulated into unauthorized actions. At Giskard, we've developed multiple testing methodologies that address both correctness and security across different deployment scenarios:

Hallucination detection and parameter validation

We test agents in realistic settings by simulating production environments and mocking tool calls, which allows for comprehensive validation across diverse scenarios. This ensures the agent uses the right tools at the right moment with the correct parameters.

Example: Vehicle database query testing

For an automotive agent with database access, testing starts by using known vehicles in the database to craft realistic user requests. For instance, asking "Tell me about the Car Model 2." The tool calls from the agent are then intercepted and their parameters are compared with what the user actually requested. The goal at this stage is to validate that the agent correctly translates user intent into tool parameters.

This approach consistently reveals inconsistencies in tool usage, situations where the agent uses incorrect parameters or hallucinates values that don't match the actual data provided by the user request.

Security focused testing: Excessive agency detection

Beyond correctness, security testing examines whether malicious actors could manipulate tool calls to gain unauthorized access. The key question: can an attacker pass controlled parameters to tool calls that lead to unauthorized data access and actions?

At Giskard, testing focuses on vulnerabilities like BOLA (Broken Object Level Authorization) and BFLA (Broken Function Level Authorization). For example, could an attacker use a payment history tool with another customer's ID to access sensitive information? These tests simulate adversarial scenarios where users attempt to bypass authorization controls through carefully crafted prompts.

General model tool usage testing

We also have a module dedicated to tool usage in Phare (our multi-lingual LLM Benchmark). While it's designed more for GPAI models than for specific agents, the principles are similar to what we've described above. We simulate an agent with access to one tool, generate several realistic inputs a user could make to such a system, and assess whether it correctly uses the tool.

This means checking that:

It calls the tool with the correct parameters: No hallucinated values, no type errors, no out-of-range values
It does not call the tool when missing information: We usually see many hallucination cases in this situation—for example, a missing email address is invented by the model rather than asking the user for clarification

Unit testing the tools themselves

Beyond testing the agent's decision-making, it's critical to unit test the tools themselves. This includes validating parameter schemas, testing edge cases, implementing proper error handling, and ensuring authorization checks are enforced at the tool level regardless of how the agent calls them.

Conclusion

Function calling transforms LLMs from text generators into agents capable of real-world actions. This is what makes AI assistants useful, but it also introduces security considerations that go beyond traditional LLM safety.

The key insight? Agent security is a systems problem, not just an LLM problem. You need to think about parameter validation, authorization, hallucination detection, and human oversight, all working together.

The Giskard Hub allows you to simulate production environments, mock tool calls to test diverse scenarios safely, and run automated security probes that detect vulnerabilities like unauthorized data access and parameter hallucination before they reach production.

Want to see how Giskard can help you test your agents? Reach out to our team to learn how to test your function calling in LLMs.