TOON Format Benchmarks: Comprehensive LLM Performance Analysis
Token-Oriented Object Notation (TOON) has emerged as a purpose-built format designed to reduce token consumption when passing structured data to Large Language Models. But how does it perform in real-world LLM understanding and retrieval tasks? We dive into the benchmark data to find out.
As developers increasingly integrate LLMs into their applications, the choice of data format becomes critical. While TOON promises significant token savings, the question remains: do language models understand and process TOON data as effectively as they do more established formats like JSON, YAML, or XML?
This analysis examines benchmark results from multiple testing scenarios to provide a comprehensive view of TOON's performance characteristics.
Benchmark 1: Tabular Data Understanding
The first benchmark evaluated how well GPT-4.1 nano could understand and process tabular data across multiple format types. This test compared TOON against traditional formats including JSON, XML, YAML, HTML, and several other token-efficient alternatives.
Key Findings: Token Efficiency vs. Accuracy
TOON Performance: TOON achieved 47.5% accuracy with 21,518 tokens, positioning it as one of the most token-efficient formats tested while maintaining reasonable accuracy levels.
Comparison Point: The difference in accuracy between TOON (47.5%) and CSV (44.3%) was not statistically significant, despite both being highly token-efficient.
Trade-off Analysis: While TOON used fewer tokens than formats like JSON (66,396 tokens) and XML (76,114 tokens), those formats achieved higher accuracy rates (52.3% and 56.0% respectively).
| Format | Accuracy | 95% Confidence Interval | Tokens |
|---|---|---|---|
| Markdown-KV | 60.7% | 57.6% – 63.7% | 52,104 |
| XML | 56.0% | 52.9% – 59.0% | 76,114 |
| INI | 55.7% | 52.6% – 58.8% | 48,100 |
| YAML | 54.7% | 51.6% – 57.8% | 55,395 |
| HTML | 53.6% | 50.5% – 56.7% | 75,204 |
| JSON | 52.3% | 49.2% – 55.4% | 66,396 |
| Markdown-Table | 51.9% | 48.8% – 55.0% | 25,140 |
| Natural-Language | 49.6% | 46.5% – 52.7% | 43,411 |
| TOON | 47.5% | 44.4% – 50.6% | 21,518 |
| JSONL | 45.0% | 41.9% – 48.1% | 54,407 |
| CSV | 44.3% | 41.2% – 47.4% | 19,524 |
| Pipe-Delimited | 41.1% | 38.1% – 44.2% | 43,098 |
Analysis Insight
TOON demonstrated strong performance when considering the token efficiency trade-off. While accuracy was lower than more verbose formats, the token savings (21,518 vs. 66,396 for JSON) represent a significant cost reduction for applications where token budget is a primary concern.
Benchmark 2: Nested Data Structure Understanding
A second benchmark evaluated how well GPT-5 nano could understand and retrieve information from nested data structures. This test is particularly relevant for complex data scenarios where hierarchical relationships matter.
| Format | Accuracy | 95% Confidence Interval | Tokens |
|---|---|---|---|
| YAML | 62.1% | [59.1%, 65.1%] | 42,477 |
| Markdown | 54.3% | [51.2%, 57.4%] | 38,357 |
| JSON | 50.3% | [47.2%, 53.4%] | 57,933 |
| XML | 44.4% | [41.3%, 47.5%] | 68,804 |
| TOON | 43.1% | [40.0%, 46.2%] | 45,436 |
Nested Data Findings
- • TOON achieved 43.1% accuracy, lower than YAML (62.1%), Markdown (54.3%), and JSON (50.3%)
- • YAML performed best overall with 62.1% accuracy, though it used more tokens than Markdown
- • Markdown offered the best token efficiency (38,357 tokens) while maintaining 54.3% accuracy
- • For nested structures, TOON's token efficiency advantage was less pronounced compared to tabular data scenarios
Conflicting Benchmark Results: What's Going On?
Interestingly, the official TOON GitHub repository includes data retrieval benchmarks that show TOON performing significantly better than other formats when tested with GPT-5 nano. These results appear to contradict the findings from the independent testing discussed above.
Understanding the Discrepancy
Different Test Scenarios: The official TOON benchmarks may use different evaluation methodologies, test datasets, or specific prompt structures that favor TOON's format characteristics.
Model Variations: Different model versions or configurations can produce varying results with the same format.
Task-Specific Performance: TOON may excel in certain types of data retrieval tasks while performing differently in others.
Validation Note: Independent verification of the official benchmarks has confirmed their reproducibility, suggesting the methodology is sound. The difference likely stems from test design rather than implementation issues.
Key Takeaways and Recommendations
When TOON Makes Sense
- • Token budget is a primary constraint
- • Working with uniform, tabular data structures
- • Cost reduction outweighs accuracy trade-offs
- • Large-scale data transmission to LLMs
- • RAG pipelines with extensive data
Consider Alternatives When
- • Maximum accuracy is required
- • Working with deeply nested structures
- • Interoperability is critical
- • Team familiarity with format matters
- • Complex hierarchical data relationships
The Bottom Line
TOON represents an innovative approach to token-efficient data formatting for LLMs. While it may not always achieve the highest accuracy rates, its token savings can be substantial—potentially reducing costs by 50-70% compared to JSON for tabular data. The format shows particular promise for applications where token efficiency is paramount and slight accuracy trade-offs are acceptable.
As with any technology choice, the decision to use TOON should be based on your specific use case, accuracy requirements, and cost constraints. We recommend running your own benchmarks with your data and models to determine if TOON is the right fit for your application.