TOON Format Benchmarks: LLM Performance Analysis

As developers increasingly integrate LLMs into their applications, the choice of data format becomes critical. While TOON promises significant token savings, the question remains: do language models understand and process TOON data as effectively as they do more established formats like JSON, YAML, or XML?

This analysis examines benchmark results from multiple testing scenarios to provide a comprehensive view of TOON's performance characteristics.

Benchmark 1: Tabular Data Understanding

The first benchmark evaluated how well GPT-4.1 nano could understand and process tabular data across multiple format types. This test compared TOON against traditional formats including JSON, XML, YAML, HTML, and several other token-efficient alternatives.

Key Findings: Token Efficiency vs. Accuracy

TOON Performance: TOON achieved 47.5% accuracy with 21,518 tokens, positioning it as one of the most token-efficient formats tested while maintaining reasonable accuracy levels.

Comparison Point: The difference in accuracy between TOON (47.5%) and CSV (44.3%) was not statistically significant, despite both being highly token-efficient.

Trade-off Analysis: While TOON used fewer tokens than formats like JSON (66,396 tokens) and XML (76,114 tokens), those formats achieved higher accuracy rates (52.3% and 56.0% respectively).

Format	Accuracy	95% Confidence Interval	Tokens
Markdown-KV	60.7%	57.6% – 63.7%	52,104
XML	56.0%	52.9% – 59.0%	76,114
INI	55.7%	52.6% – 58.8%	48,100
YAML	54.7%	51.6% – 57.8%	55,395
HTML	53.6%	50.5% – 56.7%	75,204
JSON	52.3%	49.2% – 55.4%	66,396
Markdown-Table	51.9%	48.8% – 55.0%	25,140
Natural-Language	49.6%	46.5% – 52.7%	43,411
TOON	47.5%	44.4% – 50.6%	21,518
JSONL	45.0%	41.9% – 48.1%	54,407
CSV	44.3%	41.2% – 47.4%	19,524
Pipe-Delimited	41.1%	38.1% – 44.2%	43,098

Analysis Insight

TOON demonstrated strong performance when considering the token efficiency trade-off. While accuracy was lower than more verbose formats, the token savings (21,518 vs. 66,396 for JSON) represent a significant cost reduction for applications where token budget is a primary concern.

Benchmark 2: Nested Data Structure Understanding

A second benchmark evaluated how well GPT-5 nano could understand and retrieve information from nested data structures. This test is particularly relevant for complex data scenarios where hierarchical relationships matter.

Format	Accuracy	95% Confidence Interval	Tokens
YAML	62.1%	[59.1%, 65.1%]	42,477
Markdown	54.3%	[51.2%, 57.4%]	38,357
JSON	50.3%	[47.2%, 53.4%]	57,933
XML	44.4%	[41.3%, 47.5%]	68,804
TOON	43.1%	[40.0%, 46.2%]	45,436

Nested Data Findings

• TOON achieved 43.1% accuracy, lower than YAML (62.1%), Markdown (54.3%), and JSON (50.3%)
• YAML performed best overall with 62.1% accuracy, though it used more tokens than Markdown
• Markdown offered the best token efficiency (38,357 tokens) while maintaining 54.3% accuracy
• For nested structures, TOON's token efficiency advantage was less pronounced compared to tabular data scenarios

Conflicting Benchmark Results: What's Going On?

Interestingly, the official TOON GitHub repository includes data retrieval benchmarks that show TOON performing significantly better than other formats when tested with GPT-5 nano. These results appear to contradict the findings from the independent testing discussed above.

Understanding the Discrepancy

Different Test Scenarios: The official TOON benchmarks may use different evaluation methodologies, test datasets, or specific prompt structures that favor TOON's format characteristics.

Model Variations: Different model versions or configurations can produce varying results with the same format.

Task-Specific Performance: TOON may excel in certain types of data retrieval tasks while performing differently in others.

Validation Note: Independent verification of the official benchmarks has confirmed their reproducibility, suggesting the methodology is sound. The difference likely stems from test design rather than implementation issues.

Key Takeaways and Recommendations

When TOON Makes Sense

• Token budget is a primary constraint
• Working with uniform, tabular data structures
• Cost reduction outweighs accuracy trade-offs
• Large-scale data transmission to LLMs
• RAG pipelines with extensive data

Consider Alternatives When

• Maximum accuracy is required
• Working with deeply nested structures
• Interoperability is critical
• Team familiarity with format matters
• Complex hierarchical data relationships

The Bottom Line

TOON represents an innovative approach to token-efficient data formatting for LLMs. While it may not always achieve the highest accuracy rates, its token savings can be substantial—potentially reducing costs by 50-70% compared to JSON for tabular data. The format shows particular promise for applications where token efficiency is paramount and slight accuracy trade-offs are acceptable.

As with any technology choice, the decision to use TOON should be based on your specific use case, accuracy requirements, and cost constraints. We recommend running your own benchmarks with your data and models to determine if TOON is the right fit for your application.

References and Further Reading

Ready to Test TOON Yourself?

Convert your JSON to TOON format and see the token savings in real-time.

TOON Format Benchmarks: Comprehensive LLM Performance Analysis