Skip to content

Token Efficiency of Programming Languages for LLM Code Generation

Источник: исследование Martin Alderson (RosettaCode, GPT-4 tokenizer, 19 языков)


Rankings

# Language Avg Tokens Category
1 J ~70 Array language, pure ASCII
2 Clojure 109 Functional
3 APL 110 Array language
4 Haskell 115 Functional
5 F# 118 Functional
6 Python 130 Dynamic
7 Ruby ~135 Dynamic
8 JavaScript 148 Dynamic
9 Go ~160 Statically typed
10 C# ~170 Statically typed
11 C 182 Procedural

Ключевой факт: 2.6× разрыв между самым эффективным (J) и наименее эффективным (C).


Key Findings

Dynamic languages win on token count

No type declarations = fewer tokens. However, JavaScript is a notable outlier — most verbose dynamic language in the set.

Functional languages punch above their weight

Haskell and F# compete with dynamic languages despite being statically typed. Reason: excellent type inference eliminates the need for explicit type annotations.

APL/J paradox

APL's famous terseness hurts LLMs — its Unicode glyphs (⍳, ⍴, ⌽) tokenize poorly, each becoming multiple tokens. J uses ASCII and dominates at just 70 tokens average.

Typed languages still win for LLM development

  • Compile-time catch of hallucinations
  • LSP integration works better
  • Rapid feedback loop

Using typed languages for LLMs has an awful lot of benefits — not least because it can compile and get rapid feedback on any syntax errors or method hallucinations.

Frameworks matter more than languages

Follow-up research found that web framework choice has a larger token impact than language selection.


Why It Matters

As LLMs become primary coding assistants, the context window is a hard limit. Every token spent on boilerplate shrinks space for:

  • Business logic
  • Tests
  • Documentation
  • Code review

Token footprint directly drives productivity and API cost (OpenAI/Anthropic per-token billing).


Practical Takeaways

Context Recommended
Long AI-assisted sessions F#, Haskell, Ruby, Clojure
Ecosystem + AI tooling Python (reasonable middle ground)
Maximum efficiency J (but niche, steep learning curve)
Performance-critical modules C / Rust (keep separate from LLM context)
Avoid for token-constrained contexts C, Java, plain Go unless necessary

Hybrid approach: token-efficient orchestration layer (Clojure/F#) + performance modules (C/Rust).


Sources