Token Efficiency of Programming Languages for LLM Code Generation¶

Источник: исследование Martin Alderson (RosettaCode, GPT-4 tokenizer, 19 языков)

Rankings¶

#	Language	Avg Tokens	Category
1	J	~70	Array language, pure ASCII
2	Clojure	109	Functional
3	APL	110	Array language
4	Haskell	115	Functional
5	F#	118	Functional
6	Python	130	Dynamic
7	Ruby	~135	Dynamic
8	JavaScript	148	Dynamic
9	Go	~160	Statically typed
10	C#	~170	Statically typed
11	C	182	Procedural

Ключевой факт: 2.6× разрыв между самым эффективным (J) и наименее эффективным (C).

Key Findings¶

Dynamic languages win on token count¶

No type declarations = fewer tokens. However, JavaScript is a notable outlier — most verbose dynamic language in the set.

Functional languages punch above their weight¶

Haskell and F# compete with dynamic languages despite being statically typed. Reason: excellent type inference eliminates the need for explicit type annotations.

APL/J paradox¶

APL's famous terseness hurts LLMs — its Unicode glyphs (⍳, ⍴, ⌽) tokenize poorly, each becoming multiple tokens. J uses ASCII and dominates at just 70 tokens average.

Typed languages still win for LLM development¶

Compile-time catch of hallucinations
LSP integration works better
Rapid feedback loop

Using typed languages for LLMs has an awful lot of benefits — not least because it can compile and get rapid feedback on any syntax errors or method hallucinations.

Frameworks matter more than languages¶

Follow-up research found that web framework choice has a larger token impact than language selection.

Why It Matters¶

As LLMs become primary coding assistants, the context window is a hard limit. Every token spent on boilerplate shrinks space for:

Business logic
Tests
Documentation
Code review

Token footprint directly drives productivity and API cost (OpenAI/Anthropic per-token billing).

Practical Takeaways¶

Context	Recommended
Long AI-assisted sessions	F#, Haskell, Ruby, Clojure
Ecosystem + AI tooling	Python (reasonable middle ground)
Maximum efficiency	J (but niche, steep learning curve)
Performance-critical modules	C / Rust (keep separate from LLM context)
Avoid for token-constrained contexts	C, Java, plain Go unless necessary

Hybrid approach: token-efficient orchestration layer (Clojure/F#) + performance modules (C/Rust).