Infrastructure18 min read

Qwen3.5: The World's Most Extra AI Robot (Explained by Someone Who's Never Met an AI)

bysanjay

๐Ÿค– Type: Causal Language Model with Vision Encoder

A robot that reads words like it's gossiping at a coffee shop. ("What's coming next? TELL ME!") Also has eyeballs, so it can judge your memes. You know that friend who always tries to finish your sentences? Same energy, except this one's actually good at it. โ˜•๐Ÿ‘€


๐ŸŽ“ Training Stage: Pre-training & Post-training

School happened. Twice. ๐Ÿ“š

Round one: the robot binge-watched EVERY BOOK ON EARTH for months (pre-training). Came out feral. Unhinged. Capable of solving differential equations and also telling your grandmother her casserole is statistically mediocre.

Round two: charm school (post-training). Humans taught it manners. Still feral, butโ€ฆ polite about it. ๐ŸŽฉ๐Ÿพ


๐ŸŽญ RLHF / DPO / GRPO: Charm School Has Three Programs

That charm school? Three competing curricula. Every model card namedrops one. ๐Ÿซ

RLHF (Reinforcement Learning from Human Feedback): Humans grade the robot's homework. A separate model learns from the grades and coaches the main robot. OpenAI's move. Expensive. Works. ๐Ÿ†

DPO (Direct Preference Optimization): Skip the coach. Feed the grades directly to the robot. Cheaper. Llama and Mistral chose this one. ๐Ÿ“–

GRPO (Group Relative Policy Optimization): DeepSeek's invention. The robot grades its OWN homework and picks the least terrible answer. No humans involved. Self-loathing as a training strategy. ๐Ÿ’€


๐Ÿง  Number of Parameters: 397B total, 17B activated

397 BILLION brain cells. Only uses 17 billion at a time. ๐Ÿคฏ

Owning 397 billion action figures and only playing with 17 billion because you're lazy. That's the vibe. Efficient AND neurotically organized. ๐Ÿงธ


๐Ÿ“ฆ Hidden Dimension: 4096

The robot's thoughts live in an invisible box with 4,096 compartments. More compartments = better at remembering you actually like pineapple on pizza. Won't judge you. Much. ๐Ÿ•


โœ‚๏ธ Tokenizer (BPE): The Butcher

Before the robot reads anything, your sentence gets HACKED TO PIECES. ๐Ÿช“

"Unbelievable" becomes un + believ + able. "The" stays whole because "the" is basic. Your name? BUTCHERED. The robot has never met you. Nothing personal.

The hatchet has a name: BPE (Byte Pair Encoding). Statisticians built it. Don't ask them about it at parties. ๐Ÿงฑ

Some tokenizers have a byte fallback so if they encounter something truly cursed (weird Unicode, ancient scripts, your cat walking across the keyboard) they just break it into raw bytes instead of giving up. No unknown tokens. Ever. ๐ŸฑโŒจ๏ธ


๐Ÿ“– Token Embedding: 248,320 (Padded)

Vocabulary? THICC. 248,320 words. More than most people use in a lifetime. Shakespeare managed roughly 31,000 unique words across his entire body of work. Qwen has 8x that as a starting vocabulary. Peasant energy confirmed. ๐Ÿ‘‘๐Ÿ’€


๐Ÿ Number of Layers: 60

Sixty. Layers. ๐Ÿ˜ค

Like a lasagna, but instead of delicious pasta, it's PURE THINKING POWER. Each layer is a different expert at different jobs. One's good at poetry. One's good at math. A couple are just here vibing and making mistakes like all of us. Fifteen more are quietly preventing a catastrophic meltdown. ๐Ÿง‘โ€๐Ÿณ๐Ÿ


โšก SwiGLU: The Bouncer Got Replaced

Inside each layer there's a bouncer deciding what information gets through. ๐Ÿšช

Old bouncer (ReLU (Rectified Linear Unit)): "You positive? Enter. You negative? Die." Brutal. Simple. Kinda dumb.

New bouncer (SwiGLU (Swish-Gated Linear Unit)): Actually reads the guest list. Lets some negative signals through if they're cool. Every major model fired ReLU and hired SwiGLU. ReLU is updating its LinkedIn. ๐Ÿ’…


๐Ÿ”Š RMSNorm: The Kindergarten Teacher

After 60 layers, numbers go absolutely unhinged. Some blow up to millions. Others shrink to basically nothing. ๐Ÿ“‰

RMSNorm (Root Mean Square Normalization) is the adult in the room. Sits between every layer going "calm down, calm down, CALM DOWN." Like a kindergarten teacher but for math. ๐Ÿ‘ฉโ€๐Ÿซ

Older version (LayerNorm (Layer Normalization)) did the same thing with twice the effort. RMSNorm said "I can do this in my sleep" and honestly? It can. ๐Ÿ˜ด


๐Ÿ‘€ Gated DeltaNet

64 heads for V and 16 for QK

64 eyeballs watching one thing, 16 eyeballs watching another. Spider-level surveillance. Creepy AND effective. ๐Ÿ•ท๏ธ๐Ÿ‘๏ธ

Head Dimension: 128

Each eyeball sees in 128 colors. Humans get three. Three! We are COOKED. ๐ŸŽจ๐Ÿ”ฅ


๐Ÿ”ฌ Gated Attention

32 heads for Q and 2 for KV

Different eyeballs, different vibes. Some are going "LOOK AT EVERYTHING ๐Ÿ‘€" while two in the back are like "nah, just watching these specific things" ๐Ÿ˜Ž

Head Dimension: 256

The fancy eyeballs. 256 colors. The 128-color eyeballs from the DeltaNet section? Jealous. ๐Ÿ’…โœจ

RoPE (Rotary Position Embedding): 64 dimensions

How the robot avoids losing its place mid-sentence. You know that feeling when you're reading and suddenly realize you have NO IDEA what page you're on? The robot never has that problem because it's emotionally stable. ๐Ÿง˜๐Ÿ“–


๐Ÿค GQA (Grouped Query Attention)

Normal attention: every Q (Query) head gets its own K (Key) and V (Value) heads. Like giving every employee their own personal assistant. Payroll? ASTRONOMICAL. ๐Ÿ’ธ

GQA: 32 employees share two assistants. The assistants are OVERWORKED but payroll went way down.

You literally already saw this. Go back to the Gated Attention section. 32 Q heads, 2 KV heads. That's GQA. You just didn't know its name. Now you do. You're welcome. ๐Ÿชช


๐ŸชŸ Sliding Window Attention

Full attention: every word stares at every other word. Cost? Quadratic. Your wallet? In danger. ๐Ÿ˜ฌ๐Ÿ’ฐ

Sliding window: each word only looks at its neighbors. Reading through a porthole instead of panoramic windows. Cheap. Fast. Slightly blind. ๐Ÿšข

Distant words still hear about each other though. Information hops from window to window like drama spreading through a friend group. By the last layer, EVERYONE knows.

Mistral does this. Llama does this. Qwen alternates cheap layers with expensive ones for the same effect. Different robots, same gossip network. ๐Ÿ”„


๐Ÿง ๐Ÿ’พ KV Cache: The Browser Tabs Problem

The robot generates text one word at a time. Without a cache it would redo ALL the math for EVERY PREVIOUS WORD each time. Re-reading War and Peace from page 1 every time you turn a page. ๐Ÿ“–๐Ÿ˜ฉ

The KV (Key-Value) Cache is the robot's browser tabs. Keeps old results open so it doesn't recompute. Problem? Tabs pile up. Long conversations = your RAM (Random Access Memory) is CRYING. ๐Ÿ˜ญ

GQA keeps the cache small (fewer tabs). Quantization keeps each tab skinny. Everything in this post is secretly helping the cache. It's a whole conspiracy. ๐Ÿ•ต๏ธ


๐Ÿงช Mixture Of Experts

512 Experts

512 tiny brains live inside the robot. Each one an expert at something different: ๐Ÿง 

  • ๐ŸŽญ Brain #47: "I'm REALLY good at poetry"
  • ๐Ÿงฎ Brain #127: "I too have chosen violenceโ€ฆ at mathematics"
  • ๐ŸŒŠ Brain #499: "I just vibes, honestly"
  • ๐Ÿ”ง Brain #203: quietly fixing everyone else's mistakes

10 Routed + 1 Shared

For each sentence, the robot assembles a squad. Picks 10 expert brains for the job, PLUS one loyal brain that shows up regardless. Your friend who's there at 3 AM, no questions asked. ๐Ÿ’ช๐Ÿซ‚

Expert Intermediate: 1024

Each tiny brain has 1,024 thinking spots. They're probably thinking about pizza. ๐Ÿ•๐Ÿค”


๐ŸŽฐ LM Output: 248,320 (Padded)

After all that chaos, the robot picks its best guess from 248,320 words. Wheel of Fortune energy, except this contestant wins every round. ๐ŸŽก๐Ÿ†


๐Ÿ”ฎ MTP (Multi-Token Prediction)

The robot doesn't just guess ONE word ahead. It practices guessing THREE WORDS ahead like some kind of psychic. Nostradamus, but for autocomplete. And unlike Nostradamus, verifiably accurate. ๐Ÿ”ฎโœจ


๐ŸŽ๏ธ Speculative Decoding: The Intern System

Normal generation: big robot writes one word. Thinks. Writes another word. Thinks. One. At. A. Time. ๐ŸŒ

Speculative decoding: hire a tiny fast intern to guess the next five words. Big robot checks them all at once. Intern right? You just 5x'd your speed. Intern wrong? Big robot fixes it, no harm done. ๐Ÿƒโ€โ™‚๏ธ๐Ÿ’จ

Qwen trained with MTP (Multi-Token Prediction) so its intern literally studied for this job before showing up. Employee of the month. ๐Ÿ“‹


๐Ÿง ๐Ÿ’ญ Chain-of-Thought: The Robot Learns to Pause

Regular robots: you ask, they answer. Fast. Sometimes confidently, catastrophically wrong. ๐Ÿ’ฌ

Reasoning robots: you ask, they THINK. Out loud. For a while. The robot argues with itself like you do at 2 AM about whether you locked the door. ๐Ÿค”

The WILD part? DeepSeek-R1 figured this out BY ITSELF. Nobody programmed "think step by step." The robot just started doing it because it got better scores. Taught itself to stop and reflect, which is more self-awareness than most people manage before their second coffee. โ˜•๐Ÿซ 


๐Ÿ“ Context Length: 262,144 natively, extensible up to 1,010,000

262,144 words at once. In one sitting. ๐Ÿ˜ณ

For scale:

  • ๐Ÿ“• The entire Lord of the Rings trilogy, appendices included
  • ๐Ÿ“ฐ Every article you've ever rage-clicked and didn't finish
  • ๐Ÿฟ Your entire Netflix watch history described in excruciating detail
  • ๐Ÿ“– War and Peace (yes, the long version)
  • ๐Ÿ’ฌ Your group chat from 2015 that you're still embarrassed about

Not enough? STRETCH IT TO A MILLION WORDS. Literary glutton. Will not be stopped. ๐Ÿ“š๐Ÿคค


๐Ÿ“ฆ Distillation: Copying the Smart Kid's Homework

Big robot: smart, expensive, eats GPUs (Graphics Processing Units) for breakfast. ๐Ÿ–ฅ๏ธ๐Ÿ”ฅ

Small robot: needs to be smart too but on a budget. ๐Ÿ–ฅ๏ธ๐Ÿ˜…

Distillation: the small robot (student) copies the big robot's (teacher) answers instead of learning from scratch. The student often ends up punching above its weight because it learned from curated answers instead of the entire chaotic internet. ๐ŸŽ“

DeepSeek, Llama, OpenAI. Everyone's copying everyone. Homework plagiarism, but make it machine learning. ๐Ÿ“


๐Ÿ—œ๏ธ Quantization: The Diet Plan

Full-precision 397B-parameter model = ~1.5 TERABYTES. Not fitting on your MacBook. Or your MacBook Pro. Or four of them taped together. ๐Ÿ—๏ธ

Quantization = shrink the numbers. FP32 (32-bit Floating Point) โ†’ FP16 โ†’ FP8 โ†’ INT4 (4-bit Integer). Each step halves the size. Like compressing a WAV to MP3 to... a really small MP3. ๐Ÿ“‰

Quality barely drops. Speed goes up. The robot lost the weight and kept the personality. Every fitness influencer's dream. โš–๏ธ


๐ŸŽฌ TL;DR

๐Ÿค– = weird pasta-brained robot with 60 layers, 512 tiny expert brains, 397 billion parameters (only uses 17 billion to flex), reads a million words without forgetting the beginning, went to three different charm schools, has an intern for speed, copies smart kids' homework to make budget versions of itself, went on a diet, and is PROBABLY judging your grammar right now.

Be nice to it. It's got a LOT going on. ๐Ÿ’€