Sergey Nikolenko

Blog

Selfloss

Я купил эту игру просто как инди в нравящемся мне духе; слова Goodwin Games ничего мне не сказали. Но потом начал замечать, что хотя пейзажи и многие имена весьма скандинавские, а берёзок немало и там, сплавляешься ты всё-таки по реке Лене, духи-женщины летают в кокошниках, а потом и вовсе злой колдун оказался Кощеем, а в одной из глав встретилась буквальная Баба-Яга и её изнакурнож. И действительно, эту игру сделали два петербуржца и петербурженка, а в конце титров была даже трогательная благодарность друзьям и коллегам по Университету ИТМО.

Selfloss очень красиво сделана; атмосферно, разнообразно, эмоционально. В игре пять глав, и хотя они все про платформинг и плавание на лодочке, они действительно разные, и каждый сеттинг хорош. Геймплей весьма однообразный — загадки прямолинейные, а сражения, кроме боссов, неинтересные — но мне он за семь часов надоесть не успел. В игре богатый лор, выстроен любопытный мир… но он с историей как-то не очень связан, и создаётся впечатление, что лор добавлен отдельно от игры. Надеюсь, что в этом мире будут и другие игры (как было в SteamWorld).

Сама история тоже вызывает вопросики. Казалось бы, всё как мы любим: старый лекарь Казимир собирается умирать, а по дороге на тот свет вспоминает свою тяжёлую (без шуток) жизнь и помогает проститься с миром разным существам, выполняя тот самый ритуал selfloss. Но в конце вдруг всё переворачивается; с одной стороны, финальный твист — это хорошая традиция, но с другой, тут это ничем не подготовлено и оставляет только разочарование.

Впрочем, в остальном история мощная, и в целом игра мне очень даже понравилась. Как почти дебютный проект крошечной инди-студии это вообще шедевр, но и без всяких скидок пройти за пару вечеров было приятно. Рекомендую.

Сергей Николенко

P.S. Прокомментировать и обсудить пост можно в канале “Sineкура”: присоединяйтесь!

August 15, 2025
A Mixed Blessing I: Mixtures of Experts from Committee Machines to LLMs
The world’s most powerful AI models are mostly asleep, just like our brains. Giant trillion-parameter models activate only a tiny fraction of their parameters for any given input, using mixtures of experts (MoE) to route the activations. MoE models began as a clever trick to cheat the scaling laws; by now, they are rapidly turning into the organizing principle of frontier AI models. In this post, we will discuss MoEs in depth, starting from the committee machines of the late 1980s and ending with the latest MoE-based frontier LLMs. We will mostly discuss LLMs and the underlying principles of MoE architectures, leaving vision-based and multimodal mixtures of experts for the second part.

Introduction

In December 2024, a relatively unknown Chinese AI lab called DeepSeek released a new model. As they say in Silicon Valley, it was a good model, sir. Their DeepSeek-V3 (DeepSeek AI, 2024), built with an architecture that has 671 billion parameters in total but activates only a fraction on each input, matched or exceeded the performance of models costing orders of magnitude more to train. One of their secrets was an architectural pattern that traces its roots back to the 1980s but has only recently become the organizing principle of frontier AI: Mixtures of Experts (MoE).

If you have been following AI developments, you may have noticed a curious pattern. Headlines about frontier models trumpet ever-larger parameter counts: GPT-4’s rumored 1.8 trillion, Llama 4 Behemoth’s 2 trillion parameters. But these models are still quite responsive, and can be run efficiently enough to serve millions of users. What makes it possible is that frontier models are increasingly sparse. They are not monolithic neural networks where every parameter activates on every input. Instead, they are built on mixture of experts (MoE) architectures that include routing systems that dynamically assemble specialized components, activating perhaps only 10-20% of their total capacity for any given task.

This shift from dense to sparse architectures may represent more than just an engineering optimization. It may be a fundamental rethinking of how intelligence should be organized—after all, biological neural networks also work in a sparse way, not every neuron in your brain fires even for a complex reasoning task. By developing MoE models with sparse activation, we can afford models with trillions of parameters that run on reasonable hardware, achieve better performance per compute dollar, and potentially offer more interpretable and modular AI systems.

Yet the journey of MoEs has been anything but smooth. For decades, these ideas had been considered too complex and computationally intensive for practical use. It took many hardware advances and algorithmic innovations to resurrect them, but today, MoE architectures power most frontier large language models (LLMs). In this deep dive, we will trace this remarkable evolution of mixture of experts architectures from their origins in 1980s ensemble methods to their current status as the backbone of frontier AI.

Before we begin, let me recommend a few surveys that give a lot more references and background than I could here: Cai et al. (2024), Masoudnia, Ebrahimpour (2024), and Mu, Lin (2025). To give you an idea of the scale of these surveys, here is a general timeline of recent MoE models by Cai et al. (2024) that their survey broadly follows:

Naturally, I won’t be able to discuss all of this in detail, but I do plan two posts on the subject for you, and both will be quite long.

In this first part, we discuss MoE architectures from their very beginnings in the 1980s to the latest LLMs, leaving the image processing and multimodal stuff for the second part. Here is my own timeline of the models and ideas and we touch upon in detail in this first part of the post; hopefully it will also serve as a plan for what follows:

Let’s begin!

MoE Origins: From Ensemble Intuition to Hierarchical Experts

Despite their recent resurgence, mixtures of experts have surprisingly deep historical roots. To fully understand their impact and evolution, let us begin by tracing the origins of these ideas back to a much earlier and simpler era, long before GPUs and Transformer architectures started dominating AI research.

Committee Machines (1980s and 1990s)

The first appearance of models very similar to modern MoE approaches was, unsurprisingly, in ensemble methods, specifically in a class of models known as committee machines, emerging in the late 1980s and maturing through the early 1990s (Schwarze, Hertz, 1993; 1992).

These models combined multiple simple networks or algorithms—each independently rather weak—into a collectively strong predictor. Indeed, the boundary between committee machines and multilayered neural networks remained fuzzy; both combined multiple processing units, but committee machines emphasized the ensemble aspect while neural networks focused on hierarchical feature learning. For instance, here is an illustration by Schwarze and Hertz (1993) that looks just like a two-layered perceptron:

Early theoretical explorations into committee machines applied insights from statistical physics (Mato, Parga, 1992), analyzing how multiple hidden units, called “committee members”, could collaboratively provide robust predictions.

In their seminal 1993 work, Michael Perrone and Leon Cooper (the neurobiologist who was the C in BCM theory of learning in the visual cortex) demonstrated that a simple average of multiple neural network predictions could significantly outperform any single member. They found that ensembles inherently reduced variance and often generalized better, a cornerstone insight that I still teach every time I talk about ensemble models and committees in my machine learning courses.

But while averaging neural networks was powerful, it was also somewhat naive: all ensemble members contributed equally regardless of context or expertise. A natural next step was to refine ensembles by letting each member specialize, focusing their attention on different parts of the input space. This specialization led to the key innovation behind mixtures of experts: gating mechanisms.

First mixtures of experts (1991–1994)

The first “true” MoE model was introduced in a seminal 1991 paper titled “Adaptive Mixtures of Local Experts” and authored by a very impressive team of Robert Jacobs, Michael Jordan, Steven Nowlan, and Geoffrey Hinton. They transformed the general notion of an ensemble into a precise probabilistic model, explicitly introducing the gating concept. Their architecture had a “gating network”, itself a neural network, that learned to partition the input space and assign inputs probabilistically to multiple “expert” modules:

Each expert would handle a localized subproblem, trained jointly with the gating network via gradient descent, minimizing the squared prediction error. Note that in the 1990s, every expert would be activated on every input, sparse gating would come much later.

The follow-up work by Jordan and Jacobs extended this idea to hierarchical mixtures of experts (Jordan, Jacobs, 1991; 1993). Here, gating networks were structured into trees, where each node made a soft decision about passing data down one branch or another, progressively narrowing down to the most appropriate expert:

Importantly, Jacobs and Jordan also saw how to apply the expectation-maximization (EM) algorithm for maximum likelihood optimization of these networks, firmly establishing MoE as a principled statistical modeling tool. EM is a very important tool in every machine learning researcher’s toolbox, appearing wherever models contain a lot of hidden (latent) variables, and hierarchical MoEs are no exception.

In total, these seminal contributions introduced several fundamental concepts that are important for MoE architectures even today:
- softmax gating probabilities, today mirrored as router logits in Transformer-MoE layers;
- tree-structured gating networks, which are used in multi-layer or cascaded routing in modern implementations, and
- Expectation-Maximization training, which keeps inspiring auxiliary losses in contemporary MoE implementations to balance expert load and ensure specialization;
Moreover, back in the early 1990s the authors of these works also foresaw the benefits of modularity: experts could be independently updated, swapped, or fine-tuned; this is also widely used in today’s practice, as we will see below.

The “winter of MoEs” (mid-1990s to 2010s)

Given such promising beginnings, one might wonder why MoEs did not immediately revolutionize neural network research in the 1990s. Instead, mixtures of experts went through a period of dormancy lasting nearly two decades; they would resurface only in 2017.

Why so? Mostly due to practical bottlenecks, most importantly high computational cost. The computers of the 1990s and early 2000s simply could not handle the parallel computation required to run many experts simultaneously. Typical experiments at that time involved small models even by the standards of the time, often with just a few thousand parameters in total, which did not let researchers exploit the scalability that is the main draw of MoEs.

Besides, there were alternative methods readily available and actually working better. For instance, even the relatively narrow field of ensemble methods specifically were dominated by boosting approaches, starting from AdaBoost and later extended into gradient boosting methods. Boosting is indeed a better way to combine weak submodels, while MoEs begin to shine only as the component models become smarter.

Nevertheless, some research continued. I want to highlight the Bayesian analysis of MoEs and hierarchical MoEs that was developed at the turn of the century (Jacobs et al., 1997; Bishop, Svenskn, 2002). By adding prior distributions to model parameters, Bayesian approaches helped address overfitting on smaller datasets and provided principled uncertainty estimates. These capabilities would prove valuable later when MoEs eventually scaled up, but for now, this theoretical groundwork laid dormant, waiting for the computational resources to catch up.

There were also some practical applications in speech recognition (to model pronunciation variations) and computer vision (for fine-grained classification), but it was done at modest scale and without too much success. With the exception of the above-mentioned Bayesian analysis, MoEs remained largely absent from premier ML conferences.

MoE Renaissance with Outrageously Large Networks

Everything changed dramatically with the 2017 paper called “Outrageously Large Neural Networks” by Google Brain researchers Noam Shazeer et al. (again led by Geoffrey Hinton!). They revitalized mixtures of experts by creatively leveraging GPUs and introducing critical innovations.

The most important was the sparse gating mechanism: rather than activating every expert, they proposed to activate just the top-k experts (typically k=2) per input token, dramatically reducing computational overhead:

Second, load-balancing auxiliary loss: to avoid the well-known “dead expert” problem (experts receiving zero or negligible input), they introduced a simple but effective auxiliary loss penalizing imbalanced usage across experts. Specifically, they introduced a smoothed version of the load function (probability that a given expert is activated under some random noise distribution) and a loss function that aims to reduce the coefficient of variation in the load.

These improvements enabled Shazeer et al. to build a 137-billion-parameter MoE LSTM (in 2017, that was a huge, huge network!) that matched a dense LSTM with only 1 billion parameters in training costs, a spectacular scalability breakthrough. This work demonstrated how MoEs with conditional computation could efficiently scale to enormous sizes, and revitalized mixtures of experts for the ML community.

The work by Shazeer et al. (2017) laid a foundation that modern MoE models have since built upon, inspiring large-scale implementations that we will discuss below. The underlying concepts—gating networks, localized specialization, and soft probabilistic routing—trace directly back to Jacobs and Jordan’s original formulations, but only with the computational power and practical innovations of recent years MoEs have been elevated from academic obscurity to practical prominence.

Overall, in this section we have retraced the intellectual arc that led from committee machines in the late 1980s to modern trillion-parameter sparsely-gated transformers. Interestingly, many contemporary issues such as dead experts, router instability, and bandwidth bottlenecks were already foreseen in the early literature. If you read Jacobs and Jordan’s seminal papers and related early works carefully, you can still find valuable theoretical and practical insights even into today’s open problems.

After this historical review, in the next sections we will go deeper into contemporary developments, starting with the remarkable journey of scaling language models with sparse conditional computation. Shazeer et al. used LSTM-based experts—but we live in the era of Transformers since 2017; how do they combine with conditional computation?

Scaling Text Models with Sparse Conditional Compute

If the historical journey of mixtures of experts (MoE) was a story of visionary ideas that could not be implemented on contemporary hardware, the next chapter in their story unfolded rapidly, almost explosively, once GPUs and TPUs arrived on the scene. By the early 2020s, scaling language models became the dominant focus of AI research. Yet, the push for ever-larger models soon faced both physical and economic limitations. Mixtures of experts, with their elegant promise of sparsity, offered a way around these limitations, and thus MoEs experienced a remarkable renaissance.

GShard and Switch Transformer

The era of modern MoE architectures began in earnest with Google’s GShard project in 2020 (Lepikhin et al., 2020). Faced with the huge Transformer-based models that already defined machine translation at that time, Google researchers looked for ways to train ultra-large Transformers without the typical explosive growth in computational costs. GShard’s key innovation was to combine MoE sparsity with Google’s powerful XLA compiler, which efficiently shards model parameters across multiple TPU devices.

We are used to the idea that self-attention is the computational bottleneck of Transformers due to its quadratic complexity. However, for the machine translation problem Lepikhin et al. were solving, where the inputs are relatively short but latent dimensions are large, the complexity of a Transformer was entirely dominated by feedforward layers! For their specific parameters ( $d_{\mathrm{model}}=1024$ , $d_{\mathrm{FF}}=8192$ , $L=128$ input tokens), a self-attention layer took

$4\cdot L^2\cdot d_{\mathrm{model}} = 4\cdot 128^2 \cdot 1024 \approx 6.7\times 10^7 \text{ FLOPs},$

while a feedforward layer was ~32 times heavier with about

$2\cdot L\cdot d_{\mathrm{model}}\cdot d_{\mathrm{FF}} = 2\cdot 128 \cdot 1024 \cdot 8192\approx 2.15\times 10^9 \text{ FLOPs}.$

In terms of parameter count too, feedforward layers took about two thirds of a Transformer’s parameters (Geva et al., 2020). To put this in perspective: the self-attention computation was like calculating distances between 128 items, while the feedforward layer was like applying 8192 different transformations to each item—a much heavier operation.

So the approach of Lepikhin et al. (2020) was concentrated on feedforward layers. Instead of each Transformer block performing a single dense computation, they replaced half of the dense feed-forward layers with sparsely gated mixture of experts layers (the illustration shows that only every other FF layer is turned into a MoE layer):

The gating mechanism selectively activates only a handful of experts per input token, thus dramatically reducing the computational load. Gating also included load balancing, this time for training: each expert was assigned a “capacity” value, and on every mini-batch, the number of tokens assigned to an expert cannot exceed this capacity. Lepikhin et al. (2020) also introduced a highly parallel implementation for a cluster of TPUs, which was also very helpful, although perhaps too technical to discuss here.

This allowed GShard to reach previously unattainable scales; ultimately, they trained a 600-billion-parameter machine translation model on 2,048 TPUs over just four days. Moreover, this model delivered a substantial improvement in machine translation metrics compared to dense baselines at similar computational cost.

Another Google Brain project, this time by Fedus et al. (2021), introduced the Switch Transformer in early 2021. Building directly on GShard’s sparse routing idea, the Switch Transformer took sparsity to a more extreme form, implementing a “top-1 routing” mechanism. In GShard, two experts handled each token (unless their capacity was exceeded or one of them had a low weight), but Switch Transformer simplified matters further by activating only the single most appropriate expert per token. Again, this idea was applied to feedforward layers:

This simplification ran contrary to contemporary sensibilities: Shazeer et al. (2017) hypothesized that you have to compare several (at least two) experts to have nontrivial signal for the routing function, and other works studied the number of experts required and found that more experts were helpful, especially in lower layers of the network (Ramachandran, Le, 2018).

Still, Fedus et al. (2021) proved to be right: switching worked just as well as regular gating, only faster. The Switch Transformer was the first to successfully train a model with a trillion parameters! Even more importantly, increasing the number of parameters in this sparse setting still conformed to the famous scaling laws of Kaplan et al. (2020) and allowed training to be much more sample-efficient, i.e., achieve the same results with less data and training compute:

The Switch Transformer emphatically proved that MoE models could not only scale but also remain stable, trainable, and economical. But machine translation was just the beginning.

GLaM and the Pathways approach

Inspired by the groundbreaking successes of GShard and Switch Transformer, later in 2021 Google Research introduced the Generalist Language Model (GLaM; Du et al., 2022), a further improved MoE-based sparse architecture.

Similar to GShard, GLaM replaces every other Transformer layer with a MoE layer, and at most two experts are activated for every input. But they added recent modifications that the original Transformer architecture had received by that time—Gated Linear Units in feedforward layers (Shazeer, 2020) and GELU activations (Hendrycks, Gimpel, 2016)—and utilized even more aggressive sparsity, activating only 8% of its 1.2 trillion parameters per token. This resulted in GLaM outperforming GPT-3 in both results and computational costs, and in a scaling law that held at least up to 64 experts:

Despite this extreme sparsity, GLaM performed very well on benchmark tasks, often matching or surpassing dense models such as GPT-3 (175B parameters), while using only a fraction (about one-third) of the energy during training.

GLaM was not just a single model with a novel technical breakthrough; it represented a philosophical shift. Google’s broader “Pathways” initiative introduced soon after (Dean, 2021) contained a vision of single models capable of handling multiple tasks and modalities efficiently by selectively activating different parts of the network—pathways—based on context:

The vision already suggested hierarchical routing strategies, where one router would select broad modality-specific pathways and another would pick fine-grained token-level experts, just like modern multimodal architectures such as Gemini 1.5 do.

The first widely adopted result of this initiative was the Pathways Language Model (PaLM) LLM released by Google in April 2022 (Chowdhery et al., 2022). While PaLM itself was not a MoE model, it represented Google’s commitment to the broader Pathways vision of conditional computation and served as an important stepping stone towards their later MoE implementations in Gemini. Where GLaM/MoE models saved compute by explicitly sparse activations, PaLM showed that a dense Transformer could still scale to half-a-trillion params if the communication stack is re-engineered. Specifically, PaLM used a client-server graph and gradient exchange between denser compute clusters to achieve near-linear throughput, scaling for the first time to two TPU pods… which let PaLM understand jokes about pods much better:

PaLM also popularized chain-of-thought prompting, showing how asking the model to generate its reasoning consistently boosts results across multiple applications. It became a popular and important family of LLMs, leading to PaLM 2 (Anil et al., 2023) and specialized offshoots such as Med-PaLM for medical applications (Singhal et al., 2023a; 2023b) and PaLM-E for embodied reasoning (Driess et al., 2023).

Note that PaLM is not a mixture-of-experts model per se, it is a dense model where every layer is activated on every input. To be honest, it’s not really a full implementation of the Pathways idea either; it runs on the Pathways orchestration level but does not employ the original detailed routing idea. As Chowdhery et al. (2022) put it themselves, “PaLM is only the first step in our vision towards establishing Pathways as the future of ML scaling at Google and beyond”.

Still, it was a great first step and an important stepping stone towards distributed LLM architectures. But when and how did actual MoE models enter the frontier LLM landscape? While PaLM demonstrated the power of dense models with novel distributed orchestration, the fundamental efficiency gains of sparse architectures were still underexplored. Let us see how two major industry players took up the challenge of making MoE models practical and ready for widespread adoption.

DeepSpeed-MoE and Megatron-Core: opensourcing MoE for the masses

In 2022, the landscape of LLMs was not yet as firmly established as today, where new entries such as Grok require huge investments from the richest guys in the world and still kind of fall short. The DeepSpeed family and its Megatron offshoots could try to come anew and define the LLM frontier.

Although, to be fair, there were not one, but two extremely rich companies involved: DeepSpeed was developed by Microsoft researchers, while the Megatron family was trained with the vast computational resources of NVIDIA. They first came together to produce the Megatron-Turing Natural Language Generation model (MT-NLG), which had a record-breaking 530B parameters back in October 2021 and broke a lot of records in natural language understanding and generation tasks (Smith et al., 2022; Kharya, Alvi, 2021).

By mid-2022, both teams of researchers tried to build on this success with sparse architectures. DeepSpeed-MoE (Rajbhandari et al., 2022) set the goal to make mixture of experts models practical for GPT-style language modelling, i.e., cheap to train and, crucially, cheap to serve in production. The paper presents an end-to-end stack that includes new architecture tweaks, compression ideas, distributed training, and an inference engine.

They use top-1 gating (already standard when you want to be fast and cheap), still sparsify only the feedforward parts of the Transformer architecture, and shard each expert into its own GPU rank, mixing data and tensors so that every GPU processes exactly one expert.

The MoE architectural novelty in DeepSpeed-Moe was the Pyramid-Residual MoE (PR-MoE). The pyramid part means that early layers keep fewer experts, and later layers double them, which reflects the idea that deeper layers learn task-specific concepts. The residual part means that each token always passes through a small dense MLP branch, while the expert acts as a residual “error-corrector”:

This allowed DeepSpeed-MoE to have up to 3x fewer parameters with virtually identical zero-shot accuracy.

Another interesting trick was about the distillation process that could train smaller MoE ensembles that approach the quality of the original much larger variant. Rajbhandari et al. (2022) noticed that while distillation is a natural idea for MoE models as well (although it was still novel at the time—Artexte et al., 2021 and Fedus et al., 2022 distilled MoE models before, but only into dense smaller Transformers), it starts to hurt the results after a while if you just distill larger experts into smaller ones. Thus, they introduced Mixture-of-Students (MoS) distillation: the student experts copy the teacher for the first 40% of pretraining, and then distillation is turned off so the students can keep minimizing their own LM loss.

All of these innovations—and many more—have made their way into an open source release. The DeepSpeed library, still supported and improved by Microsoft researchers, emerged as a very influential open source framework. For MoE models specifically, it offers CUDA kernels optimized for sparsity together with sophisticated memory management techniques, and later developed DeepSpeed’s ZeRO-3 memory partitioning method (from Zero Redundancy Optimizer, first introduced by Rajbhandari et al., 2019; see also ZeRO++ by Wang et al., 2023) further pushed scalability, allowing massive MoE models to run on relatively modest hardware configurations.

NVIDIA’s Megatron-LM suite, which includes the Megatron-Core framework with MoE implementations, followed shortly thereafter, introducing specialized kernels known as Grouped-GEMM, which efficiently batch many small expert computations into a single GPU operation. Megatron-Core also pioneered capacity-aware token dropping and systematic monitoring tools to alleviate common MoE issues like dead experts. These innovations significantly streamlined training pipelines, accelerating adoption of MoE at scale across diverse applications. What’s more, the Megatron-LM suite is also published in open source and has become the basis for many interesting research advances (Shoeybi et al., 2019; Shin et al., 2020; Boyd et al., 2020; Wang et al., 2023; Waleffe et al., 2024).

While DeepSpeed and Megatron showed that it was feasible to scale MoEs up, the real test would come with their adoption in production systems serving millions of users. In the next section, we will examine how today’s frontier LLMs have embraced and extended these ideas. Are state of the art LLMs powered by MoEs or have they found a better way to scale?

Mixtures of Experts in Flagship LLMs

Actually, the answer is yes. Between 2023 and 2025, MoE architectures firmly transitioned from research artifacts to production-ready frontier models. Let me show three prominent models that exemplify this shift.

Mixtral of Experts

It is fashionable to frown upon Mistral AI, a European research lab whose top models indeed often do not match the best American and Chinese offerings in scale and capabilities, and the EU AI Act and other associated bureaucracy is not helping matters.

But they are committed to open source, where they make significant contributions to the field, and the Mixtral family rapidly became an open-source sensation in 2023. Specifically, we are interested in the Mixtral of Experts idea (Jiang et al., 2024; Mistral AI, 2023) and its 8x7B model based on the Mistral-7B backbone (Jiang et al., 2023) that has entered plenty of baseline comparisons and has given rise to a lot of research papers.

Mixtral 8x7B is relatively small: 47B total parameters with 13B of them active per token, 32 layers with 8 experts per layer and top-2 routing, and only 32K tokens of context. The MoE implementation is straightforward—when a token arrives at a feedforward layer (again, here we leave the self-attention layers unchanged), a router scores it into 8 logits, only the top 2 survive, and their outputs are combined in a linear combination:

Unlike previous MoE architectures, Mixtral 8x7B has small fan-out (only 8 experts) but top-2 routing instead of the common top-1 choice; this seems to provide a better quality-compute trade-off, and 13B active parameters fit on commodity GPUs, so many researchers could run Mixtral 8x7B for their projects without huge funding.

The Mistral AI team also studied which experts get activated for which tokens.

And here is an illustration of which experts get activated on which tokens for several layers in the architecture (Jiang et al., 2024); note that it’s less semantic and perhaps more syntactic than you would expect, especially at top layers:

With good clean multilingual training data, Mixtral 8x7B became the first open-source MoE that rivaled GPT-3.5 on average while running on a single GPU.

Gemini 1.5 Pro

Google’s flagship model, Gemini 1.5 (Gemini Team, 2024), showcased multimodal routing across text, images, audio, and video, delivering previously unimaginable context windows—up to an astonishing 10 million tokens.

Again, Gemini 1.5 Pro replaces most of the dense feed-forward blocks with MoE blocks, routing every token to a small subset of experts. The main difference from previous MoE iterations was that previously MoE LLMs were unimodal, topped out at 2–32K tokens of context, and their expert routing was still fairly simple. Gemini 1.5 introduces a multimodal encoder-decoder, raises the context size up to millions of tokens, and introduces a more careful gating scheme (materials mention DSelect-k by Hazimeh et al., 2021, which we will discuss below), so load balancing stays healthy even when the prompt is a 700K-token codebase.

Gemini also introduced intricate caching strategies, such as cross-expert KV-cache sharing, which led to significant gains in cost efficiency and, again, context enlargement.

GPT-4 and later OpenAI models

OpenAI is notoriously not that open with its research ideas (and perhaps they are quite correct in that!), so I cannot really tell you what were the novelties of GPT-4o in detail; all we have officially are the system cards for GPT-4 (OpenAI, 2023) and GPT-4o (OpenAI, 2024), which are both mostly about capabilities and safety rather than about the ideas. So the following details should be taken as informed speculation rather than confirmed facts.

But there have been leaks and analyses, and I can link to a Semianalysis post on GPT-4 that suggests that GPT-4o (and its base model GPT-4) employs approximately 16 experts per MoE layer, each with around 111 billion parameters, amounting to roughly 1.8 trillion total parameters with around 280 billion active at any given time; see also here and here.

While the exact number of parameters may be different, it’s almost certain that starting from GPT-4, OpenAI’s latest models are all of the MoE variety. This lets them both be huge and at the same time have reasonable runtimes. GPT-4 routes to the top 2 experts out of 16, and there are, it seems, specialized experts fine-tuned to do specific stuff, e.g., one expert concentrates on safety and another on coding. GPT-4o seems to have the same basic structure but several efficiency improvements that are about, e.g., sparse attention mechanisms rather than MoE.

Note that GPT-4 is still the workhorse of the latest reasoning models such as o4-mini, o3 and o3-pro; the release of a larger model, GPT-4.5 (OpenAI, 2025), was not that successful, and while I cannot be 100% sure it seems like all top OpenAI’s offerings are still GPT-4-powered.

DeepSeek-V3

The Chinese top tier model, DeepSeek-V3 (DeepSeek AI, 2024), which soon became the basis for the famous DeepSeek-R1 reasoning model (Guo et al., 2025), pushes breadth: lots of fine-grained experts, relatively large K, sophisticated router engineering to keep them busy, and a lot of very insightful tricks in the training process to make the giant 671B parameter model as efficient as possible. Here let me refer to my previous longform post where I went into some details about the engineering ingenuity of DeepSeek-V3.

Llama 4

Probably the latest frontier family of models that we do have detailed information about is the Llama 4 family, introduced in April 2025 (Meta AI, 2025). And fortunately for us now, instead of using a dense decoder-only transformer like the Lllama 3 family, Llama 4 became the first of the Llama models to implement a MoE-based architecture.

For example, the Llama 4 Maverick model employs a large MoE architecture with 128 experts per layer and top-2 routing, meaning that each token activates 2 experts out of 128. The total parameter count reaches over 400 billion, but only about 17B parameters are active per token, making it efficient to run. One of these two experts in the Llama 4 family is a shared expert:

The model also introduced several innovations in the Transformer architecture itself, in particular the iRoPE approach with interleaved attention layers without positional embeddings. Perhaps most importantly for the open-source community, Meta released not just the model weights (on Huggingface) but also detailed training recipes and infrastructure code (on GitHub), making it a great starting point for others to train their own MoE models at scale.

With that, we are already basically at the frontier of modern LLMs; latest improvements deal with fine-tuning strategies and turning one-shot LLMs into reasoning models, which do not change the underlying model structure. So let me end the high-level survey part of the post here and proceed to actually diving into the details of routing strategies and other architectural choices that a MoE model designer has to make.

Router Mechanics: How To Train Your Monster

Mixtures of experts sound very simple from a bird’s eye view: you have lots of small, specialized networks (“experts”), and for every input, you activate just a few, as chosen by a router which is also a subnetwork. But translating this elegant idea into reality involves many important practical considerations. How exactly do routers work? What other considerations are there? In this section, let’s dive beneath the surface and explore exactly how engineers tame this beast in practice.

Sparse vs. dense MoEs

The first important distinction is between dense and sparse MoEs: do we combine the experts in a full linear combination or do we choose a small subset of the expert? Here is an illustration by Cai et al. (2024):

Formally speaking, a dense MoE router produces some vector of scores $\mathbf{g}(\mathbf{x},\boldsymbol{\theta})$ , where $\boldsymbol{\theta}$ are the router’s parameters and $\mathbf{x}$ is the input, and then uses it as softmax scores:

$F(\mathbf{x},\boldsymbol{\theta},\mathbf{W})=\sum_{i=1}^n\mathrm{softmax}\left(g(\mathbf{x},\boldsymbol{\theta})\right)\cdot f_i(\mathbf{x},\mathbf{w}_i),$

where $\mathbf{w}_i$ are the weights of the experts. A sparse MoE router only leaves the top $k$ of these scores and sets the rest to negative infinity so that the softmax would lead to zero probabilities:

$g_{i,\mathrm{sparse}}(\mathbf{x},\boldsymbol{\theta})=\mathrm{TopK}(\mathbf{g}(\mathbf{x},\boldsymbol{\theta}))_i=\begin{cases} g_i(\mathbf{x},\boldsymbol{\theta}), & \text{if it is in the top }k, \\-\infty, & \text{otherwise}.\end{cases}$

Early works mostly used dense MoEs (Jordan, Jacobs, 1991; 1993; Collobert et al., 2001). But ever since GShard that we discussed above (Shazeer et al., 2017) had introduced and popularized sparse routing, it was the common agreement that sparse routing is the way to go as a much more efficient solution: you don’t have to activate all experts every time.

The only remnants of dense MoEs were shared experts, pioneered by the DeepSpeed-MoE (Rajbhandari et al., 2022) architecture that we discussed above. In this case, every token is processed by a fixed expert (which is “densely activated”, so to speak) and another expert chosen through gating, so it is actually top-1 gating with two resulting experts.

It has been known that dense MoEs might work better—see, e.g., Riquelme et al. (2021)—but the tradeoff has been generally assumed to be not worth it. Still, interestingly, work on dense mixtures of experts has not stopped, and sometimes it has been very fruitful. Let me highlight a few of these cases; they are very different but each explores alternatives to the classical sparsely-gated Top-k MoE layer.

EvoMoE (Evolutional Mixture-of-Experts; Nie et al., 2021) introduces two-phase training. On the expert-diversify phase, they start with one shared expert, then copy and randomly mask it to spawn many diverse experts. On the gate-sparsify phase, they introduce the Dense-to-Sparse (DTS) gate: a softmax that initially routes tokens to all experts but then gradually prunes itself until it behaves like a sparse Top-1 or Top-2 routing gate. Here is an illustration by Nie et al. (2021):

As a result, routing is fully dense at the beginning, allowing every expert to receive gradient signals; sparsity is imposed only after experts have specialized, so in this case the authors try to get the best of both worlds with a hybrid training procedure.

MoLE (Mixture of LoRA Experts; Wu et al., 2024) notices that sparseness is important when experts are large, but there are use cases when they aren’t. What if your experts are LoRA adapters? Then you can treat every independently trained LoRA as an expert and learn a hierarchical soft-max gate that mixes them layer-by-layer.

The idea is to get an economical but expressive way to learn to adaptively combine different LoRAs; here is a comparison vs. linear composition (left) and reference tuning (center), two other popular methods for LoRA composition:

Moreover, because the routing gate is continuous, MoLE can later mask unwanted experts at inference time and renormalize weights without retraining. Wu et al. (2024) introduce a gating balance loss that keeps experts alive and avoids dominance by a few adapters. As a result, they get an approach that can scale to dozens of adapters with little extra memory because only the tiny routing gate is trainable.

Another creative way to combine LoRA adapters with a MoE mechanism was introduced in LoRAMoE (Dou et al., 2024). Compared to standard MoE, LoRAMoE replaces each feedforward layer by a mini-MoE whose experts are low-rank LoRA matrices; the output is a dense weighted sum over all experts:

Basically, just freeze the backbone, plug in multiple LoRA adapters as experts and blend them with a dense router. Moreover, Dou et al., (2024) specialize their experts explicitly, introducing a localized balancing constraint that biases half of the experts towards knowledge-heavy tasks and half towards instruction following (we are talking about image-to-text models here).

Interestingly, they applied it to LLMs and found that dense MoE and a frozen backbone preserve pretrained world knowledge much better than full SFT or single-LoRA tuning. After fine-tuning on specific downstream tasks, the LoRAMoE architecture lost much less performance on general question answering benchmarks about the world than the alternatives.

As you can see, sparse vs. dense MoE is not a dichotomy but more of a spectrum: you can switch between pure dense and pure sparse MoE in training, or apply dense MoEs to smaller subnetworks, or use any other hybrid approach that might be preferable in your specific case.

MoPE and where to place it

The discussion of MoLE and LoRAMoE brings us to a whole different class of MoE-based architectures: Mixtures of Parameter-Efficient Experts (MoPE). MoPE is a design paradigm that combines two ideas that had previously advanced largely in parallel:
- the conditional computation and specialization benefits of MoEs and
- the memory- and compute-friendly techniques of parameter-efficient fine-tuning (PEFT) such as LoRA, conditional adapters, prefix tuning and others (see, e.g., a detailed survey by Han et al., 2024).
In MoPE, the heavy dense backbone of a pretrained LLM is frozen, and each expert represents a lightweight PEFT module. A gating network selects, per token or per example, the subset of PEFT experts that will be applied on top of the frozen weights. The idea is to combine the task-adaptivity and routing diversity of MoEs while not actually spending the storage and FLOPs budget usually associated with sparse expert layers. It was pioneered by MoELoRA (Liu et al., 2023) and MixLoRA (Li et al., 2024) and has blossomed into a subfield of its own.

Since we are talking about PEFT now, there is a wide variety of different places where one can put the adapter inside a standard Transformer stack of the base model. Here I will again use an illustration by Cai et al. (2024) with a general taxonomy of these placements, and then we will consider them one by one:

In attention-level MoPE, experts augment individual projection matrices (Q, K, V, O). E.g., MoELoRA (Liu et al., 2023) inserts multiple LoRA experts into the Query and Value projections and uses top-k routing plus a contrastive loss to encourage expert diversity and stable routing. MoCLE (Gou et al., 2024) clusters tasks first and then routes inputs to cluster-specific LoRA experts plus a universal backup expert.

FFN-level MoPE adapts the feedforward sublayer, as classical MoEs usually do, but instead of gating the full FFN parallel PEFT experts are added and combined through gating. AdaMix (Wang et al., 2022) pioneered this setting, using adapter experts with a stochastic routing policy at training time that collapses to a cheap ensemble at inference, while MixDA (Diao et al., 2023) places domain-adapters and task-adapters sequentially to disentangle domain and task knowledge. LoRAMoE (Dou et al., 2024) that we have discussed above also belongs to this class.

In Transformer-block-level MoPE, two expert pools are attached simultaneously—one modulating the self-attention block, the other the FFN. For example, MoV (Zadouri et al., 2024) replaces both blocks’ dense parameters with tunable vectors, showing large memory savings without sacrificing quality, while MoLoRA from the same work follows the dual-pool idea but sticks to LoRA experts. UniPELT (Mao et al., 2024) mixes heterogeneous PEFT methods (adapter, prefix, LoRA) under separate gates, letting the model pick the most suitable adaptation mechanism per input.

Finally, layer-wise (or every-layer) MoPE treats the entire Transformer layer as the unit of specialization, with each layer hosting a set of PEFT experts applied to the whole block. We have discussed MoLE (Wu et al., 2024) that views already-trained LoRAs as ingredients and learns only the gating weights to compose them on-the-fly for a new domain. Omni-SMoLA (Wu et al., 2024) extends the idea to three modality-specific expert sets (text, vision, and multimodal) for vision-and-language tasks.

Training regimes

Another design choice that a MoE system has to make is how exactly to train the experts. Early MoE systems such as GShard and Switch Transformer simply replaced each feed-forward block with a sparse MoE layer and trained the entire network end-to-end; inference mirrored the training configuration without architectural changes. This approach remains a standard baseline, but by now there are three major routes that one can follow to try and improve upon it.

Dense-to-sparse regime starts with a dense backbone and gradually introduces sparsity. For example, Residual MoE (RMoE; Wu et al., 2022) pretrains a Vision Transformer densely, then fine-tunes it with sparsely activated experts while aligning outputs between the original FFN and the emerging MoE layer. EvoMoE (Nie et al., 2021) that we have discussed above bootstraps knowledge with a single expert, then spawns additional experts, and employs a Dense-to-Sparse Gate (DTS-Gate) that performs gradual annealing from full activation to top-k routing, thus alleviating the cold start problems for experts. The Sparse Upcycling approach (Komatsuzaki et al., 2023) copies FFN weights from a well-trained dense checkpoint into multiple identical experts, adds a randomly initialized router, and then retrains the network. DS-MoE (Pan et al, 2024) and SMoE-Dropout (Chen et al., 2023) are hybrid approaches that alternate between dense and sparse activations during training so that all experts receive gradient signal (this is needed to avoid expert under-utilization), but inference reverts to sparse execution for efficiency.

Sparse-to-dense distillation or pruning aims for deployability: how do we compress a high-capacity sparse MoE into a latency-friendly dense model. For example, Switch Transformer (Fedus et al., 2021) distills knowledge into a dense student with only ~5% of the parameters, while OneS (Xue et al., 2023) first aggregates expert weights by SVD or top-k heuristics, then distills into a single dense model that can keep 60–90 % of the parent’s gains across vision and NLP tasks. ModuleFormer (Shen et al., 2023) and Expert-Slimming / Expert-Trimming strategies by He et al. (2024) prune under-used experts or average their weights back into a single FFN, yielding a structurally dense network that preserves most task accuracy while significantly reducing memory and communication costs. In general the Sparse-to-Dense training mode demonstrates that MoE can serve as a training scaffold: once knowledge has diversified across experts, it can be re-fused into a compact model without full retraining.

Finally, expert-model merging means that experts are trained independently and then fused together with a router. For example, the Branch-Train-Merge (BTM; Li et al., 2022) and Branch-Train-MiX (BTX; Sukhbaatar et al., 2024) pipelines train several dense experts independently on domain-specific data and later fuse their FFNs into a single MoE layer, leaving attention weights averaged. Then MoE is briefly fine-tuned in order to teach the router to exploit this cross-domain complementarity. The Fusion of Experts (FoE; Wang et al., 2024) framework generalizes this idea, treating merging as supervised learning over heterogeneous expert outputs. Expert-model merging avoids the heavy communication of joint MoE training and allows recycling of preexisting specialist checkpoints, which is particularly attractive when there are external reasons against combined training such as, e.g., privacy concerns.

Obviously, the dimensions of sparse vs. dense routing, PEFT placement, and training modes are orthogonal, and there have been plenty of different variations for each of these ideas; let me again link to the surveys here (Cai et al., 2024; Masoudnia, Ebrahimpour, 2024; Mu, Lin, 2025). Moreover, these are not even the only dimensions. One could also consider different ways to feed training data or different engineering approaches to sharding the experts on a computational cluster and organizing communication between them. But I hope that this section has given you a good taste of the diversity of MoE approaches, and perhaps we will investigate some other dimensions in the second part.

Conclusion

The journey of mixtures of experts has been a journey from relatively obscure ensemble methods that couldn’t quite find a good problem fit to the backbone of frontier AI systems. This evolution offers several important lessons about innovation in machine learning and points toward some future directions.

First, timing matters as much as brilliance. The core MoE ideas from Jacobs et al. (1991) were essentially correct—they just arrived twenty-five years too early. Without GPUs, distributed computing, and modern optimization techniques, even the most elegant algorithms remained academic curiosities. This pattern repeats throughout ML history. For example, the main ideas of Transformer architectures were arguably already introduced in the attention mechanisms of the early 1990s but needed computational scale to shine, diffusion models had spent years in relative obscurity before efficient sampling algorithms were developed, and reinforcement learning required massive simulation infrastructure to tackle large-scale problems. The lesson for researchers is both humbling and hopeful: your “failed” idea might just be waiting for its technological moment, and also, perhaps, we will hear more about other “failed” ideas in the future, when the time is right.

Second, sparsity is not just an optimization trick—it is an important design principle. The success of MoE models demonstrates that by activating only 1-10% of parameters per input, we can achieve what dense models cannot: gracefully scale to trillions of parameters while maintaining reasonable inference costs. This sparse activation pattern mirrors biological neural networks, where specialized regions handle specific functions, and human organizations, where human experts collaborate dynamically based on a specific task. Perhaps the future of AI lies not in ever-larger monolithic models but in large ecosystems of specialized components, assembled on-demand for each specific problem.

Third, the best architectures balance elegance with engineering. Every successful MoE implementation requires both theoretical insights and pragmatic engineering; we have talked about the former in this first part, and next time we will talk some more about the latter.

In the second part of this mini-series, we will explore how these same MoE principles are transforming computer vision and multimodal AI, where the routing problem becomes even more fascinating: how do you decide which experts should handle a pixel, a word, or the boundary between them? The principles we have explored—sparse activation, specialized expertise, dynamic routing—take on new dimensions when applied across modalities. Stay tuned!

Sergey Nikolenko
Head of AI, Synthesis AI
August 9, 2025
Лихо Одноглазое

А это свежий релиз от пермской студии Morteshka, специализирующейся на российской мифологии. В их “Чёрную книгу” я играл с большим удовольствием два года назад; это была маленькая карточная игра, но там и геймплей, и лор (настоящий, основанный на быличках и истории Пермского края) были на своих местах.

На этот раз подход более глобальный. Основной посыл игры здесь — популяризация Проппа; главные collectibles — это разные версии сказки про ослепление одноглазого врага, от Полифема до собственно Лиха. Но сама игра выдержана в русском стиле, кузнец с портным куда-то идут, попадают в мистические лиминальные пространства, обманывают Лихо, всё такое.

Геймплейно и стилистически тоже всё по-другому: игра от первого лица, в основном про загадки, но есть пара стелс-кусочков. По стилю и картинке мне эта игра напомнила INDIKA, о которой надо бы как-нибудь тоже ретроспективно вспомнить: вид от первого лица, русская хтонь, хорошая русскоязычная озвучка. Только здесь ещё немного хоррора и много мистики, что для сказки логично. Проходится быстро, загадки в большинстве своём очень простые (хотя иногда не совсем ясно, чего от тебя хотят).

Игра оставила хорошее впечатление. Пожалуй, единственный минус — заведомая предсказуемость всего происходящего. Да, это известная сказка, но можно было бы, например, добавить метанарратива. Но образы классные, стилистика крутая, видно, что сделано недорого, но с душой. Morteshka молодцы, буду ждать следующих релизов.

Сергей Николенко

P.S. Прокомментировать и обсудить пост можно в канале “Sineкура”: присоединяйтесь!

August 8, 2025
Signalis

Ох, это мощное произведение искусства. Авторы выстроили целый мир с (действительно) потрясающей историей и глубоким лором, а потом начали искусно подавать всё это маленькими порциями, которые не так-то просто связать воедино, но каждая новая записочка оставляет желание найти следующую. Отдельно отмечу историю создания: немецкая студия rose-engine состоит буквально из двух человек, Юрия Штерна и Барбары Виттманн, и они делали Signalis с 2014 до 2022 года. Это, конечно, вызывает искреннее восхищение.

Геймплейно это типичный ранний Resident Evil, сказал я, никогда не игравший ни в Resident Evil, ни в Silent Hill, ни вообще во что бы то ни было на PlayStation 1/2 и знающий об этом жанре понаслышке. Но тут ни с чем не перепутаешь, и графический стиль оригинальной PlayStation здесь тоже отлично добавляет хоррора… а ещё иногда сменяется на вид от первого лица, что тоже круто сделано. Вообще, атмосферу можно кушать ложками от начала до конца игры.

Мне очень понравился и этот стиль, и лор, и давящая хоррор-атмосфера без особого хоррора — “настоящих” хорроров со скримерами я, честно говоря, боюсь до уровня, когда это уже неприятно, а тут от начала до конца просто напряжённо и очень депрессивно, но это я как раз люблю.)

Не понравилось же то, что загадочность лора Signalis никуда не делась до самого конца. Игра так ничего связного мне и не объяснила, и, честно говоря, в конце игры у меня в голове была изрядная каша, хотя я вроде не прерывался надолго и игру прошёл примерно за неделю.

Кашу я смог упорядочить только посмотрев видео “The Story of Signalis“; там спойлеры, не смотрите, если не играли! Игре отдельный зачёт за то, что мне реально хотелось разобраться и хотелось это видео смотреть. Но всё-таки мне самонадеянно кажется, что если я после прохождения ничего не понял, то скорее всего многие игроки ничего не поняли… И получается, что разработчики придумали очень крутую историю, которую мало кому рассказали. Более того, кажется, это не баг, а фича, и видео, которое я смотрел, — это всего лишь одна из многих идей (вот вам про dream theory, например, тоже со спойлерами).

Но в целом и для любителей survival horror, и для любителей хорошей фантастики, и для любителей депрессивных [около|псевдо]философских произведений любого жанра категорически рекомендую. А ещё Signalis в первую очередь про любовь. Как и всё на свете.

Сергей Николенко

P.S. Прокомментировать и обсудить пост можно в канале “Sineкура”: присоединяйтесь!

August 8, 2025
Не выключайте прослушку: CoT и интерпретируемость
В июле 2025 года вышла важная работа о безопасности AI, в которой больше 40 авторов, причём спектр аффилиаций очень широкий: OpenAI, Anthropic, DeepMind, METR, Redwood Research, Meta, UK AI Security Institute, Apollo Research… Среди авторов — легенда глубокого обучения Йошуа Бенджи, Войцех Заремба, Даниэль Кокотайло, знаменитые “AI-безопасники” Нил Нанда, Анка Драган, Дэн Хендрикс, Виктория Краковна, среди “expert endorsers” — Джеффри Хинтон, Сэм Боумэн, Джон Шульман и Илья Суцкевер… На чём же все они смогли единогласно сойтись?

Что такое CoT и почему это важно

Chain of Thought (CoT, цепочки рассуждений) — это способность языковых моделей “думать вслух”, последовательно излагая свои рассуждения на естественном языке перед тем, как дать финальный ответ. Если раньше модели вроде ChatGPT просто порождали текст слово за словом, то новое поколение рассуждающих моделей (reasoning models), начиная с OpenAI o1 и далее везде, специально обучается проводить расширенные рассуждения в текстовом виде. Грубо говоря, CoT — это листочек для записей, который дают моделям, чтобы они записывали промежуточные шаги и гипотезы, которые не обязательно войдут в окончательный ответ для пользователя.

Весь смысл статьи “Chain of Thought Monitorability: A New and Fragile Opportunity for AI Safety” сводится к тому, чтобы сказать разработчикам лучших LLM в мире: не отказывайтесь от простого советского человекочитаемого CoT! Почему этот вопрос вдруг стал актуальным, почему человекочитаемый CoT — это так важно, и что может прийти ему на смену?

Немного об интерпретируемости

По мере роста возможностей AI-систем растут и связанные с ними риски. Одна из ключевых проблем безопасности — непрозрачность того, как именно модели принимают решения. Мы видим входные данные и результат, но процесс между ними остаётся “чёрным ящиком”, теми самыми “giant inscrutable matrices”, о которых в своё время любил рассуждать Элиэзер Юдковский. И чем больше у модели возможностей, тем больше потенциальный негативный эффект от неправильных её решений. Чем ты сильнее, тем больше ты можешь сломать; и, конечно, сейчас уже есть о чём тревожиться, ведь речь идёт о системах, способных на сложное планирование, написание кода или принятие важных решений.

Поэтому одно из центральных направлений AI safety — это как раз интерпретируемость (interpretability). Я недавно писал об этом большой пост, а Дарио Амодеи — эссе “The Urgency of Interpretability“; интересующимся рекомендую как минимум второе, но буду рад, если прочитаете и первое.

Традиционные подходы к интерпретируемости AI, в частности механистическая интерпретируемость (mechanistic interpretability), которая пытается понять, как работают нейронные сети на уровне отдельных нейронов и небольших схем, из них составленных (её как раз активно развивает Anthropic, а я подробно обсуждаю в том посте), довольно быстро прогрессируют в последнее время. Это как раз то направление, из которого год назад получился “Golden Gate Claude“:

Но всё-таки успехи там далеко не полные, мы можем иногда кое-что заметить, но до полного понимания того, как большая нейронная сеть принимает решения, нам ещё очень далеко.

Как CoT помогает безопасности AI

И вот появляется chain-of-thought “блокнотик для записей”, который даёт максимально прямой путь: если модель “думает” на человеческом языке, мы можем буквально читать её мысли! И оказывается, что эти мысли пока что действительно правдивы, и мы действительно видим, что модель думает, даже в тех случаях, когда она потом решает нас обмануть!

Авторы статьи утверждают (и это в целом бесспорное утвеждение), что мониторинг CoT представляет собой уникальную возможность для обеспечения безопасности AI-систем. Они приводят немало убедительных примеров.

Вот Meincke et al. (2024) рассказывают о том, что модели уже способны на in-context scheming (преследовать скрытые цели, не признаваясь в этом), но CoT в этой работе всегда всё выдаёт:

А Baker et al. (2025) написали целую статью про CoT monitoring, и там тоже постоянно встречаются примеры, когда модель просто честно думает, что собирается делать, даже если в итоге финальные действия выглядят безобидно:

Кажется, что это восхитительная возможность, которой надо пользоваться как можно больше.

Почему CoT может испортиться

Однако эта возможность хрупка. Korbak et al. (2025) предупреждают о нескольких факторах, которые могут её разрушить.
- Язык может стать непонятным из-за использования RL. При масштабировании обучения с подкреплением модели могут начать использовать всё менее понятные человеку “языки” для своих рассуждений. Это давно известный эффект; тот же первый автор ещё в 2022 году написал статью о том, что если дообучать обычную LLM (тогда рассуждающих и вовсе ещё не было) с наградами на основе исходов, а не текстов, то язык становится всё менее читаемым (Korbak et al., 2022).
- Оптимизация CoT, и прямая, и непрямая, портит его. Попытки сделать рассуждения короче или “безопаснее” могут привести к тому, что они перестанут отражать реальный процесс мышления модели. Здесь главные потенциальные виновники — варианты process reward models (моделей, оценивающих путь к решению в RL, а не только ответ), которые будут следить за тем, чтобы LLM не совершали “мыслепреступлений”; результатом этого, разумеется, может стать то, что модели просто не будут думать лишнего в открытом виде, а нужную информацию передадут как-нибудь иначе. Такие примеры уже известны, и даже непрямая оптимизация вроде “давайте покажем CoT людям, которые оценивают результат в RLHF, и это повлияет на их оценки” тоже может привести к обфускации CoT (Baker et al., 2025).
- Рассуждения в латентном пространстве не порождают CoT. Модели, способные проводить рассуждения в латентном пространстве без вербализации, могут полностью обойти необходимость в понятных человеку записях в “блокнотике”. Это направление сейчас весьма активно развивается, кстати (Hao et al., 2024; Geiping et al., 2025).
Что же делать?

Korbak et al. (2025) формулируют несколько конкретных рекомендаций, к которым лично я могу только присоединиться:
- Разработать эффективные метрики для того, насколько возможно следить за CoT, то есть создать стандартизированные тесты для оценки того, насколько цепочки рассуждений действительно отражают процесс “мышления” модели.
- Публиковать результаты оценок — включать информацию о возможностях мониторинга в технические отчёты о моделях.
- Учитывать эту возможность при принятии решений — рассматривать способность к мониторингу CoT как важный фактор безопасности наряду с другими метриками.
- При разработке разных методов, связанных с рассуждениями в LLM, изучать, как различные методы обучения влияют на понятность рассуждений.
Кстати, тут есть и ещё одно оптимистическое наблюдение: в реальности не всегда обязательно, чтобы мониторинг срабатывал в 100 или 99% случаев, даже когда цена ошибки велика. Было бы достаточно, если бы мониторинг можно было перезапускать так, чтобы ошибки не повторялись, а желательно не слишком коррелировали друг с другом — и тогда, понятное дело, любая константная вероятность обнаружения ошибки быстро станет очень близка к единице. В контексте LLM-агентов Bhatt et al. (2025) недавно написали об этом целую статью под названием “Ctrl-Z: Controlling AI Agents via Resampling”.

Философские размышления и выводы

Ситуация с CoT напоминает классическую проблему философии сознания: можем ли мы доверять тому, что кто-то говорит о своих мыслях? В случае с AI-моделями эта проблема усугубляется тем, что сами “мысли” являются результатом оптимизации. Модель может научиться порождать правдоподобные, но вводящие в заблуждение объяснения — и очень важно, чтобы мы никоим образом не пытались это поощрять.

Есть и ещё один фундаментальный вопрос: является ли человеческий язык необходимым инструментом для сложных рассуждений или просто удобным промежуточным представлением? Скорее всего, конечно, второе, и для того, чтобы мониторинг CoT мог стать долгосрочным решением для AI-безопасности, нам нужно приложить нетривиальные усилия. Иначе “мысли” LLM быстро станут настолько же непостижимыми, как и их внутренние активации.

Авторы честно признают ограничения своего подхода: CoT-мониторинг не панацея, и его следует использовать только как дополнительный уровень защиты. Тем не менее, пока у нас есть эта возможность заглянуть в “мысли” LLM, было бы неразумно её игнорировать. И хочется продолжать изучать, развивать и использовать мониторинг цепочек рассуждений, пока это окно в мышление LLM остаётся открытым. Ведь оно может закрыться быстрее, чем мы думаем.

Сергей Николенко
August 4, 2025
Mixture-of-Recursions: думаем над токенами рекурсивно и адаптивно
AI-модели растут, “горький урок” и законы масштабирования всё ещё работают, но растущие вычислительные требования становятся всё более неподъёмными даже для гигантов рынка. Один из ключевых вопросов современного AI — как сделать то же самое эффективнее. В этом посте я расскажу о новом подходе, который очень элегантно сочетает сразу несколько известных идей и кажется весьма перспективным.

Три пути к эффективности

Исследователи из KAIST и Google Bae et al. (2025) представили новую идею под названием Mixture-of-Recursions (MoR, смесь рекурсий). Чтобы понять, что она делает, давайте сначала вспомним три основные стратегии повышения эффективности трансформеров без радикального изменения архитектуры (без перехода на Mamba-подобные модели, например):
- Mixture-of-Experts (MoE, смесь экспертов) добавляет разреженность активаций при инференсе: в модели есть много подсетей-экспертов, и специальный роутер выбирает, какую подсеть использовать для каждого токена на каждом MoE-слое; в результате получается, что параметров у сети очень много, но на каждый токен активируется только небольшое их подмножество; MoE — это огромная наука, о которой я, надеюсь, скоро расскажу гораздо подробнее, но в общем все современные гигантские модели устроены именно так;
- layer tying (связывание слоёв) уменьшает число параметров за счёт повторного использования одних и тех же весов в разных слоях; эта стратегия используется как минимум с времён Universal Transformers (Dehgani et al., 2018) и продолжает появляться до сих пор (Gholami, Omar, 2023; Bae et al., 2025b);
- early exiting (ранний выход) тоже добавляет разреженности, как и MoE, но за счёт остановки обработки на ранних слоях для более простых токенов; идея восходит к Depth-adaptive Transformers (Elbayad et al., 2020) и продолжает развиваться; например, недавно появилась архитектура LayerSkip (Elhoushi et al., 2024).
Традиционно исследователи использовали только один из таких подходов зараз, особенно учитывая, что эти стратегии нацелены на разные вещи: связывание весов экономит память, но не время инференса, а разреженные активации или ранние выходы, наоборот, никак не уменьшают требуемую память. MoR утверждает, что можно и нужно решать обе проблемы одним архитектурным решением. Давайте сначала рассмотрим непосредственных предшественников и идейных вдохновителей MoR, а потом перейдём и к самой статье.

Mixture-of-Depths и рекурсивные трансформеры

Первым важным предшественником этой работы стала идея Mixture-of-Depths, предложенная исследователями из Google DeepMind (Raposo et al., 2024). Они ввели маршрутизацию на уровне токенов для адаптивных вычислений. В то время как MoE-модели обучают маршрутизатор выбирать между разными подсетями (экспертами), Mixture-of-Depths обучает маршрутизатор выбирать, использовать ли слой или перейти непосредственно к остаточному соединению:

А если объединить связывание слоёв и ранние выходы, получатся рекурсивные трансформеры (Recursive Transformers, Bae et al., 2025b) — модели, которые повторяют один блок из K слоёв несколько раз, создавая зацикленную архитектуру. Например, вместо 30 уникальных слоёв в рекурсивной модели будет всего 10 слоёв, которые применяются трижды, что, соответственно, втрое сокращает число параметров:

Но какой смысл применять одни и те же слои несколько раз? Ранний слой, предназначенный для работы с самими токенами, был бы бесполезен при применении к их глобальным семантическим представлениям в более высоких слоях. И действительно, было бы лучше иметь способ модифицировать слои по мере продвижения. Поэтому Relaxed Recursive Transformers (RRT, релаксированные рекурсивные трансформеры) добавляют небольшие LoRA-адаптеры к слоям, которые можно обучать отдельно для каждой итерации цикла:

Bae et al. (2025b) обнаружили, что их RRT стабильно превосходят стандартные архитектуры: если дистиллировать предобученную модель Gemma 2B в рекурсивную Gemma 1B, результаты получаются намного лучше, чем при дистилляции в обычную Gemma 1B, и приближаются к результатам исходной модели с 2B параметров.

Mixture-of-Recursions: адаптивная глубина для каждого токена

И вот в последней работе Bae et al. (2025) делают следующий шаг. Они вводят механизмы маршрутизации, которые для каждого токена индивидуально решают, сколько раз применять рекурсивные блоки. Именно поэтому подход называется “смесью рекурсий”: небольшие подсети-маршрутизаторы делают обработку токенов адаптивной, динамически назначая различную глубину рекурсии отдельным токенам. Это значит, что если нужно породить простое функциональное слово, сеть сможет остановиться после первого прохода через рекурсивный блок, а для семантически богатого слова, которое нужно подумать, чтобы предсказать, сможет использовать, скажем, три итерации.

На иллюстрации ниже показаны (a) структура маршрутизатора, который может пропустить шаг рекурсии, (b) общая структура модели и (c) пример того, как более простые токены производятся меньшим числом шагов рекурсии, чем семантически богатые:

Идея в том, чтобы дать каждому токену ровно столько времени обработки, сколько ему нужно — ни больше, ни меньше. Сами роутеры обучаются во время обучения, развивая неявное понимание того, какие токены заслуживают более глубокой обработки.

Такую адаптивную маршрутизацию можно реализовать по-разному. Авторы сравнивают две стратегии:
- expert-choice routing (маршрутизация с выбором экспертом) рассматривает каждую глубину рекурсии как “эксперта” и выбирает, какие токены он хочет обработат; этот подход гарантирует идеальную балансировку нагрузки (большая проблема для MoE-моделей: как вы убедите роутер не отправлять всё на одного и того же эксперта?), поскольку здесь каждая итерация цикла обрабатывает фиксированное количество токенов; но тут нужно долго возиться с причинностью (в смысле causality), чтобы правильно получилось авторегрессивное порождение токенов;
- token-choice routing (маршрутизация с выбором токеном) идёт более простым путём: каждый токен заранее решает, сколько шагов рекурсии ему нужно; хотя это может привести к дисбалансу нагрузки (что если все токены захотят максимум?), авторы показывают, что эту проблему можно смягчить дополнительной регуляризацией (load balancing losses).
Вот иллюстрация двух схем маршрутизации от Bae et al. (2025):

Кроме маршрутизации, эта картинка в части (с) иллюстрирует ещё и два механизма кэширования, которые в статье предложены:
- recursion-wise KV caching (рекурсивное KV-кэширование, сверху) сохраняет пары KV только для токенов, которые реально обрабатываются на каждой глубине рекурсии;
- recursive KV sharing (рекурсивное разделение KV, снизу) повторно использует пары KV, вычисленные на первом шаге рекурсии, для всех последующих шагов.
На практике оба подхода к маршрутизации дают хорошие результаты (expert-choice чуть получше), а кэширование нужно выбирать по необходимости: recursive KV sharing быстрее, но требует больше памяти, чем рекурсивное KV-кэширование.

Эмпирически у Bae et al. (2025) тоже всё неплохо получается: результаты демонстрируют, что MoR и теоретически элегантная идея, и практическую пользу приносит. При обучении с одинаковым вычислительным бюджетом MoR-модели стабильно превосходят обычные трансформеры, причём используя только около трети параметров эквивалентных стандартных моделей. Если дистиллировать Gemma 2B в рекурсивную Gemma 1B, получится модель гораздо лучше, чем если в обычную Gemma 1B, почти с такими же результатами, как исходная 2B.

Идея MoR также приводит к значительному ускорению инференса (до 2x) за счёт специального механизма continuous depth-wise batching: поскольку у вас многие токены уходят из обработки раньше других, освободившиеся места можно сразу же заполнять новыми токенами. Это тоже кажется важной идеей: получается, что состав батча всё время меняется, освободившаяся память не простаивает и обработка идёт максимально быстро.

Заключение

Mixture-of-Recursions представляется важной работой; скорее всего, это не последнее слово, но направление крутое. Пожалуй, главная критика, которая приходит на ум, состоит в том, что эксперименты пока очень маленькие: я упоминал Gemma 2B, и действительно это самая большая модель, с которой в статье есть эксперименты. Но всё-таки будет удивительно, если идея MoR вдруг перестанет работать при масштабировании.

Интересно, что MoR также естественным образом поддерживает test-time scaling: регулируя глубину рекурсии во время инференса, можно настраивать компромисс между качеством и скоростью. В некотором смысле MoR даёт естественный механизм для сложных рассуждений в латентном пространстве: раз токены могут проходить несколько раундов обработки, и это число раундов выбирает сам роутер, MoR-модель может обучиться “подумать, прежде чем говорить”.

В целом, мне кажется, что эта серия работ — Mixture-of-Depths, рекурсивные трансформеры, Mixture-of-Recursions — представляет собой очень перспективное направление. Получаются большие AI-модели, адаптивные не только в том, какие подсети они используют (как обычные MoE-модели), но и в том, сколько вычислений они используют.

Концептуально MoR представляет собой вертикальный вариант Mixture-of-Experts: вместо того чтобы распределять токены по широким экспертным матрицам, мы отправляем (или не отправляем) их дальше в глубину. Кстати, было бы очень легко объединить MoR и MoE, это естественный следующий шаг; думаю, тут это не сделано исключительно потому, что модели в целом пока что довольно маленькие. Я бы не удивился, если бы в скором времени в этом направлении появилась “3D-разреженная” модель — токены ⨉ глубина ⨉ эксперты.

Так что, как я всегда говорю, даже если физическое масштабирование так или иначе остановится (сложно уже транзисторы дальше уменьшать), алгоритмический прогресс принесёт нам ещё немало новостей и немало иксов улучшения эффективности (и по скорости, и по памяти). Будем следить за прогрессом!

Сергей Николенко

P.S. Прокомментировать и обсудить пост можно в канале “Sineкура”: присоединяйтесь!
July 30, 2025
РОП ТОП-ДС со звёздочкой
Максим Абрамов прислал забавную ссылку на новость в канале его команды бакалавриата “Искусственный интеллект и наука о данных“; англоязычная аббревиатура в ссылке очень богатая получается, я Максиму написал, надеюсь, сменят.) Спасибо коллегам, что из всей программы решили подсветить мой доклад, но на самом деле это было маленькое отчётное выступление в ряду нескольких десятков таких же от разных вузов. Это именно то, чем я занимался большую часть прошлой недели: повышал в ИТМО квалификацию по программе “Руководство образовательной программой по ИИ топ-уровня: компетентностно-ролевой подход”.

В реальности это значило, что я изучал компетентностно-ролевую модель в области AI, которую разработали коллеги из ИТМО, и переводил на её шершавый язык учебный план бакалавриата “AI360: Математика машинного обучения“, которым я теперь руковожу. Более того, руковожу даже не чисто формально — в ИТМО я как раз проходил повышение квалификации и защищал программы СПбГУ по искусственному интеллекту, которые выиграли соответствующие конкурсы (вот новость на сайте СПбГУ, например). И надеюсь начиная с осени активнее участвовать в жизни факультета, только что обсуждали эти планы со Станиславом Смирновым.

И вот: смотрите, завидуйте, я — РОП программы ТОП-ДС со звёздочкой! РОП – это “руководитель образовательной программы”, но не спрашивайте, почему “Top” написано большими русскими буквами (да, это именно top, никак не расшифровывается), а науки о данных имеют аббревиатуру “ДС”; казалось бы, или “НоД”, или “DS”… но нет, “ДС”.

На самом деле я хоть и подшучиваю, но очень далёк от того, чтобы всерьёз критиковать получившуюся у ИТМО модель. Если бы мне поставили задачу разработать такой мегаплан всей области AI-образования, я бы точно лучше не справился. Если принять необходимость существования подобного документа как выбитую на скрижали данность, то сам документ получился очень разумный. Да и многие лекции на программе повышения квалификации были действительно интересные. Но по сути основное, что от нас там требовалось, конечно, свелось к большой и не слишком содержательной бюрократической деятельности.

Лично я ещё и болел всю прошлую неделю, и к пятнице, когда нужно было защищать программы, как раз была фаза, когда температура уже не слишком высокая, но одновременно болит горло и текут сопли. Так что фотография полностью отражает моё внутреннее и внешнее состояние в тот момент.)

К чему я это всё рассказываю? Ну разумеется, к тому, чтобы порекламировать новый бакалавриат на МКН СПбГУ! Приходите к нам, у нас много курсов про машинное обучение и большие стипендии:

AI360: Математика машинного обучения

Все “старые” бакалавриаты МКН тоже, разумеется, по-прежнему великолепны, никакой, как это изящно назвали коллеги из ИТМО, “каннибализации” у нас не происходит и не предполагается. Пройти мой трёх-четырёхсеместровый курс машинного обучения можно на любой программе, и вообще все эти программы тоже разворачиваются лицом к искусственному интеллекту – что поделать, нынче жизнь такая:
Понятия не имею, имеет ли смысл реклама сейчас или это уже только на следующий год, но в любом случае приходите к нам!

Сергей Николенко

P.S. Прокомментировать и обсудить пост можно в канале “Sineкура ”: присоединяйтесь!
July 29, 2025
Snufkin: Melody of Moominvalley

Милая и незамысловатая игра про муми-троллей. Формально это RPG, там есть уровни и сайд-квесты, но на самом деле, конечно, это простенькое линейное приключение. Всё заканчивается быстро, за 3-4 часа вы спасёте Муми-дол, перетянете злых персонажей на сторону добра и увидите много всяческой милоты в довольно классической мумитролльной стилистике.

Если вы любите муми-троллей, рекомендую однозначно. Если любите детские добрые игры и нравится эстетика, тоже. Хотя, кстати, сами муми-тролли не такие уж детские по нынешним временам, они скорее искренне-жизненные. Вот, например, из “Муми-папа и море“:

— Мы нашли ящик виски, — сообщил Муми-тролль.
— Великолепно! — сказала Муми-мама. — Теперь мы должны устроить пикник!

Лично я в целом не пожалел, что прошёл, это было очень мило. Но и не сказать, что очень увлекательно, особых твистов сюжета или крутого геймплея тут не ждите. Но расслабиться на вечер или в поездке отлично, мне на два самолёта хватило.

Сергей Николенко

P.S. Прокомментировать и обсудить пост можно в канале “Sineкура”: присоединяйтесь!

July 25, 2025
Year Walk

Вот это тот жанр, который мне нравится! Очень короткая игра, которая рассказывает тебе интересные вещи, да ещё и немножко играет с четвёртой стеной. Советую просто взять и пройти, это займёт не больше пары часов и будет действительно интересно.

Year walk (Årsgång) — это реально существовавшая шведская традиция особого гадания, когда люди перед Новым годом или Рождеством сначала себя всячески очищали, а потом отправлялись в долгую ритуальную прогулку. Если у них всё на прогулке получалось, то они могли там увидеть будущее, отображение событий наступающего года. Но в игре, кроме шведского фольклора, есть ещё и сюжет (большая часть которого начинается после титров, не пропустите!), и всё умещается в два часа, так что спойлерить особо не буду, просто рекомендую попробовать.

Сергей Николенко

P.S. Прокомментировать и обсудить пост можно в канале “Sineкура”: присоединяйтесь!

July 25, 2025
Timelie

Игра-головоломка, где основная механика — это стелс через управление временем. В каждой головоломке ты должен (то есть должна) пройти в дверь, не попадаясь на глаза роботам, и ты можешь перематывать время туда-обратно в любой момент очень детально. Есть, конечно, и ещё несколько механик, в том числе ужасно милый котик, который управляется отдельно и умеет привлекать внимание через мяукание — котики любой игре дают +1 балл сразу, конечно же.

Как головоломка Timelie работает хорошо: в основном проходится легко, но иногда есть о чём подумать, а перемотка времени гарантирует, что нет никакого бэктрекинга и переигрывания по новой, чем часто грешат чистые головоломки. Любителям пазлов рекомендую однозначно.

Но лично мне в игре всё время очень не хватало сюжета. Что вообще в игре происходит, решительно непонятно, всё очень, очень абстрактно. То, куда мы идём и зачем, мотивировано разве что тем, что за нами иногда гонится непонятная волна разрушения. Наверняка это какая-нибудь мощная метафора какой-нибудь потери (котика?). Если бы тут ещё и был хороший сюжет, как было, например, в чистой головоломке Filament (я её здесь не обозревал, а зря, надо бы вернуться к обзорам из архивов), это была бы игра на 8-9 баллов для меня.

Сергей Николенко

P.S. Прокомментировать и обсудить пост можно в канале “Sineкура”: присоединяйтесь!

July 25, 2025