Sharing is Caring: How Collective Reinforcement Learning Can Supercharge Language Models

2509.08721v1

Sep 18, 2025

Reinforcement learning (RL) has become the method of choice for this post-training phase. RL allows models to improve through trial and error, just like humans learn by practicing. Instead of relying only on fixed datasets, models can test answers, get feedback, and adjust. This is how techniques like Reinforcement Learning with Human Feedback (RLHF) helped align powerful systems such as ChatGPT.

But there is a problem. Traditional RL for large models is expensive, slow, and technically challenging. Training requires big GPU clusters, synchronization of model weights, and lots of careful engineering. All of this makes RL-based improvement out of reach for most people and organizations.

The paper we’re discussing today proposes a clever solution to these challenges. It introduces Swarm Sampling Policy Optimization (SAPO), a new decentralized approach that lets many models train together by sharing their experiences. Instead of one giant centralized system, SAPO works more like a community. Each model trains on its own device, shares what it has learned with others, and benefits from the discoveries of the group. The motto is simple but powerful: sharing is caring.

The Problem with Traditional RL for Language Models

When researchers want to improve a language model after pre-training, they usually turn to RL methods. For example, RLHF relies on human feedback to teach models to prefer helpful or accurate answers. Other versions, like RL with verifiable rewards (RLVR), use rule-based checks to judge correctness automatically.

These approaches have shown impressive results, especially in improving factual accuracy and reasoning. But scaling them up is not easy. Large-scale RL training typically requires big GPU clusters that must stay tightly synchronized. This creates bottlenecks in communication and makes the process fragile. It also costs a lot of money, because running many GPUs in parallel is expensive.

Moreover, traditional RL systems assume uniformity: the same hardware, the same model size, and the same setup. This leaves out the vast world of smaller, more diverse devices like laptops or local servers that could otherwise contribute to training.

In short, the current way of doing RL for language models is powerful but limited. It benefits big players with deep pockets, while smaller contributors and communities are left out.

SAPO: Reinforcement Learning Through Sharing

The idea behind SAPO is simple but transformative. Imagine a swarm of models, each running on a different computer. These computers don’t need to be the same. Some might have GPUs, others just CPUs; some might run big models, others smaller ones. Each node trains its own model locally using reinforcement learning.

Here’s where it gets interesting: instead of keeping all of this experience to itself, each node shares its rollouts. A rollout is basically a set of answers generated in response to a question or task. For example, if a model is asked a math question, it might produce several possible solutions. These are its rollouts.

In SAPO, these rollouts are shared across the network in plain text. Other nodes can then sample from this pool of shared experiences, re-encode them with their own models, and use them for training. In other words, when one model has an “aha moment,” that insight doesn’t stay locked up, it spreads to the rest of the swarm.

This has several advantages:

It removes the need for synchronization of model weights.
It allows heterogeneous hardware and models to participate.
It creates a collaborative effect, where models learn faster by benefiting from each other’s discoveries.

Importantly, each node has control over which rollouts it accepts. If a rollout seems unhelpful or incorrect, it can simply discard it. This ensures flexibility and protects against bad data polluting the system.

The Swarm in Action

The researchers tested SAPO in both controlled experiments and large-scale open demos.

In controlled settings, they created a swarm of eight Qwen2.5 models, each with 0.5 billion parameters. These models were given tasks from the ReasoningGYM dataset, which generates problems in algebra, logic, arithmetic, and other reasoning domains. Each model would attempt problems, generate multiple answers, and then share some of these rollouts with the swarm.

The team tried different setups:

Models using only their own rollouts (no sharing).
Models mixing some local rollouts with some external ones.
Models relying heavily on external rollouts.

The results were striking. The best performance came from a balanced approach using half local rollouts and half external rollouts. In this case, cumulative rewards improved by 94% compared to the baseline where models didn’t share at all.

The reason is clear: when models share just enough, they can spread useful discoveries quickly without drowning in low-quality external data. If they rely too much on external rollouts, performance can actually suffer, because weaker models may spread bad habits.

The experiments showed that balance matters. Sharing is powerful, but too much dependence on others can destabilize learning.

Insights from the Open-Source Demo

Beyond controlled tests, the team also ran a large open-source demo. Thousands of community members participated, running models of different sizes and types on their own hardware. These models joined the swarm, shared rollouts, and learned collectively.

The demo revealed several interesting insights.

First, swarm training clearly helped smaller models. For example, Qwen2.5 models with 0.5 billion parameters improved significantly when trained in the swarm compared to isolation. After about 175 rounds, swarm-trained models consistently outperformed those trained alone.

Second, stronger models like Qwen3 with 0.6 billion parameters didn’t show as much benefit. This suggests that SAPO’s collaborative edge may matter most for mid-sized models that can absorb and propagate rollouts effectively.

Third, because the demo used simple random sampling of rollouts, a lot of low-quality samples ended up in the pool. The researchers noted that smarter filtering strategies would likely boost performance further, especially for larger models.

The takeaway is that collective learning works in practice, even across messy, real-world conditions. With better strategies for selecting and weighting shared experiences, the benefits could be even greater.

Why This Matters

The implications of SAPO are big. By making RL post-training more efficient and accessible, it opens the door to a more democratic form of AI improvement. Instead of only big labs with expensive clusters improving models, communities of individuals with ordinary hardware could collaborate.

This could speed up progress in reasoning capabilities, make small models more useful, and encourage open-source development. Imagine thousands of laptops around the world contributing to a shared swarm, each learning from the others’ breakthroughs.

It also introduces a new way of thinking about AI training. Instead of isolated models being fine-tuned in closed systems, we could have networks of models teaching each other, passing along insights, and collectively getting smarter.

Challenges and Open Questions

Of course, SAPO is not a silver bullet. The paper notes several challenges that remain open for research.

One challenge is stability. When models rely too much on external rollouts, the system can oscillate between learning and forgetting. Finding the right balance between local and shared experience is key.

Another challenge is trust. In a large, open swarm, how can we make sure that the shared rollouts are high-quality and not malicious? Strategies for filtering and weighting rollouts will be crucial here.

There’s also the question of heterogeneity. The paper tested some scenarios with different models, but more systematic studies are needed. How does the swarm behave when models of very different sizes and architectures participate? What happens if humans or non-traditional “policies” are added to the mix?

Finally, there’s the exciting possibility of multi-modal swarms. Since SAPO is not tied to text, swarms could include image models, audio models, or even hybrid systems. One node’s reward function might favor aesthetics, while another favors accuracy. Together, the swarm could produce outputs that satisfy both.

The Bigger Picture

At its core, SAPO shows us a new way to think about intelligence. Just as humans learn not only from their own experiences but also from others, models can improve by sharing. The swarm becomes more than the sum of its parts, with collective breakthroughs spreading across the network.

This vision fits into a broader shift in AI research toward multi-agent systems. Instead of focusing on single powerful models, researchers are exploring how many models can collaborate, debate, specialize, and self-improve together. SAPO offers a practical, scalable way to make this collaboration happen during the training process.

In the future, we might see global networks of models that constantly teach each other, improving reasoning, creativity, and alignment far faster than any one lab could manage alone.

Conclusion

The phrase “sharing is caring” might sound simple, but in the context of AI training, it could be revolutionary. By letting models exchange experiences in a decentralized swarm, SAPO shows a way to overcome the cost and complexity of traditional reinforcement learning.

The results are clear: balanced sharing can nearly double performance, community swarms can drive real-world improvements, and collective learning offers a scalable path forward.

There are challenges to solve including stability, trust, heterogeneity, but the promise is huge. Instead of AI progress being concentrated in a few well-funded labs, it could become a collective endeavor, powered by communities of contributors around the world.

As we look ahead, one thing is certain: the future of AI may not just be about building bigger models, but about teaching them to learn together.

LLMQuant Newsletter

Discussion about this post