C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning

1Sun Yat-sen University, 2Hong Kong Polytechnic University, 3Peking University, 4Huawei Noah’s Ark Lab

Nerfies turns selfie videos from your phone into free-viewpoint portraits.

Abstract

Recent advances in multimodal large language models (MLLMs) have shown im pressive reasoning capabilities. However, further enhancing existing MLLMs neces sitates high-quality vision-language datasets with carefully curated task complexi ties, which are both costly and challenging to scale. Although recent self-improving models that iteratively refine themselves offer a feasible solution, they still suffer from two core challenges: (i) most existing methods augment visual or textual data separately, resulting in discrepancies in data complexity (e.g., over-simplified diagrams paired with redundant textual descriptions); and (ii) the evolution of data and models is also separated, leading to scenarios where models are exposed to tasks with mismatched difficulty levels. To address these issues, we propose C2-Evo, an automatic, closed-loop self-improving framework that jointly evolves both training data and model capabilities. Specifically, given a base dataset and a base model, C2-Evo enhances them by a cross-modal data evolution loop and a data-model evolution loop. The former loop expands the base dataset by generating complex multimodal problems that combine structured textual sub-problems with iteratively specified geometric diagrams, while the latter loop adaptively selects the generated problems based on the performance of the base model, to conduct super vised fine-tuning and reinforcement learning alternately. Consequently, our method continuously refines its model and training data, and consistently obtains consider able performance gains across multiple mathematical reasoning benchmarks.

Video

Visual Effects

Using nerfies you can create fun visual effects. This Dolly zoom effect would be impossible without nerfies since it would require going through a wall.

Matting

As a byproduct of our method, we can also solve the matting problem by ignoring samples that fall outside of a bounding box during rendering.

Animation

Interpolating states

We can also animate the scene by interpolating the deformation latent codes of two input frames. Use the slider here to linearly interpolate between the left frame and the right frame.

Interpolate start reference image.

Start Frame

Loading...
Interpolation end reference image.

End Frame


Re-rendering the input video

Using Nerfies, you can re-render a video from a novel viewpoint such as a stabilized camera by playing back the training deformations.

Related Links

There's a lot of excellent work that was introduced around the same time as ours.

Progressive Encoding for Neural Optimization introduces an idea similar to our windowed position encoding for coarse-to-fine optimization.

D-NeRF and NR-NeRF both use deformation fields to model non-rigid scenes.

Some works model videos with a NeRF by directly modulating the density, such as Video-NeRF, NSFF, and DyNeRF

There are probably many more by the time you are reading this. Check out Frank Dellart's survey on recent NeRF papers, and Yen-Chen Lin's curated list of NeRF papers.

BibTeX

@article{chen2025c2evocoevolvingmultimodaldata,
      title={C2-Evo: Co-Evolving Multimodal Data and Model for Self-Improving Reasoning}, 
      author={Xiuwei Chen and Wentao Hu and Hanhui Li and Jun Zhou and Zisheng Chen and Meng Cao and Yihan Zeng and Kui Zhang and Yu-Jie Yuan and Jianhua Han and Hang Xu and Xiaodan Liang},
      year={2025},
      eprint={2507.16518},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2507.16518}, 
}
}