Insights from LongEval at CLEF 2026

Last week in Madrid, we joined the Information Retrieval community for the annual Conference and Labs of the Evaluation Forum (CLEF). As co-organizers of the LongEval shared task, we were deeply involved in building the Scientific Search dataset for the SciRetrieval task. After nearly a year of preparation, it was fantastic to finally meet with participating teams and discuss their ideas and approaches.

The LongEval shared task investigates a crucial but often overlooked challenge in retrieval evaluations: most offline evaluations are static, but real-world search environments are dynamic. Users, collections, and relevance constantly evolve. LongEval provides a unique testbed for longitudinal evaluation by explicitly incorporating this temporal dimension, offering a more realistic testbed to investigate a system’s long-term performance.

Why This Matters for STELLA and Continuous Evaluation

The evolving testbeds directly affect experiments done with STELLA. Rigorous pre-testing is a cornerstone of the continuous evaluation framework that STELLA anticipates. New features should be pre-tested offline before being deployed for online testing. One challenge is to ensure these offline pre-tests accurately reflect the current online environment. Therefore, dynamic, up-to-date testbeds are crucial. Furthermore, they help to understand how retrieval systems are affected by temporal changes, how long test beds remain valid, and ultimately help to better align the offline and online stages in the continuous evaluation cycle.

Keeping Test Collections Fresh and Relevant

This raises a critical question: How can we ensure test collections don’t become stale?

The LongEval datasets address this by using two key techniques:

  1. Corpus and Query Snapshots: The datasets are built from monthly snapshots of the document corpus and user queries, ensuring the collection remains current.
  2. Pseudo-Relevance from Click Models: Instead of relying on repeated manual assessments, relevance judgments are generated from logged user interactions using click models.

While pseudo-relevance labels are not comparable to deep judged editorial relevance labels, this approach offers some strong advantages. Pseudo-relevance labels are far more accessible, which is crucial if annotation is needed repeatedly. Further, the resulting relevance labels (qrels) can reflect the average user behavior while avoiding potential issues like assessor fatigue that can arise from repeated manual judgments. This same technique is directly applicable to the STELLA, where user interactions are already logged in the STELLA Server!

Simplifying Experiments with ir_dataset_longeval

Longitudinal experiments are time-consuming and tedious, requiring researchers to manage numerous test collections and repeatedly generate retrieval runs. To streamline this process, we introduced ir_dataset_longeval, an extension for the popular ir_datasets toolkit.

This extension simplifies longitudinal IR experiments by adding convenient functionality, including:

  • Easy iteration over previous collection snapshots.
  • Direct access to document and query timestamps.
  • Simplified loading of new or local snapshots from disk.

By supporting the data maintenance overhead, the toolkit allows researchers to focus more on the evaluation itself.

Get Involved and Look Ahead

Building on this year’s success, we have already proposed four new and exciting tasks for LongEval 2026. If this work interests you, stay tuned and consider participating! Furthermore, if you are interested in constructing snapshotted test collections for your own search systems, please feel free to reach out to us. We’d be happy to support you and share our experience.

 Share!