End To End Product Retrieval

Dataset paper: Coming Soon


Daniel Campos (University of Illinois), Surya Kallumadi(Lowes), Corby Rosset (Microsoft), ChengXiang Zhai (University of Illinois), Alessandro Magnani (Walmart)

For any questions, comments, or suggestions please email Daniel Campos



The Product Search Track studies information retrieval in the field of product search. This is the case where there is a corpus of many products where the user goal and intent is to find the product that suits their need.

Our main goal is to study what how end to end retrieval systems can be built and evaluated given a large set of products.

Track Tasks

The product seacch track has three tasks: ranking, end to end retrieval and multi modal end to end retrieval. You can submit up to three runs for each of these tasks.

Each track uses the same training data originating from the ESCI Challenge for Improving Product Search and shares the same set of evaluation queries.

Below the three tasks are described in more detail.

Product Ranking Task

The first task focuses on product ranking. In task we provide an initial ranking of 100 documents from a BM25 baseline and you are expected to re-rank the products in terms of their relevance to the users given intent.

The ranking provides a focused task where the candidate sets are fixed and there is no need to implement complex end to end systems which makes experimentation quick and runs easily comparable.

Product Retrieval Task

The second task focuses on end to end product retrieval. In task we provide an a large collection of products and participants need to design end to end retrieval systems which leverage whichever information they find relevant/useful.

Unlike the ranking task, the focus here is in understanding the interplay between retrieval and reranking systems.

Multi-Modal Product Retrieval Task

The third task focuses on end to end product retrieval using multiple modalities. In task we provide an a large collection of products where each product features additional attributes and information such as related clicks and images and participants need to design end to end retrieval systems which leverage whichever information they find relevant/useful.

The focus of this task is to understand the interplay between different modalities and the value which additional potentially weak data provides.

Use of external information

You are allowed to use external information while developing your runs. When you submit your runs, please fill in a form listing what evidence you used, for example an external corpus such as Wikipedia or a pre-trained model or some proprietary corpus.

When submitting runs, participants will be able to indicate what resources they used. This will allow us to analyze the runs and break they down into types.

Query Selection





As mentioned above, each of the tasks share training data and test queries so there is only one dataset provided below

End To End Retrieval

Type Filename File size Num Records Description Format
Corpus coming soon N GB json: docid, url, title, body
Train coming soon N GB tsv: qid, query
Dev coming soon N GB tsv: qid, query

Submission, evaluation and judging

We will be following the classic TREC submission formating, which is repeated below. White space is used to separate columns. The width of the columns in the format is not important, but it is important to have exactly six columns per line with at least one space between the columns.

1 Q0 pid1    1 2.73 runid1
      1 Q0 pid2    1 2.71 runid1
      1 Q0 pid3    1 2.61 runid1
      1 Q0 pid4    1 2.05 runid1
      1 Q0 pid5    1 1.89 runid1

, where:

As the official evaluation set, we provide a set of ? queries where ? or more will be judged by NIST assessors. For this purpose, NIST will be using depth pooling with separate pools each tasks. Products in these pools will then be labelled by NIST assessors using multi-graded judgments, allowing us to measure NDCG.

The main type of TREC submission is automatic, which means there was not manual intervention in running the test queries. This means you should not adjust your runs, rewrite the query, retrain your model, or make any other sorts of manual adjustments after you see the test queries. The ideal case is that you only look at the test queries to check that they ran properly (i.e. no bugs) then you submit your automatic runs. However, if you want to have a human in the loop for your run, or do anything else that uses the test queries to adjust your model or ranking, you can mark your run as manual and provide a description of what types of alterations were performed.