End To End Product Retrieval

Coordinators

Daniel Campos (University of Illinois), Surya Kallumadi(Lowes), Corby Rosset (Microsoft), ChengXiang Zhai (University of Illinois), Alessandro Magnani (Walmart)

For any questions, comments, or suggestions please email Daniel Campos or sign up for email updates

Subscribe For TREC Product Search Updates

* indicates required

Timetable

Introduction

The Product Search Track studies information retrieval in the field of product search. This is the case where there is a corpus of many products where the user goal and intent is to find the product that suits their need.

Our main goal is to study what how end to end retrieval systems can be built and evaluated given a large set of products.

Track Tasks

The product search track has three tasks: ranking, end to end retrieval and multi modal end to end retrieval. You can submit up to three runs for each of these tasks.

Each track uses the same training data originating from the ESCI Challenge for Improving Product Search and shares the same set of evaluation queries.

Below the three tasks are described in more detail.

Product Ranking Task

The first task focuses on product ranking. In task we provide an initial ranking of 1000 documents from a BM25 baseline and you are expected to re-rank the products in terms of their relevance to the users given intent.

The ranking provides a focused task where the candidate sets are fixed and there is no need to implement complex end to end systems which makes experimentation quick and runs easily comparable.

Product Retrieval Task

The second task focuses on end to end product retrieval. In task we provide an a large collection of products and participants need to design end to end retrieval systems which leverage whichever information they find relevant/useful.

Unlike the ranking task, the focus here is in understanding the interplay between retrieval and reranking systems.

Multi-Modal Product Retrieval Task

The third task focuses on end to end product retrieval using multiple modalities. In task we provide an a large collection of products where each product features additional attributes and information such as related clicks and images and participants need to design end to end retrieval systems which leverage whichever information they find relevant/useful.

The focus of this task is to understand the interplay between different modalities and the value which additional potentially weak data provides.

Use of external information

You are allowed to use external information while developing your runs. When you submit your runs, please fill in a form listing what evidence you used, for example an external corpus such as Wikipedia or a pre-trained model or some proprietary corpus.

When submitting runs, participants will be able to indicate what resources they used. This will allow us to analyze the runs and break they down into types.

Datasets

As mentioned above, each of the tasks share training data and test queries so there is only one dataset provided below

All datasets can be found on Hugginface under the organization of TREC Product Search

Type Filename File size Num Records Description Format
Query to Query ID Query2QueryID 946 KB 30,734 TREC style QueryID to Query Text tsv: qid\tquery
Collection Collection (TREC Format) 1.81 GB (568 MB compressed) 1,661,907 TREC style corpus collection tsv: docid\tTitle\tDescription
Collection Collection (Pyserini- Simplified) 1.9 GB (573.2 MB compressed) 1,661,907 Pyserini Style Simple JSONL corpus collection json: docid, contents (Title and Description)
Collection Collection (Pyserini- Simplified) Parquet 900.3 MB (770.9 MB compressed) 1,661,907 Pyserini Style Simple Parqueet corpus collection json: docid, contents (Title and Description)
Collection Collection (Pyserini- Full) 10.69 GB (2.8 GB compressed) 1,661,907 Pyserini Style Full JSONL corpus collection json: docid, contents (Title,Description, metadata, reviews)
Collection Collection (Pyserini- Full) Parquet 4.56 GB (3.92 GB compressed) 1,661,907 Pyserini Style Full Parqueet corpus collection json: docid, contents (Title,Description, metadata, reviews)
Collection Collection JSONL 10.3 GB (2.78 GB compressed) 1,661,907 JSONL Collection/td> json: docid, contents (Title,Description, metadata, reviews)
Collection Collection Parquet 5.22 GB (4.53 GB compressed) 1,661,907 JSONL Collection (Parquet)/td> json: docid, contents (Title,Description, metadata, reviews)
Train QREL Train QRELS (TREC Format) 6.8 MB (2.1 MB compressed) 392,119 Train QRELs/td> tsv: qid,0, docid, relevance label
Train QREL (JSON) Train QRELS (JSONL Format) 21.5 MB (2.4 MB compressed) 392,119 Train QRELs (JSON)/td> json: qid,0, docid, relevance label
Train QREL (Parquet) Train QRELS (Parquet Format) 2.4 MB (2 MB compressed) 392,119 Train QRELs (Parquet)/td> json: qid,0, docid, relevance label
Dev QREL Dev QRELS (TREC Format) 2.9 MB (906 KB compressed) 169,952 Dev QRELs/td> tsv: qid,0, docid, relevance label
Dev QREL (JSON) Dev QRELS (JSONL Format) 21.5 MB (2.4 MB compressed) 169,952 Dev QRELs (JSON)/td> json: qid,0, docid, relevance label
Dev QREL (Parquet) Dev QRELS (Parquet Format) 2.4 MB (2 MB compressed) 169,952 Dev QRELs (Parquet)/td> json: qid,0, docid, relevance label
2023 Test Queries Dev QRELS (TREC Format) 52 KB (28 KB compressed) 926 2023 Test Queriestd> tsv: qid,query text
Test QREL Test QRELS (TREC Format) 18kb (6 KB compressed) 998 Test QRELs (Recal BaseD)/td> tsv: qid,0, docid, relevance label
Test QREL (JSON) Test QRELS (JSONL Format) 15kb (7 KB compressed) 998 Test QRELs (JSON)/td> json: qid,0, docid, relevance label
Test QREL (Parquet) Test QRELS (Parquet Format) 13kb (7 KB compressed) 998 Test QRELs (Parquet)/td> json: qid,0, docid, relevance label
Triples Train Triples JSONl 6.23 GB (1.28 GB compressed) 20,888 Training Triples json format json: qid, query, positive passages, negative passages
Triples Train Triples JSONl 6.23 GB (1.28 GB compressed) 20,888 Training Triples json format json: qid, query, positive passages, negative passages
Triples Dev Triples JSONl 2.67 GB (550.3 MB compressed) 8,956 Development Triples json format tsv: qid, query
Triples Dev Triples PARQUET 938 MB (777.2 MB compressed) 8,956 Development Triples parquet format tsv: qid, query
Triples Test Triples PARQUET 47 KB (37 KB compressed) 926 Test Triples parquet format tsv: qid, query
Triples Test Triples JSON 125 KB (30 KB compressed) 926 Test Triples json format tsv: qid, query
Train Top 100 BM25 (Pyserini Simple Context) Top 100 Train BM25 Simple 107.3 MB (28.8 MB compressed) 2,046,828 BM25 top 100 For Train Queries tsv: qid, doc_id, rank, score, run-name
Train Top 100 BM25 (Pyserini Full Context) Top 100 Train BM25 Simple 103.7 MB (28.8 MB compressed) 2,055,876 BM25 top 100 For Train Queries tsv: qid, doc_id, rank, score, run-name
Dev Top 1000 BM25 (Pyserini Simple Context) Top 100 Dev BM25 Simple 107.3 MB (28.8 MB compressed) 8,717,672 BM25 top 1000 For Dev Queries tsv: qid, doc_id, rank, score, run-name
Dev Top 1000 BM25 (Pyserini Full Context) Top 100 Dev BM25 Simple 447.4 MB (121.1 MB compressed) 8,717,672 BM25 top 1000 For Dev Queries tsv: qid, doc_id, rank, score, run-name
Test Top 1000 BM25 (Pyserini Simple Context) Top 100 Test BM25 Simple 4.1 MB (1.1 MB compressed) 75,452 BM25 top 1000 For Test Queries tsv: qid, doc_id, rank, score, run-name
Dev Top 1000 BM25 (Pyserini Full Context) Top 100 Test BM25 Simple 3.9 MB (1.1 MB compressed) 75,696 BM25 top 1000 For Test Queries tsv: qid, doc_id, rank, score, run-name

Submission, evaluation and judging

We will be following the classic TREC submission formating, which is repeated below. White space is used to separate columns. The width of the columns in the format is not important, but it is important to have exactly six columns per line with at least one space between the columns.

, where:

                  
		      1 Q0 pid1    1 2.73 runid1
		      1 Q0 pid2    1 2.71 runid1
		      1 Q0 pid3    1 2.61 runid1
		      1 Q0 pid4    1 2.05 runid1
		      1 Q0 pid5    1 1.89 runid1
      	          
	      

As the official evaluation set, we provide a set of 926 queries where 50 or more will be judged by NIST assessors. For this purpose, NIST will be using depth pooling with separate pools each tasks. Products in these pools will then be labelled by NIST assessors using multi-graded judgments, allowing us to measure NDCG.

The main type of TREC submission is automatic, which means there was not manual intervention in running the test queries. This means you should not adjust your runs, rewrite the query, retrain your model, or make any other sorts of manual adjustments after you see the test queries. The ideal case is that you only look at the test queries to check that they ran properly (i.e. no bugs) then you submit your automatic runs. However, if you want to have a human in the loop for your run, or do anything else that uses the test queries to adjust your model or ranking, you can mark your run as manual and provide a description of what types of alterations were performed.