End To End Product Retrieval

Coordinators

Surya Kallumadi(Lowes), Daniel Campos (University of Illinois), Sahiti Labhishetty (Target), ChengXiang Zhai (University of Illinois), Alessandro Magnani (Walmart)

For any questions, comments, or suggestions please email Surya or sign up for email updates

Subscribe For TREC Product Search Updates

* indicates required

Timetable

Introduction

The Product Search Track studies information retrieval in the field of product search. This is the case where there is a corpus of many products where the user goal and intent is to find the product that suits their need.

Our main goal is to study what how end to end retrieval systems can be built and evaluated given a large set of products.

Track Tasks

The product search track one task, end to end retrieval. Each TREC participant group can submit up to ten runs independly of what approaches are used.

The dataset builds on the 2023 TREC Product Search Track which itself is based on the ESCI Challenge for Improving Product Search. Unlike last year, in 2024 the focus is not on generating a collection but exploring methods for generating sythetic queries via simulation and leveraging large language models.

Given the corpus retrieval can happen in at least three ways: re-ranking, text only retrieval, and multi-modal retrieval. For re-ranking given an intial set of 1000 documents for each query extracted using a BM25 baseline, research groups can focus on re-ranking the existing results using any modeling approach.

For text only retrieval there task formulation is much like re-ranking but without the top 1000 documents as these must be generated by the research group using the text for each product. Finally in multimodal retrieval the usage of the additional information in the form of images, reviews, and product taxonomy can be used to improve retrieval or ranking performance.

Use of external information

You are allowed to use external information while developing your runs. When you submit your runs, please fill in a form listing what evidence you used, for example an external corpus such as Wikipedia or a pre-trained model or some proprietary corpus.

When submitting runs, participants will be able to indicate what resources they used. This will allow us to analyze the runs and break they down into types.

Datasets

As mentioned above, each of the tasks share training data and test queries so there is only one dataset provided below

All datasets can be found on Hugginface under the organization of TREC Product Search. There are many varians of the collection in the huggingface repo with JSON, Parqueet, and other variants. With regards to collection there exists the simple collection, which only features product titles and description, and the full collections which includes metadata such as review and product taxonomy.

Type Filename File size Num Records Description Format
Query to Query ID Query2QueryID 946 KB 30,734 TREC style QueryID to Query Text tsv: qid, query
Collection Collection (TREC Format) 1.81 GB (568 MB compressed) 1,661,907 TREC style corpus collection tsv: docid, Title, Description
Train QREL (ESCI) Train QRELS (TREC Format) 6.8 MB (2.1 MB compressed) 392,119 Train QRELs tsv: qid, 0, docid, relevance label
Dev QREL (ESCI) Dev QRELS (TREC Format) 2.9 MB (906 KB compressed) 169,952 Dev QRELs tsv: qid, 0, docid, relevance label
2023 Test Queries 2023 Test Queries (TREC Format) 12 KB (7 KB compressed) 186 2023 Test Queries tsv: qid, query text
2023 Test QREL Synthetic (Non NIST) 2023 Test QREL Synthetic (Non NIST) (TREC Format) 18kb (6 KB compressed) 998 2023 Test QREL Synthetic (Non NIST) tsv: qid, 0, docid, relevance label
2023 Test QRELS (NIST Judged) 2023 Test QREL (TREC Format) 2.1 MB (460 KB compressed) 115490 2023 Test Qrels tsv: qid, 0, docid, relevance label
2024 Test Queries 2024 Test Queries (TREC Format) TBD TBD 2024 Test Queries tsv: qid. query text
Training Triples (Query, Positive, Negative Pairs) Train Triples JSONl 6.23 GB (1.28 GB compressed) 20,888 Training Triples json format json: qid, query, positive passages, negative passages
Top 100 BM25 (Pyserini Simple Context) 2024 Queries Top 100 Train BM25 Simple TBD TBD BM25 top 100 For 2024 Queries tsv: qid, doc_id, rank, score, run-name
Top 100 BM25 (Pyserini Full Context) 2024 Queries Top 100 Train BM25 Simple TBD TBD BM25 top 100 For 2024 Queries tsv: qid, doc_id, rank, score, run-name

Getting Started/Tevatron Usage

To allow quick experimentation we have gone ahead and made the datasets compatible with the popular Tevatron library. To train, index, and retrieve from the product search researchs can take the Tevatron MSMARCO Example Guide and just update the dataset names and run with your favorite model variant. For simplicity an example is shown below.

Submission, evaluation and judging

We will be following the classic TREC submission formating, which is repeated below. White space is used to separate columns. The width of the columns in the format is not important, but it is important to have exactly six columns per line with at least one space between the columns.

, where:

                  
		      1 Q0 pid1    1 2.73 runid1
		      1 Q0 pid2    1 2.71 runid1
		      1 Q0 pid3    1 2.61 runid1
		      1 Q0 pid4    1 2.05 runid1
		      1 Q0 pid5    1 1.89 runid1
      	          
	      

As the official evaluation set, we provide a set of 926 queries where 50 or more will be judged by NIST assessors. For this purpose, NIST will be using depth pooling with separate pools each tasks. Products in these pools will then be labelled by NIST assessors using multi-graded judgments, allowing us to measure NDCG.

The main type of TREC submission is automatic, which means there was not manual intervention in running the test queries. This means you should not adjust your runs, rewrite the query, retrain your model, or make any other sorts of manual adjustments after you see the test queries. The ideal case is that you only look at the test queries to check that they ran properly (i.e. no bugs) then you submit your automatic runs. However, if you want to have a human in the loop for your run, or do anything else that uses the test queries to adjust your model or ranking, you can mark your run as manual and provide a description of what types of alterations were performed.