Surya Kallumadi(Lowes), Daniel Campos (University of Illinois), Sahiti Labhishetty (Target), ChengXiang Zhai (University of Illinois), Alessandro Magnani (Walmart)
For any questions, comments, or suggestions please email Surya or sign up for email updates
The Product Search Track studies information retrieval in the field of product search. This is the case where there is a corpus of many products where the user goal and intent is to find the product that suits their need.
Our main goal is to study what how end to end retrieval systems can be built and evaluated given a large set of products.
The product search track one task, end to end retrieval. Each TREC participant group can submit up to ten runs independly of what approaches are used.
The dataset builds on the 2023 TREC Product Search Track which itself is based on the ESCI Challenge for Improving Product Search. Unlike last year, in 2024 the focus is not on generating a collection but exploring methods for generating sythetic queries via simulation and leveraging large language models.
Given the corpus retrieval can happen in at least three ways: re-ranking, text only retrieval, and multi-modal retrieval. For re-ranking given an intial set of 1000 documents for each query extracted using a BM25 baseline, research groups can focus on re-ranking the existing results using any modeling approach.
For text only retrieval there task formulation is much like re-ranking but without the top 1000 documents as these must be generated by the research group using the text for each product. Finally in multimodal retrieval the usage of the additional information in the form of images, reviews, and product taxonomy can be used to improve retrieval or ranking performance.
You are allowed to use external information while developing your runs. When you submit your runs, please fill in a form listing what evidence you used, for example an external corpus such as Wikipedia or a pre-trained model or some proprietary corpus.
When submitting runs, participants will be able to indicate what resources they used. This will allow us to analyze the runs and break they down into types.
As mentioned above, each of the tasks share training data and test queries so there is only one dataset provided below
All datasets can be found on Hugginface under the organization of TREC Product Search. There are many varians of the collection in the huggingface repo with JSON, Parqueet, and other variants. With regards to collection there exists the simple collection, which only features product titles and description, and the full collections which includes metadata such as review and product taxonomy.
Type | Filename | File size | Num Records | Description | Format |
---|---|---|---|---|---|
Query to Query ID | Query2QueryID | 946 KB | 30,734 | TREC style QueryID to Query Text | tsv: qid, query |
Collection | Collection (TREC Format) | 1.81 GB (568 MB compressed) | 1,661,907 | TREC style corpus collection | tsv: docid, Title, Description |
Train QREL (ESCI) | Train QRELS (TREC Format) | 6.8 MB (2.1 MB compressed) | 392,119 | Train QRELs | tsv: qid, 0, docid, relevance label |
Dev QREL (ESCI) | Dev QRELS (TREC Format) | 2.9 MB (906 KB compressed) | 169,952 | Dev QRELs | tsv: qid, 0, docid, relevance label |
2023 Test Queries | 2023 Test Queries (TREC Format) | 12 KB (7 KB compressed) | 186 | 2023 Test Queries | tsv: qid, query text |
2023 Test QREL Synthetic (Non NIST) | 2023 Test QREL Synthetic (Non NIST) (TREC Format) | 18kb (6 KB compressed) | 998 | 2023 Test QREL Synthetic (Non NIST) | tsv: qid, 0, docid, relevance label |
2023 Test QRELS (NIST Judged) | 2023 Test QREL (TREC Format) | 2.1 MB (460 KB compressed) | 115490 | 2023 Test Qrels | tsv: qid, 0, docid, relevance label |
2024 Test Queries | 2024 Test Queries (TREC Format) | TBD | TBD | 2024 Test Queries | tsv: qid. query text |
Training Triples (Query, Positive, Negative Pairs) | Train Triples JSONl | 6.23 GB (1.28 GB compressed) | 20,888 | Training Triples json format | json: qid, query, positive passages, negative passages |
Top 100 BM25 (Pyserini Simple Context) 2024 Queries | Top 100 Train BM25 Simple | TBD | TBD | BM25 top 100 For 2024 Queries | tsv: qid, doc_id, rank, score, run-name |
Top 100 BM25 (Pyserini Full Context) 2024 Queries | Top 100 Train BM25 Simple | TBD | TBD | BM25 top 100 For 2024 Queries | tsv: qid, doc_id, rank, score, run-name |
To allow quick experimentation we have gone ahead and made the datasets compatible with the popular Tevatron library. To train, index, and retrieve from the product search researchs can take the Tevatron MSMARCO Example Guide and just update the dataset names and run with your favorite model variant. For simplicity an example is shown below.
We will be following the classic TREC submission formating, which is repeated below. White space is used to separate columns. The width of the columns in the format is not important, but it is important to have exactly six columns per line with at least one space between the columns.
, where:
1 Q0 pid1 1 2.73 runid1
1 Q0 pid2 1 2.71 runid1
1 Q0 pid3 1 2.61 runid1
1 Q0 pid4 1 2.05 runid1
1 Q0 pid5 1 1.89 runid1
As the official evaluation set, we provide a set of 926 queries where 50 or more will be judged by NIST assessors. For this purpose, NIST will be using depth pooling with separate pools each tasks. Products in these pools will then be labelled by NIST assessors using multi-graded judgments, allowing us to measure NDCG.
The main type of TREC submission is automatic, which means there was not manual intervention in running the test queries. This means you should not adjust your runs, rewrite the query, retrain your model, or make any other sorts of manual adjustments after you see the test queries. The ideal case is that you only look at the test queries to check that they ran properly (i.e. no bugs) then you submit your automatic runs. However, if you want to have a human in the loop for your run, or do anything else that uses the test queries to adjust your model or ranking, you can mark your run as manual and provide a description of what types of alterations were performed.