Daniel Campos (University of Illinois), Surya Kallumadi(Lowes), Corby Rosset (Microsoft), ChengXiang Zhai (University of Illinois), Alessandro Magnani (Walmart)
For any questions, comments, or suggestions please email Daniel Campos or sign up for email updates
The Product Search Track studies information retrieval in the field of product search. This is the case where there is a corpus of many products where the user goal and intent is to find the product that suits their need.
Our main goal is to study what how end to end retrieval systems can be built and evaluated given a large set of products.
The product search track has three tasks: ranking, end to end retrieval and multi modal end to end retrieval. You can submit up to three runs for each of these tasks.
Each track uses the same training data originating from the ESCI Challenge for Improving Product Search and shares the same set of evaluation queries.
Below the three tasks are described in more detail.
The first task focuses on product ranking. In task we provide an initial ranking of 1000 documents from a BM25 baseline and you are expected to re-rank the products in terms of their relevance to the users given intent.
The ranking provides a focused task where the candidate sets are fixed and there is no need to implement complex end to end systems which makes experimentation quick and runs easily comparable.
The second task focuses on end to end product retrieval. In task we provide an a large collection of products and participants need to design end to end retrieval systems which leverage whichever information they find relevant/useful.
Unlike the ranking task, the focus here is in understanding the interplay between retrieval and reranking systems.
The third task focuses on end to end product retrieval using multiple modalities. In task we provide an a large collection of products where each product features additional attributes and information such as related clicks and images and participants need to design end to end retrieval systems which leverage whichever information they find relevant/useful.
The focus of this task is to understand the interplay between different modalities and the value which additional potentially weak data provides.
You are allowed to use external information while developing your runs. When you submit your runs, please fill in a form listing what evidence you used, for example an external corpus such as Wikipedia or a pre-trained model or some proprietary corpus.
When submitting runs, participants will be able to indicate what resources they used. This will allow us to analyze the runs and break they down into types.
As mentioned above, each of the tasks share training data and test queries so there is only one dataset provided below
All datasets can be found on Hugginface under the organization of TREC Product Search
Type | Filename | File size | Num Records | Description | Format |
---|---|---|---|---|---|
Query to Query ID | Query2QueryID | 946 KB | 30,734 | TREC style QueryID to Query Text | tsv: qid\tquery |
Collection | Collection (TREC Format) | 1.81 GB (568 MB compressed) | 1,661,907 | TREC style corpus collection | tsv: docid\tTitle\tDescription |
Collection | Collection (Pyserini- Simplified) | 1.9 GB (573.2 MB compressed) | 1,661,907 | Pyserini Style Simple JSONL corpus collection | json: docid, contents (Title and Description) |
Collection | Collection (Pyserini- Simplified) Parquet | 900.3 MB (770.9 MB compressed) | 1,661,907 | Pyserini Style Simple Parqueet corpus collection | json: docid, contents (Title and Description) |
Collection | Collection (Pyserini- Full) | 10.69 GB (2.8 GB compressed) | 1,661,907 | Pyserini Style Full JSONL corpus collection | json: docid, contents (Title,Description, metadata, reviews) |
Collection | Collection (Pyserini- Full) Parquet | 4.56 GB (3.92 GB compressed) | 1,661,907 | Pyserini Style Full Parqueet corpus collection | json: docid, contents (Title,Description, metadata, reviews) |
Collection | Collection JSONL | 10.3 GB (2.78 GB compressed) | 1,661,907 | JSONL Collection/td> | json: docid, contents (Title,Description, metadata, reviews) |
Collection | Collection Parquet | 5.22 GB (4.53 GB compressed) | 1,661,907 | JSONL Collection (Parquet)/td> | json: docid, contents (Title,Description, metadata, reviews) |
Train QREL | Train QRELS (TREC Format) | 6.8 MB (2.1 MB compressed) | 392,119 | Train QRELs/td> | tsv: qid,0, docid, relevance label |
Train QREL (JSON) | Train QRELS (JSONL Format) | 21.5 MB (2.4 MB compressed) | 392,119 | Train QRELs (JSON)/td> | json: qid,0, docid, relevance label |
Train QREL (Parquet) | Train QRELS (Parquet Format) | 2.4 MB (2 MB compressed) | 392,119 | Train QRELs (Parquet)/td> | json: qid,0, docid, relevance label |
Dev QREL | Dev QRELS (TREC Format) | 2.9 MB (906 KB compressed) | 169,952 | Dev QRELs/td> | tsv: qid,0, docid, relevance label |
Dev QREL (JSON) | Dev QRELS (JSONL Format) | 21.5 MB (2.4 MB compressed) | 169,952 | Dev QRELs (JSON)/td> | json: qid,0, docid, relevance label |
Dev QREL (Parquet) | Dev QRELS (Parquet Format) | 2.4 MB (2 MB compressed) | 169,952 | Dev QRELs (Parquet)/td> | json: qid,0, docid, relevance label |
2023 Test Queries | Dev QRELS (TREC Format) | 52 KB (28 KB compressed) | 926 | 2023 Test Queriestd> | tsv: qid,query text |
Test QREL | Test QRELS (TREC Format) | 18kb (6 KB compressed) | 998 | Test QRELs (Recal BaseD)/td> | tsv: qid,0, docid, relevance label |
Test QREL (JSON) | Test QRELS (JSONL Format) | 15kb (7 KB compressed) | 998 | Test QRELs (JSON)/td> | json: qid,0, docid, relevance label |
Test QREL (Parquet) | Test QRELS (Parquet Format) | 13kb (7 KB compressed) | 998 | Test QRELs (Parquet)/td> | json: qid,0, docid, relevance label |
Triples | Train Triples JSONl | 6.23 GB (1.28 GB compressed) | 20,888 | Training Triples json format | json: qid, query, positive passages, negative passages |
Triples | Train Triples JSONl | 6.23 GB (1.28 GB compressed) | 20,888 | Training Triples json format | json: qid, query, positive passages, negative passages |
Triples | Dev Triples JSONl | 2.67 GB (550.3 MB compressed) | 8,956 | Development Triples json format | tsv: qid, query |
Triples | Dev Triples PARQUET | 938 MB (777.2 MB compressed) | 8,956 | Development Triples parquet format | tsv: qid, query |
Triples | Test Triples PARQUET | 47 KB (37 KB compressed) | 926 | Test Triples parquet format | tsv: qid, query |
Triples | Test Triples JSON | 125 KB (30 KB compressed) | 926 | Test Triples json format | tsv: qid, query |
Train Top 100 BM25 (Pyserini Simple Context) | Top 100 Train BM25 Simple | 107.3 MB (28.8 MB compressed) | 2,046,828 | BM25 top 100 For Train Queries | tsv: qid, doc_id, rank, score, run-name |
Train Top 100 BM25 (Pyserini Full Context) | Top 100 Train BM25 Simple | 103.7 MB (28.8 MB compressed) | 2,055,876 | BM25 top 100 For Train Queries | tsv: qid, doc_id, rank, score, run-name |
Dev Top 1000 BM25 (Pyserini Simple Context) | Top 100 Dev BM25 Simple | 107.3 MB (28.8 MB compressed) | 8,717,672 | BM25 top 1000 For Dev Queries | tsv: qid, doc_id, rank, score, run-name |
Dev Top 1000 BM25 (Pyserini Full Context) | Top 100 Dev BM25 Simple | 447.4 MB (121.1 MB compressed) | 8,717,672 | BM25 top 1000 For Dev Queries | tsv: qid, doc_id, rank, score, run-name |
Test Top 1000 BM25 (Pyserini Simple Context) | Top 100 Test BM25 Simple | 4.1 MB (1.1 MB compressed) | 75,452 | BM25 top 1000 For Test Queries | tsv: qid, doc_id, rank, score, run-name |
Dev Top 1000 BM25 (Pyserini Full Context) | Top 100 Test BM25 Simple | 3.9 MB (1.1 MB compressed) | 75,696 | BM25 top 1000 For Test Queries | tsv: qid, doc_id, rank, score, run-name |
We will be following the classic TREC submission formating, which is repeated below. White space is used to separate columns. The width of the columns in the format is not important, but it is important to have exactly six columns per line with at least one space between the columns.
, where:
1 Q0 pid1 1 2.73 runid1
1 Q0 pid2 1 2.71 runid1
1 Q0 pid3 1 2.61 runid1
1 Q0 pid4 1 2.05 runid1
1 Q0 pid5 1 1.89 runid1
As the official evaluation set, we provide a set of 926 queries where 50 or more will be judged by NIST assessors. For this purpose, NIST will be using depth pooling with separate pools each tasks. Products in these pools will then be labelled by NIST assessors using multi-graded judgments, allowing us to measure NDCG.
The main type of TREC submission is automatic, which means there was not manual intervention in running the test queries. This means you should not adjust your runs, rewrite the query, retrain your model, or make any other sorts of manual adjustments after you see the test queries. The ideal case is that you only look at the test queries to check that they ran properly (i.e. no bugs) then you submit your automatic runs. However, if you want to have a human in the loop for your run, or do anything else that uses the test queries to adjust your model or ranking, you can mark your run as manual and provide a description of what types of alterations were performed.