Past Websites

2024: Website
2023: Website

Previous Data

Visit here for all datasets.

2024 Datasets

Type	Filename	File Size	Num Records	Description	Format
Query to Query ID	Query2QueryID	946 KB	30,734	TREC style QueryID to Query Text	tsv: qid, query
Collection	Collection (TREC Format)	1.81 GB (568 MB compressed)	1,661,907	TREC style corpus collection	tsv: docid, Title, Description
Train QREL (ESCI)	Train QRELS (TREC Format)	6.8 MB (2.1 MB compressed)	392,119	Train QRELs	tsv: qid, 0, docid, relevance label
Dev QREL (ESCI)	Dev QRELS (TREC Format)	2.9 MB (906 KB compressed)	169,952	Dev QRELs	tsv: qid, 0, docid, relevance label
Training Triples	Train Triples JSONl	6.23 GB (1.28 GB compressed)	20,888	Training Triples json format	json: qid, query, positive passages, negative passages

2023 Datasets

Type	Filename	File Size	Num Records	Description	Format
Test Queries	2023 Test Queries (TREC Format)	12 KB (7 KB compressed)	186	2023 Test Queries	tsv: qid, query text
Test QREL Synthetic (Non NIST)	2023 Test QREL Synthetic (Non NIST)	18 KB (6 KB compressed)	998	2023 Test QREL Synthetic (Non NIST)	tsv: qid, 0, docid, relevance label
Test QRELS (NIST Judged)	2023 Test QREL (TREC Format)	2.1 MB (460 KB compressed)	115,490	2023 Test Qrels	tsv: qid, 0, docid, relevance label

Getting Started/Tevatron Usage

To allow quick experimentation, we have made the datasets compatible with the popular Tevatron library. To train, index, and retrieve from the product search, researchers can take the Tevatron MSMARCO Example Guide, update the dataset names, and run with their favorite model variant. For simplicity, an example is shown below.

Steps

Train a Model

python -m tevatron.driver.train \
    --output_dir product_search_bi_encoder_baseline \
    --model_name_or_path bert-base-uncased \
    --dataset_name trec-product-search/Product-Search-Triples

Encode the Corpus

python -m tevatron.driver.encode \
    --output_dir=temp \
    --model_name_or_path product_search_bi_encoder_baseline \
    --dataset_name trec-product-search/product-search-corpus \
    --encoded_save_path corpus_emb.pkl \
    --encode_num_shard 1 \
    --encode_shard_index 0

Create Query Embeddings for 2023 Queries

python -m tevatron.driver.encode \
    --output_dir=temp \
    --model_name_or_path product_search_bi_encoder_baseline \
    --dataset_name trec-product-search/Product-Search-Triples/test \
    --encoded_save_path query_emb.pkl \
    --q_max_len 32 \
    --encode_is_qry

Retrieve Top Results

python -m tevatron.faiss_retriever \
    --query_reps query_emb.pkl \
    --passage_reps corpus_emb.pkl \
    -depth 100 \
    --batch_size -1 \
    --save_text \
    --save_ranking_to run.txt

Convert Ranking to TREC Format

python -m tevatron.utils.format.convert_result_to_trec \
    --input run.txt \
    --output run.trec

Evaluate with TREC Eval or ir_measures ```bash ir_measures product_qrel.trec run.trec
NDCG@1 NDCG@3 NDCG@5 NDCG@10 NDCG@100 NDCG@1000
AP@1 AP@3 AP@5 AP@10 AP@100 AP@1000

TREC Product Search and Recommendations