Past Websites
Previous Data
Visit here for all datasets.
2024 Datasets
Type | Filename | File Size | Num Records | Description | Format |
---|---|---|---|---|---|
Query to Query ID | Query2QueryID | 946 KB | 30,734 | TREC style QueryID to Query Text | tsv: qid, query |
Collection | Collection (TREC Format) | 1.81 GB (568 MB compressed) | 1,661,907 | TREC style corpus collection | tsv: docid, Title, Description |
Train QREL (ESCI) | Train QRELS (TREC Format) | 6.8 MB (2.1 MB compressed) | 392,119 | Train QRELs | tsv: qid, 0, docid, relevance label |
Dev QREL (ESCI) | Dev QRELS (TREC Format) | 2.9 MB (906 KB compressed) | 169,952 | Dev QRELs | tsv: qid, 0, docid, relevance label |
Training Triples | Train Triples JSONl | 6.23 GB (1.28 GB compressed) | 20,888 | Training Triples json format | json: qid, query, positive passages, negative passages |
2023 Datasets
Type | Filename | File Size | Num Records | Description | Format |
---|---|---|---|---|---|
Test Queries | 2023 Test Queries (TREC Format) | 12 KB (7 KB compressed) | 186 | 2023 Test Queries | tsv: qid, query text |
Test QREL Synthetic (Non NIST) | 2023 Test QREL Synthetic (Non NIST) | 18 KB (6 KB compressed) | 998 | 2023 Test QREL Synthetic (Non NIST) | tsv: qid, 0, docid, relevance label |
Test QRELS (NIST Judged) | 2023 Test QREL (TREC Format) | 2.1 MB (460 KB compressed) | 115,490 | 2023 Test Qrels | tsv: qid, 0, docid, relevance label |
Getting Started/Tevatron Usage
To allow quick experimentation, we have made the datasets compatible with the popular Tevatron library. To train, index, and retrieve from the product search, researchers can take the Tevatron MSMARCO Example Guide, update the dataset names, and run with their favorite model variant. For simplicity, an example is shown below.
Steps
- Train a Model
python -m tevatron.driver.train \ --output_dir product_search_bi_encoder_baseline \ --model_name_or_path bert-base-uncased \ --dataset_name trec-product-search/Product-Search-Triples
- Encode the Corpus
python -m tevatron.driver.encode \ --output_dir=temp \ --model_name_or_path product_search_bi_encoder_baseline \ --dataset_name trec-product-search/product-search-corpus \ --encoded_save_path corpus_emb.pkl \ --encode_num_shard 1 \ --encode_shard_index 0
- Create Query Embeddings for 2023 Queries
python -m tevatron.driver.encode \ --output_dir=temp \ --model_name_or_path product_search_bi_encoder_baseline \ --dataset_name trec-product-search/Product-Search-Triples/test \ --encoded_save_path query_emb.pkl \ --q_max_len 32 \ --encode_is_qry
- Retrieve Top Results
python -m tevatron.faiss_retriever \ --query_reps query_emb.pkl \ --passage_reps corpus_emb.pkl \ -depth 100 \ --batch_size -1 \ --save_text \ --save_ranking_to run.txt
- Convert Ranking to TREC Format
python -m tevatron.utils.format.convert_result_to_trec \ --input run.txt \ --output run.trec
- Evaluate with TREC Eval or ir_measures
```bash
ir_measures product_qrel.trec run.trec
NDCG@1 NDCG@3 NDCG@5 NDCG@10 NDCG@100 NDCG@1000
AP@1 AP@3 AP@5 AP@10 AP@100 AP@1000