Skip to the content.

Recommendation Task Instructions

Many e-commerce sites provide various forms of related product recommendations: products that are somehow related to a product the user is looking at, has added to their cart, or has purchased or viewed in the past. However, published research on related-product recommendation is scant, and very few datasets are well-suited to evaluate such recommendations.

Product relationships also come in different types, further complicating the evaluation but opening fresh possibilities for research. This track focuses on substitute and complementary relationships, defined in terms of product function and use. Specifically, given a reference item A and a related item B:

In the first version of this task, we are focusing on course-grained tasks. For example, two camera lenses for the same camera are substitutes for each other, even if they serve different photographic purposes (e.g., a portrait lens and a zoom lens for the same camera platform are considered substitutes for this version of this task).

The goal of this track is to successfully identify and distinguish complementary and substitute products, to enable recommendation experiences that distinguish between these relationship types and enable richer user exploration of the product space.

Because related-product recommendation lists are often short, we will be using short rankings of products (with separate lists of substitutes and complements) as the primary basis for evaluating submitted runs. Teams will also submit a third, longer list for each query for pooling.

Training and Preparatory Data

We have provided the following data to track participants, available on HuggingFace:

For your final submissions, use the eval directory.

All data is recorded with ASINs, so your model can be trained by cross-linking it with other public datasets:

You are not limited to the product data in the corpus — feel free to enrich with other sources, such as other data available in the original ESCI or M2 data sets, or the UCSD Ratings & Reviews.

Our repository also contains copies of the relevant pieces of the original M2 and ESCI data sets, pursuant to their Apache licenses. The search corpus is formed from combining the M2 and ESCI product training data sets, and filtering as follows:

Task Definition and Query Data

The “query” data will consist of a set of requests for related product recommendations. Each request contains a single Amazon product ID (the reference item). For each reference item, the system should produce (and teams submit) three output lists:

  1. A ranked list of 100 related items, with an annotation as to whether they are complementary or substitute. This will be used to generate deeper pools for evaluation.
  2. A list of 10 Top Complementary items.
  3. A list of 10 Top Substitute items.

Participant solutions are not restricted to the training data we provide — it is acceptable to enrich the track data with additional data sources such as the Amazon Review datasets for training or model operation.

Query Format

The query data is be in a TSV file with 3 columns: query ID, product ID (ASIN), and the product title.

Run Format

Submitted runs should be in the 6-column TREC runs format (TSV with columns qid, iter, product, rank, score, runID), with QIDs derived from the input QIDs. Specifically, for input query 3, the outputs should be recorded as qids 3R, 3C, and 3S.

For the 100-item Pooled runs, since standard TREC analysis ignores the Iteration field (field 2), use it to label items: emit C for complement and S for substitute.

The fields are as follows:

  1. qid: the query identifier (from the query file), with the suffix indicating which list the result is in (R, C, or S).
  2. iter: identifier for the round in a multi-round query, generally unused. For the top-100 run, store the relationship type here; for the other runs, either store the relationship type or set to a sential value such as 0.
  3. product: the ASIN of the product in this rank.
  4. rank: the rank of the retrieval result.
  5. score: the score for this product for this query.
  6. runID: the run identifier, usually the name of the system/variant producing these results.

Annotation and Relevance

Recommended items from submitted runs will be pooled and assessed by NIST assessors. Each item will be labeled with one of 4 categories (2 of have graded labels):

Evaluation Metrics

The primary evaluation metric will be NDCG computed separately for each top-substitute and top-complement recommendation list. This will be aggregated in the following ways to produce submission-level metrics:

We will compute supplementary metrics including:

Timeline