Skip to the content.

Recommendation Task Instructions

Many e-commerce sites provide various forms of related product recommendations: products that are somehow related to a product the user is looking at, has added to their cart, or has purchased or viewed in the past. However, published research on related-product recommendation is scant, and very few datasets are well-suited to evaluate such recommendations.

Product relationships also come in different types, further complicating the evaluation but opening fresh possibilities for research. This track focuses on substitute and complementary relationships, defined in terms of product function and use. Specifically, given a reference item A and a related item B:

In the first version of this task, we are focusing on course-grained tasks. For example, two camera lenses for the same camera are substitutes for each other, even if they serve different photographic purposes (e.g., a portrait lens and a zoom lens for the same camera platform are considered substitutes for this version of this task).

The goal of this track is to successfully identify and distinguish complementary and substitute products, to enable recommendation experiences that distinguish between these relationship types and enable richer user exploration of the product space.

Because related-product recommendation lists are often short, we will be using short rankings of products (with separate lists of substitutes and complements) as the primary basis for evaluating submitted runs. Teams will also submit a third, longer list for each query for pooling.

Training and Preparatory Data

We are providing the following data to track participants (coming soon):

Precise details on the subset that forms the corpus are forthcoming, but all products used as query items or expected to be recommended will be from the M2 and/or ESCI data sets. Amazon product identifiers are consistent across both data sets.

Task Definition and Query Data

The “query” data will consist of a set of requests for related product recommendations. Each request contains a single Amazon product ID (the reference item). For each reference item, the system should produce (and teams submit) three output lists:

  1. A ranked list of 100 related items, with an annotation as to whether they are complementary or substitute. This will be used to generate deeper pools for evaluation.
  2. A list of 10 Top Complementary items.
  3. A list of 10 Top Substitute items.

The query data will be in a TSV file with 2 columns (query ID and product ID). Output data should be a TSV file in the 6-column TREC runs format (qid, iter, product, rank, score, runID), with QIDs derived from the input QIDs (for input query 3, the outputs should be recorded as qids 3R, 3C, and 3S).

Participant solutions are not restricted to the training data we provide — it is acceptable to enrich the track data with additional data sources such as the Amazon Review datasets for training or model operation.

Annotation and Relevance

Recommended items from submitted runs will be pooled and assessed by NIST assessors. Each item will be labeled with one of 4 categories (2 of have graded labels):

Evaluation Metrics

The primary evaluation metric will be NDCG computed separately for each top-substitute and top-complement recommendation list. This will be aggregated in the following ways to produce submission-level metrics:

We will compute supplementary metrics including:

Timeline