Competitions

WSDM Cup 2023: Unbiased Learning & Pre-training for Web Search

Learning to Rank (LTR), aiming to measure documents' relevance w.r.t. queries, is a popular research topic in information retrieval with huge practical usage in web search engines, e-commerce, and multiple different streaming services. With the vogue of deep learning, the heavy burden of data annotation drives the academia and industry communities to the study of learning to rank using implicit user feedback or pre-training language model (PLM) with self-supervised learning. However, directly optimizing the model with click data results in unsatisfied performance due to the strong bias on implicit user feedback, such as position bias, trust bias, and click necessary bias. Unbiased learning to rank (ULTR) is then proposed for debiasing user feedback with counterfactual learning algorithms. However, real-world user feedback can be more complex than synthetic feedback generated with specific user behavior assumptions like position-dependent click model and ULTR algorithms with good performance on synthetic datasets may not show consistently good performance in the real-world scenario. Furthermore, it is nontrivial to directly apply the recent advancements in PLMs to web-scale search engine systems since explicitly capturing the comprehensive relevance between queries and documents is crucial to the ranking task. However, existing pre-training objectives, either sequence-based tasks (e.g., masked token prediction) or sentence pair-based tasks (e.g., permuted language modeling), learn contextual representations based on the intra/inter-sentence coherence relationship, which cannot be straightforwardly adapted to model the query-document relevance relations. Therefore, in this competition, we focus on unbiased learning to rank and pre-training for web search under real long-tail user feedback dataset from Baidu Search (Baidu is the biggest Chinese search engine with 6.32 million monthly active users that has a great ambition and responsibility to promote the technique development in the community).

  • TASK-1: Unbiased Learning for Web Search
    • For the unbiased learning to rank task, you are required to train a ranking model with the Large Scale Web Search Session Data. However, the Expert Annotation Dataset and extra datasets are not allowed for training the ranking model. After the deadline of submitting result, the award will be honored to the teams with the code that can reproduce their results.
  • TASK-2: Pre-training for Web Search
    • For pre-training for web search task, you are required to pre-train a PLM with the Large Scale Web Search Session Data and finetune the PLM with the Expert Annotation Dataset (Here is the [PLM] for reference). After the deadline of submitting result, the award will be honored to the teams with the code that can reproduce their results.

    Description

    Dataset Characteristic

    Advanced Semantic Feature

    We provide the original text of queries and documents after desenibilisation. It enables to construct advanced semantic features by pretrained language model. We provide a series of large-scale pretrained langage model trained with MLM paradigm.

    Diverse Display Information

    Different from other datasets which only provides the ranking position, Baidu-ULTR provides more information, including the displayed url, the displayed title of document, the displayed abstract of document, the category of the document, the height of SERP for advanced study on biases other than those related with the ranking position. Detailed description can be found in the description page.

    Rich User Behaviors on Search Result pages (SERP)

    Both real-world user click and other user behaviors, e.g., skip, dwelling time, and displayed time, have been recorded, offering the possibility for optimizing user engagement optimization and exploring the multi-task learning in ULTR. Detailed description can be founded in the description page.

    Dataset Scale

    Baidu-ULTR has 1.2 billion searching sessions for training which supports to train large-scale language model. 7,008 expert annotationed queries is available for evaluation, which is large enough for providing reliable performance.

    Reference

    If you use this dataset of our reproduced results, please cite:


    A Large Scale Search Dataset for Unbiased Learning to Rank

    Lixin Zou*, Haitao Mao*, Xiaokai Chu, Jiliang Tang, Wenwen Ye, Shuaiqiang Wang, and Dawei Yin.(*: equal contributions)


    The BibTex infomation is detached as: 

    @inproceedings{
        zou2022large,
        title={A Large Scale Search Dataset for Unbiased Learning to Rank},
        author={Lixin Zou and Haitao Mao andXiaokai Chu and Jiliang Tang and Wenwen Ye and Shuaiqiang Wang and Dawei Yin},
        booktitle={NeurIPS 2022},
        year={2022}
    }
    Please contact us via zoulixin15@gmail.com, haitaoma@msu.edu, if you have any concerns regarding this dataset.

    Dowload

    To download the Baidu-ULTR dataset, click the link :   Download training set   Download test set   Download unigram dict

    Suppose your have downloaded training data and test data.

    First, move all the zip file into dir './data/train_data/', e.g.,  

    mv yourpath/*.gz ./data/train_data/  

    Second, move the file part-00000.gz into './data/click_data/', we will treat it as one of the validation set.  

    mv ./data/train_data/part-00000.gz ./data/click_data/part-00000.gz

    Finally, split the annotated data nips_annotation_data_0522.txt into test and validation set. Move them into dir './data/annotate_data/'  

    mv test_data.txt ./data/annotate_data/mv val_data.txt ./data/annotate_data/

    Schema

    Train Data — Large Scale Web Search Session Data

    The search session is organized as:

    Qid, Query, Query Reformulation
    Pos 1, URL MD5, Title, Abstract, Multimedia Type, Click, -, -, Skip, SERP Height, Displayed Time, Displayed Time Middle, First Click, Displayed Count, SERP's Max Show Height, Slipoff Count After Click, Dwelling Time , Displayed Time Top, SERP to Top , Displayed Count Top, Displayed Count Bottom, Slipoff Count, -, Final Click, Displayed Time Bottom, Click Count, Displayed Count, -, Last Click , Reverse Display Count, Displayed Count Middle, -
    Pos 2, URL MD5, Title, Abstract, Multimedia Type, Click, -, -, Skip, SERP Height, Displayed Time, Displayed Time Middle, First Click, Displayed Count, SERP's Max Show Height, Slipoff Count After Click, Dwelling Time , Displayed Time Top, SERP to Top , Displayed Count Top, Displayed Count Bottom, Slipoff Count, -, Final Click, Displayed Time Bottom, Click Count, Displayed Count, -, Last Click , Reverse Display Count, Displayed Count Middle, -
    ......
    Pos N, URL MD5, Title, Abstract, Multimedia Type, Click, -, -, Skip, SERP Height, Displayed Time, Displayed Time Middle, First Click, Displayed Count, SERP's Max Show Height, Slipoff Count After Click, Dwelling Time , Displayed Time Top, SERP to Top , Displayed Count Top, Displayed Count Bottom, Slipoff Count, -, Final Click, Displayed Time Bottom, Click Count, Displayed Count, -, Last Click , Reverse Display Count, Displayed Count Middle, -
    # SERP is the abbreviation of search result page.

    Column Id Explaination Remark
    Qid query id
    Query The user issued query Sequential token ids separated by “”.
    Query Reformulation The subsequent queries issued by users under the same search goal. Sequential token ids separated by “”.
    Pos The document’s displaying order on the screen. [1,30]
    Url_md5 The md5 for identifying the url
    Title The title of document. Sequential token ids separated by “”.
    Abstract A query-related brief introduction of the document under the title. Sequential token ids separated by “”.
    Multimedia Type The type of url, for example, advertisement, videos, maps. int
    Click Whether the user clicked the document. [0,1]
    - - -
    - - -
    Skip Whether the user skipped the document on the screen. [0,1]
    SERP Height The vertical pixels of SERP on the screen. Continuous Value
    Displayed Time The document’s display time on the screen. Continuous Value
    Displayed Time Middle The document’s display time on the middle 1/3 of the screen. Continuous Value
    First Click The identifier of users’ first click in a query. [0,1]
    Displayed Count The document’s display count on the screen. Discrete Number
    SERP’s Max Show Height The max vertical pixels of SERP on the screen. Continuous Value
    Slipoff Count After Click The count of slipoff after user click the document. Discrete Number
    Dwelling Time The length of time a user spends looking at a document after they’ve clicked a link on a SERP page, but before clicking back to the SERP results. Continuous Value
    Displayed Time Top The document’s display time on the top 1/3 of screen. Continuous Value
    SERP to Top The vertical pixels of the SERP to the top of the screen. Continuous Value
    Displayed Count Top The document’s display count on the top 1/3 of screen. Discrete Number
    Displayed Count Bottom The document’s display count on the bottom 1/3 of screen. Discrete Number
    Slipoff Count The count of document being slipped off the screen.
    - - -
    Final Click The identifier of users’ last click in a query session.
    Displayed Time Bottom The document’s display time on the bottom 1/3 of screen. Continuous Value
    Click Count The document’s click count. Discrete Number
    Displayed Count The document’s display count on the screen. Discrete Number
    - - -
    Last Click The identifier of users’ last click in a query. Discrete Number
    Reverse Display Count The document’s display count of user view with a reverse browse order from bottom to the top. Discrete Number
    Displayed Count Middle The document’s display count on the middle 1/3 of screen. Discrete Number
    - - -

    Test Data — Expert Annotation Dataset for Validation

    The expert annotation dataset is organized as:

    Columns Explaination Remark
    Query The user issued query Sequential token ids separated by "\x01".
    Title The title of document. Sequential token ids separated by "\x01".
    Abstract A query-related brief introduction of the document under the title. Sequential token ids separated by "\x01".
    Label Expert annotation label. [0,4]
    Bucket The queries are descendingly split into 10 buckets according to their monthly search frequency, i.e., bucket 0, bucket 1, and bucket 2 are high-frequency queries while bucket 7, bucket 8, and bucket 9 are the tail queries [0,9]

    Leaderboard

    Rule

    The performance is ranked by DCG@1

    Rank Method DCG@1 ERR@1 DCG@3 ERR@3 DCG@5 ERR@5 DCG@10 ERR@10 Link
    1 DLA 1.293±0.015 0.081±0.001 2.839±0.011 0.137±0.001 3.976±0.007 0.160±0.001 6.236±0.017 0.181±0.001 [paper] [code]
    2 PairD 1.243±0.037 0.078±0.002 2.760±0.078 0.133±0.003 3.910±0.092 0.156±0.003 6.214±0.114 0.181±0.001 [paper] [code]
    3 IPW 1.239±0.038 0.077±0.002 2.742±0.076 0.133±0.003 3.896±0.087 0.156±0.003 6.170±0.124 0.178±0.003 [paper] [code]
    4 Naive 1.235±0.029 0.077±0.002 2.743±0.072 0.133±0.003 3.889±0.087 0.156±0.003 6.170±0.124 0.178±0.003 [paper] [code]
    1 REM 1.235±0.029 0.077±0.002 2.743±0.072 0.133±0.003 3.889±0.087 0.156±0.003 6.170±0.124 0.178±0.003 [paper] [code]

    Contact

    Please contact us via zoulixin15@gmail.com, haitaoma@msu.edu, if you have any concerns.