Giter Site home page Giter Site logo

mjugo / streamingrec Goto Github PK

View Code? Open in Web Editor NEW
43.0 4.0 15.0 100 KB

A news recommendation evaluation framework

License: Apache License 2.0

Java 100.00%
news recommendation evaluation incremental streaming stream framework evaluation-metrics recommender-system

streamingrec's Introduction

StreamingRec

An evaluation framework that simulates a real-life recommendation scenario, in which recommendation algorithms receive user clicks and newly published articles chronologically. After each click, algorithms generate recommendation lists, which are then evaluated by comparing them with the next user clicks. Most of the algorithms are implemented so that they learn incrementally from every click.

Research Paper

This framework is described in detail in the following research paper:

Michael Jugovac, Dietmar Jannach, and Mozhgan Karimi. 2018. StreamingRec: A Framework for Benchmarking Stream-based News Recommenders. In Proceedings of the Twelfth ACM Conference on Recommender Systems (RecSys โ€™18), October 2โ€“7, 2018, Vancouver, BC, Canada. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3240323.3240384. forthcoming.

A preprint is available on the author's homepage.

If you are using StreamingRec in your paper, please cite the above-mentioned reference in your bibliography.

Implemented recommendation algorithms

  • Most Popular
  • Recently Published
  • Recently Popular
  • Recently Clicked
  • Item-Item CF
  • Co-Occurrence
  • k-Nearest Neighbor
  • Sequential Pattern Mining
  • Lucene
  • Keyword-based
  • Baysian Personalized Ranking

Running StreamingRec

Run on the command line

  1. Download the pre-compiled jar from the releases tab
  2. Acquire item and click CSV input files (see below)
  3. Create algorithm and metric JSON config files (see below)
  4. Run with java -jar StreamingRec.jar <parameters>
    • Commonly, at least the following parameters should be used: java -jar StreamingRec.jar --items=<path_to_item_meta_data_file> --clicks=<path_to_click_data_file> --algorithm-config=<path_to_algorithm_json_config_file> --metrics-config=<path_to_metrics_json_config_file> --session-inactivity-threshold
    • For a full list and description of the available parameters, run with -h
    • For systems with small RAM, adjusting the --thread-count=<N> parameter can help. By default, it is set to the number of available CPU cores - 1, but in general, less concurrent threads result in less RAM usage.

How to acquire input files (data sets)

Outbrain

The Outbrain data set is publicly available. To use it with StreamingRec, the following steps are necessary.

  1. Register an account at Kaggle and download the data set files from https://www.kaggle.com/c/outbrain-click-prediction/data. Only the following files are necessary:
    • page_views.csv.zip
    • documents_meta.csv.zip
    • documents_categories.csv.zip
    • documents_entities.csv.zip
    • documents_topics.csv.zip
  2. Put all above-mentioned files in one folder.
  3. Run the StreamingRec import script with at least the following parameters: java -cp StreamingRec.jar tudo.streamingrec.data.loading.ReadOutbrain --input-folder=<folder_to_outbrain_files> --out-items=<path_to_item_output_file> --out-clicks=<path_to_clicks_output_file> --publisher=<id_of_publisher_to_be_extracted>
  4. After processing, two data set files (one for the item meta data and one for the click data) will be created that can be used in with StreamingRec.

Plista / CLEF-NewsREEL Challenge

The 2016 Plista challenge data set is not publicly available anymore. However, for researchers who still have access to it, the following step need to be executed to use this data set.

  1. Collect all the .gz files of the data set in one folder
  2. Run the first stage import script with java -cp StreamingRec.jar tudo.streamingrec.data.loading.ReadPlista <parameters>. For help about the parameters run with -h
  3. This generates, intermediate input files. To create the final input files, run the second stage import script with java -cp StreamingRec.jar tudo.streamingrec.data.loading.JoinPlistaTransactionsWithMetaInfo <parameters>. For help about the parameters run with -h
  4. After processing, two data set files (one for the item meta data and one for the click data) will be created that can be used in with StreamingRec. The intermediate input files can be deleted.

How to configure algorithms and metrics

Algorithms and metrics are configured via JSON files, one for algorithms, one for metrics. Each of the files contains a JSON array that is made up of one JSON object per algorithm/metric. A general metrics configuration can be found in the project folder at config/metrics-config.json. For a quick test, a simple algorithm configuration is included at config/algorithm-config-simple.json.

To configure algorithms or metrics manually, a new JSON config file can be created. The format of the JSON object representing the algorithm or metric is always the same:

{
    "name": "Co-Occurrence", 
    "algorithm": ".FastSessionCoOccurrence", 
    "wholeSession": false 
}
  • name: Is the pretty-print name of the algorithm that will appear as an identifier in the result list
  • algorithm: the qualified name of the algorithm class relative to the tudo.streamingrec.algorithms package
  • wholeSession: Is an additional parameter specific to this algorithm (Co-Occurrence)
    • "Additional parameters" are all parameters that the respective algorithm class offers via its setter methods. To find out which parameters are offered by which algorithm, consult the javadoc from the releases tab

Import and run in Eclipse

  1. Download/Clone the source code
  2. Import the project into Eclipse via Import -> Existing Maven Projects
  3. Follow the above-mentioned instructions on how to acquire input files and configure the algorithms and metrics
  4. To set parameters, adjust the class attributes directly in tudo.Framework.StreamingRec or set the call parameters via Eclipse's Run Configurations menu
  5. Run the class tudo.Framework.StreamingRec via Run as ... -> Java Application

Implementing new algorithms / metrics

Custom algorithms have to extend the class tudo.streamingrec.algorithms.Algorithm. The implementation shall update its model every time the method void trainInternal(List<Item> items, List<ClickData> transactions) is called and it shall generate a recommendation list every time the method LongArrayList recommend(ClickData clickData) is called.

Custom metrics have to extend the class tudo.streamingrec.evaluation.metrics.Metric. The implementation shall calculate, update, and store its current metric value every time the method void evaluate(Transaction transaction, LongArrayList recommendations, LongOpenHashSet userTransactions) is called. The final metric value shall be returned when the method double getResults() is called. In any case the value of the class attribute k shall be observed.

License

Copyright [2017,2018] [Mozhgan Karimi, Michael Jugovac, Dietmar Jannach]

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

streamingrec's People

Contributors

mjugo avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

streamingrec's Issues

Issue of pre-processing the outbrain dataset

Hi,

I tried to process the Outbrain dataset with 'java -cp StreamingRec.jar org.streamingrec.data.loading.ReadOutbrain --input-folder=<folder_to_outbrain_files> --out-items=<path_to_item_output_file> --out-clicks=<path_to_clicks_output_file> --publisher=43' and got the Events.csv, the first 5 rows are as follows:

Publisher Category ItemID Cookie Timestamp keywords
1707327 18787 1465876800836 NaN NaN NaN
1513276 134357 1465876801429 NaN NaN NaN
1766890 38208 1465876801758 NaN NaN NaN
1513276 136834 1465876802078 NaN NaN NaN
830700 2429 1465876802136 NaN NaN NaN

I reviewed the code and found that the Publisher = document_id, Category = user_id, and ItemID = timestamp, which is really peculiar. Could you please explain why you are processing the dataset in such a way? Why the Timestamp is NaN in stead of the true timestamp? Thanks a lot.

GRU4Rec evaluation

Hi Michael,

We were reading your paper and hoping to find the GRU4Rec evaluation, particularly: "For GRU4Rec, we employed a heuristic to sample from the more recent sessions to take recent temporal shifts into account, as proposed in [45]."

In [45], We were able to find the following:

"We propose a simple solution to get the best of both worlds via pre-training. We first train a model on the entire dataset. The trained model is then used to initialize a new model, which is only trained using only a more recent subset of the data, e.g. the last month worth of data out of a year of click sequences. This allows the model to have the benefit of a good initialization using large amounts of data, and yet is focused on more recent click-sequences. In this way, it resembles the fine-tuning process used in training of image-based networks [2], where the models are typically initialized by pre-training on ImageNet (a large image classification dataset) before the weights are fine-tuned on a smaller image dataset in the desired domain."

How was the sampling done in particular? Are there open-source template codes?

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.