StreamingRec

An evaluation framework that simulates a real-life recommendation scenario, in which recommendation algorithms receive user clicks and newly published articles chronologically. After each click, algorithms generate recommendation lists, which are then evaluated by comparing them with the next user clicks. Most of the algorithms are implemented so that they learn incrementally from every click.

Research Paper

This framework is described in detail in the following research paper:

Michael Jugovac, Dietmar Jannach, and Mozhgan Karimi. 2018. StreamingRec: A Framework for Benchmarking Stream-based News Recommenders. In Proceedings of the Twelfth ACM Conference on Recommender Systems (RecSys ’18), October 2–7, 2018, Vancouver, BC, Canada. ACM, New York, NY, USA, 5 pages. https://doi.org/10.1145/3240323.3240384. forthcoming.

A preprint is available on the author's homepage.

If you are using StreamingRec in your paper, please cite the above-mentioned reference in your bibliography.

Implemented recommendation algorithms

Most Popular
Recently Published
Recently Popular
Recently Clicked
Item-Item CF
Co-Occurrence
k-Nearest Neighbor
Sequential Pattern Mining
Lucene
Keyword-based
Baysian Personalized Ranking

Running StreamingRec

Run on the command line

Download the pre-compiled jar from the releases tab
Acquire item and click CSV input files (see below)
Create algorithm and metric JSON config files (see below)
Run with java -jar StreamingRec.jar <parameters>
- Commonly, at least the following parameters should be used: java -jar StreamingRec.jar --items=<path_to_item_meta_data_file> --clicks=<path_to_click_data_file> --algorithm-config=<path_to_algorithm_json_config_file> --metrics-config=<path_to_metrics_json_config_file> --session-inactivity-threshold
- For a full list and description of the available parameters, run with -h
- For systems with small RAM, adjusting the --thread-count=<N> parameter can help. By default, it is set to the number of available CPU cores - 1, but in general, less concurrent threads result in less RAM usage.

How to acquire input files (data sets)

Outbrain

The Outbrain data set is publicly available. To use it with StreamingRec, the following steps are necessary.

Register an account at Kaggle and download the data set files from https://www.kaggle.com/c/outbrain-click-prediction/data. Only the following files are necessary:
- page_views.csv.zip
- documents_meta.csv.zip
- documents_categories.csv.zip
- documents_entities.csv.zip
- documents_topics.csv.zip
Put all above-mentioned files in one folder.
Run the StreamingRec import script with at least the following parameters: java -cp StreamingRec.jar tudo.streamingrec.data.loading.ReadOutbrain --input-folder=<folder_to_outbrain_files> --out-items=<path_to_item_output_file> --out-clicks=<path_to_clicks_output_file> --publisher=<id_of_publisher_to_be_extracted>
After processing, two data set files (one for the item meta data and one for the click data) will be created that can be used in with StreamingRec.

Plista / CLEF-NewsREEL Challenge

The 2016 Plista challenge data set is not publicly available anymore. However, for researchers who still have access to it, the following step need to be executed to use this data set.

Collect all the .gz files of the data set in one folder
Run the first stage import script with java -cp StreamingRec.jar tudo.streamingrec.data.loading.ReadPlista <parameters>. For help about the parameters run with -h
This generates, intermediate input files. To create the final input files, run the second stage import script with java -cp StreamingRec.jar tudo.streamingrec.data.loading.JoinPlistaTransactionsWithMetaInfo <parameters>. For help about the parameters run with -h
After processing, two data set files (one for the item meta data and one for the click data) will be created that can be used in with StreamingRec. The intermediate input files can be deleted.

How to configure algorithms and metrics

Algorithms and metrics are configured via JSON files, one for algorithms, one for metrics. Each of the files contains a JSON array that is made up of one JSON object per algorithm/metric. A general metrics configuration can be found in the project folder at config/metrics-config.json. For a quick test, a simple algorithm configuration is included at config/algorithm-config-simple.json.

To configure algorithms or metrics manually, a new JSON config file can be created. The format of the JSON object representing the algorithm or metric is always the same:

{
    "name": "Co-Occurrence", 
    "algorithm": ".FastSessionCoOccurrence", 
    "wholeSession": false 
}

name: Is the pretty-print name of the algorithm that will appear as an identifier in the result list
algorithm: the qualified name of the algorithm class relative to the tudo.streamingrec.algorithms package
wholeSession: Is an additional parameter specific to this algorithm (Co-Occurrence)
- "Additional parameters" are all parameters that the respective algorithm class offers via its setter methods. To find out which parameters are offered by which algorithm, consult the javadoc from the releases tab

Import and run in Eclipse

Download/Clone the source code
Import the project into Eclipse via Import -> Existing Maven Projects
Follow the above-mentioned instructions on how to acquire input files and configure the algorithms and metrics
To set parameters, adjust the class attributes directly in tudo.Framework.StreamingRec or set the call parameters via Eclipse's Run Configurations menu
Run the class tudo.Framework.StreamingRec via Run as ... -> Java Application

Implementing new algorithms / metrics

Custom algorithms have to extend the class tudo.streamingrec.algorithms.Algorithm. The implementation shall update its model every time the method void trainInternal(List<Item> items, List<ClickData> transactions) is called and it shall generate a recommendation list every time the method LongArrayList recommend(ClickData clickData) is called.

Custom metrics have to extend the class tudo.streamingrec.evaluation.metrics.Metric. The implementation shall calculate, update, and store its current metric value every time the method void evaluate(Transaction transaction, LongArrayList recommendations, LongOpenHashSet userTransactions) is called. The final metric value shall be returned when the method double getResults() is called. In any case the value of the class attribute k shall be observed.

License

Licensed under the Apache License, Version 2.0 (the "License"); you may not use this file except in compliance with the License. You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software distributed under the License is distributed on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the License for the specific language governing permissions and limitations under the License.

Publisher	Category	ItemID	Cookie	Timestamp	keywords
1707327	18787	1465876800836	NaN	NaN	NaN
1513276	134357	1465876801429	NaN	NaN	NaN
1766890	38208	1465876801758	NaN	NaN	NaN
1513276	136834	1465876802078	NaN	NaN	NaN
830700	2429	1465876802136	NaN	NaN	NaN

mjugo / streamingrec Goto Github PK

streamingrec's Introduction

StreamingRec

Research Paper

Implemented recommendation algorithms

Running StreamingRec

Run on the command line

How to acquire input files (data sets)

Outbrain

Plista / CLEF-NewsREEL Challenge

How to configure algorithms and metrics

Import and run in Eclipse

Implementing new algorithms / metrics

License

streamingrec's People

Contributors

Stargazers

Watchers

Forkers

streamingrec's Issues

Recommend Projects

Recommend Topics

Recommend Org