How to integrate a recommendation micro-service
Steps to implement a dockerize application for participating in Living Lab for evaluating your REC systems in a live environment of gesis search which is a search engine for finding information about social science research data and open access publications.
Table of Contents
1. [Data](#0)
2. [Implementing Ranking Algorithm](#1)
3. [Implementing Dockerize Flask App](#2)
4. [Next Steps](#3)
<hr>
0. Prerequisites
Before starting this tutorial, make sure all requirements in the README.md are fulfilled.
1.Data
- A corpus od publication 93k and Research data 83k metadata from GESIS Leibniz Institute for the Social Sciences
- Metadata in different languages (mixed and separated)
!cd data && mkdir gesis-search && mkdir gesis-search/datasets && mkdir gesis-search/documents
!wget -O gesis-search/datasets/dataset.jsonl \
https://th-koeln.sciebo.de/s/OBm0NLEwz1RYl9N/download?path=%2Fgesis-search%2Fdatasets&files=dataset.jsonl -Q --show-progress
!wget -O gesis-search/publications.jsonl \
https://th-koeln.sciebo.de/s/OBm0NLEwz1RYl9N/download?path=%2Fgesis-search%2Fdocuments&files=publication.jsonl -Q --show-progress
!chown -R 775 data/*
PATH = "./data/gesis-search/"
import json
import jsonlines
import pandas as pd
import numpy as np
import pickle
import random
pd.set_option("display.max_columns", None)
with jsonlines.open(PATH+"publications/publication.jsonl") as f :
pub = [obj for obj in f]
with jsonlines.open(PATH+"datasets/dataset.jsonl") as f2 :
dataset = [obj for obj in f2]
pubdf = pd.DataFrame(pub)
pubdf = pubdf.set_index('id')
datasetdf = pd.DataFrame(dataset)
datasetdf = datasetdf.set_index('id')
print("Number of publication: ",len(pubdf))
print("Metadata are: " ,list(pubdf.columns))
pubdf.head(3)
Number of publication: 93953
Metadata are: ['title', 'abstract', 'topic', 'person', 'links', 'subtype', 'document_type', 'coreAuthor', 'doi', 'date']
title | abstract | topic | person | links | subtype | document_type | coreAuthor | doi | date | |
---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||
gesis-ssoar-1002 | New Concerns, More Cooperation? How Non-Tradit... | None | [Indien, Wirtschaftsbeziehungen, bilaterale Be... | [Biba, Sebastian] | [{'label': 'Link', 'link': 'https://journals.s... | journal_article | Zeitschriftenaufsatz | [Biba, Sebastian] | None | 2016 |
gesis-ssoar-1006 | Buddhism in Current China-India Diplomacy | Buddhism is being emphasised strongly in both ... | [China, Indien, bilaterale Beziehungen, Außenp... | [Scott, David] | [{'label': 'Link', 'link': 'https://journals.s... | journal_article | Zeitschriftenaufsatz | [Scott, David] | None | 2016 |
gesis-ssoar-10066 | Zukunftsaufgaben der Humanisierung des Arbeits... | Das seit 1974 vom BMFT geförderte Programm "Fr... | [Arbeitswelt, Technik, Rationalisierung, Arbei... | [Altmann, Norbert, Düll, Klaus, Lutz, Burkart] | [{'label': 'Link', 'link': 'http://www.ssoar.i... | book | Buch | [Altmann, Norbert, Düll, Klaus, Lutz, Burkart] | None | 1987 |
print("Number of Research Data: ",len(datasetdf))
print("Metadata are: " ,list(datasetdf.columns))
datasetdf.head(3)
Number of Research Data: 83225
Metadata are: ['title', 'subtype', 'abstract', 'person', 'time_collection', 'countries_collection', 'methodology_collection', 'universe', 'selection_method', 'doi', 'publication_year', 'topic']
title | subtype | abstract | person | time_collection | countries_collection | methodology_collection | universe | selection_method | doi | publication_year | topic | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
id | ||||||||||||
ZA0018 | Einstellung zur Wehrbereitschaft und Demokrati... | dbk | Vergleichsstudie bei der Zivilbevölkerung zu e... | None | 10.1960 - 11.1960 | [Deutschland] | Mündliche Befragung mit standardisiertem Frage... | Alter: 16 Jahre und älter. | Mehrstufige Zufallsauswahl \r\n | doi:10.4232/1.11581 | 2013 | [Konflikte, Sicherheit und Frieden, Politische... |
ZA0025 | Einstellung zur Monarchie (Niederlande)\r\n | dbk | Einstellung der Niederländer zu den Deutschen ... | None | 06.1965 - 07.1965 | [Niederlande] | Mündliche Befragung mit standardisiertem Frage... | Alter: 20 Jahre und älter | Quotenauswahl | doi:10.4232/1.0025 | 1965 | [Politische Verhaltensweisen und Einstellungen... |
ZA0042 | Politische Einstellungen (Juni 1966)\r\n | dbk | Beurteilung der Parteien.<br/><br/>Themen: Beu... | None | 06.1966 - 07.1966 | [Deutschland] | Mündliche Befragung mit standardisiertem Frage... | Alter: 16-79 Jahre | Mehrstufige Zufallsauswahl \r\n | doi:10.4232/1.0042 | 1966 | [Politische Verhaltensweisen und Einstellungen... |
2. Implementing the Recommendation Algorithm
In the following we implement a simple app to create randomize recommendation for every publication. You will find out how simple it is.
idx = []
with jsonlines.open('./data/gesis-search/datasets/dataset.jsonl') as reader:
for obj in reader:
idx.append(obj.get('id'))
def recommend_datasets(item_id, page, rpp):
itemlist = random.choices(idx, k=rpp)
return {
'page': page,
'rpp': rpp,
'item_id': item_id,
'itemlist': itemlist,
'num_found': len(itemlist)
}
recommend_datasets("gesis-ssoar-1002", 1, 5)
{ "page": 1,
"rpp": 5,
"item_id": "gesis-ssoar-1002",
"itemlist": ["datasearch-httpseasy-dans-knaw-nloai--oaieasy-dans-knaw-nleasy-dataset32489",
"datasearch-httpseasy-dans-knaw-nloai--oaieasy-dans-knaw-nleasy-dataset76673",
"ZA5859",
"ZA8682",
"datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de4527"],
"num_found": 5}
3.Implementing Dockerize Flask App
Application structure
API Endpoints
Endpoints
To connect your system with STELLA, your container has to provide endpoints according to our interface. Most importantly, your system has to implement an indexing-endpoint. This endpoint is called when your system is used for the first time. It builds a search-index from the data provided by the site. The recommendations endpoints must return an ordered list of document-ids in JSON format. If you provide these endpoints, your results will be integrated seamlessly into the sites’ pages.
-
GET /test
: print the name o container -
GET /index
: index the data for retrieval GET /recommendation/datasets?<string:item_id>
: Retrieve a ranking corresponding to the query specified at the endpoint. A JSON object with maximally 10 entries will be returned.
app.py
you don't need to change this file
from flask import Flask, request, jsonify
from systems import Ranker, Recommender
app = Flask(__name__)
ranker = Ranker()
recommender = Recommender()
@app.route('/test', methods=["GET"])
def test():
return 'Container is running', 200
@app.route('/index', methods=["GET"])
def index():
ranker.index()
recommender.index()
return 'Indexing done!', 200
@app.route('/ranking', methods=["GET"])
def ranking():
query = request.args.get('query', None)
page = request.args.get('page', default=0, type=int)
rpp = request.args.get('rpp', default=20, type=int)
response = ranker.rank_publications(query, page, rpp)
return jsonify(response)
@app.route('/recommendation/datasets', methods=["GET"])
def rec_data():
item_id = request.args.get('item_id', None)
page = request.args.get('page', default=0, type=int)
rpp = request.args.get('rpp', default=20, type=int)
response = recommender.recommend_datasets(item_id, page, rpp)
return jsonify(response)
@app.route('/recommendation/publications', methods=["GET"])
def rec_pub():
item_id = request.args.get('item_id', None)
page = request.args.get('page', default=0, type=int)
rpp = request.args.get('rpp', default=20, type=int)
response = recommender.recommend_publications(item_id, page, rpp)
return jsonify(response)
if __name__ == '__main__':
app.run(host='0.0.0.0', port=5000, debug=True)
system.py
import jsonlines
import random
class Ranker(object):
def __init__(self):
self.idx = None
def index(self):
pass
def rank_publications(self, query, page, rpp):
itemlist = []
return {
'page': page,
'rpp': rpp,
'query': query,
'itemlist': itemlist,
'num_found': len(itemlist)
}
class Recommender(object):
def __init__(self):
self.idx = None
def index(self):
self.idx = []
with jsonlines.open('./data/gesis-search/datasets/dataset.jsonl') as reader:
for obj in reader:
self.idx.append(obj.get('id'))
def recommend_datasets(self, item_id, page, rpp):
# implement your ranking algorithm here!
itemlist = random.choices(self.idx, k=rpp)
return {
'page': page,
'rpp': rpp,
'item_id': item_id,
'itemlist': itemlist,
'num_found': len(itemlist)
}
def recommend_publications(self, item_id, page, rpp):
itemlist = []
return {
'page': page,
'rpp': rpp,
'item_id': item_id,
'itemlist': itemlist,
'num_found': len(itemlist)
}
DockerFile
FROM python:3.7
COPY requirements.txt requirements.txt
RUN python -m pip install -r requirements.txt
COPY . .
ENTRYPOINT python3 app.py
Running the App
$ cd gesis_rec_micro
$ docker build -t participant/random-rec .
$ docker run -p 5000:5000 participant/random-rec
Test the app
{
"item_id": "gesis-ssoar-44449",
"itemlist": [
"datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de542142",
"datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de462150",
"datasearch-httpseasy-dans-knaw-nloai--oaieasy-dans-knaw-nleasy-dataset51047",
"datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de438799",
"datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de585980",
"datasearch-httpsoai-datacite-orgoai--oaioai-datacite-org57070",
"datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de438015",
"datasearch-httpsoai-datacite-orgoai--oaioai-datacite-org15413441",
"datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de519570",
"datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de7788",
"datasearch-httpsdataverse-unc-eduoai--hdl1902-29H-792102",
"datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de549431",
"datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de449775",
"datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de450194",
"datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de656781",
"datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de434948",
"datasearch-httpwww-da-ra-deoaip--oaioai-da-ra-de433497",
"ZA7177",
"ZA8333",
"datasearch-httpseasy-dans-knaw-nloai--oaieasy-dans-knaw-nleasy-dataset35905"
],
"num_found": 20,
"page": 0,
"rpp": 20
}