Finding similar foods to change the unhealthy diet using Knowledge Graphs

This project consist of integrating one important publicly available data sets on foods (Open Food Facts). In the first phase of this project, I transformed the dataset to an unified Knowledge Graph with the use of YARRRML (data serialization language). Then I implemented AMIE, a graph mining algorithm in order to infer new rules and increase the completeness of the graph. There is often a trade off between coverage and correctness. Since uncertain rules usually do not create true facts exclusively (otherwise they would be called certain rules), there will wrong relations will be created too.

Data

Open Food Facts is a collaborative project built by tens of thousands of volunteers and managed by a non-profit organization. It consists of a database of food products with ingredients, allergens, nutrition facts and all the tidbits of information we can find on product labels. The database gathers around 1,637,564 products. CSV Data Export: https://static.openfoodfacts.org/data/en.openfoodfacts.org.products.csv

➡️ products_original.csv

To install

download latest rmlmapper:
https://github.com/RMLio/rmlmapper-java/releases

follow these steps for YARRRML:
https://rml.io/yarrrml/tutorial/getting-started/#writing-rules-on-your-computer

download latest amie3:
https://github.com/lajus/amie/releases/tag/3.0

download latest graphdb:
https://graphdb.ontotext.com/

Query Editor (excel) Cleaning

Filter all products where 'states' value includes 'en:complete'
Filter all products which have a 'nutriscore_score' value
Delete all unnecessary columns
Transform all number type column to text column (to avoid 8,4E14)
Delete all backslashes

➡️ products_clean.csv

Python Cleaning

Clean all unnecessary informations (e.g. quantity, proportion in 'ingredients_text')
Create dataset of categories, ingredients, countries based on all products
1. Remove duplicated value
2. Sort the values (alphabetically sort)
3. Create 'id' column based on the label (because label can contain spaces)
4. Export to CSV

➡️ categories.csv ingredients.csv countries.csv

Create dataset of products
1. Create class Product
2. Generate all products
3. Export to JSON

➡️ products.csv

🤔 Dataset of products exported to JSON because YARRRML does not support the use of the function grel:string_split the creation of new IRIs with CSV dataset.

Mapping (YARRRML to RML to RDF)

Product

yarrrml-parser -i products.yarrrml.yml -o products.rml.ttl
java -jar _rmlmapper.jar -m products.rml.ttl -o products.ttl -s turtle

➡️ products.ttl

Ingredients

yarrrml-parser -i ingredients.yarrrml.yml -o ingredients.rml.ttl
java -jar _rmlmapper.jar -m ingredients.rml.ttl -o ingredients.ttl -s turtle

➡️ ingredients.ttl

Countries

yarrrml-parser -i countries.yarrrml.yml -o countries.rml.ttl
java -jar _rmlmapper.jar -m countries.rml.ttl -o countries.ttl -s turtle

➡️ countries.ttl

GraphDB

Initialization of the graph

Create new repository
Import RDF files (products.ttl, categories.tll, ingredients.ttl, countries.ttl) with http://kg-course/mapping for the target named graphs

Visualize the graph

Explore > Visual Graph
Search RDF resources (e.g. https://w3id.org/um/ken4256/category/cakes)

Linking - product to product

java -jar _limes.jar _config_limes.xml

AMIE - rule mining

# default run
java -jar _amie.jar statements.tsv
# save into file
java -jar _amie.jar -oute statements.tsv > output/rules_default.out -maxad 3 -minis 1 -optimai
# help / documentation
java -jar _amie.jar -h

SPARQL REQUESTS

Prefixes of all requests

PREFIX : <http://mapping.example.com/>
PREFIX schema: <https://schema.org/>
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX dbo: <http://dbpedia.org/ontology/>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
PREFIX ken: <https://w3id.org/um/ken4256/>
PREFIX fo: <https://www.bbc.co.uk/ontologies/fo/>
PREFIX food: <http://data.lirmm.fr/ontologies/food#>

Embedding: Creation of similarHealthierProduct relations between products

Added 501061 statements in the graph: http://kg-course/embedding. Update took 1h 44m 20s.

INSERT {
    GRAPH <http://kg-course/embedding>
    { ?product2 ken:similarHealthierProduct ?product1 }
}
# SELECT DISTINCT ?product1 ?product1Nutriscore ?product2 ?product2Nutriscore
WHERE { 
	?product1 a schema:Product ;
		rdfs:label ?product1Label ;
		food:nutritionScoreFrPer100g ?product1Nutriscore ;
		ken:mainCategory ?mainCategory1 ;
		fo:shopping_category ?category1 ;
		fo:shopping_category ?category11 .
	?category1 a fo:ShoppingCategory ;
		rdfs:label ?category1Label .
	?product2 a schema:Product ;
		rdfs:label ?product2Label ;
		food:nutritionScoreFrPer100g ?product2Nutriscore ;
		ken:mainCategory ?mainCategory2 ;
		fo:shopping_category ?category2 ;
		fo:shopping_category ?category22 .
	?category2 a fo:ShoppingCategory ;
		rdfs:label ?category2Label .

	FILTER(?product1Nutriscore = <https://w3id.org/um/ken4256/nutriscore/a> || ?product1Nutriscore = <https://w3id.org/um/ken4256/nutriscore/b>)
	FILTER(?product2Nutriscore != <https://w3id.org/um/ken4256/nutriscore/a> && ?product2Nutriscore != <https://w3id.org/um/ken4256/nutriscore/b>)
    
	FILTER(?product1 != ?product2)
    
	FILTER(?mainCategory1 != ?category1)
	FILTER(?mainCategory1 != ?category11)
	FILTER(?category1 != ?category11)
    
	FILTER(?mainCategory2 != ?category2)
	FILTER(?mainCategory2 != ?category22)
	FILTER(?category2 != ?category22)
    
	FILTER(?mainCategory1 = ?mainCategory2)
	FILTER(?category1 = ?category2)
	FILTER(?category11 = ?category22)
}

List top number of similarProduct for each product

SELECT ?product ?mainCategory (COUNT(?similarProduct) AS ?count)
WHERE { 
	?product a schema:Product ;
		ken:mainCategory ?mainCategory ;
		ken:similarHealthierProduct ?similarProduct.
}
GROUP BY ?product ?mainCategory
ORDER BY DESC(?count)

Count Products of specific (main) category

SELECT DISTINCT (COUNT(DISTINCT ?product) AS ?count)
WHERE { 
	?product a schema:Product ;
		ken:mainCategory ?mainCategory ;
		fo:shopping_category ?category1 ;
		fo:shopping_category ?category11 ;
	    	ken:similarHealthierProduct ?similarProduct .
    
	FILTER(?mainCategory = <https://w3id.org/um/ken4256/category/yogurts>)
	FILTER(?category1 = <https://w3id.org/um/ken4256/category/dairies>)
	FILTER(?category11 = <https://w3id.org/um/ken4256/category/fermented-milk-products>)
}

Count number of products specific to a nutriscore (e.g. A)

SELECT (COUNT(DISTINCT ?product) AS ?count)
WHERE { 
	?product a schema:Product ;
		food:nutritionScoreFrPer100g ?nutriscore .

	FILTER (?nutriscore = <https://w3id.org/um/ken4256/nutriscore/e>)
}

Get all similar healthier products based on a product (e.g. product id = 0000005016)

SELECT DISTINCT ?productOutput ?productOuputLabel
WHERE { 
	?productGiven a schema:Product ;
		rdfs:label ?productGivenLabel ;
		ken:similarHealthierProduct ?productOutput .
	?productOutput a schema:Product ;
		rdfs:label ?productOuputLabel .

	FILTER(?productGiven = <https://w3id.org/um/ken4256/product/0000005016>)
}

Get all similar healthier products based based on a ingredient id and an ingredient label

With this query we have two options to filter products.

We can provide the id of an ingredient we want in products (e.g. we want the product to be made of chocolate)
We can also provide a part of the label of an ingredient we want (without FILTER NOT EXISTS) or not in products (e.g. we do not want the product to be made of milk) +) In this example, we allow the products returned to be of nutriscore A or B.

SELECT DISTINCT ?productOutput ?nutriscore ?productOuputLabel
WHERE { 
	?productOutput a schema:Product ;
		food:nutritionScoreFrPer100g ?nutriscore ;
		fo:ingredients ?ingredientForId ;
		fo:ingredients ?ingredientForLabel ;
		rdfs:label ?productOuputLabel .
	?ingredientForLabel a fo:Ingredient ;
		rdfs:label ?ingredientLabel .

	FILTER(?nutriscore = <https://w3id.org/um/ken4256/nutriscore/a> || ?nutriscore = <https://w3id.org/um/ken4256/nutriscore/b>)
	FILTER(?ingredientForId = <https://w3id.org/um/ken4256/ingredient/chocolate>)
	FILTER NOT EXISTS {
		FILTER contains(?ingredientLabel, "milk")
	}
}
ORDER BY ?nutriscore

Get all similar healthier products based based on a product id and where (country) the product is available (e.g. product id = 0073755603888, country label = France)

SELECT DISTINCT ?productOutput ?nutriscore ?productOuputLabel
WHERE {
	?productGiven a schema:Product ;
		rdfs:label ?productGivenLabel ;
		ken:similarHealthierProduct ?productOutput .
	?productOutput a schema:Product ;
		food:nutritionScoreFrPer100g ?nutriscore ;
		fo:ingredients ?ingredientForId ;
		fo:ingredients ?ingredientForLabel ;
  		essglobal:isAvailableAt ?productOutputCountry ;
		rdfs:label ?productOuputLabel .
	?ingredientForLabel a fo:Ingredient ;
		rdfs:label ?ingredientLabel .
    ?productOutputCountry a schema:Country ;
        rdfs:label ?productOutputCountryLabel .

	FILTER(?productGiven = <https://w3id.org/um/ken4256/product/0073755603888>)
    FILTER(?productOutputCountryLabel = "France") 
}
ORDER BY ?nutriscore

Get all products which have shopping category 'cakes' and which contain chocolate anf nutriscore A or B

SELECT DISTINCT ?product ?productOuputLabel ?nutriscore
WHERE {
    ?product a schema:Product ;
        food:nutritionScoreFrPer100g ?nutriscore ;
        fo:shopping_category ?productShoppingCategory ;
        fo:ingredients ?ingredientForLabel ;
        rdfs:label ?productOuputLabel .
    ?ingredientForLabel a fo:Ingredient ;
        rdfs:label ?ingredientLabel .

    FILTER(?productShoppingCategory = <https://w3id.org/um/ken4256/category/cakes>)
    FILTER contains(?ingredientLabel, "chocolat")
    FILTER(?nutriscore = <https://w3id.org/um/ken4256/nutriscore/a>
    || ?nutriscore = <https://w3id.org/um/ken4256/nutriscore/b>)
}
ORDER BY ?nutriscore

Get all pair of products which are similar (relations created by the linking)

128,592 / 2 = 64,296 pairs (divided per 2 because the relation is bidirectional)

SELECT ?product ?productName ?similarProduct ?similarProductName
WHERE { 
	?product a schema:Product ;
        rdfs:label ?productName ;
		schema:isSimilarTo ?similarProduct .
    ?similarProduct a schema:Product ;
        rdfs:label ?similarProductName .
}

Get all pair of product which are similar of a different country (relations created by the linking)

41,052 / 2 = 20,526 pairs (divided per 2 because the relation is bidirectional)

SELECT ?product ?productName ?similarProduct ?similarProductName
WHERE { 
	?product a schema:Product ;
        rdfs:label ?productName ;
		schema:isSimilarTo ?similarProduct ;
    	essglobal:isAvailableAt ?productCountry .
    ?similarProduct a schema:Product ;
        rdfs:label ?similarProductName ;
    	essglobal:isAvailableAt ?similarProductCountry .
    FILTER(?productCountry != ?similarProductCountry)
}

louismeeckers / similar-food-prediction Goto Github PK

similar-food-prediction's Introduction