This is an example of how to leverage WikiPedia data to create a set of synonyms. Unlike traditional Thesauruses, this dataset allows for the creations of synonyms over words such as microsoft, bill gates, ISS, apartment, sauna, enzyme or toronto making it a really good solution to use in Full Text Search applications where you want to match what the user is typing to relevant synonyms in the content.
This also can be very useful for industry specific terms. For example, in the medical industry you might want to have synonyms for words such as "enzyme" or in real estate, you might want to have synonyms for words such as "apartment".
Please note that before you can run the synonym application, you will need to re-create the SQLite database. To do this, please make sure to follow the steps outlined in the below section "Running the Synonym Application". If you want to use the most recent set of WikiPedia data, please start from the section "How to recreate the Data Set (Optional)".
The application included under \src\WikiSynonym allows you to provide a word which is then sent to a SQLite database to return matched synonyms. The usage for the application is as follows:
WikiSynonym.exe sauna
where the output will be:
aufguss, dry sauna, saun, sauna culture, sudatory, sweatbath, swedish bath, theremae
In addition, you can pass multi-word terms in double quotes such as:
WikiSynonym.exe "Bill Gates"
which will return results such as:
bil gates, bill gate, bill gates, bill gates (microsoft), bill gates iii, bill gates wealth index, billgatesiii, gate bill, gates & co., gates, bill, gates, william henry, iii, iii gates william henry, mr. gates, phoebe adele gates, sir bill gates, sir william henry gates iii, william gates iv, william gates, iii, william h gates, william h gates iii, william h. gates iii, william h. gates, iii, william henry gates, william henry gates 3, william henry gates iii, william henry gates iv, william henry, iii gates, willy gates
Within the \data directory you will find a set of tab separated files (synonymsX.tsv) that contains all of the WikiPedia extracted synonyms. This can be useful if you want to place this content in a different data store.
In the event, you want to re-create the data set I provided above, you can do this by following these instructions. Please note, that I used SQL Server, but you could use an alternate datastore such as MySQL which might be simpler since the data dump is already in MySQL format.
Download and unzip latest SQL redirect and page files from (https://dumps.wikimedia.org/enwiki). These will likely be of the format: enwiki-latest-redirect.sql.gz, and enwiki-latest-page.sql.gz
Unzip these files to create the full .sql files.
You will need a number of tables to host the WikiPedia data. If you are using SQL Server, you can use the schema file (database_schema.sql) that I provided in the \data directory.
Next, you will need to import the data. I have provided a very basic application under \src\WikiSQLConverter which allows you to import the data from these files. To use the application you run:
WikiSQLConverter.exe "[File path of wiki SQL file]" "[SQL Server Connection String]"
Note: this step will likely take a significant amount of time.
The following queries can be run to clean up the data in the synonym list as well as to combine it into a single table.
INSERT INTO page_relation SELECT s.rd_from as sid, t.page_id as tid, p.page_namespace as snamespace, t.page_namespace as tnamespace, p.page_title as stitle, t.page_title as ttitle FROM redirect s JOIN page p ON (s.rd_from = p.page_id) JOIN page t ON (s.rd_namespace = t.page_namespace AND s.rd_title = t.page_title);
insert into synonyms select lower(ttitle), lower(stitle) from page_relation group by lower(ttitle), lower(stitle);
delete from synonyms where synonym like '%??%' or root like '%??%'
update [synonyms] set root = REPLACE(root, '', ' '), synonym = REPLACE(synonym, '', ' ');
update synonyms set root = REPLACE(root, '"', ' '), synonym = REPLACE(synonym, '"', ' ');
update synonyms set root = REPLACE(root, '"', ' '), synonym = REPLACE(synonym, '"', ' ');
update synonyms set root = REPLACE(root, '“', ' '), synonym = REPLACE(synonym, '“', ' ');
update synonyms set root = REPLACE(root, '”', ' '), synonym = REPLACE(synonym, '”', ' ');
In this step we will take all the synonyms and export them to a set of Tab Separated Files that can easily be integrated into a SQLite database to be used by our application. To do this, you will run a BCP command to bulk copy the data out in files of no more than 2 million rows / file. Please note, I had ~7M rows in the synonyms table so if you find you have more you might need to add another line or two...
bcp "SELECT * FROM [synonyms] ORDER BY Root OFFSET 0 ROWS FETCH NEXT 2000000 ROWS ONLY" queryout synonyms1.tsv -S (local) -d WikipediaSynonyms -T -t "\t" -c
bcp "SELECT * FROM [synonyms] ORDER BY Root OFFSET 2000001 ROWS FETCH NEXT 2000000 ROWS ONLY" queryout synonyms2.tsv -S (local) -d WikipediaSynonyms -T -t "\t" -c
bcp "SELECT * FROM [synonyms] ORDER BY Root OFFSET 4000001 ROWS FETCH NEXT 2000000 ROWS ONLY" queryout synonyms3.tsv -S (local) -d WikipediaSynonyms -T -t "\t" -c
bcp "SELECT * FROM [synonyms] ORDER BY Root OFFSET 6000001 ROWS FETCH NEXT 2000000 ROWS ONLY" queryout synonyms4.tsv -S (local) -d WikipediaSynonyms -T -t "\t" -c
In order to run the Synonym application, we will need to create a SQLite database that will contain the synonyms. Unfortunately, due to the size of the database, I could not include it in github, but luckily it is really easy to re-create.
If you do not already have SQLite installed, you can do this by going here: https://www.sqlite.org/download.html. Since I have windows, I downloaded the Precompiled Binaries for Windows (sqlite-tools-win32-x86-3130000.zip). Unzip this file and open a command prompt to this directory.
Copy the .tsv files from above to this directory and run the command:
sqlite3 synonyms.db
From the sqlite command line execute:
create table synonyms (root nvarchar(255), synonym nvarchar(255))
go
create index idx_synonyms_root on synonyms(root)
go
create index idx_synonyms_synonym on synonyms(synonym)
go
.separator "\t"
.import synonyms1.tsv synonyms
.import synonyms2.tsv synonyms
.import synonyms3.tsv synonyms
.import synonyms4.tsv synonyms
Optionally at this point you can play with the synonyms by executing a query such as this to find all synonyms for microsoft.
with SynonymList as
( select synonym from synonyms where root = 'microsoft'
union
select synonym from synonyms where root in (
select root from synonyms where synonym = 'microsoft'
))
select distinct synonym from SynonymList order by synonym
go
At this point you can exit out of the SQLite command window and copy the resulting synonyms.db to your source code working directory: \src\WikiSynonym
Now that you have created and copied the SQLite database to your source code directory. You can build the application and run it from a command line. For example, after building the application you can run:
WikiSynonym.exe "Bill Gates"