Giter Site home page Giter Site logo

wikisynonyms's Introduction

WikiSynonyms

This is an example of how to leverage WikiPedia data to create a set of synonyms. Unlike traditional Thesauruses, this dataset allows for the creations of synonyms over words such as microsoft, bill gates, ISS, apartment, sauna, enzyme or toronto making it a really good solution to use in Full Text Search applications where you want to match what the user is typing to relevant synonyms in the content.

This also can be very useful for industry specific terms. For example, in the medical industry you might want to have synonyms for words such as "enzyme" or in real estate, you might want to have synonyms for words such as "apartment".

Please note that before you can run the synonym application, you will need to re-create the SQLite database. To do this, please make sure to follow the steps outlined in the below section "Running the Synonym Application". If you want to use the most recent set of WikiPedia data, please start from the section "How to recreate the Data Set (Optional)".

Console Application

The application included under \src\WikiSynonym allows you to provide a word which is then sent to a SQLite database to return matched synonyms. The usage for the application is as follows:

WikiSynonym.exe sauna

where the output will be:

aufguss, dry sauna, saun, sauna culture, sudatory, sweatbath, swedish bath, theremae

In addition, you can pass multi-word terms in double quotes such as:

WikiSynonym.exe "Bill Gates"

which will return results such as:

bil gates, bill gate, bill gates, bill gates (microsoft), bill gates iii, bill gates wealth index, billgatesiii, gate bill, gates & co., gates, bill, gates, william henry, iii, iii gates william henry, mr. gates, phoebe adele gates, sir bill gates, sir william henry gates iii, william gates iv, william gates, iii, william h gates, william h gates iii, william h. gates iii, william h. gates, iii, william henry gates, william henry gates 3, william henry gates iii, william henry gates iv, william henry, iii gates, willy gates

Data Set

Within the \data directory you will find a set of tab separated files (synonymsX.tsv) that contains all of the WikiPedia extracted synonyms. This can be useful if you want to place this content in a different data store.

How to recreate the Data Set (Optional)

In the event, you want to re-create the data set I provided above, you can do this by following these instructions. Please note, that I used SQL Server, but you could use an alternate datastore such as MySQL which might be simpler since the data dump is already in MySQL format.

Step 1 - Download the SQL Dump Files

Download and unzip latest SQL redirect and page files from (https://dumps.wikimedia.org/enwiki). These will likely be of the format: enwiki-latest-redirect.sql.gz, and enwiki-latest-page.sql.gz

Unzip these files to create the full .sql files.

Step 2 - Import the SQL Dump Files to SQL Server

You will need a number of tables to host the WikiPedia data. If you are using SQL Server, you can use the schema file (database_schema.sql) that I provided in the \data directory.

Next, you will need to import the data. I have provided a very basic application under \src\WikiSQLConverter which allows you to import the data from these files. To use the application you run:

WikiSQLConverter.exe "[File path of wiki SQL file]" "[SQL Server Connection String]"

Note: this step will likely take a significant amount of time.

Step 3 - Create the Synonyms table

The following queries can be run to clean up the data in the synonym list as well as to combine it into a single table.

INSERT INTO page_relation SELECT s.rd_from as sid, t.page_id as tid, p.page_namespace as snamespace, t.page_namespace as tnamespace, p.page_title as stitle, t.page_title as ttitle FROM redirect s JOIN page p ON (s.rd_from = p.page_id) JOIN page t ON (s.rd_namespace = t.page_namespace AND s.rd_title = t.page_title);

insert into synonyms select lower(ttitle), lower(stitle) from page_relation group by lower(ttitle), lower(stitle);

delete from synonyms where synonym like '%??%' or root like '%??%'

update [synonyms] set root = REPLACE(root, '', ' '), synonym = REPLACE(synonym, '', ' ');

update synonyms set root = REPLACE(root, '"', ' '), synonym = REPLACE(synonym, '"', ' ');

update synonyms set root = REPLACE(root, '"', ' '), synonym = REPLACE(synonym, '"', ' ');

update synonyms set root = REPLACE(root, '“', ' '), synonym = REPLACE(synonym, '“', ' ');

update synonyms set root = REPLACE(root, '”', ' '), synonym = REPLACE(synonym, '”', ' ');

Step 4 - Extract Synonyms to Tab Separated Files

In this step we will take all the synonyms and export them to a set of Tab Separated Files that can easily be integrated into a SQLite database to be used by our application. To do this, you will run a BCP command to bulk copy the data out in files of no more than 2 million rows / file. Please note, I had ~7M rows in the synonyms table so if you find you have more you might need to add another line or two...

bcp "SELECT * FROM [synonyms] ORDER BY Root OFFSET 0 ROWS FETCH NEXT 2000000 ROWS ONLY" queryout synonyms1.tsv -S (local) -d WikipediaSynonyms -T -t "\t" -c

bcp "SELECT * FROM [synonyms] ORDER BY Root OFFSET 2000001 ROWS FETCH NEXT 2000000 ROWS ONLY" queryout synonyms2.tsv -S (local) -d WikipediaSynonyms -T -t "\t" -c

bcp "SELECT * FROM [synonyms] ORDER BY Root OFFSET 4000001 ROWS FETCH NEXT 2000000 ROWS ONLY" queryout synonyms3.tsv -S (local) -d WikipediaSynonyms -T -t "\t" -c

bcp "SELECT * FROM [synonyms] ORDER BY Root OFFSET 6000001 ROWS FETCH NEXT 2000000 ROWS ONLY" queryout synonyms4.tsv -S (local) -d WikipediaSynonyms -T -t "\t" -c

Running the Synonym Application

In order to run the Synonym application, we will need to create a SQLite database that will contain the synonyms. Unfortunately, due to the size of the database, I could not include it in github, but luckily it is really easy to re-create.

Step 1 - Download SQLite

If you do not already have SQLite installed, you can do this by going here: https://www.sqlite.org/download.html. Since I have windows, I downloaded the Precompiled Binaries for Windows (sqlite-tools-win32-x86-3130000.zip). Unzip this file and open a command prompt to this directory.

Step 2 - Create the SQLite database

Copy the .tsv files from above to this directory and run the command:

sqlite3 synonyms.db

From the sqlite command line execute:

create table synonyms (root nvarchar(255), synonym nvarchar(255))

go

create index idx_synonyms_root on synonyms(root)

go

create index idx_synonyms_synonym on synonyms(synonym)

go

.separator "\t"

.import synonyms1.tsv synonyms

.import synonyms2.tsv synonyms

.import synonyms3.tsv synonyms

.import synonyms4.tsv synonyms

Optionally at this point you can play with the synonyms by executing a query such as this to find all synonyms for microsoft.

with SynonymList as

( select synonym from synonyms where root = 'microsoft'

union

select synonym from synonyms where root in (

select root from synonyms where synonym = 'microsoft'

))

select distinct synonym from SynonymList order by synonym

go

At this point you can exit out of the SQLite command window and copy the resulting synonyms.db to your source code working directory: \src\WikiSynonym

Step 3 - Building and running the Synonym Application

Now that you have created and copied the SQLite database to your source code directory. You can build the application and run it from a command line. For example, after building the application you can run:

WikiSynonym.exe "Bill Gates"

wikisynonyms's People

Contributors

liamca avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.