Giter Site home page Giter Site logo

mofu's Introduction

mofu 字符串模糊匹配

0x01 简介

mofu是一个针对中文/English字符串模糊匹配精确计算权值的一个算法。适合用于数据库较小的在线搜索或数据库较大的精确搜索中。

过去曾用于洛谷OnlineJudge的题目搜索和OI in Hand的题目查重。

目前例程采用php+mysql实现,相对于传统中文分词搜索在计算权值上具有更好的准确度。

0x02 MySQL数据库的准备

你需要一个拥有PRIMARY KEY的数据表,对应数据条目的ID。

并创建一个类型为TXT的键,开启FULLTEXT索引。

0x03 索引的建立

索引建立的程序位于prepare.php中。

mofu_prepare($str):函数用于对数据库中要拿来搜索的内容进行预处理

$str为要用于搜索的原始字符串

返回值为处理后的字符串,需要存入数据库中。

该函数简要处理过程如下:

  1. 将所有大写字符转换为小写

  2. 替换所有ASCII符号为空格

  3. 在文字间增加空格,在左右两个字符状态有一项改变时在其中间添加空格,状态包括 {

    1. 是否ASCII字符

    2. 如是ASCII字符,是数字还是英文

}

0x04 搜索原理

mofu按照普通用户在网站进行搜索时的习惯进行搜索。

相关函数位于mofu_core.php

  1. 对querystring按照' '进行字符串分割,分割完的字符串我们暂且称为这段字符串

  2. 遍历分割后的querystring,检测这段是否为纯ASCII字符 {

    若是 {

     if (这段字符串.长度 <= 5) {
    
     	添加subkey(这段字符串);
    
     	添加subkey(' '.这段字符串.' ');
    
     }
    
     else {
    
     	每4个字符切开,加入subkey中
    
     }
    

    }

    若不是 {

     //即非纯ASCII字符
    
     if (这段字符串.长度 <= 3) {
    
     	添加subkey(这段字符串);
    
     }
    
     else {
    
     	每2个字符切开,加入subkey中
    
     }
    

    }

}

  1. 对于数据表的每条记录,要搜索的这段字符串,拥有最大权值为1。

设subkey数量为n,则每份subkey占有权值1/n

将创建好的subkey放入数据库中逐条搜索。

当一条数据存在当前的subkey时,对其权值+1/n

之后,在每段字符串搜索结束时,对这段字符串的权值进行平方,作为这段字符串对于这条记录的最终权值,加到这条记录上。

  1. 依次完成每段字符串和每条记录的权值计算和排序,最终计算权值,输出按照权值排序过的PRIMARY KEY数组。

mofu's People

Contributors

cyyself avatar

Stargazers

Tianle Xu avatar  avatar Soha Jin avatar

Watchers

James Cloos avatar Soha Jin avatar  avatar  avatar

Forkers

tianhaiit

mofu's Issues

Can you publish it to composer?

Can you edit the code to let it suit PSR standard and publish it to Composer?

Then we can use it by composer require cyyself/mofu. Is it simple?

And, can you add support for TiDB and PostgreSQL? In my daily usage, MySQL is not enough. I need TiDB or PostgreSQL to power my service. Sometimes I need MongoDB too.

Also, if there's a Golang version, could you please let me know?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.