NeuralGREWT:
Neural Grammar Rule Extraction and Word Taxonomy

Click to Interact

Unsupervised learning and comprehensible representation of grammar rules and fuzzy symbol categories by high-dimensional clustering of string probabilities derived from natural language models and Comprehensible Convolutional Neural Networks (CCNN).

Theory

A grammar is a set of rules for how symbols¹ may be arranged to form valid strings in a language. Strings of symbols that follow all grammar rules are grammatically valid. Grammatic validity is necessary for, but does not imply, sensicality. For example: the string of symbols (in this case a sentence of words) "The hat runs and cooks into a hairdresser." is a grammatically valid sentence that is also nonsensical in isolation².

Given this definition of grammatic validity, valid sentences will be more common than invalid ones in any string-set comprised mostly of sensical strings; making it possible to infer a string's validity from its probability.

Therefore, it may be possible to learn grammar rules of any language, including computer, fictional, and extra-terrestrial languages, without needing to make sense of anything written in the given language³.

Let's test this hypothesis.

One good first step to learning grammar rules may be to identify the different categories of symbols present in a given language.

Symbols will be said to share a syntactic category in proportion to their mutual interchangability. In other words: 2 symbols will share a category in proportion to the probability that one may be replaced by the other in a randomly chosen valid string without rendering that string invalid.

Knowing this, a natural language generator⁴ can be used to determine the degree of mutual interchangabilty of symbols in a language using total string likelihood before and after replacement as an indicator of relative validity, and therefore of mutual interchangeability — implying taxonomy. Once we have these validity scores we can use T-SNE to infer discrete symbol categories from the degree of mutual interchangability (or equivalently the clustering of validity-under-replacement scores).

Procedure

Compile a large corpus of valid and/or sensical writings in the chosen language.
Train a natural language model, or predictor, on the corpus until it is very good at predicting symbols in the language given context. Anything like BERT or GPT-2 will do fine, and in fact are probably overkill.
- NOTE: It's definitely overkill. Future work will employ a maximally-minimal version of BERT to accelerate calculation of validity tensors and eliminate "understanding intrusion". Large BERT and GPT-2 models possess a good deal of understanding of written text that leads them to rank the probability of valid but otherwise non-sensical strings signficantly lower than they would if they understood only base grammar relationships. Clearly this is not good if we are looking for a measure of base validity regardless of sensicality⁵.
Compile a large "ground set" of valid and/or meaningful strings in the chosen language. (in the case of unknown languages, this ground set can just be a random subset of the corpus)
Compile a "symbol set" of all symbols in the chosen language, or at least a large set of the most common ones.
Create a 4 dimensional perturbation tensor from the ground and symbol sets by:
- Add to the perturbation tensor a cube for each symbol in the symbol set; construct each cube by stacking a plane for each ground string in the ground set; construct each plane by stacking a vector for all strings that can be created by replacing any one item in the ground string with the symbol. Of course, making sure to do this in the same order for all vectors⁶.
Create a 4 dimensional validity tensor from the perturbation tensor by:
- For each vector in the perturbation tensor, judge the probability of that vector by summing the relative likelihoods of each symbol appearing at its location given all previous symbols in the vector (using the predictor), taking the difference between this and the sum-validity of it's corresponding ground string to obtain a validity delta, and dividing by the length of that vector. That division might be unnecessary, but we will find out.
  - Update: It occurs to me that the division by sentence length would have a largely unhelpful effect on the validity score of any one dimension relative to all the others; exaggerating similarities between dimensions as vectors grow longer. The ground-truth validity delta may or may not have anything to do with string length depending on the language so it is wrong to force such a correlation in general. A proper normalization across all dimensions regarded equally is what we want here.
Perform T-SNE followed by PCA on the validity tensor to infer number and relative importance of symbol categories.
- NOTE: In this case PCA will not, and cannot provide a useful classifier for datapoints that were not already present in the T-SNE plot. Here we are simply looking to quantify the number of clusters, and get some small idea of the distances between clusters⁷.
Name each symbol category. (can be totally arbitrary)
Create a sym-cat mapping of each symbol to a list of its categories sorted in descending order of the symbol's belongingness to each category.
Create a cat-sym mapping of each category to a list of its symbols sorted in descending order of each symbol's belonging-ness to the category.
Create a token set from the corpus by replacing each symbol in the corpus with a name, or token, for its corresponding category⁸.
Create a garbage set from the token set by randomizing a large enough proportion of the tokens in the token set to render each string likely invalid.
Train a comprehensible, spiking, convolutional neural network, or grammar net, to distinguish between items of the token and garbage sets.
- The filters of the trained grammar net are first order grammar rules, the fully connected layers behind those are second order grammar rules, and so on. (or something to that effect depending on the structure of the grammar net.)

Status

Results

Word Taxonomy Visualization:

Click to Interact in 3D

Footnotes

1 More precisely: "morphemes" composed of symbols. ↩

2 "in isolation" meaning in the absence of explanatory context. ↩

3 So long as we are in possession of an ample body of text that would be sensical to a speaker of that language. ↩

4 Any predictor of morphemes from context really. ↩

5 This will also ensure that we get a good measurement of validity in some imaginable language where it is impossible to have grammatically valid strings that don't make sense. This would mean only that validity implies sensicality. In such languages, learning base validity is identical to learning sensicality, which may-or-may-not require a larger network to accomplish (interesting philosophical question there) but is nevertheless possible as demonstrated by large language models of recent years. ↩

6 In an actual tensor all such vectors need to be padded to uniform dimension with null strings. In practice we can use a 3 dimensional tensor of pointers to vectors of variable dimension. ↩

7 Distances between T-SNE clusters are sometimes meaningless, so the distances according to PCA are not to be taken seriously in this case. However, the number of clusters and degree of overlap communicated by PCA on T-SNE will be reliably meaningful if good clusters are found.

NOTE: this subtlety is obviated by PCA's replacement with DBSCAN in the cluster identification step. ↩

8 In the proof of concept we will simply pick the highest ranked category from each symbol's entry in the sym-cat mapping. Future work will be needed to make an informed choice of category in the case of context dependent taxonomy. ↩

9 In this case above -5 or thereabouts after scaling. ↩

10 Specifically: optimimum for a set of 500 points in ~5500 dimensions. This optimimum is likely to change depending on your point-count to dimensionality ratio. ↩

11 or are lower-weighted, redundant duplicates that all do the same thing ↩

How to Install

[TBD]

How to Use

[TBD]

Contributing

For contributors to the project; do this before making your first commit:

Install pre-commit

cd /path/to/this/repository/
sudo apt install pre-commit
pre-commit install

(we do all of our development on linux)

To test updates to the readme and other GitHub flavored markdown, simply install Grip and feed it your desired file.

pip3 install grip
python3 -m grip README.md

Then follow the link provided by the Grip sever for a live preview of your work.
When satisfied with your changes you can compile to an html file with:

python3 -m grip README.md --export README.html

Authors

Gabe M. LaFond - Initial work - ExamDay

See also the list of contributors who participated in this project.

License

This project is licensed under the MIT License - see the LICENSE.md file for details

examday / neuralgrewt Goto Github PK

neuralgrewt's Introduction

NeuralGREWT:
Neural Grammar Rule Extraction and Word Taxonomy

Theory

Procedure

Status

Results

Footnotes

How to Install

How to Use

Contributing

Authors

License

neuralgrewt's People

Contributors

Stargazers

Watchers

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent

examday / neuralgrewt Goto Github PK

neuralgrewt's Introduction

NeuralGREWT:Neural Grammar Rule Extraction and Word Taxonomy

Theory

Procedure

Status

Results

Footnotes

How to Install

How to Use

Contributing

Authors

License

neuralgrewt's People

Contributors

Stargazers

Watchers

Recommend Projects

Recommend Topics

Recommend Org

NeuralGREWT:
Neural Grammar Rule Extraction and Word Taxonomy