Giter Site home page Giter Site logo

dialectr's Introduction

dialectR: Doing Dialectometry in R

dialectR is an R package for doing dialectometry, the quantitative study of dialects. The analyses offered in this package rely upon variants of edit distance to compute the aggregate distance between phonetic transcriptions of dialects, which has been shown in prior studies to correspond with perceptual data of mutual intelligibility.

The developmental version of dialectR can be downloaded with devtools:

devtools::install_github("b05102139/dialectR")

As a preliminary example, we will examine data of Dutch dialects, which was transcribed in the Goeman-Taeldeman-van-Reenen-Project. The data is provided in the package, and can be loaded like this:

library(dialectR)
data(Dutch)

This dataset provides data of 562 concepts over 613 sites of Dutch speaking areas, where the concepts should be columns and the sites should be rows.

In dialectometry, an important step before further analyses can be done is to compute the aggregate edit distance between dialect sites. While an in depth discussion of edit distance is beyond the scope of this introduction, we briefly remark that it is a metric of distance between strings, which is computed by how many insertions, deletions, and substitutions it takes for one string to transform into another. Due to the existence of multiple entries, missing entries, and the requirement to normalize for the difference in length between entries in the data however, dialectR provides a specialized version of edit distance which meets these needs:

dialectR::leven("koguma", "kokoimo")
## [1] 4
dialectR::leven("koguma", "kokoimo", alignment_normalization = TRUE)
## [1] 0.5714286
dialectR::leven("koguma/goguma", "kokoimo", alignment_normalization = TRUE, delim = "/")
## [1] 0.6190476

The code above shows respectively the plain edit distance between two strings; the length-normalized distance; and the possibility of accounting for multiple responses in one site, which is a common situation when collecting dialect data.

The interest of such a metric is shown when the difference between sites are aggregated. Assuming the same function arguments as the above, we can also perform an aggregate calculation of site and site distance:

distDutch <- dialectR::distance_matrix(Dutch, "leven", alignment_normalization = TRUE)
distDutch[1:3,1:3]
##             Aalsmeer NH Aalst BeLb Aalst BeOv
## Aalsmeer NH   0.0000000  0.3609137  0.4224437
## Aalst BeLb    0.3609137  0.0000000  0.3745071
## Aalst BeOv    0.4224437  0.3745071  0.0000000

This can then be projected onto geography. We provide two such analyses in the package: one which depends on multidimensional scaling by mapping three dimensions onto RGB values and mixing them evenly, and one which utilizes the results of hierarchical clustering. The following two plots show that the dialect groupings of these two analyses largely converge. First we present the results of multidimensional scaling:

dutch_points <- get_points(system.file("extdata", "DutchKML.kml", package="dialectR"))
dutch_polygons <- get_polygons(system.file("extdata", "DutchKML.kml", package="dialectR"))
mds_map(distDutch, dutch_points, dutch_polygons)

And here we present that of hierarchical clustering:
cluster_map(distDutch, cluster_num = 6, method = "ward.D2", kml_points = dutch_points, kml_polygon = dutch_polygons)

In addition to such transcription-based methods, we also provide an acoustic-based method which is capable of computing the distance between audio data:

i_audio <- system.file("extdata", "i.wav", package="dialectR")
e_audio <- system.file("extdata", "e.wav", package="dialectR")
acoustic_distance(i_audio, e_audio)
## [1] 9.414545

The validity of this distance can be shown if we apply it to audio recordings of IPA vowels. Using the recordings provided by Peter Ladefoged, we show how the distance between IPA vowels can be used to reproduce the acoustic vowel space:

# we assume all the vowels are downloaded to a single folder
vowel_dist <- sapply(1:12, function(x){
  sapply(1:12, function(y){
  acoustic_distance(list.files("C:/Users/USER/Downloads/ipa_vowels", full.names = TRUE)[x],
                    list.files("C:/Users/USER/Downloads/ipa_vowels", full.names = TRUE)[y])
  })
})

The distance matrix generated from this can be seen below:

vowel_dist[2:4,2:4]
         a         e         i
a  0.00000 10.509091 10.299737
e 10.50909  0.000000  9.414545
i 10.29974  9.414545  0.000000

Now we are in a place to apply multidimensional scaling on the data:

vowel_mds <- cmdscale(vowel_dist, k = 3)
plot(-vowel_mds[,2], vowel_mds[,1], xlab = "", ylab = "")
text(-vowel_mds[,2], vowel_mds[,1], cex = 1.2, labels = vowel_names, pos = 4)

As can be seen, the distance between the vowels largely correlates with conventional charts of the acoustic vowel space.

dialectR remains in active development. If you would like to use dialectR in your research and have any concerns, ideas, or questions, do feel free to contact us.

dialectr's People

Contributors

b05102139 avatar

Stargazers

 avatar  avatar  avatar  avatar

Watchers

 avatar

dialectr's Issues

Applying dialectR to morphological paradigms

I'm curious how meaningful it would be to apply dialectR's edit distance to comparing the morphology of two dialects or related languages. For example, let's say two lects have similar non-identical morphological paradigms. For example, normal English "you goeth" vs archaic "you go"; or even taking the Wiktionary paradigms of various Turkic languages. Have you explored that at all?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.