Giter Site home page Giter Site logo

dayvonjersen / linguist Goto Github PK

View Code? Open in Web Editor NEW
72.0 4.0 9.0 15.69 MB

Detect programming language used in git repository. Go port of github linguist.

License: Apache License 2.0

Go 99.82% Makefile 0.18%
linguist github-linguist golang language-detection bayes-classifier

linguist's Introduction

linguist

godoc reference

Go port of github linguist.

Many thanks to @petermattis for his initial work in laying the groundwork of creating this project, and especially for suggesting the use of naive Bayesian classification.

Thanks also to @jbrukh for github.com/jbrukh/bayesian

install

prerequisites:

go get github.com/jteeuwen/go-bindata/go-bindata
mkdir -p $GOPATH/src/github.com/dayvonjersen/linguist
git clone --depth=1 https://github.com/dayvonjersen/linguist $GOPATH/src/github.com/dayvonjersen/linguist
go get -d github.com/dayvonjersen/linguist
cd $GOPATH/src/github.com/dayvonjersen/linguist
make
l

see also

command-line reference implentation which is documented separately

tokenizer | (godoc reference)

linguist's People

Contributors

christiand93 avatar dayvonjersen avatar petermattis avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

linguist's Issues

Suggestion: if .gitattributes has an override with linguist-language=UNKNOWN, then add it with a new random color to code line stats anyway

I have the following suggestion which would be really great to have on gitea which at least from my googling uses this library: it would be great if when a repo's .gitattributes file contains an override with linguist-language=UNKNOWN (as in the language isn't known to linguist) if then that language would be added with the given name UNKNOWN and any choice of unused color to the statistics such that it shows up on gitea's stats bar. This would be a really great addition for creators of small languages that have repos on gitea instances without admin access to allow them to mark their source files such that the overall repo stats are correct, even though that obviously still won't fix the missing syntax highlighting.

Problem with language mapping in case of duplicated percentage

Hi,

I noticed, that if two languages have exactly the same byte-count, the result output is wrong. Both percentages gets mapped to the same language.

To test it, I created a css and a js file. Both have the same file size. Here is the output from the tool:

JavaScript: 50.0000%
JavaScript: 50.0000%

I think the problem is, that existing entries in the qqq map get overwritten, if percentage is the same:

// From cmd/l/main.go starting at line 188
...
results := []float64{}
qqq := map[float64]string{}
for lang, num := range langs {
	res := (float64(num) / float64(total_size)) * 100.0
	results = append(results, res)
	qqq[res] = lang  // <-- overwriting map entries, if percentage already existing
}
...

Kind regards,
Christian

Linguist isn't thread safe

Currently you can't call LanguageByContents (and I believe other functions as well) from multiple goroutines.

Will add the exact error encountered later.

TypeScript code is detected as JavaScript.

LanguageByContents mistakenly thinks TypeScript is JavaScript.

For example, this will return "JavaScript" even though it's clearly TypeScript.

linguist.LanguageByContents(`
function wrapInArray(obj: string | string[]) {
  if (typeof obj === "string") {
    return [obj];
  }
  return obj;
}`, nil)

go mod tidy: "./testutil" is relative, but relative import paths are not supported in module mode

$ go mod tidy
go: finding module for package github.com/smartystreets/goconvey/convey
go: finding module for package gopkg.in/check.v1
go: found gopkg.in/check.v1 in gopkg.in/check.v1 v1.0.0-20201130134442-10cb98267c6c
go: found github.com/smartystreets/goconvey/convey in github.com/smartystreets/goconvey v1.8.1
go: downloading golang.org/x/text v0.8.0
github.com/ianlewis/linguist/cmd/l imports
        github.com/dayvonjersen/git4go tested by
        github.com/dayvonjersen/git4go.test imports
        ./testutil: "./testutil" is relative, but relative import paths are not supported in module mode

File extensions are not 1:1 unique with languages

For example, the .ts file extension is used by TypeScript as well as QT translation files which are in XML format.

The languages.yml file includes this for both languages but linguist internally links extensions to languages using a map. This will result in the last entry added to the map winning.

Currently LanguageByFilename returns a string. We should probably deprecate it and introduce a version of this function that returns a list of matching languages.

make the project go-getable

Maybe this is a configuration I have for git, but if I try to go get this repository, I get the following error:

< TONS OF SUBMODULE REGISTRATION AND CLONING >
...
fatal: could not read Username for 'https://github.com': terminal prompts disabled
fatal: clone of 'https://github.com/httpspec/sublime-highlighting' into submodule path '/Users/jzelinskie/OSS/go/faq/src/github.com/generaltso/linguist/data/linguist/vendor/grammars/Sublime-HTTP' failed
Failed to clone 'vendor/grammars/Sublime-HTTP' a second time, aborting
Failed to recurse into submodule path 'data/linguist'
package github.com/generaltso/linguist: exit status 1

If I run the following, everything works flawlessly

go get github.com/jteeuwen/go-bindata/go-bindata
mkdir -p $GOPATH/src/github.com/generaltso/linguist
git clone --depth=1 https://github.com/generaltso/linguist $GOPATH/src/github.com/generaltso/linguist
go get -d github.com/generaltso/linguist
cd $GOPATH/src/github.com/generaltso/linguist
make

This is a shame because I'd like to include linguist in a project that I want to be go-getable and I cannot do so unless I vendor linguist and run its makefile (and carry that knowledge going forward for whenever I want to update linguist as a dependency).

ci: Automation to update linguist data

It would be nice to have a CI workflow that runs daily or weekly that will check if the Ruby linguist has been updated (or certain files in the Ruby linguist repo) and create a PR to update the submodule. This could also regenerate generated code if the code was checked in (Related to #6)

consider using a package manager?

Hey there! First off I'd like to say thank you for writing this port of linguist to Go. It has saved me a ton of time writing one myself. ❤️

Now, with that out of the way, when first building linguist, one of the dependencies for this project was broken in master. because this project relies on go get to fetch the dependencies, building this project failed until I submitted this PR over to git4go.

I would consider using a package manager like glide or dep so future newcomers won't be bitten by issues like I did. Would you be amenable to a PR that provides a basic glide.yaml and installation instructions on how to build the project using glide?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.