Comments (19)
I was able to reproduce this issue in #3027. Indeed, the GitHub in place editor replaces all POSIX LF newlines by CR LF.
I wonder whether this is somehow caused by the addition of pre-commit. I'll try with a repository that doesn't have pre-commit enabled.
from codespell.
Worse, this means you can't review and make or merge suggested changes because the diff is now too large. π’
from codespell.
Ah, I thought this was somehow because the dictionary file had grown too large, but yes, GitHub complains the diff is too large (50,000 lines). I was wondering where this comes from.
Which files are being changed by pre-commit? 50,000 lines means it's the dictionary file, but then why does it mess with the dictionary file?
from codespell.
Is 174a5bf just about whitespace changes? How can a couple of whitespace changes result in a diff of 50,000 lines? I'm lost.
from codespell.
I've not looked properly, but I suspect it's changing all the line endings back and forth.
from codespell.
To/from DOS/POSIX line endings? But then it would be an effect of mixed-line-ending, not trailing-whitespace.
It's still a mystery to me what triggers the massive diffs. Does the GitHub editor write files with DOS endings?
from codespell.
In the short term, the --fix=no
option of mixed-line-ending might work around these back & forth line endings changes.
from codespell.
It looks like the editor you are using in GitHub changes the newlines in all of dictionary.txt
.
- I am at a loss why it does that, a Google search comes up with nothing.
- I am at a loss why we hadn't noticed that before.
from codespell.
I wonder if something has changed, or it's just that pre-commit is changing them to say LF, whereas GitHub generates CRLF, and most other editors won't explicitly change things, so it never used to have an impact.
from codespell.
Some of your commits change all newlines to CR LF. For example, after 95a2fbc, all lines end with \r\n
:
$ git checkout peternewman-obsloete
$
$ git checkout 95a2fbc985065cff6f47c982e82355bb22d885b4
$
$ od -c dictionary.txt
0000000 1 n d - > 1 s t \r \n 2 r d - > 2
0000020 n d \r \n 2 s t - > 2 n d \r \n 3 n
0000040 d - > 3 r d \r \n 3 r t - > 3 r d
0000060 \r \n 3 s t - > 3 r d \r \n 4 r d -
0000100 > 4 t h \r \n _ _ a t t r i b y t
[...]
I suspect this is new behaviour. I mean, either we would have CR LF in all files by now, or we would have seen similar diffs of 50,000 lines in the past. The question is why now? Does pre-commit somehow affect the GitHub editor? Was this specific commit really created using the GitHub editor?
from codespell.
Which of the two following options did you choose to edit the dictionary?
from codespell.
They were all edit in place. I used find (and maybe replace), which is a fairly new feature in the basic editor, but that was about all of note.
from codespell.
Well I'm glad I'm not imagining it!
I wonder if it could be that maybe the first run of pre-commit fixed all the line endings to be consistent and the presumed auto-mode behaves differently when the file is consistent. Or worse it's clever enough to realise I'm editing on Windows and behave accordingly!
We could try something like this to fix it (if that editor honours it):
https://www.aleksandrhovhannisyan.com/blog/crlf-vs-lf-normalizing-line-endings-in-git/
from codespell.
The newlines in dictionaries had always been a simple LF. I checked old commits to make sure.
I have already run dos2unix
on dictionaries, the dictionaries haven't changed. This means all newlines are standard POSIX LF newlines.
from codespell.
This is a bug of the GitHub in place editor. What appears to trigger it is not pre-commit, but the size/contents of the modified file.
- I can reproduce the issue by copying the current
dictionary.txt
file into a distinct repository. See DimitriPapadopoulos/github-edit-in-place#8. - I cannot reproduce the issue by copying a past version of
dictionary.txt
(8d0d82b) into a distinct repository. See DimitriPapadopoulos/github-edit-in-place#3.
The size of file dictionary.txt
increased considerably in the last few weeks. Between 8d0d82b and current master
branch:
- 813 β1,129 KB
- 37,406 β51,304 lines
I suspect the size or number of lines triggered some bug, perhaps a variable overflow:
from codespell.
The bug is triggered when adding just 1 byte (and no new lines) to a file of size > 1 MB .
It is not triggered when adding 1 byte (or more) to a file of size <= 1 MB.
Tried with these two files:
1048576_bytes-52429_lines.txt
(1048576 bytes is exactly 1 MB)1048577_bytes-52429_lines.txt
(1048577 bytes is 1 MB + 1 byte)
I guess this clears things up: this really is a bug of the GitHub in place editor. Does this mean we should split dictionary.txt
into chunks < 1 MB? I would recommend we leave it as is for now. I have notified GitHub about the bug.
from codespell.
I guess this clears things up: this really is a bug of the GitHub in place editor. Does this mean we should split
dictionary.txt
into chunks < 1 MB? I would recommend we leave it as is for now.
That's some excellent digging! Can't we fix this in the short term by adding a pre-commit or repo .gitattributes to force it back to LF, which will at least mean the overall diff of a PR will be small, even if the changes per commit are significant?
I have notified GitHub about the bug.
Excellent, is that something public you can link to, or a private report option they have?
from codespell.
We already have a pre-commit hook that forces back to LF (hence the unreadable diffs). I have tried a .gitattributes
file that would force LF, but it doesn't seem to be taken into account by GitHub.
Unfortunately, the ticket I managed to create appears to be private β makes sense since GitHub is not open source.
from codespell.
This is a know GitHub bug when editing files > 1 MB. it had already been reported to them. They will probably fix it in time.
Also, see #1275 about breaking up dictionary.txt
into separate files, which might result in files <= 1 MB.
@peternewman Can we close this ticket now that it is clear this is a GutHub bug for files > 1 MB, unrelated to pre-comit?
from codespell.
Related Issues (20)
- Codespell behaves differently in pre-commit when referencing files to exclude HOT 1
- Allow defining pyproject.toml argument check-filenames as a boolean HOT 3
- The value of "--interactive" is not restricted
- Question: how can I create a space seperated word in a dictionary? HOT 1
- I can't get ignore-words to work HOT 1
- remove `crate -> create` HOT 3
- Add GNUAspell/benchβs common-all/batch0.tab dictionary
- Redundant sort test HOT 3
- `setuptools_scm` dependency HOT 2
- Ignore regular expressions HOT 6
- question: how to print filename of the file being processed?
- Support namespaced dictionary packages HOT 1
- "Goverment" not regonized in all relevant lines HOT 2
- support ignore for blocks of code
- Codespell Not Ignoring Words Properly when Provided in Original Case HOT 1
- release a new version ? HOT 15
- Unable to add inline comments for markdown documents HOT 7
- Splitting multiple 'skip' paths onto indented config file lines behaves differently than when on a single line HOT 4
- Are inline ignores restricted to hash-style comments? HOT 1
- Is it possible to combine a custom dictionary with a builtin dictionary? HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
π Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. πππ
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google β€οΈ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from codespell.