Comments (4)
Thanks for suggesting this! I agree that the www and the root domain SHOULD be the same thing, and in many cases they are, so I will try to do some de-duping.
Unfortunately, I've seen a few too many cases of weirdly configured .mil domains, so I don't think I can get away with just adding the root domain when the scraper finds a www. For example, nslookup fails on dfas.mil, but works on www.dfas.mil. The same happens for a lot of unit sites, like wood.army.mil. I don't even know what to do with a domains like mail.mil, which is actively used for email, but only respond to https (not http) on the subdomain web.mail.mil. I suppose I should also be de-duping domains like www1.nga.mil which responds with the same content as www.nga.mil, (with nslookup on nga.mil failing) but there's no reason assume this will always be so.
Anyway, I've clearly got my work cut out for me. How do you handle checking for duplicates on the .gov side?
from dotmil-domains.
After further thought, you've changed my mind on this.
Every browser that I checked is smart enough to try the www
when the root domain fails. Also, checking every domain to look for cases where the www and the root domain serve different content would be the job for a different tool.
I think the code changes I committed last night solve the issue for the vast majority of cases, so I'm going to go ahead and close this. There are still a few duplicates, like that www1 example, but I may just have to handle those manually.
Thanks for taking a look at this!
from dotmil-domains.
One thing you might look at is the revamped site-inspector
, which I helped with and which looks at all 4 "endpoints" for a domain -- https://www
, http://www
, https://
, and http://
-- and tries to figure out which is "canonical" and what the overall behavior is.
I know I'll at least be running the .mil domain list through it before long!
from dotmil-domains.
And relatedly, something I've put some work time into is https://github.com/18F/domain-scan, which uses site-inspector
and a couple other tools to output data and reports.
from dotmil-domains.
Related Issues (3)
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from dotmil-domains.