Giter Site home page Giter Site logo

Comments (4)

esonderegger avatar esonderegger commented on August 16, 2024

Thanks for suggesting this! I agree that the www and the root domain SHOULD be the same thing, and in many cases they are, so I will try to do some de-duping.

Unfortunately, I've seen a few too many cases of weirdly configured .mil domains, so I don't think I can get away with just adding the root domain when the scraper finds a www. For example, nslookup fails on dfas.mil, but works on www.dfas.mil. The same happens for a lot of unit sites, like wood.army.mil. I don't even know what to do with a domains like mail.mil, which is actively used for email, but only respond to https (not http) on the subdomain web.mail.mil. I suppose I should also be de-duping domains like www1.nga.mil which responds with the same content as www.nga.mil, (with nslookup on nga.mil failing) but there's no reason assume this will always be so.

Anyway, I've clearly got my work cut out for me. How do you handle checking for duplicates on the .gov side?

from dotmil-domains.

esonderegger avatar esonderegger commented on August 16, 2024

After further thought, you've changed my mind on this.

Every browser that I checked is smart enough to try the www when the root domain fails. Also, checking every domain to look for cases where the www and the root domain serve different content would be the job for a different tool.

I think the code changes I committed last night solve the issue for the vast majority of cases, so I'm going to go ahead and close this. There are still a few duplicates, like that www1 example, but I may just have to handle those manually.

Thanks for taking a look at this!

from dotmil-domains.

konklone avatar konklone commented on August 16, 2024

One thing you might look at is the revamped site-inspector, which I helped with and which looks at all 4 "endpoints" for a domain -- https://www, http://www, https://, and http:// -- and tries to figure out which is "canonical" and what the overall behavior is.

I know I'll at least be running the .mil domain list through it before long!

from dotmil-domains.

konklone avatar konklone commented on August 16, 2024

And relatedly, something I've put some work time into is https://github.com/18F/domain-scan, which uses site-inspector and a couple other tools to output data and reports.

from dotmil-domains.

Related Issues (3)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.