Giter Site home page Giter Site logo

non-domains entries not visible about ftl HOT 13 CLOSED

Willem60 avatar Willem60 commented on August 16, 2024
non-domains entries not visible

from ftl.

Comments (13)

yubiuser avatar yubiuser commented on August 16, 2024

Which list are you seeing this?

from ftl.

Willem60 avatar Willem60 commented on August 16, 2024

Which list are you seeing this?

That is not important. I need to see which domains are considered non-domains.
(my domain list, which i import, contains only a-z 0-9 and dot and hyphen)

from ftl.

yubiuser avatar yubiuser commented on August 16, 2024

We cannot help you if you're not helping us to help you. We cannot reproduce the issue without knowing which list you're seeing issues with. This is not going to work.

from ftl.

PromoFaux avatar PromoFaux commented on August 16, 2024

These are the regex patterns we use to define valid domains:

// Define valid domain patterns
// No need to include uppercase letters, as we convert to lowercase in gravity_ParseFileIntoDomains() already
// Adapted from https://stackoverflow.com/a/30007882
// - Added "(?:...)" to form non-capturing groups (slightly faster)
#define TLD_PATTERN "[a-z0-9][a-z0-9-]{0,61}[a-z0-9]"
#define SUBDOMAIN_PATTERN "([a-z0-9_-]{0,63}\\.)"
// supported exact style: subdomain.domain.tld
// SUBDOMAIN_PATTERN is mandatory for exact style, disallowing TLD blocking
#define VALID_DOMAIN_REXEX SUBDOMAIN_PATTERN"+"TLD_PATTERN
// supported ABP style: ||subdomain.domain.tlp^
// SUBDOMAIN_PATTERN is optional for ABP style, allowing TLD blocking: ||tld^
// See https://github.com/pi-hole/pi-hole/pull/5240
#define ABP_DOMAIN_REXEX "\\|\\|"SUBDOMAIN_PATTERN"*"TLD_PATTERN"\\^"

I guess you could manually parse your list to see if any lines don't fit those patterns.

from ftl.

Willem60 avatar Willem60 commented on August 16, 2024

These are the regex patterns we use to define valid domains:

// Define valid domain patterns
// No need to include uppercase letters, as we convert to lowercase in gravity_ParseFileIntoDomains() already
// Adapted from https://stackoverflow.com/a/30007882
// - Added "(?:...)" to form non-capturing groups (slightly faster)
#define TLD_PATTERN "[a-z0-9][a-z0-9-]{0,61}[a-z0-9]"
#define SUBDOMAIN_PATTERN "([a-z0-9_-]{0,63}\\.)"
// supported exact style: subdomain.domain.tld
// SUBDOMAIN_PATTERN is mandatory for exact style, disallowing TLD blocking
#define VALID_DOMAIN_REXEX SUBDOMAIN_PATTERN"+"TLD_PATTERN
// supported ABP style: ||subdomain.domain.tlp^
// SUBDOMAIN_PATTERN is optional for ABP style, allowing TLD blocking: ||tld^
// See https://github.com/pi-hole/pi-hole/pull/5240
#define ABP_DOMAIN_REXEX "\\|\\|"SUBDOMAIN_PATTERN"*"TLD_PATTERN"\\^"

I guess you could manually parse your list to see if any lines don't fit those patterns.

Hi PromoFaux, thank you very much for the info. I will look into it.

But the question was "I need to see which domains are considered non-domains". So, it has nothing to do which don't fit those patterns. I have 151 non-domains that don't fit those patterns. Only i don't see the 151 or a part of them on my screen.

from ftl.

PromoFaux avatar PromoFaux commented on August 16, 2024

Usually the gravity output would show a sample of 5 of them, but it seems there might be something "special" about your list that is preventing it from showing anything other than the first empty string value.

This is why it would be helpful to have visibility of the list, as we could further analyse/troubleshoot/debug

from ftl.

DL6ER avatar DL6ER commented on August 16, 2024

It's also worth noting that gravity shows unique non-domains. If your list has 151 empty "domains" then the seen output is expected.

from ftl.

Willem60 avatar Willem60 commented on August 16, 2024

I think that my non-domains that need to go into invalid_domains_list[i] array not listed because the non-domains are not false-positive-regex. (line 215 of gravity-parseList.c)

here are my files
hosts3.csv
hosts2.csv
hosts1.csv

from ftl.

rdwebdesign avatar rdwebdesign commented on August 16, 2024

False positives are suppressed, but there are just a few items considered false positives.

This is the list of false positives:

// A list of items of common local hostnames not to report as unusable
// Some lists (i.e StevenBlack's) contain these as they are supposed to be used as HOST files
// but flagging them as unusable causes more confusion than it's worth - so we suppress them from the output
#define FALSE_POSITIVES "^(localhost|localhost.localdomain|local|broadcasthost|localhost|ip6-localhost|ip6-loopback|lo0 localhost|ip6-localnet|ip6-mcastprefix|ip6-allnodes|ip6-allrouters|ip6-allhosts)$"

Do you have any of these entries?

from ftl.

yubiuser avatar yubiuser commented on August 16, 2024

host3.cvs is a defect file. It contains some binary data in those lines

1126341:direktpaket.com
1126342:donzidirect.com
1126343:dowslakemicro.com
1126344:dragoman.com
1126345:drisner-trockenbau.de
1126346:e.fivebelow.com
1126347:fav7bhn0.atlassian.net
1126348:fca-worldwide.com
1126349:forgela.com
1126350:freshworks.com
1126351:gmx.net
1126352:gotowebinar.com
1126353:halfsow.shop
1126354:hell.sighnun.shop
1126355:hicglobalsolution.com
1126356:hotsighning.com
1126357:icloud.com
1126358:infrac2.ddns.net
1126359:infraccion18.ddns.net
1126360:itariannotifications.com
1126361:jouw-pensioen.nl
1126362:jssgallery.org
1126363:judecollins.com
1126364:just-in-time-racing.com
1126365:katiestevens.net
1126366:keelhauler.org
1126367:km.maarhoudcontact.com
1126368:leadpartners24.nl
1126369:luelstudio.com
1126370:magicduino.com
1126371:mail.app.com
1126372:maxxtrend.nl
1126373:medianews24.nl
1126374:news.sedo.com
1126375:pro-versender.com
1126376:profiprodukte.net
1126377:profiverkauf.com
1126378:qualiview.nl
1126379:realspouse.com
1126380:riotops.com
1126381:routezilla.com
1126382:seniorenvoordeelpas.nl
1126383:snapmood.shop
1126384:spielendraussen.de
1126385:stackoftuts.com
1126386:starmodernfurniture.com
1126387:successoverpass.com
1126388:successwithkenny.info
1126389:sunnysideas.com
1126390:tahiti.com
1126391:take.sighnun.shop
1126392:thecircuitdetective.com
1126393:thedailyracquet.com
1126394:us.pycon.org
1126395:versender50.com
1126396:versenderbuero.com
1126397:wcr-datacontrol.info
1126398:werkzeugeonline.net
1126399:wxs.nl
1126400:adamslaboratory.com
1126401:afiph.org
1126402:archeinconsultants.com
1126403:armada.mil.ec
1126404:beach-north.com
1126405:becomeuagain.com
1126406:boomcomunicazione.com
1126407:c14.tez.host
1126408:campusleeuwarden.com
1126409:cas.menshealthyagain.com
1126410:centreforglobaleducation.com
1126411:chmcok.com
1126412:chs-deutschland.de
1126413:cluster.com
1126414:comeonconnect.com
1126415:coupon1euro.com
1126416:dd12postapoc.com
1126417:designpartnersindonesia.com
1126418:devip2.noc401.com
1126419:dogcareco.com
1126420:easthartford.org
1126421:fivebelow.com
1126422:generatorenprofis.net
1126423:generatorexperten.com
1126424:giapeaservices.com
1126425:goproswimtri.com
1126426:goraifilms.com
1126427:gowologlobal.com
1126428:grandestar.net
1126429:han-solo.net
1126430:hemafoundation.org
1126431:hirallabs.com
1126432:inmotionhosting.com
1126433:joinaff.com
1126434:joypluscondoms.com
1126435:kawaramachi-ai.com
1126436:kvk.nl
1126437:loopevolutionrecords.com
1126438:lottiecooper.lc
1126439:mail2you.club
1126440:markenhandelonline.com
1126441:meltingpotaz.com
1126442:metrostroy.com
1126443:mgdgirlsguild.org
1126444:mobile-stromerzeuger.com
1126445:moghadamzaferan.com
1126446:muenster.de
1126447:mumrests.com
1126448:murakamitatami.com
1126449:nextnewcustomer.com
1126450:nickblattfilms.com
1126451:opensea.io
1126452:ordenlaw.com
1126453:ovathemes.com
1126454:ovh.net
1126455:premium232.web-hosting.com
1126456:premium81.web-hosting.com
1126457:rackharbor.com
1126458:rent355.com
1126459:repois2020.com
1126460:rs.wewehost.com
1126461:ruleengineering.com
1126462:ryanmorel.com
1126463:s15.avl4.acemsrvd.com
1126464:s3.csa1.acemsd2.com
1126465:s6.csa1.acemsd3.com
1126466:se1.ezhostingserver.com
1126467:seobrand.net
1126468:server.vromsystems.com
1126469:sewingshoppe318.com
1126470:sgg-egypt.com
1126471:sharonlouisephotography.com
1126472:splus-s.com
1126473:sspatra.com
1126474:stage-app.nl
1126475:statecensus.info
1126476:stromgeneratoren-handel.com
1126477:talaskurutma.com
1126478:testsendblaster.com
1126479:tin.it
1126480:tonepit.rest
1126481:trinec.org
1126482:uitgekookt.nl
1126483:unsub.spmta.com
1126484:uttoron.com
1126485:vidtour.shop
1126486:warenoutlet.net
1126487:werkzeughandeldirekt.net
1126488:xml-io.proteusthemes.com
1126489:xmsnet.nl
1126490:z-kompass.com
1126491:zakelijk-diensten.nl

151 Lines affected.

from ftl.

Willem60 avatar Willem60 commented on August 16, 2024

Ok. thank you for finding this issue.
I use $in = preg_replace('/[^a-zA-Z0-9\.\-]/s','',$in); in php to clean up binary codes. I don't know c-language.
maybe its an idea to put it in this script gravity-parseList.c.

from ftl.

rdwebdesign avatar rdwebdesign commented on August 16, 2024

(my domain list, which i import, contains only a-z 0-9 and dot and hyphen)

Actually your file contains many NULL characters between lines 1126342 and 1126492:

null

from ftl.

yubiuser avatar yubiuser commented on August 16, 2024

Fixed with the linked PR.

from ftl.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.