Part of the logic of my classification DB SquidBlocker is youtube related. To allo

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

self signed certificate of the destination host <p dir=

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

YouTube urls and links classification,about andybalholm/redwood

Comments (80)

andybalholm commented on August 16, 2024 1

I've merged the master branch back into the external-classifier branch now. So you should be able to test an up-to-date version.

from redwood.

andybalholm commented on August 16, 2024 1

I've merged the external-classifier branch now.

I'm pretty sure that if you write your ACLs correctly, you can do your CONNECT rules with external-classifier.

from redwood.

andybalholm commented on August 16, 2024

That wouldn't be the way to make it work with my external-classifer API. My API is intended to work with a single endpoint that gives it scores for all the categories it supports.

Multiple endpoints are possible but not very efficient.

from redwood.

elico commented on August 16, 2024

@andybalholm OK but step by step we will get there.
I understand the logic.

from redwood.

elico commented on August 16, 2024

@andybalholm specific use cases:

self signed certificate of the destination host
basic http authentication over https or http. (is it possible to use the scheme https://username:password@service/?url=https://google.com/) as it is?

from redwood.

elico commented on August 16, 2024

@andybalholm I have a sketch for the service.
It works with curl pretty nice.
I will try to test with the RedWood branch, but, can you add another parameter to the classification url?
the src ip address? ie:
https://username:password@service/?url=https://google.com/&src=192.168.8.1

Basic sketch:
http://gogs.ngtech.co.il/NgTech-LTD/redwood-classification-portal

from redwood.

elico commented on August 16, 2024

@andybalholm OK so the service is up and responding with score 1000 for porn but...
there is no such category so what happens is:

Feb 25 02:45:51 chr-lab-px6 redwood[4724]: 2019/02/25 02:45:51 error fetching https://www.google.com/complete/search?client=firefox&q=sex: context canceled
Feb 25 02:45:52 chr-lab-px6 redwood[4724]: 2019/02/25 02:45:52 http2: panic serving 192.168.74.30:53006: runtime error: invalid memory address or nil pointer dereference
Feb 25 02:45:52 chr-lab-px6 redwood[4724]: goroutine 425 [running]:
Feb 25 02:45:52 chr-lab-px6 redwood[4724]: golang.org/x/net/http2.(*serverConn).runHandler.func1(0xc0003b01b0, 0xc00057ffaf, 0xc00068a380)
Feb 25 02:45:52 chr-lab-px6 redwood[4724]: /home/eliezer/go/pkg/mod/golang.org/x/[email protected]/http2/server.go:2088 +0x163
Feb 25 02:45:52 chr-lab-px6 redwood[4724]: panic(0x972060, 0xf3e6c0)
Feb 25 02:45:52 chr-lab-px6 redwood[4724]: /usr/local/go/src/runtime/panic.go:513 +0x1b9
Feb 25 02:45:52 chr-lab-px6 redwood[4724]: main.(*config).ChooseACLCategoryAction(0xc0001a0f00, 0xc00069e0f0, 0xc00069e030, 0x113, 0xc00057f328, 0x1, 0x1, 0x0, 0x0, 0x0, ...)
Feb 25 02:45:52 chr-lab-px6 redwood[4724]: /home/eliezer/Scripts/redwood-original/acl.go:556 +0x76f
Feb 25 02:45:52 chr-lab-px6 redwood[4724]: main.proxyHandler.ServeHTTP(0x1, 0xc000427e33, 0x3, 0x0, 0x0, 0xab1a40, 0xf46d00, 0xab6c80, 0xc0003b01b0, 0xc0001e5300)
Feb 25 02:45:52 chr-lab-px6 redwood[4724]: /home/eliezer/Scripts/redwood-original/proxy.go:300 +0x9e5
Feb 25 02:45:52 chr-lab-px6 redwood[4724]: net/http.serverHandler.ServeHTTP(0xc0004f2270, 0xab6c80, 0xc0003b01b0, 0xc0001e5300)
Feb 25 02:45:52 chr-lab-px6 redwood[4724]: /usr/local/go/src/net/http/server.go:2741 +0xab
Feb 25 02:45:52 chr-lab-px6 redwood[4724]: net/http.initNPNRequest.ServeHTTP(0xc0005ace00, 0xc0004f2270, 0xab6c80, 0xc0003b01b0, 0xc0001e5300)
Feb 25 02:45:52 chr-lab-px6 redwood[4724]: /usr/local/go/src/net/http/server.go:3291 +0x8d
Feb 25 02:45:52 chr-lab-px6 redwood[4724]: net/http.Handler.ServeHTTP-fm(0xab6c80, 0xc0003b01b0, 0xc0001e5300)
Feb 25 02:45:52 chr-lab-px6 redwood[4724]: /usr/local/go/src/net/http/h2_bundle.go:5592 +0x4d
Feb 25 02:45:52 chr-lab-px6 redwood[4724]: golang.org/x/net/http2.(*serverConn).runHandler(0xc00068a380, 0xc0003b01b0, 0xc0001e5300, 0xc00032fc40)
Feb 25 02:45:52 chr-lab-px6 redwood[4724]: /home/eliezer/go/pkg/mod/golang.org/x/[email protected]/http2/server.go:2095 +0x89
Feb 25 02:45:52 chr-lab-px6 redwood[4724]: created by golang.org/x/net/http2.(*serverConn).processHeaders
Feb 25 02:45:52 chr-lab-px6 redwood[4724]: /home/eliezer/go/pkg/mod/golang.org/x/[email protected]/http2/server.go:1829 +0x4d3
Feb 25 02:45:59 chr-lab-px6 redwood[4724]: 2019/02/25 02:45:59 http2: panic serving 192.168.74.30:53008: runtime error: invalid memory address or nil pointer dereference
Feb 25 02:45:59 chr-lab-px6 redwood[4724]: goroutine 432 [running]:
Feb 25 02:45:59 chr-lab-px6 redwood[4724]: golang.org/x/net/http2.(*serverConn).runHandler.func1(0xc0003b0050, 0xc0006d9faf, 0xc00061dc00)
Feb 25 02:45:59 chr-lab-px6 redwood[4724]: /home/eliezer/go/pkg/mod/golang.org/x/[email protected]/http2/server.go:2088 +0x163
Feb 25 02:45:59 chr-lab-px6 redwood[4724]: panic(0x972060, 0xf3e6c0)
Feb 25 02:45:59 chr-lab-px6 redwood[4724]: /usr/local/go/src/runtime/panic.go:513 +0x1b9
Feb 25 02:45:59 chr-lab-px6 redwood[4724]: main.(*config).ChooseACLCategoryAction(0xc0001a0f00, 0xc00036a570, 0xc00036a4b0, 0x113, 0xc0006d9328, 0x1, 0x1, 0x0, 0x0, 0x0, ...)
Feb 25 02:45:59 chr-lab-px6 redwood[4724]: /home/eliezer/Scripts/redwood-original/acl.go:556 +0x76f
Feb 25 02:45:59 chr-lab-px6 redwood[4724]: main.proxyHandler.ServeHTTP(0x1, 0xc0002a0053, 0x3, 0x0, 0x0, 0xab1a40, 0xf46d00, 0xab6c80, 0xc0003b0050, 0xc0001e4500)
Feb 25 02:45:59 chr-lab-px6 redwood[4724]: /home/eliezer/Scripts/redwood-original/proxy.go:300 +0x9e5
Feb 25 02:45:59 chr-lab-px6 redwood[4724]: net/http.serverHandler.ServeHTTP(0xc0001788f0, 0xab6c80, 0xc0003b0050, 0xc0001e4500)
Feb 25 02:45:59 chr-lab-px6 redwood[4724]: /usr/local/go/src/net/http/server.go:2741 +0xab
Feb 25 02:45:59 chr-lab-px6 redwood[4724]: net/http.initNPNRequest.ServeHTTP(0xc000038a80, 0xc0001788f0, 0xab6c80, 0xc0003b0050, 0xc0001e4500)
Feb 25 02:45:59 chr-lab-px6 redwood[4724]: /usr/local/go/src/net/http/server.go:3291 +0x8d
Feb 25 02:45:59 chr-lab-px6 redwood[4724]: net/http.Handler.ServeHTTP-fm(0xab6c80, 0xc0003b0050, 0xc0001e4500)
Feb 25 02:45:59 chr-lab-px6 redwood[4724]: /usr/local/go/src/net/http/h2_bundle.go:5592 +0x4d
Feb 25 02:45:59 chr-lab-px6 redwood[4724]: golang.org/x/net/http2.(*serverConn).runHandler(0xc00061dc00, 0xc0003b0050, 0xc0001e4500, 0xc00034c4c0)
Feb 25 02:45:59 chr-lab-px6 redwood[4724]: /home/eliezer/go/pkg/mod/golang.org/x/[email protected]/http2/server.go:2095 +0x89
Feb 25 02:45:59 chr-lab-px6 redwood[4724]: created by golang.org/x/net/http2.(*serverConn).processHeaders
Feb 25 02:45:59 chr-lab-px6 redwood[4724]: /home/eliezer/go/pkg/mod/golang.org/x/[email protected]/http2/server.go:1829 +0x4d3

from redwood.

elico commented on August 16, 2024

@andybalholm OK I managed to make a category and I see that in the access logs:
..
Maybe I am missing some configurations since the score is there.
...
The next acl.conf configuration worked for me but the blocks aren't nice..
block porn

let me know how do you want to proceed with this.

from redwood.

thinkwelltwd commented on August 16, 2024

self signed certificate of the destination host

If self-signed certs are actually a problem, the answer is to make Redwood support self-signed certs in the manner required.

basic http authentication over https or http.

Then make Redwood support the authentication you require; in 2019, what server actually uses basic http auth?

Since the RedWood proxy and Squid lacks some dynamicity.

Yes, Squid lacks dynamicity. In what ways does Redwood lack the same dynamicity? The examples you provide would be resolved by adding additional configuration knobs.

It's not clear why either of the 2 examples merit splitting the logic into 2 libraries, where the second library be required to re-implement the classifying features already provided Redwood, plus the config knobs you currently find lacking.

In my mind, the correct approach would be to have drbl-peer call the Redwood classifier and process Redwood's results, rather than have Redwood call a classifier like drbl-peer.

from redwood.

elico commented on August 16, 2024

@thinkwelltwd First I think that RedWood is bound to the OS certificates store and assume that it's possible to just add into this store the seff signed certificate.

About the basic auth, yes there are internal services that uses basic auth as an internal mechanism but FW the service is much preferred.
For example in my lab I have a manual classification DB the a human and machines can interact with.
These services are sitting in a closed environment and can be accessed via secured gateway.
I know that many use services like OpenDNS or other providers for black or white list classification of URL's and/or domains but many of these services are based on manually inserted data.....

This feature comes to allow the admin on option to forcibly decide without automation or an algorithm that some content is specific.
One example is that a friend of mine had a public black list which should be up-to-date and since a news site contained tabula content it was considered an "Ads" site instead of "news" site.
Enforcing policy is something that is needed in many ISP's and houses.

This is also why I wrote SquidBlocker:

It works great for me to block youtube videos that are not offered by any other service I know.. ie by video ID.

If you can add more modules to reduce the usage of external customized classification services (which sometimes doesn't makes sense to some or more) then it would be nice.
Specifically Facebook, YouTube, Twitter, Instegram and other content can be blocked by a human but will be hard to spot or classify.

For example pictures DHASH are nice but, can RedWood reload a list in real time?

... With this in mind I do agree that RedWood can be used as a classification service what might be used as an ICAP service..

from redwood.

thinkwelltwd commented on August 16, 2024

First I think that RedWood is bound to the OS certificates store and assume that it's possible to just add into this store the seff signed certificate.

Yes, certainly.

About the basic auth, yes there are internal services that uses basic auth as an internal mechanism but FW the service is much preferred.

So there are internal services that use basic auth, and the classifier needs to authenticate with these services to correctly classify the content. Am I understanding the scenario correctly?

For example in my lab I have a manual classification DB the a human and machines can interact with. These services are sitting in a closed environment and can be accessed via secured gateway.

Then the answer is apparently No, I'm not understanding the scenario correctly. So where is the basic auth required?

To connect to an external resource of classifier rules?
To connect to the resource serving content where classification is required?

I know that many use services like OpenDNS or other providers for black or white list classification of URL's and/or domains but many of these services are based on manually inserted data.....

This feature comes to allow the admin on option to forcibly decide without automation or an algorithm that some content is specific.

When the admin forcibly decides something, he has to communicate his will to a running process, running on a server somewhere. Why should this process be an external classifier rather than Redwood?

One example is that a friend of mine had a public black list which should be up-to-date and since a news site contained tabula content it was considered an "Ads" site instead of "news" site.

I think this is part of the core misunderstanding. Last time I checked, Squid relies blacklists / regexes / mimetypes, etc and has little to no native capacity to work with the response body. OTOH, a properly configured Redwood is much more interested in the response body, and therefore focuses much more on phrases than blacklists. In this regard it is orders of magnitude more powerful than Squid.

At this point, it seems like the hypothetical scenario is no longer that Redwood query an external classifier, but rather that Redwood query an external database/service of hashes or domains.

I can see why that might be useful if the database of records is too large to fit in memory.

For example pictures DHASH are nice but, can RedWood reload a list in real time?

Yes. Just do a service redwood reload. All currently active connections are held open until completion or timeout. All new connections are handled with the new configuration.

from redwood.

andybalholm commented on August 16, 2024

@andybalholm specific use cases:

self signed certificate of the destination host

basic http authentication over https or http. (is it possible to use the scheme https://username:password@service/?url=https://google.com/) as it is?

Putting the username and password in the URL works. The Go http package automatically adds an Authorization: Basic header when you do that.

But I was envisioning that in production deployments, the external classifier would usually be running on the same machine as Redwood. This would make authentication (and certificates) much less important.

from redwood.

andybalholm commented on August 16, 2024

can you add another parameter to the classification url?
the src ip address?

It sounds like you don't understand the key insight that inspired the development of Redwood: classifying the page and applying a policy are two separate operations.

In DansGuardian (the filter we used before), there was no real classification step. It only dealt with policy (allowing or blocking the page). Redwood classifies a page first, and then decides what to do with it based on that classification.

The types of external databases you have been talking about integrating with sound like they belong in the classification stage. They tell us whether the page contains porn, malware, etc. I don't see what use the source IP address would be in that case.

from redwood.

andybalholm commented on August 16, 2024

there is no such category so what happens is:

I've pushed a commit that fixes the nil pointer dereference.

from redwood.

andybalholm commented on August 16, 2024

I can see why that might be useful if the database of records is too large to fit in memory.

The primary use-case would be a database that can be queried (for a particular domain or URL) but not enumerated (to list all domains or URLs). One example that @elico mentioned would be OpenDNS's list of porn sites. You can easily check it with a DNS query, but you can't download the whole list.

from redwood.

andybalholm commented on August 16, 2024

At this point, it seems like the hypothetical scenario is no longer that Redwood query an external classifier, but rather that Redwood query an external database/service of hashes or domains.

Right. I decided the service should return a map of categories to weights so that it can query multiple databases at once, and combine the results at one API endpoint. But maybe that's more elaborate than necessary.

Maybe a better approach would be to add some ACL types that check domains and URLs against external programs, and match if the program has a zero exit status. Then we could make a script like this:

#!/bin/bash

host $1 208.67.222.123 | grep 146\.112\.61\.106

If the script was called opendns-check, then we would (with this hypothetical feature) be able to use it in Redwood like this:

acl opendns-adult check-domain opendns-check
block opendns-adult

from redwood.

thinkwelltwd commented on August 16, 2024

If the script was called opendns-check, then we would (with this hypothetical feature) be able to use it in Redwood like this:

acl opendns-adult check-domain opendns-check
block opendns-adult

That limits us to one service call per ACL then, whether or not the service supported multiple return values, which bring scalability questions. Could multiple ACL definitions point to the same script, where the ACL tag name is one of the expected service return values?

Something like so:

acl opendns-adult check-domain opendns-check
acl opendns-p2p check-domain opendns-check
acl opendns-tasteless check-domain opendns-check

The service would return a zero for non-matching or one of the ACL tag names. Would this be a way that a single API call could address multiple ACLs?

(I don't think the API service call should be allowed to return any string value; that would get very hard to reason about when diagnosing connection handling.)

from redwood.

andybalholm commented on August 16, 2024

Or we could extend the concept of using the command's exit status. If no particular exit status is specified in the ACL rule, it would match when the exit status is zero. But you could also specify another exit status to check for:

acl opendns-adult check-domain opendns-check 1
acl opendns-p2p check-domain opendns-check 2 
acl opendns-tasteless check-domain opendns-check 3

from redwood.

elico commented on August 16, 2024

@andybalholm @thinkwelltwd This specific thread is about YouTube...
We don't have yet a full DB of ID's downloadable.
There for we need to classify them individually.

Squid is not an ideal solution in any case.
The fact that RedWood was written makes a great point from couple points of view about the subject.
Squid-Cache name for Caching and this was it's main purpose.
Over the time the project took couple directions based on the developers and the different investors.
Currently and for a long time Amos is leading the project and I have seen a change in the caching approach.
There is a specific thing that leads the project due to the past errors and it's "integrity" from the "CIA" acronym.

As I was told Squid-Cache is not a performance system but it have a very low level of coding which leads it to be able to handle couple very good use cases.
Squid-Cache due to it's caching nature was invested with delegation of the Content level to eCAP and ICAP services.
These are nice but hard to come by a good performance one, one of the main things that shows it is actually RedWood, ICAP is doing things wrong and RedWood does things better.
However RedWood and any classification cannot compete with something like nested motion pictures of youtube pirate videos but can do a lot as the main proxy process.
It has good garbage collection and solid libraries to be able to perform a 10Gbps system(leaving for a sec the CPU and RAM requirements).

So back to the subject...
YouTube videos can be both trance and porn, can be both trance and nudity, can be both ads and news , can be rated as R or PG-13.
If we have a child at home and we want to limit some aspects by his mac address or IP address we can write an ACL specifically for now.
Let say this is the kids hour time and I want to limit the kids to PG-13 rate videos, what can I do with RedWood today? Do I have a reconfigure page or should I get into the device and reload via cli?
I don't expect from you to implement everything opensource...
But I can give an example of a tool that did something nicely but lacks some security aspects:
http://gatesentryfilter.abdullahirfan.com/

a shell command is nice but opening a process per request compared to running an internal code that sits in RAM and couple other things leads me to think that even if this specific solution might not fit all purposes and might be irrelevant to some it's a nice concept and I think it's a good one compared to what I was thinking at start.

Since this specific feature doesn't make any regression in the current state of the code and can be reverted with basically a single command I think it's good one.

@andybalholm I will test the new version and see how it's performing.
(I have another question but I will ask separately)
Thanks so far for you two for the coding and the fruitful discussion.

from redwood.

thinkwelltwd commented on August 16, 2024

Or we could extend the concept of using the command's exit status. If no particular exit status is specified in the ACL rule, it would match when the exit status is zero. But you could also specify another exit status to check for:

I really like that idea. It makes the ACL line slightly more explicit.

This specific thread is about YouTube... We don't have yet a full DB of ID's downloadable. There for we need to classify them individually.

I'm sorry; perhaps I was the one that side-tracked it.

With all respect though, if you don't have to a database of IDs and need to classify individually, then start by entering the IDs as regexes in Redwood's existing category structure. That'll take you a very long way.

YouTube videos can be both trance and porn, can be both trance and nudity, can be both ads and news , can be rated as R or PG-13.

Redwood is a text classifier only, with no computer vision functionality. But you probably already knew that.

Let say this is the kids hour time and I want to limit the kids to PG-13 rate videos, what can I do with RedWood today?

If you frame the problem that way, there's no way to limit kids to PG-13 videos today. You can limit them by categories defined as phrase lists, but not by Youtube's video rating. We use a multi-level approach:

Enforce Safe Search by Youtube DNS but that's awkward being at the router level.
I've built a graphical console around classifying Youtube videos by downloading the text / keywords and via API call, and passing that text to Redwood's classifier API; approved videos play embedded inside the console page. That works pretty well and is easy to have all kinds of access controls and reporting, but it wouldn't work in an ISP context.

So back to the original question, assuming you have a list of PG-13 rated IDs and need a solution today, Redwood's categories will get you far down the road.

And with that'll I'll bow out of the discussion. Hope something can work out!!

from redwood.

elico commented on August 16, 2024

@andybalholm just to make things clear... towards what @thinkwelltwd wrote.
I have a DB and a team that classify for an ISP(not for me individually) but I doubt that the current RedWood configuration structure would be able to cope with a constant reconfiguration or constant update of the lists or regex.(let say 30 ID's per sec)

Please let me know if this specific concept is making this specific subject and feature relevant.
As fir the script that will run, let say it will run a curl command per url ie:

#!/bin/bash

curl http://youtube-id-classification-service/?url=$1 |grep "PG-13"

I assume that it would be better to do something like that inside RedWood instead of using a simple script.
For a tiny network with 10 users a simple script might be OK but for a SOHO workplace (40 users) or even a small ISP (1k clients) I think a script approach would be the one they wouldn't want.

from redwood.

elico commented on August 16, 2024

@thinkwelltwd If you can extract URL's from inside a page then it's a very good thing.
It means that the classification can be upgraded to do some page mangling, right?(I think it's for another issue/discussion)

from redwood.

andybalholm commented on August 16, 2024

I doubt that the current RedWood configuration structure would be able to cope with a constant reconfiguration or constant update of the lists or regex.(let say 30 ID's per sec)

Redwood does a complete reload of its configuration every time it receives SIGHUP. So the throughput of changes doesn't matter; what matters is how often the configuration is reloaded. So it's a matter of balancing the performance impact of reloading against how far out of date you're willing to let the configuration get.

Surprisingly, the main performance impact of reloading is not CPU usage, but RAM usage. In some cases a long-running request keeps an old configuration from being garbage-collected until the request finishes. (This happens much less than it used to, but it still happens.) So it's possible to have several copies of the configuration in memory at once.

The redwood server in my office can reload its configuration in roughly 1 second of CPU time (1 core of a 4-core Phenom 9850). But as I recall the configuration occupies roughly 50–100 MB of RAM. (This is a configuration that takes up roughly 7 MB on disk.) If a busy server were configured to reload its configuration every 5 seconds, the redwood process might easily use a several GB just storing outdated configurations that haven't been garbage-collected yet.

But reloading once a minute would likely be practical in many situations.

For a tiny network with 10 users a simple script might be OK but for a SOHO workplace (40 users) or even a small ISP (1k clients) I think a script approach would be the one they wouldn't want.

I'm sure the overhead of starting that many new processes wouldn't be a serious issue on a 40-user system, but I can see how it could on a 1000-user system.

from redwood.

andybalholm commented on August 16, 2024

I think the best way to handle this Youtube ID database would be to have a scheduled task that periodically (say every 5 minutes) updates a bunch of Redwood configuration files full of regexes and reloads Redwood.

Is this the only database you're currently using that needs the full URL (not just the domain name)? Maybe the best way to deal with the domain-name ones is to add support for simple SpamAssassin-style URI DNSBLs.

from redwood.

elico commented on August 16, 2024

@andybalholm The last time I touched SpamAssasin was years ago.
Indeed URI DNSBL's is good.
The issue is that we have dynamic urls changing and regex that are kind of static.
If I need to predict a specific video page then I need some software to find the url and another to answer what it's "worth" or what it "contains".
I believe that YouTube can be verfified against ReWood lists but not regex exactly.
I have an algorithm that is based on YouTube API's description that can assist finding the right ID to match against RedWood DB.

Can this classification feature be integrated someway into RedWood?
The current code is not big and I believe that it can be pulled into master and later on if required succeed by a successor feature.

from redwood.

andybalholm commented on August 16, 2024

So are you dynamically classifying YouTube videos based on their metadata, as the requests come in? I had the impression that you had a database of YouTube IDs and their classifications, which was being rapidly updated.

from redwood.

elico commented on August 16, 2024

@andybalholm I have two systems:

classifies for a set of categories such as R, pg-13 etc which is a DB.
classifies a video ID to be permitted for a specific group while others do not have the same.

The concept is that we have a SPAM level ratio:

128 is R or porn
100 is adults
60 is educational
etc...

I don't have a static ID's list since as they come they are being checked by a team or an "AI".
The "AI" is signaled to check the YouTube ID for:

ytimg
youtube video content
youtube video thumbnails

The DB content is being populated as we speak...
I can instruct to test and check the metadata but I relay on the Video checking AI more then the Metadata.

from redwood.

andybalholm commented on August 16, 2024

So does the team or "AI" check the video in real-time, between the HTTP request and response?

from redwood.

elico commented on August 16, 2024

The AI checks in almost real time but they have two policies:

real time users
non real time users

real time users are receiving the page as long as it was not blacklisted.
non real time users would receive a page that the video is being checked for them with a refresh interval of 15 seconds.
As long as the video was not whitelisted the same block page is received, if it was black listed the page that the user will receive is a 100% block with an option to send a form about it.

from redwood.

andybalholm commented on August 16, 2024

So neither group is truly real-time as I was defining it, but the 15-second latency requirement would likely make doing a full reload of the Redwood configuration problematic.

from redwood.

elico commented on August 16, 2024

from my point of view we are talking about at-least 1k clients system since RedWood is that good.
There for reloading the proxy itself is only a luxury that specific systems have.
Squid doesn't have this function and since there are known memory leaks in any reload(leak for even 15 seconds) when counting 1k clients and above it's crucial that the proxy would be able to run smoothly without reload or reconfiguration.
This is how it's done in many systems that I have seen.

from redwood.

andybalholm commented on August 16, 2024

Doesn't Squid reload the config with squid -k reconfigure? Or SIGHUP?

from redwood.

elico commented on August 16, 2024

it does but discourage the usage of this.

from redwood.

andybalholm commented on August 16, 2024

So it sounds like some way to call out to an external process will be necessary in order to make this work with the required latency. But I'm not very satisfied with the external-classifier interface I proposed earlier. What I'm inclined to right now is a new ACL type, check-url. It would be configured like this:

acl ngtech-bad check-url http://ngtech.co.il/rbl/vote/?url=

The URL of the page would be query-encoded and appended to the specified URL. If fetching the resulting URL returns a status of 200, it is considered a match.

This seems like it would be "the simplest thing that could possibly work." What do you think?

from redwood.

elico commented on August 16, 2024

@andybalholm Well I think it's a good solution that 100% will result with some delay but... the clients are OK with this.
They care more about security or safety rather then speed.
..And this external service has cache so it will speed up things.
I am testing it right now with my API and it seems to work as expected with the basic sketch.

from redwood.

andybalholm commented on August 16, 2024

I've been thinking about another option: embed a JavaScript interpreter into Redwood (probably github.com/dop251/goja), and support "ACL script" files. An ACL script file would be a JS script that would be passed an http.Request object, and could set ACLs on the request based on its evaluation of the URL, headers, etc.

This would be more complicated than my other proposal, but it would also be more flexible. For example, if you wanted to check YouTube IDs, the script could return quickly if the URL isn't on YouTube, and just reach out to the external service if it is.

What do you think?

from redwood.

elico commented on August 16, 2024

@andybalholm depends if it can utilize threading and couple other things related to concurrency.
GreasySpoon(ICAP Service) did that with ECMAScript,JAVA,ruby.
It can be good... if it utilizes multicore.
I am not sure how to write a testing/checking code in this language.

from redwood.

andybalholm commented on August 16, 2024

It wouldn't be multithreaded within the checks for a single request. But multiple requests would be checked in parallel.

In many ways I would prefer to use dynamically-loaded Go plugins, but -buildmode=plugin only works on Linux and macOS, and it can't be cross-compiled. I'm reluctant to add those constraints to Redwood.

Another option would be statically-compiled plugins. I would add "hooks" for easily extending Redwood's behavior, but the extension code would need to be compiled into the Redwood binary. Maybe that's the best way to do it.

from redwood.

elico commented on August 16, 2024

I have seen similar concept with CoreDNS.
If it's better to use hooks then it depends only on the amount of work related to it.
For CoreDNS I have tried to compile such a thing but yet to see a good example for it.

from redwood.

elico commented on August 16, 2024

@andybalholm I just remembered that I maybe forgot to say that the classification answer will not be 15 seconds since the video just enters the queue to be checked.
It would be a horrible service if every request would require the client to wait 15 seconds.
For these who have another url or a url with a specific component it will first block and after the queue job was done(inspected) then the client will receive a good rating ie 128 is bad but 0 100 is good.

from redwood.

elico commented on August 16, 2024

@andybalholm I would like to try and check this again.
The current logic I am testing is:
allow specific https://www.youtube.com/embed/XYZ
while blocking all other content like lists or single youtube content which is on embedded.

My first step would be to create an up-to-date version of:
https://github.com/andybalholm/redwood-config

With ssl-bump

from redwood.

andybalholm commented on August 16, 2024

Yes, that configuration could use some TLC before it’s used in production. Andy

…

On Mar 12, 2020, at 3:02 PM, Eliezer Croitoru ***@***.***> wrote: @andybalholm <https://github.com/andybalholm> I would like to try and check this again. The current logic I am testing is: allow specific https://www.youtube.com/embed/XYZ <https://www.youtube.com/embed/XYZ> while blocking all other content like lists or single youtube content which is on embedded. My first step would be to create an up-to-date version of: https://github.com/andybalholm/redwood-config <https://github.com/andybalholm/redwood-config> With ssl-bump — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#48 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AAGUELEPJMRKPYFFEUHMP6DRHFLWLANCNFSM4GZY7TSA>.

from redwood.

elico commented on August 16, 2024

@andybalholm Ok so the basic interface would be:
Request from RedWood: URL in a GET or POST that has a query or form argument: url and the value would be the URL of the pending request.
Converting it into a curl command request would be:
curl "http://localhost:5000/api/v1/classify?url=https://www.youtube.com/watch?v=TdZ4asd84FM"
or
curl "http://localhost:5000/api/v1/classify" --data "url=https://www.youtube.com/watch?v=TdZ4asd84FM"
or a more complex one:
curl --header "Content-Type: application/json" --request POST --data '{"url":"https://www.youtube.com/watch?v=TdZ4asd84FM"}' http://localhost:5000/api/v1/classify

..I'm missing the required raw response for couple categories.
The mentioned video is a documentary which should be an educational video.
What should be the response from the classification service for this one and for couple other use cases:

non-rated content
piracy/warez
G ie kids rated

(I do not have the code from last time)

from redwood.

andybalholm commented on August 16, 2024

{"education":500}
Non-rated: {}
Piracy: {"piracy":500}
G-Rated: {"g-rated": 500}

The category names and exact scores are up to you, of course. But non-rated should simply return an empty object to indicate the absence of information.

Keep in mind that the external-classifier branch has not been merged into master and is a year out of date. I'm not opposed to merging it; I'm just waiting to see if anybody actually uses it.

from redwood.

elico commented on August 16, 2024

@andybalholm I noticed that it's not merged.
I am still trying to wrap the basic setup again.. a docker container is my first try now.

from redwood.

elico commented on August 16, 2024

@andybalholm I started testing it without intercept on my local LAN using proxy.PAC.
The proxy bump TLS and works pretty good.
Currently I have a Dedicate Proxy and DB VM's.

from redwood.

elico commented on August 16, 2024

@andybalholm OK so tested a bit and there is one specific issue.
How can I distinct CONNECT requests from regular requests?
I am receiving:
url=http://www.test.com:9090

for URL: https://www.test.com:9090
I need something that will distinguish in the classification request that it's a CONNECT and then I would be able to also instruct RedWood to bump the connection.

from redwood.

andybalholm commented on August 16, 2024

It would be trivial to add a method parameter to the external classifier API. But you shouldn't need to distinguish CONNECT requests in the classification phase. That should be done with ACLs.

from redwood.

elico commented on August 16, 2024

@andybalholm I have couple use cases which I need to dynamically bump connections.
I have seen that I can classify a URL or a CONNECT request in a category such as "tlsbump" or "sslbypass" and with these I can decide if a specific use and\or request\host\port should be bumped.

Is there any way to do this with ACL's dynamically with an external acl/classification interface?

from redwood.

andybalholm commented on August 16, 2024

Make your external classifier put it in a category that gets special treatment in your ssl-bump acls.

from redwood.

elico commented on August 16, 2024

@andybalholm it does...
Currently my assumption from the code at: https://github.com/andybalholm/redwood/blob/master/tls.go#L168

is that a TLS request will not have a forward slash in the URL ie:

http://example.com:443
is a CONNECT request.

however for any other real HTTP/1.1 URL there should be a leading forward slash ie:

http://example.com:8000/
is not a CONNECT.

What I think should be in the classification request a:

URL
Request METHOD
SRC IP address
UserID/ClientID

In my tests I have seen very big URLS which cannot fit into a GET request URL so my recommendation is to use a POST request with either FORM values or JSON.

Also since RedWood does all other things, I think other details on the request are not relevant for most use cases.
Specifically for YouTube URLS(and other dynamic sites) a more "customized" filtering is required and it's out of the scope what I am working now.

from redwood.

andybalholm commented on August 16, 2024

I've brought the external-classifier branch up to date now, switched it to use a POST request, and added the request method to the data that is submitted.

But a URL classifier should not need to know the client username or IP address. Those may be relevant to choosing an action, but they do not change what the page is. And that is what classification is about. Setting policy for specific users comes later.

from redwood.

elico commented on August 16, 2024

@andybalholm OK.
I have tested it and it works for me fine and as expected.
For a CONNECT request I am returning the next response:
{ "nudity_annotation": 1000, "tlsbump": 1000 }
which makes the tlsbump category match but will only have nudity_annotation compared to a regular non CONNECT response:
{ "nudity": 1000, "tlsbump": 1000 }
Which will match the nudity category and the connection will be "broken".
ie with the annotation the connection will be bumped and only the HTTP request will be blocked by the URL classification as nudity category is a match.

About the userid and\or ip address you are right that it should be done in the ACL level later in the code.

Just mentioning a use-case:
A network of 200 ~ clients(Desktops/mobile/others) which requires the option to change ACL's for 5 minutes or any other time for a specific username/userid/ip.
For me It only makes sense too change the ACL's with a reload with the condition that only ACL's will be reloaded and not the categories unless there was a change on disk.

Should I open a new issue?

from redwood.

andybalholm commented on August 16, 2024

Do you mean that putting the "nudity" category on the CONNECT request makes Redwood block the connection instead of bumping it? You should have your ACLs set up in such a way that the default category actions won't affect a CONNECT request—that CONNECT requests should always be either allowed or bumped.

from redwood.

andybalholm commented on August 16, 2024

I don't want to get into just reloading part of the configuration.

from redwood.

elico commented on August 16, 2024

@andybalholm I wanted to have two options:

bump everything and bypass specific urls/domains/ips
bump only specific domains/ips/urls

From my point of view for every TLS connection with SNI that should be blocked like "example.org" or "" there should be an option to either close the connection or BUMP-And-BLOCK in the HTTP Level.

My current testing acls.conf:

acl connect method CONNECT
block opendns-adult
block ytbl
block nudity
allow localnet connect sslbypass
ssl-bump localnet connect
ignore-category connect
acl localnet user-ip 192.168.0.0/16 10.0.0.0/8 172.16.0.0/12
acl all user-ip 0.0.0.0/0
disable-proxy-headers all

redwood.conf

http-proxy :8080
http-proxy :18080
transparent-https :18443
blockpage /var/redwood/static/block.html
static-files-dir /var/redwood/static
cgi-bin /var/redwood/cgi
categories /etc/redwood/categories
acls /etc/redwood/acls.conf
external-classifier http://yt-classfier.ngtech.home/api/v1/checkurl/url
threshold 275
content-pruning /etc/redwood/pruning.conf
query-changes /etc/redwood/safesearch.conf
access-log /var/log/redwood/access.log
tls-log /var/log/redwood/tls-access.log
tls-cert /etc/redwood/ssl-cert/myCAcert.pem
tls-key /etc/redwood/ssl-cert/myCAkey.pem
``
``
`# cd /etc/redwood/ && tree`

.
├── acls.conf
├── acls.conf.backup
├── categories
│   ├── ads
│   │   ├── category.conf
│   │   └── domains.list
│   ├── index.csv
│   ├── nudity
│   │   ├── category.conf
│   │   └── regex.list
│   ├── ****
│   │   ├── category.conf
│   │   ├── custom-domains.list
│   │   └── domains.list
│   ├── sslbumpbypass
│   │   ├── category.conf
│   │   └── sslbumpbypass.list
│   ├── tlsbump
│   │   ├── category.conf
│   │   └── domains.list
│   └── vpn
│   ├── category.conf
│   └── domains.list
├── opendns.js
├── pruning.conf
├── redwood.conf
├── safesearch.conf
├── sslbump-defaultbypass-acls.conf
└── ssl-cert
├── myCAcert.pem
└── myCAkey.pem

from redwood.

andybalholm commented on August 16, 2024

Move the lines that reference the connect ACL up, so that they come before the block lines, and they will take precedence. But when it's not a CONNECT request, it will fall through to the block lines.

(And get rid of the ignore-category connect line. That used to be useful to prevent default actions, but it doesn't work any more. You can't ignore a category without mentioning its name in the ACL action line; the old way was just too confusing.)

from redwood.

elico commented on August 16, 2024

@andybalholm I finally understood what caused to think it works as expected.
I had a p**n category and this what was checking the urls and not the classification service.

I now see that the CONNECT request's are only being passed to the classification service after bumping connections.

From my point of view the classification service is an ACL service.
It has more "info" on the pages based on their urls, method, etc.

What I might need is an ACL and Classification service :\

from redwood.

elico commented on August 16, 2024

@andybalholm What works fine for me is a second service in the configuration.
both external classifier and TLS-BUMP ACL service example at:
diff: elico@0caa3b6

from redwood.

andybalholm commented on August 16, 2024

Everything you are doing there could be done with a single classifier service and a properly written acls.conf. In particular, it is quite possible to write an acls.conf file that will not let a CONNECT request ever be blocked.

from redwood.

elico commented on August 16, 2024

@andybalholm I have tried to understand how to do that and currently I do not see how.

I wanted to classify youtube urls using only acls and categories but..
The main issue is that the urls are being tested with a lowercase comparison.
For a youtube ID to be matched there is a need for at least a real full REGEX match.
There for I wrote my local service.
I will try to publish it when I will feel comfortable.

from redwood.

andybalholm commented on August 16, 2024

That shows why a separate classifier service might be needed, but CONNECT requests being blocked is a separate issue.

from redwood.

elico commented on August 16, 2024

@andybalholm This is a docker deployment with a simplified classification service:
https://github.com/elico/yt-classification-service-example

Technically it can be installed with a DB backed which makes the service a bit more dynamic.
Another next level can be a classification portal for a community project.

from redwood.

mrbluecoat commented on August 16, 2024

I'm just waiting to see if anybody actually uses it.

I'm interested. I've done some preliminary research of external services: https://mrbluecoat.blogspot.com/2019/11/url-filtering-services.html

https://github.com/luigi1809/webfilter-ng#for-filtering-based-on-dns--categorifyorg-api is a good place to start (redis caching and nghttpx significantly improve performance)

from redwood.

elico commented on August 16, 2024

@andybalholm what is missing for an integration of the service?
Maybe an option to turn it on and off?

not sure..

from redwood.

andybalholm commented on August 16, 2024

i don't understand your question.

from redwood.

elico commented on August 16, 2024

@andybalholm The external classification branch was not merged into the master to my understanding.
What are the options regarding an external classification service?
Now that I'm reading this again I was thinking about using an example like with the OpenDNS:

var openDNSResult = lookupHost(request.URL.Host, "208.67.222.123");

if (openDNSResult == "146.112.61.106") {
	addACL("opendns-adult");
} else if (openDNSResult == "146.112.61.108") {
	addACL("opendns-phishing");
}

But with my service.
I need an example, let say my service sits at: https://ngtech.co.il/rbl_service
How would I send the URL of the request and the method to the remote service, also how can I identify the result?
Seems more reasonable then changing the whole logic of the software.

Was this clear?

from redwood.

andybalholm commented on August 16, 2024

The only thing keeping the external-classifier branch from being merged is lack of users. I don't want to merge code that won't be used. So if the external-classifier branch does what you need, go ahead and use it; I'll merge it.

If not, give me documentation on your service (what the requests and responses look like), and I'll see if I can tell you how to use it with Redwood's scripting support.

from redwood.

elico commented on August 16, 2024

@andybalholm lets use categorify as the base to the script.
The url should be either a post or a get to:
https://categorify.org/api?website=google.com
https://categorify.org/api?url=https://google.com/&method=GET

The result would be some kind of json which has couple objects like rating below:
In the case "nudity": false, exists then I can use the:
addACL("categorify-adult");

{
  "url": "https://google.com/",
  "method": "GET",
  "rating": {
    "language": false,
    "violence": false,
    "nudity": false,
    "adult": false,
    "value": "PG",
    "description": "Safe for all audiences."
  },
  "category": [
    "Search Engine",
    "Clean Browsing"
  ],
  "keyword_heatmap": {
    "google": 276,
    "information": 124,
    "services": 123,
    "example": 77,
    "learn": 71,
    "privacy": 62,
    "account": 53,
    "store": 37,
    "play": 34,
    "search": 33,
    "policy": 32,
    "started": 31,
    "content": 31,
    "personal": 30,
    "including": 27,
    "service": 26,
    "data": 24,
    "using": 23,
    "collect": 23,
    "share": 20
  }
}

Thanks,

from redwood.

elico commented on August 16, 2024

@andybalholm I am starting to test this in production tonight.
I will update you as soon as I will have enough results compared to Fortigate and Checkpoint devices.

from redwood.

elico commented on August 16, 2024

@andybalholm It seems that the code works great.
I will try to write a test service in ruby to verify what API to use against.
If it's possible to pay you some amount of money let me know.
I don't have the site of your appliance.

from redwood.

andybalholm commented on August 16, 2024

So are you using scripting, or external-classifier?

I developed Redwood for Compass Foundation (https://compassfoundation.io/).

from redwood.

elico commented on August 16, 2024

@andybalholm I don't know how to write a script which will be like external-classifier.
I am working on a new setup so I am planning the .. right solution.
Reloading everytime a change happens is good for specific size of files.
Currently I created a prototype with: squid, your redwood, redwood with externals classifier, redwood with external_classifier and ssl-bump external classifier.
They all work.

If there is a way to implement external-classifier in a script it would be better cause I don't want to patch redwood.
It is a really good coding work.
I have found the image I was looking for:
https://compassfoundation.io/web/image/170990/security-appliance-web.png

However I have a much more advanced DELL Device here so I am using it.
RedWood for now is the most advanced filtering proxy I have seen which means it can work with HTTP/2.0 and allow customization.

from redwood.

andybalholm commented on August 16, 2024

You won't have to patch Redwood to use external_classifier long-term. If I know someone is actually going to use it, I'll merge it into master.

from redwood.

elico commented on August 16, 2024

@andybalholm I'm using it currently with the next setup:
https://github.com/elico/yt-classification-service-example

It's a bit more then what is now in the external_classification branch.
I have both url classification which I am doing and also tls/ssl bump decision based on the destination domain.

from redwood.

andybalholm commented on August 16, 2024

I'm pretty sure that what you're doing with external-connect-acl can be done with external-classifier.

from redwood.

elico commented on August 16, 2024

@andybalholm I have verified that it's not possible without patching some sections above and/or below the vector point/hook of the external-connect-acl is.
I can try to verify this later on.

from redwood.

elico commented on August 16, 2024

@andybalholm let me test it on production in the next few days so we would be able to know how it works instead of believing it works.

from redwood.

mrbluecoat commented on August 16, 2024

FYI, categorify is limited to 200 queries per day

from redwood.

elico commented on August 16, 2024

@andybalholm For now it works great in production.
The basic rule of thumb is to bump all and use exclusions.
I can add YouTube IDs as I please to the DB and even under load it works better then expected.

from redwood.

YouTube urls and links classification about redwood HOT 80 CLOSED

Comments (80)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent