POESIA software is designed to filter Internet content in three channels (web, email and news). In this project, we will be interested only in web channel. The architecture of POESIA is composed of some different devices: All the HTTP flow will go through the HTTP proxy which will ask the POESIA monitor for the content filtering. The communication between the proxy and the monitor is based on the ICAP protocol. The monitor communicates with some specific filters (Spanish, English, French; language detectors; image filters; Javascript; URL filters...) and decisors or decisions mechanisms.
POESIA is targeted for situations where groups of computers are used for Internet browsing, such as classrooms, libraries, computer centers and business. The number of computers will not exceed 20, because POESIA is installed on a single computer and CPU resources bound filtering capabilities.
The ICAP protocol is fully explained in RFC 3507 [2]. When the proxy receives an HTTP request or an HTTP reply, it encapsulates it in an ICAP request and sends it to the monitor, which plays the role of the ICAP server. The monitor replies by an ICAP message that encapsulates a modified HTTP request or response.
The purpose of ICAP protocol is content adaptation, but we use it only in filtering purpose in POESIA software. The monitor will not "modify" the ICAP request, it will only accept or reject it. Here are two examples:
The client requests a porno web page, for example www.sex.com. The client browser sends a request to the proxy (step 1).
GET /index.html HTTP/1.1 Host: www.sex.com User-Agent: Mozilla Connection: keep-alive
The proxy will encapsulate the HTTP request in an ICAP request and send it to the monitor (step 2):
REQMOD icap://monitor.school.com/filter ICAP/1.0 Host: monitor.school.com Encapsulated: req-hdr=0, null-body=68 GET /index.html HTTP/1.1 Host: www.sex.com User-Agent: Mozilla
The ICAP request begins with REQMOD, which is the ICAP method used. REQMOD means request modification and RESPMOD means response modification. There is a specific ICAP header "Encapsulated" which indicates what's the nature of the HTTP message (req for requests), if there is a body (null-body when no body exists) and the length of the HTTP headers (68). We assume that the monitor interrogates a database of unauthorized web servers and finds the url (www.sex.com) on it. Thus the ICAP reply will look like this (step 3):
ICAP/1.0 200 OK Date: Mon, 10 Jan 2000 09:55:21 GMT Server: POESIA-Monitor/1.0 Connection: close ISTag: "W3E4R7U9-L2E4-2" Encapsulated: res-hdr=0, res-body=(...) HTTP/1.1 403 Forbidden Date: Wed, 08 Nov 2000 16:02:10 GMT Server: Apache/1.3.12 (Unix) Last-Modified: Thu, 02 Nov 2000 13:51:37 GMT ETag: "63600-1989-3a017169" Content-Length: 58 Content-Type: text/html 3a Sorry, you are not allowed to access that naughty content. 0
According to the ICAP protocol, when the proxy receives an HTTP response for an HTTP request in an ICAP transaction, it will deliver the response to the web client (step 4):
HTTP/1.1 403 Forbidden Date: Wed, 08 Nov 2000 16:02:10 GMT Server: Apache/1.3.12 (Unix) Last-Modified: Thu, 02 Nov 2000 13:51:37 GMT ETag: "63600-1989-3a017169" Content-Length: 58 Content-Type: text/html Connection: close Sorry, you are not allowed to access that naughty content.
Notice in this example the use of chunked transfer-encoding in ICAP communication.
When the web client requests a web content that we cannot identify as forbidden or not from its URL, we must download the content before scanning it. In this example, the requested url is www.unknown-server.com. The client sends its request to the proxy like this (step 1):
GET /index.html HTTP/1.1 Host: www.unknow-server.com User-Agent: Mozilla
The proxy forwards it to the monitor like in the previous example. The monitor will not block the request, and will reply like this (step 3):
ICAP/1.0 200 OK Date: Mon, 10 Jan 2000 09:55:21 GMT Server: POESIA-Monitor/1.0 Connection: close ISTag: "W3E4R7U9-L2E4-2" Encapsulated: res-hdr=0, null-body=(...) GET /index.html HTTP/1.1 Host: www.unknow-server.com User-Agent: Mozilla
The monitor hasn't changed anything in the request. In the ICAP protocol, the ICAP server can say that he will not change the HTTP request (or the response), if the proxy asks him to do so, by providing a 204 "No content" response. In this case, the proxy must store the HTTP message in a buffer before asking the ICAP server for an adapatation. We suppose now that the proxy has contacted www.unknown-server.com and got some naughty content. The ICAP transaction will look like this:
From the proxy to the monitor (step 6):
RESPMOD icap://icap.example.org/satisf ICAP/1.0 Host: icap.example.org Encapsulated: req-hdr=0, res-hdr=(...), res-body=(...) GET /origin-resource HTTP/1.1 Host: www.origin-server.com Accept: text/html, text/plain, image/gif Accept-Encoding: gzip, compress HTTP/1.1 200 OK Date: Mon, 10 Jan 2000 09:52:22 GMT Server: Apache/1.3.6 (Unix) ETag: "63840-1ab7-378d415b" Content-Type: text/html 1d Here is some naughty content. 0
From the monitor to the proxy (step 7):
ICAP/1.0 200 OK Date: Mon, 10 Jan 2000 09:55:21 GMT Server: POESIA-Monitor/1.0 Connection: close ISTag: "W3E4R7U9-L2E4-2" Encapsulated: res-hdr=0, res-body=(...) HTTP/1.1 403 Forbidden Date: Wed, 08 Nov 2000 16:02:10 GMT Server: Apache/1.3.12 (Unix) Last-Modified: Thu, 02 Nov 2000 13:51:37 GMT ETag: "63600-1989-3a017169" Content-Length: 58 Content-Type: text/html 3a Sorry, you are not allowed to access that naughty content. 0
Thus the final response to the server will be (step 8):
HTTP/1.1 403 Forbidden Date: Wed, 08 Nov 2000 16:02:10 GMT Server: Apache/1.3.12 (Unix) Last-Modified: Thu, 02 Nov 2000 13:51:37 GMT ETag: "63600-1989-3a017169" Content-Length: 58 Content-Type: text/html Connection: close Sorry, you are not allowed to access that naughty content.
If the monitor accepts the content, it will reply with code 200 or 204. The origin content will be provided to the client.
Before we develop Shweby proxy server, the proxy used in POESIA was Squid-ICAP [4]. Squid-ICAP is an ICAP enabled version of Squid [5]. Although Squid is a popular, stable and scalable HTTP proxy, Squid-ICAP has still some bugs and maintaining it is a quite hard work. The POESIA project requires a stable ICAP implementation, even not scalable. That's the reason why we have developed Shweby proxy server.
We will explain here what are the requirements of POESIA.
POESIA software can support few levels of filtering, like kid filter, teenager filter and adult filter. The filtering level is managed by the system administrator and depends on client IP addresses. We must be able to configure the proxy to a suitable filtering level and to change the configuration without restarting the proxy.
Because we are serving many computers with Internet content, caching will reduce the response time and will save outgoing bandwidth. We can cache the content before or after filtering. The POESIA filter will be called after the cache, because this provides more security to users when they change. For example, when the kids take the place of the adults, we must simply change the POESIA filter and we don't need to empty the cache.
The POESIA monitor requires buffering. It will wait for the whole ICAP request before it can make a decision, and the content is fully accepted or fully rejected. Given the fact that in HTTP protocol, servers may not provide the content-length header, we cannot know in advance what's the length of the delivered content, and the monitor buffers could be completely filled. It would be better to perform content buffering in the server-side of the proxy. If the received content exceeds a limit (10Mb), the proxy will reject it and send an error message to the client browser. This limitation is large enough for web browsing but it will forbid long file downloading.
The stability of the proxy server is an important POESIA requirement. The proxy must run for months without breaking down.
The proxy must protect himself against hacking attacks (e.g. by Telnet). The POESIA end-user will not be allowed to bypass the proxy by changing its browser configuration. This will be achieved by transparent proxying, which means that the browser connects directly to the Internet, but POESIA gateway forwards all the requests to the proxy. Thus the proxy must support transparent requests (or direct requests). The only difference between a proxy request and a transparent request is that a proxy request contains the full URI in the first line and the transparent request contains only the path in the first line, the name of the server is provided by "Host" header in both requests.
Example of a proxy request:
GET http://www.google.com/index.html HTTP/1.1 Host: www.google.com ...
Example of a transparent (direct) request:
GET /index.html HTTP/1.1 Host: www.google.com ...
The POESIA software is targeted to a public of 20 PCs. We assume that each PC user requests a web page every 10 seconds and each web page includes 10 components, the generated load is 20 requests per second. Thus the web proxy will handle 20 request per second.
POESIA is not designed to scale to a huge public of PC clients. So it does not require a scalable web proxy.