Shweby Developer Documentation

Preliminary report

Riadh Elloumi <riadh@melix.net>
Fares Triki <triki@enst.fr>

Project targets

In this project, we are planning to develop a stable and reliable HTTP proxy with ICAP support. This proxy can be tested and used by the Internet community, especially by people who want to control or modify the HTTP traffic on-the-fly. We are cooperating with POESIA project in order to test the first versions of Shweby.

We have to choose a compromise between stability, performance and features. Stability is the most important requirement because the proxy server must be running a long time without crashing. This needs a special style of programming. A great attention will be accorded to tests and all exceptions must be caught.

The needed performances aren't great. Shweby must be able to scale to the load of 20 PC clients, like in a classroom. Given that a client will request a web page every 10 seconds and a web page contains 10 components, the generated load is 20 requests by second.

Project Planning

We have fixed three deadlines:

1st May: we must choose an open source HTTP proxy in order to plug an ICAP client on it.
1st June: we must release an alpha version of Shweby with basic ICAP support, without OPTIONS and Preview. This version will be tested during the month of June in the POESIA project.
1st July: we must have released a beta version of Shweby with basic ICAP support (version 1.0b) and if possible, an alpha version of Shweby with full ICAP support (version 1.1a).

Base of development

As the used licence is GPL, we can start our development under another free software. There are a lot of open source projects dealing with HTTP proxies. The most known one is Squid.

There is also an ICAP enabled version of Squid, Squid ICAP. But the tests of this version show that it is not enough stable to be used in production. POESIA project developers, an open source web filtering project, are experimenting and debugging Squid ICAP. Nevertheless, maintaining this software is quiete hard.

The reason is that Squid is a cache proxy, and Squid ICAP has kept the cache feature. It is designed with four processing points:

reqmod_precache: the request is modified by ICAP server "in its way to the cache"
reqmod_postcache: the request is modified by ICAP server "in its way to origin server"
respmod_precache: the origin server's reply is modified before it is stored in the cache.
respmod_postcache: the reply is modified by ICAP server "in its way to the client"

To know more about these processing points, please see ICAP RFC, §6.1 .

Squid is designed as a mono-threaded process. It uses one big poll in order to detect I/O events for opened sockets to clients and servers. When the data is available to be read in a socket, the socket is read and the call-back function that is responsible for this socket is called.

A second way for coding the proxy is by using thread. Each client connection will start one independent thread which will handle all the I/O traffic requested by the client. This way is more simple than the Squid call-backs, although it needs more care in concurrent access to shared data.

We had found a multithreaded HTTP proxy suitable to be adapted for ICAP support: Middleman. Middleman is an advanced HTTP/1.1 proxy server with features designed to increase privacy and remove unwanted content. It was written in C by Jason Mclaughlin and distributed under the terms of GPL licence. As this proxy has reached its production/stable development level, we decided to patch it for ICAP support.

HTTP communication

In the HTTProtocol, we can find three types of connection:

Connection type Protocols Description

Non persistant

HTTP/1.0 by default

HTTP/1.1 with "Connection: close" header

The TCP connection is closed after the end of the reply body

Persistant, without pipelining

HTTP/1.0 with "Connection: keep-alive" header

HTTP/1.1 by default

We can fetch many URLs over the same TCP connection. Requests are sent one by one, after the reception of replies

Persistant, with pipelining

HTTP/1.1 only !

We can fetch many URLs over the same TCP connection. Requests are sent together, and replies are provided together.

Connection type	Protocols	Description
Non persistant	HTTP/1.0 by default HTTP/1.1 with "Connection: close" header	The TCP connection is closed after the end of the reply body
Persistant, without pipelining	HTTP/1.0 with "Connection: keep-alive" header HTTP/1.1 by default	We can fetch many URLs over the same TCP connection. Requests are sent one by one, after the reception of replies
Persistant, with pipelining	HTTP/1.1 only !	We can fetch many URLs over the same TCP connection. Requests are sent together, and replies are provided together.

ICAP protocol support persistant connections, without pipelining. Pipelining is not mentioned in ICAP RFC. So we will not implement pipeline feature in Shweby.

In the case of a POST, the client "posts" some data to the server. The server wants to know the length of the request body. As the client cannot close the connection, he must use either "Content-Length" header or chunked transfer encoding mode (announced by "Transfer-Encoding" header). The server must do the same if the connection is persistant. Note that chunking could only be used in HTTP/1.1 .

In order to validate HTTP communication, we have developed a serie of tests, with all valid combinations of protocol versions, methods, connection header and chunking usage. These combinations are represented in the table below:

# method version comment Connection (S&C) client CL server CL client chunked server chunked

0 GET 1.0 closed by default 0 0 0 0

1 GET 1.0 closed by header close 0 0 0 0

2 GET 1.0 closed with useless CL close 0 1 0 0

3 GET 1.0 keep-alive keep-alive 0 1 0 0

4 GET 1.1 keep-alive by default 0 1 0 0

5 GET 1.1 closed close 0 0 0 0

6 GET 1.1 closed with useless CL close 0 1 0 0

7 GET 1.1 keep-alive by header keep-alive 0 1 0 0

8 GET 1.1 chunked, keep-alive by default 0 0 0 1

9 GET 1.1 chunked, closed close 0 0 0 1

10 GET 1.1 chunked, keep-alive by header keep-alive 0 0 0 1

** ****** **** ***************************************** ************** * * * *

11 POST 1.0 closed by default 1 0 0 0

12 POST 1.0 closed by header close 1 0 0 0

13 POST 1.0 closed with useless CL close 1 1 0 0

14 POST 1.0 keep-alive keep-alive 1 1 0 0

15 POST 1.1 keep-alive by default 1 1 0 0

16 POST 1.1 closed close 1 0 0 0

17 POST 1.1 closed with useless CL close 1 1 0 0

18 POST 1.1 keep-alive by header keep-alive 1 1 0 0

19 POST 1.1 chunked, keep-alive by default 0 0 1 1

20 POST 1.1 chunked, closed close 0 0 1 1

21 POST 1.1 chunked, keep-alive by header keep-alive 0 0 1 1

#	method	version	comment	Connection (S&C)	client CL	server CL	client chunked	server chunked
0	GET	1.0	closed by default		0	0	0	0
1	GET	1.0	closed by header	close	0	0	0	0
2	GET	1.0	closed with useless CL	close	0	1	0	0
3	GET	1.0	keep-alive	keep-alive	0	1	0	0
4	GET	1.1	keep-alive by default		0	1	0	0
5	GET	1.1	closed	close	0	0	0	0
6	GET	1.1	closed with useless CL	close	0	1	0	0
7	GET	1.1	keep-alive by header	keep-alive	0	1	0	0
8	GET	1.1	chunked, keep-alive by default		0	0	0	1
9	GET	1.1	chunked, closed	close	0	0	0	1
10	GET	1.1	chunked, keep-alive by header	keep-alive	0	0	0	1
**	******	****	*****************************************	**************	*	*	*	*
11	POST	1.0	closed by default		1	0	0	0
12	POST	1.0	closed by header	close	1	0	0	0
13	POST	1.0	closed with useless CL	close	1	1	0	0
14	POST	1.0	keep-alive	keep-alive	1	1	0	0
15	POST	1.1	keep-alive by default		1	1	0	0
16	POST	1.1	closed	close	1	0	0	0
17	POST	1.1	closed with useless CL	close	1	1	0	0
18	POST	1.1	keep-alive by header	keep-alive	1	1	0	0
19	POST	1.1	chunked, keep-alive by default		0	0	1	1
20	POST	1.1	chunked, closed	close	0	0	1	1
21	POST	1.1	chunked, keep-alive by header	keep-alive	0	0	1	1

If we add 100 to the request number, data will be sent in small fragments of few bytes (typically 5 or 10 bytes) and the delay between fragments is 100ms. This implies that the receiver (i.e. the proxy) will read small pieces of data and must collect them in order to process. For example, the proxy must collect all the HTTP headers in order to know if the connection is persistant, etc.