Shweby User Manual

Version 0.9.2

Riadh Elloumi <riadh@melix.net>
Fares Triki <triki@enst.fr>
Based on Middleman by Jason McLaughlin
http://shweby.sourceforge.net


Introduction

Shweby is a caching proxy server which supports ICAP protocol. ICAP is the standard Internet Content Adaptation Protocol, designed by the IETF in order to modify HTTP requests ans responses. Shweby implements one or many ICAP clients which can contact ICAP servers and ask them for a request or a response adaptation.

Shweby is based on Middleman, a proxy server made by Jason McLaughlin and hosted on http://www.sourceforge.net/projects/middle-man. Middleman is an advanced HTTP/1.1 proxy server with features designed to increase privacy and remove unwanted content. These features are disabled in Shweby, because the proxy does not do the adaptation himself, but ask an ICAP server for doing it. Although, Shweby inherits gzip encoding, and an intutive Web interface for configuring the proxy from Middleman. Shweby also supports persistant connections to ICAP and HTTP servers.

Installation
Installing Shweby should be straightforward. After extracting the archive type "./configure && make", if you're using a BSD operating system you will need to use "gmake" rather than make, if that's unavailable as a last resort you can use BSD's make, then enter the "gcc -o shweby *.o -pthread -lz" command afterwards. There are several compile-time options available for the configure script, type "./configure --help" to see a complete list.

Running

If you wish to have the proxy server loaded at boot time, there is a script in the "scripts" directory called shweby.init to assist you with that, simply edit the paths at the top then copy it to the "/etc/rc.#" directory, where # is your current runlevel (if you're unsure what it is, use the "runlevel" command). You may need to rename the script, if you're using a debian-based distribution the naming scheme for init.d scripts is in the form "S##program", where ## is the order in which the script is loaded, and "program" is the program's name.

There are several command line options you may use when loading the proxy server; at the very least you will need to use the -c option followed by the path to the configuration file. The -p option can be used to have Shweby check (and create) a file containing the PID of the proxy server, this can be used to prevent multiple instances of the proxy server from running concurrently. The -l option can be used to specify the path to the logfile if the --enable-syslog option wasn't used during compilation, and -d to specify the level of detail which should be logged; use -h for a complete list of loglevels.

There is a sample configuration file called icap.xml, it is configured to iserver10, an ICAP server developed by Netwok Appliance. You can cofigure iserver10 to do some adaptation. Please see README file in the iserver10 package. We haven't tested Shweby with other ICAP servers.

Using

Once the proxy server is running, you'll then need to configure your web browser to use it.

If you're using Mozilla, open up the edit menu and click on preferences. Expand the "advanced" options then click on "Proxies". Click on the "Manual proxy configuration" radio button then fill in the HTTP and HTTPS fields with the IP address and port of the proxy; if you're using the default configuration "icap.xml", the port will be 4000 with ICAP adaptation and 4001 without ICAP adaptation. There is no ICAP adaptation with HTTPS urls, because the requests and responses are crypted.

If you're using Konqueror, open up the settings menu and click on "Configure Konqueror". Click on the icon labeled "Proxy" in the left pane, click the "Use proxy" checkbox and then the "Manual proxy configuration" checkbox. Click on setup to the right of that then fill in the HTTP and HTTPS fields with the IP address and port of the proxy.


Configuration

Most of the configuration is made easy by the Web interface; however, it may be necessary to manually edit the configuration file to change network settings if the default is unusable on your configuration. The snippet of XML below shows what the configuration section looks like:

<network>
    <listen>
        <ip>127.0.0.1</ip>
        <port>8080</port>
    </listen>
</network>

Each <listen> section inside the <network> section has an <ip> and <port> option, which should contain after them the IP address and port number to listen on, respectively. You may leave out the <ip> option to have Shweby listen on all interfaces. Shweby, by default, can listen on up to 20 ports at a time.

As mentioned above, all other configuration settings can be modified through the Web interface. To access this, while using the proxy load "http://shweby" in your browser; when not using the proxy, the Web interface is accessible by making a regular HTTP request for /shweby to the proxy's IP address and port.

Once you've loaded the Web interface, you will see a page with several links available at the top.

The "Active connections" link will display a page showing all connections currently being handled by the proxy.

The "DNS cache" link is for debugging purposes only, and will display entries in the DNS cache.

The "Show headers" link will bring you to a page showing all the HTTP headers your browser sends.

The "Save settings" link will bring you to a page with a Filename dialog where you can save all current settings, by default it will be filled with the path to the configuration file given when the proxy server was loaded. Please don't use this feature, it's not yet enhanced to ICAP support, and will erase ICAP section from your configuration file.

The "Load settings" link will also bring you to a page with a Filename dialog, as well as an "Overwrite" option. The overwrite option can be used to select whether the settings contained in the configuration file will overwrite all current settings or simply be added to them. This feature is not tested with Shweby.

The "View log entries" link will bring you to a page showing recent entries made to the logfile, and will allow you to search through them using regular expressions. The log buffer can also be cleared from here, as well as have it's size adjusted. The level of logging detail available through the web interface is unaffected by the options given in the command line, and will always be all log entires with the exception of debug messages.

The "View cache entries" link will bring you to a page showing cached files, and give you the option to search through and selectively delete them.

The "Connection pool" link will bring you to a page showing connections currently being held open in the connection pool awaiting reuse.

The "Config" link will bring you to a page where all configuration settings can be accessed. On the main page you will see a dialog with a drop down list containing the name of each section, as well as a table with a list lf each section and an enable/disable radio button beside it; this can be used to quickly enable/disable a feature if it's causing problems with a website.

When you select an item in the drop down list and click on the submit button, you will be brough to a page containing a dialog at the top as well as a list of entries for that section below. The dialog at the top will always contain an "add" link, which can be used to add an additonal entry to the section, and in some cases will have several other options which will be explained below. Each entry at the bottom has an "Edit", "Delete", "Up", "Down", "Top", and "Bottom" link. The edit link will bring you to a dialog where you can edit that specific entry, the delete link will remove it from the section. The "Up" and "Down" links allow you to change the order of the entries, this is important in  cases where more than one entry can match the same thing. The "Top" and "Bottom" links can be used to move the entry to the very top or bottom of the list.

All entries for all sections have an "Enabled" option which allows you to disable a specific entry, a 'Comment' option to describe the purpose of the entry, and a 'profiles' option.

The 'Profiles' option can be used to have seperate configuration settings for different users; a comma seperated list is used to specify each configuration profile that entry belongs to, and that entry will only be enabled for users in one of those profiles. The 'Profiles' option in Access entries is used to specify which configuration profiles are enabled for connections matching that entry. If any profiles are enabled for a connection, entries with no profiles are disabled; conversely, entries that aren't in any profiles are disabled for connections that are.

Several sections follow an allow/deny/policy model; for these sections, each entry has an action option which will specify what happens when it is found to match. If no matching entry is found, the action the policy is set to will be taken. It is important to remember that all entries with an action opposite to the policy are searched first, and if nothing is found the entries with an action the same as the policy are not searched. So, for example, if the policy for the access section is set to "allow", and no entries with a "deny" action are found matching the connection, none of the entries with an "allow" action are looked at, so any access limitations specified in the allow entry are ignored.

The tables below will describe all the options available in each section and the entries within them.

--- Global section ---

Purpose
The global section gives access to configuration options that affect the overall operation of the proxy server.
General subsection
Timeout
The timeout in seconds to wait for a client to make the initial HTTP request.
Keepalive timeout
The timeout in seconds to wait for keepalive requests.
Maximum buffer size
The maximum size in bytes of files that are buffered and processed by the rewrite, keyword, and external features.
Temporary directory
The directory temporary files are stored in.
CONNECT ports
The ports outgoing CONNECT requests are allowed to be made to; each port or port range should be seperated by a comma. A port range is a lower and upper port seperated by a comma, either may be omited to allow the lowest or highest possible ports. For example: "-1024, 8888, 6660-6669" will allow connect requests to be made on ports 0 to 1024, 8888, and 6660 to 6669.
Connection pool size
The number of keep-alive connections to HTTP and FTP servers to keep in the connection pool; these connections will be shared between threads.
Connection pool timeout
The time in seconds a connection may remain in the connection pool before being removed.
FTP subsection
Passive mode
Use passive mode for FTP transfers; this is useful if you are behind a firewall that prevents the FTP server from opening a connection to you.
Timeout
The timeout to wait for a response to commands sent to the FTP server.
Anonymous login
The login to use when none is explicity given in the URL.
Anonymous password
The passsword to use when none is explicity given in the URL.
Sort order
The order FTP directory listings are sorted.
Sort field
The field which FTP directory listings are sorted
Cache subsection
Path
The directory where cached files are stored; if this is unset, only memory will be used to cache files.
Disk cache size
The maximum size in bytes of the disk cache.
Memory cache size
The maximum size in bytes of the memory cache.
Disk free extra
The number of additional bytes to free up when the disk is cleaned, this is useful to prevent the routine that scans the cache directories from being called too often.
Memory free extra
The number of additional bytes to free up when the memory is cleaned.
Minimum age
The minimum age any file must be according to the Last-Modified header before it is cached.
Maximum age
The maximum age of any cached file before it must be revalidated; this overrides any given expiry time.
Revalidate age
The maximum age of any cached file which didn't include any headers that indicate when it should expire before it must be revalidated; if set to 0, all cached files whose expiry time is uncertain will be verified. If no "Last-Modified" header is received to calculate the percent of age freshness, the cached file is always revalidated.
Percent of age freshness
The percentage of time between the date given in the Last-Modified header and the current time a cached file will be considered fresh after downloading.
Minimum file size
The minimum file size in bytes of any cached file.
Maximum file size
The maximum file size in bytes of any cached file; if set to 0, no maximum file size is imposed.
DNSBL subsection
Template
The template to send when domain is found to be blocked.
Domain
The domain to prefix the domain being checked to; i.e. in.dnsbl.org will cause a lookup for bad.com.in.dnsbl.org to be made when a page from the bad.com domain is requested.
Blocked IP addresses
A comma seperated list of IP addresses that can be returned when doing the DNS lookup which will cause the page to be blocked.

--- Network section ---

Purpose
The network section is used to configure general network settings. The configuration file must be saved and the proxy server has to be restarted before any changes take effect.
Entry options
IP
The IP address of the interface to bind to; leave empty to have the proxy listen on all interfaces.
Port
The port number to listen on.

--- ICAP section ---

Purpose
The ICAP section is used to configure ICAP clients in the proxy.
Service subsection
This is the configuration subsection for one ICAP client. This subsection needs some entries:

id
Identification of the ICAP service, This will be used for declaring ICAP services in ICAP classes
method
This refer to the ICAP method and the processing point. There are three ways for content modification :
  1. reqmod_precache : adaptation is done for requets "on their way to the cache". This is neaded for blocking harmful content identified by the request url,
  2. reqmod_postcache : adaptation is done for requests "on their way to the internet". Example : removing cookies from the request.
  3. respmod_precache : adaptation is done for responses "in their way to the cache". Example : scanning for viruses.
  4. respmod_postcache : adaptation is done for responses "in their way to the client". Examle : filtering harmful content.

host
The host where the ICAP server is running
port
The port of the icap server
path
The path of the ICAP service in the ICAP server.
keepalive
Put "enable" if you want the connection between Shweby and the ICAP server to stay alive, this would be good for increasing perforamnces. Default is enable.
shortcut
Put "enable" if you want that Shweby forwards the request or the response as it is when the ICAP modification fails. If you put "disable", you will receive an error message at your browser when the ICAP adapatation fails.
Class subsection
An ICAP class is a set of ICAP services designed to perform a set of adaptation. For example we can make an Adult class and a Kid class with a filtering service in the latter. You can switch between classes by modifying the class tag in Network session, or by binding a different port to each class.

Here are the tags of class subsection:

id
The identification of ICAP class, this will be used in Network section to bind a port to an ICAP class.
service
The of the service. We can use many services or no services. Add a service tag to each service. You must describe the service in the subsection "Service"

--- Access section ---

Purpose
The access feature is used to control who can access the proxy server, and to what extent.
Global options
Policy
Default action to take when no matching entry is found.
Entry options
Class
The ICAP class to use for this access list, ICAP classes are defined in ICAP section.
IP Address
A regular expression matching the IP addresses this entry applies to, leaving blank will cause the entry to match everything. We can use many IP Address regular expression matching. Add an IP tag to each.
IP Range
A range of IP addresses ("the start IP" dash "the end IP"). We can have many IP Range tags.
Port
The port used for this acces list. Shweby must listen on this port (see the Network section). We can have many Port tags.
Username
If this field is not empty, clients matching this entry will be required to authenticate with the proxy server. There can be more than one entry matching the same IP address, in which case the one matching the username/password send by the browser is used.
Password
The client's password if the username field is used.
Access
A list of features connections matching this entry are allowed to access, the options are:
Web interface - Access to all of the web interface (access to /shweby/template/<template name> is always allowed regardless of this)
Proxy requests - Allowed to make regular proxy requests
CONNECT Requests - Allowed to make CONNECT requests
Transparing proxying - Allowed to make transparent proxy requests (must be allowed to make HTTP requests as well)
HTTP Requests - Allowed to make regular HTTP requests to proxy (for Web interface and redirected requests)




--- Templates section ---

Purpose
Templates are used throughout Shweby as a replacement for pages which can't be displayed due to filtering, error, or other condtions.
Global options
Path
Location to look for templates in if no absolute path is given.
Entry options
Name
The name of the template, this is used in other sections to reference it. It may also be one of the following to replace internal error messages:
blocked - Page blocked
nodns - DNS lookup failed
badrequest - Malformed HTTP header from client
badresponse - Malformed HTTP header from server
nofile - File not found
nocache - Cache file not found when browsing in offline mode
noconnect - Connection failed
noaccess - Access denied
badprotocol - Protocol not implemented
badauth - Authorization failed (when forwarding through SOCKS4)

There are 3 built-in templates that can be used: tinygif (a 1x1 transparent gif image), checkedgif (a 4x4 grey and transparent checkered pattern), and tinyswf (an emtpy flash animation).

You can override the content sent by a website for certain response codes by making a template with a numerical name the same as the response code.


There are several variables that can be used in templates which will be replaced with information about the request currently being handled, they are:
$HTTP_METHOD - Method used to request file
$HTTP_HOST - Host HTTP request was made to.
$HTTP_FILE - File HTTP request was made for.
$HTTP_PORT - Port HTTP request was made to.
$IP - IP address of client making request.

Templates can be accessed directly by loading "http://shweby/template/<template name>".

File
The filename of the template
Mimetype
The MIME-type of the template. When using an executable, this can be set to STDIN to have the MIME-type extracted from a "Content-type" header sent by the program, this will be explained in greater depth below.
Response code The response code to use when sending the template, leave blank to use internal default.
Type
Template type, either File or Executable. If executable is choses, the file is executed and whatever it writes on STDOUT is sent as the template. Several environment variables are set for the executable to use, they will be explained further below in the external section.

--- Forward section ---

Purpose
The forward feature allows you to selectively forward requests through another proxy or SOCKS4 firewall based on their URL.
Entry options
Host
A regular expression matching the host's you wish to have requests forwarded for, leave blank to match everything.
File
A regular expression matching the file's you wish to have requests forwarded for, leave blank to match everything.
Proxy
The hostname or IP address of the proxy to forward through; if this is left blank, and the host or file options aren't, no action will be taken for requests matching the host and file.
Username
The username to use if the proxy requires authentication.
Password
The password to use if the proxy requires authentication.
Domain
The NT domain when using the NTLM authentication protocol.
Port
The port number of the proxy to forward through.
Type
What type of proxy to forward through; can be HTTP or SOCKS4
Applies to
What type of requests are forwarded; can be HTTP and/or CONNECT (HTTPS)


Transparent proxying


Shweby can be used to transparently proxy requets; to make use of this feature, you will need to use firewall software capable of forwarding connections. Configure the firewall to forward connections destined for port 80 to the proxy server; the proxy server will look at the Host header sent by the browser and use that to determine what host the request was originally intended for. This feature may not work for all browsers, sending the Host header is only required for HTTP 1.1, although most HTTP 1.0 clients send it anyways.

If you're using iptables under Linux, the following command should do the job (replace interfaces and port to match your setup)
iptables -t nat -A PREROUTING -i eth0 -p tcp --dport 80 -j REDIRECT --to-port 4000


Frequently asked questions


Q: Some pages show strange numbers throughout the document, and it hangs when loading a page.
A: Shweby is an HTTP 1.1 proxy; some older browsers (such as netscape 4.x) will not work correctly with the proxy, the only solution is to upgrade your browser.

Notes

- The caching feature relies on a feature currently available only on Linux (the mremap system call); on other operating systems this feature is emulated but with a severe performance penalty.

- Make sure the TZ environment variable is set to your timezone before running Shweby to ensure the cache refresh algorithm works correctly


Reporting bugs

If you encounter any problems while using Shweby, please contact me. If the problem results in a crash, please follow these steps to help me debug the problem:
1) Run "make clean" in the Shweby directory if you haven't already done so.
2) Recompile Shweby using the --enable-debug option in the configure script
3) Type "ulimit -c unlimted" in your shell before running the proxy, this will cause Shweby to dump a core file when it crashes.
4) Email me the compiled binary, core file, and configuration file you were using at the time. The last few log entires would also be helpful.

Feature requests

If you have any ideas on how Shweby could be improved, please email us (address at top)... We'll do our best to respond.