Section 9.4. Using a Reverse Proxy

9.4. Using a Reverse Proxy

A proxy is an intermediary communication device. The term "proxy" commonly refers to a forward proxy, which is a gateway device that fetches web traffic on behalf of client devices. We are more interested in the opposite type of proxy. Reverse proxies are gateway devices that isolate servers from the Web and accept traffic on their behalf.

There are two reasons to add a reverse proxy to the network: security and performance. The benefits coming from reverse proxies stem from the concept of centralization: by having a single point of entry for the HTTP traffic, we are increasing our monitoring and controlling capabilities. Therefore, the larger the network, the more benefits we will have. Here are the advantages:

Unified access control: Since all requests come in through the proxy, it is easy to see and control them all. Also known as a central point of policy enforcement.
Unified logging: Similar to the previous point, we need to collect logs only from one device instead of devising complex schemes to collect logs from all devices in the network.
Improved performance: Transparent caching, content compression, and SSL termination are easy to implement at the reverse proxy level.
Application isolation: With a reverse proxy in place, it becomes possible (and easy) to examine every HTTP request and response. The proxy becomes a sort of umbrella, which can protect vulnerable web applications.
Host and web server isolation: Your internal network may consist of many different web servers, some of which may be legacy systems that cannot be replaced or fixed when broken. Preventing direct contact with the clients allows the system to remain operational and safe.
Hiding of network topology: The more attackers know about the internal network, the easier it is to break in. The topology is often exposed through a carelessly managed DNS. If a network is guarded by a reverse proxy system, the outside world need not know anything about the internal network. Through the use of private DNS servers and private address space, the network topology can be hidden.

There are some disadvantages as well:

Increased complexity: Adding a reverse proxy requires careful thought and increased effort in system maintenance.
Complicated logging: Since systems are not accessed directly any more, the log files they produce will not contain the real client IP addresses. All requests will look like they are coming from the reverse proxy server. Some systems will offer a way around this, and some won't. Thus, special care should be given to logging on the reverse proxy.
Central point of failure: A central point of failure is unacceptable in mission critical systems. To remove it, a high availability (HA) system is needed. Such systems are expensive and increase the network's complexity.
Processing bottleneck: If a proxy is introduced as a security measure, it may become a processing bottleneck. In such cases, the need for increased security must be weighed against the cost of creating a clustered reverse proxy implementation.

9.4.1. Apache Reverse Proxy

The use of Apache 2 is recommended in reverse proxy systems. The new version of the mod_proxy module offers better support for standards and conforms to the HTTP/1.1 specification. The Apache 2 architecture introduces filters, which allow many modules to look at the content (both on the input and the output) simultaneously.

The following modules will be needed:

mod_proxy
mod_proxy_http: For basic proxying functionality
mod_headers: Manipulates request and response headers
mod_rewrite: Manipulates the request URI and performs other tricks
mod_proxy_html: Corrects absolute links in the HTML
mod_deflate: Adds content compression
mod_cache
mod_disk_cache
mod_mem_cache: Add content caching
mod_security: Implements HTTP firewalling

You are unlikely to need mod_proxy_connect, which is needed for forward proxy operation only.

9.4.1.1 Setting up the reverse proxy

Compile the web server as usual. Whenever the proxy module is used within a server, turn off the forward proxying operation:

# do not work as forward proxy
ProxyRequests Off

Not turning it off is a frequent error that creates an open proxy out of a web server, allowing anyone to go through it to reach any other system the web server can reach. Spammers will want to use it to send spam to the Internet, and attackers will use the open proxy to reach the internal network.

Two directives are needed to activate the proxy:

ProxyPass / http://web.internal.com/
ProxyPassReverse / http://web.internal.com/

The first directive instructs the proxy to forward all requests it receives to the internal server web.internal.com and to forward the responses back to the client. So, when someone types the proxy address in the browser, she will be served the content from the internal web server (web.internal.com) without having to know about it or access it directly.

The same applies to the internal server. It is not aware that all requests are executed through the proxy. To it the proxy is just another client. During normal operation, the internal server will use its real name (web.internal.com) in a response. If such a response goes to the client unmodified, the real name of the internal server will be revealed. The client will also try to use the real name for the subsequent requests, but that will probably fail because the internal name is hidden from the public and a firewall prevents access to the internal server.

This is where the second directive comes in. It instructs the proxy server to observe response headers, modify them to hide the internal information, and respond to its clients with responses that make sense to them.

Another way to use the reverse proxy is through mod_rewrite. The following would have the same effect as the ProxyPass directive above. Note the use of the P (proxy throughput) and L (last rewrite directive) flags.

RewriteRule ^(.+)$ http://web.internal.com/$1 [P,L]

9.4.1.2 mod_proxy_html

At this point, one problem remains: applications often generate and embed absolute links into HTML pages. But unlike the response header problem that gets handled by Apache, absolute links in pages are left unmodified. Again, this reveals the real name of the internal server to its clients. This problem cannot be solved with standard Apache but with the help of a third-party module, mod_proxy_html, which is maintained by Nick Kew. It can be downloaded from http://apache.webthing.com/mod_proxy_html/. It requires libxml2, which can be found at http://xmlsoft.org. (Note: the author warns against using libxml2 versions lower than 2.5.10.)

To compile the module, I had to pass the compiler the path to libxml2:

# apxs -Wc,-I/usr/include/libxml2 -cia mod_proxy_html.c

For the same reason, in the httpd.conf configuration file, you have to load the libxml2 dynamic library before attempting to load the mod_proxy_html module:

LoadFile /usr/lib/libxml2.so
LoadModule proxy_html_module modules/mod_proxy_html.so

The module looks into every HTML page, searches for absolute links referencing the internal server, and replaces them with links referencing the proxy. To activate this behavior, add the following to the configuration file:

# activate mod_proxy_html
SetOutputFilter proxy-html
   
# prevent content compression in backend operation
RequestHeader unset Accept-Encoding
   
# replace references to the internal server
# with references to this proxy
ProxyHTMLURLMap http://web.internal.com/ /

You may be wondering about the directive to prevent compression. If the client supports content decompression, it will state that with an appropriate Accept-Encoding header:

Accept-Encoding: gzip,deflate

If that happens, the backend server will respond with a compressed response, but mod_proxy_html does not know how to handle compressed content and it fails to do its job. By removing the header from the request, we force plaintext communication between the reverse proxy and the backend server. This is not a problem. Chances are both servers will share a fast local network where compression would not work to enhance performance.

Read Nick's excellent article published in Apache Week, in which he gives more tips and tricks for reverse proxying:

"Running a Reverse Proxy With Apache" by Nick Kew (http://www.apacheweek.com/features/reverseproxies)

There is an unavoidable performance penalty when using mod_proxy_html. To avoid unnecessary slow down, only activate this module when a problem with absolute links needs to be solved.

9.4.2. Reverse Proxy by Network Design

The most common approach to running a reverse proxy is to design it into the network. The web server is assigned a private IP address (e.g., 192.168.0.1) instead of a real one. The reverse proxy gets a real IP address (e.g., 217.160.182.153), and this address is attached to the domain name (which is www.example.com in the following example). Configuring Apache to respond to a domain name by forwarding requests to another server is trivial:

<VirtualHost www.example.com>
    ProxyPass / http://192.168.0.1/
    ProxyPassReverse / http://192.168.0.1/
   
    # additional mod_proxy_html configuration 
    # options can be added here if required
</VirtualHost>

9.4.3. Reverse Proxy by Redirecting Network Traffic

Sometimes, when faced with a network that is already up and running, it may be impossible or too difficult to reconfigure the network to introduce a reverse proxy. Under such circumstances you may decide to introduce the reverse proxy through traffic redirection on a network level. This technique is also useful when you are unsure about whether you want to proxy, and you want to see how it works before committing more resources.

The following steps show how a transparent reverse proxy is introduced to a network, assuming the gateway is capable of redirecting traffic:

The web server retains its real IP address. It will be unaware that traffic is not coming to it directly any more.
A reverse proxy is added to the same network segment.
A firewall rule is added to the gateway to redirect the incoming web traffic to the proxy instead of to the web server.

The exact firewall rule depends on the type of gateway. Assuming the web server is at 192.168.1.99 and the reverse proxy is at 192.168.1.100, the following iptables command will transparently redirect all web server traffic through the proxy:

# iptables -t nat -A PREROUTING -d 192.168.1.99 -p tcp --dport 80 \
> -j DNAT --to 192.168.1.100