Squid Proxy Server 3.1

Beginner's Guide

Improve the performance of your network using the caching

and access control capabilies of Squid

Kulbir Saini

BIRMINGHAM - MUMBAI

Squid Proxy Server 3.1

Beginner's Guide

or transmied in any form or by any means, without the prior wrien permission of the

publisher, except in the case of brief quotaons embedded in crical arcles or reviews.

Every eort has been made in the preparaon of this book to ensure the accuracy of the

informaon presented. However, the informaon contained in this book is sold without

warranty, either express or implied. Neither the author, nor Packt Publishing, its dealers or

distributors will be held liable for any damages caused or alleged to be caused directly or

indirectly by this book.

Packt Publishing has endeavored to provide trademark informaon about all of the

companies and products menoned in this book by the appropriate use of capitals.

However, Packt Publishing cannot guarantee the accuracy of this informaon.

First published: February 2011

Producon Reference: 1160211

Published by Packt Publishing Ltd.

32 Lincoln Road

Olton

Birmingham, B27 6PA, UK.

ISBN 978-1-849513-90-6

www.packtpub.com

Cover Image by Faiz Faohi (faizfattohi@gmail.com)

Credits

Author

Kulbir Saini

Reviewers

Mihai Dobos

Siju Oommen George

Amos Y. Jeries

Rajkumar Seenivasan

Acquision Editor

Sarah Cullington

Development Editor

Susmita Panda

Technical Editor

Sakina Kaydawala

Copy Editor

Leonard D'Silva

Indexer

Hemangini Bari

Editorial Team Leader

Mithun Sehgal

Project Team Leader

Ashwin Shey

Project Coordinator

Michelle Quadros

Proofreader

Lindsey Thomas

Graphics

Nilesh Mohite

Producon Coordinators

Aparna Bhagat

Kruthika Bangera

Cover Work

Aparna Bhagat

About the Author

Kulbir Saini is an entrepreneur based in Hyderabad, India. He has had extensive experience

in managing systems and network infrastructure. Apart from his work as a freelance

developer, he provides services to a number of startups. Through his blogs, he has been an

acve contributor of documentaon for various open source projects, most notable being

The Fedora Project and Squid. Besides computers, which his life praccally revolves around,

he loves travelling to remote places with his friends. For more details, please check

http://saini.co.in/.

There are people who served as a source of inspiraon, people who helped

me throughout, and my friends who were always there for me. Without

them, this book wouldn't have been possible.

I would like to thank Sunil Mohan Ranta, Nirnimesh, Suryakant Padar,

Shiben Bhaacharjee, Tarun Jain, Sanyam Sharma, Jayaram Kowta, Amal

Raj, Sachin Rawat, Vidit Bansal, Upasana Tegta, Gopal Da Joshi, Vardhman

Jain, Sandeep Chandna, Anurag Singh Rana, Sandeep Kumar, Rishabh

Mukherjee, Mahaveer Singh Deora, Sambhav Jain, Ajay Somani, Ankush

Kalkote, Deepak Vig, Kapil Agrawal, Sachin Goyal, Pankaj Saini, Alok Kumar,

Nin Bansal, Nin Gupta, Kapil Bajaj, Gaurav Kharkwal, Atul Dwivedi,

Abhinav Parashar, Bhargava Chowdary, Maru Borker, Abhilash I, Gopal

Krishna Koduri, Sashidhar Guntury, Siva Reddy, Prashant Mathur, Vipul

Mial, Deep G.P., Shikha Aggarwal, Gaganpreet Singh Arora, Sanrag Sood,

Anshuman Singh, Himanshu Singh, Himanshu Sharma, Dinesh Yadav, Tushar

Mahajan, Sankalp Khare, Mayank Juneja, Ankur Goel, Anuraj Pandey, Rohit

Nigam, Romit Pandey, Ankit Rai, Vishwajeet Singh, Suyesh Tiwari, Sanidhya

Kashap, and Kunal Jain.

I would also like to thank Michelle Quadros, Sarah Cullington, Susmita

Panda, Priya Mukherji, and Snehman K Kohli from Packt who have been

extremely helpful and encouraging during the wring of the book.

Special thanks go out to my parents and sister, for their love and support.

About the Reviewers

Mihai Dobos has a strong background in networking and security technologies, with hands

on project experience in open source, Cisco, Juniper, Symantec, and many other vendors.

He started as a Cisco trainer right aer nishing high school, then moved on to real-life

implementaons of network and security soluons. Mihai is now studying for his Masters

degree in Informaon Security in the Military Technical Academy.

Siju Oommen George works as the Senior Systems Administrator at HiFX Learning

Services, which is part of Virtual Training Company. He also over sees network, security,

and systems-related aspects at HiFX IT & Media Services, Fingent, and Quantlogic.

He completed his BTech course in Producon Engineering from the University of Calicut in

2000 and has many years of System Administraon experience on BSD, OS X, Linux, and

Microso Windows Plaorms, involving both open source and proprietary soware. He is

also a contributor to the DragonFlyBSD Handbook. He acvely advocates the use of BSDs

among Computer Professionals and encourages Computer students to do the same. He is an

acve parcipant in many of the BSD, Linux, and open source soware mailing lists and enjoys

helping others who are new to a parcular technology. He also reviews computer-related

books in his spare me. He is married to Sophia Yesudas who works in the Airline Industry.

I would like to thank my Lord and Savior Jesus Christ who gave me the

grace to connue working on reviewing this book during my busy schedule

and sickness, my wife Sophia for allowing me to steal me from her and

spend it in front of the computer at home, my Father T O Oommen and my

Late mother C I Maria who worked hard to pay for my educaon, my Pastor

Rajesh Mathew Koukapilly who was with me in all the ups and downs of

life, and nally my employer Mohan Thomas who provided me with the

encouragement and facilies to research, experiment, work, and learn

almost everything I know in the computer eld.

Amos Y. Jeries' original background is in genec engineering, physics, and astronomy.

He was introduced to compung in 1994. By 1996, he was developing networked

mulplayer games and accounng soware on the Macintosh plaorm. In 2000, he joined

the nanotechnology eld working with members of the Foresight Instute and others

spreading the foundaons of the technology. In 2001, he graduated from the University of

Waikato with a Bachelor of Science (Soware Engineering) degree with addional topical

background in soware design, languages, compiler construcon, data storage, encrypon,

and arcial intelligence. In 2002, as a post-graduate, Amos worked as a developer creang

real-me soware for mul-media I/O, networking, and recording on Large Interacve

Display Surfaces [1]. Later in 2002, he began a career in HTTP web design and network

administraon, founding Treehouse Networks Ltd. in 2003 as a consultancy. This led him into

the eld of SMTP mail networking and as a result data forensics and the an-spam/an-virus

industry. In 2004, he returned to formal study in the topics of low-level networking protocols

and human-computer interacon. In 2007, he entered the Squid project as a developer

integrang IPv6 support and soon stepped into the posion of Squid-3 maintainer. In 2008,

he began contract work for the Te Kotahitanga research project at the University of Waikato

developing online tools for supporng teacher professional development [2,3].

Acknowledgements should go to Robert Collins, Henrik Nordstrom,

Francesco Chemolli, and Alex Rousskov[4]. Without whom Squid-3 would

have ceased to exist some years back.

[1]

http://www.waikato.ac.nz/php/research.php?author=12357

5&mode=show

[2]

http://edlinked.soe.waikato.ac.nz/departments/index.

php?dept_id=20&page_id=2639

[3](Research publicaon due out next year).

[4] Non-English characters exist in the correct spelling of these names

www.PacktPub.com

Support files, eBooks, discount offers, and more

You might want to visit www.PacktPub.com for support les and downloads related to

your book.

Did you know that Packt oers eBook versions of every book published, with PDF and ePub

les available? You can upgrade to the eBook version at

www.PacktPub.com and as a print

book customer, you are entled to a discount on the eBook copy. Get in touch with us at

service@packtpub.com for more details.

www.PacktPub.com, you can also read a collecon of free technical arcles. Sign up

for a range of free newsleers and receive exclusive discounts and oers on Packt books

and eBooks.

http://PacktLib.PacktPub.com

Do you need instant soluons to your IT quesons? PacktLib is Packt's online digital book

library. Here, you can access, read, and search across Packt's enre library of books.

Why Subscribe?

Fully searchable across every book published by Packt

Copy and paste, print and bookmark content

On demand and accessible via web browser

Free Access for Packt account holders

If you have an account with Packt at www.PacktPub.com, you can use this to access

PacktLib today and view nine enrely free books. Simply use your login credenals for

immediate access.

•

Table of Contents

Preface 1

Chapter 1: Geng Started with Squid 7

Proxy server 7

Reverse proxy 9

Geng Squid 9

Time for acon – idenfying the right version 10

Methods of obtaining Squid 11

Using source archives 11

Time for acon – downloading Squid 11

Obtaining the latest source code from Bazaar VCS 12

Time for acon – using Bazaar to obtain source code 13

Using binary packages 14

Installing Squid 14

Installing Squid from source code 14

Compiling Squid 14

Uncompressing the source archive 15

Congure or system check 15

Time for acon – running the congure command 25

Time for acon – compiling the source 26

Time for acon – installing Squid 27

Time for acon – exploring Squid les 27

Installing Squid from binary packages 29

Fedora, CentOS or Red Hat 30

Debian or Ubuntu 30

FreeBSD 30

OpenBSD or NetBSD 30

Dragony BSD 30

Gentoo 30

Arch Linux 31

Summary 32

Table of Contents

[ ii ]

Chapter 2: Conguring Squid 33

Quick start 34

Syntax of the conguraon le 34

Types of direcves 35

HTTP port 37

Time for acon – seng the HTTP port 37

Access control lists 38

Time for acon – construcng simple ACLs 39

Controlling access to the proxy server 40

HTTP access control 40

Time for acon – combining ACLs and HTTP access 41

HTTP reply access 42

ICP access 43

HTCP access 43

HTCP CLR access 43

Miss access 43

Ident lookup access 43

Cache peers or neighbors 44

Declaring cache peers 44

Time for acon – adding a cache peer 44

Quickly restricng access to domains using peers 45

Advanced control on access using peers 46

Caching web documents 46

Using main memory (RAM) for caching 46

In-transit objects or current requests 47

Hot or popular objects 47

Negavely cached objects 47

Specifying cache space in RAM 47

Time for acon – specifying space for memory caching 48

Maximum object size in memory 48

Memory cache mode 49

Using hard disks for caching 49

Specifying the storage space 49

Time for acon – creang a cache directory 51

Conguring the number of sub directories 52

Time for acon – adding a cache directory 52

Cache directory selecon 53

Cache object size limits 53

Seng limits on object replacement 54

Cache replacement policies 54

Least recently used (LRU) 54

Greedy dual size frequency (GDSF) 54

Least frequently used with dynamic aging (LFUDA) 55

Table of Contents

[ iii ]

Tuning Squid for enhanced caching 55

Selecve caching 55

Time for acon – prevenng the caching of local content 55

Refresh paerns for cached objects 56

Time for acon – calculang the freshness of cached objects 57

Opons for refresh paern 58

Aborng the paral retrievals 60

Caching the failed requests 61

Playing around with HTTP headers 61

Controlling HTTP headers in requests 61

Controlling HTTP headers in responses 62

Replacing the contents of HTTP headers 62

DNS server conguraon 62

Specifying the DNS program path 63

Controlling the number of DNS client processes 63

Seng the DNS name servers 63

Time for acon – adding DNS name servers 64

Seng the hosts le 64

Default domain name for requests 64

Timeout for DNS queries 64

Caching the DNS responses 65

Seng the size of the DNS cache 65

Logging 66

Log formats 66

Log le rotaon or log le backups 66

Log access 66

Buered logs 66

Strip query terms 67

URL rewriters and redirectors 67

Other conguraon direcves 67

Seng the eecve user for running Squid 68

Conguring hostnames for the proxy server 68

Hostname visible to everyone 68

Unique hostname for the server 68

Controlling the request forwarding 68

Always direct 69

Never direct 69

Hierarchy stoplist 69

Broken posts 70

TCP outgoing address 70

Table of Contents

[ iv ]

PID lename 71

Client netmask 71

Summary 73

Chapter 3: Running Squid 75

Command line opons 75

Geng a list of available opons 76

Time for acon – lisng the opons 77

Geng informaon about our Squid installaon 78

Time for acon – nding out the Squid version 78

Creang cache or swap directories 78

Time for acon – creang cache directories 78

Using a dierent conguraon le 79

Geng verbose output 79

Time for acon – debugging output in the console 80

Full debugging output on the terminal 81

Running as a normal process 82

Parsing the Squid conguraon le for errors or warnings 82

Time for acon – tesng our conguraon le 82

Sending various signals to a running Squid process 83

Reloading a new conguraon le in a running process 83

Shung down the Squid process 84

Interrupng or killing a running Squid process 84

Checking the status of a running Squid process 84

Sending a running process in to debug mode 85

Rotang the log les 85

Forcing the storage metadata to rebuild 86

Double checking swap during rebuild 86

Automacally starng Squid at system startup 87

Adding Squid command to /etc/rc.local le 87

Adding init script 87

Time for acon – adding the init script 87

Summary 89

Chapter 4: Geng Started with Squid's Powerful ACLs and Access Rules 91

Access control lists 92

Fast and slow ACL types 92

Source and desnaon IP address 92

Time for acon – construcng ACL lists using IP addresses 93

Time for acon – using a range of IP addresses to build ACL lists 94

Source and desnaon domain names 96

Time for acon – construcng ACL lists using domain names 97

Desnaon port 98

Table of Contents

[ v ]

Time for acon – building ACL lists using desnaon ports 99

HTTP methods 101

Idenfying requests using the request protocol 102

Time for acon – using a request protocol to construct access rules 102

Time-based ACLs 103

URL and URL path-based idencaon 104

Matching client usernames 105

Proxy authencaon 106

Time for acon – enforcing proxy authencaon 107

User limits 108

Idencaon based on various HTTP headers 109

HTTP reply status 111

Idenfying random requests 112

Access list rules 112

Access to HTTP protocol 112

Access to other ports 114

Enforcing limited access to neighbors 115

Time for acon – denying miss_access to neighbors 115

Requesng neighbor proxy servers 116

Forwarding requests to remote servers 117

Ident lookup access 117

Controlled caching of web documents 118

URL rewrite access 118

HTTP header access 119

Custom error pages 119

Maximum size of the reply body 120

Logging requests selecvely 120

Mixing ACL lists and rules – example scenarios 121

Handling caching of local content 121

Time for acon – avoiding caching of local content 121

Denying access from external networks 122

Denying access to selecve clients 122

Blocking the download of video content 123

Time for acon – blocking video content 123

Special access for certain clients 123

Time for acon – wring rules for special access 124

Limited access during working hours 124

Allowing some clients to connect to special ports 125

Tesng access control with squidclient 126

Table of Contents

[ vi ]

Time for acon – tesng our access control example with squidclient 128

Time for acon – tesng a complex access control 129

Summary 132

Chapter 5: Understanding Log Files and Log Formats 133

Log messages 134

Cache log or debug log 134

Time for acon – understanding the cache log 134

Access log 137

Understanding the access log 137

Time for acon – understanding the access log messages 137

Access log syntax 139

Time for acon – analyzing a syntax to specify access log 139

Log format 140

Time for acon – learning log format and format codes 140

Log formats provided by Squid 142

Time for acon – customizing the access log with a new log format 142

Selecve logging of requests 143

Time for acon – using access_log to control logging of requests 144

Referer log 144

Time for acon – enabling the referer log 145

Time for acon – translang the referer logs to a human-readable format 145

User agent log 146

Time for acon – enabling user agent logging 147

Emulang HTTP server-like logs 147

Time for acon – enabling HTTP server log emulaon 147

Log le rotaon 148

Other log related features 148

Cache store log 149

Summary 150

Chapter 6: Managing Squid and Monitoring Trac 151

Cache manager 151

Installing the Apache Web server 152

Time for acon – installing Apache Web server 152

Conguring Apache for providing the cache manager web interface 152

Time for acon – conguring Apache to use cachemgr.cgi 153

Accessing the cache manager web interface 153

Conguring Squid 154

General Runme Informaon 156

IP Cache Stats and Contents 157

FQDN Cache Stascs 158

Table of Contents

[ vii ]

HTTP Header Stascs 159

Trac and Resource Counters 160

Request Forwarding Stascs 161

Cache Client List 162

Memory Ulizaon 163

Internal DNS Stascs 164

Log le analyzers 165

Calamaris 165

Installing Calamaris 166

Time for acon – installing Calamaris 166

Using Calamaris to generate stascs 167

Time for acon – generang stats in plain text format 167

Time for acon – generang graphical reports with Calamaris 168

Summary 171

Chapter 7: Protecng your Squid Proxy Server with Authencaon 173

HTTP authencaon 174

Basic authencaon 174

Time for acon – exploring Basic authencaon 174

Database authencaon 176

Conguring database authencaon 177

NCSA authencaon 178

Time for acon – conguring NCSA authencaon 178

NIS authencaon 179

LDAP authencaon 179

SMB authencaon 179

PAM authencaon 180

Time for acon – conguring PAM service 180

MSNT authencaon 180

Time for acon – conguring MSNT authencaon 180

MSNT mul domain authencaon 181

SASL authencaon 182

Time for acon – conguring Squid to use SASL authencaon 182

getpwnam authencaon 182

POP3 authencaon 183

RADIUS authencaon 183

Time for acon – conguring RADIUS authencaon 183

Fake Basic authencaon 184

Digest authencaon 184

Time for acon – conguring Digest authencaon 185

File authencaon 186

LDAP authencaon 186

eDirectory authencaon 187

Table of Contents

[ viii ]

Microso NTLM authencaon 187

Samba's NTLM authencaon 188

Fake NTLM authencaon 188

Negoate authencaon 189

Time for acon – conguring Negoate authencaon 189

Using mulple authencaon schemes 190

Wring a custom authencaon helper 191

Time for acon – wring a helper program 191

Making non-concurrent helpers concurrent 192

Common issues with authencaon 193

Summary 196

Chapter 8: Building a Hierarchy of Squid Caches 197

Cache hierarchies 198

Reasons to use hierarchical caching 198

Problems with hierarchical caching 199

Joining a cache hierarchy 201

Time for acon – joining a cache hierarchy 202

ICP opons 202

HTCP opons 203

Peer or neighbor selecon 204

Opons for peer selecon methods 205

Other cache peer opons 208

Controlling communicaon with peers 209

Domain-based forwarding 209

Time for acon – conguring Squid for domain-based forwarding 210

Cache peer access 210

Time for acon – forwarding requests to cache peers using ACLs 211

Switching peer relaonship 212

Time for acon – conguring Squid to switch peer relaonship 213

Controlling request redirects 213

Peer communicaon protocols 215

Internet Cache Protocol 215

Cache digests 216

Squid and cache digest conguraon 217

Hypertext Caching Protocol 218

Summary 219

Chapter 9: Squid in Reverse Proxy Mode 221

What is reverse proxy mode? 222

Exploring reverse proxy mode 222

Conguring Squid as a server surrogate 223

Table of Contents

[ ix ]

HTTP port 224

HTTP opons in reverse proxy mode 224

HTTPS port 225

HTTPS opons in reverse proxy mode 226

Adding backend web servers 229

Cache peer opons for reverse proxy mode 229

Time for acon – adding backend web servers 229

Support for surrogate protocol 230

Understanding the surrogate protocol 230

Conguraon opons for surrogate support 231

Support for ESI protocol 231

Conguring Squid for ESI support 232

Logging messages in web server log format 232

Ignoring the browser reloads 232

Time for acon – conguring Squid to ignore the

browser reloads 233

Access controls in reverse proxy mode 233

Squid in only reverse proxy mode 234

Squid in reverse proxy and forward proxy mode 234

Example conguraons 235

Web server and Squid server on the same machine 236

Accelerang mulple backend web servers hosng one website 236

Accelerang mulple web servers hosng mulple websites 237

Summary 238

Chapter 10: Squid in Intercept Mode 239

Intercepon caching 239

Time for acon – understanding intercepon caching 240

Advantages of intercepon caching 241

Problems with intercepon caching 241

Diverng HTTP trac to Squid 243

Using a router's policy roung to divert requests 243

Using rule-based switching to divert requests 244

Using Squid server as a bridge 244

Using WCCP tunnel 245

Implemenng intercepon caching 245

Conguring the network devices 245

Conguring the operang system 246

Time for acon – enabling IP forwarding 246

Time for acon – redirecng HTTP trac to Squid 247

Conguring Squid 248

Conguring HTTP port 248

Summary 250

Table of Contents

[ x ]

Chapter 11: Wring URL Redirectors and Rewriters 251

URL redirectors and rewriters 251

Understanding URL redirectors 252

HTTP status codes for redirecon 253

Understanding URL rewriters 254

Issues with URL rewriters 255

Squid, URL redirectors, and rewriters 256

Communicaon interface 256

Time for acon – exploring the message ow between

Squid and redirectors 257

Time for acon – wring a simple URL redirector program 258

Concurrency 259

Handling whitespace in URLs 259

Using the uri_whitespace direcve 259

Making redirector programs intelligent 260

Wring our own URL redirector program 260

Time for acon – wring our own template for a URL redirector 261

Conguring Squid 262

Specifying the URL redirector program 263

Controlling redirector children 263

Controlling requests passed to the redirector program 264

Bypassing URL redirector programs when under heavy load 264

Rewring the Host HTTP header 265

A special URL redirector – deny_info 265

Popular URL redirectors 267

SquidGuard 267

Squirm 267

Ad Zapper 268

Summary 269

Chapter 12: Troubleshoong Squid 271

Some common issues 271

Cannot write to log les 272

Time for acon – changing the ownership of log les 272

Could not determine hostname 272

Cannot create swap directories 273

Time for acon – xing cache directory permissions 273

Failed vericaon of swap directories 274

Time for acon – creang swap directories 274

Address already in use 274

Table of Contents

[ xi ]

Time for acon – nding the program listening on a specic port 275

URLs with underscore results in an invalid URL 276

Enforce hostname checks 276

Allow underscore 276

Squid becomes slow over me 276

The request or reply is too large 277

Access denied on the proxy server 277

Connecon refused when reaching a sibling proxy server 278

Debugging problems 278

Time for acon – debugging HTTP requests 281

Time for acon – debugging access control 282

Geng help online and reporng bugs 284

Summary 286

Pop Quiz Answers 287

Index 291

Preface

Squid proxy server enables you to cache your web content and return it quickly on

subsequent requests. System administrators oen struggle with delays and too much

bandwidth being used, but Squid solves these problems by handling requests locally. By

deploying Squid in accelerator mode, requests are handled faster than on normal web

servers, thus making your site perform quicker than everyone else's!

The Squid Proxy Server 3.1 Beginner's Guide will help you to install and congure Squid so

that it is opmized to enhance the performance of your network. Caching usually takes a

lot of professional know-how, which can take me and be very confusing. The Squid proxy

server reduces the amount of eort that you will have to spend and this book will show you

how best to use Squid, saving your me and allowing you to get most out of your network.

Whether you only run one site, or are in charge of a whole network, Squid is an invaluable

tool which improves performance immeasurably. Caching and performance opmizaon

usually requires a lot of work on the developer's part, but Squid does all that for you. This

book will show you how to get the most out of Squid by customizing it for your network.

You will learn about the dierent conguraon opons available and the transparent and

accelerated modes that enable you to focus on parcular areas of your network.

Applying proxy servers to large networks can be a lot of work as you have to decide where

to place restricons and who to grant access. However, the straighorward examples in this

book will guide you through step-by-step so that you will have a proxy server that covers all

areas of your network by the me you nish reading.

What this book covers

Chapter 1, Geng Started with Squid, discusses the basics of proxy servers and web

caching and how we can ulize them to save bandwidth and improve the end user's

browsing experience. We will also learn to idenfy the correct Squid version for our

environment. We will explore various conguraon opons available for enabling or

disabling certain features while we compile Squid from the source code. We will explore

steps to compile and install Squid.

Preface

[ 2 ]

Chapter 2, Conguring Squid, explores the syntax used in the Squid conguraon le, which

is used to control Squid's behavior. We will explore the important direcves used in the

conguraon le and will see related examples to understand them beer. We will have

a brief overview of the powerful access control lists which we will learn in detail in later

chapters. We will also learn to ne-tune our cache to achieve a beer HIT rao to save

bandwidth and reduce the average page load me.

Chapter 3, Running Squid, talks about running Squid in dierent modes and various

command line opons available for debugging purposes. We will also learn about rotang

Squid logs to reclaim disk space by deleng old/obsolete log les. We will learn to install

the

init script to automacally start Squid on system startup.

Chapter 4, Geng Started with Squid's Powerful ACLs and Access Rules, explores the Access

Control Lists in detail with examples. We will learn about various ACL types and to construct

ACLs to idenfy requests and responses based on dierent criteria. We will also learn about

mixing ACLs of various types with access rules to achieve desired access control.

Chapter 5, Understanding Log Files and Log Formats, discusses conguring Squid to generate

customized log messages. We will also learn to interpret the messages logged by Squid in

various log les.

Chapter 6, Managing Squid and Monitoring Trac, explores the Squid's Cache Manager

web interface in this chapter using which we can monitor our Squid proxy server and get

stascs about dierent components of Squid. We will also have a look at a few log le

analyzers which make analyzing trac simpler compared to manually interpreng the

access log messages.

Chapter 7, Protecng your Squid with Authencaon, teaches us to protect our Squid

proxy server with authencaon using the various authencaon schemes available. We

will also learn to write custom authencaon helpers using which we can build our own

authencaon system for Squid.

Chapter 8, Building a Hierarchy of Squid Caches, explores cache hierarchies in detail. We will

also learn to congure Squid to act as a parent or a sibling proxy server in a hierarchy, and to

use other proxy servers as a parent or sibling cache.

Chapter 9, Squid in Reverse Proxy Mode, discusses how Squid can accept HTTP requests on

behalf of one or more web servers in the background. We will learn to congure Squid in

reverse proxy mode. We will also have a look at a few example scenarios.

Chapter 10, Squid in Intercept Mode, talks about the details of intercept mode and how to

congure the network devices, and the host operang system to intercept the HTTP requests

and forward them to Squid proxy server. We will also have a look at the pros and cons of

Squid in intercept mode.

Preface

[ 3 ]

Chapter 11, Wring URL Redirectors and Rewriters. Squid's behavior can be further

customized using the URL redirectors and rewriter helpers. In this chapter, we will learn

about the internals of redirectors and rewriters and we will create our own custom helpers.

Chapter 12, Troubleshoong Squid, discusses some common problems or errors which you

may come across while conguring or running Squid. We will also learn about geng online

help to resolve issues with Squid and ling bug reports.

What you need for this book

A beginner level knowledge of Linux/Unix operang system and familiarity with basic

commands is all what you need. Squid runs almost on all Linux/Unix operang systems and

there is a great possibility that your favorite operang system repository already has Squid.

On a server, the availability of free main memory and speed of hard disk play a major role

in determining the performance of the Squid proxy server. As most of the cached objects

stay on the hard disks, faster disks will result in low disk latency and faster responses. But

faster hard disks (SCSI) are oen very expensive as compared to ATA hard disks and we have

to analyze our requirements to strike a balance between the disk speed we need and the

money we are going to spend on it.

The main memory is the most important factor for opmizing Squid's performance. Squid

stores a lile bit of informaon about each cached object in the main memory. On average,

Squid consumes up to 32 MB of the main memory for every GB of disk caching. The actual

memory ulizaon may vary depending on the average object size, CPU architecture, and

the number of concurrent users, and so on. While memory is crical for good performance,

a faster CPU also helps, but is not really crical.

Who this book is for

If you are a Linux or Unix system administrator and you want to enhance the performance

of your network or you are a web developer and want to enhance the performance of

your website, this book is for you. You will be expected to have some basic knowledge of

networking concepts, but may not have used caching systems or proxy servers unl now.

Conventions

In this book, you will nd several headings appearing frequently. To give clear instrucons of

how to complete a procedure or task, we use:

Preface

[ 4 ]

Time for action - heading

1. Acon 1

2. Acon 2

3. Acon 3

Instrucons oen need some extra explanaon so that they make sense, so they are

followed with:

What just happened?

This heading explains the working of tasks or instrucons that you have just completed.

You will also nd some other learning aids in the book, including:

Pop quiz

These are short mulple choice quesons intended to help you test your own understanding.

Have a go hero - heading

These set praccal challenges and give you ideas for experimenng with what you

have learned.

You will also nd a number of styles of text that disnguish between dierent kinds of

informaon. Here are some examples of these styles, and an explanaon of their meaning.

Code words in text are shown as follows: "The direcve

visible_hostname is used to set

the hostname."

A block of code is set as follows:

acl special_network src 192.0.2.0/24

tcp_outgoing_address 198.51.100.25 special_network

tcp_outgoing_address 198.51.100.86

Any command-line input or output is wrien as follows:

$ mkdir /drive/squid_cache

New terms and important words are shown in bold. Words that you see on the screen, in

menus or dialog boxes for example, appear in the text like this: "If we click on the Internal

DNS Stascs link in the Cache Manager menu, we will be presented with various stascs

about the requests performed by the internal DNS client".

Preface

[ 5 ]

Warnings or important notes appear in a box like this.

Tips and tricks appear like this.

Reader feedback

Feedback from our readers is always welcome. Let us know what you think about this

book—what you liked or may have disliked. Reader feedback is important for us to develop

tles that you really get the most out of.

To send us general feedback, simply send an e-mail to

feedback@packtpub.com, and

menon the book tle via the subject of your message.

If there is a book that you need and would like to see us publish, please send us a note in the

SUGGEST A TITLE form on

www.packtpub.com or e-mail suggest@packtpub.com.

If there is a topic that you have experse in and you are interested in either wring or

contribung to a book on, see our author guide on

www.packtpub.com/authors.

Customer support

Now that you are the proud owner of a Packt book, we have a number of things to help you

to get the most from your purchase.

Downloading the example code for the book

You can download the example code les for all Packt books you have purchased

from your account at http://www.packtpub.com. If you purchased this

book elsewhere, you can visit http://www.packtpub.com/support and

Preface

[ 6 ]

Errata

Although we have taken every care to ensure the accuracy of our content, mistakes do

happen. If you nd a mistake in one of our books—maybe a mistake in the text or the

code—we would be grateful if you would report this to us. By doing so, you can save other

readers from frustraon and help us improve subsequent versions of this book. If you

nd any errata, please report them by vising

http://www.packtpub.com/support,

selecng your book, clicking on the errata submission form link, and entering the details

of your errata. Once your errata are veried, your submission will be accepted and the

errata will be uploaded on our website, or added to any list of exisng errata, under the

Errata secon of that tle. Any exisng errata can be viewed by selecng your tle from

http://www.packtpub.com/support.

Piracy

Piracy of copyright material on the Internet is an ongoing problem across all media. At Packt,

we take the protecon of our copyright and licenses very seriously. If you come across any

illegal copies of our works, in any form, on the Internet, please provide us with the locaon

address or website name immediately, so that we can pursue a remedy.

Please contact us at

pirated material.

We appreciate your help in protecng our authors, and our ability to bring you

valuable content.

Questions

You can contact us at questions@packtpub.com if you are having a problem with any

aspect of the book, and we will do our best to address it.

Getting Started with Squid

In this chapter, we will have a look at how proxy servers and web caching

works in general. We will proceed to download the correct Squid package

for our operang system, based on the system requirements that we learned

about in the Preface. We will learn how to compile and build addional Squid

features. We will also learn the advantages of compiling Squid manually from

the source over using a pre-compiled binary package.

In the nal secon, we will learn how to install Squid from a compiled source

binary package, using popular package managers. Installaon is a crucial

part in geng started with Squid. Somemes, we need to compile Squid with

custom ags, depending on the environment requirements.

So let's get started with the real stu.

Proxy server

A proxy server is a computer system sing between the client requesng a web document

and the target server (another computer system) serving the document. In its simplest form,

a proxy server facilitates communicaon between client and target server without modifying

requests or replies. When we iniate a request for a resource from the target server, the

proxy server hijacks our connecon and represents itself as a client to the target server,

requesng the resource on our behalf. If a reply is received, the proxy server returns it to us,

giving a feel that we have communicated with the target server.

Geng Started with Squid

[ 8 ]

In advanced forms, a proxy server can lter requests based on various rules and may allow

communicaon only when requests can be validated against the available rules. The rules

are generally based on an IP address of a client or target server, protocol, content type of

web documents, web content type, and so on.

As seen in the preceding image, clients can't make direct requests to the web servers. To

facilitate communicaon between clients and web servers, we have connected them using

a proxy server which is acng as a medium of communicaon for clients and web servers.

Somemes, a proxy server can modify requests or replies, or can even store the replies from

the target server locally for fullling the same request from the same or other clients at a

later stage. Storing the replies locally for use at a later me is known as caching. Caching is a

popular technique used by proxy servers to save bandwidth, empowering web servers, and

improving the end user's browsing experience.

Proxy servers are mostly deployed to perform the following:

Reduce bandwidth usage

Enhance the user's browsing experience by reducing page load me which, in turn,

is achieved by caching web documents

Enforce network access policies

Monitoring user trac or reporng Internet usage for individual users or groups

Enhance user privacy by not exposing a user's machine directly to Internet

Distribute load among dierent web servers to reduce load on a single server

Empower a poorly performing web server

Filter requests or replies using an integrated virus/malware detecon system

Load balance network trac across mulple Internet connecons

Relay trac around within a local area network



Chapter 1

[ 9 ]

In simple terms, a proxy server is an agent between a client and target server that has a

list of rules against which it validates every request or reply, and then allows or denies

access accordingly.

Reverse proxy

Reverse proxying is a technique of storing the replies or resources from a web server locally

so that the subsequent requests to the same resource can be sased from the local copy

on the proxy server, somemes without even actually contacng the web server. The proxy

server or web cache checks if the locally stored copy of the web document is sll valid before

serving the cached copy.

The life of the locally stored web document is calculated from the addional HTTP headers

received from the web server. Using HTTP headers, web servers can control whether a given

document/response should be cached by a proxy server or not.

Web caching is mostly used:

To reduce bandwidth usage. A large number of stac web documents like CSS and

JavaScript les, images, videos, and so on can be cached as they don't change

frequently and constutes the major part of a response from a web server.

By ISPs to reduce average page load me to enhance browsing experience for their

customers on Dial-Up or broadband.

To take a load o a very busy web server by serving stac pages/documents from

a proxy server's cache.

Getting Squid

Squid is available in several forms (compressed source archives, source code from a version

control system, binary packages such as RPM, DEB, and so on) from Squid's ocial website,

various Squid mirrors worldwide, and soware repositories of almost all the popular

operang systems. Squid is also shipped with many Linux/Unix distribuons.

There are various versions and releases of Squid available for download from Squid's ocial

website. To get the most out of a Squid installaon its best to check out the latest source

code from a Version Control System (VCS) so that we get the latest features and xes. But be

warned, the latest source code from a VCS is generally leading edge and may not be stable or

may not even work properly. Though code from a VCS is good for learning or tesng Squid's

new features, you are strongly advised not to use code from a VCS for producon deployments.



Geng Started with Squid

[ 10 ]

If we want to play safe, we should probably download the latest stable version or stable

version from the older releases. Stable versions are generally tested before they are

released and are supposed to work out of the box. Stable versions can directly be used in

producon deployments.

Time for action – identifying the right version

A list of available versions of Squid is maintained at http://www.squid-cache.org/

Versions/

. For producon environments, we should use versions listed under the Stable

Versions secon only. If we want to test new Squid features in our environment or if we

intend to provide feedback to the Squid community about the new version, then we should

be using one of the Beta Versions.

As we can see in the preceding screenshot, the website contains the First Producon

Release Date and Latest Release Date for the stable versions. If we click on any of the

versions, we are directed to a page containing a list of all the releases in that parcular

version. Let's have a look at the page for version 3.1:

Chapter 1

[ 11 ]

For every release, along with a release date, there are links for downloading compressed

source archives.

Dierent versions of Squid may have dierent features. For example, all the features

available in Squid version 2.7 may or may not be available in newer versions such as Squid

3.x. Some features may have been deprecated or have become redundant over me and

they are generally removed. On the other hand, Squid 3.x may have several new features

or exisng features in an improved and revised manner.

Therefore, we should always aim for the latest version, but depending on the environment,

we may go for stable or beta version. Also, if we need specic features that are not available

in the latest version, we may choose from the available releases in a dierent branch.

What just happened?

We had a brief look at the pages containing the dierent versions and releases of Squid,

on Squid's ocial website. We also learned which versions and releases that we should

download and use for dierent types of usage.

Methods of obtaining Squid

Aer idenfying the version of Squid that we should be using for compiling and installaon,

let's have a look at the ways in which we can obtain Squid release 3.1.10.

Using source archives

Compressed source archives are the most popular way of geng Squid. To download the

source archive, please visit Squid download page, http://www.squid-cache.org/

Download/

. This web page has links for downloading the dierent versions and releases

of Squid, either from the ocial website or available mirrors worldwide. We can use either

HTTP or FTP for geng the Squid source archive.

Time for action – downloading Squid

Now we are going to download Squid 3.1.10 from Squid's ocial website:

1. Let's go to the web page http://www.squid-cache.org/Versions/.

2. Now we need to click on the link to Version 3.1, as shown in the

following screenshot:

Geng Started with Squid

[ 12 ]

3. We'll be taken to a page displaying the various releases in version 3.1. The link with

the display text tar.gz in the Download column is a link to the compressed source

archive for Squid release 3.1.10, as shown in the following screenshot:

4. To download Squid 3.1.10 using the web browser, just click on the link.

5. Alternavely, we can use wget to download the source archive from the command

line as follows:

wget http://www.squid-cache.org/Versions/v3/3.1/squid-3.1.10.tar.gz

What just happened?

We successfully retrieved Squid version 3.1.10 from Squid's ocial website. The process of

retrieving other stable or beta versions is very similar.

Obtaining the latest source code from Bazaar VCS

Advanced users may be interested in geng the very latest source code from the Squid

code repository, using Bazaar. We can safely skip this secon if we are not familiar with

VCS in general. Bazaar is a popular version control system used to track project history and

facilitate collaboraon. From version 3.x onwards, Squid source code has been migrated to

Bazaar. Therefore, we should ensure that we have Bazaar installed on our system in order

to checkout the source code from repository. To nd out more about Bazaar or for Bazaar

installaon and conguraon manuals, please visit Bazaar's ocial website at

http://bazaar.canonical.com/.

Once we have setup Bazaar, we should head to the Squid code repository mirrored on

Launchpad at

https://code.launchpad.net/squid/. From here we can browse all the

versions and branches of Squid. Let's get ourselves familiar with the page layout:

Chapter 1

[ 13 ]

In the previous screenshot, Series: trunk represents the development branch, which

contains code that is sll in development and is not ready for producon use. The branches

with the status Mature are stable and can be used right away in producon environments.

Time for action – using Bazaar to obtain source code

Now that we are familiar with the various branches, versions, and releases. Let's proceed to

checking out the source code with Bazaar. To download code from any branch, the syntax for

the command is as follows:

bzr branch lp:squid[/branch[/version]]

branch and version are oponal parameters in the previous code. So, if we want to get

branch 3.1, then the command will be as follows:

bzr branch lp:squid/3.1

The previous command will fetch source code from Launchpad and may take a considerable

amount of me, depending on the Internet connecon. If we are willing to download source

code for Squid version 3.1.10, then the command will be as follows:

bzr branch lp:squid/3.1/3.1.10

In the previous code, 3.1 is the branch name and 3.1.10 is the specic version of Squid

that we want to checkout.

What just happened?

We learned to fetch the source code for any Squid branch or release using Bazaar from

Squid's source code hosted on Launchpad.

Have a go hero – fetching the source code

Using the command syntax that we learned in the previous secon, fetch the source code for

Squid version 3.0.stable25 from Launchpad.

Soluon:

bzr branch lp:squid/3.0/3.0.stable25

Explanaon: If we browse to the parcular version on Launchpad, the version

number used in the command becomes obvious.



Geng Started with Squid

[ 14 ]

Using binary packages

Squid binary packages are pre-compiled and ready to install soware bundles. Binary

packages are available in the soware repositories of almost all Linux/Unix-based operang

systems. Depending on the operang system, only stable and somemes well tested beta

versions make it to the soware repositories, so they are ready for producon use.

Installing Squid

Squid can be installed using the source code we obtained in the previous secon, using a

package manager which, in turn, uses the binary package available for our operang system.

Let's have a detailed look at the ways in which we can install Squid.

Installing Squid from source code

Installing Squid from source code is a three step process:

1. Select the features and operang system-specic sengs.

2. Compile the source code to generate the executables.

3. Place the generated executables and other required les in their designated

locaons for Squid to funcon properly.

We can perform some of the above steps using automated tools that make the compilaon

and installaon process relavely easy.

Compiling Squid

Compiling Squid is a process of compiling several les containing C/C++ source code and

generang executables. Compiling Squid is really easy and can be done in a few steps. For

compiling Squid, we need an ANSI C/C++ compliant compiler. If we already have a GNU C/

C++ Compiler (GNU Compiler Collecon (GCC) and g++, which are available on almost every

Linux/Unix-based operang system by default), we are ready to begin the actual compilaon.

Why compile?

Compiling Squid is a bit of a painful task compared to installing Squid from the binary

package. However, we recommend compiling Squid from the source instead of using

pre-compiled binaries. Let's walk through a few advantages of compiling Squid from

the source:

While compiling we can enable extra features, which may not be enabled in the

pre-compiled binary package.



Chapter 1

[ 15 ]

When compiling, we can also disable extra features that are not needed for a

parcular environment. For example, we may not need Authencaon helpers or

ICMP support.

configure probes the system for several features and enables or disables them

accordingly, while pre-compiled binary packages will have the features detected for

the system the source was compiled on.

Using

configure, we can specify an alternate locaon for installing Squid. We can

even install Squid without root or super user privileges, which may not be possible

with pre-compiled binary package.

Though compiling Squid from source has a lot of advantages over installing from the binary

package, the binary package has its own advantages. For example, when we are in damage

control mode or a crisis situaon and we need to get the proxy server up and running really

quickly, using a binary package for installaon will provide a quicker installaon.

Uncompressing the source archive

If we obtained the Squid in a compressed archive format, we must extract it before we can

proceed any further. If we obtained Squid from Launchpad using Bazaar, we don't need

to perform this step.

tar -xvzf squid-3.1.10.tar.gz

tar is a popular command which is used to extract compressed archives of various types.

On the other hand, it can also be used to compress many les into a single archive. The

preceding command will extract the archive to a directory named

squid-3.1.10.

Congure or system check

Congure or system check is the rst step in the compilaon process and is achieved by

running ./configure from the command line. This program probes the system, making

sure that the required packages are installed. This also checks the system capabilies and

collects informaon about the system architecture and default sengs such as, available

le descriptors and so on. Aer collecng all the informaon, this program generates the

makefiles, which are used in the next step to actually compile the Squid source code.

Running

configure without any parameters uses the preset defaults. If we are willing to

change the default Squid sengs or if we want to disable some oponal features that are

enabled by default, or if we want to install Squid in an alternate locaon in the le system,

we need to pass opons to configure. Use the following the command to see the available

opons along with a brief descripon.



Geng Started with Squid

[ 16 ]

Let's run configure with the --help opon to have a look at the available

conguraon opons.

./configure --help | less

This will display the page containing the opons and their brief descripon for configure.

Use up and down arrow keys to navigate through the informaon. Now let's discuss a few of

the commonly used opons with configure:

--prex

The --prefix opon is the most commonly used opon. If we are tesng a new version or

if we wanted to test mulple Squid versions, we will have mulple Squid version installed

on our system. To idenfy the dierent versions and to prevent interference or confusion

between the versions, it's a good idea to install them in separate directories.

For example, for installing Squid version 3.1.10, we can use the directory

/opt/

squid/3.1.10/

and the corresponding configure command will be run as:

./configure --prefix=/opt/squid/3.1.10/

Similarly, for installing Squid version 3.1, we can use the directory /opt/squid/3.1/.

From now onwards, ${prefix} will represent the locaon

where we have installed Squid, that is, the directory name used

with the --prefix opon while running configure, as shown

in the previous command.

Squid provides even more control over the locaon of dierent types of les such as

executables and documentaon les. Their placement can be controlled with opons such

as --bindir, --sbindir, and so on. Please check the configure help page for further

details on these opons.

Now, let's check the oponal features and packages. To enable any oponal feature, we pass

an opon in the format

--enable-FEATURE_NAME and to disable a feature, the opon

format is either --disable-FEATURE_NAME or --enable-FEATURE_NAME=no. For

example, icmp is a feature name.

./configure --enable-FEATURE # FEATURE will be enabled

./configure --disable-FEATURE # FEATURE will be disabled

./configure --enable-FEATURE=no # FEATURE will be disabled

Similarly, to compile Squid with an available package, we pass an opon in the format

--with-PACKAGE_NAME and to compile Squid without a package, we pass the opon

--without-PACKAGE_NAME. openssl is an example package name.

Chapter 1

[ 17 ]

--enable-gnuregex

Regular expressions are used for construcng Access Control Lists in Squid. If we are running

a modern Linux/Unix-based operang system, we don't need to worry about this opon. But

if our system doesn't have built-in support for regular expressions, we should enable support

for regular expressions using --enable-gnuregex.

--disable-inline

Squid has a lot of code that can be inlined, which is good for producon use. But inline code

takes longer to compile and is useful when we need to compile a source only once for seng

up Squid for producon use. This opon is intended to be used during development when

we need to compile Squid me and again.

--disable-optimizations

Squid is, by default, compiled with compiler opmizaons that result in beer performance.

Again this opon should be used while debugging a problem or tesng dierent versions

as it'll reduce compilaon me. The --disable-inline opon is automacally used if we

use this opon.

--enable-storeio

Squid's performance depends heavily on disk I/O performance when disk caching is enabled.

The quicker Squid can read/write les from cache, the lesser me it'll take to sasfy a

request, which in turn will result in smaller delays. Dierent storage techniques may lead to

opmized performance, depending on the trac type and usage. We can use this opon to

build Squid with support for various store I/O modules. Please check the src/fs/ directory

in the Squid source code for available store I/O modules.

./configure --enable-storeio=ufs,aufs,coss,diskd,null

--enable-removal-policies

While using disk caching, we instruct Squid to use a specied disk space for caching web

documents. Over a period of me, the space is consumed and Squid will sll need more

space to cache new documents. Squid then has to decide which old documents should

be removed or purged from the cache to make space for storing the new ones. There are

dierent policies for purging the documents to achieve maximum benets from caching.

The policies are based on heap and list data structures. List data structure is enabled by

default. Please check the

src/repl/ directory in the Squid source code for available

removal policies.

./configure --enable-removal-policies=heap,lru

Geng Started with Squid

[ 18 ]

--enable-icmp

This opon is helpful in determining the distance from other cache peers and remote

servers to esmate approximate latency. This is useful only if we have other cache peers

in the network.

--enable-delay-pools

Squid uses delay pools to limit or control bandwidth that can be used by a client or a group

of clients. Delay pools are like leaky buckets which leak data (web trac) to clients and are

relled at a controlled rate. These come in handy when we need to control the bandwidth

used by a group of users.

--enable-esi

This opon enables Squid to use Edge Side Includes (see http://www.esi.org for more

informaon). If this is enabled, Squid completely ignores cache-control headers from clients.

This opon is only intended to be used when Squid is used in accelerator mode.

--enable-useragent-log

This provides the capability of logging user agent headers from HTTP requests by clients.

--enable-referer-log

If we enable this opon, Squid will be able to write a referer header eld from

HTTP requests.

--disable-wccp

This opon disables support for Cisco's Web Cache Communicaon Protocol (WCCP).

WCCP enables communicaon between caches, which in turn helps in localizing the trac.

By default, WCCP-support is enabled.

--disable-wccpv2

Similar to the previous opon, this disables support Cisco's WCCP version 2. WCCPv2

is an improved version of WCCP and has built-in support for load balancing, scaling,

fault-tolerance, and service assurance mechanisms. By default, WCCPv2 support is enabled.

--disable-snmp

In Squid versions 3.x, SNMP (Simple Network Management Protocol) is enabled by

default. SNMP is quite popular among system administrators for monitoring servers

and network devices.

Chapter 1

[ 19 ]

--enable-cachemgr-hostname

Cache Manager (cachemgr.cgi) is a CGI ulity to manage Squid's cache and view cache

stascs using a web interface. The host name for accessing cache manager can be set using

this opon. By default, we can access cache manager web interface using

localhost or the

IP address of the Squid server.

./configure --enable-cachemgr-hostname=squidproxy.example.com

--enable-arp-acl

Squid supports building Access Control Lists based on MAC (or Ethernet) addresses.

This feature is disabled by default. If we want to control client access based on Ethernet

addresses, we should enable this feature. Enabling this is a good idea while learning Squid.

This opon will be replaced by --enable-eui which is enabled by default.

--disable-htcp

Hypertext Caching Protocol (HTCP) can be used by Squid to send and receive cache digests

to neighboring caches. This opon disables HTCP support.

--enable-ssl

Squid can terminate SSL connecons. When Squid is congured in reverse proxy mode,

Squid can terminate the SSL connecons iniated by clients and handle it on behalf of the

web server in the backend. This essenally means that the backend web server will not

have to do any SSL work, which means signicant computaon savings. In this case, the

communicaon between Squid and the backend web server will be pure HTTP, but clients

will sll see it as a secure connecon with the web server. This is useful only when Squid is

congured to work in accelerator or reverse proxy mode.

--enable-cache-digests

Cache digests are Squid's way of sharing informaon with neighboring Squid servers about

the cached web documents, in a compressed format.

--enable-default-err-language

Whenever Squid encounters an error (for example, a page not found, access denied, or

network unreachable error) that should be conveyed to the client, Squid uses default pages

for showing these errors. The error pages are available in local languages. This opon can be

used to specify the default language for all the error pages. The default language for error

pages is English.

./configure --enable-default-err-language=Spanish

Geng Started with Squid

[ 20 ]

--enable-err-languages

By default, Squid builds support for all available languages. If we only want to build Squid

with languages which we are familiar with, we can use this opon. Please check the

errors/ directory in the Squid source code for available languages.

./configure --enable-err-languages='English French German'

--disable-http-violations

Squid has conguraon opons, and by using them, we can force Squid to violate HTTP

protocol standards by replacing header elds in HTTP requests or responses. Tinkering with

HTTP headers is against standard HTTP norms. We can disable support for all sorts of HTTP

violaons by using this opon.

--enable-ipfw-transparent

IPFIREWALL (IPFW) is a rewall applicaon for the FreeBSD system maintained by FreeBSD

sta and volunteers. This opon is useful while seng up Transparent Proxy Server on

systems with IPFW. If our system doesn't have IPFW, we should avoid using this opon,

because Squid will fail to compile. The default behavior is auto-detect, which does the job

quite well.

--enable-ipf-transparent

IPFilter (IPF) is also a stateful rewall for many Unix-like operang systems. It is provided by

NetBSD, Solaris, and so on. If our system has IPF, then we should enable this opon to be

able to congure Squid in Transparent mode. Enabling this opon in the absence of IPF on

the system will result in compile errors.

--enable-pf-transparent

Packet Filter (PF) is yet another stateful rewall applicaon originally developed for

OpenBSD. This opon is useful on systems with PF installed to achieve Transparent

Proxy mode. Do not enable this opon if PF is not installed.

--enable-linux-netiter

Neilter is the packet ltering framework in Linux kernels in series 2.4.x and 2.6.x. This

opon is useful for enabling Transparent Proxy support on Linux-based operang systems.

--enable-follow-x-forwarded-for

When a HTTP request is forwarded by a proxy, the proxy writes essenal informaon about

itself and the client for which the request is being forwarded, in HTTP headers. This opon

enables Squid to try to lookup the IP address of the original client for which the request was

forwarded through one or more proxy servers.

Chapter 1

[ 21 ]

--disable-ident-lookups

This prevents Squid from performing ident lookups or idenfying a username for every

connecon. Disabling this may prevent our system from a possible Denial of Service

aack by a malicious client requesng a large number of connecons.

--disable-internal-dns

Squid has its own implementaon of DNS protocol and is capable of building DNS queries. If

we want to use Squid's internal DNS, then we should not disable it. Otherwise, we can disable

support for Squid's internal DNS feature by using this opon and can use external DNS servers.

--enable-default-hostsle

Using this opon, we can select the default locaon of the hosts le. On most operang

systems, it's located in the /etc/hosts directory.

./configure --enable-default-hostsfile=/some/other/location/hosts

--enable-auth

Squid supports various authencaon mechanisms. This opon enables support for

authencaon schemes. This congure opon (and related enable auth opons) are

undergoing change.

Old Syntax

Previously, this opon was used to enable authencaon support and a list of authencaon

schemes was also passed. The authencaon schemes from the list were then built

during compilaon.

./configure --enable-auth=basic,digest,ntlm

New Syntax

Now, this opon is used only to enable global support for authencaon and a list of

authencaon schemes is not passed along. The authencaon scheme is enabled with the

opon --enable-auth-AUTHENTICATION_SCHEME where AUTHENTICATION_SCHEME

is the name of the authencaon scheme. By default, all the authencaon schemes

are enabled and the corresponding authencaon helpers are built during compilaon.

Authencaon helpers are external programs that can authencate clients using various

authencaon mechanisms, against dierent user databases.

./configure --enable-auth

Geng Started with Squid

[ 22 ]

--enable-auth-basic

This opon enables support for a Basic Authencaon scheme and builds the list of helpers

specied. If the list of helpers is not provided, this will enable all the possible helpers. A list

of available helpers for this scheme can be found in the

helpers/basic_auth/ directory

in the Squid source code. To disable this authencaon scheme, we can use

--disable-

auth-basic

./configure --enable-auth-basic=PAM,NCSA,LDAP

If we want to enable this opon but don't want to build any helpers, we should use "none"

in place of a list of helpers.

./configure --enable-auth-basic=none

Previously, this opon was known as --enable-basic-auth-helpers. The list of helpers

is passed in a similar way.

./configure --enable-basic-auth-helpers=PAM,NCSA,LDAP

The old and new opon syntax for all other authencaon schemes

are similar.

--enable-auth-ntlm

Squid support for the NTLM authencaon scheme is enabled with this opon. The available

helpers for this scheme reside in the helpers/ntlm_auth/ directory in the Squid source

code. To disable NTLM authencaon scheme support, use the

--disable-auth-ntlm

opon.

./configure --enable-auth-ntlm=smb_lm,no_check

--enable-auth-negotiate

This opon enables the Negoate Authencaon scheme. Details and syntax are similar

to the above authencaon scheme opon.

./configure --enable-auth-negotiate=kerberos

--enable-auth-digest

This opon enables support for Digest Authencaon scheme. Other details are similar

to the above opon.

--enable-ntlm-fail-open

If this opon is enabled and a helper fails while authencang a user, it can sll allow

Squid to authencate the user. This opon should be used with care as it may lead

to security loopholes.

Chapter 1

[ 23 ]

--enable-external-acl-helpers

Squid supports external ACLs using helpers. If we are willing to use external ACLs, we should

consider using this opon. We can also use this opon while learning. A list of external

ACL helpers should be passed to build specic helpers. The default behavior is to build

all the available helpers. A list of available external ACL helpers can be found in the

helpers/external_acl/ directory in the Squid source code.

./configure --enable-external-acl-helpers=unix_group,ldap_group

--disable-translation

By default, Squid tries to present error and manual pages in a local language. If we don't

want this to happen, then we may use this opon.

--disable-auto-locale

Based on a client's request headers, Squid tries to automacally provide localized error

pages. We can use this opon to disable the automac localizaon. The error_directory

tag in the Squid conguraon le must be congured if we use this opon.

--disable-unlinkd

unlinkd is an external process which is used to make unlink system calls. This opon

disables unlinkd support in Squid. Disabling unlinkd is not a good idea as the unlink

system call can block a process for a considerable amount of me, which can cause a delay

in responses.

--with-default-user

We normally don't want to run Squid as the root user to omit any security risks. By default,

Squid runs as the user nobody. However, if we have installed Squid from a pre-compiled

binary, Squid may run as a 'squid' or 'proxy' user depending on the operang system we

are using. Using this opon, we can set the default user for running Squid. See the following

example of how to use this opon:

./configure --with-default-user=squid

--with-logdir

By default, Squid writes all logs and error reports to designated les in ${prefix}/var/

logs/

. This locaon is dierent from the locaon used by all other processes and daemons

to write their logs. In order to get quick access to the Squid logs, we may want to place them

in the default system log directory, which is /var/log/ in most of the Linux-based

operang systems. See the following example of the syntax to achieve this:

./configure --with-logdir=/var/log/squid/

Geng Started with Squid

[ 24 ]

--with-pidle

The default locaon for storing the Squid PID le is ${prefix}/var/run/squid.

pid

, which is not the standard system locaon for storing PID les. On most Linux-based

operang systems, the PID les are stored in /var/run/. So, we may want to change the

default PIDle locaon using the following opon:

./configure --with-pidfile=/var/run/squid.pid

--with-aufs-threads

Using this opon, we can specify the number of threads to use when the aufs storage

system is used for managing the cache directories. If this opon is not used, Squid

automacally calculates the number of threads that should be used:

./configure --with-aufs-threads=12

--without-pthreads

Older versions of Squid were built without POSIX threads support. Now, Squid is built with

pthreads support by default, therefore, if we don't want to enable pthreads support,

we'll have to explicitly disable it.

--with-openssl

If we want to build Squid with OpenSSL support, we can use this opon to specify the

OpenSSL installaon path, if it's not installed in the default locaon:

./configure --with-openssl=/opt/openssl/

--with-large-les

Under heavy trac, Squid's log les (especially the access log) grow quickly and in the long

run the le size may become quite large. We can use this opon to enable support for large

log les.

For beer performance, it is good pracce to rotate log les frequently

instead of going with large les.

--with-ledescriptors

Operang systems use le descriptors (basically integers) to track the open les and sockets.

By default, there is a limit on the number of le descriptors a user can use (normally 1024).

Once Squid has accepted connecons which have consumed all the available le descriptors

to the Squid user, it can't accept more connecons unless some of the le descriptors

are released.

Chapter 1

[ 25 ]

Under heavy load, Squid frequently runs out of le descriptors. We can use the following

opon to overcome the le descriptor shortage problem:

./configure --with-filedescriptors=8192

We also need to increase the system-wide limit on the number of le

descriptors available to a user.

Have a go hero – le descriptors

Find out the maximum number of available le descriptors for your user. Also, write down

the commands that will set the maximum available le descriptors limit to 8192.

Soluon: To check the available le-descriptors use the following command:

ulimit -n

To set the le descriptor limit to 8192, we need to append the following lines to

/etc/security/limits.conf:

username hard nofile 8192

username soft nofile 8192

The preceding acon can be performed only with root or super user privileges.

Time for action – running the congure command

Now that we have had a brief look at several of the available opons, we can layout the

opons for the environment for which we are building Squid. Now, we are ready to run

the configure command with the following opons:

./configure --prefix=/opt/squid/ --with-logdir=/var/log/squid/ --with-

pidfile=/var/run/squid.pid --enable-storeio=ufs,aufs --enable-removal-

policies=lru,heap --enable-icmp --enable-useragent-log --enable-referer-

log --enable-cache-digests --with-large-files

The preceding command will run for a while, probing the system for various capabilies

and making decisions on the basis of the available libraries and modules. The configure

writes debugging output to the config.log le in the same directory. It is always wise

to check the config.log for any errors which may have occurred while running the

configure command.

If everything goes ne,

configure will generate the makefiles in several directories

which will be required for compiling the source code in the next step.

Geng Started with Squid

[ 26 ]

What just happened?

Running the configure program with the opons menoned in the previous code example,

will generate the makefiles needed to compile the Squid source code and source code of

the modules enabled. It will also generate the config.log and config.status les. All the

messages which are generated during the running of the configure program are logged to

the config.log le. The config.status le is an executable which can be run to recreate

the makefiles.

Have a go hero – debugging congure errors

In the Squid source directory, run the configure command, as shown in the following code:

./configure --enable-storeio='aufs,disk'

Now try to check what went wrong and x the errors.

Time for action – compiling the source

Aer specifying our environment and building the requirements, we need to do the actual

compilaon. Compiling source code is very easy and is a maer of just one command:

make

We do not need to be the root or super user to execute this command. This command may

take a considerable amount of me to execute, depending on the system hardware. Running

make will produce a lot of output in the terminal. It may also produce a lot of compiler

warnings which can safely be ignored in most cases.

make ends with errors, we should check Squid bugzilla for similar problems. We can update

an exisng bug with our error report or create a new bug report if there is no similar bug

already. For details on troubleshoong and compleng bug reports, please refer to

Chapter 12, Troubleshoong Squid.

make ends without any errors, we can quickly proceed to the installaon phase. We can

also run make again to verify that everything is compiled successfully. Running make again

should produce a lot of lines similar to the following:

Making all in compat

make[1]: Entering directory '/home/user/squid-source/compat'

make[1]: Nothing to be done for 'all'.

make[1]: Leaving directory '/home/user/squid-source/compat'

What just happened?

We have just run the make command that will compile the source code of Squid and related

modules, to generate executables, if it nishes without errors. The generated executables

are ready to be installed now.

Chapter 1

[ 27 ]

Time for action – installing Squid

The successful compilaon of the source code in the previous secon will generate the

required programs depending on the features and packages we have enabled or disabled.

However, they should be moved to their designated locaons, so that they can be used.

Let's perform the nal steps of the installaon.

1. Depending on the ${prefix}, we may need root or super user privileges for

installing Squid. If root or super user privileges are needed, we should rst switch

to root or super user by using the following command:

2. Now all we need to do is to run the make command with install as the argument:

make install

This will install or simply move programs to their designated locaons, depending on

the path used with the --prefix opon while running the configure program.

What just happened?

We just learned how to perform the nal step in installing Squid, which is to place the

generated programs and other essenal les in their designated locaons.

Time for action – exploring Squid les

Let's have a look at the les and directories generated during installaon. The easiest way

to checkout the directories and les generated is to use the tree command. If the tree

command is not available, we can list les using the ls command as well.

tree ${prefix} | less

${prefix} is the directory used with the --prefix opon for configure. Now let's

have a brief overview at the important les generated by Squid during installaon. All

of the following directories and les listed, reside in ${prefix}:

bin

This directory contains programs which can be executed or run by a user without root or

super user privileges.

bin/squidclient

squidclient is a HTTP client with advanced capabilies, which allow it to nker HTTP

requests to test the Squid server. Run squidclient to checkout the available opons:

${prefix}/bin/squidclient

Geng Started with Squid

[ 28 ]

etc

This is the place where for all the conguraon les related to Squid are located.

It's a good idea to use the --sysconfdir=/etc/squid/ opon

with configure, so that you can share the conguraon across

dierent Squid installaons while tesng.

etc/squid.conf

This is the default locaon for the Squid conguraon le. The squid.conf generated

during installaon is the bare minimum conguraon required for Squid to be used.

We always make changes to this le if we need to alter the Squid conguraon.

etc/squid.conf.default

Squid generates this default conguraon le so that we can copy and rename it to

squid.conf and start afresh.

etc/squid.conf.documented

This is a fully documented version of squid.conf, containing thousands of lines of

comments. We should always refer to this le for the available conguraon tags for

the version of Squid when we have installed.

libexec

This directory contains helper programs built during Squid compilaon.

libexec/cachemgr.cgi

This CGI program provides a web interface for managing the Squid cache called

Cache Manager.

sbin

This directory contains programs which can only be executed by a user with root or super

user privileges.

sbin/squid

This is the actual Squid program, which is generally run as a daemon.

This is the locaon for error page templates, documentaon, and other les used by Squid.

share/errors

This directory contains localized error page templates. The templates are HTML pages and we

can customize the error messages displayed by Squid, by modifying these HTML templates.

Chapter 1

[ 29 ]

share/icons

This directory contains a number of small images used for FTP or gopher directory lisng.

share/man

This is the place where the man pages for squid, squidclient, and helpers are built during

compilaon. Man pages are manual or help pages which can be viewed using the command

man (available on all Linux/Unix distribuons). To view a man page located at /opt/squid/

share/man/man8/squid.8

, we can use the man command as follows:

man /opt/squid/share/man/man8/squid.8

For more details about man pages, please visit

http://en.wikipedia.org/wiki/Man_page.

var

A place for les that change frequently while Squid is running.

var/cache

This is the default directory for storing the cached web documents on a hard disk.

var/logs

This is the default home for all the log les (such as cache.log, access.log, and so on)

used by Squid.

What just happened?

We have just looked at the various les and directories generated during installaon and

a had brief overview of what each directory contains.

Installing Squid from binary packages

Squid binary packages are available in the soware repositories of most operang systems

and we can install them by using the package managers provided by the respecve operang

systems. Next, we'll see how to use a package manager on a few operang systems to

install Squid.

The latest or beta versions may not be available in soware repositories of all

the operang systems. In such cases, we should get the latest or beta versions

from the Squid website, as explained earlier in this chapter.

Geng Started with Squid

[ 30 ]

Fedora, CentOS or Red Hat

Yum is a popular package manager on RPM-based operang systems. Squid RPM is available

in the Fedora, CentOS, and Red Hat repositories. To install Squid, we can simply use the

following command:

yum install squid

Debian or Ubuntu

We can use apt-get to install Squid on Debian or Ubuntu:

apt-get install squid3

FreeBSD

Squid is available in the FreeBSD ports collecon. The following command can be used to

install Squid on FreeBSD:

pkg_add -r squid31

For more informaon on package management in FreeBSD, please go to

http://www.freebsd.org/doc/handbook/packages-using.html.

OpenBSD or NetBSD

Installing Squid on OpenBSD or NetBSD is similar to installing it on FreeBSD and can be

performed using the following command:

pkd_add squid31

To learn more about the package management system in OpenBSD and NetBSD, please

refer to http://www.openbsd.org/ports.html#Get and http://www.netbsd.org/

docs/pkgsrc/using.html#installing-binary-packages

respecvely.

Dragony BSD

To install Squid on Dragony BSD, we can use the following command:

pkg_radd squid31

For more informaon on installing binary packages on Dragony BSD, please visit

http://www.dragonflybsd.org/docs/newhandbook/pkgsrc/.

Gentoo

We can install Squid on Gentoo Linux using emerge, as shown next:

emerge =squid-3.1*

Chapter 1

[ 31 ]

Arch Linux

To install Squid on Arch Linux, we can use the package manager pacman, as shown in the

following command:

pacman -S squid

For more informaon on pacman, please visit

https://wiki.archlinux.org/index.php/Pacman.

Pop quiz

1. Which of the following web documents can't be cached by a proxy server?

a. A HTML page

b. A JPEG image

c. A PHP script that produces output based on a client's IP Address

d. A JavaScript le

2. In which of the following scenarios, should we worry about the

--enable-diskio opon?

a. Caching in RAM (main memory) is enabled

b. Caching on hard disk is enabled

c. Caching is disabled

d. None of the above

3. When does a removal policy selecon aect the overall Squid performance?

a. If caching is disabled

b. If caching on the hard disk and RAM is enabled

c. A removal policy selecon is not related to caching

d. A removal policy doesn't aect overall Squid performance

Geng Started with Squid

[ 32 ]

Summary

We learned about proxy servers and web caching in general and the ways in which they

can be useful, especially for saving bandwidth and improving end user experience. Then we

moved on to exploring Squid, which is a powerful caching proxy server. The following are the

important things that we learned in this chapter:

Various ways to grab Squid for producon use or development

Meaning of various congure opons

Compiling Squid source code

Installing Squid from source and binary package

Pros and cons of compiling Squid from source

We also discussed about the directory structure and les generated by Squid

during installaon.

Now that we know how to install Squid, we are ready to learn how to congure Squid

according to requirements for a given network environment. We'll learn about this with

a few examples in the next chapter.



Conguring Squid

We have learned about compiling Squid source code and installing Squid from

a source and binary package. In this chapter, we are going to learn to congure

Squid according to the requirements of a given network. We will learn about

the general syntax used for a Squid conguraon le and then we will move on

to exploring the dierent opons available to ne tune Squid. There will be a

few opons which we will only cover briey but there will be chapters dedicated

to them while we will explore other opons in detail.

In this chapter, we will cover the following:

Quick exposure to Squid

Syntax of the conguraon le

HTTP port, the most important conguraon direcve

Access Control Lists (ACLs)

Controlling access to various components of Squid

Cache peers or neighbors

Caching the web documents in the main memory and hard disk

Tuning Squid to enhance bandwidth savings and reduce latency

Modifying the HTTP headers accompanied with requests and responses

Conguring Squid to use DNS servers

A few direcves related to logging

Other important or commonly used conguraon direcves

So let's get started.



Conguring Squid

[ 34 ]

Quick start

Before we explore a conguraon le in detail, let's have a look at the minimal

conguraon that you will need to get started. Get ready with the conguraon le located

at /opt/squid/etc/squid.conf, as we are going to make the changes and addions

necessary to quickly set up a minimal proxy server.

cache_dir ufs /opt/squid/var/cache/ 500 16 256

acl my_machine src 192.0.2.21 # Replace with your IP address

http_access allow my_machine

We should add the previous lines at the top of our current conguraon le (ensuring that

we change the IP address accordingly). Now, we need to create the cache directories. We

can do that by using the following command:

$ /opt/squid/sbin/squid -z

We are now ready to run our proxy server, and this can be done by running the

following command:

$ /opt/squid/sbin/squid

Squid will start listening on port 3128 (default) on all network interfaces on our machine.

Now we can congure our browser to use Squid as an HTTP proxy server with the host as

the IP address of our machine and port 3128.

Once the browser is congured, try browsing to

http://www.example.com/.

That's it! We have congured Squid as an HTTP proxy server! Now try to browse to

http://www.example.com:897/ and observe the message you receive. The message

shown is an access denied message sent to you by Squid.

Now, let's move on to understanding the conguraon le in detail.

Syntax of the conguration le

Squid's conguraon le can normally be found at /etc/squid/squid.conf,

/usr/local/squid/etc/squid.conf, or ${prefix}/etc/squid.conf where

${prefix} is the value passed to the --prefix opon, which is passed to the

configure command before compiling Squid.

In the newer versions of Squid, a documented version of

squid.conf, known as squid.

conf.documented

, can be found along side squid.conf. In this chapter, we'll cover some

of the import direcves available in the conguraon le. For a detailed descripon of all the

direcves used in the conguraon le, please check http://www.squid-cache.org/

Doc/config/

Chapter 2

[ 35 ]

The syntax for Squid's documented conguraon le is similar to many other programs for

Linux/Unix. Generally, there are a few lines of comments containing useful related

documentaon before every direcve used in the conguraon le. This makes it easier to

understand and congure direcves, even for people who are not familiar with conguring

applicaons using conguraon les. Normally, we just need to read the comments and use

the appropriate opons available for a parcular direcve.

The lines beginning with the character

# are treated as comments and are completely ignored

by Squid while parsing the conguraon le. Addionally, any blank lines are also ignored.

# Test comment. This and the above blank line will be ignored by

Squid.

Let's see a snippet from the documented conguraon le (squid.conf.documented)

# TAG: cache_effective_user

# If you start Squid as root, it will change its effective/real

# UID/GID to the user specified below. The default is to change

# to UID of nobody.

# see also; cache_effective_group

#Default:

# cache_effective_user nobody

In the previous snippet, the rst line menons the name of the direcve, that is in this case,

cache_effective_user. The lines following the tag line provide brief informaon about

the usage of a direcve. The last line shows the default value for the direcve, if none

is specied.

Types of directives

Now, let's have a brief look at the dierent types of direcves and the values that can

be specied.

Single valued directives

These are direcves which take only one value. These direcves should not be used mulple

mes in the conguraon le because the last occurrence of the direcve will override all

the previous declaraons. For example, logfile_rotate should be specied only once.

logfile_rotate 10

# Few lines containing other configuration directives

logfile_rotate 5

In this case, ve logfile rotaons will be made when we trigger Squid to rotate logfiles.

Conguring Squid

[ 36 ]

Boolean-valued or toggle directives

These are also single valued direcves, but these direcves are generally used to toggle

features on or o.

query_icmp on

log_icp_queries off

url_rewrite_bypass off

We use these direcves when we need to change the default behavior.

Multi-valued directives

Direcves of this type generally take one or more than one value. We can either specify all the

values on a single line aer the direcve or we can write them on mulple lines with a direcve

repeated every me. All the values for a direcve are aggregated from dierent lines:

hostname_aliases proxy.exmaple.com squid.example.com

Oponally, we can pass them on separate lines as follows:

dns_nameservers proxy.example.com

dns_nameservers squid.example.com

Both the previous code snippets will instruct Squid to use proxy.example.com and

squid.example.com as aliases for the hostname of our proxy server.

Directives with time as a value

There are a few direcves which take values with me as the unit. Squid understands the

words seconds, minutes, hours, and so on, and these can be suxed to numerical values

to specify actual values. For example:

request_timeout 3 hours

persistent_request_timeout 2 minutes

Directives with le or memory size as values

The values passed to these direcves are generally suxed with le or memory size units like

bytes, KB, MB, or GB. For example:

reply_body_max_size 10 MB

cache_mem 512 MB

maximum_object_in_memory 8192 KB

As we are familiar with the conguraon le syntax now, let's open the squid.conf le and

learn about the frequently used direcves.

Chapter 2

[ 37 ]

Have a go hero – categorize the directives

Open the documented Squid conguraon le and nd out at least three direcves of each

type that we discussed before. Don't use the direcves already used in the examples.

HTTP port

This direcve is used to specify the port where Squid will listen for client connecons. The

default behavior is to listen on port 3128 on all the available interfaces on a machine.

Time for action – setting the HTTP port

Now, we'll see the various ways to set the HTTP port in the squid.conf le:

In its simplest form, we just specify the port on which we want Squid to listen:

http_port 8080

We can also specify the IP address and port combinaon on which we want Squid

to listen. We normally use this approach when we have mulple interfaces on our

machine and we want Squid to listen only on the interface connected to local area

network (LAN):

http_port 192.0.2.25:3128

This will instruct Squid to listen on port 3128 on the interface with the IP address as

192.0.2.25.

Another form in which we can specify

http_port is by using hostname and

port combinaon:

http_port myproxy.example.com:8080

The hostname will be translated to an IP address by Squid and then Squid will listen

on port 8080 on that parcular IP address.

Another aspect of this direcve is that, it can take mulple values on separate lines.

Let's see what the following lines will do:

http_port 192.0.2.25:8080

http_port lan1.example.com:3128

http_port lan2.example.com:8081

These lines will trigger Squid to listen on three dierent IP addresses and port

combinaons. This is generally helpful when we have clients in dierent LANs,

which are congured to use dierent ports for the proxy server.



Conguring Squid

[ 38 ]

In the newer versions of Squid, we may also specify the mode of operaon such as

intercept, tproxy, accel, and so on.

Intercept mode will support the intercepon of requests without needing to

congure the client machines. We'll learn more about intercepon proxy

servers in Chapter 10, Squid in Intercept Mode.

http_port 3128 intercept

tproxy mode is used to enable Linux Transparent Proxy support for spoong

outgoing connecons using the client's IP address.

http_port 8080 tproxy

We should note that enabling intercept or tproxy mode

disables any congured authencaon mechanism. Also, IPv6 is

supported for tproxy but requires very recent kernel versions. IPv6

is not supported in the intercept mode.

Accelerator mode is enabled using the mode accel. It's a good idea to listen on

port 80, if we are conguring Squid in accelerator mode. This mode can't be used as

it is. We must specify at least one website we want to accelerate. We'll learn more

about the accelerator mode in Chapter 9, Squid in Accelerator Mode.

http_port 80 accel defaultsite=website.example.com

We should set the HTTP port carefully as the standard ports like 3128

or 8080 can pose a security risk if we don't secure the port properly.

If we don't want to spend me on securing the port, we can use any

arbitrary port number above 10000.

What just happened?

In this secon, we learned about the usage of one of the most important direcves, namely,

http_port. We have learned about the various ways in which we can specify HTTP port,

depending on the requirement. We can force Squid to listen on mulple interfaces and on

dierent ports, on dierent interfaces.

Access control lists

Access Control Lists (ACLs) are the base elements for access control and are normally

used in combinaon with other direcves such as http_access, icp_access, and so

on, to control access to various Squid components and web resources. ACLs idenfy a web

transacon and then direcves such as http_access, cache, and then decides whether

the transacon should be allowed or not. Also, we should note that the direcves related

to accessing resources generally end with _access.



Chapter 2

[ 39 ]

Every access control list denion must have a name and type, followed by the values for

that parcular ACL type:

acl ACL_NAME ACL_TYPE value

acl ACL_NAME ACL_TYPE "/path/to/filename"

The values for any ACL name can either be specied directly aer ACL_TYPE or Squid can

read them from a separate le. Here we should note that the values in the le should be

wrien as one value per line.

Time for action – constructing simple ACLs

Let's construct an access control list for the domain name example.com:

acl example_site dstdomain example.com

In this code, example_site is the name of the ACL with type dstdomain, which reects

that the value, example.com, is the domain name.

Now if we want to construct an access control list which can cover a lot of example websites,

we have the following three possible ways of doing it:

1. Values on a single line: We can specify all the possible values on a single line:

acl example_sites dstdomain example.com example.net example.org

This works ne as long as there are only a few values.

2. Values on mulple lines: In case the list of values that we want to specify grows

signicantly, we can split the list and pass values on mulple lines:

acl example_sites dstdomain example.com example.net

acl example_sites dstdomain example.org

3. Values from a le: If case the number of values we want to specify is quite large, we

can put them in a dedicated le and then instruct Squid to read the values from that

parcular le:

acl example_sites dstdomain '/opt/squid/etc/example_sites.txt'

We can place the example_sites.txt le in the same directory as squid.conf

so that it's easy to locate. The contents of the example_sites.txt le should be

as follows:

# This file can also have comments

# Write one value (domain name) per line

example.net

example.org # Temporarily remove example.org from example_sites

acl

example.com

Conguring Squid

[ 40 ]

ACL names are case-insensive and are mul-valued. So we can use them, mulple mes,

and the values will aggregate:

acl NiCe_NaMe src 192.0.2.21

acl nIcE_nAmE src 192.0.2.23

This code doesn't represent two dierent access control lists. It's just one ACL with two

addresses, namely, 192.0.2.21 and 192.0.2.23, as values.

We should carefully note that one ACL name can't be used with more than one

ACL type.

acl destination dstdomain example.com

acl destination dst 192.0.2.24

The above code is invalid as it uses ACL name destination across two

dierent ACL types.

The previous examples of access lists are very basic and are simply to get us started. We'll

explore access lists and controls in Chapter 4, Geng Started with Squid's Powerful ACLs

and Access Rules.

What just happened?

We have just learned to create some simple ACLs of the ACL type dstdomain, which

idenes the desnaon domain in a request.

Have a go hero – understanding the pre-dened ACLs

Jump to the ACL secon in the Squid conguraon le and try to understand the ACLs

provided by Squid, by default.

Controlling access to the proxy server

While Squid is running on our server, it can be accessed in several ways for example, via

normal web browsing by end users or as a parent or sibling proxy server by neighboring

proxy servers. Squid provides various direcves to control access to dierent resources.

Next, we'll learn about granng or revoking access to dierent resources.

HTTP access control

ACLs help only in idenfying requests based on dierent rules. ACLs are of no use by

themselves, they should be combined with access control direcves to allow or deny access

to various resources. http_access is one such direcve which is used to grant access to

perform HTTP transacons through Squid.

Chapter 2

[ 41 ]

Let's have a look at the syntax of http_access:

http_access allow|deny [!]ACL_NAME

Using http_access, we can either allow or deny access to the HTTP transacons through

Squid. The

ACL_NAME in the code signies the requests for which the access must be granted

or revoked. If a bang (

!) is prexed to the ACL_NAME, the access will be granted or revoked

for all the requests that are not idened by

ACL_NAME.

Time for action – combining ACLs and HTTP access

Let's have a look at a few cases for controlling HTTP access using example ACLs. When we

have mulple access rules, Squid matches a parcular request against them from top to

boom and keeps doing so unl a denite acon (allow or deny) is determined. Please

note that if we have mulple ACLs within a single access rule, then a request is matched

against all the ACLs from le to right, and Squid stops processing the rule as soon as it

encounters an ACL that can't idenfy the request. An access rule with mulple ACLs results

in a denite acon, only if the request is idened by all the ACLs used in the rule.

acl my_home_machine src 192.0.2.21

acl my_lab_machine src 198.51.100.86

http_access allow my_home_machine

http_access allow my_lab_machine

The ACLs and access rules in the previous code will allow hosts 192.0.2.21 and

198.51.100.86 to access the proxy server. The aforemenoned access rules may

also be wrien as:

acl my_machines src 192.0.2.21 198.51.100.86

http_access allow my_machines

Default behavior is to allow access to all the clients in a local area network and deny access

to all the other clients. If we want clients (who are not in our local area network) to be able

to use our proxy server, we must add addional access rules to allow them.

The default behavior of HTTP access control is a bit tricky if access for a client

can't be idened by any of the access rules. In such cases, the default behavior

is to do the opposite of the last access rule. If last access rule is deny, then the

acon will be to allow access and vice-versa. Therefore, to avoid any confusion

or undesired behavior, it's a good pracce to add a deny all line aer the

access rules.

http_access deny all

Conguring Squid

[ 42 ]

The parameter all is a special ACL element provided by Squid and it represents all the

IP addresses. This line will deny access to everything. As this goes aer all other access rules,

requests from unknown clients will be denied.

What just happened?

We learned to combine ACLs with the http_access direcve to allow or deny access to

clients. We also learned how to group dierent ACLs of the same type and then use them

to control access.

HTTP reply access

HTTP reply is the response received from the web server corresponding to a request iniated

by a client. Using the

http_reply_access direcve, we can control the access to the

replies received. The syntax of http_reply_access is similar to http_access.

http_reply_access allow|deny [!]ACL_NAME

This direcve parally overrides the permissions granted by http_access. Let's see

an example:

acl my_machine src 192.0.2.21

http_access allow my_machine

http_reply_access deny my_machine

We have allowed http_access to host 192.0.2.21 but sll it will not be able to access

the websites properly as it's not allowed to receive any replies. The host can only make

requests to a proxy server for web documents but won't receive any reply.

This direcve is normally used to deny access for content types such as audio, video,

and so on, to prevent users from accessing media content.

We should be really careful while using the http_reply_access direcve.

When a request is allowed by http_access, Squid will contact the original

server, even if a rule with the http_reply_access direcve denies the

response. This may lead to serious security issues. For example, consider a client

receiving a malicious URL, which can submit a client's crical private informaon

using the HTTP POST method. If the client's request passes through http_

access rules but the response is denied by an http_reply_access rule,

then the client will be under the impression that nothing happened but a hacker

will have cleverly stolen our client's private informaon.

Chapter 2

[ 43 ]

ICP access

This direcve is used to control the query access by our neighboring caches using the

Internet Cache Protocol (ICP). It basically allows or denies access to the ICP port. The syntax

is similar to

http_access and the default behavior is to deny all ICP queries.

icp_access allow|deny [!]ACL_NAME

HTCP access

Using this direcve, we can control whether Squid will respond to certain HTCP requests or

not. The syntax is similar to http_access and the default behavior is to deny all queries.

HTCP CLR access

Neighboring caches can make requests to purge or remove cache objects in the form of HTCP

CLR requests. The htcp_clr_access direcve can be used to grant purge access to only

trusted cache peers.

Miss access

This direcve is used to specify which all cache peers or clients can use as their parent cache.

When a cache peer or client tries to fetch content using our proxy server, the request may

result in a MISS (not present in cache) or a HIT (can be sased from our cache). Generally, a

MISS is fetched by our server on behalf of a client or peer. If we don't want our clients or peers

to fetch content using our proxy, then we can use the

miss_access direcve, as shown:

acl bad_clients src 192.0.2.0/24

miss_access deny bad_clients

miss_access allow all

This code will not allow bad_clients to use our proxy server as a parent proxy. The default

behavior is to allow all the clients who pass the http_access rule to use the proxy server

as a parent.

Ident lookup access

This direcve determines whether or not Squid should perform a username lookup for the

client TCP requests.

acl ident_aware_hosts src 192.0.2.0/24

ident_lookup_access allow ident_aware_hosts

ident_lookup_access deny all

This code will allow Squid to perform ident lookups only for ident_aware_hosts. The

default behavior is not to perform ident lookups for all queries.

Conguring Squid

[ 44 ]

Only TCP/IP-based ACLs are supported with this direcve.

Cache peers or neighbors

Cache peers or neighbors are the other proxy servers with which our Squid proxy server can:

Share its cache with to reduce bandwidth usage and access me

Use it as a parent or sibling proxy server to sasfy its clients' requests

Use it as a parent or sibling proxy server

We normally deploy more than one proxy server in the same network to share the load of

a single server for beer performance. The proxy servers can use each other's cache to

retrieve the cached web documents locally to improve performance. Let's have a brief look

at the direcves provided by Squid for communicaon among dierent cache peers.

Declaring cache peers

The direcve cache_peer is used to tell Squid about proxy servers in our neighborhood.

Let's have a quick look at the syntax for this direcve:

cache_peer HOSTNAME_OR_IP_ADDRESS TYPE PROXY_PORT ICP_PORT [OPTIONS]

In this code, HOSTNAME_OR_IP_ADDRESS is the hostname or IP address of the target proxy

server or cache peer. TYPE species the type of the proxy server, which in turn, determines

how that proxy server will be used by our proxy server. The other proxy servers can be used

as a parent, sibling, or a member of a mulcast group.

Time for action – adding a cache peer

Let's add a proxy server (parent.example.com) that will act as a parent proxy to our

proxy server:

cache_peer parent.example.com parent 3128 3130 default proxy-only

3130 is the standard ICP port. If the other proxy server is not using the standard ICP port, we

should change the code accordingly. This code will direct Squid to use parent.example.com

as a proxy server to sasfy client requests in case it's not able to do so itself.



Chapter 2

[ 45 ]

The opon default species that this cache peer should be used as a last resort in

the scenario where other peers can't be contacted. The opon proxy-only species

that the content fetched using this peer should not be cached locally. This is helpful when

we don't want to replicate cached web documents, especially when the two peers are

connected with a high bandwidth backbone.

What just happened?

We added parent.example.com as a cache peer or parent proxy to our Squid proxy server.

We also used the opon proxy-only, which means the requests fetched using this cache

peer will not be cached on our proxy server.

There are several other opons in which you can add cache peers, for various purposes,

such as, a hierarchy. We'll discuss cache peers in detail in Chapter 8, Building a Hierarchy

of Squid Caches.

Quickly restricting access to domains using peers

If we have added a few proxy servers as cache peers to our Squid server, we may have the

desire to have a lile bit of control over the requests being forwarded to the peers. The

direcve cache_peer_domain is a quick way to achieve the desired control. The syntax

of this direcve is quite simple:

cache_peer_domain CACHE_PEER_HOSTNAME [!]DOMAIN1 [[!]DOMAIN2 ...]

In the code, CACHE_PEER_HOSTNAME is the hostname or IP address of the cache peer being

used when declaring it as a cache peer, using the cache_peer direcve. We can specify any

number of domains which may be fetched through this cache peer. Adding a bang (!) as a

prex to the domain name will prevent the use of this cache peer for that parcular domain.

Let's say we want to use the

videoproxy.example.com cache peer for browsing video

portals like Youtube, Nelix, Metacafe, and so on.

cache_peer_domain videoproxy.example.com .youtube.com .netflix.com

cache_peer_domain videoproxy.example.com .metacafe.com

These two lines will congure Squid to use the videoproxy.example.com cache peer for

requests to the domains youtube.com, netflix.com, and metacafe.com only. Requests

to other domains will not be forwarded using this peer.

Conguring Squid

[ 46 ]

Advanced control on access using peers

We just learned about cache_peer_domain, which provides a way to control access using

cache peers. However, it's not really exible in granng or revoking access. That's when

cache_peer_access comes into the picture, which provides a very exible way to control

access using cache peers using ACLs. The syntax and implicaons are similar to other access

direcves such as http_access.

cache_peer_access CACHE_PEER_HOSTNAME allow|deny [!]ACL_NAME

Let's write the following conguraon lines, which will allow only the clients on the network

192.0.2.0/24 to use the cache peer acadproxy.example.com for accessing Youtube,

Nelix, and Metacafe.

acl my_network src 192.0.2.0/24

acl video_sites dstdomain .youtube.com .netflix.com .metacafe.com

cache_peer_access acadproxy.example.com allow my_network video_sites

cache_peer_access acadproxy.example.com deny all

In the same way, we can use other ACL types to achieve beer control over access to various

websites using cache peers.

Caching web documents

All this me, we have been talking about the caching of web documents and how it helps in

saving bandwidth and improving the end user experience, now it's me to learn how and

where Squid actually keeps these cached documents so that they can be served on demand.

Squid uses main memory (RAM) and hard disks for storing or caching the web documents.

Caching is a complex process but Squid handles it beaufully and exposes the direcves

using

squid.conf, so that we can control how much should be cached and what should

be given the highest priority while caching. Let's have a brief look at the caching-related

direcves provided by Squid.

Using main memory (RAM) for caching

The web documents cached in the main memory or RAM can be served very quickly as data

read/write speeds of RAM are very high compared to hard disks with mechanical parts.

However, as the amount of space available in RAM for caching is very low compared to the

cache space available on hard disks, only very popular objects or the documents with a very

high probability of being requested again, are stored in cache space available in RAM.

As the cache space in memory is precious, the documents are stored on a priority basis.

Let's have a look at the dierent types of objects which can be cached.

Chapter 2

[ 47 ]

In-transit objects or current requests

These are the objects related to the current requests and they have the highest priority to be

kept in the cache space in RAM. These objects must be kept in RAM and if there is a situaon

where the incoming request rate is quite high and we are about to overow the cache space

in RAM, Squid will try to keep the served part (the part which has already been sent to the

client) on the disk to create free space in RAM.

Hot or popular objects

These objects or web documents are popular and are requested quite frequently compared to

others. These are stored in the cache space le aer storing the in-transit objects as these have

a lower priority than in-transit objects. These objects are generally pushed to disk when there

is a need to generate more in RAM cache space for storing the in-transit objects.

Negatively cached objects

Negavely cached objects are error messages which Squid has encountered while fetching

a page or web document on behalf of a client. For example, if a request to a web page has

resulted in a HTTP error 404 (page not found), and Squid receives a subsequent request for

the same web page, then Squid will check if the response is sll fresh and will return a reply

from the cache itself. If there is a request for the same page aer the negavely cached

object corresponding to that page has expired, Squid will check again if the page is available.

Negavely cached objects have the same priority as hot or popular objects and they can be

pushed to disk at any me in favor of in-transit objects.

Specifying cache space in RAM

So far we have learned about how the available cache space is ulized for storing or caching

dierent types of objects with dierent priories. Now, it's me to learn about specifying

the amount of RAM space we want to dedicate for caching. While deciding the RAM space

for caching, we should be neither greedy nor paranoid. If we specify a large percentage

of RAM for caching, the overall system performance will suer as the system will start

swapping processes in case there is no free RAM le for other processes. If we use a very

low percentage of RAM for caching, then we'll not be able to take full advantage of Squid's

caching mechanism. The default size of the memory cache is 256 MB.

Conguring Squid

[ 48 ]

Time for action – specifying space for memory caching

We can use extra RAM space available on a running system aer sparing a chunk of memory

that can be ulized by the running process under heavy load. To nd out the amount of free

RAM available on our system, we can use either the

top or free command. To nd out the

free RAM in Megabytes, we can use the

free command as follows:

$ free -m

For more details, please check the top(1) and free(1) man pages.

Now, let's say we have 4 GB of total RAM on the server and all the processes are running

comfortably in 1 GB of RAM space. Aer securing another 512 MB for emergency situaons

where running processes may take extra memory, we can safely allocate 2.5 GB of RAM

for caching.

To specify the cache size in the main memory, we use the direcve

cache_mem. It has a very

simple format. As we have learned before, we can specify the memory size in

bytes, KB, MB,

or GB. Let's specify the cache memory size for the previous example:

cache_mem 2500 MB

The previous value specied with cache_mem is in Megabytes.

What just happened?

We learned about calculang the approximate space in the main memory, which can be used

to cache web documents and therefore enhance the performance of the Squid server by a

signicant margin.

Have a go hero – calculating cache_mem for your machine

Note down the total RAM on your machine and calculate the approximate space in

megabytes that you can allocate for memory caching.

Maximum object size in memory

As we have limited space in memory available for caching objects, we need to use the space

in an opmized way. We should plan to set this a bit low, as seng it to a too larger size will

mean that there will be a lesser number of cached objects in the memory and the HIT (being

found in cache) rate will suer signicantly. The default maximum size used by Squid is 512

KB, but we can change it depending on our value for cache_mem. So, if we want to set it to

1 MB, as we have a lot of RAM available for caching (as in the previous example), we can use

the maximum_object_size_in_memory direcve as follows:

maximum_object_size_in_memory 1 MB

This command will set the allowed maximum object size in memory cache to 1 MB.

Chapter 2

[ 49 ]

Memory cache mode

With the newer versions of Squid, we can control which objects we want to keep in the

memory cache for opmizing the performance. Squid oers the direcve memory_cache_

mode

to set the mode that Squid should use to ulize the space available in memory cache.

There are three dierent modes available:

Mode Descripon

always

The mode always is used to keep all the most recently fetched objects that can t

in the available space. This is the default mode used by Squid.

disk

When the disk mode is set, only the objects which are already cached on a hard

disk and have received a HIT (meaning they were requested subsequently aer being

cached), will be stored in the memory cache.

network

Only the objects which have been fetched from the network (including neighbors)

are kept in the memory cache, if the network mode is set.

Seng the mode is easy and can be set using the memory_cache_mode direcve as shown:

memory_cache_mode always

This conguraon line will set memory cache mode to always; this means that most

recently fetched objects will be kept in the memory.

Using hard disks for caching

In the previous secon, we learned about using the main memory for caching various

types of objects or web documents to reduce bandwidth usage and enhance the end user

experience. However, as the space available in RAM is small in size and we can't really invest

a lot in the main memory as it's very expensive in terms of bytes per unit of money. As

opposed to the mechanical disks, we prefer to deploy proxy servers with huge storage space

which can be used for caching objects. Let's have a look at how to tell Squid about caching

objects to disks.

Specifying the storage space

The direcve cache_dir is used to declare the space on the hard disk where Squid will

store or cache the web documents for use in future. Let's have a look at the syntax of

cache_dir and try to understand the dierent arguments and opons:

cache_dir STORAGE_TYPE DIRECTORY SIZE_IN_Mbytes L1 L2 [OPTIONS]

Conguring Squid

[ 50 ]

Storage types

Operang systems implement lesystems to store les and directories on the disk drives. In

the Linux/Unix world,

ext2, ext3, ext4, reiserfs, xfs, UFS (Unix File System), and so

on, are the popular lesystems. Filesystems also expose a few system calls such as open(),

close(), read(), and so on, so that other programs can read/write/remove les from the

storage. Squid also uses these system calls to interact with the lesystems and manage the

cached objects on the disk.

On top of the lesystems and with the help of the available system calls exposed by the

lesystems, Squid implements storage schemes such as

ufs, aufs, and diskd.

All the storage schemes supported by the operang system are built by default. The

ufs

is a very simple storage scheme and all the I/O transacons are done using the main Squid

process. As some of the system calls are blocking (meaning the system call will not return

unl the I/O transacon is complete) in nature, they somemes cause delays in processing

requests, especially under heavy load, resulng in an overall bad performance.

ufs is good

for servers with less load and high speed disks, but is not really preferable for busy caches.

aufs is an improved version of ufs where a stands for asynchronous I/O. In other words, aufs

is ufs with asynchronous I/O support, which is achieved by ulizing POSIX-threads (pthreads

library). Asynchronous I/O prevents blocking of the main Squid process by some system calls,

meaning that Squid can keep on serving requests while we are waing for some I/O transacon

to complete. So, if we have the pthreads library support on our operang system, we should

always go for aufs instead of ufs, especially for heavily loaded proxy servers.

The Disk Daemon (

diskd) storage scheme is similar to aufs. The only dierence is that diskd

uses an external process for I/O transacons instead of threads. Squid and diskd process

for each cache_dir (of the diskd type) to communicate using message queues and shared

memory. As diskd involves a queuing system, it may get overloaded over me in a busy proxy

server. So, we can pass two addional opons to cache_dir which determines how Squid

will behave in case there are more messages in the queues than diskd is able to process. Let's

have a look at the syntax of the cache_dir direcve for diskd as STORAGE_TYPE

cache_dir diskd DIRECTORY SIZE_Mbytes L1 L2 [OPTIONS] [Q1=n] [Q2=n]

The value of Q1 signies the number of pending messages in the queue beyond which Squid

will not place new requests for I/O transacons. Though Squid will keep on entertaining

requests normally, it'll not be able to cache new objects or check cache for any HITs. HIT

performance will suer in this period of me. The default value of Q1 is 64.

The value of

Q2 signies the number of pending messages in the queue beyond which Squid

will cease to operate and will go in to block mode. No new requests will be served in this

period unl Squid receives a reply or the messages in the queue fall below this number. The

default number of Q2 is 72.

Chapter 2

[ 51 ]

As you can see from the explanaon of Q1 and Q2, if the value of Q1 is more than Q2, Squid

will go in to block mode rst. If the queue is full it will result in higher latency but beer

HIT rao. If the value of Q1 is less than Q2, Squid will keep on serving the requests from the

network even if there is no I/O. This will result in lower latency, but the HIT rao will suer

considerably.

Choosing a directory name or location

We can specify any locaon on the lesystem for the directory name. Squid will populate it

with its own directory structure and will start storing or caching web documents in the space

available. However, we must make sure that the directory already exists and is writable by

the Squid process. Squid will not create the directory if it doesn't exist already.

Time for action – creating a cache directory

The cache directory locaon may not be on the same disk or paron. We can mount

another disk drive and specify that as the directory for caching. For example, let's say we

have another drive connected as

/dev/sdb and one of the parons is /dev/sdb1, we

can mount it to the /drive/ and use it right away.

$ mkdir /drive/

$ mount /dev/sdb1 /drive/squid_cache/

$ mkdir /drive/squid_cache

$ chown squid:squid /drive/squid_cache/

In the previous code, we created a directory /drive/ and mounted /dev/sdb1, the

paron from the other disk drive, to it. Then, we created a directory squid_cache in the

directory

/drive/ and changed the ownership of the directory to Squid, so that Squid can

have write access to this directory. Now we can use /drive/squid_cache/ as one of the

directories with the cache_dir direcve.

What just happened?

We mounted a paron from a dierent hard disk and assigned the correct ownership to use

it as a cache directory for disk caching.

Declaring the size of the cache

This is the easy part. We must keep in mind that we should not append MB or GB to

the number while specifying the size in this direcve. The size is always specied in

Megabytes. So, if we want to use 100 GB of disk space for caching, we should set size

to 102400 (102400 MB/1024 = 100 GB).

Conguring Squid

[ 52 ]

If we want to use the enre disk paron for caching, we should not set the cache size to be

equal to the size of the paron because Squid may need some extra space for temporary

les and the swap.state le. So, it's good pracce to subtract 5-15 percent from the total

disk space for temporary les and then set the cache size.

Conguring the number of sub directories

There are two arguments to cache_dir named as L1 and L2. Squid stores the cacheable

objects in a hierarchical fashion in directories named so that it'll be faster to lookup an object

in the cache. The hierarchy is of two-levels, where

L1 determines the number of directories

at the rst level and L2 determines the number of directories in each of the directories at

the rst level. We should set L1 and L2 high enough so that directories at the second level

don't have a huge number of les.

Read-only cache

Somemes we may want our cache to be in read-only mode so that Squid doesn't store

or remove any cached objects from it but connues to serve the content from it. This is

achieved by using an addional opon named no-store with the cache_dir direcve.

Please note that Squid will not update any content in the read-only cache directory. This

feature is used very rarely.

Time for action – adding a cache directory

So far we have learned the meaning of dierent parameters used with the cache_dir

direcve. Let's see an example of the cache directory /squid_cache/ with 50GB of

free space:

cache_dir aufs /squid_cache/ 51200 32 512

We have a cache directory /squid_cache/ with 50 GB of free space with the values of L1

and L2 as 32 and 512 respecvely. So, if we assume the average size of a cached object to be

16 KB, there will be 51200x1024÷(32x512x16) = 200 cached objects in each of the directories

at the second level, which is quite good.

What just happened?

We added /squid_cache/ with a 50 GB free disk space as a cache directory to cache web

documents on the hard disk. Following the previous instrucons, we can add as many cache

directories as we want, depending on the availability of space.

Chapter 2

[ 53 ]

Cache directory selection

If we have specied mulple caching directories, we may need a more ecient algorithm to

ensure opmal performance. For example, when under a heavy load, Squid will perform a lot

of I/O transacons. In such cases, if the load is split across the directories, this will obviously

result in low latency.

Squid provides the direcve

store_dir_select_algorithm, which can be used to specify

the way in which the cache directories should be used. It takes one of the values from

least-load and round-robin.

store_dir_select_algorithm least-load|round-robin

If we want to distribute cached objects evenly across the caching directories, we should go

for round-robin. If we want the best performance with least latency, we should certainly

go for least-load, where Squid will try to pick the directory with the least I/O operaons.

Cache object size limits

It is important to place limits on the size of the web documents which we are going to

cache for achieving a beer HIT rao. Depending on the results we want to achieve, we

may want to keep the maximum limit a bit higher than the default, which is 4 MB, which in

turn depends on the size of the cache we have specied. For example, if we have a cache

directory with a size of 10 GB and we set the maximum cacheable object size to 500 MB,

there will be fewer objects in the cache and the HIT rao will suer signicantly resulng in

high latency. However, we shouldn't keep it really low either, as this will result in lots of I/O

but fewer bandwidth savings.

Squid provides two direcves known as

minimum_object_size and maximum_object_

size

to set the object size limits. The minimum size is 0 KB, by default, meaning that there is

no lower limit on the object size. If we have a huge amount of storage dedicated to caching,

we can set the maximum limit to something around 100 MB, which will make sure that

the popular soware, audio/video content, and so on, are also cached, which may lead

to signicant bandwidth savings.

minimum_object_size 0 KB

maximum_object_size 96 MB

This conguraon will set the minimum and maximum object size in the cache to 0 (zero) and

96 MB respecvely, which means that objects with a size larger than 96 MB will not be cached.

Conguring Squid

[ 54 ]

Setting limits on object replacement

Over a period of me, the allocated space for the caching directories starts to ll up.

Squid starts deleng cached objects from the cache once the occupied space by the objects

crosses a certain threshold, which is determined by using the cache_swap_low and

cache_swap_high direcves. These direcves take integral values between 0 and 100.

cache_swap_low 96

cache_swap_high 97

So, in accordance with these values, when the space occupied for a cache directory crosses

96 percent, Squid will start deleng objects from the cache and will try to maintain the

ulizaon near 96 percent. However, if the incoming rate is high and the space ulizaon

starts to touch the high limit (97 percent), the deleon becomes quite frequent unl

ulizaon moves towards the lower limit.

Squid's defaults for low and high limits are 90 percent and 95 percent respecvely, which

are good if the size of cache directory is low (like 10 GB). However, if we have a large amount

of space for caching (such as a few hundreds GBs), we can push the limits a bit higher and

closer because even 1 percent will mean a dierence of more than a gigabyte.

Cache replacement policies

In the previous two secons, we learned about using the main memory and hard disks

for caching web documents and how to congure Squid for opmal performance. As me

passes, cache will start to ll and at some point in me, Squid will need to purge or delete

old objects from the cache to make space for new ones. Removal of objects from the cache

can be achieved in several ways. One of the simplest ways to do this is to start by removing

the least recently used or least frequently used objects from the cache.

Squid builds dierent removal or replacement policies on top of the list and heap data

structures. Let's have a look at the dierent policies provided by Squid.

Least recently used (LRU)

Least recently used (lru) is the simplest removal policy built by Squid by default. Squid starts

by removing the cached objects that are oldest (since the last HIT). The LRU policy ulizes the

list data structure, but there is also a heap-based implementaon of LRU known as heap lru.

Greedy dual size frequency (GDSF)

GDSF (heap GDSF) is a heap-based removal policy. In this policy, Squid tries to keep popular

objects with a smaller size in the cache. In other words, if there are two cached objects with

the same popularity, the object with the larger size will be purged so that we can make space

for more of the less popular objects, which will eventually lead to a beer HIT rao. While

using this policy, the HIT rao is beer, but overall bandwidth savings are small.

Chapter 2

[ 55 ]

Least frequently used with dynamic aging (LFUDA)

LFUDA (heap LFUDA) is also a heap-based replacement policy. Squid keeps the most

popular objects in the cache, irrespecve of their size. So, this policy compromises a bit of

the HIT rao, but may result in beer bandwidth savings compared to GDSF. For example, if

a cached object with a large size encounters a HIT, it'll be equal to HITs for several small sized

popular objects. So, this policy tries to opmize bandwidth savings instead of the HIT rao.

We should keep the maximum object size in the cache high if we use this policy to further

opmize the bandwidth savings.

Now we need to specify one of the policies which we have just learned, for cache

replacement for the main memory caching as well as hard disk caching. Squid provides

the direcves

memory_replacement_policy and cache_replacement_policy for

specifying the removal policies.

memory_replacement_policy lru

cache_replacement_policy heap LFUDA

These conguraon lines will set the memory replacement policy to lru and the on disk

cache replacement policy to heap LFUDA.

Tuning Squid for enhanced caching

Although Squid performs quite well with default caching opons, we can tune it to perform

even beer, by not caching the unwanted web objects and caching a few non-cacheable web

documents. This will achieve higher bandwidth savings and reduced latency. Let's have a look

at a few techniques that can be helpful.

Selective caching

There may be cases when we don't want to cache certain web documents or requests from

clients. The direcve cache is very helpful in such cases and is very easy to use.

Time for action – preventing the caching of local content

If we don't want to cache responses for certain requests or clients, we can deny it using this

opon. The default behavior is to allow all cacheable responses to be cached. As servers in

our local area network are close enough that we may not want to waste cache space on our

proxy server by caching responses from these servers, we can selecvely deny caching for

responses from local servers.

acl local_machines dst 192.0.2.0/24 198.51.100.0/24

cache deny local_machines

This code will prevent responses from the servers in the networks 192.0.2.0/24 and

198.51.100.0/24 from being cached by the proxy server.

Conguring Squid

[ 56 ]

What just happened?

To opmize the performance (especially the HIT rao), we have congured Squid not to

cache the objects that are available on the local area network. We have also learned how

to selecvely deny caching of such content.

Refresh patterns for cached objects

Squid provides the direcve refresh_pattern, using which we can control the status of

a cached object.

Using refresh_pattern to cache the non-cacheable responses or to

alter the lifeme of the cached objects, may lead to unexpected behavior or

responses from the web servers. We should use this direcve very carefully.

Refresh paerns can be used to achieve higher HIT raos by keeping the recently expired

objects fresh for a short period of me, or by overriding some of the HTTP headers sent by

the web servers. While the cache direcve can make use of ACLs, refresh_pattern uses

regular expressions. The advantage of using the refresh_pattern direcve is that we can

alter the lifeme of the cached objects, while with the cache direcve we can only control

whether a request should be cached or not.

Let's have a look at the syntax of the

refresh_pattern direcve:

refresh_pattern [-i] regex min percent max [OPTIONS]

The parameter regex should be a regular expression describing the request URL. A refresh

paern line is applied to any URL matching the corresponding regular expression. There can

be mulple lines of refresh paerns. The rst line, whose regular expression matches the

current URL, is used. By default, the regular expression is case-sensive, but we can use -i

to make it case-insensive.

Some objects or responses from web servers may not carry an expiry me. Using the

min

parameter, we can specify the me (in minutes) for which the object or response should be

considered fresh. The default and recommended value for this parameter is 0 because altering

it may cause problems or unexpected behavior with dynamic web pages. We can use a higher

value when we are absolutely sure that a website doesn't supply any dynamic content.

The parameter

percent determines the life of a cached object in the absence of the

Expires headers. An object's life me is considered to be the dierence between the mes

extracted from the Last-Modified and Date headers. So, if we set the value of percent

to 50 , and the dierence between the mes from Last-Modified and Date headers is

one hour, then the object will be considered fresh for the next 30 minutes. The response

age is simply the me that has passed since the response was generated by the web server

or was validated by the proxy server for the freshness. The rao of the response age to the

object life me is termed as lm-factor in the Squid world.

Chapter 2

[ 57 ]

Similarly the min, max parameters are the minimum and maximum mes (in minutes) for

which a cached object is considered fresh. If a cached object has spent more me in the

cache than max, then it won't be considered fresh anymore.

We should note that the Expires HTTP header

overrides min and max values.

Let's have a look at the method used for determining the freshness or staleness of a cached

object. A cached object is considered:

Stale (or expired), if the expiry me that was menoned in the HTTP response

header is in the past.

Fresh, if the expiry me menoned in the HTTP response headers is in the future.

Stale, if response age is more than the

max value.

Fresh, if

lm-factor is less than the percent value.

Fresh, if the response age is less than the

min value.

Stale, otherwise.

Time for action – calculating the freshness of cached objects

Let's see an example of a refresh_pattern and try to calculate the freshness of an object:

refresh_patten -i ^http://example.com/test.jpg$ 0 60% 1440

Let's say a client requested the image at http://example.com/text.jpg an hour ago,

and the image was last modied (created) on the web server six hours ago. Let's assume

that the web server didn't specify the expiry me. So, we have the following values for the

dierent variables:

At the me of the request, the object age was (6 - 1) = 5 hours.

Currently, the response age is 1 hour.

Currently, the

lm-factor is 1÷5 = 20 percent

Let's check whether the object is sll fresh or not:

The response age is 60 minutes, which is not more than 1440 (

max value), so this

can't be the deciding factor.

lm-factor is 20 percent, which is less than 60 percent, so the object is sll fresh.

Now, let's calculate the me when the object will expire. The object age is 5 hours and

percent value is 60 percent. So, object will expire in (5 x 60) ÷100 = 3 hours from the last

request, that is, 2 hours from now.



Conguring Squid

[ 58 ]

What just happened?

We learned the formula for calculang the freshness or staleness of a cached object and

also the me aer which a cached object will expire. We also learned about specifying

refresh paerns for the dierent content types to opmize performance.

Options for refresh pattern

Most of the me, the expiry me is specied by the web servers for all requests. But some

web documents such as style sheets (CSS) or JavaScript (JS) les included on web page,

change quite rarely and we can bump up their expiry me to a higher value to take full

advantage of caching. As the web servers already specify the expiry me, the cached CSS/

JS le will automacally expire. To forcibly ignore the Expires and a lot of other headers

related to caching, we can pass opons to the refresh_pattern direcve.

Let's have a look at the opons available for the

refresh_pattern direcve and how they

can help us improve the HIT rao.

Please be warned that using the following opons violates HTTP

standards and may also cause unexpected browsing problems.

override-expire

The opon override-expire, overrides or ignores the Expires header, which is the

main player for determining the expiry me of a cached response. As the Expires header

is ignored, the values of the min, max, and percent parameters will play an essenal role

in determining the freshness of a response.

override-lastmod

The opon override-lastmod will force Squid to ignore the Last-Modified header,

which will eventually enforce the use of min value to determine the freshness of an object.

This opon is of no use, if we have set the value of min to zero.

reload-into-ims

Using the reload-into-ims opon will force Squid to convert the no-cache direcves in

the HTTP headers to the If-Modified-Since headers. The use of this opon is useful only

when the Last-Modified header is present.

ignore-reload

Using the opon ignore-reload will simply ignore the no-cache or reload direcves

present in the HTTP headers.

Chapter 2

[ 59 ]

ignore-no-cache

When the opon ignore-no-cache is used, Squid simply ignores the no-cache direcve

in the HTTP headers.

ignore-no-store

The HTTP header Cache-Control: no-store is used to tell clients that they are not

allowed to store the data being transmied. If the opon

ignore-no-store is set, Squid

will simply ignore this HTTP header and will cache the response if it's cacheable.

ignore-must-revalidate

The HTTP header Cache-Control: must-revalidate means that the response must

be revalidated with the originang web server before it's used again. Seng the opon

ignore-must-revalidate will enforce Squid to ignore this header.

ignore-private

Private informaon or sensive data generally carries an HTTP header known as

Cache-Control: private so that intermediate servers don't cache the responses.

However, the opon ignore-private can be used to ignore this header.

ignore-auth

If the opon ignore-auth is set, then Squid will be able to cache the authorizaon

requests. Using this opon may be really risky.

refresh-ims

This opon can be prey useful. The opon refresh-ims forces Squid to validate the

cached object with the original server whenever an If-Modified-Since request header is

received from a client. Using this may increase the latency, but the clients will always get the

latest data.

Let's see an example with these opons:

refresh_pattern -i .jpg$ 0 60% 1440 ignore-no-cache ignore-no-store

reload-into-ims

This code will force all the JPEG images to be cached whether the original servers want us to

cache them or not. They will be refreshed only:

If the

Expires HTTP header was present and the expiry me is in past.

If the

Expires HTTP header was missing and the response age has exceeded the

max value.



Conguring Squid

[ 60 ]

Have a go hero – forcing the Google homepage to be cached for longer

Write a refresh_pattern conguraon that forces the Google homepage to be cached for

six hours.

Soluon:

refresh_pattern -i ^http:\/\/www\.google\.com\/$ 0 20% 360 override-

expire override-lastmod ignore-reload ignore-no-cache ignore-no-store

reload-into-ims ignore-must-revalidate

Aborting the partial retrievals

When a client iniates a request for fetching some data and aborts it prematurely, Squid

may connue to try and fetch the data. This may cause bandwidth and other resources

such as processing power and memory to be wasted, however if we get subsequent

requests for the same object, it'll result in a beer HIT rao. To counter act this problem,

Squid provides three direcves quick_abort_min (KB), quick_abort_max (KB), and

quick_abort_pct (percent).

For all the aborted requests, Squid will check the values for these direcves and will take the

appropriate acon according to the following rules:

If the remaining data that should be fetched is less than the value of

quick_abort_min, Squid will connue to fetch it.

If the remaining data to be transferred is more than the value of

quick_abort_max, Squid will immediately abort the request.

If the data that has already been transferred is more than

quick_abort_pct

percent of the total data, then Squid will keep retrieving the data.

Both the

quick_abort_min and quick_abort_max values are in KiloBytes (KB) (or any

allowed memory size unit) while quick_abort_pct is a percentage value. If we want to abort

the requests in all cases, which may be required if we are short of bandwidth. We should set

quick_abort_min and quick_abort_max to zero. If we have a lot of spare bandwidth, we

can set a higher values for quick_abort_min and quick_abort_max, and a relavely low

value for quick_abort_pct. Let's see an example for a high bandwidth case:

quick_abort_min 1024 KB

quick_abort_max 2048 KB

quick_abort_pct 90



Chapter 2

[ 61 ]

Caching the failed requests

Requests for resources which doesn't exist (HTTP Error 404) or a client doesn't have

permission to access the requested resource (HTTP Error 403) are common and requests

to such resources make up a signicant percentage of the total requests. These responses

are cacheable by Squid. However, somemes web servers don't send the Expires HTTP

headers in responses, which prevents Squid from caching these responses. To solve this

problem, Squid provides the direcve negative_ttl that forces such responses to be

cached for the me specied. The syntax of negative_ttl is as follows:

negative_ttl TIME_UNITS

Previously, this value was ve minutes by default, but in the newer versions of Squid, it is set

to zero seconds by default.

Playing around with HTTP headers

As all the requests and responses pass through Squid, it can add, modify, or delete the HTTP

headers accompanied with requests and responses. These acons are usually performed

to achieve anonymity or to hide the client-specic informaon. Squid has three direcves,

namely, request_header_access, reply_header_access, and header_replace to

modify the HTTP headers in requests and responses. Let's have a brief look at them.

Please be warned that using any of these direcves

violates HTTP standards and may cause problems.

Controlling HTTP headers in requests

The direcve request_header_access is used in combinaon with ACLs to determine

whether a parcular HTTP header will be retained in a request or if it will be dropped before

forwarding the request. The advantage of having ACLs here is that they provide a lot of

exibility. We can selecvely drop some HTTP headers for a few clients.

Let's have a look at the syntax of

request_header_access:

request_header_access HEADER_NAME allow|deny [!]ACL_NAME ...

So, if we are not willing to expose what browsers our clients are using, we can easily drop the

User-Agent header from requests. The following code will drop this parcular header for

all the requests:

request_header_access User-Agent deny all

Conguring Squid

[ 62 ]

The parameter all is a special keyword here represenng all the HTTP headers. Similarly, if

we don't want web servers to know about the browsing habits of our clients, we can start by

dropping the Referer header from all the requests.

request_header_access Referer deny all

Again, please be warned that dropping these headers may cause serious problems in

browsing. By default, no headers are removed.

Controlling HTTP headers in responses

Similar to the direcve request_header_access, we have the reply_header_access

direcve to drop the HTTP headers in responses. The syntax and usage is similar. For

example, for dropping the Server HTTP header, the example conguraon line will be:

reply_header_access Server deny all

By default, all headers are retained and are sent they are received.

Replacing the contents of HTTP headers

While the previous two direcves can only be used to drop any unwanted HTTP headers, the

direcve header_replace can be used to send false informaon to replace the contents

of the headers. Please note that this direcve replaces contents of headers which have been

denied using the request_header_access direcve and is valid only for requests and not

responses. We use this direcve to replace the contents of the headers with a stac xed

value. Let's have a look at the syntax of header_replace:

header_replace HEADER_NAME TEXT_VALUE

For example, we can send out the User-Agent header reecng that all our clients use the

Firefox web browser. Let's see the code for this example:

header_replace User-Agent Mozilla/5.0 (X11; U; Linux i686; en-US;

rv:0.9.3) Gecko/20010801

Again, we want to warn you that web servers generally validate or need the User-Agent

and other HTTP headers to serve the right content for a client, and modifying these headers

may cause unexpected problems.

DNS server conguration

For every request received from a client, Squid needs to resolve the domain name before it

can contact the target web server. For this purpose, Squid can either use the built-in internal

DNS client or, external DNS program to resolve the hostnames. The default behavior is to

Chapter 2

[ 63 ]

use the internal DNS client for resolving hostnames unless we have used the --disable-

internal-dns

opon but it must be set with the configure program before compiling

Squid, as shown:

$ ./configure --disable-internal-dns

Let's have a quick look at the DNS-related conguraon direcves provided by Squid.

Specifying the DNS program path

The direcve cache_dns_program is used to specify the path of the external DNS program

built with Squid. If we have not moved the Squid-related le aer installing, this direcve

will have the correct value, by default. However, if the DNS program is located at a dierent

locaon, we can specify the path using the following direcve:

cache_dns_program /path/to/dnsprogram

Controlling the number of DNS client processes

The number of parallel instances of the DNS program specied by cache_dns_program

can be controlled by using the direcve dns_children. The syntax of the direcve

dns_children is as follows:

dns_children max startup=n idle=n

The parameter max determines the maximum number of DNS programs which can run at any

one me. We should set it to a signicantly high value as Squid has to wait for the response

from the DNS program before it can proceed any further and seng this number to a lower

value will keep Squid waing for the response. The default value is set to 32.

The value of the parameter

startup determines the number of DNS programs that will be

started when Squid starts. This can be set to zero and Squid will not start any processes by

default. The rst ever request to Squid will result in the creaon of the rst child process.

The value of the parameter

idle determines the number of processes that will be available

at any one me. More requests will result in the creaon of more processes, but keeping this

many processes free (available) is subject to a total of max processes. A minimum acceptable

value for this parameter is 1.

Setting the DNS name servers

By default, Squid picks up the name servers from the le /etc/resolv.conf.

However, if we want to specify a list of dierent name servers, we can use the direcve

dns_nameservers.

Conguring Squid

[ 64 ]

Time for action – adding DNS name servers

A list of IP addresses can be passed to this direcve or several IP addresses can be wrien on

dierent lines like the following:

dns_nameservers 192.0.2.25 198.51.100.25

dns_nameservers 203.0.113.25

The previous conguraon lines will set the name servers to 192.0.2.25, 198.51.100.25,

and 203.0.113.25.

What just happened?

We added three DNS name servers to the Squid conguraon le which will be used by

Squid to resolve the domain names corresponding to the requests received from the clients.

Setting the hosts le

Squid can read the hostname and IP address associaons from the hosts le generally found

at /etc/hosts. This le normally contains hostnames for the machines or servers in the

local area network. We can specify the host's le locaon using the direcve hosts_file

as shown:

hosts_file /etc/hosts

If we don't want Squid to read the host's le, we can set the value to none.

Default domain name for requests

Using the direcve append_domain, we can append a default domain name to the

hostnames without any period (.) in them. This is generally useful for handling local domain

names. The value of the append_domain must begin with a period (.). For example:

append_domain .example.com

Timeout for DNS queries

If the DNS servers do not respond to the query within the me specied by the direcve

dns_timeout, they are assumed to be unavailable. The default meout value is two

minutes. Considering the ever increasing network speeds, we can set this to a slightly lower

value. For example, if there is no response within one minute, we can consider the DNS

service to be unavailable.

Chapter 2

[ 65 ]

Caching the DNS responses

The IP addresses of most domains change quite rarely, so it's safe to cache the posive

responses from DNS servers for a few hours. This doesn't provide much of a saving in

bandwidth, but caching DNS responses may reduce the latency quite signicantly because

a DNS query is done for every request. For caching DNS responses while using an external

DNS program, Squid provides two direcves known as

positive_dns_ttl and

negative_dns_ttl to tune the caching of DNS responses.

The direcve

positive_dns_ttl determines the maximum me for which a posive

DNS response will be cached while

negative_dns_ttl determines the me for which

a negave DNS response will be cached. The direcve

negative_dns_ttl also serves as

a minimum me for which the posive DNS responses can be cached.

Let's see the example values for both of the direcves:

positive_dns_ttl 8 hours

negative_dns_ttl 30 seconds

We should keep the me to live (TTL) for negave responses to a lower value as the negave

responses may be due to problems with the DNS servers.

Setting the size of the DNS cache

Squid performs domain name to address lookups for all the MISS requests and address

to domain name lookups for requests involving ACLs such as

dstdomain. These lookups

are cached. To control the size of these cached lookups, Squid exposes four direcves—

ipcache_size (number), ipcache_low (percent), ipcache_high (percent),

and fqdncache_size (number). Let's see what these direcves mean.

The direcve

ipcache_size determines the maximum number of entries that can be

cached for domain name to address lookups. As these entries take really small amounts of

memory and the amount of available main memory is enormous these days, we can cache

tens of thousands of these entries. The default value for this direcve is 1024, but we can

easily push it to 15,000 on busy caches.

The direcves

ipcache_low (let's say 95) and ipcache_high (let's say 97) are low and

high water marks for the IP cache. So, Squid will try to keep the number of entries in the

cache between 95 percent and 97 percent.

Using

fqdncache_size, we can simply set the maximum number of address to domain

name lookups that can be in the cache at any me. These entries also take really small

amounts of memory, so we can cache a large number of these. The default value is 1024,

but we can easily push it to 10,000 on busy caches.

Conguring Squid

[ 66 ]

Logging

Squid logs all the client requests and events to les. Squid provides various direcves to

control the locaon of log les, format of log messages, and to choose which requests to

log. Let's have a brief look at some of the direcves. We'll learn about logging in detail in

Chapter 5, Understanding Log Files and Log Formats.

Log formats

We can dene mulple log formats using the direcve logformat as well as the pre-dened

log formats supplied by Squid. Log formats are basically an arrangement of one or more

pre-dened format codes. Various log formats such as

squid, common, combined, and so

on, are provided by Squid, by default. We'll have a detailed look at dening addional log

formats in Chapter 5.

Log le rotation or log le backups

Over a period of me, the log les grow in size. The common pracce is to move the older

logs to separate les as a backup or for analysis, and then connue wring the logs to the

original log le. The default Squid behavior is to keep 10 backups of log les. We can change

this behavior with the direcve logfile_rotate as follows:

logfile_rotate 20

Log access

By default, Squid logs requests from all the clients to the log le set by the direcve

access_log. If we want to prevent some client requests from being logged by Squid,

we can use the log_access direcve along with ACLs. An example may be that the

CEO doesn't want his requests to be logged:

acl ceo_laptop src 192.0.2.21

log_access deny ceo_laptop

We should note that the requests denied for logging using this

direcve will not count towards performance measurements.

Buffered logs

By default, all the log les are wrien without buering any output. Buering the logs

enhances/improves performance under heavy usage or when debugging is enabled.

This direcve is rarely used.

Chapter 2

[ 67 ]

Strip query terms

Query terms are key-value pairs passed using a URL in a HTTP request. Somemes, this may

contain sensive or private informaon about the client requesng the web resource. By

default, Squid strips all the query terms from a request URL before logging it. Another reason

for stripping query terms is that the query terms are oen very long and can make monitoring

the access log very painful. However, we may want to disable it someme, especially while

debugging a problem, for example, a client is not able to access a website properly.

strip_query_terms off

This conguraon will prevent query terms from being stripped before requests are logged.

It's a good pracce to set this direcve to

on for protecng clients' privacy.

URL rewriters and redirectors

URL rewriters and redirectors are third party, independent helper programs that we can use

with Squid to modify or rewrite requests from clients. In most cases, we try to redirect a client

to a dierent web page or resource from the one that was inially requested by the client.

The interesng part is that URL rewriters can be coded in any programming language. URL

rewriters are run as independent processes and communicate with Squid using standard I/O.

URL rewriters provide a totally new area of opportunity as we can redirect clients to custom

error pages for dierent scenarios, redirect users to local mirrors of websites or soware

repositories, block adversements with small blank images, and so on.

Squid doesn't have any URL rewriters by default as we are supposed to write our own

URL rewriters because the possibilies are enormous. It is also possible to download URL

rewriters wrien by others and use them right away. We'll learn about how to use or write

our own URL rewriters in detail in Chapter 11, Wring URL Redirectors and Rewriters.

Other conguration directives

Squid has hundreds of conguraon direcves to control it in various ways. It's not possible

to discuss all of them here, we'll try to cover the important ones.

Conguring Squid

[ 68 ]

Setting the effective user for running Squid

Although we generally start the Squid server as root, it never runs with the privileges of the

root user. Right aer starng, Squid changes its real UID (User ID)/GID (Group ID) to the user

determined by the direcve

cache_effective_user. By default, it is set to nobody. We

can create a separate user for running Squid and set the value of this direcve accordingly.

For example, on some operang systems, Squid is run as squid user. The corresponding

conguraon line will be as follows:

cache_effective_user squid

Please make sure that the user specied as the value for cache_effective_user exists.

Conguring hostnames for the proxy server

Squid uses hostnames for the server for forwarding requests to other cache peers or for

detecng the neighbor caches. There two dierent direcves named

visible_hostname

and unique_hostname which are used to set the hostname of the proxy server for dierent

purposes. Let's have a quick look at these direcves.

Hostname visible to everyone

The direcve visible_hostname is used to set the hostname, which will be visible on all

the error or informaon pages used by Squid. We can set it as shown:

visible_hostname proxy.example.com

Unique hostname for the server

If we want to name all the proxy servers in our network as proxy.example.com, we can

achieve it by seng

visible_hostname for all of them to proxy.example.com. However,

doing so will cause problems in forwarding requests among the caches and detecng

forward loops. To solve this problem, Squid provides the direcve

unique_hostname. We

should set this to a unique hostname value to get rid of forward loops.

unique_hostname proxy1.example.com

Controlling the request forwarding

If we have cache peers or neighbors in our network, Squid will try to contact them for HITs

or for forwarding requests. We can control the manner in which the requests are forwarded

to other caches using the direcves

always_direct, never_direct, hierarchy_

stoplist

, prefer_direct, and cache_peer_access. Next we'll have a look at

a few of these direcves with examples.

Chapter 2

[ 69 ]

Always direct

Somemes we may want Squid to fetch the content directly from origin servers instead

of forwarding the queries to neighboring caches. This is achieved using the direcve

always_direct. The syntax is similar to http_access:

always_direct allow|deny [!]ACL_NAME

This direcve is very useful in forwarding requests to servers in the local area network

directly because contacng cache peers may introduce an unnecessary delay.

acl lan_servers dst 192.0.2.0/24

always_direct allow lan_servers

This code will instruct Squid to forward requests to desnaon servers idened by

lan_servers directly to the origin servers and the requests will not be routed through

other cache peers.

Never direct

This direcve is opposite of always_direct, but we should understand it carefully before

using it. If we want to enforce the use of a proxy server for all the client requests, then this

direcve comes handy.

never_direct allow all

This rule will enforce the usage of a proxy server for all the requests. However, generally, it's

a good pracce to allow clients to connect directly to local servers. So, we can use something

similar to the following:

acl lan_servers dst 192.0.2.0/24

never_direct deny lan_servers

never_direct allow all

These rules will make sure that requests to all the servers, except those idened by

lan_servers, go through another proxy server.

Hierarchy stoplist

This is a simple direcve prevenng the forwarding of client requests to neighbor caches.

Let's have a look at the syntax:

hierarchy_stoplist word1 word2 word3 ...

Conguring Squid

[ 70 ]

If any of the words from the list of words is found in the request URL, the request will not

be forwarded to the neighbor caches and the origin servers will be contacted directly. This

direcve is generally helpful for handling dynamic pages directly instead of roung them

using cache peers.

hierarchy_stoplist cgi-bin jsp ?

This code will prevent the forwarding of URLs containing any of cgi-bin, jsp, or ? to

cache peers.

Please note that the direcve never_direct

overrides hierarchy_stoplist.

Broken posts

Some web servers have broken implementaons of the POST method (a method using which

we can securely send data to the web server) and they expect a pair of CRLF (new-line) aer

the POST request data. Using the broken_posts direcve, we can request Squid to send an

extra CRLF pair aer the POST request data.

acl bad_server dstdomain broken.example.com

broken_posts allow bad_server

The rules in this code will take care of the broken implementaon of the POST method

on the host

broken.example.com. We should use this direcve only if its absolutely

necessary.

TCP outgoing address

This direcve is useful for forwarding requests to dierent network interfaces, depending on

the client's network. Let's have a look at the syntax for this direcve:

tcp_outgoing_address ip_address [[!]ACL_NAME]

In this line, ip_address is the IP address of the outgoing interface which we want to use.

The ACL name is totally oponal. An example case may be when we want to route trac

for a specic network using a dierent network interface:

acl special_network src 192.0.2.0/24

tcp_outgoing_address 198.51.100.25 special_network

tcp_outgoing_address 198.51.100.86

The previous code will set the outgoing address for requests from clients in the network

192.0.2.0/24 to 198.51.100.25, and for all other requests the outgoing address

will be set to 198.51.100.86.

Chapter 2

[ 71 ]

PID lename

Just like several other programs for Unix/Linux, Squid writes the process ID of the current

process in a PID le. This direcve is used to control the locaon of a PID le.

pid_filename /var/run/squid.pid

If we don't want Squid to write its process ID to any le, we can use none instead of lename:

pid_filename none

Seng the path of the PID le to none will prevent regular management

operaons like automac log rotaon or restarng Squid. The operang system

will not be able to stop Squid at the me of a shutdown or restart.

Client netmask

By default Squid logs the complete IP address of the client for every request. To enhance the

privacy of our clients, we can use this direcve to hide the actual IP addresses of the clients.

Let's see an example:

client_netmask 255.255.255.0

If a client with the IP address 192.0.2.21 accesses our proxy server, then his address will

be logged as 192.0.2.0 instead of 192.0.2.21 because Squid will set the last 8 bits of the

IP address to zero. Basically, a logical AND operaon is performed between binary version of

the netmask and the IP address to be logged. The same IP address will also be reected in

the cache manager's web interface.

Pop quiz

1. Consider the following snippet from the Squid conguraon le:

http_port 192.0.2.22:8080

http_port 192.0.2.22:3128

Which one of the following is true?

a. Squid will listen on port 8080 on all interfaces.

b. Squid will listen on port 3128 on all interfaces.

c. Squid will listen on port 8080 and 3128 on all interfaces.

d. Squid will listen on port 8080 and 3128 on interface with IP address 192.0.2.22.

Conguring Squid

[ 72 ]

2. Consider the following lines from the Squid conguraon le:

acl exapmile_sites dstdomain .example.com .example.net

We want to deny access to the requests idened by the ACL example_sites.

Which one of the following rules will not do it?

a. http_access deny example_sites

b. http_access deny Example_sites

c. http_access deny ExampleSites

d. http_access deny Example_Sites

3. Consider the following Squid conguraon:

acl blocked_clients src 192.0.2.0/24

acl special_client src 192.0.2.21

http_access deny blocked_clients

http_access allow special_client

What will happen when a client with an IP Address 192.0.2.21 tries to access the

web through our Squid proxy server?

a. They will be denied access.

b. Somemes because of the allow rule.

c. The conguraon is ambiguous and Squid will crash.

d. Squid will not crash but it'll not be able to determine denite access permissions.

4. Which of the following is correct?

a. Total memory used by Squid is determined by cache_mem.

b. Total memory used by Squid is more than that specied using cache_mem.

c. Total memory used by Squid is less than that specied using cache_mem.

d. Total memory used by Squid is independent of the memory specied using

cache_mem.

5. Let's say we have the following line in our conguraon le:

append_domain .google.com

If a client tries to browse to the website http://mail/. What will the result be?

a. The client will get an error saying domain not found.

b. Nothing will happen.

c. Squid will crash.

d. Client will automacally be redirected to http://mail.google.com/.

Chapter 2

[ 73 ]

Summary

We have learned a lot in this chapter about conguring Squid. Aer this chapter, we should

feel more comfortable in dealing with the Squid conguraon le. We should be able to

apply the things we learnt in this chapter to ne tune Squid to achieve beer performance.

Although we learned about a lot of conguraon direcves, we specically covered:

The syntax of the conguraon le. We learned about various types of direcves

generally used in the conguraon le and the possible types of values that

they take.

Caching in the main memory and hard disk in detail. We learned about using RAM

and disks for caching in an opmized manner to achieve higher HIT rao.

Fine tuning the cache. We learned about achieving a beer HIT rao by nkering

with various HTTP headers.

The required DNS conguraon for Squid. We learned about specifying DNS servers

and opmizing the DNS cache to reduce latency.

We also discussed restricng access to the Squid server, modifying HTTP headers, and had

a brief overview of cache peers and the logging system.

Now that we have learned about conguring Squid, we are ready to proceed with running

the Squid server, which is the topic of the next chapter.



Running Squid

In the previous chapters, we had learned about compiling, installing, and

conguring the Squid proxy server. In this chapter, we are going to learn about

the dierent ways of running Squid and the available opons that can be

passed to Squid from the command line. We will also learn about debugging

the Squid conguraon le.

In this chapter, we will learn the following:

Various command line opons for running Squid

Parsing the squid conguraon le for syntax errors

Using an alternate squid conguraon le for tesng purposes

Dierent ways of starng Squid

Rotang log les generated by Squid

Let's get started and explore the previous points.

Command line options

Normally, all of the Squid conguraon opons reside with in the squid.conf le (the main

Squid conguraon le). To tweak the Squid funconality, the preferred method is to change

the opons in the squid.conf le. However there are some opons which can also be

controlled using addional command line opons while running Squid.

These opons are not very popular and are rarely used, but these are very useful for

debugging problems without the Squid proxy server. Before exploring the command line

opons, let's see how Squid is run from the command line.



Running Squid

[ 76 ]

As we saw in the rst chapter, the locaon of the Squid binary le depends on the --prefix

opon passed to the configure command before compiling. So, depending upon the value

of the --prefix opon, the locaon of the Squid executable may be one of /usr/local/

sbin/squid

or ${prefix}/sbin/squid, where ${prefix} is the value of the opon

--prefix passed to the configure command. Therefore, to run Squid, we need to run

one of the following commands on the terminal:

When the

--prefix opon was not used with the configure command, the

default locaon of the Squid executable will be /usr/local/sbin/squid.

When the

--prefix opon was used and was set to a directory, then the locaon

of the Squid executable will be

${prefix}/sbin/squid.

It's painful to type the absolute path for Squid to run. So, to avoid typing the absolute path,

we can include the path to the Squid executable in our

PATH shell variable, using the export

command as shown in the following example:

$ export PATH=$PATH:/usr/local/sbin/

Alternavely, we can use the following command:

$ export PATH=$PATH:/opt/squid/sbin/

We can also add the preceding command to our ~/.bashrc or ~/.bash_profile le

to avoid running the export command every me we enter a new shell.

Aer seng the

PATH shell variable appropriately, we can run Squid by simply typing the

following command on shell:

$ squid

This command will run Squid aer loading the conguraon opons from the

squid.conf le.

We'll be using the squid command without an absolute path for

running the Squid process. Please use the appropriate path according

to the installaon prex which you have chosen.

Now that we know how to run Squid from the command line, let's have a look at the various

command line opons.

Getting a list of available options

Before actually moving forward, we should rstly check the available set of opons for our

Squid installaon.



Chapter 3

[ 77 ]

Time for action – listing the options

Like a lot of other Linux programs, Squid also provides the opon -h which can be used

to retrieve a list of opons:

squid -h

The previous command will result in the following output:

Usage: squid [-cdhvzCFNRVYX] [-s | -l facility] [-f config-file] [-[au]

port] [-k signal]

-a port Specify HTTP port number (default: 3128).

-d level Write debugging to stderr also.

-f file Use given config-file instead of

/opt/squid/etc/squid.conf.

-h Print help message.

Parse configuration file, then send signal to

running copy (except -k parse) and exit.

-s | -l facility

Enable logging to syslog.

-u port Specify ICP port number (default: 3130), disable with 0.

-v Print version.

-z Create swap directories.

-C Do not catch fatal signals.

-F Don't serve any requests until store is rebuilt.

-N No daemon mode.

-R Do not set REUSEADDR on port.

-S Double-check swap during rebuild.

...

We will now have a look at a few important opons from the preceding list. We will also,

have a look at the squid(8) man page or http://linux.die.net/man/8/squid for

more details.

What just happened?

We have just used the squid command to list the available opons which we can use on the

command line.

Running Squid

[ 78 ]

Getting information about our Squid installation

Various features may vary across dierent versions of Squid. Before proceeding any further,

it's a good idea to know the version of Squid installed on our machine.

Time for action – nding out the Squid version

Just in case we want to check which version of Squid we are using or the opons we

used with the configure command before compiling, we can use the opon -v on the

command line. Let's run Squid with this opon:

squid -v

If we try to run the preceding command in the terminal, it will produce an output similar to

the following:

Squid Cache: Version 3.1.10

configure options: '--config-cache' '--prefix=/opt/squid/' '--enable-

storeio=ufs,aufs' '--enable-removal-policies=lru,heap' '--enable-icmp'

'--enable-useragent-log' '--enable-referer-log' '--enable-cache-digests'

'--with-large-files' --enable-ltdl-convenience

What just happened?

We used the squid command with the -v opon to nd out the version of Squid installed on

our machine, and the opons used with the

configure command before compiling Squid.

Creating cache or swap directories

As we learned in the previous chapter, the cache directories specied using the cache_dir

direcve in the

squid.conf le, must already exist before Squid can actually use them.

Time for action – creating cache directories

Squid provides the -z command line opon to create the swap directories. Let's see

an example:

squid -z

If this opon is used and the cache directories don't exist already, the output will look similar

to the following:

2010/07/20 21:48:35| Creating Swap Directories

2010/07/20 21:48:35| Making directories in /squid_cache/00

Chapter 3

[ 79 ]

2010/07/20 21:48:35| Making directories in /squid_cache/01

2010/07/20 21:48:35| Making directories in /squid_cache/02

2010/07/20 21:48:35| Making directories in /squid_cache/03

...

We should use this opon whenever we add new cache directories in the Squid

conguraon le.

What just happened?

When the squid command is run with the opon -z, Squid reads all the cache directories

from the conguraon le and checks if they already exist. It will then create the directory

structure for all the cache directories that don't exist.

Have a go hero – adding cache directories

Add two or three test cache directories with dierent values of level 1 (8, 16, and 32) and

level 2 (64, 256, and 512) to the conguraon le. Then try creang them using the squid

command. Now study the dierence in the directory structure.

Using a different conguration le

The default locaon for Squid's main conguraon le is ${prefix}/etc/squid/squid.

conf

. Whenever we run Squid, the main conguraon is read from the default locaon.

While tesng or deploying a new conguraon, we may want to use a dierent conguraon

le just to check whether it will work or not. We can achieve this by using the opon

-f,

which allows us to specify a custom locaon for the conguraon le. Let's see an example:

squid -f /etc/squid.minimal.conf

# OR

squid -f /etc/squid.alternate.conf

If Squid is run this way, Squid will try to load the conguraon from /etc/squid.minimal.

conf

or /etc/squid.alternate.conf, and it will completely ignore the squid.conf

from the default locaon.

Getting verbose output

When we run Squid from the terminal without any addional command line opons, only

warnings and errors are displayed on the terminal (or

stderr). However, while tesng,

we would like to get a verbose output on the terminal, to see what is happening when

Squid starts up.

Running Squid

[ 80 ]

Time for action – debugging output in the console

To get more informaon on the terminal, we can use the opon -d. The following is

an example:

squid -d 2

We must specify an integer with the opon -d to indicate the verbosity level. Let's have

a look at the meaning of the dierent levels:

Only crical and fatal errors are logged when level

0 (zero) is used.

Level

1 includes the logging of important problems.

Level

2 and higher includes the logging of informave details and other acons.

Higher levels result in more output on the terminal. A sample output on the terminal with

level

2 would look similar to the following:

2010/07/20 21:40:53| Starting Squid Cache version 3.1.10 for i686-pc-

linux-gnu...

2010/07/20 21:40:53| Process ID 15861

2010/07/20 21:40:53| With 1024 file descriptors available

2010/07/20 21:40:53| Initializing IP Cache...

2010/07/20 21:40:53| DNS Socket created at [::], FD 7

2010/07/20 21:40:53| Adding nameserver 192.168.36.222 from /etc/resolv.

conf

2010/07/20 21:40:53| User-Agent logging is disabled.

2010/07/20 21:40:53| Referer logging is disabled.

2010/07/20 21:40:53| Unlinkd pipe opened on FD 13

2010/07/20 21:40:53| Local cache digest enabled; rebuild/rewrite every

3600/3600 sec

2010/07/20 21:40:53| Store logging disabled

2010/07/20 21:40:53| Swap maxSize 0 + 262144 KB, estimated 20164 objects

2010/07/20 21:40:53| Target number of buckets: 1008

2010/07/20 21:40:53| Using 8192 Store buckets

2010/07/20 21:40:53| Max Mem size: 262144 KB

2010/07/20 21:40:53| Max Swap size: 0 KB

2010/07/20 21:40:53| Using Least Load store dir selection

2010/07/20 21:40:53| Current Directory is /opt/squid/sbin

2010/07/20 21:40:53| Loaded Icons.



Chapter 3

[ 81 ]

2010/07/20 21:40:53| Accepting HTTP connections at [::]:3128, FD 14.

2010/07/20 21:40:53| HTCP Disabled.

2010/07/20 21:40:53| Squid modules loaded: 0

2010/07/20 21:40:53| Ready to serve requests.

2010/07/20 21:40:54| storeLateRelease: released 0 objects

...

As we can see, Squid is trying to dump a log of acons that it is performing. The messages

shown are mostly startup messages and there will be fewer messages when Squid starts

accepng connecons.

Starng Squid in debug mode is quite helpful when Squid is up and running and

users complain about poor speeds or being unable to connect. We can have a

look at the debugging output and the appropriate acons to take.

What just happened?

We started Squid in debugging mode and can now see Squid wring an output on

the command line, which is basically a log of the acons which Squid is performing. If Squid

is not working, we'll be able to see the reasons on the command line and we'll be able to

take acons accordingly.

Full debugging output on the terminal

The opon -d species the verbosity level of the output dumped by Squid on the terminal.

If we require all of the debugging informaon on the terminal, we can use the opon -X,

which will force Squid to write debugging informaon at every single step. If the opon

-X is used, we'll see informaon about parsing the squid.conf le and the acons taken

by Squid, based on the conguraon direcves encountered. Let's see a sample output

produced when opon -X is used:

...

2010/07/21 21:50:51.515| Processing: 'acl my_machines src 172.17.8.175

10.2.44.46 127.0.0.1 172.17.11.68 192.168.1.3'

2010/07/21 21:50:51.515| ACL::Prototype::Registered: invoked for type src

2010/07/21 21:50:51.515| ACL::Prototype::Registered: yes

2010/07/21 21:50:51.515| ACL::FindByName 'my_machines'

2010/07/21 21:50:51.515| ACL::FindByName found no match

2010/07/21 21:50:51.515| aclParseAclLine: Creating ACL 'my_machines'

2010/07/21 21:50:51.515| ACL::Prototype::Factory: cloning an object for

type 'src'

Running Squid

[ 82 ]

2010/07/21 21:50:51.515| aclParseIpData: 172.17.8.175

2010/07/21 21:50:51.515| aclParseIpData: 10.2.44.46

2010/07/21 21:50:51.515| aclParseIpData: 127.0.0.1

2010/07/21 21:50:51.515| aclParseIpData: 172.17.11.68

2010/07/21 21:50:51.515| aclParseIpData: 192.168.1.3

...

Let's see what this output means. In the rst line, Squid encountered a line dening an ACL

my_machines. The next few lines in the output describe Squid invoking dierent methods

to parse, creang a new ACL, and then assigning values to it. This opon can be very helpful

while debugging ambiguous ACLs.

Running as a normal process

Someme during tesng, we may not want Squid to run as a daemon. Instead, we may

want it to run as a normal process which we can interrupt easily by pressing CTRL-C. To

achieve this, we can use the opon -N. When this opon is used, Squid will not run in the

background it will run in the current shell instead.

Parsing the Squid conguration le for errors or warnings

It's a good idea to parse or check the conguraon le (squid.conf) for any errors or

warnings before we actually try to run Squid, or reload a Squid process which is already

running in a producon deployment. Squid provides an opon -k with an argument parse,

which, if supplied, will force Squid to parse the current Squid conguraon le and report

any errors and warnings. Squid -k is also used to check and report direcve and opon

changes when we upgrade our Squid version.

Time for action – testing our conguration le

As we learned before, we can use the -k parse opon to test our conguraon le. Now,

we are going to add a test line and see if Squid can catch the error.

1. For example, let's add the following line to our squid.conf le:

unknown_directive 1234

2. Now we'll run Squid with the -k parse opon as follows:

squid -k parse

Chapter 3

[ 83 ]

3. As unknown_directive is not a valid direcve for the Squid conguraon le, we

should get an error similar to the following:

2010/07/21 22:28:40| cache_cf.cc(346) squid.conf:945 unrecognized:

'unknown_directive'

So, if we nd an error within our conguraon le, we can go back and x the errors and

then parse the conguraon le again.

What just happened?

We rst added an invalid line in to our conguraon le and then tried to parse it using

a squid command which resulted in an error. It is a good idea to always parse the

conguraon le before starng Squid.

Sending various signals to a running Squid process

Squid provides the -k opon to send various signals to a Squid process which is already

running. Using this opon, we can send various management signals such as, reload the

conguraon le, rotate the log les, shut down the proxy server, switch to debug mode,

and many more. Let's have a look at some of the important signals which are available.

Please note that when the argument parse is used with the opon -k, no

signal is sent to the running Squid process.

Reloading a new conguration le in a running process

We may need to make changes to our Squid conguraon le somemes, even when it is

deployed in producon mode. In such cases, aer making changes, we don't want to restart

our proxy server because that will introduce a signicant downme and will also interrupt

acve connecons between clients and remote servers. In these situaons, we can use the

opon -k with reconfigure as an argument, to signal Squid to re-read the conguraon

le and apply the new sengs:

squid -k reconfigure

The previous command will force Squid to re-read the conguraon le, while serving the

requests normally and not terminang any acve connecons.

It's good pracce to parse the conguraon le for any errors or warning using

the -k parse opon before issuing the reconfigure signal.

Running Squid

[ 84 ]

Shutting down the Squid process

To shut down a Squid process which is already running, we can issue a shutdown signal with

the help of the opon -k as follows:

squid -k shutdown

Squid tries to terminate connecons gracefully when it receives a shutdown signal. It will allow

the acve connecon to nish before the process is completely shut down or terminated.

Interrupting or killing a running Squid process

If we have a lot of clients, Squid can take a signicant amount of me before it completely

terminates itself on receiving the -k shutdown signal. To get Squid to immediately stop

serving requests, we can use the -k interrupt signal. The -k interrupt signal will not

allow Squid to wait for acve connecons to nish and will stop the process immediately.

In some cases, the Squid process may not be stopped using

-k shutdown or -k

interrupt

signals. If we want to terminate the process immediately, we can issue a -k

kill

signal, which will immediately kill the process. This signal should only be used when

Squid can't be stopped with -k shutdown or -k interrupt. For example, to send a -k

kill

signal to Squid, we can use the following command:

squid -k kill

This command will kill the Squid process immediately.

Please note that shutdown, interrupt, and kill are Squid signals and not

the system kill signals which are emulated.

Checking the status of a running Squid process

To know whether a Squid process is already running or not, we can issue a check signal

which will tell us the status of the Squid process. Squid will also validate the conguraon le

and report any fatal problems. If the Squid process is running ne, the following command

will exit without prinng anything:

squid -k check

Otherwise, if the process has exited, this command will give an error similar to the following

error message:

squid: ERROR: Could not send signal 0 to process 25243: (3) No such

process

Chapter 3

[ 85 ]

Have a go hero – check the return value

Aer running squid -k check, nd out the return value or status in scenarios when:

Squid was running

Squid was not running

Soluon: The return value or the status of a command can be found out by using the

command

echo $?. This will print the return status or value of the previous command

that was executed in the same shell. Return values should be (1) -> 0, (2) -> 1.

Sending a running process in to debug mode

If we didn't start Squid in debug mode and for tesng we don't want to stop an already

running Squid process, we can issue a debug signal which will send the already running

process into debug mode. The debugging output will then be wrien to Squid's cache.log

le located at ${prefix}/var/logs/cache.log or /var/log/squid/cache.log.

The Squid process running in debug mode may write a log of debugging output

to the cache.log le and may quickly consume a lot of disk space.

Rotating the log les

The log les used by Squid grow in size over a period of me and can consume a signicant

amount of disk space. To get rid of the older logs, Squid provides a rotate signal that can

be issued when we want to rotate the exisng log les. Upon receiving this signal, Squid

will close the current log les, move them to other lenames, or delete them based on the

conguraon direcve logfile_rotate (in squid.conf) then reopen the les to write

the logs.

It's quite inconvenient to rotate log les manually. So, we can automate the process of log

le rotaon with the help of a cron job. Let's say we want to rotate the log les at midnight,

the corresponding cron tab entry will be:

59 23 * * * /opt/squid/sbin/squid -k rotate

Please note that the path to Squid executable may dier depending on the installaon prex.

We'll learn more about log les in Chapter 5, Understanding Log Files and Log Formats.



Running Squid

[ 86 ]

Forcing the storage metadata to rebuild

When Squid starts, it tries to load the storage metadata. If Squid fails to load the storage

metadata, then it will try to rebuild it. If it receives any requests during that period, Squid

will try to sasfy those requests in parallel, which results in slow rebuild. We can force Squid

to rebuild the metadata before it starts processing any requests using the opon -F on the

command line. This may result in a quick rebuild of storage metadata but clients may have

to wait for a signicant me, if the cache is large. For large caches, we should try to avoid

this opon:

squid -F

Squid will now rebuild the cache metadata and will not serve any client requests unl the

metadata rebuild process is complete.

Double checking swap during rebuild

The opon -F determines whether Squid should serve requests while the storage metadata

is being rebuilt. We have another opon, -S, which can be used to force Squid to double

check the cache during rebuild. If we use the

-S opon along with the opon -d as follows:

squid -S -d 1

This will produce a debugging output on the terminal which will look similar to the following:

2010/07/21 21:29:22| Beginning Validation Procedure

2010/07/21 21:29:22| UFSSwapDir::doubleCheck: SIZE MISMATCH

2010/07/21 21:29:22| UFSSwapDir::doubleCheck: ENTRY SIZE: 1332, FILE

SIZE: 114

2010/07/21 21:29:22| UFSSwapDir::dumpEntry: FILENO 00000092

2010/07/21 21:29:22| UFSSwapDir::dumpEntry: PATH /squid_

cache/00/00/00000092

2010/07/21 21:29:22| StoreEntry->key: 0060E9E547F3A1AAEEDE369C5573F8D9

2010/07/21 21:29:22| StoreEntry->next: 0

2010/07/21 21:29:22| StoreEntry->mem_obj: 0

2010/07/21 21:29:22| StoreEntry->timestamp: 1248375049

2010/07/21 21:29:22| StoreEntry->lastref: 1248375754

2010/07/21 21:29:22| StoreEntry->expires: 1279911049

2010/07/21 21:29:22| StoreEntry->lastmod: 1205097338

2010/07/21 21:29:22| StoreEntry->swap_file_sz: 1332

2010/07/21 21:29:22| StoreEntry->refcount: 1

2010/07/21 21:29:22| StoreEntry->flags: CACHABLE,DISPATCHED

2010/07/21 21:29:22| StoreEntry->swap_dirn: 0

Chapter 3

[ 87 ]

2010/07/21 21:29:22| StoreEntry->swap_filen: 146

2010/07/21 21:29:22| StoreEntry->lock_count: 0

2010/07/21 21:29:22| StoreEntry->mem_status: 0

...

Squid is basically trying to validate each and every cached object on the disk.

Automatically starting Squid at system startup

Once we have a properly congured and running proxy server, we would like it to start

whenever the system is started or rebooted. Next, we'll have a brief look at the most

common ways of adding or modifying the boot scripts for popular operang systems. These

methods will most probably work on your operang system. If they don't, please refer to the

corresponding operang system manual for informaon on boot scripts.

Adding Squid command to /etc/rc.local le

Adding the full path of the Squid executable le is the easiest way to start Squid on system

startup. The le /etc/rc.local is executed aer the system boots as the super or root

user. We can place the Squid command in this le and it will run every me the system

boots up. Add the following line at the end of the

/etc/rc.local le:

${prefix}/sbin/squid

Please replace ${prefix} with the installaon prex which you used before compiling Squid.

Adding init script

Alternavely, we can add a simple init script which will be a simple shell script to start the

Squid process or send various signals to a running Squid process. Init scripts are supported

by most operang systems and are generally located at

/etc/init.d/, /etc/rc.d/, or

/etc/rc.d/init.d/. Any shell script placed in any of these directories is executed at

system startup with root privileges.

Time for action – adding the init script

We are going to use a simple shell script, as shown in the following example, which takes

a single command line argument and acts accordingly:

#!/bin/bash

# init script to control Squid server

case "$1" in

Running Squid

[ 88 ]

start)

/opt/squid/sbin/squid

;;

stop)

/opt/squid/sbin/squid -k shutdown

;;

reload)

/opt/squid/sbin/squid -k reconfigure

;;

restart)

/opt/squid/sbin/squid -k shutdown

sleep 2

/opt/squid/sbin/squid

;;

echo $"Usage: $0 {start|stop|reload|restart}"

exit 2

esac

exit $?

Please note the absolute path to the Squid executable here and change it accordingly. We

can save this shell script to a le with the name squid and then move it to one of the

directories we discussed earlier depending on our operang system.

The Squid source carries an init script located at contrib/squid.

rc, but it's installed only on a few systems by default.

What just happened?

We added an init script to control the Squid proxy server. Using this script, we can start,

stop, restart, or reload the process. It's important that the script is placed in the correct

directory, otherwise the Squid proxy server will not start on system startup.

Chapter 3

[ 89 ]

Pop quiz

1. What should be the rst step undertaken aer adding new cache directories to the

conguraon le?

a. Reboot the server.

b. Run the squid command with the -z opon.

c. Do nothing, Squid will take care of everything by itself.

d. Run Squid with root privileges.

2. Where should we look for errors in case Squid is not running properly?

a. Access log le.

b. Cache log le.

c. Squid conguraon le.

d. None of the above.

3. In which scenario should we avoid debug mode?

a. While tesng the server.

b. While Squid is deployed in producon mode.

c. When we have a lot of spare disk space.

d. When we have a lot of spare RAM.

Summary

In this chapter, we learned about the various command line opons which can be used while

running Squid, how to start the Squid process in a dierent mode, and how to send signals

to a process which is already running. We also learned about creang new cache directories

aer adding them to the Squid conguraon le.

We specically covered the following:

Parsing the Squid conguraon le for errors and warnings.

Using various opons to get suitable debugging outputs while tesng.

Reloading a new conguraon in a Squid process which is already running, without

interrupng service.

Automac rotaon of log les to recover disk space.



Running Squid

[ 90 ]

We also learned about conguring our system to start a Squid process whenever the system

boots up.

Now that we have learned about running a Squid process, we're ready to explore access

control lists in detail and test them on a live Squid server.

Getting Started with Squid's Powerful

ACLs and Access Rules

In the previous chapters, we learned about installing, conguring, and running

Squid in dierent modes. We also learned the basics of protecng our Squid

proxy server from unauthorized access, and granng or revoking access based

on dierent criteria. We previously had a brief overview of Access Control

Lists in Chapter 2, Conguring Squid. However, in this chapter, we are going

to explore Access Control Lists in detail. We'll also construct rules for a few

example scenarios.

In this chapter, we will learn about:

Various types of ACL lists

Types of access rules

Mixing ACL lists and access list rules to achieve complex access rules

Tesng access rules with

squidclient

Once we have a Squid proxy server up and running, we can dene rules for allowing or

denying access to dierent people or to control the usage of resources. It is also possible

to dene lower and upper limits on the usage of dierent resources. Access list rules, which

are basically combinaons of allow or deny keyword and ACL elements, play a vital role in

achieving this type of control. So let's get started.



Geng Started with Squid’s Powerful ACLs and Access Rules

[ 92 ]

Access control lists

Access Control Lists are the base elements in the Squid conguraon le, which help in

idenfying web transacons, by various aributes of that transacon. We have already

learned about the syntax for construcng ACLs in Chapter 2. So, let's write an ACL element

that can idenfy all the requests from a group of clients in the IP range 192.0.2.1 to

192.0.2.127.

acl clients src 192.0.2.0/25

That was quite easy, as 192.0.2.0/25 denotes that the rst 25 bits of the available 32 bits

in the IP address are xed and only the last seven bits can vary, which will result in the range

0-127. In the conguraon above,

192.0.2.0/25 denotes a subnet with 127 possible IP

addresses. For more informaon on subnets, please check

http://en.wikipedia.org/

wiki/Subnetwork#IPv4_subnetting

In the previous ACL element, we used the

src ACL type to idenfy the IP address of the

source of the request. There are various other ACL types available, which can be used to

idenfy requests and specify acons that should be taken for the idened requests.

So, let's have a look at a few important ACL types.

Fast and slow ACL types

All ACL types fall into two major categories known as fast and slow ACL types. The fast ACL

types use informaon accompanied with a web transacon. These ACL types generally

perform matching against the source IP address, desnaon domain name, URL, HTTP

request header elds, and so on. The slow ACL types need to perform addional lookups,

and this introduces a signicant delay which is why they are known as the slow ACL types.

The examples of slow ACL types are dst and srcdomain, as these will involve DNS and

reverse DNS lookups respecvely. For a list of the latest fast and slow ACL types, please check

http://wiki.squid-cache.org/SquidFaq/SquidAcl#Fast_and_Slow_ACLs.

Source and destination IP address

Every request received by Squid from a client has several properes such as the source

IP address, desnaon IP address, source domain, desnaon domain, source MAC address,

and so on. So, when we dene an ACL element, we basically try to pick up a request and

match its properes with a pre-determined value.

Chapter 4

[ 93 ]

Time for action – constructing ACL lists using IP addresses

1. The two ACL types, src and dst, are used to idenfy the source and desnaon

IP addresses of a parcular request. There are dierent ways to specify the IP

addresses. The rst one is to specify a single IP address per ACL element, as follows:

acl client src 192.0.2.25/32

2. The previous ACL element will match all the requests being generated from the

client 192.0.2.25. We are supposed to specify a mask while specifying the IP

address, but if we don't then, Squid will try to determine the mask automacally.

To learn more about mask, and Classless Inter Domain Roung (CIDR) notaon,

please check http://en.wikipedia.org/wiki/Classless_Inter-Domain_

Routing

and http://en.wikipedia.org/wiki/CIDR_notation.

For example, the ACL following element will also idenfy the requests from the

client 192.0.2.25:

acl client src 192.0.2.25

3. Therefore, in the previous example, Squid will automacally set the mask to 32.

So we have covered the ways to specify a single IP address, now let's a have a look

at the ways in which to specify mulple IP address.

In its simplest form, we can specify mulple addresses using subnets. If we want to specify

clients in mulple connuous subnets which can't be represented as a single subnet, we can

specify them using a range of subnets. Let's say we want to idenfy all the clients in a small

research lab which has IP addresses ranging from

192.0.2.0 to 192.0.2.31. Let's see the

ACL for this case:

acl research_lab src 192.0.2.0/27

The above ACL element will idenfy the IP addresses in the range 192.0.2.0 to

192.0.2.31 as only the last ve bits of the last octet in the IP address are variable.

Construcng an ACL element with the ACL type

dst is similar. Let's say we want to write an

ACL that will idenfy all requests desned to 198.51.100.86. We can use the following

dst ACL type:

acl website dst 198.51.100.86

The previous ACL element will idenfy all requests that are desned to the IP address

198.51.100.86.

The src and dst are fast and slow ACL types respecvely.

Geng Started with Squid’s Powerful ACLs and Access Rules

[ 94 ]

What just happened?

We have just learned about two simple ways of specifying mulple IP addresses while

construcng ACL lists which use source and desnaon IP addresses in a request. These are

the most popular techniques of specifying IP addresses because they are simple to interpret

and chances of confusion are very low.

Time for action – using a range of IP addresses to build ACL lists

Now, let's say in a company, the markeng department is spread over ve oors. We

have used a convenon 10.1.FLOOR_NUM.MACHINE_NUM to assign IP addresses to each

machine on every oor. The oor number starts from two and goes up to six. So, we basically

have the following subnets.

10.1.2.0/24 # 2nd Floor

10.1.3.0/24 # 3rd Floor

10.1.4.0/24 # 4th Floor

10.1.5.0/24 # 5th Floor

10.1.6.0/24 # 6th Floor

A simple way to idenfy all these client computers is dened in the following ACL:

acl mkt_dept src 10.1.2.0/24 10.1.3.0/24 10.1.4.0/24 10.1.5.0/24

10.1.6.0/24

The previous methods are a bit cluered and long winded. Squid provides a simple way to

specify mulple addresses the following is an example of this:

acl mkt_dept src 10.1.2.0-10.1.6.0/24

The preceding ACL dening mkt_dept is simply a shortened version of the following:

acl mkt_dept src 10.1.2.0/24

acl mkt_dept src 10.1.3.0/24

acl mkt_dept src 10.1.4.0/24

acl mkt_dept src 10.1.5.0/24

acl mkt_dept src 10.1.6.0/24

So, we can use the shortened example for specifying connuous subnets. Another good

use of this method is to specify connuous IP addresses in a subnet. For example, let's say

we want to idenfy all the requests from client in the range of 10.2.44.25 to 10.2.44.35.

The IP address range we are trying to idenfy can't be put under a subnet as that will include

other IP addresses too. So, we can use the shortened version to idenfy this IP address range

as follows:

acl bad_clients src 10.2.44.25-10.2.44.35/32

Chapter 4

[ 95 ]

acl bad_clients src 10.2.44.25-10.2.44.35

The previous example also works, as Squid will try to establish a mask automacally.

So far in this secon, we learned about the dierent ways in which to idenfy client requests

on the basis of clients' IP addresses. The method to idenfy the requests on the basis of

desnaon IP addresses is similar. We just need to use the

dst ACL type instead of src.

ACL elements congured with dst as a ACL type works slower compared

to ACLs with the src ACL type, as Squid will have to resolve the desnaon

domain name before evaluang the ACL, which will involve a DNS query.

What just happened?

We have just learned how to ulize the range feature to specify a range of IP addresses to

minimize the number of IP addresses we have to specify while construcng ACL lists. We also

learned that we should try not to use the ACL type dst, as it's slower compared to the src

ACL type because Squid will have to resolve the desnaon domain before it can match ACL

lists of the dst type.

Have a go hero – make a list of the client IP addresses in your network

Try to make an exhausve list of clients' IP addresses on your network and then construct

ACL lists of the ACL type src. Now try to adjust the predened ACL localnet in the Squid

conguraon le and remove the ranges which are not present in your network.

Identifying local IP addresses

There is one more ACL type, myip, which falls in to this category. This can be used to idenfy

the local IP address on which Squid is serving requests. This is useful only if the server

running Squid has more than one network interface.

For example, if we have a proxy server with the IP addresses

192.0.2.25, 198.51.100.25,

and a public IP address. Let's say our research centers use 198.51.100.25 to connect to the

Squid proxy, and student labs use 192.0.2.25 to connect to Squid, then we can dene the

following two ACLs:

acl research_center_net myip 198.51.100.25

acl student_lab_ip myip 192.0.2.25

Geng Started with Squid’s Powerful ACLs and Access Rules

[ 96 ]

Now using these ACLs, we can easily provide dierent services to dierent subnets

connecng to dierent interfaces on the proxy servers.

Although we use IP addresses to idenfy clients, in some rare cases, we can use a client's

MAC address for idencaon. Let's have a look at this.

Client MAC addresses

We can idenfy client requests on the basis of a client's MAC (Media Access Control

address) address. A MAC address is a unique idener assigned to network interface cards

usually by the manufacturer, for idencaon. Squid provides a special ACL type arp to

idenfy requests. MAC addresses are generally represented as XX:XX:XX:XX:XX:XX, where

X is a hexadecimal number. Let's construct an ACL using a client's MAC address

acl mac_acl arp 00:1D:7D:D4:F3:EE

So, the previous ACL mac_acl will match all requests originang from a client with the MAC

address 00:1D:7D:D4:F3:EE.

This ACL type is available only if Squid was compiled with the

--enable-eui or --enable-arp-acl opon, depending on

the Squid version we have.

Please note that this ACL type is not supported on all operang systems and we should

conrm it's availability on our operang system before using it.

Squid can only detect MAC addresses of clients on the same broadcast

domain. For more informaon on broadcast domains, please check

http://en.wikipedia.org/wiki/Broadcast_domain.

Source and destination domain names

It's convenient to use IP addresses while idenfying requests with respect to client IP

addresses because we already know the network for which we are dening the ACLs.

However, when we want to idenfy requests on the basis of desnaon addresses, it's

not convenient or foolproof to use IP addresses because:

The IP address of the remote host providing the blocked service may change

Resolving the desnaon address is a slow process, which will introduce latency



Chapter 4

[ 97 ]

Squid provides two ACL types namely, srcdomain and dstdomain, to construct ACLs based

on source and desnaon domain names respecvely. However, we prefer using domain

names instead of IP addresses for idenfying requests with respect to the desnaon ,

for the reason which we have explained previously. We should note that srcdomain and

dstdomain are slow and fast ACL types respecvely.

Time for action – constructing ACL lists using domain names

Let's construct an ACL to idenfy requests for pages on www.example.com.

acl example dstdomain www.example.com

The previous ACL element will be able to idenfy any request for any web page on the

domain www.example.com. So, if we try to browse http://www.example.com/ or

http://www.example.com/index.html, the URLs will be idened by the ACL example.

However, the problem with this ACL is that it will not be able to idenfy requests to

example.com or some.example.com and so on. So, if we browse to http://example.

com/

or http://video.example.com/, our requests will not be idened

by the ACL example.

To overcome this problem, we can prex the domain name with a period or dot (.). A dot is

treated as a wildcard by Squid and an ACL will match that domain or any sub-domain of that

parcular domain. Let's see an example.

acl example dstdomain .example.com

The previous ACL element will match example.com or any of its sub-domains such as

video.example.com, news.example.com, www.exmaple.com and so on.

Similarly, if we have an ACL dened as follows:

acl example_uk dstdomain .uk.example.com

We will be able to match requests to uk.example.com or any sub-domain of

uk.example.com but not example.com, as it's not a sub-domain of uk.example.com.

So, now we know how to construct ACLs using desnaon domain names. Using source

domain names to idenfy requests is similar, and the ACL type for that is

srcdomain.

Here is an example:

acl our_network srcdomain .company.example.com

The ACL our_network will match any requests originang from company.example.com or

any of its sub-domains.

Geng Started with Squid’s Powerful ACLs and Access Rules

[ 98 ]

ACL elements with srcdomain as ACL types works slower, compared to

ACLs with the dstdomain ACL type because Squid will have to perform

a reverse DNS lookup before evaluang ACL. This will introduce signicant

latency. Moreover, the reverse DNS lookup may not work properly with

local IP addresses.

What just happened?

In this secon, we saw how we can specify domains or sub-domains of a domain while

building ACL lists of the

srcdomain or dstdomain type. We should also note here that

ACL lists of the type srcdomain are slower than the ones of the type dstdomain, as Squid

will try a reverse lookup on the source IP address before matching.

Have a go hero – make a list of domains hosted in your local network

Try to nd out all the domains and their sub-domains which are hosted in your local area

network and organize them into an ACL list local_domains.

Regular expressions for domain names

Squid provides two interesng ACL types, namely, srcdom_regex and dstdom_regex,

which can be used to idenfy requests based on the source or desnaon domain names

aached with each request. Let's say, we don't want to allow websites that have a torrent

in their domain names, we would therefore need to construct the following ACL:

acl torrent_sites dstdom_regex -i torrent

http_access deny torrent_sites

This conguraon will simply deny access to any website that has a torrent in its domain

name. The ACL type srcdom_regex can be used in a similar way to control access from

domains matching a specic regular expression.

Destination port

Whenever a client requests some web documents, Squid needs to connect to the remote

server on a specic port number. For example, if a client requests http://example.com/,

Squid will try to connect to a server at example.com on port 80, because that's the default

port used for HTTP communicaon. Now, let's say a client requests https://example.com/,

then Squid will try to connect to the server example.com on port 443 because 443 is the

default port for secure HTTP (or HTTPS) communicaon.

Chapter 4

[ 99 ]

Time for action – building ACL lists using destination ports

So, we can use network port numbers to idenfy requests and then combine them with an

access rule to control access to resources. Squid provides an ACL type port, which can be

used to declare one or more port numbers to construct an ACL. Let's see a simple example:

acl allowed_port port 80

The previous ACL will match any request for port 80 on the desnaon server requests. The

ACL type

port can take more than one port or a range of ports as an argument. So, if we

want to assign mulple ports, we can list them as follows:

acl allowed_ports port 80 443 1025-65535

The ACL allowed_ports will match all the requests requesng a connecon to ports 80,

443, or any within the range of 1025 to 65535.

Normally, the policy is to allow only needed ports and deny connecon to all other ports

to prevent any type of illegal or unauthorized access. Squid has a lot of pre-dened ports

aggregated under the ACLs named

SSL_ports and Safe_ports. The following lines are

from the default conguraon le:

acl SSL_ports port 443

acl Safe_ports port 80 # http

acl Safe_ports port 21 # ftp

acl Safe_ports port 443 # https

acl Safe_ports port 70 # gopher

acl Safe_ports port 210 # wais

acl Safe_ports port 280 # http-mgmt

acl Safe_ports port 488 # gss-http

acl Safe_ports port 591 # filemaker

acl Safe_ports port 777 # multiling http

acl Safe_ports port 1025-65535 # unregistered ports

The preceding example contains a list of ports for well-known services such as, HTTP, FTP,

HTTPS, and so on and other ports over which HTTP is known to be safely transmied. We

should be careful while adding new ports to the safe ports lists. For example, if we add port

25 (Simple Mail Transfer Protocol or SMTP) to the safe ports list, clients will be able to relay

mails through our proxy server due to the design similaries in HTTP and SMTP protocols. So,

we should not add port 25 to safe ports list unless we are fully aware of the implicaons.

Geng Started with Squid’s Powerful ACLs and Access Rules

[ 100 ]

Also, the ports listed previously may not be an exhausve list of allowed ports for our

environment and we may need to allow more ports, depending upon the client requirements.

For example, we don't have port 873 (rsync) listed above, which may be needed in some

cases. So, we keep adding more ports to the safe ports list. Let's see an example:

acl SSL_ports port 443 563

acl SSL_ports port 444 # other SSL ports

acl Safe_ports port 80 # http

acl Safe_ports port 21 # ftp

acl Safe_ports port 443 563 444 # https

acl Safe_ports port 70 # gopher

acl Safe_ports port 119 # Usenet news group

acl Safe_ports port 210 # wais

acl Safe_ports port 280 # http-mgmt

acl Safe_ports port 488 # gss-http

acl Safe_ports port 591 # filemaker

acl Safe_ports port 777 # multiling http

acl Safe_ports port 873 # rsync

acl Safe_ports port 1025-65535 # unregistered ports

The general approach is to deny access to all the ports that are not in the allowed list. To

deny all the unsafe ports, we'll write:

http_access deny !Safe_ports

Now, clients will not be able to connect to any port on the remote server, which is not listed

in the Safe_ports list.

What just happened?

In this secon, we learned to specify ports or a range of ports for construcng ACL lists of

the type port. We also learned that we shouldn't allow all ports by default as that can lead

to illegal or unauthorized access.

Local port name

Squid provides another ACL type myportname, to idenfy the network port name but it's

dierent to port. The ACL type myportname idenes the port number on the Squid proxy

server where clients connect to Squid. Just like the ACL type myip, myportname is also

useful if we congure Squid to listen on more than one port using the http_port direcve

in the Squid conguraon le.

Chapter 4

[ 101 ]

Let's say we have Squid listening on port 3128 and 8080. Therefore, we can have the

following ACLs:

http_port 192.0.2.21:3128 name=research_port

http_port 192.0.2.25:8080 name=student_port

acl research_lab_net myportname research_port

acl student_lab_net myportname student_port

Now, we can use these two ACLs to control access to dierent subnets.

HTTP methods

Every HTTP request is accompanied by a HTTP method. For example, when we type

http://example.com/ in our web browser's address bar, we make a GET request to the

example.com server. Also, when we submit an online form, we make a POST request

to the server. Similarly, PUT, DELETE, CONNECT, and so on, are other commonly used

HTTP methods.

Squid provides the ACL type

method to idenfy requests based on the HTTP method

used for that parcular request. Normally, all the methods are allowed by default, except

the CONNECT method. The HTTP method CONNECT is a bit tricky and is used to tunnel

requests through HTTP proxies. So, we should allow only trusted requests such as HTTPS

through CONNECT.

Let's see an example of a method ACL from Squid's default conguraon:

acl CONNECT method CONNECT

Don't confuse CONNECT the ACL name, with CONNECT the HTTP method. The, ACL CONNECT

will idenfy all the requests with the HTTP method CONNECT. Now, let's see Squid's default

conguraon for using the CONNECT method:

acl SSL_ports port 443

http_access deny CONNECT !SSL_ports

By default, Squid will allow the CONNECT HTTP method only for SSL port 443, which is the

standard port for HTTPS communicaon. Again, we should go with the default conguraon

and add more ports to the SSL_ports ACL as the need arises.

We should note that the port numbers we add to the SSL ports list

should be added to the safe ports list as well.

Geng Started with Squid’s Powerful ACLs and Access Rules

[ 102 ]

Identifying requests using the request protocol

Squid provides another ACL type, proto, which can be used to idenfy the communicaon

protocol or URL scheme for a request. For example, when we access

http://example.com/,

the URL scheme used is HTTP and when we browse ftp://example.com/, the URL scheme

being used is FTP. Other commonly used URL schemes are gopher, urn, https, and whois.

Time for action – using a request protocol

to construct access rules

Let's say we want to deny all FTP requests from a parcular subnet, known as, research labs.

The conguraon should look similar to the following:

acl ftp_requests proto FTP

acl research_labs src 192.0.2.0/24

http_access deny research_labs ftp_requests

The previous conguraon lines will instruct Squid to deny all the FTP requests from the

network 192.0.2.0/24.

Please note that some rewalls block acve FTP by default. Please

check http://www.ncftp.com/ncftpd/doc/misc/

ftp_and_firewalls.html for more informaon.

Apart from the previously menoned standard schemes, we have a Squid specic URL

scheme called cache_object, which is used for the cache manager (cachemgr) interface.

By default, the cache manager can only be accessed from the Squid proxy server itself

because of the following code in squid.conf:

acl manager proto cache_object

acl localhost src 127.0.0.1/32

http_access allow manager localhost

http_access deny manager

Therefore the URL scheme cache_object can only be accessed from the localhost

(the proxy server itself). If we want to access the cache_object URL scheme from other

machines (for example, from the machines of all our administrators), we can add the

following special access rules as follows:

acl manager proto cache_object

acl localhost src 127.0.0.1/32

acl admin_machines src 192.0.2.86 192.0.2.10

http_access allow manager localhost

http_access allow manager admin_machines

http_access deny manager

Chapter 4

[ 103 ]

The previous conguraon lines will ensure that only administrators can use Squid's cache

manager interface.

What just happened?

We have just seen that it is possible to build ACL lists based on the protocol used by the

client in the requests. By using this type of ACL we can completely deny requests to all

other protocols than HTTP and HTTPS, in very restricted environments.

Time-based ACLs

Access control based on me is one of the most excing features of Squid. Using the time

ACL type, we can specify a me period in the form of day(s) or me range. Then the requests

during that me period will be matched or idened by that ACL. The format of the

time

ACL type is as follows:

acl ACL_NAME time [day-abbreviation] [h1:m1-h2:m2]

Specifying days and me range are oponal, but one of them must be specied. The

following are the abbreviaons used:

Day Abbreviaon

Sunday S

Monday M

Tuesday T

Wednesday W

Thursday H

Friday F

Saturday A

All Weekdays D

We should note that me is taken only when the ACL is checked.

Therefore, it may not aect the requests made during the allow

period and performed during the deny period and vice-versa.

So, for idenfying all the requests on Sunday, Monday, and Wednesday, we'll have the

following ACL:

acl days time SMW

Geng Started with Squid’s Powerful ACLs and Access Rules

[ 104 ]

The day abbreviaons should be wrien altogether. While specifying the me, h1:m1 should

be less than h2:m2. Moreover, me should be in a 24 hour format. Now, let's construct a few

ACLs for the typical oce hours:

acl morning_hrs time MTWHF 09:00-12:59

acl lunch_hrs time D 13:00-13:59

acl evening_hrs time MTWHF 14:00-18:00

Now, let's say we don't want our clients to access YouTube during oce hours, but it's ok if

they access it during lunch hours. Also, we will allow browsing only in oce hours. So, we'll

have the following lines in our conguraon le:

acl youtube dstdomain .youtube.com

acl office dstdomain .office.example.com

http_access allow office

http_access allow youtube !morning_hours !evening_hours

http_access deny all

URL and URL path-based identication

Squid provides the ACL type url_regex, using which we can specify regular expressions

which will be matched against the enre URL. URLs are generally of the form http://

example.com/path/directory/index.php?page=2&count=10

or http://example.

com/path2/index.html#example-section

. So, let's construct an ACL that will match all

requests to JPG images on the example.com server.

acl example_com_jpg url_regex ^http://example.com/.*\.jpg$

By default, the regular expressions passed to any ACL type are treated as case-sensive.

Hence, the previous regular expression will not match if a JPG image on the server has a

lename linux.JPG. To make the regular expressions case-insensive, we can use the

opon -i while dening ACL. For example:

acl example_com_jpg url_regex -i ^http://example.com/.*\.jpg$

Now, the ACL example_com_jpg will match all the JPG images on the server

example.com.

In the URL

http://example.com/path/directory/index.php?page=2&count=10,

the secon path/directory/index.php?page=2&count=10 is the URL path. So, the

URL path is basically the URL minus the URL scheme and hostname.

Similar to

url_regex, we have another ACL type called urlpath_regex. The only

dierence is that url_regex searches for the regular expression in the complete URL

while urlpath_regex searches only in the URL path.

Chapter 4

[ 105 ]

This ACL type is specically helpful when we only want to search a string in the path and not

in the hostname. Let's see an example:

acl torrent urlpath_regex -i torrent

In another example, let's try to block some video content:

acl videos urlpath_regex -i \.(avi|mp4|mov|m4v|mkv|flv)(\?.*)?$

The above ACL videos will match a few of the well known video formats.

Please note that regular expression matching is slower than other

ACL type matching. It is highly recommended to break the regular

expression into dstdomain and urlpath_regex to enhance

ACL matching performance.

Have a go hero – ACL list for audio content

Construct an ACL list which can be used to idenfy requests for at least three types of

audio les.

Matching client usernames

Squid supports idenfying clients using the ident protocol by providing the ACL type ident.

Squid tries to connect to the ident server on the client machine and get the username

corresponding to the current request, when the ident ACL type is used. The username that

Squid will receive may not be the username of the logged in user. For example, when Squid

tries to get the username of a down-stream proxy server, it may get the username squid,

proxy, or nobody, depending on the value of the cache_effective_user direcve.

The ident protocol is not really secure and it's very easy to spoof an ident

server. So, it should be used carefully.

If we have an exhausve list of usernames for our network, we can construct an ACL

as follows:

acl friends ident john sarah michelle priya

http_access allow friends

http_access deny all

If the previous conguraon is used, only the users specied previously will be able to access

our proxy server.

Geng Started with Squid’s Powerful ACLs and Access Rules

[ 106 ]

Please note that the ident lookups are blocking calls and Squid will wait for the

reply before it can proceed with processing the request, and that may increase

the delays by a signicant margin.

Normally, it's not possible to specify all users especially if we have a large network. For

such cases, Squid provides a special keyword, REQUIRED, which can be used to enforce

a username for all the requests. If an ident lookup results in any username, the ACL is

matched, otherwise the ACL will not be matched.

To know more about the ident protocol, please visit

http://en.wikipedia.org/wiki/Ident.

So, to enforce a username, we can have the following conguraon in our Squid

conguraon le:

acl username ident REQUIRED

http_access allow username

http_access deny all

Regular expressions for client usernames

Similar to ident, we have another ACL type, ident_regex, which can be used to specify

regular expressions instead of complete usernames. This is helpful in networks, where we

have specic formats for usernames. For example, let's say we use the department name as

a sux to the usernames. Then we can construct the following ACLs:

acl mkt_dept ident_regex -i \.marketing$

acl cust_care_dept ident_regex -i \.cust_care$

Now, based on the above ACLs, we can have control over the way the resources are used by

the two departments.

Proxy authentication

The best way to keep bad guys out of a proxy server is to use proxy authencaon. In this

case, a client will need to enter a username and password to be able to use our proxy

server. If proxy authencaon is enabled, the client will send an addional header with

authencaon credenals, which Squid will evaluate and check whether the client should be

allowed to use our proxy server. The interesng part is that Squid can't validate credenals

sent by the client on its own. Squid passes the credenals it receives from a client to a helper

process, and the validity of credenals is determined by the external process.

Chapter 4

[ 107 ]

So, we have the proxy_auth ACL type where we can specify a list of usernames for

authencaon. However, as we saw previously, Squid can't validate credenals itself; we

must specify at least one authencaon scheme for validang the username and password

sent by the client. Authencaon schemes are congured using the auth_param direcves

in our Squid conguraon le.

Squid supports the Basic, Digest, NTLM, and Negoate authencaon schemes, and all of

them are built by default.

Time for action – enforcing proxy authentication

If we want to enforce proxy authencaon, we can add the following lines to our

conguraon le:

acl authenticated proxy_auth REQUIRED

http_access allow authenticated

http_access deny all

With the previous conguraon, only authencated users will be able to access the proxy

server. If we want to specically idenfy individual clients with usernames, we can pass

a list of users as well. This may be needed if we want to give extra privileges to some users.

For example:

acl authenticated proxy_auth REQUIRED

acl admins proxy_auth john sarah

acl special_website dstdomain admin.example.com

http_access allow admins special_website

http_access deny special_website

http_access allow authenticated

http_access deny all

Therefore, if we have the preceding lines in our conguraon le, only the users john and

sarah will be able to access admin.example.com, but other authencated users will be

able to access all websites except admin.example.com.

Regular expressions for usernames

Similar to proxy_auth, we have the proxy_auth_regex ACL type, which can be used

to idenfy usernames using a regular expression. Let's say, we follow a nomenclature for

allong usernames to our employees and all employees in the accounts department will

have the username of accounts_username, then we can construct an ACL matching the

usernames of the employees from the accounts department as follows:

acl accounts_dept proxy_auth_regex ^accounts_

Geng Started with Squid’s Powerful ACLs and Access Rules

[ 108 ]

If we want employees in the accounts department to access only the accounts website, we

can have the following conguraon:

acl accounts_dept proxy_auth_regex ^accounts_

acl accounts_web dstdomain .account.example.com

http_access allow accounts_dept accounts_web

http_access deny all

In accordance with the previous conguraon, employees in the accounts department will

be able to access only the accounts website.

What just happened?

In the previous example, we saw how we can enforce proxy authencaon for all the clients'

or only a group of clients using dierent types of ACL lists. We'll learn more about proxy

authencaon in Chapter 7.

User limits

Squid provides dierent ACL types, using which we can construct ACL lists to limit the

number of connecons from a client and the number of logins per user. Let's have a look.

Maximum number of connections per client

Generally, we want to place a limit on the number of parallel connecons a client can ulize

to enforce a fair usage policy. Squid provides an ACL type maxconn, which will match if a

client's IP address has more than the maximum specied acve connecons. An example

of this could be if we want to enforce a maximum of 25 connecons per client:

acl connections maxconn 25

http_access deny connections

According to the preceding conguraon lines, a client will return an access denied error if it

tries to open more than 25 parallel connecons.

In a dierent scenario, we may want to enforce dierent parallel connecon limits for

dierent user groups. Let's see an example of such a conguraon:

acl normal_users src 10.2.0.0/16

acl corporate_users src 10.1.0.0/16

acl norm_conn maxconn 15

acl corp_conn maxconn 30

http_access deny normal_users norm_conn

http_access deny corporate_users corp_conn

Chapter 4

[ 109 ]

So, according to the preceding conguraon lines, normal_users will have a maximum

limit of 15 parallel connecons, while corporate_users will enjoy a maximum limit of

30 parallel connecons.

Maximum logins per user

Squid provides an ACL type max_user_ip, which is matched when a single username

is used for authencaon from more than a specied number of machines. A direcve

authenticate_ip_ttl is used to determine the meout for the IP address entries. So,

if we want our clients to log in from, no more than, three dierent machines, we can

use the following conguraon:

acl ip_limit max_user_ip 3

http_access deny ip_limit

The default behavior is to deny random requests once the limit is reached. We can deny

complete access by specifying the opon -s while construcng an ACL.

At least one of the authencaon schemes must be congured before we

can use this feature.

Identication based on various HTTP headers

Requests or replies can be idened based on the informaon hidden in HTTP headers,

which accompany every HTTP request or reply. Let's have a look at some of the important

HTTP headers used for idenfying requests.

User-agent or browser

Almost all the HTTP requests carry a User-Agent string in their headers, which is basically

a string to idenfy the name and version of the HTTP client. For a certain version of Mozilla

Firefox, it may look like:

Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.9.2.6) Gecko/20100625

Firefox/3.6.6 GTB7.1

The previous User-Agent string represents Mozilla Firefox 3.6.6 on a Linux-based 32 bit

operang system.

Squid provides an ACL type

browser, using which we can idenfy client requests based on

the User-Agent header and combine that with an access rule to grant access to users with

specic HTTP clients. The ACL type browser takes a regular expression as an argument. For

example, if we want to restrict access to Mozilla Firefox and Internet Explorer, we can add

following lines to our conguraon les:

acl allowed_clients browser -i firefox msie

http_access allow allowed_clients

http_access deny all

Geng Started with Squid’s Powerful ACLs and Access Rules

[ 110 ]

It's very easy to spoof User-Agent header strings and we should not rely

solely on User-Agent to control access.

Referer identication

HTTP requests generally carry a Referer header string, which represents the website from

which the client was directed to the current request. An example Referer header string is

http://www.google.com/search?rlz=1C1GGLS_enIN345IN345&sourceid=chrom

e&ie=UTF-8&q=what+my+user+agent

. Squid has an ACL type referer_regex, which

can be used to match requests on the basis of the Referer header string. It's useful when we

don't want users to be directed to genuine websites from a malicious website. For example:

acl malicious_website dstdomain .malicious.example.com

acl malicious_referer referer_regex -i malicious.example.com

http_access deny malicious_website

http_access deny malicious_referer

The previous conguraon will prevent our clients from vising the malicious website and

will also prevent them from being directed to genuine websites from the malicious website.

This will in turn will prevent them from aacks like phishing.

Content type-based identication

Squid provides two ACL types, req_mime_type and rep_mime_type, which can be used

to match the content type of requests and replies respecvely. These are generally helpful in

controlling the access to le uploads and downloads. Both these ACL types try to match the

Content-Type HTTP header accompanied with requests and replies.

For uploads, the

req_mime_type is used. For example, if we want to prevent the uploading

of an MPEG video le, then we can have the following lines in our conguraon le:

acl mpeg_upload req_mime_type -i video/mpeg

http_access deny mpeg_upload

Similarly, rep_mime_type is used for matching against the Content-Type header of replies

from remote servers. To disable all video downloads, we can use the http_reply_access

direcve, which is used to control access to replies received from the remote servers.

Therefore, we can have the following lines in our conguraon le.

acl video_download rep_mime_type -i ^video/

http_reply_access deny video_download

These ACL types are only eecve if the HTTP client and remote web servers set

the Content-Type HTTP header properly.

Chapter 4

[ 111 ]

Other HTTP headers

We have learned about the browser, referer, req_mime_type, and rep_mime_type

ACL types, which idenfy requests or replies by matching a regular expression against

dierent HTTP header elds. Squid provides two addional ACL types, namely,

req_header

and

rep_header, which can be used to match any of the HTTP header elds in requests

or replies.

The

req_header ACL type is used to match HTTP headers in requests. Let's see an example:

acl user_agent req_header User-Agent -i ^Mozilla

http_access allow user_agent

http_access deny all

In the previous conguraon, User-Agent is a HTTP header. Similarly, we can specify any of

the known HTTP headers and Squid will try to match the regular expression against the value

of that parcular HTTP header.

In a similar manner, we can use

rep_header for matching HTTP header elds in replies.

However, it is worth nong that the rep_header ACL type is useful only when used with

the http_reply_access direcve as only replies can be matched.

HTTP reply status

When Squid tries to contact the remote server on the client's behalf, it'll receive a reply

corresponding to every request. Depending upon the remote web server's ability to serve

the current request, the reply will have a status code. For example, if the request can be

served, a status code 200 will be returned.

Please visit http://en.wikipedia.org/wiki/List_of_HTTP_

status_codes for a complete list of HTTP status codes.

Squid has the ACL type http_status to idenfy replies on the basis of the HTTP status

codes returned by a remote server. Let's say we want to idenfy all the server errors (5xx),

our conguraon would look similar to the following:

acl server_errors http_status 500-510

Similar to the src and port ACL types, we can pass a range as an argument to http_status.

We can take the appropriate acon based on the HTTP status or reply codes. The

https_status ACL type can be helpful in bypassing adaptaon rules.

Geng Started with Squid’s Powerful ACLs and Access Rules

[ 112 ]

Identifying random requests

The ACL type random can be used to idenfy random requests with a pre-dened

probability. The following is the format for construcng ACLs of the type random:

acl ACL_NAME random probability

The parameter probability can be specied in the following three ways:

Fracon: In the form of a fracon. For example, 2/3, 3/4, and so on.

Decimal: In the form of a decimal number. For example, 0.67, 0.2, and so on.

Raon: In the form of a matches:non-matches rao. For example, 3:4, 2:3,

and so on.

So, an example ACL matching 70 percent of requests can be wrien as:

acl random_req random 0.7

The ACL random_req will randomly match 70 percent of the total requests received by the

Squid proxy server.

Access list rules

In the previous secon, we learned about ACL lists in detail. However, as we saw, ACL lists

can only be used to idenfy, and that is only of use if they are combined with some access

rules to control access to various components of our proxy server. Squid provides a lot of

access list rules, with http_access being the most widely used.

As we have learned in Chapter 2, when we have mulple access rules, Squid matches a

parcular request against them from top to boom and keeps doing so unl a denite acon

(allow or deny) is determined. We also learned that if we have mulple ACLs within a single

access rule, then a request is matched against all the ACLs from le to right and Squid stops

processing the rule as soon as it encounters an ACL that can't idenfy the request. An access

rule with mulple ACLs results in a denite acon only if the request is idened by all the

ACLs used in the rule.

Now, let's have a brief look at the dierent access list rules provided by Squid.

Access to HTTP protocol

The http_access is the most important access list rule. Only the client allowed by this

rule will be able to send HTTP requests and requests from all other clients will be denied.

However, the behavior of this access list rule is a bit tricky.



Chapter 4

[ 113 ]

The default behavior is to allow requests only from LAN clients. If no access rules are

congured, then the default behavior is to deny all requests. Squid will stop at the rst

access rule with an ACL list matching the current request and will allow or deny the request

depending on the rule. If the current request is not idened by any of the ACL lists in the

access rules, the opposite acon of the last access rule is performed. So, if the last access

rule is to deny a request, the unmatched request will be allow and vice-versa.

Because of the above behavior, a

deny all ACL rule is always recommended in the end, so

that Squid can idenfy a denite acon. The general rule is to allow known clients and deny

the rest by default. So, our conguraon le should look something like:

http_access allow employees

http_access allow customers

http_access allow guests

http_access allow vpn_users

http_access deny all

So, we allowed all of the possible genuine users and then at the end denied all the requests.

If the need arises to add more users, we can simply add them to an exisng or another ACL

list and will add an allow access rule.

The line that denies all the requests at the end also prevents our proxy server from

unauthorized access as a result of misconguraon.

Adapted HTTP access

Squid provides the adapted_http_access rule, which is similar to http_acces but is

checked aer all the redirectors, URL rewriters, or ICAP/eCAP adaptaons, which allows

access control based on the output returned. This is only useful when we are using

redirectors, URL rewriters, or ICAP/eCAP adaptaon.

For more informaon on ICAP/eCAP, please visit http://wiki.

squid-cache.org/Features/ICAP and http://wiki.squid-

cache.org/Features/eCAP.

It is not absolutely necessary to use this rule. The syntax and behavior is similar

to http_access.

HTTP access for replies

We have seen that there are ACL types that can idenfy requests and replies. For example,

src, dst, dstdomain, req_header, and so on are a few ACL types that idenfy clients

on the basis of requests, while rep_header, http_status, rep_mime_type, and so on

can idenfy replies. The ACL lists that idenfy replies should be used with the access rule

http_reply_access to control access.

Geng Started with Squid’s Powerful ACLs and Access Rules

[ 114 ]

Squid fetches the replies from a remote server, even if the replies are denied using the

http_reply_access rules, but they are not delivered to the clients. On the other hand,

even when the replies are denied using the http_reply_acess rules, the clients will sll

receive replies in the form of access denied messages from Squid.

The usage and behavior of

http_reply_access is similar to http_access.

If a client is denied access by the http_access rule, it'll never match

an http_reply_access rule. This is because, if a client's request is

denied then Squid will not fetch a reply.

Access to other ports

Our neighbor proxy server can access a proxy server via ICP and HTCP ports. Also, our proxy

server can be accessed via the SNMP port. Let's see how to control access to these ports.

ICP port

We have seen the icp_port direcve in Squid, which is used to set the ICP port for

communicaon with neighboring proxy servers. To limit access to the ICP port of our proxy

server, we have an access list rule called icp_access. The default behavior is to deny all the

requests to the ICP port. Generally, we prefer to enable ICP port access for all clients in our

local area network, but it totally depends on our network policies.

The default Squid conguraon le contains an ACL list,

localnet, which idenes all the

clients on our LAN. So, if we want to allow ICP access to all our local clients, we can use the

following lines in the conguraon le:

acl localnet src 10.0.0.0/8

acl localnet src 172.16.0.0/12

acl localnet src 192.168.0.0/16

acl localnet src fc00::/7

acl localnet src fe80::/10

icp_access allow localnet

icp_access deny all

HTCP port

HTCP (Hypertext Caching Protocol) is used for discovering HTTP caches and communicaon

among the proxy servers. We set the HTCP port in the conguraon le using the

htcp_port direcve. We can prevent access to the HTCP port on our proxy server by using

the access list rule htcp_access, which has usage and behavior similar to icp_access.

The default behavior is to deny all requests to the HTCP port.

Chapter 4

[ 115 ]

Purge access via HTCP

A proxy server can send HTCP CLR or purge requests to other proxy servers using HTCP. We

may want to allow access to only trusted clients to prevent a proxy server from unauthorized

access. Squid provides the access rule

htcp_clr_access, which can be used to determine

the clients that will be able to issue HTCP CLR requests to purge content.

We should note that HTCP CLR requests are relayed regardless

of whether they are acted on locally.

SNMP port

We can restrict access to the SNMP port (specied by the snmp_port direcve) on our proxy

server using a combinaon of the access list rule snmp_access, the ACL list constructed

from the ACL type snmp_community and any other ACL idenfying the client requesng

SNMP access. Let's see the following example:

acl admins src 127.0.0.1 192.0.2.21 192.0.2.86

acl snmppublic snmp_community public

snmp_access allow snmppublic admins

snmp_access deny all

So, now only admins will be able to access the SNMP port. The default behavior is to deny

access to all clients.

Enforcing limited access to neighbors

When we have cache peers or neighbor proxy servers in our network, they can use our proxy

server as a sibling or a parent proxy server. When they use our proxy server as a sibling proxy

server, only HITS will be fetched from our proxy server and they will fetch all the MISS(s) on

their own. However, if they are using our proxy server as a parent, then they'll be able to

fetch MISS(s) via our proxy server. In some cases, this may not be a desirable behavior,

as it will consume our upstream bandwidth.

Time for action – denying miss_access to neighbors

To force other proxy servers to use our proxy server as a sibling proxy server, we have

an access rule miss_access. Let's say we have two neighbor proxy servers, namely,

192.0.2.25 and 198.51.100.25, in our network. Now, we don't mind if 192.0.2.25

uses our proxy server as a parent proxy server, but we don't want to allow 198.51.100.25

to fetch MISS(s) via our proxy server. So, we can have the following conguraon:

acl good_neighbour src 192.0.2.25

acl bad_neighbour src 198.51.100.25

Geng Started with Squid’s Powerful ACLs and Access Rules

[ 116 ]

miss_access allow good_neighbour # This line is not needed. Why?

miss_access deny bad_neighbour

miss_access allow all

The default behavior is to allow all proxy servers to fetch MISS(s) via our proxy server. In the

previous conguraon line, the rst allow rule is not needed because we have the allow

all

rule at the end. The allow rule was just used to draw your aenon towards the nature

of miss_access direcve.

What just happened?

We just learned the usage of the miss_access access list rule to prevent leakage of

upstream bandwidth to unknown or misbehaving clients.

Requesting neighbor proxy servers

If there are neighbor proxy servers or cache peers in our network and we have added them

to our Squid conguraon le using the cache_peer direcve, then our proxy server will

try to contact those servers using HTTP, ICP, or HTCP protocols based on the opons we used

with the cache_peer direcve. By default, Squid will select the rst or closest proxy server

to contact for various communicaons. We can however control the selecon with the

access list rule cache_peer_access.

We can combine

cache_peer_access, cache peer name, and the ACL lists to achieve

control over a selecon of the proxy servers for dierent domains, clients, or any other

criterion. The following is the format for construcng a cache_peer_access rule:

cache_peer_access CACHE_HOST allow allow|deny [!]ACL_NAME ...

Let's have a look at the following example. We have two cache peers, namely, cache1.

example.com

and cache2.example.com, and we want all YouTube trac to go through

cache1.example.com and all Google trac to go through cache2.example.com.

cache_peer cache1.example.com parent 8080 3130 proxy-only weight=1

cache_peer cache2.example.com parent 3128 3130 proxy-only weight=2

acl youtube dstdomain .youtube.com

acl google dstdomain .google.com

cache_peer_access cache1.example.com allow youtube

cache_peer_access cache1.example.com deny all

cache_peer_access cache2.example.com allow google

cache_peer_access cache2.example.com deny all

Have a go hero – make a list of proxy servers in your network

Make a list of the available proxy servers in your environment and add them as cache peers

to your conguraon le.

Chapter 4

[ 117 ]

Forwarding requests to remote servers

When we have neighboring proxy servers or cache peers in our network and we have

congured our Squid proxy server to use them via the cache_peer direcve, then the

requests from the clients will be forwarded through the peers depending on the opons

used with cache_peer while adding the peer hosts. However, Squid provides the following

access list rules, namely,

always_direct and never_direct, which can be used to

determine whether a request should be forwarded through other peers or the remote

servers should be contacted directly.

When we want to forward requests directly to remote servers without using any peers, we

can use the

always_direct access list rule. This is generally used to avoid contacng peers

for serving content from websites on the local area network. For example, for forwarding

requests to local web servers directly, we can use the following conguraon:

acl local_domains dstdomain .local.example.com

acl local_ips dst 192.0.2.0/24

always_direct allow local_domains

always_direct allow local_ips

The previous conguraon will successfully reduce the unnecessary latency introduced

because of communicaon with peers while serving local content.

The access list rule

never_direct is the opposite of always_direct. So, if we decided

that all the requests must not be forwarded to remote servers directly, then we can have

the following conguraon:

never_direct allow all

Ident lookup access

We learned that we have the ACL type ident, using which we can force username

idencaon before allowing any clients to access our proxy server. However by default,

ident lookups are not performed even if we have ACL lists with the ident ACL type, unless

the current requests are allowed by the access rule ident_lookup_access. The default

behavior is not to perform any ident lookups at all.

It's actually a good idea to perform selecve

ident lookups because not all hosts support

this feature. So, let's say we want to perform ident lookups for all the Unix/Linux hosts in

our network 192.0.2.0/24. We can add the following lines to our conguraon le:

acl nix_hosts src 192.0.2.0/24

ident_lookup_access allow nix_hosts

ident_lookup_access deny all

Geng Started with Squid’s Powerful ACLs and Access Rules

[ 118 ]

Controlled caching of web documents

Squid tries to cache all the cacheable web documents for sasfying the subsequent requests

for the same content. However, there may be mes when we may not want to cache all of

the replies. A good example is the content from websites on our local area network. We have

an access list rule, cache, using which we can allow or deny caching of content with the help

of ACL lists.

For example, for denying caching of any content on the local area network, we can add the

following lines to our conguraon le:

acl local_domain dstdomain .local.example.com

cache deny local_domain

cache allow all

Don't use the localnet ACL list here because that idenes

requests on the basis of source IP addresses and not on the basis of

desnaon IP addresses.

URL rewrite access

When we have congured our Squid proxy server to use URL rewriters, Squid will send all

the incoming requests to URL rewriters for further processing. Generally, URL rewriters are

plugins designed for a specic purpose and will operate on selecve websites. So, it's good

pracce to pass only selecve URLs to a URL rewriter to save some CPU cycles, and Squid will

not have to wait for the rewriter to process a URL that is not meant to be processed by the

rewriter. We should also avoid rewring CONNECT requests. Rewring HTTP PUT and POST

requests can also result in unexpected behavior.

We can pass only selecve requests to URL rewriters using the access list rule

url_rewrite_access. This access list rule is similar to http_access and cache_peer_

access

and only requests allowed by url_rewrite_access will be passed to the URL

rewriter. Let's say we have dened a URL rewriter that acts only on the videos.example.

com

URLs; we would need the following conguraon:

acl video_web dstdomain .videos.example.com

url_rewrite_access allow video_web

url_rewrite_access deny all

In accordance with the previous conguraon, Squid will pass all the URLs to the URL

rewriter program, which are matched by the video_web ACL.

Chapter 4

[ 119 ]

HTTP header access

Another couple of access list rules which we have are request_header_access and

reply_header_access. These can be combined with ACL lists to control access to

dierent headers in HTTP requests and replies respecvely. If we deny access to a certain

HTTP header for some requests, then that parcular HTTP header will be dropped from the

headers while sending the request to remote servers. We should note that dropping HTTP

headers from requests or replies is a violaon of HTTP protocol standards.

Let's say we want to remove the User-Agent HTTP header from both requests and

replies from the subnet

192.0.2.0/24. We would need the following conguraon

for achieving this:

acl special_net src 192.0.2.0/24

request_header_access User-Agent deny special_net

reply_header_access Content-Type deny special_net

request_header_access User-Agent allow all

reply_header_access Content-Type allow all

Custom error pages

Whenever access is denied to a client for a parcular request, Squid sends a standard access

denied page to a client with instrucons on contacng the system administrator. We can

send custom error pages to the clients, redirect them to a dierent URL, or reset the TCP

connecon using the access list rule deny_info. There are three possible ways to do this by

using deny_info; let's have a look at them. In the rst form, we return a custom error page:

deny_info ERR_PAGE ACL_NAME

In this form, we write a HTML page and store it in the errors directory dened by the

error_directory direcve in squid.conf. Let's say we have a custom access denied

error message in the ERR_CUSTOM_ACCESS_DENIED le in our errors directory. We would

need the following conguraon:

acl bad_guys src 192.0.2.0/24

deny_info ERR_CUSTOM_ACCESS_DENIED bad_guys

In the next form, we redirect clients to a custom URL:

acl bad_guys src 192.0.2.0/24

deny_info http://errors.example.com/access_denied.html bad_guys

In the last form, we simply reset the TCP connecon:

acl bad_guys src 192.0.2.0/24

deny_info TCP_RESET bad_guys

Geng Started with Squid’s Powerful ACLs and Access Rules

[ 120 ]

Have a go hero – custom access denied page

Design a custom access denied page for your Squid proxy server, which explains the reason

for revoking access.

Maximum size of the reply body

In some environments, where we don't have enough bandwidth, we may want to restrict

people from downloading large les like movies, music, games, and so on. To achieve this

goal, we can use the access list rule reply_body_max_size to put a limit on the maximum

size of the reply body that a client can access. If the size of the reply body exceeds the

maximum size, the client will be sent a proper denial message.

The syntax for using

reply_body_max_size is as follows:

reply_body_max_size SIZE UNITS ACLNAME

So, let's say we want to limit the maximum reply body size to 10MB and 20MB for dierent

subnets. We can have the following conguraon:

acl max_size_10 src 192.0.2.0/24

acl max_size_20 src 198.51.100.0/24

reply_body_max_size 10 MB max_size_10

reply_body_max_size 20 MB max_size_20

reply_body_max_size none

The reply size is calculated on the basis of the CONTENT-LENGTH HTTP header received

from the remote server. If the value is larger than the maximum allowed size for the current

request, the client will get a 'reply too large' error. If there is no CONTENT-LENGTH HTTP

header in the reply and the reply size is more than the maximum allowed, the connecon

is closed and client receives only a paral reply.

Logging requests selectively

By default, all the client requests are logged to Squid's access log le whose locaon is

determined by the access_log direcve in the conguraon le. However, there may

be requests which we may not want to log to the access log for privacy reasons.

For example, let's say we have a research lab in subnet

192.0.2.0/24 where people

work on a secret project and we don't want their requests to be logged to the access log to

prevent any collecon of browsing paerns. We can use the access list rule log_access to

prevent logging for certain requests as shown in the following example:

acl secret_req src 192.0.2.0/24

log_access deny secret_req

log_access allow all

Chapter 4

[ 121 ]

The previous conguraon will prevent logging of requests from the 192.0.2.0/24 subnet.

We learn about gaining ne control over logging in Chapter 5.

Mixing ACL lists and rules – example scenarios

We have seen various ways in which to construct ACL lists to idenfy dierent requests from

clients, and replies in some cases. We have also learned about the basic usage of access list

rules. In this secon, we'll be dening conguraons for the dierent scenarios that a Squid

administrator may face in day-to-day life.

Handling caching of local content

When we deploy a proxy server, normally all requests to external and internal websites ows

through the proxy server. If we have caching enabled on our proxy server, then it's going

to cache everything that is cacheable, which will result in caching of content from internal

websites also. When we cache content from internal websites, we are unnecessarily wasng

disk space on the proxy server because the advantage of caching the local content is almost

none, as we generally have lots of free bandwidth on LAN.

Time for action – avoiding caching of local content

First of all, we'll need to idenfy the requests in which content on your local area network is

being requested. So, let's say in our network, some clients have hosted FTP and HTTP servers

on their machines to share content on the intranet. The client machines have IP addresses

in the subnets 192.0.2.0/24 and 198.51.100.0/24. So, we need to construct an ACL

list that can idenfy all the requests directed to these machines. The following ACL list does

exactly that:

acl client_servers dst 192.0.2.0/24 198.51.100.0/24

Also, we have mail.internal.example.com and docs.internal.example.com

hosted in the local network. So, let's construct an ACL list to idenfy all the requests

to these websites:

acl internal_websites dstdomain .internal.example.com

So, as we have idened the requests for local content, we just need to instruct Squid

not to cache replies to any of these requests. Therefore, we will use the access list rule

cache to deny caching, as shown in the following example:

cache deny client_servers

cache deny internal_websites

cache allow all

Geng Started with Squid’s Powerful ACLs and Access Rules

[ 122 ]

What just happened?

We just learned about opmizing our Squid proxy server to cache only the content

that actually needs to be cached and will not waste the disk space on the proxy server

unnecessarily. We can keep updang these ACL lists as we encounter the requests that

do not need to be cached.

Denying access from external networks

When we deploy a proxy server, we normally want it to be available to users on our local

area network and no other person should be able to use our proxy server to browse

websites. In this case we will also have to idenfy all our clients on the local area network

by using ACL lists. Generally, we assign IP addresses in the local network from the private

address space. Squid already has ACL lists dened to idenfy the machines in the local

network. If we go ahead with the default Squid conguraon, requests from the local

network will be allowed and all other requests will be denied. Let's have a look at the

default conguraon provided by Squid:

acl localhost src 127.0.0.1/32

acl localnet src 10.0.0.0/8

acl localnet src 172.16.0.0/12

acl localnet src 192.168.0.0/16

acl localnet src fc00::/7

acl localnet src fe80::/10

http_access allow localnet

http_access allow localhost

http_access deny all

If we want to allow any other clients from outside our network, we'll have to construct

addional ACL lists and allow them by using

http_access.

Denying access to selective clients

There may be several reasons for blocking a parcular client but one of the most common

reasons is a huge number of requests being sent from a single client to a parcular website.

This may be due to a virus infected computer or download managers with very low retry

me in case access is denied.

For revoking access from such clients, rst we'll need to construct an ACL list to idenfy such

users and then we'll need to deny access using the

http_access access list rule. We'll have

to take care that the deny rule goes above all the allow rules, in case there are any.

acl bad_clients src 192.0.2.21 198.51.100.25

http_access deny bad_clients

http_access allow localnet

http_access allow localhost

http_access deny all

Chapter 4

[ 123 ]

Blocking the download of video content

Most of the bandwidth is consumed by only a few clients for downloading video content

such as movies, TV shows, and so on. So, we may want to deny access to all the video

content so that we can provide quality bandwidth to the clients trying to browse other

websites. This is generally required only when we have low bandwidth and a lot of clients.

Time for action – blocking video content

So, for blocking the video content, rst we'll need to idenfy all the requests for video

content. For this purpose, we can simply use the ACL type url_regex as follows:

acl video_content urlpath_regex -i \.(mpg|mpeg|avi|mov|flv|wmv|mkv|rm

vb)(\?.*)?$

The previous ACL list will match all the URLs ending with extensions of common video formats.

As a video can be served using dynamic URLs, the URL returning video content may not look

like a URL to a video le at all. For achieving beer control, we also need to use the ACL type

rep_mime_type to detect the content type of the replies returned by webservers. So, we

can construct another ACL list as follows:

acl video_in_reply rep_mime_type -i ^video\/

The previous ACL list will match all the replies with video as a part of their content

type. So, now we need to deny access to these ACL lists, which we can do by using

the following rules:

http_access deny video_content

http_reply_access deny video_in_reply

http_reply_access allow all

What just happened?

We have just seen a real life example of http_reply_access, in which we used it to

control the download of video content. The previous list is not foolproof and it will not be

able to match the replies containing video content if the remote web server doesn't send

the Content-Type HTTP header.

Special access for certain clients

This is a common scenario when clients have restricted access. Generally, we need to provide

special access to administrators. If this is the case, we need to idenfy all the requests by

administrators either by their usernames or by the origin of the requests.

Geng Started with Squid’s Powerful ACLs and Access Rules

[ 124 ]

Time for action – writing rules for special access

Let's say john, michelle, and sarah are the usernames alloed to our administrators and

192.0.2.46, 192.0.2.9, and 192.0.2.182 are their respecve IP addresses alloed to

their laptops. In this case, we are allowing addional access when the requests are originang

from the above IP addresses or if the requests are authencated with the credenals of the

aforemenoned users. The required ACL lists should look similar to the following:

acl admin_laptops src 192.0.2.46 192.0.2.9 192.0.2.182

acl authenticated proxy_auth REQUIRED

acl admin_user proxy_auth john michelle sarah

acl work_related_websites dstdomain "/opt/squid/etc/work_websites"

Now, we need to allow everyone to access only work related websites, except administrators

who should be able to access everything. Therefore, we should build the following

access rules:

http_access allow admin_laptops

http_access allow admin_user

http_access allow localnet work_related_websites authenticated

http_access deny all

So, in accordance to the previous rules, requests idened by admin_laptops and

admin_user are always allowed, but all other requests have to pass through three lters.

First of all, the requests should be authenticated, then it should originate from an IP

address in the local network, and then the requests should be to a website listed in the

/opt/squid/etc/work_websites le. If all these criteria are matched, only then will

a request be allowed; otherwise it's denied.

What just happened?

In the ACL lists we used a mixture of types (such as src, dstdomain, and proxy_auth) to

achieve special access for a set of users. Similarly, we can use various other types of ACL lists

to ne-tune our access control conguraon.

Limited access during working hours

In some organizaons, it's a part of the network usage policy to restrict access to only

work related websites during working hours. This is mostly done either due to a lack of

bandwidth or to enforce people to focus on work. In such cases, we will rst need to

construct an ACL list dening the working hours. This should look similar to the following:

acl working_hours time D 10:00-13:00

acl working_hours time D 14:00-18:00

Chapter 4

[ 125 ]

In the previous code, we have kept 1300HRS - 1400HRS as lunch me, and we don't really

mind what people browse in that period.

Now, we need to construct a list of allowed websites, which are allowed during working

hours. Let's say we are going to load them from the le

work_related.txt. So, we

construct another ACL type as:

acl work_related dstdomain "/opt/squid/etc/work_related.txt"

As we have now idened the working hours and the websites which can be accessed during

working hours, we can proceed with wring the following rules:

http_access allow working_hours localnet work_related

http_access allow !working_hours localnet

http_access deny all

If the previous conguraon is applied, clients on the local network will be able to access

only work related websites during working hours. However, they will be able to access all

websites during non working hours.

Allowing some clients to connect to special ports

From me-to–me, there may be requests from various clients that need to connect to a

website on a non-HTTP port. To handle such requests, we need to use the ACL type port to

construct an ACL list of addional ports which are allowed for only a few clients.

For example, let's say we have requests for opening ports 119 (Usenet News Group), 2082,

3389, and 9418 (Git version control system). If we add these ports to the list of

Safe_ports,

which is a default ACL list provided by Squid, then everyone will be able to connect to these

ports. However, we want only a few clients (who have requested to the special access) to

connect to these ports. So, we'll need to construct another ACL list as follows:

acl special_ports port 119 2082 3389 9418

Aer idenfying the ports, we need to idenfy the requests from the clients requesng

special access. This can be achieved in two ways. The rst, and most simple method is to

idenfy the clients by their IP addresses. The other way is to idenfy the special clients by

their usernames, but this method only works when we have authencaon enforced. So,

to idenfy the clients, we can use the following ACL lists:

acl special_clients src 192.0.2.9 192.0.2.46 192.0.2.182

acl authenticated proxy_auth REQUIRED

acl special_users proxy_auth sarah john michelle

Geng Started with Squid’s Powerful ACLs and Access Rules

[ 126 ]

Now we need to allow special_clients or special_users to connect to

special_ports. However, we should remember that the rules we are going to construct

for this scenario should go before the following line in squid.conf.

http_access deny !Safe_ports

The previous line will deny access to any port that doesn't exist in the Safe_ports ACL list.

So, the rules which we will need to construct will be as follows:

http_access allow special_ports special_clients

http_access allow special_ports special_user

http_access deny !Safe_ports

Testing access control with squidclient

We learned in Chapter 3 that we should always test our conguraon le for errors or

warnings before deploying it on the producon servers. Squid provides the command-line

opon

-k parse using which the conguraon le can be parsed quickly.

However, successful parsing of the conguraon le doesn't guarantee that Squid will

be able to allow or deny the requests or replies in the manner we are expecng. As the

conguraon les grows in size, the number ACL lists and corresponding rules keeps on

increasing, which may somemes lead to confusion. To test the access control in our new

conguraon le, we can use the

squidclient program.

For this purpose, we'll either need a dierent test server or we'll need to compile Squid on

the producon server with a dierent

--prefix opon with the congure program. For

example, we can compile Squid using the following commands:

configure --prefix=/opt/squidtest/

make

make install

The previous commands will install Squid in the /opt/squidtest/ directory. We'll need

to change the

http_port opon and set the port to 8080 or something other than the port

which is used by the original installaon.

Aer this, we need to copy the access control part from our new conguraon le to the

conguraon le of our new test Squid installaon in

/opt/squidtest/. Once we have

nished copying the access control conguraon, we can start our test proxy server.

Chapter 4

[ 127 ]

Options for squidclient

Squidclient executable or binary is generally located at ${prefix}/bin/squidclient.

If we run

squidclient without any arguments, it'll display a list of available opons which

we can specify on the command line. So, let's take a look at the available opons for our

version of the

squidclient.

./squidclient

Version: 3.1.4

Usage: ./squidclient [options] url

The following table shows a brief overview of supported opons:

Opon Usage

-a

Don't include the Accept HTTP header.

-g count

Ping mode. Performs count iteraons (0 to count unl interrupted).

-h host

Retrieve a URL from the proxy server on hostname. The default is

localhost.

-H 'string'

Extra HTTP headers to send. We can use \n for new lines.

-i IMS

Species the If-Modified-Since me (in Epoch seconds).

-I interval

Ping interval in seconds. The default is 1 second.

-j hosthdr

Host HTTP header to send.

-k

Keep the connecon acve. The default is only one request.

-l host

Specify a local IP address to bind to. The default is none.

-m method

HTTP Request method to use. The default is GET.

-p port

Port number of the proxy server. The default is 3128.

-P filename

HTTP PUT request using the le named filename.

-r

Force proxy server to reload the URL.

-s

Operate in silent mode. Do not print data to the standard output

(stdout).

-t count

Trace count proxy server hops.

-T timeout

Timeout value (in seconds) for read/write operaons.

-u username

Provide username for proxy authencaon.

-U username

Provide username for WWW authencaon.

-v

Operate in verbose mode. Print outgoing messages to standard error

(stderr).

-V version

HTTP Version to use. Use hyphen (-) for the HTTP/0.9 omied case.

-w password

Provide password for proxy authencaon.

-W password

Provide password for WWW authencaon.

Geng Started with Squid’s Powerful ACLs and Access Rules

[ 128 ]

As you can see from the previous table, the opons are prey easy to understand. We don't

really need to use all of them. We are most likely to need opons such as -i, -j, -l, -h, -p,

-u, -w, -m, and -H.

Using the squidclient

So, let's get started and begin tesng our Squid server. Let's say we have blocked access

to the website malware.example.com using the following access control in our

conguraon le:

acl malware dstdomain malware.example.com

http_access deny malware

Time for action – testing our access control example

with squidclient

We now need to run the squidclient to fetch http://malware.example.com/ to

check if we get an access denied error or not. If we are running the squidclient on the

producon server, then we don't need to use the -h opon to specify the hostname. In this

scenario, we can run the squidclient with the -p opon to specify the port.

./squidclient -p 8080 http://malware.example.com

However, if we are running the squidclient on a dierent machine, we will have to use

the -h opon to specify the hostname of the proxy server. In this scenario, we can run the

squidclient with the following conguraon:

./squidclient -h proxy.example.com -p 8080 http://malware.example.com

If our access control rules are working and they are rightly placed in the conguraon le,

we should get an output similar to the following:

HTTP/1.0 403 Forbidden

Server: squid/3.1.4

Date: Mon, 06 Sep 2010 09:28:38 GMT

Content-Type: text/html

Content-Length: 2408

Expires: Mon, 06 Sep 2010 09:28:38 GMT

X-Squid-Error: ERR_ACCESS_DENIED 0

X-Cache: MISS from proxy.example.com

X-Cache-Lookup: NONE from proxy.example.com:8080

Via: 1.0 proxy.example.com:8080 (squid/3.1.4)

Proxy-Connection: close

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/

TR/html4/strict.dtd">

Chapter 4

[ 129 ]

1">

<title>ERROR: The requested URL could not be retrieved</title>

...

In the previous output, the rst line contains the HTTP status of the reply which denotes

that the request was denied. So, we can say that the access rule which we have used in the

conguraon le is working ne.

What just happened?

We have seen an example of the basic usage of the squidclient to test the access control

conguraon before deploying a new conguraon on the producon server.

Time for action – testing a complex access control

An access control involving IP addresses from dierent subnets is a bit dicult to test

but can be tested using the

squidclient. This can be done by creang virtual or alias

network interfaces on the machine. For example, the IP address of our proxy server is

192.168.36.204 and we have the following access control conguraon in our

squid.conf, which we want to test:

acl bad_guys src 10.1.33.9 10.1.33.182

http_access deny bad_guys

We can't test these rules directly as our IP address is dierent from the clients we have

blocked and Squid will check for the source IP address in the requests. However, we can

use opon

-l, which is available with the squidclient, which will bind it to a dierent

IP address while sending requests to the Squid proxy server. To achieve this, we need to

create an alias network interface on our server. In most Linux/Unix-based systems, this

can be achieved by using the following command:

ifconfig eth0:0 10.1.33.9 up

Once the alias interface is up, we can use the following command to test our

new conguraon:

./squidclient -l 10.1.33.9 -p 8080 http://www.example.com/

We should get an output similar to the following:

HTTP/1.0 403 Forbidden

Server: squid/3.1.4

Mime-Version: 1.0

Date: Mon, 06 Sep 2010 09:40:22 GMT

Content-Type: text/html

Content-Length: 1361

Geng Started with Squid’s Powerful ACLs and Access Rules

[ 130 ]

Expires: Mon, 06 Sep 2010 09:40:22 GMT

X-Squid-Error: ERR_ACCESS_DENIED 0

X-Cache: MISS from proxy.example.com

X-Cache-Lookup: NONE from proxy.example.com:8080

Via: 1.0 proxy.bordeaux.com:8080 (squid/3.1.4)

Proxy-Connection: close

<!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/

TR/html4/strict.dtd">

1">

<title>ERROR: The requested URL could not be retrieved</title>

...

What just happened?

We created a virtual network interface or alias network interface and asked the

squidclient to project us as a totally dierent client by sending out the IP address of the

alias interface as the source IP address of the request. This helped us in tesng our access

control conguraon in the reference frame of another client.

Similarly, we can use other opons to test dierent access control conguraons before

deploying them on our producon servers.

Pop quiz

1. Consider the following lines in the Squid conguraon le:

acl client1 src 10.1.33.9/255.255.255.255

acl client2 src 10.1.33.9/32

acl client3 src 10.1.33.0/24

acl client4 src 10.1.33.0/30

Which of the following ACL lists will not match a request from a client with the

IP address 10.1.33.9?

a. client1

b. client2

c. client3

d. client4

Chapter 4

[ 131 ]

2. Consider the following line in the Squid conguraon le:

acl domain dstdomain amazon.com

Which requests to of the following domain names will be matched by the ACL

list domain?

a. amazon.com

b. www.amazon.com

c. mail.amazon.com

d. amazon.com.au

3. Consider the following conguraon:

acl manager proto cache_object

acl localhost src 127.0.0.1

acl admin1 src 10.2.44.46

acl admin2 src 10.1.33.182

http_access allow manager localhost

http_access allow manager admin1 admin2

http_access deny manager

Which of the following clients will be able to access Squid's cache manager?

a. localhost

b. 10.1.33.9

c. 10.1.33.182

d. clients b and c

4. Consider the following conguraon:

acl client1 src 10.2.44.46

acl client2 src 10.1.33.9

http_access allow client1

http_access allow client2

http_reply_access deny all

Which of the following clients will be able to view websites using our

proxy server?

a. 10.2.44.46

b. 10.1.33.9

c. a and b both

d. None

Geng Started with Squid’s Powerful ACLs and Access Rules

[ 132 ]

Summary

We have learned a lot about access control lists and access list rules in this chapter. We

had a detailed look at the various types of ACL and how to construct ACL lists for dierent

scenarios. We took examples describing a general situaon in which we needed to use

a mixture of ACL types and rules to achieve the desired access control.

Specically, we covered:

Dierent types of ACL which can idenfy individual requests and replies.

Dierent types of access list rules which can be used to control access to various

components of the Squid proxy server.

Achieving desired access control by mixing various ACL types with access rules.

Tesng our new Squid conguraon with the

squidclient before actually using it

in a producon environment.

We also discussed some example scenarios which can serve as the base conguraon for

various organizaons.

Now that we have learned about compiling, installing, conguring, and running Squid, we

can try to deploy Squid on some test machines and begin tesng them. In the next chapter,

we'll learn about logging in detail.



Understanding Log Files and

Log Formats

Understanding Squid log les and log formats is prey easy. In this chapter,

we'll present a brief explanaon of the log format and how we can customize

it to t our needs. We will cover the related Squid conguraon opons

and look at how a client's privacy can be protected, by ensuring Squid is

properly congured.

In this chapter, we will learn to interpret the dierent log les. We will also

learn about conguring Squid to achieve dierent log messages, depending

on requirements or network policies.

In this chapter, we shall learn about the following:

Cache log

Access log

Customizing the access log

Selecve logging or protecng clients' privacy

Referer log

User agent log

Emulang the HTTP server like logs

Log le rotaon

So let's get on with it.



Understanding Log Files and Log Formats

[ 134 ]

Log messages

Log messages are a nice way for any applicaon to convey messages about its current

acons to human users. A log message is basically a computer-generated message that can

be interpreted by a human being with prior knowledge of the locaon of the dierent elds

in the message. Squid also tries to log every possible acon in dierent log les at dierent

stages. When Squid encounters any errors before starng, it logs them to the output log

which generally goes to a le named cache.log. Similarly, when clients access our proxy

server, a message is logged to the le named access.log whose locaon is determined by

the

access_log direcve in the Squid conguraon le.

Squid uses dierent formats for logging messages to these les. Log les are important and

we can analyze resource consumpon and the performance of our proxy server by reading

through the log les, or by using various log le parsers available. In this chapter, we will

learn to interpret the dierent log les.

Cache log or debug log

Squid logs all the errors and debugging messages to the cache.log le. This log le also

contains messages about the integrity checks such as, availability and validity of cache

directories, which are performed by Squid.

Time for action – understanding the cache log

Let's go through the log messages for a test Squid run and see what each line means:

2010/09/10 23:31:10| Starting Squid Cache version 3.1.10 for i686-pc-

linux-gnu...

2010/09/10 23:31:10| Process ID 14892

Looking at the preceding example, the rst line represents the version of Squid we are

currently running and provides some informaon about the plaorm. The next line contains

the process ID for this instance of Squid.

2010/09/10 23:31:10| With 1024 file descriptors available

This line shows the number of le descriptors available for Squid in this run. We can check

back similar lines in our cache log, if we increase or decrease the available number of le

descriptors and restart the Squid process. Please refer to the secon on Congure or system

check in Chapter 1, Geng Started with Squid.

2010/09/10 23:31:10| Initializing IP Cache...

2010/09/10 23:31:10| DNS Socket created at [::], FD 7

2010/09/10 23:31:10| Adding nameserver 192.0.2.86 from /etc/resolv.

conf

Chapter 5

[ 135 ]

When Squid is started, it'll inialize the DNS systems starng with IP cache, as shown in the

rst line. The second and third lines show informaon about the DNS conguraon. Squid

added 192.0.2.86 as a DNS server from the le /etc/resolv.conf, which is the default

locaon for specifying DNS servers on Linux machines. If we have more than one DNS server

in the /etc/resolv.conf le, there will be more lines similar to the last line.

2010/09/10 23:31:10| User-Agent logging is disabled.

2010/09/10 23:31:10| Referer logging is disabled.

In the aforemenoned lines, Squid is trying to show the status of the oponal modules which

we have enabled while compiling Squid. It is clear to see in these lines, User-Agent and

Referer logging is disabled for this run.

The following are the log messages related to logging:

2010/09/10 23:31:10| Logfile: opening log daemon:/opt/squid/var/logs/

access.log

2010/09/10 23:31:10| Unlinkd pipe opened on FD 13

2010/09/10 23:31:10| Local cache digest enabled; rebuild/rewrite every

3600/3600 sec

2010/09/10 23:31:10| Store logging disabled

In the preceding log message shown, the rst line shows that Squid is going to use the le

/opt/squid/var/logs/access.log as an access log le. It also shows that unlinkd is

being used as the program to purge stale cache objects. Addionally, cache digest is enabled

and will be rebuilt and rewrien every hour. The last line demonstrates that the logging of all

storage-related acvies has been disabled.

2010/09/10 23:31:10| Swap maxSize 1024000 + 262144 KB, estimated 98934

objects

2010/09/10 23:31:10| Target number of buckets: 4946

2010/09/10 23:31:10| Using 8192 Store buckets

2010/09/10 23:31:10| Max Mem size: 262144 KB

2010/09/10 23:31:10| Max Swap size: 1024000 KB

2010/09/10 23:31:10| Version 1 of swap file with LFS support

detected...

2010/09/10 23:31:10| Rebuilding storage in /opt/squid/var/cache

(DIRTY)

2010/09/10 23:31:10| Using Least Load store dir selection

2010/09/10 23:31:10| Set Current Directory to /opt/squid/var/cache

The previous log message is referring to the cache directories and represents informaon

about the various parameters involved in caching web documents onto the hard disks. The

Swap in this log message refers to the Squid disk cache storage and should not be confused

with the system swap memory.

2010/09/10 23:31:10| Loaded Icons.

2010/09/10 23:31:10| Accepting HTTP connections at [::]:3128, FD 16.

2010/09/10 23:31:10| HTCP Disabled.

Understanding Log Files and Log Formats

[ 136 ]

2010/09/10 23:31:10| Squid plugin modules loaded: 0

2010/09/10 23:31:10| Ready to serve requests.

From these lines, we can interpret that Squid has loaded the required modules and is now

ready to accept connecons from clients. We can also see that the HTCP module is disabled.

2010/09/10 23:31:10| Done reading /opt/squid/var/cache swaplog (0

entries)

2010/09/10 23:31:10| Finished rebuilding storage from disk.

2010/09/10 23:31:10| 0 Entries scanned

2010/09/10 23:31:10| 0 Invalid entries.

2010/09/10 23:31:10| 0 With invalid flags.

2010/09/10 23:31:10| 0 Objects loaded.

2010/09/10 23:31:10| 0 Objects expired.

2010/09/10 23:31:10| 0 Objects cancelled.

2010/09/10 23:31:10| 0 Duplicate URLs purged.

2010/09/10 23:31:10| 0 Swapfile clashes avoided.

2010/09/10 23:31:10| Took 0.03 seconds ( 0.00 objects/sec).

2010/09/10 23:31:10| Beginning Validation Procedure

2010/09/10 23:31:10| Completed Validation Procedure

2010/09/10 23:31:10| Validated 25 Entries

2010/09/10 23:31:10| store_swap_size = 0

2010/09/10 23:31:11| storeLateRelease: released 0 objects

This log message contains informaon on the rebuilding of the cache from the hard disks.

The previous examples of log messages which we have looked at are for a successful startup

of Squid. Let's see how the log messages look when Squid encounters some problems. For

example, if Squid doesn't have write permissions on the cache directory, then the following

log message will appear in the cache log:

2010/09/10 01:42:30| Max Mem size: 262144 KB

2010/09/10 01:42:30| Max Swap size: 1024000 KB

2010/09/10 01:42:30| /opt/squid/var/cache/00: (13) Permission denied

FATAL: Failed to verify one of the swap directories, Check cache.log

for details. Run 'squid -z' to create swap directories

if needed, or if running Squid for the first time.

Squid Cache (Version 3.1.10): Terminated abnormally.

So, we can see that Squid is reporng 'Permission denied' on the cache directory. Whenever

there is a problem, Squid will try to describe the possible cause and a resoluon, or the most

appropriate acon that may x the problem.

Chapter 5

[ 137 ]

What just happened?

We learned the meaning of the various messages popping up in a cache log. Generally, if

anything goes wrong with our proxy server, the rst thing we should do is check the cache

log for any error messages or warnings. If Squid is running out of resources such as memory,

le descriptors, or disk space for example, then it will log appropriate messages in the cache

log and will also try to log the possible xes for the problems.

Have a go hero – exploring the cache log

Run the Squid proxy server and try to understand the messages being logged by Squid in the

cache log le.

Access log

The cache.log le is important for debugging if Squid is misbehaving. But the most

important log le is the access.log le, where Squid logs the live informaon about who is

accessing our proxy server, and related informaon about the status of requests and replies.

The locaon of the access.log le is determined by the direcve access_log, in the Squid

conguraon le. By default it is set defaults to ${prefix}/var/logs/access.log.

Understanding the access log

The log messages in the access.log le are not as readable as messages in the cache.log

le, but once we understand what the dierent elds mean, it's very easy to interpret the log

messages. There are mulple formats in which messages are logged in the access.log le.

The messages that we are going to see next, are in the default log format called squid.

Time for action – understanding the access log messages

Let's look at a few lines from the access.log le before we actually explore the dierent

elds in the log message:

1284565351.509 114 127.0.0.1 TCP_MISS/302 781 GET http://www.

google.com/ - FIRST_UP_PARENT/proxy.example.com text/html

1284565351.633 108 127.0.0.1 TCP_MISS/200 6526 GET http://www.

google.co.in/ - FIRST_UP_PARENT/proxy.example.com text/html

1284565352.610 517 127.0.0.1 TCP_MISS/200 29963 GET http://www.

google.co.in/images/srpr/nav_logo14.png - FIRST_UP_PARENT/proxy.

example.com image/png

1284565354.102 147 127.0.0.1 TCP_MISS/200 1786 GET http://www.

google.co.in/favicon.ico - FIRST_UP_PARENT/proxy.example.com image/x-

icon

Understanding Log Files and Log Formats

[ 138 ]

In the previous example of a log message, the rst column represents the seconds

elapsed since a Unix epoch (for more informaon on the Unix epoch, refer to http://

en.wikipedia.org/wiki/Unix_epoch

), which can't really be interpreted by human

users. To quickly convert the mestamps in access log messages, we can use Perl, as shown:

$ perl -p -e 's/^([0-9]*)/"[".localtime($1)."]"/e' < access.log > access.

log.h

Now the access log messages should look similar to the following with mestamps converted

to normal me:

[Wed Sep 15 21:12:31 2010].509 114 127.0.0.1 TCP_MISS/302 781 GET

http://www.google.com/ - FIRST_UP_PARENT/proxy.example.com text/html

[Wed Sep 15 21:12:31 2010].633 108 127.0.0.1 TCP_MISS/200 6526 GET

http://www.google.co.in/ - FIRST_UP_PARENT/proxy.example.com text/html

[Wed Sep 15 21:12:32 2010].610 517 127.0.0.1 TCP_MISS/200 29963 GET

http://www.google.co.in/images/srpr/nav_logo14.png - FIRST_UP_PARENT/

proxy.example.com image/png

[Wed Sep 15 21:12:34 2010].102 147 127.0.0.1 TCP_MISS/200 1786 GET

http://www.google.co.in/favicon.ico - FIRST_UP_PARENT/proxy.example.com

image/x-icon

The second column represents the response me in milliseconds. The third column

represents the client's IP address. The fourth column is a combinaon of Squid's requests

status and the HTTP status code. The h column represents the size of the reply including

HTTP headers. The sixth column in the log message represents the HTTP request method

which will be GET most of the me, but may also have values such as POST, PUT, DELETE,

and so on.

The seventh column represents the request URL. The eighth column is the username, which

is blank in this case because the request was not authencated. The ninth column is a

combinaon of the Squid hierarchy status and IP address or peer name of the cache peer.

The last column represents the content type of the replies.

What just happened?

We had a look at a few log messages generated by Squid in the default log format. We also

learned what the individual columns mean in the messages. We don't need to memorize

this as the meaning of these columns will become obvious once we learn about the various

format codes to construct the log formats.

Chapter 5

[ 139 ]

Access log syntax

We can use dierent places for logging access log messages. We can use a combinaon

of access_log and logformat direcves to specify the locaon and format of the log

messages. Next, we are going to explore them one by one.

Time for action – analyzing a syntax to specify access log

Let's have a look at the syntax of the access_log direcve:

access_log <module>:<place> [<logformat name> [acl acl ...]]

The eld module is one of the none, stdio, daemon, syslog, tcp, and udp methods,

which determine how the messages will be logged to a

place, and is the absolute path to

the le or place where the messages should be logged. Let's take a brief look at the meaning

of dierent modules:

none—The log messages will not be logged at all.

stdio—The log messages will be logged to a le immediately aer the compleon

of each request.

daemon—This module is similar to stdio module, however the log messages are

not wrien to the disk and are passed to a daemon helper for asynchronous

handling instead.

syslog—This module is used to log each message using the syslog facility. The

parameter place is specied in the form of the syslog facility and the priority

level for the log entries. For example, daemon.info will use the daemon syslog

facility and messages will be logged with the info priority.

The valid values of the

syslog facilies are authpriv, daemon, local0, local1,

..., local7, and user. The valid values of priority are err, warning, notice,

info, and debug.

tcp—When the tcp module is used, the log messages are sent to a TCP receiver.

The format for specifying the place parameter is \\host:port.

udp—When the udp module is used, the log messages are sent to a UDP receiver.

The format for specifying a place parameter is \\host:port.

We can specify an oponal

logformat name and can control logging using ACL lists as well.

The following is the default access log conguraon used by Squid:

access_log daemon:/opt/squid/var/logs/access.log squid

In this conguraon, /opt/squid/ is the ${prefix} and squid is the logformat

being used.



Understanding Log Files and Log Formats

[ 140 ]

What just happened?

We learned about specifying opons for the access_log direcve. We also had a brief look

at the various modules available for wring logs. We also learned about oponal controlling

of log messages using ACL lists so that we can log the only requests that we are interested in.

Have a go hero – logging messages to the syslog module

Use the access_log direcve to congure Squid to send log messages to the

syslog module.

Log format

In the previous secon, we had learned about sending log messages to dierent places. Now,

we are going to learn about formang the log messages according to our needs.

Time for action – learning log format and format codes

Log format can be dened using the logformat direcve available in the Squid

conguraon le. The syntax for dening logformat is as follows:

logformat <name> <format specification>

Format specification is a series of format code, as described in the following informaon:

Format code Format descripon

A literal % character.

Unique sequence number per log line entry.

err_code

The ID of an error response served by Squid or a similar internal error idener.

err_detail

Addional err_code dependent error informaon.

Client's source IP address.

Client's FQDN (Fully Qualied Domain Name).

Client's source port.

Server's IP address or peer name.

Local IP address of the Squid proxy server.

Local port number on which Squid is listening.

<lp

Local port number of the last server or peer connecon.

Seconds since Unix epoch.

Sub-second me (in milliseconds).

Local me. Oponal strftime format argument. The default is %d/%b/

%Y:%H:%M:%S %z

Chapter 5

[ 141 ]

Format code Format descripon

GMT me. Oponal strftime format argument. The default is %d/%b/

%Y:%H:%M:%S %z

Response me (milliseconds).

Total me spent making DNS lookups (milliseconds).

[http::]>h

Original request header. Oponal header name argument on the format header

[:[separator]element].

[http::]>ha

The HTTP request headers aer adaptaon and redirecon. Oponal header

name argument as for >h.

[http::]un

User name.

[http::]<h

Reply header. Oponal header name argument as for >h.

[http::]ul

User name from authencaon.

[http::]ui

User name from ident request.

[http::]>Hs

HTTP status code sent to the client.

[http::]<Hs

HTTP status code received from the next hop.

[http::]<bs

Number of HTTP-equivalent message body bytes received from the next hop,

excluding chunked transfer encoding and control messages. Generated FTP/

Gopher lisngs are treated as received bodies.

[http::]Ss

Squid request status (TCP_MISS, and so on).

[http::]Sh

Squid hierarchy status (DEFAULT_PARENT, and so on).

[http::]mt

MIME content type of the reply.

[http::]rm

HTTP request method (GET/POST, and so on).

[http::]ru

Request URL.

[http::]rp

Request URL path, excluding hostname.

[http::]rv

Request protocol version.

[http::]<st

Sent reply size including HTTP headers.

[http::]>st

Received request size including HTTP headers. In the case of chunked requests,

the chunked encoding metadata is not included.

[http::]>sh

Received HTTP request headers' size.

[http::]<sh

Sent HTTP reply headers' size.

[http::]st

Request and reply size including HTTP headers.

[http::]<sH

Reply high oset sent.

[http::]<sS

Upstream object size.

[http::]<pt

Peer response me in milliseconds. The mer starts when the last request byte

is sent to the next hop and stops when the last response byte is received.

[http::]<tt

Total server-side me in milliseconds. The mer starts with the rst connect

request (or write I/O) sent to the rst selected peer. The mer stops with the

last I/O with the last peer.

Understanding Log Files and Log Formats

[ 142 ]

We can use any number of the aforemenoned format codes to construct a log format

according to our choice or requirement. While specifying a format code, we must prex

the format code with a % so that Squid can evaluate it.

What just happened?

We have just learned about the syntax for building new log formats that can be used with

the access_log direcve at a later stage, for custom logging. We also saw a list of format

codes available for logging dierent informaon about a parcular request from a client. Log

formats are always a combinaon of these format codes.

Log formats provided by Squid

By default, Squid provides four log formats that can be used right away. Let's see the default

log formats provided by Squid:

logformat squid %ts.%03tu %6tr %>a %Ss/%03>Hs %<st %rm %ru %un %Sh/%<A

%mt

logformat squidmime %ts.%03tu %6tr %>a %Ss/%03>Hs %<st %rm %ru %un

%Sh/%<A %mt [%>h] [%<h]

logformat common %>a %ui %un [%tl] "%rm %ru HTTP/%rv" %>Hs %<st

%Ss:%Sh

logformat combined %>a %ui %un [%tl] "%rm %ru HTTP/%rv" %>Hs %<st

"%{Referer}>h" "%{User-Agent}>h" %Ss:%Sh

As we saw, the default log format for access log is squid, therefore we can now interpret the

log messages we saw earlier very easily. Let's see one of the lines from the log messages

shown earlier:

1284565354.102 147 127.0.0.1 TCP_MISS/200 1786 GET http://www.

google.co.in/favicon.ico - FIRST_UP_PARENT/proxy.example.com image/x-

icon

If we refer to the table of format codes, we can observe that the seventh column (%ru)

represents the request URL sent by the HTTP client.

Time for action – customizing the access log with

a new log format

Squid has a lot of informaon about every client request and reply, however it writes only the

requested informaon to the log le, which we can customize by dening several log formats.

Chapter 5

[ 143 ]

Now, let's dene a log format in which the me will appear in a human-readable format and

use it with access_log:

logformat minimal %tl %>a %Ss/%03>Hs %rm %ru

access_log daemon:/opt/squid/var/logs/access.log minimal

So, we have constructed a new log format that will log the informaon we are most

interested in. Let's see a few log messages in the preceding format:

11/Sep/2010:23:52:33 +0530 127.0.0.1 TCP_MISS/200 GET http://

en.wikipedia.org/wiki/Main_Page

11/Sep/2010:23:52:34 +0530 127.0.0.1 TCP_MISS/200 GET http://

en.wikipedia.org/images/wikimedia-button.png

Now the me in the log messages is human-readable and we can therefore tell when

a parcular URL was accessed.

We should note that if we are using custom formats for access log, then we may not be able to

use several external programs that can parse and analyze Squid's access log. However, we can

solve this problem by using mulple access log direcves to log the messages in more than one

format so that one log le is for analyzing, and the other le can be used for manual viewing.

What just happened?

We constructed a new log format and used it with the access_log direcve. Now the

me of requests in all the log messages will be in a human-readable format. Next, we can

construct any number of log formats and use them with the access_log direcve to

achieve dierent types of log messages.

Selective logging of requests

Somemes we may not want to log requests from certain clients. This could be because of

several reasons. One reason may be that a team is working on a highly secret project and we

don't want to leave any impressions of their browsing paerns anywhere.

Logging of requests can be controlled using two direcves, namely,

log_access and

access_log. These direcves may look confusing when used in the same sentence but

we can interpret the meaning by the sequence in which the individual words appear in

the direcve name. The direcve access_log is used for controlling the format of the

log messages and the locaon where the messages will be logged. While the direcve

log_access is used to control whether a parcular request should be logged or not.

We have already learned about the

log_access direcve in the Log Access secon in

Chapter 2, Conguring Squid. Now, we will learn about using the access_log direcve

to cache selecve requests.

Understanding Log Files and Log Formats

[ 144 ]

Time for action – using access_log to control logging

of requests

As we have seen in a previous secon of this chapter, the syntax of the access_log direcve

is as follows:

access_log <module>:<path> [<logformat name> [acl acl ...]]

So, here we have an opon to specify ACL lists which we can use to control where the

dierent requests will be logged, if at all. Let's consider a scenario where we don't want

to log requests to Yahoo! servers and we do want to log requests to Google and Facebook

servers to separate les, and all other requests go to the access log. This scenario can be

realized with the following conguraon:

acl yahoo dstdomain .yahoo.com

acl google dstdomain .google.com

acl facebook dstdomain .facebook.com

log_access deny yahoo

log_access allow all

access_log /opt/squid/var/logs/google.log squid google

access_log /opt/squid/var/logs/facebook.log squid facebook

access_log /opt/squid/var/logs/access.log

If we look at the conguraon carefully, we are denying log_access for all the requests

to Yahoo! servers. This means that clients will be able to browse Yahoo! websites, but

the informaon will not be logged to any access log. Also, we are logging requests to

Google websites in a le named google.log and requests to Facebook in a le named

facebook.log. All requests will be logged to the access.log le, which is the default

log le used by Squid.

What just happened?

We just learned about the control provided by Squid, using which we can log various

requests to dierent log les for analysis at a later stage.

Referer log

When a client clicks a link to other.example.com on the website example.com, then

the website example.com is a referrer and the client is referred to the website other.

example.com

. When a client is referred by a website, a HTTP header referer is sent by

the HTTP clients. Squid has the ability to log referer HTTP headers, which can later be used

for analyzing trac paerns.

Chapter 5

[ 145 ]

"Referer" is actually a misspelling of the word "Referrer",

but it has been ocially specied that way in HTTP RFCs.

Time for action – enabling the referer log

By default, there is no referer log. We can enable the referer log using the access_log

direcve in combinaon with a custom log format. To generate the referer log, rst of all,

we need to create a log format as shown:

logformat referer %ts.%03tu %>a %{Referer}>h %ru

This conguraon denes a new log format called referer, which contains a request

mestamp, IP address of the client, the referer URL, and the request URL. Now, we need to

use the

access_log direcve with the aforemenoned constructed log format as shown:

access_log /opt/squid/var/logs/referer.log referer

Now, let's look at a few lines from the referer log le:

1284576601.898 127.0.0.1 http://en.wikipedia.org/wiki/Main_Page

http://en.wikiquote.org/wiki/Main_Page

1284576607.732 127.0.0.1 http://en.wikiquote.org/wiki/Main_Page

http://upload.wikimedia.org/wikiquote/en/b/bc/Wiki.png

The referer log is a bit easier to understand. The rst column is the me elapsed since epoch,

which can't be customized to a human-readable me. The second column is the client's IP

address. The third column is the referer link, and the fourth column is the link to which the

client is referred.

What just happened?

We enabled the logging of referrers, which is not present by default. Now we can observe the

web browsing paerns on our network. Referer logging is done mostly for analysis purposes.

Time for action – translating the referer logs to a

human-readable format

We can translate a referer log to a human-readable format by using the command line ulity

awk. We can convert the enre referer.log le to a human-readable format by using the

following command sequence:

$ cat referer.log | awk '{printf("%s ", strftime("%d/%b/

%Y:%H:%M:%S",$1)); print $2 " " $3 " " $4;}' > referer_human_readable.log

Understanding Log Files and Log Formats

[ 146 ]

The log messages from referer.log, as shown, should look like the following messages

aer conversion:

12/Sep/2010:01:36:06 127.0.0.1 http://en.wikipedia.org/wiki/Main_Page

http://en.wikiquote.org/

12/Sep/2010:01:36:12 127.0.0.1 http://en.wikiquote.org/wiki/Main_Page

http://upload.wikimedia.org/wikiquote/en/b/bc/Wiki.png

The command we saw before works ne for the conversion of the enre log le, but is not

useful if we want to see the live referer log with human-readable mestamps. For achieving

this, we can use the following command:

$ tail -f referer.log | awk '{printf("%s ", strftime("%d/%b/

%Y:%H:%M:%S",$1)); print $2 " " $3 " " $4;}'

This will convert the mestamp to a human-readable me on the y.

If we don't want to use the previous command combinaons, we can modify our referer log

format to log mestamps in a human-readable format, as shown:

logformat referer %tl %>a %{Referer}>h %ru

This log format contains the mestamp in a human-readable local me format.

What just happened?

We learned to use the command line ulies like cat, tail, and awk to print the

mestamps in our proxy server's referer logs in a more user-friendly format.

Have a go hero – referer log

Enable referer logging on your proxy server. Now, using your proxy server, browse to any

website and click a few links on that website. Now check your referer log le and observe

the referer links.

User agent log

All requests from clients generally contain the User-Agent HTTP header, which is basically

a formaed string describing the HTTP client being used for the current request. As Squid

knows everything about the requests, it can log this HTTP header eld to the log le dened

by the useragent_log direcve in the Squid conguraon le.

Chapter 5

[ 147 ]

Time for action – enabling user agent logging

By default, the user agent log is disabled and we can enable it by using the following line in

our conguraon le:

useragent_log /opt/squid/var/logs/useragent.log

Once we have the user agent log enabled, Squid will start logging the User-Agent HTTP

header eld from the requests, depending on the availability of the eld. Let's see a few lines

from an example user agent log:

127.0.0.1 [12/Sep/2010:01:55:33 +0530] "Mozilla/5.0 (X11; U; Linux

i686; en-US; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6 GTB7.1"

127.0.0.1 [12/Sep/2010:01:55:33 +0530] "Mozilla/5.0 (X11; U; Linux

i686; en-US; rv:1.9.2.6) Gecko/20100625 Firefox/3.6.6 GTB7.1

GoogleToolbarFF 7.1.20100830 GTBA"

The format of this le is quite simple and only the last column, represenng the user agent,

is of interest here. The user agent log can be used to analyze the popular web browsers on

a network.

What just happened?

We learned to enable logging of the User-Agent HTTP header eld from all client requests,

which are subject to availability, and can be used for analyzing the popular HTTP clients at

a later stage.

Emulating HTTP server-like logs

Squid has an oponal feature that can help in generang log messages similar to messages

generated for most HTTP servers. We can use the access_log direcve to log messages

with the log format

common.

Time for action – enabling HTTP server log emulation

By default, Squid will generate a nave log, which contains more informaon than the

logs generated with the HTTP log emulaon on. We can use the following line in our

conguraon line:

access_log daemon:/opt/squid/var/logs/access.log common

Understanding Log Files and Log Formats

[ 148 ]

This conguraon will log messages in a web server-like format. Let's have a look at a few log

messages in the HTTP server-like log format:

127.0.0.1 - - [13/Sep/2010:17:38:57 +0530] "GET http://www.google.com/

HTTP/1.1" 200 6637 TCP_MISS:FIRSTUP_PARENT

127.0.0.1 - - [13/Sep/2010:17:40:11 +0530] "GET http://example.com/

HTTP/1.1" 200 1147 TCP_HIT:HIER_NONE

127.0.0.1 - - [13/Sep/2010:17:40:12 +0530] "GET http://example.com/

favicon.ico HTTP/1.1" 404 717 TCP_MISS:FIRSTUP_PARENT

These log messages are similar to log messages generated by the famous open source web

server Apache and many others.

What just happened?

We learned to switch on the HTTP server-like log emulaon of Squid access logs. Squid

access logs can be easy to understand if we are already familiar with web server logs.

Log le rotation

As me passes, the size of the log les increases rapidly and starts occupying more and more

disk space. To overcome this problem of the accumulaon of logs over me, we generally

keep the logs for the previous one or two weeks. To remove old log messages and retain

the recent ones, Squid has a built-in feature of log le rotaon, which can move older log

messages to separate les. Moreover, Squid stores the incremental copy of the storage index

in a le swap.state, which is also pruned down during log rotaon.

To rotate logs, we have to use the

squid command as follows:

$ squid -k rotate

This command will rotate logs depending on the value specied with the direcve

logfile_rotate in the conguraon le. The default value of logfile_rotate is 10.

This means that 10 older versions of all log les will be retained.

Have a go hero – rotate log les

Try to rotate log les on your proxy server and see how the log les are renamed.

Other log related features

We discussed important logging related direcves in the previous secons. Squid has

more direcves related to logging, but they are less important and we should not have any

problems in operang Squid normally, even if we are not aware of these features.

Chapter 5

[ 149 ]

Cache store log

If we have disk caching enabled on our proxy server, Squid can log its enre disk

caching related acvies to a separate log le whose locaon is determined by the direcve

cache_store_log. This log le, contains informaon about the web objects being cached

on the disk, stale objects being removed from the cache, and how long an object was in the

cache. The informaon logged in this le is not parcularly user-friendly. By default, logging

of storage acvity is disabled.

Pop quiz

1. Consider the following conguraon line:

access_log daemon:/opt/squid/var/logs/access.log

Which log format will be used by Squid in accordance with the previous conguraon?

a. common

b. squid

c. combined

d. squidmime

2. Which one of the following is a disadvantage of logging client requests?

a. An administrator can gure out resource usage by several clients.

b. A client's browsing behavior can be predicted by analyzing requests.

c. It can help administrators in debugging anomalies.

d. Logs can ll up hard disks.

3. Which of the following is a not valid reason for log rotaon?

a. Keeping old logs is a violaon of client privacy.

b. Old logs are generally not needed.

c. Log rotaon can help us in recovering disk space periodically.

d. Generally, we analyze logs, store the results, and delete logs to save disk space.

Understanding Log Files and Log Formats

[ 150 ]

Summary

In this chapter, we have learned to interpret several log les generated by Squid. We had

a detailed look at the format codes that Squid uses to construct log messages and how

we can construct custom log formats depending on the requirements.

Specically, we understood cache log, debugged messages generated by Squid, and had a

detailed overview of access log and format codes. We customized log messages using several

log formats and selecvely logged requests to various log les, and enabled the referer and

user agent log messages.

We also discussed about rotang log les to prevent unnecessary wastage of disk space.

Now that we have learned about the various log les and log messages, we will go

on to learn about using these messages to monitor our proxy server and analyze the

performance of our cache, in the next chapter.

Managing Squid and

Monitoring Trafc

In the previous chapter on log les, we learned about the dierent types of log

messages generated by Squid and the various log les containing the dierent

types of log messages. So, in the last few chapters, we have learned about

running a Squid proxy server and interpreng the various log les. As it's not

convenient to manually check log les every me, and as it's almost impossible

to analyze trac by manually going through the log les, it's me to explore

Squid's cache manager which is a web interface which is used to monitor and

manage the proxy server. We'll also look at a few log le analyzers that can

directly parse log les generated by Squid and then present a stascal analysis

of web pages browsed by clients.

In this chapter, we shall learn the following:

Using cache manager (web interface)

Installing the external log le analyzer soware

So, let's get started.

Cache manager

As described briey in the earlier chapters, cache manager (cachemgr) is a web interface

for managing the Squid proxy servers. It is provided by default. This means that we don't

have to install any addional module, or soware, other than a web server to have a web

interface to manage our proxy server. Also, cache manager is not just an interface to manage

our proxy server. It provides various stascs about the usage of dierent resources that can

help us in monitoring the proxy server from a web interface.



Managing Squid and Monitoring Trac

[ 152 ]

But before we can use the cache manager web interface, we need to congure Squid and our

web server to use the

cachmgr.cgi program for providing the web interface.

Installing the Apache Web server

Although this topic is out of the scope of this book, we'll have a quick look at installing

Apache, which is a very popular open source Web server and is available for free from

http://httpd.apache.org/. Apache is available in soware repositories of most

Linux/Unix-based operang systems under dierent names.

Time for action – installing Apache Web server

To install Apache on Red Hat Enterprise Linux, CentOS, or Fedora, we can use yum, the

default package manager for these distribuons, for example:

$ yum install httpd

To install Apache Ubuntu or Debian, we can use the aptitude package manager, as shown

in the following example:

$ aptitude install apache2

For installing Apache on other operang systems, please check the package installaon

manual of the operang system.

What just happened?

We learned to install the very popular open source Web server, Apache, using the package

manager for our operang system. This will help us in geng the web interface for the

cache manager up and running.

Conguring Apache for providing the cache manager

web interface

Aer installing Apache, we need to congure it to use cachemgr.cgi. The le cachemgr.

cgi

is generally located at ${prefix}/libexec/cachemgr.cgi where ${prefix} is the

value specied for the --prefix opon, before running configure.

On some operang systems like OpenBSD, Apache is chrooted by default. Please

visit http://www.openbsd.org/faq/faq10.html#httpdchroot for

more informaon.

Chapter 6

[ 153 ]

Time for action – conguring Apache to use cachemgr.cgi

To complete this task quickly we need to put the following lines in a le named

squid-cachemgr.conf and then move that le to our Apache installaon's conf.d

directory (which is generally

/etc/httpd/conf.d/ or /etc/apache2/conf.d/).

ScriptAlias /Squid/cgi-bin/cachemgr.cgi /opt/squid/libexec/cachemgr.

cgi

# Only allow access from localhost by default

order allow,deny

allow from localhost

# If we want to allow access to cache manager from 192.0.2.25,

# uncomment the following line

# allow from 192.0.2.25

# Add additional allowed hosts as needed

# allow from .example.com

</Location>

Once we have copied these lines in to a le called squid-cachemgr.conf and moved that

le to the appropriate directory, we need to restart or reload the Apache Web server using

the following command:

$ apachectl graceful

To learn more about conguring Apache, please check: http://httpd.apache.org/

docs/current/configuring.html

What just happened?

We congured Apache to use cachemgr.cgi as a cgi script to provide the cache manager

web interface, which is a source of a lot of useful informaon about Squid's runme.

Accessing the cache manager web interface

Before we can use the cache manager, we need to congure Squid to allow us to log in to

the cache manager interface. The cache manager specic direcves are cache_mgr and

cachemgr_passwd. Let's learn how to use these direcves.

Managing Squid and Monitoring Trac

[ 154 ]

Conguring Squid

The direcve cache_mgr is used to specify the e-mail address of a local administrator who

will receive an e-mail if the Squid proxy server stops funconing. The default is webmaster,

however, we can change it to something beer such as admin@example.com. For example:

cache_mgr admin@example.com

This conguraon will set the administrators e-mail address to admin@example.com and

an e-mail alert will be sent to this e-mail address if Squid stops funconing.

The direcve

cachemgr_passwd is used for controlling access to various parts of the cache

manager web interface. The format for using the cachemgr_passwd direcve is as follows:

cachemgr_passwd PASSWORD ACTION ACTION ...

The parameter PASSWORD in the conguraon line is the password for the cache manager

web interface in plain text format. There are two special values to the password named

disable and none. The value disable will disable access to acons specied. The value

none can be used if we want to give password less access to some acons.

The parameter

ACTION can be replaced with the names one or more parts of the cache

manager web interface. This parameter has a special value all, which means all parts

of the cache manager web interface.

To allow access to all parts of the cache manager web interface using a password, we can use

the following conguraon line:

cachemgr_passwd s3cr3tP4sS all

This conguraon will allow this password access to all parts of the cache manager

web interface.

To access the cache manager's web interface, we can launch a web browser and go

to the URL http://localhost/Squid/cgi-bin/cachemgr.cgi. We should replace

localhost with the IP address of the proxy server, if we are accessing the web interface

from a dierent machine.

Chapter 6

[ 155 ]

When we go to the previously menoned URL, Squid will present us with a login screen, as

shown in the following screenshot:

Here we can enter admin@example.com as Manager name and Password (which we

set to s3cr3tP4sS in a previous example) and then click on the Connue buon. Once we

authencate, we'll see a list of links that we can use to nd out about the dierent stascs

of Squid. The following screenshot shows some of the links:

The previous screenshot doesn't display all the links available and the number of links

available in the cache manager menu will depend on the version of Squid installed, and

the features which were enabled before compiling.

Now, let's go through some of the pages in the cache manager and see what they represent.

Managing Squid and Monitoring Trac

[ 156 ]

General Runtime Information

We can learn more about Squid and its resource usage from the General Runme

Informaon link in the Cache Manager menu. This link will take us to a page displaying

informaon about the various components of our proxy server:

The rst table in the previous screenshot displays informaon about the me when Squid

was started and the current me.

Following that, the rst block of details gives out informaon about the client connecons.

So, according to the stascs, we have 1146 clients accessing our proxy server and the proxy

server has received more than 60 million requests since starng. Also, we can see that our

proxy server has been serving more than ve thousand requests per minute, on average

since it started.

The second block of details displays informaon about the performance of disk and memory

caching. Request Hit Raos is the rao of requests served from the cache to the total

number of requests in a parcular interval of me. Byte Hit Raos is the rao of the

bytes served from the cache to the total bytes served by the proxy server.

The previous screenshot is only a subset of the total informaon displayed on the page.

Chapter 6

[ 157 ]

IP Cache Stats and Contents

Find IP Cache Stats and Contents in the Cache Manager menu and click on it. This will take

us to a page containing stascs about the IP address cache which Squid has built over me

(refer to the ipcache_size, ipcache_low, and ipcache_high direcves in the Squid

conguraon le).

The stascs will be displayed on the top and should look similar to the following:

IP Cache Statistics:

IPcache Entries: 14550

IPcache Requests: 139729579

IPcache Hits: 119273350

IPcache Negative Hits: 2619823

IPcache Numeric Hits: 8339299

IPcache Misses: 9496827

IPcache Invalid Requests: 280

Next is an explanaon of the previous stascs:

Entry name Descripon

IPcache Entries

The total number of entries in the IP cache. This can be limited using

the ipcache_size direcve in squid.conf.

IPcache Requests

The total number of requests to resolve domain names that Squid has

received so far.

IPcache Hits

The number of requests which could be sased from the IP cache

itself, saving a DNS query.

IPcache Negative

Hits

The number of hits for failed DNS requests due to various errors such

as temporary roung issues.

IPcache Numeric

Hits

Numeric hits occur when a request is for an IP address instead of a

domain name which results in zero DNS queries.

IPcache Misses

IP cache misses is the number of DNS queries that Squid had to make

because the IP addresses for those domain names were not present in

the cache.

IPcache Invalid

Requests

Invalid requests are caused by badly formaed domain names.

Apart from the aforemenoned stascs, cache manager can also show detailed contents of

IP cache. The following is an example of this:

IP Cache Contents:

Hostname Flg lstref TTL N

chesscube.com 0 12 1(0) 174.129.143.69-OK

Managing Squid and Monitoring Trac

[ 158 ]

policy.chesscube.com 0 42 1(0) 75.101.157.73-OK

www.warez-bb.org 0 8084 1(0) 119.42.146.35-OK

rooms.chesscube.com 0 42 1(0) 174.129.144.56-OK

proxy.example.com H 187749 -1 1(0) 127.0.0.1-OK

The rst column in the contents list is the Hostname or domain name seen in

the request.

The second column is

Flg (ag) if present. Flag is blank most of the me. Other

possible values of ag are N, represenng a negavely cached entry and H,

represenng an entry used from host les generally located at /etc/hosts

(refer to the hosts_file direcve in squid.conf).

The third column represents the number of seconds elapsed since the IP address

for this domain name was last requested.

The fourth column represents the me remaining aer which the cached entry

will expire.

The h column represents the number of IP addresses cached for this domain

name and number of addresses in the parentheses that can't be contacted due

to temporary roung issues.

The last column represents a list of IP addresses with sux

OK for good entries and

BAD for corrupted entries.

FQDN Cache Statistics

FQDN (Fully Qualied Domain Name) is a domain name that species its exact locaon in

the tree hierarchy of the Domain Name System (DNS). We can congure Squid to limit the

FQDN entries in the cache using the fqdncache_size direcve in the Squid configure

le. From the list of links on the Cache Manager home page, go to FQDN Cache Stascs.

On this page, we'll see stascs similar to the IP cache stascs. The stascs should look

like the following:

FQDN Cache Statistics:

FQDNcache Entries: 13499

FQDNcache Requests: 13252038

FQDNcache Hits: 7097795

FQDNcache Negative Hits: 2787241

FQDNcache Misses: 3367002

These stats are self descripve, and in case of any problems, please refer to the IP cache

stascs in the previous secon. Now, let's have a look at a few FQDN cache contents.

Address Flg TTL Cnt Hostnames

79.100.155.138 28678 1 79-100-155-138.btc-net.bg

209.197.11.179 22931 1 cds055.lo1.hwcdn.net



Chapter 6

[ 159 ]

114.178.90.174 9099 1 p13174-ipngn501funabasi.chiba.ocn.ne.jp

190.228.215.10 36224 1 host10.190-228-215.telecom.net.ar

80.221.230.176 6887 1 cable-imt-fee6dd00-176.dhcp.inet.fi

190.178.245.105 6597 1 190-178-245-105.speedy.com.ar

187.36.54.232 -3809 1 bb2436e8.virtua.com.br

88.230.162.6 N -10941 0

The format of the FQDN cache contents is similar to the IP cache contents. The only

dierences are:

Hostnames and IP addresses have swapped columns.

The Count column doesn't have any entries for

BAD or corrupt FQDN entries.

There is no column represenng the me since the entry was last referenced.

HTTP Header Statistics

We have learned about the various HTTP header elds in all requests and replies in the

previous chapters. Squid maintains counters for all the header elds it encounters in the

requests and replies. Click the link to HTTP Header Stascs in the Cache Manager menu.

On this page, we can see stascs about the various header elds in requests and replies,

in a nicely formaed tabular form. Let's have a look at a few entries from one of the tables.

These are a few entries for the counters of the header elds in client requests:

The rst column is

id, which is for Squid's internal use.

The second column represents the name of the HTTP header eld.

The third column represents the number of mes a parcular header eld was

found in the HTTP headers, in all client requests.



Managing Squid and Monitoring Trac

[ 160 ]

The fourth column represents the percentage of cases when a parcular header

eld occurred. For example, in the previous screenshot, the occurrence of the

Accept header eld is 88 percent.

Trafc and Resource Counters

Squid keeps tracks of all the requests and data owing through it. A detailed view of these

counters is available using the link Trac and Resource Counters in the Cache Manager

menu. Although a lot of data is available on this page, it's not nicely formaed and is really

only meant for advanced users. Sll, let's try to understand a few elds in the following

screenshot of the page:

Let's try and understand the meaning of a few of the counters in the previous screenshot:

Field Descripon

client_http.requests

Total number of requests received by the proxy server so far,

which is 61 million in this case.

client_http.hits

Total number of requests that could be served from the cache

itself without making a request to the remote web servers. In

this case, the total hits are 22 million which is quite signicant.

client_http.errors

Total number of requests which resulted in an HTTP error like

404 (Page Not Found), 403 (Access Denied), and so on.

client_http.kbytes_in

Total data uploaded by clients in the form of requests or le

uploads. In this case, 187 GB of data has been uploaded by

clients so far.



Chapter 6

[ 161 ]

Field Descripon

client_http.kbytes_out

Total data downloaded by clients in the form of web pages

or le downloads. In this case, 1.6 TB of data has been

downloaded so far since Squid was started.

client_http.hit_

kbytes_out

Total data sent to clients as a result of cache hits. In this case, a

total 279 GB of data has been served as a result of cache hits.

All other elds are similar and can be interpreted easily.

Request Forwarding Statistics

When a request from a client is received by Squid, it idenes a set of possible servers

and tries to forward the request to remote servers. If request forwarding fails, Squid will

try again. A table containing complete stascs about the number of tries versus the HTTP

status code received from the remote server can be accessed using the Request Forwarding

Stascs link in the Cache Manager menu.

The rst column represents HTTP status code (for a list of HTTP status codes and their

meanings, check http://en.wikipedia.org/wiki/List_of_HTTP_status_codes).

The numbers in the cells represent the number of requests. For example, 26.8 million

request forwards resulted in HTTP status code 200 in the rst aempt.

It's worth nong that small numbers of second or third tries are normal, but

if these numbers get large in proporon, it's a sign of network trouble.

Managing Squid and Monitoring Trac

[ 162 ]

Cache Client List

Squid maintains a list of clients which have been served in the past 24 hours. The entries may

fade out depending on the frequency of requests. It also maintains a few stascs related to

each client, which may be of interest when we want to check what a parcular client is up to.

Find the Cache Client List link in the Cache Manager menu and browse to it. The page will

contain a complete list of all the clients.

Let's have a look at the details for the rst client. The rst line represents the IP address

of the client. The second line represents the domain name corresponding to the client's IP

address (will be omied if domain name is not available or if reverse lookups are disabled).

The third line shows the Currently established connecons to this client which is currently

zero. The next line shows the total number of ICP Requests made by this client which is

also zero.

Chapter 6

[ 163 ]

The following line represents the number of HTTP requests made by that parcular client.

The list may also contain a line which will show the clients login username, if it's known. The

next few lines in the HTTP requests block show the counts and percentages of various Squid

statuses for those requests. For the latest list of Squid status codes, check http://wiki.

squid-cache.org/SquidFaq/SquidLogs#Squid_result_codes

Memory Utilization

Squid provides detailed stascs about its memory ulizaon. You will nd a link to

Memory Ulizaon in the Cache Manager menu, click and browse to it. The following

table of informaon is a small secon of the memory ulizaon page:

These stascs are mainly targeted at developers trying to analyze the memory ulized by

various components. The rst column represents the component occupying the memory.

As we can see, the components are acl, acl_deny_info_list, acl_ip_data, and so

on. Therefore, according to the previous table of informaon, a total of 3 KB of memory has

been allocated to the acl component.

This table doesn't represent the total memory occupied by all of Squid's

components. The actual memory ulizaon will be higher than shown in

this table because this table doesn't contain the memory consumpon

by all components.

Managing Squid and Monitoring Trac

[ 164 ]

Internal DNS Statistics

As we learned in the previous chapters, Squid has its own built-in implementaon of a

DNS client, which helps it in resolving domain names to IP addresses. If we click on the

Internal DNS Stascs link in the Cache Manager menu, we'll be presented with various

stascs about the requests performed by the internal DNS client. See the following

screenshot for an example of these stascs:

The rst table represents any DNS queries in the queue for which Squid has not received any

response yet. This table is generally empty or has only a few entries. If this table has a lot

of entries, then that may be an indicaon of a problem with our DNS servers.

The second table shows the number of queries and replies for each DNS server we have

specied, which is prey simple to understand.

The last table is a table represenng the response code of a DNS query against the

number of aempts to resolve a domain name. The count in the cell represents the

number of DNS queries. RCODE value zero (0) means a successful compleon of a DNS

query. For more details on the various values of RCODE, check page 27 of RFC 1035 at

http://tools.ietf.org/html/rfc1035.

Chapter 6

[ 165 ]

Have a go hero – exploring cache manager

There are a lot of other pages available through the cache manager web interface. Explore

them and check what stascs they provide about your proxy server. Its also worth nong

that the Squid components which we have disabled, are missing from the cache manager

menu or have empty stascs pages.

So, we have learned about using cache manager to obtain informaon about resource

ulizaon and general performance stascs over a period of me. Now it's me to install

a Squid log le analyzer which can read and analyze Squid's access log le to generate

interesng stascs.

Log le analyzers

In the previous chapter, we learned about Squid's access log le where every client request

is logged unless congured otherwise. Over a period of me, it's not possible to evaluate this

le manually as it may contain tens of thousands or even millions of entries. To parse and

analyze this le, there are a lot of open source and free third party programs available. A list

of these programs can be accessed at http://www.squid-cache.org/Scripts/.

In this book, we'll have a look at Calamaris, which is a Perl (

http://www.perl.org/) based

log ne analyzer and stascs generator. So, let's have a look at Calamaris.

Calamaris

Calamaris is a Perl-based script that can analyze Squid's access log les and generate

interesng stascs about the usage and performance of the proxy server. The following

are a few of the types reports that Calamaris can produce:

A brief summary of the requests, clients served, bandwidth used, plus stascs

about cache hits and hit rate

Incoming requests by HTTP methods

Incoming TCP/UDP requests by status

Outgoing requests by status, desnaon

Domain-level data ow

Request analysis based on content type (audio, video, images, HTML, and so on) of

the requests.



Managing Squid and Monitoring Trac

[ 166 ]

Calamaris proves to be a good choice because of the following features:

It can cache the parsed data for the le which has already been parsed so we don't

need to parse the same le again to generate reports

It can produce nicely formaed printable plain text reports that look good

It can also generate graphical reports which are a good way to analyze usage

and performance

It can be run using cron to periodically update the document root of a website,

congured in a web server, to view the stascs in a web browser

For the most recent informaon on Calamaris, check the Calamaris ocial website at

http://cord.de/tools/squid/calamaris/.

Installing Calamaris

We must have Perl installed on our server before we can install Calamaris. Perl is available

in soware repositories of almost all Linux/Unix operang systems. Check the installaon

manual for your operang system to install Perl.

Time for action – installing Calamaris

Calamaris can be installed using a package manager for our operang system. For installing

Calamaris on Red Hat Enterprise Linux, CentOS, or Fedora, we can use yum as follows:

$ yum install calamaris

To install Calamaris on Debian, Ubuntu, or other Debian-like operang systems, we can use

aptitude as follows:

$ aptitude install calamaris

If Calamaris is not available in our operang system's soware repository, we can visit the

Calamaris ocial website and download the latest version. Please follow the installaon

instrucons in the soware bundle. We'll be using version 2.99.4.0 in this book.

What just happened?

We learned how to install Calamaris using the package managers for several

operang systems.



Chapter 6

[ 167 ]

Using Calamaris to generate statistics

Once we have nished installing Calamaris, we can use it on the command line to parse our

log les.

Time for action – generating stats in plain text format

Let's say we want to parse our current log le; we can use Calamaris as follows:

$ cd /opt/squid/var/logs/

$ cat access.log | calamaris -a

By default, Calamaris generates stats in plain text format and prints them on a standard

output. To output the stats to a text le, we can use Calamaris as follows:

$ cat access.log | calamaris -a --output-file access_stats.txt

The content of access_stats.txt should look similar to the following:

# Summary

Calamaris statistics

------------------------------------------ ----------- --------

lines parsed: lines 50405872

invalid lines: lines 1

parse time: sec 2456

parse speed: lines/sec 20524

------------------------------------------ ----------- --------

Proxy statistics

------------------------------------------ ----------- --------

Total amount: requests 50405872

unique hosts/users: hosts 1606

Total Bandwidth: Byte 1582G

Proxy efficiency: factor 54.85

(HIT [kB/sec] / DIRECT [kB/sec])

Average speed increase: % 15.81

TCP response time of 86.96% requests: msec 241

(requests > 2000 msec skipped)

------------------------------------------ ----------- --------

Cache statistics

------------------------------------------ ----------- --------

Total amount cached: requests 17955880

Managing Squid and Monitoring Trac

[ 168 ]

Request hit rate: % 35.62

Bandwidth savings: Byte 220G

Bandwidth savings in Percent % 13.90

(Byte hit rate):

Average cached object size: Byte 13149

Average direct object size: Byte 45056

Average object size: Byte 33690

------------------------------------------ ----------- --------

The previous output is self descripve, we can analyze the bandwidth we have been saving

by enabling disk caching and the requests we have served so far. We can also see the stats

about object sizes we have in our cache.

What just happened?

We used Calamaris on the command line to generate plain text reports for our current

access log le. We also learned that Calamaris will print the reports to standard output or

the terminal by default, and we can use the --output-file opon to output the reports

to a le.

Have a go hero – exploring the reports

There will be stats using several other criteria in the stats le generated by Calamaris. Study

them to see what the most popular websites are among your clients.

Time for action – generating graphical reports with Calamaris

Now, let's learn to generate HTML and graphical stascs using Calamaris. To generate

graphical stats, we need to create a directory where Calamaris can dump image les.

So, let's see how it works:

$ mkdir stats

$ cat access.log | calamaris -a --output-file access_stats.html -F

html,graph --output-path ./stats/

Chapter 6

[ 169 ]

The previous command will generate an access_stats.html le along with a few image

les in the stats directory. Let's have a look at a few images from the stats directory:

This image is a graph of TCP requests by the Squid status. On the le-hand side is a scale

represenng the number of requests, and on the right-hand side is a scale represenng the

data transferred in Gigabytes. As we can see from the previous graph, around 17 million

requests resulted in a hit. This means that they could be served from the cache without

fetching data from remote servers.

Managing Squid and Monitoring Trac

[ 170 ]

Let's have a look at another graph:

The previous screenshot shows is a graph of request desnaons by the second level

domain. As we can observe from the graph, a total of 9 million requests were sent to

Facebook servers (*.cdn.net and *.facebook.com).

Calamaris generates a lot of interesng graphs like the ones shown, which can be helpful in

analyzing and opmizing our proxy server to enable it to perform beer.

What just happened?

We learned how to use the various opons with Calamaris to generate HTML and graphical

reports for beer analysis.

Have a go hero – exploring Calamaris

Have a look at the Calamaris man page for more details about the dierent opons which

can be used on the command line.

Chapter 6

[ 171 ]

Pop quiz

1. Which of the following is a correct choice for running the cache manager

CGI program?

a. Apache Web Server

b. Lighpd

c. Roxen Web Server

d. All of the above

2. Which of the following is the correct formula to calculate cache hit rao?

a. Number of cache hits * 100 / Number of requests

b. Number of cache hits * 100 / Number of cache misses

c. Number of bytes served as hits * 100 / Total Number of bytes served

d. Number of cache misses * 100 / Number of requests.

3. Which of the following is the correct formula for calculang byte hit rate?

a. Number of bytes served as cache hits * 100 / Total Number of requests served

b. Number of bytes served in the past 24 hours * 100 / Total Number of

bytes served

c. Number of bytes served as cache hits * 100 / Total Number of requests served

d. Number of bytes served as cache hits / Number of bytes served as cache misses

Summary

We have learned about using the cache manager to monitor our Squid proxy server for

various stascs.

Specically, we have covered the following:

Installing and conguring Apache to use the

cachemgr.cgi program to provide a

web interface for the cache manager.

Various types of informaon and stascs about our running proxy server

An overview of log le analyzers

Installing and using Calamaris to generate interesng stascs about the usage and

performance of our proxy server.

Now that we have learned about monitoring the performance of our proxy server, we'll learn

about protecng our proxy server with authencaon in the next chapter.



Protecting your Squid Proxy Server

with Authentication

In the previous chapters, we have learned about installing, conguring, running,

and monitoring our Squid proxy server. In the last chapter, we also learned about

analyzing the performance of our proxy server along with the usage stascs for

dierent resources. In this chapter, we'll learn about protecng our Squid proxy

server from unauthorized access, using the various authencaon systems which

are available. We'll also learn to develop a custom authencaon helper, using

which, we can design our own authencaon system for our proxy server.

In this chapter, we will learn about:

Squid authencaon

HTTP basic authencaon

HTTP digest authencaon

Microso NTLM authencaon

Negoate authencaon

Using mulple authencaon schemes

Wring a custom authencaon helper

Making non-concurrent helpers concurrent

Common issues with authencaon

So let's get on with it.



Protecng your Squid Proxy Server with Authencaon

[ 174 ]

HTTP authentication

So far we have learned about various ways of controlling access to our Squid proxy server.

Using IP addresses and MAC addresses to idenfy clients provides signicant access control,

but these properes can be spoofed our proxy server can sll be accessed by unauthorized

people. Using Squid authencaon helpers, we can enforce username/password/based

authencaon which can guarantee a higher level of access control.

Squid authencaon helpers work in a simple way by which the user agent or browser sends

out an

Authentication HTTP header eld, containing encoded credenals lled in by

the user. Squid tries to decode the Authentication header eld and passes the decoded

elds to the helper, which then checks the credenals against a precongured service. If the

credenals provided were valid, the client is allowed to access our proxy server; otherwise a

HTTP status 407 (Proxy Authencaon Required) is sent back. This is the complete process of

authencang a client using the Squid authencaon helper against a precongured service.

Squid currently supports four types of authencaon schemes named Basic, Digest, NTML,

and Negoate, which have their own advantages and disadvantages. Authencaon schemes

are congured using the

auth_param direcve in the Squid conguraon le that supports

various opons for dierent authencaon schemes. So let's move on and discuss the various

authencaon schemes and some of the corresponding helpers provided by Squid.

Basic authentication

Basic authencaon is the simplest scheme to congure so that our proxy server enforces

authencaon, but it's the most insecure scheme. This is due to the fact that credenals

are transmied in a Base64-encoded string format, which can be decoded very easily to

get the original credenals, such as, the username and password supplied by the client to

authencate with Squid.

This authencaon scheme is generally discouraged because anyone who is able to sni

your user's network packets will be able to see that person's username and password and

will be able to exploit it very easily. The authencaon schemes Digest or Negoate are

recommended over the Basic authencaon scheme.

This scheme can be used in small, isolated networks where the chances of packet sning

are low and because of the simplicity of conguring Squid to use this scheme.

Time for action – exploring Basic authentication

HTTP Basic authencaon supports the following auth_param opons:

auth_param basic program COMMAND

auth_param basic utf8 on|off

auth_param basic children NUMBER [startup=N] [idle=N] [concurrency=N]

Chapter 7

[ 175 ]

auth_param basic realm STRING

auth_param basic credentialsttl TIME_TO_LIVE

auth_param basic casesensitive on|off

Now let's discuss what each parameter species and what possible values can be passed

with it.

Please note that the opons startup, idle, and concurrency are

available only in Squid version 3.2 or later.

The program parameter species the absolute path to the authencaon helper we are

trying to congure. We should note that, we can also specify addional arguments to the

program on the same line. By default, all the authencaon helpers reside in ${prefix}/

libexec/

where ${prefix} is the value supplied to the --prefix opon while running

the configure program.

The aforemenoned code is given the

username password string aer decoding the

Authentication HTTP header received from the client and the program should output

either OK or ERR, depending on the validity of the credenals. The program should work in

an endless loop following the logic just described.

The

utf8 parameter species whether the credenals 'with' will be translated to UTF-8

encoding before they are passed to the authencaon helper. This is because the HTTP

uses ISO Latin-1 encoding and some authencaon helpers may expect UTF-8.

The

children parameter sets various opons for the authencaon helper. Normally, Squid

will run more than one instance of the authencaon helper, depending on the number

of requests being received from clients. This ensures that the delay caused due to the

authencaon helper, while processing, can be minimized. NUMBER species the number of

child helpers Squid is allowed to spawn. This number should be kept high enough, so that

Squid will not be choked because of a high waing me introduced by authencaon helpers,

and low enough so that the authencaon helpers don't take all the system resources.

The

startup and idle opons with the children parameter species the number of the

processes that should be started when Squid is started or recongured, and the maximum

number of idle helpers present at any me. These numbers help Squid in spawning the

appropriate number of authencaon helpers depending on the current trac.

The

concurrency opon species the number of concurrent credenal validaon requests

one instance of an authencaon helper can process at a me. Most authencaon helpers

will process only one request at a me, per instance, so the default value of concurrency

is 0 (zero) to turn it o. If we are using an authencaon helper that can process mulple

requests concurrently, we can set this value accordingly. Please note that this feature is

available only with Squid version 3.2 or later, we can however, make our exisng helpers

concurrent using helper -mux, which we'll discuss at the end of this chapter.

Protecng your Squid Proxy Server with Authencaon

[ 176 ]

The realm parameter species the message presented to the user by the HTTP client.

The

credentialsttl parameter sets the me aer which Squid will ask the authencaon

helper whether the credenals provided by the client are sll valid or the me for which

they will remain valid. This value should be set high enough to ensure that the user is not

prompted to enter their credenals me and me again. This should be set to a lower value

if there is a short-lived password system in place.

The

casesensitive parameter sets whether the usernames will be case sensive or not.

Most databases for storing user informaon are case insensive and allow usernames in any

case. Seng this parameter to on or o will aect the

max_user_ip ACL type, so we should

use it carefully.

Let's see a conguraon example of an authencaon helper using the Basic

authencaon scheme:

auth_param basic program /opt/squid/libexec/basic_pam_auth

auth_param basic utf8 on

auth_param basic children 15 start=1 idle=1

auth_param basic realm Squid proxy Server at proxy.example.com

auth_param basic credentialsttl 4 hours

auth_param basic casesensitive off

acl authenticated proxy_auth REQUIRED

http_access allow authenticated

http_access deny all

Conguring authencaon helpers is of no use unless

we use the proxy_auth ACL type to control access.

What just happened?

We learned about the various opons available for conguring HTTP Basic authencaon.

We also learned that we must construct ACL lists of the ACL type proxy_auth in order

to enforce proxy authencaon.

Now, we'll have a look at the various authencaon helpers which implement the Basic

authencaon scheme.

Database authentication

The authencaon helper basic_db_auth can validate credenals provided by a client

against a database containing usernames and passwords. For every set of usernames and

passwords supplied, basic_db_auth will try to match it against an exisng database table

containing the username and password columns.

Chapter 7

[ 177 ]

Conguring database authentication

We need to pass addional opons to this authencaon helper to tell it about the database

table which should be used for authencaon. Let's have a quick look at the opons that can

be passed.

Opon Descripon

--dsn

The --dsn opon is used to specify the Database Source Name (DSN) that will

be used by the authencaon helper to connect to a parcular database. The

default value is DBI:mysql:database=squid (replacing the word 'squid'

with the name of the relevant database). So, if we set our database name as

'clients', the corresponding DSN will be DBI:mysql:database=clients.

This helper uses Perl's database library, so any SQL style database can be used.

For a database on a dierent server, we can set the DSN to DBI:mysql:

database=clients:example.com:3306.

--user

The --user opon species the username that will be used while connecng

to the database.

--password

The --password opon species the password that will be used while

connecng to the database.

--table

The database table where Squid will look for usernames and passwords is

specied using the --table options. The default table name used is

passwd.

--usercol

The column name for the usernames is specied using the --usercol opon.

The default value is user.

--passwdcol

The --passwdcol opon can be used to specify the password column name

in the database table. The default value is password.

--plaintext

The --plaintext opon determines whether the passwords stored in the

database are plain text or encrypted. Then authencaon helper assumes

that they are encrypted by default. We can set this opon's value to 1 if the

passwords are stored in plaintext format.

--cond

The --cond opon is quite handy when we want to temporarily deny access

to certain clients using a ag or several condions set using a database table.

The default value of --cond is enabled=1, which means the authencaon

helper will add another condion, AND enabled = 1, in the SQL query

before querying the database. We must set this opon to " " (blank string) if we

don't want any addional condions to be used.

--md5

We can use the --md5 opon if the database contains unsalted passwords.

--salt

Using the --salt opon, we can specify the salt to hash passwords.

--persist

The connecons to the database will be persistent and will remain open in

between queries if the --persist opon is used.

--joomla

We can set the --joomla opon to tell the helper that the database we are

using is a Joomla database, so that it can use appropriate salt hashing. For more

informaon on Joomla, please visit http://www.joomla.org/.

Protecng your Squid Proxy Server with Authencaon

[ 178 ]

So, an example conguraon for basic_db_auth will look like the following:

auth_param basic program /opt/squid/libexec/basic_db_auth --dsn "DBI:

mysql:database=squid_auth" --user 'db_squid' --password 'sQu1Dp4sS' --

table 'clients' --cond ''

This conguraon line will congure basic_db_auth as a basic authencaon helper and

will also supply various opons to the authencaon helper basic_db_auth.

NCSA authentication

NCSA authencaon is an authencaon against a NCSA HTTPd style password le. To know

more about NCSA HTTPd, refer to http://en.wikipedia.org/wiki/NCSA_HTTPd.

Basic NCSA authencaon is easy to set up and manage. All we need to do is, create a le

containing usernames and passwords in a special format and use that password le as an

opon with the authencaon helper program.

Time for action – conguring NCSA authentication

To create and manage users, we can use the htpasswd program, which is a part of httpd

(Apache Web Server).

Let's say we are going to keep the passwords in the

/opt/squid/etc/passwd le, then we

can add some users as follows:

htpasswd /opt/squid/etc/passwd saini

New password:

Re-type new password:

We should enter the password when asked and a combinaon of a username and

encrypted password will be wrien to the password le. To add more users, we can

use the same command.

Now we need to congure the NCSA authencaon helper to use this password le. We can

do so using the following command:

auth_param basic program /opt/squid/libexec/basic_ncsa_auth /opt/

squid/etc/passwd

What just happened?

We learned to add new users to the password le, which is then used by the NCSA

authencaon helper to validate the credenals provided by the user.

Chapter 7

[ 179 ]

NIS authentication

The network Informaon Service or NIS (previously Yellow Pages or YP) is a client-server

directory protocol developed by Sun Microsystems. To be able to use NIS authencaon with

Squid, we need to provide the NIS domain name and the password database, as shown:

auth_param basic program /opt/squid/libexec/basic_nis_auth example.com

passwd.byname

LDAP authentication

The basic LDAP (Lightweight Directory Access Protocol) authencaon helper (basic_

ldap_auth

) provides authencaon using an LDAP server. For this to work, we should have

the OpenLDAP development libraries installed. Refer to http://www.openldap.org/ for

details on installing and conguring an LDAP server.

The

basic_ldap_auth helper has a large number of opons available to congure dierent

sengs for authencang against the LDAP server. However, in this book we will cover only

the necessary opons to get LDAP authencaon working. For the details of all the available

opons, we can always refer to the basic_ldap_auth man page provided by Squid.

Therefore, an example conguraon for proxy authencaon against the LDAP server

ldap.example.com will be as follows:

auth_param basic program /opt/squid/libexec/basic_ldap_auth -b

"dc=example,dc=com" ldap.example.com

In the conguraon shown, ldap.example.com is our LDAP server. The domain

example.com is the base disnguished name (DN).

SMB authentication

SMB authencaon or basic_smb_auth is a very simple way to authencate against SMB

servers like Microso Windows NT or Samba. To be able to use basic_smb_auth, we

should have Samba (http://www.samba.org/) installed on our machine or on another

machine accessible to Squid. Samba is available in soware repositories of most Linux/Unix

distribuons.

Once everything is installed and congured properly, we can add the following conguraon

line to use SMB authencaon:

auth_param basic program /opt/squid/libexec/basic_smb_auth -W

WORKGROUP

The opon -W species the Windows domain name.

Protecng your Squid Proxy Server with Authencaon

[ 180 ]

PAM authentication

Pluggable Authencaon Modules (PAM, http://www.sun.com/software/solaris/

pam/

) is a mechanism to integrate several low-level authencaon schemes such as,

ngerprint, smart cards, one me keys, and so on, into a high-level API. We should note that

PAM is not available on systems such as BSD. Squid provides the basic_pam_auth helper,

which provides authencaon against the PAM database. However, to be able to use PAM

authencaon, we need to congure the Squid (or any other name) service in the

/etc/pam.d/ directory and congure the PAM modules we plan to use.

Time for action – conguring PAM service

An example /etc/pam.d/ Squid le may look similar to the following:

#%PAM-1.0

auth include password-auth

account include password-auth

Once the Squid service is congured in /etc/pam.d/, we need to congure Squid to use the

PAM authencaon. The following conguraon example will allow Squid to authencate

using the PAM database:

auth_param basic program /opt/squid/libexec/basic_pam_auth

For more informaon on conguring basic_pam_auth, refer to the basic_pam_auth

man page.

What just happened?

We learned to congure PAM and to use the basic_pam_auth Squid helper for authencaon.

MSNT authentication

The MSNT Basic authencaon helper provides a way to authencate against NT domain

controllers on Windows or Samba.

Time for action – conguring MSNT authentication

Conguring the MSNT authencaon helper is quite easy and is done by modifying the

/opt/squid/etc/msntauth.conf le. The default conguraon le looks as follows:

# NT domain hosts. Best to put the hostnames in /etc/hosts.

server myPDC myBDC myNTdomain

# Denied and allowed users. Comment these if not needed.

denyusers /opt/squid/etc/msntauth.denyusers

allowusers /opt/squid/etc/msntauth.allowusers

Chapter 7

[ 181 ]

We should replace myPDC (Primary Domain Controller), myBDC (Backup Domain Controller),

and myNTdomain (Windows NT Domain) with values for our environment. We can add as

many as ve dierent domains in this conguraon le.

Also noce the

denyusers and allowusers direcves. The denyusers direcve species

a le that contains all the usernames that must not be allowed to access our proxy server.

The helper will not even bother to check the credenals of the usernames in this le.

The direcve

allowusers species a le which contains a list of usernames that

should always be allowed to access the proxy server, even if the credenals result

in failed validaon.

Once we have nished conguring the MSNT authencaon helper, we can add the following

line in our Squid conguraon le:

auth_param basic program /opt/squid/libexec/msnt_auth

What just happened?

We just learned to create the conguraon le for MSNT authencaon. We have also

learned to create excepons (allow or deny) for certain users without validang the

credenals provided by them.

MSNT multi domain authentication

The MSNT mul domain authencaon works similar to the MSNT authencator, except that

with MSNT mul domain, the client has to enter the Windows NT domain name before the

username, as follows:

workgroup\sarah

The conguraon line for the MSNT mul domain authencaon helper will look similar

to the following:

auth_param basic program /opt/squid/libexec/basic_msnt_multi_domain_

auth

This authencaon helper is a Perl script.

This authencaon helper needs the Authen::SMB Perl package.

Moreover, Samba should be installed on the same system or any

other system accessible to Squid. On the same system, we need the

nmblookup and smbclient binaries.

Protecng your Squid Proxy Server with Authencaon

[ 182 ]

SASL authentication

Simple Authencaon and Secure Layer (SASL) is a framework for authencaon that

decouples the authencaon mechanism from applicaon protocols. The SASL authencaon

helper (

basic_sasl_auth) for Squid is similar to the PAM authencaon helper.

Time for action – conguring Squid to use SASL authentication

To congure the SASL authencator, we need to create a le named basic_sasl_auth.

conf

with the following content:

pwcheck_method:sasldb

1. Move this le to the /usr/lib/sasl2/ directory.

2. Once we have placed the conguraon le in the appropriate directory, we can add

the following line in our conguraon le to ensure the use of SASL authencaon:

auth_param basic program /opt/squid/libexec/basic_sasl_auth

This command will congure Squid to use the basic_sasl_auth program as an SASL

authencaon helper.

The basic_sasl_auth requires the Cyrus SASL

library (http://asg.web.cmu.edu/sasl/).

What just happened?

We learned to congure the SASL authencator and then congure Squid to use SASL

authencaon.

getpwnam authentication

The getpwnam authencaon helper can allow Squid to authencate local users. This

authencaon helper uses the getpwnam() Unix ulity to locate users who have login

accounts on the Squid server and authencate them. Addionally, it can authencate

users against NIS, LDAP, and PAM databases.

To use the

getpwnam authencaon helper, we need to add the following line in our

conguraon le:

auth_param basic program /opt/squid/libexec/basic_getpwnam_auth

Chapter 7

[ 183 ]

POP3 authentication

Squid can authencate clients against an exisng POP3 (Post Oce Protocol Version 3)

user database using the authencaon helper basic_pop3_auth. To use POP3, we need

to specify the IP address or domain name of the server running the POP3 service. We can

congure Squid to use POP3 authencaon, as shown:

auth_param basic program /opt/squid/libexec/basic_pop3_auth pop3.

example.com

The basic_pop3_auth helper uses the Net::POP3 Perl package. So,

we should make sure that we have this package installed before using the

authencaon helper.

RADIUS authentication

The basic_radius_auth authencaon helper allows Squid to connect to a RADIUS server

(for more informaon on RADIUS servers, refer to http://en.wikipedia.org/wiki/

RADIUS

) and then validate the credenals provided by the HTTP client.

Time for action – conguring RADIUS authentication

We can add the following line to our Squid conguraon le to use the RADIUS server for

authencaon:

auth_param basic program /opt/squid/libexec/basic_radius_auth -h

radius.example.com -p 1645 -i squid_proxy -w s3cR37 -t 15

In this conguraon line, the opon -h species the RADIUS server to connect to. The opon

-p idenes which port to use to connect to the RADIUS server. The opon -i species the

unique idener for idenfying the Squid proxy server on the RADIUS server. If opon -i

is not specied, the authencaon helper will use the IP address of the proxy server. The

shared secret with the RADIUS server is specied using the -w opon. Finally, the opon -t

species the request meout. The default request meout is 10 seconds.

In order to avoid specifying a lot of opons in the Squid conguraon le, we can create a

separate conguraon le containing connecon-related informaon. Let's say, we are going

to place the conguraon le at

/opt/squid/etc/basic_radius_auth.conf, then we

can write the following lines in the le:

server radius.example.com

port 1645

identifier squid_proxy

secret s3cR37

Protecng your Squid Proxy Server with Authencaon

[ 184 ]

Now, we can replace the line in our Squid conguraon le with the following line:

auth_param basic program /opt/squid/libexec/basic_radius_auth -f /opt/

squid/etc/basic_radius_auth.conf -t 15

The opon -f is used to specify the conguraon le that will be used by the

basic_radius_auth helper to connect to the RADIUS server.

What just happened?

We learned two ways of using the basic_radius_auth helper. In one method, we can

pass all opons as arguments, and in the other, we can create a separate le containing

informaon about the RADIUS server. Using a separate conguraon le is the more

convenient and recommended method.

Fake Basic authentication

Squid includes an interesng authencaon helper called basic_fake_auth. This

authencaon helper is used for logging clients' credenals without actually checking them

against any user database or service. This authencaon helper is mainly used for tesng and

as a base helper which can be extended to implement complex Basic authencaon helpers.

Digest authentication

HTTP Digest authencaon is an improvement over the regular unencrypted HTTP Basic

authencaon mechanism, allowing user identy to be established securely without having

to send a password over the network. HTTP Digest authencaon is an applicaon of

MD5 cryptographic hashing with the use of the nonce value (for more informaon on the

nonce value, refer to http://en.wikipedia.org/wiki/Cryptographic_nonce)

to prevent cryptanalysis.

The following

auth_param parameters are available for conguring HTTP Digest

authencaon helpers:

auth_param digest program COMMAND

auth_param digest utf8 on|off

auth_param digest children NUMBER [startup=N] [idle=N] [concurrency=N]

auth_param digest realm STRING

auth_param digest nonce_garbage_interval TIME

auth_param digest nonce_max_duration TIME

auth_param digest nonce_max_count NUMBER

auth_param digest nonce_strictness on|off

auth_param digest check_nonce_count on|off

auth_param digest post_workaround on|off

Chapter 7

[ 185 ]

The parameters program, utf8, children, and realm have the same meanings as in HTTP

Basic authencaon. The following is a descripon of the remaining parameters:

Parameter Descripon

nonce_garbage_

interval

The parameter nonce_garbage_interval is used to specify

the me interval aer which the nonces that have been issued are

checked for validity.

nonce_max_duration

The nonce_max_duration parameter species the me for

which a given nonce will remain valid.

nonce_max_count

The parameter nonce_max_count denes the maximum number

of mes a nonce can be used.

nonce_strictness

The client may eventually skip some values while generang nonce

counts like 3, 4, 5, 6, 8, 9, 11, and so on. The parameter nonce_

strictness determines whether Squid should allow cases where

the user agent or client misses a count value. The default value is o

and the user agent is allowed to miss values.

check_nonce_count

The parameter check_nonce_count enforces Squid to check

the nonce count, and in case of failure, the client will be sent

the HTTP status 401 (Unauthorized). This is generally helpful

against authencaon replay aacks. For more informaon on

authencaon replay aacks, refer to http://en.wikipedia.

org/wiki/Replay_attack.

post_workaround

Certain buggy HTTP clients send incorrect request digests in HTTP

POST requests while ulizing the nonce acquired in an earlier HTTP

GET request. The parameter post_workaround is a workaround

for this situaon.

Time for action – conguring Digest authentication

Therefore, an example HTTP Digest authencaon with Squid will look similar to the following:

auth_param digest program /opt/squid/libexec/digest_file_auth

auth_param digest utf8 on

auth_param digest children 20 startup=0 idle=1

auth_param digest realm Squid proxy server at proxy.example.com

auth_param digest nonce_garbage_interval 5 minutes

auth_param digest nonce_max_duration 30 minutes

auth_param digest nonce_max_count 50

auth_param digest nonce_strictness on

auth_param digest check_nonce_count on

auth_param digest post_workaround on

acl authenticated proxy_auth REQUIRED

http_access allow authenticated

http_access deny all

Protecng your Squid Proxy Server with Authencaon

[ 186 ]

Now, let's have a look at the available HTTP Digest authencaon helpers provided by Squid.

What just happened?

We learned about the dierent opons available while conguraon HTTP Digest

authencaon. We also saw an example conguraon that will t most cases.

File authentication

The authencaon helper digest_file_auth (previously known as digest_pw_auth)

authencates credenals provided by the client, against a password le containing

passwords, either in plain text or MD5 encrypted.

If the passwords are stored in a plain text format, a line containing the username and

password will look like the following:

username:password

However, if we store passwords in a plain text format we would not be helping ourselves in

improving security. The only advantage is that we are not transming passwords in a plain text

format over the network. So, there is another format using which we can store the passwords

in the encrypted format. The format for storing the encrypted passwords is as follows:

username:realm:HA1

In this format code, HA1 is MD5(username:realm:password). So, once we have our

password le ready, we can proceed to congure Squid to use Digest authencaon by

adding the following line in our Squid conguraon le:

auth_param digest program /opt/squid/libexec/digest_file_auth -c /opt/

squid/etc/digest_file_passwd

Note that we have used the opon -c while specifying the digest password le because we

are using encrypted passwords in our password le. In case where the digest password le

contains passwords in a plain text format, we should not pass the opon -c.

LDAP authentication

We can use LDAP authencaon using the digest_ldap_auth helper for HTTP

Digest authencaon. The conguraon and parameters are similar to the LDAP Basic

authencaon helper. To use digest_ldap_auth with Squid, we can add the following

to the conguraon le:

auth_param digest program /opt/squid/libexec/digest_ldap_auth

-b "ou=clients,dc=example,dc=com" -u "uid" -A "l"

-D "uid=digest,dc=example,dc=com"

-W "/opt/squid/etc/digest_cred" -e -v 3 -h ldap.example.com

Chapter 7

[ 187 ]

The following is an explanaon of the opons used in the preceding conguraon

le addion:

Opon Descripon

-b

The opon -b species the base disnguished name (DN)

-u

The opon -u species the aribute that should be used along with base

DN to generate user DN.

-A

The opon -A idenes the password aribute.

-D

The opon -D represents the DN to bind, to perform searches.

-W

The opon -W represents a path to the le containing the digest password.

-e

The opon -e enforces encrypted passwords.

-v

The opon -v represents the LDAP version.

-h

The last opon -h represents the LDAP server to connect to.

eDirectory authentication

Squid supports Digest authencaon against Novell eDirectory using the digest_

edirectory_auth

authencaon helper. The conguraon opons and usage of this

authencaon helper is similar to the

digest_ldap_auth authencaon helper. For more

informaon on Novell eDirectory, refer to

http://en.wikipedia.org/wiki/Novell_

eDirectory

Microsoft NTLM authentication

NTLM (NT LAN Manager) is a proprietary connecon authencaon protocol developed

by Microso. The following are some important facts that we should know about

NTLM authencaon:

NTLM authencaon only authencates a TCP connecon and not the user using it.

It requires a three-way handshake, which puts a limit on the speed and maximum

client capacity.

It is a binary protocol. So only the windows domain controller can be used.

For more details about NTLM, refer to

http://en.wikipedia.org/wiki/NTLM. The

following auth_param parameters are supported by the NTLM authencaon helpers:

auth_param ntlm program COMMAND

auth_param ntlm children NUMBER [startup=N] [idle=N] [concurrency=N]

auth_param ntlm keep_alive on|off



Protecng your Squid Proxy Server with Authencaon

[ 188 ]

The parameters program and children are similar to the ones in HTTP Basic and Digest

authencaon. If the parameter keep_alive is set to off, Squid will terminate the

connecon aer the inial requests where browsers enquire about the supported schemes.

Default value of the keep_alive parameter is on.

Therefore, an example conguraon with NTLM authencaon will be as follows:

auth_param ntlm program /opt/squid/libexec/ntlm_smb_lm_auth

auth_param ntlm children 20 startup=0 idle=1

auth_param ntlm keep_alive on

acl authenticated proxy_auth REQUIRED

http_access allow authenticated

http_access deny all

Samba's NTLM authentication

We can use NTLM authencaon with the help of the ntlm_auth program, which is a part

of Samba. To congure Squid to use ntlm_auth as an NTLM authencaon helper, we need

to add the following line to our Squid conguraon le:

auth_param ntlm program /absolute/path/to/ntlm_auth

--helper-protocol=squid-2.5-ntlmssp

Please make sure that the path to the ntlm_auth program provided by Samba

is correct in the conguraon line.

We can also force a group limitaon with the ntlm_auth program using the

--require-membership-of opon, as shown as follows:

auth_param ntlm program /absolute/path/to/ntlm_auth

--helper-protocol=squid-2.5-ntlmssp

--require-membership-of="WORKGROUP\Domain Users"

This conguraon will allow users to log in if they are members of a parcular group. To

explore all of the opons available with the ntlm_auth program, refer to http://www.

samba.org/samba/docs/man/manpages/ntlm_auth.1.html

Fake NTLM authentication

Similar to the basic_fake_auth authencaon helper, we have the ntlm_fake_auth

authencaon helper, which acts as a fake NTLM authencator. This authencaon helper

doesn't authencate credenals provided by the client. It is generally used for logging

purposes while tesng NTLM authencaon.

Chapter 7

[ 189 ]

Negotiate authentication

Negoate authencaon is a wrapper of GSSAPI, which in turn is a wrapper over Kerberos

or NTLM authencaon schemes. This protocol is used in Microso Acve Directory enabled

environments with modern versions of the Microso Internet Explorer, Mozilla Firefox, and

Google Chrome browsers. In this protocol, the credenals are exchanged with the Squid

proxy server using the Kerberos mechanism. This authencaon scheme is more secure

compared to NTLM authencaon and is preferred over NTLM.

Time for action – conguring Negotiate authentication

Negoate/Kerberos authencaon is provided by the negotiate_kerberos_auth

authencaon helper. Next, we'll learn to congure the system running Squid to enable

Negoate authencaon.

1. First of all, we need to generate a keytab le using the ktpass ulity on a

Windows machine, as shown:

ktpass -princ HTTP/proxy.example.com@REALM

-mapuser proxy.example.com -crypto rc4-hmac-nt pass s3cr3t

-ptype KRB5_NT_SRV_HST -out squid.keytab

We should make sure that we have a proxy.example.com user account on our

Windows machine before generang the keytab le. Once the keytab le is

generated, move it to an appropriate locaon on the Squid server, for example,

/opt/squid/etc/squid.keytab. We should make sure that only the Squid

user has access to the keytab le on our system.

2. Now, we need to congure Kerberos on our Squid proxy server. For that, we need

to change the libdefaults secon in our Kerberos conguraon le, which is

generally located at /etc/krb5.conf, to the following:

[libdefaults]

default_realm = REALM

dns_lookup_realm = true

dns_lookup_kdc = true

ticket_lifetime = 24h

renew_lifetime = 7d

forwardable = true

3. Aer making changes to the Kerberos conguraon le, we need to make

changes to our Squid startup le. Please refer to Chapter 3, Running Squid for

determining the locaon of the startup script. We should add the following

line to our startup script:

export KRB5_KTNAME=/etc/squid/squid.keytab

Protecng your Squid Proxy Server with Authencaon

[ 190 ]

4. Finally, we need to add the following lines to our Squid conguraon le:

auth_param negotiate program /opt/squid/libexec/negotiate_

kerberos_auth

auth_param negotiate children 15

auth_param negotiate keep_alive on

acl authenticated proxy_auth REQUIRED

http_access allow authenticated

http_access deny all

This conguraon will enable Squid to use Negoate authencaon using the

negotiate_kerberos_auth authencaon helper.

What just happened?

We just learned about Negoate authencaon using Kerberos and how we can congure

our Squid proxy server to use Negoate authencaon for stronger security.

Using multiple authentication schemes

We can congure Squid to use mulple authencaon schemes by using the auth_param

direcve for each authencaon scheme. If we use mulple authencaon schemes, then

Squid will present the clients with a list of available authencaon schemes. According to

RFC 2617 (http://www.ietf.org/rfc/rfc2617), a client must select the strongest

authencaon scheme that it understands. However, due to bugs in various user agents,

they generally pick the rst one.

So, while adding the conguraon lines with the

auth_param direcve in our conguraon

le, we should consider the following order (strongest rst) for the dierent authencaon

schemes:

1. Negoate/Kerberos Authencaon

2. Microso NTLM Authencaon

3. Digest Authencaon

4. Basic Authencaon

Also, it's not compulsory to use congure helpers for all authencaon schemes. We

can congure helpers for any number of authencaon schemes. All we need to do is

preserve the aforemenoned order so that even the buggy clients will pick up a stronger

authencaon scheme to authencate the users.

Chapter 7

[ 191 ]

Writing a custom authentication helper

There is no need to worry if none of the exisng authencaon helpers seem to t your

needs. It is possible to write your own HTTP Basic authencaon helper relavely quickly.

The HTTP Basic authencaon helpers are very simple programs that read username

password

strings from standard input, extract the username and password from the string,

match them against an exisng user database, and then write OK or ERR on the standard

output in a never ending loop.

Time for action – writing a helper program

So, let's write a dummy Python script that can act as a Basic authencaon helper:

#!/usr/bin/env python

import sys

def validate_credentials(username, password):

"""

Returns True if the username and password are valid.

False otherwise

"""

# Write your own function definition.

# Use mysql, files, /etc/passwd or some service you want

return False

if __name__ == '__main__':

while True:

# Read a line from stdin

line = sys.stdin.readline()

# Remove '\n' from line

line = line.strip()

# Check if string contains two entities

parts = line.split(' ', 1)

if len(parts) == 2:

# Extract username and password from line

username, password = parts

# Check if username and password are valid

if validate_credentials(username, password):

sys.stdout.write('OK\n')

else:

sys.stdout.write('ERR Wrong username or password\n')

else:

# Unexpected input

sys.stdout.write('ERR Invalid input\n')

# Flush the output to stdout.

sys.stdout.flush()

Protecng your Squid Proxy Server with Authencaon

[ 192 ]

What just happened?

In the previous program, we are reading one line from the standard input at a me. Then

we go on to extracng the username and password from the input provided. Then, we

try to validate the username and password using our validate_credentials method,

which is a skeleton method and can be implemented to validate the username and

password provided against any system. Depending on the return value of the validate_

credentials

method, we write either OK or ERR on the standard output, which is read by

Squid and it authencates the client accordingly.

We can save the preceding program in a le named

basic_generic_auth.py and move it

to the /opt/squid/libexec/ directory. Now, we can add the following lines to our Squid

conguraon le to use this as a Basic authencaon helper:

auth_param basic program /opt/squid/libexec/basic_generic_auth.py

auth_param basic children 15 startup=0 idle=1

auth_param basic realm Squid proxy server at proxy.example.com

auth_param basic credentialsttl 4 hours

auth_param basic casesensitive on

acl authenticated proxy_auth REQUIRED

http_access allow authenticated

http_access deny all

Have a go hero – implementing the validation function

Implement the validate_credentials method in the previous program such that when

a user enters a password that is a palindrome, the program will consider the supplied

username and password valid.

Making non-concurrent helpers concurrent

Helper concurrency is a relavely new concept in Squid and is supported only within

Squid versions 3.2 or later. However, there is a contributed script called helper-mux that

can convert our old style non-concurrent helper programs into a concurrent helper, thus

improving the overall helper performance by a signicant amount.

The purpose of the

helper-mux program is to share some of the load that Squid has to

handle while dealing with relavely slower helper programs. The helper mulplexer program

acts as a medium through which Squid and actual helper programs exchange messages. So,

the helper mulplexer's interface with Squid is totally concurrent and while talking to the

actual helper program, it uses a non-concurrent interface.

The

helper-mux program can start helper programs on demand and can handle up to 1000

helpers per instance. The helper-mux program doesn't know anything about the messages

being exchanged between Squid and the actual helper program.

Chapter 7

[ 193 ]

Therefore, we can use the helper-mux program to make our Basic NCSA authencaon

helper concurrent, as demonstrated:

auth_param basic program /opt/squid/libexec/helper-mux.pl /opt/squid/

libexec/basic_ncsa_auth /opt/squid/etc/passwd

auth_param basic children 1 startup=1 idle=1 concurrency=5

auth_param basic realm Squid proxy-caching web server

auth_param basic credentialsttl 2 hours

The previous conguraon is equivalent to the following conguraon without concurrency:

auth_param basic program /opt/squid/libexec/basic_ncsa_auth /opt/

squid/etc/passwd

auth_param basic children 5

auth_param basic realm Squid proxy-caching web server

auth_param basic credentialsttl 2 hours

We'll have ve helper processes running if the concurrent conguraon is used.

So, we learned how we can make the old style non-concurrent helpers concurrent using the

helper mulplexer program available with Squid.

Common issues with authentication

Somemes, we may run into problems with authencaon helpers due to incorrect

conguraon. Next, we'll have a look at a few commonly known issues that can be

xed easily by modifying our main conguraon le.

Whitelisting selected websites

Depending on our environment, there may be some websites which can be accessed by our

users without authencang with the proxy server. We can create special ACL lists for such

websites and allow non-authencated users access to those websites. Let's have a look at

the conguraon lines we need to add to our conguraon le:

acl whitelisted dstdomain .example.com .news.example.net

acl authenticated proxy_auth REQUIRED

# Allow access to whitelisted websites.

# But only from our local network.

# localnet is default ACL list provided by Squid.

http_access allow localnet whitelisted

# Allow access to authenticated users.

http_access allow authenticated

# Deny access to everyone else.

http_access deny all

Protecng your Squid Proxy Server with Authencaon

[ 194 ]

This conguraon will allow users in our LAN access to whitelisted websites, and they will

not have to authencate with our proxy server to use or browse these websites. To browse

websites other than ones which are whitelisted, all users will have to be authencated.

Challenge loops

Squid asks HTTP clients for login credenals if the client is denied access by a proxy

authencaon-related ACL (proxy_auth, proxy_auth_regex, external ACL using

%LOGIN). The order of ACLs in an http_access rule determines whether Squid will ask for

can get really annoying.

Normally, a user will see a login pop up once when they open or re-open (aer closing)

their browser for Basic or Digest authencaon. A login pop up may not appear or may

appear once in case of NTLM or Kerberos authencaon. A login pop up may appear again

if the user changes his/her password in the master system. If there are more pop ups than

described here, then we might have some conguraon issues.

Let's have a look at two dierent conguraons.

First of all, let's consider the following conguraon:

# Below auth_acl is of type proxy_auth, proxy_auth_regex

# or externl_acl using %LOGIN

http_access deny non_auth_acl auth_acl

Squid will prompt for new login credenals if the preceding http_access rule is matched

and access is denied because of auth_acl.

Now, consider the following conguraon:

# Below auth_acl is of type proxy_auth, proxy_auth_regex

# or externl_acl using %LOGIN

http_access deny auth_acl non_auth_acl

According this conguraon, Squid will not prompt for new login credenals if this

http_access rule is matched and access is denied because of non_auth_acl. In this case,

the client will be presented with a simple access denied page if the authencaon fails.

To prevent challenge loops, we can keep

all as the last ACL element in our http_access

rule, as shown:

http_access deny !authenticated all

This conguraon will prevent challenge loops as the last ACL element all in the

http_access rule will always match.

Chapter 7

[ 195 ]

Authentication in the intercept or transparent mode

It is not possible to achieve proxy authencaon when Squid is operang in intercept or

transparent mode because the HTTP client is not aware that there is a proxy in between

the client and remote server and hence it doesn't send the credenals required for

authencang a user.

Pop quiz

1. In what format are usernames and passwords transmied over the network while

using HTTP Basic authencaon?

a. Plain text

b. HTML

c. Encrypted text which can't be decrypted

d. Encoded in base64

2. Why should we use case insensive usernames when using database authencaon?

a. Squid can't dierenate between upper and lower case characters

b. Browser's can't dierenate between upper and lower case characters

c. String comparisons in most databases are case insensive

d. There is no such limitaon. We can use case-sensive usernames with all

the databases.

3. Which of the following is the correct command line ulity to change a user's

password in an NCSA HTTPd style password le?

a. passwd

b. chpasswd

c. htpasswd

d. kpasswd

Protecng your Squid Proxy Server with Authencaon

[ 196 ]

Summary

We have learned the various authencaon schemes supported by Squid. We also learned

about the various authencaon helpers available for dierent authencaon schemes.

Specically, we have covered:

A lot of dierent ways to authencate using the HTTP Basic authencaon

HTTP Digest authencaon and helpers supporng Digest authencaon

Microso NTLM authencaon

Negoate authencaon

Wring our own custom authencaon helper, using which we can authencate

against various types of user databases.

Now, we know several ways to protect our Squid proxy server from unauthorized access. In

the next chapter, we'll learn about building a hierarchy of Squid proxy servers to distribute

load and opmize performance.



Building a Hierarchy of Squid Caches

In the previous chapters, we learned that the Squid proxy server can talk

to other proxy servers over the network to share informaon about cached

content, to fetch content from remote servers on behalf of other proxy servers,

or to use other proxy servers to fetch content from remote servers. In this

chapter, we will explore cache hierarchies in detail. We'll also learn to congure

Squid to act as a parent or a sibling proxy server in a hierarchy, and use other

proxy servers as parent or sibling proxy servers.

In this chapter, we will learn about the following:

Cache hierarchies

Reasons to use hierarchical caching

Problems with hierarchical caching

Related Squid conguraon

Controlling communicaon with peers

Peer communicaon protocols

So let's get started.



Building a Hierarchy of Squid Caches

[ 198 ]

Cache hierarchies

A cache hierarchy is the name given to an arrangement of proxy servers which can

communicate with each other to forward requests. The arrangement is typically a tree

structure in which the proxy servers have a parent-child or sibling relaonship. Parent proxy

servers are closer to the remote servers, compared to the child servers, and the child servers

typically use the parent servers to fetch content for their clients. Child servers can act as a

parent server to other proxy servers. Let's have a look at the following diagram:

Siblings are the proxy servers at the same level in the tree structure. In a cache hierarchy,

proxy servers use protocols like ICP, HTCP, Cache Digests, and CARP to idenfy a useful

source. The other peering types are origin server, which is generally a special type of parent

and mulcast, which in essence is a special type of sibling.

Reasons to use hierarchical caching

Somemes, it's necessary to be a part of a cache hierarchy. For example, in a large network

where all packets must pass through a rewall proxy we will be forced to use the rewall

proxy server as a parent proxy, as it's the only point of contact with the Internet. So, all our

cache misses will be fetched by the rewall proxy.

Chapter 8

[ 199 ]

Somemes, we join a cache hierarchy to reduce the average page load me. It helps only

when the fetch me from neighbors is signicantly less than the fetch me from remote

servers. Therefore, if some requests result in a cache MISS in our proxy's cache, it may be

a cache HIT in one of our neighbors caches.

Another example may be, a network where we have a large number of clients and one proxy

server is not able to handle all the trac. In this case, we'll split the load by deploying two

proxy servers as siblings. Two servers will be able to serve HITs from each other's cache,

which will further enhance the performance.

An example that is becoming increasingly popular and important for scalability and

availability of modern day websites, in a hierarchy of proxy servers, is the reverse proxy

mode. Reverse proxy mode is more commonly known as Content Delivery Networks

(CDNs). The main purpose of a CDN is to replicate the content of one or more websites to

various geographic locaons across the Internet and then transparently direct the client web

browsers to the nearest or most responsive cache. For more informaon on CDN, refer to

http://en.wikipedia.org/wiki/Content_delivery_network.

We can also join a cache hierarchy to redirect trac, based on dierent criteria such as, domain

names, content type, request origins, and so on. We'll see examples later in this chapter.

Problems with hierarchical caching

When we are a part of a cache hierarchy, we serve the content received from neighbors

directly to our clients. So, there is a serious problem if content received from neighbors is not

genuine. For example, let's say we are a part of a cache hierarchy and one of our neighbor

proxy servers is compromised. In such a scenario, the compromised proxy server can serve

any content for the requests we are forwarding. This generally leads to propagaon of

viruses and worms on a network. Therefore, all our neighbors should be properly secured

and up-to-date so that we don't end up compromising our client for the sake of increasing

our hit rao.

We'll essenally be forwarding a lot of our client's requests to our neighbor proxy servers.

This may result in leakage of private informaon of our clients. For example, a lot of data is

sent as a part of the URL in HTTP GET requests. If a neighbor cache is not properly striping

query terms from the URL before logging, then the complete URL will be logged in the access

log le, which can be later parsed for retrieving sensive informaon about clients. Hence,

client privacy is also one of the problems we face when we are a part of cache hierarchy.

Building a Hierarchy of Squid Caches

[ 200 ]

Another common problem with hierarchical caching is forwarding loops as a result of

misconguraon. The following is an example scenario:

In the preceding diagram, a request is sent from a client machine to the proxy server 1,

which may in turn, forward the request to proxy server 2 or 3. Let's say the request is

forwarded to proxy server 2, as shown in this scenario. Also, as proxy servers 2 and 3 are

siblings, proxy server 2 will check if proxy server 3 has a cached response for the current

request. Now, if the request results in a cache miss in proxy server 3, then it'll again forward

the request to proxy server 2 to check for a cache hit. This will go on forever, resulng in a

forwarding loop.

Avoiding a forwarding loop

We can avoid such situaons easily by conguring our proxy server properly. We don't have to

forward a request to a proxy server if that proxy server itself was the source of the request.

One quick and paral soluon for this problem is to set the value of the direcve

via to on.

If via is set to on, then Squid will include a via header in requests and replies as required

by RFC 2616. In the presence of a via header, peers will abort early and log an error

message instead of consuming network bandwidth and memory on the proxy servers.

Chapter 8

[ 201 ]

Another foolproof soluon is to have a proper conguraon. Consider the following Squid

conguraon on the proxy server s1.example.com (192.0.2.25):

cache_peer s2.example.com sibling 3128 3130

And the following conguraon on a proxy server s2.example.com (198.51.100.86):

cache_peer s1.example.com sibling 3128 3130

This conguraon may result in a forwarding loop. Now we'll edit the conguraon on both

the servers to avoid forwarding loops.

The conguraon on

s1.example.com should be:

cache_peer s2.example.com sibling 3128 3130

acl sibling2 src 198.51.100.86

cache_peer_access s2.example.com deny sibling2

And the conguraon on s2.example.com should be:

cache_peer s1.example.com sibling 3128 3130

acl sibling1 src 192.0.2.25

cache_peer_access s2.example.com deny sibling1

These conguraons will prevent any possible forwarding loops.

Joining a cache hierarchy

In the previous chapters, we learned about the cache_peer direcve in the Squid

conguraon le and how to use cache_peer to add other proxy servers in our

conguraon le, so that our proxy server can forward requests to neighbors. However, we

only had a brief overview of the opons used along with cache_peer. In this chapter, we'll

explore cache_peer and its various opons in detail.

The following is the format to add a proxy server in a conguraon le using

cache_peer:

cache_peer HOSTNAME TYPE HTTP_PORT ICP_OR_HTCP_PORT [OPTIONS]

The parameter HOSTNAME is the IP address or domain name of the proxy server we

are trying to add to the conguraon le. The TYPE parameter takes one of the values

parent, sibling, or multicast, and species the type of the proxy server, which further

determines the type of communicaon between the two servers.

Building a Hierarchy of Squid Caches

[ 202 ]

Please note that DNS resoluon must be working if you want to use domain

name as a value for the HOSTNAME parameter. Also note that future releases

will support originserver as a type for the TYPE parameter.

The HTTP_PORT parameter species the port on which a neighbor or peer accepts HTTP

requests on the hostname specied with the HOSTNAME parameter. Normally it's 3128.

The

ICP_OR_HTCP_PORT parameter species the ICP or HTCP port for peer communicaon.

The default ICP port is 3130, but we sll need to specify it. Also, if we specify the HTCP port

(default 4827), we must append htcp so that Squid can send HTCP queries to the peer. We

can set this to 0 if we don't want any ICP or HTCP communicaon with the peer.

Time for action – joining a cache hierarchy

Let's add two proxy servers to our Squid conguraon le:

cache_peer parent.example.com parent 3128 3130 default

cache_peer sib.example.com sibling 3128 3130 proxy-only

So, according to this conguraon, parent.example.com is a parent proxy server and

sib.example.com is a sibling proxy server.

What just happened?

We just learned how to add a proxy server or neighbors to our Squid conguraon le,

so that our proxy server can be a part of a cache hierarchy.

Now, let's have a look at the opons which can be used to control ICP or HTCP communicaon.

ICP options

When we congure a peer with ICP communicaon, we must congure the icp_port and

icp_access direcves properly. Next, we'll have a look at the ICP-related opons for the

cache_peer direcve.

no-query

If we use the opon no-query, then Squid will never send any ICP queries to this peer.

multicast-responder

The opon multicast-responder species that this peer is a member of a mulcast

group and Squid should not send ICP queries directly to this peer, however, we can receive

ICP replies from this host.

Chapter 8

[ 203 ]

closest-only

When the closest-only opon is used, and in case there are ICP_OP_MISS replies,

Squid will not forward requests resulng in

FIRST_PARENT_MISS. However, Squid can sll

forward requests resulng in CLOSEST_PARENT_MISS.

background-ping

The background-ping opon instructs Squid to send ICP queries to this peer in the

background only and that too infrequently. This is generally used to update the round trip me.

HTCP options

When we congure a peer with HTCP communicaon, we must properly congure the

htcp_port and htcp_access direcves in the Squid conguraon le. Let's have a look

at the addional opons for HTCP communicaon. Refer to http://tools.ietf.org/

html/rfc2756

for details on the HTCP protocol.

htcp

If we want Squid to use HTCP instead of ICP for communicaon, we must append the htcp

opon with the ICP_PORT parameter, while adding a neighbor using the cache_peer

direcve. Also, we should specify the port 4827 instead of 3130. This direcve accepts a

comma-separated list of the opons described below.

htcp=oldsquid

If we use the opon htcp=oldsquid, Squid will treat this neighbor as Squid version 2.5

or earlier and send HTCP queries accordingly.

htcp=no-clr

When the htcp=no-clr opon is used, Squid is allowed to send HTCP queries to this

neighbor, but CLR requests will not be sent. This opon conicts with the htcp=only-clr

opon and they should not be used together.

htcp=only-clr

The opon htcp=only-clr instructs Squid to send only HTCP CLR requests to this neighbor.

htcp=no-purge-clr

When the opon htcp=no-purge-clr is used, Squid is allowed to send HTCP queries

including CLR requests, only when CLR requests don't result from PURGE requests.

htcp=forward-clr

If the opon htcp=forward-clr is used and our proxy server receives a HTCP CLR request,

they will be forwarded to this peer.

Building a Hierarchy of Squid Caches

[ 204 ]

Peer or neighbor selection

If we add more than one cache peer or neighbor to our Squid conguraon le, then we may

be concerned about how Squid should select the peer to forward misses or send ICP or HTCP

queries to. Squid provides the following opons or methods for peer selecon, depending on

our environment or needs. By default, ICP is used for peer selecon.

default

If we specify the opon default while adding a peer or neighbor, then this parent will

be used when no other peer can be located using any other peer selecon algorithm. We

should not use this opon with more than one peer because this will mean that only the

rst one with the

default opon will be used.

round-robin

The opon round-robin can be used to enable a very simple form of load balancing. The

requests will be forwarded to cache peers marked with the round-robin opon in an

alternate order. This opon is useful only when we use it with at least two peers. The opon

weight, which we'll see in the next secon, biases the request counter which results in

biased peer selecon.

weighted-round-robin

The opon weighted-round-robin instructs Squid to load balance requests among

peers based on the round trip me, calculated by the background-ping opon which we

saw earlier. When this opon is used, closer parents are used more oen than other peers.

We can also use the weight opon which biases the round trip me, resulng in a biased

peer selecon.

userhash

The opon userhash load balances requests on the basis of the client proxy_auth or the

ident username.

sourcehash

The opon sourcehash is similar to userhash, but the load balancing is done on the basis

of the clients source IP address.

carp

The Cache Array Roung Protocol (CARP) is used to load balance HTTP requests across

mulple caching proxy servers by generang a hash for individual URLs. For more

informaon about CARP protocol, please visit http://icp.ircache.net/carp.txt. The

opon carp makes a parent peer a part of the cache array. The requests will be distributed

uniformly among the parents in this cache array based on the CARP load balancing hash

funcon. The opon weight will cause biased peer selecon in this case also.

Chapter 8

[ 205 ]

multicast-siblings

The opon multicast-siblings can be used only with mulcast peers. The members

of the mulcast group must have a sibling relaonship with this cache peer. This opon is

parcularly useful when we want to congure a pool of redundant proxies that are members

of the same mulcast group, which are also known as a cluster of siblings, where mulcast is

used to speed up and reduce the ICP query overhead.

Options for peer selection methods

Along with the various peer selecon methods we discussed previously, there are various

opons that can be combined with peer selecon methods to further opmize the load

balancing. Let's have a look at the available opons.

weight

The opon weight (weight=N) is used to aect the peer selecon in methods that perform

a weighted peer selecon. The larger value of weight means that we are favoring a cache

peer more over other cache peers with smaller values of weight. By default, the value of

weight is 1, which means that all peers are equally favored.

basetime

The opon basetime (basetime=N) is used to specify an amount of me that will be

subtracted from the round trip me of all the parents, before dividing it by the weight

to decide or select the parent to fetch from.

ttl

The opon ttl (ttl=N) is specic to mulcast groups. This opon can be used to specify

a Time to Live (TTL) when sending ICP queries to the mulcast group. Other peers or the

members of the mulcast group must be congured with the

multicast-responder

opon so that they can receive ICP replies properly. For more informaon on TTL, please

check http://en.wikipedia.org/wiki/Time_to_live.

no-delay

If we are using Squid delay pools and we have added several peers to our conguraon le,

then the cache hits from peer proxy servers will be included in the client's delay pools. We

don't want cache hits from other peers to be limited and they should not aect the delay

pools. To achieve such behavior, we can use the no-delay opon to ensure maximum speed.

digest-URL

If cache digests are enabled, Squid will try to fetch them using the standard locaon for

cache digests. However, if we want Squid to fetch cache digests from an alternate URL, we

can use the opon digest-URL (digest-URL=URL) to instruct Squid to fetch digests from

a dierent URL.

Building a Hierarchy of Squid Caches

[ 206 ]

no-digest

The opon no-digest disables fetching of cache digests from this peer.

SSL or HTTPS opons

We can encrypt our connecons to a cache peer with SSL or TLS. Squid provides a series of

related opons using which we can customize the connecon parameters. Later, we'll have

a look at these opons.

ssl

When the opon ssl is set, the communicaon to this cache peer will be encrypted with SSL

or TLS.

sslcert

The opon sslcert (sslcert=FILE) is used to specify the absolute path of a le containing

the client SSL cercate, which should be used while connecng to this cache peer.

sslkey

We can oponally use the sslkey (sslkey=FILE) opon to specify the absolute path to

a le containing private SSL key, corresponding to the SSL cercate, specied using the

sslcert opon. If this opon is not specied, then the SSL cercate le specied using

the sslcert opon is assumed to be a combined le containing the SSL cercate and the

private key.

sslversion

The opon sslversion (sslversion=NUMBER) can be used to specify the version of the

SSL/TLS protocols we need to support. The following are the possible values of sslversion:

1: Automac detecon. This is the default value.

2: SSLv2 only.

3: SSLv3 only.

4: TLSv1 only.

sslcipher

We can specify a colon separated list of supported ciphers using the sslcipher

(sslcipher=COLON_SEPARATED_LIST) opon. Please check man page for ciphers(1)

or visit http://www.openssl.org/docs/apps/ciphers.html for more informaon on

ciphers supported by OpenSSL. Please note that the availability of ciphers depends on the

version of OpenSSL being used.

Chapter 8

[ 207 ]

ssloptions

We can specify various SSL engine-specic opons in the form of a colon separated list

using the

ssloptions (ssloptions=LIST) parameter. Please check the SSL_CTX_set_

options(3)

man page or visit http://www.openssl.org/docs/ssl/SSL_CTX_set_

options.html

for a list of supported SSL engine opons.

sslcale

We can specify a le containing addional CA cercates using the opon sslcafile

(sslcafile=FILE), which can be used to verify the cache peer cercates.

sslcapath

The opon sslcapath (sslcapath=DIRECTORY) is used to specify the absolute path to

a directory containing addional cercates and CRL (Cercate Revocaon List) lists that

should be used while verifying the cache peer cercates.

sslcrlle

The opon sslcrlfile (sslcrlfile=FILE) is the absolute path to a le containing

addional CRL lists, which should be used to verify cache peer cercates. These CRL lists

will be used in addion to the CRL lists stored in sslcapath.

sslags

Using the opon sslflags (sslflags=LIST_OF_FLAGS), we can specify one or more

ags, which will modify the usage of SSL. Let's have a look at the available ags:

DONT_VERIFY_PEER: Accept the peer cercates, even if they fail to verify.

NO_DEFAULT_CA: If the ag NO_DEFAULT_CA is used, the default CA lists built in

OpenSSL will not be used.

DONT_VERIFY_DOMAIN: If the DONT_VERIFY_DOMAIN ag is used, the peer

cercate will not be veried if the domain matches the server name.

ssldomain

The opon ssldomain (ssldomain=DOMAIN_NAME) can be used to specify the peer

domain name, as menoned in the peer cercate. This opon is used to verify the received

peer cercates. If this opon is not specied, then the peer hostname will be used.

front-end-https

Using the opon front-end-hps will enable the Front-End-Https: On header when

Squid is used as a SSL frontend in front of Microso Outlook Web Access (OWA). For more

informaon on why this header is needed, please check http://support.microsoft.

com/kb/307347



Building a Hierarchy of Squid Caches

[ 208 ]

Other cache peer options

Unl now, we have learned about various opons that can be specied for opmizing

peer selecon and hence opmizing the ow of trac. Now, let's have a look at the other

important opons provided by Squid.

Some of our peers may require proxy authencaon for access. For such scenarios, we can

use login opon (login=username:password) to authencate our cache so that it can

use this peer.

The opon login=PASS is used when we want to pass on login details received from the

client to this parcular peer. Proxy authencaon is not a requirement for using this opon.

Also, if Squid didn't receive any authencaon headers from the client but the username and

password are available from external ACL user= and password= result tags, then they may

be sent instead.

If we want to use proxy authencaon on our proxy server as well as with this peer, then

both the proxies must share the same user database, as HTTP allows only a single login

(one for the proxy server and one for the origin server).

The opon login=PASSTHRU is used when we want to forward HTTP authencaon

(Proxy-Authentication and WWW-Authorization) headers to this peer without

any modicaon.

We can use the opon login=NEGOTIATE if this is a personal or a workgroup proxy

server and the parent proxy server requires a secure proxy authencaon. The rst

Service Principal Name (SPN) from the default keytab or dened by the environment

variable KRB5_KTNAME will be used.

connect-timeout

The opon connect-timeout determines the connecon meout to this peer. This can be

dierent for dierent peers. If this opon is not used, then Squid will determine the meout

from the peer_connect_timeout direcve.

connect-fail-limit

The opon connect-fail-limit (connect-fail-limit=N) determines the number

of connecon failures to this peer or neighbor, aer which it will be declared dead or

unreachable. The default value of connect-fail-limit is 10.

Chapter 8

[ 209 ]

max-conn

The number of parallel connecons that can be opened to this peer is determined by the

opon

max-conn (max-conn=N).

name

There may be cases when we have mulple peers on the same host listening on dierent

ports. In such cases, the hostname will not be able to uniquely idenfy a peer or neighbor.

So, we can use the

name (name=STRING) opon to specify a unique name for this peer. Also,

this opon is always set and defaults to either the hostname, the IP address, or the cache

peer. The name specied using this opon is used with direcves like cache_peer_access

and

cache_peer_domain.

proxy-only

Normally, Squid will try to store the responses locally if the requests are cacheable. However,

if the request is fetched from peers in the local area network, then this will result in the

unnecessary waste of disk space on our proxy servers if the responses can be fetched at a very

high speed from our peers. Specifying the proxy-only opon will instruct Squid not to cache

any responses from this peer. Please note that we should use this opon only when the cache

peer is connected with a low latency and high speed connecon to our proxy server.

allow-miss

The client requests are forwarded to a sibling only when they result in hits. We can use the

allow-miss opon to forward cache misses to siblings. We should use this opon carefully

as this may result in forward loops.

Controlling communication with peers

Unil now we have learned about various opons that can be used to congure cache peers or

neighbors as parents or siblings. Now, we'll learn about controlling access to dierent peers

and sending a variety of requests to dierent peers, depending on various rules. Access control

over peer communicaon is achieved via various direcves in the Squid conguraon le. We

have learned about these direcves briey, but we'll explore them in detail now.

Domain-based forwarding

Squid provides a direcve cache_peer_domain, using which we can restrict the domains

for which a parcular peer or neighbor will be referred. The general format for the

cache_peer_domain direcve is:

cache_peer_domain NAME [!]domain [[!]domain] ...

In the preceding format, NAME is the name of the neighbor cache, which will be either the

values of the name opon, the hostname, or the IP address specied while declaring it as

a peer using the cache_peer direcve.

Building a Hierarchy of Squid Caches

[ 210 ]

We can specify any number of domains with the cache_peer_domain direcve, either on

the same line or mulple lines. Prexing a domain name with '!' will result in all domains

being matched except the specied one.

Time for action – conguring Squid for domain-based

forwarding

Let's see an example conguraon:

cache_peer parent.example.com parent 3128 3130 default proxy-only

cache_peer acad.example.com parent 3128 3130 proxy-only

cache_peer video.example.com parent 3128 3130 proxy-only

cache_peer social.example.com parent 3128 3130 proxy-only

cache_peer search.example.com parent 3128 3130 proxy-only

cache_peer_domain acad.example.com .edu

cache_peer_domain video.example.com .youtube.com .vimeo.com

cache_peer_domain video.example.com .metacafe.com .dailymotion.com

cache_peer_domain social.example.com .facebook.com .myspace.com

cache_peer_domain social.example.com .twitter.com

cache_peer_domain search.example.com .google.com .yahoo.com .bing.com

According to the previous conguraon example, the proxy server acad.example.com can

be used to forward requests for the .edu domains only. A cache peer can only be contacted

for the domain names matching the ones specied with cache_peer_domain. If we don't

specify any domain name for a peer (parent.example.com in the above example), then it

can be used to forward all requests.

So, we can see how straight forward it is to paron trac by using some simple rules such

as those in the previous conguraon.

What just happened?

We learned to use the direcve cache_peer_domain to paron the trac or client requests

based on domain names so that the requests can be forwarded to dierent proxy servers.

Cache peer access

Squid provides another direcve named cache_peer_access, which is a more exible

version of cache_peer_domain as we can control request forwarding using powerful

access control lists. The format of the cache_peer_access direcve is as follows:

cache_peer_access NAME allow|deny [!]ACL_NAME [[!]ACL_NAME] ...

Chapter 8

[ 211 ]

The NAME parameter is the same as the one used with cache_peer_domain and species

the name of the cache peer or neighbor. The opons allow or deny will determine whether

a request will be forwarded or not to this cache peer.

Time for action – forwarding requests to cache peers

using ACLs

Let's say we have three parent proxy servers (p1.exmaple.com, p2.example.com, and

p3.example.com). The proxy server p3.example.com is connected to the internet with a

highly reliable, but expensive connecon, with a fair usage policy. The proxy servers p1 and

p2 are cheaper but unreliable. Also, we have three subnets (academic, research, and nance)

on our local area network, according to the following diagram:

Now, let's have a look at the following conguraon:

cache_peer p1.example.com parent 3128 3130 round-robin

cache_peer p2.example.com parent 3128 3130 round-robin

cache_peer p3.example.com parent 8080 3130

acl academic src 192.0.2.0/16

acl finance src 198.51.100.0/16

acl research src 203.0.13.0/16

acl imp_domains dstdomain .corporate.example.com .edu

Building a Hierarchy of Squid Caches

[ 212 ]

acl ftp proto FTP

cache_peer_access p3.example.com deny ftp

cache_peer_access p3.example.com allow research

cache_peer_access p3.exmaple.com allow academic imp_domains

cache_peer_access p3.exmaple.com allow finance imp_domains

cache_peer_access p3.example.com deny academic

cache_peer_access p3.example.com deny finance

As we can see in the previous example, we have allowed request forwarding to the parent proxy

server p3.example.com for the requests originang from the research subnet only. We have

allowed other subnets to access some important domains using the highly reliable connecon

and we have completely disabled the use of this connecon for the FTP protocol. Also, note

that the requests will be forwarded to the proxy server

p3.example.com only when both

p1.example.com and p2.example.com are not reachable. The requests will be forwarded

to the p1.example.com and p2.example.com proxy servers in a round robin fashion.

We can achieve even beer control by using ACL lists of dierent ACL types.

What just happened?

We just explored the power of the cache_peer_access direcve, which in combinaon

with Squid's access control lists, provides a powerful way to forward requests to dierent

peers. We can further improve the request forwarding by using the me-based ACLs along

with

cache_peer_access.

Have a go hero – join a cache hierarchy

Make a list of proxy servers on your network. Add these proxy servers to the Squid

conguraon le and then paron trac to these proxy servers in such a way that the

requests go to one group of servers in the day me, and to a dierent group at night.

Switching peer relationship

As we saw earlier, we have to specify the peer relaonship while adding a peer to our

Squid conguraon le. However, there may be cache peers who can oer to serve cache

misses only for certain domains, while serving cache hits for all domains. The misses and

hits menoned above are corresponding to the ICP, Cache Digest, or HTCP, misses and

hits. An ICP, Cache Digest, or HTCP miss means that the peer does not have the required

object. The peer relaonship switch for certain domains can be achieved using the

neighbor_type_domain direcve in the conguraon le. The following format

uses the neighbor_type_domain direcve:

neighbor_type_domain CACHE_HOST parent|sibling domain [domain] ...

Chapter 8

[ 213 ]

Time for action – conguring Squid to switch peer relationship

For example, let's say we have congured sibling.example.com as a sibling proxy server

but sibling.example.com allows us to forward requests to .edu domains, even if there

are cache misses. So, we can have the following conguraon:

cache_peer parent.example.com parent 3128 3130 default proxy-only

cache_peer sibling.example.com sibling 3128 3130 proxy-only

neighbor_type_domain sibling.example.com parent .edu

In accordance with the previous conguraon, we can fetch cache misses for .edu domains

using

sibling.example.com.

What just happened?

In this secon, we learned to switch the peer relaonship, from sibling to parent,

dynamically for certain domains.

Controlling request redirects

We have just seen a list of direcves, using which, we can use dierent peers to forward

requests based on various parameters. In addion to the previously menoned direcves, Squid

provides a few more direcves, using which we can force certain requests to be forwarded to

remote servers directly or to always pass through peers. Let's have a look at these direcves.

hierarchy_stoplist

We normally use cache peers to increase the cache hit rao, but there are certain requests

which can't be cached as the content served in response to these requests is dynamic and

changes every me it's requested. It's of no use to query our cache peers for such requests.

We can instruct Squid to stop forwarding requests to peers and instead contact the remote

servers directly using the direcve hierarchy_stoplist. The direcve hierarchy_

stoplist

essenally takes a list of words, which if found in a request URL, will mean that

the URL will be handled by this cache and will not be forwarded to any of the neighbors.

hierarchy_stoplist cgi-bin ? jsp

We should note that never_direct overrides

the direcve hierarchy_stoplist.

Building a Hierarchy of Squid Caches

[ 214 ]

always_direct

There may be certain requests which we always want to forward to remote servers instead

of forwarding them to neighbor caches. We can use the direcve

always_direct to direct

or forward such requests directly to remote servers. This is generally helpful in retrieving

the content on the local area network directly because cache peers will introduce

unnecessary delay.

For example, consider the following conguraon:

acl local_domain dstdomain .local.example.com

always_direct allow local_domain

The requests desned to .local.example.com will be sent directly to the corresponding

servers instead of roung them through cache peers.

never_direct

Using the direcve never_direct, we can control the requests which must not be sent

to remote servers directly and must be forwarded to a peer cache. This is generally helpful

when all the packets going to internet must pass through a proxy rewall, which is normally

congured as a default parent.

Let's say we have

firewall.example.com as a rewall proxy which must be used for

forwarding all requests, but we can sll forward all requests for local servers directly. So,

we can have the following conguraon:

cache_peer firewall.example.com parent 3128 3130 default

acl local_domain dstdomain .local.example.com

always_direct allow local_domain

never_direct allow all

The previous conguraon will congure Squid so that all requests to the local servers are

forwarded directly to the desnaon servers and that all external requests pass through the

rewall proxy server firewall.example.com.

prefer_direct

Any requests which are cacheable by Squid are routed via peers so that we can ulize

neighbor caches to improve the average page load me. However, in case we are willing

to forward the cacheable requests directly to remote servers, we can set the value of the

prefer_direct direcve to on. The default value of this direcve is o and Squid will try

to use neighbor caches rst instead of forwarding the requests directly to remote servers.

Chapter 8

[ 215 ]

The direcve prefer_direct modies Squid's behavior only for cacheable

requests. If we want to route all requests through a rewall proxy, we should use

never_direct instead.

nonhierarchical_direct

Non-hierarchical requests are the requests that are either idened by the hierarchy_

stoplist

direcve or can't be cached by Squid. Such requests should not be sent to

cache peers because they will not result in cache hits. Therefore it's a good idea to forward

these requests directly to remote servers. We achieve this behavior by seng the value

of the direcve nonhierarchical_direct to on. If we set this direcve's value to off,

these requests will not be sent to remote servers directly. Please note that although HTTPS

requests are not cacheable, nonhierarchical_direct must be set to off for HTTPS

requests to be relayed through a rewall parent proxy.

It's not recommended to set the value of nonhierarchical_

direct to o. If we want to direct all requests via a rewall

proxy, we should use the never_direct direcve instead.

Have a go hero – proxy server behind a rewall

Congure your proxy server so that it forwards all the requests to a parent proxy server and

never contacts the remote servers directly.

Peer communication protocols

We have learned about conguring Squid to be a part of a cache hierarchy. When many

proxy servers are a part of cache hierarchy, they need to communicate to share informaon

about the objects present in their cache so that neighbors can ulize these cached objects

as hits. For communicaon among peers, Squid implements ICP, HTCP, and Cache Digest

protocols. Later on, we'll have a brief look at ICP, HTCP, and Cache Digest protocols.

Internet Cache Protocol

ICP or Internet Cache Protocol is a simple web-caching protocol used to query proxy

servers (cache peers) about the existence of a parcular object in their cache. Depending

on the replies received from the neighbors, Squid will decide the forwarding path for the

parcular request.

Building a Hierarchy of Squid Caches

[ 216 ]

As we saw in the peer selecon algorithms, ICP is also used to calculate the round trip

me and also for detecng dead peers in a hierarchy. The round trip me calculaon is

an important measure as it can help Squid in dynamically reroung the trac to a less

congested network route.

Although ICP is a simple protocol and it's very easy to congure proxy servers to communicate

with each other using it, ICP also suers from a lot of problems. The rst one is latency. Squid

doesn't know whether an object is present in a peer cache or not. It has to query all the peer

caches for each object, which in some cases (if the number of peers is large), will introduce a

signicant delay as it will take me to query all of them. So, if we have a large number of peers,

there will be a lot of ICP packets oang around on the network which may end up causing

congeson. To get around the congeson issue here, we can use the mulcast ICP protocol.

Other aws in the ICP protocol are false hits, security, and so on. For more details on the ICP

protocol, please visit

http://icp.ircache.net/rfc2186.txt. Another interesng read

about the applicaon of the ICP protocol is at http://tools.ietf.org/html/rfc2187.

The HTCP protocol is recommended over the ICP protocol to avoid problems like latency,

false hits, and so on.

Cache digests

Squid keeps a list of all the cached objects in the main memory in the form of a hash, so that

it can quickly guess whether a URI will result in a hit or miss without actually searching for

the les on disk. Cache digest is a summary of these cached objects into a compact bitmap

using the Bloom Filter data structure (for more informaon on the Bloom Filter, please visit

http://en.wikipedia.org/wiki/Bloom_filter). The value of the bit determines

whether a parcular object is present in the cache or not. If the bit is on or set to 1, the object

is present in the cache, otherwise it's not in the cache. This summary is available to other peers

via a special URL over the HTTP protocol. When a cache digest is retrieved by peers they can

determine, by checking the digest, whether a parcular URI is present in the cache or not.

So, cache digests signicantly reduce the number of packets owing on the network which

are just for querying the other peers, but the total data transfer amount increases as the

cache digests are fetched by all the peers periodically. However, this helps in signicantly

decreasing the delay introduced by ICP queries.

With the cache digest protocol, the problem of false hits get worse as the digest grows older.

The digest is rebuilt only periodically (hourly by default). This also introduces the problems of

false misses. The false misses are for web objects which were cached aer the cache digest

was built.

Chapter 8

[ 217 ]

Squid and cache digest conguration

To be able to use cache digests, we must enable cache digests using the --enable-cache-

digests

opon with the configure program before compiling Squid. Let's have a look at

the cache digest related direcves available in the Squid conguraon le.

Digest generation

It makes sense to generate cache digests only when we plan to use cache digests for peer

communicaon. Therefore, we can use the digest_generation direcve in the conguraon

le to select whether the digest will be generated or not. The possible values for this direcve

can be on or o. By default, this direcve is set to on and Squid generates cache digests.

Digest bits per entry

The data structure, Bloom Filter, which is used to build cache digest, provides a lossy

encoding and there may be false hits even in the cache digests. The direcve

digest_bits_

per_entry

determines the number of bits that will be used for encoding one single entry

or a cached object. The larger value of bits per entry will result in higher accuracy and hence

lesser false hits, but will this consume more space in the main memory and more bandwidth

while transferring over the network. The default value of digest_bits_per_entry is 5

but we can safely push it to 7 for more accuracy if we have a large cache.

Digest rebuild period

We can use the direcve digest_rebuild_period to set the me interval, aer which

the cache digest will be rebuilt. One hour is the default, which will result in a not so up-to-

date cache digest, but rebuilding a cache digest is a CPU-intensive job and this me interval

should be set depending on the hardware capabilies and load on the proxy server. We can

safely set it to 10 or 15 minutes to keep things fresh.

Digest rebuild period implies the me aer which the cache

digest will be rebuilt in memory. This me doesn't imply the

me aer which the cache digest will be wrien to disk.

Digest rebuild chunk percentage

The direcve digest_rebuild_chunk_percentage determines the percentage of the cache

which will be added to the cache digest every me the rebuild roune is called on schedule. The

default behavior is to add 10 percent of the cache to the cache digest every rebuild.

Digest swapout chunk

The amount or number of bytes of the cache digest that will be wrien to the disk at a me

is determined by the direcve digest_swapout_chunk. The default behavior is to write

4096 bytes at a me.

Building a Hierarchy of Squid Caches

[ 218 ]

Digest rewrite period

The digest rewrite period is the me interval aer which the cache digest is wrien to disk,

which then can be served to other peers. We can congure this me interval using the

digest_rewrite_period direcve. Generally, it should be equal to digest the rebuild period.

Hypertext Caching Protocol

HTCP or Hypertext Caching Protocol is similar to ICP but has advanced features and

generally results in beer performance compared to the ICP protocol. Both the ICP and HTCP

protocols use UDP for communicaon and TCP communicaon is oponally allowed for HTCP

for protocol debugging. HTCP has the following advantages over the ICP protocol:

ICP queries include only URI while HTCP queries include full HTTP headers. HTCP also

includes HTTP headers in a request, which helps the server avoid false hits, but would

be true only for a URL key and would be false if more headers are known.

HTCP allows third party replies, using which a peer can inform us about an alternate

locaon of a cached object. ICP doesn't have a similar provision.

HTCP supports monitoring of peers for cache addions or deleons while ICP doesn't.

HTCP uses a variable sized binary message format, which can be used for extending

the protocol, while ICP uses a xed size binary message format rendering ICP to be

very dicult to extend.

HTCP provides oponal message authencaon using shared secret keys while

ICP doesn't.

For more details on the HTCP protocol, please visit

http://tools.ietf.org/html/

rfc2756

Pop quiz

1. Consider the following conguraon and then select the most appropriate answer

from the following selecon:

cache_peer p1.example.com parent 3128 3130 default weight=1

cache_peer p2.example.com parent 3128 3130 default weight=10

cache_peer s1.example.com sibling 3128 3130 default

cache_peer s2.example.com sibling 3128 3130 default

If all siblings are dead, then which parent proxy servers will be used for forwarding

requests?

a. p1.example.com

b. s1.example.com

c. p2.example.com

d. s2.example.com



Chapter 8

[ 219 ]

2. Consider the following Squid conguraon and then select the most appropriate

answer from the following selecon:

cache_peer sibling.example.com sibling 3128 0 no-query no-digest

Which of the following direcves can be used for forwarding all requests, except

requests to local.example.com?

a. cache_peer_domain

b. cache_peer_access

c. Both a and b

d. None

Summary

In this chapter, we have learned about conguring the Squid proxy server to join a

cache hierarchy. We also learned about the various relaonships between cache

peers or neighbors. We also learned about various peer selecon mechanisms for

forwarding requests.

Specically, we covered:

Advantages and disadvantages of joining a cache hierarchy

Various conguraon opons while adding peers

Ways to restrict access to cache peers

Various conguraon direcves to control request forwarding to peers

The protocols used for communicaon among cache peers

In the next chapter, we'll learn about conguring Squid in the reverse proxy mode.



Squid in Reverse Proxy Mode

So far, we have learned to use Squid for caching the requests to various

websites on the Internet, and for hiding a number of clients behind a single or

a hierarchy of proxy servers. The Squid proxy server can also act as an origin

server accelerator in which it accepts normal HTTP requests and forwards the

cache misses to the origin servers. This is commonly known as surrogate mode.

In this chapter, we'll learn about conguring Squid in reverse proxy mode.

In this chapter, we will learn about:

Reverse proxy mode

Conguring Squid as a server surrogate (also known as an accelerator)

Access controls in reverse proxy mode

Example conguraons

So let's get started...



Squid in Reverse Proxy Mode

[ 222 ]

What is reverse proxy mode?

In previous chapters, we have learned to use Squid to cache the web documents locally so

that we can enhance the user experience. This is done by serving the cached web documents

from the proxy server, which is generally on the same local network as the clients. So, we can

visualize this behavior using the following diagram:

As we can see in the previous diagram, we try to cache the responses received from various

web servers on the Internet and then use those cached responses to serve the subsequent

requests for the same web documents. In short, we are using Squid to improve the

performance of our Internet connecon.

Exploring reverse proxy mode

Now, consider the scenario from a point of view of a web server. Let's say that the website

www.example.com is hosted on a web server and there are tens of thousands of clients

browsing the website. So, in the scenario where a website gets way too many visitors, the

web server will be overloaded and we would have to distribute the load by deploying more

servers. We can visualize this situaon using the following diagram:

In the previous diagram, a group of web servers are hosng the domain www.example.com

and serving the responses to the requests from all over the internet.

Chapter 9

[ 223 ]

As we know, most of the content served to clients by a web server hosng a website doesn't

change frequently. For example, the addional les like JavaScript les, CSS style sheets,

and images embedded in a web page, which constute the major part of a web page, don't

change frequently. So, we can introduce a Squid proxy server in a reverse proxy mode (also

known as a surrogate or an accelerator), which will cache the content that doesn't change

frequently. It will also try to help the otherwise overloaded web server by responding to the

majority of requests targeted to the web server, by serving the content from its cache. Let's

have a look at the following diagram:

In the preceding diagram, we placed a Squid proxy server in front of the web servers so that

all the requests to the web servers are passing through the proxy server. Therefore in this

scenario, Squid will be accepng the HTTP requests. Squid will forward all the requests to

the origin web servers, except the requests that it has already cached. So, web servers will

not have to deal with the requests that are already cached on the proxy server. This mode

of Squid is called reverse proxy mode or server accelerator mode.

Now that we have understood the reverse proxy mode, it's me to learn to congure Squid

as a server accelerator.

Conguring Squid as a server surrogate

To congure Squid as a server surrogate, we need to provide the appropriate opons with

various direcves, depending on the requirements. We can congure Squid to act as a

forward proxy and server surrogate at the same me. However, the access control rules must

be wrien very carefully in such cases, which we will cover in our special secon on Access

Control Conguraon for surrogate servers. However, to omit any possible confusion, it's

always beer to have a dedicated instance of Squid for server acceleraon and a separate

instance for the forward proxy.

Squid in Reverse Proxy Mode

[ 224 ]

Also, as Squid will be listening on port 80 to accept HTTP requests, our web server can't

listen on the same IP address as Squid. In this scenario, we have the following opons:

Squid can listen on port 80 on the public IP address and the web server can listen on

port 80 on the loopback (

127.0.0.1) address.

The web server can listen on port 80 on a virtual network interface with an IP

address from the private address space. If the web server and Squid are on dierent

machines, then this is not going to be a problem at all.

HTTP port

As we learned that Squid will be accepng HTTP requests on behalf of the web servers

sing behind it, the most important conguraon direcve is http_port. We need to

set the HTTP port with the appropriate opons. Let's have a look at the general format of

http_port for conguring Squid in the reverse proxy mode.

http_port 80 accel [options]

So, we need to specify a port number, such as 80. Apart from the port, we need to use the

opon

accel, which will tell Squid that port 80 will be used for server acceleraon. Also,

there are addional opons that are required to properly congure Squid so that it can

communicate with the web servers.

Please note that while conguring Squid in surrogate mode, we need to specify at

least one opon from defaultsite, vhost, or vport. We should also note

that the CONNECT requests are blocked from receiving accel agged ports.

HTTP options in reverse proxy mode

Let's have a look at the other opons that can be used with the http_port direcve.

defaultsite

The opon defaultsite (defaultsite=domain_name) species the domain name or site

that will be used to construct the Host HTTP header if it is missing. The domain name here is

the public domain name that a visitor types in his/her browser to access the website.

vhost

If we specify the opon vhost, Squid will support domains hosted as virtualhosts.

vport

To enable IP-based virtual host support, we can use the vport opon. The opon vport can

be specied in the following two ways:



Chapter 9

[ 225 ]

If we specify the vport opon, Squid will use the port from the Host HTTP header. If the

port in the Host header is missing, then it'll use http_port (port) for virtual host support.

If we specify the

vport opon along with the port (vport=PORT_NUMBER), Squid will use

PORT_NUMBER instead of the port specied with http_port.

allow-direct

The direct forwarding of requests is denied in reverse proxy mode, by default, for security

reasons. When we have direct forwarding enabled in reverse proxy mode, a rogue client may

send a forged request with any external domain name in the Host HTTP header and Squid

will fetch and forward the response to the client. This will permit relay aacks. A very strict

access control is required to prevent such aacks when direct forwarding is enabled. If we

want, we can enable direct forwarding by specifying the opon

allow-direct.

protocol

The protocol (protocol=STRING) opon can be used to reconstruct the requests.

The default is HTTP.

ignore-cc

The HTTP requests carry Cache-Control HTTP headers from the clients which determine

whether the cached response should be ushed or reloaded. If we use the opon ignore-

, the Cache-Control headers will be ignored and Squid will serve the cached response

if it's sll fresh.

The following are a few examples showing the usage of

http_port.

http_port 80 accel defaultsite=www.example.com

http_port 80 accel vhost

http_port 80 accel vport ignore-cc

HTTPS port

Let's consider a scenario where we are serving a website or a few pages on a website over an

encrypted secure connecon using secure HTTP or HTTPS. We can outsource the encrypon

and decrypon work to the Squid proxy server, which can handle HTTPS requests. So, when

we congure Squid to accept HTTPS connecons or requests, it'll decrypt the requests and

forward the unencrypted requests to the web server.

Please note that we should use the --enable-ssl opon with the

configure program before compiling, if we want Squid to accept HTTPS

requests. Also note that several operang systems don't provide packages

capable of HTTPS reverse-proxy due to GPL and policy constraints.

Squid in Reverse Proxy Mode

[ 226 ]

HTTPS options in reverse proxy mode

Let's have a look at the syntax for the https_port direcve.

https_port [IP_ADDRESS:]port accel cert=certificate.pem [key=key.pem]

[options]

In the preceding conguraon line, the IP_ADDRESS to which Squid will bind to can be

oponally specied. The opon

port determines the port on which Squid will listen for

HTTPS requests.

The

cert parameter is used to specify the absolute path to either the SSL cercate le

or an OpenSSL-compable combined cercate and private key le. The

key parameter

is oponal and is used to specify the absolute path to the SSL private key le. If we don't

specify the

key parameter, Squid will assume the le specied by the cert parameter

as a combined cercate and private key le.

Please note that we should have OpenSSL installed on our system. Please check

http://www.openssl.org/ for more informaon on OpenSSL. It is also

recommended to keep an eye on the latest OpenSSL vulnerabilies and to apply

the patches as soon as they are available at http://www.openssl.org/

news/vulnerabilities.html.

Let's have a quick look at the other opons available with the https_port direcve.

defaultsite

The opon defaultsite (defaultsite=domain_name) can be used to specify the

default HTTPS website which should be used in case HTTP Host header is missing.

vhost

Idencal to http_port, the vhost opon can be used to support virtually-hosted domains.

Please note that if the vhost opon is used, the cercate specied should be either a

wildcard cercate or one that is valid for more than one domain.

version

The opon version (version=NUMBER) can be used to specify the version of the SSL/TLS

protocols which we need to support. The following are the possible values of

version:

1: Automac detecon. This is the default value.

2: SSLv2 only.

3: SSLv3 only.

4: TLSv1 only.



Chapter 9

[ 227 ]

cipher

We can specify a colon separated list of supported ciphers using the cipher

(

cipher=COLON_SEPARATED_LIST) opon. Please check the man page for ciphers(1)

or visit http://www.openssl.org/docs/apps/ciphers.html for more informaon

on ciphers supported by OpenSSL. Please note that this list of ciphers is directly passed to

OpenSSL libraries and we should check the availability of ciphers for our version of OpenSSL

before specifying them.

options

We can specify various SSL engine-specic opons in the form of a colon separated list using

the options (options=LIST) parameter. Please check the SSL_CTX_set_options(3)

man page or visit http://www.openssl.org/docs/ssl/SSL_CTX_set_options.html

for a list of supported SSL engine opons. Please note that these opons are directly passed to

OpenSSL libraries and we should check the availability of these opons for our OpenSSL version.

clientca

The opon clientca (clientca=FILE) is used to specify the absolute path to the le

containing a list of Cercate Authories (CAs) to be used while requesng a client cercate.

cale

We can specify a le containing addional CA cercates using the opon cafile

(cafile=FILE), which can be used to verify client cercates.

capath

The opon capath (capath=DIRECTORY) is used to specify the absolute path to a directory

containing addional cercates and CRL (Cercate Revocaon List) lists that should be

used while verifying the client cercates.

Please note that if we don't specify the clientca, cafile, or capath

opons, then SSL library defaults will be used.

crlle

The opon crlfile (crlfile=FILE) is the absolute path to a le containing addional

CRL lists, which should be used to verify client cercates. These CRL lists will be used in

addion to the CRL lists stored in capath.

dhparams

We can specify a le containing DH parameters for DH key exchanges using the opon

dhparams (dhparams=FILE). For more informaon on DH parameter generaon, please

check the dhparam(1) man page or visit http://www.openssl.org/docs/apps/

dhparam.html

Squid in Reverse Proxy Mode

[ 228 ]

sslags

Using the opon sslflags (sslflags=LIST_OF_FLAGS), we can specify one or more

ags, which will modify the usage of SSL. Let's have a look at the available ags:

NO_DEFAULT_CA

If the ag NO_DEFAULT_CA is used, the default CA lists built in OpenSSL will not be used.

NO_SESSION_REUSE

When NO_SESSION_REUSE is used, every new connecon will be a new SSL connecon and

no connecon will be reused.

VERIFY_CRL

The CRL lists contained in the les specied using crlfile or capath opons will be used

to verify the client cercates before accepng them, if the VERIFY_CRL ag is used.

VERIFY_CRL_ALL

If we use the VERIFY_CRL_ALL ag, then all the cercates in the client cercate chain

will be veried.

sslcontext

Using the opon sslcontext (sslcontext=ID) we can set the SSL session ID

context idener.

vport

The opon vport is used to enable the IP-based virtual host support. Its usage is idencal to

the vport opon in the http_port direcve.

Let's see a few examples showing the usage of the

https_port direcve:

https_port 443 accel defaultsite=secure.example.com cert=/opt/squid/

etc/squid_combined.pem sslflags=NO_DEFAULT_CA

https_port 443 accel vhost cert=/opt/squid/etc/squid.pem key=/opt/

squid/etc/squid_key.pem

Have a go hero – exploring OpenSSL

Try to read the OpenSSL documentaon for generang various cercates and private keys.

Chapter 9

[ 229 ]

Adding backend web servers

So far, we learned about conguring Squid to accept HTTP or HTTPS connecons on behalf of

our web servers. Once Squid has received a HTTP or HTTPS request, it needs to forward it to

a web server so that it can fetch content from the web server. It will then pass that content

back to the client requesng the content. So, we need to tell Squid about our backend

servers to which it will connect to sasfy the client requests. We can add one or more web

servers using the cache_peer direcve in the Squid conguraon.

Cache peer options for reverse proxy mode

Let's have a look at the opons for the cache_peer direcve specically meant for Squid in

reverse proxy mode.

originserver

If the opon originserver is used with a cache peer, Squid will treat it as an origin

web server.

forcedomain

The forcedomain (forcedomain=domain_name) can be used to congure Squid to always

send the host header with the domain name specied. This opon is generally used to x

broken origin servers which are publicly available over mulple domains. This opon should

be avoided if the origin server is capable of handling the mulple domains.

Time for action – adding backend web servers

We learned about cache_peer in detail in the previous chapter and previously, we saw two

opons specically meant for Squid in reverse proxy mode. Now, let's see a few examples

showing the usage of the

cache_peer direcve to add backend web servers.

cache_peer 127.0.0.1 parent 80 0 no-query no-digest originserver

cache_peer local_ip_of_web_server parent 80 no-query originserver

forcedomain=www.example.com

What just happened?

We learned to add backend web servers in our Squid conguraon le as cache peers or

neighbors so that Squid can forward them the requests which it receives from clients.

Squid in Reverse Proxy Mode

[ 230 ]

Support for surrogate protocol

The requests and responses for a web document may pass through a series of server

surrogates (reverse proxies or origin server accelerators) and forward caching proxies. While

the server surrogates are used for scaling individual or a group of websites, the forward proxies

are used to provide a beer browsing experience by caching the content locally. The server

surrogates act on behalf of the origin server and they act with the same authority as the origin

server. So, we need a dierent cache control mechanism or a dierent way to control these

server surrogates to achieve higher performance while maintaining the data accuracy.

Surrogate protocol extensions to the HTTP protocol provides a way to assign controls to server

surrogates, which can be dierent from controls assigned to the intermediary forward proxies

or the HTTP clients. Now, we'll explore the surrogate protocol and a few related aspects.

Understanding the surrogate protocol

Let's see how the surrogate protocol works and how the surrogate capabilies and controls

are passed using HTTP header elds.

When a surrogate receives a request, it builds a request which will look similar to

the following:

GET / HTTP/1.1

...

Surrogate-Capability: mirror.example.com="Surrogate/1.0"

...

Noce the special header eld Surrogate-Capability. The Squid proxy server is

adversing itself as a surrogate (mirror.example.com). Now this request will be

forwarded to the origin web server.

Upon receiving the request from a surrogate, the web server will construct a

response with the appropriate surrogate control HTTP header, shown in the

following example:

HTTP/1.1 200OK

...

Cache-Control: no-cache, max-age=1800, s-max-age=3600

Surrogate-Control: max-age=43200;mirror.example.com

...

Let's see what controls are being passed by the origin server to surrogates,

forward proxies, and HTTP clients. The end clients (HTTP clients) can store the

response for a maximum me of half an hour, as determined by Cache-Control:

max-age=1800

. The forward proxies on the way can store the response for an hour,

as determined by Cache-Control:s-max-age=3600. The surrogate known by

the idencaon token mirror.example.com can store the response for half a

day, as dened in Surrogate-Control: max-age=43200.



Chapter 9

[ 231 ]

So, as we can see from the previous examples, the surrogate protocol extensions to

HTTP have facilitated the dierent controls for the HTTP clients, the intermediary forward

proxies and server surrogates. For more details on the surrogate protocol, please visit

http://www.w3.org/TR/edge-arch.

Conguration options for surrogate support

We have two direcves in the Squid conguraon le related to surrogate protocol. Let's

have a look at these direcves.

httpd_accel_surrogate_id

All server surrogates need an idencaon token, which is sent to origin servers so that

they can send appropriate controls to a surrogate gateway. This idencaon token can be

unique for a surrogate or can be shared among a cluster of proxy servers, depending on the

gateway design.

The default value of this idencaon token is the same as the value of

visible_

hostname

. To set it to a dierent value, we can use the httpd_accel_surrogate_id as

shown in the following example:

httpd_accel_surrogate_id mirror1.example.com

The previous conguraon line will set the surrogate ID to mirror1.example.com.

httpd_accel_surrogate_remote

The remote surrogates (such as those in a Content Delivery Network or CDN) honor the

Surrogate-Control: no-store-remote direcve in the HTTP header, which means that

the response should not be stored in cache. Such a response can only be sent in a reply to

the original request. We can adverse our proxy server as a remote surrogate by seng the

direcve http_accel_surrogate_remote to on, which is shown in the following example:

http_accel_surrogate_remote on

We should only set this opon to on when our proxy server is two or more hops away from

the origin server.

Support for ESI protocol

ESI or Edge Side Includes is an XML-based markup language that can facilitate the dynamic

assembling of HTML content at the edge of the Internet or near the end user. ESI is designed

for processing on surrogates capable of processing the ESI language. Its capability token is

ESI/1.0. The following are a few advantages of the ESI protocol:

Allows surrogates to cache parts of web documents, which result in a beer

HIT rao.

Reduces processing overhead on the origin servers as the resource assembling can

be performed by the surrogates or HTTP clients themselves.



Squid in Reverse Proxy Mode

[ 232 ]

Enhances the availability of content.

Improves the performance for the end user as content can be fetched from

mulple caches.

For more informaon on the ESI protocol and the ESI language, please visit

http://www.akamai.com/html/support/esi.html.

Conguring Squid for ESI support

To enable ESI support, we need to use the --enable-esi opon with the congure

program before compiling Squid. If Squid is built with ESI, then we can use the

esi_parser

direcve in the Squid conguraon le to choose the appropriate parser for ESI markup.

We can use the

esi_parser direcve, as shown in the following example:

esi_parser libxml2

This conguraon line will set libxml2 as the parser for ESI markup. We can choose a

parser from libxml2, expat, or custom. The default parser is custom.

We should note that ESI markup is not strictly XML compable. The

custom ESI parser

provides higher performance compared to the other two, however it can't handle non

ASCII character encoding, which may result in unexpected behavior.

Logging messages in web server log format

When we use Squid in reverse proxy mode, most of our web server log messages will go

missing as the requests which can be sased from Squid's cache will never make it to the

web server. So, Squid's access log will be our source of web server logs now. However, the

problem is that, by default Squid's access log format is completely dierent to the log format

used by most web servers. To get rid of this problem, we can use the common log format with

the access_log direcve and this will allow Squid to start logging messages in the Apache

web server log format.

Ignoring the browser reloads

Most browsers have a reload buon, which if used, sets the Cache-Control HTTP header

to no-cache. This will force Squid to purge the cached content and fetch it from the origin

server even if the content in the cache was sll valid, which results in a waste of resources.



Chapter 9

[ 233 ]

Time for action – conguring Squid to ignore the

browser reloads

There are three ways to x this issue using http_port and refresh_pattern direcves

in the Squid conguraon le. Please note that the

refresh_pattern rules apply to both

server and client headers and can pose issues if we ignore certain headers and clients may

receive stale replies.

Using ignore-cc

We have seen the ignore-cc opon in the HTTP port secon previously. If we use this

opon while specifying the HTTP port, Squid will ignore the Cache-Control HTTP header

from clients and will completely depend on the Cache-Control headers supplied by the

backend web servers. For example:

http_port 80 accel defaultsite=example.com vhost ignore-cc

Using ignore-reload

Using the opon ignore-reload with the refresh_pattern direcve, we can

completely ignore the browser reloads and serve the content from cache anyway. However,

this may result in serving stale content in some cases. For example:

refresh_pattern . 0 20% 4320 ignore-reload

Using reload-into-ims

If we don't want to completely ignore the browser reloads using the previously explained

ignore-reload opon, we can use the reload-into-ims opon. This will downgrade

the reload into an IfModifiedSince check, allowing less bandwidth to be wasted while

retaining the data accuracy. For example:

refresh_pattern . 0 20% 4320 reload-into-ims

What just happened?

We learned about three available opons using which we can congure Squid to properly

handle the reloads forced by the browser reload buon.

Access controls in reverse proxy mode

When Squid is congured in reverse proxy mode or our proxy server is acng as a surrogate,

it'll be accepng requests from all over the Internet. In this case, we can't form a list of

clients or subnets to allow access to HTTP via our proxy server, as we did in forward proxy

mode. However, we'll have to make sure that our proxy server doesn't accept requests for

origin servers that we are not accelerang.

Squid in Reverse Proxy Mode

[ 234 ]

We should note that we'll have to be clever while construcng access rules when we are

using the same Squid instance for reverse proxying as well as forward proxying. We'll have

to allow access to foreign origin servers so that our clients can access foreign websites using

our proxy server. Later, we'll have a look at the access control conguraon for various types

of setups.

Squid in only reverse proxy mode

When we have congured Squid to work only as a reverse proxy, we need to restrict access

to the origin server which we are accelerang. Let's say, the origin servers congured for

our proxy servers are

www.example.com and www.example.net, then we can have the

following access control rules:

acl orign_servers dstdomain www.example.com www.example.net

http_access allow origin_servers

http_access deny all

The preceding conguraon will allow all requests desned for www.example.com and

www.example.net.

Please note that these access control rules should be above Squid's default

access controls, otherwise requests from clients will be denied by the default

Squid access controls.

Squid in reverse proxy and forward proxy mode

When Squid is congured to operate in reverse proxy and forward proxy modes

simultaneously, we need to be careful while designing our access controls. We need to keep

the following points in mind:

Clients using our proxy server as a forward proxy should be able to access all the

websites, except the ones that we have blacklisted.

Squid should accept all the requests desned to the origin servers we are

accelerang, except the requests from the clients that we have blacklisted.

Let's say our clients in subnet

192.0.2.0/24 will be using our proxy server as a forward

proxy and they are allowed to access all the websites except www.example.net. We can

therefore, write the access rules as follows:

acl our_clients src 192.0.2.0/24

acl blacklisted_websites dstdomain www.example.net

http_access allow our_clients !blacklisted_websites

http_access deny all



Chapter 9

[ 235 ]

Now, let's say we have congured our proxy server to accelerate the origin server www.

example.com

. However, we have found some suspicious acvity on our origin server from

the subnet 203.0.113.0/24 and don't want these visitors to access our website. So, we

can have the following access rules:

acl origin_servers dstdomain www.example.com

acl bad_visitors src 203.0.113.0/24

http_access allow origin_servers !bad_visitors

http_access deny all

We can combine the preceding two conguraons into one conguraon as follows:

# ACLs for Forward Proxy

acl our_clients src 192.0.2.0/24

acl blacklisted_websites dstdomain www.example.net

# ACLs for Reverse Proxy

acl origin_servers dstdomain www.example.com

acl bad_visitors src 203.0.113.0/24

# Allow local clients to access allowed websites

http_access allow our_clients !blacklisted_websites

# Allow visitors to access origin servers

http_access allow origin_servers !bad_visitors

# Deny access to everyone else

http_access deny all

The preceding conguraon will allow our local clients to access all websites except

www.example.net. Also, it'll allow all visitors (except from the subnet 203.0.113.0/24)

to access our origin server www.example.com. We can extend this conguraon according

to our environment.

Example congurations

Let's have a look at a few common examples of Squid in reverse proxy mode. For the access

control conguraon for the following examples, please refer to the secon on access

controls in reverse proxy mode.

Squid in Reverse Proxy Mode

[ 236 ]

Web server and Squid server on the same machine

In this example, we'll write the Squid conguraon for accelerang a web server hosng

www.example.com. As we will run Squid and the web server on the same machine, we must

ensure that the web server is bound to the loopback address (

127.0.0.1) and listening on

port 80. Let's write this conguraon.

http_port 192.0.2.25:80 accel defaultsite=www.example.com

cache_peer 127.0.0.1 parent 80 0 no-query originserver name=example

cache_peer_domain example .exmaple.com

In the rst conguraon line of the previous example, we have congured Squid to bind

to the IP address 192.0.2.25 and it'll listen on port 80 where it will be accepng visitor

requests on behalf of our origin web server. In the second line, we have added 127.0.0.1

(port 80) as a cache peer where our web server is listening for requests. In the last

conguraon line, we are allowing a cache peer named example to be used for fetching only

example.com and its sub-domains.

Accelerating multiple backend web servers hosting one website

In this example, we have three servers with IP addresses 192.0.2.25, 192.0.2.26,

and 192.0.2.27, which host the same website www.example.com. All web servers are

listening on port 80. Squid is hosted on a dierent machine with a public IP address and

www.example.com points to the public IP address of the Squid server. Let's see an example

of this conguraon:

http_port 80 accel defaultsite=www.example.com

cache_peer 192.0.2.25 parent 80 0 no-query originserver round-robin

name=server1

cache_peer 192.0.2.26 parent 80 0 no-query originserver round-robin

name=server2

cache_peer 192.0.2.27 parent 80 0 no-query originserver round-robin

name=server3

cache_peer_domain server1 .example.com

cache_peer_domain server2 .example.com

cache_peer_domain server3 .example.com

As we have used the round-robin opon with cache_peer in the preceding conguraon,

this will also load balance the requests between the three web servers.

Chapter 9

[ 237 ]

Accelerating multiple web servers hosting multiple websites

In this example, we have example.com and its sub-domains hosted on 192.0.2.25,

example.net and its sub-domains hosted on 192.0.2.26, and example.org and its

sub-domains hosted on

192.0.2.27. We have a Squid server installed on a dierent

machine with a public IP address and all the domains (

example.com, example.net,

example.org, and their sub-domains) point to the public IP address of the Squid server.

The following is an example of such a conguraon:

http_port 80 accel vhost defaultsite=www.example.com ignore-cc

cache_peer 192.0.2.25 parent 80 0 no-query originserver name=server1

cache_peer 192.0.2.26 parent 80 0 no-query originserver name=server2

cache_peer 192.0.2.27 parent 80 0 no-query originserver name=server3

cache_peer_domain server1 .example.com

cache_peer_domain server2 .example.net

cache_peer_domain server3 .example.org

Note that we can't use the round-robin opon with the cache_peer direcve here

because dierent web servers are hosng dierent domains. We have also restricted request

forwarding using the cache_peer_domain direcve so that we contact only the relevant

web server for forwarding the requests.

Have a go hero – set up a Squid proxy server in reverse proxy mode

Try to set up a Squid proxy server in reverse proxy mode as a server accelerator for your

website on the same server as web server.

Pop quiz

1. When the ignore-cc opon is used while specifying http_port as follows:

http_port 80 accel vhost ignore-cc

What will happen when a client clicks on the reload buon in the browser?

a. Squid will not receive the Cache-Control HTTP headers.

b. The ignore-cc opon doesn't aect client requests.

c. Squid will ignore the Cache-Control HTTP header from the request.

d. The backend web server will ignore the Cache-Control HTTP header.

Squid in Reverse Proxy Mode

[ 238 ]

2. Consider the following conguraon:

http_port 80 accel defaultsite=www.example.com

cache_peer 192.0.2.25 parent 80 0 no-query originserver

forcedomain=example.com name=example

What will the contents of the Host HTTP header sent to the backend web server

when a client requests http://www.example.com/ be?

a. www.example.com

b. example.com

c. example

d. 192.0.2.25

Summary

In this chapter, we learned about Squid's reverse proxy mode, which can be used to share

the load of a very busy web server or a cluster of web servers. We also learned about the

various conguraon opons to congure Squid in reverse proxy mode.

Specically, we covered:

What is a web server accelerator and how does Squid t in this model.

Conguring Squid to accept HTTP and HTTPS requests from clients on behalf of our

web servers.

Adding backend web servers to Squid so that it can forward requests to origin

servers appropriately.

We also saw a few conguraon examples in which we tried to accelerate various

web servers hosng dierent websites.

In the next chapter, we'll learn about conguring Squid in intercept mode.



Squid in Intercept Mode

In previous chapters, we have learned about using Squid in the forward proxy

and accelerator or reverse proxy modes. In this chapter, we are going to learn

about conguring Squid in the intercept (or transparent) mode. We'll learn

about Squid's behavior in the intercept mode and also the basic conguraon

required for achieving intercepon caching.

In this chapter, we shall discuss:

Intercepon caching

Advantages of running Squid in the intercept mode

Problems with the intercept mode

Diverng HTTP trac to Squid

Implemenng intercepon caching

So let's get started...

Interception caching

When the requests from clients are intercepted by a proxy server or are redirected to one

or more proxy servers, without conguring the HTTP clients on the client machines or

without the knowledge of clients, it's known as intercepon proxying. As proxying is mostly

accompanied by caching, it's known as intercepon caching. Intercepon caching is also

known by several other names, such as, transparent caching, cache redirecon, and so

on. Squid can be congured to intercept requests from clients so that we can leverage the

benets of caching without explicitly conguring each one of our clients.



Squid in Intercept Mode

[ 240 ]

Time for action – understanding interception caching

Intercepon caching is generally implemented by conguring a network device (router

or switch) on our network perimeter to divert client requests to our Squid server. Other

components that need to be congured include packet ltering soware on the operang

system running Squid, and nally Squid itself. First of all, let's see how the intercepon

of requests occurs:

1. A client requests a web page http://www.example.com/index.html.

2. First of all, the client needs to resolve the domain name to determine the IP address,

so that it can connect to the remote server. Next, the client contacts the DNS server

and resolves the domain name www.example.com to 192.0.2.10.

3. Now, the client iniates a TCP connecon to 192.0.2.10 on port 80.

4. The connecon request in the previous step is intercepted by the router/switch and

is directed to the Squid proxy server instead of sending it directly to a remote server.

5. On the Squid proxy server, the packet is received by the packet ltering tool, which

is congured to redirect all packets on port 80 to a port where Squid is listening.

6. Finally the packet reaches Squid, which then pretends its the remote server and

establishes the TCP connecon with the client.

7. At this point, the client is under the impression that it has established a connecon

with the remote server when it's actually connected to the Squid server.

8. Once the connecon is established, the client sends a HTTP request to the remote

server asking for a specic URL (/index.html in this case).

9. When Squid receives the request, it then pretends to be a client and establishes

a connecon to the remote server, if the client request can't be sased from the

cache, and then fetches the content the client has requested.

So, the idea is to redirect all our HTTP trac to the Squid server using router/switch and

host-based IP packet ltering tools.

What just happened?

We just learned how HTTP packets ow from clients to a router or a switch, which redirects

these packets to the server running Squid. We also saw how these packets are redirected

to Squid by the IP ltering tools on the server and nally how Squid reconstructs clients'

requests using the HTTP headers.

Chapter 10

[ 241 ]

Advantages of interception caching

There are several advantages to using Squid in the intercept mode instead of the normal

caching mode. A few of them are as follows:

Zero client conguration

As we discussed previously, we don't need to congure HTTP clients at all, as all the request

redirecon magic is performed by the switch and routers. This is one of the most prominent

reasons for using intercepon caching in networks where we have thousands of clients,

and it's not possible to congure each and every client to use the proxy server.

Better control

As a user cannot congure their HTTP clients to bypass a proxy server, it's easy to enforce

network usage policies as only administrators can control network devices and the Squid

proxy servers. However, the policies can sll be bypassed by clients using tunnels or using

specially designed soware.

Increased reliability

We can congure our router or switch to forward the client requests directly to the internet

in case our Squid proxy server goes down, which will mean that clients can sll access the

internet without any problems. This results in beer upme and increased reliability.

These few advantages are the reasons for the popularity of intercepon caching among

organizaons with a large number of clients and a requirement for higher upme.

Problems with interception caching

Although intercepon caching is aracve and there are a few advantages as well, it has got

some serious disadvantages, which can make it painful to manage or debug if something

goes wrong. Let's have a look at a few disadvantages of intercepon caching:

Violates TCP/IP standards

The routers or switches in a network are supposed to forward packets to the hosts to which

they are desned. Diverng packets to proxy servers violates the TCP/IP standards. Also, the

proxy server accepts TCP/IP packets which are not desned for it, which is another violaon

of the TCP/IP standards.

The proxy server oen has a dierent OS to the client, which confuses the end-to-end

packet management outside of the HTTP packets. Which in turn can cause servers and

the remote networks to become completely inaccessible or the transfer rates may drop

down considerably.

Squid in Intercept Mode

[ 242 ]

Susceptible to routing problems

Intercepon caching relies on stable routed paths and the diversion of the trac to caching

proxies by a router or a switch. As routes or network paths are determined dynamically,

requests may ow via a dierent path, which may not have a router that will divert the trac

to a caching proxy and a user's session will be lost. Also, somemes the replies may not

return to a proxy server, resulng in long meouts and unavailable websites.

No authentication

Proxy authencaon doesn't work as browsers and HTTP clients are not aware that they are

connected to remote servers via a proxy server, and will refuse to send credenals to the

unknown middleware. The IP-based authencaon doesn't work because the proxy server is

iniang connecons on behalf of all the clients and the remote server thinks that only one

client is trying to access the website.

Supports only HTTP interception

Squid can intercept only HTTP trac as the HTTP request contains Host header and Squid

can fetch content on behalf of the client using the Host and other HTTP headers. It can't

intercept other protocols, as it will not be able to process them.

Client exposure

Since we will be able to intercept only HTTP trac, clients will sll need to go on the

internet directly to make DNS queries or use other protocols like HTTPS, FTP, and so on.

So, essenally we'll have all our clients exposed on the internet, which is not desirable in

most networks.

IP ltering

Intercepon caching is incompable with IP ltering, which prevents IP address spoong. We

must create excepons in our network devices to allow address spoong.

Protocol support

Although this is not a major issue with the modern browsers and newer versions of the

legacy browsers, it may be a signicant problem with older browsers supporng only

HTTP/1.0 (or older) or buggy HTTP clients. As we learned previously, that Squid in intercept

mode totally depends on the Host HTTP header supplied within the HTTP request by the

client, if a client doesn't send the HTTP header, Squid will have no idea what to do with the

HTTP request.

The protocol support problem may be present the other way around as well. This occurs

when the client uses a HTTP feature, which is sll not implemented in Squid or if Squid

doesn't know how to handle the feature. For example, chunked encoding HTTP/1.1 was

not supported by Squid 3.0 or earlier and hence cannot be intercepted.

Chapter 10

[ 243 ]

Security vulnerabilities

We learned that Squid in intercept mode is totally dependent on the Host HTTP header

supplied by the clients. The

Host header can be easily forged by malware or rouge

applicaons to poison our proxy server's cache, which can result in the spread of the

poisoned (cached) content across the whole network.

So, as we can see, there are a lot of disadvantages of using intercepon caching, but it's up

to us to analyze our network and see if the advantages outweigh the disadvantages. Please

also have a look at other alternave soluons such as, Web Proxy Auto-Discovery Protocol

(WPAD,

http://en.wikipedia.org/wiki/Web_Proxy_Autodiscovery_Protocol),

Proxy auto cong (PAC,

http://en.wikipedia.org/wiki/Proxy_auto-config), and

Capve portal (

http://en.wikipedia.org/wiki/Captive_portal).

Have a go hero – interception caching for your network

Based on the advantages and disadvantages of intercepon caching we saw previously,

check if it will be benecial to implement intercepon caching in your network. Also, check

whether you'll be using a router or switch to divert trac to the Squid server.

Diverting HTTP trafc to Squid

We learned in previous secons that we need to divert all HTTP trac from our clients to our

proxy server. Later, we'll have a look at the ways in which we can divert HTTP trac to our

Squid proxy server.

Using a router's policy routing to divert requests

If we have an arrangement where all our client requests are passing through a router, we can

ulize the router's ability to divert the packets, to redirect them to our Squid proxy server.

Therefore if we set our router's policy to redirect all the packets with port 80 to the Squid

server and all other trac is sent to the internet directly, it will look like the following diagram:

Squid in Intercept Mode

[ 244 ]

In the previous diagram, we can see that the router is passing all the HTTP requests to the

Squid proxy server and all the non-HTTP trac is going to the internet directly. A router

can only modify the IP address of a packet. So, we must congure an IP packet ltering tool

(iptables, ipfw) to redirect trac on port 80 to the port on which Squid is listening.

Using rule-based switching to divert requests

We can also use a Layer 4 (L4) or Layer 7 (L7) switch to divert HTTP requests from our clients

to the Squid proxy server, as shown the in the following diagram:

In the previous diagram, we can see that the switch is passing HTTP trac to the Squid proxy

server based on rules congured in the switch. All the non-HTTP trac is directly forwarded

to the internet.

Using Squid server as a bridge

In this scenario, the machine running Squid proxy server also acts as a gateway to the internet

for all the clients. So, all the packets or requests to remote servers pass through the Squid

server. The IP packet ltering tool can be congured to redirect all HTTP trac to the Squid

process and all the non-HTTP trac can be forwarded to the Internet directly.

Chapter 10

[ 245 ]

In the preceding diagram, we can see that we are not using any switch or router to direct

HTTP trac to the Squid server. Instead, all the trac is passing through the Squid server

and the iptables direct HTTP trac to the Squid process, and passes the rest to the router

connected to internet. This is the easiest of the three ways to achieve intercepon caching as

we don't have to congure our router or switch, which is generally a relavely complex task.

Using WCCP tunnel

Web Cache Coordinaon Protocol (WCCP) is a protocol developed by Cisco to route content

with a mechanism to redirect trac in real-me. We can ulize WCCP in the absence of a

Layer4 switch . It is somemes preferred over Policy-based Roung as it allows mulple

proxy servers to parcipate compared to Policy-based Roung, which allows only one server.

WCCP has built-in features such as scaling, load balancing, fault tolerance, and failover

mechanisms.

When using WCCP, a GRE (Generic Roung Encapsulaon) tunnel is established between the

router and the machine running the Squid proxy server. The redirected requests from the

router are encapsulated in GRE packets and sent to the proxy server through the GRE tunnel.

The job of decapsulang the GRE packets and redirecng them to Squid is done by the host

machine using iptables. Then Squid will either fetch the content from the original server or

pull it from the cache and deliver the content back to the router. The router then sends the

response to the HTTP client. For conguring various Cisco devices, the host operang system,

and Squid to use WCCP, please visit

http://wiki.squid-cache.org/Features/Wccp2.

Implementing interception caching

Aer going through the advantages and disadvantages of intercepon caching, if we choose

to go with the intercepon caching, then as described previously, we need to congure three

dierent components to implement intercepon caching. We need to congure a network

device (not needed if we are using the Squid server as a gateway or bridge), the IP ltering

tool (iptables, ipfw, and so on.) on a server running Squid and then Squid itself. Let's have

a quick look at the conguraon of the dierent components.

Conguring the network devices

If we are using a network device to divert trac to our Squid proxy server, then we need to

congure the network device so that it can idenfy all the HTTP trac and redirect it to our

Squid proxy server. As dierent routers and switches have dierent conguraon tools, please

refer to the documentaon or instrucon manual for the router or switch which is going

to divert the trac.

Squid in Intercept Mode

[ 246 ]

Conguring the operating system

Once the packets or HTTP requests reach our machine running Squid, they'll have a

desnaon port 80. Now we need to congure an IP ltering tool, which goes by the

dierent names of dierent operang systems, to divert these packets to the port where

Squid is congured to listen. We should note that the port on which Squid is listening, is used

between the ltering tool and Squid. So, we should rewall this port from external access.

However, before conguring that, we need to congure our operang system to accept

packets that are not desned to it. This is because the packets diverted by the routers or

switches will have a desnaon IP of the remote server. These packets will be dropped

immediately by the kernel because the desnaon IP address doesn't match the address

of any of the interfaces. We need to use the IP forwarding feature in the kernel so that our

server can accept packets that are not desned to it.

Enabling IP forwarding

There are dierent ways to enable IP forwarding for dierent operang systems. Let's have

a look at few of them:

Time for action – enabling IP forwarding

1. To enable IP forwarding on Linux-based operang systems, we can use any of the

following methods.

Using the

sysctl command:

sysctl -w net.ipv4.ip_forward=1

This method doesn't need a reboot and will enable IP forwarding on the y but will

not be preserved aer a reboot.

Using the

sysctl conguraon le, we can add the following line in the

/etc/sysctl.conf le:

net.ipv4.ip_forward = 1

2. To enable the changes made to the /etc/sysctl.conf le, we need to run the

following command:

sysctl -p /etc/sysctl.conf

These changes will be preserved aer a reboot.

3. Enabling IP forwarding on BSD operang systems is almost similar. We can use any

of the following methods:

Using the

sysctl command.

sysctl -w net.inet.ip.forwarding=1

Chapter 10

[ 247 ]

This method doesn't need a reboot and will enable IP forwarding on the y but will

not be preserved aer a reboot. Please note that we don't need the -w opon on

OpenBSD and DragonFlyBSD.

We can add the following line in the

/etc/rc.conf le:

gateway_enable="YES"

4. To enable these changes made to the /etc/rc.conf le, we need to reboot our

server. The changes made will be preserved aer further reboots. Note that we

don't need to perform this step for OpenBSD.

What just happened?

We learned about various commands and methods, using which we can enable IP forwarding

on our operang system so that it can accept packets which are not desned for it.

For other operang systems, please check the respecve instrucon manual.

Redirecting packets to Squid

Once we have enabled our operang system to accept packets on behalf of others, we'll

start geng packets diverted by the router or switch. Now, we need to get these packets

to our Squid process. For this we need to congure iptables (Linux), ipf/ipnat/ipfw

(BSD variants) to redirect the packets which we have received on port 80 to port 3128.X.

Time for action – redirecting HTTP trafc to Squid

Let's have a quick look at the conguraon we need to perform. For the following, we'll

assume that the IP for the Squid proxy server is 192.0.2.25.

1. Working with Linux:

To redirect trac desned to port 80, we can use

iptables as follows:

iptables -t nat -A PREROUTING -s 192.0.2.25 -p tcp --dport 80 -j

iptables -t nat -A PREROUTING -p tcp --dport 80 -j DNAT --to-

destination 192.0.2.25:3128

iptables -t nat -A POSTROUTING -j MASQUERADE

In the previous list of commands, the rst command prevents the redirecng HTTP

trac from the Squid server itself. If we don't have the rst line in place, we'll face

forwarding loops and requests will not be sased. The second command captures

all the trac on port 80 and redirects it to the IP address to which Squid is bound

and port 3128 where Squid is listening. The last command allows Network Address

Translaon (NAT, for more details, please check http://en.wikipedia.org/

wiki/Network_address_translation

Squid in Intercept Mode

[ 248 ]

We can achieve a fully transparent setup using the Tproxy feature. However, we

should note that we'll need a relavely newer Linux kernel version and iptables with

support for Tproxy version 4.1 or later. Please check http://wiki.squid-cache.

org/Features/Tproxy4

for details.

2. Working with BSD:

There are many packet ltering programs available for various avors of BSD but

OpenBSD's Packet Filter (

pf) is one of the most popular programs. Please refer to

the Packet Filter manual at

http://www.openbsd.org/faq/pf/. The Packet

Filter has been integrated in NetBSD as well. Please have a look at the NetBSD's

manual for pf at http://www.netbsd.org/docs/network/pf.html.

What just happened?

We learned how we can redirect HTTP trac desned to port 80 to port 3128 (to Squid)

using iptables on Linux. We also learned that we have to create an excepon for the

IP address to which Squid is bound, to avoid any forwarding loops.

Have a go hero – testing the trafc diversion

Once you have nished enabling IP forwarding and conguring the appropriate rules in the

rewall to redirect trac to port 3128, try accessing any website from a client machine.

Now, check if packets are being directed properly using tcpdump or wireshark.

Conguring Squid

So far, we have congured our environment to divert all HTTP trac to port 3128 on the

Squid server. Finally, it's me to check what conguraon we need to do in Squid so that

it can intercept all the diverted trac.

Conguring HTTP port

Finally, we need to tell Squid that we will be intercepng the client requests. We can do so

by using the appropriate opon with the http_port direcve as follows:

http_port 3128 intercept

http_port 8080

Chapter 10

[ 249 ]

If we use the previous conguraon, the requests on port 3128 will be intercepted and

port 8080 will be used for normal forward proxying. It's not necessary to have the port

8080 conguraon above, but it's useful for proxy management access, which will not

work through the intercept port.

So, that's all we need to do for intercepon caching. Now, Squid will handle all the requests

normally and cached responses will be served from the cache.

Pop quiz

1. Which of the following protocols can be intercepted by Squid?

a. HTTP

b. FTP

c. Gopher

d. HTTPS

2. Which one of the following is an essenal HTTP header for the proper funconing

of Squid in intercept mode?

a. Cache-Control

b. Proxy-Authorizaon

c. Host

d. User-Agent

3. Why can't we use proxy authencaon with Squid in intercept mode?

a. Squid is not responsible for providing authencaon in intercept mode.

b. HTTP clients are not aware of a proxy and don't send the Proxy-Authorizaon

HTTP header.

c. It's not possible to assign usernames and passwords to thousands of clients.

d. Proxy-Authorizaon HTTP headers are removed by the routers or switches on

the way, when using intercepon caching.

Squid in Intercept Mode

[ 250 ]

Summary

We have learned about the basics of intercepon caching in this chapter. We have also

learned how the requests ow and packets are diverted to our Squid server so that Squid

can fetch content on behalf of clients, without explicitly conguring all the clients on

our network.

Specically, we have covered:

Intercepon caching and how it works.

Dierent ways in which to implement intercepon caching.

Advantages and drawbacks of intercepon caching.

Conguring our operang systems to forward IP packets.

Conguring IP ltering tools for our operang systems to redirect web trac to the

Squid server.

Various compile opons that can be used to implement intercepon caching on

dierent operang systems.

In the next chapter, we'll learn about wring Squid plugins or helpers to customize

Squid's behavior.



Writing URL Redirectors and

Rewriters

In the previous chapters, we have learned about installing and conguring

the Squid proxy server for various scenarios. In this chapter, we'll learn about

wring our own URL redirectors or rewriters to customize Squid's behavior.

We'll also see a few examples that can be helpful in enhancing the caching

performance of Squid or enforcing the access control.

In this chapter, we shall learn about:

URL redirectors and rewriters

Wring our own URL helper

Conguring Squid

A special URL redirector - deny_info

Popular URL helpers

So let's get started….

URL redirectors and rewriters

URL redirectors are external helper processes that can redirect the HTTP clients to alternate

URLs using HTTP redirect messages. Similarly, URL rewriters are also external helper

processes that can rewrite the URLs requested by the client with another URL. When a URL

is rewrien by a helper process, Squid fetches the rewrien URL transparently and sends

the response to the end client as if it was the one originally requested by the client.



Wring URL Redirectors and Rewriters

[ 252 ]

The URL redirectors can be used to send HTTP redirect messages like 301, 302, 303, 307,

or 3xx, along with an alternate URL to the HTTP clients. When a HTTP client receives

a redirect message, the client will request the new URL. So, the major dierence between

URL redirectors and URL rewriters is that the client is aware of a URL redirect, while rewrien

URLs are fetched transparently by Squid, and the client remains unaware of a rewrien URL.

Let's try to understand the workings of URL redirector and rewriter helper programs in detail.

Understanding URL redirectors

Now, we'll try to see what happens when we are using a URL redirector helper with the Squid

proxy server and a client tries to access a webpage at http://example.com/index.html.

The previous diagram shows the ow of requests and responses using numbered steps. Let's

try to learn what is happening at each step in the previous diagram:

1. The Client requests the webpage http://example.com/index.html.

2. The Squid Proxy Server receives the requests and forwards the essenal details

related to the request to the URL redirector helper program.

3. The URL redirector helper program processes the details and issues a 303 HTTP

redirect with an alternate URL http://example.net/index.html. In other

words, the URL redirector program suggests to Squid that the client should be

redirected to a dierent URL.

4. Squid, as suggested by the URL redirector helper, sends the redirect message to the

client with the alternate URL.

5. The client, on receiving the redirect message, iniates another request for the new

URL http://example.net/index.html.

6. When Squid receives the new request, it is again sent to the URL redirector

helper program.

Chapter 11

[ 253 ]

7. The URL redirector program processes the request and suggests to Squid that this

URL can be fetched and we don't need to redirect the client to an alternate URL.

8. Squid fetches the URL http://example.net/index.html.

9. The response received by Squid from the origin server at example.net is delivered

to the client.

We have just learned how the client iniated a request, which was redirected to an alternate

URL by the URL redirector helper program. We'll learn about the logic followed by the URL

redirector program for redirecng URLs at a later stage in this chapter. Now, let's try to

understand the useful HTTP status codes for redirecon and where they can be used.

HTTP status codes for redirection

We have learned that we can use various HTTP redirect codes for redirecng clients

to a dierent URL. Now let's try to understand when and where we can use these

HTTP redirect codes.

Code Descripon and usage

301 The HTTP status code 301 means that the URL requested by the client has moved

permanently and all the future requests should be made to the redirected URL. This

status code should be used in reverse proxy setups only.

302 The HTTP status code 302 means that the content can be fetched using an alternate

URL. This status code should be used with GET or HEAD requests.

303 The code 303 means that the request can be sased with an alternate URL but the

alternate URL should be fetched using a GET request. This status code can be used with

POST or PUT requests.

305 The status code 305 indicates that the client should use a proxy for fetching the

content. This status code is intended to be used by intercepon proxies needing to

switch to a forward proxy for the request.

307 The status code 307 means a temporary HTTP redirect to a dierent URL but the

future requests should use the original URL. In this case, the request method should

not be changed while requesng the redirected URL. This status code can be used for

CONNECT/HTTPS requests.

For more informaon on HTTP status codes for redirecon, please visit

http://en.wikipedia.org/wiki/List_of_HTTP_status_codes#3xx_Redirection.

It's now me to learn how the URL rewriter helper programs rewrite URLs.

Wring URL Redirectors and Rewriters

[ 254 ]

Understanding URL rewriters

URL rewriters are almost similar to URL redirectors, with a major dierence being that they

never tell the client about the change of URLs. Let's say, a client is trying to retrieve the

webpage at

http://example.com/index.html and we have a URL rewriter program

working on our proxy server. Now, have a look at the following diagram:

The numbered steps in the previous diagram represent the ow of requests and responses.

Let's try to understand the steps shown in the diagram as follows:

1. The client requests a URL http://example.com/index.html using our

proxy server.

2. Squid receives the request and forwards the essenal details about the request

to the URL rewriter helper program.

3. The URL rewriter helper program processes the details received from Squid and

suggests to Squid that it should fetch http://example.net/index.html instead

of http://example.com/index.html. In other words, the rewriter program has

rewrien the URL with a new URL.

4. Squid receives the rewrien URL (http://example.net/index.html) from

the rewriter program and contacts the origin server at

example.net instead of

contacng

example.com.

5. Squid delivers the response returned by the origin server at example.net to

the client.

So, we have seen how client requests are rewrien by the URL rewriter helper programs and

the client is not even informed about it. The client sll thinks that the response was fetched

from the origin server

example.com and not example.net.

Chapter 11

[ 255 ]

So far, we have learned about URL redirector and rewriter programs. The basic dierence

between the two is the presence or absence of a 3xx HTTP redirect code. When a 3xx

redirect code is present, the client is redirected to a new URL. On the other hand, in the

absence of a 3xx redirect code, the URL is simply rewrien by Squid transparently.

Issues with URL rewriters

There are some known issues when rewring URLs, which can result in unexpected behavior

from original servers or the proxy server itself. Let's have a look at a few possible issues with

URL rewriters.

Rewring URLs on a criterion other than the URL may result in unpredictable cached

responses. Moreover, the same response may be cached for several URLs. This may

expose our proxy server to cache poisoning aacks. This is not a problem when

redirecng URLs as the client will request the redirected URL, and the response if

cached, will correspond to the correct URL.

Rewring upload, POST, or WebDAV requests may result in unpredictable alteraons

on origin servers.

If a rewriter passes an invalid URL back to Squid, it may result in unexpected

behavior from Squid. For example, we may consider that a hash (

#) character is

valid in a URL as our browser understands it. However, when we rewrite a URL with

a dierent URL containing a hash (

#) character, the proxy doesn't know what to do

with it. Hence, Squid will reject the new URL and will either send an error message

to the client or bypass the rewrite depending on the Squid version. A HTTP redirect

to a URL with a hash (

#) in it will work as the browser understands what to do

with fragments.

Rewring

CONNECT/HTTPS requests may result in HTTPS errors breaking the

security channels.

As we saw previously, rewring URLs poses more problems compared to URL redirecon.

Hence, URL redirectors are recommended over URL rewriters, as the client is fully aware

in case of redirecons.

This ability of a redirector to rewrite the originally requested URL exposes a lot of power

to the developers or administrators. We can use this feature to redirect clients to alternate

access-denied pages, help, or documentaon pages, block ads from well known ad networks,

implement more aecve lters, or redirect clients to a locally mirrored content.



Wring URL Redirectors and Rewriters

[ 256 ]

Squid, URL redirectors, and rewriters

Squid and URL redirector (or rewriter) programs work closely and every request is passed

through the specied URL redirector (or rewriter) program and then Squid acts accordingly

(redirects the HTTP client to the rewrien URL or fetches the rewrien URL). Let's have

a look at a few details about Squid and URL redirectors.

Communication interface

The URL redirectors and rewriters communicate with Squid using a similar and simple

interface, which is very easy to understand as well as implement. For each request, the

following details are passed to a helper program in one line.

ID URL client_IP/FQDN username method myip=IP myport=PORT [kv-pairs]

The following table gives a brief explanaon of the elds passed by Squid to the redirectors:

Field Descripon

The ID is used for idenfying each request that Squid passes on as the

standard input to the redirector program. The redirector program is supposed

to pass the ID back to Squid so that it can relate the returned URL to an

appropriate request. This ID is used to achieve concurrency. This eld will be

missing with non-concurrent helpers.

URL

The URL eld is the actual URL requested by the client and is passed to

rewriters as it is.

client_IP

The eld client_IP represents the IP address of the client.

FQDN

The FQDN (Fully Qualified Domain Name) eld contains the fully

qualied domain name of the client, if present. If FQDN is not set, a hyphen

(-) is put in its place. Please note that FQDN will not be available at all when

reverse DNS lookup is not set for the IP address.

username

The username eld contains the username of the client for the current

request, as determined by Squid. The username eld will be replaced by a

hyphen (-) if Squid was unable to determine the username.

method

The method eld contains the HTTP request method used by the client to

request the current URL. The values will be GET, POST, DELETE, and so on.

myip=IP

The myip (myip=IP_ADDRESS) represents the Squid receiving IP address to

which the client request was sent. It is helpful if there is more than one network

interface on the server and Squid is bound to more than one IP address.

myport=PORT

The eld myport (myport=PORT_NUMBER) represents the Squid port on

which the client request was sent. It is helpful in case Squid is listening on more

than one port.

kv-pairs

There may be other key value pairs which may be made available to rewriter

programs in the future.

Chapter 11

[ 257 ]

The URL helper program can process the previous elds and take appropriate acons

according to a predened login in the helper program. Now, it's me to explore how the

messages are passed between Squid and URL helpers.

Time for action – exploring the message ow between

Squid and redirectors

Let's try to understand the message ow between Squid and the redirector (or rewriter)

programs.

1. A line containing the elds shown previously (separated by spaces) is passed by

Squid to the URL redirector program using a single line for each client request.

Once the helper program has nished processing the elds, it must write one of the

following messages on the standard output. Please note that the new line (\n) at

the end of the message is important and must not be omied:

2. The line containing the elds is read by the URL redirector program from the

standard input.

3. Aer reading the line from the standard input, the redirector (or rewriter) program

can process the elds and make decisions based on the values of dierent elds.

A line containing only the idener (

ID \n).

A modied URL with an HTTP redirect code followed by a new line.

(

ID 3XX:URL \n). The HTTP redirect code and the URL should be

separated by a colon.

A modied URL followed by a new line (

ID URL \n)

4. The message wrien by the helper program on the standard output is read by Squid

for further processing. It then takes one of the following acons:

If the helper program wrote a blank line on the standard output, Squid

treats it as if we didn't modify the URL at all and the original URL will be

used by Squid to fetch the content.

If the helper program wrote a dierent URL with a redirect code, then Squid

will send a response to the client redirecng it to the alternate URL.

If a dierent URL without a redirect code was wrien, Squid will treat it as

if that was the original URL requested by the client, will fetch the content

transparently, and return it to the client.

So, as we have seen previously, a single program can act as a URL redirector as well as a URL

rewriter program by execung condional redirecon or rewring URLs. In the following

secons, we'll be using the URL redirector to mean both URL redirector and URL rewriter,

unless specied otherwise.



Wring URL Redirectors and Rewriters

[ 258 ]

What just happened?

We have just learned how Squid communicates with URL redirector programs using standard

I/O. Squid sends some details about each request to the URL redirector program. Then the

URL redirector program processes the elds sent by Squid and makes a decision accordingly.

Aer making the decision, the redirector sends back the appropriate message, which is then

read by Squid.

Now, let's have a look at a few example elds sent by Squid to a URL redirector program:

http://www.example.com/ 127.0.0.1/localhost - GET myip=127.0.0.1

myport=3128

http://www.example.net/index.php?test=123 192.0.2.25/- john GET

myip=192.0.2.25 myport=8080

http://www.example.org/login.php 198.51.100.86/- saini POST

myip=192.0.2.25 myport=8080

As shown in the previous examples, the enre URL is passed to a URL redirector program

along with the query parameters, if any. The fragment ideners are removed from the URL,

while Squid passes the URL to the redirector program.

We should be careful while using URL redirector programs because Squid

passes the enre URL along with query parameters to the URL redirector

program. This may lead to leakage of sensive client informaon as some

websites use HTTP GET methods for passing clients' private informaon.

The URL redirector program has to read lines, as shown in above examples, in an endless

loop unless an EOF (end of file) occurs on the standard input. The program should not

exit. However, if the program exits prematurely, Squid tries to respawn another instance of

the redirector program and writes a message (as shown in the following example) to the

Squid cache log, warning the user of a probable problem with the redirector program:

2010/11/08 22:01:19| WARNING: redirector #1 (FD 8) exited

Time for action – writing a simple URL redirector program

Let's see a very simple Perl script that can act as a URL redirector program.

$|=1;

while (<>) {

s@http://www.example.com@303:http://www.example.net@;

print;

}

Chapter 11

[ 259 ]

The previous code is a URL redirector program in its simplest form. It redirects all URLs

containing the URL

www.example.com to www.example.net without inspecng values

of any of the elds by Squid.

What just happened?

We have just seen a simplisc Perl script which can act as a URL redirector program and can

be used with Squid.

Have a go hero – modify the redirector program

Modify the previous URL redirector program so that all requests to google.co.uk can be

redirected to

google.com.

Concurrency

We can make our URL redirector programs concurrent for beer performance. When we

congure Squid to use a concurrent URL redirector program, it passes an addional eld, ID,

on the standard input to the redirector program. This is used to achieve concurrency as we

learned in the previous secon.

It's always beer to have more concurrency than more children helpers for

beer performance.

Handling whitespace in URLs

There are dierent ways to handle whitespaces in URLs. A few techniques that can be used

are as follows:

Using the uri_whitespace directive

We can use the uri_whitespace direcve to drop, truncate, or encode the whitespaces in

URLs. Let's have a look at the format for using the uri_whitespace direcve.

uri_whitespace OPTION

The possible values that OPTION can have are as follows:

Strip whitespaces

When we use the strip opon, the whitespace characters are completely stripped from the

URL. This behavior is recommended.

Wring URL Redirectors and Rewriters

[ 260 ]

Deny URLs with whitespaces

The requests to URLs containing whitespaces are denied and the user gets an Invalid

Request

message when the deny opon is used.

Encode whitespaces in URLs

When the encode opon is used, the whitespaces in the URLs are encoded. This is a

violaon of HTTP protocol as proxies should not make changes to a URL. It is however,

what the browser should have sent, so it is relavely safe to do if needed.

Chop URLs

When the chop opon is used, the URL is chopped at the rst whitespace. This is not

recommended and may lead to unexpected behavior. This is also a violaon of

HTTP protocol.

Allow URLs with whitespaces

The request URL is not changed at all when the allow opon is used.

So, to remove whitespaces from the URLs before they are passed to the URL

redirector programs, we can use the

strip, encode, deny, or chop opons with the

uri_whitespace direcve, and the redirector program will never have to worry about

whitespaces in the URLs. For example:

uri_whitespace deny

Please note that the default Squid behavior is to strip whitespaces from

all the URLs in compliance with RFC 2396.

Making redirector programs intelligent

Just in case we choose to allow whitespaces in URLs, then we'll need to make our redirector

programs a bit more intelligent. In non-concurrent redirectors, we can remove the ve

elds from the end and whatever is le will be the URL (with or without whitespaces). For

concurrent redirector programs, the logic will be a bit dierent and we'll need to remove one

eld (ID) from the beginning, ve elds from the end, and whatever is le will be the URL

(with or without whitespaces).

Writing our own URL redirector program

Based on the concepts we learned earlier about the URL redirector helper programs, we

can write a program that can redirect/rewrite URLs condionally. So, let's have a look at

an example:

Chapter 11

[ 261 ]

Time for action – writing our own template for a URL redirector

Now, let's have a look at an example URL redirector program in Python, which can be

extended to t any scenario:

#!/usr/bin/env python

import sys

def redirect_url(line, concurrent):

list = line.split(' ')

# 1st or 2nd element of the list

# is the URL depending on concurrency

if concurrent:

old_url = list[1]

else:

old_url = list[0]

# Do remember that the new_url

# should contain a '\n' at the end.

new_url = '\n'

# Take the decision and modify the url if needed

if old_url.endswith('.avi'):

# Rewrite example

new_url = 'http://example.com/' + new_url

elif old_url.endswith('.exe'):

# Redirect example

new_url = '302:http://google.co.in/' + new_url

return new_url

def main(concurrent = True):

# the format of the line read from stdin with concurrency is

# ID URL ip-address/fqdn ident method myip=ip myport=port

# and with concurrency disabled is

# URL ip-address/fqdn ident method myip=ip myport=port

line = sys.stdin.readline().strip()

# We are to keep doing this unless there is EOF

while line:

# new_url will be a URL, URL prefixed with 3xx code

# or just a blank line.

new_url = redirect_url(line, concurrent)

Wring URL Redirectors and Rewriters

[ 262 ]

id = ''

# Add ID for concurrency if concurrency is enabled.

if concurrent:

id += line.split(' ')[0] + ' '

new_url = id + new_url

# Write the new_url to standard output and

# flush immediately so that it's available to Squid

sys.stdout.write(new_url)

sys.stdout.flush()

# Read the next line

line = sys.stdin.readline().strip()

if __name__ == '__main__':

# Check if concurrency is enabled or not

main(len(sys.argv) > 1 and sys.argv[1] == '-c')

The previous program is a bit more powerful than the Perl script we saw before. In the

previous program, we rst read the data (a single line) from a standard input and removed

any unwanted characters from it. Then we call the redirect_url funcon using the data

obtained from a standard input. Then we split the data on whitespace and extract the URL

from it (the second element).

If the URL ends with

.avi, we rewrite the URL with a URL to our custom access denied page.

If the URL ends with .exe, then we redirect the user to a dierent URL, warning them of

a probable virus.

We can extend the

redirect_url funcon according to our needs and return a

rewrien URL.

What just happened?

We wrote our own URL redirector program, which is more of a template, and can be

extended to t any scenario. We can use any programming language to write such URL

redirector programs.

Have a go hero – extend the redirector program

Extend the URL redirector program, shown previously, to redirect all requests from ash

animaon les to a ny GIF image located at http://www.example.com/ban.gif.

Conguring Squid

Once we have nished wring the redirector program, we need to congure Squid to use it

properly. There are a few direcves in the Squid conguraon le using which we can control

how Squid will use our URL redirector program. Let's have a quick look at these direcves.

Chapter 11

[ 263 ]

Specifying the URL redirector program

We can specify the absolute path to our URL redirector program using the

url_rewrite_program direcve. We can also specify any addional interpreter or

command line arguments that the program expects. The following are a few examples:

url_rewrite_program /opt/squid/libexec/custom_rewriter

url_rewrite_program /usr/bin/python /opt/squid/libexec/my_rewriter.py

url_rewrite_program /usr/bin/python /opt/squid/libexec/another_

rewriter.py --concurrent

Squid can use only one URL redirector program at a me, so we should

specify only one program using the url_rewrite_program direcve.

Controlling redirector children

Once we have specied the redirector program, we need to use the url_rewrite_

children

direcve to specify the number of instances of the redirector program (children)

that Squid is allowed to spawn. The format of the url_rewrite_children direcve is

given as follows:

url_rewrite_children CHILDREN startup=N idle=N concurrency=N

In the previous conguraon line, the parameter CHILDREN represents the maximum

number of children or the maximum number of instances of the redirector program

that Squid is allowed to spawn.

We should choose this value carefully because if we keep this value very low, Squid may have

to wait for the redirector programs to process and write data to a standard output, which

may lead to signicant delays in processing client requests. Also, if we keep this value very

high, then the redirector programs will consume a signicant amount of resources (RAM,

CPU) on the server, which in turn may slow down the server, leading to delays in processing

client requests. The default value is 20.

The argument

startup (startup=N) is used to specify the minimum number of children

that will be spawned when Squid is started or recongured. If we set the value of startup

to zero (0), the rst child will be spawned on the rst request. The default value of the

startup argument is zero (0).

Seng startup to a low value will cause inial slowdown if Squid

receives a large number of requests, as it'll have to spawn a lot of children.

Wring URL Redirectors and Rewriters

[ 264 ]

The argument idle (idle=N) is used to set the minimum number of children processes that

should be idle at any point of me. The number of children processes rises with the trac

up to the maximum number, set previously. The minimum and default value of idle

argument is 1.

The value of the argument

concurrency (concurrency=N) determines the number of

concurrent requests that each redirector program can process in parallel. The default value

of concurrency is zero (0) which means that the rewriter program is single threaded.

Controlling requests passed to the redirector program

By default, all the requests are passed to the URL redirector program. However, this may

not be the desired behavior. We can control what requests Squid passes to the redirector

program using the

url_rewrite_access direcve. The format and usage of the

url_rewrite_access direcve is similar to http_access.

Let's say our URL redirector program redirects/rewrites URLs only for the domain

example.com. Now, we can add the following conguraon lines to our Squid

conguraon le:

acl rewrite_domain dstdom .example.com

url_rewrite_access allow rewrite_domain

url_rewrite_access deny all

In accordance to the previous conguraon, Squid will only pass requests whose domain

is example.com or any of its sub-domains. Similarly, we can create powerful lters by

combining Access Control Lists and the url_rewrite_access direcve.

Please note that certain request types such as POST and CONNECT must not

be rewrien as they may lead to errors and unexpected behavior. It's a good

idea to block them using the url_rewrite_access direcve.

Bypassing URL redirector programs when under heavy load

When the redirector programs are under heavy load (receiving more requests than they can

process), Squid will have to wait unl a redirector program returns the redirected or rewrien

URL. This will introduce signicant delays in processing the user requests. In such situaons,

we can use the url_rewrite_bypass direcve to skip passing the requests to the redirector

program so that Squid can handle them on its own. So, to bypass the redirector program, we

can add the following conguraon line to our Squid conguraon le.

url_rewrite_bypass on

Chapter 11

[ 265 ]

The default Squid behavior is not to bypass any request and wait for a redirector to become

free, if all of them are busy.

Bypassing redirector programs may not be desirable in some cases,

especially if the redirector program is being used to limit access to certain

resources, because it may give clients access to resources which are not

accessible otherwise.

Rewriting the Host HTTP header

When we use a URL redirector program to send HTTP redirect messages to the client,

Squid rewrites the Host HTTP header in the redirected requests. This may work when

Squid is congured in the forward proxy mode. However, when in reverse proxy mode,

rewring the Host header may cause problems. To prevent the rewring of the Host

header, we can use the url_rewrite_host_header direcve. When set to off, the

url_rewrite_host_header will stop Squid from rewring the Host HTTP header.

The default Squid behavior is to rewrite the Host HTTP header in all

redirected requests.

A special URL redirector – deny_info

The deny_info opon is a direcve in the Squid conguraon le, which can be used to:

Present clients with a custom access denied page.

Redirect (HTTP 302) the clients to a dierent URL, displaying more informaon

about why access was denied or containing help messages.

Reset the TCP connecon.

Let's have a look at the three syntaxes of the

deny_info direcve:

deny_info CUSTOM_ERROR_PAGE ACL_NAME

deny_info ALTERNATE_URL ACL_NAME

deny_info TCP_RESET ACL_NAME

The syntaxes shown previously correspond to the uses we have just discussed. In the rst

syntax, the parameter CUSTOM_ERROR_PAGE species a custom error page wrien in HTML

or plain text, which will be displayed instead of Squid's default access denied page. The

error page wrien in English should be placed in the ${prefix}/share/errors/en-us/

directory or another appropriate locaon for other languages. We can also place this errors

le in a custom locaon such as /etc/squid/local-errors/.



Wring URL Redirectors and Rewriters

[ 266 ]

In the second syntax, the client will be redirected (HTTP 302) to an alternate URL specied

using the ALTERNATE_URL parameter. In the last syntax, the connecon with the client will

be reset.

In all of the previous syntaxes,

ACL_NAME represents the ACL name that must match for

rendering the corresponding access denied page or reseng the TCP connecon.

When the

http_access rules result in denied access, Squid remembers the last ACL it

evaluated in the http_access rules. If a deny_info line exists for the ACL last evaluated,

then Squid will render the corresponding error page. Now, let's try to understand this in

detail using examples.

Consider the following conguraon:

acl example_com dstdomain .example.com

acl example_net dstdomain .example.net

acl png_file urlpath_regex -i \.png$

http_access allow example_net

http_access deny example_com png_file

http_access deny all

deny_info TCP_RESET example_com

deny_info http://example.net/ png_file

Now, let's say a client tries to access http://www.example.com/default.png. According

to the previous conguraon, the rst access rule with the

example_net ACL doesn't

match. So, we proceed to the second access rule. The URL menoned here is matched by

both the example_com and png_file ACLs. However, note that the last ACL evaluated by

Squid which resulted in denied access is png_file. So, Squid will try to nd a deny_info

line corresponding to the png_file ACL. As a result, the HTTP client will be sent a HTTP 302

redirect, redirecng the client to http://example.net/.

Now, we are going to modify our conguraon by switching the posion of ACL names in the

access rule, shown as follows:

acl example_com dstdomain .example.com

acl example_net dstdomain .example.net

acl png_file urlpath_regex -i \.png$

http_access allow example_net

# Notice the switch

http_access deny png_file example_com

http_access deny all

deny_info TCP_RESET example_com

deny_info http://example.net/ png_file

Chapter 11

[ 267 ]

If a client tries to access the same URL http://www.example.com/default.png, the

result will be a TCP connecon reset. This is because the last ACL resulng in denied access

will be example_com and not png_file.

The

deny_info direcve is preferred over custom URL redirects when we need

to redirect our client to alternate URLs poinng to custom error pages.

Popular URL redirectors

So far, we have learned about how URL redirector programs communicate with Squid

and how we can write our own URL redirector programs. Now, let's have a look at

a few popular URL redirectors. For a full list of available redirector programs, please

visit http://www.squid-cache.org/Misc/related-software.html.

SquidGuard

SquidGuard is a combinaon of lter, URL rewriter, and an access control plugin for Squid.

The main features of SquidGuard includes the fact that it is fast, free, exible, and ease of

installaon. Below are a few use cases of SquidGuard:

Liming access for some users to a list of well known web servers or URLs

Blocking access for some users based on blacklists

Redirect blocked URLs to pages containing helpful informaon

Redirect unregistered users to registraon pages

And much more...

For more details on SquidGuard, please see

http://www.squidguard.org/.

Squirm

Squirm is a fast and congurable URL rewriter for Squid. Please check http://squirm.

foote.com.au/

for more details. A few features of Squirm are as follows:

It is very fast and uses almost no memory

It can read the conguraon le again even when running

It can run in bypass mode in case the conguraon le contains errors

It has an interacve mode for tesng new conguraon les



Wring URL Redirectors and Rewriters

[ 268 ]

Ad Zapper

Ad Zapper is another popular Squid URL rewriter for removing ad banners, ash animaons,

pop-up windows, page counters, and other web bugs. Ad zapper maintains a list of regular

expressions for well known ad networks. For more details on ad zapper, please check

http://adzapper.sourceforge.net/.

Pop quiz

1. If a client requests a URL http://www.example.com/users/list.

php?start=10&end=20#top

, then which one of the following is the URL

which will be received by a URL rewriter program?

a. http://www.example.com

b. http://www.example.com/users/list.php

c. http://www.example.com/users/list.php?start=10&end=20

d. http://www.example.com/users/list.php?start=10&end=20#top

2. How many dierent URL rewriter programs can be used by Squid at any me?

a. Unlimited

b. Depends on the RAM and CPU power of the machine

c. Depends on the number of network interfaces available on the server

d. 1

3. Consider the following snippet from a Squid conguraon le:

url_rewrite_program /opt/squid/libexec/rewriter

acl rewrite_domain dstdom example.com

url_rewrite_access allow rewrite_domain

url_rewrite_access deny all

url_rewrite_bypass off

Now, consider a situaon when all the URL rewriter programs are busy and a client

requests a URL

http://www.example.com/index.html. What will Squid do?

a. Return an access denied message

b. Wait for a rewriter program to become free

c. Crash

d. Will not wait for the rewriter and will process the request normally

Chapter 11

[ 269 ]

Summary

In this chapter, we have learned about URL redirector and rewriter programs, which are

very helpful in extending the basic Squid funconality. We have also learned about

the deny_info direcve which is a beer t for redirecng users to beer and more

understandable error pages. We also learned how Squid communicates with URL helpers.

Specically, we covered:

URL redirectors and their use

How Squid communicates with the URL redirector programs

Wring our own URL redirector program

Conguring Squid to use our URL redirector program

A few popular URL redirectors that are helpful in saving bandwidth and providing

beer access control

Now that we have learned about most of the components of Squid, we need to learn about

troubleshoong in case a component doesn't behave appropriately, and that is the topic of

our next chapter.



Troubleshooting Squid

In the previous chapters, we have learned about installing and conguring the

Squid proxy server in dierent modes. Then we moved on to learning about and

further customizing Squid using the powerful URL redirector programs. Though

we may take utmost care while conguring Squid and tesng everything before

deploying changes in producon mode, somemes we may face issues which

can aect our clients. The issues may be a result of conguraon glitches, Squid

bugs, operang system limitaons, or even because of the network issues. In

this chapter, we'll learn about common known issues and how we can go about

the troubleshoong of these issues in a strategic manner.

In this chapter, we shall learn about:

Some common issues

Debugging problems

Geng help online and reporng bugs

So let's begin...

Some common issues

Most of the issues which arise are due to conguraon errors and ambiguous conguraons,

which are known as Squid bugs or operang system issues. You can x these issues quickly

if you are aware of the issues which are commonly faced by Squid users, as these types of

issues generally have standard soluons. So, let's have a look at a few common problems.



Troubleshoong Squid

[ 272 ]

Cannot write to log les

Somemes, while starng Squid, we may get a warning similar to the following:

WARNING: Cannot write log file: /opt/squid/var/logs/cache.log

/opt/squid/var/logs/cache.log: Permission denied

messages will be sent to 'stderr'.

This generally happens when the user running Squid doesn't have write permissions to the

directory containing log les or the log les themselves. This error can be avoided to a large

extent if we use binary packages for our operang system because the permissions and

ownerships will be set up properly by the packet installer during installaon.

Time for action – changing the ownership of log les

This issue can be quickly xed by changing the ownership of the log directory and les

within. Squid is either run by the user nobody or by the user menoned using the

cache_effective_user direcve in the Squid conguraon le. So, to change the

ownership of the log directory and les within, we can use the

chown command as follows:

chown –R nobody:nobody /opt/squid/var/logs/

Don't forget to replace username, group name, and log directory

in accordance with your Squid installaon.

What just happened?

We learned that Squid should have the ownership of the directory containing log les

to be able to log messages properly. We also learned how to change the ownership

using the chown command.

Could not determine hostname

Another error encountered commonly is shown as follows:

FATAL: Could not determine fully qualified hostname. Please set

'visible_hostname'

Squid Cache (Version 3.1.10): Terminated abnormally.

This happens when Squid is not able to determine the fully-qualied hostname for the

IP address it's binding to.

Please note that with Squid version 3.2 onwards, this error will be converted from FATAL

to WARNING and Squid will sll run using the name

localhost.

Chapter 12

[ 273 ]

This issue can be resolved quickly by seng an appropriate hostname using the

visible_hostname direcve in the Squid conguraon, demonstrated in the

following example:

visible_hostname proxy.example.com

The hostname provided previously should now have DNS records resolving it to the

IP address of the proxy server. In a cluster of proxies, this hostname should be unique

for every proxy server to tackle IP-forwarding issues.

Cannot create swap directories

When we try to create new swap directories using the Squid command, we may get an error

shown as follows:

[root@saini ~]# /opt/squid/sbin/squid -z

2010/11/10 00:42:34| Creating Swap Directories

FATAL: Failed to make swap directory /opt/squid/var/cache: (13)

Permission denied

[root@saini ~]#

As it is clear from the previous error message, Squid didn't have enough permission to create

the swap directories.

Time for action – xing cache directory permissions

We can x this issue by creang the cache directory and then transferring the ownership

to the Squid user manually.

mkdir /opt/squid/var/cache

chown nobody:nobody /opt/squid/var/cache

The previous commands will create the cache directory and will transfer the ownership

to the Squid user.

If we try to create the swap directories now, the command will succeed and will output

something like this:

[root@saini etc]# /opt/squid/sbin/squid -z

2010/11/10 00:44:16| Creating Swap Directories

2010/11/10 00:44:16| /opt/squid/var/cache exists

2010/11/10 00:44:16| Making directories in /opt/squid/var/cache/00

2010/11/10 00:44:16| Making directories in /opt/squid/var/cache/01

...

Troubleshoong Squid

[ 274 ]

What just happened?

We learned how to create the cache directory with proper ownership, so that Squid can

create the swap directories without any problems.

Failed verication of swap directories

In most cases, we'll be using Squid as a caching proxy server and we'll have disk caching

enabled. A common error related to cache or swap directories is as follows:

2010/11/10 00:33:56| /opt/squid/var/cache: (2) No such file or