Understanding HTTP
Podcast: Play in new window | Download (52.1MB) | Embed
Subscribe: Apple Podcasts | Spotify | Email | RSS | More
Most of us will have to deal with HTTP at some point in our career. Understanding the web and its underlying protocols is absolutely necessary for understanding most larger systems in use today. While you can get by with little understanding of the protocol for quite a while, at some point you will have to really dig in and understand not only HTTP as currently implemented, but the history of the protocol. You will need to understand why things were done the way they were years ago, and how those things were overcome in more recent iterations of the protocol.
“The Hypertext Transfer Protocol (HTTP) is an application protocol for distributed, collaborative, hypermedia information systems.”
Having some understanding of HTTP is critical to being a developer these days. Not only is HTTP used to send and retrieve data for browsers, but it’s also a very common method of integrating applications. Whether you are building cloud scale AI applications, working on IOT devices, or simply writing “boring” business code, you are probably going to have to deal with http at some point in your job. Even a rudimentary understanding of how the protocol works can help you significantly if you ever have to troubleshoot problems with it. While our episode wasn’t a deep dive on the subject (even if it felt like it), we hope that it was at least enough to make you a little more comfortable with it.
Episode Breakdown
History
Development of HTTP was initiated by Tim Berners-Lee at CERN in 1989. Http 1.1 was first documented in an RFC in 1997, with another RFC in 1999. Http/2 was published in 2015 and made HTTP’s semantics more efficient on the wire. Http/3 is the upcoming successor to HTTP/2. It’s looking like it’s going to use UDP instead of TCP for the transport channel.
Definitions
Request: A packet of information sent from a client (often a browser) to a web server.
Response: The payload that a webserver sends back to a client in response to the request.
User Agent: A user agent is a piece of software acting on behalf of the user. It typically is a browser, but not always.
URL: Uniform Resource Locator. Indicates how to get to something on the web. It’s essentially an address.
Headers: A set of key/value pairs sent as part of a web request, or received as a response to a request that aren’t part of the contents. These are essentially request metadata.
HTTP Verb: Essentially indicates the intent of a web request.
Cookie: A small piece of data that the server sends to the client (and vice versa) that is maintained for the client as long as the client session lasts.
Session: The server side storage of information that should persist for the period of a user’s interaction with a website.
Cache: A store of the response to an http request. This is intended to lighten the load on the server by keeping requests from going to the server when there is no need to do so.
The Request and Response Cycle
While this process looks simple, it gets complicated very quickly. First, the browser gets the IP address of the server. If the IP address is cached, then the browser moves to the next phase. There are multiple layers of caching involved here. Otherwise, the browser has to look it up. This process is substantially more complicated than we can deal with here.
Once the IP address of the server in question is retrieved, the web browser opens a TCP connection to the server. A TCP handshake process occurs as the server and the client negotiate the connection. This connection will serve as the transport channel for the subsequent HTTP traffic.
Next, the browser will package up an HTTP request and send it across the transport channel. Then, the server will process the request and build up a response in the format requested (HTML, JSON, XML, etc.).
The server sends out response. This includes not only the payload, but the response code. Additionally the response will indicate to the browser how to handle caching, which cookies to set, and privacy information. The browser displays the HTML content that is returned.
Understanding the structure of an URL
The scheme is the first segment of the URL and is the part before the colon. For instance, in http, this would be http or https.
Next is the Authority segment, which has a few pieces. The userinfo component is optional and precedes the host. This may consist of a username and password separated by a colon and followed with an @ symbol. There is also a host subcomponent, which is either the name or IP address of the server on which the resource resides. There is also an optional port component, which is after the host segment and separated from it with a colon.
Following the host is the path, which is essentially the address of the resource on the host. This will be encoded as UTF-8. Any characters not part of the basic URL character set are escaped as hexidecimal using percent encoding. For instance, a space is encoded as %20, because a space is character code 32 (which is 20 hex.) One other thing. Sometimes the last segment of the path is referred to as a “slug”.
After this path is an optional query component (aka, QueryString). If present, this is proceeded by a “?”. The query part of the url is a set of key/value pairs, separated by ampersands (&). This is not well-defined, but this is a convention.
Additionally, there may be a fragment component. This is preceded by a hash (#). This directs focus to a secondary resource within the HTML, such as a section heading or the id of a specific element.
HTTP Verbs
GET is very similar to SELECT in SQL.
POST is very similar to INSERT in SQL. In practice, in the field you’ll see a lot of implementations that treat this verb more like an upsert.
PUT is used to Replace a resource.
PATCH is used to modify a resource.
DELETE is used to get rid of a resource.
Understanding HTTP Headers
Headers might apply to both requests and responses, or only one. There are also Entity Headers, which contain information about a resource, such as content length or Mime Type. Headers consist of the case-insensitive name of the header, followed by a colon and then the value of the header.
There are too many types of headers to discuss here, so we’ll just break them down into some categories and you can research them further if you need them. Authentication headers indicate how to get access to a resource. These may contain credentials and the like. Caching headers indicate how a resource should be cached. Client Hint headers indicate suggestions to the client for how to handle the payload. This may include things like the viewport width and the like. Conditionals are usually on the request and indicate whether to return the resource or not based on a condition (such as how recently it was changed). Connection management headers indicate how long the underlying connection should be kept open after the current connection completes. Content negotiation headers tell the server what type of data to send back, how it is encoded, and what language to use. Cookies indicate which cookie values are set on the request, which ones should be set (on the response). CORS headers indicate policies for how content is accessible. For instance, you might have a policy that indicates that certain content is only accessible from certain domains, so that other parties don’t misuse your server.
There are a lot more beyond this set, but this should give you the idea. Essentially, headers are metadata shipped with a request.
Understanding HTTP Responses
Messages starting with 1 are informational messages.
Messages starting with 2 indicate success. Usually this is a 200, which indicates success. Some other notables are 201 (created), 202 (accepted) and 206 (partial content).
Messages starting with 3 are redirects. 301 is a permanent redirect. 307 is a temporary redirect.
Messages starting with 4 are client-side errors. 400 is a bad request. 401 is unauthorized (you don’t have credentials). 403 Forbidden (your credentials aren’t good enough). 404 Not found (this resource isn’t present). 405 Method Not allowed (you tried to GET something that only takes POST).
Messages starting with 5 are server-side errors. 500 is an internal server error. 502 is a bad gateway. Something upstream broke while processing the request. 503 indicates that the service is unavailable.
HTTP Sessions
A session is essentially the server-side representation of the data that is required during the course of a user’s interaction with the web server. Sessions are identified by a session id, which is passed along with the request. Depending on the application, the amount of data stored in a session could vary considerably in size. Bear in mind that if your website uses multiple servers, you need to be careful about how you manage sessions. You can no longer store sessions on a single server in this case.
Https
HTTPS is essentially a way of encrypting HTTP traffic. This prevents man-in-the-middle attacks and data sniffing. It also ensures that you are communicating with the server that you think you are communicating with by identifying the server.
A handshake process occurs which enables the two parties to figure out which version of SSL will be used. The server sends a certificate containing the public key used to encrypt data going back to it. The browser verifies the server (using a certificate authority) and then sends a “certificate verify” message to indicate that it accepted the certificate, followed by a “change cipher spec” command, which indicates that subsequent data will be encrypted. The server then follows this by sending the browser a “change cipher spec” command to indicate that data returned will also be encrypted. The browser then generates an asymmetric key and encrypts it using the public key that the server sent and sends that to the server. These two keys are then used to secure the exchange of information between the browser and the server.
Book Club
Algorithms to Live By: The Computer Science of Human Decisions
Brian Christian and Tom Griffiths
The next few chapters talk about caching, scheduling, and Bayes’s Rule. In chapter 4 the authors start off by comparing memory storage problems to having an overfull closet. They go on to explain the history of caching in computer science and apply it to areas such as organizing a task list or how to set up a library for optimal usage. This leads into scheduling in chapter 5 and the science behind how we spend our time. They even go into the history of Gantt charts. In chapter 6 they get into Bayes’s Rule and predictive algorithms.
Tricks of the Trade
The writers of the HTTP specifications put a lot of effort into making sure that the protocol did a good job of surfacing errors to the client in a way that was useful for diagnostics. Go look at it. They did well and you can learn a lot from their approach.