Writing

Software, projects & software projects

Analytics: Calculate Visits From Page Views

Disliking the current state of web analytics solutions, I've set up my own little thing. It's not gifting free data to Google and it's not spending my visitor's CPU cycles on client side Javascript.

It only collects a few data points on the server once a visitor hits a path: visitor_id, path, timestamp and the referrer. This provides the most fundamental metric: page views. A page view being an instance of an Internet user visiting a particular page on a website [1].
But there is a second metric in analytics, which provides useful insights: visits. Also sometimes called sessions, a visit is a sequence of page views. Probably the most common definition on what defines the end of a visit is a window of inactivity of 30 minutes between two page views.

The above dashboard stores page views in a single SQLite table:

| utc_time | visitor_id | path |
| -------- | ---------- | ---- |
      
The magic words to calculate visits from this are window functions. These allow the user to define queries which work in a more iterative sense, instead of being confined to SQL's "match all the things at once" semantics.

Lag

Window functions roughly work by applying a window function over the rows of a window definition. A concrete example would be to calculate the time between page views of a visitors. The window definition being "page views of a visitor", the window function being "calculate time between":

SELECT
  utc_time,
  visitor_id,
  path,
  LAG(utc_time) OVER (PARTITION BY visitor_id ORDER BY utc_time) - utc_time AS inactivity_time
FROM
  page_view
        
The LAG function is a built-in window function, which evaluates the inner expression against the preceding row: Subtracting the timestamp of the preceding row from the current row to calculate the time between page views. The PARTITION BY clause defines the window; rows are divided into partitions per visitors, the function is applied to each partition in isolation.

Visits

With the lag between page views available, visits can be calculated using yet another set of window functions:

SELECT
  visitor_id || '-' || row_number() OVER(PARTITION BY visitor_id ORDER BY utc_time) AS session_id,
  visitor_id,
  utc_time AS start_utc_time,
  lead(utc_time) OVER(PARTITION BY visitor_id ORDER BY utc_time) AS next_start_utc_time
FROM
  (...the query from before...)
WHERE
  ABS(inactivity_time) > 30 * 60 OR inactivity_time IS NULL
        
In the where clause the definition of a visit can be found again. The row_number() function is used to retrieve the number of the current row in the given partition. The row number is then used as part of the unique id assigned to each session.
I went ahead and stored that query as a view called session. The view can be used just like a normal table. Counting visits for a given interval is trivial now. Joining the sessions and page views on the visitor id even allows for stats like average session length or average visit depth.

Alternatives

It's worth taking some time to understand those queries, SQL is still one of the most requested tech skills. But of course there are some alternatives to this multi line query:

  1. Setting the session id of page views when they are inserted into the database. This involves looking up the most recent page view and then deciding whether the next one is the start of a new session or not.
  2. Calculating the session as above, but in application code. That'll probably work for small page view counts, but will become increasingly slower as the number of records increases. It's ridiculous how fast and optimized database queries are.

I've found most of these queries on the cube.js blog. The SQLite documentation on window functions is pretty insightful too.

Other posts