Analytics: Calculate Visits From Page Views
2020-05-10
Disliking the current state of web analytics solutions, I've set up my own little thing. It's not gifting free data to Google and it's not spending my visitor's CPU cycles on client side Javascript.
It only collects a few data points on the server once a visitor hits a path: visitor_id
,
path
, timestamp
and the referrer
. This provides the most fundamental
metric: page views. A page view being an instance of an Internet user visiting a
particular page on a website [1].
But there is a second metric in analytics, which provides useful insights: visits. Also
sometimes called sessions, a visit is a sequence of page views. Probably the most common definition on what
defines the end of a visit is a window of inactivity of 30 minutes between two page views.
The above dashboard stores page views in a single SQLite table:
| utc_time | visitor_id | path | | -------- | ---------- | ---- |The magic words to calculate visits from this are window functions. These allow the user to define queries which work in a more iterative sense, instead of being confined to SQL's "match all the things at once" semantics.
Lag
Window functions roughly work by applying a window function over the rows of a window definition. A concrete example would be to calculate the time between page views of a visitors. The window definition being "page views of a visitor", the window function being "calculate time between":
SELECT utc_time, visitor_id, path, LAG(utc_time) OVER (PARTITION BY visitor_id ORDER BY utc_time) - utc_time AS inactivity_time FROM page_viewThe
LAG
function is a built-in window function, which evaluates the inner expression against the
preceding row: Subtracting the timestamp of the preceding row from the current row to calculate the time between
page
views. The PARTITION BY
clause defines the window; rows are divided into
partitions per visitors, the function is applied to each partition in isolation.
Visits
With the lag between page views available, visits can be calculated using yet another set of window functions:
SELECT visitor_id || '-' || row_number() OVER(PARTITION BY visitor_id ORDER BY utc_time) AS session_id, visitor_id, utc_time AS start_utc_time, lead(utc_time) OVER(PARTITION BY visitor_id ORDER BY utc_time) AS next_start_utc_time FROM (...the query from before...) WHERE ABS(inactivity_time) > 30 * 60 OR inactivity_time IS NULLIn the where clause the definition of a visit can be found again. The
row_number()
function is used
to retrieve the number of the current row in the given partition. The row number is then used as part of the
unique id assigned to each session.
I went ahead and stored that query as a view called
session
. The view can be used
just like a normal table. Counting visits for a given interval is trivial now. Joining the sessions and page views
on the visitor id even allows for stats like average session length or average visit depth.
Alternatives
It's worth taking some time to understand those queries, SQL is still one of the most requested tech skills. But of course there are some alternatives to this multi line query:
- Setting the session id of page views when they are inserted into the database. This involves looking up the most recent page view and then deciding whether the next one is the start of a new session or not.
- Calculating the session as above, but in application code. That'll probably work for small page view counts, but will become increasingly slower as the number of records increases. It's ridiculous how fast and optimized database queries are.
I've found most of these queries on the cube.js blog. The SQLite documentation on window functions is pretty insightful too.