Automating web analytics through Python

8/27/2017

For the Jupyter notebook, please visit my Github repository at https://github.com/RRighart/GA/

Python modules

In the current blog Python 2 was used. Please note that the code may be slightly different for Python 3. The following Python modules were used:

1. Introduction

Web analytics is a fascinating domain and important target of data science. Seeing where people come from (geographical information), what they do (behavioral analyses), how they visit (device: mobile, tablet or workstation), and when they visit your website (time-related info, frequency etc), are all different metrics of webtraffic that have potential business value. Google Analytics is one of the available open-source tools that is highly used and well-documented.

The current blog deals with the case how to implement web analytics in Python. I am enthusiastic about the options that are available inside Google Analytics. Google Analytics has a rich variety of metrics and dimensions available. It has a good visualization and an intuitive Graphic User Interface (GUI). However, in certain situations it makes sense to automate webanalytics and add advanced statistics and visualizations. In the current blog, I will show how to do that using Python.

As an example, I will present the traffic analyses of my own website, for one of my blogs (https://rrighart.github.io/Webscraping/). Note however that many of the implementation steps are quite similar for conventional (non-GitHub) websites. So please do stay here and do not despair if GitHub is not your cup of tea.

2. The end in mind

As the options are quite extensive, it is best to start with the end in mind. In other words, for what purpose do we use webtraffic analyses? [1].

What is the effect of a campaign on webtraffic? Concretely, publishing a link (i.e., my blog "Webscraping and beyond") on a reputated site for colleague developers, does this have a substantial impact on the number of visitors? To reach this goal, much like an experiment, I performed a single intervention of publishing a link at http://www.datatau.com .
Where do the visitors come from, is this a large range of countries, from different continents, or is it rather limited to one or a few countries?
What kind of devices do my visitors use? (mobile, desktop, or tablet).

To realize these goals, what follows are the analytical steps needed from data acquisition to visualization. If at this moment you are not able to run Google Analytics for your own website but want to nevertheless reproduce the data analyses in Python (starting below at section 8), I advice to load the DataFrames (df1, df2, df3, df3a, df3b) from my GitHub site using the following code (change the "df"-filename accordingly):

3. Add a website to your Google Analytics account

You need to first subscribe to Google Analytics and add your website [2]:

Select Admin
In the dropdown menu Property, select Create new property. You need to give the name of your website. The URL has the following format for GitHub sites: https://yourname.github.io/projectname/ (in my case, it is for example https://rrighart.github.io/Webscraping/ ).
Do not forget to set the reporting timezone right. This is very important if you want to research the time of the day that your customers come visit your website.
After confirming the other steps, you'll receive a Universal Analytics (UA) tracking -ID, which has the following format, UA-xxxxxxxx-x, where the x are numbers.

4. Tracking-ID and code

If you need to find back the tracking-ID later, the code can be found at Tracking Info and then Tracking Code. Under the header Website Tracking, Google Analytics will give a script that needs to be pasted in your website. It is essential to set the code right to prevent for example half or double counting [3].

5. Check the connection

Now you have the tracking code pasted in your website, Google Analytics is able to collect traffic data. The „official“ way to inspect if there is a connection in Google Analytics is to select Tracking Info, Tracking Code, and under status, push the button Send test traffic. This will open up your website.

However, a more real life way to do this is to visit your website yourself, using for example your mobile phone. In Google Analytics select Home, Real-time, and Overview. If you just visited your website of interest, you should see under pageviews that there is "Right now 1 active users on site" (of course this could be >1 if at the same moment there were other visitors). Additionally, you may want to check the geographical map and see if your place is highlighted. If you leave your website, the active users section should return to zero (or go one down). If this works, you are ready to start webtraffic analyses as soon as your first visitors drop in.

6. Query Explorer

So how to start webtraffic analyses? One option is to visualize traffic in Google Analytics itself. Another option is Query Explorer. Query Explorer is a GUI tool that gives a very quick impression of your data, combining different metrics and dimensions at once. It is also very helpful for preparing the Python code needed for data extraction (more about this later). Follow the next steps:

Log-in with your Google account at: https://ga-dev-tools.appspot.com/query-explorer/
Select under property the webpage that you want to check (in my case „Webscraping“).
Select view to choose between extracting all data, desktop or mobile.
Select ids: this is the „ga:“ code that corresponds with your property.
Fill-in a start-date. Here we select '2017-07-20' (this is one day before I started campaigning at www.datatau.com ).
Fill-in End-date: '2017-08-07'.
Metrics: select 'ga:sessions'.
Dimensions: select 'ga:date'.

Note that number of sessions is different from number of visitors. The difference is that the same visitors may return several times at the same website, resulting in a higher number of sessions. For the time being, leave all the other fields empty. When you hit the button Run Query this should return a spreadsheet with number of sessions, for each day in your time-window.

7. Get your data into Python

A major advantage of using Python is that you can automate data extraction, customize settings and build your own platform, statistics, predictive analytics, and visualizations, if you desire all in a single script. Regarding visualizations, it would be possible to build for example dynamic geographic maps in Python, showing how the flux of visitors changed locally and globally from day-to-day, week-to-week.

Google2pandas [4] is a tool that transfers data from Google Analytics to Python into a Pandas DataFrame. From there you could proceed further making for example statistics and visualizations. The most important steps to enable this are:

Getting permission from Google Analytics API [2]
Install Google2pandas and Pandas.
Copy the following code in Jupyter notebook. Run it and verify if you obtained the right DataFrame. This should give data that are identical to those in Query Explorer or the data that can be displayed in Google Analytics. So we started collecting the data at July 20, and ended at August 7. There are by the way a number of handy options to select start_date and end_date, such as '7daysAgo' (you can change the number to your likings, for ex. '10daysAgo') or 'today'. These options come in very useful if you want to regularly extract and analyze a same time period. For example, if you want to make a report every thursday morning going one week back ("7daysAgo"), you could in principal run everytime the same script without changing anything.

date sessions
0 20170720 4
1 20170721 147
2 20170722 125
3 20170723 77
4 20170724 104
....

8. Plotting number of sessions as a function of date

To answer the first question - what is the effect of a campaign on the webtraffic? - we will analyze if campaigning had a sizeable effect on the number of sessions. First, to improve readability of the resulting plot, we will modify the date string that will be in the x-axis. Therefore, we will remove the year part, and we will reverse the order of day and month [5].

Checking the DataFrame df1 we can see that the date column now has less bits of information. Before plotting the number of sessions, let us view some summary statistics. The number of sessions was 91 on average in the inspected period (with a max. of 326).

count 19.000000
mean 91.210526
std 84.652610
min 4.000000
25% 31.000000
50% 67.000000
75% 111.000000
max 326.000000
Name: sessions, dtype: float64

Total number of sessions during the selected time window was 1733.

1733

Now we are going to plot the number of sessions (y-axis) as a function of date (x-axis). Remind that a link to the blog was published at 21-07-2017. Next to the observation that the number of visitors increased substantially after the publication date, there is an additional boost at July 27. I do not have a definite explanation for this second boost. One admittedly speculative explanation is that the blog had in the meanwhile received several "likes" at the site datatau, and this in turn may have attracted other visitors. That the number of sessions is decreasing after a certain time is explained by the fact that the link is slowly falling off the datatau main page, as new blogs are dropping in. This means that some time after publication people are less likely to see and visit it.

9. Geographic mapping

To investigate the second question - where do visitors come from? - a choropleth map can be used. In this case, a world map is used that displays the number of visitors per country, using different color scales. For this purpose, we make a new DataFrame df2 that sorts the countries on number of sessions.

The top 20 countries are the following:

It turns out that visitors come from 80 countries:

We are now going to use the choroplethmap. This is a bit of code and an excellent blog going in more detail about this method can be found elsewhere [6].

We are going to make a list called cnt consisting of country abbreviations that we will put in the index of df2:

Using for sessions the absolute values did not give a clear color distribution. Most countries had quite similar values with only a few countries having higher values, and for this reason most countries fell in the same color scale. Therefore, I decided to convert the values to percentiles, benefiting from the Scipy package [7]. The resulting map demonstrates that visitors do not come from a local geographic area, but they come from a wide variety of countries around the world.

More detail is possible here, such as regional or city maps. An example of a city map can be found elsewhere [8]. Region and city dimensions can be extracted from Google Analytics. As mentioned before, it is best to explore the available "metrics" and "dimensions" in Query Explorer before implementing it in Python.

To get a better impression of the real number of sessions from "top 30" countries, we could make a barplot.

10. Device

The third question - what kind of devices do my visitors use? - can be best answered using a piechart, since it nicely illustrates the proportions.

For this purpose we are going to use Plotly, but one could equally well do this using for example Matplotlib. Please be aware that you would need a so-called api-key to do this, which you will receive upon registration at the Plotly website.

The pie chart shows that a large majority of the visitors used a desktop. Does this change from one week to another week? It would be possible to show the change in device use over time using multiple pie charts. First, we make two datasets, a dataset for the first and second week, df3a and df3b respectively. The code is a bit extensive, it is advisable to use a loop when you have several timepoints.

Now the goal is to display the piecharts for the 1st and 2nd week. It would be important to use the domain parameter to set the position of the charts. Further parameters can be found in the official Plotly documentation [9].

The number of mobile sessions increased slightly while the number of desktop sessions decreased. If this is just noise or a real shift in sessions could be probably best evaluated using more timepoints.

11. Save data

If you want to save the DataFrames for later use:

12. Closing words

Webtraffic allows a wide variety of potentially interesting analyses. Lots of other measures can be explored. For example, is webtraffic different during certain hours or weekdays?, what is the duration that visitors stay on the website?, from which websites are users referred? is there a change in geographic distribution over time?, just to name a few.
Bringing webtraffic data to Python is rewarding for several reasons. Python can automate basic to advanced analyses. After programming the analysis pipeline, a single button press can produce the desired analyses and visualizations in a report. And last but not least, periodical updates can be produced on a regular basis, customized to your goals.

Notes & References

My goal was to track webtraffic to my GitHub site, specifically my blog pages (the so-called gh-pages in GitHub). GitHub has its own webtraffic stats for the master branch, where developers typically share their software tools, scripts etc. The current blog only deals with Google Analytics.

Subscribing and how to get your site detected by Google Analytics. http://www.ryanpraski.com/google-analytics-reporting-api-python-tutorial/
Setting the tracking code right. http://www.seerinteractive.com/blog/audit-this-why-it-matters-where-you-put-the-google-analytics-tracking-code/
Google2pandas. https://github.com/panalysis/Google2Pandas
A perhaps more appropriate way is converting the column to datetime. As we only use this variable as an x-axis in the plot, I have chosen the shorter alternative.
Choropleth mapping using Basemap. http://ramiro.org/notebook/basemap-choropleth/
Conversion to percentiles. https://stackoverflow.com/questions/12414043/map-each-list-value-to-its-corresponding-percentile
City map. https://rrighart.github.io/City/#q24
Plotly subplot option. https://plot.ly/python/pie-charts