Exclude personal data from your web analytics

Personally Identifiable Information (PII) data or personal data (e.g. email addresses, social security numbers etc.) is a complete nightmare for web analysts. Every web analytics platform is very clear on this matter, that their systems are not designed to store PII data. Google analytics also states in the service’s terms, that any GA account found storing PII data will be deleted by their administrator. If you end up having your account deleted, the chances of bringing it back online are very slim. Adobe Analytics will also advice its users against storing personal data on their servers.

Most marketers don’t need to store personal data (although some times it’s difficult to understand what’s considered PII and what’s not, more details in this article about GDPR) because you don’t get a lot of actual value when handling PII, compared to having a system which can identify visitor behavior. Even if you decide that you will not be storing and personal information in your web analytics platform, it’s easy to miss something, especially when managing a very big website. Visitors are usually very crafty and find new ways, you would have never imaged, to sneak PII data inside your data layer or your tracking setup.

My rule of thumb is:

Every web analytics variable populated with free text from your visitors can hold personal information, even if it doesn’t make sense

One of the most common cases I’ve seen, is when visitors type their email addresses in search boxes or when email marketing vendors decide to include an un-hashed name of the email recipient in their landing page URLs. In these cases, even without you being aware, you would end up with a web analytics account full of personal data. As you know these mistakes are very difficult to erase since you cannot delete data from GA or Adobe Analytics.

Myself, every time I have to capture information containing user input, I prefer using a data sanitization process first. This process looks for known PII data formats, using Regex rules. When one of these Regex rules matches a value, that value is replaced with standard label (i.e. “[email_masked]”). This way you know when the sanitization process produces results and you also clean up your web analytics platform from PII data.

This is the JS function I am using to filter out PII data. The following example will only look for email addresses, but can also be extended to look for other personal data formats. You will only need to add other functions similar to the “maskEmail” function and replace the regular expression rule and the “masked” value in the next line.

/* Mask PII data */
    function all_maskPIIData(originalValue) {
        var newValue = "";
        if (typeof originalValue === "undefined")
            originalValue = "";
        if (originalValue === null)
            originalValue = "";
        try {
            originalValue = decodeURIComponent(originalValue);
        } catch (e) {
        }

        var maskEmail = function (originalValue) {
            try {
                var rp = /[-a-z0-9~!$%^&*_+}{\'?]+(\.[-a-z0-9~!$%^&*_+}{\'?]+)*@([a-z0-9_][-a-z0-9_]*(\.[-a-z0-9_]+)*\.(aero|arpa|biz|com|coop|edu|gov|info|int|mil|museum|name|net|org|pro|travel|mobi|[a-z][a-z])|([0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}\.[0-9]{1,3}))(:[0-9]{1,5})?/gi;
                originalValue = originalValue.replace(rp, "[PII_Mask-Email]");
            } catch (e) {
            }

            return originalValue;
        };

        /* Email masking */
        newValue = maskEmail(originalValue);

        return newValue;
    }

To apply the masking rules to a value, you just need to call the function like in the following example:

s.eVar21 = all_maskPIIData(document.URL);

Below is a list of common regular expressions you can use to extend this JavaScript function:

IP address

/^(?:(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)\.){3}(?:25[0-5]|2[0-4][0-9]|[01]?[0-9][0-9]?)$/

Phone number

^\+?[\d\s]{3,}$

10-digit phone number

^\+?[\d\s]+\(?[\d\s]{10,}$

Birth Date (dd mm yyyy, d/m/yyyy, etc.)

^([1-9]|0[1-9]|[12][0-9]|3[01])\D([1-9]|0[1-9]|1[012])\D(19[0-9][0-9]|20[0-9][0-9])$

Social security number

/^(?!(000|666|9))\d{3}-(?!00)\d{2}-(?!0000)\d{4}$/

Zip codes

/^[0-9]{5}(?:-[0-9]{4})?$/

Library of common regex rules

https://projects.lukehaas.me/regexhub/

Bonus – Regex cheat sheet

A useful cheat sheet to help you setup your own regex rules:

https://www.cheatography.com/davechild/cheat-sheets/regular-expressions/

Update June 2019 – Google Cloud

If you are using Google Cloud to host your web analytics data, you can use Google Cloud DLP to obfuscate sensitive data. DLP is able to go through all your data and identify common PII data patterns and redact them automatically. You can find more information here.

Panagiotis

Written By

Panagiotis (pronounced Panayotis) is a passionate G(r)eek with experience in digital analytics projects and website implementation. Fan of clear and effective processes, automation of tasks and problem-solving technical hacks. Hands-on experience with projects ranging from small to enterprise-level companies, starting from the communication with the customers and ending with the transformation of business requirements to the final deliverable.