What data is collected by Google Analytics (by default)

18,981

Solution 1

... identify what data is actually collected by the default script .... I also have a list of all the possible dimensions and metrics that can be collected

Just to be clear, GA collects more information than what they share with Analytics consumers. While their client-side script may allow for additional data to be collected (like custom query string parameters), most of what they collect data seems to be similar on every site, regardless of what the analytics user chooses to consume (with the exception of a few configuration items such as "anonymizeIp").

Google's policies are cleverly worded to indicate that turning on "Advertising Features" doesn't necessarily change what they collect with GA, other than the fact that a new cookie might be present:

By enabling the Advertising Features, you enable Google Analytics to collect data about your traffic via Google advertising cookies and identifiers

Knowing what GA collects (even when you don't ask it to) is particularly important given the ambiguity around whether GA is really GDPR compliant (which includes IP addresses, cookie identifiers, and GPS locations as "personal data").

Looking at the source code

Google Analytics is a moving target, BUT there is value in having a snapshot of the identifying information about the client and browser that was being leaked to Google Analytics at a given point in time,

Even though it's a bit outdated, this analysis was done using a Manually Deobfuscated Google Analytics javascript file, snapshot taken Mar 27, 2018.

1. Data available in Document and Window Objects

Some key objects to look for in the analytics JS: DOCUMENT, WINDOW, NAVIGATOR, SCREEN, LOCATION

Here are the items that are utilized by GA (doesn't necessarily mean this data is sent back to google in a raw form).

Data Utilized         |   Code Snippet
-------------         |   ------------
Url                   |   LOCATION.protocol + "//" + LOCATION.hostname + LOCATION.pathname + LOCATION.search
ReferringPage         |   DOCUMENT.referrer
PageTitle             |   DOCUMENT.title
HowLongIsPageVisible  |   DOCUMENT.visibilityState .. DOCUMENT,"visibilitychange"
DocumentSize          |   DOCUMENT.documentElement  .clientWidth && .clientHeight
ScreenResolution      |   SCREEN.width  SCREEN.height
ScreenColors          |   SCREEN.colorDepth + "-bit"
ClientSize            |   e = document.body; e.clientWidth && e.clientHeight
ViewportSize          |   ca = [documentEl.clientWidth .... : ca = [e.clientWidth .... ca.join("x")
FlashVersion          |   getFlashVersion
Encoding              |   characterSet || DOCUMENT.charset
JSONAvailable         |   window.JSON
JavaEnabled           |   NAVIGATOR.javaEnabled()
Language              |   NAVIGATOR.language || NAVIGATOR.browserLanguage
UserAgent             |   NAVIGATOR.userAgent
Timezone/LocalTime    |   c.getTimezoneOffset(), c.getYear(), c.getDate(), c.getHours(), c.getMinutes()
PerformanceData       |   WINDOW.performance || WINDOW.webkitPerformance   ... loadEventStart,domainLookupEnd,domainLookupStart,connectStart,responseStart,requestStart,responseEnd,responseStart,fetchStart,domInteractive,domContentLoadedEventStart
Plugins               |   NAVIGATOR.plugins
SignalUserLeaving     |   navigator.sendBeacon()  // how long the user was on the page
HistoryLength         |   WINDOW.history.length   // number of pages viewed with this browser tab
IsTopSiteForUser      |   navigator.loadPurpose   // "Top Sites" section of Safari
NameOfPage (JS)       |   WINDOW.name
IsFrame               |   WINDOW.top != WINDOW
IsEmbedded            |   WINDOW.external
RandomData            |   WINDOW.crypto.getRandomValues  // because of the try/catch, it doesn't appear to leak anything other than random values
ScriptTags            |   getElementsByTagName("script");  // probably for Ads, AutoLink decorating [https://support.google.com/analytics/answer/4627488?hl=en] and cross-domain tracking [https://developers.google.com/analytics/devguides/collection/analyticsjs/cross-domain]
Cookies (JS)          |   DOCUMENT.cookie.split(";")   // limited to cookies not marked as server only

2. Data available from the QueryString and Hash

By default, GA seems to only explicitly collect querystring parameters that are documented as specific to Google Analytics. But keep in mind that they also have the entire URL available to extract this data server-side, querystring and hash included:

_ga
_gac
gclid
gclsrc
dclid
utm_id
utm_campaign
utm_source
utm_medium
utm_term
utm_content

3. Data available in the HTTP Header

They can choose to capture anything on the request header from the browser. Most notably:

Cookies (Google)   |   for the google analytics domain, to track the user between sites
IP Address         |   (parameter "anonymizeIp" claims to anonymize the IP address)
Browser w/ version |
Operating system   |
Device Type        |   
Referer            |   (in this context, only the url of the page the client is currently on)
X-Forwarded-For    |   Is a proxy being used?  And, if not used for privacy, the actual IP address

4. Other inferred data

Javascript enabled
Cookies enabled

Other identifying information they don't appear to track/utilize

Some other metrics that are readily available, but GA doesn't appear to access:

Canvas Supported
CPU Architecture
CPU Number of cores
AudioContext Supported 
Bluetooth Supported
Battery Status
Memory (RAM)
Number of speakers
Number of microphones
Number of webcams
Device Orientation
Device input is Touchscreen
System Fonts
LocalStorage Data
IndexedDB Data
WebRTC Supported
WebGL Supported
WebSocket Supported

Misc Hacks

They don't appear to use any known hacks to extract additional unique user information, such as finding the video card model of the current machine using Canvas and GL. This is not too surprising, since Google can just expose any data they want in chromium/webkit.

However, their control of 70% of the browser market gives them the power to manipulate otherwise innocuous functions (like the random number generator) to leak data for user tracking, if they so desire.

Summary

What you choose to see from the Google Analytics portal does not necessarily impact what they collect.

GA helps Google determine how well a site performs for Search Ranking, and creates a User Fingerprint to track what each internet user looks at and for how long. The latter helps them select ads, which is where they make the bulk of their money. Much of the data they touch in their script doesn't get sent back in raw form, but rather, is used to create said fingerprint.

Solution 2

If you dig deeper you'll find plenty of literature on Google Analytics architecture.

According to the official documentation:

Google Analytics works by the inclusion of a block of JavaScript code on pages in your website. When users to your website view a page, this JavaScript code references a JavaScript file which then executes the tracking operation for Analytics. The tracking operation retrieves data about the page request through various means and sends this information to the Analytics server via a list of parameters attached to a single-pixel image request.

Source: How Does Google Analytics Collect Data?
Additional reading: Google Analytics Features

Share:
18,981

Related videos on Youtube

Schuiram
Author by

Schuiram

Updated on September 15, 2022

Comments

  • Schuiram
    Schuiram over 1 year

    I try to identify what data is actually collected by the default script of Google Analytics. What seems to be an easy question turns out to have no clear answer.

    I know that they (for example) collect the IP-address, screen resolution, operating system and so forth ... but I simply do not find a complete list. I also have a list of all the possible dimensions and metrics that can be collected, but not for the "default" analytics script.

    I ask for a list of all the data collected by default by Google Analytics.

    • nyuen
      nyuen over 9 years
      The best way to find out is to put GA on a test web page and play around with and explore the reports. GA is constantly changing, so the list of "what data is collected" continues to grow. I doubt you will actually find a list of what is collected either. You may find a starter list, but you won't find a complete one.
    • Schuiram
      Schuiram over 9 years
      I already did that. To be more clear: I do not search the results GA provides (they have good insight in that: Dimensions & Metrics. I am searching what is gathered to calculate this data. So for example: IP Address (which can be anonymized, so no privacy issue here), screen resolution, operating system, url, color depth, referer etc.). In security we try to only collect information that we need to maintain the privacy of customers. So to achieve that and to get a Data Inventory this information is needed.
  • Schuiram
    Schuiram over 9 years
    Your answer does unfortunately not answer my question. I did not ask about the how, I am very aware of that, I asked about the what. If you tried to imply a: "Reverse Engineer the "ga.js" and "analytics.js" my answer is: I hoped somebody already did that :) I read the book: Google Analytics Features, Benefits, and Limitations by Brian Clifton. While it helped me to understand the "How" and "What is possible" it did not answer the What. I also was crawling through the information an policies provided by google (for example the links you just posted) but it did not help me.
  • carlodurso
    carlodurso over 9 years
    I get your point, but I see a complete list of whats in the section "The GIF Request Parameters" of the first link. A part from getting the IP which is not disclaimed, and some demographic details linked to the visitor Google Account the rest is set and read by usage of cookies.
  • carlodurso
    carlodurso over 9 years
    Additionally, I strongly doubt that Google Analytics will dig deeper into the visitor machine to get emails and the like.
  • Schuiram
    Schuiram over 9 years
    This list you are referring to is just exemplified and non-exhaustive. Also the goal of my task is not about the severity of possible privacy leaks but about the soundness of the approach. I just need the information to say "It is like that!". If I would just say: "Most probably there will be no personal data lost in a case of emergency!" they will not be satisfied :) But thanks for your effort :)