Content versus metadata

Content conundrum

The contest at the heart of the Investigatory Powers Act

After more scrutiny than probably any other piece of legislation in recent memory, the Investigatory Powers Bill received Royal Assent in November. Notwithstanding the amount of Parliamentary time spent on the 300 pages of powers and safeguards, underpinning the Act are some complex and abstractly defined (in some cases undefined) concepts. Nowhere is this more true than in the distinction the legislation tries to draw between between content and metadata.

The distinction matters because the Act applies fewer safeguards and constraints to selection and examination of metadata than to content.

The government’s position, which finds support in human rights law, is that intercepting, acquiring, processing and examining the content of a communication is more intrusive than for the “who, when, where, how” contextual data wrapped around it.

However, the view is gaining ground that as metadata becomes ever richer and more revealing thanks to the mobile internet and smartphone apps, so the difference in intrusiveness has become less marked.

Where to draw a dividing line between content and metadata is not necessarily obvious. Courts applying ECHR Articles 8 and 10 or the EU Charter may not necessarily draw a line in the same place as domestic legislation.

In fact the Act creates different versions of the content and metadata dividing line for different purposes: one version for mandatory retention and acquisition of communications data from service providers and another for communications interception and equipment interference. The latter version designates more information as metadata and less as content.

Consequences of the demarcation

The demarcation between content and metadata has significant practical consequences.
Under the existing Regulation of Investigatory Powers Act (RIPA) the agency cannot search content for information about people known to be in the British Islands without obtaining special Ministerial authorisation. Under the Act the agency would need a targeted examination warrant. However these restrictions do not apply to “related communications data” (under RIPA) or “secondary data” (under the new Act) gathered as the by-product of an overseas-focused bulk interception campaign. The Snowden disclosures suggest that GCHQ has bulk intercepted and stored metadata by the tens of billions of records.

The Commons Science and Technology Committee and a Joint Parliamentary Committee scrutinised the draft Bill. Neither seemed confident that it (or anyone else) understood where the legislation drew the line between content and metadata.

The Joint Committee’s Recommendation 1 said: “Parliament will need to look again at this issue when the Bill is introduced. We urge the Government to undertake further consultation with communications service providers, oversight bodies and others to ascertain whether the definitions are sufficiently clear to those who will have to use them.”

The closest that the Home Office came to producing a systematic analysis of the difference between content and metadata was evidence submitted which categorised a selection of datatypes.

How does the Act draw the line?

The Act’s definition of content turns on whether data reveals “anything of what might reasonably be considered to be the meaning (if any) of a communication.” The Joint Committee commented on the draft Bill:

  1. The definition of content was also a concern for witnesses. Dr Paul Bernal questioned the reference to “meaning” in the definition of content, saying “It is possible to derive ‘meaning’ from almost any data—this is one of the fundamental problems with the idea that content and communications can be simply and meaningfully separated. In practice, this is far from the case.” Graham Smith posed a challenging philosophical and technical question, “For a computer to computer communication, what is the meaning of ‘meaning’?”

The impression of having to perform metaphysical gymnastics is bolstered when we are introduced to the concept of “inferred meaning”. Paragraph 2.14 of the draft Interception Code of Practice says:

“There are two exceptions to the definition of content … The first is there to address inferred meaning. When a communication is sent, the simple fact of the communication conveys some meaning, eg it can provide a link between persons or between a person and a service. This exception makes clear that any communications data associated with the communication remains communications data and the fact that some meaning can be inferred from it does not make it content.”

If anything this confirms Paul Bernal’s concern that since meaning can be derived from almost any data, a dividing line based on the existence of meaning is problematic.

The practical result of the definitions

One set of metadata definitions applies to interception and equipment interference, the other to retention and acquisition of communications data.

The interception variety of metadata is “secondary data”. For equipment interference it is the similar “equipment data”. Both consist of either “systems data” or “identifying data”. Systems data is an overriding definition, since s 261(6) lays down that if something is systems data it cannot be content.

The draft Interception Code of Practice notes that in practice the intercepting or interfering agency will only have to decide whether information fits within the definition of systems data. If so, it cannot be content even if it reveals some of the meaning of the communication.

The Act also enables “identifying data” to be extracted from the contents of a communication and treated as secondary data. Under RIPA, information such as an email address embedded in a web page is treated as content. Under the Act, intercepting and interfering agencies would be able to scrape such data from the body of a communication and treat it as metadata.

For retention and acquisition of communications data metadata is either “entity data” or “events data”. Here the position is reversed: content takes precedence. If information reveals anything of the meaning of the communication (beyond the mere fact or transmission of the communication) then for these purposes it is content, even if for interception or equipment interference purposes it would be systems data. The “identifying data” scraping exception does not apply.

The result is that some types of information may be treated as metadata for the purposes of interception and equipment interference, but as content for the purposes of communications data retention and acquisition.

Testing the dividing line

With definitions as complex as these we cannot rely on intuition to determine what is content and what is metadata. An interesting exercise is to take a short email and evaluate which of its components might count as content and which as metadata.

For this exercise I have adopted the version of the dividing line that contrasts content with “secondary data”. As we have seen, “secondary data” is generally broader than the “communications data” definition used for mandatory retention and acquisition.

Here is my sample e-mail.

From: abc@xzfxzfxz.co.uk

To: xyz@abafbabaf.co.uk

Sent: Wed 18/05/2016 13:15

Subject: Following up last night’s call

Hi Bill,

Meet Wednesday at Red Lion. DM me @cyberleagle.

Graham

An initial impression is probably that the From, To and Sent fields are metadata and everything else is content. Indeed that is the current position under RIPA. When we turn to the new Act, however, things are rather different. It seems that most of the visible parts of the e-mail may be either systems data, or identifying data that can be extracted and treated as metadata.

To understand how what looks like email content can become metadata, we need to understand more about “secondary data”.

What is secondary data?

Section 137 of the Act provides that secondary data, in relation to any communication transmitted by means of a telecommunication system, means any data falling within either of two subsections:

Subsection (4) is systems data which is comprised in, included as part of, attached to or logically associated with the communication (whether by the sender or otherwise). In general terms systems data is data that enables or facilitates a telecommunication system or service, a system holding a communication, or a service provided by such a system, to function. It is not limited to the system that is conveying the communication in question.

Subsection (5) concerns identifying data. Like systems data it must be comprised in, included as part of, attached to or logically associated with the communication (whether by the sender or otherwise). Unlike systems data it must also be capable of being logically separated from the remainder of the communication; and, if it were separated, must not “reveal anything of what might reasonably be considered to be the meaning (if any) of the communication, disregarding any meaning arising from the fact of the communication or from any data relating to the transmission of the communication.”

This last condition mirrors the Act’s general definition of content. It raises the perplexing question of what (and how much) information can be extracted from the content of a communication without revealing anything of the meaning of the communication. Examples given in the Explanatory Notes include:

  • the location of a meeting in a calendar appointment;
  • photograph information – such as the time/date and location it was taken; and
  • contact “mailto” addresses within a webpage.

The data can, it seems, relate to matters such as a real world meeting or the taking of a photograph that are not an aspect of a communication.

This is confirmed by the definition of “identifying data”, which includes data which may be used to identify any person, apparatus, system or service, any event, or the location of any person, event or thing. Events are – apparently – not limited to events forming part of the use of a communications system. Data may relate to the fact of the event, the type, method or pattern of event, or the time or duration of the event. The systems data and identifying data definitions are not limited to structured data.

It is unclear whether each item of identifying data has to be evaluated separately in determining whether it reveals anything of the meaning of the communication, or whether extracted items of identifying data should be considered cumulatively.

Analysis of sample email

For the purposes of analysing the sample email, let us assume that unstructured information can for the purposes of the Act (whether it is technically possible is another matter) be “logically separated” from the rest of the communication; and that extracted elements of identifying data are not considered cumulatively.

The From, To and Sent fields fit the definition of systems data, as data facilitating the functioning of a telecommunications service. This is unsurprising.

An email Subject line is content. However, as the draft Equipment Interference Code of Practice explains for equipment data, elements of the subject line may be capable of being extracted and treated as metadata: “the text in the subject line would not be equipment data (unless separated as identifying data).”

So consider “last night’s call”. “call” appears to be identifying data, since it identifies both the fact and type of an event. “last night’s” relates to the time of the event.

“Bill” and “Graham” both identify, or may assist in identifying, persons.

“Meet”, Wednesday” and “Red Lion” all appear to be identifying data. “Meet” relates to the type of event, “Wednesday” to its time and “Red Lion” to the location of the event.

“DM”. It is possible that this is systems data, describing something connected with enabling or facilitating the functioning of a telecommunications service. If not, it appears to be identifying data as assisting in identifying a service.

“@cyberleagle” is probably systems data (there no apparent requirement that the data should relate to means used to send the intercepted communication itself). If not, this is identifying data.

If this analysis is correct, the secondary data (and equipment data) provisions of the Act represent a significant change to the existing content/metadata boundary under RIPA.

Graham Smith is a partner at Bird & Bird, specialising in internet, IT and intellectual property law. He is the author of Sweet & Maxwell’s Internet Law and Regulation and blogs as Cyberleagle. Email Graham.Smith@twobirds.com. Twitter @cyberleagle.