Preliminaries: trajectory data
A trajectory dataset is a collection of trajectories \(\mathcal {D}_{T} = \{t_{1}, t_{2}, \ldots, t_{m}\}\). A trajectory t=〈x
1,y
1,t
s
1〉,…,〈x
n
,y
n
,t
s
n
〉, is a sequence of spatio-temporal points, i.e., triples 〈x
i
,y
i
,t
s
i
〉, where (x
i
,y
i
) are points in R
2, i.e., spatial coordinates, and t
s
i
(i=1…n) denotes a timestamp such that ∀1<i<n
t
s
i
<t
s
i+1. Intuitively, each triple 〈x
i
,y
i
,t
s
i
〉 indicates that the object is in the position (x
i
,y
i
) at time t
s
i
. A trajectory t
′=〈x1′,y1′,t
s1′〉,…,〈x
m′,y
m′,t
s
m′〉 is a sub-trajectory of t (t
′≼t) if there exist integers 1<i
1<…<i
m
≤n such that \(\forall 1 \leq j \leq m\, \langle x_{j}', y_{j}', ts_{j}'\rangle = \langle {\phantom {\dot {i}\!}{x}_{i_{j}}}, {\phantom {\dot {i}\!}{y}_{{i}_{j}}}, {\phantom {\dot {i}\!}{ts}_{i_{j}}}\rangle \). We refer to the number of trajectories in \(\mathcal {D}_{T}\) containing a sub-trajectory t
′ as support of t
′ and denote it by \(N_{\mathcal {D}_{T}}(t') = \left |\{t \in \mathcal {D}_{T} | t' \preceq t\}\right |\).
The k-anonymity framework for trajectory data
A well known method for anonymisation of data before release is k-anonymity [2]. The k-anonymity model was also studied in the context of trajectory data [4–6]. Given an input dataset \(\mathcal {D}_{T} \subseteq T\) of trajectories, the objective of the data release is to transform \(\mathcal {D}_{T}\) into some k-anonymised form \(\mathcal {D}'_{T}\). Without this transformation, the publication of the original data can put at risk the privacy of individuals represented in the data. Indeed, an intruder who gains access to the anonymous dataset may possess some background knowledge allowing him/her to conduct attacks that may enable inferences on the dataset. We refer to any such intruders as an attacker. An attacker may know a sub-trajectory of the trajectory of some specific person and could use this information to infer the complete trajectory of the same person from the released dataset. Given the attacker’s background knowledge of partial trajectories, a k-anonymous version has to guarantee that the re-identification probability of the whole trajectory within the released dataset has to be at most \(\frac {1}{k}\). If we denote the probability of re-identification of the trajectories as Pr(r
e_i
d|t
′) based on the trajectory t
′ known to the attacker then the theoretical k-anonymity framework implies that ∀t
′∈T, \(\Pr (re\_id | t') \leq \frac {1}{k}\). The parameter k is a given threshold that reflects the expected level of privacy.
Note that, given a trajectory dataset \(\mathcal {D}_{T}\) and an anonymity threshold k>1 we can have trajectories with a support lower than k (\(N_{\mathcal {D}_{T}}(t') < k\)) and trajectories that are frequent at least k times (\(N_{\mathcal {D}_{T}}(t') \geq k\)). The first type of trajectories are called k-harmful because their probabilities of re-identification are greater than \(\frac {1}{k}\). In [6], the authors show that if a k-anonymisation method returns a dataset \(\mathcal {D}'_{T}\) by guaranteeing that for each k-harmful trajectory t
′ in the original dataset, \(t' \in \mathcal {D}_{T}\), either \(N_{\mathcal {D}'_{T}}(t') = 0\) or \(N_{\mathcal {D}'_{T}}(t') \geq k\), then we have the property that for any trajectory t known by an attacker (harmful or not), \(\Pr (re\_id | t') \leq \frac {1}{k}\).
This fact is easy to verify. Indeed, given a k-anonymous version \(\mathcal {D}'_{T}\) of a trajectory dataset \(\mathcal {D}_{T}\) that satisfies the above condition, and a trajectory t known by the attacker two cases can arise: -
t
is
k
-harmful in
\(\boldsymbol {\boldsymbol{\mathcal {D}}_{T}}\)
: in this case we can have either, \(N_{\mathcal {D}'_{T}}(t) = 0\), which implies Pr(r
e_i
d|t)=0, or \(N_{\mathcal {D}'_{T}}(t') \geq k\), which implies \(\Pr (re\_id | t) = \frac {1}{N_{\mathcal {D}'_{T}}(t)} \leq \frac {1}{k}\). -
t
is not
k
-harmful in
\(\boldsymbol {\boldsymbol{\mathcal {D}}_{T}}\)
: in this case we have \(N_{\mathcal {D}_{T}}(t) = F \geq k\) and t can have an arbitrary support in \(\mathcal {D}'_{T}\). If \(N_{\mathcal {D'}_{T}}(t)=0\) or \(N_{\mathcal {D'}_{T}}(t)\geq F\), then the same reasoning as in the previous case applies. If \(0 < N_{\mathcal {D'}_{T}}(t) < F\) then the probability to re-identify a user to the trajectory t is the probability that user is present in \({\mathcal {D'}_{T}}\) times the probability of picking that user in \({\mathcal {D'}_{T}}\), i.e., \(\frac {N_{\mathcal {D'}_{T}}(t)}{F} \times \frac {1}{N_{\mathcal {D'}_{T}}(t)} = \frac {1}{F} \leq \frac {1}{k}\).
The aforementioned mathematical condition that any k-anonymous dataset has to satisfy, is explained as follows. Given the attacker’s knowledge of partial trajectories that are k-harmful, i.e., occurring only a few times in the dataset, they can enable a few specific complete trajectories to be selected, and thus the probability that the sequence linking attack succeeds is very high. Therefore, there must be at least k trajectories in the anonymised dataset matching the attacker’s knowledge. Alternatively, there can be no trajectories in the anonymised dataset matching the attacker’s knowledge. If the attacker knows a sub-trajectory occurring many times (at least k times) then this means that it is compatible with too many subjects and this reduces the probability of a successful attack. If the partially observed trajectories lead to no match then it is equivalent to saying that the partially observed trajectories could be in any other dataset except from the one under attack, thus leading to an infinitely large search space. This is, somewhat, equivalent to k→∞. Thus, in this case, \({\lim }_{k \to \infty }{\Pr (re\_id | t')} = 0\).
This is the theoretical worst-case guarantee of the probability of re-identification of a k-anonymised dataset. However, we shall see in the following sub-section that this does not give us a complete picture of the probabilities of re-identification.
Why is the theoretical worst-case guarantee inadequate?
In order to explain the inadequacies of the theoretical worst-case guarantee, let us consider a toy example of trajectories as shown in Fig. 1. Let \(\mathcal {D}_{T}\) be the example dataset. We can choose, as an example, a value of k=3 and obtain the 3-anonymous dataset \(\mathcal {D}'_{T}\), for which the theoretical worst-case guarantee is that ∀t
′,\(\: \Pr (re\_id | t') \leq \frac {1}{3}\).
Figure 2 illustrates the probability that a given observed trajectory (i.e., attacker’s knowledge) can be uniquely identified in the anonymised dataset, while Fig. 3 shows the cumulative distributions of probabilities with h denoting the number of observations in the attacker’s knowledge. We notice in Fig. 2 and Fig. 3 that the actual probability of re-identification is often much lower than the theoretical worst-case scenario, but this fact is not demonstrated by the theoretical guarantee.
Empirical risk model for anonymised trajectory data
In the last sub-section, we described that the theoretical worst-case guarantee does not demonstrate the distribution of attack probabilities. The worst-case scenario also does not illustrate the fact that a large majority of the attacks have far lower probabilities of success than the worst-case guarantee. Thus, we propose an empirical risk model for anonymised trajectory data. If t
′ represents attacker’s knowledge; h=|t
′| denotes the number of observations in the attacker’s knowledge then the intent is to approximate a probability density and a cumulative distribution of Pr(r
e_i
d|t
′) for each value of h. This can be achieved by iterating over every value of h=1, …, M where M is the length of the longest trajectory in \(\mathcal {D}_{T}\). For each value of h, we consider all the sub-trajectories \(t' \in \mathcal {D}_{T}\) of length h and compute the probability of re-identification Pr(r
e_i
d|t
′) as described in Algorithm 1. In particular, for each value of h a further iteration can be run over each value of t
′ of length h, in which we compute \(N_{\mathcal {D}'_{T}}(t')\), \(N_{\mathcal {D}_{T}}(t')\) and the probability of re-identification by following the reasoning described in Section “The k-anonymity framework for trajectory data” for the computation of this probability. Algorithm 1 presents the pseudocode of the attack simulation.
The advantages of this approach is that this model supports arguments such as: (a) “98 % of the attacks have at most 10−5 probability of success”; and (b) “only 0.001 % of the attacks have a probability close to \(\frac {1}{k}\)”. The disadvantages of this model are: (a) a separate distribution plot is necessary for each value of h; and (b) the probability of re-identification increases with the increase in h. The illustration in Fig. 3 demonstrates the aforementioned advantages and disadvantages of the risk model.
For the simulation of the attack we need to select a set of trajectories B
K
T
from the original dataset of trajectories. The optimal solution would be to take the all possible sub-trajectories in the original dataset and compute the probability of re-identification. Since the set of attack trajectories can be quite large, in order to avoid a combinatorial explosion, two strategies can be adopted. First, we can extract from the original dataset of trajectories a random subset of trajectories that we can use as background knowledge for the attacks to estimate the distributions. In particular, for each each trajectory length value h we extract a random subset of trajectories \(B{K^{h}_{T}}\) and then, the union of all \(B{K^{h}_{T}}\) represents the global background knowledge B
K
T
used in the attack simulation.
Secondly, we can use a prefix tree to represent in a compact way the original dataset and then, by incrementally visiting the tree we can enumerate all the distinct sequences for using them as an adversary’s background knowledge.
Risk versus cost
One of the most important open problems that makes the communication between the experts in law and in computer science hard is how to evaluate whether an individual is identifiable or not, i.e., the evaluation of privacy risks for an individual. Usually, the main legal references to this problem suggests to measure the difficulty in re-identifying the data subject in terms of “time and manpower”. This definition is surely suitable for traditional computer security problems. As an example, we can measure the difficulty to decrypt a message without the proper key in terms of how much time we need to try all possible keys i.e., the time and resources required by the so-called brute force attack. In the field of privacy the computer science literature shows that the key factor affecting the difficulty to re-identify an anonymous data is the background knowledge available to the adversary. Thus, we should consider the difficulty to acquire the knowledge that enables the attack to infer some sensitive information. If we are able to measure the cost of the acquisition of the background knowledge then we can provide a single risk indicator that takes into consideration both the probability of success of an attack and its cost. Combining the two factors and providing one single value could help the communication of a specific privacy risk in the legal language.
We propose three methods for measuring the cost of an attack and a way to combine it with the probability of re-identification. We also propose to normalise the probability of re-identification Pr(r
e_i
d|t
′) with the cost of gaining the knowledge of t
′ by the attacker. The longer the t
′, the higher the cost to acquire such knowledge. Thus, Pr(t
′)= Pr(r
e_i
d|t
′)/C(t
′) where C(t
′) is the cost function proportional to the length of t
′. We can then estimate the distribution of Pr(t
′) over all t
′ to obtain a unique combined measurement of risk over all possible attacks.
The cost function C(t
′) can be derived from various alternatives. (1) One option would be to use a sub-linear cost function akin to that incurred in machine-operated sensing. The initial costs of setting up the sensing equipment are high but subsequent observations are cheaper and cheaper. Thus, C(t
′)=1+l
o
g(|t
′|) is a good approximation. (2) Another option is a linear cost where a spying service is paid a fixed fee per observation, leading to C(t
′)=α|t
′|. (3) A third alternative is a super-linear cost where the attacker directly invests time and resources to sensing, thus making the cost function grow in some exponential fashion, such as \(C({t}^{\prime }) = e^{-\beta |{t}^{\prime }|}\phantom {\dot {i}\!}\).
These cost models are not exhaustive. There can be other factors, beyond the scope of this paper, that can have perceptible effects on the costs of attacks.