Legal talk:Data retention guidelines: Difference between revisions

From Wikimedia Foundation Governance Wiki
Content deleted Content added
→‎Who are "we"?: - add a further comment about CUs
→‎Is there any possibilities of reasonable IAR on this guideline?: - not for employees; hopefully not necessary for others.
Line 139: Line 139:
I just imagined a case: A checkuser once did some CU, and results are some data from a range of time, some of which will expire very soon(for example, 89 days old data). Whether these data that are about to stale are important or not(say, if one user's last logged action was 89 days ago, and that data related to sockpuppet issue, and later attempts to CU would likely result in "stale"), is it reasonable to keep the data beyond 90 days for the sake to make the issue clear?
I just imagined a case: A checkuser once did some CU, and results are some data from a range of time, some of which will expire very soon(for example, 89 days old data). Whether these data that are about to stale are important or not(say, if one user's last logged action was 89 days ago, and that data related to sockpuppet issue, and later attempts to CU would likely result in "stale"), is it reasonable to keep the data beyond 90 days for the sake to make the issue clear?
--[[User:朝鲜的轮子|朝鲜的轮子]] ([[User talk:朝鲜的轮子|talk]]) 03:04, 15 January 2014 (UTC)
--[[User:朝鲜的轮子|朝鲜的轮子]] ([[User talk:朝鲜的轮子|talk]]) 03:04, 15 January 2014 (UTC)
: {{ping|朝鲜的轮子}} Let me answer this in two parts, to explain:
:: For employees, IAR is not an option :) If an exception was made, it would be added here.
:: As I said above, for CUs, this generally doesn't apply. Whatever rule we do work out with the CUs will hopefully not be a rule that interferes with their work; we hope that if they are tempted to IAR, they'll discuss with us (or the broader community) first. —[[User:LVilla (WMF)|LVilla (WMF)]] ([[User talk:LVilla (WMF)|talk]]) 20:51, 22 January 2014 (UTC)

Revision as of 20:51, 22 January 2014

Gender?

The examples list "email and gender in account settings" as examples of non-public data; however the account settings 'gender' property is publicly disclosed by necessity due to its purpose in producing grammatically correct strings.

Is this meant only to treat the combination of the two as private? Otherwise, we're leaking gender-by-username... --brion (talk) 21:24, 9 January 2014 (UTC)[reply]

Someone removed it, so ... :D --brion (talk) 21:41, 9 January 2014 (UTC)[reply]
Michelle did, but her login is apparently still on vacation :) It was removed in response to this remark. —LVilla (WMF) (talk) 21:49, 9 January 2014 (UTC)[reply]

Section 4 (Definition of personal information)

Information you provide us or information we collect from you that could be used to personally identify you. Reads a bit strange. Maybe a full sentence would be better here? Something like "Personal information means information you provide ..." maybe? --თოგო (D) 22:14, 9 January 2014 (UTC)[reply]

Hi თოგო. Thank you for your suggestion! We have adjusted the language accordingly. Mpaulson (WMF) (talk) 00:41, 10 January 2014 (UTC)[reply]

Possibilities in case of breaches

Maybe the policy should contain information about a place where users can go if they feel that the policy was breached. --თოგო (D) 22:17, 9 January 2014 (UTC)[reply]

Thank you თოგო for your comment - this makes sense. What if we added this sentence to the last section of the document (“Ongoing handling…”):
If you think that these guidelines have been breached, or if you have questions or comments about compliance with the guidelines, please contact us at privacy@wikimedia.org.
Would that address your concern? Any suggestions on how to improve it? :--JVargas (WMF) (talk) 00:14, 10 January 2014 (UTC)[reply]
Hi თოგო. I've went ahead and implemented Jorge's suggested language by adding a new section to the guidelines. Thank you for this helpful suggestion. Mpaulson (WMF) (talk)

Who are "we"?

Does this mean WMF? Or does this mean Wikimedia sites in general? --Rschen7754 23:46, 9 January 2014 (UTC)[reply]

Thanks for the question, Rschen7754! Whenever you see "we" / "us" / "our" in the text, we are indeed referring to the The Wikimedia Foundation, Inc., the non-profit organization that operates the Wikimedia Sites. This explanation is part of the “Definitions” section of the new Privacy Policy draft. Would it help if we added something like this to the document?
Terms that are not defined in this document have the same meaning given to them in the Privacy Policy.
--JVargas (WMF) (talk) 00:26, 10 January 2014 (UTC)[reply]
Yes, that would be helpful. --Rschen7754 00:28, 10 January 2014 (UTC)[reply]
We changed the "Definition of Personal Information" section to "Definitions" in the document, and we added the above sentence at the end of it. Thanks again!--JVargas (WMF) (talk) 00:52, 10 January 2014 (UTC)[reply]
Does this guideline only governs what WMF will do, but not those with access to private informations will do? Though I believe guidelines about what those users with access should do will likely be only nominal.--朝鲜的轮子 (talk) 03:03, 15 January 2014 (UTC)[reply]
@朝鲜的轮子: Generally we don't share personal information unless it is anonymized/aggregated. (There are exceptions in the privacy policy - the most important one is checkusers, but they will be covered by the Access to nonpublic information policy.) So we could make it apply to others, but (1) as you say, it would be hard to enforce and (2) in general they shouldn't have access to the data anyway. So I would prefer not to change it. Does that make sense? —LVilla (WMF) (talk) 20:25, 22 January 2014 (UTC)[reply]
Note that for CUs, we're talking with them about how to minimize retention, even though this policy won't formally apply to them. —LVilla (WMF) (talk) 20:27, 22 January 2014 (UTC)[reply]

Comments from //Shell

Introduction

How long do we retain non-public data?

    • "After no more than 90 days..." I had to think twice about what it means. Would it be possible to say "After at most 90 days..."?
      Good suggestion. I've made the change. Mpaulson (WMF) (talk) 01:03, 10 January 2014 (UTC)[reply]
      Nice. //Shell 09:09, 10 January 2014 (UTC)[reply]
    • "Anonymized" What does this mean? Does it mean that it becomes very difficult to associate the data to a specific user, or that it's completely impossible? (Clarification: Especially for small projects, say 5 editors a normal day)
    • "Email address in account settings: Indefinitely" Does this mean that if I remove or change my email address, the old address will still be kept? Is that the meaning? Is it desirable? Not sure how to rephrase it to only be about the current email address.
    • "Non-personal information associated with a user account: Collected from user: Indefinitely" While the given examples seem okay, this category seems broad and that's particularly bad since the data is kept indefinitely. The given examples seem okay, since they're almost already public data (first edit, when a user has verified email, and whether the user edits through mobile are public data). E.g. the list of read articles is not public, but could be covered by this category.
    • "Non-personal information associated with a user account: Optionally provided by a user: Logs of terms entered into the site's search box" I realize that "optional" here means that not every WM site visitor must search, but since it's a key part of any wiki it doesn't feel like I "optionally provided" it - I must do it to see the article I'm interested in (ignoring other search engines). No biggie, but feels a bit weird.
      I see your point here, Shell. We weren't sure how to best phrase the differentiation between information collected from the user and information provided by the user. We're open to suggestions though if you or anyone else has one. Mpaulson (WMF) (talk) 01:17, 10 January 2014 (UTC)[reply]
      Would it be possible to remove "optionally" and just say "Provided by a user"? //Shell 09:09, 10 January 2014 (UTC)[reply]
      I would be fine with doing that. I think we originally added "optionally" to more clearly distinguish that kind of data from data that is collected either automatically or actively by us. But obviously, if it makes it less clear rather than helping, we can remove it. Mpaulson (WMF) (talk) 14:32, 10 January 2014 (UTC)[reply]
      I was confused about search terms being optional, since they feel necessary to use the site, while the email address is usually mandatory, but in the Wikimedia case it's optional. So, I wouldn't mind adding back "optional" to the "personal information" one, but it's more consistent not to. //Shell 19:04, 10 January 2014 (UTC)[reply]
    • Do you intend to have most common data in this table, in the form of examples? It would be nice to see a complete list somewhere (though that might be asking too much).
      The table is meant to address broad categories of data so that we address the treatment of as much data as we can in these guidelines. That said, we are going to try to improve the table (and the exceptions section) with more examples over time as we refine our practices. Mpaulson (WMF) (talk) 01:21, 10 January 2014 (UTC)[reply]
      It would be nice to have as many examples as possible, so I could imagine that there was a long list in this table, but collapsed by default. //Shell 09:09, 10 January 2014 (UTC)[reply]
      I agree. The hope is that we will gradually expand the guidelines with more examples over time. I will talk to people internally and see what additional examples (if any) we can add now though. I imagine if the table gets unwieldy, we'll experiment with formatting so that it's as easy-to-read as we can make it. Mpaulson (WMF) (talk) 14:20, 10 January 2014 (UTC)[reply]
      Great. Since there are already examples that feel representative, it's not a big deal, but it'd be nice to eventually have an almost complete list. //Shell 19:04, 10 January 2014 (UTC)[reply]

Definition of personal information (good job!)

    • I can think of a couple more items to put in (b), though I'm not sure if it's necessary: (current) city (clarification: which is different/broader than address), marital status, family ties
      I added "marital and familial status" to the definition. I'm checking internally whether it makes sense to add current city. Mpaulson (WMF) (talk) 18:41, 10 January 2014 (UTC)[reply]
      I was thinking about city, since that something you can "easily" get from an IP address, but street address is not.
      Of course there's lots of other private information, but maybe it's unnecessary to add that, since I don't see how Wikimedia would get the info: income level/economic situation, level of education, profession, current job situation, hobbies/interests (though interests could be gleaned from what pages a user visits).
      There's also the user-agent info: OS/browser version, browser language(s), screen size etc. which websites almost never make public, but which could potentially uniquely identify a user over multiple websites[1]. //Shell 19:04, 10 January 2014 (UTC)[reply]
      We have added user-agent string to the definition of personal information, so that should be covered now. As for the other "private information" you mentioned earlier, I don't think that level of detail is necessary as the categories in (b) are meant to be illustrative examples of what we consider to be "sensitive information". Mpaulson (WMF) (talk) 22:51, 14 January 2014 (UTC)[reply]

Exceptions to these guidelines

    • "Data may be retained in system backups for longer periods of time." Is there any restriction on how long those backups can exist? Would it be possible, for instance, to delete, aggregate, or anonymize them after at most 5 years?

Design of new systems

    • "inclusion of privacy considerations in the code review process". Would this be added to some checklist, or is it just a general guideline?

Great to see this stuff be explicit. //Shell 00:15, 10 January 2014 (UTC)[reply]

Hi Shell! Thank you for taking the time to comment and help us improve these guidelines. Your suggestions are always helpful and greatly appreciated. We will respond in-line to your comments as we work through them. Mpaulson (WMF) (talk) 00:51, 10 January 2014 (UTC)[reply]
I've responded to your comments. (and clarified a couple of things) //Shell 09:09, 10 January 2014 (UTC)[reply]

Logs of terms entered into the site's search box

Why exactly is this data needed at all?Geni (talk) 00:44, 10 January 2014 (UTC)[reply]

Hi Geni, it's needed at least for debugging purposes. There are some searches that trigger bugs or performance problems, and we need to be able to go back and correlate searches with the bugs that get triggered. Sometimes malicious users may actually try to cause performance problems on the site via search, so we need to have the information correlated with IP addresses so that we can take action if necessary. We don't yet do much in the way of analytics on search traffic (that I'm aware of), but I could see that being of use in the future. -- RobLa-WMF (talk) 02:01, 10 January 2014 (UTC)[reply]

Non-public vs public data

It might be worth expanding on what is actually meant by non-public data and what data is public by default. For example, this page talks about IP addresses for visitors, which might be confused with IP addresses for editors which are either publicly visible (where an anonymous edit is made) or kept private but presumably kept indefinitely (for user accounts). I think most of this is in the privacy policy, but it might be worth summarising here. Thanks. Mike Peel (talk) 08:26, 10 January 2014 (UTC)[reply]

Hi Mike. Thanks for the suggestion! You are correct in that it is addressed more fully in the Privacy Policy draft, but we will draft some language to make that clear as well as give some basic examples. Mpaulson (WMF) (talk) 14:10, 10 January 2014 (UTC)[reply]
Thanks Michelle. :-) Mike Peel (talk) 14:44, 10 January 2014 (UTC)[reply]
Just to let you know, we added some language. Let me know if that works! Mpaulson (WMF) (talk) 18:56, 10 January 2014 (UTC)[reply]
I think it's much better now. The info is in the privacy policy, but you can't expect everybody to read it. //Shell 19:07, 10 January 2014 (UTC)[reply]

Donor data

Presumably this page isn't intended to cover donor data? It might be worth linking to wmf:Donor_policy. Thanks. Mike Peel (talk) 08:28, 10 January 2014 (UTC)[reply]

Hi Mike! While this document does not currently address donor data, we are hoping to eventually include retention practices in relation to donor data. These guidelines are meant to be a starting point for us and will get more detailed over time. In the meantime, I will see about getting navigational tools to other privacy-related documents (including the donor policy) added. Thanks for the suggestion! Mpaulson (WMF) (talk) 13:34, 10 January 2014 (UTC)[reply]

Sampled data

Does the 90 day rule for IP address applied to sampled data? As far as I know, the main motivation behind deleting IP addresses is to prevent tracking site usage back to a specific person. If site usage data is (sufficiently) sampled, it is much harder to do this regularly. Ottomata (talk) 16:01, 10 January 2014 (UTC)[reply]

It should apply. Your ISP might keep the correlation data IP address<->customer indefinitely, thus making it personally identifiable regardless of frequency. //Shell 19:09, 10 January 2014 (UTC)[reply]
Yes, the intent is that it should apply. This does probably mean there will have to be changes to certain existing setups, but as noted in the Audit section, that's a process we expect to occur gradually. —LVilla (WMF) (talk) 22:43, 10 January 2014 (UTC)[reply]

Examples in #How long do we retain non-public data?

Are the examples in the table supposed to be exhaustive? If the WMF retain other types of non-public data, I believe this guideline should explain all of them but at the moment it does not read that way. -- (talk) 08:00, 11 January 2014 (UTC)[reply]

The examples are not intended to be exhaustive (they're examples, after all ;) It would be both impractical and not very useful to readers if we listed every type of data we collect. That said, the "data types" should be exhaustive - everything we collect/retain should all fit into one of those categories. Hope that helps. —LVilla (WMF) (talk) 21:40, 14 January 2014 (UTC)[reply]

Indefinite retention of emails

Why would emails be retained indefinitely? I would have expected that if an account gets "officially" closed, the user identifies under a new account and declares the old one as discontinued, or exercises their Right to Vanish, these are all scenarios where an email would not be kept on record forever. -- (talk) 08:03, 11 January 2014 (UTC)[reply]

Non-personal information associated with a user account (server logs)

This should include the contents of some HTTP headers, which may have privacy concerns, including:

  • Referer: the previous page visited, which may be on any other site (in my opinion if this is from another site, it is strictly private and can only be used as analytic data, only in aggregated forms by origin domain). Almost all browsers send this information by default (unless the user has installed a filtering plugin).
  • Accept-Language: the default language of the browser used, or the list of prefered languages defined in browser preferences; some combinations of prefered languages may be very user-specific, and notably if this/these languages are very uncommon in the country or region associated to the géolocalized IP (e.g. Icelandic or Wolof selected by a user currently in locations like Monaco, Addis Abheba or Harbin, China).
  • User-Agent: and Accept: which identify precisely the type and version of the browser, and of its supported or installed plugins. These indormations are used by CheckUser admins teying to identify a user given its past navigation with the same browser installation when IP only is not enough to assert that this is the same user. The exact configation of these combinations of software versions may be very unique to a user; notably when the user has installed some uncommon plugin (this includes media player extensions, or localized versions of security tools) or uses an uncommon browser for a specific platform.
  • X-Chrome-*: and similar custom HTTP headers defined by browsers or plugins (including antivirus tools), some of these headers contain user id's (associated to registration of the plugin or browser; this is very common for media players, or custom browsers embedded within game softwares, or within game consoles, or in some smart TV sets or set top boxes, or in some brands of mobile devices).
  • Via: and similar HTTP headers defined by proxies relaying the user navigation. Some of these headers identify the origin user behind a non-anomizing proxy. Frequently, they contain personal information such as an authorized user name registered on the proxy, or the IP address of the connected user, or some hardware identifier of a mobile device using a public hotspot, or some user id associated internally by the proxy or hotspot (for example in a McDonald restaurant or in a train station), or session identifiers generated on those proxies or hotspots locally associated to an identified user whose account there may persist there for long, and will be sent again each time the same user returns to the same location to use the hotspot with the same device or same local user account). Generally these identifiers (and the full set of HTTP headers) may be requested by admins of these proxies or hotspot, when they receive an alert that one of its users is using their service to abuse external sites such as Wikimedia.

There are also:

  • Cookies: but they are defined by the visited site itself and should be subject to the policy about permanent or session cookies defined by the visited wikimedia sites (this includes cookies generated once the user logs on any Wikimedia site with SUL).
  • Data collected by javascript (or scripted plugins such as Flash and media codecs), which can collect other capabilities of the device (such as as the display resolution), or its settings, and data sent to servers by dynamic HTTP requests generated by these scripts. Some of these scripts may also send regular "ping" events to show that the user is still connected to the same page. It could even track what the user is reading specifically in the page (for example when the user interacts with it to inhide a "rolled" box, or when he clicks on visible tabs to see other tabs. Some browser-side scripts may also respond to servers, in response to an incoming event from the server. This allows a site to know that the user is active for long on one specific page; however these data perform separate HTTP requests, in the background, which are not always on the same site as the visited site, and that are logged separately on the queried server).
  • Data collected by media players for tracking the quality of connections for the delivery of streams. In some cases the media players will switch to use another stream.
  • Some medias such as video and audio include timecodes that also allows the site to track which part of the media has been played, and how many times by the user. When the user pauses the media, rollbacks to repeat it, or skips some parts, the media server may know it.
  • DNS resolution requests and similar "site info" requests, including for getting TXT records checked by security tools, of "finger" and "whois" info: not all of them are coming from an ISP but may be performed directly to Wikimedia DNS servers from a plugin in the browser or from the browser itself (trying to assess the site). Some of these requests may be very user-specific if they test some aliased subdomain names within Wikmedia domains, or if they perform queries that are typically only performed by ISPs. Users may perform direct DNS requests to Wikimedia domains. In some cases the ISP may reveal information about the user for which it forwards the DNS resolution request, as part of the DNS query itself in timely reproducible patterns of events. These requests are not reaching a webserver but an infrastructure server managed by Wikimedia (but possibly hosted by a third party domain hosting provider, operating with their own data retention and privacy policies).

More generally, this data includes everything that is stored by the webserver in the server logs, and it is much more than just the IP or the URL visited with its query parameters (some webserver logs may add query parameters not present in the URL but added in POST data (and that may be converted by one of the front proxies used by Wikimedia sites into GET parameters present in the URL submitted to the backend server).

Note that there are logs stored in front proxies (including instances the various Squid instances connected to the public IP address) and logs stored by backend webservers. There may be filters in front proxies, and front proxes may anonymize part of these requests (notably requests whose cacheable results will be delivered to multiple users).

Server logs are concerned by US laws, when they require that the sites in US retain these logs for some period of time. All these logs are also used by CheckUser admins. verdy_p (talk) 00:53, 15 January 2014 (UTC)[reply]

Is there any possibilities of reasonable IAR on this guideline?

I just imagined a case: A checkuser once did some CU, and results are some data from a range of time, some of which will expire very soon(for example, 89 days old data). Whether these data that are about to stale are important or not(say, if one user's last logged action was 89 days ago, and that data related to sockpuppet issue, and later attempts to CU would likely result in "stale"), is it reasonable to keep the data beyond 90 days for the sake to make the issue clear? --朝鲜的轮子 (talk) 03:04, 15 January 2014 (UTC)[reply]

@朝鲜的轮子: Let me answer this in two parts, to explain:
For employees, IAR is not an option :) If an exception was made, it would be added here.
As I said above, for CUs, this generally doesn't apply. Whatever rule we do work out with the CUs will hopefully not be a rule that interferes with their work; we hope that if they are tempted to IAR, they'll discuss with us (or the broader community) first. —LVilla (WMF) (talk) 20:51, 22 January 2014 (UTC)[reply]