Legal talk:Data retention guidelines: Difference between revisions

From Wikimedia Foundation Governance Wiki
Content deleted Content added
mNo edit summary
Tag: 2017 source edit
 
(174 intermediate revisions by 61 users not shown)
Line 1: Line 1:
{{User:LincolnBot/archiveconfig
<div style="float:right">{{autotranslate|base=Privacy policy navbox}}</div>
|archive = Legal talk:Data retention guidelines/Archive %(counter)d
== Gender? ==
|algo = old(180d)
|counter = 1
|maxarchivesize = 150K
|archiveheader = {{talk archive}}
|minthreadstoarchive = 1
|minthreadsleft = 3
}}{{Archives|link1=meta:Talk:Data retention guidelines/Archives/2014}}
{{autotranslate|base=Privacy policy navbox}}


== Added exception for page views investigation ==
The examples list "email and gender in account settings" as examples of non-public data; however the account settings 'gender' property is publicly disclosed by necessity due to its purpose in producing grammatically correct strings.
The Privacy team has temporarily extended the retention period for two datasets for a short period so that the Data Engineering team can investigate the impact of a data collection technical issue. Between June 4, 2021 and January 27, 2022, some of the Foundation’s caching nodes stopped collecting web traffic data (see the [[phab:T300164|Phabricator task]] for more details). This resulted in data loss for web requests and the derived pageviews, which impacts the Foundation’s ability to correctly report on the Wikimedia pageviews and fundraising banner impressions.


The Data Engineering team required a temporary short-term extension to the usual 90-day retention period in order to better estimate what data was not collected and which projects and geographies were most affected. The [[wikitech:Analytics/Data_Lake/Traffic/Pageview_actor|wmf.pageview_actor]] dataset is being used to estimate the data loss for pageviews and the [[wikitech:Analytics/Data_Lake/Traffic/Webrequest|wmf.webrequest]] dataset is being used to estimate the data loss for fundraising banners. Information from both datasets is required because webrequest data for visited banners is not reported as pageviews. Deletion of these datasets was paused on February 16, 2022 and deletion will resume by March 18, 2022.
Is this meant only to treat the combination of the two as private? Otherwise, we're leaking gender-by-username... --[[User:Brion VIBBER|brion]] ([[User talk:Brion VIBBER|talk]]) 21:24, 9 January 2014 (UTC)


If you have questions or concerns, please reach out to [mailto:privacy@wikimedia.org privacy@wikimedia.org]. If you are interested in a conversation meeting to discuss this exception and investigation, please sign up below and we will contact you with details. [[User:MMoss (WMF)|MMoss (WMF)]] ([[User talk:MMoss (WMF)|talk]]) 19:51, 11 March 2022 (UTC)
:Someone removed it, so ... :D --[[User:Brion VIBBER|brion]] ([[User talk:Brion VIBBER|talk]]) 21:41, 9 January 2014 (UTC)
:: Michelle did, but her login is apparently still on vacation :) It was removed in response to this remark. —[[User:LVilla (WMF)|LVilla (WMF)]] ([[User talk:LVilla (WMF)|talk]]) 21:49, 9 January 2014 (UTC)


== Section 4 (Definition of personal information) ==
== Definition of "public information": the IP address really? ==


Quote:
''Information you provide us or information we collect from you that could be used to personally identify you.'' Reads a bit strange. Maybe a full sentence would be better here? Something like "Personal information means information you provide ..." maybe? --[[User:Thogo|თოგო]] <sup><small>([[User talk:Thogo|D]])</small></sup> 22:14, 9 January 2014 (UTC)
: ''Some examples of "public information" would include: (a) your IP address, if you edit without logging in;''
:Hi თოგო. Thank you for your suggestion! We have adjusted the language accordingly. [[User:Mpaulson (WMF)|Mpaulson (WMF)]] ([[User talk:Mpaulson (WMF)|talk]]) 00:41, 10 January 2014 (UTC)
This is insufficent (and it was a problem for all Wikimedia projects, that is currently being solved, because many users unfortunately made edits without being properly conencted and did not notice it immediately; such disconnection has often opccured for various technical resasons and were not always prominently displayed, revealing an IP address to a permanent ccount that was protecting their user provicy) and in fact this statement may be now false (as is it is now illegal in various juridictions to make IP addresses visible to the public view, and the WMF could have been liable of violation, or orders of termination, or banning on some networks that must respect privacy laws, especially in the UE and the EEA, or in California where the WMF is located). You should add this precision:
: ''(if your access has still not been anonimized from the public view)''.
The anomymization of user access and IP accounts visible in the public list of users or in histories should soon be replaced by anonymized accounts for temporary accounts. IP addresses will only be visible later by specific users with CheckUser access rights and contractually accepting to strictly follow its usage policy (in addition to the general Privacy policy).


We have to accept the fact that this was not the case in the past and that there exists archives elsewhere (including in old database dumps published by the WMF) where such anonization will not be possible as they are now out of control. But "IP user" accounts should never be used now and should disappear from all categories, and former links to their uer pages or tlak pages should be processed by some admin tool that will associate them to as many anonimized accounts as needed (respecting the "temporary period"), to avoid capturing and keeping information on overlong periods of time. Such bot will then ned the permission to create temporary accounts, and "antidate" them to match the dates found in edit histories within which these IP user accounts were associated
== Possibilities in case of breaches ==


You should inform users that the WMF will make all efforts to disable existing public views for such data hosted now, but that past public records are now out of control and the WMF is unable to warranty that other parties won't use the revealed IP addresses (they could be able to do that legally in various juridictions that do not protect privacy, especially for users not located or connected from their own juridiction, and that don't benefit of the local legal protection). However the WMF should be clear that these IP addresses were anonymized from the public in order to comply to privacy laws and if other parties are using such data, they'll do that on their own liability, because the WMF does not endorse or approve such use of private data (including all those listed in these guidelines) by third parties, even if these data were published under a Free licence (which mostly covers copyright and correct users attribution, but does not void any law related to the protection of privacy, independantly of what the used licence is granting).
Maybe the policy should contain information about a place where users can go if they feel that the policy was breached. --[[User:Thogo|თოგო]] <sup><small>([[User talk:Thogo|D]])</small></sup> 22:17, 9 January 2014 (UTC)
-- [[User:Verdy p|Verdy p]] ([[User talk:Verdy p|talk]]) 13:10, 17 June 2023 (UTC)

:Thank you [[User:Thogo|თოგო]] for your comment - this makes sense. What if we added this sentence to the [[Data_retention_guidelines#Ongoing_handling_of_new_information|last section of the document]] (“Ongoing handling…”):
{{quote|If you think that these guidelines have been breached, or if you have questions or comments about compliance with the guidelines, please contact us at privacy@wikimedia.org.}}
:Would that address your concern? Any suggestions on how to improve it? :--[[User:JVargas (WMF)|JVargas (WMF)]] ([[User talk:JVargas (WMF)|talk]]) 00:14, 10 January 2014 (UTC)
::Hi თოგო. I've went ahead and implemented Jorge's suggested language by adding a new section to the guidelines. Thank you for this helpful suggestion. [[User:Mpaulson (WMF)|Mpaulson (WMF)]] ([[User talk:Mpaulson (WMF)|talk]])

==Who are "we"?==
Does this mean WMF? Or does this mean Wikimedia sites in general? --'''[[User:Rschen7754|Rs]][[User talk:Rschen7754|chen]][[Special:Contributions/Rschen7754|7754]]''' 23:46, 9 January 2014 (UTC)
:Thanks for the question, '''[[User:Rschen7754|Rs]][[User talk:Rschen7754|chen]][[Special:Contributions/Rschen7754|7754]]'''! Whenever you see "we" / "us" / "our" in the text, we are indeed referring to the The Wikimedia Foundation, Inc., the non-profit organization that operates the Wikimedia Sites. This explanation is part of the “Definitions” section of the new Privacy Policy draft. Would it help if we added something like this to the document?
{{quote|Terms that are not defined in this document have the same meaning given to them in the [[Privacy_policy#Definitions|Privacy Policy]].}}
:--[[User:JVargas (WMF)|JVargas (WMF)]] ([[User talk:JVargas (WMF)|talk]]) 00:26, 10 January 2014 (UTC)
::Yes, that would be helpful. --'''[[User:Rschen7754|Rs]][[User talk:Rschen7754|chen]][[Special:Contributions/Rschen7754|7754]]''' 00:28, 10 January 2014 (UTC)
:::We changed the "Definition of Personal Information" section to "[[Data_retention_guidelines#Definitions|Definitions]]" in the document, and we added the above sentence at the end of it. Thanks again!--[[User:JVargas (WMF)|JVargas (WMF)]] ([[User talk:JVargas (WMF)|talk]]) 00:52, 10 January 2014 (UTC)

== Comments from //Shell ==
* <u>Introduction</u>
** ''"Data is important. It is how we can learn and grow as an organization and a movement..."'' It's not the only way to learn and grow. Is there a way to rephrase it to say that it's an (important) way to learn and grow?
**:What about simply "Data is important. It is one of the ways we can learn and grow as an organization and a movement..."? [[User:Mpaulson (WMF)|Mpaulson (WMF)]] ([[User talk:Mpaulson (WMF)|talk]]) 00:51, 10 January 2014 (UTC)
**::This is much better! Sounds less extreme, while essentially saying the same thing. [[User:Skalman|//Shell]] 09:09, 10 January 2014 (UTC)
**:::Great, done! [[User:Mpaulson (WMF)|Mpaulson (WMF)]] ([[User talk:Mpaulson (WMF)|talk]]) 14:14, 10 January 2014 (UTC)
** ''"for the shortest possible time that is consistent with <u>maintenance</u>, understanding, and <u>improving</u> the Wikimedia Sites, and our obligations under <u>applicable U.S.</u> law"'' This exact text is not (any longer?) in the privacy policy, though two very similar sections are there. You might want to have the two sections actually say the same thing also in the privacy policy.
**:Good catch! I have corrected this sentence. [[User:Mpaulson (WMF)|Mpaulson (WMF)]] ([[User talk:Mpaulson (WMF)|talk]]) 00:58, 10 January 2014 (UTC)
**:: Good. [[User:Skalman|//Shell]] 09:09, 10 January 2014 (UTC)
* <u>How long do we retain non-public data?</u>
** ''"After no more than 90 days..."'' I had to think twice about what it means. Would it be possible to say "After at most 90 days..."?
**:Good suggestion. I've made the change. [[User:Mpaulson (WMF)|Mpaulson (WMF)]] ([[User talk:Mpaulson (WMF)|talk]]) 01:03, 10 January 2014 (UTC)
**::Nice. [[User:Skalman|//Shell]] 09:09, 10 January 2014 (UTC)
** ''"Anonymized"'' What does this mean? Does it mean that it becomes very difficult to associate the data to a specific user, or that it's completely impossible? (Clarification: Especially for small projects, say 5 editors a normal day)
** ''"Email address in account settings: Indefinitely"'' Does this mean that if I remove or change my email address, the old address will still be kept? Is that the meaning? Is it desirable? Not sure how to rephrase it to only be about the current email address.
** ''"Non-personal information associated with a user account: Collected from user: Indefinitely"'' While the given examples seem okay, this category seems broad and that's particularly bad since the data is kept indefinitely. The given examples seem okay, since they're ''almost'' already public data (first edit, when a user has verified email, and whether the user edits through mobile are public data). E.g. the list of read articles is not public, but could be covered by this category.
** ''"Non-personal information associated with a user account: Optionally provided by a user: Logs of terms entered into the site's search box"'' I realize that "optional" here means that not every WM site visitor must search, but since it's a key part of any wiki it doesn't feel like I "optionally provided" it - I ''must'' do it to see the article I'm interested in (ignoring other search engines). No biggie, but feels a bit weird.
**:I see your point here, Shell. We weren't sure how to best phrase the differentiation between information collected from the user and information provided by the user. We're open to suggestions though if you or anyone else has one. [[User:Mpaulson (WMF)|Mpaulson (WMF)]] ([[User talk:Mpaulson (WMF)|talk]]) 01:17, 10 January 2014 (UTC)
**:: Would it be possible to remove "optionally" and just say "Provided by a user"? [[User:Skalman|//Shell]] 09:09, 10 January 2014 (UTC)
**:::I would be fine with doing that. I think we originally added "optionally" to more clearly distinguish that kind of data from data that is collected either automatically or actively by us. But obviously, if it makes it less clear rather than helping, we can remove it. [[User:Mpaulson (WMF)|Mpaulson (WMF)]] ([[User talk:Mpaulson (WMF)|talk]]) 14:32, 10 January 2014 (UTC)
**:::: I was confused about ''search terms'' being optional, since they feel necessary to use the site, while the email address is ''usually'' mandatory, but in the Wikimedia case it's optional. So, I wouldn't mind adding back "optional" to the "personal information" one, but it's more consistent not to. [[User:Skalman|//Shell]] 19:04, 10 January 2014 (UTC)
** Do you intend to have most common data in this table, in the form of examples? It would be nice to see a ''complete'' list somewhere (though that might be asking too much).
**:The table is meant to address broad categories of data so that we address the treatment of as much data as we can in these guidelines. That said, we are going to try to improve the table (and the exceptions section) with more examples over time as we refine our practices. [[User:Mpaulson (WMF)|Mpaulson (WMF)]] ([[User talk:Mpaulson (WMF)|talk]]) 01:21, 10 January 2014 (UTC)
**:: It would be nice to have as many examples as possible, so I could imagine that there was a long list in this table, but collapsed by default. [[User:Skalman|//Shell]] 09:09, 10 January 2014 (UTC)
**:::I agree. The hope is that we will gradually expand the guidelines with more examples over time. I will talk to people internally and see what additional examples (if any) we can add now though. I imagine if the table gets unwieldy, we'll experiment with formatting so that it's as easy-to-read as we can make it. [[User:Mpaulson (WMF)|Mpaulson (WMF)]] ([[User talk:Mpaulson (WMF)|talk]]) 14:20, 10 January 2014 (UTC)
**:::: Great. Since there are already examples that feel representative, it's not a big deal, but it'd be nice to eventually have an almost complete list. [[User:Skalman|//Shell]] 19:04, 10 January 2014 (UTC)
* <u>Definition of personal information</u> (good job!)
** I can think of a couple more items to put in (b), though I'm not sure if it's necessary: (current) city ''(clarification: which is different/broader than address)'', marital status, family ties
**::I added "marital and familial status" to the definition. I'm checking internally whether it makes sense to add current city. [[User:Mpaulson (WMF)|Mpaulson (WMF)]] ([[User talk:Mpaulson (WMF)|talk]]) 18:41, 10 January 2014 (UTC)
**::: I was thinking about city, since that something you can "easily" get from an IP address, but street address is not.
**::: Of course there's lots of other private information, but maybe it's unnecessary to add that, since I don't see how Wikimedia would get the info: income level/economic situation, level of education, profession, current job situation, hobbies/interests (though interests could be gleaned from what pages a user visits).
**::: There's also the user-agent info: OS/browser version, browser language(s), screen size etc. which websites almost never make public, but which could potentially uniquely identify a user over multiple websites[https://panopticlick.eff.org/]. [[User:Skalman|//Shell]] 19:04, 10 January 2014 (UTC)
* <u>Exceptions to these guidelines</u>: ''"Data may be retained in system backups for longer periods of time."'' Is there any restriction on how long those backups can exist? Would it be possible, for instance, to delete, aggregate, or anonymize them after at most 5 years?
* <u>Design of new systems</u>: ''"inclusion of privacy considerations in the code review process"''. Would this be added to some checklist, or is it just a general guideline?

Great to see this stuff be explicit. [[User:Skalman|//Shell]] 00:15, 10 January 2014 (UTC)
:Hi Shell! Thank you for taking the time to comment and help us improve these guidelines. Your suggestions are always helpful and greatly appreciated. We will respond in-line to your comments as we work through them. [[User:Mpaulson (WMF)|Mpaulson (WMF)]] ([[User talk:Mpaulson (WMF)|talk]]) 00:51, 10 January 2014 (UTC)

:: I've responded to your comments. (and clarified a couple of things) [[User:Skalman|//Shell]] 09:09, 10 January 2014 (UTC)

==Logs of terms entered into the site's search box==

Why exactly is this data needed at all?[[User:Geni|Geni]] ([[User talk:Geni|talk]]) 00:44, 10 January 2014 (UTC)
:Hi Geni, it's needed at least for debugging purposes. There are some searches that trigger bugs or performance problems, and we need to be able to go back and correlate searches with the bugs that get triggered. Sometimes malicious users may actually try to cause performance problems on the site via search, so we need to have the information correlated with IP addresses so that we can take action if necessary. We don't yet do much in the way of analytics on search traffic (that I'm aware of), but I could see that being of use in the future. -- [[User:RobLa-WMF|RobLa-WMF]] ([[User talk:RobLa-WMF|talk]]) 02:01, 10 January 2014 (UTC)

== Non-public vs public data ==

It might be worth expanding on what is actually meant by non-public data and what data is public by default. For example, this page talks about IP addresses for visitors, which might be confused with IP addresses for editors which are either publicly visible (where an anonymous edit is made) or kept private but presumably kept indefinitely (for user accounts). I think most of this is in the privacy policy, but it might be worth summarising here. Thanks. [[User:Mike Peel|Mike Peel]] ([[User talk:Mike Peel|talk]]) 08:26, 10 January 2014 (UTC)
:Hi Mike. Thanks for the suggestion! You are correct in that it is addressed more fully in the Privacy Policy draft, but we will draft some language to make that clear as well as give some basic examples. [[User:Mpaulson (WMF)|Mpaulson (WMF)]] ([[User talk:Mpaulson (WMF)|talk]]) 14:10, 10 January 2014 (UTC)
:: Thanks Michelle. :-) [[User:Mike Peel|Mike Peel]] ([[User talk:Mike Peel|talk]]) 14:44, 10 January 2014 (UTC)
:::Just to let you know, we added some language. Let me know if that works! [[User:Mpaulson (WMF)|Mpaulson (WMF)]] ([[User talk:Mpaulson (WMF)|talk]]) 18:56, 10 January 2014 (UTC)

::::I think it's much better now. The info is in the privacy policy, but you can't expect everybody to read it. [[User:Skalman|//Shell]] 19:07, 10 January 2014 (UTC)

== Donor data ==

Presumably this page isn't intended to cover donor data? It might be worth linking to [[:wmf:Donor_policy]]. Thanks. [[User:Mike Peel|Mike Peel]] ([[User talk:Mike Peel|talk]]) 08:28, 10 January 2014 (UTC)
:Hi Mike! While this document does not currently address donor data, we are hoping to eventually include retention practices in relation to donor data. These guidelines are meant to be a starting point for us and will get more detailed over time. In the meantime, I will see about getting navigational tools to other privacy-related documents (including the donor policy) added. Thanks for the suggestion! [[User:Mpaulson (WMF)|Mpaulson (WMF)]] ([[User talk:Mpaulson (WMF)|talk]]) 13:34, 10 January 2014 (UTC)

== Sampled data ==

Does the 90 day rule for IP address applied to sampled data? As far as I know, the main motivation behind deleting IP addresses is to prevent tracking site usage back to a specific person. If site usage data is (sufficiently) sampled, it is much harder to do this regularly. [[User:Ottomata|Ottomata]] ([[User talk:Ottomata)|talk]]) 16:01, 10 January 2014 (UTC)

:It should apply. Your ISP might keep the correlation data IP address<->customer indefinitely, thus making it personally identifiable regardless of frequency. [[User:Skalman|//Shell]] 19:09, 10 January 2014 (UTC)

:Yes, the intent is that it should apply. This does probably mean there will have to be changes to certain existing setups, but as noted in the Audit section, that's a process we expect to occur gradually. —[[User:LVilla (WMF)|LVilla (WMF)]] ([[User talk:LVilla (WMF)|talk]]) 22:43, 10 January 2014 (UTC)

Latest revision as of 07:04, 14 October 2023


Archives
I

Template:Autotranslate

Added exception for page views investigation

The Privacy team has temporarily extended the retention period for two datasets for a short period so that the Data Engineering team can investigate the impact of a data collection technical issue. Between June 4, 2021 and January 27, 2022, some of the Foundation’s caching nodes stopped collecting web traffic data (see the Phabricator task for more details). This resulted in data loss for web requests and the derived pageviews, which impacts the Foundation’s ability to correctly report on the Wikimedia pageviews and fundraising banner impressions.

The Data Engineering team required a temporary short-term extension to the usual 90-day retention period in order to better estimate what data was not collected and which projects and geographies were most affected. The wmf.pageview_actor dataset is being used to estimate the data loss for pageviews and the wmf.webrequest dataset is being used to estimate the data loss for fundraising banners. Information from both datasets is required because webrequest data for visited banners is not reported as pageviews. Deletion of these datasets was paused on February 16, 2022 and deletion will resume by March 18, 2022.

If you have questions or concerns, please reach out to privacy@wikimedia.org. If you are interested in a conversation meeting to discuss this exception and investigation, please sign up below and we will contact you with details. MMoss (WMF) (talk) 19:51, 11 March 2022 (UTC)Reply

Definition of "public information": the IP address really?

Quote:

Some examples of "public information" would include: (a) your IP address, if you edit without logging in;

This is insufficent (and it was a problem for all Wikimedia projects, that is currently being solved, because many users unfortunately made edits without being properly conencted and did not notice it immediately; such disconnection has often opccured for various technical resasons and were not always prominently displayed, revealing an IP address to a permanent ccount that was protecting their user provicy) and in fact this statement may be now false (as is it is now illegal in various juridictions to make IP addresses visible to the public view, and the WMF could have been liable of violation, or orders of termination, or banning on some networks that must respect privacy laws, especially in the UE and the EEA, or in California where the WMF is located). You should add this precision:

(if your access has still not been anonimized from the public view).

The anomymization of user access and IP accounts visible in the public list of users or in histories should soon be replaced by anonymized accounts for temporary accounts. IP addresses will only be visible later by specific users with CheckUser access rights and contractually accepting to strictly follow its usage policy (in addition to the general Privacy policy).

We have to accept the fact that this was not the case in the past and that there exists archives elsewhere (including in old database dumps published by the WMF) where such anonization will not be possible as they are now out of control. But "IP user" accounts should never be used now and should disappear from all categories, and former links to their uer pages or tlak pages should be processed by some admin tool that will associate them to as many anonimized accounts as needed (respecting the "temporary period"), to avoid capturing and keeping information on overlong periods of time. Such bot will then ned the permission to create temporary accounts, and "antidate" them to match the dates found in edit histories within which these IP user accounts were associated

You should inform users that the WMF will make all efforts to disable existing public views for such data hosted now, but that past public records are now out of control and the WMF is unable to warranty that other parties won't use the revealed IP addresses (they could be able to do that legally in various juridictions that do not protect privacy, especially for users not located or connected from their own juridiction, and that don't benefit of the local legal protection). However the WMF should be clear that these IP addresses were anonymized from the public in order to comply to privacy laws and if other parties are using such data, they'll do that on their own liability, because the WMF does not endorse or approve such use of private data (including all those listed in these guidelines) by third parties, even if these data were published under a Free licence (which mostly covers copyright and correct users attribution, but does not void any law related to the protection of privacy, independantly of what the used licence is granting). -- Verdy p (talk) 13:10, 17 June 2023 (UTC)Reply