Big Data is the term used to describe the commercial aggregation, mining, and analysis of very large, complex and unstructured datasets based on social media and Web-enabled workloads.
Platforms like Google and LinkedIn have created a revenue stream from the data they collect. Everything from your Facebook posts and tweets, to your late-night shopping habits are analyzed. They supply businesses, individuals and public institutions with insights into the behavioral patterns of target audiences.
Raw digital data is a resource many companies are turning to in their quest for market advantage.
Dubbed “the new oil” by the World Economic Forum, big data can improve decision making, reduce the time before taking a product to market and increase profits. But it can also raise significant risk, ranging from disastrous data breaches to concerns about privacy and compliance.
Its place in the legal landscape is still being defined. And it’s in your best interests to keep abreast of these developments.
The NSA Affair
Big data made global news in 2013 with the revelations of fugitive whistle-blower Edward Snowden concerning the US National Security Agency (NSA). Leaked documents confirmed that the Agency has been gathering sensitive information (including email and phone transcripts) on private individuals and public figures. The purposes of this data collection remain unspecified.
Privacy advocates have pressed for legal action. And, after the NSA’s loss in the District court of Washington DC the presiding judge made this comment on the agency’s use of Big Data:
“The threshold issue… is whether plaintiffs have a reasonable expectation of privacy that is violated when the Government indiscriminately collects their telephony metadata along with the metadata of hundreds of millions of other citizens without any particularized suspicion of wrongdoing, retains all of that metadata for five years, and then queries, analyses, and investigates that data without prior judicial approval of the investigative targets.”
The court’s analysis throws up vital issues. Namely:
- How data can be used to discern other data
- How data is (or can be) used
- Where it comes from in the first place
The Mechanics of Big Data
Gathering information is a costly exercise that requires extensive storage facilities and a solid privacy compliance system.
The legal framework regulating the big data business model is based on existing principles of intellectual property, confidentiality, contract and data protection law.
Database Rights will protect a data set under English law. Rights belong to the person who takes the initiative in “obtaining, verifying and presenting the content of a database, while assuming the risks involved in doing this.” This is an automatic right of ownership and should be respected.
The Database Right may also cover a database, which has been substantially formed by collecting data from various different databases. This is one argument to further support claims that the NSA breached copyright laws.
Implications for IT Management
The business of Big Data changes the function of the IT department. There’s less emphasis on technology, and more on information architecture.
Rather than building a universal database, creation of domain specific databases solving domain-specific problems is the conventional thought on how to resolve this issue.
In effect, the data is in a data warehouse. It’s the responsibility of IT to get the HCI [human computer interaction] part right.
With cloud infrastructure, software as a service, applications moving outside, the role of IT changes even more. It’s no longer a case of just managing servers, databases, or applications (other than some security administration).
Instead of IT managers, the trend is moving towards data scientists with IT capabilities who can manipulate big data technologies. Professionals with a solid knowledge of data architectures, data quality, and mastery of data management hubs are highly sought after.
Overall, a business should do a proactive job of telling consumers what is known about them, what is done with that information, and why it’s done.
Intellectual Property
In November 2013, Google won a legal case after a suit was brought against the company in relation to book-scanning.
Briefly, Google indexed books and gave the public access to the indexes. Copyrights had expired with respect to some of the indexed books, but not (allegedly) all of them. The copyright holders sued Google for infringement of copyright on the scanned and indexed books.
Google’s defense to the infringement claims is that its actions fell under a fair usage clause and thus could not be considered an infringing activity.
A business using big data must be clear on the extent to which it can re-use this information. It is wise to address these issues as far up the “data chain” as possible.
In addition it’s worth considering these factors (which are actually enshrined in US legal statutes):
(1) The purpose and character of the use, including whether it is of a commercial nature, or for non-profit educational purposes;
(2) The nature of the copyrighted work;
(3) The amount and substantiality of the portion used in relation to the copyrighted work as a whole; and
(4) The effect of the use on the potential market for or value of the copyrighted work.
The source of the data set needs to be investigated. Unless specifically released as open data, most data sets will be subject to some controls in relation to their use. Any license terms on which data is supplied must be identified, and used to engineer protection in the form of warranties in the agreement with the data owner.
Warranties will also be required in relation to ownership of and “freedom to use” the data. This will avoid disputes arising from an infringement of intellectual property rights or a breach of confidentiality.
Nondisclosure
At the outset of a big data transaction between companies, it’s advisable to enter a nondisclosure agreement.
Each nondisclosure should include a provision that disclosed information will be considered confidential only if the information is:
1. Marked as confidential at the time of disclosure;
2. If unmarked at the time of disclosure, was treated as confidential at the time of disclosure and subsequently identified in writing as confidential, or that the receiving party knows, or should reasonably be expected to know, is confidential to the disclosing party
In November 2013, Convolve, INC entered a nondisclosure agreement with Compaq. Convolve entered a similar agreement with Seagate.
Having made disclosures of confidential information without meeting any of the conditions set out above, Convolve sued to keep the information it disclosed confidential. The court ruled that this information was no longer a Convolve trade secret. This lead to Convolve losing lots of money.
Sometimes, the people reading or drawing up a nondisclosure agreement don’t communicate the requirements to the people making the disclosures. There may be limitations of liability, to consider. The unthinking use of form agreements is another potential pitfall.
Recruitment and Vetting
In the USA, Wichita State University is using Big Data analytics in its recruiting and admissions program. IBM is using Wichita State’s Big Data program as a case study.
A quote in the IBM White Paper, attributed to David Wright, Assistant Vice-President for Strategic Planning and Business Intelligence states that: “Ultimately, [business analytics] predicts the chances of success for potential students, enabling marketing teams to focus on high-quality applicants.”
Big data is a predictive tool for enhancing academic standards and financial stability.
Fine, so far as it goes.
But organizations using big data in their admissions practices, or to develop these practices will need to understand how its use affects them. Then they will need to consider whether they are accurately communicating their admissions standards and practices to applicants.
Codes of Ethics
A set of standards, voluntarily imposed to temper the increasing power that IT professionals now have, have been devised with big data in mind.
Google chairman Eric Schmidt’s 2003 ethos “Don’t be evil” isn’t one of those standards. At least, it’s not a very workable one.
A better model might be this, posed by the Association for Computing Machinery Code of Ethics and Professional Conduct:
“Harm” means injury or negative consequences, such as undesirable loss of information, loss of property, property damage, or unwanted environmental impacts. This principle prohibits use of computing technology in ways that result in harm to any of the following: users, the general public, employees, and employers.
Harmful actions include intentional destruction or modification of files and programs leading to serious loss of resources or unnecessary expenditure of human resources such as the time and effort required to purge systems of “computer viruses.”
The ICO Guidelines: Anonymization
The Data Protection Act controls how organizations use ‘personal data’ which is defined as any information that allows individuals to be identified.
Anonymization is the process of turning data into a form that doesn’t identify individuals, and where identification isn’t likely to take place. This avoids having to tackle problems concerning the data subject’s consent. It also allows for a much wider use of the information.
Guidelines from the UK Information Commissioner’s Office (ICO) include a Code of Practice on Anonymization, that’s aimed at helping companies to manage risk.
The code of practice explains the issues surrounding the anonymization of personal data, and the disclosure of data once it has been anonymized. It describes the steps an organization can take to ensure that anonymization is conducted effectively while retaining useful data.
The code is useful to any organization wanting to turn personal data into anonymized information for research or other data analysis purposes.
As further help the ICO is supporting the establishment of a network for practitioners to discuss issues relating to anonymization, and share better practices.
The UK Anonymisation Network (UKAN) is coordinated by the University of Manchester, the University of Southampton, the Open Data Institute, and the Office for National Statistics.
Big Data Best Practices
Consider these key questions:
1. Can we trust our sources of Big Data?
2. What information are we collecting without exposing the enterprise to legal and regulatory battles?
3. How will we protect our sources, our processes and our decisions from theft and corruption?
4. What policies are in place to ensure that employees keep stakeholder information confidential during and after employment?
5. What actions are we taking that create trends that can be exploited by our rivals?
A business should identify any license terms by which data is supplied.
This should be used to engineer protection in the form of warranties in its agreement with the data owner.
If the data collected is going to be used in a commercial service delivered to users by the business, then those customers can expect to receive assurances as well.
Even where anonymized data is supplied it’s wise to request a warranty from the supplier giving assurances that the data is fully compliant with Data Protection Act (DPA) requirements.
This should include scrutiny of the information on use of data and privacy that was given to the data subjects at the point of data collection.
Individuals should be told in a privacy statement accessible at the point the data is collected, that their data may be used and disclosed to others in anonymized form. This is good practice, and the credibility of the data set will depend on it.
Businesses also need to be multitasking, as they consider the consequences of using data from multiple sources.