Resources ⌇ Tell me about it

A summary of ‘Data Governance in Open Source AI: Enabling Responsible and Systemic Access’

This resource was published on

Defining open AI is more difficult than other open pursuits. Through efforts to define it, questions arise as to if an AI system needs to be trained on open data to be deemed open AI and whether current practices in open data adequately accommodate the needs of all data stakeholders.

TL;DR

Even as the open movement seeks to define open AI there is no agreed idea of what ‘good’ open AI is. The use of open data as the training data for open AI is only recommended. Mandating open data is likely to limit the development of open AI but without such a condition open AI becomes a wide spectrum where some or all elements of an AI system are open (but not all of them may be).

Through the whitepaper, Alek Tarkowski draws out the need for open data to move beyond a basic objective of ‘as open as possible’ to a more nuanced data governance approach. The classic pursuit of open data fails to accommodate the different needs of other stakeholders related to the data set. To rectify this, more than copyright licensing needs to be considered.

What

A white paper looking at the need to establish shared data governance practices in open data

Who

Author: Alek Tarkowski
Publisher: Open Future

When

Published: Friday 24 January 2025

Where

Online: on the Open Future website

Why

To encourage open advocates to strike a better balance between open data a responsible data governance

How

Suggesting areas where open data practitioners can improve their practices

What else

The paper suggests a need for two paradigm shifts to make open data more responsible and six focus areas where the open movement can take action

Estimated reading time: 10 minutes

WTF is this?!

This is a summary of the Open Future whitepaper titled Data Governance in Open Source AI: Enabling Responsible and Systemic Access by Alek Tarkowski, published by Open Future in partnership with the Open Source Initiative, on Friday 24 January 2025.


Summary

Defining open source AI or open AI is a difficult task given how many elements are involved in developing an AI system. While data is a critical element, it is not the only element. How many elements must be open to be open AI? Is a purist view that sees AI as open only when every element is open even possible? Can open AI be responsible and ethical if it is trained on open data that did not respond to the broad needs of stakeholders related to the data?

These questions sit behind Tarkowski’s white paper, which looks at the need for changed data governance practices in open data communities, particularly through the lense of open AI development.


Open and closed AI

Many of the AI systems that exist are propriety in nature. We have seen this recently with OpenAI investigating if DeepSeek used ‘its’ data to training their model. While alternatives existed that could broadly fit under an ‘open’ label, a diversity of approaches exist. ⟨ Tarkowski provides a solid overview of the spectrum of ‘openness’ in AI development in the white paper ⟩ There is no agreed definition of what ‘good’ open source AI is.

That challenge has been taken up by a number of initiatives: Linux Foundation’s Model Openness Framework (MOF), the Digital Public Goods Alliance’s standard for AI as a digital public good, Mozilla’s Convening on openness in Artificial Intelligence and the Open Source Initiative’s Open Source AI Definition (OSAID) (which was released in October last year) and others. The crux of the OSAID is that for AI to be ‘open’ it should provide other parties with freedoms equivalent to open source software.


Should open data be used for AI to be ‘open’?

Because data is viewed as one of the key components of AI models or AI systems “… considerations of openness of AI models must address the gnarly question of the openness of data: to what extent and in what way does openness of the training data (or lack thereof) determine the openness of the overall AI system?” While consensus seems to be emerging in relation to the transparency of data for openness the openness of datasets remains unresolved.

On one hand, a definition of open AI that makes the open training AI a ‘nice to have’ seemingly allows for ‘closed’ components with an ‘open’ AI system. If the data in an ‘open’ AI is not available under an open content licence or is otherwise publicly available important aspects of ‘openness’ are compromised. Without reusability of the underlying training data desirable ‘open’ activities such as auditing, verification and replication of the model is not possible. On the other hand, the complexity of AI systems and the practicalities that limit openness of data may make such a definition of open AI almost unattainable. And the ever-present risk of open washing persists.

A concurrent issue is the ethics of mass-scale data extraction to train the Big AI platforms. There are many perspectives on web-scraping the open internet to use as training data for AI: it “is often seen as not properly managed: conducted in ways that are perceived in some cases as outright unlawful and in others as at least morally questionable, not in line with research ethics or unjust.” Regardless of your views on scraping the public web to train AI, as Tarkowski points out, if you want to train AI on “fully open and transparent datasets” that are either in the Public Domain or openly licences you are at a disadvantage because of the comparatively small volume of open data available. Even with the body of open data, much of it will not be the type of data your AI system needs.

So it seems an aspiration to ‘do the right thing’ while also aspiring to a gold standard of open AI could be a double edged sword. This quote from the white paper sums up the paradox well:

The importance of data transparency and access to data, even if the latter is contested as a requirement for Open Source AI systems, signals the need for not just more data to be shared, but also for better data governance.

Such data governance also needs to navigate the risk that open data generated by and for communities could be opportunistically exploited by powerful third parties. Hence, many community driven AI data collections end up in a dilemma: Protecting openness and respecting the data rights of marginalized communities will limit the general ability to grow a global pool of Open Source AI. Paradoxically, by trying to avoid the freeriding of some, everyone might end up with less Open Source AI which can be used by anyone, including vulnerable and marginalized communities.


Open data is not exempt from ethical concerns

At the heart of the white paper is the reality that open AI cannot compete with and address the adverse concentrations of power of Big AI without data open AI systems can be trained on. Also, the idea that better data governance and acknowledgement that it should be the foundation of good AI governance should be advanced more. Emphasis should not be on the quantum of data available to train AI, but on “the quality of the data and specific governance mechanisms that ensure that data is shared in ways that are equitable, sustainable and protected from value extraction”. And, that other mechanisms beyond open licensing should be considered. Tarkowski gives the example of non-copyright-based preference signaling through opt-outs to indicate whether the data can be used for AI training as an illustrative example.

Although the notion of Indigenous Data Sovereignty (IDSov) was not explicitly mentioned in the white paper, Tarkowski’s conceptualisation of good open data sharing efforts not simply aiming to release as much data as possible, as openly as possible, but rather to take as a starting point “proper data preparation, data governance frameworks and stewardship functions” leaves room for IDSov and First Nations Indigenous Cultural and Intellectual Property (ICIP) principles and practices. Likewise, so does elevating non-copyright mechanisms.


Moving open data to a more ethical approach

To take open AI forward, Tarkowski suggests two paradigm shifts that are needed. The first is is adoption of a ‘data commons’ approach that moves beyond basic open data methods that fall short of preventing data exploitation to “robust commons-basaed governacne models.” While it is not totally clear what Tarkowski means, the white paper does say that this would “… result in an acknowledgment of a gradient of data sharing approaches, where open data is the optimum on one side of the spectrum, with other data sharing approaches — suited for cases where open sharing is not desirable or attainable — on the other side.” The paper envisages innovative data licensing models and management approaches such as data trusts and cooperatives.

The second paradigm shift put forward in the white paper suggests that open AI look beyond “… solely meeting AI development needs to a broader view of data sharing that serves the needs and objectives of a broader set of stakeholders.” Recognition and understanding of the various needs and goals of other stakeholders with a stake in data is necessary to successfully share new sources of data.


Focus areas for open AI

In the white paper, Tarkowski also espouses six focus areas for open AI. They are:

  • Data preparation and provenance: Establishing robust standards for data collection, classification, anonymization, and metadata to ensure quality and traceability.
  • Preference signaling and licensing: Developing mechanisms like opt-out frameworks and social licenses to allow rights holders and communities to control data use.
  • Data stewards and custodians: Strengthening roles for data stewardship, including intermediary institutions that facilitate data sharing while ensuring ethical governance.
  • Environmental sustainability: Promoting practices that reduce the environmental impact of AI through shared datasets and efficient training methods.
  • Reciprocity and compensation: Implementing mechanisms that ensure value generated from shared data is equitably distributed, particularly to marginalized communities.
  • Policy interventions: Advocating for public policies that mandate data transparency, incentivize data sharing, and support the creation of open datasets.

Here’s some links I recommend related to Alek Tarkowski open data and open AI white paper:

Data Governance in Open Source AI: Enabling Responsible and Systemic Access [PDF]

Data Governance in Open Source AI: Enabling Responsible and Systemic Access

The publication page on the Open Future website announcing the white paper. It includes an overview of the white pape.


Was this free resource helpful?

If so, I encourage you to please show your support through a small contribution – it all helps me keep creating free arts marketing content.

Disclosure

AI use

This resource was drafted using Google Docs. No part of the text of this resource was generated using AI. The original text was not modified or improved using AI. No text suggested by AI was incorporated. If spelling or grammar corrections were suggested by AI they were accepted or rejected based on my discretion (however, sometimes spelling, grammar and corrections of typos may have occurred automatically in Google Docs).

I used Gemini in Google Workspace to summarise the text of this resource, however the summary (see TL;DR) does not duplicate any of the AI-generated text. Rather, it was used to help me gather my thoughts on the most important parts of the text to include in a summary.


Provenance

This resource was produced by Elliott Bledsoe from Agentry, an arts marketing micro-consultancy. It was first published on Sunday 9 February 2025. It has not been updated since it was first published. This is version 1.0. Questions, comments and corrections are welcome – get in touch any time.


Reuse

Good ideas shouldn’t be kept to yourself. I believe in the power of open access to information and creativity and a thriving commons of shared knowledge and culture. That’s why this resource is licensed for reuse under a Creative Commons licence.

A bright green version of the Creative Commons brand icon. It is two lowercase letter Cs styled similar to the global symbol for copyright but with a second C. Like the C in the copyright symbol, the two Cs are enclosed in a circle.A bright green version of the Creative Commons brand icon. It is two lowercase letter Cs styled similar to the global symbol for copyright but with a second C. Like the C in the copyright symbol, the two Cs are enclosed in a circle.

Unless otherwise stated or indicated, this resource – A summary of ‘Data Governance in Open Source AILearn more  →: Enabling Responsible and Systemic Access’ – is licensed under the terms of a Creative Commons Attribution 4.0 International licence (CC BY 4.0). Please attribute Elliott Bledsoe as the original creator. View the full copyright licensing information for clarification.

Under the licence, you are free to copyshare and adapt this resource, or any modified version you create from it, even commercially, as long as you give credit to Elliott Bledsoe as the original creator of it. So please make use of this resource as you see fit.


Resource metadata

Resource clusters:

Resource tags:

,