Ask ELP: Ethical AI

Dear ELP,

I would like to create a free-use chatbot for an endangered language, which I am not a speaker of, but will be making in collaboration with native speakers. Are there considerations I should be aware of to make the chatbot more ethical?

-Ethical AI

“Should AI be used with Indigenous languages at all? And the answer is only with great care. And if it can't be done right it shouldn't be done at all.” -Danielle Boyer (Anishinaabe)

Dear Ethical AI,

This is a great question – and a really important one, as tools broadly called “AI” begin to permeate more communities, languages, and parts of life. It’s also a complicated question, so this answer will be pretty long. I hope it can be useful to you.

Disclaimer: I’ll start by saying I’m not a machine learning specialist, a computer scientist, or someone who works directly with large language models (LLMs, like ChatGPT and other chatbots) or other AI tools. I’m just a non-Indigenous linguist who aspires to do ethical work in partnership with Indigenous people and communities, and who tries to stay informed about issues related to AI and Indigenous languages. I’ve had the good fortune to talk about this issue with a lot of smart people, including many Indigenous data scientists and linguists, but my thoughts are also filtered through my own perspectives. So this answer won’t have a ton of technical depth, but hopefully it will be useful in sparking some thoughts and conversations about ethical issues.

I'm going to suggest a few questions below that might serve as a guide as you think through these ethical issues around using AI in language work.

If you’re not a member of an Indigenous or endangered-language community, you might want to ask yourself and/or your team these questions. (For anyone else reading this who is a member of an Indigenous or endangered-language community considering developing AI tools, you might consider asking the following questions to anyone who is proposing to work with you.)

These questions build on the ideas presented in A Linguist’s Code of Conduct, so I’d recommend checking out that document before you start, too.

1. Why are you developing this tool?

This is actually three questions in one. The first one is Why as in, “Who asked you to do this? Did the idea or request come from the community? Or is this an idea you came up with yourself, and brought to the community to ask if they would accept it?” Projects are much harder to do ethically when you only approach the community after you’ve already decided what you plan on doing. Ideally, ideas for new projects should spring from an already-developed collaborative relationship with the community. Think “what would y’all find useful, and can I offer any assistance?” rather than “can we do this thing I came up with?”

The second why is “what’s the purpose of this chatbot?” What is it actually useful for? How is this specific tool going to support the community’s aspirations for their language? What will it realistically accomplish in relation to the community’s language revitalization efforts and goals? Have other communities created or used similar tools (or considered it), and what was their experience like? Sometimes, people can see technology tools as a “magic bullet” – something that will solve all their problems at once. Sometimes, people want specific technology tools because they’re popular, or hyped up, or because other communities (who may be in very different language situations) have them. Unfortunately, this sometimes leads to people building tools that aren’t actually very useful for their specific language situation. You don’t want to be in a position where a community spends lots and lots of time, effort, and resources building a tool that won’t realistically help accomplish their goals.

The third why is “what’s your motive?” This is a challenging question that may require a lot of introspection. If it’s for your own benefit (career advancement/portfolio, recognition, profit, etc.), have you made sure that the benefits to the community will equal or exceed what you’re getting out of it? And, in the case of AI tools, are those benefits things that the community has asked for? Or are you making assumptions about what will benefit them and what they need? How will your project build community capacity in technology, like mentoring others and teaching your own technical skills, or contributing to/funding training and education for community technologists? Throughout all your work, you need to be explicit, transparent, and honest about why you’re doing this work, what you’re going to get out of it, and what the community will gain from this project. This isn’t “just a chatbot” - you’re working with language, which is integral to the health and well-being of the whole community. How will your work fit into a broader ethical approach that takes this into account?

If you and the community are clear on why you’re doing this, you have established collaborative relationships with the community, you’re sure that the community actively wants this project to happen, and you’ve got a plan to ensure that benefits accrue equally or primarily to the community, then it’s time to think about how the project is going to unfold.

2. How are you ensuring that Indigenous data sovereignty is upheld in this project?

Data sovereignty is a complex set of ideas, but in short: data sovereignty means that data produced by Indigenous Peoples belongs to Indigenous Peoples. This includes data related to language. I recommend doing lots of reading about this concept if you’re going to be working with Indigenous or endangered languages. Animikii has some useful ebooks for general audiences, and there are lots of more academic writings about it too – check out the “further reading” section at the end of this response.

Any ethical AI project should have a strong data sovereignty framework laid out from the very beginning. This includes key points like:

Where are you getting the language data to train the LLM or other tool? Did the source you’re getting it from gather it ethically? A lot of language data out there, especially bundled in big datasets, was acquired unethically or even illegally (there are plenty of examples).
Do you actually have permission to use that data to train the LLM? There are often major differences between legal frameworks and Indigenous protocols around intellectual property – language data that’s “open source” or “licensable” under the laws of your country may actually have been shared in violation of that community’s protocols and laws. Have thorough and ongoing discussions about this with your partners in the community, and commit to only using data that you have all the needed permissions for – whatever that looks like in your context.
What are the protocols for making decisions about the language, and who has authority within the community to make these choices? Who has the authority to change or withdraw permissions later? It’s rarely as simple as “everyone in the community all agree about everything related to the language”, so you’ll need to be very clear about who has the authority to approve this project, make decisions related to it, and so on. You may encounter some difficult or complicated choices. Go slowly and carefully, and have lots of conversations with your partners in the community.
Who will own and control the training data set, as well as the model you train and the user interface to access it? The First Nations Information Governance Centre (Canada)’s OCAP model – Ownership, Control, Access, and Possession – is a useful framework for thinking through this issue. So are the CARE Principles for Indigenous Data Governance. (Needless to say, the answer I’d call “ethical” is one that begins and ends with “the community [or a person or organization within the community, depending on your context] owns and controls the data, the model, and the codebase powering the tool”.)
How is the data being protected from unauthorized use or theft? Who do you want this chatbot to be accessible to, and if the answer isn’t “everyone”, how will you make sure it’s only accessible to the appropriate people? Are there things the chatbot shouldn’t be able to disclose because they are secret, private, or sacred? How will you ensure that you have these conversations with the community, and that all the people who need to be part of those ongoing conversations are included? To what degree are you able to prevent large commercial LLMs (which generally do not uphold Indigenous data sovereignty, and may be used to directly harm the community) from scraping your data? (Nothing online is truly impossible to steal if a bad actor wants to – there are really no “100% safe” options when it comes to putting materials online, but there are options that are “safer”.)
Similarly, are there things the community/you absolutely do not want the chatbot to be usable for? What are those things? Can you build them into the guardrails on the chatbot (the way commercial LLMs aren’t supposed to tell you how to do harmful things, even if they sometimes do), or the platform’s terms of service (so you could have legal recourse if someone does those things)?
- Te Hiku Media’s Kaitiakitanga License for their Kaituhi Māori speech tools is an interesting model of a license built around Indigenous understandings and protocols around language and knowledge, and forbids uses like surveillance, discrimination, and mining Māori data.

3. Who’s involved in the training and testing of this model? What does the validation (testing) process look like?

When you’re working with languages facing some degree of endangerment, there are greater risks of harm from a bad AI model. That makes it extra important to have robust and ongoing validation (testing) of any tools you develop.

AI tools frequently give wrong answers that sound convincing – this well-known phenomenon is sometimes called “hallucination”. When ChatGPT confidently tells you there are two Rs in “strawberry,” most English speakers can immediately go “no there aren’t!” But when you’re a new learner asking a chatbot how to say a sentence in your ancestral language, you might not have any speakers around to correct you if the chatbot gives you a wrong or made-up answer. In situations where every learner is crucial to the future of the language, feeding learners misinformation can do immense damage.

So it’s incredibly important to have fluent speakers closely involved in testing and validating your model. (Again, this may look different depending on your context – maybe there’s an official language committee or authority, an Elder who leads language efforts, a community scholar, a group of strong L2 speakers, etc.) Make sure your model goes through iterative testing (many repeated cycles that build on each other) with people who know the language very well.

And then, keep testing. Do it iteratively, in an ongoing way. As your model changes and incorporates new data (as all LLMs are built to do), ensure there’s a regular process for validating its outputs. Like all ethical, collaborative language work with community, this is going to be a long-term process.

4. Who’s accountable for this tool, at the end of the day? How will you address problems or respond to harms? Who will maintain it in the long term?

Chatbots can produce unpredictable outputs. They may even give responses that can harm people. To build one ethically, you have to sort of become the anti-Dr. Frankenstein: you have to accept that the creator of the thing is responsible for what the thing does.

So, do you have a plan for how you’ll address any problems the chatbot causes? What if it tells a kid to say something really offensive to their grandparents or teachers? What if it makes up fake words? What if it discloses private knowledge that it shouldn’t be able to publish? At the end of the day, who will take accountability for this chatbot? Build a really clear plan with your collaborators in the community. “Eh, we’ll figure it out if it happens” isn’t a great plan – in fact, it’s not a plan at all. Have a framework in place for who’ll deal with problems, according to what ethical principles, how repairs will be made, etc. – and ground this framework in the community’s ways of doing and knowing things.

Finally, when developing any technology tool (chatbot or otherwise), you also need to consider who’s going to be responsible for the tool’s upkeep in the long term. Throughout the development process described above, and into the future, who’s going to be involved in updating, improving or maintaining the tool? If it’s you, what are the agreed-upon terms of that commitment? If it is other people within the community, is everyone aware of the necessary knowledge, skills, and resources required to maintain the tool? It’s normal for technical tools to degrade over time, and sometimes people choose to let a tool deprecate (slowly stop working) if it’s no longer useful or has been otherwise replaced. However, that needs to be an informed choice. It’s important that everyone is aware of this reality, and doesn’t expect the tool to just keep working forever on its own – especially if the tool is being developed for profit. For more guidance in the development of language technologies in general, Check Before You Tech is a great place to start.

Final thoughts

This answer may make it sound really challenging to build an Indigenous language chatbot ethically – and that’s on purpose. It is challenging! As my colleague Amanda Holmes and I say in A Linguist’s Code of Conduct, this isn’t work to be undertaken lightly. It’s a big commitment. But it can also be important, meaningful, and beneficial if done well.

If you’d like to talk more about any of these issues, you can always book a free appointment with the Language Revitalization Mentors – Yulha is an especially good person to talk to about anything related to computational linguistics! We hope you’ll find a path to build something ethical, collaborative, and really useful to the community you’re working with.

Indigenous-led organizations working in data sovereignty and language technology:

ELP Categories

Language and Technology

Media Image

Ethical AI

Audience

Everyone

Tag

Computational Linguistics and NLP Ethics and Protocols Technology