Developers, Choose Wisely: a Guide for Responsible Use of Machine Learning APIs

Aug 12, 2020

Disclaimer: I am an independent researcher @Taraaz with no affiliation with any of the companies mentioned below. متن فارسی را از اینجا بخوانید

Last month, my friend posted a story on Instagram. It was about boycotting a Unilever-made skin lightening product from India which goes by the brand name “Fair & Lovely.” The campaign’s goal? To bring attention to the larger problem of colorism in India.

I’m not Indian. But the campaign’s message resonated with me. In Iran, where I grew up, I too encountered similar “beauty” products that claimed to be able to lighten the skin of those who used them. I took them for granted when I was a kid. But these days, in the wake of the Black Lives Matter movement, there’s been a moment of awakening for many from different countries who think about racism and colorism at home.

A tweet saying: Removing the word ‘Fair’ from ‘Fair and lovely’ and rebranding it makes no difference as at the end of the da — From Mashable

While reading through tweets from the campaign, I began to think about the emotion behind these tweets. I wondered how Unilever would perceive and react to these tweets? Of course, their social media team wouldn’t be able to read every single tweet. But perhaps they use social media analysis tools –powered by emotion recognition technologies – to get a sense of people’s demand.

I’m a researcher in technology and human rights. My job is to understand how technical designs impact human rights. I know that one of the promises of text-based emotion analysis tools is to help companies to understand customer satisfaction based on social media engagement.

That’s why I decided to use the example of “Fair & Lovely” to scrutinize off-the-shelf machine learning-based emotion analysis APIs. How do these practices — which are now the norm among major brands — perform in a specific case such as this? In particular, I wondered whether the positive sentiment of the phrase “Fair & Lovely” might trick the emotion analysis tool and lead to the misclassification of a sentence’s sentiment, even if the overall sentiment of the sentence may not be positive.

This question led me to write this blog post, especially for developers who use machine learning technologies as a service (MLaaS) and also for my fellow human rights practitioners who are interested in examining human rights implications of tech companies’ third-party relationships.

I’ll tell you why APIs terms are so important to understand, and what have been some misuses of APIs in the past few years.
I’ll choose IBM Tone Analyzer API and ParallelDots Text Analysis Emotion API to test their result on tweets about Unilever’s “Fair & Lovely” product. I’ll walk you through those APIs’ developers’ policies, terms of service, APIs documents, and show you what could be some criteria to consider before choosing that API.
I’ll provide a set of recommendations for developers who want to use general-purpose APIs for a specific domain in a responsible manner. I’ll also provide recommendations for auditors and human rights practitioners who study companies’ third-party relationships.

So, let’s say you are a developer or a social media analyst, and you are approached by Unilever to analyze the emotion behind customers’ social media engagement. What do you do?

As a hypothetical, we will assume that you don’t have the necessary skills, data, and computation power to build a whole custom machine learning model, nor do you want to use any pre-trained model. Instead, you choose the easiest route: an off-the-shelf general-purpose emotion-analysis API.

If that’s the case, what would be your criteria to choose and use these APIs in a responsible manner?

APIs have rules — and power 🔍

First, the basics. An Application Programming Interface (API) is what helps different software applications interact with each other. It allows one application to make a request (either data or service) and the other application respond to it. For example, if you are a social media company and want researchers to use your data to conduct research, you give them access to those data via an API. If you want IoT devices at home to interact with each other (for example your smart lamp reacts to events on your Google calendar) then you connect those services through APIs.

But as with any kind of interaction, there need to be rules between those services before starting working with each other. These rules are set by APIs policies, developers’ policies, and Terms of Service.

So far, so good. But when you give it more thought, you realize that those services make agreements between each other to provide services for you, as a user, or to handle your data, without you fully understand how they reach agreements.

It’s kind of bizarre, right?

Here are a couple of reminders of why this sort of thing is important.

1) You remember Cambridge Analytica and Facebook, right? (As a refresher, 87 million users’ information was “improperly” shared with Cambridge Analytica to analyze and manipulate Facebook users’ political behavior). Long story short, the underlying reason for such privacy-invasive data sharing practice was because of the abuse of Facebook’s APIs. As a result, Facebook restricted developers’ data access by making changes in their APIs policies.

2) There are also concerns when ML APIs are used as analytical services. In this case, developers are the ones who have the data and go to big tech companies’ ML APIs to process that data (MLaaS). Joy Buolamwini and Timnit Gebru’s Gender Shades study revealed significant racial and gender discrimination of several Facial Recognition APIs. In fact, as a result, big tech companies limited providing their APIs to law enforcement agencies in the US (who knows about their business relationship with other countries though…? 🤷🏻‍♀️)

But what about developers’ responsibilities who want to use tech companies’ general-purpose services? Is there any guidance to help them choose and use those ML APIs responsibly in their specific domains?

An emotion analysis API: IBM Tone Analyzer or ParalletDots Text Analysis?

As a developer, if you don’t want to build an ML system from scratch nor do you want to use a pre-trained model, the other option is to use cloud-based ML APIs. Everything is ready to go: you set up a developer account and receive API credentials, you provide input data, the service provider works its “magic,” and you receive the results as output. Easy! You don’t even need any knowledge about data science and machine learning to be able to integrate that API with your product. Or at least, this is how companies market their services.

As a developer, you have an obvious set of criteria to choose a service, right: criteria such as accuracy, cost, and speed. But what if you wanted to pick your ML API service based on other criteria, such as privacy, security, fairness, and transparency? What process do you go through? What do you check?

Let’s go back to the “Fair & Lovely” tweets. Putting myself of our hypothetical developer, I collected several hundred tweets about “fair & lovely” in the English language using Twint. Next, I looked at RapidAPI, a platform that helps developers to manage and compare different APIs, and picked IBM Watson Tone Analyzer and ParrallelDots as the best options. Both services promise to infer emotions including fear, anger, joy, happiness, etc. from tweets.

Then I registered with both services and received API credentials for free developer accounts. IBM’s free “Lite” account provides 2500 API calls per month; ParallelDots is free for 1000 API hits/day.

Finally, I ran the experiments below. These are a result of providing my corpus of “fair & lovely” tweets as input and then gathering the APIs’ output. You can see more examples in this spreadsheet.

Please note the drastically different results of the two services.

See the result of this experiment in the google spreadsheet: https://docs.google.com/spreadsheets/d/1GzfgRnyXV9Ng9loQiQXDpSl5

I also changed the word “fair & lovely” to more neutral phrases such as “your product” and “this product.” The output result changed. However, from a human analytical standpoint the message — and its sentiment — are the same.

At this point, I wouldn’t use either of these tools for this specific case! You tell me if the sentence “Unilever- cancel fair & lovely -sign the petition!” is joyous! 🤦🏻‍♀️

However, let’s say our hypothetical developer still thinks that there are benefits for using these tools.

In that case, we’d need to take into account the following criteria. I have to say this list is very preliminary and by no means comprehensive. But at least it gives you a sense of what to look for if you, as a developer, decide to use these tools.

Registration: privacy policies and terms of service

When you want to sign up for a developer account, always read the complete Terms of Service (ToS) and privacy policy. Crucially, this is different from a company website’s terms and policies.

In particular, be vigilant for information about how the data you provide as input is going to be handled. There is a service called Polisis that helps you to compare policies from different service providers (it’s not perfect, but it is still helpful).

Read the developer’s privacy policy and product-specific policy to understand what data from you (as an account holder) the company collects, how they protect them? Do they encrypt the data at rest and in transit? Is the data they collect personally identifiable? Do they define what they mean by metadata? Do they collect the data you provide as an input to the service? Do they retain it? For how long? Do they keep the log files?

Here’s a comparison between the policies of the two services. (For the rest of this post, my comparisons between the details of IBM Tone Analyzer and ParallelDots will appear in gray text boxes like the one below, featuring summaries of what I found in their posted policies and documentation).

IBM Tone Analyzer: When you want to create a developer account, IBM points you to their general privacy policy that lists everything from website visit to using cloud services. There are some vague statement such as:

• "IBM may also share your personal information with selected partners to help us provide you ..." Who are their partners though?!

• Or "We will not retain personal information longer than necessary to fulfill the purposes for which it is processed." What is "longer than necessary?"

If you want specific information about data collection and retention via Tone Analyzer API go to the product document page. Some relevant information includes:

• "Request logging is disabled for the Tone Analyzer service.the service does not log or retain data from requests and responses."

• The service "processes but does not store users' data. Users of the Tone Analyzer service do not need to take any action to identify, protect, or delete their data for this service."

ParallelDots: The website says that ParallelDots “protects your data and follow the GDPR compliance guidelines to the last word.” But it doesn’t go further? Which data? Metadata or developers’ information or users’ data?

• ParallelDots' ToS says "you may not access the services for purposes of monitoring their availability, performance or functionality, or for any other benchmarking or competitive purposes" This is bizarre to me, does that mean I broke their Tos?!

Documentation

If a company has already provided documentation such as Model Card for Model Reporting for that specific API, read it before using the service. If not, good luck on finding such important information! Look at the API’s documents and dig in for information about API security, background research. papers, training data, architecture and algorithms, evaluation metrics, and recommended use and not-to-use cases.

🔐 API Security. During the past few years, there have been numerous examples of data breaches via the use of insecure APIs. It’s an API provider’s responsibility to be able to detect security vulnerabilities, identify suspicious requests, provide encrypted traffic, and traffic monitoring methods. Make sure an API provider has already put these security practices in place — and read more about APIs security here.

Here’s another comparison, this time comparing API security:

IBM Tone Analyzer:

• IBM suggests developers use IBM Cloud Activity Tracker with LogDN to monitor the activity of an IBM Cloud account and investigate abnormal activity.

• The service also requires a strong password and sends you a verification code to confirm your developer account.

ParallelDots: There is no information about API security in the API document page. However they mentioned that they only provide encrypted access to premium content.

• For registration, developers are not required to set a strong password, however, ParallelDots sends you a verification email to confirm your account.

🌏 Accurate and Precise… but for who? In the example of “Fair & Lovely,” language plays an important role. English-only tweets don’t provide an accurate understanding of discussions around the product because it’s not one restricted to a single language.

Check the API to see if it support other languages. If so, what is the accuracy rate for different languages? Service providers often say they support multiple languages, but don’t provide broken down details about accuracy and other evaluation metrics for each language. Dig in the API document, background research pages, and try to find metrics for different sub-categories.

In our case, here’s what I found:

IBM Tone Analyzer: The company listed 11 supported languages. However, there is no breakdown information about accuracy or other evaluation metrics based on different languages.

ParallelDots: The company listed 14 supported languages. However, there is no information about accuracy or other evaluation metrics based on different languages.

❗️Suggested (Not) Use Cases. Companies also provide guidance about the suggested use of their services, but sometimes the use case can be dangerous or unethical. Companies need to be transparent about the cases in which developers should not use their services.

IBM Tone Analyzer: Tone Analyzer use cases according to the document page include predicting customer satisfaction in support forums; predicting customer satisfaction in Twitter responses; predicting online dating matches; Predicting TED Talk applause. There is no indication about Not-to-Use Cases.

ParallelDots: There are two suggested use cases: “target[ing] detractors to improve service to them” and “brand-watching.” There is no indication about Not-to-Use Cases.

⚖️ Fairness Practices. In the past couple of years, researchers and practitioners have raised awareness about discriminatory outcomes of machine learning systems. They’ve provided numerous toolkits to help companies assess the human rights implications of their tools and be transparent about potential social risks. I keep track of different initiatives, papers, toolkits here.

But how many companies provide that information for their specific ML APIs?

IBM Tone Analyzer: IBM provides information about background research, data collection process (twitter data), and data annotation method. However there is no mention of potential discrminatory outcomes and no breakdown information about demographics and measurement based on different sub-groups (language, gender, age, etc.)

• Fun fact: IBM Research is one of the pioneers in providing fairness and explanability toolkits (check out IBM 360). They also proposed using FactSheets for every ML model to show the origin of training datasets, model specifications, and use cases. But when it comes to their own model, you rarely find such information on their product pages!

This reminded me of the great piece of poem from Nizami, basically meaning first fix your own flaws before being too critical of others:

عیب کسان منگر و احسان خویش      دیده فرو کن بگریبان خویش

ParallelDots: I found no information about fairness practices.

🛠 Maintenance and Updates

IBM Tone Analyzer: The company frequently update the service and provides information about the updates. However, in some updates there are generic sentecs including "The service was also updated for internal changes and improvements." What are those internal changes and improvements?

ParallelDots: I couldn't find information about updates and maintenance.

💬 Developers Community. Communities of developers(via Slack Workspace, Stack Overflow, GitHub, etc) help share feedback, interact with themselves and service providers, and raise issues around privacy, security, fairness, explainability about a certain product and in a specific domain.

IBM Tone Analyzer: IBM Watson provides Slack workspace (there is no dedicated channel for ethical uses, however) and a Stack Overflow developers community. The Github page for the Tone Analyzer is here.

ParallelDots: The company has a GitHub page.

Recommendations

To developers

Don’t use machine learning APIs blindly, especially if they are black boxes. In addition to criteria such as cost, speed, and accuracy — as marketed by a service provider — consider criteria related to fairness, privacy, security, and transparency.
If it’s not documented, reach out to service providers and ask them whether they have conducted any fairness audits. It’s their responsibility to publish this information online or walk you through it. Use your buying power, they’ll listen!
Think about the domain for which you will be using the tool. Who might be affected disproportionally by the outcome of integrating a given ML API with your product? Think about gender, race, religion, age, language, accent, country, socio-economic status (read this to learn more about vulnerable groups who are protected under human rights conventions). I keep track of different ML assessment tools here; you might find them helpful in your assessment process.
Try to find benchmark datasets that relate to discriminatory outcomes of ML projects (Equity Evaluation Corpus is an example of a benchmark dataset used to examine biases in sentiment analysis systems). Reach out to people who are involved in creating such benchmarks and ask them for help to scrutinize the API in your specific domain. Check out FAccT conference directory for finding people who work on these issues.
When you suspect something is ethically wrong with an API service in your specific domain, share it with other developers by opening an issue on that service’s GitHub page, Stack Overflow, or developers’ community pages. Almost all service providers have these platforms for the developers to share their issues. Service providers might say it is impossible to test and audit their tools for every single domain because their service is a general-purpose tool. But you can inform them about ethical issues you face within a specific domain for your use cases. By providing public information you can also help other developers who might want to use that service!
If you integrate third party ML APIs in your product, mention it in your product’s privacy policy and terms of services. Don’t minimize it to a sentence saying “we use other parties’ services.” Include information about those third-party services — in this case, an ML service provider. Be transparent about how users’ data are handled because of that specific third-party relationship.

To Machine Learning APIs Service providers

The focus of this post was not on service providers but on third-party developers. However, to highlight some of the service providers’ responsibilities when it comes to informing developers, I would say:

Document and be transparent! Don’t bury fairness criteria in a 500-page document. Use more visible and friendly user Interfaces to guide developers to read about fairness and privacy criteria before signing up to use your service.
Add issues related to the fairness, security, and privacy of your own API services in your developers’ portals and community pages. Let developers discuss these issues within those portals (e.g. creating a dedicated Slack channel within developers’ Workspace) and encourage developers to share their experience dealing with fairness, privacy, and security while using your services (IBM 360 Slack Channel, Salesforce UI warnings are good examples). Don’t only showcase “successful” uses and positive testimonial on your marketplace page!
Each tier of developer account (free, standard, premium) brings different levels of responsibilities for you. Develop privacy-protective practices to monitor potential misuses of your services. This paper offers some feasible solutions: Monitoring Misuse for Accountable ‘Artificial Intelligence as a Service.’

To ML Auditors and Human Rights & Technology Practitioners

We hear a lot about democratizing building blocks of digital technologies; also we hear a lot about the interoperability of digital services. These are all good. But they bring new kinds of interactions, data flows, and data ownership matters.

The purpose of this blog post has been to raise awareness about the importance of these often-overlooked relationships and actors. It’s for developers to think about their responsibilities before integrating these APIs into their services. But it’s also for human rights practitioners, privacy advocates, ethical tech researchers to dissect these issues and find practical guidance to help smaller actors in our data-driven world.

Scrutinize third-party relationships when you audit a certain product/service and try to assess potential adverse human rights impacts of it. Both groups play a role when things go wrong. Going forward, let’s pay more attention to such things as supply chain issues, and carefully examine the role and responsibilities of different actors of the digital technologies ecosystem.

I work on issues at the intersection of technology and human rights. If you are a developer and have been thinking about ways to choose and use building blocks of your product more responsibly please reach out to me. I would be happy to speak with you: rpakzad@taraazresearch.org.

If you are interested in tech & human rights check out Taraaz’s website and sign up for our newsletter.

Humane AI

Discussion about this post