Connect with us


How our data encodes systematic racism



I’ve often been told, “The data does not lie.” However, that has never been my experience. For me, the data nearly always lies. Google Image search results for “healthy skin” show only light-skinned women, and a query on “Black girls” still returns pornography. The CelebA face data set has labels of “big nose” and “big lips” that are disproportionately assigned to darker-skinned female faces like mine. ImageNet-trained models label me a “bad person,” a “drug addict,” or a “failure.” Data sets for detecting skin cancer are missing samples of darker skin types. 

White supremacy often appears violently—in gunshots at a crowded Walmart or church service, in the sharp remark of a hate-fueled accusation or a rough shove on the street—but sometimes it takes a more subtle form, like these lies. When those of us building AI systems continue to allow the blatant lie of white supremacy to be embedded in everything from how we collect data to how we define data sets and how we choose to use them, it signifies a disturbing tolerance.

Non-white people are not outliers. Globally, we are the norm, and this doesn’t seem to be changing anytime soon. Data sets so specifically built in and for white spaces represent the constructed reality, not the natural one. To have accuracy calculated in the absence of my lived experience not only offends me, but also puts me in real danger. 

Corrupt data

In a research paper titled “Dirty Data, Bad Predictions,” lead author Rashida Richardson describes an alarming scenario: police precincts suspected or confirmed to have engaged in “corrupt, racially biased, or otherwise illegal” practices continue to contribute their data to the development of new automated systems meant to help officers make policing decisions. 

The goal of predictive policing tools is to send officers to the scene of a crime before it happens. The assumption is that locations where individuals had been previously arrested correlate with a likelihood of future illegal activity. What Richardson points out is that this assumption remains unquestioned even when those initial arrests were racially motivated or illegal, sometimes involving “systemic data manipulation, police corruption, falsifying police reports, and violence, including robbing residents, planting evidence, extortion, unconstitutional searches, and other corrupt practices.” Even data from the worst-behaving police departments is still being used to inform predictive policing tools

As the Tampa Bay Times reports, this approach can provide algorithmic justification for further police harassment of minority and low-income communities. Using such flawed data to train new systems embeds the police department’s documented misconduct in the algorithm and perpetuates practices already known to be terrorizing those most vulnerable to that abuse.

This may appear to describe a handful of tragic situations. However, it is really the norm in machine learning: this is the typical quality of the data we currently accept as our unquestioned “ground truth.” 

One day GPT-2, an earlier publicly available version of the automated language generation model developed by the research organization OpenAI, started talking to me openly about “white rights.” Given simple prompts like “a white man is” or “a Black woman is,” the text the model generated would launch into discussions of “white Aryan nations” and “foreign and non-white invaders.” 

Not only did these diatribes include horrific slurs like “bitch,” “slut,” “nigger,” “chink,” and “slanteye,” but the generated text embodied a specific American white nationalist rhetoric, describing “demographic threats” and veering into anti-Semitic asides against “Jews” and “Communists.” 

GPT-2 doesn’t think for itself—it generates responses by replicating language patterns observed in the data used to develop the model. This data set, named WebText, contains “over 8 million documents for a total of 40 GB of text” sourced from hyperlinks. These links were themselves selected from posts most upvoted on the social media website Reddit, as “a heuristic indicator for whether other users found the link interesting, educational, or just funny.” 

However, Reddit users—including those uploading and upvoting—are known to include white supremacists. For years, the platform was rife with racist language and permitted links to content expressing racist ideology. And although there are practical options available to curb this behavior on the platform, the first serious attempts to take action, by then-CEO Ellen Pao in 2015, were poorly received by the community and led to intense harassment and backlash

Whether dealing with wayward cops or wayward users, technologists choose to allow this particular oppressive worldview to solidify in data sets and define the nature of models that we develop. OpenAI itself acknowledged the limitations of sourcing data from Reddit, noting that “many malicious groups use those discussion forums to organize.” Yet the organization also continues to make use of the Reddit-derived data set, even in subsequent versions of its language model. The dangerously flawed nature of data sources is effectively dismissed for the sake of convenience, despite the consequences. Malicious intent isn’t necessary for this to happen, though a certain unthinking passivity and neglect is. 

Little white lies

White supremacy is the false belief that white individuals are superior to those of other races. It is not a simple misconception but an ideology rooted in deception. Race is the first myth, superiority the next. Proponents of this ideology stubbornly cling to an invention that privileges them. 

I hear how this lie softens language from a “war on drugs” to an “opioid epidemic,” and blames “mental health” or “video games” for the actions of white assailants even as it attributes “laziness” and “criminality” to non-white victims. I notice how it erases those who look like me, and I watch it play out in an endless parade of pale faces that I can’t seem to escape—in film, on magazine covers, and at awards shows.

Data sets so specifically built in and for white spaces represent the constructed reality, not the natural one.

This shadow follows my every move, an uncomfortable chill on the nape of my neck. When I hear “murder,” I don’t just see the police officer with his knee on a throat or the misguided vigilante with a gun by his side—it’s the economy that strangles us, the disease that weakens us, and the government that silences us.

Tell me—what is the difference between overpolicing in minority neighborhoods and the bias of the algorithm that sent officers there? What is the difference between a segregated school system and a discriminatory grading algorithm? Between a doctor who doesn’t listen and an algorithm that denies you a hospital bed? There is no systematic racism separate from our algorithmic contributions, from the hidden network of algorithmic deployments that regularly collapse on those who are already most vulnerable.

Resisting technological determinism 

Technology is not independent of us; it’s created by us, and we have complete control over it. Data is not just arbitrarily “political”—there are specific toxic and misinformed politics that data scientists carelessly allow to infiltrate our data sets. White supremacy is one of them. 

We’ve already inserted ourselves and our decisions into the outcome—there is no neutral approach. There is no future version of data that is magically unbiased. Data will always be a subjective interpretation of someone’s reality, a specific presentation of the goals and perspectives we choose to prioritize in this moment. That’s a power held by those of us responsible for sourcing, selecting, and designing this data and developing the models that interpret the information. Essentially, there is no exchange of “fairness” for “accuracy”—that’s a mythical sacrifice, an excuse not to own up to our role in defining performance at the exclusion of others in the first place.

Those of us building these systems will choose which subreddits and online sources to crawl, which languages to use or ignore, which data sets to remove or accept. Most important, we choose who we apply these algorithms to, and which objectives we optimize for. We choose the labels we create, the data we take in, the methods we use. We choose who we welcome as data scientists and engineers and researchers—and who we do not. There were many possibilities for the design of the technology we built, and we chose this one. We are responsible. 

So why can’t we be more careful? When will we finally get into the habit of disclosing data provenance, deleting problematic data sets, and explicitly defining the limitations of every model’s scope? At what point can we condemn those operating with an explicit white supremacist agenda, and take serious actions for inclusion?

An uncertain path forward

Distracted by corporate condolences, abstract technical solutions, and articulate social theories, I’ve watched peers congratulate themselves on invisible progress. Ultimately, I envy them, because they have a choice in the same world where I, like every other Black person, cannot opt out of caring about this. 

As Black people now die in a cacophony of natural and unnatural disasters, many of my colleagues are still more galvanized by the latest product or space launch than the jarring horror of a reality that chokes the breath out of me.

The fact is that AI doesn’t work until it works for all of us.

For years, I’ve watched this issue extolled as important, but it’s clear that dealing with it is still seen as a non-priority, “nice to have” supplementary action—secondary always to some definition of model functionality that doesn’t include me.

Models clearly still struggling to address these bias challenges get celebrated as breakthroughs, while people brave enough to speak up about the risk get silenced, or worse. There’s a clear cultural complacency with things as usual, and although disappointing, that’s not particularly surprising in a field where the vast majority just don’t understand the stakes.

The fact is that AI doesn’t work until it works for all of us. If we hope to ever address racial injustice, then we need to stop presenting our distorted data as “ground truth.” There’s no rational and just world in which hiring tools systematically exclude women from technical roles, or where self-driving cars are more likely to hit pedestrians with darker skin. The truth of any reality I recognize is not in these models, or in the data sets that inform them.

The machine-learning community continues to accept a certain level of dysfunction as long as only certain groups are affected. This needs conscious change, and that will take as much effort as any other fight against systematic oppression. After all, the lies embedded in our data are not much different from any other lie white supremacy has told. They will thus require just as much energy and investment to counteract.

Deborah Raji is a Mozilla fellow interested in algorithmic auditing and evaluation. She has worked on several award-winning projects to highlight cases of bias in computer vision and improve documentation practices in machine learning.

Continue Reading


Elon Musk says Tesla Semi is ready for production, but limited by battery cell output



Tesla CEO Elon Musk said on the company’s 2020 Q4 earnings call that all engineering work is now complete on the Tesla Semi, the freight-hauling semi truck that the company is building with an all-electric powertrain. The company expects to begin deliveries of Tesla Semi this year, the company said in its Q4 earnings release, and Musk said the only thing limiting their ability to produce them now is the availability of battery cells.

“The main reason we have not accelerated new products – like for example Tesla Semi – is that we simply don’t have enough cells for it,” Musk said. “If we were to make the Semi right now, and we could easily go into production with the Semi right now, but we would not have enough cells for it.”

Musk added that the company does expect to have sufficient cell volume to meet its needs once it goes into production on its 4680 battery pack, which is a new custom cell design it created with a so-called ‘tables’ design that allows for greater energy density and therefore range.

“A Semi would use typically five times the number of cells that a car would use, but it would not sell for five times what a car would sell for, so it kind of would not make sense for us to do the Semi right now,” Musk said. “But it will absolutely make sense for us to do it as soon as we can address the cell production constraint.”

That constraint points to the same conclusion for the possibility of Tesla developing a van, Musk added, and the lifting of the constraint will likewise make it possible for Tesla to pursue the development of that category of vehicle, he said.

Tesla has big plans for “exponentially” ramping cell production, with a goal of having production capacity infrastructure in place for a Toal of 200 gigawatt hours per year by 2022, and a target of being able to actually produce around 40% of that by that year (with future process improvements generating additional gigawatt hours of cell capacity  in gradual improvements thereafter).

Continue Reading


Pro-Trump Twitter figure arrested for spreading vote-by-text disinformation in 2016



The man behind a once-influential pro-Trump account is facing charges of election interference for allegedly disseminating voting disinformation on Twitter in 2016.

Federal prosecutors allege that Douglass Mackey, who used the name “Ricky Vaughn” on Twitter, encouraged people to cast their ballot via text or on social media, effectively tricking others into throwing away those votes.

According to the Justice Department, 4,900 unique phone numbers texted a phone number Mackey promoted in order to “vote by text.” BuzzFeed reported the vote-by-text scam at the time, noting that many of the images were photoshopped to look like official graphics from Hillary Clinton’s presidential campaign.

Some of those images appeared to specifically target Black and Spanish-speaking Clinton supporters, a motive that tracks with the account’s track record of white supremacist and anti-Semitic content. The account was suspended in November 2016.

At the time, the mysterious account quickly gained traction in the political disinformation ecosystem. HuffPost revealed that the account was run by Mackey, the son of a lobbyist, two years later.

“… His talent for blending far-right propaganda with conservative messages on Twitter made him a key disseminator of extremist views to Republican voters and a central figure in the alt-right’ white supremacist movement that attached itself to Trump’s coattails,” HuffPost’s Luke O’Brien reported.

Mackey, a West Palm Beach resident, was taken into custody Wednesday in Florida.

“There is no place in public discourse for lies and misinformation to defraud citizens of their right to vote,” Acting U.S. Attorney for the Eastern District of New York Seth D. DuCharme said.

“With Mackey’s arrest, we serve notice that those who would subvert the democratic process in this manner cannot rely on the cloak of Internet anonymity to evade responsibility for their crimes.”

Continue Reading


Tesla is willing to license Autopilot and has already had “preliminary discussions” about it with other automakers



Tesla is open to licensing its software, including its Autopilot highly-automated driving technology, and the neural network training it has built to improve its autonomous driving technology. Tesla CEO Elon Musk revealed those considerations on the company’s Q4 earnings call on Wednesday, adding that the company has in fact already “had some preliminary discussions about licensing Autopilot to other OEMs.”

The company began rolling out its beta version of the so-called ‘full self-driving’ or FSD version of Autopilot late last year. The standard Autopilot features available in general release provide advanced driver assistance (ADAS) which provide essentially advanced cruise control capabilities designed primarily for use in highway commutes. Musk said on the call that he expects the company will seek to prove out its FSD capabilities before entering into any licensing agreements, if it does end up pursuing that path.

Musk noted that Tesla’s “philosophy is definitely not to create walled gardens” overall, and pointed out that the company is planning to allow other automakers to use its Supercharger networks, as well as its autonomy software. He characterized Tesla as “more than happy to license” those autonomous technologies to “other car companies,” in fact.

One key technical hurdle required to get to a point where Tesla’s technology is able to demonstrate true reliability far surpassing that of a standard human driver is transition the neural networks operating in the cars and providing them with the analysis that powers their perception engines is to transition those to video. That’s a full-stack transition across the system away from basing it around neural nets trained on single cameras and single frames.

To this end, the company has developed video labelling software that has had “a huge effect on the efficiency of labeling,” with the ultimate aim being enabling automatic labeling. Musk (who isn’t known for modesty around his company’s achievements, it should be said) noted that Tesla believes “it may be the best neural net training computer in the world by possibly an order of magnitude,” adding that it’s also “something we can offer potentially as a service.”

Training huge quantities of video data will help Tesla push the reliability of its software from 100% that of a human driver, to 200% and eventually to “2,000% better than the average human,” Musk said, while again suggesting that it won’t be a technological achievement the company is interested into keeping to themselves.

Continue Reading