Connect with us

Uncategorized

The way we train AI is fundamentally flawed

Published

on

It’s no secret that machine-learning models tuned and tweaked to near-perfect performance in the lab often fail in real settings. This is typically put down to a mismatch between the data the AI was trained and tested on and the data it encounters in the world, a problem known as data shift. For example, an AI trained to spot signs of disease in high-quality medical images will struggle with blurry or cropped images captured by a cheap camera in a busy clinic.   

Now a group of 40 researchers across seven different teams at Google have identified another major cause for the common failure of machine-learning models. Called “underspecification,” it could be an even bigger problem than data shift. “We are asking more of machine-learning models than we are able to guarantee with our current approach,” says Alex D’Amour, who led the study.

Underspecification is a known issue in statistics, where observed effects can have many possible causes. D’Amour, who has a background in causal reasoning, wanted to know why his own machine-learning models often failed in practice. He wondered if underspecification might be the problem here too. D’Amour soon realized that many of his colleagues were noticing the same problem in their own models. “It’s actually a phenomenon that happens all over the place,” he says.

D’Amour’s initial investigation snowballed and dozens of Google researchers ended up looking at a range of different AI applications, from image recognition to natural language processing (NLP) to disease prediction. They found that underspecification was to blame for poor performance in all of them. The problem lies in the way that machine-learning models are trained and tested, and there’s no easy fix.

The paper is a “wrecking ball,” says Brandon Rohrer, a machine-learning engineer at iRobot, who previously worked at Facebook and Microsoft and was not involved in the work.  

Same but different

To understand exactly what’s going on, we need to back up a bit. Roughly put, building a machine-learning model involves training it on a large number of examples and then testing it on a bunch of similar examples that it has not yet seen. When the model passes the test, you’re done.

What the Google researchers point out is that this bar is too low. The training process can produce many different models that all pass the test but—and this is the crucial part—these models will differ in small, arbitrary ways, depending on things like the random values given to the nodes in a neural network before training starts, the way training data is selected or represented, the number of training runs, and so on. These small, often random, differences are typically overlooked if they don’t affect how a model does on the test. But it turns out they can lead to huge variation in performance in the real world.

In other words, the process used to build most machine-learning models today cannot tell which models will work in the real world and which ones won’t.

This is not the same as data shift, where training fails to produce a good model because the training data does not match real-world examples. Underspecification means something different: even if a training process can produce a good model, it could still spit out a bad one because it won’t know the difference. Neither would we.

The researchers looked at the impact of underspecification on a number of different applications. In each case they used the same training processes to produce multiple machine-learning models and then ran those models through stress tests designed to highlight specific differences in their performance.  

For example, they trained 50 versions of an image recognition model on ImageNet, a dataset of images of everyday objects. The only difference between training runs were the random values assigned to the neural network at the start. Yet despite all 50 models scoring more or less the same in the training test—suggesting that they were equally accurate—their performance varied wildly in the stress test.

The stress test used ImageNet-C, a dataset of images from ImageNet that have been pixelated or had their brightness and contrast altered, and ObjectNet, a dataset of images of everyday objects in unusual poses, such as chairs on their backs, upside-down teapots, and T-shirts hanging from hooks. Some of the 50 models did well with pixelated images, some did well with the unusual poses; some did much better overall than others. But as far as the standard training process was concerned, they were all the same.

The researchers carried out similar experiments with two different NLP systems, and three medical AIs for predicting eye disease from retinal scans, cancer from skin lesions, and kidney failure from patient records. Every system had the same problem: models that should have been equally accurate performed differently when tested with real-world data, such as different retinal scans or skin types.

We might need to rethink how we evaluate neural networks, says Rohrer. “It pokes some significant holes in the fundamental assumptions we’ve been making.”

D’Amour agrees. “The biggest, immediate takeaway is that we need to be doing a lot more testing,” he says. That won’t be easy, however. The stress tests were tailored specifically to each task, using data taken from the real world or data that mimicked the real world. This is not always available.

Some stress tests are also at odds with each other: models that were good at recognizing pixelated images were often bad at recognizing images with high contrast, for example. It might not always be possible to train a single model that passes all stress tests. 

Multiple choice

One option is to design an additional stage to the training and testing process, in which many models are produced at once instead of just one. These competing models can then be tested again on specific real-world tasks to select the best one for the job.

That’s a lot of work. But for a company like Google, which builds and deploys big models, it could be worth it, says Yannic Kilcher, a machine-learning researcher at ETH Zurich. Google could offer 50 different versions of an NLP model and application developers could pick the one that worked best for them, he says.

D’Amour and his colleagues don’t yet have a fix but are exploring ways to improve the training process. “We need to get better at specifying exactly what our requirements are for our models,” he says. “Because often what ends up happening is that we discover these requirements only after the model has failed out in the world.”

Getting a fix is vital if AI is to have as much impact outside the lab as it is having inside. When AI underperforms in the real-world it makes people less willing to want to use it, says co-author Katherine Heller, who works at Google on AI for healthcare: “We’ve lost a lot of trust when it comes to the killer applications, that’s important trust that we want to regain.”

Uncategorized

This Week in Apps: Snapchat clones TikTok, India bans 43 Chinese apps, more data on App Store commission changes

Published

on

Welcome back to This Week in Apps, the TechCrunch series that recaps the latest in mobile OS news, mobile applications, and the overall app economy.

The app industry is as hot as ever, with a record 204 billion downloads and $120 billion in consumer spending in 2019. People now spend three hours and 40 minutes per day using apps, rivaling TV. Apps aren’t just a way to pass idle hours — they’re also a big business. In 2019, mobile-first companies had a combined $544 billion valuation, 6.5x higher than those without a mobile focus.

This week, we’re digging into more data about how the App Store commission changes will impact developers, as well as other top stories, like Snapchat’s new Spotlight feed and India’s move to ban more Chinese apps from the country, among other things.

We also have our weekly round-up of news about platforms, services, privacy, trends, and other headlines.

Top Stories

More on App Store Commissions

Last week, App Annie confirmed to TechCrunch around 98% of all iOS developers in 2019 (meaning, unique publisher accounts) fell under the $1 million annual consumer spend threshold that will now move App Store commissions from a reduced 15% to the standard 30%. The firm also found that only 0.5% of developers were making between $800K and $1M; only 1% were in $500K-$800K range; and 87.7% made less than $100K.

This week, Appfigures has compiled its own data on how Apple’s changes to App Store commissions will impact the app developer community.

According to its findings, of the 2M published apps on the App Store, 376K apps are a paid download, have in-app purchases, or monetize with subscriptions. Those 376K apps are operated by a smaller group of 124.5K developers. Of those developers, only a little under 2% earned more than $1M in 2019. This confirms App Annie’s estimate that 98% of all developers earned under the $1M threshold.

Image Credits: Appfigures

The firm also took a look at companies above the $1M mark, and found that around 53% were games, led by King (of the Candy Crush titles). After a large gap, the next largest categories in 2019 were Health & Fitness, Social Networking, Entertainment, then Photo & Video.

 

Of the developers making over $1M, the largest percentage — 39% — made between $1M and $2.5M in 2019.

Image Credits: Appfigures

The smallest group (1.5%) of developers making more than $1M is the group making more than $150M. These accounted for 29% of the “over $1M” crowd’s total revenue. And those making between $50M and $150M accounted for 24% of the revenue.

Image Credits: Appfigures

AppFigures also found that of those making less than $1M, most (>97%) fell into the sub $250K category. Some developes were worried about the way Apple’s commission change system was implemented — that is, it immediately upon hitting $1M and only annual reassessments. But there are so few developers operating in the “danger zone” (being near the threshold), this doesn’t seem like a significant problem. Read More.

Snapchat takes on TikTok

After taking on TikTok  with music-powered features last month, Snapchat this week launched a dedicated place within its app where users can watch short, entertaining videos in a vertically scrollable, TikTok-like feed. This new feature, called Spotlight, will showcase the community’s creative efforts, including the videos now backed by music, as well as other Snaps users may find interesting. Snapchat says its algorithms will work to surface the most engaging Snaps to display to each user on a personalized basis. Read More

India bans more Chinese apps

India, which has already banned at least 220 apps with links to China in recent months, said on Tuesday it was banning an additional 43 Chinese apps, again citing cybersecurity concerns. Newly banned apps include short video service Snack Video, e-commerce app AliExpress, delivery app Lalamove, shopping app Taobao Live, business card reader CamCard, and others. There are now no Chinese apps in the top 500 most-used apps in India, as a result. Read More.

Weekly News

Platforms

  • Apple’s App Store Connect will now require an Apple ID with 2-step verification enabled.
  • Apple announces holiday schedule for App Store Connect. New apps and app updates won’t be accepted Dec. 23-27 (Pacific Time).
  • SKAdNetwork 2.0 adds Source App ID and Conversion Value. The former lets networks identify which app initiated a download from the App Store and the latter lets them know whether users who installed an app through a campaign performed an action in the app, like signing up for a trial or completing a purchase.
  • Apple rounded up developer praise for its App Store commission change. Lending their names to Apple’s list: Little 10 Robot (Tots Letters and Numbers), Broadstreet (Brief), Foundermark (Friended), Shine, Lifesum, Med ART Studios (Sprout Fertility Tracker), RevenueCat, OK Play, SignEasy, Jump Rope, Wine Spectator, Apollo for Reddit, SwingVision Tennis, Cinémoi.

Services

  • Fortnite adds a $12/mo subscription offering a full season battle pass, 1,000 monthly bucks and a Crew Pack featuring an exclusive outfit bundle. More money for Apple to miss out on, I guess.
  • 14 U.S. states plus Washington D.C. have now adopted COVID-19 contact tracing apps. CA and other states may release apps soon. Few in the U.S. have downloaded the apps, however, which limits their usefulness. 
  • Samsung’s TV Plus streaming TV service comes to more Galaxy phones

Security & Privacy

  • Apple’s senior director of global privacy, Jane Horvath, in a letter to the Ranking Digital Rights organization, confirms App Tracking Transparency feature will arrive in 2021. The feature will allow users to disable tracking between apps. The letter also slams Facebook for collecting “as much data as possible” on users.
  • Baidu’s apps banned from Google Play, Baidu Maps and the Baidu App, were leaking sensitive user data, researchers said. The apps had 6M U.S. users and millions more worldwide.

Apps in the News

Trends

Image Credits: Sensor Tower

  • U.S. Brick-and-mortar retail apps saw 27% growth in first three quarters of 2020, or nearly double the growth of online retailer apps (14%), as measured by new installs. Top apps included Walmart, Target, Sam’s Club, Nike, Walgreens, and The Home Depot.
  • App Annie forecast estimates shoppers will spend over 110M hours in (Android) mobile shopping apps this holiday season.
  • PayPal and Square’s Cash app have scored 100% of the newly-issued supply of bitcoins, report says.
  • All social media companies now look alike, Axois argues, citing Twitter’s Fleets and Snap’s TikTok-like feature as recent examples.

Funding and M&A

  • CoStar Group, a provider of commercial real estate info and analytics, acquires Homesnap’s platform and app for $250M to move into the residential real estate market.
  • Remote work app Friday raises $2.1M seed led by Bessemer Venture Partners
  • Stories-style Q&A app F3 raises $3.9M. The team previously founded Ask.fm.
  • Edtech company Kahoot acquires Drops, a startup whose apps help people learn languages using games, for $50M.
  • Mobile banking app Current raises $131M Series C, led by Tiger Global Management.
  • Square buys Credit Karma’s tax unit, Credit Karma Tax, for $50M in cash.

Continue Reading

Uncategorized

The Supreme Court will hear its first big CFAA case

Published

on

The Supreme Court will hear arguments on Monday in a case that could lead to sweeping changes to America’s controversial computer hacking laws — and affecting how millions use their computers and access online services.

The Computer Fraud and Abuse Act was signed into federal law in 1986 and predates the modern internet as we know it, but governs to this day what constitutes hacking — or “unauthorized” access to a computer or network. The controversial law was designed to prosecute hackers, but has been dubbed as the “worst law” in the technology law books by critics who say it’s outdated and vague language fails to protect good-faith hackers from finding and disclosing security vulnerabilities.

At the center of the case is Nathan Van Buren, a former police sergeant in Georgia. Van Buren used his access to a police license plate database to search for an acquaintance in exchange for cash. Van Buren was caught, and prosecuted on two counts: accepting a kickback for accessing the police database, and violating the CFAA. The first conviction was overturned, but the CFAA conviction was upheld.

Van Buren may have been allowed to access the database by way of his police work, but whether he exceeded his access remains the key legal question.

Orin Kerr, a law professor at the University of California, Berkeley, said Van Buren vs. United States was an “ideal case” for the Supreme Court to take up. “The question couldn’t be presented more cleanly,” he argued in a blog post in April.

The Supreme Court will try to clarify the decades-old law by deciding what the law means by “unauthorized” access. But that’s not a simple answer in itself.

“The Supreme Court’s opinion in this case could decide whether millions of ordinary Americans are committing a federal crime whenever they engage in computer activities that, while common, don’t comport with an online service or employer’s terms of use,” said Riana Pfefferkorn, associate director of surveillance and cybersecurity at Stanford University’s law school. (Pfefferkorn’s colleague Jeff Fisher is representing Van Buren at the Supreme Court.)

How the Supreme Court will determine what “unauthorized” means is anybody’s guess. The court could define unauthorized access anywhere from violating a site’s terms of service to logging into a system that a person has no user account for.

Pfefferkorn said a broad reading of the CFAA could criminalize anything from lying on a dating profile, sharing the password to a streaming service, or using a work computer for personal use in violation of an employer’s policies.

But the Supreme Court’s eventual ruling could also have broad ramifications on good-faith hackers and security researchers, who purposefully break systems in order to make them more secure. Hackers and security researchers have for decades operated in a legal grey area because the law as written exposes their work to prosecution, even if the goal is to improve cybersecurity.

Tech companies have for years encouraged hackers to privately reach out with security bugs. In return, the companies fix their systems and pay the hackers for their work. Mozilla, Dropbox, and Tesla are among the few companies that have gone a step further by promising not to sue good-faith hackers under the CFAA. Not all companies welcome the scrutiny and bucked the trend by threatening to sue researchers over their findings, and in some cases actively launching legal action to prevent unflattering headlines.

Security researchers are no stranger to legal threats, but a decision by the Supreme Court that rules against Van Buren could have a chilling effect on their work, and drive vulnerability disclosure underground.

“If there are potential criminal (and civil) consequences for violating a computerized system’s usage policy, that would empower the owners of such systems to prohibit bona fide security research and to silence researchers from disclosing any vulnerabilities they find in those systems,” said Pfefferkorn. “Even inadvertently coloring outside the lines of a set of bug bounty rules could expose a researcher to liability.”

“The Court now has the chance to resolve the ambiguity over the law’s scope and make it safer for security researchers to do their badly-needed work by narrowly construing the CFAA,” said Pfefferkorn. “We can ill afford to scare off people who want to improve cybersecurity.”

The Supreme Court will likely rule on the case later this year, or early next.

Read more:

Continue Reading

Uncategorized

What to make of Stripe’s possible $100B valuation

Published

on

This is The TechCrunch Exchange, a newsletter that goes out on Saturdays, based on the column of the same name. You can sign up for the email here.

Welcome to a special Thanksgiving edition of The Exchange. Today we will be brief. But not silent, as there is much to talk about.

Up top, The Exchange noodled on the Slack-Salesforce deal here, so please catch up if you missed that while eating pie for breakfast yesterday. And, sadly, I have no idea why Palantir is seeing its value skyrocket. Normally we’d discuss it, asking ourselves what its gains could mean for the lower tiers of private SaaS companies. But as its public market movement appears to be an artificial bump in value, we’ll just wait.

Here’s what I want to talk about this fine Saturday: Bloomberg reporting that Stripe is in the market for more money, at a price that could value the company at “more than $70 billion or significantly higher, at as much as $100 billion.”

Hot damn. Stripe would become the first or second most valuable startup in the world at those prices, depending on how you count. Startup is a weird word to use for a company worth that much, but as Stripe is still clinging to the private markets like some sort of liferaft, keeps raising external funds, and is presumably more focused on growth than profitability, it retains the hallmark qualities of a tech startup, so, sure, we can call it one.

Which is odd, because Stripe is a huge concern that could be worth twelve-figures, provided that gets that $100 billion price tag. It’s hard to come up with a good reason for why it’s still private, other than the fact that it can get away with it.

Anyhoo, are those reported, possible prices bonkers? Maybe. But there is some logic to them. Recall that Square and PayPal earnings pointed to strong payments volume in recent quarters, which bodes well for Stripe’s own recent growth. Also note that 14 months ago or so, Stripe was already processing “hundreds of billions of dollars of transactions a year.”

You can do fun math at this juncture. Let’s say Stripe’s processing volume was $200 billion last September, and $400 billion today, thinking of the number as an annualized metric. Stripe charges 2.9% plus $0.30 for a transaction, so let’s call it 3% for the sake of simplicity and being conservative. That math shakes out to a run rate of $12 billion.

Now, the company’s actual numbers could be closer to $100 billion, $150 billion and $4.5 billion, right? And Stripe won’t have the same gross margins as Slack .

But you can start to see why Stripe’s new rumored prices aren’t 100% wild. You can make the multiples work if you are a believer in the company’s growth story. And helping the argument are its public comps. Square’s stock has more than tripled this year. PayPal’s value has more than doubled. Adyen’s shares have almost doubled. That’s the sort of public market pull that can really help a super-late-stage startup looking to raise new capital and secure an aggressive price.

To wrap, Stripe’s possible new valuation could make some sense. The fact that it is still a private company does not.

Market Notes

Various and Sundry

And speaking of edtech, Equity’s Natasha Mascarenhas and our intrepid producer Chris Gates put together a special ep on the education technology market. You can listen to it here. It’s good.

Hugs and let’s both go do some cardio,

Alex

Continue Reading

Trending