A research consortium featuring some of the greatest minds in AI are launching a benchmark to measure natural language processing (NLP) abilities.
The consortium includes Google DeepMind, Facebook AI, New York University, and the University of Washington. Each of the consortium’s members believe a more comprehensive benchmark is needed for NLP than current solutions.
The result is a benchmarking platform called SuperGLUE which replaces an older platform called GLUE with a “much harder benchmark with comprehensive human baselines,” according to Facebook AI.
SuperGLUE helps to put NLP abilities to the test where previous benchmarks were beginning to pose too simple for the latest systems.
“Within one year of release, several NLP models have already surpassed human baseline performance on the GLUE benchmark. Current models have advanced a surprisingly effective recipe that combines language model pretraining on huge text data sets with simple multitask and transfer learning techniques,” Facebook said.
In 2018, Google released BERT (Bidirectional Encoder Representations from Transformers) which Facebook calls one of the biggest breakthroughs in NLP. Facebook took Google’s open-source work and identified changes to improve its effectiveness which led to RoBERTa (Robustly Optimized BERT Pretraining Approach).
RoBERTa basically “smashed it,” as the kids would say, in commonly-used benchmarks:
“Within one year of release, several NLP models (including RoBERTa) have already surpassed human baseline performance on the GLUE benchmark. Current models have advanced a surprisingly effective recipe that combines language model pretraining on huge text data sets with simple multitask and transfer learning techniques,” Facebook explains.
For the SuperGLUE benchmark, the consortium decided on tasks which meet four criteria:
- Have varied formats.
- Use more nuanced questions.
- Are yet-to-be-solved using state-of-the-art methods.
- Can be easily solved by people.
The new benchmark includes eight diverse and challenging tasks, including a Choice of Plausible Alternatives (COPA) causal reasoning task. The aforementioned task provides the system with the premise of a sentence and it must determine either the cause or effect of the premise from two possible choices. Humans have managed to achieve 100 percent accuracy on COPA while BERT achieves just 74 percent.
Across SuperGLUE’s tasks, RoBERTa is currently the leading NLP system and isn’t far behind the human baseline:
You can find a full breakdown of SuperGLUE and its various benchmarking tasks in a Facebook AI blog post here.
/">AI & Big Data Expo events with upcoming shows in Silicon Valley, London, and Amsterdam to learn more. Co-located with the IoT Tech Expo, , & .