Wikipedia biography dataset
Wikipedia biography dataset
One life to live spoilers new...
Citation Credit
Neural Text Generation from Structured Data with Application to the Biography Domain
Rémi Lebret, David Grangier and Michael Auli, EMNLP 2016
http://arxiv.org/abs/1603.07771
This publication provides further information about the data, and we kindly ask you to cite this paper when using the data.
The data was extracted from the English wikipedia dump (enwiki-20150901) relying on the articles referred by WikiProject Biography.
Dataset Description
For each article, we extracted the first paragraph (text) and the infobox (structured data).
Each infobox is encoded as a list of (field name, field value) pairs.
Wikipedia biography dataset download
We used Stanford CoreNLP to preprocess the data, i.e. we broke the text into sentences and tokenized both the text and the field values. The dataset was randomly split in three subsets train (80%), valid (10%), test (10%). We strongly recommend using test only for the final evaluation.
The data is organised in three subdirectories for train, valid and test.