Description

DESCRIPTION
The dataset contains statistics of word occurrences (frequencies) from a text corpora obtained from Wikipedia in 32 languages. The size of the corpus for each language is balanced approximately around 5,000,000 million of words. The attached “metadata.cvs” file contains the ISO 639-1 code of each language, the Spanish name of the language, the English name of the language, the total number of Wikipedia articles in the corpus, the total number of words, the vocabulary size and the source URL of the Wikipedia dump for each corpus. The corpus for each language corresponds to the sequence of the first articles in the Wikipedia dump until a threshold of 5,000,000 words was reached. The words boundaries are identified by a simple tokenizer based on the space character as separator and other punctuation marks. The data was collected during October and November 2015.
NOTES
Note 1: as the vocabulary size differs for each language, the number of data rows for each language varies too. The empty words can be identified with the “%” word and zeroes (0) in the RANK and FREQ columns.

Note 2: if you want to open with MS-Excel the .csv file, please have in mind that the file is encoded UTF-8 and that information is not in the file. So, to right way for opening it and make all characters recognized adequately, please follow these steps:

1. Create a new blank spreadsheet (or Blank workbook).
2. Go to the “Data” menu or tab.
3. Select “Get External Data”
4. Select “From Text”
5. Select the downloaded .csv file
6. Select the “Unicode UTF-8” encoding and “Delimited” as file type
7. Select “comma” as field delimiter and double quotes “ as text qualifier.

Be aware that the file is relatively large for Excel (0.3GB), so the importing process could take several minutes.
In the following link there is a figure obtained with this data showing the Zipf’s law.

https://en.wikipedia.org/wiki/Zipf%27s_law#/media/File:Zipf_30wiki_en_labels.png

This dataset is a collaborative effort of the students of the course “Análisis Computacional del Lenguaje” taught in November 2015 at the Instituto Caro y Cuervo, Bogotá D.C., Colombia by Professor Sergio Jiménez.

Activity
Community Rating
Current value: 0 out of 5
Your Rating
Current value: 0 out of 5
Raters
0
Visits
260
Downloads
236
Comments
0
Contributors
0
Meta
Category
Ciencia, Tecnología e Innovación
Permissions
Public
Tags
wikipedia word frequencies, word ranks, multilingual, 32 languages statistics, zipf’s law
Row Label
ID (row counter), xx_WORD (utf-8 word character string), xx_RANK (word ranking by frequency), xx_FREQ (word frequency in a corpus of approximately 5,000,000 words). The three columns including ‘xx’ in its header are repeated 32 times for each language. The two characters ‘xx’ correspond to the ISO 639-1 code of the language. The total number of columns in the file is 97, i.e. (32 x 3) + 1 = 97.ID (row counter), xx_WORD (utf-8 word character string), xx_RANK (word ranking by frequency), xx_FREQ (word frequency in a corpus of approximately 5,000,000 words). The three columns including ‘xx’ in its header are repeated 32 times for each language. The two characters ‘xx’ correspond to the ISO 639-1 code of the language. The total number of columns in the file is 97, i.e. (32 x 3) + 1 = 97.
SODA2 Only
Yes
Licensing and Attribution
Data Provided By
Instituto Caro y Cuervo - Grupo de Investigación en Lingüística
Source Link
(none)
License
Public Domain
Información de la Entidad
Área o dependencia
Grupo de Investigación en Lingüística
Nombre de la Entidad
Instituto Caro y Cuervo
Departamento
Bogotá D.C.
Municipio
Bogotá D.C.
Orden
Nacional
Sector
Ciencia, Tecnología e innovación
Información de Datos
Idioma
Inglés
Cobertura Geográfica
Internacional
Frecuencia de Actualización
No aplica
Fecha Emisión (aaaa-mm-dd)
2018-03-09
This view cannot be displayed