This view cannot be displayed
This view is currently private. You can preview it, but you will need to make it public before people will be able to see it.
Size
  • 500x425

  • 760x646

  • 950x808

Custom Size

425x425 is the minimum size

The Socrata Open Data API (SODA) provides programmatic access to this dataset including the ability to filter, query, and aggregate data. For more more information, view the API docs for this dataset or visit our developer portal

API Endpoint:

Field Names:

ID
id
be_WORD
be_word
be_RANK
be_rank
be_FREQ
be_freq
ca_WORD
ca_word
ca_RANK
ca_rank
ca_FREQ
ca_freq
ce_WORD
ce_word
ce_RANK
ce_rank
ce_FREQ
ce_freq
cs_WORD
cs_word
cs_RANK
cs_rank
cs_FREQ
cs_freq
da_WORD
da_word
da_RANK
da_rank
da_FREQ
da_freq
de_WORD
de_word
de_RANK
de_rank
de_FREQ
de_freq
en_WORD
en_word
en_RANK
en_rank
en_FREQ
en_freq
eo_WORD
eo_word
eo_RANK
eo_rank
eo_FREQ
eo_freq
es_WORD
es_word
es_RANK
es_rank
es_FREQ
es_freq
et_WORD
et_word
et_RANK
et_rank
et_FREQ
et_freq
eu_WORD
eu_word
eu_RANK
eu_rank
eu_FREQ
eu_freq
fi_WORD
fi_word
fi_RANK
fi_rank
fi_FREQ
fi_freq
fr_WORD
fr_word
fr_RANK
fr_rank
fr_FREQ
fr_freq
gl_WORD
gl_word
gl_RANK
gl_rank
gl_FREQ
gl_freq
he_WORD
he_word
he_RANK
he_rank
he_FREQ
he_freq
hr_WORD
hr_word
hr_RANK
hr_rank
hr_FREQ
hr_freq
hu_WORD
hu_word
hu_RANK
hu_rank
hu_FREQ
hu_freq
id_WORD
id_word
id_RANK
id_rank
id_FREQ
id_freq
it_WORD
it_word
it_RANK
it_rank
it_FREQ
it_freq
la_WORD
la_word
la_RANK
la_rank
la_FREQ
la_freq
lt_WORD
lt_word
lt_RANK
lt_rank
lt_FREQ
lt_freq
ms_WORD
ms_word
ms_RANK
ms_rank
ms_FREQ
ms_freq
nl_WORD
nl_word
nl_RANK
nl_rank
nl_FREQ
nl_freq
pl_WORD
pl_word
pl_RANK
pl_rank
pl_FREQ
pl_freq
pt_WORD
pt_word
pt_RANK
pt_rank
pt_FREQ
pt_freq
ro_WORD
ro_word
ro_RANK
ro_rank
ro_FREQ
ro_freq
sk_WORD
sk_word
sk_RANK
sk_rank
sk_FREQ
sk_freq
sl_WORD
sl_word
sl_RANK
sl_rank
sl_FREQ
sl_freq
sr_WORD
sr_word
sr_RANK
sr_rank
sr_FREQ
sr_freq
tr_WORD
tr_word
tr_RANK
tr_rank
tr_FREQ
tr_freq
uk_WORD
uk_word
uk_RANK
uk_rank
uk_FREQ
uk_freq
uz_WORD
uz_word
uz_RANK
uz_rank
uz_FREQ
uz_freq

Use OData to open the dataset in tools like Excel or Tableau. This provides a direct connection to the data that can be refreshed on-demand within the connected application.

Socrata OData documentation

Tableau users should select the OData v2 endpoint option.

OData V4 Endpoint:

OData V2 Endpoint:

Close
Author
Instituto Caro y Cuervo - Grupo de Investigación en Lingüística
Description
<b>DESCRIPTION</b> The dataset contains statistics of word occurrences (frequencies) from a text corpora obtained from Wikipedia in 32 languages. The size of the corpus for each language is balanced approximately around 5,000,000 million of words. The attached “metadata.cvs” file contains the ISO 639-1 code of each language, the Spanish name of the language, the English name of the language, the total number of Wikipedia articles in the corpus, the total number of words, the vocabulary size and the source URL of the Wikipedia dump for each corpus. The corpus for each language corresponds to the sequence of the first articles in the Wikipedia dump until a threshold of 5,000,000 words was reached. The words boundaries are identified by a simple tokenizer based on the space character as separator and other punctuation marks. The data was collected during October and November 2015.<br> <b>NOTES</b> Note 1: as the vocabulary size differs for each language, the number of data rows for each language varies too. The empty words can be identified with the “%” word and zeroes (0) in the RANK and FREQ columns.<br> Note 2: if you want to open with MS-Excel the .csv file, please have in mind that the file is encoded UTF-8 and that information is not in the file. So, to right way for opening it and make all characters recognized adequately, please follow these steps: <br> 1. Create a new blank spreadsheet (or Blank workbook). 2. Go to the “Data” menu or tab. 3. Select “Get External Data” 4. Select “From Text” 5. Select the downloaded .csv file 6. Select the “Unicode UTF-8” encoding and “Delimited” as file type 7. Select “comma” as field delimiter and double quotes “ as text qualifier. <br> Be aware that the file is relatively large for Excel (0.3GB), so the importing process could take several minutes. In the following link there is a figure obtained with this data showing the Zipf’s law. <br> <a href=https://en.wikipedia.org/wiki/Zipf%27s_law#/media/File:Zipf_30wiki_en_labels.png>https://en.wikipedia.org/wiki/Zipf%27s_law#/media/File:Zipf_30wiki_en_labels.png</a><br> This dataset is a collaborative effort of the students of the course “Análisis Computacional del Lenguaje” taught in November 2015 at the Instituto Caro y Cuervo, Bogotá D.C., Colombia by Professor <a href=”https://sites.google.com/site/sergiojimenezvargas/”>Sergio Jiménez</a>.
Category
Ciencia, Tecnología e Innovación
Tags
wikipedia word frequencies, word ranks, multilingual, 32 languages statistics, zipf’s law
Rating
Current value: 0 out of 5
Data Provided By
Instituto Caro y Cuervo - Grupo de Investigación en Lingüística
License
Public Domain

la plataforma de datos abiertos del gobierno colombiano

You are viewing a mobile version of this dataset. To access the full dataset, tap here.