From science to practice: identifying important sources of information in multilingual Wikipedia
Automatic identification and evaluation of information sources in Wikipedia
With over a billion websites, it is a huge challenge for Wikipedia users to individually assess the credibility of each source. Although there are detailed guidelines for reliable sources in the various language versions of Wikipedia, there is no comprehensive list of sites that can be considered reliable in various topical contexts. Additionally, the credibility and reputation of sites may change over time, requiring regular updates to such listings. For this reason, automating the process of creating and updating a list of reliable sources is extremely important. Such a list would be a valuable resource not only for Wikipedia editors, but also for its readers seeking accurate and reliable information.
The Department of Information Systems PUEB conducts research in the area of automatic assessment of the quality of articles and the reliability of information sources in various language versions of Wikipedia. The analysis of over 60 million Wikipedia articles allowed the identification of over 330 million references to sources. Various evaluation models identified important sources of information. The table below shows the results of references extraction for selected language versions of this encyclopedia and the number of unique websites in October 2023:
Wiki | Language Version | Number of Articles | Number of References | Unique Websites |
ar | Arabic | 1,219,168 | 6,355,164 | 294,089 |
ca | Catalan | 735,551 | 3,895,389 | 197,470 |
cs | Czech | 532,602 | 2,752,877 | 119,313 |
de | German | 2,839,878 | 14,473,501 | 622,551 |
en | English | 6,722,214 | 79,687,819 | 1,942,579 |
es | Spanish | 1,833,749 | 12,558,623 | 509,313 |
fa | Persian | 975,931 | 2,477,763 | 133,634 |
fi | Finnish | 559,931 | 3,371,084 | 138,320 |
fr | French | 2,557,559 | 19,455,752 | 576,523 |
he | Hebrew | 342,285 | 1,867,068 | 103,848 |
hi | Hindi | 162,954 | 496,057 | 47,617 |
hu | Hungarian | 530,977 | 2,545,152 | 124,536 |
id | Indonesian | 661,844 | 2,672,604 | 162,924 |
it | Italian | 1,829,095 | 8,856,574 | 278,232 |
ja | Japanese | 1,388,532 | 14,684,917 | 359,446 |
ko | Korean | 646,717 | 1,885,878 | 91,918 |
nl | Dutch | 2,133,536 | 3,010,002 | 112,318 |
no | Norwegian | 616,624 | 2,102,507 | 107,343 |
pl | Polish | 1,583,919 | 8,847,928 | 242,835 |
pt | Portuguese | 1,110,209 | 7,692,600 | 319,534 |
ru | Russian | 1,940,113 | 15,461,960 | 454,351 |
sv | Swedish | 2,572,575 | 11,791,609 | 134,081 |
th | Thai | 158,905 | 1,010,438 | 70,395 |
tr | Turkish | 533,201 | 2,773,455 | 146,854 |
uk | Ukrainian | 1,289,727 | 5,455,954 | 217,787 |
vi | Vietnamese | 1,288,093 | 3,796,577 | 147,041 |
zh | Chinese | 1,379,496 | 8,130,187 | 283,516 |
During the webinar, Dr. Włodzimierz Lewoniewski presented the possibilities of identifying and automatically assessing the importance of information sources of Wikipedia articles from different language versions. As part of the practical part, some of the capabilities of the BestRef tool were shown, which contains information about the results of the evaluation of millions of Internet sources in Wikipedia articles from the point of view of individual language versions.
The webinar took place on November 23, 2023. The organizer of the event is the Wikimedia Polska, which supports and promotes Wikipedia and its sister projects (such as Wikidata, Wiktionary, Wikinews, Wikisource and others).
More information about research on the analysis of information sources on Wikipedia can be found in scientific publications:
- Companies in Multilingual Wikipedia: Articles Quality and Important Sources of Information (2023)
- Identification of Important Web Sources of Information on Wikipedia across various Topics and Languages (2022)
- Reliability in Time: Evaluating the Web Sources of Information on COVID-19 in Wikipedia across Various Language Editions from the Beginning of the Pandemic (2022)
- Identifying Reliable Sources of Information about Companies in Multilingual Wikipedia (2022)
- Modeling Popularity and Reliability of Sources in Multilingual Wikipedia (2020)