Computer Science

Designing a Public Archive Data Structure with Wikipedia as Database

 

Abstract

Lunar Mission One is a publicly-funded project that aims to pioneer inclusive space exploration that belongs to everyone, planning to send an international robotic lander to Moon and leave a record of life on Earth there. This essay aims to design a data structure for the mission’s public archive. With Wikipedia as the database, this essay focuses on how to organize data inclusively and effectively, aiming to cover information from highly qualified material to children’s contributions to their daily life.

The essay first does the background research in data modeling, deciding to use Wikipedia as the database for the public archive. Then, the essay goes on to determine the language, structure used by the archive and the content of the archive. Starting from choosing English as the language of the data, this essay discusses how the language of the data is chosen and identifies the possible weakness of choosing a language not recognizable for all intelligence. Then, the essay studies the structure of Wikipedia, selecting the most applicable data structure, a hierarchical model. This essay then moves on studying different categories in Wikipedia and researching Wikipedia’s assessing criteria of articles, managing to select inclusive and concise data to be used for the public archive. At the end of the essay, the application of this data model would be evaluated, showing both the model’s strengths and limitations.

This essay manages to design a data structure for the public archive by focusing on the model’s language, organization, and content. However, there are still limitations in this model, which requires more research for improvement.

 

1 Introduction

Lunar Mission One is a publicly-funded project that aims to pioneer inclusive space exploration that belongs to everyone. Besides sending an international robotic lander to the south pole of the Moon, the project also plans to leave a record of life on Earth there, a lasting record of human existence that will endure for millions of years. There are two types of archives to be sent: public and private. This project would focus on the data structure for the public archive, which is designed to engage the public in space exploration, from collecting most representative data of human beings to engaging children into making their own humanity record, which links to education and our everyday life.

A data model is an abstract model that organizes elements of data and standardizes how they relate to one another and to properties of the real world. For both compilation and reading, the archive will require some form of data structure. The database for public archive in particular is likely to be very complex. However, as the largest encyclopedia online, Wikipedia may be helpful for forming the basis for the public archive. With the aid of Wikipedia, how could the data be organized and inclusive? How can we manage to cover information from highly qualified and officially approved material, to children’s contributions with only teacher approval? This project sets these as the objectives.

2 Background Research about Data Modeling

Data modeling is a process of creating a data model for an information system by applying formal data modeling techniques. When modeling data, we need to define and analyze data requirements to support the business processes within the scope of corresponding information systems in organizations. Since Wikipedia is among the largest search engines free to use across the planet, I chose to use Wikipedia as the database.

3 Design the Data Structure

3.1 Language

What language to be used is one of the first things comes into mind when designing a data structure. In order to attain the widest range of information used in the database, I chose to use English Wikipedia as the database for the public archive, since the English version has the most number of articles stored.

In the future, it might be necessary to convert English into symbols recognized by intelligence, since the public archive might be read by intelligent beings from outer space or by Earth-descended humanity in the far future. This is to provide the far-future readers a basis to translate the text into a usable format.

3.2 Data Structure

To decide which type of data structure to be used, first of all, I need to know the data structure of the database: Wikipedia (English).

The data structure Wikipedia is a hierarchical model, a multi-layered data structure where the data is organized into a tree-like structure, as shown below[1].

In a hierarchical database model, the data are stored as records which are connected to one another through links. In a hierarchical database model, each child record has only one parent, whereas each parent record can have one or more child records.

The way Wikipedia constructs this tree-like structure is through randomness to automatically determine the most related information. By determining the level of relevance, data are grouped into different categories and form a hierarchical model.

This structure could be seen by searching in Wikipedia: If you click into one passage and scroll all the way down, you would find the category the passage belongs to. Then if you click on the category, the website would demonstrate all the sub-categories. If you go on clicking these, you would be directed to sub-sub-categories and so on until finally the website would direct you into an article. This reveals that the organization of data in Wikipedia is like a tree of many branches.

Since the database of the public archive was chosen to be Wikipedia, then the same data structure should be used to simplify the work.

3.3 Data Contents

Generally, we would think a public archive should comprise of both the humanities information such as history, civilization, culture as well as scientific information such as introduction to different species, depicting the environment of natural life. To select which to be included in the public archive, big data processing would be a good tool to use. Therefore, let us look at the statistics provided by Wikipedia (English).

3.3.1 Wikipedia’s Statistics

Wikipedia ranks articles according to their quality and importance, as shown below[2]:

The column from top to bottom indicates the ranking by quality from high to low: the articles of best quality are marked as FA, Featured Articles. The row from right to left shows the importance from top to low, which is generated according to the click rate.

The quality of articles is assessed by Wikipedia’s editors. For example, featured articles are considered to be the best articles used by editors as examples for writing other articles. Featured articles are reviewed for accuracy, neutrality, completeness, and style according to Wikipedia’s article criteria. Wikipedia also has its internal processes about assessing articles which no longer meet the criteria, which would be proposed for improvement or removal.

Hence, according to the big data provided, the articles chosen should be the ones of best quality and top importance.

3.3.2 General Data Content

According to the Wikipedia’s statistics, there are 1,159 articles of best quality and top importance out of 5,387,915. However, it does not mean only the articles with top importance (i.e. the top click rate) reflect the key information of human development. This leads me to consider about how to more effectively select articles. Therefore, I decided to use ‘categories’, the key of the hierarchical structure, even more to sort out suitable articles.

In Wikipedia, ‘The Category:Wikipedia Did you know (DYK) articles that are featured articles’ is a hidden category used on the main page of English Wikipedia.

The DYK section showcases new or expanded articles that are selected through an informal review process, where the choice of articles is subject to a set of criteria. Under this category, there are 1,384 articles in total, including inclusive portals that provide important information of human. By using articles from this category, a better and more comprehensive image of human beings would be reflected in the public archive for Lunar Mission One.

3.3.3 From Education’s Point of View

In order to fulfill the education purpose of the public archive, something could be done with the data content.

Wikipedia for Schools is a website that includes a selection of articles from Wikipedia that matches the UK National Curriculum and can be used by school children around the world. It is of a much smaller capacity than Wikipedia, including 6,000 articles covering the topics of art, citizenship, everyday life, language, mathematics, science and so on.

By using Wikipedia for Schools as our major data content, we could engage children in the school to learn and to create their own ‘history’ as a human on the big blue dot[3]. They could write up their own information under the relevant categories.

4 Summary of the Data Structure

Language: English;

Structure: multi-layered hierarchical model;

Content:

1: Humanities-based and science-based information: 1,384 articles from ‘The Category:Wikipedia Did you know articles that are featured articles’.

2: Education-based information: children around the world create their own texts under the topics corresponding to the categories of Wikipedia for Schools.

Hence, the data structure for the public archive is like:

Data structure for the public archive

Under each category, there would be sub-categories, sub-sub-categories and so on until a targeted article is found.

Impact: This data structure allows searching for information under different categories, which relates to relevant information closely.

 

5 Applications and Limitations

5.1 Applications

This data structure is designed for a public archive of Lunar Mission One, which is dedicated to both record human information and engage the public in space exploration.

The usage of hierarchical model manages to create a multi-layered approach to organize the data efficiently, while the usage of Wikipedia (English) as the database helps the data to be inclusive.

5.2 Limitations and Improvements

1. Language: the language used for the output of this data structure is English, but since the public archive might be read by a far future intelligence who does not understand English, the language might be transferred into other symbols recognized by intelligence species. As a result, it might be necessary to research into designing symbols that are commonly recognized by intelligence beings.

2. Capacity: the content of the general information is 1,384 articles, which consumes lots of capacity of the public archive. Besides, the content written by children may also consume lots of space, so the actual size of the public archive needs to be taken into account. As a result, the size of the public archive needs to be considered, if the capacity is not big enough, the more articles need to be further selected.

References

[1] Matthew West and Julian Fowler, Developing High Quality Data Models (The European Process Industries STEP Technical Liaison Executive, 1999).

[2] Simison, Graeme. C. and Witt, Graham. C., Data Modeling Essentials.3rd Edition (Morgan Kauffman Publishers, 2005).

Footnotes

  1. Intelligence, last modified January 1999. http://www.personalityresearch.org/intelligence/structure.html
  2. Wikipedia, last modified 22 January 2017. https://en.wikipedia.org/wiki/Wikipedia:100,000_feature-quality_articles
  3. This phrase is a phrase from the Voyager space program. The blue marble phrase is from the Moon program.

 

About the Author

Ruihua Zhang, 18, UK

Ruihua Zhang, growing up in Beijing and now studying in Britain, is a high school student with a great passion for maths and science. In summer 2016, she conducted a scientific research with Lunar Mission One as a member of Nuffield Research, successfully designed a data structure for the mission’s public archive using Wikipedia as database and winning a Gold CREST Award.

Mentor: Gerald Shields, J.D., LL.M.

Leave a Reply

Your email address will not be published. Required fields are marked *