An Introduction to the Application of Statistics in Big Data

Abstract

Statistics, in its modern sense, is a field of study that aims to analyze natural phenomena through mathematical means. As a highly diverse and versatile discipline, statistics has been developed not only in the areas of STEM subjects, but also in the spheres of social science, economics, and humanities. More recently, the use of statistics in big data has increased, mainly in relevance to machine learning and artificial intelligence. In amalgamation with these subjects, statistics has been used in numerical and textual analysis, and is also starting to be applied in areas previously thought of as exclusively the domain of humans, such as the arts. Although there are differences in conventional statistics and these new developments, their purpose, which is finding associations in data, remains the same.

Introduction

By reviewing this history and current developments of statistics, this article aims to outline the possible future trajectory for statistics in this new age of big data, as well as the increased role statistics now has in primary and secondary schools as a result of its expanding role in multiple disciplines. Moreover, it also addresses some realistic criticisms and concerns about the subject in the context of the rapidly advancing technology of our world.

Historical Development

While historical records date the first population census — an important part of statistics — to have been in 2AD during the Han Dynasty of China, the first form of modern statistics is cited to have emerged in 1662, when John Graunt founded the science of demography, a field for the statistical study of human populations. Among his notable contributions to the development of statistics, his creation of census methods to analyze survival rates of human populations according to age paved the way for the framework of modern demography.

While Graunt is largely responsible for the creation of a systematic approach to collecting and analyzing human data, the first statistical work had emerged long before his time in the book Manuscript on Deciphering Cryptographic Messages. This was published some time in the 9th century, and the author, Al-Kindi, discusses methods of using statistical inference—the usage of statistics from sample data to derive inferences about the entire population—and frequency analysis—the study of repetition in ciphertext—to decode encrypted messages. This book later laid the groundwork for modern cryptanalysis and statistics[1].

The first step toward the development of statistics in its modern form was in the 19th century when two mathematicians, Sir Francis Galton and Karl Pearson, introduced to statistics the notion of a standard deviation, a numerical representation of the deviation of a set of data from its mean; methods of identifying correlation, a measure of the strength of a directly proportional relationship between two quantitative variables; and regression analysis, a statistical method to determine the graphical relationship between the independent variable and the dependent variable in a study. These new developments allowed for statistics to not only be more actively used to study human demographics, but also became a participant in the analysis of industry and politics. They later went on to found the first university statistics department and the first statistics journal. This was the beginning of statistics as an independent field of study[2].

The second wave of modern statistics came in the first half of the 1900s, when it started becoming actively incorporated into research works and higher education curricula. A notable contributor in this time period was Ronald Fisher, whose major publications on statistics helped outline statistical methods for researchers. He gave directions on how to design experiments to avoid unintentional bias and other human errors; described how statistical data collection and analysis methods could be improved through means such as randomized design, a data collection method where subjects are assigned random values of the variable in question which in turn removes possible unintentional bias on the part of both subjects and researchers; and set an example of how statistics could be used to explore various questions if a valid null hypothesis—the hypothesis in a statistical analysis that states that there is no significant difference (other than those as a result of human error during sampling) between two population variables in question—an alternative hypothesis—the hypothesis that states that there is a discernible difference between the two variables in a statistical study—and a set of data could be generated from the experiment. One such example was proving the existence of changes in crop yields, analyzed and published by Fisher in 1921[3].

In the latter half of the 20th century, the development of supercomputers and personal computers led to greater amounts of information being stored digitally, causing a rapid inflation in the amounts of large data. This resulted in the advent of the term “big data,” which refers to sizable volumes of data that can be analyzed to identify patterns or trends in them. Applications of big data range from monitoring large-scale financial activities, such as international trade, to customer analysis for effective social media marketing, and with this growing role of big data has come a subsequent increase in the importance of statistics in managing and analyzing it.

The role of statistics in education and research

Having become an official discipline at the university level in 1911, statistics has since then been incorporated into departments of education on various different levels. Notably, basic statistical concepts were first introduced to high schools in the 1920s. The 1940s and 1950s saw vigorous effort to broaden the availability of statistical education from younger years, as spurred on by the governmental and social efforts during and after the Second World War, where statistical analysis became frequently used to analyze military performance, casualties, and more. While educational endeavors laxed in the 1960s and 1970s, the boom of big data brought back the interest in statistics from the 1980s onward[4]. Presently, statistics is taught in both primary and secondary schools, and is also offered as Honor and Advanced Placement courses to many high school students hoping to study the subject at the college level and beyond.

The field of statistics has also become a crucial element in research, ranging from predicting the best price of commodities based on levels of consumer demand in the commercial sphere to determining the effectiveness of certain treatments in the medical profession. By incorporating statistics into research, researchers have been able to find ways to represent the credibility of their findings through data analysis, and have also been able to find and prove causal relationships using hypothesis testing. Statistics is especially necessary and irreplaceable in research in that, as mentioned, it is the most accurate form of measuring the reliability of the results drawn from a study. Whether that be measuring the confidence interval of a population mean, or testing whether a new treatment has any effect on patients when compared with a placebo, it places mathematical limitations on the objective aspects of research[5]. Moreover, statistics allows for a study conducted on a sample from a defined population to be extended to that general population given that the research satisfies a number of conditions, the sample being randomly chosen being one such prerequisite. This is one of the greatest strengths of statistics; the ability to extend the findings from a sample to the entire population without having to analyze every single data point.

Statistics in Big Data and Artificial Intelligence

In the age of big data and artificial intelligence (AI), intellectual reasoning and ability demonstrated by machines as opposed to humans, statistics is being utilized in education and research more than ever. Often combined with computer science and engineering, statistics is being used in many different capacities such as generating probability models through which complex data can filter through and then generate a model of best fit[6]. Even in this day and age, statistics continues to be transformed and applied in new ways to cope with the growing size and complexity of big data, as well as the many other rapid advancements being made in artificial intelligence.

While a large portion of big data consists of quantitative data, qualitative statistics also plays a large role in it. Notably, the analysis of text messages using statistical techniques by artificial intelligence has become one of the forefronts of the application of modern statistics. Text mining is the process of deriving information, such as underlying sentiments, from a piece of text. This method is intertwined with sentiment analysis, which, to put simply, is the subjective analysis of textual data. The fundamental purpose of sentiment analysis is the classification of text through its underlying sentiments — positive, negative, or neutral[7]. For example, “the chef was very friendly” has a positive underlying sentiment, while the sentence “but the food was mediocre” has a negative connotation.

While previous statistical techniques were underdeveloped for sentiment analysis to work effectively, recent developments in deep learning, which is a subfield of AI dedicated to mimicking the workings of the human brain[8], has allowed for greater, more complex sentiment analysis. A main application of sentiment analysis is in natural language processing (NLP)—a field of study of how computers can analyze human language and draw a conclusion about the connotation of a piece of text—which is often used to measure sentiments within corporates’ financial statements. For example, when a top management comments on its quarterly or annual performance, the level of positivity in this comment can be analyzed through NLP. The top management report is generally a piece of unorganized text, which NLP converts into a structured format that AI can then interpret. Through this process, the performance levels of companies can be gauged more effectively and accurately.

To train computers to be able to identify these implicit undertones, researchers must first provide and educate it with a set of data related to its purpose. This training method also goes beyond sentiment analysis; if a machine is being trained to recognize and locate a human face in an image, as is often used in camera applications on phones, it must be given a large data set of pictures with human faces which can then be used for training purposes.

This data set can be split into three different sections; training data, validation data, and testing data. Training data is the data that helps the AI machine learn new material by picking up patterns within the set of data. Training data consists of two parts; input information and corresponding target answers. Given the input information, the AI will be trained to output the target answers as often as possible, and the AI model can re-run over the training data numerous times until a solid pattern is identified. Validation data is similarly structured to training data in that it has both input and target information. By running the inputs in the validation data through the AI program, it is possible to see whether the model is able to churn out the target information as results, which would prove it to be successful. Testing data, which comes much after both training and validation data, is a series of inputs without any target information. Mimicking real-world applications, testing data aims to recreate a realistic environment in which it will be able to run. Testing data makes no improvements on the existing AI model. Instead, it tests if the AI model is able to make accurate predictions based on this testing data on a consistent basis[9]. If it proves successful in doing so, then the program is ruled to be ready for real-world usage.

An example of these types of data used to create an AI program can be found in AlphaGo. AlphaGo is a computer program designed to play Go, a two-player board game involving black and white stones that the players alternate placing. The goal is to enclose as much of the board’s territory as possible. Countless records of previous professional Go games spanning back centuries contributed to the training data used to teach the AlphaGo program. Through analyzing the different moves that were taken by the Go players, the creators of AlphaGo then set up different versions of the program to play against each other, which served as its validation data. AlphaGo’s widely broadcasted matches against professional players, most notably Lee Sedol, was the program’s testing data[10].

The quality and quantity of training data is also crucial in creating an effective AI model. A large set of refined data will aid the AI in identifying statistical patterns and thereby more accurately fulfill its purpose. Using the aforementioned facial recognition example, this point can be elaborated on more clearly; if a large set of images containing human faces are given to the AI during training, it will be able to recognize patterns within human faces, such as the existence of two eyes, a nose, and a mouth, and thereby increase its success rate in identifying faces during testing. However, if images of trees and stones are mixed into the training data, then the AI program may find it more difficult to accurately perceive patterns within the given data set, and consequently become less effective in fulfilling its initial purpose. Moreover, being given a larger set of training data allows an AI model to make more accurate predictions, since it has a larger pool of information in which it can identify and apply patterns to.

Training data is used for a range of purposes, such as the aforementioned image recognition, sentiment analysis, spam detection, and text categorization. A common theme among these different types of training data, however, is the possibility of wrong methods of training. Artificial intelligence, with its ability to mimic the process of human thought, also raises possibilities of negative inputs with incorrect target results creating a machine with a harmful thought process. For example, if an AI program is continuously shown images of aircrafts being bombed, and taught that the target result should be positive, then the machine may consider terrorist bombings or warfare to be positive when applied to real life.

Artificial intelligence, like all things created by mankind, retains the potential to be used for a malevolent cause. In particular, because we do not understand all of the statistical techniques being used by computers to analyze training data, we must continue to tread cautiously in our efforts to develop and understand AI through the application of statistics.

The statistical methods used to understand and categorize big data are by no means as simple as those used by human statisticians; in fact, many of the mechanisms used by computers to find and analyze patterns in data sets still remain a mystery to us. They cannot be labeled with discrete descriptions such as “standard deviation” or “normal distribution.” Instead, they are an amalgamation of various complex pattern-identifying and data-processing techniques.

Furthermore, the statistical techniques used in the realm of big data and artificial intelligence are somewhat different from previous applications of statistics. For example, the previously mentioned training data is a novel subject that was only incorporated into statistics after the subject’s introduction to AI. Statistics, which had almost exclusively dealt with quantitative data in the past, is now also used to analyze qualitative data, creating a necessity for this training data. Training data also indicates another difference between conventional and modern applications of statistics, which is that statistics in AI and machine learning require supervised learning to find relationships in data, while conventional statistics requires regression analysis[11].

Conventional statistics is more intuitive to humans but limited in its usage. On the other hand, statistics in AI and machine learning is essentially a black box that cannot be explained through previous rules, but proves more efficient in deriving implications from larger and more diverse sets of data.

However, despite these many distinctions, the subject’s fundamental purpose has not changed; statistics, in the end, is an effort to mathematically approach phenomenon, identify patterns in data, and apply our findings to new situations. Consequently, recent developments in statistics and its traditional applications should be used in conjunction with each other, cancelling each other’s drawbacks with their strengths.

Criticisms about statistics

Apart from the concerns raised on the use of statistics in the realm of artificial intelligence and big data, conventional statistics also has its fair share of criticisms. As a constantly changing, improving discipline, there continues to exist imperfections in statistics that we should always be cautious of when using statistical analysis in any situation. For example, in 2012, statistician Nate Silver used statistical analysis to successfully predict the results of the presidential election for all 50 states in the U.S[12]. While this brought about much media attention to the role of statistics in fields beyond the scope of learning it was commonly associated with, this event led to what could arguably be referred to as an overreliance on statistical prediction in the next U.S. presidential election. As can be seen by this example, there certainly exists shortcomings in statistics, both in the collection of statistical data and our use of it.

Among the multiple criticisms frequently made about the subject, there is a recurring theme that can be found; they often condemn how it distorts our perception of phenomena by oversimplifying it. While statistics is a tool used to conveniently perceive the message portrayed to us by large sets of data, it is, in the end, a discipline based on averages and predictions. The real world does not always act with this in mind, and therefore deviates from statistical predictions most of the time.

Moreover, data analysis is mostly done in the realm of quantitative data, so qualitative aspects of socio economic phenomena are often underrepresented in statistical results. This also makes it easier for statisticians to use data to understate or exaggerate the issue at hand, therefore making some statistical data unreliable[13]. However, we do need some form of numeric representation for situations that require comparison, so utilizing statistics is necessary. This is why overreliance on statistical analysis is both easy and dangerous to do.

One example is the overreliance on GDP statistics; this usually leads to the conclusion that the economic situations of most citizens of a country are improving. This is not always the case, especially for countries whose economic disparity is also widening. The individual welfare of the population is not accurately and entirely reflected in the GDP of a nation, which only tells us its overall economic status — including its corporations, government, and net exports. Therefore, relying only on GDP statistics may lead to the inaccurate analysis of the personal welfare of the people.

Statistics, in the end, is a discipline of averages and predictions. No matter how much effort researchers put into refining the analysis methods of numerical data, they will always fall short of being able to fully represent a real-life phenomenon by only deploying numbers. Statistics will always fall short of giving a definite answer about virtually anything. All conclusions made about hypotheses are never certain, and comparisons between two sets of data at best give a solid prediction. However, it must also be understood that this is the very definition of statistics. Statistics serves to give a better interpretation of complicated issues by removing certain factors that bring about uncertainty during the process of research; thus, it may be too much to expect statistics to be able to give an exact one-to-one portrayal of the situation it is analyzing. It is, like all other disciplines, used best when amalgamated with other approaches and fields.

The future of statistics

Statistics, with its ability to explore different social phenomena using situation hypotheses and reliably interpret nonphysical trends, is a rapidly growing discipline in the modern world. With the ability to be used in conjunction with a variety of other subjects such as mathematics, economics, the social sciences, and computer science, statistics is relevant and necessary in all kinds of different fields.

While the future of statistics is not entirely clear — predictions on which domain it will be used most often in, and which spheres of knowledge it will most frequently intermingle with vary — it is safe to say that statistics will be taking on a similarly important, if not greater, role in our future than it is now.

Statistics has already played a large role in helping us understand general trends in data, and with the world becoming increasingly interconnected, this unique aspect of statistics will only become more necessary. Big data and artificial intelligence are becoming the centerpiece of modern technological development, and because the statistical techniques being used in these fields are very different and entirely transcendental of the statistical mechanisms previously used by human statisticians, the adaptation of data analysis and statistical usage to this new trend is all the more necessary.

Amalgamated with statistics, big data and AI have been explored in numerical and textual analysis for many years. This is not, however, the boundary of their potentials; efforts are already being made to expand their usage into the field of human creation, such as the arts. A major example is the development of artificial intelligence algorithms to find similarities between paintings by various artists by a team of researchers from MIT[14]. In a world that is increasingly reliant on different types and greater amounts of big data, statistics must evolve to fit its needs, and it, at this moment, seems to be walking down the right path.

References

[1] “History of Statistics.” Wikipedia, Wikimedia Foundation, 15 Aug. 2020, https://en.wikipedia.org/wiki/History_of_statistics. Accessed 19 Aug. 2020.

[2] “Statistics.” Wikipedia, Wikimedia Foundation, 17 Aug. 2020, https://en.wikipedia.org/wiki/Statistics. Accessed 19 Aug. 2020.

[3] Fisher, R. A. “Studies in Crop Variation. I. An Examination of the Yield of Dressed Grain from Broadbalk: The Journal of Agricultural Science.” Cambridge Core, Cambridge University Press, 27 Mar. 2009, www.cambridge.org/core/journals/journal-of-agricultural-science/article/studies-in-crop-variation-i-an-examination-of-the-yield-of-dressed-grain-from-broadbalk/882CB236D1EC608B1A6C74CA96F82CC3. Accessed 6 Oct. 2020.

[4] Scheaffer, Richard L, and Tim Jacobbe. “Statistics Education in the K-12 Schools of the United States: A Brief History.” Journal of Statistics Education, vol. 22, no. 2, 2014, pp. 1–14., doi:https://doi.org/10.1080/10691898.2014.11889705. Accessed 15 Aug. 2020.

[5] Calmorin, L. Statistics in Education and the Sciences. Rex Bookstore, Inc., 1997.

[6] Secchi, Piercesare. “On the Role of Statistics in the Era of Big Data: A Call for a Debate.” Statistics & Probability Letters, vol. 136, 2018, pp. 10–14., https://www.sciencedirect.com/science/article/abs/pii/S0167715218300865. Accessed 16 Aug. 2020.

[7] Gupta, Shashank. “Sentiment Analysis: Concept, Analysis and Applications.” Towards Data Science, Medium, 19 Jan. 2018, https://towardsdatascience.com/sentiment-analysis-concept-analysis-and-applications-6c94d6f58c17. Accessed 19 Aug. 2020.

[8] Brownlee, Jason. “What Is Deep Learning?” Machine Learning Mastery, Machine Learning Mastery Pty. Ltd., 16 Aug. 2019, https://machinelearningmastery.com/what-is-deep-learning/. Accessed 20 Aug. 2020.

[9] Smith, Daniel. “What Is AI Training Data?” Lionbridge, Lionbridge Technologies, Inc., 28 Dec. 2019, https://lionbridge.ai/articles/what-is-ai-training-data/. Accessed 20 Aug. 2020.

[10] “AlphaGo: The Story so Far.” DeepMind, Google, 2020, https://deepmind.com/research/case-studies/alphago-the-story-so-far. Accessed 6 Oct. 2020.

[11] Shah, Aatash. “Machine Learning vs Statistics.” KDnuggets, KDnuggets, 29 Nov. 2016, www.kdnuggets.com/2016/11/machine-learning-vs-statistics.html. Accessed 19 Aug. 2020.

[12] O’Hara, Bob. “How Did Nate Silver Predict the US Election?” The Guardian, Guardian News and Media, 8 Nov. 2012, www.theguardian.com/science/grrlscientist/2012/nov/08/nate-sliver-predict-us-election. Accessed 21 Aug. 2020.

[13] Davies, William. “How Statistics Lost Their Power – and Why We Should Fear What Comes Next.” The Guardian, Guardian News and Media, 19 Jan. 2017, www.theguardian.com/politics/2017/jan/19/crisis-of-statistics-big-data-democracy. Accessed 21 Aug. 2020.

[14] Gordon, Rachel. “Algorithm Finds Hidden Connections between Paintings at the Met.” MIT News, Massachusetts Institute of Technology, 29 July 2020, https://news.mit.edu/2020/algorithm-finds-hidden-connections-between-paintings-met-museum-0729. Accessed 6 Oct. 2020.

One thought on “An Introduction to the Application of Statistics in Big Data

Leave a Reply

Your email address will not be published.