## — Mathematical statistics is vital in the 21st Century —

### (1) What is statistics?

When you think of the skills required for the 21st Century, you might think of reading, writing and computing. However, as we become overwhelmed with a huge amount of data in all areas of life, knowledge of statistics is extremely important. We can even say that statistical literacy is vital to understand and accurately interpret all kinds of scientific and business data. More and more, statistical skills are absolutely necessary for all of us.

In any field, the accumulation of experience, combined with the collection and analysis of data allows us to search for the best answers. Where there is data, there is statistics, and you will find it in the natural sciences, medicine, education, psychology, sociology, economics, business administration, philology, and even in sports. Students with a solid understanding of statistics will have a much deeper understanding of the true meaning of their own data and will also be able to correctly evaluate all they read and hear.* “I keep saying that the sexy job in the next 10 years will be statisticians.”*

This is a famous quote from Google's chief economist, Hal Varian (Ph.D.), during a keynote speech in 2009. And now, some years later, we have truly arrived in the era of statistics.

Professor Makoto Aoshima |
Assistant Professor Kazuyoshi Yata |

(2) Could you explain, in simple terms, the details of your research?

One of our research themes is constructing theories and methodologies for high-dimensional data analysis. A feature of modern scientific data is the vast number of dimensions. In most datasets, the number of data dimensions (*p*) is far greater than the number of samples (*n*); i.e., modern datasets are high dimension, low sample size datasets (*p >> n*). For example, while genomic data has tens of thousands to millions of dimensions (corresponding to the number of genes), the sample size (number of subjects) is only in the order of several tens.

Fig. 1. DNA microarray data: the light intensity at each point indicates the expression level of the gene.

The number of dimensions (number of genes) is around 10,000. The number of samples (number of subjects) is around 50.

Previously, statistics did not need to analyze datasets with a vast number of dimensions; the assumptions were that the number of samples is always considerably greater than the number of dimensions. However, in the 21st Century data has changed, and conventional statistics does not guarantee statistical accuracy for datasets with high dimensions and low sample sizes. In fact, it has been mathematically demonstrated that conventional statistics may provide incorrect solutions when applied to such data. To analyze modern data, new ideas and approaches beyond the framework of conventional statistics are required.

In a series of studies, we discovered geometric representations of the data space for handling datasets with high dimensions and low sample sizes. We have also provided statistical theories in the data dual space. From a unified approach that guarantees statistical accuracy of the developed methodologies, we constructed an asymptotic theory applicable to datasets with high dimensions and low sample sizes (*p >> n*).

A theory that guarantees accuracy of high dimensional data analysis was first developed by Aoshima and Yata (2011). This study was on the cutting edge of global mathematics and has provided many useful results. It was awarded the international “Abraham Wald Prize in Sequential Analysis” in 2012. Their studies have also won national recognition—the 2012 Japan Statistical Society Achievement Award—which acclaimed that “Extremely ingenious and pioneering research, outside the framework of conventional multivariate analysis, has been developed. In a series of excellent theoretical research accomplishments, these gentlemen have produced a creative and meaningful methodology that has contributed tremendously to both theoretical and practical statistics.” These details appear in the Proceedings of the Annual Meeting of the Japan Statistical Society, No. 152 (pp. 7–8) and No.153 (pp. 12–14).

Fig. 2. Geometric representations in three-dimensional dual space for datasets with high dimensions and low sample sizes: by increasing the number of dimensions (p) with a boundary formed by the boundary conditions for the data type, the dataset can be classified into one of the two geometric representations. Please refer to Yata and Aoshima (2012) for the details.

### (3) Could you please give a message to those who want to study statistics in the future?

With the prevalence of computers, statistical software is now readily accessible to individuals. We can retrieve some sort of analytical result by simply entering data. However, the mathematical assumptions on which such software is built, and the accuracy of the results obtained appear to be frequently overlooked. This problem cannot be ignored because making incorrect decisions through misuse or misinterpretation of data can lead to critical errors.

Recent rapid growth of information has given us new problems. In social networking sites such as Twitter, an enormous amount of varied data (e.g., tweets) that relate to individual behavior is being accumulated every day. The arrival of the big data era has given rise to new types of data that were never assumed with existing statistics, creating a new set of associated problems.

As the big data era advances, flexible thinking will become essential for appropriately handling enormous and varied datasets. Demands for mathematical statisticians are expected to increase. Society also requires individuals who can adopt the correct mathematical approach when the existing statistical framework is inappropriate. We need people who can pioneer new fields to solve new problems; being able to apply existing techniques and software packages is no longer enough. Raising the level of mathematical statistics and truly understanding the mathematical background underlying statistics are more important than ever before. This is precisely the stance adopted in our research.

Proper mathematical learning of statistics requires an adequate education system and research environment. Such requirements will become increasingly important in the future. Therefore, we hope to sufficiently develop our statistical theories and methodologies and share them globally, and in particular, we aim to contribute to society by educating young people who are knowledgeable enough to thrive in the coming era.

We have a lot of visitors who are interested in our research & laboratory every year. Please visit our homepage: