Preface

Our lives are shaped by predictions that we make daily. The process of predicting is not static. We want to improve our predictions to avoid catastrophes in our lives … We learn from our mistakes. We are all self-learning walking machines with a very limited processing capacity. Can we develop a self-learning algorithm for a high-capacity machine that can make prediction more efficiently and accurately for us? Yes, we can! With the well-developed statistical models programmed in very effective algorithms run by high-capacity computers.

This book takes on the first part, statistical models, without too much abstraction. It doesn’t teach all aspects of programming but enough coding skills that you can find your way in building predictive algorithms with R. All gets into a computer. So, we also need to know enough about the “machines” with which we can facilitate a better efficiency. We have enough of it too …

According to Leo Breiman (Breiman 2001a), Statistical Modeling: The Two Cultures, there are two goals in analyzing the data:

Prediction: To be able to predict what the responses are going to be to future input variables; Information: To extract some information about how nature is associating the response variables to the input variables.

And there are two approaches towards those goals:

The Data Modeling Culture : One assumes that the data are generated by a given stochastic data model (econometrics) …
Algorithmic Modeling Culture: One uses algorithmic models and treats the data mechanism as unknown (machine learning) …

And he describes the current state:

…With the insistence on data models, multivariate analysis tools in statistics are frozen at discriminant analysis and logistic regression in classification and multiple linear regression in regression. Nobody really believes that multivariate data is multivariate normal, but that data model occupies a large number of pages in every graduate text book on multivariate statistical analysis…

Broadly speaking, many social scientists look at a statistical analysis from the window of causal inference. Most courses in their training have been (are) based on inferential statistics covering regression-based parametric approaches using interface-based statistical packages (like, EViews, Stata, SPSS, SAS). Since the demand for broader, more inclusive “data analytics” courses has been rising in the last decade, most departments (Economics, Finance, and Business fields) are looking for a data analytics course, which is less foreign to their traditional curriculum. This integration is important because of two reasons: first, “predictive” and nonparametric methods are never given the front seat in conventional curricula. We have “forecasting” courses, but they mostly cover conventional parametric time-series using ARIMA/GARCH type applications. Second, interface-based statistical packages are not enough anymore for unconventional nonparametric applications. R and Python are the new languages that students increasingly demand in all Business schools.

This should not be surprising: first, machine learning is new for many fields. And not only the concept is relatively new, but its “language” is different: hyperparameters, classification, features, variance-bias trade-off, tuning, and so on. Further, the culture in our quantitative courses is different. We do not “understand” how a high prediction accuracy itself would be a focal point in data analytics: even if “ice-cream sales predict the crime rates very well, the result would be useless”. This is what many policy analysts think.

The structure of the book is different from many other books on machine learning. First, it is not all about Machine Learning written mostly for practitioners. For example, the initial chapters are positioned for a smooth transition from inferential statistics and the “parametric world” to predictive models with the help of a section that covers nonparametric methods. Even at the PhD level, we rarely teach nonparametric methods as those methods have been less applied in inferential statistics. Nonparametric econometrics, however, makes the link between these two cultures (data modeling and algorithmic modeling) as machine learning is an extension of nonparametric econometrics. The order of chapters offers this transition to have our first nonparametric application, kNN in Chapter 8. After traditional tree-based models, the book also covers Support Vector Machines and Neural Network. These chapters can be skipped but they are self-contained so that even students with a weak background in linear algebra can understand the “black-box” models.

The book covers the concepts with applications. However, the most applications use “toy data” to make the book more generalizable for other fields with a similar curriculum in inferential statistics. We will have supplementary books covering field specific data with applications. I do not believe that we need a book for each field (accounting, management, finance, economics, political science, sociology, and so on). The first supplementary online book using real data (from economic and financial sources) will be ready very soon.

After well-known predictive algorithms collected only in two sections, the book proceeds into five new sections: Penalized Regressions, which are the well-known high-dimensional methods in economics and finance used in model selection and sparsity. The following section covers Dimension Reduction Methods. They are important tools in “noise” reduction in almost all fields in social science. This section puts all main methods (EVD, SVD, Rank approximations, PCA, FA, DMD) together with applications. The next section gives a new look at time-series applications by showing how time series data can be used in predictions beyond traditional parametric models. The last section covers Network Analysis. This concept is also very prevalent in many business fields and specially in economics and finance. This section summarizes all new developments in Graphical Network Analysis.

Finally, the chapters in Appendix provide enough information on two important subjects: Algorithmic Optimization (including gradient descent applications) and the discussion on classification with imbalanced data. Moreover, there will be some additions to several chapters covering Conditional Inference Trees, General CV, and Causal Random Forest.

I hope that this book provides a good starting point to give predictive analytics a well-deserved place in curricula of social science and business fields. I also hope that this book will “grow” by the time to keep up with the fast changing world. Hence, I guess, it will always be a “draft” in a humble way …

Who

This book is targeted at motivated students and researchers who have a background in inferential statistics using parametric models. It is applied because I skip many theoretical proofs and justifications that can easily be found elsewhere. I do not assume a previous experience with R but some familiarity with coding.

Acknowledgements

This book was made possible by Mutlu Yuksel, Tolga Kaya, Mehmet Caner, Juri Marcucci, Atul Dar and, Andrea Guisto. This work is greatly inspired by following books and people:

  1. Introduction to Statistical Learning by Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani.
  2. Introduction to Data Science by Rafael A. Irizarry.
  3. Applied Statistics with R by David Dalpiaz
  4. R for Statistical Learning by David Dalpiaz

I’ve also greatly benefited from my participation in the Summer School of SiDe in 2019 on Machine Learning Algorithms for Econometricians by Arthur Carpentier and Emmanuel Flachaire and in 2017 on High-Dimensional Econometrics by Anders Bredahl Kock and Mehmet Caner. I never stop learning from these people.

I also thank my research assistant Kyle Morton. Without him, this book wouldn’t be possible.

Finally, my wife and my son, Isik and Ege, you are always my compass finding my purpose …

References

Breiman, Leo. 2001a. “Random Forests.” Machine Learning 45: 5–32. https://link.springer.com/article/10.1023/A:1010933404324#citeas.