Statistics for Data Science and Analytics

By Peter C. Bruce, Peter Gedeck, Janet Dobbins

Description

Statistics for Data Science and Analytics is an excellent primary textbook for courses in preliminary statistics as well as a supplement for courses in upper-level statistics and related fields, such as biostatistics and econometrics. The book is also a general reference for readers interested in revisiting the value of statistics, and in gaining a true understanding of hypothesis tests and confidence intervals with examples using Python.

Statistics and Data Science

Statistics and data science are rapidly evolving to meet the needs of business, government, and research organizations. It is useful to think (though over simplified) that there are two main communities:

  1. Research Communities: Traditional academic and medical researchers following strict standards.

  2. Data Science Community: Businesses and organizations using statistical methods for quick data value, prioritizing reliability over academic rigor.

Most users now fall into the second category, where AI integrates statistical methods originating from the first group. This book aims to clarify relevant techniques for data science and uses resampling/simulation methods to make statistical inference understandable.

The book starts with examples of statistics in action, addresses study design, and considers the role of chance. It covers all standard introductory statistics topics (probability, descriptive statistics, inference, sampling, correlation) within relevant contexts.

Using Python with this Book

This book presents relevant Python code in the second part of each chapter and provides the tools you need to implement the statistical procedures that are discussed in this book. Because many of these procedures are based on iterative resampling, rather than simply calculating formulas, you will get useful practice with the data handling and manipulation that is a Python strength. No specific level of Python ability is required to get started.

Resources

Short Answers

These files are answers to short questions and exercises in the text. If you are using the print version of the text, they are provided in print at the end of each chapter. If you are using the e-book, the links will take you directly to the relevant PDF.

Getting Started with Python

Instructions for installing Python

Videos

Videos Mentioned in the Text

GitHub Repository

You will find the following files:

  • datasets.zip: contains all datafiles used in the book

  • notebooks.zip: Jupyter notebooks with code from chapters and the Python sections of each chapter

  • python.zip: raw Python files

About us

Peter C. Bruce

Peter Bruce is the Founder of the Institute for Statistics Education, a privately-owned online educational institution. Since its creation in 2002, the Institute has specialized in introductory and graduate level online education in statistics, machine learning, data science, optimization, and other subjects in quantitative analytics.

Prior to founding the Institute, in partnership with the noted economist Julian Simon, Peter continued and commercialized the development of Simon's Resampling Stats, a tool for bootstrapping and resampling. In his work at Cytel Software Corp., he developed Box Sampler along similar lines, and helped bring XLMiner, a machine learning add-in for Excel, to market. He has authored a number of journal articles in the area of resampling, and is a co-author of Practical Statistics for Data Science and Machine Learning for Business Analytics"​. He is also the author of Introductory Statistics and Analytics, which was developed in consultation with members of the statistical education community in the American Statistical Association, and with the Guidelines for Assessment and Instruction in Statistics Education (GAISE) in mind and class tested for nearly a decade in the Introductory Statistics for College Credit at Statisics.com. Early in his career, he co-authored (with D. Traynham) a noted review of airline deregulation in the National Review (May, 1980).

Prior to his retirement in 2024, Peter's role at the Institute centered on course development and faculty recruitment - there are over 60 faculty members from around the world who are published experts in their fields; most teach from their own texts. He also teaches a course on resampling methods.

Peter has degrees in Russian from Princeton and Harvard, and an MBA from the University of Maryland; he is an autodidact in the area of statistics. Prior to his work in statistics, Peter worked in the US diplomatic corps as a Foreign Service Officer.

Dr. Peter Gedeck holds a Ph.D. in chemistry. He worked for twenty years as a computational chemist in drug discovery at Novartis in the United Kingdom, Switzerland, and Singapore. His research interests include the application of statistical and machine learning methods to problems in drug discovery. He is a scientist in the research informatics team at Collaborative Drug Discovery, which offers the pharmaceutical industry cloud-based software to manage the huge amount of data involved in the drug discovery process.

Peter’s specialty is the development of machine learning algorithms to predict biological and physicochemical properties of drug candidates. His scientific work is published in more than 50 peer reviewed articles and five books.

Peter is also a lecturer at the University of Virginia's School of Data Science teaching courses for the Master's program.

Janet Dobbins is the Chair of the Board of Directors for Data Community DC (dc2). a nonprofit 501(3)(c) organization committed to connecting and promoting the work of data professionals in the National Capital Region by fostering education, opportunity, and professional development through high-quality, community-driven events, resources, products and services. She worked for nearly twenty years as a Vice President of Strategic Partnerships at The Institute for Statistics Education at Statistics.com. She directed community outreach, communication, and marketing efforts, working with colleges, universities, and industry teams to develop innovative curriculum and help teams acquire necessary skills.

Janet co-organizes monthly Data Science DC meetups and works with Arlington Tech Program (an alternative Arlington Public High School) to create a mentorship program for 9-12 grade girls interested in STEM.

Coming soon - membership site for instructors