Data Preparation and Exploration

Data scientists spend more than two-thirds of their time cleaning, preparing, exploring, and visualizing data before it is ready for modeling and mining. This textbook covers the important steps of data preparation and exploration that anyone who deals with data should know. This textbook is an excellent companion text for our other textbook Introduction to Biomedical Data Science. The data preparation and exploration methods we include are spreadsheet and statistics package approaches, as well as the programming languages R and Python. The reader is introduced to the free stat packages Jamovi and BlueSky Statistics. Multiple techniques for data visualization are presented. Medical datasets are used for demonstrations and student exercises. Importantly, chapter content is supplemented with YouTube videos. Chapters are well referenced (100+) and there is a chapter on health data resources so the reader can find data to prepare and explore on their own. Prominent issues such as how to handle missing data and imbalanced datasets are covered along with sections on descriptive statistics, visualization, correlations, handling duplicates and outliers, scaling, standardization, and much more. A downloadable Data Checklist is available on

AIM Magazine Book Reviews

Review of Introduction to Biomedical Data Science

This volume on biomedical data science is an excellent reference for all those interested in data science in the healthcare domain. Most of the reference books in this domain lack exact relevance for the clinician but this reference is much more relatable. What is very helpful is that one does not have to have programming skills to enjoy the interactive sections of the book. It also has an impressive balance of theory and practice: while it covers essential topics such as overview of biomedical data science (data analytical processes and major types of analytics), biostatistics primer, introduction to databases, and machine learning, it also has chapters on practical topics such as spreadsheet tools and tips as well as programming languages for data analysis. There is also a very helpful section on biomedical data science resources as well as exercises (with step by step instructions) and references for each of the chapters, and this compendium renders the book an ideal companion to the beginner student to the more advanced practitioner.

Review of Data Preparation and Exploration: Applied to Healthcare Data

The same authors of the aforementioned book published this companion volume on the part of the data science project that is both daunting and fundamental: data preparation and exploration. This book is the perfect complementary volume to the Introduction to Biomedical Data Science as it is a very hands-on and practical resource for data preparation (raw data to data cleanup and leakage) and exploration (near zero variance, scaling, binning, and dimension reduction). The third section is focused on the background of a data science project (defining the problem all the way to deploying the model) and automated data preparation and exploration to render this difficult area more manageable for especially the beginner. Among the many strengths of this book is its use of biomedical datasets (including a chapter on health data resources) in demonstrations and exercises, and another advantage is its myriad of resources, including many video clips of relevant topics. Similar to their prior work, this book is also a good balance between theory (topics such as imbalanced datasets and missing data reconciliation) and practice.

These two timely books reflect the decades of unparalleled wisdom and hands-on experience that Dr. Hoyt has in both biomedical informatics as well as biomedical data science. His personal mission to promulgate informatics and data science to be available and accessible to all practitioners is obvious in the details of these pages of both works. We all owe Dr. Hoyt a debt of gratitude for producing this set of works that is both enjoyable to peruse and useful to read.

Anthony Chang, MD, MBA, MPH, MS
Founder, AIMed
Chief Intelligence and Innovation Officer
Medical Director, The Sharon Disney Lund Medical Intelligence and Innovation Institute (MI3)
Children’s Hospital of Orange County
Editor, Intelligence-Based Medicine Editor-in-Chief, Intelligence-Based Medicine