Know Your Terms: Big Data and Precision Public Health

An article by Dr Wan Kim Sui (our Doctor of Public Health candidate), Professor Dr Moy Foong Ming, and Professor Dr Noran Naqiah Hairi has been published by The Malaysian Medical Gazette. They aim to raise the importance of big data and precision public health among healthcare professionals and the general public.

Big data has become ubiquitous in our daily living and is indispensable in precision public health. Precision public health is “an emerging practice to more granularly predict and understand public health risks and customise treatments for more specific and homogeneous subpopulations, often using new data, technologies, and methods.” Fundamentally, delivering the right intervention to the right population and at the right time is the goal. Big data has been successfully used in surveillance and signal detection, predicting future risk, targeted intervention and understanding disease.

Big data is often described using four dimensions – volume, velocity, veracity and variety – the four V’s of big data. What makes data “big” is the sheer volume, and this can be at least 100 terabytes of data for more prominent companies or organisations. Velocity is the speed of incoming data that needs processing, whereas veracity refers to the data’s accuracy or trustworthiness. Meanwhile, variety refers to the different forms of data.

One way to scale up this data volume is to link and connect these existing databases, which can yield many benefits. However, the linking of databases is not without its own set of challenges. Besides the much-needed resources in terms of money, material and technical expertise, differences in database structures, ownership issues, data confidentiality, and ethical and legal concerns are real.

Nevertheless, all is not lost. Data integration (also called data fusion, data matching and data merging) methodology is an emerging field that enables researchers to pool data drawn from multiple existing studies. At the fundamental level, combining data is merging information from different datasets with some common variables. The creation of a new dataset allows for more flexibility in the analysis than the separate analysis of each dataset.

From our project, six different datasets from the National Diabetes Registry were merged to form a five-year longitudinal cohort dataset. The resultant dataset was then used to answer several research objectives, including the trends of glycosylated haemoglobin A1C, blood pressure and LDL-cholesterol among diabetes patients and the time to treatment intensification among those with uncontrolled A1C. The surveillance of A1C, blood pressure and LDL-cholesterol increase our understanding of diabetes care quality in public health clinics. High-risk subpopulations were identified, which allows targeted intervention. Clinical inertia in diabetes management was also quantified.