Big Data, Scarce Data: Which One Fits Medicine?


While visiting CERN last spring, there was a catch phrase used during the visit that stick in my mind. For the ATLAS detector, at the heart of one of the 4 main experiments at CERN and also one of the experiment that found experimental evidence for the Higgs boson (or a Higgs boson…), the interesting data were the equivalent of a 100 megapixels camera taking 400 photos per seconds (or maybe the other way around, but it does not change the shear scale of things)!

This amount of data is after all kinds of real-time software and hardware processing because the raw data during normal operation (read beam on condition) is close to 1 PetaBytes/sec (MB, GB=1000MB, TB=1000GB and finally PB=1000TB)…and this is only for ATLAS. In fact everything about the Large Hadron Collider (LHC) is big, from cost, to equipment, to human resources and data generated. Nature had an interesting article about how the data are handle and distributed worldwide among the collaborators.


Now what about medicine? We hear a lot about big data in biological sciences and medicine. The main problem, at least in medicine in my opinion, is not that there is too much data for the researchers and physicians but rather the other way around. Database for clinical trials conducted at various levels (from internal trials at individual hospitals to more global trials) are not all, or at all(!), compatible with each other. Furthermore, numerous database tends to be incomplete not by design but simply from the difficulty of filling and ensuring data integrity. While big data also sounds great for personalize medicine, personalize medicine by definition means low numbers of very specific medical conditions. Overall, we are unfortunately at this point in time in a scarce data mode.

The next big step for big data in medicine is a revolution with regards to database management, sharing and analysis. And yes personalize medicine will likely mean bigger research consortium and more sharing of data. There is a lot to learn from the particle physics community and initiative like the LHC. I do hope that those big data grant programs we are seeing in our country is to address that in priority. Until then, we will remain with incomplete or scarce data in medicine.

