Hi I'm Victoria! I'm a data scientist working on open data.

My group focuses on understanding the effect of big data and computation on scientific inference. We are interested in:

How effectively does statistical methodology translate to big data settings?

Instead of collecting data to test a particular hypothesis, researchers are now generating hypotheses by direct inspection of the data, then using the data to test those hypotheses. What counts as a significant finding in this case? Can we estimate how likely that finding is to be replicated in a new sample?

What information is needed to verify and replicate data science findings?

When computation is used in research, it becomes part of the methods used to derive a result. How should these steps be made openly available to the community for inspection, verification, replication, and re-use?

What tools and computational environments are needed for data science?

We have an opportunity to think about data science as a life cycle, from experimental design and databases through to the scientific findings, and design tools and environments that enable reliable scientific inference at scale.