2019-01-24-lecture0-overview.pdf - What is Data Science...

This preview shows page 1 out of 84 pages.

Unformatted text preview: What is Data Science? January 24, 2019 Data Science CSCI 1951A Brown University Instructor: Ellie Pavlick HTAs: Wennie Zhang, Maulik Dang, Gurnaaz Kaur Waitlist • If you are not registered, make sure you are on the waitlist (link on CAB) • We have a *little* wiggle room in the enrollment cap • Indicate relevant extenuating circumstances, we will try to prioritize fairly What is Data Science? Moneyball! Obama Campaign Google’s “40 Shades of Blue” Why Google has 200m reasons to put engineers over designers. The Gaurdian. The Origin of A/B Testing. Nicolai Kramer Jakobsen. Data Science = Magic Data Science! The Scientific Method The Scientific Method The Scientific Method Data Analytics, Visualization, Presentation The Scientific Method Data Analytics, Visualization, Presentation Machine Learning, Forecasting, Modeling The Scientific Method Data Collection, Sampling, Cleaning and Processing Data Analytics, Visualization, Presentation Machine Learning, Forecasting, Modeling The Scientific Method What is Data Science? What is Data Science? Data “Science” Data “Science” Data “Science” So many maps! Data “Science” • • To be fair… • Intuition plays a huge role in the scientific method (“make observations” is Step 1). • Exploratory analysis is necessary, its okay to not be all rigor all the time But! • Exploratory analysis (even when it involves the biggest of data) is meant to *form* a hypothesis, not test one • Good experimental design and rigorous statistics are essential if we want to make claims about how the world works Data “Science” • • To be fair… • Intuition plays a huge role in the scientific method (“make observations” is Step 1). • Exploratory analysis is necessary, its okay to not be all rigor all the time But! • Exploratory analysis (even when it involves the biggest of data) is meant to *form* a hypothesis, not test one • Good experimental design and rigorous statistics are essential if we want to make claims about how the world works Data “Science” • • To be fair… • Intuition plays a huge role in the scientific method (“make observations” is Step 1). • Exploratory analysis is necessary, its okay to not be all rigor all the time But! • Exploratory analysis (even when it involves the biggest of data) is meant to *form* a hypothesis, not test one • Good experimental design and rigorous statistics are essential if we want to make claims about how the world works Data “Science” “Eyeballing it” 13-18 23-29 19-22 30-65 Facebook posts by age group Personality, Gender, and Age in the Language of Social Media: The Open-Vocabulary Approach. Schwartz et al. (2013). Data “Science” “Eyeballing it” Frequent topics observed in 17,000 Science articles Probabilistic Topic Models. Blei (2012). Data “Science” “Eyeballing it” Similarity of words according on word2vec model Data “Science” • • To be fair… • Intuition plays a huge role in the scientific method (“make observations” is Step 1). • Exploratory analysis is necessary, its okay to not be all rigor all the time But! • Exploratory analysis (even when it involves the biggest of data) is meant to *form* a hypothesis, not test one • Good experimental design and rigorous statistics are essential if we want to make claims about how the world works Data “Science” • • To be fair… • Intuition plays a huge role in the scientific method (“make observations” is Step 1). • Exploratory analysis is necessary, its okay to not be all rigor all the time But! • Exploratory analysis (even when it involves the biggest of data) is meant to *form* a hypothesis, not test one • Good experimental design and rigorous statistics are essential if we want to make claims about how the world works Data “Science” • • To be fair… • Intuition plays a huge role in the scientific method (“make observations” is Step 1). • Exploratory analysis is necessary, its okay to not be all rigor all the time But! • Exploratory analysis (even when it involves the biggest of data) is meant to *form* a hypothesis, not test one • Good experimental design and rigorous statistics are essential if we want to make claims about how the world works Data “Science” Per capita cheese consumption correlates with Number of people who died by becoming tangled in their bedsheets 2000 2001 2002 2003 2004 2005 2006 2007 2008 2009 800 deaths Cheese consumed 31.5lbs 600 deaths 30lbs 400 deaths 28.5lbs 200 deaths 2000 2001 2002 2003 2004 Bedsheet tanglings 2005 2006 Cheese consumed 2007 2008 ρ = 0.95 Bedsheet tanglings 33lbs 2009 tylervigen.com Data “Science” Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon: An argument for multiple comparisons correction Craig M. Bennett1, Abigail A. Baird2, Michael B. Miller1, and George L. Wolford3 1 3 Psychology Department, University of California Santa Barbara, Santa Barbara, CA; 2 Department of Psychology, Vassar College, Poughkeepsie, NY; Department of Psychological & Brain Sciences, Dartmouth College, Hanover, NH INTRODUCTION GLM RESULTS With the extreme dimensionality of functional neuroimaging data comes extreme risk for false positives. Across the 130,000 voxels in a typical fMRI volume the probability of a false positive is almost certain. Correction for multiple comparisons should be completed with these datasets, but is often ignored by investigators. To illustrate the magnitude of the problem we carried out a real experiment that demonstrates the danger of not correcting for chance properly. METHODS Subject. One mature Atlantic Salmon (Salmo salar) participated in the fMRI study. The salmon was approximately 18 inches long, weighed 3.8 lbs, and was not alive at the time of scanning. Task. The task administered to the salmon involved completing an open-ended mentalizing task. The salmon was shown a series of photographs depicting human individuals in social situations with a specified emotional valence. The salmon was asked to determine what emotion the individual in the photo must have been experiencing. Design. Stimuli were presented in a block design with each photo presented for 10 seconds followed by 12 seconds of rest. A total of 15 photos were displayed. Total scan time was 5.5 minutes. Preprocessing. Image processing was completed using SPM2. Preprocessing steps for the functional imaging data included a 6-parameter rigid-body affine realignment of the fMRI timeseries, coregistration of the data to a T1 -weighted anatomical image, and 8 mm full-width at half-maximum (FWHM) Gaussian smoothing. Analysis. Voxelwise statistics on the salmon data were calculated through an ordinary least-squares estimation of the general linear model (GLM). Predictors of the hemodynamic response were modeled by a boxcar function convolved with a canonical hemodynamic response. A temporal high pass filter of 128 seconds was include to account for low frequency drift. No autocorrelation correction was applied. A t-contrast was used to test for regions with significant BOLD signal change during the photo condition compared to rest. The parameters for this comparison were t(131) > 3.15, p(uncorrected) < 0.001, 3 voxel extent threshold. Several active voxels were discovered in a cluster located within the salmon’s brain cavity (Figure 1, see above). The size of this cluster was 81 mm3 with a cluster-level significance of p = 0.001. Due to the coarse resolution of the echo-planar image acquisition and the relatively small size of the salmon brain further discrimination between brain regions could not be completed. Out of a search volume of 8064 voxels a total of 16 voxels were significant. Identical t-contrasts controlling the false discovery rate (FDR) and familywise error rate (FWER) were completed. These contrasts indicated no active voxels, even at relaxed statistical thresholds (p = 0.25). VOXELWISE VARIABILITY Voxel Selection. Two methods were used for the correction of multiple comparisons in the fMRI results. The first method controlled the overall false discovery rate (FDR) and was based on a method defined by Benjamini and Hochberg (1995). The second method controlled the overall familywise error rate (FWER) through the use of Gaussian random field theory. This was done using algorithms originally devised by Friston et al. (1994). DISCUSSION Can we conclude from this data that the salmon is engaging in the perspective-taking task? Certainly not. What we can determine is that random noise in the EPI timeseries may yield spurious results if multiple comparisons are not controlled for. Adaptive methods for controlling the FDR and FWER are excellent options and are widely available in all major fMRI analysis packages. We argue that relying on standard statistical thresholds (p < 0.001) and low minimum cluster sizes (k > 8) is an ineffective control for multiple comparisons. We further argue that the vast majority of fMRI studies should be utilizing multiple comparisons correction as standard practice in the computation of their statistics. REFERENCES Benjamini Y and Hochberg Y (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57:289-300. Friston KJ, Worsley KJ, Frackowiak RSJ, Mazziotta JC, and Evans AC. (1994). Assessing the significance of focal activations using their spatial extent. Human Brain Mapping, 1:214-220. To examine the spatial configuration of false positives we completed a variability analysis of the fMRI timeseries. On a voxel-by-voxel basis we calculated the standard deviation of signal values across all 140 volumes. We observed clustering of highly variable voxels into groups near areas of high voxel signal intensity. Figure 2a shows the mean EPI image for all 140 image volumes. Figure 2b shows the standard deviation values of each voxel. Figure 2c shows thresholded standard deviation values overlaid onto a highresolution T1 -weighted image. To To investigate this effect in greater detail we conducted a Pearson correlation to examine the relationship between the signal in a voxel and its variability. There was a significant positive correlation between the mean voxel value and its variability over time (r = 0.54, p < 0.001). A scatterplot of mean voxel signal intensity against voxel standard deviation is presented to the right. Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon 3 Department of Psychological & Brain Sciences, Dartmouth College, Hanover, NH INTRODUCTION GLM RESULTS Data “Science” With the extreme dimensionality of functional neuroimaging data comes extreme risk for false positives. Across the 130,000 voxels in a typical fMRI volume the probability of a false positive is almost certain. Correction for multiple comparisons should be completed with these datasets, but is often ignored by investigators. To illustrate the magnitude of the problem we carried out a real experiment that demonstrates the danger of not correcting Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon: for chance properly. An argument for multiple comparisons correction Craig M. Bennett1, Abigail A. Baird2, Michael B. Miller1, and George L. Wolford3 1 3 METHODS Psychology Department, University of California Santa Barbara, Santa Barbara, CA; 2 Department of Psychology, Vassar College, Poughkeepsie, NY; Department of Psychological & Brain Sciences, Dartmouth College, Hanover, NH INTRODUCTION GLM RESULTS With the extreme dimensionality of functional neuroimaging data comes extreme risk for false positives. Across the 130,000 voxels in a typical fMRI volume the probability of a false positive is almost certain. Correction for multiple comparisons should be completed with these datasets, but is often ignored by investigators. To illustrate the magnitude of the problem we carried out a real experiment that demonstrates the danger of not correcting for chance properly. Subject. One mature Atlantic Salmon (Salmo salar) participated in the fMRI study. The salmon was approximately 18 inches long, weighed 3.8 lbs, and was not alive at METHODS the time of scanning. Subject. One mature Atlantic Salmon (Salmo salar) participated in the fMRI study. The salmon was approximately 18 inches long, weighed 3.8 lbs, and was not alive at the time of scanning. Task. The task administered to the salmon involved completing an open-ended mentalizing task. The salmon was shown a series of photographs depicting human individuals in social situations with a specified emotional valence. The salmon was asked to determine what emotion the individual in the photo must have been experiencing. Task. The task administered to the salmon involved completing an open-ended mentalizing task. The salmon was shown a series of photographs depicting human individuals in social situations with a specified emotional valence. The salmon was asked to determine what emotion the individual in the photo must have been experiencing. Design. Stimuli were presented in a block design with each photo presented for 10 seconds followed by 12 seconds of rest. A total of 15 photos were displayed. Total scan time was 5.5 minutes. Preprocessing. Image processing was completed using SPM2. Preprocessing steps for the functional imaging data included a 6-parameter rigid-body affine realignment of the fMRI timeseries, coregistration of the data to a T1 -weighted anatomical image, and 8 mm full-width at half-maximum (FWHM) Gaussian smoothing. Analysis. Voxelwise statistics on the salmon data were calculated through an ordinary least-squares estimation of the general linear model (GLM). Predictors of the hemodynamic response were modeled by a boxcar function convolved with a canonical hemodynamic response. A temporal high pass filter of 128 seconds was include to account for low frequency drift. No autocorrelation correction was applied. Design. Stimuli were presented in a block design with each photo presented for 10 seconds followed by 12 seconds of rest. A total of 15 photos were displayed. Total scan time was 5.5 minutes. Voxel Selection. Two methods were used for the correction of multiple comparisons in the fMRI results. The first method controlled the overall false discovery rate (FDR) and was based on a method defined by Benjamini and Hochberg (1995). The second method controlled the overall familywise error rate (FWER) through the use of Gaussian random field theory. This was done using algorithms originally devised by Friston et al. (1994). Preprocessing. Image processing was completed using SPM2. Preprocessing steps for the functional imaging data included a 6-parameter rigid-body affine realignment DISCUSSION of the fMRI timeseries, coregistration of the data to a T1 -weighted anatomical image, and 8 mm full-width at half-maximum (FWHM) Gaussian smoothing. Can we conclude from this data that the salmon is engaging in the perspective-taking task? Certainly not. What we can determine is that random noise in the EPI timeseries may yield spurious results if multiple comparisons are not controlled for. Adaptive methods for controlling the FDR and FWER are excellent options and are widely available in all major fMRI analysis packages. We argue that relying on standard statistical thresholds (p < 0.001) and low minimum cluster sizes (k > 8) is an ineffective control for multiple comparisons. We further argue that the vast majority of fMRI studies should be utilizing multiple comparisons correction as standard practice in the computation of their statistics. Analysis. Voxelwise statistics on the salmon data were calculated through an ordinary least-squares estimation of the general linear model (GLM). Predictors of the hemodynamic response were modeled by a boxcar function convolved REFERENCES with a canonical hemodynamic response. A temporal high pass filter of 128 seconds was include to account for low frequency drift. No autocorrelation correction was applied. Benjamini Y and Hochberg Y (1995). Controlling the false discovery rate: a practical and powerful approach to multiple testing. Journal of the Royal Statistical Society: Series B, 57:289-300. Friston KJ, Worsley KJ, Frackowiak RSJ, Mazziotta JC, and Evans AC. (1994). Assessing the significance of focal activations using their spatial extent. Human Brain Mapping, 1:214-220. Voxel Selection. Two methods were used for the correction of multiple comparisons in the fMRI results. The first method controlled the overall false discovery rate (FDR) and was based on a method defined by Benjamini and Hochberg (1995). The second method controlled the overall familywise error rate (FWER) through the use of Gaussian random field theory. This was done using algorithms originally devised by Friston et al. (1994). A t-contrast was used to test for regions with significant BOLD signal change during the photo condition compared to rest. The parameters for this comparison were t(131) > 3.15, p(uncorrected) < 0.001, 3 voxel extent threshold. A t-contrast was used to test for regions with significant BOLD signal change during the photo condition compared to rest. The parameters for this comparison were t(131) > 3.15, p(uncorrected) < 0.001, 3 voxel extent threshold. Several active voxels were discovered in a cluster located within the salmon’s brain cavity (Figure 1, see above). The size of this cluster was 81 mm3 with a cluster-level significance of p = 0.001. Due to the coarse resolution of the echo-planar image acquisition and the relatively small size of the salmon brain further discrimination between brain regions could not be completed. Out of a search volume of 8064 voxels a total of 16 voxels were significant. Several active voxels were discovered in a cluster located within the salmon’s brain cavity (Figure 1, see above). The size of this cluster was 81 mm3 with a cluster-level significance of p = 0.001. Due to the coarse resolution of the VOXELWISE VARIABILITY echo-planar image acquisition and the relatively small size of the salmon brain further discrimination between brain regions could not be completed. Out of a search volume of 8064 voxels a total of 16 voxels were significant. Identical t-contrasts controlling the false discovery rate (FDR) and familywise error rate (FWER) were completed. These contrasts indicated no active voxels, even at relaxed statistical thresholds (p = 0.25). To examine the spatial configuration of false positives we completed a variability analysis of the fMRI timeseries. On a voxel-by-voxel basis we calculated the standard deviation of signal values across all 140 volumes. Identical t-contrasts controlling the false discovery rate (FDR) and familywise error rate (FWER) were completed. These contrasts indicated no active voxels, even at relaxed statistical thresholds (p = 0.25). We observed clustering of highly variable voxels into groups near areas of high voxel signal intensity. Figure 2a shows the mean EPI image for all 140 image volumes. Figure 2b shows the standard deviation values of each voxel. Figure 2c shows thresholded standard deviation values overlaid onto a highresolution T1 -weighted image. To To investigate this effect in greater detail we conducted a Pearson correlation to examine the relationship between the signal in a voxel and its variability. There was a significant positive correlation between the mean voxel value and its variability over time (r = 0.54, p < 0.001). A scatterplot of mean voxel signal intensity against voxel standard deviation is presented to the right. VOXELWISE VARIABILITY Neural correlates of interspecies perspective taking in the post-mortem Atlantic Salmon individuals in social situations with a specified emotional valence. The salmon was asked to determine what emotion the individual in the photo must have been experiencing. Design. Stimuli were presented in a block design with each photo presented for 10 seconds followed by 12 seconds of rest. A total of 15 photos were displayed. Total scan time was 5.5 minutes. Data “...
View Full Document

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture