Advanced Statistical Methods in Data Science ( PDFDrive.com ).pdf - ICSA Book Series in Statistics Series Editors Jiahua Chen � Ding-Geng(Din Chen

Advanced Statistical Methods in Data Science ( PDFDrive.com ).pdf

This preview shows page 1 out of 229 pages.

You've reached the end of your free preview.

Want to read all 229 pages?

Unformatted text preview: ICSA Book Series in Statistics Series Editors: Jiahua Chen · Ding-Geng (Din) Chen Ding-Geng (Din) Chen Jiahua Chen Xuewen Lu Grace Y. Yi Hao Yu Editors Advanced Statistical Methods in Data Science ICSA Book Series in Statistics Series editors Jiahua Chen Department of Statistics University of British Columbia Vancouver Canada Ding-Geng (Din) Chen University of North Carolina Chapel Hill, NC, USA More information about this series at Ding-Geng (Din) Chen • Jiahua Chen • Xuewen Lu • Grace Y. Yi • Hao Yu Editors Advanced Statistical Methods in Data Science 123 Editors Ding-Geng (Din) Chen School of Social Work University of North Carolina at Chapel Hill Chapel Hill, NC, USA Jiahua Chen Department of Statistics University of British Columbia Vancouver, BC, Canada Department of Biostatistics Gillings School of Global Public Health University of North Carolina at Chapel Hill Chapel Hill, NC, USA Grace Y. Yi Department of Statistics and Actuarial Science University of Waterloo Waterloo, ON, Canada Xuewen Lu Department of Mathematics and Statistics University of Calgary Calgary, AB, Canada Hao Yu Department of Statistics and Actuarial Science Western University London, ON, Canada ISSN 2199-0980 ICSA Book Series in Statistics ISBN 978-981-10-2593-8 DOI 10.1007/978-981-10-2594-5 ISSN 2199-0999 (electronic) ISBN 978-981-10-2594-5 (eBook) Library of Congress Control Number: 2016959593 © Springer Science+Business Media Singapore 2016 This work is subject to copyright. All rights are reserved by the Publisher, whether the whole or part of the material is concerned, specifically the rights of translation, reprinting, reuse of illustrations, recitation, broadcasting, reproduction on microfilms or in any other physical way, and transmission or information storage and retrieval, electronic adaptation, computer software, or by similar or dissimilar methodology now known or hereafter developed. The use of general descriptive names, registered names, trademarks, service marks, etc. in this publication does not imply, even in the absence of a specific statement, that such names are exempt from the relevant protective laws and regulations and therefore free for general use. The publisher, the authors and the editors are safe to assume that the advice and information in this book are believed to be true and accurate at the date of publication. Neither the publisher nor the authors or the editors give a warranty, express or implied, with respect to the material contained herein or for any errors or omissions that may have been made. Printed on acid-free paper This Springer imprint is published by Springer Nature The registered company is Springer Nature Singapore Pte Ltd. The registered company address is: 152 Beach Road, #22-06/08 Gateway East, Singapore 189721, Singapore To my parents and parents-in-law, who value higher education and hard work; to my wife Ke, for her love, support, and patience; and to my son John D. Chen and my daughter Jenny K. Chen for their love and support. Ding-Geng (Din) Chen, PhD To my wife, my daughter Amy, and my son Andy, whose admiring conversations transformed into lasting enthusiasm for my research activities. Jiahua Chen, PhD To my wife Xiaobo, my daughter Sophia, and my son Samuel, for their support and understanding. Xuewen Lu, PhD To my family, Wenqing He, Morgan He, and Joy He, for being my inspiration and offering everlasting support. Grace Y. Yi, PhD Preface This book is a compilation of invited presentations and lectures that were presented at the Second Symposium of the International Chinese Statistical Association– Canada Chapter (ICSA–CANADA) held at the University of Calgary, Canada, August 4–6, 2015 ( ). The Symposium was organized around the theme “Embracing Challenges and Opportunities of Statistics and Data Science in the Modern World” with a threefold goal: to promote advanced statistical methods in big data sciences, to create an opportunity for the exchange ideas among researchers in statistics and data science, and to embrace the opportunities inherent in the challenges of using statistics and data science in the modern world. The Symposium encompassed diverse topics in advanced statistical analysis in big data sciences, including methods for administrative data analysis, survival data analysis, missing data analysis, high-dimensional and genetic data analysis, and longitudinal and functional data analysis; design and analysis of studies with response-dependent and multiphase designs; time series and robust statistics; and statistical inference based on likelihood, empirical likelihood, and estimating functions. This book compiles 12 research articles generated from Symposium presentations. Our aim in creating this book was to provide a venue for timely dissemination of the research presented during the Symposium to promote further research and collaborative work in advanced statistics. In the era of big data, this collection of innovative research not only has high potential to have a substantial impact on the development of advanced statistical models across a wide spectrum of big data sciences but also has great promise for fostering more research and collaborations addressing the ever-changing challenges and opportunities of statistics and data science. The authors have made their data and computer programs publicly available so that readers can replicate the model development and data analysis presented in each chapter, enabling them to readily apply these new methods in their own research. vii viii Preface The 12 chapters are organized into three sections. Part I includes four chapters that present and discuss data analyses based on latent variable models in data sciences. Part II comprises four chapters that share a common focus on lifetime data analyses. Part III is composed of four chapters that address applied data analyses in big data sciences. Part I Data Analysis Based on Latent or Dependent Variable Models (Chaps. 1, 2, 3, and 4) Chapter 1 presents a weighted multiple testing procedure commonly used and known in clinical trials. Given this wide use, many researchers have proposed methods for making multiple testing adjustments to control family-wise error rates while accounting for the logical relations among the null hypotheses. However, most of those methods not only disregard the correlation among the endpoints within the same family but also assume the hypotheses associated with each family are equally weighted. Authors Enas Ghulam, Kesheng Wang, and Changchun Xie report on their work in which they proposed and tested a gatekeeping procedure based on Xie’s weighted multiple testing correction for correlated tests. The proposed method is illustrated with an example to clearly demonstrate how it can be used in complex clinical trials. In Chap. 2, Abbas Khalili, Jiahua Chen, and David A. Stephens consider the regime-switching Gaussian autoregressive model as an effective platform for analyzing financial and economic time series. The authors first explain the heterogeneous behavior in volatility over time and multimodality of the conditional or marginal distributions and then propose a computationally more efficient regularization method for simultaneous autoregressive-order and parameter estimation when the number of autoregressive regimes is predetermined. The authors provide a helpful demonstration by applying this method to analysis of the growth of the US gross domestic product and US unemployment rate data. Chapter 3 deals with a practical problem of healthcare use for understanding the risk factors associated with the length of hospital stay. In this chapter, Cindy Xin Feng and Longhai Li develop hurdle and zero-inflated models to accommodate both the excess zeros and skewness of data with various configurations of spatial random effects. In addition, these models allow for the analysis of the nonlinear effect of seasonality and other fixed effect covariates. This research draws attention to considerable drawbacks regarding model misspecifications. The modeling and inference presented by Feng and Li use the fully Bayesian approach via Markov Chain Monte Carlo (MCMC) simulation techniques. Chapter 4 discusses emerging issues in the era of precision medicine and the development of multi-agent combination therapy or polytherapy. Prior research has established that, as compared with conventional single-agent therapy (monotherapy), polytherapy often leads to a high-dimensional dose searching space, especially when a treatment combines three or more drugs. To overcome the burden of calibration of multiple design parameters, Ruitao Lin and Guosheng Yin propose a robust optimal interval (ROI) design to locate the maximum tolerated dose (MTD) in Phase I clinical trials. The optimal interval is determined by minimizing the probability of incorrect decisions under the Bayesian paradigm. To tackle high- Preface ix dimensional drug combinations, the authors develop a random-walk ROI design to identify the MTD combination in the multi-agent dose space. The authors of this chapter designed extensive simulation studies to demonstrate the finite-sample performance of the proposed methods. Part II Lifetime Data Analysis (Chaps. 5, 6, 7, and 8) In Chap. 5, Longlong Huang, Karen Kopciuk, and Xuewen Lu present a new method for group selection in an accelerated failure time (AFT) model with a group bridge penalty. This method is capable of simultaneously carrying out feature selection at the group and within-group individual variable levels. The authors conducted a series of simulation studies to demonstrate the capacity of this group bridge approach to identify the correct group and correct individual variable even with high censoring rates. Real data analysis illustrates the application of the proposed method to scientific problems. Chapter 6 considers issues around Case I interval censored data, also known as current status data, commonly encountered in areas such as demography, economics, epidemiology, and medical science. In this chapter, Pooneh Pordeli and Xuewen Lu first introduce a partially linear single-index proportional odds model to analyze these types of data and then propose a method for simultaneous sieve maximum likelihood estimation. The resultant estimator of regression parameter vector is asymptotically normal, and, under some regularity conditions, this estimator can achieve the semiparametric information bound. Chapter 7 presents a framework for general empirical likelihood inference of Type I censored multiple samples. Authors Song Cai and Jiahua Chen develop an effective empirical likelihood ratio test and efficient methods for distribution function and quantile estimation for Type I censored samples. This newly developed approach can achieve high efficiency without requiring risky model assumptions. The maximum empirical likelihood estimator is asymptotically normal. Simulation studies show that, as compared to some semiparametric competitors, the proposed empirical likelihood ratio test has superior power under a wide range of population distribution settings. Chapter 8 provides readers with an overview of recent developments in the joint modeling of longitudinal quality of life (QoL) measurements and survival time for cancer patients that promise more efficient estimation. Authors Hui Song, Yingwei Peng, and Dongsheng Tu then propose semiparametric estimation methods to estimate the parameters in these joint models and illustrate the applications of these joint modeling procedures to analyze longitudinal QoL measurements and recurrence times using data from a clinical trial sample of women with early breast cancer. Part III Applied Data Analysis (Chaps. 9, 10, 11, and 12) Chapter 9 presents an interesting discussion of a confidence weighting model applied to multiple-choice tests commonly used in undergraduate mathematics and statistics courses. Michael Cavers and Joseph Ling discuss an approach to multiplechoice testing called the student-weighted model and report on findings based on the implementation of this method in two sections of a first-year calculus course at the University of Calgary (2014 and 2015). x Preface Chapter 10 discusses parametric imputation in missing data analysis. Author Peisong Han proposes to estimate and subtract the asymptotic bias to obtain consistent estimators. Han demonstrates that the resulting estimator is consistent if any of the missingness mechanism models or the imputation model is correctly specified. Chapter 11 considers one of the basic and important problems in statistics: the estimation of the center of a symmetric distribution. In this chapter, authors Pengfei Li and Zhaoyang Tian propose a new estimator by maximizing the smoothed likelihood. Li and Tian’s simulation studies show that, as compared with the existing methods, their proposed estimator has much smaller mean square errors under uniform distribution, t-distribution with one degree of freedom, and mixtures of normal distributions on the mean parameter. Additionally, the proposed estimator is comparable to the existing methods under other symmetric distributions. Chapter 12 presents the work of Jingjia Chu, Reg Kulperger, and Hao Yu in which they propose a new class of multivariate time series models. Specifically, the authors propose a multivariate time series model with an additive GARCH-type structure to capture the common risk among equities. The dynamic conditional covariance between series is aggregated by a common risk term, which is key to characterizing the conditional correlation. As a general note, the references for each chapter are included immediately following the chapter text. We have organized the chapters as self-contained units so readers can more easily and readily refer to the cited sources for each chapter. The editors are deeply grateful to many organizations and individuals for their support of the research and efforts that have gone into the creation of this collection of impressive, innovative work. First, we would like to thank the authors of each chapter for the contribution of their knowledge, time, and expertise to this book as well as to the Second Symposium of the ICSA–CANADA. Second, our sincere gratitude goes to the sponsors of the Symposium for their financial support: the Canadian Statistical Sciences Institute (CANSSI), the Pacific Institute for the Mathematical Sciences (PIMS), and the Department of Mathematics and Statistics, University of Calgary; without their support, this book would not have become a reality. We also owe big thanks to the volunteers and the staff of the University of Calgary for their assistance at the Symposium. We express our sincere thanks to the Symposium organizers: Gemai Chen, PhD, University of Calgary; Jiahua Chen, PhD, University of British Columbia; X. Joan Hu, PhD, Simon Fraser University; Wendy Lou, PhD, University of Toronto; Xuewen Lu, PhD, University of Calgary; Chao Qiu, PhD, University of Calgary; Bingrui (Cindy) Sun, PhD, University of Calgary; Jingjing Wu, PhD, University of Calgary; Grace Y. Yi, PhD, University of Waterloo; and Ying Zhang, PhD, Acadia University. The editors wish to acknowledge the professional support of Hannah Qiu (Springer/ICSA Book Series coordinator) and Wei Zhao (associate editor) from Springer Beijing that made publishing this book with Springer a reality. Preface xi We welcome readers’ comments, including notes on typos or other errors, and look forward to receiving suggestions for improvements to future editions of this book. Please send comments and suggestions to any of the editors listed below. University of North Carolina at Chapel Hill Chapel Hill, NC, USA Ding-Geng (Din) Chen, MSc, PhD University of British Columbia Vancouver, BC, Canada Jiahua Chen, MSc, PhD University of Calgary Calgary, AB, Canada Xuewen Lu, MSc, PhD University of Waterloo Waterloo, ON, Canada Western University West Ontario, ON, Canada July 28, 2016 Grace Y. Yi, MSc, MA, PhD Hao Yu, MSc, PhD Contents Part I 1 2 3 4 The Mixture Gatekeeping Procedure Based on Weighted Multiple Testing Correction for Correlated Tests . . .. . . . . . . . . . . . . . . . . . . . Enas Ghulam, Kesheng Wang, and Changchun Xie 3 Regularization in Regime-Switching Gaussian Autoregressive Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Abbas Khalili, Jiahua Chen, and David A. Stephens 13 Modeling Zero Inflation and Overdispersion in the Length of Hospital Stay for Patients with Ischaemic Heart Disease . . . . . . . . . . . Cindy Xin Feng and Longhai Li 35 Robust Optimal Interval Design for High-Dimensional Dose Finding in Multi-agent Combination Trials . . .. . . . . . . . . . . . . . . . . . . . Ruitao Lin and Guosheng Yin 55 Part II 5 Data Analysis Based on Latent or Dependent Variable Models Life Time Data Analysis Group Selection in Semiparametric Accelerated Failure Time Model .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . Longlong Huang, Karen Kopciuk, and Xuewen Lu 77 6 A Proportional Odds Model for Regression Analysis of Case I Interval-Censored Data .. . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 101 Pooneh Pordeli and Xuewen Lu 7 Empirical Likelihood Inference Under Density Ratio Models Based on Type I Censored Samples: Hypothesis Testing and Quantile Estimation . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 123 Song Cai and Jiahua Chen xiii xiv 8 Contents Recent Development in the Joint Modeling of Longitudinal Quality of Life Measurements and Survival Data from Cancer Clinical Trials . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 153 Hui Song, Yingwei Peng, and Dongsheng Tu Part III 9 Applied Data Analysis Confidence Weighting Procedures for Multiple-Choice Tests . . . . . . . . . 171 Michael Cavers and Joseph Ling 10 Improving the Robustness of Parametric Imputation .. . . . . . . . . . . . . . . . . 183 Peisong Han 11 Maximum Smoothed Likelihood Estimation of the Centre of a Symmetric Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 195 Pengfei Li and Zhaoyang Tian 12 Modelling the Common Risk Among Equities: A Multivariate Time Series Model with an Additive GARCH Structure .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 205 Jingjia Chu, Reg Kulperger, and Hao Yu Index . . . . . . . . .. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .. . . . . . . . . . . . . . . . . . . . 219 Contributors Song Cai School of Mathematics and Statistics, Carleton University, Ottawa, ON, Canada Michael Cavers Department of Mathematics and Statistics, University of Calgary, Calgary, AB, Canada Jiahua Chen Big Data Research Institute of Yunnan University and Department of Statistics, University of British Columbia, Vancouver, BC, Canada Jingjia Chu Department of Statistical and Actuarial Sciences, Western University, London, ON, Canada Cindy Xin Feng School of Public Health and Western College of Veterinary Medicine, University of Saskatchewan, Saskatoon, SK, Canada Enas Ghulam Division of Biostatistics and Bioinformatics, Department of Environmental Health, University of Cincinnati, Cincinnati, OH, USA Peisong Han Department of Statistics and Actuarial Science, University of Waterloo, Waterloo, ON, Canada Longlong Huang Department of Mathematics and Statistics, University of Calgary, Calgary, AB, Canada Abbas Khalili Department of Mathematics and Statistics, McGill University, Montreal, QC, Canada Karen Kopciuk Department of Cancer Epidemiology and Prevention Research, Alberta Health Services, Calgary, AB, Canada Reg Kulperger Department of Statistical and Actua...
View Full Document

  • Summer '15
  • Statistics, Test, The American, Null hypothesis, Statistical hypothesis testing, Maximum likelihood, Multiple comparisons, Jenny K. Chen

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

Stuck? We have tutors online 24/7 who can help you get unstuck.
A+ icon
Ask Expert Tutors You can ask You can ask You can ask (will expire )
Answers in as fast as 15 minutes