Daniel B. Neill

Professor of Computer Science and Public Service, NYU Wagner; Professor of Computer Science and Public Service, NYU Courant; Professor of Urban Analytics, NYU Center for Urban Science and Progress

Contact Details

Office

105 East 17th Street
Room 375
New York, NY 10003

Daniel B. Neill, Ph.D., is Professor of Computer Science, Public Service, and Urban Analytics at New York University (NYU), jointly tenured at NYU's Courant Institute Department of Computer Science, Robert F. Wagner Graduate School of Public Service, and the Center for Urban Science and Progress (part of NYU's Tandon School of Engineering). He is also Affiliated Faculty at NYU's Center for Data Science and the NYU Tandon Department of Computer Science and Engineering. At NYU, he directs the Machine Learning for Good (ML4G) Laboratory and recently finished a 3-year term as co-director of the university's Urban Initiative. Dr. Neill was previously a tenured faculty member at Carnegie Mellon University’s Heinz College, where he was the Dean’s Career Development Professor, Associate Professor of Information Systems, and Director of the Event and Pattern Detection Laboratory.

Dr. Neill's research focuses on developing novel machine learning methods for social good, with applications ranging from medicine and public health to urban analytics and fairness in criminal justice. He works closely with organizations including health departments, hospitals, and city leaders to create and deploy data-driven tools and systems to improve the quality of public health, safety, and security, for example, through the early detection of disease outbreaks. He has been the Associate Editor of six journals (IEEE Intelligent Systems, Decision Sciences, Security Informatics, ACM Transactions on Management Information Systems, INFORMS Journal on Data Science, and ACM Journal on Computing and Sustainable Societies). He was the recipient of an NSF CAREER award and an NSF Graduate Research Fellowship, and was named one of the "top ten artificial intelligence researchers to watch" by IEEE Intelligent Systems. He received his M.Phil. from Cambridge University and his M.S. and Ph.D. in Computer Science from Carnegie Mellon University.

Please see Dr. Neill's personal webpage (http://www.cs.nyu.edu/~neill) for more information.

Courses

The past decade has seen the increasing availability of very large scale data sets, arising from the rapid growth of transformative technologies such as the Internet and cellular telephones, along with the development of new and powerful computational methods to analyze such datasets. Such methods, developed in the closely related fields of machine learning, data mining, and artificial intelligence, provide a powerful set of tools for intelligent problem-solving and data-driven policy analysis. These methods have the potential to dramatically improve the public welfare by guiding policy decisions and interventions, and their incorporation into intelligent information systems will improve public services in domains ranging from medicine and public health to law enforcement and security.

The LSDA course series will provide a basic introduction to large scale data analysis methods, focusing on four main problem paradigms (prediction, clustering, modeling, and detection). The first course (LSDA I) will focus on prediction (both classification and regression) and clustering (identifying underlying group structure in data), while the second course (LSDA II) will focus on probabilistic modeling using Bayesian networks and on anomaly and pattern detection. LSDA I is a prerequisite for LSDA II, as a number of concepts from classification and clustering will be used in the Bayesian networks and anomaly detection modules, and students are expected to understand these without the need for extensive review.

In both LSDA I and LSDA II, students will learn how to translate policy questions into these paradigms, choose and apply the appropriate machine learning and data mining tools, and correctly interpret, evaluate, and apply the results for policy analysis and decision making. We will emphasize tools that can "scale up" to real-world policy problems involving reasoning in complex and uncertain environments, discovering new and useful patterns, and drawing inferences from large amounts of structured, high-dimensional, and multivariate data.

No previous knowledge of machine learning or data mining is required, and no knowledge of computer programming is required. We will be using Weka, a freely available and easy-to-use machine learning and data mining toolkit, to analyze data in this course.

The provides more information.

Research

2023

Benjamin Jakubowski, Sriram Somanchi, Edward McFowland III, and Daniel B. Neill. Exploiting discovered regression discontinuities to debias conditioned-on-observable estimators. B. Jakubowski, S. Somanchi, E. McFowland III, and D. B. Neill. Exploiting discovered regression discontinuities to debias conditioned-on-observable estimators. Journal of Machine Learning Research 24(133): 1-57, 2023.

Forthcoming/Accepted

B. Allen, D. B. Neill, R. C. Schell, J. Ahern, B. Hallowell, M. Krieger, V. A. Jent, W. C. Goedel, A. R. Cartus, J. L. Yedinak, C. Pratty, B. D. L. Marshall, and M. Cerda. Translating predictive analytics for public health practice: a case study of overdose prevention in Rhode Island. B. Allen, D. B. Neill, R. C. Schell, J. Ahern, B. Hallowell, M. Krieger, V. A. Jent, W. C. Goedel, A. R. Cartus, J. L. Yedinak, C. Pratty, B. D. L. Marshall, and M. Cerda. Translating predictive analytics for public health practice: a case study of overdose prevention in Rhode Island. American Journal of Epidemiology, 2023, in press.

Forthcoming/Accepted

Pavan Ravishankar, Qingyu Mo, Edward McFowland III, and Daniel B. Neill. Provable detection of propagating sampling bias in prediction models. P. Ravishankar, Q. Mo, E. McFowland III, and D. B. Neill. Provable detection of propagating sampling bias in prediction models. Proc. 37th AAAI Conf. on Artificial Intelligence, 2023, in press.

Forthcoming/Accepted

Charles A. Pehlivanian and Daniel B. Neill. Efficient optimization of partition scan statistics via the Consecutive Partitions Property. C. A. Pehlivanian and D. B. Neill. Efficient optimization of partition scan statistics via the Consecutive Partitions Property. Journal of Computational and Graphical Statistics 32(2): 712-729, 2023.

Katie Rosman and Daniel B. Neill. Detecting anomalous networks of opioid prescribers and dispensers in prescription drug data. K. Rosman and D. B. Neill. Detecting anomalous networks of opioid prescribers and dispensers in prescription drug data. Proc. 37th AAAI Conf. on Artificial Intelligence, 2023, in press.

Forthcoming/Accepted

Konstantin Klemmer, Nathan S. Safir, and D. B. Neill. Positional encoder graph neural networks for geographic data. K. Klemmer, N.S. Safir, and D. B. Neill. Positional encoder graph neural networks for geographic data. Proc. 26th Intl. Conf. on Artificial Intelligence and Statistics, PMLR 206:1379-1389, 2023.

2022

Mallory Nobles, Ramona Lall, Robert W. Mathes, and Daniel B. Neill. Presyndromic surveillance for improved detection of emerging public health threats. M. Nobles, R. Lall, R. W. Mathes, and D. B. Neill. Presyndromic surveillance for improved detection of emerging public health threats. Science Advances 8(44): eabm4920, 2022.

K. Klemmer, T. Xu, B. Acciaio, and D. B. Neill. SPATE-GAN: Improved generative modeling of dynamic spatio-temporal patterns with an autoregressive embedding loss. K. Klemmer, T. Xu, B. Acciaio, and D. B. Neill. SPATE-GAN: Improved generative modeling of dynamic spatio-temporal patterns with an autoregressive embedding loss. Proc. 36th AAAI Conf. on Artificial Intelligence, 4523-4531, 2022.

Chunpai Wang, Daniel B. Neill, and Feng Chen. Calibrated nonparametric scan statistics for anomalous pattern detection in graphs. C. Wang, D. B. Neill, and F. Chen. Calibrated nonparametric scan statistics for anomalous pattern detection in graphs. Proc. 36th AAAI Conf. on Artificial Intelligence, 4201-4209, 2022.

B. D. L. Marshall, N. Alexander-Scott, J. L. Yedinak, B. Hallowell, W. C. Goedel, B. Allen, R. C. Schell, M. S. Krieger, C. Pratty, J. Ahern, D. B. Neill, and M. Cerda. Preventing overdose using information and data from the environment (PROVIDENT): Protocol for a randomised, population-based, community intervention trial. B. D. L. Marshall, N. Alexander-Scott, J. L. Yedinak, B. Hallowell, W. C. Goedel, B. Allen, R. C. Schell, M. S. Krieger, C. Pratty, J. Ahern, D. B. Neill, and M. Cerda. Preventing overdose using information and data from the environment (PROVIDENT): Protocol for a randomised, population-based, community intervention trial. Addiction 117(4): 1152-1162, 2022.

Konstantin Klemmer, Tianlin Xu, Beatrice Acciaio, and Daniel B. Neill. SPATE-GAN: Improved generative modeling of dynamic spatio-temporal patterns with an autoregressive embedding loss. K. Klemmer, T. Xu, B. Acciaio, and D. B. Neill. SPATE-GAN: Improved generative modeling of dynamic spatio-temporal patterns with an autoregressive embedding loss. Proc. 36th AAAI Conf. on Artificial Intelligence, 2022, in press.

Forthcoming/Accepted

Abstract

We propose a new approach, the calibrated nonparametric scan statistic (CNSS), for more accurate detection of anomalous patterns in large-scale, real-world graphs. Scan statistics identify connected subgraphs that are interesting or unexpected through maximization of a likelihood ratio statistic; in particular, nonparametric scan statistics (NPSSs) identify subgraphs with a higher than expected proportion of individually significant nodes. However, we show that recently proposed NPSS methods are miscalibrated, failing to account for the maximization of the statistic over the multiplicity of subgraphs. This results in both reduced detection power for subtle signals, and low precision of the detected subgraph even for stronger signals. Thus we develop a new statistical approach to recalibrate NPSSs, correctly adjusting for multiple hypothesis testing and taking the underlying graph structure into account. While the recalibration, based on randomization testing, is computationally expensive, we propose both an efficient (approximate) algorithm and new, closed-form lower bounds (on the expected maximum proportion of significant nodes for subgraphs of a given size, under the null hypothesis of no anomalous patterns). These advances, along with the integration of recent core-tree decomposition methods, enable CNSS to scale to large real-world graphs, with substantial improvement in the accuracy of detected subgraphs. Extensive experiments on both semi-synthetic and real-world datasets are demonstrated to validate the effectiveness of our proposed methods, in comparison with state-of-the-art counterparts.

R. C. Schell, B. Allen, W. C. Goedel, B. D. Hallowell, R. Scagos, Y. Li, M. S. Krieger, D. B. Neill, B. D. L. Marshall, M. Cerda, and J. Ahern. Identifying predictors of opioid overdose death at a neighborhood level with machine learning. R. C. Schell, B. Allen, W. C. Goedel, B. D. Hallowell, R. Scagos, Y. Li, M. S. Krieger, D. B. Neill, B. D. L. Marshall, M. Cerda, and J. Ahern. Identifying predictors of opioid overdose death at a neighborhood level with machine learning. American Journal of Epidemiology 191(3): 526-533, 2022.

Under Review

Abstract

Currently under review for ACM KDD 2022 Conference on Knowledge Discovery and Data Mining

2021

B. D. L. Marshall, N. Alexander-Scott, J. L. Yedinak, B. Hallowell, W. C. Goedel, B. Allen, R. C. Schell, M. S. Krieger, C. Pratty, J. Ahern, D. B. Neill, M. Cerda. Preventing overdose using information and data from the environment (PROVIDENT): Protocol for a randomised, population-based, community intervention trial. Addiction 117(4): 1152-1162, 2022.

R. C. Schell, B. Allen, W. C. Goedel, B. D. Hallowell, R. Scagos, Y. Li, M. S. Krieger, D. B. Neill, B. D. L. Marshall, M. Cerda, J. Ahern. Identifying predictors of opioid overdose death at a neighborhood level with machine learning. American Journal of Epidemiology, 191(3): 526-533, 2022.

Abstract

Machine learning is gaining popularity in a broad range of areas working with geographic data. Here, data often exhibit spatial effects, which can be difficult to learn for neural networks. We propose SXL, a method for embedding information on the autoregressive nature of spatial data directly into the learning process using auxiliary tasks. We utilize the local Moran's I, a measure of local spatial autocorrelation, to "nudge" the model to learn the direction and magnitude of local spatial effects, complementing learning of the primary task. We further introduce a novel expansion of Moran's I to multiple resolutions, capturing spatial interactions over longer and shorter distances simultaneously. The novel multi-resolution Moran's I can be constructed easily and offers seamless integration into existing machine learning frameworks. Over a range of experiments using real-world data, we highlight how our method consistently improves the training of neural networks in unsupervised and supervised learning tasks. In generative spatial modeling experiments, we propose a novel loss for auxiliary task GANs utilizing task uncertainty weights. SXL outperforms domain-specific spatial interpolation benchmarks, highlighting its potential for downstream applications.

Abstract

Discovery of localized and irregularly shaped anomalous patterns in spatial data provides useful context for across many policy domains. The subset scan (SVSS) integrates the penalized fast subset scan with a kernel support vector machine to accurately detect spatial clusters without imposing hard constraints on the shape or size of the pattern. The method iterates between (1) efficiently maximizing a penalized log-likelihood ratio over subsets of locations to obtain an anomalous pattern, and (2) learning a high-dimensional decision boundary between locations included in and excluded from the anomalous subset. On each iteration, location-specific penalties to the log-likelihood ratio are assigned according to distance to the decision boundary, encouraging patterns which are spatially compact but potentially highly irregular in shape. SVSS outperforms competing methods for spatial cluster detection at the task of detecting randomly generated patterns in simulated experiments. SVSS enables discovery of practically-useful anomalous patterns for disease surveillance in Chicago, IL, crime detection in Portland, OR, and pothole cluster detection in Pittsburgh, PA, as demonstrated by experiments using publicly available data sets from these domains.

Abstract

Under-reporting and delayed reporting of rape crime are severe issues that can complicate the prosecution of perpetrators and prevent rape survivors from receiving needed support. Building on a massive database of publicly available criminal reports from two US cities, we develop a machine learning framework to predict delayed reporting of rape to help tackle this issue. Motivated by large and unexplained spatial variation in reporting delays, we build predictive models to analyse spatial, temporal and socio-economic factors that might explain this variation. Our findings suggest that we can explain a substantial proportion of the variation in rape reporting delays using only openly available data. The insights from this study can be used to motivate targeted, data-driven policies to assist vulnerable communities. For example, we find that younger rape survivors and crimes committed during holiday seasons exhibit longer delays. Our insights can thus help organizations focused on supporting survivors of sexual violence to provide their services at the right place and time. Due to the non-confidential nature of the data used in our models, even community organizations lacking access to sensitive police data can use these findings to optimize their operations.

2020

D. Zeng, Z. Cao, and D. B. Neill. AI-enabled public health surveillance: from local detection to global epidemic monitoring and control. Artificial Intelligence in Medicine, 437-453, 2021.

M. Nobles, R. Lall, R. Mathes and D. B. Neill. Pre-syndromic surveillance for improved detection of emerging public health threats..

Under Review

D. J. Fitzpatrick, W. Gorr and D. B. Neill. Policing chronic and temporary hot spots of violent crime: a controlled field experiment..

Under Review

B. Jakubowski and D. B. Neill. Exploiting discovered regression discontinuities to debias conditioned-on-observable estimators..

Under Review

K. S. Boxer, B. Hong, C. E. Kontokosta, and D. B. Neill. Estimating reporting bias in 311 complaint data.

Under Review

C. A. Pehlivanian and D. B. Neill. Efficient optimization of partition scan statistics via the consecutive partitions property..

Under Review

E. McFowland III, S. Somanchi and D. B. Neill. Efficient discovery of heterogeneous treat- ment effects in randomized experiments via anomalous pattern detection..

Under Review

W. Herlands and D. B. Neill. Automated discovery of difference-in-differences..

Under Review

2017

D. J. Fitzpatrick, Y. Ni, and D. B. Neill. Support vector subset scan for spatial pattern detection. Computational Statistics and Data Analysis 157: 107149, 2021.

Daniel B. Neill

Professor of Computer Science and Public Service, NYU Wagner; Professor of Computer Science and Public Service, NYU Courant; Professor of Urban Analytics, NYU Center for Urban Science and Progress

Spring 2024

PADM-GP.4147.: Large Scale Data Analysis with Machine Learning I

Spring 2023

PADM-GP.4147.: Large Scale Data Analysis with Machine Learning I

Spring 2023

PADM-GP.4148.: Large Scale Data Analysis with Machine Learning II

Spring 2022

PADM-GP.4147.: Large Scale Data Analysis with Machine Learning I

Spring 2022

PADM-GP.4148.: Large Scale Data Analysis with Machine Learning II

2023

2022

2021

2020

2017