My research develops econometric tools for settings where standard identification or inference arguments are strained by high-dimensional, non-Gaussian, or unstructured data. Drafts without public links are available upon request.

Working Papers

Moment-Based Inference for Regression with Latent Dirichlet Covariates

Status: Working paper
Public version: arXiv working paper, 2026

Abstract. Topic models are often used as first-stage dimension-reduction tools before regression, with estimated document-level topic shares treated as observed covariates. This plug-in workflow creates two inferential difficulties: valid inference requires a regular first-stage-to-second-stage expansion that propagates topic-estimation uncertainty, and, at fixed document length, a document’s topic mixture is not consistently recoverable from its own words even when the population topic matrix is known. Corrected spectral moment methods for LDA provide a natural starting point: when the total Dirichlet concentration parameter is known, low-order word moments can be corrected to yield operators diagonal in the latent topic basis. We extend this idea to downstream regression.

Under a finite latent Dirichlet allocation model with response residuals orthogonal to the low-order token moments used for identification, response-weighted word moments admit the same correction, and the resulting supervised operator identifies the regression coefficient β directly, without estimating document-level topic shares. The main theoretical obstacle is that the spectral correction depends on the unknown total concentration α0. We show that, for k ≥ 3 topics and under a generic finite-probe condition, α0 is identifiable by commutativity: at the true value, a family of corrected word-moment operators commute, whereas away from the truth they generically do not.

This yields a feasible estimator and allows uncertainty in α̂0 to be propagated into inference for β. The estimator is asymptotically linear as the number of documents grows with fixed document length, with sandwich standard errors based on document-level moment contributions. Simulations show near-nominal coverage where plug-in topic-share regressions can undercover, and an application to top economics journals illustrates contrast inference for latent topic effects.

Work in Progress

Micro-foundation for Topic Models

Status: Work in progress