If the mantra in real estate is “location, location, location,” the mantra for public data users is—or should be—“disaggregation, disaggregation, disaggregation.”
The average income in your neighborhood will go up if a billionaire moves in, but that rising average doesn’t tell you anything about whether income rose, fell, or remained the same for everyone else. A new billionaire neighbor is a dramatic example but one that illustrates the point: Disaggregated data are crucial for understanding how people are doing.
We couldn’t uncover these truths without disaggregated data:
During the 2021 conference of the Association of Public Data Users (APDU), speaker Rhonda Vonshay Sharpe provided numerous examples of how disaggregating data—by gender, race and ethnicity, and education—provides crucial insights for improving public health and well-being. Her talk was inspiring and also left many in the audience wondering…
To be fair, some people probably just don’t think about disaggregation. But there are bigger, systemwide challenges.
Sometimes survey sample sizes are too small to produce reliable estimates for a population of interest. When this happens, researchers—hoping to provide some data rather than none—may group smaller demographic groups together so they have enough combined survey responses to get an estimate they can report.
I have done this kind of aggregation in my own work—grouping across income levels, sexual orientations, racial/ethnic groups, geographies, or ages—because in the context of the work I was doing, aggregated data were preferable to tables full of missing data. If you’re considering aggregating groups, the Urban Institute provides some handy guidelines. And remember that sometimes noting in your work that the sample size is small or estimates are unreliable is important because it signals that there’s a data gap.
Speaking of data gaps: Sometimes data are only reported for aggregate groups. A visitor to federal statistical websites will often find data for just five racial groups (American Indian or Alaska Native, Asian, Black or African American, Native Hawaiian or Other Pacific Islander, and White) and two ethnic groups (Hispanic or Latino and Not Hispanic or Latino). These groups reflect minimum standards set by the U.S. Office of Management and Budget (OMB) in 1997. Even with those standards in place, it was just last month that the Bureau of Labor Statistics began publishing jobs data for American Indians and Alaska Natives.
While many agencies in the federal statistical system go beyond the minimum standards—including reporting data for multiracial populations—the standards (in my opinion) are overdue for an overhaul.
Recently, PRB joined more than 150 other signatories in a letter to the acting director of the Office of Management and Budget requesting that the OMB minimum standards be revised. The letter included requests, developed in collaboration with community groups and based on the latest research on self-identification, such as the following:
The letter included specific suggestions focused on the data needs of Asian American populations, Native Hawaiian and Pacific Islander populations, Hispanic/Latino populations, Middle Eastern and North African populations, and Black and African American populations.
But racial demographics are just the tip of the disaggregation iceberg. Sexual orientation and gender identity, age, geography, education, and other topics also deserve data systems that are robust enough to support disaggregation. The solution for survey data is to structure—and fund—surveys that have enough records to support detailed disaggregation. This could be achieved through larger sample sizes overall, as proposed by The Census Project for the American Community Survey, or through strategic oversampling of specific smaller populations of interest.
For administrative data such as birth and death records, education statistics, and others, many agencies already collect more data on race/ethnicity, age, income, and sexual orientation and gender identity than they report. Reporting is often limited by staff time, data quality issues, and, in some cases, privacy and confidentiality concerns. In these cases, newer tools, such as synthetic estimation or noise infusion, may help achieve a balance between reporting disaggregated data and protecting individual privacy.
There is no one perfect answer, but here are some suggestions:
Only by breaking down the data can we understand enough to make wise policy decisions that build up our communities.
Note: A version of this piece first appeared in the Association of Public Data Users blog. It has been modified slightly for PRB.