The US Census Bureau has been working diligently to refine the Disclosure Avoidance System (DAS) for the Census 2020 redistricting data release in September. The DAS is an essential component of this decade’s Census data releases because the Census Bureau is required by law to protect individual privacy under Title 13. The DAS represents a compromise between accurate data and privacy protection. In prior censuses the Census Bureau used various forms of disclosure avoidance. These techniques historically consisted of table suppression and data swapping. Given the risk of a “reconstruction attack,” Census 2020 will apply a more modern formal statistical approach of disclosure avoidance called differential privacy. Refer to this blog for more information on the move to differential privacy.
To better understand what differentially privatized data look like, the Census Bureau has released a series of demonstration products that compare published Census 2010 data with incremental versions the Census 2020 DAS applied to Census 2010 data. This allows for a comparison of the differences between the old method of data swapping and the new method of differential privacy. This comparison therefore allows users to test the Census 2020 DAS for fitness of use in various use cases. The latest version of the DAS has increased the Privacy-Loss Budget (PLB) to 12.2. This essentially means that there is less noise injected into the data than in prior versions of the DAS. The Census Bureau plans to apply this DAS to the upcoming redistricting data release. However, there is still time to provide the Census Bureau with valuable feedback on the data’s fitness for use.
To make the differentially privatized Census 2010 demonstration data more accessible and easier to use, IPUMS NHGIS have processed and tabulated the released microdata for 15 geographies. In conjunction with these tabulation releases, Esri has published five versions of the demonstration data as feature sets for 8 geographies in the ArcGIS Living Atlas of the World. The most recent version is V5 (April 2021).
Esri used the latest version of Census demonstration data published in the Living Atlas of the World to analyze the differences between the demonstration and released data sets at the place level. Places are easy to interact with as we are all familiar with our general local areas, but not many people know what tract or block group they live in. There are 29,261 places in the 2010 Census places inventory. This includes 9,721 Census Designated Places (CDPs) and 19,540 incorporated jurisdictions. Use this application to better understand places that you are familiar with.
Are the differences between the latest version of demonstration data (DP) and the released Census 2010 data (SF1) acceptable for your use case?
Click the image below to launch the Dashboard
Differences across datasets
There are many ways to explore the differences between the two datasets. As you navigate through the application it is helpful to have some context as to how the differences in your places of interest compare to the typical differences among all places. For this context refer to the descriptive statistics below.
Descriptive Statistics For All 2010 Places
|Differences (DP – SF1)||Average||Min||Median||Max||SD||Min %*||Median %*||Max %*||MALPD||MAPD|
|Black or African American||0.3||-334||0||309||13.5||-94.1%||0.0%||1800.0%||3.7%||33.2%|
|Amer Indian and AK Native||0.0||-310||0||269||9.6||-95.5%||0.0%||2200.0%||7.2%||44.4%|
|Two or More Races||-2.4||-1,727||0||3,988||54.9||-96.3%||0.0%||2800.0%||27.5%||52.6%|
|Persons Per Household||0.0||-5.0||0.0||48.0||0.6||N/A||N/A||N/A||3.2%||9.8%|
MALPD = Mean Algebraic Percent Difference, a measure of bias
MAPD = Mean Absolute Percent Difference, a measure of average percent difference
SD = Standard Deviation
* These calculations are limited to cases where DP values are > 0 and SF1 values are > 0
Looking at these descriptive statistics we can see some patterns in how the proposed DAS impacts the data:
- The average percent difference for the White population is lower than all the other race groups at 2.7%. The tendency (bias) is to reduce the White population via differential privacy by -0.5% on average.
- Two or More Races shows the largest differences of all the race groups with an average absolute percent difference of 52.6%. On average, this is a reduction of -2.4 persons. However, looking at the mean algebraic percentage change shows an average increase of 27.5%. These measures show divergent trends because we are often dealing with small numbers and percent increase has no limit while percent decrease is limited to -100%.
- Persons Per Household (PPH) is a sensitive measure used in planning and program administration. The average absolute difference in PPH is 9.8% with a tendency (bias) to be higher than SF1 by 3.2%. The maximum difference is 48, indicating that Summary File 1 showed 2 PPH, while differential privacy states 48 PPH.
Many statistical tables from census data yield information on the interaction between people and housing (e.g., persons per household, household type, occupancy, etc.). However, differential privacy is applied separately to person-level counts and housing counts. This poses the additional challenge of data integrity across the person and housing universes. The DAS attempts to maintain integrity through various forms of post-processing, a second step of the DAS after the formal privacy protection has been applied. Looking at the latest version (V5, PLB 12.2) we see that data integrity has improved but still yields some peculiar results even at the place level. Faults in data integrity can be broken out into two types, impossible and improbable. For example, a place cannot have more households than household population because, by definition, a household is occupied by at least one person. Improbable results are possible but highly unlikely. For example, a place could include all population under 18 years of age if the place only contains a juvenile facility and the caretakers do not live at the facility full time. This scenario, although possible, is highly improbable, and these cases should be scrutinized.
Checks on Data Integrity
|DP Data Integrity Problems – Impossible||Place Count||Percent|
|More households than household population||39||0.13%|
|Household population > 0 but households = 0||16||0.05%|
|Households > 0 but household population = 0||8||0.03%|
|DP Data Integrity Problems – Improbable||Place Count||Percent|
|All population under 18 years of age||2||0.01%|
|Persons per household greater than 10||20||0.07%|
|DP occupancy rate is 100% but SF1 occupancy rate is not 100%||433||1.48%|
|SF1 occupancy rate is 100% but DP occupancy rate is not 100%||56||0.19%|
|DP occupancy rate is 0% but SF1 occupancy rate is not 0%||17||0.06%|
|SF1 occupancy rate is 0% but DP occupancy rate is not 0%||6||0.02%|
|SF1 household population = 0 but DP household population > 0||8||0.03%|
|DP household population = 0 but SF1 household population > 0||11||0.04%|
When applying the same tests to the released 2010 SF1 data only one place has persons per household greater than 10. All other counts equal zero for the above data integrity tests.
This is a limited analysis of a subset of the demonstration data. These differences represent the trade-off between privacy protection and accurate data. We encourage you to perform your own analysis of the demonstration data to test your use cases. This blog by Lauren Scott Griffin provides an excellent workflow for creating web maps to compare the differences. If you are comfortable with the proposed changes to census data, then you are well positioned for the upcoming 2020 census release. If you have concerns, now is the last opportunity to voice them. Please send any comments to the Census Bureau at 2020DAS@census.gov before the May 28th deadline.