# Author: University of Washington Center for Human Rights
# Title: Hidden in Plain Sight: ICE Air Data Appendix
# Date: 2019-04-24
# License: GPL 3.0 or greater

import pandas as pd
import numpy as np
import yaml
import matplotlib.pyplot as plt

Hidden in Plain Sight: ICE Air Data Appendix

This is an appendix to the report Hidden in Plain Sight: ICE Air and the Machinery of Mass Deportation, which uses data from ICE's Alien Repatriation Tracking System (ARTS) released by ICE Enforcement and Removal Operations pursuant to a Freedom of Information Act request by the University of Washington Center for Human Rights. This appendix intended to provide readers with greater detail on the contents, structure, and limitations of this dataset, and the process our researchers performed to render it suitable for social scientific analysis. The appendix is a living document that will be updated over time in order to make ICE Air data as widely-accessible and transparently-documented as possible.

The project repository contains all the data and code used for the production of the report.

# Get optimal data types before reading in the ARTS dataset
with open('input/dtypes.yaml', 'r') as yamlfile:
    column_types = yaml.load(yamlfile)
read_csv_opts = {'sep': '|',
                 'quotechar': '"',
                 'compression': 'gzip',
                 'encoding': 'utf-8',
                 'dtype': column_types,
                 'parse_dates': ['MissionDate'],
                 'infer_datetime_format': True}
df = pd.read_csv('input/ice-air.csv.gz', **read_csv_opts)

# The ARTS Data Dictionary as released by ICE
data_dict = pd.read_csv('input/ARTS_Data_Dictionary.csv.gz', compression='gzip', sep='|')
data_dict.columns = ['Field', 'Definition']

# A YAML file containing the field names in the original ARTS dataset
with open('hand/arts_cols.yaml', 'r') as yamlfile:
    arts_cols = yaml.load(yamlfile)

# Asserting characteristics of key fields
assert sum(df['AlienMasterID'].isnull()) == 0
assert len(df) == len(set(df['AlienMasterID']))
assert sum(df['MissionID'].isnull()) == 0
assert sum(df['MissionNumber'].isnull()) == 0
assert len(set(df['MissionID'])) == len(set(df['MissionNumber']))

The ARTS dataset

According to a 2015 audit of ICE Air by the Department of Homeland Security Office of Inspector General (OIG), the ARTS dataset was created by an ICE contractor at an unknown date and transitioned to government personnell in May 2014.^[1] The identity of the contractor which originally created the database is not currently known.

Following cleaning by UWCHR as detailed below, the ARTS dataset contains 1732625 records relating to ICE Air Operations charter flights during the period from October 1, 2010 to December 5, 2018, including full data for U.S. Federal Government Fiscal Years 2011 through 2018. Each record in the dataset relates to a single passenger on a single ICE Air mission.

The ARTS dataset is made up of 43 fields, defined in a data dictionary provided by ICE; however, as described below, the content of these fields does not always conform to the definitions provided:

Field	Definition
Status	Criminal Status
Sex	Sex
Convictions	Convictions
GangMember	Gang Member
ClassLvl	Class Level
Age	Age
MissionDate	Mission Date
MissionNumber	Mission Number (i.e., Flight Number)
PULOC	Pick up location
DropLoc	Drop off location
StrikeFromList	Was Alien struck from mission
ReasonStruck	Reason Struck
R-T	Removal or Transfer
Code	Criminality Code
CountryOfCitizenship	Country of Citizenship
Juvenile	Is Alien a Juvenile
MissionWeek	Mission Week
MissionQuarter	Mission Quarter
MissionYear	Mission Year
MissionMonth	Mission Month
Criminality	Criminality
FamilyUnitFlag	Family Unit Flag
UnaccompaniedFlag	Unaccompanied Flag
AlienMasterID	Unique value assigned by ARTS for Alien
MissionID	Unique value assigned by ARTS for mission
air_AirportID	Airport ID (numerical)
air_AirportName	Airport Name
air_City	Airport City
st_StateID	State ID (Numerical)
st_StateAbbr	State Abbreviation
AOR_AORID	Area of Responsibility (Numerical)
AOR_AOR	Area of Responsibility (Abbreviation)
AOR_AORName	Area of Responsibility
air_Country	Airport Country
air2_AirportID	Airport ID (numerical)
air2_AirportName	Airport Name
air2_City	Airport City
st2_StateID	State ID (Numerical)
st2_StateAbbr	State Abbreviation
aor2_AORID	Area of Responsibility (Numerical)
aor2_AOR	Area of Responsibility (Abbreviation)
aor2_AORName	Area of Responsibility
air2_Country	Airport Country

The first 25 fields relate to passenger and mission characteristics; the latter 18 fields, marked by various prefixes, relate to characteristics of the airports and locations associated with each record. (Additional fields generated by UWCHR in the process of analysis of the dataset are not enumerated here.) There is no indication that the content of any of these fields was withheld or redacted by ICE upon release of the dataset. However, according to the 2015 DHS OIG audit, the ARTS database does include additional fields which were not released to UWCHR, including passenger A-Numbers and Fingerprint IDs; and details on the cost of individual flights.

The ARTS dataset uses three key fields to identify passengers (AlienMasterID) and missions (MissionID and MissionNumber). The AlienMasterID field is made up of 1732625 unique values. AlienMasterID values are numeric strings starting at 5000 and incrementing to 2047246, with some values skipped. Each AlienMasterID value is used only once; repeat passengers on multiple flights cannot be identified based solely on this field. Below we will display a subset of columns relating to passenger characteristics for the most recent records in the dataset.

selected_cols = ['AlienMasterID', 'Status', 'Sex',
    'Convictions', 'GangMember', 'Age', 'PULOC', 'DropLoc',
    'R-T', 'CountryOfCitizenship']

sample_records = df.loc[:, selected_cols].tail().dropna(axis=1)

AlienMasterID	Status	Sex	Convictions	GangMember	Age	PULOC	DropLoc	R-T	CountryOfCitizenship
2047241	8F	M	NC	N	22.0	KIAH	MGGT	R	GUATEMALA
2047242	8F	M	NC	N	20.0	KIAH	MGGT	R	GUATEMALA
2047243	8F	M	Illegal Entry	N	32.0	KIAH	MGGT	R	GUATEMALA
2047245	16	M	Illegal Entry	N	24.0	KIAH	MGGT	R	GUATEMALA
2047246	16	M	NC	N	27.0	KIAH	MGGT	R	GUATEMALA

The MissionID and MissionNumber fields each contain 14961 values; like AlienMasterID, these are also numeric strings that increase incrementally over time. Below we will display records for a single ICE Air mission^[2] using the Pandas groupby function, which groups records by specified characteristics. We can include here other fields such as R-T, CountryOfCitizenship, etc.:

sample_mission_id = 47425
sample_records = df[df['MissionID'] == sample_mission_id]
sample_groupby = sample_records.groupby(['MissionDate', 'MissionID', 'MissionNumber', 'PULOC', 'DropLoc'])
sample_table = sample_groupby['AlienMasterID'].nunique().reset_index()
sample_table = sample_table.rename({'AlienMasterID': 'PassengerCount'}, axis=1)

MissionDate	MissionID	MissionNumber	PULOC	DropLoc	PassengerCount
2018-12-04	47425	190331	KBFI	KELP	47
2018-12-04	47425	190331	KBFI	KIWA	19
2018-12-04	47425	190331	KELP	KIWA	51
2018-12-04	47425	190331	KIWA	KBFI	14
2018-12-04	47425	190331	KIWA	KELP	15

While mission records allow us to determine the numbers of individuals (summarized in the grouping above) moved between each pickup and dropoff location involved in a mission, we are not able to reconstruct the exact itinerary of the flight from these records. See below for more on the limitations of the MissionID and MissionNumber keys.

Data cleaning

The raw ARTS dataset was released by ICE as 9 XLSX format files, these were combined into a single dataset containing 1763020 records. For space reasons, the raw files are not stored in the project's Git repository, but are available via UWCHR's Google Drive. Prior to analysis, the ARTS dataset was converted from XLSX to CSV format and cleaned to standardize fields and remove some records with missing data.

The cleaning process is fully documented in code; see the clean/ directory in the project repository. Selected data cleaning steps are described below:

Duplicate AlienMasterID values and airport metadata: In the raw ARTS dataset, 29465 AlienMasterID values are repeated up to 2 times. Upon close inspection, it becomes apparent that these records are repeated because of inconsistencies in certain airport metadata values (fields starting with the prefixes air_ or air2_), resulting in the duplication of some passenger records, probably due to a database merge by ICE prior to release. (For example, all records for Yuma International Airport in Yuma, AZ are duplicated, with one version of the records incorrectly listing "AR" as the state associated with the airport.) 29465 duplicate AlienMasterID values were dropped and erroneous or missing airport metadata was corrected; see ice-air/clean/hand/bad_airports.csv for values that were substituted. Airport metadata fields not used in the present analysis, such as numeric codes for US states, were not cleaned.
Passenger characteristics: The ARTS dataset includes several fields with unstandardized values. UWCHR has cleaned or partially cleaned several of these fields for analysis; original values to be cleaned and their subsitutions are documented in the file ice-air/clean/hand/clean.yaml. Fields not used in the present analysis were not cleaned. See below for discussion of some of these fields.
Missing data: 808 passenger records with missing pickup locations and 122 records with missing dropoff airports were dropped from the dataset. Records with missing data in other fields were not dropped.

Limitations of the dataset

While this dataset provides the first public view into the operations of ICE Air, it raises as many questions as it answers. In part, this is because it was released without an explanation of the context in which it is used: as noted in the report, we know some deportation flights are on commercial airlines, which appear not to be listed here, and others are charter flights whose records are similarly absent from this set. It is important, therefore, to be conscious of the limitations to the conclusions we can draw from this data alone, to “ground-truth” any observations by comparing to the lived experiences of immigrant communities, and to continue to demand full transparency from federal and local governments about the mechanics of deportations.

There are also several specific elements of this dataset’s design which limit its usefulness.

First, it does not permit the tracking of individual passengers; it is very likely that some individuals are repeated in the collection, but it it not possible to identify this because each individual is not assigned a consistent identifier in this dataset.

According to the 2015 DHS OIG audit, the ARTS database reviewed by OIG includes fields for passenger A-Numbers and Fingerprint IDs, which would permit tracking of repeat passengers, though the OIG also notes inconsistencies in the usage of these identifiers. Unfortunately, these fields were not included in the database shared with UWCHR. There is no inherent way to track repeat passengers on multiple flights in the ARTS dataset as released to the UWCHR by ICE; close analysis of specific combinations of passenger characteristics (i.e. age, nationality, criminal conviction status) does suggest that passengers are represented multiple times in the dataset, but systematically isolating repeat passengers without access to additional unique ID fields would be prohibitively difficult.

Second, the database does not track flight itineraries. Each record represents an individual passenger on an ICE Air mission, but reconstructing the flight path for a single mission or passenger is not possible. While the database contains fields labeled MissionID and MissionNumber that might initially appear useful for such purposes, they too introduce limitations. While MissionID and MissionNumber values differ, their functions in the dataset appear to be completely equivalent: combinations of these values are strictly one-to-one, and they are not hierarchical (see code snippet below). Both values consist of numeric strings which increase incrementally, with some values skipped. Contrary to the ARTS data dictionary released by ICE, the MissionNumber field appears to bear no relation to actual flight numbers.

assert sum(df.groupby(['MissionID'])['MissionNumber'].nunique() > 1) == 0
assert sum(df.groupby(['MissionNumber'])['MissionID'].nunique() > 1) == 0
assert sum(df.groupby(['MissionNumber', 'MissionID'])['MissionDate'].nunique() > 1) == 0
assert sum(df.groupby(['MissionID', 'MissionNumber'])['MissionDate'].nunique() > 1) == 0

ICE Air missions as represented by the MissionID and MissionNumber fields never span more than one day, though multiple missions may occur on a given date. The MissionDate field only records the day of the mission; the dataset does not include any other time data, such as takeoff or landing timestamps. Each mission can include multiple combinations of pickup and drop-off locations, represented by the PULOC and DropLoc fields. These values encode the pickup and drop-off location for each passenger on the mission, not the flight itinerary of the mission. Therefore, while each mission may include multiple flight legs, it is not possible to use this version of the ARTS database to conclusively reconstruct itineraries or calculate the total number of legs on flights operated by ICE Air.

Third, ICE agents’ entry of data is inconsistent. Many of the fields are unstandardized and present significant challenges for cleaning and analysis, especially the Status, GangMember, and Convictions fields, which include many unique and often irrelevant values. (This concern was also noted by the DHS OIG in its 2015 audit; in several cases, data entry processes seem to have become more standardized over time, which may obscure real trends in the data.)

Several of these fields merit additional explanation:

The Status field, despite being defined in the ARTS Data Dictionary as relating to "Criminal Status", appears to relate to the status of a passeger's deportation proceedings. Some values in this field conform to a set of 29 alphanumeric codes used by ICE categorize the status of removal processes (see Kerwin et al., 2015, for a description of each of these codes), others consist of unstandardized text descriptions, including unrelated or irrelevant values. Analysis of the distribution of Status values prior to data cleaning shows that the 29 standardized codes were used very rarely prior to FY 2013; after FY 2013 the standardized values are more frequent. We have translated unstandardized values with more than 100 ocurrences into their standardized equivalents, where possible; see cleaning description above and code in the clean/ task. In total this field contains more than 937 unique values, after cleaning.
The Convictions, Criminality and Code fields all represent different ways of coding a passenger's criminal status. Convictions is an unstructured field with 14709 unique values, presumably representing each passenger's most serious criminal conviction. The Criminality field is more structured and was easily cleaned into a binary category where "NC" represents passengers without a criminal conviction and "C" represents passengers with a criminal conviction. However, it is important to note that this field is not always consistent with the Convictions field, and it contains 21163 missing values, especially in the earlier period of the dataset. The Code field consists of a relatively structured set of alphabetic codes which also appear to relate to criminal status, but their meaning is unclear.
The Age and Juvenile fields are relatively self-explanatory. Juvenile is a binary field where all passengers aged 17 or younger are marked True; the values in this field are consistent with the numeric values in Age. Values below 0 or above 99 in Age were set as null values in the cleaning process.
Notably, two ARTS fields that would likely be of great interest to researchers and advocates, FamilyUnitFlag and UnaccompaniedFlag (presumably relating to family groups and unaccompanied minors), are entirely unused in this dataset: not a single record is flagged with either of these values. As noted above, there is no indication that these values have been redacted or withheld by ICE, suggesting that they are simply not used.

Notes

^[1] The DHS OIG explains the ARTS database as such: "ICE Air records data pertaining to charter flights and detainees in the Alien Repatriation Trackgin System (ARTS). A contractor who managed the program%rsquo;s daily activities created the ARTS database to provide a historical record of charter flights, monthly statistical reports, and data responses to ICE required under the contract. ARTS captures information such as the dates, routes, detainees, delays or cancelations, and costs associated with each charter flight."

^[2] The 2015 audit by the DHS OIG defines an ICE Air mission: "A mission begins when the aircraft departs from its initial location and ends when the aircraft reaches its final destination. Depending on the need, a mission can contain one or more stops to destinations within the United States, internationally, or a combination." In a footnote, the DHS OIG specifies, "The definition for the term “mission” is derived from the definition used in the former Office of Detention and Removal Operations’ *Policy and Procedure Manual (June 2008)." UWCHR researchers have not been able to locate this reference.