Use of Solitary Confinement at the Northwest Detention Center: Data Appendix

1. GEO Group Internal Datasets ("SMU", "RHU")

UW Center for Human Rights

Data analyzed:

1.1 - GEO Group Segregation Lieutenant's log of Restricted Housing Unit ("RHU") placements at NWDC, released to UWCHR via FOIA litigation on August 12, 2020. 1.2 - GEOTrack report of Segregation Management Unit ("SMU") housing assignments at NWDC, released to UWCHR via FOIA litigation on August 12, 2020.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import yaml

with open('input/cleanstats.yaml','r') as yamlfile:
    cur_yaml = yaml.load(yamlfile)
    smu_cleanstats = cur_yaml['output/smu.csv.gz']
    rhu_cleanstats = cur_yaml['output/rhu.csv.gz']

1.1 - GEOtrack report ("SMU")

Original filename: Sep_1_2013_to_March_31_2020_SMU_geotrack_report_Redacted.pdf

Described by US DOJ attorneys for ICE as follows:

"The GEOtrack report that was provided to Plaintiffs runs from September 1, 2013 to March 31, 2020. That report not only reports all placements into segregation, but it also tracks movement. This means that if an individual is placed into one particular unit then simply moves to a different unit, it is tracked in that report (if an individual is moved from H unit cell 101 to H unit cell 102, it would reflect the move as a new placement on the report)."

We refer to this dataset here by the shorthand "SMU" for "Special Management Unit".

The original file has been converted from PDF to CSV format using the Xpdf pdftotext command line tool with --table option, and hand cleaned to correct OCR errors. The resulting CSV has been minimally cleaned in a private repository, dropping 14 duplicated records and adding a unique identifier field, hashid; cleaning code available upon request.

The original file includes three redacted fields: Alien #, Name, and Birthdate. The file appears to be generated by a database report for the date range "9/1/2013 To 3/31/2020", presumably from the "GEOtrack" database referenced in the filename and by the DOJ attorneys for ICE. The original file has no un-redacted unique field identifiers or individual identifiers.

csv_opts = {'sep': '|',
            'quotechar': '"',
            'compression': 'gzip',
            'encoding': 'utf-8'}

smu = pd.read_csv('input/smu.csv.gz', **csv_opts)

assert len(set(smu['hashid'])) == len(smu)
assert sum(smu['hashid'].isnull()) == 0

data_cols = list(smu.columns)
data_cols.remove('hashid')

print(smu.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3433 entries, 0 to 3432
Data columns (total 10 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   citizenship    3433 non-null   object
 1   housing        3433 non-null   object
 2   assigned_dt    3433 non-null   object
 3   removed_dt     3433 non-null   object
 4   days_in_seg    3433 non-null   int64
 5   assigned_date  3433 non-null   object
 6   assigned_hour  3433 non-null   object
 7   removed_date   3433 non-null   object
 8   removed_hour   3433 non-null   object
 9   hashid         3433 non-null   object
dtypes: int64(1), object(9)
memory usage: 268.3+ KB
None

Here we display the first five records in the dataset (excluding hashid field):

citizenship	housing	assigned_dt	removed_dt	days_in_seg	assigned_date	assigned_hour	removed_date	removed_hour
GUATEMALA	H-NA-108	6/27/2013 1:31:00AM	4/9/2014 11:49:00PM	286	6/27/2013	1:31:00AM	4/9/2014	11:49:00PM
MEXICO	H-NA-205	8/5/2013 2:30:00PM	11/10/2014 6:34:00AM	462	8/5/2013	2:30:00PM	11/10/2014	6:34:00AM
MEXICO	H-NA-106	8/8/2013 10:08:00AM	9/6/2013 11:41:00AM	29	8/8/2013	10:08:00AM	9/6/2013	11:41:00AM
MARSHALL ISLANDS	H-NA-203	8/15/2013 11:17:00AM	9/13/2013 9:05:00AM	29	8/15/2013	11:17:00AM	9/13/2013	9:05:00AM
MEXICO	H-NA-209	8/15/2013 10:07:00PM	9/9/2013 12:00:00AM	25	8/15/2013	10:07:00PM	9/9/2013	12:00:00AM

# All date fields convert successfully

assert pd.to_datetime(smu['assigned_dt']).isnull().sum() == 0
smu['assigned_dt'] = pd.to_datetime(smu['assigned_dt'])
assert pd.to_datetime(smu['removed_dt']).isnull().sum() == 0
smu['removed_dt'] = pd.to_datetime(smu['removed_dt'])
assert pd.to_datetime(smu['assigned_date']).isnull().sum() == 0
smu['assigned_date'] = pd.to_datetime(smu['assigned_date'])
assert pd.to_datetime(smu['removed_date']).isnull().sum() == 0
smu['removed_date'] = pd.to_datetime(smu['removed_date'])

The GEOTrack database export time-frame conforms to removed_dt min/max values:

print(smu['assigned_dt'].describe())
print()
print(smu['removed_dt'].describe())

count                    3433
unique                   3297
top       2020-02-28 04:29:00
freq                        3
first     2013-06-27 01:31:00
last      2020-03-31 18:28:00
Name: assigned_dt, dtype: object

count                    3433
unique                   3303
top       2020-03-31 12:00:00
freq                       17
first     2013-09-01 18:18:00
last      2020-03-31 12:00:00
Name: removed_dt, dtype: object

One record has a removed_dt value less than assigned_dt, but this is only a discrepancy in the hour values:

citizenship	housing	assigned_dt	removed_dt	days_in_seg	assigned_date	assigned_hour	removed_date	removed_hour
MEXICO	H-NA-110	2020-03-31 18:28:00	2020-03-31 12:00:00	0	2020-03-31	6:28:00PM	2020-03-31	12:00:00PM

81 records have a removed_dt value equal to assigned_dt, as seen in this sample of five records:

citizenship	housing	assigned_dt	removed_dt	assigned_date	assigned_hour	removed_date	removed_hour
MEXICO	H-NA-209	2013-09-22 03:25:00	2013-09-22 03:25:00	2013-09-22	3:25:00AM	2013-09-22	3:25:00AM
MEXICO	H-NA-209	2013-09-22 03:29:00	2013-09-22 03:29:00	2013-09-22	3:29:00AM	2013-09-22	3:29:00AM
UKRAINE	H-NA-210	2013-11-08 03:30:00	2013-11-08 03:30:00	2013-11-08	3:30:00AM	2013-11-08	3:30:00AM
MOROCCO	H-NA-103	2013-11-29 02:11:00	2013-11-29 02:11:00	2013-11-29	2:11:00AM	2013-11-29	2:11:00AM
LAOS	H-NA-102	2013-12-28 20:03:00	2013-12-28 20:03:00	2013-12-28	8:03:00PM	2013-12-28	8:03:00PM

We retain these records despite the logical inconsistency of these datetime fields, under the assumption that they represent short placements of less than one full day.

Recalculating segregation placement length based on date only results in same value as days_in_seg field.

Note that this calculation is not first day inclusive, as in the case of the original version of the RHU dataset (see below). We will disregard hourly data for comparison purposes, as no other dataset includes hourly placement or release times.

smu['days_calc'] = (smu['removed_date'] - smu['assigned_date']) / np.timedelta64(1, 'D')
assert sum(smu['days_in_seg'] == smu['days_calc']) == len(smu)

The below desciptive statistics reflect first day exclusive stay lengths, including stays of 0 days. 547, or 15.93% of records reflect stay lengths of less than one day, based on placement dates. Note that placements in the SMU dataset represent specific housing assignments within one of 3433 cells in the segregation management unit, and would therefore be expected to reflect more and shorter placements than other datasets:

print(smu['days_calc'].describe())

count    3433.000000
mean        9.976697
std        23.672531
min         0.000000
25%         1.000000
50%         3.000000
75%        10.000000
max       488.000000
Name: days_calc, dtype: float64

All housing assignments are represented during each year covered by the dataset, but usage patterns vary, with housing units in the 200 block associated with longer average placements:

smu_annual = smu.set_index('assigned_date').groupby([pd.Grouper(freq='AS')])

housing_unit_count = smu_annual['housing'].nunique()

assert int(housing_unit_count.unique()) == 20

print(smu.groupby('housing')['days_calc'].mean())

housing
H-NA-101     5.921162
H-NA-102     6.841202
H-NA-103     6.396624
H-NA-104     5.917355
H-NA-105     7.285088
H-NA-106     8.342857
H-NA-107     7.387755
H-NA-108     8.229665
H-NA-109     6.763158
H-NA-110     7.327014
H-NA-201    14.446281
H-NA-202    14.398374
H-NA-203    13.976923
H-NA-204    14.376000
H-NA-205    41.538462
H-NA-206    21.675676
H-NA-207    20.444444
H-NA-208    12.503597
H-NA-209    13.992308
H-NA-210    11.358974
Name: days_calc, dtype: float64

Annual median and mean placement lengths show an increase during calendar years 2017-2018:

g = smu.set_index('assigned_date').groupby([pd.Grouper(freq='AS')])

smu_annual_med = g['days_calc'].median()

smu_annual_avg = g['days_calc'].mean()

print(smu_annual_med)
print()
print(smu_annual_avg)

assigned_date
2013-01-01    3.0
2014-01-01    3.0
2015-01-01    3.0
2016-01-01    3.0
2017-01-01    4.0
2018-01-01    5.0
2019-01-01    4.0
2020-01-01    2.0
Freq: AS-JAN, Name: days_calc, dtype: float64

assigned_date
2013-01-01    10.785185
2014-01-01     7.831721
2015-01-01     9.393814
2016-01-01     9.889693
2017-01-01    12.404537
2018-01-01    11.188940
2019-01-01     9.501969
2020-01-01     7.437956
Freq: AS-JAN, Name: days_calc, dtype: float64

Total placement counts per calendar year (note incomplete data for 2013, 2020):

smu_total_annual = smu.set_index('assigned_dt').groupby([pd.Grouper(freq='AS')])['hashid'].nunique()

print(smu_total_annual)

assigned_dt
2013-01-01    270
2014-01-01    517
2015-01-01    485
2016-01-01    553
2017-01-01    529
2018-01-01    434
2019-01-01    508
2020-01-01    137
Freq: AS-JAN, Name: hashid, dtype: int64

Stays over 14 days must be reported to ICE SRMS; here we flag long placements and calculate as a percent of total placements per year. Again note that placements are by housing assignment in one of 20 total housing locations, not cumulative stay length, so long stays may not be accurately represented here. The lack of unique identifiers makes it impossible to track cases of individuals in segregation for a total of 14 non-consecutive days during any 21 day period; or individuals with special vulnerabilities.

We find that long placements increase over time both absolutely and as proportion of total placements. However, this may simply reflect fewer transfers of individuals between housing assignments:

smu['long_stay'] = smu['days_calc'] > 14
long_stays_annual = smu.set_index('assigned_dt').groupby([pd.Grouper(freq='AS')])['long_stay'].sum()

print(long_stays_annual)
print()
print(long_stays_annual / smu_total_annual)

assigned_dt
2013-01-01     48.0
2014-01-01     66.0
2015-01-01     63.0
2016-01-01     85.0
2017-01-01     94.0
2018-01-01    105.0
2019-01-01    111.0
2020-01-01     26.0
Freq: AS-JAN, Name: long_stay, dtype: float64

assigned_dt
2013-01-01    0.177778
2014-01-01    0.127660
2015-01-01    0.129897
2016-01-01    0.153707
2017-01-01    0.177694
2018-01-01    0.241935
2019-01-01    0.218504
2020-01-01    0.189781
Freq: AS-JAN, dtype: float64

Top citizenship values:

Table 1: SMU dataset top five countries of citizenship

citizenship	placements
MEXICO	1782
EL SALVADOR	189
HONDURAS	148
GUATEMALA	133
CANADA	95
ALL OTHERS	1086

Comparison with segregation placements reported by DHS inspectors

A June 24-26, 2014 DHS inspection report for NWDC states, "Documentation reflects there were 776 assignments to segregation in the past year". The DHS inspection report does not specify the source of the records cited.

The SMU dataset covers this period, albeit with only partial records for June-Sept 2013. The total count of placements recorded in the SMU dataset during this period, 615 , is reasonably close to figure cited by DHS inspectors, which suggests an average of about 65.0 placements per month:

### Monthly total placements during period of DHS inspection report:

dhs_period = smu.set_index('assigned_dt').loc[:'2014-06-30']

g = dhs_period.groupby(pd.Grouper(freq='M'))

print(g['hashid'].nunique())

dhs_period_complete = smu.set_index('assigned_dt').loc['2013-09-01':'2014-06-30']

g = dhs_period_complete.groupby(pd.Grouper(freq='M'))

dhs_period_complete_monthly_avg = g['hashid'].nunique().mean()

assigned_dt
2013-06-30     1
2013-07-31     0
2013-08-31    14
2013-09-30    65
2013-10-31    56
2013-11-30    56
2013-12-31    78
2014-01-31    61
2014-02-28    55
2014-03-31    61
2014-04-30    63
2014-05-31    48
2014-06-30    57
Freq: M, Name: hashid, dtype: int64

This is comparable to the average of 60.0 placements per month reported in the SMU dataset during the period for which complete data exists (September 2013 - June 2014). If the GEOtrack database is the source of the data cited in the 2014 DHS inspection report, this is not noted in the inspection report itself.

1.2 - GEO Lieutenant's report ("RHU")

Original file: 15_16_17_18_19_20_RHU_admission_Redacted.xlsx

Log created and maintained by hand by GEO employee to track Restricted Housing Unit placements. Described by US DOJ attorneys for ICE as follows:

"The spreadsheet runs from January 2015 to May 28, 2020 and was created by and for a lieutenant within the facility once he took over the segregation lieutenant duties. The spreadsheet is updated once a detainee departs segregation. The subjects who are included on this list, therefore, are those who were placed into segregation and have already been released from segregation. It does not include those individuals who are currently in segregation."

We refer to this dataset here by the shorthand "RHU" for "Restricted Housing Unit".^[1]

The original file has been converted from XLSX to CSV format, with each annual tab saved as a separate CSV. The resulting CSVs have been concatenated and minimally cleaned in a private repository, dropping 75 duplicated records and adding a unique identifier field, hashid; cleaning code availabe upon request.

The original file includes two fully redacted fields: Name and Alien #; and one partially redacted field, Placement reason. The original file has no un-redacted unique field identifiers or individual identifiers.

csv_opts = {'sep': '|',
            'quotechar': '"',
            'compression': 'gzip',
            'encoding': 'utf-8'}

rhu = pd.read_csv('input/rhu.csv.gz', **csv_opts)

assert len(set(rhu['hashid'])) == len(rhu)
assert sum(rhu['hashid'].isnull()) == 0

data_cols = list(rhu.columns)
data_cols.remove('hashid')

print(rhu.info())

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2457 entries, 0 to 2456
Data columns (total 10 columns):
 #   Column             Non-Null Count  Dtype
---  ------             --------------  -----
 0   name               2457 non-null   object
 1   alien_#            2457 non-null   object
 2   date_in            2457 non-null   object
 3   date_out           2457 non-null   object
 4   total_days         2457 non-null   int64
 5   initial_placement  2457 non-null   object
 6   placement_reason   2457 non-null   object
 7   disc_seg           2457 non-null   object
 8   release_reason     2457 non-null   object
 9   hashid             2457 non-null   object
dtypes: int64(1), object(9)
memory usage: 192.1+ KB
None

Here we display the first five records in the dataset (excluding hashid field):

name	alien_#	date_in	date_out	total_days	initial_placement	placement_reason	disc_seg	release_reason
(b)(6),(b)(7)(c )	(b)(6),(b)(7)(c )	1/3/2015	1/5/2015	20	pending hearing	refusing staff orders	N	release to gen pop
(b)(6),(b)(7)(c )	(b)(6),(b)(7)(c )	1/10/2015	1/12/2015	294	pending hearing	assault	N	release from custody
(b)(6),(b)(7)(c )	(b)(6),(b)(7)(c )	1/12/2015	1/16/2015	8	pending hearing	disrupting facility operations	Y	released to medical
(b)(6),(b)(7)(c )	(b)(6),(b)(7)(c )	1/13/2015	1/29/2015	3	pending hearing	assault	Y	release to gen pop
(b)(6),(b)(7)(c )	(b)(6),(b)(7)(c )	1/14/2015	2/6/2015	4	security risk	(b)(6), (b)(7)(C), (b)(7)(E)	N	release from custody

Dates and total days calculation

Inspection of the original Excel file shows that the Total days column values are often incorrect, based on a missing cell formula. For example, on the "2020" spreadsheet tab, the Total days column values are integers which only occasionally align with calculated placement length based on the Date in and Date out columns. However, additional spreadsheet rows at the bottom of the sheet not containing values in other fields contain an Excel formula ("=(D138-C138)+1") which should have been used to calculate these values. Comparing calculated stay lengths with reported Total days suggests that this formula was not updated consistently, causing fields to become misaligned. Additionally, the "2015" spreadsheet tab includes many Total days values equal to "1", suggesting that the formula was applied incorrectly or with missing data.

We can recalculate actual stay lengths based on the formula cited above (inclusive of start days, with stays of less than one day calculated as "1"); or with the formula used for the "SMU" records above (exclusive of start days, with stays of less than one day calculated as "0"), for more consistent comparison with other datasets.

The above issue raises the possibility that other fields in addition to Total days may be misaligned in the original dataset. One fact mitigating this possibility is that no Date out values predate associated Date in values. We can also look more closely at qualitative fields to make an educated guess as to the data quality: for example, do intial_placement values suggesting disciplinary placements align with placement_reason values also consistent with disciplinary placements? However, we do not intend to use this dataset for detailed qualitative analysis; of most interest are total segregation placements and segregation stay lengths.

rhu['date_in'] = pd.to_datetime(rhu['date_in'])
rhu['date_out'] = pd.to_datetime(rhu['date_out'])

# As noted above, no `date_out` values predate associated `date_in` values:

assert sum(rhu['date_in'] > rhu['date_out']) == 0

print(rhu['date_in'].describe())
print()
print(rhu['date_out'].describe())

count                    2457
unique                   1237
top       2019-01-08 00:00:00
freq                       12
first     2015-01-03 00:00:00
last      2020-05-28 00:00:00
Name: date_in, dtype: object

count                    2457
unique                   1105
top       2017-06-13 00:00:00
freq                       11
first     2015-01-05 00:00:00
last      2020-05-29 00:00:00
Name: date_out, dtype: object

Here we recalculate the total days field based on the first day inclusive formula in the original Excel spreadsheet ("=(D138-C138)+1"):

rhu['total_days_calc'] = (rhu['date_out'] - rhu['date_in']) / np.timedelta64(1, 'D') + 1

compare_pct = sum(rhu['total_days_calc'] == rhu['total_days']) / len(rhu) * 100

print(rhu['total_days'].describe())
print()
print(rhu['total_days_calc'].describe())

count    2457.000000
mean       13.877900
std        37.544802
min         1.000000
25%         2.000000
50%         4.000000
75%        11.000000
max       694.000000
Name: total_days, dtype: float64

count    2457.000000
mean       15.077737
std        38.142819
min         1.000000
25%         3.000000
50%         5.000000
75%        12.000000
max       694.000000
Name: total_days_calc, dtype: float64

Only 7.86% of original total_days values match their respective recalculated stay lengths in total_days_calc.

However, note that the above summary statistics for the original field (total_days) are very similar to the recalculated field (total_days_calc), suggesting that most values are present in the dataset but misaligned.

Therefore, we will conclude that it is correct to recalculate the total_days field. Instead of the first day inclusive formula suggested in the original dataset, here we will use a first day exclusive formula, where placements starting and ending on the same day have length 0. While this risks underestimating placement lengths represented in the dataset, it is more consistent with the calculation of placement lenghts in the SMU and SRMS datasets:

rhu['total_days'] = (rhu['date_out'] - rhu['date_in']) / np.timedelta64(1, 'D')
rhu = rhu.drop('total_days_calc', axis=1)

print(rhu['total_days'].describe())

count    2457.000000
mean       14.077737
std        38.142819
min         0.000000
25%         2.000000
50%         4.000000
75%        11.000000
max       693.000000
Name: total_days, dtype: float64

Annual median and mean placement lengths are relatively consistent, showing an apparent decrease during the first few months of 2020, possibly explained by incomplete placements excluded from this dataset:

g = rhu.set_index('date_in').groupby([pd.Grouper(freq='AS')])

rhu_annual_med = g['total_days'].median()

rhu_annual_avg = g['total_days'].mean()

print(rhu_annual_med)
print()
print(rhu_annual_avg)

date_in
2015-01-01    4.0
2016-01-01    4.0
2017-01-01    4.0
2018-01-01    4.0
2019-01-01    4.0
2020-01-01    3.0
Freq: AS-JAN, Name: total_days, dtype: float64

date_in
2015-01-01    14.374631
2016-01-01    12.013761
2017-01-01    16.978927
2018-01-01    13.826690
2019-01-01    14.343681
2020-01-01     8.848485
Freq: AS-JAN, Name: total_days, dtype: float64

Total placement counts per calendar year (note data for 2020 is incomplete):

rhu_total_annual = rhu.set_index('date_in').groupby([pd.Grouper(freq='AS')])['hashid'].nunique()

print(rhu_total_annual)

date_in
2015-01-01    339
2016-01-01    436
2017-01-01    522
2018-01-01    577
2019-01-01    451
2020-01-01    132
Freq: AS-JAN, Name: hashid, dtype: int64

Stays over 14 days must be reported to ICE SRMS; here we flag long placements and calculate as a percent of total placements per year. The lack of unique identifiers makes it impossible to track cases of individuals in segregation for a total of 14 non-consecutive days during any 21 day period. Inconsistencies and lack of information in placement_reason make it a poor candidate for flagging placements involving individuals with special vulnerabilities. We note an increasing proportion and absolute number of long placements during 2017-2019:

rhu['long_stay'] = rhu['total_days'] > 14
long_stays_annual = rhu.set_index('date_in').groupby([pd.Grouper(freq='AS')])['long_stay'].sum()

print(long_stays_annual)
print()
print(long_stays_annual / rhu_total_annual)

date_in
2015-01-01     52.0
2016-01-01     73.0
2017-01-01    102.0
2018-01-01    129.0
2019-01-01    103.0
2020-01-01     21.0
Freq: AS-JAN, Name: long_stay, dtype: float64

date_in
2015-01-01    0.153392
2016-01-01    0.167431
2017-01-01    0.195402
2018-01-01    0.223570
2019-01-01    0.228381
2020-01-01    0.159091
Freq: AS-JAN, dtype: float64

There are 12 initial_placement values. These closely correspond to the placement_reason values cited in the SRMS datasets (see SRMS 2, National SRMS Comparison appendices). The most common initial_placement values (not correcting for some minor spelling variations) are:

print(rhu['initial_placement'].str.strip().str.lower().value_counts().head(5))

pending hearing       1161
security risk          543
medical                393
protective custody     331
disciplinary             9
Name: initial_placement, dtype: int64

There are 319 placement_reason values, including some redacted fields. Below we print the 10 most common values:

print(rhu['placement_reason'].str.strip().str.lower().value_counts().head(10))

fighting                        398
fears general population        209
medical overflow                209
(b)(6), (b)(7)(c), (b)(7)(e)    163
assault                          93
assault on detainee              69
medical isolation overflow       67
sharpened instrument             58
threats to staff                 43
threat to others                 39
Name: placement_reason, dtype: int64

There are 54 release_reason values (not correcting for spelling or other variations). Below we print the 10 most common values:

print(rhu['release_reason'].str.strip().str.lower().value_counts().head(10))

release to population          408
release to gen pop             344
release from custody           313
release to medical             227
transfer facility              172
disc time complete             168
discipline time complete       130
time served                    121
not guilty                     115
time served - release to gp     80
Name: release_reason, dtype: int64

The field disc_seg flags disciplinary segregation placements, which require a hearing process; versus administrative segregation placements. The majority of placements are administrative. Average stay lengths for disciplinary and administrative are similar; though median

rhu['disc_seg'] = rhu['disc_seg'].str.strip().str.upper()

assert sum(rhu['disc_seg'].isnull()) == 0

print('Proportion:')
print(rhu['disc_seg'].value_counts(normalize=True, dropna=False))
print('\nCount per year:')
print(rhu.set_index('date_in').groupby(pd.Grouper(freq='AS'))['disc_seg'].value_counts())
print('\nStay length by category:')
print(rhu.set_index('date_in').groupby(['disc_seg'])['total_days'].describe())
print('\nAnnual median stay length by category:')
print(rhu.set_index('date_in').groupby([pd.Grouper(freq='AS'), 'disc_seg'])['total_days'].mean())

Proportion:
N    0.756614
Y    0.243386
Name: disc_seg, dtype: float64

Count per year:
date_in     disc_seg
2015-01-01  N           241
            Y            98
2016-01-01  N           339
            Y            97
2017-01-01  N           411
            Y           111
2018-01-01  N           432
            Y           145
2019-01-01  N           339
            Y           112
2020-01-01  N            97
            Y            35
Name: disc_seg, dtype: int64

Stay length by category:
           count       mean        std  min  25%   50%   75%    max
disc_seg
N         1859.0  13.866595  41.405676  0.0  1.0   3.0   7.0  693.0
Y          598.0  14.734114  25.474417  1.0  6.0  10.0  19.0  412.0

Annual median stay length by category:
date_in     disc_seg
2015-01-01  N           14.514523
            Y           14.030612
2016-01-01  N           11.687316
            Y           13.154639
2017-01-01  N           17.530414
            Y           14.936937
2018-01-01  N           12.780093
            Y           16.944828
2019-01-01  N           14.268437
            Y           14.571429
2020-01-01  N            7.783505
            Y           11.800000
Name: total_days, dtype: float64

Next section: Data Appenix 2. Comparison of GEO Group and ICE SRMS Datasets

Back to Data Appendix Index