Data analyzed:
1.1 - GEO Group Segregation Lieutenant's log of Restricted Housing Unit ("RHU") placements at NWDC, released to UWCHR via FOIA litigation on August 12, 2020. 1.2 - GEOTrack report of Segregation Management Unit ("SMU") housing assignments at NWDC, released to UWCHR via FOIA litigation on August 12, 2020.
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import yaml
with open('input/cleanstats.yaml','r') as yamlfile:
cur_yaml = yaml.load(yamlfile)
smu_cleanstats = cur_yaml['output/smu.csv.gz']
rhu_cleanstats = cur_yaml['output/rhu.csv.gz']
Original filename: Sep_1_2013_to_March_31_2020_SMU_geotrack_report_Redacted.pdf
Described by US DOJ attorneys for ICE as follows:
"The GEOtrack report that was provided to Plaintiffs runs from September 1, 2013 to March 31, 2020. That report not only reports all placements into segregation, but it also tracks movement. This means that if an individual is placed into one particular unit then simply moves to a different unit, it is tracked in that report (if an individual is moved from H unit cell 101 to H unit cell 102, it would reflect the move as a new placement on the report)."
We refer to this dataset here by the shorthand "SMU" for "Special Management Unit".
The original file has been converted from PDF to CSV format using the Xpdf pdftotext command line tool with --table
option, and hand cleaned to correct OCR errors. The resulting CSV has been minimally cleaned in a private repository, dropping 14 duplicated records and adding a unique identifier field, hashid
; cleaning code available upon request.
The original file includes three redacted fields: Alien #
, Name
, and Birthdate
. The file appears to be generated by a database report for the date range "9/1/2013 To 3/31/2020", presumably from the "GEOtrack" database referenced in the filename and by the DOJ attorneys for ICE. The original file has no un-redacted unique field identifiers or individual identifiers.
csv_opts = {'sep': '|',
'quotechar': '"',
'compression': 'gzip',
'encoding': 'utf-8'}
smu = pd.read_csv('input/smu.csv.gz', **csv_opts)
assert len(set(smu['hashid'])) == len(smu)
assert sum(smu['hashid'].isnull()) == 0
data_cols = list(smu.columns)
data_cols.remove('hashid')
print(smu.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3433 entries, 0 to 3432
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 citizenship 3433 non-null object
1 housing 3433 non-null object
2 assigned_dt 3433 non-null object
3 removed_dt 3433 non-null object
4 days_in_seg 3433 non-null int64
5 assigned_date 3433 non-null object
6 assigned_hour 3433 non-null object
7 removed_date 3433 non-null object
8 removed_hour 3433 non-null object
9 hashid 3433 non-null object
dtypes: int64(1), object(9)
memory usage: 268.3+ KB
None
Here we display the first five records in the dataset (excluding hashid
field):
citizenship | housing | assigned_dt | removed_dt | days_in_seg | assigned_date | assigned_hour | removed_date | removed_hour |
---|---|---|---|---|---|---|---|---|
GUATEMALA | H-NA-108 | 6/27/2013 1:31:00AM | 4/9/2014 11:49:00PM | 286 | 6/27/2013 | 1:31:00AM | 4/9/2014 | 11:49:00PM |
MEXICO | H-NA-205 | 8/5/2013 2:30:00PM | 11/10/2014 6:34:00AM | 462 | 8/5/2013 | 2:30:00PM | 11/10/2014 | 6:34:00AM |
MEXICO | H-NA-106 | 8/8/2013 10:08:00AM | 9/6/2013 11:41:00AM | 29 | 8/8/2013 | 10:08:00AM | 9/6/2013 | 11:41:00AM |
MARSHALL ISLANDS | H-NA-203 | 8/15/2013 11:17:00AM | 9/13/2013 9:05:00AM | 29 | 8/15/2013 | 11:17:00AM | 9/13/2013 | 9:05:00AM |
MEXICO | H-NA-209 | 8/15/2013 10:07:00PM | 9/9/2013 12:00:00AM | 25 | 8/15/2013 | 10:07:00PM | 9/9/2013 | 12:00:00AM |
# All date fields convert successfully
assert pd.to_datetime(smu['assigned_dt']).isnull().sum() == 0
smu['assigned_dt'] = pd.to_datetime(smu['assigned_dt'])
assert pd.to_datetime(smu['removed_dt']).isnull().sum() == 0
smu['removed_dt'] = pd.to_datetime(smu['removed_dt'])
assert pd.to_datetime(smu['assigned_date']).isnull().sum() == 0
smu['assigned_date'] = pd.to_datetime(smu['assigned_date'])
assert pd.to_datetime(smu['removed_date']).isnull().sum() == 0
smu['removed_date'] = pd.to_datetime(smu['removed_date'])
The GEOTrack database export time-frame conforms to removed_dt
min/max values:
print(smu['assigned_dt'].describe())
print()
print(smu['removed_dt'].describe())
count 3433
unique 3297
top 2020-02-28 04:29:00
freq 3
first 2013-06-27 01:31:00
last 2020-03-31 18:28:00
Name: assigned_dt, dtype: object
count 3433
unique 3303
top 2020-03-31 12:00:00
freq 17
first 2013-09-01 18:18:00
last 2020-03-31 12:00:00
Name: removed_dt, dtype: object
One record has a removed_dt
value less than assigned_dt
, but this is only a discrepancy in the hour values:
citizenship | housing | assigned_dt | removed_dt | days_in_seg | assigned_date | assigned_hour | removed_date | removed_hour |
---|---|---|---|---|---|---|---|---|
MEXICO | H-NA-110 | 2020-03-31 18:28:00 | 2020-03-31 12:00:00 | 0 | 2020-03-31 | 6:28:00PM | 2020-03-31 | 12:00:00PM |
81 records have a removed_dt
value equal to assigned_dt
, as seen in this sample of five records:
citizenship | housing | assigned_dt | removed_dt | days_in_seg | assigned_date | assigned_hour | removed_date | removed_hour |
---|---|---|---|---|---|---|---|---|
MEXICO | H-NA-209 | 2013-09-22 03:25:00 | 2013-09-22 03:25:00 | 0 | 2013-09-22 | 3:25:00AM | 2013-09-22 | 3:25:00AM |
MEXICO | H-NA-209 | 2013-09-22 03:29:00 | 2013-09-22 03:29:00 | 0 | 2013-09-22 | 3:29:00AM | 2013-09-22 | 3:29:00AM |
UKRAINE | H-NA-210 | 2013-11-08 03:30:00 | 2013-11-08 03:30:00 | 0 | 2013-11-08 | 3:30:00AM | 2013-11-08 | 3:30:00AM |
MOROCCO | H-NA-103 | 2013-11-29 02:11:00 | 2013-11-29 02:11:00 | 0 | 2013-11-29 | 2:11:00AM | 2013-11-29 | 2:11:00AM |
LAOS | H-NA-102 | 2013-12-28 20:03:00 | 2013-12-28 20:03:00 | 0 | 2013-12-28 | 8:03:00PM | 2013-12-28 | 8:03:00PM |
We retain these records despite the logical inconsistency of these datetime fields, under the assumption that they represent short placements of less than one full day.
Recalculating segregation placement length based on date only results in same value as days_in_seg
field.
Note that this calculation is not first day inclusive, as in the case of the original version of the RHU dataset (see below). We will disregard hourly data for comparison purposes, as no other dataset includes hourly placement or release times.
smu['days_calc'] = (smu['removed_date'] - smu['assigned_date']) / np.timedelta64(1, 'D')
assert sum(smu['days_in_seg'] == smu['days_calc']) == len(smu)
The below desciptive statistics reflect first day exclusive stay lengths, including stays of 0 days. 547, or 15.93% of records reflect stay lengths of less than one day, based on placement dates. Note that placements in the SMU dataset represent specific housing assignments within one of 3433 cells in the segregation management unit, and would therefore be expected to reflect more and shorter placements than other datasets:
print(smu['days_calc'].describe())
count 3433.000000
mean 9.976697
std 23.672531
min 0.000000
25% 1.000000
50% 3.000000
75% 10.000000
max 488.000000
Name: days_calc, dtype: float64
All housing assignments are represented during each year covered by the dataset, but usage patterns vary, with housing units in the 200 block associated with longer average placements:
smu_annual = smu.set_index('assigned_date').groupby([pd.Grouper(freq='AS')])
housing_unit_count = smu_annual['housing'].nunique()
assert int(housing_unit_count.unique()) == 20
print(smu.groupby('housing')['days_calc'].mean())
housing
H-NA-101 5.921162
H-NA-102 6.841202
H-NA-103 6.396624
H-NA-104 5.917355
H-NA-105 7.285088
H-NA-106 8.342857
H-NA-107 7.387755
H-NA-108 8.229665
H-NA-109 6.763158
H-NA-110 7.327014
H-NA-201 14.446281
H-NA-202 14.398374
H-NA-203 13.976923
H-NA-204 14.376000
H-NA-205 41.538462
H-NA-206 21.675676
H-NA-207 20.444444
H-NA-208 12.503597
H-NA-209 13.992308
H-NA-210 11.358974
Name: days_calc, dtype: float64
Annual median and mean placement lengths show an increase during calendar years 2017-2018:
g = smu.set_index('assigned_date').groupby([pd.Grouper(freq='AS')])
smu_annual_med = g['days_calc'].median()
smu_annual_avg = g['days_calc'].mean()
print(smu_annual_med)
print()
print(smu_annual_avg)
assigned_date
2013-01-01 3.0
2014-01-01 3.0
2015-01-01 3.0
2016-01-01 3.0
2017-01-01 4.0
2018-01-01 5.0
2019-01-01 4.0
2020-01-01 2.0
Freq: AS-JAN, Name: days_calc, dtype: float64
assigned_date
2013-01-01 10.785185
2014-01-01 7.831721
2015-01-01 9.393814
2016-01-01 9.889693
2017-01-01 12.404537
2018-01-01 11.188940
2019-01-01 9.501969
2020-01-01 7.437956
Freq: AS-JAN, Name: days_calc, dtype: float64
Total placement counts per calendar year (note incomplete data for 2013, 2020):
smu_total_annual = smu.set_index('assigned_dt').groupby([pd.Grouper(freq='AS')])['hashid'].nunique()
print(smu_total_annual)
assigned_dt
2013-01-01 270
2014-01-01 517
2015-01-01 485
2016-01-01 553
2017-01-01 529
2018-01-01 434
2019-01-01 508
2020-01-01 137
Freq: AS-JAN, Name: hashid, dtype: int64
Stays over 14 days must be reported to ICE SRMS; here we flag long placements and calculate as a percent of total placements per year. Again note that placements are by housing assignment in one of 20 total housing locations, not cumulative stay length, so long stays may not be accurately represented here. The lack of unique identifiers makes it impossible to track cases of individuals in segregation for a total of 14 non-consecutive days during any 21 day period; or individuals with special vulnerabilities.
We find that long placements increase over time both absolutely and as proportion of total placements. However, this may simply reflect fewer transfers of individuals between housing assignments:
smu['long_stay'] = smu['days_calc'] > 14
long_stays_annual = smu.set_index('assigned_dt').groupby([pd.Grouper(freq='AS')])['long_stay'].sum()
print(long_stays_annual)
print()
print(long_stays_annual / smu_total_annual)
assigned_dt
2013-01-01 48.0
2014-01-01 66.0
2015-01-01 63.0
2016-01-01 85.0
2017-01-01 94.0
2018-01-01 105.0
2019-01-01 111.0
2020-01-01 26.0
Freq: AS-JAN, Name: long_stay, dtype: float64
assigned_dt
2013-01-01 0.177778
2014-01-01 0.127660
2015-01-01 0.129897
2016-01-01 0.153707
2017-01-01 0.177694
2018-01-01 0.241935
2019-01-01 0.218504
2020-01-01 0.189781
Freq: AS-JAN, dtype: float64
Top citizenship values:
Table 1: SMU dataset top five countries of citizenship
citizenship | placements |
---|---|
MEXICO | 1782 |
EL SALVADOR | 189 |
HONDURAS | 148 |
GUATEMALA | 133 |
CANADA | 95 |
ALL OTHERS | 1086 |
A June 24-26, 2014 DHS inspection report for NWDC states, "Documentation reflects there were 776 assignments to segregation in the past year". The DHS inspection report does not specify the source of the records cited.
The SMU dataset covers this period, albeit with only partial records for June-Sept 2013. The total count of placements recorded in the SMU dataset during this period, 615 , is reasonably close to figure cited by DHS inspectors, which suggests an average of about 65.0 placements per month:
### Monthly total placements during period of DHS inspection report:
dhs_period = smu.set_index('assigned_dt').loc[:'2014-06-30']
g = dhs_period.groupby(pd.Grouper(freq='M'))
print(g['hashid'].nunique())
dhs_period_complete = smu.set_index('assigned_dt').loc['2013-09-01':'2014-06-30']
g = dhs_period_complete.groupby(pd.Grouper(freq='M'))
dhs_period_complete_monthly_avg = g['hashid'].nunique().mean()
assigned_dt
2013-06-30 1
2013-07-31 0
2013-08-31 14
2013-09-30 65
2013-10-31 56
2013-11-30 56
2013-12-31 78
2014-01-31 61
2014-02-28 55
2014-03-31 61
2014-04-30 63
2014-05-31 48
2014-06-30 57
Freq: M, Name: hashid, dtype: int64
This is comparable to the average of 60.0 placements per month reported in the SMU dataset during the period for which complete data exists (September 2013 - June 2014). If the GEOtrack database is the source of the data cited in the 2014 DHS inspection report, this is not noted in the inspection report itself.
Original file: 15_16_17_18_19_20_RHU_admission_Redacted.xlsx
Log created and maintained by hand by GEO employee to track Restricted Housing Unit placements. Described by US DOJ attorneys for ICE as follows:
"The spreadsheet runs from January 2015 to May 28, 2020 and was created by and for a lieutenant within the facility once he took over the segregation lieutenant duties. The spreadsheet is updated once a detainee departs segregation. The subjects who are included on this list, therefore, are those who were placed into segregation and have already been released from segregation. It does not include those individuals who are currently in segregation."
We refer to this dataset here by the shorthand "RHU" for "Restricted Housing Unit".[1]
The original file has been converted from XLSX to CSV format, with each annual tab saved as a separate CSV. The resulting CSVs have been concatenated and minimally cleaned in a private repository, dropping 75 duplicated records and adding a unique identifier field, hashid
; cleaning code availabe upon request.
The original file includes two fully redacted fields: Name
and Alien #
; and one partially redacted field, Placement reason
. The original file has no un-redacted unique field identifiers or individual identifiers.
csv_opts = {'sep': '|',
'quotechar': '"',
'compression': 'gzip',
'encoding': 'utf-8'}
rhu = pd.read_csv('input/rhu.csv.gz', **csv_opts)
assert len(set(rhu['hashid'])) == len(rhu)
assert sum(rhu['hashid'].isnull()) == 0
data_cols = list(rhu.columns)
data_cols.remove('hashid')
print(rhu.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2457 entries, 0 to 2456
Data columns (total 10 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 name 2457 non-null object
1 alien_# 2457 non-null object
2 date_in 2457 non-null object
3 date_out 2457 non-null object
4 total_days 2457 non-null int64
5 initial_placement 2457 non-null object
6 placement_reason 2457 non-null object
7 disc_seg 2457 non-null object
8 release_reason 2457 non-null object
9 hashid 2457 non-null object
dtypes: int64(1), object(9)
memory usage: 192.1+ KB
None
Here we display the first five records in the dataset (excluding hashid
field):
name | alien_# | date_in | date_out | total_days | initial_placement | placement_reason | disc_seg | release_reason |
---|---|---|---|---|---|---|---|---|
(b)(6),(b)(7)(c ) | (b)(6),(b)(7)(c ) | 1/3/2015 | 1/5/2015 | 20 | pending hearing | refusing staff orders | N | release to gen pop |
(b)(6),(b)(7)(c ) | (b)(6),(b)(7)(c ) | 1/10/2015 | 1/12/2015 | 294 | pending hearing | assault | N | release from custody |
(b)(6),(b)(7)(c ) | (b)(6),(b)(7)(c ) | 1/12/2015 | 1/16/2015 | 8 | pending hearing | disrupting facility operations | Y | released to medical |
(b)(6),(b)(7)(c ) | (b)(6),(b)(7)(c ) | 1/13/2015 | 1/29/2015 | 3 | pending hearing | assault | Y | release to gen pop |
(b)(6),(b)(7)(c ) | (b)(6),(b)(7)(c ) | 1/14/2015 | 2/6/2015 | 4 | security risk | (b)(6), (b)(7)(C), (b)(7)(E) | N | release from custody |
Inspection of the original Excel file shows that the Total days
column values are often incorrect, based on a missing cell formula. For example, on the "2020" spreadsheet tab, the Total days
column values are integers which only occasionally align with calculated placement length based on the Date in
and Date out
columns. However, additional spreadsheet rows at the bottom of the sheet not containing values in other fields contain an Excel formula ("=(D138-C138)+1") which should have been used to calculate these values. Comparing calculated stay lengths with reported Total days
suggests that this formula was not updated consistently, causing fields to become misaligned. Additionally, the "2015" spreadsheet tab includes many Total days
values equal to "1", suggesting that the formula was applied incorrectly or with missing data.
We can recalculate actual stay lengths based on the formula cited above (inclusive of start days, with stays of less than one day calculated as "1"); or with the formula used for the "SMU" records above (exclusive of start days, with stays of less than one day calculated as "0"), for more consistent comparison with other datasets.
The above issue raises the possibility that other fields in addition to Total days
may be misaligned in the original dataset. One fact mitigating this possibility is that no Date out
values predate associated Date in
values. We can also look more closely at qualitative fields to make an educated guess as to the data quality: for example, do intial_placement
values suggesting disciplinary placements align with placement_reason
values also consistent with disciplinary placements? However, we do not intend to use this dataset for detailed qualitative analysis; of most interest are total segregation placements and segregation stay lengths.
rhu['date_in'] = pd.to_datetime(rhu['date_in'])
rhu['date_out'] = pd.to_datetime(rhu['date_out'])
# As noted above, no `date_out` values predate associated `date_in` values:
assert sum(rhu['date_in'] > rhu['date_out']) == 0
print(rhu['date_in'].describe())
print()
print(rhu['date_out'].describe())
count 2457
unique 1237
top 2019-01-08 00:00:00
freq 12
first 2015-01-03 00:00:00
last 2020-05-28 00:00:00
Name: date_in, dtype: object
count 2457
unique 1105
top 2017-06-13 00:00:00
freq 11
first 2015-01-05 00:00:00
last 2020-05-29 00:00:00
Name: date_out, dtype: object
Here we recalculate the total days field based on the first day inclusive formula in the original Excel spreadsheet ("=(D138-C138)+1"):
rhu['total_days_calc'] = (rhu['date_out'] - rhu['date_in']) / np.timedelta64(1, 'D') + 1
compare_pct = sum(rhu['total_days_calc'] == rhu['total_days']) / len(rhu) * 100
print(rhu['total_days'].describe())
print()
print(rhu['total_days_calc'].describe())
count 2457.000000
mean 13.877900
std 37.544802
min 1.000000
25% 2.000000
50% 4.000000
75% 11.000000
max 694.000000
Name: total_days, dtype: float64
count 2457.000000
mean 15.077737
std 38.142819
min 1.000000
25% 3.000000
50% 5.000000
75% 12.000000
max 694.000000
Name: total_days_calc, dtype: float64
Only 7.86% of original total_days
values match their respective recalculated stay lengths in total_days_calc
.
However, note that the above summary statistics for the original field (total_days
) are very similar to the recalculated field (total_days_calc
), suggesting that most values are present in the dataset but misaligned.
Therefore, we will conclude that it is correct to recalculate the total_days
field. Instead of the first day inclusive formula suggested in the original dataset, here we will use a first day exclusive formula, where placements starting and ending on the same day have length 0. While this risks underestimating placement lengths represented in the dataset, it is more consistent with the calculation of placement lenghts in the SMU and SRMS datasets:
rhu['total_days'] = (rhu['date_out'] - rhu['date_in']) / np.timedelta64(1, 'D')
rhu = rhu.drop('total_days_calc', axis=1)
print(rhu['total_days'].describe())
count 2457.000000
mean 14.077737
std 38.142819
min 0.000000
25% 2.000000
50% 4.000000
75% 11.000000
max 693.000000
Name: total_days, dtype: float64
Annual median and mean placement lengths are relatively consistent, showing an apparent decrease during the first few months of 2020, possibly explained by incomplete placements excluded from this dataset:
g = rhu.set_index('date_in').groupby([pd.Grouper(freq='AS')])
rhu_annual_med = g['total_days'].median()
rhu_annual_avg = g['total_days'].mean()
print(rhu_annual_med)
print()
print(rhu_annual_avg)
date_in
2015-01-01 4.0
2016-01-01 4.0
2017-01-01 4.0
2018-01-01 4.0
2019-01-01 4.0
2020-01-01 3.0
Freq: AS-JAN, Name: total_days, dtype: float64
date_in
2015-01-01 14.374631
2016-01-01 12.013761
2017-01-01 16.978927
2018-01-01 13.826690
2019-01-01 14.343681
2020-01-01 8.848485
Freq: AS-JAN, Name: total_days, dtype: float64
Total placement counts per calendar year (note data for 2020 is incomplete):
rhu_total_annual = rhu.set_index('date_in').groupby([pd.Grouper(freq='AS')])['hashid'].nunique()
print(rhu_total_annual)
date_in
2015-01-01 339
2016-01-01 436
2017-01-01 522
2018-01-01 577
2019-01-01 451
2020-01-01 132
Freq: AS-JAN, Name: hashid, dtype: int64
Stays over 14 days must be reported to ICE SRMS; here we flag long placements and calculate as a percent of total placements per year. The lack of unique identifiers makes it impossible to track cases of individuals in segregation for a total of 14 non-consecutive days during any 21 day period. Inconsistencies and lack of information in placement_reason
make it a poor candidate for flagging placements involving individuals with special vulnerabilities. We note an increasing proportion and absolute number of long placements during 2017-2019:
rhu['long_stay'] = rhu['total_days'] > 14
long_stays_annual = rhu.set_index('date_in').groupby([pd.Grouper(freq='AS')])['long_stay'].sum()
print(long_stays_annual)
print()
print(long_stays_annual / rhu_total_annual)
date_in
2015-01-01 52.0
2016-01-01 73.0
2017-01-01 102.0
2018-01-01 129.0
2019-01-01 103.0
2020-01-01 21.0
Freq: AS-JAN, Name: long_stay, dtype: float64
date_in
2015-01-01 0.153392
2016-01-01 0.167431
2017-01-01 0.195402
2018-01-01 0.223570
2019-01-01 0.228381
2020-01-01 0.159091
Freq: AS-JAN, dtype: float64
There are 12 initial_placement
values. These closely correspond to the placement_reason
values cited in the SRMS datasets (see SRMS 2, National SRMS Comparison appendices). The most common initial_placement
values (not correcting for some minor spelling variations) are:
print(rhu['initial_placement'].str.strip().str.lower().value_counts().head(5))
pending hearing 1161
security risk 543
medical 393
protective custody 331
disciplinary 9
Name: initial_placement, dtype: int64
There are 319 placement_reason
values, including some redacted fields. Below we print the 10 most common values:
print(rhu['placement_reason'].str.strip().str.lower().value_counts().head(10))
fighting 398
fears general population 209
medical overflow 209
(b)(6), (b)(7)(c), (b)(7)(e) 163
assault 93
assault on detainee 69
medical isolation overflow 67
sharpened instrument 58
threats to staff 43
threat to others 39
Name: placement_reason, dtype: int64
There are 54 release_reason
values (not correcting for spelling or other variations). Below we print the 10 most common values:
print(rhu['release_reason'].str.strip().str.lower().value_counts().head(10))
release to population 408
release to gen pop 344
release from custody 313
release to medical 227
transfer facility 172
disc time complete 168
discipline time complete 130
time served 121
not guilty 115
time served - release to gp 80
Name: release_reason, dtype: int64
The field disc_seg
flags disciplinary segregation placements, which require a hearing process; versus administrative segregation placements. The majority of placements are administrative. Average stay lengths for disciplinary and administrative are similar; though median
rhu['disc_seg'] = rhu['disc_seg'].str.strip().str.upper()
assert sum(rhu['disc_seg'].isnull()) == 0
print('Proportion:')
print(rhu['disc_seg'].value_counts(normalize=True, dropna=False))
print('\nCount per year:')
print(rhu.set_index('date_in').groupby(pd.Grouper(freq='AS'))['disc_seg'].value_counts())
print('\nStay length by category:')
print(rhu.set_index('date_in').groupby(['disc_seg'])['total_days'].describe())
print('\nAnnual median stay length by category:')
print(rhu.set_index('date_in').groupby([pd.Grouper(freq='AS'), 'disc_seg'])['total_days'].mean())
Proportion:
N 0.756614
Y 0.243386
Name: disc_seg, dtype: float64
Count per year:
date_in disc_seg
2015-01-01 N 241
Y 98
2016-01-01 N 339
Y 97
2017-01-01 N 411
Y 111
2018-01-01 N 432
Y 145
2019-01-01 N 339
Y 112
2020-01-01 N 97
Y 35
Name: disc_seg, dtype: int64
Stay length by category:
count mean std min 25% 50% 75% max
disc_seg
N 1859.0 13.866595 41.405676 0.0 1.0 3.0 7.0 693.0
Y 598.0 14.734114 25.474417 1.0 6.0 10.0 19.0 412.0
Annual median stay length by category:
date_in disc_seg
2015-01-01 N 14.514523
Y 14.030612
2016-01-01 N 11.687316
Y 13.154639
2017-01-01 N 17.530414
Y 14.936937
2018-01-01 N 12.780093
Y 16.944828
2019-01-01 N 14.268437
Y 14.571429
2020-01-01 N 7.783505
Y 11.800000
Name: total_days, dtype: float64
Next section: Data Appenix 2. Comparison of GEO Group and ICE SRMS Datasets