Tuesday, November 17, 2020

Making Data Management Decisions

 In week three of the Data Management and Visualization course, I have been learning how to actually manage the data, to prepare it for the next steps.  First, I needed to decide how to manage the variables I am studying, and our professor has given us several tools to do this, such as recoding and grouping values.  I have made use of recoding and also creating secondary variables; as I mentioned previously, my examination of respondents' religiosity and their acceptance of others outside their groups involves a large number of questions that must be analyzed to gauge those two values.  

Gauging respondents' acceptance of other groups will be particularly tricky, as it involves grouping respondents according to certain responses (e.g., race or political party), then comparing their answers to some questions but not others (e.g., Republicans will not be measured on their perception of Republicans).  I am hopeful we will learn more about this next week.  However, measuring respondents' religiosity was doable with the tools we have been given.

Values for religiosity questions
    M1  M1A    M4   M5   M6   QL2   QL3  QL4
0  1.0  2.0  0.25  2.5  2.0  1.00  1.00  1.0
1  1.0  2.0  0.75  2.0  2.0   NaN   NaN  NaN
2  1.0  2.0  0.75  2.5  0.0  1.00  0.75  0.8
3  0.0  0.0  0.50  0.0  NaN  0.25  0.25  0.2
4  1.0  0.0  0.50  1.0  0.0   NaN   NaN  NaN
5  1.0  0.0  0.50  0.5  0.0  0.50  0.75  0.6
6  1.0  0.0  0.50  2.0  0.0  0.75  0.75  0.4
7  1.0  0.0  0.75  0.0  NaN   NaN   NaN  NaN
8  1.0  2.0  0.75  2.0  2.0  0.75  0.75  0.2
9  1.0  2.0  0.50  2.0  2.0  0.75  0.50  0.2

The table above shows the first ten values of the eight variables I used to gauge religiosity, including "what is your religion" (M1), "are you born again" (M1A), and "how often do you attend church" (M5), which I have selected to run frequency distributions:

counts for M1 respondent's religion, 1=religious, 0=atheist
1.0    1981
0.0     294
NaN      19
Name: M1, dtype: int64

percentages for M1 respondent's religion, 1=religious, 0=atheist
1.0    0.870769
0.0    0.129231
Name: M1, dtype: float64

Originally, respondents were asked to select their religion from a list; I recoded this data to simply indicate whether they consider themselves religious (1) or not (0). Percentage of non-religous (atheists) is 13%, which follows expectations for the general population.

counts for M1A is respondent born again?, no=0, yes=2
2.0     833
0.0    1455
NaN       6
Name: M1A, dtype: int64

percentages for M1A is respondent born again?, no=0, yes=2
2.0    0.364073
0.0    0.635927
Name: M1A, dtype: float64

Question M1A originally included three possibilities: 1=yes, 2=no, and 3=was not asked (presumably due to a negative previous question).  I condensed the 2 and 3 responses and gave them a value of 0, then assigned a value of 2 to each 'yes' because I believe that response is a sign of high religiosity and therefore should add more to the final tally.  Here we can see about 36% of those who answered this question professed to be born again.

counts for M5 how often does respondent attend church? 2.5=more than once/wk, 0=never
2.5    356
2.0    534
0.0    422
1.0    405
0.5    326
1.5    234
NaN     17
Name: M5, dtype: int64

percentages for M5 how often does respondent attend church? 2.5=more than once/wk, 0=never
2.5    0.156346
2.0    0.234519
0.0    0.185332
1.0    0.177866
0.5    0.143171
1.5    0.102767
Name: M5, dtype: float64

Again, this question had multiple possible answers, ranging from "more than once per week" (1) to "once or twice a month" (3) to "never" (6).  I recoded the responses to give 0.5 point to each ascending category, so "never" became 0.0 and "more than once per week" became 2.5.  Thus, 18.5% of respondents say they never go to church (0), 10% say they go once or twice a month (1.5), and 16% say they go multiple times per week (2.5).

Next, I created secondary variables RELIG, CNT, and RELIND to develop a religious index of each respondent:

Sum of Religiosity values for each respondent
0    10.75
1     7.75
2     8.80
3     1.20
4     2.50
5     3.85
6     5.40
7     1.75
8     9.45
9     8.95
Name: RELIG, dtype: float64

RELIG is the sum of the variables shown in the first table.  In the first ten rows, it ranges from 1.2 to 10.75; higher numbers indicate higher determined religiosity. I was pleased with these values because they followed the patterns I expected:  in particular, those who attend church often and claim to be born again had the highest scores.  However, I also realized the scores could be misleading because of missing data (NaN in the first table).  I cannot run calculations on missing data, so I made my next secondary variable:

Number of religiosity questions with valid answers
0    8
1    5
2    8
3    7
4    5
5    8
6    8
7    4
8    8
9    8
Name: CNT, dtype: int64

CNT is the count of viable data that can be used for calculation.  As you can see, the first ten respondents answered as many as 8 or as few as 4 questions, and this could make a difference in my final values.  In fact, it did!  Notice the first two values in RELIG (10.75 and 7.75) and CNT (8 and 5), for example. 

Religious Index: sum of values / number of questions
0    1.343750
1    1.550000
2    1.100000
3    0.171429
4    0.500000
5    0.481250
6    0.675000
7    0.437500
8    1.181250
9    1.118750
Name: RELIND, dtype: float64

When RELIG is divided by CNT to get RELIND, the religious index, the first and second values flipped -- now the second is higher than the first!  So the first ten values of the religious index range from 0.17 (a self-professed atheist) to 1.55 (a born-again believer who attends church more than once per week).  Other questions had to do with respondents' opinions on moral issues and whether they are active in their church.

I may decide to further manage this data by dividing religious indexes into high, medium, and low, but I'm not sure yet.  These values will be compared based on respondents' answers to outside-group acceptance questions such as "would you date outside your race?" or "rate Democrats (0-100)".

Below is the text of my program:

"""
Analysis of Outlook on Life 2012 dataset using Python
created Tuesday, November 10, 2020
@author Joel Caren
"""

import pandas
import numpy

data = pandas.read_csv('oolpds.csv',low_memory = False) # read data file into 'data' using pandas and fend off a memory error

print (len(data)) #number of observations (rows)
print (len(data.columns)) #number of variables (columns)

#check format of variables
data['W1_M1'].dtype

#setting variables you will be working with to numeric; errors="coerce" fixes missing data
data['W1_C1'] = pandas.to_numeric(data['W1_C1'], errors = "coerce") #pol party? (R=1,D=2,I=3,other=4)
data['W1_D11'] = pandas.to_numeric(data['W1_D11'], errors = "coerce") #rate republicans (0-100)
data['W1_D12'] = pandas.to_numeric(data['W1_D12'], errors = "coerce") #rate dems (0-100)
data['W1_E4'] = pandas.to_numeric(data['W1_E4'], errors = "coerce") #date outside race? y=1
data['W1_E61_C'] = pandas.to_numeric(data['W1_E61_C'], errors = "coerce") #against biracial kids (black), 1=agree--4=disagree
data['W1_E62_C'] = pandas.to_numeric(data['W1_E62_C'], errors = "coerce") #against biracial kids (latino), 1=agree--4=disagree
data['W1_E63_C'] = pandas.to_numeric(data['W1_E63_C'], errors = "coerce") #against biracial kids (white), 1=agree--4=disagree
data['W1_E64_C'] = pandas.to_numeric(data['W1_E64_C'], errors = "coerce") #against biracial kids (asian), 1=agree--4=disagree
data['W1_M1'] = pandas.to_numeric(data['W1_M1'], errors = "coerce") #which religion? none=13
data['W1_M1A'] = pandas.to_numeric(data['W1_M1A'], errors = "coerce") #born again? y=1,n=2,not asked=3,refuse=4
data['W1_M2'] = pandas.to_numeric(data['W1_M2'], errors = "coerce") #church race? B=1,W=2,L=3,A=4,other=5,don't know=6
data['W1_M4'] = pandas.to_numeric(data['W1_M4'], errors = "coerce") #allow women clergy? 1=agree--4=disagree
data['W1_M5'] = pandas.to_numeric(data['W1_M5'], errors = "coerce") #how often attend church? mult times/wk=1,never=6
data['W1_M6'] = pandas.to_numeric(data['W1_M6'], errors = "coerce") #active church member? y=1
data['W1_N1B'] = pandas.to_numeric(data['W1_N1B'], errors = "coerce") #rate native americans (0-100)
data['W1_N1C'] = pandas.to_numeric(data['W1_N1C'], errors = "coerce") #rate latinos (0-100)
data['W1_N1D'] = pandas.to_numeric(data['W1_N1D'], errors = "coerce") #rate whites (0-100)
data['W1_N1E'] = pandas.to_numeric(data['W1_N1E'], errors = "coerce") #rate blacks (0-100)
data['W1_N1F'] = pandas.to_numeric(data['W1_N1F'], errors = "coerce") #rate asians (0-100)
data['W1_O1'] = pandas.to_numeric(data['W1_O1'], errors = "coerce") #blacks should teach kids vs racism, 1=no--5=yes
data['W1_O3'] = pandas.to_numeric(data['W1_O3'], errors = "coerce") #blacks should teach kids not all whites racist, 1=no--5=yes
data['W1_QA5D'] = pandas.to_numeric(data['W1_QA5D'], errors = "coerce") #blacks should segregate, 1=agree--4=disagree
data['W2_QL2B'] = pandas.to_numeric(data['W2_QL2B'], errors = "coerce") #gays as priests? 1=agree--4=disagree
data['W2_QL3'] = pandas.to_numeric(data['W2_QL3'], errors = "coerce") #homosex immoral? 1=agree--4=disagree
data['W2_QL4'] = pandas.to_numeric(data['W2_QL4'], errors = "coerce") #laws vs gay discrimination? y=1--n=5
data['PPETHM'] = pandas.to_numeric(data['PPETHM'], errors = "coerce") #your race? w=1,b=2,other=3,L=4,mix=5

# recode missing values to python missing (NaN)
data['W1_C1']=data['W1_C1'].replace(-1, numpy.nan)
data['W1_D11']=data['W1_D11'].replace(998, numpy.nan)
data['W1_D12']=data['W1_D12'].replace(998, numpy.nan)
data['W1_E4']=data['W1_E4'].replace(-1, numpy.nan)
data['W1_E61_C']=data['W1_E61_C'].replace(-1, numpy.nan)
data['W1_E62_C']=data['W1_E62_C'].replace(-1, numpy.nan)
data['W1_E63_C']=data['W1_E63_C'].replace(-1, numpy.nan)
data['W1_E64_C']=data['W1_E64_C'].replace(-1, numpy.nan)
data['W1_M1']=data['W1_M1'].replace(-1, numpy.nan)
data['W1_M1A']=data['W1_M1A'].replace(-1, numpy.nan)
data['W1_M2']=data['W1_M2'].replace(-1, numpy.nan)
data['W1_M4']=data['W1_M4'].replace(-1, numpy.nan)
data['W1_M5']=data['W1_M5'].replace(-1, numpy.nan)
data['W1_M6']=data['W1_M6'].replace(-1, numpy.nan)
data['W1_N1B']=data['W1_N1B'].replace(998, numpy.nan)
data['W1_N1C']=data['W1_N1C'].replace(998, numpy.nan)
data['W1_N1D']=data['W1_N1D'].replace(998, numpy.nan)
data['W1_N1E']=data['W1_N1E'].replace(998, numpy.nan)
data['W1_N1F']=data['W1_N1F'].replace(998, numpy.nan)
data['W1_O1']=data['W1_O1'].replace(-1, numpy.nan)
data['W1_O3']=data['W1_O3'].replace(-1, numpy.nan)
data['W1_QA5D']=data['W1_QA5D'].replace(-1, numpy.nan)
data['W2_QL2B']=data['W2_QL2B'].replace(-1, numpy.nan)
data['W2_QL3']=data['W2_QL3'].replace(-1, numpy.nan)
data['W2_QL4']=data['W2_QL4'].replace(-1, numpy.nan)


#create new variables that can be added together for religiosity from current variables
recode1 = {1:1, 2:1, 3:1, 4:1, 5:1, 6:1, 7:1, 8:1, 9:1, 10:1, 11:1, 12:1, 13:0}
data['M1'] = data['W1_M1'].map(recode1)
recode2 = {1:2, 2:0, 3:0}                  #yes=2pts, no=0
data['M1A'] = data['W1_M1A'].map(recode2)
data['M4'] = data['W1_M4'] / 4 #0-1pt in 0.25 increments
recode3 = {1:2.5, 2:2, 3:1.5, 4:1, 5:0.5, 6:0, 7:0}
data['M5'] = data['W1_M5'].map(recode3) #0-2.5pts based on church attentance
data['M6'] = (2 - data['W1_M6']) * 2 #yes=2pts, no=0
data['QL2'] = data['W2_QL2B'] / 4 #0-1pt in 0.25 increments
data['QL3'] = (5 - data['W2_QL3']) / 4 #0-1pt in 0.25 increments
data['QL4'] = data['W2_QL4'] / 5 #0-1pt in 0.2 increments


#create data subset for religiosity calculation
relig = data[['M1', 'M1A', 'M4', 'M5', 'M6', 'QL2', 'QL3', 'QL4']]
a = relig.head (n=10)
print ("Values for religiosity questions")
print(a)

#create new secondary variable, RELIG, the index of religiosity
data['RELIG']= relig.sum(axis=1)   #axis=1 means the horizontal axis
b = data['RELIG'].head (n=10)
print ()
print ("Sum of Religiosity values for each respondent")
print (b)

data['CNT'] = relig.count(axis=1)
d = data['CNT'].head (n=10)
print ()
print ("Number of religiosity questions with valid answers")
print (d)

data['RELIND'] = relig.sum(axis=1) / relig.count(axis=1)  #religious index=sum of religious variables/num of questions answered
c = data['RELIND'].head (n=10)
print ()
print ("Religious Index: sum of values / number of questions")
print (c)


#recode 'would you date outside your race?' to 1=Y, 0=N
recode4 = {1:1, 2:0}
data['E4']=data['W1_E4'].map(recode4)


#counts and percentages (i.e. frequency distributions) for selected variables
print ("counts for M1 respondent's religion, 1=religious, 0=atheist")
c1 = data['M1'].value_counts(sort=False, dropna=False)
print (c1)

print ("percentages for M1 respondent's religion, 1=religious, 0=atheist")
p1 = data['M1'].value_counts(sort=False, normalize=True)
print (p1)

print ('counts for M1A is respondent born again?, no=0, yes=2')
c2 = data['M1A'].value_counts(sort=False, dropna=False)
print (c2)

print ('percentages for M1A is respondent born again?, no=0, yes=2')
p2 = data['M1A'].value_counts(sort=False, normalize=True)
print (p2)

print ('counts for M5 how often does respondent attend church? 2.5=more than once/wk, 0=never')
c3 = data['M5'].value_counts(sort=False, dropna=False)
print (c3)

print ('percentages for M5 how often does respondent attend church? 2.5=more than once/wk, 0=never')
p3 = data['M5'].value_counts(sort=False, normalize=True)
print (p3)


No comments:

Post a Comment

Testing a Basic Linear Regression Model on the Outlook on Life Survey

 In review, I am working with the 2012 Outlook on Life Survey, specifically the relationship between Religiosity (RELIND, an aggregate index...