My Analysis of 2012 Outlook on Life Survey: Making Data Management Decisions

In week three of the Data Management and Visualization course, I have been learning how to actually manage the data, to prepare it for the next steps. First, I needed to decide how to manage the variables I am studying, and our professor has given us several tools to do this, such as recoding and grouping values. I have made use of recoding and also creating secondary variables; as I mentioned previously, my examination of respondents' religiosity and their acceptance of others outside their groups involves a large number of questions that must be analyzed to gauge those two values.

Gauging respondents' acceptance of other groups will be particularly tricky, as it involves grouping respondents according to certain responses (e.g., race or political party), then comparing their answers to some questions but not others (e.g., Republicans will not be measured on their perception of Republicans). I am hopeful we will learn more about this next week. However, measuring respondents' religiosity was doable with the tools we have been given.

Values for religiosity questions
M1 M1A M4 M5 M6 QL2 QL3 QL4
0 1.0 2.0 0.25 2.5 2.0 1.00 1.00 1.0
1 1.0 2.0 0.75 2.0 2.0 NaN NaN NaN
2 1.0 2.0 0.75 2.5 0.0 1.00 0.75 0.8
3 0.0 0.0 0.50 0.0 NaN 0.25 0.25 0.2
4 1.0 0.0 0.50 1.0 0.0 NaN NaN NaN
5 1.0 0.0 0.50 0.5 0.0 0.50 0.75 0.6
6 1.0 0.0 0.50 2.0 0.0 0.75 0.75 0.4
7 1.0 0.0 0.75 0.0 NaN NaN NaN NaN
8 1.0 2.0 0.75 2.0 2.0 0.75 0.75 0.2
9 1.0 2.0 0.50 2.0 2.0 0.75 0.50 0.2

The table above shows the first ten values of the eight variables I used to gauge religiosity, including "what is your religion" (M1), "are you born again" (M1A), and "how often do you attend church" (M5), which I have selected to run frequency distributions:

counts for M1 respondent's religion, 1=religious, 0=atheist
1.0 1981
0.0 294
NaN 19
Name: M1, dtype: int64

percentages for M1 respondent's religion, 1=religious, 0=atheist
1.0 0.870769
0.0 0.129231
Name: M1, dtype: float64

Originally, respondents were asked to select their religion from a list; I recoded this data to simply indicate whether they consider themselves religious (1) or not (0). Percentage of non-religous (atheists) is 13%, which follows expectations for the general population.

counts for M1A is respondent born again?, no=0, yes=2
2.0 833
0.0 1455
NaN 6
Name: M1A, dtype: int64

percentages for M1A is respondent born again?, no=0, yes=2
2.0 0.364073
0.0 0.635927
Name: M1A, dtype: float64

Question M1A originally included three possibilities: 1=yes, 2=no, and 3=was not asked (presumably due to a negative previous question). I condensed the 2 and 3 responses and gave them a value of 0, then assigned a value of 2 to each 'yes' because I believe that response is a sign of high religiosity and therefore should add more to the final tally. Here we can see about 36% of those who answered this question professed to be born again.

counts for M5 how often does respondent attend church? 2.5=more than once/wk, 0=never
2.5 356
2.0 534
0.0 422
1.0 405
0.5 326
1.5 234
NaN 17
Name: M5, dtype: int64

percentages for M5 how often does respondent attend church? 2.5=more than once/wk, 0=never
2.5 0.156346
2.0 0.234519
0.0 0.185332
1.0 0.177866
0.5 0.143171
1.5 0.102767
Name: M5, dtype: float64

Again, this question had multiple possible answers, ranging from "more than once per week" (1) to "once or twice a month" (3) to "never" (6). I recoded the responses to give 0.5 point to each ascending category, so "never" became 0.0 and "more than once per week" became 2.5. Thus, 18.5% of respondents say they never go to church (0), 10% say they go once or twice a month (1.5), and 16% say they go multiple times per week (2.5).

Next, I created secondary variables RELIG, CNT, and RELIND to develop a religious index of each respondent:

Sum of Religiosity values for each respondent
0 10.75
1 7.75
2 8.80
3 1.20
4 2.50
5 3.85
6 5.40
7 1.75
8 9.45
9 8.95
Name: RELIG, dtype: float64

RELIG is the sum of the variables shown in the first table. In the first ten rows, it ranges from 1.2 to 10.75; higher numbers indicate higher determined religiosity. I was pleased with these values because they followed the patterns I expected: in particular, those who attend church often and claim to be born again had the highest scores. However, I also realized the scores could be misleading because of missing data (NaN in the first table). I cannot run calculations on missing data, so I made my next secondary variable:

Number of religiosity questions with valid answers

0 8
1 5
2 8
3 7
4 5
5 8
6 8
7 4
8 8
9 8

Name: CNT, dtype: int64

CNT is the count of viable data that can be used for calculation. As you can see, the first ten respondents answered as many as 8 or as few as 4 questions, and this could make a difference in my final values. In fact, it did! Notice the first two values in RELIG (10.75 and 7.75) and CNT (8 and 5), for example.

Religious Index: sum of values / number of questions
0 1.343750
1 1.550000
2 1.100000
3 0.171429
4 0.500000
5 0.481250
6 0.675000
7 0.437500
8 1.181250
9 1.118750
Name: RELIND, dtype: float64

When RELIG is divided by CNT to get RELIND, the religious index, the first and second values flipped -- now the second is higher than the first! So the first ten values of the religious index range from 0.17 (a self-professed atheist) to 1.55 (a born-again believer who attends church more than once per week). Other questions had to do with respondents' opinions on moral issues and whether they are active in their church.

I may decide to further manage this data by dividing religious indexes into high, medium, and low, but I'm not sure yet. These values will be compared based on respondents' answers to outside-group acceptance questions such as "would you date outside your race?" or "rate Democrats (0-100)".

Below is the text of my program:

"""

Analysis of Outlook on Life 2012 dataset using Python

created Tuesday, November 10, 2020

@author Joel Caren

"""

import pandas

import numpy

data = pandas.read_csv('oolpds.csv',low_memory = False) # read data file into 'data' using pandas and fend off a memory error

print (len(data)) #number of observations (rows)

print (len(data.columns)) #number of variables (columns)

#check format of variables

data['W1_M1'].dtype

#setting variables you will be working with to numeric; errors="coerce" fixes missing data

data['W1_C1'] = pandas.to_numeric(data['W1_C1'], errors = "coerce") #pol party? (R=1,D=2,I=3,other=4)

data['W1_D11'] = pandas.to_numeric(data['W1_D11'], errors = "coerce") #rate republicans (0-100)

data['W1_D12'] = pandas.to_numeric(data['W1_D12'], errors = "coerce") #rate dems (0-100)

data['W1_E4'] = pandas.to_numeric(data['W1_E4'], errors = "coerce") #date outside race? y=1

data['W1_E61_C'] = pandas.to_numeric(data['W1_E61_C'], errors = "coerce") #against biracial kids (black), 1=agree--4=disagree

data['W1_E62_C'] = pandas.to_numeric(data['W1_E62_C'], errors = "coerce") #against biracial kids (latino), 1=agree--4=disagree

data['W1_E63_C'] = pandas.to_numeric(data['W1_E63_C'], errors = "coerce") #against biracial kids (white), 1=agree--4=disagree

data['W1_E64_C'] = pandas.to_numeric(data['W1_E64_C'], errors = "coerce") #against biracial kids (asian), 1=agree--4=disagree

data['W1_M1'] = pandas.to_numeric(data['W1_M1'], errors = "coerce") #which religion? none=13

data['W1_M1A'] = pandas.to_numeric(data['W1_M1A'], errors = "coerce") #born again? y=1,n=2,not asked=3,refuse=4

data['W1_M2'] = pandas.to_numeric(data['W1_M2'], errors = "coerce") #church race? B=1,W=2,L=3,A=4,other=5,don't know=6

data['W1_M4'] = pandas.to_numeric(data['W1_M4'], errors = "coerce") #allow women clergy? 1=agree--4=disagree

data['W1_M5'] = pandas.to_numeric(data['W1_M5'], errors = "coerce") #how often attend church? mult times/wk=1,never=6

data['W1_M6'] = pandas.to_numeric(data['W1_M6'], errors = "coerce") #active church member? y=1

data['W1_N1B'] = pandas.to_numeric(data['W1_N1B'], errors = "coerce") #rate native americans (0-100)

data['W1_N1C'] = pandas.to_numeric(data['W1_N1C'], errors = "coerce") #rate latinos (0-100)

data['W1_N1D'] = pandas.to_numeric(data['W1_N1D'], errors = "coerce") #rate whites (0-100)

data['W1_N1E'] = pandas.to_numeric(data['W1_N1E'], errors = "coerce") #rate blacks (0-100)

data['W1_N1F'] = pandas.to_numeric(data['W1_N1F'], errors = "coerce") #rate asians (0-100)

data['W1_O1'] = pandas.to_numeric(data['W1_O1'], errors = "coerce") #blacks should teach kids vs racism, 1=no--5=yes

data['W1_O3'] = pandas.to_numeric(data['W1_O3'], errors = "coerce") #blacks should teach kids not all whites racist, 1=no--5=yes

data['W1_QA5D'] = pandas.to_numeric(data['W1_QA5D'], errors = "coerce") #blacks should segregate, 1=agree--4=disagree

data['W2_QL2B'] = pandas.to_numeric(data['W2_QL2B'], errors = "coerce") #gays as priests? 1=agree--4=disagree

data['W2_QL3'] = pandas.to_numeric(data['W2_QL3'], errors = "coerce") #homosex immoral? 1=agree--4=disagree

data['W2_QL4'] = pandas.to_numeric(data['W2_QL4'], errors = "coerce") #laws vs gay discrimination? y=1--n=5

data['PPETHM'] = pandas.to_numeric(data['PPETHM'], errors = "coerce") #your race? w=1,b=2,other=3,L=4,mix=5

# recode missing values to python missing (NaN)

data['W1_C1']=data['W1_C1'].replace(-1, numpy.nan)

data['W1_D11']=data['W1_D11'].replace(998, numpy.nan)

data['W1_D12']=data['W1_D12'].replace(998, numpy.nan)

data['W1_E4']=data['W1_E4'].replace(-1, numpy.nan)

data['W1_E61_C']=data['W1_E61_C'].replace(-1, numpy.nan)

data['W1_E62_C']=data['W1_E62_C'].replace(-1, numpy.nan)

data['W1_E63_C']=data['W1_E63_C'].replace(-1, numpy.nan)

data['W1_E64_C']=data['W1_E64_C'].replace(-1, numpy.nan)

data['W1_M1']=data['W1_M1'].replace(-1, numpy.nan)

data['W1_M1A']=data['W1_M1A'].replace(-1, numpy.nan)

data['W1_M2']=data['W1_M2'].replace(-1, numpy.nan)

data['W1_M4']=data['W1_M4'].replace(-1, numpy.nan)

data['W1_M5']=data['W1_M5'].replace(-1, numpy.nan)

data['W1_M6']=data['W1_M6'].replace(-1, numpy.nan)

data['W1_N1B']=data['W1_N1B'].replace(998, numpy.nan)

data['W1_N1C']=data['W1_N1C'].replace(998, numpy.nan)

data['W1_N1D']=data['W1_N1D'].replace(998, numpy.nan)

data['W1_N1E']=data['W1_N1E'].replace(998, numpy.nan)

data['W1_N1F']=data['W1_N1F'].replace(998, numpy.nan)

data['W1_O1']=data['W1_O1'].replace(-1, numpy.nan)

data['W1_O3']=data['W1_O3'].replace(-1, numpy.nan)

data['W1_QA5D']=data['W1_QA5D'].replace(-1, numpy.nan)

data['W2_QL2B']=data['W2_QL2B'].replace(-1, numpy.nan)

data['W2_QL3']=data['W2_QL3'].replace(-1, numpy.nan)

data['W2_QL4']=data['W2_QL4'].replace(-1, numpy.nan)

#create new variables that can be added together for religiosity from current variables

recode1 = {1:1, 2:1, 3:1, 4:1, 5:1, 6:1, 7:1, 8:1, 9:1, 10:1, 11:1, 12:1, 13:0}

data['M1'] = data['W1_M1'].map(recode1)

recode2 = {1:2, 2:0, 3:0} #yes=2pts, no=0

data['M1A'] = data['W1_M1A'].map(recode2)

data['M4'] = data['W1_M4'] / 4 #0-1pt in 0.25 increments

recode3 = {1:2.5, 2:2, 3:1.5, 4:1, 5:0.5, 6:0, 7:0}

data['M5'] = data['W1_M5'].map(recode3) #0-2.5pts based on church attentance

data['M6'] = (2 - data['W1_M6']) * 2 #yes=2pts, no=0

data['QL2'] = data['W2_QL2B'] / 4 #0-1pt in 0.25 increments

data['QL3'] = (5 - data['W2_QL3']) / 4 #0-1pt in 0.25 increments

data['QL4'] = data['W2_QL4'] / 5 #0-1pt in 0.2 increments

#create data subset for religiosity calculation

relig = data[['M1', 'M1A', 'M4', 'M5', 'M6', 'QL2', 'QL3', 'QL4']]

a = relig.head (n=10)

print ("Values for religiosity questions")

print(a)

#create new secondary variable, RELIG, the index of religiosity

data['RELIG']= relig.sum(axis=1) #axis=1 means the horizontal axis

b = data['RELIG'].head (n=10)

print ()

print ("Sum of Religiosity values for each respondent")

print (b)

data['CNT'] = relig.count(axis=1)

d = data['CNT'].head (n=10)

print ()

print ("Number of religiosity questions with valid answers")

print (d)

data['RELIND'] = relig.sum(axis=1) / relig.count(axis=1) #religious index=sum of religious variables/num of questions answered

c = data['RELIND'].head (n=10)

print ()

print ("Religious Index: sum of values / number of questions")

print (c)

#recode 'would you date outside your race?' to 1=Y, 0=N

recode4 = {1:1, 2:0}

data['E4']=data['W1_E4'].map(recode4)

#counts and percentages (i.e. frequency distributions) for selected variables

print ("counts for M1 respondent's religion, 1=religious, 0=atheist")

c1 = data['M1'].value_counts(sort=False, dropna=False)

print (c1)

print ("percentages for M1 respondent's religion, 1=religious, 0=atheist")

p1 = data['M1'].value_counts(sort=False, normalize=True)

print (p1)

print ('counts for M1A is respondent born again?, no=0, yes=2')

c2 = data['M1A'].value_counts(sort=False, dropna=False)

print (c2)

print ('percentages for M1A is respondent born again?, no=0, yes=2')

p2 = data['M1A'].value_counts(sort=False, normalize=True)

print (p2)

print ('counts for M5 how often does respondent attend church? 2.5=more than once/wk, 0=never')

c3 = data['M5'].value_counts(sort=False, dropna=False)

print (c3)

print ('percentages for M5 how often does respondent attend church? 2.5=more than once/wk, 0=never')

p3 = data['M5'].value_counts(sort=False, normalize=True)

print (p3)

My Analysis of 2012 Outlook on Life Survey

Tuesday, November 17, 2020

Making Data Management Decisions

No comments:

Post a Comment

Testing a Basic Linear Regression Model on the Outlook on Life Survey

Report Abuse