Tuesday, November 24, 2020

Visualizing the data

 In week four, we finally get to see the results of our efforts over the last few weeks: the visualization of interaction of our chosen variables.  In review:  I chose to look at the effect of religiosity on racial acceptance.  Thus, religiosity is my independent variable, and racial acceptance is my dependent variable.  These are both quantitative variables, so I chose to make histograms for each:


Here we see that religiosity ranged from 0.0 to 2.5 (this is simply an index; the numbers themselves are only meaningful in their own context), with a single mode at about 0.5, though the mean is 0.75.  There is an obvious right skew (the long tail is on the right).


Racial acceptance is also skewed, but to the left, with a mode about 0.8 and mean of 0.69 in a range of 0.0 to 1.0.  

My next step is to make a scatterplot of the two variables.


My hypothesis was that religiosity and racial acceptance would be positively correlated, but this plot clearly shows no correlation between the two.  In fact, data points cluster in the area indicating relatively low religiosity and high acceptance.  To better examine my original hypothesis, I would seek out more high-religiosity individuals.

One caveat:  the 2012 Outlook on Life survey (which I used for this project) included individuals from various ethnicities/races, but the proportion does not reflect the general population: about half were black, 40% white, and 10% others.  Therefore, I would hesitate to apply these conclusions to the general population.

My code is below in case you are interested.

"""
Analysis of Outlook on Life 2012 dataset using Python
created Tuesday, November 10, 2020
@author Joel Caren
"""
import pandas
import numpy
import seaborn
import matplotlib.pyplot as plt
data = pandas.read_csv('oolpds.csv',low_memory = False) # read data file into 'data' using pandas and fend off a memory error
print (len(data)) #number of observations (rows)
print (len(data.columns)) #number of variables (columns)

#check format of variables
data['W1_M1'].dtype
#setting variables you will be working with to numeric; errors="coerce" fixes missing data
#data['W1_C1'] = pandas.to_numeric(data['W1_C1'], errors = "coerce") #pol party? (R=1,D=2,I=3,other=4)
data['W1_D11'] = pandas.to_numeric(data['W1_D11'], errors = "coerce") #rate republicans (0-100)
data['W1_D12'] = pandas.to_numeric(data['W1_D12'], errors = "coerce") #rate dems (0-100)
data['W1_E4'] = pandas.to_numeric(data['W1_E4'], errors = "coerce") #date outside race? y=1
data['W1_E61_C'] = pandas.to_numeric(data['W1_E61_C'], errors = "coerce") #against biracial kids (black), 1=agree--4=disagree
data['W1_E62_C'] = pandas.to_numeric(data['W1_E62_C'], errors = "coerce") #against biracial kids (latino), 1=agree--4=disagree
data['W1_E63_C'] = pandas.to_numeric(data['W1_E63_C'], errors = "coerce") #against biracial kids (white), 1=agree--4=disagree
data['W1_E64_C'] = pandas.to_numeric(data['W1_E64_C'], errors = "coerce") #against biracial kids (asian), 1=agree--4=disagree
data['W1_M1'] = pandas.to_numeric(data['W1_M1'], errors = "coerce") #which religion? none=13
data['W1_M1A'] = pandas.to_numeric(data['W1_M1A'], errors = "coerce") #born again? y=1,n=2,not asked=3,refuse=4
data['W1_M2'] = pandas.to_numeric(data['W1_M2'], errors = "coerce") #church race? B=1,W=2,L=3,A=4,other=5,don't know=6
data['W1_M4'] = pandas.to_numeric(data['W1_M4'], errors = "coerce") #allow women clergy? 1=agree--4=disagree
data['W1_M5'] = pandas.to_numeric(data['W1_M5'], errors = "coerce") #how often attend church? mult times/wk=1,never=6
data['W1_M6'] = pandas.to_numeric(data['W1_M6'], errors = "coerce") #active church member? y=1
data['W1_N1B'] = pandas.to_numeric(data['W1_N1B'], errors = "coerce") #rate native americans (0-100)
data['W1_N1C'] = pandas.to_numeric(data['W1_N1C'], errors = "coerce") #rate latinos (0-100)
data['W1_N1D'] = pandas.to_numeric(data['W1_N1D'], errors = "coerce") #rate whites (0-100)
data['W1_N1E'] = pandas.to_numeric(data['W1_N1E'], errors = "coerce") #rate blacks (0-100)
data['W1_N1F'] = pandas.to_numeric(data['W1_N1F'], errors = "coerce") #rate asians (0-100)
data['W1_O1'] = pandas.to_numeric(data['W1_O1'], errors = "coerce") #blacks should teach kids vs racism, 1=no--5=yes
data['W1_O3'] = pandas.to_numeric(data['W1_O3'], errors = "coerce") #blacks should teach kids not all whites racist, 1=no--5=yes
data['W1_QA5D'] = pandas.to_numeric(data['W1_QA5D'], errors = "coerce") #blacks should segregate, 1=agree--4=disagree
data['W2_QL2B'] = pandas.to_numeric(data['W2_QL2B'], errors = "coerce") #gays as priests? 1=agree--4=disagree
data['W2_QL3'] = pandas.to_numeric(data['W2_QL3'], errors = "coerce") #homosex immoral? 1=agree--4=disagree
data['W2_QL4'] = pandas.to_numeric(data['W2_QL4'], errors = "coerce") #laws vs gay discrimination? y=1--n=5
data['PPETHM'] = pandas.to_numeric(data['PPETHM'], errors = "coerce") #your race? w=1,b=2,other=3,L=4,mix=5
# recode missing values to python missing (NaN)
#data['W1_C1']=data['W1_C1'].replace(-1, numpy.nan)
data['W1_D11']=data['W1_D11'].replace(998, numpy.nan)
data['W1_D12']=data['W1_D12'].replace(998, numpy.nan)
data['W1_E4']=data['W1_E4'].replace(-1, numpy.nan)
data['W1_E61_C']=data['W1_E61_C'].replace(-1, numpy.nan)
data['W1_E62_C']=data['W1_E62_C'].replace(-1, numpy.nan)
data['W1_E63_C']=data['W1_E63_C'].replace(-1, numpy.nan)
data['W1_E64_C']=data['W1_E64_C'].replace(-1, numpy.nan)
data['W1_M1']=data['W1_M1'].replace(-1, numpy.nan)
data['W1_M1A']=data['W1_M1A'].replace(-1, numpy.nan)
data['W1_M2']=data['W1_M2'].replace(-1, numpy.nan)
data['W1_M4']=data['W1_M4'].replace(-1, numpy.nan)
data['W1_M5']=data['W1_M5'].replace(-1, numpy.nan)
data['W1_M6']=data['W1_M6'].replace(-1, numpy.nan)
data['W1_N1B']=data['W1_N1B'].replace(998, numpy.nan)
data['W1_N1C']=data['W1_N1C'].replace(998, numpy.nan)
data['W1_N1D']=data['W1_N1D'].replace(998, numpy.nan)
data['W1_N1E']=data['W1_N1E'].replace(998, numpy.nan)
data['W1_N1F']=data['W1_N1F'].replace(998, numpy.nan)
data['W1_N1B']=data['W1_N1B'].replace(-1, numpy.nan)
data['W1_N1C']=data['W1_N1C'].replace(-1, numpy.nan)
data['W1_N1D']=data['W1_N1D'].replace(-1, numpy.nan)
data['W1_N1E']=data['W1_N1E'].replace(-1, numpy.nan)
data['W1_N1F']=data['W1_N1F'].replace(-1, numpy.nan)
data['W1_O1']=data['W1_O1'].replace(-1, numpy.nan)
data['W1_O3']=data['W1_O3'].replace(-1, numpy.nan)
data['W1_QA5D']=data['W1_QA5D'].replace(-1, numpy.nan)
data['W2_QL2B']=data['W2_QL2B'].replace(-1, numpy.nan)
data['W2_QL3']=data['W2_QL3'].replace(-1, numpy.nan)
data['W2_QL4']=data['W2_QL4'].replace(-1, numpy.nan)

#create new variables that can be added together for religiosity from current variables
recode1 = {1:1, 2:1, 3:1, 4:1, 5:1, 6:1, 7:1, 8:1, 9:1, 10:1, 11:1, 12:1, 13:0}
data['M1'] = data['W1_M1'].map(recode1)
recode2 = {1:2, 2:0, 3:0}                  #yes=2pts, no=0
data['M1A'] = data['W1_M1A'].map(recode2)
data['M4'] = data['W1_M4'] / 4 #0-1pt in 0.25 increments
recode3 = {1:2.5, 2:2, 3:1.5, 4:1, 5:0.5, 6:0, 7:0}
data['M5'] = data['W1_M5'].map(recode3) #0-2.5pts based on church attentance
data['M6'] = (2 - data['W1_M6']) * 2 #yes=2pts, no=0
data['QL2'] = data['W2_QL2B'] / 4 #0-1pt in 0.25 increments
data['QL3'] = (5 - data['W2_QL3']) / 4 #0-1pt in 0.25 increments
data['QL4'] = data['W2_QL4'] / 5 #0-1pt in 0.2 increments
#create data subset for religiosity calculation
relig = data[['M1', 'M1A', 'M4', 'M5', 'M6', 'QL2', 'QL3', 'QL4']]
a = relig.head (n=10)
print ("Values for religiosity questions")
print(a)
#create new secondary variable, RELIG, the index of religiosity
data['RELIG']= relig.sum(axis=1)   #axis=1 means the horizontal axis
b = data['RELIG'].head (n=10)
print ()
print ("Sum of Religiosity values for each respondent")
print (b)
data['CNT'] = relig.count(axis=1)
d = data['CNT'].head (n=10)
print ()
print ("Number of religiosity questions with valid answers")
print (d)
data['RELIND'] = relig.sum(axis=1) / relig.count(axis=1)  #religious index=sum of religious variables/num of questions answered
c = data['RELIND'].head (n=10)
print ()
print ("Religious Index: sum of values / number of questions")
print (c)
#Manage acceptance variables
recode7 = {1:2, 2:1, 3:5, 4:3, 5:6}
data['ETH'] = data['PPETHM'].map(recode7)  #remap ethnicity to match M2 church race
recode4 = {1:1, 2:0}
data['E4']=data['W1_E4'].map(recode4) #recode 'would you date outside your race?' to 1=Y, 0=N
recode5 = {1:0, 2:0.33, 3:0.66, 4:1}
data['E61'] = data['W1_E61_C'].map(recode5)  #recode scale to 0--1 from 1--4
data['E62'] = data['W1_E62_C'].map(recode5)  #concerned about biracial kids?
data['E63'] = data['W1_E63_C'].map(recode5)
data['E64'] = data['W1_E64_C'].map(recode5)
data['QA5D'] =data['W1_QA5D'].map(recode5)
data['N1B'] = data['W1_N1B'] / 100    #change 0--100 to 0.0--1.0 scale
data['N1C'] = data['W1_N1C'] / 100    #rate races
data['N1D'] = data['W1_N1D'] / 100    
data['N1E'] = data['W1_N1E'] / 100    
data['N1F'] = data['W1_N1F'] / 100    
recode6 = {1:0, 2:0.25, 3:0.5, 4:0.75, 5:1.0}
data['O1'] = data['W1_O1'].map(recode6)  #recode scale to 0--1 from 1--5
data['O3'] = data['W1_O3'].map(recode6)
#subdividing based on ethnicity
blk = data[(data['ETH'] == 1)]  #black sub
wht = data[(data['ETH'] == 2)]  #white sub
ltn = data[(data['ETH'] == 3)]  #latino sub
oth = data[(data['ETH'] == 5)]  #other sub
mix = data[(data['ETH'] == 6)]  #mixed sub
def CHURAC (row):   #1pt if church race != own race
    if row['W1_M2'] > row['ETH']:
        return 1
    if row['W1_M2'] < row['ETH']:
        return 1
    if row['W1_M2'] == row['ETH']:
        return 0
data['M2'] = data.apply (lambda row: CHURAC (row),axis=1)
#group the acceptance questions for each race (omit rating of own race)
baccept = data[['E4', 'E61', 'E62', 'E63', 'E64', 'M2', 'N1B', 'N1C', 'N1D', 'N1F', 'O1', 'O3', 'QA5D']]  #the questions for blacks
waccept = data[['E4', 'E61', 'E62', 'E63', 'E64', 'M2', 'N1B', 'N1C', 'N1E', 'N1F', 'O1', 'O3', 'QA5D']]  #the questions for whites
laccept = data[['E4', 'E61', 'E62', 'E63', 'E64', 'M2', 'N1B', 'N1D', 'N1E', 'N1F', 'O1', 'O3', 'QA5D']]  #the questions for latinos
oaccept = data[['E4', 'E61', 'E62', 'E63', 'E64', 'M2',        'N1C', 'N1D', 'N1E', 'O1', 'O3', 'QA5D']]  #the questions for other
maccept = data[['E4', 'E61', 'E62', 'E63', 'E64', 'M2', 'N1B', 'N1C', 'N1D', 'N1E', 'N1F', 'O1', 'O3', 'QA5D']]  #the questions for mixed
#add the columns for each race, but do it for all respondents
data['BACSUM'] = baccept.sum (axis=1)
data['WACSUM'] = waccept.sum (axis=1)
data['LACSUM'] = laccept.sum (axis=1)
data['OACSUM'] = oaccept.sum (axis=1)
data['MACSUM'] = maccept.sum (axis=1)
def ACCEP (row):  #select the sum (from above) based on respondent's race
    a = 0.0
    if row['ETH'] == 1:  #black
        a = row['BACSUM']
    if row['ETH'] == 2:  #white
        a = row['WACSUM']
    if row['ETH'] == 3:  #latino
        a = row['LACSUM']
    if row['ETH'] == 5:  #other
        a = row['OACSUM']
    if row['ETH'] == 6:  #mix
        a = row['MACSUM']
    return a
data['ACCEPT'] = data.apply (lambda row: ACCEP (row),axis=1)
a1 = data['ACCEPT'].head (n=10)
print ('acceptance')
print (a1)
#count number of valid acceptance answers for all respondents based on race
data['BACCNT'] = baccept.count(axis=1)
data['WACCNT'] = waccept.count(axis=1)
data['LACCNT'] = laccept.count(axis=1)
data['OACCNT'] = oaccept.count(axis=1)
data['MACCNT'] = maccept.count(axis=1)
def ACCNT (row):  #select the count (from above) based on respondent's race
    a = 0.0
    if row['ETH'] == 1:  #black
        a = row['BACCNT']
    if row['ETH'] == 2:  #white
        a = row['WACCNT']
    if row['ETH'] == 3:  #latino
        a = row['LACCNT']
    if row['ETH'] == 5:  #other
        a = row['OACCNT']
    if row['ETH'] == 6:  #mix
        a = row['MACCNT']
    return a
data['ACCNT'] = data.apply (lambda row: ACCNT (row),axis=1)
a2 = data['ACCNT'].head (n=10)
print ('# of valid answers to acceptance questions')
print (a2)
#calculate acceptance index: acceptance sum / valid question count
data['ACCIND'] = data['ACCEPT'] / data['ACCNT']
a3 = data['ACCIND'].head (n=10)
print ('Acceptance index')
print (a3)
descr = data['RELIND'].describe()
print (descr)
desca = data['ACCIND'].describe()
print (desca)
#Univariate histogram for religiosity:
hist1 = seaborn.histplot(data["RELIND"].dropna(), kde=False);
plt.xlabel('Index of Religiosity')
plt.title('Index of Religiosity')
print (hist1)
#Univariate histogram for acceptance:
hist2 = seaborn.histplot(data["ACCIND"].dropna(), kde=False);
plt.xlabel('Index of Racial Acceptance')
plt.title('Index of Racial Acceptance')
print (hist2)

#basic scatterplot:  Q->Q
scat1 = seaborn.regplot(x="RELIND", y="ACCIND", fit_reg=False, data=data)
plt.xlabel('Religiosity')
plt.ylabel('Acceptance')
plt.title('Scatterplot for the Association Between Religiosity and Acceptance')

'''
data['BACCEP']= baccept.sum(axis=1)   #axis=1 means the horizontal axis
b = data['BACCEP'].head (n=10)
print ()
print ("Sum of acceptance values for blacks")
print (b)
data['WACCEP']= waccept.sum(axis=1)   #axis=1 means the horizontal axis
b = data['WACCEP'].head (n=10)
print ()
print ("Sum of acceptance values for whites")
print (b)

print ('number of blacks')
a1 = blk['ETH'].value_counts(sort=False, dropna=False)
print (a1)
print ('number of whites')
a1 = wht['ETH'].value_counts(sort=False, dropna=False)
print (a1)
print ('number of latinos')
a1 = ltn['ETH'].value_counts(sort=False, dropna=False)
print (a1)
print ('number of other')
a1 = oth['ETH'].value_counts(sort=False, dropna=False)
print (a1)
print ('number of mix')
a1 = mix['ETH'].value_counts(sort=False, dropna=False)
print (a1)
print ('average religiosity of blacks')
b2 = blk['RELIND'].mean()
print (b2)
print ('average religiosity of whites')
b2 = wht['RELIND'].mean()
print (b2)
print ('average religiosity of latinos')
b2 = ltn['RELIND'].mean()
print (b2)
print ('average religiosity of others')
b2 = oth['RELIND'].mean()
print (b2)
print ('average religiosity of mixed')
b2 = mix['RELIND'].mean()
print (b2)

#counts and percentages (i.e. frequency distributions) for each variable
print ("counts for M1 respondent's religion, 1=religious, 0=atheist")
c1 = data['M1'].value_counts(sort=False, dropna=False)
print (c1)
print ("percentages for M1 respondent's religion, 1=religious, 0=atheist")
p1 = data['M1'].value_counts(sort=False, normalize=True)
print (p1)
print ('counts for M1A is respondent born again?, no=0, yes=2')
c2 = data['M1A'].value_counts(sort=False, dropna=False)
print (c2)
print ('percentages for M1A is respondent born again?, no=0, yes=2')
p2 = data['M1A'].value_counts(sort=False, normalize=True)
print (p2)
print ('counts for M5 how often does respondent attend church? 2.5=more than once/wk, 0=never')
c3 = data['M5'].value_counts(sort=False, dropna=False)
print (c3)
print ('percentages for M5 how often does respondent attend church? 2.5=more than once/wk, 0=never')
p3 = data['M5'].value_counts(sort=False, normalize=True)
print (p3)
'''

Tuesday, November 17, 2020

Making Data Management Decisions

 In week three of the Data Management and Visualization course, I have been learning how to actually manage the data, to prepare it for the next steps.  First, I needed to decide how to manage the variables I am studying, and our professor has given us several tools to do this, such as recoding and grouping values.  I have made use of recoding and also creating secondary variables; as I mentioned previously, my examination of respondents' religiosity and their acceptance of others outside their groups involves a large number of questions that must be analyzed to gauge those two values.  

Gauging respondents' acceptance of other groups will be particularly tricky, as it involves grouping respondents according to certain responses (e.g., race or political party), then comparing their answers to some questions but not others (e.g., Republicans will not be measured on their perception of Republicans).  I am hopeful we will learn more about this next week.  However, measuring respondents' religiosity was doable with the tools we have been given.

Values for religiosity questions
    M1  M1A    M4   M5   M6   QL2   QL3  QL4
0  1.0  2.0  0.25  2.5  2.0  1.00  1.00  1.0
1  1.0  2.0  0.75  2.0  2.0   NaN   NaN  NaN
2  1.0  2.0  0.75  2.5  0.0  1.00  0.75  0.8
3  0.0  0.0  0.50  0.0  NaN  0.25  0.25  0.2
4  1.0  0.0  0.50  1.0  0.0   NaN   NaN  NaN
5  1.0  0.0  0.50  0.5  0.0  0.50  0.75  0.6
6  1.0  0.0  0.50  2.0  0.0  0.75  0.75  0.4
7  1.0  0.0  0.75  0.0  NaN   NaN   NaN  NaN
8  1.0  2.0  0.75  2.0  2.0  0.75  0.75  0.2
9  1.0  2.0  0.50  2.0  2.0  0.75  0.50  0.2

The table above shows the first ten values of the eight variables I used to gauge religiosity, including "what is your religion" (M1), "are you born again" (M1A), and "how often do you attend church" (M5), which I have selected to run frequency distributions:

counts for M1 respondent's religion, 1=religious, 0=atheist
1.0    1981
0.0     294
NaN      19
Name: M1, dtype: int64

percentages for M1 respondent's religion, 1=religious, 0=atheist
1.0    0.870769
0.0    0.129231
Name: M1, dtype: float64

Originally, respondents were asked to select their religion from a list; I recoded this data to simply indicate whether they consider themselves religious (1) or not (0). Percentage of non-religous (atheists) is 13%, which follows expectations for the general population.

counts for M1A is respondent born again?, no=0, yes=2
2.0     833
0.0    1455
NaN       6
Name: M1A, dtype: int64

percentages for M1A is respondent born again?, no=0, yes=2
2.0    0.364073
0.0    0.635927
Name: M1A, dtype: float64

Question M1A originally included three possibilities: 1=yes, 2=no, and 3=was not asked (presumably due to a negative previous question).  I condensed the 2 and 3 responses and gave them a value of 0, then assigned a value of 2 to each 'yes' because I believe that response is a sign of high religiosity and therefore should add more to the final tally.  Here we can see about 36% of those who answered this question professed to be born again.

counts for M5 how often does respondent attend church? 2.5=more than once/wk, 0=never
2.5    356
2.0    534
0.0    422
1.0    405
0.5    326
1.5    234
NaN     17
Name: M5, dtype: int64

percentages for M5 how often does respondent attend church? 2.5=more than once/wk, 0=never
2.5    0.156346
2.0    0.234519
0.0    0.185332
1.0    0.177866
0.5    0.143171
1.5    0.102767
Name: M5, dtype: float64

Again, this question had multiple possible answers, ranging from "more than once per week" (1) to "once or twice a month" (3) to "never" (6).  I recoded the responses to give 0.5 point to each ascending category, so "never" became 0.0 and "more than once per week" became 2.5.  Thus, 18.5% of respondents say they never go to church (0), 10% say they go once or twice a month (1.5), and 16% say they go multiple times per week (2.5).

Next, I created secondary variables RELIG, CNT, and RELIND to develop a religious index of each respondent:

Sum of Religiosity values for each respondent
0    10.75
1     7.75
2     8.80
3     1.20
4     2.50
5     3.85
6     5.40
7     1.75
8     9.45
9     8.95
Name: RELIG, dtype: float64

RELIG is the sum of the variables shown in the first table.  In the first ten rows, it ranges from 1.2 to 10.75; higher numbers indicate higher determined religiosity. I was pleased with these values because they followed the patterns I expected:  in particular, those who attend church often and claim to be born again had the highest scores.  However, I also realized the scores could be misleading because of missing data (NaN in the first table).  I cannot run calculations on missing data, so I made my next secondary variable:

Number of religiosity questions with valid answers
0    8
1    5
2    8
3    7
4    5
5    8
6    8
7    4
8    8
9    8
Name: CNT, dtype: int64

CNT is the count of viable data that can be used for calculation.  As you can see, the first ten respondents answered as many as 8 or as few as 4 questions, and this could make a difference in my final values.  In fact, it did!  Notice the first two values in RELIG (10.75 and 7.75) and CNT (8 and 5), for example. 

Religious Index: sum of values / number of questions
0    1.343750
1    1.550000
2    1.100000
3    0.171429
4    0.500000
5    0.481250
6    0.675000
7    0.437500
8    1.181250
9    1.118750
Name: RELIND, dtype: float64

When RELIG is divided by CNT to get RELIND, the religious index, the first and second values flipped -- now the second is higher than the first!  So the first ten values of the religious index range from 0.17 (a self-professed atheist) to 1.55 (a born-again believer who attends church more than once per week).  Other questions had to do with respondents' opinions on moral issues and whether they are active in their church.

I may decide to further manage this data by dividing religious indexes into high, medium, and low, but I'm not sure yet.  These values will be compared based on respondents' answers to outside-group acceptance questions such as "would you date outside your race?" or "rate Democrats (0-100)".

Below is the text of my program:

"""
Analysis of Outlook on Life 2012 dataset using Python
created Tuesday, November 10, 2020
@author Joel Caren
"""

import pandas
import numpy

data = pandas.read_csv('oolpds.csv',low_memory = False) # read data file into 'data' using pandas and fend off a memory error

print (len(data)) #number of observations (rows)
print (len(data.columns)) #number of variables (columns)

#check format of variables
data['W1_M1'].dtype

#setting variables you will be working with to numeric; errors="coerce" fixes missing data
data['W1_C1'] = pandas.to_numeric(data['W1_C1'], errors = "coerce") #pol party? (R=1,D=2,I=3,other=4)
data['W1_D11'] = pandas.to_numeric(data['W1_D11'], errors = "coerce") #rate republicans (0-100)
data['W1_D12'] = pandas.to_numeric(data['W1_D12'], errors = "coerce") #rate dems (0-100)
data['W1_E4'] = pandas.to_numeric(data['W1_E4'], errors = "coerce") #date outside race? y=1
data['W1_E61_C'] = pandas.to_numeric(data['W1_E61_C'], errors = "coerce") #against biracial kids (black), 1=agree--4=disagree
data['W1_E62_C'] = pandas.to_numeric(data['W1_E62_C'], errors = "coerce") #against biracial kids (latino), 1=agree--4=disagree
data['W1_E63_C'] = pandas.to_numeric(data['W1_E63_C'], errors = "coerce") #against biracial kids (white), 1=agree--4=disagree
data['W1_E64_C'] = pandas.to_numeric(data['W1_E64_C'], errors = "coerce") #against biracial kids (asian), 1=agree--4=disagree
data['W1_M1'] = pandas.to_numeric(data['W1_M1'], errors = "coerce") #which religion? none=13
data['W1_M1A'] = pandas.to_numeric(data['W1_M1A'], errors = "coerce") #born again? y=1,n=2,not asked=3,refuse=4
data['W1_M2'] = pandas.to_numeric(data['W1_M2'], errors = "coerce") #church race? B=1,W=2,L=3,A=4,other=5,don't know=6
data['W1_M4'] = pandas.to_numeric(data['W1_M4'], errors = "coerce") #allow women clergy? 1=agree--4=disagree
data['W1_M5'] = pandas.to_numeric(data['W1_M5'], errors = "coerce") #how often attend church? mult times/wk=1,never=6
data['W1_M6'] = pandas.to_numeric(data['W1_M6'], errors = "coerce") #active church member? y=1
data['W1_N1B'] = pandas.to_numeric(data['W1_N1B'], errors = "coerce") #rate native americans (0-100)
data['W1_N1C'] = pandas.to_numeric(data['W1_N1C'], errors = "coerce") #rate latinos (0-100)
data['W1_N1D'] = pandas.to_numeric(data['W1_N1D'], errors = "coerce") #rate whites (0-100)
data['W1_N1E'] = pandas.to_numeric(data['W1_N1E'], errors = "coerce") #rate blacks (0-100)
data['W1_N1F'] = pandas.to_numeric(data['W1_N1F'], errors = "coerce") #rate asians (0-100)
data['W1_O1'] = pandas.to_numeric(data['W1_O1'], errors = "coerce") #blacks should teach kids vs racism, 1=no--5=yes
data['W1_O3'] = pandas.to_numeric(data['W1_O3'], errors = "coerce") #blacks should teach kids not all whites racist, 1=no--5=yes
data['W1_QA5D'] = pandas.to_numeric(data['W1_QA5D'], errors = "coerce") #blacks should segregate, 1=agree--4=disagree
data['W2_QL2B'] = pandas.to_numeric(data['W2_QL2B'], errors = "coerce") #gays as priests? 1=agree--4=disagree
data['W2_QL3'] = pandas.to_numeric(data['W2_QL3'], errors = "coerce") #homosex immoral? 1=agree--4=disagree
data['W2_QL4'] = pandas.to_numeric(data['W2_QL4'], errors = "coerce") #laws vs gay discrimination? y=1--n=5
data['PPETHM'] = pandas.to_numeric(data['PPETHM'], errors = "coerce") #your race? w=1,b=2,other=3,L=4,mix=5

# recode missing values to python missing (NaN)
data['W1_C1']=data['W1_C1'].replace(-1, numpy.nan)
data['W1_D11']=data['W1_D11'].replace(998, numpy.nan)
data['W1_D12']=data['W1_D12'].replace(998, numpy.nan)
data['W1_E4']=data['W1_E4'].replace(-1, numpy.nan)
data['W1_E61_C']=data['W1_E61_C'].replace(-1, numpy.nan)
data['W1_E62_C']=data['W1_E62_C'].replace(-1, numpy.nan)
data['W1_E63_C']=data['W1_E63_C'].replace(-1, numpy.nan)
data['W1_E64_C']=data['W1_E64_C'].replace(-1, numpy.nan)
data['W1_M1']=data['W1_M1'].replace(-1, numpy.nan)
data['W1_M1A']=data['W1_M1A'].replace(-1, numpy.nan)
data['W1_M2']=data['W1_M2'].replace(-1, numpy.nan)
data['W1_M4']=data['W1_M4'].replace(-1, numpy.nan)
data['W1_M5']=data['W1_M5'].replace(-1, numpy.nan)
data['W1_M6']=data['W1_M6'].replace(-1, numpy.nan)
data['W1_N1B']=data['W1_N1B'].replace(998, numpy.nan)
data['W1_N1C']=data['W1_N1C'].replace(998, numpy.nan)
data['W1_N1D']=data['W1_N1D'].replace(998, numpy.nan)
data['W1_N1E']=data['W1_N1E'].replace(998, numpy.nan)
data['W1_N1F']=data['W1_N1F'].replace(998, numpy.nan)
data['W1_O1']=data['W1_O1'].replace(-1, numpy.nan)
data['W1_O3']=data['W1_O3'].replace(-1, numpy.nan)
data['W1_QA5D']=data['W1_QA5D'].replace(-1, numpy.nan)
data['W2_QL2B']=data['W2_QL2B'].replace(-1, numpy.nan)
data['W2_QL3']=data['W2_QL3'].replace(-1, numpy.nan)
data['W2_QL4']=data['W2_QL4'].replace(-1, numpy.nan)


#create new variables that can be added together for religiosity from current variables
recode1 = {1:1, 2:1, 3:1, 4:1, 5:1, 6:1, 7:1, 8:1, 9:1, 10:1, 11:1, 12:1, 13:0}
data['M1'] = data['W1_M1'].map(recode1)
recode2 = {1:2, 2:0, 3:0}                  #yes=2pts, no=0
data['M1A'] = data['W1_M1A'].map(recode2)
data['M4'] = data['W1_M4'] / 4 #0-1pt in 0.25 increments
recode3 = {1:2.5, 2:2, 3:1.5, 4:1, 5:0.5, 6:0, 7:0}
data['M5'] = data['W1_M5'].map(recode3) #0-2.5pts based on church attentance
data['M6'] = (2 - data['W1_M6']) * 2 #yes=2pts, no=0
data['QL2'] = data['W2_QL2B'] / 4 #0-1pt in 0.25 increments
data['QL3'] = (5 - data['W2_QL3']) / 4 #0-1pt in 0.25 increments
data['QL4'] = data['W2_QL4'] / 5 #0-1pt in 0.2 increments


#create data subset for religiosity calculation
relig = data[['M1', 'M1A', 'M4', 'M5', 'M6', 'QL2', 'QL3', 'QL4']]
a = relig.head (n=10)
print ("Values for religiosity questions")
print(a)

#create new secondary variable, RELIG, the index of religiosity
data['RELIG']= relig.sum(axis=1)   #axis=1 means the horizontal axis
b = data['RELIG'].head (n=10)
print ()
print ("Sum of Religiosity values for each respondent")
print (b)

data['CNT'] = relig.count(axis=1)
d = data['CNT'].head (n=10)
print ()
print ("Number of religiosity questions with valid answers")
print (d)

data['RELIND'] = relig.sum(axis=1) / relig.count(axis=1)  #religious index=sum of religious variables/num of questions answered
c = data['RELIND'].head (n=10)
print ()
print ("Religious Index: sum of values / number of questions")
print (c)


#recode 'would you date outside your race?' to 1=Y, 0=N
recode4 = {1:1, 2:0}
data['E4']=data['W1_E4'].map(recode4)


#counts and percentages (i.e. frequency distributions) for selected variables
print ("counts for M1 respondent's religion, 1=religious, 0=atheist")
c1 = data['M1'].value_counts(sort=False, dropna=False)
print (c1)

print ("percentages for M1 respondent's religion, 1=religious, 0=atheist")
p1 = data['M1'].value_counts(sort=False, normalize=True)
print (p1)

print ('counts for M1A is respondent born again?, no=0, yes=2')
c2 = data['M1A'].value_counts(sort=False, dropna=False)
print (c2)

print ('percentages for M1A is respondent born again?, no=0, yes=2')
p2 = data['M1A'].value_counts(sort=False, normalize=True)
print (p2)

print ('counts for M5 how often does respondent attend church? 2.5=more than once/wk, 0=never')
c3 = data['M5'].value_counts(sort=False, dropna=False)
print (c3)

print ('percentages for M5 how often does respondent attend church? 2.5=more than once/wk, 0=never')
p3 = data['M5'].value_counts(sort=False, normalize=True)
print (p3)


Thursday, November 12, 2020

My first Python program

 In my previous post, I explained that I wanted to use the 2012 Outlook on Life survey to examine the intersection of religiosity and acceptance of others outside one's group.  To do this, I needed to use multiple variables to get an index of respondents' religiosity, and I also needed to know which groups each belonged to, to gauge their view of others outside that group.  This ended up eliminating several variables, since I couldn't use, for example, "please rate socialists (0-100)" if I didn't know which respondents considered themselves to be socialists.  Even then, when I wrote out all the variables I was considering, I had over forty!  So, as I expected I would, I had to narrow my focus.  Currently, I am concentrating mostly on racial acceptance, with a couple political questions thrown in (primarily Republican vs Democrat).

Next I set about writing my program, following the examples given in the videos.  The Spyder IDE makes the process very easy (compared to pounding out C++ on Programmer's Notepad)!  I did run into problems with missing data causing the program to throw errors, but with a quick search of the forums I found the phrase 'errors = "coerce"', which eliminated (or at least silenced) that problem.

Our instructor gave suggestions for using a subset of the data, but I think that I will use the whole dataset, so my program does not have code for that.  Currently, my program only checks the variables I want to use, makes sure they are numeric, and then prints out the frequency distributions of three representative variables:


"""
Analysis of Outlook on Life 2012 dataset using Python
created Tuesday, November 10, 2020
@author Joel Caren
"""

import pandas
import numpy
data = pandas.read_csv('oolpds.csv',low_memory = False) # read data file into 'data' using pandas and fend off a memory error
print (len(data)) #number of observations (rows)
print (len(data.columns)) #number of variables (columns)

#check format of variables
data['W1_M1'].dtype

#setting variables you will be working with to numeric; errors="coerce" fixes missing data
data['W1_M1'] = pandas.to_numeric(data['W1_M1'], errors = "coerce") #which religion? none=13
data['W1_M1A'] = pandas.to_numeric(data['W1_M1A'], errors = "coerce") #born again? y=1,n=2,not asked=3,refuse=4
data['W1_M2'] = pandas.to_numeric(data['W1_M2'], errors = "coerce") #church race? B=1,W=2,L=3,A=4,other=5,don't know=6
data['W1_M4'] = pandas.to_numeric(data['W1_M4'], errors = "coerce") #allow women clergy? 1=agree--4=disagree
data['W1_M6'] = pandas.to_numeric(data['W1_M6'], errors = "coerce") #active church member? y=1
data['W1_M5'] = pandas.to_numeric(data['W1_M5'], errors = "coerce") #how often attend church? mult times/wk=1,never=6
data['W2_QL2B'] = pandas.to_numeric(data['W2_QL2B'], errors = "coerce") #gays as priests? 1=agree--4=disagree
data['W2_QL3'] = pandas.to_numeric(data['W2_QL3'], errors = "coerce") #homosex immoral? 1=agree--4=disagree
data['W2_QL4'] = pandas.to_numeric(data['W2_QL4'], errors = "coerce") #laws vs gay discrimination? y=1--n=5
data['PPETHM'] = pandas.to_numeric(data['PPETHM'], errors = "coerce") #your race? w=1,b=2,other=3,L=4,mix=5
data['W1_C1'] = pandas.to_numeric(data['W1_C1'], errors = "coerce") #pol party? (R=1,D=2,I=3,other=4)
data['W1_D11'] = pandas.to_numeric(data['W1_D11'], errors = "coerce") #rate republicans (0-100)
data['W1_D12'] = pandas.to_numeric(data['W1_D12'], errors = "coerce") #rate dems (0-100)
data['W1_E4'] = pandas.to_numeric(data['W1_E4'], errors = "coerce") #date outside race? y=1
data['W1_N1B'] = pandas.to_numeric(data['W1_N1B'], errors = "coerce") #rate native americans (0-100)
data['W1_N1C'] = pandas.to_numeric(data['W1_N1C'], errors = "coerce") #rate latinos (0-100)
data['W1_N1D'] = pandas.to_numeric(data['W1_N1D'], errors = "coerce") #rate whites (0-100)
data['W1_N1E'] = pandas.to_numeric(data['W1_N1E'], errors = "coerce") #rate blacks (0-100)
data['W1_N1F'] = pandas.to_numeric(data['W1_N1F'], errors = "coerce") #rate asians (0-100)
data['W1_O1'] = pandas.to_numeric(data['W1_O1'], errors = "coerce") #blacks should teach kids vs racism, 1=no--5=yes
data['W1_O3'] = pandas.to_numeric(data['W1_O3'], errors = "coerce") #blacks should teach kids not all whites racist, 1=no--5=yes
data['W1_QA5D'] = pandas.to_numeric(data['W1_QA5D'], errors = "coerce") #blacks should segregate, 1=agree--4=disagree
data['W1_E61_C'] = pandas.to_numeric(data['W1_E61_C'], errors = "coerce") #against biracial kids (black), 1=agree--4=disagree
data['W1_E62_C'] = pandas.to_numeric(data['W1_E62_C'], errors = "coerce") #against biracial kids (latino), 1=agree--4=disagree
data['W1_E63_C'] = pandas.to_numeric(data['W1_E63_C'], errors = "coerce") #against biracial kids (white), 1=agree--4=disagree
data['W1_E64_C'] = pandas.to_numeric(data['W1_E64_C'], errors = "coerce") #against biracial kids (asian), 1=agree--4=disagree

#counts and percentages (i.e. frequency distributions) for each variable
print ("counts for W1_M2 racial makeup of respondent's church, Black=1, White=2, Latino=3, Asian=4, other=5, don't know=6")
c1 = data['W1_M2'].value_counts(sort=False)
print (c1)
print ("percentages for W1_M2 racial makeup of respondent's church, Black=1, White=2, Latino=3, Asian=4, other=5, don't know=6")
p1 = data['W1_M2'].value_counts(sort=False, normalize=True)
print (p1)

print ('counts for W1_E4 would respondent date outside their race?, yes= 1')
c2 = data['W1_E4'].value_counts(sort=False)
print (c2)
print ('percentages for W1_E4 would respondent date outside their race?, yes= 1')
p2 = data['W1_E4'].value_counts(sort=False, normalize=True)
print (p2)

print ('counts for W1_M1A is respondent born again? yes=1, no=2, not asked=3, refused=4')
c3 = data['W1_M1A'].value_counts(sort=False)
print (c3)
print ('percentages for W1_M1A is respondent born again? yes=1, no=2, not asked=3, refused=4')
p3 = data['W1_M1A'].value_counts(sort=False, normalize=True)
print (p3)

Running the program in the IDE, I get these frequency distributions:

2294
436

counts for W1_M2 racial makeup of respondent's church, Black=1, White=2, Latino=3, Asian=4, other=5, don't know=6
 5.0    108
 1.0    736
 2.0    644
 6.0    242
-1.0     66
 3.0     49
 4.0     11
Name: W1_M2, dtype: int64

percentages for W1_M2 racial makeup of respondent's church, Black=1, White=2, Latino=3, Asian=4, other=5, don't know=6
 5.0    0.058190
 1.0    0.396552
 2.0    0.346983
 6.0    0.130388
-1.0    0.035560
 3.0    0.026401
 4.0    0.005927
Name: W1_M2, dtype: float64

counts for W1_E4 would respondent date outside their race?, yes=1
 2.0    603
 1.0    671
-1.0     23
Name: W1_E4, dtype: int64

percentages for W1_E4 would respondent date outside their race?, yes=1
 2.0    0.464919
 1.0    0.517348
-1.0    0.017733
Name: W1_E4, dtype: float64

counts for W1_M1A is respondent born again? yes=1, no=2, not asked=3, refused=4
 2    1052
-1       6
 1     833
 3     403
Name: W1_M1A, dtype: int64

percentages for W1_M1A is respondent born again? yes=1, no=2, not asked=3, refused=4
 2    0.458588
-1    0.002616
 1    0.363121
 3    0.175676
Name: W1_M1A, dtype: float64

In this dataset, "-1" typically indicates "refused", which I take to mean the respondent selected "refuse to answer"; however, many people seemed to simply skip questions.  For example, almost 2300 people filled out the survey, but less than 1300 answered the question, "would you date outside your race?"  I foresee some difficulty in getting an index for each category because of this, but I will work that out as I go.  That said, of those 1300, about 52% (671) said they would.  As to the question of being born-again, about 36%, or 833, said they were.   When respondents were asked the racial makeup of their church, the highest number attended Black churches (736 or 40%), followed by White churches (644 or 35%).  This question is flawed in my opinion, because "don't know" could mean the respondent didn't attend church, or it could even mean that he or she attended a mixed church with no clear majority (yes, they exist!).

Thanks for reading my post, and I welcome your suggestions.


Thursday, November 5, 2020

Getting my research project started

Hello, and welcome!  You have stumbled upon my blog for the Coursera Data Management and Visualization course.  I have chosen to analyze the 2012 Outlook on Life Survey, and specifically to examine the association of religiosity and acceptance of other groups, which I have divided into the racial, political, and cultural.

Once I chose the dataset, I read through its codebook to decide what I would study within it.  The Outlook on Life (OOL) codebook is 421 pages long, and I had to scan through it two or three times to find the data subsets I wanted to compare.  Then my task was to prepare my own codebook for those subsets.  At present, my codebook is still 116 pages.

While religiosity can be internal or external (or perhaps intrinsic and extrinsic -- I am not very familiar with psychological jargon), social status is far less dependent upon perceived religiosity than in the past, so I think the distinction will not be great.  And besides, the dataset does not allow for parsing these types of religiosity, as far as I can tell.  That said, I am considering the possibility that my second topic, acceptance of other groups, may be too broad, and I may need to narrow it down to one of the three subtopics.

Of course, the next step is to study what others have already done in this area.  I found the results to be mixed.  Photiadis and Biggar (1962) found a negative correlation between church participation and "ethnic distance".  Similarly, Verma and Upadhyay (1984) found a negative correlation between high religiosity and "desire for social distance".  Again, Grandin and Brinkerhoff (1991) stated "fundamentalism was not found to be significantly related to racial and ethnic intolerance.... a significant, negative relationship was found between intrinsic religious orientation and social distance."  However, Duriez (2009) concluded "The Literal vs. Symbolic dimension [of religion] was found to significantly contribute to the prediction of racism", and Hall, Matz, and Wood (2009) claimed that in their study, "Only religious agnostics were racially tolerant".

In the interest of full disclosure, I am a molecular biologist (M.S.) and a Christian with high religiosity.  As a biologist, I am taken by the subjective nature of the study of humans' attitudes, and I am aware that my own perceptions and biases will play a part in how I look at the data and the conclusions I draw.  For example, I rather think that *which* religion you belong to will have an effect on desire for social distance.  I'm sure others have studied this.

Looking at the short selection of previous literature, I notice that the earlier papers find more positive correlations between religiosity and acceptance of other groups than do the later papers.  The more recent articles also focus more specifically on "racism", which I contend has broadened its definition over time.  I hypothesize that high religiosity will positively correlate with acceptance of other groups.  I think that whatever I find, my conclusions must be tempered by the knowledge that there are many factors at work.


Testing a Basic Linear Regression Model on the Outlook on Life Survey

 In review, I am working with the 2012 Outlook on Life Survey, specifically the relationship between Religiosity (RELIND, an aggregate index...