Wednesday, January 27, 2021

Testing a Basic Linear Regression Model on the Outlook on Life Survey

 In review, I am working with the 2012 Outlook on Life Survey, specifically the relationship between Religiosity (RELIND, an aggregate index of several questions) and Acceptance of Others (ACCIND, also an aggregate index).  Both variables are quantitative.  For this assignment, I "centered" Religiosity by finding the mean of all respondents' scores (0.75), then creating a new variable by subtracting that mean from each respondent's score (RELIN0).  I then ran the linear regression for the two variables.  The results indicated that Religiosity (Beta=-0.26, p=.006) was significantly and negatively associated with Acceptance of Others.  However, Religiosity accounts for only 0.3% of variability in the response variable (R-squared = 0.003).  After centering my explanatory variable, the mean = -5.49e-15 (essentially zero).  Below is the output of the linear regression, followed by the relevant program snippet.

OLS regression model for the association between Religiosity and Acceptance of Others
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 ACCIND   R-squared:                       0.003
Model:                            OLS   Adj. R-squared:                  0.003
Method:                 Least Squares   F-statistic:                     7.711
Date:                Wed, 27 Jan 2021   Prob (F-statistic):            0.00553
Time:                        15:43:24   Log-Likelihood:                 776.02
No. Observations:                2269   AIC:                            -1548.
Df Residuals:                    2267   BIC:                            -1537.
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.6877      0.004    190.500      0.000       0.681       0.695
RELIN0        -0.0260      0.009     -2.777      0.006      -0.044      -0.008
==============================================================================
Omnibus:                       81.863   Durbin-Watson:                   1.961
Prob(Omnibus):                  0.000   Jarque-Bera (JB):               90.391
Skew:                          -0.489   Prob(JB):                     2.35e-20
Kurtosis:                       2.966   Cond. No.                         2.59
==============================================================================

relmean = data['RELIND'].mean(axis=0)
print ('mean of Religious Index')
print (relmean)

#subtract mean from all values to center Religious Index values on 0
data['RELIN0'] = data['RELIND'] - relmean
relmean0 = data['RELIN0'].mean(axis=0)
print ('mean of zeroed Religious index <should be zero.>')
print (relmean0)

#basic scatterplot:  Q->Q
scat1 = seaborn.regplot(x="RELIN0", y="ACCIND", fit_reg=False, data=data)
plt.xlabel('Religiosity')
plt.ylabel('Acceptance')
plt.title('Scatterplot for the Association Between Religiosity and Acceptance')

#linear regression using centered explanatory variable (Religiosity)
print ("OLS regression model for the association between Religiosity and Acceptance of Others")
reg1 = smf.ols('ACCIND ~ RELIN0', data=data).fit()  #response variable then expanatory variable
print (reg1.summary())

Wednesday, December 30, 2020

Writing About My Data: Describing My Study Methods

 Sample

The sample studied included adults over the age of 18 residing in the United States who responded to the online Outlook on Life survey in 2012. 2294 respondents completed the first wave of the study; of those, 1601 completed the second wave also. Respondents were categorized by ethnic groups including Blacks (55.7%), Whites (35.5%), Latinos (5.4%), Other (2.0%), and Mixed Race (1.4%). My sample included all respondents who answered the questions relevant to my analysis.

Procedure

Data were collected by GfK Knowledge Networks for the University of California, Irvine via online surveys in two waves: wave one was conducted on August 2012; those who completed wave one were then asked to complete wave two on December 2012. Individuals were initially contacted via phone or postal mail and asked to fill out the online survey.  The survey was conducted to study the influences of variables such as political and sexual orientation, religiosity, and other beliefs influence political and social attitudes. 

Measures

My explanatory variable was Religiosity, which was calculated as a composite of several questions:  “what is your religion?”, “are you born again?”, “do you approve of women clergy?”, “how often do you attend church?”, “are you an active member of your church?”, “do you approve of gay clergy?”, “do you consider homosexuality immoral?”, and “do you approve of laws protecting gays?”.  The composite was divided by the number of valid responses to account for non-answers, giving an index ranging from 0.0 to 2.0.

My response variable was Racial Acceptance, which was also a composite of 14 questions, including: “were you ever willing to date outside your race?”, “how do you feel about biracial children with [blacks/Latinos/whites/Asians]?”, “what is the predominant race of your church?” (1 point if it was different from respondent’s ethnicity), “rate [Native Americans/Latinos/whites/blacks/Asians] on a scale of 1 – 100”, “should blacks teach their children against racism?”, “should blacks segregate from whites?”.  This composite was also divided by the number of valid responses, giving an index ranging from 0.0 to 1.0. The two indices are quantitative variables, so a Pearson Correlation was run to determine possible relationship.


Monday, December 14, 2020

Potential Moderator for ANOVA

 In our first week of this course, we were to run an ANOVA on two variables in our dataset.  I decided to compare ethnicity to religiosity (an index I formulated from several questions about religion).  I found that black respondents were significantly more religious than other races.  However, one variable that might confound racial comparisons is socioeconomic status.  Therefore, I decided to test whether such status is a moderator for religiosity in the various ethnicities.

Following the videos' instructions, I subdivided the 2012 Outlook on Life dataset by self-reported socioeconomic status (W1_P2).  Then I ran ANOVAs on each subset of the data.  Thus, for those in the "poor" socioeconomic class, F(253,4) = 2.6 and p=0.04, significant but barely.  For the "working" class, F(750,4)=8.7 and p<0.001.  For "middle" class, F(950,4)=14.7 and p<0.001.  For "upper-middle" class, F(209,4)=4.7 and p=0.001.  For "upper" class, F(21,3)=6.1 and p=0.004.  Thus, I could still reject the null hypothesis for all classes.

As before, since I am running ANOVA on a variable with multiple cases, I ran the Tukey HSD test for each socioeconomic class.  When the "poor" groups were compared, the null hypothesis could not be rejected for any pairwise comparisons.  However, among the "working" class, blacks and whites were significantly different (p=0.001), but all other pairwise comparisons were not.  Among those in the "middle" class, blacks were significantly more religious than all other races (p=0.001 vs whites, p=0.007 vs Latinos, p=0.017 vs others, and p=0.036 vs mixed race), but all other races were not significantly different from each other.  Among races in the "upper-middle" class, blacks were again significantly more religious than whites (p=0.0025) but not significantly more than any other race.  Again, pairwise comparison of the other races were not significant.  This same pattern followed in the "upper" class, with blacks vs whites (p=0.01) and all other comparisons not significant.

If socioeconomic status were, in fact, a moderating variable for the relationship between ethnicity and religiosity, I would expect noticeably different results in the comparisons for some classes from the overall result.  However, the same pattern was followed regardless of class, with some expected variation.  Thus, I reject the possibility of socioeconomic status as a moderator on this relationship.

This is the snippet of code I used to run these tests:

# ANOVA of ethnicity vs religiosity
model2 = smf.ols(formula='RELIND ~ C(ETH)', data=data).fit()
print (model2.summary())
#create sub4 with only RELIND and ETH to print means and s.d.'s
sub4 = data[['RELIND', 'ETH']].dropna()
print ('means for Religiosity by Race')
m3= sub4.groupby('ETH').mean()
print (m3)
print ('standard deviations for Religiosity by Race')
sd3 = sub4.groupby('ETH').std()
print (sd3)
#run post hoc Tukey test to find out which group was significantly different
mc2 = multi.MultiComparison(sub4['RELIND'], sub4['ETH'])
res2 = mc2.tukeyhsd()
print(res2.summary())

#create sub-dataframes based on socioeconomic class: 1=poor, 2=working, 3=middle, 4=upper-middle, 5=upper
sub5=data[(data['W1_P2']==1)]
sub5 = sub5[['RELIND', 'ETH', 'W1_P2']].dropna()
sub6=data[(data['W1_P2']==2)]
sub6 = sub6[['RELIND', 'ETH', 'W1_P2']].dropna()
sub7=data[(data['W1_P2']==3)]
sub7 = sub7[['RELIND', 'ETH', 'W1_P2']].dropna()
sub8=data[(data['W1_P2']==4)]
sub8 = sub8[['RELIND', 'ETH', 'W1_P2']].dropna()
sub9=data[(data['W1_P2']==5)]
sub9 = sub9[['RELIND', 'ETH', 'W1_P2']].dropna()

#run ANOVAs for each socioeconomic class
print ('association between ethnicity and religiosity for those in "Poor" socioeconomic class')
model3 = smf.ols(formula='RELIND ~ C(ETH)', data=sub5).fit()
print (model3.summary())
print ('association between ethnicity and religiosity for those in "Working" socioeconomic class')
model4 = smf.ols(formula='RELIND ~ C(ETH)', data=sub6).fit()
print (model4.summary())
print ('association between ethnicity and religiosity for those in "Middle" socioeconomic class')
model5 = smf.ols(formula='RELIND ~ C(ETH)', data=sub7).fit()
print (model5.summary())
print ('association between ethnicity and religiosity for those in "Upper-middle" socioeconomic class')
model6 = smf.ols(formula='RELIND ~ C(ETH)', data=sub8).fit()
print (model6.summary())
print ('association between ethnicity and religiosity for those in "Upper" socioeconomic class')
model7 = smf.ols(formula='RELIND ~ C(ETH)', data=sub9).fit()
print (model7.summary())

#print comparative means for each socioeconomic class
print ("means for religiosity by ethnicity for Poor")
m5= sub5.groupby('ETH').mean()
print (m5)
print ("means for religiosity by ethnicity for Working")
m6= sub6.groupby('ETH').mean()
print (m6)
print ("means for religiosity by ethnicity for Middle")
m7= sub7.groupby('ETH').mean()
print (m7)
print ("means for religiosity by ethnicity for Upper-middle")
m8= sub8.groupby('ETH').mean()
print (m8)
print ("means for religiosity by ethnicity for Upper")
m9= sub9.groupby('ETH').mean()
print (m9)

#run post hoc Tukey test for all classes to find out which group in that class was significantly different
mc3 = multi.MultiComparison(sub5['RELIND'], sub5['ETH'])  #poor
res3 = mc3.tukeyhsd()
print('Tukey test of ethnicity and religiosity for "Poor" class')
print(res3.summary())
mc4 = multi.MultiComparison(sub6['RELIND'], sub6['ETH'])  #working
res4 = mc4.tukeyhsd()
print('Tukey test of ethnicity and religiosity for "Working" class')
print(res4.summary())
mc5 = multi.MultiComparison(sub7['RELIND'], sub7['ETH'])  #middle
res5 = mc5.tukeyhsd()
print('Tukey test of ethnicity and religiosity for "Middle" class')
print(res5.summary())
mc6 = multi.MultiComparison(sub8['RELIND'], sub8['ETH'])  #upper-middle
res6 = mc6.tukeyhsd()
print('Tukey test of ethnicity and religiosity for "Upper-middle" class')
print(res6.summary())
mc7 = multi.MultiComparison(sub9['RELIND'], sub9['ETH'])  #upper
res7 = mc7.tukeyhsd()
print('Tukey test of ethnicity and religiosity for "Upper" class')
print(res7.summary())

Wednesday, December 9, 2020

Pearson Correlation of Religiosity to Racial Acceptance

 This week, we were instructed to use Python to calculate the Pearson Correlation for two variables in our chosen dataset.  I went back to my original question, which was whether, according to the 2012 Outlook on Life survey, respondent religiosity correlated with respondent acceptance of others outside their ethnicity.  To do this, I combined several questions to calculate indices for each measure.  The graph looked like this:


So, visually we can surmise that there is no correlation.  I ran the Pearson Correlation using Python, and the code snippet reads:

#calculate and print Pearson Correlation
data_clean=data.dropna()
print ('association between Religiosity Index and Racial Acceptance Index')
print (scipy.stats.pearsonr(data_clean['RELIND'], data_clean['ACCIND']))

The output was: r = -0.09, p = 0.15.  Therefore, the Pearson Correlation confirms no significant relationship between the to variables.  Squaring r gives us r^2 = 0.0081, a miniscule proportion of variability in racial acceptance that might be credited to religiosity.

Thursday, December 3, 2020

Chi Square Test of Independence: Race vs Willingness to Date Extraracially

 For this assignment, we were to run a Chi Square test on our dataset, which in my case is the 2012 Outlook on Life Survey.  I chose to compare respondents' indicated ethnicity to their willingness to date outside their race (one of the few questions with a simple yes/no answer).  This one was fairly straightforward.  Plotted on a bar graph, the results look like this:


where 1=Black, 2=White, 3=Latino, 5=Other, and 6=Mixed.

Right away, one group (blacks) appears to be much higher than the others.  Running the Chi Square test  on all groups gave a value of 40.7, with p<0.0001 and df=4.

However, I'm not done because I have multiple groups being compared.  I need to run pairwise comparisons for all groups, for a total of 10 comparisons.  The Bonferroni Adjustment indicates I need to divide my significant p value by 10, giving p<0.005.  I must accept the null hypothesis for anything higher than that.

As expected, Black vs White Chi Square gave a value of 38.5, with p<0.0001 and df=1.  Black vs Latino gave CS=5.0, p=0.026, with df=1; this would have been significant also, but following the Bonferroni Adjustment shown above, it is not.  No other comparisons were even close to significant.  Thus, the Chi Square test indicates blacks were significantly more willing to date outside their race than were whites, but not more significantly so than were other ethnicities.

For your perusal, my code follows:

"""
Analysis of Outlook on Life 2012 dataset using Python
created Tuesday, November 10, 2020
@author Joel Caren
"""
import pandas
import numpy
import seaborn
import scipy.stats
import matplotlib.pyplot as plt
import statsmodels.formula.api as smf
import statsmodels.stats.multicomp as multi 
data = pandas.read_csv('oolpds.csv',low_memory = False) # read data file into 'data' using pandas and fend off a memory error
print (len(data)) #number of observations (rows)
print (len(data.columns)) #number of variables (columns)

#check format of variables
data['W1_M1'].dtype
#setting variables you will be working with to numeric; errors="coerce" fixes missing data
#data['W1_C1'] = pandas.to_numeric(data['W1_C1'], errors = "coerce") #pol party? (R=1,D=2,I=3,other=4)
data['W1_D11'] = pandas.to_numeric(data['W1_D11'], errors = "coerce") #rate republicans (0-100)
data['W1_D12'] = pandas.to_numeric(data['W1_D12'], errors = "coerce") #rate dems (0-100)
data['W1_E4'] = pandas.to_numeric(data['W1_E4'], errors = "coerce") #date outside race? y=1
data['W1_E61_C'] = pandas.to_numeric(data['W1_E61_C'], errors = "coerce") #against biracial kids (black), 1=agree--4=disagree
data['W1_E62_C'] = pandas.to_numeric(data['W1_E62_C'], errors = "coerce") #against biracial kids (latino), 1=agree--4=disagree
data['W1_E63_C'] = pandas.to_numeric(data['W1_E63_C'], errors = "coerce") #against biracial kids (white), 1=agree--4=disagree
data['W1_E64_C'] = pandas.to_numeric(data['W1_E64_C'], errors = "coerce") #against biracial kids (asian), 1=agree--4=disagree
data['W1_M1'] = pandas.to_numeric(data['W1_M1'], errors = "coerce") #which religion? none=13
data['W1_M1A'] = pandas.to_numeric(data['W1_M1A'], errors = "coerce") #born again? y=1,n=2,not asked=3,refuse=4
data['W1_M2'] = pandas.to_numeric(data['W1_M2'], errors = "coerce") #church race? B=1,W=2,L=3,A=4,other=5,don't know=6
data['W1_M4'] = pandas.to_numeric(data['W1_M4'], errors = "coerce") #allow women clergy? 1=agree--4=disagree
data['W1_M5'] = pandas.to_numeric(data['W1_M5'], errors = "coerce") #how often attend church? mult times/wk=1,never=6
data['W1_M6'] = pandas.to_numeric(data['W1_M6'], errors = "coerce") #active church member? y=1
data['W1_N1B'] = pandas.to_numeric(data['W1_N1B'], errors = "coerce") #rate native americans (0-100)
data['W1_N1C'] = pandas.to_numeric(data['W1_N1C'], errors = "coerce") #rate latinos (0-100)
data['W1_N1D'] = pandas.to_numeric(data['W1_N1D'], errors = "coerce") #rate whites (0-100)
data['W1_N1E'] = pandas.to_numeric(data['W1_N1E'], errors = "coerce") #rate blacks (0-100)
data['W1_N1F'] = pandas.to_numeric(data['W1_N1F'], errors = "coerce") #rate asians (0-100)
data['W1_O1'] = pandas.to_numeric(data['W1_O1'], errors = "coerce") #blacks should teach kids vs racism, 1=no--5=yes
data['W1_O3'] = pandas.to_numeric(data['W1_O3'], errors = "coerce") #blacks should teach kids not all whites racist, 1=no--5=yes
data['W1_QA5D'] = pandas.to_numeric(data['W1_QA5D'], errors = "coerce") #blacks should segregate, 1=agree--4=disagree
data['W2_QL2B'] = pandas.to_numeric(data['W2_QL2B'], errors = "coerce") #gays as priests? 1=agree--4=disagree
data['W2_QL3'] = pandas.to_numeric(data['W2_QL3'], errors = "coerce") #homosex immoral? 1=agree--4=disagree
data['W2_QL4'] = pandas.to_numeric(data['W2_QL4'], errors = "coerce") #laws vs gay discrimination? y=1--n=5
data['PPETHM'] = pandas.to_numeric(data['PPETHM'], errors = "coerce") #your race? w=1,b=2,other=3,L=4,mix=5
# recode missing values to python missing (NaN)
#data['W1_C1']=data['W1_C1'].replace(-1, numpy.nan)
data['W1_D11']=data['W1_D11'].replace(998, numpy.nan)
data['W1_D12']=data['W1_D12'].replace(998, numpy.nan)
data['W1_E4']=data['W1_E4'].replace(-1, numpy.nan)
data['W1_E61_C']=data['W1_E61_C'].replace(-1, numpy.nan)
data['W1_E62_C']=data['W1_E62_C'].replace(-1, numpy.nan)
data['W1_E63_C']=data['W1_E63_C'].replace(-1, numpy.nan)
data['W1_E64_C']=data['W1_E64_C'].replace(-1, numpy.nan)
data['W1_M1']=data['W1_M1'].replace(-1, numpy.nan)
data['W1_M1A']=data['W1_M1A'].replace(-1, numpy.nan)
data['W1_M2']=data['W1_M2'].replace(-1, numpy.nan)
data['W1_M4']=data['W1_M4'].replace(-1, numpy.nan)
data['W1_M5']=data['W1_M5'].replace(-1, numpy.nan)
data['W1_M6']=data['W1_M6'].replace(-1, numpy.nan)
data['W1_N1B']=data['W1_N1B'].replace(998, numpy.nan)
data['W1_N1C']=data['W1_N1C'].replace(998, numpy.nan)
data['W1_N1D']=data['W1_N1D'].replace(998, numpy.nan)
data['W1_N1E']=data['W1_N1E'].replace(998, numpy.nan)
data['W1_N1F']=data['W1_N1F'].replace(998, numpy.nan)
data['W1_N1B']=data['W1_N1B'].replace(-1, numpy.nan)
data['W1_N1C']=data['W1_N1C'].replace(-1, numpy.nan)
data['W1_N1D']=data['W1_N1D'].replace(-1, numpy.nan)
data['W1_N1E']=data['W1_N1E'].replace(-1, numpy.nan)
data['W1_N1F']=data['W1_N1F'].replace(-1, numpy.nan)
data['W1_O1']=data['W1_O1'].replace(-1, numpy.nan)
data['W1_O3']=data['W1_O3'].replace(-1, numpy.nan)
data['W1_QA5D']=data['W1_QA5D'].replace(-1, numpy.nan)
data['W2_QL2B']=data['W2_QL2B'].replace(-1, numpy.nan)
data['W2_QL3']=data['W2_QL3'].replace(-1, numpy.nan)
data['W2_QL4']=data['W2_QL4'].replace(-1, numpy.nan)

#create new variables that can be added together for religiosity from current variables
recode1 = {1:1, 2:1, 3:1, 4:1, 5:1, 6:1, 7:1, 8:1, 9:1, 10:1, 11:1, 12:1, 13:0}
data['M1'] = data['W1_M1'].map(recode1)
recode2 = {1:2, 2:0, 3:0}                  #yes=2pts, no=0
data['M1A'] = data['W1_M1A'].map(recode2)
data['M4'] = data['W1_M4'] / 4 #0-1pt in 0.25 increments
recode3 = {1:2.5, 2:2, 3:1.5, 4:1, 5:0.5, 6:0, 7:0}
data['M5'] = data['W1_M5'].map(recode3) #0-2.5pts based on church attentance
data['M6'] = (2 - data['W1_M6']) * 2 #yes=2pts, no=0
data['QL2'] = data['W2_QL2B'] / 4 #0-1pt in 0.25 increments
data['QL3'] = (5 - data['W2_QL3']) / 4 #0-1pt in 0.25 increments
data['QL4'] = data['W2_QL4'] / 5 #0-1pt in 0.2 increments
#create data subset for religiosity calculation
relig = data[['M1', 'M1A', 'M4', 'M5', 'M6', 'QL2', 'QL3', 'QL4']]
a = relig.head (n=10)
print ("Values for religiosity questions")
print(a)
#create new secondary variable, RELIG, the index of religiosity
data['RELIG']= relig.sum(axis=1)   #axis=1 means the horizontal axis
b = data['RELIG'].head (n=10)
print ()
print ("Sum of Religiosity values for each respondent")
print (b)
data['CNT'] = relig.count(axis=1)
d = data['CNT'].head (n=10)
print ()
print ("Number of religiosity questions with valid answers")
print (d)
data['RELIND'] = relig.sum(axis=1) / relig.count(axis=1)  #religious index=sum of religious variables/num of questions answered
c = data['RELIND'].head (n=10)
print ()
print ("Religious Index: sum of values / number of questions")
print (c)
#Manage acceptance variables
recode7 = {1:2, 2:1, 3:5, 4:3, 5:6}
data['ETH'] = data['PPETHM'].map(recode7)  #remap ethnicity to match M2 church race
recode4 = {1:1, 2:0}
data['E4']=data['W1_E4'].map(recode4) #recode 'would you date outside your race?' to 1=Y, 0=N
recode5 = {1:0, 2:0.33, 3:0.66, 4:1}
data['E61'] = data['W1_E61_C'].map(recode5)  #recode scale to 0--1 from 1--4
data['E62'] = data['W1_E62_C'].map(recode5)  #concerned about biracial kids?
data['E63'] = data['W1_E63_C'].map(recode5)
data['E64'] = data['W1_E64_C'].map(recode5)
data['QA5D'] =data['W1_QA5D'].map(recode5)
data['N1B'] = data['W1_N1B'] / 100    #change 0--100 to 0.0--1.0 scale
data['N1C'] = data['W1_N1C'] / 100    #rate races
data['N1D'] = data['W1_N1D'] / 100    
data['N1E'] = data['W1_N1E'] / 100    
data['N1F'] = data['W1_N1F'] / 100    
recode6 = {1:0, 2:0.25, 3:0.5, 4:0.75, 5:1.0}
data['O1'] = data['W1_O1'].map(recode6)  #recode scale to 0--1 from 1--5
data['O3'] = data['W1_O3'].map(recode6)
#subdividing based on ethnicity
blk = data[(data['ETH'] == 1)]  #black sub
wht = data[(data['ETH'] == 2)]  #white sub
ltn = data[(data['ETH'] == 3)]  #latino sub
oth = data[(data['ETH'] == 5)]  #other sub
mix = data[(data['ETH'] == 6)]  #mixed sub
def CHURAC (row):   #1pt if church race != own race
    if row['W1_M2'] > row['ETH']:
        return 1
    if row['W1_M2'] < row['ETH']:
        return 1
    if row['W1_M2'] == row['ETH']:
        return 0
data['M2'] = data.apply (lambda row: CHURAC (row),axis=1)
#group the acceptance questions for each race (omit rating of own race)
baccept = data[['E4', 'E61', 'E62', 'E63', 'E64', 'M2', 'N1B', 'N1C', 'N1D', 'N1F', 'O1', 'O3', 'QA5D']]  #the questions for blacks
waccept = data[['E4', 'E61', 'E62', 'E63', 'E64', 'M2', 'N1B', 'N1C', 'N1E', 'N1F', 'O1', 'O3', 'QA5D']]  #the questions for whites
laccept = data[['E4', 'E61', 'E62', 'E63', 'E64', 'M2', 'N1B', 'N1D', 'N1E', 'N1F', 'O1', 'O3', 'QA5D']]  #the questions for latinos
oaccept = data[['E4', 'E61', 'E62', 'E63', 'E64', 'M2',        'N1C', 'N1D', 'N1E', 'O1', 'O3', 'QA5D']]  #the questions for other
maccept = data[['E4', 'E61', 'E62', 'E63', 'E64', 'M2', 'N1B', 'N1C', 'N1D', 'N1E', 'N1F', 'O1', 'O3', 'QA5D']]  #the questions for mixed
#add the columns for each race, but do it for all respondents
data['BACSUM'] = baccept.sum (axis=1)
data['WACSUM'] = waccept.sum (axis=1)
data['LACSUM'] = laccept.sum (axis=1)
data['OACSUM'] = oaccept.sum (axis=1)
data['MACSUM'] = maccept.sum (axis=1)
def ACCEP (row):  #select the sum (from above) based on respondent's race
    a = 0.0
    if row['ETH'] == 1:  #black
        a = row['BACSUM']
    if row['ETH'] == 2:  #white
        a = row['WACSUM']
    if row['ETH'] == 3:  #latino
        a = row['LACSUM']
    if row['ETH'] == 5:  #other
        a = row['OACSUM']
    if row['ETH'] == 6:  #mix
        a = row['MACSUM']
    return a
data['ACCEPT'] = data.apply (lambda row: ACCEP (row),axis=1)
a1 = data['ACCEPT'].head (n=10)
print ('acceptance')
print (a1)
#count number of valid acceptance answers for all respondents based on race
data['BACCNT'] = baccept.count(axis=1)
data['WACCNT'] = waccept.count(axis=1)
data['LACCNT'] = laccept.count(axis=1)
data['OACCNT'] = oaccept.count(axis=1)
data['MACCNT'] = maccept.count(axis=1)
def ACCNT (row):  #select the count (from above) based on respondent's race
    a = 0.0
    if row['ETH'] == 1:  #black
        a = row['BACCNT']
    if row['ETH'] == 2:  #white
        a = row['WACCNT']
    if row['ETH'] == 3:  #latino
        a = row['LACCNT']
    if row['ETH'] == 5:  #other
        a = row['OACCNT']
    if row['ETH'] == 6:  #mix
        a = row['MACCNT']
    return a
data['ACCNT'] = data.apply (lambda row: ACCNT (row),axis=1)
a2 = data['ACCNT'].head (n=10)
print ('# of valid answers to acceptance questions')
print (a2)
#calculate acceptance index: acceptance sum / valid question count
data['ACCIND'] = data['ACCEPT'] / data['ACCNT']
a3 = data['ACCIND'].head (n=10)
print ('Acceptance index')
print (a3)
descr = data['RELIND'].describe()
print (descr)
desca = data['ACCIND'].describe()
print (desca)
#Univariate histogram for religiosity:
hist1 = seaborn.histplot(data["RELIND"].dropna(), kde=False);
plt.xlabel('Index of Religiosity')
plt.title('Index of Religiosity')
print (hist1)
#Univariate histogram for acceptance:
hist2 = seaborn.histplot(data["ACCIND"].dropna(), kde=False);
plt.xlabel('Index of Racial Acceptance')
plt.title('Index of Racial Acceptance')
print (hist2)

#basic scatterplot:  Q->Q
scat1 = seaborn.regplot(x="RELIND", y="ACCIND", fit_reg=False, data=data)
plt.xlabel('Religiosity')
plt.ylabel('Acceptance')
plt.title('Scatterplot for the Association Between Religiosity and Acceptance')

#Set up for Chi Square comparison:  race vs willingness to date outside race
# contingency table of observed counts
ct1=pandas.crosstab(data['E4'], data['ETH'])
print (ct1)
# column percentages
colsum=ct1.sum(axis=0)
colpct=ct1/colsum
print(colpct)
# chi-square
print ('chi-square value, p value, expected counts')
cs1= scipy.stats.chi2_contingency(ct1)
print (cs1)
# graph percent willing to date outside race within each racial group 
seaborn.catplot(x="ETH", y="E4", data=data, kind="bar", ci=None)
plt.xlabel('Ethnicity')
plt.ylabel('Proportion Willing to Date Extraracially')
#Using Bonferroni Adjustment post hoc test to guard vs Type I error: 10 different comparisons, so significance level p<0.005
#use recode to do individual pairwise comparisons: 1-2, 1-3, 1-5, 1-5, 2-3, 2-5, 2-6, 3-5, 3-6, 5-6
recode2 = {1: 1, 2: 2}   #1-2 comparison
data['COMP1v2']= data['ETH'].map(recode2)
# contingency table of observed counts
ct2=pandas.crosstab(data['E4'], data['COMP1v2'])
print (ct2)
# column percentages
colsum=ct2.sum(axis=0)
colpct=ct2/colsum
print(colpct)
print ('chi-square value, p value, expected counts')
cs2= scipy.stats.chi2_contingency(ct2)
print (cs2)
recode3 = {1: 1, 3: 3}   #1-3 comparison
data['COMP1v3']= data['ETH'].map(recode3)
# contingency table of observed counts
ct3=pandas.crosstab(data['E4'], data['COMP1v3'])
print (ct3)
# column percentages
colsum=ct3.sum(axis=0)
colpct=ct3/colsum
print(colpct)
print ('chi-square value, p value, expected counts')
cs3= scipy.stats.chi2_contingency(ct3)
print (cs3)
recode4 = {1: 1, 5: 5}   #1-5 comparison
data['COMP1v5']= data['ETH'].map(recode4)
# contingency table of observed counts
ct4=pandas.crosstab(data['E4'], data['COMP1v5'])
print (ct4)
# column percentages
colsum=ct4.sum(axis=0)
colpct=ct4/colsum
print(colpct)
print ('chi-square value, p value, expected counts')
cs4= scipy.stats.chi2_contingency(ct4)
print (cs4)
recode5 = {1: 1, 6: 6}   #1-6 comparison
data['COMP1v6']= data['ETH'].map(recode5)
# contingency table of observed counts
ct5=pandas.crosstab(data['E4'], data['COMP1v6'])
print (ct5)
# column percentages
colsum=ct5.sum(axis=0)
colpct=ct5/colsum
print(colpct)
print ('chi-square value, p value, expected counts')
cs5= scipy.stats.chi2_contingency(ct5)
print (cs5)
recode6 = {2: 2, 3: 3}   #2-3 comparison
data['COMP2v3']= data['ETH'].map(recode6)
# contingency table of observed counts
ct6=pandas.crosstab(data['E4'], data['COMP2v3'])
print (ct6)
# column percentages
colsum=ct6.sum(axis=0)
colpct=ct6/colsum
print(colpct)
print ('chi-square value, p value, expected counts')
cs6= scipy.stats.chi2_contingency(ct6)
print (cs6)
recode7 = {2: 2, 5: 5}   #2-5 comparison
data['COMP2v5']= data['ETH'].map(recode7)
# contingency table of observed counts
ct7=pandas.crosstab(data['E4'], data['COMP2v5'])
print (ct7)
# column percentages
colsum=ct7.sum(axis=0)
colpct=ct7/colsum
print(colpct)
print ('chi-square value, p value, expected counts')
cs7= scipy.stats.chi2_contingency(ct7)
print (cs7)
recode8 = {2: 2, 6: 6}   #2-6 comparison
data['COMP2v6']= data['ETH'].map(recode8)
# contingency table of observed counts
ct8=pandas.crosstab(data['E4'], data['COMP2v6'])
print (ct8)
# column percentages
colsum=ct8.sum(axis=0)
colpct=ct8/colsum
print(colpct)
print ('chi-square value, p value, expected counts')
cs8= scipy.stats.chi2_contingency(ct8)
print (cs8)
recode9 = {3: 3, 5: 5}   #3-5 comparison
data['COMP3v5']= data['ETH'].map(recode9)
# contingency table of observed counts
ct9=pandas.crosstab(data['E4'], data['COMP3v5'])
print (ct9)
# column percentages
colsum=ct9.sum(axis=0)
colpct=ct9/colsum
print(colpct)
print ('chi-square value, p value, expected counts')
cs9= scipy.stats.chi2_contingency(ct9)
print (cs9)
recode10 = {3: 3, 6: 6}   #3-6 comparison
data['COMP3v6']= data['ETH'].map(recode10)
# contingency table of observed counts
ct10=pandas.crosstab(data['E4'], data['COMP3v6'])
print (ct10)
# column percentages
colsum=ct10.sum(axis=0)
colpct=ct10/colsum
print(colpct)
print ('chi-square value, p value, expected counts')
cs10= scipy.stats.chi2_contingency(ct10)
print (cs10)
recode11 = {5: 5, 6: 6}   #5-6 comparison
data['COMP5v6']= data['ETH'].map(recode11)
# contingency table of observed counts
ct11=pandas.crosstab(data['E4'], data['COMP5v6'])
print (ct11)
# column percentages
colsum=ct11.sum(axis=0)
colpct=ct11/colsum
print(colpct)
print ('chi-square value, p value, expected counts')
cs11= scipy.stats.chi2_contingency(ct11)
print (cs11)

Tuesday, December 1, 2020

ANOVA of Race vs Racial Acceptance and Religiosity

Previously, I looked for a correlation between religiosity and racial acceptance, as determined by the 2012 Outlook on Life survey.  My alternative hypothesis was that high religiosity would correlate with high racial acceptance; however, visualization of the data in a scatterplot showed no correlation, so I accepted the null hypothesis.

This week, I decided to take the question further:  we were assigned to run an ANOVA on our data, so I chose to run two:  Race vs Racial Acceptance, and Race vs Religiosity.  In each case, race is the categorical explanatory variable; racial acceptance or religiosity is the quantitative response variable.  Here's what I found:

For  Race vs Racial Acceptance, ANOVA found a significant relationship: F(4, 2264) = 59.49, p<0.001.  Because race has more than two levels (black, white, latino, other, and mixed) and the ANOVA was significant, I ran post hoc analysis using the Tukey HSD test.  The pairwise comparisons that were significant were:  black-white, black-latino, white-latino, white-other, white-mixed.  The means for each group were:


So this means that whites showed significantly lower racial acceptance than did other ethnicities.  This was an unfortunate surprise.

For Race vs Religiosity, ANOVA again found significant relationship:  F(4, 2289) = 29.35, p<0.001.  Again for the reasons listed above I ran the Tukey HSD test. The pairwise comparisons that were significant were:  black-white, black-latino, and black-other.  The means for each group were:


So, as you can see, blacks showed significantly higher religiosity as compared to the other races.

Tuesday, November 24, 2020

Visualizing the data

 In week four, we finally get to see the results of our efforts over the last few weeks: the visualization of interaction of our chosen variables.  In review:  I chose to look at the effect of religiosity on racial acceptance.  Thus, religiosity is my independent variable, and racial acceptance is my dependent variable.  These are both quantitative variables, so I chose to make histograms for each:


Here we see that religiosity ranged from 0.0 to 2.5 (this is simply an index; the numbers themselves are only meaningful in their own context), with a single mode at about 0.5, though the mean is 0.75.  There is an obvious right skew (the long tail is on the right).


Racial acceptance is also skewed, but to the left, with a mode about 0.8 and mean of 0.69 in a range of 0.0 to 1.0.  

My next step is to make a scatterplot of the two variables.


My hypothesis was that religiosity and racial acceptance would be positively correlated, but this plot clearly shows no correlation between the two.  In fact, data points cluster in the area indicating relatively low religiosity and high acceptance.  To better examine my original hypothesis, I would seek out more high-religiosity individuals.

One caveat:  the 2012 Outlook on Life survey (which I used for this project) included individuals from various ethnicities/races, but the proportion does not reflect the general population: about half were black, 40% white, and 10% others.  Therefore, I would hesitate to apply these conclusions to the general population.

My code is below in case you are interested.

"""
Analysis of Outlook on Life 2012 dataset using Python
created Tuesday, November 10, 2020
@author Joel Caren
"""
import pandas
import numpy
import seaborn
import matplotlib.pyplot as plt
data = pandas.read_csv('oolpds.csv',low_memory = False) # read data file into 'data' using pandas and fend off a memory error
print (len(data)) #number of observations (rows)
print (len(data.columns)) #number of variables (columns)

#check format of variables
data['W1_M1'].dtype
#setting variables you will be working with to numeric; errors="coerce" fixes missing data
#data['W1_C1'] = pandas.to_numeric(data['W1_C1'], errors = "coerce") #pol party? (R=1,D=2,I=3,other=4)
data['W1_D11'] = pandas.to_numeric(data['W1_D11'], errors = "coerce") #rate republicans (0-100)
data['W1_D12'] = pandas.to_numeric(data['W1_D12'], errors = "coerce") #rate dems (0-100)
data['W1_E4'] = pandas.to_numeric(data['W1_E4'], errors = "coerce") #date outside race? y=1
data['W1_E61_C'] = pandas.to_numeric(data['W1_E61_C'], errors = "coerce") #against biracial kids (black), 1=agree--4=disagree
data['W1_E62_C'] = pandas.to_numeric(data['W1_E62_C'], errors = "coerce") #against biracial kids (latino), 1=agree--4=disagree
data['W1_E63_C'] = pandas.to_numeric(data['W1_E63_C'], errors = "coerce") #against biracial kids (white), 1=agree--4=disagree
data['W1_E64_C'] = pandas.to_numeric(data['W1_E64_C'], errors = "coerce") #against biracial kids (asian), 1=agree--4=disagree
data['W1_M1'] = pandas.to_numeric(data['W1_M1'], errors = "coerce") #which religion? none=13
data['W1_M1A'] = pandas.to_numeric(data['W1_M1A'], errors = "coerce") #born again? y=1,n=2,not asked=3,refuse=4
data['W1_M2'] = pandas.to_numeric(data['W1_M2'], errors = "coerce") #church race? B=1,W=2,L=3,A=4,other=5,don't know=6
data['W1_M4'] = pandas.to_numeric(data['W1_M4'], errors = "coerce") #allow women clergy? 1=agree--4=disagree
data['W1_M5'] = pandas.to_numeric(data['W1_M5'], errors = "coerce") #how often attend church? mult times/wk=1,never=6
data['W1_M6'] = pandas.to_numeric(data['W1_M6'], errors = "coerce") #active church member? y=1
data['W1_N1B'] = pandas.to_numeric(data['W1_N1B'], errors = "coerce") #rate native americans (0-100)
data['W1_N1C'] = pandas.to_numeric(data['W1_N1C'], errors = "coerce") #rate latinos (0-100)
data['W1_N1D'] = pandas.to_numeric(data['W1_N1D'], errors = "coerce") #rate whites (0-100)
data['W1_N1E'] = pandas.to_numeric(data['W1_N1E'], errors = "coerce") #rate blacks (0-100)
data['W1_N1F'] = pandas.to_numeric(data['W1_N1F'], errors = "coerce") #rate asians (0-100)
data['W1_O1'] = pandas.to_numeric(data['W1_O1'], errors = "coerce") #blacks should teach kids vs racism, 1=no--5=yes
data['W1_O3'] = pandas.to_numeric(data['W1_O3'], errors = "coerce") #blacks should teach kids not all whites racist, 1=no--5=yes
data['W1_QA5D'] = pandas.to_numeric(data['W1_QA5D'], errors = "coerce") #blacks should segregate, 1=agree--4=disagree
data['W2_QL2B'] = pandas.to_numeric(data['W2_QL2B'], errors = "coerce") #gays as priests? 1=agree--4=disagree
data['W2_QL3'] = pandas.to_numeric(data['W2_QL3'], errors = "coerce") #homosex immoral? 1=agree--4=disagree
data['W2_QL4'] = pandas.to_numeric(data['W2_QL4'], errors = "coerce") #laws vs gay discrimination? y=1--n=5
data['PPETHM'] = pandas.to_numeric(data['PPETHM'], errors = "coerce") #your race? w=1,b=2,other=3,L=4,mix=5
# recode missing values to python missing (NaN)
#data['W1_C1']=data['W1_C1'].replace(-1, numpy.nan)
data['W1_D11']=data['W1_D11'].replace(998, numpy.nan)
data['W1_D12']=data['W1_D12'].replace(998, numpy.nan)
data['W1_E4']=data['W1_E4'].replace(-1, numpy.nan)
data['W1_E61_C']=data['W1_E61_C'].replace(-1, numpy.nan)
data['W1_E62_C']=data['W1_E62_C'].replace(-1, numpy.nan)
data['W1_E63_C']=data['W1_E63_C'].replace(-1, numpy.nan)
data['W1_E64_C']=data['W1_E64_C'].replace(-1, numpy.nan)
data['W1_M1']=data['W1_M1'].replace(-1, numpy.nan)
data['W1_M1A']=data['W1_M1A'].replace(-1, numpy.nan)
data['W1_M2']=data['W1_M2'].replace(-1, numpy.nan)
data['W1_M4']=data['W1_M4'].replace(-1, numpy.nan)
data['W1_M5']=data['W1_M5'].replace(-1, numpy.nan)
data['W1_M6']=data['W1_M6'].replace(-1, numpy.nan)
data['W1_N1B']=data['W1_N1B'].replace(998, numpy.nan)
data['W1_N1C']=data['W1_N1C'].replace(998, numpy.nan)
data['W1_N1D']=data['W1_N1D'].replace(998, numpy.nan)
data['W1_N1E']=data['W1_N1E'].replace(998, numpy.nan)
data['W1_N1F']=data['W1_N1F'].replace(998, numpy.nan)
data['W1_N1B']=data['W1_N1B'].replace(-1, numpy.nan)
data['W1_N1C']=data['W1_N1C'].replace(-1, numpy.nan)
data['W1_N1D']=data['W1_N1D'].replace(-1, numpy.nan)
data['W1_N1E']=data['W1_N1E'].replace(-1, numpy.nan)
data['W1_N1F']=data['W1_N1F'].replace(-1, numpy.nan)
data['W1_O1']=data['W1_O1'].replace(-1, numpy.nan)
data['W1_O3']=data['W1_O3'].replace(-1, numpy.nan)
data['W1_QA5D']=data['W1_QA5D'].replace(-1, numpy.nan)
data['W2_QL2B']=data['W2_QL2B'].replace(-1, numpy.nan)
data['W2_QL3']=data['W2_QL3'].replace(-1, numpy.nan)
data['W2_QL4']=data['W2_QL4'].replace(-1, numpy.nan)

#create new variables that can be added together for religiosity from current variables
recode1 = {1:1, 2:1, 3:1, 4:1, 5:1, 6:1, 7:1, 8:1, 9:1, 10:1, 11:1, 12:1, 13:0}
data['M1'] = data['W1_M1'].map(recode1)
recode2 = {1:2, 2:0, 3:0}                  #yes=2pts, no=0
data['M1A'] = data['W1_M1A'].map(recode2)
data['M4'] = data['W1_M4'] / 4 #0-1pt in 0.25 increments
recode3 = {1:2.5, 2:2, 3:1.5, 4:1, 5:0.5, 6:0, 7:0}
data['M5'] = data['W1_M5'].map(recode3) #0-2.5pts based on church attentance
data['M6'] = (2 - data['W1_M6']) * 2 #yes=2pts, no=0
data['QL2'] = data['W2_QL2B'] / 4 #0-1pt in 0.25 increments
data['QL3'] = (5 - data['W2_QL3']) / 4 #0-1pt in 0.25 increments
data['QL4'] = data['W2_QL4'] / 5 #0-1pt in 0.2 increments
#create data subset for religiosity calculation
relig = data[['M1', 'M1A', 'M4', 'M5', 'M6', 'QL2', 'QL3', 'QL4']]
a = relig.head (n=10)
print ("Values for religiosity questions")
print(a)
#create new secondary variable, RELIG, the index of religiosity
data['RELIG']= relig.sum(axis=1)   #axis=1 means the horizontal axis
b = data['RELIG'].head (n=10)
print ()
print ("Sum of Religiosity values for each respondent")
print (b)
data['CNT'] = relig.count(axis=1)
d = data['CNT'].head (n=10)
print ()
print ("Number of religiosity questions with valid answers")
print (d)
data['RELIND'] = relig.sum(axis=1) / relig.count(axis=1)  #religious index=sum of religious variables/num of questions answered
c = data['RELIND'].head (n=10)
print ()
print ("Religious Index: sum of values / number of questions")
print (c)
#Manage acceptance variables
recode7 = {1:2, 2:1, 3:5, 4:3, 5:6}
data['ETH'] = data['PPETHM'].map(recode7)  #remap ethnicity to match M2 church race
recode4 = {1:1, 2:0}
data['E4']=data['W1_E4'].map(recode4) #recode 'would you date outside your race?' to 1=Y, 0=N
recode5 = {1:0, 2:0.33, 3:0.66, 4:1}
data['E61'] = data['W1_E61_C'].map(recode5)  #recode scale to 0--1 from 1--4
data['E62'] = data['W1_E62_C'].map(recode5)  #concerned about biracial kids?
data['E63'] = data['W1_E63_C'].map(recode5)
data['E64'] = data['W1_E64_C'].map(recode5)
data['QA5D'] =data['W1_QA5D'].map(recode5)
data['N1B'] = data['W1_N1B'] / 100    #change 0--100 to 0.0--1.0 scale
data['N1C'] = data['W1_N1C'] / 100    #rate races
data['N1D'] = data['W1_N1D'] / 100    
data['N1E'] = data['W1_N1E'] / 100    
data['N1F'] = data['W1_N1F'] / 100    
recode6 = {1:0, 2:0.25, 3:0.5, 4:0.75, 5:1.0}
data['O1'] = data['W1_O1'].map(recode6)  #recode scale to 0--1 from 1--5
data['O3'] = data['W1_O3'].map(recode6)
#subdividing based on ethnicity
blk = data[(data['ETH'] == 1)]  #black sub
wht = data[(data['ETH'] == 2)]  #white sub
ltn = data[(data['ETH'] == 3)]  #latino sub
oth = data[(data['ETH'] == 5)]  #other sub
mix = data[(data['ETH'] == 6)]  #mixed sub
def CHURAC (row):   #1pt if church race != own race
    if row['W1_M2'] > row['ETH']:
        return 1
    if row['W1_M2'] < row['ETH']:
        return 1
    if row['W1_M2'] == row['ETH']:
        return 0
data['M2'] = data.apply (lambda row: CHURAC (row),axis=1)
#group the acceptance questions for each race (omit rating of own race)
baccept = data[['E4', 'E61', 'E62', 'E63', 'E64', 'M2', 'N1B', 'N1C', 'N1D', 'N1F', 'O1', 'O3', 'QA5D']]  #the questions for blacks
waccept = data[['E4', 'E61', 'E62', 'E63', 'E64', 'M2', 'N1B', 'N1C', 'N1E', 'N1F', 'O1', 'O3', 'QA5D']]  #the questions for whites
laccept = data[['E4', 'E61', 'E62', 'E63', 'E64', 'M2', 'N1B', 'N1D', 'N1E', 'N1F', 'O1', 'O3', 'QA5D']]  #the questions for latinos
oaccept = data[['E4', 'E61', 'E62', 'E63', 'E64', 'M2',        'N1C', 'N1D', 'N1E', 'O1', 'O3', 'QA5D']]  #the questions for other
maccept = data[['E4', 'E61', 'E62', 'E63', 'E64', 'M2', 'N1B', 'N1C', 'N1D', 'N1E', 'N1F', 'O1', 'O3', 'QA5D']]  #the questions for mixed
#add the columns for each race, but do it for all respondents
data['BACSUM'] = baccept.sum (axis=1)
data['WACSUM'] = waccept.sum (axis=1)
data['LACSUM'] = laccept.sum (axis=1)
data['OACSUM'] = oaccept.sum (axis=1)
data['MACSUM'] = maccept.sum (axis=1)
def ACCEP (row):  #select the sum (from above) based on respondent's race
    a = 0.0
    if row['ETH'] == 1:  #black
        a = row['BACSUM']
    if row['ETH'] == 2:  #white
        a = row['WACSUM']
    if row['ETH'] == 3:  #latino
        a = row['LACSUM']
    if row['ETH'] == 5:  #other
        a = row['OACSUM']
    if row['ETH'] == 6:  #mix
        a = row['MACSUM']
    return a
data['ACCEPT'] = data.apply (lambda row: ACCEP (row),axis=1)
a1 = data['ACCEPT'].head (n=10)
print ('acceptance')
print (a1)
#count number of valid acceptance answers for all respondents based on race
data['BACCNT'] = baccept.count(axis=1)
data['WACCNT'] = waccept.count(axis=1)
data['LACCNT'] = laccept.count(axis=1)
data['OACCNT'] = oaccept.count(axis=1)
data['MACCNT'] = maccept.count(axis=1)
def ACCNT (row):  #select the count (from above) based on respondent's race
    a = 0.0
    if row['ETH'] == 1:  #black
        a = row['BACCNT']
    if row['ETH'] == 2:  #white
        a = row['WACCNT']
    if row['ETH'] == 3:  #latino
        a = row['LACCNT']
    if row['ETH'] == 5:  #other
        a = row['OACCNT']
    if row['ETH'] == 6:  #mix
        a = row['MACCNT']
    return a
data['ACCNT'] = data.apply (lambda row: ACCNT (row),axis=1)
a2 = data['ACCNT'].head (n=10)
print ('# of valid answers to acceptance questions')
print (a2)
#calculate acceptance index: acceptance sum / valid question count
data['ACCIND'] = data['ACCEPT'] / data['ACCNT']
a3 = data['ACCIND'].head (n=10)
print ('Acceptance index')
print (a3)
descr = data['RELIND'].describe()
print (descr)
desca = data['ACCIND'].describe()
print (desca)
#Univariate histogram for religiosity:
hist1 = seaborn.histplot(data["RELIND"].dropna(), kde=False);
plt.xlabel('Index of Religiosity')
plt.title('Index of Religiosity')
print (hist1)
#Univariate histogram for acceptance:
hist2 = seaborn.histplot(data["ACCIND"].dropna(), kde=False);
plt.xlabel('Index of Racial Acceptance')
plt.title('Index of Racial Acceptance')
print (hist2)

#basic scatterplot:  Q->Q
scat1 = seaborn.regplot(x="RELIND", y="ACCIND", fit_reg=False, data=data)
plt.xlabel('Religiosity')
plt.ylabel('Acceptance')
plt.title('Scatterplot for the Association Between Religiosity and Acceptance')

'''
data['BACCEP']= baccept.sum(axis=1)   #axis=1 means the horizontal axis
b = data['BACCEP'].head (n=10)
print ()
print ("Sum of acceptance values for blacks")
print (b)
data['WACCEP']= waccept.sum(axis=1)   #axis=1 means the horizontal axis
b = data['WACCEP'].head (n=10)
print ()
print ("Sum of acceptance values for whites")
print (b)

print ('number of blacks')
a1 = blk['ETH'].value_counts(sort=False, dropna=False)
print (a1)
print ('number of whites')
a1 = wht['ETH'].value_counts(sort=False, dropna=False)
print (a1)
print ('number of latinos')
a1 = ltn['ETH'].value_counts(sort=False, dropna=False)
print (a1)
print ('number of other')
a1 = oth['ETH'].value_counts(sort=False, dropna=False)
print (a1)
print ('number of mix')
a1 = mix['ETH'].value_counts(sort=False, dropna=False)
print (a1)
print ('average religiosity of blacks')
b2 = blk['RELIND'].mean()
print (b2)
print ('average religiosity of whites')
b2 = wht['RELIND'].mean()
print (b2)
print ('average religiosity of latinos')
b2 = ltn['RELIND'].mean()
print (b2)
print ('average religiosity of others')
b2 = oth['RELIND'].mean()
print (b2)
print ('average religiosity of mixed')
b2 = mix['RELIND'].mean()
print (b2)

#counts and percentages (i.e. frequency distributions) for each variable
print ("counts for M1 respondent's religion, 1=religious, 0=atheist")
c1 = data['M1'].value_counts(sort=False, dropna=False)
print (c1)
print ("percentages for M1 respondent's religion, 1=religious, 0=atheist")
p1 = data['M1'].value_counts(sort=False, normalize=True)
print (p1)
print ('counts for M1A is respondent born again?, no=0, yes=2')
c2 = data['M1A'].value_counts(sort=False, dropna=False)
print (c2)
print ('percentages for M1A is respondent born again?, no=0, yes=2')
p2 = data['M1A'].value_counts(sort=False, normalize=True)
print (p2)
print ('counts for M5 how often does respondent attend church? 2.5=more than once/wk, 0=never')
c3 = data['M5'].value_counts(sort=False, dropna=False)
print (c3)
print ('percentages for M5 how often does respondent attend church? 2.5=more than once/wk, 0=never')
p3 = data['M5'].value_counts(sort=False, normalize=True)
print (p3)
'''

Testing a Basic Linear Regression Model on the Outlook on Life Survey

 In review, I am working with the 2012 Outlook on Life Survey, specifically the relationship between Religiosity (RELIND, an aggregate index...