Thursday, November 12, 2020

My first Python program

 In my previous post, I explained that I wanted to use the 2012 Outlook on Life survey to examine the intersection of religiosity and acceptance of others outside one's group.  To do this, I needed to use multiple variables to get an index of respondents' religiosity, and I also needed to know which groups each belonged to, to gauge their view of others outside that group.  This ended up eliminating several variables, since I couldn't use, for example, "please rate socialists (0-100)" if I didn't know which respondents considered themselves to be socialists.  Even then, when I wrote out all the variables I was considering, I had over forty!  So, as I expected I would, I had to narrow my focus.  Currently, I am concentrating mostly on racial acceptance, with a couple political questions thrown in (primarily Republican vs Democrat).

Next I set about writing my program, following the examples given in the videos.  The Spyder IDE makes the process very easy (compared to pounding out C++ on Programmer's Notepad)!  I did run into problems with missing data causing the program to throw errors, but with a quick search of the forums I found the phrase 'errors = "coerce"', which eliminated (or at least silenced) that problem.

Our instructor gave suggestions for using a subset of the data, but I think that I will use the whole dataset, so my program does not have code for that.  Currently, my program only checks the variables I want to use, makes sure they are numeric, and then prints out the frequency distributions of three representative variables:


"""
Analysis of Outlook on Life 2012 dataset using Python
created Tuesday, November 10, 2020
@author Joel Caren
"""

import pandas
import numpy
data = pandas.read_csv('oolpds.csv',low_memory = False) # read data file into 'data' using pandas and fend off a memory error
print (len(data)) #number of observations (rows)
print (len(data.columns)) #number of variables (columns)

#check format of variables
data['W1_M1'].dtype

#setting variables you will be working with to numeric; errors="coerce" fixes missing data
data['W1_M1'] = pandas.to_numeric(data['W1_M1'], errors = "coerce") #which religion? none=13
data['W1_M1A'] = pandas.to_numeric(data['W1_M1A'], errors = "coerce") #born again? y=1,n=2,not asked=3,refuse=4
data['W1_M2'] = pandas.to_numeric(data['W1_M2'], errors = "coerce") #church race? B=1,W=2,L=3,A=4,other=5,don't know=6
data['W1_M4'] = pandas.to_numeric(data['W1_M4'], errors = "coerce") #allow women clergy? 1=agree--4=disagree
data['W1_M6'] = pandas.to_numeric(data['W1_M6'], errors = "coerce") #active church member? y=1
data['W1_M5'] = pandas.to_numeric(data['W1_M5'], errors = "coerce") #how often attend church? mult times/wk=1,never=6
data['W2_QL2B'] = pandas.to_numeric(data['W2_QL2B'], errors = "coerce") #gays as priests? 1=agree--4=disagree
data['W2_QL3'] = pandas.to_numeric(data['W2_QL3'], errors = "coerce") #homosex immoral? 1=agree--4=disagree
data['W2_QL4'] = pandas.to_numeric(data['W2_QL4'], errors = "coerce") #laws vs gay discrimination? y=1--n=5
data['PPETHM'] = pandas.to_numeric(data['PPETHM'], errors = "coerce") #your race? w=1,b=2,other=3,L=4,mix=5
data['W1_C1'] = pandas.to_numeric(data['W1_C1'], errors = "coerce") #pol party? (R=1,D=2,I=3,other=4)
data['W1_D11'] = pandas.to_numeric(data['W1_D11'], errors = "coerce") #rate republicans (0-100)
data['W1_D12'] = pandas.to_numeric(data['W1_D12'], errors = "coerce") #rate dems (0-100)
data['W1_E4'] = pandas.to_numeric(data['W1_E4'], errors = "coerce") #date outside race? y=1
data['W1_N1B'] = pandas.to_numeric(data['W1_N1B'], errors = "coerce") #rate native americans (0-100)
data['W1_N1C'] = pandas.to_numeric(data['W1_N1C'], errors = "coerce") #rate latinos (0-100)
data['W1_N1D'] = pandas.to_numeric(data['W1_N1D'], errors = "coerce") #rate whites (0-100)
data['W1_N1E'] = pandas.to_numeric(data['W1_N1E'], errors = "coerce") #rate blacks (0-100)
data['W1_N1F'] = pandas.to_numeric(data['W1_N1F'], errors = "coerce") #rate asians (0-100)
data['W1_O1'] = pandas.to_numeric(data['W1_O1'], errors = "coerce") #blacks should teach kids vs racism, 1=no--5=yes
data['W1_O3'] = pandas.to_numeric(data['W1_O3'], errors = "coerce") #blacks should teach kids not all whites racist, 1=no--5=yes
data['W1_QA5D'] = pandas.to_numeric(data['W1_QA5D'], errors = "coerce") #blacks should segregate, 1=agree--4=disagree
data['W1_E61_C'] = pandas.to_numeric(data['W1_E61_C'], errors = "coerce") #against biracial kids (black), 1=agree--4=disagree
data['W1_E62_C'] = pandas.to_numeric(data['W1_E62_C'], errors = "coerce") #against biracial kids (latino), 1=agree--4=disagree
data['W1_E63_C'] = pandas.to_numeric(data['W1_E63_C'], errors = "coerce") #against biracial kids (white), 1=agree--4=disagree
data['W1_E64_C'] = pandas.to_numeric(data['W1_E64_C'], errors = "coerce") #against biracial kids (asian), 1=agree--4=disagree

#counts and percentages (i.e. frequency distributions) for each variable
print ("counts for W1_M2 racial makeup of respondent's church, Black=1, White=2, Latino=3, Asian=4, other=5, don't know=6")
c1 = data['W1_M2'].value_counts(sort=False)
print (c1)
print ("percentages for W1_M2 racial makeup of respondent's church, Black=1, White=2, Latino=3, Asian=4, other=5, don't know=6")
p1 = data['W1_M2'].value_counts(sort=False, normalize=True)
print (p1)

print ('counts for W1_E4 would respondent date outside their race?, yes= 1')
c2 = data['W1_E4'].value_counts(sort=False)
print (c2)
print ('percentages for W1_E4 would respondent date outside their race?, yes= 1')
p2 = data['W1_E4'].value_counts(sort=False, normalize=True)
print (p2)

print ('counts for W1_M1A is respondent born again? yes=1, no=2, not asked=3, refused=4')
c3 = data['W1_M1A'].value_counts(sort=False)
print (c3)
print ('percentages for W1_M1A is respondent born again? yes=1, no=2, not asked=3, refused=4')
p3 = data['W1_M1A'].value_counts(sort=False, normalize=True)
print (p3)

Running the program in the IDE, I get these frequency distributions:

2294
436

counts for W1_M2 racial makeup of respondent's church, Black=1, White=2, Latino=3, Asian=4, other=5, don't know=6
 5.0    108
 1.0    736
 2.0    644
 6.0    242
-1.0     66
 3.0     49
 4.0     11
Name: W1_M2, dtype: int64

percentages for W1_M2 racial makeup of respondent's church, Black=1, White=2, Latino=3, Asian=4, other=5, don't know=6
 5.0    0.058190
 1.0    0.396552
 2.0    0.346983
 6.0    0.130388
-1.0    0.035560
 3.0    0.026401
 4.0    0.005927
Name: W1_M2, dtype: float64

counts for W1_E4 would respondent date outside their race?, yes=1
 2.0    603
 1.0    671
-1.0     23
Name: W1_E4, dtype: int64

percentages for W1_E4 would respondent date outside their race?, yes=1
 2.0    0.464919
 1.0    0.517348
-1.0    0.017733
Name: W1_E4, dtype: float64

counts for W1_M1A is respondent born again? yes=1, no=2, not asked=3, refused=4
 2    1052
-1       6
 1     833
 3     403
Name: W1_M1A, dtype: int64

percentages for W1_M1A is respondent born again? yes=1, no=2, not asked=3, refused=4
 2    0.458588
-1    0.002616
 1    0.363121
 3    0.175676
Name: W1_M1A, dtype: float64

In this dataset, "-1" typically indicates "refused", which I take to mean the respondent selected "refuse to answer"; however, many people seemed to simply skip questions.  For example, almost 2300 people filled out the survey, but less than 1300 answered the question, "would you date outside your race?"  I foresee some difficulty in getting an index for each category because of this, but I will work that out as I go.  That said, of those 1300, about 52% (671) said they would.  As to the question of being born-again, about 36%, or 833, said they were.   When respondents were asked the racial makeup of their church, the highest number attended Black churches (736 or 40%), followed by White churches (644 or 35%).  This question is flawed in my opinion, because "don't know" could mean the respondent didn't attend church, or it could even mean that he or she attended a mixed church with no clear majority (yes, they exist!).

Thanks for reading my post, and I welcome your suggestions.


No comments:

Post a Comment

Testing a Basic Linear Regression Model on the Outlook on Life Survey

 In review, I am working with the 2012 Outlook on Life Survey, specifically the relationship between Religiosity (RELIND, an aggregate index...