import moduleName1
import moduleName2
Data Management
1 Import Modules
The default way to import a module in Python
is:
For example:
import math
import random
import collections
import numpy as np
import pandas as pd
import matplotlib.pyplot as pp
2 Import Data
Importing data using pandas
takes syntax of the following form for .csv
files:
= pd.read_csv("filepath/filename.csv") # uses the pandas module data
Below, I import a .csv
file and save it into an object called mydata
(you could call this object whatever you want):
= pd.read_csv("https://osf.io/s6wrm/download") # uses the pandas module mydata
3 Save Data
Saving data in Python takes syntax of the following form for .csv
files:
object.to_csv("filepath/filename.csv", index = False)
For example:
"mydata.csv", index = False) mydata.to_csv(
4 Set a Seed
Set a seed (any number) to reproduce the results of analyses that involve random number generation.
52242) # uses the random module random.seed(
5 Run a Python
Script
To run a Python
script, use the following syntax:
%run "filepath/filename.py"
6 Render a Quarto (.qmd
) File
To render a Quarto (.qmd
) file, you would typically use the command line. Here is the equivalent command in a Python
cell using the !
operator to run shell commands:
!quarto render "filepath/filename.qmd"
7 Variable Names
To look at the names of variables in a dataframe, use the following syntax:
list(mydata.columns)
['survived', 'pclass', 'sex', 'age', 'sibsp', 'parch', 'prediction']
8 Logical Operators
Logical operators evaluate a condition for each value and yield values of True
and False
, corresponding to whether the evaluation for a given value met the condition.
8.1 Is Equal To: ==
'survived'] == 1 mydata[
0 True
1 True
2 False
3 False
4 False
...
1041 False
1042 False
1043 False
1044 False
1045 False
Name: survived, Length: 1046, dtype: bool
8.2 Is Not Equal To: !=
'survived'] != 1 mydata[
0 False
1 False
2 True
3 True
4 True
...
1041 True
1042 True
1043 True
1044 True
1045 True
Name: survived, Length: 1046, dtype: bool
8.2.1 Greater Than: >
'parch'] > 1 mydata[
0 False
1 True
2 True
3 True
4 True
...
1041 False
1042 False
1043 False
1044 False
1045 False
Name: parch, Length: 1046, dtype: bool
8.3 Less Than: <
'parch'] < 1 mydata[
0 True
1 False
2 False
3 False
4 False
...
1041 True
1042 True
1043 True
1044 True
1045 True
Name: parch, Length: 1046, dtype: bool
8.4 Greater Than or Equal To: >=
'parch'] >= 1 mydata[
0 False
1 True
2 True
3 True
4 True
...
1041 False
1042 False
1043 False
1044 False
1045 False
Name: parch, Length: 1046, dtype: bool
8.5 Less Than or Equal To: <=
'parch'] <= 1 mydata[
0 True
1 False
2 False
3 False
4 False
...
1041 True
1042 True
1043 True
1044 True
1045 True
Name: parch, Length: 1046, dtype: bool
8.6 Is in a Value of Another Vector: isin
= [0,1]
anotherVector 'parch'].isin(anotherVector) mydata[
0 True
1 False
2 False
3 False
4 False
...
1041 True
1042 True
1043 True
1044 True
1045 True
Name: parch, Length: 1046, dtype: bool
8.7 Is Not in a Value of Another Vector
In Python, you can use the ~
operator in combination with the isin
method to check if values are not in another sequence.
~mydata['parch'].isin(anotherVector)
0 False
1 True
2 True
3 True
4 True
...
1041 False
1042 False
1043 False
1044 False
1045 False
Name: parch, Length: 1046, dtype: bool
8.8 Is Missing: isnull()
'prediction'].isnull() mydata[
0 False
1 False
2 False
3 False
4 False
...
1041 False
1042 False
1043 False
1044 False
1045 False
Name: prediction, Length: 1046, dtype: bool
8.9 Is Not Missing: notnull()
'prediction'].notnull() mydata[
0 True
1 True
2 True
3 True
4 True
...
1041 True
1042 True
1043 True
1044 True
1045 True
Name: prediction, Length: 1046, dtype: bool
8.10 And: &
'prediction'].notnull() & (mydata['parch'] >= 1) mydata[
0 False
1 True
2 True
3 True
4 True
...
1041 False
1042 False
1043 False
1044 False
1045 False
Length: 1046, dtype: bool
8.11 Or: |
'prediction'].isnull() | (mydata['parch'] >= 1) mydata[
0 False
1 True
2 True
3 True
4 True
...
1041 False
1042 False
1043 False
1044 False
1045 False
Length: 1046, dtype: bool
9 Subset
To subset a dataframe, you can use the loc
and iloc
accessors, or directly access the columns by their names.
dataframe.loc[rowsToKeep, columnsToKeep] dataframe.iloc[rowIndices, columnIndices]
You can subset by using any of the following:
- numeric indices of the rows/columns to keep (or drop)
- names of the rows/columns to keep (or drop)
- boolean arrays corresponding to which rows/columns to keep
9.1 One Variable
To subset one variable, use the following syntax:
'age'] mydata[
0 29.00
1 0.92
2 2.00
3 30.00
4 25.00
...
1041 45.50
1042 14.50
1043 26.50
1044 27.00
1045 29.00
Name: age, Length: 1046, dtype: float64
9.2 Particular Rows of One Variable
To subset one variable, use the following syntax:
'survived'] == 1, 'age'] mydata.loc[mydata[
0 29.00
1 0.92
5 48.00
6 63.00
8 53.00
...
1010 63.00
1024 22.00
1030 38.00
1034 47.00
1040 15.00
Name: age, Length: 427, dtype: float64
9.3 Particular Columns (Variables)
To subset particular columns/variables, use the following syntax:
= ["survived", "age", "prediction"]
subsetVars
mydata[subsetVars]
survived | age | prediction | |
---|---|---|---|
0 | 1 | 29.00 | 0.944861 |
1 | 1 | 0.92 | 0.783519 |
2 | 0 | 2.00 | 0.978557 |
3 | 0 | 30.00 | 0.516082 |
4 | 0 | 25.00 | 0.945537 |
... | ... | ... | ... |
1041 | 0 | 45.50 | 0.058135 |
1042 | 0 | 14.50 | 0.655366 |
1043 | 0 | 26.50 | 0.115592 |
1044 | 0 | 27.00 | 0.113589 |
1045 | 0 | 29.00 | 0.105877 |
1046 rows × 3 columns
Or, to drop columns:
= ["sibsp", "parch"]
dropVars
= dropVars) mydata.drop(columns
survived | pclass | sex | age | prediction | |
---|---|---|---|---|---|
0 | 1 | 1 | female | 29.00 | 0.944861 |
1 | 1 | 1 | male | 0.92 | 0.783519 |
2 | 0 | 1 | female | 2.00 | 0.978557 |
3 | 0 | 1 | male | 30.00 | 0.516082 |
4 | 0 | 1 | female | 25.00 | 0.945537 |
... | ... | ... | ... | ... | ... |
1041 | 0 | 3 | male | 45.50 | 0.058135 |
1042 | 0 | 3 | female | 14.50 | 0.655366 |
1043 | 0 | 3 | male | 26.50 | 0.115592 |
1044 | 0 | 3 | male | 27.00 | 0.113589 |
1045 | 0 | 3 | male | 29.00 | 0.105877 |
1046 rows × 5 columns
9.4 Particular Rows
To subset particular rows, you can use the iloc
accessor or boolean indexing.
= [0, 2, 4] # Python uses 0-based indexing
subsetRows
mydata.iloc[subsetRows]'survived'] == 1] mydata[mydata[
survived | pclass | sex | age | sibsp | parch | prediction | |
---|---|---|---|---|---|---|---|
0 | 1 | 1 | female | 29.00 | 0 | 0 | 0.944861 |
1 | 1 | 1 | male | 0.92 | 1 | 2 | 0.783519 |
5 | 1 | 1 | male | 48.00 | 0 | 0 | 0.368893 |
6 | 1 | 1 | female | 63.00 | 1 | 0 | 0.739512 |
8 | 1 | 1 | female | 53.00 | 2 | 0 | 0.749240 |
... | ... | ... | ... | ... | ... | ... | ... |
1010 | 1 | 3 | female | 63.00 | 0 | 0 | 0.271834 |
1024 | 1 | 3 | male | 22.00 | 0 | 0 | 0.134332 |
1030 | 1 | 3 | female | 38.00 | 0 | 0 | 0.511020 |
1034 | 1 | 3 | female | 47.00 | 1 | 0 | 0.335634 |
1040 | 1 | 3 | female | 15.00 | 1 | 0 | 0.650893 |
427 rows × 7 columns
9.5 Particular Rows and Columns
To subset particular rows and columns, you can use the iloc
accessor or boolean indexing.
= [0, 2, 4] # Python uses 0-based indexing
subsetRows = ["survived", "age", "prediction"]
subsetVars
mydata.iloc[subsetRows][subsetVars]'survived'] == 1, subsetVars] mydata.loc[mydata[
survived | age | prediction | |
---|---|---|---|
0 | 1 | 29.00 | 0.944861 |
1 | 1 | 0.92 | 0.783519 |
5 | 1 | 48.00 | 0.368893 |
6 | 1 | 63.00 | 0.739512 |
8 | 1 | 53.00 | 0.749240 |
... | ... | ... | ... |
1010 | 1 | 63.00 | 0.271834 |
1024 | 1 | 22.00 | 0.134332 |
1030 | 1 | 38.00 | 0.511020 |
1034 | 1 | 47.00 | 0.335634 |
1040 | 1 | 15.00 | 0.650893 |
427 rows × 3 columns
10 View Data
10.1 All Data
To view data in Python, you can simply print the dataframe:
print(mydata)
survived pclass sex age sibsp parch prediction
0 1 1 female 29.00 0 0 0.944861
1 1 1 male 0.92 1 2 0.783519
2 0 1 female 2.00 1 2 0.978557
3 0 1 male 30.00 1 2 0.516082
4 0 1 female 25.00 1 2 0.945537
... ... ... ... ... ... ... ...
1041 0 3 male 45.50 0 0 0.058135
1042 0 3 female 14.50 1 0 0.655366
1043 0 3 male 26.50 0 0 0.115592
1044 0 3 male 27.00 0 0 0.113589
1045 0 3 male 29.00 0 0 0.105877
[1046 rows x 7 columns]
Or, if you’re using a Jupyter
notebook, you can just write the name of the dataframe:
mydata
survived | pclass | sex | age | sibsp | parch | prediction | |
---|---|---|---|---|---|---|---|
0 | 1 | 1 | female | 29.00 | 0 | 0 | 0.944861 |
1 | 1 | 1 | male | 0.92 | 1 | 2 | 0.783519 |
2 | 0 | 1 | female | 2.00 | 1 | 2 | 0.978557 |
3 | 0 | 1 | male | 30.00 | 1 | 2 | 0.516082 |
4 | 0 | 1 | female | 25.00 | 1 | 2 | 0.945537 |
... | ... | ... | ... | ... | ... | ... | ... |
1041 | 0 | 3 | male | 45.50 | 0 | 0 | 0.058135 |
1042 | 0 | 3 | female | 14.50 | 1 | 0 | 0.655366 |
1043 | 0 | 3 | male | 26.50 | 0 | 0 | 0.115592 |
1044 | 0 | 3 | male | 27.00 | 0 | 0 | 0.113589 |
1045 | 0 | 3 | male | 29.00 | 0 | 0 | 0.105877 |
1046 rows × 7 columns
10.2 First 6 Rows/Elements
To view only the first six rows of a dataframe or elements of a series, use the following syntax:
mydata.head()'age'].head() mydata[
0 29.00
1 0.92
2 2.00
3 30.00
4 25.00
Name: age, dtype: float64
11 Data Characteristics
11.1 Data Structure
print(mydata.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1046 entries, 0 to 1045
Data columns (total 7 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 survived 1046 non-null int64
1 pclass 1046 non-null int64
2 sex 1046 non-null object
3 age 1046 non-null float64
4 sibsp 1046 non-null int64
5 parch 1046 non-null int64
6 prediction 1046 non-null float64
dtypes: float64(2), int64(4), object(1)
memory usage: 57.3+ KB
None
11.2 Data Dimensions
Number of rows and columns:
print(mydata.shape)
(1046, 7)
11.3 Number of Elements
print(len(mydata['age']))
1046
11.4 Number of Missing Elements
print(mydata['age'].isnull().sum())
0
11.5 Number of Non-Missing Elements
print(mydata['age'].notnull().sum())
1046
12 Create New Variables
To create a new variable, you can directly assign a value to a new column in the dataframe.
'newVar'] = None mydata[
Here is an example of creating a new variable:
'ID'] = range(1, len(mydata) + 1) mydata[
13 Create a Dataframe
Here is an example of creating a dataframe:
= pd.DataFrame({ # uses pandas module
mydata2 'ID': list(range(1, 6)) + list(range(1047, 1052)),
'cat': np.random.choice([0, 1], 10) # uses numpy module
})
mydata2
ID | cat | |
---|---|---|
0 | 1 | 0 |
1 | 2 | 1 |
2 | 3 | 0 |
3 | 4 | 0 |
4 | 5 | 0 |
5 | 1047 | 0 |
6 | 1048 | 0 |
7 | 1049 | 0 |
8 | 1050 | 0 |
9 | 1051 | 1 |
14 Recode Variables
Here is an example of recoding a variable:
'sex'] == "male", 'oldVar1'] = 0
mydata.loc[mydata['sex'] == "female", 'oldVar1'] = 1
mydata.loc[mydata[
'sex'] == "male", 'oldVar2'] = 1
mydata.loc[mydata['sex'] == "female", 'oldVar2'] = 0 mydata.loc[mydata[
Recode multiple variables:
= ['survived', 'pclass']
columns_to_recode
for col in columns_to_recode:
= mydata[col].map({1: 'Yes', 0: 'No'})
mydata[col]
for col in columns_to_recode:
= mydata[col].map(lambda x: 1 if x in [0, 1] else 2) mydata[col]
15 Rename Variables
= mydata.rename(columns = {
mydata 'oldVar1': 'newVar1',
'oldVar2': 'newVar2'
})
Using a dictionary of variable names:
= ["oldVar1","oldVar2"]
varNamesFrom = ["newVar1","newVar2"]
varNamesTo
= dict(zip(varNamesFrom, varNamesTo))
rename_dict
= mydata.rename(columns = rename_dict) mydata
16 Convert the Types of Variables
One variable:
'factorVar'] = mydata['sex'].astype('category')
mydata['numericVar'] = mydata['prediction'].astype(float)
mydata['integerVar'] = mydata['parch'].astype(int)
mydata['characterVar'] = mydata['sex'].astype(str) mydata[
Multiple variables:
'age', 'parch', 'prediction']] = mydata[['age', 'parch', 'prediction']].astype(float)
mydata[[
'age':'parch'] = mydata.loc[:, 'age':'parch'].astype(float)
mydata.loc[:,
# Convert all categorical columns to string
for col in mydata.select_dtypes('category').columns:
= mydata[col].astype(str) mydata[col]
17 Merging/Joins
17.1 Overview
Merging (also called joining) merges two data objects using a shared set of variables called “keys.” The keys are the variable(s) that uniquely identify each row (i.e., they account for the levels of nesting). In some data objects, the key might be the participant’s ID (e.g., participantID
). However, some data objects have multiple keys. For instance, in long form data objects, each participant may have multiple rows corresponding to multiple timepoints. In this case, the keys are participantID
and timepoint
. If a participant has multiple rows corresponding to timepoints and measures, the keys are participantID
, timepoint
, and measure
. In general, each row should have a value on each of the keys; there should be no missingness in the keys.
To merge two objects, the keys must be present in both objects. The keys are used to merge the variables in object 1 (x
) with the variables in object 2 (y
). Different merge types select different rows to merge.
Note: if the two objects include variables with the same name (apart from the keys), Python will not know how you want each to appear in the merged object. So, it will add a suffix (e.g., _x
, _y
) to each common variable to indicate which object (i.e., object x
or object y
) the variable came from, where object x
is the first object—i.e., the object to which object y
(the second object) is merged. In general, apart from the keys, you should not include variables with the same name in two objects to be merged. To prevent this, either remove or rename the shared variable in one of the objects, or include the shared variable as a key. However, as described above, you should include it as a key only if it uniquely identifies each row in terms of levels of nesting.
17.2 Data Before Merging
Here are the data in the mydata
object:
print(mydata)
print(mydata.shape)
survived pclass sex age sibsp parch prediction newVar ID \
0 2 2 female 29.00 0 0.0 0.944861 None 1
1 2 2 male 0.92 1 2.0 0.783519 None 2
2 2 2 female 2.00 1 2.0 0.978557 None 3
3 2 2 male 30.00 1 2.0 0.516082 None 4
4 2 2 female 25.00 1 2.0 0.945537 None 5
... ... ... ... ... ... ... ... ... ...
1041 2 2 male 45.50 0 0.0 0.058135 None 1042
1042 2 2 female 14.50 1 0.0 0.655366 None 1043
1043 2 2 male 26.50 0 0.0 0.115592 None 1044
1044 2 2 male 27.00 0 0.0 0.113589 None 1045
1045 2 2 male 29.00 0 0.0 0.105877 None 1046
newVar1 newVar2 factorVar numericVar integerVar characterVar
0 1.0 0.0 female 0.944861 0 female
1 0.0 1.0 male 0.783519 2 male
2 1.0 0.0 female 0.978557 2 female
3 0.0 1.0 male 0.516082 2 male
4 1.0 0.0 female 0.945537 2 female
... ... ... ... ... ... ...
1041 0.0 1.0 male 0.058135 0 male
1042 1.0 0.0 female 0.655366 0 female
1043 0.0 1.0 male 0.115592 0 male
1044 0.0 1.0 male 0.113589 0 male
1045 0.0 1.0 male 0.105877 0 male
[1046 rows x 15 columns]
(1046, 15)
Here are the data in the mydata2
object:
print(mydata2)
print(mydata2.shape)
ID cat
0 1 0
1 2 1
2 3 0
3 4 0
4 5 0
5 1047 0
6 1048 0
7 1049 0
8 1050 0
9 1051 1
(10, 2)
17.3 Types of Joins
17.3.1 Visual Overview of Join Types
Below is a visual that depicts various types of merges/joins. Object x
is the circle labeled as A
. Object y
is the circle labeled as B
. The area of overlap in the Venn diagram indicates the rows on the keys that are shared between the two objects (e.g., participantID
values 1, 2, and 3). The non-overlapping area indicates the rows on the keys that are unique to each object (e.g., participantID
values 4, 5, and 6 in Object x
and values 7, 8, and 9 in Object y
). The shaded yellow area indicates which rows (on the keys) are kept in the merged object from each of the two objects, when using each of the merge types. For instance, a left outer join keeps the shared rows and the rows that are unique to object x
, but it drops the rows that are unique to object y
.
Image source: Predictive Hacks (archived at: https://perma.cc/WV7U-BS68)
17.3.2 Full Outer Join
A full outer join includes all rows in \(x\) or \(y\). It returns columns from \(x\) and \(y\). Here is how to merge two data frames using a full outer join (i.e., “full join”):
= pd.merge(mydata, mydata2, on = "ID", how = "outer")
fullJoinData
print(fullJoinData)
print(fullJoinData.shape)
survived pclass sex age sibsp parch prediction newVar ID \
0 2.0 2.0 female 29.00 0.0 0.0 0.944861 None 1
1 2.0 2.0 male 0.92 1.0 2.0 0.783519 None 2
2 2.0 2.0 female 2.00 1.0 2.0 0.978557 None 3
3 2.0 2.0 male 30.00 1.0 2.0 0.516082 None 4
4 2.0 2.0 female 25.00 1.0 2.0 0.945537 None 5
... ... ... ... ... ... ... ... ... ...
1046 NaN NaN NaN NaN NaN NaN NaN NaN 1047
1047 NaN NaN NaN NaN NaN NaN NaN NaN 1048
1048 NaN NaN NaN NaN NaN NaN NaN NaN 1049
1049 NaN NaN NaN NaN NaN NaN NaN NaN 1050
1050 NaN NaN NaN NaN NaN NaN NaN NaN 1051
newVar1 newVar2 factorVar numericVar integerVar characterVar cat
0 1.0 0.0 female 0.944861 0.0 female 0.0
1 0.0 1.0 male 0.783519 2.0 male 1.0
2 1.0 0.0 female 0.978557 2.0 female 0.0
3 0.0 1.0 male 0.516082 2.0 male 0.0
4 1.0 0.0 female 0.945537 2.0 female 0.0
... ... ... ... ... ... ... ...
1046 NaN NaN NaN NaN NaN NaN 0.0
1047 NaN NaN NaN NaN NaN NaN 0.0
1048 NaN NaN NaN NaN NaN NaN 0.0
1049 NaN NaN NaN NaN NaN NaN 0.0
1050 NaN NaN NaN NaN NaN NaN 1.0
[1051 rows x 16 columns]
(1051, 16)
17.3.3 Left Outer Join
A left outer join includes all rows in \(x\). It returns columns from \(x\) and \(y\). Here is how to merge two data frames using a left outer join (“left join”):
= pd.merge(mydata, mydata2, on = "ID", how = "left")
leftJoinData
print(leftJoinData)
print(leftJoinData.shape)
survived pclass sex age sibsp parch prediction newVar ID \
0 2 2 female 29.00 0 0.0 0.944861 None 1
1 2 2 male 0.92 1 2.0 0.783519 None 2
2 2 2 female 2.00 1 2.0 0.978557 None 3
3 2 2 male 30.00 1 2.0 0.516082 None 4
4 2 2 female 25.00 1 2.0 0.945537 None 5
... ... ... ... ... ... ... ... ... ...
1041 2 2 male 45.50 0 0.0 0.058135 None 1042
1042 2 2 female 14.50 1 0.0 0.655366 None 1043
1043 2 2 male 26.50 0 0.0 0.115592 None 1044
1044 2 2 male 27.00 0 0.0 0.113589 None 1045
1045 2 2 male 29.00 0 0.0 0.105877 None 1046
newVar1 newVar2 factorVar numericVar integerVar characterVar cat
0 1.0 0.0 female 0.944861 0 female 0.0
1 0.0 1.0 male 0.783519 2 male 1.0
2 1.0 0.0 female 0.978557 2 female 0.0
3 0.0 1.0 male 0.516082 2 male 0.0
4 1.0 0.0 female 0.945537 2 female 0.0
... ... ... ... ... ... ... ...
1041 0.0 1.0 male 0.058135 0 male NaN
1042 1.0 0.0 female 0.655366 0 female NaN
1043 0.0 1.0 male 0.115592 0 male NaN
1044 0.0 1.0 male 0.113589 0 male NaN
1045 0.0 1.0 male 0.105877 0 male NaN
[1046 rows x 16 columns]
(1046, 16)
17.3.4 Right Outer Join
A right outer join includes all rows in \(y\). It returns columns from \(x\) and \(y\). Here is how to merge two data frames using a right outer join (“right join”):
= pd.merge(mydata, mydata2, on = "ID", how = "right")
rightJoinData
print(rightJoinData)
print(rightJoinData.shape)
survived pclass sex age sibsp parch prediction newVar ID \
0 2.0 2.0 female 29.00 0.0 0.0 0.944861 None 1
1 2.0 2.0 male 0.92 1.0 2.0 0.783519 None 2
2 2.0 2.0 female 2.00 1.0 2.0 0.978557 None 3
3 2.0 2.0 male 30.00 1.0 2.0 0.516082 None 4
4 2.0 2.0 female 25.00 1.0 2.0 0.945537 None 5
5 NaN NaN NaN NaN NaN NaN NaN NaN 1047
6 NaN NaN NaN NaN NaN NaN NaN NaN 1048
7 NaN NaN NaN NaN NaN NaN NaN NaN 1049
8 NaN NaN NaN NaN NaN NaN NaN NaN 1050
9 NaN NaN NaN NaN NaN NaN NaN NaN 1051
newVar1 newVar2 factorVar numericVar integerVar characterVar cat
0 1.0 0.0 female 0.944861 0.0 female 0
1 0.0 1.0 male 0.783519 2.0 male 1
2 1.0 0.0 female 0.978557 2.0 female 0
3 0.0 1.0 male 0.516082 2.0 male 0
4 1.0 0.0 female 0.945537 2.0 female 0
5 NaN NaN NaN NaN NaN NaN 0
6 NaN NaN NaN NaN NaN NaN 0
7 NaN NaN NaN NaN NaN NaN 0
8 NaN NaN NaN NaN NaN NaN 0
9 NaN NaN NaN NaN NaN NaN 1
(10, 16)
17.3.5 Inner Join
An inner join includes all rows that are in both \(x\) and \(y\). An inner join will return one row of \(x\) for each matching row of \(y\), and can duplicate values of records on either side (left or right) if \(x\) and \(y\) have more than one matching record. It returns columns from \(x\) and \(y\). Here is how to merge two data frames using an inner join:
= pd.merge(mydata, mydata2, on = "ID", how = "inner")
innerJoinData
print(innerJoinData)
print(innerJoinData.shape)
survived pclass sex age sibsp parch prediction newVar ID \
0 2 2 female 29.00 0 0.0 0.944861 None 1
1 2 2 male 0.92 1 2.0 0.783519 None 2
2 2 2 female 2.00 1 2.0 0.978557 None 3
3 2 2 male 30.00 1 2.0 0.516082 None 4
4 2 2 female 25.00 1 2.0 0.945537 None 5
newVar1 newVar2 factorVar numericVar integerVar characterVar cat
0 1.0 0.0 female 0.944861 0 female 0
1 0.0 1.0 male 0.783519 2 male 1
2 1.0 0.0 female 0.978557 2 female 0
3 0.0 1.0 male 0.516082 2 male 0
4 1.0 0.0 female 0.945537 2 female 0
(5, 16)
17.3.6 Cross Join
A cross join combines each row in \(x\) with each row in \(y\).
= pd.DataFrame({'rater': ["Mother","Father","Teacher"]})
rater = pd.DataFrame({'timepoint': range(1, 4)})
timepoint
= rater.assign(key = 1).merge(timepoint.assign(key = 1), on = 'key').drop('key', axis = 1)
crossJoinData
print(crossJoinData)
print(crossJoinData.shape)
rater timepoint
0 Mother 1
1 Mother 2
2 Mother 3
3 Father 1
4 Father 2
5 Father 3
6 Teacher 1
7 Teacher 2
8 Teacher 3
(9, 2)
18 Long to Wide
import seaborn as sns
# Load the iris dataset
= sns.load_dataset('iris')
iris
# Melt the dataset to a long format
= iris.melt(
iris_long = 'species',
id_vars = 'measurement',
var_name = 'value')
value_name
print(iris_long)
# Pivot the dataset to a wide format
= iris_long.pivot_table(
iris_wide = 'species',
index = 'measurement',
columns = 'value')
values
print(iris_wide)
species measurement value
0 setosa sepal_length 5.1
1 setosa sepal_length 4.9
2 setosa sepal_length 4.7
3 setosa sepal_length 4.6
4 setosa sepal_length 5.0
.. ... ... ...
595 virginica petal_width 2.3
596 virginica petal_width 1.9
597 virginica petal_width 2.0
598 virginica petal_width 2.3
599 virginica petal_width 1.8
[600 rows x 3 columns]
measurement petal_length petal_width sepal_length sepal_width
species
setosa 1.462 0.246 5.006 3.428
versicolor 4.260 1.326 5.936 2.770
virginica 5.552 2.026 6.588 2.974
19 Wide to Long
Original data:
import seaborn as sns
# Load the iris dataset
= sns.load_dataset('iris')
iris
print(iris)
sepal_length sepal_width petal_length petal_width species
0 5.1 3.5 1.4 0.2 setosa
1 4.9 3.0 1.4 0.2 setosa
2 4.7 3.2 1.3 0.2 setosa
3 4.6 3.1 1.5 0.2 setosa
4 5.0 3.6 1.4 0.2 setosa
.. ... ... ... ... ...
145 6.7 3.0 5.2 2.3 virginica
146 6.3 2.5 5.0 1.9 virginica
147 6.5 3.0 5.2 2.0 virginica
148 6.2 3.4 5.4 2.3 virginica
149 5.9 3.0 5.1 1.8 virginica
[150 rows x 5 columns]
Data in long form, transformed from wide form using pandas
:
= iris.melt(
iris_long = 'species',
id_vars = 'measurement',
var_name = 'value')
value_name
print(iris_long)
species measurement value
0 setosa sepal_length 5.1
1 setosa sepal_length 4.9
2 setosa sepal_length 4.7
3 setosa sepal_length 4.6
4 setosa sepal_length 5.0
.. ... ... ...
595 virginica petal_width 2.3
596 virginica petal_width 1.9
597 virginica petal_width 2.0
598 virginica petal_width 2.3
599 virginica petal_width 1.8
[600 rows x 3 columns]
20 Average Ratings Across Coders
Create data with multiple coders:
# Create a dataframe with multiple coders
= pd.DataFrame(np.array(np.meshgrid(np.arange(1, 101), np.arange(1, 4), np.arange(1, 4))).T.reshape(-1,3), columns=['id', 'wave', 'coder'])
idWaveCoder
# Add positiveAffect and negativeAffect columns with random values
0)
np.random.seed('positiveAffect'] = np.random.randn(len(idWaveCoder))
idWaveCoder['negativeAffect'] = np.random.randn(len(idWaveCoder))
idWaveCoder[
# Sort the dataframe
= idWaveCoder.sort_values(['id', 'wave', 'coder'])
idWaveCoder
print(idWaveCoder)
id wave coder positiveAffect negativeAffect
0 1 1 1 1.764052 -1.461733
300 1 1 2 -1.306527 -0.896370
600 1 1 3 -1.550429 0.013240
1 1 2 1 0.400157 -0.683440
301 1 2 2 1.658131 -1.962732
.. ... ... ... ... ...
598 100 2 2 -0.930157 -1.079343
898 100 2 3 2.594425 -0.470742
299 100 3 1 0.370056 0.589704
599 100 3 2 -0.178589 -0.114226
899 100 3 3 -0.404032 0.272796
[900 rows x 5 columns]
Average data across coders:
# Group by id and wave, then calculate the mean for each group
= idWaveCoder.groupby(['id', 'wave']).mean().reset_index()
idWave
# Drop the coder column
= idWave.drop(columns=['coder'])
idWave
print(idWave)
id wave positiveAffect negativeAffect
0 1 1 -0.364301 -0.781621
1 1 2 0.825202 -0.922706
2 1 3 -0.027932 0.763808
3 2 1 0.599606 0.082882
4 2 2 0.375993 -0.962184
.. ... ... ... ...
295 99 2 1.171361 0.600051
296 99 3 -0.409619 -0.064589
297 100 1 -0.391788 -0.219040
298 100 2 0.421606 -1.187322
299 100 3 -0.070855 0.249425
[300 rows x 4 columns]
21 Session Info
import sys
print(sys.version)
3.12.3 (main, Apr 10 2024, 03:36:41) [GCC 11.4.0]