Last updated : 06/Aug/2022
Events are in reverse chronological order.
Find and delete specific keys from dict
Lets assume if nc dict has keys like cur_7, cur_6, cur_5, cur_4, cur_3, cur_2, cur_1, cur
and we want to keep keys cur
and delete all keys greater than 4.
nc = [ 'cur_7' , 'cur_6' , 'cur_5' , 'cur_4' , 'cur_3' , 'cur_2' , 'cur_1' , 'cur' ]
temp = []
for k, v in nc.items():
print (k, v)
if (v == 'cur' ):
continue
num = int (v[ 4 :])
print (num)
if (num > 4 ):
temp.append(k)
print (temp)
[nc.pop(k, None ) for k in temp]
print (nc)
Updating dataframe with another dataframe
# Column names should be same be df1 and df2 for it to be updated
df1.set_index([ 'sc_code' ], inplace = True )
df1.update(df2.set_index([ 'sc_code' ]))
df1.reset_index( inplace = True )
Find null on specific columns
-- Find rows which has null or NaN in any column
df[df.isnull().any( axis = 1 )]
cols = [ 'cur_4' , 'cur_3' , 'cur_2' , 'cur_1' , 'cur' ]
-- Find null on specific columns
df_pivot_avg = df_pivot_avg[df_pivot_avg[cols].isnull().all( axis = 1 )].copy()
-- Find nonnull on specific columns
df_pivot_avg = df_pivot_avg[df_pivot_avg[cols].notnull().all( axis = 1 )].copy()
Convert specific column types
s = df.select_dtypes( include = 'object' ).columns
df[s] = df[s].astype( "float" )
Convert dictionaries into values of variable of same name as key
# Contents of input file walkPath.txt
bucket = bucket - name
pattern =*
root = . / tmp
# Py Code
if Path( 'walkPath.txt' ).is_file(): # file exists
temp = [line.rstrip() for line in open ( 'walkPath.txt' )]
for i in temp:
key, val = i.split( '=' )
exec (key + '=val' )
print (key, val)
# If it was a actual JSON dictionary
d = { 'bucket' : 'bucket-name' , 'root' : './tmp' }
for key,val in d.items():
exec (key + '=val' )
Getting time differences for comparisons
import datetime
import time
t1 = datetime.datetime.now()
print ( "Local Time: " ,datetime.datetime.now())
time.sleep( 5 )
t2 = datetime.datetime.now()
print ( "Local Time: " ,t2)
print ( "Time taken: " ,t2 - t1)
print ((t2 - t1).total_seconds())
minutes_diff = (t2 - t1).total_seconds() / 60.0
print (minutes_diff)
# Output
Local Time: 2021 - 0 9 - 29 0 8 : 0 4 : 43.191698
Local Time: 2021 - 0 9 - 29 0 8 : 0 4 : 48.195868
Time taken: 0 : 00 : 05.004170
5.00417
0.08340283333333334
To know installed python versions
( env ) D:\BigData\12. Python \1 . Tryouts \B ea m > py -0p
Installed Pythons found by py Launcher for Windows
( venv ) D:\BigData\12. Python \1 . Tryouts \B eam \e nv \S cripts \p ython.exe *
-3.9-64 python3.9.exe
-3.7-64 C: \U sers \S ushanth \A naconda3 \p ython.exe
-3.5-32 C: \U sers \S ushanth \A ppData \L ocal \P rograms \P ython \P ython35-32 \p ython.exe
Virtual Environments
cd your - project
python3 - m venv env
.\ env\Scripts\activate
# Install any packages
pip3 install google - cloud - storage
pip3 install -- upgrade google - cloud - storage
pip3 install wheel
pip3 install apache - beam[gcp]
# Listing installed packages
pip3 list
pip3 freeze -- all > requirements.txt
# When done
deactivate
# Later in a new environment can use below command to install packages in requirement.txt
pip3 install - r requirements.txt
Drop duplicates, but ignore nulls
# Creating a sample dataframe
df = pd.DataFrame()
names = [ 'Gandolf' , 'Gandolf' , 'Gimli' , 'Frodo' , 'Legolas' , 'Bilbo' , 'Aragorn' , 'Arwen' ]
types = [ 'Wizard' , 'Wizard' , 'Dwarf' , 'Hobbit' , 'Elf' , 'Hobbit' , np.nan, np.nan]
magic = [ 0 , 10 , 1 , 4 , 6 , 4 , np.nan, np.nan]
aggression = [ 0 , 7 , 10 , 2 , 5 , 1 , np.nan, np.nan]
stealth = [ 0 , 8 , 2 , 5 , 10 , 5 , np.nan, np.nan]
df[ 'names' ], df[ 'type' ], df[ 'magic_power' ] = names, types, magic
df[ 'aggression' ], df[ 'stealth' ] = aggression, stealth
df
+ ---- + -------- -+ -------- + -------------- -+ -------------- + ---------- -+
| | names | type | magic_power | aggression | stealth |
| ---- + -------- -+ -------- + -------------- -+ -------------- + ---------- -|
| 0 | Gandolf | Wizard | 0 | 0 | 0 |
| 1 | Gandolf | Wizard | 10 | 7 | 8 |
| 2 | Gimli | Dwarf | 1 | 10 | 2 |
| 3 | Frodo | Hobbit | 4 | 2 | 5 |
| 4 | Legolas | Elf | 6 | 5 | 10 |
| 5 | Bilbo | Hobbit | 4 | 1 | 5 |
| 6 | Aragorn | nan | nan | nan | nan |
| 7 | Arwen | nan | nan | nan | nan |
+ ---- + -------- -+ -------- + -------------- -+ -------------- + ---------- -+
Here we have two sets of duplicates and as per the command we are keeping the last and eliminating the first.
df1 = df[( ~ df[ 'type' ].duplicated( keep = 'last' ))]
+ ---- + -------- -+ -------- + -------------- -+ -------------- + ---------- -+
| | names | type | magic_power | aggression | stealth |
| ---- + -------- -+ -------- + -------------- -+ -------------- + ---------- -|
| 1 | Gandolf | Wizard | 10 | 7 | 8 |
| 2 | Gimli | Dwarf | 1 | 10 | 2 |
| 4 | Legolas | Elf | 6 | 5 | 10 |
| 5 | Bilbo | Hobbit | 4 | 1 | 5 |
| 7 | Arwen | nan | nan | nan | nan |
+ ---- + -------- -+ -------- + -------------- -+ -------------- + ---------- -+
General idea is NaNs are suppose to be incomparable as it basically means ‘no value’. So, comparing no values and keeping the last does not make sense. So, we do this and here we are maintaining the uniqueness and keeping nulls
df1 = df[( ~ df[ 'type' ].duplicated( keep = 'last' )) | df[ 'type' ].isna()]
+ ---- + -------- -+ -------- + -------------- -+ -------------- + ---------- -+
| | names | type | magic_power | aggression | stealth |
| ---- + -------- -+ -------- + -------------- -+ -------------- + ---------- -|
| 1 | Gandolf | Wizard | 10 | 7 | 8 |
| 2 | Gimli | Dwarf | 1 | 10 | 2 |
| 4 | Legolas | Elf | 6 | 5 | 10 |
| 5 | Bilbo | Hobbit | 4 | 1 | 5 |
| 6 | Aragorn | nan | nan | nan | nan |
| 7 | Arwen | nan | nan | nan | nan |
+ ---- + -------- -+ -------- + -------------- -+ -------------- + ---------- -+
Convert two arrays to dictionary
keys = [ 'a' , 'b' , 'c' ]
values = [ 1 , 2 , 3 ]
dictionary = dict ( zip (keys, values))
print (dictionary)
# output
{ 'b' : 2 , 'a' : 1 , 'c' : 3 }
View specific row by index
df_joined.loc[[ 28 ]]
df_joined.iloc[[ 27 ]]
Read all columns into pandas as object
df = pd.read_csv(osfp_file, sep = '|' , names = all_cols, usecols = use_cols , skip_blank_lines = True , dtype = object )
Common pandas options
pd.set_option( 'display.max_columns' , None )
pd.set_option( 'display.max_colwidth' , 200 )
# To display all rows
pd.set_option( 'display.max_rows' , None )
# Setting limit to display max 100 rows
pd.set_option( 'display.max_rows' , 100 )
# When there are lots of columns (by default 100+) then df.info() wouldn't show all the columns,
<class 'pandas.core.frame.DataFrame' >
RangeIndex: 3984 entries, 0 to 3983
Columns: 114 entries, name to process
dtypes: datetime64[ns]( 10 ), float64( 68 ), int64( 1 ), object ( 35 )
memory usage: 2.9 + MB
# When that happens, you can use,
df.info( verbose = True , null_counts = True )
# What option sets that default 100 is this
pd.options.display.max_info_columns
print ( 'Max columns in display setting: ' , pd.get_option( 'max_columns' ))
Display all columns of specific datatype
If you have a column naming convention, like in column name itself we can identify datatype, you can use below method to get those columns. Here we are trying to get date columns,
Method 1
[col for col in df_companies.columns if 'date' in col.lower()]
Method 2
df_companies[ 'lastraded' ].dtypes
# Output:
dtype( '<M8[ns]' )
df_companies.dtypes[df_companies.dtypes == '<M8[ns]' ]
# Output:
lastraded datetime64[ns]
QresultDate datetime64[ns]
PLResultDate datetime64[ns]
BSResultDate datetime64[ns]
RResultDate datetime64[ns]
CFResultDate datetime64[ns]
YRResultDate datetime64[ns]
HiDate datetime64[ns]
LoDate datetime64[ns]
TVDate datetime64[ns]
dtype: object
Method 3
dtCols = list (df_companies.select_dtypes( include = [ 'datetime' ]).columns)