— python, jupyter, numpy — 4 min read
Its been a month since i have worked on python, last time it was during Udacity Data Enginnering Nano Degree and over the month i was fully occupied with Rexx and Db2 REST.
In this exercise, i am exploring what all the possible things one can do with BSE daily data, for this analysis i will be using 1yr worth of data. This not technical analysis, just a little exercise to improvise some Python skills(Upskilling).
After giving a lot of thought by looking at the data, i think i can get below metrics from the data, this itself took couple of days to think :D
Field | Description |
---|---|
closeH | high close in the week |
closeL | low closein the week |
volHigh | Highest volume in the week |
volAvg | Volume average |
daysTraded | Number of days traded in the week |
HSDL | Highest Single Day Loss |
HSDG | Highest Single Day Gain |
HSDLp | Highest Single Day Loss percent |
HSDGp | Highest Single Day Gain percent |
first | First close of the week |
last | Last close of he week |
wChng | Week change |
wChngp | Week change percent |
lastTrdDoW | Last traded day of week |
TI | Times increased |
volAvgWOhv | Volume average without high volume |
HVdAV | High volume / Average volume(without highvolume) |
CPveoHVD | Close positive on high volume day |
lastDVotWk | Last day volume |
lastDVdAV | Last day volume / average volume |
I was able to calculate and get data for all above fields in dataframe. Step-by-Step approach is available in Notebook in github.
To cut things short, first run took 1h 26min and second run took 1h 41min 41s via pandas to complete analysis for 1yr data. I didn't expect it would take that long to execute. I dont know where to start as well, so i posted my code in StackOverflow asking for suggestions and one user had asked.
Did you profile to see exactly which line(s) were the bottleneck(s)? – Mad Physicist
At this point i don't know,
Bit of googling, got me below links,
Faced bit of a problem installing line_profiler in my system(Win10), so had to do a workaround, thats another topic, here.
Here is the output of the profiler,
1('getcwd : ', '/home/bobby_dreamer')2Timer unit: 1e-06 s3
4Total time: 0.043637 s5File: BTD-Analysis1V3.py6Function: weekly_trend_analysis at line 367
8Line # Hits Time Per Hit % Time Line Contents9==============================================================10 36 def weekly_trend_analysis(exchange, df_weekly_all, df_daily):11 37 12 38 1 3.0 3.0 0.0 if exchange == 'BSE':13 39 1 963.0 963.0 2.2 ticker = df_daily.iloc[0]['sc_code']14 40 else:15 41 ticker = df_daily.iloc[0]['symbol']16 42 17 43 1 201.0 201.0 0.5 arr_yearWeek = df_daily['yearWeek'].to_numpy()18 44 1 100.0 100.0 0.2 arr_close = df_daily['close'].to_numpy()19 45 1 87.0 87.0 0.2 arr_prevclose = df_daily['prevclose'].to_numpy()20 46 1 85.0 85.0 0.2 arr_chng = df_daily['chng'].to_numpy()21 47 1 83.0 83.0 0.2 arr_chngp = df_daily['chngp'].to_numpy()22 48 1 108.0 108.0 0.2 arr_ts = df_daily['ts'].to_numpy()23 49 1 89.0 89.0 0.2 arr_volumes = df_daily['volumes'].to_numpy()24 50 25 51 # Close26 52 1 41.0 41.0 0.1 arr_concat = np.column_stack((arr_yearWeek, arr_close))27 53 1 241.0 241.0 0.6 npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])28 54 29 55 #a = df_temp[['yearWeek', 'close']].to_numpy()30 56 1 113.0 113.0 0.3 yearWeek, daysTraded = np.unique(arr_concat[:,0], return_counts=True)31 57 32 58 1 4.0 4.0 0.0 cmaxs, cmins = [], []33 59 1 3.0 3.0 0.0 first, last, wChng, wChngp = [], [], [], []34 60 2 11.0 5.5 0.0 for idx,subarr in enumerate(npi_gb):35 61 1 32.0 32.0 0.1 cmaxs.append( np.amax(subarr) )36 62 1 17.0 17.0 0.0 cmins.append( np.amin(subarr) )37 63 1 2.0 2.0 0.0 first.append(subarr[0])38 64 1 2.0 2.0 0.0 last.append(subarr[-1])39 65 1 3.0 3.0 0.0 wChng.append( subarr[-1] - subarr[0] )40 66 1 6.0 6.0 0.0 wChngp.append( ( (subarr[-1] / subarr[0]) * 100) - 100 )41 67 42 68 #npi_gb.clear()43 69 1 4.0 4.0 0.0 arr_concat = np.empty((100,100))44 70 45 71 # Chng46 72 1 21.0 21.0 0.0 arr_concat = np.column_stack((arr_yearWeek, arr_chng))47 73 1 109.0 109.0 0.2 npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])48 74 49 75 1 2.0 2.0 0.0 HSDL, HSDG = [], []50 76 2 7.0 3.5 0.0 for idx,subarr in enumerate(npi_gb):51 77 1 12.0 12.0 0.0 HSDL.append( np.amin(subarr) )52 78 1 9.0 9.0 0.0 HSDG.append( np.amax(subarr) )53 79 54 80 #npi_gb.clear()55 81 1 3.0 3.0 0.0 arr_concat = np.empty((100,100))56 82 57 83 # Chngp58 84 1 15.0 15.0 0.0 arr_concat = np.column_stack((arr_yearWeek, arr_chngp))59 85 1 86.0 86.0 0.2 npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])60 86 61 87 1 1.0 1.0 0.0 HSDLp, HSDGp = [], []62 88 2 7.0 3.5 0.0 for idx,subarr in enumerate(npi_gb):63 89 1 11.0 11.0 0.0 HSDLp.append( np.amin(subarr) )64 90 1 9.0 9.0 0.0 HSDGp.append( np.amax(subarr) )65 91 66 92 #npi_gb.clear()67 93 1 3.0 3.0 0.0 arr_concat = np.empty((100,100))68 94 69 95 # Last Traded Date of the Week70 96 1 3111.0 3111.0 7.1 i = df_daily[['yearWeek', 'ts']].to_numpy()71 97 1 128.0 128.0 0.3 j = npi.group_by(i[:, 0]).split(i[:, 1])72 98 73 99 1 2.0 2.0 0.0 lastTrdDoW = []74 100 2 9.0 4.5 0.0 for idx,subarr in enumerate(j):75 101 1 2.0 2.0 0.0 lastTrdDoW.append( subarr[-1] )76 102 77 103 1 4.0 4.0 0.0 i = np.empty((100,100))78 104 #j.clear()79 105 80 106 # Times inreased81 107 1 11.0 11.0 0.0 TI = np.where(arr_close > arr_prevclose, 1, 0)82 108 83 109 # Below npi_gb_yearWeekTI is used in volumes section84 110 1 19.0 19.0 0.0 arr_concat = np.column_stack((arr_yearWeek, TI))85 111 1 111.0 111.0 0.3 npi_gb_yearWeekTI = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])86 112 87 113 1 73.0 73.0 0.2 tempArr, TI = npi.group_by(arr_yearWeek).sum(TI)88 114 89 115 # Volume ( dependent on above section value t_group , thats the reason to move from top to here)90 116 1 39.0 39.0 0.1 arr_concat = np.column_stack((arr_yearWeek, arr_volumes))91 117 1 94.0 94.0 0.2 npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])92 118 93 119 1 2.0 2.0 0.0 vmaxs, vavgs, volAvgWOhv, HVdAV, CPveoHVD, lastDVotWk, lastDVdAV = [], [], [], [], [], [], []94 120 2 8.0 4.0 0.0 for idx,subarr in enumerate(npi_gb):95 121 1 53.0 53.0 0.1 vavgs.append( np.mean(subarr) )96 122 1 2.0 2.0 0.0 ldvotWk = subarr[-1]97 123 1 2.0 2.0 0.0 lastDVotWk.append(ldvotWk)98 124 99 125 #print(idx, 'O - ',subarr, np.argmax(subarr), ', average : ',np.mean(subarr))100 126 1 13.0 13.0 0.0 ixDel = np.argmax(subarr)101 127 1 2.0 2.0 0.0 hV = subarr[ixDel]102 128 1 2.0 2.0 0.0 vmaxs.append( hV )103 129 104 130 1 1.0 1.0 0.0 if(len(subarr)>1):105 131 1 53.0 53.0 0.1 subarr = np.delete(subarr, ixDel)106 132 1 29.0 29.0 0.1 vawoHV = np.mean(subarr)107 133 else:108 134 vawoHV = np.mean(subarr)109 135 1 2.0 2.0 0.0 volAvgWOhv.append( vawoHV )110 136 1 12.0 12.0 0.0 HVdAV.append(hV / vawoHV)111 137 1 3.0 3.0 0.0 CPveoHVD.append( npi_gb_yearWeekTI[idx][ixDel] )112 138 1 6.0 6.0 0.0 lastDVdAV.append(ldvotWk / vawoHV) 113 139 114 140 #npi_gb.clear()115 141 1 3.0 3.0 0.0 arr_concat = np.empty((100,100))116 142 117 143 # Preparing the dataframe118 144 # yearWeek and occurances 119 145 #yearWeek, daysTraded = np.unique(a[:,0], return_counts=True)120 146 1 5.0 5.0 0.0 yearWeek = yearWeek.astype(int)121 147 1 44.0 44.0 0.1 HSDL = np.round(HSDL,2)122 148 1 21.0 21.0 0.0 HSDG = np.round(HSDG,2)123 149 1 18.0 18.0 0.0 HSDLp = np.round(HSDLp,2)124 150 1 18.0 18.0 0.0 HSDGp = np.round(HSDGp,2)125 151 126 152 1 17.0 17.0 0.0 first = np.round(first,2)127 153 1 17.0 17.0 0.0 last = np.round(last,2)128 154 1 17.0 17.0 0.0 wChng = np.round(wChng,2)129 155 1 16.0 16.0 0.0 wChngp = np.round(wChngp,2)130 156 131 157 1 5.0 5.0 0.0 vavgs = np.array(vavgs).astype(int)132 158 1 3.0 3.0 0.0 volAvgWOhv = np.array(volAvgWOhv).astype(int)133 159 1 17.0 17.0 0.0 HVdAV = np.round(HVdAV,2)134 160 135 161 1 3.0 3.0 0.0 dict_temp = {'yearWeek': yearWeek, 'closeH': cmaxs, 'closeL': cmins, 'volHigh':vmaxs, 'volAvg':vavgs, 'daysTraded':daysTraded136 162 1 2.0 2.0 0.0 ,'HSDL':HSDL, 'HSDG':HSDG, 'HSDLp':HSDLp, 'HSDGp':HSDGp, 'first':first, 'last':last, 'wChng':wChng, 'wChngp':wChngp137 163 1 2.0 2.0 0.0 ,'lastTrdDoW':lastTrdDoW, 'TI':TI, 'volAvgWOhv':volAvgWOhv, 'HVdAV':HVdAV, 'CPveoHVD':CPveoHVD138 164 1 2.0 2.0 0.0 ,'lastDVotWk':lastDVotWk, 'lastDVdAV':lastDVdAV}139 165 1 3677.0 3677.0 8.4 df_weekly = pd.DataFrame(data=dict_temp)140 166 141 167 1 1102.0 1102.0 2.5 df_weekly['sc_code'] = ticker142 168 143 169 1 3.0 3.0 0.0 cols = ['sc_code', 'yearWeek', 'lastTrdDoW', 'daysTraded', 'closeL', 'closeH', 'volAvg', 'volHigh'144 170 1 1.0 1.0 0.0 , 'HSDL', 'HSDG', 'HSDLp', 'HSDGp', 'first', 'last', 'wChng', 'wChngp', 'TI', 'volAvgWOhv', 'HVdAV'145 171 1 2.0 2.0 0.0 , 'CPveoHVD', 'lastDVotWk', 'lastDVdAV']146 172 147 173 1 2816.0 2816.0 6.5 df_weekly = df_weekly[cols].copy()148 174 149 175 # df_weekly_all will be 0, when its a new company or its a FTA(First Time Analysis)150 176 1 13.0 13.0 0.0 if df_weekly_all.shape[0] == 0:151 177 1 20473.0 20473.0 46.9 df_weekly_all = pd.DataFrame(columns=list(df_weekly.columns)) 152 178 153 179 # Removing all yearWeek in df_weekly2 from df_weekly154 180 1 321.0 321.0 0.7 a = set(df_weekly_all['yearWeek'])155 181 1 190.0 190.0 0.4 b = set(df_weekly['yearWeek'])156 182 1 5.0 5.0 0.0 c = list(a.difference(b))157 183 #print('df_weekly_all={}, df_weekly={}, difference={}'.format(len(a), len(b), len(c)) )158 184 1 1538.0 1538.0 3.5 df_weekly_all = df_weekly_all[df_weekly_all.yearWeek.isin(c)].copy()159 185 160 186 # Append the latest week data to df_weekly161 187 1 6998.0 6998.0 16.0 df_weekly_all = pd.concat([df_weekly_all, df_weekly], sort=False)162 188 #print('After concat : df_weekly_all={}'.format(df_weekly_all.shape[0])) 163 189 164 190 1 2.0 2.0 0.0 return df_weekly_all
Below are the sections of the code that have high time,
1Line # Hits Time Per Hit % Time Line Contents2==============================================================3 38 1 3.0 3.0 0.0 if exchange == 'BSE':4 39 1 963.0 963.0 2.2 ticker = df_daily.iloc[0]['sc_code']5 40 else:6 41 ticker = df_daily.iloc[0]['symbol']7
8 95 # Last Traded Date of the Week9 96 1 3111.0 3111.0 7.1 i = df_daily[['yearWeek', 'ts']].to_numpy()10 97 1 128.0 128.0 0.3 j = npi.group_by(i[:, 0]).split(i[:, 1])11 98 12 99 1 2.0 2.0 0.0 lastTrdDoW = []13 100 2 9.0 4.5 0.0 for idx,subarr in enumerate(j):14 101 1 2.0 2.0 0.0 lastTrdDoW.append( subarr[-1] )15
16 161 1 3.0 3.0 0.0 dict_temp = {'yearWeek': yearWeek, 'closeH': cmaxs, 'closeL': cmins, 'volHigh':vmaxs, 'volAvg':vavgs, 'daysTraded':daysTraded17 162 1 2.0 2.0 0.0 ,'HSDL':HSDL, 'HSDG':HSDG, 'HSDLp':HSDLp, 'HSDGp':HSDGp, 'first':first, 'last':last, 'wChng':wChng, 'wChngp':wChngp18 163 1 2.0 2.0 0.0 ,'lastTrdDoW':lastTrdDoW, 'TI':TI, 'volAvgWOhv':volAvgWOhv, 'HVdAV':HVdAV, 'CPveoHVD':CPveoHVD19 164 1 2.0 2.0 0.0 ,'lastDVotWk':lastDVotWk, 'lastDVdAV':lastDVdAV}20 165 1 3677.0 3677.0 8.4 df_weekly = pd.DataFrame(data=dict_temp)21 166 22 167 1 1102.0 1102.0 2.5 df_weekly['sc_code'] = ticker23 168 24 169 1 3.0 3.0 0.0 cols = ['sc_code', 'yearWeek', 'lastTrdDoW', 'daysTraded', 'closeL', 'closeH', 'volAvg', 'volHigh'25 170 1 1.0 1.0 0.0 , 'HSDL', 'HSDG', 'HSDLp', 'HSDGp', 'first', 'last', 'wChng', 'wChngp', 'TI', 'volAvgWOhv', 'HVdAV'26 171 1 2.0 2.0 0.0 , 'CPveoHVD', 'lastDVotWk', 'lastDVdAV']27 172 28 173 1 2816.0 2816.0 6.5 df_weekly = df_weekly[cols].copy()29 174 30 175 # df_weekly_all will be 0, when its a new company or its a FTA(First Time Analysis)31 176 1 13.0 13.0 0.0 if df_weekly_all.shape[0] == 0:32 177 1 20473.0 20473.0 46.9 df_weekly_all = pd.DataFrame(columns=list(df_weekly.columns)) 33 178 34 179 # Removing all yearWeek in df_weekly2 from df_weekly35 180 1 321.0 321.0 0.7 a = set(df_weekly_all['yearWeek'])36 181 1 190.0 190.0 0.4 b = set(df_weekly['yearWeek'])37 182 1 5.0 5.0 0.0 c = list(a.difference(b))38 183 #print('df_weekly_all={}, df_weekly={}, difference={}'.format(len(a), len(b), len(c)) )39 184 1 1538.0 1538.0 3.5 df_weekly_all = df_weekly_all[df_weekly_all.yearWeek.isin(c)].copy()40 185 41 186 # Append the latest week data to df_weekly42 187 1 6998.0 6998.0 16.0 df_weekly_all = pd.concat([df_weekly_all, df_weekly], sort=False)43 188 #print('After concat : df_weekly_all={}'.format(df_weekly_all.shape[0]))
So started to try on alternative approaches to get things to run faster. I have been reading articles saying 'Numpy is fast', 'Vectors are fast' and started my tests.
Here are some of the attempts to solve the above performance problems,
Getting the ticker code
1Line # Hits Time Per Hit % Time Line Contents2==============================================================3 38 1 3.0 3.0 0.0 if exchange == 'BSE':4 39 1 963.0 963.0 2.2 ticker = df_daily.iloc[0]['sc_code']5 40 else:6 41 ticker = df_daily.iloc[0]['symbol']
Had to settle with numpy here
1#ticker = df_temp.iloc[0]['sc_code']2#ticker = df_daily.at[0,'sc_code']3#ticker = df_daily.head(1)['sc_code'].values[0]4ticker = df_daily['sc_code'].to_numpy()[0]
3k to convert to numpy
1Line # Hits Time Per Hit % Time Line Contents2==============================================================3 95 # Last Traded Date of the Week4 96 1 3111.0 3111.0 7.1 i = df_daily[['yearWeek', 'ts']].to_numpy()5 97 1 128.0 128.0 0.3 j = npi.group_by(i[:, 0]).split(i[:, 1])6 98 7 99 1 2.0 2.0 0.0 lastTrdDoW = []8 100 2 9.0 4.5 0.0 for idx,subarr in enumerate(j):9 101 1 2.0 2.0 0.0 lastTrdDoW.append( subarr[-1] )
Pandas approach seems to be bit expensive here and tried to convert it to numpy and faced a TypeError: invalid type promotion
. This error came because i was trying to concatenate arr_yearWeek(int64) with arr_ts(timestamp64[ns]) like arr_concat = np.column_stack((arr_yearWeek, arr_ts))
To resolve this issue, i converted timestamp ts
to string df_temp['ts'] = df_temp['ts'] .dt.strftime('%Y-%m-%d')
and this worked.
1arr_concat = np.column_stack((arr_yearWeek, arr_ts))2npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])3
4lastTrdDoW = []5for idx,subarr in enumerate(npi_gb):6 lastTrdDoW.append( subarr[-1] )
Here npi is import numpy_indexed as npi
, another library, when i was testing, i had encountered various issues and when i check in stackoverflow for solutions, i have seen user Eelco Hoogendoorn suggest to use numpy_indexed
in multiple posts and i had to try it out and also he is the author of the library as well.
So now most of my code uses this.
Converting to dataframe. Looks like creating a new dataframe is an expensive process. When i stepback and think about the flow of program, this function is called in a loop and for each company a new dataframe is being created and if this takes this much time for single company, obviously whole program is going to be slow. So my thought was, what-if i keep everything in numpy itself, and see how it goes.
There are two things to solve here
1Line # Hits Time Per Hit % Time Line Contents2==============================================================3161 1 3.0 3.0 0.0 dict_temp = {'yearWeek': yearWeek, 'closeH': cmaxs, 'closeL': cmins, 'volHigh':vmaxs, 'volAvg':vavgs, 'daysTraded':daysTraded4162 1 2.0 2.0 0.0 ,'HSDL':HSDL, 'HSDG':HSDG, 'HSDLp':HSDLp, 'HSDGp':HSDGp, 'first':first, 'last':last, 'wChng':wChng, 'wChngp':wChngp5163 1 2.0 2.0 0.0 ,'lastTrdDoW':lastTrdDoW, 'TI':TI, 'volAvgWOhv':volAvgWOhv, 'HVdAV':HVdAV, 'CPveoHVD':CPveoHVD6164 1 2.0 2.0 0.0 ,'lastDVotWk':lastDVotWk, 'lastDVdAV':lastDVdAV}7165 1 3677.0 3677.0 8.4 df_weekly = pd.DataFrame(data=dict_temp)8
9169 1 3.0 3.0 0.0 cols = ['sc_code', 'yearWeek', 'lastTrdDoW', 'daysTraded', 'closeL', 'closeH', 'volAvg', 'volHigh'10170 1 1.0 1.0 0.0 , 'HSDL', 'HSDG', 'HSDLp', 'HSDGp', 'first', 'last', 'wChng', 'wChngp', 'TI', 'volAvgWOhv', 'HVdAV'11171 1 2.0 2.0 0.0 , 'CPveoHVD', 'lastDVotWk', 'lastDVdAV']12172 13173 1 2816.0 2816.0 6.5 df_weekly = df_weekly[cols].copy()
Combined above two problems into one solution. Since all the values are going to be of same size, np.column_stack
helped resolve the issue
1np_weekly = np.column_stack((ticker, yearWeek, lastTrdDoW, daysTraded, cmins, cmaxs, vavgs, vmaxs, HSDL2 , HSDG, HSDLp, HSDGp, first, last, wChng, wChngp, TI, volAvgWOhv, HVdAV3 , CPveoHVD, lastDVotWk, lastDVdAV))
Filling same value to entire dataframe column
1Line # Hits Time Per Hit % Time Line Contents2==============================================================3167 1 1102.0 1102.0 2.5 df_weekly['sc_code'] = ticker
Resolved by filling the value in numpy
1ticker = np.full(yearWeek.shape[0], ticker)
Last part, here there are multiple high timers
1Line # Hits Time Per Hit % Time Line Contents2==============================================================3175 # df_weekly_all will be 0, when its a new company or its a FTA(First Time Analysis)4176 1 13.0 13.0 0.0 if df_weekly_all.shape[0] == 0:5177 1 20473.0 20473.0 46.9 df_weekly_all = pd.DataFrame(columns=list(df_weekly.columns)) 6178 7179 # Removing all yearWeek in df_weekly2 from df_weekly8180 1 321.0 321.0 0.7 a = set(df_weekly_all['yearWeek'])9181 1 190.0 190.0 0.4 b = set(df_weekly['yearWeek'])10182 1 5.0 5.0 0.0 c = list(a.difference(b))11183 #print('df_weekly_all={}, df_weekly={}, difference={}'.format(len(a), len(b), len(c)) )12184 1 1538.0 1538.0 3.5 df_weekly_all = df_weekly_all[df_weekly_all.yearWeek.isin(c)].copy()13185 14186 # Append the latest week data to df_weekly15187 1 6998.0 6998.0 16.0 df_weekly_all = pd.concat([df_weekly_all, df_weekly], sort=False)16188 #print('After concat : df_weekly_all={}'.format(df_weekly_all.shape[0]))
Thought process here is same as before, why should i use dataframe here, can i do everything in numpy. So, proceeding towards that angle.
1# Removing latest yearWeek from df_weekly_all as it could be partial and concatenating latest one(df_weekly) to df_weekly_all2 if len(np_weekly_all) > 0:3 #print(len(np_weekly_all))4 a = np_weekly_all[:,1] 5 b = np_weekly[:,1] 6 tf_1 = np.isin(a, b, invert=True) 7 #print(tf_1)8 t_result = list(compress(range(len(tf_1)), tf_1)) 9 #print(t_result)10 np_weekly_all = np_weekly_all[t_result]11 np_weekly_all = np.vstack((np_weekly_all, np_weekly)) 12 else:13 np_weekly_all = []14 np_weekly_all = np.vstack((np_weekly))
Interesting concepts learned here are ,
from itertools import compress
After making all the above changes in the new run, it took only 7min 47s.
Here is the new profiler report,
1('getcwd : ', '/home/bobby_dreamer')2Timer unit: 1e-06 s3
4Total time: 0.013077 s5File: BTD-Analysis1V3-lf.py6Function: weekly_trend_analysis_np at line 387
8Line # Hits Time Per Hit % Time Line Contents9==============================================================10 38 def weekly_trend_analysis_np(exchange, np_weekly_all, df_daily):11 39 12 40 1 4.0 4.0 0.0 if exchange == 'BSE':13 41 #ticker = df_daily.at[0,'sc_code']14 42 #ticker = df_daily.head(1)['sc_code'].values[0]15 43 1 152.0 152.0 1.2 ticker = df_daily['sc_code'].to_numpy()[0]16 44 else:17 45 #ticker = df_daily.at[0,'symbol']18 46 #ticker = df_daily.head(1)['symbol'].values[0]19 47 ticker = df_daily['symbol'].to_numpy()[0]20 48 21 49 1 34.0 34.0 0.3 arr_yearWeek = df_daily['yearWeek'].to_numpy()22 50 1 34.0 34.0 0.3 arr_close = df_daily['close'].to_numpy()23 51 1 31.0 31.0 0.2 arr_prevclose = df_daily['prevclose'].to_numpy()24 52 1 29.0 29.0 0.2 arr_chng = df_daily['chng'].to_numpy()25 53 1 27.0 27.0 0.2 arr_chngp = df_daily['chngp'].to_numpy()26 54 1 28.0 28.0 0.2 arr_ts = df_daily['ts'].to_numpy()27 55 1 27.0 27.0 0.2 arr_volumes = df_daily['volumes'].to_numpy()28 56 29 57 # Close30 58 1 72.0 72.0 0.6 arr_concat = np.column_stack((arr_yearWeek, arr_close))31 59 1 651.0 651.0 5.0 npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])32 60 33 61 #a = df_temp[['yearWeek', 'close']].to_numpy()34 62 1 111.0 111.0 0.8 yearWeek, daysTraded = np.unique(arr_concat[:,0], return_counts=True)35 63 36 64 1 2.0 2.0 0.0 cmaxs, cmins = [], []37 65 1 1.0 1.0 0.0 first, last, wChng, wChngp = [], [], [], []38 66 54 79.0 1.5 0.6 for idx,subarr in enumerate(npi_gb):39 67 53 465.0 8.8 3.6 cmaxs.append( np.amax(subarr) )40 68 53 423.0 8.0 3.2 cmins.append( np.amin(subarr) )41 69 53 86.0 1.6 0.7 first.append(subarr[0])42 70 53 75.0 1.4 0.6 last.append(subarr[-1])43 71 53 103.0 1.9 0.8 wChng.append( subarr[-1] - subarr[0] )44 72 53 142.0 2.7 1.1 wChngp.append( ( (subarr[-1] / subarr[0]) * 100) - 100 )45 73 46 74 #npi_gb.clear()47 75 1 6.0 6.0 0.0 arr_concat = np.empty((100,100))48 76 49 77 # Chng50 78 1 24.0 24.0 0.2 arr_concat = np.column_stack((arr_yearWeek, arr_chng))51 79 1 357.0 357.0 2.7 npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])52 80 53 81 1 2.0 2.0 0.0 HSDL, HSDG = [], []54 82 54 75.0 1.4 0.6 for idx,subarr in enumerate(npi_gb):55 83 53 387.0 7.3 3.0 HSDL.append( np.amin(subarr) )56 84 53 402.0 7.6 3.1 HSDG.append( np.amax(subarr) )57 85 58 86 #npi_gb.clear()59 87 1 4.0 4.0 0.0 arr_concat = np.empty((100,100))60 88 61 89 # Chngp62 90 1 21.0 21.0 0.2 arr_concat = np.column_stack((arr_yearWeek, arr_chngp))63 91 1 292.0 292.0 2.2 npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])64 92 65 93 1 2.0 2.0 0.0 HSDLp, HSDGp = [], []66 94 54 75.0 1.4 0.6 for idx,subarr in enumerate(npi_gb):67 95 53 382.0 7.2 2.9 HSDLp.append( np.amin(subarr) )68 96 53 403.0 7.6 3.1 HSDGp.append( np.amax(subarr) )69 97 70 98 #npi_gb.clear()71 99 1 6.0 6.0 0.0 arr_concat = np.empty((100,100))72 100 73 101 # Last Traded Date of the Week74 102 1 33.0 33.0 0.3 arr_concat = np.column_stack((arr_yearWeek, arr_ts))75 103 1 341.0 341.0 2.6 npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])76 104 77 105 1 1.0 1.0 0.0 lastTrdDoW = []78 106 54 70.0 1.3 0.5 for idx,subarr in enumerate(npi_gb):79 107 53 79.0 1.5 0.6 lastTrdDoW.append( subarr[-1] )80 108 81 109 #npi_gb.clear()82 110 1 5.0 5.0 0.0 arr_concat = np.empty((100,100))83 111 84 112 # Times inreased85 113 1 14.0 14.0 0.1 TI = np.where(arr_close > arr_prevclose, 1, 0)86 114 87 115 # Below npi_gb_yearWeekTI is used in volumes section88 116 1 16.0 16.0 0.1 arr_concat = np.column_stack((arr_yearWeek, TI))89 117 1 267.0 267.0 2.0 npi_gb_yearWeekTI = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])90 118 91 119 1 78.0 78.0 0.6 tempArr, TI = npi.group_by(arr_yearWeek).sum(TI)92 120 93 121 # Volume ( dependent on above section value t_group , thats the reason to move from top to here)94 122 1 16.0 16.0 0.1 arr_concat = np.column_stack((arr_yearWeek, arr_volumes))95 123 1 277.0 277.0 2.1 npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])96 124 97 125 1 3.0 3.0 0.0 vmaxs, vavgs, volAvgWOhv, HVdAV, CPveoHVD, lastDVotWk, lastDVdAV = [], [], [], [], [], [], []98 126 54 87.0 1.6 0.7 for idx,subarr in enumerate(npi_gb):99 127 53 1077.0 20.3 8.2 vavgs.append( np.mean(subarr) )100 128 53 82.0 1.5 0.6 ldvotWk = subarr[-1]101 129 53 83.0 1.6 0.6 lastDVotWk.append(ldvotWk)102 130 103 131 #print(idx, 'O - ',subarr, np.argmax(subarr), ', average : ',np.mean(subarr))104 132 53 235.0 4.4 1.8 ixDel = np.argmax(subarr)105 133 53 88.0 1.7 0.7 hV = subarr[ixDel]106 134 53 73.0 1.4 0.6 vmaxs.append( hV )107 135 108 136 53 71.0 1.3 0.5 if(len(subarr)>1):109 137 53 1610.0 30.4 12.3 subarr = np.delete(subarr, ixDel)110 138 53 1024.0 19.3 7.8 vawoHV = np.mean(subarr)111 139 else:112 140 vawoHV = np.mean(subarr)113 141 53 85.0 1.6 0.6 volAvgWOhv.append( vawoHV )114 142 53 327.0 6.2 2.5 HVdAV.append(hV / vawoHV)115 143 53 102.0 1.9 0.8 CPveoHVD.append( npi_gb_yearWeekTI[idx][ixDel] )116 144 53 791.0 14.9 6.0 lastDVdAV.append( np.round(ldvotWk / vawoHV, 2) ) 117 145 118 146 #npi_gb.clear()119 147 1 4.0 4.0 0.0 arr_concat = np.empty((100,100))120 148 121 149 # Preparing the dataframe122 150 # yearWeek and occurances 123 151 #yearWeek, daysTraded = np.unique(a[:,0], return_counts=True)124 152 1 5.0 5.0 0.0 yearWeek = yearWeek.astype(int)125 153 1 59.0 59.0 0.5 HSDL = np.round(HSDL,2)126 154 1 26.0 26.0 0.2 HSDG = np.round(HSDG,2)127 155 1 23.0 23.0 0.2 HSDLp = np.round(HSDLp,2)128 156 1 23.0 23.0 0.2 HSDGp = np.round(HSDGp,2)129 157 130 158 1 23.0 23.0 0.2 first = np.round(first,2)131 159 1 23.0 23.0 0.2 last = np.round(last,2)132 160 1 23.0 23.0 0.2 wChng = np.round(wChng,2)133 161 1 23.0 23.0 0.2 wChngp = np.round(wChngp,2)134 162 135 163 1 12.0 12.0 0.1 vavgs = np.array(vavgs).astype(int)136 164 1 16.0 16.0 0.1 volAvgWOhv = np.array(volAvgWOhv).astype(int)137 165 1 24.0 24.0 0.2 HVdAV = np.round(HVdAV,2)138 166 139 167 1 16.0 16.0 0.1 ticker = np.full(yearWeek.shape[0], ticker)140 168 1 2.0 2.0 0.0 np_weekly = np.column_stack((ticker, yearWeek, lastTrdDoW, daysTraded, cmins, cmaxs, vavgs, vmaxs, HSDL141 169 1 2.0 2.0 0.0 , HSDG, HSDLp, HSDGp, first, last, wChng, wChngp, TI, volAvgWOhv, HVdAV142 170 1 546.0 546.0 4.2 , CPveoHVD, lastDVotWk, lastDVdAV))143 171 144 172 # Removing latest yearWeek from df_weekly_all as it could be partial and concatenating latest one(df_weekly) to df_weekly_all145 173 1 2.0 2.0 0.0 if len(np_weekly_all) > 0:146 174 #print(len(np_weekly_all))147 175 1 2.0 2.0 0.0 a = np_weekly_all[:,1] 148 176 1 1.0 1.0 0.0 b = np_weekly[:,1] 149 177 1 205.0 205.0 1.6 tf_1 = np.isin(a, b, invert=True) 150 178 #print(tf_1)151 179 1 13.0 13.0 0.1 t_result = list(compress(range(len(tf_1)), tf_1)) 152 180 #print(t_result)153 181 1 13.0 13.0 0.1 np_weekly_all = np_weekly_all[t_result]154 182 1 40.0 40.0 0.3 np_weekly_all = np.vstack((np_weekly_all, np_weekly)) 155 183 else:156 184 np_weekly_all = []157 185 np_weekly_all = np.vstack((np_weekly)) 158 186 159 187 1 2.0 2.0 0.0 return np_weekly_all
Needless to say, Hype is real. Numpy wins, its fast.
Issue faced during full execution using numpy,
MemoryError: Unable to allocate array with shape (82912, 22) and data type <U32
. This issue occured in two tests, first after processing 1600 companies and second time after 900 company. Since all the numpy elements should be of same type, here in my data, i have (integers, decimals and date). So, all the data is stored as OBJECT, which is bigger in size.
Tried couple of approaches,
1npi_gb = None2arr_concat = None
1for name in dir():2 if not name.startswith('_'):3 print(name)4 del globals()[name]
1# To know why bit of python you are running2import struct3print(struct.calcsize("P") * 8)
Going forward, should i write only numpy programs ?
No. It depends, to a certain extent. What am i doing here, i am calling weekly_trend_analysis() in loop passing last 1 yr daily data of each company as argument and function processes it and finally returns 1yr weekly data of that company and this process repeats for all companies.
Instead of passing daily data of each company in loop, if i pass entire dataset to the function with all the companies via pandas, it takes only Wall time: 1min 22s.
So, pandas are good as well in processing large datasets. We got 2 enemies here, they should be avoided at all cost,
Couple of other points to remember,
So finally, whats best, both are, but its the thought process thats going to save time.