BSE Weekly trend analysis using Pandas & Numpy

08.04.2020 — python, jupyter, numpy — 4 min read

Its been a month since i have worked on python, last time it was during Udacity Data Enginnering Nano Degree and over the month i was fully occupied with Rexx and Db2 REST.

In this exercise, i am exploring what all the possible things one can do with BSE daily data, for this analysis i will be using 1yr worth of data. This not technical analysis, just a little exercise to improvise some Python skills(Upskilling).

Data available here

After giving a lot of thought by looking at the data, i think i can get below metrics from the data, this itself took couple of days to think :D

Field	Description
closeH	high close in the week
closeL	low closein the week
volHigh	Highest volume in the week
volAvg	Volume average
daysTraded	Number of days traded in the week
HSDL	Highest Single Day Loss
HSDG	Highest Single Day Gain
HSDLp	Highest Single Day Loss percent
HSDGp	Highest Single Day Gain percent
first	First close of the week
last	Last close of he week
wChng	Week change
wChngp	Week change percent
lastTrdDoW	Last traded day of week
TI	Times increased
volAvgWOhv	Volume average without high volume
HVdAV	High volume / Average volume(without highvolume)
CPveoHVD	Close positive on high volume day
lastDVotWk	Last day volume
lastDVdAV	Last day volume / average volume

I was able to calculate and get data for all above fields in dataframe. Step-by-Step approach is available in Notebook in github.

To cut things short, first run took 1h 26min and second run took 1h 41min 41s via pandas to complete analysis for 1yr data. I didn't expect it would take that long to execute. I dont know where to start as well, so i posted my code in StackOverflow asking for suggestions and one user had asked.

Did you profile to see exactly which line(s) were the bottleneck(s)? – Mad Physicist

At this point i don't know,

How to enable profiler
How to install it
What type of report does it produce
How to read the report

Bit of googling, got me below links,

Tried - StackOverflow - How do I use line_profiler (from Robert Kern)?
Tried - line-profiler-code-example

Faced bit of a problem installing line_profiler in my system(Win10), so had to do a workaround, thats another topic, here.

Here is the output of the profiler,

1('getcwd : ', '/home/bobby_dreamer')
2Timer unit: 1e-06 s
3
4Total time: 0.043637 s
5File: BTD-Analysis1V3.py
6Function: weekly_trend_analysis at line 36
7
8Line #      Hits         Time  Per Hit   % Time  Line Contents
9==============================================================
10    36                                           def weekly_trend_analysis(exchange, df_weekly_all, df_daily):
11    37                                           
12    38         1          3.0      3.0      0.0      if exchange == 'BSE':
13    39         1        963.0    963.0      2.2          ticker = df_daily.iloc[0]['sc_code']
14    40                                               else:
15    41                                                   ticker = df_daily.iloc[0]['symbol']
16    42                                           
17    43         1        201.0    201.0      0.5      arr_yearWeek = df_daily['yearWeek'].to_numpy()
18    44         1        100.0    100.0      0.2      arr_close = df_daily['close'].to_numpy()
19    45         1         87.0     87.0      0.2      arr_prevclose = df_daily['prevclose'].to_numpy()
20    46         1         85.0     85.0      0.2      arr_chng = df_daily['chng'].to_numpy()
21    47         1         83.0     83.0      0.2      arr_chngp = df_daily['chngp'].to_numpy()
22    48         1        108.0    108.0      0.2      arr_ts = df_daily['ts'].to_numpy()
23    49         1         89.0     89.0      0.2      arr_volumes = df_daily['volumes'].to_numpy()
24    50                                           
25    51                                               # Close
26    52         1         41.0     41.0      0.1      arr_concat = np.column_stack((arr_yearWeek, arr_close))
27    53         1        241.0    241.0      0.6      npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
28    54                                           
29    55                                               #a = df_temp[['yearWeek', 'close']].to_numpy()
30    56         1        113.0    113.0      0.3      yearWeek, daysTraded = np.unique(arr_concat[:,0], return_counts=True)
31    57                                               
32    58         1          4.0      4.0      0.0      cmaxs, cmins = [], []
33    59         1          3.0      3.0      0.0      first, last, wChng, wChngp = [], [], [], []
34    60         2         11.0      5.5      0.0      for idx,subarr in enumerate(npi_gb):
35    61         1         32.0     32.0      0.1          cmaxs.append( np.amax(subarr) )
36    62         1         17.0     17.0      0.0          cmins.append( np.amin(subarr) )
37    63         1          2.0      2.0      0.0          first.append(subarr[0])
38    64         1          2.0      2.0      0.0          last.append(subarr[-1])
39    65         1          3.0      3.0      0.0          wChng.append( subarr[-1] - subarr[0] )
40    66         1          6.0      6.0      0.0          wChngp.append( ( (subarr[-1] / subarr[0]) * 100) - 100 )
41    67                                           
42    68                                               #npi_gb.clear()
43    69         1          4.0      4.0      0.0      arr_concat = np.empty((100,100))
44    70                                           
45    71                                               # Chng
46    72         1         21.0     21.0      0.0      arr_concat = np.column_stack((arr_yearWeek, arr_chng))
47    73         1        109.0    109.0      0.2      npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
48    74                                           
49    75         1          2.0      2.0      0.0      HSDL, HSDG = [], []
50    76         2          7.0      3.5      0.0      for idx,subarr in enumerate(npi_gb):
51    77         1         12.0     12.0      0.0          HSDL.append( np.amin(subarr) )
52    78         1          9.0      9.0      0.0          HSDG.append( np.amax(subarr) )
53    79                                           
54    80                                               #npi_gb.clear()
55    81         1          3.0      3.0      0.0      arr_concat = np.empty((100,100))
56    82                                           
57    83                                               # Chngp
58    84         1         15.0     15.0      0.0      arr_concat = np.column_stack((arr_yearWeek, arr_chngp))
59    85         1         86.0     86.0      0.2      npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
60    86                                           
61    87         1          1.0      1.0      0.0      HSDLp, HSDGp = [], []
62    88         2          7.0      3.5      0.0      for idx,subarr in enumerate(npi_gb):
63    89         1         11.0     11.0      0.0          HSDLp.append( np.amin(subarr) )
64    90         1          9.0      9.0      0.0          HSDGp.append( np.amax(subarr) )
65    91                                           
66    92                                               #npi_gb.clear()
67    93         1          3.0      3.0      0.0      arr_concat = np.empty((100,100))
68    94                                           
69    95                                               # Last Traded Date of the Week
70    96         1       3111.0   3111.0      7.1      i = df_daily[['yearWeek', 'ts']].to_numpy()
71    97         1        128.0    128.0      0.3      j = npi.group_by(i[:, 0]).split(i[:, 1])
72    98                                           
73    99         1          2.0      2.0      0.0      lastTrdDoW = []
74   100         2          9.0      4.5      0.0      for idx,subarr in enumerate(j):
75   101         1          2.0      2.0      0.0          lastTrdDoW.append( subarr[-1] )
76   102                                           
77   103         1          4.0      4.0      0.0      i = np.empty((100,100))
78   104                                               #j.clear()
79   105                                           
80   106                                               # Times inreased
81   107         1         11.0     11.0      0.0      TI = np.where(arr_close > arr_prevclose, 1, 0)
82   108                                           
83   109                                               # Below npi_gb_yearWeekTI is used in volumes section
84   110         1         19.0     19.0      0.0      arr_concat = np.column_stack((arr_yearWeek, TI))
85   111         1        111.0    111.0      0.3      npi_gb_yearWeekTI = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
86   112                                           
87   113         1         73.0     73.0      0.2      tempArr, TI = npi.group_by(arr_yearWeek).sum(TI)
88   114                                           
89   115                                               # Volume ( dependent on above section value t_group , thats the reason to move from top to here)
90   116         1         39.0     39.0      0.1      arr_concat = np.column_stack((arr_yearWeek, arr_volumes))
91   117         1         94.0     94.0      0.2      npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
92   118                                           
93   119         1          2.0      2.0      0.0      vmaxs, vavgs, volAvgWOhv, HVdAV, CPveoHVD, lastDVotWk, lastDVdAV = [], [], [], [], [], [], []
94   120         2          8.0      4.0      0.0      for idx,subarr in enumerate(npi_gb):
95   121         1         53.0     53.0      0.1          vavgs.append( np.mean(subarr) )
96   122         1          2.0      2.0      0.0          ldvotWk = subarr[-1]
97   123         1          2.0      2.0      0.0          lastDVotWk.append(ldvotWk)
98   124                                           
99   125                                                   #print(idx, 'O - ',subarr, np.argmax(subarr), ', average : ',np.mean(subarr))
100   126         1         13.0     13.0      0.0          ixDel = np.argmax(subarr)
101   127         1          2.0      2.0      0.0          hV = subarr[ixDel]
102   128         1          2.0      2.0      0.0          vmaxs.append( hV )
103   129                                           
104   130         1          1.0      1.0      0.0          if(len(subarr)>1):
105   131         1         53.0     53.0      0.1              subarr = np.delete(subarr, ixDel)
106   132         1         29.0     29.0      0.1              vawoHV = np.mean(subarr)
107   133                                                   else:
108   134                                                       vawoHV = np.mean(subarr)
109   135         1          2.0      2.0      0.0          volAvgWOhv.append( vawoHV )
110   136         1         12.0     12.0      0.0          HVdAV.append(hV / vawoHV)
111   137         1          3.0      3.0      0.0          CPveoHVD.append( npi_gb_yearWeekTI[idx][ixDel] )
112   138         1          6.0      6.0      0.0          lastDVdAV.append(ldvotWk / vawoHV)    
113   139                                           
114   140                                               #npi_gb.clear()
115   141         1          3.0      3.0      0.0      arr_concat = np.empty((100,100))
116   142                                           
117   143                                               # Preparing the dataframe
118   144                                               # yearWeek and occurances 
119   145                                               #yearWeek, daysTraded = np.unique(a[:,0], return_counts=True)
120   146         1          5.0      5.0      0.0      yearWeek = yearWeek.astype(int)
121   147         1         44.0     44.0      0.1      HSDL = np.round(HSDL,2)
122   148         1         21.0     21.0      0.0      HSDG = np.round(HSDG,2)
123   149         1         18.0     18.0      0.0      HSDLp = np.round(HSDLp,2)
124   150         1         18.0     18.0      0.0      HSDGp = np.round(HSDGp,2)
125   151                                           
126   152         1         17.0     17.0      0.0      first = np.round(first,2)
127   153         1         17.0     17.0      0.0      last = np.round(last,2)
128   154         1         17.0     17.0      0.0      wChng = np.round(wChng,2)
129   155         1         16.0     16.0      0.0      wChngp = np.round(wChngp,2)
130   156                                           
131   157         1          5.0      5.0      0.0      vavgs = np.array(vavgs).astype(int)
132   158         1          3.0      3.0      0.0      volAvgWOhv = np.array(volAvgWOhv).astype(int)
133   159         1         17.0     17.0      0.0      HVdAV = np.round(HVdAV,2)
134   160                                           
135   161         1          3.0      3.0      0.0      dict_temp = {'yearWeek': yearWeek, 'closeH': cmaxs, 'closeL': cmins, 'volHigh':vmaxs, 'volAvg':vavgs, 'daysTraded':daysTraded
136   162         1          2.0      2.0      0.0                  ,'HSDL':HSDL, 'HSDG':HSDG, 'HSDLp':HSDLp, 'HSDGp':HSDGp, 'first':first, 'last':last, 'wChng':wChng, 'wChngp':wChngp
137   163         1          2.0      2.0      0.0                  ,'lastTrdDoW':lastTrdDoW, 'TI':TI, 'volAvgWOhv':volAvgWOhv, 'HVdAV':HVdAV, 'CPveoHVD':CPveoHVD
138   164         1          2.0      2.0      0.0                  ,'lastDVotWk':lastDVotWk, 'lastDVdAV':lastDVdAV}
139   165         1       3677.0   3677.0      8.4      df_weekly = pd.DataFrame(data=dict_temp)
140   166                                           
141   167         1       1102.0   1102.0      2.5      df_weekly['sc_code'] = ticker
142   168                                           
143   169         1          3.0      3.0      0.0      cols = ['sc_code', 'yearWeek', 'lastTrdDoW', 'daysTraded', 'closeL', 'closeH', 'volAvg', 'volHigh'
144   170         1          1.0      1.0      0.0               , 'HSDL', 'HSDG', 'HSDLp', 'HSDGp', 'first', 'last', 'wChng', 'wChngp', 'TI', 'volAvgWOhv', 'HVdAV'
145   171         1          2.0      2.0      0.0               , 'CPveoHVD', 'lastDVotWk', 'lastDVdAV']
146   172                                           
147   173         1       2816.0   2816.0      6.5      df_weekly = df_weekly[cols].copy()
148   174                                                   
149   175                                               # df_weekly_all will be 0, when its a new company or its a FTA(First Time Analysis)
150   176         1         13.0     13.0      0.0      if df_weekly_all.shape[0] == 0:
151   177         1      20473.0  20473.0     46.9          df_weekly_all = pd.DataFrame(columns=list(df_weekly.columns))       
152   178                                                   
153   179                                               # Removing all yearWeek in df_weekly2 from df_weekly
154   180         1        321.0    321.0      0.7      a = set(df_weekly_all['yearWeek'])
155   181         1        190.0    190.0      0.4      b = set(df_weekly['yearWeek'])
156   182         1          5.0      5.0      0.0      c = list(a.difference(b))
157   183                                               #print('df_weekly_all={}, df_weekly={}, difference={}'.format(len(a), len(b), len(c)) )
158   184         1       1538.0   1538.0      3.5      df_weekly_all = df_weekly_all[df_weekly_all.yearWeek.isin(c)].copy()
159   185                                           
160   186                                               # Append the latest week data to df_weekly
161   187         1       6998.0   6998.0     16.0      df_weekly_all = pd.concat([df_weekly_all, df_weekly], sort=False)
162   188                                               #print('After concat : df_weekly_all={}'.format(df_weekly_all.shape[0]))    
163   189                                                   
164   190         1          2.0      2.0      0.0      return df_weekly_all

Below are the sections of the code that have high time,

1Line #      Hits         Time  Per Hit   % Time  Line Contents
2==============================================================
3    38         1          3.0      3.0      0.0      if exchange == 'BSE':
4    39         1        963.0    963.0      2.2          ticker = df_daily.iloc[0]['sc_code']
5    40                                               else:
6    41                                                   ticker = df_daily.iloc[0]['symbol']
7
8    95                                               # Last Traded Date of the Week
9    96         1       3111.0   3111.0      7.1      i = df_daily[['yearWeek', 'ts']].to_numpy()
10    97         1        128.0    128.0      0.3      j = npi.group_by(i[:, 0]).split(i[:, 1])
11    98                                           
12    99         1          2.0      2.0      0.0      lastTrdDoW = []
13   100         2          9.0      4.5      0.0      for idx,subarr in enumerate(j):
14   101         1          2.0      2.0      0.0          lastTrdDoW.append( subarr[-1] )
15
16   161         1          3.0      3.0      0.0      dict_temp = {'yearWeek': yearWeek, 'closeH': cmaxs, 'closeL': cmins, 'volHigh':vmaxs, 'volAvg':vavgs, 'daysTraded':daysTraded
17   162         1          2.0      2.0      0.0                  ,'HSDL':HSDL, 'HSDG':HSDG, 'HSDLp':HSDLp, 'HSDGp':HSDGp, 'first':first, 'last':last, 'wChng':wChng, 'wChngp':wChngp
18   163         1          2.0      2.0      0.0                  ,'lastTrdDoW':lastTrdDoW, 'TI':TI, 'volAvgWOhv':volAvgWOhv, 'HVdAV':HVdAV, 'CPveoHVD':CPveoHVD
19   164         1          2.0      2.0      0.0                  ,'lastDVotWk':lastDVotWk, 'lastDVdAV':lastDVdAV}
20   165         1       3677.0   3677.0      8.4      df_weekly = pd.DataFrame(data=dict_temp)
21   166                                           
22   167         1       1102.0   1102.0      2.5      df_weekly['sc_code'] = ticker
23   168                                           
24   169         1          3.0      3.0      0.0      cols = ['sc_code', 'yearWeek', 'lastTrdDoW', 'daysTraded', 'closeL', 'closeH', 'volAvg', 'volHigh'
25   170         1          1.0      1.0      0.0               , 'HSDL', 'HSDG', 'HSDLp', 'HSDGp', 'first', 'last', 'wChng', 'wChngp', 'TI', 'volAvgWOhv', 'HVdAV'
26   171         1          2.0      2.0      0.0               , 'CPveoHVD', 'lastDVotWk', 'lastDVdAV']
27   172                                           
28   173         1       2816.0   2816.0      6.5      df_weekly = df_weekly[cols].copy()
29   174                                                   
30   175                                               # df_weekly_all will be 0, when its a new company or its a FTA(First Time Analysis)
31   176         1         13.0     13.0      0.0      if df_weekly_all.shape[0] == 0:
32   177         1      20473.0  20473.0     46.9          df_weekly_all = pd.DataFrame(columns=list(df_weekly.columns))       
33   178                                                   
34   179                                               # Removing all yearWeek in df_weekly2 from df_weekly
35   180         1        321.0    321.0      0.7      a = set(df_weekly_all['yearWeek'])
36   181         1        190.0    190.0      0.4      b = set(df_weekly['yearWeek'])
37   182         1          5.0      5.0      0.0      c = list(a.difference(b))
38   183                                               #print('df_weekly_all={}, df_weekly={}, difference={}'.format(len(a), len(b), len(c)) )
39   184         1       1538.0   1538.0      3.5      df_weekly_all = df_weekly_all[df_weekly_all.yearWeek.isin(c)].copy()
40   185                                           
41   186                                               # Append the latest week data to df_weekly
42   187         1       6998.0   6998.0     16.0      df_weekly_all = pd.concat([df_weekly_all, df_weekly], sort=False)
43   188                                               #print('After concat : df_weekly_all={}'.format(df_weekly_all.shape[0]))

So started to try on alternative approaches to get things to run faster. I have been reading articles saying 'Numpy is fast', 'Vectors are fast' and started my tests.

Here are some of the attempts to solve the above performance problems,

Getting the ticker code

1Line #      Hits         Time  Per Hit   % Time  Line Contents
2==============================================================
3    38         1          3.0      3.0      0.0      if exchange == 'BSE':
4    39         1        963.0    963.0      2.2          ticker = df_daily.iloc[0]['sc_code']
5    40                                               else:
6    41                                                   ticker = df_daily.iloc[0]['symbol']

Had to settle with numpy here

1#ticker = df_temp.iloc[0]['sc_code']
2#ticker = df_daily.at[0,'sc_code']
3#ticker = df_daily.head(1)['sc_code'].values[0]
4ticker = df_daily['sc_code'].to_numpy()[0]

3k to convert to numpy

1Line #      Hits         Time  Per Hit   % Time  Line Contents
2==============================================================
3    95                                               # Last Traded Date of the Week
4    96         1       3111.0   3111.0      7.1      i = df_daily[['yearWeek', 'ts']].to_numpy()
5    97         1        128.0    128.0      0.3      j = npi.group_by(i[:, 0]).split(i[:, 1])
6    98                                           
7    99         1          2.0      2.0      0.0      lastTrdDoW = []
8   100         2          9.0      4.5      0.0      for idx,subarr in enumerate(j):
9   101         1          2.0      2.0      0.0          lastTrdDoW.append( subarr[-1] )

Pandas approach seems to be bit expensive here and tried to convert it to numpy and faced a TypeError: invalid type promotion. This error came because i was trying to concatenate arr_yearWeek(int64) with arr_ts(timestamp64[ns]) like arr_concat = np.column_stack((arr_yearWeek, arr_ts))

To resolve this issue, i converted timestamp ts to string df_temp['ts'] = df_temp['ts'] .dt.strftime('%Y-%m-%d') and this worked.

1arr_concat = np.column_stack((arr_yearWeek, arr_ts))
2npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
3
4lastTrdDoW = []
5for idx,subarr in enumerate(npi_gb):
6    lastTrdDoW.append( subarr[-1] )

Here npi is import numpy_indexed as npi, another library, when i was testing, i had encountered various issues and when i check in stackoverflow for solutions, i have seen user Eelco Hoogendoorn suggest to use numpy_indexed in multiple posts and i had to try it out and also he is the author of the library as well.

StackOverflow - Is there any numpy group by function?

So now most of my code uses this.

Converting to dataframe. Looks like creating a new dataframe is an expensive process. When i stepback and think about the flow of program, this function is called in a loop and for each company a new dataframe is being created and if this takes this much time for single company, obviously whole program is going to be slow. So my thought was, what-if i keep everything in numpy itself, and see how it goes.

There are two things to solve here

Dataframe (Line no.165)
Ordering of columns in dataframe(Line no.173 - did not think this process will consume this much time )

1Line #      Hits         Time  Per Hit   % Time  Line Contents
2==============================================================
3161         1          3.0      3.0      0.0      dict_temp = {'yearWeek': yearWeek, 'closeH': cmaxs, 'closeL': cmins, 'volHigh':vmaxs, 'volAvg':vavgs, 'daysTraded':daysTraded
4162         1          2.0      2.0      0.0                  ,'HSDL':HSDL, 'HSDG':HSDG, 'HSDLp':HSDLp, 'HSDGp':HSDGp, 'first':first, 'last':last, 'wChng':wChng, 'wChngp':wChngp
5163         1          2.0      2.0      0.0                  ,'lastTrdDoW':lastTrdDoW, 'TI':TI, 'volAvgWOhv':volAvgWOhv, 'HVdAV':HVdAV, 'CPveoHVD':CPveoHVD
6164         1          2.0      2.0      0.0                  ,'lastDVotWk':lastDVotWk, 'lastDVdAV':lastDVdAV}
7165         1       3677.0   3677.0      8.4      df_weekly = pd.DataFrame(data=dict_temp)
8
9169         1          3.0      3.0      0.0      cols = ['sc_code', 'yearWeek', 'lastTrdDoW', 'daysTraded', 'closeL', 'closeH', 'volAvg', 'volHigh'
10170         1          1.0      1.0      0.0               , 'HSDL', 'HSDG', 'HSDLp', 'HSDGp', 'first', 'last', 'wChng', 'wChngp', 'TI', 'volAvgWOhv', 'HVdAV'
11171         1          2.0      2.0      0.0               , 'CPveoHVD', 'lastDVotWk', 'lastDVdAV']
12172                                           
13173         1       2816.0   2816.0      6.5      df_weekly = df_weekly[cols].copy()

Combined above two problems into one solution. Since all the values are going to be of same size, np.column_stack helped resolve the issue

1np_weekly = np.column_stack((ticker, yearWeek, lastTrdDoW, daysTraded, cmins, cmaxs, vavgs, vmaxs, HSDL
2                            , HSDG, HSDLp, HSDGp, first, last, wChng, wChngp, TI, volAvgWOhv, HVdAV
3                            , CPveoHVD, lastDVotWk, lastDVdAV))

Filling same value to entire dataframe column

1Line #      Hits         Time  Per Hit   % Time  Line Contents
2==============================================================
3167         1       1102.0   1102.0      2.5      df_weekly['sc_code'] = ticker

Resolved by filling the value in numpy

1ticker = np.full(yearWeek.shape[0], ticker)

Last part, here there are multiple high timers

20k - Basically i am trying to define a dataframe with column, that process seems to be very expensive here.
6k - dataframe concatenation
1.5k - using .isin()

1Line #      Hits         Time  Per Hit   % Time  Line Contents
2==============================================================
3175                                               # df_weekly_all will be 0, when its a new company or its a FTA(First Time Analysis)
4176         1         13.0     13.0      0.0      if df_weekly_all.shape[0] == 0:
5177         1      20473.0  20473.0     46.9          df_weekly_all = pd.DataFrame(columns=list(df_weekly.columns))       
6178                                                   
7179                                               # Removing all yearWeek in df_weekly2 from df_weekly
8180         1        321.0    321.0      0.7      a = set(df_weekly_all['yearWeek'])
9181         1        190.0    190.0      0.4      b = set(df_weekly['yearWeek'])
10182         1          5.0      5.0      0.0      c = list(a.difference(b))
11183                                               #print('df_weekly_all={}, df_weekly={}, difference={}'.format(len(a), len(b), len(c)) )
12184         1       1538.0   1538.0      3.5      df_weekly_all = df_weekly_all[df_weekly_all.yearWeek.isin(c)].copy()
13185                                           
14186                                               # Append the latest week data to df_weekly
15187         1       6998.0   6998.0     16.0      df_weekly_all = pd.concat([df_weekly_all, df_weekly], sort=False)
16188                                               #print('After concat : df_weekly_all={}'.format(df_weekly_all.shape[0]))

Thought process here is same as before, why should i use dataframe here, can i do everything in numpy. So, proceeding towards that angle.

1# Removing latest yearWeek from df_weekly_all as it could be partial and concatenating latest one(df_weekly) to df_weekly_all
2    if len(np_weekly_all) > 0:
3        #print(len(np_weekly_all))
4        a = np_weekly_all[:,1] 
5        b = np_weekly[:,1] 
6        tf_1 = np.isin(a, b, invert=True) 
7        #print(tf_1)
8        t_result = list(compress(range(len(tf_1)), tf_1)) 
9        #print(t_result)
10        np_weekly_all = np_weekly_all[t_result]
11        np_weekly_all = np.vstack((np_weekly_all, np_weekly))    
12    else:
13        np_weekly_all = []
14        np_weekly_all = np.vstack((np_weekly))

Interesting concepts learned here are ,

Slicing in numpy
numpy.isin and invert is available
compress of itertools from itertools import compress
vstack

After making all the above changes in the new run, it took only 7min 47s.

Here is the new profiler report,

1('getcwd : ', '/home/bobby_dreamer')
2Timer unit: 1e-06 s
3
4Total time: 0.013077 s
5File: BTD-Analysis1V3-lf.py
6Function: weekly_trend_analysis_np at line 38
7
8Line #      Hits         Time  Per Hit   % Time  Line Contents
9==============================================================
10    38                                           def weekly_trend_analysis_np(exchange, np_weekly_all, df_daily):
11    39                                           
12    40         1          4.0      4.0      0.0      if exchange == 'BSE':
13    41                                                   #ticker = df_daily.at[0,'sc_code']
14    42                                                   #ticker = df_daily.head(1)['sc_code'].values[0]
15    43         1        152.0    152.0      1.2          ticker = df_daily['sc_code'].to_numpy()[0]
16    44                                               else:
17    45                                                   #ticker = df_daily.at[0,'symbol']
18    46                                                   #ticker = df_daily.head(1)['symbol'].values[0]
19    47                                                   ticker = df_daily['symbol'].to_numpy()[0]
20    48                                           
21    49         1         34.0     34.0      0.3      arr_yearWeek = df_daily['yearWeek'].to_numpy()
22    50         1         34.0     34.0      0.3      arr_close = df_daily['close'].to_numpy()
23    51         1         31.0     31.0      0.2      arr_prevclose = df_daily['prevclose'].to_numpy()
24    52         1         29.0     29.0      0.2      arr_chng = df_daily['chng'].to_numpy()
25    53         1         27.0     27.0      0.2      arr_chngp = df_daily['chngp'].to_numpy()
26    54         1         28.0     28.0      0.2      arr_ts = df_daily['ts'].to_numpy()
27    55         1         27.0     27.0      0.2      arr_volumes = df_daily['volumes'].to_numpy()
28    56                                           
29    57                                               # Close
30    58         1         72.0     72.0      0.6      arr_concat = np.column_stack((arr_yearWeek, arr_close))
31    59         1        651.0    651.0      5.0      npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
32    60                                           
33    61                                               #a = df_temp[['yearWeek', 'close']].to_numpy()
34    62         1        111.0    111.0      0.8      yearWeek, daysTraded = np.unique(arr_concat[:,0], return_counts=True)
35    63                                               
36    64         1          2.0      2.0      0.0      cmaxs, cmins = [], []
37    65         1          1.0      1.0      0.0      first, last, wChng, wChngp = [], [], [], []
38    66        54         79.0      1.5      0.6      for idx,subarr in enumerate(npi_gb):
39    67        53        465.0      8.8      3.6          cmaxs.append( np.amax(subarr) )
40    68        53        423.0      8.0      3.2          cmins.append( np.amin(subarr) )
41    69        53         86.0      1.6      0.7          first.append(subarr[0])
42    70        53         75.0      1.4      0.6          last.append(subarr[-1])
43    71        53        103.0      1.9      0.8          wChng.append( subarr[-1] - subarr[0] )
44    72        53        142.0      2.7      1.1          wChngp.append( ( (subarr[-1] / subarr[0]) * 100) - 100 )
45    73                                           
46    74                                               #npi_gb.clear()
47    75         1          6.0      6.0      0.0      arr_concat = np.empty((100,100))
48    76                                           
49    77                                               # Chng
50    78         1         24.0     24.0      0.2      arr_concat = np.column_stack((arr_yearWeek, arr_chng))
51    79         1        357.0    357.0      2.7      npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
52    80                                           
53    81         1          2.0      2.0      0.0      HSDL, HSDG = [], []
54    82        54         75.0      1.4      0.6      for idx,subarr in enumerate(npi_gb):
55    83        53        387.0      7.3      3.0          HSDL.append( np.amin(subarr) )
56    84        53        402.0      7.6      3.1          HSDG.append( np.amax(subarr) )
57    85                                           
58    86                                               #npi_gb.clear()
59    87         1          4.0      4.0      0.0      arr_concat = np.empty((100,100))
60    88                                           
61    89                                               # Chngp
62    90         1         21.0     21.0      0.2      arr_concat = np.column_stack((arr_yearWeek, arr_chngp))
63    91         1        292.0    292.0      2.2      npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
64    92                                           
65    93         1          2.0      2.0      0.0      HSDLp, HSDGp = [], []
66    94        54         75.0      1.4      0.6      for idx,subarr in enumerate(npi_gb):
67    95        53        382.0      7.2      2.9          HSDLp.append( np.amin(subarr) )
68    96        53        403.0      7.6      3.1          HSDGp.append( np.amax(subarr) )
69    97                                           
70    98                                               #npi_gb.clear()
71    99         1          6.0      6.0      0.0      arr_concat = np.empty((100,100))
72   100                                           
73   101                                               # Last Traded Date of the Week
74   102         1         33.0     33.0      0.3      arr_concat = np.column_stack((arr_yearWeek, arr_ts))
75   103         1        341.0    341.0      2.6      npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
76   104                                           
77   105         1          1.0      1.0      0.0      lastTrdDoW = []
78   106        54         70.0      1.3      0.5      for idx,subarr in enumerate(npi_gb):
79   107        53         79.0      1.5      0.6          lastTrdDoW.append( subarr[-1] )
80   108                                               
81   109                                               #npi_gb.clear()
82   110         1          5.0      5.0      0.0      arr_concat = np.empty((100,100))
83   111                                           
84   112                                               # Times inreased
85   113         1         14.0     14.0      0.1      TI = np.where(arr_close > arr_prevclose, 1, 0)
86   114                                           
87   115                                               # Below npi_gb_yearWeekTI is used in volumes section
88   116         1         16.0     16.0      0.1      arr_concat = np.column_stack((arr_yearWeek, TI))
89   117         1        267.0    267.0      2.0      npi_gb_yearWeekTI = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
90   118                                           
91   119         1         78.0     78.0      0.6      tempArr, TI = npi.group_by(arr_yearWeek).sum(TI)
92   120                                           
93   121                                               # Volume ( dependent on above section value t_group , thats the reason to move from top to here)
94   122         1         16.0     16.0      0.1      arr_concat = np.column_stack((arr_yearWeek, arr_volumes))
95   123         1        277.0    277.0      2.1      npi_gb = npi.group_by(arr_concat[:, 0]).split(arr_concat[:, 1])
96   124                                           
97   125         1          3.0      3.0      0.0      vmaxs, vavgs, volAvgWOhv, HVdAV, CPveoHVD, lastDVotWk, lastDVdAV = [], [], [], [], [], [], []
98   126        54         87.0      1.6      0.7      for idx,subarr in enumerate(npi_gb):
99   127        53       1077.0     20.3      8.2          vavgs.append( np.mean(subarr) )
100   128        53         82.0      1.5      0.6          ldvotWk = subarr[-1]
101   129        53         83.0      1.6      0.6          lastDVotWk.append(ldvotWk)
102   130                                           
103   131                                                   #print(idx, 'O - ',subarr, np.argmax(subarr), ', average : ',np.mean(subarr))
104   132        53        235.0      4.4      1.8          ixDel = np.argmax(subarr)
105   133        53         88.0      1.7      0.7          hV = subarr[ixDel]
106   134        53         73.0      1.4      0.6          vmaxs.append( hV )
107   135                                           
108   136        53         71.0      1.3      0.5          if(len(subarr)>1):
109   137        53       1610.0     30.4     12.3              subarr = np.delete(subarr, ixDel)
110   138        53       1024.0     19.3      7.8              vawoHV = np.mean(subarr)
111   139                                                   else:
112   140                                                       vawoHV = np.mean(subarr)
113   141        53         85.0      1.6      0.6          volAvgWOhv.append( vawoHV )
114   142        53        327.0      6.2      2.5          HVdAV.append(hV / vawoHV)
115   143        53        102.0      1.9      0.8          CPveoHVD.append( npi_gb_yearWeekTI[idx][ixDel] )
116   144        53        791.0     14.9      6.0          lastDVdAV.append( np.round(ldvotWk / vawoHV, 2) )    
117   145                                           
118   146                                               #npi_gb.clear()
119   147         1          4.0      4.0      0.0      arr_concat = np.empty((100,100))
120   148                                           
121   149                                               # Preparing the dataframe
122   150                                               # yearWeek and occurances 
123   151                                               #yearWeek, daysTraded = np.unique(a[:,0], return_counts=True)
124   152         1          5.0      5.0      0.0      yearWeek = yearWeek.astype(int)
125   153         1         59.0     59.0      0.5      HSDL = np.round(HSDL,2)
126   154         1         26.0     26.0      0.2      HSDG = np.round(HSDG,2)
127   155         1         23.0     23.0      0.2      HSDLp = np.round(HSDLp,2)
128   156         1         23.0     23.0      0.2      HSDGp = np.round(HSDGp,2)
129   157                                           
130   158         1         23.0     23.0      0.2      first = np.round(first,2)
131   159         1         23.0     23.0      0.2      last = np.round(last,2)
132   160         1         23.0     23.0      0.2      wChng = np.round(wChng,2)
133   161         1         23.0     23.0      0.2      wChngp = np.round(wChngp,2)
134   162                                           
135   163         1         12.0     12.0      0.1      vavgs = np.array(vavgs).astype(int)
136   164         1         16.0     16.0      0.1      volAvgWOhv = np.array(volAvgWOhv).astype(int)
137   165         1         24.0     24.0      0.2      HVdAV = np.round(HVdAV,2)
138   166                                           
139   167         1         16.0     16.0      0.1      ticker = np.full(yearWeek.shape[0], ticker)
140   168         1          2.0      2.0      0.0      np_weekly = np.column_stack((ticker, yearWeek, lastTrdDoW, daysTraded, cmins, cmaxs, vavgs, vmaxs, HSDL
141   169         1          2.0      2.0      0.0                                 , HSDG, HSDLp, HSDGp, first, last, wChng, wChngp, TI, volAvgWOhv, HVdAV
142   170         1        546.0    546.0      4.2                                 , CPveoHVD, lastDVotWk, lastDVdAV))
143   171                                               
144   172                                               # Removing latest yearWeek from df_weekly_all as it could be partial and concatenating latest one(df_weekly) to df_weekly_all
145   173         1          2.0      2.0      0.0      if len(np_weekly_all) > 0:
146   174                                                   #print(len(np_weekly_all))
147   175         1          2.0      2.0      0.0          a = np_weekly_all[:,1] 
148   176         1          1.0      1.0      0.0          b = np_weekly[:,1] 
149   177         1        205.0    205.0      1.6          tf_1 = np.isin(a, b, invert=True) 
150   178                                                   #print(tf_1)
151   179         1         13.0     13.0      0.1          t_result = list(compress(range(len(tf_1)), tf_1)) 
152   180                                                   #print(t_result)
153   181         1         13.0     13.0      0.1          np_weekly_all = np_weekly_all[t_result]
154   182         1         40.0     40.0      0.3          np_weekly_all = np.vstack((np_weekly_all, np_weekly))    
155   183                                               else:
156   184                                                   np_weekly_all = []
157   185                                                   np_weekly_all = np.vstack((np_weekly))    
158   186                                                   
159   187         1          2.0      2.0      0.0      return np_weekly_all

Needless to say, Hype is real. Numpy wins, its fast.

Issue faced during full execution using numpy,

MemoryError: Unable to allocate array with shape (82912, 22) and data type <U32. This issue occured in two tests, first after processing 1600 companies and second time after 900 company. Since all the numpy elements should be of same type, here in my data, i have (integers, decimals and date). So, all the data is stored as OBJECT, which is bigger in size.
Tried couple of approaches,
- Tried to clear memory of used variables like
```
1npi_gb = None
2arr_concat = None
```
- Tested(doesn't help) to see if deleting user-defined variables help
```
1for name in dir():
2    if not name.startswith('_'):
3        print(name)
4        del globals()[name]
```
- Thought, memory might have to do with python installation as well. My computer is 64-bit but python i had installed is 32-bit. So, tried reinstalling and checked it. It doesnt help,
```
1# To know why bit of python you are running
2import struct
3print(struct.calcsize("P") * 8)
```
- Finally, what i did is, write an intermediate result file after processing 500 companies and then reinitialize the variables which is holding that data.

# Things to do next

Going forward, should i write only numpy programs ?

No. It depends, to a certain extent. What am i doing here, i am calling weekly_trend_analysis() in loop passing last 1 yr daily data of each company as argument and function processes it and finally returns 1yr weekly data of that company and this process repeats for all companies.

Doing this via Pandas takes 1h 26min
Via Numpy takes 7min 47s

Instead of passing daily data of each company in loop, if i pass entire dataset to the function with all the companies via pandas, it takes only Wall time: 1min 22s.

So, pandas are good as well in processing large datasets. We got 2 enemies here, they should be avoided at all cost,

Looping
Thought process that gets you to write a loop.

Couple of other points to remember,

With complex codes, maintainence will hit hard. Pandas are easy to understand than numpy.
If you really want to scale, it easy to convert from Pandas to PySpark, i have done it. (Easy meaning, you have to sit for couple of weeks and do it, but doable).
Numpy is very fast with looping itself it got the elapsed time down from 1hr 26min to 7mins, what i didn't like about it is the memory error, but the problem is with my data, i wanted multiple datatypes, so it had to change everything to object types.

So finally, whats best, both are, but its the thought process thats going to save time.

BSE Weekly trend analysis using Pandas & Numpy

# Things to do next

# Related articles