BK_GDAN_002894.pdf - ADVANCES IN FINANCIAL MACHINE LEARNING BY MARCOS L\u00d3PEZ DE PRADO Contents Table 1.1 Table 1.2 Table 2.1 Figure 2.1 Equation 1

BK_GDAN_002894.pdf - ADVANCES IN FINANCIAL MACHINE LEARNING...

This preview shows page 1 out of 218 pages.

You've reached the end of your free preview.

Want to read all 218 pages?

Unformatted text preview: ADVANCES IN FINANCIAL MACHINE LEARNING BY MARCOS LÓPEZ DE PRADO Contents Table 1.1 Table 1.2 Table 2.1 Figure 2.1 Equation 1 Equation 2 Equation 3 Equation 4 Equation 5 Equation 6 Equation 7 Equation 8 Equation 9 Equation 10 Equation 11 Equation 12 Equation 13 Equation 14 Equation 15 Expression 1 Equation 16 Equation 17 Equation 18 Expression 2 Equation 19 Equation 20 Equation 21 Equation 22 Equation 23 Equation 24 Equation 25 Figure 2.2 Snippet 2.1 Snippet 2.2 Snippet 2.3 11 12 13 14 15 15 15 16 16 16 16 16 16 17 17 17 18 18 18 18 19 19 19 19 19 19 19 20 20 20 20 20 21 22 22 Equation 26 Equation 27 Equation 28 Equation 29a Equation 29b Snippet 2.4 Figure 2.3 Equation 30 Equation 31 Snippet 3.1 Snippet 3.2 Figure 3.1 Snippet 3.3 Snippet 3.4 Snippet 3.5 Snippet 3.6 Snippet 3.7 Figure 3.2 Snippet 3.8 Snippet 4.1 Equation 32 Figure 4.1 Snippet 4.2 Equation 33 Snippet 4.3 Snippet 4.4 Snippet 4.5 Matrix 1 Snippet 4.6 Snippet 4.7 Snippet 4.8 Snippet 4.9 Figure 4.2 Function 1 Snippet 4.10 Section 4.7 Snippet 4.11 23 23 23 23 23 23 24 25 25 26 27 28 29 30 30 31 32 33 34 35 36 36 37 38 38 39 39 39 40 40 41 41 42 43 43 44 44 Figure 4.3 Section 5.4 Figure 5.1 Figure 5.2 Snippet 5.1 Figure 5.3 Snippet 5.2 Figure 5.4 Snippet 5.3 Figure 5.5 Snippet 5.4 Table 5.1 Equation 34 Section 6.3.1 Figure 6.1 Equation 35 Snippet 6.1 Figure 6.2 Snippet 6.2 Figure 6.3 Figure 7.1 Snippet 7.1 Figure 7.2 Figure 7.3 Snippet 7.2 Snippet 7.3 Snippet 7.4 Snippet 8.2 Snippet 8.3 Snipper 8.4 Snippet 8.5 Figure 8.1 Snippet 8.6 Snippet 8.7 Snippet 8.8 Snippet 8.9 Snippet 8.10 45 46 47 48 48 49 50 51 51 52 53 54 56 56 57 58 58 59 60 61 62 63 64 65 65 66 67 68 69 70 71 72 73 74 75 75 76 Figure 8.2 Figure 8.3 Figure 8.4 Snippet 9.1 Snippet 9.2 Snippet 9.3 Function 2 Figure 9.1 Snippet 9.4 Log 1 Figure 9.2 Equation 36 Section 10.3 Figure 10.1 Snippet 10.1 Snippet 10.2 Figure 10.2 Snippet 10.3 Section 10.6 Snippet 10.4 Figure 10.3 Equation 37 Figure 11.1 Figure 11.2 Equation 38 Figure 12.1 Equation 39 Equation 40 Figure 12.2 Equation 41 Equation 42 Equation 43a Equation 43b Equation 44 Definition 2 Section 13.4 Section 13.5.1 77 78 78 79 80 80 81 82 82 83 84 85 86 87 87 88 89 89 89 91 92 93 94 94 95 96 96 96 96 97 97 98 98 99 99 100 101 Snippet 13.1 Snippet 13.2 Table 13.1 Figure 13.1 Figure 13.2 Figure 13.3 Figure 13.4 Figure 13.5 Figure 13.6 Figure 13.7 Figure 13.8 Figure 13.9 Figure 13.10 Figure 13.11 Figure 13.12 Figure 13.13 Figure 13.14 Figure 13.15 Figure 13.16 Figure 13.17 Figure 13.18 Figure 13.19 Figure 13.20 Figure 13.21 Figure 13.22 Figure 13.23 Figure 13.24 Figure 13.25 Snippet 14.1 Snippet 14.2 Equation 45 Equation 46 Equation 47 Equation 48 Equation 49 Series 1-4 Equation 50 103 103 104 105 106 107 107 108 109 109 110 111 111 112 112 113 113 114 114 115 116 116 117 117 118 118 119 119 120 120 121 121 121 122 122 123 123 Equation 51 Snippet 14.3 Snippet 14.4 Figure 14.1 Equation 52 Figure 14.2 Equation 53 Figure 14.3 Table 14.1 Equation 54 Figure 15.1 Snippet 15.1 Equation 55 Equation 56 Equation 57 Snippet 15.2 Snippet 15.3 Figure 15.2 Snippet 15.4 Figure 15.3 Snippet 15.5 Figure 16.1 Section 16.4.1 Equation 58 Figure 16.2 Figure 16.3 Snippet 16.1 Snippet 16.2 Snippet 16.3 Figure 16.4 Figure 16.5 Figure 16.6 Table 16.1 Figure 16.7 Figure 16.8 Chapter 16 Appendices Section 17.3 123 124 124 125 126 127 127 128 129 130 130 131 131 131 131 132 132 133 133 134 135 136 137 137 140 141 141 142 143 144 145 146 146 147 149 150 156 Equation 59 Equation 60 Equation 61 Equation 62 Figure 17.1 Equation 63 Equation 64 Table 17.1 Figure 17.2 Figure 17.3 Snippet 17.1 Snippet 17.2 Snippet 17.3 Snippet 17.4 Equation 65 Equation 66 Section 18.2 Equation 67 Snippet 18.1 Snippet 18.2 Equation 68 Snippet 18.3 Equation 69 Equation 70 Equation 71 Equation 72 Equation 73 Snippet 18.4 Equation 74 Equation 75 Figure 18.1 Equation 76 Figure 18.2 Equation 77 Equation 78 Equation 79 Equation 80 158 158 158 158 159 159 160 161 162 163 164 164 165 165 166 166 167 168 168 169 169 170 170 170 170 171 171 172 173 173 174 176 176 177 177 177 177 Equation 81 Equation 82 Equation 83 Equation 84 Equation 85 Equation 86 Equation 87 Equation 88 Equation 89 Snippet 19.1 Snippet 19.2 Figure 19.1 Equation 90 Equation 91 Figure 19.2 Equation 92 Figure 19.3 Section 19.5.1 Section 19.5.2 Snippet 20.1 Snippet 20.2 Snippet 20.3 Snippet 20.4 Snippet 20.5 Figure 20.1 Equation 93 Equation 94 Equation 95 Snippet 20.6 Figure 20.2 Snippet 20.7 Snippet 20.8 Snippet 20.9 Snippet 20.10 Snippet 20.11 Snippet 20.12 Snippet 20.13 178 178 178 179 179 179 179 179 179 180 181 182 182 182 183 183 184 184 186 187 187 188 188 189 190 191 191 191 191 192 193 194 195 195 196 197 198 Equation 96 Snippet 20.14 Equation 97 Equation 98 Figure 21.1 Snippet 21.1 Snippet 21.2 Snippet 21.3 Snippet 21.4 Snippet 21.5 Snippet 21.6 Snippet 21.7 Figure 22.1 Figure 22.2 Figure 22.3 Figure 22.4 Figure 22.5 Figure 22.6 Figure 22.7 Figure 22.8 Figure 22.9 Figure 22.10 198 199 200 200 201 201 202 203 204 205 205 206 207 208 209 210 211 212 215 216 217 218 JWBT2318-c01 JWBT2318-Marcos T TABLE 1.1 Part January 5, 2018 17:20 Printer Name: Trim: 6in × 9in Overview of the Challenges Addressed by Every Chapter Chapter Fin. data Software 1 1 1 1 2 3 4 5 X X X X X X X X 2 2 2 2 6 7 8 9 X X X X 3 3 3 3 3 3 3 10 11 12 13 14 15 16 X X X X X X X 4 4 4 17 18 19 5 5 5 20 21 22 X X X Hardware Meta-Strat Overfitting X X X X X X X X X X X X X X X X X X Math X X X X X 11 X X X X X X X X X X JWBT2318-c01 JWBT2318-Marcos TABLE 1.2 January 5, 2018 17:20 Printer Name: Trim: 6in × 9in Common Pitfalls in Financial ML # Category Pitfall Solution Chapter 1 Epistemological The Sisyphus paradigm 1 2 Epistemological Research through backtesting The meta-strategy paradigm Feature importance analysis 3 Data processing 4 Data processing Chronological sampling Integer differentiation 5 Classification 6 Classification 7 Classification 8 Evaluation 9 Evaluation 10 Evaluation 8 The volume clock 2 Fractional differentiation 5 Fixed-time horizon labeling Learning side and size simultaneously Weighting of non-IID samples The triple-barrier method Meta-labeling 3 Uniqueness weighting; sequential bootstrapping 4 Cross-validation leakage Walk-forward (historical) backtesting Backtest overfitting Purging and embargoing Combinatorial purged cross-validation Backtesting on synthetic data; the deflated Sharpe ratio 7, 9 12 3 11, 12 10–16 JWBT2318-c02 JWBT2318-Marcos TABLE 2.1 January 3, 2018 17:39 Printer Name: Trim: 6in × 9in The Four Essential Types of Financial Data Fundamental Data Market Data Analytics Alternative Data r Assets r Liabilities r Sales r Costs/earnings r Macro variables r ... r Price/yield/implied r Analyst r Satellite/CCTV volatility recommendations r Volume r Dividend/coupons r Open interest r Quotes/cancellations r Aggressor side r ... r Credit ratings r Earnings expectations r News sentiment r ... 13 images r Google searches r Twitter/chats r Metadata r ... JWBT2318-c02 JWBT2318-Marcos FIGURE 2.1 January 3, 2018 17:39 Printer Name: Trim: 6in × 9in Average daily frequency of tick, volume, and dollar bars 14 JWBT2318-c02 JWBT2318-Marcos January 3, 2018 17:39 ⎧ bt−1 ⎪ bt = ⎨ ||Δpt || ⎪ ⎩ Δpt Printer Name: Trim: 6in × 9in if Δpt = 0 if Δpt ≠ 0 Equation 1 T = T ∑ bt t=1 Equation 2 } { | [ | T ∗ = arg min ||T || ≥ E0 [T] |2P bt = 1 − 1| | | T Equation 3 15 JWBT2318-c02 JWBT2318-Marcos January 3, 2018 T = 17:39 T ∑ Printer Name: Trim: 6in × 9in bt vt t=1 Equation 4 [ E0 [T ] = E0 T ∑ t|bt =1 [ vt − E0 T ∑ vt t|bt =−1 = E0 [T](P[bt = 1]E0 [vt |bt = 1] −P[bt = −1]E0 [vt |bt = −1]) Equation 5 Equation 6 Equation 7 E0[T ] = E0[T](v+ − v−) = E0[T](2v+ − E0[vt]) Equation 8 T ∗ = arg min{|T | ≥ E0 [T]|2v+ − E0 [vt ]|} T Equation 9 16 JWBT2318-c02 JWBT2318-Marcos January 3, 2018 17:39 Printer Name: Trim: 6in × 9in Equation 10 E0[T ] = E0[T]max{P[bt = 1], 1 − P[bt = 1]} Equation 11 T ∗ = arg min{T ≥ E0 [T]max{P[bt = 1],1 − P[bt = 1]}} T Equation 12 17 JWBT2318-c02 JWBT2318-Marcos January 3, 2018 17:39 Printer Name: Trim: 6in × 9in ⎫ ⎧ T T ∑ ⎪ ⎪ ∑ bt vt , − bt vt ⎬ T = max ⎨ ⎪ ⎪t|bt =1 t|bt =−1 ⎭ ⎩ Equation 13 E0[T ] = E0[T]max{P[bt = 1]E0[vt|bt = 1], (1 − P[bt = 1])E0[vt|bt = −1]} Equation 14 Equation 15 max{P[bt = 1]E0[vt|bt = 1],(1 − P[bt = 1])E0[vt|bt = −1]} Expression 1 18 JWBT2318-c02 JWBT2318-Marcos January 3, 2018 17:39 Printer Name: i,t Kt ⎧ ∑I ⎪ | | hi,t = ⎨ oi,t+1 i,t i=1 | i,t | ⎪ ⎩ hi,t−1 { i,t = pi,t − oi,t Δpi,t Kt = Kt−1 + I ∑ Trim: 6in × 9in if t ∈ B otherwise if (t − 1) ∈ B otherwise ( ) hi,t−1 i,t i,t + di,t i=1 Equations 16, 17, and 18 Expression 2 1. Rebalance costs: The variable cost {ct } associated with the allocation rebal∑ ance is ct = Ii=1 (|hi,t−1 |pi,t + |hi,t |oi,t+1 )i,t i , ∀t ∈ B. We do not embed ct in Kt , or shorting the spread will generate fictitious profits when the allocation is rebalanced. In your code, you can treat {ct } as a (negative) dividend. 2. Bid-ask spread: The cost {̃ct } of buying or selling one unit of this virtual ETF ∑ is c̃ t = Ii=1 |hi,t−1 |pi,t i,t i . When a unit is bought or sold, the strategy must charge this cost c̃ t , which is the equivalent to crossing the bid-ask spread of this virtual ETF. 3. Volume: The volume traded {vt } is determined by the least active member in the basket. Let vi,t be the volume traded by instrument i over bar t. The number { v } of tradeable basket units is vt = min |h i,t | . i i,t−1 Equations 19, 20, and 21 19 JWBT2318-c02 JWBT2318-Marcos January 3, 2018 17:39 Printer Name: Trim: 6in × 9in Equation 22 Equation 23 Equation 24 Equation 25 FIGURE 2.2 Contribution to risk per principal component 20 JWBT2318-c02 JWBT2318-Marcos January 3, 2018 17:39 Printer Name: Trim: 6in × 9in SNIPPET 2.1 PCA WEIGHTS FROM A RISK DISTRIBUTION R def pcaWeights(cov,riskDist=None,riskTarget=1.): # Following the riskAlloc distribution, match riskTarget eVal,eVec=np.linalg.eigh(cov) # must be Hermitian indices=eVal.argsort()[::-1] # arguments for sorting eVal desc eVal,eVec=eVal[indices],eVec[:,indices] if riskDist is None: riskDist=np.zeros(cov.shape[0]) riskDist[-1]=1. loads=riskTarget*(riskDist/eVal)**.5 wghts=np.dot(eVec,np.reshape(loads,(-1,1))) #ctr=(loads/riskTarget)**2*eVal # verify riskDist return wghts 21 JWBT2318-c02 JWBT2318-Marcos SNIPPET 2.2 January 3, 2018 17:39 Printer Name: Trim: 6in × 9in FORM A GAPS SERIES, DETRACT IT FROM PRICES def getRolledSeries(pathIn,key): series=pd.read_hdf(pathIn,key='bars/ES_10k') series['Time']=pd.to_datetime(series['Time'],format='%Y%m%d%H%M%S%f') series=series.set_index('Time') gaps=rollGaps(series) for fld in ['Close','VWAP']:series[fld]-=gaps return series #——————————————————————————————————————————— def rollGaps(series,dictio={'Instrument':'FUT_CUR_GEN_TICKER','Open':'PX_OPEN', \ 'Close':'PX_LAST'},matchEnd=True): # Compute gaps at each roll, between previous close and next open rollDates=series[dictio['Instrument']].drop_duplicates(keep='first').index gaps=series[dictio['Close']]*0 iloc=list(series.index) iloc=[iloc.index(i)-1 for i in rollDates] # index of days prior to roll gaps.loc[rollDates[1:]]=series[dictio['Open']].loc[rollDates[1:]]- \ series[dictio['Close']].iloc[iloc[1:]].values gaps=gaps.cumsum() if matchEnd:gaps-=gaps.iloc[-1] # roll backward return gaps SNIPPET 2.3 NON-NEGATIVE ROLLED PRICE SERIES raw=pd.read_csv(filePath,index_col=0,parse_dates=True) gaps=rollGaps(raw,dictio={'Instrument':'Symbol','Open':'Open','Close':'Close'}) rolled=raw.copy(deep=True) for fld in ['Open','Close']:rolled[fld]-=gaps rolled['Returns']=rolled['Close'].diff()/raw['Close'].shift(1) rolled['rPrices']=(1+rolled['Returns']).cumprod() 22 JWBT2318-c02 JWBT2318-Marcos January 3, 2018 17:39 Printer Name: Trim: 6in × 9in [ ]} { St = max 0, St−1 + yt − Et−1 yt Equation 26 |∑ | t ( [ ]) yi − Ei−1 yt ≥ h St ≥ h ⇔ ∃ ∈ [1, t] || | i= | Equation 27 [ ]} { + + yt − Et−1 yt , S0+ = 0 St+ = max 0, St−1 [ ]} { − St− = min 0, St−1 + yt − Et−1 yt , S0− = 0 } { St = max St+ , −St− Equation 28 and 29a & b SNIPPET 2.4 THE SYMMETRIC CUSUM FILTER def getTEvents(gRaw,h): tEvents,sPos,sNeg=,0,0 diff=gRaw.diff() for i in diff.index[1:]: sPos,sNeg=max(0,sPos+diff.loc[i]),min(0,sNeg+diff.loc[i]) if sNeg<-h: sNeg=0;tEvents.append(i) elif sPos>h: sPos=0;tEvents.append(i) return pd.DatetimeIndex(tEvents) 23 JWBT2318-c02 JWBT2318-Marcos January 3, 2018 FIGURE 2.3 17:39 Printer Name: CUSUM sampling of a price series 24 Trim: 6in × 9in JWBT2318-c03 JWBT2318-Marcos January 4, 2018 ⎧ −1 ⎪ yi = ⎨ 0 ⎪ 1 ⎩ 14:41 Printer Name: if rti,0 ,ti,0 +h < − if |rti,0 ,ti,0 +h | ≤ if rti,0 ,ti,0 +h > Equation 30 rti,0 ,ti,0 +h = pti,0 +h pti,0 Equation 31 25 −1 Trim: 6in × 9in JWBT2318-c03 JWBT2318-Marcos January 4, 2018 14:41 Printer Name: Trim: 6in × 9in SNIPPET 3.1 DAILY VOLATILITY ESTIMATES def getDailyVol(close,span0=100): # daily vol, reindexed to close df0=close.index.searchsorted(close.index-pd.Timedelta(days=1)) df0=df0[df0>0] df0=pd.Series(close.index[df0–1], index=close.index[close.shape[0]-df0.shape[0]:]) df0=close.loc[df0.index]/close.loc[df0.values].values-1 # daily returns df0=df0.ewm(span=span0).std() return df0 26 JWBT2318-c03 JWBT2318-Marcos SNIPPET 3.2 January 4, 2018 14:41 Printer Name: Trim: 6in × 9in TRIPLE-BARRIER LABELING METHOD def applyPtSlOnT1(close,events,ptSl,molecule): # apply stop loss/profit taking, if it takes place before t1 (end of event) events_=events.loc[molecule] out=events_[['t1']].copy(deep=True) if ptSl[0]>0:pt=ptSl[0]*events_['trgt'] else:pt=pd.Series(index=events.index) # NaNs if ptSl[1]>0:sl=-ptSl[1]*events_['trgt'] else:sl=pd.Series(index=events.index) # NaNs for loc,t1 in events_['t1'].fillna(close.index[-1]).iteritems(): df0=close[loc:t1] # path prices df0=(df0/close[loc]-1)*events_.at[loc,'side'] # path returns out.loc[loc,'sl']=df0[df0<sl[loc]].index.min() # earliest stop loss. out.loc[loc,'pt']=df0[df0>pt[loc]].index.min() # earliest profit taking. return out 27 JWBT2318-c03 JWBT2318-Marcos January 4, 2018 14:41 Printer Name: Trim: 6in × 9in (a) (b) FIGURE 3.1 Two alternative configurations of the triple-barrier method 28 JWBT2318-c03 JWBT2318-Marcos January 4, 2018 14:41 Printer Name: Trim: 6in × 9in SNIPPET 3.3 GETTING THE TIME OF FIRST TOUCH def getEvents(close,tEvents,ptSl,trgt,minRet,numThreads,t1=False): #1) get target trgt=trgt.loc[tEvents] trgt=trgt[trgt>minRet] # minRet #2) get t1 (max holding period) if t1 is False:t1=pd.Series(pd.NaT,index=tEvents) #3) form events object, apply stop loss on t1 side_=pd.Series(1.,index=trgt.index) events=pd.concat({'t1':t1,'trgt':trgt,'side':side_}, \ axis=1).dropna(subset=['trgt']) df0=mpPandasObj(func=applyPtSlOnT1,pdObj=('molecule',events.index), \ numThreads=numThreads,close=close,events=events,ptSl=[ptSl,ptSl]) events['t1']=df0.dropna(how='all').min(axis=1) # pd.min ignores nan events=events.drop('side',axis=1) return events 29 JWBT2318-c03 JWBT2318-Marcos SNIPPET 3.4 January 4, 2018 14:41 Printer Name: Trim: 6in × 9in ADDING A VERTICAL BARRIER t1=close.index.searchsorted(tEvents+pd.Timedelta(days=numDays)) t1=t1[t1<close.shape[0]] t1=pd.Series(close.index[t1],index=tEvents[:t1.shape[0]]) # NaNs at end SNIPPET 3.5 LABELING FOR SIDE AND SIZE def getBins(events,close): #1) prices aligned with events events_=events.dropna(subset=['t1']) px=events_.index.union(events_['t1'].values).drop_duplicates() px=close.reindex(px,method='bfill') #2) create out object out=pd.DataFrame(index=events_.index) out['ret']=px.loc[events_['t1'].values].values/px.loc[events_.index]-1 out['bin']=np.sign(out['ret']) return out 30 JWBT2318-c03 JWBT2318-Marcos January 4, 2018 14:41 Printer Name: Trim: 6in × 9in SNIPPET 3.6 EXPANDING getEvents TO INCORPORATE META-LABELING def getEvents(close,tEvents,ptSl,trgt,minRet,numThreads,t1=False,side=None): #1) get target trgt=trgt.loc[tEvents] trgt=trgt[trgt>minRet] # minRet #2) get t1 (max holding period) if t1 is False:t1=pd.Series(pd.NaT,index=tEvents) #3) form events object, apply stop loss on t1 if side is None:side_,ptSl_=pd.Series(1.,index=trgt.index),[ptSl[0],ptSl[0]] else:side_,ptSl_=side.loc[trgt.index],ptSl[:2] events=pd.concat({'t1':t1,'trgt':trgt,'side':side_}, \ axis=1).dropna(subset=['trgt']) df0=mpPandasObj(func=applyPtSlOnT1,pdObj=('molecule',events.index), \ numThreads=numThreads,close=inst['Close'],events=events,ptSl=ptSl_) events['t1']=df0.dropna(how='all').min(axis=1) # pd.min ignores nan if side is None:events=events.drop('side',axis=1) return events 31 JWBT2318-c03 JWBT2318-Marcos January 4, 2018 14:41 Printer Name: Trim: 6in × 9in SNIPPET 3.7 EXPANDING getBins TO INCORPORATE META-LABELING def getBins(events,close): ’’’ Compute event's outcome (including side information, if provided). events is a DataFrame where: —events.index is event's starttime —events[’t1’] is event's endtime —events[’trgt’] is event's target —events[’side’] (optional) implies the algo's position side Case 1: (’side’ not in events): bin in (-1,1) <—label by price action Case 2: (’side’ in events): bin in (0,1) <—label by pnl (meta-labeling) ’’’ #1) prices aligned with events events_=events.dropna(subset=['t1']) px=events_.index.union(events_['t1'].values).drop_duplicates() px=close.reindex(px,method='bfill') #2) create out object out=pd.DataFrame(index=events_.index) out['ret']=px.loc[events_['t1'].values].values/px.loc[events_.index]-1 if 'side' in events_:out['ret']*=events_['side'] # meta-labeling out['bin']=np.sign(out['ret']) if 'side' in events_:out.loc[out['ret']<=0,'bin']=0 # meta-labeling return out 32 JWBT2318-c03 JWBT2318-Marcos January 4, 2018 FIGURE 3.2 14:41 Printer Name: A visualization of the “confusion matrix” 33 Trim: 6in × 9in JWBT2318-c03 JWBT2318-Marcos January 4, 2018 14:41 Printer Name: Trim: 6in × 9in SNIPPET 3.8 DROPPING UNDER-POPULATED LABELS def dropLabels(events,minPtc=.05): # apply weights, drop labels with insufficient examples while True: df0=events['bin'].value_counts(normalize=True) if df0.min()>minPct or df0.shape[0]<3:break print 'dropped label',df0.argmin(),df0.min() events=events[events['bin']!=df0.argmin()] return events 34 JWBT2318-c04 JWBT2318-Marcos January 3, 2018 19:1 Printer Name: Trim: 6in × 9in SNIPPET 4.1 ESTIMATING THE UNIQUENESS OF A LABEL def mpNumCoEvents(closeIdx,t1,molecule): ’’’ Compute the number of concurrent events per bar. +molecule[0] is the date of the first event on which the weight will be computed +molecule[-1] is the date of the last event on which the weight will be computed Any event that starts before t1[molecule].max() impacts the count. ’’’ #1) find events that span the period [molecule[0],molecule[-1]] t1=t1.fillna(closeIdx[-1]) # unclosed events still must impact other weights t1=t1[t1>=molecule[0]] # events that end at or after molecule[0] t1=t1.loc[:t1[molecule].max()] # events that start at or before t1[molecule].max() #2) count events spanning a bar iloc=closeIdx.searchsorted(np.array([t1.index[0],t1.max()])) count=pd.Series(0,index=closeIdx[iloc[0]:iloc[1]+1]) for tIn,tOut in t1.iteritems():count.loc[tIn:tOut]+=1. return count.loc[molecule[0]:t1[molecule].max()] 35 JWBT2318-c04 JWBT2318-Marcos January 3, 2018 19:1 Printer Name: Equation 32 FIGURE 4.1 Histogram of uniqueness values 36 Trim: 6in × 9in JWBT2318-c04 JWBT2318-Marcos January 3, 2018 19:1 Printer Name: Trim: 6in × 9in SNIPPET 4.2 ESTIMATING THE AVERAGE UNIQUENESS OF A LABEL def mpSampleTW(t1,numCoEvents,molecule): # Derive average uniqueness over the event's lifespan wght=pd.Series(index=molecule) for tIn,tOut in t1.loc[wght.index].iteritems(): wght.loc[tIn]=(1./numCoEvents.loc[tIn:tOut]).mean() return wght #——————————————————————————————————————— numCoEvents=mpPandasObj(mpNumCoEvents,('molecule',events.index),numThreads, \ closeIdx=close.index,t1=events['t1']) numCoEvents=numCoEvents.loc[~numCoEvents.index.duplicated(keep='last')] numCoEvents=numCoEvents.reindex(close.index).fillna(0) out['tW']=mpPandasObj(mpSampleTW,('molecule',events.index),numThreads, \ t1=events['t1'],numCoEvents=numCoEvents) 37 JWBT2318-c04 JWBT2318-Marcos January 3, 2018 j(2) = ū (2) j 19:1 ( I ∑ Print...
View Full Document

  • Fall '19
  • The Return, Emoticon, ASCII art

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture

  • Left Quote Icon

    Student Picture