Sharpe and the number of days

Inspired by the discussion about RT Forex North, I made a plot of the Sharpe ratio and the number of days that the system exists for all forex systems in the Grid. I removed one outlier with a Sharpe of -40. You can see the plot at



http://www.xs4all.nl/~julese/Sharpe_days_forex.pdf



Very interesting. So it supports your theory that a system’s Sharpe Ratio decreases over time, if I remember right you were suggesting that a system that has been going twice as long as another and has a larger drawdown and worse sharpe is not necessarily a worse system, and as the data seems to suggest the longer a system has been trading the more likely it is to have encountered it’s max drawdown and will have a worse sharpe (and drawdown stats naturally) as a result.

Yes. Formulated more precisely, the Sharpe ratios tend to go to 0 in the long run. Positive Sharpes will become smaller, negative Sharpes will become larger (= closer to 0).



(A note of caution: Another interpretation is that, as C2 grows, a larger variation of systems are launched. A more detailed study would compute for each system the Sharpe when it was young and when it is old).



So with respect to the effect of "luck": It is true that having multiple systems increases the chance of having one "lucky" system, but it seems that it is also true that younger systems are more affected by (good or bad) luck.

It suggests (to me) that it may be better to compare systems of different ages on basis of their t-value (see advanced statistics guide), or even better their p-value (idem), or equivalently, -log of the p-value. The forex systems with -log§ > 1.5 are:



System Age Sharpe T-value "-log p"

Team Aphid Bird 183 5.5 4.01 4.36

positive forex 440 3.3 3.73 3.97

Forex Fighter GBP/USD KO 360 2.7 2.76 2.52

SNIPER ¤ 4X 166 4 2.78 2.52

ForexEmail 211 3.3 2.58 2.28

Gold Survivor Forex Port 180 3.1 2.24 1.88

Entropia 611 1.6 2.13 1.78

VN Forex Club 221 2.6 2.08 1.72

TrueOrange ForeX 48 5.7 2.13 1.72

ATForex 67 4.6 2.03 1.63

GFCM fx2 23 8.1 2.09 1.62



According to this metric the Sharpe of 5.7 of TrueOrange Forex and 2.6 of VN Forex Club would be comparable to the Sharpe of 1.6 of Entropia, because Entropia is much older.

I think Sharpe was only part of it. The profit factor was more interesting to me, and that was loads higher for VN.



I would be curious to see your analysis of this…

See



http://www.xs4all.nl/~julese/Profit_factor_days_trades_forex.pdf



The first figure is profit factor on number of days. You don’t see very much on this figure because there are some outliers that I don’t want to remove because they may contradict my theory (this wasn’t the case with the outlier in the Sharpe).



The second figure plots the log of the profit factor on the number of days. This makes the distribution more symmetrical and the outliers,less extreme.



The third figure plots the log of the profit factor on the number of trades, which is more logical because the profit factor is computed on trades, not days.



My conclusion is that the profit factor displays the same pattern with respect to the number of trades.



I am less sure that the same formulas can be used for an age-correction here. The “t-value” would be obtained as log(profit factor) * sqrt(#trades). I don’t think that this has a t-distribution and therefore it makes no sense to compute the p-value from it. If I sort on this “t-value”, I get a lot of systems with a value above 2:



System Age Trades Profit Factor log(profit factor)*sqrt(#trades)

positive forex 440 126 4.5 7.33

FX-Traders.com Forex Str 209 15 63.4 6.98

Plutarch Hedge 387 340 1.9 5.14

LTR FX 184 160 2.5 5.03

Team Aphid Bird 183 156 2.3 4.52

GBP/USD Power Trade 152 82 3.1 4.45

(FREE!!!) Monthly FX Sys 359 17 9.7 4.07

CTS ForexTrends 198 105 2.4 3.90

Tradency 2 Testting 417 5 44.7 3.69

ForexEmail 211 164 1.9 3.57

FXParadise 296 198 1.7 3.24

Foreximo 235 59 2.6 3.19

Forex Fighter GBP/USD KO 360 176 1.7 3.06

Gold Survivor Forex Port 180 265 1.5 2.87

M-GEAR 186 21 3.9 2.71

ATForex 67 40 2.6 2.62

TrueOrange ForeX 48 30 3 2.61

Trading Fork extremes 421 206 1.5 2.53

Entropia 611 268 1.4 2.39

GFCM fx2 23 36 2.5 2.39

SNIPER ¤ 4X 166 259 1.4 2.35

FA (Pt) 10 5 10.9 2.32

VN Forex Club 221 96 1.7 2.26

Swift Pips 529 20 2.9 2.07

FR Helper 43 27 2.5 2.07

Broadsword Forex 301 80 1.7 2.06



So I think that the profit factor 2.6 of Foreximo is actually a little less than the 1.7 of FXParadise after correction for the number of trades, since the “t-values” are 3.19 and 3.24 respectively.



With respect to the 2 systems we discussed previously, RT Forex has 1.89 on this metric and VN Forex 2.26. With the t-value based on the Sharpe ratio, RT Forex has 1.43 and VN Forex has 2.08. Both statistics still suggest that VN Forex is better, but not so much as you might think when you see the raw statistics.



BTW, Ross, my theory is actually a version of what you have always said, namely that no systems lives forever and that good statistics of young systems are probably a matter of luck. I just add to it that systems of moderate age fit into the same pattern somewhere between these to extremes.

Jules, two remarks:

1. What about systems that were terminated? Don’t they have a whole bunch of days with returns of zero after the date of termination? Wouldn’t that result in Sharpe ratios going to zero?



2. If the aim is to compare systems directly against each other, shouldn’t we take the returns or trades over the same time period for both systems (equal to the length of the shorter system), and test whether the difference between the systems for some statistic of interest (e.g. profit factor, sharpe ratio etc.) is different from zero?



I bet for most systems with reasonably looking equity curves, the difference would be statistically insignificant.

ST,



1. You’re right. I didn’t realize that. But this has no effect on the relation of the profit factor with the number of trades, where we see the same pattern. And I think that many systems stopped because they were not profitable anymore, i.e. that the Sharpe went to 0 before they stopped trading. Nevertheless, I agree that the analysis should be done again with for each system only the number of days before it stopped trading and the Sharpe at that moment. But this is not easy downloadable, so I won’t do it.



2. The problem with that suggestion is that you ignore data that contain relevant information. In the example of RT Forex and VN Forex, this would mean that you consider both periods over a period of 220 days, as I did previously, but in this way you ignore about 200 days of RT Forex. Why are these not relevant?



Yes, I expect that the differences between two good systems are often not significant, but my points of the present thread are somewhat more general. One point is that statistics like the Sharpe and the profit factor seem to be subject to a kind of regression to the mean (0 or 1). So when you see an exceptionally high Sharpe or profit factor then it is most likely to be lower later (at least, this is my conjecture). Another point is that such exceptional statistics are more likely with younger systems, and that selection on basis of these statistics creates a bias against older systems. A third point is that most advantages of systems seem to exist only temporarily (as you would expect according to the efficient market hypothesis).

Jules



I mulled your analyses. Although it is interesting, it still bothers me, as I see no reason for there to be a natural drop. Often, research like yours can take two paths. For example:



1) A scientist/mathematician has a hypothesis, and they gather sufficient data in an unbiased fashion to see whether their hypothese is supported



2) A statistician runs analysis on a sufficient set of data to see what the data says. Usually, they run some follow-on testing to ensure the results are not spurious and that the results are trustworthy.



Yours seems to be the second variety. Some problems I see include



1) One of the lists you have above has about 15 entries. Another has perhaps 30 or less. In statistics, 30 is often listed as the BEGINNING of statistically significant. Your sample size really does not prove your relationship of declining Sharpe/PFs, but at most, can be called "interesting."



2) I don’t think it accounts for numerous other possible explanations. You need to be objective, and not make assumptions. You need to generate other plausible reasons. For example, when people analyze CTA returns, there is something called survivorship bias. This is where poor-performing CTAs drop out, and are not counted in the annual/quarterly results. This incorrectly pushes up the performance of CTAs as a while,



Especially with PF, there is no reason for a system to automatically drop. Either vendors have ability or not. The thought that the market structure changes and causes vendors is interesting. But again, it is not proven, and more cases need to be thought up.



But it is more my belief that the strong majority of systems on C2 that look good do not really work. I still maintain that most top systems simply look good, because when C2 tracks a few hundred systems, that the top 5-10% over a few months are simply due to luck producing a trading-results string that is fortuitous. That is why after seeing an interesting, top-performing system, I follow it for a few weeks. Invariably, most of them stop their high performance. And more strongly is the complaint that “how come it stopped working once I got in?” because getting in on the lucky string of a “top” lucky system is going to produce/experience randomness… Randy and someone else (ST?) just said that, about getting into VN Forex, I think. And how it suddenly went into DD.



I would say for example, something like positive forex has pretty much proven itself. And the views on it demonstrate this. ES Shark has not yet, but might if it keeps it up another month or two (since it trades a lot).

> Especially with PF, there is no reason for a system to automatically drop.



Not automatically, but probably. Most systems, even good ones, are

tailored to types of situations (trends, channels, chop, whatever). Often

a new system is “lucky” to be in a situation that flows with its theory.

Aestreux was an over simplified example: it was bullish the EURUSD

and when that strong move broke it broke the system. The theory can be

more robust and/or elegant, but every system has strong suits and weaknesses and over time they are exposed.



> I would say for example, something like positive forex has pretty much proven itself.



Pretty much, but some of these FX trends have been around longer than even positive forex (even C2!). Markets can trend for a long time and then change character. Don’t get me wrong, this system has been great, and I’m not saying it will not be great in the future, but it is wise to see why a system has done well and if that situation still exists.



>And the views on it demonstrate this. ES Shark has not yet, but might if it keeps it up another month or two (since it trades a lot).



Well, maybe. Again, let’s say a system does 10,000 long trades in a month in a steep uptrend. Does it prove the system will work in choppy or down markets “since it trades a lot”? No way.



BTW, on this note of handling different market characteristics a tip of the hat to TMG. Don’t you agree Ross?



Ross,



I formulated the theory before I analyzed the data, in the RT Forex discussion. Or was this theory not explicit enough in your opinion?



1) The total sample, used in the graphs, included all 160 forex systems that were listed on the Grid. The lists in my posts were only the “top” systems according to that metric.



2) Yes, there are alternative explanations. I pointed that out in my post to Jon already. In the language of social scientists the present sample is “cross-sectional”, and his all limitations of such studies. It would be interesting to analyze “longitudinal” data, i.e. to track the same sample of systems over a time.



"I still maintain that most top systems simply look good, because when C2 tracks a few hundred systems, that the top 5-10% over a few months are simply due to luck producing a trading-results string that is fortuitous."



That is one of the points that I’m trying to illustrate.



"I would say for example, something like positive forex has pretty much proven itself."



That agrees with the top position that it got in these lists, if I remember correctly (I can’t see that post while I am typing this one).



PS. According to my dictionary mulled means “to heat and spice”. I don’t understand its meaning in your first sentence.

Verb: mull over múl 'owvu®

Reflect deeply on a subject

but i like Jules’ definition better LOL

"> Especially with PF, there is no reason for a system to automatically drop. Not automatically, but probably. Most systems, even good ones, are

tailored to types of situations (trends, channels, chop, whatever). "



"Probably is a reasonable thought. It is the lack of statistical significance in the trading world that galls me. Although Jules is a very sober individual (except perhaps on the occasional Friday night Happy Hour…)

"BTW, on this note of handling different market characteristics a tip of the hat to TMG. Don’t you agree Ross? "



It is certainly still around…



> "BTW, on this note of handling different market characteristics a tip of the hat to TMG. Don’t you agree Ross? "



> It is certainly still around…



Awww… maybe you need to join Jules for that drink and lighten up

a bit ;-)))).



TMG not only handled the CIT, but unlike most other FX or Index

systems it is up strong on the decline. This nimble ability is noteworthy

and commendable.

"It is the lack of statistical significance in the trading world that galls me."



Ah, I guess I was not clear enough about this. What I suggested for the Sharpe is actually an ordinary t-test of significance. The null hypothesis is that the system is not more profitable than the riskfree asset. The alternative hypothesis is that the system is more profitable than the riskfree asset. You reject the null hypothesis in favor of the alternative hypothesis if the p-value is small, e.g. p < 0.05.



The p-value is computed from the Sharpe ratio, where the t-value is an intermediate step. Now what I say is that if you compare systems of different ages, it is better to look at this p-value than to look at the Sharpe ratio itself. The reason for this is that the p-value includes a correction for the age. Loosely speaking it can be viewed as a measure of the certainty that the system is profitable (where a smaller p-value means more certainty).



I could think of better tests, but this p-value has the practical advantage that it is listed in the advanced statistics.



For the best systems the p-values are all very close to 0, and to make the differences better visible I used -log§. This is not a matter of substance but only of presentation, comparable to the difference of using a linear scale or a logarithmic scale in graphs. I used 10 as base, so a 1 means p = 0.1, a 2 means p = 0.01, a 3 means p = 0.001, etc.



Surely it is true that a more rigorous statistical analysis would be needed if I intended to publish this in a scientific journal. But this is a forum. I am afraid that nobody (except ST) will read it if I become too scientific. You yourself pointed out that many people don’t understand the Sharpe ratio. My posts above are already several levels above that. I think that most people skip it, thinking “there we have Jules again with his 100 paragraphs headache”. Then, what’s the point of making the analysis 10 times more sophisticated? In case like this I believe that it is better to use a statistically less optimal method, so that at least a few readers will understand the basic idea, provided that I believe that a better method will probably produce essentially similar results.



Moreover, if I read your comments then I get the impression that we actually agree on most points. So I don’t think that another statistical analysis would change much. If we disagree on anything then it is not the theory itself but the conclusions that we draw from the theory. If you believe that systems with a high Sharpe are often just being lucky, and I think you said that - and I agree with it - then for me the conclusion is that the Sharpe ratio itself is not the best measure to select systems on and that we should correct for this luck factor. This implies a correction for age, because the contribution of luck will decrease with age. A similar argument applies to the profit factor.



BTW, because reading the posts here is becoming a day job, I missed the happy hour and the girls - who would not be interested in me anyway - so I just paid more for my own drinks and I hope I am not talking nonsense again :wink:

I agree that the t-test or p-statistic, or more generally, a confidence interval around the statistic of interest (either the Sharpe ratio, profit factor or whatever you like) is helpful. But another question is whether it is converging to some stable estimate. If it’s not, then the law of large numbers doesn’t apply, and we’re in real trouble. That’s why I track this convergence and for a system like ARS for example, it has been converging quite nicely and appears to be pretty stable during the last year, although skeptics might see a slight downward slope:

http://scitra.blogspot.com/2007/07/ars-sharpe-ratio.html



Perhaps a test exist for this convergence? Haven’t looked really into that…

"The null hypothesis is that the system is not more profitable than the riskfree asset. The alternative hypothesis is that the system is more profitable than the riskfree asset."



But the conclusion was that the systems see a decline over time when compared, which I do not see as this statement (above). I just think that there is a long list of other possible explanations, besides "Sharpe, PF, and xxx decline over time on C2."



That was really what I was driving at.

ST, I am not sure what you mean and how technical you want to become.



"But another question is whether it is converging to some stable estimate."



I have written a long and a short answer. The short answer is: I don’t know, this depends on the stability of the market conditions (and in case of discretionary systems, the stability of the vendor). For forex systems my first plot suggests that the Sharpe ratio converges to 0. ARS is not a forex system though.



The long answer is rather theoretical, and I can post or e-mail it if that was what you meant. In summary: if distribution of returns is constant then the Sharpe converges.



"Perhaps a test exist for this convergence?"



Again, I don’t know how sophisticated you want to be. What I would do is this: The Sharpe ratio has a non-central t-distribution and this is quite inconvenient to work with, even if you have a statistics program with functions for it (unlike Excel). I would rather test that the mean return doesn’t change. To do this, you can split the history into two periods. Then conduct a t-test for independent samples where you compare the mean daily return of the first period with the mean daily return of the second period. Alternatively, and more relevant for the Sharpe, you can use the daily excess log return rates. You know the assumptions. If you want more periods, use a one-way analysis of variance. If you want no distributional assumptions, use a Kruskall-Wallis test. If you want to test against downward trend, use a regression analysis, and then you don’t need to split in periods, but you might want to do a nonlinear transformation on the number of days becaus any decay of the mean return is probably not linear. If you want to assume a autocorrelation then a time-series analysis could be appropriate, but there are a lot of versions of that.