The Blog of Florian Dahms - How to Analyze Historical Stock Data with Ruby

First of all I want to thank you for all the positive feedback I got on the last article. It is wonderful to know that I have such an awesome readership in you!

Now to todays topic: I want to share how I gathered the data for the last article. There are two parts to it. First the historical data points need to be collected from somewhere. I chose Yahoo Finance, because it has an openly accessible API and provides most of the data I needed (sadly not all, as we will see later). The second part is to calculate the values I needed for the article (i.e. for the graphs and tables). I used Ruby for these two tasks. The graphs where then generated with R. That part only consisted of generating plots. I will therefore not elaborate on it here and hope for a chance to tell more about R in some other article in the future.

Parsing historical stock prices from Yahoo Finance

How to get the original data points? When you are on the Yahoo Finance page it is easy to look up historical data points when manually navigation through their interface. But while possible to emulate the clicks a human does on the website (e.g., Ruby has the Mechanize Gem for that), it would be much nicer to have some API that just returns the data we want, e.g, in CSV format.

After a little search it turns out the Yahoo provides such an API, although it is a little hidden. Ido Green explains on his Blog how to use it. The URL we need is

http://ichart.finance.yahoo.com/table.csv?s=...&a=...&b=...&c=...&d=...&e=...&f=...&g=d&ignore=.csv

(dots indicate places we need to fill in)

s has to be assgned the symbol of the stock we are searching for (you can get them e.g, on the Yahoo Finance Page or in this Excel Spreadsheet)
a is the day of the start date - note that the API assumes that the first day of the month is 0
b is the month of the start data - here January translates to 1, December to 12
c is the year of the start date
d,e,f are the day, month and year of the last date we want data for (with the same peculiarities as before)

When playing around with the API, I noticed that it does not matter when the start date is set to a point in time that is before the stock started to exist. The API will just return the available data. As I always want all the available data, I can therefore just set the start date on a date that precedes the foundation of the indices I want to look at, e.g., the 1.1.1800.

A drawback of the API is, that it does not (for a reason I don't know yet) return data for all available symbols. For that reason I did not analyze, e.g., the Dow Jones Industrial Average.

The first step is to download the data from Yahoo. This can be done via the following commands (here for the S&P 500 with the symbol ^GSPC):

require 'open-uri'
require 'uri'

symbol = "^GSPC"

file = open(URI.escape("http://ichart.finance.yahoo.com/table.csv?s=#{symbol}&a=00&b=01&c=1800&d=#{Time.now.month-1}&e=#{Time.now.day}&f=#{Time.now.year}&g=d&ignore=.csv"))

From open-uri we get the open command that lets us obtain the data from a given URL. To avoid building malformed URLs, due to special characters like "^", one can use URI.escape to translate those into a URL safe form. Now parsing the retrieved data is as simple as

require 'csv'

raw_data = CSV.parse(file)[1..-1]

where [1..-1] is the Ruby notation for taking all but the first entry from an Array. The first row of our returned data set is just a header row that we can discard.

Because every entry is returned as a string by default, we want to map them into more usable data types. There are 7 columns, with the following content:

Date: the date of this data point. All other columns refer to that day
Open: the initial price
High: the maximum price
Low: the minimum price
Close: the last price
Volume: the amount of traded shares
Adj Close: the adjusted close is the close value, adjusted so that the price is comparable over time. E.g. if the stocks of a company are split, their value decreases - but an investor is not affected, as she now holds more shares. Adjusted close takes such events into account and normalizes the price.

In most cases it makes sense to look for the adjusted close to avoid inconsistencies.

Mapping the strings from the CSV parser into more useful data types works via

data = raw_data.map {|e| [Time.parse(e[0]), e[1].to_f, e[2].to_f, e[3].to_f, e[4].to_f, e[5].to_i, e[6].to_f]}

Interesting is the Time.parse function as it usually does a very good job determining the date that is indicated by a string.

Calculating best case performances

Best case performances? Wasn't it all about worst cases?

What interests us is the return rate that we can set as our target. And we should only set a target that we will be able to withdraw within our time limit of n days (or months, or years). We can get that return rate by looking at all possible n day periods, each time taking the best possible return rate within the period, and then take the number we got from all of them. That is the return we could have made in any period.

So here is what our algorithm shall do

For each day i in the historical data that is at least n days before today
- For each day j between i and i+n
  - Calculate the return between i and j. It is important to calculate annual returns in order to make them comparable
- Remember the best case performance among these

This is by no way the most efficient way to obtain this data. But a more clever way to calculate these values would likely make the code more complex and error prone and therefore take more development time. Not a good trade off for a script that is supposed to run only once.

The following listing contains the Ruby code that performs these calculations. It is not pretty, but again getting the job done is more important if there is going to be only one run:

latest_date = data[0][0]
best_case_performances = Array.new

# Look at all entries starting with the oldest
i = data.size - 1
while i >= 0 # just to be sure

	# Variable to store the incumbent best solution
	max_return = nil
	max_j = nil

	# look at the following days until we reach the limit of n days
	j = i - 1
	while j >= 0 && (data[j][0] - data[i][0]) / (60*60*24) <= n
		# Calculate performance on day j
		n_days = ((data[j][0] - data[i][0]) / (60*60*24)).to_i
		period_return = data[j][6] / data[i][6] - 1
		annual_return = (1 + period_return) ** (365.0 / n_days) - 1

		# Update the incumbent
		if max_return.nil? || annual_return > max_return
			max_return = annual_return
			max_j = j
		end

		j -= 1
	end

	# Save the best performance for the n day period starting at i
	best_case_performances << max_return

	# End loop if the most current date was reached
	if j <= 0
		return best_case_performances
	end
	i -= 1
end

Among the values returned we just need to take the minimum. And of course repeat this procedure for different choices of n and different indices / symbols.

Afterwords

I hope especially the first half could give you an idea how simple it can be to retrieve data from a web source and bring it into a form you can work with, given you have the right tools.

If you are interested, you can find the code (a little more polished) on GitHub.

Header photo by Death to the Stock Photo