IMDB data parsing with regular expressions

Programming comments edit

I need a large dataset to practice on SQL server while preparing to exams. One idea is to generate random data but there is no fun in that. Because it is hard to create real-life scenarios but gibberish data. Then I found out that IMDB actually releases all their data and allows usage for non-commercial applications.

Data Analysis

Unfortunately, IMDB data doesn’t come in a very structured format. You have to deal with the exceptional cases and create your own data structures. I found some nice info in a Github repo called imdb-blaster (link in the resources section).

I started with the main file to parse: Movies.lst. This file contains movies, TV shows and their episodes. I want to store these in separate tables so I wanted to treat them separately. What they look like in a nutshell is as follows:

  • Movies: The Lord of the Rings: The Return of the King (2003)    2003
  • TV Shows: “Seinfeld” (1990)                    1990-1998
  • Episodes: “Seinfeld” (1990) {Male Unbonding (#1.4)}        1990

All TV Shows are enclosed with double-quotes. There are also some other optional attributes that need to be taken into account. Such as deleted items are marked with like “Seinfeld” (1990) {Good News, Bad News (#1.0)}     1989

Start parsing

I don’t like regular expressions too much as they are very hard to read. But they are very helpful and powerful in scenarios like this. The goal is validating and parsing it to groups so that it can be inserted easily to a relational database. So I developed 3 regular expression patterns for the type of data in the file. One thing I learned about regular expressions is that you can group parts by using parenthesises. This creates an unnamed group. If you want to name them you can use angle brackets and assign a name to group such as: (subexpression)

Parsing movies

As most regular expressions tend to wind up with hard-to-read complex queries this was no exception! Even if the structure is simple because of the optional parameters and exceptional cases it became very complex very soon. Here’s the sample code to run in LINQPad to parse a sample line for a movie:

string line = @"America the Punchline (2009) 		2009";
string pattern = @"(^.*)\((\d{4}|\?{4})(/I|II|III)?\)([ ]?)()(\(V\)|\(TV\))?\t*(\d{4})$";
Console.WriteLine(Regex.IsMatch(line, pattern));

MatchCollection matches = Regex.Matches(line, pattern);
foreach (Match match in matches)
{
	Console.WriteLine("Group count: {0}", match.Groups.Count);
	foreach (Group matchGroup in match.Groups)
	{
		Console.WriteLine("{0}", matchGroup.Value);
	}
}

And the output is: ``` True Group count: 8 America the Punchline (2009) 2009 America the Punchline 2009

2009 ```

The first group is the whole string. The second one is the title, the third is the year. There is an optional group because some titles come with I, II or III after the release date. I have no idea what that means but had to break it into a group to avoid an unmatch. The last two parameters are the type (V: Video, TV: Television, VG: Video game) and another date. I don’t know why the movies have two dates and they almost always match so I’ll ignore one of them. So the actual code is very simple:

if (regexMovie.IsMatch(line))
{
	Match match = regexMovie.Match(line);

	commandMovie.Parameters.Clear();
	commandMovie.Parameters.AddWithValue("@Name", match.Groups[1].Value);
	commandMovie.Parameters.AddWithValue("@ProductionYear", match.Groups[2].Value);
	commandMovie.Parameters.AddWithValue("@IsDeleted", (match.Groups[5].Value == ""));
	commandMovie.Parameters.AddWithValue("@ReleasedFor", match.Groups[6].Value);
	commandMovie.ExecuteNonQuery();
}

All the hard work is done by regular expressions already. All we have to do is to check is a line matches a movie definition and if it does just insert the data that is already broken up into groups.

Parsing the rest

I’ll publish the full source code in a Github repository and will update this post with the link. But I guess to start parsing the movies file the other two patterns suffice for now: TV Show Pattern: (^””.””) (((\d{4}|\?{4})(/I|II|III)?))\t.(\d{4}-\d{4}|\?{4})$ Episode Pattern: (^””.””) ((\d{4}|\?{4})(/I|II|III)?) {(.?)(#(\d).(\d))}\t*(\d{4}|\?{4})$

Resources

Comments