Data Mining Portfolio

Thursday, April 30, 2009

Portfolio Assignment 9

Our final project added on to the movie clustering project we worked on earlier this year. We decided to focus our clustering on genre, year, and rating. Our goal was to see how time affected the criteria for each genre. The most prominent example is how the horror genre has changed over time. Currently, most people would find horror movies from the 1930s and 1940s completely not scary and maybe a little bit funny, but when those movies first debuted, they were some of the scariest movies ever made. Our dendrogram was able to differentiate older horror movies from newer ones and put them in different clusters.
We chose to cluster genre, year, and rating because we decided adding keywords would vastly complicate our project and most likely be of minimal benefit to the clustering, since it would not help what we were trying to study. A lot of words have multiple meanings, such as vampire. There are some drama and chick flick movies that also have vampires in them and these words would interfere with getting a good horror movie cluster.
As far as future improvments to our clustering algorithm, we would have liked to work out a way to differentiate between subgenres. There are multiple kinds of chick flicks and it would be nice to have a way to separate them, especially to remove the teen drama movies into their own group. Often a few legitimately good chick flicks are released (as in, these are actually really good movies), but so are several not very good teen melodramatic ones, and they would all share the same year. By breaking the chick flick category into subgenres, the legitimately high rated movies will not have those thrown in with them.

Thursday, February 12, 2009

Portfolio Assignment 3

I worked with Renee Carignan and Elizabeth Ramsey on this assignment.

I am glad Dr. Zacharski posted what we had to add to the pylast library, it saved us a lot of time and we really appreciate that. I don't think I could have figured that one out by myself.

Installing pylast was easy, unlike putting in some of the assignment 2 ones.

We discovered that the last.fm api is not very well documented. Various function names are typed differently than shown on their website. The function getSimilar on the website was really get_similar. We spend a very long time staring at compiler errors about this. I'm glad the autocomplete feature was there to help us.

We sucessfully made a band recommendation program on the command line. The user inputs a band and it outputs a list of similar bands the user might also enjoy. We used python to do this, since its what we were used to working with for this class.

Here's the band recommendation code:

def recommend():
   name = raw_input("Enter a band that you like: ")
   print(name)
   a = pylast.Artist(name, key, secret, sk)
   similar = pylast.Artist.get_similar(a, key)
   for item in similar:
       print(item)

Saturday, January 31, 2009

Portfolio Assignment 2

Python Stuff

I got pydelicious to work after editing the file and changing the md5 library to the hashlib libary in the file and then I had to add the feedparser library into the python26 folder. I don't know if its in the right spot, but its working and that's good enough for me.

I tried to build the dataset using the commands on the bottom of page 21 and got several errors:
>>> from deliciousrec import *
>>> delusers=initializeUserDict('programming')

Traceback (most recent call last):
File "", line 1, in
delusers=initializeUserDict('programming')
File "C:\Python26\deliciousrec.py", line 9, in initializeUserDict
for p2 in get_urlposts(p1['href']):
File "C:\Python26\pydelicious.py", line 803, in get_urlposts
d return getrss(url = url)
File "C:\Python26\pydelicious.py", line 794, in getrss
return dlcs_rss_request(tag=tag, popular=popular, user=user, url=url)
File "C:\Python26\pydelicious.py", line 418, in dlcs_rss_request
url = DLCS_RSS + '''url/%s'''%md5.new(url).hexdigest()
NameError: global name 'md5' is not defined

Turns out you can't just change the library name. I have to deal with the md5 deprecation warning instead. At least it works now.

Building the data set, recommending neighbors and links, and building the item comparison data set, and getting recommendations sections all worked, they just didn't look nearly as neat as the book examples because the numbers were not nicely rounded to 3 decimal places. The actual data matched correctly though.

As for movie lens stuff, once I got my file path hardcoded into the def loadMovieLens function it worked great for me.

def loadMovieLens(path='C:\Python26/data/movielens'):

That bit of code was quite important, especially the one slash that goes the other way...

Building the item-based recommendations took about a minute to complete on my desktop, which makes me glad I did not use my much slower laptop for this assignment. My outputs matched the book again, which made me quite happy.

Weka Part 1

This part seemed pretty straightforward to me as long you followed the book's directions you were fine. I'm still working on understanding the algorithms, but it seemed to do a good job in most cases. It had fewer errors than the 1 rule method on the weather data set.

Weka Part 2

I ran the J4.8 tree building algorithm on the data set and got the following results:

=== Summary ===

Correctly Classified Instances 235 77.5578 %
Incorrectly Classified Instances 68 22.4422 %
Kappa statistic 0.5443
Mean absolute error 0.1044
Root mean squared error 0.2725
Relative absolute error 52.0476 %
Root relative squared error 86.5075 %
Total Number of Instances 303

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.83 0.29 0.774 0.83 0.801 0.809 <50
0.71 0.17 0.778 0.71 0.742 0.809 >50_1
0 0 0 0 0 ? >50_2
0 0 0 0 0 ? >50_3
0 0 0 0 0 ? >50_4
Weighted Avg. 0.776 0.235 0.776 0.776 0.774 0.809

=== Confusion Matrix ===

a b c d e <-- classified as
137 28 0 0 0 | a = <50
40 98 0 0 0 | b = >50_1
0 0 0 0 0 | c = >50_2
0 0 0 0 0 | d = >50_3
0 0 0 0 0 | e = >50_4

It seems to have worked. This method seems like it is fairly accurate, but I definitely would not bet my life on it because there is a 20% chance that it could be wrong. There were also a lot more variables to consider in this set and as far as machine learning goes, it did really well.

Friday, January 23, 2009

Portfolio Assignment 1

I got Python all set up, I admit that it took me a little while to figure out how to actually make and run Python files. It turns out that you must explicitly add .py in your file name, even though it says its saving as type .py.

I was able to run recommendations.py

Euclidean Distance
I added the sim_distance function from the book but I got a different result. It turns out that the result from his code, but not the book result of his code is right. If you do the math yourself according to the formula, you get 0.294298..etc, which is right and what his code produced when I typed it in to my version of recommendations.py. I spent a really long time figuring out why my code wasn't working, but it turns out the book was wrong :(

Pearson Correlation Score
Worked great, minus the hour lost to a stupid indentation error. Turns out most of my stuff was in the for loop. Lovely. I got the book answer for this one.

Recommendations
This part worked fine for me.

Manhattan Distance
There aren't any good websites on Manhattan distance. I swear they don't expect you to actually need to use it. I can't quite figure out what the x's and y's are and how you add them and stuff. But after emailing Dr. Zacharski I was able to implement it. This is what I came up with.

#Manhattan Distance Stuff
def manhattan(prefs,person1,person2):

#Get the list of mutually related items
si={}
for item in prefs[person1]:
if item in prefs[person2]: si[item]=1

#Find the number of the elements
n=len(si)

#if they have no ratings in common, return 0
if n==0: return 0

mdists = [abs(prefs[person1][item] - prefs[person2][item])
for item in si]

#Doing the thing at the bottom of page 10 to make it between 0 and 1
# we are adding 1 then inverting it so we dont divide by 0

return 1/(1+sum(mdists))

I got a result of 0.1818181... for Lisa Rose and Gene Seymour, which I think is at least close to the right answer.