Data Mining Portfolio: January 2009

Python Stuff

I got pydelicious to work after editing the file and changing the md5 library to the hashlib libary in the file and then I had to add the feedparser library into the python26 folder. I don't know if its in the right spot, but its working and that's good enough for me.

I tried to build the dataset using the commands on the bottom of page 21 and got several errors:
>>> from deliciousrec import *
>>> delusers=initializeUserDict('programming')

Traceback (most recent call last):
File "", line 1, in
delusers=initializeUserDict('programming')
File "C:\Python26\deliciousrec.py", line 9, in initializeUserDict
for p2 in get_urlposts(p1['href']):
File "C:\Python26\pydelicious.py", line 803, in get_urlposts
d return getrss(url = url)
File "C:\Python26\pydelicious.py", line 794, in getrss
return dlcs_rss_request(tag=tag, popular=popular, user=user, url=url)
File "C:\Python26\pydelicious.py", line 418, in dlcs_rss_request
url = DLCS_RSS + '''url/%s'''%md5.new(url).hexdigest()
NameError: global name 'md5' is not defined

Turns out you can't just change the library name. I have to deal with the md5 deprecation warning instead. At least it works now.

Building the data set, recommending neighbors and links, and building the item comparison data set, and getting recommendations sections all worked, they just didn't look nearly as neat as the book examples because the numbers were not nicely rounded to 3 decimal places. The actual data matched correctly though.

As for movie lens stuff, once I got my file path hardcoded into the def loadMovieLens function it worked great for me.

def loadMovieLens(path='C:\Python26/data/movielens'):

That bit of code was quite important, especially the one slash that goes the other way...

Building the item-based recommendations took about a minute to complete on my desktop, which makes me glad I did not use my much slower laptop for this assignment. My outputs matched the book again, which made me quite happy.

Weka Part 1

This part seemed pretty straightforward to me as long you followed the book's directions you were fine. I'm still working on understanding the algorithms, but it seemed to do a good job in most cases. It had fewer errors than the 1 rule method on the weather data set.

Weka Part 2

I ran the J4.8 tree building algorithm on the data set and got the following results:

=== Summary ===

Correctly Classified Instances 235 77.5578 %
Incorrectly Classified Instances 68 22.4422 %
Kappa statistic 0.5443
Mean absolute error 0.1044
Root mean squared error 0.2725
Relative absolute error 52.0476 %
Root relative squared error 86.5075 %
Total Number of Instances 303

=== Detailed Accuracy By Class ===

TP Rate FP Rate Precision Recall F-Measure ROC Area Class
0.83 0.29 0.774 0.83 0.801 0.809 <50
0.71 0.17 0.778 0.71 0.742 0.809 >50_1
0 0 0 0 0 ? >50_2
0 0 0 0 0 ? >50_3
0 0 0 0 0 ? >50_4
Weighted Avg. 0.776 0.235 0.776 0.776 0.774 0.809

=== Confusion Matrix ===

a b c d e <-- classified as
137 28 0 0 0 | a = <50
40 98 0 0 0 | b = >50_1
0 0 0 0 0 | c = >50_2
0 0 0 0 0 | d = >50_3
0 0 0 0 0 | e = >50_4

It seems to have worked. This method seems like it is fairly accurate, but I definitely would not bet my life on it because there is a 20% chance that it could be wrong. There were also a lot more variables to consider in this set and as far as machine learning goes, it did really well.

I got Python all set up, I admit that it took me a little while to figure out how to actually make and run Python files. It turns out that you must explicitly add .py in your file name, even though it says its saving as type .py.

I was able to run recommendations.py

Euclidean Distance
I added the sim_distance function from the book but I got a different result. It turns out that the result from his code, but not the book result of his code is right. If you do the math yourself according to the formula, you get 0.294298..etc, which is right and what his code produced when I typed it in to my version of recommendations.py. I spent a really long time figuring out why my code wasn't working, but it turns out the book was wrong :(

Pearson Correlation Score
Worked great, minus the hour lost to a stupid indentation error. Turns out most of my stuff was in the for loop. Lovely. I got the book answer for this one.

Recommendations
This part worked fine for me.

Manhattan Distance
There aren't any good websites on Manhattan distance. I swear they don't expect you to actually need to use it. I can't quite figure out what the x's and y's are and how you add them and stuff. But after emailing Dr. Zacharski I was able to implement it. This is what I came up with.

#Manhattan Distance Stuff
def manhattan(prefs,person1,person2):

#Get the list of mutually related items
si={}
for item in prefs[person1]:
if item in prefs[person2]: si[item]=1

#Find the number of the elements
n=len(si)

#if they have no ratings in common, return 0
if n==0: return 0

mdists = [abs(prefs[person1][item] - prefs[person2][item])
for item in si]

#Doing the thing at the bottom of page 10 to make it between 0 and 1
# we are adding 1 then inverting it so we dont divide by 0

return 1/(1+sum(mdists))

I got a result of 0.1818181... for Lisa Rose and Gene Seymour, which I think is at least close to the right answer.

Saturday, January 31, 2009

Portfolio Assignment 2

Friday, January 23, 2009

Portfolio Assignment 1

Data Mining Portfolio

Followers

Blog Archive

About Me