Project Guidelines (please read carefully):
Project Suggestions:
Additional References:
Yan T.W., Jacobsen M., Garcia-Molina H., Dayal U., "From User Access Patterns to Dynamic Hypertext Linking", Fifth International World Wide Web Conference, Paris, 1996. http://www5conf.inria.fr/fich_html/papers/P8/Overview.html
Shahabi, C., Zarkesh A., Adibi J., Shah V., "Knowledge Discovery From Users Web-Page Navigation", Proc. Seventh IEEE Intl Workshop on Research Issues in Data Engineering(RIDE) 1997. http://dimlab.usc.edu/publications.asp
Schechter S., Krishnan M., Smith M., "Using path profiles to predict HTTP requests The Seventh International World Wide Web Conference", Brisbane, Australia http://www.eecs.harvard.edu/~stuart/
Nasraoui O., Frigui H., Joshi A., Krishnapuram R., "Mining Web Access Logs Using Relational Competitive Fuzzy Clustering", Proc. Eight International Fuzzy Systems Association World Congress - IFSA 99, August 1999. http://www.cs.umbc.edu/~ajoshi/web-mine/publications.html
Workshops:
KDD'99 Workshop on Web Usage Analysis http://www.acm.org/sigkdd/proceedings/webkdd99/toconline.htm
KDD'00 Workshop on Web Mining for E-Commerce http://www.wiwi.hu-berlin.de/~myra/WEBKDD2000/schedule.html
User Modeling'99:
http://www.cs.usask.ca/UM99/papers.shtml
Also see the forthcoming Workshop on Web Mining for additional topics
of interest.
http://www.lans.ece.utexas.edu/workshop_index.htm
Software:
http://www.hypernews.org/HyperNews/get/www/log-analyzers.html
packages on log analysis.
WUM: Web Utilization Miner
http://wum.wiwi.hu-berlin.de/ (please
contact Dr. Ghosh if you want to
use this package).
Possible projects include:
A. Collaborative Filtering:
How can the importance of different items (e.g. genre vs. director vs.
length vs. actor for movies) be customized for different users? Can use
EachMovie data set. How can collaborative filtering approach be more
scalable to handle larger numbers of people and items? How can content
and/or usage (web logs) be used to augment collaborative filtering?
See Collaborative Filtering Mailing List Archive By Thread:
http://info.berkeley.edu/resources/mailing-lists/collab/
Also the Syskill
Webert data from UCI for rating web pages.
B. Infomediaries:
These agents can modify information streams at the client end, server end,
or somewhere in-between. They can be used to customize, filter, annotate,
aggregate or transcode in a personalized way. See http://wwwcssrv.almaden.ibm.com/wbi/intermediaries.html
and http://www.almaden.ibm.com/cs/wbi/papers/chi97/wbipaper.html
Also a WBI plugin is available at: http://www.almaden.ibm.com/cs/wbi/
.
C. Personalized Web Searching/Browsing/Filtering
Develop agents that work on behalf of the individual user, a personal
e-valet, if you will. They can identify and gather interesting news
articles, extract info about specific products, filter emails etc etc.
Additional References:
Clexandros Moukas and Pattie Maes Amalthaea: "An Evolving Multi-Agent Information Filtering and Discovery System for the WWW", http://www.media.mit.edu/~moux/papers/jaamas98.ps
Liren Chen and Katya Sycara, "Webmate : A personal agent for
browsing and searching", In Proc of 2nd International Conference
on Autonomous Agents, 1998.
http://www.cs.cmu.edu/~softagents/webmate/aa98-webmate.ps
J. Allan, "Incremental relevance feedback for information
filtering". In Proc. ACM SIGIR Conf., Zurich, Switzerland, August
1996.
http://www-ciir.cs.umass.edu/~allan/Papers/sigir96.ps
See www.xml.org and www.xml.com
for starters. Also the articles: Top Tech Trends for 2001
http://www.infoworld.com/articles/op/xml/01/01/08/010108opvizard.xml
and XML enlivens e-business transactions: http://www.infoworld.com/supplements/2000toy/010129tcxml.html.
Possible projects include:
- develop an architecture for searching/querying a collection of XML
documents
- determine ways of finding similarity between two XML documents, that take
into account both the words and the tags.
- ways to (semi)-automatically converting an XML document into another one
with a different DTD: useful for information exchange, say between two
companies with different conventions for marking up documents.
- extracting meta-data from web pages, and using them to augment a web
content mining application.
- develop XML tools to integrate both content and usage data and domain
knowledge such as ontologies or word vocabularies, to model users in an
e-commerce sites.
We have purchased TREC 4 and TREC 5 CDs for the class. These have many
large document collections in SGML, a superset of XML, and thus can possibly
be adapted for your XML project.
Andrei Broder, Steve Glassman, Mark Manasse, Geoffrey Zweig, Syntactic
Clustering of the Web, Sixth International World Wide Web Conference,
391--404, 1997. SRC Technical Note, 1997-015.
http://gatekeeper.dec.com/pub/DEC/SRC/technical-notes/SRC-1997-015-html
N. Shivakumar, H. Garcia-Molina. SCAM: A Copy Detection Mechanism for Digital Documents. Proceedings of the 2nd International Conference on Theory and Practice of Digital Libraries, Austin, Texas, 1995. Available from http://www-db.stanford.edu/~shiva/Pubs/scam.ps
Misc Stuff:
The TREC and SMART queries are of different flavors. Some queries are short while others have a more verbose description. Many queries in TREC are such that the documents that are judged to be relevant to these queries do not contain any of the words in the query.
The first phase of this project would be to explore how "far" relevant documents are from the words contained in a query. For example, are all relevant documents two "hops" from the query, that is, do all relevant documents either (i) contain words in the query or (ii) contain words that are contained in documents that contain words in the query. Such questions can be answered by looking at a graph between the documents and words.
The second part of the project would be to explore algorithms (graph algorithms?) that would exploit the information discovered in the above phase to get better precision and recall.
See "Clustering Hypertext with Applications to Web Searching" (http://ww.lans.ece.utexas.edu/course/ee380l/2001sp/readinglist/toric.ps.gz) for one approach.
Classifying email automatically into different folders is a challenging and worthwhile problem. See http://www.ai.mit.edu/~ychang/learningmail.pdf for a comparison of Naive Bayes and Fisher's Linear Discriminant. Source code is available too! Also see last year's CS395T project: http://www.cs.utexas.edu/users/zhux2/datamining/
The following paper uses a so-called Fisher index:
http://www.cs.berkeley.edu/~soumen/doc/www99focus/html/
Last year's class project (http://www.cs.utexas.edu/users/subbiah1/mining/) did an excellent job in building prototype software in MATLAB for this problem. The scope of this year's project would be to use this software and extend it to use computationally different (and more efficient) methods.
Resources