Quantcast
Viewing all articles
Browse latest Browse all 49208

Recommender Module Performance Enhancement & Drupal for Data-intensive Computing

The Proposal

This student proposal has two parts. The first part is to enhance performance for the recommender modules via Apache Mahout integration.

The Recommender API module and its helper modules was developed as a GSoC 2009 project. Those modules enable Drupal sites to provide content recommendation services based on users browsing history, Fivestar ratings, product purchasing history, and so on, similar to what http://amazon.com offers. However, I received many feedback from users complaining the performance/scalability issue of those modules: for sites with >1k nodes or users, the modules won't work (using 4GB RAM, running for >10 hours). This is simply not acceptable, because it is those sites with lots of nodes/users who need the recommendation service the most.

The performance issue is due to the following reasons:

* The recommendation algorithm involves complex matrix computation. By nature it requires lots of CPU/RAM. For those who are interested in the algorithm, please read Amazon.com recommendations: item-to-item collaborative filtering. (Note: there are other algorithms that require less resources, eg, Slope One algorithm, but none of those are used in Amazon, Netflix, Pandora, iTunes Genius, Facebook friends recommendation, etc.)
* Recommender API was implemented in PHP, which is not optimized for high performance matrix computation.
* Recommender API runs locally with the Drupal web server, which usually has limited RAM (<2GB) and CPU time.

To fix the performance issue, the general idea is to outsource the recommendation computation to another program (written in Java/Python/C) running on a local or remote machine. I evaluated three approaches:

  • Outsourcing only the matrix computation part to 3rd party program, such as Matlab, R, Octave, NumPy, or Java
  • Outsourcing all recommendation computation to Apache Mahout using mysql direct access.
  • Outsourcing all recommendation computation to Apache Mahout via REST/WebServices.

Most users and myself favor the 2nd approach. Specifically, it consists these sub-tasks:

  • I'll write a Java program that uses Apache Mahout to do the recommendation computation. The Java program can run either on the local Drupal server or on a remote computer with better CPU/RAM capacity. And it uses MySQL-JDBC to directly access the required Drupal database tables.
  • I'll also write a Drupal module so that users can issue commands to the Java program through the Drupal interface, and then the Java program will pick up those commands and execute accordingly.
  • All the nitty-gritty communication between Drupal and the Java program is handled by Recommender API, and the helper modules (Browsing History Recommender, Fivestar Recommender, etc) just use Recommender API to calculate the recommendations.

For the end users who use the Recommender modules, they need to do the following:

  • First, install the Recommender API module and helper modules on the Drupal server.
  • Second, install the Recommender-Java program and Apache Mahout library on either the local Drupal server or a remote computer, and configure MySQL access so that it can read certain Drupal database tables.
  • Third, run the Recommender-Java program as a daemon service
  • Finally, from Drupal, site admins can issue commands to the Recommender-Java program to compute recommendations, which are written back to the Drupal database for display in the Drupal site.

In addition to the performance enhancement, I'll also work on the following two highly-requested issues:

The proposed work would be targeted to Drupal 7 with Drupal 6 back port.

Extension

The second part of the proposal is to extend the "Apache Mahout integration" idea to a broader usage scenario -- "Drupal for data-intensive computing".

Drupal is awesome for website building, but not so great on data-intensive computing. Computing recommendations is one example. Other examples could be:

  • For e-commerce sites, to analyze historical sales data and make predictions on future sales.
  • For news sites, to use machine learning algorithms to categorize news articles.
  • For biology research sites, to use scientific programs (Matlab, SPSS, etc) to analyze genes data in the Drupal database.

In all these scenarios, Drupal is used to organize and display information, and 3rd party software/script is used for data-intensive computing. What's missing is a framework to help the 3rd party software exchange data with Drupal.

I have proposed such a framework in the first part of the proposal, here I would just generalize that framework to be used not only for Apahce Mahout, but also for other programs too (Matlab, R, NumPy, etc). Specifically, I propose to do the following tasks:

  • I'll write a generic Java class that does the following:
    1) establish connection pooling to the Drupal database via JDBC
    2) data object mapping (via Java Persistence API, so that developers can use nodes, users, etc as objects)
    3) "paging" for slow connections
    4) multi-threading
    5) run as "daemon" service
    6) output CSV file to be fed into Matlab, R, SPSS, etc.
  • Developers can write Java/Jython/JRuby script for Drupal data-intensive computing tasks simply by sub-classing that generic Java class, or output CSV file to work with Matlab, R, SPSS, etc.
  • I'll write a Drupal module so that site admins can issue commands to the data-intensive computing programs within the Drupal interface.

Contribution to the Drupal community

For the first part of the proposal, "Recommender modules performance enhancement", I have received many requests from users, so I know it would add values to the Drupal community. My hope is to give Drupal the best recommender system to be competitive with similar services offered by Amazon, Netflix, etc.

For the second part of the proposal, "Drupal for data-intensive computing", I know there are some initiatives within the community to try to build machine learning and artificial intelligent capability into Drupal, eg, D.A.I.L., Machine Learning API, etc. Again, performance is a big issue if implementing them in PHP. And also it doesn't make too much sense to re-write lots of algorithms in PHP given that many of them are already implemented in Java/Python/Matlab/R. I think the most viable solution is to write a framework, as I proposed, so that it's easy to use 3rd party programs/script for data-intensive computation. With this framework, I hope it would facilitate more innovations on data-intensive computation with Drupal, such as making predictions on sales, text categorization, etc.

About me

I'm a PhD student at the University of Michigan School of Information. My expertise is in recommender systems and machine learning algorithms. I served/will serve as a Program Committee for the ACM Recommender System Conference 2010 and 2011. It is my goal to build cutting edge recommender system into Drupal.

In terms of Drupal involvement, I have participated in GSoC 2009 to develop the initial version of Recommender modules. I have also collaborated with the Drupal.org infrastructure team and implemented the "Related projects" page on drupal.org module pages. My drupal account is at http://drupal.org/user/112233

Time line

  • May 24 - 31st - Get familar with Apache Mahout library, setup development environment
  • June 1 - 21 - Develop the Java program for recommendation computation with Mahout
  • June 21 - July 12 - Develop the Drupal module that issues command to the Java program; add Views support; take "accesslog" as input for Browsing History Recommender module.
  • July 12 - submit midterm
  • July 16 - August 2 - Generalize the "Drupal for data-intensive computing" framework to go beyond Apache Mahout
  • August 3 - August 16 - D6 back port.
  • August 17 - August 20 - polish up and submit final report

Viewing all articles
Browse latest Browse all 49208

Trending Articles