Saturday, December 12, 2015

Playing with PostgreSQL FDW for MongoDB

While it was interesting to read articles on people bashing MongoDB's latest BI connector is based on PostgreSQL, I wonder how the (alleged) Python implemented Foreign Data Wrapper (FDW) performs.

First I got PostgreSQL 9.4 and Python dev installed on my openSUSE Tumbleweed machine:

postgresql94 postgresql94-server postgresql94-devel postgresql94-contrib python-devel python-setuptools


Then build and install Multicorn, which is a Postgresql extension that the (alleged) MongoDB FDW is based on:

git clone git://github.com/Kozea/Multicorn.git
cd Multicorn
sudo make
sudo make install


Also install the FDW:

git clone https://github.com/asya999/yam_fdw.git
cd yam_fdw
sudo python setup.py install


Start the PostgreSQL server  if it is not running:

sudo systemctl start postgresql.service


To test the performance, I got some sample Yelp data loaded into MongoDB 3.0.7.  Here is what the format of "business" data:

> db.yelp_business.findOne()
{
        "_id" : ObjectId("566bb192563714b25d604a94"),
        "business_id" : "UsFtqoBl7naz8AVUBZMjQQ",
        "full_address" : "202 McClure St\nDravosburg, PA 15034",
        "hours" : {

        },
        "open" : true,
        "categories" : [
                "Nightlife"
        ],
        "city" : "Dravosburg",
        "review_count" : 4,
        "name" : "Clancy's Pub",
        "neighborhoods" : [ ],
        "longitude" : -79.88693,
        "state" : "PA",
        "stars" : 3.5,
        "latitude" : 40.350519,
        "attributes" : {
                "Happy Hour" : true,
                "Accepts Credit Cards" : true,
                "Good For Groups" : true,
                "Outdoor Seating" : false,
                "Price Range" : 1
        },
        "type" : "business"
}
> db.yelp_business.count()
61184


And here is sample of the review data:

> db.yelp_review.findOne()
{
        "_id" : ObjectId("566bb1eb563714b25d61ea02"),
        "votes" : {
                "funny" : 0,
                "useful" : 2,
                "cool" : 0
        },
        "user_id" : "H1kH6QZV7Le4zqTRNxoZow",
        "review_id" : "RF6UnRTtG7tWMcrO2GEoAg",
        "stars" : 2,
        "date" : "2010-03-22",
        "text" : "Unfortunately, the frustration of being Dr. Goldberg's patient is a repeat of the experience I've had with so many other doctors in NYC -- good doctor, terrible staff.  It seems that his staff simply never answers the phone.  It usually takes 2 hours of repeated calling to get an answer.  Who has time for that or wants to deal with it?  I have run into this problem with many other doctors and I just don't get it.  You have office workers, you have patients with medical needs, why isn't anyone answering the phone?  It's incomprehensible and not work the aggravation.  It's with regret that I feel that I have to give Dr. Goldberg 2 stars.",
        "type" : "review",
        "business_id" : "vcNAWiLM4dR7D2nwwJ7nCA"
}
> db.yelp_review.count()
1569264


So, say I want to do a join of the two collections in PostgreSQL to find out, for each business, how many reviews give lower rating than its existing rating.  Here are the sample SQL scripts execute in psql:

-- create the Multicorn extension
CREATE EXTENSION multicorn;

-- define the mongodb FDW
create server mongodb_yelp foreign data wrapper multicorn options (wrapper 'yam_fdw.Yamfdw');
-- turn on debug if necessary
--create server mongodb_yelp foreign data wrapper multicorn options (wrapper 'yam_fdw.Yamfdw', debug 'True');

-- create the foreign tables
create foreign table yelp_review ("_id" varchar, "review_id" varchar, "user_id" varchar, "business_id" varchar, "stars" numeric, "date" date, "text" varchar) server mongodb_yelp options (db 'test', collection 'yelp_review');
create foreign table yelp_business ("_id" varchar, "business_id" varchar, "name" varchar, "stars" numeric, "longitude" float8, "latitude" float8, "full_address" varchar, "type" varchar) server mongodb_yelp options (db 'test', collection 'yelp_business');

-- use SQL to join the collections
select b.name, count(1)
from yelp_business b
left outer join yelp_review r
  on r.business_id = b.business_id
  and r.stars < b.stars
group by b.name
fetch first 100 rows only;


The performance is nothing to write home about, but is better than I expected.  It is hard to do push-down to foreign data source.

Anyway, I did fork the FDW and made some minor changes.

PS.  Not sure if the bashing (here, here) of MongoDB's BI connector was caused by conflict of interest (after all, Slam Data is an analytic tool for NoSQL).  But I am no fanboy of any particular software.

I just use whatever software that suits my needs.  The more choices I have, the better off I will be.

No comments: