Nabeel's Blog: October 2009

Sunday, October 18, 2009

Fixing "font not embedded" issue to pass the IEEE PDF eXpress check

We recently had to make the format of a paper complaint with the IEEE PDF eXpress format. The paper did not pass the check in the first few attempts. Hence this blog post. I'd like to thank my colleague Ning Shang who did the most of the fixes to get it working. I am listing the fixes here so that anyone else who encountered similar issues may find this post useful.

Before that, I work on Ubuntu 9.04, kile 2.1 (the IDE), use the tools latex, bibtex and dvipdf to generate pdf files from tex/bib/cls files. (i.e. latex file.tex; bibtex file; (to attach the ref.bib file) latex file.tex; dvipdf file.dvi to finally get file.pdf)

The tex file uses the IEEE conference style. Additionally we used the following packages initially:
times, epsfig, graphicx, url, verbatim, amsmath, amsfonts

Issue #1: Document contains bookmarks
Fix: We had to remove the url package from the included packages lists and convert \url{address} to {address} in ref.bib.

Issue #2: Font Times-Italic, Times-Roman, Times-BoldItalic, Times-Bold, Helvetica, Courier is not embedded.

You can see what fonts are embedded and what are not, by using "pdffont file.pdf" and looking at the "emb" column. In our case, it did show that some fonts are not embedded.

Fix: We searched the Internet [1, 2]and found that in order to fix this (i.e. to embed all the required fonts) we need to do the conversion from tex to pdf in two stages. This is a dirty hack; but it works.

latex file.tex
bibtex file
latex file.tex
latex file.tex (Now we have file.dvi)
dvips -Ppdf -G0 -tletter file.dvi (Now we have file.ps)
ps2pdf -dCompatibilityLevel=1.4 -dPDFSETTINGS=/prepress file.ps file.pdf (Now we have file.pdf)

Thursday, October 15, 2009

Accountability or anonymity or can we have both?

This blog post was prompted from the question "what is important accountability or anonymity when it comes to online activities?" (short answer: it depends :)

Accountability is all about holding an individual accountable for what (s)he does; it's about identification. Anonymity, on the other hand, is about de-identification and the privacy of individuals.

From the PoV of on-going research, I find three main areas under which anonymity is being considered (am I missing any other ones?)

1. Anonymous access to resources
Here the goal is to allow users to access some service or resources anonymously - without revealing their identity. It's mainly about unlinkability - no two transactions can be linked to a single transaction. Bob buys a T-shirt from JC Penny and a denim from Old Navy; he used the same Chase credit card for both transactions. Bob may not want his bank, Chase, to know how he spent his money for privacy reasons - if it works like this, it provides unlinkability for Bob at his bank. However, unlinkability, in this case, may be undesirable due to security reasons; if the transactions cannot be linked to Bob, it would be really hard, if not impossible, to identify fraudulent activities by bad users. If Bob really wants to prevent his bank from knowing how he spent his money, the safest way is to use cash - that's the price Bob needs to pay to remain unlinkable! Note that are many cryptography based e-cash scheme to achieve the same objective.

As you can see, the decision to go anonymous has a cost. The issue is to decide if the benefits weighs higher than the cost. Take another example. You may not like, for example, Marsh or Pay Less, tracking all your transactions - you loose your privacy apparently without any gain for you. What if the loyalty card from Marsh or Pay Less, gives you a discount on most of the items you buy? Most of us (at least graduate students) will go for the loyalty card. The problem here is that there is no way for us to quantify the cost of loosing privacy (shopping history) and further the effect of loosing privacy may not be immediate.

What about online services? Would you be comfortable if a digital library service records all your moves? Think about it; when you go to a public library in your area, you can read whatever books, newspapers, magazines you want and whatever sections you want without being noticed/recorded. Anonymization techniques may come into your rescue and protect your privacy (i.e. your reading habits). But what do you loose for your privacy? If you are not being anonymous (unlinkable), the digital library service may offer you a better service by recommending books, magazines that are closely related to your reading habits. Same goes with online shopping web sites. Another related note, your online access traces could be a valuable source of income for free services such as youtube, hulu, etc.

2. Anonymous publishing
Here I am talking about publishing content without revealing your real identity. Most of the time it is a pseudonym under which one publishes. Publications could be writing a comment, blog post, news article, a paper, posting photos/videos, tweeting, etc. A pseudonym hides your real identity but does not prevent linkability. There are system such as FreeNet, Publius that even make it difficult, in not impossible, to censor what is published; once you publish, no one can take it out. There are good and bad things about anonymous publishing. It is a good thing if some one wants to voice their political opinion or something similar without having to face any repercussions. We make a very important implicit assumption here; the society we live always act good and whatever they do falls under what we perceive as 'acceptable'. It'd be naive to think we can always assume this to be true. We do have bad guys - true - it's only the minority - but this minority could do major damages. A simple example is to defame others hiding behind the screen for personal, political, business etc. advantage. Isn't it a cowardly act? No question about it.

Here's an example about a defamatory blog (in the own words of the victim - let's call him "Joe"):
There is someone who, for complicated psychiatric reasons, developed a severe dislike of me. This is an extraordinarily vindictive and immature girl whom I have NOT wronged in any remotely substantial way. She created an anonymous blog and posted alleging falsely that I'm gay and saying a number of inaccurate and very negative things about my character. (Basically, name-calling.) I'm concerned that this will affect future job prospects since the post appears within the first couple of pages of search results for my name. She confirmed to a mutual friend that she wrote the blog but refused to take it down. Google/blogspot says they don't take down defamatory posts without a court order.

(I am not sure what exactly is legally considered as defamatory. Let's assume it's considered defamatory. What actions can Joe take? IMO, hiring a lawyer for not so grave an incident like this may cost Joe. If he's worried about his online reputation, first thing he should do is to increase the online presence by posting/blogging true facts out, writing about topics of interest, etc. )

Now apply this to a business, corporate level or a popular person or even a major religion. The problem comes when we allow people to freely publish incorrect/falsified information without being accountable. Censor resistant systems makes the problem worse.

3. Data anonymization of statistical analysis
Here we talk about modifying existing records such that sensitive/private information about individuals cannot be inferred from the published data. For example, Alice is doing a survey of cancer patients in Indiana. A good source for her survey is medical records and patient information in the hospitals in Indiana. However, hospitals may not be willing to give Alice row data as it would violate patient privacy (and in fact not allowed under law). Since this study could be beneficial (e.g. correlating cancer to location, public facilities, living habits, etc.), hospital can anonymize the data such that Alice cannot link what is provided to her with individual residents of Indiana. In the research literature, there have been many work done in this area; k-anonymity, l-diversity, t-closeness are just to name a few. A key issue here is the trade-off between privacy and utility. The data can be completely anonymized providing the highest level of privacy but without any utility all or the data can be published as it is providing the highest level of utility but without any privacy. On-going research tries to strike a balance between these two parameters - sufficiently protect individually identifiable data and still able to perform statistical analysis. I don't have any problem with this type of anonymization; in fact, this type of anonymization is encouraged before releasing data for studies.

As we all know, if you take any real user base, a vast majority of them are good users and only a few of them are bad users. So whatever solution we provide should be beneficial to the vast majority. Since anonymity helps good users in certain scenarios, should we focus more on anonymity over accountability? There are consequences.

As per the first two types, a bad effect of anonymity is that it may reduce the accountability one perceives to have for their actions. This could be an incentive for good people to turn bad and bad people to worse. My mother used to tell us that too much of any thing is not good. The same applies here. We need to define an 'acceptable' level of freedom of speech and censorship resistance. IMO, there should be a way to identify bad people in anonymous systems while good people continue to remain anonymous.

Another bad side of anonymity is related to trust. We trust a publication that explicitly mentions the authors than an anonymous publication, don't we? Of course, there are other ways to increase the level of trust we place. For example, many people like it, if the author goes under a pseudonym and that pseudonym has a good history of publication, if it backs up the facts with citations or if it is a shared content management system (like wikipedia) and there is less dispute by other users, etc.

Here's another interesting point raised by Sarah Hinchliff Pearson in her blog:
The National Fair Housing Alliance (NFHA) has been fighting a defamation lawsuit brought by a real estate company that was the target of its fair housing testing. (Disclosure: I helped defend NFHA in this litigation at my prior firm.) NFHA conducted months of well-documented fair housing tests and then reported its results to the media. Despite NFHA’s due diligence, it has been subjected to the burden of ongoing litigation. Yet under amici’s proposed standard, it would likely not have faced this burden if it had reported the results anonymously on the Internet. By giving better protection to anonymous speakers, the heightened standard reflects an implicit judgment that anonymous speech should be valued more highly than regular speech. It also produces a perverse incentive for all speakers to withhold their name from reports, comments, and opinions online.

She argues that we should not place a premium on anonymous contributions. I agree with her when the anonymity is related to publishing. Further, anonymous access may not be desirable for access to restricted materials or when there is a legal requirement to audit. However, for routine tasks such as accessing a digital library (any other content that has a economical value but innocuous in nature), echoing your political opinion, it is desirable to have some degree of anonymity.

In conclusion, ideally I would like to see systems where good guys remain anonymous but bad guys are identified. Anonymity in certain cases is a good thing; but there are situations where it could lead to unpleasant consequences - that's where we need some level of accountability. In certain other cases, you may have to pay a price for remaining anonymous. It is likely the issues mentioned in this post will take time and effort to solve. You are more than welcome to provide your thoughts on this.

Update: 12/1/2009
Here's a good article about the dark side of Internet and it is related to the topic discussed above.

Tuesday, October 13, 2009

Thought of the day

I saw the following quote in a friend's feed:

"The saddest failures in life are those that come from not putting forth the power and will to succeed". ~Edwin P. Whipple

I cannot agree more with this quote. Personally, I don't mind failing. However, I feel bad when I fail knowing that I didn't put enough effort to succeed. The more I think about this quote, the more do I feel certain that it's not just the skill/talent that matters, determination/willingness to prepare yourself plays a bigger role.

I try to keep my game simple; there are no short cuts - you've got to practice hard every cricket shot you want to master - you've got to prepare even harder if you are to innovate a new shot. Now apply that to whatever game you play in life. What's your game plan?

Monday, October 5, 2009

Overcoming hibernate/mysql connection reset issue

One of the projects I have been working on uses Java1.6/JSP/Servlet/Hibernate3.2/Tomcat/MySQL5. Since it is just a prototype, I initially used the Hibernate's native connection pool management mechanism (which is not recommend for a production level deployment).

Every now and then, when we try to connect to the database server, it threw a connection reset exception. This happens because MySQL drops connections after every configured wait_timeout. But when I try to connect the second time, it works. It is not acceptable to have a piece of software that works in the second attempt! So, I tried different fixes.

I added the following property to hibernate.cfg.xml:

<property name="hibernate.connection.autoReconnect">true</property>

However it did not solve the connection reset problem. Still the first attempt failed. Apparently, the Hibernate's connection pooling library does not support this property.

From Hibernate (Jboss):
Hibernate's own connection pooling algorithm is, however, quite rudimentary. It is intended to help you get started and is not intended for use in a production system, or even for performance testing. You should use a third party pool for best performance and stability.

(It would be helpful for people to inform what is working and what's not. But can't complain these are free stuff.)

There are three possible avenues:
1. modify mysql.cfg to have a longer wait_timeout
2. use Tomcat managed connections
3. use a third-party connection pooling library

The first two options are out of my control and we only have limited privileges to mysql and tomcat instances. So, the only option was to look into #3.

I downloaded c3p0 and added the following configurations to hibernate.cfg.xml file have a basic setting (I did not try to optimize these figures just used the numbers that worked for others since the objective is not performance tuning, but to get it working.):

<!-- Min pool size -->
<property name="c3p0.min_size">5</property>

<!--Max pool size -->
<property name="c3p0.max_size">20</property>

<!-- Max idle time -->
<property name="c3p0.timeout">1800</property>

<!--Max statements - size of the prepared statement cache -->
<property name="c3p0.max_statements">50</property>

<!-- Set the pooling implementation to c3p0 -->
<property name="connection.provider_class">org.hibernate.connection.C3P0ConnectionProvider</property>

Those are the basic pool settings. Still, the problem of first time failure is not solved. We need to tell c3p0 swallow the first failure and transparently connect in the second attempt. This does have a performance issue - every time when you want to connect, it does this.

You have to set an extra c3p0 property using c3p0.properties file. Add the file c3p0.properties to the root of the class path (in classes or WEB-INF classes for example) and turn on the c3p0.testConnectionOnCheckout property in that file.

c3p0.testConnectionOnCheckout=true

Note from Hibernate:
Don't use c3p0.testConnectionOnCheckout, this feature is very expensive. If set to true, an operation will be performed at every connection checkout to verify that the connection is valid. A better choice is to verify connections periodically using c3p0.idleConnectionTestPeriod.

As you can see, they do recommend a polling based approach where Hibernate periodically checks for idle connections. But I guess this also depends on the how frequently the hibernate layer is accessed. In our case, it is not that frequent. I didn't try that option but it should work.

Other pooling libraries such as Apache DBCP, Proxool should also work. But I didn't have time to check them out.

References: 1, 2, 3, 4, 5

What DHS knows about you

If you just wonder what DHS collects about you from travel agents, read on.

It is a good idea to use cash or use a one time credit card number (like the one Citi bank issues - which allows you to set exp. date, credit limit and have multiple numbers) if you are booking through a travel agent (and concerned about security/privacy) (assuming your PNR is passed to DHS upon booking?).

Or, we need ways to fly under the radar. Anonymous booking?

You can request your PNR's and other records of your international travel that are being kept by the U.S. Customs and Border Protection (CBP) division of the Department of Homeland Security (DHS). I haven't tried this. This link shows how to.

Both travel agents and airline reservation staff:
The CBP eventually admitted that their records include information about travel agents and airline reservation staff...

They collect information from other sources as well:
In February 2009, the DHS admitted that Amtrak and bus companies "voluntarily" provide the DHS with information on bus and train passengers travelling between the USA and Canada and Mexico.

Your travel data may be shared with other parties in addition to DHS:
If you traveled on an airline based in the European Union, or made your reservations or bought your ticket in the EU or from an airline office or travel agency or tour operator in the EU, you can also request your records (including an accounting of what information they passed on directly to the DHS or outsourced or transferred to Computerized Reservation Systems (CRS's) or other commercial entities in the USA), from the airline, travel agency, tour operator, or CRS. Even if they claim that you "consented" to data sharing, EU laws require that they disclose, on request, exactly what data about you they have "shared", and with whom. Note that you can make such a request of a USA-based airline if you bought your ticket from them in Europe. EU data protection law is applicable whenever data is originally collected in the EU, regardless of your citizenship or where the company is based...By subscribing to CRS's based in the USA, and by participating in code-sharing and other marketing (and data sharing) "partnerships", most airlines, travel agencies, and tour operators based in the EU have effectively outsourced and offshored the storage of all of their PNR's and customer data.

Reference: 1, 2

Saturday, October 3, 2009

Open decentralized microblogging

I recently wanted to access twitter updates through facebook. I clicked on the add twitter application, but after seeing the authentication anti-pattern they are using I backed off (yet, many of my friends have added it; looks like their perceived risk is less than the benefits they expect).

If an imaginary dude added twitter in to their FB profile, the conversation would have been as follows.
Dude: hey FB, I want to access Tweets.
FB: sure dude, give me your Twitter username and password. (domain - facebook.com)
Dude: my Twitter username and password.
FB: hey twitter, I am (pretending to be) the dude with this user name and password.
Twitter: hey dude (actually FB pretending to be the dude - which Twitter does not know), you are authenticated and welcome back to Twitter.
FB: dude, now you are all set.

Do you want to allow FB (or any other third-party service provider) to predend like you to some other service you are already using (e.g. Twitter)? What are the possible risks/benefits of doing it?

At least there are some positive signs, Twitter already has an OAuth API (They would also like to drop the basic authentication API that uses the above conversion pattern; I guess they continue to keep it due to migration/usability issues). I would feel little safer (but not completely) if the FB folks the following conversation using a delegated authentication mechanism.

Dude: hey FB, I want to access Tweets.
FB: No problem dude, I am sending you to Twitter (open ups a new browser window - domain twitter.com).
FB: hey Twitter, a dude wants to connect to Twitter.
Twitter: hey dude, FB (or any other third-party dude) wants to access your tweets; you cool with that?
Dude: yep, I am. (Dude type his/her username and password and give approval)
Twitter: hey FB, use this token to access Dude's tweets.
FB: dude, now you are all set.

Notice that the dude did not have to give private information such as twitter password to FB. In other words, the twitter password is still under the control of the dude. Nabeel's dilemma: should I wait till FB provides such an application or should I sacrifice my private information and go ahead with the current application?

It may appear I have derailed from the $subject of this blog post, but I was telling all this to motivate you about the need to have an open decentralized microblogging (aka the $subject). There has been some research work in this area. I found the following paper interesting in this regard.

Birds of a FETHR: Open, Decentralized Microblogging by researchers at the Rice university.
Abstract:
Microblogging, as exemplified by Twitter, is gaining popularity as a way to exchange short messages within social networks. However, the limitations of current microblog services—proprietary, centralized, and isolated—threaten the long-term viability of this new medium. In
this work we investigate the current state of microblogging and envision an open, distributed micropublishing service that addresses the weaknesses of today’s systems. We draw on traces taken from Twitter to characterize the microblogging workload. Our proposal, fethr, connects
micropublishers large and small in a single global network. New messages are gossiped among subscribers using a lightweight http-based protocol. Cryptographic measures protect authenticity and continuity of updates and prove message ordering even across providers.