Magazine: Features Analyzing the Amazon Mechanical Turk marketplace
FREE CONTENT FEATURE
An associate professor at New York Universitys Stern School of Business uncovers answers about who are the employers in paid crowdsourcing, what tasks they post, and how much they pay.
Analyzing the Amazon Mechanical Turk marketplace
Full text also available in the ACM Digital Library as PDF | HTML | Digital Edition
Amazon Mechanical Turk (AMT) is a popular crowdsourcing marketplace, introduced by Amazon in 2005. The marketplace is named after an 18th century "automatic" chess-playing machine, which was handily beating humans in chess games. Of course, the robot was not using any artificial intelligence algorithms back then. The secret of the Mechanical Turk machine was a human operator, hidden inside the machine, who was the real intelligence source.
AMT is also a marketplace for small tasks that cannot be easily automated. For example, humans can tell if two different descriptions correspond to the same product, can easily tag an image with descriptions of its content, or can easily transcribe with high quality an audio snippetthough all those tasks are extremely difficult for computers to do.
Using Mechanical Turk, computers can use a programmable API to post tasks on the marketplace, which are then fulfilled by human users. This API-based interaction gives the impression that the task can be automatically fulfilled, hence the name.
In the marketplace, employers are known as requesters and they post tasks, called human intelligence tasks, or HITs. The HITs are then picked up by online users, referred to as workers, who complete them in exchange for a small payment, typically a few cents per HIT.
Since the concept of crowdsourcing is relatively new, many potential participants have questions about the AMT marketplace. For example, a common set of questions that pop up in an "introduction to crowdsourcing and AMT" session are the following:
- Who are the workers that complete these tasks?
- What type of tasks can be completed in the marketplace?
- How much does it cost?
- How fast can I get results back?
- How big is the AMT marketplace?
For the first question, about the demographics of the workers, past research [5, 6] indicated that the workers that participate on the marketplace are mainly coming from the United States, with an increasing proportion coming from India. In general, the workers are representative of the general Internet user population but are generally younger and, correspondingly, have lower incomes and smaller families.
"By observing the practices of the successful requesters, we can learn more about what makes crowdsourcing successful, and increase the demand from the smaller requesters."
At the same time, the answers for the other questions remain largely anecdotal and based on personal observations and experiences. To understand better what types of tasks are being completed today using crowdsourcing techniques, we started collecting data about the marketplace. Here, we present a preliminary analysis of the findings and provide directions for interesting future research.
We started gathering data about AMT in January 2009, and we continue to collect data at the time of this writing. Every hour, we crawl the list of HITs available on AMT and keep the status of each available HIT group (groupid, requester, title, description, keywords, rewards, number of HITs available within the HIT group, qualifications required and time of expiration). We also store the HTML content of each HIT.
Following this approach, we could find the new HITs being posted over time, the completion rate of each HIT, and the time that they disappear from the market because they have either been completed or expired, or because a requester canceled and removed the remaining HITs from the market. (Identifying expired HITs is easy, as we know the expiration time of a HIT. Identifying cancelled HITs is a little trickier. We need to monitor the usual completion rate of a HIT over time and see if it is likely, at the time of disappearance, for the remaining HITs to have been completed within the time since the last crawl.)
A shortcoming of this approach is that it cannot measure the redundancy of the posted HITs. So, if a single HIT needs to be completed by multiple workers, we can only observe it as a single HIT.
From January 2009 through April 2010, we collected 165,368 HIT groups, with 6,701,406 HITs total, from 9,436 requesters. The total value of the posted HITs was $529,259. These numbers, of course, do not account for the redundancy of the posted HITs, or for HITs that were posted and disappeared between our crawls. Nevertheless, they should be good approximations (within an order of magnitude) of the activity of the marketplace.
One way to understand what types of tasks are being completed in the marketplace is to find the "top" requesters and analyze the HITs that they post. Table 1 shows the top requesters, based on the total rewards of the HITs posted, filtering out requesters that were active only for a short period of time.
We can see that there are very few active requesters that post a significant amount of tasks in the marketplace and account for a large fraction of the posted rewards. Following our measurements, the top requesters listed in Table 1 (which is 0.1 percent of the total requesters in our dataset), account for more than 30 percent of the overall activity of the market.
Given the high concentration of the market, the type of tasks posted by the requesters shows the type of tasks that are being completed in the marketplace. Castingwords is the major requester, posting transcription tasks frequently. There are also two other semi-anonymous requesters posting transcription tasks as well.
Among the top requesters we also see two mediator services, Dolores Labs (aka Crowdflower) and Smart-sheet.com, who post tasks on Mechanical Turk on behalf of their clients. Such services are essentially aggregators of tasks, and provide quality assurance services on top of Mechanical Turk. The fact that they account for approximately 10 percent of the market indicates that many users that are interested in crowdsourcing prefer to use an intermediary that address the concerns about worker quality, and also allow posting of complex tasks without the need for programming.
We also see that four of the top requesters use Mechanical Turk in order to create a variety of original content, from product reviews, feature stories, blog posts, and so on. (One requester, "Paul Pullen," uses Mechanical Turk to paraphrase existing content, instead of asking the workers to create content from scratch.) Finally, we see that two requesters use Mechanical Turk in order to classify a variety of objects into categories. This was the original task for which Mechanical Turk was used by Amazon.
The high concentration of the market is not unusual for any online community. There is always a long tail of participants that has significantly lower activity than the top contributors. Figure 1 shows how this activity is distributed, according to the value of the HITs posted by each requester. The x-axis shows the log2 of the value of the posted HITs and the y-axis shows what percentage of requesters has this level of activity. As we can see, the distribution is approximately log-normal. Interestingly enough, this is approximately the same level of activity demonstrated by workers .
For our analysis, we wanted to also examine the marketplace as a whole, to see if the HITs submitted by other requesters were significantly different than the ones posted by the top requesters. For this, we measured the popularity of the keywords in the different HITgroups, measuring the number of HITgroups with a given keywords, the number of HITs, and the total amount of rewards associated with this keyword. Table 2 shows the results.
Our keyword analysis of all HITs in our dataset indicates that transcription is indeed a very common task on the AMT marketplace. Notice that it is one of the most "rewarding" keywords and appears in many HITgroups, but not in many HITs. This means that most of the transcription HITs are posted as single HITs and not as groups of many similar HITs. By doing a comparison of the prices for the transcription HITs, we also noticed that it is a task for which the payment per HIT is comparatively high. It is unclear at this point if this is due to the high expectation for quality or whether the higher price simply reflects the higher effort required to complete the work.
Beyond transcription, Table 2 indicates that classification and categorization are indeed tasks that appear in many (inexpensive) HITs. Table 2 also indicates that many tasks are about data collection, image tagging and classification, and also ask workers for feedback and advice for a variety of tasks (e.g., usability testing of websites).
To understand better the typical prices paid for crowdsourcing tasks on AMT, we examined the distribution of the HIT prices and the size of the posted HITs. Figure 2 illustrates the results. When examining HIT groups, we can see that only 10 percent have a price tag of $0.02 or less, 50 percent of the HITs have price above $0.10, and 15 percent of the HITs come with a price tag of $1 or more.
However, this analysis can be misleading. In general, HITgroups with a high price only contain a single HIT, while the HITgroups with large number of HITs have a low price. Therefore, if we compute the distribution of HITs (not HITgroups) according to the price, we can see that 25 percent of the HITs created on Mechanical Turk have a price tag of just $0.01, 70 percent have a reward of $0.05 or less, and 90 percent pay less than $0.10. This analysis confirms the common feeling that most of the tasks on Mechanical Turk have tiny rewards.
Of course, this analysis simply scratches the surface of the bigger problem: How can we automatically price tasks, taking into consideration the nature of the task, the existing competition, the expected activity level of the workers, the desired completion time, the tenure and prior activity of the requester, and many other factors? How much should we pay for an image tagging task, for 100,000 images in order to get it done within 24 hours? Building such models will allow the execution of crowdsourcing tasks to become easier for people that simply want to "get things done" and do not want to tune and micro-optimize their crowdsourcing process.
What is the typical activity in the AMT marketplace? What is the volume of the transactions? These are very common questions from people who are interested in understanding the size of the market and its demonstrated capacity for handling big tasks. (Detecting the true capacity of the market is a more involved task than simply measuring its current serving rate. Many workers may show up only when there is a significant amount of work for them, and be dormant under normal loads.)
One way to approach such questions is to examine the task posting and task completion activity on AMT. By studying the posting activity we can understand the demand for crowdsourcing, and the completion rate shows how fast the market can handle the demand. To study these processes, we computed, for each day, the value of tasks being posted by AMT requesters and the value of the tasks that got completed in each day.
We present first an analysis of the two processes (posting and completion), ignoring any dependencies on task-specific and time-specific factors. Figure 3 illustrates the distributions of the posting and completion processes. The two distributions are similar but we see that, in general, the rate of completion is slightly higher than the rate of arrival. This is not surprising and is a required stability condition. If the completion rate was lower than the arrival rate, then the number of incomplete tasks in the marketplace would go to infinity.
We observed that the median arrival rate is $1,040 per day and the median completion rate is $1,155 per day. If we assume that the AMT marketplace behaves like an M/M/1 queuing system, and using basic queuing theory, we can see that a task worth $1 has an average completion time of 12.5 minutes, resulting in an effective hourly wage of $4.80.
Of course, this analysis is an over-simplification of the actual process. The tasks are not completed in a first-in, first-out manner, and the completion rate is not independent of the arrival rate. In reality, workers pick tasks following personal preferences or by the AMT interface. For example Chilton et al. indicated that most workers use two of the main task sorting mechanisms provided by AMT to find and complete tasks ("recently posted" and "largest number of HITs" orders). Furthermore, the completion rate is not independent of the arrival rate.
When there are many tasks available, more workers come to complete tasks, as there are more opportunities to find and work for bigger tasks, as opposed to working for one-time HITs. As a simple example, consider the dependency of posting and completion rates on the day of the week. Figure 4 illustrates the results.
The posting activity from the requesters is significantly lower over the weekends and is typically maximized on Tuesdays. This can be rather easily explained. Since most requesters are corporations and organizations, most of the tasks are being posted during normal working days. However, the same does not hold for workers. The completion activity is rather unaffected by the weekends. The only day on which the completion rate drops is on Monday, and this is most probably a side-effect of the lower posting rate over the weekends. (There are fewer tasks available for completion on Monday, due to the lower posting rate over the weekend.)
An interesting open question is to understand better how to model the marketplace. Work on queuing theory for modeling call centers is related and can help us understand better the dynamics of the market and the way that workers handle the posted tasks. Next, we present some evidence that modeling can help us understand better the shortcomings of the market and point to potential design improvements.
Given that the system does not satisfy the usual queuing assumptions of M/M/1  for the analysis of completion times, we analyzed empirically the completion time for the posted tasks. The goal of this analysis was to understand what approaches may be appropriate for modeling the behavior of the AMT marketplace.
Our analysis indicated that the completion time follows (approximately) a power law, as illustrated in Figure 5. We observe some irregularities, with some outliers at approximately 12 hours and at the seven-day completion times. These are common "expiration times" set for many HITs, hence the sudden disappearance of many HITs at that point. Similarly, we see a different behavior of HITs that are available for longer than one week: these HITs are typically "renewed" by their requesters by the continuous posting of new HITs within the same HITgroup. (A common reason for this behavior is for the HIT to appear in the first page of the "Most recently posted" list of HIT-groups, as many workers pick the tasks to work on from this list .) Although it is still unclear what dynamics causes this behavior, the analysis by Barabási indicates that priority-based completion of tasks can lead to such power-law distributions .
To better characterize this power-law distribution of completion times, we used the maximum likelihood estimator for power-laws. To avoid biases, we also marked as "censored" the HITs that we detected to be "aborted before completion" and the HITs that were still running at the last crawling date of our dataset (which will not be given in detail in this article).
The MLE estimator indicated that the most likely exponent for the power-law distribution of the completion times of Mechanical Turk is α=-1.48. This exponent is very close to the value predicted theoretically for the queuing model of Cobham , in which each task upon arrival is assigned to a queue with different priority. Barabási  indicates that the Cobham model can be a good explanation of the power-law distribution of completion times only when the arrival rate is equal to the completion rate of tasks. Our earlier results indicate that for the AMT marketplace this is not far from reality. Hence the Cobham model of priority-based execution of tasks can explain the power-law distribution of completion times.
Unfortunately, a system with a power-law distribution of completion times is rather undesirable. Given the infinite variance of power-law distributions, it is inherently difficult to predict the necessary time required to complete a task. Although we can predict that for many tasks the completion time will be short, there is a high probability that the posted task will need a significant amount of time to finish. This can happen when a small task is not executed quickly, and therefore is not available in any of the two preferred queues from which workers pick tasks to work on. The probability of a "forgotten" task increases if the task is not discoverable through any of the other sorting methods as well.
This result indicates that it is necessary for the marketplace of AMT to be equipped with better ways for workers to pick tasks. If workers can pick tasks to work on in a slightly more "randomized" fashion, it will be possible to change the behavior of the system and eliminate the "heavy tailed" distribution of completion times. This can lead to a higher predictability of completion times, which is a desirable characteristic for requesters. Especially new requesters, without the necessary experience for making their tasks visible, would find such a characteristic desirable, as it will lower the barrier to successfully complete tasks as a new requester on the AMT market.
We should note, of course, that these results do not take into consideration the effect of various factors. For example, an established requester is expected to have its tasks completed faster than a new requester that has not established connections with the worker community. A task with a higher price will be picked up faster than an identical task with lower price. An image recognition task is typically easier than a content generation task, hence more workers will be available to work on it and finish it faster. These are interesting directions for future research, as they can show the effect of various factors when designing and posting tasks. This can lead to a better understanding of the crowdsourcing process and a better prediction of completion times when crowdsourcing various tasks.
Higher predictability means lower risk for new participants. Lower risk means higher participation and higher satisfaction both for requesters and for workers.
Our analysis indicates that the AMT is a heavy-tailed market, in terms of requester activity, with the activity of the requesters following a log-normal distribution. The top 0.1 percent of requesters amount for 30 percent of the dollar activity and with 1 percent of the requesters posting more than 50 percent of the dollar-weighted tasks.
A similar activity pattern also appears from the side of workers . This can be interpreted both positively and negatively. The negative aspect is that the adoption of crowdsourcing solutions is still minimal, as only a small number of participants actively use crowdsourcing for large-scale tasks. On the other hand, the long tail of requesters indicates a significant interest for such solutions. By observing the practices of the successful requesters, we can learn more about what makes crowdsourcing successful, and increase the demand from the smaller requesters.
We also observe that the activity is still concentrated around small tasks, with 90 percent of the posted HITs giving a reward of $0.10 or less. A next step in this analysis is to separate the price distributions by type of task and identify the "usual" pricing points for different types of tasks. This can provide guidance to new requesters that do not know whether they are pricing their tasks correctly.
Finally, we presented a first analysis of the dynamics of the AMT marketplace. By analyzing the speed of posting and completion of the posted HITs, we can see that Mechanical Turk is a price-effective task completion marketplace, as the estimated hourly wage is approximately $5.
Further analysis will allow us to get a better insight of "how things get done" on the AMT market, identifying elements that can be improved and lead to a better design for the marketplace. For example, by analyzing the waiting time for the posted tasks, we get significant evidence that workers are limited by the current user interface and complete tasks by picking the HITs available through one of the existing sorting criteria. This limitation leads to a high degree of unpredictability in completion times, a significant shortcoming for requesters that want high degree of reliability. A better search and discovery interface (or perhaps a better task recommendation service, a specialty of Amazon.com, can lead to improvements in the efficiency and predictability of the marketplace.
Further research is also necessary in better predicting how changes in the design and parameters of a task can affect quality and completion speed. Ideally, we should have a framework that automatically optimizes all the aspects of task design. Database systems hide all the underlying complexity of data management, using query optimizers to pick the appropriate execution plans. Google Predict hides the complexity of predictive modeling by offering an auto-optimizing framework for classification. Crowdsourcing can benefit significantly by the development of similar framework that provide similar abstractions and automatic task optimizations.
1. Mechanical Turk Monitor, http://www.mturk-tracker.com.
4. Chilton, L. B., Horton, J. J., Miller, R. C., and Azenkot, S. 2010. Task search in a human computation market. In Proceedings of the ACM SIGKDD Workshop on Human Computation [Washington DC, July 25, 2010]. HCOMP '10. ACM, New York, NY, 19.
5. Ipeirotis, P. 2010. Demographics of Mechanical Turk. Working paper CeDER-10-01, New York University, Stern School of Business. Available at http://hdl.handle.net/2451/29585.
6. Ross, J., Irani, L., Silberman, M. S., Zaldivar, A., and Tomlinson, B. 2010. Who are the crowdworkers?: shifting demographics in mechanical turk. In Proceedings of the 28th of the international Conference Extended Abstracts on Human Factors in Computing Systems [Atlanta, Georgia, USA, April 10 15, 2010]. CHI EA '10. ACM, New York, NY, 28632872.
7. M/M/1 model, http://en.wikipedia.org/wiki/M/M/1_model.
Panagiotis G. Ipeirotis is an associate professor at the Department of Information, Operations, and Management Sciences at Leonard N. Stern School of Business of New York University. His recent research interests focus on crowdsourcing. He received his PhD in computer science from Columbia University in 2004, with distinction, and has received two Microsoft Live Labs Awards, two best paper awards (IEEE ICDE 2005, ACM SIGMOD 2006], two runner up awards for best paper [JCDL 2002, ACM KDD 2008], and a CAREER award from the National Science Foundation. This work was supported by the National Science Foundation under Grant No. IIS0643846.
Figure 5. The distribution of completion times for HIT
groups posted on AMT. The distribution does not change
significantly if we use the completion time per HIT [and not per
HITgroup], as 80 percent of the HIT groups contain just one
©2010 ACM 1528-4972/10/1200 $10.00
Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee.
The Digital Library is published by the Association for Computing Machinery. Copyright © 2010 ACM, Inc.
To comment you must create or log in with your ACM account.
Complete Automated Public Turing test to tell Computers and Humans Apart: A contrived acronym intentionally redolent of the word “capture,” used to describe a test issued on web forms to protect against automated responses.
Game with A Purpose: a term used to describe a computer game that layers a recreational challenge on top of a problem that demands human intelligence for efficient solution, e.g.: protein folding.
Human Intelligence Task: A task that an AMT requester is willing to pay to have accomplished by AMT providers. More generally, a task that may be best completed via crowdsourcing.
Human-Guided Search: A research project investigating a strategy for search and optimization problems that incorporates human intuition and insight.