Using Weighted Sampling to Understand the Prevalence of Spam

To effectively fight spam, we need an unbiased estimate of how much bad content there is in the ecosystem and where it resides. In this presentation we discuss sampling schemes to identify the small percentage of bad content viewed from both user generated content and commercially-motivated content such as ads and sponsored posts. These methods specifically employ ML-derived classifiers to weight the sampling, increasing the volume of bad content in the samples. With more bad content we are able to segment it further, allowing us to measure the prevalence of bad material in certain segments, or as identified by certain policies.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s