时光

时光的时光轴

寄以时间,予以文字
telegram
github
微信公众号

Enable Simple Anti-Scraping for Websites

When it comes to web scraping, I believe that most developers can act as creators of both the spear and the shield.

  • So why create web scrapers? To make valuable information programmatically accessible.
  • And why fight against web scrapers? To ensure that the value of information is owned by its rightful owners.

Anti-Scraping Techniques#

In addition to implementing basic risk control measures for visitors, you may have come across some niche anti-scraping techniques, such as:

Dynamic Font Rendering#

For example, the case of Maoyan Movies.

image

When you view the source code, the numbers appear as garbled text. This involves dynamically generating a font on the backend to render the data correctly for users to view.

Shuffled Paragraph Rendering#

For example, the case of a news website.

image

The font is not hidden, but the order of the paragraphs is shuffled. The frontend then uses the data-s attribute to restore the correct paragraph order.

These anti-scraping techniques leverage the ability of frontend JavaScript execution to allow users to read the content normally. Of course, there are solutions to these cases, which can easily restore the content for user-friendly reading.

Therefore, the more commonly used approach now is to implement risk control measures for "users" to determine whether the visitors are legitimate and non-malicious individuals.

However, anti-scraping techniques are not necessarily aimed at preventing malicious attempts to obtain valuable data. For example, they can prevent automated retrieval of content from a website.

Why Avoid Automated Retrieval of Website Content#

Have you seen these before?

image

Image source: https://cloud.tencent.com/developer/article/1591533

Or these:

image

This is something I've encountered myself. 😢

You might wonder, I just shared the website URL in an instant messaging app, sometimes I didn't even share it, how did I get flagged?

It's because something is constantly retrieving the content of the website and matching it with keywords to make a simple determination.

⚠️ Note: I'm not teaching you to exploit loopholes. In addition to machine-based review, there is still a manual review process, so don't rely on luck.

So our requirement is simple: prevent non-browser access (including browsers with JavaScript disabled) from obtaining the actual website content.

image

For example, the "JS Challenge" provided by Cloudflare is a good example. It makes users wait for a few seconds before accessing the website.

But this is not an advertisement. If there were many advantages, I wouldn't have written this article. After using Cloudflare in China, it becomes a decelerator. Even if you've paid a lot for a premium network, after using their CDN, everyone is on an equal footing (only for free users), and it becomes slow.

However, not all visitors who automatically retrieve content are trying to harm you. There is also a search engine. So we also need to optimize the waiting page for SEO to avoid affecting the website's search performance.

Implementing a JS Challenge#

What Does a JS Challenge Do?#

Before implementing it, let's understand what a JS Challenge does:

  • When a webpage is accessed, it checks if the user has been marked as legitimate, such as using cookies and session data. If legitimate, it directly renders the webpage content.
  • If not legitimate, it renders a waiting page that contains encrypted (or unencrypted) JS code. When executed, the code produces a unique result, which is then compared with a pre-stored result on the server. If they match, the user is marked as legitimate for a certain period of time.

In other words, conventional scraping methods (such as directly simulating HTTP requests) cannot access the website properly because they cannot execute the expected JS code.

What Should We Implement?#

Unfortunately, I don't know what keywords to search for, and even when I asked GPT, it only recommended existing solutions. Searching for "js challenge" only returns programming questions...

So I had to find a way to implement it myself. I summarized the following requirements for the program:

  • A piece of JS code that takes time to compute
  • The server needs to set the answer in advance
  • The answer should be unique and not easily obtainable on the frontend

Then I thought of something: blockchain. I won't go into too much detail here, although what I'm doing is not as complex as blockchain, there is something that might meet the above requirements: hashing.

In simple terms, hashing is a fixed algorithm that generates a unique string of a fixed length from unique content. For example, the commonly used MD5 algorithm generates a 32-character string composed of 0-9 and a-f.

Implementation#

Setting the Goal#

What kind of thing can I know in advance on the server and is not easily obtainable on the frontend? My answer is:

The server generates a "string that meets certain conditions," and the frontend calculates a number that produces a hash ending with the string provided by the server. Then the frontend sends the number to the server, which checks if the hash produced by the number ends with the string it provided.

So how do we calculate the hash? JavaScript has a Crypto object, and its browser compatibility is as follows:

image

Let's use it to implement the solution.

image

The main selling point is to believe without testing

Verifying the Algorithm#

Of course, SHA-1 is an insecure encryption algorithm, so GPT warned me every time it generated the code. But fortunately, it finally gave me this code snippet:

async function sha256(str) {
  const encoder = new TextEncoder();
  const data = encoder.encode(str);
  const hash = await crypto.subtle.digest('SHA-256', data);
  const hexHash = Array.from(new Uint8Array(hash))
    .map(b => b.toString(16).padStart(2, '0'))
    .join('');
  return hexHash;
}

async function findHashCollision() {
  const targetSuffix = 'fff';
  const maxLength = 10;
  const output = [];

  for (let i = 0; i < Math.pow(36, maxLength); i++) {
    const str = i.toString(36).padStart(maxLength, '0');
    const hash = await sha256(str);
    if (hash.endsWith(targetSuffix)) {
      output.push(str);
    }
  }

  return output;
}

findHashCollision().then(output => {
  console.log(output);
});

But GPT was persistent in giving me the implementation of SHA-256. It finds collisions by continuously calculating numbers, so let's make some modifications and test it in the browser:

const encoder = new TextEncoder();
async function sha1(str) {
  const hash = await crypto.subtle.digest('SHA-1', encoder.encode(str));
  return Array.from(new Uint8Array(hash))
    .map(b => b.toString(16).padStart(2, '0'))
    .join('');
}

async function work(target) {
  const maxLength = 10;
  
  for (let i = 0; i < Number.MAX_SAFE_INTEGER; i++) {
    const hash = await sha1(i);
    if (hash.endsWith(target)) {
      return i;
    }
  }
}

work('fff').then(output => {
  console.log(output);
});

After testing, it found a 3-digit number almost instantly, took about 1 second for 4 digits, and became a bit challenging for 5 digits, taking more than 10 seconds.

Considering all factors, let's go with 4 digits.

Testing Its Generality#

Testing with just ffff or aaaa might not be enough, so let's try some random strings:

image

The speed for 9a9a was slightly slower, but still around 1 second. So let's consider the verification successful: the backend randomly generates a string, and the frontend finds a number that produces a hash ending with that string.

Creating a Nice-Looking Frontend#

There is still a slight delay before accessing our website, so we need to create a friendly waiting page.

But if we write too much content, the page might not finish loading before the redirect. After thinking about it, why not just imitate copy Cloudflare? 🤪

image

Backend Implementation#

        $status = session('challenge');
         if($status === "pass")
             return $next($request);

         if(isset($_REQUEST['_challenge'])){
             if (substr(sha1($_REQUEST['_challenge']), -4) === $status){
                 session(['challenge' => 'pass']);
                 return $next($request);
             }
         }
         $challenge = substr(sha1(rand()), -4);
         session(['challenge' => $challenge]);
         return response()->view('common/challenge',['code' => $challenge]);

Here's the code. The implementation is relatively simple.

Adding Some Details#

Besides malicious bots, there are also good bots, such as search engines.

Search engines are also a type of web crawler, and they don't have as many resources to run your JS Challenge. What should we do? Because the nature of the products is different, I'll just mention one idea:

  • Place the JS Challenge at a later stage in the HTTP processing, after rendering the page, so that SEO information can be obtained.
  • Allow the characteristics of search engines to bypass the challenge, but there is a risk of bypassing, and I'm not sure how Cloudflare does it. I'll study it when I have time.

So in my product, I use an IP database and the country code passed by Cloudflare:

        if(isset($_SERVER["HTTP_CF_IPCOUNTRY"]))
            $isoCode = $_SERVER["HTTP_CF_IPCOUNTRY"];
        else{
            $reader = new Reader(storage_path('app/library/GeoLite2-Country.mmdb'));
            $isoCode = $reader->country($request->ip())->country->isoCode;
        }
        if($isoCode != 'CN'){
            session(['challenge' => 'pass']);
            return $next($request);
        }

With this processing, we can achieve our goal.

Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.