Scraping Andorid app's data. If it works, more efficient than web scraping

Imagine I give you the following Centaline API. Would it be easier than scraping the Centaline Web by locating each html element and looping through all pages?

So, how to do that?

  1. Find the APK of your target in Google
  2. Download and run the script in Github. This script modifies the network_security_config.xml file to allow third parities to listen to network requests.
  3. Download mitmproxy, a famous man-in-the-middle attach tool.
  4. Set up the port forwarding in your phone to forward traffic to port 8080
  5. Download the mitmproxy User Certificate by going to mitm.it in your phone
  6. Browse your target app and capture traffic in mitmproxy

For an introduction on how to use mitmproxy in Andorid phone, please refer to this Medium post.

For how to modify apk and showing scraping centaline app in action, please refer to my video in LinkedIn.

SSL Pinning will prevent such attacks. But there are ways to bypass SSL pinning. Luciky, all major property listing apps in Hong Kong do not use SSL pinning. haha. For mainland app, they use it. I am still learning how to bypass SSL pinning so that I can scrape some mainland listing apps as well. Will keep you guys updated.

Btw, first post in this discourse channel! I am John :slight_smile:

1 Like

Welcome @johnchan, great intro!

Once you have the API, how do you personally scrape it? (I totally agree that APIs should be preferred over scraping Web pages if possible).

@gfreeman
First, I use async request package like aiohttp to improve the speed.

Second, to avoid being blocked. I also switch user-agents randomly and use tor network to change my ip addresses.

But in some cases, like airbnb, they blocked tor network access. So, not always work.

P.S. Please treat the server nicely! Do it as slow as you can :stuck_out_tongue:

Very interesting… Have you considered a scraping framework like scrapy to deal with all that infrastructure for you?

@gfreeman. I do not aware that scrapy supports async requests. Reference. I may be wrong.

If we cannot do async requests, then scraping will be quite slow since we need to wait the completion of one request before starting the others.

For switching ips using tor, scrapy, as far as I know, does not support that. We may need to write our own middleware like this.

But I do not have much experience in using scrapy.

Scrapy makes asynchronous requests using the Twisted library. See here: https://docs.scrapy.org/en/latest/topics/architecture.html#event-driven-networking

Scrapy is neutral when it comes to which IP to use, but it’s easy to use it with proxies through the HTTP Proxy downloader middleware. See here: https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware

Scrapy is probably the best industrial-scale and professional scraping framework in the world right now. I recommend all aspiring scrapers become familiar with it.

1 Like