Scraping Andorid app's data. If it works, more efficient than web scraping

johnchan · 16 March 2020 00:51

Imagine I give you the following Centaline API. Would it be easier than scraping the Centaline Web by locating each html element and looping through all pages?

So, how to do that?

Find the APK of your target in Google
Download and run the script in Github. This script modifies the network_security_config.xml file to allow third parities to listen to network requests.
Download mitmproxy, a famous man-in-the-middle attach tool.
Set up the port forwarding in your phone to forward traffic to port 8080
Download the mitmproxy User Certificate by going to mitm.it in your phone
Browse your target app and capture traffic in mitmproxy

For an introduction on how to use mitmproxy in Andorid phone, please refer to this Medium post.

For how to modify apk and showing scraping centaline app in action, please refer to my video in LinkedIn.

SSL Pinning will prevent such attacks. But there are ways to bypass SSL pinning. Luciky, all major property listing apps in Hong Kong do not use SSL pinning. haha. For mainland app, they use it. I am still learning how to bypass SSL pinning so that I can scrape some mainland listing apps as well. Will keep you guys updated.

Btw, first post in this discourse channel! I am John

gfreeman · 16 March 2020 15:35

Welcome @johnchan, great intro!

Once you have the API, how do you personally scrape it? (I totally agree that APIs should be preferred over scraping Web pages if possible).

johnchan · 17 March 2020 03:50

@gfreeman
First, I use async request package like aiohttp to improve the speed.

Second, to avoid being blocked. I also switch user-agents randomly and use tor network to change my ip addresses.

But in some cases, like airbnb, they blocked tor network access. So, not always work.

P.S. Please treat the server nicely! Do it as slow as you can

gfreeman · 17 March 2020 09:28

Very interesting… Have you considered a scraping framework like scrapy to deal with all that infrastructure for you?

johnchan · 18 March 2020 02:26

@gfreeman. I do not aware that scrapy supports async requests. Reference. I may be wrong.

If we cannot do async requests, then scraping will be quite slow since we need to wait the completion of one request before starting the others.

For switching ips using tor, scrapy, as far as I know, does not support that. We may need to write our own middleware like this.

But I do not have much experience in using scrapy.

gfreeman · 20 March 2020 09:47

Scrapy makes asynchronous requests using the Twisted library. See here: https://docs.scrapy.org/en/latest/topics/architecture.html#event-driven-networking

Scrapy is neutral when it comes to which IP to use, but it’s easy to use it with proxies through the HTTP Proxy downloader middleware. See here: https://docs.scrapy.org/en/latest/topics/downloader-middleware.html#scrapy.downloadermiddlewares.httpproxy.HttpProxyMiddleware

Scrapy is probably the best industrial-scale and professional scraping framework in the world right now. I recommend all aspiring scrapers become familiar with it.