Heroku is a powerful tool to supplement your value as a developer. In this article, I’m going to discuss how to combine Heroku and Salesforce to collect valuable information on the Internet for free. I’ve been dabbling in Python and web scraping for a few months now and felt comfortable meshing multiple systems together into an automation symphony. We’re going to pull a list of countries from a site to keep this simple. However, if you want to scrape more complex sites, there are plenty of Selenium WebDriver Python tutorials out there.
Definitions
Before I dive too deep, I wanted to clarify a few things. Python is a high level abstraction programming language often used to teach new computer scientists, write artificial intelligence, or perform various tasks like web scraping. It’s one of the simpler languages to pick up but does about as much as any other language. I usually write a third as many lines in Python as I do in other languages, including Apex.
Web scraping (also known as “spidering”) is a method of programmatically retrieving information from websites to use for your own intent. Instead of you manually going to a website like a stock market page to find out information, wouldn’t it be nice if your computer just downloaded that information for you into an Excel spreadsheet every day? What if that downloaded info could be converted into datapoints that allow you to make quicker decisions about your business? Those are the kind of problems web scraping attempts to solve.
A word of caution, not every site likes to be scraped and some may even utilize the judicial system to prosecute you if you violate their rules. It’s a good idea to review the robot.txt file to make sure you’re behaving in an acceptable way because every site has different rules. If the website has an API, use that instead. The robot.txt file will tell you which pages can be scrapped and how often. There are also debates on the legality of this. Developers sometimes refer to this as “Jay-walking on the Internet.” But I always come back to the fact that companies like Google’s search engine relies entirely on its ability to effectively scrape websites. You should be fine if you’re respectful to websites about how often you scrape, only scrape whitelisted pages, and what time of the day you scrape. For instance, only querying a page every 10 seconds during non-peak hours (e.g. midnight) is often an accepted method.
As with any article I’ve written, I’ll discuss why you would want to do something like this for your role as a Salesforce developer. Salesforce is a complex and powerful system, but it can’t do everything; nor should it. There are other things which other languages do beautifully that just can’t be done easily in Salesforce. Since Salesforce acquired Heroku around 2010-2011, developers now can incorporate more advanced techniques and logic that is generally available in open-source languages. For instance, you could pull in project opportunities from RFP websites automatically, parse contact information from websites as a Lead generation tool, and so much more. This project is a nice way to add value to your work without incurring any (or hardly any) costs.
Setup
Before you can code any of this, you’ll need the right tools. Below are the items I recommend for a straightforward experience.
IDE: PyCharm (choose the Community option – its free for learning purposes)
Chromium Driver (you’ll use this to programmatically control Chrome for local testing – make sure it matches your current Chrome version)
Heroku – Python Getting Started (do all of this to ensure it’s all working. You won’t use the web app part but it’s good to understand)
Salesforce Setup
We’ll use a sandbox to store this information. All that needs to be done is to create an Object called “Countries.” You’ll need to go to Setup -> Object Manager -> Create.
Configure Heroku
The first thing is to configure the Heroku application. We’ll have to do the following.
Create a Heroku App
Add Python Build pack
Add Headless Google Chrome build pack
Add Chromedriver build pack
Add environment variables
Install Postgres Add-on
Install Heroku Connect Add-on
Create a Heroku App
First, we’ll need to make an app. When logged into the Heroku website click on “New” -> “Create new app.”
You’ll have to select a name as long as its unique to all of Heroku.
Add Python Build Pack
Click on the Settings tab.
Scroll down to “build packs” and click the “Add buildpack” buttons.
Click on the “Python” option and “Save Changes.”
Repeatedly adding Build Packs for the following URLs
Scroll up to “Config Vars” and click the “Reveal Config Vars.”
Add these options to match the next image.
CHROMEDRIVER
/app/.chromedriver/bin/chromedriver
GOOGLE_CHROME
/app/.apt/usr/bin/google-chrome
Now you’ll have to add both the Postgres and Heroku Connect Add-ons. Simply go to your app, and click on Add-ons in the overview. The hyperlink for “Configure Add-ons” is where we’ll add them. Type in and provision both the Heroku Postgres and Heroku Connect. The free versions will work for this.
Click on the Heroku Connect add-on so you can begin authentication.
Click “Setup Connection” to start the login process with the org you wish to tie to this.
Now we must define the mappings and which objects are permitted. Click on the “Mappings” tab and then “Create Mapping.”
Find the Country object you made and click on it. Check all the fields you want to come over. At minimum, we need the Name field. Also, check “Accelerated Polling.” Click Save.
Configure Local
Now that you have a source of truth, let’s dive into the local computer environment used to build this application. First, you’ll need to download this project into a folder on your computer. Open the folder with PyCharm. Then you’ll need to open the console and enter “pip install -r requirements.txt” like below. This magically configures everything to the same build I had when I made this.
If you see this error message below, the simplest fix is to download Postgres to your machine here and restart your Pycharm to try again. It just needs a PATH variable with Postgres files to work.
Next, update the PostgresHandler.py file in this section with the URI of your Postgres database where it says “<YOUR DATABASE URI>”. This makes it easier to test database records in a local program, but you don’t want to version this data in Github.
To find your Postgres URI, you’ll go into your Heroku app and click on the Add-on, Settings tab, and then the button “View Credentials.”
Next, you need to fix the database name of your file. To do that, go to your Heroku app and then to the Heroku Postgres add-on. Create a Dataclip by clicking on the Dataclips tab. Click “Create Dataclip” and you’ll see the database definitions on the right-hand side. Here you can test queries without doing it in code.
Change the Postgreshandler.py file in the “FROM” statements to match your table name.
Lastly, you need to set the path of your ChromeDriver download. Make sure it’s unzipped and pasted the directory in the WebCrawler.py file.
Note: you may run into issues with the ChromeDriver “not being found” when running the program. When that happens, I temporarily place the ChromeDriver file in the same directory as the project, update the googleChromeExecutePath variable, and it starts working.
The Process
Now that you’ve downloaded the code. You can begin to understand the mantra behind this scraping process. We’re going to scrape this example site for the list of countries to get the hang of the code. The flow is below and everything in the code is commented.
We’ll load the webpage
Identify the table we want
Pull the text
Click next and repeat 1 through 4 until Next is no longer available
Get a list of existing countries in the database
Check the list and bulk insert anything new
Testing
If everything was set up correctly and you run the code, then you should see this when you run a dataclip in Heroku.
Uploading
Heroku has a great tutorial on how to handle uploading code directly to Herorku or using a service like Github. I will not cover that part here, but you can upload the code to Heroku if it’s working.
Taking it a step further
I left a lot of print statements mainly because that’s a simple way to see the order of execution and read logs. I also left them in there because now you’re able to schedule your Heroku application to run like a Batch job in Salesforce. All you need to do is verify your account with a credit card and install the add-on “Heroku Scheduler.” You will set a frequency and then put in the command “python run main.py.” That will do the trick. It will run on your desired scheduled.
(AI voice) In today’s fast-paced business world, efficiency and accuracy are paramount—end transmission. (personable blog voice) Just kidding. I, an actual human, wrote this. Lousy AI jokes aside, Salesforce automation and customization are game changers. They enable businesses to streamline operations, reduce errors, and boost productivity. Tailoring Salesforce to your operation’s specific needs will drive […]
Case Study: PROBLEM Sustainable janitorial solutions provider was dissuaded with their Salesforce instance after an unsatisfactory and costly implementation. Incomplete feature setup, improper permissions sets, and poor data migration were causing countless issues and irritations. In addition to a disappointing implementation, the client was not provided with training to assist them during the process. SOLUTION […]
“What can I say about Virtuoso other than wow! Their Salesforce experience, expertise, and guidance have given our company a customized platform that puts us head and shoulders above our competition. It is no stretch to say Virtuoso truly transformed our business. When we started, our staff was running our business on spreadsheets, pad and […]
Virtuoso is a privately held, women and veteran-owned LLC. Headquartered in Chicago with employees across the U.S., Virtuoso readily serves customers throughout North America.