2009-03-04

Extended sessions with the Haskell Curl bindings

I recently needed to automate retrieving protected data from a secure web site. I had to:

  1. Log into the website with a POST request.
  2. Download the protected data with a GET request.

All this had to be done using SSL, and I suspected I’d need to handle cookies too.

I had read that libcurl had support for sessions and cookies spanning multiple requests, and knew that it could handle SSL. I was aware there is a Haskell binding to libcurl (aptly named “curl” but hereafter referred to as Network.Curl to avoid confusion) on Hackage so I had my tools cut out for me. While I had used the command line curl utility quite a bit I had never programmed against libcurl before and had some learning to do.

It wasn’t entirely clear to me from the haddocks how to use Network.Curl. This may not be a problem if you are familiar with libcurl already (I couldn’t tell) but for me it was quite a hurdle. Googling on the topic I found some blogged examples that got me started but I was unable to find an example demonstrating a multi-request session. However, with the basics from the blogs I was able to return to Network.Curl and figure things out by inspecting its source code. I’ll share an example here for the benefit of others who find themselves in the same situation. I’m using version 1.3.4 of Network.Curl.

As a contrived example let’s assume we want to write a small program that, given a user name and password, fetches the user’s API token from GitHub. Here is the code (literate Haskell, just copy and paste into a .lhs file):

> import Network.Curl
> import System (getArgs)
> import Text.Regex.Posix
> -- | Standard options used for all requests. Uncomment the @CurlVerbose@
> -- option for lots of info on STDOUT.
> opts = [ CurlCookieJar "cookies" {- , CurlVerbose True -} ]
> -- | Additional options to simulate submitting the login form.
> loginOptions user pass =
> CurlPostFields [ "login=" ++ user, "password=" ++ pass ] : method_POST
> main = withCurlDo $ do
> -- Get username and password from command line arguments (will cause
> -- pattern match failure if incorrect number of args provided).
> [user, pass] <- getArgs
>   -- Initialize curl instance.
> curl <- initialize
> setopts curl opts
>   -- POST request to login.
> r <- do_curl_ curl "https://github.com/session" (loginOptions user pass)
> :: IO CurlResponse
> if respCurlCode r /= CurlOK || respStatus r /= 302
> then error $ "Failed to log in: "
> ++ show (respCurlCode r) ++ " -- " ++ respStatusLine r
> else do
> -- GET request to fetch account page.
> r <- do_curl_ curl ("https://github.com/account") method_GET
> :: IO CurlResponse
> if respCurlCode r /= CurlOK || respStatus r /= 200
> then error $ "Failed to retrieve account page: "
> ++ show (respCurlCode r) ++ " -- " ++ respStatusLine r
> else putStrLn $ extractToken $ respBody r

The first thing to note is that we use do_curl_ rather than e.g. curlPost and curlGet. The latter two don’t actually give you access to the response body but instead prints it on stdout! The general process is:

  1. Initialize a curl instance.
  2. Set options.
  3. Call do_curl with the URL and request-specific options.
  4. Inspect CurlResponse
  5. Repeat from 3 until done.

Note that the all used of libcurl should be wrapped by withCurlDo. In the example I wrapped the entire body of main. Also note that the type of do_curl_ must be specified explicitly unless it can be inferred by later use. The CurlResponse type specified above uses vanilla Strings for everything.

For the POST request I added some CurlPostFields to the method_POST options predefined in the Network.Curl. For the GET request’s options since the predefined method_GET is sufficient.

A GitHub-specific peculiarity here is the error checking after the POST request. GitHub returns a 302 (“Moved Temporarily”) on successful login and a 200 (“OK”) when the credentials are bad. Stuff like this needs to be figured out on a site-by-site basis.

For completeness here is the function that extracts the token from the response body using a regular expression:

> -- | Extracts the token from GitHub account HTML page.
> extractToken body = head' "GitHub token not found" xs
> where
> head' msg l = if null l then error msg else head l
> (_,_,_,xs) = body =~ "github\\.token (.+)"
> :: (String, String, String,[String])

If you Load this code in ghci and type :main username password the Octocat will deliver your token.

Octocat, GitHub&rsquo;s mascot

1 comment: