Server log parsing with Spark and schema extraction from query string parameters

I needed to parse server logs and create Spark DataFrames to query information from the query string parameters. My naive version kept throwing errors about mismatched number of fields in schema and those in the row being queried.

It turns out I was dealing with over 350 different query string params across the logs. This could change over time and there was no way I was going to add these programmatically by hand.

The technique I used was to in the first pass, parse out all the various parameters and then as a second job, make sure that every row had all the fields that were present. I am not sure if there is a different Spark way of doing this.

Here is what worked in my particular case.

Parse the logs normally using a regex to split the fields and parameters normally

from pyspark.sql import SQLContext, Row
from pyspark.sql.types import *
from pyspark import SparkContext

sc = SparkContext("local", "Parse Logs")
sqlContext = SQLContext(sc)

def parseSchema(line):
 return (' '.join(line.asDict().viewkeys())).split()
def parseLogs():
    """ Read and parse log file """
    parsed_logs = (sc
                   .textFile(logFile)
                   .map(parseServerLogLine)
                   .cache())
    schema = (access_logs
                .flatMap(parseSchema)
                .map(lambda s: (s,1))
                .reduceByKey(lambda a, b: a + b)
                .cache())
    return parsed_logs, schema

Make sure the parsed logs are cached by using the cache() directive. The parseSchema method takes each key value pair in the parsed_logs RDD Row.

The schema RDD accumulates all possible parameter keys as it processes each line of the log file, the reduceByKey action builds the final list with counts of occurences of each parameter.

As a next step – once we have all the possible keys parsed out, we create a dictionary of all possible parameters via dictionary comprehension.

parsed_logs, access_logs, failed_logs, schema = parseLogs()
schemaDict = {pair[0]: '' for pair in schema.collect()}

The next step is to modify the log Rows so that each row has all the possible elements, albeit empty for ones that don’t exist in that particular row.

By this point we have (1) all the rows parsed, rows have a variable number of parameters and (2) the entire schema in a dicitonary.
The tricky bit here is to realize that since Spark is splitting the work to multiple workers on different Hadoop nodes, the schema will need to be somehow shared. The Spark way of doing this is via BroadcastVars. These are read-only serializable values that are cached on each worker node. To do this we create a new BroadcastVar with with schema dictionary we created above.


broadcastVar = sc.broadcast(schemaDict)

The final job is to enrich the log Rows with all the missing parameters so that the DataFrame can be created.

def enrichRow(row):
    """Add missing fields for DataFrame to work
    """
    rowdata = row.asDict()
    newRow = broadcastVar.value.copy()
    newRow.update(rowdata)
    return (Row(**newRow))

def finalLogs():
    final_logs = (access_logs
                .map(enrichRow)
                .cache())
    return final_logs

final_logs = finalLogs()

At this point we have (1) The parsed rows where every row has all the possible parameters ready to create a DataFrame.

schemaAccess = sqlContext.createDataFrame(final_logs)
schemaAccess.printSchema()

schemaAccess.registerTempTable("access")
schemaAccess.show()

res = sqlContext.sql("select client_identd, endpoint, count(*) from access group by client_identd, endpoint order by count(*) desc")

You will notice that we are letting Spark infer the schema itself by parsing the first row typically. If there are Types that are specific, you will need to create those as StructType and StructField constructs.

Note that this not production ready code and mostly explains the concept behind creating DataFrames from data that may not be complete.

How deep is your marketplace?

640px-Guadalajara-mercado  Marketplaces are all the rage these days, in truth, marketplaces have been around since ancient times. A marketplace usually exists to connect sellers of goods or services with buyers. The marketplace itself usually benefits by selling space to sellers or by taking a cut of the transaction.

This post is focused on service marketplaces that are connecting and democratizing human capital in a way that has never been possible. Things like vetting, payments, reviews and support have been researched and A/B tested to an art.

Technologist and blogger Sangeet Paul Choudary has a series of great posts on strategies to kick-start your marketplace.

Let’s say you have now achieved a critical mass in your marketplace and enough relevant transactions are occurring that it’s starting to look good.

What next?

Which factors make your marketplace the go-to place for my business. What makes me come back for repeat business. Is the marketplace only useful in connecting me, the buyer, to the right seller. How do you prevent leakage.

Here are some factors that are hallmarks of a successful marketplace strategy. These factors add that depth to a marketplace, the extras that delight both, your consumers and producers.

Payments

While there are new and easy ways for sellers to directly charge customers, having a centralized payments system allows a marketplace to engender that trust by using standardized payment systems. The marketplace is able to offer multiple ways to complete the transaction, Paypal, credit cards and perhaps invoices for the more traditional enterprise client who might order in bulk. A trustworthy payment system will also go a long ways in curtailing leakage. What is a reasonable leakage percentage?

CRM integration

While most marketplaces have guidelines for sellers for conflict resolutions, in the world of service marketplaces the transactions usually have that shade of gray that only platform service guidelines and rating systems will not solve. While not cheap, the ability to provide a human at the other end of a phone number is usually what stands between your consumers and sellers having a fulfilling transaction and becoming evangelists and never coming back. Other benefits include marketing campaigns, customer specific analytics and clear data to show which areas in your platform add friction to the workflow.

Delivery endpoints

Being able to track and be part of the final exchange of goods is ideal for a marketplace. At Smartshoot we stay with the transaction right to the end and beyond by offering hosting and distribution services for videos contracted on the site. Other reasons for a marketplace to be part of the final delivery is to collect data about customer satisfaction ratings, referrals, social media mentions and provide opportunity for repeat transactions.

API

One of the key reasons for the success of Ebay has been early access to APIs. These API’s allowed an ecosystem of providers to develop around the basic platform. These providers included payment systems (Paypal), bulk listing management tools, warranty services and even web developers who built horrible looking listing pages. When their livelihood was tied to the bigger platform, it was in everyone’s interest to see Ebay succeed.

Collaboration

Your consumers and providers could communicate with each other through a set of limited period throwaway email address, but then you have seen the mess trying to sell your old Macbook on Craigslist. Providing a clean centralized collaboration system that allows private and public collaboration increases the chances that a contact will lead to completion of the transaction.

LogParser – Log extraction and visualizing

Or How to use the Log Parser to create a message sequence chart from protocol logs.

My call processing application generates copious amounts of logs that are irreplaceable in debugging problems with the system but its a pain to read and sift through thousands of lines to track logs for a particular call.
Log Parser is a great but overlooked tool that can help. Although there is a wide variety of built in input log formats that LogParser knows about, I found the CSV format to be more most convenient for my purpose.
I also found this Log Parser ToolkitBORDER=0 book to be a great resource for real world examples, diving right in.

An excerpt version of my logfile shows comma separated fields of information, I have added the header row for log parser to parse out the fields.

Time, Type, Port, Direction, Message, Extra
05:28.2,PROTO,0x410113fa , Recv, INVITE, From 4152484012 To 8886938437 IIDigits 0 CallId 5a87905440af7d8d2d677b8f6a4999ab@198.87.25.207. User2 is 0
05:28.2,PROTO,0x410113fa , Sent, 200 OK for INVITE, User2 is 0
05:28.4,FYI,0x410113fa ,, Playing prompt(s) c:\hello\Ivr\Prompts\NonWorkingNumber.pcm^,
05:28.4,PROTO,0x410113fa , Recv, ACK, User2 is 0
05:29.7,PROTO,0x410113fa , Recv, BYE, No Contents
05:29.7,PROTO,0x410113fa , Sent, 200 OK, for BYE
05:29.8,FYI,0x410113fa ,, Prompt play canceled,
08:14.6,FYI,0xffffffff , Initiating call to route 7, IP Address 10.2.20.20, Port 5060
08:14.6,PROTO,0x43f6e1a4 , Sent, INVITE, to User To – 14152484155 From – 8009751379 UserInfo – none CallId – 127bd67abc760140@REVWQURNSU4wMi5kZXYua2Vlbi5jb20.. User2 is 0
08:14.6,PROTO,0x43f6e1a4 , Recv,100, for INVITE
08:21.1,PROTO,0x43f6e1a4 , Recv,183, for INVITE
08:21.1,FYI,0x43f6e1a4 ,, Send Port Change of State. User2 is 0 ,
08:29.4,PROTO,0x43f6e1a4 , Recv,200, for INVITE
08:29.4,PROTO,0x43f6e1a4 , Sent, ACK, User2 is 0

The application dumps PROTO as part of the log statement where the SIP message is received or sent. We’ll use this field to identify the interesting lines.
Logparser lets you search log files by writing a SQL type query and the one I use looks something like this.

select
Message,
case Direction
when ‘Sent’ then ”
else ‘ ‘
end as Direction,
Extra
from input.csv to msc.out
where Type = ‘PROTO’

To generate a nice looking message sequence chart, I use the TPL output type and use this template file.

<LPHEADER>
<HTML>
<HEAD>
<TITLE>MSC</TITLE>
</HEAD>

<BODY>
<H1>MSC</H1>
<TABLE  >
<TR>
<TH align=”left”>Carrier</TH>
<TH></TH>
<TH align=”left”><B>Switch</B></TH>
<TH></TH>
</TR>
</LPHEADER>

<LPBODY>
<TR>
<TD align=”center” colspan=”2″><TT>%Message%<BR>%Direction%</TT></TD>
<TD>%Extra%</TD>
</TR>

</LPBODY>

<LPFOOTER>
</BODY>
</HTML>
</LPFOOTER>

Now all we have to do is C:\> logparser file:input.sql -i:CSV -o:TPL -tpl:html.tpl and here is what the output comes out looking like.

Carrier
Switch
INVITE
—————–> From 4152484012 To 8886938437 IIDigits 0 CallId 5a87905440af7d8d2d677b8f6a4999ab@198.87.25.207. User2 is 0
200 OK for INVITE
User2 is 0
BYE
—————–> No Contents
200 OK
<—————– for BYE
INVITE
for INVITE
183
—————–> for INVITE
200
—————–> for INVITE
ACK
<—————– User2 is 0

Nice!!

Finding with Find

Find is a really versatile utility that can be used to enumerate files of different types, narrow the list by file types, dates, sizes, access times and a whole list of expressions. The output can be formatted with various switches to be csv.

My goal was to list the sizes and access times for all video files in the system. I knew that there was over 3 TB of files but not how recently these were accessed/played. I also needed to know at what rate these files were being added to the system.

Here is a simple expression to list all these.

find . -iname '*.mp4' -print

This will print the names on each line. A good starting point for my list.

The -exec switch will let you execute another utility on each file that you enumerate with find. The ‘{}’ is substituted with the filenames that are found.

find . -iname '*.mp4' -exec stat --printf="%n, %s, %y, %x\n" '{}' \; | gawk '{ split($0, a, "_"); print $0,",", a[4] }'

The –printf allows me to format the output; the %n prints the name %s prints the size and %y and %x print the access and create dates.
The gawk lets me further split the name since the files are named a certain way to identify different types of videos.

Pretty powerful stuff – all in one line.
Gotta know your shell utilities.

Once I had the information into a CSV file – I opened it up in Excel and added a pivot table to see the rate at which these were added by grouping by month and quarter.

Tilt Shift Photography

Just saw this really great tutorial on tilt shift photography , no need to buy a lensbaby, yet. Played with some of my photos. The idea is to take really wide photos with an angled top view to get the best results.

I really like the “miniature” look.

This one is from Honolulu.

hawaii1-copy

Here is one from Yellowstone

From Tilt Shift

This one is from Prague

Prague - Karlov Most (Charles Bridge)

Prague - Karlov Most (Charles Bridge)

Apple iPhone

Steve Jobs just announced the 3G iPhone, so naturally I wanted to go get me some new iPhone candy, since my contract with AT&T is up, I should be able to upgrade my phone.

The Apple iPhone website already has updated information about the new 3G phone but their partner AT&T is still undergoing maintenance when you try and locate a store that might have new iPhone in stock. Tsk tsk.

AT&T store locator down.

Gmail bug?

Found something curious today, this might be a bug.

If you have a gmail account and your id is JohnDoe@gmail.com, you could send mail to J.ohnDoe@gmail.com, Jo.hnDoe@gmail.com, Joh.nDoe@gmail.com, John.Doe@gmail.com, JohnD.oe@gmail.com, JohnDo.e@gmail.com and so on, that email will hit your account. Now you cannot log in with the . variations so its does not seem like a security hole but nonetheless.
What about users who really have a . in their ids.


Tags:

Power me up – Power laws and phone calls

Power law distributions are observed in numerous real world and natural phenomena in the fields of physics, biology, sociology and linguistics. Think of the 80-20 rule which was discovered by the an Italian economist Pareto who observed that 80% of the wealth is distributed among 20% of the population.
I came across a non-intuitive example of such a distribution.

Think of your behavior on the phone – are most of your calls short or long or are they distributed normally around an average length. I was expecting to see a Gaussian bell shaped distribution but to my surprise it turns out that we have a lot of short calls and few long ones.

I have access to some data around call length information for calls random people make to yellow page advertisers. Here is data from one random advertiser with random call samples and guess what – it looks like a power-law distribution.