Have you ever wondered what happens to your Android application code when it’s compiled and packaged into an APK? This post takes a deep dive into the Dalvik Executable Format, with a practical example of the structure of a minimal Dex file.
A Dex file contains code which is ultimately executed by the Android Runtime. Every APK has a single classes.dex
file, which references any classes or methods used within an app. Essentially, any Activity
, Object
, or Fragment
used within your codebase, will be transformed into bytes within a Dex file that can be run as an Android app.
It can be useful to understand the structure of a Dex file because all these references can take up a lot of space in your application. Using many 3rd party libraries can increase your APK size by megabytes, or worse, lead to the infamous 64k method size limit. And of course, there may come a day when knowledge of Dex files helps you track down unexpected behaviour in your app.
All the Java source files in an Android project are first compiled to .class
files, which consist of bytecode instructions. In a traditional Java application, these instructions would be executed on the JVM. However, Android apps are executed on the Android Runtime, which uses incompatible opcodes, and therefore an additional Dexing step is required, where .class
files are converted into a single .dex
file.
Because most mobile devices are severely constrained in the amount of memory, processing power, and battery life available, ART offers superior performance to the JVM. One key feature that helps achieve this is that ART performs both Ahead-of-Time and Just-in-Time compilation. This avoids some of the runtime overhead of JIT, while still allowing performance improvements over time as an app is profiled.
A practical example of a Dex file makes this a lot easier to understand. Let’s create a minimal APK that only contains one Application class, as this allows us to understand the file format without being overwhelmed by the thousands of methods that are present in a typical app.
We’ll use Hexfiend to view our Dex file in hexadecimal, as Dex uses some unusual data types to save space. We’ve hidden any null bytes, so empty spaces in the above screenshot actually represent 00
.
The full structure of our 480-byte Dex file is shown in hexadecimal and UTF-8 below. Some sections are instantly recognisable when interpreted as UTF-8, such as the single BugsnagApp
class that we have defined within our source code, and others not so much:
6465780A 30333800 7A44CBBB FB4AE841 0286C06A 8DF19000
3C5DE024 D07326A2 E0010000 70000000 78563412 00000000
00000000 64010000 05000000 70000000 03000000 84000000
01000000 90000000 00000000 00000000 02000000 9C000000
01000000 AC000000 14010000 CC000000 E4000000 EC000000
07010000 2C010000 2F010000 01000000 02000000 03000000
03000000 02000000 00000000 00000000 00000000 01000000
00000000 01000000 01000000 00000000 00000000 FFFFFFFF
00000000 57010000 00000000 01000100 01000000 00000000
04000000 70100000 00000E00 063C696E 69743E00 194C616E
64726F69 642F6170 702F4170 706C6963 6174696F 6E3B0023
4C636F6D 2F627567 736E6167 2F646578 6578616D 706C652F
42756773 6E616741 70703B00 01560026 7E7E4438 7B226D69
6E2D6170 69223A32 362C2276 65727369 6F6E223A 2276302E
312E3134 227D0000 00010001 818004CC 01000000 0A000000
00000000 01000000 00000000 01000000 05000000 70000000
02000000 03000000 84000000 03000000 01000000 90000000
05000000 02000000 9C000000 06000000 01000000 AC000000
01200000 01000000 CC000000 02200000 05000000 E4000000
00200000 01000000 57010000 00100000 01000000 64010000
dex
038zDÀª˚JËAÜ¿jçÒê<]‡$–s&¢‡pxv4dpñêú¨ã‰ï, ˇˇˇˇwp<init="">Landroid/app/Application;</]‡$–s&¢‡pxv4dpñêú¨ã‰ï,>
#Lcom/bugsnag/dexexample/BugsnagApp;
V&~~D8{"min-api":26,"version":"v0.1.14"}ÅÄÃ
pÑêú¨ Ã ‰ Wd
At a very high level, Dex files can be thought of as two separate parts. A file header which contains metadata, and a body which contains the majority of the data. A diagram of the file header structure is shown below.
Let’s step through each item in the header sequentially.
Many file formats begin with a fixed sequence of bytes that uniquely identify the application used to manipulate them, and Dex is no exception.
6465780A 30333800
dex
038
We can see that the first 8 bytes must contain ‘dex’, and the version number – currently 38 when our targetSdkVersion
is API 26.
You may have also noticed that the 4th byte encodes a newline character, and the 8th byte is null. These are validated by the Android Framework to check for file corruption — an APK should refuse to install if this exact sequence isn’t present.
7A44CBBB
The next value is a checksum, which is calculated by applying a function to the contents of the entire file, excluding any bytes preceding the checksum. If a byte within the file was corrupted during download or storage on disk, the calculated checksum won’t match and the Android Framework will refuse to install the APK.
FB4AE841 0286C06A 8DF19000 3C5DE024 D07326A2
The header also includes a SHA-1 hash of the file (excluding any preceding bytes). This is used to uniquely identify Dex files, which may be useful in scenarios such as Multidex.
E0010000
480
This matches the file size in bytes, and can also be used for validation when reading the Dex file.
70000000
112
The header size should be 112 bytes long.
Therefore we can now highlight all the remaining fields within the header_item
.
78563412
Dex files support both big endian and little endian encoding. This value equals REVERSE_ENDIAN_CONSTANT
, indicating that this particular Dex file is encoded in little endian, which is the default behaviour.
The remaining values in the file header define the size and location of other data structures which hold identifiers for methods, strings, and other items.
00000000 00000000 64010000 05000000
70000000 03000000 84000000 01000000
90000000 00000000 00000000 02000000
9C000000 01000000 AC000000 14010000
CC000000
These values are summarised in the table below, where size equals the array length, and the offset is the number of bytes from the start of file where this information can be found.
It’s worth noting that link_size
and field_ids
are both 0, because our app doesn’t statically link any libraries or contain any fields. The map_off
structure in the data
section largely duplicates this information, in an easier format for Dex file parsing.
As an example, we can see that there are 5 Strings IDs in our Dex file, encoded between bytes 112-132. Each ID at this position also points to an offset within the data
section, that encodes the actual value of the String.
The map_list
is a section within the data body which contains similar information to the file header.
With this knowledge, we can use the offsets to resolve the actual information, and determine what our Dex file encodes.
Enough talk — let’s see something concrete. Let’s find out what the string_ids
structure points at.
E4000000 EC000000 07010000 2C010000 2F010000
228, 236, 263, 300, 303
The array encodes 5 integer offsets, which point at the data section.
<init>
Landroid/app/Application;
Lcom/bugsnag/dexexample/BugsnagApp;
V
~~D8{"min-api":26,"version":"v0.1.14"}
If we retrieve these values as UTF-8, we are greeted by a few Java symbols which will look familiar to anyone who has used the JNI before, and also some JSON which indicates D8 created the Dex file. All this business of IDs, offsets, and multiple headers, may seem a bit useless at this point. Why not just encode the string value directly in the header?
Some of the reasoning behind this is that these strings are referenced from multiple points within the Dex file. Providing an ID for each one prevents duplication of information and reduces the overall file size, simplifies parsing as an ID will always be a fixed length, and means values are only accessed when required.
01000000 02000000 03000000
1, 2, 3
Our Dex file defines 3 Java types. Each value here is an index into the previous string_id
array — therefore we can determine that the types in our file are as follows:
Landroid/app/Application;
Lcom/bugsnag/dexexample/BugsnagApp;
V
The TypeDescriptor syntax may look somewhat unfamiliar, but the L simply refers to a full class name, and V is the type void
. Our types include our custom BugsnagApp
class, and the Application
class from the Android framework.
03000000 02000000 00000000
3, 2, 0
"V", V
A method prototype consists of information on the return type of a method, and the number of parameters it takes. The proto_id
section uses indices to retrieve the type information, and an offset, which is non-functional in this case as our method doesn’t take any parameters.
The Method section also uses indices. Each method looks up the class ID where it was defined, the method prototype, and the name of the method from the strings table.
00000000 00000000 01000000 00000000
0, 0, 0, 1, 0, 0
Landroid/app/Application "V" <init>
Lcom/bugsnag/dexexample/BugsnagApp; "V" <init>
The only methods in our Dex file relate to the constructor for BugsnagApp
— which is exactly what we’d expect.
This section contains the type, inheritance hierarchy, access metadata, and other class metadata such as annotations and source file indices.
01000000 01000000 00000000 00000000 FFFFFFFF 00000000 57010000 00000000
1, 1, 0, 0, NO_INDEX,0, 343, 0
This evaluates as a public class
Lcom/bugsnag/dexexample/BugsnagApp
which inherits from Landroid/app/Application
, whose class data is stored from byte 343. The public
access modifier is determined from a bit field. Let’s view the class data.
The first 4 bytes of our BugsnagApp
class data define the number of static and instance fields, along with any direct or virtual methods.
00 00 01 00
0, 0, 1, 0,
01 81 80 04 CC 01 00 00 00
1, 460
There is only one direct method defined in this class. It has an ID of #1, which corresponds to Lcom/bugsnag/dexexample/BugsnagApp; "V" •<init>
, and a code data offset of 460
. If our method was abstract
or native
, there wouldn’t be a code data offset.</init>
If our class defined fields and other information, more data would be encoded in this section. Incidentally, if the method ID was a value larger than 65,536, we would have encountered the infamous 64k method limit.
We’re now analysing the constructor method defined in our class, which has the following structure at the offset of 460
:
0100 0000 5701 0000 0010, 0000 01000000 64010000
1, 0, 343, 0, 16, 0 1, 64,1
This corresponds to a register size of 1, 0 incoming arguments, 343 outgoing arguments, and an offset of 16 where debug information is stored.
The most important part however, is the last few bytes. We have an instruction list size of 1, which means that our method has compiled to one opcode: 64010000
.
The Dalvik Bytecode table suggests that 64
corresponds to the sget-byte
operation on a register, using the field reference index of 1. This seems to match up with our expectation that a singleton #BugsnagApp field will be created for our application — but diving deep into Dalvik is a story for another day!
We haven’t touched too much on the compilation process, but our minimal Dex file was created using D8, a new compiler that will be rolled out by default in Android Studio 3.1. It offers performance benefits in the overall file size and build speed, so let’s test those claims.
Let’s create a greenfield app with Android Studio 3.0.1. We’ll add Kotlin Support and a Navigation Drawer, but will otherwise leave all the options as default, generate a signed APK, and view it with the APK Analyzer.
We can retrieve classes.dex
from within the APK by unzipping the APK with unzip app-release.apk -d app
, then measuring the file size in bytes: stat -f%z app/classes.dex
.
Our Dex file is approximately 88% of its previous size when compiling with D8. Your mileage may vary, as this is a very simple example project. One other interesting thing to note is that using D8, we appear to have lost the following two method references:
android.view.View#isInEditMode
java.lang.Class#desiredAssertionStatus
These don’t appear to be used at runtime, so could be an optimisation. Please get in touch if you know why these are missing!
Enabling minification and obfuscation is the single greatest thing you can do for your app, and now that you’re an expert in the Dex format, you can probably think of some reasons why.
Firstly, stripping out unused Java classes with Proguard will reduce the size of an APK, as the generated Dex file won’t contain unused class definitions and all their associated data which take up space.
Obfuscation will also reduce the Dex file size, as unless you’re the type of developer who names their classes a.a.A
and z.z.Z
, less characters will be required for each symbol, which will save space overall. Solutions exist for mapping obfuscated stacktraces, which allow you to easily diagnose crashes within your app.
Finally, a smaller Dex file leads to a smaller APK, which means users spend less on mobile data, and are less likely to give up on a download. If you offer an Instant App, then the hard limit of 4Mb means keeping APK size low is a big consideration.
Hopefully this has helped you understand Dex files, which are about to get a lot smaller with the advent of D8. If you have any questions or feedback, please feel free to get in touch.
———
Bugsnag automatically monitors your applications for harmful errors and alerts you to them, giving you visibility into the stability of your software. You can think of us as mission control for software quality.
Try Bugsnag’s Android crash reporting.