Hubbub $Id$
Macros | Functions
detect.c File Reference
#include <assert.h>
#include <stdbool.h>
#include <string.h>
#include <strings.h>
#include <parserutils/charset/mibenum.h>
#include <hubbub/types.h>
#include "utils/utils.h"
#include "detect.h"

Macros

#define PEEK(a)
 
#define ADVANCE(a)
 
#define ISSPACE(a)
 

Functions

parserutils_error hubbub_charset_extract (const uint8_t *data, size_t len, uint16_t *mibenum, uint32_t *source)
 Extract a charset from a chunk of data. More...
 
uint16_t hubbub_charset_parse_content (const uint8_t *value, uint32_t valuelen)
 Parse a content= attribute's value. More...
 
void hubbub_charset_fix_charset (uint16_t *charset)
 Fix charsets, according to the override table in HTML5, section 8.2.2.2. More...
 

Macro Definition Documentation

◆ ADVANCE

#define ADVANCE (   a)
Value:
while (pos < end - SLEN(a)) { \
if (PEEK(a)) \
break; \
pos++; \
} \
\
if (pos == end - SLEN(a)) \
return 0;
#define PEEK(a)
Definition: detect.c:185
#define SLEN(s)
Definition: utils.h:21

◆ ISSPACE

#define ISSPACE (   a)
Value:
(a == 0x09 || a == 0x0a || a == 0x0c || \
a == 0x0d || a == 0x20 || a == 0x2f)

◆ PEEK

#define PEEK (   a)
Value:
(pos < end - SLEN(a) && \
strncasecmp((const char *) pos, a, SLEN(a)) == 0)

Function Documentation

◆ hubbub_charset_extract()

parserutils_error hubbub_charset_extract ( const uint8_t *  data,
size_t  len,
uint16_t *  mibenum,
uint32_t *  source 
)

Extract a charset from a chunk of data.

Parameters
dataPointer to buffer containing data
lenBuffer length
mibenumPointer to location containing current MIB enum
sourcePointer to location containint current charset source
Returns
PARSERUTILS_OK on success, appropriate error otherwise

mibenum and source will be updated on exit

The larger a chunk of data fed to this routine, the better, as it allows charset autodetection access to a larger dataset for analysis.

Meaning of *source on entry:

CONFIDENT - Do not pass Go, do not attempt auto-detection. TENTATIVE - We've tried to autodetect already, but subsequently discovered that we don't actually support the detected charset. Thus, we've defaulted to Windows-1252. Don't perform auto-detection again, as it would be futile. (This bit diverges from the spec) UNKNOWN - No autodetection performed yet. Get on with it.

Todo:
We probably want to wait for ~512 bytes of data / 500ms here
Todo:
Charset autodetection

◆ hubbub_charset_fix_charset()

void hubbub_charset_fix_charset ( uint16_t *  charset)

Fix charsets, according to the override table in HTML5, section 8.2.2.2.

Character encoding requirements http://www.whatwg.org/specs/web-apps/current-work/#character0

Parameters
charsetPointer to charset value to fix

◆ hubbub_charset_parse_content()

uint16_t hubbub_charset_parse_content ( const uint8_t *  value,
uint32_t  valuelen 
)

Parse a content= attribute's value.

Parameters
valueAttribute's value
valuelenLength of value
Returns
MIB enum of detected encoding, or 0 if none found